Stock Price Forecasting with Classical Time Series to Deep Learning Method

Time Series is everywhere and we often meet. You may be able to see that in the data on the number of airplane passengers, weather predictions, stock price indexes, or during the pandemic the time series can be seen from the data on the number of people exposed to COVID-19.

Ok, maybe you are familiar with it. Now, what exactly is a time series?

A time series can be defined as a collection of consecutive observations or sequential over time. Each observation can be correlated, where the current observation Xt is influenced by the previous observation Xt−1. Thus, the order of the observations is important.

From the previous example, it is a univariate type of time series because it only consists of one type of observation at each time point. In addition, there are multivariate time series types, as the name implies, in this type, there is more than one observation at each point in time. For example, it is possible at a certain point in time to predict not only the weather but also temperature, and humidity, where wheater, temperature, or humidity might influence each other.

One of the purposes of time series analysis is to forecast or predict the future. The analytical methods used can use methods from classical statistics to deep learning. For further details on this blog, analytical practice will be carried out, but only focus on univariate time series. We want to forecast the stock price of PT. Bank Central Asia, Tbk. (BBCA.JK)

Import Data

The data used is BBCA Daily Close Stock Price sourced from yahoo finance which is taken from the period 2018 - 2022. For analysis, I use Google Colab so I need to install and import the required libraries.

!pip install pmdarima
!pip install prophet

!pip install chart-studio
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import as pio

from pmdarima.arima.utils import ndiffs
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller, kpss
from import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA

from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error 
from sklearn.preprocessing import StandardScaler, MinMaxScaler

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator
from prophet import Prophet

Next, import the data. Take the Date and Close columns for modeling. Perform data preprocessing by changing the Date data format to datetime format.

df = pd.read_csv("/content/BBCA.JK.csv")
df = df[['Date', 'Close']]

df.loc[:, 'Date'] = pd.to_datetime(df.loc[:, 'Date'])

Dataset Splitting The process of splitting data into sequential data cannot be done like the data splitting process in general modeling like using train_test_split on scikit-learn. Sequential data pay attention to the order so that at the time of splitting, shuffle data should not be carried out. We will split the data without shuffle as follows:

split_time = 1000
train = df[:split_time]
test = df[split_time:]

train.shape, test.shape

((1000, 2), (125, 2))

Before doing the analysis, let's see how the data looks like by plotting data:
#plot 1
pio.templates.default = "plotly_white"

st_fig = go.Figure()

line1 = go.Scatter(x = train['Date'], y = train['Close'], mode = 'lines', name = 'Train')
line2 = go.Scatter(x = test['Date'], y = test['Close'], mode = 'lines', name= 'Test')

st_fig.update_layout(title='Daily Close Stock Price BBCA Period 2018-2022')

From the graphic, it can be seen that in general, the data has an uptrend throughout the time period, although there are some fluctuation points caused by unexpected factors, such as the COVID 19 pandemic in early 2020 which had an impact on the decline in stock prices. To more clearly see the data can be decomposed as follows:

# plot 2
sd = df.copy().dropna()
sd.set_index('Date', inplace=True)"seaborn-whitegrid")
plt.rc("figure", figsize=(10,10))

fig = seasonal_decompose(sd, model='multiplicative', period=1).plot()

ARIMA (Autoregressive Integrated Moving Average)

ARIMA is one of the classical statistical analysis methods used for time series data. This method assumes the data used is stationary. In simple terms, it means that it has constant means and variance. Because the plot shows fluctuations and an upward trend, it is possible that this data is not stationary. To check the stationarity of the data, you can use the following ADF Test or KPSS Test:

ADF Test

  • H0: The series has a unit root (not stationary)

  • H1: The series is stationary


  • H0: The series stationary

  • H1: The series has a unit root (not stationary)

result = adfuller(train['Close'])
print(f'ADF Stattistics : {result[0]}')
print(f'P-value : {result[1]}')

result_kpss = kpss(train['Close'])
print(f'KPSS Stattistics : {result_kpss[0]}')
print(f'P-value : {result_kpss[1]}')

ADF Stattistics : -1.531829948563764

P-value : 0.5177109001310636 KPSS Stattistics : 3.7931630312874 P-value : 0.01

The results of the ADF Test p-value > α:0.05 so the conclusion is Accept H0, this means the data is not stationary. Likewise, with the KPSS Test p-value < α:0.05 so the conclusion is Reject H0 which means the data is not stationary. The two tests result in the conclusion that the data is not stationary. To meet the assumptions, the data needed to be stationary.

How we can make data stationery? You can do some transformation or differencing data. For this analysis, we will be differencing data.

Determining the differencing order can be done by looking at the Autocorrelation Function (ACF) plot and the Partial Autocorrelation Function (PACF) plot.

# Plot ACF
fig, ax = plt.subplots(2, 2, figsize=(18,8))

diff_once = train['Close'].diff()
ax[0,0].set_title('First Differencing')

plot_acf(diff_once.dropna(), ax=ax[0,1])

diff_twice = train['Close'].diff().diff()
ax[1,0].set_title('Second Differencing')

plot_acf(diff_twice.dropna(), ax=ax[1,1])
#Plot PACF
fig, ax = plt.subplots(2, 2, figsize=(18,8))

diff_once = train['Close'].diff()
ax[0,0].set_title('First Differencing')

plot_pacf(diff_once.dropna(), ax=ax[0,1])

diff_twice = train['Close'].diff().diff()
ax[1,0].set_title('Second Differencing')

plot_pacf(diff_twice.dropna(), ax=ax[1,1])

From the ACF and PACF plots above, it can be seen that when differencing is done 2 times, the value of lag 1 is increasingly negative. So that over differencing occurs, for that it is better to do differencing 1 time. To be sure, the determination of differencing can also be done through the following ADF Test:

ndiffs(train['Close'], test='adf')


The ADF Test results show a valu