Stationary Time Series Model¶
Introduction¶
- Two types of model
- Additive: $Y_{t} =T + S + C + R$
- Multiplicative: $Y_{t} =T \cdot S \cdot C \cdot R$
- Exponential Smoothing
- Knowledge gets updated as time passes
- Based on past experience predictions are to be made
- Improvements to unobserved component models
- Most recent (past) values have more weightage than distant (past) values, in order to predict/forecast future values.
- Modern Time Series
- Later developments from 1980s using,
- Measures of central tendency
- mean: sensitive to extreme values
- median: robust to outliers
- mode: useful for clustering
- Econometric Modelling
- Data Collection (sampling)
- Construction of model
- Estimation
- Validation
- Forecasting
- Time-series can be univariate, while econometric modeling is multi-variate, with the same objective (forecasting)
- It has a flow while econ modeling is static
A Realization is a random process collected over time or a sequence of time variables/ series of observations of a given variable observed over a period of time.
- A collection of realizations over the same time period is called an ensemble.
- A realization is a sample from stochastic process
History¶
- Klien-Goldberger Model
- KK-Pandit
- Granger-Newbold (Nobel Laurates) said that all models were misleading!
- If we use any time series data that is non-stationary, all of our inferences will be invalid.
- All started checking if data is stationary or not before modeling.
- Rule of thumb: $R^2 \gt\text{DW}$ $\implies$ Almost surely the model is wrong. (1976)
What is Stationary?¶
Let's look at the Cobweb model:
$$ P_{t} = \alpha P_{t-1} + c $$
A Difference Equation is the difference between two observations of the same variable (i.e. at different intervals of time lag)
For example, the first order difference equation would look like
$$ P_{t} - P_{t-1} = (\alpha-1)P_{t-1} + c $$
$$ \underbrace{y_{0},y_{1},y_{2},y_{3},\dots,y_{t-2},y_{t}}_{\text{Realization taken}}, \underbrace{y_{t+1}, y_{t+2},\dots y_{t+7}}_{\text{To Predict}} $$ Can take multiple lags too, $$ y_{t} = \alpha_{1} y_{t-1} + \alpha_{2}y_{t-2} + c $$
We take a realization, and use it to predict the future.
- In multivariate regression, we have many variables but in time series, we have the same variable with different lags.
- We can model time series as an autoregression, which using mathematics can be written as a difference equation.
Types of Stationarity¶
- Strong Form (strict)
- Weak Form (Covariance Stationary)### History
- Klien-Goldberger Model
- KK-Pandit
- Granger-Newbold (Nobel Laurates) said that all models were misleading!
- If we use any time series data that is non-stationary, all of our inferences will be invalid.
- All started checking if data is stationary or not before modeling.
- Rule of thumb: $R^2 \gt\text{DW}$ $\implies$ Almost surely the model is wrong. (1976)
Tests¶
- Then we have multiple tests to check for MSPE (Mean Squared Prediction Errors)'s significance
- F-test
- Granger-Newbold test
- Diebold-Mariano test
Notation¶
- Conditional expectation: $E_{t}y_{t+2}$ is the conditional expectation of $y_{t+2}$ given information at time $t$
- Forecast error = $e_{t}(j) = \epsilon_{t+j} +\sum_{i=1}^j a_{1}^{i}\epsilon_{t+j-i}$
- Proof HINT: $e_{t}(j) \equiv y_{t+j} - E_{t}y_{t+j}$
- Backwards operator: $\nabla f(x) = f(x) - f(x-h)$
- Forward operator: $\Delta f(x) = f(x+h) - f(x)$
Stochastic Difference Equation Models¶
Determistic vs Stochastic (concept)¶
- Stochastic: A discrete variable $y$ is said to be random, if for any real number $r$, there exists a probability $p(y \leq r)$ that $y$ will take on a value equal to $r$ or less, for atleast some $r$.
- Deterministic: If there exists some $r$ for which $p(y=r)=1$ then $y$ is determistic, rather than stochastic/random.
- Elements of a time series $\{y_0, y_1, \dots, y_t\}$ are realisations of a stochastic process.
- Since we cannot predict GDP exactly, it is stochastic.
$$ m_{t} = \rho (1.03)^tm_{0}^* + (1-\rho) m_{t-1} +\epsilon_{t} $$
- Properties of the above:
- This is a linear stochastic difference equation (since $\epsilon_{t}$ is stochastic)
- If we knew the distribution of $\epsilon_{t}$, we can know the distribution of $\{ m_{t} \}$, and as they are linked across time, it would be possible to calculate their joint probabilities.
- Having observed first $t$ observations, we can make forecasts $m_{t+1},m_{t+1},\dots$ like so
$$E_{t}(m_{t+1}) = \rho(1.03)^{t+1}m_{0}^* + (1-\rho)m_{t}$$
White Noise Process¶
$$ \epsilon_{t} \stackrel{\text{iid}}{\sim} \text{WN}(0,\sigma^2) $$
A sequence $\{ \epsilon_{t} \}$ is a white noise process if
- each value has mean of zero $E(\epsilon_{t}) = 0$
- constant variance $E(\epsilon_{t}^2) = \sigma^2$
- uncorrelated with all other realizations $E(\epsilon_{t} \epsilon_{t-s}) = E(\epsilon_{t-j}\epsilon_{t-j-s}) = 0$ for all $j,s$
This process is stationary
Random Walk¶
A random walk is a markov process (= where the future state of the process depends only on its current state and not on the sequence of events that led to that current state), where $y_0$ is the starting position and is given by:
$$ y_{t} = \alpha y_{t-1} + \epsilon_{t} $$
This, with a drift parameter, $c$
$$ y_{t} = \alpha y_{t-1} + c + \epsilon_{t} $$
is called a random walk with drift.
Moving Average¶
A sequence formed in this manner:
$$ x_{t} = \sum_{i=0}^{q} \beta_{i}\epsilon_{t-i} $$
where $\{ \epsilon_{t} \}$ is a White Noise process, is called a moving average process.
- Denoted by $\text{MA}(q)$
- e.g. win $1 when fair coin shows head, and lose $1 when it shows tail. Average payoff on the last four tosses
$$ \dfrac{1}{4}\epsilon_{t} + \dfrac{1}{4}\epsilon_{t-1} + \dfrac{1}{4}\epsilon_{t-2} + \dfrac{1}{4}\epsilon_{t-3} $$
thus, $\beta_{i} = 0.25$ for $i\leq 3$ and 0 otherwise
NOTE: A white-noise process cannot have $\beta_i != 0$ for more than one $i$. Else the third criteria of non-correlation is not satisfied for white-noise.
ARMA Models¶
Combine an $\text{MA}(q)$ and an $\text{AR}(p)$ process to get an $\text{ARMA}(p,q)$
$$ y_{t} = a_{0} + \sum_{i=1}^p a_{i}y_{t-i} + x_{t} $$ Where, $\{ x_{t} \}$ is an $\text{MA}(q)$ process, so we get $$ y_{t} = a_{0} + \sum_{i=1}^p a_{i}y_{t-i} + \sum_{i=0}^q \beta_{i}\epsilon_{t-i} $$
- Condition: If the characteristic roots are all in the unit circle, $\{ y_{t} \}$ is called an $\text{ARMA}(p,q)$ model for $y$.
- $q=0$ implies a pure autoregressive process $\text{AR}(p)$
- $p=0$ implies a pure moving average process $\text{MA}(q)$
If one of more characteristics roots of $y_{t}$ are $\geq 1$ then $\{ y_{t} \}$ sequence is said to be an integrated process, called an $\text{ARIMA}$ model.
Examples¶
- ARMA(1,1)
- $y_{t} = \alpha_{0} + \alpha_{1}y_{t-1}+\epsilon_{t} + \beta_{1}\epsilon_{t-1}$
- ARMA(2,1)
- $y_{t} = \alpha_{0} + \alpha_{1}y_{t-1}+ \alpha_{2}y_{t-2}+\epsilon_{t} + \beta_{1}\epsilon_{t-1}$
- AMRA(1,0) is a pure AR process
- ARMA(0,1) is a pure MA process
- ARMA(2,1) is a mixed process.
Moving Average Representation¶
Solve for $y_{t}$ in terms of $\{ \epsilon_{t} \}$ sequence, then we get
$$ y_{t} = \dfrac{a_{0}}{1-a_{1}} + \sum_{i=0}^\infty a_{1}^i \epsilon_{t-i} $$
- This is based on the sum of infinite geometric series
- This expansion yields an $\text{MA}(\infty)$ process. Will such an expansion be convergent? is the key question, to ensure stability of the stochastic diffference equation.
Stationarity¶
We wish to find the mean, variance and autocorrelation of a time series data. So we take 4 in equal time intervals ($y_{1t}, y_{2t},y_{3t}, y_{4t})$. So, for the mean, we can find it as…
$$ \bar{y_{t}} = \sum_{i=1}^4 y_{it}/4 $$
- Ensemble: multiple time-series of the same process over the same time period.
- If we had an ensemble, then we would be able to find the mean and variance (even if they were time dependent). But since that's not a luxury we have. We can bet on the stationarity of a series $\{ y_{t} \}$. If a series is stationary, we can approximate the mean, variance and autocorrelation by sufficiently long time averages based on a single set of realization (instead of needing to find an ensemble)
Thus, the long time average of the mean would be:
$$ \bar{y_{t}} \cong \sum_{t=1}^{20} y_{1t}/20 $$
We assume that the mean is the same for each period.
- Covariance stationary: stochastic process having finite mean and variance for all $t$ and $t-s$.
- Autocovariance: Covariance between $y_t$ and $y_{t-s}$ (same series)
- Cross-covaraince: Covariance between one series and another.
Types of Stationarity¶
Strict Stationary¶
The joint probability distribution of $\{ y_{1},y_{2},\dots,y_{t} \}$ must be the same for each realization.
- All moments must be the same!
- That's a very tight condition!
Weak Stationary¶
A liberal form for the strict stationarity.
- $E(y_{t})=\mu$ (constant)
- $\gamma_{l} = \text{Cov}(y_{t},y_{t-l}) = \text{Cov}(y_{t-l},y_{t})$
- $\gamma_{-l} = \text{Cov}(y_{t},y_{t+l}) = \text{Cov}(y_{t+l},y_{t})$
- Meaning variance is constant, "Homoskedasticity"
Thus, if (1) and (2) exist, then the series is weak stationary as only two moments are same. It is thus, also known as moment stationary. The Variance covariance matrix is symmetric
$$ \begin{bmatrix} V(x_{1}) & \cdots & & & \\ \vdots & V(x_{2}) & & & \\ \vdots & \vdots & & \ddots & & \\ \cdot & \cdots & & & V(x_{k}) \\ \end{bmatrix} $$
- So, the mean and variance are time-independent.
Methods¶
...to find if the series is stationary or not.
Graphical Method: Just by looking at the time-series plot.
Correlograms
- ACF
- PACF
- Portmanteue tests
Unit Root Tests
- DF/ADF
- PP
- KPSS
Difference Method¶
If we find that our series is non-stationary then we can apply differncing to our time-series
$$ y_{t} - y_{t-1} = \alpha_{0}t + \epsilon_{t} $$
Thus,
$$ Z_{t} = \Delta y_{t} = \alpha_{0}t + \epsilon_{t} $$
If $Z_t$ is stationary at levels... we say that our original series $\{y_t\}$ is first order stationary.
Second Difference¶
If we apply differencing again, i.e.
$$ \Delta(\Delta y_{t}) $$
and try to analyze if the original series is second order stationary or not.
Stochastic Process¶
A random process associated with some series $\{ y_{t}: t \in T \}$, where each time-variable is a Random Variable.
In an econometric model
$$ Y = AX + \mathcal{E} $$
$\mathcal{E}$ is the stociastic portion, whereas, $AX$ is the deterministic portion.
- Regression is the average relationship, with causal inferences.
- Correlation is the exact relationship (tells us the exact degree (strength and direction) of association between two variables, without causal inferences.
Stationary Restrictions¶
Stationarity restrictions are the conditions required for the series to be stationary.
A homogenous solution is one without the noise term.
Covariance & Correlation¶
- Covariance measures the degree to which two variables change together. It indicates whether an increase in one variable corresponds to increase or decrease in another variable.
$$ Cov(X,Y) = E[(X-E(X))(Y-E(Y))] $$
which can range from $-\infty \to +\infty$
- Correlation is the strength and direction of linear relationship between two variables, normalized to be unitless and bounded.
$$ \rho_{XX} = \dfrac{Cov(X,Y)}{\sigma_{X}\sigma_{Y}} $$
which ranges from $0 \to 1$
For time-variables let $\rho_{l}$ represent the $\rho_{y_{t}, y_{t-l}}$
$$ \rho_{l} = \dfrac{Cov(y_{t},y_{t-l})}{\sqrt{ v(y_{t}) V(y_{t-l}) }} $$
and since we are looking at weak-stationary variables $Cov(y_{t},y_{t-l}) =Cov(y_{t+l}, y_{t})$ and $V(y_{t-l})=V(y_{t+l})$. Thus,
$$ \rho_{l} = \rho_{-l} $$
and,
$$ \rho_{l} = \dfrac{\gamma_{l}}{\sqrt{ x({y_{t}})V(y_{t-l}) }} = \dfrac{\gamma_{l}}{\gamma_{0}} $$
Hence, we can find the autocorrelations for every lag for a weakly-stationary time-variable like so:
$$ \begin{align*} \rho_{1} & = \dfrac{\gamma_{1}}{\gamma_{0}} \\ \rho_{2} & = \dfrac{\gamma_{2}}{\gamma_{0}} \\ \vdots \\ \rho_{k} & = \dfrac{\gamma_{k}}{\gamma_{0}} \end{align*} $$
AR(1)¶
$$ y_{t} = a_{0} + a_{1}y_{t-1} + \epsilon_{t} $$
- Mean is time dependent ($Ey_t \neq Ey_{t+s}$) so, sequence is not stationary
- $a_1$ is called the characteristic root
- But we can find a limiting value of $y_t$ if $t$ is sufficiently large since, as $t \to \infty$ and $|a_{1}| \lt 1$ then
$$ \lim_{ t \to \infty } y_{t} = \dfrac{a_{0}}{1-a_{1}} + \sum_{i=0}^\infty a_{1}^i \epsilon_{t-i} $$
- The expected value is thus $\dfrac{a_{0}}{1-a_{1}}$
- And so this sequence $\{y_t\}$ would then be stationary! (THE KEY ASSUMPTION THAT ECONOMETRICIANS MAKE is that the data generating process has been occurring for an infinitely long time)
- So be careful about data generated from a "new" process.
For this reason, the initial value $y_0$ must be known. If not then the solution would have an extra $A(a_1)^5$ term, which has to be zero for the series to be stationary. So either of two should happen:
- $A = 0$ (The process should be in equilibrium. Since, $A$ is interpreted as the deviation from the long-run equilibrium)
- Sequence started infinitely long ago (thus the effect of the initial value, nullifies over time)
These two alongside the condition that $a_1 \leq 1$.
ACF¶
The ACF measures the correlation of a time-series with its own lagged values. it quantifies how similar a time-seires is to itself at different time lags, helping to identify patterns like seasonality or trends in the data.
For a time series, y_t the autocorrelation function at lag $k$ is defiend as
$$ \rho_{k} = \dfrac{Cov(y_{t},y_{t-l})}{\sqrt{ V(y_{t}) \times V(y_{t-l}) }} $$
Test for stationarity¶
A series is stationary
- if the ACF drops off quickly to zero as the lag increases.
- if it shows no significant long-term patterns or trends
- it may exhibit significant spikes at specific lags if the series has periodic or seasonal components, but these spikes are consistent and do not persist across many lags.
A series is non-stationary
- decays slowly or remains high for many lags
- may not approach zero even at higher lags
- shows irregular patterns
Details for certain models¶
For AR(1), $\gamma_{0}=\dfrac{\sigma^2}{(1-a_{1}^2)}$, $\gamma_{s}=\dfrac{\sigma^2(a_{1})^s}{(1-a_{1}^2)}$
- Divide $\dfrac{\gamma_{s}}{\gamma_{0}}$ to get the ACF or correlogram, should converge to zero if the series is stationary.
- $|a_{1}|\lt 1$ for stationarity… convergence is direct if $a_{1}$ is positive, damped oscillatory around zero if $a_{1}$ is negative.
For AR(2)
- “If the roots of the inverse characteristic equation lie outside the unit circle [i.e., if the roots of the homogeneous form of (2.22) lie inside the unit circle] and if the {xt} sequence is stationary, the {yt} sequence will be stationary.” (pdf) (NOT UNDERSTOOD YET)
- For this to be stationarity the roots of $(1-a_{1}L - a_{2}L^2)$ should be outside the unit circle.
- Yule-Walker equations and 'method of undetermined coefficients' can be used to derive the autocovariances of an ARMA(2,1) process.
- We get $\gamma_{0} = a_{1}\gamma_{1}+a_{2}\gamma_{2}+\sigma^2$ and $\gamma_{s} = a_{1}\gamma_{s-1}+a_{2}\gamma_{s-2}+\sigma^2$
For MA(1), $\rho_{0}=1$, $\rho_{1} = \dfrac{\beta}{1+\beta^2}$ and $\rho_{s} = 0$ for $s\gt1$
Correlograms¶
Rules of identifying stationary or non-stationarity using a correlogram¶
- Box-Pierce developed some tests using autocorrelation.
- But the problem is that it can only be used for large sample.
$$ Q\text{-statistic} \sim \chi^2(\ ) $$
- Ljung-Box
- Modified $Q\text{-statistic} \sim \chi^2$ for both small and large samples
- Portmantue tests
PACF¶
- The indirect correlation between $y_t$ and $y_{t-2}$ in an AR(1) process is due to the correlation $\rho_2 = \rho_1 \times \rho_1$ i.e. the product of correlations between $y_t \to y_{t-1}$ and $y_{t-1} \to y_{t-2}$.
- The partial autocorrelation eliminates these indirect effects. Cool. How to find it tho?
- Subtract mean $\mu$ of the series from each observation $y_t^*\equiv y_{t}-\mu$
- Regress $y_{t}^* =\phi_{11}y_{t-1}^* +e_{t}$ where $\{ e_{t} \}$ may be any error process (may not be white noise)
- Regress $y_{t}^* =\phi_{21}y_{t-1}^* + \phi_{22}y_{t-2}^* + e_{t}$ where $\{ e_{t} \}$ may be any error process (may not be white noise). So we get $\phi_{22}$ as the partial autocorrelation between $y_t$ and $y_{t-2}$
- Using Yule-Walker equations
- $\phi_{11} = \rho_1$
- $\phi_{22} =\dfrac{\rho_{2}- \rho_{1}^2}{1-\rho_{1}^2}$
Thus, PACF can help us identify the AR(p) model.
- For an MA(1) process, PACF will not jump to zero... since $y_t$ will be correlated with all of its own lags. Why?
- $y_{t} =\epsilon_{t}+\beta \epsilon_{t-1}$ can be written as $\dfrac{y_{t}}{1+\beta L}=\epsilon_{t}$
- $y_{t} - \beta y_{t-1}+\beta^2 - \beta^3y_{t-3}+\cdots = \epsilon_{t}$
- Instead, it shows a geometrical decay (direct/indirect depends on the sign of $\beta$... $\beta \lt 0$ direct... $\beta \gt 0$ oscillate)
- Negative $\beta$ make all signs positive in tha above equation.
- Positive $\beta$ creates an alternating series.
For a general ARMA(p,q) process
- The ACF will began todecay after lag $q$. Then onwards the $\rho_i$ start to satisfy the difference equation of the AR(p) process...
- The PACF will began to decay after lag $p$. Then onwards the coefficients, $\phi_{ss}$ will mimic ACF coefficients $\rho_s$ from the model, $\dfrac{y_{t}}{1+\beta_{1}L+\beta_{2}L^2+\dots+\beta_{q}L^q}$
Visual Identification¶
- (a) Mean level is increasing, so mean is time-dependent, $E(y_{t}) \propto t$ $\implies$ NON-STATIONARY
- (b) Mean is time-independent but variance is not $Var(y_{t}) \propto \dfrac{1}{t}$ $\implies$ NON-STATIONARY
- (c) Covariance can be interpreted as the horizontal width between the crests/troughs. Here, we observe that the covariance varies with time $\implies$ NON-STATIONARY
- (d-f) are all STATIONARY. Notably,
- (d) Mean is constant
- (e) Variance is constant
- (f) Covariance is constant
Data Preparation¶
How to prepare the observed past for forecasting?
- Collect data based on objective
- Even if data is large, one should prioritize on the system of data (and thus not choose it right away).
- Check for any kind of discrepancies in the data (e.g. outliers, which can affect the mean adversely = Black Swan events)
- For such discrepancies, scrutinize the data , pin point the problem and take appropriate steps.
- General behavior of the whole model is to be understood by a mathematical formula
- AR - Yule (1926-27)
- MA - Slutsky (1938)
- ARMA - Wald (last)
- Modern ARMA
- Box Jenkins (1976) model
Box-Jenkins ARIMA Model¶
A rigorous time series model given by both. Though originally by Wald, the mathematical treatment was not complete (it had to evolve).
- This was the same ARMA model but more mathematically rigorious.
- For Box-Jenkins wanted a "stationary" model.
So, the improvement over this problem is the Autoregressive Integral Moving Average $ARIMA(p,d,q)$ process.
- where $d$ represents the number of time we have to difference the data in order to transform the non-stationary series to stationary.
Examples¶
- ARIMA(2,0,1) is stationary at levels $I(0)$
- ARIMA(1,1,1) is stationary at the first order or Integrated at first order $I(1)$
- ARIMA(3,1,0) is a pure AR(3) process with first differencing.
Box-Jenkins Model Selection¶
Three-stage method for selecting an appropriate model for estimating and forecasting a univariate time series.
- Identification
- Viasually examine time plot: Outliers, missing values, structural breaks
- Look for trends or meandering without a constant long-run mean or variance
- Look at the correlograms i.e., ACF and PACF: to suggest plausable models by comparing them with theoretical ACF and PACF plots
- Perform Unit-root tests (DF, ADF, PP, KPSS)
- Estimation
- Fit each of the tentative models, examine the $\alpha_i$ and $\beta_i$ coefficients
- Select a stationary and parsimonious model that has a good fit.
- Diagnostic Checking
- Ensure that the residuals from the estimated model mimic white-noise.
Diagnostic Checks¶
- $R^2$ and $\bar{R}^{2}$ is used in regression traditionally for diagnostic checking.
- But, here we use Root Mean Squared Error (RMSE) which is better than $R^{2}$ and $\bar{R}^{2}$ (also known as MSE)
- We split the dataset into such parts
- 2/3 of the data would be used for training
- We will perform in-sample forecast
- We will reserve the remaining 1/3 for out-of-sample forecast to check how our model does on an unseen piece of data
- if RMSE is good, our forecasting model is good.
Parsimony¶
- Additional coefficients increase fit ($R^2$ increases)
- But reduces degrees of freedom $d.f.$
Box and Jenkins argued: $\text{parsimonious} \gg\text{overparameterized}$ model
- Aim to approximate the true data-generating process but not to pin down the exact process.
By reducing polynomials (which is not so easy to do in practice) we can get a more parsimonious form, with the needless (unnecessary) parameters eliminated.
Unit Root Process¶
For an AR(1) process,
$$ a_{t} = \phi a_{t-1} + \epsilon_{t} $$ If we find the closed form solutions to this recursive relation, we get
$$ \begin{align*} a_{t} &= \phi(\phi a_{t-2} + \epsilon_{t-1}) + \epsilon_{t}\\ &= \phi^2 a_{t-2} + \phi\epsilon_{t-1} + \epsilon_{t} \\ &= \cdots \\ &= \phi^t a_{0} + \sum_{k=1}^{t} \phi^{t-k} \epsilon_{k} \end{align*} $$
Thus, the expected value would be:
$$ E(a_{t}) = \phi E(a_{t-1}) = \dots = \phi^t a_{t} $$
And the variance would be:
$$ Var(a_{t}) = \sigma^2[\phi^0 + \phi^2 + \dots + \phi^{2(t-1)}] = \dfrac{1-\phi^{2t}}{1-\phi^2} \sigma^2 $$
Now, what are the possible values of $\phi$?
- CASE 1 $|\phi| \lt 1$
- $\lim_{ t \to \infty }E(a_{t}) = 0$
- $\lim_{ t \to \infty }Var(a_{t}) = \dfrac{1-0}{1-\sigma^2}\sigma^2 = \dfrac{\sigma^2}{1-\phi^2}$ which is time-independent.
- Thus, the series is stationary
- CASE 2 $|\phi| \gt 1$
- The series explodes because $\lim_{ t \to \infty }E(a_{t}) = \pm \infty$.
- Thus, non-stationary
- CASE 3 $|\phi|= 1$ or $\phi = \pm 1$
- This causes problems, because now the time series may be stationary (we don't know for sure.)
- $E(a_{t}) = a_{0}$ ( a constant trend) Doesn't violate stationarity.
- But, $Var(a_{t}) = t\sigma^2$, the variance is time dependent and gets bigger
- Thus, non-stationary.
General solution of a linear stochastic model:
$$ y_{t} = \text{trend} + \text{stationary component} + \text{noise} $$
- We need to remove the unit root from our time series, so that we can fit it to one of our known models (ARIMA)
Random walk¶
$$ y_{t} = y_{t-1} + \epsilon_{t}\quad (\text{or } \Delta y_{t} = \epsilon_{t}) $$
- $y_{t} = y_{0} +\sum_{i=1}^t \epsilon_{i}$
- $E_{t}y_{t+1}=E_{t}(y_{t}+\epsilon_{t+1}) = y_{t}$
- $E_{t}y_{t+s} = y_{t}+E_{t}\sum_{i=1}^s \epsilon_{t+i} = y_{t}$
- $var(y_{t}) = var(\epsilon_{t} + \epsilon_{t-1} + \dots + \epsilon_{1}) = t\sigma^2$
Random Walk Plus Drift¶
Note that $a_0$ denotes a constant trend
$$ y_{t} = y_{t-1} + a_{0} + \epsilon_{t} $$
General solution
$$ y_{t} = y_{0} + a_{0}t + \sum_{i=1}^t\epsilon_{t} $$
Agumented Dicky-Fuller (ADF) tests¶
Note that $a_0$ is a constant trend, while $a_2t$ represents a time trend
DF Test¶
Three regressions that can be used for testing for presence of a unit root:
- Pure random walk model: $\Delta y_{t} = \gamma y_{t-1} + \epsilon_{t}$
- Add an intercept (drift) term: $\Delta y_{t} = a_{0} + \gamma y_{t-1}+\epsilon_{t}$
- Add drift and linear time trend: $\Delta y_{t} = a_{0} \gamma y_{t-1} + a_{2}t + \epsilon_{t}$
$$ H_{0}: \gamma = 0 \quad (\text{Sequence contains a unit root.}) $$
Test Statistic¶
$$ (A)DF = \dfrac{\hat{\phi} - E(\hat{\phi})}{SE(\hat{\phi})} $$
is a $t\text{-statistic}$.
Steps¶
- Fit one or more of the equations to obtain the estimated value of $\gamma$ and associated error.
- Compare the resulting $t\text{-statistic}$ with appropriate value reported in the Dicky-Fuller tables.
- For $\hat{\gamma}=-0.0454$ with $se(\gamma) = 0.030$, $t = -\dfrac{0.0454}{0.03}=-1.5133$
- The critical value depends on the presence of an intercept and/or time trend.
- Without intercept and trend terms ($a_{0}=a_{2}=0$), use section $\tau$
- Including an intercept but not a trend term (only $a_{2}=0$), use section $\tau_{\mu}$
- With both intercept and trend term, use section $\tau_{\tau}$
ADF Test¶
This is an extension to DF test that works for higher-order autoregressive processes to handle serial correlations in the residuals.
- The augmented version of DF test includes lagged changes.
The equations are replaced by
- $\Delta y_{t}= \gamma y_{t-1} \sum_{i=2}^p \beta_{i}\Delta y_{t-i+1} + \epsilon_{t}$
- $\Delta y_{t}= a_{0} + \gamma y_{t-1} \sum_{i=2}^p \beta_{i}\Delta y_{t-i+1} + \epsilon_{t}$
- $\Delta y_{t}= a_{0} + \gamma y_{t-1} + a_{2}t + \sum_{i=2}^p \beta_{i}\Delta y_{t-i+1} + \epsilon_{t}$
For these autoregressive processes,
- The same $\tau$, $\tau_{\mu}$ and $\tau_{\tau}$ statistics are used to test the $H_{0}: \gamma =0$
- Dicky and Fuller provided three additional $F\text{-statistics}$ ($\phi_{1}, \phi_{2}$ and $\phi_{3}$) to test joint hypothesis on the coefficients
- $\phi_{1}\to H_{0}: a_{0}=\gamma=0$ using equation (2) with $a_{2}=0$.
- $\phi_{2}\to H_{0}: a_{0}=\gamma=a_{2}=0$ using equation (3)
- $\phi_{3} \to H_{0}: \gamma = a_{2}=0$ using equation (3) with $a_{0}=0$
$$ \phi_{i} = \dfrac{[SSR(\text{restricted})- SSR(\text{unrestricted})]/r}{SSR(\text{unrestricted})/(T-k)} $$
where,
- $r:$ number of restrictions
- $T:$ number of usable observations
- $k:$ number of parameters estimated
Phillip Perron (PP) test¶
Source: https://faculty.washington.edu/ezivot/econ584/notes/unitroot.pdf
$D_{t}$ is a vector of deterministic terms (constant and time trend etc)
- PP unit root test differs from ADF tests mainly in how they deal with serial correlation and heteroskedasticity in the errors.
- ADF uses a parametric AR to approximate ARMA structure of the errors in the test regression.
- PP ignores any serial correlation in the test regression, and use
$$ \Delta y_{t} = \beta' D_{t} + \gamma y_{t-1} + u_{t} $$
where $u_t$ is $I(0)$ and may be heteroskedastic.
- PP test corrects for any serial correlation and heteroskedasticity in the errors $u_{t}$ by directly modifying the test statistic $t_{\gamma=0}$ and $T \hat{\gamma}$ (which is the normalized coefficient statistic[^1])
$$ Z = \tau \cdot\left(\dfrac{\hat{\sigma}^{2}}{\hat{\lambda}^{2}}\right)^{1/2} - \dfrac{1}{2} (\hat{\lambda}^{2}-\hat{\sigma}^{2}) \cdot \dfrac{T \cdot SE(\hat{\gamma})}{\hat{\lambda}^{2}} $$
where,
- $\tau:$ DF test statistic
- $\hat{\sigma}^{2}:$ variance of residuals
- $\hat{\lambda}^{2}:$ consistent estimate of long-run variance
- $T:$ sample size
Advantage¶
- PP tests are robust to general forms of heteroskedasticity in $u_{t}$
- User doesn't have to specify a lag length for the test regression.
[^1]: To match the convergence rate and resulting in a stable (non-normal) Dicky-Fuller distribution in the unit root case. (Read on Standard Error and Convergence )
Kwaiatkowski-Philips-Schmidt-Shin (KPSS) Test¶
Tests whether a time series is stationary around the mean or a linear trend (trend stationary or level stationary)
$$ KPSS = \dfrac{1}{T^{2}}\sum_{t=1}^{T} \dfrac{S_{t}^{2}}{\hat{\sigma}^{2}} $$
- $S_{t}= \sum_{i=1}^t \hat{e_{i}}$ is the cumulative sum of residuals for a regression of $y_{t}$ or a constant.
- $\hat{\sigma}^2$ is a consistent estimate of the long-run variance of the residuals.
- $T$ is the sample size
Then compare the test statistic to the KPSS-specific critical values.
Selection of Lag-length¶
The selection of lag length is another problem, as it needs to be specified for even performing the parametric tests like ADF.
We can use information criteria for this
- Akaike Information Criteria (AIC)
- AICC
- SBC or SC or SBIC
- Hannan-Quinnon (HQ)
- Breush-Pagen (Autocorrelation) (BP)
- Ljung-Box ($\theta$)
- $R^2$ or $\bar{R^2}$ (adjusted)
Akaike Information Criteria (AIC)¶
It is used to select the optimal lag length in time seires models
$$ AIC = -2 \ln(L) + 2k $$
- $\ln(L)$ is the log-likelihood of the model
- $k$ is the number of estimated parameters
The model with the lowest AIC is preferred.
Schwartz Bayesian Criterion (SBC) a.k.a Bayesian Info. Criterion (BIC)¶
Fun fact: The name "Schwarz" comes from the statistician Gideon Schwarz, who developed the criterion in 1978. Some authors and software packages (like SAS) use "SBC" to give credit to its originator, while "BIC" is the more common name used in most modern textbooks and software (like R and Python).
$$ BIC = -2 \ln(L) + k\ln(T) $$
- $\ln(L)$ is the log-likelihood of the model
- $k$ is the number of estimated parameters
- $T$ is the sample size
The model with the lowest AIC is preferred.
AIC with Correction (AICC)¶
$$ AIC = AIC + \dfrac{2k(k+1)}{T-k-1} $$
The added term corrects for small sample bias, increasing the penalty for complex models when the $T$ is small.
This will converge to AIC results as the sample size increases,
$$ \lim_{ T \to \infty } AICC = AIC $$
The model with the lowest AIC is preferred.
Hannan-Quinn Info. Criterion (HQ)¶
$$ HQ = -2\ln(L) + 2k\ln(\ln(T)) $$
Practical advice¶
Most of the economic time series variables will be stationary by $I(2)$ (not financial time series).
| MA╲AR | 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| 0 | AIC | AIC | AIC | AIC | AIC |
| 1 | AIC | AIC (for each combination) | AIC | AIC | AIC |
| 2 | AIC | AIC | AIC | AIC | AIC |
| $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ |
- The cell where AIC is minimum corresponds to the lag-length that should be considered.
- If two cells have the same AIC, choose the simpler process. E.g. ARMA(2,0) should be selected over ARMA(1,2)
Structural Changes¶
- When there are structural breaks, DF test statistics are biased towards the nonrejection of the unit root (i.e. the test isn't powerful)
If there is a structural change in the data from the 50th observation (where the mean shifts from 0 to 6, after the 50th observation), we can model that sequence with
$$ y_t = 0.5 y_{t-1} + \epsilon_t +D_L $$
But the Dicky-Fuller test will assume this OLS equation:
$$ y_{1} = a_{0} + a_{2}t + e_{t} $$
which is the same line in the figure above, with a negative intercept $a_0$ and a positive slope $a_2$ and is the line of best-fit.
The proper way to estimate the equation however, is to fit a simple AR(1) model, and *allow the intercept to vary by including a dummy variable $D_L$. But suppose we still decide to fit the regression equation.
$$ y_t = a_0 + a_1 y_{t-1} + e_t $$
We find that $a_1$ is biased towards unity (because $\hat{a_1}$ tries to capture the property that low values $y_1=0$ are followed by high values $y_1=6$). Thus, the DF test accepts $H_0$ of a unit root, even though the series is stationary within each subperiod.
But a unit-root process can also exhibit a structural change. This example was constructed with
- $y_0 = 2$
- The same $\epsilon_t$ sequence
- $D_P(51) = 4$ else $D_P = 0$
- where $P$ refers to the fact that there is a single pulse (a jump) in the dummy at $t=51$
- which is equivalent of an $\epsilon_{t+51}$ shock of four extra units, which we capture.)
- This one-time shock to $D_P(51)$ has a permanent effect on the mean of the sequence.
Perron's Test for Structural Change¶
Background¶
- One process is splitting the sample into two parts and using D-F test on each part.
- Problem 1: Degrees of freedom for each resulting regression is diminished.
- Problem 2: We may not know exactly when the structural break appears.
So, prefer to have a single test on the full sample
Formulation¶
$$ \begin{align} H_{1} & : y_1 = a_{0}+y_{t-1}+\mu_{1}D_{P} + \epsilon_{t} \\ A_{1} & : y_1 = a_{0}+a_{2}t+\mu_{2}D_{L} + \epsilon_{t} \end{align} $$
- In words:
- One-time jump in the level of a unit-root process against
- One-time change in the intercept of a trend stationary process.
- where
- $D_{P}$ = pulse dummy variable ($D_{P}=1$ if $t=\tau+1$ else $D_{P}=0$)
- $D_{L}$ = level dummy variable ($D_{L}=1$ if $t \gt \tau$ else $D_{P}=0$)
- Observation
- Up to $t=\tau$, $\{ y_{t} \}$ is stationary around $a_{0} +a_{2}t$
- Thereafter, $\{ y_{t} \}$ is stationary around $a_{0}+a_{2}t+\mu_{2}$
- One time change in the intercept of the trend if $\mu_{2}\gt 0$
Steps¶
- Combine the null and alternative hypothesis
- $y_{t} = a_{0}+a_{1}y_{t-1} + a_{2}t+\mu_{1}D_{P}+\mu_{2}D_{L} + \epsilon_{t}$
- Estimate the regression,
- Under $H_{0}$, $a_{1}=1$
- When residuals are i.i.d., the distribution of $a_{1}$ depends on $\lambda = \dfrac{\tau}{T}$ where $T$ is the total number of observations, and $\tau$ is number of observations prior to the break.
- Perform diagnostic checks to determine if residuals are serially correlated
- If so, use augmented form of the regression
- Calculate $t\text{-statistic}$ for the null $a_{1}=1$.
- Compare with critical values calculated by Perron
- Using values of $\lambda$ from 0 to 1 ($0.1,0.2\dots0.9,1$)
- No structural change unless $0 \lt \lambda \lt 1$