14 Dynamic Regression

Thus far, we have discussed how to model an individual time series by itself. On the other hand, we usually observe several concurrent time series which could be related to each other. For example, economists interested in understanding the economy of a given country usually have access to multiple economic indicators, such as GDP per capita, unemployment rate, and household wealth (see for instance, the datasets global_economy and hh_budget). This means that other concurrent time series may be useful in helping to forecast a time series of interest. The precise causal relationship between different time-varying quantities may also be of interest. Companies are intereted in understanding how certain business decisions, such as changing advertisement spend or product pricing, may affect future revenue. Prediction and causal analysis have long been studied by applying linear regression to sample data from some population. In this chapter, we will learn how to extend linear regression to a time series context.

14.1 Linear regression

14.1.1 Linear regression recap

We begin with a short overview of linear regression. These days, students are likely to first encounter linear regression as an prediction model applied to a given dataset. That is, given a response vector \(\bY = (Y_1,Y_2,\ldots Y_n)^T\) and a covariate matrix \(\bX \in \R^{n \times p}\), the linear regression model is of the form \(\hat f(\bx) = \hat\bbeta^T\bx\), where \[ \hat\bbeta = \arg\min_{\bbeta}\left\|\bY - \bX\bbeta \right\|^2 = (\bX^T\bX)^{-1}\bX^T\bY. \tag{14.1}\]

This prediction algorithm can be derived from a statistical model, which yields additional insight on estimation and predictive uncertainties. There are two flavors, depending on whether we treat the covariates as fixed (called a fixed design), or themselves as random variables (called a random design).¹ Assuming a fixed design for now, the response vector is generated according to the formula \[ \by = \bX\bbeta^* + \beps, \tag{14.2}\] where \(\bbeta^*\) is the true regression vector and \(\beps = (\varepsilon_1,\varepsilon_2,\ldots,\varepsilon_n)^T\) is drawn i.i.d. from \(\mathcal{N}(0,\sigma^2)\) for some noise parameter \(\sigma^2\).

Note that the only randomness is in the response noise vector \(\beps\). Under these assumptions, one can check that the estimator in Equation 14.1, called the ordinary least squares (OLS) estimator, is equivalent to the maximum likelihood estimate. In addition, it is an unbiased estimate for \(\bbeta\) and its deviation from the true regression vector satisfies \[ \hat\bbeta - \bbeta^* \sim \mathcal{N}(0, \sigma^2(\bX^T\bX)^{-1}). \] Furthermore, by the Gauss-Markov theorem, the OLS estimator is optimal, in the sense that it has a the minimum variance among all linear unbiased estimators of \(\bbeta^*\).

In a random design, the rows of \(\bX\) are assumed to be drawn i.i.d. from a fixed distribution on \(\R^p\) with a covariance matrix \(\bSigma\). Under this assumption, we have the further asymptotic behavior: \[ \sqrt{n}(\hat\bbeta - \bbeta^*) \to_d \mathcal{N}(0, \sigma^2\bSigma^{-1}). \]

14.1.2 Interpretations of regression coefficients

A linear model is highly interpretable as a predictive (statistical) model. For each \(k=1,2,\ldots,p\), \(\beta_k^*\) is the slope of the prediction surface in the \(k\)-th coordinate direction. In other words, it is the amount by which our prediction increases if we increase the \(k\)-th covariate by 1 unit.

In addition, these coefficient values can sometimes have scientific (also called causal) interpretations. Indeed, one of the earliest uses of linear regression was in randomized experiments. Here, each row in the covariate matrix \(\bX\) corresponds to a particular experimental setting. Equation 14.2 then becomes a causal model, with \(\beta_k^*\) being the effect on the response of increasing the \(k\)-th covariate by 1 unit.

Of course, randomized experiments are not always possible, especially in the social sciences. Economists and sociologists, for instance, usually have to work with non-experimental data coming from census or administative records. Even so, Equation 14.2 can sometimes be justified as a causal model using domain knowledge. The distinction between predictive and causal models is subtle, and will be discussed again later in this chapter.

14.1.3 Simple linear regression for time series

In the setting of multiple time series, we can treat the time series of interest as a response vector \(\bY\), and the remaining time series as columns in a covariate matrix \(\bX\). However, note that because their are measurements of the same time series at different time points, the rows of \(\bX\) are rarely i.i.d.. As such, a random design assumption is unrealistic and we have to use a fixed design interpretation. Focusing on just a single predictor for now, we model \[ Y_t = \alpha + \beta X_t + W_t \tag{14.3}\] for \(t=1,2,\ldots,n\), where \((W_t)\) is a Gaussian white noise process, and \((X_t)\) is a fixed (non-random) time series. In the economics literature, \((X_t)\) is sometimes called an exogenous time series, because its dynamics are not modeled.

Let us apply this model to the dataset us_change, which comprises percentage changes in quarterly personal consumption expenditure, personal disposable income, production, savings and the unemployment rate for the US from 1970 to 2016.

Time plot
Scatter plot

us_change |>
  pivot_longer(c(Consumption, Income), names_to="Series") |>
  autoplot(value) +
  labs(y = "% change")

Figure 14.1: Percentage changes in personal consumption expenditure and personal income for the US.

us_change |>
  ggplot(aes(x = Income, y = Consumption)) +
  labs(y = "Consumption (quarterly % change)",
       x = "Income (quarterly % change)") +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

Figure 14.2: Scatterplot of quarterly changes in consumption expenditure versus quarterly changes in personal income and the fitted regression line.

From the time plot and scatter plot shown above, we see that change in income is predictive of change in consumption. This leads us to fit the model Equation 14.3 with these time series as \((X_t)\) and \((Y_t)\) respectively. We do so using the TSLM() function in the fable package.

lr_fit <- us_change |>
  model(TSLM(Consumption ~ Income))
lr_fit |> report()

Series: Consumption 
Model: TSLM 

Residuals:
     Min       1Q   Median       3Q      Max 
-2.58236 -0.27777  0.01862  0.32330  1.42229 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.54454    0.05403  10.079  < 2e-16 ***
Income       0.27183    0.04673   5.817  2.4e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5905 on 196 degrees of freedom
Multiple R-squared: 0.1472, Adjusted R-squared: 0.1429
F-statistic: 33.84 on 1 and 196 DF, p-value: 2.4022e-08

From the print out, we gather that \(\hat\beta \approx 0.27\) and \(\hat\alpha \approx 0.54\). This can be interpreted as saying that a 1% increase in income leads to a 0.27% predicted increase in consumption, and that when income does not change, the predicted increase in consumption is 0.54%.

We now check the goodness of fit of the model via a residual analysis.

lr_fit |> gg_tsresiduals()

Figure 14.3: Residual analysis of the simple linear regression model of % change in personal consumption expenditure against % change in personal disposable income.

These plots provide evidence that the residuals of the model are not white noise, which violates the modeling assumption in Equation 14.3. This means that we should be suspicious of any conclusions obtained using this model, such as prediction intervals and coefficient interpretations.

14.1.4 Multiple linear regression for time series

Given multiple time fixed time series \((X_{t1}),(X_{t2}),\ldots,(X_{tp})\), we can define the multiple linear regression model \[ Y_t = \alpha + \beta_1 X_{t1} + \cdots \beta_p X_{tp} + W_t \] for \(t = 1,2,\ldots,n\).

Since us_change contains other time series, we can try regressing the change in consumption with multiple predictors. We first make a pairs plot (pairwise scatter plot) to inspect the degree of pairwise marginal correlations between change in consumption and the predictors.

us_change |>
  GGally::ggpairs(columns = 2:6)

Figure 14.4: Pairwise scatter plot between columns in us_change.

It seems that change in consumption is positively correlated with change in income and change in production, and negatively correlated with change in savings and change in unemployment. Let us now fit the multiple linear regression model.

mlr_fit <- us_change |>
  model(TSLM(Consumption ~ Income + Production + Savings + Unemployment))
mlr_fit |> report()

Series: Consumption 
Model: TSLM 

Residuals:
     Min       1Q   Median       3Q      Max 
-0.90555 -0.15821 -0.03608  0.13618  1.15471 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.253105   0.034470   7.343 5.71e-12 ***
Income        0.740583   0.040115  18.461  < 2e-16 ***
Production    0.047173   0.023142   2.038   0.0429 *  
Savings      -0.052890   0.002924 -18.088  < 2e-16 ***
Unemployment -0.174685   0.095511  -1.829   0.0689 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3102 on 193 degrees of freedom
Multiple R-squared: 0.7683, Adjusted R-squared: 0.7635
F-statistic:   160 on 4 and 193 DF, p-value: < 2.22e-16

The resulting model is \[ \begin{split} \Delta\text{Consumption}_t & = 0.74\Delta\text{Income}_t + 0.05\Delta\text{Production}_t - 0.05\Delta\text{Savings}_t \\ & \quad - 0.17\Delta\text{Unemployment}_t + 0.25 + W_t. \end{split} \tag{14.4}\]

Notice that the regression coefficient of change of income in Equation 14.4 is different from that of the same predictor in the simple linear regression earlier. Mathematically, this can be explained as the effect of correlations between the predictors. If we believe that Equation 14.4 is the true statistical model, then the difference between the two coefficients is called omitted variable bias. That is, by omitting the other predictors previously, we have biased our estimate of the effect of change of income on change in consumption.

As usual, we check the goodness of fit of this model via a residual analysis.

mlr_fit |> gg_tsresiduals()

Compared to the simple linear regression model, it seems that the multiple linear regression model in Equation 14.4 has a better fit, which, at face value, suggests that we can trust its conclusions more. Nonetheless, one should still be wary. First, there is a risk that our model has overfit to the observed data and may not remain a good fit at future time points. This can checked via some form of cross-validation. Second, even if the model does not overfit, the phenomenon being modeled may not be stationary, and there is no a priori guarantee that its future behavior will resemble the past. For instance, the relationships between the economic variables in us_change exist under the current economic framework of the country, and may change as a consequence as the economics, politics, and technology evolve.

A further issue is that using the regression model to forecast the target time series requires knowledge of the future values of the predictors, which are usually unavailable. If we attempt to use the fitted model in mlr_fit to forecast, R will throw an error warning us that the required predictor values cannot be found.

mlr_fit |> forecast(h = 5)

Error in `mutate()`:
ℹ In argument: `TSLM(Consumption ~ Income + Production + Savings +
  Unemployment) = (function (object, ...) ...`.
Caused by error in `value[[3L]]()`:
! object 'Income' not found
  Unable to compute required variables from provided `new_data`.
  Does your model require extra variables to produce forecasts?

To resolve this, these predictors can themselves be forecasted, or we may choose to use deterministic predictors instead.

14.1.5 Useful predictors

We now discuss useful deterministic predictors that can be used in time series regression. These comprise simple functions in the time index \(t\) that describe commonly seen structure in real-world time series.

Trend: We have already described the linear trend model \(Y_t = \beta_0 + \beta_1 t + W_t\). This is a regression model, where \(t\) is used as the sole predictor. More generally, one can apply polynomial functions of \(t\) to model potentially nonlinear trend. Indeed, we can fit the model \(Y_t = \beta_0 + \beta_1 t + \beta_2 t^2 + W_t\) via the code snippet TSLM(Y ~ trend() + trend() ** 2).

Seasonal dummy variables: If the time series is seasonal with period \(m\), we can model it as \[ Y_t = \beta_0 + \beta_1 d_{1,t} + \beta_2 d_{2,t} + \cdots + \beta_{m-1}d_{m-1,t} + W_t, \] where \(d_{k, t} =1\) if \((t - 1)~\text{mod}~m = k - 1\) (i.e. the season of \(t\) is \(k\)). In this case, \(\beta_0 + \beta_k\) is the mean value of \(Y_t\) during the \(k\)-th season. \(d_{1,t},\ldots,d_{m-1,t}\) are called seasonal dummies. Note that the last seasonal dummy is not necessary, because \(d_{m,t} = 1 - \sum_{k=1}^{m-1}d_{k,t}\). We can fit this model via the code snippet TSLM(Y ~ season()).

Outlier dummy variables: If certain observations are outliers, such as when there is a public holiday for a daily time series, they may affect the parameters of our fitted model in an undesirable way. To counteract this, we may use a dummy for that time point as a predictor. In fable, we can create a new column that is equal to 1 for that time point and equal to 0 otherwise, and add that as a predictor in TSLM().

14.2 Dynamic regression

14.2.1 Definition

There are two obvious problems with the time series regression model Equation 14.4. First, it is not able to model lagged effects of the predictors on the response time series. We have already noted in Chapter 12 that lagged effects is a common feature of time series analysis and is a motivation for the class of MA(\(q\)) models. Second, it does not allow for the possibility that the regression residuals have nonzero autocorrelation. When there is indeed such autocorrelation, naively applying time series regression can lead to issues such as inefficient estimation of the regression coefficients and

These problems are addressed by performing dynamic regression. Given a single predictor time series \((X_t)\), this corresponds to the model \[ Y_t = \beta_0 + \beta_{1,0}X_t + \beta_{1,1}X_{t-1} + \cdots + \beta_{1,r}X_{t-r} + \eta_t, \] where \((\eta_t)\) is drawn from an ARIMA model.

Such a model can be fit by specifying \((X_t)\) and its lags as additional predictor in the ARIMA() function in fable. While the desired lags to be used need to be explicitly specified, the precise ARIMA model to be used can be automatically estimated from data, as with vanilla ARIMA modeling. We demonstrate this by fitting two dynamic regression models using the us_change dataset.

consumption_fit <- us_change |>
  model(dr1 = ARIMA(Consumption ~ Income),
        dr2 = ARIMA(Consumption ~ Income + lag(Income, 1)))
consumption_fit |> select(dr1) |> report()

Series: Consumption 
Model: LM w/ ARIMA(1,0,2) errors 

Coefficients:
         ar1      ma1     ma2  Income  intercept
      0.7070  -0.6172  0.2066  0.1976     0.5949
s.e.  0.1068   0.1218  0.0741  0.0462     0.0850

sigma^2 estimated as 0.3113:  log likelihood=-163.04
AIC=338.07   AICc=338.51   BIC=357.8

Above, we report the fitted parameters under the first model, dr1. We see that the fitted model can be written as \[ \Delta\text{Consumption}_t = 0.59 + 0.20\Delta\text{Income}_t + \eta_t \] \[ \eta_t = 0.71\eta_{t-1} + W_t -0.62W_{t-1} + 0.21 W_{t-2} \] \[ W_t \sim N(0, 0.31). \]

Compared to the simple linear regression model earlier, we see that we get a different set of estimates for both the intercept as well as the regression coefficient for \(\Delta\text{Income}_t\). We now perform a residual analysis.

consumption_fit |> select(dr1) |> gg_tsresiduals()

Figure 14.5: Residual analysis of the dynamic linear regression model of % change in personal consumption expenditure against % change in personal disposable income.

This seems to suggest a better fit. However, there are still some skewness in the residuals, as well as some outlier values, such as in 1980 Q1.

As with vanilla linear regression, dynamic regression can be extended to involve multiple predictor time series.

14.2.2 Parameter estimation and model selection

The parameters of a dynamic regression model can be estimated using maximum likelihood as before. However, one needs to be careful about potential multicollinearity affecting the stability of this estimation process. Specifically, if any of the predictor time series is non-stationary, it can potentially have high correlation with an ARIMA time series, which leads to inefficiency and even unidentifiability when estimating the parameters. As such, it is good practice to take enough differences to ensure that all predictors (and the response) are stationary, before applying dynamic regression.

Likewise, model selection can be done by making use of AIC and AICc. Unlike with ARIMA models, however, the fable package does not automatically select from models with different choices of predictors, or the number of lags to include in the regression. One has to do this manually by simultaneously fitting multiple candidate models and comparing their reported AICc values.

Let us illustrate this again with the us_change dataset. In the following, we fit a few models to predict change in consumption and compare their relative AICc values. Note that all time series in the dataset already appear to be stationary, so no further differencing is required.

consumption_fit <- us_change |>
  model(dr1 = ARIMA(Consumption ~ Income + lag(Income, 1)),
        dr2 = ARIMA(Consumption ~ Income + lag(Income, 1) + Production + 
                    lag(Production, 1)),
        dr3 = ARIMA(Consumption ~ Income + lag(Income, 1) + Production + 
                    lag(Production, 1) + Unemployment + lag(Unemployment, 1)),
        dr4 = ARIMA(Consumption ~ Income + lag(Income, 1) + Production + 
                    lag(Production, 1) + Unemployment + lag(Unemployment, 1) + 
                    Savings + lag(Savings, 1)),
        dr5 = ARIMA(Consumption ~ Income + lag(Income, 1) + Production + 
                    lag(Production, 1) + Unemployment + lag(Unemployment, 1) + 
                    Savings + lag(Savings, 1) + lag(Income, 2) + lag(Production, 2) + 
                    lag(Unemployment, 2) + lag(Savings, 2)))
consumption_fit |> glance()

# A tibble: 5 × 8
  .model sigma2 log_lik   AIC  AICc   BIC ar_roots   ma_roots 
  <chr>   <dbl>   <dbl> <dbl> <dbl> <dbl> <list>     <list>   
1 dr1    0.291   -154.   327.  328.  356. <cpl [11]> <cpl [0]>
2 dr2    0.241   -137.   287.  287.  307. <cpl [0]>  <cpl [0]>
3 dr3    0.228   -131.   277.  278.  304. <cpl [0]>  <cpl [0]>
4 dr4    0.0966   -43.4  117.  119.  166. <cpl [12]> <cpl [0]>
5 dr5    0.0915   -36.3  111.  115.  173. <cpl [12]> <cpl [0]>

The most complicated model, which includes the current value and the first two lags of all the possible predictor time series, has the smallest AICc value.

consumption_fit |> select(dr5) |> gg_tsresiduals()

Figure 14.6: Residual analysis of the dynamic linear regression model.

A residual analysis seems to suggest that the model is a good fit.

14.2.3 Vector autoregression (VAR) models

Instead of following a two-stage process whereby we first forecast the predictor time series and then use their forecasted values to forecast the regression target, one may model all time series simultaneously via a vector autoregression (VAR) model. In this case, we concatenate all time series of interest into a vector \(\bX_t = (X_{1t}, X_{2t},\ldots,X_{kt})\). A vector autoregression model of order \(p\), denoted VAR(\(p\)), has the form

\[ \bX_t = \balpha + \bPhi_1 \bX_{t-1} + \bPhi_2 \bX_{t-2} + \cdots + \bPhi_p\bX_{t-p} + \bW_t, \] where \(\bPhi_1,\bPhi_2,\ldots,\bPhi_p\) are \(k \times k\) coefficient matrices, and \(\bW_t \sim N(0, \bSigma)\) for some covariance matrix \(\bSigma\), with \(\bW_s\) and \(\bW_t\) independent whenever \(s \neq t\).

Modeling with VAR helps to streamline forecasting because one can simply perform recursive forecasting. VAR models are similar to AR(\(p\)) models in other ways, with most concepts for AR(\(p\)) models having natural analogues. For example, stationary of a VAR model has to do with the unit roots of the VAR polynomial. Furthermore, VAR models can be extended to vector ARMA models to gain even more flexibility.

The downside of using a VAR model is that it often requires a large number of parameters. For example, if we had modeled each of the \(k\) time series in \((\bX_t)\) individually as an AR(\(p\)) model, we would need to estimate \(kp\) parameters (not including the constant coefficient and noise variances). Modeling them using a VAR(\(p\)) model requires \(k^2p\) parameters instead. Having too many parameters makes estimation difficult as it inflates the sampling variance of any reasonable estimator and makes it more likely for any fitted model to be overfit. Therefore, a common practice is to set some (or even most) entries of the coefficient matrices to be zero. Deciding which entries are allowed to be nonzero is part of the problem of model selection for VAR(\(p\)) models, and can be done by comparing AICc values as before.

In the fable package, a VAR model can be fit using the VAR() function. We illustrate how to do this using us_change dataset as before. In the following code snippet, we model the two time series measuring change in consumption and change in income jointly. Note how they are concatenated symbolically using the vars() function.

consumption_fit <- us_change |>
  model(VAR(vars(Consumption, Income)))
consumption_fit

# A mable: 1 x 1
  `VAR(vars(Consumption, Income))`
                           <model>
1                 <VAR(5) w/ mean>

Because we did not indicate what order \(p\) to select, this was automatically determined using AICc, and selected to be \(p=5\). Displaying the fitted coefficients below, note that all coefficients for the VAR(\(5\)) model were estimated. It seems that the package does not have functionality to specify setting certain parameters to be equal to zero.

consumption_fit |> 
  tidy() |> select(-.model) |> print(n = 22)

# A tibble: 22 × 6
   term               .response   estimate std.error statistic    p.value
   <chr>              <chr>          <dbl>     <dbl>     <dbl>      <dbl>
 1 lag(Consumption,1) Consumption   0.186     0.0771     2.42  0.0166    
 2 lag(Income,1)      Consumption   0.112     0.0535     2.09  0.0379    
 3 lag(Consumption,2) Consumption   0.166     0.0792     2.09  0.0378    
 4 lag(Income,2)      Consumption  -0.0117    0.0563    -0.207 0.836     
 5 lag(Consumption,3) Consumption   0.302     0.0789     3.82  0.000182  
 6 lag(Income,3)      Consumption  -0.0315    0.0551    -0.571 0.569     
 7 lag(Consumption,4) Consumption  -0.0291    0.0818    -0.356 0.722     
 8 lag(Income,4)      Consumption  -0.0754    0.0552    -1.37  0.173     
 9 lag(Consumption,5) Consumption  -0.0492    0.0781    -0.630 0.529     
10 lag(Income,5)      Consumption  -0.0433    0.0529    -0.819 0.414     
11 constant           Consumption   0.350     0.0870     4.02  0.0000852 
12 lag(Consumption,1) Income        0.402     0.109      3.70  0.000288  
13 lag(Income,1)      Income       -0.299     0.0754    -3.97  0.000104  
14 lag(Consumption,2) Income        0.102     0.112      0.917 0.361     
15 lag(Income,2)      Income       -0.0449    0.0793    -0.566 0.572     
16 lag(Consumption,3) Income        0.469     0.111      4.21  0.0000397 
17 lag(Income,3)      Income       -0.148     0.0777    -1.91  0.0578    
18 lag(Consumption,4) Income        0.259     0.115      2.24  0.0262    
19 lag(Income,4)      Income       -0.252     0.0778    -3.24  0.00144   
20 lag(Consumption,5) Income       -0.126     0.110     -1.15  0.253     
21 lag(Income,5)      Income       -0.197     0.0746    -2.64  0.00891   
22 constant           Income        0.575     0.123      4.69  0.00000543

To inspect the goodness of fit, we plot the ACF of the model residuals for both change in consumption and change in income. Since gg_tsresiduals() no longer works for a multivariate model, we use the following code snippet. It seems that the model is a relatively good fit, but one should also be sure to check the time plot and density plot for further verification.

consumption_fit |> 
  augment() |> 
  ACF(.innov) |> 
  autoplot()

Figure 14.7: Residual analysis for US consumption and income generated from a VAR(5) model.

Finally, we can use the fitted model to forecast.

consumption_fit |> 
  forecast(h = 20) |> 
  autoplot(us_change)

Figure 14.8: Forecasts for US consumption and income generated from a VAR(5) model.

14.2.4 Dynamic harmonic regression

14.3 Causality

This terminology seems to have come from the literature on design of experiments.↩︎