## Wednesday, January 20, 2021

### Automatic Factor Selection: Working with FRED-MD Data

This is the first of two posts devoted to automatic factor selection and panel unit root tests with cross-sectional dependence. Both features were recently released with EViews 12. Here, we summarize and work with two seminal contributions to automatic factor selection by Bai and Ng (2002) and Ahn and Horenstein (2013).

1. Introduction
2. Overview of Automatic Factor Selection
3. Working with FRED-MD
4. Files
5. References

### Introduction

Recent trends in empirical economics (particularly those in macroeconomics) indicate increased use and demand for large dimensional datasets. Since the temporal dimension ($T$) is typically thought to be large anyway, the term large dimensional here refers to the number of variables ($N$), otherwise referred to as factors or cross-sectional units. This is in contrast with traditional paradigms where the number of variables is few in number, but the temporal dimension is long. This paradigm shift is markedly the result of theoretical advancements in dimension-aware techniques such as factor-augmented and panel models.

At the heart of all dimension-aware methods is factor selection, or the correct specification (estimation) of the number of factors. Traditionally, this parameter was often assumed. Recently, however, several contributions have offered data driven (semi-)autonomous factor selection methods, most notably those of Bai and Ng (2002) and Ahn and Horenstein (2013).

These automatic factor selection techniques have come to play important roles in factor augmented (vector auto)regressions, panel unit root tests with cross sectional dependence, and data manipulation. A particularly important example of the latter is FRED-MD - a regularly updated and freely distributed macroeconomic database designed for the empirical analysis of big data. What is notable here is that the dataset is leveraged by collecting a vast number of important macroeconomic variables (factors) which are then optimally reduced in dimensionality using the Bai and Ng (2002) factor selection procedure.

In this post, we will demonstrate how to perform this dimensionality reduction using EViews' native Bai and Ng (2002) and Ahn and Horenstein (2013) factor selection procedures. The latter were introduced with the release of EViews 12. In particular, we will download the raw FRED-MD data, transform each series according to the FRED-MD instructions, and then proceed to perform dimensionality reduction. We will next estimate a traditional factor model with the optimally selected factors, and then proceed to forecast industrial production.

We pause briefly in the next section to provide a quick overview of the aforementioned factor selection procedures.

### Overview of Automatic Factor Selection

Recall that the maximum number of factors cannot exceed the number of observable variables. factor selection is often used as a dimension reduction technique. In other words, the goal is always to optimally select the smallest number of the most representative or principal variables in a set. Since dimensional principality (or importance) is typically quantified in terms of eigenvalues, virtually all dimension reduction techniques in this literature go through principal component analysis (PCA). For detailed theoretical and empirical discussions of PCA, please refer to our blog entries: Principal Component Analysis: Part I (Theory) and Principal Component Analysis: Part II (Practice).

Although PCA can identify which dimensions are most principal in a set, it is not designed to offer guidance on how many dimensions to retain. As a result, traditionally, this parameter was often assumed rather than driven by the data. To address this inadequacy, Bai and Ng (2002) proposed to cast the problem of factor selection as a model selection problem whereas Ahn and Horenstein (2013) achieve automatic factor selection by maximizing over ratios of two adjacent eigenvalues. In either case, optimal factor selection is data driven.

#### Bai and Ng (2002)

Bai and Ng (2002) handle the problem of optimal factor selection as the more familiar model selection problem. In particular, criteria are judged as a tradeoff between goodness of fit and parsimony. To formalize matters, consider the traditional factor augmented model: $$Y_{i,t} = \mathbf{\lambda}_{i}^{\top} \mathbf{F}_{t} + e_{i,t}$$ where $\mathbf{F}_{t}$ is a vector of $r$ common factors, $\mathbf{\lambda}_{i}$ denotes a vector of factor loadings, and $e_{i,t}$ is the idiosyncratic component which is cross-sectionally independent provided $\mathbf{F}_{t}$ accounts for all inter-cross-sectional correlations. When $e_{i,t}$ are not cross-sectionally independent, the factor model governing $u_{i,t}$ is said to be approximate.

The objective here is to identify the optimal number of factors. In particular, $\mathbf{\lambda}_{i}$ and $\mathbf{F}_{t}$ are estimated through th optimization problem: \begin{align} \min_{\mathbf{\Lambda}, \mathbf{F}}\frac{1}{NT} \xsum{i}{1}{N}{\xsum{t}{1}{T}{\rbrace{ Y_{i,t} - \mathbf{\lambda}_{i}^{\top}\mathbf{F}_{t} }^{2}}} \label{eq1} \end{align} subject to the normalization $\frac{1}{T}\mathbf{F}^{\top}\mathbf{F} = \mathbf{I}$ where $\mathbf{I}$ is the identity matrix.

Traditionally, the estimated factors $\widehat{\mathbf{F}}_{t}$ are proportional to the $T \times \min(N,T)$ matrix of eigenvectors associated with all eigenvalues of the $T\times T$ matrix $\mathbf{Y}\mathbf{Y}^{\top}$. This generates the full set of $\min(N,T)$ factors. The objective then is to choose $r < \min(N,T)$ factors that best capture the variation in $\mathbf{Y}$.

Since the minimization problem in \eqref{eq1} is linear, once the factor matrix is estimated (observed), estimation of the factor loadings reduces to an ordinary least squares problem for a given set of regressors (factors). In particular, let $\mathbf{F}^{r}$ denote the factors associated with the $k$ largest eigenvalues of $\mathbf{Y}\mathbf{Y}^{\top}$, and let $\mathbf{\lambda}_{i}^{r}$ denote the associated factor loadings. Then, the problem of estimating $\mathbf{\lambda}_{i}^{r}$ is cast as: $$V \rbrace{ r, \widehat{\mathbf{F}}^{r} } = \min_{\mathbf{\Lambda}}\frac{1}{NT} \xsum{i}{1}{N}{\xsum{t}{1}{T}{\rbrace{ Y_{i,t} - \mathbf{\lambda}_{i}^{r^{\top}}\widehat{\mathbf{F}}_{t}^{r} }^{2}}}$$ Since a model with $r+1$ factors can fit no worse than a model with $r$ factors, although efficiency is a decreasing function of the number of regressors, the problem of optimally selecting $r$ becomes a classical problem of model selection. Furthermore, observe that $V \rbrace{ r, \mathbf{F}^{r} }$ is the sum of squared residuals from a regression of $\mathbf{Y_{i}}$ on the $r$ factors, for all $i$. Thus, to determine $r$ optimally, one can use a loss function $L_{r}$ of the form $$V \rbrace{ r, \widehat{\mathbf{F}}^{r} } + rg(N,T)$$ where $g(N,T)$ is a penalty for overfitting. \cite{bai-2002} propose 6 such loss functions, labeled PC 1 through 3 and IC 1 through 3. loss functions that yield consistent estimates: The optimal number of factors now derives as the minimum of $V \rbrace{ r, \widehat{\mathbf{F}}^{r} }$ across $r \leq r_{\text{max}} < \min(N,T)$, where $r_{\text{max}}$ is some known number of maximum factors under consideration. In other words: $$r^{\star} \equiv \min_{1 \leq k \leq r_{max}} V \rbrace{ r, \widehat{\mathbf{F}}^{r} }$$ Note that since $r_{\text{max}}$ must be specified a priori, its choice will play a role in optimization.

#### Ahn and Horenstein (2013)

In contrast to Bai and Ng (2002), Ahn and Horenstein (2013) exploit the fact that the $r$ largest eigenvalues of some matrix grow unboundedly as the rank of said matrix increases, whereas the other eigenvalues remain bounded. The optimization strategy is then simply the maximum of the ratio of two adjacent eigenvalues. One of the advantages of this contribution is that it's far less sensitive to the choice $r_{\text{max}}$ than Bai and Ng (2002). Furthermore, the procedure is significantly easier to compute, requiring only eigenvalues.

To further the discussion, let $\psi_{r}$ denote the $r^{\text{th}}$ largest eigenvalue of some positive semi-definite matrix $\mathbf{Q} \equiv \mathbf{Y}\mathbf{Y}^{\top}$ or $\mathbf{Q} \equiv \mathbf{Y}^{\top}\mathbf{Y}$. Furthermore, define: $$\tilde{\mu}_{NT,\, r} \equiv \frac{1}{NT}\psi_{r}$$ Ahn and Horenstein (2013) propose the following tow estimators factors. For some $1 \leq r_{max} < \min(N,T)$, the optimal number of factors, $r^{\star}$ is derived as:
• Eigenvalue Ratio (ER) $$r^{\star} \equiv \displaystyle \max_{r \leq r_{max}} ER(k) \equiv \frac{\tilde{\mu}_{NT,\, r}}{\tilde{\mu}_{NT,\, r + 1}}$$
• Growth Ratio (ER) $$r^{\star} \equiv \displaystyle \max_{r \leq r_{max}} ER(k) \equiv \frac{\log \rbrace{ 1 + \widehat{\mu}_{NT,\, r} }}{\log \rbrace{ 1 + \widehat{\mu}_{NT,\, r + 1} }}$$ where $$\widehat{\mu}_{NT,\, r} \equiv \frac{\tilde{\mu}_{NT,\, r}}{\displaystyle \xsum{k}{r+1}{\min(N,T)}{\tilde{\mu}_{NT,\, k}}}$$
At last, we note that Ahn and Horenstein (2013) suggest demeaning the data both in the time dimension as well as the cross-section dimension. While not absolutely necessary for consistency, this step is extremely useful in case of small samples.

### Working with FRED-MD Data

The FRED-MD data a large dimensional dataset updated in real-time and publicly distributed by the Federal Reserve Bank of St. Louis. In its raw form, it consists of 128 time series either in quarterly or monthly frequency. Here, we will work with the monthly frequency which can be downloaded in its raw flavour from current.csv. Furthermore, associated with the raw dataset is a set of instructions on how to process each variable in the dataset for empirical work. This can be obtained from Appendix_Tables_Update.pdf.

As a first step, we will write a brief EViews program to download the raw dataset and process each variable according to the aforementioned instructions. The latter is summarized below:

'documentation on the data:
'https://s3.amazonaws.com/files.fred.stlouisfed.org/fred-md/Appendix_Tables_Update.pdf

close @wf

'get the latest data (monthly only):
pagecontract if sasdate<>na
pagestruct @date(sasdate)

'perform transformations
%serlist = @wlookup("*", "series")
for %j {%serlist}
%tform = {%j}.@attr("Transform:")
if @len(%tform) then
if %tform="1" then
series temp = {%j}  'no transform
endif
if %tform="2" then
series temp = d({%j})  'first difference
endif
if %tform="2" then
%tform="3" then
series temp = d({%j},2) 'second difference
endif
if %tform="2" then
%tform="4" then
series temp = log({%j}) 'log
endif
if %tform="2" then
%tform= "5" then
series temp = dlog({%j}) 'log difference
endif
if %tform="2" then
%tform= "6" then
series temp = dlog({%j},2)  'log second difference
endif
if %tform="2" then
%tform= "7" then
series temp = d({%j}/{%j}(-1) -1)  'whatever
endif

{%j} = temp
{%j}.clearhistory
d temp
endif
next

'drop
group grp *
grp.drop resid
grp.drop sasdate

smpl 1960:03 @last

This program processes and collects the variables in a group which we've labeled here GRP. Additionally, we've dropped the variable SASDATE from this group since it is a date variable. In other words, GRP is a collection of 127 variables. Furthermore, as suggested by the FRED-MD paper, the sample under consideration should start from March 1960, and so the final line of the code above sets that sample.

A brief glance at the variables indicates that certain variables have missing values. Unfortunately, neither the Bai and Ng (2002) nor the Ahn and Horenstein (2013) procedure handle missing values particularly well. Accordingly, as suggested in the original FRED-MD paper, missing values are initially set to the mean of non-missing observations for any given series. This is easily achieved with a quick program as follows:

'impute missing values with mean of non-missing observations
for !k=1 to grp.count
'compute mean of non-missing observations
series tmp = grp(!k)
!mu = @mean(tmp)

'set missing observations to mean
grp(!k) = @nan(grp(!k), !mu)

'clean up before next series
smpl 1960:03 @last
d tmp
next

The original FRED-MD paper next suggests a second stage updating of missing observations. Nevertheless, for sake of simplicity, we will skip this step and proceed to estimating the optimal number of factors.

Although we will later estimate a factor model which will handle factor selection within its scope, here we demonstrate automatic factor selection as a standalone exercise. To do so, we will proceed through the principal component dialog. In particular, we open the group GRP, and then proceed to click on View/Principal Components....

Notice that the principal components dialog here is changed from previous versions. This is to allow for the additional selection procedures we've introduced in EViews 12. Because of these changes, we briefly pause to explain the options available to users. In particular, the method dropdown offers several factor selection procedures. The first two, Bai and Ng and Ahn and Horenstein, are automatic selection procedures. The remaining two, Simple and User, are legacy principal component methods that were available in EViews versions prior to 12.

Next, associated with each method is a criteria to use in selection. In case of Bai and Ng, this offers seven possibilities. One for each of the 6 criteria, and the default Average of criteria which provides a summary of each of the 6 criteria, as well as their average.

Also, associated with each method is a dropdown which determines how the maximum number of factors are determined. Here EViews offers 5 possibilities, the specifics of which can be obtained by referring to the EViews manual. Recall that both the Bai and Ng (2002) as well as the Ahn and Horenstein (2013) methods require the specification of this parameter. Although EViews offers several automatic selection mechanisms, in keeping with the suggestions in the FRED-MD paper, exercises below will use a user-defined value of 8.

Finally, EViews offers the option of demeaning and standardizing the dataset across both time and factor dimension. In fact, since the FRED-MD paper suggests that data should be demeaned and standardized, exercises below will proceed by demeaning and standardizing each of the variables. We next demonstrate how to obtain the Bai and Ng (2002) estimate of the optimal number of factors.

#### Factor Selection using Bai and Ng (2002)

From the open principal component dialog, we proceed as follows:

1. Change the Method dropdown to Bai and Ng.
2. Set the User maximum factors to 8.
3. Check the Time-demean box.
4. Check the Time-standardize box.
5. Click on OK.

 Figure 1: Principal Components Dialog

Hitting OK, Eviews produces a spool output. The first part of this output is a summary of the principal component analysis.
 Figure 2a: Bai and Ng Summary: PCA Results Figure 2b: Bai and Ng Summary: Factor Selection Results

The second part of the output, Component Selection Results, displays the summary of the Bai and Ng factor selection procedure. In particular, we see that each of the 6 selection criteria selected 8 factors. Naturally, the average number of selected factors is also 8. This result corresponds to the findings in the original FRED-MD paper, although the latter insists on using the PCP2 criterion. Accordingly, we can repeat the exercise above and show the specifics of the PCP2 selection. To do so, from the open group window, we again click on View/Principal Components..., and proceed as follows:
1. Change the Method dropdown to Bai and Ng.
2. Change the Criterion dropdown to PCP2.
3. Set the User maximum factors to 8.
4. Check the Time-demean box.
5. Check the Time-standardize box.
6. Click on OK.

 Figure 3: Bai and Ng PCP2: Factor Selection Results

The output above is a detailed look at the selection procedure. In particular, for each number of factors from 1 to 8, EViews displays the PCP2 statistic. Clearly, the minimum is achieved with 8 factors where the statistic equals 0.904325. Again, the number of factors selected matches that obtained in the FRED-MD paper.

#### Factor Selection using Ahn and Horenstein (2013)

Similar steps can be undertaken to obtain the Ahn and Horenstein (2013) factor selection results. From the open principal component dialog, we proceed as follows:

1. Change the Method dropdown to Ahn and Horenstein.
2. Set the User maximum factors to 8.
3. Check the Time-demean box.
4. Check the Time-standardize box.
5. Check the Cross-demean box.
6. Check the Cross-standardize box.
7. Click on OK.

 Figure 4a: Ahn and Horenstein: PCA Results Figure 4b: Ahn and Horenstein: Factor Selection Results

The results of the Ahn and Horenstein (2013) procedure are markedly different. Unlike the preceding Bai and Ng exercises, here we have chosen to demean the factor (cross-sectional) dimension in addition to demeaning and standardizing the time dimension. This is in keeping with the suggestion in Ahn and Horenstein (2013) who suggest that the cross-sectional dimension should be demeaned to achieve superior results. In particular, the optimal number of factors selected is 1 using both the Eigenvalue Ratio and the Growth Ratio statistics. Clearly, this is very different from the 8 selected factors in the previous exercises.

#### Factor Model Estimation

Typically, the objective of factor selection mechanisms is not in finding the number of factors outside of some context. Rather, it's a precursor to some form of estimation such factor model or second generation panel unit root tests. Here, we estimate a factor model using the full FRED-MD dataset and specify that the number of factors should be selected with the Bai and Ng (2002) procedure.

We start by creating a factor object. This is easily done by issuing the following command:

factor fact

This will create a factor object in the workfile called FACT. We double click it to open it and then proceed to click on the Estimate button to bring up the estimation dialog.
 Figure 5a: Factor Dialog: Data Tab Figure 5b: Factor Dialog: Estimation Tab

The rest of the steps proceed as follows:
1. Under the Data tab, enter GRP.
2. Click on the Estimation tab.
3. From the Number of factors group, set the Method dropdown to Bai and Ng.
4. From the Max. Factors dropdown select User.
5. In the User maximum factors textbox write 8.
6. Check the Time-demean box.
7. Check the Time-standardize box.
8. Click on OK.

This tells EViews to estimate a factor model of at most 8 factors, with the number of factors chosen from the full FRED-MD set of variables using the Bai and Ng (2002) procedure. The output is reproduced below:

 Figure 6a: Factor Estimation: Part 1 Figure 6b: Factor Estimation: Part 2

#### Forecasting Industrial Production

Having estimated a factor model, we now repeat the exercise of forecasting industrial production. The exercise is considered in the original FRED-MD paper where the forecast dynamics are summarized as follows: $$y_{t+h} = \alpha_h + \beta_h(L)\hat{f}_t + \gamma_h(L)y_t$$ In other words, this is an $h-$step-ahead AR forecast with a constant and estimated factor as exogenous variables. In particular, to maintain comparability with the original exercise, we consider an 11-month-ahead forecast where $\hat{f}_t$ is obtained from the previously estimated factor model. In other words, we'll forecast for the period of available data in 2020. This exercise is repeated for the first estimated factor, the sum of the first two estimated factors, and no estimated factors, respectively.

As a first step in this exercise, we must extract the estimated factors. Although the factors are unobserved, they may be estimated from the estimated factor model as scores. In particular, proceed as follows:
1. From the open factor model, click on Proc and then Make Scores....
2. Under the Output specification enter 1 2.
3. Click on OK.

This will produce two series in the workfile: F1 and F2.

Next, let's forecast industrial production by leveraging the EViews native autoregressive forecast engine. To do so, double click on the series INDPRO to open it. Next, click on Proc/Automatic ARIMA Forecasting... to open the dialog. We now proceed with the following steps:
1. In the Estimation sample textbox, enter 1960M03 2019M12.
2. Under Forecast length enter 11.
3. Under the Regressors textbox, enter C F1.
4. Click on the Options tab.
5. Under the Output forecast name, enter INDPRO_F1.
6. Ensure the Forecast comparison graph is checked.
7. Click on OK.

 Figure 8a: Forecast Dialog: Specification Figure 8b: Forecast Dialog: Options

The options above specify that we wish to forecast the last 11 months of available data. Since our available sample runs from March 1960 to November 2020, we will estimate on the sample 1960 March through December 2019, and forecast out to November 2020.

 Figure 9a: Forecast: Actuals vs Forecast Figure 9b: Forecast: Forecast Comparison Graph

For comparison, the same type of forecast is produced using C (F1 + F2) as exogenous variables, and C as the only exogenous variable. All three forecasts are superimposed on top of the original curve for comparison. This is reproduced below.
 Figure 10: Forecast Comparison

### References

1. Bai J and Ng S (2002), "Determining the Number of Factors in Approximate Factor Models", Econometrica, Vol. 70, pp. 191-221. Wiley Online Library.
2. Ahn SC and Horenstein AR (2013), "Eigenvalue Ratio Test for the Number of Factors", Econometrica, Vol. 81, pp. 1203-1227. Wiley Online Library.
3. McCracken MW and Ng S (2016), "FRED-MD: A Monthly Database for Macroeconomic Research", Econometrica, Vol. 34, pp. 574-589. Taylor & Francis.