Table of Contents
Introduction
Recent trends in empirical economics (particularly those in macroeconomics) indicate increased use and demand for large dimensional datasets. Since the temporal dimension ($T$) is typically thought to be large anyway, the term large dimensional here refers to the number of variables ($N$), otherwise referred to as factors or crosssectional units. This is in contrast with traditional paradigms where the number of variables is few in number, but the temporal dimension is long. This paradigm shift is markedly the result of theoretical advancements in dimensionaware techniques such as factoraugmented and panel models.At the heart of all dimensionaware methods is factor selection, or the correct specification (estimation) of the number of factors. Traditionally, this parameter was often assumed. Recently, however, several contributions have offered data driven (semi)autonomous factor selection methods, most notably those of Bai and Ng (2002) and Ahn and Horenstein (2013).
These automatic factor selection techniques have come to play important roles in factor augmented (vector auto)regressions, panel unit root tests with cross sectional dependence, and data manipulation. A particularly important example of the latter is FREDMD  a regularly updated and freely distributed macroeconomic database designed for the empirical analysis of big data. What is notable here is that the dataset is leveraged by collecting a vast number of important macroeconomic variables (factors) which are then optimally reduced in dimensionality using the Bai and Ng (2002) factor selection procedure.
In this post, we will demonstrate how to perform this dimensionality reduction using EViews' native Bai and Ng (2002) and Ahn and Horenstein (2013) factor selection procedures. The latter were introduced with the release of EViews 12. In particular, we will download the raw FREDMD data, transform each series according to the FREDMD instructions, and then proceed to perform dimensionality reduction. We will next estimate a traditional factor model with the optimally selected factors, and then proceed to forecast industrial production.
We pause briefly in the next section to provide a quick overview of the aforementioned factor selection procedures.
Overview of Automatic Factor Selection
Recall that the maximum number of factors cannot exceed the number of observable variables. factor selection is often used as a dimension reduction technique. In other words, the goal is always to optimally select the smallest number of the most representative or principal variables in a set. Since dimensional principality (or importance) is typically quantified in terms of eigenvalues, virtually all dimension reduction techniques in this literature go through principal component analysis (PCA). For detailed theoretical and empirical discussions of PCA, please refer to our blog entries: Principal Component Analysis: Part I (Theory) and Principal Component Analysis: Part II (Practice).Although PCA can identify which dimensions are most principal in a set, it is not designed to offer guidance on how many dimensions to retain. As a result, traditionally, this parameter was often assumed rather than driven by the data. To address this inadequacy, Bai and Ng (2002) proposed to cast the problem of factor selection as a model selection problem whereas Ahn and Horenstein (2013) achieve automatic factor selection by maximizing over ratios of two adjacent eigenvalues. In either case, optimal factor selection is data driven.
Bai and Ng (2002)
Bai and Ng (2002) handle the problem of optimal factor selection as the more familiar model selection problem. In particular, criteria are judged as a tradeoff between goodness of fit and parsimony. To formalize matters, consider the traditional factor augmented model: $$ Y_{i,t} = \mathbf{\lambda}_{i}^{\top} \mathbf{F}_{t} + e_{i,t} $$ where $ \mathbf{F}_{t} $ is a vector of $ r $ common factors, $ \mathbf{\lambda}_{i} $ denotes a vector of factor loadings, and $ e_{i,t} $ is the idiosyncratic component which is crosssectionally independent provided $ \mathbf{F}_{t} $ accounts for all intercrosssectional correlations. When $ e_{i,t} $ are not crosssectionally independent, the factor model governing $ u_{i,t} $ is said to be approximate.The objective here is to identify the optimal number of factors. In particular, $ \mathbf{\lambda}_{i}$ and $ \mathbf{F}_{t} $ are estimated through th optimization problem: \begin{align} \min_{\mathbf{\Lambda}, \mathbf{F}}\frac{1}{NT} \xsum{i}{1}{N}{\xsum{t}{1}{T}{\rbrace{ Y_{i,t}  \mathbf{\lambda}_{i}^{\top}\mathbf{F}_{t} }^{2}}} \label{eq1} \end{align} subject to the normalization $ \frac{1}{T}\mathbf{F}^{\top}\mathbf{F} = \mathbf{I} $ where $ \mathbf{I} $ is the identity matrix.
Traditionally, the estimated factors $\widehat{\mathbf{F}}_{t}$ are proportional to the $T \times \min(N,T)$ matrix of eigenvectors associated with all eigenvalues of the $T\times T$ matrix $\mathbf{Y}\mathbf{Y}^{\top}$. This generates the full set of $ \min(N,T) $ factors. The objective then is to choose $ r < \min(N,T) $ factors that best capture the variation in $ \mathbf{Y} $.
Since the minimization problem in \eqref{eq1} is linear, once the factor matrix is estimated (observed), estimation of the factor loadings reduces to an ordinary least squares problem for a given set of regressors (factors). In particular, let $ \mathbf{F}^{r} $ denote the factors associated with the $ k $ largest eigenvalues of $ \mathbf{Y}\mathbf{Y}^{\top} $, and let $ \mathbf{\lambda}_{i}^{r} $ denote the associated factor loadings. Then, the problem of estimating $ \mathbf{\lambda}_{i}^{r} $ is cast as: $$ V \rbrace{ r, \widehat{\mathbf{F}}^{r} } = \min_{\mathbf{\Lambda}}\frac{1}{NT} \xsum{i}{1}{N}{\xsum{t}{1}{T}{\rbrace{ Y_{i,t}  \mathbf{\lambda}_{i}^{r^{\top}}\widehat{\mathbf{F}}_{t}^{r} }^{2}}} $$ Since a model with $ r+1 $ factors can fit no worse than a model with $ r $ factors, although efficiency is a decreasing function of the number of regressors, the problem of optimally selecting $ r $ becomes a classical problem of model selection. Furthermore, observe that $ V \rbrace{ r, \mathbf{F}^{r} } $ is the sum of squared residuals from a regression of $ \mathbf{Y_{i}} $ on the $ r $ factors, for all $ i $. Thus, to determine $ r $ optimally, one can use a loss function $ L_{r} $ of the form $$ V \rbrace{ r, \widehat{\mathbf{F}}^{r} } + rg(N,T) $$ where $ g(N,T) $ is a penalty for overfitting. \cite{bai2002} propose 6 such loss functions, labeled PC 1 through 3 and IC 1 through 3. loss functions that yield consistent estimates: The optimal number of factors now derives as the minimum of $V \rbrace{ r, \widehat{\mathbf{F}}^{r} }$ across $ r \leq r_{\text{max}} < \min(N,T) $, where $r_{\text{max}}$ is some known number of maximum factors under consideration. In other words: $$ r^{\star} \equiv \min_{1 \leq k \leq r_{max}} V \rbrace{ r, \widehat{\mathbf{F}}^{r} } $$ Note that since $r_{\text{max}}$ must be specified a priori, its choice will play a role in optimization.
Ahn and Horenstein (2013)
In contrast to Bai and Ng (2002), Ahn and Horenstein (2013) exploit the fact that the $ r $ largest eigenvalues of some matrix grow unboundedly as the rank of said matrix increases, whereas the other eigenvalues remain bounded. The optimization strategy is then simply the maximum of the ratio of two adjacent eigenvalues. One of the advantages of this contribution is that it's far less sensitive to the choice $ r_{\text{max}} $ than Bai and Ng (2002). Furthermore, the procedure is significantly easier to compute, requiring only eigenvalues.To further the discussion, let $ \psi_{r} $ denote the $ r^{\text{th}} $ largest eigenvalue of some positive semidefinite matrix $ \mathbf{Q} \equiv \mathbf{Y}\mathbf{Y}^{\top} $ or $ \mathbf{Q} \equiv \mathbf{Y}^{\top}\mathbf{Y} $. Furthermore, define: $$ \tilde{\mu}_{NT,\, r} \equiv \frac{1}{NT}\psi_{r} $$ Ahn and Horenstein (2013) propose the following tow estimators factors. For some $ 1 \leq r_{max} < \min(N,T) $, the optimal number of factors, $ r^{\star} $ is derived as:
 Eigenvalue Ratio (ER) $$ r^{\star} \equiv \displaystyle \max_{r \leq r_{max}} ER(k) \equiv \frac{\tilde{\mu}_{NT,\, r}}{\tilde{\mu}_{NT,\, r + 1}} $$
 Growth Ratio (ER) $$ r^{\star} \equiv \displaystyle \max_{r \leq r_{max}} ER(k) \equiv \frac{\log \rbrace{ 1 + \widehat{\mu}_{NT,\, r} }}{\log \rbrace{ 1 + \widehat{\mu}_{NT,\, r + 1} }} $$ where $$ \widehat{\mu}_{NT,\, r} \equiv \frac{\tilde{\mu}_{NT,\, r}}{\displaystyle \xsum{k}{r+1}{\min(N,T)}{\tilde{\mu}_{NT,\, k}}} $$
Working with FREDMD Data
The FREDMD data a large dimensional dataset updated in realtime and publicly distributed by the Federal Reserve Bank of St. Louis. In its raw form, it consists of 128 time series either in quarterly or monthly frequency. Here, we will work with the monthly frequency which can be downloaded in its raw flavour from current.csv. Furthermore, associated with the raw dataset is a set of instructions on how to process each variable in the dataset for empirical work. This can be obtained from Appendix_Tables_Update.pdf.As a first step, we will write a brief EViews program to download the raw dataset and process each variable according to the aforementioned instructions. The latter is summarized below:
'documentation on the data:
'https://s3.amazonaws.com/files.fred.stlouisfed.org/fredmd/Appendix_Tables_Update.pdf
close @wf
'get the latest data (monthly only):
wfopen https://s3.amazonaws.com/files.fred.stlouisfed.org/fredmd/monthly/current.csv colhead=2 namepos=firstatt
pagecontract if sasdate<>na
pagestruct @date(sasdate)
'perform transformations
%serlist = @wlookup("*", "series")
for %j {%serlist}
%tform = {%j}.@attr("Transform:")
if @len(%tform) then
if %tform="1" then
series temp = {%j} 'no transform
endif
if %tform="2" then
series temp = d({%j}) 'first difference
endif
if %tform="2" then
%tform="3" then
series temp = d({%j},2) 'second difference
endif
if %tform="2" then
%tform="4" then
series temp = log({%j}) 'log
endif
if %tform="2" then
%tform= "5" then
series temp = dlog({%j}) 'log difference
endif
if %tform="2" then
%tform= "6" then
series temp = dlog({%j},2) 'log second difference
endif
if %tform="2" then
%tform= "7" then
series temp = d({%j}/{%j}(1) 1) 'whatever
endif
{%j} = temp
{%j}.clearhistory
d temp
endif
next
'drop
group grp *
grp.drop resid
grp.drop sasdate
smpl 1960:03 @last
This program processes and collects the variables in a group which we've labeled here GRP. Additionally, we've dropped the variable SASDATE from this group since it is a date variable. In other words, GRP is a collection of 127 variables. Furthermore, as suggested by the FREDMD paper, the sample under consideration should start from March 1960, and so the final line of the code above sets that sample.A brief glance at the variables indicates that certain variables have missing values. Unfortunately, neither the Bai and Ng (2002) nor the Ahn and Horenstein (2013) procedure handle missing values particularly well. Accordingly, as suggested in the original FREDMD paper, missing values are initially set to the mean of nonmissing observations for any given series. This is easily achieved with a quick program as follows:
'impute missing values with mean of nonmissing observations
for !k=1 to grp.count
'compute mean of nonmissing observations
series tmp = grp(!k)
!mu = @mean(tmp)
'set missing observations to mean
grp(!k) = @nan(grp(!k), !mu)
'clean up before next series
smpl 1960:03 @last
d tmp
next
The original FREDMD paper next suggests a second stage updating of missing observations. Nevertheless, for sake of simplicity, we will skip this step and proceed to estimating the optimal number of factors.Although we will later estimate a factor model which will handle factor selection within its scope, here we demonstrate automatic factor selection as a standalone exercise. To do so, we will proceed through the principal component dialog. In particular, we open the group GRP, and then proceed to click on View/Principal Components....
Notice that the principal components dialog here is changed from previous versions. This is to allow for the additional selection procedures we've introduced in EViews 12. Because of these changes, we briefly pause to explain the options available to users. In particular, the method dropdown offers several factor selection procedures. The first two, Bai and Ng and Ahn and Horenstein, are automatic selection procedures. The remaining two, Simple and User, are legacy principal component methods that were available in EViews versions prior to 12.
Next, associated with each method is a criteria to use in selection. In case of Bai and Ng, this offers seven possibilities. One for each of the 6 criteria, and the default Average of criteria which provides a summary of each of the 6 criteria, as well as their average.
Also, associated with each method is a dropdown which determines how the maximum number of factors are determined. Here EViews offers 5 possibilities, the specifics of which can be obtained by referring to the EViews manual. Recall that both the Bai and Ng (2002) as well as the Ahn and Horenstein (2013) methods require the specification of this parameter. Although EViews offers several automatic selection mechanisms, in keeping with the suggestions in the FREDMD paper, exercises below will use a userdefined value of 8.
Finally, EViews offers the option of demeaning and standardizing the dataset across both time and factor dimension. In fact, since the FREDMD paper suggests that data should be demeaned and standardized, exercises below will proceed by demeaning and standardizing each of the variables. We next demonstrate how to obtain the Bai and Ng (2002) estimate of the optimal number of factors.
Factor Selection using Bai and Ng (2002)
From the open principal component dialog, we proceed as follows: Change the Method dropdown to Bai and Ng.
 Set the User maximum factors to 8.
 Check the Timedemean box.
 Check the Timestandardize box.
 Click on OK.






 Change the Method dropdown to Bai and Ng.
 Change the Criterion dropdown to PCP2.
 Set the User maximum factors to 8.
 Check the Timedemean box.
 Check the Timestandardize box.
 Click on OK.


Factor Selection using Ahn and Horenstein (2013)
Similar steps can be undertaken to obtain the Ahn and Horenstein (2013) factor selection results. From the open principal component dialog, we proceed as follows: Change the Method dropdown to Ahn and Horenstein.
 Set the User maximum factors to 8.
 Check the Timedemean box.
 Check the Timestandardize box.
 Check the Crossdemean box.
 Check the Crossstandardize box.
 Click on OK.




Factor Model Estimation
Typically, the objective of factor selection mechanisms is not in finding the number of factors outside of some context. Rather, it's a precursor to some form of estimation such factor model or second generation panel unit root tests. Here, we estimate a factor model using the full FREDMD dataset and specify that the number of factors should be selected with the Bai and Ng (2002) procedure.We start by creating a factor object. This is easily done by issuing the following command:
factor fact
This will create a factor object in the workfile called FACT. We double click it to open it and then proceed to click on the Estimate button to bring up the estimation dialog.




 Under the Data tab, enter GRP.
 Click on the Estimation tab.
 From the Number of factors group, set the Method dropdown to Bai and Ng.
 From the Max. Factors dropdown select User.
 In the User maximum factors textbox write 8.
 Check the Timedemean box.
 Check the Timestandardize box.
 Click on OK.
This tells EViews to estimate a factor model of at most 8 factors, with the number of factors chosen from the full FREDMD set of variables using the Bai and Ng (2002) procedure. The output is reproduced below:




Forecasting Industrial Production
Having estimated a factor model, we now repeat the exercise of forecasting industrial production. The exercise is considered in the original FREDMD paper where the forecast dynamics are summarized as follows: $$ y_{t+h} = \alpha_h + \beta_h(L)\hat{f}_t + \gamma_h(L)y_t $$ In other words, this is an $h$stepahead AR forecast with a constant and estimated factor as exogenous variables. In particular, to maintain comparability with the original exercise, we consider an 11monthahead forecast where $\hat{f}_t$ is obtained from the previously estimated factor model. In other words, we'll forecast for the period of available data in 2020. This exercise is repeated for the first estimated factor, the sum of the first two estimated factors, and no estimated factors, respectively.As a first step in this exercise, we must extract the estimated factors. Although the factors are unobserved, they may be estimated from the estimated factor model as scores. In particular, proceed as follows:
 From the open factor model, click on Proc and then Make Scores....
 Under the Output specification enter 1 2.
 Click on OK.
This will produce two series in the workfile: F1 and F2.
Next, let's forecast industrial production by leveraging the EViews native autoregressive forecast engine. To do so, double click on the series INDPRO to open it. Next, click on Proc/Automatic ARIMA Forecasting... to open the dialog. We now proceed with the following steps:
 In the Estimation sample textbox, enter 1960M03 2019M12.
 Under Forecast length enter 11.
 Under the Regressors textbox, enter C F1.
 Click on the Options tab.
 Under the Output forecast name, enter INDPRO_F1.
 Ensure the Forecast comparison graph is checked.
 Click on OK.










Files
References
 Bai J and Ng S (2002), "Determining the Number of Factors in Approximate Factor Models", Econometrica, Vol. 70, pp. 191221. Wiley Online Library.
 Ahn SC and Horenstein AR (2013), "Eigenvalue Ratio Test for the Number of Factors", Econometrica, Vol. 81, pp. 12031227. Wiley Online Library.
 McCracken MW and Ng S (2016), "FREDMD: A Monthly Database for Macroeconomic Research", Econometrica, Vol. 34, pp. 574589. Taylor & Francis.
Dear
ReplyDeleteI'm trying to replicate the calculations but the file link doesn't work
Could you check, please?
Grateful
I refer to the file link FREDMD.WF1
ReplyDelete