Lasso variable selection, new to EViews 12 and also known as the Lasso-OLS hybrid, post-Lasso OLS, the relaxed Lasso (under certain conditions), or post-estimation OLS, uses Lasso as a variable selection technique followed by ordinary least squares estimation on the selected variables.
Table of Contents
Background
In today’s data-rich environment it is useful to have methods of extracting information from complex datasets with large numbers of variables. A popular way of doing this is with dimension reduction techniques such as principal components analysis or dynamic factor models. By reducing the number of variables in a model, we can reduce overfitting, reduce the complexity of the model and make it easier to interpret, and decrease computation time. However, dimension reduction methods have the risk of losing useful information contained in variables that are not included in the reduced set, and may potentially have poorer predictive power.Lasso is useful because it is a shrinkage estimator: it shrinks the size of the coefficients of the independent variables depending on their predictive power. Some coefficients may shrink down to zero, allowing us to restrict the model to variables with nonzero coefficients.
Lasso is just one method out of a family of penalized least squares estimators (other members include ridge regression and elastic net). Starting with the linear regression cost function: \begin{align*} J = \frac{1}{2m}\xsum{i}{1}{m}{\rbrace{y_i - \beta_0 -\xsum{j}{1}{p}{x_{ij}\beta_j}}} \end{align*} where $y_i$ is the dependent variable, $x_{ij}$ are the independent variables, $\beta_j$ are the coefficients, $m$ is the number of data points, and $p$ the number of independent variables, we obtain the coefficients $\beta_j$ by minimizing $J$. If the model based on linear regression is overfit and does not make good predictions on new data, then one solution is to construct a Lasso model by adding a penalty term: \begin{align*} J = \frac{1}{2m}\xsum{i}{1}{m}{\rbrace{y_i - \beta_0 -\xsum{j}{1}{p}{x_{ij}\beta_j}}} {\color{red}{+\lambda\xsum{j}{1}{p}{|\beta_j|}}} \end{align*} where the parameters are the same as before with the addition of the regularization parameter $\lambda$. By adding these extra terms the cost of $\beta_j$ is increased, so to minimize the cost function the values of $\beta_j$ have to be reduced. Smaller values of $\beta_j$ will "smooth out" the function so it fits the data less tightly, leaving it more likely to generalize well to new data. The regularization parameter $\lambda$ determines how much the cost of $\beta_j$ is increased. Lasso estimation in EViews can automatically select an appropriate value with cross-validation, which is a data-driven method of choosing $\lambda$ based on its predictive ability.
If we have a dataset with many independent variables, ordinary least squares models may produce estimates with large variances and therefore unstable forecasts. By applying Lasso regression to the data and removing variables that have been shrunk to zero, then applying OLS to the reduced number of variables, we may be able to improve forecasting performance. In this way we can perform dimension reduction on our data based on the predictive accuracy of our model.
Dataset
In the table below we show part of the data used for this example.
|
(Click to enlarge) |
Analysis
We first perform an OLS regression on the dataset to give us a baseline for comparison.
|
(Click to enlarge) |
Next, we run a Lasso regression over the same dataset and look at the plot of the coefficients against the L1 norm of the coefficients. This gives us a sense of how each coefficient contributes to the dependent variable. We can see that as the degree of regularization decreases (the L1 norm increases) more coefficients enter the model.
|
(Click to enlarge) |
|
(Click to enlarge) |
|
(Click to enlarge) |
You may have noticed that the set of nonzero coefficients here is different than that for the Lasso example earlier. That’s because Lasso variable selection uses a different measure (AIC) to select the preferred model compared to Lasso. This is the same measure used for the other variable selection methods in EViews.
What about out-of-sample predictive power? We have randomly labeled each of the 442 observations as either training or test datapoints (the split is 70% training, 30% test). After doing least squares and Lasso variable selection on the training data, we use Series->View->Forecast Evaluation to compare the forecasts for least squares and Lasso variable selection over the test set:
|
(Click to enlarge) |
This is all mildly interesting. But the real power of variable selection techniques comes when you have a larger dataset and want to reduce the set of variables under consideration to a more manageable set. To this end, we use the “extended” dataset provided by the authors that includes the ten original variables plus squares of nine variables and forty-five interaction terms, for a total of sixty-four variables.
First, we repeat the OLS regression from earlier with the new extended dataset:
|
(Click to enlarge) |
Next, let’s go straight to Lasso variable selection on the extended dataset.
|
(Click to enlarge) |
The in-sample R-squared and errors have moved in a modest but promising direction. What about out-of-sample prediction? We again compare the forecasts for least squares and Lasso variable selection over the test set:
|
(Click to enlarge) |
excellent guide thanks
ReplyDeleteHow to estimate Standard Error for the coefficients in ridge regression aproach.
ReplyDeleteHow to estimate Standard Error for the coefficients in ridge regression approach. Please reply at javid@pide.org.pk
ReplyDelete