Table of Contents
Nonparametric Estimation
Traditional least squares regression is parametric in nature. It confines relationships between the dependent variable $ Y_{t} $ and independent variables (regressors) $ X_{1,t}, X_{2,t}, \ldots $ to be, in expectation, linear in the parameter space. For instance, if the true data generating process (DGP) for $ Y_{t} $ derives from $ p $ regressors, the least squares regression model postulates that: $$ Y_{t} = m(x_{1}, \ldots, x_{p}) \equiv E(Y_t | X_{1,t} = x_{1}, \ldots, X_{p,t} = x_{p}) = \beta_0 + \sum_{k=1}^{p}{\beta_k x_{k}} $$ Since this relationship holds only in expectation, a statistically equivalent form of this statement is: \begin{align} Y_t &= m\left(X_{1,t}, \ldots, X_{p,t}\right) + \epsilon_{t} \nonumber \\ &=\beta_0 + \sum_{k=1}^{p}{\beta_k X_{k,t}} + \epsilon_t \label{eq.1.1} \end{align} where the error term $ \epsilon_{t} $ has mean zero, and parameter estimates are solutions to the minimization problem: $$ \arg\!\min_{\hspace{-1em}\beta_{0}, \ldots, \beta_{p}} E\left(Y_{t} - \beta_0 + \sum_{k=1}^{p}{\beta_k X_{k,t}}\right)^{2} $$ Nevertheless, while this framework is typically sufficient for most applications, and is obviously very appealing and intuitive, when the true but unknown DGP is in fact non-linear, inference is rendered unreliable.On the other hand, nonparametric modelling prefers to remain agnostic about functional forms. Relationships are, in expectation, simply functionals $ m(\cdot) $, and if the true DGP for $ Y_{t} $ is a function of $ p $ regressors, then: $$ Y_t = m\left(X_{1,t}, \ldots, X_{p,t}\right) + \epsilon_{t} $$ Here, estimators of $ m(\cdot) $ can generally be cast as minimization problems of the form: \begin{align} \arg\!\min_{\hspace{-1em} m\in \mathcal{M}} E\left(Y_{t} - m\left(X_{1,t}, \ldots, X_{p,t}\right)\right)^{2} \label{eq.1.2} \end{align} where $ \mathcal{M} $ is now a function space. In this regard, a nonparametric estimator can be thought of as a solution to a search problem over functions as opposed to parameters.
The problem in \eqref{eq.1.2}, however, is infeasible. It turns out the function space is effectively uncountable. In fact, even if arguing to the contrary, solutions would be unidentified since different functions in $ \mathcal{M} $ can map to the same range. Accordingly, general practice is to reduce $ \mathcal{M} $ to a lower dimensional countable space and optimize over it. This typically implies a reduction of the problem to a parametric framework so that the problem in \eqref{eq.1.2} is cast into: \begin{align} \arg\!\min_{\hspace{-1em} m\in \mathcal{M}} E\left(Y_{t} - h\left(X_{1,t}, \ldots, X_{p,t}; \mathbf{\Theta} \right)\right)^{2} \label{eq.1.3} \end{align} where $ h(\cdot; \mathbf{\Theta}) \in \mathcal{H} $ is a function with associated parameters $ \mathbf{\Theta} \in \mathbf{R}^{q} $ and $ \mathcal{H} $ is a function space which is dense in $ \mathcal{M} $; formally, $ h^{\star} \in \mathcal{H} \rightarrow m^{\star} \in \mathcal{M} $ where $ \rightarrow $ denotes asymptotic convergence. Recall that this means that any feasible estimate $ h^{\star} $ must become arbitrarily close to the unfeasible estimate $ m^{\star} $ as the space $ \mathcal{H} $ grows to asymptotic equivalence with $ \mathcal{M} $. In this regard, nonparametric estimators are typically classified into either global or local kinds.
Global Methods
Global estimators, generally synonymous with the class of sieve estimators introduced by Grenander (1981), approximate arbitrary functions by simpler functions which are uniformly dense in the target space $ \mathcal{M} $. A particularly important class of such estimators are linear sieves which are constructed as linear combinations of popular basis functions. The latter include Bernstein polynomials, Chebychev polynomials, Hermite polynomials, Fourier series, polynomial splines, B-splines, and wavelets. Formally, when the function $ m(\cdot) $ is univariate, linear sieves assume the following general structure: \begin{align} \mathcal{H}_{J} = \left\{h \in \mathcal{M}: h(x; \mathbf{\Theta}) = \sum_{j=1}^{J}\theta_{k}f_{j}(x)\right\} \label{eq.1.4} \end{align} where $ \theta_{j} \in \mathbf{\Theta} $, $ f_{j}(\cdot) $ is one of the aforementioned basis functions, and $ J \rightarrow \infty$.For instance, if the sieve exploits the Stone-Weierstrass' Approximation Theorem which claims that any continuously differentiable function over a compact interval, can be uniformly approximated on that interval by a polynomial to any degree, then $ f_{j}(x) = x^{j-1} $. In particular, if the unknown function of interest is $ m(x) $, then choosing to approximate the latter with a polynomial of degree $ J = J^{\star} < \infty $ (some integer), reduces the problem in \eqref{eq.1.4} to: $$ \arg\!\min_{\hspace{-1em} m\in \mathcal{M}} E\left(Y_{t} - \theta_{0} + \sum_{j=1}^{J^{\star}}\theta_{j}X_{t}^{j} \right)^{2} $$ where $ Y_{t} $ are the values we observe from the theoretical function $ m(x) $, and $ X_{t} $ is the regressor we're using to estimate it. Usual least squares now yields $ \widehat{\theta}_{j} $ for $ j=1,\ldots, J^{\star} $. Furthermore, $ m(x) $ can be approximated as $$ m(x) \approx \widehat{\theta}_{0} + \sum_{j=1}^{J^{\star}}\widehat{\theta}_{j}x^{j} $$ where $ x $ is evaluated on some grid $ [a,b] $, where it can have arbitrary length, or even on the original regressor values so that $ x \equiv X_{t} $.
To demonstrates the procedure, define the true but unknown function $ m(x) $ as: \begin{align} m(x) = \sin(x)\cos(\frac{1}{x}) + \log\left(x + \sqrt{x^2+1}\right) \quad x \in [-6,6]\label{eq.1.5} \end{align} Furthermore, generate observable data from $ m(x) $ as $ Y_{t} = m(x) + 0.5\epsilon_{t} $ and generate the regressor data as $ X_{t} = x - 0.5 + \eta_{t} $ where $ \epsilon_{t} $ and $ \eta_{t} $ are mutually independent respectively standard normal and standard uniform random variables. Estimation is now summarized for polynomial degrees 1, 5, and 15, respectively.
Figure 1: Polynomial Sieve Estimation
Figure 2: Gaussian Sieve Estimation
Optimal Sieve Length
Given the examples explored above, it is evident that sieve length plays a major role in fitting accuracy. For instance, estimation with a low sieve length resulted in severe underfitting, while a higher sieve length resulted in better fit. The question of course is whether an optimal length can be determined.Li et. al. (1987) studied three well-known procedures, all of which are based on the mean squared forecast error of the estimated function over a search grid $ \mathcal{J} \equiv \left\{J_{min},\ldots, J_{max}\right\} $, and all of which are asymptotically equivalent. In particular, let $ J^{\star} $ the optimal sieve length and consider:
- $ C_{p} $ method due to Mallows (1973): $$ J^{\star} = \min_{J \in \mathcal{J}} \frac{1}{T}\sum_{t=1}^{T}\left(Y_{t} - \widehat{m}(X_{t})\right)^{2} - 2\widehat{\sigma}^{2}\frac{J}{T} $$ where $ \widehat{\sigma}^{2} = \frac{1}{n}\sum_{t=1}^{T}\left(Y_{t} - \widehat{m}(X_{t})\right)^{2}$
- Generalized cross-validation method due to Craven and Wahba (1979): $$ J^{\star} = \min_{J \in \mathcal{J}} \frac{1}{(1 - (J/2))^{2}T}\sum_{t=1}^{T}\left(Y_{t} - \widehat{m}(X_{t})\right)^{2} $$
- Leave-one-out cross validation method due to Stone (1974): $$ J^{\star} = \min_{J \in \mathcal{J}} \frac{1}{T}\sum_{t=1}^{T}\left(Y_{t} - \widehat{m}_{\setminus t^{\star}}(X_{t})\right)^{2} $$ where the subscript notation $ \setminus t^{\star} $ indicates estimation after dropping observation $ t^{\star} $.
- For each observation $ t^{\star} \in \left\{1, \ldots, T \right\} $:
- Solve the optimization problem in \eqref{eq.1.4} using data from the pair $ (Y_{t}, X_{t})_{t \neq t^{\star}} $, and derive the estimated model as follows: $$ \widehat{m}_{J,\setminus t^{\star}}(x) \equiv \widehat{\theta}_{_{J,\setminus t^{\star}}0} + \sum_{j=1}^{J}\widehat{\theta}_{_{J,\setminus t^{\star}}j}f_{j}(x)$$ where the subscript $ J,\setminus t^{\star} $ indicates that parameters are estimated using sieve length $ J $, after dropping observation $ t^{\star} $.
- Derive the forecast error for the dropped observation as follows: $$ e_{_{J}t^{\star}} \equiv Y_{t^{\star}} - \widehat{m}_{J,\setminus t^{\star}}(X_{t^{\star}}) $$
- Derive the cross-validation mean squared error for sieve length $ J $ as follows: $$ MSE_{J} = \frac{1}{T}\sum_{t=1}^{T} e_{_{J}t}^{2} $$
- Determine the optimal sieve length $ J^{\star} $ as the minimum $ MSE_{J} $ across $ \mathcal{J} $. In other words $$ J^{\star} = \min_{J\in\mathcal{J}} MSE_{J} $$
Figure 3: Sieve Regression with Optimized Sieve Length Selection
Critiques
While global nonparametric estimators are easy to work with, they exhibit several well recognized drawbacks. First, they leave little room for fine-tuning estimation. For instance, in the case of polynomial sieves, the polynomial degree is not continuous. In other words, if estimation underfits when sieve length is $ J $, but overfits when sieve length is $ J+1 $, then there is no polynomial degree $ J < J^{\star} < J+1 $.Second, global estimators are often subject to infeasibility since regressor values may not be sufficiently small. This is because increased sieve lengths can result in the values of the regressor covariance matrix to become extremely large. In turn, this can render the inverse of the covariance matrix nearly singular, and by extension, render estimation infeasible. In other words, at some point, increasing the polynomial degree further does not lead to estimate improvements.
Lastly, it is worth pointing out that global estimators fit curves by smoothing (averaging) over the entire domain. As such, they can have difficulties handling observations with strong influences such as outliers and regime switches. This is due to the fact that outlying observations will be averaged with the rest of the data, resulting in a curve that significantly under- or over- fits these observations. To illustrate this point, consider a modification of equation \eqref{eq.1.5} with outliers when $ -1 < x \leq 1 $ : \begin{align} m(x) = \begin{cases} \sin(x)\cos(\frac{1}{x}) + \log\left(x + \sqrt{x^2+1}\right) & \text{if } x\in [-6,1]\\ \sin(x)\cos(\frac{1}{x}) + \log\left(x + \sqrt{x^2+1}\right) + 4 & \text{if } x \in (-1,1]\\ \sin(x)\cos(\frac{1}{x}) + \log\left(x + \sqrt{x^2+1}\right) - 2 & \text{if } x \in (1,6] \end{cases}\label{eq.1.6} \end{align} We generate $ Y_{t} $ and $ X_{t} $ as before, and estimate this model using both polynomial and Gaussian sieves based on cross-validated sieve length selection.
Figure 4: Sieve Regression with Optimized Sieve Length Selection and Outliers
It turns out that a number of these shortcomings can be mitigated by averaging locally instead of globally. In this regard, we turn to the idea of local estimation next.
Local Methods
The general idea behind local nonparametric estimators is local averaging. The procedure partitions the functional variable $ x $ into bins of a particular size, and estimates $ m(x) $ as a linear interpolation of the average values of the dependent variable at the middle of each bin. We demonstrate the procedure when $ m(x) $ is the function in \eqref{eq.1.5}.In particular, define $ Y_{t} $ as before, but let $ X_{t} = x $. In other words, we consider deterministic regressors. We will relax the latter assumption later, but this is momentarily more instructive as it leads to contiguous partitions of the explanatory variable $ X_{t} $. At last, define the bins as quantiles of $ x $ and consider the procedure with bin partitions equal to 2, 5, 15, and 30, respectively.
Figure 5: Local Averaging with Quantiles
Figure 6: Local Averaging with Quantiles and Outliers
It's worth pointing out here that unlike sieve estimation which can suffer from infeasibility with increased sieve length, in local estimation, there is in principle no limit to how finely we wish to define the bin width. Nevertheless, as is evident from the visuals, while increasing the number of bins will reduce bias, it will also introduce variance. In other words, smoothness is sacrificed at the expense of accuracy. This is of course the bias-variance tradeoff and is precisely the mechanism by which fine-tuning the estimator is possible.
Localized Kernel Regression
The idea of local averaging can be extended to accommodate various bin types and sizes. The most popular approaches leverage information of the points at which estimates of $ m(x) $ are desired. For instance, if estimates of $ m(x) $ are desired at a set of points $ \left(x_{1}, \ldots, x_{J} \right) $, then the estimate $ \widehat{m}(x_{j}) $ can be the average of $ Y_{t} $ for each point $ X_{t} $ in some neighborhood of $ x_{j} $ for $ j=1,\ldots, J $. In other words, bins are defined as neighborhoods centered around the points $ x_{j} $, with the size of the neighborhood determined by some distance metric. Then, to gain control over the bias-variance tradeoff, neighborhood size can be exploited with a penalization scheme. In particular, penalization introduces a weight function which disadvantages those $ X_{t} $ that are too far from $ x_{j} $ in any direction. In other words, those $ X_{t} $ close to $ x_{j} $ (in the neighborhood) are assigned larger weights, whereas those $ X_{t} $ far from $ x_{j} $ (outside the neighborhood) are weighed down.Formally, when the function $ m(\cdot) $ is univariate, local kernel estimators solve optimization problems of the form: \begin{align} \arg\!\min_{\hspace{-1em} \beta_{0}} E\left(Y_{t} - \beta_{0}\right)^{2}K_{h}\left(X_{t} - x_{j}\right) \quad \forall j \in \left\{1, \ldots, J\right\}\label{eq.1.7} \end{align} Here we use the traditional notation $ K_{h}(X_{t} - x_{j}) \equiv K\left(\frac{|X_{t} - x_{j}|}{h}\right) $ where $ K(\cdot) $ is a distributional weight function, otherwise known as a kernel, $ |\cdot| $ denotes a distance metric (typically Euclidean), $ h $ denotes the size of the local neighbourhood (bin), otherwise known as a bandwidth, and $ \beta_{0} \equiv \beta_{0}(x_{j}) $ due to its dependence on the evaluation point $ x_{j} $.
To gain further insight, it is easiest to think of $ K(\cdot) $ as a probability density function with support on $ [-1,1] $. For instance, consider the famous Epanechnikov kernel: $$ K(u) = \frac{3}{4}\left(1 - u^{2}\right) \quad \text{for} \quad |u| \leq 1 $$ or the cosine kernel specified by: $$ K(u) = \frac{\pi}{4}\cos(\frac{\pi}{2}u) \quad \text{for} \quad |u| \leq 1 $$
|
|
|
|
To demonstrate the mechanics, consider a kernel estimator based on $ k- $nearest neighbouring points, or the weighted $ k-NN $ estimator. In particular, this estimator defines the neighbourhood as all points $ X_{t} $, the distance of which to an evaluation point $ x_{j} $, are no greater than the distance of the $ k^{\text{th}} $ nearest point $ X_{t} $ to the same evaluation point $ x_{j} $. When used in the optimization problem \eqref{eq.1.7}, the resulting estimator is also sometimes referred to as LOWESS - LOcally Weighted Estimated Scatterplot Smoothing.
The algorithm used in the demonstration is relatively simple. First, define $ k^{\star} $ as the number of neighbouring points to be considered and define a grid $ \mathcal{X} \equiv \{x_{1}, \ldots, x_{J}\} $ of points at which an estimate of $ m(\cdot) $ is desired. Next, define a kernel function $ K(\cdot) $. Finally, for each $ j \in \{1, \ldots, J\}, $, execute the following:
- For each $ t \in \{1,\ldots, T\} $, compute $ d_{t} = |X_{t} - x_{j}| $ -- the Euclidean distance between $ X_{t} $ and $ x_{j} $.
- Order the $ d_{t} $ in ascending order to form the ordered set $ \{d_{(1)} \leq d_{(2)} \leq \ldots \leq d_{(T)}\} $.
- Set the bandwidth as $ h = d_{(k^{\star})} $.
- For each $ t \in \{1,\ldots, T\} $, compute a weight $ w_{t} \equiv K_{h}(X_{t} - x_{j}) $.
- Solve the optimization problem: $$ \arg\!\min_{\hspace{-1em} \beta_{0}} E\left(Y_{t} - \beta_{0}\right)^{2}w_{t} $$ to derive the parameter estimate: $$ \widehat{m}_{\setminus t^{\star}}(x_{j}) \equiv \widehat{\beta}_{0}(x_{j}) = \frac{\sum_{t=1}^{T}w_{t}Y_{t}}{\sum_{t=1}^{T}w_{t}} $$
For instance, suppose $ m(x) $ is the curve defined in \eqref{eq.1.6}, the evaluation grid $ \mathcal{X} $ consists of points in the interval $ [-6,6] $, and $ K(\cdot) $ is the Epanechnikov kernel. Furthermore, suppose $ Y_{t} = m(x) + 0.5\epsilon_{t} $ and $ X_{t} = x - 0.5 + \eta_{t} $. Notice that we're back to treating the regressor as a stochastic variable. Then, the $ k-NN $ estimator of $ m(\cdot) $ with 15, 40, 100, and 200 nearest neighbour points, respectively, is illustrated below.
Figure 8: k-NN Regression
Bandwidth Selection
While we will cover optimal bandwidth selection in greater detail in Part II of this series, it is not difficult to draw similarities between the role of bandwidth size in local estimation and sieve length in global methods. In fact similar methods for optimal bandwidth selection exist in the context of local kernel regression, and analogous to sieve methods, are also typically grid searches. In this regard, in order to avoid complicated theoretical discourse, consider momentarily the optimization problem in \eqref{eq.1.7}.It is not difficult to demonstrate that the estimator $ \widehat{\beta}_{0}(x) $ satisfies: \begin{align*} \widehat{\beta}_{0}(x) &= \frac{T^{-1}\sum_{t=1}^{T}K_{h}\left(X_{t} - x\right)Y_{t}}{T^{-1}\sum_{t=1}^{T}K_{h}\left(X_{t} - x\right)}\\ &=\frac{1}{T}\sum_{t=1}^{T}\left(\frac{K_{h}\left(X_{t} - x\right)}{T^{-1}\sum_{i=1}^{T}K_{h}\left(X_{i} - x\right)}\right)Y_{t} \end{align*} Accordingly, if $ h\rightarrow 0 $, then $ \frac{K_{h}\left(X_{t} - x\right)}{T^{-1}\sum_{i=1}^{T}K_{h}\left(X_{i} - x\right)} \rightarrow T $ and is only defined on $ x = X_{t} $. In other words, as the bandwidth approaches zero, $ \widehat{\beta}_{0}(x) \equiv \widehat{\beta}_{0}(X_{t}) \rightarrow Y_{t} $, and the estimator is effectively an interpolation of the data. Naturally, this estimator has very small bias since it picks up every data point in $ Y_{t} $, but also has very large variance for the same reason.
Alternatively, should $ h \rightarrow \infty $, then $ \frac{K_{h}\left(X_{t} - x\right)}{T^{-1}\sum_{i=1}^{T}K_{h}\left(X_{i} - x\right)} \rightarrow 1 $ for all values of $ x $, and $ \widehat{\beta}_{0}(x) \rightarrow T^{-1}\sum_{t=1}^{T}Y_{t} $. That is, $ \widehat{\beta}_{0}(x) $ is a constant function equal to the mean of $ Y_{t} $, and therefore has zero variance, but suffers from very large modelling bias since it picks up only those points equal to the average.
Between these two extremes is an entire spectrum of models $ \left\{\mathcal{M}_{h} : h \in \left(0, \infty\right) \right\} $ ranging from the most complex $ \mathcal{M}_{0} $, to the least complex $ \mathcal{M}_{\infty} $. In other words, the bandwidth parameter $ h $ governs model complexity. Thus, the optimal bandwidth selection problem selects an $ h^{\star} $ to generate a model $ \mathcal{M}_{h^{\star}} $ best suited for the data under consideration. In other words, it reduces to the classical bias-variance tradeoff.
To demonstrate certain principles, we close this section by returning to the leave-one-out cross-validation procedure discussed earlier. As a matter of fact, the algorithm also applies to local kernel regression and we do so in the context of $ k-NN $ regression, also discussed earlier.
In particular, define a search grid $ \mathcal{K} \equiv \{k_{min}, \ldots, k_{max}\} $ of the number of neighbouring points, select a kernel function $ K(\cdot) $, and iterate the following steps over $ k \in \mathcal{K} $:
- For each observation $ t^{\star} \in \left\{1, \ldots, T \right\} $:
- For each $ t \neq t^{\star} \in \{1,\ldots, T\} $, compute $ d_{t \neq t^{\star}} = |X_{t} - X_{t^{\star}}| $.
- Order the $ d_{t \neq t^{\star}} $ in ascending order to form the ordered set $ \{d_{t \neq t^{\star} (1)} \leq d_{t \neq t^{\star} (2)} \leq \ldots d_{t \neq t^{\star} (T-1)}\} $.
- Set the bandwidth as $ h_{\setminus t^{\star}} = d_{t \neq t^{\star} (k)} $.
- For each $ t \neq t^{\star} \in \{1,\ldots, T\} $ , compute a weight $ w_{_{\setminus t^{\star}}t} \equiv K_{h_{\setminus t^{\star}}}(X_{t} - X_{t^{\star}}) $.
- Solve the optimization problem: $$ \arg\!\min_{\hspace{-1em} \beta_{0}} E\left(Y_{t} - \beta_{0}\right)^{2}w_{_{\setminus t^{\star}}t} $$ to derive the parameter estimate: $$ \widehat{m}_{k,\setminus t^{\star}}(X_{t^{\star}}) \equiv \widehat{\beta}_{_{k,\setminus t^{\star}}0}(X_{t^{\star}}) = \frac{\sum_{t\neq t^{\star}}^{T}w_{_{\setminus t^{\star}}t}Y_{t}}{\sum_{t\neq t^{\star}}^{T}w_{_{\setminus t^{\star}}t}} $$ where we use the subscript $ k,\setminus t^{\star} $ to denote explicit dependence on the number of neighbouring points $ k $ and the dropped observation $ t^{\star} $.
- Derive the forecast error for the dropped observation as follows: $$ e_{_{k}t^{\star}} \equiv Y_{t^{\star}} - \widehat{m}_{k,\setminus t^{\star}}(X_{t^{\star}}) $$
- Derive the cross-validation mean squared error when using $ k $ nearest neighbouring points : $$ MSE_{k} = \frac{1}{T}\sum_{t=1}^{T} e_{_{k}t}^{2} $$
- Determine the optimal number of neighbouring points $ k^{\star} $ as the minimum $ MSE_{k} $ across $ \mathcal{K} $. In other words $$ k^{\star} = \min_{k\in\mathcal{K}} MSE_{k} $$
Figure 9: k-NN Regression with Optimized k
Conclusion
Given the recent introduction of functional coefficient estimation in EViews 11, our aim in this multi-part blog series is to complement this feature release with a theoretical and practical overview. As first step in this regard, we've dedicated this Part I of the series to gently introducing readers to the principles of nonparametric estimation, and illustrated them using EViews programs. In particular, we've covered principles of sieve and kernel estimation, as well as optimal sieve length and bandwidth selection. In Part II, we'll extend the principles discussed here and cover the theory underlying functional coefficient estimation in greater detail.Files
The workfile and program files can be downloaded here.References
- Peter Craven and Grace Wahba. Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31:377–403, 1979.
- Ulf Grenander. Abstract inference. Technical report, 1981.
- Ker-Chau Li and others. Asymptotic optimality for csub>p, csub>l , cross-validation and generalized cross-validation: Discrete index set. The Annals of Statistics, 15(3):958–975, 1987.
- Colin L Mallows. Some comments on c p. Technometrics, 15(4):661–675, 1973.
- Mervyn Stone. Cross-validation and multinomial prediction. Biometrika, 61(3):509–515, 1974.
No comments:
Post a Comment