Monday, May 13, 2019

Functional Coefficient Estimation: Part I (Nonparametric Estimation)

Recently, EViews 11 introduced several new nonparametric techniques. One of those features is the ability to estimate functional coefficient models. To help familiarize users with this important technique, we're launching a multi-part blog series on nonparametric estimation, with a particular focus on the theoretical and practical aspects of functional coefficient estimation. Before delving into the subject matter however, in this Part I of the series, we give a brief and gentle introduction to some of the most important principles underlying nonparametric estimation, and illustrate them using EViews programs.

Table of Contents

  1. Nonparametric Estimation
  2. Global Methods
    1. Optimal Sieve Length
    2. Critiques
  3. Local Methods
    1. Localized Kernel Regression
    2. Bandwidth Selection
  4. Conclusion
  5. Files
  6. References

Nonparametric Estimation

Traditional least squares regression is parametric in nature. It confines relationships between the dependent variable $ Y_{t} $ and independent variables (regressors) $ X_{1,t}, X_{2,t}, \ldots $ to be, in expectation, linear in the parameter space. For instance, if the true data generating process (DGP) for $ Y_{t} $ derives from $ p $ regressors, the least squares regression model postulates that: $$ Y_{t} = m(x_{1}, \ldots, x_{p}) \equiv E(Y_t | X_{1,t} = x_{1}, \ldots, X_{p,t} = x_{p}) = \beta_0 + \sum_{k=1}^{p}{\beta_k x_{k}} $$ Since this relationship holds only in expectation, a statistically equivalent form of this statement is: \begin{align} Y_t &= m\left(X_{1,t}, \ldots, X_{p,t}\right) + \epsilon_{t} \nonumber \\ &=\beta_0 + \sum_{k=1}^{p}{\beta_k X_{k,t}} + \epsilon_t \label{eq.1.1} \end{align} where the error term $ \epsilon_{t} $ has mean zero, and parameter estimates are solutions to the minimization problem: $$ \arg\!\min_{\hspace{-1em}\beta_{0}, \ldots, \beta_{p}} E\left(Y_{t} - \beta_0 + \sum_{k=1}^{p}{\beta_k X_{k,t}}\right)^{2} $$ Nevertheless, while this framework is typically sufficient for most applications, and is obviously very appealing and intuitive, when the true but unknown DGP is in fact non-linear, inference is rendered unreliable.

On the other hand, nonparametric modelling prefers to remain agnostic about functional forms. Relationships are, in expectation, simply functionals $ m(\cdot) $, and if the true DGP for $ Y_{t} $ is a function of $ p $ regressors, then: $$ Y_t = m\left(X_{1,t}, \ldots, X_{p,t}\right) + \epsilon_{t} $$ Here, estimators of $ m(\cdot) $ can generally be cast as minimization problems of the form: \begin{align} \arg\!\min_{\hspace{-1em} m\in \mathcal{M}} E\left(Y_{t} - m\left(X_{1,t}, \ldots, X_{p,t}\right)\right)^{2} \label{eq.1.2} \end{align} where $ \mathcal{M} $ is now a function space. In this regard, a nonparametric estimator can be thought of as a solution to a search problem over functions as opposed to parameters.

The problem in \eqref{eq.1.2}, however, is infeasible. It turns out the function space is effectively uncountable. In fact, even if arguing to the contrary, solutions would be unidentified since different functions in $ \mathcal{M} $ can map to the same range. Accordingly, general practice is to reduce $ \mathcal{M} $ to a lower dimensional countable space and optimize over it. This typically implies a reduction of the problem to a parametric framework so that the problem in \eqref{eq.1.2} is cast into: \begin{align} \arg\!\min_{\hspace{-1em} m\in \mathcal{M}} E\left(Y_{t} - h\left(X_{1,t}, \ldots, X_{p,t}; \mathbf{\Theta} \right)\right)^{2} \label{eq.1.3} \end{align} where $ h(\cdot; \mathbf{\Theta}) \in \mathcal{H} $ is a function with associated parameters $ \mathbf{\Theta} \in \mathbf{R}^{q} $ and $ \mathcal{H} $ is a function space which is dense in $ \mathcal{M} $; formally, $ h^{\star} \in \mathcal{H} \rightarrow m^{\star} \in \mathcal{M} $ where $ \rightarrow $ denotes asymptotic convergence. Recall that this means that any feasible estimate $ h^{\star} $ must become arbitrarily close to the unfeasible estimate $ m^{\star} $ as the space $ \mathcal{H} $ grows to asymptotic equivalence with $ \mathcal{M} $. In this regard, nonparametric estimators are typically classified into either global or local kinds.

Global Methods

Global estimators, generally synonymous with the class of sieve estimators introduced by Grenander (1981), approximate arbitrary functions by simpler functions which are uniformly dense in the target space $ \mathcal{M} $. A particularly important class of such estimators are linear sieves which are constructed as linear combinations of popular basis functions. The latter include Bernstein polynomials, Chebychev polynomials, Hermite polynomials, Fourier series, polynomial splines, B-splines, and wavelets. Formally, when the function $ m(\cdot) $ is univariate, linear sieves assume the following general structure: \begin{align} \mathcal{H}_{J} = \left\{h \in \mathcal{M}: h(x; \mathbf{\Theta}) = \sum_{j=1}^{J}\theta_{k}f_{j}(x)\right\} \label{eq.1.4} \end{align} where $ \theta_{j} \in \mathbf{\Theta} $, $ f_{j}(\cdot) $ is one of the aforementioned basis functions, and $ J \rightarrow \infty$.

For instance, if the sieve exploits the Stone-Weierstrass' Approximation Theorem which claims that any continuously differentiable function over a compact interval, can be uniformly approximated on that interval by a polynomial to any degree, then $ f_{j}(x) = x^{j-1} $. In particular, if the unknown function of interest is $ m(x) $, then choosing to approximate the latter with a polynomial of degree $ J = J^{\star} < \infty $ (some integer), reduces the problem in \eqref{eq.1.4} to: $$ \arg\!\min_{\hspace{-1em} m\in \mathcal{M}} E\left(Y_{t} - \theta_{0} + \sum_{j=1}^{J^{\star}}\theta_{j}X_{t}^{j} \right)^{2} $$ where $ Y_{t} $ are the values we observe from the theoretical function $ m(x) $, and $ X_{t} $ is the regressor we're using to estimate it. Usual least squares now yields $ \widehat{\theta}_{j} $ for $ j=1,\ldots, J^{\star} $. Furthermore, $ m(x) $ can be approximated as $$ m(x) \approx \widehat{\theta}_{0} + \sum_{j=1}^{J^{\star}}\widehat{\theta}_{j}x^{j} $$ where $ x $ is evaluated on some grid $ [a,b] $, where it can have arbitrary length, or even on the original regressor values so that $ x \equiv X_{t} $.

To demonstrates the procedure, define the true but unknown function $ m(x) $ as: \begin{align} m(x) = \sin(x)\cos(\frac{1}{x}) + \log\left(x + \sqrt{x^2+1}\right) \quad x \in [-6,6]\label{eq.1.5} \end{align} Furthermore, generate observable data from $ m(x) $ as $ Y_{t} = m(x) + 0.5\epsilon_{t} $ and generate the regressor data as $ X_{t} = x - 0.5 + \eta_{t} $ where $ \epsilon_{t} $ and $ \eta_{t} $ are mutually independent respectively standard normal and standard uniform random variables. Estimation is now summarized for polynomial degrees 1, 5, and 15, respectively.

Figure 1: Polynomial Sieve Estimation

Alternatively, if the sieve exploits Hermite polynomials, one can construct the Gaussian sieve which reduces the problem in \eqref{eq.1.4} to: $$ \arg\!\min_{\hspace{-1em} m\in \mathcal{M}} E\left(Y_{t} - \theta_{0} + \sum_{j=1}^{J^{\star}}\theta_{j}\phi(X_{t})H_{j}(X_{t}) \right)^{2} $$ where $ \phi(\cdot) $ is the standard normal density and $ H_{j}(\cdot) $ are Hermite polynomials of degree $ j $. The figure below demonstrates the procedure using sieve lengths 1, 3, and 10, respectively.

Figure 2: Gaussian Sieve Estimation

Clearly, both sieve estimators are very similar. So how does one select an optimal sieve? There really isn't a prescription for such optimization. Each sieve has its advantages and disadvantages, but the general rule of thumb is to choose a sieve that most closely resembles the function of interest $ m(\cdot) $. For instance, if the function is polynomial, then using a polynomial sieve is probably best. Alternatively, if the function is expected to be smooth and concentrated around its mean, a Gaussian sieve will work well. On the other hand, the question of optimal sieve length lends itself to more concrete advice.

Optimal Sieve Length

Given the examples explored above, it is evident that sieve length plays a major role in fitting accuracy. For instance, estimation with a low sieve length resulted in severe underfitting, while a higher sieve length resulted in better fit. The question of course is whether an optimal length can be determined.

Li et. al. (1987) studied three well-known procedures, all of which are based on the mean squared forecast error of the estimated function over a search grid $ \mathcal{J} \equiv \left\{J_{min},\ldots, J_{max}\right\} $, and all of which are asymptotically equivalent. In particular, let $ J^{\star} $ the optimal sieve length and consider:

  1. $ C_{p} $ method due to Mallows (1973): $$ J^{\star} = \min_{J \in \mathcal{J}} \frac{1}{T}\sum_{t=1}^{T}\left(Y_{t} - \widehat{m}(X_{t})\right)^{2} - 2\widehat{\sigma}^{2}\frac{J}{T} $$ where $ \widehat{\sigma}^{2} = \frac{1}{n}\sum_{t=1}^{T}\left(Y_{t} - \widehat{m}(X_{t})\right)^{2}$
  2. Generalized cross-validation method due to Craven and Wahba (1979): $$ J^{\star} = \min_{J \in \mathcal{J}} \frac{1}{(1 - (J/2))^{2}T}\sum_{t=1}^{T}\left(Y_{t} - \widehat{m}(X_{t})\right)^{2} $$
  3. Leave-one-out cross validation method due to Stone (1974): $$ J^{\star} = \min_{J \in \mathcal{J}} \frac{1}{T}\sum_{t=1}^{T}\left(Y_{t} - \widehat{m}_{\setminus t^{\star}}(X_{t})\right)^{2} $$ where the subscript notation $ \setminus t^{\star} $ indicates estimation after dropping observation $ t^{\star} $.
Here we discuss the algorithm for the last of the three procedures. In particular, with the search grid $ \mathcal{J} $ defined as before, iterate the following steps over $ J \in \mathcal{J} $:
  1. For each observation $ t^{\star} \in \left\{1, \ldots, T \right\} $:
    1. Solve the optimization problem in \eqref{eq.1.4} using data from the pair $ (Y_{t}, X_{t})_{t \neq t^{\star}} $, and derive the estimated model as follows: $$ \widehat{m}_{J,\setminus t^{\star}}(x) \equiv \widehat{\theta}_{_{J,\setminus t^{\star}}0} + \sum_{j=1}^{J}\widehat{\theta}_{_{J,\setminus t^{\star}}j}f_{j}(x)$$ where the subscript $ J,\setminus t^{\star} $ indicates that parameters are estimated using sieve length $ J $, after dropping observation $ t^{\star} $.
    2. Derive the forecast error for the dropped observation as follows: $$ e_{_{J}t^{\star}} \equiv Y_{t^{\star}} - \widehat{m}_{J,\setminus t^{\star}}(X_{t^{\star}}) $$
  2. Derive the cross-validation mean squared error for sieve length $ J $ as follows: $$ MSE_{J} = \frac{1}{T}\sum_{t=1}^{T} e_{_{J}t}^{2} $$
  3. Determine the optimal sieve length $ J^{\star} $ as the minimum $ MSE_{J} $ across $ \mathcal{J} $. In other words $$ J^{\star} = \min_{J\in\mathcal{J}} MSE_{J} $$
In words, the algorithm moves across the sieve search grid $ \mathcal{J} $ and computes an out-of-sample forecast error for each observation. The optimal sieve length is that which minimizes the average mean squared error across the search grid. We demonstrate the selection criteria and accompanied estimation when using a grid search from 1 to 15.

Figure 3: Sieve Regression with Optimized Sieve Length Selection

Evidently, both the polynomial and Gaussian sieve models ought to use a sieve length of 15.


While global nonparametric estimators are easy to work with, they exhibit several well recognized drawbacks. First, they leave little room for fine-tuning estimation. For instance, in the case of polynomial sieves, the polynomial degree is not continuous. In other words, if estimation underfits when sieve length is $ J $, but overfits when sieve length is $ J+1 $, then there is no polynomial degree $ J < J^{\star} < J+1 $.

Second, global estimators are often subject to infeasibility since regressor values may not be sufficiently small. This is because increased sieve lengths can result in the values of the regressor covariance matrix to become extremely large. In turn, this can render the inverse of the covariance matrix nearly singular, and by extension, render estimation infeasible. In other words, at some point, increasing the polynomial degree further does not lead to estimate improvements.

Lastly, it is worth pointing out that global estimators fit curves by smoothing (averaging) over the entire domain. As such, they can have difficulties handling observations with strong influences such as outliers and regime switches. This is due to the fact that outlying observations will be averaged with the rest of the data, resulting in a curve that significantly under- or over- fits these observations. To illustrate this point, consider a modification of equation \eqref{eq.1.5} with outliers when $ -1 < x \leq 1 $ : \begin{align} m(x) = \begin{cases} \sin(x)\cos(\frac{1}{x}) + \log\left(x + \sqrt{x^2+1}\right) & \text{if } x\in [-6,1]\\ \sin(x)\cos(\frac{1}{x}) + \log\left(x + \sqrt{x^2+1}\right) + 4 & \text{if } x \in (-1,1]\\ \sin(x)\cos(\frac{1}{x}) + \log\left(x + \sqrt{x^2+1}\right) - 2 & \text{if } x \in (1,6] \end{cases}\label{eq.1.6} \end{align} We generate $ Y_{t} $ and $ X_{t} $ as before, and estimate this model using both polynomial and Gaussian sieves based on cross-validated sieve length selection.

Figure 4: Sieve Regression with Optimized Sieve Length Selection and Outliers

Clearly, both procedures have a difficult time handling jumps in the domain region $ -1 < x \leq 1 $. Nevertheless, it is evident that the Gaussian sieve does significantly better than polynomial regression. This is further corroborated by the leave-one-out cross-validation MSE values which indicate that the Gaussian sieve minimum MSE is roughly 6 times as small as the polynomial sieve minimum MSE.

It turns out that a number of these shortcomings can be mitigated by averaging locally instead of globally. In this regard, we turn to the idea of local estimation next.

Local Methods

The general idea behind local nonparametric estimators is local averaging. The procedure partitions the functional variable $ x $ into bins of a particular size, and estimates $ m(x) $ as a linear interpolation of the average values of the dependent variable at the middle of each bin. We demonstrate the procedure when $ m(x) $ is the function in \eqref{eq.1.5}.

In particular, define $ Y_{t} $ as before, but let $ X_{t} = x $. In other words, we consider deterministic regressors. We will relax the latter assumption later, but this is momentarily more instructive as it leads to contiguous partitions of the explanatory variable $ X_{t} $. At last, define the bins as quantiles of $ x $ and consider the procedure with bin partitions equal to 2, 5, 15, and 30, respectively.

Figure 5: Local Averaging with Quantiles

Clearly, when the number of bins is 2, the estimate is a straight line and severely underfits the objective function. Nevertheless, as the number of bins increases, so does the accuracy of the estimate. Indeed, local estimation here is shown to be significantly more accurate than global estimation used earlier on the same function $ m(x) $. This is of course a consequence of local averaging which performs piecemeal smoothing on only those observations restricted to each bin. Naturally, high leverage observations and outliers are better accommodated as they are averaged only with those observations in the immediate vicinity which also fall in the same bin. In fact, we can demonstrate this using the function $ m(x) $ in \eqref{eq.1.6}.

Figure 6: Local Averaging with Quantiles and Outliers

Evidently, increasing the number of bins leads to increasingly better adaptation to the presence of outlying observations.

It's worth pointing out here that unlike sieve estimation which can suffer from infeasibility with increased sieve length, in local estimation, there is in principle no limit to how finely we wish to define the bin width. Nevertheless, as is evident from the visuals, while increasing the number of bins will reduce bias, it will also introduce variance. In other words, smoothness is sacrificed at the expense of accuracy. This is of course the bias-variance tradeoff and is precisely the mechanism by which fine-tuning the estimator is possible.

Localized Kernel Regression

The idea of local averaging can be extended to accommodate various bin types and sizes. The most popular approaches leverage information of the points at which estimates of $ m(x) $ are desired. For instance, if estimates of $ m(x) $ are desired at a set of points $ \left(x_{1}, \ldots, x_{J} \right) $, then the estimate $ \widehat{m}(x_{j}) $ can be the average of $ Y_{t} $ for each point $ X_{t} $ in some neighborhood of $ x_{j} $ for $ j=1,\ldots, J $. In other words, bins are defined as neighborhoods centered around the points $ x_{j} $, with the size of the neighborhood determined by some distance metric. Then, to gain control over the bias-variance tradeoff, neighborhood size can be exploited with a penalization scheme. In particular, penalization introduces a weight function which disadvantages those $ X_{t} $ that are too far from $ x_{j} $ in any direction. In other words, those $ X_{t} $ close to $ x_{j} $ (in the neighborhood) are assigned larger weights, whereas those $ X_{t} $ far from $ x_{j} $ (outside the neighborhood) are weighed down.

Formally, when the function $ m(\cdot) $ is univariate, local kernel estimators solve optimization problems of the form: \begin{align} \arg\!\min_{\hspace{-1em} \beta_{0}} E\left(Y_{t} - \beta_{0}\right)^{2}K_{h}\left(X_{t} - x_{j}\right) \quad \forall j \in \left\{1, \ldots, J\right\}\label{eq.1.7} \end{align} Here we use the traditional notation $ K_{h}(X_{t} - x_{j}) \equiv K\left(\frac{|X_{t} - x_{j}|}{h}\right) $ where $ K(\cdot) $ is a distributional weight function, otherwise known as a kernel, $ |\cdot| $ denotes a distance metric (typically Euclidean), $ h $ denotes the size of the local neighbourhood (bin), otherwise known as a bandwidth, and $ \beta_{0} \equiv \beta_{0}(x_{j}) $ due to its dependence on the evaluation point $ x_{j} $.

To gain further insight, it is easiest to think of $ K(\cdot) $ as a probability density function with support on $ [-1,1] $. For instance, consider the famous Epanechnikov kernel: $$ K(u) = \frac{3}{4}\left(1 - u^{2}\right) \quad \text{for} \quad |u| \leq 1 $$ or the cosine kernel specified by: $$ K(u) = \frac{\pi}{4}\cos(\frac{\pi}{2}u) \quad \text{for} \quad |u| \leq 1 $$

Figure 7A: Epanechnikov Kernel

Figure 7B: Cosine Kernel

Now, if $ |X_{t} - x| > h $, it is clear that $ K(\cdot) = 0 $. In other words, if the distance between $ X_{t} $ and $ x $ is larger than the bandwidth (neighborhood size), then $ X_{t} $ lies outside the neighborhood and its importance will be weighed down to zero. Alternatively, if $ |X_{t} - x| = 0 $, then $ X_{t} = x $ and $ X_{t} $ will be assigned the highest weight, which in the case of the Epanechnikov and cosine kernels, is 0.75 and 0.8, respectively

To demonstrate the mechanics, consider a kernel estimator based on $ k- $nearest neighbouring points, or the weighted $ k-NN $ estimator. In particular, this estimator defines the neighbourhood as all points $ X_{t} $, the distance of which to an evaluation point $ x_{j} $, are no greater than the distance of the $ k^{\text{th}} $ nearest point $ X_{t} $ to the same evaluation point $ x_{j} $. When used in the optimization problem \eqref{eq.1.7}, the resulting estimator is also sometimes referred to as LOWESS - LOcally Weighted Estimated Scatterplot Smoothing.

The algorithm used in the demonstration is relatively simple. First, define $ k^{\star} $ as the number of neighbouring points to be considered and define a grid $ \mathcal{X} \equiv \{x_{1}, \ldots, x_{J}\} $ of points at which an estimate of $ m(\cdot) $ is desired. Next, define a kernel function $ K(\cdot) $. Finally, for each $ j \in \{1, \ldots, J\}, $, execute the following:
  1. For each $ t \in \{1,\ldots, T\} $, compute $ d_{t} = |X_{t} - x_{j}| $ -- the Euclidean distance between $ X_{t} $ and $ x_{j} $.
  2. Order the $ d_{t} $ in ascending order to form the ordered set $ \{d_{(1)} \leq d_{(2)} \leq \ldots \leq d_{(T)}\} $.
  3. Set the bandwidth as $ h = d_{(k^{\star})} $.
  4. For each $ t \in \{1,\ldots, T\} $, compute a weight $ w_{t} \equiv K_{h}(X_{t} - x_{j}) $.
  5. Solve the optimization problem: $$ \arg\!\min_{\hspace{-1em} \beta_{0}} E\left(Y_{t} - \beta_{0}\right)^{2}w_{t} $$ to derive the parameter estimate: $$ \widehat{m}_{\setminus t^{\star}}(x_{j}) \equiv \widehat{\beta}_{0}(x_{j}) = \frac{\sum_{t=1}^{T}w_{t}Y_{t}}{\sum_{t=1}^{T}w_{t}} $$
An estimate of $ m(x) $ along the domain $ \mathcal{X} $, is now the linear interpolation of the points $ \{\widehat{\beta}_{0}(x_{1}), \ldots, \widehat{\beta}_{0}(x_{J})\} $.

For instance, suppose $ m(x) $ is the curve defined in \eqref{eq.1.6}, the evaluation grid $ \mathcal{X} $ consists of points in the interval $ [-6,6] $, and $ K(\cdot) $ is the Epanechnikov kernel. Furthermore, suppose $ Y_{t} = m(x) + 0.5\epsilon_{t} $ and $ X_{t} = x - 0.5 + \eta_{t} $. Notice that we're back to treating the regressor as a stochastic variable. Then, the $ k-NN $ estimator of $ m(\cdot) $ with 15, 40, 100, and 200 nearest neighbour points, respectively, is illustrated below.

Figure 8: k-NN Regression

Clearly, the estimator can be very adaptive to the nuances of outlying points but can suffer from both underfitting and overfitting. In this regard, observe that the number of neighbouring points is directly proportional to neighbourhood (bandwidth) size. In other words, as the number of neighbouring points increases, the bandwidth increases. This is evidenced by a very volatile estimator when the number of neighbouring points is 15, and a significantly smoother estimator when th number of neighbouring points is 200. Therefore, there must be some optimal middle ground between undersmoothing and oversmoothing. In general, notice that apart from the lower zero bound, the bandwidth is not bounded above. Thus, there is an extensive range of bandwidth possibilities. So how does one define what constitutes an optimal bandwidth?

Bandwidth Selection

While we will cover optimal bandwidth selection in greater detail in Part II of this series, it is not difficult to draw similarities between the role of bandwidth size in local estimation and sieve length in global methods. In fact similar methods for optimal bandwidth selection exist in the context of local kernel regression, and analogous to sieve methods, are also typically grid searches. In this regard, in order to avoid complicated theoretical discourse, consider momentarily the optimization problem in \eqref{eq.1.7}.

It is not difficult to demonstrate that the estimator $ \widehat{\beta}_{0}(x) $ satisfies: \begin{align*} \widehat{\beta}_{0}(x) &= \frac{T^{-1}\sum_{t=1}^{T}K_{h}\left(X_{t} - x\right)Y_{t}}{T^{-1}\sum_{t=1}^{T}K_{h}\left(X_{t} - x\right)}\\ &=\frac{1}{T}\sum_{t=1}^{T}\left(\frac{K_{h}\left(X_{t} - x\right)}{T^{-1}\sum_{i=1}^{T}K_{h}\left(X_{i} - x\right)}\right)Y_{t} \end{align*} Accordingly, if $ h\rightarrow 0 $, then $ \frac{K_{h}\left(X_{t} - x\right)}{T^{-1}\sum_{i=1}^{T}K_{h}\left(X_{i} - x\right)} \rightarrow T $ and is only defined on $ x = X_{t} $. In other words, as the bandwidth approaches zero, $ \widehat{\beta}_{0}(x) \equiv \widehat{\beta}_{0}(X_{t}) \rightarrow Y_{t} $, and the estimator is effectively an interpolation of the data. Naturally, this estimator has very small bias since it picks up every data point in $ Y_{t} $, but also has very large variance for the same reason.

Alternatively, should $ h \rightarrow \infty $, then $ \frac{K_{h}\left(X_{t} - x\right)}{T^{-1}\sum_{i=1}^{T}K_{h}\left(X_{i} - x\right)} \rightarrow 1 $ for all values of $ x $, and $ \widehat{\beta}_{0}(x) \rightarrow T^{-1}\sum_{t=1}^{T}Y_{t} $. That is, $ \widehat{\beta}_{0}(x) $ is a constant function equal to the mean of $ Y_{t} $, and therefore has zero variance, but suffers from very large modelling bias since it picks up only those points equal to the average.

Between these two extremes is an entire spectrum of models $ \left\{\mathcal{M}_{h} : h \in \left(0, \infty\right) \right\} $ ranging from the most complex $ \mathcal{M}_{0} $, to the least complex $ \mathcal{M}_{\infty} $. In other words, the bandwidth parameter $ h $ governs model complexity. Thus, the optimal bandwidth selection problem selects an $ h^{\star} $ to generate a model $ \mathcal{M}_{h^{\star}} $ best suited for the data under consideration. In other words, it reduces to the classical bias-variance tradeoff.

To demonstrate certain principles, we close this section by returning to the leave-one-out cross-validation procedure discussed earlier. As a matter of fact, the algorithm also applies to local kernel regression and we do so in the context of $ k-NN $ regression, also discussed earlier.

In particular, define a search grid $ \mathcal{K} \equiv \{k_{min}, \ldots, k_{max}\} $ of the number of neighbouring points, select a kernel function $ K(\cdot) $, and iterate the following steps over $ k \in \mathcal{K} $:
  1. For each observation $ t^{\star} \in \left\{1, \ldots, T \right\} $:
    1. For each $ t \neq t^{\star} \in \{1,\ldots, T\} $, compute $ d_{t \neq t^{\star}} = |X_{t} - X_{t^{\star}}| $.
    2. Order the $ d_{t \neq t^{\star}} $ in ascending order to form the ordered set $ \{d_{t \neq t^{\star} (1)} \leq d_{t \neq t^{\star} (2)} \leq \ldots d_{t \neq t^{\star} (T-1)}\} $.
    3. Set the bandwidth as $ h_{\setminus t^{\star}} = d_{t \neq t^{\star} (k)} $.
    4. For each $ t \neq t^{\star} \in \{1,\ldots, T\} $ , compute a weight $ w_{_{\setminus t^{\star}}t} \equiv K_{h_{\setminus t^{\star}}}(X_{t} - X_{t^{\star}}) $.
    5. Solve the optimization problem: $$ \arg\!\min_{\hspace{-1em} \beta_{0}} E\left(Y_{t} - \beta_{0}\right)^{2}w_{_{\setminus t^{\star}}t} $$ to derive the parameter estimate: $$ \widehat{m}_{k,\setminus t^{\star}}(X_{t^{\star}}) \equiv \widehat{\beta}_{_{k,\setminus t^{\star}}0}(X_{t^{\star}}) = \frac{\sum_{t\neq t^{\star}}^{T}w_{_{\setminus t^{\star}}t}Y_{t}}{\sum_{t\neq t^{\star}}^{T}w_{_{\setminus t^{\star}}t}} $$ where we use the subscript $ k,\setminus t^{\star} $ to denote explicit dependence on the number of neighbouring points $ k $ and the dropped observation $ t^{\star} $.
    6. Derive the forecast error for the dropped observation as follows: $$ e_{_{k}t^{\star}} \equiv Y_{t^{\star}} - \widehat{m}_{k,\setminus t^{\star}}(X_{t^{\star}}) $$
  2. Derive the cross-validation mean squared error when using $ k $ nearest neighbouring points : $$ MSE_{k} = \frac{1}{T}\sum_{t=1}^{T} e_{_{k}t}^{2} $$
  3. Determine the optimal number of neighbouring points $ k^{\star} $ as the minimum $ MSE_{k} $ across $ \mathcal{K} $. In other words $$ k^{\star} = \min_{k\in\mathcal{K}} MSE_{k} $$
We close this section and blog entry with an illustration of the procedure. In particular, we again consider the function in \eqref{eq.1.6}, and use the cosine kernel to search for the optimal number of neighbouring points over the search grid $ \mathcal{K} \equiv \{40, \ldots, 80\} $.

Figure 9: k-NN Regression with Optimized k


Given the recent introduction of functional coefficient estimation in EViews 11, our aim in this multi-part blog series is to complement this feature release with a theoretical and practical overview. As first step in this regard, we've dedicated this Part I of the series to gently introducing readers to the principles of nonparametric estimation, and illustrated them using EViews programs. In particular, we've covered principles of sieve and kernel estimation, as well as optimal sieve length and bandwidth selection. In Part II, we'll extend the principles discussed here and cover the theory underlying functional coefficient estimation in greater detail.


The workfile and program files can be downloaded here.


  1. Peter Craven and Grace Wahba. Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31:377–403, 1979.
  2. Ulf Grenander. Abstract inference. Technical report, 1981.
  3. Ker-Chau Li and others. Asymptotic optimality for csub>p, csub>l , cross-validation and generalized cross-validation: Discrete index set. The Annals of Statistics, 15(3):958–975, 1987.
  4. Colin L Mallows. Some comments on c p. Technometrics, 15(4):661–675, 1973.
  5. Mervyn Stone. Cross-validation and multinomial prediction. Biometrika, 61(3):509–515, 1974.

No comments:

Post a Comment