Table of Contents
- Introduction to Wavelets
- Wavelet Transforms
- Practical Considerations
- Wavelet Thresholding
- Conclusion
- References
Introduction to Wavelets
What characterizes most economic time series are time-varying features such as non-stationarity, volatility, seasonality, and structural discontinuities. Wavelet analysis is a natural framework for analyzing these phenomena without imposing any simplifying assumptions such as stationarity. In particular, wavelet filters can decompose and reconstruct a time series (as well as its correlation structure) across timescales so that constituent elements at one scale are uncorrelated with those at another. This is clearly useful in isolating features which materialize only at certain timescales.Wavelet analysis is also, in many respects, like Fourier spectral analysis. Both methods can represent a time series signal in a different space by re-expressing a signal as a linear combination of basis functions. In the context of Fourier analysis, these basis functions are sines and cosines. While these basis functions approximate global variation well, they are poorly adapted to capturing local variation, otherwise known as time-variation in time series analysis. To see this, observe that trigonometric basis functions are sinusoids of the form: $$ R\cos\left(2\pi(\omega t + \phi)\right) $$ where $ R $ is the amplitude, $ \omega $ is the frequency (in cycles per unit time) or period $ \frac{1}{\omega} $ (in units of time), and $ \phi $ is the phase. Accordingly, if the time variable $ t $ is shifted and scaled to $ u = \frac{t - a}{b} $, the associated sinusoid becomes: $$ R\cos\left(2\pi(\omega^{\star} u + \phi^{\star})\right) $$ where $ \omega^{\star} = \omega b $ and $ \phi^{\star} = \phi + \omega a $.
Evidently, the amplitude $ R $ is invariant to shifts in location and scale. Furthermore, notice that if $ b > 1 $, the frequency $ \omega^{\star} $ increases, but time $ u $ decreases, and vice versa. Accordingly, frequency information is gained when time information is lost, and vice versa.
Ultimately, trigonometric functions are ideally adapted to stationary processes characterized by impulses which wane with time, but are otherwise poorly adapted to discontinuous, non-linear, and non-stationary processes whose impulses persist and evolve with time. To surmount this fixed time-frequency relationship, a new set of basis functions are needed.
In contrast to Fourier transforms, wavelet transforms rely on a reference basis function called the mother wavelet. The latter is stretched (scaled) and shifted across time to capture time-dependent features. Thus, the wavelet basis functions are localized both in scale and time. In this sense, the wavelet basis function scale is the analogue of frequency in Fourier transforms. The fact that the wavelet basis function is also shifted (translated) across time, implies that wavelet basis functions are similar in spirit to performing a Fourier transform on a moving and overlapping window of subsets of the entire time series signal.
In particular, the mother wavelet function $ \psi(t) $ is any function satisfying: $$ \int_{-\infty}^{\infty} \psi(x) dx = 0 \qquad\qquad \int_{-\infty}^{\infty} \psi(x)^{2} dx = 1 $$ In other words, wavelets are functions that have mean zero and unit energy. Here, the term energy originates from the signal processing literature and is formalized as $ \int_{-\infty}^{\infty} |f(t)^{2}| dt$ for some function $ f(t) $. In fact, the concept is interchangeable with the idea of variance for non-complex functions.
From the mother wavelet, the wavelet basis functions are now derived as: $$ \psi_{a,b}(t) = \frac{1}{\sqrt{b}}\psi\left(\frac{t - a}{b}\right) $$ where $ a $ is the location constant, whereas $ b $ is the scaling factor which corresponds to the notion of frequency in Fourier analysis. Observe further that the analogue of the amplitude $ R $ in Fourier analysis, here captured by the term $ \frac{1}{\sqrt{b}} $, is in fact a function of the scale $ b $. Accordingly, wavelet basis functions will adapt to scale-dependent phenomena much better than their trigonometric counterparts.
Since wavelet basis functions are de facto location and scale transformations of a single function, they are also an ideal tool for multiresolution analysis (MRA) - the ability to analyze a signal at different frequencies with varying resolutions. In fact, MRA is in some sense the inverse of the wavelet transform. It can derive representations of the original time-series data, using only those features which are characteristic at a given timescale. For instance, a highly noisy but persistent time series, can be decomposed into a portion which represents only the noise (features captured at high frequency), and a portion which represents only the persistent signal (features captured at low frequencies). Thus, moving along the time domain, MRA allows one to zoom to a desired level of detail such that high (low) frequencies yield good (poor) time resolutions and poor (good) frequency resolutions. Since economic time series often exhibit multiscale features, wavelet techniques can effectively decompose these series into constituent processes associated with different timescales.
Wavelet Transforms
In the context of continuous functions, the continuous wavelet transform (CWT) of a time series $ y(t) $ is defined as: $$ W(a, b) = \int_{-\infty}^{\infty} y(t)\psi_{a,b}(t) \,dt $$ Moreover, the inverse transformation to reconstruct the original process is given as: $$ y(t) = \int_{-\infty}^{\infty} \int_{0}^{\infty} W(a,b)\psi_{a,b}(t) \,da \,db $$ See Percival and Walden (2000) for a detailed discussion.Since continuous functions are rarely observed, the CWT is empirically rarely exploited and a discretized analogue known as the discrete wavelet transform (DWT) is used. In its most basic form, the series length, $ T = 2^{M} $ for $ M \geq 0 $, is assumed dyadic (a power of 2), and the DWT manifests as a collection of CWT slices at nodes $ (a, b) \equiv (a_{k}, b_{j}) $ such that $ a_{k} = 2^{j}k $ and $ b_{j} = 2^{j} $ where $ j = 1, \ldots, M $. In other words, the discrete wavelet basis functions assume the form: $$ \psi_{k,j}(t) = 2^{-j/2}\psi\left( 2^{-j}t - k \right) $$ Unlike the CWT which is highly redundant in both location and scale, the DWT can be designed as an orthonormal transformation. If the location discretization is restricted to the index $ k = 1, \ldots, 2^{-j}T $, at each scale $ \lambda_{j} = 2^{j - 1} $, half the available observations are lost in exchange for orthonormality. This is the classical DWT framework. Alternatively, if the location index is restricted to the full set of available observations with $ k = 1, \ldots, T $, the discretized transform is no longer orthonormal, but does not suffer from observation loss. The latter framework is typically referred to as the maximal overlap discrete wavelet transform (MODWT), and sometimes as the non-decimated DWT. Since the DWT is formally characterized by wavelet filters, we devote some time to those next.
Discrete Wavelet Filters
Formally, the DWT is characterized via $h = \rbrace{h_{0}, \ldots, h_{L-1}}$ and $g = \rbrace{ g_{0}, \ldots, g_{L-1} }$ -- the wavelet (high pass) and scaling (low pass) filters of length $L$, respectively, for some $ L \geq 1 $. Recall that the low and high pass filters are defined in the context of frequency response functions, otherwise known as transfer functions. The latter are Fourier transforms of impulse response functions. Since the impulse response function describes, in the time domain, the evolution (response) of a time series signal to a given stimulus (impulse), the transfer function describes, in the frequency domain, the response of a time series signal to a given impulse in the frequency domain. In this regard, when the magnitude of the transfer function, otherwise known as the gain function, is large at low frequencies and small at high frequencies, the filter associated with that transfer function is said to be a low-pass filter. Otherwise, when the gain function is small at low frequencies but high at high frequencies, the transfer function is associated with a high-pass filter.Like traditional time series filters which are used to extract features (eg. trends, seasonalities, business cycles, noise, etc.), wavelets filters perform a similar role. They are designed to capture low and high frequencies, and have a particular length. This length governs how much of the original series information is used to extract low and high frequency phenomena. This is very similar to the role of the autoregressive (AR) order in traditional time series models where higher AR orders imply more historical observations influence the present.
The simplest and shortest wavelet filter is of length $ L = 2 $ and is called the Haar wavelet. Formally, it is characterized by its high-pass filter definition: \begin{align*} h_{l} = \begin{cases} \frac{1}{\sqrt{2}} \quad \text{if} \quad l = 0\\ \frac{-1}{\sqrt{2}} \quad \text{if} \quad l = 1 \end{cases} \end{align*} This is a sequence of rescaled rectangular functions and is therefore ideally suited to analyzing signals with sudden and discontinuous changes. In this regard, it is ideally suited for outlier detection. Unfortunately, this filter is typically too simple for most other applications.
To help mitigate the limitations of the Haar filter, Daubechies (1992) introduced a family of filters (known as daublets) of even length that are indexed by the polynomial degree they are able to capture -- rather the number of vanishing moments. Thus, the Haar filter, which is of length 2, can only capture constants and linear functions. The Daubechies wavelet filter of length 4 can capture everything from a constant to a cubic function, and so on. Accordingly, higher filter lengths are associated with higher smoothness. Unlike the Haar filter which has a closed form solution in the time domain, the Daubechies family of wavelet filters have a closed form solution only in the frequency domain.
Unfortunately, Daubechies filters are typically not symmetric. If a more symmetric version of the daublet filters is required, then the class known as least asymmetric, or symmlets, is used. The latter define a family of wavelet filters which are as close to symmetric as possible.
|
|
|
|
|
|
Mallat's Pyramid Algorithm
In practice, DWT coefficients are derived through the pyramid algorithm of Mallat (1989). In case of the classical DWT with $T=2^{M}$, let $\mathbf{y} = \series{y}{t}{1}{T}$ and define $\mathbf{W} = \sbrace{\mathbf{W}_{1}, \ldots, \mathbf{W}_{M}, \mathbf{V}_{M}}^{\top}$ as the matrix of DWT coefficients. Here, $\mathbf{W}_{j}$ is a vector of wavelet coefficients of length $T/2^{j}$ and is associated with changes on a scale of length $\lambda_{j} = 2^{j-1}$. Moreover, $\mathbf{V}_{M}$ is a vector of scaling coefficients of length $T/2^{j}$ and is associated with averages on a scale of length $\lambda_{M} = 2^{M-1}$. $\mathbf{W}$ now follows from $\mathbf{W} = \mathcal{W}\mathbf{y}$ where $\mathcal{W}$ is some $T\times T$ orthonormal matrix generating the DWT coefficients. The algorithm can now be formalized as follows.If $\mathbf{W}_{j} = \rbrace{W_{1,1} \ldots W_{T/2^{j},j}}^{\top}$ and $\mathbf{V}_{j} = \rbrace{V_{1,1} \ldots V_{T/2^{j},j}}^{\top}$, the $j^{th}$ iteration of the algorithm convolves an input signal with filters $h$ and $g$ respectively to derive the $j^{th}$ level DWT matrix $\sbrace{\mathbf{W}_{1}, \ldots \mathbf{W}_{j}, \mathbf{V}_{j}}^{\top}$. Explicitly, the convolution is formalized as: \begin{align*} W_{t,1} &= \xsum{l}{0}{L-1}{h_{l}y_{2t-l\hspace{-5pt}\mod T}} && V_{t,1} = \xsum{l}{0}{L-1}{g_{l} y_{2t-l\hspace{-5pt}\mod T}} && j=1\\ W_{t,j} &= \xsum{l}{0}{L-1}{h_{l} V_{2t-l\hspace{-5pt}\mod T,j-1}} && V_{t,j} = \xsum{l}{0}{L-1}{g_{l} V_{2t-l\hspace{-5pt}\mod T,j-1}} && j=2,\ldots,M \end{align*} where $t=1,\ldots,T/2^{j}$. In particular, each iteration therefore convolves the scaling coefficients from the preceding iteration, namely $V_{t,j-1}$, with both the high and low pass filters, and the input signal in the first iteration is $y_{t}$. The entire algorithm continues until the $M^{th}$ iteration although it can be stopped earlier.
In effect, at each scale, the DWT algorithm partitions the frequency spectrum into equal subsets -- the low and high frequencies. At the first scale, low-frequency phenomena of the original signal $ \mathbf{y} $ are captured by $ \mathbf{V}_{1} $, whereas high frequency phenomena are captured by $ \mathbf{W}_{1} $. At scale 2, the same procedure is performed not on the original time series signal, but on the low-frequency components $ \mathbf{V}_{1} $. This in turn generates $ \mathbf{V}_{2} $, which is in a sense those phenomena that would be captured in the first quarter of the frequency spectrum, as well as $ \mathbf{W}_{2} $ -- the high-frequency components at scale 2, or those phenomena that would be captured in the second quarter of the frequency range. This continues at finer and finer levels as we increase scale. In this regard, increasing scale can isolate increasingly more persistent (lower frequency) features of the original time-series signal, with the wavelet coefficients $ \mathbf{W}_{j} $ capturing the remaining, cumulated, ``noisy'' features.
Boundary Conditions
It's important to note that both the DWT and the MODWT make use of circular filtering. When a filtering operation reaches the beginning or end of an input series, otherwise known as the boundaries, the filter treats the input time series as periodic with period $ T $. In other words, we assume that $ y_{T-1}, y_{T-2}, \ldots $ are useful surrogates for unobserved values $ y_{-1}, y_{-2}, \ldots $. Those wavelet coefficients which are affected are also known as boundary coefficients. Note that the number of boundary coefficients only depends on the filter length $ L $ and is independent of the input series length $ T $. Furthermore, the number of boundary coefficients increases with filter length $ L $. In particular, the formula for the number of boundary coefficients for the DWT and MODWT respectively, are given by: \begin{align*} \kappa_{\text{DWT}, j} &\equiv L_{j}^{\prime}\\ \kappa_{\text{MODWT}, j} &\equiv \min \cbrace{L_{j}, T} \end{align*} where $ L_{j}^{\prime} = \left\lceil (L - 2)\rbrace{1 - \frac{1}{2^{j}}} \right\rceil $ and $ L_{j} = (L - 1)(2^{j - 1} - 1) $.Furthermore, both DWT and MODWT boundary coefficients will appear at the beginning of $ \mathbf{W}_{j} $ and $ \mathbf{V}_{j} $. Refer to Percival and Walden (2000) for further details.
Variance Decomposition
The orthonormality of the DWT generating matrix $\mathcal{W}$ has important implications. First, $\mathcal{W}\times\mathcal{W} = I_{T}$, is an identity matrix of dimension $T$. More importantly, $\norm{\mathbf{y}}^{2} = \norm{\mathbf{W}}^{2}$. To see this, recall that $\mathbf{y} = \mathcal{W}^{\top}\mathbf{W}$ and $\norm{\mathbf{y}}^{2} = \mathbf{y}^{\top}\mathbf{y}$. The DWT is therefore an energy (variance) preserving transformation. Coupled with this preservation of energy is also the decomposition of energy on a scale by scale basis. The latter formalizes as: \begin{align} \norm{\mathbf{y}}^{2} = \xsum{j}{1}{M}{\norm{\mathbf{W}_{j}}^{2}} + \norm{\mathbf{V}_{M}}^{2} \label{eq2.5.1} \end{align} where $\norm{\mathbf{W}_{j}}^{2} = \xsum{t}{t}{T/2^{j}}{W^{2}_{t,j}}$ and $\norm{\mathbf{V}_{M}}^{2} = \xsum{t}{t}{T/2^{M}}{V^{2}_{t,M}}$. Thus, $\norm{\mathbf{W}_{j}}^{2}$ quantifies the energy of $ y_{t} $ accounted for at scale $\lambda_{j}$. This decomposition is known as the wavelet power spectrum (WPS) and is arguably the most insightful of the properties of the DWT.The WPS bares resemblance to the spectral density function (SDF) used in Fourier analysis. Whereas the SDF decomposes the variance of an input series across frequencies, in wavelet analysis, the variance of an input series is decomposed across scales $ \lambda_{j} $. One of the advantages of the WPS over the SDF is that the latter requires an estimate of the input series mean, whereas the former does not. In particular, note that the total variance in $ \mathbf{y} $ can be decomposed as: $$ \xsum{j}{0}{\infty}{\nu^{2}(\lambda_{j})} = \var(\mathbf{y}) $$ where $ \nu^{2}(\lambda_{j}) $ is the contribution to $ \var(\mathbf{y}) $ due to scale $ \lambda_{j} $ and is estimated as: $$ \hat{\nu}^{2}(\lambda_{j}) \equiv \frac{1}{T} \xsum{t}{1}{T}{W_{t,j}^{2}} $$ Note that $ \hat{\nu}^{2}(\lambda_{j}) $ is the energy of $ y_{t} $ at scale $ \lambda_{j} $ divided by the number of observations. Unfortunately, this estimator is biased due to the presence of boundary coefficients. To derive an unbiased estimate, boundary coefficients should be dropped from consideration. Accordingly, an unbiased estimate of variance contributed at scale $ \lambda_{j} $ is given by: $$ \tilde{\nu}^{2}(\lambda_{j}) \equiv \frac{1}{M_{j}} \xsum{t}{\kappa_{j} + 1}{T}{W_{t,j}^{2}}$$ where $ M_{j} = T - \kappa_{j}$ and $ \kappa_{j} \equiv L_{j}^{\prime} $ when wavelet coefficients are derived using the DWT, whereas $ \kappa_{j} \equiv L_{j} $ in case wavelet coefficients derive from the MODWT.
It is also possible to derive confidence intervals for the contribution to the overall variance at each scale. In particular, dealing with unbiased estimators $ \tilde{\nu}(\lambda_{j}) $ and a level of significance $ \alpha \in (0,1) $, a confidence interval for $ \nu(\lambda_{j}) $ with coverage $ 1 - 2\alpha $ is given by: \begin{align*} \sbrace{\tilde{\nu}^{2}(\lambda_{j}) - \Phi^{-1}(1 - \alpha) \rbrace{\frac{2A_{j}}{M_{j}}}^{1/2} \quad ,\quad \tilde{\nu}^{2}(\lambda_{j}) + \Phi^{-1}(1 - \alpha) \rbrace{\frac{2A_{j}}{M_{j}}}^{1/2}} \end{align*} Above, $ A_{j} $ is the integral of the squared spectral density function of wavelet coefficients $ \mathbf{W_{j}} $ excluding any boundary coefficients. As shown in Percival and Walden (2000), $ A_{j} $ can be estimated as the sum of squared serial correlations among $ \mathbf{W_{j}} $ excluding any boundary coefficients. In other words: $$ \hat{A}_{j} = \frac{1}{M_{j}}\xsum{t}{\kappa_{j}}{T - |\tau|}{W_{j, t}W_{j, t+ |\tau|}} \, \quad 0 \leq |\tau| \leq M_{j} - 1 $$ Unfortunately, as argued in Priestley (1981), there is no condition that prevents the lower bound of the confidence interval above from becoming negative. Accordingly, Percival and Walden (2000) suggest the approximation: $$ \frac{\eta \tilde{\nu}^{2}(\lambda_{j})}{\nu^{2}(\lambda_{j})} \stackrel{d}{=} \chi^{2}_{\eta} $$ where $ \eta $ is known as the equivalent degrees of freedom (EDOF) and is formalized as: $$ \eta = \frac{2 E\rbrace{\tilde{\nu}^{2}(\lambda_{j})}^{2}}{\var \rbrace{\tilde{\nu}^{2}(\lambda_{j})}} $$ The confidence interval of interest with coverage $ 1 - 2\alpha $ can now be stated as: \begin{align*} \sbrace{\frac{\eta \tilde{\nu}^{2}(\lambda_{j})}{Q_{\eta}(1 - \alpha)} \,,\, \frac{\eta \tilde{\nu}^{2}(\lambda_{j})}{Q_{\eta}(\alpha)}} \end{align*} where $ Q_{\eta}(1 - \alpha) $ is the $ \alpha- $ quantile for the $ \chi^{2}_{\eta} $ distribution.
Remaining is the issue of EDOF estimation. Two suggestions in Percival and Walden (2000): \begin{align*} \eta_{1} \equiv \frac{M_{j}\tilde{\nu}^{4}(\lambda_{j})}{\hat{A}_{j}}\\ \eta_{2} \equiv \max \cbrace{2^{-j}M_{j} \, , \, 1} \end{align*} The first estimate above relies on large sample theory and in practice requires a sample of at least $ T = 128 $ to yield a decent approximation. The second assumes that the SDF of the wavelet coefficients at scale $ \lambda_{j} $ is a band-pass. See Percival and Walden (2000) for details.
Multiresolution Analysis
Similar to Fourier, spline, and linear approximations, a principal feature of the DWT is the ability to approximate an input series as a function of wavelet basis functions. In wavelet theory this is known as multiresolution analysis (MRA) and refers to the approximation of an input series at each scale (and up to all scales) $ \lambda_{j} $.To formalize matters, recall that $ \mathbf{W} = \mathcal{W}\mathbf{y} $ and partition the rows of $ \mathcal{W} $ commensurate with the row partition of $ \mathbf{W} $ into $ \mathbf{W}_{1}, \ldots, \mathbf{W}_{M} $ and $ \mathbf{V}_{M} $. In other words, let $ \mathcal{W} = \sbrace{\mathcal{W}_{1}, \ldots, \mathcal{W}_{M}, \mathcal{V}_{M}}^{\top} $, where $ \mathcal{W}_{j} $ and $ \mathcal{V}_{j} $ have dimensions $ 2^{-j}T \times T $. Then, note that for any $ m \in \cbrace{1, \ldots, M} $: \begin{align*} \mathbf{y} &= \mathcal{W}^{\top}\mathbf{W}\\ &= \xsum{j}{1}{m}{\mathcal{W}^{\top}\mathbf{W}_{j}} + \mathcal{V}^{\top}\mathbf{V}_{m}\\ &= \xsum{j}{1}{m}{\mathcal{D}_{j}} + \mathcal{S}_{m} \end{align*} where $ \mathcal{D}_{j} = \mathcal{W}^{\top}_{j} \mathbf{W}_{j} $ and $ \mathcal{V}_{m} = \mathcal{V}^{\top}_{m} \mathbf{V}_{m} $ are $ T- $ dimensional vectors, respectively called the $ j^{\text{th}} $ level detail and $ m^{\text{th}} $ level smooth series. Furthermore, since the low-pass (high-pass) wavelet coefficients are associated with changes (averages) at scale $ \lambda_{j} $, the detail and smooth series are associated with changes and average at scale $ \lambda_{j} $, respectively, in the input series $ \mathbf{y} $.
The MRA is typically used to derive approximations for the original series using its lower and upper frequency components. Since upper frequency components are associated with transient features and are captured by the wavelet coefficients, the detail series will in fact extract those features of the original series which are typically associated with ``noise''. Alternatively, since lower frequency components are associated with perpetual features and are captured by the scaling coefficients, the smooth series will in fact extract those features of the original series which are typically associated with the ``signal''.
It's worth noting that because wavelet filtering can result in boundary coefficients, the detail and smooth series will have observations affected by the same. The latter are given as: \begin{align*} \text{DWT} &\quad t = \begin{cases} 1, \ldots, 2^{j}L_{j}^{\prime} &\quad \text{lower portion}\\ T - \rbrace{L_{j} + 1 - 2^{j}} + 1, \ldots, T &\quad \text{upper portion} \end{cases}\\ \\ \text{MODWT} &\quad t = \begin{cases} 1, \ldots, L_{j} &\quad \text{lower portion}\\ T - L_{j} + 1, \ldots, T &\quad \text{upper portion} \end{cases} \end{align*}
Practical Considerations
The exposition above introduces basic theory underlying wavelet analysis. Nevertheless, there are several practical (empirical) considerations which should be addressed. We focus here on three in particular:- Wavelet filter selection
- Handling boundary conditions
- Non-dyadic series length adjustments
Choice of Wavelet Filter
The type of wavelet filter is typically chosen to mimic the data to which it is applied. Shorter filters don't approximate the ideal band pass filter well, but longer ones do. On the other hand, if the data derives from piecewise constant functions, the Haar wavelet or other shorter wavelets may be more appropriate. Alternatively, if the underlying data is smooth, longer filters may be more appropriate. In this regard, it's important to note that longer filters expose more coefficients to boundary condition effects than shorter ones. Accordingly, the rule of thumb strategy is to use the filter with the smallest length that gives reasonable results. Furthermore, since the MODWT is not orthogonal and its wavelet coefficients are correlated, wavelet filter choice is not as vital as in the case of the orthogonal DWT. Nevertheless, if alignment to time is important (i.e. zero phase filters), the least asymmetric family of filters may be a good choice.Handling Boundary Conditions
As previously mentioned, wavelet filters exhibit boundary conditions due to circular recycling of observations. Although this may be an appropriate assumption for some series such as those naturally exhibiting cyclical effects, it is not appropriate in all circumstances. In this regard, another popular approach is to reflect the original series to generate a series of length $ 2T $. In other words, wavelet filtering proceeds on observations $ y_{1}, \ldots, y_{T}, y_{T}, y_{T-1}, \ldots, y_{1} $. In either case, any proper wavelet analysis ought, at the very least, quantify how many wavelet coefficients are affected by boundary conditions.Adjusting Non-dyadic Length Time Series
Recall that the DWT requires an input series of dyadic length. Naturally, this condition is rarely satisfied in practice. In this regard, there are two broad strategies. Either shorten the input series to dyadic length at the expense of losing observations, or ``pad'' the input series with observations to achieve dyadic length. In the context of the latter strategy, although the choice of padding values is ultimately arbitrary, there are three popular choices, neither of which has proven superior:- Pad with zeros
- Pad with mean
- Pad with median
Wavelet Thresholding
A key objective in any empirical work is to discriminate noise from useful information. In this regard, suppose that the observed time series $ y_{t} = x_{t} + \epsilon_{t} $ where $ x_{t} $ is an unknown signal of interest obscured by the presence of unwanted noise $ \epsilon_{t} $. Traditionally, signal discernment was typically achieved using discrete Fourier transforms. Naturally, this assumes that any signal is an infinite superposition of sinusoidal functions; a strong assumption in empirical econometrics where most data exhibits unit roots, jumps, kinks, and various other non-linearities.The principle behind wavelet-based signal extraction, otherwise known as wavelet shrinkage, is to shrink any wavelet coefficients not exceeding some threshold to zero and then exploit the MRA to synthesize the signal of interest using the modified wavelet coefficients. In other words, only those wavelet coefficients associated with very pronounced spectra are retained with the additional benefit of deriving a very sparse wavelet matrix.
To formalize the idea, let $ \mathbf{x} = \series{x}{t}{1}{T} $ and $ \mathbf{\epsilon} = \series{\epsilon}{t}{1}{T} $. Next, recall that the DWT can be represented as $ T\times T $ orthonormal matrix $ \mathcal{W} $, yielding: $$ \mathbf{z} \equiv \mathcal{W}\mathbf{y} = \mathcal{W}\mathbf{x} + \mathcal{W}\mathbf{\epsilon} $$ where $ \mathcal{W}\mathbf{\epsilon} \sim N(0, \sigma^{2}_{\epsilon}) $. The idea now is to shrink any coefficients not surpassing a threshold to zero.
Thresholding Rule
While there are several thresholding rules, by far, the two most popular are:- Hard Tresholding Rule (``kill/keep'' strategy), formalized as: $$ \delta_{\eta}^{H}(x) = \begin{cases} x \quad \text{if } |x| > \eta\\ 0 \quad \text{otherwise} \end{cases} $$
- Soft Thresholding Rule, formalized as: $$ \delta_{\eta}^{S}(x) = \sign(x)\max\cbrace{0 \,,\, |x| - \eta} $$
Optimal Threshold
The threshold value $ \eta $ is key to wavelet shrinkage. In particular, optimal thresholding is achieved when $ \eta = \sigma_{\epsilon} $ where $ \sigma_{\epsilon} $ is the standard deviation of the noise process $ \mathbf{\epsilon} $. In this regard, several threshold strategies have emerged over the years.-
Universal Threshold, proposed in Donoho and Johnstone (1994), and formalized as:
$$ \eta^{\text{U}} = \hat{\sigma}_{\epsilon} \sqrt{2\log(T)} $$
where $ \hat{\sigma}_{\epsilon} $ is estimated using wavelet coefficients only at scale $ \lambda_{1} $, regardless of what scale is under consideration. When this threshold rule is coupled with soft thresholding, the combination is commonly referred to as VisuShrink.
-
Adaptive Universal Threshold is identical to the universal threshold above, but estimates $ \hat{\sigma}_{\epsilon} $ using those wavelet coefficients associated with the scale under consideration. In other words:
$$ \eta^{\text{AU}} = \hat{\sigma}_{\epsilon, j} \sqrt{2\log(T)} $$
where $ \sigma_{\epsilon, j} $ is the variance of the wavelet coefficients at scale $ \lambda_{j} $.
-
Minimax Estimation proposed in Donoho and Johnstone (1994), and is formalized as the solution to:
$$ \inf_{\hat{\mathbf{x}}}\sup_{\mathbf{x}} R(\hat{\mathbf{x}}, \mathbf{x}) $$
Unfortunately, a closed form solution is not available, although tabulated values exist. Furthermore, when this threshold is coupled with soft thresholding, the combination is commonly referred to as RiskShrink.
-
Stein's Unbiased Risk Estimate (SURE), formalized as the solution to:
$$ \min_{\hat{\mathbf{\mu}}} \norm{\mathbf{\mu} - \hat{\mathbf{\mu}}}^{2} $$
where $ \mathbf{\mu} = (\mu_{1}, \ldots, \mu_{s})^{\top} $ and $ \mu_{k} $ is the mean of some variable of interest $ q_{k} ~ N(\mu_{k}, 1) $, for $ k = 1, \ldots, s $. In the framework of wavelet coefficients, $ q_{k} $ would represent the standardized wavelet coefficients at a given scale.
Furthermore, while the optimal threshold $ \eta $ based on this rule depends on the thresholding rule used, the solution may not be unique and so the SURE threshold value is the minimum such $ \eta $. In case of the soft thresholding rule, the solution was proposed in Donoho and Johnstone (1994). Alternatively, for the hard thresholding rule, the solution was proposed in Jansen (2010).
-
False Discovery Rate (FDR), proposed in Abramovich and Benjamini (1995), determines the threshold value through a multiple hypotheses testing problem. The procedure is summarized in the following algorithm:
-
For each $ W_{t,j} \in \mathbf{W}_{j} $ consider the hypothesis $ H_{t,j}: W_{t,j} = 0 $ and its associated two-sided $ p- $value:
$$ p_{t,j} = 2\rbrace{1 - \Phi\rbrace{\frac{|W_{t,j}|}{\sigma_{\epsilon, j}}}} $$
where as before, $ \sigma_{\epsilon, j} $ is the variance of the wavelet coefficients at scale $ \lambda_{j} $ and $ \Phi(\cdot) $ is the standard Gaussian CDF.
-
Sort the $ p_{t,j} $ in ascending order so that:
$$ p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(m_{j})} $$
where $ m_{j} $ denotes the cardinality (number of elements) in $ \mathbf{W}_{j} $. For instance, when $ \mathbf{W}_{j} $ are derived from a DWT, then $ m_{j} = T/2^{j} $.
-
Let $ \alpha $ define the significance level of the hypothesis tests and let $ i^{\star} $ denote the largest $ i \in \cbrace{1, \ldots, m_{j}} $ such that $ p_{(i)} \leq (\frac{i}{m_{j}})\alpha $. For this $ i^{\star} $, the quantity:
$$ \eta^{\text{FDR}}_{j} = \sigma_{\epsilon, j}\Phi^{-1}\rbrace{1 - \frac{p_{i^{\star}}}{2}} $$
is the optimal threshold for wavelet coefficients at scale $ \lambda_{j} $.
-
For each $ W_{t,j} \in \mathbf{W}_{j} $ consider the hypothesis $ H_{t,j}: W_{t,j} = 0 $ and its associated two-sided $ p- $value:
$$ p_{t,j} = 2\rbrace{1 - \Phi\rbrace{\frac{|W_{t,j}|}{\sigma_{\epsilon, j}}}} $$
where as before, $ \sigma_{\epsilon, j} $ is the variance of the wavelet coefficients at scale $ \lambda_{j} $ and $ \Phi(\cdot) $ is the standard Gaussian CDF.
Wavelet Coefficient Variance
Before summarizing the entire threshold procedure, there remains the issue of how to estimate the variance of the wavelet coefficients, $ \sigma^{2}_{\epsilon} $. If the assumption is that the observed data $ \mathbf{y} $ is obscured by some noise process $ \mathbf{\epsilon} $, the usual estimator of variance will exhibit extreme sensitivity to noisy observations. Accordingly, let $ \mu_{j} $ and $ \zeta_{j} $ denote the mean and median, respectively, of the wavelet coefficients $ \mathbf{W}_{j} $ at scale $ \lambda_{j} $, and let $ m_{j} $ denote its cardinality (total number of coefficients at said scale). Then, several common estimators have been proposed in the literature:-
Mean Absolute Deviation formalized as:
$$ \hat{\sigma}_{\epsilon, j} = \frac{1}{m_{j}}\xsum{i}{1}{m_{j}}{|W_{i, j} -\mu_{j}|} $$
-
Median Absolute Deviation formalized as:
$$ \hat{\sigma}_{\epsilon, j} = \med\rbrace{|W_{1, j} -\zeta_{1}|, \ldots, |W_{m_{j}, j} -\zeta_{m_{j}}|} $$
-
Mean Median Absolute Deviation formalized as:
$$ \hat{\sigma}_{\epsilon, j} = \frac{1}{m_{j}}\xsum{i}{1}{m_{j}}{|W_{i, j} -\zeta_{j}|} $$
-
Median (Gaussian) formalized as:
$$ \hat{\sigma}_{\epsilon, j} = \frac{\med\rbrace{|W_{1, j}|, \ldots, |W_{m_{j}, j}|}}{0.6745} $$
Thresholding Implementation
The previous sections were devoted to describing thresholding rules and optimal threshold values. Here the focus is on summarizing thresholding implementations.Effectively all wavelet thresholding procedures follow the algorithm below:
-
Compute a wavelet transformation of the original data up to some scale $ J^{\star} < J $. In other words, derive a partial wavelet transform and derive the wavelet and scaling coefficients $ \mathbf{W}_{1}, \ldots, \mathbf{W}_{J^{\star}}, \mathbf{V}_{J^{\star}} $.
-
Select an optimal threshold $ \eta $ from one of the methods discussed earlier.
-
Threshold the coefficients at each scale $ \lambda_{j} $ for $ j \in \cbrace{1, \ldots, J^{\star}} $ using the threshold value selected in 2 and some thresholding rule (hard or soft). This will generate a set of modified (thresholded) wavelet coefficients $ \mathbf{W}^{\text{(T)}}_{1}, \ldots, \mathbf{W}^{\text{(T)}}_{J^{\star}} $. Observe that scaling coefficients $ \mathbf{V}_{J^{\star}} $ are not thresholded.
-
Use MRA with the thresholded coefficients to reconstruct the signal (original data) as follows:
\begin{align*}
\hat{\mathbf{y}} &= \xsum{j}{1}{J^{\star}}{\mathcal{W}^{\top}\mathbf{W}^{\text{(T)}}_{j}} + \mathcal{V}^{\top}\mathbf{V}_{J^{\star}}\\
&= \xsum{j}{1}{J^{\star}}{\mathcal{D}^{\text{(T)}}_{j}} + \mathcal{S}_{J^{\star}}
\end{align*}
Conclusion
In this first entry of our series on wavelets, we provided a theoretical overview of the most important aspects in wavelet analysis. In Part II, we will see how to apply these concepts by using the new wavelet features released with EViews 12.References
- Abramovich F and Benjamini Y (1995), "Thresholding of wavelet coefficients as multiple hypotheses testing procedure", In Wavelets and Statistics. , pp. 5-14. Springer.
- Daubechies I (1992), "Ten lectures on wavelets, CBMS-NSF Conference Series in Applied Mathematics", SIAM Ed. , pp. 122-122.
- Donoho DL and Johnstone IM (1994), "Ideal spatial adaptation by wavelet shrinkage", biomeliika. Vol. 81(3), pp. 425-455. Oxford University Press.
- Donoho DL and Johnstone IM (1995), "Adapting to unknown smoothness via wavelet shrinkage", Journal of the american statistical association. Vol. 90(432), pp. 1200-1224. Taylor & Francis Group.
- Donoho DL, Johnstone IM and others (1998), "Minimax estimation via wavelet shrinkage", The annals of Statistics. Vol. 26(3), pp. 879-921. Institute of Mathematical Statistics.
- Gençay R, Selçuk F and Whitcher BJ (2001), "An inlioduction to wavelets and other filtering methods in finance and economics" Academic press.
- Jansen M (2010), "Minimum risk methods in the estimation of unknown sparsity", Technical report.
- Mallat S (1989), "A theory for multiresolution signal decomposition: The wavelet representation", Pattern Analysis and Machine Intelligence, IEEE liansactions on. Vol. 11(7), pp. 674-693.
- Percival D and Walden A (2000), "Wavelet methods for time series analysis" Vol. 4 Cambridge Univ Pr.
- Priestley MB (1981), "Speclial analysis and time series: probability and mathematical statistics" (04; QA280, P7.)
No comments:
Post a Comment