Monday, October 15, 2018

Principal Component Analysis: Part I (Theory)

Most students of econometrics are taught to appreciate the value of data. We are generally taught that more data is better than less, and that throwing data away is almost "taboo". While this is generally good practice when it concerns the number of observations per variable, it is not always recommended when it concerns the number of variables under consideration. In fact, as the number of variables increases, it becomes increasingly more difficult to rank the importance (impact) of any given variable, and can lead to problems ranging from basic overfitting, to more serious issues such as multicollinearity or model invalidity. In this regard, selecting the smallest number of the most meaningful variables -- otherwise known as dimensionality reduction -- is not a trivial problem, and has become a staple of modern data analytics, and a motivation for many modern techniques. One such technique is Principal Component Analysis (PCA).

Variance Decomposition

Consider a linear statistical system -- a random matrix (multidimensional set of random variables) $ \mathbf{X} $ of size $ n \times m $ where the first dimension denotes observations and the second variables. Moreover, recall that linear statistical systems are characterized by two inefficiencies: 1) noise and 2) redundancy. The former is commonly measured through the signal (desirable information) to noise (undesirable information) ratio $ \text{SNR} = \sigma^{2}_{\text{signal}} / \sigma^{2}_{\text{noise}} $, and implies that systems with larger signal variances $ \sigma^{2}_{\text{signal}} $ relative to their noise counterpart, are more informative. Assuming that noise is a nuisance equally present in observing each of the $ m $ variables of our system, it stands to reason that variables with larger variances have larger SNRs, therefore carry relatively richer signals, and are in this regard relatively more important, or principal.

Whereas relative importance reduces to relative variances across system variables, redundancy, or relative uniqueness of information, is captured by system covariances. Recall that covariances (or normalized covariances called correlations) are measures of variable dependency or co-movement (direction and magnitude of joint variability). In other words, variables with overlapping (redundant) information will typically move in the same direction with similar magnitudes, and will therefore have non-zero covariances. Conversely, when variables share little to no overlapping information, they exhibit small to zero linear dependency, although statistical dependence could still manifest nonlinearly.

Together, system variances and covariances quantify the amount of information afforded by each variable, and how much of that information is truly unique. In fact, the two are typically derived together using the familiar variance-covariance matrix formula: $$ \mathbf{\Sigma}_{X} = E \left( \mathbf{X}^{\top}\mathbf{X} \right) $$ where $ \mathbf{\Sigma}_{X} $ is an $ m\times m $ square symmetric matrix with (off-)diagonal elements as (co)variances, and where we have a priori assumed that all variables in $ \mathbf{X} $ have been demeaned. Thus, systems where all variables are unique will result in a diagonal $ \mathbf{\Sigma}_{X} $, whereas those exhibiting redundancy will have non-zero off-diagonal elements. In this regard, systems with zero redundancy have a particularly convenient feature known as variance decomposition. Since covariance terms in these systems are zero, total system variation (and therefore information) is the sum of all variance terms, and the proportion of total system information contributed by a variable is the ratio of its variance to total system variation.

Although the variance-covariance matrix is typically not diagonal, suppose there exists a way to diagonalize $ \mathbf{\Sigma}_{X} $, and by extension transform $ \mathbf{X} $, while simultaneously preserving information. If such transformation exists, one is guaranteed a new set of at most $ m $ variables (some variables may be perfectly correlated with others) which are uncorrelated, and therefore linearly independent. Accordingly, discarding any one of those new variables would have no linear statistical impact on the $ m-1 $ remaining variables, and would reduce dimensionality at the cost of losing information to the extent contained in the discarded variables. In this regard, if one could also quantify the amount of information captured by each of the new variables, order the latter in descending order of information quantity, one could discard variables from the back until sufficient dimensionality reduction is achieved, while maintaining the maximum amount of information within the preserved variables. We summarize these objectives below:
  1. Diagonalize $ \mathbf{\Sigma}_{X} $.
  2. Preserve information.
  3. Identify principal (important) information.
  4. Reduce dimensionality.
So how does one realize these objectives? It is precisely this question which motivates the subject of this entry.

Principal Component Analysis

Recall that associated with every matrix $ \mathbf{X} $ is a basis -- a set (matrix) of linearly independent vectors such that every row vector in $ \mathbf{X} $ is a linear combination of the vectors in the basis. In other words, the row vectors are projections onto the column vectors in $ \mathbf{B} $. Since the covariance matrix contains all noise and redundancy information associated with a matrix, the idea driving principal component analysis is to re-express the original covariance matrix using a basis that results in a new, diagonal covariance matrix -- in other words, off-diagonal elements in the original covariance matrix are driven to zero and redundancy is eliminated.

Change of Basis

The starting point of PCA is the change of basis relationship. In particular, if $ \mathbf{B} $ is an $ m\times p $ matrix of geometric transformations with $ p \leq m $, the $ n\times p $ matrix $ \mathbf{Q}=\mathbf{XB} $ is a projection of the $ n\times m $ matrix $ \mathbf{X} = [\mathbf{X}_{1}^{\top}, \ldots, \mathbf{X}_{n}^{\top}]^{\top}$ onto $ \mathbf{B} $. In other words, the rows of $ \mathbf{X} $ are linear combinations of the column vectors in $ \mathbf{B} = [\mathbf{B}_{1}, \ldots, \mathbf{B}_{p}]$. Formally, \begin{align*} \mathbf{Q} & = \begin{bmatrix} \mathbf{X}_{1}\\ \vdots\\ \mathbf{X}_{n} \end{bmatrix} \begin{bmatrix} \mathbf{B}_{1} &\cdots &\mathbf{B}_{p} \end{bmatrix}\\ &= \begin{bmatrix} \mathbf{X}_{1}\mathbf{B}_{1} &\cdots &\mathbf{X}_{1}\mathbf{B}_{p}\\ \vdots &\ddots &\vdots\\ \mathbf{X}_{n}\mathbf{B}_{1} &\cdots &\mathbf{X}_{n}\mathbf{B}_{p} \end{bmatrix} \end{align*} More importantly, if the column vectors $ \left\{ \mathbf{B}_{1}, \ldots, \mathbf{B}_{p} \right\} $ are also linearly independent, then $ \mathbf{B} $, by definition, characterizes a matrix of basis vectors for $ \mathbf{X} $. Furthermore, the covariance matrix of this transformation formalizes as: \begin{align} \mathbf{\Sigma}_{Q} = E\left( \mathbf{Q}^{\top}\mathbf{Q} \right) = E\left( \mathbf{B}^{\top}\mathbf{X}^{\top}\mathbf{XB} \right) = \mathbf{B}^{\top}\mathbf{\Sigma}_{X}\mathbf{B} \label{eq1} \end{align} It is important to reflect here on the dimensionality of $ \mathbf{\Sigma}_{Q} $, which, unlike $ \mathbf{\Sigma}_{X} $, is of dimension $ p\times p $ where $ p \leq m $. In other words, the covariance matrix under the transformation $ \mathbf{B} $ is at most the size of the original covariance matrix, and possibly smaller. Since dimensionality reduction is clearly one of our objectives, the transformation above is certainly poised to do so. However, the careful reader may remark here: if the objective is simply dimensionality reduction, then any matrix $ \mathbf{B} $ of size $ m \times p $ with $ p\leq m $ will suffice; so why especially does $ \mathbf{B} $ have to characterize a basis?

The answer is simple: dimensionality reduction is not the only objective, but one among preservation of information and importance of information. As to the former, we recall that what makes a set of basis vectors special is that they characterize entirely the space on which an associated matrix takes values and therefore span the multidimensional space on which that matrix resides. Accordingly, if $ \mathbf{B} $ characterizes a basis, then information contained in $ \mathbf{X} $ is never lost during the transformation to $ \mathbf{Q} $. Furthermore, recall that the channel for dimensionality reduction that motivated our discussion earlier was never intended to go through a sparser basis. Rather, the mechanism of interest was a diagonalization of the covariance matrix followed by variable exclusion. Accordingly, any dimension reduction that reflects basis sparsity via $ p \leq m $, is a consequence of perfect co-linearity (correlation) among some of the original system variables. In other words, $ p = \text{rk}\left( \mathbf{X} \right) $, where $ \text{rk}(\cdot) $ denotes the matrix rank, or the number of its linearly independent columns (or rows).

Diagonalization

We argued earlier that any transformation from $ \mathbf{X} $ to $ \mathbf{Q} $ that preserves information must operate through a basis transformation $ \mathbf{B} $. Suppose momentarily that we have in fact found such $ \mathbf{B} $. Our next objective would be to ensure that $ \mathbf{B} $ also produces a diagonal $ \mathbf{\Sigma}_{Q} $. In this regard, we remind the reader of two famous results in linear algebra:
  1. [Thm. 1:] A matrix is symmetric if and only if it is orthogonally diagonalizable.
    • In other words, if a matrix $ \mathbf{A} $ is symmetric, there exists a diagonal matrix $ \mathbf{D} $ and a matrix $ \mathbf{E} $ which diagonalizes $ \mathbf{A} $, such that $ \mathbf{A} = \mathbf{EDE}^{\top} $. The converse statement holds as well.
  2. [Thm. 2:] A symmetric matrix is diagonalized by a matrix of its orthonormal eigenvectors.
    • Extending the result above, if a $ q\times q $ matrix $ \mathbf{A} $ is symmetric, the diagonalizing matrix $ \mathbf{E} = [\mathbf{E}_{1}, \ldots, \mathbf{E}_{q}]$, the diagonal matrix $ \mathbf{D} = \text{diag} [\lambda_{1}, \ldots, \lambda_{q}] $, and $ \mathbf{E}_{i} $ and $ \lambda_{i} $ are respectively the $ i^{\text{th}} $ eigenvector and associated eigenvalue of $ \mathbf{A} $.
    • Note that a set of vectors is orthonormal if each vector is of length unity and orthogonal to all other vectors in the set. Accordingly, if $ \mathbf{V} = [\mathbf{V}_{1}, \ldots, \mathbf{V}_{q}]$ is orthonormal, then $ \mathbf{V}_{j}^{\top}\mathbf{V}_{j} = 1 $ and $ \mathbf{V}_{j}^{\top}\mathbf{V}_{k} = 0 $ for all $ j \neq k $. Furthermore, $ \mathbf{V}^{\top}\mathbf{V} = \mathbf{I}_{q} $ where $ \mathbf{I}_{q} $ is the identity matrix of size $ q $, and therefore, $ \mathbf{V}^{\top} = \mathbf{V}^{-1} $.
    • Recall further that eigenvectors of a linear transformation are those vectors which only change magnitude but not direction when subject to said transformation. Since any matrix is effectively a linear transformation, if $ \mathbf{v} $ is an eigenvector of some matrix $ \mathbf{A} $, it satisfies the relationship $ \mathbf{Av} = \lambda \mathbf{v} $. Here, associated with each eigenvector is the eigenvalue $ \lambda $ quantifying the resulting change in magnitude.
    • Finally, observe that matrix rank determines the maximum number of eigenvectors (eigenvalues) one can extract for said matrix. In particular, if $ \text{rk}(\mathbf{A}) = r \leq q $, there are in fact only $ r $ orthonormal eigenvectors associated with $ \mathbf{A} $. To see this, use a geometric interpretation to note that $ q- $dimensional objects reside in spaces with $ q $ orthogonal directions. Since any $ n\times q $ matrix is effectively a $ q- $dimensional object of vectors, the maximum number of orthogonal directions that characterize these vectors is $ q $. Nevertheless, if the (column) rank of this matrix is in fact $ r \leq q $, then $ q - r $ of the $ q $ orthogonal directions are never used. For instance, think of 2$ d $ drawings in 3$ d $ spaces. It makes no difference whether the drawing is characterized in the $ xy $, the $ xz $, or the $ yz $ plane -- the drawing still has 2 dimensions and in any of those configurations, the dimension left out is a linear combination of the others. In particular, if the $ xz $ plane is used, then the $ z- $direction is a linear combination of the $ y- $direction since the drawing can be equivalently characterized in the $ xy $ plane, and so on. In other words, one of the three dimensions is never used, although it exists and can be characterized if necessary. Along the same lines, if $ \mathbf{A} $ indeed has rank $ r \leq q $, we can construct $ q - r $ additional orthogonal eigenvectors to ensure dimensional equality in the diagonalization $ \mathbf{A} = \mathbf{EDE}^{\top} $, although their associated eigenvalues will in fact be 0, essentially negating their presence.
    • By extension of the previous point, since $ \mathbf{A} $ is a $ q- $dimensional object of $ q- $dimensional column vectors, it can afford at most $ q $ orthogonal directions to characterize its space. Since all $ q $ such vectors are collected in $ \mathbf{E} $, we are guaranteed that $ \mathbf{E} $ is a spanning set and therefore constitutes an eigenbasis.
Since $ Cov(\mathbf{X}) $ is a symmetric matrix by construction, the $ 1^{\text{st}} $ result above affords a re-express of equation (\ref{eq1}) as follows: \begin{align} \mathbf{\Sigma}_{Q} &= \mathbf{B}^{\top} \mathbf{\Sigma}_{X} \mathbf{B} \notag \\ &= \mathbf{B}^{\top}\mathbf{E}_{X}\mathbf{D}_{X}\mathbf{E}_{X}^{\top} \mathbf{B} \label{eq2} \end{align} where $ \mathbf{E}_{X} = [\mathbf{E}_{1}, \ldots, \mathbf{E}_{m}] $ is the orthonormal matrix of eigenvectors of $ \mathbf{\Sigma}_{X} $ and $ \mathbf{D}_{X} = \text{diag} [\lambda_{1}, \ldots, \lambda_{q}] $ is the diagonal matrix of associated eigenvalues.

Now, since we require $ \mathbf{\Sigma}_{Q} $ to be diagonal, we can set $ \mathbf{B}^{\top} = \mathbf{E}^{-1} $ in order to reduce $ Cov(\mathbf{Q}) $ to the diagonal matrix $ \mathbf{D}_{X} $. Since the $ 2^{\text{nd}} $ linear algebra result above guarantees that $ \mathbf{E}_{X} $ is orthonormal, we know that $ \mathbf{E}^{-1} = \mathbf{E}^{\top} $. Accordingly, \begin{align} \mathbf{\Sigma}_{Q} = \mathbf{D}_{X} \quad \text{if and only if} \quad \mathbf{B} = \mathbf{E}_{X} \label{eq3} \end{align} The entire idea is visualized below in Figures 1 and 2. In particular, Figure 1 demonstrates the ``data perspective'' view of the system in relation to an alternate basis. That is, two alternate basis axes, labeled as ``Principal Direction 1'' and ``Principal Direction 2'' are superimposed on the familiar $ x $ and $ y $ axes. Since the vectors of a basis are mutually orthogonal, the principal direction axes are naturally drawn at $ 90 $° angles. Alternatively, Figure 2 demonstrates the view of the system when the perspective uses the principal directions as the reference axes.




Consistency

In practice, $ \mathbf{\Sigma}_{X} $, and by extension $ \mathbf{\Sigma}_{Q}, \mathbf{E}_{X}, $ and $ \mathbf{D}_{X} $, are typically not observed. Nevertheless, we can apply the analysis above using sample covariance matrices $$ \mathbf{S}_{Q} = \frac{1}{n}\mathbf{Q}^{\top}\mathbf{Q} \xrightarrow[n \to \infty]{p} \mathbf{\Sigma}_{Q} \quad \text{and} \quad \mathbf{S}_{X} = \frac{1}{n}\mathbf{X}^{\top}\mathbf{X} \xrightarrow[n \to \infty]{p} \mathbf{\Sigma}_{X} $$ where $ \xrightarrow[\color{white}{n \to \infty}]{p} $ indicates weak convergence to asymptotic counterparts. In this regard, the result analogous to equation (\ref{eq2}) for estimated $ 2^{\text{nd}} $ moment matrices states that \begin{align} \mathbf{S}_{Q} = \widehat{\mathbf{E}}_{X}^{\top} \mathbf{S}_{X} \widehat{\mathbf{E}}_{X} = \widehat{\mathbf{E}}_{X}^{\top} \left( \widehat{\mathbf{E}}_{X}\widehat{\mathbf{D}}_{X}\widehat{\mathbf{E}}_{X}^{\top} \right) \widehat{\mathbf{E}}_{X} = \widehat{\mathbf{D}}_{X} \label{eq4} \end{align} where $ \widehat{\mathbf{E}}_{X} $ and $ \widehat{\mathbf{D}}_{X} $ now represent the eigenbasis and respective eigenvalues associated with the square symmetric matrix $ \mathbf{S}_{X} $. It is important to understand here that while $ \widehat{\mathbf{E}}_{X} \neq \mathbf{E}_{X} $ and $ \widehat{\mathbf{D}}_{X} \neq \mathbf{D}_{X} $, there is a long-standing literature far beyond the scope of this entry which guarantees that $ \widehat{\mathbf{E}}_{X} $ and $ \widehat{\mathbf{D}}_{X} $ are both consistent estimators of $ \mathbf{E}_{X} $ and $ \mathbf{D}_{X} $, provided $ m/n \to 0 $ as $ n \to \infty $. In other words, as in classical regression paradigms, consistency of PCA holds only under the usual ``large $ n $ and small $ m $ '' framework. There are modern results which address cases for $ m/n \to c > 0 $, however, they too are beyond the scope of this text. In proceeding however, in order to contain notational complexity, unless otherwise stated, we will maintain that $ \mathbf{E}_{X} $ and $ \mathbf{D}_{X} $ now represent the eigenbasis and respective eigenvalues associated with the square symmetric matrix $ \mathbf{S}_{X} $.

Preservation of Information

In addition to diagonalizing $ \mathbf{S}_{Q} $, we also require preservation of information. For this we need to guarantee that $ \mathbf{B} $ is a basis. Here, we recall the final remark under the $ 2^{\text{nd}} $ linear algebra result above, which argues that $ \mathbf{S_{Q}} $ affords at most $ m $ orthonormal eigenvectors and associated eigenvalues, with the former also forming an eigenbasis. Since all $ m $ eigenvectors are collected in $ \mathbf{E}_{X} = \mathbf{B} $, we are guaranteed that $ \mathbf{B} $ is indeed a basis. In this regard, we transform $ \mathbf{X} $ into $ m $ statistically uncorrelated, but exhaustive directions. We are careful not to use the word variables (although technically they are), since the transformation $ \mathbf{Q} = \mathbf{XE}_{X} $ does not preserve variable interpretation. That is, the $ j^{\text{th}} $ column of $ \mathbf{Q} $ no longer retains the interpretation of the $ j^{\text{th}} $ variable (column) in $ \mathbf{X} $. In fact, the $ j^{\text{th}} $ column of $ \mathbf{Q} $ is a projection (linear combination) of all $ m $ variables in $ \mathbf{X} $, in the direction of the $ j^{\text{th}} $ eigenvector $ \mathbf{E}_{j} $. Accordingly, we can interpret $ \mathbf{XE}_{X} $ as $ m $ orthogonal weighted averages of the $ m $ variables in $ \mathbf{X} $. Furthermore, since $ \mathbf{E}_{X} $ is an eigenbasis, the total variation (information) of the original system $ \mathbf{X} $, namely $ \mathbf{S}_{X} $, is preserved in the transformation to $ \mathbf{Q} $. Unlike $ \mathbf{S}_{X} $ however, $ \mathbf{S}_{Q} = \mathbf{D}_{X}$ is diagonal, and total variation in $ \mathbf{X} $ is now distributed across $ \mathbf{Q} $ without redundancy.

Principal Directions

Since preservation of information is guaranteed under the transformation $ \mathbf{Q} = \mathbf{XE}_{X} $, the proportion of information in $ \mathbf{S}_{X} $ associated with the $ j^{\text{th}} $ column of $ \mathbf{S}_{Q}$ is in fact $ \lambda_{j} $. By extension, each column in $ \mathbf{Q} $ has standard deviation $ \sqrt{\lambda_{j}} $ or variance $ \lambda_{j} $. Moreover, since $ \mathbf{S}_{Q} $ is diagonal and information redundancy is not an issue, it stands to reason that the total amount of system variation is the sum of variations due to each column in $ \mathbf{Q} $. In other words, total system variation is $ \text{tr}\left( \mathbf{S}_{Q} \right) = \lambda_{1} + \ldots + \lambda_{m} $, where $ \text{tr}(\cdot) $ denotes the matrix trace operator, and the $ j^{\text{th}} $ orthogonalized direction contributes to $$ \frac{\lambda_{j}}{\lambda_{1} + \ldots + \lambda_{m}} \times 100 \% $$ of total system variation (information). If we now arrange the columns of $ \mathbf{Q} $, or equivalently those of $ \mathbf{E}_{X} $, according to the order $ \lambda_{(1)} \geq \lambda_{(2)} \geq \ldots \geq \lambda_{(m)} $, where $ \lambda_{(j)} $ are ordered versions of their counterparts $ \lambda_{j} $, we are guaranteed to have the directions arranged from most principal to least, measured as the proportion of total system variation contributed by that direction.

Another useful feature of the vectors in $ \mathbf{E}_{X} $ is that they quantify the proportion of directionality each original variable contributes toward the overall direction of that vector. In particular, let $ e_{i,j} $ denote the $ i^{\text{th}} $ element in $ \mathbf{E}_{j} = [e_{1,j}, \ldots, e_{m,j} ]$, where $ i \in {1, \ldots, m} $, and observe that since $ \mathbf{E}_{j} $ are the eigenvectors of $ \mathbf{S}_{X} $, each element $ e_{i,j} $ is in fact associated with the $ i^{\text{th}} $ variable (column) of $ \mathbf{X} $. Furthermore, since the vectors $ \mathbf{E}_{j} $ each have unit length due to (ortho)normality, we know that they must lie inside the unit circle and that $ e_{i,j}^{2} \times 100 \% $ of the direction $ \mathbf{E}_{j} $ is due to variable $ i $. In other words, we can quantify how principal each variable is in each direction.

Principal Components

Principal directions, the eigenvectors in $ \mathbf{E}_{X} $, are often mistakenly called principal components. Nevertheless, correct literature reserves the term principal components for the projections of the original system variables onto the principal directions. That is, principal components refer to the column vectors in $ \mathbf{Q} = [\mathbf{Q}_{1}, \ldots, \mathbf{Q}_{m}] = \mathbf{XE}_{X} $, and are sometimes also referred to as scores. Like their principal direction counterparts, principal components contain several important properties worth observing.

As a direct consequence of the diagonalization properties discussed earlier, the variance of each principal component is in fact the eigenvalue associated with the underlying principal direction, and principal components are mutually uncorrelated. To see this formally, let $ \mathbf{C}_{j} = [0, \ldots, 0, \underbrace{1}_j, 0, \ldots, 0 ]^{\top} $ denote the canonical basis vector in the $ j^{\text{th}} $ dimension. Then, using the result in equation (\ref{eq4}), the correlation between the $ j^{\text{th}} $ and $ k^{\text{th}} $ principal components $ \mathbf{Q}_{j} = \mathbf{QC}_{j} $ and $ \mathbf{Q}_{k} = \mathbf{QC}_{k} $, respectively, is obviously: \begin{align*} s_{Q_{j}, Q_{k}} &= \frac{1}{n}\mathbf{Q}_{j}^{\top}\mathbf{Q}_{k} \\ &= \mathbf{C}_{j}^{\top} \left( \frac{1}{n} \mathbf{Q}^{\top}\mathbf{Q} \right) \mathbf{C}_{k} \\ &= \mathbf{C}_{j}^{\top} \mathbf{S}_{Q} \mathbf{C}_{k} \\ &= \mathbf{C}_{j}^{\top} \mathbf{D}_{X} \mathbf{C}_{k} \\ \end{align*} which equals $ \lambda_{j} $ when $ j = k $ and $ 0 $ otherwise.

Moreover, we can quantify how (co)related the original variables are with the principal directions. In particular, consider the covariance between the $ i^{\text{th}} $ variable $ \mathbf{X}_{i}=\mathbf{XC}_{i} $ and the $ j^{\text{th}} $ principal component $ \mathbf{Q}_{j} $, formalized as: \begin{align} \mathbf{S}_{X_{i}Q_{j}} & = \frac{1}{n} \mathbf{X}_{i}^{\top}\mathbf{Q}_{j} \notag\\ &= \mathbf{C}_{i}^{\top} \left( \frac{1}{n}\mathbf{X}^{\top}\mathbf{Q} \right) \mathbf{C}_{j}\notag\\ &= \mathbf{C}_{i}^{\top} \left( \frac{1}{n}\mathbf{X}^{\top}\mathbf{X}\mathbf{E}_{X} \right) \mathbf{C}_{j}\notag\\ &= \mathbf{C}_{i}^{\top} \mathbf{S}_{X} \mathbf{E}_{X} \mathbf{C}_{j}\notag\\ &= \mathbf{C}_{i}^{\top} \mathbf{E}_{X}\mathbf{D}_{X} \mathbf{E}_{X}^{\top} \mathbf{E}_{X} \mathbf{C}_{j}\notag\\ &= \mathbf{C}_{i}^{\top} \mathbf{E}_{X}\mathbf{D}_{X} \mathbf{C}_{j}\notag\\ &= e_{i,j} \lambda_{j} \label{eq5} \end{align} where the antepenultimate line invokes Theorem 1 to $ \mathbf{S}_{X} $, and the cancelation to identity in the penultimate line follows by Theorem 2 and orthonormality of $ \mathbf{E}_{X} $, and the ultimate line is the product of the $ j^{\text{th}} $ element of the principal direction $ \mathbf{E}_{j} $ and the $ j^{\text{th}} $ principal eigenvalue.

Dimension Reduction

At last, we arrive a the issue of dimensionality reduction. Assuming that the columns of $ \mathbf{Q} $ are arranged in decreasing order of importance (more principal columns come first), we can discard the $ g < m $ least principal columns of $ \mathbf{Q} $ until sufficient dimension reduction is achieved, and rest assured that the remaining (first) $ m - g $ columns are in fact most principal. In other words, the $ m - g $ directions which are retained, contribute to $$ \frac{ \sum \limits_{j=1}^{m-g}\lambda_{(j)}}{\lambda_{1} + \ldots + \lambda_{m}} \times 100 \% $$ of the original variation in $ \mathbf{X} $. Since directions are ordered in decreasing order of importance, the first few directions will capture the majority of variation, leaving the less principal directions to contribute information only marginally. Accordingly, one can significantly reduce dimensionality whilst retaining the majority of information. This is particularly important when we want to measure the complexity of our data set. In particular, if the $ r $ most principal directions account for the majority of variance, it stands to reason that our underlying data set is in fact only $ r- $dimensional, with the remaining $ m-r $ dimensions being noise. In other words, dimensionality reduction naturally leads to data denoising.

So how does one select how many principal directions to retain? There are several approaches, but we list only several below:
  1. A very popular approach is to use a scree plot -- a plot of the ordered eigenvalues from most to least principal. The idea here is to look for a sharp drop in the function, and select the bend or elbow as the cutoff value, retaining all eigenvalues (and by extension principal directions) to the left of this value.
  2. Another popular alternative is to use the cumulative proportion of variation explained by the first $ r $ principal directions. In other words, select the first $ r $ principal directions such that $ \frac{ \sum \limits_{j=1}^{r}\lambda_{(j)}}{\lambda_{1} + \ldots + \lambda_{m}} \geq 1 - \alpha $, where $ \alpha \in [0,1] $. Typical uses set $ \alpha = 0.1 $ in order to retain $ r $ most principal directions that capture at least 90\% of the system variation.
  3. A more data driven result is known as the Guttman-Kaiser (Guttman (1954), Kaiser (1960), Kaiser (1961)) criterion. This criterion advocates the retention of all eigenvalues, and by extension, the associated principal directions, that exceed the average of all eigenvalues. In other words, select the first $ r $ principal directions such that $ \lambda_{(1)} + \ldots + \lambda_{(k)} \geq r\bar{\lambda} $, where $ \bar{\lambda} = \frac{1}{m} \sum\limits_{j = 1}^{m}\lambda_{j} $.
  4. An entirely data-driven approach akin to classical information criteria selection methods borrows the Bai and Ng (2002) paper on factor models. In this regard, consider $$ \mathbf{X}_{j} = \beta_{1}\mathbf{Q}_{1} + \ldots + \beta_{r}\mathbf{Q}_{r} + \mathbf{U}(j,r) $$ as the regression of the $ j^{\text{th}} $ variable in $ \mathbf{X} $ on the first $ r $ principal components of $ \mathbf{S}_{X} $, and let $ \widehat{\mathbf{U}}(j,r) $ denote the corresponding residual vector. Furthermore, define $ SSR(j,r) = \frac{1}{n} \widehat{\mathbf{U}}(j,r)^{\top} \widehat{\mathbf{U}}(j,r) $ as the sum of squared residuals from said regression, and define $ SSR(r) = \frac{1}{m}\sum \limits_{j=1}^{m}SSR(j,r) $ as the average of all $ SSR(j,r) $ across all variables $ j $ for a given $ r $. We can then select $ r $ as the one that minimizes a particular penalty function. In other words, the problem reduces to: $$ \min\limits_{r} \left\{ \ln\left( SSR(r) \right) + rg(n,m) \right\} $$ where $ g(n,m) $ is a penalty term which leads to one of several criteria proposed in Bai and Ng (2002). For instance when $ n > m $, one such option is the $ IC_{p2}(r) $ criterion, and the problem above formalizes as: $$ \min\limits_{r} \left\{ \ln\left( SSR(r) \right) + r\left( \frac{n + m}{nm} \right) \ln(m) \right\} $$
Of course, it goes without saying that discarding information comes at its own cost, although, if dimensionality reduction is desired, it may well be a price worth paying.

Inference

Although PCA is deeply rooted in linear algebra, it is also a very visual experience. In this regard, a particularly convenient feature is the ability to visualize multidimensional structures across two-dimensional summaries. In particular, comparing two principal directions provides a wealth of information that is typically inaccessible in traditional multidimensional contexts.

Loading Plots

A powerful inferential tool unique to PCA is element-wise comparison of two principal directions. In particular, consider two principal directions $ \mathbf{E}_{j} = [e_{1,j}, \ldots, e_{m,j}]$ and $ \mathbf{E}_{k} = [e_{1,k}, \ldots, e_{m,k}]$, and let $ \left\{ \mathbf{V}_{1,j,k}, \ldots, \mathbf{V}_{m,j,k} \right\}$ denote the set of vectors from the origin $ (0,0) $ to $ \left( e_{i,j}, e_{i,k} \right) $ for $ i \in {1, \ldots, m} $. In other words, $ \mathbf{V}_{i,j,k} = \left( e_{i,j}, e_{i,k} \right)^{\top}$. Then, for any $ (j,k) $ principal direction pairs, a plot of all $ m $ vectors $ \mathbf{V}_{i,j,k} $, for $ i \in {1, \ldots, m} $, on a single plot, is called a loading plot.

There is an important connection between the vectors $ \mathbf{V}_{i,j,k} $ and original variable covariances. In particular, consider $ \mathbf{S}_{X_{i},X_{s}} $ -- the finite sample covariance between $ \mathbf{X}_{i} $ and $ \mathbf{X}_{s} $ -- and, assuming we have ordered eigenvalues from most principal to least, note that: \begin{align*} \mathbf{S}_{X_{i},X_{s}} &= \mathbf{C}_{i}^{\top} \mathbf{S}_{X} \mathbf{C}_{s}\\ &= \mathbf{C}_{i}^{\top} \mathbf{E}_{X} \mathbf{D}_{X} \mathbf{E}_{X}^{\top} \mathbf{C}_{s}\\ &= \lambda_{(1)}e_{i,1}e_{s,1} + \lambda_{(2)}e_{i,2}e_{s,2} + \ldots + \lambda_{(m)}e_{i,m}e_{s,m}\\ &= \mathbf{V}_{i,1,2}^{\top}\mathbf{L}_{1,2}\mathbf{V}_{s,1,2} + \ldots + \mathbf{V}_{i,m-1,m}^{\top}\mathbf{L}_{m,m-1}\mathbf{V}_{s,m-1,m} \end{align*} where $ \mathbf{L}_{j,k} = \text{diag} \left[\lambda_{(j)}, \lambda_{(k)} \right] $ denotes the appropriate scaling matrix. In other words, for any $ (j,k) $ principal direction pairs, $ \mathbf{V}_{i,j,k}^{\top} \mathbf{L}_{j,k} \mathbf{V}_{s,j,k} $ explains a proportion of the covariance $ \mathbf{S}_{X_{i},X_{s}} $. Accordingly, when $ \mathbf{X}_{i} $ and $ \mathbf{X}_{s} $ are highly correlated, we can expect $ \mathbf{V}_{i,j,k}^{\top} \mathbf{L}_{j,k} \mathbf{V}_{s,j,k} $ to be larger values. In this regard, let $ \theta_{i,s,j,k} $ denote the angle between any two vectors $ \mathbf{V}_{i,j,k} $ and $ \mathbf{V}_{s,j,k} $, and recall that \begin{align*} \cos \theta_{i,s,j,k} &= \frac{\mathbf{V}_{i,j,k}^{\top}\mathbf{V}_{s,j,k}}{\norm{\mathbf{V}_{i,j,k}} \norm{\mathbf{V}_{s,j,k}}} \end{align*} To accommodate the use of the scaling matrices $ \mathbf{L}_{j,k} $, observe that we can modify this result as follows: \begin{align} \mathbf{V}_{i,j,k}^{\top} \mathbf{L}_{j,k} \mathbf{V}_{s,j,k} = \mathbf{V}_{i,j,k}^{\top} \mathbf{L}_{j,k} \left(\mathbf{V}_{i,j,k}\mathbf{V}_{i,j,k}^{\top} \right)^{-1} \mathbf{V}_{i,j,k} \norm{\mathbf{V}_{i,j,k}} \norm{\mathbf{V}_{s,j,k}} \cos \theta_{i,s,j,k} \label{eq6} \end{align} Now, when $ \theta_{i,s,j,k} $ is small, say between $ 0 $ and $ \pi/2 $, we can expect $ \mathbf{V}_{i,j,k}^{\top} \mathbf{L}_{j,k} \mathbf{V}_{s,j,k} $ to be large, and by extension, $ \mathbf{X}_{i} $ and $ \mathbf{X}_{s} $ to be more correlated. In other words, vectors that are close to one another in a loading plot indicate stronger correlations of their underlying variables. Figure 3 below gives a visual representation.


It is important to realize here that since $ \theta_{i,s,j,k} $ is in fact the angle between $ \mathbf{V}_{i,j,k} $ and $ \mathbf{V}_{s,j,k} $, the interpretation of how exhibitive $ \theta_{i,s,j,k} $ is of the underlying correlation $ \mathbf{S}_{X_{i}, X_{s}} $ is made more complicated by the presence of $ \mathbf{L}_{j,k} $ in equation (\ref{eq6}). Accordingly, to ease interpretation, the vectors $ \mathbf{V}_{i,j,k} $ are sometimes scaled appropriately, or loaded with scaling information, leading to the term loadings. In this regard, consider the vectors $ \widetilde{\mathbf{V}}_{i,j,k} = \mathbf{V}_{i,j,k} \mathbf{L}_{j,k}^{1/2} $. Here, loading is done via $ \mathbf{L}_{j,k}^{1/2} $, and we have: $$ \mathbf{S}_{X_{i}, X_{s}} = \widetilde{\mathbf{V}}_{i,1,2}^{\top}\widetilde{\mathbf{V}}_{s,1,2} + \ldots + \widetilde{\mathbf{V}}_{i,m-1,m}^{\top}\widetilde{\mathbf{V}}_{s,m-1,m} $$ and $$ \widetilde{\mathbf{V}}_{i,j,k}^{\top}\widetilde{\mathbf{V}}_{s,j,k} = \norm{\widetilde{\mathbf{V}}_{i,j,k}} \norm{\widetilde{\mathbf{V}}_{s,j,k}} \cos \widetilde{\theta}_{i,s,j,k} $$ As such, $ \widetilde{\theta}_{i,s,j,k} $ more closely exhibits the true angle between $ \mathbf{X}_{i} $ and $ \mathbf{X}_{s} $ than $ \theta_{i,s,j,k} $, and loading plots using $ \widetilde{\mathbf{V}}_{i,j,k} $ tend to be more exhibitive of the underlying correlations $ \mathbf{S}_{X_{i}, X_{s}} $ than those based on $ \mathbf{V}_{i,j,k} $. Of course, one does not have to resort to the use of $ \mathbf{L}_{j,k}^{1/2} $ as the loading matrix. In principle, one can use $ \mathbf{L}_{j,k}^{\alpha} $ for some $ 0 \leq \alpha \leq 1 $, although the underlying interpretation of what such a loading means ought to be understood first.

Of course, it is not difficult to see that $ \widetilde{\mathbf{V}}_{i,j,k} = \mathbf{V}_{i,j,k} \mathbf{L}_{j,k}^{\alpha} $ is in fact the $ i^{\text{th}} $ "XY"-pair between $ \mathbf{E}_{j}\lambda_j^{\alpha} $ and $ \mathbf{E}_{k}\lambda_k^{\alpha} $. In other words, it is the $ i^{\text{th}} $ "XY"-pair using the "loaded" $ j^{\text{th}} $ and $ k^{\text{th}} $ principal directions. Accordingly, the term loading vector is sometimes used to denote a loaded principal direction. In particular, the entire matrix of loading vectors $ \widetilde{\mathbf{E}}_X $ can be obtained as follows: $$ \widetilde{\mathbf{E}}_X = \mathbf{E}_X \mathbf{D}_X^{\alpha} $$ Figure 4 below demonstrates the impact of using a loading weight. In particular, the vectors in Figure 3 are superimposed on the set of loaded vectors where the loading factor is $ \mathbf{D}_{X}^{1/2} $. Clearly, the loaded vectors are much more correlated with the general shape of the data as represented by the ellipse.


Scores Plots

A score plot across principal direction pairs $ (j,k) $ is essentially a scatter plot of the principal component vector $ \mathbf{Q}_{i} $ vs. $ \mathbf{Q}_{j} $. In fact, it is the analogous version of the loading plot, but for observations as opposed to variables. In this regard, whereas the angle between two loading vectors is exhibitive of the underlying correlation between some variables, the distance between observations in a score plot exhibits homogeneity across observations. Accordingly, observations which tend to cluster together, tend to move together, and one typically looks to identify important clusters when conducting inference.

Recall also the expression derived in the last line of \ref{eq5}, namely, $ \mathbf{S}_{X_{i}Q_{j}} = e_{i,j} \lambda_{j} = \left(e_{i,j} \lambda_{j}^{1/2}\right)\lambda_{j}^{1/2} $. Notice that the latter expression states that the correlation between the $ i^{\text{th}} $ variable and the $ j^{\text{th}} $ score vector is in fact a product of the $j^{\text{th}}$ element of a loaded $i^{\text{th}}$ principal direction ($i^{\text{th}}$ loading vector), and $ \lambda_{j}^{1/2} $. Accordingly, in order to achieve a more natural interpretation, one can proceed in a manner analogous the the creation of loading vectors, and either scale or entirely remove the remaining scaling factor. This leads to the idea of loaded score vectors. In particular, using the context above, if one wishes to interpret the correlation between the $ i^{\text{th}} $ variable and the $ j^{\text{th}} $ score vector as just a loaded principal direction without the additional factor $ \lambda_{j}^{1/2} $, then doing so is as simple as computing $$ \mathbf{S}_{X_{i}Q_{j}\lambda_j^{-1/2}} = e_{i,j} \lambda_{j}^{1/2} $$ where we now interpret $ Q_{j}\lambda_j^{-1/2} $ as a loaded score vector. Of course, an infinite array of such scaling options is achievable using $ Q_{j}\lambda_j^{-\alpha} $, although, as before, their interpretation ought to be understood first.

Outlier Detection

An important application of PCA is to outlier detection. The general principle exploits the first few principal directions to explain the majority of variation in the original system, and uses data reconstruction to generate an approximation of the original system using the first few principal components.

Formally, if we start from the matrix of all principal components $ \mathbf{Q} $, it is trivial to reconstruct the original system $ \mathbf{X} $ using the inverse: $$ \mathbf{Q}\mathbf{E}_{X}^{\top} = \mathbf{X}\mathbf{E}_{X}\mathbf{E}_{X}^{\top} = \mathbf{X}$$ On the other hand, if we restrict our principal components to the first $ r \ll m $ most principal directions, then $ \widetilde{\mathbf{Q}}\widetilde{\mathbf{E}}_{X}^{\top} = \widetilde{\mathbf{X}} \approx \mathbf{X} $, where $ \widetilde{\mathbf{Q}} $ and $ \widetilde{\mathbf{E}}_{X} $ are respectively the matrix $ \mathbf{Q} $ and $ \mathbf{E}_{X} $ with the last $ m - r $ columns removed, and $ \approx $ denotes an approximation. Then, the difference $$ \mathbf{\xi} = \widetilde{\mathbf{X}} - \mathbf{X} $$ is known as the reconstruction error, and if the first $ r $ principal directions explain the original variation well, we can expect $ \norm{\mathbf{\xi}}_{D}^{2} \approx \mathbf{0}$ where $ \norm{\cdot}_{D} $ denotes some measure of distance.

We would now like to define a statistic associated with outlier identification, and as in usual regression analysis, the reconstruction error (residuals) plays a key role. In particular, we follow the contributions of Jackson and Mudholkar (1979) and define $$ \mathbf{SPE} = \mathbf{\xi} \mathbf{\xi}^{\top} $$ as the squared prediction error most resembling the usual sum of squared residuals. Moreover, Jackson and Mudholkar (1979) show that if observations (row vectors) in $ \mathbf{X} $ are independent and identically distributed, Gaussian random variables, $ \mathbf{SPE} $ has the following distribution $$ \mathbf{SPE} \sim \sum\limits_{j+1}^{m}\lambda_{(j)}Z_{j}^{2} \equiv \Psi(k) $$ where $ \chi^{2}_{p} $ denotes the $ \chi^{2}- $distribution with $ p $ degrees of freedom, and $ Z_{j} $ are independent $ \chi^{2}_{1} $ variables. Noting that the $ i^{\text{th}} $ diagonal element of $ \mathbf{SPE} $, namely $ \mathbf{SPE}_{ii} = \mathbf{C}_{i}^{\top} (\mathbf{SPE}) \mathbf{C}_{i} $ is associated with the $ i^{\text{th}} $ observation, we can now derive a rule for outlier detection. In particular, should $ \mathbf{SPE}_{ii} $, for any $ i $, fall into some critical region defined by the upper $ (1 - \alpha) $ percentile of $ \Psi(k) $, that observation would be considered an outlier.

Closing Remarks

Principal component analysis is an extremely important multivariate statistical technique that is often misunderstood and abused. The hope is that in reading this entry you will have found the intuition one often seeks in complicated subject matters, with just enough mathematical rigour to ease any serious future undertakings. In Part II of this series, we will use EViews to exhibit a PCA case study and demonstrate just how easy this is with a a few clicks.


References

[1] Jushan Bai and Serena Ng. Determining the number of factors in approximate factor models. Econometrica, 70(1):191--221, 2002.
[2] Louis Guttman. Some necessary conditions for common-factor analysis. Psychometrika, 19(2):149--161, 1954.
[3] J Edward Jackson and Govind S Mudholkar. Control procedures for residuals associated with principal component analysis. Technometrics, 21(3):341--349, 1979.
[4] Henry F Kaiser. The application of electronic computers to factor analysis. Educational and psychological measurement, 20(1):141--151, 1960.
[5] Henry F Kaiser. A note on guttman's lower bound for the number of common factors 1. British Journal of Statistical Psychology, 14(1):1--2, 1961.

No comments:

Post a Comment