 Summarize and describe the dataset under consideration.
 Extract all principal (important) directions (features).
 Quantify how much variation (information) is explained by each principal direction.
 Determine how much variation each variable contributes in each principal direction.
 Reduce data dimensionality.
 Identify which variables are correlated and which correlations are more principal.
 Identify which observations are correlated with which variables.
Principal Component Analysis of US Crime Data
We will use PCA to study US crime data. In particular, our dataset summarizes the number of arrests per 100,000 residents in each of the 50 US states in 1973. The data contains four variables, three of which pertain to arrests associated with (and naturally named) MURDER, ASSAULT, and RAPE, whereas the last, named URBANPOP, contains the percentage of the population living in urban centers.Data Summary
To understand our data, we will first create a group object with the variables of interest. We can do this by selecting all four variables in the workfile by clicking on each while holding down the Ctrl button, rightclicking on any of the highlighted variables, moving the mouse pointer over Open in the context menu, and finally clicking on as Group. This will open a group object in a spreadsheet with the four variables placed in columns. The steps are reproduced in Figures 1a and 1b.








Principal Component Analysis of Crime Data
Doing PCA in EViews is trivial. From our group object window, click on View and click on Principal Components.... This opens the main PCA dialog. See Figure 5a and 5b below.


Summary of Fundamentals
As a first step, we are interested in summarizing PCA fundamentals. In particular, we seek an overview of eigenvalues and eigenvectors that result from applying the principal component decomposition to the covariance or correlation matrix associated with our variables of interest. To do so, consider the Display group, and select Table. The latter produces three tables summarizing the covariance (correlation) matrix, and the associated eigenvectors and eigenvalues.Associated to this output are several important options under the Component selection group. These include:
 Maximum number: This defaults to the theoretical maximum number of eigenvalues possible, which is the total number of variables in the group under consideration. In our case, this number is 4.
 Minimum eigenvalue: This defaults to 0. Nevertheless, selecting a positive value requests that all eigenvectors associated with eigenvalues less than this value are not displayed.
 Cumulative proportion: This defaults to 1. Choosing a value $\alpha < 1$ however, requests that only the most principal $ k $ eigenvalues and eigenvectors associated with explaining $ \alpha*100 \% $ of the variation are retained. Naturally, choosing $ \alpha=1 $ requests that all eigenvalues are displayed. See section Dimension Reduction in Part I of this series for further details.
Furthermore, consider momentarily the Calculation tab. Here, the Type dropdown offers the choice to apply the principal component decomposition either to the correlation or covariance matrix. For details, see sections Variance Decomposition and Change of Basis in Part I of this series. The choice essentially reduces to whether or not the variables under consideration exhibit similar scales. In other words, if variances of the underlying variables of interest are similar, then conducting PCA on the covariance matrix is certainly justified. Nevertheless, if the variances are widely different, then selecting the correlation matrix is more appropriate if interpretability and comparability are desired. EViews errs on the side of caution and defaults to using the correlation matrix. Since the table of summary statistics we produced in figure 3b clearly shows a lack of uniformity in standard deviations across the four variables of interest, we will stick with the default and use the correlation matrix. Hit OK.
Figure 6: PCA Table Output
The second (middle) table summarizes the eigenvectors associated with each of the principal eigenvalues. Naturally, the eigenvectors are also arranged in order of principality. Furthermore, whereas the eigenvalues highlight how much of the overall information is extracted in each principal direction, the eigenvectors reveal how much weight each variable has in each direction.
Recall from Part I of this series that all eigenvectors have length unity. Accordingly, the relative importance of any variable in a given principal direction is effectively the proportion of the eigenvector length (unity) attributed to that variable. For instance, in the case of the first eigenvector, $ [0.535899, 0.583184, 0.543432, 0.278191]^{\top} $, MURDER accounts for $ 0.535899^{2} \times 100\% = 0.287188\% $ of the overall direction length. Similarly, ASSAULT accounts for 0.340103% of the direction, and RAPE contributes 0.295318%. Evidently, the least important variable in the first principal direction is URBANPOP, which accounts for only 0.077390% of the direction length.
On the other hand, in the second principal direction, it is URBANPOP that carries most weight, contributing $ 0.872806 \times 100\% = 0.761790\% $ to the direction length. Accordingly, if feature extraction is the goal, it is clear (and rather obvious) that the first principal direction is roughly equally dominated by MURDER, ASSAULT, and RAPE, whereas the second principal direction is almost entirely governed by URBANPOP. For a theoretical exposition, see section Principal Components in Part I of this series.
At last, the third table is just the correlation matrix to which the eigendecomposition is applied. The latter, while important, is provided only as a reference.
Eigenvalue Plots and Dimensionality
Now that we have a rough picture of PCA fundamentals associated with our dataset, it is natural to ask whether we can proceed with dimensionality reduction in a more formal manner. One such way (albeit arbitrary, but widely popular) is to look at several eigenvalue plots and visually identify how many eigenvalues to retain.From the previous PCA output, click again on View, then Principal Components..., and select Eigenvalue Plots under the Display group. This is summarized in Figure 7 below.
Figure 7: PCA Dialog: Eigenvalue Plots
Figure 8: Eigenvalue Plots Output
A slightly more numeric approach discards all eigenvalues significantly below the eigenvalue average. Referring to the first table in Figure 6, we see that the average of the eigenvalues is 1, and the 2nd eigenvalue is in fact just below this cutoff. Since the 2nd value is so close to this average, while using the visual support we mentioned in the previous paragraph, it is safe to conclude that the scree plot analysis indicates that only the first two eigenvalues ought to be retained.
The second graph plots a line graph of the differences between successive eigenvalues. Superimposed on this graph is another horizontal line, this time with a value equal to the average of the differences of successive eigenvalues. Although EViews does not report this number, using the top table in Figure 6, it is not difficult to show that the average in question is $ (1.490476+0.633202+0.183133)/3 = 0.768937 $. The idea here is to retain all eigenvalues whose differences are above this threshold. Clearly, only the first two eigenvalues satisfy this criterion.
The final graph is a line graph of the cumulative proportion of information explained by successive principal eigenvalues. Superimposed on this graph is a line with a slope equal to the average of the eigenvalues, namely 1. The idea here is to retain those eigenvalues that form segments of the cumulative curve whose slopes are at least as steep as the line with slope 1. In our case, only two eigenvalues seem to form such a segment: eigenvalues 1 and 2.
All three graphical approaches indicate that one ought to retain the first two eigenvalues and their associated eigenvectors. There is however an entirely data driven methodology adapted from Bai and Ng (2002). We discussed this approach in section Dimension Reduction in Part I of this series. Nevertheless, EViews currently doesn't support its implementation via dialogs and it must be programmed manually. In this regard, we temporarily move away from our dialogbased exposition, and offer a code snippet which implements the aforementioned protocol.
'  Bai and Ng (2002) Protocol  group crime murder assault rape urbanpop ' create group with all 4 variables !obz = murder.@obs ' get number of observations !numvar = @columns(crime) ' get number of variables equation eqjr ' equation object to hold regression matrix(!numvar, !numvar) SSRjr' matrix to store SSR from each regression eqjr crime.makepcomp(cov=corr) s1 s2 s3 s4 ' get all score series for !j = 1 to !numvar for !r = 1 to !numvar %scrstr = "" ' holds score specification to extract ' generate string to specify which scores to use in regression for !r2 = 1 to !r %scrstr = %scrstr + " s" + @str(!r2) next eqjr.ls crime(!j) {%scrstr} ' estimate regression SSRjr(!j, !r) = (eqjr.@ssr)/!obz ' take average of SSR next next ' get column means of SSRjr. namely, get r means, averaging across regressions j. vector SSRr = @cmean(SSRjr) vector(!numvar) IC ' stores critical values for !r = 1 to !numvar IC(!r) = @log(SSRr(!r)) + !r*(!obz + !numvar)/(!obz*!numvar)*@log(!numvar) next ' take the index of the minimum value of IC as number of principal components to retain scalar numpc = @imin(IC)Unlike our graphical analysis, the protocol above suggests that the number of retained eigenvalues is 1. Nevertheless, for sake of greater analytical exposition below, we will stick with the original suggestion of retaining the first two principal directions instead.
Principal Direction Analysis
The next step in our analysis is to look at what, if any, meaningful patterns emerge by studying the principal directions themselves. To do so, we again bring up the main principal component dialog and this time select Variable Loading Plots under the Display group. See Figure 9 below.Figure 9: PCA Dialog: Variable Loading Plots
 Normalize Loadings: In this case, scaling is unity and loading vectors are in fact the eigenvectors themselves.
 Normalize Scores: Here, the scaling factor is the square root of the eigenvalue vector. In other words, the $ k^{\text{th}} $ element of the $ i^{\text{th}} $ loading vector is the $ k^{\text{th}} $ element of the $ i^{\text{th}} $ eigenvector, multiplied by the square root of the $ k^{\text{th}} $ eigenvalue.
 Symmetric Weights: In this scenario, the scaling factor is the quartic (fourth) root of the eigenvalue vector. Namely, the $ k^{\text{th}} $ element of the $ i^{\text{th}} $ loading vector is the $ k^{\text{th}} $ element of the $ i^{\text{th}} $ eigenvector, multiplied by the fourth root of the $ k^{\text{th}} $ eigenvalue.
 User Loading Weight: If $ 0 \leq \omega \leq 1 $ denotes the user defined scaling factor, then the loading vectors are formed by scaling the $ k^{\text{th}} $ element of the corresponding eigenvector by the $ k^{\text{th}} $ eigenvalue raised to the power $ \omega/2 $ .
Figure 10: Variable Loading Plots Output
Furthermore, all three variables, MURDER, ASSAULT, and RAPE, are strongly correlated with the first principal direction, whereas URBANPOP is strongly correlated with the second principal direction. In fact, looking at vector lengths, we can also see that MURDER, ASSAULT, and RAPE are roughly equally dominant in the first direction, whereas URBANPOP is significantly more dominant than either of the former three, albeit in the second direction. Of course, this simply confirms our preliminary analysis of the middle table in Figure 6.
Above, we started with the basic loading vector with scale unity. We could have, of course, resorted to other scaling options such as normalizing to the score vectors, using symmetric weights, or using some other custom weighting. Since each of these would yield a different but similar perspective, we won't delve further into details. Nevertheless, as an exercise in exhibiting the steps involved, we provide below small snippets of code to manually generate loading vectors using only the eigenvalues and eigenvectors associated with the underlying correlation matrix. This is done for each of the four scaling protocols. These manually generated vectors are then compared to the loading vectors generated by EViews' internal code and shown to be identical.
'  Verify Loading Plot Vectors  group crime murder assault rape urbanpop ' create group with all 4 variables ' make eigenvalues and eigenvectors based on the corr. matrix crime.pcomp(eigval=eval, eigvec=evec, cov=corr) 'normalize loadings crime.makepcomp(loading=load, cov=corr) s1 s2 s3 s4 ' EViews generated loading vectors matrix evaldiag = @makediagonal(eval) ' create diagonal matrix of eigenvalues matrix loadverify = evec*evaldiag ' manually create loading vector with scaling unity matrix loaddiff = loadverify  load ' get difference between custom and eviews output show loaddiff ' display results 'normalize scores crime.makepcomp(scale=normscores, loading=load, cov=corr) s1 s2 s3 s4 loadverify = evec*@epow(evaldiag, 0.5) loaddiff = loadverify  load show loaddiff 'symmetric weights crime.makepcomp(scale=symmetrics, loading=load, cov=corr) s1 s2 s3 s4 loadverify = evec*@epow(evaldiag, 0.25) loaddiff = loadverify  load show loaddiff 'user weights crime.makepcomp(scale=0.36, loading=load, cov=corr) s1 s2 s3 s4 loadverify = evec*@epow(evaldiag, 0.18) loaddiff = loadverify  load show loaddiff
Score Analysis
Whereas loading vectors reveal information on which variables dominate (and by how much) each principal direction, it is only when they are used to create the principal component vectors (score vectors) that they are truly useful in a data exploratory sense. In this regard, we again open the main principal component dialog and select Component scores plots in the Display group of options. We capture this in Figure 11 below.Figure 11: PCA Dialog: Component Scores Plots
 Normalize Loadings: Score vectors are scaled by unity. In other words, no scaling occurs.
 Normalize Scores: The $ k^{\text{th}} $ score vector is scaled by the inverse of the square root of the $ k^{\text{th}} $ eigenvalue.
 Symmetric Weights: The $ k^{\text{th}} $ score vector is scaled by the inverse of the quartic root of the $ k^{\text{th}} $ eigenvalue.
 User Loading Weight: If $ 0 \leq \omega \leq 1 $ denotes the user defined scaling factor, the $ k^{\text{th}} $ score vector is scaled by the $ k^{\text{th}} $ eigenvalue raised to the power $ \omega/2 $.
Figure 12: Component Scores Plots Output
First, the further east of the zero vertical axis a state is located, the more positively correlated it is with the first principal direction. Since the latter is dominated positively (east of the zero vertical axis) by the three crime categories MURDER, ASSAULT, and RAPE (see Figure 11), we conclude that such states are positively correlated with said crimes. Naturally, converse conclusions hold as well. In particular, we see that CALIFORNIA, NEVADA, and FLORIDA are most positively correlated with the three crimes under consideration. If this is indeed the case, then it is little surprise that most Hollywood productions typically involve crime thrillers set in these three states. Conversely, NORTH DAKOTA and VERMONT are typically least associated with the crimes under consideration.
Second, the further north of the zero horizontal axis a state is located, the more positively correlated it is with the second principal direction. Since the latter is dominated positively (north of the zero horizontal axis) by the variable URBNAPOP (see Figure 11), we conclude that such states are positively correlated with urbanization. Again, the converse conclusions hold as well. In particular, HAWAII, CALIFORNIA, RHODE ISLAND, MASSACHUSETTS, UTAH, NEW JERSEY are states most positively associated with urbanization, whereas those least so are SOUTH CAROLINA, NORTH CAROLINA, and MISSISSIPPI.
Lastly, it is worth recalling that like loading vectors, score vectors can also be scaled. In this regard, we provide code snippets below to show how to manually compute scaled score vectors, exposing the algorithm that EViews uses to do same in its internal computations.
'  Verify Score Vectors  ' make eigenvalues and eigenvectors based on the corr. matrix crime.pcomp(eigval=eval, eigvec=evec, cov=corr) matrix evaldiag = @makediagonal(eval) ' create diagonal matrix of eigenvalues stom(crime, crimemat) ' create matrix from crime group vector means = @cmean(crimemat) ' get column means vector popsds = @cstdevp(crimemat) ' get population standard deviations ' initialize matrix for normalized crimemat matrix(@rows(crimemat), @columns(crimemat)) crimematnorm ' normalize (remove mean and divide by pop. s.d.) every column of crimemat for !k = 1 to @columns(crimemat) colplace(crimematnorm,(@columnextract(crimemat,!k)  means(!k))/popsds(!k),!k) next 'normalize loadings crime.makepcomp(cov=corr) s1 s2 s3 s4 ' get score series group scores s1 s2 s3 s4 ' put scores into group stom(scores, scoremat) ' put scores group into matrix matrix scoreverify = crimematnorm*evec ' create custom score matrix matrix scorediff = scoreverify  scoremat ' get difference between custom and eviews output show scorediff 'normalize scores crime.makepcomp(scale=normscores, cov=corr) s1 s2 s3 s4 group scores s1 s2 s3 s4 stom(scores, scoremat) scoreverify = crimematnorm*evec*@inverse(@epow(evaldiag, 0.5)) scorediff = scoreverify  scoremat show scorediff 'symmetric weights crime.makepcomp(scale=symmetrics, cov=corr) s1 s2 s3 s4 group scores s1 s2 s3 s4 stom(scores, scoremat) scoreverify = crimematnorm*evec*@inverse(@epow(evaldiag, 0.25)) scorediff = scoreverify  scoremat show scorediff 'user weights crime.makepcomp(scale=0.36, cov=corr) s1 s2 s3 s4 group scores s1 s2 s3 s4 stom(scores, scoremat) scoreverify = crimematnorm*evec*@inverse(@epow(evaldiag, 0.18)) scorediff = scoreverify  scoremat show scorediffAbove, observe that we derived eigenvalues and eigenvectors of the correlation matrix. Accordingly, to derive the score vectors manually, we needed to standardize the original variables first. In this regard, when using the covariance matrix instead, one need only to demean the original variables and disregard scaling information. We leave this as an exercise to interested readers.
Biplot Analysis
As a last exercise, we superimpose the loading vectors and score vectors onto a single graph called the biplot. To do this, again, bring up the main principal component dialog and under the Display group select Biplot (scores & loadings). As in the previous exercise, under the Graph options group, select Label all obs. from the Obs. labels dropdwon, and hit OK. We summarize these steps in Figure 13.Figure 13: PCA Dialog: Biplots (scores & loadings)
Figure 14: Biplots (scores & loadings) Output
Concluding Remarks
In Part I of this series we laid out the theoretical foundations underlying PCA. Here, we used EViews to conduct a brief data exploratory implementation of PCA on serious crimes across 50 US states. Our aim was to illustrate the use of numerous PCA tools available in EViews with brief interpretations associated with each.In closing, we would like to point out that apart from the main principal component dialog we used above, EViews also offers a Make Principal Components... proc function which provides a unified framework for producing vectors and matrices of the most important objects related to PCA. These include the vector of eigenvalues, the matrix of eigenvectors, the matrix of loading vectors, as well as the matrix of scores. To access this function, open the crime group from the workfile, click on Proc and click on Make Principal Components.... We summarize this in Figures 15a and 15b below.


Files
The EViews workfile can be downloaded here: usarrests.wf1The EViews program file can be downloaded here: usarrests.prg
References
[1]  Jushan Bai and Serena Ng. Determining the number of factors in approximate factor models. Econometrica, 70(1):191221, 2002. 
No comments:
Post a Comment