Monday, October 15, 2018

Principal Component Analysis: Part I (Theory)

Most students of econometrics are taught to appreciate the value of data. We are generally taught that more data is better than less, and that throwing data away is almost "taboo". While this is generally good practice when it concerns the number of observations per variable, it is not always recommended when it concerns the number of variables under consideration. In fact, as the number of variables increases, it becomes increasingly more difficult to rank the importance (impact) of any given variable, and can lead to problems ranging from basic overfitting, to more serious issues such as multicollinearity or model invalidity. In this regard, selecting the smallest number of the most meaningful variables -- otherwise known as dimensionality reduction -- is not a trivial problem, and has become a staple of modern data analytics, and a motivation for many modern techniques. One such technique is Principal Component Analysis (PCA).