Simplifying development of predictive models by identification of key components and their impact
Analysis of complex data often involves dealing with a multi dimensional dataset. Most of the time, there’s a lot of noise in the data and one has to navigate through this noise identify useful information or signal. Further, it is an arduous task to visualize data that tends to exceed the three dimensional space.
Principal Component Analysis (PCA) is a technique of reducing dimensionality of our dataset while retaining meaningful variation in the data. This method ensures that the complexity of data is reduced without compromising on the information extracted from data. This is known as ‘Dimensionality Reduction’.
Typically, features with a lot of variance tend to contain meaningful information. Predictive algorithms attempt to learn these variations and consequently are able to identify patterns in the data. Thus features having a high variance is a basic requirement for building accurate predictive models. PCA can capture these variations in data and can facilitate predictive algorithms to furnish good results.
PCA is also able to generate features that are independent of each other. This is particularly significant in identifying uncorrelated data. While building a predictive model, it is advisable to select features which are independent of each other to avoid multicollinearity.
Multicollinearity occurs when we have features in our dataset that are correlated with each other. The idea when selecting features is to have a set of independent features through which the value of a dependent feature or target variable can be predicted. If we have features that are correlated, a slight variation in one feature might begin to affect other features and consequently lead to reduced precision of our model.
Another reason for using PCA is to reduce the number of features that we select for our predictive model. The more number of features we select, the more our model will grow in complexity and the more the model will begin to overfit the data. Overfitting can lead to excellent model performance on our training set but considerable poor performance on our test set. The idea behind using PCA is to retain the information in our dataset while also reducing the model complexity.
To understand Principal Component Analysis, let’s take an example of a mouse and genes dataset.
Let’s consider plotting a graph taking only one feature from our dataset. We are taking that feature to be ‘Gene_1’. Considering that we have only one feature, we’ll be having only a single dimension in our graph.
Now, let’s take two features to plot our graph, namely ‘Gene_1’ and ‘Gene_2’. In this case, we’ll be dealing with two dimensions.
Continuing with the same trend, let’s now take three features into consideration, ‘Gene_1’, ‘Gene_2’ and ‘Gene_3’. Now, we’ll be having a three dimensional plot.
Note: The samples of data that are larger in value for ‘Gene_3’ are smaller in size as they’ll be located away from the origin, while those smaller in value will be nearer the origin.
Now, if we want to plot a graph with a fourth feature, a four dimensional plot is required which is not possible to draw.
Principal Component Analysis (PCA) can take these four or more dimensional data and can squeeze them in a two dimensional plot which can make it easier for us to analyze.
Let’s try to understand how this is possible…
Let’s revert back to our previous two dimensional plot to gain an intuition behind it.
The point in the middle denoted by ‘X’ represents the center of the data. Once we are aware of the center, we try to shift the plot such that the center coincides with the origin of the axis.
We then pass a line through the origin and project the data points onto it.
In order for the line passing through the origin to be a best fit line on the data, we need to minimize the projected length of the data onto the line (The thin black lines). This can also be achieved by maximizing the distance of the projected points of the data from the origin (The red lines).
To understand this concept of minimization and maximization, let’s build an intuition.
As we try to get a line that better fits our data, the length of the projected lines (The thin black lines) gradually decreases while the length of the distance of the projected points from the origin (The red lines) gradually increases.
To understand this mathematically
Let’s assume that we have a single data point in our x-y plane. The length of the projected line is denoted by ‘b’ and the distance of the projected point from the origin is denoted by ‘c’, finally, the distance of the data point from the origin is denoted by ‘a’.
We are well aware that while projecting a point onto a line, the projected line is perpendicular to that line.
Since this property is true, we are presented with a right angled triangle and ‘Pythagoras’s theorem’ can be applied here.
Therefore, we can observe that the value of a² will remain constant even if we try to change the orientation of the blue line. The only variables we are having here are the values of b² and c².
We can also observe that as we attempt to fit the blue line to our data, the value of ‘b’ decreases. Consequently, the value of ‘b²’ also decreases.
According to ‘Pythagoras’ theorem’ if ‘b²’ decreases and ‘a²’ remains constant, the only way we can maintain the equality is by increasing the value of ‘c²’.
Intuitively, it is ‘b’ that has to be minimized but it is actually ‘c’ that is easier to calculate.
PCA therefore attempts to maximize the value of ‘c’.
Now that we are clear with this picture in mind, we shall attempt to apply this method for the data we have in our hand.
PCA performs the following steps:-
- It projects the points onto the line that is to be fit.
- It measures the projected point distance from the origin. In this case dₓ where ‘x’ represents the data points.
- It then squares the distance to get rid of the negative sign.
- Finally, it sums up all the squared distances.
And thus we are presented with this: d₁² +d₂²+d₃²+d₄²
This quantity is known as Sum of squared distances. We’ll be representing this by ‘SS’.
We’ll now rotate the blue line by a small angle while calculating the sum of the squared distances (SS) in an attempt to find the best fit line to the data. This process is repeated until we achieve the maximum value of SS.
We’ve finally achieved the best fit line for the data in the above graph. For this line, we have the largest value for SS. This best fit line is also called the ‘Principal Component 1’ or ‘PC1’.
Let’s assume that for this data distribution, we have a PC1 slope of ‘0.25’.
This means, for every 4 points that we move in direction of the ‘X-axis’, we move 1 point in the direction of ‘Y-axis’. So what we can understand through this is that, PC1 is more spread out across ‘Gene_1’ and comparatively less spread out across the ‘Gene_2’ axis.
This is analogous to a food recipe. In order to make PC1, we’ll first add 4 parts of ‘Gene_1’ and 1 part ‘Gene_2’. This tells us that for PC1, ‘Gene_1’ is more important than ‘Gene_2’ relatively speaking.
This combination of different features to make PC1 is called ‘Linear Combination’.
So, for PC1 we have a linear combination of:-
- 4 parts of Gene_1
- 1 part of Gene_2
By following the ‘Linear Combination’ of variables we get a vector interpretation of PC1.
Using Pythagoras’ theorem, we can measure the magnitude of the PC1 vector.
Singular Value Decomposition (SVD) is used to scale the PC1 vector to a unit vector. When we apply PCA with SVD we are presented with the following results.
We can observe that even though the values for the Linear Combination have changed for the PC1 unit vector, the ratio remains the same for individual components.
This unit vector for PC1 is known as the ‘Eigenvector’ or ‘Singular Vector’. The proportion of each of Gene_1 and Gene_2 is called the ‘Loading Scores’. The Loading Scores is instrumental in telling us which feature is more important, for example for PC1 in terms of the data projected onto it, Loading Scores can tell us that Gene_1 is 4 times more important than Gene_1.
The Loading Scores for PC1 are:-
- 0.97 parts of Gene_1
- 0.242 parts of Gene_2
PCA calls the Sum of the Squared distances or the SS distance for the best fit line the ‘Eigenvalue’ for PC1 and the square root of the Eigenvalue is called the ‘Singular value for PC1’.
Principal Component 2 or PC2 is simply a line through the origin that is perpendicular to PC1.
This is done in view of the principle of orthogonality. Two vectors are considered orthogonal if they are perpendicular to each other. Thus any change or shift of data in one axis or movement of one vector does not generate any corresponding change or shift in an orthogonal axis or the orthogonal vector. The two vectors are essentially independent of each other.
The data is projected onto PC2 in a similar fashion to PC1.
Since PC2 is orthogonal to PC1, the recipe for PC2 is:-
- -1 part Gene_1
- 4 parts Gene_2
After scaling through SVD, the Loading Scores for PC2 are:-
- -0.242 parts of Gene_1
- 0.97 parts of Gene_2
PC1 and PC2 are then rotated such that PC1 is horizontal. We then use the projected points on the Principal Components to plot the data points in the new dimensional plane.
We now have an entirely new frame to analyze our data. The original axis serves us no purpose and can be eliminated.
We observe that the majority of the data points tend to lie on PC1. This brings us to an important observation. There exist a large amount of variation of the data across PC1 than on PC2, in fact, PC1 exhibits the largest amount of data variation compared to any other Principal Component and thus is an important feature in the analysis of our data.
This is because it is the variation in data that machine learning algorithms attempt to learn in order to make predictions in the future.
- SS(distances for PC1) = Eigenvalue for PC1
- SS(distances for PC2) = Eigenvalue for PC2
The Eigenvalues are used to calculate the contribution of variation of each Principal Components.
- SS(distances for PC1)/n-1 = Variation for PC1
- SS(distances for PC2)/n-1 = Variation for PC2
Where ’n’ represents the sample size or in simple words, the number of data points.
- Variation for PC1 = 15
- Variation for PC2 = 3
Therefore, total variation = 18
- PC1 accounts for (15/18) = 0.83 = 83% of total variation of the PCs
- PC2 accounts for (3/18) = 0.17 = 17% of total variation of the PCs
The concept of PCA remains pretty much the same.
- The center of the data is found.
- A best fitting line is drawn through the origin.
- The best fitting line is called PC1.
- A line is drawn orthogonal to PC1 and is called PC2.
- Similarly, a third line is drawn orthogonal to both PC1 and PC2, this is called PC3.
Therefore, all PCs are perpendicular to each other.
Note: In theory, there is one PCA axis per variable or feature but in practice, the number of PCs is either number of variables or the number of samples whichever is smaller.
Once we have all PCs drawn out, we can use the Eigenvalues i.e. SS distances to determine the proportion of variation each PC accounts for.
So, let’s assume that in this case,
- PC1 accounts for 79%
- PC2 accounts for 15%
- PC3 accounts for 6%
We can draw a Scree plot of this PCA.
We can observe that PC1 and PC2 account for 94% of the data. Therefore, we can omit PC3 in our analysis and resort to only use PC1 and PC2.
This is how ‘Dimensionality Reduction’ comes into play. We effectively reduced the number of dimensions that our data depended upon from three to two.
However, sometimes we do get situations like these.
In such a case where each Principal Component contributes almost equally as the other, taking only a few PCs into consideration will not be enough to account for all the variation. However, even a noisy PCA can help in clustering the data.
Employing Principal Component Analysis in the analysis of our data has its fair share of benefits. However, we do face a few drawbacks while using this method.
Once we run our features through PCA, it effectively creates a new set of features that holds the information present in the original features. This new set of features thus created are the Principal Components. Without the domain expertise, we really can’t tell what these Principal Components represent. Thus we have a general loss of interpretability.
This, however, is not a major setback for our predictive model. If we are able to extract important information from our data, then this method remains a useful technique which we can employ without worrying about the model overfitting the data.
Thanks for reading this blog. I would appreciate hearing your thoughts on this.
- Josh Starmer, StatQuest: Principal Component Analysis (PCA), Step-by-Step. https://www.youtube.com/watch?v=FgakZw6K1QQ
- Tony Yiu, Understanding PCA (Principal Components Analysis). https://towardsdatascience.com/understanding-pca-fae3e243731d