## Detailed theoretical explanation and scikit-learn example

Data has become more valuable than ever with the tremendous advancement in data science. Real life datasets usually have many features (columns). Some of the features may be uninformative or correlated with other features. However, we may not know this beforehand so we tend to collect as much data as possible. In some cases, it can be possible to accomplish the task without using all the features. Due to computational and performance reasons, it is desired to do a task with less number of features if possible. Uninformative features do not provide any prediction power and also cause a computational burden. Let’s assume we are trying to predict the shooting accuracy of basketball players. The dataset includes distance to basket, angle of the direction, the position of the defender, the accuracy of previous shots and the color of the ball. It is glaringly obvious that the color of the ball has no relation with shooting accuracy, so we can just remove it. The cases in real life are not that obvious and we need to do some pre-processing to determine uninformative features. Correlation among features or between a feature and target variable can easily be calculated using software packages.

There are also some cases which have a high number of features in nature. For example, an image classification task with 8x8 pixel images has 64 features. We can find a way to represent these images with less number of features without losing a considerable amount of information. Depending on the field you work in, you may even encounter datasets with more than a thousand features. In such cases, reducing the number of features is a challenging yet very beneficial task.

As the number of features increases, the performance of a classifier starts to decrease after some point. More features result in more combinations that the model needs to learn in order to accurately predict the target. Therefore, with same amount of observations (rows), models tend to perform better on datasets with less number of features. Moreover, a high number of features increase the risk of overfitting.

There are two main methods to reduce the number of features. The first one is **feature selection **which aims to find the most informative features or eliminate uninformative features. Feature selection can be done manually or using software tools. The second way is to derive new features from the existing ones with keeping as much information as possible. This process is called **feature extraction** or **dimensionality reduction**.

What do I mean by “keeping as much information as possible”? How do we measure the amount of information? The answer is **variance **which is a measure of how much a variable is spread out. If the variance of a variable (feature) is very low, it does not tell us much when building a model. The figure below shows the distribution of two variables, x and y. As you can see, x ranges from 1 to 6 while y values are in between 1 and 2. In this case, x has high variance. If these are the only two features to predict a target variable, the role of x in the prediction is much higher than y.

Variation within the current datasets must be retained as much as possible while doing dimensionality reduction. There are many ways to do dimensionality reduction. In this post, I will cover one of the most widely used dimensionality reduction algorithm: **Principal Component Analysis (PCA)**.

PCA is an unsupervised learning algorithm which finds the relations among features within a dataset. It is also widely used as a preprocessing step for supervised learning algorithms.

*Note: PCA is a linear dimensionality reduction algorithm. There are also non-linear methods available.*

We first need to shift the data points so that the center of data is at the origin. Although the positions of individual data points change, relative positions do not change. For example, the point with highest feature 1 value still has highest feature 1 value. Then, PCA fits a line to the data which minimizes the distances from data points to the line.

This red line is the new axis or first principal component (PC1). Most of the variance of a dataset can be explained by PC1. The second principle component is able to explain vertical variance with respect to PC1.

The sort red line is the second principal component (PC2). The order of principal components is determined according to the fraction of variance of original dataset they explain. It is clear that PC1 explains much more variance than PC2.

Then principal components and data points are rotated so that PC1 becomes new x axis and PC2 becomes new y axis. Relative positions of data points do not change. Principal components are orthogonal to each other and thus linearly independent.

The principal components are linear combinations of the features of original dataset.

The advantage of PCA is that a significant amount of variance of the original dataset is retained using much smaller number of features than the original dataset. Principal components are ordered according to the amount of variance they represent.

Let’s go over an example using scikit-learn. Scikit-learn is a machine learning library that provides simple and efficient tools for predictive data analysis.

To be consistent, I will use the datapoints that I have been showing since the beginning. It is a very simple example yet enough to grasp the concept.

We create a DataFrame using these datapoints and assign a class for each one.

import numpy as np

import pandas as pddf = pd.DataFrame({

'feature_a':[2,1.5,2,2.5,3,2.5,3.7,2.8,1.8,3.3],

'feature_b':[1,1.2,2,1.5,3,2.4,3.5,2.8,1.5,2.5],

'target':['a','a','a','a','b','b','b','b','a','b']})

So it is a binary classification task with two independent variables.

Before applying PCA, we need to standardize the data so that the mean of datapoints is 0 and the variance is 1. Scikit-learn provides **StandardScaler() **from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import StandardScalerdf_features = df[['feature_a','feature_b']]

df_features = StandardScaler().fit_transform(df_features)

Then we use create a PCA() object and fit datapoints to it.

from sklearn.decomposition import PCApca = PCA(n_components=2)

PCs = pca.fit_transform(df_features)

Then we create a new dataframe using principal components:

#Data visualization libraries

import seaborn as sns

import matplotlib.pyplot as plt

%matplotlib inline#Create DataFrame

df_new = pd.DataFrame(data=PCs, columns={'PC1','PC2'})

df_new['target'] = df['target'] #targets do not change

We can draw a scatter plot to see the new data points:

fig = plt.figure(figsize = (8,4))

ax = fig.add_subplot()

ax.set_xlabel('PC1')

ax.set_ylabel('PC2')targets = ['a', 'b']

colors = ['r', 'b']for target, color in zip(targets,colors):

rows = df_new['target'] == target

ax.scatter(df_new.loc[rows, 'PC1'],

df_new.loc[rows, 'PC2'],

ax.legend(targets)

Let’s also draw the scatter plot of original data points so that you can clearly see how data points are transformed:

As you can see on the principal components graph, two classes can be separated using only PC1 instead of using both feature_a and feature_b. Therefore we can say that most of the variance is explained by PC1. To be exact, we can calculate how much each principal component explains the variance. Scikit-learn provides **explained_variance_ratio_ **method to calculate these amounts:

pca.explained_variance_ratio_array([0.93606831, 0.06393169])

PC1 explains 93.6% of the variance and PC2 explains 6.4%.

*Note: Principal components are a linear combination of original features.*

This example is a very simple case but it explains the concept. When doing PCA on datasets with many more features, we just follow the same steps.

Thank you for reading. Please let me know if you have any feedback.

**Machine Learning**

- Naive Bayes Classifier — Explained
- Logistic Regression — Explained
- Support Vector Machine — Explained
- Decision Trees and Random Forests — Explained
- Gradient Boosted Decision Trees — Explained
- Predicting Used Car Prices with Machine Learning

**Data analysis**

- The Most Underrated Tool in Data Science: NumPy
- Combining DataFrames Using Pandas
- Handling Missing Values with Pandas
- 3 Useful Functionalities of Pandas