Dimensionality Reduction with PCA
Dimensionality Reduction
Dimensionality reduction is a data preparation technique performed on data prior to modeling. It might be performed after data cleaning and data scaling and before training a predictive model. Lets understand its significance ..
We can consider the columns of data representing dimensions on an n-dimensional feature space and the rows of data as points in that space. This is a useful geometric interpretation of a dataset.
The number of input variables or features for a dataset is referred to as its dimensionality.
And dimensionality reduction refers to techniques that reduce the number of input variables in a dataset.
The Curse of Dimensionality
The phrase, attributed to Richard Bellman, was coined to express the difficulty of using brute force (a.k.a. grid search) to optimize a function with too many input variables.
- The curse of dimensionality refers to the phenomena that occur when classifying, organizing, and analyzing high dimensional data that does not occur in low dimensional spaces, specifically the issue of data sparsity .
This exponential growth in data causes high sparsity in the data set and unnecessarily increases storage space and processing time for the particular modelling algorithm.Value added by additional dimension is much smaller compared to overhead it adds to the algorithm.
- Fewer input dimensions often mean correspondingly fewer parameters or a simpler structure in the machine learning model, referred to as degrees of freedom. A model with too many degrees of freedom is likely to overfit the training dataset and therefore may not perform well on new data.
It is desirable to have simple models that generalize well, and in turn, input data with few input variables. This is particularly true for linear models where the number of inputs and the degrees of freedom of the model are often closely related.
Techniques for Dimensionality Reduction
- Feature Selection Methods:
Perhaps the most common feature selection techniques that use scoring or statistical methods to select which features to keep and which features to delete.Two main classes of feature selection techniques include wrapper methods and filter methods.
Filter Method -Filter methods are generally used as a preprocessing step. The selection of features is independent of any machine learning algorithms. Instead, features are selected on the basis of their scores in various statistical tests for their correlation with the outcome variable.
One thing that should be kept in mind is that filter methods do not remove multicollinearity. So, you must deal with multicollinearity of features as well before training models for your data.
Wrapper method-In wrapper methods, we try to use a subset of features and train a model using them. Based on the inferences that we draw from the previous model, we decide to add or remove features from your subset. The problem is essentially reduced to a search problem. These methods are usually computationally very expensive.Some common examples of wrapper methods are forward feature selection, backward feature elimination, recursive feature elimination, etc.
Embedded Methods-Embedded methods combine the qualities’ of filter and wrapper methods. It’s implemented by algorithms that have their own built-in feature selection methods.Some of the most popular examples of these methods are LASSO and RIDGE regression which have inbuilt penalization functions to reduce overfitting.
2. Matrix Factorization
Matrix factorization is a technique from linear algebra that can be used to reduce a dataset matrix into its constituent parts.Examples include the eigendecomposition (PCA ) and singular value decomposition.
3. Manifold Learning
In mathematics, a projection is a kind of function or mapping that transforms data in some way.These techniques are sometimes referred to as “manifold learning” and are used to create a low-dimensional projection of high-dimensional data, often for the purposes of data visualization.
Examples of manifold learning techniques include: Kohonen Self-Organizing Map (SOM),Sammons Mapping,Multidimensional Scaling (MDS),t-distributed Stochastic Neighbor Embedding (t-SNE).
4. Autoencoder Methods
Deep learning neural networks can be constructed to perform dimensionality reduction.A popular approach is called autoencoders. This involves framing a self-supervised learning problem where a model must reproduce the input correctly.
An auto-encoder is a kind of unsupervised neural network that is used for dimensionality reduction and feature discovery. More precisely, an auto-encoder is a feedforward neural network that is trained to predict the input itself.
As part of this article we will be concentrating on PCA . So lets go ..
Principal Component Analysis ( PCA )
- Principal Component Analysis (PCA) is an unsupervised, non-parametric statistical technique primarily used for dimensionality reduction in machine learning.
- Principal component analysis is a technique for feature extraction not feature elemination (where we drop some of the feature ) — so it combines our input variables in a specific way, then we can drop the “least important” variables while still retaining the most valuable parts of all of the variables.
- The same is done by transforming the variables to a new set of variables, which are known as the principal components (or simply, the PCs) and are orthogonal, ordered such that the retention of variation present in the original variables decreases as we move down in the order. So, in this way, the 1st principal component retains maximum variation that was present in the original components. The principal components are the eigenvectors of a covariance matrix, and hence they are orthogonal.
PCA is a statistical procedure to convert observations of possibly correlated variables to ‘principal components’ such that:
They are uncorrelated with each other.
They are linear combinations of the original variables.
They help in capturing maximum information in the data set.
Fundamental building block of PCA: Basis
Essentially, ‘basis’ is a unit in which we express the vectors of a matrix.
So for example, when you say that an object has a length of 23 cm, what you are essentially saying is that the object’s length is 23×1 cm. Here, 1 cm is the unit in which you are expressing the length of the object.
Similarly, vectors in any dimensional space or matrix can be represented as a linear combination of basis vectors.
Any point on the 2-D space is a linear combination of i and j. i and j are orthogonal/ perpendicular vectors, and hence, one can’t be represented by the other.
The basic definition of basis vectors is that they’re a certain set of vectors whose linear combination is able to explain any other vector in that space.
The Goal of Principal component analysis computes the most meaningful basis where most of the variance in the data is along the axes. The hope is that this new basis will filter out the noise and reveal hidden dynamics.
Change of Basis — different basis vectors can be used to represent the same observations, just like you can represent the weight of a patient in kilograms or pounds.
key ideas that will help you connect basis vectors and the idea of dimensionality reduction: using different basis vectors to represent the same points.
A vector isn’t tied to the coordinate axes set used to describe it originally. We can re-describe it using a new set of coordinate axes or basis vectors. This is an important concept in Linear Algebra. It allows us to move basis of a vector or data item, which moves the numbers in the vector.
how to change from one basis space to other using matrices. Here’s a list of rules to help you revise the same.
Ideal basis vectors
As we know PCA changes the basis vectors in such a way that the new basis vectors capture the maximum variance or information .
( variance ⬆=information⬆)
The ideal basis vectors should have the following properties:
- They explain the directions of maximum variance
- When used as the new set of basis vectors, the transformed dataset is now suitable for dimensionality reduction.
- These directions explaining the maximum variance are called the Principal Components of our data.
STEPS INVOLVED IN PCA
- First step is to normalize the data that we have so that PCA works properly
- Compute the covariance matrix along the ideal basis.
the variance of a particular column to measure its importance.The reason you need the covariance matrix is to capture all the information in the dataset — variance along the columns as well as the inter-column covariance. Now using this matrix, we’ll be able to deduce the directions of maximum variance
3. Eigendecomposition of the covariance matrix
You need to do find something known as eigenvectors of the covariance matrix using a process called Eigendecomposition. These eigenvectors would be the new set of basis vectors in whose representation, the covariance matrix will be diagonalized ( This process of converting the covariance matrix with only non-zero diagonal elements and 0 values everywhere else is also known as diagonalisation) .
In general, if the size of matrix A is n ( i.e. it is a n x n matrix) then, there will be at most n eigenvectors that can be formed
Now you must be wondering “How does finding new basis vectors where the covariance matrix only has non-zero values along the diagonal and 0 elsewhere help us?”
Well now,
- Your new basis vectors are all uncorrelated and independent of each other.
- Since variance is now explained only by the new basis vectors themselves, you can find the directions of maximum variance by checking which value is higher than the rest numerically. There is no correlation to take into account this time. All the information in the dataset is explained by the columns themselves.
So now, your new basis vectors are uncorrelated, hence linearly independent, and explain the directions of maximum variance.
Now, we wanted to find a new set of basis vectors where the covariance matrix gets diagonalised. It turns out that these new set of basis vectors are in fact the eigenvectors of the Covariance Matrix.
And therefore these eigenvectors are the Principal Components of our original dataset. In other words, these eigenvectors are the directions that capture maximum variance.
In conclusion, you saw that, by using PCA you’re able to convert your original data to a new basis and then take a decision on which directions/features to keep and therefore perform dimensionality reduction.
PCA is a measure of how each variable is associated with one another. (Covariance matrix.) The directions in which our data are dispersed. (Eigenvectors.) The relative importance of these different directions. (Eigenvalues.) PCA combines our predictors and allows us to drop the eigenvectors that are relatively unimportant.
Principal Component
A principal component is a normalized linear combination of the original predictors in a data set.
First principal component is a linear combination of original predictor variables which captures the maximum variance in the data set. It determines the direction of highest variability in the data. Larger the variability captured in first component, larger the information captured by component. No other component can have variability higher than first principal component.
The first principal component results in a line which is closest to the data i.e. it minimizes the sum of squared distance between a data point and the line.Similarly, we can compute the second principal component also.
Second principal component is also a linear combination of original predictors which captures the remaining variance in the data set and is uncorrelated with First principal component. In other words, the correlation between first and second component should is zero. If the two components are uncorrelated, their directions should be orthogonal.
All succeeding principal component follows a similar concept i.e. they capture the remaining variation without being correlated with the previous component. In general, for n × p dimensional data, min(n-1, p) principal component can be constructed.
The directions of these components are identified in an unsupervised way i.e. the response variable(Y) is not used to determine the component direction
Lets see how to do PCA in python ..
pca.fit(x) : This function does both the steps: finding the covariance matrix and doing an eigendecomposition of it to obtain the eigenvectors, which are nothing but the Principal Components of the original dataset.
Executing the above code will give the list of Principal components of the original dataset. They’ll be of the same number as the original variables in your dataset.now our task is to know how to choose the optimal number of principal components.
how much variance is being explained by each Principal Component using the following code —
So as you can see, the first PC, i.e. Principal Component 1([0.52 -0.26 0.58 0.56]) explains the maximum information in the dataset followed by PC2 at 23% and PC3 at 3.6%. In general, when you perform PCA, all the Principal Components are formed in decreasing order of the information that they explain. Therefore, the first principal component will always explain the highest variance, followed by the second principal component and so on. This order helps us in our dimensionality reduction exercise, as now we know which directions are more important than the others.
But what happens when there are hundreds of columns? Using the above process would be cumbersome since you’d need to look at all the PCs and keep adding their variances up to find the total variance captured. So here Scree-Plot comes to rescue ..
Scree-Plot
An elegant solution here would be to simply add plot a “Cumulative variance explained chart” Here against each number of components, we have the total variance explained by all the components till then.
If you plot the number of components on the X-axis and the total variance explained on the Y-axis, the resultant plot is also known as a Scree-Plot.
Now, this is a better representation of variance and the number of components needed to explain that much variance.
From the scree plot that you saw previously, you decided to keep ~95% of the information in the data that we have and for that, you need only 2 components. Hence you instantiate a new PCA function with the number of components as 2. This function will perform the dimensionality reduction on our dataset and reduce the number of columns from 4 to 2.
pc2 = PCA(n_components=2, random_state=42)
Now you simply transform the original dataset to the new one where the columns are given by the Principal Components. Here you’ve finally performed the dimensionality reduction on the dataset by reducing the number of columns from 4 to 2 and still retain 95% of the information. The code that you used to perform the same step is as follows:
newdata = pc2.fit_transform(x)
PCA helped us solve the problem of multicollinearity (and thus model instability), loss of information due to the dropping of variables, and we don’t need to use iterative feature selection procedures. Also, our model becomes much faster because it has to run on a smaller dataset.
To sum it up, if you’re doing any sort of modelling activity on a large dataset containing lots of variables, it is a good practice to perform PCA on that dataset first, reduce the dimensionality and then go ahead and create the model that you wanted to make in the first place.
Advantages of PCA
Avoiding overfitting is a major motivation for performing dimensionality reduction.The fewer features our training data has, the lesser assumptions our model makes and the simpler it will be. But that is not all and dimensionality reduction has a lot more advantages to offer, like
- Less misleading data means model accuracy improves.
- Less dimensions mean less computing. Less data means that algorithms train faster.
- Data visualisation and EDA becomes easier
- Less data means less storage space required.
- Less dimensions allow usage of algorithms unfit for a large number of dimensions
- Removes redundant features and noise.
- For creating uncorrelated features: Having a lot of correlated features lead to the multicollinearity problem. Iteratively removing features is time-consuming and also leads to some information loss.
- Finding latent themes in the data
some important shortcomings of PCA:
- PCA is limited to linearity, though we can use non-linear techniques such as t-SNE as well
- PCA needs the components to be perpendicular, though in some cases, that may not be the best solution. The alternative technique is to use Independent Components Analysis.
- PCA assumes that columns with low variance are not useful, which might not be true in prediction setups (especially classification problem with a high class imbalance).
- The data need to be balanced and not have outliers. PCA is sensitive towards Outliers.
Thats all for now ….
If you liked the article, show your support by clapping for this article. This article is basically a colab of many articles from medium , analytical vidya , upgrad material etc.
If you are also learning Machine learning like me follow me, for more articles. Lets go on this trip together :)
You can also follow me on Linkedin