Enriching your PCA results
Among all the Principal Component Analysis (PCA) routines that are available in R basic and contribution packages, I could not find one fully satisfactory in terms of output results naming and values. So I decided to leverage on one of those PCA routines, the prcomp() as available inside the stats package, and add some more outputs. I am going to show how I did.
First, let us import the dataset to be processed by PCA.
data <- read.csv("dataset.csv", header=TRUE)
head(data, n=10)
## source1 source2 source3 source4 source5
## 1 0 0 0 0 0
## 2 951 1103 1355 1415 119
## 3 951 1103 1355 3979 119
## 4 951 1103 1355 3979 119
## 5 1970 2508 2925 4748 950
## 6 3072 3785 4631 6697 1198
## 7 3072 3785 4631 6697 1198
## 8 3072 3785 4631 6697 1198
## 9 4147 5123 5875 8472 2852
## 10 5048 6145 7538 10645 4384
dim(data)
## [1] 130 5
As we can see, it consists of sampled data collected from five sources. Their plots reveal some common pattern between them, based on intuition.
plot(data[,1], type='l', col=1, xlab='time', ylab='value')
for(i in 2:ncol(data)) {lines(data[,i], type='l', col=i)}
Let investigate correlations among them.
cor(data)
## source1 source2 source3 source4 source5
## source1 1.0000 0.9998 0.9991 0.9978 0.9915
## source2 0.9998 1.0000 0.9991 0.9976 0.9922
## source3 0.9991 0.9991 1.0000 0.9971 0.9941
## source4 0.9978 0.9976 0.9971 1.0000 0.9946
## source5 0.9915 0.9922 0.9941 0.9946 1.0000
They appear highly correlated, hence we expect our PCA analysis being able to do a good job in terms of dimensionality reduction. Let scale the data and run the prcomp routine. Since I need to make use of the scaled data, I am not herein taking advantage of the center and scale flags.
scaled.data <- scale(data)
pca.prcomp <- prcomp(scaled.data, retx = TRUE, center = FALSE, scale = FALSE)
Let us discover class and mode of the PCA output.
class(pca.prcomp)
## [1] "prcomp"
mode(pca.prcomp)
## [1] "list"
The mode reveals that the underlying class is a list. Hence we can easily add attributes to it.
summary(pca.prcomp)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 2.233 0.10553 0.05447 0.02360 0.01219
## Proportion of Variance 0.997 0.00223 0.00059 0.00011 0.00003
## Cumulative Proportion 0.997 0.99927 0.99986 0.99997 1.00000
A quick analysis of the PCA summary advises that the first principal component is responsible of more than 99% of the total variance. We will verify that later.
Originally, the attributes of the prcomp object are:
names(pca.prcomp)
## [1] "sdev" "rotation" "center" "scale" "x"
The attributes of the prcomp summary reveals a little surprise
summary <- summary(pca.prcomp)
summary
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 2.233 0.10553 0.05447 0.02360 0.01219
## Proportion of Variance 0.997 0.00223 0.00059 0.00011 0.00003
## Cumulative Proportion 0.997 0.99927 0.99986 0.99997 1.00000
names(summary)
## [1] "sdev" "rotation" "center" "scale" "x"
## [6] "importance"
class(summary)
## [1] "summary.prcomp"
summary$importance
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 2.233 0.10553 0.05447 0.02360 0.01219
## Proportion of Variance 0.997 0.00223 0.00059 0.00011 0.00003
## Cumulative Proportion 0.997 0.99927 0.99986 0.99997 1.00000
class(summary$importance)
## [1] "matrix"
The class summary.prcomp has got all the attributes of prcomp class plus the importance one I would like to have it available in my pca output. Hence:
pca.prcomp$importance <- summary(pca.prcomp)$importance
pca.prcomp$importance
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 2.233 0.10553 0.05447 0.02360 0.01219
## Proportion of Variance 0.997 0.00223 0.00059 0.00011 0.00003
## Cumulative Proportion 0.997 0.99927 0.99986 0.99997 1.00000
The rotation output (also known as loadings) are the correlation matrix eigenvectors with changed sign.
pca.prcomp$rotation
## PC1 PC2 PC3 PC4 PC5
## source1 0.4475 0.38662 -0.02948 0.1002 0.7996
## source2 0.4475 0.33354 -0.13813 0.6491 -0.4981
## source3 0.4476 0.14886 -0.49266 -0.6855 -0.2548
## source4 0.4474 -0.02534 0.83967 -0.2516 -0.1757
## source5 0.4461 -0.84645 -0.17968 0.1885 0.1294
-(eigen(cor(scaled.data))$vectors)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.4475 0.38662 -0.02948 0.1002 -0.7996
## [2,] 0.4475 0.33354 -0.13813 0.6491 0.4981
## [3,] 0.4476 0.14886 -0.49266 -0.6855 0.2548
## [4,] 0.4474 -0.02534 0.83967 -0.2516 0.1757
## [5,] 0.4461 -0.84645 -0.17968 0.1885 -0.1294
The x output are the scores, i.e. the projection of the original data onto the rotation matrix. You could compute the scores as:
scores <- scaled.data %*% pca.prcomp$rotation
head(scores, n=10)
## PC1 PC2 PC3 PC4 PC5
## [1,] -3.997 0.1300 -2.188e-02 -0.01297 0.0145761
## [2,] -3.916 0.1654 -2.326e-02 -0.01669 0.0119880
## [3,] -3.887 0.1638 3.123e-02 -0.03302 0.0005886
## [4,] -3.887 0.1638 3.123e-02 -0.03302 0.0005886
## [5,] -3.787 0.1780 4.653e-03 -0.02252 -0.0003674
## [6,] -3.684 0.2151 6.456e-03 -0.03099 -0.0052133
## [7,] -3.684 0.2151 6.456e-03 -0.03099 -0.0052133
## [8,] -3.684 0.2151 6.456e-03 -0.03099 -0.0052133
## [9,] -3.561 0.1944 2.608e-05 -0.01365 0.0008479
## [10,] -3.439 0.1725 -1.737e-03 -0.01929 0.0005706
head(pca.prcomp$x, n=10)
## PC1 PC2 PC3 PC4 PC5
## [1,] -3.997 0.1300 -2.188e-02 -0.01297 0.0145761
## [2,] -3.916 0.1654 -2.326e-02 -0.01669 0.0119880
## [3,] -3.887 0.1638 3.123e-02 -0.03302 0.0005886
## [4,] -3.887 0.1638 3.123e-02 -0.03302 0.0005886
## [5,] -3.787 0.1780 4.653e-03 -0.02252 -0.0003674
## [6,] -3.684 0.2151 6.456e-03 -0.03099 -0.0052133
## [7,] -3.684 0.2151 6.456e-03 -0.03099 -0.0052133
## [8,] -3.684 0.2151 6.456e-03 -0.03099 -0.0052133
## [9,] -3.561 0.1944 2.608e-05 -0.01365 0.0008479
## [10,] -3.439 0.1725 -1.737e-03 -0.01929 0.0005706
Now, let us run the parallel analysis to determine the number of principal components to be retained. Typically I take advantage of the parallel analysis available inside the psych package to determine how many components can be retained.
require(MASS)
## Loading required package: MASS
require(psych)
## Loading required package: psych
fa.par <- fa.parallel(data, fa = 'PC', n.iter = 100)
## Parallel analysis suggests that the number of factors = 1 and the number of components = 1
As we can see, the parallel analysis confirms that just the first component can be retained. So, I want my PCA results stores that information:
pca.prcomp$nfact = fa.par$nfact
names(pca.prcomp)
## [1] "sdev" "rotation" "center" "scale" "x"
## [6] "importance" "nfact"
Moreover, I want my PCA results to store the eigenvalues, which are the square of standard deviation:
pca.prcomp$eigenvalues <- pca.prcomp$sdev^2
pca.prcomp$eigenvalues
## [1] 4.9851909 0.0111367 0.0029667 0.0005570 0.0001486
eigen(cor(scaled.data))$values
## [1] 4.9851909 0.0111367 0.0029667 0.0005570 0.0001486
I also like to reconstruct the dataset based upon retained PCA components and stores them inside the PCA results together with scaled data and residuals.
pca.prcomp$scaled <- scaled.data
nfact <- pca.prcomp$nfact
pca.prcomp$reconstructed <- pca.prcomp$x[, 1:nfact, drop=FALSE] %*% solve(pca.prcomp$rotation)[1:nfact, , drop=FALSE]
pca.prcomp$residual <- pca.prcomp$scaled - pca.prcomp$reconstructed
One comment about that. The scores represent the projecton of the original (scaled) data onto the correlation matrix eigenvectors. So
pca.prcomp$scores[, 1:nfact, drop=FALSE]
are the first nfact columns of such projected data. On the other hand,
solve(pca.prcomp$eigenvectors)[1:nfact, , drop=FALSE]
are the first nfact rows of the inverse eigenvectors matrix (inverse because of solve() call). Matrix multiplication between those two matrixes computes the reconstructed data based on nfact principal components.
It is interesting to compute the Mean Squared Error (MSE) of the actual data with respect the reconstructed to verify how good the reconstruction was. A mse() routine is available inside the Metrics package. The reconstructed data plays the role of the mse() predicted input parameter. We compute Mean Squared Error for an increasing number of retained PCA components-
require(Metrics)
## Loading required package: Metrics
nc <- ncol(data)
msev <- vector(mode="numeric", length=nc)
for(i in (1: ncol(data))) {
reconstructed <- pca.prcomp$x[, 1:i, drop=FALSE] %*% solve(pca.prcomp$rotation)[1:i, , drop=FALSE]
msev[i] <- mse(pca.prcomp$scaled, reconstructed)
}
msev
## [1] 2.939e-03 7.288e-04 1.400e-04 2.950e-05 8.300e-32
That is just the variance of the residuals and could have been computed based on summary matrix row named as Cumulative Proportion, this way:
rep(1,nc) - pca.prcomp$importance[3,]
## PC1 PC2 PC3 PC4 PC5
## 0.00296 0.00073 0.00014 0.00003 0.00000
As a conclusion, we have seen how much valuable information the PCA analysis brings with that and much more if we are able to elaborate a little bit its outputs.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.