Monday, September 8, 2014

More outputs to your PCA.

Enriching your PCA results

Among all the Principal Component Analysis (PCA) routines that are available in R basic and contribution packages, I could not find one fully satisfactory in terms of output results naming and values. So I decided to leverage on one of those PCA routines, the prcomp() as available inside the stats package, and add some more outputs. I am going to show how I did.

First, let us import the dataset to be processed by PCA.

data <- read.csv("dataset.csv", header=TRUE)
head(data, n=10)
##    source1 source2 source3 source4 source5
## 1        0       0       0       0       0
## 2      951    1103    1355    1415     119
## 3      951    1103    1355    3979     119
## 4      951    1103    1355    3979     119
## 5     1970    2508    2925    4748     950
## 6     3072    3785    4631    6697    1198
## 7     3072    3785    4631    6697    1198
## 8     3072    3785    4631    6697    1198
## 9     4147    5123    5875    8472    2852
## 10    5048    6145    7538   10645    4384
dim(data)
## [1] 130   5

As we can see, it consists of sampled data collected from five sources. Their plots reveal some common pattern between them, based on intuition.

plot(data[,1], type='l', col=1, xlab='time', ylab='value')
for(i in 2:ncol(data)) {lines(data[,i], type='l', col=i)}

plot of chunk unnamed-chunk-2

Let investigate correlations among them.

cor(data)
##         source1 source2 source3 source4 source5
## source1  1.0000  0.9998  0.9991  0.9978  0.9915
## source2  0.9998  1.0000  0.9991  0.9976  0.9922
## source3  0.9991  0.9991  1.0000  0.9971  0.9941
## source4  0.9978  0.9976  0.9971  1.0000  0.9946
## source5  0.9915  0.9922  0.9941  0.9946  1.0000

They appear highly correlated, hence we expect our PCA analysis being able to do a good job in terms of dimensionality reduction. Let scale the data and run the prcomp routine. Since I need to make use of the scaled data, I am not herein taking advantage of the center and scale flags.

scaled.data <- scale(data)
pca.prcomp <- prcomp(scaled.data, retx = TRUE, center = FALSE, scale = FALSE)

Let us discover class and mode of the PCA output.

class(pca.prcomp)
## [1] "prcomp"
mode(pca.prcomp)
## [1] "list"

The mode reveals that the underlying class is a list. Hence we can easily add attributes to it.

summary(pca.prcomp)
## Importance of components:
##                          PC1     PC2     PC3     PC4     PC5
## Standard deviation     2.233 0.10553 0.05447 0.02360 0.01219
## Proportion of Variance 0.997 0.00223 0.00059 0.00011 0.00003
## Cumulative Proportion  0.997 0.99927 0.99986 0.99997 1.00000

A quick analysis of the PCA summary advises that the first principal component is responsible of more than 99% of the total variance. We will verify that later.

Originally, the attributes of the prcomp object are:

names(pca.prcomp)
## [1] "sdev"     "rotation" "center"   "scale"    "x"

The attributes of the prcomp summary reveals a little surprise

summary <- summary(pca.prcomp)
summary
## Importance of components:
##                          PC1     PC2     PC3     PC4     PC5
## Standard deviation     2.233 0.10553 0.05447 0.02360 0.01219
## Proportion of Variance 0.997 0.00223 0.00059 0.00011 0.00003
## Cumulative Proportion  0.997 0.99927 0.99986 0.99997 1.00000
names(summary)
## [1] "sdev"       "rotation"   "center"     "scale"      "x"         
## [6] "importance"
class(summary)
## [1] "summary.prcomp"
summary$importance
##                          PC1     PC2     PC3     PC4     PC5
## Standard deviation     2.233 0.10553 0.05447 0.02360 0.01219
## Proportion of Variance 0.997 0.00223 0.00059 0.00011 0.00003
## Cumulative Proportion  0.997 0.99927 0.99986 0.99997 1.00000
class(summary$importance)
## [1] "matrix"

The class summary.prcomp has got all the attributes of prcomp class plus the importance one I would like to have it available in my pca output. Hence:

pca.prcomp$importance <- summary(pca.prcomp)$importance
pca.prcomp$importance
##                          PC1     PC2     PC3     PC4     PC5
## Standard deviation     2.233 0.10553 0.05447 0.02360 0.01219
## Proportion of Variance 0.997 0.00223 0.00059 0.00011 0.00003
## Cumulative Proportion  0.997 0.99927 0.99986 0.99997 1.00000

The rotation output (also known as loadings) are the correlation matrix eigenvectors with changed sign.

pca.prcomp$rotation
##            PC1      PC2      PC3     PC4     PC5
## source1 0.4475  0.38662 -0.02948  0.1002  0.7996
## source2 0.4475  0.33354 -0.13813  0.6491 -0.4981
## source3 0.4476  0.14886 -0.49266 -0.6855 -0.2548
## source4 0.4474 -0.02534  0.83967 -0.2516 -0.1757
## source5 0.4461 -0.84645 -0.17968  0.1885  0.1294
-(eigen(cor(scaled.data))$vectors)
##        [,1]     [,2]     [,3]    [,4]    [,5]
## [1,] 0.4475  0.38662 -0.02948  0.1002 -0.7996
## [2,] 0.4475  0.33354 -0.13813  0.6491  0.4981
## [3,] 0.4476  0.14886 -0.49266 -0.6855  0.2548
## [4,] 0.4474 -0.02534  0.83967 -0.2516  0.1757
## [5,] 0.4461 -0.84645 -0.17968  0.1885 -0.1294

The x output are the scores, i.e. the projection of the original data onto the rotation matrix. You could compute the scores as:

scores <- scaled.data %*% pca.prcomp$rotation
head(scores, n=10)
##          PC1    PC2        PC3      PC4        PC5
##  [1,] -3.997 0.1300 -2.188e-02 -0.01297  0.0145761
##  [2,] -3.916 0.1654 -2.326e-02 -0.01669  0.0119880
##  [3,] -3.887 0.1638  3.123e-02 -0.03302  0.0005886
##  [4,] -3.887 0.1638  3.123e-02 -0.03302  0.0005886
##  [5,] -3.787 0.1780  4.653e-03 -0.02252 -0.0003674
##  [6,] -3.684 0.2151  6.456e-03 -0.03099 -0.0052133
##  [7,] -3.684 0.2151  6.456e-03 -0.03099 -0.0052133
##  [8,] -3.684 0.2151  6.456e-03 -0.03099 -0.0052133
##  [9,] -3.561 0.1944  2.608e-05 -0.01365  0.0008479
## [10,] -3.439 0.1725 -1.737e-03 -0.01929  0.0005706
head(pca.prcomp$x, n=10)
##          PC1    PC2        PC3      PC4        PC5
##  [1,] -3.997 0.1300 -2.188e-02 -0.01297  0.0145761
##  [2,] -3.916 0.1654 -2.326e-02 -0.01669  0.0119880
##  [3,] -3.887 0.1638  3.123e-02 -0.03302  0.0005886
##  [4,] -3.887 0.1638  3.123e-02 -0.03302  0.0005886
##  [5,] -3.787 0.1780  4.653e-03 -0.02252 -0.0003674
##  [6,] -3.684 0.2151  6.456e-03 -0.03099 -0.0052133
##  [7,] -3.684 0.2151  6.456e-03 -0.03099 -0.0052133
##  [8,] -3.684 0.2151  6.456e-03 -0.03099 -0.0052133
##  [9,] -3.561 0.1944  2.608e-05 -0.01365  0.0008479
## [10,] -3.439 0.1725 -1.737e-03 -0.01929  0.0005706

Now, let us run the parallel analysis to determine the number of principal components to be retained. Typically I take advantage of the parallel analysis available inside the psych package to determine how many components can be retained.

require(MASS)
## Loading required package: MASS
require(psych)
## Loading required package: psych
fa.par <- fa.parallel(data, fa = 'PC', n.iter = 100)

plot of chunk unnamed-chunk-12

## Parallel analysis suggests that the number of factors =  1  and the number of components =  1

As we can see, the parallel analysis confirms that just the first component can be retained. So, I want my PCA results stores that information:

pca.prcomp$nfact = fa.par$nfact
names(pca.prcomp)
## [1] "sdev"       "rotation"   "center"     "scale"      "x"         
## [6] "importance" "nfact"

Moreover, I want my PCA results to store the eigenvalues, which are the square of standard deviation:

pca.prcomp$eigenvalues <- pca.prcomp$sdev^2
pca.prcomp$eigenvalues
## [1] 4.9851909 0.0111367 0.0029667 0.0005570 0.0001486
eigen(cor(scaled.data))$values
## [1] 4.9851909 0.0111367 0.0029667 0.0005570 0.0001486

I also like to reconstruct the dataset based upon retained PCA components and stores them inside the PCA results together with scaled data and residuals.

pca.prcomp$scaled <- scaled.data
nfact <- pca.prcomp$nfact
pca.prcomp$reconstructed <- pca.prcomp$x[, 1:nfact, drop=FALSE] %*% solve(pca.prcomp$rotation)[1:nfact, , drop=FALSE]
pca.prcomp$residual <- pca.prcomp$scaled - pca.prcomp$reconstructed

One comment about that. The scores represent the projecton of the original (scaled) data onto the correlation matrix eigenvectors. So

pca.prcomp$scores[, 1:nfact, drop=FALSE]

are the first nfact columns of such projected data. On the other hand,

solve(pca.prcomp$eigenvectors)[1:nfact, , drop=FALSE]

are the first nfact rows of the inverse eigenvectors matrix (inverse because of solve() call). Matrix multiplication between those two matrixes computes the reconstructed data based on nfact principal components.

It is interesting to compute the Mean Squared Error (MSE) of the actual data with respect the reconstructed to verify how good the reconstruction was. A mse() routine is available inside the Metrics package. The reconstructed data plays the role of the mse() predicted input parameter. We compute Mean Squared Error for an increasing number of retained PCA components-

require(Metrics)
## Loading required package: Metrics
nc <- ncol(data)
msev <- vector(mode="numeric", length=nc)
for(i in (1: ncol(data))) {
   reconstructed <- pca.prcomp$x[, 1:i, drop=FALSE] %*% solve(pca.prcomp$rotation)[1:i, , drop=FALSE]
   msev[i] <- mse(pca.prcomp$scaled, reconstructed) 
}
msev
## [1] 2.939e-03 7.288e-04 1.400e-04 2.950e-05 8.300e-32

That is just the variance of the residuals and could have been computed based on summary matrix row named as Cumulative Proportion, this way:

rep(1,nc) - pca.prcomp$importance[3,]
##     PC1     PC2     PC3     PC4     PC5 
## 0.00296 0.00073 0.00014 0.00003 0.00000

As a conclusion, we have seen how much valuable information the PCA analysis brings with that and much more if we are able to elaborate a little bit its outputs.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.