You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Amlan Jyoti <am...@tcs.com> on 2017/02/14 10:45:43 UTC

Different Results When Performing PCA with Spark and R

Dear all,

I was exploring an use case of PCA , and found out that the results of 
Spark ML and R are different. 

More clearly,
 1) eigenMatrix_Spark EQUALS-TO eigenMatrix_R
 2) transformedData_Spark NOT-EQUALS-TO transformedData_R
 
Sample Spark Code
----------------------------------
                PCAModel pca = new 
PCA().setInputCol("features").setOutputCol("pcaFeatures").setK(numberOfCol).fit(inputDataset);
                DenseMatrix eigenMatrix_Spark = pca.pc
                Dataset<Row> transformedData_Spark = 
pca.transform(inputDataset.select("features"));

Sample R Code
--------------------------------- 
                pc <- prcomp(mydata)
                eigenMatrix_R<- pc$Rotation
                transformedData_R<- pc$x

********************************************************************************************************************************************************************************************** 
 
After further analysis, I found out that:

        - By Default, R initially performs mean-centering on the input 
dataset and then uses this modified dataset for calculating both Eigen 
Matrix and Transformed Data. [ Uses a parameter : 'center = TRUE'; for 
mean-centering]
 
        - Whereas, probably Spark is performing mean-centering on the 
input data to calculate only the Eigen Matrix; and using the original 
dataset to compute the Transformed Data. [Generally, Transformed data = 
Eigen Matrix * Dataset ]
 
That is why, the result of- Eigen Matrix of Spark and R are same, whereas 
the Transformed dataset result is different for both the cases.

So, can anyone please point out the reason for why spark is not 
considering mean-centered Input data for Transformed data calculation[But 
considers while calculating for Eigen Matrix], as opposed to R?
 [Initial, Mean centering on the Input Data is done for a good PCA 
analysis as pointed out by many technical papers as well as in R]


With Best Regards
Amlan Jyoti
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you