You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Amlan Jyoti <am...@tcs.com> on 2017/02/14 10:45:43 UTC
Different Results When Performing PCA with Spark and R
Dear all,
I was exploring an use case of PCA , and found out that the results of
Spark ML and R are different.
More clearly,
1) eigenMatrix_Spark EQUALS-TO eigenMatrix_R
2) transformedData_Spark NOT-EQUALS-TO transformedData_R
Sample Spark Code
----------------------------------
PCAModel pca = new
PCA().setInputCol("features").setOutputCol("pcaFeatures").setK(numberOfCol).fit(inputDataset);
DenseMatrix eigenMatrix_Spark = pca.pc
Dataset<Row> transformedData_Spark =
pca.transform(inputDataset.select("features"));
Sample R Code
---------------------------------
pc <- prcomp(mydata)
eigenMatrix_R<- pc$Rotation
transformedData_R<- pc$x
**********************************************************************************************************************************************************************************************
After further analysis, I found out that:
- By Default, R initially performs mean-centering on the input
dataset and then uses this modified dataset for calculating both Eigen
Matrix and Transformed Data. [ Uses a parameter : 'center = TRUE'; for
mean-centering]
- Whereas, probably Spark is performing mean-centering on the
input data to calculate only the Eigen Matrix; and using the original
dataset to compute the Transformed Data. [Generally, Transformed data =
Eigen Matrix * Dataset ]
That is why, the result of- Eigen Matrix of Spark and R are same, whereas
the Transformed dataset result is different for both the cases.
So, can anyone please point out the reason for why spark is not
considering mean-centered Input data for Transformed data calculation[But
considers while calculating for Eigen Matrix], as opposed to R?
[Initial, Mean centering on the Input Data is done for a good PCA
analysis as pointed out by many technical papers as well as in R]
With Best Regards
Amlan Jyoti
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you