You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Marek Wiewiorka <ma...@gmail.com> on 2017/02/06 15:59:04 UTC

PCA slow in comparison with single-threaded R version

Hi All,
I hit performance issues with running PCA for matrix with greater number of
features (2.5k x 15k):

import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.mllib.linalg.Vectors

val sampleCnt = 2504
val featureCnt = 15000
 val gen = sc.parallelize( (1 to sampleCnt).map{r=>val rnd = new
scala.util.Random(); Vectors.dense ((1 to
featureCnt).map(k=>rnd.nextInt(2).toDouble).toArray ) } )
val rowMat = new RowMatrix(gen)
val pc: Matrix = rowMat.computePrincipalComponents(10)

I'm running the above code on standalone Spark cluster of 4 nodes and
128cores in total.
According to what I observed there is a final stage of the algorithm that
is executed on the Driver using 1 thread that seems to be a bottleneck here
- is there any way of tuning it?

It takes ages (actually I was forced to kill it after 30min or so)
whereas the same code written in R executes in ~6.5  minutes on my
laptop(1-thread):
> a<-replicate(2504, rnorm(5000))
> nrow(a)
[1] 5000
> ncol(a)
[1] 2504
> system.time(b<-prcomp(a))
   user  system elapsed
190.284   0.392 191.150
> a<-replicate(2504, rnorm(15000))
> system.time(b<-prcomp(a))
   user  system elapsed
386.520   0.384 386.933

I've compiled Spark with the support for native matrix libs uisng
-Pnetlib-lgpl switch.

Has anyone experienced such problems with mllib version of PCA?

Thanks,
Marek