You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Marek Wiewiorka <ma...@gmail.com> on 2017/02/06 15:59:04 UTC
PCA slow in comparison with single-threaded R version
Hi All,
I hit performance issues with running PCA for matrix with greater number of
features (2.5k x 15k):
import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.mllib.linalg.Vectors
val sampleCnt = 2504
val featureCnt = 15000
val gen = sc.parallelize( (1 to sampleCnt).map{r=>val rnd = new
scala.util.Random(); Vectors.dense ((1 to
featureCnt).map(k=>rnd.nextInt(2).toDouble).toArray ) } )
val rowMat = new RowMatrix(gen)
val pc: Matrix = rowMat.computePrincipalComponents(10)
I'm running the above code on standalone Spark cluster of 4 nodes and
128cores in total.
According to what I observed there is a final stage of the algorithm that
is executed on the Driver using 1 thread that seems to be a bottleneck here
- is there any way of tuning it?
It takes ages (actually I was forced to kill it after 30min or so)
whereas the same code written in R executes in ~6.5 minutes on my
laptop(1-thread):
> a<-replicate(2504, rnorm(5000))
> nrow(a)
[1] 5000
> ncol(a)
[1] 2504
> system.time(b<-prcomp(a))
user system elapsed
190.284 0.392 191.150
> a<-replicate(2504, rnorm(15000))
> system.time(b<-prcomp(a))
user system elapsed
386.520 0.384 386.933
I've compiled Spark with the support for native matrix libs uisng
-Pnetlib-lgpl switch.
Has anyone experienced such problems with mllib version of PCA?
Thanks,
Marek