You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Yann-Aël Le Borgne <ya...@gmail.com> on 2016/07/31 13:14:29 UTC
Spark R 2.0 dapply very slow
Hello all,
I just started testing Spark R 2.0, and find the execution of dapply very
slow.
For example, using R, the following code
set.seed(2)
random_DF<-data.frame(matrix(rnorm(1000000),100000,10))
system.time(dummy_res<-random_DF[random_DF[,1]>1,])
user system elapsed
0.005 0.000 0.006
is executed in 6ms
Now, if I create a Spark DF on 4 partition, and run on 4 cores, I get:
sparkR.session(master = "local[4]")
random_DF_Spark <- repartition(createDataFrame(random_DF),4)
subset_DF_Spark <- dapply(
random_DF_Spark,
function(x) {
y <- x[x[1] > 1, ]
y
},
schema(random_DF_Spark))
system.time(dummy_res_Spark<-collect(subset_DF_Spark))
user system elapsed
2.003 0.119 62.919
I.e. 1 minute, which is abnormally slow.... Am I missing something?
I get also a warning (16/07/31 15:07:02 WARN TaskSetManager: Stage 64
contains a task of very large size (16411 KB). The maximum recommended task
size is 100 KB.). Why is this 100KB limit so low?
I am using R 3.3.0 on Mac OS 10.10.5
Any insight welcome,
Best,
Yann-Aël
--
=========================================
Yann-Aël Le Borgne
Machine Learning Group
Université Libre de Bruxelles
http://mlg.ulb.ac.be
http://www.ulb.ac.be/di/map/yleborgn
=========================================