You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Yann-Aël Le Borgne <ya...@gmail.com> on 2016/07/31 13:14:29 UTC

Spark R 2.0 dapply very slow

Hello all,

I just started testing Spark R 2.0, and find the execution of dapply very
slow.

For example, using R, the following code

set.seed(2)
random_DF<-data.frame(matrix(rnorm(1000000),100000,10))
system.time(dummy_res<-random_DF[random_DF[,1]>1,])
   user  system elapsed
  0.005   0.000   0.006

is executed in 6ms

Now, if I create a Spark DF on 4 partition, and run on 4 cores, I get:

sparkR.session(master = "local[4]")

  random_DF_Spark <- repartition(createDataFrame(random_DF),4)

  subset_DF_Spark <- dapply(
    random_DF_Spark,
    function(x) {
      y <- x[x[1] > 1, ]
      y
    },
    schema(random_DF_Spark))

  system.time(dummy_res_Spark<-collect(subset_DF_Spark))
user  system elapsed
  2.003   0.119  62.919

I.e. 1 minute, which is abnormally slow.... Am I missing something?

I get also a warning (16/07/31 15:07:02 WARN TaskSetManager: Stage 64
contains a task of very large size (16411 KB). The maximum recommended task
size is 100 KB.). Why is this 100KB limit so low?

I am using R 3.3.0 on Mac OS 10.10.5

Any insight welcome,
Best,
Yann-Aël

-- 
=========================================
Yann-Aël Le Borgne
Machine Learning Group
Université Libre de Bruxelles

http://mlg.ulb.ac.be
http://www.ulb.ac.be/di/map/yleborgn
=========================================