You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@systemml.apache.org by Rajarshi Bhadra <bh...@gmail.com> on 2017/07/26 11:32:51 UTC

Implementation of Parallelized process in Standalone Spark Cluster using SystemML

Hi,

I have been using SystemML for sometime and I am finding it extremely
useful for scaling up my algorithm using Spark. However there area few
aspects which I am fully not understanding and would like to have some
clarification

My System Configuration: 244gb RAM, 32 Cores.
My spark Configuration: 'spark.executor.cores', '4'
                                       'spark.driver.memory', '80g'
                                       'spark.executor.memory', '20g'
                                       'spark.memory.fraction', '0.75'
                                       'spark.worker.cleanup.enabled',
'true'
                                       'spark.default.parallelism','1'

I have a process in R which I am trying to implement. The process is
similar to randomForest involving growing trees. Now The way the process is
in R I parallelize it using the parLapply statement where n trees are grown
in n parallel processes. I have implemented the algorithm in an identical
way and tried running it using parfor loop. There are two main issues I am
facing

1. In R using ncore = 16 i get 30 trees in 10 mins but in spark via
systemml the process is taking 1 hour.
2. Also I have noticed that if one tree takes  2 mins to run 5 trees take
7-8 mins to run. It seems to me I am unable to parallelize the process by
trees in SystemML

It would be great if someone can help me out with this

Thank you
Rajarshi