You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by 陈哲 <cz...@gmail.com> on 2016/10/20 10:21:20 UTC

Spark Random Forest training cost same time on yarn as on standalone

I'm training random forest model using spark2.0 on yarn with cmd like:
$SPARK_HOME/bin/spark-submit \
  --class com.netease.risk.prediction.HelpMain --master yarn --deploy-mode
client --driver-cores 1 --num-executors 32 --executor-cores 2 --driver-memory
10g --executor-memory 6g \
  --conf spark.rpc.askTimeout=3000 --conf spark.rpc.lookupTimeout=3000
--conf spark.rpc.message.maxSize=2000  --conf spark.driver.maxResultSize=0
    \
....

the training process cost almost 8 hours

And I tried training model on local machine with master(local[4]) , the
whole process still cost 8 - 9 hours.

My question is why running on yarn doesn't save time ? is this suppose to
be distributed, with 32 executors ? And am I missing anything or what I can
do to improve this and save more time ?

Thanks

Re: Spark Random Forest training cost same time on yarn as on standalone

Posted by Xi Shen <da...@gmail.com>.
If you are running on your local, I do not see the point that you start
with 32 executors with 2 cores for each.

Also, you can check the Spark web console to find out where the time spent.

Also, you may want to read
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
.


On Thu, Oct 20, 2016 at 6:21 PM 陈哲 <cz...@gmail.com> wrote:

> I'm training random forest model using spark2.0 on yarn with cmd like:
> $SPARK_HOME/bin/spark-submit \
>   --class com.netease.risk.prediction.HelpMain --master yarn
> --deploy-mode client --driver-cores 1 --num-executors 32 --executor-cores 2 --driver-memory
> 10g --executor-memory 6g \
>   --conf spark.rpc.askTimeout=3000 --conf spark.rpc.lookupTimeout=3000
> --conf spark.rpc.message.maxSize=2000  --conf spark.driver.maxResultSize=0
>     \
> ....
>
> the training process cost almost 8 hours
>
> And I tried training model on local machine with master(local[4]) , the
> whole process still cost 8 - 9 hours.
>
> My question is why running on yarn doesn't save time ? is this suppose to
> be distributed, with 32 executors ? And am I missing anything or what I can
> do to improve this and save more time ?
>
> Thanks
>
> --


Thanks,
David S.