You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "zhengruifeng (JIRA)" <ji...@apache.org> on 2018/08/13 06:49:00 UTC

[jira] [Commented] (SPARK-24581) Design: BarrierTaskContext.barrier()

    [ https://issues.apache.org/jira/browse/SPARK-24581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16577893#comment-16577893 ] 

zhengruifeng commented on SPARK-24581:
--------------------------------------

It maybe meaningful to support resettable iterator in BarrierTaskContext, if the RDD is cached.

In BarrierTaskContext, other distributed systems like MPI may be applyed, and it is common to iterate the partition many times. Current mapPartitions after barrier do not support iterations, and it is up to the users to cache the partition.

An example is the XGBoost On Spark: https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/XGBoost.scala#L124

XGboost has to create tmp file to store on external memory, even if the total dataset is already cached.

> Design: BarrierTaskContext.barrier()
> ------------------------------------
>
>                 Key: SPARK-24581
>                 URL: https://issues.apache.org/jira/browse/SPARK-24581
>             Project: Spark
>          Issue Type: Story
>          Components: ML, Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Xiangrui Meng
>            Assignee: Jiang Xingbo
>            Priority: Major
>
> We need to provide a communication barrier function to users to help coordinate tasks within a barrier stage. This is very similar to MPI_Barrier function in MPI. This story is for its design.
>  
> Requirements:
>  * Low-latency. The tasks should be unblocked soon after all tasks have reached this barrier. The latency is more important than CPU cycles here.
>  * Support unlimited timeout with proper logging. For DL tasks, it might take very long to converge, we should support unlimited timeout with proper logging. So users know why a task is waiting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org