You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yongjia Wang (JIRA)" <ji...@apache.org> on 2015/10/18 22:55:05 UTC

[jira] [Updated] (SPARK-11175) Concurrent execution of JobSet within a batch in Spark streaming

     [ https://issues.apache.org/jira/browse/SPARK-11175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yongjia Wang updated SPARK-11175:
---------------------------------
    Description: 
Spark StreamingContext can register multiple independent Input DStreams (such as from different Kafka topics) that results in multiple independent jobs for each batch. These jobs should better be run concurrently to maximally take advantage of available resources. 
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to foreachRDD. However, it will mess up with streaming statistics since the batch will finish immediately even the jobs it launched are still running in another thread. This can further affect resuming from checkpoint, since all batches are completed right away even the actual threaded jobs may fail and checkpoint only resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the Jobset to wait for all threads to join, but doing this would mess up with closure serialization, and make checkpoint not usable.
Therefore, I would propose to make the default behavior to just run all jobs of the current batch concurrently, and mark batch completion when all the jobs complete.

  was:
Spark StreamingContext can register multiple independent Input DStreams (such as from different Kafka topics) that results in multiple independent jobs for each batch. These jobs should better be run concurrently to maximally take advantage of available resources. 
I went through a few hacks:
1.  launch the rdd action into a new thread from the function passed to foreachRDD to achieve. However, it will mess up with streaming statistics since the batch will finish immediately even the jobs it launched are still running in another thread. This can further affect resuming from checkpoint, since all batches are completed right away even the actual threaded jobs may fail and checkpoint only resume the last batch.
2. It's possible by just using foreachRDD and the available APIs to block the Jobset to wait for all threads to join, but doing this would mess up with closure serialization, and make checkpoint not usable.
Therefore, I would propose to make the default behavior to just run all jobs of the current batch concurrently, and mark batch completion when all the jobs complete.


> Concurrent execution of JobSet within a batch in Spark streaming
> ----------------------------------------------------------------
>
>                 Key: SPARK-11175
>                 URL: https://issues.apache.org/jira/browse/SPARK-11175
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>            Reporter: Yongjia Wang
>
> Spark StreamingContext can register multiple independent Input DStreams (such as from different Kafka topics) that results in multiple independent jobs for each batch. These jobs should better be run concurrently to maximally take advantage of available resources. 
> I went through a few hacks:
> 1.  launch the rdd action into a new thread from the function passed to foreachRDD. However, it will mess up with streaming statistics since the batch will finish immediately even the jobs it launched are still running in another thread. This can further affect resuming from checkpoint, since all batches are completed right away even the actual threaded jobs may fail and checkpoint only resume the last batch.
> 2. It's possible by just using foreachRDD and the available APIs to block the Jobset to wait for all threads to join, but doing this would mess up with closure serialization, and make checkpoint not usable.
> Therefore, I would propose to make the default behavior to just run all jobs of the current batch concurrently, and mark batch completion when all the jobs complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org