You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sungho Ham (JIRA)" <ji...@apache.org> on 2016/12/30 08:49:58 UTC
[jira] [Updated] (SPARK-19036) Merging dealyed micro batches
[ https://issues.apache.org/jira/browse/SPARK-19036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sungho Ham updated SPARK-19036:
-------------------------------
Summary: Merging dealyed micro batches (was: Merging dealyed micro batches for parallelism )
> Merging dealyed micro batches
> -----------------------------
>
> Key: SPARK-19036
> URL: https://issues.apache.org/jira/browse/SPARK-19036
> Project: Spark
> Issue Type: New Feature
> Components: Structured Streaming
> Reporter: Sungho Ham
> Priority: Minor
>
> Efficiency of parallel execution get worsen when data is not evenly distributed by both time and message. Theses skews make streaming batch delayed despite sufficient resources.
> Merging small-sized delayed batches could help increase efficiency of micro batch.
> Here is an example.
> ||batch_time || messages ||
> |4 | 1 <-- current time |
> |3 | 1 |
> |2 | 1 |
> |1 | 1000 <-- processing |
> After long-running batch (t=1), three batches has only one message. These batches cannot utilize parallelism of Spark. By merging stream RDDs from t=1 to t=4 into one RDD, Spark can process them 3 times faster.
> If processing time of each message is highly skewed also, not only utilizing parallelism matters. Suppose batches from t=2 to t=4 consist of one BIG message and ten small messages. Then, merging three RDDs still could considerably improve efficiency.
> ||batch_time || messages ||
> | 4 | 1+10 <-- current time |
> | 3 | 1+10 |
> | 2 | 1+10 |
> | 1 | 1000 <-- processing |
> There could be two parameters to describe merging behavior.
> - delay_time_limit: when to start merge
> - merge_record_limit: when to stop merge
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org