You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2015/06/24 19:51:05 UTC

[jira] [Updated] (SPARK-7441) Implement microbatch functionality so that Spark Streaming can process a large backlog of existing files discovered in batch in smaller batches

     [ https://issues.apache.org/jira/browse/SPARK-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Rosen updated SPARK-7441:
------------------------------
    Target Version/s: 1.5.0  (was: 1.4.1)

> Implement microbatch functionality so that Spark Streaming can process a large backlog of existing files discovered in batch in smaller batches
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-7441
>                 URL: https://issues.apache.org/jira/browse/SPARK-7441
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>            Reporter: Emre Sevinç
>              Labels: performance
>
> Implement microbatch functionality so that Spark Streaming can process a huge backlog of existing files discovered in batch in smaller batches.
> Spark Streaming can process already existing files in a directory, and depending on the value of "{{spark.streaming.minRememberDuration}}" (60 seconds by default, see SPARK-3276 for more details), this might mean that a Spark Streaming application can receive thousands, or hundreds of thousands of files within the first batch interval. This, in turn, leads to something like a 'flooding' effect for the streaming application, that tries to deal with a huge number of existing files in a single batch interval.
>  We will propose a very simple change to {{org.apache.spark.streaming.dstream.FileInputDStream}}, so that, based on a configuration property such as "{{spark.streaming.microbatch.size}}", it will either keep its default behavior when  {{spark.streaming.microbatch.size}} will have the default value of {{0}} (meaning as many as has been discovered as new files in the current batch interval), or will process new files in groups of {{spark.streaming.microbatch.size}} (e.g. in groups of 100s).
> We have tested this patch in one of our customers, and it's been running successfully for weeks (e.g. there were cases where our Spark Streaming application was stopped, and in the meantime tens of thousands file were created in a directory, and our Spark Streaming application had to process those existing files after it was started).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org