You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tathagata Das (JIRA)" <ji...@apache.org> on 2015/06/01 23:17:20 UTC

[jira] [Resolved] (SPARK-1312) Batch should read based on the batch interval provided in the StreamingContext

     [ https://issues.apache.org/jira/browse/SPARK-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tathagata Das resolved SPARK-1312.
----------------------------------
    Resolution: Fixed

> Batch should read based on the batch interval provided in the StreamingContext
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-1312
>                 URL: https://issues.apache.org/jira/browse/SPARK-1312
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 0.9.0
>            Reporter: Sanjay Awatramani
>            Assignee: Tathagata Das
>            Priority: Critical
>              Labels: sliding, streaming, window
>
> This problem primarily affects sliding window operations in spark streaming.
> Consider the following scenario:
> - a DStream is created from any source. (I've checked with file and socket)
> - No actions are applied to this DStream
> - Sliding Window operation is applied to this DStream and an action is applied to the sliding window.
> In this case, Spark will not even read the input stream in the batch in which the sliding interval isn't a multiple of batch interval. Put another way, it won't read the input when it doesn't have to apply the window function. This is happening because all transformations in Spark are lazy.
> How to fix this or workaround it (see line#3):
> JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new Duration(1 * 60 * 1000));
> JavaDStream<String> inputStream = stcObj.textFileStream("/Input");
> inputStream.print(); // This is the workaround
> JavaDStream<String> objWindow = inputStream.window(new Duration(windowLen*60*1000), new Duration(slideInt*60*1000));
> objWindow.dstream().saveAsTextFiles("/Output", "");
> The "Window operations" example on the streaming guide implies that Spark will read the stream in every batch, which is not happening because of the lazy transformations.
> Wherever sliding window would be used, in most of the cases, no actions will be taken on the pre-window batch, hence my gut feeling was that Streaming would read every batch if any actions are being taken in the windowed stream.
> As per Tathagata,
> "Ideally every batch should read based on the batch interval provided in the StreamingContext."
> Refer the original thread on http://apache-spark-user-list.1001560.n3.nabble.com/Sliding-Window-operations-do-not-work-as-documented-tp2999.html for more details, including Tathagata's conclusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org