You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Imran Rashid (JIRA)" <ji...@apache.org> on 2015/01/29 00:17:34 UTC

[jira] [Created] (SPARK-5467) DStreams should provide windowing based on timestamps from the data (as opposed to wall clock time)

Imran Rashid created SPARK-5467:
-----------------------------------

             Summary: DStreams should provide windowing based on timestamps from the data (as opposed to wall clock time)
                 Key: SPARK-5467
                 URL: https://issues.apache.org/jira/browse/SPARK-5467
             Project: Spark
          Issue Type: New Feature
          Components: Streaming
            Reporter: Imran Rashid


DStreams currently only let you window based on wall clock time.  This doesn't work very well when you're loading historical logs that are already sitting around, because they'll all go into one window.  DStreams should provide a way for you to window based on a field of the incoming data.  This would be useful if you want to either (1) bootstrap a streaming app from some logs or (2) test out the behavior of your app on historical logs, eg. for correctness or performance.

I think there are some open questions here, such as whether the input data sources need to be sorted by time, how batches get triggered etc., but it seems like an important use case.

This just came up on the mailing list: http://apache-spark-user-list.1001560.n3.nabble.com/reduceByKeyAndWindow-but-using-log-timestamps-instead-of-clock-seconds-td21405.html

And I think it is also what was this Jira was getting at: https://issues.apache.org/jira/browse/SPARK-4427



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org