You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Laeeq Ahmed <la...@yahoo.com> on 2014/07/09 15:56:32 UTC

Re: controlling the time in spark-streaming

Hi,

For QueueRDD, have a look here. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala

  

Regards,
Laeeq



On Friday, May 23, 2014 10:33 AM, Mayur Rustagi <ma...@gmail.com> wrote:
 


Well its hard to use text data as time of input. 
But if you are adament here's what you would do. 
Have a Dstream object which works in on a folder using filestream/textstream
Then have another process (spark streaming or cron) read through the files you receive & push them into the folder in order of time. Mostly your data would be produced at t, you would get it at t + say 5 sec, & you can push it in & get processed at t + say 10 sec. Then you can timeshift your calculations. If you are okay with broad enough time frame you should be fine.


Another way is to use queue processing.
Queue<JavaRDD<Integer>> rddQueue = new LinkedList<JavaRDD<Integer>>();
Create a Dstream to consume from Queue of RDD, keep looking into the folder of files & create rdd's from them at a min level & push them into thee queue. This would cause you to go through your data atleast twice & just provide order guarantees , processing time is still going to vary with how quickly you can process your RDD.

Regards
Mayur


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi



On Thu, May 22, 2014 at 9:08 PM, Ian Holsman <ia...@holsman.com.au> wrote:

Hi.
>
>
>I'm writing a pilot project, and plan on using spark's streaming app for it.
>
>
>To start with I have a dump of some access logs with their own timestamps, and am using the textFileStream and some old files to test it with.
>
>
>One of the issues I've come across is simulating the windows. I would like use the timestamp from the access logs as the 'system time' instead of the real clock time.
>
>
>I googled a bit and found the 'manual' clock which appears to be used for testing the job scheduler.. but I'm not sure what my next steps should be.
>
>
>I'm guessing I'll need to do something like
>
>
>1. use the textFileStream to create a 'DStream'
>2. have some kind of DStream that runs on top of that that creates the RDDs based on the timestamps Instead of the system time
>3. the rest of my mappers.
>
>
>Is this correct? or do I need to create my own 'textFileStream' to initially create the RDDs and modify the system clock inside of it.
>
>
>I'm not too concerned about out-of-order messages, going backwards in time, or being 100% in sync across workers.. as this is more for 'development'/prototyping.
>
>
>Are there better ways of achieving this? I would assume that controlling the windows RDD buckets would be a common use case.
>
>
>TIA
>Ian
>
>-- 
>
>Ian Holsman
>ian@holsman.com.au 
>PH: + 61-3-9028 8133 / +1-(425) 998-7083