You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jason Nerothin <ja...@gmail.com> on 2019/03/26 00:04:44 UTC
streaming - absolute maximum
Hello,
I wish to calculate the most recent event time from a Stream.
Something like this:
val timestamped = records.withColumn("ts_long",
unix_timestamp($"eventTime"))
val lastReport = timestamped
.withWatermark("eventTime", "4 hours")
.groupBy(col("eventTime"),
window(col("eventTime"), "10 minutes", "5 minutes"))
.max("ts_long")
.writeStream
.foreach(new LastReportUpdater(stationId))
.start()
During normal execution, I expect to receive a few events per minute, at
most.
So now for the problem: During system initiation, I batch load a longer
history of data (stretching back months). Because the volume is higher
during initiation, records arrive with lots of time skew.
I'm saving the result off to a database and want to update it in realtime
during streaming operation.
Do I write to flavors of the query - one as a static Dataset for initiation
and another for realtime? Is my logic incorrect?
Thanks,
Jason
--
Thanks,
Jason