You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by karthikjay <as...@gmail.com> on 2018/05/23 06:46:04 UTC
[Beginner][StructuredStreaming] Using Spark aggregation -
WithWatermark on old data
I am doing the following aggregation on the data
val channelChangesAgg = tunerDataJsonDF
.withWatermark("ts2", "10 seconds")
.groupBy(window(col("ts2"),"10 seconds"),
col("env"),
col("servicegroupid"))
.agg(count("linetransactionid") as "count1")
The only constraint here is that the data is backdated; even though the data
is chronologically ordered, the ts2 will be a old date. Given this
condition, will the watermarking and aggregation still work ?
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: [Beginner][StructuredStreaming] Using Spark aggregation -
WithWatermark on old data
Posted by karthikjay <as...@gmail.com>.
My data looks like this:
{
"ts2" : "2018/05/01 00:02:50.041",
"serviceGroupId" : "123",
"userId" : "avv-0",
"stream" : "",
"lastUserActivity" : "00:02:50",
"lastUserActivityCount" : "0"
}
{
"ts2" : "2018/05/01 00:09:02.079",
"serviceGroupId" : "123",
"userId" : "avv-0",
"stream" : "",
"lastUserActivity" : "00:09:02",
"lastUserActivityCount" : "0"
}
{
"ts2" : "2018/05/01 00:09:02.086",
"serviceGroupId" : "123",
"userId" : "avv-2",
"stream" : "",
"lastUserActivity" : "00:09:02",
"lastUserActivityCount" : "0"
}
...
And my aggregation is :
val sdvTuneInsAgg1 = df
.withWatermark("ts2", "10 seconds")
.groupBy(window(col("ts2"),"10 seconds"))
.agg(count("*") as "count")
.as[CountMetric1]
But, the only anomaly is that the current date is 2018/05/24 but the record
ts2 has old dates. Will aggregation / count work in this scenario ?
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org