You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by SRK <sw...@gmail.com> on 2017/05/30 21:36:55 UTC
Checkpointing fro reduceByKeyAndWindow with a window size of 1 hour
and 24 hours
Hi,
What happens if I dont specify checkpointing on a DStream that has
reduceByKeyAndWindow with no inverse function? Would it cause the memory to
be overflown? My window sizes are 1 hour and 24 hours.
I cannot provide an inserse function for this as it is based on HyperLogLog.
My code looks like something like the following:
val logsByPubGeo = messages.map(_._2).filter(_.geo !=
Constants.UnknownGeo).map {
log =>
val key = PublisherGeoKey(log.publisher, log.geo)
val agg = AggregationLog(
timestamp = log.timestamp,
sumBids = log.bid,
imps = 1,
uniquesHll = hyperLogLog(log.cookie.getBytes(Charsets.UTF_8))
)
(key, agg)
}
val aggLogs = logsByPubGeo.reduceByKeyAndWindow(reduceAggregationLogs,
BatchDuration)
private def reduceAggregationLogs(aggLog1: AggregationLog, aggLog2:
AggregationLog) = {
aggLog1.copy(
timestamp = math.min(aggLog1.timestamp, aggLog2.timestamp),
sumBids = aggLog1.sumBids + aggLog2.sumBids,
imps = aggLog1.imps + aggLog2.imps,
uniquesHll = aggLog1.uniquesHll + aggLog2.uniquesHll
)
}
Please let me know.
Thanks!
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Checkpointing-fro-reduceByKeyAndWindow-with-a-window-size-of-1-hour-and-24-hours-tp28722.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org