You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by SRK <sw...@gmail.com> on 2017/05/30 21:36:55 UTC

Checkpointing fro reduceByKeyAndWindow with a window size of 1 hour and 24 hours

Hi,

What happens if I dont specify checkpointing on a DStream that has
reduceByKeyAndWindow  with no inverse function? Would it cause the memory to
be overflown? My window sizes are 1 hour and 24 hours.
I cannot provide an inserse function for this as it is based on HyperLogLog.

My code looks like something like the following:

  val logsByPubGeo = messages.map(_._2).filter(_.geo !=
Constants.UnknownGeo).map {
    log =>
      val key = PublisherGeoKey(log.publisher, log.geo)
      val agg = AggregationLog(
        timestamp = log.timestamp,
        sumBids = log.bid,
        imps = 1,
        uniquesHll = hyperLogLog(log.cookie.getBytes(Charsets.UTF_8))
      )
      (key, agg)
  }


 val aggLogs = logsByPubGeo.reduceByKeyAndWindow(reduceAggregationLogs,
BatchDuration)

   private def reduceAggregationLogs(aggLog1: AggregationLog, aggLog2:
AggregationLog) = {
    aggLog1.copy(
      timestamp = math.min(aggLog1.timestamp, aggLog2.timestamp),
      sumBids = aggLog1.sumBids + aggLog2.sumBids,
      imps = aggLog1.imps + aggLog2.imps,
      uniquesHll = aggLog1.uniquesHll + aggLog2.uniquesHll
    )
  }


Please let me know.

Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Checkpointing-fro-reduceByKeyAndWindow-with-a-window-size-of-1-hour-and-24-hours-tp28722.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org