You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Dimitris Kouzis - Loukas <lo...@gmail.com> on 2015/08/05 19:46:58 UTC

Streaming and calculated-once semantics

Hello, here's a simple program that demonstrates my problem:


ssc = StreamingContext(sc, 1)

input = [ [("k1",12), ("k2",14)], [("k1",22)] ]

rawData = ssc.queueStream([sc.parallelize(d, 1) for d in input])

runningRawData = rawData.updateStateByKey(lambda nv, prev: reduce(sum, nv,
prev or 0))

def stats(rdd) {
    keyavg = rdd.values().reduce(sum) / rdd.count()
    return rdd.mapValues(lambda i: i - keyavg)
}

runningRawData.transform(stats).print()


I have a feeling this will calculate "keyavg = rdd.values().reduce(sum) /
rdd.count()" inside stats quite a few times depending on the number of
partitions on the current rdd.

What would be an alternative way to do this two step computation without
calculating the average many times?