You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Lin Zhao <li...@exabeam.com> on 2016/01/15 18:48:19 UTC

Spark Streaming: routing by key without groupByKey

I have requirement to route a paired DStream to a series of map and flatMap such that entries with the same key goes to the same thread within the same batch. Closest I can come up with is groupByKey().flatMap(_._2). But this kills throughput by 50%.

When I think about it groupByKey is more than I need. With groupByKey the same thread sees all events with key Alice at a time, and only Alice. For my requirement if there are Bob, Charlie in between it's still OK. This seems to be a common routing requirement and shouldn't cause 50% performance hit. Is there a way to construct the stream in such way that I'm not aware of?

I have read https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html but reduceByKey isn't the solution since we are not doing aggregation. Our stream is a chain of map and flatMap[withState]