You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sandip Khanzode <sa...@nutanix.com> on 2022/04/13 09:40:21 UTC

Streaming partition-by data locality for state lookupon executor

Hello,

If I have a Kinesis stream split into multiple shards (say 10), can I have, say 3 executors, subscribed to those shards? I assume that automatic re-balancing will be enabled as we add/remove executors for scale up/down or simply failures …

If so, can I specify a partition key? If I specify a partition key on the Kinesis producer, it will always send (Key=A) to say Shard 4 and (Key=B) to Shard 8 and this will be consistent I assume so long as the executors are up and no rebalancing occurs.

How can I map the payloads in the first Spark stage/task that receives the payload from Kinesis? What I would want to finally achieve is that the flatMapGroupWithState() that I would call later in the pipeline should have the same (partition) key internally for key lookups in the (RocksDB) state so that data locality can be achieved.

Is this redundant or implicit or not possible or am I missing something? Your response would be greatly helpful. If there is some documentation or examples around this, that would be good too!

Thanks,
Sandip Khanzode