You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Soumitra Johri <so...@gmail.com> on 2016/01/06 02:21:23 UTC

UpdateStateByKey : Partitioning and Shuffle

Hi,

I am relatively new to Spark and am using updateStateByKey() operation to
maintain state in my Spark Streaming application. The input data is coming
through a Kafka topic.

   1. I want to understand how are DStreams partitioned?
   2. How does the partitioning work with mapWithState() or
   updateStatebyKey() method?
   3. In updateStateByKey() does the old state and the new values against a
   given key processed on same node ?
   4. How frequent is the shuffle for updateStateByKey() method ?

The state I have to maintaining contains ~ 100000 keys and I want to avoid
shuffle every time I update the state , any tips to do it ?

Warm Regards
Soumitra

Re: UpdateStateByKey : Partitioning and Shuffle

Posted by Tathagata Das <td...@databricks.com>.

Both mapWithState and updateStateByKey by default uses the HashPartitioner,
and hashes the key in the key-value DStream on which the state operation is
applied. The new data and state is partition in the exact same partitioner,
so that same keys from the new data (from the input DStream) get shuffled
and colocated with the already partitioned state RDDs. So the new data is
brought to the corresponding old state in the same machine and then the
state mapping /updating function is applied. The state is not shuffled
every time, only the batches of new data is shuffled in every batch

On Tue, Jan 5, 2016 at 5:21 PM, Soumitra Johri <soumitra.siddharth@gmail.com
> wrote:

> Hi,
>
> I am relatively new to Spark and am using updateStateByKey() operation to
> maintain state in my Spark Streaming application. The input data is coming
> through a Kafka topic.
>
>    1. I want to understand how are DStreams partitioned?
>    2. How does the partitioning work with mapWithState() or
>    updateStatebyKey() method?
>    3. In updateStateByKey() does the old state and the new values against
>    a given key processed on same node ?
>    4. How frequent is the shuffle for updateStateByKey() method ?
>
> The state I have to maintaining contains ~ 100000 keys and I want to avoid
> shuffle every time I update the state , any tips to do it ?
>
> Warm Regards
> Soumitra
>

Re: UpdateStateByKey : Partitioning and Shuffle

Posted by Tathagata Das <td...@databricks.com>.

Both mapWithState and updateStateByKey by default uses the HashPartitioner,
and hashes the key in the key-value DStream on which the state operation is
applied. The new data and state is partition in the exact same partitioner,
so that same keys from the new data (from the input DStream) get shuffled
and colocated with the already partitioned state RDDs. So the new data is
brought to the corresponding old state in the same machine and then the
state mapping /updating function is applied. The state is not shuffled
every time, only the batches of new data is shuffled in every batch

On Tue, Jan 5, 2016 at 5:21 PM, Soumitra Johri <soumitra.siddharth@gmail.com
> wrote:

> Hi,
>
> I am relatively new to Spark and am using updateStateByKey() operation to
> maintain state in my Spark Streaming application. The input data is coming
> through a Kafka topic.
>
>    1. I want to understand how are DStreams partitioned?
>    2. How does the partitioning work with mapWithState() or
>    updateStatebyKey() method?
>    3. In updateStateByKey() does the old state and the new values against
>    a given key processed on same node ?
>    4. How frequent is the shuffle for updateStateByKey() method ?
>
> The state I have to maintaining contains ~ 100000 keys and I want to avoid
> shuffle every time I update the state , any tips to do it ?
>
> Warm Regards
> Soumitra
>