You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Manish Malhotra <ma...@gmail.com> on 2016/11/10 16:42:49 UTC

Spark Streaming: question on sticky session across batches ?

Hello Spark Devs/Users,

Im trying to solve the use case with Spark Streaming 1.6.2 where for every
batch ( say 2 mins) data needs to go to the same reducer node after
grouping by key.
The underlying storage is Cassandra and not HDFS.

This is a map-reduce job, where also trying to use the partitions of the
Cassandra table to batch the data for the same partition.

The requirement of sticky session/partition across batches is because the
operations which we need to do, needs to read data for every key and then
merge this with the current batch aggregate values. So, currently when
there is no stickyness across batches, we have to read for every key, merge
and then write back. and reads are very expensive. So, if we have sticky
session, we can avoid read in every batch and have a cache of till last
batch aggregates across batches.

So, there are few options, can think of:

1. to change the TaskSchedulerImpl, as its using Random to identify the
node for mapper/reducer before starting the batch/phase.
Not sure if there is a custom scheduler way of achieving it?

2. Can custom RDD can help to find the node for the key-->node.
there is a getPreferredLocation() method.
But not sure, whether this will be persistent or can vary for some edge
cases?

Thanks in advance for you help and time !

Regards,
Manish

Re: Spark Streaming: question on sticky session across batches ?

Posted by Manish Malhotra <ma...@gmail.com>.
Thanks!
On Tue, Nov 15, 2016 at 1:07 AM Takeshi Yamamuro <li...@gmail.com>
wrote:

> - dev
>
> Hi,
>
> AFAIK, if you use RDDs only, you can control the partition mapping to some
> extent
> by using a partition key RDD[(key, data)].
> A defined partitioner distributes data into partitions depending on the
> key.
> As a good example to control partitions, you can see the GraphX code;
>
> https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L291
>
> GraphX holds `PartitionId` in edge RDDs to control the partition where
> edge data are.
>
> // maropu
>
>
> On Tue, Nov 15, 2016 at 5:19 AM, Manish Malhotra <
> manish.malhotra.work@gmail.com> wrote:
>
> sending again.
> any help is appreciated !
>
> thanks in advance.
>
> On Thu, Nov 10, 2016 at 8:42 AM, Manish Malhotra <
> manish.malhotra.work@gmail.com> wrote:
>
> Hello Spark Devs/Users,
>
> Im trying to solve the use case with Spark Streaming 1.6.2 where for every
> batch ( say 2 mins) data needs to go to the same reducer node after
> grouping by key.
> The underlying storage is Cassandra and not HDFS.
>
> This is a map-reduce job, where also trying to use the partitions of the
> Cassandra table to batch the data for the same partition.
>
> The requirement of sticky session/partition across batches is because the
> operations which we need to do, needs to read data for every key and then
> merge this with the current batch aggregate values. So, currently when
> there is no stickyness across batches, we have to read for every key, merge
> and then write back. and reads are very expensive. So, if we have sticky
> session, we can avoid read in every batch and have a cache of till last
> batch aggregates across batches.
>
> So, there are few options, can think of:
>
> 1. to change the TaskSchedulerImpl, as its using Random to identify the
> node for mapper/reducer before starting the batch/phase.
> Not sure if there is a custom scheduler way of achieving it?
>
> 2. Can custom RDD can help to find the node for the key-->node.
> there is a getPreferredLocation() method.
> But not sure, whether this will be persistent or can vary for some edge
> cases?
>
> Thanks in advance for you help and time !
>
> Regards,
> Manish
>
>
>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Spark Streaming: question on sticky session across batches ?

Posted by Takeshi Yamamuro <li...@gmail.com>.
- dev

Hi,

AFAIK, if you use RDDs only, you can control the partition mapping to some
extent
by using a partition key RDD[(key, data)].
A defined partitioner distributes data into partitions depending on the key.
As a good example to control partitions, you can see the GraphX code;
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L291

GraphX holds `PartitionId` in edge RDDs to control the partition where edge
data are.

// maropu


On Tue, Nov 15, 2016 at 5:19 AM, Manish Malhotra <
manish.malhotra.work@gmail.com> wrote:

> sending again.
> any help is appreciated !
>
> thanks in advance.
>
> On Thu, Nov 10, 2016 at 8:42 AM, Manish Malhotra <
> manish.malhotra.work@gmail.com> wrote:
>
>> Hello Spark Devs/Users,
>>
>> Im trying to solve the use case with Spark Streaming 1.6.2 where for
>> every batch ( say 2 mins) data needs to go to the same reducer node after
>> grouping by key.
>> The underlying storage is Cassandra and not HDFS.
>>
>> This is a map-reduce job, where also trying to use the partitions of the
>> Cassandra table to batch the data for the same partition.
>>
>> The requirement of sticky session/partition across batches is because the
>> operations which we need to do, needs to read data for every key and then
>> merge this with the current batch aggregate values. So, currently when
>> there is no stickyness across batches, we have to read for every key, merge
>> and then write back. and reads are very expensive. So, if we have sticky
>> session, we can avoid read in every batch and have a cache of till last
>> batch aggregates across batches.
>>
>> So, there are few options, can think of:
>>
>> 1. to change the TaskSchedulerImpl, as its using Random to identify the
>> node for mapper/reducer before starting the batch/phase.
>> Not sure if there is a custom scheduler way of achieving it?
>>
>> 2. Can custom RDD can help to find the node for the key-->node.
>> there is a getPreferredLocation() method.
>> But not sure, whether this will be persistent or can vary for some edge
>> cases?
>>
>> Thanks in advance for you help and time !
>>
>> Regards,
>> Manish
>>
>
>


-- 
---
Takeshi Yamamuro

Re: Spark Streaming: question on sticky session across batches ?

Posted by Manish Malhotra <ma...@gmail.com>.
sending again.
any help is appreciated !

thanks in advance.

On Thu, Nov 10, 2016 at 8:42 AM, Manish Malhotra <
manish.malhotra.work@gmail.com> wrote:

> Hello Spark Devs/Users,
>
> Im trying to solve the use case with Spark Streaming 1.6.2 where for every
> batch ( say 2 mins) data needs to go to the same reducer node after
> grouping by key.
> The underlying storage is Cassandra and not HDFS.
>
> This is a map-reduce job, where also trying to use the partitions of the
> Cassandra table to batch the data for the same partition.
>
> The requirement of sticky session/partition across batches is because the
> operations which we need to do, needs to read data for every key and then
> merge this with the current batch aggregate values. So, currently when
> there is no stickyness across batches, we have to read for every key, merge
> and then write back. and reads are very expensive. So, if we have sticky
> session, we can avoid read in every batch and have a cache of till last
> batch aggregates across batches.
>
> So, there are few options, can think of:
>
> 1. to change the TaskSchedulerImpl, as its using Random to identify the
> node for mapper/reducer before starting the batch/phase.
> Not sure if there is a custom scheduler way of achieving it?
>
> 2. Can custom RDD can help to find the node for the key-->node.
> there is a getPreferredLocation() method.
> But not sure, whether this will be persistent or can vary for some edge
> cases?
>
> Thanks in advance for you help and time !
>
> Regards,
> Manish
>

Re: Spark Streaming: question on sticky session across batches ?

Posted by Manish Malhotra <ma...@gmail.com>.
sending again.
any help is appreciated !

thanks in advance.

On Thu, Nov 10, 2016 at 8:42 AM, Manish Malhotra <
manish.malhotra.work@gmail.com> wrote:

> Hello Spark Devs/Users,
>
> Im trying to solve the use case with Spark Streaming 1.6.2 where for every
> batch ( say 2 mins) data needs to go to the same reducer node after
> grouping by key.
> The underlying storage is Cassandra and not HDFS.
>
> This is a map-reduce job, where also trying to use the partitions of the
> Cassandra table to batch the data for the same partition.
>
> The requirement of sticky session/partition across batches is because the
> operations which we need to do, needs to read data for every key and then
> merge this with the current batch aggregate values. So, currently when
> there is no stickyness across batches, we have to read for every key, merge
> and then write back. and reads are very expensive. So, if we have sticky
> session, we can avoid read in every batch and have a cache of till last
> batch aggregates across batches.
>
> So, there are few options, can think of:
>
> 1. to change the TaskSchedulerImpl, as its using Random to identify the
> node for mapper/reducer before starting the batch/phase.
> Not sure if there is a custom scheduler way of achieving it?
>
> 2. Can custom RDD can help to find the node for the key-->node.
> there is a getPreferredLocation() method.
> But not sure, whether this will be persistent or can vary for some edge
> cases?
>
> Thanks in advance for you help and time !
>
> Regards,
> Manish
>