You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by N B <nb...@gmail.com> on 2015/08/28 20:38:47 UTC

Dynamic lookup table

Hi all,

I have the following use case that I wanted to get some insight on how to
go about doing in Spark Streaming.

Every batch is processed through the pipeline and at the end, it has to
update some statistics information. This updated info should be reusable in
the next batch of this DStream e.g for looking up the relevant stat and it
in turn refines the stats further. It has to continue doing this for every
batch processed. First batch in the DStream can work with empty stats
lookup without issue. Essentially, we are trying to do a feedback loop.

What is a good pattern to apply for something like this? Some approaches
that I considered are:

1. Use updateStateByKey(). But this produces a new DStream that I cannot
refer back in the pipeline, so seems like a no-go but would be happy to be
proven wrong.

2. Use broadcast variables to maintain this state in a Map for example and
continue re-brodcasting it after every batch. I am not sure if this has
performance implications or if its even a good idea.

3. IndexedRDD? Looked promising initially but I quickly realized that it
might have the same issue as the updateStateByKey() approach, i.e. its not
available in the pipeline before its created.

4. Any other ideas that are obvious and I am missing?

Thanks
Nikunj

Re: Dynamic lookup table

Posted by N B <nb...@gmail.com>.

Hi Jason,

Thanks for the response. I believe I can look into a Redis based solution
for storing this state externally. However, would it be possible to refresh
this from the store with every batch i.e. what code can be written inside
the pipeline to fetch this info from the external store? Also, seems like a
waste since all that info should be available from within spark already.
The number of keys and amount of data is going to be quite limited in this
case. It just needs to be updated periodically.

The Accumulator based solution only works for simple counting and we have a
larger stats object that includes things like Averages, Variance etc. The
Accumulable interface potentially can be implemented for this but then the
other restriction seems to be that the values of such an accumulator can
only be accessed on the driver/master. We want this info to be available in
the data processing for the next batch as a lookup for example. Do you know
of a way to make that possible with Accumulable?

Thanks
Nikunj

On Fri, Aug 28, 2015 at 3:10 PM, Jason <Ja...@jasonknight.us> wrote:

> Hi Nikunj,
>
> Depending on what kind of stats you want to accumulate, you may want to
> look into the Accumulator/Accumulable API, or if you need more control, you
> can store these things in an external key-value store (HBase, redis, etc..)
> and do careful updates there. Though be careful and make sure your updates
> are atomic (transactions or CAS semantics) or you could run into race
> condition problems.
>
> Jason
>
> On Fri, Aug 28, 2015 at 11:39 AM N B <nb...@gmail.com> wrote:
>
>> Hi all,
>>
>> I have the following use case that I wanted to get some insight on how to
>> go about doing in Spark Streaming.
>>
>> Every batch is processed through the pipeline and at the end, it has to
>> update some statistics information. This updated info should be reusable in
>> the next batch of this DStream e.g for looking up the relevant stat and it
>> in turn refines the stats further. It has to continue doing this for every
>> batch processed. First batch in the DStream can work with empty stats
>> lookup without issue. Essentially, we are trying to do a feedback loop.
>>
>> What is a good pattern to apply for something like this? Some approaches
>> that I considered are:
>>
>> 1. Use updateStateByKey(). But this produces a new DStream that I cannot
>> refer back in the pipeline, so seems like a no-go but would be happy to be
>> proven wrong.
>>
>> 2. Use broadcast variables to maintain this state in a Map for example
>> and continue re-brodcasting it after every batch. I am not sure if this has
>> performance implications or if its even a good idea.
>>
>> 3. IndexedRDD? Looked promising initially but I quickly realized that it
>> might have the same issue as the updateStateByKey() approach, i.e. its not
>> available in the pipeline before its created.
>>
>> 4. Any other ideas that are obvious and I am missing?
>>
>> Thanks
>> Nikunj
>>
>>

Re: Dynamic lookup table

Posted by Jason <Ja...@jasonknight.us>.

Hi Nikunj,

Depending on what kind of stats you want to accumulate, you may want to
look into the Accumulator/Accumulable API, or if you need more control, you
can store these things in an external key-value store (HBase, redis, etc..)
and do careful updates there. Though be careful and make sure your updates
are atomic (transactions or CAS semantics) or you could run into race
condition problems.

Jason

On Fri, Aug 28, 2015 at 11:39 AM N B <nb...@gmail.com> wrote:

> Hi all,
>
> I have the following use case that I wanted to get some insight on how to
> go about doing in Spark Streaming.
>
> Every batch is processed through the pipeline and at the end, it has to
> update some statistics information. This updated info should be reusable in
> the next batch of this DStream e.g for looking up the relevant stat and it
> in turn refines the stats further. It has to continue doing this for every
> batch processed. First batch in the DStream can work with empty stats
> lookup without issue. Essentially, we are trying to do a feedback loop.
>
> What is a good pattern to apply for something like this? Some approaches
> that I considered are:
>
> 1. Use updateStateByKey(). But this produces a new DStream that I cannot
> refer back in the pipeline, so seems like a no-go but would be happy to be
> proven wrong.
>
> 2. Use broadcast variables to maintain this state in a Map for example and
> continue re-brodcasting it after every batch. I am not sure if this has
> performance implications or if its even a good idea.
>
> 3. IndexedRDD? Looked promising initially but I quickly realized that it
> might have the same issue as the updateStateByKey() approach, i.e. its not
> available in the pipeline before its created.
>
> 4. Any other ideas that are obvious and I am missing?
>
> Thanks
> Nikunj
>
>