You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Praveen K Viswanathan <ha...@gmail.com> on 2020/06/26 14:28:45 UTC

Caching data inside DoFn

Hi All - I have a DoFn which generates data (KV pair) for each element that
it is processing. It also has to read from that KV for other elements based
on a key which means, the KV has to retain all the data that's getting
added to it while processing every element. I was thinking about the
"slow-caching side input pattern" but it is more of caching outside the
DoFn and then using it inside. It doesn't update the cache inside a DoFn.
Please share if anyone has thoughts on how to approach this case.

Element 1 > Add a record to a KV > ..... Element 5 > Used the value from KV
if there is a match in the key

-- 
Thanks,
Praveen K Viswanathan

Re: Caching data inside DoFn

Posted by Praveen K Viswanathan <ha...@gmail.com>.
Thank you Luke. I will work on implementing my use case with Stateful ParDo
itself and come back if I have any questions.

Appreciate your help.

On Fri, Jun 26, 2020 at 8:14 AM Luke Cwik <lc...@google.com> wrote:

> Use a stateful DoFn and buffer the elements in a bag state. You'll want to
> use a key that contains enough data to match your join condition you are
> trying to match. For example, if your trying to match on a customerId then
> you would do something like:
> element 1 -> ParDo(extract customer id) -> KV<customer id, element 1> ->
> stateful ParDo(buffer element 1 in bag state)
> ...
> element 5 -> ParDo(extract customer id) -> KV<customer id, element 5> ->
> stateful ParDo(output all element in bag)
>
> If you are matching on cudomerId and eventId then you would use a
> composite key (customerId, eventId).
>
> You can always use a single global key but you will lose all parallelism
> during processing (for small pipelines this likely won't matter).
>
> On Fri, Jun 26, 2020 at 7:29 AM Praveen K Viswanathan <
> harish.praveen@gmail.com> wrote:
>
>> Hi All - I have a DoFn which generates data (KV pair) for each element
>> that it is processing. It also has to read from that KV for other elements
>> based on a key which means, the KV has to retain all the data that's
>> getting added to it while processing every element. I was thinking
>> about the "slow-caching side input pattern" but it is more of caching
>> outside the DoFn and then using it inside. It doesn't update the cache
>> inside a DoFn. Please share if anyone has thoughts on how to approach this
>> case.
>>
>> Element 1 > Add a record to a KV > ..... Element 5 > Used the value from
>> KV if there is a match in the key
>>
>> --
>> Thanks,
>> Praveen K Viswanathan
>>
>

-- 
Thanks,
Praveen K Viswanathan

Re: Caching data inside DoFn

Posted by Luke Cwik <lc...@google.com>.
Use a stateful DoFn and buffer the elements in a bag state. You'll want to
use a key that contains enough data to match your join condition you are
trying to match. For example, if your trying to match on a customerId then
you would do something like:
element 1 -> ParDo(extract customer id) -> KV<customer id, element 1> ->
stateful ParDo(buffer element 1 in bag state)
...
element 5 -> ParDo(extract customer id) -> KV<customer id, element 5> ->
stateful ParDo(output all element in bag)

If you are matching on cudomerId and eventId then you would use a composite
key (customerId, eventId).

You can always use a single global key but you will lose all parallelism
during processing (for small pipelines this likely won't matter).

On Fri, Jun 26, 2020 at 7:29 AM Praveen K Viswanathan <
harish.praveen@gmail.com> wrote:

> Hi All - I have a DoFn which generates data (KV pair) for each element
> that it is processing. It also has to read from that KV for other elements
> based on a key which means, the KV has to retain all the data that's
> getting added to it while processing every element. I was thinking
> about the "slow-caching side input pattern" but it is more of caching
> outside the DoFn and then using it inside. It doesn't update the cache
> inside a DoFn. Please share if anyone has thoughts on how to approach this
> case.
>
> Element 1 > Add a record to a KV > ..... Element 5 > Used the value from
> KV if there is a match in the key
>
> --
> Thanks,
> Praveen K Viswanathan
>