You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Kenneth Knowles <ke...@apache.org> on 2021/03/31 17:20:52 UTC

Re: Global window + stateful transformation

Great question!

Moving this to user@beam.apache.org

Kenn

On Wed, Mar 31, 2021 at 10:19 AM Hemali Sutaria <
hsutaria@paloaltonetworks.com> wrote:

> Beam Developers,
>
> I have a global window with per-key-and-window stateful processing
> dataflow job. Do I need groupbykey in my transform ? Thank you
>
>
> https://cloud.google.com/blog/products/gcp/writing-dataflow-pipelines-with-scalability-in-mind
>
> https://beam.apache.org/documentation/programming-guide/#transforms
>
>
> https://beam.apache.org/blog/timely-processing/
>
>
> Thanks,
> Hemali Sutaria
>
>

Re: Global window + stateful transformation

Posted by Hemali Sutaria <hs...@paloaltonetworks.com>.
Thank you Kenneth.

Thanks,
Hemali Sutaria



On Wed, Mar 31, 2021 at 10:23 AM Kenneth Knowles <ke...@apache.org> wrote:

>
> On Wed, Mar 31, 2021 at 10:20 AM Kenneth Knowles <ke...@apache.org> wrote:
>
>>
>> On Wed, Mar 31, 2021 at 10:19 AM Hemali Sutaria <
>> hsutaria@paloaltonetworks.com> wrote:
>>
>>> I have a global window with per-key-and-window stateful processing
>>> dataflow job. Do I need groupbykey in my transform ? Thank you
>>>
>>
> No you do not need a GroupByKey. When you use a stateful DoFn the Beam
> runner will partition the data automatically by key and window.
>
> Kenn
>
>
>>
>>>
>>> https://cloud.google.com/blog/products/gcp/writing-dataflow-pipelines-with-scalability-in-mind
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__cloud.google.com_blog_products_gcp_writing-2Ddataflow-2Dpipelines-2Dwith-2Dscalability-2Din-2Dmind&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=gizAAGdFA7m5QsnxkMFRenvNE9IDJSHidbXk-LafTj8&m=w8YUTt_WFJLbjNZD-kVKZ5SvaTkaDMWomSaVYqm_1Bk&s=vKBpzxOdHAwbfZJK4hXknCqtzRPuAH0g-v5s3RrZUDE&e=>
>>>
>>> https://beam.apache.org/documentation/programming-guide/#transforms
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__beam.apache.org_documentation_programming-2Dguide_-23transforms&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=gizAAGdFA7m5QsnxkMFRenvNE9IDJSHidbXk-LafTj8&m=w8YUTt_WFJLbjNZD-kVKZ5SvaTkaDMWomSaVYqm_1Bk&s=5RU3xh0brlUPoAlIgo7VmJxM1QtXTrvsyH6_V_e6Sio&e=>
>>>
>>>
>>> https://beam.apache.org/blog/timely-processing/
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__beam.apache.org_blog_timely-2Dprocessing_&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=gizAAGdFA7m5QsnxkMFRenvNE9IDJSHidbXk-LafTj8&m=w8YUTt_WFJLbjNZD-kVKZ5SvaTkaDMWomSaVYqm_1Bk&s=U_6l4v1fTsQ1tdjeUsLsksFDnqSMqV-p3OJNr9RgWkU&e=>
>>>
>>>
>>> Thanks,
>>> Hemali Sutaria
>>>
>>>

Re: Global window + stateful transformation

Posted by Amit Ziv-Kenet <am...@gmail.com>.
Hi Hemali,

AFAIK you are correct - all elements with the same key will be processed by
the same instance of the stateful DoFn (same machine, same thread). However
that holds for PCollection which have a window applied - all elements with
the same key+window combination will be processed by the same DoFn instance.
Keep in mind that this inherently limits the runner ability to parallelize
the stateful DoFn, which might cause a processing bottleneck, depending on
the cardinality of the keys.

Regards,
Amit.



On Wed, Mar 31, 2021 at 8:33 PM Hemali Sutaria <
hsutaria@paloaltonetworks.com> wrote:

> My understanding is : Stateful transformations are thread safe. In case of
> global window + stateful transformation, Beam makes sure that  all values
> for that key must be processed on the same machine, in fact on the same
> thread. Only if you have a session/time window, you need to add groupbykey.
> Is it correct ?
>
>
>
> Thanks,
> Hemali Sutaria
>
>
>
> On Wed, Mar 31, 2021 at 10:23 AM Kenneth Knowles <ke...@apache.org> wrote:
>
>>
>> On Wed, Mar 31, 2021 at 10:20 AM Kenneth Knowles <ke...@apache.org> wrote:
>>
>>>
>>> On Wed, Mar 31, 2021 at 10:19 AM Hemali Sutaria <
>>> hsutaria@paloaltonetworks.com> wrote:
>>>
>>>> I have a global window with per-key-and-window stateful processing
>>>> dataflow job. Do I need groupbykey in my transform ? Thank you
>>>>
>>>
>> No you do not need a GroupByKey. When you use a stateful DoFn the Beam
>> runner will partition the data automatically by key and window.
>>
>> Kenn
>>
>>
>>>
>>>>
>>>> https://cloud.google.com/blog/products/gcp/writing-dataflow-pipelines-with-scalability-in-mind
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__cloud.google.com_blog_products_gcp_writing-2Ddataflow-2Dpipelines-2Dwith-2Dscalability-2Din-2Dmind&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=gizAAGdFA7m5QsnxkMFRenvNE9IDJSHidbXk-LafTj8&m=w8YUTt_WFJLbjNZD-kVKZ5SvaTkaDMWomSaVYqm_1Bk&s=vKBpzxOdHAwbfZJK4hXknCqtzRPuAH0g-v5s3RrZUDE&e=>
>>>>
>>>> https://beam.apache.org/documentation/programming-guide/#transforms
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__beam.apache.org_documentation_programming-2Dguide_-23transforms&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=gizAAGdFA7m5QsnxkMFRenvNE9IDJSHidbXk-LafTj8&m=w8YUTt_WFJLbjNZD-kVKZ5SvaTkaDMWomSaVYqm_1Bk&s=5RU3xh0brlUPoAlIgo7VmJxM1QtXTrvsyH6_V_e6Sio&e=>
>>>>
>>>>
>>>> https://beam.apache.org/blog/timely-processing/
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__beam.apache.org_blog_timely-2Dprocessing_&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=gizAAGdFA7m5QsnxkMFRenvNE9IDJSHidbXk-LafTj8&m=w8YUTt_WFJLbjNZD-kVKZ5SvaTkaDMWomSaVYqm_1Bk&s=U_6l4v1fTsQ1tdjeUsLsksFDnqSMqV-p3OJNr9RgWkU&e=>
>>>>
>>>>
>>>> Thanks,
>>>> Hemali Sutaria
>>>>
>>>>

Re: Global window + stateful transformation

Posted by Hemali Sutaria <hs...@paloaltonetworks.com>.
My understanding is : Stateful transformations are thread safe. In case of
global window + stateful transformation, Beam makes sure that  all values
for that key must be processed on the same machine, in fact on the same
thread. Only if you have a session/time window, you need to add groupbykey.
Is it correct ?



Thanks,
Hemali Sutaria



On Wed, Mar 31, 2021 at 10:23 AM Kenneth Knowles <ke...@apache.org> wrote:

>
> On Wed, Mar 31, 2021 at 10:20 AM Kenneth Knowles <ke...@apache.org> wrote:
>
>>
>> On Wed, Mar 31, 2021 at 10:19 AM Hemali Sutaria <
>> hsutaria@paloaltonetworks.com> wrote:
>>
>>> I have a global window with per-key-and-window stateful processing
>>> dataflow job. Do I need groupbykey in my transform ? Thank you
>>>
>>
> No you do not need a GroupByKey. When you use a stateful DoFn the Beam
> runner will partition the data automatically by key and window.
>
> Kenn
>
>
>>
>>>
>>> https://cloud.google.com/blog/products/gcp/writing-dataflow-pipelines-with-scalability-in-mind
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__cloud.google.com_blog_products_gcp_writing-2Ddataflow-2Dpipelines-2Dwith-2Dscalability-2Din-2Dmind&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=gizAAGdFA7m5QsnxkMFRenvNE9IDJSHidbXk-LafTj8&m=w8YUTt_WFJLbjNZD-kVKZ5SvaTkaDMWomSaVYqm_1Bk&s=vKBpzxOdHAwbfZJK4hXknCqtzRPuAH0g-v5s3RrZUDE&e=>
>>>
>>> https://beam.apache.org/documentation/programming-guide/#transforms
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__beam.apache.org_documentation_programming-2Dguide_-23transforms&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=gizAAGdFA7m5QsnxkMFRenvNE9IDJSHidbXk-LafTj8&m=w8YUTt_WFJLbjNZD-kVKZ5SvaTkaDMWomSaVYqm_1Bk&s=5RU3xh0brlUPoAlIgo7VmJxM1QtXTrvsyH6_V_e6Sio&e=>
>>>
>>>
>>> https://beam.apache.org/blog/timely-processing/
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__beam.apache.org_blog_timely-2Dprocessing_&d=DwMFaQ&c=V9IgWpI5PvzTw83UyHGVSoW3Uc1MFWe5J8PTfkrzVSo&r=gizAAGdFA7m5QsnxkMFRenvNE9IDJSHidbXk-LafTj8&m=w8YUTt_WFJLbjNZD-kVKZ5SvaTkaDMWomSaVYqm_1Bk&s=U_6l4v1fTsQ1tdjeUsLsksFDnqSMqV-p3OJNr9RgWkU&e=>
>>>
>>>
>>> Thanks,
>>> Hemali Sutaria
>>>
>>>

Re: Global window + stateful transformation

Posted by Kenneth Knowles <ke...@apache.org>.
On Wed, Mar 31, 2021 at 10:20 AM Kenneth Knowles <ke...@apache.org> wrote:

>
> On Wed, Mar 31, 2021 at 10:19 AM Hemali Sutaria <
> hsutaria@paloaltonetworks.com> wrote:
>
>> I have a global window with per-key-and-window stateful processing
>> dataflow job. Do I need groupbykey in my transform ? Thank you
>>
>
No you do not need a GroupByKey. When you use a stateful DoFn the Beam
runner will partition the data automatically by key and window.

Kenn


>
>>
>> https://cloud.google.com/blog/products/gcp/writing-dataflow-pipelines-with-scalability-in-mind
>>
>> https://beam.apache.org/documentation/programming-guide/#transforms
>>
>>
>> https://beam.apache.org/blog/timely-processing/
>>
>>
>> Thanks,
>> Hemali Sutaria
>>
>>