You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Alex Amato <aj...@google.com> on 2020/08/26 15:55:26 UTC

How does groupIntoBatches behave when there are too few elements for a key?

How does groupIntoBatches behave when there are too few elements for a key
(less than the provided batch size)?

Based on how its described
<https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/GroupIntoBatches.html>.
Its not clear to me that the elements will ever emit. Can this cause
stuckness in this case?

Re: How does groupIntoBatches behave when there are too few elements for a key?

Posted by Siyuan Chen <sy...@google.com>.
Thanks! I created a PR for Java here:
https://github.com/apache/beam/pull/12726/commits. Will also work on the
python version afterwards.
--
Best regards,
Siyuan


On Wed, Aug 26, 2020 at 9:41 AM Reuven Lax <re...@google.com> wrote:

> Yes.- you can just add a new timer spec to the DoFn.
>
> On Wed, Aug 26, 2020 at 9:26 AM Siyuan Chen <sy...@google.com> wrote:
>
>> I have been preparing a PR to add the timeout option. I had a dumb
>> question - seems to me that the timeout should be set in processing time
>> while the existing timer fired at the window expiration is in event time.
>> Is there a way to have timers in different time domains?
>> --
>> Best regards,
>> Siyuan
>>
>>
>> On Wed, Aug 26, 2020 at 9:15 AM Reuven Lax <re...@google.com> wrote:
>>
>>> Seems reasonable to add an optional timeout to GroupIntoBatches to flush
>>> records.
>>>
>>> On Wed, Aug 26, 2020 at 9:04 AM Robert Bradshaw <ro...@google.com>
>>> wrote:
>>>
>>>> GroupIntoBatches sets a timer to flush the batches at the end of the
>>>> window [1] no matter how many elements there are. This could cause a
>>>> problem for the GlobalWindow if no more data ever comes in.
>>>>
>>>> [1]
>>>> https://github.com/apache/beam/blob/release-2.23.0/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java#L116
>>>>
>>>> On Wed, Aug 26, 2020 at 8:55 AM Alex Amato <aj...@google.com> wrote:
>>>> >
>>>> > How does groupIntoBatches behave when there are too few elements for
>>>> a key (less than the provided batch size)?
>>>> >
>>>> > Based on how its described. Its not clear to me that the elements
>>>> will ever emit. Can this cause stuckness in this case?
>>>>
>>>

Re: How does groupIntoBatches behave when there are too few elements for a key?

Posted by Reuven Lax <re...@google.com>.
Yes.- you can just add a new timer spec to the DoFn.

On Wed, Aug 26, 2020 at 9:26 AM Siyuan Chen <sy...@google.com> wrote:

> I have been preparing a PR to add the timeout option. I had a dumb
> question - seems to me that the timeout should be set in processing time
> while the existing timer fired at the window expiration is in event time.
> Is there a way to have timers in different time domains?
> --
> Best regards,
> Siyuan
>
>
> On Wed, Aug 26, 2020 at 9:15 AM Reuven Lax <re...@google.com> wrote:
>
>> Seems reasonable to add an optional timeout to GroupIntoBatches to flush
>> records.
>>
>> On Wed, Aug 26, 2020 at 9:04 AM Robert Bradshaw <ro...@google.com>
>> wrote:
>>
>>> GroupIntoBatches sets a timer to flush the batches at the end of the
>>> window [1] no matter how many elements there are. This could cause a
>>> problem for the GlobalWindow if no more data ever comes in.
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/release-2.23.0/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java#L116
>>>
>>> On Wed, Aug 26, 2020 at 8:55 AM Alex Amato <aj...@google.com> wrote:
>>> >
>>> > How does groupIntoBatches behave when there are too few elements for a
>>> key (less than the provided batch size)?
>>> >
>>> > Based on how its described. Its not clear to me that the elements will
>>> ever emit. Can this cause stuckness in this case?
>>>
>>

Re: How does groupIntoBatches behave when there are too few elements for a key?

Posted by Siyuan Chen <sy...@google.com>.
I have been preparing a PR to add the timeout option. I had a dumb question
- seems to me that the timeout should be set in processing time while the
existing timer fired at the window expiration is in event time. Is there a
way to have timers in different time domains?
--
Best regards,
Siyuan


On Wed, Aug 26, 2020 at 9:15 AM Reuven Lax <re...@google.com> wrote:

> Seems reasonable to add an optional timeout to GroupIntoBatches to flush
> records.
>
> On Wed, Aug 26, 2020 at 9:04 AM Robert Bradshaw <ro...@google.com>
> wrote:
>
>> GroupIntoBatches sets a timer to flush the batches at the end of the
>> window [1] no matter how many elements there are. This could cause a
>> problem for the GlobalWindow if no more data ever comes in.
>>
>> [1]
>> https://github.com/apache/beam/blob/release-2.23.0/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java#L116
>>
>> On Wed, Aug 26, 2020 at 8:55 AM Alex Amato <aj...@google.com> wrote:
>> >
>> > How does groupIntoBatches behave when there are too few elements for a
>> key (less than the provided batch size)?
>> >
>> > Based on how its described. Its not clear to me that the elements will
>> ever emit. Can this cause stuckness in this case?
>>
>

Re: How does groupIntoBatches behave when there are too few elements for a key?

Posted by Reuven Lax <re...@google.com>.
Seems reasonable to add an optional timeout to GroupIntoBatches to flush
records.

On Wed, Aug 26, 2020 at 9:04 AM Robert Bradshaw <ro...@google.com> wrote:

> GroupIntoBatches sets a timer to flush the batches at the end of the
> window [1] no matter how many elements there are. This could cause a
> problem for the GlobalWindow if no more data ever comes in.
>
> [1]
> https://github.com/apache/beam/blob/release-2.23.0/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java#L116
>
> On Wed, Aug 26, 2020 at 8:55 AM Alex Amato <aj...@google.com> wrote:
> >
> > How does groupIntoBatches behave when there are too few elements for a
> key (less than the provided batch size)?
> >
> > Based on how its described. Its not clear to me that the elements will
> ever emit. Can this cause stuckness in this case?
>

Re: How does groupIntoBatches behave when there are too few elements for a key?

Posted by Robert Bradshaw <ro...@google.com>.
GroupIntoBatches sets a timer to flush the batches at the end of the
window [1] no matter how many elements there are. This could cause a
problem for the GlobalWindow if no more data ever comes in.

[1] https://github.com/apache/beam/blob/release-2.23.0/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java#L116

On Wed, Aug 26, 2020 at 8:55 AM Alex Amato <aj...@google.com> wrote:
>
> How does groupIntoBatches behave when there are too few elements for a key (less than the provided batch size)?
>
> Based on how its described. Its not clear to me that the elements will ever emit. Can this cause stuckness in this case?

Re: How does groupIntoBatches behave when there are too few elements for a key?

Posted by Luke Cwik <lc...@google.com>.
GroupIntoBatches should always emit any buffered elements on window
expiration.

On Wed, Aug 26, 2020 at 8:55 AM Alex Amato <aj...@google.com> wrote:

> How does groupIntoBatches behave when there are too few elements for a key
> (less than the provided batch size)?
>
> Based on how its described
> <https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/GroupIntoBatches.html>.
> Its not clear to me that the elements will ever emit. Can this cause
> stuckness in this case?
>