You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Daniel Oliveira <da...@google.com> on 2020/05/29 22:34:26 UTC

Is this SO question showing a bug in Java Reshuffle? Can someone take a look?

Hi dev list,

While answering Stack Overflow questions I stumbled onto this:
https://stackoverflow.com/questions/62017572/beam-java-dataflow-bigquery-streaming-insert-groupbykey-reducing-elements

The user's pipeline seems to have a Reshuffle outputting less elements than
it received, inside a BigQuery streaming insert. This looks like a bug to
me since I assume Reshuffle should always be outputting unchanged elements,
and I read through the code and as far as I can tell this shouldn't be
happening. But I'm not too familiar with the code in question so I was
hoping someone else with more context on it could help confirm.

Thanks,
Daniel Oliveira

Re: Is this SO question showing a bug in Java Reshuffle? Can someone take a look?

Posted by Daniel Oliveira <da...@google.com>.
I asked the user to check if it was just the GBK or the entire Reshuffle,
and they confirmed it was the entire Reshuffle. Also their pipeline did
ultimately not have everything that was expected to be output. I'm still
asking the user for more info to make sure this isn't a bug on the Dataflow
side.

On Fri, May 29, 2020 at 4:32 PM Robert Bradshaw <ro...@google.com> wrote:

> Reshuffle should be emitting exactly the same number of elements that it
> gets. The GBK inside Reshuffle may have slightly less due to key
> collisions, but the ExpandIterable step should take care of this. Do we
> have counts for that output? (I will say that seem to be an
> extraordinarily high number of collisions.)
>
> On Fri, May 29, 2020 at 3:34 PM Daniel Oliveira <da...@google.com>
> wrote:
>
>> Hi dev list,
>>
>> While answering Stack Overflow questions I stumbled onto this:
>> https://stackoverflow.com/questions/62017572/beam-java-dataflow-bigquery-streaming-insert-groupbykey-reducing-elements
>>
>> The user's pipeline seems to have a Reshuffle outputting less elements
>> than it received, inside a BigQuery streaming insert. This looks like a bug
>> to me since I assume Reshuffle should always be outputting unchanged
>> elements, and I read through the code and as far as I can tell this
>> shouldn't be happening. But I'm not too familiar with the code in question
>> so I was hoping someone else with more context on it could help confirm.
>>
>> Thanks,
>> Daniel Oliveira
>>
>

Re: Is this SO question showing a bug in Java Reshuffle? Can someone take a look?

Posted by Robert Bradshaw <ro...@google.com>.
Reshuffle should be emitting exactly the same number of elements that it
gets. The GBK inside Reshuffle may have slightly less due to key
collisions, but the ExpandIterable step should take care of this. Do we
have counts for that output? (I will say that seem to be an
extraordinarily high number of collisions.)

On Fri, May 29, 2020 at 3:34 PM Daniel Oliveira <da...@google.com>
wrote:

> Hi dev list,
>
> While answering Stack Overflow questions I stumbled onto this:
> https://stackoverflow.com/questions/62017572/beam-java-dataflow-bigquery-streaming-insert-groupbykey-reducing-elements
>
> The user's pipeline seems to have a Reshuffle outputting less elements
> than it received, inside a BigQuery streaming insert. This looks like a bug
> to me since I assume Reshuffle should always be outputting unchanged
> elements, and I read through the code and as far as I can tell this
> shouldn't be happening. But I'm not too familiar with the code in question
> so I was hoping someone else with more context on it could help confirm.
>
> Thanks,
> Daniel Oliveira
>