You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Etienne Chauchot <ec...@apache.org> on 2020/03/02 10:07:40 UTC

Re: GroupIntoBatches not Working properly for Direct Runner Java

Hi,

+1 to what Kenn asked: your pipeline is in streaming mode and GIB 
preserves windowing, the elements are buffered until one of these 
conditions are true: batchsize reached or end of window. I your case I 
think it is the second one.

Best

Etienne

On 28/02/2020 19:15, Kenneth Knowles wrote:
> What are the timestamps on the elements?
>
> On Fri, Feb 28, 2020 at 8:36 AM Vasu Gupta <dev.vasugupta@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Edit: Issue is on Direct Runner(Not Direction Runner - mistyped)
>     Issue Details:
>     Input data: 7 key-value Packets like: a-1, a-4, b-3, c-5, d-1,
>     e-4, e-5
>     Batch Size: 5
>     Expected output: a-1,4, b-3, c-5, d-1, e-4,5
>     Getting Packets with irregular size like a-1, b-5, e-4,5 OR a-1,4,
>     c-5 etc
>     But i always got correct number of packets with BATCH_SIZE = 1
>
>     On 2020/02/27 20:40:16, Kenneth Knowles <kenn@apache.org
>     <ma...@apache.org>> wrote:
>     > Can you share some more details? What is the expected output and
>     what
>     > output are you seeing?
>     >
>     > On Thu, Feb 27, 2020 at 9:39 AM Vasu Gupta
>     <dev.vasugupta@gmail.com <ma...@gmail.com>> wrote:
>     >
>     > > Hey folks, I am using Apache beam Framework in Java with
>     Direction Runner
>     > > for local testing purposes. When using GroupIntoBatches with
>     batch size 1
>     > > it works perfectly fine i.e. the output of the transform is
>     consistent and
>     > > as expected. But when using with batch size > 1 the output
>     Pcollection has
>     > > less data than it should be.
>     > >
>     > > Pipeline flow:
>     > > 1. A Transform for reading from pubsub
>     > > 2. Transform for making a KV out of the data
>     > > 3. A Fixed Window transform of 1 second
>     > > 4. Applying GroupIntoBatches transform
>     > > 5. And last, Logging the resulting Iterables.
>     > >
>     > > Weird thing is that it batch_size > 1 works great when running on
>     > > DataflowRunner but not with DirectRunner. I think the issue
>     might be with
>     > > Timer Expiry since GroupIntoBatches uses BagState internally.
>     > >
>     > > Any help will be much appreciated.
>     > >
>     >
>