You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Josh Cogan <jo...@google.com.INVALID> on 2016/11/14 21:57:06 UTC

Batcher DoFn

Hi Dev,

After offline discussions with Gus, I'd like propose we include a Batcher
function into contrib/.  This would be a DoFn that behaves like this:

[1,2,3,4,5] -> Batcher(max_size=2) -> [[1,2],[3,4],[5]]

Its simple code, but it also shows off that values can still be yielded
from finish_bundle(), and lots of people found it useful for the internal
Google version too.

LMK what you think.  Thanks!

Josh

-- 
joshgc
:wq

Re: Batcher DoFn

Posted by Kenneth Knowles <kl...@google.com.INVALID>.
Hi Josh,

I think you probably mean something like buffering elements in a field on
the DoFn, emitting batches as appropriate, and emitting the remainder in
finishBundle.

Unfortunately there are two issues:

 - in the presence of windowing the DoFn might be invoked in different
windows, so you'll garble the contents between windows
 - when data is streamed in small bundles, way smaller than batch size, the
results might be unintuitive

The solution to both is the State API which I am hard at work on. Then you
buffer in state, which is per-window and cross-bundle, and output as
appropriate, emitting the remainder from a callback invoked once the window
has expired (exceeded the allowed lateness).

Kenn

On Mon, Nov 14, 2016 at 1:57 PM, Josh Cogan <jo...@google.com.invalid>
wrote:

> Hi Dev,
>
> After offline discussions with Gus, I'd like propose we include a Batcher
> function into contrib/.  This would be a DoFn that behaves like this:
>
> [1,2,3,4,5] -> Batcher(max_size=2) -> [[1,2],[3,4],[5]]
>
> Its simple code, but it also shows off that values can still be yielded
> from finish_bundle(), and lots of people found it useful for the internal
> Google version too.
>
> LMK what you think.  Thanks!
>
> Josh
>
> --
> joshgc
> :wq
>