You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Dean Hiller <de...@orderlyhealth.com> on 2021/11/01 19:52:24 UTC

Streaming but with window containing each file's contents

I keep digging, searching and can't quite find this use case anywhere.  We
receive a stream of 'file received' events and these files can be received
at the same time.  These files vary in size from 10 rows(simple) to 800,000
rows.

Ideally, we would like each file to end up in it's own window(I think) and
we would like to join each file with itself before sending downstream.
Some rows in the file error before we join and those go down an error
notification path.  The other rows, we join together.

Is there any sort of example on how to do this?  It is much like doing a
batch in the middle of streaming ideally.

thanks,
Dean Hiller (he/him <https://www.mypronouns.org/what-and-why>)
CTO & VP of Engineering
Orderly Health <http://orderlyhealth.com/>
Connect with me on LinkedIn! <https://www.linkedin.com/in/deanhiller/>

Re: Streaming but with window containing each file's contents

Posted by Luke Cwik <lc...@google.com>.
Do you receive the rows from the files or do you receive a file name and
then you read the entire file from within the pipeline?

If it is the former, then take a look at session windows with a large
enough gap duration and use the filename as the key when performing a
grouping operation or use a stateful DoFn to buffer the records and use
timers to choose when to produce output.

The latter is much simpler since it is easier to know when you are at the
end of the file.


On Mon, Nov 1, 2021 at 12:52 PM Dean Hiller <de...@orderlyhealth.com> wrote:

> I keep digging, searching and can't quite find this use case anywhere.  We
> receive a stream of 'file received' events and these files can be received
> at the same time.  These files vary in size from 10 rows(simple) to 800,000
> rows.
>
> Ideally, we would like each file to end up in it's own window(I think) and
> we would like to join each file with itself before sending downstream.
> Some rows in the file error before we join and those go down an error
> notification path.  The other rows, we join together.
>
> Is there any sort of example on how to do this?  It is much like doing a
> batch in the middle of streaming ideally.
>
> thanks,
> Dean Hiller (he/him <https://www.mypronouns.org/what-and-why>)
> CTO & VP of Engineering
> Orderly Health <http://orderlyhealth.com/>
> Connect with me on LinkedIn! <https://www.linkedin.com/in/deanhiller/>
>
>
>