You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Cory Tucker <co...@gmail.com> on 2016/08/19 19:38:55 UTC

Saving results of GroupByKey?

I have a fairly large data set that I need to perform a GroupByKey on.
This is by far the most time consuming part of my pipeline and I'm looking
for ways to optimize it.  The data is somewhat static and only changes
periodically so it pains me to have to wait on the GBK to happen every time
I want to run the pipeline.  Is there any way to cache the result of the
operation and load the data each time already grouped?

thanks
--Cory

Re: Saving results of GroupByKey?

Posted by Lukasz Cwik <lc...@google.com>.

Split your pipeline into two parts, one which saves the results of the GBK
to a set of files using AvroIO.
Then you can have another pipeline that reads those records in using AvroIO.
Use a file location that both pipelines can access.

You will exchange the cost of doing the GBK with the cost of reading from
disk.
This obviously does make organizing and executing the pipelines more
complicated though than having one pipeline that does everything.

On Fri, Aug 19, 2016 at 12:38 PM, Cory Tucker <co...@gmail.com> wrote:

> I have a fairly large data set that I need to perform a GroupByKey on.
> This is by far the most time consuming part of my pipeline and I'm looking
> for ways to optimize it.  The data is somewhat static and only changes
> periodically so it pains me to have to wait on the GBK to happen every time
> I want to run the pipeline.  Is there any way to cache the result of the
> operation and load the data each time already grouped?
>
> thanks
> --Cory
>