You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Yi En Ong <yi...@gmail.com> on 2022/10/04 16:02:35 UTC

Help on Apache Beam Pipeline Optimization

Hi,

I am trying to optimize my Apache Beam pipeline on Google Cloud Platform
Dataflow, and I would really appreciate your help and advice.

Background information: I am trying to read data from PubSub Messages, and
aggregate them based on 3 time windows: 1 min, 5 min and 60 min. Such
aggregations consists of summing, averaging, finding the maximum or
minimum, etc. For example, for all data collected from 1200 to 1201, I want
to aggregate them and write the output into BigTable's 1-min column family.
And for all data collected from 1200 to 1205, I want to similarly aggregate
them and write the output into BigTable's 5-min column. Same goes for 60min.

The current approach I took is to have 3 separate dataflow jobs (i.e. 3
separate Beam Pipelines), each one having a different window duration
(1min, 5min and 60min). See
https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/windowing/Window.html.
And the outputs of all 3 dataflow jobs are written to the same BigTable,
but on different column families. Other than that, the function and
aggregations of the data are the same for the 3 jobs.

However, this seems to be very computationally inefficient, and cost
inefficient, as the 3 jobs are essentially doing the same function, with
the only exception being the window time duration and output column family.

Some challenges and limitations we faced was that from the documentation,
it seems like we are unable to create multiple windows of different periods
in a singular dataflow job. Also, when we write the final data into big
table, we would have to define the table, column family, column, and
rowkey. And unfortunately, the column family is a fixed property (i.e. it
cannot be redefined or changed given the window period).

Hence, I am writing to ask if there is a way to only use 1 dataflow job
that fulfils the objective of this project? Which is to aggregate data on
different window periods, and write them to different column families of
the same BigTable.

Thank you

Re: Help on Apache Beam Pipeline Optimization

Posted by Evan Galpin <eg...@apache.org>.

If I’m not mistaken you could create a PCollection from the pubsub read
operation, and then apply 3 different windowing strategies in different
“chains” of the graph. Ex

PCollection<PubsubMessage> msgs = PubsubIO.read(…);

msgs.apply(Window.into(FixedWindows.of(1 min)).apply(allMyTransforms)

msgs.apply(Window.into(FixedWindows.of(5 min)).apply(allMyTransforms)

msgs.apply(Window.into(FixedWindows.of(60 min)).apply(allMyTransforms)


Similarly this could be done with a loop if preferred.

On Tue, Oct 4, 2022 at 14:15 Yi En Ong <yi...@gmail.com> wrote:

> Hi,
>
>
> I am trying to optimize my Apache Beam pipeline on Google Cloud Platform
> Dataflow, and I would really appreciate your help and advice.
>
>
> Background information: I am trying to read data from PubSub Messages, and
> aggregate them based on 3 time windows: 1 min, 5 min and 60 min. Such
> aggregations consists of summing, averaging, finding the maximum or
> minimum, etc. For example, for all data collected from 1200 to 1201, I want
> to aggregate them and write the output into BigTable's 1-min column family.
> And for all data collected from 1200 to 1205, I want to similarly aggregate
> them and write the output into BigTable's 5-min column. Same goes for 60min.
>
>
> The current approach I took is to have 3 separate dataflow jobs (i.e. 3
> separate Beam Pipelines), each one having a different window duration
> (1min, 5min and 60min). See
> https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/windowing/Window.html.
> And the outputs of all 3 dataflow jobs are written to the same BigTable,
> but on different column families. Other than that, the function and
> aggregations of the data are the same for the 3 jobs.
>
>
> However, this seems to be very computationally inefficient, and cost
> inefficient, as the 3 jobs are essentially doing the same function, with
> the only exception being the window time duration and output column family.
>
>
>
> Some challenges and limitations we faced was that from the documentation,
> it seems like we are unable to create multiple windows of different periods
> in a singular dataflow job. Also, when we write the final data into big
> table, we would have to define the table, column family, column, and
> rowkey. And unfortunately, the column family is a fixed property (i.e. it
> cannot be redefined or changed given the window period).
>
>
> Hence, I am writing to ask if there is a way to only use 1 dataflow job
> that fulfils the objective of this project? Which is to aggregate data on
> different window periods, and write them to different column families of
> the same BigTable.
>
>
> Thank you
>