You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Evan Galpin <eg...@apache.org> on 2022/10/11 23:25:16 UTC
Re: Help on Apache Beam Pipeline Optimization

If I’m not mistaken you could create a PCollection from the pubsub read
operation, and then apply 3 different windowing strategies in different
“chains” of the graph. Ex

PCollection<PubsubMessage> msgs = PubsubIO.read(…);

msgs.apply(Window.into(FixedWindows.of(1 min)).apply(allMyTransforms)

msgs.apply(Window.into(FixedWindows.of(5 min)).apply(allMyTransforms)

msgs.apply(Window.into(FixedWindows.of(60 min)).apply(allMyTransforms)


Similarly this could be done with a loop if preferred.

On Tue, Oct 4, 2022 at 14:15 Yi En Ong <yi...@gmail.com> wrote:

> Hi,
>
>
> I am trying to optimize my Apache Beam pipeline on Google Cloud Platform
> Dataflow, and I would really appreciate your help and advice.
>
>
> Background information: I am trying to read data from PubSub Messages, and
> aggregate them based on 3 time windows: 1 min, 5 min and 60 min. Such
> aggregations consists of summing, averaging, finding the maximum or
> minimum, etc. For example, for all data collected from 1200 to 1201, I want
> to aggregate them and write the output into BigTable's 1-min column family.
> And for all data collected from 1200 to 1205, I want to similarly aggregate
> them and write the output into BigTable's 5-min column. Same goes for 60min.
>
>
> The current approach I took is to have 3 separate dataflow jobs (i.e. 3
> separate Beam Pipelines), each one having a different window duration
> (1min, 5min and 60min). See
> https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/windowing/Window.html.
> And the outputs of all 3 dataflow jobs are written to the same BigTable,
> but on different column families. Other than that, the function and
> aggregations of the data are the same for the 3 jobs.
>
>
> However, this seems to be very computationally inefficient, and cost
> inefficient, as the 3 jobs are essentially doing the same function, with
> the only exception being the window time duration and output column family.
>
>
>
> Some challenges and limitations we faced was that from the documentation,
> it seems like we are unable to create multiple windows of different periods
> in a singular dataflow job. Also, when we write the final data into big
> table, we would have to define the table, column family, column, and
> rowkey. And unfortunately, the column family is a fixed property (i.e. it
> cannot be redefined or changed given the window period).
>
>
> Hence, I am writing to ask if there is a way to only use 1 dataflow job
> that fulfils the objective of this project? Which is to aggregate data on
> different window periods, and write them to different column families of
> the same BigTable.
>
>
> Thank you
>