You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by TAREK ALSALEH <ta...@outlook.com.au> on 2020/06/12 07:51:59 UTC

Continuous Read pipeline

Hi,

I am using the Python SDK with Dataflow as my runner. I am looking at implementing a streaming pipeline that will continuously monitor a GCS bucket for incoming files and depending on the regex of the file, launch a set of transforms and write the final output back to parquet for each file I read.

Excuse me as I am new to beam and specially the streaming bit as I have done the above in batch mode but we are limited by dataflow allowing only 100 jobs per 24 hours.

I was looking into a couple of options:

  1.  Have a beam pipeline running in streaming mode listening to a pubsub topic. Once a file lands in GCS a message is published. I am planning to use the WriteToFiles transform but it seems that there is a limitation where :"it currently does not have support for multiple trigger firings on the same window."
     *   So what windowing strategy and trigger should I use?
     *   Which transform should I use since there are two ReadFromPubSub transforms one in the io.gcp Subpackages and another one in the external.gcp?
  2.  Using the TextIO.Read<https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/TextIO.Read.html> and the watchForNewFiles<https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/TextIO.Read.html#watchForNewFiles-org.joda.time.Duration-org.apache.beam.sdk.transforms.Watch.Growth.TerminationCondition-> from the Java SDK within a python pipeline as I understand there is some support for cross-language transforms?

Regards,
Tarek

Re: Continuous Read pipeline

Posted by Chamikara Jayalath <ch...@google.com>.

On Fri, Jun 12, 2020 at 12:52 AM TAREK ALSALEH <ta...@outlook.com.au>
wrote:

> Hi,
>
> I am using the Python SDK with Dataflow as my runner. I am looking at
> implementing a streaming pipeline that will continuously monitor a GCS
> bucket for incoming files and depending on the regex of the file, launch a
> set of transforms and write the final output back to parquet for each file
> I read.
>
> Excuse me as I am new to beam and specially the streaming bit as I have
> done the above in batch mode but we are limited by dataflow allowing only
> 100 jobs per 24 hours.
>
> I was looking into a couple of options:
>
>    1. Have a beam pipeline running in streaming mode listening to a
>    pubsub topic. Once a file lands in GCS a message is published. I am
>    planning to use the WriteToFiles transform but it seems that there is
>    a limitation where :"it currently does not have support for multiple
>    trigger firings on the same window."
>       1. So what windowing strategy and trigger should I use?
>       2. Which transform should I use since there are two ReadFromPubSub
>       transforms one in the io.gcp Subpackages and another one in the
>       external.gcp?
>
>
>    1. Using the TextIO.Read
>    <https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/TextIO.Read.html> and
>    the watchForNewFiles
>    <https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/TextIO.Read.html#watchForNewFiles-org.joda.time.Duration-org.apache.beam.sdk.transforms.Watch.Growth.TerminationCondition-> from
>    the Java SDK within a python pipeline as I understand there is some support
>    for cross-language transforms?
>
>
Sounds like watchForNewFiles transform is exactly what you are looking for
but we don't have that for Python SDK yet.
Also above transforms haven't been tested with cross-language transforms
framework yet and we don't have cross-language Python wrappers for these
yet.

Probably best solution for Python SDK today will be to use some sort of a
GCS to Cloud Pub/Sub mapping to publish events regarding new files and read
these files using  a Beam pipeline that reads from Cloud Pub/Sub. For
example following (I haven't tested this).
https://cloud.google.com/storage/docs/pubsub-notifications

For Dataflow use Pub/Sub connector in io/gcp submodule.

Thanks,
Cham


>    1.
>
>
> Regards,
> Tarek
>