You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by TAREK ALSALEH <ta...@outlook.com.au> on 2020/06/12 07:51:59 UTC
Continuous Read pipeline
Hi,
I am using the Python SDK with Dataflow as my runner. I am looking at implementing a streaming pipeline that will continuously monitor a GCS bucket for incoming files and depending on the regex of the file, launch a set of transforms and write the final output back to parquet for each file I read.
Excuse me as I am new to beam and specially the streaming bit as I have done the above in batch mode but we are limited by dataflow allowing only 100 jobs per 24 hours.
I was looking into a couple of options:
1. Have a beam pipeline running in streaming mode listening to a pubsub topic. Once a file lands in GCS a message is published. I am planning to use the WriteToFiles transform but it seems that there is a limitation where :"it currently does not have support for multiple trigger firings on the same window."
* So what windowing strategy and trigger should I use?
* Which transform should I use since there are two ReadFromPubSub transforms one in the io.gcp Subpackages and another one in the external.gcp?
2. Using the TextIO.Read<https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/TextIO.Read.html> and the watchForNewFiles<https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/TextIO.Read.html#watchForNewFiles-org.joda.time.Duration-org.apache.beam.sdk.transforms.Watch.Growth.TerminationCondition-> from the Java SDK within a python pipeline as I understand there is some support for cross-language transforms?
Regards,
Tarek
Re: Continuous Read pipeline
Posted by Chamikara Jayalath <ch...@google.com>.
On Fri, Jun 12, 2020 at 12:52 AM TAREK ALSALEH <ta...@outlook.com.au>
wrote:
> Hi,
>
> I am using the Python SDK with Dataflow as my runner. I am looking at
> implementing a streaming pipeline that will continuously monitor a GCS
> bucket for incoming files and depending on the regex of the file, launch a
> set of transforms and write the final output back to parquet for each file
> I read.
>
> Excuse me as I am new to beam and specially the streaming bit as I have
> done the above in batch mode but we are limited by dataflow allowing only
> 100 jobs per 24 hours.
>
> I was looking into a couple of options:
>
> 1. Have a beam pipeline running in streaming mode listening to a
> pubsub topic. Once a file lands in GCS a message is published. I am
> planning to use the WriteToFiles transform but it seems that there is
> a limitation where :"it currently does not have support for multiple
> trigger firings on the same window."
> 1. So what windowing strategy and trigger should I use?
> 2. Which transform should I use since there are two ReadFromPubSub
> transforms one in the io.gcp Subpackages and another one in the
> external.gcp?
>
>
> 1. Using the TextIO.Read
> <https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/TextIO.Read.html> and
> the watchForNewFiles
> <https://beam.apache.org/releases/javadoc/2.22.0/org/apache/beam/sdk/io/TextIO.Read.html#watchForNewFiles-org.joda.time.Duration-org.apache.beam.sdk.transforms.Watch.Growth.TerminationCondition-> from
> the Java SDK within a python pipeline as I understand there is some support
> for cross-language transforms?
>
>
Sounds like watchForNewFiles transform is exactly what you are looking for
but we don't have that for Python SDK yet.
Also above transforms haven't been tested with cross-language transforms
framework yet and we don't have cross-language Python wrappers for these
yet.
Probably best solution for Python SDK today will be to use some sort of a
GCS to Cloud Pub/Sub mapping to publish events regarding new files and read
these files using a Beam pipeline that reads from Cloud Pub/Sub. For
example following (I haven't tested this).
https://cloud.google.com/storage/docs/pubsub-notifications
For Dataflow use Pub/Sub connector in io/gcp submodule.
Thanks,
Cham
> 1.
>
>
> Regards,
> Tarek
>