You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Luke Cwik <lc...@google.com> on 2021/08/03 16:23:17 UTC

Re: Beam Pipeline: storing files in CloudBucket / overriding

One option is to always write to some temporary location and add a ParDo
after the write transform to rename the files (possibly deleting the old
ones) after the write completes. This will allow you to group multiple
writers outputs and will allow you to minimize the amount of time the set
of files containing data from a previous run and the current run exist. You
should be able to do this with any IO based transform that returns a
WriteFilesResult like AvroIO[1].

Another option is to delete all the files instead of selectively
overwriting at the beginning of the pipeline.

1:
https://beam.apache.org/releases/javadoc/2.31.0/org/apache/beam/sdk/io/AvroIO.TypedWrite.html#expand-org.apache.beam.sdk.values.PCollection-

On Wed, Jul 21, 2021 at 7:14 AM Sofia’s World <mm...@gmail.com> wrote:

> HI all
>   if i remember correct, if my pipeline writes a file to a GCP Bucket, and
> the file already exists, the default behaviour is that the file is not
> overridden
> What is the pattern to follow if i want that every time the pipeline run
> and the file is stored to GCP Bucket, the existing file is overridden?
>
> kind regards
>  Marco
>