You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Lina MÃ¥rtensson <li...@camus.energy> on 2022/04/26 21:28:42 UTC

Streaming writes to GCS

Hi Beamers,

I'm testing out streaming writes to GCS in an attempt to see if Beam will
work well for our needs, but I'm having trouble making this work.

I tried having my pipeline run similar to described in
https://stackoverflow.com/questions/56960622/writetotext-is-only-writing-to-temp-files

p | 'Read PubSub metadata' >>
ReadFromPubSub(subscription=known_args.input_subscription)
  | 'Convert Message to JSON' >> beam.Map(lambda message:
json.loads(message))
  | 'Extract File Names' >> beam.ParDo(ExtractFn())
  | 'Read Files' >> beam.io.ReadAllFromText()
  | "Window" >>
beam.WindowInto(beam.window.FixedWindows(30),
trigger=beam.transforms.trigger.AfterProcessingTime(),
accumulation_mode=beam.transforms.trigger.AccumulationMode.DISCARDING)
  | "Combine" >>
beam.transforms.core.CombineGlobally(CombineFn).without_defaults()
    | "Output" >> beam.io.WriteToText(known_args.output)

But, just like the Stack Overflow poster, my files never get written to
output GCS files. (There are temp files with data, but that looks like it
might be all of the input data, not the output data.)

There's an answer there saying "From engaging with the Apache Beam Python
guys, streaming writes to GCS (or local filesystem) is not yet supported in
Python, hence why the streaming write does not occur; only unbounded
targets are currently supported (e.g. Big Query tables).
Apparently this will be supported in the upcoming release of Beam for
Python v2.14.0."

I'm on 2.37.0 - was support for streaming file writes ever added?

Thanks!