You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Luke Cwik <lc...@google.com> on 2021/08/09 23:10:19 UTC

Re: [2.28.0] [Java] [Dataflow] ParquetIO writeDynamic stuck in Garbage Collection when writing ~125K files to dynamic destinations

It seems like you're hitting an OOM where Apache Beam is loading too much
into memory vs paginating over some data. Would be good if you could narrow
it down with some heap dumps.

Until then, try partitioning the data via the Window associated with each
record into N different PCollections. Then apply the WriteDynamic transform
to each of them. This should help cut down on the working set size by a
factor of N for whatever you're hitting with WriteDynamic.

On Wed, Jul 21, 2021 at 11:30 AM Robert Bradshaw <ro...@google.com>
wrote:

> Windowing works fine for batch jobs; that's one of the main points of
> the unified batch/streaming model. (One difference is that in batch
> there's not expectation that data be temporally ordered, even
> "kind-of" ordered like it is for Streaming.)
>
> On Wed, Jul 21, 2021 at 10:39 AM Andrew Kettmann <ak...@evolve24.com>
> wrote:
> >
> > Oof none of the documentation around windowing I have read has said
> anything about it not working in a batch job. Where did you find that info
> at if you remember?
> > ________________________________
> > From: Vincent Marquez <vi...@gmail.com>
> > Sent: Wednesday, July 21, 2021 12:21 PM
> > To: user <us...@beam.apache.org>
> > Subject: Re: [2.28.0] [Java] [Dataflow] ParquetIO writeDynamic stuck in
> Garbage Collection when writing ~125K files to dynamic destinations
> >
> > Windowing doesn't work with Batch jobs.  You could dump your BQ data to
> pubsub and then use a streaming job to window.
> > ~Vincent
> >
> >
> > On Wed, Jul 21, 2021 at 10:13 AM Andrew Kettmann <ak...@evolve24.com>
> wrote:
> >
> > Worker machines are n1-standard-2s (2 cpus and 7.5GB of RAM)
> >
> > Pipeline is simple, but large amounts of end files, ~125K temp files
> written in one case at least
> >
> > Scan Bigtable (NoSQL DB)
> > Transform with business logic
> > Convert to GenericRecord
> > WriteDynamic to a google bucket as Parquet files partitioned by 15
> minute intervals.
> (gs://bucket/root_dir/CATEGORY/YEAR/MONTH/DAY/HOUR/MINUTE_FLOOR_15/FILENAME.parquet)
> >
> >
> > Everything does fine until I get to the writeDynamic. When it does the
> groupByKey
> (FileIO.Write/WriteFiles/GatherTempFileResults/Reshuffle.ViaRandomKey/Reshuffle/GroupByKey)
> the stackdriver logs show a ton of allocation failure triggered GC that
> then frees up essentially zero space and never progresses, ends up with a
> "The worker lost contact with the service." error four times and then
> fails. Also worth noting that Dataflow sizes down to a single worker during
> this time, so it is trying to do it all at once. What are my options for
> splitting
> >
> > Likely I am not hitting GC alerts because I am using a snippet I pulled
> from a GCP Dataflow template that queries Bigtable that looks to disable
> the GCThrashing monitoring, due to Bigtable creating at least 5 objects per
> row scanned.
> >
> >     DataflowPipelineDebugOptions debugOptions = options.as
> (DataflowPipelineDebugOptions.class);
> >     debugOptions.setGCThrashingPercentagePerPeriod(100.00);
> >
> > What are my options for splitting this up so that it can process this in
> smaller chunks? I tried adding windowing but it didn't seem to help, or I
> needed to do something else other than just the windowing, but I don't
> really have a key to group it by here.
> >
> > Andrew Kettmann
> > DevOps Engineer
> > P: 1.314.596.2836 <(314)%20596-2836>
> >
> > evolve24 Confidential & Proprietary Statement: This email and any
> attachments are confidential and may contain information that is
> privileged, confidential or exempt from disclosure under applicable law. It
> is intended for the use of the recipients. If you are not the intended
> recipient, or believe that you have received this communication in error,
> please do not read, print, copy, retransmit, disseminate, or otherwise use
> the information. Please delete this email and attachments, without reading,
> printing, copying, forwarding or saving them, and notify the Sender
> immediately by reply email. No confidentiality or privilege is waived or
> lost by any transmission in error.
>