You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Eugene Kirpichov <ki...@google.com> on 2017/11/29 22:14:40 UTC

Re: Using JDBC IO read transform, running out of memory on DataflowRunner.

Hi,
I think you're hitting something that can be fixed by configuring Redshift
driver:
http://docs.aws.amazon.com/redshift/latest/dg/queries-troubleshooting.html#set-the-JDBC-fetch-size-parameter
*By default, the JDBC driver collects all the results for a query at one
time. As a result, when you attempt to retrieve a large result set over a
JDBC connection, you might encounter a client-side out-of-memory error. To
enable your client to retrieve result sets in batches instead of in a
single all-or-nothing fetch, set the JDBC fetch size parameter in your
client application.*

On Wed, Nov 29, 2017 at 1:41 PM Chet Aldrich <ch...@postmates.com>
wrote:

> Hey all,
>
> I’m running a Dataflow job that uses the JDBC IO transform to pull in a
> bunch of data (20mm rows, for reference) from Redshift, and I’m noticing
> that I’m getting an OutofMemoryError on the Dataflow workers once I reach
> around 4mm rows.
>
> It seems like given the code that I’m reading inside JDBC IO and the guide
> here (
> https://beam.apache.org/documentation/io/authoring-overview/#read-transforms)
> that it’s just pulling the data in from the result one-by-one and the
> emitting each output. Considering that this is sort of a limitation of the
> driver, this makes sense, but is there a way I can get around the memory
> limitation somehow? It seems like Dataflow repeatedly tries to create more
> workers to handle the work, but it can’t, which is part of the problem.
>
> If more info is needed in order to help me sort out what I could do to not
> run into the memory limitations I’m happy to provide it.
>
>
> Thanks,
>
> Chet
>

Re: Using JDBC IO read transform, running out of memory on DataflowRunner.

Posted by Chet Aldrich <ch...@postmates.com>.

Hey Eugene, 

Thanks for this, didn’t realize this was a parameter I could tune. Fixed my problems straight away. 

Chet

> On Nov 29, 2017, at 2:14 PM, Eugene Kirpichov <ki...@google.com> wrote:
> 
> Hi,
> I think you're hitting something that can be fixed by configuring Redshift driver:
> http://docs.aws.amazon.com/redshift/latest/dg/queries-troubleshooting.html#set-the-JDBC-fetch-size-parameter <http://docs.aws.amazon.com/redshift/latest/dg/queries-troubleshooting.html#set-the-JDBC-fetch-size-parameter>
> By default, the JDBC driver collects all the results for a query at one time. As a result, when you attempt to retrieve a large result set over a JDBC connection, you might encounter a client-side out-of-memory error. To enable your client to retrieve result sets in batches instead of in a single all-or-nothing fetch, set the JDBC fetch size parameter in your client application.
> 
> On Wed, Nov 29, 2017 at 1:41 PM Chet Aldrich <chet.aldrich@postmates.com <ma...@postmates.com>> wrote:
> Hey all, 
> 
> I’m running a Dataflow job that uses the JDBC IO transform to pull in a bunch of data (20mm rows, for reference) from Redshift, and I’m noticing that I’m getting an OutofMemoryError on the Dataflow workers once I reach around 4mm rows. 
> 
> It seems like given the code that I’m reading inside JDBC IO and the guide here (https://beam.apache.org/documentation/io/authoring-overview/#read-transforms <https://beam.apache.org/documentation/io/authoring-overview/#read-transforms>) that it’s just pulling the data in from the result one-by-one and the emitting each output. Considering that this is sort of a limitation of the driver, this makes sense, but is there a way I can get around the memory limitation somehow? It seems like Dataflow repeatedly tries to create more workers to handle the work, but it can’t, which is part of the problem. 
> 
> If more info is needed in order to help me sort out what I could do to not run into the memory limitations I’m happy to provide it. 
> 
> 
> Thanks,
> 
> Chet