You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Chet Aldrich <ch...@postmates.com> on 2017/11/29 21:41:04 UTC

Using JDBC IO read transform, running out of memory on DataflowRunner.

Hey all, 

I’m running a Dataflow job that uses the JDBC IO transform to pull in a bunch of data (20mm rows, for reference) from Redshift, and I’m noticing that I’m getting an OutofMemoryError on the Dataflow workers once I reach around 4mm rows. 

It seems like given the code that I’m reading inside JDBC IO and the guide here (https://beam.apache.org/documentation/io/authoring-overview/#read-transforms <https://beam.apache.org/documentation/io/authoring-overview/#read-transforms>) that it’s just pulling the data in from the result one-by-one and the emitting each output. Considering that this is sort of a limitation of the driver, this makes sense, but is there a way I can get around the memory limitation somehow? It seems like Dataflow repeatedly tries to create more workers to handle the work, but it can’t, which is part of the problem. 

If more info is needed in order to help me sort out what I could do to not run into the memory limitations I’m happy to provide it. 


Thanks,

Chet

Re: Using JDBC IO read transform, running out of memory on DataflowRunner.

Posted by Chet Aldrich <ch...@postmates.com>.

Hey Eugene, 

Thanks for this, didn’t realize this was a parameter I could tune. Fixed my problems straight away. 

Chet

> On Nov 29, 2017, at 2:14 PM, Eugene Kirpichov <ki...@google.com> wrote:
> 
> Hi,
> I think you're hitting something that can be fixed by configuring Redshift driver:
> http://docs.aws.amazon.com/redshift/latest/dg/queries-troubleshooting.html#set-the-JDBC-fetch-size-parameter <http://docs.aws.amazon.com/redshift/latest/dg/queries-troubleshooting.html#set-the-JDBC-fetch-size-parameter>
> By default, the JDBC driver collects all the results for a query at one time. As a result, when you attempt to retrieve a large result set over a JDBC connection, you might encounter a client-side out-of-memory error. To enable your client to retrieve result sets in batches instead of in a single all-or-nothing fetch, set the JDBC fetch size parameter in your client application.
> 
> On Wed, Nov 29, 2017 at 1:41 PM Chet Aldrich <chet.aldrich@postmates.com <ma...@postmates.com>> wrote:
> Hey all, 
> 
> I’m running a Dataflow job that uses the JDBC IO transform to pull in a bunch of data (20mm rows, for reference) from Redshift, and I’m noticing that I’m getting an OutofMemoryError on the Dataflow workers once I reach around 4mm rows. 
> 
> It seems like given the code that I’m reading inside JDBC IO and the guide here (https://beam.apache.org/documentation/io/authoring-overview/#read-transforms <https://beam.apache.org/documentation/io/authoring-overview/#read-transforms>) that it’s just pulling the data in from the result one-by-one and the emitting each output. Considering that this is sort of a limitation of the driver, this makes sense, but is there a way I can get around the memory limitation somehow? It seems like Dataflow repeatedly tries to create more workers to handle the work, but it can’t, which is part of the problem. 
> 
> If more info is needed in order to help me sort out what I could do to not run into the memory limitations I’m happy to provide it. 
> 
> 
> Thanks,
> 
> Chet

Re: Using JDBC IO read transform, running out of memory on DataflowRunner.

Posted by Eugene Kirpichov <ki...@google.com>.

Hi,
I think you're hitting something that can be fixed by configuring Redshift
driver:
http://docs.aws.amazon.com/redshift/latest/dg/queries-troubleshooting.html#set-the-JDBC-fetch-size-parameter
*By default, the JDBC driver collects all the results for a query at one
time. As a result, when you attempt to retrieve a large result set over a
JDBC connection, you might encounter a client-side out-of-memory error. To
enable your client to retrieve result sets in batches instead of in a
single all-or-nothing fetch, set the JDBC fetch size parameter in your
client application.*

On Wed, Nov 29, 2017 at 1:41 PM Chet Aldrich <ch...@postmates.com>
wrote:

> Hey all,
>
> I’m running a Dataflow job that uses the JDBC IO transform to pull in a
> bunch of data (20mm rows, for reference) from Redshift, and I’m noticing
> that I’m getting an OutofMemoryError on the Dataflow workers once I reach
> around 4mm rows.
>
> It seems like given the code that I’m reading inside JDBC IO and the guide
> here (
> https://beam.apache.org/documentation/io/authoring-overview/#read-transforms)
> that it’s just pulling the data in from the result one-by-one and the
> emitting each output. Considering that this is sort of a limitation of the
> driver, this makes sense, but is there a way I can get around the memory
> limitation somehow? It seems like Dataflow repeatedly tries to create more
> workers to handle the work, but it can’t, which is part of the problem.
>
> If more info is needed in order to help me sort out what I could do to not
> run into the memory limitations I’m happy to provide it.
>
>
> Thanks,
>
> Chet
>