You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Vitalii Diravka <vi...@gmail.com> on 2018/08/13 17:33:29 UTC

Re: Requesting guidance. Having trouble generating parquet files from jdbc connection to PostgreSQL. "java.lang.OutOfMemoryError: GC overhead limit exceeded"

Hi Reid,

Look like your issue is similar to DRILL-4177 [1].
It was related to MySQL connection. Looks like the similar issue is with
PostgreSQL.
Looking at the Postgres documentation, the code needs to explicitly set the
connection autocommit mode
to false e.g. conn.setAutoCommit(false) [2]. For data size of 10 million
plus, this is a must.

You could disable "Auto Commit" option as session option [3]
or to do it within plugin config URL with the following property:
*defaultAutoCommit=false* [4]

[1] https://issues.apache.org/jira/browse/DRILL-4177
[2]
https://jdbc.postgresql.org/documentation/93/query.html#fetchsize-example
[3] https://www.postgresql.org/docs/9.3/static/ecpg-sql-set-autocommit.html
[4] https://jdbc.postgresql.org/documentation/head/ds-cpds.html

Kind regards
Vitalii


On Mon, Aug 13, 2018 at 3:03 PM Reid Thompson <Re...@omnicell.com>
wrote:

> My standalone host is configured with 16GB RAM, 8 cpus.  Using
> drill-embedded (single host standalone), I am attempting to pull data
> from PostgreSQL tables to parquet files via CTAS. Smaller datasets work
> fine, but larger data sets fail (for example ~11GB) with
> "java.lang.OutOfMemoryError: GC overhead limit exceeded"  Can someone
> advise on how to get past this?
>
> Is there a way to have drill stream this data from PostgreSQL to parquet
> files on disk, or does the data set have to be completely loaded into
> memory before it can be written to disk?  The documentation indicates
> that drill will spill to disk to avoid memory issues, so I had hoped
> that it would be straightforward to extract from the DB to disk.
>
> Should I not be attempting this via CTAS?  What are the other options?
>
>
> thanks,
> reid
>
>
>
>

Re: Requesting guidance. Having trouble generating parquet files from jdbc connection to PostgreSQL. "java.lang.OutOfMemoryError: GC overhead limit exceeded"

Posted by Robert Hou <rh...@mapr.com>.
I'm wondering if you can export the json from Postgres to a json document.
And then write it to parquet using Drill.  This link may have some ideas:


https://hashrocket.com/blog/posts/create-quick-json-data-dumps-from-postgresql

Thanks.

--Robert

On Wed, Aug 15, 2018 at 10:16 AM, Reid Thompson <Re...@omnicell.com>
wrote:

> Thanks for your help.
>
> Yes, it solves the out of memory failure, but then I ran into the json
> issue.
>
> I'm not sure that the ticket represents what I currently need.  I.E.  At
> this point, I'm attempting to do a simple pull of data from PostgreSQL
> and write to parquet, nothing more.   I think that the ticket indicates
> wanting to be able to directly query for content in json(b) columns from
> drill.
>
> thanks,
> Reid
>
> On Wed, 2018-08-15 at 19:27 +0300, Vitalii Diravka wrote:
> > [EXTERNAL SOURCE]
> >
> > Glad to see that it helps and you've solved the issue.
> >
> > I have seen you asked about PostgreSQL JSONB in another topic,
> > but looks like it is not supported by now and should be implemented in
> context of DRILL-5087
> >
> > Kind regards
> > Vitalii
> >
> >
> > On Wed, Aug 15, 2018 at 3:36 PM Reid Thompson <
> Reid.Thompson@omnicell.com> wrote:
> > > Vitalii,
> > >
> > > yes. Per https://urldefense.proofpoint.com/v2/url?u=https-3A__jdbc.
> postgresql.org_documentation_head_connect.html&d=DwIGaQ&c=
> cskdkSMqhcnjZxdQVpwTXg&r=GXRJhB4g1YFDJsrcglHwUA&m=
> HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s=
> Y1QZGgkI2OWmGL84IKvDRdTMPlWy28ZLLLR8fPLALKk&e=
> > > (page lists numerous settings available)
> > >
> > > defaultRowFetchSize = int
> > >
> > > Determine the number of rows fetched in ResultSet by one fetch with
> trip to the database. Limiting the number of rows are fetch with each trip
> to the database allow avoids unnecessary memory
> > > consumption and as a consequence OutOfMemoryException.
> > >
> > > The default is zero, meaning that in ResultSet will be fetch all rows
> at once. Negative number is not available.
> > >
> > >
> > > on another topic,
> > > is there any way to have drill properly recognize postgresql's json and
> > > jsonb types?  I have tables with both, and am getting this error
> > >
> > >  org.apache.drill.common.exceptions.UserException:
> UNSUPPORTED_OPERATION
> > >  ERROR: A column you queried has a data type that is not currently
> > >  supported by the JDBC storage plugin. The column's name was actionjson
> > >  and its JDBC data type was OTHER.
> > >
> > >
> > > thanks,
> > > reid
> > >
> > > On Wed, 2018-08-15 at 14:44 +0300, Vitalii Diravka wrote:
> > > > [EXTERNAL SOURCE]
> > > >
> > > > Hi Reid,
> > > >
> > > > Am I right, defaultRowFetchSize=10000 property in URL solves that
> OOM issue?
> > > > If so possibly it can be useful to have this information in Drill
> docs [1].
> > > >
> > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__drill.
> apache.org_docs_rdbms-2Dstorage-2Dplugin_&d=DwIGaQ&
> c=cskdkSMqhcnjZxdQVpwTXg&r=GXRJhB4g1YFDJsrcglHwUA&m=
> HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s=tvskvH61dBj_z89kZ6NYTxnR-6_
> E6bHXJ4kcGXNfqQI&e=
> > > >
> > > > Kind regards
> > > > Vitalii
> > > >
> > > >
> > > > On Tue, Aug 14, 2018 at 4:17 PM Reid Thompson <
> Reid.Thompson@omnicell.com> wrote:
> > > > > using the below parameters in the URL and looking in the defined
> logfile
> > > > > indicates that the fetch size is being set to 10000, as expected.
> > > > >
> > > > > just to note that it appears that the param defaultRowFetchSize
> sets the
> > > > > fetch size and signifies that a cursor should be used.  It is
> different
> > > > > from the originally noted defaultFetchSize param, and it appears
> that
> > > > > postgresql doesn't require the useCursorFetch=true or the
> defaultAutoCommit=false.
> > > > >
> > > > > ...snip..
> > > > >   "url": "jdbc:postgresql://myhost.mydomain.com/mydb?
> useCursorFetch=true&defaultAutoCommit=false&loggerLevel=TRACE&loggerFile=/
> tmp/jdbc.log&defaultRowFetchSize=10000",
> > > > > ...snip..
> > > > >
> > > > >
> > > > >
> > > > > On Tue, 2018-08-14 at 07:26 -0400, Reid Thompson wrote:
> > > > > > attempting with the below still fails.
> > > > > > looking at pg_stat_activity it doesn't appear that a cursor is
> being
> > > > > > created.  It's still attempting to pull all the data at once.
> > > > > >
> > > > > > thanks,
> > > > > > reid
> > > > > > On Mon, 2018-08-13 at 14:18 -0400, Reid Thompson wrote:
> > > > > > > Vitalii,
> > > > > > >
> > > > > > > Ok, thanks, I had found that report, but didn't note the
> option related
> > > > > > > to defaultAutoCommit.
> > > > > > > > [1] https://urldefense.proofpoint.
> com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-
> 2D4177&d=DwIGaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=GXRJhB4g1YFDJsrcglHwUA&m=
> HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s=02dphqxg7r_
> 7IjLwMMl9Sd-GmrO3EjVN_mD37PgjcTQ&e=
> > > > > > >
> > > > > > >
> > > > > > > so, something along the lines of
> > > > > > >
> > > > > > > ..snip..
> > > > > > >   "url": "jdbc:postgresql://myhost.mydomain.com/ateb?
> useCursorFetch=true&defaultFetchSize=10000&defaultAutoCommit=false",
> > > > > > > ..snip..
> > > > > > >
> > > > > > >
> > > > > > > thanks,
> > > > > > > reid
> > > > > > >
> > > > > > > On Mon, 2018-08-13 at 20:33 +0300, Vitalii Diravka wrote:
> > > > > > > > [EXTERNAL SOURCE]
> > > > > > > >
> > > > > > > > Hi Reid,
> > > > > > > >
> > > > > > > > Look like your issue is similar to DRILL-4177 [1].
> > > > > > > > It was related to MySQL connection. Looks like the similar
> issue is with PostgreSQL.
> > > > > > > > Looking at the Postgres documentation, the code needs to
> explicitly set the connection autocommit mode
> > > > > > > > to false e.g. conn.setAutoCommit(false) [2]. For data size
> of 10 million plus, this is a must.
> > > > > > > >
> > > > > > > > You could disable "Auto Commit" option as session option [3]
> > > > > > > > or to do it within plugin config URL with the following
> property: defaultAutoCommit=false [4]
> > > > > > > >
> > > > > > > > [1] https://urldefense.proofpoint.
> com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-
> 2D4177&d=DwIGaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=GXRJhB4g1YFDJsrcglHwUA&m=
> HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s=02dphqxg7r_
> 7IjLwMMl9Sd-GmrO3EjVN_mD37PgjcTQ&e=
> > > > > > > > [2] https://urldefense.proofpoint.
> com/v2/url?u=https-3A__jdbc.postgresql.org_documentation_
> 93_query.html-23fetchsize-2Dexample&d=DwIGaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=
> GXRJhB4g1YFDJsrcglHwUA&m=HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s=
> QVTkxdxQrN6ClYDj1gBm1buRnmH5ra3fQ8rsLCGHO6w&e=
> > > > > > > > [3] https://urldefense.proofpoint.
> com/v2/url?u=https-3A__www.postgresql.org_docs_9.3_
> static_ecpg-2Dsql-2Dset-2Dautocommit.html&d=DwIGaQ&c=
> cskdkSMqhcnjZxdQVpwTXg&r=GXRJhB4g1YFDJsrcglHwUA&m=
> HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s=
> qVoZA1PBAjd7DGfRYBUH3Huqh7GN8MYfxZ6Hw7ocAz0&e=
> > > > > > > > [4] https://urldefense.proofpoint.
> com/v2/url?u=https-3A__jdbc.postgresql.org_documentation_
> head_ds-2Dcpds.html&d=DwIGaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=
> GXRJhB4g1YFDJsrcglHwUA&m=HfCsrd_3Gio-3LgJp9WOn4ZwJAQR-s7EeNgc63yvbHc&s=
> rGZnDFuuSDpTWs8LUr8RiwZNsaqQ31AexNwuC3reBTE&e=
> > > > > > > >
> > > > > > > > Kind regards
> > > > > > > > Vitalii
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Aug 13, 2018 at 3:03 PM Reid Thompson <
> Reid.Thompson@omnicell.com> wrote:
> > > > > > > > > My standalone host is configured with 16GB RAM, 8 cpus.
> Using
> > > > > > > > > drill-embedded (single host standalone), I am attempting
> to pull data
> > > > > > > > > from PostgreSQL tables to parquet files via CTAS. Smaller
> datasets work
> > > > > > > > > fine, but larger data sets fail (for example ~11GB) with
> > > > > > > > > "java.lang.OutOfMemoryError: GC overhead limit exceeded"
> Can someone
> > > > > > > > > advise on how to get past this?
> > > > > > > > >
> > > > > > > > > Is there a way to have drill stream this data from
> PostgreSQL to parquet
> > > > > > > > > files on disk, or does the data set have to be completely
> loaded into
> > > > > > > > > memory before it can be written to disk?  The
> documentation indicates
> > > > > > > > > that drill will spill to disk to avoid memory issues, so I
> had hoped
> > > > > > > > > that it would be straightforward to extract from the DB to
> disk.
> > > > > > > > >
> > > > > > > > > Should I not be attempting this via CTAS?  What are the
> other options?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > thanks,
> > > > > > > > > reid
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > >
> > >
>
>

Re: Requesting guidance. Having trouble generating parquet files from jdbc connection to PostgreSQL. "java.lang.OutOfMemoryError: GC overhead limit exceeded"

Posted by Reid Thompson <Re...@omnicell.com>.
Thanks for your help.

Yes, it solves the out of memory failure, but then I ran into the json
issue.

I'm not sure that the ticket represents what I currently need.  I.E.  At
this point, I'm attempting to do a simple pull of data from PostgreSQL
and write to parquet, nothing more.   I think that the ticket indicates
wanting to be able to directly query for content in json(b) columns from
drill.

thanks,
Reid

On Wed, 2018-08-15 at 19:27 +0300, Vitalii Diravka wrote:
> [EXTERNAL SOURCE]
>  
> Glad to see that it helps and you've solved the issue.
> 
> I have seen you asked about PostgreSQL JSONB in another topic, 
> but looks like it is not supported by now and should be implemented in context of DRILL-5087
> 
> Kind regards
> Vitalii
> 
> 
> On Wed, Aug 15, 2018 at 3:36 PM Reid Thompson <Re...@omnicell.com> wrote:
> > Vitalii,
> > 
> > yes. Per https://jdbc.postgresql.org/documentation/head/connect.html 
> > (page lists numerous settings available)
> > 
> > defaultRowFetchSize = int
> > 
> > Determine the number of rows fetched in ResultSet by one fetch with trip to the database. Limiting the number of rows are fetch with each trip to the database allow avoids unnecessary memory
> > consumption and as a consequence OutOfMemoryException.
> > 
> > The default is zero, meaning that in ResultSet will be fetch all rows at once. Negative number is not available.
> > 
> > 
> > on another topic,
> > is there any way to have drill properly recognize postgresql's json and
> > jsonb types?  I have tables with both, and am getting this error
> > 
> >  org.apache.drill.common.exceptions.UserException: UNSUPPORTED_OPERATION
> >  ERROR: A column you queried has a data type that is not currently
> >  supported by the JDBC storage plugin. The column's name was actionjson
> >  and its JDBC data type was OTHER.
> > 
> > 
> > thanks,
> > reid
> > 
> > On Wed, 2018-08-15 at 14:44 +0300, Vitalii Diravka wrote:
> > > [EXTERNAL SOURCE]
> > >  
> > > Hi Reid,
> > > 
> > > Am I right, defaultRowFetchSize=10000 property in URL solves that OOM issue?
> > > If so possibly it can be useful to have this information in Drill docs [1].
> > > 
> > > [1] https://drill.apache.org/docs/rdbms-storage-plugin/
> > > 
> > > Kind regards
> > > Vitalii
> > > 
> > > 
> > > On Tue, Aug 14, 2018 at 4:17 PM Reid Thompson <Re...@omnicell.com> wrote:
> > > > using the below parameters in the URL and looking in the defined logfile
> > > > indicates that the fetch size is being set to 10000, as expected. 
> > > > 
> > > > just to note that it appears that the param defaultRowFetchSize sets the
> > > > fetch size and signifies that a cursor should be used.  It is different
> > > > from the originally noted defaultFetchSize param, and it appears that
> > > > postgresql doesn't require the useCursorFetch=true or the defaultAutoCommit=false.
> > > > 
> > > > ...snip..
> > > >   "url": "jdbc:postgresql://myhost.mydomain.com/mydb?useCursorFetch=true&defaultAutoCommit=false&loggerLevel=TRACE&loggerFile=/tmp/jdbc.log&defaultRowFetchSize=10000",
> > > > ...snip..
> > > > 
> > > > 
> > > > 
> > > > On Tue, 2018-08-14 at 07:26 -0400, Reid Thompson wrote:
> > > > > attempting with the below still fails.
> > > > > looking at pg_stat_activity it doesn't appear that a cursor is being
> > > > > created.  It's still attempting to pull all the data at once.
> > > > > 
> > > > > thanks,
> > > > > reid
> > > > > On Mon, 2018-08-13 at 14:18 -0400, Reid Thompson wrote:
> > > > > > Vitalii,
> > > > > > 
> > > > > > Ok, thanks, I had found that report, but didn't note the option related
> > > > > > to defaultAutoCommit. 
> > > > > > > [1] https://issues.apache.org/jira/browse/DRILL-4177
> > > > > > 
> > > > > > 
> > > > > > so, something along the lines of
> > > > > > 
> > > > > > ..snip..
> > > > > >   "url": "jdbc:postgresql://myhost.mydomain.com/ateb?useCursorFetch=true&defaultFetchSize=10000&defaultAutoCommit=false",
> > > > > > ..snip..
> > > > > > 
> > > > > > 
> > > > > > thanks,
> > > > > > reid
> > > > > > 
> > > > > > On Mon, 2018-08-13 at 20:33 +0300, Vitalii Diravka wrote:
> > > > > > > [EXTERNAL SOURCE]
> > > > > > >  
> > > > > > > Hi Reid,
> > > > > > > 
> > > > > > > Look like your issue is similar to DRILL-4177 [1].
> > > > > > > It was related to MySQL connection. Looks like the similar issue is with PostgreSQL.
> > > > > > > Looking at the Postgres documentation, the code needs to explicitly set the connection autocommit mode 
> > > > > > > to false e.g. conn.setAutoCommit(false) [2]. For data size of 10 million plus, this is a must.
> > > > > > > 
> > > > > > > You could disable "Auto Commit" option as session option [3] 
> > > > > > > or to do it within plugin config URL with the following property: defaultAutoCommit=false [4]
> > > > > > > 
> > > > > > > [1] https://issues.apache.org/jira/browse/DRILL-4177
> > > > > > > [2] https://jdbc.postgresql.org/documentation/93/query.html#fetchsize-example
> > > > > > > [3] https://www.postgresql.org/docs/9.3/static/ecpg-sql-set-autocommit.html
> > > > > > > [4] https://jdbc.postgresql.org/documentation/head/ds-cpds.html
> > > > > > > 
> > > > > > > Kind regards
> > > > > > > Vitalii
> > > > > > > 
> > > > > > > 
> > > > > > > On Mon, Aug 13, 2018 at 3:03 PM Reid Thompson <Re...@omnicell.com> wrote:
> > > > > > > > My standalone host is configured with 16GB RAM, 8 cpus.  Using
> > > > > > > > drill-embedded (single host standalone), I am attempting to pull data
> > > > > > > > from PostgreSQL tables to parquet files via CTAS. Smaller datasets work
> > > > > > > > fine, but larger data sets fail (for example ~11GB) with
> > > > > > > > "java.lang.OutOfMemoryError: GC overhead limit exceeded"  Can someone
> > > > > > > > advise on how to get past this?
> > > > > > > > 
> > > > > > > > Is there a way to have drill stream this data from PostgreSQL to parquet
> > > > > > > > files on disk, or does the data set have to be completely loaded into
> > > > > > > > memory before it can be written to disk?  The documentation indicates
> > > > > > > > that drill will spill to disk to avoid memory issues, so I had hoped
> > > > > > > > that it would be straightforward to extract from the DB to disk.
> > > > > > > > 
> > > > > > > > Should I not be attempting this via CTAS?  What are the other options?
> > > > > > > > 
> > > > > > > > 
> > > > > > > > thanks,
> > > > > > > > reid
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > 
> > 


Re: Requesting guidance. Having trouble generating parquet files from jdbc connection to PostgreSQL. "java.lang.OutOfMemoryError: GC overhead limit exceeded"

Posted by Vitalii Diravka <vi...@gmail.com>.
Glad to see that it helps and you've solved the issue.

I have seen you asked about PostgreSQL JSONB in another topic,
but looks like it is not supported by now and should be implemented in
context of DRILL-5087 <https://issues.apache.org/jira/browse/DRILL-5087>



Kind regards
Vitalii


On Wed, Aug 15, 2018 at 3:36 PM Reid Thompson <Re...@omnicell.com>
wrote:

> Vitalii,
>
> yes. Per https://jdbc.postgresql.org/documentation/head/connect.html
> (page lists numerous settings available)
>
> defaultRowFetchSize = int
>
> Determine the number of rows fetched in ResultSet by one fetch with trip
> to the database. Limiting the number of rows are fetch with each trip to
> the database allow avoids unnecessary memory
> consumption and as a consequence OutOfMemoryException.
>
> The default is zero, meaning that in ResultSet will be fetch all rows at
> once. Negative number is not available.
>
>
> on another topic,
> is there any way to have drill properly recognize postgresql's json and
> jsonb types?  I have tables with both, and am getting this error
>
>  org.apache.drill.common.exceptions.UserException: UNSUPPORTED_OPERATION
>  ERROR: A column you queried has a data type that is not currently
>  supported by the JDBC storage plugin. The column's name was actionjson
>  and its JDBC data type was OTHER.
>
>
> thanks,
> reid
>
> On Wed, 2018-08-15 at 14:44 +0300, Vitalii Diravka wrote:
> > [EXTERNAL SOURCE]
> >
> > Hi Reid,
> >
> > Am I right, defaultRowFetchSize=10000 property in URL solves that OOM
> issue?
> > If so possibly it can be useful to have this information in Drill docs
> [1].
> >
> > [1] https://drill.apache.org/docs/rdbms-storage-plugin/
> >
> > Kind regards
> > Vitalii
> >
> >
> > On Tue, Aug 14, 2018 at 4:17 PM Reid Thompson <
> Reid.Thompson@omnicell.com> wrote:
> > > using the below parameters in the URL and looking in the defined
> logfile
> > > indicates that the fetch size is being set to 10000, as expected.
> > >
> > > just to note that it appears that the param defaultRowFetchSize sets
> the
> > > fetch size and signifies that a cursor should be used.  It is different
> > > from the originally noted defaultFetchSize param, and it appears that
> > > postgresql doesn't require the useCursorFetch=true or the
> defaultAutoCommit=false.
> > >
> > > ...snip..
> > >   "url": "jdbc:postgresql://
> myhost.mydomain.com/mydb?useCursorFetch=true&defaultAutoCommit=false&loggerLevel=TRACE&loggerFile=/tmp/jdbc.log&defaultRowFetchSize=10000
> ",
> > > ...snip..
> > >
> > >
> > >
> > > On Tue, 2018-08-14 at 07:26 -0400, Reid Thompson wrote:
> > > > attempting with the below still fails.
> > > > looking at pg_stat_activity it doesn't appear that a cursor is being
> > > > created.  It's still attempting to pull all the data at once.
> > > >
> > > > thanks,
> > > > reid
> > > > On Mon, 2018-08-13 at 14:18 -0400, Reid Thompson wrote:
> > > > > Vitalii,
> > > > >
> > > > > Ok, thanks, I had found that report, but didn't note the option
> related
> > > > > to defaultAutoCommit.
> > > > > > [1] https://issues.apache.org/jira/browse/DRILL-4177
> > > > >
> > > > >
> > > > > so, something along the lines of
> > > > >
> > > > > ..snip..
> > > > >   "url": "jdbc:postgresql://
> myhost.mydomain.com/ateb?useCursorFetch=true&defaultFetchSize=10000&defaultAutoCommit=false
> ",
> > > > > ..snip..
> > > > >
> > > > >
> > > > > thanks,
> > > > > reid
> > > > >
> > > > > On Mon, 2018-08-13 at 20:33 +0300, Vitalii Diravka wrote:
> > > > > > [EXTERNAL SOURCE]
> > > > > >
> > > > > > Hi Reid,
> > > > > >
> > > > > > Look like your issue is similar to DRILL-4177 [1].
> > > > > > It was related to MySQL connection. Looks like the similar issue
> is with PostgreSQL.
> > > > > > Looking at the Postgres documentation, the code needs to
> explicitly set the connection autocommit mode
> > > > > > to false e.g. conn.setAutoCommit(false) [2]. For data size of 10
> million plus, this is a must.
> > > > > >
> > > > > > You could disable "Auto Commit" option as session option [3]
> > > > > > or to do it within plugin config URL with the following
> property: defaultAutoCommit=false [4]
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/DRILL-4177
> > > > > > [2]
> https://jdbc.postgresql.org/documentation/93/query.html#fetchsize-example
> > > > > > [3]
> https://www.postgresql.org/docs/9.3/static/ecpg-sql-set-autocommit.html
> > > > > > [4] https://jdbc.postgresql.org/documentation/head/ds-cpds.html
> > > > > >
> > > > > > Kind regards
> > > > > > Vitalii
> > > > > >
> > > > > >
> > > > > > On Mon, Aug 13, 2018 at 3:03 PM Reid Thompson <
> Reid.Thompson@omnicell.com> wrote:
> > > > > > > My standalone host is configured with 16GB RAM, 8 cpus.  Using
> > > > > > > drill-embedded (single host standalone), I am attempting to
> pull data
> > > > > > > from PostgreSQL tables to parquet files via CTAS. Smaller
> datasets work
> > > > > > > fine, but larger data sets fail (for example ~11GB) with
> > > > > > > "java.lang.OutOfMemoryError: GC overhead limit exceeded"  Can
> someone
> > > > > > > advise on how to get past this?
> > > > > > >
> > > > > > > Is there a way to have drill stream this data from PostgreSQL
> to parquet
> > > > > > > files on disk, or does the data set have to be completely
> loaded into
> > > > > > > memory before it can be written to disk?  The documentation
> indicates
> > > > > > > that drill will spill to disk to avoid memory issues, so I had
> hoped
> > > > > > > that it would be straightforward to extract from the DB to
> disk.
> > > > > > >
> > > > > > > Should I not be attempting this via CTAS?  What are the other
> options?
> > > > > > >
> > > > > > >
> > > > > > > thanks,
> > > > > > > reid
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
>
>

Re: Requesting guidance. Having trouble generating parquet files from jdbc connection to PostgreSQL. "java.lang.OutOfMemoryError: GC overhead limit exceeded"

Posted by Reid Thompson <Re...@omnicell.com>.
Vitalii,

yes. Per https://jdbc.postgresql.org/documentation/head/connect.html 
(page lists numerous settings available)

defaultRowFetchSize = int

Determine the number of rows fetched in ResultSet by one fetch with trip to the database. Limiting the number of rows are fetch with each trip to the database allow avoids unnecessary memory
consumption and as a consequence OutOfMemoryException.

The default is zero, meaning that in ResultSet will be fetch all rows at once. Negative number is not available.


on another topic,
is there any way to have drill properly recognize postgresql's json and
jsonb types?  I have tables with both, and am getting this error

 org.apache.drill.common.exceptions.UserException: UNSUPPORTED_OPERATION
 ERROR: A column you queried has a data type that is not currently
 supported by the JDBC storage plugin. The column's name was actionjson
 and its JDBC data type was OTHER.


thanks,
reid

On Wed, 2018-08-15 at 14:44 +0300, Vitalii Diravka wrote:
> [EXTERNAL SOURCE]
>  
> Hi Reid,
> 
> Am I right, defaultRowFetchSize=10000 property in URL solves that OOM issue?
> If so possibly it can be useful to have this information in Drill docs [1].
> 
> [1] https://drill.apache.org/docs/rdbms-storage-plugin/
> 
> Kind regards
> Vitalii
> 
> 
> On Tue, Aug 14, 2018 at 4:17 PM Reid Thompson <Re...@omnicell.com> wrote:
> > using the below parameters in the URL and looking in the defined logfile
> > indicates that the fetch size is being set to 10000, as expected. 
> > 
> > just to note that it appears that the param defaultRowFetchSize sets the
> > fetch size and signifies that a cursor should be used.  It is different
> > from the originally noted defaultFetchSize param, and it appears that
> > postgresql doesn't require the useCursorFetch=true or the defaultAutoCommit=false.
> > 
> > ...snip..
> >   "url": "jdbc:postgresql://myhost.mydomain.com/mydb?useCursorFetch=true&defaultAutoCommit=false&loggerLevel=TRACE&loggerFile=/tmp/jdbc.log&defaultRowFetchSize=10000",
> > ...snip..
> > 
> > 
> > 
> > On Tue, 2018-08-14 at 07:26 -0400, Reid Thompson wrote:
> > > attempting with the below still fails.
> > > looking at pg_stat_activity it doesn't appear that a cursor is being
> > > created.  It's still attempting to pull all the data at once.
> > > 
> > > thanks,
> > > reid
> > > On Mon, 2018-08-13 at 14:18 -0400, Reid Thompson wrote:
> > > > Vitalii,
> > > > 
> > > > Ok, thanks, I had found that report, but didn't note the option related
> > > > to defaultAutoCommit. 
> > > > > [1] https://issues.apache.org/jira/browse/DRILL-4177
> > > > 
> > > > 
> > > > so, something along the lines of
> > > > 
> > > > ..snip..
> > > >   "url": "jdbc:postgresql://myhost.mydomain.com/ateb?useCursorFetch=true&defaultFetchSize=10000&defaultAutoCommit=false",
> > > > ..snip..
> > > > 
> > > > 
> > > > thanks,
> > > > reid
> > > > 
> > > > On Mon, 2018-08-13 at 20:33 +0300, Vitalii Diravka wrote:
> > > > > [EXTERNAL SOURCE]
> > > > >  
> > > > > Hi Reid,
> > > > > 
> > > > > Look like your issue is similar to DRILL-4177 [1].
> > > > > It was related to MySQL connection. Looks like the similar issue is with PostgreSQL.
> > > > > Looking at the Postgres documentation, the code needs to explicitly set the connection autocommit mode 
> > > > > to false e.g. conn.setAutoCommit(false) [2]. For data size of 10 million plus, this is a must.
> > > > > 
> > > > > You could disable "Auto Commit" option as session option [3] 
> > > > > or to do it within plugin config URL with the following property: defaultAutoCommit=false [4]
> > > > > 
> > > > > [1] https://issues.apache.org/jira/browse/DRILL-4177
> > > > > [2] https://jdbc.postgresql.org/documentation/93/query.html#fetchsize-example
> > > > > [3] https://www.postgresql.org/docs/9.3/static/ecpg-sql-set-autocommit.html
> > > > > [4] https://jdbc.postgresql.org/documentation/head/ds-cpds.html
> > > > > 
> > > > > Kind regards
> > > > > Vitalii
> > > > > 
> > > > > 
> > > > > On Mon, Aug 13, 2018 at 3:03 PM Reid Thompson <Re...@omnicell.com> wrote:
> > > > > > My standalone host is configured with 16GB RAM, 8 cpus.  Using
> > > > > > drill-embedded (single host standalone), I am attempting to pull data
> > > > > > from PostgreSQL tables to parquet files via CTAS. Smaller datasets work
> > > > > > fine, but larger data sets fail (for example ~11GB) with
> > > > > > "java.lang.OutOfMemoryError: GC overhead limit exceeded"  Can someone
> > > > > > advise on how to get past this?
> > > > > > 
> > > > > > Is there a way to have drill stream this data from PostgreSQL to parquet
> > > > > > files on disk, or does the data set have to be completely loaded into
> > > > > > memory before it can be written to disk?  The documentation indicates
> > > > > > that drill will spill to disk to avoid memory issues, so I had hoped
> > > > > > that it would be straightforward to extract from the DB to disk.
> > > > > > 
> > > > > > Should I not be attempting this via CTAS?  What are the other options?
> > > > > > 
> > > > > > 
> > > > > > thanks,
> > > > > > reid
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> > 
> > 
> > 


Re: Requesting guidance. Having trouble generating parquet files from jdbc connection to PostgreSQL. "java.lang.OutOfMemoryError: GC overhead limit exceeded"

Posted by Vitalii Diravka <vi...@gmail.com>.
Hi Reid,

Am I right, *defaultRowFetchSize=10000 *property in URL solves that OOM
issue?
If so possibly it can be useful to have this information in Drill docs [1].

[1] https://drill.apache.org/docs/rdbms-storage-plugin/

Kind regards
Vitalii


On Tue, Aug 14, 2018 at 4:17 PM Reid Thompson <Re...@omnicell.com>
wrote:

> using the below parameters in the URL and looking in the defined logfile
> indicates that the fetch size is being set to 10000, as expected.
>
> just to note that it appears that the param defaultRowFetchSize sets the
> fetch size and signifies that a cursor should be used.  It is different
> from the originally noted defaultFetchSize param, and it appears that
> postgresql doesn't require the useCursorFetch=true or the
> defaultAutoCommit=false.
>
> ...snip..
>   "url": "jdbc:postgresql://
> myhost.mydomain.com/mydb?useCursorFetch=true&defaultAutoCommit=false&loggerLevel=TRACE&loggerFile=/tmp/jdbc.log&defaultRowFetchSize=10000
> ",
> ...snip..
>
>
>
> On Tue, 2018-08-14 at 07:26 -0400, Reid Thompson wrote:
> > attempting with the below still fails.
> > looking at pg_stat_activity it doesn't appear that a cursor is being
> > created.  It's still attempting to pull all the data at once.
> >
> > thanks,
> > reid
> > On Mon, 2018-08-13 at 14:18 -0400, Reid Thompson wrote:
> > > Vitalii,
> > >
> > > Ok, thanks, I had found that report, but didn't note the option related
> > > to defaultAutoCommit.
> > > > [1] https://issues.apache.org/jira/browse/DRILL-4177
> > >
> > >
> > > so, something along the lines of
> > >
> > > ..snip..
> > >   "url": "jdbc:postgresql://
> myhost.mydomain.com/ateb?useCursorFetch=true&defaultFetchSize=10000&defaultAutoCommit=false
> ",
> > > ..snip..
> > >
> > >
> > > thanks,
> > > reid
> > >
> > > On Mon, 2018-08-13 at 20:33 +0300, Vitalii Diravka wrote:
> > > > [EXTERNAL SOURCE]
> > > >
> > > > Hi Reid,
> > > >
> > > > Look like your issue is similar to DRILL-4177 [1].
> > > > It was related to MySQL connection. Looks like the similar issue is
> with PostgreSQL.
> > > > Looking at the Postgres documentation, the code needs to explicitly
> set the connection autocommit mode
> > > > to false e.g. conn.setAutoCommit(false) [2]. For data size of 10
> million plus, this is a must.
> > > >
> > > > You could disable "Auto Commit" option as session option [3]
> > > > or to do it within plugin config URL with the following property:
> defaultAutoCommit=false [4]
> > > >
> > > > [1] https://issues.apache.org/jira/browse/DRILL-4177
> > > > [2]
> https://jdbc.postgresql.org/documentation/93/query.html#fetchsize-example
> > > > [3]
> https://www.postgresql.org/docs/9.3/static/ecpg-sql-set-autocommit.html
> > > > [4] https://jdbc.postgresql.org/documentation/head/ds-cpds.html
> > > >
> > > > Kind regards
> > > > Vitalii
> > > >
> > > >
> > > > On Mon, Aug 13, 2018 at 3:03 PM Reid Thompson <
> Reid.Thompson@omnicell.com> wrote:
> > > > > My standalone host is configured with 16GB RAM, 8 cpus.  Using
> > > > > drill-embedded (single host standalone), I am attempting to pull
> data
> > > > > from PostgreSQL tables to parquet files via CTAS. Smaller datasets
> work
> > > > > fine, but larger data sets fail (for example ~11GB) with
> > > > > "java.lang.OutOfMemoryError: GC overhead limit exceeded"  Can
> someone
> > > > > advise on how to get past this?
> > > > >
> > > > > Is there a way to have drill stream this data from PostgreSQL to
> parquet
> > > > > files on disk, or does the data set have to be completely loaded
> into
> > > > > memory before it can be written to disk?  The documentation
> indicates
> > > > > that drill will spill to disk to avoid memory issues, so I had
> hoped
> > > > > that it would be straightforward to extract from the DB to disk.
> > > > >
> > > > > Should I not be attempting this via CTAS?  What are the other
> options?
> > > > >
> > > > >
> > > > > thanks,
> > > > > reid
> > > > >
> > > > >
> > > > >
> > > > >
> > >
> > >
> >
> >
>
>
>
>

Re: Requesting guidance. Having trouble generating parquet files from jdbc connection to PostgreSQL. "java.lang.OutOfMemoryError: GC overhead limit exceeded"

Posted by Reid Thompson <Re...@omnicell.com>.
using the below parameters in the URL and looking in the defined logfile
indicates that the fetch size is being set to 10000, as expected. 

just to note that it appears that the param defaultRowFetchSize sets the
fetch size and signifies that a cursor should be used.  It is different
from the originally noted defaultFetchSize param, and it appears that
postgresql doesn't require the useCursorFetch=true or the defaultAutoCommit=false.

...snip..
  "url": "jdbc:postgresql://myhost.mydomain.com/mydb?useCursorFetch=true&defaultAutoCommit=false&loggerLevel=TRACE&loggerFile=/tmp/jdbc.log&defaultRowFetchSize=10000",
...snip..



On Tue, 2018-08-14 at 07:26 -0400, Reid Thompson wrote:
> attempting with the below still fails.
> looking at pg_stat_activity it doesn't appear that a cursor is being
> created.  It's still attempting to pull all the data at once.
> 
> thanks,
> reid
> On Mon, 2018-08-13 at 14:18 -0400, Reid Thompson wrote:
> > Vitalii,
> > 
> > Ok, thanks, I had found that report, but didn't note the option related
> > to defaultAutoCommit. 
> > > [1] https://issues.apache.org/jira/browse/DRILL-4177
> > 
> > 
> > so, something along the lines of
> > 
> > ..snip..
> >   "url": "jdbc:postgresql://myhost.mydomain.com/ateb?useCursorFetch=true&defaultFetchSize=10000&defaultAutoCommit=false",
> > ..snip..
> > 
> > 
> > thanks,
> > reid
> > 
> > On Mon, 2018-08-13 at 20:33 +0300, Vitalii Diravka wrote:
> > > [EXTERNAL SOURCE]
> > >  
> > > Hi Reid,
> > > 
> > > Look like your issue is similar to DRILL-4177 [1].
> > > It was related to MySQL connection. Looks like the similar issue is with PostgreSQL.
> > > Looking at the Postgres documentation, the code needs to explicitly set the connection autocommit mode 
> > > to false e.g. conn.setAutoCommit(false) [2]. For data size of 10 million plus, this is a must.
> > > 
> > > You could disable "Auto Commit" option as session option [3] 
> > > or to do it within plugin config URL with the following property: defaultAutoCommit=false [4]
> > > 
> > > [1] https://issues.apache.org/jira/browse/DRILL-4177
> > > [2] https://jdbc.postgresql.org/documentation/93/query.html#fetchsize-example
> > > [3] https://www.postgresql.org/docs/9.3/static/ecpg-sql-set-autocommit.html
> > > [4] https://jdbc.postgresql.org/documentation/head/ds-cpds.html
> > > 
> > > Kind regards
> > > Vitalii
> > > 
> > > 
> > > On Mon, Aug 13, 2018 at 3:03 PM Reid Thompson <Re...@omnicell.com> wrote:
> > > > My standalone host is configured with 16GB RAM, 8 cpus.  Using
> > > > drill-embedded (single host standalone), I am attempting to pull data
> > > > from PostgreSQL tables to parquet files via CTAS. Smaller datasets work
> > > > fine, but larger data sets fail (for example ~11GB) with
> > > > "java.lang.OutOfMemoryError: GC overhead limit exceeded"  Can someone
> > > > advise on how to get past this?
> > > > 
> > > > Is there a way to have drill stream this data from PostgreSQL to parquet
> > > > files on disk, or does the data set have to be completely loaded into
> > > > memory before it can be written to disk?  The documentation indicates
> > > > that drill will spill to disk to avoid memory issues, so I had hoped
> > > > that it would be straightforward to extract from the DB to disk.
> > > > 
> > > > Should I not be attempting this via CTAS?  What are the other options?
> > > > 
> > > > 
> > > > thanks,
> > > > reid
> > > > 
> > > > 
> > > > 
> > > > 
> > 
> > 
> 
> 




Re: Requesting guidance. Having trouble generating parquet files from jdbc connection to PostgreSQL. "java.lang.OutOfMemoryError: GC overhead limit exceeded"

Posted by Reid Thompson <Re...@omnicell.com>.
attempting with the below still fails.
looking at pg_stat_activity it doesn't appear that a cursor is being
created.  It's still attempting to pull all the data at once.

thanks,
reid
On Mon, 2018-08-13 at 14:18 -0400, Reid Thompson wrote:
> Vitalii,
> 
> Ok, thanks, I had found that report, but didn't note the option related
> to defaultAutoCommit. 
> > [1] https://issues.apache.org/jira/browse/DRILL-4177
> 
> 
> so, something along the lines of
> 
> ..snip..
>   "url": "jdbc:postgresql://myhost.mydomain.com/ateb?useCursorFetch=true&defaultFetchSize=10000&defaultAutoCommit=false",
> ..snip..
> 
> 
> thanks,
> reid
> 
> On Mon, 2018-08-13 at 20:33 +0300, Vitalii Diravka wrote:
> > [EXTERNAL SOURCE]
> >  
> > Hi Reid,
> > 
> > Look like your issue is similar to DRILL-4177 [1].
> > It was related to MySQL connection. Looks like the similar issue is with PostgreSQL.
> > Looking at the Postgres documentation, the code needs to explicitly set the connection autocommit mode 
> > to false e.g. conn.setAutoCommit(false) [2]. For data size of 10 million plus, this is a must.
> > 
> > You could disable "Auto Commit" option as session option [3] 
> > or to do it within plugin config URL with the following property: defaultAutoCommit=false [4]
> > 
> > [1] https://issues.apache.org/jira/browse/DRILL-4177
> > [2] https://jdbc.postgresql.org/documentation/93/query.html#fetchsize-example
> > [3] https://www.postgresql.org/docs/9.3/static/ecpg-sql-set-autocommit.html
> > [4] https://jdbc.postgresql.org/documentation/head/ds-cpds.html
> > 
> > Kind regards
> > Vitalii
> > 
> > 
> > On Mon, Aug 13, 2018 at 3:03 PM Reid Thompson <Re...@omnicell.com> wrote:
> > > My standalone host is configured with 16GB RAM, 8 cpus.  Using
> > > drill-embedded (single host standalone), I am attempting to pull data
> > > from PostgreSQL tables to parquet files via CTAS. Smaller datasets work
> > > fine, but larger data sets fail (for example ~11GB) with
> > > "java.lang.OutOfMemoryError: GC overhead limit exceeded"  Can someone
> > > advise on how to get past this?
> > > 
> > > Is there a way to have drill stream this data from PostgreSQL to parquet
> > > files on disk, or does the data set have to be completely loaded into
> > > memory before it can be written to disk?  The documentation indicates
> > > that drill will spill to disk to avoid memory issues, so I had hoped
> > > that it would be straightforward to extract from the DB to disk.
> > > 
> > > Should I not be attempting this via CTAS?  What are the other options?
> > > 
> > > 
> > > thanks,
> > > reid
> > > 
> > > 
> > > 
> > > 
> 
> 


Re: Requesting guidance. Having trouble generating parquet files from jdbc connection to PostgreSQL. "java.lang.OutOfMemoryError: GC overhead limit exceeded"

Posted by Reid Thompson <Re...@omnicell.com>.
Vitalii,

Ok, thanks, I had found that report, but didn't note the option related
to defaultAutoCommit. 
> [1] https://issues.apache.org/jira/browse/DRILL-4177


so, something along the lines of

..snip..
  "url": "jdbc:postgresql://myhost.mydomain.com/ateb?useCursorFetch=true&defaultFetchSize=10000&defaultAutoCommit=false",
..snip..


thanks,
reid

On Mon, 2018-08-13 at 20:33 +0300, Vitalii Diravka wrote:
> [EXTERNAL SOURCE]
>  
> Hi Reid,
> 
> Look like your issue is similar to DRILL-4177 [1].
> It was related to MySQL connection. Looks like the similar issue is with PostgreSQL.
> Looking at the Postgres documentation, the code needs to explicitly set the connection autocommit mode 
> to false e.g. conn.setAutoCommit(false) [2]. For data size of 10 million plus, this is a must.
> 
> You could disable "Auto Commit" option as session option [3] 
> or to do it within plugin config URL with the following property: defaultAutoCommit=false [4]
> 
> [1] https://issues.apache.org/jira/browse/DRILL-4177
> [2] https://jdbc.postgresql.org/documentation/93/query.html#fetchsize-example
> [3] https://www.postgresql.org/docs/9.3/static/ecpg-sql-set-autocommit.html
> [4] https://jdbc.postgresql.org/documentation/head/ds-cpds.html
> 
> Kind regards
> Vitalii
> 
> 
> On Mon, Aug 13, 2018 at 3:03 PM Reid Thompson <Re...@omnicell.com> wrote:
> > My standalone host is configured with 16GB RAM, 8 cpus.  Using
> > drill-embedded (single host standalone), I am attempting to pull data
> > from PostgreSQL tables to parquet files via CTAS. Smaller datasets work
> > fine, but larger data sets fail (for example ~11GB) with
> > "java.lang.OutOfMemoryError: GC overhead limit exceeded"  Can someone
> > advise on how to get past this?
> > 
> > Is there a way to have drill stream this data from PostgreSQL to parquet
> > files on disk, or does the data set have to be completely loaded into
> > memory before it can be written to disk?  The documentation indicates
> > that drill will spill to disk to avoid memory issues, so I had hoped
> > that it would be straightforward to extract from the DB to disk.
> > 
> > Should I not be attempting this via CTAS?  What are the other options?
> > 
> > 
> > thanks,
> > reid
> > 
> > 
> > 
> >