You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Jonathan Perron <jo...@lumapps.com> on 2019/01/14 08:51:00 UTC

Connection leaks with PostgreSQL instance

Hello !

My question is maybe mainly GCP-oriented, so I apologize if it is not 
fully related to the Beam community.

We have a streaming pipeline running on Dataflow which writes data to a 
PostgreSQL instance hosted on Cloud SQL. This database is suffering from 
connection leak spikes on a regular basis:

The connections are kept alive until the pipeline is canceled/drained:

We are writing to the database with:

- individual DoFn where we open/close the connection using the standard 
JDBC try/catch (SQLException ex)/finally statements;

- a Pipeline.apply(JdbcIO.<SessionData>write()) operations.

I observed that these spikes happens most of the time after I/O errors 
with the database. Has anyone observed the same situation ?

I have several questions/observations, please correct me if I am wrong 
(I am not from the java environment, so some can seem pretty trivial) :

- Does the apply method handles SQLException or I/O errors ?

- Would the use of a connection pool prevents such behaviours ? If so, 
how would one implement it to allow all workers to use it ? Could it be 
implemented with JDBC Connection pooling ?

I am worrying about the serialization if one would pass a Connection 
item as an argument of a DoFn.

Thank you in advance for your comments and reactions.

Best regards,

Jonathan


Re: Connection leaks with PostgreSQL instance

Posted by Jonathan Perron <jo...@lumapps.com>.
Hi Kenneth,

Thank you for your reply. I find out that the leak was coming somehow 
from my individual DoFns. I replaced all the connections by a connection 
pooling and I haven't seen connection leaks since. I will keep 
monitoring the pipeline state and if I see new leaks, I would 
investigate to provide more details to the JIRA board.

Best regards,

Jonathan

On 14/01/2019 21:37, Kenneth Knowles wrote:
> Hi Jonathan,
>
> JdbcIO.write() just invokes this DoFn: 
> https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L765
>
> It establishes a connection in @StartBundle and then in @FinishBundle 
> it commits a batch and closes the connection. If an error happens 
> in @StartBundle or @ProcessElement there will be a retry with a fresh 
> instance of the DoFn, which will establish a new connection. It looks 
> like neither @StartBundle nor @ProcessElement closes the connection, 
> so I'm guessing that the old connection sticks around because the 
> worker process was not terminated. So the Beam runner and Dataflow 
> service are working as intended and this is an issue with JdbcIO, 
> unless I've made a mistake in my reading or analysis.
>
> Would you mind reporting these details to 
> https://issues.apache.org/jira/projects/BEAM/ ?
>
> Kenn
>
> On Mon, Jan 14, 2019 at 12:51 AM Jonathan Perron 
> <jonathan.perron@lumapps.com <ma...@lumapps.com>> wrote:
>
>     Hello !
>
>     My question is maybe mainly GCP-oriented, so I apologize if it is
>     not fully related to the Beam community.
>
>     We have a streaming pipeline running on Dataflow which writes data
>     to a PostgreSQL instance hosted on Cloud SQL. This database is
>     suffering from connection leak spikes on a regular basis:
>
>     The connections are kept alive until the pipeline is canceled/drained:
>
>     We are writing to the database with:
>
>     - individual DoFn where we open/close the connection using the
>     standard JDBC try/catch (SQLException ex)/finally statements;
>
>     - a Pipeline.apply(JdbcIO.<SessionData>write()) operations.
>
>     I observed that these spikes happens most of the time after I/O
>     errors with the database. Has anyone observed the same situation ?
>
>     I have several questions/observations, please correct me if I am
>     wrong (I am not from the java environment, so some can seem pretty
>     trivial) :
>
>     - Does the apply method handles SQLException or I/O errors ?
>
>     - Would the use of a connection pool prevents such behaviours ? If
>     so, how would one implement it to allow all workers to use it ?
>     Could it be implemented with JDBC Connection pooling ?
>
>     I am worrying about the serialization if one would pass a
>     Connection item as an argument of a DoFn.
>
>     Thank you in advance for your comments and reactions.
>
>     Best regards,
>
>     Jonathan
>

Re: Connection leaks with PostgreSQL instance

Posted by Jonathan Perron <jo...@lumapps.com>.
Dear all,

It happened again on Friday morning:

You can see a baseline in the connection amount from the 16th to the 18th.

Looking at the pg_stat_activity, all connections are used, even when the 
pipeline is not used at 100 % (my use case is processing data from a 
platform which is not used a lot during the week-end).

I opened a ticket on the project JIRA: 
https://issues.apache.org/jira/browse/BEAM-6475. I added as many 
information as I could gather, but let me know if anything else is required.

Best regards,

Jonathan

On 18/01/2019 06:17, Kenneth Knowles wrote:
> My mistake - using @Teardown in this way is a good approach. It may 
> not be executed sometimes, but like Reuven says it means the process 
> died.
>
> Kenn
>
> On Thu, Jan 17, 2019 at 9:31 AM Jean-Baptiste Onofré <jb@nanthrax.net 
> <ma...@nanthrax.net>> wrote:
>
>     Hi,
>
>     I don't think we have connection leak in normal behavior.
>
>     The actual SQL statement is executed in @FinishBundle, where the
>     connection is closed.
>
>     The processElement adds record to process.
>
>     Does it mean that an Exception occurs in the batch addition ?
>
>     Regards
>     JB
>
>     On 17/01/2019 12:41, Alexey Romanenko wrote:
>     > Kenn,
>     >
>     > I’m not sure that we have a connection leak in JdbcIO since new
>     > connection is being obtained from an instance
>     of /javax.sql.DataSource/
>     > (created in @Setup) and which
>     > is /org.apache.commons.dbcp2.BasicDataSource/ by default.
>     > /BasicDataSource/ uses connection pool and closes all idle
>     connections
>     > in "close()”.
>     >
>     > In its turn, JdbcIO calls/DataSource.close()/ in @Teardown, so
>     all idle
>     > connections should be closed and released there in case of fails.
>     > Though, potentially some connections, that has been delegated to
>     client
>     > before and were not not properly returned to pool, could be leaked…
>     > Anyway, I think it could be a good idea to call
>     "/connection.close()/”
>     > (return to connection pool) explicitly in case of any exception
>     happened
>     > during bundle processing.
>     >
>     > Probably JB may provide more details as original author of JdbcIO.
>     >
>     >> On 14 Jan 2019, at 21:37, Kenneth Knowles <kenn@apache.org
>     <ma...@apache.org>
>     >> <mailto:kenn@apache.org <ma...@apache.org>>> wrote:
>     >>
>     >> Hi Jonathan,
>     >>
>     >> JdbcIO.write() just invokes this
>     >> DoFn:
>     https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L765
>     >>
>     >> It establishes a connection in @StartBundle and then in
>     @FinishBundle
>     >> it commits a batch and closes the connection. If an error happens
>     >> in @StartBundle or @ProcessElement there will be a retry with a
>     fresh
>     >> instance of the DoFn, which will establish a new connection. It
>     looks
>     >> like neither @StartBundle nor @ProcessElement closes the
>     connection,
>     >> so I'm guessing that the old connection sticks around because the
>     >> worker process was not terminated. So the Beam runner and Dataflow
>     >> service are working as intended and this is an issue with JdbcIO,
>     >> unless I've made a mistake in my reading or analysis.
>     >>
>     >> Would you mind reporting these details
>     >> to https://issues.apache.org/jira/projects/BEAM/ ?
>     >>
>     >> Kenn
>     >>
>     >> On Mon, Jan 14, 2019 at 12:51 AM Jonathan Perron
>     >> <jonathan.perron@lumapps.com
>     <ma...@lumapps.com>
>     <mailto:jonathan.perron@lumapps.com
>     <ma...@lumapps.com>>> wrote:
>     >>
>     >>     Hello !
>     >>
>     >>     My question is maybe mainly GCP-oriented, so I apologize if
>     it is
>     >>     not fully related to the Beam community.
>     >>
>     >>     We have a streaming pipeline running on Dataflow which
>     writes data
>     >>     to a PostgreSQL instance hosted on Cloud SQL. This database is
>     >>     suffering from connection leak spikes on a regular basis:
>     >>
>     >>     <ofbkcnmdfbgcoooc.png>
>     >>
>     >>     The connections are kept alive until the pipeline is
>     canceled/drained:
>     >>
>     >>     <gngklddbhnckgpni.png>
>     >>
>     >>     We are writing to the database with:
>     >>
>     >>     - individual DoFn where we open/close the connection using the
>     >>     standard JDBC try/catch (SQLException ex)/finally statements;
>     >>
>     >>     - a Pipeline.apply(JdbcIO.<SessionData>write()) operations.
>     >>
>     >>     I observed that these spikes happens most of the time after I/O
>     >>     errors with the database. Has anyone observed the same
>     situation ?
>     >>
>     >>     I have several questions/observations, please correct me if
>     I am
>     >>     wrong (I am not from the java environment, so some can seem
>     pretty
>     >>     trivial) :
>     >>
>     >>     - Does the apply method handles SQLException or I/O errors ?
>     >>
>     >>     - Would the use of a connection pool prevents such
>     behaviours ? If
>     >>     so, how would one implement it to allow all workers to use it ?
>     >>     Could it be implemented with JDBC Connection pooling ?
>     >>
>     >>     I am worrying about the serialization if one would pass a
>     >>     Connection item as an argument of a DoFn.
>     >>
>     >>     Thank you in advance for your comments and reactions.
>     >>
>     >>     Best regards,
>     >>
>     >>     Jonathan
>     >>
>     >
>
>     -- 
>     Jean-Baptiste Onofré
>     jbonofre@apache.org <ma...@apache.org>
>     http://blog.nanthrax.net
>     Talend - http://www.talend.com
>

Re: Connection leaks with PostgreSQL instance

Posted by Kenneth Knowles <kl...@google.com>.
My mistake - using @Teardown in this way is a good approach. It may not be
executed sometimes, but like Reuven says it means the process died.

Kenn

On Thu, Jan 17, 2019 at 9:31 AM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi,
>
> I don't think we have connection leak in normal behavior.
>
> The actual SQL statement is executed in @FinishBundle, where the
> connection is closed.
>
> The processElement adds record to process.
>
> Does it mean that an Exception occurs in the batch addition ?
>
> Regards
> JB
>
> On 17/01/2019 12:41, Alexey Romanenko wrote:
> > Kenn,
> >
> > I’m not sure that we have a connection leak in JdbcIO since new
> > connection is being obtained from an instance of /javax.sql.DataSource/
> > (created in @Setup) and which
> > is /org.apache.commons.dbcp2.BasicDataSource/ by default.
> > /BasicDataSource/ uses connection pool and closes all idle connections
> > in "close()”.
> >
> > In its turn, JdbcIO calls/DataSource.close()/ in @Teardown, so all idle
> > connections should be closed and released there in case of fails.
> > Though, potentially some connections, that has been delegated to client
> > before and were not not properly returned to pool, could be leaked…
> > Anyway, I think it could be a good idea to call "/connection.close()/”
> > (return to connection pool) explicitly in case of any exception happened
> > during bundle processing.
> >
> > Probably JB may provide more details as original author of JdbcIO.
> >
> >> On 14 Jan 2019, at 21:37, Kenneth Knowles <kenn@apache.org
> >> <ma...@apache.org>> wrote:
> >>
> >> Hi Jonathan,
> >>
> >> JdbcIO.write() just invokes this
> >> DoFn:
> https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L765
> >>
> >> It establishes a connection in @StartBundle and then in @FinishBundle
> >> it commits a batch and closes the connection. If an error happens
> >> in @StartBundle or @ProcessElement there will be a retry with a fresh
> >> instance of the DoFn, which will establish a new connection. It looks
> >> like neither @StartBundle nor @ProcessElement closes the connection,
> >> so I'm guessing that the old connection sticks around because the
> >> worker process was not terminated. So the Beam runner and Dataflow
> >> service are working as intended and this is an issue with JdbcIO,
> >> unless I've made a mistake in my reading or analysis.
> >>
> >> Would you mind reporting these details
> >> to https://issues.apache.org/jira/projects/BEAM/ ?
> >>
> >> Kenn
> >>
> >> On Mon, Jan 14, 2019 at 12:51 AM Jonathan Perron
> >> <jonathan.perron@lumapps.com <ma...@lumapps.com>>
> wrote:
> >>
> >>     Hello !
> >>
> >>     My question is maybe mainly GCP-oriented, so I apologize if it is
> >>     not fully related to the Beam community.
> >>
> >>     We have a streaming pipeline running on Dataflow which writes data
> >>     to a PostgreSQL instance hosted on Cloud SQL. This database is
> >>     suffering from connection leak spikes on a regular basis:
> >>
> >>     <ofbkcnmdfbgcoooc.png>
> >>
> >>     The connections are kept alive until the pipeline is
> canceled/drained:
> >>
> >>     <gngklddbhnckgpni.png>
> >>
> >>     We are writing to the database with:
> >>
> >>     - individual DoFn where we open/close the connection using the
> >>     standard JDBC try/catch (SQLException ex)/finally statements;
> >>
> >>     - a Pipeline.apply(JdbcIO.<SessionData>write()) operations.
> >>
> >>     I observed that these spikes happens most of the time after I/O
> >>     errors with the database. Has anyone observed the same situation ?
> >>
> >>     I have several questions/observations, please correct me if I am
> >>     wrong (I am not from the java environment, so some can seem pretty
> >>     trivial) :
> >>
> >>     - Does the apply method handles SQLException or I/O errors ?
> >>
> >>     - Would the use of a connection pool prevents such behaviours ? If
> >>     so, how would one implement it to allow all workers to use it ?
> >>     Could it be implemented with JDBC Connection pooling ?
> >>
> >>     I am worrying about the serialization if one would pass a
> >>     Connection item as an argument of a DoFn.
> >>
> >>     Thank you in advance for your comments and reactions.
> >>
> >>     Best regards,
> >>
> >>     Jonathan
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Connection leaks with PostgreSQL instance

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.
Hi,

I don't think we have connection leak in normal behavior.

The actual SQL statement is executed in @FinishBundle, where the
connection is closed.

The processElement adds record to process.

Does it mean that an Exception occurs in the batch addition ?

Regards
JB

On 17/01/2019 12:41, Alexey Romanenko wrote:
> Kenn,
> 
> I’m not sure that we have a connection leak in JdbcIO since new
> connection is being obtained from an instance of /javax.sql.DataSource/
> (created in @Setup) and which
> is /org.apache.commons.dbcp2.BasicDataSource/ by default.
> /BasicDataSource/ uses connection pool and closes all idle connections
> in "close()”. 
> 
> In its turn, JdbcIO calls/DataSource.close()/ in @Teardown, so all idle
> connections should be closed and released there in case of fails.
> Though, potentially some connections, that has been delegated to client
> before and were not not properly returned to pool, could be leaked…
> Anyway, I think it could be a good idea to call "/connection.close()/”
> (return to connection pool) explicitly in case of any exception happened
> during bundle processing.
> 
> Probably JB may provide more details as original author of JdbcIO.
> 
>> On 14 Jan 2019, at 21:37, Kenneth Knowles <kenn@apache.org
>> <ma...@apache.org>> wrote:
>>
>> Hi Jonathan,
>>
>> JdbcIO.write() just invokes this
>> DoFn: https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L765
>>
>> It establishes a connection in @StartBundle and then in @FinishBundle
>> it commits a batch and closes the connection. If an error happens
>> in @StartBundle or @ProcessElement there will be a retry with a fresh
>> instance of the DoFn, which will establish a new connection. It looks
>> like neither @StartBundle nor @ProcessElement closes the connection,
>> so I'm guessing that the old connection sticks around because the
>> worker process was not terminated. So the Beam runner and Dataflow
>> service are working as intended and this is an issue with JdbcIO,
>> unless I've made a mistake in my reading or analysis.
>>
>> Would you mind reporting these details
>> to https://issues.apache.org/jira/projects/BEAM/ ?
>>
>> Kenn
>>
>> On Mon, Jan 14, 2019 at 12:51 AM Jonathan Perron
>> <jonathan.perron@lumapps.com <ma...@lumapps.com>> wrote:
>>
>>     Hello !
>>
>>     My question is maybe mainly GCP-oriented, so I apologize if it is
>>     not fully related to the Beam community.
>>
>>     We have a streaming pipeline running on Dataflow which writes data
>>     to a PostgreSQL instance hosted on Cloud SQL. This database is
>>     suffering from connection leak spikes on a regular basis:
>>
>>     <ofbkcnmdfbgcoooc.png>
>>
>>     The connections are kept alive until the pipeline is canceled/drained:
>>
>>     <gngklddbhnckgpni.png>
>>
>>     We are writing to the database with:
>>
>>     - individual DoFn where we open/close the connection using the
>>     standard JDBC try/catch (SQLException ex)/finally statements;
>>
>>     - a Pipeline.apply(JdbcIO.<SessionData>write()) operations.
>>
>>     I observed that these spikes happens most of the time after I/O
>>     errors with the database. Has anyone observed the same situation ?
>>
>>     I have several questions/observations, please correct me if I am
>>     wrong (I am not from the java environment, so some can seem pretty
>>     trivial) :
>>
>>     - Does the apply method handles SQLException or I/O errors ?
>>
>>     - Would the use of a connection pool prevents such behaviours ? If
>>     so, how would one implement it to allow all workers to use it ?
>>     Could it be implemented with JDBC Connection pooling ?
>>
>>     I am worrying about the serialization if one would pass a
>>     Connection item as an argument of a DoFn.
>>
>>     Thank you in advance for your comments and reactions.
>>
>>     Best regards,
>>
>>     Jonathan
>>
> 

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Connection leaks with PostgreSQL instance

Posted by Reuven Lax <re...@google.com>.
But I think it will be called unless the process crashes though.

On Thu, Jan 17, 2019, 7:05 AM Ismaël Mejía <iemejia@gmail.com wrote:

> Couldn't this be related also to the fact that @Teardown is
> best-effort in Dataflow?
>
> On Thu, Jan 17, 2019 at 12:41 PM Alexey Romanenko
> <ar...@gmail.com> wrote:
> >
> > Kenn,
> >
> > I’m not sure that we have a connection leak in JdbcIO since new
> connection is being obtained from an instance of javax.sql.DataSource
> (created in @Setup) and which is org.apache.commons.dbcp2.BasicDataSource
> by default. BasicDataSource uses connection pool and closes all idle
> connections in "close()”.
> >
> > In its turn, JdbcIO calls DataSource.close() in @Teardown, so all idle
> connections should be closed and released there in case of fails. Though,
> potentially some connections, that has been delegated to client before and
> were not not properly returned to pool, could be leaked… Anyway, I think it
> could be a good idea to call "connection.close()” (return to connection
> pool) explicitly in case of any exception happened during bundle processing.
> >
> > Probably JB may provide more details as original author of JdbcIO.
> >
> > On 14 Jan 2019, at 21:37, Kenneth Knowles <ke...@apache.org> wrote:
> >
> > Hi Jonathan,
> >
> > JdbcIO.write() just invokes this DoFn:
> https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L765
> >
> > It establishes a connection in @StartBundle and then in @FinishBundle it
> commits a batch and closes the connection. If an error happens in
> @StartBundle or @ProcessElement there will be a retry with a fresh instance
> of the DoFn, which will establish a new connection. It looks like neither
> @StartBundle nor @ProcessElement closes the connection, so I'm guessing
> that the old connection sticks around because the worker process was not
> terminated. So the Beam runner and Dataflow service are working as intended
> and this is an issue with JdbcIO, unless I've made a mistake in my reading
> or analysis.
> >
> > Would you mind reporting these details to
> https://issues.apache.org/jira/projects/BEAM/ ?
> >
> > Kenn
> >
> > On Mon, Jan 14, 2019 at 12:51 AM Jonathan Perron <
> jonathan.perron@lumapps.com> wrote:
> >>
> >> Hello !
> >>
> >> My question is maybe mainly GCP-oriented, so I apologize if it is not
> fully related to the Beam community.
> >>
> >> We have a streaming pipeline running on Dataflow which writes data to a
> PostgreSQL instance hosted on Cloud SQL. This database is suffering from
> connection leak spikes on a regular basis:
> >>
> >> <ofbkcnmdfbgcoooc.png>
> >>
> >> The connections are kept alive until the pipeline is canceled/drained:
> >>
> >> <gngklddbhnckgpni.png>
> >>
> >> We are writing to the database with:
> >>
> >> - individual DoFn where we open/close the connection using the standard
> JDBC try/catch (SQLException ex)/finally statements;
> >>
> >> - a Pipeline.apply(JdbcIO.<SessionData>write()) operations.
> >>
> >> I observed that these spikes happens most of the time after I/O errors
> with the database. Has anyone observed the same situation ?
> >>
> >> I have several questions/observations, please correct me if I am wrong
> (I am not from the java environment, so some can seem pretty trivial) :
> >>
> >> - Does the apply method handles SQLException or I/O errors ?
> >>
> >> - Would the use of a connection pool prevents such behaviours ? If so,
> how would one implement it to allow all workers to use it ? Could it be
> implemented with JDBC Connection pooling ?
> >>
> >> I am worrying about the serialization if one would pass a Connection
> item as an argument of a DoFn.
> >>
> >> Thank you in advance for your comments and reactions.
> >>
> >> Best regards,
> >>
> >> Jonathan
> >
> >
>

Re: Connection leaks with PostgreSQL instance

Posted by Ismaël Mejía <ie...@gmail.com>.
Couldn't this be related also to the fact that @Teardown is
best-effort in Dataflow?

On Thu, Jan 17, 2019 at 12:41 PM Alexey Romanenko
<ar...@gmail.com> wrote:
>
> Kenn,
>
> I’m not sure that we have a connection leak in JdbcIO since new connection is being obtained from an instance of javax.sql.DataSource (created in @Setup) and which is org.apache.commons.dbcp2.BasicDataSource by default. BasicDataSource uses connection pool and closes all idle connections in "close()”.
>
> In its turn, JdbcIO calls DataSource.close() in @Teardown, so all idle connections should be closed and released there in case of fails. Though, potentially some connections, that has been delegated to client before and were not not properly returned to pool, could be leaked… Anyway, I think it could be a good idea to call "connection.close()” (return to connection pool) explicitly in case of any exception happened during bundle processing.
>
> Probably JB may provide more details as original author of JdbcIO.
>
> On 14 Jan 2019, at 21:37, Kenneth Knowles <ke...@apache.org> wrote:
>
> Hi Jonathan,
>
> JdbcIO.write() just invokes this DoFn: https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L765
>
> It establishes a connection in @StartBundle and then in @FinishBundle it commits a batch and closes the connection. If an error happens in @StartBundle or @ProcessElement there will be a retry with a fresh instance of the DoFn, which will establish a new connection. It looks like neither @StartBundle nor @ProcessElement closes the connection, so I'm guessing that the old connection sticks around because the worker process was not terminated. So the Beam runner and Dataflow service are working as intended and this is an issue with JdbcIO, unless I've made a mistake in my reading or analysis.
>
> Would you mind reporting these details to https://issues.apache.org/jira/projects/BEAM/ ?
>
> Kenn
>
> On Mon, Jan 14, 2019 at 12:51 AM Jonathan Perron <jo...@lumapps.com> wrote:
>>
>> Hello !
>>
>> My question is maybe mainly GCP-oriented, so I apologize if it is not fully related to the Beam community.
>>
>> We have a streaming pipeline running on Dataflow which writes data to a PostgreSQL instance hosted on Cloud SQL. This database is suffering from connection leak spikes on a regular basis:
>>
>> <ofbkcnmdfbgcoooc.png>
>>
>> The connections are kept alive until the pipeline is canceled/drained:
>>
>> <gngklddbhnckgpni.png>
>>
>> We are writing to the database with:
>>
>> - individual DoFn where we open/close the connection using the standard JDBC try/catch (SQLException ex)/finally statements;
>>
>> - a Pipeline.apply(JdbcIO.<SessionData>write()) operations.
>>
>> I observed that these spikes happens most of the time after I/O errors with the database. Has anyone observed the same situation ?
>>
>> I have several questions/observations, please correct me if I am wrong (I am not from the java environment, so some can seem pretty trivial) :
>>
>> - Does the apply method handles SQLException or I/O errors ?
>>
>> - Would the use of a connection pool prevents such behaviours ? If so, how would one implement it to allow all workers to use it ? Could it be implemented with JDBC Connection pooling ?
>>
>> I am worrying about the serialization if one would pass a Connection item as an argument of a DoFn.
>>
>> Thank you in advance for your comments and reactions.
>>
>> Best regards,
>>
>> Jonathan
>
>

Re: Connection leaks with PostgreSQL instance

Posted by Alexey Romanenko <ar...@gmail.com>.
Kenn,

I’m not sure that we have a connection leak in JdbcIO since new connection is being obtained from an instance of javax.sql.DataSource (created in @Setup) and which is org.apache.commons.dbcp2.BasicDataSource by default. BasicDataSource uses connection pool and closes all idle connections in "close()”. 

In its turn, JdbcIO calls DataSource.close() in @Teardown, so all idle connections should be closed and released there in case of fails. Though, potentially some connections, that has been delegated to client before and were not not properly returned to pool, could be leaked… Anyway, I think it could be a good idea to call "connection.close()” (return to connection pool) explicitly in case of any exception happened during bundle processing.

Probably JB may provide more details as original author of JdbcIO.

> On 14 Jan 2019, at 21:37, Kenneth Knowles <ke...@apache.org> wrote:
> 
> Hi Jonathan,
> 
> JdbcIO.write() just invokes this DoFn: https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L765 <https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L765>
> 
> It establishes a connection in @StartBundle and then in @FinishBundle it commits a batch and closes the connection. If an error happens in @StartBundle or @ProcessElement there will be a retry with a fresh instance of the DoFn, which will establish a new connection. It looks like neither @StartBundle nor @ProcessElement closes the connection, so I'm guessing that the old connection sticks around because the worker process was not terminated. So the Beam runner and Dataflow service are working as intended and this is an issue with JdbcIO, unless I've made a mistake in my reading or analysis.
> 
> Would you mind reporting these details to https://issues.apache.org/jira/projects/BEAM/ <https://issues.apache.org/jira/projects/BEAM/> ?
> 
> Kenn
> 
> On Mon, Jan 14, 2019 at 12:51 AM Jonathan Perron <jonathan.perron@lumapps.com <ma...@lumapps.com>> wrote:
> Hello !
> 
> My question is maybe mainly GCP-oriented, so I apologize if it is not fully related to the Beam community.
> 
> We have a streaming pipeline running on Dataflow which writes data to a PostgreSQL instance hosted on Cloud SQL. This database is suffering from connection leak spikes on a regular basis:
> 
> <ofbkcnmdfbgcoooc.png>
> 
> The connections are kept alive until the pipeline is canceled/drained:
> 
> <gngklddbhnckgpni.png>
> 
> We are writing to the database with:
> 
> - individual DoFn where we open/close the connection using the standard JDBC try/catch (SQLException ex)/finally statements;
> 
> - a Pipeline.apply(JdbcIO.<SessionData>write()) operations.
> 
> I observed that these spikes happens most of the time after I/O errors with the database. Has anyone observed the same situation ?
> 
> I have several questions/observations, please correct me if I am wrong (I am not from the java environment, so some can seem pretty trivial) :
> 
> - Does the apply method handles SQLException or I/O errors ?
> 
> - Would the use of a connection pool prevents such behaviours ? If so, how would one implement it to allow all workers to use it ? Could it be implemented with JDBC Connection pooling ?
> 
> I am worrying about the serialization if one would pass a Connection item as an argument of a DoFn.
> 
> Thank you in advance for your comments and reactions.
> 
> Best regards,
> 
> Jonathan


Re: Connection leaks with PostgreSQL instance

Posted by Kenneth Knowles <ke...@apache.org>.
Hi Jonathan,

JdbcIO.write() just invokes this DoFn:
https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L765

It establishes a connection in @StartBundle and then in @FinishBundle it
commits a batch and closes the connection. If an error happens
in @StartBundle or @ProcessElement there will be a retry with a fresh
instance of the DoFn, which will establish a new connection. It looks like
neither @StartBundle nor @ProcessElement closes the connection, so I'm
guessing that the old connection sticks around because the worker process
was not terminated. So the Beam runner and Dataflow service are working as
intended and this is an issue with JdbcIO, unless I've made a mistake in my
reading or analysis.

Would you mind reporting these details to
https://issues.apache.org/jira/projects/BEAM/ ?

Kenn

On Mon, Jan 14, 2019 at 12:51 AM Jonathan Perron <
jonathan.perron@lumapps.com> wrote:

> Hello !
>
> My question is maybe mainly GCP-oriented, so I apologize if it is not
> fully related to the Beam community.
>
> We have a streaming pipeline running on Dataflow which writes data to a
> PostgreSQL instance hosted on Cloud SQL. This database is suffering from
> connection leak spikes on a regular basis:
>
> The connections are kept alive until the pipeline is canceled/drained:
>
> We are writing to the database with:
>
> - individual DoFn where we open/close the connection using the standard
> JDBC try/catch (SQLException ex)/finally statements;
>
> - a Pipeline.apply(JdbcIO.<SessionData>write()) operations.
>
> I observed that these spikes happens most of the time after I/O errors
> with the database. Has anyone observed the same situation ?
>
> I have several questions/observations, please correct me if I am wrong (I
> am not from the java environment, so some can seem pretty trivial) :
>
> - Does the apply method handles SQLException or I/O errors ?
>
> - Would the use of a connection pool prevents such behaviours ? If so, how
> would one implement it to allow all workers to use it ? Could it be
> implemented with JDBC Connection pooling ?
>
> I am worrying about the serialization if one would pass a Connection item
> as an argument of a DoFn.
>
> Thank you in advance for your comments and reactions.
>
> Best regards,
>
> Jonathan
>