You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tinkerpop.apache.org by "Johannes Carlsen (Jira)" <ji...@apache.org> on 2020/08/04 21:24:00 UTC
[jira] [Commented] (TINKERPOP-2369) Connections in ConnectionPool are not replaced in background when underlying channel is closed

    [ https://issues.apache.org/jira/browse/TINKERPOP-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171130#comment-17171130 ] 

Johannes Carlsen commented on TINKERPOP-2369:
---------------------------------------------

Hi [~divijvaidya],

Thanks for looking into this!

Regarding the following step in the sequence you mentioned:

> 2. When the channel is closed, `channelInactive` method for all the handlers is called including for `GremlinResponseHandler`. In GremlinResponseHandler, on Channel Inactive, [we mark all existing requests on that channel as completed exceptionally|https://github.com/apache/tinkerpop/blob/HEAD/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/Handler.java#L207].

This will only mark _pending_ requests on that channel as completed exceptionally. If there are no pending requests on that channel, the exceptional callback for `Connection` will not be triggered by the close frame from the server. This means that Step #3 only occurs when the driver attempts to handle the next incoming request and fails to write to the channel (which was closed), which triggers the clean up of the closed connection. This is the behavior that we would not expect, we would expect the connection to be removed from the pool once its underlying channel is closed, regardless of whether there are any pending requests on the channel.

As for this:

> One change that we could make is to keep trying out different connections first from the same connection pool and then from separate hosts on failing to write to a connection.

This approach wouldn't really fix the issue we are experiencing, as we are working with only one host and a single connection in the pool. We implemented retry logic as a workaround for this issue, but the delay between retries has to be > 1 second because that is how long it takes for the driver to reconnect to the host by creating a new connection, which is not ideal. We would expect the driver to automatically replace a connection when its underlying channel is closed, instead of waiting until an actual gremlin query comes in and fails since the singular connection in the connection pool of the singular host is dead.

 

[~cvilladiegoa] have you enabled keep-alive? If keep-alive is enabled, connections shouldn't go idle. There is also a bug that was fixed in v3.4.6 of the driver that prevents keep-alive from working properly, so you might want to make sure you are using v3.4.6 or later: https://issues.apache.org/jira/browse/TINKERPOP-2266.

 

> Connections in ConnectionPool are not replaced in background when underlying channel is closed
> ----------------------------------------------------------------------------------------------
>
>                 Key: TINKERPOP-2369
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-2369
>             Project: TinkerPop
>          Issue Type: Bug
>          Components: driver
>    Affects Versions: 3.4.1
>            Reporter: Johannes Carlsen
>            Priority: Major
>
> Hi Tinkerpop team!
>  
> We are using the Gremlin Java Driver to connect to an Amazon Neptune cluster. We are using the IAM authentication feature provided by Neptune, which means that individual websocket connections are closed by the server every 36 hours, when their credentials expire. The current implementation of the driver does not handle this situation well, as the Connection whose channel has been closed by the server remains in the ConnectionPool. The connection is only reported as dead and replaced when when it is later chosen by the LoadBalancingStrategy to server a client request, which inevitably fails when the connection attempts to write to the closed channel.
> A fix for this bug would cause the connection pool to be automatically refreshed in the background by either the keep-alive mechanism, which should replace a connection if a keep-alive request fails, or by adding a listener for the close frame being sent to the underlying channel to replace the connection. Without a fix, the only way to recover from a stale connection is to retry the request at the cluster level, which will allow the request to be directed to a different connection.
> I noticed a PR out for the .NET client to fix this behavior: [https://github.com/apache/tinkerpop/pull/1279.] We are hoping for something similar in the Gremlin Java Driver.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)