You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tinkerpop.apache.org by "Johannes Carlsen (Jira)" <ji...@apache.org> on 2020/05/15 16:31:00 UTC
[jira] [Comment Edited] (TINKERPOP-2369) Connections in ConnectionPool are not replaced in background when underlying channel is closed

    [ https://issues.apache.org/jira/browse/TINKERPOP-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108434#comment-17108434 ] 

Johannes Carlsen edited comment on TINKERPOP-2369 at 5/15/20, 4:30 PM:
-----------------------------------------------------------------------

I did some digging and wasn't able to find a Jira issue..were you able to find one?

If the listener approach is not possible for some reason, would the keep-alive approach be tenable as a second-best solution?

Also, wanted to get an opinion if possible on the following possible workarounds for this issue that we are looking at:
 # Add retry logic for *ClosedChannelException* and *RemoteConnectionException*.
 # Re-create the driver *Cluster* every 36 hours so that all the connections are re-created.

The second of these options is better for our use-case as we want to avoid retry if possible, but it seems to be more "hacky" to implement, as it involves sub-classing *GraphTraversalSource* to create some sort of *SafeGraphTraversalSource* which instead of using its *RemoteConnection* for execution (which could have been closed since the *GraphTraversalSource* was initialized) has a reference to a *RemoteConnectionProvider* that is guaranteed to return the active connection. We then have to create a custom *TraversalStrategy* within the *SafeGraphTraversalSource* that will provide a custom *SafeRemoteStep* that is initialized with the reference to the *RemoteConnectionProvider* and then submits the query using the *RemoteConnection* provided by the provider at execution time, while also blocking the *RemoteConnectionProvider* from refreshing the underlying *Cluster* until the request has finished. Seems like a lot of complexity for a workaround if you ask me.

 

Until the driver is able to replace stale connections in the background, do you know of a better way to go about #2 that doesn't involve so much customization of the *GraphTraversalSource* object? No worries if not, just curious. Do clients usually just implement retry logic to get around stale connection issues, since the driver has no way of knowing whether the server has closed a connection until a request fails?


was (Author: carlsej):
I did some digging and wasn't able to find a Jira issue..were you able to find one?

If the listener approach is not possible for some reason, would the keep-alive approach be tenable as a second-best solution?

Also, wanted to get an opinion if possible on the following possible workarounds for this issue that we are looking at:
 # Add retry logic for `ClosedChannelException` and `RemoteConnectionException`.
 # Re-create the driver `Cluster` every 36 hours so that all the connections are re-created.

The second of these options is better for our use-case as we want to avoid retry if possible, but it seems to be more "hacky" to implement, as it involves sub-classing `GraphTraversalSource` to create some sort of `SafeGraphTraversalSource` which instead of using its `RemoteConnection` for execution (which could have been closed since the `GraphTraversalSource` was initialized) has a reference to a `RemoteConnectionProvider` that is guaranteed to return the active connection. We then have to create a custom `TraversalStrategy` within the `SafeGraphTraversalSource` that will provide a custom `SafeRemoteStep` that is initialized with the reference to the `RemoteConnectionProvider` and then submits the query using the `RemoteConnection` provided by the provider at execution time, while also blocking the `RemoteConnectionProvider` from refreshing the underlying `Cluster` until the request has finished. Seems like a lot of complexity for a workaround if you ask me.

 

Until the driver is able to replace stale connections in the background, do you know of a better way to go about #2 that doesn't involve so much customization of the `GraphTraversalSource` object? No worries if not, just curious. Do clients usually just implement retry logic to get around stale connection issues, since the driver has no way of knowing whether the server has closed a connection until a request fails?

> Connections in ConnectionPool are not replaced in background when underlying channel is closed
> ----------------------------------------------------------------------------------------------
>
>                 Key: TINKERPOP-2369
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-2369
>             Project: TinkerPop
>          Issue Type: Bug
>          Components: driver
>    Affects Versions: 3.4.1
>            Reporter: Johannes Carlsen
>            Priority: Major
>
> Hi Tinkerpop team!
>  
> We are using the Gremlin Java Driver to connect to an Amazon Neptune cluster. We are using the IAM authentication feature provided by Neptune, which means that individual websocket connections are closed by the server every 36 hours, when their credentials expire. The current implementation of the driver does not handle this situation well, as the Connection whose channel has been closed by the server remains in the ConnectionPool. The connection is only reported as dead and replaced when when it is later chosen by the LoadBalancingStrategy to server a client request, which inevitably fails when the connection attempts to write to the closed channel.
> A fix for this bug would cause the connection pool to be automatically refreshed in the background by either the keep-alive mechanism, which should replace a connection if a keep-alive request fails, or by adding a listener for the close frame being sent to the underlying channel to replace the connection. Without a fix, the only way to recover from a stale connection is to retry the request at the cluster level, which will allow the request to be directed to a different connection.
> I noticed a PR out for the .NET client to fix this behavior: [https://github.com/apache/tinkerpop/pull/1279.] We are hoping for something similar in the Gremlin Java Driver.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)