You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tinkerpop.apache.org by "Yigit Kiran (JIRA)" <ji...@apache.org> on 2019/02/27 18:44:00 UTC

[jira] [Comment Edited] (TINKERPOP-2169) Responses exceeding maxContentLength cause subsequent queries to hang

    [ https://issues.apache.org/jira/browse/TINKERPOP-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16779633#comment-16779633 ] 

Yigit Kiran edited comment on TINKERPOP-2169 at 2/27/19 6:43 PM:
-----------------------------------------------------------------

Thanks for the quick response, Stephen. Yes, this is only a problem for the client-side maxContentLength.  There might be more generic solutions than the suggestions above. More investigation on this issue lead to these observations;

1) Connection-borrowing logic: Connection class has a borrow logic [1], where borrowed field gets incremented by reuse. When Connection gets replaced, the definitelyDestroyConnection method expects there are no borrowers left to delete the connection [2], but borrowed value never gets decremented in case of the aforementioned Exceptions. This sounds like a bug and would love to hear if you have any suggestions on what this cleanup logic should look like.

2) Instead of the second suggestion above, we can clear the pending messages when closeAsync method is called, because if there are things in the pending queue, connection waits.

 
{code:java}
final class Connection {
    ...
    public synchronized CompletableFuture<Void> closeAsync() {
        if (isClosing()) return closeFuture.get();

        final CompletableFuture<Void> future = new CompletableFuture<>();
        closeFuture.set(future);

        // stop any pings being sent at the server for keep-alive
        final ScheduledFuture keepAlive = keepAliveFuture.get();
        if (keepAlive != null) keepAlive.cancel(true);

        // FIX HERE - clear the pending messages
        pending.values().forEach(val -> val.markError(new RuntimeException("Connection closed because of erroneous requests")));
        pending.clear();{code}
 

3) When write fails due to a ClosedChannel exception, host is marked as bad. Since same connection can be borrowed by multiple requests, it is possible for the first request to close the channel causing other requests to fail the write. But this probably doesn't mean the host is dead, more generally failing write on channel doesn't mean host is dead (one possibility is that server closed the channel for some reason). So instead, this should just replaceConnection and if during the connection creation, we fail to connect then we need to mark host as dead.

Also, the following can prevent this particular issue in the server side;

4) Client can issue a bunch of requests and due to some error on client side it can send CloseWebsocketFrame to server. Server actually closes the channel when it receives this but the executors trying to write response might get stuck in the loop where it checks for watermark to clear (it will not clear in these cases since client has already given up on these requests). We can check channel.isActive() too, and throw an Exception if it is inactive, before [3].

I can send a PR after we decide which fixes we want to move forward with.

[1] [https://github.com/apache/tinkerpop/blob/c588d4f8ea411bf89f0d1a5dcdb707af3e87d1df/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/Connection.java#L80]

[2] [https://github.com/apache/tinkerpop/blob/9e4078983bb41a66f17cd33d599168a8ec78295b/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java#L327] 

[3] [https://github.com/apache/tinkerpop/blob/d1a3fa147d1f009ae57274827c9b59426dfc6e58/gremlin-server/src/main/java/org/apache/tinkerpop/gremlin/server/op/traversal/TraversalOpProcessor.java#L533] 


was (Author: yigitk):
Thanks for the quick response, Stephen. Yes, this is only a problem for the client-side maxContentLength.  There might be more generic solutions than the suggestions above. More investigation on this issue lead to these observations;

1) Connection-borrowing logic: Connection class has a borrow logic [1], where borrowed field gets incremented by reuse. When Connection gets replaced, the definitelyDestroyConnection method expects there are no borrowers left to delete the connection [2], but borrowed value never gets decremented in case of the aforementioned Exceptions. This sounds like a bug and would love to hear if you have any suggestions on what this cleanup logic should look like.

2) Instead of the second suggestion above, we can clear the pending messages when closeAsync method is called, because if there are things in the pending queue, connection waits.

 
{code:java}
final class Connection {
    ...
    public synchronized CompletableFuture<Void> closeAsync() {
        if (isClosing()) return closeFuture.get();

        final CompletableFuture<Void> future = new CompletableFuture<>();
        closeFuture.set(future);

        // stop any pings being sent at the server for keep-alive
        final ScheduledFuture keepAlive = keepAliveFuture.get();
        if (keepAlive != null) keepAlive.cancel(true);

        // FIX HERE - clear the pending messages
        pending.values().forEach(val -> val.markError(new RuntimeException("Connection closed because of crap requests")));
        pending.clear();{code}
 

3) When write fails due to a ClosedChannel exception, host is marked as bad. Since same connection can be borrowed by multiple requests, it is possible for the first request to close the channel causing other requests to fail the write. But this probably doesn't mean the host is dead, more generally failing write on channel doesn't mean host is dead (one possibility is that server closed the channel for some reason). So instead, this should just replaceConnection and if during the connection creation, we fail to connect then we need to mark host as dead.

Also, the following can prevent this particular issue in the server side;

4) Client can issue a bunch of requests and due to some error on client side it can send CloseWebsocketFrame to server. Server actually closes the channel when it receives this but the executors trying to write response might get stuck in the loop where it checks for watermark to clear (it will not clear in these cases since client has already given up on these requests). We can check channel.isActive() too, and throw an Exception if it is inactive, before [3].

I can send a PR after we decide which fixes we want to move forward with.

[1] [https://github.com/apache/tinkerpop/blob/c588d4f8ea411bf89f0d1a5dcdb707af3e87d1df/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/Connection.java#L80]

 [2] [https://github.com/apache/tinkerpop/blob/9e4078983bb41a66f17cd33d599168a8ec78295b/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java#L327] 

[3] [https://github.com/apache/tinkerpop/blob/d1a3fa147d1f009ae57274827c9b59426dfc6e58/gremlin-server/src/main/java/org/apache/tinkerpop/gremlin/server/op/traversal/TraversalOpProcessor.java#L533] 

> Responses exceeding maxContentLength cause subsequent queries to hang
> ---------------------------------------------------------------------
>
>                 Key: TINKERPOP-2169
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-2169
>             Project: TinkerPop
>          Issue Type: Bug
>          Components: driver
>    Affects Versions: 3.4.0
>         Environment: Client: EC2 Amazon Linux t3.medium
> Server: Amazon Neptune r4.2xlarge
>            Reporter: Yigit Kiran
>            Priority: Major
>
> Gremlin Driver replaces connections on the channel when it receives Exceptions that are instances of IOException or CodecException (including CorruptedFrameException). When CorruptedFrameException is thrown because response length is greater than the maxContentLength value (32kb by default), driver thinks the host might be unavailable and tries to replace Connection.
> If Connection is shared among multiple requests (its pending queue is > 1), other WSConnection goes stale after connection replacement, while keeping the server executor threads busy.
> Keeping the exec threads busy for stale connections prevents server from picking up new tasks for subsequent requests from the request queue. Additionally since there is a new connection added in Client, it can accept more requests and similar errors can lead to a build up in request queue. When many concurrent requests gets into this situation server become unresponsive to the new requests.
> h3. Steps to repro
>  
> 1. Have a gremlin server
> 2. Connect it using java driver with setting the maxContentLength pretty low, i.e. using the config below:
>  
> {code:java}
> Cluster.Builder builder = Cluster.build();
>         builder.addContactPoint(endpoint[0]);
>         builder.port(8182);
>         builder.maxConnectionPoolSize(100);
>         builder.maxSimultaneousUsagePerConnection(100);
>         builder.maxInProcessPerConnection(50);
>         builder.maxContentLength(32); // <-- this is reduced from 32k 
>         builder.keepAliveInterval(0);
> {code}
>  
> 3. Issue concurrent requests using the cluster, where the response would be greater than 32 bytes.
> h3. Ideas on a possible solution
> One possible solution to this is to not consider channel as dead when request length exceeds maxContentFrame length. 
> {{}}
> {code:java}
> final class Connection {
>     ...
>     public ChannelPromise write(final RequestMessage requestMessage, final CompletableFuture<ResultSet> future) {
>     ...
>         // FIX HERE: Do not consider CorruptedFrameException as non-recoverable exception.
>         if ((t instanceof IOException || t instanceof CodecException) && (! (t instanceof CorruptedFrameException))) {
>         ...
>         }
>     }
> }
> {code}
> Another fix could be the request can be deleted from the Connections' pending request map, and if there are other pending requests on the connection, close them before replacing the connection, or not replace the connection at all: 
>  
> {code:java}
> final class Connection {void replaceConnection(final Connection connection) {
>     ...
>     // FIX HERE: Do not replace connection if there are pending requests on it. 
>     if (!connection.getPending().isEmpty()) {
>         return; // prevent replacing the connection while there are pending requests.
>     }
>     considerNewConnection();
>     definitelyDestroyConnection(connection);
>     }
> }{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)