You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tinkerpop.apache.org by "stephen mallette (JIRA)" <ji...@apache.org> on 2015/09/29 22:24:04 UTC

[jira] [Closed] (TINKERPOP3-814) ConnectionPool can fill with dead Connections

     [ https://issues.apache.org/jira/browse/TINKERPOP3-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stephen mallette closed TINKERPOP3-814.
---------------------------------------
    Resolution: Fixed

The solution in the description wasn't too far off from the heart of the problem in some sense.  First of all, the {{Host}} was getting marked unavailable too early for issues that might have been repairable in a sense.  For example, a serialization problem might occur for one query but not for another (e.g. one query might return a Titan {{RelationIdentifier}} which requires a serializer and another might not).  These cases where there is a chance that future requests can be made on the {{Host}} it's probably not a good idea to kill the channel and mark the {{Host}} dead.  So that was resolved TINKERPOP3-816

Then, when the {{Host}} was marked unavailable, nothing was being done to clear the connections in the pool and then (for some reason) the driver wasn't doing much to respect the pool size and new connections were just being created and added to the pool when a reconnect to the {{Host}} was attempted.  So, the driver now destroys connections when the {{Host}} is marked as unavailable.  On reconnect attempts it looks to create a new connection (through standard methods that respect pool size/timeouts/etc) and then issues a "ping" to see if the host is online.  The "ping" is kinda dumpy, just a gremlin script with an empty string, which should just evaluate back as an empty string, but it does the job.  Unfortunately, it's not easy rely on the websocket ping given the way the driver is setup now, plus we have to rely on the subprotcol capabilities since this code will need to work over the NIO channel or other means so doing websocket specific things won't be so good.  Perhaps we could build a "ping" built into the subprotocol itself.

It would be nice if I had a clean unit test here to prevent regressions, but I'm not completely sure how to write it - our integration testing infrastructure isn't advanced enough for the scenario involved i don't think.

> ConnectionPool can fill with dead Connections
> ---------------------------------------------
>
>                 Key: TINKERPOP3-814
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP3-814
>             Project: TinkerPop 3
>          Issue Type: Bug
>          Components: driver
>    Affects Versions: 3.0.1-incubating
>         Environment: "Cluster" of a single gremlin-server/Titan node.  Single client.  minConnectionPoolSize = 2; maxConnectionPoolSize = 10.
>            Reporter: Kieran Sherlock
>            Assignee: stephen mallette
>             Fix For: 3.1.0-incubating, 3.0.2-incubating
>
>
> Client gets a GryoMessageSerializerV1d0 decoding error on a couple of requests (different unfiled issue).   ConnectionPool gets into the situation where there are 2 dead connections being maintained in the pool, but no new connections are being issued.
> A sample of the log output...
> {noformat}
> 2015-08-27 09:57:18,122 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Borrowing connection from pool on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} - timeout in 3000 MILLISECONDS
> 2015-08-27 09:57:18,122 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Return least used Connection{host=Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}, isDead=false, borrowed=1, pending=0} on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}
> 2015-08-27 09:57:18,122 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.Client  - Submitted RequestMessage{, requestId=0349c1ce-19dc-4d54-9ca9-30965c9314ae, op='eval', processor='', args={<redacted>}} to - Connection{host=Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}}
> 2015-08-27 09:57:18,125 [gremlin-driver-worker-6] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Attempting to return Connection{host=Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}} on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}
> 2015-08-27 09:57:18,126 [gremlin-driver-worker-6] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - On Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} pool size of 3 > minPoolSize 2 and borrowed of 0 <= minSimultaneousUsagePerConnection 8 so destroy Connection{host=Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}, isDead=false, borrowed=0, pending=0}
> 2015-08-27 09:57:18,126 [gremlin-driver-worker-6] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Connection{host=Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}, isDead=false, borrowed=0, pending=0} destroyed
> 2015-08-27 09:57:18,126 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Borrowing connection from pool on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} - timeout in 3000 MILLISECONDS
> 2015-08-27 09:57:18,126 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Pool was initialized but a connection could not be selected earlier - waiting for connection on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}
> 2015-08-27 09:57:18,126 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Wait 3000 MILLISECONDS for an available connection on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} with Thread[Grizzly-worker(15),5,main]
> 2015-08-27 09:57:21,127 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Continue to wait for connection on -1 if {} > 0
> 2015-08-27 09:57:21,127 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Timed-out waiting for connection on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin} - possibly unavailable
> 2015-08-27 09:57:21,127 [gremlin-driver-worker-3] DEBUG org.apache.tinkerpop.gremlin.driver.Host  - Trying to reconnect to dead host at Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}
> 2015-08-27 09:57:21,128 [Grizzly-worker(15)] DEBUG org.apache.tinkerpop.gremlin.driver.Client  - Submitted RequestMessage{, requestId=67761973-7ae7-4a6a-8e62-443fdb55142d, op='eval', processor='', args=<redacted>} to - connection not initialized
> 2015-08-27 09:57:21,128 [gremlin-driver-worker-3] DEBUG org.apache.tinkerpop.gremlin.driver.ConnectionPool  - Trying to re-establish connection on Host{address=localhost/127.0.0.1:18182, hostUri=ws://localhost:18182/gremlin}
> 2015-08-27 09:57:21,128 [Grizzly-worker(15)] ERROR EdnaMetaDataServiceGremlinImpl  - Failed man lookupMerchantAccountName
> java.lang.RuntimeException: java.lang.RuntimeException: java.util.concurrent.TimeoutException
> {noformat}
> This behavior continues with each subsequent request.  The gremlin-server is available at all times.
> As a workaround I have applied the following patch, but I am pretty sure it is not the correct solution
> {noformat}
> diff --git a/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java b/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java
> index 30cd71f..cb8ebfa 100644
> --- a/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java
> +++ b/gremlin-driver/src/main/java/org/apache/tinkerpop/gremlin/driver/ConnectionPool.java
> @@ -352,7 +352,7 @@ final class ConnectionPool {
>              }
>              remaining = to - TimeUtil.timeSince(start, unit);
> -            logger.debug("Continue to wait for connection on {} if {} > 0", remaining);
> +            logger.debug("Continue to wait for connection on {} if {} > 0", host, remaining);
>          } while (remaining > 0);
>          logger.debug("Timed-out waiting for connection on {} - possibly unavailable", host);
> @@ -371,13 +371,16 @@ final class ConnectionPool {
>          // let the load-balancer know that the host is acting poorly
>          this.cluster.loadBalancingStrategy().onUnavailable(host);
> -
>      }
>      private boolean tryReconnect(final Host h) {
>          logger.debug("Trying to re-establish connection on {}", host);
>          try {
> +            // remove dead connections
> +            for (Connection c : connections)
> +                if (c.isDead())
> +                    definitelyDestroyConnection(c);
>              connections.add(new Connection(host.getHostUri(), this, cluster, settings().maxInProcessPerConnection));
>              this.open.set(connections.size());
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)