You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tinkerpop.apache.org by "Florian Hockmann (Jira)" <ji...@apache.org> on 2021/04/14 15:30:00 UTC

[jira] [Commented] (TINKERPOP-2390) Connections not released when closed abruptly in the server side

    [ https://issues.apache.org/jira/browse/TINKERPOP-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17321093#comment-17321093 ] 

Florian Hockmann commented on TINKERPOP-2390:
---------------------------------------------

I just tried to reproduce the scenario but I don't see anything wrong. Here is what I did:
 # Start the server with a {{gremlinPool}} of 1 as described above (_TinkerpopServer configured not to provide any concurrent service (i.e., all the queries were processed sequentially_).
 # Connect from Gremlin.Net (I used the version from current {{master}} and also tried it with the version from {{3.4-dev}}) with default settings ({{PoolSize}} of 4 and {{MaxInProcessPerConnection}}: 32)
 # Send 10 requests with a custom evaluation timeout of 1 ms that simply sleep for 3 seconds.
 # Result:
 ## All requests get a {{ResponseException}} with a timeout on the server side.
 ## 4 connections in state {{ESTABLISHED}} on the server side.
 # Send 1 request to verify that both the driver and the server are still in a valid state. -> Receive the expected result.
 # Dispose the {{GremlinClient}} instance.
 # Result:
 ## All 4 connections in state {{TIME_WAIT}} on the server
 ## After 1 min: connections completely closed

The server is still responsive after this. The {{TIME_WAIT}} is expected from my limited knowledge about TCP as connections are not completely closed immediately in case a packet is received out of order. But they are closed after a timeout which seems to be one minute on my machine.

What I really don't understand here is why the server should close the connection just because one request ran into a timeout. That doesn't make much sense as multiple requests can be processed on the same connection. So, the connection shouldn't be affected by a failing request (failing here in the sense of timing out).

[~Bobed] Could you please provide more information on this, ideally a setup to reproduce the problem deterministically? Otherwise, I'm inclined to close this issue as we cannot reproduce it.

> Connections not released when closed abruptly in the server side
> ----------------------------------------------------------------
>
>                 Key: TINKERPOP-2390
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-2390
>             Project: TinkerPop
>          Issue Type: Bug
>          Components: dotnet
>    Affects Versions: 3.4.7
>         Environment: Tinkerpop 3.4.7 + Janusgraph 0.5.1 (optional opencypher 1.0.0) 
>            Reporter: Carlos
>            Priority: Major
>
> We have developed a WService to query a gremlin-server (JanusGraph 0.5.1) using the .net driver. Using the opencypher plugin has allowed us to see a behaviour where the server gets completely blocked after a timeout on the server side. We thought this might be related to issue https://issues.apache.org/jira/browse/TINKERPOP-2288, so we have moved our driver version to the master one (3.4-dev, which includes the PR solving this issue). However, when facing a timeout (server side always, it is the one launching the exception), quite a lot of connections get stalled at CLOSE_WAIT status, and the server becomes unusable. 
> I've been digging around other bugs and issues, and from what I've read, some similar behaviour happened to CosmoDB (although it might be caused in that situation due to the some connection leaks, in this case is the timeout). We have traced down the problem to the driver itself after isolating all the components involved (optimizing the cypher query results in a non-timeout situation where everything is ok; forcing the timeout from pure gremlin replicates the behaviour). 
> We have set up the connection pool params to 16 / 4096 (we are expecting quite a high concurrency load).  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)