You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Chris M. Hostetter (Jira)" <ji...@apache.org> on 2022/03/14 23:26:00 UTC

[jira] [Commented] (SOLR-16099) HTTP Client threads can hang in Jetty's InputStreamResponseListener when using HTTP2 - impacts intra-node communication

    [ https://issues.apache.org/jira/browse/SOLR-16099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506622#comment-17506622 ] 

Chris M. Hostetter commented on SOLR-16099:
-------------------------------------------

I'm filing this issue for tracking purposes and to raise visibility of the symptoms and workaround - I don't know if/how it can be "fixed" in Solr.

This popped up for me when dealing with a customer who noticed they were running out of disk on nodes hosting PULL replicas of large collection that were getting lots of document updates. The hung threads were holding {{SolrIndexSearcher}} references, which had open file handles to segments that were no longer in use by the "current" index but couldn't be overwritten by the filesystem because of the open file handles).

This problem could manifest in other ways for other people – the most likely (beyond the obvious failure or "slow" request metrics) being high memory usage as these threads and the object they hold references to taking up Heap space indefinitely.

 
----
 

While investigating open jetty issues that might pertain to this, I discovered that Ishan & Noble have apparently [been aware of this problem|https://github.com/fullstorydev/lucene-solr/pull/121] [for a while|https://github.com/fullstorydev/lucene-solr/pull/106], and already have [an open ticket with the Jetty project (#7192)|https://github.com/eclipse/jetty.project/issues/7192] regarding these symptoms.

(Ishan alluded to this problem in SOLR-15840, but that Jira is a general "Task" regarding questions about the performance of HTTP2 in Solr, that doesn't explicitly call out that this "Bug" exists and can hang threads - hence I've opened this new Jira.)

So far the only known [steps to reliably reproduce|https://github.com/eclipse/jetty.project/issues/7192#issuecomment-989727192] are fairly onerous (they crashed my machine while attempting to run them) which makes it hard to easily identify a root cause, but the [latest analysis from jetty dev|https://github.com/eclipse/jetty.project/issues/7192#issuecomment-991068558] {{sbordet}} is...
{quote}We were able to get a failure, so we instrumented the client to understand why it was waiting, and found that it was waiting for a request whose headers arrived fine (with a 200 status code) but the client was waiting for the response body (no errors).

We tried to track this request on the server logs, but we could not find any information.
Any attempt to increment the log level to DEBUG makes the failure non-reproducible, so we are a bit stuck trying to understand what's going on.

From our point of view, the client received response headers and it's waiting for the content, but it never arrives.
We cannot explain how the response headers arrived, but not the content, but we don't have information about the server – whether there was an error on the server or similar, because the logs don't show any error.

At this point I'm throwing the ball in your court: are you sure that you are sending the response content?

...

If we can track the tcpdump and see that the HTTP/2 frames reach the client, then it's a Jetty client problem;
otherwise the server did not send the HTTP/2 frames and this could be either for a Jetty server problem, or because the application did not send any data.
In this way we're trying to restrict the problem.
{quote}
The issue seems to have stalled out ~2 months ago while attempting to provide usable {{tcpdump}} trace of the affected requests.

[~ichattopadhyaya] / [~noblepaul] : Do you have any more info on this that you can share? If you can reliably reproduce this can you update the jetty ticket w/usable {{tcpdump}} logs?

 
----
 

One thing we may want to consider (in Solr) is replacing our usage of {{InputStreamResponseListener}} with a variant implementation that uses a "timeout" instead of an unlimited {{wait()}} (along the lines of a [spin-off jetty enhancement issue|https://github.com/eclipse/jetty.project/issues/7259] one of the jetty devs filed). We could probably (with some effort) tweak the impacted Solr APIs to propogate the (remaining) {{timeAllowed}} (if that option was specified) down to this class – and/or have an "extreme" default (ie: 30min) just to prevent threads from sticking around forever.

> HTTP Client threads can hang in Jetty's InputStreamResponseListener when using HTTP2 - impacts intra-node communication
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-16099
>                 URL: https://issues.apache.org/jira/browse/SOLR-16099
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud, SolrJ
>            Reporter: Chris M. Hostetter
>            Priority: Major
>
> There apearrs to be a Jetty HttpClient bug that makes it possible for a request thread to hang indefinitely while waiting to parse the response from remote jetty servers. The cause of the hung thread is because it calls {{.wait()}} on monitor lock that _should_ be notified by another (internal jetty client) thread when a chunk of data is available from the wire – but in some cases this evidently may not happen.
> In the case of {{distrib=true}} requests processed by Solr (aggregating multiple per-shard responses from other nodes) this can manifest with stack traces that look like the following (taken from Solr 8.8.2)...
> {noformat}
>  "thread",{
>    "id":14253,
>    "name":"httpShardExecutor-7-thread-13819-...",
>    "state":"WAITING",
>    "lock":"org.eclipse.jetty.client.util.InputStreamResponseListener@12b59075",
>    "lock-waiting":{
>      "name":"org.eclipse.jetty.client.util.InputStreamResponseListener@12b59075",
>      "owner":null},
>    "synchronizers-locked":["java.util.concurrent.ThreadPoolExecutor$Worker@1ec1aed0"],
>    "cpuTime":"65.4882ms",
>    "userTime":"60.0000ms",
>    "stackTrace":["java.base@11.0.14/java.lang.Object.wait(Native Method)",
>      "java.base@11.0.14/java.lang.Object.wait(Unknown Source)",
>      "org.eclipse.jetty.client.util.InputStreamResponseListener$Input.read(InputStreamResponseListener.java:318)",
>      "org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:90)",
>      "org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:99)",
>      "org.apache.solr.common.util.FastInputStream.readByte(FastInputStream.java:217)",
>      "org.apache.solr.common.util.JavaBinCodec._init(JavaBinCodec.java:211)",
>      "org.apache.solr.common.util.JavaBinCodec.initRead(JavaBinCodec.java:202)",
>      "org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:195)",
>      "org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:51)",
>      "org.apache.solr.client.solrj.impl.Http2SolrClient.processErrorsAndResponse(Http2SolrClient.java:696)",
>      "org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:412)",
>      "org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:761)",
>      "org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:369)",
>      "org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:297)",
>      "org.apache.solr.handler.component.HttpShardHandlerFactory.makeLoadBalancedRequest(HttpShardHandlerFactory.java:371)",
>      "org.apache.solr.handler.component.ShardRequestor.call(ShardRequestor.java:132)",
>      "org.apache.solr.handler.component.ShardRequestor.call(ShardRequestor.java:41)",
>      "java.base@11.0.14/java.util.concurrent.FutureTask.run(Unknown Source)",
>      "java.base@11.0.14/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)",
>      "java.base@11.0.14/java.util.concurrent.FutureTask.run(Unknown Source)",
>      "com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180)",
>      "org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218)",
>      "org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$175/0x0000000840243c40.run(Unknown Source)",
>      "java.base@11.0.14/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)",
>      "java.base@11.0.14/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)",
>      "java.base@11.0.14/java.lang.Thread.run(Unknown Source)"]},
> {noformat}
> ...these {{httpShardExecutor}} threads can stay hung, tying up system resources, indefinitely (unless they get a spuriuos {{notify()}} from the JVM). (In once case, it seems to have caused a request to hang for {*}10.3 hours{*})
> Anecdotally:
>  * There is some evidence that this problem did _*NOT*_ affect Solr 8.6.3, but does affect later versions
>  ** suggesting the bug didn't exist in Jetty until _after_ 9.4.27.v20200227
>  * Forcing the Jetty HttpClient to use HTTP1.1 transport seems to prevent this problem from happening
>  ** In Solr this can be done by setting the {{"solr.http1"}} system property
>  ** Or using the {{Http2SolrClient.Builder.useHttp1_1()}} method in client application code



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org