You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Jason Gerlowski (JIRA)" <ji...@apache.org> on 2018/12/03 21:27:00 UTC

[jira] [Commented] (SOLR-13038) Overseer actions fail with NoHttpResponseException following a node restart

    [ https://issues.apache.org/jira/browse/SOLR-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707814#comment-16707814 ] 

Jason Gerlowski commented on SOLR-13038:
----------------------------------------

You can reproduce this behavior pretty regularly with the JUnit test below that uses SolrCloudTestCase as its base:

{code}
@Test
  public void testOtherReplicasAreNotActive() throws Exception {
    final String collection = "collection1";
    CollectionAdminRequest
        .createCollection(collection, "config", 1, 2)
        .process(cluster.getSolrClient());
    cluster.waitForActiveCollection(collection, 1, 2);
    Slice shard = getCollectionState(collection).getSlice("shard1");
    JettySolrRunner otherReplicaJetty = cluster.getReplicaJetty(getNonLeader(shard));
    
    otherReplicaJetty.stop();
    cluster.waitForJettyToStop(otherReplicaJetty);
    waitForState("Timeout waiting for replica get down", collection, (liveNodes, collectionState) -> getNonLeader(collectionState.getSlice("shard1")).getState() != Replica.State.ACTIVE);
    otherReplicaJetty.start();
    cluster.waitForNode(otherReplicaJetty, 30);
    waitForState("Timeout waiting for replica get up", collection, (liveNodes, collectionState) -> getNonLeader(collectionState.getSlice("shard1")).getState() == Replica.State.ACTIVE);
    CollectionAdminResponse response = CollectionAdminRequest.deleteCollection(collection).process(cluster.getSolrClient());
    assertNull("Expected collection-delete to fully succeed", response.getResponse().get("failure"));
  }
{code}

> Overseer actions fail with NoHttpResponseException following a node restart
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-13038
>                 URL: https://issues.apache.org/jira/browse/SOLR-13038
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>    Affects Versions: master (8.0)
>            Reporter: Jason Gerlowski
>            Assignee: Jason Gerlowski
>            Priority: Major
>
> I noticed recently that a lot of overseer operations fail if they're executed right after a restart of a Solr node.  The failure returns a message like "org.apache.solr.client.solrj.SolrServerException:IOException occured when talking to server at: https://127.0.0.1:62253/solr".  The logs are a bit more helpful:
> {code}
> org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: https://127.0.0.1:62253/solr
>     at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:657) ~[java/:?]
>     at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255) ~[java/:?]
>     at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244) ~[java/:?]
>     at org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1260) ~[java/:?]
>     at org.apache.solr.handler.component.HttpShardHandler.lambda$submit$0(HttpShardHandler.java:172) ~[java/:?]
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_172]
>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_172]
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_172]
>     at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) ~[metrics-core-3.2.6.jar:3.2.6]
>     at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209) ~[java/:?]
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
>     at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
> Caused by: org.apache.http.NoHttpResponseException: 127.0.0.1:62253 failed to respond
>     at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:141) ~[httpclient-4.5.6.jar:4.5.6]
>     at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[httpclient-4.5.6.jar:4.5.6]
>     at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[httpcore-4.4.10.jar:4.4.10]
>     at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[httpcore-4.4.10.jar:4.4.10]
>     at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165) ~[httpclient-4.5.6.jar:4.5.6]
>     at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[httpcore-4.4.10.jar:4.4.10]
>     at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[httpcore-4.4.10.jar:4.4.10]
>     at org.apache.solr.util.stats.InstrumentedHttpRequestExecutor.execute(InstrumentedHttpRequestExecutor.java:120) ~[java/:?]
>     at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[httpclient-4.5.6.jar:4.5.6]
>     at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[httpclient-4.5.6.jar:4.5.6]
>     at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[httpclient-4.5.6.jar:4.5.6]
>     at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.6.jar:4.5.6]
>     at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.6.jar:4.5.6]
>     at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.6.jar:4.5.6]
>     at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.6.jar:4.5.6]
>     at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:542) ~[java/:?]
>     ... 12 more
> {code}
> After a bit of debugging I was able to confirm the problem: when some non-overseer node gets restarted, the overseer never notices that its connections are invalid and will try to reuse them for subsequent requests that happen right after the restart.
> There's a few ways we might be able to tackle this:
> * we could look at adding logic to {{SolrHttpRequestRetryHandler}} to retry when this happens.  SHRRH already retries NoHttpResponseException generally, but has other logic which prevents any retries on collection/core-admin APIs.  Maybe we could elaborate this a bit.
> * we could add retry logic to the {{HttpShardHandler}} code that makes these requests.  We could do this across the board, or more selectively for only the overseer commands that are "retry-able".
> * We could tweak how our connection pool is managed so that it evicts these idle connections more aggressively.  It seems like something similar has already been tried (without success) on SOLR-6944
> Not sure what the right approach is.  Seems like intermittent NoHttpResponseExceptions have been a problem in Solr (and its tests) going back at least 5 years or so.  Several JIRAs suggested adding retries for NHRE in the past but have been killed since not all APIs are idempotent and other JIRAs have been concerned with fixing this at the (very broad) SolrClient level.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org