You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Andy Throgmorton (Jira)" <ji...@apache.org> on 2021/03/09 00:09:00 UTC

[jira] [Created] (SOLR-15228) Single host in a bad state can block collection creation for the cluster with autoscaling enabled

Andy Throgmorton created SOLR-15228:
---------------------------------------

             Summary: Single host in a bad state can block collection creation for the cluster with autoscaling enabled
                 Key: SOLR-15228
                 URL: https://issues.apache.org/jira/browse/SOLR-15228
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: AutoScaling
    Affects Versions: 8.2
            Reporter: Andy Throgmorton


We configured a SolrCloud cluster (running 8.2) with this cluster autoscaling policy:
{noformat}
{
  "set-cluster-preferences":[
    {
      "minimize":"cores",
      "precision":5
    },
    {
      "maximize":"freedisk",
      "precision":25
    },
    {
      "minimize":"sysLoadAvg",
      "precision":10
    }],
  "set-cluster-policy":[
    {
      "replica": "<2",
      "node": "#ANY"
    }],
  "set-trigger": {
    "name":".auto_add_replicas",
    "event":"nodeLost",
    "waitFor":"10m",
    "enabled":true,
    "actions":[
      {
        "name":"auto_add_replicas_plan",
        "class":"solr.AutoAddReplicasPlanAction"},
      {
        "name":"execute_plan",
        "class":"solr.ExecutePlanAction"}]
  }
}{noformat}
A node was rebooted at one point, and when that node came back, it had trouble establishing a connection with ZK when it was initializing the CoreContainer. As a result, it returns 404s for (I think?) all admin requests.

Now, any call to create a collection in that cluster throw an error, with this stacktrace:
{noformat}
2021-03-04 12:47:03.615 ERROR (OverseerThreadFactory-141-thread-4-processing-n:HOST_REDACTED:8983_solr) [   ] o.a.s.c.a.c.OverseerCollectionMessageHandler Collection: COLLECTON_REDACTED operation: create failed:org.apache.solr.common.SolrException: Error getting replica locations : unable to get autoscaling policy session
    at org.apache.solr.cloud.api.collections.CreateCollectionCmd.call(CreateCollectionCmd.java:195)
    at org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:264)
    at org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:505)
    at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.solr.common.SolrException: unable to get autoscaling policy session
    at org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper.getReplicaLocations(PolicyHelper.java:129)
    at org.apache.solr.cloud.api.collections.Assign.getPositionsUsingPolicy(Assign.java:382)
    at org.apache.solr.cloud.api.collections.Assign$PolicyBasedAssignStrategy.assign(Assign.java:630)
    at org.apache.solr.cloud.api.collections.CreateCollectionCmd.buildReplicaPositions(CreateCollectionCmd.java:410)
    at org.apache.solr.cloud.api.collections.CreateCollectionCmd.call(CreateCollectionCmd.java:190)
    ... 6 more
Caused by: org.apache.solr.common.SolrException: org.apache.solr.common.SolrException: Error getting remote info
    at org.apache.solr.common.cloud.rule.ImplicitSnitch.getTags(ImplicitSnitch.java:78)
    at org.apache.solr.client.solrj.impl.SolrClientNodeStateProvider.fetchTagValues(SolrClientNodeStateProvider.java:139)
    at org.apache.solr.client.solrj.impl.SolrClientNodeStateProvider.getNodeValues(SolrClientNodeStateProvider.java:128)
    at org.apache.solr.client.solrj.cloud.autoscaling.Row.<init>(Row.java:71)
    at org.apache.solr.client.solrj.cloud.autoscaling.Policy$Session.<init>(Policy.java:575)
    at org.apache.solr.client.solrj.cloud.autoscaling.Policy.createSession(Policy.java:396)
    at org.apache.solr.client.solrj.cloud.autoscaling.Policy.createSession(Policy.java:358)
    at org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper$SessionRef.createSession(PolicyHelper.java:492)
    at org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper$SessionRef.get(PolicyHelper.java:457)
    at org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper.getSession(PolicyHelper.java:513)
    at org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper.getReplicaLocations(PolicyHelper.java:127)
    ... 10 more
Caused by: org.apache.solr.common.SolrException: Error getting remote info
    at org.apache.solr.client.solrj.impl.SolrClientNodeStateProvider$AutoScalingSnitch.getRemoteInfo(SolrClientNodeStateProvider.java:364)
    at org.apache.solr.common.cloud.rule.ImplicitSnitch.getTags(ImplicitSnitch.java:76)
    ... 20 more
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://HOSTNAME_REDACTED:8983/solr: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/admin/metrics. Reason:
<pre>    Not Found</pre></p><h3>Caused by:</h3><pre>javax.servlet.ServletException: javax.servlet.UnavailableException: Error processing the request. CoreContainer is either not initialized or shutting down.
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:168)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
    at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
    at org.eclipse.jetty.server.Server.handle(Server.java:505)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
    at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:427)
    at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:321)
    at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:159)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
    at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
    at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
    at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:781)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:917)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: javax.servlet.UnavailableException: Error processing the request. CoreContainer is either not initialized or shutting down.
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:369)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:350)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:152)
    ... 21 more
</pre>
<h3>Caused by:</h3><pre>javax.servlet.UnavailableException: Error processing the request. CoreContainer is either not initialized or shutting down.
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:369)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:350)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678)
    at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:152)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
    at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
    at org.eclipse.jetty.server.Server.handle(Server.java:505)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
    at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:427)
    at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:321)
    at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:159)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
    at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
    at org.eclipse.jetty....{noformat}
I looked through the Solr code and to me it looks like:
 * Client asks to create collection (CreateCollectionCmd)
 * PolicyHelper.getReplicaLocations tries to build a map of where every replica is
 * To do that, it creates a SessionRef, which needs to populate its cache first
 * SessionRef attempts to collect all metrics, [including metrics from every node|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/solrj/src/java/org/apache/solr/client/solrj/cloud/autoscaling/Policy.java#L583] (or {{Row}})
 * SolrClientNodeStateProvider$AutoScalingSnitch.getRemoteInfo makes the remote call
 ** It will retry on certain errors (see below), but not for this error ({{HttpSolrClient$RemoteSolrException}}), which bubbles up and fails the request
 ** [https://github.com/apache/lucene-solr/blob/branch_8_8/solr/solrj/src/java/org/apache/solr/client/solrj/impl/SolrClientNodeStateProvider.java#L310-L338]

 

I realize this autoscaling code is gone in 9.x, but at least wanted to report this issue for documentation purposes, in case others see this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org