You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Andy Throgmorton (Jira)" <ji...@apache.org> on 2021/03/09 00:11:00 UTC

[jira] [Resolved] (SOLR-15228) Single host in a bad state can block collection creation for the cluster with autoscaling enabled

     [ https://issues.apache.org/jira/browse/SOLR-15228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Throgmorton resolved SOLR-15228.
-------------------------------------
    Resolution: Duplicate

I guess Jira made another bug when I hit refresh?

> Single host in a bad state can block collection creation for the cluster with autoscaling enabled
> -------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-15228
>                 URL: https://issues.apache.org/jira/browse/SOLR-15228
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: AutoScaling
>    Affects Versions: 8.2
>            Reporter: Andy Throgmorton
>            Priority: Minor
>
> We configured a SolrCloud cluster (running 8.2) with this cluster autoscaling policy:
> {noformat}
> {
>   "set-cluster-preferences":[
>     {
>       "minimize":"cores",
>       "precision":5
>     },
>     {
>       "maximize":"freedisk",
>       "precision":25
>     },
>     {
>       "minimize":"sysLoadAvg",
>       "precision":10
>     }],
>   "set-cluster-policy":[
>     {
>       "replica": "<2",
>       "node": "#ANY"
>     }],
>   "set-trigger": {
>     "name":".auto_add_replicas",
>     "event":"nodeLost",
>     "waitFor":"10m",
>     "enabled":true,
>     "actions":[
>       {
>         "name":"auto_add_replicas_plan",
>         "class":"solr.AutoAddReplicasPlanAction"},
>       {
>         "name":"execute_plan",
>         "class":"solr.ExecutePlanAction"}]
>   }
> }{noformat}
> A node was rebooted at one point, and when that node came back, it had trouble establishing a connection with ZK when it was initializing the CoreContainer. As a result, it returns 404s for (I think?) all admin requests.
> Now, any call to create a collection in that cluster throw an error, with this stacktrace:
> {noformat}
> 2021-03-04 12:47:03.615 ERROR (OverseerThreadFactory-141-thread-4-processing-n:HOST_REDACTED:8983_solr) [   ] o.a.s.c.a.c.OverseerCollectionMessageHandler Collection: COLLECTON_REDACTED operation: create failed:org.apache.solr.common.SolrException: Error getting replica locations : unable to get autoscaling policy session
>     at org.apache.solr.cloud.api.collections.CreateCollectionCmd.call(CreateCollectionCmd.java:195)
>     at org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:264)
>     at org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:505)
>     at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
>     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>     at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: org.apache.solr.common.SolrException: unable to get autoscaling policy session
>     at org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper.getReplicaLocations(PolicyHelper.java:129)
>     at org.apache.solr.cloud.api.collections.Assign.getPositionsUsingPolicy(Assign.java:382)
>     at org.apache.solr.cloud.api.collections.Assign$PolicyBasedAssignStrategy.assign(Assign.java:630)
>     at org.apache.solr.cloud.api.collections.CreateCollectionCmd.buildReplicaPositions(CreateCollectionCmd.java:410)
>     at org.apache.solr.cloud.api.collections.CreateCollectionCmd.call(CreateCollectionCmd.java:190)
>     ... 6 more
> Caused by: org.apache.solr.common.SolrException: org.apache.solr.common.SolrException: Error getting remote info
>     at org.apache.solr.common.cloud.rule.ImplicitSnitch.getTags(ImplicitSnitch.java:78)
>     at org.apache.solr.client.solrj.impl.SolrClientNodeStateProvider.fetchTagValues(SolrClientNodeStateProvider.java:139)
>     at org.apache.solr.client.solrj.impl.SolrClientNodeStateProvider.getNodeValues(SolrClientNodeStateProvider.java:128)
>     at org.apache.solr.client.solrj.cloud.autoscaling.Row.<init>(Row.java:71)
>     at org.apache.solr.client.solrj.cloud.autoscaling.Policy$Session.<init>(Policy.java:575)
>     at org.apache.solr.client.solrj.cloud.autoscaling.Policy.createSession(Policy.java:396)
>     at org.apache.solr.client.solrj.cloud.autoscaling.Policy.createSession(Policy.java:358)
>     at org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper$SessionRef.createSession(PolicyHelper.java:492)
>     at org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper$SessionRef.get(PolicyHelper.java:457)
>     at org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper.getSession(PolicyHelper.java:513)
>     at org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper.getReplicaLocations(PolicyHelper.java:127)
>     ... 10 more
> Caused by: org.apache.solr.common.SolrException: Error getting remote info
>     at org.apache.solr.client.solrj.impl.SolrClientNodeStateProvider$AutoScalingSnitch.getRemoteInfo(SolrClientNodeStateProvider.java:364)
>     at org.apache.solr.common.cloud.rule.ImplicitSnitch.getTags(ImplicitSnitch.java:76)
>     ... 20 more
> Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://HOSTNAME_REDACTED:8983/solr: Expected mime type application/octet-stream but got text/html. <html>
> <head>
> <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
> <title>Error 404 Not Found</title>
> </head>
> <body><h2>HTTP ERROR 404</h2>
> <p>Problem accessing /solr/admin/metrics. Reason:
> <pre>    Not Found</pre></p><h3>Caused by:</h3><pre>javax.servlet.ServletException: javax.servlet.UnavailableException: Error processing the request. CoreContainer is either not initialized or shutting down.
>     at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:168)
>     at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>     at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
>     at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>     at org.eclipse.jetty.server.Server.handle(Server.java:505)
>     at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
>     at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
>     at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
>     at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
>     at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:427)
>     at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:321)
>     at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:159)
>     at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
>     at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
>     at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
>     at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
>     at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
>     at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
>     at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
>     at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:781)
>     at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:917)
>     at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: javax.servlet.UnavailableException: Error processing the request. CoreContainer is either not initialized or shutting down.
>     at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:369)
>     at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:350)
>     at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
>     at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
>     at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
>     at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>     at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>     at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>     at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)
>     at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
>     at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)
>     at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
>     at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
>     at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678)
>     at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
>     at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249)
>     at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>     at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
>     at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:152)
>     ... 21 more
> </pre>
> <h3>Caused by:</h3><pre>javax.servlet.UnavailableException: Error processing the request. CoreContainer is either not initialized or shutting down.
>     at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:369)
>     at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:350)
>     at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
>     at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
>     at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
>     at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>     at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>     at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
>     at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)
>     at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
>     at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)
>     at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
>     at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
>     at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678)
>     at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
>     at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249)
>     at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>     at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
>     at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:152)
>     at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>     at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
>     at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>     at org.eclipse.jetty.server.Server.handle(Server.java:505)
>     at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
>     at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
>     at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
>     at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
>     at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:427)
>     at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:321)
>     at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:159)
>     at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
>     at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
>     at org.eclipse.jetty....{noformat}
> I looked through the Solr code and to me it looks like:
>  * Client asks to create collection (CreateCollectionCmd)
>  * PolicyHelper.getReplicaLocations tries to build a map of where every replica is
>  * To do that, it creates a SessionRef, which needs to populate its cache first
>  * SessionRef attempts to collect all metrics, [including metrics from every node|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/solrj/src/java/org/apache/solr/client/solrj/cloud/autoscaling/Policy.java#L583] (or {{Row}})
>  * SolrClientNodeStateProvider$AutoScalingSnitch.getRemoteInfo makes the remote call
>  ** It will retry on certain errors (see below), but not for this error ({{HttpSolrClient$RemoteSolrException}}), which bubbles up and fails the request
>  ** [https://github.com/apache/lucene-solr/blob/branch_8_8/solr/solrj/src/java/org/apache/solr/client/solrj/impl/SolrClientNodeStateProvider.java#L310-L338]
>  
> I realize this autoscaling code is gone in 9.x, but at least wanted to report this issue for documentation purposes, in case others see this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org