You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Andy Throgmorton (Jira)" <ji...@apache.org> on 2021/03/09 00:09:00 UTC
[jira] [Created] (SOLR-15228) Single host in a bad state can block
collection creation for the cluster with autoscaling enabled
Andy Throgmorton created SOLR-15228:
---------------------------------------
Summary: Single host in a bad state can block collection creation for the cluster with autoscaling enabled
Key: SOLR-15228
URL: https://issues.apache.org/jira/browse/SOLR-15228
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Components: AutoScaling
Affects Versions: 8.2
Reporter: Andy Throgmorton
We configured a SolrCloud cluster (running 8.2) with this cluster autoscaling policy:
{noformat}
{
"set-cluster-preferences":[
{
"minimize":"cores",
"precision":5
},
{
"maximize":"freedisk",
"precision":25
},
{
"minimize":"sysLoadAvg",
"precision":10
}],
"set-cluster-policy":[
{
"replica": "<2",
"node": "#ANY"
}],
"set-trigger": {
"name":".auto_add_replicas",
"event":"nodeLost",
"waitFor":"10m",
"enabled":true,
"actions":[
{
"name":"auto_add_replicas_plan",
"class":"solr.AutoAddReplicasPlanAction"},
{
"name":"execute_plan",
"class":"solr.ExecutePlanAction"}]
}
}{noformat}
A node was rebooted at one point, and when that node came back, it had trouble establishing a connection with ZK when it was initializing the CoreContainer. As a result, it returns 404s for (I think?) all admin requests.
Now, any call to create a collection in that cluster throw an error, with this stacktrace:
{noformat}
2021-03-04 12:47:03.615 ERROR (OverseerThreadFactory-141-thread-4-processing-n:HOST_REDACTED:8983_solr) [ ] o.a.s.c.a.c.OverseerCollectionMessageHandler Collection: COLLECTON_REDACTED operation: create failed:org.apache.solr.common.SolrException: Error getting replica locations : unable to get autoscaling policy session
at org.apache.solr.cloud.api.collections.CreateCollectionCmd.call(CreateCollectionCmd.java:195)
at org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:264)
at org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:505)
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.solr.common.SolrException: unable to get autoscaling policy session
at org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper.getReplicaLocations(PolicyHelper.java:129)
at org.apache.solr.cloud.api.collections.Assign.getPositionsUsingPolicy(Assign.java:382)
at org.apache.solr.cloud.api.collections.Assign$PolicyBasedAssignStrategy.assign(Assign.java:630)
at org.apache.solr.cloud.api.collections.CreateCollectionCmd.buildReplicaPositions(CreateCollectionCmd.java:410)
at org.apache.solr.cloud.api.collections.CreateCollectionCmd.call(CreateCollectionCmd.java:190)
... 6 more
Caused by: org.apache.solr.common.SolrException: org.apache.solr.common.SolrException: Error getting remote info
at org.apache.solr.common.cloud.rule.ImplicitSnitch.getTags(ImplicitSnitch.java:78)
at org.apache.solr.client.solrj.impl.SolrClientNodeStateProvider.fetchTagValues(SolrClientNodeStateProvider.java:139)
at org.apache.solr.client.solrj.impl.SolrClientNodeStateProvider.getNodeValues(SolrClientNodeStateProvider.java:128)
at org.apache.solr.client.solrj.cloud.autoscaling.Row.<init>(Row.java:71)
at org.apache.solr.client.solrj.cloud.autoscaling.Policy$Session.<init>(Policy.java:575)
at org.apache.solr.client.solrj.cloud.autoscaling.Policy.createSession(Policy.java:396)
at org.apache.solr.client.solrj.cloud.autoscaling.Policy.createSession(Policy.java:358)
at org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper$SessionRef.createSession(PolicyHelper.java:492)
at org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper$SessionRef.get(PolicyHelper.java:457)
at org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper.getSession(PolicyHelper.java:513)
at org.apache.solr.client.solrj.cloud.autoscaling.PolicyHelper.getReplicaLocations(PolicyHelper.java:127)
... 10 more
Caused by: org.apache.solr.common.SolrException: Error getting remote info
at org.apache.solr.client.solrj.impl.SolrClientNodeStateProvider$AutoScalingSnitch.getRemoteInfo(SolrClientNodeStateProvider.java:364)
at org.apache.solr.common.cloud.rule.ImplicitSnitch.getTags(ImplicitSnitch.java:76)
... 20 more
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://HOSTNAME_REDACTED:8983/solr: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/admin/metrics. Reason:
<pre> Not Found</pre></p><h3>Caused by:</h3><pre>javax.servlet.ServletException: javax.servlet.UnavailableException: Error processing the request. CoreContainer is either not initialized or shutting down.
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:168)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:505)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:427)
at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:321)
at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:159)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:781)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:917)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: javax.servlet.UnavailableException: Error processing the request. CoreContainer is either not initialized or shutting down.
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:369)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:350)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:152)
... 21 more
</pre>
<h3>Caused by:</h3><pre>javax.servlet.UnavailableException: Error processing the request. CoreContainer is either not initialized or shutting down.
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:369)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:350)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)
at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678)
at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:152)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:505)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:427)
at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:321)
at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:159)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
at org.eclipse.jetty....{noformat}
I looked through the Solr code and to me it looks like:
* Client asks to create collection (CreateCollectionCmd)
* PolicyHelper.getReplicaLocations tries to build a map of where every replica is
* To do that, it creates a SessionRef, which needs to populate its cache first
* SessionRef attempts to collect all metrics, [including metrics from every node|https://github.com/apache/lucene-solr/blob/branch_8_8/solr/solrj/src/java/org/apache/solr/client/solrj/cloud/autoscaling/Policy.java#L583] (or {{Row}})
* SolrClientNodeStateProvider$AutoScalingSnitch.getRemoteInfo makes the remote call
** It will retry on certain errors (see below), but not for this error ({{HttpSolrClient$RemoteSolrException}}), which bubbles up and fails the request
** [https://github.com/apache/lucene-solr/blob/branch_8_8/solr/solrj/src/java/org/apache/solr/client/solrj/impl/SolrClientNodeStateProvider.java#L310-L338]
I realize this autoscaling code is gone in 9.x, but at least wanted to report this issue for documentation purposes, in case others see this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org