You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (Jira)" <ji...@apache.org> on 2019/08/22 01:35:00 UTC

[jira] [Updated] (SOLR-13709) Race condition on core reload while core is still loading?

     [ https://issues.apache.org/jira/browse/SOLR-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-13709:
----------------------------
    Attachment: apache_Lucene-Solr-Tests-8.x_449.log.txt
        Status: Open  (was: Open)

I've attached the logs from the jenkins run in question...

Interestingly: Even though the logs indicate several problems in trying to reload/unload the SolrCore, the test itself didn't seem to care enough about the state of the collection to notice the problems – the only junit failure recorded was a suite level failure from the ObjectTracker due to unreleased threads/objects.

The first sign of trouble in the logs is this WARN from a ZK watcher registered to monitor the schema for changes (in schemaless mode) in order to re-load the SolrCore – it fails with an NullPointerException...
{noformat}
   [junit4]   2> 4309877 WARN  (Thread-7591) [n:localhost:38920_solr c:testCloudExamplePrompt s:shard2 r:core_node7 x:testCloudExamplePrompt_shard2_replica_n4 ] o.a.s.c.ZkController liste
ner throws error
   [junit4]   2>           => org.apache.solr.common.SolrException: Unable to reload core [testCloudExamplePrompt_shard2_replica_n6]
   [junit4]   2>        at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:1557)
   [junit4]   2> org.apache.solr.common.SolrException: Unable to reload core [testCloudExamplePrompt_shard2_replica_n6]
   [junit4]   2>        at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:1557) ~[java/:?]
   [junit4]   2>        at org.apache.solr.core.SolrCore.lambda$getConfListener$21(SolrCore.java:3099) ~[java/:?]
   [junit4]   2>        at org.apache.solr.cloud.ZkController.lambda$fireEventListeners$14(ZkController.java:2514) ~[java/:?]
   [junit4]   2>        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
   [junit4]   2> Caused by: java.lang.NullPointerException
   [junit4]   2>        at org.apache.solr.core.CoreDescriptor.<init>(CoreDescriptor.java:172) ~[java/:?]
   [junit4]   2>        at org.apache.solr.core.SolrCore.reload(SolrCore.java:683) ~[java/:?]
   [junit4]   2>        at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:1507) ~[java/:?]
   [junit4]   2>        ... 3 more
{noformat}
...AFAICT the only way this NPE is possily is if the CoreDescriptor for the _original_ SolrCore is NULL at the time the watcher fires, and the only concievable way that seems to be possible is if the original SolrCore hadn't completley finished loading.

Aparently as a result of this failure to reload, a SolrCoreInitializationException is recorded for the core name, and that ultimately causes a fast-failure response when trying to unload the core...
{noformat}
   [junit4]   2> 4310314 ERROR (qtp373709619-50629) [n:localhost:38920_solr    x:testCloudExamplePrompt_shard2_replica_n6 ] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException
: Error unregistering core [testCloudExamplePrompt_shard2_replica_n6] from cloud state
   [junit4]   2>        at org.apache.solr.core.CoreContainer.unload(CoreContainer.java:1672)
   [junit4]   2>        at org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$1(CoreAdminOperation.java:105)
   [junit4]   2>        at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:360)
   [junit4]   2>        at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:397)
   [junit4]   2>        at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:181)
   [junit4]   2>        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:200)
   [junit4]   2>        at org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:820)
   [junit4]   2>        at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:786)
   [junit4]   2>        at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:546)
   [junit4]   2>        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:423)
   [junit4]   2>        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:350)
   [junit4]   2>        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
   [junit4]   2>        at org.apache.solr.client.solrj.embedded.JettySolrRunner$DebugFilter.doFilter(JettySolrRunner.java:165)
   [junit4]   2>        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
   [junit4]   2>        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
   [junit4]   2>        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
   [junit4]   2>        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)
   [junit4]   2>        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
   [junit4]   2>        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)
   [junit4]   2>        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
   [junit4]   2>        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
   [junit4]   2>        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678)
   [junit4]   2>        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
   [junit4]   2>        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249)
   [junit4]   2>        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
   [junit4]   2>        at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:703)
   [junit4]   2>        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
   [junit4]   2>        at org.eclipse.jetty.server.Server.handle(Server.java:505)
   [junit4]   2>        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
   [junit4]   2>        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
   [junit4]   2>        at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
   [junit4]   2>        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
   [junit4]   2>        at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
   [junit4]   2>        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:781)
   [junit4]   2>        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:917)
   [junit4]   2>        at java.lang.Thread.run(Thread.java:748)
   [junit4]   2> Caused by: org.apache.solr.core.SolrCoreInitializationException: SolrCore 'testCloudExamplePrompt_shard2_replica_n6' is not available due to init failure: null
   [junit4]   2>        at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:1740)
   [junit4]   2>        at org.apache.solr.cloud.ShardLeaderElectionContext.cancelElection(ElectionContext.java:294)
   [junit4]   2>        at org.apache.solr.cloud.ZkController.unregister(ZkController.java:1658)
   [junit4]   2>        at org.apache.solr.cloud.ZkController.unregister(ZkController.java:1637)
   [junit4]   2>        at org.apache.solr.core.CoreContainer.unload(CoreContainer.java:1665)
   [junit4]   2>        ... 35 more
   [junit4]   2> Caused by: java.lang.NullPointerException
   [junit4]   2>        at org.apache.solr.core.CoreDescriptor.<init>(CoreDescriptor.java:172)
   [junit4]   2>        at org.apache.solr.core.SolrCore.reload(SolrCore.java:683)
   [junit4]   2>        at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:1507)
   [junit4]   2>        at org.apache.solr.core.SolrCore.lambda$getConfListener$21(SolrCore.java:3099)
   [junit4]   2>        at org.apache.solr.cloud.ZkController.lambda$fireEventListeners$14(ZkController.java:2514)
   [junit4]   2>        ... 1 more
   [junit4]   2> 
{noformat}
As i mentioned before, none of this caused any failures directly from {{TestSolrCLIRunExample.testInteractiveSolrCloudExample}} (the minimal checks it does that the collection was created and the one doc indexed can be found succeeded, likeley because they were handled by other cores) but the main point i want to draw attention to is that this situation even seems to be possible.
----
FWIW: I briefly looked into the code to try and understand where/why a SolrCore could be allowed to "reload" itself if it hadn't finished loading (regardless of wether SolrCloud was in use) and the first thing thta jumped out at me was the javadocs for {{SolrCores.getCoreDescriptor(String)}} (which is what {{SolrCore.getCoreDescriptor()}} ultimately delegates too, and would be the source of the 'null' that causes the NPE: {{SolrCore.reload(...)}} calls {{getCoreDescriptor()}} and passes the – evidently null – result to the copy constructor of {{new CoreDescriptor(...)}}) ...
{code:java}
  /**
   * Return the CoreDescriptor corresponding to a given core name.
   * Blocks if the SolrCore is still loading until it is ready.
   * @param coreName the name of the core
   * @return the CoreDescriptor
   */
  public CoreDescriptor getCoreDescriptor(String coreName) {
    synchronized (modifyLock) {
      if (residentDesciptors.containsKey(coreName))
        return residentDesciptors.get(coreName);
      return getTransientCacheHandler().getTransientDescriptor(coreName);
    }
  }
{code}
Note the javadocs say this method will "Blocks if the SolrCore is still loading until it is ready." – but there doesn't seem to be any (obvious) evidence of that, unless the expecation is that {{modifyLock}} will be held for the durration of a SolrCore initialization (but i don't see any sign/intent of that either).

[~markrmiller@gmail.com] : you added that comment to the javadocs as part of SOLR-7361, but made no other changes to the method so i'm not clear what/where it was being enforced at the time you wrote it.

[~erickerickson] : pinging you as well since you've done a lot of work on (transient) SolrCore loading in the past few years, and wondering if you have any insight into where/how this could happen.

> Race condition on core reload while core is still loading?
> ----------------------------------------------------------
>
>                 Key: SOLR-13709
>                 URL: https://issues.apache.org/jira/browse/SOLR-13709
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Priority: Major
>         Attachments: apache_Lucene-Solr-Tests-8.x_449.log.txt
>
>
> A recent jenkins failure from {{TestSolrCLIRunExample}} seems to suggest that there may be a race condition when attempting to re-load a SolrCore while the core is currently in the process of (re)loading that can leave the SolrCore in an unusable state.
> Details to follow...



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org