You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Eduard Moraru (Jira)" <ji...@apache.org> on 2021/03/12 15:23:00 UTC

[jira] [Commented] (SOLR-15254) Random deadlock during cores initialization

    [ https://issues.apache.org/jira/browse/SOLR-15254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300376#comment-17300376 ] 

Eduard Moraru commented on SOLR-15254:
--------------------------------------

On a separate discussion, [~tmortagne] pointed out that those threads are in {{RUNNABLE}} state, but with {{in Object.wait() [0x00007fb5717ec000]}}.

Following this lead, I've come to [http://ternarysearch.blogspot.com/2013/07/static-initialization-deadlock.html] which explains static initialization deadlocks and I am suspecting (without properly knowing the Solr code) that this might be the case here as well.

The second clue is the fact that {{"coreLoadExecutor-9-thread-2"}} is featuring this segment in its stracktrace {{org.apache.lucene.analysis.util.TokenFilterFactory.<clinit>(TokenFilterFactory.java:33)}} which would correspond to a static initialization block.

Tracking this down, I can see it leading to this code, as part of Solr 8.5.1:
{noformat}
  private static final AnalysisSPILoader<TokenFilterFactory> loader =
      new AnalysisSPILoader<>(TokenFilterFactory.class,
          new String[] { "TokenFilterFactory", "FilterFactory" });
{noformat}
(https://github.com/apache/solr/blob/edb9fc409398f2c3446883f9f80595c884d245d0/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/TokenFilterFactory.java#L33-L35)

.. and then we can find further down the code an example of a subclass of {{TokenFilterFactory}} that might contribute to the static initialization deadlock if initialized by a different thread:
{noformat}
public abstract class BaseManagedTokenFilterFactory extends TokenFilterFactory 
  implements ResourceLoaderAware, ManagedResourceObserver
{noformat}
(https://github.com/apache/solr/blob/edb9fc409398f2c3446883f9f80595c884d245d0/solr/core/src/java/org/apache/solr/rest/schema/analysis/BaseManagedTokenFilterFactory.java#L35)

As we can see from the thread stacks, all 3 threads are stuck at trying to access either fields of a class ({{java.lang.reflect.Field.get(java.base@11.0.9.1/Field.java:418)}}) or to actually get a Class object ({{java.lang.Class.forName(java.base@11.0.9.1/Class.java:398)}}) so the chances of dealing with a static initialization deadlock seem pretty high.

Looking at the latest version of Solr, I see the static block in {{TokenFilterFactory}} has been modified to no longer use those "prefixes" (and also some code was moved around in different repositories):
{noformat}
  private static final AnalysisSPILoader<TokenFilterFactory> loader =
      new AnalysisSPILoader<>(TokenFilterFactory.class);
{noformat}
(https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/analysis/TokenFilterFactory.java)
... but I am not sure if this helps in any way for dealing with the potential static init deadlock issue. I am guessing it's not, since the behavior is pretty much the same.

Could anyone with more knowledge of the Solr code confirm this?

Thanks!

> Random deadlock during cores initialization
> -------------------------------------------
>
>                 Key: SOLR-15254
>                 URL: https://issues.apache.org/jira/browse/SOLR-15254
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 8.5.1
>         Environment: openjdk 11.0.9.1
>            Reporter: Thomas Mortagne
>            Priority: Major
>
> At XWiki we are using Solr in an embbedded way with 5 cores, and sometimes we end up with what appears to be a race condition leading to a deadlock between several of the cores's configuration loading while loading some required classes.
> Here is an extract of a thread dump (also attached is a more complete version of the thread dump) we get in such a case:
> {noformat}
> "coreLoadExecutor-9-thread-1" #41 prio=5 os_prio=0 cpu=305.06ms elapsed=101.24s tid=0x00007fb5cfc8f800 nid=0x57a in Object.wait()  [0x00007fb5717ec000]
>    java.lang.Thread.State: RUNNABLE
> 	at java.lang.Class.forName0(java.base@11.0.9.1/Native Method)
> 	at java.lang.Class.forName(java.base@11.0.9.1/Class.java:398)
> 	at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:554)
> 	at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:639)
> 	at org.apache.solr.schema.FieldTypePluginLoader$3.create(FieldTypePluginLoader.java:339)
> 	at org.apache.solr.schema.FieldTypePluginLoader$3.create(FieldTypePluginLoader.java:332)
> 	at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:152)
> 	at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:356)
> 	at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:90)
> 	at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:48)
> 	at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:152)
> 	at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:502)
> 	at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:178)
> 	at org.apache.solr.schema.ManagedIndexSchema.<init>(ManagedIndexSchema.java:105)
> 	at org.apache.solr.schema.ManagedIndexSchemaFactory.create(ManagedIndexSchemaFactory.java:178)
> 	at org.apache.solr.schema.ManagedIndexSchemaFactory.create(ManagedIndexSchemaFactory.java:46)
> 	at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:145)
> 	at org.apache.solr.core.ConfigSetService.loadConfigSet(ConfigSetService.java:83)
> 	at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1302)
> 	at org.apache.solr.core.CoreContainer.lambda$load$13(CoreContainer.java:802)
> 	at org.apache.solr.core.CoreContainer$$Lambda$569/0x0000000840bc7040.call(Unknown Source)
> 	at org.glowroot.agent.plugin.executor.CallableWrapper.call(CallableWrapper.java:38)
> 	at com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:202)
> 	at java.util.concurrent.FutureTask.run(java.base@11.0.9.1/FutureTask.java:264)
> 	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210)
> 	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$438/0x0000000840b29440.run(Unknown Source)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.9.1/ThreadPoolExecutor.java:1128)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.9.1/ThreadPoolExecutor.java:628)
> 	at java.lang.Thread.run(java.base@11.0.9.1/Thread.java:834)
> "coreLoadExecutor-9-thread-3" #43 prio=5 os_prio=0 cpu=294.37ms elapsed=101.24s tid=0x00007fb5cfcdf800 nid=0x57c in Object.wait()  [0x00007fb5715ea000]
>    java.lang.Thread.State: RUNNABLE
> 	at java.lang.Class.forName0(java.base@11.0.9.1/Native Method)
> 	at java.lang.Class.forName(java.base@11.0.9.1/Class.java:398)
> 	at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:554)
> 	at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:639)
> 	at org.apache.solr.schema.FieldTypePluginLoader$3.create(FieldTypePluginLoader.java:339)
> 	at org.apache.solr.schema.FieldTypePluginLoader$3.create(FieldTypePluginLoader.java:332)
> 	at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:152)
> 	at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:356)
> 	at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:90)
> 	at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:48)
> 	at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:152)
> 	at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:502)
> 	at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:178)
> 	at org.apache.solr.schema.ManagedIndexSchema.<init>(ManagedIndexSchema.java:105)
> 	at org.apache.solr.schema.ManagedIndexSchemaFactory.create(ManagedIndexSchemaFactory.java:178)
> 	at org.apache.solr.schema.ManagedIndexSchemaFactory.create(ManagedIndexSchemaFactory.java:46)
> 	at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:145)
> 	at org.apache.solr.core.ConfigSetService.loadConfigSet(ConfigSetService.java:83)
> 	at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1302)
> 	at org.apache.solr.core.CoreContainer.lambda$load$13(CoreContainer.java:802)
> 	at org.apache.solr.core.CoreContainer$$Lambda$569/0x0000000840bc7040.call(Unknown Source)
> 	at org.glowroot.agent.plugin.executor.CallableWrapper.call(CallableWrapper.java:38)
> 	at com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:202)
> 	at java.util.concurrent.FutureTask.run(java.base@11.0.9.1/FutureTask.java:264)
> 	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210)
> 	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$438/0x0000000840b29440.run(Unknown Source)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.9.1/ThreadPoolExecutor.java:1128)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.9.1/ThreadPoolExecutor.java:628)
> 	at java.lang.Thread.run(java.base@11.0.9.1/Thread.java:834)
> "coreLoadExecutor-9-thread-2" #42 prio=5 os_prio=0 cpu=388.81ms elapsed=101.24s tid=0x00007fb5cfcde800 nid=0x57b in Object.wait()  [0x00007fb5716eb000]
>    java.lang.Thread.State: RUNNABLE
> 	at jdk.internal.misc.Unsafe.ensureClassInitialized0(java.base@11.0.9.1/Native Method)
> 	at jdk.internal.misc.Unsafe.ensureClassInitialized(java.base@11.0.9.1/Unsafe.java:1042)
> 	at jdk.internal.reflect.UnsafeFieldAccessorFactory.newFieldAccessor(java.base@11.0.9.1/UnsafeFieldAccessorFactory.java:43)
> 	at jdk.internal.reflect.ReflectionFactory.newFieldAccessor(java.base@11.0.9.1/ReflectionFactory.java:186)
> 	at java.lang.reflect.Field.acquireFieldAccessor(java.base@11.0.9.1/Field.java:1105)
> 	at java.lang.reflect.Field.getFieldAccessor(java.base@11.0.9.1/Field.java:1086)
> 	at java.lang.reflect.Field.get(java.base@11.0.9.1/Field.java:418)
> 	at org.apache.lucene.analysis.util.AbstractAnalysisFactory.lookupSPIName(AbstractAnalysisFactory.java:358)
> 	at org.apache.lucene.analysis.util.AnalysisSPILoader.reload(AnalysisSPILoader.java:90)
> 	- locked <0x00000000e98e20a0> (a org.apache.lucene.analysis.util.AnalysisSPILoader)
> 	at org.apache.lucene.analysis.util.AnalysisSPILoader.<init>(AnalysisSPILoader.java:66)
> 	at org.apache.lucene.analysis.util.AnalysisSPILoader.<init>(AnalysisSPILoader.java:52)
> 	at org.apache.lucene.analysis.util.TokenFilterFactory.<clinit>(TokenFilterFactory.java:33)
> 	at org.apache.solr.core.SolrResourceLoader.reloadLuceneSPI(SolrResourceLoader.java:256)
> 	at org.apache.solr.core.SolrResourceLoader.<init>(SolrResourceLoader.java:196)
> 	at org.apache.solr.core.ConfigSetService$Standalone.createCoreResourceLoader(ConfigSetService.java:207)
> 	at org.apache.solr.core.ConfigSetService.loadConfigSet(ConfigSetService.java:66)
> 	at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1302)
> 	at org.apache.solr.core.CoreContainer.lambda$load$13(CoreContainer.java:802)
> 	at org.apache.solr.core.CoreContainer$$Lambda$569/0x0000000840bc7040.call(Unknown Source)
> 	at org.glowroot.agent.plugin.executor.CallableWrapper.call(CallableWrapper.java:38)
> 	at com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:202)
> 	at java.util.concurrent.FutureTask.run(java.base@11.0.9.1/FutureTask.java:264)
> 	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210)
> 	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$438/0x0000000840b29440.run(Unknown Source)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.9.1/ThreadPoolExecutor.java:1128)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.9.1/ThreadPoolExecutor.java:628)
> 	at java.lang.Thread.run(java.base@11.0.9.1/Thread.java:834)
> "main" #1 prio=5 os_prio=0 cpu=25108.09ms elapsed=133.50s tid=0x00007fb5cc018800 nid=0x541 waiting on condition  [0x00007fb5d2f73000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
> 	at jdk.internal.misc.Unsafe.park(java.base@11.0.9.1/Native Method)
> 	- parking to wait for  <0x00000000ea480000> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> 	at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.9.1/LockSupport.java:234)
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@11.0.9.1/AbstractQueuedSynchronizer.java:2123)
> 	at java.util.concurrent.ThreadPoolExecutor.awaitTermination(java.base@11.0.9.1/ThreadPoolExecutor.java:1454)
> 	at com.codahale.metrics.InstrumentedExecutorService.awaitTermination(InstrumentedExecutorService.java:163)
> 	at org.apache.solr.common.util.ExecutorUtil.awaitTermination(ExecutorUtil.java:85)
> 	at org.apache.solr.common.util.ExecutorUtil.shutdownAndAwaitTermination(ExecutorUtil.java:77)
> 	at org.apache.solr.core.CoreContainer.load(CoreContainer.java:843)
> "h2sc-4-thread-1" #38 prio=5 os_prio=0 cpu=0.20ms elapsed=101.89s tid=0x00007fb5cfc16800 nid=0x577 runnable  [0x00007fb571cf0000]
>    java.lang.Thread.State: RUNNABLE
> 	at sun.nio.ch.EPoll.wait(java.base@11.0.9.1/Native Method)
> 	at sun.nio.ch.EPollSelectorImpl.doSelect(java.base@11.0.9.1/EPollSelectorImpl.java:120)
> 	at sun.nio.ch.SelectorImpl.lockAndDoSelect(java.base@11.0.9.1/SelectorImpl.java:124)
> 	- locked <0x00000000e9819cf0> (a sun.nio.ch.Util$2)
> 	- locked <0x00000000e9819c98> (a sun.nio.ch.EPollSelectorImpl)
> 	at sun.nio.ch.SelectorImpl.select(java.base@11.0.9.1/SelectorImpl.java:141)
> 	at org.eclipse.jetty.io.ManagedSelector.nioSelect(ManagedSelector.java:149)
> 	at org.eclipse.jetty.io.ManagedSelector.select(ManagedSelector.java:156)
> 	at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:572)
> 	at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:509)
> 	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:360)
> 	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:184)
> 	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171)
> 	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:135)
> 	at org.eclipse.jetty.io.ManagedSelector$$Lambda$437/0x0000000840b28c40.run(Unknown Source)
> 	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210)
> 	at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda$438/0x0000000840b29440.run(Unknown Source)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.9.1/ThreadPoolExecutor.java:1128
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.9.1/ThreadPoolExecutor.java:628)
> 	at java.lang.Thread.run(java.base@11.0.9.1/Thread.java:834)
> {noformat}
> Notice the coreLoadExecutor-9-thread-* thread which despite their "RUNNABLE" state are actually internally blocked wait on some unknown object. The main thread is stuck because it's waiting for the cores's initialization to finish.
> I included "h2sc-4-thread-1" to have all the threads doing stuff but it does not seems to have much to do with the deadlock (other than being a Solr thing too).
> It's not very easy to debug since we cannot always reproduce but maybe you guys have a clue for where this issue might be comming from based on those informations.
> When we hit this problem we can "workaround" it by restarting a few times until it pass the Solr init.
> We never had this issue back when we only had 1 or 2 cores but having 5 of those is very recent for us so I cannot really tell if it's some recent Solr regression or an old bug we did not had before because we did not had enough cores to trigger it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)