You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Chris M. Hostetter (Jira)" <ji...@apache.org> on 2020/05/01 01:14:00 UTC

[jira] [Created] (SOLR-14452) "classloading deadlock" issue with DocSet/SortedIntDocSet

Chris M. Hostetter created SOLR-14452:
-----------------------------------------

             Summary: "classloading deadlock" issue with DocSet/SortedIntDocSet
                 Key: SOLR-14452
                 URL: https://issues.apache.org/jira/browse/SOLR-14452
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
            Reporter: Chris M. Hostetter


While beasting some facet related cloud tests on master, I noticed a pattern of occasional failures that seemed to crop up...
 * test ultimately fails due to a time out (usually the client threads time out waiting for a server response)
 * if i notice my CPU isn't spinning very hard _before_ the test fails, I can capture a jstack and inspect some threads
 * there will be multiple jetty/solr request threads (ex: {{"qtp82184175-145"}} whose stack traces show various stages of DocSet collection that show they are {{"... in Object.wait()"}} but also {{RUNNABLE}}

...this isn't a thread summary+state combination that i'm use to seeing when looking at thread dumps, and some research into when/why this might happen lead me to:
 * [https://stackoverflow.com/questions/28631656/runnable-thread-state-but-in-object-wait]
 ** [https://stackoverflow.com/a/28776438/689372]
 *** 
 **** [http://ternarysearch.blogspot.com/2013/07/static-initialization-deadlock.html]
 **** [https://bugs.openjdk.java.net/browse/JDK-8037567]

...while the comments/status of JDK-8037567 suggests "nothing wrong here" the overall symptoms/description of the problem in the SO answer and linked blog and summation that this is essentially a "deadlock" situation in the class loader, do seem to correlate to some of the specifics I can see in the stack traces when this happens while running solr tests...
 * at least one "RUNNABLE / Object.wait" thread trying to do class init; class: DocSet...
{noformat}
"qtp1535326437-68" #68 prio=5 os_prio=0 cpu=72.48ms elapsed=241.69s tid=0x00007fc08c0a4000 nid=0x864 in Object.wait()  [0x00007fc0adedd000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.solr.search.DocSet.<clinit>(DocSet.java:118)
	at org.apache.solr.search.DocSetCollector.getDocSet(DocSetCollector.java:90) // "new BitDocSet(..)"
	at org.apache.solr.search.DocSetUtil.getDocSet(DocSetUtil.java:93)
	at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1730)
{noformat}

 * other "RUNNABLE / Object.wait" threads are on lines that involve instantiating a subclass of DocSet:
 ** 
{noformat}
"qtp1535326437-67" #67 prio=5 os_prio=0 cpu=801.44ms elapsed=241.69s tid=0x00007fc08c0a1800 nid=0x863 in Object.wait()  [0x00007fc0adfdf000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.solr.search.DocSetCollector.getDocSet(DocSetCollector.java:90) // "new BitDocSet(..)"
	at org.apache.solr.search.DocSetUtil.getDocSet(DocSetUtil.java:93)
	at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1730)
{noformat}

 ** 
{noformat}
"qtp82184175-65" #65 prio=5 os_prio=0 cpu=137.76ms elapsed=241.69s tid=0x00007fc088092000 nid=0x860 in Object.wait()  [0x00007fc0ae2e2000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.solr.search.DocSetCollector.getDocSet(DocSetCollector.java:84) // "new SortedIntDocSet(..)"
	at org.apache.solr.search.DocSetUtil.getDocSet(DocSetUtil.java:93)
	at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1730)
	at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1433)
{noformat}

 ** etc...
 * DocSet has a static reference to a concrete subclass...
 ** {{public static final DocSet EMPTY = new SortedIntDocSet(new int[0], 0);

----

I should point out:
* While this particular "class loading deadlock" issue seems more likely to happen in a "test" situation where the JVMs/classloaders are short lived, there's no reason to assume this type of failure couldn't happen in a production solr instance when handling a burst of queries right after startup.
* This type of failure (either specifically due to "DocSet vs SortedIntDocSet", or due to similar patterns in other classes) may also be the root cause of various other hard to reproduce "timed out" test failures we've seen over the years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org