You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "David Smiley (Jira)" <ji...@apache.org> on 2020/05/01 03:22:00 UTC

[jira] [Commented] (SOLR-14452) "classloading deadlock" issue with DocSet/SortedIntDocSet

    [ https://issues.apache.org/jira/browse/SOLR-14452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17097138#comment-17097138 ] 

David Smiley commented on SOLR-14452:
-------------------------------------

I think your master branch is out of date.  March 17th in SOLR-14256 I fixed this bug which had been around for a month.

> "classloading deadlock" issue with DocSet/SortedIntDocSet
> ---------------------------------------------------------
>
>                 Key: SOLR-14452
>                 URL: https://issues.apache.org/jira/browse/SOLR-14452
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Priority: Major
>
> While beasting some facet related cloud tests on master, I noticed a pattern of occasional failures that seemed to crop up...
>  * test ultimately fails due to a time out (usually the client threads time out waiting for a server response)
>  * if i notice my CPU isn't spinning very hard _before_ the test fails, I can capture a jstack and inspect some threads
>  * there will be multiple jetty/solr request threads (ex: {{"qtp82184175-145"}} whose stack traces show various stages of DocSet collection that show they are {{"... in Object.wait()"}} but also {{RUNNABLE}}
> ...this isn't a thread summary+state combination that i'm use to seeing when looking at thread dumps, and some research into when/why this might happen lead me to:
>  * [https://stackoverflow.com/questions/28631656/runnable-thread-state-but-in-object-wait]
>  ** [https://stackoverflow.com/a/28776438/689372]
>  *** 
>  **** [http://ternarysearch.blogspot.com/2013/07/static-initialization-deadlock.html]
>  **** [https://bugs.openjdk.java.net/browse/JDK-8037567]
> ...while the comments/status of JDK-8037567 suggests "nothing wrong here" the overall symptoms/description of the problem in the SO answer and linked blog and summation that this is essentially a "deadlock" situation in the class loader, do seem to correlate to some of the specifics I can see in the stack traces when this happens while running solr tests...
>  * at least one "RUNNABLE / Object.wait" thread trying to do class init; class: DocSet...
> {noformat}
> "qtp1535326437-68" #68 prio=5 os_prio=0 cpu=72.48ms elapsed=241.69s tid=0x00007fc08c0a4000 nid=0x864 in Object.wait()  [0x00007fc0adedd000]
>    java.lang.Thread.State: RUNNABLE
> 	at org.apache.solr.search.DocSet.<clinit>(DocSet.java:118)
> 	at org.apache.solr.search.DocSetCollector.getDocSet(DocSetCollector.java:90) // "new BitDocSet(..)"
> 	at org.apache.solr.search.DocSetUtil.getDocSet(DocSetUtil.java:93)
> 	at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1730)
> {noformat}
>  * other "RUNNABLE / Object.wait" threads are on lines that involve instantiating a subclass of DocSet:
>  ** 
> {noformat}
> "qtp1535326437-67" #67 prio=5 os_prio=0 cpu=801.44ms elapsed=241.69s tid=0x00007fc08c0a1800 nid=0x863 in Object.wait()  [0x00007fc0adfdf000]
>    java.lang.Thread.State: RUNNABLE
> 	at org.apache.solr.search.DocSetCollector.getDocSet(DocSetCollector.java:90) // "new BitDocSet(..)"
> 	at org.apache.solr.search.DocSetUtil.getDocSet(DocSetUtil.java:93)
> 	at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1730)
> {noformat}
>  ** 
> {noformat}
> "qtp82184175-65" #65 prio=5 os_prio=0 cpu=137.76ms elapsed=241.69s tid=0x00007fc088092000 nid=0x860 in Object.wait()  [0x00007fc0ae2e2000]
>    java.lang.Thread.State: RUNNABLE
> 	at org.apache.solr.search.DocSetCollector.getDocSet(DocSetCollector.java:84) // "new SortedIntDocSet(..)"
> 	at org.apache.solr.search.DocSetUtil.getDocSet(DocSetUtil.java:93)
> 	at org.apache.solr.search.SolrIndexSearcher.getDocListAndSetNC(SolrIndexSearcher.java:1730)
> 	at org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1433)
> {noformat}
>  ** etc...
>  * DocSet has a static reference to a concrete subclass...
>  ** {{public static final DocSet EMPTY = new SortedIntDocSet(new int[0], 0);
> ----
> I should point out:
> * While this particular "class loading deadlock" issue seems more likely to happen in a "test" situation where the JVMs/classloaders are short lived, there's no reason to assume this type of failure couldn't happen in a production solr instance when handling a burst of queries right after startup.
> * This type of failure (either specifically due to "DocSet vs SortedIntDocSet", or due to similar patterns in other classes) may also be the root cause of various other hard to reproduce "timed out" test failures we've seen over the years.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org