You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2014/01/13 19:33:00 UTC

[jira] [Commented] (SOLR-5628) Cloud test harness manifesting reproducible failures in TestDistribDocBasedVersion

    [ https://issues.apache.org/jira/browse/SOLR-5628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869802#comment-13869802 ] 

Hoss Man commented on SOLR-5628:
--------------------------------

Comments from McCandless based on his investigation...

{quote}
OK, this seems to be a test bug: somehow, the main thread is removing
the entire index directory from the filesystem, before another thread
calls IW.close.

I can see the test opens 5 IW instances, and then things run fine, and
then the 5 IWs are closed in sequence in a thread, but in the main
thread is returning (randomized testing prints "OK      20.1s |
TestDistribDocBasedVersion.testDistribSearch") before the last 2 IWs
are closed.

So somewhere in the main thread, something is concurrently removing
the index directory before the last 2 IWs are closed.

I tried passing -Dtests.leaveTemporary=true and the test still fails,
so I don't think it's test-framework that's removing the index dir; I
think it's something somewhere in Solr's test infrastructure?

Probably there is just a missing call somewhere to cleanly shutdown
all cores before cleaning up?
{quote}
{quote}
OK, indeed, I tried passing -Dsolr.test.leavedatadir=true (thanks
Rob!) and that makes the test pass.

So somewhere this test just needs to properly close things before
removing filesystem directories.
{quote}

However, when i attempted to add {{-Dsolr.test.leavedatadir=true -Dtests.leaveTemporary=true}} I could still reproduce the failure.

Meanwhile, the specific behavior of the failure seems to have changed slightly between  r1556643 and r1557798

Here's the failure on branch_4x as of r1556643 ...

{noformat}
   [junit4]    > Throwable #1: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=137, name=Lucene Merge Thread #0, state=RUNNABLE, group=TGRP-TestDistribDocBasedVersion]
   [junit4]    > Caused by: org.apache.lucene.index.MergePolicy$MergeException: java.lang.AssertionError
   [junit4]    > 	at __randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)
   [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
   [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
   [junit4]    > Caused by: java.lang.AssertionError
   [junit4]    > 	at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3807)
   [junit4]    > 	at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3791)
   [junit4]    > 	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3644)
   [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
   [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)Throwable #2: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=138, name=Lucene Merge Thread #0, state=RUNNABLE, group=TGRP-TestDistribDocBasedVersion]
   [junit4]    > Caused by: org.apache.lucene.index.MergePolicy$MergeException: java.lang.AssertionError
   [junit4]    > 	at __randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)
   [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
   [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
   [junit4]    > Caused by: java.lang.AssertionError
   [junit4]    > 	at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3807)
   [junit4]    > 	at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3791)
   [junit4]    > 	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3644)
   [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
   [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
   [junit4] Completed in 29.76s, 1 test, 2 errors <<< FAILURES!
   [junit4] 
   [junit4] 
   [junit4] Tests with failures:
   [junit4]   - org.apache.solr.cloud.TestDistribDocBasedVersion (suite)
   [junit4] 
   [junit4] 
   [junit4] JVM J0:     0.41 ..    31.03 =    30.62s
   [junit4] Execution time total: 31 seconds
   [junit4] Tests summary: 1 suite, 1 test, 2 suite-level errors
{noformat}

... but if you then svn update to r1557798...

{noformat}
   [junit4]   2> 167814 T9 ccr.ThreadLeakControl.checkThreadLeaks SEVERE 2 threads leaked from SUITE scope at org.apache.solr.cloud.TestDistribDocBasedVersion: 
   [junit4]   2> 	   1) Thread[id=113, name=searcherExecutor-42-thread-1, state=WAITING, group=TGRP-TestDistribDocBasedVersion]
   [junit4]   2> 	        at sun.misc.Unsafe.park(Native Method)
   [junit4]   2> 	        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
   [junit4]   2> 	        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
   [junit4]   2> 	        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
   [junit4]   2> 	        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
   [junit4]   2> 	        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
   [junit4]   2> 	        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   [junit4]   2> 	        at java.lang.Thread.run(Thread.java:701)
   [junit4]   2> 	   2) Thread[id=93, name=searcherExecutor-32-thread-1, state=WAITING, group=TGRP-TestDistribDocBasedVersion]
   [junit4]   2> 	        at sun.misc.Unsafe.park(Native Method)
   [junit4]   2> 	        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
   [junit4]   2> 	        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
   [junit4]   2> 	        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
   [junit4]   2> 	        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
   [junit4]   2> 	        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
   [junit4]   2> 	        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   [junit4]   2> 	        at java.lang.Thread.run(Thread.java:701)

...

   [junit4] ERROR   0.00s | TestDistribDocBasedVersion (suite) <<<
   [junit4]    > Throwable #1: java.lang.AssertionError: ERROR: SolrIndexSearcher opens=19 closes=17
   [junit4]    > 	at __randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)
   [junit4]    > 	at org.apache.solr.SolrTestCaseJ4.endTrackingSearchers(SolrTestCaseJ4.java:335)
   [junit4]    > 	at org.apache.solr.SolrTestCaseJ4.afterClass(SolrTestCaseJ4.java:139)
   [junit4]    > 	at java.lang.Thread.run(Thread.java:701)Throwable #2: com.carrotsearch.randomizedtesting.ThreadLeakError: 2 threads leaked from SUITE scope at org.apache.solr.cloud.TestDistribDocBasedVersion: 
   [junit4]    >    1) Thread[id=113, name=searcherExecutor-42-thread-1, state=WAITING, group=TGRP-TestDistribDocBasedVersion]
   [junit4]    >         at sun.misc.Unsafe.park(Native Method)
   [junit4]    >         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
   [junit4]    >         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
   [junit4]    >         at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
   [junit4]    >         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
   [junit4]    >         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
   [junit4]    >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   [junit4]    >         at java.lang.Thread.run(Thread.java:701)
   [junit4]    >    2) Thread[id=93, name=searcherExecutor-32-thread-1, state=WAITING, group=TGRP-TestDistribDocBasedVersion]
   [junit4]    >         at sun.misc.Unsafe.park(Native Method)
   [junit4]    >         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
   [junit4]    >         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
   [junit4]    >         at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
   [junit4]    >         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
   [junit4]    >         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
   [junit4]    >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   [junit4]    >         at java.lang.Thread.run(Thread.java:701)
   [junit4]    > 	at __randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)Throwable #3: com.carrotsearch.randomizedtesting.ThreadLeakError: There are still zombie threads that couldn't be terminated:
   [junit4]    >    1) Thread[id=113, name=searcherExecutor-42-thread-1, state=WAITING, group=TGRP-TestDistribDocBasedVersion]
   [junit4]    >         at sun.misc.Unsafe.park(Native Method)
   [junit4]    >         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
   [junit4]    >         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
   [junit4]    >         at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
   [junit4]    >         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
   [junit4]    >         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
   [junit4]    >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   [junit4]    >         at java.lang.Thread.run(Thread.java:701)
   [junit4]    >    2) Thread[id=93, name=searcherExecutor-32-thread-1, state=WAITING, group=TGRP-TestDistribDocBasedVersion]
   [junit4]    >         at sun.misc.Unsafe.park(Native Method)
   [junit4]    >         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
   [junit4]    >         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
   [junit4]    >         at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:386)
   [junit4]    >         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
   [junit4]    >         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
   [junit4]    >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   [junit4]    >         at java.lang.Thread.run(Thread.java:701)
   [junit4]    > 	at __randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)Throwable #4: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=137, name=Lucene Merge Thread #0, state=RUNNABLE, group=TGRP-TestDistribDocBasedVersion]
   [junit4]    > Caused by: org.apache.lucene.index.MergePolicy$MergeException: java.lang.AssertionError
   [junit4]    > 	at __randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)
   [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
   [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
   [junit4]    > Caused by: java.lang.AssertionError
   [junit4]    > 	at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3817)
   [junit4]    > 	at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3801)
   [junit4]    > 	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3654)
   [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
   [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)Throwable #5: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=138, name=Lucene Merge Thread #0, state=RUNNABLE, group=TGRP-TestDistribDocBasedVersion]
   [junit4]    > Caused by: org.apache.lucene.index.MergePolicy$MergeException: java.lang.AssertionError
   [junit4]    > 	at __randomizedtesting.SeedInfo.seed([791402573DC76F3C]:0)
   [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:545)
   [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:518)
   [junit4]    > Caused by: java.lang.AssertionError
   [junit4]    > 	at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3817)
   [junit4]    > 	at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3801)
   [junit4]    > 	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3654)
   [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
   [junit4]    > 	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
   [junit4] Completed in 171.33s, 1 test, 1 failure, 4 errors <<< FAILURES!
   [junit4] 
   [junit4] 
   [junit4] Tests with failures:
   [junit4]   - org.apache.solr.cloud.TestDistribDocBasedVersion (suite)
   [junit4] 
   [junit4] 
   [junit4] JVM J0:     0.41 ..   172.56 =   172.15s
   [junit4] Execution time total: 2 minutes 52 seconds
   [junit4] Tests summary: 1 suite, 1 test, 5 suite-level errors

{noformat}

The key question being: what changed on 4x between these two versions that causes these zombie threads on top of the existing problem?


> Cloud test harness manifesting reproducible failures in TestDistribDocBasedVersion
> ----------------------------------------------------------------------------------
>
>                 Key: SOLR-5628
>                 URL: https://issues.apache.org/jira/browse/SOLR-5628
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Hoss Man
>
> Jenkins uncovered a test seed that causes a reproducible IndexWriter assertion failure in TestDistribDocBasedVersion on the 4x branch.  
> McCandless helped dig in and believe that something in the way the solr test framework is setup is causing the test to delete the index dirs before the IndexWriter is being closed.  Meanwhile, it appears that recent changes to 4x have caused the nature of the failure to change, so that now -- in addition to the IndexWriter assertion failure -- the test cleanup also stalls out and the test runner has to terminate some stalled threads.
> details to following in comment, but here's the reproduce line...
> {noformat}
> ant test  -Dtestcase=TestDistribDocBasedVersion -Dtests.seed=791402573DC76F3C -Dtests.multiplier=3 -Dtests.slow=true
> -Dtests.locale=ar_IQ -Dtests.timezone=Antarctica/Rothera -Dtests.file.encoding=US-ASCII
> {noformat}
> And the mail thread regarding this...
> https://mail-archives.apache.org/mod_mbox/lucene-dev/201401.mbox/%3Calpine.DEB.2.02.1401100930260.20275@frisbee%3E



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org