You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@solr.apache.org by "Mike Drob (Jira)" <ji...@apache.org> on 2022/02/22 17:14:00 UTC

[jira] [Commented] (SOLR-16043) HDFS tests - "Command processor" thread leak

    [ https://issues.apache.org/jira/browse/SOLR-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496238#comment-17496238 ] 

Mike Drob commented on SOLR-16043:
----------------------------------

DataNode.shutdown will call [blockPoolManager.shutDownAll|https://github.com/apache/hadoop/blob/rel/release-3.3.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java#L2152] which eventually calls [BPServiceActor.stop|https://github.com/apache/hadoop/blob/rel/release-3.3.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java#L592] which will interrupt the command processing thread, which _is_ a daemon thread.

In the logs we see...

{noformat}
  2> 31835 ERROR (Command processor) [] o.a.h.h.s.d.DataNode Command processor encountered interrupt and exit.
  2> 31835 WARN  (Command processor) [] o.a.h.h.s.d.DataNode Ending command processor service for: Thread[Command processor,5,TGRP-HdfsBackupRepositoryIntegrationTest]
{noformat}
which despite error status looks like the normal mode of shutting things down. However... that's only one of the data nodes, and the second one is the one that got stuck. It would be great if the thread names were unique as that would have given us the hint faster.

Maybe we can try reducing to one datanode instead of our current default of 2 and see if that helps the tests? We're not really interested in testing HDFS resiliency here anyway, so I think it should be fine for test purposes. Additional bonus is that startup/shutdown times might reduce a bit?
 

 

> HDFS tests - "Command processor" thread leak
> --------------------------------------------
>
>                 Key: SOLR-16043
>                 URL: https://issues.apache.org/jira/browse/SOLR-16043
>             Project: Solr
>          Issue Type: Task
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: hdfs, Tests
>            Reporter: Kevin Risden
>            Priority: Major
>         Attachments: Solr-main-MACOSX-678_consoleText.txt.gz
>
>
> This is caused by SOLR-15942. Some background:
> * originally discussed in 3.3.1 PR - https://github.com/apache/solr/pull/553#discussion_r792068327
> * Failures so far from jenkins
> ** https://jenkins.thetaphi.de/job/Solr-main-Linux/3116/consoleText
> ** https://jenkins.thetaphi.de/job/Solr-main-Linux/3109/consoleText
> ** https://jenkins.thetaphi.de/job/Solr-main-MacOSX/678/consoleText
> It looks like it might be trying to shut down:
> {code:java}
>  2> 31835 ERROR (Command processor) [] o.a.h.h.s.d.DataNode Command processor encountered interrupt and exit.
>   2> 31835 WARN  (BP-1491942626-127.0.0.1-1645493532707 heartbeating to localhost/127.0.0.1:51572) [] o.a.h.h.s.d.IncrementalBlockReportManager IncrementalBlockReportManager interrupted
>   2> 31835 WARN  (Command processor) [] o.a.h.h.s.d.DataNode Ending command processor service for: Thread[Command processor,5,TGRP-HdfsBackupRepositoryIntegrationTest]
> {code}
> the error message looks like (this is from https://jenkins.thetaphi.de/job/Solr-main-MacOSX/678/consoleText which is attached  [^Solr-main-MACOSX-678_consoleText.txt.gz] ):
> {code:java}
>    2> 31948 WARN  (Listener at localhost/51675) [] o.a.h.h.s.d.f.i.FsDatasetAsyncDiskService AsyncDiskService has already shut down.
>   2> 31948 WARN  (Listener at localhost/51675) [] o.a.h.h.s.d.f.i.RamDiskAsyncLazyPersistService AsyncLazyPersistService has already shut down.
>   2> 31948 INFO  (Listener at localhost/51675) [] o.a.h.h.s.d.DataNode Shutdown complete.
>   2> 32290 INFO  (Listener at localhost/51675) [] o.a.s.u.ErrorLogMuter Closing ErrorLogMuter-regex-1 after mutting 0 log messages
>   2> 32290 INFO  (Listener at localhost/51675) [] o.a.s.u.ErrorLogMuter Creating ErrorLogMuter-regex-2 for ERROR logs matching regex: ignore_exception
>   2> Feb 22, 2022 1:32:34 AM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
>   2> WARNING: Will linger awaiting termination of 1 leaked thread(s).
>   2> Feb 22, 2022 1:32:35 AM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
>   2> SEVERE: 1 thread leaked from SUITE scope at org.apache.solr.core.backup.repository.HdfsBackupRepositoryIntegrationTest: 
>   2>    1) Thread[id=90, name=Command processor, state=WAITING, group=TGRP-HdfsBackupRepositoryIntegrationTest]
>   2>         at java.base@16.0.2/jdk.internal.misc.Unsafe.park(Native Method)
>   2>         at java.base@16.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
>   2>         at java.base@16.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(AbstractQueuedSynchronizer.java:505)
>   2>         at java.base@16.0.2/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3137)
>   2>         at java.base@16.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1614)
>   2>         at java.base@16.0.2/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
>   2>         at app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1331)
>   2>         at app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1315)
>   2> Feb 22, 2022 1:32:35 AM com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
>   2> INFO: Starting to interrupt leaked threads:
>   2>    1) Thread[id=90, name=Command processor, state=WAITING, group=TGRP-HdfsBackupRepositoryIntegrationTest]
>   2> 33504 ERROR (Command processor) [] o.a.h.h.s.d.DataNode Command processor encountered interrupt and exit.
>   2> 33504 WARN  (Command processor) [] o.a.h.h.s.d.DataNode Ending command processor service for: Thread[Command processor,5,TGRP-HdfsBackupRepositoryIntegrationTest]
>   2> Feb 22, 2022 1:32:35 AM com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
>   2> INFO: All leaked threads terminated.
>    >     com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked from SUITE scope at org.apache.solr.core.backup.repository.HdfsBackupRepositoryIntegrationTest: 
>    >        1) Thread[id=90, name=Command processor, state=WAITING, group=TGRP-HdfsBackupRepositoryIntegrationTest]
>    >             at java.base@16.0.2/jdk.internal.misc.Unsafe.park(Native Method)
>    >             at java.base@16.0.2/java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
>    >             at java.base@16.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(AbstractQueuedSynchronizer.java:505)
>    >             at java.base@16.0.2/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3137)
>    >             at java.base@16.0.2/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1614)
>    >             at java.base@16.0.2/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
>    >             at app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1331)
>    >             at app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1315)
>    >         at __randomizedtesting.SeedInfo.seed([1A5FAFA6CE7F7FF0]:0)
>   2> NOTE: test params are: codec=Asserting(Lucene90): {}, docValues:{}, maxPointsInLeafNode=757, maxMBSortInHeap=5.99773708873607, sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=sah-RU, timezone=Asia/Macau
>   2> NOTE: Mac OS X 10.14.6 x86_64/Eclipse Foundation 16.0.2 (64-bit)/cpus=6,threads=40,free=159527368,total=271581184
>   2> NOTE: All tests run in this JVM: [HdfsBackupRepositoryIntegrationTest]
>   2> NOTE: reproduce with: gradlew test --tests HdfsBackupRepositoryIntegrationTest -Dtests.seed=1A5FAFA6CE7F7FF0 -Dtests.slow=true -Dtests.locale=sah-RU -Dtests.timezone=Asia/Macau -Dtests.asserts=true -Dtests.file.encoding=UTF-8
> {code}
> Some related code pointers:
> * https://github.com/apache/hadoop/blame/rel/release-3.3.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java#L1336
> * https://github.com/apache/hadoop/blame/rel/release-3.3.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java#L2017
> PS: there is an error about datanode mbean - but it is unrelated as far as I can tell - https://issues.apache.org/jira/browse/HDFS-11041



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@solr.apache.org
For additional commands, e-mail: issues-help@solr.apache.org