You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Michael Stack (Jira)" <ji...@apache.org> on 2022/05/26 00:05:00 UTC

[jira] [Resolved] (HDFS-16586) Purge FsDatasetAsyncDiskService threadgroup; it causes BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal exception and exit'

     [ https://issues.apache.org/jira/browse/HDFS-16586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Stack resolved HDFS-16586.
----------------------------------
    Fix Version/s: 3.4.0
                   3.2.4
                   3.3.4
     Hadoop Flags: Reviewed
       Resolution: Fixed

Merged to branch-3, branch-3.3, and to branch-3.2. Thank you for the review [~hexiaoqiao] 

> Purge FsDatasetAsyncDiskService threadgroup; it causes BPServiceActor$CommandProcessingThread IllegalThreadStateException 'fatal exception and exit' 
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16586
>                 URL: https://issues.apache.org/jira/browse/HDFS-16586
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 3.3.0, 3.2.3
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.0, 3.2.4, 3.3.4
>
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> The below failed block finalize is causing a downstreamer's test to fail when it uses hadoop 3.2.3 or 3.3.0+:
> {code:java}
> 2022-05-19T18:21:08,243 INFO  [Command processor] impl.FsDatasetAsyncDiskService(234): Scheduling blk_1073741840_1016 replica FinalizedReplica, blk_1073741840_1016, FINALIZED
>   getNumBytes()     = 52
>   getBytesOnDisk()  = 52
>   getVisibleLength()= 52
>   getVolume()       = /Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2
>   getBlockURI()     = file:/Users/stack/checkouts/hbase.apache.git/hbase-server/target/test-data/d544dd1e-b42d-8fae-aa9a-99e3eb52f61c/cluster_e8660d1b-733a-b023-2e91-dc3f951cf189/dfs/data/data2/current/BP-62743752-127.0.0.1-1653009535881/current/finalized/subdir0/subdir0/blk_1073741840 for deletion
> 2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] metrics.TopMetrics(134): a metric is reported: cmd: delete user: stack.hfs.0 (auth:SIMPLE)
> 2022-05-19T18:21:08,243 DEBUG [IPC Server handler 0 on default port 54774] top.TopAuditLogger(78): ------------------- logged event for top service: allowed=true ugi=stack.hfs.0 (auth:SIMPLE) ip=/127.0.0.1 cmd=delete  src=/user/stack/test-data/b8167d53-bcd7-c682-a767-55faaf7f3e96/data/default/t1/4499521075f51d5138fe4f1916daf92d/.tmp  dst=null  perm=null
> 2022-05-19T18:21:08,243 DEBUG [PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, type=LAST_IN_PIPELINE] datanode.BlockReceiver$PacketResponder(1645): PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, type=LAST_IN_PIPELINE, replyAck=seqno: 901 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0
> 2022-05-19T18:21:08,243 DEBUG [PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, type=LAST_IN_PIPELINE] datanode.BlockReceiver$PacketResponder(1327): PacketResponder: BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006, type=LAST_IN_PIPELINE: seqno=-2 waiting for local datanode to finish write.
> 2022-05-19T18:21:08,243 ERROR [Command processor] datanode.BPServiceActor$CommandProcessingThread(1276): Command processor encountered fatal exception and exit.
> java.lang.IllegalThreadStateException: null
>   at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:865) ~[?:?]
>   at java.lang.Thread.<init>(Thread.java:430) ~[?:?]
>   at java.lang.Thread.<init>(Thread.java:704) ~[?:?]
>   at java.lang.Thread.<init>(Thread.java:525) ~[?:?]
>   at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService$1.newThread(FsDatasetAsyncDiskService.java:113) ~[hadoop-hdfs-3.2.3.jar:?]
>   at java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:623) ~[?:?]
>   at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:912) ~[?:?]
>   at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343) ~[?:?]
>   at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:189) ~[hadoop-hdfs-3.2.3.jar:?]
>   at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:238) ~[hadoop-hdfs-3.2.3.jar:?]
>   at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2184) ~[hadoop-hdfs-3.2.3.jar:?]
>   at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2103) ~[hadoop-hdfs-3.2.3.jar:?]
>   at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:736) ~[hadoop-hdfs-3.2.3.jar:?]
>   at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:682) ~[hadoop-hdfs-3.2.3.jar:?]
>   at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1318) ~[hadoop-hdfs-3.2.3.jar:?]
>   at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1364) ~[hadoop-hdfs-3.2.3.jar:?]
>   at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1291) ~[hadoop-hdfs-3.2.3.jar:?]
>   at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1274) ~[hadoop-hdfs-3.2.3.jar:?]
> 2022-05-19T18:21:08,243 DEBUG [DataXceiver for client DFSClient_NONMAPREDUCE_1036441577_16 at /127.0.0.1:54879 [Receiving block BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006]] datanode.BlockReceiver(532): Receiving one packet for block BP-62743752-127.0.0.1-1653009535881:blk_1073741830_1006: PacketHeader with packetLen=750 header data: offsetInBlock: 495104
> seqno: 902
> lastPacketInBlock: false
> dataLen: 7382022-05-19T18:21:08,243 WARN  [Command processor] datanode.BPServiceActor$CommandProcessingThread(1279): Ending command processor service for: Thread[Command processor,5,FailOnTimeoutGroup] {code}
> Above, we are trying to run a FINALIZED 'command' asynchronously using a thread from the volume's dedicated executor. The executor is trying to create a thread because it has none-to-hand as it has not seen action in over {color:#660e7a}THREADS_KEEP_ALIVE_SECONDS{color} (60seconds) so all live threads have been let go (The volume executor as configured allows core threads to die also). The thread creation fails because the FsDatasetAsyncDiskService#threadGroup we are trying to associate the new thread with is 'destroyed'. Threadgroups auto-'destroy' themselves after their last member has died|[https://github.com/openjdk/jdk11/blob/master/src/java.base/share/classes/java/lang/ThreadGroup.java#L846]|https://github.com/openjdk/jdk11/blob/master/src/java.base/share/classes/java/lang/ThreadGroup.java#L846].] which can happen given the way this volume executor is configured. This FsDatasetAsyncDiskService#threadGroup goes unexploited. If no thread group is supplied, the executor makes a default that internally associates any created threads with a dedicated executor 'threadgroup' but given how our executor is configured to let all threads die in the name of economizing on total unning threads, a threadgroup-for-each-volume-executor is likely to have the same issue as this  FsDatasetAsyncDiskService#threadGroup one. Let me put up a PR to remove it and undo the cause of the above ITSE.
> Of interest perhaps is that this IllegalThreadStateException is not new to 3.2.3/3.0.0+. See the below from 3.2.2 when I run the same downstream test:
> {{2022-05-19T17:06:11,140 ERROR [BP-1924619834-10.0.0.192-1653005021952 heartbeating to localhost/127.0.0.1:54360] datanode.BPServiceActor(856): Exception in BPOfferService for Block pool BP-1924619834-10.0.0.192-1653005021952 (Datanode Uuid bb5219c0-a444-4e5a-9455-6176ced7ffe7) service to localhost/127.0.0.1:54360}}
> {{java.lang.IllegalThreadStateException: null}}
> {{  at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:865) ~[?:?]}}
> {{  at java.lang.Thread.<init>(Thread.java:430) ~[?:?]}}
> {{  at java.lang.Thread.<init>(Thread.java:704) ~[?:?]}}
> {{  at java.lang.Thread.<init>(Thread.java:525) ~[?:?]}}
> {{  at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService$1.newThread(FsDatasetAsyncDiskService.java:105) ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:623) ~[?:?]}}
> {{  at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:912) ~[?:?]}}
> {{  at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343) ~[?:?]}}
> {{  at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:181) ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:230) ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2155) ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2074) ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:736) ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:682) ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:888) ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678) ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:854) ~[hadoop-hdfs-3.2.2.jar:?]}}
> {{  at java.lang.Thread.run(Thread.java:829) ~[?:?]}}
> It too runs into the IllegalThreadStateException.
> The difference is HDFS-14997 which nicely made command processing async. It changed how commands work (of course). Now on an ITSE, the processor shuts down completely with ERROR. In my test case, the FINALIZED of the block fails and the procedure waiting on its completion times out, in turn timing out the test. In the old days, before HDFS-14997, the then synchronus heartbeats would pick up the ITSE earlier and would just keep retrying in spite of ITSE – the block finalize happened to complete and the test passed (I did not do deep analysis of how command-processnig behavior has changed. If interest, I'd be up for digging in).
> {color:#660e7a} {color}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org