You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Glen Geng (Jira)" <ji...@apache.org> on 2020/10/30 02:23:00 UTC

[jira] [Updated] (HDDS-4408) Datanode State Machine Thread should keep alive during the whole lifetime of DatanodeStateMachine

     [ https://issues.apache.org/jira/browse/HDDS-4408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Glen Geng updated HDDS-4408:
----------------------------
    Summary: Datanode State Machine Thread should keep alive during the whole lifetime of DatanodeStateMachine  (was: Datanode State Machine Thread needs handle OutOfMemoryError)

> Datanode State Machine Thread should keep alive during the whole lifetime of DatanodeStateMachine
> -------------------------------------------------------------------------------------------------
>
>                 Key: HDDS-4408
>                 URL: https://issues.apache.org/jira/browse/HDDS-4408
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: Ozone Datanode
>    Affects Versions: 1.1.0
>            Reporter: Glen Geng
>            Priority: Major
>              Labels: pull-request-available
>
> In Tencent internal production environment, we got several dead DNs which can never come back without a restart.
>  
> We found that the thread "Datanode State Machine Thread - 0" does not exist in the output of jstack, thus no HeartbeatEndpointTask will be created,  this DN will soon become dead and can not recover unless being restarted.
>  
> After checked the .out log, we saw that OOM occurred in thread "Datanode State Machine Thread", which should be responsible for this issue:
> {code:java}
> 114370.799: Total time for which application threads were stopped: 1.0883622 seconds, Stopping threads took: 0.0002926 seconds
> Exception in thread "Datanode State Machine Thread - 0" java.lang.OutOfMemoryError: GC overhead limit exceeded 114370.810: Application time: 0.0115941 seconds {Heap before GC invocations=2946 (full 2680): PSYoungGen total 3170304K, used 2846720K [0x00000006eab00000, 0x00000007c0000000, 0x00000007c0000000) eden space 2846720K, 100% used [0x00000006eab00000,0x0000000798700000,0x0000000798700000) from space 323584K, 0% used [0x00000007ac400000,0x00000007ac400000,0x00000007c0000000) to space 324096K, 0% used [0x0000000798700000,0x0000000798700000,0x00000007ac380000) ParOldGen total 6990848K, used 6990627K [0x0000000540000000, 0x00000006eab00000, 0x00000006eab00000) object space 6990848K, 99% used [0x0000000540000000,0x00000006eaac8c90,0x00000006eab00000) Metaspace used 60721K, capacity 63446K, committed 64128K, reserved 1105920K class space used 6583K, capacity 7031K, committed 7296K, reserved 1048576K
> {code}
>  
> {code:java}
> 300010.579: Total time for which application threads were stopped: 3.0848769 seconds, Stopping threads took: 0.0000943 seconds
> Exception in thread "Datanode State Machine Thread - 0" java.lang.OutOfMemoryError: Java heap space
> 300010.579: Application time: 0.0001554 seconds
> 300010.580: Total time for which application threads were stopped: 0.0015600 seconds, Stopping threads took: 0.0002747 seconds
> 300010.581: Application time: 0.0004684 seconds
> {Heap before GC invocations=13766 (full 11664):
>  PSYoungGen total 3441664K, used 3388416K [0x000000072ab00000, 0x0000000800000000, 0x0000000800000000)
>  eden space 3388416K, 100% used [0x000000072ab00000,0x00000007f9800000,0x00000007f9800000)
>  from space 53248K, 0% used [0x00000007fcc00000,0x00000007fcc00000,0x0000000800000000)
>  to space 53248K, 0% used [0x00000007f9800000,0x00000007f9800000,0x00000007fcc00000)
>  ParOldGen total 6990848K, used 6990848K [0x0000000580000000, 0x000000072ab00000, 0x000000072ab00000)
>  object space 6990848K, 100% used [0x0000000580000000,0x000000072ab00000,0x000000072ab00000)
>  Metaspace used 55150K, capacity 57816K, committed 59224K, reserved 1101824K
>  class space used 5922K, capacity 6372K, committed 6744K, reserved 1048576K{code}
>  
> BTW, after running DN for more than a week, we see a lot of "java.lang.OutOfMemoryError: GC overhead limit exceeded" in DN's log. Since we configured a dead Recon, we guess this could an evidence for HDDS-4404.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org