You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Arpit Agarwal (Jira)" <ji...@apache.org> on 2020/06/02 18:35:00 UTC

[jira] [Resolved] (HDDS-1687) Datanode process shutdown due to OOME

     [ https://issues.apache.org/jira/browse/HDDS-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arpit Agarwal resolved HDDS-1687.
---------------------------------
    Resolution: Done

We've done some tuning of DN settings recently, not sure if this is still an issue.

Also DN out of the box needs some GC settings to be changed to accommodate retry cache. Let's reopen if we still see this error.

> Datanode process shutdown due to OOME
> -------------------------------------
>
>                 Key: HDDS-1687
>                 URL: https://issues.apache.org/jira/browse/HDDS-1687
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>    Affects Versions: 0.5.0
>            Reporter: Rakesh Radhakrishnan
>            Priority: Major
>         Attachments: baseline test - datanode error logs.0.5.0.rar
>
>
> Ran Freon benchmark in a three node cluster and with more parallel writer threads, datanode daemon hits OOME and got shutdown. Used HDD as storage type in worker nodes.
> +Freon with the args:-+
> --numOfBuckets=10 --numOfKeys=8 --keySize=67108864 --numOfVolumes=100 --numOfThreads=100
> *DN-2* : Process got killed during the test, due to OOME
> {code}
> 2019-06-13 00:48:11,976 ERROR org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker: Terminating with exit status 1: a0cb8914-b51c-41b1-b5d2-59313cf38c0b-SegmentedRaftLogWorker:Storage Directory /data/datab/ozone/metadir/ratis/cbf29739-cbd1-4b00-8a21-2db750004dc7 failed.
> java.lang.OutOfMemoryError: Direct buffer memory
>                at java.nio.Bits.reserveMemory(Bits.java:694)
>                at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
>                at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
>                at org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.<init>(BufferedWriteChannel.java:44)
>                at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.<init>(SegmentedRaftLogOutputStream.java:70)
>                at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:481)
>                at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:234)
>                at java.lang.Thread.run(Thread.java:748)
> {code}
> *DN3* : Process got killed during the test, due to OOME. I could see lots of NPE at the datanode logs.
> {code}
> 2019-06-13 00:44:44,581 INFO org.apache.ratis.grpc.server.GrpcLogAppender: 83232f1f-4469-4a4d-b369-c131c8432ae9: follower 07ace812-3883-47d3-ac95-3d55de5fab5c:10.243.61.192:9858's next index is 0, log's start index is 10062, need to notify follower to install snapshot
> 2019-06-13 00:44:44,582 INFO org.apache.ratis.grpc.server.GrpcLogAppender: 83232f1f-4469-4a4d-b369-c131c8432ae9->07ace812-3883-47d3-ac95-3d55de5fab5c: follower responses installSnapshot Completed
> 2019-06-13 00:44:44,582 INFO org.apache.ratis.grpc.server.GrpcLogAppender: 83232f1f-4469-4a4d-b369-c131c8432ae9: follower 07ace812-3883-47d3-ac95-3d55de5fab5c:10.243.61.192:9858's next index is 0, log's start index is 10062, need to notify follower to install snapshot
> 2019-06-13 00:44:44,587 ERROR org.apache.ratis.server.impl.LogAppender: org.apache.ratis.server.impl.LogAppender$AppenderDaemon@554415fe unexpected exception
> java.lang.NullPointerException: 83232f1f-4469-4a4d-b369-c131c8432ae9->07ace812-3883-47d3-ac95-3d55de5fab5c: Previous TermIndex not found for firstIndex = 10062
>                at java.util.Objects.requireNonNull(Objects.java:290)
>                at org.apache.ratis.server.impl.LogAppender.assertProtos(LogAppender.java:234)
>                at org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:221)
>                at org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:169)
>                at org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:113)
>                at org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:80)
>                at java.lang.Thread.run(Thread.java:748)
> OOME log messages present in the *.out file.
> Exception in thread "org.apache.ratis.server.impl.LogAppender$AppenderDaemon$$Lambda$267/386355867@1d9c10b3" java.lang.OutOfMemoryError: unable to create new native thread
>                at java.lang.Thread.start0(Native Method)
>                at java.lang.Thread.start(Thread.java:717)
>                at org.apache.ratis.server.impl.LogAppender$AppenderDaemon.start(LogAppender.java:68)
>                at org.apache.ratis.server.impl.LogAppender.startAppender(LogAppender.java:153)
>                at java.util.ArrayList.forEach(ArrayList.java:1257)
>                at org.apache.ratis.server.impl.LeaderState.addAndStartSenders(LeaderState.java:372)
>                at org.apache.ratis.server.impl.LeaderState.restartSender(LeaderState.java:394)
>                at org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:97)
>                at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org