You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by jiang yu <ji...@gmail.com> on 2015/09/22 04:29:43 UTC

huge editlog segment size make standby start failure

Hi everyone,
My SNN failed two days ago, and it stopped to trigger ANN roll edit, so the editlog can be 10G large. After i restart the SNN, it failed to fetcher the editlog, because it is too large, the log is below :
015-09-22 00:23:07,338 ERROR org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Got error reading edit log input stream http://**********:8480/getJournal?jid=ns1&segmentTxId=19034359098&storageInfo=-56%3A200185119%3A1401352022932%3ACID-3c312573-1381-44f2-9e8b-fa2529f043d7&ugi=hadoop; failing over to edit log http://*******:8480/getJournal?jid=ns1&segmentTxId=19034359098&storageInfo=-56%3A200185119%3A1401352022932%3ACID-3c312573-1381-44f2-9e8b-fa2529f043d7&ugi=hadoop
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at java.io.FilterInputStream.read(FilterInputStream.java:116)
        at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:2707)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        at java.io.FilterInputStream.read(FilterInputStream.java:66)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader$PositionTrackingInputStream.read(FSEditLogLoader.java:1105)
        at java.io.FilterInputStream.read(FilterInputStream.java:66)
        at java.util.zip.CheckedInputStream.read(CheckedInputStream.java:42)

I don’t think it is good idea to set connection timeout in URLFactory, which is 1 min default.
For now, i can’t restart the SNN, so ANN roll edit per day, and the edit size is too large  making SNN impossible to restart.
I am currently developing some utility to resolve this problem.
1. using RPC to ask ANN roll editlog like Editlog Tailer doing
2. Copy all the meta data from SNN to ANN, and read the newest FSImage file and read the editlog file on local file system then apply to FSNamesystem, after that save namespace to form a new FSImage file
3.After that restart SNN and hope everything goes well


Any idea? i appreciate to get your reply, thank you.

RE: huge editlog segment size make standby start failure

Posted by Brahma Reddy Battula <br...@huawei.com>.

Hi jiang yu

Assuming ANN is working fine..Just you can issue savenamespace on ANN and start SNN by bootsrap should ok..

You can do like following..

1) Enter ANN into safemode  i.e hdfs dfsadmin safemode enter
2) Execute savenamespace  i.e hdfs dfsadmin savenamespace
3)leave safemode   i.e hdfs dfsadmin safemode leave
4) Start SNN using bootstrap   i.e hdfs namenode -bootstrapStandby

To avoid such problems in future and for auto trigger rolling,you can reduce the value of dfs.namenode.edit.log.autoroll.multiplier.threshold (by default its 2.0 means, for every 2M txns auto trigger will happen)
But only if version is > 2.3.0

Thanks & Regards
 Brahma Reddy Battula
________________________________________
From: jiang yu [jiangyu1211@gmail.com]
Sent: Tuesday, September 22, 2015 7:59 AM
To: common-dev@hadoop.apache.org
Subject: huge editlog segment size make standby start failure

Hi everyone,
My SNN failed two days ago, and it stopped to trigger ANN roll edit, so the editlog can be 10G large. After i restart the SNN, it failed to fetcher the editlog, because it is too large, the log is below :
015-09-22 00:23:07,338 ERROR org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Got error reading edit log input stream http://**********:8480/getJournal?jid=ns1&segmentTxId=19034359098&storageInfo=-56%3A200185119%3A1401352022932%3ACID-3c312573-1381-44f2-9e8b-fa2529f043d7&ugi=hadoop; failing over to edit log http://*******:8480/getJournal?jid=ns1&segmentTxId=19034359098&storageInfo=-56%3A200185119%3A1401352022932%3ACID-3c312573-1381-44f2-9e8b-fa2529f043d7&ugi=hadoop
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at java.io.FilterInputStream.read(FilterInputStream.java:116)
        at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:2707)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        at java.io.FilterInputStream.read(FilterInputStream.java:66)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader$PositionTrackingInputStream.read(FSEditLogLoader.java:1105)
        at java.io.FilterInputStream.read(FilterInputStream.java:66)
        at java.util.zip.CheckedInputStream.read(CheckedInputStream.java:42)

I don’t think it is good idea to set connection timeout in URLFactory, which is 1 min default.
For now, i can’t restart the SNN, so ANN roll edit per day, and the edit size is too large  making SNN impossible to restart.
I am currently developing some utility to resolve this problem.
1. using RPC to ask ANN roll editlog like Editlog Tailer doing
2. Copy all the meta data from SNN to ANN, and read the newest FSImage file and read the editlog file on local file system then apply to FSNamesystem, after that save namespace to form a new FSImage file
3.After that restart SNN and hope everything goes well


Any idea? i appreciate to get your reply, thank you.