You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by "rongshen.long" <ro...@baifendian.com> on 2012/10/22 16:55:35 UTC

Backup node crashed with NPE and failed to restart

hi,
I tried to run a backup node on hdfs 0.21 , however the daemon crashed with NPE (stack trace as below) and 
left an 'edits.new' file in the $dfs.namenode.name.dir/current diretory . After that , I failed to restart the namenode and the backup node because of the same exception. 
Could anyone give me a help to recovery the cluster?  Although the NN can be restarted by creating an empty 'edits' file ,much data would be lost .

12/10/09 15:32:45 ERROR namenode.Checkpointer: Throwable Exception in doCheckpoint: 
java.lang.NullPointerException
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1765)
        at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1753)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:708)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:411)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:378)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1209)
        at org.apache.hadoop.hdfs.server.namenode.BackupStorage.loadCheckpoint(BackupStorage.java:158)
        at org.apache.hadoop.hdfs.server.namenode.Checkpointer.doCheckpoint(Checkpointer.java:243)
        at org.apache.hadoop.hdfs.server.namenode.Checkpointer.run(Checkpointer.java:141)
12/10/09 15:32:45 WARN namenode.FSNamesystem: ReplicationMonitor thread received InterruptedException.java.lang.InterruptedException: sleep interrupted
12/10/09 15:32:45 WARN namenode.DecommissionManager: Monitor interrupted: java.lang.InterruptedException: sleep interrupted
12/10/09 15:32:45 INFO namenode.FSNamesystem: Number of transactions: 24 Total time for transactions(ms): 4Number of transactions batched in Syncs: 0 Number of syncs: 25 SyncTimes(ms): 239 
12/10/09 15:32:45 INFO ipc.Server: Stopping server on 50100




2012-10-22



rongshen.long

problem with upgrading from HDFS 0.21 to HDFS 1.0.4

Posted by "rongshen.long" <ro...@baifendian.com>.
hi all,
It seems it's not supported  to upgrade hadoop from 0.21 to the stable version 1.0.4 . The 'linkBlocks' function in the 'DataStorage.java'(v1.0.4) can not work well , because the datanode storage structure of the former is different from the latter ,there are finalized and rbw directorys under $dfs.datanode.data.dir/current.
Do you have some suggestions to deal with this problem? 

2012-11-20



rongshen.long

problem with upgrading from HDFS 0.21 to HDFS 1.0.4

Posted by "rongshen.long" <ro...@baifendian.com>.
hi all,
It seems it's not supported  to upgrade hadoop from 0.21 to the stable version 1.0.4 . The 'linkBlocks' function in the 'DataStorage.java'(v1.0.4) can not work well , because the datanode storage structure of the former is different from the latter ,there are finalized and rbw directorys under $dfs.datanode.data.dir/current.
Do you have some suggestions to deal with this problem? 

2012-11-20



rongshen.long

problem with upgrading from HDFS 0.21 to HDFS 1.0.4

Posted by "rongshen.long" <ro...@baifendian.com>.
hi all,
It seems it's not supported  to upgrade hadoop from 0.21 to the stable version 1.0.4 . The 'linkBlocks' function in the 'DataStorage.java'(v1.0.4) can not work well , because the datanode storage structure of the former is different from the latter ,there are finalized and rbw directorys under $dfs.datanode.data.dir/current.
Do you have some suggestions to deal with this problem? 

2012-11-20



rongshen.long

problem with upgrading from HDFS 0.21 to HDFS 1.0.4

Posted by "rongshen.long" <ro...@baifendian.com>.
hi all,
It seems it's not supported  to upgrade hadoop from 0.21 to the stable version 1.0.4 . The 'linkBlocks' function in the 'DataStorage.java'(v1.0.4) can not work well , because the datanode storage structure of the former is different from the latter ,there are finalized and rbw directorys under $dfs.datanode.data.dir/current.
Do you have some suggestions to deal with this problem? 

2012-11-20



rongshen.long

Re: Backup node crashed with NPE and failed to restart

Posted by Harsh J <ha...@cloudera.com>.
Hi,

First off, do not use 0.21, it is unsupported/unmaintained. Use 2.x if
you want HA-NN capabilities. See
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailability.html.

Second, BackupNode/CheckpointNode is also unmaintained actively, and
may soon be removed away in favor of HA NameNodes and the (if not HA)
SecondaryNameNode.

Regarding your metadata, if your NN is still up, issue a "dfsadmin
-saveNamespace" to recreate a good copy of image and edits from the
memory. If your NN was taken down and fails to start anymore, try to
restore from an older checkpoint - do you have one?

On Mon, Oct 22, 2012 at 8:25 PM, rongshen.long
<ro...@baifendian.com> wrote:
> hi,
> I tried to run a backup node on hdfs 0.21 , however the daemon crashed with
> NPE (stack trace as below) and
> left an 'edits.new' file in the $dfs.namenode.name.dir/current diretory .
> After that , I failed to restart the namenode and the backup node because of
> the same exception.
> Could anyone give me a help to recovery the cluster?  Although the NN can be
> restarted by creating an empty 'edits' file ,much data would be lost .
>
> 12/10/09 15:32:45 ERROR namenode.Checkpointer: Throwable Exception in
> doCheckpoint:
> java.lang.NullPointerException
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1765)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1753)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:708)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:411)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:378)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1209)
>         at
> org.apache.hadoop.hdfs.server.namenode.BackupStorage.loadCheckpoint(BackupStorage.java:158)
>         at
> org.apache.hadoop.hdfs.server.namenode.Checkpointer.doCheckpoint(Checkpointer.java:243)
>         at
> org.apache.hadoop.hdfs.server.namenode.Checkpointer.run(Checkpointer.java:141)
> 12/10/09 15:32:45 WARN namenode.FSNamesystem: ReplicationMonitor thread
> received InterruptedException.java.lang.InterruptedException: sleep
> interrupted
> 12/10/09 15:32:45 WARN namenode.DecommissionManager: Monitor interrupted:
> java.lang.InterruptedException: sleep interrupted
> 12/10/09 15:32:45 INFO namenode.FSNamesystem: Number of transactions: 24
> Total time for transactions(ms): 4Number of transactions batched in Syncs: 0
> Number of syncs: 25 SyncTimes(ms): 239
> 12/10/09 15:32:45 INFO ipc.Server: Stopping server on 50100
>
>
>
> 2012-10-22
> ________________________________
> rongshen.long



-- 
Harsh J

Re: Backup node crashed with NPE and failed to restart

Posted by Harsh J <ha...@cloudera.com>.
Hi,

First off, do not use 0.21, it is unsupported/unmaintained. Use 2.x if
you want HA-NN capabilities. See
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailability.html.

Second, BackupNode/CheckpointNode is also unmaintained actively, and
may soon be removed away in favor of HA NameNodes and the (if not HA)
SecondaryNameNode.

Regarding your metadata, if your NN is still up, issue a "dfsadmin
-saveNamespace" to recreate a good copy of image and edits from the
memory. If your NN was taken down and fails to start anymore, try to
restore from an older checkpoint - do you have one?

On Mon, Oct 22, 2012 at 8:25 PM, rongshen.long
<ro...@baifendian.com> wrote:
> hi,
> I tried to run a backup node on hdfs 0.21 , however the daemon crashed with
> NPE (stack trace as below) and
> left an 'edits.new' file in the $dfs.namenode.name.dir/current diretory .
> After that , I failed to restart the namenode and the backup node because of
> the same exception.
> Could anyone give me a help to recovery the cluster?  Although the NN can be
> restarted by creating an empty 'edits' file ,much data would be lost .
>
> 12/10/09 15:32:45 ERROR namenode.Checkpointer: Throwable Exception in
> doCheckpoint:
> java.lang.NullPointerException
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1765)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1753)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:708)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:411)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:378)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1209)
>         at
> org.apache.hadoop.hdfs.server.namenode.BackupStorage.loadCheckpoint(BackupStorage.java:158)
>         at
> org.apache.hadoop.hdfs.server.namenode.Checkpointer.doCheckpoint(Checkpointer.java:243)
>         at
> org.apache.hadoop.hdfs.server.namenode.Checkpointer.run(Checkpointer.java:141)
> 12/10/09 15:32:45 WARN namenode.FSNamesystem: ReplicationMonitor thread
> received InterruptedException.java.lang.InterruptedException: sleep
> interrupted
> 12/10/09 15:32:45 WARN namenode.DecommissionManager: Monitor interrupted:
> java.lang.InterruptedException: sleep interrupted
> 12/10/09 15:32:45 INFO namenode.FSNamesystem: Number of transactions: 24
> Total time for transactions(ms): 4Number of transactions batched in Syncs: 0
> Number of syncs: 25 SyncTimes(ms): 239
> 12/10/09 15:32:45 INFO ipc.Server: Stopping server on 50100
>
>
>
> 2012-10-22
> ________________________________
> rongshen.long



-- 
Harsh J

Re: Backup node crashed with NPE and failed to restart

Posted by Harsh J <ha...@cloudera.com>.
Hi,

First off, do not use 0.21, it is unsupported/unmaintained. Use 2.x if
you want HA-NN capabilities. See
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailability.html.

Second, BackupNode/CheckpointNode is also unmaintained actively, and
may soon be removed away in favor of HA NameNodes and the (if not HA)
SecondaryNameNode.

Regarding your metadata, if your NN is still up, issue a "dfsadmin
-saveNamespace" to recreate a good copy of image and edits from the
memory. If your NN was taken down and fails to start anymore, try to
restore from an older checkpoint - do you have one?

On Mon, Oct 22, 2012 at 8:25 PM, rongshen.long
<ro...@baifendian.com> wrote:
> hi,
> I tried to run a backup node on hdfs 0.21 , however the daemon crashed with
> NPE (stack trace as below) and
> left an 'edits.new' file in the $dfs.namenode.name.dir/current diretory .
> After that , I failed to restart the namenode and the backup node because of
> the same exception.
> Could anyone give me a help to recovery the cluster?  Although the NN can be
> restarted by creating an empty 'edits' file ,much data would be lost .
>
> 12/10/09 15:32:45 ERROR namenode.Checkpointer: Throwable Exception in
> doCheckpoint:
> java.lang.NullPointerException
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1765)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1753)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:708)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:411)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:378)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1209)
>         at
> org.apache.hadoop.hdfs.server.namenode.BackupStorage.loadCheckpoint(BackupStorage.java:158)
>         at
> org.apache.hadoop.hdfs.server.namenode.Checkpointer.doCheckpoint(Checkpointer.java:243)
>         at
> org.apache.hadoop.hdfs.server.namenode.Checkpointer.run(Checkpointer.java:141)
> 12/10/09 15:32:45 WARN namenode.FSNamesystem: ReplicationMonitor thread
> received InterruptedException.java.lang.InterruptedException: sleep
> interrupted
> 12/10/09 15:32:45 WARN namenode.DecommissionManager: Monitor interrupted:
> java.lang.InterruptedException: sleep interrupted
> 12/10/09 15:32:45 INFO namenode.FSNamesystem: Number of transactions: 24
> Total time for transactions(ms): 4Number of transactions batched in Syncs: 0
> Number of syncs: 25 SyncTimes(ms): 239
> 12/10/09 15:32:45 INFO ipc.Server: Stopping server on 50100
>
>
>
> 2012-10-22
> ________________________________
> rongshen.long



-- 
Harsh J

Re: Backup node crashed with NPE and failed to restart

Posted by Harsh J <ha...@cloudera.com>.
Hi,

First off, do not use 0.21, it is unsupported/unmaintained. Use 2.x if
you want HA-NN capabilities. See
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailability.html.

Second, BackupNode/CheckpointNode is also unmaintained actively, and
may soon be removed away in favor of HA NameNodes and the (if not HA)
SecondaryNameNode.

Regarding your metadata, if your NN is still up, issue a "dfsadmin
-saveNamespace" to recreate a good copy of image and edits from the
memory. If your NN was taken down and fails to start anymore, try to
restore from an older checkpoint - do you have one?

On Mon, Oct 22, 2012 at 8:25 PM, rongshen.long
<ro...@baifendian.com> wrote:
> hi,
> I tried to run a backup node on hdfs 0.21 , however the daemon crashed with
> NPE (stack trace as below) and
> left an 'edits.new' file in the $dfs.namenode.name.dir/current diretory .
> After that , I failed to restart the namenode and the backup node because of
> the same exception.
> Could anyone give me a help to recovery the cluster?  Although the NN can be
> restarted by creating an empty 'edits' file ,much data would be lost .
>
> 12/10/09 15:32:45 ERROR namenode.Checkpointer: Throwable Exception in
> doCheckpoint:
> java.lang.NullPointerException
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1765)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1753)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:708)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:411)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:378)
>         at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1209)
>         at
> org.apache.hadoop.hdfs.server.namenode.BackupStorage.loadCheckpoint(BackupStorage.java:158)
>         at
> org.apache.hadoop.hdfs.server.namenode.Checkpointer.doCheckpoint(Checkpointer.java:243)
>         at
> org.apache.hadoop.hdfs.server.namenode.Checkpointer.run(Checkpointer.java:141)
> 12/10/09 15:32:45 WARN namenode.FSNamesystem: ReplicationMonitor thread
> received InterruptedException.java.lang.InterruptedException: sleep
> interrupted
> 12/10/09 15:32:45 WARN namenode.DecommissionManager: Monitor interrupted:
> java.lang.InterruptedException: sleep interrupted
> 12/10/09 15:32:45 INFO namenode.FSNamesystem: Number of transactions: 24
> Total time for transactions(ms): 4Number of transactions batched in Syncs: 0
> Number of syncs: 25 SyncTimes(ms): 239
> 12/10/09 15:32:45 INFO ipc.Server: Stopping server on 50100
>
>
>
> 2012-10-22
> ________________________________
> rongshen.long



-- 
Harsh J