You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Robert Dyer <ps...@gmail.com> on 2013/02/16 21:38:40 UTC

Namenode failures

I am at a bit of wits end here.  Every single time I restart the namenode,
I get this crash:

2013-02-16 14:32:42,616 INFO org.apache.hadoop.hdfs.server.common.Storage:
Image file of size 168058 loaded in 0 seconds.
2013-02-16 14:32:42,618 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode:
java.lang.NullPointerException
    at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
    at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
    at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
    at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
    at
org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
    at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
    at
org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
    at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
    at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
    at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
    at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
    at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
    at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
    at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
    at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)

I am following best practices here, as far as I know.  I have the namenode
writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs have the
exact same files in them.

I also run a secondary checkpoint node.  This one appears to have started
failing a week ago.  So checkpoints were *not* being done since then.  Thus
I can get the NN up and running, but with a week old data!

What is going on here?  Why does my NN data *always* wind up causing this
exception over time?  Is there some easy way to get notified when the
checkpointing starts to fail?

Re: Namenode failures

Posted by Robert Dyer <rd...@iastate.edu>.

On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Robert,
>
>          It seems that your edit logs and fsimage have got
> corrupted somehow. It looks somewhat similar to this one
> https://issues.apache.org/jira/browse/HDFS-686
>

Similar, but the trace is different.


> Have you made any changes to the 'dfs.name.dir' directory
> lately?
>

No.


> Do you have enough space where metadata is getting
> stored?
>

Yes.  All 3 locations have plenty of space (hundreds of GB).


> You can make use of offine image viewer to diagnose
> the fsimage file.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:
>
>> It just happened again.  This was after a fresh format of HDFS/HBase and
>> I am attempting to re-import the (backed up) data.
>>
>>   http://pastebin.com/3fsWCNQY
>>
>> So now if I restart the namenode, I will lose data from the past 3 hours.
>>
>> What is causing this?  How can I avoid it in the future?  Is there an
>> easy way to monitor (other than a script grep'ing the logs) the checkpoints
>> to see when this happens?
>>
>>
>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:
>>
>>> Forgot to mention: Hadoop 1.0.4
>>>
>>>
>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>>>
>>>> I am at a bit of wits end here.  Every single time I restart the
>>>> namenode, I get this crash:
>>>>
>>>> 2013-02-16 14:32:42,616 INFO
>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>>>> loaded in 0 seconds.
>>>> 2013-02-16 14:32:42,618 ERROR
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>>>> java.lang.NullPointerException
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>>>
>>>> I am following best practices here, as far as I know.  I have the
>>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>>>> have the exact same files in them.
>>>>
>>>> I also run a secondary checkpoint node.  This one appears to have
>>>> started failing a week ago.  So checkpoints were *not* being done since
>>>> then.  Thus I can get the NN up and running, but with a week old data!
>>>>
>>>>  What is going on here?  Why does my NN data *always* wind up causing
>>>> this exception over time?  Is there some easy way to get notified when the
>>>> checkpointing starts to fail?
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Robert Dyer
>>> rdyer@iastate.edu
>>>
>>
>>
>>
>> --
>>
>> Robert Dyer
>> rdyer@iastate.edu
>>
>
>


-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Robert Dyer <rd...@iastate.edu>.

On Sun, Feb 17, 2013 at 5:08 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Robert,
>
> Are you by any chance adding files carrying unusual encoding?


I don't believe so.  The only files I push to HDFS are SequenceFiles (with
protobuf objects in them) and HBase's regions, which again is just protobuf
objects.  I don't use any special encodings in the protobufs.


> If its
> possible, can we be sent a bundle of the corrupted log set (all of the
> dfs.name.dir contents) to inspect what seems to be causing the
> corruption?
>

I can give the logs, dfs data dir(s), and 2nn dirs.

https://www.dropbox.com/s/heijq65pmb3esvd/hdfs-bug.tar.gz


> The only identified (but rarely occurring) bug around this part in
> 1.0.4 would be https://issues.apache.org/jira/browse/HDFS-4423. The
> other major corruption bug I know of is already fixed in your version,
> being https://issues.apache.org/jira/browse/HDFS-3652 specifically.
>
> We've not had this report from other users so having a reproduced file
> set (data not required) would be most helpful. If you have logs
> leading to the shutdown and crash as well, that'd be good to have too.
>
> P.s. How exactly are you shutting down the NN each time? A kill -9 or
> a regular SIGTERM shutdown?
>

I shut down the NN with 'bin/stop-dfs.sh'.


>  On Mon, Feb 18, 2013 at 4:31 AM, Robert Dyer <rd...@iastate.edu> wrote:
> > On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <do...@gmail.com>
> wrote:
> >>
> >> You can make use of offine image viewer to diagnose
> >> the fsimage file.
> >
> >
> > Is this not included in the 1.0.x branch?  All of the documentation I
> find
> > for it says to run 'bin/hdfs oev' but I do not have a 'bin/hdfs'.
> >
> >>
> >> Warm Regards,
> >> Tariq
> >> https://mtariq.jux.com/
> >> cloudfront.blogspot.com
> >>
> >>
> >> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:
> >>>
> >>> It just happened again.  This was after a fresh format of HDFS/HBase
> and
> >>> I am attempting to re-import the (backed up) data.
> >>>
> >>>   http://pastebin.com/3fsWCNQY
> >>>
> >>> So now if I restart the namenode, I will lose data from the past 3
> hours.
> >>>
> >>> What is causing this?  How can I avoid it in the future?  Is there an
> >>> easy way to monitor (other than a script grep'ing the logs) the
> checkpoints
> >>> to see when this happens?
> >>>
> >>>
> >>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com>
> wrote:
> >>>>
> >>>> Forgot to mention: Hadoop 1.0.4
> >>>>
> >>>>
> >>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com>
> wrote:
> >>>>>
> >>>>> I am at a bit of wits end here.  Every single time I restart the
> >>>>> namenode, I get this crash:
> >>>>>
> >>>>> 2013-02-16 14:32:42,616 INFO
> >>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size
> 168058
> >>>>> loaded in 0 seconds.
> >>>>> 2013-02-16 14:32:42,618 ERROR
> >>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
> >>>>> java.lang.NullPointerException
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
> >>>>>
> >>>>> I am following best practices here, as far as I know.  I have the
> >>>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of
> these dirs
> >>>>> have the exact same files in them.
> >>>>>
> >>>>> I also run a secondary checkpoint node.  This one appears to have
> >>>>> started failing a week ago.  So checkpoints were *not* being done
> since
> >>>>> then.  Thus I can get the NN up and running, but with a week old
> data!
> >>>>>
> >>>>> What is going on here?  Why does my NN data *always* wind up causing
> >>>>> this exception over time?  Is there some easy way to get notified
> when the
> >>>>> checkpointing starts to fail?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> Robert Dyer
> >>>> rdyer@iastate.edu
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> Robert Dyer
> >>> rdyer@iastate.edu
> >>
> >>
> >
> >
> >
> > --
> >
> > Robert Dyer
> > rdyer@iastate.edu
>
>
>
> --
> Harsh J
>



-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Robert Dyer <rd...@iastate.edu>.

On Sun, Feb 17, 2013 at 5:08 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Robert,
>
> Are you by any chance adding files carrying unusual encoding?


I don't believe so.  The only files I push to HDFS are SequenceFiles (with
protobuf objects in them) and HBase's regions, which again is just protobuf
objects.  I don't use any special encodings in the protobufs.


> If its
> possible, can we be sent a bundle of the corrupted log set (all of the
> dfs.name.dir contents) to inspect what seems to be causing the
> corruption?
>

I can give the logs, dfs data dir(s), and 2nn dirs.

https://www.dropbox.com/s/heijq65pmb3esvd/hdfs-bug.tar.gz


> The only identified (but rarely occurring) bug around this part in
> 1.0.4 would be https://issues.apache.org/jira/browse/HDFS-4423. The
> other major corruption bug I know of is already fixed in your version,
> being https://issues.apache.org/jira/browse/HDFS-3652 specifically.
>
> We've not had this report from other users so having a reproduced file
> set (data not required) would be most helpful. If you have logs
> leading to the shutdown and crash as well, that'd be good to have too.
>
> P.s. How exactly are you shutting down the NN each time? A kill -9 or
> a regular SIGTERM shutdown?
>

I shut down the NN with 'bin/stop-dfs.sh'.


>  On Mon, Feb 18, 2013 at 4:31 AM, Robert Dyer <rd...@iastate.edu> wrote:
> > On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <do...@gmail.com>
> wrote:
> >>
> >> You can make use of offine image viewer to diagnose
> >> the fsimage file.
> >
> >
> > Is this not included in the 1.0.x branch?  All of the documentation I
> find
> > for it says to run 'bin/hdfs oev' but I do not have a 'bin/hdfs'.
> >
> >>
> >> Warm Regards,
> >> Tariq
> >> https://mtariq.jux.com/
> >> cloudfront.blogspot.com
> >>
> >>
> >> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:
> >>>
> >>> It just happened again.  This was after a fresh format of HDFS/HBase
> and
> >>> I am attempting to re-import the (backed up) data.
> >>>
> >>>   http://pastebin.com/3fsWCNQY
> >>>
> >>> So now if I restart the namenode, I will lose data from the past 3
> hours.
> >>>
> >>> What is causing this?  How can I avoid it in the future?  Is there an
> >>> easy way to monitor (other than a script grep'ing the logs) the
> checkpoints
> >>> to see when this happens?
> >>>
> >>>
> >>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com>
> wrote:
> >>>>
> >>>> Forgot to mention: Hadoop 1.0.4
> >>>>
> >>>>
> >>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com>
> wrote:
> >>>>>
> >>>>> I am at a bit of wits end here.  Every single time I restart the
> >>>>> namenode, I get this crash:
> >>>>>
> >>>>> 2013-02-16 14:32:42,616 INFO
> >>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size
> 168058
> >>>>> loaded in 0 seconds.
> >>>>> 2013-02-16 14:32:42,618 ERROR
> >>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
> >>>>> java.lang.NullPointerException
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
> >>>>>
> >>>>> I am following best practices here, as far as I know.  I have the
> >>>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of
> these dirs
> >>>>> have the exact same files in them.
> >>>>>
> >>>>> I also run a secondary checkpoint node.  This one appears to have
> >>>>> started failing a week ago.  So checkpoints were *not* being done
> since
> >>>>> then.  Thus I can get the NN up and running, but with a week old
> data!
> >>>>>
> >>>>> What is going on here?  Why does my NN data *always* wind up causing
> >>>>> this exception over time?  Is there some easy way to get notified
> when the
> >>>>> checkpointing starts to fail?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> Robert Dyer
> >>>> rdyer@iastate.edu
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> Robert Dyer
> >>> rdyer@iastate.edu
> >>
> >>
> >
> >
> >
> > --
> >
> > Robert Dyer
> > rdyer@iastate.edu
>
>
>
> --
> Harsh J
>



-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Robert Dyer <rd...@iastate.edu>.

On Sun, Feb 17, 2013 at 5:08 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Robert,
>
> Are you by any chance adding files carrying unusual encoding?


I don't believe so.  The only files I push to HDFS are SequenceFiles (with
protobuf objects in them) and HBase's regions, which again is just protobuf
objects.  I don't use any special encodings in the protobufs.


> If its
> possible, can we be sent a bundle of the corrupted log set (all of the
> dfs.name.dir contents) to inspect what seems to be causing the
> corruption?
>

I can give the logs, dfs data dir(s), and 2nn dirs.

https://www.dropbox.com/s/heijq65pmb3esvd/hdfs-bug.tar.gz


> The only identified (but rarely occurring) bug around this part in
> 1.0.4 would be https://issues.apache.org/jira/browse/HDFS-4423. The
> other major corruption bug I know of is already fixed in your version,
> being https://issues.apache.org/jira/browse/HDFS-3652 specifically.
>
> We've not had this report from other users so having a reproduced file
> set (data not required) would be most helpful. If you have logs
> leading to the shutdown and crash as well, that'd be good to have too.
>
> P.s. How exactly are you shutting down the NN each time? A kill -9 or
> a regular SIGTERM shutdown?
>

I shut down the NN with 'bin/stop-dfs.sh'.


>  On Mon, Feb 18, 2013 at 4:31 AM, Robert Dyer <rd...@iastate.edu> wrote:
> > On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <do...@gmail.com>
> wrote:
> >>
> >> You can make use of offine image viewer to diagnose
> >> the fsimage file.
> >
> >
> > Is this not included in the 1.0.x branch?  All of the documentation I
> find
> > for it says to run 'bin/hdfs oev' but I do not have a 'bin/hdfs'.
> >
> >>
> >> Warm Regards,
> >> Tariq
> >> https://mtariq.jux.com/
> >> cloudfront.blogspot.com
> >>
> >>
> >> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:
> >>>
> >>> It just happened again.  This was after a fresh format of HDFS/HBase
> and
> >>> I am attempting to re-import the (backed up) data.
> >>>
> >>>   http://pastebin.com/3fsWCNQY
> >>>
> >>> So now if I restart the namenode, I will lose data from the past 3
> hours.
> >>>
> >>> What is causing this?  How can I avoid it in the future?  Is there an
> >>> easy way to monitor (other than a script grep'ing the logs) the
> checkpoints
> >>> to see when this happens?
> >>>
> >>>
> >>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com>
> wrote:
> >>>>
> >>>> Forgot to mention: Hadoop 1.0.4
> >>>>
> >>>>
> >>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com>
> wrote:
> >>>>>
> >>>>> I am at a bit of wits end here.  Every single time I restart the
> >>>>> namenode, I get this crash:
> >>>>>
> >>>>> 2013-02-16 14:32:42,616 INFO
> >>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size
> 168058
> >>>>> loaded in 0 seconds.
> >>>>> 2013-02-16 14:32:42,618 ERROR
> >>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
> >>>>> java.lang.NullPointerException
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
> >>>>>
> >>>>> I am following best practices here, as far as I know.  I have the
> >>>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of
> these dirs
> >>>>> have the exact same files in them.
> >>>>>
> >>>>> I also run a secondary checkpoint node.  This one appears to have
> >>>>> started failing a week ago.  So checkpoints were *not* being done
> since
> >>>>> then.  Thus I can get the NN up and running, but with a week old
> data!
> >>>>>
> >>>>> What is going on here?  Why does my NN data *always* wind up causing
> >>>>> this exception over time?  Is there some easy way to get notified
> when the
> >>>>> checkpointing starts to fail?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> Robert Dyer
> >>>> rdyer@iastate.edu
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> Robert Dyer
> >>> rdyer@iastate.edu
> >>
> >>
> >
> >
> >
> > --
> >
> > Robert Dyer
> > rdyer@iastate.edu
>
>
>
> --
> Harsh J
>



-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Robert Dyer <rd...@iastate.edu>.

On Sun, Feb 17, 2013 at 5:08 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Robert,
>
> Are you by any chance adding files carrying unusual encoding?


I don't believe so.  The only files I push to HDFS are SequenceFiles (with
protobuf objects in them) and HBase's regions, which again is just protobuf
objects.  I don't use any special encodings in the protobufs.


> If its
> possible, can we be sent a bundle of the corrupted log set (all of the
> dfs.name.dir contents) to inspect what seems to be causing the
> corruption?
>

I can give the logs, dfs data dir(s), and 2nn dirs.

https://www.dropbox.com/s/heijq65pmb3esvd/hdfs-bug.tar.gz


> The only identified (but rarely occurring) bug around this part in
> 1.0.4 would be https://issues.apache.org/jira/browse/HDFS-4423. The
> other major corruption bug I know of is already fixed in your version,
> being https://issues.apache.org/jira/browse/HDFS-3652 specifically.
>
> We've not had this report from other users so having a reproduced file
> set (data not required) would be most helpful. If you have logs
> leading to the shutdown and crash as well, that'd be good to have too.
>
> P.s. How exactly are you shutting down the NN each time? A kill -9 or
> a regular SIGTERM shutdown?
>

I shut down the NN with 'bin/stop-dfs.sh'.


>  On Mon, Feb 18, 2013 at 4:31 AM, Robert Dyer <rd...@iastate.edu> wrote:
> > On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <do...@gmail.com>
> wrote:
> >>
> >> You can make use of offine image viewer to diagnose
> >> the fsimage file.
> >
> >
> > Is this not included in the 1.0.x branch?  All of the documentation I
> find
> > for it says to run 'bin/hdfs oev' but I do not have a 'bin/hdfs'.
> >
> >>
> >> Warm Regards,
> >> Tariq
> >> https://mtariq.jux.com/
> >> cloudfront.blogspot.com
> >>
> >>
> >> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:
> >>>
> >>> It just happened again.  This was after a fresh format of HDFS/HBase
> and
> >>> I am attempting to re-import the (backed up) data.
> >>>
> >>>   http://pastebin.com/3fsWCNQY
> >>>
> >>> So now if I restart the namenode, I will lose data from the past 3
> hours.
> >>>
> >>> What is causing this?  How can I avoid it in the future?  Is there an
> >>> easy way to monitor (other than a script grep'ing the logs) the
> checkpoints
> >>> to see when this happens?
> >>>
> >>>
> >>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com>
> wrote:
> >>>>
> >>>> Forgot to mention: Hadoop 1.0.4
> >>>>
> >>>>
> >>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com>
> wrote:
> >>>>>
> >>>>> I am at a bit of wits end here.  Every single time I restart the
> >>>>> namenode, I get this crash:
> >>>>>
> >>>>> 2013-02-16 14:32:42,616 INFO
> >>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size
> 168058
> >>>>> loaded in 0 seconds.
> >>>>> 2013-02-16 14:32:42,618 ERROR
> >>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
> >>>>> java.lang.NullPointerException
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
> >>>>>     at
> >>>>>
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
> >>>>>
> >>>>> I am following best practices here, as far as I know.  I have the
> >>>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of
> these dirs
> >>>>> have the exact same files in them.
> >>>>>
> >>>>> I also run a secondary checkpoint node.  This one appears to have
> >>>>> started failing a week ago.  So checkpoints were *not* being done
> since
> >>>>> then.  Thus I can get the NN up and running, but with a week old
> data!
> >>>>>
> >>>>> What is going on here?  Why does my NN data *always* wind up causing
> >>>>> this exception over time?  Is there some easy way to get notified
> when the
> >>>>> checkpointing starts to fail?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> Robert Dyer
> >>>> rdyer@iastate.edu
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> Robert Dyer
> >>> rdyer@iastate.edu
> >>
> >>
> >
> >
> >
> > --
> >
> > Robert Dyer
> > rdyer@iastate.edu
>
>
>
> --
> Harsh J
>



-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Harsh J <ha...@cloudera.com>.

Hi Robert,

Are you by any chance adding files carrying unusual encoding? If its
possible, can we be sent a bundle of the corrupted log set (all of the
dfs.name.dir contents) to inspect what seems to be causing the
corruption?

The only identified (but rarely occurring) bug around this part in
1.0.4 would be https://issues.apache.org/jira/browse/HDFS-4423. The
other major corruption bug I know of is already fixed in your version,
being https://issues.apache.org/jira/browse/HDFS-3652 specifically.

We've not had this report from other users so having a reproduced file
set (data not required) would be most helpful. If you have logs
leading to the shutdown and crash as well, that'd be good to have too.

P.s. How exactly are you shutting down the NN each time? A kill -9 or
a regular SIGTERM shutdown?

On Mon, Feb 18, 2013 at 4:31 AM, Robert Dyer <rd...@iastate.edu> wrote:
> On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>
>> You can make use of offine image viewer to diagnose
>> the fsimage file.
>
>
> Is this not included in the 1.0.x branch?  All of the documentation I find
> for it says to run 'bin/hdfs oev' but I do not have a 'bin/hdfs'.
>
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:
>>>
>>> It just happened again.  This was after a fresh format of HDFS/HBase and
>>> I am attempting to re-import the (backed up) data.
>>>
>>>   http://pastebin.com/3fsWCNQY
>>>
>>> So now if I restart the namenode, I will lose data from the past 3 hours.
>>>
>>> What is causing this?  How can I avoid it in the future?  Is there an
>>> easy way to monitor (other than a script grep'ing the logs) the checkpoints
>>> to see when this happens?
>>>
>>>
>>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:
>>>>
>>>> Forgot to mention: Hadoop 1.0.4
>>>>
>>>>
>>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>>>>>
>>>>> I am at a bit of wits end here.  Every single time I restart the
>>>>> namenode, I get this crash:
>>>>>
>>>>> 2013-02-16 14:32:42,616 INFO
>>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>>>>> loaded in 0 seconds.
>>>>> 2013-02-16 14:32:42,618 ERROR
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>>>>> java.lang.NullPointerException
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>>>>
>>>>> I am following best practices here, as far as I know.  I have the
>>>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>>>>> have the exact same files in them.
>>>>>
>>>>> I also run a secondary checkpoint node.  This one appears to have
>>>>> started failing a week ago.  So checkpoints were *not* being done since
>>>>> then.  Thus I can get the NN up and running, but with a week old data!
>>>>>
>>>>> What is going on here?  Why does my NN data *always* wind up causing
>>>>> this exception over time?  Is there some easy way to get notified when the
>>>>> checkpointing starts to fail?
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Robert Dyer
>>>> rdyer@iastate.edu
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Robert Dyer
>>> rdyer@iastate.edu
>>
>>
>
>
>
> --
>
> Robert Dyer
> rdyer@iastate.edu



--
Harsh J

Re: Namenode failures

Posted by Harsh J <ha...@cloudera.com>.

Hi Robert,

Are you by any chance adding files carrying unusual encoding? If its
possible, can we be sent a bundle of the corrupted log set (all of the
dfs.name.dir contents) to inspect what seems to be causing the
corruption?

The only identified (but rarely occurring) bug around this part in
1.0.4 would be https://issues.apache.org/jira/browse/HDFS-4423. The
other major corruption bug I know of is already fixed in your version,
being https://issues.apache.org/jira/browse/HDFS-3652 specifically.

We've not had this report from other users so having a reproduced file
set (data not required) would be most helpful. If you have logs
leading to the shutdown and crash as well, that'd be good to have too.

P.s. How exactly are you shutting down the NN each time? A kill -9 or
a regular SIGTERM shutdown?

On Mon, Feb 18, 2013 at 4:31 AM, Robert Dyer <rd...@iastate.edu> wrote:
> On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>
>> You can make use of offine image viewer to diagnose
>> the fsimage file.
>
>
> Is this not included in the 1.0.x branch?  All of the documentation I find
> for it says to run 'bin/hdfs oev' but I do not have a 'bin/hdfs'.
>
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:
>>>
>>> It just happened again.  This was after a fresh format of HDFS/HBase and
>>> I am attempting to re-import the (backed up) data.
>>>
>>>   http://pastebin.com/3fsWCNQY
>>>
>>> So now if I restart the namenode, I will lose data from the past 3 hours.
>>>
>>> What is causing this?  How can I avoid it in the future?  Is there an
>>> easy way to monitor (other than a script grep'ing the logs) the checkpoints
>>> to see when this happens?
>>>
>>>
>>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:
>>>>
>>>> Forgot to mention: Hadoop 1.0.4
>>>>
>>>>
>>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>>>>>
>>>>> I am at a bit of wits end here.  Every single time I restart the
>>>>> namenode, I get this crash:
>>>>>
>>>>> 2013-02-16 14:32:42,616 INFO
>>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>>>>> loaded in 0 seconds.
>>>>> 2013-02-16 14:32:42,618 ERROR
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>>>>> java.lang.NullPointerException
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>>>>
>>>>> I am following best practices here, as far as I know.  I have the
>>>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>>>>> have the exact same files in them.
>>>>>
>>>>> I also run a secondary checkpoint node.  This one appears to have
>>>>> started failing a week ago.  So checkpoints were *not* being done since
>>>>> then.  Thus I can get the NN up and running, but with a week old data!
>>>>>
>>>>> What is going on here?  Why does my NN data *always* wind up causing
>>>>> this exception over time?  Is there some easy way to get notified when the
>>>>> checkpointing starts to fail?
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Robert Dyer
>>>> rdyer@iastate.edu
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Robert Dyer
>>> rdyer@iastate.edu
>>
>>
>
>
>
> --
>
> Robert Dyer
> rdyer@iastate.edu



--
Harsh J

Re: Namenode failures

Posted by Harsh J <ha...@cloudera.com>.

Hi Robert,

Are you by any chance adding files carrying unusual encoding? If its
possible, can we be sent a bundle of the corrupted log set (all of the
dfs.name.dir contents) to inspect what seems to be causing the
corruption?

The only identified (but rarely occurring) bug around this part in
1.0.4 would be https://issues.apache.org/jira/browse/HDFS-4423. The
other major corruption bug I know of is already fixed in your version,
being https://issues.apache.org/jira/browse/HDFS-3652 specifically.

We've not had this report from other users so having a reproduced file
set (data not required) would be most helpful. If you have logs
leading to the shutdown and crash as well, that'd be good to have too.

P.s. How exactly are you shutting down the NN each time? A kill -9 or
a regular SIGTERM shutdown?

On Mon, Feb 18, 2013 at 4:31 AM, Robert Dyer <rd...@iastate.edu> wrote:
> On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>
>> You can make use of offine image viewer to diagnose
>> the fsimage file.
>
>
> Is this not included in the 1.0.x branch?  All of the documentation I find
> for it says to run 'bin/hdfs oev' but I do not have a 'bin/hdfs'.
>
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:
>>>
>>> It just happened again.  This was after a fresh format of HDFS/HBase and
>>> I am attempting to re-import the (backed up) data.
>>>
>>>   http://pastebin.com/3fsWCNQY
>>>
>>> So now if I restart the namenode, I will lose data from the past 3 hours.
>>>
>>> What is causing this?  How can I avoid it in the future?  Is there an
>>> easy way to monitor (other than a script grep'ing the logs) the checkpoints
>>> to see when this happens?
>>>
>>>
>>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:
>>>>
>>>> Forgot to mention: Hadoop 1.0.4
>>>>
>>>>
>>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>>>>>
>>>>> I am at a bit of wits end here.  Every single time I restart the
>>>>> namenode, I get this crash:
>>>>>
>>>>> 2013-02-16 14:32:42,616 INFO
>>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>>>>> loaded in 0 seconds.
>>>>> 2013-02-16 14:32:42,618 ERROR
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>>>>> java.lang.NullPointerException
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>>>>
>>>>> I am following best practices here, as far as I know.  I have the
>>>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>>>>> have the exact same files in them.
>>>>>
>>>>> I also run a secondary checkpoint node.  This one appears to have
>>>>> started failing a week ago.  So checkpoints were *not* being done since
>>>>> then.  Thus I can get the NN up and running, but with a week old data!
>>>>>
>>>>> What is going on here?  Why does my NN data *always* wind up causing
>>>>> this exception over time?  Is there some easy way to get notified when the
>>>>> checkpointing starts to fail?
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Robert Dyer
>>>> rdyer@iastate.edu
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Robert Dyer
>>> rdyer@iastate.edu
>>
>>
>
>
>
> --
>
> Robert Dyer
> rdyer@iastate.edu



--
Harsh J

Re: Namenode failures

Posted by Harsh J <ha...@cloudera.com>.

Hi Robert,

Are you by any chance adding files carrying unusual encoding? If its
possible, can we be sent a bundle of the corrupted log set (all of the
dfs.name.dir contents) to inspect what seems to be causing the
corruption?

The only identified (but rarely occurring) bug around this part in
1.0.4 would be https://issues.apache.org/jira/browse/HDFS-4423. The
other major corruption bug I know of is already fixed in your version,
being https://issues.apache.org/jira/browse/HDFS-3652 specifically.

We've not had this report from other users so having a reproduced file
set (data not required) would be most helpful. If you have logs
leading to the shutdown and crash as well, that'd be good to have too.

P.s. How exactly are you shutting down the NN each time? A kill -9 or
a regular SIGTERM shutdown?

On Mon, Feb 18, 2013 at 4:31 AM, Robert Dyer <rd...@iastate.edu> wrote:
> On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <do...@gmail.com> wrote:
>>
>> You can make use of offine image viewer to diagnose
>> the fsimage file.
>
>
> Is this not included in the 1.0.x branch?  All of the documentation I find
> for it says to run 'bin/hdfs oev' but I do not have a 'bin/hdfs'.
>
>>
>> Warm Regards,
>> Tariq
>> https://mtariq.jux.com/
>> cloudfront.blogspot.com
>>
>>
>> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:
>>>
>>> It just happened again.  This was after a fresh format of HDFS/HBase and
>>> I am attempting to re-import the (backed up) data.
>>>
>>>   http://pastebin.com/3fsWCNQY
>>>
>>> So now if I restart the namenode, I will lose data from the past 3 hours.
>>>
>>> What is causing this?  How can I avoid it in the future?  Is there an
>>> easy way to monitor (other than a script grep'ing the logs) the checkpoints
>>> to see when this happens?
>>>
>>>
>>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:
>>>>
>>>> Forgot to mention: Hadoop 1.0.4
>>>>
>>>>
>>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>>>>>
>>>>> I am at a bit of wits end here.  Every single time I restart the
>>>>> namenode, I get this crash:
>>>>>
>>>>> 2013-02-16 14:32:42,616 INFO
>>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>>>>> loaded in 0 seconds.
>>>>> 2013-02-16 14:32:42,618 ERROR
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>>>>> java.lang.NullPointerException
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>>>>     at
>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>>>>
>>>>> I am following best practices here, as far as I know.  I have the
>>>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>>>>> have the exact same files in them.
>>>>>
>>>>> I also run a secondary checkpoint node.  This one appears to have
>>>>> started failing a week ago.  So checkpoints were *not* being done since
>>>>> then.  Thus I can get the NN up and running, but with a week old data!
>>>>>
>>>>> What is going on here?  Why does my NN data *always* wind up causing
>>>>> this exception over time?  Is there some easy way to get notified when the
>>>>> checkpointing starts to fail?
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Robert Dyer
>>>> rdyer@iastate.edu
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Robert Dyer
>>> rdyer@iastate.edu
>>
>>
>
>
>
> --
>
> Robert Dyer
> rdyer@iastate.edu



--
Harsh J

Re: Namenode failures

Posted by Robert Dyer <rd...@iastate.edu>.

On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <do...@gmail.com> wrote:

> You can make use of offine image viewer to diagnose
> the fsimage file.
>

Is this not included in the 1.0.x branch?  All of the documentation I find
for it says to run 'bin/hdfs oev' but I do not have a 'bin/hdfs'.


> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:
>
>> It just happened again.  This was after a fresh format of HDFS/HBase and
>> I am attempting to re-import the (backed up) data.
>>
>>   http://pastebin.com/3fsWCNQY
>>
>> So now if I restart the namenode, I will lose data from the past 3 hours.
>>
>> What is causing this?  How can I avoid it in the future?  Is there an
>> easy way to monitor (other than a script grep'ing the logs) the checkpoints
>> to see when this happens?
>>
>>
>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:
>>
>>> Forgot to mention: Hadoop 1.0.4
>>>
>>>
>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>>>
>>>> I am at a bit of wits end here.  Every single time I restart the
>>>> namenode, I get this crash:
>>>>
>>>> 2013-02-16 14:32:42,616 INFO
>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>>>> loaded in 0 seconds.
>>>> 2013-02-16 14:32:42,618 ERROR
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>>>> java.lang.NullPointerException
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>>>
>>>> I am following best practices here, as far as I know.  I have the
>>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>>>> have the exact same files in them.
>>>>
>>>> I also run a secondary checkpoint node.  This one appears to have
>>>> started failing a week ago.  So checkpoints were *not* being done since
>>>> then.  Thus I can get the NN up and running, but with a week old data!
>>>>
>>>>  What is going on here?  Why does my NN data *always* wind up causing
>>>> this exception over time?  Is there some easy way to get notified when the
>>>> checkpointing starts to fail?
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Robert Dyer
>>> rdyer@iastate.edu
>>>
>>
>>
>>
>> --
>>
>> Robert Dyer
>> rdyer@iastate.edu
>>
>
>


-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Robert Dyer <rd...@iastate.edu>.

On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <do...@gmail.com> wrote:

> You can make use of offine image viewer to diagnose
> the fsimage file.
>

Is this not included in the 1.0.x branch?  All of the documentation I find
for it says to run 'bin/hdfs oev' but I do not have a 'bin/hdfs'.


> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:
>
>> It just happened again.  This was after a fresh format of HDFS/HBase and
>> I am attempting to re-import the (backed up) data.
>>
>>   http://pastebin.com/3fsWCNQY
>>
>> So now if I restart the namenode, I will lose data from the past 3 hours.
>>
>> What is causing this?  How can I avoid it in the future?  Is there an
>> easy way to monitor (other than a script grep'ing the logs) the checkpoints
>> to see when this happens?
>>
>>
>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:
>>
>>> Forgot to mention: Hadoop 1.0.4
>>>
>>>
>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>>>
>>>> I am at a bit of wits end here.  Every single time I restart the
>>>> namenode, I get this crash:
>>>>
>>>> 2013-02-16 14:32:42,616 INFO
>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>>>> loaded in 0 seconds.
>>>> 2013-02-16 14:32:42,618 ERROR
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>>>> java.lang.NullPointerException
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>>>
>>>> I am following best practices here, as far as I know.  I have the
>>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>>>> have the exact same files in them.
>>>>
>>>> I also run a secondary checkpoint node.  This one appears to have
>>>> started failing a week ago.  So checkpoints were *not* being done since
>>>> then.  Thus I can get the NN up and running, but with a week old data!
>>>>
>>>>  What is going on here?  Why does my NN data *always* wind up causing
>>>> this exception over time?  Is there some easy way to get notified when the
>>>> checkpointing starts to fail?
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Robert Dyer
>>> rdyer@iastate.edu
>>>
>>
>>
>>
>> --
>>
>> Robert Dyer
>> rdyer@iastate.edu
>>
>
>


-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Robert Dyer <rd...@iastate.edu>.

On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Robert,
>
>          It seems that your edit logs and fsimage have got
> corrupted somehow. It looks somewhat similar to this one
> https://issues.apache.org/jira/browse/HDFS-686
>

Similar, but the trace is different.


> Have you made any changes to the 'dfs.name.dir' directory
> lately?
>

No.


> Do you have enough space where metadata is getting
> stored?
>

Yes.  All 3 locations have plenty of space (hundreds of GB).


> You can make use of offine image viewer to diagnose
> the fsimage file.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:
>
>> It just happened again.  This was after a fresh format of HDFS/HBase and
>> I am attempting to re-import the (backed up) data.
>>
>>   http://pastebin.com/3fsWCNQY
>>
>> So now if I restart the namenode, I will lose data from the past 3 hours.
>>
>> What is causing this?  How can I avoid it in the future?  Is there an
>> easy way to monitor (other than a script grep'ing the logs) the checkpoints
>> to see when this happens?
>>
>>
>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:
>>
>>> Forgot to mention: Hadoop 1.0.4
>>>
>>>
>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>>>
>>>> I am at a bit of wits end here.  Every single time I restart the
>>>> namenode, I get this crash:
>>>>
>>>> 2013-02-16 14:32:42,616 INFO
>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>>>> loaded in 0 seconds.
>>>> 2013-02-16 14:32:42,618 ERROR
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>>>> java.lang.NullPointerException
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>>>
>>>> I am following best practices here, as far as I know.  I have the
>>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>>>> have the exact same files in them.
>>>>
>>>> I also run a secondary checkpoint node.  This one appears to have
>>>> started failing a week ago.  So checkpoints were *not* being done since
>>>> then.  Thus I can get the NN up and running, but with a week old data!
>>>>
>>>>  What is going on here?  Why does my NN data *always* wind up causing
>>>> this exception over time?  Is there some easy way to get notified when the
>>>> checkpointing starts to fail?
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Robert Dyer
>>> rdyer@iastate.edu
>>>
>>
>>
>>
>> --
>>
>> Robert Dyer
>> rdyer@iastate.edu
>>
>
>


-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Robert Dyer <rd...@iastate.edu>.

On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <do...@gmail.com> wrote:

> You can make use of offine image viewer to diagnose
> the fsimage file.
>

Is this not included in the 1.0.x branch?  All of the documentation I find
for it says to run 'bin/hdfs oev' but I do not have a 'bin/hdfs'.


> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:
>
>> It just happened again.  This was after a fresh format of HDFS/HBase and
>> I am attempting to re-import the (backed up) data.
>>
>>   http://pastebin.com/3fsWCNQY
>>
>> So now if I restart the namenode, I will lose data from the past 3 hours.
>>
>> What is causing this?  How can I avoid it in the future?  Is there an
>> easy way to monitor (other than a script grep'ing the logs) the checkpoints
>> to see when this happens?
>>
>>
>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:
>>
>>> Forgot to mention: Hadoop 1.0.4
>>>
>>>
>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>>>
>>>> I am at a bit of wits end here.  Every single time I restart the
>>>> namenode, I get this crash:
>>>>
>>>> 2013-02-16 14:32:42,616 INFO
>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>>>> loaded in 0 seconds.
>>>> 2013-02-16 14:32:42,618 ERROR
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>>>> java.lang.NullPointerException
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>>>
>>>> I am following best practices here, as far as I know.  I have the
>>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>>>> have the exact same files in them.
>>>>
>>>> I also run a secondary checkpoint node.  This one appears to have
>>>> started failing a week ago.  So checkpoints were *not* being done since
>>>> then.  Thus I can get the NN up and running, but with a week old data!
>>>>
>>>>  What is going on here?  Why does my NN data *always* wind up causing
>>>> this exception over time?  Is there some easy way to get notified when the
>>>> checkpointing starts to fail?
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Robert Dyer
>>> rdyer@iastate.edu
>>>
>>
>>
>>
>> --
>>
>> Robert Dyer
>> rdyer@iastate.edu
>>
>
>


-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Robert Dyer <rd...@iastate.edu>.

On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Robert,
>
>          It seems that your edit logs and fsimage have got
> corrupted somehow. It looks somewhat similar to this one
> https://issues.apache.org/jira/browse/HDFS-686
>

Similar, but the trace is different.


> Have you made any changes to the 'dfs.name.dir' directory
> lately?
>

No.


> Do you have enough space where metadata is getting
> stored?
>

Yes.  All 3 locations have plenty of space (hundreds of GB).


> You can make use of offine image viewer to diagnose
> the fsimage file.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:
>
>> It just happened again.  This was after a fresh format of HDFS/HBase and
>> I am attempting to re-import the (backed up) data.
>>
>>   http://pastebin.com/3fsWCNQY
>>
>> So now if I restart the namenode, I will lose data from the past 3 hours.
>>
>> What is causing this?  How can I avoid it in the future?  Is there an
>> easy way to monitor (other than a script grep'ing the logs) the checkpoints
>> to see when this happens?
>>
>>
>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:
>>
>>> Forgot to mention: Hadoop 1.0.4
>>>
>>>
>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>>>
>>>> I am at a bit of wits end here.  Every single time I restart the
>>>> namenode, I get this crash:
>>>>
>>>> 2013-02-16 14:32:42,616 INFO
>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>>>> loaded in 0 seconds.
>>>> 2013-02-16 14:32:42,618 ERROR
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>>>> java.lang.NullPointerException
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>>>
>>>> I am following best practices here, as far as I know.  I have the
>>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>>>> have the exact same files in them.
>>>>
>>>> I also run a secondary checkpoint node.  This one appears to have
>>>> started failing a week ago.  So checkpoints were *not* being done since
>>>> then.  Thus I can get the NN up and running, but with a week old data!
>>>>
>>>>  What is going on here?  Why does my NN data *always* wind up causing
>>>> this exception over time?  Is there some easy way to get notified when the
>>>> checkpointing starts to fail?
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Robert Dyer
>>> rdyer@iastate.edu
>>>
>>
>>
>>
>> --
>>
>> Robert Dyer
>> rdyer@iastate.edu
>>
>
>


-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Robert Dyer <rd...@iastate.edu>.

On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Hello Robert,
>
>          It seems that your edit logs and fsimage have got
> corrupted somehow. It looks somewhat similar to this one
> https://issues.apache.org/jira/browse/HDFS-686
>

Similar, but the trace is different.


> Have you made any changes to the 'dfs.name.dir' directory
> lately?
>

No.


> Do you have enough space where metadata is getting
> stored?
>

Yes.  All 3 locations have plenty of space (hundreds of GB).


> You can make use of offine image viewer to diagnose
> the fsimage file.
>
> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:
>
>> It just happened again.  This was after a fresh format of HDFS/HBase and
>> I am attempting to re-import the (backed up) data.
>>
>>   http://pastebin.com/3fsWCNQY
>>
>> So now if I restart the namenode, I will lose data from the past 3 hours.
>>
>> What is causing this?  How can I avoid it in the future?  Is there an
>> easy way to monitor (other than a script grep'ing the logs) the checkpoints
>> to see when this happens?
>>
>>
>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:
>>
>>> Forgot to mention: Hadoop 1.0.4
>>>
>>>
>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>>>
>>>> I am at a bit of wits end here.  Every single time I restart the
>>>> namenode, I get this crash:
>>>>
>>>> 2013-02-16 14:32:42,616 INFO
>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>>>> loaded in 0 seconds.
>>>> 2013-02-16 14:32:42,618 ERROR
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>>>> java.lang.NullPointerException
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>>>
>>>> I am following best practices here, as far as I know.  I have the
>>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>>>> have the exact same files in them.
>>>>
>>>> I also run a secondary checkpoint node.  This one appears to have
>>>> started failing a week ago.  So checkpoints were *not* being done since
>>>> then.  Thus I can get the NN up and running, but with a week old data!
>>>>
>>>>  What is going on here?  Why does my NN data *always* wind up causing
>>>> this exception over time?  Is there some easy way to get notified when the
>>>> checkpointing starts to fail?
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Robert Dyer
>>> rdyer@iastate.edu
>>>
>>
>>
>>
>> --
>>
>> Robert Dyer
>> rdyer@iastate.edu
>>
>
>


-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Robert Dyer <rd...@iastate.edu>.

On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <do...@gmail.com> wrote:

> You can make use of offine image viewer to diagnose
> the fsimage file.
>

Is this not included in the 1.0.x branch?  All of the documentation I find
for it says to run 'bin/hdfs oev' but I do not have a 'bin/hdfs'.


> Warm Regards,
> Tariq
> https://mtariq.jux.com/
> cloudfront.blogspot.com
>
>
> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:
>
>> It just happened again.  This was after a fresh format of HDFS/HBase and
>> I am attempting to re-import the (backed up) data.
>>
>>   http://pastebin.com/3fsWCNQY
>>
>> So now if I restart the namenode, I will lose data from the past 3 hours.
>>
>> What is causing this?  How can I avoid it in the future?  Is there an
>> easy way to monitor (other than a script grep'ing the logs) the checkpoints
>> to see when this happens?
>>
>>
>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:
>>
>>> Forgot to mention: Hadoop 1.0.4
>>>
>>>
>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>>>
>>>> I am at a bit of wits end here.  Every single time I restart the
>>>> namenode, I get this crash:
>>>>
>>>> 2013-02-16 14:32:42,616 INFO
>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>>>> loaded in 0 seconds.
>>>> 2013-02-16 14:32:42,618 ERROR
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>>>> java.lang.NullPointerException
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>>>     at
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>>>
>>>> I am following best practices here, as far as I know.  I have the
>>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>>>> have the exact same files in them.
>>>>
>>>> I also run a secondary checkpoint node.  This one appears to have
>>>> started failing a week ago.  So checkpoints were *not* being done since
>>>> then.  Thus I can get the NN up and running, but with a week old data!
>>>>
>>>>  What is going on here?  Why does my NN data *always* wind up causing
>>>> this exception over time?  Is there some easy way to get notified when the
>>>> checkpointing starts to fail?
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Robert Dyer
>>> rdyer@iastate.edu
>>>
>>
>>
>>
>> --
>>
>> Robert Dyer
>> rdyer@iastate.edu
>>
>
>


-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Robert,

         It seems that your edit logs and fsimage have got
corrupted somehow. It looks somewhat similar to this one
https://issues.apache.org/jira/browse/HDFS-686

Have you made any changes to the 'dfs.name.dir' directory
lately?Do you have enough space where metadata is getting
stored?You can make use of offine image viewer to diagnose
the fsimage file.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:

> It just happened again.  This was after a fresh format of HDFS/HBase and I
> am attempting to re-import the (backed up) data.
>
>   http://pastebin.com/3fsWCNQY
>
> So now if I restart the namenode, I will lose data from the past 3 hours.
>
> What is causing this?  How can I avoid it in the future?  Is there an easy
> way to monitor (other than a script grep'ing the logs) the checkpoints to
> see when this happens?
>
>
> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:
>
>> Forgot to mention: Hadoop 1.0.4
>>
>>
>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>>
>>> I am at a bit of wits end here.  Every single time I restart the
>>> namenode, I get this crash:
>>>
>>> 2013-02-16 14:32:42,616 INFO
>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>>> loaded in 0 seconds.
>>> 2013-02-16 14:32:42,618 ERROR
>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>>> java.lang.NullPointerException
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>>
>>> I am following best practices here, as far as I know.  I have the
>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>>> have the exact same files in them.
>>>
>>> I also run a secondary checkpoint node.  This one appears to have
>>> started failing a week ago.  So checkpoints were *not* being done since
>>> then.  Thus I can get the NN up and running, but with a week old data!
>>>
>>>  What is going on here?  Why does my NN data *always* wind up causing
>>> this exception over time?  Is there some easy way to get notified when the
>>> checkpointing starts to fail?
>>>
>>
>>
>>
>> --
>>
>> Robert Dyer
>> rdyer@iastate.edu
>>
>
>
>
> --
>
> Robert Dyer
> rdyer@iastate.edu
>

Re: Namenode failures

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Robert,

         It seems that your edit logs and fsimage have got
corrupted somehow. It looks somewhat similar to this one
https://issues.apache.org/jira/browse/HDFS-686

Have you made any changes to the 'dfs.name.dir' directory
lately?Do you have enough space where metadata is getting
stored?You can make use of offine image viewer to diagnose
the fsimage file.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:

> It just happened again.  This was after a fresh format of HDFS/HBase and I
> am attempting to re-import the (backed up) data.
>
>   http://pastebin.com/3fsWCNQY
>
> So now if I restart the namenode, I will lose data from the past 3 hours.
>
> What is causing this?  How can I avoid it in the future?  Is there an easy
> way to monitor (other than a script grep'ing the logs) the checkpoints to
> see when this happens?
>
>
> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:
>
>> Forgot to mention: Hadoop 1.0.4
>>
>>
>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>>
>>> I am at a bit of wits end here.  Every single time I restart the
>>> namenode, I get this crash:
>>>
>>> 2013-02-16 14:32:42,616 INFO
>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>>> loaded in 0 seconds.
>>> 2013-02-16 14:32:42,618 ERROR
>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>>> java.lang.NullPointerException
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>>
>>> I am following best practices here, as far as I know.  I have the
>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>>> have the exact same files in them.
>>>
>>> I also run a secondary checkpoint node.  This one appears to have
>>> started failing a week ago.  So checkpoints were *not* being done since
>>> then.  Thus I can get the NN up and running, but with a week old data!
>>>
>>>  What is going on here?  Why does my NN data *always* wind up causing
>>> this exception over time?  Is there some easy way to get notified when the
>>> checkpointing starts to fail?
>>>
>>
>>
>>
>> --
>>
>> Robert Dyer
>> rdyer@iastate.edu
>>
>
>
>
> --
>
> Robert Dyer
> rdyer@iastate.edu
>

Re: Namenode failures

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Robert,

         It seems that your edit logs and fsimage have got
corrupted somehow. It looks somewhat similar to this one
https://issues.apache.org/jira/browse/HDFS-686

Have you made any changes to the 'dfs.name.dir' directory
lately?Do you have enough space where metadata is getting
stored?You can make use of offine image viewer to diagnose
the fsimage file.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:

> It just happened again.  This was after a fresh format of HDFS/HBase and I
> am attempting to re-import the (backed up) data.
>
>   http://pastebin.com/3fsWCNQY
>
> So now if I restart the namenode, I will lose data from the past 3 hours.
>
> What is causing this?  How can I avoid it in the future?  Is there an easy
> way to monitor (other than a script grep'ing the logs) the checkpoints to
> see when this happens?
>
>
> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:
>
>> Forgot to mention: Hadoop 1.0.4
>>
>>
>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>>
>>> I am at a bit of wits end here.  Every single time I restart the
>>> namenode, I get this crash:
>>>
>>> 2013-02-16 14:32:42,616 INFO
>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>>> loaded in 0 seconds.
>>> 2013-02-16 14:32:42,618 ERROR
>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>>> java.lang.NullPointerException
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>>
>>> I am following best practices here, as far as I know.  I have the
>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>>> have the exact same files in them.
>>>
>>> I also run a secondary checkpoint node.  This one appears to have
>>> started failing a week ago.  So checkpoints were *not* being done since
>>> then.  Thus I can get the NN up and running, but with a week old data!
>>>
>>>  What is going on here?  Why does my NN data *always* wind up causing
>>> this exception over time?  Is there some easy way to get notified when the
>>> checkpointing starts to fail?
>>>
>>
>>
>>
>> --
>>
>> Robert Dyer
>> rdyer@iastate.edu
>>
>
>
>
> --
>
> Robert Dyer
> rdyer@iastate.edu
>

Re: Namenode failures

Posted by Mohammad Tariq <do...@gmail.com>.

Hello Robert,

         It seems that your edit logs and fsimage have got
corrupted somehow. It looks somewhat similar to this one
https://issues.apache.org/jira/browse/HDFS-686

Have you made any changes to the 'dfs.name.dir' directory
lately?Do you have enough space where metadata is getting
stored?You can make use of offine image viewer to diagnose
the fsimage file.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <ps...@gmail.com> wrote:

> It just happened again.  This was after a fresh format of HDFS/HBase and I
> am attempting to re-import the (backed up) data.
>
>   http://pastebin.com/3fsWCNQY
>
> So now if I restart the namenode, I will lose data from the past 3 hours.
>
> What is causing this?  How can I avoid it in the future?  Is there an easy
> way to monitor (other than a script grep'ing the logs) the checkpoints to
> see when this happens?
>
>
> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:
>
>> Forgot to mention: Hadoop 1.0.4
>>
>>
>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>>
>>> I am at a bit of wits end here.  Every single time I restart the
>>> namenode, I get this crash:
>>>
>>> 2013-02-16 14:32:42,616 INFO
>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>>> loaded in 0 seconds.
>>> 2013-02-16 14:32:42,618 ERROR
>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>>> java.lang.NullPointerException
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>>     at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>>
>>> I am following best practices here, as far as I know.  I have the
>>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>>> have the exact same files in them.
>>>
>>> I also run a secondary checkpoint node.  This one appears to have
>>> started failing a week ago.  So checkpoints were *not* being done since
>>> then.  Thus I can get the NN up and running, but with a week old data!
>>>
>>>  What is going on here?  Why does my NN data *always* wind up causing
>>> this exception over time?  Is there some easy way to get notified when the
>>> checkpointing starts to fail?
>>>
>>
>>
>>
>> --
>>
>> Robert Dyer
>> rdyer@iastate.edu
>>
>
>
>
> --
>
> Robert Dyer
> rdyer@iastate.edu
>

Re: Namenode failures

Posted by Robert Dyer <ps...@gmail.com>.

It just happened again.  This was after a fresh format of HDFS/HBase and I
am attempting to re-import the (backed up) data.

  http://pastebin.com/3fsWCNQY

So now if I restart the namenode, I will lose data from the past 3 hours.

What is causing this?  How can I avoid it in the future?  Is there an easy
way to monitor (other than a script grep'ing the logs) the checkpoints to
see when this happens?


On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:

> Forgot to mention: Hadoop 1.0.4
>
>
> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>
>> I am at a bit of wits end here.  Every single time I restart the
>> namenode, I get this crash:
>>
>> 2013-02-16 14:32:42,616 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>> loaded in 0 seconds.
>> 2013-02-16 14:32:42,618 ERROR
>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>> java.lang.NullPointerException
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>
>> I am following best practices here, as far as I know.  I have the
>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>> have the exact same files in them.
>>
>> I also run a secondary checkpoint node.  This one appears to have started
>> failing a week ago.  So checkpoints were *not* being done since then.  Thus
>> I can get the NN up and running, but with a week old data!
>>
>>  What is going on here?  Why does my NN data *always* wind up causing
>> this exception over time?  Is there some easy way to get notified when the
>> checkpointing starts to fail?
>>
>
>
>
> --
>
> Robert Dyer
> rdyer@iastate.edu
>



-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Robert Dyer <ps...@gmail.com>.

It just happened again.  This was after a fresh format of HDFS/HBase and I
am attempting to re-import the (backed up) data.

  http://pastebin.com/3fsWCNQY

So now if I restart the namenode, I will lose data from the past 3 hours.

What is causing this?  How can I avoid it in the future?  Is there an easy
way to monitor (other than a script grep'ing the logs) the checkpoints to
see when this happens?


On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:

> Forgot to mention: Hadoop 1.0.4
>
>
> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>
>> I am at a bit of wits end here.  Every single time I restart the
>> namenode, I get this crash:
>>
>> 2013-02-16 14:32:42,616 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>> loaded in 0 seconds.
>> 2013-02-16 14:32:42,618 ERROR
>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>> java.lang.NullPointerException
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>
>> I am following best practices here, as far as I know.  I have the
>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>> have the exact same files in them.
>>
>> I also run a secondary checkpoint node.  This one appears to have started
>> failing a week ago.  So checkpoints were *not* being done since then.  Thus
>> I can get the NN up and running, but with a week old data!
>>
>>  What is going on here?  Why does my NN data *always* wind up causing
>> this exception over time?  Is there some easy way to get notified when the
>> checkpointing starts to fail?
>>
>
>
>
> --
>
> Robert Dyer
> rdyer@iastate.edu
>



-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Robert Dyer <ps...@gmail.com>.

It just happened again.  This was after a fresh format of HDFS/HBase and I
am attempting to re-import the (backed up) data.

  http://pastebin.com/3fsWCNQY

So now if I restart the namenode, I will lose data from the past 3 hours.

What is causing this?  How can I avoid it in the future?  Is there an easy
way to monitor (other than a script grep'ing the logs) the checkpoints to
see when this happens?


On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:

> Forgot to mention: Hadoop 1.0.4
>
>
> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>
>> I am at a bit of wits end here.  Every single time I restart the
>> namenode, I get this crash:
>>
>> 2013-02-16 14:32:42,616 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>> loaded in 0 seconds.
>> 2013-02-16 14:32:42,618 ERROR
>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>> java.lang.NullPointerException
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>
>> I am following best practices here, as far as I know.  I have the
>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>> have the exact same files in them.
>>
>> I also run a secondary checkpoint node.  This one appears to have started
>> failing a week ago.  So checkpoints were *not* being done since then.  Thus
>> I can get the NN up and running, but with a week old data!
>>
>>  What is going on here?  Why does my NN data *always* wind up causing
>> this exception over time?  Is there some easy way to get notified when the
>> checkpointing starts to fail?
>>
>
>
>
> --
>
> Robert Dyer
> rdyer@iastate.edu
>



-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Robert Dyer <ps...@gmail.com>.

It just happened again.  This was after a fresh format of HDFS/HBase and I
am attempting to re-import the (backed up) data.

  http://pastebin.com/3fsWCNQY

So now if I restart the namenode, I will lose data from the past 3 hours.

What is causing this?  How can I avoid it in the future?  Is there an easy
way to monitor (other than a script grep'ing the logs) the checkpoints to
see when this happens?


On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <ps...@gmail.com> wrote:

> Forgot to mention: Hadoop 1.0.4
>
>
> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:
>
>> I am at a bit of wits end here.  Every single time I restart the
>> namenode, I get this crash:
>>
>> 2013-02-16 14:32:42,616 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 168058
>> loaded in 0 seconds.
>> 2013-02-16 14:32:42,618 ERROR
>> org.apache.hadoop.hdfs.server.namenode.NameNode:
>> java.lang.NullPointerException
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>>     at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>>
>> I am following best practices here, as far as I know.  I have the
>> namenode writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs
>> have the exact same files in them.
>>
>> I also run a secondary checkpoint node.  This one appears to have started
>> failing a week ago.  So checkpoints were *not* being done since then.  Thus
>> I can get the NN up and running, but with a week old data!
>>
>>  What is going on here?  Why does my NN data *always* wind up causing
>> this exception over time?  Is there some easy way to get notified when the
>> checkpointing starts to fail?
>>
>
>
>
> --
>
> Robert Dyer
> rdyer@iastate.edu
>



-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Robert Dyer <ps...@gmail.com>.

Forgot to mention: Hadoop 1.0.4


On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:

> I am at a bit of wits end here.  Every single time I restart the namenode,
> I get this crash:
>
> 2013-02-16 14:32:42,616 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 168058 loaded in 0 seconds.
> 2013-02-16 14:32:42,618 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode:
> java.lang.NullPointerException
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>
> I am following best practices here, as far as I know.  I have the namenode
> writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs have the
> exact same files in them.
>
> I also run a secondary checkpoint node.  This one appears to have started
> failing a week ago.  So checkpoints were *not* being done since then.  Thus
> I can get the NN up and running, but with a week old data!
>
> What is going on here?  Why does my NN data *always* wind up causing this
> exception over time?  Is there some easy way to get notified when the
> checkpointing starts to fail?
>



-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Robert Dyer <ps...@gmail.com>.

Forgot to mention: Hadoop 1.0.4


On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:

> I am at a bit of wits end here.  Every single time I restart the namenode,
> I get this crash:
>
> 2013-02-16 14:32:42,616 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 168058 loaded in 0 seconds.
> 2013-02-16 14:32:42,618 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode:
> java.lang.NullPointerException
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>
> I am following best practices here, as far as I know.  I have the namenode
> writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs have the
> exact same files in them.
>
> I also run a secondary checkpoint node.  This one appears to have started
> failing a week ago.  So checkpoints were *not* being done since then.  Thus
> I can get the NN up and running, but with a week old data!
>
> What is going on here?  Why does my NN data *always* wind up causing this
> exception over time?  Is there some easy way to get notified when the
> checkpointing starts to fail?
>



-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Robert Dyer <ps...@gmail.com>.

Forgot to mention: Hadoop 1.0.4


On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:

> I am at a bit of wits end here.  Every single time I restart the namenode,
> I get this crash:
>
> 2013-02-16 14:32:42,616 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 168058 loaded in 0 seconds.
> 2013-02-16 14:32:42,618 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode:
> java.lang.NullPointerException
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>
> I am following best practices here, as far as I know.  I have the namenode
> writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs have the
> exact same files in them.
>
> I also run a secondary checkpoint node.  This one appears to have started
> failing a week ago.  So checkpoints were *not* being done since then.  Thus
> I can get the NN up and running, but with a week old data!
>
> What is going on here?  Why does my NN data *always* wind up causing this
> exception over time?  Is there some easy way to get notified when the
> checkpointing starts to fail?
>



-- 

Robert Dyer
rdyer@iastate.edu

Re: Namenode failures

Posted by Robert Dyer <ps...@gmail.com>.

Forgot to mention: Hadoop 1.0.4


On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <ps...@gmail.com> wrote:

> I am at a bit of wits end here.  Every single time I restart the namenode,
> I get this crash:
>
> 2013-02-16 14:32:42,616 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 168058 loaded in 0 seconds.
> 2013-02-16 14:32:42,618 ERROR
> org.apache.hadoop.hdfs.server.namenode.NameNode:
> java.lang.NullPointerException
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
>
> I am following best practices here, as far as I know.  I have the namenode
> writing into 3 directories (2 local, 1 NFS).  All 3 of these dirs have the
> exact same files in them.
>
> I also run a secondary checkpoint node.  This one appears to have started
> failing a week ago.  So checkpoints were *not* being done since then.  Thus
> I can get the NN up and running, but with a week old data!
>
> What is going on here?  Why does my NN data *always* wind up causing this
> exception over time?  Is there some easy way to get notified when the
> checkpointing starts to fail?
>



-- 

Robert Dyer
rdyer@iastate.edu