You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by praveenesh kumar <pr...@gmail.com> on 2012/01/17 07:38:07 UTC

Best practices to recover from Corrupt Namenode

Hi guys,

I just faced a weird situation, in which one of my hard disks on DN went
down.
Due to which when I restarted namenode, some of the blocks went missing and
it was saying my namenode is CORRUPT and in safe mode, which doesn't allow
you to add or delete any files on HDFS.

I know , we can close the safe mode part.
Problem is how to deal with Corrupt Namenode problem in this case -- Best
practices.

In my case, I was lucky that all missing blocks were that of the Outputs of
my M/R codes I ran previously.
So I just deleted all those files with the missing blocks from HDFS to come
from CORRUPT --> HEALTHY state.

But had it be for the large input data files , it won't be a good solution
in that case to delete those files.

So I wanted to know what should be the best practices to deal with above
kind of problems to go from CORRUPT NAMENODE --> HEALTHY NAMENODE?

Thanks,
Praveenesh

Re: Best practices to recover from Corrupt Namenode

Posted by praveenesh kumar <pr...@gmail.com>.
Thanks a lot guys, for such illustrative explanation. I will go through the
links you send and will get back with any doubts I have.

Thanks,
Praveenesh

On Thu, Jan 19, 2012 at 2:17 PM, Sameer Farooqui <sa...@blueplastic.com>wrote:

> Hey Praveenesh,
>
> Here's a good article on HDFS by some senior Yahoo!, Facebook, HortonWorks
> and eBay engineers that you might find helpful:
> http://www.aosabook.org/en/hdfs.html
>
> You may already know that "each block replica on a DataNode is represented
> by two files in the DataNode's local, native filesystem (usually ext3). The
> first file contains the data itself and the second file records the block's
> metadata including checksums for the data and the generation stamp."
>
> In section 8.3.5., the article above describes a Block Scanner that runs on
> each DataNode and "periodically scans its block replicas and verifies that
> stored checksums match the block data."
>
> More copy+paste from the article: "Whenever a read client or a block
> scanner detects a corrupt block, it notifies the NameNode. The NameNode
> marks the replica as corrupt, but does not schedule deletion of the replica
> immediately. Instead, it starts to replicate a good copy of the block. Only
> when the good replica count reaches the replication factor of the block the
> corrupt replica is scheduled to be removed. This policy aims to preserve
> data as long as possible. So even if all replicas of a block are corrupt,
> the policy allows the user to retrieve its data from the corrupt replicas."
>
> Like Harsh J was saying in an email before, this doesn't sound like
> NameNode corruption yet. The article also describes how the periodic block
> reports (aka metadata) from the DataNode are sent to the NameNode. "A block
> report contains the block ID, the generation stamp and the length for each
> block replica the server hosts." In the NameNode's RAM, "the inodes and the
> list of blocks that define the metadata of the name system are called the *
> image*. The persistent record of the image stored in the NameNode's local
> native filesystem is called a checkpoint. The NameNode records changes to
> HDFS in a write-ahead log called the journal in its local native
> filesystem."
>
> You can check those NameNode checkpoint and journal logs for errors if you
> suspect NameNode corruption.
>
> If you're wondering how often the Block Scanner runs and how long it takes
> to scan over the entire dataset in HDFS: "In each scan period, the block
> scanner adjusts the read bandwidth in order to complete the verification in
> a configurable period. If a client reads a complete block and checksum
> verification succeeds, it informs the DataNode. The DataNode treats it as a
> verification of the replica."
>
> "The verification time of each block is stored in a human-readable log
> file. At any time there are up to two files in the top-level DataNode
> directory, the current and previous logs. New verification times are
> appended to the current file. Correspondingly, each DataNode has an
> in-memory scanning list ordered by the replica's verification time."
>
> Can you maybe check the verification time for the blocks that went corrupt
> in the log file? If you're a human you should be able to read it. Try
> checking both the current and previous logs.
>
> To dive deeper, here is a document by Tom White/Cloudera, but it's from
> 2008, so a lot could be out-dated:
> http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
>
> One good bit of info from Tom's doc is that you can view the DataNode's
> Block Scanner reports at: http://datanode:50075/ blockScannerReport
>
> And if you could post the filesystem check output logs (cmd:fsck), I'm sure
> someone could help you further. It would be helpful to know which version
> of Hadoop and HDFS you're running.
>
> Also, don't you think it's weird that all the missing blocks were that of
> the outputs of your M/R jobs? The NameNode should have been distributing
> them evenly across the hard drives of your cluster. If the output of the
> jobs is set to replication factor = 2, then the output should have been
> replicated over the network to at least one other DataNode. It should take
> at least 2 hard drives to fail in the cluster for you to lose a replica
> completely. HDFS should be very robust. With Yahoo's r=3, "for a large
> cluster, the probability of losing a block during one year is less than
> 0.005"
>
> - Sameer
>
>
> On Wed, Jan 18, 2012 at 11:19 PM, praveenesh kumar <praveenesh@gmail.com
> >wrote:
>
> > Hi everyone,
> > Any ideas on how to tackle this kind of situation.
> >
> > Thanks,
> > Praveenesh
> >
> > On Tue, Jan 17, 2012 at 1:02 PM, praveenesh kumar <praveenesh@gmail.com
> > >wrote:
> >
> > > I have a replication factor of 2, because of the reason that I can not
> > > afford 3 replicas on my cluster.
> > > fsck output was saying block replicas missing for some files that was
> > > making Namenode is corrupt
> > > I don't have the output with me. but issue was block replicas were
> > > missing. How can we tackle that ?
> > >
> > > Is their an internal mechanism of creating new blocks, if they were
> found
> > > missing / some kind of refresh command  or something ?
> > >
> > >
> > > Thanks,
> > > Praveenesh
> > >
> > > On Tue, Jan 17, 2012 at 12:48 PM, Harsh J <ha...@cloudera.com> wrote:
> > >
> > >> You ran into a corrupt files issue, not a namenode corruption (which
> > >> generally refers to the fsimage or edits getting corrupted).
> > >>
> > >> Did your files not have adequate replication that they could not
> > >> withstand the loss of one DN's disk? What exactly did fsck output? Did
> > all
> > >> block replicas go missing for your files?
> > >>
> > >> On 17-Jan-2012, at 12:08 PM, praveenesh kumar wrote:
> > >>
> > >> > Hi guys,
> > >> >
> > >> > I just faced a weird situation, in which one of my hard disks on DN
> > went
> > >> > down.
> > >> > Due to which when I restarted namenode, some of the blocks went
> > missing
> > >> and
> > >> > it was saying my namenode is CORRUPT and in safe mode, which doesn't
> > >> allow
> > >> > you to add or delete any files on HDFS.
> > >> >
> > >> > I know , we can close the safe mode part.
> > >> > Problem is how to deal with Corrupt Namenode problem in this case --
> > >> Best
> > >> > practices.
> > >> >
> > >> > In my case, I was lucky that all missing blocks were that of the
> > >> Outputs of
> > >> > my M/R codes I ran previously.
> > >> > So I just deleted all those files with the missing blocks from HDFS
> to
> > >> come
> > >> > from CORRUPT --> HEALTHY state.
> > >> >
> > >> > But had it be for the large input data files , it won't be a good
> > >> solution
> > >> > in that case to delete those files.
> > >> >
> > >> > So I wanted to know what should be the best practices to deal with
> > above
> > >> > kind of problems to go from CORRUPT NAMENODE --> HEALTHY NAMENODE?
> > >> >
> > >> > Thanks,
> > >> > Praveenesh
> > >>
> > >> --
> > >> Harsh J
> > >> Customer Ops. Engineer, Cloudera
> > >>
> > >>
> > >
> >
>

Re: Best practices to recover from Corrupt Namenode

Posted by Sameer Farooqui <sa...@blueplastic.com>.
Hey Praveenesh,

Here's a good article on HDFS by some senior Yahoo!, Facebook, HortonWorks
and eBay engineers that you might find helpful:
http://www.aosabook.org/en/hdfs.html

You may already know that "each block replica on a DataNode is represented
by two files in the DataNode's local, native filesystem (usually ext3). The
first file contains the data itself and the second file records the block's
metadata including checksums for the data and the generation stamp."

In section 8.3.5., the article above describes a Block Scanner that runs on
each DataNode and "periodically scans its block replicas and verifies that
stored checksums match the block data."

More copy+paste from the article: "Whenever a read client or a block
scanner detects a corrupt block, it notifies the NameNode. The NameNode
marks the replica as corrupt, but does not schedule deletion of the replica
immediately. Instead, it starts to replicate a good copy of the block. Only
when the good replica count reaches the replication factor of the block the
corrupt replica is scheduled to be removed. This policy aims to preserve
data as long as possible. So even if all replicas of a block are corrupt,
the policy allows the user to retrieve its data from the corrupt replicas."

Like Harsh J was saying in an email before, this doesn't sound like
NameNode corruption yet. The article also describes how the periodic block
reports (aka metadata) from the DataNode are sent to the NameNode. "A block
report contains the block ID, the generation stamp and the length for each
block replica the server hosts." In the NameNode's RAM, "the inodes and the
list of blocks that define the metadata of the name system are called the *
image*. The persistent record of the image stored in the NameNode's local
native filesystem is called a checkpoint. The NameNode records changes to
HDFS in a write-ahead log called the journal in its local native
filesystem."

You can check those NameNode checkpoint and journal logs for errors if you
suspect NameNode corruption.

If you're wondering how often the Block Scanner runs and how long it takes
to scan over the entire dataset in HDFS: "In each scan period, the block
scanner adjusts the read bandwidth in order to complete the verification in
a configurable period. If a client reads a complete block and checksum
verification succeeds, it informs the DataNode. The DataNode treats it as a
verification of the replica."

"The verification time of each block is stored in a human-readable log
file. At any time there are up to two files in the top-level DataNode
directory, the current and previous logs. New verification times are
appended to the current file. Correspondingly, each DataNode has an
in-memory scanning list ordered by the replica's verification time."

Can you maybe check the verification time for the blocks that went corrupt
in the log file? If you're a human you should be able to read it. Try
checking both the current and previous logs.

To dive deeper, here is a document by Tom White/Cloudera, but it's from
2008, so a lot could be out-dated:
http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf

One good bit of info from Tom's doc is that you can view the DataNode's
Block Scanner reports at: http://datanode:50075/ blockScannerReport

And if you could post the filesystem check output logs (cmd:fsck), I'm sure
someone could help you further. It would be helpful to know which version
of Hadoop and HDFS you're running.

Also, don't you think it's weird that all the missing blocks were that of
the outputs of your M/R jobs? The NameNode should have been distributing
them evenly across the hard drives of your cluster. If the output of the
jobs is set to replication factor = 2, then the output should have been
replicated over the network to at least one other DataNode. It should take
at least 2 hard drives to fail in the cluster for you to lose a replica
completely. HDFS should be very robust. With Yahoo's r=3, "for a large
cluster, the probability of losing a block during one year is less than
0.005"

- Sameer


On Wed, Jan 18, 2012 at 11:19 PM, praveenesh kumar <pr...@gmail.com>wrote:

> Hi everyone,
> Any ideas on how to tackle this kind of situation.
>
> Thanks,
> Praveenesh
>
> On Tue, Jan 17, 2012 at 1:02 PM, praveenesh kumar <praveenesh@gmail.com
> >wrote:
>
> > I have a replication factor of 2, because of the reason that I can not
> > afford 3 replicas on my cluster.
> > fsck output was saying block replicas missing for some files that was
> > making Namenode is corrupt
> > I don't have the output with me. but issue was block replicas were
> > missing. How can we tackle that ?
> >
> > Is their an internal mechanism of creating new blocks, if they were found
> > missing / some kind of refresh command  or something ?
> >
> >
> > Thanks,
> > Praveenesh
> >
> > On Tue, Jan 17, 2012 at 12:48 PM, Harsh J <ha...@cloudera.com> wrote:
> >
> >> You ran into a corrupt files issue, not a namenode corruption (which
> >> generally refers to the fsimage or edits getting corrupted).
> >>
> >> Did your files not have adequate replication that they could not
> >> withstand the loss of one DN's disk? What exactly did fsck output? Did
> all
> >> block replicas go missing for your files?
> >>
> >> On 17-Jan-2012, at 12:08 PM, praveenesh kumar wrote:
> >>
> >> > Hi guys,
> >> >
> >> > I just faced a weird situation, in which one of my hard disks on DN
> went
> >> > down.
> >> > Due to which when I restarted namenode, some of the blocks went
> missing
> >> and
> >> > it was saying my namenode is CORRUPT and in safe mode, which doesn't
> >> allow
> >> > you to add or delete any files on HDFS.
> >> >
> >> > I know , we can close the safe mode part.
> >> > Problem is how to deal with Corrupt Namenode problem in this case --
> >> Best
> >> > practices.
> >> >
> >> > In my case, I was lucky that all missing blocks were that of the
> >> Outputs of
> >> > my M/R codes I ran previously.
> >> > So I just deleted all those files with the missing blocks from HDFS to
> >> come
> >> > from CORRUPT --> HEALTHY state.
> >> >
> >> > But had it be for the large input data files , it won't be a good
> >> solution
> >> > in that case to delete those files.
> >> >
> >> > So I wanted to know what should be the best practices to deal with
> above
> >> > kind of problems to go from CORRUPT NAMENODE --> HEALTHY NAMENODE?
> >> >
> >> > Thanks,
> >> > Praveenesh
> >>
> >> --
> >> Harsh J
> >> Customer Ops. Engineer, Cloudera
> >>
> >>
> >
>

Re: Best practices to recover from Corrupt Namenode

Posted by praveenesh kumar <pr...@gmail.com>.
Hi everyone,
Any ideas on how to tackle this kind of situation.

Thanks,
Praveenesh

On Tue, Jan 17, 2012 at 1:02 PM, praveenesh kumar <pr...@gmail.com>wrote:

> I have a replication factor of 2, because of the reason that I can not
> afford 3 replicas on my cluster.
> fsck output was saying block replicas missing for some files that was
> making Namenode is corrupt
> I don't have the output with me. but issue was block replicas were
> missing. How can we tackle that ?
>
> Is their an internal mechanism of creating new blocks, if they were found
> missing / some kind of refresh command  or something ?
>
>
> Thanks,
> Praveenesh
>
> On Tue, Jan 17, 2012 at 12:48 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> You ran into a corrupt files issue, not a namenode corruption (which
>> generally refers to the fsimage or edits getting corrupted).
>>
>> Did your files not have adequate replication that they could not
>> withstand the loss of one DN's disk? What exactly did fsck output? Did all
>> block replicas go missing for your files?
>>
>> On 17-Jan-2012, at 12:08 PM, praveenesh kumar wrote:
>>
>> > Hi guys,
>> >
>> > I just faced a weird situation, in which one of my hard disks on DN went
>> > down.
>> > Due to which when I restarted namenode, some of the blocks went missing
>> and
>> > it was saying my namenode is CORRUPT and in safe mode, which doesn't
>> allow
>> > you to add or delete any files on HDFS.
>> >
>> > I know , we can close the safe mode part.
>> > Problem is how to deal with Corrupt Namenode problem in this case --
>> Best
>> > practices.
>> >
>> > In my case, I was lucky that all missing blocks were that of the
>> Outputs of
>> > my M/R codes I ran previously.
>> > So I just deleted all those files with the missing blocks from HDFS to
>> come
>> > from CORRUPT --> HEALTHY state.
>> >
>> > But had it be for the large input data files , it won't be a good
>> solution
>> > in that case to delete those files.
>> >
>> > So I wanted to know what should be the best practices to deal with above
>> > kind of problems to go from CORRUPT NAMENODE --> HEALTHY NAMENODE?
>> >
>> > Thanks,
>> > Praveenesh
>>
>> --
>> Harsh J
>> Customer Ops. Engineer, Cloudera
>>
>>
>

Re: Best practices to recover from Corrupt Namenode

Posted by praveenesh kumar <pr...@gmail.com>.
I have a replication factor of 2, because of the reason that I can not
afford 3 replicas on my cluster.
fsck output was saying block replicas missing for some files that was
making Namenode is corrupt
I don't have the output with me. but issue was block replicas were missing.
How can we tackle that ?

Is their an internal mechanism of creating new blocks, if they were found
missing / some kind of refresh command  or something ?


Thanks,
Praveenesh

On Tue, Jan 17, 2012 at 12:48 PM, Harsh J <ha...@cloudera.com> wrote:

> You ran into a corrupt files issue, not a namenode corruption (which
> generally refers to the fsimage or edits getting corrupted).
>
> Did your files not have adequate replication that they could not withstand
> the loss of one DN's disk? What exactly did fsck output? Did all block
> replicas go missing for your files?
>
> On 17-Jan-2012, at 12:08 PM, praveenesh kumar wrote:
>
> > Hi guys,
> >
> > I just faced a weird situation, in which one of my hard disks on DN went
> > down.
> > Due to which when I restarted namenode, some of the blocks went missing
> and
> > it was saying my namenode is CORRUPT and in safe mode, which doesn't
> allow
> > you to add or delete any files on HDFS.
> >
> > I know , we can close the safe mode part.
> > Problem is how to deal with Corrupt Namenode problem in this case -- Best
> > practices.
> >
> > In my case, I was lucky that all missing blocks were that of the Outputs
> of
> > my M/R codes I ran previously.
> > So I just deleted all those files with the missing blocks from HDFS to
> come
> > from CORRUPT --> HEALTHY state.
> >
> > But had it be for the large input data files , it won't be a good
> solution
> > in that case to delete those files.
> >
> > So I wanted to know what should be the best practices to deal with above
> > kind of problems to go from CORRUPT NAMENODE --> HEALTHY NAMENODE?
> >
> > Thanks,
> > Praveenesh
>
> --
> Harsh J
> Customer Ops. Engineer, Cloudera
>
>

Re: Best practices to recover from Corrupt Namenode

Posted by Harsh J <ha...@cloudera.com>.
You ran into a corrupt files issue, not a namenode corruption (which generally refers to the fsimage or edits getting corrupted).

Did your files not have adequate replication that they could not withstand the loss of one DN's disk? What exactly did fsck output? Did all block replicas go missing for your files?

On 17-Jan-2012, at 12:08 PM, praveenesh kumar wrote:

> Hi guys,
> 
> I just faced a weird situation, in which one of my hard disks on DN went
> down.
> Due to which when I restarted namenode, some of the blocks went missing and
> it was saying my namenode is CORRUPT and in safe mode, which doesn't allow
> you to add or delete any files on HDFS.
> 
> I know , we can close the safe mode part.
> Problem is how to deal with Corrupt Namenode problem in this case -- Best
> practices.
> 
> In my case, I was lucky that all missing blocks were that of the Outputs of
> my M/R codes I ran previously.
> So I just deleted all those files with the missing blocks from HDFS to come
> from CORRUPT --> HEALTHY state.
> 
> But had it be for the large input data files , it won't be a good solution
> in that case to delete those files.
> 
> So I wanted to know what should be the best practices to deal with above
> kind of problems to go from CORRUPT NAMENODE --> HEALTHY NAMENODE?
> 
> Thanks,
> Praveenesh

--
Harsh J
Customer Ops. Engineer, Cloudera