You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by C G <pa...@yahoo.com> on 2008/05/12 05:23:39 UTC

HDFS corrupt...how to proceed?

Hi All:
   
  We had a primary node failure over the weekend.  When we brought the node back up and I ran Hadoop fsck, I see the file system is corrupt.  I'm unsure how best to proceed.  Any advice is greatly appreciated.   If I've missed a Wiki page or documentation somewhere please feel free to tell me to RTFM and let me know where to look.  
   
  Specific question:  how to clear under and over replicated files?  Is the correct procedure to copy the file locally, delete from HDFS, and then copy back to HDFS?
   
  The fsck output is long, but the final summary is:
   
   Total size:    4899680097382 B
 Total blocks:  994252 (avg. block size 4928006 B)
 Total dirs:    47404
 Total files:   952070
  ********************************
  CORRUPT FILES:        2
  MISSING BLOCKS:       24
  MISSING SIZE:         1501009630 B
  ********************************
 Over-replicated blocks:        1 (1.0057812E-4 %)
 Under-replicated blocks:       14958 (1.5044476 %)
 Target replication factor:     3
 Real replication factor:       2.9849212
  
The filesystem under path '/' is CORRUPT

       
---------------------------------
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.

Re: HDFS corrupt...how to proceed?

Posted by C G <pa...@yahoo.com>.

Yes, several of our logging apps had accumulated backlogs of data and were "eager" to write to HDFS....

Dhruba Borthakur <dh...@gmail.com> wrote:  Is it possible that new files were being created by running
applications between the first and second fsck runs?

thans,
dhruba


On Sun, May 11, 2008 at 8:55 PM, C G 
wrote:
> The system hosting the namenode experienced an OS panic and shut down, we subsequently rebooted it. Currently we don't believe there is/was a bad disk or other hardware problem.
>
> Something interesting: I've ran fsck twice, the first time it gave the result I posted. The second time I still declared the FS to be corrupt, but said:
> [many rows of periods deleted]
> ..........Status: CORRUPT
> Total size: 4900076384766 B
> Total blocks: 994492 (avg. block size 4927215 B)
> Total dirs: 47404
> Total files: 952310
> Over-replicated blocks: 0 (0.0 %)
> Under-replicated blocks: 0 (0.0 %)
> Target replication factor: 3
> Real replication factor: 3.0
>
>
> The filesystem under path '/' is CORRUPT
>
> So it seems like it's fixing some problems on its own?
>
> Thanks,
> C G
>
>
> Dhruba Borthakur wrote:
> Did one datanode fail or did the namenode fail? By "fail" do you mean
> that the system was rebooted or was there a bad disk that caused the
> problem?
>
> thanks,
> dhruba
>
> On Sun, May 11, 2008 at 7:23 PM, C G
>
>
> wrote:
> > Hi All:
> >
> > We had a primary node failure over the weekend. When we brought the node back up and I ran Hadoop fsck, I see the file system is corrupt. I'm unsure how best to proceed. Any advice is greatly appreciated. If I've missed a Wiki page or documentation somewhere please feel free to tell me to RTFM and let me know where to look.
> >
> > Specific question: how to clear under and over replicated files? Is the correct procedure to copy the file locally, delete from HDFS, and then copy back to HDFS?
> >
> > The fsck output is long, but the final summary is:
> >
> > Total size: 4899680097382 B
> > Total blocks: 994252 (avg. block size 4928006 B)
> > Total dirs: 47404
> > Total files: 952070
> > ********************************
> > CORRUPT FILES: 2
> > MISSING BLOCKS: 24
> > MISSING SIZE: 1501009630 B
> > ********************************
> > Over-replicated blocks: 1 (1.0057812E-4 %)
> > Under-replicated blocks: 14958 (1.5044476 %)
> > Target replication factor: 3
> > Real replication factor: 2.9849212
> >
> > The filesystem under path '/' is CORRUPT
> >
> >
> >
> > ---------------------------------
> > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now.
>
>
>
> ---------------------------------
> Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now.


       
---------------------------------
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.

Re: HDFS corrupt...how to proceed?

Posted by Dhruba Borthakur <dh...@gmail.com>.

Is it possible that new files were being created by running
applications between the first and second fsck runs?

thans,
dhruba


On Sun, May 11, 2008 at 8:55 PM, C G <pa...@yahoo.com> wrote:
> The system hosting the namenode experienced an OS panic and shut down, we subsequently rebooted it.  Currently we don't believe there is/was a bad disk or other hardware problem.
>
>   Something interesting:  I've ran fsck twice, the first time it gave the result I posted.  The second time I still declared the FS to be corrupt, but said:
>   [many rows of periods deleted]
>   ..........Status: CORRUPT
>   Total size:    4900076384766 B
>   Total blocks:  994492 (avg. block size 4927215 B)
>   Total dirs:    47404
>   Total files:   952310
>   Over-replicated blocks:        0 (0.0 %)
>   Under-replicated blocks:       0 (0.0 %)
>   Target replication factor:     3
>   Real replication factor:       3.0
>
>
>  The filesystem under path '/' is CORRUPT
>
>   So it seems like it's fixing some problems on its own?
>
>   Thanks,
>   C G
>
>
>  Dhruba Borthakur <dh...@gmail.com> wrote:
>   Did one datanode fail or did the namenode fail? By "fail" do you mean
>  that the system was rebooted or was there a bad disk that caused the
>  problem?
>
>  thanks,
>  dhruba
>
>  On Sun, May 11, 2008 at 7:23 PM, C G
>
>
> wrote:
>  > Hi All:
>  >
>  > We had a primary node failure over the weekend. When we brought the node back up and I ran Hadoop fsck, I see the file system is corrupt. I'm unsure how best to proceed. Any advice is greatly appreciated. If I've missed a Wiki page or documentation somewhere please feel free to tell me to RTFM and let me know where to look.
>  >
>  > Specific question: how to clear under and over replicated files? Is the correct procedure to copy the file locally, delete from HDFS, and then copy back to HDFS?
>  >
>  > The fsck output is long, but the final summary is:
>  >
>  > Total size: 4899680097382 B
>  > Total blocks: 994252 (avg. block size 4928006 B)
>  > Total dirs: 47404
>  > Total files: 952070
>  > ********************************
>  > CORRUPT FILES: 2
>  > MISSING BLOCKS: 24
>  > MISSING SIZE: 1501009630 B
>  > ********************************
>  > Over-replicated blocks: 1 (1.0057812E-4 %)
>  > Under-replicated blocks: 14958 (1.5044476 %)
>  > Target replication factor: 3
>  > Real replication factor: 2.9849212
>  >
>  > The filesystem under path '/' is CORRUPT
>  >
>  >
>  >
>  > ---------------------------------
>  > Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now.
>
>
>
>  ---------------------------------
>  Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.

Re: HDFS corrupt...how to proceed?

Posted by C G <pa...@yahoo.com>.

The system hosting the namenode experienced an OS panic and shut down, we subsequently rebooted it.  Currently we don't believe there is/was a bad disk or other hardware problem.

  Something interesting:  I've ran fsck twice, the first time it gave the result I posted.  The second time I still declared the FS to be corrupt, but said:
  [many rows of periods deleted]
  ..........Status: CORRUPT
 Total size:    4900076384766 B
 Total blocks:  994492 (avg. block size 4927215 B)
 Total dirs:    47404
 Total files:   952310
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Target replication factor:     3
 Real replication factor:       3.0

The filesystem under path '/' is CORRUPT

  So it seems like it's fixing some problems on its own?

  Thanks,
  C G

Dhruba Borthakur <dh...@gmail.com> wrote:
  Did one datanode fail or did the namenode fail? By "fail" do you mean
that the system was rebooted or was there a bad disk that caused the
problem?

thanks,
dhruba

On Sun, May 11, 2008 at 7:23 PM, C G 
wrote:
> Hi All:
>
> We had a primary node failure over the weekend. When we brought the node back up and I ran Hadoop fsck, I see the file system is corrupt. I'm unsure how best to proceed. Any advice is greatly appreciated. If I've missed a Wiki page or documentation somewhere please feel free to tell me to RTFM and let me know where to look.
>
> Specific question: how to clear under and over replicated files? Is the correct procedure to copy the file locally, delete from HDFS, and then copy back to HDFS?
>
> The fsck output is long, but the final summary is:
>
> Total size: 4899680097382 B
> Total blocks: 994252 (avg. block size 4928006 B)
> Total dirs: 47404
> Total files: 952070
> ********************************
> CORRUPT FILES: 2
> MISSING BLOCKS: 24
> MISSING SIZE: 1501009630 B
> ********************************
> Over-replicated blocks: 1 (1.0057812E-4 %)
> Under-replicated blocks: 14958 (1.5044476 %)
> Target replication factor: 3
> Real replication factor: 2.9849212
>
> The filesystem under path '/' is CORRUPT
>
>
>
> ---------------------------------
> Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now.

---------------------------------
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.

Re: HDFS corrupt...how to proceed?

Posted by Dhruba Borthakur <dh...@gmail.com>.

Did one datanode fail or did the namenode fail? By "fail" do you mean
that the system was rebooted or was there a bad disk that caused the
problem?

thanks,
dhruba

On Sun, May 11, 2008 at 7:23 PM, C G <pa...@yahoo.com> wrote:
> Hi All:
>
>   We had a primary node failure over the weekend.  When we brought the node back up and I ran Hadoop fsck, I see the file system is corrupt.  I'm unsure how best to proceed.  Any advice is greatly appreciated.   If I've missed a Wiki page or documentation somewhere please feel free to tell me to RTFM and let me know where to look.
>
>   Specific question:  how to clear under and over replicated files?  Is the correct procedure to copy the file locally, delete from HDFS, and then copy back to HDFS?
>
>   The fsck output is long, but the final summary is:
>
>    Total size:    4899680097382 B
>   Total blocks:  994252 (avg. block size 4928006 B)
>   Total dirs:    47404
>   Total files:   952070
>   ********************************
>   CORRUPT FILES:        2
>   MISSING BLOCKS:       24
>   MISSING SIZE:         1501009630 B
>   ********************************
>   Over-replicated blocks:        1 (1.0057812E-4 %)
>   Under-replicated blocks:       14958 (1.5044476 %)
>   Target replication factor:     3
>   Real replication factor:       2.9849212
>
>  The filesystem under path '/' is CORRUPT
>
>
>
>  ---------------------------------
>  Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.

Re: HDFS corrupt...how to proceed?

Posted by C G <pa...@yahoo.com>.

Thanks to everyone who responded.   Things are back on the air now - all the replication issues seem to have gone away.  I am wading through a detailed fsck output now looking for specific problems on a file-by-file basis.

  Just in case anybody is interested, we mirror our master nodes using DRBD.  It performed very well in this first "real world" test.  If there is interest I can write up how we protect our master nodes in more detail and share w/the community.

  Thanks,
  C G

Ted Dunning <td...@veoh.com> wrote:

You don't need to correct over-replicated files.

The under-replicated files should cure themselves, but there is a problem on
old versions where that doesn't happen quite right.

You can use hadoop fsck / to get a list of the files that are broken and
there are options to copy what remains of them to lost+found or to delete
them.

Other than that, things should correct themselves fairly quickly.

On 5/11/08 8:23 PM, "C G" 
wrote:

> Hi All:
> 
> We had a primary node failure over the weekend. When we brought the node
> back up and I ran Hadoop fsck, I see the file system is corrupt. I'm unsure
> how best to proceed. Any advice is greatly appreciated. If I've missed a
> Wiki page or documentation somewhere please feel free to tell me to RTFM and
> let me know where to look.
> 
> Specific question: how to clear under and over replicated files? Is the
> correct procedure to copy the file locally, delete from HDFS, and then copy
> back to HDFS?
> 
> The fsck output is long, but the final summary is:
> 
> Total size: 4899680097382 B
> Total blocks: 994252 (avg. block size 4928006 B)
> Total dirs: 47404
> Total files: 952070
> ********************************
> CORRUPT FILES: 2
> MISSING BLOCKS: 24
> MISSING SIZE: 1501009630 B
> ********************************
> Over-replicated blocks: 1 (1.0057812E-4 %)
> Under-replicated blocks: 14958 (1.5044476 %)
> Target replication factor: 3
> Real replication factor: 2.9849212
> 
> The filesystem under path '/' is CORRUPT
> 
> 
> ---------------------------------
> Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it
> now.

---------------------------------
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.

Re: HDFS corrupt...how to proceed?

Posted by Ted Dunning <td...@veoh.com>.


You don't need to correct over-replicated files.

The under-replicated files should cure themselves, but there is a problem on
old versions where that doesn't happen quite right.

You can use hadoop fsck / to get a list of the files that are broken and
there are options to copy what remains of them to lost+found or to delete
them.

Other than that, things should correct themselves fairly quickly.


On 5/11/08 8:23 PM, "C G" <pa...@yahoo.com> wrote:

> Hi All:
>    
>   We had a primary node failure over the weekend.  When we brought the node
> back up and I ran Hadoop fsck, I see the file system is corrupt.  I'm unsure
> how best to proceed.  Any advice is greatly appreciated.   If I've missed a
> Wiki page or documentation somewhere please feel free to tell me to RTFM and
> let me know where to look.
>    
>   Specific question:  how to clear under and over replicated files?  Is the
> correct procedure to copy the file locally, delete from HDFS, and then copy
> back to HDFS?
>    
>   The fsck output is long, but the final summary is:
>    
>    Total size:    4899680097382 B
>  Total blocks:  994252 (avg. block size 4928006 B)
>  Total dirs:    47404
>  Total files:   952070
>   ********************************
>   CORRUPT FILES:        2
>   MISSING BLOCKS:       24
>   MISSING SIZE:         1501009630 B
>   ********************************
>  Over-replicated blocks:        1 (1.0057812E-4 %)
>  Under-replicated blocks:       14958 (1.5044476 %)
>  Target replication factor:     3
>  Real replication factor:       2.9849212
>   
> The filesystem under path '/' is CORRUPT
> 
>        
> ---------------------------------
> Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it
> now.