You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by ch huang <ju...@gmail.com> on 2013/12/10 01:32:38 UTC

how to handle the corrupt block in HDFS?

hi,maillist:
            my nagios alert me that there is a corrupt block in HDFS all
day,but i do not know how to remove it,and if the HDFS will handle this
automaticlly? and if remove the corrupt block will cause any data
lost?thanks

RE: how to handle the corrupt block in HDFS?

Posted by Vinayakumar B <vi...@huawei.com>.

Hi ch huang,

It may seem strange, but the fact is,
CorruptBlocks through JMX or http://NNIP:50070/jmx<http://nnip:50070/jmx> means "Number of blocks with corrupt replicas". May not be all replicas are corrupt.  This you can check through jconsole for description.

Where as Corrupt blocks through fsck means, blocks with all replicas corrupt(non-recoverable)/ missing.

In your case, may be one of the replica is corrupt, not all replicas of same block. This corrupt replica will be deleted automatically if one more datanode available in your cluster and block replicated to that.


Related to replication 10, As Peter Marron said, some of the important files of the mapreduce job will set the replication of 10, to make it accessible faster and launch map tasks faster.
Anyway, if the job is success these files will be deleted auomatically. I think only in some cases if the jobs are killed in between these files will remain in hdfs showing underreplicated blocks.

Thanks and Regards,
Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com]
Sent: 11 December 2013 06:48
To: user@hadoop.apache.org
Subject: Re: how to handle the corrupt block in HDFS?

"By default this higher replication level is 10. "
is this value can be control via some option or variable? i only hive a 5-worknode cluster,and i think 5 replicas should be better,because every node can get a local replica.

another question is ,why hdfs fsck check the cluster is healthy and no corrupt block,but i see one corrupt block though checking NN metrics?
curl http://NNIP:50070/jmx<http://nnip:50070/jmx> ,thanks

On Tue, Dec 10, 2013 at 4:48 PM, Peter Marron <Pe...@trilliumsoftware.com>> wrote:
Hi,

I am sure that there are others who will answer this better, but anyway.
The default replication level for files in HDFS is 3 and so most files that you
see will have a replication level of 3. However when you run a Map/Reduce
job the system knows in advance that every node will need a copy of
certain files. Specifically the job.xml and the various jars containing
classes that will be needed to run the mappers and reducers. So the
system arranges that some of these files have a higher replication level. This increases
the chances that a copy will be found locally.
By default this higher replication level is 10.

This can seem a little odd on a cluster where you only have, say, 3 nodes.
Because it means that you will almost always have some blocks that are marked
under-replicated. I think that there was some discussion a while back to change
this to make the replication level something like min(10, #number of nodes)
However, as I recall, the general consensus was that this was extra
complexity that wasn't really worth it. If it ain't broke...

Hope that this helps.

Peter Marron
Senior Developer, Research & Development

Office: +44 (0) 118-940-7609  peter.marron@trilliumsoftware.com<ma...@trilliumsoftware.com>
Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
[cid:image001.png@01CEF63D.7AE096A0]

[cid:image002.png@01CEF63D.7AE096A0]<https://www.facebook.com/pages/Trillium-Software/109184815778307>

[cid:image003.png@01CEF63D.7AE096A0]<https://twitter.com/TrilliumSW>

[cid:image004.png@01CEF63D.7AE096A0]<http://www.linkedin.com/company/17710>


www.trilliumsoftware.com<http://www.trilliumsoftware.com/>

Be Certain About Your Data. Be Trillium Certain.

From: ch huang [mailto:justlooks@gmail.com<ma...@gmail.com>]
Sent: 10 December 2013 01:21
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: how to handle the corrupt block in HDFS?

more strange , in my HDFS cluster ,every block has three replicas,but i find some one has ten replicas ,why?

# sudo -u hdfs hadoop fs -ls /data/hisstage/helen/.staging/job_1385542328307_0915
Found 5 items
-rw-r--r--   3 helen hadoop          7 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
-rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
-rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>> wrote:
the strange thing is when i use the following command i find 1 corrupt block

#  curl -s http://ch11:50070/jmx |grep orrupt
    "CorruptBlocks" : 1,
but when i run hdfs fsck / , i get none ,everything seems fine

# sudo -u hdfs hdfs fsck /
........

....................................Status: HEALTHY
 Total size:    1479728140875 B (Total open files size: 1677721600 B)
 Total dirs:    21298
 Total files:   100636 (Files currently being written: 25)
 Total blocks (validated):      119788 (avg. block size 12352891 B) (Total open file blocks (not validated): 37)
 Minimally replicated blocks:   119788 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       166 (0.13857816 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0027633
 Corrupt blocks:                0
 Missing replicas:              831 (0.23049656 %)
 Number of data-nodes:          5
 Number of racks:               1
FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds

The filesystem under path '/' is HEALTHY
On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>> wrote:
hi,maillist:
            my nagios alert me that there is a corrupt block in HDFS all day,but i do not know how to remove it,and if the HDFS will handle this automaticlly? and if remove the corrupt block will cause any data lost?thanks

RE: how to handle the corrupt block in HDFS?

Posted by Vinayakumar B <vi...@huawei.com>.

Hi ch huang,

It may seem strange, but the fact is,
CorruptBlocks through JMX or http://NNIP:50070/jmx<http://nnip:50070/jmx> means "Number of blocks with corrupt replicas". May not be all replicas are corrupt.  This you can check through jconsole for description.

Where as Corrupt blocks through fsck means, blocks with all replicas corrupt(non-recoverable)/ missing.

In your case, may be one of the replica is corrupt, not all replicas of same block. This corrupt replica will be deleted automatically if one more datanode available in your cluster and block replicated to that.


Related to replication 10, As Peter Marron said, some of the important files of the mapreduce job will set the replication of 10, to make it accessible faster and launch map tasks faster.
Anyway, if the job is success these files will be deleted auomatically. I think only in some cases if the jobs are killed in between these files will remain in hdfs showing underreplicated blocks.

Thanks and Regards,
Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com]
Sent: 11 December 2013 06:48
To: user@hadoop.apache.org
Subject: Re: how to handle the corrupt block in HDFS?

"By default this higher replication level is 10. "
is this value can be control via some option or variable? i only hive a 5-worknode cluster,and i think 5 replicas should be better,because every node can get a local replica.

another question is ,why hdfs fsck check the cluster is healthy and no corrupt block,but i see one corrupt block though checking NN metrics?
curl http://NNIP:50070/jmx<http://nnip:50070/jmx> ,thanks

On Tue, Dec 10, 2013 at 4:48 PM, Peter Marron <Pe...@trilliumsoftware.com>> wrote:
Hi,

I am sure that there are others who will answer this better, but anyway.
The default replication level for files in HDFS is 3 and so most files that you
see will have a replication level of 3. However when you run a Map/Reduce
job the system knows in advance that every node will need a copy of
certain files. Specifically the job.xml and the various jars containing
classes that will be needed to run the mappers and reducers. So the
system arranges that some of these files have a higher replication level. This increases
the chances that a copy will be found locally.
By default this higher replication level is 10.

This can seem a little odd on a cluster where you only have, say, 3 nodes.
Because it means that you will almost always have some blocks that are marked
under-replicated. I think that there was some discussion a while back to change
this to make the replication level something like min(10, #number of nodes)
However, as I recall, the general consensus was that this was extra
complexity that wasn't really worth it. If it ain't broke...

Hope that this helps.

Peter Marron
Senior Developer, Research & Development

Office: +44 (0) 118-940-7609  peter.marron@trilliumsoftware.com<ma...@trilliumsoftware.com>
Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
[cid:image001.png@01CEF63D.7AE096A0]

[cid:image002.png@01CEF63D.7AE096A0]<https://www.facebook.com/pages/Trillium-Software/109184815778307>

[cid:image003.png@01CEF63D.7AE096A0]<https://twitter.com/TrilliumSW>

[cid:image004.png@01CEF63D.7AE096A0]<http://www.linkedin.com/company/17710>


www.trilliumsoftware.com<http://www.trilliumsoftware.com/>

Be Certain About Your Data. Be Trillium Certain.

From: ch huang [mailto:justlooks@gmail.com<ma...@gmail.com>]
Sent: 10 December 2013 01:21
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: how to handle the corrupt block in HDFS?

more strange , in my HDFS cluster ,every block has three replicas,but i find some one has ten replicas ,why?

# sudo -u hdfs hadoop fs -ls /data/hisstage/helen/.staging/job_1385542328307_0915
Found 5 items
-rw-r--r--   3 helen hadoop          7 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
-rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
-rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>> wrote:
the strange thing is when i use the following command i find 1 corrupt block

#  curl -s http://ch11:50070/jmx |grep orrupt
    "CorruptBlocks" : 1,
but when i run hdfs fsck / , i get none ,everything seems fine

# sudo -u hdfs hdfs fsck /
........

....................................Status: HEALTHY
 Total size:    1479728140875 B (Total open files size: 1677721600 B)
 Total dirs:    21298
 Total files:   100636 (Files currently being written: 25)
 Total blocks (validated):      119788 (avg. block size 12352891 B) (Total open file blocks (not validated): 37)
 Minimally replicated blocks:   119788 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       166 (0.13857816 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0027633
 Corrupt blocks:                0
 Missing replicas:              831 (0.23049656 %)
 Number of data-nodes:          5
 Number of racks:               1
FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds

The filesystem under path '/' is HEALTHY
On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>> wrote:
hi,maillist:
            my nagios alert me that there is a corrupt block in HDFS all day,but i do not know how to remove it,and if the HDFS will handle this automaticlly? and if remove the corrupt block will cause any data lost?thanks

RE: how to handle the corrupt block in HDFS?

Posted by Vinayakumar B <vi...@huawei.com>.

Hi ch huang,

It may seem strange, but the fact is,
CorruptBlocks through JMX or http://NNIP:50070/jmx<http://nnip:50070/jmx> means "Number of blocks with corrupt replicas". May not be all replicas are corrupt.  This you can check through jconsole for description.

Where as Corrupt blocks through fsck means, blocks with all replicas corrupt(non-recoverable)/ missing.

In your case, may be one of the replica is corrupt, not all replicas of same block. This corrupt replica will be deleted automatically if one more datanode available in your cluster and block replicated to that.


Related to replication 10, As Peter Marron said, some of the important files of the mapreduce job will set the replication of 10, to make it accessible faster and launch map tasks faster.
Anyway, if the job is success these files will be deleted auomatically. I think only in some cases if the jobs are killed in between these files will remain in hdfs showing underreplicated blocks.

Thanks and Regards,
Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com]
Sent: 11 December 2013 06:48
To: user@hadoop.apache.org
Subject: Re: how to handle the corrupt block in HDFS?

"By default this higher replication level is 10. "
is this value can be control via some option or variable? i only hive a 5-worknode cluster,and i think 5 replicas should be better,because every node can get a local replica.

another question is ,why hdfs fsck check the cluster is healthy and no corrupt block,but i see one corrupt block though checking NN metrics?
curl http://NNIP:50070/jmx<http://nnip:50070/jmx> ,thanks

On Tue, Dec 10, 2013 at 4:48 PM, Peter Marron <Pe...@trilliumsoftware.com>> wrote:
Hi,

I am sure that there are others who will answer this better, but anyway.
The default replication level for files in HDFS is 3 and so most files that you
see will have a replication level of 3. However when you run a Map/Reduce
job the system knows in advance that every node will need a copy of
certain files. Specifically the job.xml and the various jars containing
classes that will be needed to run the mappers and reducers. So the
system arranges that some of these files have a higher replication level. This increases
the chances that a copy will be found locally.
By default this higher replication level is 10.

This can seem a little odd on a cluster where you only have, say, 3 nodes.
Because it means that you will almost always have some blocks that are marked
under-replicated. I think that there was some discussion a while back to change
this to make the replication level something like min(10, #number of nodes)
However, as I recall, the general consensus was that this was extra
complexity that wasn't really worth it. If it ain't broke...

Hope that this helps.

Peter Marron
Senior Developer, Research & Development

Office: +44 (0) 118-940-7609  peter.marron@trilliumsoftware.com<ma...@trilliumsoftware.com>
Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
[cid:image001.png@01CEF63D.7AE096A0]

[cid:image002.png@01CEF63D.7AE096A0]<https://www.facebook.com/pages/Trillium-Software/109184815778307>

[cid:image003.png@01CEF63D.7AE096A0]<https://twitter.com/TrilliumSW>

[cid:image004.png@01CEF63D.7AE096A0]<http://www.linkedin.com/company/17710>


www.trilliumsoftware.com<http://www.trilliumsoftware.com/>

Be Certain About Your Data. Be Trillium Certain.

From: ch huang [mailto:justlooks@gmail.com<ma...@gmail.com>]
Sent: 10 December 2013 01:21
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: how to handle the corrupt block in HDFS?

more strange , in my HDFS cluster ,every block has three replicas,but i find some one has ten replicas ,why?

# sudo -u hdfs hadoop fs -ls /data/hisstage/helen/.staging/job_1385542328307_0915
Found 5 items
-rw-r--r--   3 helen hadoop          7 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
-rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
-rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>> wrote:
the strange thing is when i use the following command i find 1 corrupt block

#  curl -s http://ch11:50070/jmx |grep orrupt
    "CorruptBlocks" : 1,
but when i run hdfs fsck / , i get none ,everything seems fine

# sudo -u hdfs hdfs fsck /
........

....................................Status: HEALTHY
 Total size:    1479728140875 B (Total open files size: 1677721600 B)
 Total dirs:    21298
 Total files:   100636 (Files currently being written: 25)
 Total blocks (validated):      119788 (avg. block size 12352891 B) (Total open file blocks (not validated): 37)
 Minimally replicated blocks:   119788 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       166 (0.13857816 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0027633
 Corrupt blocks:                0
 Missing replicas:              831 (0.23049656 %)
 Number of data-nodes:          5
 Number of racks:               1
FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds

The filesystem under path '/' is HEALTHY
On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>> wrote:
hi,maillist:
            my nagios alert me that there is a corrupt block in HDFS all day,but i do not know how to remove it,and if the HDFS will handle this automaticlly? and if remove the corrupt block will cause any data lost?thanks

RE: how to handle the corrupt block in HDFS?

Posted by Vinayakumar B <vi...@huawei.com>.

Hi ch huang,

It may seem strange, but the fact is,
CorruptBlocks through JMX or http://NNIP:50070/jmx<http://nnip:50070/jmx> means "Number of blocks with corrupt replicas". May not be all replicas are corrupt.  This you can check through jconsole for description.

Where as Corrupt blocks through fsck means, blocks with all replicas corrupt(non-recoverable)/ missing.

In your case, may be one of the replica is corrupt, not all replicas of same block. This corrupt replica will be deleted automatically if one more datanode available in your cluster and block replicated to that.


Related to replication 10, As Peter Marron said, some of the important files of the mapreduce job will set the replication of 10, to make it accessible faster and launch map tasks faster.
Anyway, if the job is success these files will be deleted auomatically. I think only in some cases if the jobs are killed in between these files will remain in hdfs showing underreplicated blocks.

Thanks and Regards,
Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com]
Sent: 11 December 2013 06:48
To: user@hadoop.apache.org
Subject: Re: how to handle the corrupt block in HDFS?

"By default this higher replication level is 10. "
is this value can be control via some option or variable? i only hive a 5-worknode cluster,and i think 5 replicas should be better,because every node can get a local replica.

another question is ,why hdfs fsck check the cluster is healthy and no corrupt block,but i see one corrupt block though checking NN metrics?
curl http://NNIP:50070/jmx<http://nnip:50070/jmx> ,thanks

On Tue, Dec 10, 2013 at 4:48 PM, Peter Marron <Pe...@trilliumsoftware.com>> wrote:
Hi,

I am sure that there are others who will answer this better, but anyway.
The default replication level for files in HDFS is 3 and so most files that you
see will have a replication level of 3. However when you run a Map/Reduce
job the system knows in advance that every node will need a copy of
certain files. Specifically the job.xml and the various jars containing
classes that will be needed to run the mappers and reducers. So the
system arranges that some of these files have a higher replication level. This increases
the chances that a copy will be found locally.
By default this higher replication level is 10.

This can seem a little odd on a cluster where you only have, say, 3 nodes.
Because it means that you will almost always have some blocks that are marked
under-replicated. I think that there was some discussion a while back to change
this to make the replication level something like min(10, #number of nodes)
However, as I recall, the general consensus was that this was extra
complexity that wasn't really worth it. If it ain't broke...

Hope that this helps.

Peter Marron
Senior Developer, Research & Development

Office: +44 (0) 118-940-7609  peter.marron@trilliumsoftware.com<ma...@trilliumsoftware.com>
Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
[cid:image001.png@01CEF63D.7AE096A0]

[cid:image002.png@01CEF63D.7AE096A0]<https://www.facebook.com/pages/Trillium-Software/109184815778307>

[cid:image003.png@01CEF63D.7AE096A0]<https://twitter.com/TrilliumSW>

[cid:image004.png@01CEF63D.7AE096A0]<http://www.linkedin.com/company/17710>


www.trilliumsoftware.com<http://www.trilliumsoftware.com/>

Be Certain About Your Data. Be Trillium Certain.

From: ch huang [mailto:justlooks@gmail.com<ma...@gmail.com>]
Sent: 10 December 2013 01:21
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: how to handle the corrupt block in HDFS?

more strange , in my HDFS cluster ,every block has three replicas,but i find some one has ten replicas ,why?

# sudo -u hdfs hadoop fs -ls /data/hisstage/helen/.staging/job_1385542328307_0915
Found 5 items
-rw-r--r--   3 helen hadoop          7 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
-rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
-rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>> wrote:
the strange thing is when i use the following command i find 1 corrupt block

#  curl -s http://ch11:50070/jmx |grep orrupt
    "CorruptBlocks" : 1,
but when i run hdfs fsck / , i get none ,everything seems fine

# sudo -u hdfs hdfs fsck /
........

....................................Status: HEALTHY
 Total size:    1479728140875 B (Total open files size: 1677721600 B)
 Total dirs:    21298
 Total files:   100636 (Files currently being written: 25)
 Total blocks (validated):      119788 (avg. block size 12352891 B) (Total open file blocks (not validated): 37)
 Minimally replicated blocks:   119788 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       166 (0.13857816 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0027633
 Corrupt blocks:                0
 Missing replicas:              831 (0.23049656 %)
 Number of data-nodes:          5
 Number of racks:               1
FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds

The filesystem under path '/' is HEALTHY
On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>> wrote:
hi,maillist:
            my nagios alert me that there is a corrupt block in HDFS all day,but i do not know how to remove it,and if the HDFS will handle this automaticlly? and if remove the corrupt block will cause any data lost?thanks

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

"By default this higher replication level is 10. "
is this value can be control via some option or variable? i only hive a
5-worknode cluster,and i think 5 replicas should be better,because every
node can get a local replica.

another question is ,why hdfs fsck check the cluster is healthy and no
corrupt block,but i see one corrupt block though checking NN metrics?
curl http://NNIP:50070/jmx <http://nnip:50070/jmx> ,thanks


On Tue, Dec 10, 2013 at 4:48 PM, Peter Marron <
Peter.Marron@trilliumsoftware.com> wrote:

>  Hi,
>
>
>
> I am sure that there are others who will answer this better, but anyway.
>
> The default replication level for files in HDFS is 3 and so most files
> that you
>
> see will have a replication level of 3. However when you run a Map/Reduce
>
> job the system knows in advance that every node will need a copy of
>
> certain files. Specifically the job.xml and the various jars containing
>
> classes that will be needed to run the mappers and reducers. So the
>
> system arranges that some of these files have a higher replication level.
> This increases
>
> the chances that a copy will be found locally.
>
> By default this higher replication level is 10.
>
>
>
> This can seem a little odd on a cluster where you only have, say, 3 nodes.
>
> Because it means that you will almost always have some blocks that are
> marked
>
> under-replicated. I think that there was some discussion a while back to
> change
>
> this to make the replication level something like min(10, #number of nodes)
>
> However, as I recall, the general consensus was that this was extra
>
> complexity that wasn’t really worth it. If it ain’t broke…
>
>
>
> Hope that this helps.
>
>
>
> *Peter Marron*
>
> Senior Developer, Research & Development
>
>
>
> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>
> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>
>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>
>  <https://twitter.com/TrilliumSW>
>
>  <http://www.linkedin.com/company/17710>
>
>
>
> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>
> Be Certain About Your Data. Be Trillium Certain.
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 10 December 2013 01:21
> *To:* user@hadoop.apache.org
> *Subject:* Re: how to handle the corrupt block in HDFS?
>
>
>
> more strange , in my HDFS cluster ,every block has three replicas,but i
> find some one has ten replicas ,why?
>
>
>
> # sudo -u hdfs hadoop fs -ls
> /data/hisstage/helen/.staging/job_1385542328307_0915
> Found 5 items
> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>
>  On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>
> the strange thing is when i use the following command i find 1 corrupt
> block
>
>
>
> #  curl -s http://ch11:50070/jmx |grep orrupt
>     "CorruptBlocks" : 1,
>
> but when i run hdfs fsck / , i get none ,everything seems fine
>
>
>
> # sudo -u hdfs hdfs fsck /
>
> ........
>
>
>
> ....................................Status: HEALTHY
>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>  Total dirs:    21298
>  Total files:   100636 (Files currently being written: 25)
>  Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
> open file blocks (not validated): 37)
>  Minimally replicated blocks:   119788 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       166 (0.13857816 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0027633
>  Corrupt blocks:                0
>  Missing replicas:              831 (0.23049656 %)
>  Number of data-nodes:          5
>  Number of racks:               1
> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
>   On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>
> hi,maillist:
>
>             my nagios alert me that there is a corrupt block in HDFS all
> day,but i do not know how to remove it,and if the HDFS will handle this
> automaticlly? and if remove the corrupt block will cause any data
> lost?thanks
>
>
>
>
>

Re: how to handle the corrupt block in HDFS?

Posted by shashwat shriparv <dw...@gmail.com>.

How many nodes you have?
and if fsck is giving you healthy status no need to worry.
with the replication 10 what i may conclude that you have 10 listed
datanodes so 10 replicated jar files for the job to run.

*Thanks & Regards    *

∞
Shashwat Shriparv



On Tue, Dec 10, 2013 at 3:50 PM, Vinayakumar B <vi...@huawei.com>wrote:

>  Hi ch huang,
>
>
>
> It may seem strange, but the fact is,
>
> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
> replicas”. May not be all replicas are corrupt.  *This you can check
> though jconsole for description.
>
>
>
> Where as *Corrupt blocks* through fsck means, *blocks with all replicas
> corrupt(non-recoverable)/ missing.*
>
>
>
> In your case, may be one of the replica is corrupt, not all replicas of
> same block. This corrupt replica will be deleted automatically if one more
> datanode available in your cluster and block replicated to that.
>
>
>
>
>
> Related to replication 10, As Peter Marron said, *some of the important
> files of the mapreduce job will set the replication of 10, to make it
> accessible faster and launch map tasks faster. *
>
> Anyway, if the job is success these files will be deleted auomatically. I
> think only in some cases if the jobs are killed in between these files will
> remain in hdfs showing underreplicated blocks.
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
> *Sent:* 10 December 2013 14:19
> *To:* user@hadoop.apache.org
> *Subject:* RE: how to handle the corrupt block in HDFS?
>
>
>
> Hi,
>
>
>
> I am sure that there are others who will answer this better, but anyway.
>
> The default replication level for files in HDFS is 3 and so most files
> that you
>
> see will have a replication level of 3. However when you run a Map/Reduce
>
> job the system knows in advance that every node will need a copy of
>
> certain files. Specifically the job.xml and the various jars containing
>
> classes that will be needed to run the mappers and reducers. So the
>
> system arranges that some of these files have a higher replication level.
> This increases
>
> the chances that a copy will be found locally.
>
> By default this higher replication level is 10.
>
>
>
> This can seem a little odd on a cluster where you only have, say, 3 nodes.
>
> Because it means that you will almost always have some blocks that are
> marked
>
> under-replicated. I think that there was some discussion a while back to
> change
>
> this to make the replication level something like min(10, #number of nodes)
>
> However, as I recall, the general consensus was that this was extra
>
> complexity that wasn’t really worth it. If it ain’t broke…
>
>
>
> Hope that this helps.
>
>
>
> *Peter Marron*
>
> Senior Developer, Research & Development
>
>
>
> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>
> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>
>    <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>
>  <https://twitter.com/TrilliumSW>
>
>  <http://www.linkedin.com/company/17710>
>
>
>
> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>
> Be Certain About Your Data. Be Trillium Certain.
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
> *Sent:* 10 December 2013 01:21
> *To:* user@hadoop.apache.org
> *Subject:* Re: how to handle the corrupt block in HDFS?
>
>
>
> more strange , in my HDFS cluster ,every block has three replicas,but i
> find some one has ten replicas ,why?
>
>
>
> # sudo -u hdfs hadoop fs -ls
> /data/hisstage/helen/.staging/job_1385542328307_0915
> Found 5 items
> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>
> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>
> the strange thing is when i use the following command i find 1 corrupt
> block
>
>
>
> #  curl -s http://ch11:50070/jmx |grep orrupt
>     "CorruptBlocks" : 1,
>
> but when i run hdfs fsck / , i get none ,everything seems fine
>
>
>
> # sudo -u hdfs hdfs fsck /
>
> ........
>
>
>
> ....................................Status: HEALTHY
>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>  Total dirs:    21298
>  Total files:   100636 (Files currently being written: 25)
>  Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
> open file blocks (not validated): 37)
>  Minimally replicated blocks:   119788 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       166 (0.13857816 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0027633
>  Corrupt blocks:                0
>  Missing replicas:              831 (0.23049656 %)
>  Number of data-nodes:          5
>  Number of racks:               1
> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>
> hi,maillist:
>
>             my nagios alert me that there is a corrupt block in HDFS all
> day,but i do not know how to remove it,and if the HDFS will handle this
> automaticlly? and if remove the corrupt block will cause any data
> lost?thanks
>
>
>
>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

and is fsck report data from BlockPoolSliceScanner? it seems run once each
3 weeks
can i restart DN one by one without interrupt the job which is running?

On Thu, Dec 12, 2013 at 2:33 AM, Adam Kawa <ka...@gmail.com> wrote:

>  I have only 1-node cluster, so I am not able to verify it when
> replication factor is bigger than 1.
>
>  I run the fsck on a file that consists of 3 blocks, and 1 block has a
> corrupt replica. fsck told that the system is HEALTHY.
>
> When I restarted the DN, then the block scanner (BlockPoolSliceScanner)
> started and it detected a corrupted replica. Then I run fsck again on that
> file, and it told me that the system is CORRUPT.
>
> If you have a small (and non-production) cluster, can you restart your
> datandoes and run fsck again?
>
>
>
> 2013/12/11 ch huang <ju...@gmail.com>
>
>> thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck
>> can not tell you which block of which file has a replica been
>> corrupted,fsck just useful on all of one block's replica bad
>>
>> On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <ka...@gmail.com> wrote:
>>
>>> When you identify a file with corrupt block(s), then you can locate the
>>> machines that stores its block by typing
>>> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>>>
>>>
>>> 2013/12/11 Adam Kawa <ka...@gmail.com>
>>>
>>>> Maybe this can work for you
>>>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>>>> ?
>>>>
>>>>
>>>> 2013/12/11 ch huang <ju...@gmail.com>
>>>>
>>>>> thanks for reply, what i do not know is how can i locate the block
>>>>> which has the corrupt replica,(so i can observe how long the corrupt
>>>>> replica will be removed and a new health replica replace it,because i get
>>>>> nagios alert for three days,i do not sure if it is the same corrupt replica
>>>>> cause the alert ,and i do not know the interval of hdfs check corrupt
>>>>> replica and clean it)
>>>>>
>>>>>
>>>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <
>>>>> vinayakumar.b@huawei.com> wrote:
>>>>>
>>>>>>  Hi ch huang,
>>>>>>
>>>>>>
>>>>>>
>>>>>> It may seem strange, but the fact is,
>>>>>>
>>>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>>>>> though jconsole for description.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>>>
>>>>>>
>>>>>>
>>>>>> In your case, may be one of the replica is corrupt, not all replicas
>>>>>> of same block. This corrupt replica will be deleted automatically if one
>>>>>> more datanode available in your cluster and block replicated to that.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Related to replication 10, As Peter Marron said, *some of the
>>>>>> important files of the mapreduce job will set the replication of 10, to
>>>>>> make it accessible faster and launch map tasks faster. *
>>>>>>
>>>>>> Anyway, if the job is success these files will be deleted
>>>>>> auomatically. I think only in some cases if the jobs are killed in between
>>>>>> these files will remain in hdfs showing underreplicated blocks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks and Regards,
>>>>>>
>>>>>> Vinayakumar B
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>>>>> *Sent:* 10 December 2013 14:19
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am sure that there are others who will answer this better, but
>>>>>> anyway.
>>>>>>
>>>>>> The default replication level for files in HDFS is 3 and so most
>>>>>> files that you
>>>>>>
>>>>>> see will have a replication level of 3. However when you run a
>>>>>> Map/Reduce
>>>>>>
>>>>>> job the system knows in advance that every node will need a copy of
>>>>>>
>>>>>> certain files. Specifically the job.xml and the various jars
>>>>>> containing
>>>>>>
>>>>>> classes that will be needed to run the mappers and reducers. So the
>>>>>>
>>>>>> system arranges that some of these files have a higher replication
>>>>>> level. This increases
>>>>>>
>>>>>> the chances that a copy will be found locally.
>>>>>>
>>>>>> By default this higher replication level is 10.
>>>>>>
>>>>>>
>>>>>>
>>>>>> This can seem a little odd on a cluster where you only have, say, 3
>>>>>> nodes.
>>>>>>
>>>>>> Because it means that you will almost always have some blocks that
>>>>>> are marked
>>>>>>
>>>>>> under-replicated. I think that there was some discussion a while back
>>>>>> to change
>>>>>>
>>>>>> this to make the replication level something like min(10, #number of
>>>>>> nodes)
>>>>>>
>>>>>> However, as I recall, the general consensus was that this was extra
>>>>>>
>>>>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hope that this helps.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Peter Marron*
>>>>>>
>>>>>> Senior Developer, Research & Development
>>>>>>
>>>>>>
>>>>>>
>>>>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>>>>
>>>>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>>>>
>>>>>>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>>>>
>>>>>>  <https://twitter.com/TrilliumSW>
>>>>>>
>>>>>>  <http://www.linkedin.com/company/17710>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>>>>
>>>>>> Be Certain About Your Data. Be Trillium Certain.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>>>>> *Sent:* 10 December 2013 01:21
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>>>>
>>>>>>
>>>>>>
>>>>>> more strange , in my HDFS cluster ,every block has three replicas,but
>>>>>> i find some one has ten replicas ,why?
>>>>>>
>>>>>>
>>>>>>
>>>>>> # sudo -u hdfs hadoop fs -ls
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>>>>> Found 5 items
>>>>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>>>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>>>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> the strange thing is when i use the following command i find 1
>>>>>> corrupt block
>>>>>>
>>>>>>
>>>>>>
>>>>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>>>>     "CorruptBlocks" : 1,
>>>>>>
>>>>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>>>>
>>>>>>
>>>>>>
>>>>>> # sudo -u hdfs hdfs fsck /
>>>>>>
>>>>>> ........
>>>>>>
>>>>>>
>>>>>>
>>>>>> ....................................Status: HEALTHY
>>>>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>>>>  Total dirs:    21298
>>>>>>  Total files:   100636 (Files currently being written: 25)
>>>>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>>>>> (Total open file blocks (not validated): 37)
>>>>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>>>>  Mis-replicated blocks:         0 (0.0 %)
>>>>>>  Default replication factor:    3
>>>>>>  Average block replication:     3.0027633
>>>>>>  Corrupt blocks:                0
>>>>>>  Missing replicas:              831 (0.23049656 %)
>>>>>>  Number of data-nodes:          5
>>>>>>  Number of racks:               1
>>>>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>>>>
>>>>>>
>>>>>> The filesystem under path '/' is HEALTHY
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> hi,maillist:
>>>>>>
>>>>>>             my nagios alert me that there is a corrupt block in HDFS
>>>>>> all day,but i do not know how to remove it,and if the HDFS will handle this
>>>>>> automaticlly? and if remove the corrupt block will cause any data
>>>>>> lost?thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

and is fsck report data from BlockPoolSliceScanner? it seems run once each
3 weeks
can i restart DN one by one without interrupt the job which is running?

On Thu, Dec 12, 2013 at 2:33 AM, Adam Kawa <ka...@gmail.com> wrote:

>  I have only 1-node cluster, so I am not able to verify it when
> replication factor is bigger than 1.
>
>  I run the fsck on a file that consists of 3 blocks, and 1 block has a
> corrupt replica. fsck told that the system is HEALTHY.
>
> When I restarted the DN, then the block scanner (BlockPoolSliceScanner)
> started and it detected a corrupted replica. Then I run fsck again on that
> file, and it told me that the system is CORRUPT.
>
> If you have a small (and non-production) cluster, can you restart your
> datandoes and run fsck again?
>
>
>
> 2013/12/11 ch huang <ju...@gmail.com>
>
>> thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck
>> can not tell you which block of which file has a replica been
>> corrupted,fsck just useful on all of one block's replica bad
>>
>> On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <ka...@gmail.com> wrote:
>>
>>> When you identify a file with corrupt block(s), then you can locate the
>>> machines that stores its block by typing
>>> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>>>
>>>
>>> 2013/12/11 Adam Kawa <ka...@gmail.com>
>>>
>>>> Maybe this can work for you
>>>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>>>> ?
>>>>
>>>>
>>>> 2013/12/11 ch huang <ju...@gmail.com>
>>>>
>>>>> thanks for reply, what i do not know is how can i locate the block
>>>>> which has the corrupt replica,(so i can observe how long the corrupt
>>>>> replica will be removed and a new health replica replace it,because i get
>>>>> nagios alert for three days,i do not sure if it is the same corrupt replica
>>>>> cause the alert ,and i do not know the interval of hdfs check corrupt
>>>>> replica and clean it)
>>>>>
>>>>>
>>>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <
>>>>> vinayakumar.b@huawei.com> wrote:
>>>>>
>>>>>>  Hi ch huang,
>>>>>>
>>>>>>
>>>>>>
>>>>>> It may seem strange, but the fact is,
>>>>>>
>>>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>>>>> though jconsole for description.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>>>
>>>>>>
>>>>>>
>>>>>> In your case, may be one of the replica is corrupt, not all replicas
>>>>>> of same block. This corrupt replica will be deleted automatically if one
>>>>>> more datanode available in your cluster and block replicated to that.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Related to replication 10, As Peter Marron said, *some of the
>>>>>> important files of the mapreduce job will set the replication of 10, to
>>>>>> make it accessible faster and launch map tasks faster. *
>>>>>>
>>>>>> Anyway, if the job is success these files will be deleted
>>>>>> auomatically. I think only in some cases if the jobs are killed in between
>>>>>> these files will remain in hdfs showing underreplicated blocks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks and Regards,
>>>>>>
>>>>>> Vinayakumar B
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>>>>> *Sent:* 10 December 2013 14:19
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am sure that there are others who will answer this better, but
>>>>>> anyway.
>>>>>>
>>>>>> The default replication level for files in HDFS is 3 and so most
>>>>>> files that you
>>>>>>
>>>>>> see will have a replication level of 3. However when you run a
>>>>>> Map/Reduce
>>>>>>
>>>>>> job the system knows in advance that every node will need a copy of
>>>>>>
>>>>>> certain files. Specifically the job.xml and the various jars
>>>>>> containing
>>>>>>
>>>>>> classes that will be needed to run the mappers and reducers. So the
>>>>>>
>>>>>> system arranges that some of these files have a higher replication
>>>>>> level. This increases
>>>>>>
>>>>>> the chances that a copy will be found locally.
>>>>>>
>>>>>> By default this higher replication level is 10.
>>>>>>
>>>>>>
>>>>>>
>>>>>> This can seem a little odd on a cluster where you only have, say, 3
>>>>>> nodes.
>>>>>>
>>>>>> Because it means that you will almost always have some blocks that
>>>>>> are marked
>>>>>>
>>>>>> under-replicated. I think that there was some discussion a while back
>>>>>> to change
>>>>>>
>>>>>> this to make the replication level something like min(10, #number of
>>>>>> nodes)
>>>>>>
>>>>>> However, as I recall, the general consensus was that this was extra
>>>>>>
>>>>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hope that this helps.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Peter Marron*
>>>>>>
>>>>>> Senior Developer, Research & Development
>>>>>>
>>>>>>
>>>>>>
>>>>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>>>>
>>>>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>>>>
>>>>>>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>>>>
>>>>>>  <https://twitter.com/TrilliumSW>
>>>>>>
>>>>>>  <http://www.linkedin.com/company/17710>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>>>>
>>>>>> Be Certain About Your Data. Be Trillium Certain.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>>>>> *Sent:* 10 December 2013 01:21
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>>>>
>>>>>>
>>>>>>
>>>>>> more strange , in my HDFS cluster ,every block has three replicas,but
>>>>>> i find some one has ten replicas ,why?
>>>>>>
>>>>>>
>>>>>>
>>>>>> # sudo -u hdfs hadoop fs -ls
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>>>>> Found 5 items
>>>>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>>>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>>>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> the strange thing is when i use the following command i find 1
>>>>>> corrupt block
>>>>>>
>>>>>>
>>>>>>
>>>>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>>>>     "CorruptBlocks" : 1,
>>>>>>
>>>>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>>>>
>>>>>>
>>>>>>
>>>>>> # sudo -u hdfs hdfs fsck /
>>>>>>
>>>>>> ........
>>>>>>
>>>>>>
>>>>>>
>>>>>> ....................................Status: HEALTHY
>>>>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>>>>  Total dirs:    21298
>>>>>>  Total files:   100636 (Files currently being written: 25)
>>>>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>>>>> (Total open file blocks (not validated): 37)
>>>>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>>>>  Mis-replicated blocks:         0 (0.0 %)
>>>>>>  Default replication factor:    3
>>>>>>  Average block replication:     3.0027633
>>>>>>  Corrupt blocks:                0
>>>>>>  Missing replicas:              831 (0.23049656 %)
>>>>>>  Number of data-nodes:          5
>>>>>>  Number of racks:               1
>>>>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>>>>
>>>>>>
>>>>>> The filesystem under path '/' is HEALTHY
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> hi,maillist:
>>>>>>
>>>>>>             my nagios alert me that there is a corrupt block in HDFS
>>>>>> all day,but i do not know how to remove it,and if the HDFS will handle this
>>>>>> automaticlly? and if remove the corrupt block will cause any data
>>>>>> lost?thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

the alert from my product env,i will test on my benchmark env,thanks

On Thu, Dec 12, 2013 at 2:33 AM, Adam Kawa <ka...@gmail.com> wrote:

>  I have only 1-node cluster, so I am not able to verify it when
> replication factor is bigger than 1.
>
>  I run the fsck on a file that consists of 3 blocks, and 1 block has a
> corrupt replica. fsck told that the system is HEALTHY.
>
> When I restarted the DN, then the block scanner (BlockPoolSliceScanner)
> started and it detected a corrupted replica. Then I run fsck again on that
> file, and it told me that the system is CORRUPT.
>
> If you have a small (and non-production) cluster, can you restart your
> datandoes and run fsck again?
>
>
>
> 2013/12/11 ch huang <ju...@gmail.com>
>
>> thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck
>> can not tell you which block of which file has a replica been
>> corrupted,fsck just useful on all of one block's replica bad
>>
>> On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <ka...@gmail.com> wrote:
>>
>>> When you identify a file with corrupt block(s), then you can locate the
>>> machines that stores its block by typing
>>> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>>>
>>>
>>> 2013/12/11 Adam Kawa <ka...@gmail.com>
>>>
>>>> Maybe this can work for you
>>>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>>>> ?
>>>>
>>>>
>>>> 2013/12/11 ch huang <ju...@gmail.com>
>>>>
>>>>> thanks for reply, what i do not know is how can i locate the block
>>>>> which has the corrupt replica,(so i can observe how long the corrupt
>>>>> replica will be removed and a new health replica replace it,because i get
>>>>> nagios alert for three days,i do not sure if it is the same corrupt replica
>>>>> cause the alert ,and i do not know the interval of hdfs check corrupt
>>>>> replica and clean it)
>>>>>
>>>>>
>>>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <
>>>>> vinayakumar.b@huawei.com> wrote:
>>>>>
>>>>>>  Hi ch huang,
>>>>>>
>>>>>>
>>>>>>
>>>>>> It may seem strange, but the fact is,
>>>>>>
>>>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>>>>> though jconsole for description.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>>>
>>>>>>
>>>>>>
>>>>>> In your case, may be one of the replica is corrupt, not all replicas
>>>>>> of same block. This corrupt replica will be deleted automatically if one
>>>>>> more datanode available in your cluster and block replicated to that.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Related to replication 10, As Peter Marron said, *some of the
>>>>>> important files of the mapreduce job will set the replication of 10, to
>>>>>> make it accessible faster and launch map tasks faster. *
>>>>>>
>>>>>> Anyway, if the job is success these files will be deleted
>>>>>> auomatically. I think only in some cases if the jobs are killed in between
>>>>>> these files will remain in hdfs showing underreplicated blocks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks and Regards,
>>>>>>
>>>>>> Vinayakumar B
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>>>>> *Sent:* 10 December 2013 14:19
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am sure that there are others who will answer this better, but
>>>>>> anyway.
>>>>>>
>>>>>> The default replication level for files in HDFS is 3 and so most
>>>>>> files that you
>>>>>>
>>>>>> see will have a replication level of 3. However when you run a
>>>>>> Map/Reduce
>>>>>>
>>>>>> job the system knows in advance that every node will need a copy of
>>>>>>
>>>>>> certain files. Specifically the job.xml and the various jars
>>>>>> containing
>>>>>>
>>>>>> classes that will be needed to run the mappers and reducers. So the
>>>>>>
>>>>>> system arranges that some of these files have a higher replication
>>>>>> level. This increases
>>>>>>
>>>>>> the chances that a copy will be found locally.
>>>>>>
>>>>>> By default this higher replication level is 10.
>>>>>>
>>>>>>
>>>>>>
>>>>>> This can seem a little odd on a cluster where you only have, say, 3
>>>>>> nodes.
>>>>>>
>>>>>> Because it means that you will almost always have some blocks that
>>>>>> are marked
>>>>>>
>>>>>> under-replicated. I think that there was some discussion a while back
>>>>>> to change
>>>>>>
>>>>>> this to make the replication level something like min(10, #number of
>>>>>> nodes)
>>>>>>
>>>>>> However, as I recall, the general consensus was that this was extra
>>>>>>
>>>>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hope that this helps.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Peter Marron*
>>>>>>
>>>>>> Senior Developer, Research & Development
>>>>>>
>>>>>>
>>>>>>
>>>>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>>>>
>>>>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>>>>
>>>>>>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>>>>
>>>>>>  <https://twitter.com/TrilliumSW>
>>>>>>
>>>>>>  <http://www.linkedin.com/company/17710>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>>>>
>>>>>> Be Certain About Your Data. Be Trillium Certain.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>>>>> *Sent:* 10 December 2013 01:21
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>>>>
>>>>>>
>>>>>>
>>>>>> more strange , in my HDFS cluster ,every block has three replicas,but
>>>>>> i find some one has ten replicas ,why?
>>>>>>
>>>>>>
>>>>>>
>>>>>> # sudo -u hdfs hadoop fs -ls
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>>>>> Found 5 items
>>>>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>>>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>>>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> the strange thing is when i use the following command i find 1
>>>>>> corrupt block
>>>>>>
>>>>>>
>>>>>>
>>>>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>>>>     "CorruptBlocks" : 1,
>>>>>>
>>>>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>>>>
>>>>>>
>>>>>>
>>>>>> # sudo -u hdfs hdfs fsck /
>>>>>>
>>>>>> ........
>>>>>>
>>>>>>
>>>>>>
>>>>>> ....................................Status: HEALTHY
>>>>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>>>>  Total dirs:    21298
>>>>>>  Total files:   100636 (Files currently being written: 25)
>>>>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>>>>> (Total open file blocks (not validated): 37)
>>>>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>>>>  Mis-replicated blocks:         0 (0.0 %)
>>>>>>  Default replication factor:    3
>>>>>>  Average block replication:     3.0027633
>>>>>>  Corrupt blocks:                0
>>>>>>  Missing replicas:              831 (0.23049656 %)
>>>>>>  Number of data-nodes:          5
>>>>>>  Number of racks:               1
>>>>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>>>>
>>>>>>
>>>>>> The filesystem under path '/' is HEALTHY
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> hi,maillist:
>>>>>>
>>>>>>             my nagios alert me that there is a corrupt block in HDFS
>>>>>> all day,but i do not know how to remove it,and if the HDFS will handle this
>>>>>> automaticlly? and if remove the corrupt block will cause any data
>>>>>> lost?thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

the alert from my product env,i will test on my benchmark env,thanks

On Thu, Dec 12, 2013 at 2:33 AM, Adam Kawa <ka...@gmail.com> wrote:

>  I have only 1-node cluster, so I am not able to verify it when
> replication factor is bigger than 1.
>
>  I run the fsck on a file that consists of 3 blocks, and 1 block has a
> corrupt replica. fsck told that the system is HEALTHY.
>
> When I restarted the DN, then the block scanner (BlockPoolSliceScanner)
> started and it detected a corrupted replica. Then I run fsck again on that
> file, and it told me that the system is CORRUPT.
>
> If you have a small (and non-production) cluster, can you restart your
> datandoes and run fsck again?
>
>
>
> 2013/12/11 ch huang <ju...@gmail.com>
>
>> thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck
>> can not tell you which block of which file has a replica been
>> corrupted,fsck just useful on all of one block's replica bad
>>
>> On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <ka...@gmail.com> wrote:
>>
>>> When you identify a file with corrupt block(s), then you can locate the
>>> machines that stores its block by typing
>>> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>>>
>>>
>>> 2013/12/11 Adam Kawa <ka...@gmail.com>
>>>
>>>> Maybe this can work for you
>>>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>>>> ?
>>>>
>>>>
>>>> 2013/12/11 ch huang <ju...@gmail.com>
>>>>
>>>>> thanks for reply, what i do not know is how can i locate the block
>>>>> which has the corrupt replica,(so i can observe how long the corrupt
>>>>> replica will be removed and a new health replica replace it,because i get
>>>>> nagios alert for three days,i do not sure if it is the same corrupt replica
>>>>> cause the alert ,and i do not know the interval of hdfs check corrupt
>>>>> replica and clean it)
>>>>>
>>>>>
>>>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <
>>>>> vinayakumar.b@huawei.com> wrote:
>>>>>
>>>>>>  Hi ch huang,
>>>>>>
>>>>>>
>>>>>>
>>>>>> It may seem strange, but the fact is,
>>>>>>
>>>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>>>>> though jconsole for description.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>>>
>>>>>>
>>>>>>
>>>>>> In your case, may be one of the replica is corrupt, not all replicas
>>>>>> of same block. This corrupt replica will be deleted automatically if one
>>>>>> more datanode available in your cluster and block replicated to that.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Related to replication 10, As Peter Marron said, *some of the
>>>>>> important files of the mapreduce job will set the replication of 10, to
>>>>>> make it accessible faster and launch map tasks faster. *
>>>>>>
>>>>>> Anyway, if the job is success these files will be deleted
>>>>>> auomatically. I think only in some cases if the jobs are killed in between
>>>>>> these files will remain in hdfs showing underreplicated blocks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks and Regards,
>>>>>>
>>>>>> Vinayakumar B
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>>>>> *Sent:* 10 December 2013 14:19
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am sure that there are others who will answer this better, but
>>>>>> anyway.
>>>>>>
>>>>>> The default replication level for files in HDFS is 3 and so most
>>>>>> files that you
>>>>>>
>>>>>> see will have a replication level of 3. However when you run a
>>>>>> Map/Reduce
>>>>>>
>>>>>> job the system knows in advance that every node will need a copy of
>>>>>>
>>>>>> certain files. Specifically the job.xml and the various jars
>>>>>> containing
>>>>>>
>>>>>> classes that will be needed to run the mappers and reducers. So the
>>>>>>
>>>>>> system arranges that some of these files have a higher replication
>>>>>> level. This increases
>>>>>>
>>>>>> the chances that a copy will be found locally.
>>>>>>
>>>>>> By default this higher replication level is 10.
>>>>>>
>>>>>>
>>>>>>
>>>>>> This can seem a little odd on a cluster where you only have, say, 3
>>>>>> nodes.
>>>>>>
>>>>>> Because it means that you will almost always have some blocks that
>>>>>> are marked
>>>>>>
>>>>>> under-replicated. I think that there was some discussion a while back
>>>>>> to change
>>>>>>
>>>>>> this to make the replication level something like min(10, #number of
>>>>>> nodes)
>>>>>>
>>>>>> However, as I recall, the general consensus was that this was extra
>>>>>>
>>>>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hope that this helps.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Peter Marron*
>>>>>>
>>>>>> Senior Developer, Research & Development
>>>>>>
>>>>>>
>>>>>>
>>>>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>>>>
>>>>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>>>>
>>>>>>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>>>>
>>>>>>  <https://twitter.com/TrilliumSW>
>>>>>>
>>>>>>  <http://www.linkedin.com/company/17710>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>>>>
>>>>>> Be Certain About Your Data. Be Trillium Certain.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>>>>> *Sent:* 10 December 2013 01:21
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>>>>
>>>>>>
>>>>>>
>>>>>> more strange , in my HDFS cluster ,every block has three replicas,but
>>>>>> i find some one has ten replicas ,why?
>>>>>>
>>>>>>
>>>>>>
>>>>>> # sudo -u hdfs hadoop fs -ls
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>>>>> Found 5 items
>>>>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>>>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>>>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> the strange thing is when i use the following command i find 1
>>>>>> corrupt block
>>>>>>
>>>>>>
>>>>>>
>>>>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>>>>     "CorruptBlocks" : 1,
>>>>>>
>>>>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>>>>
>>>>>>
>>>>>>
>>>>>> # sudo -u hdfs hdfs fsck /
>>>>>>
>>>>>> ........
>>>>>>
>>>>>>
>>>>>>
>>>>>> ....................................Status: HEALTHY
>>>>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>>>>  Total dirs:    21298
>>>>>>  Total files:   100636 (Files currently being written: 25)
>>>>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>>>>> (Total open file blocks (not validated): 37)
>>>>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>>>>  Mis-replicated blocks:         0 (0.0 %)
>>>>>>  Default replication factor:    3
>>>>>>  Average block replication:     3.0027633
>>>>>>  Corrupt blocks:                0
>>>>>>  Missing replicas:              831 (0.23049656 %)
>>>>>>  Number of data-nodes:          5
>>>>>>  Number of racks:               1
>>>>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>>>>
>>>>>>
>>>>>> The filesystem under path '/' is HEALTHY
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> hi,maillist:
>>>>>>
>>>>>>             my nagios alert me that there is a corrupt block in HDFS
>>>>>> all day,but i do not know how to remove it,and if the HDFS will handle this
>>>>>> automaticlly? and if remove the corrupt block will cause any data
>>>>>> lost?thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

and is fsck report data from BlockPoolSliceScanner? it seems run once each
3 weeks
can i restart DN one by one without interrupt the job which is running?

On Thu, Dec 12, 2013 at 2:33 AM, Adam Kawa <ka...@gmail.com> wrote:

>  I have only 1-node cluster, so I am not able to verify it when
> replication factor is bigger than 1.
>
>  I run the fsck on a file that consists of 3 blocks, and 1 block has a
> corrupt replica. fsck told that the system is HEALTHY.
>
> When I restarted the DN, then the block scanner (BlockPoolSliceScanner)
> started and it detected a corrupted replica. Then I run fsck again on that
> file, and it told me that the system is CORRUPT.
>
> If you have a small (and non-production) cluster, can you restart your
> datandoes and run fsck again?
>
>
>
> 2013/12/11 ch huang <ju...@gmail.com>
>
>> thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck
>> can not tell you which block of which file has a replica been
>> corrupted,fsck just useful on all of one block's replica bad
>>
>> On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <ka...@gmail.com> wrote:
>>
>>> When you identify a file with corrupt block(s), then you can locate the
>>> machines that stores its block by typing
>>> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>>>
>>>
>>> 2013/12/11 Adam Kawa <ka...@gmail.com>
>>>
>>>> Maybe this can work for you
>>>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>>>> ?
>>>>
>>>>
>>>> 2013/12/11 ch huang <ju...@gmail.com>
>>>>
>>>>> thanks for reply, what i do not know is how can i locate the block
>>>>> which has the corrupt replica,(so i can observe how long the corrupt
>>>>> replica will be removed and a new health replica replace it,because i get
>>>>> nagios alert for three days,i do not sure if it is the same corrupt replica
>>>>> cause the alert ,and i do not know the interval of hdfs check corrupt
>>>>> replica and clean it)
>>>>>
>>>>>
>>>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <
>>>>> vinayakumar.b@huawei.com> wrote:
>>>>>
>>>>>>  Hi ch huang,
>>>>>>
>>>>>>
>>>>>>
>>>>>> It may seem strange, but the fact is,
>>>>>>
>>>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>>>>> though jconsole for description.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>>>
>>>>>>
>>>>>>
>>>>>> In your case, may be one of the replica is corrupt, not all replicas
>>>>>> of same block. This corrupt replica will be deleted automatically if one
>>>>>> more datanode available in your cluster and block replicated to that.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Related to replication 10, As Peter Marron said, *some of the
>>>>>> important files of the mapreduce job will set the replication of 10, to
>>>>>> make it accessible faster and launch map tasks faster. *
>>>>>>
>>>>>> Anyway, if the job is success these files will be deleted
>>>>>> auomatically. I think only in some cases if the jobs are killed in between
>>>>>> these files will remain in hdfs showing underreplicated blocks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks and Regards,
>>>>>>
>>>>>> Vinayakumar B
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>>>>> *Sent:* 10 December 2013 14:19
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am sure that there are others who will answer this better, but
>>>>>> anyway.
>>>>>>
>>>>>> The default replication level for files in HDFS is 3 and so most
>>>>>> files that you
>>>>>>
>>>>>> see will have a replication level of 3. However when you run a
>>>>>> Map/Reduce
>>>>>>
>>>>>> job the system knows in advance that every node will need a copy of
>>>>>>
>>>>>> certain files. Specifically the job.xml and the various jars
>>>>>> containing
>>>>>>
>>>>>> classes that will be needed to run the mappers and reducers. So the
>>>>>>
>>>>>> system arranges that some of these files have a higher replication
>>>>>> level. This increases
>>>>>>
>>>>>> the chances that a copy will be found locally.
>>>>>>
>>>>>> By default this higher replication level is 10.
>>>>>>
>>>>>>
>>>>>>
>>>>>> This can seem a little odd on a cluster where you only have, say, 3
>>>>>> nodes.
>>>>>>
>>>>>> Because it means that you will almost always have some blocks that
>>>>>> are marked
>>>>>>
>>>>>> under-replicated. I think that there was some discussion a while back
>>>>>> to change
>>>>>>
>>>>>> this to make the replication level something like min(10, #number of
>>>>>> nodes)
>>>>>>
>>>>>> However, as I recall, the general consensus was that this was extra
>>>>>>
>>>>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hope that this helps.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Peter Marron*
>>>>>>
>>>>>> Senior Developer, Research & Development
>>>>>>
>>>>>>
>>>>>>
>>>>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>>>>
>>>>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>>>>
>>>>>>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>>>>
>>>>>>  <https://twitter.com/TrilliumSW>
>>>>>>
>>>>>>  <http://www.linkedin.com/company/17710>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>>>>
>>>>>> Be Certain About Your Data. Be Trillium Certain.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>>>>> *Sent:* 10 December 2013 01:21
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>>>>
>>>>>>
>>>>>>
>>>>>> more strange , in my HDFS cluster ,every block has three replicas,but
>>>>>> i find some one has ten replicas ,why?
>>>>>>
>>>>>>
>>>>>>
>>>>>> # sudo -u hdfs hadoop fs -ls
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>>>>> Found 5 items
>>>>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>>>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>>>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> the strange thing is when i use the following command i find 1
>>>>>> corrupt block
>>>>>>
>>>>>>
>>>>>>
>>>>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>>>>     "CorruptBlocks" : 1,
>>>>>>
>>>>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>>>>
>>>>>>
>>>>>>
>>>>>> # sudo -u hdfs hdfs fsck /
>>>>>>
>>>>>> ........
>>>>>>
>>>>>>
>>>>>>
>>>>>> ....................................Status: HEALTHY
>>>>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>>>>  Total dirs:    21298
>>>>>>  Total files:   100636 (Files currently being written: 25)
>>>>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>>>>> (Total open file blocks (not validated): 37)
>>>>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>>>>  Mis-replicated blocks:         0 (0.0 %)
>>>>>>  Default replication factor:    3
>>>>>>  Average block replication:     3.0027633
>>>>>>  Corrupt blocks:                0
>>>>>>  Missing replicas:              831 (0.23049656 %)
>>>>>>  Number of data-nodes:          5
>>>>>>  Number of racks:               1
>>>>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>>>>
>>>>>>
>>>>>> The filesystem under path '/' is HEALTHY
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> hi,maillist:
>>>>>>
>>>>>>             my nagios alert me that there is a corrupt block in HDFS
>>>>>> all day,but i do not know how to remove it,and if the HDFS will handle this
>>>>>> automaticlly? and if remove the corrupt block will cause any data
>>>>>> lost?thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

the alert from my product env,i will test on my benchmark env,thanks

On Thu, Dec 12, 2013 at 2:33 AM, Adam Kawa <ka...@gmail.com> wrote:

>  I have only 1-node cluster, so I am not able to verify it when
> replication factor is bigger than 1.
>
>  I run the fsck on a file that consists of 3 blocks, and 1 block has a
> corrupt replica. fsck told that the system is HEALTHY.
>
> When I restarted the DN, then the block scanner (BlockPoolSliceScanner)
> started and it detected a corrupted replica. Then I run fsck again on that
> file, and it told me that the system is CORRUPT.
>
> If you have a small (and non-production) cluster, can you restart your
> datandoes and run fsck again?
>
>
>
> 2013/12/11 ch huang <ju...@gmail.com>
>
>> thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck
>> can not tell you which block of which file has a replica been
>> corrupted,fsck just useful on all of one block's replica bad
>>
>> On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <ka...@gmail.com> wrote:
>>
>>> When you identify a file with corrupt block(s), then you can locate the
>>> machines that stores its block by typing
>>> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>>>
>>>
>>> 2013/12/11 Adam Kawa <ka...@gmail.com>
>>>
>>>> Maybe this can work for you
>>>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>>>> ?
>>>>
>>>>
>>>> 2013/12/11 ch huang <ju...@gmail.com>
>>>>
>>>>> thanks for reply, what i do not know is how can i locate the block
>>>>> which has the corrupt replica,(so i can observe how long the corrupt
>>>>> replica will be removed and a new health replica replace it,because i get
>>>>> nagios alert for three days,i do not sure if it is the same corrupt replica
>>>>> cause the alert ,and i do not know the interval of hdfs check corrupt
>>>>> replica and clean it)
>>>>>
>>>>>
>>>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <
>>>>> vinayakumar.b@huawei.com> wrote:
>>>>>
>>>>>>  Hi ch huang,
>>>>>>
>>>>>>
>>>>>>
>>>>>> It may seem strange, but the fact is,
>>>>>>
>>>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>>>>> though jconsole for description.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>>>
>>>>>>
>>>>>>
>>>>>> In your case, may be one of the replica is corrupt, not all replicas
>>>>>> of same block. This corrupt replica will be deleted automatically if one
>>>>>> more datanode available in your cluster and block replicated to that.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Related to replication 10, As Peter Marron said, *some of the
>>>>>> important files of the mapreduce job will set the replication of 10, to
>>>>>> make it accessible faster and launch map tasks faster. *
>>>>>>
>>>>>> Anyway, if the job is success these files will be deleted
>>>>>> auomatically. I think only in some cases if the jobs are killed in between
>>>>>> these files will remain in hdfs showing underreplicated blocks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks and Regards,
>>>>>>
>>>>>> Vinayakumar B
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>>>>> *Sent:* 10 December 2013 14:19
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am sure that there are others who will answer this better, but
>>>>>> anyway.
>>>>>>
>>>>>> The default replication level for files in HDFS is 3 and so most
>>>>>> files that you
>>>>>>
>>>>>> see will have a replication level of 3. However when you run a
>>>>>> Map/Reduce
>>>>>>
>>>>>> job the system knows in advance that every node will need a copy of
>>>>>>
>>>>>> certain files. Specifically the job.xml and the various jars
>>>>>> containing
>>>>>>
>>>>>> classes that will be needed to run the mappers and reducers. So the
>>>>>>
>>>>>> system arranges that some of these files have a higher replication
>>>>>> level. This increases
>>>>>>
>>>>>> the chances that a copy will be found locally.
>>>>>>
>>>>>> By default this higher replication level is 10.
>>>>>>
>>>>>>
>>>>>>
>>>>>> This can seem a little odd on a cluster where you only have, say, 3
>>>>>> nodes.
>>>>>>
>>>>>> Because it means that you will almost always have some blocks that
>>>>>> are marked
>>>>>>
>>>>>> under-replicated. I think that there was some discussion a while back
>>>>>> to change
>>>>>>
>>>>>> this to make the replication level something like min(10, #number of
>>>>>> nodes)
>>>>>>
>>>>>> However, as I recall, the general consensus was that this was extra
>>>>>>
>>>>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hope that this helps.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Peter Marron*
>>>>>>
>>>>>> Senior Developer, Research & Development
>>>>>>
>>>>>>
>>>>>>
>>>>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>>>>
>>>>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>>>>
>>>>>>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>>>>
>>>>>>  <https://twitter.com/TrilliumSW>
>>>>>>
>>>>>>  <http://www.linkedin.com/company/17710>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>>>>
>>>>>> Be Certain About Your Data. Be Trillium Certain.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>>>>> *Sent:* 10 December 2013 01:21
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>>>>
>>>>>>
>>>>>>
>>>>>> more strange , in my HDFS cluster ,every block has three replicas,but
>>>>>> i find some one has ten replicas ,why?
>>>>>>
>>>>>>
>>>>>>
>>>>>> # sudo -u hdfs hadoop fs -ls
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>>>>> Found 5 items
>>>>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>>>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>>>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> the strange thing is when i use the following command i find 1
>>>>>> corrupt block
>>>>>>
>>>>>>
>>>>>>
>>>>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>>>>     "CorruptBlocks" : 1,
>>>>>>
>>>>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>>>>
>>>>>>
>>>>>>
>>>>>> # sudo -u hdfs hdfs fsck /
>>>>>>
>>>>>> ........
>>>>>>
>>>>>>
>>>>>>
>>>>>> ....................................Status: HEALTHY
>>>>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>>>>  Total dirs:    21298
>>>>>>  Total files:   100636 (Files currently being written: 25)
>>>>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>>>>> (Total open file blocks (not validated): 37)
>>>>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>>>>  Mis-replicated blocks:         0 (0.0 %)
>>>>>>  Default replication factor:    3
>>>>>>  Average block replication:     3.0027633
>>>>>>  Corrupt blocks:                0
>>>>>>  Missing replicas:              831 (0.23049656 %)
>>>>>>  Number of data-nodes:          5
>>>>>>  Number of racks:               1
>>>>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>>>>
>>>>>>
>>>>>> The filesystem under path '/' is HEALTHY
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> hi,maillist:
>>>>>>
>>>>>>             my nagios alert me that there is a corrupt block in HDFS
>>>>>> all day,but i do not know how to remove it,and if the HDFS will handle this
>>>>>> automaticlly? and if remove the corrupt block will cause any data
>>>>>> lost?thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

and is fsck report data from BlockPoolSliceScanner? it seems run once each
3 weeks
can i restart DN one by one without interrupt the job which is running?

On Thu, Dec 12, 2013 at 2:33 AM, Adam Kawa <ka...@gmail.com> wrote:

>  I have only 1-node cluster, so I am not able to verify it when
> replication factor is bigger than 1.
>
>  I run the fsck on a file that consists of 3 blocks, and 1 block has a
> corrupt replica. fsck told that the system is HEALTHY.
>
> When I restarted the DN, then the block scanner (BlockPoolSliceScanner)
> started and it detected a corrupted replica. Then I run fsck again on that
> file, and it told me that the system is CORRUPT.
>
> If you have a small (and non-production) cluster, can you restart your
> datandoes and run fsck again?
>
>
>
> 2013/12/11 ch huang <ju...@gmail.com>
>
>> thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck
>> can not tell you which block of which file has a replica been
>> corrupted,fsck just useful on all of one block's replica bad
>>
>> On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <ka...@gmail.com> wrote:
>>
>>> When you identify a file with corrupt block(s), then you can locate the
>>> machines that stores its block by typing
>>> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>>>
>>>
>>> 2013/12/11 Adam Kawa <ka...@gmail.com>
>>>
>>>> Maybe this can work for you
>>>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>>>> ?
>>>>
>>>>
>>>> 2013/12/11 ch huang <ju...@gmail.com>
>>>>
>>>>> thanks for reply, what i do not know is how can i locate the block
>>>>> which has the corrupt replica,(so i can observe how long the corrupt
>>>>> replica will be removed and a new health replica replace it,because i get
>>>>> nagios alert for three days,i do not sure if it is the same corrupt replica
>>>>> cause the alert ,and i do not know the interval of hdfs check corrupt
>>>>> replica and clean it)
>>>>>
>>>>>
>>>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <
>>>>> vinayakumar.b@huawei.com> wrote:
>>>>>
>>>>>>  Hi ch huang,
>>>>>>
>>>>>>
>>>>>>
>>>>>> It may seem strange, but the fact is,
>>>>>>
>>>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>>>>> though jconsole for description.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>>>
>>>>>>
>>>>>>
>>>>>> In your case, may be one of the replica is corrupt, not all replicas
>>>>>> of same block. This corrupt replica will be deleted automatically if one
>>>>>> more datanode available in your cluster and block replicated to that.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Related to replication 10, As Peter Marron said, *some of the
>>>>>> important files of the mapreduce job will set the replication of 10, to
>>>>>> make it accessible faster and launch map tasks faster. *
>>>>>>
>>>>>> Anyway, if the job is success these files will be deleted
>>>>>> auomatically. I think only in some cases if the jobs are killed in between
>>>>>> these files will remain in hdfs showing underreplicated blocks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks and Regards,
>>>>>>
>>>>>> Vinayakumar B
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>>>>> *Sent:* 10 December 2013 14:19
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am sure that there are others who will answer this better, but
>>>>>> anyway.
>>>>>>
>>>>>> The default replication level for files in HDFS is 3 and so most
>>>>>> files that you
>>>>>>
>>>>>> see will have a replication level of 3. However when you run a
>>>>>> Map/Reduce
>>>>>>
>>>>>> job the system knows in advance that every node will need a copy of
>>>>>>
>>>>>> certain files. Specifically the job.xml and the various jars
>>>>>> containing
>>>>>>
>>>>>> classes that will be needed to run the mappers and reducers. So the
>>>>>>
>>>>>> system arranges that some of these files have a higher replication
>>>>>> level. This increases
>>>>>>
>>>>>> the chances that a copy will be found locally.
>>>>>>
>>>>>> By default this higher replication level is 10.
>>>>>>
>>>>>>
>>>>>>
>>>>>> This can seem a little odd on a cluster where you only have, say, 3
>>>>>> nodes.
>>>>>>
>>>>>> Because it means that you will almost always have some blocks that
>>>>>> are marked
>>>>>>
>>>>>> under-replicated. I think that there was some discussion a while back
>>>>>> to change
>>>>>>
>>>>>> this to make the replication level something like min(10, #number of
>>>>>> nodes)
>>>>>>
>>>>>> However, as I recall, the general consensus was that this was extra
>>>>>>
>>>>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hope that this helps.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Peter Marron*
>>>>>>
>>>>>> Senior Developer, Research & Development
>>>>>>
>>>>>>
>>>>>>
>>>>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>>>>
>>>>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>>>>
>>>>>>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>>>>
>>>>>>  <https://twitter.com/TrilliumSW>
>>>>>>
>>>>>>  <http://www.linkedin.com/company/17710>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>>>>
>>>>>> Be Certain About Your Data. Be Trillium Certain.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>>>>> *Sent:* 10 December 2013 01:21
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>>>>
>>>>>>
>>>>>>
>>>>>> more strange , in my HDFS cluster ,every block has three replicas,but
>>>>>> i find some one has ten replicas ,why?
>>>>>>
>>>>>>
>>>>>>
>>>>>> # sudo -u hdfs hadoop fs -ls
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>>>>> Found 5 items
>>>>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>>>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>>>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> the strange thing is when i use the following command i find 1
>>>>>> corrupt block
>>>>>>
>>>>>>
>>>>>>
>>>>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>>>>     "CorruptBlocks" : 1,
>>>>>>
>>>>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>>>>
>>>>>>
>>>>>>
>>>>>> # sudo -u hdfs hdfs fsck /
>>>>>>
>>>>>> ........
>>>>>>
>>>>>>
>>>>>>
>>>>>> ....................................Status: HEALTHY
>>>>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>>>>  Total dirs:    21298
>>>>>>  Total files:   100636 (Files currently being written: 25)
>>>>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>>>>> (Total open file blocks (not validated): 37)
>>>>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>>>>  Mis-replicated blocks:         0 (0.0 %)
>>>>>>  Default replication factor:    3
>>>>>>  Average block replication:     3.0027633
>>>>>>  Corrupt blocks:                0
>>>>>>  Missing replicas:              831 (0.23049656 %)
>>>>>>  Number of data-nodes:          5
>>>>>>  Number of racks:               1
>>>>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>>>>
>>>>>>
>>>>>> The filesystem under path '/' is HEALTHY
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> hi,maillist:
>>>>>>
>>>>>>             my nagios alert me that there is a corrupt block in HDFS
>>>>>> all day,but i do not know how to remove it,and if the HDFS will handle this
>>>>>> automaticlly? and if remove the corrupt block will cause any data
>>>>>> lost?thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

the alert from my product env,i will test on my benchmark env,thanks

On Thu, Dec 12, 2013 at 2:33 AM, Adam Kawa <ka...@gmail.com> wrote:

>  I have only 1-node cluster, so I am not able to verify it when
> replication factor is bigger than 1.
>
>  I run the fsck on a file that consists of 3 blocks, and 1 block has a
> corrupt replica. fsck told that the system is HEALTHY.
>
> When I restarted the DN, then the block scanner (BlockPoolSliceScanner)
> started and it detected a corrupted replica. Then I run fsck again on that
> file, and it told me that the system is CORRUPT.
>
> If you have a small (and non-production) cluster, can you restart your
> datandoes and run fsck again?
>
>
>
> 2013/12/11 ch huang <ju...@gmail.com>
>
>> thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck
>> can not tell you which block of which file has a replica been
>> corrupted,fsck just useful on all of one block's replica bad
>>
>> On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <ka...@gmail.com> wrote:
>>
>>> When you identify a file with corrupt block(s), then you can locate the
>>> machines that stores its block by typing
>>> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>>>
>>>
>>> 2013/12/11 Adam Kawa <ka...@gmail.com>
>>>
>>>> Maybe this can work for you
>>>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>>>> ?
>>>>
>>>>
>>>> 2013/12/11 ch huang <ju...@gmail.com>
>>>>
>>>>> thanks for reply, what i do not know is how can i locate the block
>>>>> which has the corrupt replica,(so i can observe how long the corrupt
>>>>> replica will be removed and a new health replica replace it,because i get
>>>>> nagios alert for three days,i do not sure if it is the same corrupt replica
>>>>> cause the alert ,and i do not know the interval of hdfs check corrupt
>>>>> replica and clean it)
>>>>>
>>>>>
>>>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <
>>>>> vinayakumar.b@huawei.com> wrote:
>>>>>
>>>>>>  Hi ch huang,
>>>>>>
>>>>>>
>>>>>>
>>>>>> It may seem strange, but the fact is,
>>>>>>
>>>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>>>>> though jconsole for description.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>>>
>>>>>>
>>>>>>
>>>>>> In your case, may be one of the replica is corrupt, not all replicas
>>>>>> of same block. This corrupt replica will be deleted automatically if one
>>>>>> more datanode available in your cluster and block replicated to that.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Related to replication 10, As Peter Marron said, *some of the
>>>>>> important files of the mapreduce job will set the replication of 10, to
>>>>>> make it accessible faster and launch map tasks faster. *
>>>>>>
>>>>>> Anyway, if the job is success these files will be deleted
>>>>>> auomatically. I think only in some cases if the jobs are killed in between
>>>>>> these files will remain in hdfs showing underreplicated blocks.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks and Regards,
>>>>>>
>>>>>> Vinayakumar B
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>>>>> *Sent:* 10 December 2013 14:19
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I am sure that there are others who will answer this better, but
>>>>>> anyway.
>>>>>>
>>>>>> The default replication level for files in HDFS is 3 and so most
>>>>>> files that you
>>>>>>
>>>>>> see will have a replication level of 3. However when you run a
>>>>>> Map/Reduce
>>>>>>
>>>>>> job the system knows in advance that every node will need a copy of
>>>>>>
>>>>>> certain files. Specifically the job.xml and the various jars
>>>>>> containing
>>>>>>
>>>>>> classes that will be needed to run the mappers and reducers. So the
>>>>>>
>>>>>> system arranges that some of these files have a higher replication
>>>>>> level. This increases
>>>>>>
>>>>>> the chances that a copy will be found locally.
>>>>>>
>>>>>> By default this higher replication level is 10.
>>>>>>
>>>>>>
>>>>>>
>>>>>> This can seem a little odd on a cluster where you only have, say, 3
>>>>>> nodes.
>>>>>>
>>>>>> Because it means that you will almost always have some blocks that
>>>>>> are marked
>>>>>>
>>>>>> under-replicated. I think that there was some discussion a while back
>>>>>> to change
>>>>>>
>>>>>> this to make the replication level something like min(10, #number of
>>>>>> nodes)
>>>>>>
>>>>>> However, as I recall, the general consensus was that this was extra
>>>>>>
>>>>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hope that this helps.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Peter Marron*
>>>>>>
>>>>>> Senior Developer, Research & Development
>>>>>>
>>>>>>
>>>>>>
>>>>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>>>>
>>>>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>>>>
>>>>>>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>>>>
>>>>>>  <https://twitter.com/TrilliumSW>
>>>>>>
>>>>>>  <http://www.linkedin.com/company/17710>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>>>>
>>>>>> Be Certain About Your Data. Be Trillium Certain.
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>>>>> *Sent:* 10 December 2013 01:21
>>>>>> *To:* user@hadoop.apache.org
>>>>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>>>>
>>>>>>
>>>>>>
>>>>>> more strange , in my HDFS cluster ,every block has three replicas,but
>>>>>> i find some one has ten replicas ,why?
>>>>>>
>>>>>>
>>>>>>
>>>>>> # sudo -u hdfs hadoop fs -ls
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>>>>> Found 5 items
>>>>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>>>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>>>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> the strange thing is when i use the following command i find 1
>>>>>> corrupt block
>>>>>>
>>>>>>
>>>>>>
>>>>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>>>>     "CorruptBlocks" : 1,
>>>>>>
>>>>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>>>>
>>>>>>
>>>>>>
>>>>>> # sudo -u hdfs hdfs fsck /
>>>>>>
>>>>>> ........
>>>>>>
>>>>>>
>>>>>>
>>>>>> ....................................Status: HEALTHY
>>>>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>>>>  Total dirs:    21298
>>>>>>  Total files:   100636 (Files currently being written: 25)
>>>>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>>>>> (Total open file blocks (not validated): 37)
>>>>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>>>>  Mis-replicated blocks:         0 (0.0 %)
>>>>>>  Default replication factor:    3
>>>>>>  Average block replication:     3.0027633
>>>>>>  Corrupt blocks:                0
>>>>>>  Missing replicas:              831 (0.23049656 %)
>>>>>>  Number of data-nodes:          5
>>>>>>  Number of racks:               1
>>>>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>>>>
>>>>>>
>>>>>> The filesystem under path '/' is HEALTHY
>>>>>>
>>>>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> hi,maillist:
>>>>>>
>>>>>>             my nagios alert me that there is a corrupt block in HDFS
>>>>>> all day,but i do not know how to remove it,and if the HDFS will handle this
>>>>>> automaticlly? and if remove the corrupt block will cause any data
>>>>>> lost?thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by Adam Kawa <ka...@gmail.com>.

I have only 1-node cluster, so I am not able to verify it when replication
factor is bigger than 1.

I run the fsck on a file that consists of 3 blocks, and 1 block has a
corrupt replica. fsck told that the system is HEALTHY.

When I restarted the DN, then the block scanner (BlockPoolSliceScanner)
started and it detected a corrupted replica. Then I run fsck again on that
file, and it told me that the system is CORRUPT.

If you have a small (and non-production) cluster, can you restart your
datandoes and run fsck again?



2013/12/11 ch huang <ju...@gmail.com>

> thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck
> can not tell you which block of which file has a replica been
> corrupted,fsck just useful on all of one block's replica bad
>
> On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <ka...@gmail.com> wrote:
>
>> When you identify a file with corrupt block(s), then you can locate the
>> machines that stores its block by typing
>> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>>
>>
>> 2013/12/11 Adam Kawa <ka...@gmail.com>
>>
>>> Maybe this can work for you
>>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>>> ?
>>>
>>>
>>> 2013/12/11 ch huang <ju...@gmail.com>
>>>
>>>> thanks for reply, what i do not know is how can i locate the block
>>>> which has the corrupt replica,(so i can observe how long the corrupt
>>>> replica will be removed and a new health replica replace it,because i get
>>>> nagios alert for three days,i do not sure if it is the same corrupt replica
>>>> cause the alert ,and i do not know the interval of hdfs check corrupt
>>>> replica and clean it)
>>>>
>>>>
>>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <
>>>> vinayakumar.b@huawei.com> wrote:
>>>>
>>>>>  Hi ch huang,
>>>>>
>>>>>
>>>>>
>>>>> It may seem strange, but the fact is,
>>>>>
>>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>>>> though jconsole for description.
>>>>>
>>>>>
>>>>>
>>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>>
>>>>>
>>>>>
>>>>> In your case, may be one of the replica is corrupt, not all replicas
>>>>> of same block. This corrupt replica will be deleted automatically if one
>>>>> more datanode available in your cluster and block replicated to that.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Related to replication 10, As Peter Marron said, *some of the
>>>>> important files of the mapreduce job will set the replication of 10, to
>>>>> make it accessible faster and launch map tasks faster. *
>>>>>
>>>>> Anyway, if the job is success these files will be deleted
>>>>> auomatically. I think only in some cases if the jobs are killed in between
>>>>> these files will remain in hdfs showing underreplicated blocks.
>>>>>
>>>>>
>>>>>
>>>>> Thanks and Regards,
>>>>>
>>>>> Vinayakumar B
>>>>>
>>>>>
>>>>>
>>>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>>>> *Sent:* 10 December 2013 14:19
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I am sure that there are others who will answer this better, but
>>>>> anyway.
>>>>>
>>>>> The default replication level for files in HDFS is 3 and so most files
>>>>> that you
>>>>>
>>>>> see will have a replication level of 3. However when you run a
>>>>> Map/Reduce
>>>>>
>>>>> job the system knows in advance that every node will need a copy of
>>>>>
>>>>> certain files. Specifically the job.xml and the various jars containing
>>>>>
>>>>> classes that will be needed to run the mappers and reducers. So the
>>>>>
>>>>> system arranges that some of these files have a higher replication
>>>>> level. This increases
>>>>>
>>>>> the chances that a copy will be found locally.
>>>>>
>>>>> By default this higher replication level is 10.
>>>>>
>>>>>
>>>>>
>>>>> This can seem a little odd on a cluster where you only have, say, 3
>>>>> nodes.
>>>>>
>>>>> Because it means that you will almost always have some blocks that are
>>>>> marked
>>>>>
>>>>> under-replicated. I think that there was some discussion a while back
>>>>> to change
>>>>>
>>>>> this to make the replication level something like min(10, #number of
>>>>> nodes)
>>>>>
>>>>> However, as I recall, the general consensus was that this was extra
>>>>>
>>>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>>>
>>>>>
>>>>>
>>>>> Hope that this helps.
>>>>>
>>>>>
>>>>>
>>>>> *Peter Marron*
>>>>>
>>>>> Senior Developer, Research & Development
>>>>>
>>>>>
>>>>>
>>>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>>>
>>>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>>>
>>>>>    <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>>>
>>>>>  <https://twitter.com/TrilliumSW>
>>>>>
>>>>>  <http://www.linkedin.com/company/17710>
>>>>>
>>>>>
>>>>>
>>>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>>>
>>>>> Be Certain About Your Data. Be Trillium Certain.
>>>>>
>>>>>
>>>>>
>>>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>>>> *Sent:* 10 December 2013 01:21
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>>>
>>>>>
>>>>>
>>>>> more strange , in my HDFS cluster ,every block has three replicas,but
>>>>> i find some one has ten replicas ,why?
>>>>>
>>>>>
>>>>>
>>>>> # sudo -u hdfs hadoop fs -ls
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>>>> Found 5 items
>>>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>>>
>>>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>>>>>
>>>>> the strange thing is when i use the following command i find 1 corrupt
>>>>> block
>>>>>
>>>>>
>>>>>
>>>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>>>     "CorruptBlocks" : 1,
>>>>>
>>>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>>>
>>>>>
>>>>>
>>>>> # sudo -u hdfs hdfs fsck /
>>>>>
>>>>> ........
>>>>>
>>>>>
>>>>>
>>>>> ....................................Status: HEALTHY
>>>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>>>  Total dirs:    21298
>>>>>  Total files:   100636 (Files currently being written: 25)
>>>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>>>> (Total open file blocks (not validated): 37)
>>>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>>>  Mis-replicated blocks:         0 (0.0 %)
>>>>>  Default replication factor:    3
>>>>>  Average block replication:     3.0027633
>>>>>  Corrupt blocks:                0
>>>>>  Missing replicas:              831 (0.23049656 %)
>>>>>  Number of data-nodes:          5
>>>>>  Number of racks:               1
>>>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>>>
>>>>>
>>>>> The filesystem under path '/' is HEALTHY
>>>>>
>>>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>>>>
>>>>> hi,maillist:
>>>>>
>>>>>             my nagios alert me that there is a corrupt block in HDFS
>>>>> all day,but i do not know how to remove it,and if the HDFS will handle this
>>>>> automaticlly? and if remove the corrupt block will cause any data
>>>>> lost?thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by Adam Kawa <ka...@gmail.com>.

I have only 1-node cluster, so I am not able to verify it when replication
factor is bigger than 1.

I run the fsck on a file that consists of 3 blocks, and 1 block has a
corrupt replica. fsck told that the system is HEALTHY.

When I restarted the DN, then the block scanner (BlockPoolSliceScanner)
started and it detected a corrupted replica. Then I run fsck again on that
file, and it told me that the system is CORRUPT.

If you have a small (and non-production) cluster, can you restart your
datandoes and run fsck again?



2013/12/11 ch huang <ju...@gmail.com>

> thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck
> can not tell you which block of which file has a replica been
> corrupted,fsck just useful on all of one block's replica bad
>
> On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <ka...@gmail.com> wrote:
>
>> When you identify a file with corrupt block(s), then you can locate the
>> machines that stores its block by typing
>> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>>
>>
>> 2013/12/11 Adam Kawa <ka...@gmail.com>
>>
>>> Maybe this can work for you
>>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>>> ?
>>>
>>>
>>> 2013/12/11 ch huang <ju...@gmail.com>
>>>
>>>> thanks for reply, what i do not know is how can i locate the block
>>>> which has the corrupt replica,(so i can observe how long the corrupt
>>>> replica will be removed and a new health replica replace it,because i get
>>>> nagios alert for three days,i do not sure if it is the same corrupt replica
>>>> cause the alert ,and i do not know the interval of hdfs check corrupt
>>>> replica and clean it)
>>>>
>>>>
>>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <
>>>> vinayakumar.b@huawei.com> wrote:
>>>>
>>>>>  Hi ch huang,
>>>>>
>>>>>
>>>>>
>>>>> It may seem strange, but the fact is,
>>>>>
>>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>>>> though jconsole for description.
>>>>>
>>>>>
>>>>>
>>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>>
>>>>>
>>>>>
>>>>> In your case, may be one of the replica is corrupt, not all replicas
>>>>> of same block. This corrupt replica will be deleted automatically if one
>>>>> more datanode available in your cluster and block replicated to that.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Related to replication 10, As Peter Marron said, *some of the
>>>>> important files of the mapreduce job will set the replication of 10, to
>>>>> make it accessible faster and launch map tasks faster. *
>>>>>
>>>>> Anyway, if the job is success these files will be deleted
>>>>> auomatically. I think only in some cases if the jobs are killed in between
>>>>> these files will remain in hdfs showing underreplicated blocks.
>>>>>
>>>>>
>>>>>
>>>>> Thanks and Regards,
>>>>>
>>>>> Vinayakumar B
>>>>>
>>>>>
>>>>>
>>>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>>>> *Sent:* 10 December 2013 14:19
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I am sure that there are others who will answer this better, but
>>>>> anyway.
>>>>>
>>>>> The default replication level for files in HDFS is 3 and so most files
>>>>> that you
>>>>>
>>>>> see will have a replication level of 3. However when you run a
>>>>> Map/Reduce
>>>>>
>>>>> job the system knows in advance that every node will need a copy of
>>>>>
>>>>> certain files. Specifically the job.xml and the various jars containing
>>>>>
>>>>> classes that will be needed to run the mappers and reducers. So the
>>>>>
>>>>> system arranges that some of these files have a higher replication
>>>>> level. This increases
>>>>>
>>>>> the chances that a copy will be found locally.
>>>>>
>>>>> By default this higher replication level is 10.
>>>>>
>>>>>
>>>>>
>>>>> This can seem a little odd on a cluster where you only have, say, 3
>>>>> nodes.
>>>>>
>>>>> Because it means that you will almost always have some blocks that are
>>>>> marked
>>>>>
>>>>> under-replicated. I think that there was some discussion a while back
>>>>> to change
>>>>>
>>>>> this to make the replication level something like min(10, #number of
>>>>> nodes)
>>>>>
>>>>> However, as I recall, the general consensus was that this was extra
>>>>>
>>>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>>>
>>>>>
>>>>>
>>>>> Hope that this helps.
>>>>>
>>>>>
>>>>>
>>>>> *Peter Marron*
>>>>>
>>>>> Senior Developer, Research & Development
>>>>>
>>>>>
>>>>>
>>>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>>>
>>>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>>>
>>>>>    <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>>>
>>>>>  <https://twitter.com/TrilliumSW>
>>>>>
>>>>>  <http://www.linkedin.com/company/17710>
>>>>>
>>>>>
>>>>>
>>>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>>>
>>>>> Be Certain About Your Data. Be Trillium Certain.
>>>>>
>>>>>
>>>>>
>>>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>>>> *Sent:* 10 December 2013 01:21
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>>>
>>>>>
>>>>>
>>>>> more strange , in my HDFS cluster ,every block has three replicas,but
>>>>> i find some one has ten replicas ,why?
>>>>>
>>>>>
>>>>>
>>>>> # sudo -u hdfs hadoop fs -ls
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>>>> Found 5 items
>>>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>>>
>>>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>>>>>
>>>>> the strange thing is when i use the following command i find 1 corrupt
>>>>> block
>>>>>
>>>>>
>>>>>
>>>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>>>     "CorruptBlocks" : 1,
>>>>>
>>>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>>>
>>>>>
>>>>>
>>>>> # sudo -u hdfs hdfs fsck /
>>>>>
>>>>> ........
>>>>>
>>>>>
>>>>>
>>>>> ....................................Status: HEALTHY
>>>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>>>  Total dirs:    21298
>>>>>  Total files:   100636 (Files currently being written: 25)
>>>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>>>> (Total open file blocks (not validated): 37)
>>>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>>>  Mis-replicated blocks:         0 (0.0 %)
>>>>>  Default replication factor:    3
>>>>>  Average block replication:     3.0027633
>>>>>  Corrupt blocks:                0
>>>>>  Missing replicas:              831 (0.23049656 %)
>>>>>  Number of data-nodes:          5
>>>>>  Number of racks:               1
>>>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>>>
>>>>>
>>>>> The filesystem under path '/' is HEALTHY
>>>>>
>>>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>>>>
>>>>> hi,maillist:
>>>>>
>>>>>             my nagios alert me that there is a corrupt block in HDFS
>>>>> all day,but i do not know how to remove it,and if the HDFS will handle this
>>>>> automaticlly? and if remove the corrupt block will cause any data
>>>>> lost?thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by Adam Kawa <ka...@gmail.com>.

I have only 1-node cluster, so I am not able to verify it when replication
factor is bigger than 1.

I run the fsck on a file that consists of 3 blocks, and 1 block has a
corrupt replica. fsck told that the system is HEALTHY.

When I restarted the DN, then the block scanner (BlockPoolSliceScanner)
started and it detected a corrupted replica. Then I run fsck again on that
file, and it told me that the system is CORRUPT.

If you have a small (and non-production) cluster, can you restart your
datandoes and run fsck again?



2013/12/11 ch huang <ju...@gmail.com>

> thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck
> can not tell you which block of which file has a replica been
> corrupted,fsck just useful on all of one block's replica bad
>
> On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <ka...@gmail.com> wrote:
>
>> When you identify a file with corrupt block(s), then you can locate the
>> machines that stores its block by typing
>> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>>
>>
>> 2013/12/11 Adam Kawa <ka...@gmail.com>
>>
>>> Maybe this can work for you
>>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>>> ?
>>>
>>>
>>> 2013/12/11 ch huang <ju...@gmail.com>
>>>
>>>> thanks for reply, what i do not know is how can i locate the block
>>>> which has the corrupt replica,(so i can observe how long the corrupt
>>>> replica will be removed and a new health replica replace it,because i get
>>>> nagios alert for three days,i do not sure if it is the same corrupt replica
>>>> cause the alert ,and i do not know the interval of hdfs check corrupt
>>>> replica and clean it)
>>>>
>>>>
>>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <
>>>> vinayakumar.b@huawei.com> wrote:
>>>>
>>>>>  Hi ch huang,
>>>>>
>>>>>
>>>>>
>>>>> It may seem strange, but the fact is,
>>>>>
>>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>>>> though jconsole for description.
>>>>>
>>>>>
>>>>>
>>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>>
>>>>>
>>>>>
>>>>> In your case, may be one of the replica is corrupt, not all replicas
>>>>> of same block. This corrupt replica will be deleted automatically if one
>>>>> more datanode available in your cluster and block replicated to that.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Related to replication 10, As Peter Marron said, *some of the
>>>>> important files of the mapreduce job will set the replication of 10, to
>>>>> make it accessible faster and launch map tasks faster. *
>>>>>
>>>>> Anyway, if the job is success these files will be deleted
>>>>> auomatically. I think only in some cases if the jobs are killed in between
>>>>> these files will remain in hdfs showing underreplicated blocks.
>>>>>
>>>>>
>>>>>
>>>>> Thanks and Regards,
>>>>>
>>>>> Vinayakumar B
>>>>>
>>>>>
>>>>>
>>>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>>>> *Sent:* 10 December 2013 14:19
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I am sure that there are others who will answer this better, but
>>>>> anyway.
>>>>>
>>>>> The default replication level for files in HDFS is 3 and so most files
>>>>> that you
>>>>>
>>>>> see will have a replication level of 3. However when you run a
>>>>> Map/Reduce
>>>>>
>>>>> job the system knows in advance that every node will need a copy of
>>>>>
>>>>> certain files. Specifically the job.xml and the various jars containing
>>>>>
>>>>> classes that will be needed to run the mappers and reducers. So the
>>>>>
>>>>> system arranges that some of these files have a higher replication
>>>>> level. This increases
>>>>>
>>>>> the chances that a copy will be found locally.
>>>>>
>>>>> By default this higher replication level is 10.
>>>>>
>>>>>
>>>>>
>>>>> This can seem a little odd on a cluster where you only have, say, 3
>>>>> nodes.
>>>>>
>>>>> Because it means that you will almost always have some blocks that are
>>>>> marked
>>>>>
>>>>> under-replicated. I think that there was some discussion a while back
>>>>> to change
>>>>>
>>>>> this to make the replication level something like min(10, #number of
>>>>> nodes)
>>>>>
>>>>> However, as I recall, the general consensus was that this was extra
>>>>>
>>>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>>>
>>>>>
>>>>>
>>>>> Hope that this helps.
>>>>>
>>>>>
>>>>>
>>>>> *Peter Marron*
>>>>>
>>>>> Senior Developer, Research & Development
>>>>>
>>>>>
>>>>>
>>>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>>>
>>>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>>>
>>>>>    <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>>>
>>>>>  <https://twitter.com/TrilliumSW>
>>>>>
>>>>>  <http://www.linkedin.com/company/17710>
>>>>>
>>>>>
>>>>>
>>>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>>>
>>>>> Be Certain About Your Data. Be Trillium Certain.
>>>>>
>>>>>
>>>>>
>>>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>>>> *Sent:* 10 December 2013 01:21
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>>>
>>>>>
>>>>>
>>>>> more strange , in my HDFS cluster ,every block has three replicas,but
>>>>> i find some one has ten replicas ,why?
>>>>>
>>>>>
>>>>>
>>>>> # sudo -u hdfs hadoop fs -ls
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>>>> Found 5 items
>>>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>>>
>>>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>>>>>
>>>>> the strange thing is when i use the following command i find 1 corrupt
>>>>> block
>>>>>
>>>>>
>>>>>
>>>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>>>     "CorruptBlocks" : 1,
>>>>>
>>>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>>>
>>>>>
>>>>>
>>>>> # sudo -u hdfs hdfs fsck /
>>>>>
>>>>> ........
>>>>>
>>>>>
>>>>>
>>>>> ....................................Status: HEALTHY
>>>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>>>  Total dirs:    21298
>>>>>  Total files:   100636 (Files currently being written: 25)
>>>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>>>> (Total open file blocks (not validated): 37)
>>>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>>>  Mis-replicated blocks:         0 (0.0 %)
>>>>>  Default replication factor:    3
>>>>>  Average block replication:     3.0027633
>>>>>  Corrupt blocks:                0
>>>>>  Missing replicas:              831 (0.23049656 %)
>>>>>  Number of data-nodes:          5
>>>>>  Number of racks:               1
>>>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>>>
>>>>>
>>>>> The filesystem under path '/' is HEALTHY
>>>>>
>>>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>>>>
>>>>> hi,maillist:
>>>>>
>>>>>             my nagios alert me that there is a corrupt block in HDFS
>>>>> all day,but i do not know how to remove it,and if the HDFS will handle this
>>>>> automaticlly? and if remove the corrupt block will cause any data
>>>>> lost?thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by Adam Kawa <ka...@gmail.com>.

I have only 1-node cluster, so I am not able to verify it when replication
factor is bigger than 1.

I run the fsck on a file that consists of 3 blocks, and 1 block has a
corrupt replica. fsck told that the system is HEALTHY.

When I restarted the DN, then the block scanner (BlockPoolSliceScanner)
started and it detected a corrupted replica. Then I run fsck again on that
file, and it told me that the system is CORRUPT.

If you have a small (and non-production) cluster, can you restart your
datandoes and run fsck again?



2013/12/11 ch huang <ju...@gmail.com>

> thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck
> can not tell you which block of which file has a replica been
> corrupted,fsck just useful on all of one block's replica bad
>
> On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <ka...@gmail.com> wrote:
>
>> When you identify a file with corrupt block(s), then you can locate the
>> machines that stores its block by typing
>> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>>
>>
>> 2013/12/11 Adam Kawa <ka...@gmail.com>
>>
>>> Maybe this can work for you
>>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>>> ?
>>>
>>>
>>> 2013/12/11 ch huang <ju...@gmail.com>
>>>
>>>> thanks for reply, what i do not know is how can i locate the block
>>>> which has the corrupt replica,(so i can observe how long the corrupt
>>>> replica will be removed and a new health replica replace it,because i get
>>>> nagios alert for three days,i do not sure if it is the same corrupt replica
>>>> cause the alert ,and i do not know the interval of hdfs check corrupt
>>>> replica and clean it)
>>>>
>>>>
>>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <
>>>> vinayakumar.b@huawei.com> wrote:
>>>>
>>>>>  Hi ch huang,
>>>>>
>>>>>
>>>>>
>>>>> It may seem strange, but the fact is,
>>>>>
>>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>>>> though jconsole for description.
>>>>>
>>>>>
>>>>>
>>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>>
>>>>>
>>>>>
>>>>> In your case, may be one of the replica is corrupt, not all replicas
>>>>> of same block. This corrupt replica will be deleted automatically if one
>>>>> more datanode available in your cluster and block replicated to that.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Related to replication 10, As Peter Marron said, *some of the
>>>>> important files of the mapreduce job will set the replication of 10, to
>>>>> make it accessible faster and launch map tasks faster. *
>>>>>
>>>>> Anyway, if the job is success these files will be deleted
>>>>> auomatically. I think only in some cases if the jobs are killed in between
>>>>> these files will remain in hdfs showing underreplicated blocks.
>>>>>
>>>>>
>>>>>
>>>>> Thanks and Regards,
>>>>>
>>>>> Vinayakumar B
>>>>>
>>>>>
>>>>>
>>>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>>>> *Sent:* 10 December 2013 14:19
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I am sure that there are others who will answer this better, but
>>>>> anyway.
>>>>>
>>>>> The default replication level for files in HDFS is 3 and so most files
>>>>> that you
>>>>>
>>>>> see will have a replication level of 3. However when you run a
>>>>> Map/Reduce
>>>>>
>>>>> job the system knows in advance that every node will need a copy of
>>>>>
>>>>> certain files. Specifically the job.xml and the various jars containing
>>>>>
>>>>> classes that will be needed to run the mappers and reducers. So the
>>>>>
>>>>> system arranges that some of these files have a higher replication
>>>>> level. This increases
>>>>>
>>>>> the chances that a copy will be found locally.
>>>>>
>>>>> By default this higher replication level is 10.
>>>>>
>>>>>
>>>>>
>>>>> This can seem a little odd on a cluster where you only have, say, 3
>>>>> nodes.
>>>>>
>>>>> Because it means that you will almost always have some blocks that are
>>>>> marked
>>>>>
>>>>> under-replicated. I think that there was some discussion a while back
>>>>> to change
>>>>>
>>>>> this to make the replication level something like min(10, #number of
>>>>> nodes)
>>>>>
>>>>> However, as I recall, the general consensus was that this was extra
>>>>>
>>>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>>>
>>>>>
>>>>>
>>>>> Hope that this helps.
>>>>>
>>>>>
>>>>>
>>>>> *Peter Marron*
>>>>>
>>>>> Senior Developer, Research & Development
>>>>>
>>>>>
>>>>>
>>>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>>>
>>>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>>>
>>>>>    <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>>>
>>>>>  <https://twitter.com/TrilliumSW>
>>>>>
>>>>>  <http://www.linkedin.com/company/17710>
>>>>>
>>>>>
>>>>>
>>>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>>>
>>>>> Be Certain About Your Data. Be Trillium Certain.
>>>>>
>>>>>
>>>>>
>>>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>>>> *Sent:* 10 December 2013 01:21
>>>>> *To:* user@hadoop.apache.org
>>>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>>>
>>>>>
>>>>>
>>>>> more strange , in my HDFS cluster ,every block has three replicas,but
>>>>> i find some one has ten replicas ,why?
>>>>>
>>>>>
>>>>>
>>>>> # sudo -u hdfs hadoop fs -ls
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>>>> Found 5 items
>>>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>>>
>>>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>>>>>
>>>>> the strange thing is when i use the following command i find 1 corrupt
>>>>> block
>>>>>
>>>>>
>>>>>
>>>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>>>     "CorruptBlocks" : 1,
>>>>>
>>>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>>>
>>>>>
>>>>>
>>>>> # sudo -u hdfs hdfs fsck /
>>>>>
>>>>> ........
>>>>>
>>>>>
>>>>>
>>>>> ....................................Status: HEALTHY
>>>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>>>  Total dirs:    21298
>>>>>  Total files:   100636 (Files currently being written: 25)
>>>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>>>> (Total open file blocks (not validated): 37)
>>>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>>>  Mis-replicated blocks:         0 (0.0 %)
>>>>>  Default replication factor:    3
>>>>>  Average block replication:     3.0027633
>>>>>  Corrupt blocks:                0
>>>>>  Missing replicas:              831 (0.23049656 %)
>>>>>  Number of data-nodes:          5
>>>>>  Number of racks:               1
>>>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>>>
>>>>>
>>>>> The filesystem under path '/' is HEALTHY
>>>>>
>>>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>>>>
>>>>> hi,maillist:
>>>>>
>>>>>             my nagios alert me that there is a corrupt block in HDFS
>>>>> all day,but i do not know how to remove it,and if the HDFS will handle this
>>>>> automaticlly? and if remove the corrupt block will cause any data
>>>>> lost?thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck can
not tell you which block of which file has a replica been corrupted,fsck
just useful on all of one block's replica bad

On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <ka...@gmail.com> wrote:

> When you identify a file with corrupt block(s), then you can locate the
> machines that stores its block by typing
> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>
>
> 2013/12/11 Adam Kawa <ka...@gmail.com>
>
>> Maybe this can work for you
>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>> ?
>>
>>
>> 2013/12/11 ch huang <ju...@gmail.com>
>>
>>> thanks for reply, what i do not know is how can i locate the block which
>>> has the corrupt replica,(so i can observe how long the corrupt replica will
>>> be removed and a new health replica replace it,because i get nagios alert
>>> for three days,i do not sure if it is the same corrupt replica cause the
>>> alert ,and i do not know the interval of hdfs check corrupt replica and
>>> clean it)
>>>
>>>
>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <vinayakumar.b@huawei.com
>>> > wrote:
>>>
>>>>  Hi ch huang,
>>>>
>>>>
>>>>
>>>> It may seem strange, but the fact is,
>>>>
>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>>> though jconsole for description.
>>>>
>>>>
>>>>
>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>
>>>>
>>>>
>>>> In your case, may be one of the replica is corrupt, not all replicas of
>>>> same block. This corrupt replica will be deleted automatically if one more
>>>> datanode available in your cluster and block replicated to that.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Related to replication 10, As Peter Marron said, *some of the
>>>> important files of the mapreduce job will set the replication of 10, to
>>>> make it accessible faster and launch map tasks faster. *
>>>>
>>>> Anyway, if the job is success these files will be deleted auomatically.
>>>> I think only in some cases if the jobs are killed in between these files
>>>> will remain in hdfs showing underreplicated blocks.
>>>>
>>>>
>>>>
>>>> Thanks and Regards,
>>>>
>>>> Vinayakumar B
>>>>
>>>>
>>>>
>>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>>> *Sent:* 10 December 2013 14:19
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I am sure that there are others who will answer this better, but anyway.
>>>>
>>>> The default replication level for files in HDFS is 3 and so most files
>>>> that you
>>>>
>>>> see will have a replication level of 3. However when you run a
>>>> Map/Reduce
>>>>
>>>> job the system knows in advance that every node will need a copy of
>>>>
>>>> certain files. Specifically the job.xml and the various jars containing
>>>>
>>>> classes that will be needed to run the mappers and reducers. So the
>>>>
>>>> system arranges that some of these files have a higher replication
>>>> level. This increases
>>>>
>>>> the chances that a copy will be found locally.
>>>>
>>>> By default this higher replication level is 10.
>>>>
>>>>
>>>>
>>>> This can seem a little odd on a cluster where you only have, say, 3
>>>> nodes.
>>>>
>>>> Because it means that you will almost always have some blocks that are
>>>> marked
>>>>
>>>> under-replicated. I think that there was some discussion a while back
>>>> to change
>>>>
>>>> this to make the replication level something like min(10, #number of
>>>> nodes)
>>>>
>>>> However, as I recall, the general consensus was that this was extra
>>>>
>>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>>
>>>>
>>>>
>>>> Hope that this helps.
>>>>
>>>>
>>>>
>>>> *Peter Marron*
>>>>
>>>> Senior Developer, Research & Development
>>>>
>>>>
>>>>
>>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>>
>>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>>
>>>>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>>
>>>>  <https://twitter.com/TrilliumSW>
>>>>
>>>>  <http://www.linkedin.com/company/17710>
>>>>
>>>>
>>>>
>>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>>
>>>> Be Certain About Your Data. Be Trillium Certain.
>>>>
>>>>
>>>>
>>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>>> *Sent:* 10 December 2013 01:21
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>>
>>>>
>>>>
>>>> more strange , in my HDFS cluster ,every block has three replicas,but i
>>>> find some one has ten replicas ,why?
>>>>
>>>>
>>>>
>>>> # sudo -u hdfs hadoop fs -ls
>>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>>> Found 5 items
>>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>>
>>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>>>>
>>>> the strange thing is when i use the following command i find 1 corrupt
>>>> block
>>>>
>>>>
>>>>
>>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>>     "CorruptBlocks" : 1,
>>>>
>>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>>
>>>>
>>>>
>>>> # sudo -u hdfs hdfs fsck /
>>>>
>>>> ........
>>>>
>>>>
>>>>
>>>> ....................................Status: HEALTHY
>>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>>  Total dirs:    21298
>>>>  Total files:   100636 (Files currently being written: 25)
>>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>>> (Total open file blocks (not validated): 37)
>>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>>  Mis-replicated blocks:         0 (0.0 %)
>>>>  Default replication factor:    3
>>>>  Average block replication:     3.0027633
>>>>  Corrupt blocks:                0
>>>>  Missing replicas:              831 (0.23049656 %)
>>>>  Number of data-nodes:          5
>>>>  Number of racks:               1
>>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>>
>>>>
>>>> The filesystem under path '/' is HEALTHY
>>>>
>>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>>>
>>>> hi,maillist:
>>>>
>>>>             my nagios alert me that there is a corrupt block in HDFS
>>>> all day,but i do not know how to remove it,and if the HDFS will handle this
>>>> automaticlly? and if remove the corrupt block will cause any data
>>>> lost?thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck can
not tell you which block of which file has a replica been corrupted,fsck
just useful on all of one block's replica bad

On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <ka...@gmail.com> wrote:

> When you identify a file with corrupt block(s), then you can locate the
> machines that stores its block by typing
> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>
>
> 2013/12/11 Adam Kawa <ka...@gmail.com>
>
>> Maybe this can work for you
>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>> ?
>>
>>
>> 2013/12/11 ch huang <ju...@gmail.com>
>>
>>> thanks for reply, what i do not know is how can i locate the block which
>>> has the corrupt replica,(so i can observe how long the corrupt replica will
>>> be removed and a new health replica replace it,because i get nagios alert
>>> for three days,i do not sure if it is the same corrupt replica cause the
>>> alert ,and i do not know the interval of hdfs check corrupt replica and
>>> clean it)
>>>
>>>
>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <vinayakumar.b@huawei.com
>>> > wrote:
>>>
>>>>  Hi ch huang,
>>>>
>>>>
>>>>
>>>> It may seem strange, but the fact is,
>>>>
>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>>> though jconsole for description.
>>>>
>>>>
>>>>
>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>
>>>>
>>>>
>>>> In your case, may be one of the replica is corrupt, not all replicas of
>>>> same block. This corrupt replica will be deleted automatically if one more
>>>> datanode available in your cluster and block replicated to that.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Related to replication 10, As Peter Marron said, *some of the
>>>> important files of the mapreduce job will set the replication of 10, to
>>>> make it accessible faster and launch map tasks faster. *
>>>>
>>>> Anyway, if the job is success these files will be deleted auomatically.
>>>> I think only in some cases if the jobs are killed in between these files
>>>> will remain in hdfs showing underreplicated blocks.
>>>>
>>>>
>>>>
>>>> Thanks and Regards,
>>>>
>>>> Vinayakumar B
>>>>
>>>>
>>>>
>>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>>> *Sent:* 10 December 2013 14:19
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I am sure that there are others who will answer this better, but anyway.
>>>>
>>>> The default replication level for files in HDFS is 3 and so most files
>>>> that you
>>>>
>>>> see will have a replication level of 3. However when you run a
>>>> Map/Reduce
>>>>
>>>> job the system knows in advance that every node will need a copy of
>>>>
>>>> certain files. Specifically the job.xml and the various jars containing
>>>>
>>>> classes that will be needed to run the mappers and reducers. So the
>>>>
>>>> system arranges that some of these files have a higher replication
>>>> level. This increases
>>>>
>>>> the chances that a copy will be found locally.
>>>>
>>>> By default this higher replication level is 10.
>>>>
>>>>
>>>>
>>>> This can seem a little odd on a cluster where you only have, say, 3
>>>> nodes.
>>>>
>>>> Because it means that you will almost always have some blocks that are
>>>> marked
>>>>
>>>> under-replicated. I think that there was some discussion a while back
>>>> to change
>>>>
>>>> this to make the replication level something like min(10, #number of
>>>> nodes)
>>>>
>>>> However, as I recall, the general consensus was that this was extra
>>>>
>>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>>
>>>>
>>>>
>>>> Hope that this helps.
>>>>
>>>>
>>>>
>>>> *Peter Marron*
>>>>
>>>> Senior Developer, Research & Development
>>>>
>>>>
>>>>
>>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>>
>>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>>
>>>>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>>
>>>>  <https://twitter.com/TrilliumSW>
>>>>
>>>>  <http://www.linkedin.com/company/17710>
>>>>
>>>>
>>>>
>>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>>
>>>> Be Certain About Your Data. Be Trillium Certain.
>>>>
>>>>
>>>>
>>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>>> *Sent:* 10 December 2013 01:21
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>>
>>>>
>>>>
>>>> more strange , in my HDFS cluster ,every block has three replicas,but i
>>>> find some one has ten replicas ,why?
>>>>
>>>>
>>>>
>>>> # sudo -u hdfs hadoop fs -ls
>>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>>> Found 5 items
>>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>>
>>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>>>>
>>>> the strange thing is when i use the following command i find 1 corrupt
>>>> block
>>>>
>>>>
>>>>
>>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>>     "CorruptBlocks" : 1,
>>>>
>>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>>
>>>>
>>>>
>>>> # sudo -u hdfs hdfs fsck /
>>>>
>>>> ........
>>>>
>>>>
>>>>
>>>> ....................................Status: HEALTHY
>>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>>  Total dirs:    21298
>>>>  Total files:   100636 (Files currently being written: 25)
>>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>>> (Total open file blocks (not validated): 37)
>>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>>  Mis-replicated blocks:         0 (0.0 %)
>>>>  Default replication factor:    3
>>>>  Average block replication:     3.0027633
>>>>  Corrupt blocks:                0
>>>>  Missing replicas:              831 (0.23049656 %)
>>>>  Number of data-nodes:          5
>>>>  Number of racks:               1
>>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>>
>>>>
>>>> The filesystem under path '/' is HEALTHY
>>>>
>>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>>>
>>>> hi,maillist:
>>>>
>>>>             my nagios alert me that there is a corrupt block in HDFS
>>>> all day,but i do not know how to remove it,and if the HDFS will handle this
>>>> automaticlly? and if remove the corrupt block will cause any data
>>>> lost?thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck can
not tell you which block of which file has a replica been corrupted,fsck
just useful on all of one block's replica bad

On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <ka...@gmail.com> wrote:

> When you identify a file with corrupt block(s), then you can locate the
> machines that stores its block by typing
> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>
>
> 2013/12/11 Adam Kawa <ka...@gmail.com>
>
>> Maybe this can work for you
>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>> ?
>>
>>
>> 2013/12/11 ch huang <ju...@gmail.com>
>>
>>> thanks for reply, what i do not know is how can i locate the block which
>>> has the corrupt replica,(so i can observe how long the corrupt replica will
>>> be removed and a new health replica replace it,because i get nagios alert
>>> for three days,i do not sure if it is the same corrupt replica cause the
>>> alert ,and i do not know the interval of hdfs check corrupt replica and
>>> clean it)
>>>
>>>
>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <vinayakumar.b@huawei.com
>>> > wrote:
>>>
>>>>  Hi ch huang,
>>>>
>>>>
>>>>
>>>> It may seem strange, but the fact is,
>>>>
>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>>> though jconsole for description.
>>>>
>>>>
>>>>
>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>
>>>>
>>>>
>>>> In your case, may be one of the replica is corrupt, not all replicas of
>>>> same block. This corrupt replica will be deleted automatically if one more
>>>> datanode available in your cluster and block replicated to that.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Related to replication 10, As Peter Marron said, *some of the
>>>> important files of the mapreduce job will set the replication of 10, to
>>>> make it accessible faster and launch map tasks faster. *
>>>>
>>>> Anyway, if the job is success these files will be deleted auomatically.
>>>> I think only in some cases if the jobs are killed in between these files
>>>> will remain in hdfs showing underreplicated blocks.
>>>>
>>>>
>>>>
>>>> Thanks and Regards,
>>>>
>>>> Vinayakumar B
>>>>
>>>>
>>>>
>>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>>> *Sent:* 10 December 2013 14:19
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I am sure that there are others who will answer this better, but anyway.
>>>>
>>>> The default replication level for files in HDFS is 3 and so most files
>>>> that you
>>>>
>>>> see will have a replication level of 3. However when you run a
>>>> Map/Reduce
>>>>
>>>> job the system knows in advance that every node will need a copy of
>>>>
>>>> certain files. Specifically the job.xml and the various jars containing
>>>>
>>>> classes that will be needed to run the mappers and reducers. So the
>>>>
>>>> system arranges that some of these files have a higher replication
>>>> level. This increases
>>>>
>>>> the chances that a copy will be found locally.
>>>>
>>>> By default this higher replication level is 10.
>>>>
>>>>
>>>>
>>>> This can seem a little odd on a cluster where you only have, say, 3
>>>> nodes.
>>>>
>>>> Because it means that you will almost always have some blocks that are
>>>> marked
>>>>
>>>> under-replicated. I think that there was some discussion a while back
>>>> to change
>>>>
>>>> this to make the replication level something like min(10, #number of
>>>> nodes)
>>>>
>>>> However, as I recall, the general consensus was that this was extra
>>>>
>>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>>
>>>>
>>>>
>>>> Hope that this helps.
>>>>
>>>>
>>>>
>>>> *Peter Marron*
>>>>
>>>> Senior Developer, Research & Development
>>>>
>>>>
>>>>
>>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>>
>>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>>
>>>>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>>
>>>>  <https://twitter.com/TrilliumSW>
>>>>
>>>>  <http://www.linkedin.com/company/17710>
>>>>
>>>>
>>>>
>>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>>
>>>> Be Certain About Your Data. Be Trillium Certain.
>>>>
>>>>
>>>>
>>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>>> *Sent:* 10 December 2013 01:21
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>>
>>>>
>>>>
>>>> more strange , in my HDFS cluster ,every block has three replicas,but i
>>>> find some one has ten replicas ,why?
>>>>
>>>>
>>>>
>>>> # sudo -u hdfs hadoop fs -ls
>>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>>> Found 5 items
>>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>>
>>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>>>>
>>>> the strange thing is when i use the following command i find 1 corrupt
>>>> block
>>>>
>>>>
>>>>
>>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>>     "CorruptBlocks" : 1,
>>>>
>>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>>
>>>>
>>>>
>>>> # sudo -u hdfs hdfs fsck /
>>>>
>>>> ........
>>>>
>>>>
>>>>
>>>> ....................................Status: HEALTHY
>>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>>  Total dirs:    21298
>>>>  Total files:   100636 (Files currently being written: 25)
>>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>>> (Total open file blocks (not validated): 37)
>>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>>  Mis-replicated blocks:         0 (0.0 %)
>>>>  Default replication factor:    3
>>>>  Average block replication:     3.0027633
>>>>  Corrupt blocks:                0
>>>>  Missing replicas:              831 (0.23049656 %)
>>>>  Number of data-nodes:          5
>>>>  Number of racks:               1
>>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>>
>>>>
>>>> The filesystem under path '/' is HEALTHY
>>>>
>>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>>>
>>>> hi,maillist:
>>>>
>>>>             my nagios alert me that there is a corrupt block in HDFS
>>>> all day,but i do not know how to remove it,and if the HDFS will handle this
>>>> automaticlly? and if remove the corrupt block will cause any data
>>>> lost?thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

thanks for reply,but if the block just has  1 corrupt replica,hdfs fsck can
not tell you which block of which file has a replica been corrupted,fsck
just useful on all of one block's replica bad

On Wed, Dec 11, 2013 at 10:01 AM, Adam Kawa <ka...@gmail.com> wrote:

> When you identify a file with corrupt block(s), then you can locate the
> machines that stores its block by typing
> $ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations
>
>
> 2013/12/11 Adam Kawa <ka...@gmail.com>
>
>> Maybe this can work for you
>> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
>> ?
>>
>>
>> 2013/12/11 ch huang <ju...@gmail.com>
>>
>>> thanks for reply, what i do not know is how can i locate the block which
>>> has the corrupt replica,(so i can observe how long the corrupt replica will
>>> be removed and a new health replica replace it,because i get nagios alert
>>> for three days,i do not sure if it is the same corrupt replica cause the
>>> alert ,and i do not know the interval of hdfs check corrupt replica and
>>> clean it)
>>>
>>>
>>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <vinayakumar.b@huawei.com
>>> > wrote:
>>>
>>>>  Hi ch huang,
>>>>
>>>>
>>>>
>>>> It may seem strange, but the fact is,
>>>>
>>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>>> though jconsole for description.
>>>>
>>>>
>>>>
>>>> Where as *Corrupt blocks* through fsck means, *blocks with all
>>>> replicas corrupt(non-recoverable)/ missing.*
>>>>
>>>>
>>>>
>>>> In your case, may be one of the replica is corrupt, not all replicas of
>>>> same block. This corrupt replica will be deleted automatically if one more
>>>> datanode available in your cluster and block replicated to that.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Related to replication 10, As Peter Marron said, *some of the
>>>> important files of the mapreduce job will set the replication of 10, to
>>>> make it accessible faster and launch map tasks faster. *
>>>>
>>>> Anyway, if the job is success these files will be deleted auomatically.
>>>> I think only in some cases if the jobs are killed in between these files
>>>> will remain in hdfs showing underreplicated blocks.
>>>>
>>>>
>>>>
>>>> Thanks and Regards,
>>>>
>>>> Vinayakumar B
>>>>
>>>>
>>>>
>>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>>> *Sent:* 10 December 2013 14:19
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I am sure that there are others who will answer this better, but anyway.
>>>>
>>>> The default replication level for files in HDFS is 3 and so most files
>>>> that you
>>>>
>>>> see will have a replication level of 3. However when you run a
>>>> Map/Reduce
>>>>
>>>> job the system knows in advance that every node will need a copy of
>>>>
>>>> certain files. Specifically the job.xml and the various jars containing
>>>>
>>>> classes that will be needed to run the mappers and reducers. So the
>>>>
>>>> system arranges that some of these files have a higher replication
>>>> level. This increases
>>>>
>>>> the chances that a copy will be found locally.
>>>>
>>>> By default this higher replication level is 10.
>>>>
>>>>
>>>>
>>>> This can seem a little odd on a cluster where you only have, say, 3
>>>> nodes.
>>>>
>>>> Because it means that you will almost always have some blocks that are
>>>> marked
>>>>
>>>> under-replicated. I think that there was some discussion a while back
>>>> to change
>>>>
>>>> this to make the replication level something like min(10, #number of
>>>> nodes)
>>>>
>>>> However, as I recall, the general consensus was that this was extra
>>>>
>>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>>
>>>>
>>>>
>>>> Hope that this helps.
>>>>
>>>>
>>>>
>>>> *Peter Marron*
>>>>
>>>> Senior Developer, Research & Development
>>>>
>>>>
>>>>
>>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>>
>>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>>
>>>>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>>
>>>>  <https://twitter.com/TrilliumSW>
>>>>
>>>>  <http://www.linkedin.com/company/17710>
>>>>
>>>>
>>>>
>>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>>
>>>> Be Certain About Your Data. Be Trillium Certain.
>>>>
>>>>
>>>>
>>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>>> *Sent:* 10 December 2013 01:21
>>>> *To:* user@hadoop.apache.org
>>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>>
>>>>
>>>>
>>>> more strange , in my HDFS cluster ,every block has three replicas,but i
>>>> find some one has ten replicas ,why?
>>>>
>>>>
>>>>
>>>> # sudo -u hdfs hadoop fs -ls
>>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>>> Found 5 items
>>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>>
>>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>>>>
>>>> the strange thing is when i use the following command i find 1 corrupt
>>>> block
>>>>
>>>>
>>>>
>>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>>     "CorruptBlocks" : 1,
>>>>
>>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>>
>>>>
>>>>
>>>> # sudo -u hdfs hdfs fsck /
>>>>
>>>> ........
>>>>
>>>>
>>>>
>>>> ....................................Status: HEALTHY
>>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>>  Total dirs:    21298
>>>>  Total files:   100636 (Files currently being written: 25)
>>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>>> (Total open file blocks (not validated): 37)
>>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>>  Over-replicated blocks:        0 (0.0 %)
>>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>>  Mis-replicated blocks:         0 (0.0 %)
>>>>  Default replication factor:    3
>>>>  Average block replication:     3.0027633
>>>>  Corrupt blocks:                0
>>>>  Missing replicas:              831 (0.23049656 %)
>>>>  Number of data-nodes:          5
>>>>  Number of racks:               1
>>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>>
>>>>
>>>> The filesystem under path '/' is HEALTHY
>>>>
>>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>>>
>>>> hi,maillist:
>>>>
>>>>             my nagios alert me that there is a corrupt block in HDFS
>>>> all day,but i do not know how to remove it,and if the HDFS will handle this
>>>> automaticlly? and if remove the corrupt block will cause any data
>>>> lost?thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by Adam Kawa <ka...@gmail.com>.

When you identify a file with corrupt block(s), then you can locate the
machines that stores its block by typing
$ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations


2013/12/11 Adam Kawa <ka...@gmail.com>

> Maybe this can work for you
> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
> ?
>
>
> 2013/12/11 ch huang <ju...@gmail.com>
>
>> thanks for reply, what i do not know is how can i locate the block which
>> has the corrupt replica,(so i can observe how long the corrupt replica will
>> be removed and a new health replica replace it,because i get nagios alert
>> for three days,i do not sure if it is the same corrupt replica cause the
>> alert ,and i do not know the interval of hdfs check corrupt replica and
>> clean it)
>>
>>
>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <vi...@huawei.com>wrote:
>>
>>>  Hi ch huang,
>>>
>>>
>>>
>>> It may seem strange, but the fact is,
>>>
>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>> though jconsole for description.
>>>
>>>
>>>
>>> Where as *Corrupt blocks* through fsck means, *blocks with all replicas
>>> corrupt(non-recoverable)/ missing.*
>>>
>>>
>>>
>>> In your case, may be one of the replica is corrupt, not all replicas of
>>> same block. This corrupt replica will be deleted automatically if one more
>>> datanode available in your cluster and block replicated to that.
>>>
>>>
>>>
>>>
>>>
>>> Related to replication 10, As Peter Marron said, *some of the important
>>> files of the mapreduce job will set the replication of 10, to make it
>>> accessible faster and launch map tasks faster. *
>>>
>>> Anyway, if the job is success these files will be deleted auomatically.
>>> I think only in some cases if the jobs are killed in between these files
>>> will remain in hdfs showing underreplicated blocks.
>>>
>>>
>>>
>>> Thanks and Regards,
>>>
>>> Vinayakumar B
>>>
>>>
>>>
>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>> *Sent:* 10 December 2013 14:19
>>> *To:* user@hadoop.apache.org
>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I am sure that there are others who will answer this better, but anyway.
>>>
>>> The default replication level for files in HDFS is 3 and so most files
>>> that you
>>>
>>> see will have a replication level of 3. However when you run a Map/Reduce
>>>
>>> job the system knows in advance that every node will need a copy of
>>>
>>> certain files. Specifically the job.xml and the various jars containing
>>>
>>> classes that will be needed to run the mappers and reducers. So the
>>>
>>> system arranges that some of these files have a higher replication
>>> level. This increases
>>>
>>> the chances that a copy will be found locally.
>>>
>>> By default this higher replication level is 10.
>>>
>>>
>>>
>>> This can seem a little odd on a cluster where you only have, say, 3
>>> nodes.
>>>
>>> Because it means that you will almost always have some blocks that are
>>> marked
>>>
>>> under-replicated. I think that there was some discussion a while back to
>>> change
>>>
>>> this to make the replication level something like min(10, #number of
>>> nodes)
>>>
>>> However, as I recall, the general consensus was that this was extra
>>>
>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>
>>>
>>>
>>> Hope that this helps.
>>>
>>>
>>>
>>> *Peter Marron*
>>>
>>> Senior Developer, Research & Development
>>>
>>>
>>>
>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>
>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>
>>>    <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>
>>>  <https://twitter.com/TrilliumSW>
>>>
>>>  <http://www.linkedin.com/company/17710>
>>>
>>>
>>>
>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>
>>> Be Certain About Your Data. Be Trillium Certain.
>>>
>>>
>>>
>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>> *Sent:* 10 December 2013 01:21
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>
>>>
>>>
>>> more strange , in my HDFS cluster ,every block has three replicas,but i
>>> find some one has ten replicas ,why?
>>>
>>>
>>>
>>> # sudo -u hdfs hadoop fs -ls
>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>> Found 5 items
>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>
>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>>>
>>> the strange thing is when i use the following command i find 1 corrupt
>>> block
>>>
>>>
>>>
>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>     "CorruptBlocks" : 1,
>>>
>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>
>>>
>>>
>>> # sudo -u hdfs hdfs fsck /
>>>
>>> ........
>>>
>>>
>>>
>>> ....................................Status: HEALTHY
>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>  Total dirs:    21298
>>>  Total files:   100636 (Files currently being written: 25)
>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>> (Total open file blocks (not validated): 37)
>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>  Over-replicated blocks:        0 (0.0 %)
>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>  Mis-replicated blocks:         0 (0.0 %)
>>>  Default replication factor:    3
>>>  Average block replication:     3.0027633
>>>  Corrupt blocks:                0
>>>  Missing replicas:              831 (0.23049656 %)
>>>  Number of data-nodes:          5
>>>  Number of racks:               1
>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>
>>>
>>> The filesystem under path '/' is HEALTHY
>>>
>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>>
>>> hi,maillist:
>>>
>>>             my nagios alert me that there is a corrupt block in HDFS all
>>> day,but i do not know how to remove it,and if the HDFS will handle this
>>> automaticlly? and if remove the corrupt block will cause any data
>>> lost?thanks
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by Adam Kawa <ka...@gmail.com>.

When you identify a file with corrupt block(s), then you can locate the
machines that stores its block by typing
$ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations


2013/12/11 Adam Kawa <ka...@gmail.com>

> Maybe this can work for you
> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
> ?
>
>
> 2013/12/11 ch huang <ju...@gmail.com>
>
>> thanks for reply, what i do not know is how can i locate the block which
>> has the corrupt replica,(so i can observe how long the corrupt replica will
>> be removed and a new health replica replace it,because i get nagios alert
>> for three days,i do not sure if it is the same corrupt replica cause the
>> alert ,and i do not know the interval of hdfs check corrupt replica and
>> clean it)
>>
>>
>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <vi...@huawei.com>wrote:
>>
>>>  Hi ch huang,
>>>
>>>
>>>
>>> It may seem strange, but the fact is,
>>>
>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>> though jconsole for description.
>>>
>>>
>>>
>>> Where as *Corrupt blocks* through fsck means, *blocks with all replicas
>>> corrupt(non-recoverable)/ missing.*
>>>
>>>
>>>
>>> In your case, may be one of the replica is corrupt, not all replicas of
>>> same block. This corrupt replica will be deleted automatically if one more
>>> datanode available in your cluster and block replicated to that.
>>>
>>>
>>>
>>>
>>>
>>> Related to replication 10, As Peter Marron said, *some of the important
>>> files of the mapreduce job will set the replication of 10, to make it
>>> accessible faster and launch map tasks faster. *
>>>
>>> Anyway, if the job is success these files will be deleted auomatically.
>>> I think only in some cases if the jobs are killed in between these files
>>> will remain in hdfs showing underreplicated blocks.
>>>
>>>
>>>
>>> Thanks and Regards,
>>>
>>> Vinayakumar B
>>>
>>>
>>>
>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>> *Sent:* 10 December 2013 14:19
>>> *To:* user@hadoop.apache.org
>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I am sure that there are others who will answer this better, but anyway.
>>>
>>> The default replication level for files in HDFS is 3 and so most files
>>> that you
>>>
>>> see will have a replication level of 3. However when you run a Map/Reduce
>>>
>>> job the system knows in advance that every node will need a copy of
>>>
>>> certain files. Specifically the job.xml and the various jars containing
>>>
>>> classes that will be needed to run the mappers and reducers. So the
>>>
>>> system arranges that some of these files have a higher replication
>>> level. This increases
>>>
>>> the chances that a copy will be found locally.
>>>
>>> By default this higher replication level is 10.
>>>
>>>
>>>
>>> This can seem a little odd on a cluster where you only have, say, 3
>>> nodes.
>>>
>>> Because it means that you will almost always have some blocks that are
>>> marked
>>>
>>> under-replicated. I think that there was some discussion a while back to
>>> change
>>>
>>> this to make the replication level something like min(10, #number of
>>> nodes)
>>>
>>> However, as I recall, the general consensus was that this was extra
>>>
>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>
>>>
>>>
>>> Hope that this helps.
>>>
>>>
>>>
>>> *Peter Marron*
>>>
>>> Senior Developer, Research & Development
>>>
>>>
>>>
>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>
>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>
>>>    <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>
>>>  <https://twitter.com/TrilliumSW>
>>>
>>>  <http://www.linkedin.com/company/17710>
>>>
>>>
>>>
>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>
>>> Be Certain About Your Data. Be Trillium Certain.
>>>
>>>
>>>
>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>> *Sent:* 10 December 2013 01:21
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>
>>>
>>>
>>> more strange , in my HDFS cluster ,every block has three replicas,but i
>>> find some one has ten replicas ,why?
>>>
>>>
>>>
>>> # sudo -u hdfs hadoop fs -ls
>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>> Found 5 items
>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>
>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>>>
>>> the strange thing is when i use the following command i find 1 corrupt
>>> block
>>>
>>>
>>>
>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>     "CorruptBlocks" : 1,
>>>
>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>
>>>
>>>
>>> # sudo -u hdfs hdfs fsck /
>>>
>>> ........
>>>
>>>
>>>
>>> ....................................Status: HEALTHY
>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>  Total dirs:    21298
>>>  Total files:   100636 (Files currently being written: 25)
>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>> (Total open file blocks (not validated): 37)
>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>  Over-replicated blocks:        0 (0.0 %)
>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>  Mis-replicated blocks:         0 (0.0 %)
>>>  Default replication factor:    3
>>>  Average block replication:     3.0027633
>>>  Corrupt blocks:                0
>>>  Missing replicas:              831 (0.23049656 %)
>>>  Number of data-nodes:          5
>>>  Number of racks:               1
>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>
>>>
>>> The filesystem under path '/' is HEALTHY
>>>
>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>>
>>> hi,maillist:
>>>
>>>             my nagios alert me that there is a corrupt block in HDFS all
>>> day,but i do not know how to remove it,and if the HDFS will handle this
>>> automaticlly? and if remove the corrupt block will cause any data
>>> lost?thanks
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by Adam Kawa <ka...@gmail.com>.

When you identify a file with corrupt block(s), then you can locate the
machines that stores its block by typing
$ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations


2013/12/11 Adam Kawa <ka...@gmail.com>

> Maybe this can work for you
> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
> ?
>
>
> 2013/12/11 ch huang <ju...@gmail.com>
>
>> thanks for reply, what i do not know is how can i locate the block which
>> has the corrupt replica,(so i can observe how long the corrupt replica will
>> be removed and a new health replica replace it,because i get nagios alert
>> for three days,i do not sure if it is the same corrupt replica cause the
>> alert ,and i do not know the interval of hdfs check corrupt replica and
>> clean it)
>>
>>
>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <vi...@huawei.com>wrote:
>>
>>>  Hi ch huang,
>>>
>>>
>>>
>>> It may seem strange, but the fact is,
>>>
>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>> though jconsole for description.
>>>
>>>
>>>
>>> Where as *Corrupt blocks* through fsck means, *blocks with all replicas
>>> corrupt(non-recoverable)/ missing.*
>>>
>>>
>>>
>>> In your case, may be one of the replica is corrupt, not all replicas of
>>> same block. This corrupt replica will be deleted automatically if one more
>>> datanode available in your cluster and block replicated to that.
>>>
>>>
>>>
>>>
>>>
>>> Related to replication 10, As Peter Marron said, *some of the important
>>> files of the mapreduce job will set the replication of 10, to make it
>>> accessible faster and launch map tasks faster. *
>>>
>>> Anyway, if the job is success these files will be deleted auomatically.
>>> I think only in some cases if the jobs are killed in between these files
>>> will remain in hdfs showing underreplicated blocks.
>>>
>>>
>>>
>>> Thanks and Regards,
>>>
>>> Vinayakumar B
>>>
>>>
>>>
>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>> *Sent:* 10 December 2013 14:19
>>> *To:* user@hadoop.apache.org
>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I am sure that there are others who will answer this better, but anyway.
>>>
>>> The default replication level for files in HDFS is 3 and so most files
>>> that you
>>>
>>> see will have a replication level of 3. However when you run a Map/Reduce
>>>
>>> job the system knows in advance that every node will need a copy of
>>>
>>> certain files. Specifically the job.xml and the various jars containing
>>>
>>> classes that will be needed to run the mappers and reducers. So the
>>>
>>> system arranges that some of these files have a higher replication
>>> level. This increases
>>>
>>> the chances that a copy will be found locally.
>>>
>>> By default this higher replication level is 10.
>>>
>>>
>>>
>>> This can seem a little odd on a cluster where you only have, say, 3
>>> nodes.
>>>
>>> Because it means that you will almost always have some blocks that are
>>> marked
>>>
>>> under-replicated. I think that there was some discussion a while back to
>>> change
>>>
>>> this to make the replication level something like min(10, #number of
>>> nodes)
>>>
>>> However, as I recall, the general consensus was that this was extra
>>>
>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>
>>>
>>>
>>> Hope that this helps.
>>>
>>>
>>>
>>> *Peter Marron*
>>>
>>> Senior Developer, Research & Development
>>>
>>>
>>>
>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>
>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>
>>>    <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>
>>>  <https://twitter.com/TrilliumSW>
>>>
>>>  <http://www.linkedin.com/company/17710>
>>>
>>>
>>>
>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>
>>> Be Certain About Your Data. Be Trillium Certain.
>>>
>>>
>>>
>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>> *Sent:* 10 December 2013 01:21
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>
>>>
>>>
>>> more strange , in my HDFS cluster ,every block has three replicas,but i
>>> find some one has ten replicas ,why?
>>>
>>>
>>>
>>> # sudo -u hdfs hadoop fs -ls
>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>> Found 5 items
>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>
>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>>>
>>> the strange thing is when i use the following command i find 1 corrupt
>>> block
>>>
>>>
>>>
>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>     "CorruptBlocks" : 1,
>>>
>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>
>>>
>>>
>>> # sudo -u hdfs hdfs fsck /
>>>
>>> ........
>>>
>>>
>>>
>>> ....................................Status: HEALTHY
>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>  Total dirs:    21298
>>>  Total files:   100636 (Files currently being written: 25)
>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>> (Total open file blocks (not validated): 37)
>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>  Over-replicated blocks:        0 (0.0 %)
>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>  Mis-replicated blocks:         0 (0.0 %)
>>>  Default replication factor:    3
>>>  Average block replication:     3.0027633
>>>  Corrupt blocks:                0
>>>  Missing replicas:              831 (0.23049656 %)
>>>  Number of data-nodes:          5
>>>  Number of racks:               1
>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>
>>>
>>> The filesystem under path '/' is HEALTHY
>>>
>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>>
>>> hi,maillist:
>>>
>>>             my nagios alert me that there is a corrupt block in HDFS all
>>> day,but i do not know how to remove it,and if the HDFS will handle this
>>> automaticlly? and if remove the corrupt block will cause any data
>>> lost?thanks
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by Adam Kawa <ka...@gmail.com>.

When you identify a file with corrupt block(s), then you can locate the
machines that stores its block by typing
$ sudo -u hdfs hdfs fsck <path-to-file> -files -blocks -locations


2013/12/11 Adam Kawa <ka...@gmail.com>

> Maybe this can work for you
> $ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
> ?
>
>
> 2013/12/11 ch huang <ju...@gmail.com>
>
>> thanks for reply, what i do not know is how can i locate the block which
>> has the corrupt replica,(so i can observe how long the corrupt replica will
>> be removed and a new health replica replace it,because i get nagios alert
>> for three days,i do not sure if it is the same corrupt replica cause the
>> alert ,and i do not know the interval of hdfs check corrupt replica and
>> clean it)
>>
>>
>> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <vi...@huawei.com>wrote:
>>
>>>  Hi ch huang,
>>>
>>>
>>>
>>> It may seem strange, but the fact is,
>>>
>>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>>> replicas”. May not be all replicas are corrupt.  *This you can check
>>> though jconsole for description.
>>>
>>>
>>>
>>> Where as *Corrupt blocks* through fsck means, *blocks with all replicas
>>> corrupt(non-recoverable)/ missing.*
>>>
>>>
>>>
>>> In your case, may be one of the replica is corrupt, not all replicas of
>>> same block. This corrupt replica will be deleted automatically if one more
>>> datanode available in your cluster and block replicated to that.
>>>
>>>
>>>
>>>
>>>
>>> Related to replication 10, As Peter Marron said, *some of the important
>>> files of the mapreduce job will set the replication of 10, to make it
>>> accessible faster and launch map tasks faster. *
>>>
>>> Anyway, if the job is success these files will be deleted auomatically.
>>> I think only in some cases if the jobs are killed in between these files
>>> will remain in hdfs showing underreplicated blocks.
>>>
>>>
>>>
>>> Thanks and Regards,
>>>
>>> Vinayakumar B
>>>
>>>
>>>
>>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>>> *Sent:* 10 December 2013 14:19
>>> *To:* user@hadoop.apache.org
>>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I am sure that there are others who will answer this better, but anyway.
>>>
>>> The default replication level for files in HDFS is 3 and so most files
>>> that you
>>>
>>> see will have a replication level of 3. However when you run a Map/Reduce
>>>
>>> job the system knows in advance that every node will need a copy of
>>>
>>> certain files. Specifically the job.xml and the various jars containing
>>>
>>> classes that will be needed to run the mappers and reducers. So the
>>>
>>> system arranges that some of these files have a higher replication
>>> level. This increases
>>>
>>> the chances that a copy will be found locally.
>>>
>>> By default this higher replication level is 10.
>>>
>>>
>>>
>>> This can seem a little odd on a cluster where you only have, say, 3
>>> nodes.
>>>
>>> Because it means that you will almost always have some blocks that are
>>> marked
>>>
>>> under-replicated. I think that there was some discussion a while back to
>>> change
>>>
>>> this to make the replication level something like min(10, #number of
>>> nodes)
>>>
>>> However, as I recall, the general consensus was that this was extra
>>>
>>> complexity that wasn’t really worth it. If it ain’t broke…
>>>
>>>
>>>
>>> Hope that this helps.
>>>
>>>
>>>
>>> *Peter Marron*
>>>
>>> Senior Developer, Research & Development
>>>
>>>
>>>
>>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>>
>>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>>
>>>    <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>>
>>>  <https://twitter.com/TrilliumSW>
>>>
>>>  <http://www.linkedin.com/company/17710>
>>>
>>>
>>>
>>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>>
>>> Be Certain About Your Data. Be Trillium Certain.
>>>
>>>
>>>
>>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>>> *Sent:* 10 December 2013 01:21
>>> *To:* user@hadoop.apache.org
>>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>>
>>>
>>>
>>> more strange , in my HDFS cluster ,every block has three replicas,but i
>>> find some one has ten replicas ,why?
>>>
>>>
>>>
>>> # sudo -u hdfs hadoop fs -ls
>>> /data/hisstage/helen/.staging/job_1385542328307_0915
>>> Found 5 items
>>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>>
>>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>>>
>>> the strange thing is when i use the following command i find 1 corrupt
>>> block
>>>
>>>
>>>
>>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>>     "CorruptBlocks" : 1,
>>>
>>> but when i run hdfs fsck / , i get none ,everything seems fine
>>>
>>>
>>>
>>> # sudo -u hdfs hdfs fsck /
>>>
>>> ........
>>>
>>>
>>>
>>> ....................................Status: HEALTHY
>>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>>  Total dirs:    21298
>>>  Total files:   100636 (Files currently being written: 25)
>>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>>> (Total open file blocks (not validated): 37)
>>>  Minimally replicated blocks:   119788 (100.0 %)
>>>  Over-replicated blocks:        0 (0.0 %)
>>>  Under-replicated blocks:       166 (0.13857816 %)
>>>  Mis-replicated blocks:         0 (0.0 %)
>>>  Default replication factor:    3
>>>  Average block replication:     3.0027633
>>>  Corrupt blocks:                0
>>>  Missing replicas:              831 (0.23049656 %)
>>>  Number of data-nodes:          5
>>>  Number of racks:               1
>>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>>
>>>
>>> The filesystem under path '/' is HEALTHY
>>>
>>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>>
>>> hi,maillist:
>>>
>>>             my nagios alert me that there is a corrupt block in HDFS all
>>> day,but i do not know how to remove it,and if the HDFS will handle this
>>> automaticlly? and if remove the corrupt block will cause any data
>>> lost?thanks
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by Adam Kawa <ka...@gmail.com>.

Maybe this can work for you
$ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
?


2013/12/11 ch huang <ju...@gmail.com>

> thanks for reply, what i do not know is how can i locate the block which
> has the corrupt replica,(so i can observe how long the corrupt replica will
> be removed and a new health replica replace it,because i get nagios alert
> for three days,i do not sure if it is the same corrupt replica cause the
> alert ,and i do not know the interval of hdfs check corrupt replica and
> clean it)
>
>
> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>>  Hi ch huang,
>>
>>
>>
>> It may seem strange, but the fact is,
>>
>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>> replicas”. May not be all replicas are corrupt.  *This you can check
>> though jconsole for description.
>>
>>
>>
>> Where as *Corrupt blocks* through fsck means, *blocks with all replicas
>> corrupt(non-recoverable)/ missing.*
>>
>>
>>
>> In your case, may be one of the replica is corrupt, not all replicas of
>> same block. This corrupt replica will be deleted automatically if one more
>> datanode available in your cluster and block replicated to that.
>>
>>
>>
>>
>>
>> Related to replication 10, As Peter Marron said, *some of the important
>> files of the mapreduce job will set the replication of 10, to make it
>> accessible faster and launch map tasks faster. *
>>
>> Anyway, if the job is success these files will be deleted auomatically. I
>> think only in some cases if the jobs are killed in between these files will
>> remain in hdfs showing underreplicated blocks.
>>
>>
>>
>> Thanks and Regards,
>>
>> Vinayakumar B
>>
>>
>>
>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>> *Sent:* 10 December 2013 14:19
>> *To:* user@hadoop.apache.org
>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>
>>
>>
>> Hi,
>>
>>
>>
>> I am sure that there are others who will answer this better, but anyway.
>>
>> The default replication level for files in HDFS is 3 and so most files
>> that you
>>
>> see will have a replication level of 3. However when you run a Map/Reduce
>>
>> job the system knows in advance that every node will need a copy of
>>
>> certain files. Specifically the job.xml and the various jars containing
>>
>> classes that will be needed to run the mappers and reducers. So the
>>
>> system arranges that some of these files have a higher replication level.
>> This increases
>>
>> the chances that a copy will be found locally.
>>
>> By default this higher replication level is 10.
>>
>>
>>
>> This can seem a little odd on a cluster where you only have, say, 3 nodes.
>>
>> Because it means that you will almost always have some blocks that are
>> marked
>>
>> under-replicated. I think that there was some discussion a while back to
>> change
>>
>> this to make the replication level something like min(10, #number of
>> nodes)
>>
>> However, as I recall, the general consensus was that this was extra
>>
>> complexity that wasn’t really worth it. If it ain’t broke…
>>
>>
>>
>> Hope that this helps.
>>
>>
>>
>> *Peter Marron*
>>
>> Senior Developer, Research & Development
>>
>>
>>
>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>
>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>
>>    <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>
>>  <https://twitter.com/TrilliumSW>
>>
>>  <http://www.linkedin.com/company/17710>
>>
>>
>>
>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>
>> Be Certain About Your Data. Be Trillium Certain.
>>
>>
>>
>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>> *Sent:* 10 December 2013 01:21
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>
>>
>>
>> more strange , in my HDFS cluster ,every block has three replicas,but i
>> find some one has ten replicas ,why?
>>
>>
>>
>> # sudo -u hdfs hadoop fs -ls
>> /data/hisstage/helen/.staging/job_1385542328307_0915
>> Found 5 items
>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>
>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>>
>> the strange thing is when i use the following command i find 1 corrupt
>> block
>>
>>
>>
>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>     "CorruptBlocks" : 1,
>>
>> but when i run hdfs fsck / , i get none ,everything seems fine
>>
>>
>>
>> # sudo -u hdfs hdfs fsck /
>>
>> ........
>>
>>
>>
>> ....................................Status: HEALTHY
>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>  Total dirs:    21298
>>  Total files:   100636 (Files currently being written: 25)
>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>> (Total open file blocks (not validated): 37)
>>  Minimally replicated blocks:   119788 (100.0 %)
>>  Over-replicated blocks:        0 (0.0 %)
>>  Under-replicated blocks:       166 (0.13857816 %)
>>  Mis-replicated blocks:         0 (0.0 %)
>>  Default replication factor:    3
>>  Average block replication:     3.0027633
>>  Corrupt blocks:                0
>>  Missing replicas:              831 (0.23049656 %)
>>  Number of data-nodes:          5
>>  Number of racks:               1
>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>
>>
>> The filesystem under path '/' is HEALTHY
>>
>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>
>> hi,maillist:
>>
>>             my nagios alert me that there is a corrupt block in HDFS all
>> day,but i do not know how to remove it,and if the HDFS will handle this
>> automaticlly? and if remove the corrupt block will cause any data
>> lost?thanks
>>
>>
>>
>>
>>
>
>

Re: how to handle the corrupt block in HDFS?

Posted by Adam Kawa <ka...@gmail.com>.

Maybe this can work for you
$ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
?


2013/12/11 ch huang <ju...@gmail.com>

> thanks for reply, what i do not know is how can i locate the block which
> has the corrupt replica,(so i can observe how long the corrupt replica will
> be removed and a new health replica replace it,because i get nagios alert
> for three days,i do not sure if it is the same corrupt replica cause the
> alert ,and i do not know the interval of hdfs check corrupt replica and
> clean it)
>
>
> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>>  Hi ch huang,
>>
>>
>>
>> It may seem strange, but the fact is,
>>
>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>> replicas”. May not be all replicas are corrupt.  *This you can check
>> though jconsole for description.
>>
>>
>>
>> Where as *Corrupt blocks* through fsck means, *blocks with all replicas
>> corrupt(non-recoverable)/ missing.*
>>
>>
>>
>> In your case, may be one of the replica is corrupt, not all replicas of
>> same block. This corrupt replica will be deleted automatically if one more
>> datanode available in your cluster and block replicated to that.
>>
>>
>>
>>
>>
>> Related to replication 10, As Peter Marron said, *some of the important
>> files of the mapreduce job will set the replication of 10, to make it
>> accessible faster and launch map tasks faster. *
>>
>> Anyway, if the job is success these files will be deleted auomatically. I
>> think only in some cases if the jobs are killed in between these files will
>> remain in hdfs showing underreplicated blocks.
>>
>>
>>
>> Thanks and Regards,
>>
>> Vinayakumar B
>>
>>
>>
>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>> *Sent:* 10 December 2013 14:19
>> *To:* user@hadoop.apache.org
>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>
>>
>>
>> Hi,
>>
>>
>>
>> I am sure that there are others who will answer this better, but anyway.
>>
>> The default replication level for files in HDFS is 3 and so most files
>> that you
>>
>> see will have a replication level of 3. However when you run a Map/Reduce
>>
>> job the system knows in advance that every node will need a copy of
>>
>> certain files. Specifically the job.xml and the various jars containing
>>
>> classes that will be needed to run the mappers and reducers. So the
>>
>> system arranges that some of these files have a higher replication level.
>> This increases
>>
>> the chances that a copy will be found locally.
>>
>> By default this higher replication level is 10.
>>
>>
>>
>> This can seem a little odd on a cluster where you only have, say, 3 nodes.
>>
>> Because it means that you will almost always have some blocks that are
>> marked
>>
>> under-replicated. I think that there was some discussion a while back to
>> change
>>
>> this to make the replication level something like min(10, #number of
>> nodes)
>>
>> However, as I recall, the general consensus was that this was extra
>>
>> complexity that wasn’t really worth it. If it ain’t broke…
>>
>>
>>
>> Hope that this helps.
>>
>>
>>
>> *Peter Marron*
>>
>> Senior Developer, Research & Development
>>
>>
>>
>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>
>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>
>>    <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>
>>  <https://twitter.com/TrilliumSW>
>>
>>  <http://www.linkedin.com/company/17710>
>>
>>
>>
>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>
>> Be Certain About Your Data. Be Trillium Certain.
>>
>>
>>
>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>> *Sent:* 10 December 2013 01:21
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>
>>
>>
>> more strange , in my HDFS cluster ,every block has three replicas,but i
>> find some one has ten replicas ,why?
>>
>>
>>
>> # sudo -u hdfs hadoop fs -ls
>> /data/hisstage/helen/.staging/job_1385542328307_0915
>> Found 5 items
>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>
>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>>
>> the strange thing is when i use the following command i find 1 corrupt
>> block
>>
>>
>>
>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>     "CorruptBlocks" : 1,
>>
>> but when i run hdfs fsck / , i get none ,everything seems fine
>>
>>
>>
>> # sudo -u hdfs hdfs fsck /
>>
>> ........
>>
>>
>>
>> ....................................Status: HEALTHY
>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>  Total dirs:    21298
>>  Total files:   100636 (Files currently being written: 25)
>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>> (Total open file blocks (not validated): 37)
>>  Minimally replicated blocks:   119788 (100.0 %)
>>  Over-replicated blocks:        0 (0.0 %)
>>  Under-replicated blocks:       166 (0.13857816 %)
>>  Mis-replicated blocks:         0 (0.0 %)
>>  Default replication factor:    3
>>  Average block replication:     3.0027633
>>  Corrupt blocks:                0
>>  Missing replicas:              831 (0.23049656 %)
>>  Number of data-nodes:          5
>>  Number of racks:               1
>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>
>>
>> The filesystem under path '/' is HEALTHY
>>
>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>
>> hi,maillist:
>>
>>             my nagios alert me that there is a corrupt block in HDFS all
>> day,but i do not know how to remove it,and if the HDFS will handle this
>> automaticlly? and if remove the corrupt block will cause any data
>> lost?thanks
>>
>>
>>
>>
>>
>
>

Re: how to handle the corrupt block in HDFS?

Posted by Adam Kawa <ka...@gmail.com>.

Maybe this can work for you
$ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
?


2013/12/11 ch huang <ju...@gmail.com>

> thanks for reply, what i do not know is how can i locate the block which
> has the corrupt replica,(so i can observe how long the corrupt replica will
> be removed and a new health replica replace it,because i get nagios alert
> for three days,i do not sure if it is the same corrupt replica cause the
> alert ,and i do not know the interval of hdfs check corrupt replica and
> clean it)
>
>
> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>>  Hi ch huang,
>>
>>
>>
>> It may seem strange, but the fact is,
>>
>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>> replicas”. May not be all replicas are corrupt.  *This you can check
>> though jconsole for description.
>>
>>
>>
>> Where as *Corrupt blocks* through fsck means, *blocks with all replicas
>> corrupt(non-recoverable)/ missing.*
>>
>>
>>
>> In your case, may be one of the replica is corrupt, not all replicas of
>> same block. This corrupt replica will be deleted automatically if one more
>> datanode available in your cluster and block replicated to that.
>>
>>
>>
>>
>>
>> Related to replication 10, As Peter Marron said, *some of the important
>> files of the mapreduce job will set the replication of 10, to make it
>> accessible faster and launch map tasks faster. *
>>
>> Anyway, if the job is success these files will be deleted auomatically. I
>> think only in some cases if the jobs are killed in between these files will
>> remain in hdfs showing underreplicated blocks.
>>
>>
>>
>> Thanks and Regards,
>>
>> Vinayakumar B
>>
>>
>>
>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>> *Sent:* 10 December 2013 14:19
>> *To:* user@hadoop.apache.org
>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>
>>
>>
>> Hi,
>>
>>
>>
>> I am sure that there are others who will answer this better, but anyway.
>>
>> The default replication level for files in HDFS is 3 and so most files
>> that you
>>
>> see will have a replication level of 3. However when you run a Map/Reduce
>>
>> job the system knows in advance that every node will need a copy of
>>
>> certain files. Specifically the job.xml and the various jars containing
>>
>> classes that will be needed to run the mappers and reducers. So the
>>
>> system arranges that some of these files have a higher replication level.
>> This increases
>>
>> the chances that a copy will be found locally.
>>
>> By default this higher replication level is 10.
>>
>>
>>
>> This can seem a little odd on a cluster where you only have, say, 3 nodes.
>>
>> Because it means that you will almost always have some blocks that are
>> marked
>>
>> under-replicated. I think that there was some discussion a while back to
>> change
>>
>> this to make the replication level something like min(10, #number of
>> nodes)
>>
>> However, as I recall, the general consensus was that this was extra
>>
>> complexity that wasn’t really worth it. If it ain’t broke…
>>
>>
>>
>> Hope that this helps.
>>
>>
>>
>> *Peter Marron*
>>
>> Senior Developer, Research & Development
>>
>>
>>
>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>
>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>
>>    <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>
>>  <https://twitter.com/TrilliumSW>
>>
>>  <http://www.linkedin.com/company/17710>
>>
>>
>>
>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>
>> Be Certain About Your Data. Be Trillium Certain.
>>
>>
>>
>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>> *Sent:* 10 December 2013 01:21
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>
>>
>>
>> more strange , in my HDFS cluster ,every block has three replicas,but i
>> find some one has ten replicas ,why?
>>
>>
>>
>> # sudo -u hdfs hadoop fs -ls
>> /data/hisstage/helen/.staging/job_1385542328307_0915
>> Found 5 items
>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>
>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>>
>> the strange thing is when i use the following command i find 1 corrupt
>> block
>>
>>
>>
>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>     "CorruptBlocks" : 1,
>>
>> but when i run hdfs fsck / , i get none ,everything seems fine
>>
>>
>>
>> # sudo -u hdfs hdfs fsck /
>>
>> ........
>>
>>
>>
>> ....................................Status: HEALTHY
>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>  Total dirs:    21298
>>  Total files:   100636 (Files currently being written: 25)
>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>> (Total open file blocks (not validated): 37)
>>  Minimally replicated blocks:   119788 (100.0 %)
>>  Over-replicated blocks:        0 (0.0 %)
>>  Under-replicated blocks:       166 (0.13857816 %)
>>  Mis-replicated blocks:         0 (0.0 %)
>>  Default replication factor:    3
>>  Average block replication:     3.0027633
>>  Corrupt blocks:                0
>>  Missing replicas:              831 (0.23049656 %)
>>  Number of data-nodes:          5
>>  Number of racks:               1
>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>
>>
>> The filesystem under path '/' is HEALTHY
>>
>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>
>> hi,maillist:
>>
>>             my nagios alert me that there is a corrupt block in HDFS all
>> day,but i do not know how to remove it,and if the HDFS will handle this
>> automaticlly? and if remove the corrupt block will cause any data
>> lost?thanks
>>
>>
>>
>>
>>
>
>

Re: how to handle the corrupt block in HDFS?

Posted by Adam Kawa <ka...@gmail.com>.

Maybe this can work for you
$ sudo -u hdfs hdfs fsck / -list-corruptfileblocks
?


2013/12/11 ch huang <ju...@gmail.com>

> thanks for reply, what i do not know is how can i locate the block which
> has the corrupt replica,(so i can observe how long the corrupt replica will
> be removed and a new health replica replace it,because i get nagios alert
> for three days,i do not sure if it is the same corrupt replica cause the
> alert ,and i do not know the interval of hdfs check corrupt replica and
> clean it)
>
>
> On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>>  Hi ch huang,
>>
>>
>>
>> It may seem strange, but the fact is,
>>
>> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
>> replicas”. May not be all replicas are corrupt.  *This you can check
>> though jconsole for description.
>>
>>
>>
>> Where as *Corrupt blocks* through fsck means, *blocks with all replicas
>> corrupt(non-recoverable)/ missing.*
>>
>>
>>
>> In your case, may be one of the replica is corrupt, not all replicas of
>> same block. This corrupt replica will be deleted automatically if one more
>> datanode available in your cluster and block replicated to that.
>>
>>
>>
>>
>>
>> Related to replication 10, As Peter Marron said, *some of the important
>> files of the mapreduce job will set the replication of 10, to make it
>> accessible faster and launch map tasks faster. *
>>
>> Anyway, if the job is success these files will be deleted auomatically. I
>> think only in some cases if the jobs are killed in between these files will
>> remain in hdfs showing underreplicated blocks.
>>
>>
>>
>> Thanks and Regards,
>>
>> Vinayakumar B
>>
>>
>>
>> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
>> *Sent:* 10 December 2013 14:19
>> *To:* user@hadoop.apache.org
>> *Subject:* RE: how to handle the corrupt block in HDFS?
>>
>>
>>
>> Hi,
>>
>>
>>
>> I am sure that there are others who will answer this better, but anyway.
>>
>> The default replication level for files in HDFS is 3 and so most files
>> that you
>>
>> see will have a replication level of 3. However when you run a Map/Reduce
>>
>> job the system knows in advance that every node will need a copy of
>>
>> certain files. Specifically the job.xml and the various jars containing
>>
>> classes that will be needed to run the mappers and reducers. So the
>>
>> system arranges that some of these files have a higher replication level.
>> This increases
>>
>> the chances that a copy will be found locally.
>>
>> By default this higher replication level is 10.
>>
>>
>>
>> This can seem a little odd on a cluster where you only have, say, 3 nodes.
>>
>> Because it means that you will almost always have some blocks that are
>> marked
>>
>> under-replicated. I think that there was some discussion a while back to
>> change
>>
>> this to make the replication level something like min(10, #number of
>> nodes)
>>
>> However, as I recall, the general consensus was that this was extra
>>
>> complexity that wasn’t really worth it. If it ain’t broke…
>>
>>
>>
>> Hope that this helps.
>>
>>
>>
>> *Peter Marron*
>>
>> Senior Developer, Research & Development
>>
>>
>>
>> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>>
>> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>>
>>    <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>>
>>  <https://twitter.com/TrilliumSW>
>>
>>  <http://www.linkedin.com/company/17710>
>>
>>
>>
>> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>>
>> Be Certain About Your Data. Be Trillium Certain.
>>
>>
>>
>> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
>> *Sent:* 10 December 2013 01:21
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: how to handle the corrupt block in HDFS?
>>
>>
>>
>> more strange , in my HDFS cluster ,every block has three replicas,but i
>> find some one has ten replicas ,why?
>>
>>
>>
>> # sudo -u hdfs hadoop fs -ls
>> /data/hisstage/helen/.staging/job_1385542328307_0915
>> Found 5 items
>> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
>> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
>> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
>> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
>> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>>
>> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>>
>> the strange thing is when i use the following command i find 1 corrupt
>> block
>>
>>
>>
>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>     "CorruptBlocks" : 1,
>>
>> but when i run hdfs fsck / , i get none ,everything seems fine
>>
>>
>>
>> # sudo -u hdfs hdfs fsck /
>>
>> ........
>>
>>
>>
>> ....................................Status: HEALTHY
>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>  Total dirs:    21298
>>  Total files:   100636 (Files currently being written: 25)
>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>> (Total open file blocks (not validated): 37)
>>  Minimally replicated blocks:   119788 (100.0 %)
>>  Over-replicated blocks:        0 (0.0 %)
>>  Under-replicated blocks:       166 (0.13857816 %)
>>  Mis-replicated blocks:         0 (0.0 %)
>>  Default replication factor:    3
>>  Average block replication:     3.0027633
>>  Corrupt blocks:                0
>>  Missing replicas:              831 (0.23049656 %)
>>  Number of data-nodes:          5
>>  Number of racks:               1
>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>
>>
>> The filesystem under path '/' is HEALTHY
>>
>> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>
>> hi,maillist:
>>
>>             my nagios alert me that there is a corrupt block in HDFS all
>> day,but i do not know how to remove it,and if the HDFS will handle this
>> automaticlly? and if remove the corrupt block will cause any data
>> lost?thanks
>>
>>
>>
>>
>>
>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

thanks for reply, what i do not know is how can i locate the block which
has the corrupt replica,(so i can observe how long the corrupt replica will
be removed and a new health replica replace it,because i get nagios alert
for three days,i do not sure if it is the same corrupt replica cause the
alert ,and i do not know the interval of hdfs check corrupt replica and
clean it)

On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <vi...@huawei.com>wrote:

>  Hi ch huang,
>
>
>
> It may seem strange, but the fact is,
>
> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
> replicas”. May not be all replicas are corrupt.  *This you can check
> though jconsole for description.
>
>
>
> Where as *Corrupt blocks* through fsck means, *blocks with all replicas
> corrupt(non-recoverable)/ missing.*
>
>
>
> In your case, may be one of the replica is corrupt, not all replicas of
> same block. This corrupt replica will be deleted automatically if one more
> datanode available in your cluster and block replicated to that.
>
>
>
>
>
> Related to replication 10, As Peter Marron said, *some of the important
> files of the mapreduce job will set the replication of 10, to make it
> accessible faster and launch map tasks faster. *
>
> Anyway, if the job is success these files will be deleted auomatically. I
> think only in some cases if the jobs are killed in between these files will
> remain in hdfs showing underreplicated blocks.
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
> *Sent:* 10 December 2013 14:19
> *To:* user@hadoop.apache.org
> *Subject:* RE: how to handle the corrupt block in HDFS?
>
>
>
> Hi,
>
>
>
> I am sure that there are others who will answer this better, but anyway.
>
> The default replication level for files in HDFS is 3 and so most files
> that you
>
> see will have a replication level of 3. However when you run a Map/Reduce
>
> job the system knows in advance that every node will need a copy of
>
> certain files. Specifically the job.xml and the various jars containing
>
> classes that will be needed to run the mappers and reducers. So the
>
> system arranges that some of these files have a higher replication level.
> This increases
>
> the chances that a copy will be found locally.
>
> By default this higher replication level is 10.
>
>
>
> This can seem a little odd on a cluster where you only have, say, 3 nodes.
>
> Because it means that you will almost always have some blocks that are
> marked
>
> under-replicated. I think that there was some discussion a while back to
> change
>
> this to make the replication level something like min(10, #number of nodes)
>
> However, as I recall, the general consensus was that this was extra
>
> complexity that wasn’t really worth it. If it ain’t broke…
>
>
>
> Hope that this helps.
>
>
>
> *Peter Marron*
>
> Senior Developer, Research & Development
>
>
>
> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>
> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>
>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>
>  <https://twitter.com/TrilliumSW>
>
>  <http://www.linkedin.com/company/17710>
>
>
>
> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>
> Be Certain About Your Data. Be Trillium Certain.
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
> *Sent:* 10 December 2013 01:21
> *To:* user@hadoop.apache.org
> *Subject:* Re: how to handle the corrupt block in HDFS?
>
>
>
> more strange , in my HDFS cluster ,every block has three replicas,but i
> find some one has ten replicas ,why?
>
>
>
> # sudo -u hdfs hadoop fs -ls
> /data/hisstage/helen/.staging/job_1385542328307_0915
> Found 5 items
> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>
> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>
> the strange thing is when i use the following command i find 1 corrupt
> block
>
>
>
> #  curl -s http://ch11:50070/jmx |grep orrupt
>     "CorruptBlocks" : 1,
>
> but when i run hdfs fsck / , i get none ,everything seems fine
>
>
>
> # sudo -u hdfs hdfs fsck /
>
> ........
>
>
>
> ....................................Status: HEALTHY
>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>  Total dirs:    21298
>  Total files:   100636 (Files currently being written: 25)
>  Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
> open file blocks (not validated): 37)
>  Minimally replicated blocks:   119788 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       166 (0.13857816 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0027633
>  Corrupt blocks:                0
>  Missing replicas:              831 (0.23049656 %)
>  Number of data-nodes:          5
>  Number of racks:               1
> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>
> hi,maillist:
>
>             my nagios alert me that there is a corrupt block in HDFS all
> day,but i do not know how to remove it,and if the HDFS will handle this
> automaticlly? and if remove the corrupt block will cause any data
> lost?thanks
>
>
>
>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

thanks for reply, what i do not know is how can i locate the block which
has the corrupt replica,(so i can observe how long the corrupt replica will
be removed and a new health replica replace it,because i get nagios alert
for three days,i do not sure if it is the same corrupt replica cause the
alert ,and i do not know the interval of hdfs check corrupt replica and
clean it)

On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <vi...@huawei.com>wrote:

>  Hi ch huang,
>
>
>
> It may seem strange, but the fact is,
>
> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
> replicas”. May not be all replicas are corrupt.  *This you can check
> though jconsole for description.
>
>
>
> Where as *Corrupt blocks* through fsck means, *blocks with all replicas
> corrupt(non-recoverable)/ missing.*
>
>
>
> In your case, may be one of the replica is corrupt, not all replicas of
> same block. This corrupt replica will be deleted automatically if one more
> datanode available in your cluster and block replicated to that.
>
>
>
>
>
> Related to replication 10, As Peter Marron said, *some of the important
> files of the mapreduce job will set the replication of 10, to make it
> accessible faster and launch map tasks faster. *
>
> Anyway, if the job is success these files will be deleted auomatically. I
> think only in some cases if the jobs are killed in between these files will
> remain in hdfs showing underreplicated blocks.
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
> *Sent:* 10 December 2013 14:19
> *To:* user@hadoop.apache.org
> *Subject:* RE: how to handle the corrupt block in HDFS?
>
>
>
> Hi,
>
>
>
> I am sure that there are others who will answer this better, but anyway.
>
> The default replication level for files in HDFS is 3 and so most files
> that you
>
> see will have a replication level of 3. However when you run a Map/Reduce
>
> job the system knows in advance that every node will need a copy of
>
> certain files. Specifically the job.xml and the various jars containing
>
> classes that will be needed to run the mappers and reducers. So the
>
> system arranges that some of these files have a higher replication level.
> This increases
>
> the chances that a copy will be found locally.
>
> By default this higher replication level is 10.
>
>
>
> This can seem a little odd on a cluster where you only have, say, 3 nodes.
>
> Because it means that you will almost always have some blocks that are
> marked
>
> under-replicated. I think that there was some discussion a while back to
> change
>
> this to make the replication level something like min(10, #number of nodes)
>
> However, as I recall, the general consensus was that this was extra
>
> complexity that wasn’t really worth it. If it ain’t broke…
>
>
>
> Hope that this helps.
>
>
>
> *Peter Marron*
>
> Senior Developer, Research & Development
>
>
>
> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>
> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>
>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>
>  <https://twitter.com/TrilliumSW>
>
>  <http://www.linkedin.com/company/17710>
>
>
>
> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>
> Be Certain About Your Data. Be Trillium Certain.
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
> *Sent:* 10 December 2013 01:21
> *To:* user@hadoop.apache.org
> *Subject:* Re: how to handle the corrupt block in HDFS?
>
>
>
> more strange , in my HDFS cluster ,every block has three replicas,but i
> find some one has ten replicas ,why?
>
>
>
> # sudo -u hdfs hadoop fs -ls
> /data/hisstage/helen/.staging/job_1385542328307_0915
> Found 5 items
> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>
> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>
> the strange thing is when i use the following command i find 1 corrupt
> block
>
>
>
> #  curl -s http://ch11:50070/jmx |grep orrupt
>     "CorruptBlocks" : 1,
>
> but when i run hdfs fsck / , i get none ,everything seems fine
>
>
>
> # sudo -u hdfs hdfs fsck /
>
> ........
>
>
>
> ....................................Status: HEALTHY
>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>  Total dirs:    21298
>  Total files:   100636 (Files currently being written: 25)
>  Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
> open file blocks (not validated): 37)
>  Minimally replicated blocks:   119788 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       166 (0.13857816 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0027633
>  Corrupt blocks:                0
>  Missing replicas:              831 (0.23049656 %)
>  Number of data-nodes:          5
>  Number of racks:               1
> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>
> hi,maillist:
>
>             my nagios alert me that there is a corrupt block in HDFS all
> day,but i do not know how to remove it,and if the HDFS will handle this
> automaticlly? and if remove the corrupt block will cause any data
> lost?thanks
>
>
>
>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

thanks for reply, what i do not know is how can i locate the block which
has the corrupt replica,(so i can observe how long the corrupt replica will
be removed and a new health replica replace it,because i get nagios alert
for three days,i do not sure if it is the same corrupt replica cause the
alert ,and i do not know the interval of hdfs check corrupt replica and
clean it)

On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <vi...@huawei.com>wrote:

>  Hi ch huang,
>
>
>
> It may seem strange, but the fact is,
>
> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
> replicas”. May not be all replicas are corrupt.  *This you can check
> though jconsole for description.
>
>
>
> Where as *Corrupt blocks* through fsck means, *blocks with all replicas
> corrupt(non-recoverable)/ missing.*
>
>
>
> In your case, may be one of the replica is corrupt, not all replicas of
> same block. This corrupt replica will be deleted automatically if one more
> datanode available in your cluster and block replicated to that.
>
>
>
>
>
> Related to replication 10, As Peter Marron said, *some of the important
> files of the mapreduce job will set the replication of 10, to make it
> accessible faster and launch map tasks faster. *
>
> Anyway, if the job is success these files will be deleted auomatically. I
> think only in some cases if the jobs are killed in between these files will
> remain in hdfs showing underreplicated blocks.
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
> *Sent:* 10 December 2013 14:19
> *To:* user@hadoop.apache.org
> *Subject:* RE: how to handle the corrupt block in HDFS?
>
>
>
> Hi,
>
>
>
> I am sure that there are others who will answer this better, but anyway.
>
> The default replication level for files in HDFS is 3 and so most files
> that you
>
> see will have a replication level of 3. However when you run a Map/Reduce
>
> job the system knows in advance that every node will need a copy of
>
> certain files. Specifically the job.xml and the various jars containing
>
> classes that will be needed to run the mappers and reducers. So the
>
> system arranges that some of these files have a higher replication level.
> This increases
>
> the chances that a copy will be found locally.
>
> By default this higher replication level is 10.
>
>
>
> This can seem a little odd on a cluster where you only have, say, 3 nodes.
>
> Because it means that you will almost always have some blocks that are
> marked
>
> under-replicated. I think that there was some discussion a while back to
> change
>
> this to make the replication level something like min(10, #number of nodes)
>
> However, as I recall, the general consensus was that this was extra
>
> complexity that wasn’t really worth it. If it ain’t broke…
>
>
>
> Hope that this helps.
>
>
>
> *Peter Marron*
>
> Senior Developer, Research & Development
>
>
>
> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>
> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>
>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>
>  <https://twitter.com/TrilliumSW>
>
>  <http://www.linkedin.com/company/17710>
>
>
>
> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>
> Be Certain About Your Data. Be Trillium Certain.
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
> *Sent:* 10 December 2013 01:21
> *To:* user@hadoop.apache.org
> *Subject:* Re: how to handle the corrupt block in HDFS?
>
>
>
> more strange , in my HDFS cluster ,every block has three replicas,but i
> find some one has ten replicas ,why?
>
>
>
> # sudo -u hdfs hadoop fs -ls
> /data/hisstage/helen/.staging/job_1385542328307_0915
> Found 5 items
> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>
> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>
> the strange thing is when i use the following command i find 1 corrupt
> block
>
>
>
> #  curl -s http://ch11:50070/jmx |grep orrupt
>     "CorruptBlocks" : 1,
>
> but when i run hdfs fsck / , i get none ,everything seems fine
>
>
>
> # sudo -u hdfs hdfs fsck /
>
> ........
>
>
>
> ....................................Status: HEALTHY
>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>  Total dirs:    21298
>  Total files:   100636 (Files currently being written: 25)
>  Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
> open file blocks (not validated): 37)
>  Minimally replicated blocks:   119788 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       166 (0.13857816 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0027633
>  Corrupt blocks:                0
>  Missing replicas:              831 (0.23049656 %)
>  Number of data-nodes:          5
>  Number of racks:               1
> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>
> hi,maillist:
>
>             my nagios alert me that there is a corrupt block in HDFS all
> day,but i do not know how to remove it,and if the HDFS will handle this
> automaticlly? and if remove the corrupt block will cause any data
> lost?thanks
>
>
>
>
>

Re: how to handle the corrupt block in HDFS?

Posted by shashwat shriparv <dw...@gmail.com>.

How many nodes you have?
and if fsck is giving you healthy status no need to worry.
with the replication 10 what i may conclude that you have 10 listed
datanodes so 10 replicated jar files for the job to run.

*Thanks & Regards    *

∞
Shashwat Shriparv



On Tue, Dec 10, 2013 at 3:50 PM, Vinayakumar B <vi...@huawei.com>wrote:

>  Hi ch huang,
>
>
>
> It may seem strange, but the fact is,
>
> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
> replicas”. May not be all replicas are corrupt.  *This you can check
> though jconsole for description.
>
>
>
> Where as *Corrupt blocks* through fsck means, *blocks with all replicas
> corrupt(non-recoverable)/ missing.*
>
>
>
> In your case, may be one of the replica is corrupt, not all replicas of
> same block. This corrupt replica will be deleted automatically if one more
> datanode available in your cluster and block replicated to that.
>
>
>
>
>
> Related to replication 10, As Peter Marron said, *some of the important
> files of the mapreduce job will set the replication of 10, to make it
> accessible faster and launch map tasks faster. *
>
> Anyway, if the job is success these files will be deleted auomatically. I
> think only in some cases if the jobs are killed in between these files will
> remain in hdfs showing underreplicated blocks.
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
> *Sent:* 10 December 2013 14:19
> *To:* user@hadoop.apache.org
> *Subject:* RE: how to handle the corrupt block in HDFS?
>
>
>
> Hi,
>
>
>
> I am sure that there are others who will answer this better, but anyway.
>
> The default replication level for files in HDFS is 3 and so most files
> that you
>
> see will have a replication level of 3. However when you run a Map/Reduce
>
> job the system knows in advance that every node will need a copy of
>
> certain files. Specifically the job.xml and the various jars containing
>
> classes that will be needed to run the mappers and reducers. So the
>
> system arranges that some of these files have a higher replication level.
> This increases
>
> the chances that a copy will be found locally.
>
> By default this higher replication level is 10.
>
>
>
> This can seem a little odd on a cluster where you only have, say, 3 nodes.
>
> Because it means that you will almost always have some blocks that are
> marked
>
> under-replicated. I think that there was some discussion a while back to
> change
>
> this to make the replication level something like min(10, #number of nodes)
>
> However, as I recall, the general consensus was that this was extra
>
> complexity that wasn’t really worth it. If it ain’t broke…
>
>
>
> Hope that this helps.
>
>
>
> *Peter Marron*
>
> Senior Developer, Research & Development
>
>
>
> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>
> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>
>    <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>
>  <https://twitter.com/TrilliumSW>
>
>  <http://www.linkedin.com/company/17710>
>
>
>
> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>
> Be Certain About Your Data. Be Trillium Certain.
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
> *Sent:* 10 December 2013 01:21
> *To:* user@hadoop.apache.org
> *Subject:* Re: how to handle the corrupt block in HDFS?
>
>
>
> more strange , in my HDFS cluster ,every block has three replicas,but i
> find some one has ten replicas ,why?
>
>
>
> # sudo -u hdfs hadoop fs -ls
> /data/hisstage/helen/.staging/job_1385542328307_0915
> Found 5 items
> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>
> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>
> the strange thing is when i use the following command i find 1 corrupt
> block
>
>
>
> #  curl -s http://ch11:50070/jmx |grep orrupt
>     "CorruptBlocks" : 1,
>
> but when i run hdfs fsck / , i get none ,everything seems fine
>
>
>
> # sudo -u hdfs hdfs fsck /
>
> ........
>
>
>
> ....................................Status: HEALTHY
>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>  Total dirs:    21298
>  Total files:   100636 (Files currently being written: 25)
>  Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
> open file blocks (not validated): 37)
>  Minimally replicated blocks:   119788 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       166 (0.13857816 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0027633
>  Corrupt blocks:                0
>  Missing replicas:              831 (0.23049656 %)
>  Number of data-nodes:          5
>  Number of racks:               1
> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>
> hi,maillist:
>
>             my nagios alert me that there is a corrupt block in HDFS all
> day,but i do not know how to remove it,and if the HDFS will handle this
> automaticlly? and if remove the corrupt block will cause any data
> lost?thanks
>
>
>
>
>

Re: how to handle the corrupt block in HDFS?

Posted by shashwat shriparv <dw...@gmail.com>.

How many nodes you have?
and if fsck is giving you healthy status no need to worry.
with the replication 10 what i may conclude that you have 10 listed
datanodes so 10 replicated jar files for the job to run.

*Thanks & Regards    *

∞
Shashwat Shriparv



On Tue, Dec 10, 2013 at 3:50 PM, Vinayakumar B <vi...@huawei.com>wrote:

>  Hi ch huang,
>
>
>
> It may seem strange, but the fact is,
>
> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
> replicas”. May not be all replicas are corrupt.  *This you can check
> though jconsole for description.
>
>
>
> Where as *Corrupt blocks* through fsck means, *blocks with all replicas
> corrupt(non-recoverable)/ missing.*
>
>
>
> In your case, may be one of the replica is corrupt, not all replicas of
> same block. This corrupt replica will be deleted automatically if one more
> datanode available in your cluster and block replicated to that.
>
>
>
>
>
> Related to replication 10, As Peter Marron said, *some of the important
> files of the mapreduce job will set the replication of 10, to make it
> accessible faster and launch map tasks faster. *
>
> Anyway, if the job is success these files will be deleted auomatically. I
> think only in some cases if the jobs are killed in between these files will
> remain in hdfs showing underreplicated blocks.
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
> *Sent:* 10 December 2013 14:19
> *To:* user@hadoop.apache.org
> *Subject:* RE: how to handle the corrupt block in HDFS?
>
>
>
> Hi,
>
>
>
> I am sure that there are others who will answer this better, but anyway.
>
> The default replication level for files in HDFS is 3 and so most files
> that you
>
> see will have a replication level of 3. However when you run a Map/Reduce
>
> job the system knows in advance that every node will need a copy of
>
> certain files. Specifically the job.xml and the various jars containing
>
> classes that will be needed to run the mappers and reducers. So the
>
> system arranges that some of these files have a higher replication level.
> This increases
>
> the chances that a copy will be found locally.
>
> By default this higher replication level is 10.
>
>
>
> This can seem a little odd on a cluster where you only have, say, 3 nodes.
>
> Because it means that you will almost always have some blocks that are
> marked
>
> under-replicated. I think that there was some discussion a while back to
> change
>
> this to make the replication level something like min(10, #number of nodes)
>
> However, as I recall, the general consensus was that this was extra
>
> complexity that wasn’t really worth it. If it ain’t broke…
>
>
>
> Hope that this helps.
>
>
>
> *Peter Marron*
>
> Senior Developer, Research & Development
>
>
>
> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>
> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>
>    <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>
>  <https://twitter.com/TrilliumSW>
>
>  <http://www.linkedin.com/company/17710>
>
>
>
> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>
> Be Certain About Your Data. Be Trillium Certain.
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
> *Sent:* 10 December 2013 01:21
> *To:* user@hadoop.apache.org
> *Subject:* Re: how to handle the corrupt block in HDFS?
>
>
>
> more strange , in my HDFS cluster ,every block has three replicas,but i
> find some one has ten replicas ,why?
>
>
>
> # sudo -u hdfs hadoop fs -ls
> /data/hisstage/helen/.staging/job_1385542328307_0915
> Found 5 items
> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>
> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>
> the strange thing is when i use the following command i find 1 corrupt
> block
>
>
>
> #  curl -s http://ch11:50070/jmx |grep orrupt
>     "CorruptBlocks" : 1,
>
> but when i run hdfs fsck / , i get none ,everything seems fine
>
>
>
> # sudo -u hdfs hdfs fsck /
>
> ........
>
>
>
> ....................................Status: HEALTHY
>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>  Total dirs:    21298
>  Total files:   100636 (Files currently being written: 25)
>  Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
> open file blocks (not validated): 37)
>  Minimally replicated blocks:   119788 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       166 (0.13857816 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0027633
>  Corrupt blocks:                0
>  Missing replicas:              831 (0.23049656 %)
>  Number of data-nodes:          5
>  Number of racks:               1
> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>
> hi,maillist:
>
>             my nagios alert me that there is a corrupt block in HDFS all
> day,but i do not know how to remove it,and if the HDFS will handle this
> automaticlly? and if remove the corrupt block will cause any data
> lost?thanks
>
>
>
>
>

Re: how to handle the corrupt block in HDFS?

Posted by shashwat shriparv <dw...@gmail.com>.

How many nodes you have?
and if fsck is giving you healthy status no need to worry.
with the replication 10 what i may conclude that you have 10 listed
datanodes so 10 replicated jar files for the job to run.

*Thanks & Regards    *

∞
Shashwat Shriparv



On Tue, Dec 10, 2013 at 3:50 PM, Vinayakumar B <vi...@huawei.com>wrote:

>  Hi ch huang,
>
>
>
> It may seem strange, but the fact is,
>
> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
> replicas”. May not be all replicas are corrupt.  *This you can check
> though jconsole for description.
>
>
>
> Where as *Corrupt blocks* through fsck means, *blocks with all replicas
> corrupt(non-recoverable)/ missing.*
>
>
>
> In your case, may be one of the replica is corrupt, not all replicas of
> same block. This corrupt replica will be deleted automatically if one more
> datanode available in your cluster and block replicated to that.
>
>
>
>
>
> Related to replication 10, As Peter Marron said, *some of the important
> files of the mapreduce job will set the replication of 10, to make it
> accessible faster and launch map tasks faster. *
>
> Anyway, if the job is success these files will be deleted auomatically. I
> think only in some cases if the jobs are killed in between these files will
> remain in hdfs showing underreplicated blocks.
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
> *Sent:* 10 December 2013 14:19
> *To:* user@hadoop.apache.org
> *Subject:* RE: how to handle the corrupt block in HDFS?
>
>
>
> Hi,
>
>
>
> I am sure that there are others who will answer this better, but anyway.
>
> The default replication level for files in HDFS is 3 and so most files
> that you
>
> see will have a replication level of 3. However when you run a Map/Reduce
>
> job the system knows in advance that every node will need a copy of
>
> certain files. Specifically the job.xml and the various jars containing
>
> classes that will be needed to run the mappers and reducers. So the
>
> system arranges that some of these files have a higher replication level.
> This increases
>
> the chances that a copy will be found locally.
>
> By default this higher replication level is 10.
>
>
>
> This can seem a little odd on a cluster where you only have, say, 3 nodes.
>
> Because it means that you will almost always have some blocks that are
> marked
>
> under-replicated. I think that there was some discussion a while back to
> change
>
> this to make the replication level something like min(10, #number of nodes)
>
> However, as I recall, the general consensus was that this was extra
>
> complexity that wasn’t really worth it. If it ain’t broke…
>
>
>
> Hope that this helps.
>
>
>
> *Peter Marron*
>
> Senior Developer, Research & Development
>
>
>
> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>
> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>
>    <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>
>  <https://twitter.com/TrilliumSW>
>
>  <http://www.linkedin.com/company/17710>
>
>
>
> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>
> Be Certain About Your Data. Be Trillium Certain.
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
> *Sent:* 10 December 2013 01:21
> *To:* user@hadoop.apache.org
> *Subject:* Re: how to handle the corrupt block in HDFS?
>
>
>
> more strange , in my HDFS cluster ,every block has three replicas,but i
> find some one has ten replicas ,why?
>
>
>
> # sudo -u hdfs hadoop fs -ls
> /data/hisstage/helen/.staging/job_1385542328307_0915
> Found 5 items
> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>
> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>
> the strange thing is when i use the following command i find 1 corrupt
> block
>
>
>
> #  curl -s http://ch11:50070/jmx |grep orrupt
>     "CorruptBlocks" : 1,
>
> but when i run hdfs fsck / , i get none ,everything seems fine
>
>
>
> # sudo -u hdfs hdfs fsck /
>
> ........
>
>
>
> ....................................Status: HEALTHY
>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>  Total dirs:    21298
>  Total files:   100636 (Files currently being written: 25)
>  Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
> open file blocks (not validated): 37)
>  Minimally replicated blocks:   119788 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       166 (0.13857816 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0027633
>  Corrupt blocks:                0
>  Missing replicas:              831 (0.23049656 %)
>  Number of data-nodes:          5
>  Number of racks:               1
> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>
> hi,maillist:
>
>             my nagios alert me that there is a corrupt block in HDFS all
> day,but i do not know how to remove it,and if the HDFS will handle this
> automaticlly? and if remove the corrupt block will cause any data
> lost?thanks
>
>
>
>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

thanks for reply, what i do not know is how can i locate the block which
has the corrupt replica,(so i can observe how long the corrupt replica will
be removed and a new health replica replace it,because i get nagios alert
for three days,i do not sure if it is the same corrupt replica cause the
alert ,and i do not know the interval of hdfs check corrupt replica and
clean it)

On Tue, Dec 10, 2013 at 6:20 PM, Vinayakumar B <vi...@huawei.com>wrote:

>  Hi ch huang,
>
>
>
> It may seem strange, but the fact is,
>
> *CorruptBlocks* through JMX means *“Number of blocks with corrupt
> replicas”. May not be all replicas are corrupt.  *This you can check
> though jconsole for description.
>
>
>
> Where as *Corrupt blocks* through fsck means, *blocks with all replicas
> corrupt(non-recoverable)/ missing.*
>
>
>
> In your case, may be one of the replica is corrupt, not all replicas of
> same block. This corrupt replica will be deleted automatically if one more
> datanode available in your cluster and block replicated to that.
>
>
>
>
>
> Related to replication 10, As Peter Marron said, *some of the important
> files of the mapreduce job will set the replication of 10, to make it
> accessible faster and launch map tasks faster. *
>
> Anyway, if the job is success these files will be deleted auomatically. I
> think only in some cases if the jobs are killed in between these files will
> remain in hdfs showing underreplicated blocks.
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
> *Sent:* 10 December 2013 14:19
> *To:* user@hadoop.apache.org
> *Subject:* RE: how to handle the corrupt block in HDFS?
>
>
>
> Hi,
>
>
>
> I am sure that there are others who will answer this better, but anyway.
>
> The default replication level for files in HDFS is 3 and so most files
> that you
>
> see will have a replication level of 3. However when you run a Map/Reduce
>
> job the system knows in advance that every node will need a copy of
>
> certain files. Specifically the job.xml and the various jars containing
>
> classes that will be needed to run the mappers and reducers. So the
>
> system arranges that some of these files have a higher replication level.
> This increases
>
> the chances that a copy will be found locally.
>
> By default this higher replication level is 10.
>
>
>
> This can seem a little odd on a cluster where you only have, say, 3 nodes.
>
> Because it means that you will almost always have some blocks that are
> marked
>
> under-replicated. I think that there was some discussion a while back to
> change
>
> this to make the replication level something like min(10, #number of nodes)
>
> However, as I recall, the general consensus was that this was extra
>
> complexity that wasn’t really worth it. If it ain’t broke…
>
>
>
> Hope that this helps.
>
>
>
> *Peter Marron*
>
> Senior Developer, Research & Development
>
>
>
> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>
> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>
>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>
>  <https://twitter.com/TrilliumSW>
>
>  <http://www.linkedin.com/company/17710>
>
>
>
> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>
> Be Certain About Your Data. Be Trillium Certain.
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com <ju...@gmail.com>]
> *Sent:* 10 December 2013 01:21
> *To:* user@hadoop.apache.org
> *Subject:* Re: how to handle the corrupt block in HDFS?
>
>
>
> more strange , in my HDFS cluster ,every block has three replicas,but i
> find some one has ten replicas ,why?
>
>
>
> # sudo -u hdfs hadoop fs -ls
> /data/hisstage/helen/.staging/job_1385542328307_0915
> Found 5 items
> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>
> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>
> the strange thing is when i use the following command i find 1 corrupt
> block
>
>
>
> #  curl -s http://ch11:50070/jmx |grep orrupt
>     "CorruptBlocks" : 1,
>
> but when i run hdfs fsck / , i get none ,everything seems fine
>
>
>
> # sudo -u hdfs hdfs fsck /
>
> ........
>
>
>
> ....................................Status: HEALTHY
>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>  Total dirs:    21298
>  Total files:   100636 (Files currently being written: 25)
>  Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
> open file blocks (not validated): 37)
>  Minimally replicated blocks:   119788 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       166 (0.13857816 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0027633
>  Corrupt blocks:                0
>  Missing replicas:              831 (0.23049656 %)
>  Number of data-nodes:          5
>  Number of racks:               1
> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
> On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>
> hi,maillist:
>
>             my nagios alert me that there is a corrupt block in HDFS all
> day,but i do not know how to remove it,and if the HDFS will handle this
> automaticlly? and if remove the corrupt block will cause any data
> lost?thanks
>
>
>
>
>

RE: how to handle the corrupt block in HDFS?

Posted by Vinayakumar B <vi...@huawei.com>.

Hi ch huang,

It may seem strange, but the fact is,
CorruptBlocks through JMX means "Number of blocks with corrupt replicas". May not be all replicas are corrupt.  This you can check though jconsole for description.

Where as Corrupt blocks through fsck means, blocks with all replicas corrupt(non-recoverable)/ missing.

In your case, may be one of the replica is corrupt, not all replicas of same block. This corrupt replica will be deleted automatically if one more datanode available in your cluster and block replicated to that.


Related to replication 10, As Peter Marron said, some of the important files of the mapreduce job will set the replication of 10, to make it accessible faster and launch map tasks faster.
Anyway, if the job is success these files will be deleted auomatically. I think only in some cases if the jobs are killed in between these files will remain in hdfs showing underreplicated blocks.

Thanks and Regards,
Vinayakumar B

From: Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
Sent: 10 December 2013 14:19
To: user@hadoop.apache.org
Subject: RE: how to handle the corrupt block in HDFS?

Hi,

I am sure that there are others who will answer this better, but anyway.
The default replication level for files in HDFS is 3 and so most files that you
see will have a replication level of 3. However when you run a Map/Reduce
job the system knows in advance that every node will need a copy of
certain files. Specifically the job.xml and the various jars containing
classes that will be needed to run the mappers and reducers. So the
system arranges that some of these files have a higher replication level. This increases
the chances that a copy will be found locally.
By default this higher replication level is 10.

This can seem a little odd on a cluster where you only have, say, 3 nodes.
Because it means that you will almost always have some blocks that are marked
under-replicated. I think that there was some discussion a while back to change
this to make the replication level something like min(10, #number of nodes)
However, as I recall, the general consensus was that this was extra
complexity that wasn't really worth it. If it ain't broke...

Hope that this helps.

Peter Marron
Senior Developer, Research & Development

Office: +44 (0) 118-940-7609  peter.marron@trilliumsoftware.com<ma...@trilliumsoftware.com>
Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
[cid:image001.png@01CEF5BC.07D01FE0]

[cid:image002.png@01CEF5BC.07D01FE0]<https://www.facebook.com/pages/Trillium-Software/109184815778307>

[cid:image003.png@01CEF5BC.07D01FE0]<https://twitter.com/TrilliumSW>

[cid:image004.png@01CEF5BC.07D01FE0]<http://www.linkedin.com/company/17710>


www.trilliumsoftware.com<http://www.trilliumsoftware.com/>

Be Certain About Your Data. Be Trillium Certain.

From: ch huang [mailto:justlooks@gmail.com]
Sent: 10 December 2013 01:21
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: how to handle the corrupt block in HDFS?

more strange , in my HDFS cluster ,every block has three replicas,but i find some one has ten replicas ,why?

# sudo -u hdfs hadoop fs -ls /data/hisstage/helen/.staging/job_1385542328307_0915
Found 5 items
-rw-r--r--   3 helen hadoop          7 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
-rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
-rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>> wrote:
the strange thing is when i use the following command i find 1 corrupt block

#  curl -s http://ch11:50070/jmx |grep orrupt
    "CorruptBlocks" : 1,
but when i run hdfs fsck / , i get none ,everything seems fine

# sudo -u hdfs hdfs fsck /
........

....................................Status: HEALTHY
 Total size:    1479728140875 B (Total open files size: 1677721600 B)
 Total dirs:    21298
 Total files:   100636 (Files currently being written: 25)
 Total blocks (validated):      119788 (avg. block size 12352891 B) (Total open file blocks (not validated): 37)
 Minimally replicated blocks:   119788 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       166 (0.13857816 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0027633
 Corrupt blocks:                0
 Missing replicas:              831 (0.23049656 %)
 Number of data-nodes:          5
 Number of racks:               1
FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds

The filesystem under path '/' is HEALTHY
On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>> wrote:
hi,maillist:
            my nagios alert me that there is a corrupt block in HDFS all day,but i do not know how to remove it,and if the HDFS will handle this automaticlly? and if remove the corrupt block will cause any data lost?thanks

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

"By default this higher replication level is 10. "
is this value can be control via some option or variable? i only hive a
5-worknode cluster,and i think 5 replicas should be better,because every
node can get a local replica.

another question is ,why hdfs fsck check the cluster is healthy and no
corrupt block,but i see one corrupt block though checking NN metrics?
curl http://NNIP:50070/jmx <http://nnip:50070/jmx> ,thanks


On Tue, Dec 10, 2013 at 4:48 PM, Peter Marron <
Peter.Marron@trilliumsoftware.com> wrote:

>  Hi,
>
>
>
> I am sure that there are others who will answer this better, but anyway.
>
> The default replication level for files in HDFS is 3 and so most files
> that you
>
> see will have a replication level of 3. However when you run a Map/Reduce
>
> job the system knows in advance that every node will need a copy of
>
> certain files. Specifically the job.xml and the various jars containing
>
> classes that will be needed to run the mappers and reducers. So the
>
> system arranges that some of these files have a higher replication level.
> This increases
>
> the chances that a copy will be found locally.
>
> By default this higher replication level is 10.
>
>
>
> This can seem a little odd on a cluster where you only have, say, 3 nodes.
>
> Because it means that you will almost always have some blocks that are
> marked
>
> under-replicated. I think that there was some discussion a while back to
> change
>
> this to make the replication level something like min(10, #number of nodes)
>
> However, as I recall, the general consensus was that this was extra
>
> complexity that wasn’t really worth it. If it ain’t broke…
>
>
>
> Hope that this helps.
>
>
>
> *Peter Marron*
>
> Senior Developer, Research & Development
>
>
>
> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>
> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>
>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>
>  <https://twitter.com/TrilliumSW>
>
>  <http://www.linkedin.com/company/17710>
>
>
>
> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>
> Be Certain About Your Data. Be Trillium Certain.
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 10 December 2013 01:21
> *To:* user@hadoop.apache.org
> *Subject:* Re: how to handle the corrupt block in HDFS?
>
>
>
> more strange , in my HDFS cluster ,every block has three replicas,but i
> find some one has ten replicas ,why?
>
>
>
> # sudo -u hdfs hadoop fs -ls
> /data/hisstage/helen/.staging/job_1385542328307_0915
> Found 5 items
> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>
>  On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>
> the strange thing is when i use the following command i find 1 corrupt
> block
>
>
>
> #  curl -s http://ch11:50070/jmx |grep orrupt
>     "CorruptBlocks" : 1,
>
> but when i run hdfs fsck / , i get none ,everything seems fine
>
>
>
> # sudo -u hdfs hdfs fsck /
>
> ........
>
>
>
> ....................................Status: HEALTHY
>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>  Total dirs:    21298
>  Total files:   100636 (Files currently being written: 25)
>  Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
> open file blocks (not validated): 37)
>  Minimally replicated blocks:   119788 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       166 (0.13857816 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0027633
>  Corrupt blocks:                0
>  Missing replicas:              831 (0.23049656 %)
>  Number of data-nodes:          5
>  Number of racks:               1
> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
>   On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>
> hi,maillist:
>
>             my nagios alert me that there is a corrupt block in HDFS all
> day,but i do not know how to remove it,and if the HDFS will handle this
> automaticlly? and if remove the corrupt block will cause any data
> lost?thanks
>
>
>
>
>

RE: how to handle the corrupt block in HDFS?

Posted by Vinayakumar B <vi...@huawei.com>.

Hi ch huang,

It may seem strange, but the fact is,
CorruptBlocks through JMX means "Number of blocks with corrupt replicas". May not be all replicas are corrupt.  This you can check though jconsole for description.

Where as Corrupt blocks through fsck means, blocks with all replicas corrupt(non-recoverable)/ missing.

In your case, may be one of the replica is corrupt, not all replicas of same block. This corrupt replica will be deleted automatically if one more datanode available in your cluster and block replicated to that.


Related to replication 10, As Peter Marron said, some of the important files of the mapreduce job will set the replication of 10, to make it accessible faster and launch map tasks faster.
Anyway, if the job is success these files will be deleted auomatically. I think only in some cases if the jobs are killed in between these files will remain in hdfs showing underreplicated blocks.

Thanks and Regards,
Vinayakumar B

From: Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
Sent: 10 December 2013 14:19
To: user@hadoop.apache.org
Subject: RE: how to handle the corrupt block in HDFS?

Hi,

I am sure that there are others who will answer this better, but anyway.
The default replication level for files in HDFS is 3 and so most files that you
see will have a replication level of 3. However when you run a Map/Reduce
job the system knows in advance that every node will need a copy of
certain files. Specifically the job.xml and the various jars containing
classes that will be needed to run the mappers and reducers. So the
system arranges that some of these files have a higher replication level. This increases
the chances that a copy will be found locally.
By default this higher replication level is 10.

This can seem a little odd on a cluster where you only have, say, 3 nodes.
Because it means that you will almost always have some blocks that are marked
under-replicated. I think that there was some discussion a while back to change
this to make the replication level something like min(10, #number of nodes)
However, as I recall, the general consensus was that this was extra
complexity that wasn't really worth it. If it ain't broke...

Hope that this helps.

Peter Marron
Senior Developer, Research & Development

Office: +44 (0) 118-940-7609  peter.marron@trilliumsoftware.com<ma...@trilliumsoftware.com>
Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
[cid:image001.png@01CEF5BC.07D01FE0]

[cid:image002.png@01CEF5BC.07D01FE0]<https://www.facebook.com/pages/Trillium-Software/109184815778307>

[cid:image003.png@01CEF5BC.07D01FE0]<https://twitter.com/TrilliumSW>

[cid:image004.png@01CEF5BC.07D01FE0]<http://www.linkedin.com/company/17710>


www.trilliumsoftware.com<http://www.trilliumsoftware.com/>

Be Certain About Your Data. Be Trillium Certain.

From: ch huang [mailto:justlooks@gmail.com]
Sent: 10 December 2013 01:21
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: how to handle the corrupt block in HDFS?

more strange , in my HDFS cluster ,every block has three replicas,but i find some one has ten replicas ,why?

# sudo -u hdfs hadoop fs -ls /data/hisstage/helen/.staging/job_1385542328307_0915
Found 5 items
-rw-r--r--   3 helen hadoop          7 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
-rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
-rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>> wrote:
the strange thing is when i use the following command i find 1 corrupt block

#  curl -s http://ch11:50070/jmx |grep orrupt
    "CorruptBlocks" : 1,
but when i run hdfs fsck / , i get none ,everything seems fine

# sudo -u hdfs hdfs fsck /
........

....................................Status: HEALTHY
 Total size:    1479728140875 B (Total open files size: 1677721600 B)
 Total dirs:    21298
 Total files:   100636 (Files currently being written: 25)
 Total blocks (validated):      119788 (avg. block size 12352891 B) (Total open file blocks (not validated): 37)
 Minimally replicated blocks:   119788 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       166 (0.13857816 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0027633
 Corrupt blocks:                0
 Missing replicas:              831 (0.23049656 %)
 Number of data-nodes:          5
 Number of racks:               1
FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds

The filesystem under path '/' is HEALTHY
On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>> wrote:
hi,maillist:
            my nagios alert me that there is a corrupt block in HDFS all day,but i do not know how to remove it,and if the HDFS will handle this automaticlly? and if remove the corrupt block will cause any data lost?thanks

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

"By default this higher replication level is 10. "
is this value can be control via some option or variable? i only hive a
5-worknode cluster,and i think 5 replicas should be better,because every
node can get a local replica.

another question is ,why hdfs fsck check the cluster is healthy and no
corrupt block,but i see one corrupt block though checking NN metrics?
curl http://NNIP:50070/jmx <http://nnip:50070/jmx> ,thanks


On Tue, Dec 10, 2013 at 4:48 PM, Peter Marron <
Peter.Marron@trilliumsoftware.com> wrote:

>  Hi,
>
>
>
> I am sure that there are others who will answer this better, but anyway.
>
> The default replication level for files in HDFS is 3 and so most files
> that you
>
> see will have a replication level of 3. However when you run a Map/Reduce
>
> job the system knows in advance that every node will need a copy of
>
> certain files. Specifically the job.xml and the various jars containing
>
> classes that will be needed to run the mappers and reducers. So the
>
> system arranges that some of these files have a higher replication level.
> This increases
>
> the chances that a copy will be found locally.
>
> By default this higher replication level is 10.
>
>
>
> This can seem a little odd on a cluster where you only have, say, 3 nodes.
>
> Because it means that you will almost always have some blocks that are
> marked
>
> under-replicated. I think that there was some discussion a while back to
> change
>
> this to make the replication level something like min(10, #number of nodes)
>
> However, as I recall, the general consensus was that this was extra
>
> complexity that wasn’t really worth it. If it ain’t broke…
>
>
>
> Hope that this helps.
>
>
>
> *Peter Marron*
>
> Senior Developer, Research & Development
>
>
>
> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>
> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>
>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>
>  <https://twitter.com/TrilliumSW>
>
>  <http://www.linkedin.com/company/17710>
>
>
>
> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>
> Be Certain About Your Data. Be Trillium Certain.
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 10 December 2013 01:21
> *To:* user@hadoop.apache.org
> *Subject:* Re: how to handle the corrupt block in HDFS?
>
>
>
> more strange , in my HDFS cluster ,every block has three replicas,but i
> find some one has ten replicas ,why?
>
>
>
> # sudo -u hdfs hadoop fs -ls
> /data/hisstage/helen/.staging/job_1385542328307_0915
> Found 5 items
> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>
>  On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>
> the strange thing is when i use the following command i find 1 corrupt
> block
>
>
>
> #  curl -s http://ch11:50070/jmx |grep orrupt
>     "CorruptBlocks" : 1,
>
> but when i run hdfs fsck / , i get none ,everything seems fine
>
>
>
> # sudo -u hdfs hdfs fsck /
>
> ........
>
>
>
> ....................................Status: HEALTHY
>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>  Total dirs:    21298
>  Total files:   100636 (Files currently being written: 25)
>  Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
> open file blocks (not validated): 37)
>  Minimally replicated blocks:   119788 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       166 (0.13857816 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0027633
>  Corrupt blocks:                0
>  Missing replicas:              831 (0.23049656 %)
>  Number of data-nodes:          5
>  Number of racks:               1
> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
>   On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>
> hi,maillist:
>
>             my nagios alert me that there is a corrupt block in HDFS all
> day,but i do not know how to remove it,and if the HDFS will handle this
> automaticlly? and if remove the corrupt block will cause any data
> lost?thanks
>
>
>
>
>

RE: how to handle the corrupt block in HDFS?

Posted by Vinayakumar B <vi...@huawei.com>.

Hi ch huang,

It may seem strange, but the fact is,
CorruptBlocks through JMX means "Number of blocks with corrupt replicas". May not be all replicas are corrupt.  This you can check though jconsole for description.

Where as Corrupt blocks through fsck means, blocks with all replicas corrupt(non-recoverable)/ missing.

In your case, may be one of the replica is corrupt, not all replicas of same block. This corrupt replica will be deleted automatically if one more datanode available in your cluster and block replicated to that.


Related to replication 10, As Peter Marron said, some of the important files of the mapreduce job will set the replication of 10, to make it accessible faster and launch map tasks faster.
Anyway, if the job is success these files will be deleted auomatically. I think only in some cases if the jobs are killed in between these files will remain in hdfs showing underreplicated blocks.

Thanks and Regards,
Vinayakumar B

From: Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
Sent: 10 December 2013 14:19
To: user@hadoop.apache.org
Subject: RE: how to handle the corrupt block in HDFS?

Hi,

I am sure that there are others who will answer this better, but anyway.
The default replication level for files in HDFS is 3 and so most files that you
see will have a replication level of 3. However when you run a Map/Reduce
job the system knows in advance that every node will need a copy of
certain files. Specifically the job.xml and the various jars containing
classes that will be needed to run the mappers and reducers. So the
system arranges that some of these files have a higher replication level. This increases
the chances that a copy will be found locally.
By default this higher replication level is 10.

This can seem a little odd on a cluster where you only have, say, 3 nodes.
Because it means that you will almost always have some blocks that are marked
under-replicated. I think that there was some discussion a while back to change
this to make the replication level something like min(10, #number of nodes)
However, as I recall, the general consensus was that this was extra
complexity that wasn't really worth it. If it ain't broke...

Hope that this helps.

Peter Marron
Senior Developer, Research & Development

Office: +44 (0) 118-940-7609  peter.marron@trilliumsoftware.com<ma...@trilliumsoftware.com>
Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
[cid:image001.png@01CEF5BC.07D01FE0]

[cid:image002.png@01CEF5BC.07D01FE0]<https://www.facebook.com/pages/Trillium-Software/109184815778307>

[cid:image003.png@01CEF5BC.07D01FE0]<https://twitter.com/TrilliumSW>

[cid:image004.png@01CEF5BC.07D01FE0]<http://www.linkedin.com/company/17710>


www.trilliumsoftware.com<http://www.trilliumsoftware.com/>

Be Certain About Your Data. Be Trillium Certain.

From: ch huang [mailto:justlooks@gmail.com]
Sent: 10 December 2013 01:21
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: how to handle the corrupt block in HDFS?

more strange , in my HDFS cluster ,every block has three replicas,but i find some one has ten replicas ,why?

# sudo -u hdfs hadoop fs -ls /data/hisstage/helen/.staging/job_1385542328307_0915
Found 5 items
-rw-r--r--   3 helen hadoop          7 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
-rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
-rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>> wrote:
the strange thing is when i use the following command i find 1 corrupt block

#  curl -s http://ch11:50070/jmx |grep orrupt
    "CorruptBlocks" : 1,
but when i run hdfs fsck / , i get none ,everything seems fine

# sudo -u hdfs hdfs fsck /
........

....................................Status: HEALTHY
 Total size:    1479728140875 B (Total open files size: 1677721600 B)
 Total dirs:    21298
 Total files:   100636 (Files currently being written: 25)
 Total blocks (validated):      119788 (avg. block size 12352891 B) (Total open file blocks (not validated): 37)
 Minimally replicated blocks:   119788 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       166 (0.13857816 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0027633
 Corrupt blocks:                0
 Missing replicas:              831 (0.23049656 %)
 Number of data-nodes:          5
 Number of racks:               1
FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds

The filesystem under path '/' is HEALTHY
On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>> wrote:
hi,maillist:
            my nagios alert me that there is a corrupt block in HDFS all day,but i do not know how to remove it,and if the HDFS will handle this automaticlly? and if remove the corrupt block will cause any data lost?thanks

RE: how to handle the corrupt block in HDFS?

Posted by Vinayakumar B <vi...@huawei.com>.

Hi ch huang,

It may seem strange, but the fact is,
CorruptBlocks through JMX means "Number of blocks with corrupt replicas". May not be all replicas are corrupt.  This you can check though jconsole for description.

Where as Corrupt blocks through fsck means, blocks with all replicas corrupt(non-recoverable)/ missing.

In your case, may be one of the replica is corrupt, not all replicas of same block. This corrupt replica will be deleted automatically if one more datanode available in your cluster and block replicated to that.


Related to replication 10, As Peter Marron said, some of the important files of the mapreduce job will set the replication of 10, to make it accessible faster and launch map tasks faster.
Anyway, if the job is success these files will be deleted auomatically. I think only in some cases if the jobs are killed in between these files will remain in hdfs showing underreplicated blocks.

Thanks and Regards,
Vinayakumar B

From: Peter Marron [mailto:Peter.Marron@trilliumsoftware.com]
Sent: 10 December 2013 14:19
To: user@hadoop.apache.org
Subject: RE: how to handle the corrupt block in HDFS?

Hi,

I am sure that there are others who will answer this better, but anyway.
The default replication level for files in HDFS is 3 and so most files that you
see will have a replication level of 3. However when you run a Map/Reduce
job the system knows in advance that every node will need a copy of
certain files. Specifically the job.xml and the various jars containing
classes that will be needed to run the mappers and reducers. So the
system arranges that some of these files have a higher replication level. This increases
the chances that a copy will be found locally.
By default this higher replication level is 10.

This can seem a little odd on a cluster where you only have, say, 3 nodes.
Because it means that you will almost always have some blocks that are marked
under-replicated. I think that there was some discussion a while back to change
this to make the replication level something like min(10, #number of nodes)
However, as I recall, the general consensus was that this was extra
complexity that wasn't really worth it. If it ain't broke...

Hope that this helps.

Peter Marron
Senior Developer, Research & Development

Office: +44 (0) 118-940-7609  peter.marron@trilliumsoftware.com<ma...@trilliumsoftware.com>
Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
[cid:image001.png@01CEF5BC.07D01FE0]

[cid:image002.png@01CEF5BC.07D01FE0]<https://www.facebook.com/pages/Trillium-Software/109184815778307>

[cid:image003.png@01CEF5BC.07D01FE0]<https://twitter.com/TrilliumSW>

[cid:image004.png@01CEF5BC.07D01FE0]<http://www.linkedin.com/company/17710>


www.trilliumsoftware.com<http://www.trilliumsoftware.com/>

Be Certain About Your Data. Be Trillium Certain.

From: ch huang [mailto:justlooks@gmail.com]
Sent: 10 December 2013 01:21
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: how to handle the corrupt block in HDFS?

more strange , in my HDFS cluster ,every block has three replicas,but i find some one has ten replicas ,why?

# sudo -u hdfs hadoop fs -ls /data/hisstage/helen/.staging/job_1385542328307_0915
Found 5 items
-rw-r--r--   3 helen hadoop          7 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
-rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
-rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>> wrote:
the strange thing is when i use the following command i find 1 corrupt block

#  curl -s http://ch11:50070/jmx |grep orrupt
    "CorruptBlocks" : 1,
but when i run hdfs fsck / , i get none ,everything seems fine

# sudo -u hdfs hdfs fsck /
........

....................................Status: HEALTHY
 Total size:    1479728140875 B (Total open files size: 1677721600 B)
 Total dirs:    21298
 Total files:   100636 (Files currently being written: 25)
 Total blocks (validated):      119788 (avg. block size 12352891 B) (Total open file blocks (not validated): 37)
 Minimally replicated blocks:   119788 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       166 (0.13857816 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0027633
 Corrupt blocks:                0
 Missing replicas:              831 (0.23049656 %)
 Number of data-nodes:          5
 Number of racks:               1
FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds

The filesystem under path '/' is HEALTHY
On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>> wrote:
hi,maillist:
            my nagios alert me that there is a corrupt block in HDFS all day,but i do not know how to remove it,and if the HDFS will handle this automaticlly? and if remove the corrupt block will cause any data lost?thanks

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

"By default this higher replication level is 10. "
is this value can be control via some option or variable? i only hive a
5-worknode cluster,and i think 5 replicas should be better,because every
node can get a local replica.

another question is ,why hdfs fsck check the cluster is healthy and no
corrupt block,but i see one corrupt block though checking NN metrics?
curl http://NNIP:50070/jmx <http://nnip:50070/jmx> ,thanks


On Tue, Dec 10, 2013 at 4:48 PM, Peter Marron <
Peter.Marron@trilliumsoftware.com> wrote:

>  Hi,
>
>
>
> I am sure that there are others who will answer this better, but anyway.
>
> The default replication level for files in HDFS is 3 and so most files
> that you
>
> see will have a replication level of 3. However when you run a Map/Reduce
>
> job the system knows in advance that every node will need a copy of
>
> certain files. Specifically the job.xml and the various jars containing
>
> classes that will be needed to run the mappers and reducers. So the
>
> system arranges that some of these files have a higher replication level.
> This increases
>
> the chances that a copy will be found locally.
>
> By default this higher replication level is 10.
>
>
>
> This can seem a little odd on a cluster where you only have, say, 3 nodes.
>
> Because it means that you will almost always have some blocks that are
> marked
>
> under-replicated. I think that there was some discussion a while back to
> change
>
> this to make the replication level something like min(10, #number of nodes)
>
> However, as I recall, the general consensus was that this was extra
>
> complexity that wasn’t really worth it. If it ain’t broke…
>
>
>
> Hope that this helps.
>
>
>
> *Peter Marron*
>
> Senior Developer, Research & Development
>
>
>
> Office: +44 *(0) 118-940-7609*  peter.marron@trilliumsoftware.com
>
> Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
>
>   <https://www.facebook.com/pages/Trillium-Software/109184815778307>
>
>  <https://twitter.com/TrilliumSW>
>
>  <http://www.linkedin.com/company/17710>
>
>
>
> *www.trilliumsoftware.com <http://www.trilliumsoftware.com/>*
>
> Be Certain About Your Data. Be Trillium Certain.
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 10 December 2013 01:21
> *To:* user@hadoop.apache.org
> *Subject:* Re: how to handle the corrupt block in HDFS?
>
>
>
> more strange , in my HDFS cluster ,every block has three replicas,but i
> find some one has ten replicas ,why?
>
>
>
> # sudo -u hdfs hadoop fs -ls
> /data/hisstage/helen/.staging/job_1385542328307_0915
> Found 5 items
> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>
>  On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>
> the strange thing is when i use the following command i find 1 corrupt
> block
>
>
>
> #  curl -s http://ch11:50070/jmx |grep orrupt
>     "CorruptBlocks" : 1,
>
> but when i run hdfs fsck / , i get none ,everything seems fine
>
>
>
> # sudo -u hdfs hdfs fsck /
>
> ........
>
>
>
> ....................................Status: HEALTHY
>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>  Total dirs:    21298
>  Total files:   100636 (Files currently being written: 25)
>  Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
> open file blocks (not validated): 37)
>  Minimally replicated blocks:   119788 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       166 (0.13857816 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0027633
>  Corrupt blocks:                0
>  Missing replicas:              831 (0.23049656 %)
>  Number of data-nodes:          5
>  Number of racks:               1
> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
>   On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>
> hi,maillist:
>
>             my nagios alert me that there is a corrupt block in HDFS all
> day,but i do not know how to remove it,and if the HDFS will handle this
> automaticlly? and if remove the corrupt block will cause any data
> lost?thanks
>
>
>
>
>

RE: how to handle the corrupt block in HDFS?

Posted by Peter Marron <Pe...@trilliumsoftware.com>.

Hi,

I am sure that there are others who will answer this better, but anyway.
The default replication level for files in HDFS is 3 and so most files that you
see will have a replication level of 3. However when you run a Map/Reduce
job the system knows in advance that every node will need a copy of
certain files. Specifically the job.xml and the various jars containing
classes that will be needed to run the mappers and reducers. So the
system arranges that some of these files have a higher replication level. This increases
the chances that a copy will be found locally.
By default this higher replication level is 10.

This can seem a little odd on a cluster where you only have, say, 3 nodes.
Because it means that you will almost always have some blocks that are marked
under-replicated. I think that there was some discussion a while back to change
this to make the replication level something like min(10, #number of nodes)
However, as I recall, the general consensus was that this was extra
complexity that wasn't really worth it. If it ain't broke...

Hope that this helps.

Peter Marron
Senior Developer, Research & Development

Office: +44 (0) 118-940-7609  peter.marron@trilliumsoftware.com<ma...@trilliumsoftware.com>
Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
[cid:image001.png@01CEF584.9A97F060]

[cid:image002.png@01CEF584.9A97F060]<https://www.facebook.com/pages/Trillium-Software/109184815778307>

[cid:image003.png@01CEF584.9A97F060]<https://twitter.com/TrilliumSW>

[cid:image004.png@01CEF584.9A97F060]<http://www.linkedin.com/company/17710>


www.trilliumsoftware.com<http://www.trilliumsoftware.com/>

Be Certain About Your Data. Be Trillium Certain.

From: ch huang [mailto:justlooks@gmail.com]
Sent: 10 December 2013 01:21
To: user@hadoop.apache.org
Subject: Re: how to handle the corrupt block in HDFS?

more strange , in my HDFS cluster ,every block has three replicas,but i find some one has ten replicas ,why?

# sudo -u hdfs hadoop fs -ls /data/hisstage/helen/.staging/job_1385542328307_0915
Found 5 items
-rw-r--r--   3 helen hadoop          7 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
-rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
-rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.split

On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>> wrote:
the strange thing is when i use the following command i find 1 corrupt block

#  curl -s http://ch11:50070/jmx |grep orrupt
    "CorruptBlocks" : 1,
but when i run hdfs fsck / , i get none ,everything seems fine

# sudo -u hdfs hdfs fsck /
........

....................................Status: HEALTHY
 Total size:    1479728140875 B (Total open files size: 1677721600 B)
 Total dirs:    21298
 Total files:   100636 (Files currently being written: 25)
 Total blocks (validated):      119788 (avg. block size 12352891 B) (Total open file blocks (not validated): 37)
 Minimally replicated blocks:   119788 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       166 (0.13857816 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0027633
 Corrupt blocks:                0
 Missing replicas:              831 (0.23049656 %)
 Number of data-nodes:          5
 Number of racks:               1
FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds

The filesystem under path '/' is HEALTHY

On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>> wrote:
hi,maillist:
            my nagios alert me that there is a corrupt block in HDFS all day,but i do not know how to remove it,and if the HDFS will handle this automaticlly? and if remove the corrupt block will cause any data lost?thanks

RE: how to handle the corrupt block in HDFS?

Posted by Peter Marron <Pe...@trilliumsoftware.com>.

Hi,

I am sure that there are others who will answer this better, but anyway.
The default replication level for files in HDFS is 3 and so most files that you
see will have a replication level of 3. However when you run a Map/Reduce
job the system knows in advance that every node will need a copy of
certain files. Specifically the job.xml and the various jars containing
classes that will be needed to run the mappers and reducers. So the
system arranges that some of these files have a higher replication level. This increases
the chances that a copy will be found locally.
By default this higher replication level is 10.

This can seem a little odd on a cluster where you only have, say, 3 nodes.
Because it means that you will almost always have some blocks that are marked
under-replicated. I think that there was some discussion a while back to change
this to make the replication level something like min(10, #number of nodes)
However, as I recall, the general consensus was that this was extra
complexity that wasn't really worth it. If it ain't broke...

Hope that this helps.

Peter Marron
Senior Developer, Research & Development

Office: +44 (0) 118-940-7609  peter.marron@trilliumsoftware.com<ma...@trilliumsoftware.com>
Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
[cid:image001.png@01CEF584.9A97F060]

[cid:image002.png@01CEF584.9A97F060]<https://www.facebook.com/pages/Trillium-Software/109184815778307>

[cid:image003.png@01CEF584.9A97F060]<https://twitter.com/TrilliumSW>

[cid:image004.png@01CEF584.9A97F060]<http://www.linkedin.com/company/17710>


www.trilliumsoftware.com<http://www.trilliumsoftware.com/>

Be Certain About Your Data. Be Trillium Certain.

From: ch huang [mailto:justlooks@gmail.com]
Sent: 10 December 2013 01:21
To: user@hadoop.apache.org
Subject: Re: how to handle the corrupt block in HDFS?

more strange , in my HDFS cluster ,every block has three replicas,but i find some one has ten replicas ,why?

# sudo -u hdfs hadoop fs -ls /data/hisstage/helen/.staging/job_1385542328307_0915
Found 5 items
-rw-r--r--   3 helen hadoop          7 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
-rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
-rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.split

On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>> wrote:
the strange thing is when i use the following command i find 1 corrupt block

#  curl -s http://ch11:50070/jmx |grep orrupt
    "CorruptBlocks" : 1,
but when i run hdfs fsck / , i get none ,everything seems fine

# sudo -u hdfs hdfs fsck /
........

....................................Status: HEALTHY
 Total size:    1479728140875 B (Total open files size: 1677721600 B)
 Total dirs:    21298
 Total files:   100636 (Files currently being written: 25)
 Total blocks (validated):      119788 (avg. block size 12352891 B) (Total open file blocks (not validated): 37)
 Minimally replicated blocks:   119788 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       166 (0.13857816 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0027633
 Corrupt blocks:                0
 Missing replicas:              831 (0.23049656 %)
 Number of data-nodes:          5
 Number of racks:               1
FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds

The filesystem under path '/' is HEALTHY

On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>> wrote:
hi,maillist:
            my nagios alert me that there is a corrupt block in HDFS all day,but i do not know how to remove it,and if the HDFS will handle this automaticlly? and if remove the corrupt block will cause any data lost?thanks

RE: how to handle the corrupt block in HDFS?

Posted by Peter Marron <Pe...@trilliumsoftware.com>.

Hi,

I am sure that there are others who will answer this better, but anyway.
The default replication level for files in HDFS is 3 and so most files that you
see will have a replication level of 3. However when you run a Map/Reduce
job the system knows in advance that every node will need a copy of
certain files. Specifically the job.xml and the various jars containing
classes that will be needed to run the mappers and reducers. So the
system arranges that some of these files have a higher replication level. This increases
the chances that a copy will be found locally.
By default this higher replication level is 10.

This can seem a little odd on a cluster where you only have, say, 3 nodes.
Because it means that you will almost always have some blocks that are marked
under-replicated. I think that there was some discussion a while back to change
this to make the replication level something like min(10, #number of nodes)
However, as I recall, the general consensus was that this was extra
complexity that wasn't really worth it. If it ain't broke...

Hope that this helps.

Peter Marron
Senior Developer, Research & Development

Office: +44 (0) 118-940-7609  peter.marron@trilliumsoftware.com<ma...@trilliumsoftware.com>
Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
[cid:image001.png@01CEF584.9A97F060]

[cid:image002.png@01CEF584.9A97F060]<https://www.facebook.com/pages/Trillium-Software/109184815778307>

[cid:image003.png@01CEF584.9A97F060]<https://twitter.com/TrilliumSW>

[cid:image004.png@01CEF584.9A97F060]<http://www.linkedin.com/company/17710>


www.trilliumsoftware.com<http://www.trilliumsoftware.com/>

Be Certain About Your Data. Be Trillium Certain.

From: ch huang [mailto:justlooks@gmail.com]
Sent: 10 December 2013 01:21
To: user@hadoop.apache.org
Subject: Re: how to handle the corrupt block in HDFS?

more strange , in my HDFS cluster ,every block has three replicas,but i find some one has ten replicas ,why?

# sudo -u hdfs hadoop fs -ls /data/hisstage/helen/.staging/job_1385542328307_0915
Found 5 items
-rw-r--r--   3 helen hadoop          7 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
-rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
-rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.split

On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>> wrote:
the strange thing is when i use the following command i find 1 corrupt block

#  curl -s http://ch11:50070/jmx |grep orrupt
    "CorruptBlocks" : 1,
but when i run hdfs fsck / , i get none ,everything seems fine

# sudo -u hdfs hdfs fsck /
........

....................................Status: HEALTHY
 Total size:    1479728140875 B (Total open files size: 1677721600 B)
 Total dirs:    21298
 Total files:   100636 (Files currently being written: 25)
 Total blocks (validated):      119788 (avg. block size 12352891 B) (Total open file blocks (not validated): 37)
 Minimally replicated blocks:   119788 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       166 (0.13857816 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0027633
 Corrupt blocks:                0
 Missing replicas:              831 (0.23049656 %)
 Number of data-nodes:          5
 Number of racks:               1
FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds

The filesystem under path '/' is HEALTHY

On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>> wrote:
hi,maillist:
            my nagios alert me that there is a corrupt block in HDFS all day,but i do not know how to remove it,and if the HDFS will handle this automaticlly? and if remove the corrupt block will cause any data lost?thanks

Re: how to handle the corrupt block in HDFS?

Posted by Patai Sangbutsarakum <si...@gmail.com>.

10 copies for those job.jar and split are controlled by
mapred.submit.replication property at job init level.



On Mon, Dec 9, 2013 at 5:20 PM, ch huang <ju...@gmail.com> wrote:

> more strange , in my HDFS cluster ,every block has three replicas,but i
> find some one has ten replicas ,why?
>
> # sudo -u hdfs hadoop fs -ls
> /data/hisstage/helen/.staging/job_1385542328307_0915
> Found 5 items
> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>
>
> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>
>> the strange thing is when i use the following command i find 1 corrupt
>> block
>>
>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>     "CorruptBlocks" : 1,
>> but when i run hdfs fsck / , i get none ,everything seems fine
>>
>> # sudo -u hdfs hdfs fsck /
>> ........
>>
>> ....................................Status: HEALTHY
>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>  Total dirs:    21298
>>  Total files:   100636 (Files currently being written: 25)
>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>> (Total open file blocks (not validated): 37)
>>  Minimally replicated blocks:   119788 (100.0 %)
>>  Over-replicated blocks:        0 (0.0 %)
>>  Under-replicated blocks:       166 (0.13857816 %)
>>  Mis-replicated blocks:         0 (0.0 %)
>>  Default replication factor:    3
>>  Average block replication:     3.0027633
>>  Corrupt blocks:                0
>>  Missing replicas:              831 (0.23049656 %)
>>  Number of data-nodes:          5
>>  Number of racks:               1
>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>
>> The filesystem under path '/' is HEALTHY
>>
>>
>>  On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>
>>> hi,maillist:
>>>             my nagios alert me that there is a corrupt block in HDFS all
>>> day,but i do not know how to remove it,and if the HDFS will handle this
>>> automaticlly? and if remove the corrupt block will cause any data
>>> lost?thanks
>>>
>>
>>
>

RE: how to handle the corrupt block in HDFS?

Posted by Peter Marron <Pe...@trilliumsoftware.com>.

Hi,

I am sure that there are others who will answer this better, but anyway.
The default replication level for files in HDFS is 3 and so most files that you
see will have a replication level of 3. However when you run a Map/Reduce
job the system knows in advance that every node will need a copy of
certain files. Specifically the job.xml and the various jars containing
classes that will be needed to run the mappers and reducers. So the
system arranges that some of these files have a higher replication level. This increases
the chances that a copy will be found locally.
By default this higher replication level is 10.

This can seem a little odd on a cluster where you only have, say, 3 nodes.
Because it means that you will almost always have some blocks that are marked
under-replicated. I think that there was some discussion a while back to change
this to make the replication level something like min(10, #number of nodes)
However, as I recall, the general consensus was that this was extra
complexity that wasn't really worth it. If it ain't broke...

Hope that this helps.

Peter Marron
Senior Developer, Research & Development

Office: +44 (0) 118-940-7609  peter.marron@trilliumsoftware.com<ma...@trilliumsoftware.com>
Theale Court First Floor, 11-13 High Street, Theale, RG7 5AH, UK
[cid:image001.png@01CEF584.9A97F060]

[cid:image002.png@01CEF584.9A97F060]<https://www.facebook.com/pages/Trillium-Software/109184815778307>

[cid:image003.png@01CEF584.9A97F060]<https://twitter.com/TrilliumSW>

[cid:image004.png@01CEF584.9A97F060]<http://www.linkedin.com/company/17710>


www.trilliumsoftware.com<http://www.trilliumsoftware.com/>

Be Certain About Your Data. Be Trillium Certain.

From: ch huang [mailto:justlooks@gmail.com]
Sent: 10 December 2013 01:21
To: user@hadoop.apache.org
Subject: Re: how to handle the corrupt block in HDFS?

more strange , in my HDFS cluster ,every block has three replicas,but i find some one has ten replicas ,why?

# sudo -u hdfs hadoop fs -ls /data/hisstage/helen/.staging/job_1385542328307_0915
Found 5 items
-rw-r--r--   3 helen hadoop          7 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
-rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
-rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01 /data/hisstage/helen/.staging/job_1385542328307_0915/job.split

On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com>> wrote:
the strange thing is when i use the following command i find 1 corrupt block

#  curl -s http://ch11:50070/jmx |grep orrupt
    "CorruptBlocks" : 1,
but when i run hdfs fsck / , i get none ,everything seems fine

# sudo -u hdfs hdfs fsck /
........

....................................Status: HEALTHY
 Total size:    1479728140875 B (Total open files size: 1677721600 B)
 Total dirs:    21298
 Total files:   100636 (Files currently being written: 25)
 Total blocks (validated):      119788 (avg. block size 12352891 B) (Total open file blocks (not validated): 37)
 Minimally replicated blocks:   119788 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       166 (0.13857816 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0027633
 Corrupt blocks:                0
 Missing replicas:              831 (0.23049656 %)
 Number of data-nodes:          5
 Number of racks:               1
FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds

The filesystem under path '/' is HEALTHY

On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com>> wrote:
hi,maillist:
            my nagios alert me that there is a corrupt block in HDFS all day,but i do not know how to remove it,and if the HDFS will handle this automaticlly? and if remove the corrupt block will cause any data lost?thanks

Re: how to handle the corrupt block in HDFS?

Posted by Patai Sangbutsarakum <si...@gmail.com>.

10 copies for those job.jar and split are controlled by
mapred.submit.replication property at job init level.



On Mon, Dec 9, 2013 at 5:20 PM, ch huang <ju...@gmail.com> wrote:

> more strange , in my HDFS cluster ,every block has three replicas,but i
> find some one has ten replicas ,why?
>
> # sudo -u hdfs hadoop fs -ls
> /data/hisstage/helen/.staging/job_1385542328307_0915
> Found 5 items
> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>
>
> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>
>> the strange thing is when i use the following command i find 1 corrupt
>> block
>>
>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>     "CorruptBlocks" : 1,
>> but when i run hdfs fsck / , i get none ,everything seems fine
>>
>> # sudo -u hdfs hdfs fsck /
>> ........
>>
>> ....................................Status: HEALTHY
>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>  Total dirs:    21298
>>  Total files:   100636 (Files currently being written: 25)
>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>> (Total open file blocks (not validated): 37)
>>  Minimally replicated blocks:   119788 (100.0 %)
>>  Over-replicated blocks:        0 (0.0 %)
>>  Under-replicated blocks:       166 (0.13857816 %)
>>  Mis-replicated blocks:         0 (0.0 %)
>>  Default replication factor:    3
>>  Average block replication:     3.0027633
>>  Corrupt blocks:                0
>>  Missing replicas:              831 (0.23049656 %)
>>  Number of data-nodes:          5
>>  Number of racks:               1
>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>
>> The filesystem under path '/' is HEALTHY
>>
>>
>>  On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>
>>> hi,maillist:
>>>             my nagios alert me that there is a corrupt block in HDFS all
>>> day,but i do not know how to remove it,and if the HDFS will handle this
>>> automaticlly? and if remove the corrupt block will cause any data
>>> lost?thanks
>>>
>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by Patai Sangbutsarakum <si...@gmail.com>.

10 copies for those job.jar and split are controlled by
mapred.submit.replication property at job init level.



On Mon, Dec 9, 2013 at 5:20 PM, ch huang <ju...@gmail.com> wrote:

> more strange , in my HDFS cluster ,every block has three replicas,but i
> find some one has ten replicas ,why?
>
> # sudo -u hdfs hadoop fs -ls
> /data/hisstage/helen/.staging/job_1385542328307_0915
> Found 5 items
> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>
>
> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>
>> the strange thing is when i use the following command i find 1 corrupt
>> block
>>
>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>     "CorruptBlocks" : 1,
>> but when i run hdfs fsck / , i get none ,everything seems fine
>>
>> # sudo -u hdfs hdfs fsck /
>> ........
>>
>> ....................................Status: HEALTHY
>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>  Total dirs:    21298
>>  Total files:   100636 (Files currently being written: 25)
>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>> (Total open file blocks (not validated): 37)
>>  Minimally replicated blocks:   119788 (100.0 %)
>>  Over-replicated blocks:        0 (0.0 %)
>>  Under-replicated blocks:       166 (0.13857816 %)
>>  Mis-replicated blocks:         0 (0.0 %)
>>  Default replication factor:    3
>>  Average block replication:     3.0027633
>>  Corrupt blocks:                0
>>  Missing replicas:              831 (0.23049656 %)
>>  Number of data-nodes:          5
>>  Number of racks:               1
>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>
>> The filesystem under path '/' is HEALTHY
>>
>>
>>  On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>
>>> hi,maillist:
>>>             my nagios alert me that there is a corrupt block in HDFS all
>>> day,but i do not know how to remove it,and if the HDFS will handle this
>>> automaticlly? and if remove the corrupt block will cause any data
>>> lost?thanks
>>>
>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by Patai Sangbutsarakum <si...@gmail.com>.

10 copies for those job.jar and split are controlled by
mapred.submit.replication property at job init level.



On Mon, Dec 9, 2013 at 5:20 PM, ch huang <ju...@gmail.com> wrote:

> more strange , in my HDFS cluster ,every block has three replicas,but i
> find some one has ten replicas ,why?
>
> # sudo -u hdfs hadoop fs -ls
> /data/hisstage/helen/.staging/job_1385542328307_0915
> Found 5 items
> -rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
> -rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
> -rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
> /data/hisstage/helen/.staging/job_1385542328307_0915/job.split
>
>
> On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:
>
>> the strange thing is when i use the following command i find 1 corrupt
>> block
>>
>> #  curl -s http://ch11:50070/jmx |grep orrupt
>>     "CorruptBlocks" : 1,
>> but when i run hdfs fsck / , i get none ,everything seems fine
>>
>> # sudo -u hdfs hdfs fsck /
>> ........
>>
>> ....................................Status: HEALTHY
>>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>>  Total dirs:    21298
>>  Total files:   100636 (Files currently being written: 25)
>>  Total blocks (validated):      119788 (avg. block size 12352891 B)
>> (Total open file blocks (not validated): 37)
>>  Minimally replicated blocks:   119788 (100.0 %)
>>  Over-replicated blocks:        0 (0.0 %)
>>  Under-replicated blocks:       166 (0.13857816 %)
>>  Mis-replicated blocks:         0 (0.0 %)
>>  Default replication factor:    3
>>  Average block replication:     3.0027633
>>  Corrupt blocks:                0
>>  Missing replicas:              831 (0.23049656 %)
>>  Number of data-nodes:          5
>>  Number of racks:               1
>> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>>
>> The filesystem under path '/' is HEALTHY
>>
>>
>>  On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>>
>>> hi,maillist:
>>>             my nagios alert me that there is a corrupt block in HDFS all
>>> day,but i do not know how to remove it,and if the HDFS will handle this
>>> automaticlly? and if remove the corrupt block will cause any data
>>> lost?thanks
>>>
>>
>>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

more strange , in my HDFS cluster ,every block has three replicas,but i
find some one has ten replicas ,why?

# sudo -u hdfs hadoop fs -ls
/data/hisstage/helen/.staging/job_1385542328307_0915
Found 5 items
-rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
/data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
-rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
/data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
-rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
/data/hisstage/helen/.staging/job_1385542328307_0915/job.split


On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:

> the strange thing is when i use the following command i find 1 corrupt
> block
>
> #  curl -s http://ch11:50070/jmx |grep orrupt
>     "CorruptBlocks" : 1,
> but when i run hdfs fsck / , i get none ,everything seems fine
>
> # sudo -u hdfs hdfs fsck /
> ........
>
> ....................................Status: HEALTHY
>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>  Total dirs:    21298
>  Total files:   100636 (Files currently being written: 25)
>  Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
> open file blocks (not validated): 37)
>  Minimally replicated blocks:   119788 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       166 (0.13857816 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0027633
>  Corrupt blocks:                0
>  Missing replicas:              831 (0.23049656 %)
>  Number of data-nodes:          5
>  Number of racks:               1
> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>
> The filesystem under path '/' is HEALTHY
>
>
>  On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>
>> hi,maillist:
>>             my nagios alert me that there is a corrupt block in HDFS all
>> day,but i do not know how to remove it,and if the HDFS will handle this
>> automaticlly? and if remove the corrupt block will cause any data
>> lost?thanks
>>
>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

more strange , in my HDFS cluster ,every block has three replicas,but i
find some one has ten replicas ,why?

# sudo -u hdfs hadoop fs -ls
/data/hisstage/helen/.staging/job_1385542328307_0915
Found 5 items
-rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
/data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
-rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
/data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
-rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
/data/hisstage/helen/.staging/job_1385542328307_0915/job.split


On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:

> the strange thing is when i use the following command i find 1 corrupt
> block
>
> #  curl -s http://ch11:50070/jmx |grep orrupt
>     "CorruptBlocks" : 1,
> but when i run hdfs fsck / , i get none ,everything seems fine
>
> # sudo -u hdfs hdfs fsck /
> ........
>
> ....................................Status: HEALTHY
>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>  Total dirs:    21298
>  Total files:   100636 (Files currently being written: 25)
>  Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
> open file blocks (not validated): 37)
>  Minimally replicated blocks:   119788 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       166 (0.13857816 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0027633
>  Corrupt blocks:                0
>  Missing replicas:              831 (0.23049656 %)
>  Number of data-nodes:          5
>  Number of racks:               1
> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>
> The filesystem under path '/' is HEALTHY
>
>
>  On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>
>> hi,maillist:
>>             my nagios alert me that there is a corrupt block in HDFS all
>> day,but i do not know how to remove it,and if the HDFS will handle this
>> automaticlly? and if remove the corrupt block will cause any data
>> lost?thanks
>>
>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

more strange , in my HDFS cluster ,every block has three replicas,but i
find some one has ten replicas ,why?

# sudo -u hdfs hadoop fs -ls
/data/hisstage/helen/.staging/job_1385542328307_0915
Found 5 items
-rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
/data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
-rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
/data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
-rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
/data/hisstage/helen/.staging/job_1385542328307_0915/job.split


On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:

> the strange thing is when i use the following command i find 1 corrupt
> block
>
> #  curl -s http://ch11:50070/jmx |grep orrupt
>     "CorruptBlocks" : 1,
> but when i run hdfs fsck / , i get none ,everything seems fine
>
> # sudo -u hdfs hdfs fsck /
> ........
>
> ....................................Status: HEALTHY
>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>  Total dirs:    21298
>  Total files:   100636 (Files currently being written: 25)
>  Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
> open file blocks (not validated): 37)
>  Minimally replicated blocks:   119788 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       166 (0.13857816 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0027633
>  Corrupt blocks:                0
>  Missing replicas:              831 (0.23049656 %)
>  Number of data-nodes:          5
>  Number of racks:               1
> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>
> The filesystem under path '/' is HEALTHY
>
>
>  On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>
>> hi,maillist:
>>             my nagios alert me that there is a corrupt block in HDFS all
>> day,but i do not know how to remove it,and if the HDFS will handle this
>> automaticlly? and if remove the corrupt block will cause any data
>> lost?thanks
>>
>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

more strange , in my HDFS cluster ,every block has three replicas,but i
find some one has ten replicas ,why?

# sudo -u hdfs hadoop fs -ls
/data/hisstage/helen/.staging/job_1385542328307_0915
Found 5 items
-rw-r--r--   3 helen hadoop          7 2013-11-29 14:01
/data/hisstage/helen/.staging/job_1385542328307_0915/appTokens
-rw-r--r--  10 helen hadoop    2977839 2013-11-29 14:01
/data/hisstage/helen/.staging/job_1385542328307_0915/job.jar
-rw-r--r--  10 helen hadoop       3696 2013-11-29 14:01
/data/hisstage/helen/.staging/job_1385542328307_0915/job.split


On Tue, Dec 10, 2013 at 9:15 AM, ch huang <ju...@gmail.com> wrote:

> the strange thing is when i use the following command i find 1 corrupt
> block
>
> #  curl -s http://ch11:50070/jmx |grep orrupt
>     "CorruptBlocks" : 1,
> but when i run hdfs fsck / , i get none ,everything seems fine
>
> # sudo -u hdfs hdfs fsck /
> ........
>
> ....................................Status: HEALTHY
>  Total size:    1479728140875 B (Total open files size: 1677721600 B)
>  Total dirs:    21298
>  Total files:   100636 (Files currently being written: 25)
>  Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
> open file blocks (not validated): 37)
>  Minimally replicated blocks:   119788 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       166 (0.13857816 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    3
>  Average block replication:     3.0027633
>  Corrupt blocks:                0
>  Missing replicas:              831 (0.23049656 %)
>  Number of data-nodes:          5
>  Number of racks:               1
> FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds
>
> The filesystem under path '/' is HEALTHY
>
>
>  On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:
>
>> hi,maillist:
>>             my nagios alert me that there is a corrupt block in HDFS all
>> day,but i do not know how to remove it,and if the HDFS will handle this
>> automaticlly? and if remove the corrupt block will cause any data
>> lost?thanks
>>
>
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

the strange thing is when i use the following command i find 1 corrupt block

#  curl -s http://ch11:50070/jmx |grep orrupt
    "CorruptBlocks" : 1,
but when i run hdfs fsck / , i get none ,everything seems fine

# sudo -u hdfs hdfs fsck /
........

....................................Status: HEALTHY
 Total size:    1479728140875 B (Total open files size: 1677721600 B)
 Total dirs:    21298
 Total files:   100636 (Files currently being written: 25)
 Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
open file blocks (not validated): 37)
 Minimally replicated blocks:   119788 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       166 (0.13857816 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0027633
 Corrupt blocks:                0
 Missing replicas:              831 (0.23049656 %)
 Number of data-nodes:          5
 Number of racks:               1
FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds

The filesystem under path '/' is HEALTHY


On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:

> hi,maillist:
>             my nagios alert me that there is a corrupt block in HDFS all
> day,but i do not know how to remove it,and if the HDFS will handle this
> automaticlly? and if remove the corrupt block will cause any data
> lost?thanks
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

the strange thing is when i use the following command i find 1 corrupt block

#  curl -s http://ch11:50070/jmx |grep orrupt
    "CorruptBlocks" : 1,
but when i run hdfs fsck / , i get none ,everything seems fine

# sudo -u hdfs hdfs fsck /
........

....................................Status: HEALTHY
 Total size:    1479728140875 B (Total open files size: 1677721600 B)
 Total dirs:    21298
 Total files:   100636 (Files currently being written: 25)
 Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
open file blocks (not validated): 37)
 Minimally replicated blocks:   119788 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       166 (0.13857816 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0027633
 Corrupt blocks:                0
 Missing replicas:              831 (0.23049656 %)
 Number of data-nodes:          5
 Number of racks:               1
FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds

The filesystem under path '/' is HEALTHY


On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:

> hi,maillist:
>             my nagios alert me that there is a corrupt block in HDFS all
> day,but i do not know how to remove it,and if the HDFS will handle this
> automaticlly? and if remove the corrupt block will cause any data
> lost?thanks
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

the strange thing is when i use the following command i find 1 corrupt block

#  curl -s http://ch11:50070/jmx |grep orrupt
    "CorruptBlocks" : 1,
but when i run hdfs fsck / , i get none ,everything seems fine

# sudo -u hdfs hdfs fsck /
........

....................................Status: HEALTHY
 Total size:    1479728140875 B (Total open files size: 1677721600 B)
 Total dirs:    21298
 Total files:   100636 (Files currently being written: 25)
 Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
open file blocks (not validated): 37)
 Minimally replicated blocks:   119788 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       166 (0.13857816 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0027633
 Corrupt blocks:                0
 Missing replicas:              831 (0.23049656 %)
 Number of data-nodes:          5
 Number of racks:               1
FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds

The filesystem under path '/' is HEALTHY


On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:

> hi,maillist:
>             my nagios alert me that there is a corrupt block in HDFS all
> day,but i do not know how to remove it,and if the HDFS will handle this
> automaticlly? and if remove the corrupt block will cause any data
> lost?thanks
>

Re: how to handle the corrupt block in HDFS?

Posted by ch huang <ju...@gmail.com>.

the strange thing is when i use the following command i find 1 corrupt block

#  curl -s http://ch11:50070/jmx |grep orrupt
    "CorruptBlocks" : 1,
but when i run hdfs fsck / , i get none ,everything seems fine

# sudo -u hdfs hdfs fsck /
........

....................................Status: HEALTHY
 Total size:    1479728140875 B (Total open files size: 1677721600 B)
 Total dirs:    21298
 Total files:   100636 (Files currently being written: 25)
 Total blocks (validated):      119788 (avg. block size 12352891 B) (Total
open file blocks (not validated): 37)
 Minimally replicated blocks:   119788 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       166 (0.13857816 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0027633
 Corrupt blocks:                0
 Missing replicas:              831 (0.23049656 %)
 Number of data-nodes:          5
 Number of racks:               1
FSCK ended at Tue Dec 10 09:14:48 CST 2013 in 3276 milliseconds

The filesystem under path '/' is HEALTHY


On Tue, Dec 10, 2013 at 8:32 AM, ch huang <ju...@gmail.com> wrote:

> hi,maillist:
>             my nagios alert me that there is a corrupt block in HDFS all
> day,but i do not know how to remove it,and if the HDFS will handle this
> automaticlly? and if remove the corrupt block will cause any data
> lost?thanks
>