You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by Uma Maheswara Rao G <ma...@huawei.com> on 2011/11/22 12:16:38 UTC

Blocks are getting corrupted under very high load

Hi All,

I have backported HDFS-1779 to our Hadoop version which is based on 0.20-Append branch.

We are running a load test, as usual. (We want to ensure the reliability of the system under heavy loads.)
My cluster has 8 DataNodes and a Namenode
Each machine has 16 CPUs and 12 hard disks, each having 2TB capacity.
Clients are running along with Datanodes.
Clients will upload some tar files containing 3-4 blocks, from 50 threads.
Each block size is 256MB. replication factor is 3.

Everything looks to be fine on a normal load.
When the load is increased, lot of errors are happening.
Many pipeline failures are happening also.
All these are fine, except for the strange case of few blocks.

Some blocks (around 30) are missing (FSCK report shows).
When I tried to read that files, it fails saying that No Datanodes for this block
Analysing the logs, we found that, for these blocks, pipeline recovery happened, write was successful to a single Datanode.
Also, Datanode reported the block to Namenode in a blockReceived command.
After some time (say, 30 minutes), the Datanode is getting restarted.
In the BBW (BlocksBeingWritten) report send by DN immediately after restart, these finalized blocks are also included. (Showing that these blocks are in blocksBeingWritten folder)
In many of the cases, the generation timestamp reported in the BBW report is the old timestamp.

Namenode is rejecting that block in the BBW report by saying file is already closed.
Also, Namenode asks the Datanode to invlidate the blocks & Datanode is doing the same.
When deleting the blocks also, it is printing the path from BlocksBeingWritten directory. (Also the previous generation timestamp)

Looks very strange for me.
Does this means that the finalized block file & meta file (which is written in current folder) is getting lost after DN restart
Due to which Namenode will not receive these block's information in the BLOCK REPORT send from the Datanodes.

Regards,

Uma

RE: Blocks are getting corrupted under very high load

Posted by Uma Maheswara Rao G <ma...@huawei.com>.

Thanks Todd.

Finally we also started suspecting in that angle. Planned to take the file details before reboot and after reboot.
With the above analysis i can confirm, whether the same issue or not.

One more thing to notice is that the difference between reboot time and last replica finalization is ~1hr in some cases. 
Since the machine is rebooted due to kernal.hung_task_timeout_secs , in OS also that particular thread might not got the chance to sync the data.

great one, HDFS-1539, I have merged all the bugs. Since this is an improvement, issue might not come to my list :( .

Also found some OS level configs to do the filesystem operations synchronously 
dirsync
    All directory updates within the filesystem should be done synchronously. This affects the following system calls: creat, link, unlink, symlink, mkdir, rmdir, mknod and rename. 
We suspected mainly the rename operation lost after reboot. Since metafile , blockfile rename should happen when finalizing the block from BBW to current. ( at least not considered blocksize).

Anyway, thanks a lot for your great & valuable  time  with us here. After checking the above OS logs, i will have a run with HDFS-1539.

Regards,
Uma

________________________________________
From: Todd Lipcon [todd@cloudera.com]
Sent: Thursday, November 24, 2011 5:07 AM
To: common-dev@hadoop.apache.org
Cc: hdfs-dev@hadoop.apache.org
Subject: Re: Blocks are getting corrupted under very high load

On Wed, Nov 23, 2011 at 1:23 AM, Uma Maheswara Rao G
<ma...@huawei.com> wrote:
> Yes, Todd,  block after restart is small and  genstamp also lesser.
>   Here complete machine reboot happend. The boards are configured like, if it is not getting any CPU cycles  for 480secs, it will reboot himself.
>  kernal.hung_task_timeout_secs = 480 sec.

So sounds like the following happened:
- while writing file, the pipeline got reduced down to 1 node due to
timeouts from the other two
- soon thereafter (before more replicas were made), that last replica
kernel-paniced without syncing the data
- on reboot, the filesystem lost some edits from its ext3 journal, and
the block got moved back into the RBW directly, with truncated data
- hdfs did "the right thing" - at least what the algorithms say it
should do, because it had gotten a commitment for a later replica

If you have a build which includes HDFS-1539, you could consider
setting dfs.datanode.synconclose to true, which would have prevented
this problem.

-Todd
--
Todd Lipcon
Software Engineer, Cloudera

RE: Blocks are getting corrupted under very high load

Posted by Uma Maheswara Rao G <ma...@huawei.com>.

Thanks Todd.

Finally we also started suspecting in that angle. Planned to take the file details before reboot and after reboot.
With the above analysis i can confirm, whether the same issue or not.

One more thing to notice is that the difference between reboot time and last replica finalization is ~1hr in some cases. 
Since the machine is rebooted due to kernal.hung_task_timeout_secs , in OS also that particular thread might not got the chance to sync the data.

great one, HDFS-1539, I have merged all the bugs. Since this is an improvement, issue might not come to my list :( .

Also found some OS level configs to do the filesystem operations synchronously 
dirsync
    All directory updates within the filesystem should be done synchronously. This affects the following system calls: creat, link, unlink, symlink, mkdir, rmdir, mknod and rename. 
We suspected mainly the rename operation lost after reboot. Since metafile , blockfile rename should happen when finalizing the block from BBW to current. ( at least not considered blocksize).

Anyway, thanks a lot for your great & valuable  time  with us here. After checking the above OS logs, i will have a run with HDFS-1539.

Regards,
Uma

________________________________________
From: Todd Lipcon [todd@cloudera.com]
Sent: Thursday, November 24, 2011 5:07 AM
To: common-dev@hadoop.apache.org
Cc: hdfs-dev@hadoop.apache.org
Subject: Re: Blocks are getting corrupted under very high load

On Wed, Nov 23, 2011 at 1:23 AM, Uma Maheswara Rao G
<ma...@huawei.com> wrote:
> Yes, Todd,  block after restart is small and  genstamp also lesser.
>   Here complete machine reboot happend. The boards are configured like, if it is not getting any CPU cycles  for 480secs, it will reboot himself.
>  kernal.hung_task_timeout_secs = 480 sec.

So sounds like the following happened:
- while writing file, the pipeline got reduced down to 1 node due to
timeouts from the other two
- soon thereafter (before more replicas were made), that last replica
kernel-paniced without syncing the data
- on reboot, the filesystem lost some edits from its ext3 journal, and
the block got moved back into the RBW directly, with truncated data
- hdfs did "the right thing" - at least what the algorithms say it
should do, because it had gotten a commitment for a later replica

If you have a build which includes HDFS-1539, you could consider
setting dfs.datanode.synconclose to true, which would have prevented
this problem.

-Todd
--
Todd Lipcon
Software Engineer, Cloudera

Re: Blocks are getting corrupted under very high load

Posted by Todd Lipcon <to...@cloudera.com>.

On Wed, Nov 23, 2011 at 1:23 AM, Uma Maheswara Rao G
<ma...@huawei.com> wrote:
> Yes, Todd,  block after restart is small and  genstamp also lesser.
>   Here complete machine reboot happend. The boards are configured like, if it is not getting any CPU cycles  for 480secs, it will reboot himself.
>  kernal.hung_task_timeout_secs = 480 sec.

So sounds like the following happened:
- while writing file, the pipeline got reduced down to 1 node due to
timeouts from the other two
- soon thereafter (before more replicas were made), that last replica
kernel-paniced without syncing the data
- on reboot, the filesystem lost some edits from its ext3 journal, and
the block got moved back into the RBW directly, with truncated data
- hdfs did "the right thing" - at least what the algorithms say it
should do, because it had gotten a commitment for a later replica

If you have a build which includes HDFS-1539, you could consider
setting dfs.datanode.synconclose to true, which would have prevented
this problem.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Blocks are getting corrupted under very high load

Posted by Todd Lipcon <to...@cloudera.com>.

On Wed, Nov 23, 2011 at 1:23 AM, Uma Maheswara Rao G
<ma...@huawei.com> wrote:
> Yes, Todd,  block after restart is small and  genstamp also lesser.
>   Here complete machine reboot happend. The boards are configured like, if it is not getting any CPU cycles  for 480secs, it will reboot himself.
>  kernal.hung_task_timeout_secs = 480 sec.

So sounds like the following happened:
- while writing file, the pipeline got reduced down to 1 node due to
timeouts from the other two
- soon thereafter (before more replicas were made), that last replica
kernel-paniced without syncing the data
- on reboot, the filesystem lost some edits from its ext3 journal, and
the block got moved back into the RBW directly, with truncated data
- hdfs did "the right thing" - at least what the algorithms say it
should do, because it had gotten a commitment for a later replica

If you have a build which includes HDFS-1539, you could consider
setting dfs.datanode.synconclose to true, which would have prevented
this problem.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

RE: Blocks are getting corrupted under very high load

Posted by Uma Maheswara Rao G <ma...@huawei.com>.

Yes, Todd,  block after restart is small and  genstamp also lesser.
   Here complete machine reboot happend. The boards are configured like, if it is not getting any CPU cycles  for 480secs, it will reboot himself.
  kernal.hung_task_timeout_secs = 480 sec.
  Due to this timeout, automatically reboot happend for DN.
Regards,
Uma

________________________________________
From: Todd Lipcon [todd@cloudera.com]
Sent: Wednesday, November 23, 2011 2:08 PM
To: hdfs-dev@hadoop.apache.org
Subject: Re: Blocks are getting corrupted under very high load

I noticed that the reported block after restart is also much smaller
than the block reported earlier.

Any chance when your DN restarted it actually lost power? ie do you
mean that just the DN JVM restarted, or that the whole machine crashed
and restarted?

On Tue, Nov 22, 2011 at 9:17 PM, Uma Maheswara Rao G
<ma...@huawei.com> wrote:
> Hi Todd,
> Thanks a lot for taking a look.
>
> Yes, I also suspect the same initially. But after analysing the logs, we found below client trace logs in DN, which means block finalization completed.
> ./hadoop-root-datanode-xx-xx-132-22.log.1:2011-11-20 18:18:45,498 INFO  DataNode.clienttrace (BlockReceiver.java:run(1130)) - src: /xx.xx.132.26:55882, dest: /xx.xx.132.22:10010, bytes: 255954944, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_827638122_13, srvID: DS-1518903564-158.1.132.22-10010-1321492867433, blockid: blk_1321803251510_85379
> ./hadoop-root-datanode-xx-xx-132-22.log.1:2011-11-20 18:18:45,498 INFO  datanode.DataNode (BlockReceiver.java:run(1185)) - PacketResponder 0 for block blk_1321803251510_85379 terminating
>
>
> blk_1321803251510_85379 is recent generationTimeStamp. Also NN logs clearly saying that addStoredBlock called for this block id.
>
> Below are the logs..
> ./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:15,836 INFO  namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(2415)) - commitBlockSynchronization(lastblock=blk_1321803251510_83627, newgenerationstamp=85379, newlength=246505984, newtargets=[xx.xx.132.22:10010], closeFile=false, deleteBlock=false)
> ./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:15,869 INFO  namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(2488)) - commitBlockSynchronization(blk_1321803251510_85379) successful
> ./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:45,496 WARN  namenode.FSNamesystem (FSNamesystem.java:addStoredBlock(3708)) - Inconsistent size for block blk_1321803251510_85379 reported from xx.xx.132.22:10010 current size is 246505984 reported size is 255954944
> ./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:45,496 WARN  hdfs.StateChange (FSNamesystem.java:addStoredBlock(3800)) - BLOCK* NameSystem.addStoredBlock: Redundant addStoredBlock request received for blk_1321803251510_85379 on xx.xx.132.22:10010 size 255954944
>
>
> After DN restart:
> ./hadoop-root-HANameNode-xx-xx-132-27.log.7:2011-11-20 18:55:54,844 INFO  hdfs.StateChange (FSNamesystem.java:rejectAddStoredBlock(3520)) - BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_1321803251510_83627 on xx.xx.132.22:10010 size 104428544 but was rejected: Reported as block being written but is a block of closed file.
>
> Regards,
> Uma
>
> ________________________________________
> From: Todd Lipcon [todd@cloudera.com]
> Sent: Wednesday, November 23, 2011 6:27 AM
> To: common-dev@hadoop.apache.org
> Cc: hdfs-dev@hadoop.apache.org
> Subject: Re: Blocks are getting corrupted under very high load
>
> Can you look on the DN in question and see whether it was succesfully
> finalized when the write finished? It doesn't sound like a successful
> write -- should have moved it out of the bbw directory into current/
>
> -Todd
>
> On Tue, Nov 22, 2011 at 3:16 AM, Uma Maheswara Rao G
> <ma...@huawei.com> wrote:
>> Hi All,
>>
>>
>>
>> I have backported HDFS-1779 to our Hadoop version which is based on 0.20-Append branch.
>>
>> We are running a load test, as usual. (We want to ensure the reliability of the system under heavy loads.)
>> My cluster has 8 DataNodes and a Namenode
>> Each machine has 16 CPUs and 12 hard disks, each having 2TB capacity.
>> Clients are running along with Datanodes.
>> Clients will upload some tar files containing 3-4 blocks, from 50 threads.
>> Each block size is 256MB. replication factor is 3.
>>
>> Everything looks to be fine on a normal load.
>> When the load is increased, lot of errors are happening.
>> Many pipeline failures are happening also.
>> All these are fine, except for the strange case of few blocks.
>>
>> Some blocks (around 30) are missing (FSCK report shows).
>> When I tried to read that files, it fails saying that No Datanodes for this block
>> Analysing the logs, we found that, for these blocks, pipeline recovery happened, write was successful to a single Datanode.
>> Also, Datanode reported the block to Namenode in a blockReceived command.
>> After some time (say, 30 minutes), the Datanode is getting restarted.
>> In the BBW (BlocksBeingWritten) report send by DN immediately after restart, these finalized blocks are also included. (Showing that these blocks are in blocksBeingWritten folder)
>> In many of the cases, the generation timestamp reported in the BBW report is the old timestamp.
>>
>> Namenode is rejecting that block in the BBW report by saying file is already closed.
>> Also, Namenode asks the Datanode to invlidate the blocks & Datanode is doing the same.
>> When deleting the blocks also, it is printing the path from BlocksBeingWritten directory. (Also the previous generation timestamp)
>>
>> Looks very strange for me.
>> Does this means that the finalized block file & meta file (which is written in current folder) is getting lost after DN restart
>> Due to which Namenode will not receive these block's information in the BLOCK REPORT send from the Datanodes.
>>
>>
>>
>>
>>
>> Regards,
>>
>> Uma
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



--
Todd Lipcon
Software Engineer, Cloudera

RE: Blocks are getting corrupted under very high load

Posted by Uma Maheswara Rao G <ma...@huawei.com>.

Yes, Todd,  block after restart is small and  genstamp also lesser.
   Here complete machine reboot happend. The boards are configured like, if it is not getting any CPU cycles  for 480secs, it will reboot himself.
  kernal.hung_task_timeout_secs = 480 sec.
  Due to this timeout, automatically reboot happend for DN.
Regards,
Uma

________________________________________
From: Todd Lipcon [todd@cloudera.com]
Sent: Wednesday, November 23, 2011 2:08 PM
To: hdfs-dev@hadoop.apache.org
Subject: Re: Blocks are getting corrupted under very high load

I noticed that the reported block after restart is also much smaller
than the block reported earlier.

Any chance when your DN restarted it actually lost power? ie do you
mean that just the DN JVM restarted, or that the whole machine crashed
and restarted?

On Tue, Nov 22, 2011 at 9:17 PM, Uma Maheswara Rao G
<ma...@huawei.com> wrote:
> Hi Todd,
> Thanks a lot for taking a look.
>
> Yes, I also suspect the same initially. But after analysing the logs, we found below client trace logs in DN, which means block finalization completed.
> ./hadoop-root-datanode-xx-xx-132-22.log.1:2011-11-20 18:18:45,498 INFO  DataNode.clienttrace (BlockReceiver.java:run(1130)) - src: /xx.xx.132.26:55882, dest: /xx.xx.132.22:10010, bytes: 255954944, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_827638122_13, srvID: DS-1518903564-158.1.132.22-10010-1321492867433, blockid: blk_1321803251510_85379
> ./hadoop-root-datanode-xx-xx-132-22.log.1:2011-11-20 18:18:45,498 INFO  datanode.DataNode (BlockReceiver.java:run(1185)) - PacketResponder 0 for block blk_1321803251510_85379 terminating
>
>
> blk_1321803251510_85379 is recent generationTimeStamp. Also NN logs clearly saying that addStoredBlock called for this block id.
>
> Below are the logs..
> ./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:15,836 INFO  namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(2415)) - commitBlockSynchronization(lastblock=blk_1321803251510_83627, newgenerationstamp=85379, newlength=246505984, newtargets=[xx.xx.132.22:10010], closeFile=false, deleteBlock=false)
> ./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:15,869 INFO  namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(2488)) - commitBlockSynchronization(blk_1321803251510_85379) successful
> ./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:45,496 WARN  namenode.FSNamesystem (FSNamesystem.java:addStoredBlock(3708)) - Inconsistent size for block blk_1321803251510_85379 reported from xx.xx.132.22:10010 current size is 246505984 reported size is 255954944
> ./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:45,496 WARN  hdfs.StateChange (FSNamesystem.java:addStoredBlock(3800)) - BLOCK* NameSystem.addStoredBlock: Redundant addStoredBlock request received for blk_1321803251510_85379 on xx.xx.132.22:10010 size 255954944
>
>
> After DN restart:
> ./hadoop-root-HANameNode-xx-xx-132-27.log.7:2011-11-20 18:55:54,844 INFO  hdfs.StateChange (FSNamesystem.java:rejectAddStoredBlock(3520)) - BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_1321803251510_83627 on xx.xx.132.22:10010 size 104428544 but was rejected: Reported as block being written but is a block of closed file.
>
> Regards,
> Uma
>
> ________________________________________
> From: Todd Lipcon [todd@cloudera.com]
> Sent: Wednesday, November 23, 2011 6:27 AM
> To: common-dev@hadoop.apache.org
> Cc: hdfs-dev@hadoop.apache.org
> Subject: Re: Blocks are getting corrupted under very high load
>
> Can you look on the DN in question and see whether it was succesfully
> finalized when the write finished? It doesn't sound like a successful
> write -- should have moved it out of the bbw directory into current/
>
> -Todd
>
> On Tue, Nov 22, 2011 at 3:16 AM, Uma Maheswara Rao G
> <ma...@huawei.com> wrote:
>> Hi All,
>>
>>
>>
>> I have backported HDFS-1779 to our Hadoop version which is based on 0.20-Append branch.
>>
>> We are running a load test, as usual. (We want to ensure the reliability of the system under heavy loads.)
>> My cluster has 8 DataNodes and a Namenode
>> Each machine has 16 CPUs and 12 hard disks, each having 2TB capacity.
>> Clients are running along with Datanodes.
>> Clients will upload some tar files containing 3-4 blocks, from 50 threads.
>> Each block size is 256MB. replication factor is 3.
>>
>> Everything looks to be fine on a normal load.
>> When the load is increased, lot of errors are happening.
>> Many pipeline failures are happening also.
>> All these are fine, except for the strange case of few blocks.
>>
>> Some blocks (around 30) are missing (FSCK report shows).
>> When I tried to read that files, it fails saying that No Datanodes for this block
>> Analysing the logs, we found that, for these blocks, pipeline recovery happened, write was successful to a single Datanode.
>> Also, Datanode reported the block to Namenode in a blockReceived command.
>> After some time (say, 30 minutes), the Datanode is getting restarted.
>> In the BBW (BlocksBeingWritten) report send by DN immediately after restart, these finalized blocks are also included. (Showing that these blocks are in blocksBeingWritten folder)
>> In many of the cases, the generation timestamp reported in the BBW report is the old timestamp.
>>
>> Namenode is rejecting that block in the BBW report by saying file is already closed.
>> Also, Namenode asks the Datanode to invlidate the blocks & Datanode is doing the same.
>> When deleting the blocks also, it is printing the path from BlocksBeingWritten directory. (Also the previous generation timestamp)
>>
>> Looks very strange for me.
>> Does this means that the finalized block file & meta file (which is written in current folder) is getting lost after DN restart
>> Due to which Namenode will not receive these block's information in the BLOCK REPORT send from the Datanodes.
>>
>>
>>
>>
>>
>> Regards,
>>
>> Uma
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



--
Todd Lipcon
Software Engineer, Cloudera

Re: Blocks are getting corrupted under very high load

Posted by Todd Lipcon <to...@cloudera.com>.

I noticed that the reported block after restart is also much smaller
than the block reported earlier.

Any chance when your DN restarted it actually lost power? ie do you
mean that just the DN JVM restarted, or that the whole machine crashed
and restarted?

On Tue, Nov 22, 2011 at 9:17 PM, Uma Maheswara Rao G
<ma...@huawei.com> wrote:
> Hi Todd,
> Thanks a lot for taking a look.
>
> Yes, I also suspect the same initially. But after analysing the logs, we found below client trace logs in DN, which means block finalization completed.
> ./hadoop-root-datanode-xx-xx-132-22.log.1:2011-11-20 18:18:45,498 INFO  DataNode.clienttrace (BlockReceiver.java:run(1130)) - src: /xx.xx.132.26:55882, dest: /xx.xx.132.22:10010, bytes: 255954944, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_827638122_13, srvID: DS-1518903564-158.1.132.22-10010-1321492867433, blockid: blk_1321803251510_85379
> ./hadoop-root-datanode-xx-xx-132-22.log.1:2011-11-20 18:18:45,498 INFO  datanode.DataNode (BlockReceiver.java:run(1185)) - PacketResponder 0 for block blk_1321803251510_85379 terminating
>
>
> blk_1321803251510_85379 is recent generationTimeStamp. Also NN logs clearly saying that addStoredBlock called for this block id.
>
> Below are the logs..
> ./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:15,836 INFO  namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(2415)) - commitBlockSynchronization(lastblock=blk_1321803251510_83627, newgenerationstamp=85379, newlength=246505984, newtargets=[xx.xx.132.22:10010], closeFile=false, deleteBlock=false)
> ./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:15,869 INFO  namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(2488)) - commitBlockSynchronization(blk_1321803251510_85379) successful
> ./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:45,496 WARN  namenode.FSNamesystem (FSNamesystem.java:addStoredBlock(3708)) - Inconsistent size for block blk_1321803251510_85379 reported from xx.xx.132.22:10010 current size is 246505984 reported size is 255954944
> ./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:45,496 WARN  hdfs.StateChange (FSNamesystem.java:addStoredBlock(3800)) - BLOCK* NameSystem.addStoredBlock: Redundant addStoredBlock request received for blk_1321803251510_85379 on xx.xx.132.22:10010 size 255954944
>
>
> After DN restart:
> ./hadoop-root-HANameNode-xx-xx-132-27.log.7:2011-11-20 18:55:54,844 INFO  hdfs.StateChange (FSNamesystem.java:rejectAddStoredBlock(3520)) - BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_1321803251510_83627 on xx.xx.132.22:10010 size 104428544 but was rejected: Reported as block being written but is a block of closed file.
>
> Regards,
> Uma
>
> ________________________________________
> From: Todd Lipcon [todd@cloudera.com]
> Sent: Wednesday, November 23, 2011 6:27 AM
> To: common-dev@hadoop.apache.org
> Cc: hdfs-dev@hadoop.apache.org
> Subject: Re: Blocks are getting corrupted under very high load
>
> Can you look on the DN in question and see whether it was succesfully
> finalized when the write finished? It doesn't sound like a successful
> write -- should have moved it out of the bbw directory into current/
>
> -Todd
>
> On Tue, Nov 22, 2011 at 3:16 AM, Uma Maheswara Rao G
> <ma...@huawei.com> wrote:
>> Hi All,
>>
>>
>>
>> I have backported HDFS-1779 to our Hadoop version which is based on 0.20-Append branch.
>>
>> We are running a load test, as usual. (We want to ensure the reliability of the system under heavy loads.)
>> My cluster has 8 DataNodes and a Namenode
>> Each machine has 16 CPUs and 12 hard disks, each having 2TB capacity.
>> Clients are running along with Datanodes.
>> Clients will upload some tar files containing 3-4 blocks, from 50 threads.
>> Each block size is 256MB. replication factor is 3.
>>
>> Everything looks to be fine on a normal load.
>> When the load is increased, lot of errors are happening.
>> Many pipeline failures are happening also.
>> All these are fine, except for the strange case of few blocks.
>>
>> Some blocks (around 30) are missing (FSCK report shows).
>> When I tried to read that files, it fails saying that No Datanodes for this block
>> Analysing the logs, we found that, for these blocks, pipeline recovery happened, write was successful to a single Datanode.
>> Also, Datanode reported the block to Namenode in a blockReceived command.
>> After some time (say, 30 minutes), the Datanode is getting restarted.
>> In the BBW (BlocksBeingWritten) report send by DN immediately after restart, these finalized blocks are also included. (Showing that these blocks are in blocksBeingWritten folder)
>> In many of the cases, the generation timestamp reported in the BBW report is the old timestamp.
>>
>> Namenode is rejecting that block in the BBW report by saying file is already closed.
>> Also, Namenode asks the Datanode to invlidate the blocks & Datanode is doing the same.
>> When deleting the blocks also, it is printing the path from BlocksBeingWritten directory. (Also the previous generation timestamp)
>>
>> Looks very strange for me.
>> Does this means that the finalized block file & meta file (which is written in current folder) is getting lost after DN restart
>> Due to which Namenode will not receive these block's information in the BLOCK REPORT send from the Datanodes.
>>
>>
>>
>>
>>
>> Regards,
>>
>> Uma
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

RE: Blocks are getting corrupted under very high load

Posted by Uma Maheswara Rao G <ma...@huawei.com>.

Hi Todd,
Thanks a lot for taking a look.

Yes, I also suspect the same initially. But after analysing the logs, we found below client trace logs in DN, which means block finalization completed. 
./hadoop-root-datanode-xx-xx-132-22.log.1:2011-11-20 18:18:45,498 INFO  DataNode.clienttrace (BlockReceiver.java:run(1130)) - src: /xx.xx.132.26:55882, dest: /xx.xx.132.22:10010, bytes: 255954944, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_827638122_13, srvID: DS-1518903564-158.1.132.22-10010-1321492867433, blockid: blk_1321803251510_85379
./hadoop-root-datanode-xx-xx-132-22.log.1:2011-11-20 18:18:45,498 INFO  datanode.DataNode (BlockReceiver.java:run(1185)) - PacketResponder 0 for block blk_1321803251510_85379 terminating


blk_1321803251510_85379 is recent generationTimeStamp. Also NN logs clearly saying that addStoredBlock called for this block id.

Below are the logs..
./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:15,836 INFO  namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(2415)) - commitBlockSynchronization(lastblock=blk_1321803251510_83627, newgenerationstamp=85379, newlength=246505984, newtargets=[xx.xx.132.22:10010], closeFile=false, deleteBlock=false)
./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:15,869 INFO  namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(2488)) - commitBlockSynchronization(blk_1321803251510_85379) successful
./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:45,496 WARN  namenode.FSNamesystem (FSNamesystem.java:addStoredBlock(3708)) - Inconsistent size for block blk_1321803251510_85379 reported from xx.xx.132.22:10010 current size is 246505984 reported size is 255954944
./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:45,496 WARN  hdfs.StateChange (FSNamesystem.java:addStoredBlock(3800)) - BLOCK* NameSystem.addStoredBlock: Redundant addStoredBlock request received for blk_1321803251510_85379 on xx.xx.132.22:10010 size 255954944


After DN restart:
./hadoop-root-HANameNode-xx-xx-132-27.log.7:2011-11-20 18:55:54,844 INFO  hdfs.StateChange (FSNamesystem.java:rejectAddStoredBlock(3520)) - BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_1321803251510_83627 on xx.xx.132.22:10010 size 104428544 but was rejected: Reported as block being written but is a block of closed file.

Regards,
Uma

________________________________________
From: Todd Lipcon [todd@cloudera.com]
Sent: Wednesday, November 23, 2011 6:27 AM
To: common-dev@hadoop.apache.org
Cc: hdfs-dev@hadoop.apache.org
Subject: Re: Blocks are getting corrupted under very high load

Can you look on the DN in question and see whether it was succesfully
finalized when the write finished? It doesn't sound like a successful
write -- should have moved it out of the bbw directory into current/

-Todd

On Tue, Nov 22, 2011 at 3:16 AM, Uma Maheswara Rao G
<ma...@huawei.com> wrote:
> Hi All,
>
>
>
> I have backported HDFS-1779 to our Hadoop version which is based on 0.20-Append branch.
>
> We are running a load test, as usual. (We want to ensure the reliability of the system under heavy loads.)
> My cluster has 8 DataNodes and a Namenode
> Each machine has 16 CPUs and 12 hard disks, each having 2TB capacity.
> Clients are running along with Datanodes.
> Clients will upload some tar files containing 3-4 blocks, from 50 threads.
> Each block size is 256MB. replication factor is 3.
>
> Everything looks to be fine on a normal load.
> When the load is increased, lot of errors are happening.
> Many pipeline failures are happening also.
> All these are fine, except for the strange case of few blocks.
>
> Some blocks (around 30) are missing (FSCK report shows).
> When I tried to read that files, it fails saying that No Datanodes for this block
> Analysing the logs, we found that, for these blocks, pipeline recovery happened, write was successful to a single Datanode.
> Also, Datanode reported the block to Namenode in a blockReceived command.
> After some time (say, 30 minutes), the Datanode is getting restarted.
> In the BBW (BlocksBeingWritten) report send by DN immediately after restart, these finalized blocks are also included. (Showing that these blocks are in blocksBeingWritten folder)
> In many of the cases, the generation timestamp reported in the BBW report is the old timestamp.
>
> Namenode is rejecting that block in the BBW report by saying file is already closed.
> Also, Namenode asks the Datanode to invlidate the blocks & Datanode is doing the same.
> When deleting the blocks also, it is printing the path from BlocksBeingWritten directory. (Also the previous generation timestamp)
>
> Looks very strange for me.
> Does this means that the finalized block file & meta file (which is written in current folder) is getting lost after DN restart
> Due to which Namenode will not receive these block's information in the BLOCK REPORT send from the Datanodes.
>
>
>
>
>
> Regards,
>
> Uma
>



--
Todd Lipcon
Software Engineer, Cloudera

RE: Blocks are getting corrupted under very high load

Posted by Uma Maheswara Rao G <ma...@huawei.com>.

Hi Todd,
Thanks a lot for taking a look.

Yes, I also suspect the same initially. But after analysing the logs, we found below client trace logs in DN, which means block finalization completed. 
./hadoop-root-datanode-xx-xx-132-22.log.1:2011-11-20 18:18:45,498 INFO  DataNode.clienttrace (BlockReceiver.java:run(1130)) - src: /xx.xx.132.26:55882, dest: /xx.xx.132.22:10010, bytes: 255954944, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_827638122_13, srvID: DS-1518903564-158.1.132.22-10010-1321492867433, blockid: blk_1321803251510_85379
./hadoop-root-datanode-xx-xx-132-22.log.1:2011-11-20 18:18:45,498 INFO  datanode.DataNode (BlockReceiver.java:run(1185)) - PacketResponder 0 for block blk_1321803251510_85379 terminating


blk_1321803251510_85379 is recent generationTimeStamp. Also NN logs clearly saying that addStoredBlock called for this block id.

Below are the logs..
./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:15,836 INFO  namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(2415)) - commitBlockSynchronization(lastblock=blk_1321803251510_83627, newgenerationstamp=85379, newlength=246505984, newtargets=[xx.xx.132.22:10010], closeFile=false, deleteBlock=false)
./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:15,869 INFO  namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(2488)) - commitBlockSynchronization(blk_1321803251510_85379) successful
./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:45,496 WARN  namenode.FSNamesystem (FSNamesystem.java:addStoredBlock(3708)) - Inconsistent size for block blk_1321803251510_85379 reported from xx.xx.132.22:10010 current size is 246505984 reported size is 255954944
./hadoop-root-HANameNode-xx-xx-132-27.log.9:2011-11-20 18:18:45,496 WARN  hdfs.StateChange (FSNamesystem.java:addStoredBlock(3800)) - BLOCK* NameSystem.addStoredBlock: Redundant addStoredBlock request received for blk_1321803251510_85379 on xx.xx.132.22:10010 size 255954944


After DN restart:
./hadoop-root-HANameNode-xx-xx-132-27.log.7:2011-11-20 18:55:54,844 INFO  hdfs.StateChange (FSNamesystem.java:rejectAddStoredBlock(3520)) - BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_1321803251510_83627 on xx.xx.132.22:10010 size 104428544 but was rejected: Reported as block being written but is a block of closed file.

Regards,
Uma

________________________________________
From: Todd Lipcon [todd@cloudera.com]
Sent: Wednesday, November 23, 2011 6:27 AM
To: common-dev@hadoop.apache.org
Cc: hdfs-dev@hadoop.apache.org
Subject: Re: Blocks are getting corrupted under very high load

Can you look on the DN in question and see whether it was succesfully
finalized when the write finished? It doesn't sound like a successful
write -- should have moved it out of the bbw directory into current/

-Todd

On Tue, Nov 22, 2011 at 3:16 AM, Uma Maheswara Rao G
<ma...@huawei.com> wrote:
> Hi All,
>
>
>
> I have backported HDFS-1779 to our Hadoop version which is based on 0.20-Append branch.
>
> We are running a load test, as usual. (We want to ensure the reliability of the system under heavy loads.)
> My cluster has 8 DataNodes and a Namenode
> Each machine has 16 CPUs and 12 hard disks, each having 2TB capacity.
> Clients are running along with Datanodes.
> Clients will upload some tar files containing 3-4 blocks, from 50 threads.
> Each block size is 256MB. replication factor is 3.
>
> Everything looks to be fine on a normal load.
> When the load is increased, lot of errors are happening.
> Many pipeline failures are happening also.
> All these are fine, except for the strange case of few blocks.
>
> Some blocks (around 30) are missing (FSCK report shows).
> When I tried to read that files, it fails saying that No Datanodes for this block
> Analysing the logs, we found that, for these blocks, pipeline recovery happened, write was successful to a single Datanode.
> Also, Datanode reported the block to Namenode in a blockReceived command.
> After some time (say, 30 minutes), the Datanode is getting restarted.
> In the BBW (BlocksBeingWritten) report send by DN immediately after restart, these finalized blocks are also included. (Showing that these blocks are in blocksBeingWritten folder)
> In many of the cases, the generation timestamp reported in the BBW report is the old timestamp.
>
> Namenode is rejecting that block in the BBW report by saying file is already closed.
> Also, Namenode asks the Datanode to invlidate the blocks & Datanode is doing the same.
> When deleting the blocks also, it is printing the path from BlocksBeingWritten directory. (Also the previous generation timestamp)
>
> Looks very strange for me.
> Does this means that the finalized block file & meta file (which is written in current folder) is getting lost after DN restart
> Due to which Namenode will not receive these block's information in the BLOCK REPORT send from the Datanodes.
>
>
>
>
>
> Regards,
>
> Uma
>



--
Todd Lipcon
Software Engineer, Cloudera

Re: Blocks are getting corrupted under very high load

Posted by Todd Lipcon <to...@cloudera.com>.

Can you look on the DN in question and see whether it was succesfully
finalized when the write finished? It doesn't sound like a successful
write -- should have moved it out of the bbw directory into current/

-Todd

On Tue, Nov 22, 2011 at 3:16 AM, Uma Maheswara Rao G
<ma...@huawei.com> wrote:
> Hi All,
>
>
>
> I have backported HDFS-1779 to our Hadoop version which is based on 0.20-Append branch.
>
> We are running a load test, as usual. (We want to ensure the reliability of the system under heavy loads.)
> My cluster has 8 DataNodes and a Namenode
> Each machine has 16 CPUs and 12 hard disks, each having 2TB capacity.
> Clients are running along with Datanodes.
> Clients will upload some tar files containing 3-4 blocks, from 50 threads.
> Each block size is 256MB. replication factor is 3.
>
> Everything looks to be fine on a normal load.
> When the load is increased, lot of errors are happening.
> Many pipeline failures are happening also.
> All these are fine, except for the strange case of few blocks.
>
> Some blocks (around 30) are missing (FSCK report shows).
> When I tried to read that files, it fails saying that No Datanodes for this block
> Analysing the logs, we found that, for these blocks, pipeline recovery happened, write was successful to a single Datanode.
> Also, Datanode reported the block to Namenode in a blockReceived command.
> After some time (say, 30 minutes), the Datanode is getting restarted.
> In the BBW (BlocksBeingWritten) report send by DN immediately after restart, these finalized blocks are also included. (Showing that these blocks are in blocksBeingWritten folder)
> In many of the cases, the generation timestamp reported in the BBW report is the old timestamp.
>
> Namenode is rejecting that block in the BBW report by saying file is already closed.
> Also, Namenode asks the Datanode to invlidate the blocks & Datanode is doing the same.
> When deleting the blocks also, it is printing the path from BlocksBeingWritten directory. (Also the previous generation timestamp)
>
> Looks very strange for me.
> Does this means that the finalized block file & meta file (which is written in current folder) is getting lost after DN restart
> Due to which Namenode will not receive these block's information in the BLOCK REPORT send from the Datanodes.
>
>
>
>
>
> Regards,
>
> Uma
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Blocks are getting corrupted under very high load

Posted by Todd Lipcon <to...@cloudera.com>.

Can you look on the DN in question and see whether it was succesfully
finalized when the write finished? It doesn't sound like a successful
write -- should have moved it out of the bbw directory into current/

-Todd

On Tue, Nov 22, 2011 at 3:16 AM, Uma Maheswara Rao G
<ma...@huawei.com> wrote:
> Hi All,
>
>
>
> I have backported HDFS-1779 to our Hadoop version which is based on 0.20-Append branch.
>
> We are running a load test, as usual. (We want to ensure the reliability of the system under heavy loads.)
> My cluster has 8 DataNodes and a Namenode
> Each machine has 16 CPUs and 12 hard disks, each having 2TB capacity.
> Clients are running along with Datanodes.
> Clients will upload some tar files containing 3-4 blocks, from 50 threads.
> Each block size is 256MB. replication factor is 3.
>
> Everything looks to be fine on a normal load.
> When the load is increased, lot of errors are happening.
> Many pipeline failures are happening also.
> All these are fine, except for the strange case of few blocks.
>
> Some blocks (around 30) are missing (FSCK report shows).
> When I tried to read that files, it fails saying that No Datanodes for this block
> Analysing the logs, we found that, for these blocks, pipeline recovery happened, write was successful to a single Datanode.
> Also, Datanode reported the block to Namenode in a blockReceived command.
> After some time (say, 30 minutes), the Datanode is getting restarted.
> In the BBW (BlocksBeingWritten) report send by DN immediately after restart, these finalized blocks are also included. (Showing that these blocks are in blocksBeingWritten folder)
> In many of the cases, the generation timestamp reported in the BBW report is the old timestamp.
>
> Namenode is rejecting that block in the BBW report by saying file is already closed.
> Also, Namenode asks the Datanode to invlidate the blocks & Datanode is doing the same.
> When deleting the blocks also, it is printing the path from BlocksBeingWritten directory. (Also the previous generation timestamp)
>
> Looks very strange for me.
> Does this means that the finalized block file & meta file (which is written in current folder) is getting lost after DN restart
> Due to which Namenode will not receive these block's information in the BLOCK REPORT send from the Datanodes.
>
>
>
>
>
> Regards,
>
> Uma
>



-- 
Todd Lipcon
Software Engineer, Cloudera