You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by reena upadhyay <re...@outlook.com> on 2014/03/28 11:28:51 UTC

How check sum are generated for blocks in data node

I was going through this link http://stackoverflow.com/questions/9406477/data-integrity-in-hdfs-which-data-nodes-verifies-the-checksum . Its written that in recent version of hadoop only the last data node verifies the checksum as the write happens in a pipeline fashion. 
Now I have a question:
Assuming my cluster has two data nodes A and B cluster, I have a file, half of the file content is written on first data node A and the other remaining half is written on the second data node B to take advantage of parallelism.  My question is:  Will data node A will not store the check sum for the blocks stored on it. 

As per the line "only the last data node verifies the checksum", it looks like only the  last data node in my case it will be data node B, will generate the checksum. But if only data node B generates checksum, then it will generate the check sum only for the blocks stored on data node B. What about the checksum for the data blocks on data node  machine A?

Re: How check sum are generated for blocks in data node

Posted by Wellington Chevreuil <we...@gmail.com>.

Hi Reena,

the pipeline is per block. If you have half of your file in data node A only, that means the pipeline had only one node (node A, in this case, probably because replication factor is set to 1) and then, data node A has the checksums for its block. The same applies to data node B.  

All nodes will have checksums for the blocks they own. Checksums is passed together with the block, as it goes through the pipeline, but as the last node on the pipeline receives the original checksums along with the block from previous nodes, its only needed to make the validation on this last one, because if it passes there, it means the file was not corrupted in any of the previous nodes as well.

Cheers.

On 28 Mar 2014, at 10:28, reena upadhyay <re...@outlook.com> wrote:

> I was going through this link http://stackoverflow.com/questions/9406477/data-integrity-in-hdfs-which-data-nodes-verifies-the-checksum . Its written that in recent version of hadoop only the last data node verifies the checksum as the write happens in a pipeline fashion. 
> Now I have a question:
> Assuming my cluster has two data nodes A and B cluster, I have a file, half of the file content is written on first data node A and the other remaining half is written on the second data node B to take advantage of parallelism.  My question is:  Will data node A will not store the check sum for the blocks stored on it. 
> 
> As per the line "only the last data node verifies the checksum", it looks like only the  last data node in my case it will be data node B, will generate the checksum. But if only data node B generates checksum, then it will generate the check sum only for the blocks stored on data node B. What about the checksum for the data blocks on data node  machine A?

Re: How check sum are generated for blocks in data node

Posted by Wellington Chevreuil <we...@gmail.com>.

Hi Reena,

the pipeline is per block. If you have half of your file in data node A only, that means the pipeline had only one node (node A, in this case, probably because replication factor is set to 1) and then, data node A has the checksums for its block. The same applies to data node B.  

All nodes will have checksums for the blocks they own. Checksums is passed together with the block, as it goes through the pipeline, but as the last node on the pipeline receives the original checksums along with the block from previous nodes, its only needed to make the validation on this last one, because if it passes there, it means the file was not corrupted in any of the previous nodes as well.

Cheers.

On 28 Mar 2014, at 10:28, reena upadhyay <re...@outlook.com> wrote:

> I was going through this link http://stackoverflow.com/questions/9406477/data-integrity-in-hdfs-which-data-nodes-verifies-the-checksum . Its written that in recent version of hadoop only the last data node verifies the checksum as the write happens in a pipeline fashion. 
> Now I have a question:
> Assuming my cluster has two data nodes A and B cluster, I have a file, half of the file content is written on first data node A and the other remaining half is written on the second data node B to take advantage of parallelism.  My question is:  Will data node A will not store the check sum for the blocks stored on it. 
> 
> As per the line "only the last data node verifies the checksum", it looks like only the  last data node in my case it will be data node B, will generate the checksum. But if only data node B generates checksum, then it will generate the check sum only for the blocks stored on data node B. What about the checksum for the data blocks on data node  machine A?

Re: How check sum are generated for blocks in data node

Posted by Wellington Chevreuil <we...@gmail.com>.

Hi Reena,

the pipeline is per block. If you have half of your file in data node A only, that means the pipeline had only one node (node A, in this case, probably because replication factor is set to 1) and then, data node A has the checksums for its block. The same applies to data node B.  

All nodes will have checksums for the blocks they own. Checksums is passed together with the block, as it goes through the pipeline, but as the last node on the pipeline receives the original checksums along with the block from previous nodes, its only needed to make the validation on this last one, because if it passes there, it means the file was not corrupted in any of the previous nodes as well.

Cheers.

On 28 Mar 2014, at 10:28, reena upadhyay <re...@outlook.com> wrote:

> I was going through this link http://stackoverflow.com/questions/9406477/data-integrity-in-hdfs-which-data-nodes-verifies-the-checksum . Its written that in recent version of hadoop only the last data node verifies the checksum as the write happens in a pipeline fashion. 
> Now I have a question:
> Assuming my cluster has two data nodes A and B cluster, I have a file, half of the file content is written on first data node A and the other remaining half is written on the second data node B to take advantage of parallelism.  My question is:  Will data node A will not store the check sum for the blocks stored on it. 
> 
> As per the line "only the last data node verifies the checksum", it looks like only the  last data node in my case it will be data node B, will generate the checksum. But if only data node B generates checksum, then it will generate the check sum only for the blocks stored on data node B. What about the checksum for the data blocks on data node  machine A?

Re: How check sum are generated for blocks in data node

Posted by Wellington Chevreuil <we...@gmail.com>.

Hi Reena,

the pipeline is per block. If you have half of your file in data node A only, that means the pipeline had only one node (node A, in this case, probably because replication factor is set to 1) and then, data node A has the checksums for its block. The same applies to data node B.  

All nodes will have checksums for the blocks they own. Checksums is passed together with the block, as it goes through the pipeline, but as the last node on the pipeline receives the original checksums along with the block from previous nodes, its only needed to make the validation on this last one, because if it passes there, it means the file was not corrupted in any of the previous nodes as well.

Cheers.

On 28 Mar 2014, at 10:28, reena upadhyay <re...@outlook.com> wrote:

> I was going through this link http://stackoverflow.com/questions/9406477/data-integrity-in-hdfs-which-data-nodes-verifies-the-checksum . Its written that in recent version of hadoop only the last data node verifies the checksum as the write happens in a pipeline fashion. 
> Now I have a question:
> Assuming my cluster has two data nodes A and B cluster, I have a file, half of the file content is written on first data node A and the other remaining half is written on the second data node B to take advantage of parallelism.  My question is:  Will data node A will not store the check sum for the blocks stored on it. 
> 
> As per the line "only the last data node verifies the checksum", it looks like only the  last data node in my case it will be data node B, will generate the checksum. But if only data node B generates checksum, then it will generate the check sum only for the blocks stored on data node B. What about the checksum for the data blocks on data node  machine A?