You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Thamizh <tc...@yahoo.co.in> on 2011/04/08 14:17:46 UTC

Reg HDFS checksum

Hi All,

This is question regarding "HDFS checksum" computation.

I understood that When we read a file from HDFS by default it verifies the checksum and your read would not succeed if the file is corrupted. Also CRC is internal to hadoop.

Here are my questions:
1. How can I use "hadoop dfs -get [-ignoreCrc] [-crc] <src> <localdst>" command?

2. I used "get" command for a .gz file with -crc option ( "hadoop dfs -get -crc input1/test.gz /home/hadoop/test/. " ).
    Does this check for .crc file created in hadoop? When I tried this, I got an error as
"-crc option is not valid when source file system does not have crc files. Automatically turn the option off." means that hadoop does not create crc for this file?
Is this correct?

3. How can I enable hadoop to create CRC file?

Regards,
Thamil

Regards,

  Thamizhannal P

Re: Reg HDFS checksum

Posted by Steve Loughran <st...@apache.org>.
On 12/04/2011 07:06, Josh Patterson wrote:
> If you take a look at:
>
> https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/ExternalHDFSChecksumGenerator.java
>
> you'll see a single process version of what HDFS does under the hood,
> albeit in a highly distributed fashion. Whats going on here is that
> for every 512 bytes a CRC32 is calc'd and saved at each local datanode
> for that block. when the "checksum" is requested, these CRC32's are
> pulled together and MD5 hashed, which is sent to the client process.
> The client process then MD5 hashes all of these hashes together to
> produce a final hash.
>
> For some context: Our purpose on the openPDC project for this was we
> had some legacy software writing to HDFS through a FTP proxy bridge:
>
> https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/HdfsBridge/
>
> Since the openPDC data was ultra critical in that we could not lose
> *any* data, and the team wanted to use a simple FTP client lib to
> write to HDFS (least amount of work for them, standard libs), we
> needed a way to make sure that no corruption occurred during the "hop"
> through the FTP bridge (acted as intermediary to DFSClient, something
> could fail, and the file might be slightly truncated, yet hard to
> detect this). In the FTP bridge we allowed a custom FTP command to
> call the now exposed "hdfs-checksum" command, and the sending agent
> could then compute the hash locally (in the case of the openPDC it was
> done in C#), and make sure the file made it there intact. This system
> has been in production for over a year now storing and maintaining
> smart grid data and has been highly reliable.
>
> I say all of this to say: After having dug through HDFS's checksumming
> code I am pretty confident that its Good Stuff, although I dont
> proclaim to be a filesystem expert by any means. It may be just some
> simple error or oversight in your process, possibly?

Assuming it came down over HTTP, it's perfectly conceivable that 
something went wrong on the way, especially if a proxy server get 
involved. All HTTP checks is that the (optional) content length is 
consistent with what arrived -it relies on TCP checksums, which verify 
the network links work, but not the other bits of the system in the way 
(like any proxy server)

Re: Reg HDFS checksum

Posted by Josh Patterson <jo...@cloudera.com>.
If you take a look at:

https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/ExternalHDFSChecksumGenerator.java

you'll see a single process version of what HDFS does under the hood,
albeit in a highly distributed fashion. Whats going on here is that
for every 512 bytes a CRC32 is calc'd and saved at each local datanode
for that block. when the "checksum" is requested, these CRC32's are
pulled together and MD5 hashed, which is sent to the client process.
The client process then MD5 hashes all of these hashes together to
produce a final hash.

For some context: Our purpose on the openPDC project for this was we
had some legacy software writing to HDFS through a FTP proxy bridge:

https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/HdfsBridge/

Since the openPDC data was ultra critical in that we could not lose
*any* data, and the team wanted to use a simple FTP client lib to
write to HDFS (least amount of work for them, standard libs), we
needed a way to make sure that no corruption occurred during the "hop"
through the FTP bridge (acted as intermediary to DFSClient, something
could fail, and the file might be slightly truncated, yet hard to
detect this). In the FTP bridge we allowed a custom FTP command to
call the now exposed "hdfs-checksum" command, and the sending agent
could then compute the hash locally (in the case of the openPDC it was
done in C#), and make sure the file made it there intact. This system
has been in production for over a year now storing and maintaining
smart grid data and has been highly reliable.

I say all of this to say: After having dug through HDFS's checksumming
code I am pretty confident that its Good Stuff, although I dont
proclaim to be a filesystem expert by any means. It may be just some
simple error or oversight in your process, possibly?

On Tue, Apr 12, 2011 at 7:32 AM, Thamizh <tc...@yahoo.co.in> wrote:
>
> Thanks of lot Josh.
>
> I have been given a .gz file and been told that it has been downloaded from HDFS.
>
> When I tried to compute integrity of that file using "gzip -t", It ended up with "invalid compressed data--format violated" and even "gzip -d" also given the same result.
>
> I am bit worried about Hadoop's CRC checking mechanism. So, I looking forward to implement external CRC checker for Hadoop.
>
> Regards,
>
>  Thamizhannal P
>
> --- On Mon, 11/4/11, Josh Patterson <jo...@cloudera.com> wrote:
>
> From: Josh Patterson <jo...@cloudera.com>
> Subject: Re: Reg HDFS checksum
> To: common-user@hadoop.apache.org
> Cc: "Thamizh" <tc...@yahoo.co.in>
> Date: Monday, 11 April, 2011, 7:53 PM
>
> Thamizh,
> For a much older project I wrote a demo tool that computed the hadoop
> style checksum locally:
>
> https://github.com/jpatanooga/IvoryMonkey
>
> Checksum generator is a single threaded replica of Hadoop's internal
> Distributed hash-checksum mechanic.
>
> What its actually doing is saving the CRC32 of every 512 bytes (per
> block) and then doing a MD5 hash on that. Then when the
> "getFileChecksum()" method is called, each block for a file sends its
> md5 hash to a collector which are gathered together and a md5 hash is
> calc'd for all of the block hashes.
>
> My version includes code that can calculate the hash on the client
> side (but breaks things up in the same way that hdfs does and will
> calc it the same way).
>
> During development, we also discovered and filed:
>
> https://issues.apache.org/jira/browse/HDFS-772
>
> To invoke this method, use my shell wrapper:
>
> https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/Shell.java
>
> Hope this provides some reference information for you.
>
> On Sat, Apr 9, 2011 at 10:38 AM, Thamizh <tc...@yahoo.co.in> wrote:
>> Hi Harsh ,
>> Thanks a lot for your reference.
>> I am looking forward to know about, how does Hadoop computes CRC for any file? If you have some reference please share me. It would be great help for me.
>>
>> Regards,
>>
>>  Thamizhannal P
>>
>> --- On Sat, 9/4/11, Harsh J <ha...@cloudera.com> wrote:
>>
>> From: Harsh J <ha...@cloudera.com>
>> Subject: Re: Reg HDFS checksum
>> To: common-user@hadoop.apache.org
>> Date: Saturday, 9 April, 2011, 3:20 PM
>>
>> Hello Thamizh,
>>
>> Perhaps the discussion in the following link can shed some light on
>> this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc
>>
>> On Fri, Apr 8, 2011 at 5:47 PM, Thamizh <tc...@yahoo.co.in> wrote:
>>> Hi All,
>>>
>>> This is question regarding "HDFS checksum" computation.
>>
>> --
>> Harsh J
>>
>
>
>
> --
> Twitter: @jpatanooga
> Solution Architect @ Cloudera
> hadoop: http://www.cloudera.com
> blog: http://jpatterson.floe.tv
>



-- 
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop: http://www.cloudera.com
blog: http://jpatterson.floe.tv

Re: Reg HDFS checksum

Posted by Thamizh <tc...@yahoo.co.in>.
Thanks of lot Josh.  

I have been given a .gz file and been told that it has been downloaded from HDFS.

When I tried to compute integrity of that file using "gzip -t", It ended up with "invalid compressed data--format violated" and even "gzip -d" also given the same result.

I am bit worried about Hadoop's CRC checking mechanism. So, I looking forward to implement external CRC checker for Hadoop.

Regards,

  Thamizhannal P

--- On Mon, 11/4/11, Josh Patterson <jo...@cloudera.com> wrote:

From: Josh Patterson <jo...@cloudera.com>
Subject: Re: Reg HDFS checksum
To: common-user@hadoop.apache.org
Cc: "Thamizh" <tc...@yahoo.co.in>
Date: Monday, 11 April, 2011, 7:53 PM

Thamizh,
For a much older project I wrote a demo tool that computed the hadoop
style checksum locally:

https://github.com/jpatanooga/IvoryMonkey

Checksum generator is a single threaded replica of Hadoop's internal
Distributed hash-checksum mechanic.

What its actually doing is saving the CRC32 of every 512 bytes (per
block) and then doing a MD5 hash on that. Then when the
"getFileChecksum()" method is called, each block for a file sends its
md5 hash to a collector which are gathered together and a md5 hash is
calc'd for all of the block hashes.

My version includes code that can calculate the hash on the client
side (but breaks things up in the same way that hdfs does and will
calc it the same way).

During development, we also discovered and filed:

https://issues.apache.org/jira/browse/HDFS-772

To invoke this method, use my shell wrapper:

https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/Shell.java

Hope this provides some reference information for you.

On Sat, Apr 9, 2011 at 10:38 AM, Thamizh <tc...@yahoo.co.in> wrote:
> Hi Harsh ,
> Thanks a lot for your reference.
> I am looking forward to know about, how does Hadoop computes CRC for any file? If you have some reference please share me. It would be great help for me.
>
> Regards,
>
>  Thamizhannal P
>
> --- On Sat, 9/4/11, Harsh J <ha...@cloudera.com> wrote:
>
> From: Harsh J <ha...@cloudera.com>
> Subject: Re: Reg HDFS checksum
> To: common-user@hadoop.apache.org
> Date: Saturday, 9 April, 2011, 3:20 PM
>
> Hello Thamizh,
>
> Perhaps the discussion in the following link can shed some light on
> this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc
>
> On Fri, Apr 8, 2011 at 5:47 PM, Thamizh <tc...@yahoo.co.in> wrote:
>> Hi All,
>>
>> This is question regarding "HDFS checksum" computation.
>
> --
> Harsh J
>



-- 
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop: http://www.cloudera.com
blog: http://jpatterson.floe.tv

Re: Reg HDFS checksum

Posted by Josh Patterson <jo...@cloudera.com>.
Thamizh,
For a much older project I wrote a demo tool that computed the hadoop
style checksum locally:

https://github.com/jpatanooga/IvoryMonkey

Checksum generator is a single threaded replica of Hadoop's internal
Distributed hash-checksum mechanic.

What its actually doing is saving the CRC32 of every 512 bytes (per
block) and then doing a MD5 hash on that. Then when the
"getFileChecksum()" method is called, each block for a file sends its
md5 hash to a collector which are gathered together and a md5 hash is
calc'd for all of the block hashes.

My version includes code that can calculate the hash on the client
side (but breaks things up in the same way that hdfs does and will
calc it the same way).

During development, we also discovered and filed:

https://issues.apache.org/jira/browse/HDFS-772

To invoke this method, use my shell wrapper:

https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/Shell.java

Hope this provides some reference information for you.

On Sat, Apr 9, 2011 at 10:38 AM, Thamizh <tc...@yahoo.co.in> wrote:
> Hi Harsh ,
> Thanks a lot for your reference.
> I am looking forward to know about, how does Hadoop computes CRC for any file? If you have some reference please share me. It would be great help for me.
>
> Regards,
>
>  Thamizhannal P
>
> --- On Sat, 9/4/11, Harsh J <ha...@cloudera.com> wrote:
>
> From: Harsh J <ha...@cloudera.com>
> Subject: Re: Reg HDFS checksum
> To: common-user@hadoop.apache.org
> Date: Saturday, 9 April, 2011, 3:20 PM
>
> Hello Thamizh,
>
> Perhaps the discussion in the following link can shed some light on
> this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc
>
> On Fri, Apr 8, 2011 at 5:47 PM, Thamizh <tc...@yahoo.co.in> wrote:
>> Hi All,
>>
>> This is question regarding "HDFS checksum" computation.
>
> --
> Harsh J
>



-- 
Twitter: @jpatanooga
Solution Architect @ Cloudera
hadoop: http://www.cloudera.com
blog: http://jpatterson.floe.tv

Re: Reg HDFS checksum

Posted by Thamizh <tc...@yahoo.co.in>.
Hi Harsh ,
Thanks a lot for your reference.
I am looking forward to know about, how does Hadoop computes CRC for any file? If you have some reference please share me. It would be great help for me.

Regards,

  Thamizhannal P

--- On Sat, 9/4/11, Harsh J <ha...@cloudera.com> wrote:

From: Harsh J <ha...@cloudera.com>
Subject: Re: Reg HDFS checksum
To: common-user@hadoop.apache.org
Date: Saturday, 9 April, 2011, 3:20 PM

Hello Thamizh,

Perhaps the discussion in the following link can shed some light on
this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc

On Fri, Apr 8, 2011 at 5:47 PM, Thamizh <tc...@yahoo.co.in> wrote:
> Hi All,
>
> This is question regarding "HDFS checksum" computation.

-- 
Harsh J

Re: Reg HDFS checksum

Posted by Harsh J <ha...@cloudera.com>.
Hello Thamizh,

Perhaps the discussion in the following link can shed some light on
this: http://getsatisfaction.com/cloudera/topics/hadoop_fs_crc

On Fri, Apr 8, 2011 at 5:47 PM, Thamizh <tc...@yahoo.co.in> wrote:
> Hi All,
>
> This is question regarding "HDFS checksum" computation.

-- 
Harsh J