You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Ralph Soika <ra...@imixs.com> on 2017/09/18 13:53:36 UTC

Is Hadoop validating the checksum when reading only a part of a file?

Hi,

I have a question about the read behavior of partial read in a large 
data file.
I want to implement a archive solution where I append smaller XML files 
into a big archive file via WebHDFS.
For each new added file, my client stores the offset and size of the xml 
file appended into the archive file.
Wen I later need to read a XML file from the big archive file, I use the 
'offset' and 'length' parameter to read  only a part of the file:

http://<HOST>:/webhdfs/v1/<PATH>?op=OPEN[&offset=<LONG>][&length=<LONG>]


My question now is: Is in this case Hadoop verifying the checksum to 
guaranties the data integrity of the partial read?

I guess only the checksum of the affected block will be verified but not 
the complete archive file?
Or is partial read a performance issue?

Thanks for help in advance

===
Ralph

-- 
*Imixs*...extends the way people work together
We are an open source company, read more at: www.imixs.org 
<http://www.imixs.org>
------------------------------------------------------------------------
Imixs Software Solutions GmbH
Agnes-Pockels-Bogen 1, 80992 München
*Web:* www.imixs.com <http://www.imixs.com>
*Office:* +49 (0)89-452136 16 *Mobil:* +49-177-4128245
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsfuehrer: Gaby Heinle u. Ralph Soika


Re: Is Hadoop validating the checksum when reading only a part of a file?

Posted by Ralph Soika <ra...@imixs.com>.
Thanks a lot for your answer. This makes it now clear to me and I 
expected that hadoop work in this way.

===
Ralph


On 20.09.2017 07:57, Harsh J wrote:
> Yes, checksum match is checked for every form of read (unless 
> explicitly disabled). By default, a checksum is generated and stored 
> for every 512 bytes of data (io.bytes.per.checksum), so only the 
> relevant parts are checked vs. the whole file when doing a partial read.
>
> On Mon, 18 Sep 2017 at 19:23 Ralph Soika <ralph.soika@imixs.com 
> <ma...@imixs.com>> wrote:
>
>     Hi,
>
>     I have a question about the read behavior of partial read in a
>     large data file.
>     I want to implement a archive solution where I append smaller XML
>     files into a big archive file via WebHDFS.
>     For each new added file, my client stores the offset and size of
>     the xml file appended into the archive file.
>     Wen I later need to read a XML file from the big archive file, I
>     use the 'offset' and 'length' parameter to read only a part of the
>     file:
>
>     http://<HOST>:/webhdfs/v1/<PATH>?op=OPEN[&offset=<LONG>][&length=<LONG>]
>
>
>     My question now is: Is in this case Hadoop verifying the checksum
>     to guaranties the data integrity of the partial read?
>
>     I guess only the checksum of the affected block will be verified
>     but not the complete archive file?
>     Or is partial read a performance issue?
>
>     Thanks for help in advance
>
>     ===
>     Ralph
>
>     -- 
>     *Imixs*...extends the way people work together
>     We are an open source company, read more at: www.imixs.org
>     <http://www.imixs.org>
>     ------------------------------------------------------------------------
>     Imixs Software Solutions GmbH
>     Agnes-Pockels-Bogen 1, 80992 München
>     <https://maps.google.com/?q=Agnes-Pockels-Bogen+1,+80992+M%C3%BCnchen&entry=gmail&source=g>
>     *Web:* www.imixs.com <http://www.imixs.com>
>     *Office:* +49 (0)89-452136 16 <tel:+49%2089%2045213616> *Mobil:*
>     +49-177-4128245 <tel:+49%20177%204128245>
>     Registergericht: Amtsgericht Muenchen, HRB 136045
>     Geschaeftsfuehrer: Gaby Heinle u. Ralph Soika
>

-- 
*Imixs*...extends the way people work together
We are an open source company, read more at: www.imixs.org 
<http://www.imixs.org>
------------------------------------------------------------------------
Imixs Software Solutions GmbH
Agnes-Pockels-Bogen 1, 80992 München
*Web:* www.imixs.com <http://www.imixs.com>
*Office:* +49 (0)89-452136 16 *Mobil:* +49-177-4128245
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsfuehrer: Gaby Heinle u. Ralph Soika


Re: Is Hadoop validating the checksum when reading only a part of a file?

Posted by Harsh J <ha...@cloudera.com>.
Yes, checksum match is checked for every form of read (unless explicitly
disabled). By default, a checksum is generated and stored for every 512
bytes of data (io.bytes.per.checksum), so only the relevant parts are
checked vs. the whole file when doing a partial read.

On Mon, 18 Sep 2017 at 19:23 Ralph Soika <ra...@imixs.com> wrote:

> Hi,
>
> I have a question about the read behavior of partial read in a large data
> file.
> I want to implement a archive solution where I append smaller XML files
> into a big archive file via WebHDFS.
> For each new added file, my client stores the offset and size of the xml
> file appended into the archive file.
> Wen I later need to read a XML file from the big archive file, I use the
> 'offset' and 'length' parameter to read  only a part of the file:
>
> http://<HOST>:/webhdfs/v1/<PATH>?op=OPEN[&offset=<LONG>][&length=<LONG>]
>
>
> My question now is: Is in this case Hadoop verifying the checksum to
> guaranties the data integrity of the partial read?
>
> I guess only the checksum of the affected block will be verified but not
> the complete archive file?
> Or is partial read a performance issue?
>
> Thanks for help in advance
>
> ===
> Ralph
>
> --
> *Imixs*...extends the way people work together
> We are an open source company, read more at: www.imixs.org
> ------------------------------
> Imixs Software Solutions GmbH
> Agnes-Pockels-Bogen 1, 80992 München
> <https://maps.google.com/?q=Agnes-Pockels-Bogen+1,+80992+M%C3%BCnchen&entry=gmail&source=g>
> *Web:* www.imixs.com
> *Office:* +49 (0)89-452136 16 <+49%2089%2045213616> *Mobil:*
> +49-177-4128245 <+49%20177%204128245>
> Registergericht: Amtsgericht Muenchen, HRB 136045
> Geschaeftsfuehrer: Gaby Heinle u. Ralph Soika
>