You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Doug Cutting <cu...@apache.org> on 2007/05/16 20:25:42 UTC

Re: Many Checksum Errors

[ Moving discussion to hadoop-dev.  -drc ]

Raghu Angadi wrote:
> This is good validation how important ECC memory is. Currently HDFS 
> client deletes a block when it notices a checksum error. After moving to 
> Block level CRCs soon, we should make Datanode re-validate the block 
> before deciding to delete it.

It also emphasizes how important end-to-end checksums are.  Data should 
also be checksummed as soon as possible after it is generated, before it 
has a chance to be corrupted.

Ideally, the initial buffer that stores the data should be small, and 
data should be checksummed as this initial buffer is flushed.  In the 
current implementation, the small checksum buffer is the second buffer, 
the initial buffer is the larger, io.buffer.size buffer.  To provide 
maximum protection against memory errors, this situation should be reversed.

This is discussed in https://issues.apache.org/jira/browse/HADOOP-928. 
Perhaps a new issue should be filed to reverse the order of these 
buffers, so that data is checksummed before entering the larger, 
longer-lived buffer?

Doug

Re: Many Checksum Errors

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Doug Cutting wrote:
> Raghu Angadi wrote:
>> In my implementation of block-level CRCs (does not affect 
>> ChecksumFileSystem in HADOOP-928), we don't buffer checksum data at all. 
> 
> That sounds like a good approach.  I look forward to seeing the patch.

I will prepare a temporary patch with the current changes and upload it 
to HADOOP-1134 by this weekend. It does not do upgrades, but works well 
otherwise.

Raghu.

Re: Many Checksum Errors

Posted by Doug Cutting <cu...@apache.org>.

Raghu Angadi wrote:
> In my implementation of block-level CRCs (does not affect 
> ChecksumFileSystem in HADOOP-928), we don't buffer checksum data at all. 

That sounds like a good approach.  I look forward to seeing the patch.

> We could remove 
> buffering all together in FileSystem level and let the FS 
> implementations to decide how to buffer.

That's already been done, as of HADOOP-928.  FileSystem implementations 
now opt to use ChecksumFileSystem.  The buffer size defaults to 
io.buffer.size, but applications may pass an explicit buffer size to the 
FileSystem.  The FileSystem implementation is free to ignore that hint.

Doug

Re: Many Checksum Errors

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Doug Cutting wrote:
> Raghu Angadi wrote:
>> But this will not fix the same problem with block-level checksums. 
>> Pretty soon, HDFS will not use ChecksumFileSystem at all.
> 
> I'd hope that block-level checksums do not replicate logic from 
> ChecksumFileSystem.  Rather they should probably share substantial 
> portions of their checksumming input and output stream implementations, 
> no?  So it could fix the same problem for block-level checksums, and 
> should if possible.

Nope, DFSClient that implements client side of block-levels checksums 
does not replicate or reuse any code from ChecksumFileSystem.

>> Ideally we should let the implementations decide how to buffer.
> 
> I'm not sure what you mean by this.  The buffer size is a parameter to 
> FileSystem's open() and create() methods. 

All FSOutputStreams (including DFS) go through a BufferedOutputStream. 
We can not give buffersize of 0. But may be DistributedFileSystem can 
always provide bufferSize of 1. I will see we can easily support 
explicit option not to buffer to FSOutputStream.

> Whether checksums require 
> another level of buffering is a separate issue.

Yes. In my patch, DFS uses FSOutputStream with default buffer so DFS is 
affected by the same issue, in a different place.

> Is it efficient to 
> invoke the CRC32 code as each byte is written, or is it faster to run it 
> in 512-byte or larger batches?

I think CPU cost wise, it is twice as fast to CRC32 larger buffers (> 
512) than to CRC32 small buffers. But I don't think its a very 
noticeable overhead. Do we expect users do many small writes?

Some measurements I did some time back:

Total size  read size  Total Overhead  Overhead/MB    MB/Overhead
---------------------------------------------------------------------
   128 MB      64	1.3  sec       10.1  msec/MB  100 MB/sec
   128	     128	0.98		7.65	      130
   128	     256	0.80		6.24	      160
   128	     512	0.71		5.52	      180
   128	    1024	0.68		5.25	      190
   128	   10240	0.65		5.12	      195

> Doug

Re: Many Checksum Errors

Posted by Doug Cutting <cu...@apache.org>.

Raghu Angadi wrote:
> But this will not fix the same problem with block-level checksums. 
> Pretty soon, HDFS will not use ChecksumFileSystem at all.

I'd hope that block-level checksums do not replicate logic from 
ChecksumFileSystem.  Rather they should probably share substantial 
portions of their checksumming input and output stream implementations, 
no?  So it could fix the same problem for block-level checksums, and 
should if possible.

> Ideally we 
> should let the implementations decide how to buffer.

I'm not sure what you mean by this.  The buffer size is a parameter to 
FileSystem's open() and create() methods.  Whether checksums require 
another level of buffering is a separate issue.  Is it efficient to 
invoke the CRC32 code as each byte is written, or is it faster to run it 
in 512-byte or larger batches?

Doug

Re: Many Checksum Errors

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Hairong Kuang wrote:
> What Doug suggested makes sense. We should make the initial buffer size to
> be bytesPerChecksum and the user defined buffer size to be the size of the
> second buffer. This will also solve most of the problems that I described in
> HADOOP-1124.

But this will not fix the same problem with block-level checksums. 
Pretty soon, HDFS will not use ChecksumFileSystem at all. Ideally we 
should let the implementations decide how to buffer.

Raghu.
> Hairong

RE: Many Checksum Errors

Posted by Hairong Kuang <ha...@yahoo-inc.com>.

What Doug suggested makes sense. We should make the initial buffer size to
be bytesPerChecksum and the user defined buffer size to be the size of the
second buffer. This will also solve most of the problems that I described in
HADOOP-1124.

Hairong

-----Original Message-----
From: Raghu Angadi [mailto:rangadi@yahoo-inc.com] 
Sent: Wednesday, May 16, 2007 11:39 AM
To: hadoop-dev@lucene.apache.org
Subject: Re: Many Checksum Errors

Doug Cutting wrote:
> [ Moving discussion to hadoop-dev.  -drc ]
> 
> Raghu Angadi wrote:
>> This is good validation how important ECC memory is. Currently HDFS 
>> client deletes a block when it notices a checksum error. After moving 
>> to Block level CRCs soon, we should make Datanode re-validate the 
>> block before deciding to delete it.
> 
> It also emphasizes how important end-to-end checksums are.  Data 
> should also be checksummed as soon as possible after it is generated, 
> before it has a chance to be corrupted.
> 
> Ideally, the initial buffer that stores the data should be small, and 
> data should be checksummed as this initial buffer is flushed.

In my implementation of block-level CRCs (does not affect ChecksumFileSystem
in HADOOP-928), we don't buffer checksum data at all. 
As soon as io.bytes.per.checksum are written, checksum is written directly
to the backupstream. I have removed stream buffering in multiple places in
DFSClient. But it this is still affected by the buffering issue you
mentioned below.

> In the
> current implementation, the small checksum buffer is the second 
> buffer, the initial buffer is the larger, io.buffer.size buffer.  To 
> provide maximum protection against memory errors, this situation 
> should be reversed.
> 
> This is discussed in https://issues.apache.org/jira/browse/HADOOP-928. 
> Perhaps a new issue should be filed to reverse the order of these 
> buffers, so that data is checksummed before entering the larger, 
> longer-lived buffer?

This reversal still does not help Block-level CRCs. We could remove
buffering all together in FileSystem level and let the FS implementations to
decide how to buffer.

Raghu.

> Doug

Re: Many Checksum Errors

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Doug Cutting wrote:
> [ Moving discussion to hadoop-dev.  -drc ]
> 
> Raghu Angadi wrote:
>> This is good validation how important ECC memory is. Currently HDFS 
>> client deletes a block when it notices a checksum error. After moving 
>> to Block level CRCs soon, we should make Datanode re-validate the 
>> block before deciding to delete it.
> 
> It also emphasizes how important end-to-end checksums are.  Data should 
> also be checksummed as soon as possible after it is generated, before it 
> has a chance to be corrupted.
> 
> Ideally, the initial buffer that stores the data should be small, and 
> data should be checksummed as this initial buffer is flushed.

In my implementation of block-level CRCs (does not affect 
ChecksumFileSystem in HADOOP-928), we don't buffer checksum data at all. 
As soon as io.bytes.per.checksum are written, checksum is written 
directly to the backupstream. I have removed stream buffering in 
multiple places in DFSClient. But it this is still affected by the 
buffering issue you mentioned below.

> In the 
> current implementation, the small checksum buffer is the second buffer, 
> the initial buffer is the larger, io.buffer.size buffer.  To provide 
> maximum protection against memory errors, this situation should be 
> reversed.
> 
> This is discussed in https://issues.apache.org/jira/browse/HADOOP-928. 
> Perhaps a new issue should be filed to reverse the order of these 
> buffers, so that data is checksummed before entering the larger, 
> longer-lived buffer?

This reversal still does not help Block-level CRCs. We could remove 
buffering all together in FileSystem level and let the FS 
implementations to decide how to buffer.

Raghu.

> Doug