You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Krishna Rao <kr...@gmail.com> on 2014/04/24 14:55:08 UTC

HBase checksum vs HDFS checksum

Hi all,

I understand that there is a significant improvement gain when turning on
short circuit reads, and additionally by setting HBase to do checksums
rather than HDFS.

However, I'm a little confused by this, do I need to turn of checksum
within HDFS for the entire file system? We don't just use HBase on our
cluster, so this would seem to be a bad idea right?

Cheers,

Krishna

Re: HBase checksum vs HDFS checksum

Posted by Stack <st...@duboce.net>.
On Tue, Apr 29, 2014 at 11:53 AM, Stack <st...@duboce.net> wrote:

> On Tue, Apr 29, 2014 at 1:54 AM, Krishna Rao <kr...@gmail.com>wrote:
>
>> Thank you for your reply Anoop.
>>
>> However, the confusing is, unfortunately, still there because of the
>> following (from
>> here<http://hbase.apache.org/book.html#perf.hdfs.configs.localread>
>> ):
>>
>> "For optimal performance when short-circuit reads are enabled, it is
>> recommended that HDFS checksums are disabled. To maintain data integrity
>> with HDFS checksums disabled, HBase can be configured to write its own
>> checksums into its datablocks and verify against these"
>>
>>
> The text is confusing.  If you read the next sentence and click on the
> description under hbase.regionserver.checksum.verify<http://hbase.apache.org/book.html#hbase.regionserver.checksum.verify> it
> should be a little more clear.
>
> The confusion comes of the little configuration dance that is necessary
> around hbase writing checksums optionally inline into hfiles
>

Correction: we seem to always write hbase checksums inline with the data.
 See
http://hbase.apache.org/xref/org/apache/hadoop/hbase/regionserver/HStore.html#901
The HBase checksums are always present.  The flag then is just about
whether they are used at read time.  If so, at read time, we ask for a
stream from HDFS that does not validate checksums (if an error on this
stream, we reopen asking HDFS to do checksum validation).

St.Ack



> so they are available inline at read time and the interaction w/ native
> hdfs checksumming.  When running with hbase checksumming of hfiles, we want
> a means of telling HDFS to NOT validate the checksum -- i.e. double
> checksumming -- because hbase will be doing it (unless there is an error,
> and then we'll fall back to HDFS validation).  Let me try and clean up the
> docs.
>
> St.Ack
>
>
>
>> To me it implies that HDFS checksum needs to be disabled, meaning that
>> HDFS
>> wouldn't write checksums into it's datablocks. But HBase would be fine by
>> writing it's own checksum.
>>
>>
>> On 29 April 2014 09:32, Anoop John <an...@gmail.com> wrote:
>>
>> > HBase using its own checksum handling doesn't directly affect HDFS. It
>> will
>> > still maintain checksum info.  The diff is at the read time..  HBase
>> will
>> > open reader with checksum validation false and it will do checksum
>> > validation on its own.   So using hbase handled checksum in a cluster
>> > should not affect other data..  Does that solves your doubt?
>> >
>> > -Anoop-
>> >
>> > On Tue, Apr 29, 2014 at 1:58 PM, Krishna Rao <kr...@gmail.com>
>> > wrote:
>> >
>> > > Hi Ted,
>> > >
>> > > I had read those, but I'm confused about how this will affect
>> non-HBase
>> > > HDFS data. With HDFS checksumming off won't it affect data integrity?
>> > >
>> > > Krishna
>> > >
>> > >
>> > > On 24 April 2014 15:54, Ted Yu <yu...@gmail.com> wrote:
>> > >
>> > > > Please take a look at the following:
>> > > >
>> > > > http://hbase.apache.org/book.html#perf.hdfs.configs.localread
>> > > >
>> http://hbase.apache.org/book.html#hbase.regionserver.checksum.verify
>> > > >
>> > > >
>> > > > On Thu, Apr 24, 2014 at 5:55 AM, Krishna Rao <
>> krishnanjrao@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > I understand that there is a significant improvement gain when
>> > turning
>> > > on
>> > > > > short circuit reads, and additionally by setting HBase to do
>> > checksums
>> > > > > rather than HDFS.
>> > > > >
>> > > > > However, I'm a little confused by this, do I need to turn of
>> checksum
>> > > > > within HDFS for the entire file system? We don't just use HBase on
>> > our
>> > > > > cluster, so this would seem to be a bad idea right?
>> > > > >
>> > > > >  Cheers,
>> > > > >
>> > > > > Krishna
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: HBase checksum vs HDFS checksum

Posted by Stack <st...@duboce.net>.
On Tue, Apr 29, 2014 at 1:54 AM, Krishna Rao <kr...@gmail.com> wrote:

> Thank you for your reply Anoop.
>
> However, the confusing is, unfortunately, still there because of the
> following (from
> here<http://hbase.apache.org/book.html#perf.hdfs.configs.localread>
> ):
>
> "For optimal performance when short-circuit reads are enabled, it is
> recommended that HDFS checksums are disabled. To maintain data integrity
> with HDFS checksums disabled, HBase can be configured to write its own
> checksums into its datablocks and verify against these"
>
>
The text is confusing.  If you read the next sentence and click on the
description under
hbase.regionserver.checksum.verify<http://hbase.apache.org/book.html#hbase.regionserver.checksum.verify>
it
should be a little more clear.

The confusion comes of the little configuration dance that is necessary
around hbase writing checksums optionally inline into hfiles so they are
available inline at read time and the interaction w/ native hdfs
checksumming.  When running with hbase checksumming of hfiles, we want a
means of telling HDFS to NOT validate the checksum -- i.e. double
checksumming -- because hbase will be doing it (unless there is an error,
and then we'll fall back to HDFS validation).  Let me try and clean up the
docs.

St.Ack



> To me it implies that HDFS checksum needs to be disabled, meaning that HDFS
> wouldn't write checksums into it's datablocks. But HBase would be fine by
> writing it's own checksum.
>
>
> On 29 April 2014 09:32, Anoop John <an...@gmail.com> wrote:
>
> > HBase using its own checksum handling doesn't directly affect HDFS. It
> will
> > still maintain checksum info.  The diff is at the read time..  HBase will
> > open reader with checksum validation false and it will do checksum
> > validation on its own.   So using hbase handled checksum in a cluster
> > should not affect other data..  Does that solves your doubt?
> >
> > -Anoop-
> >
> > On Tue, Apr 29, 2014 at 1:58 PM, Krishna Rao <kr...@gmail.com>
> > wrote:
> >
> > > Hi Ted,
> > >
> > > I had read those, but I'm confused about how this will affect non-HBase
> > > HDFS data. With HDFS checksumming off won't it affect data integrity?
> > >
> > > Krishna
> > >
> > >
> > > On 24 April 2014 15:54, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > Please take a look at the following:
> > > >
> > > > http://hbase.apache.org/book.html#perf.hdfs.configs.localread
> > > > http://hbase.apache.org/book.html#hbase.regionserver.checksum.verify
> > > >
> > > >
> > > > On Thu, Apr 24, 2014 at 5:55 AM, Krishna Rao <krishnanjrao@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I understand that there is a significant improvement gain when
> > turning
> > > on
> > > > > short circuit reads, and additionally by setting HBase to do
> > checksums
> > > > > rather than HDFS.
> > > > >
> > > > > However, I'm a little confused by this, do I need to turn of
> checksum
> > > > > within HDFS for the entire file system? We don't just use HBase on
> > our
> > > > > cluster, so this would seem to be a bad idea right?
> > > > >
> > > > >  Cheers,
> > > > >
> > > > > Krishna
> > > > >
> > > >
> > >
> >
>

Re: HBase checksum vs HDFS checksum

Posted by Krishna Rao <kr...@gmail.com>.
Thank you for your reply Anoop.

However, the confusing is, unfortunately, still there because of the
following (from
here<http://hbase.apache.org/book.html#perf.hdfs.configs.localread>
):

"For optimal performance when short-circuit reads are enabled, it is
recommended that HDFS checksums are disabled. To maintain data integrity
with HDFS checksums disabled, HBase can be configured to write its own
checksums into its datablocks and verify against these"

To me it implies that HDFS checksum needs to be disabled, meaning that HDFS
wouldn't write checksums into it's datablocks. But HBase would be fine by
writing it's own checksum.


On 29 April 2014 09:32, Anoop John <an...@gmail.com> wrote:

> HBase using its own checksum handling doesn't directly affect HDFS. It will
> still maintain checksum info.  The diff is at the read time..  HBase will
> open reader with checksum validation false and it will do checksum
> validation on its own.   So using hbase handled checksum in a cluster
> should not affect other data..  Does that solves your doubt?
>
> -Anoop-
>
> On Tue, Apr 29, 2014 at 1:58 PM, Krishna Rao <kr...@gmail.com>
> wrote:
>
> > Hi Ted,
> >
> > I had read those, but I'm confused about how this will affect non-HBase
> > HDFS data. With HDFS checksumming off won't it affect data integrity?
> >
> > Krishna
> >
> >
> > On 24 April 2014 15:54, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Please take a look at the following:
> > >
> > > http://hbase.apache.org/book.html#perf.hdfs.configs.localread
> > > http://hbase.apache.org/book.html#hbase.regionserver.checksum.verify
> > >
> > >
> > > On Thu, Apr 24, 2014 at 5:55 AM, Krishna Rao <kr...@gmail.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I understand that there is a significant improvement gain when
> turning
> > on
> > > > short circuit reads, and additionally by setting HBase to do
> checksums
> > > > rather than HDFS.
> > > >
> > > > However, I'm a little confused by this, do I need to turn of checksum
> > > > within HDFS for the entire file system? We don't just use HBase on
> our
> > > > cluster, so this would seem to be a bad idea right?
> > > >
> > > >  Cheers,
> > > >
> > > > Krishna
> > > >
> > >
> >
>

Re: HBase checksum vs HDFS checksum

Posted by Anoop John <an...@gmail.com>.
HBase using its own checksum handling doesn't directly affect HDFS. It will
still maintain checksum info.  The diff is at the read time..  HBase will
open reader with checksum validation false and it will do checksum
validation on its own.   So using hbase handled checksum in a cluster
should not affect other data..  Does that solves your doubt?

-Anoop-

On Tue, Apr 29, 2014 at 1:58 PM, Krishna Rao <kr...@gmail.com> wrote:

> Hi Ted,
>
> I had read those, but I'm confused about how this will affect non-HBase
> HDFS data. With HDFS checksumming off won't it affect data integrity?
>
> Krishna
>
>
> On 24 April 2014 15:54, Ted Yu <yu...@gmail.com> wrote:
>
> > Please take a look at the following:
> >
> > http://hbase.apache.org/book.html#perf.hdfs.configs.localread
> > http://hbase.apache.org/book.html#hbase.regionserver.checksum.verify
> >
> >
> > On Thu, Apr 24, 2014 at 5:55 AM, Krishna Rao <kr...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I understand that there is a significant improvement gain when turning
> on
> > > short circuit reads, and additionally by setting HBase to do checksums
> > > rather than HDFS.
> > >
> > > However, I'm a little confused by this, do I need to turn of checksum
> > > within HDFS for the entire file system? We don't just use HBase on our
> > > cluster, so this would seem to be a bad idea right?
> > >
> > >  Cheers,
> > >
> > > Krishna
> > >
> >
>

Re: HBase checksum vs HDFS checksum

Posted by Krishna Rao <kr...@gmail.com>.
Hi Ted,

I had read those, but I'm confused about how this will affect non-HBase
HDFS data. With HDFS checksumming off won't it affect data integrity?

Krishna


On 24 April 2014 15:54, Ted Yu <yu...@gmail.com> wrote:

> Please take a look at the following:
>
> http://hbase.apache.org/book.html#perf.hdfs.configs.localread
> http://hbase.apache.org/book.html#hbase.regionserver.checksum.verify
>
>
> On Thu, Apr 24, 2014 at 5:55 AM, Krishna Rao <kr...@gmail.com>
> wrote:
>
> > Hi all,
> >
> > I understand that there is a significant improvement gain when turning on
> > short circuit reads, and additionally by setting HBase to do checksums
> > rather than HDFS.
> >
> > However, I'm a little confused by this, do I need to turn of checksum
> > within HDFS for the entire file system? We don't just use HBase on our
> > cluster, so this would seem to be a bad idea right?
> >
> >  Cheers,
> >
> > Krishna
> >
>

Re: HBase checksum vs HDFS checksum

Posted by Krishna Rao <kr...@blinkbox.com>.
Hi Ted,

I had read those, but I'm confused about how this will affect non-HBase HDFS data. With HDFS checksumming off won't it affect data integrity?

Krishna


On 24 April 2014 15:54, Ted Yu <yu...@gmail.com>> wrote:
Please take a look at the following:

http://hbase.apache.org/book.html#perf.hdfs.configs.localread
http://hbase.apache.org/book.html#hbase.regionserver.checksum.verify


On Thu, Apr 24, 2014 at 5:55 AM, Krishna Rao <kr...@gmail.com>> wrote:

> Hi all,
>
> I understand that there is a significant improvement gain when turning on
> short circuit reads, and additionally by setting HBase to do checksums
> rather than HDFS.
>
> However, I'm a little confused by this, do I need to turn of checksum
> within HDFS for the entire file system? We don't just use HBase on our
> cluster, so this would seem to be a bad idea right?
>
>  Cheers,
>
> Krishna
>



Krishna Rao
Senior Development Engineer Lead
t: +44 20 7117 0809
m:
blinkbox music - the easiest way to listen to the music you love, for free
www.blinkboxmusic.com


Re: HBase checksum vs HDFS checksum

Posted by Ted Yu <yu...@gmail.com>.
Please take a look at the following:

http://hbase.apache.org/book.html#perf.hdfs.configs.localread
http://hbase.apache.org/book.html#hbase.regionserver.checksum.verify


On Thu, Apr 24, 2014 at 5:55 AM, Krishna Rao <kr...@gmail.com> wrote:

> Hi all,
>
> I understand that there is a significant improvement gain when turning on
> short circuit reads, and additionally by setting HBase to do checksums
> rather than HDFS.
>
> However, I'm a little confused by this, do I need to turn of checksum
> within HDFS for the entire file system? We don't just use HBase on our
> cluster, so this would seem to be a bad idea right?
>
>  Cheers,
>
> Krishna
>

Re: HBase checksum vs HDFS checksum

Posted by Ted Yu <yu...@gmail.com>.
Please take a look at the following:

http://hbase.apache.org/book.html#perf.hdfs.configs.localread
http://hbase.apache.org/book.html#hbase.regionserver.checksum.verify


On Thu, Apr 24, 2014 at 5:55 AM, Krishna Rao <kr...@gmail.com> wrote:

> Hi all,
>
> I understand that there is a significant improvement gain when turning on
> short circuit reads, and additionally by setting HBase to do checksums
> rather than HDFS.
>
> However, I'm a little confused by this, do I need to turn of checksum
> within HDFS for the entire file system? We don't just use HBase on our
> cluster, so this would seem to be a bad idea right?
>
>  Cheers,
>
> Krishna
>

Re: HBase checksum vs HDFS checksum

Posted by Ted Yu <yu...@gmail.com>.
Please take a look at the following:

http://hbase.apache.org/book.html#perf.hdfs.configs.localread
http://hbase.apache.org/book.html#hbase.regionserver.checksum.verify


On Thu, Apr 24, 2014 at 5:55 AM, Krishna Rao <kr...@gmail.com> wrote:

> Hi all,
>
> I understand that there is a significant improvement gain when turning on
> short circuit reads, and additionally by setting HBase to do checksums
> rather than HDFS.
>
> However, I'm a little confused by this, do I need to turn of checksum
> within HDFS for the entire file system? We don't just use HBase on our
> cluster, so this would seem to be a bad idea right?
>
>  Cheers,
>
> Krishna
>

Re: HBase checksum vs HDFS checksum

Posted by Ted Yu <yu...@gmail.com>.
Please take a look at the following:

http://hbase.apache.org/book.html#perf.hdfs.configs.localread
http://hbase.apache.org/book.html#hbase.regionserver.checksum.verify


On Thu, Apr 24, 2014 at 5:55 AM, Krishna Rao <kr...@gmail.com> wrote:

> Hi all,
>
> I understand that there is a significant improvement gain when turning on
> short circuit reads, and additionally by setting HBase to do checksums
> rather than HDFS.
>
> However, I'm a little confused by this, do I need to turn of checksum
> within HDFS for the entire file system? We don't just use HBase on our
> cluster, so this would seem to be a bad idea right?
>
>  Cheers,
>
> Krishna
>

Re: HBase checksum vs HDFS checksum

Posted by Ted Yu <yu...@gmail.com>.
Please take a look at the following:

http://hbase.apache.org/book.html#perf.hdfs.configs.localread
http://hbase.apache.org/book.html#hbase.regionserver.checksum.verify


On Thu, Apr 24, 2014 at 5:55 AM, Krishna Rao <kr...@gmail.com> wrote:

> Hi all,
>
> I understand that there is a significant improvement gain when turning on
> short circuit reads, and additionally by setting HBase to do checksums
> rather than HDFS.
>
> However, I'm a little confused by this, do I need to turn of checksum
> within HDFS for the entire file system? We don't just use HBase on our
> cluster, so this would seem to be a bad idea right?
>
>  Cheers,
>
> Krishna
>