You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2011/02/16 13:30:35 UTC

Major compactions and OS cache

Hi,

Over on http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html I saw 
this bit:

"The most important factor is that HBase is not restarted frequently and that it 

performs house keeping on a regular basis. These so called compactions rewrite 
files as new data is added over time. All files in HDFS once written are 
immutable (for all sorts of reasons). Because of  that, data is written into 
new files and as their number grows HBase compacts them into another set of 
new, consolidated files. And here is  the kicker: HDFS is smart enough to put 
the data where it is needed!"

... and I always wondered what this does to the OS cache.  In some applications 
(non-HBase stuff, say full-text search), the OS cache plays a crucial role in 
how the system performs.  If you have to hit the disk too much, you're in 
trouble, so one of the things you avoid is making big changes to index files on 
disk in order to avoid invalidating data that's been nicely cached by the OS.

However, with HBase, and especially major compactions, what happens with the OS 
cache?  All gone, right?
Do people find this problematic?
Or does the OS cache simply not play such a significant role in systems running 
HBase simply because the data it holds and that needs to be accessed is much 
bigger than the OS cache could ever be, so even with the OS cache full and hot, 
other data would still have to be read from disk anyway?

Thanks,
Otis

Re: Major compactions and OS cache

Posted by Andrew Purtell <ap...@apache.org>.
Ok, so in my fixed up version of the patch the DN validates the block token before handing out the file location, so this is not arbitrary access, but it does mean that the hbase user and the hdfs user must both have read permissions to the local DFS data directories for the sharing to then work, and that elevates the hbase user to a special status indeed. 

Best regards,

    - Andy

Problems worthy of attack prove their worth by hitting back.
  - Piet Hein (via Tom White)


--- On Wed, 2/16/11, Ryan Rawson <ry...@gmail.com> wrote:

> From: Ryan Rawson <ry...@gmail.com>
> Subject: Re: Major compactions and OS cache
> To: "Jason Rutherglen" <ja...@gmail.com>
> Cc: "Edward Capriolo" <ed...@gmail.com>, user@hbase.apache.org
> Date: Wednesday, February 16, 2011, 6:40 PM
> I can't say, I think there just isn't
> a push for it since mapreduce
> would not benefit from it as much nas HBase. Futhermore the
> patch
> proposals have to deal with HDFS security, and the one I'm
> testing
> just does not worry about security (and hence is a security
> hole
> itself).
> 
> HDFS is just a slow moving project alas.
> 
> On Wed, Feb 16, 2011 at 6:35 PM, Jason Rutherglen
> <ja...@gmail.com>
> wrote:
> >> There is a patch that causes us to evict the block
> cache on close of
> >> hfile, and populate the block cache during
> compaction write out.  This
> >> is included in 0.90.
> >
> > That's good!
> >
> >> HDFS-347, which is a huge
> >> clear win but still no plans to include it in any
> hadoop version.
> >
> > Why's that?  It seems to be fairly logical.  Does it
> affect the
> > 'over-the-wire' protocol?
> >
> > On Wed, Feb 16, 2011 at 6:23 PM, Ryan Rawson <ry...@gmail.com>
> wrote:
> >> There is a patch that causes us to evict the block
> cache on close of
> >> hfile, and populate the block cache during
> compaction write out.  This
> >> is included in 0.90.
> >>
> >> So that helps.  Fixing VFS issues is quite a bit
> longer term, since
> >> the on-wire format of HDFS rpc is kind of "fixed",
> petitioning for
> >> changes will be a little tricky. Again, see
> HDFS-347, which is a huge
> >> clear win but still no plans to include it in any
> hadoop version.
> >>
> >> -ryan
> >>
> >
> 


      

Re: Major compactions and OS cache

Posted by Ryan Rawson <ry...@gmail.com>.
I can't say, I think there just isn't a push for it since mapreduce
would not benefit from it as much nas HBase. Futhermore the patch
proposals have to deal with HDFS security, and the one I'm testing
just does not worry about security (and hence is a security hole
itself).

HDFS is just a slow moving project alas.

On Wed, Feb 16, 2011 at 6:35 PM, Jason Rutherglen
<ja...@gmail.com> wrote:
>> There is a patch that causes us to evict the block cache on close of
>> hfile, and populate the block cache during compaction write out.  This
>> is included in 0.90.
>
> That's good!
>
>> HDFS-347, which is a huge
>> clear win but still no plans to include it in any hadoop version.
>
> Why's that?  It seems to be fairly logical.  Does it affect the
> 'over-the-wire' protocol?
>
> On Wed, Feb 16, 2011 at 6:23 PM, Ryan Rawson <ry...@gmail.com> wrote:
>> There is a patch that causes us to evict the block cache on close of
>> hfile, and populate the block cache during compaction write out.  This
>> is included in 0.90.
>>
>> So that helps.  Fixing VFS issues is quite a bit longer term, since
>> the on-wire format of HDFS rpc is kind of "fixed", petitioning for
>> changes will be a little tricky. Again, see HDFS-347, which is a huge
>> clear win but still no plans to include it in any hadoop version.
>>
>> -ryan
>>
>

Re: Major compactions and OS cache

Posted by Jason Rutherglen <ja...@gmail.com>.
> There is a patch that causes us to evict the block cache on close of
> hfile, and populate the block cache during compaction write out.  This
> is included in 0.90.

That's good!

> HDFS-347, which is a huge
> clear win but still no plans to include it in any hadoop version.

Why's that?  It seems to be fairly logical.  Does it affect the
'over-the-wire' protocol?

On Wed, Feb 16, 2011 at 6:23 PM, Ryan Rawson <ry...@gmail.com> wrote:
> There is a patch that causes us to evict the block cache on close of
> hfile, and populate the block cache during compaction write out.  This
> is included in 0.90.
>
> So that helps.  Fixing VFS issues is quite a bit longer term, since
> the on-wire format of HDFS rpc is kind of "fixed", petitioning for
> changes will be a little tricky. Again, see HDFS-347, which is a huge
> clear win but still no plans to include it in any hadoop version.
>
> -ryan
>

Re: Major compactions and OS cache

Posted by Chris Tarnas <cf...@email.com>.
Count me as someone interested...

thanks!
-chris

On Feb 17, 2011, at 12:52 AM, Andrew Purtell wrote:

> On Wed, 2/16/11, Ryan Rawson <ry...@gmail.com> wrote:
>> Again, see HDFS-347, which is a huge
>> clear win but still no plans to include it in any hadoop
>> version.
> 
> I ported Ryan's patch for 0.20-append on HDFS-347 on top of CDH3B3 and it's going into preproduction. We might be a bit more aggressive than average (also have pooling patch from HDFS-918) but 40-50% latency reduction is... intriguing. I can report on our experiences if anyone in Cloudera-land is interested.
> 
> Best regards,
> 
>    - Andy
> 
> Problems worthy of attack prove their worth by hitting back.
>  - Piet Hein (via Tom White)
> 
> 
> 
> 
> 
> 


Re: Major compactions and OS cache

Posted by Andrew Purtell <ap...@apache.org>.
On Wed, 2/16/11, Ryan Rawson <ry...@gmail.com> wrote:
> Again, see HDFS-347, which is a huge
> clear win but still no plans to include it in any hadoop
> version.

I ported Ryan's patch for 0.20-append on HDFS-347 on top of CDH3B3 and it's going into preproduction. We might be a bit more aggressive than average (also have pooling patch from HDFS-918) but 40-50% latency reduction is... intriguing. I can report on our experiences if anyone in Cloudera-land is interested.

Best regards,

    - Andy

Problems worthy of attack prove their worth by hitting back.
  - Piet Hein (via Tom White)





      

Re: Major compactions and OS cache

Posted by Ryan Rawson <ry...@gmail.com>.
There is a patch that causes us to evict the block cache on close of
hfile, and populate the block cache during compaction write out.  This
is included in 0.90.

So that helps.  Fixing VFS issues is quite a bit longer term, since
the on-wire format of HDFS rpc is kind of "fixed", petitioning for
changes will be a little tricky. Again, see HDFS-347, which is a huge
clear win but still no plans to include it in any hadoop version.

-ryan

On Wed, Feb 16, 2011 at 6:21 PM, Edward Capriolo <ed...@gmail.com> wrote:
> On Wed, Feb 16, 2011 at 3:09 PM, Jason Rutherglen
> <ja...@gmail.com> wrote:
>> This comment https://issues.apache.org/jira/browse/HDFS-347?focusedCommentId=12991734&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12991734
>> is interesting as in Lucene the IO cache is relied on, one would
>> assume that HBase'd be the same?
>>
>> On Wed, Feb 16, 2011 at 11:48 AM, Ryan Rawson <ry...@gmail.com> wrote:
>>> That would be cool, I think we should probably also push for HSDF-347
>>> while we are at it as well. The situation for HDFS improvements has
>>> not been good, but might improve in the mid-future.
>>>
>>> Thanks for the pointer!
>>> -ryan
>>>
>>> On Wed, Feb 16, 2011 at 11:40 AM, Jason Rutherglen
>>> <ja...@gmail.com> wrote:
>>>>> One of my coworker is reminding me that major compactions do have the
>>>>> well know side effect of slowing down a busy system.
>>>>
>>>> I think where this is going is the system IO cache problem could be
>>>> solved with something like DirectIOLinuxDirectory:
>>>> https://issues.apache.org/jira/browse/LUCENE-2500  Of course the issue
>>>> would be integrating DIOLD or it's underlying native implementation
>>>> into HDFS somehow?
>>>>
>>>
>>
>
> This seems to be a common issue across the "write once and compact"
> model, it tends to vaporizes page cache. Cassandra is working on
> similar trickery at the file system level. Another interesting idea is
> the concept of re-warming the cache after a compaction
> https://issues.apache.org/jira/browse/CASSANDRA-1878.
>
> I would assume that users of HBase rely more on the HBase Block cache
> then the VFS cache. Our of curiosity do people who run with 24 GB
> memory. 4GB Xmx DataNode 16GB Xmx RegionServer (block cache), 4MB vfs
> cache?
>
> I always suggest firing off the major compact at a low traffic time
> (if you have such a time) so it has the least impact.
>

Re: Major compactions and OS cache

Posted by Jason Rutherglen <ja...@gmail.com>.
> Another interesting idea is
> the concept of re-warming the cache after a compaction

That's probably the best approach for now.  O_DIRECT would only be
used for reading the old files, though in lieu of that we'd still
need/want to warm the new file?  Eg, the old files are probably still
being used, warming them just prior to putting them online should
affect IO latency the least?

On Wed, Feb 16, 2011 at 6:21 PM, Edward Capriolo <ed...@gmail.com> wrote:
> On Wed, Feb 16, 2011 at 3:09 PM, Jason Rutherglen
> <ja...@gmail.com> wrote:
>> This comment https://issues.apache.org/jira/browse/HDFS-347?focusedCommentId=12991734&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12991734
>> is interesting as in Lucene the IO cache is relied on, one would
>> assume that HBase'd be the same?
>>
>> On Wed, Feb 16, 2011 at 11:48 AM, Ryan Rawson <ry...@gmail.com> wrote:
>>> That would be cool, I think we should probably also push for HSDF-347
>>> while we are at it as well. The situation for HDFS improvements has
>>> not been good, but might improve in the mid-future.
>>>
>>> Thanks for the pointer!
>>> -ryan
>>>
>>> On Wed, Feb 16, 2011 at 11:40 AM, Jason Rutherglen
>>> <ja...@gmail.com> wrote:
>>>>> One of my coworker is reminding me that major compactions do have the
>>>>> well know side effect of slowing down a busy system.
>>>>
>>>> I think where this is going is the system IO cache problem could be
>>>> solved with something like DirectIOLinuxDirectory:
>>>> https://issues.apache.org/jira/browse/LUCENE-2500  Of course the issue
>>>> would be integrating DIOLD or it's underlying native implementation
>>>> into HDFS somehow?
>>>>
>>>
>>
>
> This seems to be a common issue across the "write once and compact"
> model, it tends to vaporizes page cache. Cassandra is working on
> similar trickery at the file system level. Another interesting idea is
> the concept of re-warming the cache after a compaction
> https://issues.apache.org/jira/browse/CASSANDRA-1878.
>
> I would assume that users of HBase rely more on the HBase Block cache
> then the VFS cache. Our of curiosity do people who run with 24 GB
> memory. 4GB Xmx DataNode 16GB Xmx RegionServer (block cache), 4MB vfs
> cache?
>
> I always suggest firing off the major compact at a low traffic time
> (if you have such a time) so it has the least impact.
>

Re: Major compactions and OS cache

Posted by Edward Capriolo <ed...@gmail.com>.
On Wed, Feb 16, 2011 at 3:09 PM, Jason Rutherglen
<ja...@gmail.com> wrote:
> This comment https://issues.apache.org/jira/browse/HDFS-347?focusedCommentId=12991734&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12991734
> is interesting as in Lucene the IO cache is relied on, one would
> assume that HBase'd be the same?
>
> On Wed, Feb 16, 2011 at 11:48 AM, Ryan Rawson <ry...@gmail.com> wrote:
>> That would be cool, I think we should probably also push for HSDF-347
>> while we are at it as well. The situation for HDFS improvements has
>> not been good, but might improve in the mid-future.
>>
>> Thanks for the pointer!
>> -ryan
>>
>> On Wed, Feb 16, 2011 at 11:40 AM, Jason Rutherglen
>> <ja...@gmail.com> wrote:
>>>> One of my coworker is reminding me that major compactions do have the
>>>> well know side effect of slowing down a busy system.
>>>
>>> I think where this is going is the system IO cache problem could be
>>> solved with something like DirectIOLinuxDirectory:
>>> https://issues.apache.org/jira/browse/LUCENE-2500  Of course the issue
>>> would be integrating DIOLD or it's underlying native implementation
>>> into HDFS somehow?
>>>
>>
>

This seems to be a common issue across the "write once and compact"
model, it tends to vaporizes page cache. Cassandra is working on
similar trickery at the file system level. Another interesting idea is
the concept of re-warming the cache after a compaction
https://issues.apache.org/jira/browse/CASSANDRA-1878.

I would assume that users of HBase rely more on the HBase Block cache
then the VFS cache. Our of curiosity do people who run with 24 GB
memory. 4GB Xmx DataNode 16GB Xmx RegionServer (block cache), 4MB vfs
cache?

I always suggest firing off the major compact at a low traffic time
(if you have such a time) so it has the least impact.

Re: Major compactions and OS cache

Posted by Jason Rutherglen <ja...@gmail.com>.
This comment https://issues.apache.org/jira/browse/HDFS-347?focusedCommentId=12991734&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12991734
is interesting as in Lucene the IO cache is relied on, one would
assume that HBase'd be the same?

On Wed, Feb 16, 2011 at 11:48 AM, Ryan Rawson <ry...@gmail.com> wrote:
> That would be cool, I think we should probably also push for HSDF-347
> while we are at it as well. The situation for HDFS improvements has
> not been good, but might improve in the mid-future.
>
> Thanks for the pointer!
> -ryan
>
> On Wed, Feb 16, 2011 at 11:40 AM, Jason Rutherglen
> <ja...@gmail.com> wrote:
>>> One of my coworker is reminding me that major compactions do have the
>>> well know side effect of slowing down a busy system.
>>
>> I think where this is going is the system IO cache problem could be
>> solved with something like DirectIOLinuxDirectory:
>> https://issues.apache.org/jira/browse/LUCENE-2500  Of course the issue
>> would be integrating DIOLD or it's underlying native implementation
>> into HDFS somehow?
>>
>

Re: Major compactions and OS cache

Posted by Ryan Rawson <ry...@gmail.com>.
That would be cool, I think we should probably also push for HSDF-347
while we are at it as well. The situation for HDFS improvements has
not been good, but might improve in the mid-future.

Thanks for the pointer!
-ryan

On Wed, Feb 16, 2011 at 11:40 AM, Jason Rutherglen
<ja...@gmail.com> wrote:
>> One of my coworker is reminding me that major compactions do have the
>> well know side effect of slowing down a busy system.
>
> I think where this is going is the system IO cache problem could be
> solved with something like DirectIOLinuxDirectory:
> https://issues.apache.org/jira/browse/LUCENE-2500  Of course the issue
> would be integrating DIOLD or it's underlying native implementation
> into HDFS somehow?
>

Re: Major compactions and OS cache

Posted by Jason Rutherglen <ja...@gmail.com>.
> One of my coworker is reminding me that major compactions do have the
> well know side effect of slowing down a busy system.

I think where this is going is the system IO cache problem could be
solved with something like DirectIOLinuxDirectory:
https://issues.apache.org/jira/browse/LUCENE-2500  Of course the issue
would be integrating DIOLD or it's underlying native implementation
into HDFS somehow?

Re: Major compactions and OS cache

Posted by Jean-Daniel Cryans <jd...@apache.org>.
One of my coworker is reminding me that major compactions do have the
well know side effect of slowing down a busy system.

Are we able to assert that the performance degradation is due to the
OS cache being invalidated? Or is it because of all the disk IO being
used? Or because of the block cache's being screwed? Or because it
also requires a full CPU to major compact?

The answer is probably "all of the above".

J-D

On Wed, Feb 16, 2011 at 10:03 AM, Jean-Daniel Cryans
<jd...@apache.org> wrote:
> Hi Otis,
>
> Excellent reflexion, unfortunately I don't think anyone benchmarked it
> to give a definitive answer.
>
> One thing I'm sure of is that worse than screwing up the OS cache, it
> also screws up the block cache! But this is the price to pay to clear
> up old versions and regroup all store files into 1. If you're not
> deleting a whole lot, or updating the same fields a ton, then maybe
> you should explore setting a larger window between each major
> compaction (current being once every 24h). I know some people just
> plain disable major compactions because they are never overwriting
> values.
>
> J-D
>
> On Wed, Feb 16, 2011 at 4:30 AM, Otis Gospodnetic
> <ot...@yahoo.com> wrote:
>> Hi,
>>
>> Over on http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html I saw
>> this bit:
>>
>> "The most important factor is that HBase is not restarted frequently and that it
>>
>> performs house keeping on a regular basis. These so called compactions rewrite
>> files as new data is added over time. All files in HDFS once written are
>> immutable (for all sorts of reasons). Because of  that, data is written into
>> new files and as their number grows HBase compacts them into another set of
>> new, consolidated files. And here is  the kicker: HDFS is smart enough to put
>> the data where it is needed!"
>>
>> ... and I always wondered what this does to the OS cache.  In some applications
>> (non-HBase stuff, say full-text search), the OS cache plays a crucial role in
>> how the system performs.  If you have to hit the disk too much, you're in
>> trouble, so one of the things you avoid is making big changes to index files on
>> disk in order to avoid invalidating data that's been nicely cached by the OS.
>>
>> However, with HBase, and especially major compactions, what happens with the OS
>> cache?  All gone, right?
>> Do people find this problematic?
>> Or does the OS cache simply not play such a significant role in systems running
>> HBase simply because the data it holds and that needs to be accessed is much
>> bigger than the OS cache could ever be, so even with the OS cache full and hot,
>> other data would still have to be read from disk anyway?
>>
>> Thanks,
>> Otis
>>
>

Re: Major compactions and OS cache

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Hi Otis,

Excellent reflexion, unfortunately I don't think anyone benchmarked it
to give a definitive answer.

One thing I'm sure of is that worse than screwing up the OS cache, it
also screws up the block cache! But this is the price to pay to clear
up old versions and regroup all store files into 1. If you're not
deleting a whole lot, or updating the same fields a ton, then maybe
you should explore setting a larger window between each major
compaction (current being once every 24h). I know some people just
plain disable major compactions because they are never overwriting
values.

J-D

On Wed, Feb 16, 2011 at 4:30 AM, Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Hi,
>
> Over on http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html I saw
> this bit:
>
> "The most important factor is that HBase is not restarted frequently and that it
>
> performs house keeping on a regular basis. These so called compactions rewrite
> files as new data is added over time. All files in HDFS once written are
> immutable (for all sorts of reasons). Because of  that, data is written into
> new files and as their number grows HBase compacts them into another set of
> new, consolidated files. And here is  the kicker: HDFS is smart enough to put
> the data where it is needed!"
>
> ... and I always wondered what this does to the OS cache.  In some applications
> (non-HBase stuff, say full-text search), the OS cache plays a crucial role in
> how the system performs.  If you have to hit the disk too much, you're in
> trouble, so one of the things you avoid is making big changes to index files on
> disk in order to avoid invalidating data that's been nicely cached by the OS.
>
> However, with HBase, and especially major compactions, what happens with the OS
> cache?  All gone, right?
> Do people find this problematic?
> Or does the OS cache simply not play such a significant role in systems running
> HBase simply because the data it holds and that needs to be accessed is much
> bigger than the OS cache could ever be, so even with the OS cache full and hot,
> other data would still have to be read from disk anyway?
>
> Thanks,
> Otis
>