You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Mark Vigeant <ma...@riskmetrics.com> on 2009/12/21 21:55:57 UTC

Smaller Region Size?

Hey Everyone,

I would like to make my HRegion size be smaller so that I can test out how my jobs run when the tables are split up across multiple region servers. Is this something I can set in the hbase-site config, or is this an hdfs thing?

Thanks a lot!

Mark Vigeant
RiskMetrics Group, Inc.
One Chase Manhattan Plaza
44th Floor
New York, NY 10005
(p) 646-778-4142


This email message and any attachments are for the sole use of the intended recipients and may contain proprietary and/or confidential information which may be privileged or otherwise protected from disclosure. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy the original message and any copies of the message as well as any attachments to the original message.

Re: Smaller Region Size?

Posted by Jean-Daniel Cryans <jd...@apache.org>.
It will save you a lot of trouble since by default the version of a
cell is set to System.currenTimeInMillis by the region server. Let's
say you delete a value, the region gets reassigned minutes later to
another region server which is running 60 minutes in the past and then
you do a Get on that cell with default ts. This will translate to a
time before the previous delete so you will get a deleted cell back
(unless a major compaction was run).

So a minor clock skew is ok but more than 20 minutes is asking for
trouble. This requirement is documented in the Getting Started.

J-D

On Thu, Dec 24, 2009 at 8:17 AM, Dhruba Borthakur <dh...@gmail.com> wrote:
> Hi folks,
>
> Is it necessary to run keep the clocks synchronized on all Hbase region
> servers/master? I would appreciate it a lot if somebody can please explain
> if the HBase architecture depends on this fact.
>
> thanks,
> dhruba
>
>
> On Wed, Dec 23, 2009 at 9:57 AM, Mark Vigeant
> <ma...@riskmetrics.com>wrote:
>
>> The clocks are all running in sync, though I am not using NTP shamefully. I
>> should.
>>
>> And no, I listed the errors backwards, that's not how they showed up in the
>> log, sorry, heh. I don't think they run backwards.
>>
>> -----Original Message-----
>> From: Andrew Purtell [mailto:apurtell@apache.org]
>> Sent: Wednesday, December 23, 2009 12:47 PM
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: Smaller Region Size?
>>
>> How do you have clocks set up on your systems Mark? Are you using NTP to
>> keep
>> them sane? Am I correct that they are sometimes running backward?
>>
>>
>>   - Andy
>>
>>
>>
>> ----- Original Message ----
>> > From: Mark Vigeant <ma...@riskmetrics.com>
>> > To: "hbase-user@hadoop.apache.org" <hb...@hadoop.apache.org>
>> > Sent: Wed, December 23, 2009 9:09:04 AM
>> > Subject: RE: Smaller Region Size?
>> >
>> > > The biggest legitimate reason to run smaller region size is if your
>> > > data set is small (lets say 400mb) but highly accessed, so you want a
>> > > good spread of regions across your cluster.
>> >
>> > That's exactly it, my input dataset was 500MB total (~1,000,000 rows) and
>> it was
>> > getting stored as just one region on one regionserver.
>> >
>> > In response to St. Ack, I don't think my regions are performing too many
>> splits:
>> > the regionserver logs mainly consist of the occasional ZooKeeper
>> Connection
>> > error and these two repeatedly:
>> >
>> > 2009-12-22 15:21:50,415 DEBUG
>> org.apache.hadoop.hbase.io.hfile.LruBlockCache:
>> > Cache Stats: Sizes: Total=6.556961MB (6875472), Free=792.61804MB
>> (831120240),
>> > Max=799.175MB (837995712), Counts: Blocks=0, Access=25755, Hit=0,
>> Miss=25755,
>> > Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%,
>> > Evicted/Run=NaN
>> >
>> > 2009-12-22 15:20:35,073 DEBUG org.apache.hadoop.hbase.regionserver.Store:
>> > Skipping major compaction of Message because one (major) compacted file
>> only and
>> > elapsedTime 339624149ms is < ttl=9223372036854775807
>> >
>> > You're suggesting the performance would be improved if the dataset was
>> larger?
>> > What are other parameters that can be fine-tuned to optimize based off
>> data
>> > size?
>> >
>> > Thanks
>> > -Mark
>> > -----Original Message-----
>> > From: Ryan Rawson [mailto:ryanobjc@gmail.com]
>> > Sent: Tuesday, December 22, 2009 11:28 PM
>> > To: hbase-user@hadoop.apache.org
>> > Subject: Re: Smaller Region Size?
>> >
>> > The biggest legitimate reason to run smaller region size is if your
>> > data set is small (lets say 400mb) but highly accessed, so you want a
>> > good spread of regions across your cluster.
>> >
>> > Another is to run a larger region if you are having a huge table and
>> > you want to keep absolute region count low. I am not 100% sold on this
>> > yet.
>> >
>> > I have a patch that can keep performance high during a highly split
>> > table, by using parallel puts. This has been proven to keep aggregate
>> > performance really high, and I hope it will make 0.20.3.
>> >
>> > On Tue, Dec 22, 2009 at 2:31 PM, stack wrote:
>> > > On Tue, Dec 22, 2009 at 8:57 AM, Mark Vigeant
>> > > wrote:
>> > >
>> > >> J-D,
>> > >>
>> > >> I noticed that performance for uploading data into tables got a lot
>> better
>> > >> as I lowered the max file size -- but up until a certain point, where
>> the
>> > >> performance began slowing down again.
>> > >>
>> > >>
>> > > Tell us more.  What kinda size changes did you make?  How many regions
>> were
>> > > created?  Is the slow down because table is splitting all the time?  If
>> you
>> > > study regionserver logs, can you make out what the regionservers are
>> > > spending their times doing?
>> > >
>> > >
>> > >
>> > >> Is there a rule of thumb/formula/notion to rely on when setting this
>> > >> parameter for optimal performance? Thanks!
>> > >>
>> > >>
>> > > We have most experience running defaults.  Generally folks go up from
>> the
>> > > default size because they want to host more data in about same number
>> or
>> > > regions.  Going down from the default I've not seen much of.
>> > >
>> > > St.Ack
>> > >
>> >
>> > This email message and any attachments are for the sole use of the
>> intended
>> > recipients and may contain proprietary and/or confidential information
>> which may
>> > be privileged or otherwise protected from disclosure. Any unauthorized
>> review,
>> > use, disclosure or distribution is prohibited. If you are not an intended
>> > recipient, please contact the sender by reply email and destroy the
>> original
>> > message and any copies of the message as well as any attachments to the
>> original
>> > message.
>>
>>
>>
>>
>>
>>
>> This email message and any attachments are for the sole use of the intended
>> recipients and may contain proprietary and/or confidential information which
>> may be privileged or otherwise protected from disclosure. Any unauthorized
>> review, use, disclosure or distribution is prohibited. If you are not an
>> intended recipient, please contact the sender by reply email and destroy the
>> original message and any copies of the message as well as any attachments to
>> the original message.
>>
>
>
>
> --
> Connect to me at http://www.facebook.com/dhruba
>

Re: Smaller Region Size?

Posted by Dhruba Borthakur <dh...@gmail.com>.
Hi folks,

Is it necessary to run keep the clocks synchronized on all Hbase region
servers/master? I would appreciate it a lot if somebody can please explain
if the HBase architecture depends on this fact.

thanks,
dhruba


On Wed, Dec 23, 2009 at 9:57 AM, Mark Vigeant
<ma...@riskmetrics.com>wrote:

> The clocks are all running in sync, though I am not using NTP shamefully. I
> should.
>
> And no, I listed the errors backwards, that's not how they showed up in the
> log, sorry, heh. I don't think they run backwards.
>
> -----Original Message-----
> From: Andrew Purtell [mailto:apurtell@apache.org]
> Sent: Wednesday, December 23, 2009 12:47 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Smaller Region Size?
>
> How do you have clocks set up on your systems Mark? Are you using NTP to
> keep
> them sane? Am I correct that they are sometimes running backward?
>
>
>   - Andy
>
>
>
> ----- Original Message ----
> > From: Mark Vigeant <ma...@riskmetrics.com>
> > To: "hbase-user@hadoop.apache.org" <hb...@hadoop.apache.org>
> > Sent: Wed, December 23, 2009 9:09:04 AM
> > Subject: RE: Smaller Region Size?
> >
> > > The biggest legitimate reason to run smaller region size is if your
> > > data set is small (lets say 400mb) but highly accessed, so you want a
> > > good spread of regions across your cluster.
> >
> > That's exactly it, my input dataset was 500MB total (~1,000,000 rows) and
> it was
> > getting stored as just one region on one regionserver.
> >
> > In response to St. Ack, I don't think my regions are performing too many
> splits:
> > the regionserver logs mainly consist of the occasional ZooKeeper
> Connection
> > error and these two repeatedly:
> >
> > 2009-12-22 15:21:50,415 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache:
> > Cache Stats: Sizes: Total=6.556961MB (6875472), Free=792.61804MB
> (831120240),
> > Max=799.175MB (837995712), Counts: Blocks=0, Access=25755, Hit=0,
> Miss=25755,
> > Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%,
> > Evicted/Run=NaN
> >
> > 2009-12-22 15:20:35,073 DEBUG org.apache.hadoop.hbase.regionserver.Store:
> > Skipping major compaction of Message because one (major) compacted file
> only and
> > elapsedTime 339624149ms is < ttl=9223372036854775807
> >
> > You're suggesting the performance would be improved if the dataset was
> larger?
> > What are other parameters that can be fine-tuned to optimize based off
> data
> > size?
> >
> > Thanks
> > -Mark
> > -----Original Message-----
> > From: Ryan Rawson [mailto:ryanobjc@gmail.com]
> > Sent: Tuesday, December 22, 2009 11:28 PM
> > To: hbase-user@hadoop.apache.org
> > Subject: Re: Smaller Region Size?
> >
> > The biggest legitimate reason to run smaller region size is if your
> > data set is small (lets say 400mb) but highly accessed, so you want a
> > good spread of regions across your cluster.
> >
> > Another is to run a larger region if you are having a huge table and
> > you want to keep absolute region count low. I am not 100% sold on this
> > yet.
> >
> > I have a patch that can keep performance high during a highly split
> > table, by using parallel puts. This has been proven to keep aggregate
> > performance really high, and I hope it will make 0.20.3.
> >
> > On Tue, Dec 22, 2009 at 2:31 PM, stack wrote:
> > > On Tue, Dec 22, 2009 at 8:57 AM, Mark Vigeant
> > > wrote:
> > >
> > >> J-D,
> > >>
> > >> I noticed that performance for uploading data into tables got a lot
> better
> > >> as I lowered the max file size -- but up until a certain point, where
> the
> > >> performance began slowing down again.
> > >>
> > >>
> > > Tell us more.  What kinda size changes did you make?  How many regions
> were
> > > created?  Is the slow down because table is splitting all the time?  If
> you
> > > study regionserver logs, can you make out what the regionservers are
> > > spending their times doing?
> > >
> > >
> > >
> > >> Is there a rule of thumb/formula/notion to rely on when setting this
> > >> parameter for optimal performance? Thanks!
> > >>
> > >>
> > > We have most experience running defaults.  Generally folks go up from
> the
> > > default size because they want to host more data in about same number
> or
> > > regions.  Going down from the default I've not seen much of.
> > >
> > > St.Ack
> > >
> >
> > This email message and any attachments are for the sole use of the
> intended
> > recipients and may contain proprietary and/or confidential information
> which may
> > be privileged or otherwise protected from disclosure. Any unauthorized
> review,
> > use, disclosure or distribution is prohibited. If you are not an intended
> > recipient, please contact the sender by reply email and destroy the
> original
> > message and any copies of the message as well as any attachments to the
> original
> > message.
>
>
>
>
>
>
> This email message and any attachments are for the sole use of the intended
> recipients and may contain proprietary and/or confidential information which
> may be privileged or otherwise protected from disclosure. Any unauthorized
> review, use, disclosure or distribution is prohibited. If you are not an
> intended recipient, please contact the sender by reply email and destroy the
> original message and any copies of the message as well as any attachments to
> the original message.
>



-- 
Connect to me at http://www.facebook.com/dhruba

RE: Smaller Region Size?

Posted by Mark Vigeant <ma...@riskmetrics.com>.
The clocks are all running in sync, though I am not using NTP shamefully. I should.

And no, I listed the errors backwards, that's not how they showed up in the log, sorry, heh. I don't think they run backwards.

-----Original Message-----
From: Andrew Purtell [mailto:apurtell@apache.org]
Sent: Wednesday, December 23, 2009 12:47 PM
To: hbase-user@hadoop.apache.org
Subject: Re: Smaller Region Size?

How do you have clocks set up on your systems Mark? Are you using NTP to keep
them sane? Am I correct that they are sometimes running backward?


   - Andy



----- Original Message ----
> From: Mark Vigeant <ma...@riskmetrics.com>
> To: "hbase-user@hadoop.apache.org" <hb...@hadoop.apache.org>
> Sent: Wed, December 23, 2009 9:09:04 AM
> Subject: RE: Smaller Region Size?
>
> > The biggest legitimate reason to run smaller region size is if your
> > data set is small (lets say 400mb) but highly accessed, so you want a
> > good spread of regions across your cluster.
>
> That's exactly it, my input dataset was 500MB total (~1,000,000 rows) and it was
> getting stored as just one region on one regionserver.
>
> In response to St. Ack, I don't think my regions are performing too many splits:
> the regionserver logs mainly consist of the occasional ZooKeeper Connection
> error and these two repeatedly:
>
> 2009-12-22 15:21:50,415 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache:
> Cache Stats: Sizes: Total=6.556961MB (6875472), Free=792.61804MB (831120240),
> Max=799.175MB (837995712), Counts: Blocks=0, Access=25755, Hit=0, Miss=25755,
> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%,
> Evicted/Run=NaN
>
> 2009-12-22 15:20:35,073 DEBUG org.apache.hadoop.hbase.regionserver.Store:
> Skipping major compaction of Message because one (major) compacted file only and
> elapsedTime 339624149ms is < ttl=9223372036854775807
>
> You're suggesting the performance would be improved if the dataset was larger?
> What are other parameters that can be fine-tuned to optimize based off data
> size?
>
> Thanks
> -Mark
> -----Original Message-----
> From: Ryan Rawson [mailto:ryanobjc@gmail.com]
> Sent: Tuesday, December 22, 2009 11:28 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Smaller Region Size?
>
> The biggest legitimate reason to run smaller region size is if your
> data set is small (lets say 400mb) but highly accessed, so you want a
> good spread of regions across your cluster.
>
> Another is to run a larger region if you are having a huge table and
> you want to keep absolute region count low. I am not 100% sold on this
> yet.
>
> I have a patch that can keep performance high during a highly split
> table, by using parallel puts. This has been proven to keep aggregate
> performance really high, and I hope it will make 0.20.3.
>
> On Tue, Dec 22, 2009 at 2:31 PM, stack wrote:
> > On Tue, Dec 22, 2009 at 8:57 AM, Mark Vigeant
> > wrote:
> >
> >> J-D,
> >>
> >> I noticed that performance for uploading data into tables got a lot better
> >> as I lowered the max file size -- but up until a certain point, where the
> >> performance began slowing down again.
> >>
> >>
> > Tell us more.  What kinda size changes did you make?  How many regions were
> > created?  Is the slow down because table is splitting all the time?  If you
> > study regionserver logs, can you make out what the regionservers are
> > spending their times doing?
> >
> >
> >
> >> Is there a rule of thumb/formula/notion to rely on when setting this
> >> parameter for optimal performance? Thanks!
> >>
> >>
> > We have most experience running defaults.  Generally folks go up from the
> > default size because they want to host more data in about same number or
> > regions.  Going down from the default I've not seen much of.
> >
> > St.Ack
> >
>
> This email message and any attachments are for the sole use of the intended
> recipients and may contain proprietary and/or confidential information which may
> be privileged or otherwise protected from disclosure. Any unauthorized review,
> use, disclosure or distribution is prohibited. If you are not an intended
> recipient, please contact the sender by reply email and destroy the original
> message and any copies of the message as well as any attachments to the original
> message.






This email message and any attachments are for the sole use of the intended recipients and may contain proprietary and/or confidential information which may be privileged or otherwise protected from disclosure. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy the original message and any copies of the message as well as any attachments to the original message.

Re: Smaller Region Size?

Posted by Andrew Purtell <ap...@apache.org>.
How do you have clocks set up on your systems Mark? Are you using NTP to keep
them sane? Am I correct that they are sometimes running backward? 


   - Andy



----- Original Message ----
> From: Mark Vigeant <ma...@riskmetrics.com>
> To: "hbase-user@hadoop.apache.org" <hb...@hadoop.apache.org>
> Sent: Wed, December 23, 2009 9:09:04 AM
> Subject: RE: Smaller Region Size?
> 
> > The biggest legitimate reason to run smaller region size is if your
> > data set is small (lets say 400mb) but highly accessed, so you want a
> > good spread of regions across your cluster.
> 
> That's exactly it, my input dataset was 500MB total (~1,000,000 rows) and it was 
> getting stored as just one region on one regionserver.
> 
> In response to St. Ack, I don't think my regions are performing too many splits: 
> the regionserver logs mainly consist of the occasional ZooKeeper Connection 
> error and these two repeatedly:
> 
> 2009-12-22 15:21:50,415 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
> Cache Stats: Sizes: Total=6.556961MB (6875472), Free=792.61804MB (831120240), 
> Max=799.175MB (837995712), Counts: Blocks=0, Access=25755, Hit=0, Miss=25755, 
> Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%, 
> Evicted/Run=NaN
> 
> 2009-12-22 15:20:35,073 DEBUG org.apache.hadoop.hbase.regionserver.Store: 
> Skipping major compaction of Message because one (major) compacted file only and 
> elapsedTime 339624149ms is < ttl=9223372036854775807
> 
> You're suggesting the performance would be improved if the dataset was larger? 
> What are other parameters that can be fine-tuned to optimize based off data 
> size?
> 
> Thanks
> -Mark
> -----Original Message-----
> From: Ryan Rawson [mailto:ryanobjc@gmail.com]
> Sent: Tuesday, December 22, 2009 11:28 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Smaller Region Size?
> 
> The biggest legitimate reason to run smaller region size is if your
> data set is small (lets say 400mb) but highly accessed, so you want a
> good spread of regions across your cluster.
> 
> Another is to run a larger region if you are having a huge table and
> you want to keep absolute region count low. I am not 100% sold on this
> yet.
> 
> I have a patch that can keep performance high during a highly split
> table, by using parallel puts. This has been proven to keep aggregate
> performance really high, and I hope it will make 0.20.3.
> 
> On Tue, Dec 22, 2009 at 2:31 PM, stack wrote:
> > On Tue, Dec 22, 2009 at 8:57 AM, Mark Vigeant
> > wrote:
> >
> >> J-D,
> >>
> >> I noticed that performance for uploading data into tables got a lot better
> >> as I lowered the max file size -- but up until a certain point, where the
> >> performance began slowing down again.
> >>
> >>
> > Tell us more.  What kinda size changes did you make?  How many regions were
> > created?  Is the slow down because table is splitting all the time?  If you
> > study regionserver logs, can you make out what the regionservers are
> > spending their times doing?
> >
> >
> >
> >> Is there a rule of thumb/formula/notion to rely on when setting this
> >> parameter for optimal performance? Thanks!
> >>
> >>
> > We have most experience running defaults.  Generally folks go up from the
> > default size because they want to host more data in about same number or
> > regions.  Going down from the default I've not seen much of.
> >
> > St.Ack
> >
> 
> This email message and any attachments are for the sole use of the intended 
> recipients and may contain proprietary and/or confidential information which may 
> be privileged or otherwise protected from disclosure. Any unauthorized review, 
> use, disclosure or distribution is prohibited. If you are not an intended 
> recipient, please contact the sender by reply email and destroy the original 
> message and any copies of the message as well as any attachments to the original 
> message.



      


RE: Smaller Region Size?

Posted by Mark Vigeant <ma...@riskmetrics.com>.
> The biggest legitimate reason to run smaller region size is if your
> data set is small (lets say 400mb) but highly accessed, so you want a
> good spread of regions across your cluster.

That's exactly it, my input dataset was 500MB total (~1,000,000 rows) and it was getting stored as just one region on one regionserver.

In response to St. Ack, I don't think my regions are performing too many splits: the regionserver logs mainly consist of the occasional ZooKeeper Connection error and these two repeatedly:

2009-12-22 15:21:50,415 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: Total=6.556961MB (6875472), Free=792.61804MB (831120240), Max=799.175MB (837995712), Counts: Blocks=0, Access=25755, Hit=0, Miss=25755, Evictions=0, Evicted=0, Ratios: Hit Ratio=0.0%, Miss Ratio=100.0%, Evicted/Run=NaN

2009-12-22 15:20:35,073 DEBUG org.apache.hadoop.hbase.regionserver.Store: Skipping major compaction of Message because one (major) compacted file only and elapsedTime 339624149ms is < ttl=9223372036854775807

You're suggesting the performance would be improved if the dataset was larger? What are other parameters that can be fine-tuned to optimize based off data size?

Thanks
-Mark
-----Original Message-----
From: Ryan Rawson [mailto:ryanobjc@gmail.com]
Sent: Tuesday, December 22, 2009 11:28 PM
To: hbase-user@hadoop.apache.org
Subject: Re: Smaller Region Size?

The biggest legitimate reason to run smaller region size is if your
data set is small (lets say 400mb) but highly accessed, so you want a
good spread of regions across your cluster.

Another is to run a larger region if you are having a huge table and
you want to keep absolute region count low. I am not 100% sold on this
yet.

I have a patch that can keep performance high during a highly split
table, by using parallel puts. This has been proven to keep aggregate
performance really high, and I hope it will make 0.20.3.

On Tue, Dec 22, 2009 at 2:31 PM, stack <st...@duboce.net> wrote:
> On Tue, Dec 22, 2009 at 8:57 AM, Mark Vigeant
> <ma...@riskmetrics.com>wrote:
>
>> J-D,
>>
>> I noticed that performance for uploading data into tables got a lot better
>> as I lowered the max file size -- but up until a certain point, where the
>> performance began slowing down again.
>>
>>
> Tell us more.  What kinda size changes did you make?  How many regions were
> created?  Is the slow down because table is splitting all the time?  If you
> study regionserver logs, can you make out what the regionservers are
> spending their times doing?
>
>
>
>> Is there a rule of thumb/formula/notion to rely on when setting this
>> parameter for optimal performance? Thanks!
>>
>>
> We have most experience running defaults.  Generally folks go up from the
> default size because they want to host more data in about same number or
> regions.  Going down from the default I've not seen much of.
>
> St.Ack
>

This email message and any attachments are for the sole use of the intended recipients and may contain proprietary and/or confidential information which may be privileged or otherwise protected from disclosure. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy the original message and any copies of the message as well as any attachments to the original message.

Re: Smaller Region Size?

Posted by Ryan Rawson <ry...@gmail.com>.
The biggest legitimate reason to run smaller region size is if your
data set is small (lets say 400mb) but highly accessed, so you want a
good spread of regions across your cluster.

Another is to run a larger region if you are having a huge table and
you want to keep absolute region count low. I am not 100% sold on this
yet.

I have a patch that can keep performance high during a highly split
table, by using parallel puts. This has been proven to keep aggregate
performance really high, and I hope it will make 0.20.3.

On Tue, Dec 22, 2009 at 2:31 PM, stack <st...@duboce.net> wrote:
> On Tue, Dec 22, 2009 at 8:57 AM, Mark Vigeant
> <ma...@riskmetrics.com>wrote:
>
>> J-D,
>>
>> I noticed that performance for uploading data into tables got a lot better
>> as I lowered the max file size -- but up until a certain point, where the
>> performance began slowing down again.
>>
>>
> Tell us more.  What kinda size changes did you make?  How many regions were
> created?  Is the slow down because table is splitting all the time?  If you
> study regionserver logs, can you make out what the regionservers are
> spending their times doing?
>
>
>
>> Is there a rule of thumb/formula/notion to rely on when setting this
>> parameter for optimal performance? Thanks!
>>
>>
> We have most experience running defaults.  Generally folks go up from the
> default size because they want to host more data in about same number or
> regions.  Going down from the default I've not seen much of.
>
> St.Ack
>

Re: Smaller Region Size?

Posted by stack <st...@duboce.net>.
On Tue, Dec 22, 2009 at 8:57 AM, Mark Vigeant
<ma...@riskmetrics.com>wrote:

> J-D,
>
> I noticed that performance for uploading data into tables got a lot better
> as I lowered the max file size -- but up until a certain point, where the
> performance began slowing down again.
>
>
Tell us more.  What kinda size changes did you make?  How many regions were
created?  Is the slow down because table is splitting all the time?  If you
study regionserver logs, can you make out what the regionservers are
spending their times doing?



> Is there a rule of thumb/formula/notion to rely on when setting this
> parameter for optimal performance? Thanks!
>
>
We have most experience running defaults.  Generally folks go up from the
default size because they want to host more data in about same number or
regions.  Going down from the default I've not seen much of.

St.Ack

RE: Smaller Region Size?

Posted by Mark Vigeant <ma...@riskmetrics.com>.
J-D,

I noticed that performance for uploading data into tables got a lot better as I lowered the max file size -- but up until a certain point, where the performance began slowing down again.

Is there a rule of thumb/formula/notion to rely on when setting this parameter for optimal performance? Thanks!

-Mark

-----Original Message-----
From: Mark Vigeant [mailto:mark.vigeant@riskmetrics.com]
Sent: Monday, December 21, 2009 4:06 PM
To: hbase-user@hadoop.apache.org
Subject: RE: Smaller Region Size?

Thanks J-D!

-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Monday, December 21, 2009 3:59 PM
To: hbase-user@hadoop.apache.org
Subject: Re: Smaller Region Size?

Mark,

When you create a table you can set MAX_FILESIZE in the shell or in
the code. Set it to something small than 256MB.

J-D

On Mon, Dec 21, 2009 at 12:55 PM, Mark Vigeant
<ma...@riskmetrics.com> wrote:
> Hey Everyone,
>
> I would like to make my HRegion size be smaller so that I can test out how my jobs run when the tables are split up across multiple region servers. Is this something I can set in the hbase-site config, or is this an hdfs thing?
>
> Thanks a lot!
>
> Mark Vigeant
> RiskMetrics Group, Inc.
> One Chase Manhattan Plaza
> 44th Floor
> New York, NY 10005
> (p) 646-778-4142
>
>
> This email message and any attachments are for the sole use of the intended recipients and may contain proprietary and/or confidential information which may be privileged or otherwise protected from disclosure. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy the original message and any copies of the message as well as any attachments to the original message.
>

This email message and any attachments are for the sole use of the intended recipients and may contain proprietary and/or confidential information which may be privileged or otherwise protected from disclosure. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy the original message and any copies of the message as well as any attachments to the original message.

This email message and any attachments are for the sole use of the intended recipients and may contain proprietary and/or confidential information which may be privileged or otherwise protected from disclosure. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy the original message and any copies of the message as well as any attachments to the original message.

RE: Smaller Region Size?

Posted by Mark Vigeant <ma...@riskmetrics.com>.
Thanks J-D!

-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Monday, December 21, 2009 3:59 PM
To: hbase-user@hadoop.apache.org
Subject: Re: Smaller Region Size?

Mark,

When you create a table you can set MAX_FILESIZE in the shell or in
the code. Set it to something small than 256MB.

J-D

On Mon, Dec 21, 2009 at 12:55 PM, Mark Vigeant
<ma...@riskmetrics.com> wrote:
> Hey Everyone,
>
> I would like to make my HRegion size be smaller so that I can test out how my jobs run when the tables are split up across multiple region servers. Is this something I can set in the hbase-site config, or is this an hdfs thing?
>
> Thanks a lot!
>
> Mark Vigeant
> RiskMetrics Group, Inc.
> One Chase Manhattan Plaza
> 44th Floor
> New York, NY 10005
> (p) 646-778-4142
>
>
> This email message and any attachments are for the sole use of the intended recipients and may contain proprietary and/or confidential information which may be privileged or otherwise protected from disclosure. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy the original message and any copies of the message as well as any attachments to the original message.
>

This email message and any attachments are for the sole use of the intended recipients and may contain proprietary and/or confidential information which may be privileged or otherwise protected from disclosure. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy the original message and any copies of the message as well as any attachments to the original message.

Re: Smaller Region Size?

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Mark,

When you create a table you can set MAX_FILESIZE in the shell or in
the code. Set it to something small than 256MB.

J-D

On Mon, Dec 21, 2009 at 12:55 PM, Mark Vigeant
<ma...@riskmetrics.com> wrote:
> Hey Everyone,
>
> I would like to make my HRegion size be smaller so that I can test out how my jobs run when the tables are split up across multiple region servers. Is this something I can set in the hbase-site config, or is this an hdfs thing?
>
> Thanks a lot!
>
> Mark Vigeant
> RiskMetrics Group, Inc.
> One Chase Manhattan Plaza
> 44th Floor
> New York, NY 10005
> (p) 646-778-4142
>
>
> This email message and any attachments are for the sole use of the intended recipients and may contain proprietary and/or confidential information which may be privileged or otherwise protected from disclosure. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy the original message and any copies of the message as well as any attachments to the original message.
>