You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Oleg Ruchovets <or...@gmail.com> on 2012/09/10 10:19:33 UTC

bulk loading regions number

Hi ,
  I am using bulk loading to write my data to hbase.

I works fine , but number of regions growing very rapidly.
Entering ONE WEEK of data I got  200 regions (I am going to save years of
data).
As a result job which writes data to HBase has REDUCERS number equals
REGIONS number.
So entering only one WEEK of data I have 200 reducers.

Questions:
   How to resolve the problem of constantly growing reducers number using
bulk loading and TotalOrderPartition.
 I have 10 machine cluster and I think I should have ~ 30 reducers.

Thank in advance.
Oleg.

Re: bulk loading regions number

Posted by Harsh J <ha...@cloudera.com>.
The decision can be made depending on the number of total regions you
want deployed across your 10 machines, and the size you expect the
total to be before you have to expand the size of cluster.
Additionally add in a parallelism factor of say 5-10 (or more if you
want) regions of the same table per RS, so that cluster expansion is
easy later.

The penalty of large HFile sizes (I am considering > 4 GB large
enough) may be that major compactions will begin taking time on
full/full-ish regions (writes a single file worth that much). I don't
think there's too much impact to parallelism (# of regions
independently serve-able) or to random reads with the new HFileV2
format with such big files.

If it suits your data ingest, go for bigger files.

On Mon, Sep 10, 2012 at 2:15 PM, Oleg Ruchovets <or...@gmail.com> wrote:
> Great
>   That is actually what I am thinking about too.
> What is the best practice to choose HFile size?
> What is the penalty to define it very big?
>
> Thanks
> Oleg.
>
> On Mon, Sep 10, 2012 at 4:24 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hi Oleg,
>>
>> If the root issue is a growing number of regions, why not control that
>> instead of a way to control the Reducer count? You could, for example,
>> raise the split-point sizes for HFiles, to not have it split too much,
>> and hence have larger but fewer regions?
>>
>> Given that you have 10 machines, I'd go this way rather than ending up
>> with a lot of regions causing issues with load.
>>
>> On Mon, Sep 10, 2012 at 1:49 PM, Oleg Ruchovets <or...@gmail.com>
>> wrote:
>> > Hi ,
>> >   I am using bulk loading to write my data to hbase.
>> >
>> > I works fine , but number of regions growing very rapidly.
>> > Entering ONE WEEK of data I got  200 regions (I am going to save years of
>> > data).
>> > As a result job which writes data to HBase has REDUCERS number equals
>> > REGIONS number.
>> > So entering only one WEEK of data I have 200 reducers.
>> >
>> > Questions:
>> >    How to resolve the problem of constantly growing reducers number using
>> > bulk loading and TotalOrderPartition.
>> >  I have 10 machine cluster and I think I should have ~ 30 reducers.
>> >
>> > Thank in advance.
>> > Oleg.
>>
>>
>>
>> --
>> Harsh J
>>



-- 
Harsh J

Re: bulk loading regions number

Posted by Marcos Ortiz <ml...@uci.cu>.
Well, the defaul value for a region is 256 MB, so, if you want to 
storage a lot of date, you should want to consider to
increase that value.
With the preSplit() method, you can control how to do this process.

On 09/10/2012 04:45 AM, Oleg Ruchovets wrote:
> Great
>    That is actually what I am thinking about too.
> What is the best practice to choose HFile size?
> What is the penalty to define it very big?
>
> Thanks
> Oleg.
>
> On Mon, Sep 10, 2012 at 4:24 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hi Oleg,
>>
>> If the root issue is a growing number of regions, why not control that
>> instead of a way to control the Reducer count? You could, for example,
>> raise the split-point sizes for HFiles, to not have it split too much,
>> and hence have larger but fewer regions?
>>
>> Given that you have 10 machines, I'd go this way rather than ending up
>> with a lot of regions causing issues with load.
>>
>> On Mon, Sep 10, 2012 at 1:49 PM, Oleg Ruchovets <or...@gmail.com>
>> wrote:
>>> Hi ,
>>>    I am using bulk loading to write my data to hbase.
>>>
>>> I works fine , but number of regions growing very rapidly.
>>> Entering ONE WEEK of data I got  200 regions (I am going to save years of
>>> data).
>>> As a result job which writes data to HBase has REDUCERS number equals
>>> REGIONS number.
>>> So entering only one WEEK of data I have 200 reducers.
>>>
>>> Questions:
>>>     How to resolve the problem of constantly growing reducers number using
>>> bulk loading and TotalOrderPartition.
>>>   I have 10 machine cluster and I think I should have ~ 30 reducers.
>>>
>>> Thank in advance.
>>> Oleg.
>>
>>
>> --
>> Harsh J
>>
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci

-- 

Marcos Luis Ortíz Valmaseda
*Data Engineer && Sr. System Administrator at UCI*
about.me/marcosortiz <http://about.me/marcosortiz>
My Blog <http://marcosluis2186.posterous.com>
Tumblr's blog <http://marcosortiz.tumblr.com/>
@marcosluis2186 <http://twitter.com/marcosluis2186>





10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: bulk loading regions number

Posted by Oleg Ruchovets <or...@gmail.com>.
Great
  That is actually what I am thinking about too.
What is the best practice to choose HFile size?
What is the penalty to define it very big?

Thanks
Oleg.

On Mon, Sep 10, 2012 at 4:24 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi Oleg,
>
> If the root issue is a growing number of regions, why not control that
> instead of a way to control the Reducer count? You could, for example,
> raise the split-point sizes for HFiles, to not have it split too much,
> and hence have larger but fewer regions?
>
> Given that you have 10 machines, I'd go this way rather than ending up
> with a lot of regions causing issues with load.
>
> On Mon, Sep 10, 2012 at 1:49 PM, Oleg Ruchovets <or...@gmail.com>
> wrote:
> > Hi ,
> >   I am using bulk loading to write my data to hbase.
> >
> > I works fine , but number of regions growing very rapidly.
> > Entering ONE WEEK of data I got  200 regions (I am going to save years of
> > data).
> > As a result job which writes data to HBase has REDUCERS number equals
> > REGIONS number.
> > So entering only one WEEK of data I have 200 reducers.
> >
> > Questions:
> >    How to resolve the problem of constantly growing reducers number using
> > bulk loading and TotalOrderPartition.
> >  I have 10 machine cluster and I think I should have ~ 30 reducers.
> >
> > Thank in advance.
> > Oleg.
>
>
>
> --
> Harsh J
>

Re: bulk loading regions number

Posted by Harsh J <ha...@cloudera.com>.
Hi Oleg,

If the root issue is a growing number of regions, why not control that
instead of a way to control the Reducer count? You could, for example,
raise the split-point sizes for HFiles, to not have it split too much,
and hence have larger but fewer regions?

Given that you have 10 machines, I'd go this way rather than ending up
with a lot of regions causing issues with load.

On Mon, Sep 10, 2012 at 1:49 PM, Oleg Ruchovets <or...@gmail.com> wrote:
> Hi ,
>   I am using bulk loading to write my data to hbase.
>
> I works fine , but number of regions growing very rapidly.
> Entering ONE WEEK of data I got  200 regions (I am going to save years of
> data).
> As a result job which writes data to HBase has REDUCERS number equals
> REGIONS number.
> So entering only one WEEK of data I have 200 reducers.
>
> Questions:
>    How to resolve the problem of constantly growing reducers number using
> bulk loading and TotalOrderPartition.
>  I have 10 machine cluster and I think I should have ~ 30 reducers.
>
> Thank in advance.
> Oleg.



-- 
Harsh J