You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Jianshi Huang <ji...@gmail.com> on 2014/08/15 07:53:43 UTC

How are split files distributed across Region servers?

Say I have 100 split files on 10 region servers, and I did a major compact.

Will these split files be distributed like this:
reg1: [splits 1,2,..,10]
reg2: [splits 11,12,...,20]
...

Or like this:
reg1: [splits: 1, 11, 21, ... , 91]
reg2: [splits: 2, 12, 22, ... , 92]
...

And if I want to specify the locality and the stride of split files? How
can I do it in HBase?


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: How are split files distributed across Region servers?

Posted by Jianshi Huang <ji...@gmail.com>.

Ok, I found some reference. I was actually asking the default load balancer
of HBase. And by googling, it seems it only makes the number of regions
even across region servers, but the distribution of regions are random.

Also found good load balancer implementation like this:


https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.html

Thanks for the help JM! :)

Jianshi


On Tue, Aug 19, 2014 at 2:31 PM, lars hofhansl <la...@apache.org> wrote:

> I'd change the max file size to 20GB. That'd give you 5000 regions for
> 100TB.
>
>
>
> ________________________________
>  From: Jianshi Huang <ji...@gmail.com>
> To: user@hbase.apache.org
> Sent: Monday, August 18, 2014 12:22 PM
> Subject: Re: How are split files distributed across Region servers?
>
>
> Hi JM,
>
> Make the range bigger you mean to make it multiple regions/splits, right?
>
> I probably will have >100TB of data, and I think the default split file
> size is 10GB. So I can assume each of my 100 machines will get assigned to
> 100 *random* regions?
>
> Where can I find the implementation details or settings for region
> assignment?
>
> Jianshi
>
>
>
> On Mon, Aug 18, 2014 at 8:48 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > Hi Jianshi,
> >
> > A region server can host more than one region. So if you pre-split your
> > table correctly based on your access usage, at the end all the servers
> > should be used evenly.
> >
> > If you have about 30% or your range which is not used, just make sure
> that
> > this range is bigger so at the end it will have the same load at the
> > others.
> >
> > JM
> >
> >
> > 2014-08-18 2:08 GMT-04:00 Jianshi Huang <ji...@gmail.com>:
> >
> > > Hi JM,
> > >
> > > If the region boundaries will not change, does that mean,
> > >
> > > If my data access pattern has skews (say a certain part (30%) of my
> data
> > > will almost never be used), then a proportion (30%) of my server will
> > > always be idle?
> > >
> > > A region server has to have a continuous rowkey range?
> > >
> > > Jianshi
> > >
> > >
> > >
> > >
> > > On Sat, Aug 16, 2014 at 2:46 AM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org> wrote:
> > >
> > > > H Jianshi,
> > > >
> > > > Not sure to get your question.
> > > >
> > > > Can I rephrase it?
> > > >
> > > > So you have 10 regions, and each of those regions has 10 HFiles. Then
> > you
> > > > run a major compaction on the table. Correct?
> > > >
> > > > Then you will end up with:
> > > >
> > > > reg1:[files:1]
> > > > reg2:[files:2]
> > > > reg3:[files:3]
> > > > ...
> > > >
> > > > Regions boundaries will not change. But each region will not have a
> > > single
> > > > underlaying file.
> > > >
> > > > HTH,
> > > >
> > > > JM
> > > >
> > > >
> > > > 2014-08-15 1:53 GMT-04:00 Jianshi Huang <ji...@gmail.com>:
> > > >
> > > > > Say I have 100 split files on 10 region servers, and I did a major
> > > > compact.
> > > > >
> > > > > Will these split files be distributed like this:
> > > > > reg1: [splits 1,2,..,10]
> > > > > reg2: [splits 11,12,...,20]
> > > > > ...
> > > > >
> > > > > Or like this:
> > > > > reg1: [splits: 1, 11, 21, ... , 91]
> > > > > reg2: [splits: 2, 12, 22, ... , 92]
> > > > > ...
> > > > >
> > > > > And if I want to specify the locality and the stride of split
> files?
> > > How
> > > > > can I do it in HBase?
> > > > >
> > > > >
> > > > > --
> > > > > Jianshi Huang
> > > > >
> > > > > LinkedIn: jianshi
> > > > > Twitter: @jshuang
> > > > > Github & Blog: http://huangjs.github.com/
>
>
>
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Jianshi Huang
> > >
> > > LinkedIn: jianshi
> > > Twitter: @jshuang
> > > Github & Blog: http://huangjs.github.com/
> > >
> >
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: How are split files distributed across Region servers?

Posted by lars hofhansl <la...@apache.org>.

I'd change the max file size to 20GB. That'd give you 5000 regions for 100TB.



________________________________
 From: Jianshi Huang <ji...@gmail.com>
To: user@hbase.apache.org 
Sent: Monday, August 18, 2014 12:22 PM
Subject: Re: How are split files distributed across Region servers?
 

Hi JM,

Make the range bigger you mean to make it multiple regions/splits, right?

I probably will have >100TB of data, and I think the default split file
size is 10GB. So I can assume each of my 100 machines will get assigned to
100 *random* regions?

Where can I find the implementation details or settings for region
assignment?

Jianshi



On Mon, Aug 18, 2014 at 8:48 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Jianshi,
>
> A region server can host more than one region. So if you pre-split your
> table correctly based on your access usage, at the end all the servers
> should be used evenly.
>
> If you have about 30% or your range which is not used, just make sure that
> this range is bigger so at the end it will have the same load at the
> others.
>
> JM
>
>
> 2014-08-18 2:08 GMT-04:00 Jianshi Huang <ji...@gmail.com>:
>
> > Hi JM,
> >
> > If the region boundaries will not change, does that mean,
> >
> > If my data access pattern has skews (say a certain part (30%) of my data
> > will almost never be used), then a proportion (30%) of my server will
> > always be idle?
> >
> > A region server has to have a continuous rowkey range?
> >
> > Jianshi
> >
> >
> >
> >
> > On Sat, Aug 16, 2014 at 2:46 AM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > H Jianshi,
> > >
> > > Not sure to get your question.
> > >
> > > Can I rephrase it?
> > >
> > > So you have 10 regions, and each of those regions has 10 HFiles. Then
> you
> > > run a major compaction on the table. Correct?
> > >
> > > Then you will end up with:
> > >
> > > reg1:[files:1]
> > > reg2:[files:2]
> > > reg3:[files:3]
> > > ...
> > >
> > > Regions boundaries will not change. But each region will not have a
> > single
> > > underlaying file.
> > >
> > > HTH,
> > >
> > > JM
> > >
> > >
> > > 2014-08-15 1:53 GMT-04:00 Jianshi Huang <ji...@gmail.com>:
> > >
> > > > Say I have 100 split files on 10 region servers, and I did a major
> > > compact.
> > > >
> > > > Will these split files be distributed like this:
> > > > reg1: [splits 1,2,..,10]
> > > > reg2: [splits 11,12,...,20]
> > > > ...
> > > >
> > > > Or like this:
> > > > reg1: [splits: 1, 11, 21, ... , 91]
> > > > reg2: [splits: 2, 12, 22, ... , 92]
> > > > ...
> > > >
> > > > And if I want to specify the locality and the stride of split files?
> > How
> > > > can I do it in HBase?
> > > >
> > > >
> > > > --
> > > > Jianshi Huang
> > > >
> > > > LinkedIn: jianshi
> > > > Twitter: @jshuang
> > > > Github & Blog: http://huangjs.github.com/



> > > >
> > >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: How are split files distributed across Region servers?

Posted by Jianshi Huang <ji...@gmail.com>.

Hi JM,

Make the range bigger you mean to make it multiple regions/splits, right?

I probably will have >100TB of data, and I think the default split file
size is 10GB. So I can assume each of my 100 machines will get assigned to
100 *random* regions?

Where can I find the implementation details or settings for region
assignment?

Jianshi



On Mon, Aug 18, 2014 at 8:48 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Jianshi,
>
> A region server can host more than one region. So if you pre-split your
> table correctly based on your access usage, at the end all the servers
> should be used evenly.
>
> If you have about 30% or your range which is not used, just make sure that
> this range is bigger so at the end it will have the same load at the
> others.
>
> JM
>
>
> 2014-08-18 2:08 GMT-04:00 Jianshi Huang <ji...@gmail.com>:
>
> > Hi JM,
> >
> > If the region boundaries will not change, does that mean,
> >
> > If my data access pattern has skews (say a certain part (30%) of my data
> > will almost never be used), then a proportion (30%) of my server will
> > always be idle?
> >
> > A region server has to have a continuous rowkey range?
> >
> > Jianshi
> >
> >
> >
> >
> > On Sat, Aug 16, 2014 at 2:46 AM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > H Jianshi,
> > >
> > > Not sure to get your question.
> > >
> > > Can I rephrase it?
> > >
> > > So you have 10 regions, and each of those regions has 10 HFiles. Then
> you
> > > run a major compaction on the table. Correct?
> > >
> > > Then you will end up with:
> > >
> > > reg1:[files:1]
> > > reg2:[files:2]
> > > reg3:[files:3]
> > > ...
> > >
> > > Regions boundaries will not change. But each region will not have a
> > single
> > > underlaying file.
> > >
> > > HTH,
> > >
> > > JM
> > >
> > >
> > > 2014-08-15 1:53 GMT-04:00 Jianshi Huang <ji...@gmail.com>:
> > >
> > > > Say I have 100 split files on 10 region servers, and I did a major
> > > compact.
> > > >
> > > > Will these split files be distributed like this:
> > > > reg1: [splits 1,2,..,10]
> > > > reg2: [splits 11,12,...,20]
> > > > ...
> > > >
> > > > Or like this:
> > > > reg1: [splits: 1, 11, 21, ... , 91]
> > > > reg2: [splits: 2, 12, 22, ... , 92]
> > > > ...
> > > >
> > > > And if I want to specify the locality and the stride of split files?
> > How
> > > > can I do it in HBase?
> > > >
> > > >
> > > > --
> > > > Jianshi Huang
> > > >
> > > > LinkedIn: jianshi
> > > > Twitter: @jshuang
> > > > Github & Blog: http://huangjs.github.com/
> > > >
> > >
> >
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: How are split files distributed across Region servers?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Jianshi,

A region server can host more than one region. So if you pre-split your
table correctly based on your access usage, at the end all the servers
should be used evenly.

If you have about 30% or your range which is not used, just make sure that
this range is bigger so at the end it will have the same load at the others.

JM


2014-08-18 2:08 GMT-04:00 Jianshi Huang <ji...@gmail.com>:

> Hi JM,
>
> If the region boundaries will not change, does that mean,
>
> If my data access pattern has skews (say a certain part (30%) of my data
> will almost never be used), then a proportion (30%) of my server will
> always be idle?
>
> A region server has to have a continuous rowkey range?
>
> Jianshi
>
>
>
>
> On Sat, Aug 16, 2014 at 2:46 AM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > H Jianshi,
> >
> > Not sure to get your question.
> >
> > Can I rephrase it?
> >
> > So you have 10 regions, and each of those regions has 10 HFiles. Then you
> > run a major compaction on the table. Correct?
> >
> > Then you will end up with:
> >
> > reg1:[files:1]
> > reg2:[files:2]
> > reg3:[files:3]
> > ...
> >
> > Regions boundaries will not change. But each region will not have a
> single
> > underlaying file.
> >
> > HTH,
> >
> > JM
> >
> >
> > 2014-08-15 1:53 GMT-04:00 Jianshi Huang <ji...@gmail.com>:
> >
> > > Say I have 100 split files on 10 region servers, and I did a major
> > compact.
> > >
> > > Will these split files be distributed like this:
> > > reg1: [splits 1,2,..,10]
> > > reg2: [splits 11,12,...,20]
> > > ...
> > >
> > > Or like this:
> > > reg1: [splits: 1, 11, 21, ... , 91]
> > > reg2: [splits: 2, 12, 22, ... , 92]
> > > ...
> > >
> > > And if I want to specify the locality and the stride of split files?
> How
> > > can I do it in HBase?
> > >
> > >
> > > --
> > > Jianshi Huang
> > >
> > > LinkedIn: jianshi
> > > Twitter: @jshuang
> > > Github & Blog: http://huangjs.github.com/
> > >
> >
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>

Re: How are split files distributed across Region servers?

Posted by Jianshi Huang <ji...@gmail.com>.

Hi JM,

If the region boundaries will not change, does that mean,

If my data access pattern has skews (say a certain part (30%) of my data
will almost never be used), then a proportion (30%) of my server will
always be idle?

A region server has to have a continuous rowkey range?

Jianshi




On Sat, Aug 16, 2014 at 2:46 AM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> H Jianshi,
>
> Not sure to get your question.
>
> Can I rephrase it?
>
> So you have 10 regions, and each of those regions has 10 HFiles. Then you
> run a major compaction on the table. Correct?
>
> Then you will end up with:
>
> reg1:[files:1]
> reg2:[files:2]
> reg3:[files:3]
> ...
>
> Regions boundaries will not change. But each region will not have a single
> underlaying file.
>
> HTH,
>
> JM
>
>
> 2014-08-15 1:53 GMT-04:00 Jianshi Huang <ji...@gmail.com>:
>
> > Say I have 100 split files on 10 region servers, and I did a major
> compact.
> >
> > Will these split files be distributed like this:
> > reg1: [splits 1,2,..,10]
> > reg2: [splits 11,12,...,20]
> > ...
> >
> > Or like this:
> > reg1: [splits: 1, 11, 21, ... , 91]
> > reg2: [splits: 2, 12, 22, ... , 92]
> > ...
> >
> > And if I want to specify the locality and the stride of split files? How
> > can I do it in HBase?
> >
> >
> > --
> > Jianshi Huang
> >
> > LinkedIn: jianshi
> > Twitter: @jshuang
> > Github & Blog: http://huangjs.github.com/
> >
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: How are split files distributed across Region servers?

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

H Jianshi,

Not sure to get your question.

Can I rephrase it?

So you have 10 regions, and each of those regions has 10 HFiles. Then you
run a major compaction on the table. Correct?

Then you will end up with:

reg1:[files:1]
reg2:[files:2]
reg3:[files:3]
...

Regions boundaries will not change. But each region will not have a single
underlaying file.

HTH,

JM


2014-08-15 1:53 GMT-04:00 Jianshi Huang <ji...@gmail.com>:

> Say I have 100 split files on 10 region servers, and I did a major compact.
>
> Will these split files be distributed like this:
> reg1: [splits 1,2,..,10]
> reg2: [splits 11,12,...,20]
> ...
>
> Or like this:
> reg1: [splits: 1, 11, 21, ... , 91]
> reg2: [splits: 2, 12, 22, ... , 92]
> ...
>
> And if I want to specify the locality and the stride of split files? How
> can I do it in HBase?
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>