You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Kevin O'dell <ke...@cloudera.com> on 2015/04/02 14:41:16 UTC

Re: introducing nodes w/ more storage

Hi Mike,

  Sorry for the delay here.

How does the HDFS load balancer impact the load balancing of HBase? <-- The
HDFS load balancer is not automatically run, it is a manual process that is
kicked off. It is not recommended to *ever run the HDFS balancer on a
cluster running HBase.  Similar to have HBase has no concept or care about
the underlying storage, HDFS has no concept or care of the region layout,
nor the locality we worked so hard to build through compactions.

Furthermore, once the HDFS balancer has saved us from running out of space
on the smaller nodes, we will run a major compaction, and re-write all of
the HBase data right back to where it was before.

one is the number of regions managed by a region server that’s HBase’s
load, right? And then there’s the data distribution of HBase files that is
really managed by HDFS load balancer, right? <--- Right, until we run major
compaction and "restore" locality by moving the data back

Even still… eventually the data will be distributed equally across the
cluster. What’s happening with the HDFS balancer?  Is that heterogenous or
homogenous in terms of storage? <-- Not quite, as I said before the HDFS
balancer is manual, so it is quite easy to build up a skew, especially if
you use a datanode as an edge node or thrift gateway etc.  Yes, the HDFS
balancer is heterogenous, but it doesn't play nice with HBase.

*The use of the word ever should not be construed as a true definitive.
Ever is being used to represent a best practice.  In many cases the HDFS
balancer needs to be run, especially in multi-tenant clusters
with archive data.  It is best to immediately run a major compaction to
restore HBase locality if the HDFS balancer is used.

On Mon, Mar 23, 2015 at 10:50 AM, Michael Segel <mi...@hotmail.com>
wrote:

> @lars,
>
> How does the HDFS load balancer impact the load balancing of HBase?
>
> Of course there are two loads… one is the number of regions managed by a
> region server that’s HBase’s load, right?
> And then there’s the data distribution of HBase files that is really
> managed by HDFS load balancer, right?
>
> OP’s question is having a heterogenous cluster where he would like to see
> a more even distribution of data/free space based on the capacity of the
> newer machines in the cluster.
>
> This is a storage question, not a memory/cpu core question.
>
> Or am I missing something?
>
>
> -Mike
>
> > On Mar 22, 2015, at 10:56 PM, lars hofhansl <la...@apache.org> wrote:
> >
> > Seems that it should not be too hard to add that to the stochastic load
> balancer.
> > We could add a spaceCost or something.
> >
> >
> >
> > ----- Original Message -----
> > From: Jean-Marc Spaggiari <je...@spaggiari.org>
> > To: user <us...@hbase.apache.org>
> > Cc: Development <De...@mentacapital.com>
> > Sent: Thursday, March 19, 2015 12:55 PM
> > Subject: Re: introducing nodes w/ more storage
> >
> > You can extend the default balancer and assign the regions based on
> > that.But at the end, the replicated blocks might still go all over the
> > cluster and your "small" nodes are going to be full and will not be able
> to
> > get anymore writes even for the regions they are supposed to get.
> >
> > I'm not sure there is a good solution for what you are looking for :(
> >
> > I build my own balancer but because of differences in the CPUs, not
> because
> > of differences of the storage space...
> >
> >
> > 2015-03-19 15:50 GMT-04:00 Nick Dimiduk <nd...@gmail.com>:
> >
> >> Seems more fantasy than fact, I'm afraid. The default load balancer [0]
> >> takes store file size into account, but has no concept of capacity. It
> >> doesn't know that nodes in a heterogenous environment have different
> >> capacity.
> >>
> >> This would be a good feature to add though.
> >>
> >> [0]:
> >>
> >>
> https://github.com/apache/hbase/blob/branch-1.0/hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.java
> >>
> >> On Tue, Mar 17, 2015 at 7:26 AM, Ted Tuttle <te...@mentacapital.com>
> wrote:
> >>
> >>> Hello-
> >>>
> >>> Sometime back I asked a question about introducing new nodes w/ more
> >>> storage that existing nodes.  I was told at the time that HBase will
> not
> >> be
> >>> able to utilize the additional storage; I assumed at the time that
> >> regions
> >>> are allocated to nodes in something like a round-robin fashion and the
> >> node
> >>> with the least storage sets the limit for how much each node can
> utilize.
> >>>
> >>> My question this time around has to do with nodes w/ unequal numbers of
> >>> volumes: Does HBase allocate regions based on nodes or volumes on the
> >>> nodes?  I am hoping I can add a node with 8 volumes totaling 8X TB and
> >> all
> >>> the volumes will be filled.  This even though legacy nodes have 5
> volumes
> >>> and total storage of 5X TB.
> >>>
> >>> Fact or fantasy?
> >>>
> >>> Thanks,
> >>> Ted
> >>>
> >>>
> >>
> >
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>

-- 
Kevin O'Dell
Field Enablement, Cloudera

Re: introducing nodes w/ more storage

Posted by Michael Segel <mi...@hotmail.com>.

I don’t know that it is such a good idea. 

Let me ask it this way… 

What are you balancing with the HBase load balancer? 
Locations of HFiles on HDFS or which RS is responsible for the HFile? 

-Mike

> On Apr 2, 2015, at 12:42 PM, lars hofhansl <la...@apache.org> wrote:
> 
> What Kevin says.
> The best we can do is exclude the HBase from the HDFS balancer (HDF S-6133).The HDFS balancer will destroy data locality for HBase. If you don't care - maybe you have a fat network tree, and your network bandwidth matches the aggregate disk throughput for each machine - you can run it. Even then as Kevin says, HBase will just happily rewrite it as before.
> 
> Balancing of HBase data has to happen on the HBase level. Then we have to decide what we use as a basis for distribution.CPU? RAM? disk space? IOPs? disk throughput? It depends... So some configurable function of those.
> -- Lars
> 
>      From: Kevin O'dell <ke...@cloudera.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org> 
> Cc: lars hofhansl <la...@apache.org> 
> Sent: Thursday, April 2, 2015 5:41 AM
> Subject: Re: introducing nodes w/ more storage
> 
> Hi Mike,
>   Sorry for the delay here.  
> How does the HDFS load balancer impact the load balancing of HBase? <-- The HDFS load balancer is not automatically run, it is a manual process that is kicked off. It is not recommended to *ever run the HDFS balancer on a cluster running HBase.  Similar to have HBase has no concept or care about the underlying storage, HDFS has no concept or care of the region layout, nor the locality we worked so hard to build through compactions. 
> 
> Furthermore, once the HDFS balancer has saved us from running out of space on the smaller nodes, we will run a major compaction, and re-write all of the HBase data right back to where it was before.
> one is the number of regions managed by a region server that’s HBase’s load, right? And then there’s the data distribution of HBase files that is really managed by HDFS load balancer, right? <--- Right, until we run major compaction and "restore" locality by moving the data back
> 
> Even still… eventually the data will be distributed equally across the cluster. What’s happening with the HDFS balancer?  Is that heterogenous or homogenous in terms of storage? <-- Not quite, as I said before the HDFS balancer is manual, so it is quite easy to build up a skew, especially if you use a datanode as an edge node or thrift gateway etc.  Yes, the HDFS balancer is heterogenous, but it doesn't play nice with HBase.
> 
> *The use of the word ever should not be construed as a true definitive.  Ever is being used to represent a best practice.  In many cases the HDFS balancer needs to be run, especially in multi-tenant clusters with archive data.  It is best to immediately run a major compaction to restore HBase locality if the HDFS balancer is used.
> 
> 
> On Mon, Mar 23, 2015 at 10:50 AM, Michael Segel <mi...@hotmail.com> wrote:
> 
> @lars,
> 
> How does the HDFS load balancer impact the load balancing of HBase?
> 
> Of course there are two loads… one is the number of regions managed by a region server that’s HBase’s load, right?
> And then there’s the data distribution of HBase files that is really managed by HDFS load balancer, right?
> 
> OP’s question is having a heterogenous cluster where he would like to see a more even distribution of data/free space based on the capacity of the newer machines in the cluster.
> 
> This is a storage question, not a memory/cpu core question.
> 
> Or am I missing something?
> 
> 
> -Mike
> 
>> On Mar 22, 2015, at 10:56 PM, lars hofhansl <la...@apache.org> wrote:
>> 
>> Seems that it should not be too hard to add that to the stochastic load balancer.
>> We could add a spaceCost or something.
>> 
>> 
>> 
>> ----- Original Message -----
>> From: Jean-Marc Spaggiari <je...@spaggiari.org>
>> To: user <us...@hbase.apache.org>
>> Cc: Development <De...@mentacapital.com>
>> Sent: Thursday, March 19, 2015 12:55 PM
>> Subject: Re: introducing nodes w/ more storage
>> 
>> You can extend the default balancer and assign the regions based on
>> that.But at the end, the replicated blocks might still go all over the
>> cluster and your "small" nodes are going to be full and will not be able to
>> get anymore writes even for the regions they are supposed to get.
>> 
>> I'm not sure there is a good solution for what you are looking for :(
>> 
>> I build my own balancer but because of differences in the CPUs, not because
>> of differences of the storage space...
>> 
>> 
>> 2015-03-19 15:50 GMT-04:00 Nick Dimiduk <nd...@gmail.com>:
>> 
>>> Seems more fantasy than fact, I'm afraid. The default load balancer [0]
>>> takes store file size into account, but has no concept of capacity. It
>>> doesn't know that nodes in a heterogenous environment have different
>>> capacity.
>>> 
>>> This would be a good feature to add though.
>>> 
>>> [0]:
>>> 
>>> https://github.com/apache/hbase/blob/branch-1.0/hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.java
>>> 
>>> On Tue, Mar 17, 2015 at 7:26 AM, Ted Tuttle <te...@mentacapital.com> wrote:
>>> 
>>>> Hello-
>>>> 
>>>> Sometime back I asked a question about introducing new nodes w/ more
>>>> storage that existing nodes.  I was told at the time that HBase will not
>>> be
>>>> able to utilize the additional storage; I assumed at the time that
>>> regions
>>>> are allocated to nodes in something like a round-robin fashion and the
>>> node
>>>> with the least storage sets the limit for how much each node can utilize.
>>>> 
>>>> My question this time around has to do with nodes w/ unequal numbers of
>>>> volumes: Does HBase allocate regions based on nodes or volumes on the
>>>> nodes?  I am hoping I can add a node with 8 volumes totaling 8X TB and
>>> all
>>>> the volumes will be filled.  This even though legacy nodes have 5 volumes
>>>> and total storage of 5X TB.
>>>> 
>>>> Fact or fantasy?
>>>> 
>>>> Thanks,
>>>> Ted
>>>> 
>>>> 
>>> 
>> 
> 
> The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> Kevin O'Dell
> Field Enablement, Cloudera
> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com

Re: introducing nodes w/ more storage

Posted by lars hofhansl <la...@apache.org>.

What Kevin says.
The best we can do is exclude the HBase from the HDFS balancer (HDF S-6133).The HDFS balancer will destroy data locality for HBase. If you don't care - maybe you have a fat network tree, and your network bandwidth matches the aggregate disk throughput for each machine - you can run it. Even then as Kevin says, HBase will just happily rewrite it as before.

Balancing of HBase data has to happen on the HBase level. Then we have to decide what we use as a basis for distribution.CPU? RAM? disk space? IOPs? disk throughput? It depends... So some configurable function of those.
-- Lars

      From: Kevin O'dell <ke...@cloudera.com>
 To: "user@hbase.apache.org" <us...@hbase.apache.org> 
Cc: lars hofhansl <la...@apache.org> 
 Sent: Thursday, April 2, 2015 5:41 AM
 Subject: Re: introducing nodes w/ more storage

Hi Mike,
  Sorry for the delay here.  
How does the HDFS load balancer impact the load balancing of HBase? <-- The HDFS load balancer is not automatically run, it is a manual process that is kicked off. It is not recommended to *ever run the HDFS balancer on a cluster running HBase.  Similar to have HBase has no concept or care about the underlying storage, HDFS has no concept or care of the region layout, nor the locality we worked so hard to build through compactions. 

Furthermore, once the HDFS balancer has saved us from running out of space on the smaller nodes, we will run a major compaction, and re-write all of the HBase data right back to where it was before.
one is the number of regions managed by a region server that’s HBase’s load, right? And then there’s the data distribution of HBase files that is really managed by HDFS load balancer, right? <--- Right, until we run major compaction and "restore" locality by moving the data back

Even still… eventually the data will be distributed equally across the cluster. What’s happening with the HDFS balancer?  Is that heterogenous or homogenous in terms of storage? <-- Not quite, as I said before the HDFS balancer is manual, so it is quite easy to build up a skew, especially if you use a datanode as an edge node or thrift gateway etc.  Yes, the HDFS balancer is heterogenous, but it doesn't play nice with HBase.

*The use of the word ever should not be construed as a true definitive.  Ever is being used to represent a best practice.  In many cases the HDFS balancer needs to be run, especially in multi-tenant clusters with archive data.  It is best to immediately run a major compaction to restore HBase locality if the HDFS balancer is used.

On Mon, Mar 23, 2015 at 10:50 AM, Michael Segel <mi...@hotmail.com> wrote:

@lars,

How does the HDFS load balancer impact the load balancing of HBase?

Of course there are two loads… one is the number of regions managed by a region server that’s HBase’s load, right?
And then there’s the data distribution of HBase files that is really managed by HDFS load balancer, right?

OP’s question is having a heterogenous cluster where he would like to see a more even distribution of data/free space based on the capacity of the newer machines in the cluster.

This is a storage question, not a memory/cpu core question.

Or am I missing something?

-Mike

> On Mar 22, 2015, at 10:56 PM, lars hofhansl <la...@apache.org> wrote:
>
> Seems that it should not be too hard to add that to the stochastic load balancer.
> We could add a spaceCost or something.
>
>
>
> ----- Original Message -----
> From: Jean-Marc Spaggiari <je...@spaggiari.org>
> To: user <us...@hbase.apache.org>
> Cc: Development <De...@mentacapital.com>
> Sent: Thursday, March 19, 2015 12:55 PM
> Subject: Re: introducing nodes w/ more storage
>
> You can extend the default balancer and assign the regions based on
> that.But at the end, the replicated blocks might still go all over the
> cluster and your "small" nodes are going to be full and will not be able to
> get anymore writes even for the regions they are supposed to get.
>
> I'm not sure there is a good solution for what you are looking for :(
>
> I build my own balancer but because of differences in the CPUs, not because
> of differences of the storage space...
>
>
> 2015-03-19 15:50 GMT-04:00 Nick Dimiduk <nd...@gmail.com>:
>
>> Seems more fantasy than fact, I'm afraid. The default load balancer [0]
>> takes store file size into account, but has no concept of capacity. It
>> doesn't know that nodes in a heterogenous environment have different
>> capacity.
>>
>> This would be a good feature to add though.
>>
>> [0]:
>>
>> https://github.com/apache/hbase/blob/branch-1.0/hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.java
>>
>> On Tue, Mar 17, 2015 at 7:26 AM, Ted Tuttle <te...@mentacapital.com> wrote:
>>
>>> Hello-
>>>
>>> Sometime back I asked a question about introducing new nodes w/ more
>>> storage that existing nodes.  I was told at the time that HBase will not
>> be
>>> able to utilize the additional storage; I assumed at the time that
>> regions
>>> are allocated to nodes in something like a round-robin fashion and the
>> node
>>> with the least storage sets the limit for how much each node can utilize.
>>>
>>> My question this time around has to do with nodes w/ unequal numbers of
>>> volumes: Does HBase allocate regions based on nodes or volumes on the
>>> nodes?  I am hoping I can add a node with 8 volumes totaling 8X TB and
>> all
>>> the volumes will be filled.  This even though legacy nodes have 5 volumes
>>> and total storage of 5X TB.
>>>
>>> Fact or fantasy?
>>>
>>> Thanks,
>>> Ted
>>>
>>>
>>
>

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental.
Use at your own risk.
Michael Segel
michael_segel (AT) hotmail.com

-- 
Kevin O'Dell
Field Enablement, Cloudera

Re: introducing nodes w/ more storage

Posted by Esteban Gutierrez <es...@cloudera.com>.

As usual it all depends, currently HDFS-6133 provides the functionality to
exclude paths from the balancer which makes possible in the near future to
exclude the HBase root directory from the balancer and avoid the lost of
data locality when the HDFS runs. So you have the option to wait until
Hadoop 2.7 is out or any Hadoop distro backports this for users. In the
meantime as everybody has mentioned here the only reasonable way is to
major compact any table that requires high data locality after running the
HDFS balancer.

cheers,
esteban.


--
Cloudera, Inc.


On Thu, Apr 2, 2015 at 8:30 AM, Kevin O'dell <ke...@cloudera.com>
wrote:

> Mike,
>
>   I agree with all of the above, I am just saying from experience, even
> clusters that do not run HBase at all rarely run the HDFS balancer except
> when doing major overhauls such as adding nodes/racks.
>
>
> And no, you do not use a data node as an edge node.
> (Really saying that? C’mon, really? ) Never a good design. Ever. <--
> Sometimes you have to make do with what you got :)
>
>
> On Thu, Apr 2, 2015 at 10:33 AM, Michael Segel <mi...@hotmail.com>
> wrote:
>
> >
> >
> > When you say … "It is not recommended to *ever run the HDFS balancer on a
> > cluster running HBase. “ … thats a very scary statement.
> >
> > Not really a good idea.  Unless you are building a cluster for a specific
> > use case.
> >
> >
> > When you look at the larger picture… in most use cases, the cluster will
> > contain more data in flat files (HDFS) than they would inside HBase.
> > (which you allude to in you last paragraph) so balancing is a good idea.
> > (Even manual processes can be run in cron jobs ;-)
> >
> > And no, you do not use a data node as an edge node.
> > (Really saying that? C’mon, really? ) Never a good design. Ever.
> >
> >
> > I agree that you should run major compactions after running the load
> > balancer. (HDFS)
> > But the point I am trying to make is that with respect to HBase, you
> still
> > need to think about the cluster as a whole.
> >
> >
> > > On Apr 2, 2015, at 7:41 AM, Kevin O'dell <ke...@cloudera.com>
> > wrote:
> > >
> > > Hi Mike,
> > >
> > >  Sorry for the delay here.
> > >
> > > How does the HDFS load balancer impact the load balancing of HBase? <--
> > The
> > > HDFS load balancer is not automatically run, it is a manual process
> that
> > is
> > > kicked off. It is not recommended to *ever run the HDFS balancer on a
> > > cluster running HBase.  Similar to have HBase has no concept or care
> > about
> > > the underlying storage, HDFS has no concept or care of the region
> layout,
> > > nor the locality we worked so hard to build through compactions.
> > >
> > > Furthermore, once the HDFS balancer has saved us from running out of
> > space
> > > on the smaller nodes, we will run a major compaction, and re-write all
> of
> > > the HBase data right back to where it was before.
> > >
> > > one is the number of regions managed by a region server that’s HBase’s
> > > load, right? And then there’s the data distribution of HBase files that
> > is
> > > really managed by HDFS load balancer, right? <--- Right, until we run
> > major
> > > compaction and "restore" locality by moving the data back
> > >
> > > Even still… eventually the data will be distributed equally across the
> > > cluster. What’s happening with the HDFS balancer?  Is that heterogenous
> > or
> > > homogenous in terms of storage? <-- Not quite, as I said before the
> HDFS
> > > balancer is manual, so it is quite easy to build up a skew, especially
> if
> > > you use a datanode as an edge node or thrift gateway etc.  Yes, the
> HDFS
> > > balancer is heterogenous, but it doesn't play nice with HBase.
> > >
> > > *The use of the word ever should not be construed as a true definitive.
> > > Ever is being used to represent a best practice.  In many cases the
> HDFS
> > > balancer needs to be run, especially in multi-tenant clusters
> > > with archive data.  It is best to immediately run a major compaction to
> > > restore HBase locality if the HDFS balancer is used.
> > >
> > > On Mon, Mar 23, 2015 at 10:50 AM, Michael Segel <
> > michael_segel@hotmail.com>
> > > wrote:
> > >
> > >> @lars,
> > >>
> > >> How does the HDFS load balancer impact the load balancing of HBase?
> > >>
> > >> Of course there are two loads… one is the number of regions managed
> by a
> > >> region server that’s HBase’s load, right?
> > >> And then there’s the data distribution of HBase files that is really
> > >> managed by HDFS load balancer, right?
> > >>
> > >> OP’s question is having a heterogenous cluster where he would like to
> > see
> > >> a more even distribution of data/free space based on the capacity of
> the
> > >> newer machines in the cluster.
> > >>
> > >> This is a storage question, not a memory/cpu core question.
> > >>
> > >> Or am I missing something?
> > >>
> > >>
> > >> -Mike
> > >>
> > >>> On Mar 22, 2015, at 10:56 PM, lars hofhansl <la...@apache.org>
> wrote:
> > >>>
> > >>> Seems that it should not be too hard to add that to the stochastic
> load
> > >> balancer.
> > >>> We could add a spaceCost or something.
> > >>>
> > >>>
> > >>>
> > >>> ----- Original Message -----
> > >>> From: Jean-Marc Spaggiari <je...@spaggiari.org>
> > >>> To: user <us...@hbase.apache.org>
> > >>> Cc: Development <De...@mentacapital.com>
> > >>> Sent: Thursday, March 19, 2015 12:55 PM
> > >>> Subject: Re: introducing nodes w/ more storage
> > >>>
> > >>> You can extend the default balancer and assign the regions based on
> > >>> that.But at the end, the replicated blocks might still go all over
> the
> > >>> cluster and your "small" nodes are going to be full and will not be
> > able
> > >> to
> > >>> get anymore writes even for the regions they are supposed to get.
> > >>>
> > >>> I'm not sure there is a good solution for what you are looking for :(
> > >>>
> > >>> I build my own balancer but because of differences in the CPUs, not
> > >> because
> > >>> of differences of the storage space...
> > >>>
> > >>>
> > >>> 2015-03-19 15:50 GMT-04:00 Nick Dimiduk <nd...@gmail.com>:
> > >>>
> > >>>> Seems more fantasy than fact, I'm afraid. The default load balancer
> > [0]
> > >>>> takes store file size into account, but has no concept of capacity.
> It
> > >>>> doesn't know that nodes in a heterogenous environment have different
> > >>>> capacity.
> > >>>>
> > >>>> This would be a good feature to add though.
> > >>>>
> > >>>> [0]:
> > >>>>
> > >>>>
> > >>
> >
> https://github.com/apache/hbase/blob/branch-1.0/hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.java
> > >>>>
> > >>>> On Tue, Mar 17, 2015 at 7:26 AM, Ted Tuttle <te...@mentacapital.com>
> > >> wrote:
> > >>>>
> > >>>>> Hello-
> > >>>>>
> > >>>>> Sometime back I asked a question about introducing new nodes w/
> more
> > >>>>> storage that existing nodes.  I was told at the time that HBase
> will
> > >> not
> > >>>> be
> > >>>>> able to utilize the additional storage; I assumed at the time that
> > >>>> regions
> > >>>>> are allocated to nodes in something like a round-robin fashion and
> > the
> > >>>> node
> > >>>>> with the least storage sets the limit for how much each node can
> > >> utilize.
> > >>>>>
> > >>>>> My question this time around has to do with nodes w/ unequal
> numbers
> > of
> > >>>>> volumes: Does HBase allocate regions based on nodes or volumes on
> the
> > >>>>> nodes?  I am hoping I can add a node with 8 volumes totaling 8X TB
> > and
> > >>>> all
> > >>>>> the volumes will be filled.  This even though legacy nodes have 5
> > >> volumes
> > >>>>> and total storage of 5X TB.
> > >>>>>
> > >>>>> Fact or fantasy?
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Ted
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >> The opinions expressed here are mine, while they may reflect a
> cognitive
> > >> thought, that is purely accidental.
> > >> Use at your own risk.
> > >> Michael Segel
> > >> michael_segel (AT) hotmail.com
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >
> > >
> > > --
> > > Kevin O'Dell
> > > Field Enablement, Cloudera
> >
> > The opinions expressed here are mine, while they may reflect a cognitive
> > thought, that is purely accidental.
> > Use at your own risk.
> > Michael Segel
> > michael_segel (AT) hotmail.com
> >
> >
> >
> >
> >
> >
>
>
> --
> Kevin O'Dell
> Field Enablement, Cloudera
>

Re: introducing nodes w/ more storage

Posted by Kevin O'dell <ke...@cloudera.com>.

Mike,

  I agree with all of the above, I am just saying from experience, even
clusters that do not run HBase at all rarely run the HDFS balancer except
when doing major overhauls such as adding nodes/racks.


And no, you do not use a data node as an edge node.
(Really saying that? C’mon, really? ) Never a good design. Ever. <--
Sometimes you have to make do with what you got :)


On Thu, Apr 2, 2015 at 10:33 AM, Michael Segel <mi...@hotmail.com>
wrote:

>
>
> When you say … "It is not recommended to *ever run the HDFS balancer on a
> cluster running HBase. “ … thats a very scary statement.
>
> Not really a good idea.  Unless you are building a cluster for a specific
> use case.
>
>
> When you look at the larger picture… in most use cases, the cluster will
> contain more data in flat files (HDFS) than they would inside HBase.
> (which you allude to in you last paragraph) so balancing is a good idea.
> (Even manual processes can be run in cron jobs ;-)
>
> And no, you do not use a data node as an edge node.
> (Really saying that? C’mon, really? ) Never a good design. Ever.
>
>
> I agree that you should run major compactions after running the load
> balancer. (HDFS)
> But the point I am trying to make is that with respect to HBase, you still
> need to think about the cluster as a whole.
>
>
> > On Apr 2, 2015, at 7:41 AM, Kevin O'dell <ke...@cloudera.com>
> wrote:
> >
> > Hi Mike,
> >
> >  Sorry for the delay here.
> >
> > How does the HDFS load balancer impact the load balancing of HBase? <--
> The
> > HDFS load balancer is not automatically run, it is a manual process that
> is
> > kicked off. It is not recommended to *ever run the HDFS balancer on a
> > cluster running HBase.  Similar to have HBase has no concept or care
> about
> > the underlying storage, HDFS has no concept or care of the region layout,
> > nor the locality we worked so hard to build through compactions.
> >
> > Furthermore, once the HDFS balancer has saved us from running out of
> space
> > on the smaller nodes, we will run a major compaction, and re-write all of
> > the HBase data right back to where it was before.
> >
> > one is the number of regions managed by a region server that’s HBase’s
> > load, right? And then there’s the data distribution of HBase files that
> is
> > really managed by HDFS load balancer, right? <--- Right, until we run
> major
> > compaction and "restore" locality by moving the data back
> >
> > Even still… eventually the data will be distributed equally across the
> > cluster. What’s happening with the HDFS balancer?  Is that heterogenous
> or
> > homogenous in terms of storage? <-- Not quite, as I said before the HDFS
> > balancer is manual, so it is quite easy to build up a skew, especially if
> > you use a datanode as an edge node or thrift gateway etc.  Yes, the HDFS
> > balancer is heterogenous, but it doesn't play nice with HBase.
> >
> > *The use of the word ever should not be construed as a true definitive.
> > Ever is being used to represent a best practice.  In many cases the HDFS
> > balancer needs to be run, especially in multi-tenant clusters
> > with archive data.  It is best to immediately run a major compaction to
> > restore HBase locality if the HDFS balancer is used.
> >
> > On Mon, Mar 23, 2015 at 10:50 AM, Michael Segel <
> michael_segel@hotmail.com>
> > wrote:
> >
> >> @lars,
> >>
> >> How does the HDFS load balancer impact the load balancing of HBase?
> >>
> >> Of course there are two loads… one is the number of regions managed by a
> >> region server that’s HBase’s load, right?
> >> And then there’s the data distribution of HBase files that is really
> >> managed by HDFS load balancer, right?
> >>
> >> OP’s question is having a heterogenous cluster where he would like to
> see
> >> a more even distribution of data/free space based on the capacity of the
> >> newer machines in the cluster.
> >>
> >> This is a storage question, not a memory/cpu core question.
> >>
> >> Or am I missing something?
> >>
> >>
> >> -Mike
> >>
> >>> On Mar 22, 2015, at 10:56 PM, lars hofhansl <la...@apache.org> wrote:
> >>>
> >>> Seems that it should not be too hard to add that to the stochastic load
> >> balancer.
> >>> We could add a spaceCost or something.
> >>>
> >>>
> >>>
> >>> ----- Original Message -----
> >>> From: Jean-Marc Spaggiari <je...@spaggiari.org>
> >>> To: user <us...@hbase.apache.org>
> >>> Cc: Development <De...@mentacapital.com>
> >>> Sent: Thursday, March 19, 2015 12:55 PM
> >>> Subject: Re: introducing nodes w/ more storage
> >>>
> >>> You can extend the default balancer and assign the regions based on
> >>> that.But at the end, the replicated blocks might still go all over the
> >>> cluster and your "small" nodes are going to be full and will not be
> able
> >> to
> >>> get anymore writes even for the regions they are supposed to get.
> >>>
> >>> I'm not sure there is a good solution for what you are looking for :(
> >>>
> >>> I build my own balancer but because of differences in the CPUs, not
> >> because
> >>> of differences of the storage space...
> >>>
> >>>
> >>> 2015-03-19 15:50 GMT-04:00 Nick Dimiduk <nd...@gmail.com>:
> >>>
> >>>> Seems more fantasy than fact, I'm afraid. The default load balancer
> [0]
> >>>> takes store file size into account, but has no concept of capacity. It
> >>>> doesn't know that nodes in a heterogenous environment have different
> >>>> capacity.
> >>>>
> >>>> This would be a good feature to add though.
> >>>>
> >>>> [0]:
> >>>>
> >>>>
> >>
> https://github.com/apache/hbase/blob/branch-1.0/hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.java
> >>>>
> >>>> On Tue, Mar 17, 2015 at 7:26 AM, Ted Tuttle <te...@mentacapital.com>
> >> wrote:
> >>>>
> >>>>> Hello-
> >>>>>
> >>>>> Sometime back I asked a question about introducing new nodes w/ more
> >>>>> storage that existing nodes.  I was told at the time that HBase will
> >> not
> >>>> be
> >>>>> able to utilize the additional storage; I assumed at the time that
> >>>> regions
> >>>>> are allocated to nodes in something like a round-robin fashion and
> the
> >>>> node
> >>>>> with the least storage sets the limit for how much each node can
> >> utilize.
> >>>>>
> >>>>> My question this time around has to do with nodes w/ unequal numbers
> of
> >>>>> volumes: Does HBase allocate regions based on nodes or volumes on the
> >>>>> nodes?  I am hoping I can add a node with 8 volumes totaling 8X TB
> and
> >>>> all
> >>>>> the volumes will be filled.  This even though legacy nodes have 5
> >> volumes
> >>>>> and total storage of 5X TB.
> >>>>>
> >>>>> Fact or fantasy?
> >>>>>
> >>>>> Thanks,
> >>>>> Ted
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >> The opinions expressed here are mine, while they may reflect a cognitive
> >> thought, that is purely accidental.
> >> Use at your own risk.
> >> Michael Segel
> >> michael_segel (AT) hotmail.com
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
> > --
> > Kevin O'Dell
> > Field Enablement, Cloudera
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>


-- 
Kevin O'Dell
Field Enablement, Cloudera

Re: introducing nodes w/ more storage

Posted by Michael Segel <mi...@hotmail.com>.


When you say … "It is not recommended to *ever run the HDFS balancer on a cluster running HBase. “ … thats a very scary statement.

Not really a good idea.  Unless you are building a cluster for a specific use case. 


When you look at the larger picture… in most use cases, the cluster will contain more data in flat files (HDFS) than they would inside HBase.
(which you allude to in you last paragraph) so balancing is a good idea. (Even manual processes can be run in cron jobs ;-) 

And no, you do not use a data node as an edge node. 
(Really saying that? C’mon, really? ) Never a good design. Ever. 


I agree that you should run major compactions after running the load balancer. (HDFS)
But the point I am trying to make is that with respect to HBase, you still need to think about the cluster as a whole. 


> On Apr 2, 2015, at 7:41 AM, Kevin O'dell <ke...@cloudera.com> wrote:
> 
> Hi Mike,
> 
>  Sorry for the delay here.
> 
> How does the HDFS load balancer impact the load balancing of HBase? <-- The
> HDFS load balancer is not automatically run, it is a manual process that is
> kicked off. It is not recommended to *ever run the HDFS balancer on a
> cluster running HBase.  Similar to have HBase has no concept or care about
> the underlying storage, HDFS has no concept or care of the region layout,
> nor the locality we worked so hard to build through compactions.
> 
> Furthermore, once the HDFS balancer has saved us from running out of space
> on the smaller nodes, we will run a major compaction, and re-write all of
> the HBase data right back to where it was before.
> 
> one is the number of regions managed by a region server that’s HBase’s
> load, right? And then there’s the data distribution of HBase files that is
> really managed by HDFS load balancer, right? <--- Right, until we run major
> compaction and "restore" locality by moving the data back
> 
> Even still… eventually the data will be distributed equally across the
> cluster. What’s happening with the HDFS balancer?  Is that heterogenous or
> homogenous in terms of storage? <-- Not quite, as I said before the HDFS
> balancer is manual, so it is quite easy to build up a skew, especially if
> you use a datanode as an edge node or thrift gateway etc.  Yes, the HDFS
> balancer is heterogenous, but it doesn't play nice with HBase.
> 
> *The use of the word ever should not be construed as a true definitive.
> Ever is being used to represent a best practice.  In many cases the HDFS
> balancer needs to be run, especially in multi-tenant clusters
> with archive data.  It is best to immediately run a major compaction to
> restore HBase locality if the HDFS balancer is used.
> 
> On Mon, Mar 23, 2015 at 10:50 AM, Michael Segel <mi...@hotmail.com>
> wrote:
> 
>> @lars,
>> 
>> How does the HDFS load balancer impact the load balancing of HBase?
>> 
>> Of course there are two loads… one is the number of regions managed by a
>> region server that’s HBase’s load, right?
>> And then there’s the data distribution of HBase files that is really
>> managed by HDFS load balancer, right?
>> 
>> OP’s question is having a heterogenous cluster where he would like to see
>> a more even distribution of data/free space based on the capacity of the
>> newer machines in the cluster.
>> 
>> This is a storage question, not a memory/cpu core question.
>> 
>> Or am I missing something?
>> 
>> 
>> -Mike
>> 
>>> On Mar 22, 2015, at 10:56 PM, lars hofhansl <la...@apache.org> wrote:
>>> 
>>> Seems that it should not be too hard to add that to the stochastic load
>> balancer.
>>> We could add a spaceCost or something.
>>> 
>>> 
>>> 
>>> ----- Original Message -----
>>> From: Jean-Marc Spaggiari <je...@spaggiari.org>
>>> To: user <us...@hbase.apache.org>
>>> Cc: Development <De...@mentacapital.com>
>>> Sent: Thursday, March 19, 2015 12:55 PM
>>> Subject: Re: introducing nodes w/ more storage
>>> 
>>> You can extend the default balancer and assign the regions based on
>>> that.But at the end, the replicated blocks might still go all over the
>>> cluster and your "small" nodes are going to be full and will not be able
>> to
>>> get anymore writes even for the regions they are supposed to get.
>>> 
>>> I'm not sure there is a good solution for what you are looking for :(
>>> 
>>> I build my own balancer but because of differences in the CPUs, not
>> because
>>> of differences of the storage space...
>>> 
>>> 
>>> 2015-03-19 15:50 GMT-04:00 Nick Dimiduk <nd...@gmail.com>:
>>> 
>>>> Seems more fantasy than fact, I'm afraid. The default load balancer [0]
>>>> takes store file size into account, but has no concept of capacity. It
>>>> doesn't know that nodes in a heterogenous environment have different
>>>> capacity.
>>>> 
>>>> This would be a good feature to add though.
>>>> 
>>>> [0]:
>>>> 
>>>> 
>> https://github.com/apache/hbase/blob/branch-1.0/hbase-server/src/main/java/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.java
>>>> 
>>>> On Tue, Mar 17, 2015 at 7:26 AM, Ted Tuttle <te...@mentacapital.com>
>> wrote:
>>>> 
>>>>> Hello-
>>>>> 
>>>>> Sometime back I asked a question about introducing new nodes w/ more
>>>>> storage that existing nodes.  I was told at the time that HBase will
>> not
>>>> be
>>>>> able to utilize the additional storage; I assumed at the time that
>>>> regions
>>>>> are allocated to nodes in something like a round-robin fashion and the
>>>> node
>>>>> with the least storage sets the limit for how much each node can
>> utilize.
>>>>> 
>>>>> My question this time around has to do with nodes w/ unequal numbers of
>>>>> volumes: Does HBase allocate regions based on nodes or volumes on the
>>>>> nodes?  I am hoping I can add a node with 8 volumes totaling 8X TB and
>>>> all
>>>>> the volumes will be filled.  This even though legacy nodes have 5
>> volumes
>>>>> and total storage of 5X TB.
>>>>> 
>>>>> Fact or fantasy?
>>>>> 
>>>>> Thanks,
>>>>> Ted
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> Kevin O'Dell
> Field Enablement, Cloudera

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com