You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Wayne <wa...@gmail.com> on 2011/05/18 14:54:48 UTC

Max Table Count

How many tables can a cluster realistically handle or how many tables/node
can be supported? I am looking for a realistic idea of whether a 10 node
cluster can support 100 or even 500 tables. I realize it is recommended to
have a few tables at most (and to use the row key to add everything to one
table), but that is not an option for us at this point. What are the
settings that need to be tweaked and where are the issues going to occur in
terms of resource limitations, memory constraints, and OOM problems? Do most
resource limitations fall back to total active region count regardless of
the table count? Where do things get scary in terms of a large numbers of
tables?

Thanks in advance for any advice that can be provided.

Re: Max Table Count

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Yes, you will be wasting some IO, this is a well known bug in HBase,
but it's not because empty families would be flushes. In HBase,
usually if something is empty it means it doesn't exist (that's why
sparse columns are free). Now if you insert in 4 families in different
rows but all in the same region, then it flushes on the aggregate size
of all the families instead of flushing them individually. Let's say
you load them unevenly, then you could end up with 3 files of a few
KBs and a big 63MB file. Repeat that a few times, and you'll be
compacting those small files with other small files until you get
bigger ones, and you will still compact them with small files. That's
where the waste is, you want to flush/compact as less as possible.

J-D

On Thu, May 19, 2011 at 8:25 AM, Wayne <wa...@gmail.com> wrote:
> How about Column Families? We have 4 column families per table due to
> different settings (versions etc.). They are sparse in that a given row will
> only ever write to a single CF and even regions usually have only 1 CF's
> data/store file except at the border between row key naming conventions
> (each CF has its own convention). I recently read in the online book (see
> below) how more CFs are bad and you should stick with only 1. Is this true
> given that there is only really ever data for one CF in a given region? Are
> we wasting disk i/o and memory because of empty CFs being flushed and
> compacted?
>
> Thanks as always Stack for your help.
> 8.2.  On the number of column families
>
> HBase currently does not do well with anything about two or three column
> families so keep the number of column families in your schema low.
> Currently, flushing and compactions are done on a per Region basis so if one
> column family is carrying the bulk of the data bringing on flushes, the
> adjacent families will also be flushed though the amount of data they carry
> is small. Compaction is currently triggered by the total number of files
> under a column family. Its not size based. When many column families the
> flushing and compaction interaction can make for a bunch of needless i/o
> loading (To be addressed by changing flushing and compaction to work on a
> per column family basis).
>
> Try to make do with one column famliy if you can in your schemas. Only
> introduce a second and third column family in the case where data access is
> usually column scoped; i.e. you query one column family or the other but
> usually not both at the one time.
>
> On Wed, May 18, 2011 at 10:46 AM, Stack <st...@duboce.net> wrote:
>
>> Its not the number of tables that is of import, its the number of
>> regions.  You can have your regions in as many tables as you like.  I
>> do not believe there a cost to having more tables.
>>
>> St.Ack
>>
>> On Wed, May 18, 2011 at 5:54 AM, Wayne <wa...@gmail.com> wrote:
>> > How many tables can a cluster realistically handle or how many
>> tables/node
>> > can be supported? I am looking for a realistic idea of whether a 10 node
>> > cluster can support 100 or even 500 tables. I realize it is recommended
>> to
>> > have a few tables at most (and to use the row key to add everything to
>> one
>> > table), but that is not an option for us at this point. What are the
>> > settings that need to be tweaked and where are the issues going to occur
>> in
>> > terms of resource limitations, memory constraints, and OOM problems? Do
>> most
>> > resource limitations fall back to total active region count regardless of
>> > the table count? Where do things get scary in terms of a large numbers of
>> > tables?
>> >
>> > Thanks in advance for any advice that can be provided.
>> >
>>
>

Re: Max Table Count

Posted by Wayne <wa...@gmail.com>.
How about Column Families? We have 4 column families per table due to
different settings (versions etc.). They are sparse in that a given row will
only ever write to a single CF and even regions usually have only 1 CF's
data/store file except at the border between row key naming conventions
(each CF has its own convention). I recently read in the online book (see
below) how more CFs are bad and you should stick with only 1. Is this true
given that there is only really ever data for one CF in a given region? Are
we wasting disk i/o and memory because of empty CFs being flushed and
compacted?

Thanks as always Stack for your help.
8.2.  On the number of column families

HBase currently does not do well with anything about two or three column
families so keep the number of column families in your schema low.
Currently, flushing and compactions are done on a per Region basis so if one
column family is carrying the bulk of the data bringing on flushes, the
adjacent families will also be flushed though the amount of data they carry
is small. Compaction is currently triggered by the total number of files
under a column family. Its not size based. When many column families the
flushing and compaction interaction can make for a bunch of needless i/o
loading (To be addressed by changing flushing and compaction to work on a
per column family basis).

Try to make do with one column famliy if you can in your schemas. Only
introduce a second and third column family in the case where data access is
usually column scoped; i.e. you query one column family or the other but
usually not both at the one time.

On Wed, May 18, 2011 at 10:46 AM, Stack <st...@duboce.net> wrote:

> Its not the number of tables that is of import, its the number of
> regions.  You can have your regions in as many tables as you like.  I
> do not believe there a cost to having more tables.
>
> St.Ack
>
> On Wed, May 18, 2011 at 5:54 AM, Wayne <wa...@gmail.com> wrote:
> > How many tables can a cluster realistically handle or how many
> tables/node
> > can be supported? I am looking for a realistic idea of whether a 10 node
> > cluster can support 100 or even 500 tables. I realize it is recommended
> to
> > have a few tables at most (and to use the row key to add everything to
> one
> > table), but that is not an option for us at this point. What are the
> > settings that need to be tweaked and where are the issues going to occur
> in
> > terms of resource limitations, memory constraints, and OOM problems? Do
> most
> > resource limitations fall back to total active region count regardless of
> > the table count? Where do things get scary in terms of a large numbers of
> > tables?
> >
> > Thanks in advance for any advice that can be provided.
> >
>

Re: Max Table Count

Posted by Stack <st...@duboce.net>.
Its not the number of tables that is of import, its the number of
regions.  You can have your regions in as many tables as you like.  I
do not believe there a cost to having more tables.

St.Ack

On Wed, May 18, 2011 at 5:54 AM, Wayne <wa...@gmail.com> wrote:
> How many tables can a cluster realistically handle or how many tables/node
> can be supported? I am looking for a realistic idea of whether a 10 node
> cluster can support 100 or even 500 tables. I realize it is recommended to
> have a few tables at most (and to use the row key to add everything to one
> table), but that is not an option for us at this point. What are the
> settings that need to be tweaked and where are the issues going to occur in
> terms of resource limitations, memory constraints, and OOM problems? Do most
> resource limitations fall back to total active region count regardless of
> the table count? Where do things get scary in terms of a large numbers of
> tables?
>
> Thanks in advance for any advice that can be provided.
>