You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Stack <st...@duboce.net> on 2010/08/11 17:11:37 UTC

Re: Do we need to split the table into two when there are two many rows in one table?

Inline below.

On Tue, Aug 10, 2010 at 10:55 PM, Yu Bady <ba...@gmail.com> wrote:
> Hi,
>
>
> We are going to use HBase to store our large volume of pretty structured
> data.
>
> Every day, we will have about 24 new roles added to one table. After three
> months, there will be about 4,000,000,000 new rows in the table.
>

Sounds fine.

> By the way,  in the table, each row will have about 8 column families and
> each column family will have 2-3 columns. But each cell just contains 20
> bytes data.
>

Why 8 column families?  You'll be doing accesses against individual
column families?   If you could do with yes, that'd be better but 8
should be fine.


>
> So I have following questions:
>
> 1. How many rows can HBase supports in one table?
>

I don't know.  I know of tables of 30B small rows.


> 2. After one year, there will be about 16,000,000,000 rows in the table. If
> the row numbers are too large, is it helpful to solve the problem by
> splitting the original table into several tables? How to split one table
> into several tables?
>

How big are your cells?

As far as hbase is concerned, there is no real difference hosting many
vs one table.

> 3. Any other suggestions?
>

Tell us more about how you intend to access the table -- the kinda of
queries -- otherwise, sounds fine.  Can you try things out in the
small first to learn edgecases yourself first?

St.Ack

Re: Do we need to split the table into two when there are two many rows in one table?

Posted by Stack <st...@duboce.net>.
On Wed, Aug 11, 2010 at 8:11 AM, Stack <st...@duboce.net> wrote:
>> 1. How many rows can HBase supports in one table?
>>
>
> I don't know.  I know of tables of 30B small rows.
>

FYI, just heard offlist of a table that had 100B rows in it made of
rows > than those of the 30B SU table referenced above -- with more
columnfamilles (The host of the 100B table didn't want to be seen to
be 'boasting' -- let me see if I can get them to make a blog on it).

St.Ack

Re: Do we need to split the table into two when there are two many rows in one table?

Posted by Ted Yu <yu...@gmail.com>.
I think moving all column 2 into another table would help utilize block
cache more efficiently.

On Wed, Aug 11, 2010 at 4:45 PM, Yu Bady <ba...@gmail.com> wrote:

> Thank St.Ack very much for the helpful answers.
>
> Inline also.
>
> On Wed, Aug 11, 2010 at 11:11 PM, Stack <st...@duboce.net> wrote:
>
> > Inline below.
> >
> > On Tue, Aug 10, 2010 at 10:55 PM, Yu Bady <ba...@gmail.com> wrote:
> > > Hi,
> > >
> > >
> > > We are going to use HBase to store our large volume of pretty
> structured
> > > data.
> > >
> > > Every day, we will have about 24 new roles added to one table. After
> > three
> > > months, there will be about 4,000,000,000 new rows in the table.
> > >
> >
> > Sounds fine.
> >
> > > By the way,  in the table, each row will have about 8 column families
> and
> > > each column family will have 2-3 columns. But each cell just contains
> 20
> > > bytes data.
> > >
> >
> > Why 8 column families?  You'll be doing accesses against individual
> > column families?   If you could do with yes, that'd be better but 8
> > should be fine.
> >
> >
> > >
> > > So I have following questions:
> > >
> > > 1. How many rows can HBase supports in one table?
> > >
> >
> > I don't know.  I know of tables of 30B small rows.
> >
> >
> > > 2. After one year, there will be about 16,000,000,000 rows in the
> table.
> > If
> > > the row numbers are too large, is it helpful to solve the problem by
> > > splitting the original table into several tables? How to split one
> table
> > > into several tables?
> > >
> >
> > How big are your cells?
> >
>
>
> Each cell contain a string less than 20 bytes. In fact, each cell holds
>  either an integer number or a double number. Quite a few cell will have no
> value, which means its value is 0/0.0.
>
>
> > As far as hbase is concerned, there is no real difference hosting many
> > vs one table.
> >
> > > 3. Any other suggestions?
> > >
> >
> > Tell us more about how you intend to access the table -- the kinda of
> > queries -- otherwise, sounds fine.  Can you try things out in the
> > small first to learn edgecases yourself first?
> >
> >
> Let me give an example here.  To ease the description, suppose we only have
> 2 column families instead of 8.
>
> We have some logs. Each log line contains several fields as follows:
>         user 1|val_a | val_b | ....
>
> After processing the logs, the values will be filled into the HBase by
> map/reduce:
>
>            |  column family a        |       column family b   |
>            | column 1 | column 2 | column 1 | column 2  |
> -------------------------------------------------------------------
> user 1 |    val_a     |                 |  val_b       |                  |
>
> Then we will run map/reduce against the HBase table to aggregate some value
> of column1 for each column family and the result will be filled in column
> 2.
> That is, the map/reduce will read value in column 1 and write result value
> to column 2 for each column family.
>
> The query to the Hbase table will only access the value in column 2 but it
> may access both column families at the same time.
>
> Of cause, we can merge the two column families into one as follows:
>            |                  column family  a-b
>            |
>            | column_a_1|column_a_2|column_b_1|column_b_2|
> -------------------------------------------------------------------
> user 1 |    val_a         |                     |  val_b       |
>     |
>
> Does it benefit the performance? What is the rule for column family
> organization?
>
> What's your suggestion on the placement of column 2? leave it as current
> design or move it out into another table? If we move all column 2 into
> another table, it will increase space consumption. Does it?
>
>
>
>
>
>
> > St.Ack
> >
>

Re: Do we need to split the table into two when there are two many rows in one table?

Posted by Yu Bady <ba...@gmail.com>.
Thank St.Ack very much for the helpful answers.

Inline also.

On Wed, Aug 11, 2010 at 11:11 PM, Stack <st...@duboce.net> wrote:

> Inline below.
>
> On Tue, Aug 10, 2010 at 10:55 PM, Yu Bady <ba...@gmail.com> wrote:
> > Hi,
> >
> >
> > We are going to use HBase to store our large volume of pretty structured
> > data.
> >
> > Every day, we will have about 24 new roles added to one table. After
> three
> > months, there will be about 4,000,000,000 new rows in the table.
> >
>
> Sounds fine.
>
> > By the way,  in the table, each row will have about 8 column families and
> > each column family will have 2-3 columns. But each cell just contains 20
> > bytes data.
> >
>
> Why 8 column families?  You'll be doing accesses against individual
> column families?   If you could do with yes, that'd be better but 8
> should be fine.
>
>
> >
> > So I have following questions:
> >
> > 1. How many rows can HBase supports in one table?
> >
>
> I don't know.  I know of tables of 30B small rows.
>
>
> > 2. After one year, there will be about 16,000,000,000 rows in the table.
> If
> > the row numbers are too large, is it helpful to solve the problem by
> > splitting the original table into several tables? How to split one table
> > into several tables?
> >
>
> How big are your cells?
>


Each cell contain a string less than 20 bytes. In fact, each cell holds
 either an integer number or a double number. Quite a few cell will have no
value, which means its value is 0/0.0.


> As far as hbase is concerned, there is no real difference hosting many
> vs one table.
>
> > 3. Any other suggestions?
> >
>
> Tell us more about how you intend to access the table -- the kinda of
> queries -- otherwise, sounds fine.  Can you try things out in the
> small first to learn edgecases yourself first?
>
>
Let me give an example here.  To ease the description, suppose we only have
2 column families instead of 8.

We have some logs. Each log line contains several fields as follows:
         user 1|val_a | val_b | ....

After processing the logs, the values will be filled into the HBase by
map/reduce:

            |  column family a        |       column family b   |
            | column 1 | column 2 | column 1 | column 2  |
-------------------------------------------------------------------
user 1 |    val_a     |                 |  val_b       |                  |

Then we will run map/reduce against the HBase table to aggregate some value
of column1 for each column family and the result will be filled in column
2.
That is, the map/reduce will read value in column 1 and write result value
to column 2 for each column family.

The query to the Hbase table will only access the value in column 2 but it
may access both column families at the same time.

Of cause, we can merge the two column families into one as follows:
            |                  column family  a-b
            |
            | column_a_1|column_a_2|column_b_1|column_b_2|
-------------------------------------------------------------------
user 1 |    val_a         |                     |  val_b       |
     |

Does it benefit the performance? What is the rule for column family
organization?

What's your suggestion on the placement of column 2? leave it as current
design or move it out into another table? If we move all column 2 into
another table, it will increase space consumption. Does it?






> St.Ack
>