You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Mark <st...@gmail.com> on 2012/02/01 17:59:54 UTC

One table or multiple tables?

We would like to track all of our users interactions ordered by time. 
Product views, searches, logins, etc. There are (at least) two ways of 
accomplishing this:

We could use one table 'user_logs' and have keys in the format of. 
USER_ID/TYPE/TIMESTAMP. Type could be (product view, search, login, etc)

Or we could have multiple tables for each type.. UserProductLogs, 
UserSearchLogs, etc.

What are the pros/cons of each strategy and which one do you think I 
should employ?

- M

Re: One table or multiple tables?

Posted by Tom <fi...@gmail.com>.

I am assuming that your read pattern is base on user sessions, i.e, your 
user logs in and then chances are that you will have to look at various 
things for this user such as his logs, his searches etc.

I was investigating a similar problem, and from the info I collected 
this is the architecture I came up with:
-a single table,
-a single column family
-store all of the different types of data for this user based on 
multiple keys which are "close" for this user (*).

Only this way you are sure that all data is co-located, i.e. likely to 
fit into the same / adjacent regions.

With this design and the right tuning, all of the data belonging to one 
user, is likely to be sitting on only one region server (as opposed to 
be distributed over many region servers.
Only one region server for all kinds of session data has a lot of 
advantages: less overhead, less connections, if one region server is 
down fewer total number of users are affected. etc.

(*) of course, while also making sure that the whole set of keys will 
have a reasonable distribution.

On 02/01/2012 08:59 AM, Mark wrote:
> We would like to track all of our users interactions ordered by time.
> Product views, searches, logins, etc. There are (at least) two ways of
> accomplishing this:
>
> We could use one table 'user_logs' and have keys in the format of.
> USER_ID/TYPE/TIMESTAMP. Type could be (product view, search, login, etc)
>
> Or we could have multiple tables for each type.. UserProductLogs,
> UserSearchLogs, etc.
>
> What are the pros/cons of each strategy and which one do you think I
> should employ?
>
> - M
>

Re: One table or multiple tables?

Posted by tsuna <ts...@gmail.com>.

On Thu, Feb 2, 2012 at 4:47 PM, Bryan Beaudreault
<bb...@hubspot.com> wrote:
> I'd love to hear from an expert on the pros and cons of big tables vs many
> tables, when access patterns and simplicity are not a concern[1].  I
> haven't found much information regarding it, but I'd imagine the only
> benefit to many tables is the ability to configure each differently if that
> is helpful for the use case.

HBase doesn't offer a whole lot of configuration knobs per table.
Most table I come across have the same configuration: single family,
LZO compression, some form of Bloom filter.  Maybe VERSIONS=>1.

If you need different configs, you can also consider using multiple
column families in a single table.

If you have somewhat related data and you're on the fence when trying
to decide whether you store everything in a single table or not, I
generally recommend to stick to a single table.  From an operational
standpoint, it's easier to manage a single table for an application
than multiple ones.  You also generally end up with fewer, bigger
regions, which is almost always better.  This entails that your RS are
writing more data to fewer WALs, which leads to more sequential writes
across the board.  You'll end up with fewer HLogs, which is also a
good thing.

As others said, with a single table design, you can control data
locality, but as soon as you write to and read from multiple tables,
all bets are off.

If you use HBase's client (which is most likely the case as the only
other alternative is asynchbase), beware that you need to create one
HTable instance per table per thread in your application code.  If you
build an application with many tables, this rapidly becomes unwieldy.
If you use asynchbase you don't have this problem because it uses a
single HBaseClient object for your entire cluster, and it's
thread-safe.

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

Re: One table or multiple tables?

Posted by Bryan Beaudreault <bb...@hubspot.com>.

At the risk of sounding contrary, I'll actually give voice to the opposite.

Like Jean said, you haven't said much about your read patterns.  I'd say
understanding that is the first critical part of this.

I'd also argue that it is no less simple to put them all in the same table,
and possibly much more flexible.  I imagine you aren't always going to be
reporting on only a single metric at once.  In the multiple table layout,
you'll need to do multiple scans/gets to retrieve the data you need.  If
you put them all in a single table, you might be able to do a better job
returning them all (or a subset) at once.

I'm not sure what the value you would be storing is, but if it is
reasonable enough you might want to put the metrics as different columns
instead of different rows.  It all depends on the access patterns, but
having them all in the same table opens up more flexibility.  (Beware of
incrementing row keys though: http://hbase.apache.org/book.html#timeseries)

I'd love to hear from an expert on the pros and cons of big tables vs many
tables, when access patterns and simplicity are not a concern[1].  I
haven't found much information regarding it, but I'd imagine the only
benefit to many tables is the ability to configure each differently if that
is helpful for the use case.

[1] By this I mean, 2 or more different data sets where the row keys won't
conflict and will never be queried together.  Is there a benefit to putting
them in multiple tables vs a single, aside from config differences (e.g. #
of column families)?

On Thu, Feb 2, 2012 at 5:37 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> You're not telling us much about your read patterns and data
> distribution, but I would go with the former solution for the sake of
> simplicity. You'd want to write your row keys in the same format as
> OpenTSDB does: http://opentsdb.net/schema.html
>
> J-D
>
> On Wed, Feb 1, 2012 at 8:59 AM, Mark <st...@gmail.com> wrote:
> > We would like to track all of our users interactions ordered by time.
> > Product views, searches, logins, etc. There are (at least) two ways of
> > accomplishing this:
> >
> > We could use one table 'user_logs' and have keys in the format of.
> > USER_ID/TYPE/TIMESTAMP. Type could be (product view, search, login, etc)
> >
> > Or we could have multiple tables for each type.. UserProductLogs,
> > UserSearchLogs, etc.
> >
> > What are the pros/cons of each strategy and which one do you think I
> should
> > employ?
> >
> > - M
>

Re: One table or multiple tables?

Posted by Jean-Daniel Cryans <jd...@apache.org>.

You're not telling us much about your read patterns and data
distribution, but I would go with the former solution for the sake of
simplicity. You'd want to write your row keys in the same format as
OpenTSDB does: http://opentsdb.net/schema.html

J-D

On Wed, Feb 1, 2012 at 8:59 AM, Mark <st...@gmail.com> wrote:
> We would like to track all of our users interactions ordered by time.
> Product views, searches, logins, etc. There are (at least) two ways of
> accomplishing this:
>
> We could use one table 'user_logs' and have keys in the format of.
> USER_ID/TYPE/TIMESTAMP. Type could be (product view, search, login, etc)
>
> Or we could have multiple tables for each type.. UserProductLogs,
> UserSearchLogs, etc.
>
> What are the pros/cons of each strategy and which one do you think I should
> employ?
>
> - M