You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Bill de hOra <bi...@dehora.net> on 2009/02/12 00:50:23 UTC

usecase: tagged key/values

Hi,

I was wondering if hbase is a good fit for the following - storing 
arbitrary key/values tagged with a single identifier, eg:

"8904830324": {
   "url":"...",
   "stat":"...",
   ...
}


When I say arbitrary I mean across deployments. So while each deployment 
will have different sets of keys, tags within that deployment will tend 
to reuse same keys, hence there is an option to index via keys (eg find 
all tags where stat=1 above). It's similar I guess to what memcached-tag 
[1] does, but needs to be persisted.

Any thoughts?

Bill

[1] http://code.google.com/p/memcached-tag/wiki/MemcacheTagIntroduction

RE: usecase: tagged key/values

Posted by Jonathan Gray <jl...@streamy.com>.
Bill,

So out of the box, with a straight-forward schema, you can store the data in
the way you want and have the efficient query of "get all key/vals for this
identifier".

In order to also implement "get all identifiers which have key X = value Y"
queries, you'd need to store your data in the inverted manner I described.
There would be a table per key, the row would be the .

There is rudimentary secondary indexing support (implemented as I described,
with an additional table for each index) provided by this issue:
https://issues.apache.org/jira/browse/HBASE-883 but I don't recommend it
because it uses the OCC subsystem so index updates are done in a
transaction.  There is a large amount of overhead involved with this
implementation.

We do some secondary indexing here and decided to push the management of it
to the application.  We have loose constraints on consistency and assume
that in the case of a failure the secondary indexes are out of sync (and we
can quickly rebuild them with a special MapReduce job that builds secondary
indexes).

So your application might look like this (pseudo-code):


Put(id, List[(key, val), ...])

InsertToHbase(PrimaryTable, id, List[keyvals])

For((key,val) in List[keyvals]):
  InsertToHbase(key, val, id)


Where key is now the table name (you might have to add prefixes/suffixes so
you don't have duplication across your instances).  Here we just append the
name of the key onto the source tables name (PrimaryTable + key).

So you'll have a table for all identifiers.  And then a table for each
possible key.

It might sound like a lot but each of these tables can scale to any number
of entries and you can have quite a few tables, though I've not tested how
far that goes. 

JG

> -----Original Message-----
> From: Bill de hOra [mailto:bill@dehora.net]
> Sent: Thursday, February 12, 2009 12:32 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: usecase: tagged key/values
> 
> Jonathan Gray wrote:
> > Bill,
> >
> > It's hard to say whether hbase is a good fit without knowing a bit
> more.
> >
> > HBase is very well suited for storing data in the format you
> describe.  If
> > your primary problem is scaling the persistence of this dataset, it
> can
> > certainly do that.  You can have any number of arbitrary key/vals for
> each
> > row, and any number of rows.  The example you show looks almost
> exactly
> > like an HBase schema.
> >
> > Your row key would be "8904830324" and you would have a single family
> that
> > contained a column per key/val.  The column name is the key, the
> column
> > value is the val.  You could have one key/val in one row, and 1000 in
> > another row, this schema is not at all fixed.
> >
> > But I really need to better understand the expected dimensions of
> your
> > dataset and how you'd like to query it to know if that's the right
> schema.
> >
> > Do you expect very high numbers of key/vals per identifier?  10, 100,
> > 1000, more?
> 
> I'd say in the range 5-20. The number of identifiers is at least 10s of
> millions.
> 
> 
> > And would they be consistent across the identifiers (within a
> > deployment, or table in this case) or would they vary greatly between
> > rows?
> 
> Reasonably consistent; not every identifer will have all values.
> 
> 
> > Also, are you going to be querying this in realtime and concurrently?
> > Will you be storing lots of data and processing it in batch?  Are you
> > write heavy or read heavy?
> 
> Read dominated; easily 80-85% of calls. The calls are realtime, but I
> have the option to cache that data heavily.
> 
> 
> > As you can see, you have to think carefully about how you're going to
> be
> > inserting and querying the data to determine how best to store it.
> I'm
> > looking forward to hearing more details because it sounds like an
> > interesting (and potentially common) problem to solve.
> 
> So in this case each identifier is a user or a community key; as I said
> those are in the tens of millions. And they have some arbitrary
> key/values associated with them, 10-20 each, but typically the keys are
> common across users. In some cases there's a need to do a reverse look
> up by the key's value, eg "find all users where foo=10", but they are a
> subset again of the total key set.
> 
> Another user case is being able to store semi-structured data for
> media,
> eg exif values or a controlled set of tags. Again there aren't that
> many
> keys, but the media count is big - 100s of millions of items.
> 
> In both cases reads outstrip writes, probably 10 to 1. In the media
> case, most writes are new data being put in. it's the kind of data that
> in RDBMSes winds up in extension tables, which become harder to manage
> as they get bigger.
> 
> Bill


Re: usecase: tagged key/values

Posted by Bill de hOra <bi...@dehora.net>.
Jonathan Gray wrote:
> Bill,
> 
> It's hard to say whether hbase is a good fit without knowing a bit more.
> 
> HBase is very well suited for storing data in the format you describe.  If
> your primary problem is scaling the persistence of this dataset, it can
> certainly do that.  You can have any number of arbitrary key/vals for each
> row, and any number of rows.  The example you show looks almost exactly
> like an HBase schema.
> 
> Your row key would be "8904830324" and you would have a single family that
> contained a column per key/val.  The column name is the key, the column
> value is the val.  You could have one key/val in one row, and 1000 in
> another row, this schema is not at all fixed.
> 
> But I really need to better understand the expected dimensions of your
> dataset and how you'd like to query it to know if that's the right schema.
> 
> Do you expect very high numbers of key/vals per identifier?  10, 100,
> 1000, more?  

I'd say in the range 5-20. The number of identifiers is at least 10s of 
millions.


> And would they be consistent across the identifiers (within a
> deployment, or table in this case) or would they vary greatly between
> rows?  

Reasonably consistent; not every identifer will have all values.


> Also, are you going to be querying this in realtime and concurrently? 
> Will you be storing lots of data and processing it in batch?  Are you
> write heavy or read heavy?

Read dominated; easily 80-85% of calls. The calls are realtime, but I 
have the option to cache that data heavily.


> As you can see, you have to think carefully about how you're going to be
> inserting and querying the data to determine how best to store it.  I'm
> looking forward to hearing more details because it sounds like an
> interesting (and potentially common) problem to solve.

So in this case each identifier is a user or a community key; as I said 
those are in the tens of millions. And they have some arbitrary 
key/values associated with them, 10-20 each, but typically the keys are 
common across users. In some cases there's a need to do a reverse look 
up by the key's value, eg "find all users where foo=10", but they are a 
subset again of the total key set.

Another user case is being able to store semi-structured data for media, 
eg exif values or a controlled set of tags. Again there aren't that many 
keys, but the media count is big - 100s of millions of items.

In both cases reads outstrip writes, probably 10 to 1. In the media 
case, most writes are new data being put in. it's the kind of data that 
in RDBMSes winds up in extension tables, which become harder to manage 
as they get bigger.

Bill

Re: usecase: tagged key/values

Posted by Jonathan Gray <jl...@streamy.com>.
Bill,

It's hard to say whether hbase is a good fit without knowing a bit more.

HBase is very well suited for storing data in the format you describe.  If
your primary problem is scaling the persistence of this dataset, it can
certainly do that.  You can have any number of arbitrary key/vals for each
row, and any number of rows.  The example you show looks almost exactly
like an HBase schema.

Your row key would be "8904830324" and you would have a single family that
contained a column per key/val.  The column name is the key, the column
value is the val.  You could have one key/val in one row, and 1000 in
another row, this schema is not at all fixed.

But I really need to better understand the expected dimensions of your
dataset and how you'd like to query it to know if that's the right schema.

Do you expect very high numbers of key/vals per identifier?  10, 100,
1000, more?  And would they be consistent across the identifiers (within a
deployment, or table in this case) or would they vary greatly between
rows?  Because as far as efficiently querying hbase for that data, this
schema will likely not work for you.

You said you want to have queries like "return all identifiers that have
the following key" or "have the following key=value"?  In either case,
you'll need to remodel and/or denormalize the data in hbase to be able to
query for that efficiently.

How to attack that really depends on the details.  The most common
approach to secondary indexing is creating an additional table for each
"index".  Each table would signify an individual "key", the row key would
be the "val" and a single family would contain a list of "identifiers" as
column names.  Since you are only storing key/vals, this basically inverts
your schema and you wouldn't need to store the data keyed on identifier
anymore (unless you want to efficiently retrieve all key/vals for a
specific identifier, in which case you need both).

Also, are you going to be querying this in realtime and concurrently? 
Will you be storing lots of data and processing it in batch?  Are you
write heavy or read heavy?

As you can see, you have to think carefully about how you're going to be
inserting and querying the data to determine how best to store it.  I'm
looking forward to hearing more details because it sounds like an
interesting (and potentially common) problem to solve.

As far as memcached-tag goes, from what I can tell that's basically a
method for group invalidations/deletions.  You still store one global set
of key/vals.  With it, you'd tag different groups of key/vals so they
could be deleted together with a single command.

In your case, you could use it to associate key/vals with identifiers, but
any given key could only exist once globally and for a single identifier. 
Thanks for bringing that project to my attention though because it does
look cool and the lack of ability to do that is something that steered me
away from memcached as a cache for hbase.  At the time, my cache design
had invalidations per-row but there could be many entries for a single
row.

Sorry for the drawn-out response.  Look forward to hearing back.

JG

On Wed, February 11, 2009 3:50 pm, Bill de hOra wrote:
> Hi,
>
>
> I was wondering if hbase is a good fit for the following - storing
> arbitrary key/values tagged with a single identifier, eg:
>
> "8904830324": {
> "url":"...",
> "stat":"...",
> ...
> }
>
>
>
> When I say arbitrary I mean across deployments. So while each deployment
> will have different sets of keys, tags within that deployment will tend to
> reuse same keys, hence there is an option to index via keys (eg find all
> tags where stat=1 above). It's similar I guess to what memcached-tag [1]
> does, but needs to be persisted.
>
> Any thoughts?
>
>
> Bill
>
>
> [1] http://code.google.com/p/memcached-tag/wiki/MemcacheTagIntroduction
>
>
>