You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by David Ross <dy...@klout.com> on 2012/01/24 08:48:18 UTC
Designing HBase schema to best support specific queries

Howdy,

I have an HBase schema-design related question. The problem is fairly
simple - I am storing "notifications" in hbase, each of which has a status
("new", "seen", and "read"). Here are the API's I need to provide:

- Get all notifications for a user
- Get all "new" notifications for a user
- Get the count of all "new" notifications for a user
- Update status for a notification
- Update status for all of a user's notifications
- Get all "new" notifications accross the database
- Notifications should be scannable in reverse chronological order and
allow pagination.

I have a few ideas, and I wanted to see if one of them is clearly best, or
if I have missed a good strategy entirely. Common to all three, I think
having one row per notification and having the user id in the rowkey is the
way to go. To get chronological ordering for pagination, I need to have a
reverse timestamp in there, too. I'd like to keep all notifs in one table
(so I don't have to merge sort for the "get all notificatiosn for a user"
call) and don't want to write batch jobs for secondary index tables (since
updates to the count and status should be in real time).

The simplest way to do it would be (1) row key is "userId_reverseTimestamp"
and do filtering for status on the client side. This seems naive, since we
will be sending lots of unecessary data through the network.

The next possibility is to (2) encode the status into the rowkey as well,
so either "userId_reverseTimestamp_status" and then doing rowkey regex
filtering on the scans. The first issue I see is needing to delete a row
and copy the notification data to a new row when status changes (which
presumably, should happen exactly twice per notification). Also, since the
status is the last part of the rowkey, for each user, we will be scanning
lots of extra rows. Is this a big performance hit? Finally, in order to
change status, I will need to know what the previous status was (to build
the row key) or else I will need to do another scan.

The last idea I had is to (3) have two column families, one for the static
notif data, and one as a flag for the status, i.e. "s:read" or "s:new" with
's' as the cf and the status as the qualifier. There would be exactly one
per row, and I can do a MultipleColumnPrefixFilter or SkipFilter w/
ColumnPrefixFilter against that cf. Here too, I would have to delete and
create columns on status change, but it should be much more lightweight
than copying whole rows. My only concern is the warning in the HBase book
that HBase doesn't do well with "more than 2 or 3 column families" -
perhaps if the system needs to be extended with more querying capabilities,
the multi-cf strategy won't scale.

So (1) seems like it would have too much network overhead. (2) seems like
it would have wasted cost spent copying data and (3) might cause issues
with too many families. Between (2) and (3), which type of filter should
give better performance? In both cases, the scan will have look at each row
for a user, which presumably has mostly read notifications - which would
have better performance. I think I'm leaning towards (3) - are there other
options (or tweaks) that I have missed?

Thanks,

David