You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Andrew Nguyen <an...@ucsfcti.org> on 2010/06/04 18:16:42 UTC

Re: Modeling column families

Ryan,

I went ahead and began modeling our data as you have suggested below.  However, we just realized something with our compound key.  We don't actually have access to the patient identifier at the level of the data collection that is being performed.  What we do know is the bed #.  We have a predetermined number of beds so I was thinking if there were better ways to model everything given this finite (and predetermined) set for the compound keys.

Given this, would it be better to have a different table for each bed (and just have the row key be the time stamp)?  What are the downsides to having hundreds of different tables that have the same "schema" otherwise?

Thanks!

--Andrew

--
Andrew Nguyen
andrew@ucsfcti.org

The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain confidential or privileged information.  Any unauthorized review, dissemination, distribution, or copying of this communication is prohibited.  If you are not the intended recipient, please notify the sender immediately by reply e-mail, and destroy all copies of this message and any attachments from your files.





On Apr 24, 2010, at 1:21 PM, Ryan Rawson wrote:

> Each column family acts like a different table, so they have strong
> data locality on disk, so things you retrieve together should be in a
> column family.  Times you might use 2 would be in the classic
> 'webtable' example from the bigtable paper - where they have the
> original text in one family, and extracted features (such as outgoing
> links) in a 'meta' family.
> 
> The row oriented solution works well with HBase's splitting model
> because it will allow you to spread your load evenly over more nodes
> for any given bed.  Splits are done by data size, so things generally
> work out really well.  I have yet to see a situation where the split
> didnt "do it right" and caused bad performance.
> 
> Most of the 'nosql' solutions tend to be focused on key-value data
> modeling with wide rows.  But this is not the only technique!  Tall
> tables (more rows, less wide) with compound keys I think are a highly
> underrated and relatively unknown approach.
> 
> One of my tables has key schema like so:
> <user><timestamp><eventid>
> 
> Where you create it like so:
> Bytes.add(
>  Bytes.toBytes(userAsInt),
>  Bytes.toBytes(timestampAsLong),
>  Bytes.toBytes(eventIdAsInt));
> 
> Each part of the key is a fixed 4 or 8 bytes wide and the data is
> stored in big-endian order.
> 
> So user's events are stored by the order they happened. Eventid allows
> multiple events per timestamp. If you want all of a user's events you
> build a scan like so:
> Scan scan = new Scan ( Bytes.toBytes(userId), Bytes.toBytes(userId+1) );
> 
> Since the scan end row is exclusive you only get the events for the
> user in 'userId'.
> 
> You can do this:
> 
> Scan scan = new Scan ( Bytes.add(    Bytes.toBytes(userId),
> Bytes.toBytes(timestampToStart) ),  Bytes.toBytes(userId+1) );
> 
> to get a partial date scan - from the timestamp to the end of the user data.
> 
> In this schema the stuff is stored IN chronological order, so the
> oldest stuff is at the beginning.  If you find reverse more useful,
> you should do this when you build a key:
> 
> Bytes.add(
>  Bytes.toBytes(userAsInt),
>  Bytes.toBytes(Long.MAX_VALUE - timestampAsLong),
>  Bytes.toBytes(eventIdAsInt));
> 
> Note the MAX_VALUE subtraction, this will make it so the newest things
> are at the top and stored going backwards in time.
> 
> To go backwards do:
> long valueFromKey = Bytes.toLong(key, Bytes.SIZEOF_INT,
> Bytes.SIZEOF_LONG); //timestamp starts after an int, and is a long
> long timestamp = Long.MAX_VALUE - valueFromKey;
> 
> I hope this helps!
> 
> 
> On Sat, Apr 24, 2010 at 1:10 PM, Andrew Nguyen
> <an...@ucsfcti.org> wrote:
>> Ryan,
>> 
>> Exactly, eventually, we will be storing data continuously on N beds in the ICU.  So, if it's waveform data, it's probably going to be 125 Hz which is about 3.9 billion points per bed, times N beds.  I've been trying to find out what sort of search terms to use to dive deeper and "compound keys" with respect to NoSQL solutions.
>> 
>> You mention tall tables - this sounds consistent with what Erik and Andrey have said.  Given that, just to clarify my understanding, I'm probably looking at a single table with only one column (the value, which Andrey names as "series"???) and billiions of rows, right?
>> 
>> That said, the decision to break up the values into multiple column families is just a function of performance and how I want the data physically stored.  Are there any other major points to consider for determining what column families to have?  (I made this conclusion from your hbase-nosql presentation on slideshare.)
>> 
>> Thanks all!
>> 
>> --Andrew
>> 
>> On Apr 24, 2010, at 12:59 PM, Ryan Rawson wrote:
>> 
>>> For example if you are storing timeseries data for a monitoring
>>> system, you might want to store it by row, since the number of points
>>> for a single system might be arbitrarily large (think: 2 years+ of
>>> data). In this case if the expected data set size per row is larger
>>> than what a single machine could conceivably store, Cassandra would
>>> not work for you in this case (since each row must be stored on a
>>> single (er N) node(s)).
>> 
>>