You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Aji Janis <aj...@gmail.com> on 2013/07/02 15:32:37 UTC

When to expand vertically vs. horizontally in Hbase

The section on Rows vs. Columns at
http://hbase.apache.org/book/schema.smackdown.html talks about expanding
horizontally vs. vertically.

Can someone please explain to me when to choose rows vs. columns. The
sections reads, "To be clear, this guideline is in the context is in
extremely wide cases, not in the standard use-case where one needs to store
a few dozen or hundred columns" so if I had 5 column families with 10
qualifiers each, accessed mostly together is this a case for wider or
taller table? Thanks for any help in advance.

Re: When to expand vertically vs. horizontally in Hbase

Posted by Michael Segel <mi...@hotmail.com>.
Ian, 

You still want to stick to your relational modeling.  :-(

You need to play around more with hierarchical models to get a better appreciation. 

If you model as if you're working with a RDBMS then you will end up with a poor HBase table design. 

In ERD models, you don't have the concept of a weak relationship. 
The weak relationship is that the model has no relationship between the entities. Its the application that manages that. 

Imagine a reference or look up table that in the model has no association. Using our example of an Order Entry system, its the application that hits the customer lookup table to capture relevant information for the order.  That's why I refer to it as a weak association. 



On Jul 5, 2013, at 6:00 PM, Ian Varley <iv...@salesforce.com> wrote:

> Sure. Maybe it's useful to talk about the functional aspect of relationships in models. In an RDBMS, explicit relationship play a couple roles:
> 
> - foreign key constraints: don't allow a tuple in relation A to point to a row in relation B that doesn't exist
> - join optimization - knowledge of how two relations are logically connected can help perform joins in a more optimal way
> 
> HBase, of course, provides neither of these features out of the box, so there is no difference between an implied (weakly coupled, to use your term) relationship and something stronger. 
> 
> Where it gets interesting is in the kind of denormalization you're talking about, where information that properly belongs to one entity is copied into another one for efficiency's sake, or to get some kind of atomicity protection. Your scenario below is doing this (duplicating customer info in the order records). 
> 
> To be fair, relational DBs also force this kind of behavior sometimes, again for efficiency reasons (we've all done it). HBase just starts there. :)
> 
> Ian
> 
> On Jul 5, 2013, at 4:22 PM, "Michael Segel" <mi...@hotmail.com> wrote:
> 
>> An entity is an entity. 
>> When you couple them you are saying that there's a relationship to them in the model. 
>> 
>> What I am saying is that you can have an HBase model which is not a single table, however when you look at your use case, you are querying data from a single table at a time. 
>> 
>> Going back to the order entry system. You may have a customer table which maintains all of the information about your customer yet you will also duplicate portions of the data in to the order system.  You still have other entities such as your orders, pick slips, shipping and invoices. There won't be a hard or strong relationship between the customer table and the order table. 
>> 
>> When you go to your ERD tool, you wouldn't show a strong coupling of the data. 
>> 
>> Does that make sense? 
>> 
>> On Jul 5, 2013, at 1:56 PM, Ian Varley <iv...@salesforce.com> wrote:
>> 
>>> Mike, what do you mean by "you can have entities, except that they are not coupled"? You mean, they have no relationship to each other? Or the relationship is defined elsewhere (e.g. application code)? The concept of "coupling" seems a little overloaded and not as concise here as "relationship". Two tuples in a database can have a wide number of relationships to each other; the kinds of relationships that are actively supported differs between a traditional RDBMS and HBase, and proper HBase design requires understand these limitations precisely.
>>> 
>>> I'm not trying to be an ER<http://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model> apologist, there are a lot of ways in which it sucks. :) But if we want to evolve, we can't just pretend there's no history here to build on.
>>> 
>>> Ian
>>> 
>>> On Jul 5, 2013, at 1:41 PM, Michael Segel wrote:
>>> 
>>> LOL...
>>> 
>>> Ian wrote:
>>> "But, something just occurred to me: just because your physical implementation (HBase) doesn't support normalized entities and relationships doesn't mean your *problem* doesn't have entities and relationships. :) An Author is one entity, a Title is another, and a Genre is a third. Understanding how they interact is a prerequisite for translating into a physical model that works well in HBase. (ERD modeling is not categorically the only way to understand that, but I've yet to hear a credible alternative that doesn't boil down to either ERD or "do it in your head").
>>> "
>>> 
>>> You can have entities, except that they are not coupled.
>>> 
>>> If you have a common key, then you may have a use for column families, it just depends on your data and how you access your data.
>>> 
>>> Its not rocket science, but its a non-trivial matter. Not doing it right may mean that you are not going to get the most out of your system.
>>> 
>>> 
>>> On Jul 5, 2013, at 1:26 PM, Ian Varley <iv...@salesforce.com>> wrote:
>>> 
>>> But, something just occurred to me: just because your physical implementation (HBase) doesn't support normalized entities and relationships doesn't mean your *problem* doesn't have entities and relationships. :) An Author is one entity, a Title is another, and a Genre is a third. Understanding how they interact is a prerequisite for translating into a physical model that works well in HBase. (ERD modeling is not categorically the only way to understand that, but I've yet to hear a credible alternative that doesn't boil down to either ERD or "do it in your head").
>>> 
>>> 
>> 
> 


Re: When to expand vertically vs. horizontally in Hbase

Posted by Ian Varley <iv...@salesforce.com>.
Sure. Maybe it's useful to talk about the functional aspect of relationships in models. In an RDBMS, explicit relationship play a couple roles:

- foreign key constraints: don't allow a tuple in relation A to point to a row in relation B that doesn't exist
- join optimization - knowledge of how two relations are logically connected can help perform joins in a more optimal way

HBase, of course, provides neither of these features out of the box, so there is no difference between an implied (weakly coupled, to use your term) relationship and something stronger. 

Where it gets interesting is in the kind of denormalization you're talking about, where information that properly belongs to one entity is copied into another one for efficiency's sake, or to get some kind of atomicity protection. Your scenario below is doing this (duplicating customer info in the order records). 

To be fair, relational DBs also force this kind of behavior sometimes, again for efficiency reasons (we've all done it). HBase just starts there. :)

Ian

On Jul 5, 2013, at 4:22 PM, "Michael Segel" <mi...@hotmail.com> wrote:

> An entity is an entity. 
> When you couple them you are saying that there's a relationship to them in the model. 
> 
> What I am saying is that you can have an HBase model which is not a single table, however when you look at your use case, you are querying data from a single table at a time. 
> 
> Going back to the order entry system. You may have a customer table which maintains all of the information about your customer yet you will also duplicate portions of the data in to the order system.  You still have other entities such as your orders, pick slips, shipping and invoices. There won't be a hard or strong relationship between the customer table and the order table. 
> 
> When you go to your ERD tool, you wouldn't show a strong coupling of the data. 
> 
> Does that make sense? 
> 
> On Jul 5, 2013, at 1:56 PM, Ian Varley <iv...@salesforce.com> wrote:
> 
>> Mike, what do you mean by "you can have entities, except that they are not coupled"? You mean, they have no relationship to each other? Or the relationship is defined elsewhere (e.g. application code)? The concept of "coupling" seems a little overloaded and not as concise here as "relationship". Two tuples in a database can have a wide number of relationships to each other; the kinds of relationships that are actively supported differs between a traditional RDBMS and HBase, and proper HBase design requires understand these limitations precisely.
>> 
>> I'm not trying to be an ER<http://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model> apologist, there are a lot of ways in which it sucks. :) But if we want to evolve, we can't just pretend there's no history here to build on.
>> 
>> Ian
>> 
>> On Jul 5, 2013, at 1:41 PM, Michael Segel wrote:
>> 
>> LOL...
>> 
>> Ian wrote:
>> "But, something just occurred to me: just because your physical implementation (HBase) doesn't support normalized entities and relationships doesn't mean your *problem* doesn't have entities and relationships. :) An Author is one entity, a Title is another, and a Genre is a third. Understanding how they interact is a prerequisite for translating into a physical model that works well in HBase. (ERD modeling is not categorically the only way to understand that, but I've yet to hear a credible alternative that doesn't boil down to either ERD or "do it in your head").
>> "
>> 
>> You can have entities, except that they are not coupled.
>> 
>> If you have a common key, then you may have a use for column families, it just depends on your data and how you access your data.
>> 
>> Its not rocket science, but its a non-trivial matter. Not doing it right may mean that you are not going to get the most out of your system.
>> 
>> 
>> On Jul 5, 2013, at 1:26 PM, Ian Varley <iv...@salesforce.com>> wrote:
>> 
>> But, something just occurred to me: just because your physical implementation (HBase) doesn't support normalized entities and relationships doesn't mean your *problem* doesn't have entities and relationships. :) An Author is one entity, a Title is another, and a Genre is a third. Understanding how they interact is a prerequisite for translating into a physical model that works well in HBase. (ERD modeling is not categorically the only way to understand that, but I've yet to hear a credible alternative that doesn't boil down to either ERD or "do it in your head").
>> 
>> 
> 

Re: When to expand vertically vs. horizontally in Hbase

Posted by Michael Segel <mi...@hotmail.com>.
An entity is an entity. 
When you couple them you are saying that there's a relationship to them in the model. 

What I am saying is that you can have an HBase model which is not a single table, however when you look at your use case, you are querying data from a single table at a time. 

Going back to the order entry system. You may have a customer table which maintains all of the information about your customer yet you will also duplicate portions of the data in to the order system.  You still have other entities such as your orders, pick slips, shipping and invoices. There won't be a hard or strong relationship between the customer table and the order table. 

When you go to your ERD tool, you wouldn't show a strong coupling of the data. 

Does that make sense? 

On Jul 5, 2013, at 1:56 PM, Ian Varley <iv...@salesforce.com> wrote:

> Mike, what do you mean by "you can have entities, except that they are not coupled"? You mean, they have no relationship to each other? Or the relationship is defined elsewhere (e.g. application code)? The concept of "coupling" seems a little overloaded and not as concise here as "relationship". Two tuples in a database can have a wide number of relationships to each other; the kinds of relationships that are actively supported differs between a traditional RDBMS and HBase, and proper HBase design requires understand these limitations precisely.
> 
> I'm not trying to be an ER<http://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model> apologist, there are a lot of ways in which it sucks. :) But if we want to evolve, we can't just pretend there's no history here to build on.
> 
> Ian
> 
> On Jul 5, 2013, at 1:41 PM, Michael Segel wrote:
> 
> LOL...
> 
> Ian wrote:
> "But, something just occurred to me: just because your physical implementation (HBase) doesn't support normalized entities and relationships doesn't mean your *problem* doesn't have entities and relationships. :) An Author is one entity, a Title is another, and a Genre is a third. Understanding how they interact is a prerequisite for translating into a physical model that works well in HBase. (ERD modeling is not categorically the only way to understand that, but I've yet to hear a credible alternative that doesn't boil down to either ERD or "do it in your head").
> "
> 
> You can have entities, except that they are not coupled.
> 
> If you have a common key, then you may have a use for column families, it just depends on your data and how you access your data.
> 
> Its not rocket science, but its a non-trivial matter. Not doing it right may mean that you are not going to get the most out of your system.
> 
> 
> On Jul 5, 2013, at 1:26 PM, Ian Varley <iv...@salesforce.com>> wrote:
> 
> But, something just occurred to me: just because your physical implementation (HBase) doesn't support normalized entities and relationships doesn't mean your *problem* doesn't have entities and relationships. :) An Author is one entity, a Title is another, and a Genre is a third. Understanding how they interact is a prerequisite for translating into a physical model that works well in HBase. (ERD modeling is not categorically the only way to understand that, but I've yet to hear a credible alternative that doesn't boil down to either ERD or "do it in your head").
> 
> 


Re: When to expand vertically vs. horizontally in Hbase

Posted by Ian Varley <iv...@salesforce.com>.
Mike, what do you mean by "you can have entities, except that they are not coupled"? You mean, they have no relationship to each other? Or the relationship is defined elsewhere (e.g. application code)? The concept of "coupling" seems a little overloaded and not as concise here as "relationship". Two tuples in a database can have a wide number of relationships to each other; the kinds of relationships that are actively supported differs between a traditional RDBMS and HBase, and proper HBase design requires understand these limitations precisely.

I'm not trying to be an ER<http://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model> apologist, there are a lot of ways in which it sucks. :) But if we want to evolve, we can't just pretend there's no history here to build on.

Ian

On Jul 5, 2013, at 1:41 PM, Michael Segel wrote:

LOL...

Ian wrote:
"But, something just occurred to me: just because your physical implementation (HBase) doesn't support normalized entities and relationships doesn't mean your *problem* doesn't have entities and relationships. :) An Author is one entity, a Title is another, and a Genre is a third. Understanding how they interact is a prerequisite for translating into a physical model that works well in HBase. (ERD modeling is not categorically the only way to understand that, but I've yet to hear a credible alternative that doesn't boil down to either ERD or "do it in your head").
"

You can have entities, except that they are not coupled.

If you have a common key, then you may have a use for column families, it just depends on your data and how you access your data.

Its not rocket science, but its a non-trivial matter. Not doing it right may mean that you are not going to get the most out of your system.


On Jul 5, 2013, at 1:26 PM, Ian Varley <iv...@salesforce.com>> wrote:

But, something just occurred to me: just because your physical implementation (HBase) doesn't support normalized entities and relationships doesn't mean your *problem* doesn't have entities and relationships. :) An Author is one entity, a Title is another, and a Genre is a third. Understanding how they interact is a prerequisite for translating into a physical model that works well in HBase. (ERD modeling is not categorically the only way to understand that, but I've yet to hear a credible alternative that doesn't boil down to either ERD or "do it in your head").



Re: When to expand vertically vs. horizontally in Hbase

Posted by Michael Segel <mi...@hotmail.com>.
LOL...

Ian wrote:
"But, something just occurred to me: just because your physical implementation (HBase) doesn't support normalized entities and relationships doesn't mean your *problem* doesn't have entities and relationships. :) An Author is one entity, a Title is another, and a Genre is a third. Understanding how they interact is a prerequisite for translating into a physical model that works well in HBase. (ERD modeling is not categorically the only way to understand that, but I've yet to hear a credible alternative that doesn't boil down to either ERD or "do it in your head").
"

You can have entities, except that they are not coupled. 

If you have a common key, then you may have a use for column families, it just depends on your data and how you access your data. 

Its not rocket science, but its a non-trivial matter. Not doing it right may mean that you are not going to get the most out of your system. 


On Jul 5, 2013, at 1:26 PM, Ian Varley <iv...@salesforce.com> wrote:

> But, something just occurred to me: just because your physical implementation (HBase) doesn't support normalized entities and relationships doesn't mean your *problem* doesn't have entities and relationships. :) An Author is one entity, a Title is another, and a Genre is a third. Understanding how they interact is a prerequisite for translating into a physical model that works well in HBase. (ERD modeling is not categorically the only way to understand that, but I've yet to hear a credible alternative that doesn't boil down to either ERD or "do it in your head").


Re: When to expand vertically vs. horizontally in Hbase

Posted by Ian Varley <iv...@salesforce.com>.
Mike and I get into good discussions about ERD modeling and HBase a lot ... :)

Mike's right that you should avoid a design that relies heavily on relationships when modeling data in HBase, because relationships are tricky (they're the first thing that gets throw out the window in a database that can scale to huge data sets, because enforcing them is more trouble than its worth; as is supporting normalization, joins, etc). If you start with a traditional ERD, you're more likely to fall into this trap, because you're "used to" normalizing the crap out of your entities.

But, something just occurred to me: just because your physical implementation (HBase) doesn't support normalized entities and relationships doesn't mean your *problem* doesn't have entities and relationships. :) An Author is one entity, a Title is another, and a Genre is a third. Understanding how they interact is a prerequisite for translating into a physical model that works well in HBase. (ERD modeling is not categorically the only way to understand that, but I've yet to hear a credible alternative that doesn't boil down to either ERD or "do it in your head").

Once you understand what your entities really are, and how they relate to each other, you have pretty limited choices for how to represent multiple independent entities in HBase:

1) In unrelated tables. You just put authors in one table, titles in another, and genres in a third. You do all the work of joining and maintaining cross-entity integrity yourself (if needed). This is the default mode in HBase: "you worry about it". And that works great in many simple cases. This is appropriate if your "hard problem" is scaling a small set of simple entities to massive size, and you can take the hit for the application complexity that follows.

2) Scrunched into one table. You figure out the most important entity, and make that *the* table, with all other data stuffed into it. In simple cases, this could be columns that hold JSON; in advanced cases, you could use many columns to "nest" other entities in an intra-row version of denormalization. For example, have the row key of the HBase table be something like "Author ID", and then have a repeating column series for their titles, with column names like "title:1234", "title:5678", etc. This isn't a very common model, because you have to jump through some hoops in HBase (e.g. in this model, the way you would scan over authors differs from how you'd "scan over" titles for an author or across authors). The only real advantage to this over other forms of denormalization is that HBase guarantees intra-row ACID properties, so you're guaranteed to get all or none of the updates to the row (i.e. you don't have to reason about the failure cases). This can (but does *not* have to) use different column families for the different "entities" inside the row.

3) Denormalized across many tables. When you write to HBase, you write in multiple layouts: the Author table also contains a list of their titles, the Title table has author name & other info, etc. This basically equates to doing extra work at write time so you don't have to write code that does arbitrary joins and index usage at read time; in exchange, you get slower and more complex writes, but faster and simpler reads from different access paths. (It's still quite tricky, because you have to handle failure cases--what if one table gets written but the other doesn't?)

4) Normalized, with help from custom coprocessors. You could write your own suite of coprocessors to automatically do database-like things for you, such as joins and secondary indexing. I wouldn't recommend this route unless you're doing them in a general enough way to share. For example, Phoenix has an aggregation component that's built as a coprocessor and works really well; and it's applicable to anyone who wants to use Phoenix. You could build more stuff on this SQL framework, like indexes and joins and cascaded relationships and stuff. But that's a pretty massive undertaking for a single use case. :)

Maybe there are others I'm not thinking of, but I think these are basically your only choices. Mike, can you think of other basic approaches to representing more than one entity in HBase (where entity is defined as some repeating element in your data storage where individual instances are uniquely identifiable, possibly with one or more additional attributes)?

Ian

On Jul 5, 2013, at 12:48 PM, Michael Segel wrote:

Sorry, but you missed the point.

(Note: This is why I keep trying to put a talk at Strata and the other conferences on Schema design yet for some reason... it just doesn't seem important enough or sexy enough... maybe if I worked for Cloudera/Intel/etc ...  ;-)

Look,

The issue is what is and how to use Column families.

Since they are a separate HFile that uses the same key, the question is why do you need it and when do you want to use it.

The answer unfortunately is a bit more complicated than the questions.

You have to ask yourself when do you have a series of tables which have the same key value?
How do you access this data?

It gets more involved, but just looking at the answers to those two questions is a start.

Like I said, think about the order entry example and how the data is used in those column families.

Please also remember that you are NOT WORKING IN A RELATIONAL MODEL. Sorry to shout that last part, but its a very important concept. You need to stop thinking in terms of ERD when there is no relationship. Column families tend to create a weak relationship... which makes them a bit more confusing....

On Jul 5, 2013, at 11:16 AM, Aji Janis <aj...@gmail.com>> wrote:

I understand that there shouldn't be unlimited number of column families. I
am using this example on purpose to see how it comes into play.


On Fri, Jul 5, 2013 at 12:07 PM, Michael Segel <mi...@hotmail.com>>wrote:

Why do you have so many column families (CF) ?

Its not a question on the physical limitations, but more on the issue of
data design.

There aren't that many really good examples of where you would have
multiple column families that would require more than a handful of CFs.

When I teach or lecture, the example I use is an order entry system.
Where you would have the same key on Order entry, pick slips, shipping,
and invoice.

That's probably the best example of where CFs come in to play.

I'd suggest that you go back and rethink the design if you're having more
than a handful.



On Jul 5, 2013, at 8:53 AM, Aji Janis <aj...@gmail.com>> wrote:

Asaf,

I am using the Genre/Author stuff as an example but yes at the moment I
only have 5 column families. However, over time I may have more (no upper
limit decided that this point). See below for more responses


On Wed, Jul 3, 2013 at 3:42 PM, Asaf Mesika <as...@gmail.com>>
wrote:

Do you have only 5 static author names?
Keep in mind the column family name is defined when creating the table.

Regarding tall vs wide debate:
HBase is first and for most a Key Value database thus reads and writes
in
the column-value level. So it doesn't really care about rows.
But it's not entirely true. Rows come into play in the following
situations:
Splitting a region is per row and not per column, thus a row will be
saved
as a whole on a region. If you have a really large row, the region size
granularity is dependent on it. It doesn't seem to be the case here.
Put/Delete creates a lock until finished. If you are intensive on
inserts
to the same row at the same time, thus might be bad for you, keeping
your
rows slimmer can reduce contention, but again, only if you make a lot
concurrent modifications to the same row.


I expect batches of Put/Delete to the same row to happen by at most one
thread at a time based on user's current behavior. So locking shouldn't
be
an issue. However, not sure if the saving row to a region with enough
space
topic is really an issue I need to worry about (probably because I just
don't know much about it yet).


Filtering - if you need a filter which need all the row (there is a
method
you override in Filter to mark that) than a far row will be more memory
intensive. If you needed only 1/5 of your row, than maybe splitting it
to 5
rows to begin with would have made a better schema design in terms of
memory and I/O.


Currently, my access pattern is to get all data for a given row. Its
possible in the future we may want to apply (family/qualifier) filters.
There is a lot of uncertainty on use cases (client side) at this point
which is why I am not entirely sure on how things will look months from
now. I am not sure I follow this statement

"if you need a filter which need all the row (there is a method you
override in Filter to mark that) than a far row will be more memory
intensive."

Can you please explain? Thank you for these suggestions btw, good food
for
thought!



On Wednesday, July 3, 2013, Aji Janis wrote:

I have a major typo in the question so I apologize. I meant to say 5
families with 1000+ qualifiers each.

Lets work with an example, (not the greatest example here but still).
Lets
say we have a Genre Class like this:

Class HistoryBooks{

ArrayList<Books> author1;
ArrayList<Books> author2;
ArrayList<Books> author3;
ArrayList<Books> author4;
ArrayList<Books> author5;

...}

Each author is a column family (lets say we only allow 5 authors per
<T>Book class. Book per author ends up being the qualifier. In this
case, I
know I have a max family count but my qualifiers have no upper limit.
So
is
this scenario a case for tall or wide table? Why? Thank you.


On Tue, Jul 2, 2013 at 9:56 AM, Bryan Beaudreault
<bb...@hubspot.com> <javascript:;>>wrote:

If they are accessed mostly together they should all be a single
column
family. The key with tall or wide is based on the total byte size of
each
KeyValue. Your cells would need to be quite large for 50 to become a
problem. I still would recommend using a single CF though.
—
Sent from iPhone






Re: When to expand vertically vs. horizontally in Hbase

Posted by Michael Segel <mi...@hotmail.com>.
Sorry, but you missed the point. 

(Note: This is why I keep trying to put a talk at Strata and the other conferences on Schema design yet for some reason... it just doesn't seem important enough or sexy enough... maybe if I worked for Cloudera/Intel/etc ...  ;-) 

Look, 

The issue is what is and how to use Column families. 

Since they are a separate HFile that uses the same key, the question is why do you need it and when do you want to use it. 

The answer unfortunately is a bit more complicated than the questions. 

You have to ask yourself when do you have a series of tables which have the same key value? 
How do you access this data? 

It gets more involved, but just looking at the answers to those two questions is a start. 

Like I said, think about the order entry example and how the data is used in those column families. 

Please also remember that you are NOT WORKING IN A RELATIONAL MODEL. Sorry to shout that last part, but its a very important concept. You need to stop thinking in terms of ERD when there is no relationship. Column families tend to create a weak relationship... which makes them a bit more confusing....

On Jul 5, 2013, at 11:16 AM, Aji Janis <aj...@gmail.com> wrote:

> I understand that there shouldn't be unlimited number of column families. I
> am using this example on purpose to see how it comes into play.
> 
> 
> On Fri, Jul 5, 2013 at 12:07 PM, Michael Segel <mi...@hotmail.com>wrote:
> 
>> Why do you have so many column families (CF) ?
>> 
>> Its not a question on the physical limitations, but more on the issue of
>> data design.
>> 
>> There aren't that many really good examples of where you would have
>> multiple column families that would require more than a handful of CFs.
>> 
>> When I teach or lecture, the example I use is an order entry system.
>> Where you would have the same key on Order entry, pick slips, shipping,
>> and invoice.
>> 
>> That's probably the best example of where CFs come in to play.
>> 
>> I'd suggest that you go back and rethink the design if you're having more
>> than a handful.
>> 
>> 
>> 
>> On Jul 5, 2013, at 8:53 AM, Aji Janis <aj...@gmail.com> wrote:
>> 
>>> Asaf,
>>> 
>>> I am using the Genre/Author stuff as an example but yes at the moment I
>>> only have 5 column families. However, over time I may have more (no upper
>>> limit decided that this point). See below for more responses
>>> 
>>> 
>>> On Wed, Jul 3, 2013 at 3:42 PM, Asaf Mesika <as...@gmail.com>
>> wrote:
>>> 
>>>> Do you have only 5 static author names?
>>>> Keep in mind the column family name is defined when creating the table.
>>>> 
>>>> Regarding tall vs wide debate:
>>>> HBase is first and for most a Key Value database thus reads and writes
>> in
>>>> the column-value level. So it doesn't really care about rows.
>>>> But it's not entirely true. Rows come into play in the following
>>>> situations:
>>>> Splitting a region is per row and not per column, thus a row will be
>> saved
>>>> as a whole on a region. If you have a really large row, the region size
>>>> granularity is dependent on it. It doesn't seem to be the case here.
>>>> Put/Delete creates a lock until finished. If you are intensive on
>> inserts
>>>> to the same row at the same time, thus might be bad for you, keeping
>> your
>>>> rows slimmer can reduce contention, but again, only if you make a lot
>>>> concurrent modifications to the same row.
>>>> 
>>> 
>>> I expect batches of Put/Delete to the same row to happen by at most one
>>> thread at a time based on user's current behavior. So locking shouldn't
>> be
>>> an issue. However, not sure if the saving row to a region with enough
>> space
>>> topic is really an issue I need to worry about (probably because I just
>>> don't know much about it yet).
>>> 
>>> 
>>>> Filtering - if you need a filter which need all the row (there is a
>> method
>>>> you override in Filter to mark that) than a far row will be more memory
>>>> intensive. If you needed only 1/5 of your row, than maybe splitting it
>> to 5
>>>> rows to begin with would have made a better schema design in terms of
>>>> memory and I/O.
>>>> 
>>> 
>>> Currently, my access pattern is to get all data for a given row. Its
>>> possible in the future we may want to apply (family/qualifier) filters.
>>> There is a lot of uncertainty on use cases (client side) at this point
>>> which is why I am not entirely sure on how things will look months from
>>> now. I am not sure I follow this statement
>>> 
>>> "if you need a filter which need all the row (there is a method you
>>> override in Filter to mark that) than a far row will be more memory
>>> intensive."
>>> 
>>> Can you please explain? Thank you for these suggestions btw, good food
>> for
>>> thought!
>>> 
>>> 
>>>> 
>>>> On Wednesday, July 3, 2013, Aji Janis wrote:
>>>> 
>>>>> I have a major typo in the question so I apologize. I meant to say 5
>>>>> families with 1000+ qualifiers each.
>>>>> 
>>>>> Lets work with an example, (not the greatest example here but still).
>>>> Lets
>>>>> say we have a Genre Class like this:
>>>>> 
>>>>> Class HistoryBooks{
>>>>> 
>>>>> ArrayList<Books> author1;
>>>>> ArrayList<Books> author2;
>>>>> ArrayList<Books> author3;
>>>>> ArrayList<Books> author4;
>>>>> ArrayList<Books> author5;
>>>>> 
>>>>> ...}
>>>>> 
>>>>> Each author is a column family (lets say we only allow 5 authors per
>>>>> <T>Book class. Book per author ends up being the qualifier. In this
>>>> case, I
>>>>> know I have a max family count but my qualifiers have no upper limit.
>> So
>>>> is
>>>>> this scenario a case for tall or wide table? Why? Thank you.
>>>>> 
>>>>> 
>>>>> On Tue, Jul 2, 2013 at 9:56 AM, Bryan Beaudreault
>>>>> <bbeaudreault@hubspot.com <javascript:;>>wrote:
>>>>> 
>>>>>> If they are accessed mostly together they should all be a single
>> column
>>>>>> family. The key with tall or wide is based on the total byte size of
>>>> each
>>>>>> KeyValue. Your cells would need to be quite large for 50 to become a
>>>>>> problem. I still would recommend using a single CF though.
>>>>>> —
>>>>>> Sent from iPhone
>>>> 
>> 
>> 


Re: When to expand vertically vs. horizontally in Hbase

Posted by Aji Janis <aj...@gmail.com>.
I understand that there shouldn't be unlimited number of column families. I
am using this example on purpose to see how it comes into play.


On Fri, Jul 5, 2013 at 12:07 PM, Michael Segel <mi...@hotmail.com>wrote:

> Why do you have so many column families (CF) ?
>
> Its not a question on the physical limitations, but more on the issue of
> data design.
>
> There aren't that many really good examples of where you would have
> multiple column families that would require more than a handful of CFs.
>
> When I teach or lecture, the example I use is an order entry system.
>  Where you would have the same key on Order entry, pick slips, shipping,
> and invoice.
>
> That's probably the best example of where CFs come in to play.
>
> I'd suggest that you go back and rethink the design if you're having more
> than a handful.
>
>
>
> On Jul 5, 2013, at 8:53 AM, Aji Janis <aj...@gmail.com> wrote:
>
> > Asaf,
> >
> > I am using the Genre/Author stuff as an example but yes at the moment I
> > only have 5 column families. However, over time I may have more (no upper
> > limit decided that this point). See below for more responses
> >
> >
> > On Wed, Jul 3, 2013 at 3:42 PM, Asaf Mesika <as...@gmail.com>
> wrote:
> >
> >> Do you have only 5 static author names?
> >> Keep in mind the column family name is defined when creating the table.
> >>
> >> Regarding tall vs wide debate:
> >> HBase is first and for most a Key Value database thus reads and writes
> in
> >> the column-value level. So it doesn't really care about rows.
> >> But it's not entirely true. Rows come into play in the following
> >> situations:
> >> Splitting a region is per row and not per column, thus a row will be
> saved
> >> as a whole on a region. If you have a really large row, the region size
> >> granularity is dependent on it. It doesn't seem to be the case here.
> >> Put/Delete creates a lock until finished. If you are intensive on
> inserts
> >> to the same row at the same time, thus might be bad for you, keeping
> your
> >> rows slimmer can reduce contention, but again, only if you make a lot
> >> concurrent modifications to the same row.
> >>
> >
> > I expect batches of Put/Delete to the same row to happen by at most one
> > thread at a time based on user's current behavior. So locking shouldn't
> be
> > an issue. However, not sure if the saving row to a region with enough
> space
> > topic is really an issue I need to worry about (probably because I just
> > don't know much about it yet).
> >
> >
> >> Filtering - if you need a filter which need all the row (there is a
> method
> >> you override in Filter to mark that) than a far row will be more memory
> >> intensive. If you needed only 1/5 of your row, than maybe splitting it
> to 5
> >> rows to begin with would have made a better schema design in terms of
> >> memory and I/O.
> >>
> >
> > Currently, my access pattern is to get all data for a given row. Its
> > possible in the future we may want to apply (family/qualifier) filters.
> > There is a lot of uncertainty on use cases (client side) at this point
> > which is why I am not entirely sure on how things will look months from
> > now. I am not sure I follow this statement
> >
> > "if you need a filter which need all the row (there is a method you
> > override in Filter to mark that) than a far row will be more memory
> > intensive."
> >
> > Can you please explain? Thank you for these suggestions btw, good food
> for
> > thought!
> >
> >
> >>
> >> On Wednesday, July 3, 2013, Aji Janis wrote:
> >>
> >>> I have a major typo in the question so I apologize. I meant to say 5
> >>> families with 1000+ qualifiers each.
> >>>
> >>> Lets work with an example, (not the greatest example here but still).
> >> Lets
> >>> say we have a Genre Class like this:
> >>>
> >>> Class HistoryBooks{
> >>>
> >>> ArrayList<Books> author1;
> >>> ArrayList<Books> author2;
> >>> ArrayList<Books> author3;
> >>> ArrayList<Books> author4;
> >>> ArrayList<Books> author5;
> >>>
> >>> ...}
> >>>
> >>> Each author is a column family (lets say we only allow 5 authors per
> >>> <T>Book class. Book per author ends up being the qualifier. In this
> >> case, I
> >>> know I have a max family count but my qualifiers have no upper limit.
> So
> >> is
> >>> this scenario a case for tall or wide table? Why? Thank you.
> >>>
> >>>
> >>> On Tue, Jul 2, 2013 at 9:56 AM, Bryan Beaudreault
> >>> <bbeaudreault@hubspot.com <javascript:;>>wrote:
> >>>
> >>>> If they are accessed mostly together they should all be a single
> column
> >>>> family. The key with tall or wide is based on the total byte size of
> >> each
> >>>> KeyValue. Your cells would need to be quite large for 50 to become a
> >>>> problem. I still would recommend using a single CF though.
> >>>> —
> >>>> Sent from iPhone
> >>
>
>

Re: When to expand vertically vs. horizontally in Hbase

Posted by Michael Segel <mi...@hotmail.com>.
Why do you have so many column families (CF) ? 

Its not a question on the physical limitations, but more on the issue of data design. 

There aren't that many really good examples of where you would have multiple column families that would require more than a handful of CFs. 

When I teach or lecture, the example I use is an order entry system.  Where you would have the same key on Order entry, pick slips, shipping, and invoice. 

That's probably the best example of where CFs come in to play. 

I'd suggest that you go back and rethink the design if you're having more than a handful. 



On Jul 5, 2013, at 8:53 AM, Aji Janis <aj...@gmail.com> wrote:

> Asaf,
> 
> I am using the Genre/Author stuff as an example but yes at the moment I
> only have 5 column families. However, over time I may have more (no upper
> limit decided that this point). See below for more responses
> 
> 
> On Wed, Jul 3, 2013 at 3:42 PM, Asaf Mesika <as...@gmail.com> wrote:
> 
>> Do you have only 5 static author names?
>> Keep in mind the column family name is defined when creating the table.
>> 
>> Regarding tall vs wide debate:
>> HBase is first and for most a Key Value database thus reads and writes in
>> the column-value level. So it doesn't really care about rows.
>> But it's not entirely true. Rows come into play in the following
>> situations:
>> Splitting a region is per row and not per column, thus a row will be saved
>> as a whole on a region. If you have a really large row, the region size
>> granularity is dependent on it. It doesn't seem to be the case here.
>> Put/Delete creates a lock until finished. If you are intensive on inserts
>> to the same row at the same time, thus might be bad for you, keeping your
>> rows slimmer can reduce contention, but again, only if you make a lot
>> concurrent modifications to the same row.
>> 
> 
> I expect batches of Put/Delete to the same row to happen by at most one
> thread at a time based on user's current behavior. So locking shouldn't be
> an issue. However, not sure if the saving row to a region with enough space
> topic is really an issue I need to worry about (probably because I just
> don't know much about it yet).
> 
> 
>> Filtering - if you need a filter which need all the row (there is a method
>> you override in Filter to mark that) than a far row will be more memory
>> intensive. If you needed only 1/5 of your row, than maybe splitting it to 5
>> rows to begin with would have made a better schema design in terms of
>> memory and I/O.
>> 
> 
> Currently, my access pattern is to get all data for a given row. Its
> possible in the future we may want to apply (family/qualifier) filters.
> There is a lot of uncertainty on use cases (client side) at this point
> which is why I am not entirely sure on how things will look months from
> now. I am not sure I follow this statement
> 
> "if you need a filter which need all the row (there is a method you
> override in Filter to mark that) than a far row will be more memory
> intensive."
> 
> Can you please explain? Thank you for these suggestions btw, good food for
> thought!
> 
> 
>> 
>> On Wednesday, July 3, 2013, Aji Janis wrote:
>> 
>>> I have a major typo in the question so I apologize. I meant to say 5
>>> families with 1000+ qualifiers each.
>>> 
>>> Lets work with an example, (not the greatest example here but still).
>> Lets
>>> say we have a Genre Class like this:
>>> 
>>> Class HistoryBooks{
>>> 
>>> ArrayList<Books> author1;
>>> ArrayList<Books> author2;
>>> ArrayList<Books> author3;
>>> ArrayList<Books> author4;
>>> ArrayList<Books> author5;
>>> 
>>> ...}
>>> 
>>> Each author is a column family (lets say we only allow 5 authors per
>>> <T>Book class. Book per author ends up being the qualifier. In this
>> case, I
>>> know I have a max family count but my qualifiers have no upper limit. So
>> is
>>> this scenario a case for tall or wide table? Why? Thank you.
>>> 
>>> 
>>> On Tue, Jul 2, 2013 at 9:56 AM, Bryan Beaudreault
>>> <bbeaudreault@hubspot.com <javascript:;>>wrote:
>>> 
>>>> If they are accessed mostly together they should all be a single column
>>>> family. The key with tall or wide is based on the total byte size of
>> each
>>>> KeyValue. Your cells would need to be quite large for 50 to become a
>>>> problem. I still would recommend using a single CF though.
>>>> —
>>>> Sent from iPhone
>> 


Re: When to expand vertically vs. horizontally in Hbase

Posted by Aji Janis <aj...@gmail.com>.
Asaf,

 I am using the Genre/Author stuff as an example but yes at the moment I
only have 5 column families. However, over time I may have more (no upper
limit decided that this point). See below for more responses


On Wed, Jul 3, 2013 at 3:42 PM, Asaf Mesika <as...@gmail.com> wrote:

> Do you have only 5 static author names?
> Keep in mind the column family name is defined when creating the table.
>
> Regarding tall vs wide debate:
> HBase is first and for most a Key Value database thus reads and writes in
> the column-value level. So it doesn't really care about rows.
> But it's not entirely true. Rows come into play in the following
> situations:
> Splitting a region is per row and not per column, thus a row will be saved
> as a whole on a region. If you have a really large row, the region size
> granularity is dependent on it. It doesn't seem to be the case here.
> Put/Delete creates a lock until finished. If you are intensive on inserts
> to the same row at the same time, thus might be bad for you, keeping your
> rows slimmer can reduce contention, but again, only if you make a lot
> concurrent modifications to the same row.
>

I expect batches of Put/Delete to the same row to happen by at most one
thread at a time based on user's current behavior. So locking shouldn't be
an issue. However, not sure if the saving row to a region with enough space
topic is really an issue I need to worry about (probably because I just
don't know much about it yet).


> Filtering - if you need a filter which need all the row (there is a method
> you override in Filter to mark that) than a far row will be more memory
> intensive. If you needed only 1/5 of your row, than maybe splitting it to 5
> rows to begin with would have made a better schema design in terms of
> memory and I/O.
>

Currently, my access pattern is to get all data for a given row. Its
possible in the future we may want to apply (family/qualifier) filters.
There is a lot of uncertainty on use cases (client side) at this point
which is why I am not entirely sure on how things will look months from
now. I am not sure I follow this statement

"if you need a filter which need all the row (there is a method you
override in Filter to mark that) than a far row will be more memory
intensive."

Can you please explain? Thank you for these suggestions btw, good food for
thought!


>
> On Wednesday, July 3, 2013, Aji Janis wrote:
>
> > I have a major typo in the question so I apologize. I meant to say 5
> > families with 1000+ qualifiers each.
> >
> > Lets work with an example, (not the greatest example here but still).
> Lets
> > say we have a Genre Class like this:
> >
> > Class HistoryBooks{
> >
> >  ArrayList<Books> author1;
> >  ArrayList<Books> author2;
> >  ArrayList<Books> author3;
> >  ArrayList<Books> author4;
> >  ArrayList<Books> author5;
> >
> > ...}
> >
> > Each author is a column family (lets say we only allow 5 authors per
> > <T>Book class. Book per author ends up being the qualifier. In this
> case, I
> > know I have a max family count but my qualifiers have no upper limit. So
> is
> > this scenario a case for tall or wide table? Why? Thank you.
> >
> >
> > On Tue, Jul 2, 2013 at 9:56 AM, Bryan Beaudreault
> > <bbeaudreault@hubspot.com <javascript:;>>wrote:
> >
> > > If they are accessed mostly together they should all be a single column
> > > family. The key with tall or wide is based on the total byte size of
> each
> > > KeyValue. Your cells would need to be quite large for 50 to become a
> > > problem. I still would recommend using a single CF though.
> > > —
> > > Sent from iPhone
>

Re: When to expand vertically vs. horizontally in Hbase

Posted by Asaf Mesika <as...@gmail.com>.
Do you have only 5 static author names?
Keep in mind the column family name is defined when creating the table.

Regarding tall vs wide debate:
HBase is first and for most a Key Value database thus reads and writes in
the column-value level. So it doesn't really care about rows.
But it's not entirely true. Rows come into play in the following situations:
Splitting a region is per row and not per column, thus a row will be saved
as a whole on a region. If you have a really large row, the region size
granularity is dependent on it. It doesn't seem to be the case here.
Put/Delete creates a lock until finished. If you are intensive on inserts
to the same row at the same time, thus might be bad for you, keeping your
rows slimmer can reduce contention, but again, only if you make a lot
concurrent modifications to the same row.
Filtering - if you need a filter which need all the row (there is a method
you override in Filter to mark that) than a far row will be more memory
intensive. If you needed only 1/5 of your row, than maybe splitting it to 5
rows to begin with would have made a better schema design in terms of
memory and I/O.

On Wednesday, July 3, 2013, Aji Janis wrote:

> I have a major typo in the question so I apologize. I meant to say 5
> families with 1000+ qualifiers each.
>
> Lets work with an example, (not the greatest example here but still). Lets
> say we have a Genre Class like this:
>
> Class HistoryBooks{
>
>  ArrayList<Books> author1;
>  ArrayList<Books> author2;
>  ArrayList<Books> author3;
>  ArrayList<Books> author4;
>  ArrayList<Books> author5;
>
> ...}
>
> Each author is a column family (lets say we only allow 5 authors per
> <T>Book class. Book per author ends up being the qualifier. In this case, I
> know I have a max family count but my qualifiers have no upper limit. So is
> this scenario a case for tall or wide table? Why? Thank you.
>
>
> On Tue, Jul 2, 2013 at 9:56 AM, Bryan Beaudreault
> <bbeaudreault@hubspot.com <javascript:;>>wrote:
>
> > If they are accessed mostly together they should all be a single column
> > family. The key with tall or wide is based on the total byte size of each
> > KeyValue. Your cells would need to be quite large for 50 to become a
> > problem. I still would recommend using a single CF though.
> > —
> > Sent from iPhone
> >
> > On Tue, Jul 2, 2013 at 9:33 AM, Aji Janis <aji1705@gmail.com<javascript:;>>
> wrote:
> >
> > > The section on Rows vs. Columns at
> > > http://hbase.apache.org/book/schema.smackdown.html talks about
> expanding
> > > horizontally vs. vertically.
> > > Can someone please explain to me when to choose rows vs. columns. The
> > > sections reads, "To be clear, this guideline is in the context is in
> > > extremely wide cases, not in the standard use-case where one needs to
> > store
> > > a few dozen or hundred columns" so if I had 5 column families with 10
> > > qualifiers each, accessed mostly together is this a case for wider or
> > > taller table? Thanks for any help in advance.
> >
>

Re: When to expand vertically vs. horizontally in Hbase

Posted by Aji Janis <aj...@gmail.com>.
I have a major typo in the question so I apologize. I meant to say 5
families with 1000+ qualifiers each.

Lets work with an example, (not the greatest example here but still). Lets
say we have a Genre Class like this:

Class HistoryBooks{

 ArrayList<Books> author1;
 ArrayList<Books> author2;
 ArrayList<Books> author3;
 ArrayList<Books> author4;
 ArrayList<Books> author5;

...}

Each author is a column family (lets say we only allow 5 authors per
<T>Book class. Book per author ends up being the qualifier. In this case, I
know I have a max family count but my qualifiers have no upper limit. So is
this scenario a case for tall or wide table? Why? Thank you.


On Tue, Jul 2, 2013 at 9:56 AM, Bryan Beaudreault
<bb...@hubspot.com>wrote:

> If they are accessed mostly together they should all be a single column
> family. The key with tall or wide is based on the total byte size of each
> KeyValue. Your cells would need to be quite large for 50 to become a
> problem. I still would recommend using a single CF though.
> —
> Sent from iPhone
>
> On Tue, Jul 2, 2013 at 9:33 AM, Aji Janis <aj...@gmail.com> wrote:
>
> > The section on Rows vs. Columns at
> > http://hbase.apache.org/book/schema.smackdown.html talks about expanding
> > horizontally vs. vertically.
> > Can someone please explain to me when to choose rows vs. columns. The
> > sections reads, "To be clear, this guideline is in the context is in
> > extremely wide cases, not in the standard use-case where one needs to
> store
> > a few dozen or hundred columns" so if I had 5 column families with 10
> > qualifiers each, accessed mostly together is this a case for wider or
> > taller table? Thanks for any help in advance.
>

Re: When to expand vertically vs. horizontally in Hbase

Posted by Bryan Beaudreault <bb...@hubspot.com>.
If they are accessed mostly together they should all be a single column family. The key with tall or wide is based on the total byte size of each KeyValue. Your cells would need to be quite large for 50 to become a problem. I still would recommend using a single CF though.
—
Sent from iPhone

On Tue, Jul 2, 2013 at 9:33 AM, Aji Janis <aj...@gmail.com> wrote:

> The section on Rows vs. Columns at
> http://hbase.apache.org/book/schema.smackdown.html talks about expanding
> horizontally vs. vertically.
> Can someone please explain to me when to choose rows vs. columns. The
> sections reads, "To be clear, this guideline is in the context is in
> extremely wide cases, not in the standard use-case where one needs to store
> a few dozen or hundred columns" so if I had 5 column families with 10
> qualifiers each, accessed mostly together is this a case for wider or
> taller table? Thanks for any help in advance.

Re: When to expand vertically vs. horizontally in Hbase

Posted by Michael Segel <mi...@hotmail.com>.
Not off hand. 

But its something that I think I could cobble up over the next couple of days if my wife runs out of projects for me to do around the house. ;-) 


On Jul 3, 2013, at 12:57 PM, Stack <st...@duboce.net> wrote:

> On Wed, Jul 3, 2013 at 7:08 AM, Michael Segel <mi...@hotmail.com>wrote:
> 
>> Really a bad title  for the section.
>> 
>> Schema Smackdown?  Really?
>> 
>> 6.10.1 isn't really valid is it? Rows version versions?
>> IMHO it should be Columns versus versions.  (Do you put a timestamp in the
>> column qualifier name versus having an enormous number of versions allowed?)
>> 
>> There's more, but I tend to be very conservative to my approach to schema
>> design, so what do I know? ;-)
>> 
> 
> Do you have a patch for the refguide Michael to improve the section?
> Thanks,
> St.Ack


Re: When to expand vertically vs. horizontally in Hbase

Posted by Stack <st...@duboce.net>.
On Wed, Jul 3, 2013 at 7:08 AM, Michael Segel <mi...@hotmail.com>wrote:

> Really a bad title  for the section.
>
> Schema Smackdown?  Really?
>
> 6.10.1 isn't really valid is it? Rows version versions?
> IMHO it should be Columns versus versions.  (Do you put a timestamp in the
> column qualifier name versus having an enormous number of versions allowed?)
>
> There's more, but I tend to be very conservative to my approach to schema
> design, so what do I know? ;-)
>

Do you have a patch for the refguide Michael to improve the section?
Thanks,
St.Ack

Re: When to expand vertically vs. horizontally in Hbase

Posted by Michael Segel <mi...@hotmail.com>.
Really a bad title  for the section. 

Schema Smackdown?  Really? 

6.10.1 isn't really valid is it? Rows version versions? 
IMHO it should be Columns versus versions.  (Do you put a timestamp in the column qualifier name versus having an enormous number of versions allowed?) 

There's more, but I tend to be very conservative to my approach to schema design, so what do I know? ;-) 

On Jul 2, 2013, at 8:32 AM, Aji Janis <aj...@gmail.com> wrote:

> http://hbase.apache.org/book/schema.smackdown.html