You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jason <ur...@gmail.com> on 2011/02/11 01:55:00 UTC

Parent/child relation - go vertical, horizontal, or many tables?

Hi all,

Let's say I have two entities Parent and Child. There could be many children in one parent (from hundreds to tens of millions)
A child can only belong to one Parent.

Typical queries are:
-Fetch all children from a single parent
-Find a few children by their keys or values from a single parent
-Update a single child by child key and it's parent key

And there are no cross-parent queries.

I am trying to figure out what is better schema approach from performance/maintenance perspective:

1. One table with one Parent per row. Row key is a parent id. Children are stored in a single family each under separate qualifier (child id). Would it even work assuming all children may not fit in memory? 

2. One table. Compound row key parent id/child id. One child per row. 

3. Many tables - one per parent. Row key is a child id.

Thanks!

Re: Parent/child relation - go vertical, horizontal, or many tables?

Posted by Andrey Stepachev <oc...@gmail.com>.
I such case I think that you can use tall tables with parent:child keys and
filters or range scans to get childrens.

You queries will be:
-Fetch all children from a single parent

scan [parent:0, parent+1:0)

-Find a few children by their keys or values from a single parent

scan [parent:min_of_child_keys, parent:max_of_child_key + 1] + filterset (or
custom hash filter)
If it is too many keys, you can use HTable.getRegionLocation to split you
childs
by parallel scans on different regions.

-Update a single child by child key and it's parent key

easy (in all cases, simpe put or get+put if it is true update, not
overwrite)


2011/2/11 Jason <ur...@gmail.com>

> Hi all,
>
> Let's say I have two entities Parent and Child. There could be many
> children in one parent (from hundreds to tens of millions)
> A child can only belong to one Parent.
>
> Typical queries are:
> -Fetch all children from a single parent
> -Find a few children by their keys or values from a single parent
> -Update a single child by child key and it's parent key
>
> And there are no cross-parent queries.
>
> I am trying to figure out what is better schema approach from
> performance/maintenance perspective:
>
> 1. One table with one Parent per row. Row key is a parent id. Children are
> stored in a single family each under separate qualifier (child id). Would it
> even work assuming all children may not fit in memory?
>
> 2. One table. Compound row key parent id/child id. One child per row.
>
> 3. Many tables - one per parent. Row key is a child id.
>
> Thanks!

Re: LZO Compression

Posted by Todd Lipcon <to...@cloudera.com>.
Also sounds like you didn't install the libgplcompression.so shared library
which gets built by LZO. You need that on your java library path.

You can use this to build an RPM that puts it all in the right places when
you're using CDH
https://github.com/toddlipcon/hadoop-lzo-packager

-Todd

On Fri, Feb 11, 2011 at 11:22 AM, Ryan Rawson <ry...@gmail.com> wrote:

> the .so has to be the same machine arch as your java binary. meaning
> if you are using 64bit java your lib should also be 64 bit.
>
> -ryan
>
> On Fri, Feb 11, 2011 at 11:00 AM, Peter Haidinyak <ph...@local.com>
> wrote:
> > HBase version: 0.89.20100924+28
> > Hadoop version: 0.20.2+737
> >
> > Howdy,
> >   My boss compiled  kevinweil-hadoop-lzo-0e70051 and created a jar
> 'hadoop-lzo-0.4.9.jar'
> > He also gave me an rpm 'lzo-2.04-1.el4.rf.i386.rpm' which is installed
> into /usr/local/lib.
> > I have LD_LIBRARY_PAT pointing to this directory plus I added it to the
> /etc/ld.so.conf file.
> >
> > When I run hbase a make a table use LZO compression I get the following
> error.
> >
> > 2011-02-11 10:50:33,317 ERROR
> com.hadoop.compression.lzo.GPLNativeCodeLoader: Could not load native gpl
> library
> > java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
> >
> >
> > My questions are
> >
> > 1.      Are the two items I was given compatible?
> > 2.      If so, what could I be doing wrong?
> >
> > Thanks
> >
> > -Pete
> >
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: LZO Compression

Posted by Ryan Rawson <ry...@gmail.com>.
the .so has to be the same machine arch as your java binary. meaning
if you are using 64bit java your lib should also be 64 bit.

-ryan

On Fri, Feb 11, 2011 at 11:00 AM, Peter Haidinyak <ph...@local.com> wrote:
> HBase version: 0.89.20100924+28
> Hadoop version: 0.20.2+737
>
> Howdy,
>   My boss compiled  kevinweil-hadoop-lzo-0e70051 and created a jar 'hadoop-lzo-0.4.9.jar'
> He also gave me an rpm 'lzo-2.04-1.el4.rf.i386.rpm' which is installed into /usr/local/lib.
> I have LD_LIBRARY_PAT pointing to this directory plus I added it to the /etc/ld.so.conf file.
>
> When I run hbase a make a table use LZO compression I get the following error.
>
> 2011-02-11 10:50:33,317 ERROR com.hadoop.compression.lzo.GPLNativeCodeLoader: Could not load native gpl library
> java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path
>
>
> My questions are
>
> 1.      Are the two items I was given compatible?
> 2.      If so, what could I be doing wrong?
>
> Thanks
>
> -Pete
>
>

LZO Compression

Posted by Peter Haidinyak <ph...@local.com>.
HBase version: 0.89.20100924+28
Hadoop version: 0.20.2+737

Howdy,
   My boss compiled  kevinweil-hadoop-lzo-0e70051 and created a jar 'hadoop-lzo-0.4.9.jar'
He also gave me an rpm 'lzo-2.04-1.el4.rf.i386.rpm' which is installed into /usr/local/lib.
I have LD_LIBRARY_PAT pointing to this directory plus I added it to the /etc/ld.so.conf file.

When I run hbase a make a table use LZO compression I get the following error.

2011-02-11 10:50:33,317 ERROR com.hadoop.compression.lzo.GPLNativeCodeLoader: Could not load native gpl library
java.lang.UnsatisfiedLinkError: no gplcompression in java.library.path


My questions are

1.	Are the two items I was given compatible?
2.	If so, what could I be doing wrong?

Thanks

-Pete


Re: Parent/child relation - go vertical, horizontal, or many tables?

Posted by Jason <ur...@gmail.com>.
Thank you all for the great insight. Based on your thoughts I am going to try a hybrid approach - that is split children into buckets based on id range and store a bucket per row.
The row key then would be parent-id:bucket-id where bucket-id=child-id/n, and n - bucket size chosen specifically to prevent rows from being too wide.



On Feb 11, 2011, at 11:00 PM, Ryan Rawson <ry...@gmail.com> wrote:

> If you dont think a row would get beyond thousands of columns I'd go
> with wide columns.  Once you get to 10k, 100k, or millions, things
> might get a little weird.  Performance on huge rows is difficult
> because we have to materialize the entire row at a time.  There are
> options in scan to return partial rows though.  Also a region will
> eventually become a single row region unable to split.
> 
> But wide columns arent in general to be avoided, just if you cant
> predict the ultimate width.
> 
> -ryan
> 
> On Fri, Feb 11, 2011 at 12:59 PM, Michael Segel
> <mi...@hotmail.com> wrote:
>> 
>> Jonathan,
>> Thanks for the response.
>>> The fact that a row cannot cross a region boundary is a
>> consideration, but unless your rows will be many gigabytes each, I don't
>>  think this is that important.  Having to cross a region boundary to
>> fulfill the "get all children" query would be my primary worry.
>> 
>> That would be an issue if you have a tall table with many rows. Assuming you had enough children to break the wide row and the children were relatively big...
>> 
>>> Now besides those considerations above, the other two queries you
>> want (parent-child point lookups and parent-child additions) are
>> virtually identical in performance on the server-side starting with
>> HBase 0.90 and beyond.  We have the same block-seeking optimizations in
>> both schemas for the read case, and the write case is identical in both.
>> 
>> This is interesting.
>> 
>> So essentially the pat response these days is either "... it depends..." or "YMMV".
>> 
>> Because the OP didn't really say how wide or how frequent he would have wide rows... I'd still lean to wide rows...
>> But it is good to know about the improvements in 0.90
>> 
>> Thx
>> 
>> -Mike
>> 
>> 
>>> From: jgray@fb.com
>>> To: user@hbase.apache.org
>>> Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
>>> Date: Fri, 11 Feb 2011 20:48:51 +0000
>>> 
>>> Just to chime in with my usual take on this (seems like the tall vs. wide discussion happens every few weeks...)
>>> 
>>> For "get all children of a parent", doing a get() on the wide table vs. doing a scan() on the tall table (as long as you set scanner caching appropriately) will be almost identical.  I wouldn't expect any difference in performance if you are properly tuning parameters *EXCEPT* that today a Scan will always require more than one RPC because the API is such that you need to open the scanner first, and then do next() on it, and then close() it.  This is a current API limitation but we could implement an optimization to allow for single-RPC scans if the query can be fulfilled in a single response (start row, stop row, and scanner caching set appropriately).  A Get, on the server-side, does this exact same thing but in a single RPC (it opens a scanner, next() on it, and then close() it).
>>> 
>>> The fact that a row cannot cross a region boundary is a consideration, but unless your rows will be many gigabytes each, I don't think this is that important.  Having to cross a region boundary to fulfill the "get all children" query would be my primary worry.
>>> 
>>> Now besides those considerations above, the other two queries you want (parent-child point lookups and parent-child additions) are virtually identical in performance on the server-side starting with HBase 0.90 and beyond.  We have the same block-seeking optimizations in both schemas for the read case, and the write case is identical in both.
>>> 
>>> The only other thing to consider is what if all the children of one parent can't fit in memory at the same time.  This is not at all related to a region getting too big (there is no requirement of fitting a  region into memory) but is a consideration for reading it in a single RPC (both on the server-side and also receiving it in your client).  However, you would deal with this the same way in the tall or wide case.  In the tall case, you would appropriately set the scanner caching number.  In the wide case, you would set the intra-row scan limit.  In this case, you will be forced to use the Scan API regardless because if you need multiple RPCs for a single row, you need the Scanner next() semantics.
>>> 
>>> Many times, this decisions comes to a matter of personal preference.  I lean towards wide tables these days unless I expect extremely high numbers of children (so I want to split across regions and RPC requests) and I expect to frequently run the get-all-children query with high numbers of children.
>>> 
>>> JG
>>> 
>>>> -----Original Message-----
>>>> From: Michael Segel [mailto:michael_segel@hotmail.com]
>>>> Sent: Friday, February 11, 2011 12:23 PM
>>>> To: user@hbase.apache.org
>>>> Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
>>>> 
>>>> 
>>>> David,
>>>> 
>>>> First a caveat... You need to have a realistic notion of the data and its sizes
>>>> when considering your options...
>>>> With respect to the response, Here's what I said:
>>>> -=-
>>>> "With respect to your issue about a row being too large to fit in to memory...
>>>>  This would imply that the row would be too large to fit in to a single region.
>>>> Wouldn't that cause your HBase to die a horrible death?
>>>> 
>>>>  If this really is a potential situation, then you should consider the
>>>> parent_key, child_id compound row key..."
>>>> -=-
>>>> Now a correction. If you insert a row that is larger than a region, the region
>>>> will grow to fit the row and will not split. So until your row exceeds the size of
>>>> available disk... you can do it. So yeah you could fill up memory...
>>>> 
>>>> And that's the only reason why I would recommend option 2 over option 1.
>>>> So how real is this scenario?
>>>> 
>>>> Looking at the 3 stated use cases...  Doing a get() on the parent ID will give
>>>> you the entire set of children for the parent in a single fetch.
>>>> If you limit the columns to either a single column or a set of columns, you are
>>>> still going to be a single get().
>>>> 
>>>> This is going to be faster than doing a scan() on a series of row starting with
>>>> parent_id stopping with parent_id+1.
>>>> (At least in theory. I haven't mocked this out and tried it.)
>>>> 
>>>> Again the only advantage of option 2 is if you really are worried about your
>>>> data size blowing you out of the water.
>>>> If you do find yourself using a lot of memory to fetch your edge cases, then
>>>> you'd be better off with the second option.
>>>> 
>>>> Here you have the following:
>>>> 
>>>> 1) Fetching all of the children (scan() with a start and stop key)
>>>> 2) Fetching some of the rows... (scan() with a start and stop key and some
>>>> sort of filter);
>>>> 3) Fetching single child (get() using a combination of parent_id, child_id for
>>>> the key.)
>>>> 
>>>> So while you don't have to worry about the size of a row, you do not get the
>>>> same performance that you could with option 1.
>>>> 
>>>> Does that make sense?
>>>> 
>>>> -Mike
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> From: buttler1@llnl.gov
>>>>> To: user@hbase.apache.org
>>>>> Date: Fri, 11 Feb 2011 10:45:14 -0800
>>>>> Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
>>>>> 
>>>>> Michael,
>>>>> Thanks for the analysis.  The thought process you put into this seems
>>>> useful.  However, following along at home I came to a different conclusion
>>>> than you did.  I would prefer (sol. 2) over (sol. 3) for the reason you mention,
>>>> but I would also strongly prefer (sol. 2) over (sol. 1), also for the reason you
>>>> mention.
>>>>> 
>>>>> So, I don't see how you can not recommend (sol. 2).  It seems like (sol. 1)
>>>> would be very wasteful for use cases (u2) and (u3). The only time it would
>>>> help is in (u1).  And then it doesn't seem obvious to me that a single row is
>>>> better except in cases where there are very few children per parent.
>>>>> 
>>>>> Perhaps if the data is expected to have a power law distribution (fat tail,
>>>> zipfian), a hybrid approach would be better: go with (sol. 1) for any parent
>>>> that has fewer than (say 10) children.  But, after a parent fills up its first 10
>>>> children, start populating rows like (sol. 2).
>>>>> 
>>>>> This would definitely make the client code more complex, so it would only
>>>> make sense if there were huge savings to be had.
>>>>> Maybe a slightly better implementation of the hybrid would be to divide
>>>> the child key space up into buckets so that you can directly address any child,
>>>> but still have fewer calls in retrieving all children.  Then you can adjust your
>>>> bucket size based on your actual use case (with a bucket size of 1 being the
>>>> special case of (sol. 2)).
>>>>> 
>>>>> But the more I think about it, the more I suspect that the added complexity
>>>> will not be worth it, and he should just go with (sol. 2).
>>>>> 
>>>>> Dave
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Michael Segel [mailto:michael_segel@hotmail.com]
>>>>> Sent: Friday, February 11, 2011 5:51 AM
>>>>> To: user@hbase.apache.org
>>>>> Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
>>>>> 
>>>>> 
>>>>> Jason,
>>>>> 
>>>>> You have the following constraint:
>>>>> Foreach child there is one parent. A parent can have more than one child.
>>>>> 
>>>>> While you don't specify size of the child, when a parent can have tens of
>>>> millions, that could become an issue.
>>>>> Assuming that the child is relatively small...
>>>>> 
>>>>> You have 3 use cases: (Scan patterns)
>>>>> 
>>>>>> -Fetch all children from a single parent -Find a few children by
>>>>>> their keys or values from a single parent -Update a single child by
>>>>>> child key and it's parent key
>>>>> 
>>>>> Your options...
>>>>> 
>>>>>> 1. One table with one Parent per row. Row key is a parent id.
>>>>> Children are stored in a single family each under separate qualifier
>>>>> (child id). Would it even work assuming all children may not fit in
>>>>> memory?
>>>>>> 
>>>>> While you raise an interesting point, lets look at the schema as a solution.
>>>>> This works well because you can fetch the entire row based on parent key.
>>>>> So all queries are get()s and not scan()s.
>>>>> 
>>>>> You can then pull all of the existing columns where each column represents
>>>> a child.
>>>>> 
>>>>> You can also do a get() of only those columns you want based on child_id as
>>>> the column name.
>>>>> 
>>>>> You can also do a get() or a put of a specific column (child_id) for a given
>>>> parent (row key).
>>>>> 
>>>>> 
>>>>> With respect to your issue about a row being too large to fit in to memory...
>>>>> This would imply that the row would be too large to fit in to a single region.
>>>> Wouldn't that cause your HBase to die a horrible death?
>>>>> 
>>>>> If this really is a potential situation, then you should consider the
>>>> parent_key, child_id compound row key...
>>>>> 
>>>>>> 2. One table. Compound row key parent id/child id. One child per row.
>>>>>> 
>>>>> Based on your use cases, I wouldn't recommend this. While it is a valid
>>>> schema, it is only 'optimal' for your 'Update a single child by child key and its
>>>> parent key'.
>>>>> 
>>>>>> 3. Many tables - one per parent. Row key is a child id.
>>>>> If you have a scenario of a parent has billions+ of children, the
>>>>> could be a valid choice, however based on what you said, (up to tens
>>>>> of millions) and the data set is unique and non-intersecting, you
>>>>> would be better off with a single table. (Too many tables is not a
>>>>> good thing in HBase.)
>>>>> 
>>>>> 
>>>>> HTH
>>>>> 
>>>>> -Mike
>>>>> 
>>>>> 
>>>>>> Subject: Parent/child relation - go vertical, horizontal, or many tables?
>>>>>> From: urgisb@gmail.com
>>>>>> Date: Thu, 10 Feb 2011 16:55:00 -0800
>>>>>> To: user@hbase.apache.org
>>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> Let's say I have two entities Parent and Child. There could be many
>>>>>> children in one parent (from hundreds to tens of millions) A child can only
>>>> belong to one Parent.
>>>>>> 
>>>>>> Typical queries are:
>>>>>> -Fetch all children from a single parent -Find a few children by
>>>>>> their keys or values from a single parent -Update a single child by
>>>>>> child key and it's parent key
>>>>>> 
>>>>>> And there are no cross-parent queries.
>>>>>> 
>>>>>> I am trying to figure out what is better schema approach from
>>>> performance/maintenance perspective:
>>>>>> 
>>>>>> 1. One table with one Parent per row. Row key is a parent id. Children are
>>>> stored in a single family each under separate qualifier (child id). Would it
>>>> even work assuming all children may not fit in memory?
>>>>>> 
>>>>>> 2. One table. Compound row key parent id/child id. One child per row.
>>>>>> 
>>>>>> 3. Many tables - one per parent. Row key is a child id.
>>>>>> 
>>>>>> Thanks!
>>>>> 
>>>> 
>> 

Re: Parent/child relation - go vertical, horizontal, or many tables?

Posted by Ryan Rawson <ry...@gmail.com>.
If you dont think a row would get beyond thousands of columns I'd go
with wide columns.  Once you get to 10k, 100k, or millions, things
might get a little weird.  Performance on huge rows is difficult
because we have to materialize the entire row at a time.  There are
options in scan to return partial rows though.  Also a region will
eventually become a single row region unable to split.

But wide columns arent in general to be avoided, just if you cant
predict the ultimate width.

-ryan

On Fri, Feb 11, 2011 at 12:59 PM, Michael Segel
<mi...@hotmail.com> wrote:
>
> Jonathan,
> Thanks for the response.
>> The fact that a row cannot cross a region boundary is a
> consideration, but unless your rows will be many gigabytes each, I don't
>  think this is that important.  Having to cross a region boundary to
> fulfill the "get all children" query would be my primary worry.
>
> That would be an issue if you have a tall table with many rows. Assuming you had enough children to break the wide row and the children were relatively big...
>
>> Now besides those considerations above, the other two queries you
> want (parent-child point lookups and parent-child additions) are
> virtually identical in performance on the server-side starting with
> HBase 0.90 and beyond.  We have the same block-seeking optimizations in
> both schemas for the read case, and the write case is identical in both.
>
> This is interesting.
>
> So essentially the pat response these days is either "... it depends..." or "YMMV".
>
> Because the OP didn't really say how wide or how frequent he would have wide rows... I'd still lean to wide rows...
> But it is good to know about the improvements in 0.90
>
> Thx
>
> -Mike
>
>
>> From: jgray@fb.com
>> To: user@hbase.apache.org
>> Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
>> Date: Fri, 11 Feb 2011 20:48:51 +0000
>>
>> Just to chime in with my usual take on this (seems like the tall vs. wide discussion happens every few weeks...)
>>
>> For "get all children of a parent", doing a get() on the wide table vs. doing a scan() on the tall table (as long as you set scanner caching appropriately) will be almost identical.  I wouldn't expect any difference in performance if you are properly tuning parameters *EXCEPT* that today a Scan will always require more than one RPC because the API is such that you need to open the scanner first, and then do next() on it, and then close() it.  This is a current API limitation but we could implement an optimization to allow for single-RPC scans if the query can be fulfilled in a single response (start row, stop row, and scanner caching set appropriately).  A Get, on the server-side, does this exact same thing but in a single RPC (it opens a scanner, next() on it, and then close() it).
>>
>> The fact that a row cannot cross a region boundary is a consideration, but unless your rows will be many gigabytes each, I don't think this is that important.  Having to cross a region boundary to fulfill the "get all children" query would be my primary worry.
>>
>> Now besides those considerations above, the other two queries you want (parent-child point lookups and parent-child additions) are virtually identical in performance on the server-side starting with HBase 0.90 and beyond.  We have the same block-seeking optimizations in both schemas for the read case, and the write case is identical in both.
>>
>> The only other thing to consider is what if all the children of one parent can't fit in memory at the same time.  This is not at all related to a region getting too big (there is no requirement of fitting a  region into memory) but is a consideration for reading it in a single RPC (both on the server-side and also receiving it in your client).  However, you would deal with this the same way in the tall or wide case.  In the tall case, you would appropriately set the scanner caching number.  In the wide case, you would set the intra-row scan limit.  In this case, you will be forced to use the Scan API regardless because if you need multiple RPCs for a single row, you need the Scanner next() semantics.
>>
>> Many times, this decisions comes to a matter of personal preference.  I lean towards wide tables these days unless I expect extremely high numbers of children (so I want to split across regions and RPC requests) and I expect to frequently run the get-all-children query with high numbers of children.
>>
>> JG
>>
>> > -----Original Message-----
>> > From: Michael Segel [mailto:michael_segel@hotmail.com]
>> > Sent: Friday, February 11, 2011 12:23 PM
>> > To: user@hbase.apache.org
>> > Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
>> >
>> >
>> > David,
>> >
>> > First a caveat... You need to have a realistic notion of the data and its sizes
>> > when considering your options...
>> > With respect to the response, Here's what I said:
>> > -=-
>> > "With respect to your issue about a row being too large to fit in to memory...
>> >  This would imply that the row would be too large to fit in to a single region.
>> > Wouldn't that cause your HBase to die a horrible death?
>> >
>> >  If this really is a potential situation, then you should consider the
>> > parent_key, child_id compound row key..."
>> > -=-
>> > Now a correction. If you insert a row that is larger than a region, the region
>> > will grow to fit the row and will not split. So until your row exceeds the size of
>> > available disk... you can do it. So yeah you could fill up memory...
>> >
>> > And that's the only reason why I would recommend option 2 over option 1.
>> > So how real is this scenario?
>> >
>> > Looking at the 3 stated use cases...  Doing a get() on the parent ID will give
>> > you the entire set of children for the parent in a single fetch.
>> > If you limit the columns to either a single column or a set of columns, you are
>> > still going to be a single get().
>> >
>> > This is going to be faster than doing a scan() on a series of row starting with
>> > parent_id stopping with parent_id+1.
>> > (At least in theory. I haven't mocked this out and tried it.)
>> >
>> > Again the only advantage of option 2 is if you really are worried about your
>> > data size blowing you out of the water.
>> > If you do find yourself using a lot of memory to fetch your edge cases, then
>> > you'd be better off with the second option.
>> >
>> > Here you have the following:
>> >
>> > 1) Fetching all of the children (scan() with a start and stop key)
>> > 2) Fetching some of the rows... (scan() with a start and stop key and some
>> > sort of filter);
>> > 3) Fetching single child (get() using a combination of parent_id, child_id for
>> > the key.)
>> >
>> > So while you don't have to worry about the size of a row, you do not get the
>> > same performance that you could with option 1.
>> >
>> > Does that make sense?
>> >
>> > -Mike
>> >
>> >
>> >
>> >
>> >
>> > > From: buttler1@llnl.gov
>> > > To: user@hbase.apache.org
>> > > Date: Fri, 11 Feb 2011 10:45:14 -0800
>> > > Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
>> > >
>> > > Michael,
>> > > Thanks for the analysis.  The thought process you put into this seems
>> > useful.  However, following along at home I came to a different conclusion
>> > than you did.  I would prefer (sol. 2) over (sol. 3) for the reason you mention,
>> > but I would also strongly prefer (sol. 2) over (sol. 1), also for the reason you
>> > mention.
>> > >
>> > > So, I don't see how you can not recommend (sol. 2).  It seems like (sol. 1)
>> > would be very wasteful for use cases (u2) and (u3). The only time it would
>> > help is in (u1).  And then it doesn't seem obvious to me that a single row is
>> > better except in cases where there are very few children per parent.
>> > >
>> > > Perhaps if the data is expected to have a power law distribution (fat tail,
>> > zipfian), a hybrid approach would be better: go with (sol. 1) for any parent
>> > that has fewer than (say 10) children.  But, after a parent fills up its first 10
>> > children, start populating rows like (sol. 2).
>> > >
>> > > This would definitely make the client code more complex, so it would only
>> > make sense if there were huge savings to be had.
>> > > Maybe a slightly better implementation of the hybrid would be to divide
>> > the child key space up into buckets so that you can directly address any child,
>> > but still have fewer calls in retrieving all children.  Then you can adjust your
>> > bucket size based on your actual use case (with a bucket size of 1 being the
>> > special case of (sol. 2)).
>> > >
>> > > But the more I think about it, the more I suspect that the added complexity
>> > will not be worth it, and he should just go with (sol. 2).
>> > >
>> > > Dave
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Michael Segel [mailto:michael_segel@hotmail.com]
>> > > Sent: Friday, February 11, 2011 5:51 AM
>> > > To: user@hbase.apache.org
>> > > Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
>> > >
>> > >
>> > > Jason,
>> > >
>> > > You have the following constraint:
>> > > Foreach child there is one parent. A parent can have more than one child.
>> > >
>> > > While you don't specify size of the child, when a parent can have tens of
>> > millions, that could become an issue.
>> > > Assuming that the child is relatively small...
>> > >
>> > > You have 3 use cases: (Scan patterns)
>> > >
>> > > > -Fetch all children from a single parent -Find a few children by
>> > > > their keys or values from a single parent -Update a single child by
>> > > > child key and it's parent key
>> > >
>> > > Your options...
>> > >
>> > > > 1. One table with one Parent per row. Row key is a parent id.
>> > > Children are stored in a single family each under separate qualifier
>> > > (child id). Would it even work assuming all children may not fit in
>> > > memory?
>> > > >
>> > > While you raise an interesting point, lets look at the schema as a solution.
>> > > This works well because you can fetch the entire row based on parent key.
>> > > So all queries are get()s and not scan()s.
>> > >
>> > > You can then pull all of the existing columns where each column represents
>> > a child.
>> > >
>> > > You can also do a get() of only those columns you want based on child_id as
>> > the column name.
>> > >
>> > > You can also do a get() or a put of a specific column (child_id) for a given
>> > parent (row key).
>> > >
>> > >
>> > > With respect to your issue about a row being too large to fit in to memory...
>> > > This would imply that the row would be too large to fit in to a single region.
>> > Wouldn't that cause your HBase to die a horrible death?
>> > >
>> > > If this really is a potential situation, then you should consider the
>> > parent_key, child_id compound row key...
>> > >
>> > > > 2. One table. Compound row key parent id/child id. One child per row.
>> > > >
>> > > Based on your use cases, I wouldn't recommend this. While it is a valid
>> > schema, it is only 'optimal' for your 'Update a single child by child key and its
>> > parent key'.
>> > >
>> > > > 3. Many tables - one per parent. Row key is a child id.
>> > > If you have a scenario of a parent has billions+ of children, the
>> > > could be a valid choice, however based on what you said, (up to tens
>> > > of millions) and the data set is unique and non-intersecting, you
>> > > would be better off with a single table. (Too many tables is not a
>> > > good thing in HBase.)
>> > >
>> > >
>> > > HTH
>> > >
>> > > -Mike
>> > >
>> > >
>> > > > Subject: Parent/child relation - go vertical, horizontal, or many tables?
>> > > > From: urgisb@gmail.com
>> > > > Date: Thu, 10 Feb 2011 16:55:00 -0800
>> > > > To: user@hbase.apache.org
>> > > >
>> > > > Hi all,
>> > > >
>> > > > Let's say I have two entities Parent and Child. There could be many
>> > > > children in one parent (from hundreds to tens of millions) A child can only
>> > belong to one Parent.
>> > > >
>> > > > Typical queries are:
>> > > > -Fetch all children from a single parent -Find a few children by
>> > > > their keys or values from a single parent -Update a single child by
>> > > > child key and it's parent key
>> > > >
>> > > > And there are no cross-parent queries.
>> > > >
>> > > > I am trying to figure out what is better schema approach from
>> > performance/maintenance perspective:
>> > > >
>> > > > 1. One table with one Parent per row. Row key is a parent id. Children are
>> > stored in a single family each under separate qualifier (child id). Would it
>> > even work assuming all children may not fit in memory?
>> > > >
>> > > > 2. One table. Compound row key parent id/child id. One child per row.
>> > > >
>> > > > 3. Many tables - one per parent. Row key is a child id.
>> > > >
>> > > > Thanks!
>> > >
>> >
>

RE: Parent/child relation - go vertical, horizontal, or many tables?

Posted by Michael Segel <mi...@hotmail.com>.
Jonathan,
Thanks for the response.
> The fact that a row cannot cross a region boundary is a 
consideration, but unless your rows will be many gigabytes each, I don't
 think this is that important.  Having to cross a region boundary to 
fulfill the "get all children" query would be my primary worry.

That would be an issue if you have a tall table with many rows. Assuming you had enough children to break the wide row and the children were relatively big...

> Now besides those considerations above, the other two queries you 
want (parent-child point lookups and parent-child additions) are 
virtually identical in performance on the server-side starting with 
HBase 0.90 and beyond.  We have the same block-seeking optimizations in 
both schemas for the read case, and the write case is identical in both.

This is interesting.

So essentially the pat response these days is either "... it depends..." or "YMMV". 

Because the OP didn't really say how wide or how frequent he would have wide rows... I'd still lean to wide rows... 
But it is good to know about the improvements in 0.90

Thx

-Mike


> From: jgray@fb.com
> To: user@hbase.apache.org
> Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
> Date: Fri, 11 Feb 2011 20:48:51 +0000
> 
> Just to chime in with my usual take on this (seems like the tall vs. wide discussion happens every few weeks...)
> 
> For "get all children of a parent", doing a get() on the wide table vs. doing a scan() on the tall table (as long as you set scanner caching appropriately) will be almost identical.  I wouldn't expect any difference in performance if you are properly tuning parameters *EXCEPT* that today a Scan will always require more than one RPC because the API is such that you need to open the scanner first, and then do next() on it, and then close() it.  This is a current API limitation but we could implement an optimization to allow for single-RPC scans if the query can be fulfilled in a single response (start row, stop row, and scanner caching set appropriately).  A Get, on the server-side, does this exact same thing but in a single RPC (it opens a scanner, next() on it, and then close() it).
> 
> The fact that a row cannot cross a region boundary is a consideration, but unless your rows will be many gigabytes each, I don't think this is that important.  Having to cross a region boundary to fulfill the "get all children" query would be my primary worry.
> 
> Now besides those considerations above, the other two queries you want (parent-child point lookups and parent-child additions) are virtually identical in performance on the server-side starting with HBase 0.90 and beyond.  We have the same block-seeking optimizations in both schemas for the read case, and the write case is identical in both.
> 
> The only other thing to consider is what if all the children of one parent can't fit in memory at the same time.  This is not at all related to a region getting too big (there is no requirement of fitting a  region into memory) but is a consideration for reading it in a single RPC (both on the server-side and also receiving it in your client).  However, you would deal with this the same way in the tall or wide case.  In the tall case, you would appropriately set the scanner caching number.  In the wide case, you would set the intra-row scan limit.  In this case, you will be forced to use the Scan API regardless because if you need multiple RPCs for a single row, you need the Scanner next() semantics.
> 
> Many times, this decisions comes to a matter of personal preference.  I lean towards wide tables these days unless I expect extremely high numbers of children (so I want to split across regions and RPC requests) and I expect to frequently run the get-all-children query with high numbers of children.
> 
> JG
> 
> > -----Original Message-----
> > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > Sent: Friday, February 11, 2011 12:23 PM
> > To: user@hbase.apache.org
> > Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
> > 
> > 
> > David,
> > 
> > First a caveat... You need to have a realistic notion of the data and its sizes
> > when considering your options...
> > With respect to the response, Here's what I said:
> > -=-
> > "With respect to your issue about a row being too large to fit in to memory...
> >  This would imply that the row would be too large to fit in to a single region.
> > Wouldn't that cause your HBase to die a horrible death?
> > 
> >  If this really is a potential situation, then you should consider the
> > parent_key, child_id compound row key..."
> > -=-
> > Now a correction. If you insert a row that is larger than a region, the region
> > will grow to fit the row and will not split. So until your row exceeds the size of
> > available disk... you can do it. So yeah you could fill up memory...
> > 
> > And that's the only reason why I would recommend option 2 over option 1.
> > So how real is this scenario?
> > 
> > Looking at the 3 stated use cases...  Doing a get() on the parent ID will give
> > you the entire set of children for the parent in a single fetch.
> > If you limit the columns to either a single column or a set of columns, you are
> > still going to be a single get().
> > 
> > This is going to be faster than doing a scan() on a series of row starting with
> > parent_id stopping with parent_id+1.
> > (At least in theory. I haven't mocked this out and tried it.)
> > 
> > Again the only advantage of option 2 is if you really are worried about your
> > data size blowing you out of the water.
> > If you do find yourself using a lot of memory to fetch your edge cases, then
> > you'd be better off with the second option.
> > 
> > Here you have the following:
> > 
> > 1) Fetching all of the children (scan() with a start and stop key)
> > 2) Fetching some of the rows... (scan() with a start and stop key and some
> > sort of filter);
> > 3) Fetching single child (get() using a combination of parent_id, child_id for
> > the key.)
> > 
> > So while you don't have to worry about the size of a row, you do not get the
> > same performance that you could with option 1.
> > 
> > Does that make sense?
> > 
> > -Mike
> > 
> > 
> > 
> > 
> > 
> > > From: buttler1@llnl.gov
> > > To: user@hbase.apache.org
> > > Date: Fri, 11 Feb 2011 10:45:14 -0800
> > > Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
> > >
> > > Michael,
> > > Thanks for the analysis.  The thought process you put into this seems
> > useful.  However, following along at home I came to a different conclusion
> > than you did.  I would prefer (sol. 2) over (sol. 3) for the reason you mention,
> > but I would also strongly prefer (sol. 2) over (sol. 1), also for the reason you
> > mention.
> > >
> > > So, I don't see how you can not recommend (sol. 2).  It seems like (sol. 1)
> > would be very wasteful for use cases (u2) and (u3). The only time it would
> > help is in (u1).  And then it doesn't seem obvious to me that a single row is
> > better except in cases where there are very few children per parent.
> > >
> > > Perhaps if the data is expected to have a power law distribution (fat tail,
> > zipfian), a hybrid approach would be better: go with (sol. 1) for any parent
> > that has fewer than (say 10) children.  But, after a parent fills up its first 10
> > children, start populating rows like (sol. 2).
> > >
> > > This would definitely make the client code more complex, so it would only
> > make sense if there were huge savings to be had.
> > > Maybe a slightly better implementation of the hybrid would be to divide
> > the child key space up into buckets so that you can directly address any child,
> > but still have fewer calls in retrieving all children.  Then you can adjust your
> > bucket size based on your actual use case (with a bucket size of 1 being the
> > special case of (sol. 2)).
> > >
> > > But the more I think about it, the more I suspect that the added complexity
> > will not be worth it, and he should just go with (sol. 2).
> > >
> > > Dave
> > >
> > >
> > > -----Original Message-----
> > > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > > Sent: Friday, February 11, 2011 5:51 AM
> > > To: user@hbase.apache.org
> > > Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
> > >
> > >
> > > Jason,
> > >
> > > You have the following constraint:
> > > Foreach child there is one parent. A parent can have more than one child.
> > >
> > > While you don't specify size of the child, when a parent can have tens of
> > millions, that could become an issue.
> > > Assuming that the child is relatively small...
> > >
> > > You have 3 use cases: (Scan patterns)
> > >
> > > > -Fetch all children from a single parent -Find a few children by
> > > > their keys or values from a single parent -Update a single child by
> > > > child key and it's parent key
> > >
> > > Your options...
> > >
> > > > 1. One table with one Parent per row. Row key is a parent id.
> > > Children are stored in a single family each under separate qualifier
> > > (child id). Would it even work assuming all children may not fit in
> > > memory?
> > > >
> > > While you raise an interesting point, lets look at the schema as a solution.
> > > This works well because you can fetch the entire row based on parent key.
> > > So all queries are get()s and not scan()s.
> > >
> > > You can then pull all of the existing columns where each column represents
> > a child.
> > >
> > > You can also do a get() of only those columns you want based on child_id as
> > the column name.
> > >
> > > You can also do a get() or a put of a specific column (child_id) for a given
> > parent (row key).
> > >
> > >
> > > With respect to your issue about a row being too large to fit in to memory...
> > > This would imply that the row would be too large to fit in to a single region.
> > Wouldn't that cause your HBase to die a horrible death?
> > >
> > > If this really is a potential situation, then you should consider the
> > parent_key, child_id compound row key...
> > >
> > > > 2. One table. Compound row key parent id/child id. One child per row.
> > > >
> > > Based on your use cases, I wouldn't recommend this. While it is a valid
> > schema, it is only 'optimal' for your 'Update a single child by child key and its
> > parent key'.
> > >
> > > > 3. Many tables - one per parent. Row key is a child id.
> > > If you have a scenario of a parent has billions+ of children, the
> > > could be a valid choice, however based on what you said, (up to tens
> > > of millions) and the data set is unique and non-intersecting, you
> > > would be better off with a single table. (Too many tables is not a
> > > good thing in HBase.)
> > >
> > >
> > > HTH
> > >
> > > -Mike
> > >
> > >
> > > > Subject: Parent/child relation - go vertical, horizontal, or many tables?
> > > > From: urgisb@gmail.com
> > > > Date: Thu, 10 Feb 2011 16:55:00 -0800
> > > > To: user@hbase.apache.org
> > > >
> > > > Hi all,
> > > >
> > > > Let's say I have two entities Parent and Child. There could be many
> > > > children in one parent (from hundreds to tens of millions) A child can only
> > belong to one Parent.
> > > >
> > > > Typical queries are:
> > > > -Fetch all children from a single parent -Find a few children by
> > > > their keys or values from a single parent -Update a single child by
> > > > child key and it's parent key
> > > >
> > > > And there are no cross-parent queries.
> > > >
> > > > I am trying to figure out what is better schema approach from
> > performance/maintenance perspective:
> > > >
> > > > 1. One table with one Parent per row. Row key is a parent id. Children are
> > stored in a single family each under separate qualifier (child id). Would it
> > even work assuming all children may not fit in memory?
> > > >
> > > > 2. One table. Compound row key parent id/child id. One child per row.
> > > >
> > > > 3. Many tables - one per parent. Row key is a child id.
> > > >
> > > > Thanks!
> > >
> > 
 		 	   		  

RE: Parent/child relation - go vertical, horizontal, or many tables?

Posted by Jonathan Gray <jg...@fb.com>.
Just to chime in with my usual take on this (seems like the tall vs. wide discussion happens every few weeks...)

For "get all children of a parent", doing a get() on the wide table vs. doing a scan() on the tall table (as long as you set scanner caching appropriately) will be almost identical.  I wouldn't expect any difference in performance if you are properly tuning parameters *EXCEPT* that today a Scan will always require more than one RPC because the API is such that you need to open the scanner first, and then do next() on it, and then close() it.  This is a current API limitation but we could implement an optimization to allow for single-RPC scans if the query can be fulfilled in a single response (start row, stop row, and scanner caching set appropriately).  A Get, on the server-side, does this exact same thing but in a single RPC (it opens a scanner, next() on it, and then close() it).

The fact that a row cannot cross a region boundary is a consideration, but unless your rows will be many gigabytes each, I don't think this is that important.  Having to cross a region boundary to fulfill the "get all children" query would be my primary worry.

Now besides those considerations above, the other two queries you want (parent-child point lookups and parent-child additions) are virtually identical in performance on the server-side starting with HBase 0.90 and beyond.  We have the same block-seeking optimizations in both schemas for the read case, and the write case is identical in both.

The only other thing to consider is what if all the children of one parent can't fit in memory at the same time.  This is not at all related to a region getting too big (there is no requirement of fitting a  region into memory) but is a consideration for reading it in a single RPC (both on the server-side and also receiving it in your client).  However, you would deal with this the same way in the tall or wide case.  In the tall case, you would appropriately set the scanner caching number.  In the wide case, you would set the intra-row scan limit.  In this case, you will be forced to use the Scan API regardless because if you need multiple RPCs for a single row, you need the Scanner next() semantics.

Many times, this decisions comes to a matter of personal preference.  I lean towards wide tables these days unless I expect extremely high numbers of children (so I want to split across regions and RPC requests) and I expect to frequently run the get-all-children query with high numbers of children.

JG

> -----Original Message-----
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Friday, February 11, 2011 12:23 PM
> To: user@hbase.apache.org
> Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
> 
> 
> David,
> 
> First a caveat... You need to have a realistic notion of the data and its sizes
> when considering your options...
> With respect to the response, Here's what I said:
> -=-
> "With respect to your issue about a row being too large to fit in to memory...
>  This would imply that the row would be too large to fit in to a single region.
> Wouldn't that cause your HBase to die a horrible death?
> 
>  If this really is a potential situation, then you should consider the
> parent_key, child_id compound row key..."
> -=-
> Now a correction. If you insert a row that is larger than a region, the region
> will grow to fit the row and will not split. So until your row exceeds the size of
> available disk... you can do it. So yeah you could fill up memory...
> 
> And that's the only reason why I would recommend option 2 over option 1.
> So how real is this scenario?
> 
> Looking at the 3 stated use cases...  Doing a get() on the parent ID will give
> you the entire set of children for the parent in a single fetch.
> If you limit the columns to either a single column or a set of columns, you are
> still going to be a single get().
> 
> This is going to be faster than doing a scan() on a series of row starting with
> parent_id stopping with parent_id+1.
> (At least in theory. I haven't mocked this out and tried it.)
> 
> Again the only advantage of option 2 is if you really are worried about your
> data size blowing you out of the water.
> If you do find yourself using a lot of memory to fetch your edge cases, then
> you'd be better off with the second option.
> 
> Here you have the following:
> 
> 1) Fetching all of the children (scan() with a start and stop key)
> 2) Fetching some of the rows... (scan() with a start and stop key and some
> sort of filter);
> 3) Fetching single child (get() using a combination of parent_id, child_id for
> the key.)
> 
> So while you don't have to worry about the size of a row, you do not get the
> same performance that you could with option 1.
> 
> Does that make sense?
> 
> -Mike
> 
> 
> 
> 
> 
> > From: buttler1@llnl.gov
> > To: user@hbase.apache.org
> > Date: Fri, 11 Feb 2011 10:45:14 -0800
> > Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
> >
> > Michael,
> > Thanks for the analysis.  The thought process you put into this seems
> useful.  However, following along at home I came to a different conclusion
> than you did.  I would prefer (sol. 2) over (sol. 3) for the reason you mention,
> but I would also strongly prefer (sol. 2) over (sol. 1), also for the reason you
> mention.
> >
> > So, I don't see how you can not recommend (sol. 2).  It seems like (sol. 1)
> would be very wasteful for use cases (u2) and (u3). The only time it would
> help is in (u1).  And then it doesn't seem obvious to me that a single row is
> better except in cases where there are very few children per parent.
> >
> > Perhaps if the data is expected to have a power law distribution (fat tail,
> zipfian), a hybrid approach would be better: go with (sol. 1) for any parent
> that has fewer than (say 10) children.  But, after a parent fills up its first 10
> children, start populating rows like (sol. 2).
> >
> > This would definitely make the client code more complex, so it would only
> make sense if there were huge savings to be had.
> > Maybe a slightly better implementation of the hybrid would be to divide
> the child key space up into buckets so that you can directly address any child,
> but still have fewer calls in retrieving all children.  Then you can adjust your
> bucket size based on your actual use case (with a bucket size of 1 being the
> special case of (sol. 2)).
> >
> > But the more I think about it, the more I suspect that the added complexity
> will not be worth it, and he should just go with (sol. 2).
> >
> > Dave
> >
> >
> > -----Original Message-----
> > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > Sent: Friday, February 11, 2011 5:51 AM
> > To: user@hbase.apache.org
> > Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
> >
> >
> > Jason,
> >
> > You have the following constraint:
> > Foreach child there is one parent. A parent can have more than one child.
> >
> > While you don't specify size of the child, when a parent can have tens of
> millions, that could become an issue.
> > Assuming that the child is relatively small...
> >
> > You have 3 use cases: (Scan patterns)
> >
> > > -Fetch all children from a single parent -Find a few children by
> > > their keys or values from a single parent -Update a single child by
> > > child key and it's parent key
> >
> > Your options...
> >
> > > 1. One table with one Parent per row. Row key is a parent id.
> > Children are stored in a single family each under separate qualifier
> > (child id). Would it even work assuming all children may not fit in
> > memory?
> > >
> > While you raise an interesting point, lets look at the schema as a solution.
> > This works well because you can fetch the entire row based on parent key.
> > So all queries are get()s and not scan()s.
> >
> > You can then pull all of the existing columns where each column represents
> a child.
> >
> > You can also do a get() of only those columns you want based on child_id as
> the column name.
> >
> > You can also do a get() or a put of a specific column (child_id) for a given
> parent (row key).
> >
> >
> > With respect to your issue about a row being too large to fit in to memory...
> > This would imply that the row would be too large to fit in to a single region.
> Wouldn't that cause your HBase to die a horrible death?
> >
> > If this really is a potential situation, then you should consider the
> parent_key, child_id compound row key...
> >
> > > 2. One table. Compound row key parent id/child id. One child per row.
> > >
> > Based on your use cases, I wouldn't recommend this. While it is a valid
> schema, it is only 'optimal' for your 'Update a single child by child key and its
> parent key'.
> >
> > > 3. Many tables - one per parent. Row key is a child id.
> > If you have a scenario of a parent has billions+ of children, the
> > could be a valid choice, however based on what you said, (up to tens
> > of millions) and the data set is unique and non-intersecting, you
> > would be better off with a single table. (Too many tables is not a
> > good thing in HBase.)
> >
> >
> > HTH
> >
> > -Mike
> >
> >
> > > Subject: Parent/child relation - go vertical, horizontal, or many tables?
> > > From: urgisb@gmail.com
> > > Date: Thu, 10 Feb 2011 16:55:00 -0800
> > > To: user@hbase.apache.org
> > >
> > > Hi all,
> > >
> > > Let's say I have two entities Parent and Child. There could be many
> > > children in one parent (from hundreds to tens of millions) A child can only
> belong to one Parent.
> > >
> > > Typical queries are:
> > > -Fetch all children from a single parent -Find a few children by
> > > their keys or values from a single parent -Update a single child by
> > > child key and it's parent key
> > >
> > > And there are no cross-parent queries.
> > >
> > > I am trying to figure out what is better schema approach from
> performance/maintenance perspective:
> > >
> > > 1. One table with one Parent per row. Row key is a parent id. Children are
> stored in a single family each under separate qualifier (child id). Would it
> even work assuming all children may not fit in memory?
> > >
> > > 2. One table. Compound row key parent id/child id. One child per row.
> > >
> > > 3. Many tables - one per parent. Row key is a child id.
> > >
> > > Thanks!
> >
> 

RE: Parent/child relation - go vertical, horizontal, or many tables?

Posted by Michael Segel <mi...@hotmail.com>.
David,

First a caveat... You need to have a realistic notion of the data and its sizes when considering your options...
With respect to the response, Here's what I said: 
-=-
"With respect to your issue about a row being too large to fit in to memory... 
 This would imply that the row would be too large to fit in to a single 
region. Wouldn't that cause your HBase to die a horrible death?

 If this really is a potential situation, then you should consider the parent_key, child_id compound row key..."
-=-
Now a correction. If you insert a row that is larger than a region, the region will grow to fit the row and will not split. So until your row exceeds the size of available disk... you can do it. So yeah you could fill up memory...

And that's the only reason why I would recommend option 2 over option 1.
So how real is this scenario? 

Looking at the 3 stated use cases...  Doing a get() on the parent ID will give you the entire set of children for the parent in a single fetch.
If you limit the columns to either a single column or a set of columns, you are still going to be a single get().

This is going to be faster than doing a scan() on a series of row starting with parent_id stopping with parent_id+1.
(At least in theory. I haven't mocked this out and tried it.)

Again the only advantage of option 2 is if you really are worried about your data size blowing you out of the water.
If you do find yourself using a lot of memory to fetch your edge cases, then you'd be better off with the second option.

Here you have the following:

1) Fetching all of the children (scan() with a start and stop key)
2) Fetching some of the rows... (scan() with a start and stop key and some sort of filter);
3) Fetching single child (get() using a combination of parent_id, child_id for the key.)

So while you don't have to worry about the size of a row, you do not get the same performance that you could with option 1.

Does that make sense?

-Mike





> From: buttler1@llnl.gov
> To: user@hbase.apache.org
> Date: Fri, 11 Feb 2011 10:45:14 -0800
> Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
> 
> Michael,
> Thanks for the analysis.  The thought process you put into this seems useful.  However, following along at home I came to a different conclusion than you did.  I would prefer (sol. 2) over (sol. 3) for the reason you mention, but I would also strongly prefer (sol. 2) over (sol. 1), also for the reason you mention.
> 
> So, I don't see how you can not recommend (sol. 2).  It seems like (sol. 1) would be very wasteful for use cases (u2) and (u3). The only time it would help is in (u1).  And then it doesn't seem obvious to me that a single row is better except in cases where there are very few children per parent.
> 
> Perhaps if the data is expected to have a power law distribution (fat tail, zipfian), a hybrid approach would be better: go with (sol. 1) for any parent that has fewer than (say 10) children.  But, after a parent fills up its first 10 children, start populating rows like (sol. 2).
> 
> This would definitely make the client code more complex, so it would only make sense if there were huge savings to be had.
> Maybe a slightly better implementation of the hybrid would be to divide the child key space up into buckets so that you can directly address any child, but still have fewer calls in retrieving all children.  Then you can adjust your bucket size based on your actual use case (with a bucket size of 1 being the special case of (sol. 2)).
> 
> But the more I think about it, the more I suspect that the added complexity will not be worth it, and he should just go with (sol. 2).
> 
> Dave
> 
> 
> -----Original Message-----
> From: Michael Segel [mailto:michael_segel@hotmail.com] 
> Sent: Friday, February 11, 2011 5:51 AM
> To: user@hbase.apache.org
> Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?
> 
> 
> Jason,
> 
> You have the following constraint:
> Foreach child there is one parent. A parent can have more than one child.
> 
> While you don't specify size of the child, when a parent can have tens of millions, that could become an issue.
> Assuming that the child is relatively small...
> 
> You have 3 use cases: (Scan patterns)
> 
> > -Fetch all children from a single parent
> > -Find a few children by their keys or values from a single parent
> > -Update a single child by child key and it's parent key
> 
> Your options...
> 
> > 1. One table with one Parent per row. Row key is a parent id. 
> Children are stored in a single family each under separate qualifier 
> (child id). Would it even work assuming all children may not fit in 
> memory? 
> > 
> While you raise an interesting point, lets look at the schema as a solution.
> This works well because you can fetch the entire row based on parent key.
> So all queries are get()s and not scan()s.
> 
> You can then pull all of the existing columns where each column represents a child.
> 
> You can also do a get() of only those columns you want based on child_id as the column name.
> 
> You can also do a get() or a put of a specific column (child_id) for a given parent (row key).
> 
> 
> With respect to your issue about a row being too large to fit in to memory... 
> This would imply that the row would be too large to fit in to a single region. Wouldn't that cause your HBase to die a horrible death?
> 
> If this really is a potential situation, then you should consider the parent_key, child_id compound row key...
> 
> > 2. One table. Compound row key parent id/child id. One child per row. 
> > 
> Based on your use cases, I wouldn't recommend this. While it is a valid schema, it is only 'optimal' for your 'Update a single child by child key and its parent key'. 
> 
> > 3. Many tables - one per parent. Row key is a child id.
> If you have a scenario of a parent has billions+ of children, the could be a valid choice, however based on what you said, (up to tens of millions) and the data set is unique and non-intersecting, you would be better off with a single table. (Too many tables is not a good thing in HBase.)
> 
> 
> HTH
> 
> -Mike
> 
> 
> > Subject: Parent/child relation - go vertical, horizontal, or many tables?
> > From: urgisb@gmail.com
> > Date: Thu, 10 Feb 2011 16:55:00 -0800
> > To: user@hbase.apache.org
> > 
> > Hi all,
> > 
> > Let's say I have two entities Parent and Child. There could be many children in one parent (from hundreds to tens of millions)
> > A child can only belong to one Parent.
> > 
> > Typical queries are:
> > -Fetch all children from a single parent
> > -Find a few children by their keys or values from a single parent
> > -Update a single child by child key and it's parent key
> > 
> > And there are no cross-parent queries.
> > 
> > I am trying to figure out what is better schema approach from performance/maintenance perspective:
> > 
> > 1. One table with one Parent per row. Row key is a parent id. Children are stored in a single family each under separate qualifier (child id). Would it even work assuming all children may not fit in memory? 
> > 
> > 2. One table. Compound row key parent id/child id. One child per row. 
> > 
> > 3. Many tables - one per parent. Row key is a child id.
> > 
> > Thanks!
>  		 	   		  
 		 	   		  

RE: Parent/child relation - go vertical, horizontal, or many tables?

Posted by "Buttler, David" <bu...@llnl.gov>.
Michael,
Thanks for the analysis.  The thought process you put into this seems useful.  However, following along at home I came to a different conclusion than you did.  I would prefer (sol. 2) over (sol. 3) for the reason you mention, but I would also strongly prefer (sol. 2) over (sol. 1), also for the reason you mention.

So, I don't see how you can not recommend (sol. 2).  It seems like (sol. 1) would be very wasteful for use cases (u2) and (u3). The only time it would help is in (u1).  And then it doesn't seem obvious to me that a single row is better except in cases where there are very few children per parent.

Perhaps if the data is expected to have a power law distribution (fat tail, zipfian), a hybrid approach would be better: go with (sol. 1) for any parent that has fewer than (say 10) children.  But, after a parent fills up its first 10 children, start populating rows like (sol. 2).

This would definitely make the client code more complex, so it would only make sense if there were huge savings to be had.
Maybe a slightly better implementation of the hybrid would be to divide the child key space up into buckets so that you can directly address any child, but still have fewer calls in retrieving all children.  Then you can adjust your bucket size based on your actual use case (with a bucket size of 1 being the special case of (sol. 2)).

But the more I think about it, the more I suspect that the added complexity will not be worth it, and he should just go with (sol. 2).

Dave


-----Original Message-----
From: Michael Segel [mailto:michael_segel@hotmail.com] 
Sent: Friday, February 11, 2011 5:51 AM
To: user@hbase.apache.org
Subject: RE: Parent/child relation - go vertical, horizontal, or many tables?


Jason,

You have the following constraint:
Foreach child there is one parent. A parent can have more than one child.

While you don't specify size of the child, when a parent can have tens of millions, that could become an issue.
Assuming that the child is relatively small...

You have 3 use cases: (Scan patterns)

> -Fetch all children from a single parent
> -Find a few children by their keys or values from a single parent
> -Update a single child by child key and it's parent key

Your options...

> 1. One table with one Parent per row. Row key is a parent id. 
Children are stored in a single family each under separate qualifier 
(child id). Would it even work assuming all children may not fit in 
memory? 
> 
While you raise an interesting point, lets look at the schema as a solution.
This works well because you can fetch the entire row based on parent key.
So all queries are get()s and not scan()s.

You can then pull all of the existing columns where each column represents a child.

You can also do a get() of only those columns you want based on child_id as the column name.

You can also do a get() or a put of a specific column (child_id) for a given parent (row key).


With respect to your issue about a row being too large to fit in to memory... 
This would imply that the row would be too large to fit in to a single region. Wouldn't that cause your HBase to die a horrible death?

If this really is a potential situation, then you should consider the parent_key, child_id compound row key...

> 2. One table. Compound row key parent id/child id. One child per row. 
> 
Based on your use cases, I wouldn't recommend this. While it is a valid schema, it is only 'optimal' for your 'Update a single child by child key and its parent key'. 

> 3. Many tables - one per parent. Row key is a child id.
If you have a scenario of a parent has billions+ of children, the could be a valid choice, however based on what you said, (up to tens of millions) and the data set is unique and non-intersecting, you would be better off with a single table. (Too many tables is not a good thing in HBase.)


HTH

-Mike


> Subject: Parent/child relation - go vertical, horizontal, or many tables?
> From: urgisb@gmail.com
> Date: Thu, 10 Feb 2011 16:55:00 -0800
> To: user@hbase.apache.org
> 
> Hi all,
> 
> Let's say I have two entities Parent and Child. There could be many children in one parent (from hundreds to tens of millions)
> A child can only belong to one Parent.
> 
> Typical queries are:
> -Fetch all children from a single parent
> -Find a few children by their keys or values from a single parent
> -Update a single child by child key and it's parent key
> 
> And there are no cross-parent queries.
> 
> I am trying to figure out what is better schema approach from performance/maintenance perspective:
> 
> 1. One table with one Parent per row. Row key is a parent id. Children are stored in a single family each under separate qualifier (child id). Would it even work assuming all children may not fit in memory? 
> 
> 2. One table. Compound row key parent id/child id. One child per row. 
> 
> 3. Many tables - one per parent. Row key is a child id.
> 
> Thanks!
 		 	   		  

RE: Parent/child relation - go vertical, horizontal, or many tables?

Posted by Michael Segel <mi...@hotmail.com>.
Jason,

You have the following constraint:
Foreach child there is one parent. A parent can have more than one child.

While you don't specify size of the child, when a parent can have tens of millions, that could become an issue.
Assuming that the child is relatively small...

You have 3 use cases: (Scan patterns)

> -Fetch all children from a single parent
> -Find a few children by their keys or values from a single parent
> -Update a single child by child key and it's parent key

Your options...

> 1. One table with one Parent per row. Row key is a parent id. 
Children are stored in a single family each under separate qualifier 
(child id). Would it even work assuming all children may not fit in 
memory? 
> 
While you raise an interesting point, lets look at the schema as a solution.
This works well because you can fetch the entire row based on parent key.
So all queries are get()s and not scan()s.

You can then pull all of the existing columns where each column represents a child.

You can also do a get() of only those columns you want based on child_id as the column name.

You can also do a get() or a put of a specific column (child_id) for a given parent (row key).


With respect to your issue about a row being too large to fit in to memory... 
This would imply that the row would be too large to fit in to a single region. Wouldn't that cause your HBase to die a horrible death?

If this really is a potential situation, then you should consider the parent_key, child_id compound row key...

> 2. One table. Compound row key parent id/child id. One child per row. 
> 
Based on your use cases, I wouldn't recommend this. While it is a valid schema, it is only 'optimal' for your 'Update a single child by child key and its parent key'. 

> 3. Many tables - one per parent. Row key is a child id.
If you have a scenario of a parent has billions+ of children, the could be a valid choice, however based on what you said, (up to tens of millions) and the data set is unique and non-intersecting, you would be better off with a single table. (Too many tables is not a good thing in HBase.)


HTH

-Mike


> Subject: Parent/child relation - go vertical, horizontal, or many tables?
> From: urgisb@gmail.com
> Date: Thu, 10 Feb 2011 16:55:00 -0800
> To: user@hbase.apache.org
> 
> Hi all,
> 
> Let's say I have two entities Parent and Child. There could be many children in one parent (from hundreds to tens of millions)
> A child can only belong to one Parent.
> 
> Typical queries are:
> -Fetch all children from a single parent
> -Find a few children by their keys or values from a single parent
> -Update a single child by child key and it's parent key
> 
> And there are no cross-parent queries.
> 
> I am trying to figure out what is better schema approach from performance/maintenance perspective:
> 
> 1. One table with one Parent per row. Row key is a parent id. Children are stored in a single family each under separate qualifier (child id). Would it even work assuming all children may not fit in memory? 
> 
> 2. One table. Compound row key parent id/child id. One child per row. 
> 
> 3. Many tables - one per parent. Row key is a child id.
> 
> Thanks!
 		 	   		  

Re: Parent/child relation - go vertical, horizontal, or many tables?

Posted by Ryan Rawson <ry...@gmail.com>.
You want to choose the schema that minimizes the # of RPCs you are doing.

-ryan

On Thu, Feb 10, 2011 at 4:55 PM, Jason <ur...@gmail.com> wrote:
> Hi all,
>
> Let's say I have two entities Parent and Child. There could be many children in one parent (from hundreds to tens of millions)
> A child can only belong to one Parent.
>
> Typical queries are:
> -Fetch all children from a single parent
> -Find a few children by their keys or values from a single parent
> -Update a single child by child key and it's parent key
>
> And there are no cross-parent queries.
>
> I am trying to figure out what is better schema approach from performance/maintenance perspective:
>
> 1. One table with one Parent per row. Row key is a parent id. Children are stored in a single family each under separate qualifier (child id). Would it even work assuming all children may not fit in memory?
>
> 2. One table. Compound row key parent id/child id. One child per row.
>
> 3. Many tables - one per parent. Row key is a child id.
>
> Thanks!