You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by "Basmajian, Raffi" <rb...@oppenheimerfunds.com> on 2010/03/19 16:33:41 UTC

How to join tables in HBase 20.3

I am new to HBase and come from a rdbms background. After looking in the
sample client code it seems fairly easy to query a single table using
Get and Scan, but it's not so obvious how to join data across multiple
tables. 
 
Are there any examples on how to read/join data across multiple tables?
 
Thank you
 
Raffi Basmajian
 

------------------------------------------------------------------------------
This e-mail transmission may contain information that is proprietary, privileged and/or confidential and is intended exclusively for the person(s) to whom it is addressed. Any use, copying, retention or disclosure by any person other than the intended recipient or the intended recipient's designees is strictly prohibited. If you are not the intended recipient or their designee, please notify the sender immediately by return e-mail and delete all copies. OppenheimerFunds may, at its sole discretion, monitor, review, retain and/or disclose the content of all email communications. 
==============================================================================

Re: How to join tables in HBase 20.3

Posted by Jeff Hammerbacher <ha...@cloudera.com>.
Specifically, you may want to follow
https://issues.apache.org/jira/browse/HIVE-1257, which is a ticket for
debugging the current implementation of joins over HBase tables using Hive.

On Fri, Mar 19, 2010 at 9:46 AM, Jonathan Gray <jg...@facebook.com> wrote:

> What you're asking for is a join.  You said you understand there isn't a
> mechanism to do it but then ask if there is functionality to provide
> combining the data.  They are equivalent.
>
> One thing to understand is that you're talking about a very traditional
> relational data model.  That fits very well into an RDBMS and less so into
> an HBase model.  However it is still possible to implement it in the same
> way as an RDBMS (by doing your own joining) or in a different way by
> denormalizing the data.
>
> To denormalize the data you would combine these things into a single table
> (or fewer than three), or in each table duplicate the data for the others.
>
> For example, let's say a customer can have any number of claims
> (1-to-many).  Rather than thinking of it like a relational database where
> each of these things are in a different table and reference one another, you
> might just toss them into a single table.
>
> The customer table (keyed on customerid) would have a 'claims' family.  For
> each claim, you could insert a column with the claimid (or a composite
> column if you needed time sorting, prepended with a stamp for example).  The
> value would be the claim information in a serialized type.  If you wanted to
> not use a serialized type, you could still spread each claim over multiple
> columns by adding additional type information into the column qualifier.
>  For example:  <timestamp><claimid><fieldname> and in the value
> <fieldvalue>.  You have to use filters to get everything for a claimid,
> which is unfortunate (would actually be possible to implement start/stop
> keyvalues but currently not supported).  In that case, you might make the
> table tall instead of wide and push these things into the row key.
>  <customerid><policyid><timestamp><claimid> and then you could have column
> qualifiers -> values for each field.  This would allow you to do a Get for a
> single claim (you'd have to know the row key to do a get), but would allow
> you do to queries like "give me all policies and claims for this customer",
> "give me the 10 most recent claims for this customer's policy", etc...
>
> For your specific example, where you don't want to pivot on the customer
> first but rather the time of the claim, you might create a table with rows
> such as <claim_timestamp><claim_id>.  Then you could use scanners to grab
> any claims within any range of time (rows from now() to now() - 1 month).
>
> Whether you denormalize the claims and store their full content in the
> table is another question.  The trade-off is really just about how much data
> there is, how many times you would need to duplicate it (you may need to
> create a new table for every query you want to support if they each pivot on
> a different column, time claim customer policy etc), etc.. So the trade-off
> is:  if denormalizing you get significantly faster reads at the expense of
> slower writes and data duplication.  If joining, you get better space
> efficiency and faster writes at the expense of slower reads.
>
> One of the advantages of HBase over an RDBMS is that you get to choose
> these trade-offs.  Often times in an RDBMS (especially in "by the book"
> schema design) there is one way and you don't have this flexibility.
>
> Hope that helps more than it confuses :)
>
> JG
>
> > -----Original Message-----
> > From: Basmajian, Raffi [mailto:rbasmajian@oppenheimerfunds.com]
> > Sent: Friday, March 19, 2010 9:20 AM
> > To: hbase-user@hadoop.apache.org
> > Subject: RE: How to join tables in HBase 20.3
> >
> > JG,
> >
> > I understand that there is no built in mechanism to do joins, but the
> > essence of combining data to make it more useful remains the same
> > regardless of whether it's a rdmbs, hbase, etc, so there must be
> > something in hbase that provided this functionality.
> >
> > Assume for the moment that in hbase I have the tables Customer, Policy,
> > and Claim for an auto insurance business. Say I want to get a list of
> > all customers that filed a claim on their auto policy in the past
> > month.
> > If I use Get and/or Scan then that allows me to pull information from
> > each individual table, but I still need to combine the data to give me
> > the list of policies based on my original query. Is there additional
> > functionality in hbase that enables combining the data? I've been
> > searching in the samples and I can't find a clear and simple example.
> >
> > Thanks
> > Raffi
> >
> >
> > -----Original Message-----
> > From: Jonathan Gray [mailto:jgray@facebook.com]
> > Sent: Friday, March 19, 2010 12:03 PM
> > To: hbase-user@hadoop.apache.org
> > Subject: RE: How to join tables in HBase 20.3
> >
> > At some point joins may be necessary when denormalization is not
> > possible.
> >
> > There is no built-in mechanism to do it.  It would be a series of
> > additional Get calls to the second table you are joining against.  This
> > would be helped significantly with a parallel MultiGet which will
> > hopefully make it to 0.21.
> >
> > JG
> >
> > > -----Original Message-----
> > > From: TuX RaceR [mailto:tuxracer69@gmail.com]
> > > Sent: Friday, March 19, 2010 8:41 AM
> > > To: hbase-user@hadoop.apache.org
> > > Subject: Re: How to join tables in HBase 20.3
> > >
> > > Hi Raffi,
> > >
> > > when dealing with key-value stores, you need to think in a different
> > > way see for instance:
> > >
> > > http://wiki.apache.org/hadoop/Hbase/DataModel
> > >
> > > "Getting high scalability from your relational database isn't done by
> > > simply adding more machines because its data model is based on a
> > > single-machine architecture. For example, a JOIN between two tables
> > is
> >
> > > done in memory and does not take into account the possibility that
> > the
> >
> > > data has to go over the wire."
> > >
> > > JOIN simply does not scale in relational databases.
> > >
> > >
> > > see also
> > >
> > > http://wiki.apache.org/hadoop/Hbase/FAQ#A20
> > >
> > > *20 Are there any Schema Design examples?*
> > >
> > >
> > > Hope this helps,
> > >
> > > Cheers
> > > TuX
> > >
> > >
> > > Basmajian, Raffi wrote:
> > > > I am new to HBase and come from a rdbms background. After looking
> > in
> > > the
> > > > sample client code it seems fairly easy to query a single table
> > > > using Get and Scan, but it's not so obvious how to join data across
> > > multiple
> > > > tables.
> > > >
> > > > Are there any examples on how to read/join data across multiple
> > > tables?
> > > >
> > > > Thank you
> > > >
> > > > Raffi Basmajian
> > > >
> > > >
> > > > -------------------------------------------------------------------
> > -
> > > > -
> > > ---------
> > > > This e-mail transmission may contain information that is
> > > > proprietary,
> > > privileged and/or confidential and is intended exclusively for the
> > > person(s) to whom it is addressed. Any use, copying, retention or
> > > disclosure by any person other than the intended recipient or the
> > > intended recipient's designees is strictly prohibited. If you are not
> > > the intended recipient or their designee, please notify the sender
> > > immediately by return e-mail and delete all copies. OppenheimerFunds
> > > may, at its sole discretion, monitor, review, retain and/or disclose
> > > the content of all email communications.
> > > >
> > >
> > ======================================================================
> > > =
> > > =======
> > > >
> > > >
> >
> >
> >
> > -----------------------------------------------------------------------
> > -------
> > This e-mail transmission may contain information that is proprietary,
> > privileged and/or confidential and is intended exclusively for the
> > person(s) to whom it is addressed. Any use, copying, retention or
> > disclosure by any person other than the intended recipient or the
> > intended recipient's designees is strictly prohibited. If you are not
> > the intended recipient or their designee, please notify the sender
> > immediately by return e-mail and delete all copies. OppenheimerFunds
> > may, at its sole discretion, monitor, review, retain and/or disclose
> > the content of all email communications.
> > =======================================================================
> > =======
>
>

RE: How to join tables in HBase 20.3

Posted by Jonathan Gray <jg...@facebook.com>.
What you're asking for is a join.  You said you understand there isn't a mechanism to do it but then ask if there is functionality to provide combining the data.  They are equivalent.

One thing to understand is that you're talking about a very traditional relational data model.  That fits very well into an RDBMS and less so into an HBase model.  However it is still possible to implement it in the same way as an RDBMS (by doing your own joining) or in a different way by denormalizing the data.

To denormalize the data you would combine these things into a single table (or fewer than three), or in each table duplicate the data for the others.

For example, let's say a customer can have any number of claims (1-to-many).  Rather than thinking of it like a relational database where each of these things are in a different table and reference one another, you might just toss them into a single table.

The customer table (keyed on customerid) would have a 'claims' family.  For each claim, you could insert a column with the claimid (or a composite column if you needed time sorting, prepended with a stamp for example).  The value would be the claim information in a serialized type.  If you wanted to not use a serialized type, you could still spread each claim over multiple columns by adding additional type information into the column qualifier.  For example:  <timestamp><claimid><fieldname> and in the value <fieldvalue>.  You have to use filters to get everything for a claimid, which is unfortunate (would actually be possible to implement start/stop keyvalues but currently not supported).  In that case, you might make the table tall instead of wide and push these things into the row key.  <customerid><policyid><timestamp><claimid> and then you could have column qualifiers -> values for each field.  This would allow you to do a Get for a single claim (you'd have to know the row key to do a get), but would allow you do to queries like "give me all policies and claims for this customer", "give me the 10 most recent claims for this customer's policy", etc...

For your specific example, where you don't want to pivot on the customer first but rather the time of the claim, you might create a table with rows such as <claim_timestamp><claim_id>.  Then you could use scanners to grab any claims within any range of time (rows from now() to now() - 1 month).

Whether you denormalize the claims and store their full content in the table is another question.  The trade-off is really just about how much data there is, how many times you would need to duplicate it (you may need to create a new table for every query you want to support if they each pivot on a different column, time claim customer policy etc), etc.. So the trade-off is:  if denormalizing you get significantly faster reads at the expense of slower writes and data duplication.  If joining, you get better space efficiency and faster writes at the expense of slower reads.

One of the advantages of HBase over an RDBMS is that you get to choose these trade-offs.  Often times in an RDBMS (especially in "by the book" schema design) there is one way and you don't have this flexibility.

Hope that helps more than it confuses :)

JG 

> -----Original Message-----
> From: Basmajian, Raffi [mailto:rbasmajian@oppenheimerfunds.com]
> Sent: Friday, March 19, 2010 9:20 AM
> To: hbase-user@hadoop.apache.org
> Subject: RE: How to join tables in HBase 20.3
> 
> JG,
> 
> I understand that there is no built in mechanism to do joins, but the
> essence of combining data to make it more useful remains the same
> regardless of whether it's a rdmbs, hbase, etc, so there must be
> something in hbase that provided this functionality.
> 
> Assume for the moment that in hbase I have the tables Customer, Policy,
> and Claim for an auto insurance business. Say I want to get a list of
> all customers that filed a claim on their auto policy in the past
> month.
> If I use Get and/or Scan then that allows me to pull information from
> each individual table, but I still need to combine the data to give me
> the list of policies based on my original query. Is there additional
> functionality in hbase that enables combining the data? I've been
> searching in the samples and I can't find a clear and simple example.
> 
> Thanks
> Raffi
> 
> 
> -----Original Message-----
> From: Jonathan Gray [mailto:jgray@facebook.com]
> Sent: Friday, March 19, 2010 12:03 PM
> To: hbase-user@hadoop.apache.org
> Subject: RE: How to join tables in HBase 20.3
> 
> At some point joins may be necessary when denormalization is not
> possible.
> 
> There is no built-in mechanism to do it.  It would be a series of
> additional Get calls to the second table you are joining against.  This
> would be helped significantly with a parallel MultiGet which will
> hopefully make it to 0.21.
> 
> JG
> 
> > -----Original Message-----
> > From: TuX RaceR [mailto:tuxracer69@gmail.com]
> > Sent: Friday, March 19, 2010 8:41 AM
> > To: hbase-user@hadoop.apache.org
> > Subject: Re: How to join tables in HBase 20.3
> >
> > Hi Raffi,
> >
> > when dealing with key-value stores, you need to think in a different
> > way see for instance:
> >
> > http://wiki.apache.org/hadoop/Hbase/DataModel
> >
> > "Getting high scalability from your relational database isn't done by
> > simply adding more machines because its data model is based on a
> > single-machine architecture. For example, a JOIN between two tables
> is
> 
> > done in memory and does not take into account the possibility that
> the
> 
> > data has to go over the wire."
> >
> > JOIN simply does not scale in relational databases.
> >
> >
> > see also
> >
> > http://wiki.apache.org/hadoop/Hbase/FAQ#A20
> >
> > *20 Are there any Schema Design examples?*
> >
> >
> > Hope this helps,
> >
> > Cheers
> > TuX
> >
> >
> > Basmajian, Raffi wrote:
> > > I am new to HBase and come from a rdbms background. After looking
> in
> > the
> > > sample client code it seems fairly easy to query a single table
> > > using Get and Scan, but it's not so obvious how to join data across
> > multiple
> > > tables.
> > >
> > > Are there any examples on how to read/join data across multiple
> > tables?
> > >
> > > Thank you
> > >
> > > Raffi Basmajian
> > >
> > >
> > > -------------------------------------------------------------------
> -
> > > -
> > ---------
> > > This e-mail transmission may contain information that is
> > > proprietary,
> > privileged and/or confidential and is intended exclusively for the
> > person(s) to whom it is addressed. Any use, copying, retention or
> > disclosure by any person other than the intended recipient or the
> > intended recipient's designees is strictly prohibited. If you are not
> > the intended recipient or their designee, please notify the sender
> > immediately by return e-mail and delete all copies. OppenheimerFunds
> > may, at its sole discretion, monitor, review, retain and/or disclose
> > the content of all email communications.
> > >
> >
> ======================================================================
> > =
> > =======
> > >
> > >
> 
> 
> 
> -----------------------------------------------------------------------
> -------
> This e-mail transmission may contain information that is proprietary,
> privileged and/or confidential and is intended exclusively for the
> person(s) to whom it is addressed. Any use, copying, retention or
> disclosure by any person other than the intended recipient or the
> intended recipient's designees is strictly prohibited. If you are not
> the intended recipient or their designee, please notify the sender
> immediately by return e-mail and delete all copies. OppenheimerFunds
> may, at its sole discretion, monitor, review, retain and/or disclose
> the content of all email communications.
> =======================================================================
> =======


RE: How to join tables in HBase 20.3

Posted by "Basmajian, Raffi" <rb...@oppenheimerfunds.com>.
It's not a complaint at all, I'm just trying to understand what, if any,
functionality hbase provides via its API to combine data. According to
your comments, it appears that data aggregation is performed
programmatically, after the data is retreived, which is different than
how it's done in a rdbms using joins. If that's the case then fine.

Thank you  

-----Original Message-----
From: Buttler, David [mailto:buttler1@llnl.gov] 
Sent: Friday, March 19, 2010 12:28 PM
To: hbase-user@hadoop.apache.org
Subject: RE: How to join tables in HBase 20.3

This particular query seems quite straight forward:
Scan the claim table with a filter that only returns entries from the
last month.  Get the customer id and policy id from the claim record
(e.g. the foreign keys).  Use Get to retrieve data from the customer and
policy tables.  Is the complaint that you have to write this yourself,
or that there is no referential integrity between the tables, or
something else?

Dave

-----Original Message-----
From: Basmajian, Raffi [mailto:rbasmajian@oppenheimerfunds.com]
Sent: Friday, March 19, 2010 9:20 AM
To: hbase-user@hadoop.apache.org
Subject: RE: How to join tables in HBase 20.3

JG,

I understand that there is no built in mechanism to do joins, but the
essence of combining data to make it more useful remains the same
regardless of whether it's a rdmbs, hbase, etc, so there must be
something in hbase that provided this functionality.

Assume for the moment that in hbase I have the tables Customer, Policy,
and Claim for an auto insurance business. Say I want to get a list of
all customers that filed a claim on their auto policy in the past month.
If I use Get and/or Scan then that allows me to pull information from
each individual table, but I still need to combine the data to give me
the list of policies based on my original query. Is there additional
functionality in hbase that enables combining the data? I've been
searching in the samples and I can't find a clear and simple example.

Thanks
Raffi
 

-----Original Message-----
From: Jonathan Gray [mailto:jgray@facebook.com]
Sent: Friday, March 19, 2010 12:03 PM
To: hbase-user@hadoop.apache.org
Subject: RE: How to join tables in HBase 20.3

At some point joins may be necessary when denormalization is not
possible.

There is no built-in mechanism to do it.  It would be a series of
additional Get calls to the second table you are joining against.  This
would be helped significantly with a parallel MultiGet which will
hopefully make it to 0.21.

JG

> -----Original Message-----
> From: TuX RaceR [mailto:tuxracer69@gmail.com]
> Sent: Friday, March 19, 2010 8:41 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: How to join tables in HBase 20.3
> 
> Hi Raffi,
> 
> when dealing with key-value stores, you need to think in a different 
> way see for instance:
> 
> http://*wiki.apache.org/hadoop/Hbase/DataModel
> 
> "Getting high scalability from your relational database isn't done by 
> simply adding more machines because its data model is based on a 
> single-machine architecture. For example, a JOIN between two tables is

> done in memory and does not take into account the possibility that the

> data has to go over the wire."
> 
> JOIN simply does not scale in relational databases.
> 
> 
> see also
> 
> http://*wiki.apache.org/hadoop/Hbase/FAQ#A20
> 
> *20 Are there any Schema Design examples?*
> 
> 
> Hope this helps,
> 
> Cheers
> TuX
> 
> 
> Basmajian, Raffi wrote:
> > I am new to HBase and come from a rdbms background. After looking in
> the
> > sample client code it seems fairly easy to query a single table 
> > using Get and Scan, but it's not so obvious how to join data across
> multiple
> > tables.
> >
> > Are there any examples on how to read/join data across multiple
> tables?
> >
> > Thank you
> >
> > Raffi Basmajian
> >
> >
> > --------------------------------------------------------------------
> > -
> ---------
> > This e-mail transmission may contain information that is 
> > proprietary,
> privileged and/or confidential and is intended exclusively for the
> person(s) to whom it is addressed. Any use, copying, retention or 
> disclosure by any person other than the intended recipient or the 
> intended recipient's designees is strictly prohibited. If you are not 
> the intended recipient or their designee, please notify the sender 
> immediately by return e-mail and delete all copies. OppenheimerFunds 
> may, at its sole discretion, monitor, review, retain and/or disclose 
> the content of all email communications.
> >
> ======================================================================
> =
> =======
> >
> >



------------------------------------------------------------------------
------
This e-mail transmission may contain information that is proprietary,
privileged and/or confidential and is intended exclusively for the
person(s) to whom it is addressed. Any use, copying, retention or
disclosure by any person other than the intended recipient or the
intended recipient's designees is strictly prohibited. If you are not
the intended recipient or their designee, please notify the sender
immediately by return e-mail and delete all copies. OppenheimerFunds
may, at its sole discretion, monitor, review, retain and/or disclose the
content of all email communications. 
========================================================================
======




------------------------------------------------------------------------------
This e-mail transmission may contain information that is proprietary, privileged and/or confidential and is intended exclusively for the person(s) to whom it is addressed. Any use, copying, retention or disclosure by any person other than the intended recipient or the intended recipient's designees is strictly prohibited. If you are not the intended recipient or their designee, please notify the sender immediately by return e-mail and delete all copies. OppenheimerFunds may, at its sole discretion, monitor, review, retain and/or disclose the content of all email communications. 
==============================================================================


RE: How to join tables in HBase 20.3

Posted by "Buttler, David" <bu...@llnl.gov>.
This particular query seems quite straight forward:
Scan the claim table with a filter that only returns entries from the last month.  Get the customer id and policy id from the claim record (e.g. the foreign keys).  Use Get to retrieve data from the customer and policy tables.  Is the complaint that you have to write this yourself, or that there is no referential integrity between the tables, or something else?

Dave

-----Original Message-----
From: Basmajian, Raffi [mailto:rbasmajian@oppenheimerfunds.com] 
Sent: Friday, March 19, 2010 9:20 AM
To: hbase-user@hadoop.apache.org
Subject: RE: How to join tables in HBase 20.3

JG,

I understand that there is no built in mechanism to do joins, but the
essence of combining data to make it more useful remains the same
regardless of whether it's a rdmbs, hbase, etc, so there must be
something in hbase that provided this functionality.

Assume for the moment that in hbase I have the tables Customer, Policy,
and Claim for an auto insurance business. Say I want to get a list of
all customers that filed a claim on their auto policy in the past month.
If I use Get and/or Scan then that allows me to pull information from
each individual table, but I still need to combine the data to give me
the list of policies based on my original query. Is there additional
functionality in hbase that enables combining the data? I've been
searching in the samples and I can't find a clear and simple example.

Thanks
Raffi
 

-----Original Message-----
From: Jonathan Gray [mailto:jgray@facebook.com] 
Sent: Friday, March 19, 2010 12:03 PM
To: hbase-user@hadoop.apache.org
Subject: RE: How to join tables in HBase 20.3

At some point joins may be necessary when denormalization is not
possible.

There is no built-in mechanism to do it.  It would be a series of
additional Get calls to the second table you are joining against.  This
would be helped significantly with a parallel MultiGet which will
hopefully make it to 0.21.

JG

> -----Original Message-----
> From: TuX RaceR [mailto:tuxracer69@gmail.com]
> Sent: Friday, March 19, 2010 8:41 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: How to join tables in HBase 20.3
> 
> Hi Raffi,
> 
> when dealing with key-value stores, you need to think in a different 
> way see for instance:
> 
> http://*wiki.apache.org/hadoop/Hbase/DataModel
> 
> "Getting high scalability from your relational database isn't done by 
> simply adding more machines because its data model is based on a 
> single-machine architecture. For example, a JOIN between two tables is

> done in memory and does not take into account the possibility that the

> data has to go over the wire."
> 
> JOIN simply does not scale in relational databases.
> 
> 
> see also
> 
> http://*wiki.apache.org/hadoop/Hbase/FAQ#A20
> 
> *20 Are there any Schema Design examples?*
> 
> 
> Hope this helps,
> 
> Cheers
> TuX
> 
> 
> Basmajian, Raffi wrote:
> > I am new to HBase and come from a rdbms background. After looking in
> the
> > sample client code it seems fairly easy to query a single table 
> > using Get and Scan, but it's not so obvious how to join data across
> multiple
> > tables.
> >
> > Are there any examples on how to read/join data across multiple
> tables?
> >
> > Thank you
> >
> > Raffi Basmajian
> >
> >
> > --------------------------------------------------------------------
> > -
> ---------
> > This e-mail transmission may contain information that is 
> > proprietary,
> privileged and/or confidential and is intended exclusively for the
> person(s) to whom it is addressed. Any use, copying, retention or 
> disclosure by any person other than the intended recipient or the 
> intended recipient's designees is strictly prohibited. If you are not 
> the intended recipient or their designee, please notify the sender 
> immediately by return e-mail and delete all copies. OppenheimerFunds 
> may, at its sole discretion, monitor, review, retain and/or disclose 
> the content of all email communications.
> >
> ======================================================================
> =
> =======
> >
> >



------------------------------------------------------------------------------
This e-mail transmission may contain information that is proprietary, privileged and/or confidential and is intended exclusively for the person(s) to whom it is addressed. Any use, copying, retention or disclosure by any person other than the intended recipient or the intended recipient's designees is strictly prohibited. If you are not the intended recipient or their designee, please notify the sender immediately by return e-mail and delete all copies. OppenheimerFunds may, at its sole discretion, monitor, review, retain and/or disclose the content of all email communications. 
==============================================================================



RE: How to join tables in HBase 20.3

Posted by "Basmajian, Raffi" <rb...@oppenheimerfunds.com>.
JG,

I understand that there is no built in mechanism to do joins, but the
essence of combining data to make it more useful remains the same
regardless of whether it's a rdmbs, hbase, etc, so there must be
something in hbase that provided this functionality.

Assume for the moment that in hbase I have the tables Customer, Policy,
and Claim for an auto insurance business. Say I want to get a list of
all customers that filed a claim on their auto policy in the past month.
If I use Get and/or Scan then that allows me to pull information from
each individual table, but I still need to combine the data to give me
the list of policies based on my original query. Is there additional
functionality in hbase that enables combining the data? I've been
searching in the samples and I can't find a clear and simple example.

Thanks
Raffi
 

-----Original Message-----
From: Jonathan Gray [mailto:jgray@facebook.com] 
Sent: Friday, March 19, 2010 12:03 PM
To: hbase-user@hadoop.apache.org
Subject: RE: How to join tables in HBase 20.3

At some point joins may be necessary when denormalization is not
possible.

There is no built-in mechanism to do it.  It would be a series of
additional Get calls to the second table you are joining against.  This
would be helped significantly with a parallel MultiGet which will
hopefully make it to 0.21.

JG

> -----Original Message-----
> From: TuX RaceR [mailto:tuxracer69@gmail.com]
> Sent: Friday, March 19, 2010 8:41 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: How to join tables in HBase 20.3
> 
> Hi Raffi,
> 
> when dealing with key-value stores, you need to think in a different 
> way see for instance:
> 
> http://wiki.apache.org/hadoop/Hbase/DataModel
> 
> "Getting high scalability from your relational database isn't done by 
> simply adding more machines because its data model is based on a 
> single-machine architecture. For example, a JOIN between two tables is

> done in memory and does not take into account the possibility that the

> data has to go over the wire."
> 
> JOIN simply does not scale in relational databases.
> 
> 
> see also
> 
> http://wiki.apache.org/hadoop/Hbase/FAQ#A20
> 
> *20 Are there any Schema Design examples?*
> 
> 
> Hope this helps,
> 
> Cheers
> TuX
> 
> 
> Basmajian, Raffi wrote:
> > I am new to HBase and come from a rdbms background. After looking in
> the
> > sample client code it seems fairly easy to query a single table 
> > using Get and Scan, but it's not so obvious how to join data across
> multiple
> > tables.
> >
> > Are there any examples on how to read/join data across multiple
> tables?
> >
> > Thank you
> >
> > Raffi Basmajian
> >
> >
> > --------------------------------------------------------------------
> > -
> ---------
> > This e-mail transmission may contain information that is 
> > proprietary,
> privileged and/or confidential and is intended exclusively for the
> person(s) to whom it is addressed. Any use, copying, retention or 
> disclosure by any person other than the intended recipient or the 
> intended recipient's designees is strictly prohibited. If you are not 
> the intended recipient or their designee, please notify the sender 
> immediately by return e-mail and delete all copies. OppenheimerFunds 
> may, at its sole discretion, monitor, review, retain and/or disclose 
> the content of all email communications.
> >
> ======================================================================
> =
> =======
> >
> >



------------------------------------------------------------------------------
This e-mail transmission may contain information that is proprietary, privileged and/or confidential and is intended exclusively for the person(s) to whom it is addressed. Any use, copying, retention or disclosure by any person other than the intended recipient or the intended recipient's designees is strictly prohibited. If you are not the intended recipient or their designee, please notify the sender immediately by return e-mail and delete all copies. OppenheimerFunds may, at its sole discretion, monitor, review, retain and/or disclose the content of all email communications. 
==============================================================================


Re: How to join tables in HBase 20.3

Posted by Tim Robertson <ti...@gmail.com>.
If your joining needs are for offline reporting (e.g. not real time search)
then you can join by using MapReduce but they are long running jobs.
I am using Hive which gives you SQL, but compiles the SQL to mapreduce jobs.
 I am running on Tab files, but I read Hive now has HBase input formats,
meaning you can join HBase tables.  It will not be a fast query though, but
will meet long running join needs (e.g. reports etc)

Tim


On Fri, Mar 19, 2010 at 5:03 PM, Jonathan Gray <jg...@facebook.com> wrote:

> At some point joins may be necessary when denormalization is not possible.
>
> There is no built-in mechanism to do it.  It would be a series of
> additional Get calls to the second table you are joining against.  This
> would be helped significantly with a parallel MultiGet which will hopefully
> make it to 0.21.
>
> JG
>
> > -----Original Message-----
> > From: TuX RaceR [mailto:tuxracer69@gmail.com]
> > Sent: Friday, March 19, 2010 8:41 AM
> > To: hbase-user@hadoop.apache.org
> > Subject: Re: How to join tables in HBase 20.3
> >
> > Hi Raffi,
> >
> > when dealing with key-value stores, you need to think in a different
> > way
> > see for instance:
> >
> > http://wiki.apache.org/hadoop/Hbase/DataModel
> >
> > "Getting high scalability from your relational database isn't done by
> > simply adding more machines because its data model is based on a
> > single-machine architecture. For example, a JOIN between two tables is
> > done in memory and does not take into account the possibility that the
> > data has to go over the wire."
> >
> > JOIN simply does not scale in relational databases.
> >
> >
> > see also
> >
> > http://wiki.apache.org/hadoop/Hbase/FAQ#A20
> >
> > *20 Are there any Schema Design examples?*
> >
> >
> > Hope this helps,
> >
> > Cheers
> > TuX
> >
> >
> > Basmajian, Raffi wrote:
> > > I am new to HBase and come from a rdbms background. After looking in
> > the
> > > sample client code it seems fairly easy to query a single table using
> > > Get and Scan, but it's not so obvious how to join data across
> > multiple
> > > tables.
> > >
> > > Are there any examples on how to read/join data across multiple
> > tables?
> > >
> > > Thank you
> > >
> > > Raffi Basmajian
> > >
> > >
> > > ---------------------------------------------------------------------
> > ---------
> > > This e-mail transmission may contain information that is proprietary,
> > privileged and/or confidential and is intended exclusively for the
> > person(s) to whom it is addressed. Any use, copying, retention or
> > disclosure by any person other than the intended recipient or the
> > intended recipient's designees is strictly prohibited. If you are not
> > the intended recipient or their designee, please notify the sender
> > immediately by return e-mail and delete all copies. OppenheimerFunds
> > may, at its sole discretion, monitor, review, retain and/or disclose
> > the content of all email communications.
> > >
> > =======================================================================
> > =======
> > >
> > >
>
>

RE: How to join tables in HBase 20.3

Posted by Jonathan Gray <jg...@facebook.com>.
At some point joins may be necessary when denormalization is not possible.

There is no built-in mechanism to do it.  It would be a series of additional Get calls to the second table you are joining against.  This would be helped significantly with a parallel MultiGet which will hopefully make it to 0.21.

JG

> -----Original Message-----
> From: TuX RaceR [mailto:tuxracer69@gmail.com]
> Sent: Friday, March 19, 2010 8:41 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: How to join tables in HBase 20.3
> 
> Hi Raffi,
> 
> when dealing with key-value stores, you need to think in a different
> way
> see for instance:
> 
> http://wiki.apache.org/hadoop/Hbase/DataModel
> 
> "Getting high scalability from your relational database isn't done by
> simply adding more machines because its data model is based on a
> single-machine architecture. For example, a JOIN between two tables is
> done in memory and does not take into account the possibility that the
> data has to go over the wire."
> 
> JOIN simply does not scale in relational databases.
> 
> 
> see also
> 
> http://wiki.apache.org/hadoop/Hbase/FAQ#A20
> 
> *20 Are there any Schema Design examples?*
> 
> 
> Hope this helps,
> 
> Cheers
> TuX
> 
> 
> Basmajian, Raffi wrote:
> > I am new to HBase and come from a rdbms background. After looking in
> the
> > sample client code it seems fairly easy to query a single table using
> > Get and Scan, but it's not so obvious how to join data across
> multiple
> > tables.
> >
> > Are there any examples on how to read/join data across multiple
> tables?
> >
> > Thank you
> >
> > Raffi Basmajian
> >
> >
> > ---------------------------------------------------------------------
> ---------
> > This e-mail transmission may contain information that is proprietary,
> privileged and/or confidential and is intended exclusively for the
> person(s) to whom it is addressed. Any use, copying, retention or
> disclosure by any person other than the intended recipient or the
> intended recipient's designees is strictly prohibited. If you are not
> the intended recipient or their designee, please notify the sender
> immediately by return e-mail and delete all copies. OppenheimerFunds
> may, at its sole discretion, monitor, review, retain and/or disclose
> the content of all email communications.
> >
> =======================================================================
> =======
> >
> >


Re: How to join tables in HBase 20.3

Posted by TuX RaceR <tu...@gmail.com>.
Hi Raffi,

when dealing with key-value stores, you need to think in a different way 
see for instance:

http://wiki.apache.org/hadoop/Hbase/DataModel

"Getting high scalability from your relational database isn't done by 
simply adding more machines because its data model is based on a 
single-machine architecture. For example, a JOIN between two tables is 
done in memory and does not take into account the possibility that the 
data has to go over the wire."

JOIN simply does not scale in relational databases.


see also

http://wiki.apache.org/hadoop/Hbase/FAQ#A20

*20 Are there any Schema Design examples?*


Hope this helps,

Cheers
TuX


Basmajian, Raffi wrote:
> I am new to HBase and come from a rdbms background. After looking in the
> sample client code it seems fairly easy to query a single table using
> Get and Scan, but it's not so obvious how to join data across multiple
> tables. 
>  
> Are there any examples on how to read/join data across multiple tables?
>  
> Thank you
>  
> Raffi Basmajian
>  
>
> ------------------------------------------------------------------------------
> This e-mail transmission may contain information that is proprietary, privileged and/or confidential and is intended exclusively for the person(s) to whom it is addressed. Any use, copying, retention or disclosure by any person other than the intended recipient or the intended recipient's designees is strictly prohibited. If you are not the intended recipient or their designee, please notify the sender immediately by return e-mail and delete all copies. OppenheimerFunds may, at its sole discretion, monitor, review, retain and/or disclose the content of all email communications. 
> ==============================================================================
>
>