You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Marc Sturm <ma...@nyp.org> on 2014/12/03 21:31:20 UTC

question about composite rowKey and performance difference between getScanner() and get(Get[])

Hi,

I have a many to many relationship that I am trying to model in hbase, and I want to be sure I am not missing anything so please let me know or point to the right documentation.

Let's say I have an A to B many to many relationship, the query parameter takes A unique id and returns all the B uniqueids related to A with their properties and values.

The first solution I found is having two tables: one with the rowKey equal to A's unique id, the table column identifiers are equal to B's unique ids related to A, the second table has its rowKeys equal to B unique ids and its columns contain the property values. So the query is two steps, it first does a get on A to collect all the B uniqueIds and then does a second get on the B passing as a parameter an array of B rowkeys. When I run the second query, I can get a latency much longer on the first query and then good low latency on subsequent queries with same parameter. I believe that's a caching issue...

The second solution is having one table with a composite rowkey equal to A uniqueid + B uniqueid, I will then have duplicate B uniqueid rows. But when I do a scan on the just the first part of the rowKey (A uniqueid) the response time and latency is more consistent and better (smaller).

So, my questions are threefold: 1) which way is the best, 2) what is the performance difference between a scan and a get with multiple rowkeys (I think scan is faster because the data is not or less "distributed") and 3) how can we make the get with multiple rowkeys more consistent?

Thank you for your help,
Marc

This electronic message is intended to be for the use only of the named recipient, and may contain information that is confidential or privileged. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the contents of this message is strictly prohibited. If you have received this message in error or are not the named recipient, please notify us immediately by contacting the sender at the electronic mail address noted above, and delete and destroy all copies of this message. Thank you.

RE: question about composite rowKey and performance difference between getScanner() and get(Get[])

Posted by Marc Sturm <ma...@nyp.org>.

I will read it. Thanks!
The size of data is not A or B uniqueIds is pretty small compare to whole dataset, so I think that points to the unique table solution.
Marc

-----Original Message-----
From: Ted Yu [mailto:yuzhihong@gmail.com] 
Sent: Thursday, December 04, 2014 1:12 PM
To: user@hbase.apache.org
Subject: Re: question about composite rowKey and performance difference between getScanner() and get(Get[])

I assume you have read http://hbase.apache.org/book.html#schema.casestudies
(See 6.11.3)

What's the size of data that is not A or B's uniqueIds ? The answer is related to the amount of data redundancy that you are comfortable with in your design.

Cheers

On Wed, Dec 3, 2014 at 12:31 PM, Marc Sturm <ma...@nyp.org> wrote:

> Hi,
>
> I have a many to many relationship that I am trying to model in hbase, 
> and I want to be sure I am not missing anything so please let me know 
> or point to the right documentation.
>
> Let's say I have an A to B many to many relationship, the query 
> parameter takes A unique id and returns all the B uniqueids related to 
> A with their properties and values.
>
> The first solution I found is having two tables: one with the rowKey 
> equal to A's unique id, the table column identifiers are equal to B's 
> unique ids related to A, the second table has its rowKeys equal to B 
> unique ids and its columns contain the property values. So the query 
> is two steps, it first does a get on A to collect all the B uniqueIds 
> and then does a second get on the B passing as a parameter an array of 
> B rowkeys. When I run the second query, I can get a latency much 
> longer on the first query and then good low latency on subsequent 
> queries with same parameter. I believe that's a caching issue...
>
> The second solution is having one table with a composite rowkey equal 
> to A uniqueid + B uniqueid, I will then have duplicate B uniqueid 
> rows. But when I do a scan on the just the first part of the rowKey (A 
> uniqueid) the response time and latency is more consistent and better (smaller).
>
> So, my questions are threefold: 1) which way is the best, 2) what is 
> the performance difference between a scan and a get with multiple 
> rowkeys (I think scan is faster because the data is not or less 
> "distributed") and 3) how can we make the get with multiple rowkeys more consistent?
>
> Thank you for your help,
> Marc
>
> This electronic message is intended to be for the use only of the 
> named recipient, and may contain information that is confidential or privileged.
> If you are not the intended recipient, you are hereby notified that 
> any disclosure, copying, distribution or use of the contents of this 
> message is strictly prohibited.  If you have received this message in 
> error or are not the named recipient, please notify us immediately by 
> contacting the sender at the electronic mail address noted above, and 
> delete and destroy all copies of this message.  Thank you.

This electronic message is intended to be for the use only of the named recipient, and may contain information that is confidential or privileged.  If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the contents of this message is strictly prohibited.  If you have received this message in error or are not the named recipient, please notify us immediately by contacting the sender at the electronic mail address noted above, and delete and destroy all copies of this message.  Thank you.

Re: question about composite rowKey and performance difference between getScanner() and get(Get[])

Posted by Ted Yu <yu...@gmail.com>.

I assume you have read http://hbase.apache.org/book.html#schema.casestudies
(See 6.11.3)

What's the size of data that is not A or B's uniqueIds ? The answer is
related to the amount of data redundancy that you are comfortable with in
your design.

Cheers

On Wed, Dec 3, 2014 at 12:31 PM, Marc Sturm <ma...@nyp.org> wrote:

> Hi,
>
> I have a many to many relationship that I am trying to model in hbase, and
> I want to be sure I am not missing anything so please let me know or point
> to the right documentation.
>
> Let's say I have an A to B many to many relationship, the query parameter
> takes A unique id and returns all the B uniqueids related to A with their
> properties and values.
>
> The first solution I found is having two tables: one with the rowKey equal
> to A's unique id, the table column identifiers are equal to B's unique ids
> related to A, the second table has its rowKeys equal to B unique ids and
> its columns contain the property values. So the query is two steps, it
> first does a get on A to collect all the B uniqueIds and then does a second
> get on the B passing as a parameter an array of B rowkeys. When I run the
> second query, I can get a latency much longer on the first query and then
> good low latency on subsequent queries with same parameter. I believe
> that's a caching issue...
>
> The second solution is having one table with a composite rowkey equal to A
> uniqueid + B uniqueid, I will then have duplicate B uniqueid rows. But when
> I do a scan on the just the first part of the rowKey (A uniqueid) the
> response time and latency is more consistent and better (smaller).
>
> So, my questions are threefold: 1) which way is the best, 2) what is the
> performance difference between a scan and a get with multiple rowkeys (I
> think scan is faster because the data is not or less "distributed") and 3)
> how can we make the get with multiple rowkeys more consistent?
>
> Thank you for your help,
> Marc
>
> This electronic message is intended to be for the use only of the named
> recipient, and may contain information that is confidential or privileged.
> If you are not the intended recipient, you are hereby notified that any
> disclosure, copying, distribution or use of the contents of this message is
> strictly prohibited.  If you have received this message in error or are not
> the named recipient, please notify us immediately by contacting the sender
> at the electronic mail address noted above, and delete and destroy all
> copies of this message.  Thank you.