You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Wilm Schumacher <wi...@cawoom.com> on 2013/12/16 16:34:34 UTC

Newbie question: Rowkey design

Hi,

I'm a newbie to hbase and have a question on the rowkey design and I
hope this question isn't to newbie-like for this list. I have a question
which cannot be answered by knoledge of code but by experience with
large databases, thus this mail.

For the sake of explaination I create a small example. Suppose you want
to design a small "blogging" plattform. You just want to store the name
of the user and a small text. And of course you want to get all postings
of one user.

Furthermore we have 4 users, let's call them A,B,C,D (and you can trust
that the length of the username is fixed). Now let's say the A,B,C and D
have N postings, and D has 6*N postings. BUT: the data of A is 3 times
more often fetched than the data from the other users each!

If you create a hbase cluster with 10 nodes, every node is holding N
postings (of course I know, that the data is hold redundantly, but this
is not so important for the question).

Rowkey design #1:
the i-th posting of user X would have the rowkey: "$X$i", e.g. "A003".
The table just would be: "create 'postings' , 'text'"

For this rowkey design the first node would hold the data of A, the
second of B, the third of C and the fourth to the tenth node the data of D.

Fetching of data would be very easy, but half of the traffic would hit
the first node.

Rowkey design #2
the rowkey would be random, e.g. an uuid. The table design would be now:
"create 'postings' , 'user' , 'text'"

the fetching of the data would be a "real" map-reduce job, checking for
the user and emit etc..

So, if a fetching takes place I have to do more computation cycles and
IO. But in this scenario all traffic would hit all 10 servers.

If the number of N (number of postings) is large enough that the disk
space is critical, I'm also not able to adjust the key regions in a way
that e.g. the data of D is only on the last server and the key space of
A would span the first 5 nodes. Or making replication very broad (e.g.
10 times in this case)

So basically the question is: What's the better plan? Trying to avoid
computation cycles of map reducing and get the key design straight, or
trying to scale the computation, but doing more IO?

I hope that the small example helped to make the question more vivid.

Best wishes

Wilm

Re: Newbie question: Rowkey design

Posted by Wilm Schumacher <wi...@cawoom.com>.
I was afraid of this answer and suspected it ;). I knew that the answer
would depend on the actual setting, but I hoped, that there is is a
little hint.

Thanks a lot for your time and the answer. I will try it out with test
data (and a simple table design) and will share my experiments when they
are done.

Thx for hbase to all and best wishes

Wilm

Am 17.12.2013 14:03, schrieb yonghu:
> In my opinion, it really depends on your queries.
> 
> The first one achieves data locality. There is no additional data  transmit
> between different nodes. But this strategy sacrifices parallelism and the
> node which stores A will be a hot node if too many applications try to
> access A.
> 
> The second approach gives you parallelism but you need somehow to merge the
> data together to generate the final results.  So, you can see there is a
> trade off between data locality and parallelism. So the performance of
> query will be influenced by following factors:
> 
> 1. data size;
> 2. data access frequency;
> 3. data access pattern, full scan or index scan;
> 4. network bandwidth.
> 
> So the best solution for one situation may not fit for the others.
> 
> 
> 
> On Tue, Dec 17, 2013 at 3:56 AM, Tao Xiao <xi...@gmail.com> wrote:
> 
>> Sometimes row key design is a trade-off issue between load-balance and
>> query : if you design row key such that you can query it very fast and
>> convenient, maybe the records are not spread evenly across the nodes; if
>> you design row key such that the records are spread evenly across the
>> nodes, maybe it's not convenient to query or impossible to get the record
>> through row key directly (say you have a random number as the row key's
>> prefix).
>>
>> You can have a look at secondary index. Secondary index is very helpful.
>>
>>
>>
>>
>> 2013/12/16 Wilm Schumacher <wi...@cawoom.com>
>>
>>> Hi,
>>>
>>> I'm a newbie to hbase and have a question on the rowkey design and I
>>> hope this question isn't to newbie-like for this list. I have a question
>>> which cannot be answered by knoledge of code but by experience with
>>> large databases, thus this mail.
>>>
>>> For the sake of explaination I create a small example. Suppose you want
>>> to design a small "blogging" plattform. You just want to store the name
>>> of the user and a small text. And of course you want to get all postings
>>> of one user.
>>>
>>> Furthermore we have 4 users, let's call them A,B,C,D (and you can trust
>>> that the length of the username is fixed). Now let's say the A,B,C and D
>>> have N postings, and D has 6*N postings. BUT: the data of A is 3 times
>>> more often fetched than the data from the other users each!
>>>
>>> If you create a hbase cluster with 10 nodes, every node is holding N
>>> postings (of course I know, that the data is hold redundantly, but this
>>> is not so important for the question).
>>>
>>> Rowkey design #1:
>>> the i-th posting of user X would have the rowkey: "$X$i", e.g. "A003".
>>> The table just would be: "create 'postings' , 'text'"
>>>
>>> For this rowkey design the first node would hold the data of A, the
>>> second of B, the third of C and the fourth to the tenth node the data of
>> D.
>>>
>>> Fetching of data would be very easy, but half of the traffic would hit
>>> the first node.
>>>
>>> Rowkey design #2
>>> the rowkey would be random, e.g. an uuid. The table design would be now:
>>> "create 'postings' , 'user' , 'text'"
>>>
>>> the fetching of the data would be a "real" map-reduce job, checking for
>>> the user and emit etc..
>>>
>>> So, if a fetching takes place I have to do more computation cycles and
>>> IO. But in this scenario all traffic would hit all 10 servers.
>>>
>>> If the number of N (number of postings) is large enough that the disk
>>> space is critical, I'm also not able to adjust the key regions in a way
>>> that e.g. the data of D is only on the last server and the key space of
>>> A would span the first 5 nodes. Or making replication very broad (e.g.
>>> 10 times in this case)
>>>
>>> So basically the question is: What's the better plan? Trying to avoid
>>> computation cycles of map reducing and get the key design straight, or
>>> trying to scale the computation, but doing more IO?
>>>
>>> I hope that the small example helped to make the question more vivid.
>>>
>>> Best wishes
>>>
>>> Wilm
>>>
>>
> 

Re: Newbie question: Rowkey design

Posted by yonghu <yo...@gmail.com>.
In my opinion, it really depends on your queries.

The first one achieves data locality. There is no additional data  transmit
between different nodes. But this strategy sacrifices parallelism and the
node which stores A will be a hot node if too many applications try to
access A.

The second approach gives you parallelism but you need somehow to merge the
data together to generate the final results.  So, you can see there is a
trade off between data locality and parallelism. So the performance of
query will be influenced by following factors:

1. data size;
2. data access frequency;
3. data access pattern, full scan or index scan;
4. network bandwidth.

So the best solution for one situation may not fit for the others.



On Tue, Dec 17, 2013 at 3:56 AM, Tao Xiao <xi...@gmail.com> wrote:

> Sometimes row key design is a trade-off issue between load-balance and
> query : if you design row key such that you can query it very fast and
> convenient, maybe the records are not spread evenly across the nodes; if
> you design row key such that the records are spread evenly across the
> nodes, maybe it's not convenient to query or impossible to get the record
> through row key directly (say you have a random number as the row key's
> prefix).
>
> You can have a look at secondary index. Secondary index is very helpful.
>
>
>
>
> 2013/12/16 Wilm Schumacher <wi...@cawoom.com>
>
> > Hi,
> >
> > I'm a newbie to hbase and have a question on the rowkey design and I
> > hope this question isn't to newbie-like for this list. I have a question
> > which cannot be answered by knoledge of code but by experience with
> > large databases, thus this mail.
> >
> > For the sake of explaination I create a small example. Suppose you want
> > to design a small "blogging" plattform. You just want to store the name
> > of the user and a small text. And of course you want to get all postings
> > of one user.
> >
> > Furthermore we have 4 users, let's call them A,B,C,D (and you can trust
> > that the length of the username is fixed). Now let's say the A,B,C and D
> > have N postings, and D has 6*N postings. BUT: the data of A is 3 times
> > more often fetched than the data from the other users each!
> >
> > If you create a hbase cluster with 10 nodes, every node is holding N
> > postings (of course I know, that the data is hold redundantly, but this
> > is not so important for the question).
> >
> > Rowkey design #1:
> > the i-th posting of user X would have the rowkey: "$X$i", e.g. "A003".
> > The table just would be: "create 'postings' , 'text'"
> >
> > For this rowkey design the first node would hold the data of A, the
> > second of B, the third of C and the fourth to the tenth node the data of
> D.
> >
> > Fetching of data would be very easy, but half of the traffic would hit
> > the first node.
> >
> > Rowkey design #2
> > the rowkey would be random, e.g. an uuid. The table design would be now:
> > "create 'postings' , 'user' , 'text'"
> >
> > the fetching of the data would be a "real" map-reduce job, checking for
> > the user and emit etc..
> >
> > So, if a fetching takes place I have to do more computation cycles and
> > IO. But in this scenario all traffic would hit all 10 servers.
> >
> > If the number of N (number of postings) is large enough that the disk
> > space is critical, I'm also not able to adjust the key regions in a way
> > that e.g. the data of D is only on the last server and the key space of
> > A would span the first 5 nodes. Or making replication very broad (e.g.
> > 10 times in this case)
> >
> > So basically the question is: What's the better plan? Trying to avoid
> > computation cycles of map reducing and get the key design straight, or
> > trying to scale the computation, but doing more IO?
> >
> > I hope that the small example helped to make the question more vivid.
> >
> > Best wishes
> >
> > Wilm
> >
>

Re: Newbie question: Rowkey design

Posted by Tao Xiao <xi...@gmail.com>.
Sometimes row key design is a trade-off issue between load-balance and
query : if you design row key such that you can query it very fast and
convenient, maybe the records are not spread evenly across the nodes; if
you design row key such that the records are spread evenly across the
nodes, maybe it's not convenient to query or impossible to get the record
through row key directly (say you have a random number as the row key's
prefix).

You can have a look at secondary index. Secondary index is very helpful.




2013/12/16 Wilm Schumacher <wi...@cawoom.com>

> Hi,
>
> I'm a newbie to hbase and have a question on the rowkey design and I
> hope this question isn't to newbie-like for this list. I have a question
> which cannot be answered by knoledge of code but by experience with
> large databases, thus this mail.
>
> For the sake of explaination I create a small example. Suppose you want
> to design a small "blogging" plattform. You just want to store the name
> of the user and a small text. And of course you want to get all postings
> of one user.
>
> Furthermore we have 4 users, let's call them A,B,C,D (and you can trust
> that the length of the username is fixed). Now let's say the A,B,C and D
> have N postings, and D has 6*N postings. BUT: the data of A is 3 times
> more often fetched than the data from the other users each!
>
> If you create a hbase cluster with 10 nodes, every node is holding N
> postings (of course I know, that the data is hold redundantly, but this
> is not so important for the question).
>
> Rowkey design #1:
> the i-th posting of user X would have the rowkey: "$X$i", e.g. "A003".
> The table just would be: "create 'postings' , 'text'"
>
> For this rowkey design the first node would hold the data of A, the
> second of B, the third of C and the fourth to the tenth node the data of D.
>
> Fetching of data would be very easy, but half of the traffic would hit
> the first node.
>
> Rowkey design #2
> the rowkey would be random, e.g. an uuid. The table design would be now:
> "create 'postings' , 'user' , 'text'"
>
> the fetching of the data would be a "real" map-reduce job, checking for
> the user and emit etc..
>
> So, if a fetching takes place I have to do more computation cycles and
> IO. But in this scenario all traffic would hit all 10 servers.
>
> If the number of N (number of postings) is large enough that the disk
> space is critical, I'm also not able to adjust the key regions in a way
> that e.g. the data of D is only on the last server and the key space of
> A would span the first 5 nodes. Or making replication very broad (e.g.
> 10 times in this case)
>
> So basically the question is: What's the better plan? Trying to avoid
> computation cycles of map reducing and get the key design straight, or
> trying to scale the computation, but doing more IO?
>
> I hope that the small example helped to make the question more vivid.
>
> Best wishes
>
> Wilm
>