You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Jason Huang <ja...@icare.com> on 2013/07/08 16:19:17 UTC

Using separator/delimiter in HBase rowkey?

Hello,

I am trying to get some advice on pros/cons of using separator/delimiter as
part of HBase row key.

Currently one of our user activity tables has a rowkey design of
"UserID^TimeStamp" with a separator of "^". (UserID is a string that won't
include '^').

This is designed for the two common use cases in our system:
(1) If we come from a context where the UserID is known, we can do a scan
easily for all the user activities with a startRowKey and stopRowKey.
(2) If we come from a external networked table where the row key of this
user activity table is stored and can be retrieved as activityRowKey, then
we can use the following code to parse out the UserID and do the same scan
as in (1):

    String activityRowKeyStr = Bytes.toString(activityRowKey);
    String userId =
activityRowKeyStr.subString(activityRowKeyStr.indexOf("^")+1)

Then I can set startRowKey and stopRowKey for the scan based on userId.
Here we get benefit of having the User ID as part of the row key with the
separator (comparing to another solution that stores the userID as one of
the columns in the user activity table).

The reason I pick a separator after UserID is that sometimes we may not get
a fixed length string of the UserID value. At one point I actually thought
of using MD5 to hash the UserID and make it a fixed length, however, the
possibility of collision and possible overhead of applying the hash
function makes me pick the separator "^".

My question:
(1) I kind of make the argument that using a separator is kind of better
than using a MD5 hash value. Does that seem reasonable? Could you comments
on other pros and cons that I might miss (as the bases for my argument)?

(2) On using a separator/delimiter, besides the requirements that this
separator/delimiter shouldn't appear elsewhere in the rowkey, are there any
other requirements? Are there any special separator/delimiters that are
better/worse than the average ones?

thanks!

Jason

Re: Using separator/delimiter in HBase rowkey?

Posted by Shahab Yunus <sh...@gmail.com>.

Not saying this is a solution or better in anyway but just more food for
thought. Is there any maximum size limit for UserIds? You can pad also for
Users Ids of smaller length. You are using more space in this way though.
It can help in sorting as well.

Regards,
Shahab


On Mon, Jul 8, 2013 at 10:19 AM, Jason Huang <ja...@icare.com> wrote:

> Hello,
>
> I am trying to get some advice on pros/cons of using separator/delimiter as
> part of HBase row key.
>
> Currently one of our user activity tables has a rowkey design of
> "UserID^TimeStamp" with a separator of "^". (UserID is a string that won't
> include '^').
>
> This is designed for the two common use cases in our system:
> (1) If we come from a context where the UserID is known, we can do a scan
> easily for all the user activities with a startRowKey and stopRowKey.
> (2) If we come from a external networked table where the row key of this
> user activity table is stored and can be retrieved as activityRowKey, then
> we can use the following code to parse out the UserID and do the same scan
> as in (1):
>
>     String activityRowKeyStr = Bytes.toString(activityRowKey);
>     String userId =
> activityRowKeyStr.subString(activityRowKeyStr.indexOf("^")+1)
>
> Then I can set startRowKey and stopRowKey for the scan based on userId.
> Here we get benefit of having the User ID as part of the row key with the
> separator (comparing to another solution that stores the userID as one of
> the columns in the user activity table).
>
> The reason I pick a separator after UserID is that sometimes we may not get
> a fixed length string of the UserID value. At one point I actually thought
> of using MD5 to hash the UserID and make it a fixed length, however, the
> possibility of collision and possible overhead of applying the hash
> function makes me pick the separator "^".
>
> My question:
> (1) I kind of make the argument that using a separator is kind of better
> than using a MD5 hash value. Does that seem reasonable? Could you comments
> on other pros and cons that I might miss (as the bases for my argument)?
>
> (2) On using a separator/delimiter, besides the requirements that this
> separator/delimiter shouldn't appear elsewhere in the rowkey, are there any
> other requirements? Are there any special separator/delimiters that are
> better/worse than the average ones?
>
> thanks!
>
> Jason
>

Re: Using separator/delimiter in HBase rowkey?

Posted by Ted Yu <yu...@gmail.com>.

In 0.94, we have src/main/java/org/apache/hadoop/hbase/util/MurmurHash.java

For hadoop 1, there is src/core/org/apache/hadoop/util/hash/MurmurHash.java

Cheers

On Mon, Jul 8, 2013 at 8:29 AM, Michael Segel <mi...@hotmail.com>wrote:

> Is murmur part of the standard java libraries?
>
> If not, you end up having to do a bit more maintenance of your cluster and
> that's going to be part of your tradeoff.
>
> On Jul 8, 2013, at 10:14 AM, Mike Axiak <mi...@axiak.net> wrote:
>
> > Hello Jason,
> >
> > Have you considered the following rowkey?
> >
> >  murmur_128(userId) + timestamp + userId ?
> >
> > This handles both of your cases as (1) murmur 128 is much faster than
> > md5 so will have very low overhead and (2) the userid at the end of
> > the key will ensure that no murmur collisions will cause issues. This
> > key also handle incrementing userIds well because close userIds will
> > likely be in separate regions.
> >
> > Cheers,
> > Mike
> >
> > On Mon, Jul 8, 2013 at 10:19 AM, Jason Huang <ja...@icare.com>
> wrote:
> >> Hello,
> >>
> >> I am trying to get some advice on pros/cons of using
> separator/delimiter as
> >> part of HBase row key.
> >>
> >> Currently one of our user activity tables has a rowkey design of
> >> "UserID^TimeStamp" with a separator of "^". (UserID is a string that
> won't
> >> include '^').
> >>
> >> This is designed for the two common use cases in our system:
> >> (1) If we come from a context where the UserID is known, we can do a
> scan
> >> easily for all the user activities with a startRowKey and stopRowKey.
> >> (2) If we come from a external networked table where the row key of this
> >> user activity table is stored and can be retrieved as activityRowKey,
> then
> >> we can use the following code to parse out the UserID and do the same
> scan
> >> as in (1):
> >>
> >>    String activityRowKeyStr = Bytes.toString(activityRowKey);
> >>    String userId =
> >> activityRowKeyStr.subString(activityRowKeyStr.indexOf("^")+1)
> >>
> >> Then I can set startRowKey and stopRowKey for the scan based on userId.
> >> Here we get benefit of having the User ID as part of the row key with
> the
> >> separator (comparing to another solution that stores the userID as one
> of
> >> the columns in the user activity table).
> >>
> >> The reason I pick a separator after UserID is that sometimes we may not
> get
> >> a fixed length string of the UserID value. At one point I actually
> thought
> >> of using MD5 to hash the UserID and make it a fixed length, however, the
> >> possibility of collision and possible overhead of applying the hash
> >> function makes me pick the separator "^".
> >>
> >> My question:
> >> (1) I kind of make the argument that using a separator is kind of better
> >> than using a MD5 hash value. Does that seem reasonable? Could you
> comments
> >> on other pros and cons that I might miss (as the bases for my argument)?
> >>
> >> (2) On using a separator/delimiter, besides the requirements that this
> >> separator/delimiter shouldn't appear elsewhere in the rowkey, are there
> any
> >> other requirements? Are there any special separator/delimiters that are
> >> better/worse than the average ones?
> >>
> >> thanks!
> >>
> >> Jason
> >
>
>

Re: Using separator/delimiter in HBase rowkey?

Posted by Jason Huang <ja...@icare.com>.

thanks for all these valuable comments.

Jason

On Mon, Jul 8, 2013 at 12:25 PM, Michael Segel <mi...@hotmail.com>wrote:

> Where is murmur?
>
> In your app?
> So then every app that wants to fetch that row must now use murmur.
>
> Added to Hadoop/HBase?
> Then when you do upgrades you have to make sure that the package is still
> in your class path. Note that different vendor's release management will
> mean YMMV as to what happens to your class paths and set up or if the jar
> gets blown out of the directory.
>
>
> Is the added cost in maintenance worth it?
>
> I don't know but I seriously doubt it.
>
>
> On Jul 8, 2013, at 11:00 AM, Mike Axiak <mi...@axiak.net> wrote:
>
> > I just don't understand.. every creation/interpretation of the key is
> > going to require code to be used. The murmur implementation is with
> > that code. How is there any extra burden?
> >
> > On Mon, Jul 8, 2013 at 11:54 AM, Michael Segel
> > <mi...@hotmail.com> wrote:
> >> You will need to put the jar into either every app that runs, or you
> will need to put it on every node. Every upgrade, you will need to make
> sure its still in your class path.
> >>
> >> More work for the admins. So how much faster is it over MD5? MD5 and
> SHA-1 are part of the Java libraries that ship w Sun/Oracle so you have
> them already installed and in your class path.
> >>
> >> Just saying... ;-)
> >>
> >> On Jul 8, 2013, at 10:36 AM, Mike Axiak <mi...@axiak.net> wrote:
> >>
> >>> On Mon, Jul 8, 2013 at 11:29 AM, Michael Segel
> >>> <mi...@hotmail.com> wrote:
> >>>> If not, you end up having to do a bit more maintenance of your
> cluster and that's going to be part of your tradeoff.
> >>>
> >>> How so?
> >>>
> >>> -Mike
> >>>
> >>
> >
>
>

Re: Using separator/delimiter in HBase rowkey?

Posted by Michael Segel <mi...@hotmail.com>.

Where is murmur? 

In your app? 
So then every app that wants to fetch that row must now use murmur. 

Added to Hadoop/HBase? 
Then when you do upgrades you have to make sure that the package is still in your class path. Note that different vendor's release management will mean YMMV as to what happens to your class paths and set up or if the jar gets blown out of the directory. 

Is the added cost in maintenance worth it? 

I don't know but I seriously doubt it. 

On Jul 8, 2013, at 11:00 AM, Mike Axiak <mi...@axiak.net> wrote:

> I just don't understand.. every creation/interpretation of the key is
> going to require code to be used. The murmur implementation is with
> that code. How is there any extra burden?
> 
> On Mon, Jul 8, 2013 at 11:54 AM, Michael Segel
> <mi...@hotmail.com> wrote:
>> You will need to put the jar into either every app that runs, or you will need to put it on every node. Every upgrade, you will need to make sure its still in your class path.
>> 
>> More work for the admins. So how much faster is it over MD5? MD5 and SHA-1 are part of the Java libraries that ship w Sun/Oracle so you have them already installed and in your class path.
>> 
>> Just saying... ;-)
>> 
>> On Jul 8, 2013, at 10:36 AM, Mike Axiak <mi...@axiak.net> wrote:
>> 
>>> On Mon, Jul 8, 2013 at 11:29 AM, Michael Segel
>>> <mi...@hotmail.com> wrote:
>>>> If not, you end up having to do a bit more maintenance of your cluster and that's going to be part of your tradeoff.
>>> 
>>> How so?
>>> 
>>> -Mike
>>> 
>> 
>

Re: Using separator/delimiter in HBase rowkey?

Posted by Mike Axiak <mi...@axiak.net>.

I just don't understand.. every creation/interpretation of the key is
going to require code to be used. The murmur implementation is with
that code. How is there any extra burden?

On Mon, Jul 8, 2013 at 11:54 AM, Michael Segel
<mi...@hotmail.com> wrote:
> You will need to put the jar into either every app that runs, or you will need to put it on every node. Every upgrade, you will need to make sure its still in your class path.
>
> More work for the admins. So how much faster is it over MD5? MD5 and SHA-1 are part of the Java libraries that ship w Sun/Oracle so you have them already installed and in your class path.
>
> Just saying... ;-)
>
> On Jul 8, 2013, at 10:36 AM, Mike Axiak <mi...@axiak.net> wrote:
>
>> On Mon, Jul 8, 2013 at 11:29 AM, Michael Segel
>> <mi...@hotmail.com> wrote:
>>> If not, you end up having to do a bit more maintenance of your cluster and that's going to be part of your tradeoff.
>>
>> How so?
>>
>> -Mike
>>
>

Re: Using separator/delimiter in HBase rowkey?

Posted by Ted Yu <yu...@gmail.com>.

http://jsperf.com/murmur3-performance might be related.

On Mon, Jul 8, 2013 at 8:54 AM, Michael Segel <mi...@hotmail.com>wrote:

> You will need to put the jar into either every app that runs, or you will
> need to put it on every node. Every upgrade, you will need to make sure its
> still in your class path.
>
> More work for the admins. So how much faster is it over MD5? MD5 and SHA-1
> are part of the Java libraries that ship w Sun/Oracle so you have them
> already installed and in your class path.
>
> Just saying... ;-)
>
> On Jul 8, 2013, at 10:36 AM, Mike Axiak <mi...@axiak.net> wrote:
>
> > On Mon, Jul 8, 2013 at 11:29 AM, Michael Segel
> > <mi...@hotmail.com> wrote:
> >> If not, you end up having to do a bit more maintenance of your cluster
> and that's going to be part of your tradeoff.
> >
> > How so?
> >
> > -Mike
> >
>
>

Re: Using separator/delimiter in HBase rowkey?

Posted by Michael Segel <mi...@hotmail.com>.

You will need to put the jar into either every app that runs, or you will need to put it on every node. Every upgrade, you will need to make sure its still in your class path. 

More work for the admins. So how much faster is it over MD5? MD5 and SHA-1 are part of the Java libraries that ship w Sun/Oracle so you have them already installed and in your class path. 

Just saying... ;-) 

On Jul 8, 2013, at 10:36 AM, Mike Axiak <mi...@axiak.net> wrote:

> On Mon, Jul 8, 2013 at 11:29 AM, Michael Segel
> <mi...@hotmail.com> wrote:
>> If not, you end up having to do a bit more maintenance of your cluster and that's going to be part of your tradeoff.
> 
> How so?
> 
> -Mike
>

Re: Using separator/delimiter in HBase rowkey?

Posted by Mike Axiak <mi...@axiak.net>.

On Mon, Jul 8, 2013 at 11:29 AM, Michael Segel
<mi...@hotmail.com> wrote:
> If not, you end up having to do a bit more maintenance of your cluster and that's going to be part of your tradeoff.

How so?

-Mike

Re: Using separator/delimiter in HBase rowkey?

Posted by Michael Segel <mi...@hotmail.com>.

Is murmur part of the standard java libraries? 

If not, you end up having to do a bit more maintenance of your cluster and that's going to be part of your tradeoff. 

On Jul 8, 2013, at 10:14 AM, Mike Axiak <mi...@axiak.net> wrote:

> Hello Jason,
> 
> Have you considered the following rowkey?
> 
>  murmur_128(userId) + timestamp + userId ?
> 
> This handles both of your cases as (1) murmur 128 is much faster than
> md5 so will have very low overhead and (2) the userid at the end of
> the key will ensure that no murmur collisions will cause issues. This
> key also handle incrementing userIds well because close userIds will
> likely be in separate regions.
> 
> Cheers,
> Mike
> 
> On Mon, Jul 8, 2013 at 10:19 AM, Jason Huang <ja...@icare.com> wrote:
>> Hello,
>> 
>> I am trying to get some advice on pros/cons of using separator/delimiter as
>> part of HBase row key.
>> 
>> Currently one of our user activity tables has a rowkey design of
>> "UserID^TimeStamp" with a separator of "^". (UserID is a string that won't
>> include '^').
>> 
>> This is designed for the two common use cases in our system:
>> (1) If we come from a context where the UserID is known, we can do a scan
>> easily for all the user activities with a startRowKey and stopRowKey.
>> (2) If we come from a external networked table where the row key of this
>> user activity table is stored and can be retrieved as activityRowKey, then
>> we can use the following code to parse out the UserID and do the same scan
>> as in (1):
>> 
>>    String activityRowKeyStr = Bytes.toString(activityRowKey);
>>    String userId =
>> activityRowKeyStr.subString(activityRowKeyStr.indexOf("^")+1)
>> 
>> Then I can set startRowKey and stopRowKey for the scan based on userId.
>> Here we get benefit of having the User ID as part of the row key with the
>> separator (comparing to another solution that stores the userID as one of
>> the columns in the user activity table).
>> 
>> The reason I pick a separator after UserID is that sometimes we may not get
>> a fixed length string of the UserID value. At one point I actually thought
>> of using MD5 to hash the UserID and make it a fixed length, however, the
>> possibility of collision and possible overhead of applying the hash
>> function makes me pick the separator "^".
>> 
>> My question:
>> (1) I kind of make the argument that using a separator is kind of better
>> than using a MD5 hash value. Does that seem reasonable? Could you comments
>> on other pros and cons that I might miss (as the bases for my argument)?
>> 
>> (2) On using a separator/delimiter, besides the requirements that this
>> separator/delimiter shouldn't appear elsewhere in the rowkey, are there any
>> other requirements? Are there any special separator/delimiters that are
>> better/worse than the average ones?
>> 
>> thanks!
>> 
>> Jason
>

Re: Using separator/delimiter in HBase rowkey?

Posted by Mike Axiak <mi...@axiak.net>.

Hello Jason,

Have you considered the following rowkey?

  murmur_128(userId) + timestamp + userId ?

This handles both of your cases as (1) murmur 128 is much faster than
md5 so will have very low overhead and (2) the userid at the end of
the key will ensure that no murmur collisions will cause issues. This
key also handle incrementing userIds well because close userIds will
likely be in separate regions.

Cheers,
Mike

On Mon, Jul 8, 2013 at 10:19 AM, Jason Huang <ja...@icare.com> wrote:
> Hello,
>
> I am trying to get some advice on pros/cons of using separator/delimiter as
> part of HBase row key.
>
> Currently one of our user activity tables has a rowkey design of
> "UserID^TimeStamp" with a separator of "^". (UserID is a string that won't
> include '^').
>
> This is designed for the two common use cases in our system:
> (1) If we come from a context where the UserID is known, we can do a scan
> easily for all the user activities with a startRowKey and stopRowKey.
> (2) If we come from a external networked table where the row key of this
> user activity table is stored and can be retrieved as activityRowKey, then
> we can use the following code to parse out the UserID and do the same scan
> as in (1):
>
>     String activityRowKeyStr = Bytes.toString(activityRowKey);
>     String userId =
> activityRowKeyStr.subString(activityRowKeyStr.indexOf("^")+1)
>
> Then I can set startRowKey and stopRowKey for the scan based on userId.
> Here we get benefit of having the User ID as part of the row key with the
> separator (comparing to another solution that stores the userID as one of
> the columns in the user activity table).
>
> The reason I pick a separator after UserID is that sometimes we may not get
> a fixed length string of the UserID value. At one point I actually thought
> of using MD5 to hash the UserID and make it a fixed length, however, the
> possibility of collision and possible overhead of applying the hash
> function makes me pick the separator "^".
>
> My question:
> (1) I kind of make the argument that using a separator is kind of better
> than using a MD5 hash value. Does that seem reasonable? Could you comments
> on other pros and cons that I might miss (as the bases for my argument)?
>
> (2) On using a separator/delimiter, besides the requirements that this
> separator/delimiter shouldn't appear elsewhere in the rowkey, are there any
> other requirements? Are there any special separator/delimiters that are
> better/worse than the average ones?
>
> thanks!
>
> Jason