You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Fernando Padilla <fe...@alum.mit.edu> on 2009/07/27 21:01:46 UTC

key hashing?

So I will be generating lots of rows into the db keyed by userId, in 
userId order.

I have already learned through this mailing list that this use-case is 
not ideal, since it would mean most row-inserts will be on one region 
server.  I know that some people suggest to add some randomization to 
the keys so that it will be spread around, but I can't do that, since 
I'll need to be able to do random access lookup on the rows via userId.


But I'm wondering if I could map/hash the real userId, into another 
number that will spread around the inserts.  And I can still do random 
access lookups given a real userId, by calculating the hash..



1) i think i like this idea, does anyone have any experience with this?

2) assume userId is a 8byte long, what would be some good hashing 
functions?  I would be lazy and use little-endian, but I bet one of you 
could come up with something better. :)


Re: key hashing?

Posted by mike anderson <sa...@gmail.com>.
Not to take the thread off topic, but do you have any links to information
about importing directly into hfiles?

Thanks,
Mike



On Mon, Jul 27, 2009 at 3:08 PM, Ryan Rawson <ry...@gmail.com> wrote:

> Hi,
>
> You have to consider the difference between a bulk one time import and
> a continuous row insertion process.  Often the former needs to achieve
> extremely high insert rates (150kops/sec + ) to import a large
> multi-100million data set in any reasonable time frame.  But the
> latter tends to be fairly slow, unless you are planning on adding
> users faster than 20,000 a second, you probably don't need to hash
> userids.
>
> It should be possible to randomly insert data from a pre-existing data
> set.  There is some work to directly import straight into hfiles and
> skipping the regionserver, but that would only really work on 1 time
> imports to new tables.
>
>
> On Mon, Jul 27, 2009 at 12:01 PM, Fernando Padilla<fe...@alum.mit.edu>
> wrote:
> > So I will be generating lots of rows into the db keyed by userId, in
> userId
> > order.
> >
> > I have already learned through this mailing list that this use-case is
> not
> > ideal, since it would mean most row-inserts will be on one region server.
>  I
> > know that some people suggest to add some randomization to the keys so
> that
> > it will be spread around, but I can't do that, since I'll need to be able
> to
> > do random access lookup on the rows via userId.
> >
> >
> > But I'm wondering if I could map/hash the real userId, into another
> number
> > that will spread around the inserts.  And I can still do random access
> > lookups given a real userId, by calculating the hash..
> >
> >
> >
> > 1) i think i like this idea, does anyone have any experience with this?
> >
> > 2) assume userId is a 8byte long, what would be some good hashing
> functions?
> >  I would be lazy and use little-endian, but I bet one of you could come
> up
> > with something better. :)
> >
> >
>

Re: key hashing?

Posted by Ryan Rawson <ry...@gmail.com>.
Hi,

You have to consider the difference between a bulk one time import and
a continuous row insertion process.  Often the former needs to achieve
extremely high insert rates (150kops/sec + ) to import a large
multi-100million data set in any reasonable time frame.  But the
latter tends to be fairly slow, unless you are planning on adding
users faster than 20,000 a second, you probably don't need to hash
userids.

It should be possible to randomly insert data from a pre-existing data
set.  There is some work to directly import straight into hfiles and
skipping the regionserver, but that would only really work on 1 time
imports to new tables.


On Mon, Jul 27, 2009 at 12:01 PM, Fernando Padilla<fe...@alum.mit.edu> wrote:
> So I will be generating lots of rows into the db keyed by userId, in userId
> order.
>
> I have already learned through this mailing list that this use-case is not
> ideal, since it would mean most row-inserts will be on one region server.  I
> know that some people suggest to add some randomization to the keys so that
> it will be spread around, but I can't do that, since I'll need to be able to
> do random access lookup on the rows via userId.
>
>
> But I'm wondering if I could map/hash the real userId, into another number
> that will spread around the inserts.  And I can still do random access
> lookups given a real userId, by calculating the hash..
>
>
>
> 1) i think i like this idea, does anyone have any experience with this?
>
> 2) assume userId is a 8byte long, what would be some good hashing functions?
>  I would be lazy and use little-endian, but I bet one of you could come up
> with something better. :)
>
>