You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Fernando Padilla <fe...@alum.mit.edu> on 2009/07/27 21:01:46 UTC
key hashing?
So I will be generating lots of rows into the db keyed by userId, in
userId order.
I have already learned through this mailing list that this use-case is
not ideal, since it would mean most row-inserts will be on one region
server. I know that some people suggest to add some randomization to
the keys so that it will be spread around, but I can't do that, since
I'll need to be able to do random access lookup on the rows via userId.
But I'm wondering if I could map/hash the real userId, into another
number that will spread around the inserts. And I can still do random
access lookups given a real userId, by calculating the hash..
1) i think i like this idea, does anyone have any experience with this?
2) assume userId is a 8byte long, what would be some good hashing
functions? I would be lazy and use little-endian, but I bet one of you
could come up with something better. :)
Re: key hashing?
Posted by mike anderson <sa...@gmail.com>.
Not to take the thread off topic, but do you have any links to information
about importing directly into hfiles?
Thanks,
Mike
On Mon, Jul 27, 2009 at 3:08 PM, Ryan Rawson <ry...@gmail.com> wrote:
> Hi,
>
> You have to consider the difference between a bulk one time import and
> a continuous row insertion process. Often the former needs to achieve
> extremely high insert rates (150kops/sec + ) to import a large
> multi-100million data set in any reasonable time frame. But the
> latter tends to be fairly slow, unless you are planning on adding
> users faster than 20,000 a second, you probably don't need to hash
> userids.
>
> It should be possible to randomly insert data from a pre-existing data
> set. There is some work to directly import straight into hfiles and
> skipping the regionserver, but that would only really work on 1 time
> imports to new tables.
>
>
> On Mon, Jul 27, 2009 at 12:01 PM, Fernando Padilla<fe...@alum.mit.edu>
> wrote:
> > So I will be generating lots of rows into the db keyed by userId, in
> userId
> > order.
> >
> > I have already learned through this mailing list that this use-case is
> not
> > ideal, since it would mean most row-inserts will be on one region server.
> I
> > know that some people suggest to add some randomization to the keys so
> that
> > it will be spread around, but I can't do that, since I'll need to be able
> to
> > do random access lookup on the rows via userId.
> >
> >
> > But I'm wondering if I could map/hash the real userId, into another
> number
> > that will spread around the inserts. And I can still do random access
> > lookups given a real userId, by calculating the hash..
> >
> >
> >
> > 1) i think i like this idea, does anyone have any experience with this?
> >
> > 2) assume userId is a 8byte long, what would be some good hashing
> functions?
> > I would be lazy and use little-endian, but I bet one of you could come
> up
> > with something better. :)
> >
> >
>
Re: key hashing?
Posted by Ryan Rawson <ry...@gmail.com>.
Hi,
You have to consider the difference between a bulk one time import and
a continuous row insertion process. Often the former needs to achieve
extremely high insert rates (150kops/sec + ) to import a large
multi-100million data set in any reasonable time frame. But the
latter tends to be fairly slow, unless you are planning on adding
users faster than 20,000 a second, you probably don't need to hash
userids.
It should be possible to randomly insert data from a pre-existing data
set. There is some work to directly import straight into hfiles and
skipping the regionserver, but that would only really work on 1 time
imports to new tables.
On Mon, Jul 27, 2009 at 12:01 PM, Fernando Padilla<fe...@alum.mit.edu> wrote:
> So I will be generating lots of rows into the db keyed by userId, in userId
> order.
>
> I have already learned through this mailing list that this use-case is not
> ideal, since it would mean most row-inserts will be on one region server. I
> know that some people suggest to add some randomization to the keys so that
> it will be spread around, but I can't do that, since I'll need to be able to
> do random access lookup on the rows via userId.
>
>
> But I'm wondering if I could map/hash the real userId, into another number
> that will spread around the inserts. And I can still do random access
> lookups given a real userId, by calculating the hash..
>
>
>
> 1) i think i like this idea, does anyone have any experience with this?
>
> 2) assume userId is a 8byte long, what would be some good hashing functions?
> I would be lazy and use little-endian, but I bet one of you could come up
> with something better. :)
>
>