You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Weishung Chung <we...@gmail.com> on 2011/01/10 16:33:49 UTC

how to randomize the primary key which is a timestamp

What is the good way to randomize the primary key which is a timestamp in
HBase to avoid hotspotting?
Thank you so much :)

Re: how to randomize the primary key which is a timestamp

Posted by Tost <nc...@gmail.com>.

How about SecureRandom class.

you can get the key from seed.

see
http://download.oracle.com/javase/6/docs/api/java/security/SecureRandom.html

2011/1/11 Weishung Chung <we...@gmail.com>

> Thanks alot, this will get me started :D
>
> On Mon, Jan 10, 2011 at 11:04 AM, Matt Corgan <mc...@hotpads.com> wrote:
>
> > You could have prefix = timestamp % 64.  Then for a single key lookup,
> you
> > could calculate the prefix and query just one shard.  For a scan, you
> have
> > to query all shards and merge the results.
> >
> >
> > On Mon, Jan 10, 2011 at 11:56 AM, Weishung Chung <we...@gmail.com>
> > wrote:
> >
> > > Thank you for your prompt response. I am a bit confused about the
> prefix.
> > > If i were to use prefix for the timestamp key, when come to query time,
> > how
> > > should i specify the row key to search for? How do I know which prefix
> > was
> > > used for a certain timestamp and needs to be append to the timestamp
> for
> > > querying?
> > >
> > > On Mon, Jan 10, 2011 at 10:41 AM, Matt Corgan <mc...@hotpads.com>
> > wrote:
> > >
> > > > You can put them all in the same table.  If you prefix the keys when
> > > > written, use a prefix filter when querying.  I would choose a prefix
> > > window
> > > > that's about 4 times the number of nodes.
> > > >
> > > >
> > > > On Mon, Jan 10, 2011 at 11:30 AM, Ted Dunning <tdunning@maprtech.com
> >
> > > > wrote:
> > > >
> > > > > If multiple tables have the same key distribution and count, then
> > they
> > > > will
> > > > > have similar split points for their regions, but the locations of
> the
> > > > > regions will be randomized.
> > > > >
> > > > > I wouldn't worry about this until you see evidence it is a problem.
> > > > >
> > > > > On Mon, Jan 10, 2011 at 8:20 AM, Weishung Chung <
> weishung@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thank you for the replies.
> > > > > > Most of the queries, (70%) will be for scanning a range of
> > > consecutive
> > > > > > times, with some single timestamp query (30%)
> > > > > > But there are multiple tables with the same range of timestamps,
> > will
> > > > all
> > > > > > these same range of timestamps from multiple tables be stored on
> > the
> > > > same
> > > > > > region server and if so, could it affect the performance of map
> > > reduce
> > > > > jobs
> > > > > > (operated on those tables with the same range of time periods) ?
> > > Would
> > > > > > hotspotting defeat the purpose of map reduce?
> > > > > >
> > > > > > On Mon, Jan 10, 2011 at 10:08 AM, Matt Corgan <
> mcorgan@hotpads.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > You can also add a random (or hashed) prefix to the beginning
> of
> > > the
> > > > > key.
> > > > > > >  If your prefix were one byte with values 0-63, you've divided
> > the
> > > > hot
> > > > > > spot
> > > > > > > into 64 smaller ones, which is better for writing.  The
> downside
> > is
> > > > > that
> > > > > > if
> > > > > > > you want to read a range of values, you will have to query all
> 64
> > > > > > "shards"
> > > > > > > and merge the sorted values.  You can choose whatever prefix
> size
> > > is
> > > > > best
> > > > > > > for your scenario.
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Jan 10, 2011 at 11:05 AM, Chirstopher Tarnas <
> > > cft@email.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Some options that I am aware of:
> > > > > > > >
> > > > > > > > reverse the byte order of the timestamp
> > > > > > > > use UUIDs rather than a timestamp
> > > > > > > > use hashing, this working really depends on your requirements
> > > > > > > >
> > > > > > > > On Mon, Jan 10, 2011 at 9:33 AM, Weishung Chung <
> > > > weishung@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > What is the good way to randomize the primary key which is
> a
> > > > > > timestamp
> > > > > > > in
> > > > > > > > > HBase to avoid hotspotting?
> > > > > > > > > Thank you so much :)
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: how to randomize the primary key which is a timestamp

Posted by Weishung Chung <we...@gmail.com>.

Thanks alot, this will get me started :D

On Mon, Jan 10, 2011 at 11:04 AM, Matt Corgan <mc...@hotpads.com> wrote:

> You could have prefix = timestamp % 64.  Then for a single key lookup, you
> could calculate the prefix and query just one shard.  For a scan, you have
> to query all shards and merge the results.
>
>
> On Mon, Jan 10, 2011 at 11:56 AM, Weishung Chung <we...@gmail.com>
> wrote:
>
> > Thank you for your prompt response. I am a bit confused about the prefix.
> > If i were to use prefix for the timestamp key, when come to query time,
> how
> > should i specify the row key to search for? How do I know which prefix
> was
> > used for a certain timestamp and needs to be append to the timestamp for
> > querying?
> >
> > On Mon, Jan 10, 2011 at 10:41 AM, Matt Corgan <mc...@hotpads.com>
> wrote:
> >
> > > You can put them all in the same table.  If you prefix the keys when
> > > written, use a prefix filter when querying.  I would choose a prefix
> > window
> > > that's about 4 times the number of nodes.
> > >
> > >
> > > On Mon, Jan 10, 2011 at 11:30 AM, Ted Dunning <td...@maprtech.com>
> > > wrote:
> > >
> > > > If multiple tables have the same key distribution and count, then
> they
> > > will
> > > > have similar split points for their regions, but the locations of the
> > > > regions will be randomized.
> > > >
> > > > I wouldn't worry about this until you see evidence it is a problem.
> > > >
> > > > On Mon, Jan 10, 2011 at 8:20 AM, Weishung Chung <we...@gmail.com>
> > > > wrote:
> > > >
> > > > > Thank you for the replies.
> > > > > Most of the queries, (70%) will be for scanning a range of
> > consecutive
> > > > > times, with some single timestamp query (30%)
> > > > > But there are multiple tables with the same range of timestamps,
> will
> > > all
> > > > > these same range of timestamps from multiple tables be stored on
> the
> > > same
> > > > > region server and if so, could it affect the performance of map
> > reduce
> > > > jobs
> > > > > (operated on those tables with the same range of time periods) ?
> > Would
> > > > > hotspotting defeat the purpose of map reduce?
> > > > >
> > > > > On Mon, Jan 10, 2011 at 10:08 AM, Matt Corgan <mcorgan@hotpads.com
> >
> > > > wrote:
> > > > >
> > > > > > You can also add a random (or hashed) prefix to the beginning of
> > the
> > > > key.
> > > > > >  If your prefix were one byte with values 0-63, you've divided
> the
> > > hot
> > > > > spot
> > > > > > into 64 smaller ones, which is better for writing.  The downside
> is
> > > > that
> > > > > if
> > > > > > you want to read a range of values, you will have to query all 64
> > > > > "shards"
> > > > > > and merge the sorted values.  You can choose whatever prefix size
> > is
> > > > best
> > > > > > for your scenario.
> > > > > >
> > > > > >
> > > > > > On Mon, Jan 10, 2011 at 11:05 AM, Chirstopher Tarnas <
> > cft@email.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Some options that I am aware of:
> > > > > > >
> > > > > > > reverse the byte order of the timestamp
> > > > > > > use UUIDs rather than a timestamp
> > > > > > > use hashing, this working really depends on your requirements
> > > > > > >
> > > > > > > On Mon, Jan 10, 2011 at 9:33 AM, Weishung Chung <
> > > weishung@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > What is the good way to randomize the primary key which is a
> > > > > timestamp
> > > > > > in
> > > > > > > > HBase to avoid hotspotting?
> > > > > > > > Thank you so much :)
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: how to randomize the primary key which is a timestamp

Posted by Matt Corgan <mc...@hotpads.com>.

You could have prefix = timestamp % 64.  Then for a single key lookup, you
could calculate the prefix and query just one shard.  For a scan, you have
to query all shards and merge the results.


On Mon, Jan 10, 2011 at 11:56 AM, Weishung Chung <we...@gmail.com> wrote:

> Thank you for your prompt response. I am a bit confused about the prefix.
> If i were to use prefix for the timestamp key, when come to query time, how
> should i specify the row key to search for? How do I know which prefix was
> used for a certain timestamp and needs to be append to the timestamp for
> querying?
>
> On Mon, Jan 10, 2011 at 10:41 AM, Matt Corgan <mc...@hotpads.com> wrote:
>
> > You can put them all in the same table.  If you prefix the keys when
> > written, use a prefix filter when querying.  I would choose a prefix
> window
> > that's about 4 times the number of nodes.
> >
> >
> > On Mon, Jan 10, 2011 at 11:30 AM, Ted Dunning <td...@maprtech.com>
> > wrote:
> >
> > > If multiple tables have the same key distribution and count, then they
> > will
> > > have similar split points for their regions, but the locations of the
> > > regions will be randomized.
> > >
> > > I wouldn't worry about this until you see evidence it is a problem.
> > >
> > > On Mon, Jan 10, 2011 at 8:20 AM, Weishung Chung <we...@gmail.com>
> > > wrote:
> > >
> > > > Thank you for the replies.
> > > > Most of the queries, (70%) will be for scanning a range of
> consecutive
> > > > times, with some single timestamp query (30%)
> > > > But there are multiple tables with the same range of timestamps, will
> > all
> > > > these same range of timestamps from multiple tables be stored on the
> > same
> > > > region server and if so, could it affect the performance of map
> reduce
> > > jobs
> > > > (operated on those tables with the same range of time periods) ?
> Would
> > > > hotspotting defeat the purpose of map reduce?
> > > >
> > > > On Mon, Jan 10, 2011 at 10:08 AM, Matt Corgan <mc...@hotpads.com>
> > > wrote:
> > > >
> > > > > You can also add a random (or hashed) prefix to the beginning of
> the
> > > key.
> > > > >  If your prefix were one byte with values 0-63, you've divided the
> > hot
> > > > spot
> > > > > into 64 smaller ones, which is better for writing.  The downside is
> > > that
> > > > if
> > > > > you want to read a range of values, you will have to query all 64
> > > > "shards"
> > > > > and merge the sorted values.  You can choose whatever prefix size
> is
> > > best
> > > > > for your scenario.
> > > > >
> > > > >
> > > > > On Mon, Jan 10, 2011 at 11:05 AM, Chirstopher Tarnas <
> cft@email.com>
> > > > > wrote:
> > > > >
> > > > > > Some options that I am aware of:
> > > > > >
> > > > > > reverse the byte order of the timestamp
> > > > > > use UUIDs rather than a timestamp
> > > > > > use hashing, this working really depends on your requirements
> > > > > >
> > > > > > On Mon, Jan 10, 2011 at 9:33 AM, Weishung Chung <
> > weishung@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > What is the good way to randomize the primary key which is a
> > > > timestamp
> > > > > in
> > > > > > > HBase to avoid hotspotting?
> > > > > > > Thank you so much :)
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: how to randomize the primary key which is a timestamp

Posted by Weishung Chung <we...@gmail.com>.

Thank you for your prompt response. I am a bit confused about the prefix.
If i were to use prefix for the timestamp key, when come to query time, how
should i specify the row key to search for? How do I know which prefix was
used for a certain timestamp and needs to be append to the timestamp for
querying?

On Mon, Jan 10, 2011 at 10:41 AM, Matt Corgan <mc...@hotpads.com> wrote:

> You can put them all in the same table.  If you prefix the keys when
> written, use a prefix filter when querying.  I would choose a prefix window
> that's about 4 times the number of nodes.
>
>
> On Mon, Jan 10, 2011 at 11:30 AM, Ted Dunning <td...@maprtech.com>
> wrote:
>
> > If multiple tables have the same key distribution and count, then they
> will
> > have similar split points for their regions, but the locations of the
> > regions will be randomized.
> >
> > I wouldn't worry about this until you see evidence it is a problem.
> >
> > On Mon, Jan 10, 2011 at 8:20 AM, Weishung Chung <we...@gmail.com>
> > wrote:
> >
> > > Thank you for the replies.
> > > Most of the queries, (70%) will be for scanning a range of consecutive
> > > times, with some single timestamp query (30%)
> > > But there are multiple tables with the same range of timestamps, will
> all
> > > these same range of timestamps from multiple tables be stored on the
> same
> > > region server and if so, could it affect the performance of map reduce
> > jobs
> > > (operated on those tables with the same range of time periods) ? Would
> > > hotspotting defeat the purpose of map reduce?
> > >
> > > On Mon, Jan 10, 2011 at 10:08 AM, Matt Corgan <mc...@hotpads.com>
> > wrote:
> > >
> > > > You can also add a random (or hashed) prefix to the beginning of the
> > key.
> > > >  If your prefix were one byte with values 0-63, you've divided the
> hot
> > > spot
> > > > into 64 smaller ones, which is better for writing.  The downside is
> > that
> > > if
> > > > you want to read a range of values, you will have to query all 64
> > > "shards"
> > > > and merge the sorted values.  You can choose whatever prefix size is
> > best
> > > > for your scenario.
> > > >
> > > >
> > > > On Mon, Jan 10, 2011 at 11:05 AM, Chirstopher Tarnas <cf...@email.com>
> > > > wrote:
> > > >
> > > > > Some options that I am aware of:
> > > > >
> > > > > reverse the byte order of the timestamp
> > > > > use UUIDs rather than a timestamp
> > > > > use hashing, this working really depends on your requirements
> > > > >
> > > > > On Mon, Jan 10, 2011 at 9:33 AM, Weishung Chung <
> weishung@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > What is the good way to randomize the primary key which is a
> > > timestamp
> > > > in
> > > > > > HBase to avoid hotspotting?
> > > > > > Thank you so much :)
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: how to randomize the primary key which is a timestamp

Posted by Matt Corgan <mc...@hotpads.com>.

You can put them all in the same table.  If you prefix the keys when
written, use a prefix filter when querying.  I would choose a prefix window
that's about 4 times the number of nodes.


On Mon, Jan 10, 2011 at 11:30 AM, Ted Dunning <td...@maprtech.com> wrote:

> If multiple tables have the same key distribution and count, then they will
> have similar split points for their regions, but the locations of the
> regions will be randomized.
>
> I wouldn't worry about this until you see evidence it is a problem.
>
> On Mon, Jan 10, 2011 at 8:20 AM, Weishung Chung <we...@gmail.com>
> wrote:
>
> > Thank you for the replies.
> > Most of the queries, (70%) will be for scanning a range of consecutive
> > times, with some single timestamp query (30%)
> > But there are multiple tables with the same range of timestamps, will all
> > these same range of timestamps from multiple tables be stored on the same
> > region server and if so, could it affect the performance of map reduce
> jobs
> > (operated on those tables with the same range of time periods) ? Would
> > hotspotting defeat the purpose of map reduce?
> >
> > On Mon, Jan 10, 2011 at 10:08 AM, Matt Corgan <mc...@hotpads.com>
> wrote:
> >
> > > You can also add a random (or hashed) prefix to the beginning of the
> key.
> > >  If your prefix were one byte with values 0-63, you've divided the hot
> > spot
> > > into 64 smaller ones, which is better for writing.  The downside is
> that
> > if
> > > you want to read a range of values, you will have to query all 64
> > "shards"
> > > and merge the sorted values.  You can choose whatever prefix size is
> best
> > > for your scenario.
> > >
> > >
> > > On Mon, Jan 10, 2011 at 11:05 AM, Chirstopher Tarnas <cf...@email.com>
> > > wrote:
> > >
> > > > Some options that I am aware of:
> > > >
> > > > reverse the byte order of the timestamp
> > > > use UUIDs rather than a timestamp
> > > > use hashing, this working really depends on your requirements
> > > >
> > > > On Mon, Jan 10, 2011 at 9:33 AM, Weishung Chung <we...@gmail.com>
> > > > wrote:
> > > >
> > > > > What is the good way to randomize the primary key which is a
> > timestamp
> > > in
> > > > > HBase to avoid hotspotting?
> > > > > Thank you so much :)
> > > > >
> > > >
> > >
> >
>

Re: how to randomize the primary key which is a timestamp

Posted by Ted Dunning <td...@maprtech.com>.

If multiple tables have the same key distribution and count, then they will
have similar split points for their regions, but the locations of the
regions will be randomized.

I wouldn't worry about this until you see evidence it is a problem.

On Mon, Jan 10, 2011 at 8:20 AM, Weishung Chung <we...@gmail.com> wrote:

> Thank you for the replies.
> Most of the queries, (70%) will be for scanning a range of consecutive
> times, with some single timestamp query (30%)
> But there are multiple tables with the same range of timestamps, will all
> these same range of timestamps from multiple tables be stored on the same
> region server and if so, could it affect the performance of map reduce jobs
> (operated on those tables with the same range of time periods) ? Would
> hotspotting defeat the purpose of map reduce?
>
> On Mon, Jan 10, 2011 at 10:08 AM, Matt Corgan <mc...@hotpads.com> wrote:
>
> > You can also add a random (or hashed) prefix to the beginning of the key.
> >  If your prefix were one byte with values 0-63, you've divided the hot
> spot
> > into 64 smaller ones, which is better for writing.  The downside is that
> if
> > you want to read a range of values, you will have to query all 64
> "shards"
> > and merge the sorted values.  You can choose whatever prefix size is best
> > for your scenario.
> >
> >
> > On Mon, Jan 10, 2011 at 11:05 AM, Chirstopher Tarnas <cf...@email.com>
> > wrote:
> >
> > > Some options that I am aware of:
> > >
> > > reverse the byte order of the timestamp
> > > use UUIDs rather than a timestamp
> > > use hashing, this working really depends on your requirements
> > >
> > > On Mon, Jan 10, 2011 at 9:33 AM, Weishung Chung <we...@gmail.com>
> > > wrote:
> > >
> > > > What is the good way to randomize the primary key which is a
> timestamp
> > in
> > > > HBase to avoid hotspotting?
> > > > Thank you so much :)
> > > >
> > >
> >
>

Re: how to randomize the primary key which is a timestamp

Posted by Weishung Chung <we...@gmail.com>.

Thank you for the replies.
Most of the queries, (70%) will be for scanning a range of consecutive
times, with some single timestamp query (30%)
But there are multiple tables with the same range of timestamps, will all
these same range of timestamps from multiple tables be stored on the same
region server and if so, could it affect the performance of map reduce jobs
(operated on those tables with the same range of time periods) ? Would
hotspotting defeat the purpose of map reduce?

On Mon, Jan 10, 2011 at 10:08 AM, Matt Corgan <mc...@hotpads.com> wrote:

> You can also add a random (or hashed) prefix to the beginning of the key.
>  If your prefix were one byte with values 0-63, you've divided the hot spot
> into 64 smaller ones, which is better for writing.  The downside is that if
> you want to read a range of values, you will have to query all 64 "shards"
> and merge the sorted values.  You can choose whatever prefix size is best
> for your scenario.
>
>
> On Mon, Jan 10, 2011 at 11:05 AM, Chirstopher Tarnas <cf...@email.com>
> wrote:
>
> > Some options that I am aware of:
> >
> > reverse the byte order of the timestamp
> > use UUIDs rather than a timestamp
> > use hashing, this working really depends on your requirements
> >
> > On Mon, Jan 10, 2011 at 9:33 AM, Weishung Chung <we...@gmail.com>
> > wrote:
> >
> > > What is the good way to randomize the primary key which is a timestamp
> in
> > > HBase to avoid hotspotting?
> > > Thank you so much :)
> > >
> >
>

Re: how to randomize the primary key which is a timestamp

Posted by Matt Corgan <mc...@hotpads.com>.

You can also add a random (or hashed) prefix to the beginning of the key.
 If your prefix were one byte with values 0-63, you've divided the hot spot
into 64 smaller ones, which is better for writing.  The downside is that if
you want to read a range of values, you will have to query all 64 "shards"
and merge the sorted values.  You can choose whatever prefix size is best
for your scenario.

On Mon, Jan 10, 2011 at 11:05 AM, Chirstopher Tarnas <cf...@email.com> wrote:

> Some options that I am aware of:
>
> reverse the byte order of the timestamp
> use UUIDs rather than a timestamp
> use hashing, this working really depends on your requirements
>
> On Mon, Jan 10, 2011 at 9:33 AM, Weishung Chung <we...@gmail.com>
> wrote:
>
> > What is the good way to randomize the primary key which is a timestamp in
> > HBase to avoid hotspotting?
> > Thank you so much :)
> >
>

Re: how to randomize the primary key which is a timestamp

Posted by Chirstopher Tarnas <cf...@email.com>.

Some options that I am aware of:

reverse the byte order of the timestamp
use UUIDs rather than a timestamp
use hashing, this working really depends on your requirements

On Mon, Jan 10, 2011 at 9:33 AM, Weishung Chung <we...@gmail.com> wrote:

> What is the good way to randomize the primary key which is a timestamp in
> HBase to avoid hotspotting?
> Thank you so much :)
>

Re: how to randomize the primary key which is a timestamp

Posted by Friso van Vollenhoven <fv...@xebia.com>.

Once the data is stored, how do you plan on querying it? If you want to scan for certain periods of time, having the order of timestamps randomized is not ideal.

If you are planning to do only exact lookups for individual timestamps (which might be the case), I guess you can reverse the byte order of the timestamp given that the granularity of the times is fine enough.

Friso

On 10 jan 2011, at 16:33, Weishung Chung wrote:

> What is the good way to randomize the primary key which is a timestamp in
> HBase to avoid hotspotting?
> Thank you so much :)