You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "Gangl, Michael E (388K)" <Mi...@jpl.nasa.gov> on 2010/11/12 23:44:17 UTC

Randomize key insert

It appears that I'm falling victim to sequential data writing in Hbase. I have a map reduce job which is taking values like this (from disk):

Year   doy    time       value      x    y
2001 001    0:00     -44.22       1    1
2001 001    8:00     -44.22       1    1
2001 001    16:00     -44.22     1    2
2001 002    00:00     -44.22     1    1

(it's more complicated than this, but this is essentially what we're doing.

And generating values like the following:

<key>                         <value>
0010012001001        -44.45
0010012001002        -43.45
0010012001003        -42.45
0010012001004        -41.45

And so on, where the first three digits are part of an X coordinate, and the next 3 are a Y coordinate, the last N digits are a time frame, in the case above it's the YEAR/DAY-OF-YEAR combo.

So we're reducing to a single day of year from multiple day time measurements of bunch of data.

When writing to 5 Hbase region servers I notice for any given Region, it's getting all the requests are that current time. This leads me to think I'm writing to these in sequential order, and only one region can be written to at a time, creating a large bottle neck in writes.

Is there any way to randomize the key's I'm writing to Hbase? Such that I wouldn't write 0010012001001 then 0010012001002 then 0010012001003? I was reading on some lists that someone had built a randomize.java function to do this, but all my searches have been in vain. Any one run into this problem?

-Mike

Re: Randomize key insert

Posted by Debashis Saha <de...@gmail.com>.

If you are using reducer, none of the reducer will start processing until
all mapping is done. This is to ensure that all spitted data from maps
are shuffled and sorted so that all values with same key goes to same
reducer. Not exactly sure what is your processing logic, but if you don't
need any sort or shuffle, you may consider not using reducer. Use map only
job - read data with map and write from the mapper itself. This will give
true parallelism due to the fact that mappers are independent. Reducer is
great for auto sorting and shuffling but has lots of overhead.

Increasing the block size may not make significant difference. Not sure.

-Debashis

On Tue, Nov 16, 2010 at 1:08 PM, Gangl, Michael E (388K) <
Michael.E.Gangl@jpl.nasa.gov> wrote:

> I understand I can hash the input to distribute across regions, But I need
> to be able to use a multiple scanners in parallel to read bunches of the
> data at once in sequence.
>
> So I guess I need to take the write performance hit to enable the read
> performance gains, unless there is a way to randomize the reducer data
> (obviously each reducer will still get all of the same key) but its seems
> they are getting them "in order."
>
> Would increasing the block size in HDFS mean the reducers would get data
> with keys further away (lexographically) from each other?
>
> -Mike
>
>
> On 11/13/10 12:24 AM, "Debashis Saha" <de...@gmail.com> wrote:
>
> I think you identified the problem - since the most of the first part of
> your rowkeys are same, it is hitting on one region server. Randomization of
> rowkey is the solution and you must ensure following things:
>
> 1. you have to preserve the uniqueness of the rowkey; one original key's
> randomize value should not collide with others. It should be one to one
> mapping
> 2. easy retrieval logic for your application using the loaded data
> It is a very common problem, there are two popular solutions:
> 1. reversing the key; commonly used in rowkey that has website url or
> rowkey
> which has first half characters common but rest are pretty random
> 2. using a Sha1 hash of row; works pretty much for any situation
> providing high randomness;
>
> -Debashis
>
> On Fri, Nov 12, 2010 at 4:44 PM, Gangl, Michael E (388K) <
> Michael.E.Gangl@jpl.nasa.gov> wrote:
>
> >
> > It appears that I'm falling victim to sequential data writing in Hbase. I
> > have a map reduce job which is taking values like this (from disk):
> >
> > Year   doy    time       value      x    y
> > 2001 001    0:00     -44.22       1    1
> > 2001 001    8:00     -44.22       1    1
> > 2001 001    16:00     -44.22     1    2
> > 2001 002    00:00     -44.22     1    1
> >
> > (it's more complicated than this, but this is essentially what we're
> doing.
> >
> > And generating values like the following:
> >
> > <key>                         <value>
> > 0010012001001        -44.45
> > 0010012001002        -43.45
> > 0010012001003        -42.45
> > 0010012001004        -41.45
> >
> > And so on, where the first three digits are part of an X coordinate, and
> > the next 3 are a Y coordinate, the last N digits are a time frame, in the
> > case above it's the YEAR/DAY-OF-YEAR combo.
> >
> > So we're reducing to a single day of year from multiple day time
> > measurements of bunch of data.
> >
> > When writing to 5 Hbase region servers I notice for any given Region,
> it's
> > getting all the requests are that current time. This leads me to think
> I'm
> > writing to these in sequential order, and only one region can be written
> to
> > at a time, creating a large bottle neck in writes.
> >
> > Is there any way to randomize the key's I'm writing to Hbase? Such that I
> > wouldn't write 0010012001001 then 0010012001002 then 0010012001003? I was
> > reading on some lists that someone had built a randomize.java function to
> do
> > this, but all my searches have been in vain. Any one run into this
> problem?
> >
> > -Mike
> >
>
>
>
> --
> - DEBASHIS SAHA
>
> 2519 Honeysuckle Ln
> Rolling Meadows, IL 60008, USA
>
> 1-(847) 925 - 5071 (H);
> 1-(312)-731- 6414 (M)
> --~<O>~--
>
>


-- 
- DEBASHIS SAHA

2519 Honeysuckle Ln
Rolling Meadows, IL 60008, USA

1-(847) 925 - 5071 (H);
1-(312)-731- 6414 (M)
--~<O>~--

Re: Randomize key insert

Posted by "Gangl, Michael E (388K)" <Mi...@jpl.nasa.gov>.

I understand I can hash the input to distribute across regions, But I need to be able to use a multiple scanners in parallel to read bunches of the data at once in sequence.

So I guess I need to take the write performance hit to enable the read performance gains, unless there is a way to randomize the reducer data (obviously each reducer will still get all of the same key) but its seems they are getting them "in order."

Would increasing the block size in HDFS mean the reducers would get data with keys further away (lexographically) from each other?

-Mike

On 11/13/10 12:24 AM, "Debashis Saha" <de...@gmail.com> wrote:

I think you identified the problem - since the most of the first part of
your rowkeys are same, it is hitting on one region server. Randomization of
rowkey is the solution and you must ensure following things:

1. you have to preserve the uniqueness of the rowkey; one original key's
randomize value should not collide with others. It should be one to one
mapping
2. easy retrieval logic for your application using the loaded data
It is a very common problem, there are two popular solutions:
1. reversing the key; commonly used in rowkey that has website url or rowkey
which has first half characters common but rest are pretty random
2. using a Sha1 hash of row; works pretty much for any situation
providing high randomness;

-Debashis

On Fri, Nov 12, 2010 at 4:44 PM, Gangl, Michael E (388K) <
Michael.E.Gangl@jpl.nasa.gov> wrote:

>
> It appears that I'm falling victim to sequential data writing in Hbase. I
> have a map reduce job which is taking values like this (from disk):
>
> Year   doy    time       value      x    y
> 2001 001    0:00     -44.22       1    1
> 2001 001    8:00     -44.22       1    1
> 2001 001    16:00     -44.22     1    2
> 2001 002    00:00     -44.22     1    1
>
> (it's more complicated than this, but this is essentially what we're doing.
>
> And generating values like the following:
>
> <key>                         <value>
> 0010012001001        -44.45
> 0010012001002        -43.45
> 0010012001003        -42.45
> 0010012001004        -41.45
>
> And so on, where the first three digits are part of an X coordinate, and
> the next 3 are a Y coordinate, the last N digits are a time frame, in the
> case above it's the YEAR/DAY-OF-YEAR combo.
>
> So we're reducing to a single day of year from multiple day time
> measurements of bunch of data.
>
> When writing to 5 Hbase region servers I notice for any given Region, it's
> getting all the requests are that current time. This leads me to think I'm
> writing to these in sequential order, and only one region can be written to
> at a time, creating a large bottle neck in writes.
>
> Is there any way to randomize the key's I'm writing to Hbase? Such that I
> wouldn't write 0010012001001 then 0010012001002 then 0010012001003? I was
> reading on some lists that someone had built a randomize.java function to do
> this, but all my searches have been in vain. Any one run into this problem?
>
> -Mike
>

--
- DEBASHIS SAHA

2519 Honeysuckle Ln
Rolling Meadows, IL 60008, USA

1-(847) 925 - 5071 (H);
1-(312)-731- 6414 (M)
--~<O>~--

Re: Randomize key insert

Posted by Debashis Saha <de...@gmail.com>.

I think you identified the problem - since the most of the first part of
your rowkeys are same, it is hitting on one region server. Randomization of
rowkey is the solution and you must ensure following things:

1. you have to preserve the uniqueness of the rowkey; one original key's
randomize value should not collide with others. It should be one to one
mapping
2. easy retrieval logic for your application using the loaded data
It is a very common problem, there are two popular solutions:
1. reversing the key; commonly used in rowkey that has website url or rowkey
which has first half characters common but rest are pretty random
2. using a Sha1 hash of row; works pretty much for any situation
providing high randomness;

-Debashis

On Fri, Nov 12, 2010 at 4:44 PM, Gangl, Michael E (388K) <
Michael.E.Gangl@jpl.nasa.gov> wrote:

>
> It appears that I'm falling victim to sequential data writing in Hbase. I
> have a map reduce job which is taking values like this (from disk):
>
> Year   doy    time       value      x    y
> 2001 001    0:00     -44.22       1    1
> 2001 001    8:00     -44.22       1    1
> 2001 001    16:00     -44.22     1    2
> 2001 002    00:00     -44.22     1    1
>
> (it's more complicated than this, but this is essentially what we're doing.
>
> And generating values like the following:
>
> <key>                         <value>
> 0010012001001        -44.45
> 0010012001002        -43.45
> 0010012001003        -42.45
> 0010012001004        -41.45
>
> And so on, where the first three digits are part of an X coordinate, and
> the next 3 are a Y coordinate, the last N digits are a time frame, in the
> case above it's the YEAR/DAY-OF-YEAR combo.
>
> So we're reducing to a single day of year from multiple day time
> measurements of bunch of data.
>
> When writing to 5 Hbase region servers I notice for any given Region, it's
> getting all the requests are that current time. This leads me to think I'm
> writing to these in sequential order, and only one region can be written to
> at a time, creating a large bottle neck in writes.
>
> Is there any way to randomize the key's I'm writing to Hbase? Such that I
> wouldn't write 0010012001001 then 0010012001002 then 0010012001003? I was
> reading on some lists that someone had built a randomize.java function to do
> this, but all my searches have been in vain. Any one run into this problem?
>
> -Mike
>



-- 
- DEBASHIS SAHA

2519 Honeysuckle Ln
Rolling Meadows, IL 60008, USA

1-(847) 925 - 5071 (H);
1-(312)-731- 6414 (M)
--~<O>~--