You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Jon Bender <jo...@gmail.com> on 2012/02/06 05:56:05 UTC

HBase Read Performance - Multiget vs TableInputFormat Job

Hi,

I've got a question about batch read performance in HBase.  I've got a
nightly job that extracts HBase data (currently upwards of ~300k new rows)
added from the previous day.  The rows are spread out fairly evenly over
the key range, so inevitably we will have to read from most, if not all
regions, to retrieve this data, and these reads will not be sequential
across rows.

The two alternatives I am exploring are

   1. Running a TableInputFormat MR job that filters for data added in the
   past day (Scan on the internal timestamp range of the cells)
   2. Using a batched get (multiGet) with a list of the rows were written
   the previous day, most likely using a number of HBase client processes to
   read this data out in parallel.

Does anyone have any recommendations on which approach to take?  I haven't
used the new MultiGet operations so I figured I'd ask the pros before
diving in.

Cheers,
Jon

Re: HBase Read Performance - Multiget vs TableInputFormat Job

Posted by Stack <st...@duboce.net>.

On Mon, Feb 6, 2012 at 8:58 AM, Jon Bender <jo...@gmail.com> wrote:
> When you say it'll sort regions by you, does that mean I'll need to
> identify the regions before dividing up the maps?  Or just deal with the
> fact that multiple maps might read from the same regionserver?
>

If you do a multiget on N rows, internally HTable will sort the rows
by region so that the big multiget get turns into a as many
mini-multigets as there are regions present in the N rows.  HTable
then dispatches all in parallell and manages the returns, failures,
etc.

I was suggesting you run a client in the mapper and the map input
would be N rows for the client to handle.   Perhaps have each mapper
do 5 minutes worth of N multigets.

If in MR, your job gets distributed for you, retried (maybe you won't
want retries?), etc.
St.Ack

Re: HBase Read Performance - Multiget vs TableInputFormat Job

Posted by Jon Bender <jo...@gmail.com>.

Thanks for the responses!

>What percentage of total data is the 300k new rows?

A constantly shrinking percentage--we may retain upwards of 5 years of data
here, so running against the full table will get very expensive going
forward.  I think the second approach sounds best.

>If you have the list of the 300k, this could work.  You could write a
mapreduce job that divided the 300k into maps and in each mapper run a
client to do  multiget (it'll sort the gets by regions for you).

When you say it'll sort regions by you, does that mean I'll need to
identify the regions before dividing up the maps?  Or just deal with the
fact that multiple maps might read from the same regionserver?

--Jon

On Mon, Feb 6, 2012 at 8:21 AM, Stack <st...@duboce.net> wrote:

> On Sun, Feb 5, 2012 at 8:56 PM, Jon Bender <jo...@gmail.com>
> wrote:
> > The two alternatives I am exploring are
> >
> >   1. Running a TableInputFormat MR job that filters for data added in the
> >   past day (Scan on the internal timestamp range of the cells)
>
> You'll touch all your data when you do this.
>
> What percentage of total data is the 300k new rows?
>
> >   2. Using a batched get (multiGet) with a list of the rows were written
> >   the previous day, most likely using a number of HBase client processes
> to
> >   read this data out in parallel.
> >
>
> If you have the list of the 300k, this could work.  You could write a
> mapreduce job that divided the 300k into maps and in each mapper run a
> client to do  multiget (it'll sort the gets by regions for you).
>
> St.Ack
>