You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Luke Forehand <lu...@networkedinsights.com> on 2010/08/03 17:40:49 UTC

Secondary Index versus Full Table Scan

Thanks to the help of people on this mailing list and Cloudera, our team has
managed to get our 3 data node cluster with HBase running like a top.  Our
import rate is now around 3 GB per job which takes about 10 minutes.  This is
great.  Now we are trying to tackle reading.

With our current setup, a map reduce job with 24 mappers performing a full table
scan of ~150 million records takes ~1 hour.  This won't work for our use case,
because not only are we continuing to add more data to this table, but we are
asking many more questions in a day.  To increase performance, the first thought
was to use a secondary index table, and do range scans of the secondary index
table, iteratively performing GET operations of the master table.  

In testing the average GET operation took 37 milliseconds.  At that rate with 24
mappers it would take ~1.5 hours to scan 3 million rows.  This still seems like
a lot of time.  37 milliseconds per GET is nice for "real time" access from a
client, but not during massive GETs of data in a map reduce job.

My question is, does it make sense to use secondary index tables in a map reduce
job of this scale?  Should we not be using HBase for input in these map reduce
jobs and go with raw SequenceFile?  Do we simply need more nodes?  

Here are the specs for each of our 3 data nodes:
2x CPU (2.5 GHZ nehalem ep quad core)
24 GB RAM (4gb / region server )
4x 1tb hard drives

Region size: 1GB

Thanks,

Luke Forehand
Software Engineer
http://www.networkedinsights.com

Re: Secondary Index versus Full Table Scan

Posted by Luke Forehand <lu...@networkedinsights.com>.

Hegner, Travis <TH...@...> writes:

> 
> Going out on a limb, I think it will perform MUCH faster with multiple copies,
as the data is already sitting
> in each mappers memory, ready to be accessed locally. The time to process per
mapper should be very
> dramatically reduced. With that in mind, you only have to scale up as disk
space requires it, and disk space
> is cheap.
> 
> With your current method, adding three more identical data nodes, is only
going to cut your time in half. So
> unless you have the budget to get the number of machines required, it's at
least worth a try to have multiple
> copies, at least that only costs your time.
> 
> HTH,
> 
> Travis Hegner
> http://www.travishegner.com/
> 

Thanks Travis!  I am in the process of making a copy of our master table with a
composite rowKey of '<columnToIndex> <masterRowKey>'

I'll be testing out range scans using the composite key shortly.

-Luke

RE: Secondary Index versus Full Table Scan

Posted by "Hegner, Travis" <TH...@trilliumit.com>.

Going out on a limb, I think it will perform MUCH faster with multiple copies, as the data is already sitting in each mappers memory, ready to be accessed locally. The time to process per mapper should be very dramatically reduced. With that in mind, you only have to scale up as disk space requires it, and disk space is cheap.

With your current method, adding three more identical data nodes, is only going to cut your time in half. So unless you have the budget to get the number of machines required, it's at least worth a try to have multiple copies, at least that only costs your time.

HTH,

Travis Hegner
http://www.travishegner.com/

-----Original Message-----
From: Luke Forehand [mailto:luke.forehand@networkedinsights.com]
Sent: Tuesday, August 03, 2010 12:37 PM
To: user@hbase.apache.org
Subject: Re: Secondary Index versus Full Table Scan

Edward Capriolo <ed...@...> writes:

> Generally speaking: If you are doing full range scans of a table
> indexes will not help. Adding indexes will make the performance worse,
> it will take longer to load your data and now fetching the data will
> involve two lookups instead of one.
>
> If you are doing full range scans adding more nodes should result in
> linear scale up.
>
>

Edward,

Can you clarify what "full range scan" means?  I am not doing "full" range
scans, but I am doing relatively large range scans (3 million records), so I
think what you are saying applies.  Thanks for the insight.

We initially implemented the secondary index out of a need to have our main data
sorted by multiple dimensions for various use cases.  Now I'm thinking it may be
better to have multiple copies of our main data, sorted in multiple ways, to
avoid the two lookups.  So I'm faced with two options right now; multiple copies
of the data sorted in multiple ways to do range scans, or buy a lot more servers
and do full scans.  Given these two choices, do people have general
recommendations on which makes the most sense?

Thanks!
-Luke

The information contained in this communication is confidential and is intended only for the use of the named recipient.  Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful.  If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender or our IT Department at  866.459.4599.

Re: Secondary Index versus Full Table Scan

Posted by Luke Forehand <lu...@networkedinsights.com>.

Edward Capriolo <ed...@...> writes:

> Generally speaking: If you are doing full range scans of a table
> indexes will not help. Adding indexes will make the performance worse,
> it will take longer to load your data and now fetching the data will
> involve two lookups instead of one.
> 
> If you are doing full range scans adding more nodes should result in
> linear scale up.
> 
> 

Edward,

Can you clarify what "full range scan" means?  I am not doing "full" range
scans, but I am doing relatively large range scans (3 million records), so I
think what you are saying applies.  Thanks for the insight.  

We initially implemented the secondary index out of a need to have our main data
sorted by multiple dimensions for various use cases.  Now I'm thinking it may be
better to have multiple copies of our main data, sorted in multiple ways, to
avoid the two lookups.  So I'm faced with two options right now; multiple copies
of the data sorted in multiple ways to do range scans, or buy a lot more servers
and do full scans.  Given these two choices, do people have general
recommendations on which makes the most sense?

Thanks!
-Luke

Re: Secondary Index versus Full Table Scan

Posted by Edward Capriolo <ed...@gmail.com>.

On Tue, Aug 3, 2010 at 11:40 AM, Luke Forehand
<lu...@networkedinsights.com> wrote:
> Thanks to the help of people on this mailing list and Cloudera, our team has
> managed to get our 3 data node cluster with HBase running like a top.  Our
> import rate is now around 3 GB per job which takes about 10 minutes.  This is
> great.  Now we are trying to tackle reading.
>
> With our current setup, a map reduce job with 24 mappers performing a full table
> scan of ~150 million records takes ~1 hour.  This won't work for our use case,
> because not only are we continuing to add more data to this table, but we are
> asking many more questions in a day.  To increase performance, the first thought
> was to use a secondary index table, and do range scans of the secondary index
> table, iteratively performing GET operations of the master table.
>
> In testing the average GET operation took 37 milliseconds.  At that rate with 24
> mappers it would take ~1.5 hours to scan 3 million rows.  This still seems like
> a lot of time.  37 milliseconds per GET is nice for "real time" access from a
> client, but not during massive GETs of data in a map reduce job.
>
> My question is, does it make sense to use secondary index tables in a map reduce
> job of this scale?  Should we not be using HBase for input in these map reduce
> jobs and go with raw SequenceFile?  Do we simply need more nodes?
>
> Here are the specs for each of our 3 data nodes:
> 2x CPU (2.5 GHZ nehalem ep quad core)
> 24 GB RAM (4gb / region server )
> 4x 1tb hard drives
>
> Region size: 1GB
>
> Thanks,
>
> Luke Forehand
> Software Engineer
> http://www.networkedinsights.com
>
>

Generally speaking: If you are doing full range scans of a table
indexes will not help. Adding indexes will make the performance worse,
it will take longer to load your data and now fetching the data will
involve two lookups instead of one.

If you are doing full range scans adding more nodes should result in
linear scale up.

RE: Secondary Index versus Full Table Scan

Posted by Jonathan Gray <jg...@facebook.com>.

Also seek/reseek hooks in the filters will allow skipping of blocks, which for some queries (returning high % of total data) it won't matter but for more sparse filters that want to jump this can be significant.

These are being worked on by an intern here and should have some patches up in a couple weeks.

> -----Original Message-----
> From: Todd Lipcon [mailto:todd@cloudera.com]
> Sent: Wednesday, August 04, 2010 2:15 PM
> To: user@hbase.apache.org
> Subject: Re: Secondary Index versus Full Table Scan
> 
> On Wed, Aug 4, 2010 at 1:14 PM, Luke Forehand <
> luke.forehand@networkedinsights.com> wrote:
> 
> > Todd Lipcon <to...@...> writes:
> >
> > > The above is true if you assume you can only do one get at a time.
> In
> > fact,
> > > you can probably pipeline gets, and there's actually a patch in the
> works
> > > for multiget support - HBASE-1845. I don't think it's being
> actively
> > worked
> > > on at the moment, though, so you'll have to do it somewhat
> manually. I'd
> > > recommend using multithreading in each mapper so that the keys come
> off
> > the
> > > scan into a small thread pool executor which performs the gets -
> this
> > should
> > > get you some parallelism. Otherwise you'll find the mappers are
> mostly
> > > spending time waiting on the network and not doing work.
> >
> > Excellent!  We will definitely try multithreading the gets.
> >
> > > It highly depends on the selectivity - if you're able to cut out a
> very
> > > large percentage of the records using your secondary index, then
> you'll
> > be
> > > saving time for sure. If not, then you've just turned your
> sequential IO
> > > (read: fast) into random IO (read: slow). It's better to do a few
> random
> > IOs
> > > than a lot of sequential, but better to do a lot of sequential than
> a lot
> > of
> > > random, if that makes any sense.
> >
> > Yes we are coming to terms with this very quickly :-)  It's easier to
> find
> > the
> > balance now that we're working with some real data...
> >
> > > One thing that no one has raised yet is whether you're using the
> Filter
> > API.
> > > If you're not already using Filters to apply a server side
> predicate, I'd
> > > recommend looking into it. This will allow you to reduce the amount
> of
> > > network traffic between the mappers and the region servers, and
> should
> > > improve performance noticeably.
> >
> > We are using the Filter API but our mappers are local to the region
> servers
> > so
> > we didn't notice much of an improvement.  Does that make sense?
> >
> > Yep, certainly - you avoid a few extra copies between processes by
> using
> filters, but probably not a huge difference there in that case.
> 
> What's really missing is the pushdown of the filters all the way to the
> storage layer - this is where something like bitmap indexes will likely
> help
> - we'll be able to avoid reading the data off disk when it doesn't
> match,
> and thus query time will be closer to linear with the number of matches
> instead of linear with the amount of data.
> 
> -Todd
> 
> --
> Todd Lipcon
> Software Engineer, Cloudera

Re: Secondary Index versus Full Table Scan

Posted by Todd Lipcon <to...@cloudera.com>.

On Wed, Aug 4, 2010 at 1:14 PM, Luke Forehand <
luke.forehand@networkedinsights.com> wrote:

> Todd Lipcon <to...@...> writes:
>
> > The above is true if you assume you can only do one get at a time. In
> fact,
> > you can probably pipeline gets, and there's actually a patch in the works
> > for multiget support - HBASE-1845. I don't think it's being actively
> worked
> > on at the moment, though, so you'll have to do it somewhat manually. I'd
> > recommend using multithreading in each mapper so that the keys come off
> the
> > scan into a small thread pool executor which performs the gets - this
> should
> > get you some parallelism. Otherwise you'll find the mappers are mostly
> > spending time waiting on the network and not doing work.
>
> Excellent!  We will definitely try multithreading the gets.
>
> > It highly depends on the selectivity - if you're able to cut out a very
> > large percentage of the records using your secondary index, then you'll
> be
> > saving time for sure. If not, then you've just turned your sequential IO
> > (read: fast) into random IO (read: slow). It's better to do a few random
> IOs
> > than a lot of sequential, but better to do a lot of sequential than a lot
> of
> > random, if that makes any sense.
>
> Yes we are coming to terms with this very quickly :-)  It's easier to find
> the
> balance now that we're working with some real data...
>
> > One thing that no one has raised yet is whether you're using the Filter
> API.
> > If you're not already using Filters to apply a server side predicate, I'd
> > recommend looking into it. This will allow you to reduce the amount of
> > network traffic between the mappers and the region servers, and should
> > improve performance noticeably.
>
> We are using the Filter API but our mappers are local to the region servers
> so
> we didn't notice much of an improvement.  Does that make sense?
>
> Yep, certainly - you avoid a few extra copies between processes by using
filters, but probably not a huge difference there in that case.

What's really missing is the pushdown of the filters all the way to the
storage layer - this is where something like bitmap indexes will likely help
- we'll be able to avoid reading the data off disk when it doesn't match,
and thus query time will be closer to linear with the number of matches
instead of linear with the amount of data.

-Todd

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Secondary Index versus Full Table Scan

Posted by Luke Forehand <lu...@networkedinsights.com>.

Todd Lipcon <to...@...> writes:

> The above is true if you assume you can only do one get at a time. In fact,
> you can probably pipeline gets, and there's actually a patch in the works
> for multiget support - HBASE-1845. I don't think it's being actively worked
> on at the moment, though, so you'll have to do it somewhat manually. I'd
> recommend using multithreading in each mapper so that the keys come off the
> scan into a small thread pool executor which performs the gets - this should
> get you some parallelism. Otherwise you'll find the mappers are mostly
> spending time waiting on the network and not doing work.

Excellent!  We will definitely try multithreading the gets.

> It highly depends on the selectivity - if you're able to cut out a very
> large percentage of the records using your secondary index, then you'll be
> saving time for sure. If not, then you've just turned your sequential IO
> (read: fast) into random IO (read: slow). It's better to do a few random IOs
> than a lot of sequential, but better to do a lot of sequential than a lot of
> random, if that makes any sense.

Yes we are coming to terms with this very quickly :-)  It's easier to find the
balance now that we're working with some real data...

> One thing that no one has raised yet is whether you're using the Filter API.
> If you're not already using Filters to apply a server side predicate, I'd
> recommend looking into it. This will allow you to reduce the amount of
> network traffic between the mappers and the region servers, and should
> improve performance noticeably.

We are using the Filter API but our mappers are local to the region servers so
we didn't notice much of an improvement.  Does that make sense?

Thanks for the sage advice Todd, I think our team is heading in the right
direction after hearing the advice from folks here.

-Luke

Re: Secondary Index versus Full Table Scan

Posted by Todd Lipcon <to...@cloudera.com>.

Hey Luke,

A couple comments inline below:

On Tue, Aug 3, 2010 at 8:40 AM, Luke Forehand <
luke.forehand@networkedinsights.com> wrote:

> Thanks to the help of people on this mailing list and Cloudera, our team
> has
> managed to get our 3 data node cluster with HBase running like a top.  Our
> import rate is now around 3 GB per job which takes about 10 minutes.  This
> is
> great.  Now we are trying to tackle reading.
>
> With our current setup, a map reduce job with 24 mappers performing a full
> table
> scan of ~150 million records takes ~1 hour.  This won't work for our use
> case,
> because not only are we continuing to add more data to this table, but we
> are
> asking many more questions in a day.  To increase performance, the first
> thought
> was to use a secondary index table, and do range scans of the secondary
> index
> table, iteratively performing GET operations of the master table.
>
> In testing the average GET operation took 37 milliseconds.  At that rate
> with 24
> mappers it would take ~1.5 hours to scan 3 million rows.  This still seems
> like
> a lot of time.  37 milliseconds per GET is nice for "real time" access from
> a
> client, but not during massive GETs of data in a map reduce job.
>
>
The above is true if you assume you can only do one get at a time. In fact,
you can probably pipeline gets, and there's actually a patch in the works
for multiget support - HBASE-1845. I don't think it's being actively worked
on at the moment, though, so you'll have to do it somewhat manually. I'd
recommend using multithreading in each mapper so that the keys come off the
scan into a small thread pool executor which performs the gets - this should
get you some parallelism. Otherwise you'll find the mappers are mostly
spending time waiting on the network and not doing work.


> My question is, does it make sense to use secondary index tables in a map
> reduce
> job of this scale?  Should we not be using HBase for input in these map
> reduce
> jobs and go with raw SequenceFile?  Do we simply need more nodes?
>
>
It highly depends on the selectivity - if you're able to cut out a very
large percentage of the records using your secondary index, then you'll be
saving time for sure. If not, then you've just turned your sequential IO
(read: fast) into random IO (read: slow). It's better to do a few random IOs
than a lot of sequential, but better to do a lot of sequential than a lot of
random, if that makes any sense.

One thing that no one has raised yet is whether you're using the Filter API.
If you're not already using Filters to apply a server side predicate, I'd
recommend looking into it. This will allow you to reduce the amount of
network traffic between the mappers and the region servers, and should
improve performance noticeably.

In the long run, I have some plans to implement bitmap indexes in HBase,
which would provide for very fast filter predicates during scans. It's not
on the road map for the next several months, unfortunately - probably
post-0.90.

Thanks
-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera