You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Himanish Kushary <hi...@gmail.com> on 2011/05/12 07:36:27 UTC

Very slow Scan performance using Filters

Hi,

We have a table split across multiple regions(approx 50-60 regions for 64 MB
split size) with rowid schema as
[ReverseTimestamp/itemtimestamp/customerid/itemid].This stores the
activities for an item for a customer.We have lots of data for lots of item
for a custoer in this table.

When we try to lookup activities for an item for the last 30 days from this
table , we are using a Scan with RowFilter and RegexComparator.The scan
takes a lot of time ( almost 15-20 secs) to get us the activities for an
item.

We are hooked up to HBase tables directly from a web application,so this
response time of around 20 secs is unacceptable.We also noticed that
whenever we do any scan kind of operation it is never in acceptable ranges
for a web application.

Are we doing something wrong ? If Hbase scans are so slow then it would be
real hard to hook it up directly with any web application.

Could somebody please suggest how to improve this or some other
options(design,architectural) to remedy this kind of issues dealing with lot
of data.

Note: We have tried with setCaching,SingleColumnValueFilter to no
significant effect.

---------------------------
Thanks & Regards
Himanish

Re: Very slow Scan performance using Filters

Posted by Ryan Rawson <ry...@gmail.com>.

Don't forget that a Get is just a 1 row scan, they share the same code
path internally.  The only difference of course is that a get just
returns that one row and therefore is fairly fast (unless your row is
huge, think hundreds of MBs).

-ryan

On Thu, May 12, 2011 at 1:31 PM, Himanish Kushary <hi...@gmail.com> wrote:
> Thanks for your help. We are implementing our own secondary index table to
> get rid of the scan and replace those calls with Get.
>
> One common trend that we are following , to ensure the frontend web
> application is performant as per our expectation, is to always try and use
> Gets' from the UI instead of Scans'.
>
> Thanks
> Himanish
>
> On Thu, May 12, 2011 at 2:21 AM, Ryan Rawson <ry...@gmail.com> wrote:
>
>> Scans are in serial.
>>
>> To use DB parlance, consider a Scan + filter the moral equivalent of a
>> "SELECT * FROM <> WHERE col='val'" with no index, and a full table
>> scan is engaged.
>>
>> The typical ways to help solve performance issues are such:
>> - arrange your data using the primary key so you can scan the smallest
>> portion of the table possible.
>> - use another table as an index. Unfortunately HBase doesn't help you here.
>>
>> -ryan
>>
>> On Wed, May 11, 2011 at 11:12 PM, Connolly Juhani <ju...@ninja.co.jp>
>> wrote:
>> > By naming rows from the timestamp the rowids are going to all be
>> sequential
>> > when inserting. So all new inserts will be going into the same region.
>> When
>> > checking the last 30 days you will also be reading from the same region
>> > where all the writing is happening, i.e the one that is already busy
>> writing
>> > the edit log for all those entries. You might want to consider an
>> > alternative method of naming your rows that would result in more
>> distributed
>> > reading/writing.
>> > However since you are naming rows by timestamps, you should be able to
>> > restrict the scan by a start and end date. You are doing this, right? If
>> > you're not, you are scanning every row in the table when you only need
>> the
>> > rows from end-start.
>> >
>> > Someone may need to correct me, but based on my memory of the
>> implementation
>> > scans are entirely sequential, so region a gets scanned, then b, then c.
>> You
>> > could speed this up by scanning multiple regions in parallel processes
>> and
>> > merging the results.
>> >
>> > On 12 May 2011 14:36, Himanish Kushary <hi...@gmail.com> wrote:
>> >
>> >> Hi,
>> >>
>> >> We have a table split across multiple regions(approx 50-60 regions for
>> 64
>> >> MB
>> >> split size) with rowid schema as
>> >> [ReverseTimestamp/itemtimestamp/customerid/itemid].This stores the
>> >> activities for an item for a customer.We have lots of data for lots of
>> item
>> >> for a custoer in this table.
>> >>
>> >> When we try to lookup activities for an item for the last 30 days from
>> this
>> >> table , we are using a Scan with RowFilter and RegexComparator.The scan
>> >> takes a lot of time ( almost 15-20 secs) to get us the activities for an
>> >> item.
>> >>
>> >> We are hooked up to HBase tables directly from a web application,so this
>> >> response time of around 20 secs is unacceptable.We also noticed that
>> >> whenever we do any scan kind of operation it is never in acceptable
>> ranges
>> >> for a web application.
>> >>
>> >> Are we doing something wrong ? If Hbase scans are so slow then it would
>> be
>> >> real hard to hook it up directly with any web application.
>> >>
>> >> Could somebody please suggest how to improve this or some other
>> >> options(design,architectural) to remedy this kind of issues dealing with
>> >> lot
>> >> of data.
>> >>
>> >> Note: We have tried with setCaching,SingleColumnValueFilter to no
>> >> significant effect.
>> >>
>> >> ---------------------------
>> >> Thanks & Regards
>> >> Himanish
>> >>
>> >
>>
>
>
>
> --
> Thanks & Regards
> Himanish
>

Re: Very slow Scan performance using Filters

Posted by Himanish Kushary <hi...@gmail.com>.

Thanks for your help. We are implementing our own secondary index table to
get rid of the scan and replace those calls with Get.

One common trend that we are following , to ensure the frontend web
application is performant as per our expectation, is to always try and use
Gets' from the UI instead of Scans'.

Thanks
Himanish

On Thu, May 12, 2011 at 2:21 AM, Ryan Rawson <ry...@gmail.com> wrote:

> Scans are in serial.
>
> To use DB parlance, consider a Scan + filter the moral equivalent of a
> "SELECT * FROM <> WHERE col='val'" with no index, and a full table
> scan is engaged.
>
> The typical ways to help solve performance issues are such:
> - arrange your data using the primary key so you can scan the smallest
> portion of the table possible.
> - use another table as an index. Unfortunately HBase doesn't help you here.
>
> -ryan
>
> On Wed, May 11, 2011 at 11:12 PM, Connolly Juhani <ju...@ninja.co.jp>
> wrote:
> > By naming rows from the timestamp the rowids are going to all be
> sequential
> > when inserting. So all new inserts will be going into the same region.
> When
> > checking the last 30 days you will also be reading from the same region
> > where all the writing is happening, i.e the one that is already busy
> writing
> > the edit log for all those entries. You might want to consider an
> > alternative method of naming your rows that would result in more
> distributed
> > reading/writing.
> > However since you are naming rows by timestamps, you should be able to
> > restrict the scan by a start and end date. You are doing this, right? If
> > you're not, you are scanning every row in the table when you only need
> the
> > rows from end-start.
> >
> > Someone may need to correct me, but based on my memory of the
> implementation
> > scans are entirely sequential, so region a gets scanned, then b, then c.
> You
> > could speed this up by scanning multiple regions in parallel processes
> and
> > merging the results.
> >
> > On 12 May 2011 14:36, Himanish Kushary <hi...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> We have a table split across multiple regions(approx 50-60 regions for
> 64
> >> MB
> >> split size) with rowid schema as
> >> [ReverseTimestamp/itemtimestamp/customerid/itemid].This stores the
> >> activities for an item for a customer.We have lots of data for lots of
> item
> >> for a custoer in this table.
> >>
> >> When we try to lookup activities for an item for the last 30 days from
> this
> >> table , we are using a Scan with RowFilter and RegexComparator.The scan
> >> takes a lot of time ( almost 15-20 secs) to get us the activities for an
> >> item.
> >>
> >> We are hooked up to HBase tables directly from a web application,so this
> >> response time of around 20 secs is unacceptable.We also noticed that
> >> whenever we do any scan kind of operation it is never in acceptable
> ranges
> >> for a web application.
> >>
> >> Are we doing something wrong ? If Hbase scans are so slow then it would
> be
> >> real hard to hook it up directly with any web application.
> >>
> >> Could somebody please suggest how to improve this or some other
> >> options(design,architectural) to remedy this kind of issues dealing with
> >> lot
> >> of data.
> >>
> >> Note: We have tried with setCaching,SingleColumnValueFilter to no
> >> significant effect.
> >>
> >> ---------------------------
> >> Thanks & Regards
> >> Himanish
> >>
> >
>



-- 
Thanks & Regards
Himanish

Re: Very slow Scan performance using Filters

Posted by Ryan Rawson <ry...@gmail.com>.

Scans are in serial.

To use DB parlance, consider a Scan + filter the moral equivalent of a
"SELECT * FROM <> WHERE col='val'" with no index, and a full table
scan is engaged.

The typical ways to help solve performance issues are such:
- arrange your data using the primary key so you can scan the smallest
portion of the table possible.
- use another table as an index. Unfortunately HBase doesn't help you here.

-ryan

On Wed, May 11, 2011 at 11:12 PM, Connolly Juhani <ju...@ninja.co.jp> wrote:
> By naming rows from the timestamp the rowids are going to all be sequential
> when inserting. So all new inserts will be going into the same region. When
> checking the last 30 days you will also be reading from the same region
> where all the writing is happening, i.e the one that is already busy writing
> the edit log for all those entries. You might want to consider an
> alternative method of naming your rows that would result in more distributed
> reading/writing.
> However since you are naming rows by timestamps, you should be able to
> restrict the scan by a start and end date. You are doing this, right? If
> you're not, you are scanning every row in the table when you only need the
> rows from end-start.
>
> Someone may need to correct me, but based on my memory of the implementation
> scans are entirely sequential, so region a gets scanned, then b, then c. You
> could speed this up by scanning multiple regions in parallel processes and
> merging the results.
>
> On 12 May 2011 14:36, Himanish Kushary <hi...@gmail.com> wrote:
>
>> Hi,
>>
>> We have a table split across multiple regions(approx 50-60 regions for 64
>> MB
>> split size) with rowid schema as
>> [ReverseTimestamp/itemtimestamp/customerid/itemid].This stores the
>> activities for an item for a customer.We have lots of data for lots of item
>> for a custoer in this table.
>>
>> When we try to lookup activities for an item for the last 30 days from this
>> table , we are using a Scan with RowFilter and RegexComparator.The scan
>> takes a lot of time ( almost 15-20 secs) to get us the activities for an
>> item.
>>
>> We are hooked up to HBase tables directly from a web application,so this
>> response time of around 20 secs is unacceptable.We also noticed that
>> whenever we do any scan kind of operation it is never in acceptable ranges
>> for a web application.
>>
>> Are we doing something wrong ? If Hbase scans are so slow then it would be
>> real hard to hook it up directly with any web application.
>>
>> Could somebody please suggest how to improve this or some other
>> options(design,architectural) to remedy this kind of issues dealing with
>> lot
>> of data.
>>
>> Note: We have tried with setCaching,SingleColumnValueFilter to no
>> significant effect.
>>
>> ---------------------------
>> Thanks & Regards
>> Himanish
>>
>

Re: Very slow Scan performance using Filters

Posted by Connolly Juhani <ju...@ninja.co.jp>.

By naming rows from the timestamp the rowids are going to all be sequential
when inserting. So all new inserts will be going into the same region. When
checking the last 30 days you will also be reading from the same region
where all the writing is happening, i.e the one that is already busy writing
the edit log for all those entries. You might want to consider an
alternative method of naming your rows that would result in more distributed
reading/writing.
However since you are naming rows by timestamps, you should be able to
restrict the scan by a start and end date. You are doing this, right? If
you're not, you are scanning every row in the table when you only need the
rows from end-start.

Someone may need to correct me, but based on my memory of the implementation
scans are entirely sequential, so region a gets scanned, then b, then c. You
could speed this up by scanning multiple regions in parallel processes and
merging the results.

On 12 May 2011 14:36, Himanish Kushary <hi...@gmail.com> wrote:

> Hi,
>
> We have a table split across multiple regions(approx 50-60 regions for 64
> MB
> split size) with rowid schema as
> [ReverseTimestamp/itemtimestamp/customerid/itemid].This stores the
> activities for an item for a customer.We have lots of data for lots of item
> for a custoer in this table.
>
> When we try to lookup activities for an item for the last 30 days from this
> table , we are using a Scan with RowFilter and RegexComparator.The scan
> takes a lot of time ( almost 15-20 secs) to get us the activities for an
> item.
>
> We are hooked up to HBase tables directly from a web application,so this
> response time of around 20 secs is unacceptable.We also noticed that
> whenever we do any scan kind of operation it is never in acceptable ranges
> for a web application.
>
> Are we doing something wrong ? If Hbase scans are so slow then it would be
> real hard to hook it up directly with any web application.
>
> Could somebody please suggest how to improve this or some other
> options(design,architectural) to remedy this kind of issues dealing with
> lot
> of data.
>
> Note: We have tried with setCaching,SingleColumnValueFilter to no
> significant effect.
>
> ---------------------------
> Thanks & Regards
> Himanish
>