You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Gurjeet Singh <gu...@gmail.com> on 2012/08/12 08:04:43 UTC

Slow full-table scans

Hi,

I am trying to read all the data out of an HBase table using a scan
and it is extremely slow.

Here are some characteristics of the data:

1. The total table size is tiny (~200MB)
2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
Thus the size of each cell is ~10bytes and the size of each row is
~2MB
3. Currently scanning the whole table takes ~400s (both in a
distributed setting with 12 nodes or so and on a single node), thus
5sec/row
4. The row keys are unique 8 byte crypto hashes of sequential numbers
5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
and is set to fetch 100MB of data at a time (scan.setCaching)
6. Changing the caching size seems to have no effect on the total scan
time at all
7. The column family is setup to keep a single version of the cells,
no compression, and no block cache.

Am I missing something ? Is there a way to optimize this ?

I guess a general question I have is whether HBase is good datastore
for storing many medium sized (~50GB), dense datasets with lots of
columns when a lot of the queries require full table scans ?

Thanks!
Gurjeet

Re: Slow full-table scans

Posted by Gurjeet Singh <gu...@gmail.com>.

Hi Jacques,

I did consider that. So, this increases the on-disk size of my data by
3-4x (=600-800MB). That still does not explain why reading 1row (=~4MB
with overhead) takes 5sec. About serialization/deserialization on the
client side - it happens on a different thread out of a buffer and
most of the time, that thread is just idling.

Gurjeet

On Sun, Aug 12, 2012 at 2:05 PM, Jacques <wh...@gmail.com> wrote:
> Something to consider is that HBase stores and retrieves the row key (8
> bytes in your case) + timestamp (8 bytes) + column qualifier (?) for every
> single value.  The schemaless nature of HBase generally means that this
> data has to be stored for each row (certain kinds of newer block level
> compression can make this less).  So depending on your column qualifiers,
> you're going to be looking at potentially a huge amount of overhead when
> you're dealing with 200,000 cells in a single row.  I also wonder whether
> you're dealing with a large amount of overhead simply on the
> serialization/deserialization/instantiation side if you're pulling back
> that many values.
>
> I'm not sure how many people are using that many cells in a single row and
> trying to read or write them all at once.
>
> Other's may have more thoughts.
>
> Jacques
>
>
>
> On Sun, Aug 12, 2012 at 7:23 AM, Gurjeet Singh <gu...@gmail.com> wrote:
>
>> Hi Ted,
>>
>> Yes, I am using the cloudera distribution 3.
>>
>> Gurjeet
>>
>> Sent from my iPad
>>
>> On Aug 12, 2012, at 7:11 AM, Ted Yu <yu...@gmail.com> wrote:
>>
>> > Gurjeet:
>> > Can you tell us which HBase version you are using ?
>> >
>> > Thanks
>> >
>> > On Sun, Aug 12, 2012 at 5:32 AM, Gurjeet Singh <gu...@gmail.com>
>> wrote:
>> >
>> >> Thanks for the reply Stack. My comments are inline.
>> >>
>> >>> You've checked out the perf section of the refguide?
>> >>>
>> >>> http://hbase.apache.org/book.html#performance
>> >>
>> >> Yes. HBase has 8GB RAM both on my cluster as well as my dev machine.
>> >> Both configurations are backed by SSDs and Hbase options are set to
>> >>
>> >> HBASE_OPTS="-ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode"
>> >>
>> >> The data that I am dealing with is static. The table never changes
>> >> after the first load.
>> >>
>> >> Even some of my GET requests are taking up to a full 60 seconds when
>> >> the row sizes reach ~10MB. In general, taking 5 seconds to fetch a
>> >> single row (~1MB) seems a extremely high to me.
>> >>
>> >> Thanks again for your help.
>> >>
>>

Re: Slow full-table scans

Posted by Jacques <wh...@gmail.com>.

Something to consider is that HBase stores and retrieves the row key (8
bytes in your case) + timestamp (8 bytes) + column qualifier (?) for every
single value.  The schemaless nature of HBase generally means that this
data has to be stored for each row (certain kinds of newer block level
compression can make this less).  So depending on your column qualifiers,
you're going to be looking at potentially a huge amount of overhead when
you're dealing with 200,000 cells in a single row.  I also wonder whether
you're dealing with a large amount of overhead simply on the
serialization/deserialization/instantiation side if you're pulling back
that many values.

I'm not sure how many people are using that many cells in a single row and
trying to read or write them all at once.

Other's may have more thoughts.

Jacques

On Sun, Aug 12, 2012 at 7:23 AM, Gurjeet Singh <gu...@gmail.com> wrote:

> Hi Ted,
>
> Yes, I am using the cloudera distribution 3.
>
> Gurjeet
>
> Sent from my iPad
>
> On Aug 12, 2012, at 7:11 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > Gurjeet:
> > Can you tell us which HBase version you are using ?
> >
> > Thanks
> >
> > On Sun, Aug 12, 2012 at 5:32 AM, Gurjeet Singh <gu...@gmail.com>
> wrote:
> >
> >> Thanks for the reply Stack. My comments are inline.
> >>
> >>> You've checked out the perf section of the refguide?
> >>>
> >>> http://hbase.apache.org/book.html#performance
> >>
> >> Yes. HBase has 8GB RAM both on my cluster as well as my dev machine.
> >> Both configurations are backed by SSDs and Hbase options are set to
> >>
> >> HBASE_OPTS="-ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode"
> >>
> >> The data that I am dealing with is static. The table never changes
> >> after the first load.
> >>
> >> Even some of my GET requests are taking up to a full 60 seconds when
> >> the row sizes reach ~10MB. In general, taking 5 seconds to fetch a
> >> single row (~1MB) seems a extremely high to me.
> >>
> >> Thanks again for your help.
> >>
>

Re: Slow full-table scans

Posted by Gurjeet Singh <gu...@gmail.com>.

Hi Ted,

Yes, I am using the cloudera distribution 3.

Gurjeet

Sent from my iPad

On Aug 12, 2012, at 7:11 AM, Ted Yu <yu...@gmail.com> wrote:

> Gurjeet:
> Can you tell us which HBase version you are using ?
> 
> Thanks
> 
> On Sun, Aug 12, 2012 at 5:32 AM, Gurjeet Singh <gu...@gmail.com> wrote:
> 
>> Thanks for the reply Stack. My comments are inline.
>> 
>>> You've checked out the perf section of the refguide?
>>> 
>>> http://hbase.apache.org/book.html#performance
>> 
>> Yes. HBase has 8GB RAM both on my cluster as well as my dev machine.
>> Both configurations are backed by SSDs and Hbase options are set to
>> 
>> HBASE_OPTS="-ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode"
>> 
>> The data that I am dealing with is static. The table never changes
>> after the first load.
>> 
>> Even some of my GET requests are taking up to a full 60 seconds when
>> the row sizes reach ~10MB. In general, taking 5 seconds to fetch a
>> single row (~1MB) seems a extremely high to me.
>> 
>> Thanks again for your help.
>>

Re: Slow full-table scans

Posted by Ted Yu <yu...@gmail.com>.

Gurjeet:
Can you tell us which HBase version you are using ?

Thanks

On Sun, Aug 12, 2012 at 5:32 AM, Gurjeet Singh <gu...@gmail.com> wrote:

> Thanks for the reply Stack. My comments are inline.
>
> > You've checked out the perf section of the refguide?
> >
> > http://hbase.apache.org/book.html#performance
>
> Yes. HBase has 8GB RAM both on my cluster as well as my dev machine.
> Both configurations are backed by SSDs and Hbase options are set to
>
> HBASE_OPTS="-ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode"
>
> The data that I am dealing with is static. The table never changes
> after the first load.
>
> Even some of my GET requests are taking up to a full 60 seconds when
> the row sizes reach ~10MB. In general, taking 5 seconds to fetch a
> single row (~1MB) seems a extremely high to me.
>
> Thanks again for your help.
>

Re: Slow full-table scans

Posted by Gurjeet Singh <gu...@gmail.com>.

Thanks for the reply Stack. My comments are inline.

> You've checked out the perf section of the refguide?
>
> http://hbase.apache.org/book.html#performance

Yes. HBase has 8GB RAM both on my cluster as well as my dev machine.
Both configurations are backed by SSDs and Hbase options are set to

HBASE_OPTS="-ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode"

The data that I am dealing with is static. The table never changes
after the first load.

Even some of my GET requests are taking up to a full 60 seconds when
the row sizes reach ~10MB. In general, taking 5 seconds to fetch a
single row (~1MB) seems a extremely high to me.

Thanks again for your help.

Re: Slow full-table scans

Posted by Stack <st...@duboce.net>.

On Sun, Aug 12, 2012 at 7:04 AM, Gurjeet Singh <gu...@gmail.com> wrote:
> Am I missing something ? Is there a way to optimize this ?
>

You've checked out the perf section of the refguide?

http://hbase.apache.org/book.html#performance

And have you read the postings by the GBIF lads starting with this one:

http://gbif.blogspot.ie/2012/02/performance-evaluation-of-hbase.html

The boys have done a few blog postings on what they did to get HBase
scans working fast enough for their needs.  Its good reading because
they tell it like a detective story figuring where the frictions were
and how they measured it and then undid them, one by one.

> I guess a general question I have is whether HBase is good datastore
> for storing many medium sized (~50GB), dense datasets with lots of
> columns when a lot of the queries require full table scans ?
>

Yes.

St.Ack

Re: Slow full-table scans

Posted by Jacques <wh...@gmail.com>.

HTable.getRegionLocations()

I didn't realize the KeyValue serializations/deserialization happened on a
separate thread in the hbase client code.

J



On Sun, Aug 12, 2012 at 3:52 PM, Gurjeet Singh <gu...@gmail.com> wrote:

> Hi Mohammad,
>
> This is a great idea. Is there a API call to determine the start/end
> key for each region ?
>
> Thanks,
> Gurjeet
>
> On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq <do...@gmail.com>
> wrote:
> > Hello experts,
> >
> >        Would it be feasible to create a separate thread for each
> region??I
> > mean we can determine start and end key of each region and issue a scan
> for
> > each region in parallel.
> >
> > Regards,
> >     Mohammad Tariq
> >
> >
> >
> > On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <lh...@yahoo.com>
> wrote:
> >
> >> Do you really have to retrieve all 200.000 each time?
> >> Scan.setBatch(...) makes no difference?! (note that batching is
> different
> >> and separate from caching).
> >>
> >> Also note that the scanner contract is to return sorted KVs, so a single
> >> scan cannot be parallelized across RegionServers (well not entirely
> true,
> >> it could be farmed off in parallel and then be presented to the client
> in
> >> the right order - but HBase is not doing that). That is why one vs 12
> RSs
> >> makes no difference in this scenario.
> >>
> >> In the 12 node case you'll see low CPU on all but one RS, and each RS
> will
> >> get its turn.
> >>
> >> In your case this is scanning 20.000.000 KVs serially in 400s, that's
> >> 50000 KVs/s, which - depending on hardware - is not too bad for HBase
> (but
> >> not great either).
> >>
> >> If you only ever expect to run a single query like this on top your
> >> cluster (i.e. your concern is latency not throughput) you can do
> multiple
> >> RPCs in parallel for a sub portion of your key range. Together with
> >> batching can start using value before all is streamed back from the
> server.
> >>
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ----- Original Message -----
> >> From: Gurjeet Singh <gu...@gmail.com>
> >> To: user@hbase.apache.org
> >> Cc:
> >> Sent: Saturday, August 11, 2012 11:04 PM
> >> Subject: Slow full-table scans
> >>
> >> Hi,
> >>
> >> I am trying to read all the data out of an HBase table using a scan
> >> and it is extremely slow.
> >>
> >> Here are some characteristics of the data:
> >>
> >> 1. The total table size is tiny (~200MB)
> >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
> >> Thus the size of each cell is ~10bytes and the size of each row is
> >> ~2MB
> >> 3. Currently scanning the whole table takes ~400s (both in a
> >> distributed setting with 12 nodes or so and on a single node), thus
> >> 5sec/row
> >> 4. The row keys are unique 8 byte crypto hashes of sequential numbers
> >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
> >> and is set to fetch 100MB of data at a time (scan.setCaching)
> >> 6. Changing the caching size seems to have no effect on the total scan
> >> time at all
> >> 7. The column family is setup to keep a single version of the cells,
> >> no compression, and no block cache.
> >>
> >> Am I missing something ? Is there a way to optimize this ?
> >>
> >> I guess a general question I have is whether HBase is good datastore
> >> for storing many medium sized (~50GB), dense datasets with lots of
> >> columns when a lot of the queries require full table scans ?
> >>
> >> Thanks!
> >> Gurjeet
> >>
> >>
>

Re: Slow full-table scans

Posted by Gurjeet Singh <gu...@gmail.com>.

It seems like the client code just sits idle, waiting for data from
the regionservers.

Gurjeet

On Sun, Aug 12, 2012 at 4:13 PM, Jacques <wh...@gmail.com> wrote:
> I think the first question is where is the time spent.  Does your analysis
> show that all the time spent is on the regionservers or is a portion of the
> bottleneck on the client side?
>
> Jacques
>
>
>
> On Sun, Aug 12, 2012 at 4:00 PM, Mohammad Tariq <do...@gmail.com> wrote:
>
>> Methods getStartKey and getEndKey provided by  HRegionInfo class can used
>> for that purpose.
>> Also, please make sure, any HTable instance is not left opened once you are
>> are done with reads.
>> Regards,
>>     Mohammad Tariq
>>
>>
>>
>> On Mon, Aug 13, 2012 at 4:22 AM, Gurjeet Singh <gu...@gmail.com> wrote:
>>
>> > Hi Mohammad,
>> >
>> > This is a great idea. Is there a API call to determine the start/end
>> > key for each region ?
>> >
>> > Thanks,
>> > Gurjeet
>> >
>> > On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq <do...@gmail.com>
>> > wrote:
>> > > Hello experts,
>> > >
>> > >        Would it be feasible to create a separate thread for each
>> > region??I
>> > > mean we can determine start and end key of each region and issue a scan
>> > for
>> > > each region in parallel.
>> > >
>> > > Regards,
>> > >     Mohammad Tariq
>> > >
>> > >
>> > >
>> > > On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <lh...@yahoo.com>
>> > wrote:
>> > >
>> > >> Do you really have to retrieve all 200.000 each time?
>> > >> Scan.setBatch(...) makes no difference?! (note that batching is
>> > different
>> > >> and separate from caching).
>> > >>
>> > >> Also note that the scanner contract is to return sorted KVs, so a
>> single
>> > >> scan cannot be parallelized across RegionServers (well not entirely
>> > true,
>> > >> it could be farmed off in parallel and then be presented to the client
>> > in
>> > >> the right order - but HBase is not doing that). That is why one vs 12
>> > RSs
>> > >> makes no difference in this scenario.
>> > >>
>> > >> In the 12 node case you'll see low CPU on all but one RS, and each RS
>> > will
>> > >> get its turn.
>> > >>
>> > >> In your case this is scanning 20.000.000 KVs serially in 400s, that's
>> > >> 50000 KVs/s, which - depending on hardware - is not too bad for HBase
>> > (but
>> > >> not great either).
>> > >>
>> > >> If you only ever expect to run a single query like this on top your
>> > >> cluster (i.e. your concern is latency not throughput) you can do
>> > multiple
>> > >> RPCs in parallel for a sub portion of your key range. Together with
>> > >> batching can start using value before all is streamed back from the
>> > server.
>> > >>
>> > >>
>> > >> -- Lars
>> > >>
>> > >>
>> > >>
>> > >> ----- Original Message -----
>> > >> From: Gurjeet Singh <gu...@gmail.com>
>> > >> To: user@hbase.apache.org
>> > >> Cc:
>> > >> Sent: Saturday, August 11, 2012 11:04 PM
>> > >> Subject: Slow full-table scans
>> > >>
>> > >> Hi,
>> > >>
>> > >> I am trying to read all the data out of an HBase table using a scan
>> > >> and it is extremely slow.
>> > >>
>> > >> Here are some characteristics of the data:
>> > >>
>> > >> 1. The total table size is tiny (~200MB)
>> > >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
>> > >> Thus the size of each cell is ~10bytes and the size of each row is
>> > >> ~2MB
>> > >> 3. Currently scanning the whole table takes ~400s (both in a
>> > >> distributed setting with 12 nodes or so and on a single node), thus
>> > >> 5sec/row
>> > >> 4. The row keys are unique 8 byte crypto hashes of sequential numbers
>> > >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
>> > >> and is set to fetch 100MB of data at a time (scan.setCaching)
>> > >> 6. Changing the caching size seems to have no effect on the total scan
>> > >> time at all
>> > >> 7. The column family is setup to keep a single version of the cells,
>> > >> no compression, and no block cache.
>> > >>
>> > >> Am I missing something ? Is there a way to optimize this ?
>> > >>
>> > >> I guess a general question I have is whether HBase is good datastore
>> > >> for storing many medium sized (~50GB), dense datasets with lots of
>> > >> columns when a lot of the queries require full table scans ?
>> > >>
>> > >> Thanks!
>> > >> Gurjeet
>> > >>
>> > >>
>> >
>>

Re: Slow full-table scans

Posted by Mohammad Tariq <do...@gmail.com>.

Also, give it a shot using  HTablePools and see if it makes any significant
difference.

Regards,
    Mohammad Tariq



On Mon, Aug 13, 2012 at 4:43 AM, Jacques <wh...@gmail.com> wrote:

> I think the first question is where is the time spent.  Does your analysis
> show that all the time spent is on the regionservers or is a portion of the
> bottleneck on the client side?
>
> Jacques
>
>
>
> On Sun, Aug 12, 2012 at 4:00 PM, Mohammad Tariq <do...@gmail.com>
> wrote:
>
> > Methods getStartKey and getEndKey provided by  HRegionInfo class can used
> > for that purpose.
> > Also, please make sure, any HTable instance is not left opened once you
> are
> > are done with reads.
> > Regards,
> >     Mohammad Tariq
> >
> >
> >
> > On Mon, Aug 13, 2012 at 4:22 AM, Gurjeet Singh <gu...@gmail.com>
> wrote:
> >
> > > Hi Mohammad,
> > >
> > > This is a great idea. Is there a API call to determine the start/end
> > > key for each region ?
> > >
> > > Thanks,
> > > Gurjeet
> > >
> > > On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq <do...@gmail.com>
> > > wrote:
> > > > Hello experts,
> > > >
> > > >        Would it be feasible to create a separate thread for each
> > > region??I
> > > > mean we can determine start and end key of each region and issue a
> scan
> > > for
> > > > each region in parallel.
> > > >
> > > > Regards,
> > > >     Mohammad Tariq
> > > >
> > > >
> > > >
> > > > On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <lh...@yahoo.com>
> > > wrote:
> > > >
> > > >> Do you really have to retrieve all 200.000 each time?
> > > >> Scan.setBatch(...) makes no difference?! (note that batching is
> > > different
> > > >> and separate from caching).
> > > >>
> > > >> Also note that the scanner contract is to return sorted KVs, so a
> > single
> > > >> scan cannot be parallelized across RegionServers (well not entirely
> > > true,
> > > >> it could be farmed off in parallel and then be presented to the
> client
> > > in
> > > >> the right order - but HBase is not doing that). That is why one vs
> 12
> > > RSs
> > > >> makes no difference in this scenario.
> > > >>
> > > >> In the 12 node case you'll see low CPU on all but one RS, and each
> RS
> > > will
> > > >> get its turn.
> > > >>
> > > >> In your case this is scanning 20.000.000 KVs serially in 400s,
> that's
> > > >> 50000 KVs/s, which - depending on hardware - is not too bad for
> HBase
> > > (but
> > > >> not great either).
> > > >>
> > > >> If you only ever expect to run a single query like this on top your
> > > >> cluster (i.e. your concern is latency not throughput) you can do
> > > multiple
> > > >> RPCs in parallel for a sub portion of your key range. Together with
> > > >> batching can start using value before all is streamed back from the
> > > server.
> > > >>
> > > >>
> > > >> -- Lars
> > > >>
> > > >>
> > > >>
> > > >> ----- Original Message -----
> > > >> From: Gurjeet Singh <gu...@gmail.com>
> > > >> To: user@hbase.apache.org
> > > >> Cc:
> > > >> Sent: Saturday, August 11, 2012 11:04 PM
> > > >> Subject: Slow full-table scans
> > > >>
> > > >> Hi,
> > > >>
> > > >> I am trying to read all the data out of an HBase table using a scan
> > > >> and it is extremely slow.
> > > >>
> > > >> Here are some characteristics of the data:
> > > >>
> > > >> 1. The total table size is tiny (~200MB)
> > > >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
> > > >> Thus the size of each cell is ~10bytes and the size of each row is
> > > >> ~2MB
> > > >> 3. Currently scanning the whole table takes ~400s (both in a
> > > >> distributed setting with 12 nodes or so and on a single node), thus
> > > >> 5sec/row
> > > >> 4. The row keys are unique 8 byte crypto hashes of sequential
> numbers
> > > >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
> > > >> and is set to fetch 100MB of data at a time (scan.setCaching)
> > > >> 6. Changing the caching size seems to have no effect on the total
> scan
> > > >> time at all
> > > >> 7. The column family is setup to keep a single version of the cells,
> > > >> no compression, and no block cache.
> > > >>
> > > >> Am I missing something ? Is there a way to optimize this ?
> > > >>
> > > >> I guess a general question I have is whether HBase is good datastore
> > > >> for storing many medium sized (~50GB), dense datasets with lots of
> > > >> columns when a lot of the queries require full table scans ?
> > > >>
> > > >> Thanks!
> > > >> Gurjeet
> > > >>
> > > >>
> > >
> >
>

Re: Slow full-table scans

Posted by Jacques <wh...@gmail.com>.

I think the first question is where is the time spent.  Does your analysis
show that all the time spent is on the regionservers or is a portion of the
bottleneck on the client side?

Jacques



On Sun, Aug 12, 2012 at 4:00 PM, Mohammad Tariq <do...@gmail.com> wrote:

> Methods getStartKey and getEndKey provided by  HRegionInfo class can used
> for that purpose.
> Also, please make sure, any HTable instance is not left opened once you are
> are done with reads.
> Regards,
>     Mohammad Tariq
>
>
>
> On Mon, Aug 13, 2012 at 4:22 AM, Gurjeet Singh <gu...@gmail.com> wrote:
>
> > Hi Mohammad,
> >
> > This is a great idea. Is there a API call to determine the start/end
> > key for each region ?
> >
> > Thanks,
> > Gurjeet
> >
> > On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq <do...@gmail.com>
> > wrote:
> > > Hello experts,
> > >
> > >        Would it be feasible to create a separate thread for each
> > region??I
> > > mean we can determine start and end key of each region and issue a scan
> > for
> > > each region in parallel.
> > >
> > > Regards,
> > >     Mohammad Tariq
> > >
> > >
> > >
> > > On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <lh...@yahoo.com>
> > wrote:
> > >
> > >> Do you really have to retrieve all 200.000 each time?
> > >> Scan.setBatch(...) makes no difference?! (note that batching is
> > different
> > >> and separate from caching).
> > >>
> > >> Also note that the scanner contract is to return sorted KVs, so a
> single
> > >> scan cannot be parallelized across RegionServers (well not entirely
> > true,
> > >> it could be farmed off in parallel and then be presented to the client
> > in
> > >> the right order - but HBase is not doing that). That is why one vs 12
> > RSs
> > >> makes no difference in this scenario.
> > >>
> > >> In the 12 node case you'll see low CPU on all but one RS, and each RS
> > will
> > >> get its turn.
> > >>
> > >> In your case this is scanning 20.000.000 KVs serially in 400s, that's
> > >> 50000 KVs/s, which - depending on hardware - is not too bad for HBase
> > (but
> > >> not great either).
> > >>
> > >> If you only ever expect to run a single query like this on top your
> > >> cluster (i.e. your concern is latency not throughput) you can do
> > multiple
> > >> RPCs in parallel for a sub portion of your key range. Together with
> > >> batching can start using value before all is streamed back from the
> > server.
> > >>
> > >>
> > >> -- Lars
> > >>
> > >>
> > >>
> > >> ----- Original Message -----
> > >> From: Gurjeet Singh <gu...@gmail.com>
> > >> To: user@hbase.apache.org
> > >> Cc:
> > >> Sent: Saturday, August 11, 2012 11:04 PM
> > >> Subject: Slow full-table scans
> > >>
> > >> Hi,
> > >>
> > >> I am trying to read all the data out of an HBase table using a scan
> > >> and it is extremely slow.
> > >>
> > >> Here are some characteristics of the data:
> > >>
> > >> 1. The total table size is tiny (~200MB)
> > >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
> > >> Thus the size of each cell is ~10bytes and the size of each row is
> > >> ~2MB
> > >> 3. Currently scanning the whole table takes ~400s (both in a
> > >> distributed setting with 12 nodes or so and on a single node), thus
> > >> 5sec/row
> > >> 4. The row keys are unique 8 byte crypto hashes of sequential numbers
> > >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
> > >> and is set to fetch 100MB of data at a time (scan.setCaching)
> > >> 6. Changing the caching size seems to have no effect on the total scan
> > >> time at all
> > >> 7. The column family is setup to keep a single version of the cells,
> > >> no compression, and no block cache.
> > >>
> > >> Am I missing something ? Is there a way to optimize this ?
> > >>
> > >> I guess a general question I have is whether HBase is good datastore
> > >> for storing many medium sized (~50GB), dense datasets with lots of
> > >> columns when a lot of the queries require full table scans ?
> > >>
> > >> Thanks!
> > >> Gurjeet
> > >>
> > >>
> >
>

Re: Slow full-table scans

Posted by Mohammad Tariq <do...@gmail.com>.

Methods getStartKey and getEndKey provided by  HRegionInfo class can used
for that purpose.
Also, please make sure, any HTable instance is not left opened once you are
are done with reads.
Regards,
    Mohammad Tariq



On Mon, Aug 13, 2012 at 4:22 AM, Gurjeet Singh <gu...@gmail.com> wrote:

> Hi Mohammad,
>
> This is a great idea. Is there a API call to determine the start/end
> key for each region ?
>
> Thanks,
> Gurjeet
>
> On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq <do...@gmail.com>
> wrote:
> > Hello experts,
> >
> >        Would it be feasible to create a separate thread for each
> region??I
> > mean we can determine start and end key of each region and issue a scan
> for
> > each region in parallel.
> >
> > Regards,
> >     Mohammad Tariq
> >
> >
> >
> > On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <lh...@yahoo.com>
> wrote:
> >
> >> Do you really have to retrieve all 200.000 each time?
> >> Scan.setBatch(...) makes no difference?! (note that batching is
> different
> >> and separate from caching).
> >>
> >> Also note that the scanner contract is to return sorted KVs, so a single
> >> scan cannot be parallelized across RegionServers (well not entirely
> true,
> >> it could be farmed off in parallel and then be presented to the client
> in
> >> the right order - but HBase is not doing that). That is why one vs 12
> RSs
> >> makes no difference in this scenario.
> >>
> >> In the 12 node case you'll see low CPU on all but one RS, and each RS
> will
> >> get its turn.
> >>
> >> In your case this is scanning 20.000.000 KVs serially in 400s, that's
> >> 50000 KVs/s, which - depending on hardware - is not too bad for HBase
> (but
> >> not great either).
> >>
> >> If you only ever expect to run a single query like this on top your
> >> cluster (i.e. your concern is latency not throughput) you can do
> multiple
> >> RPCs in parallel for a sub portion of your key range. Together with
> >> batching can start using value before all is streamed back from the
> server.
> >>
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ----- Original Message -----
> >> From: Gurjeet Singh <gu...@gmail.com>
> >> To: user@hbase.apache.org
> >> Cc:
> >> Sent: Saturday, August 11, 2012 11:04 PM
> >> Subject: Slow full-table scans
> >>
> >> Hi,
> >>
> >> I am trying to read all the data out of an HBase table using a scan
> >> and it is extremely slow.
> >>
> >> Here are some characteristics of the data:
> >>
> >> 1. The total table size is tiny (~200MB)
> >> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
> >> Thus the size of each cell is ~10bytes and the size of each row is
> >> ~2MB
> >> 3. Currently scanning the whole table takes ~400s (both in a
> >> distributed setting with 12 nodes or so and on a single node), thus
> >> 5sec/row
> >> 4. The row keys are unique 8 byte crypto hashes of sequential numbers
> >> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
> >> and is set to fetch 100MB of data at a time (scan.setCaching)
> >> 6. Changing the caching size seems to have no effect on the total scan
> >> time at all
> >> 7. The column family is setup to keep a single version of the cells,
> >> no compression, and no block cache.
> >>
> >> Am I missing something ? Is there a way to optimize this ?
> >>
> >> I guess a general question I have is whether HBase is good datastore
> >> for storing many medium sized (~50GB), dense datasets with lots of
> >> columns when a lot of the queries require full table scans ?
> >>
> >> Thanks!
> >> Gurjeet
> >>
> >>
>

Re: Slow full-table scans

Posted by Gurjeet Singh <gu...@gmail.com>.

Hi Mohammad,

This is a great idea. Is there a API call to determine the start/end
key for each region ?

Thanks,
Gurjeet

On Sun, Aug 12, 2012 at 3:49 PM, Mohammad Tariq <do...@gmail.com> wrote:
> Hello experts,
>
>        Would it be feasible to create a separate thread for each region??I
> mean we can determine start and end key of each region and issue a scan for
> each region in parallel.
>
> Regards,
>     Mohammad Tariq
>
>
>
> On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <lh...@yahoo.com> wrote:
>
>> Do you really have to retrieve all 200.000 each time?
>> Scan.setBatch(...) makes no difference?! (note that batching is different
>> and separate from caching).
>>
>> Also note that the scanner contract is to return sorted KVs, so a single
>> scan cannot be parallelized across RegionServers (well not entirely true,
>> it could be farmed off in parallel and then be presented to the client in
>> the right order - but HBase is not doing that). That is why one vs 12 RSs
>> makes no difference in this scenario.
>>
>> In the 12 node case you'll see low CPU on all but one RS, and each RS will
>> get its turn.
>>
>> In your case this is scanning 20.000.000 KVs serially in 400s, that's
>> 50000 KVs/s, which - depending on hardware - is not too bad for HBase (but
>> not great either).
>>
>> If you only ever expect to run a single query like this on top your
>> cluster (i.e. your concern is latency not throughput) you can do multiple
>> RPCs in parallel for a sub portion of your key range. Together with
>> batching can start using value before all is streamed back from the server.
>>
>>
>> -- Lars
>>
>>
>>
>> ----- Original Message -----
>> From: Gurjeet Singh <gu...@gmail.com>
>> To: user@hbase.apache.org
>> Cc:
>> Sent: Saturday, August 11, 2012 11:04 PM
>> Subject: Slow full-table scans
>>
>> Hi,
>>
>> I am trying to read all the data out of an HBase table using a scan
>> and it is extremely slow.
>>
>> Here are some characteristics of the data:
>>
>> 1. The total table size is tiny (~200MB)
>> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
>> Thus the size of each cell is ~10bytes and the size of each row is
>> ~2MB
>> 3. Currently scanning the whole table takes ~400s (both in a
>> distributed setting with 12 nodes or so and on a single node), thus
>> 5sec/row
>> 4. The row keys are unique 8 byte crypto hashes of sequential numbers
>> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
>> and is set to fetch 100MB of data at a time (scan.setCaching)
>> 6. Changing the caching size seems to have no effect on the total scan
>> time at all
>> 7. The column family is setup to keep a single version of the cells,
>> no compression, and no block cache.
>>
>> Am I missing something ? Is there a way to optimize this ?
>>
>> I guess a general question I have is whether HBase is good datastore
>> for storing many medium sized (~50GB), dense datasets with lots of
>> columns when a lot of the queries require full table scans ?
>>
>> Thanks!
>> Gurjeet
>>
>>

Re: Slow full-table scans

Posted by Mohammad Tariq <do...@gmail.com>.

Hello experts,

       Would it be feasible to create a separate thread for each region??I
mean we can determine start and end key of each region and issue a scan for
each region in parallel.

Regards,
    Mohammad Tariq



On Mon, Aug 13, 2012 at 3:54 AM, lars hofhansl <lh...@yahoo.com> wrote:

> Do you really have to retrieve all 200.000 each time?
> Scan.setBatch(...) makes no difference?! (note that batching is different
> and separate from caching).
>
> Also note that the scanner contract is to return sorted KVs, so a single
> scan cannot be parallelized across RegionServers (well not entirely true,
> it could be farmed off in parallel and then be presented to the client in
> the right order - but HBase is not doing that). That is why one vs 12 RSs
> makes no difference in this scenario.
>
> In the 12 node case you'll see low CPU on all but one RS, and each RS will
> get its turn.
>
> In your case this is scanning 20.000.000 KVs serially in 400s, that's
> 50000 KVs/s, which - depending on hardware - is not too bad for HBase (but
> not great either).
>
> If you only ever expect to run a single query like this on top your
> cluster (i.e. your concern is latency not throughput) you can do multiple
> RPCs in parallel for a sub portion of your key range. Together with
> batching can start using value before all is streamed back from the server.
>
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Gurjeet Singh <gu...@gmail.com>
> To: user@hbase.apache.org
> Cc:
> Sent: Saturday, August 11, 2012 11:04 PM
> Subject: Slow full-table scans
>
> Hi,
>
> I am trying to read all the data out of an HBase table using a scan
> and it is extremely slow.
>
> Here are some characteristics of the data:
>
> 1. The total table size is tiny (~200MB)
> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
> Thus the size of each cell is ~10bytes and the size of each row is
> ~2MB
> 3. Currently scanning the whole table takes ~400s (both in a
> distributed setting with 12 nodes or so and on a single node), thus
> 5sec/row
> 4. The row keys are unique 8 byte crypto hashes of sequential numbers
> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
> and is set to fetch 100MB of data at a time (scan.setCaching)
> 6. Changing the caching size seems to have no effect on the total scan
> time at all
> 7. The column family is setup to keep a single version of the cells,
> no compression, and no block cache.
>
> Am I missing something ? Is there a way to optimize this ?
>
> I guess a general question I have is whether HBase is good datastore
> for storing many medium sized (~50GB), dense datasets with lots of
> columns when a lot of the queries require full table scans ?
>
> Thanks!
> Gurjeet
>
>

Re: Slow full-table scans

Posted by Gurjeet Singh <gu...@gmail.com>.

Okay, I just ran extensive tests with my minimal test case and you are
correct, the old and the new version do the scans in about the same
amount of time (although puts are MUCH faster in the packed scheme).

I guess my test case is too minimal. I will try to make a better
testcase since in my production code, there is still a 500x
difference.

Gurjeet

On Tue, Aug 21, 2012 at 10:00 PM, J Mohamed Zahoor <jm...@gmail.com> wrote:
> Try a quick TestDFSIO to see if things are okay.
>
> ./zahoor
>
> On Wed, Aug 22, 2012 at 6:26 AM, Mohit Anchlia <mo...@gmail.com>wrote:
>
>> It's possible that there is a bad or slower disk on Gurjeet's machine. I
>> think details of iostat and cpu would clear things up.
>>
>> On Tue, Aug 21, 2012 at 4:33 PM, lars hofhansl <lh...@yahoo.com>
>> wrote:
>>
>> > I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment size
>> > 100
>> >
>> >
>> >
>> > ________________________________
>> >  From: Gurjeet Singh <gu...@gmail.com>
>> > To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
>> > Sent: Tuesday, August 21, 2012 11:31 AM
>> >  Subject: Re: Slow full-table scans
>> >
>> > How does that compare with the newScanTable on your build ?
>> >
>> > Gurjeet
>> >
>> > On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl <lh...@yahoo.com>
>> > wrote:
>> > > Hmm... So I tried in HBase (current trunk).
>> > > I created 100 rows with 200.000 columns each (using your oldMakeTable).
>> > The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo
>> > distributed mode - with your oldScanTable).
>> > >
>> > > -- Lars
>> > >
>> > >
>> > >
>> > > ----- Original Message -----
>> > > From: lars hofhansl <lh...@yahoo.com>
>> > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
>> > > Cc:
>> > > Sent: Monday, August 20, 2012 7:50 PM
>> > > Subject: Re: Slow full-table scans
>> > >
>> > > Thanks Gurjeet,
>> > >
>> > > I'll (hopefully) have a look tomorrow.
>> > >
>> > > -- Lars
>> > >
>> > >
>> > >
>> > > ----- Original Message -----
>> > > From: Gurjeet Singh <gu...@gmail.com>
>> > > To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
>> > > Cc:
>> > > Sent: Monday, August 20, 2012 7:42 PM
>> > > Subject: Re: Slow full-table scans
>> > >
>> > > Hi Lars,
>> > >
>> > > Here is a testcase:
>> > >
>> > > https://gist.github.com/3410948
>> > >
>> > > Benchmarking code:
>> > >
>> > > https://gist.github.com/3410952
>> > >
>> > > Try running it with numRows = 100, numCols = 200000, segmentSize = 1000
>> > >
>> > > Gurjeet
>> > >
>> > >
>> > > On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <gu...@gmail.com>
>> > wrote:
>> > >> Sure - I can create a minimal testcase and send it along.
>> > >>
>> > >> Gurjeet
>> > >>
>> > >> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <lh...@yahoo.com>
>> > wrote:
>> > >>> That's interesting.
>> > >>> Could you share your old and new schema. I would like to track down
>> > the performance problems you saw.
>> > >>> (If you had a demo program that populates your rows with 200.000
>> > columns in a way where you saw the performance issues, that'd be even
>> > better, but not necessary).
>> > >>>
>> > >>>
>> > >>> -- Lars
>> > >>>
>> > >>>
>> > >>>
>> > >>> ________________________________
>> > >>>  From: Gurjeet Singh <gu...@gmail.com>
>> > >>> To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
>> > >>> Sent: Thursday, August 16, 2012 11:26 AM
>> > >>> Subject: Re: Slow full-table scans
>> > >>>
>> > >>> Sorry for the delay guys.
>> > >>>
>> > >>> Here are a few results:
>> > >>>
>> > >>> 1. Regions in the table = 11
>> > >>> 2. The region servers don't appear to be very busy with the query ~5%
>> > >>> CPU (but with parallelization, they are all busy)
>> > >>>
>> > >>> Finally, I changed the format of my data, such that each cell in
>> HBase
>> > >>> contains a chunk of a row instead of the single value it had. So,
>> > >>> stuffing each Hbase cell with 500 columns of a row, gave me a
>> > >>> performance boost of 1000x. It seems that the underlying issue was IO
>> > >>> overhead per byte of actual data stored.
>> > >>>
>> > >>>
>> > >>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <lh...@yahoo.com>
>> > wrote:
>> > >>>> Yeah... It looks OK.
>> > >>>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows.
>> > >>>>
>> > >>>>
>> > >>>> If you can I'd like to know how busy your regionservers are during
>> > these operations. That would be an indication on whether the
>> > parallelization is good or not.
>> > >>>>
>> > >>>> -- Lars
>> > >>>>
>> > >>>>
>> > >>>> ----- Original Message -----
>> > >>>> From: Stack <st...@duboce.net>
>> > >>>> To: user@hbase.apache.org
>> > >>>> Cc:
>> > >>>> Sent: Wednesday, August 15, 2012 3:13 PM
>> > >>>> Subject: Re: Slow full-table scans
>> > >>>>
>> > >>>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <gu...@gmail.com>
>> > wrote:
>> > >>>>> I am beginning to think that this is a configuration issue on my
>> > >>>>> cluster. Do the following configuration files seem sane ?
>> > >>>>>
>> > >>>>> hbase-env.sh    https://gist.github.com/3345338
>> > >>>>>
>> > >>>>
>> > >>>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in
>> > >>>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).
>> > >>>>
>> > >>>>
>> > >>>>> hbase-site.xml    https://gist.github.com/3345356
>> > >>>>>
>> > >>>>
>> > >>>> This is all defaults effectively.   I don't see any of the configs.
>> > >>>> recommended by the performance section of the reference guide and/or
>> > >>>> those suggested by the GBIF blog.
>> > >>>>
>> > >>>> You don't answer LarsH's query about where you see the 4%
>> difference.
>> > >>>>
>> > >>>> How many regions in your table?  Whats the HBase Master UI look like
>> > >>>> when this scan is running?
>> > >>>> St.Ack
>> > >>>>
>> >
>>

Re: Slow full-table scans

Posted by J Mohamed Zahoor <jm...@gmail.com>.

Try a quick TestDFSIO to see if things are okay.

./zahoor

On Wed, Aug 22, 2012 at 6:26 AM, Mohit Anchlia <mo...@gmail.com>wrote:

> It's possible that there is a bad or slower disk on Gurjeet's machine. I
> think details of iostat and cpu would clear things up.
>
> On Tue, Aug 21, 2012 at 4:33 PM, lars hofhansl <lh...@yahoo.com>
> wrote:
>
> > I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment size
> > 100
> >
> >
> >
> > ________________________________
> >  From: Gurjeet Singh <gu...@gmail.com>
> > To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> > Sent: Tuesday, August 21, 2012 11:31 AM
> >  Subject: Re: Slow full-table scans
> >
> > How does that compare with the newScanTable on your build ?
> >
> > Gurjeet
> >
> > On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl <lh...@yahoo.com>
> > wrote:
> > > Hmm... So I tried in HBase (current trunk).
> > > I created 100 rows with 200.000 columns each (using your oldMakeTable).
> > The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo
> > distributed mode - with your oldScanTable).
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ----- Original Message -----
> > > From: lars hofhansl <lh...@yahoo.com>
> > > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > > Cc:
> > > Sent: Monday, August 20, 2012 7:50 PM
> > > Subject: Re: Slow full-table scans
> > >
> > > Thanks Gurjeet,
> > >
> > > I'll (hopefully) have a look tomorrow.
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ----- Original Message -----
> > > From: Gurjeet Singh <gu...@gmail.com>
> > > To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> > > Cc:
> > > Sent: Monday, August 20, 2012 7:42 PM
> > > Subject: Re: Slow full-table scans
> > >
> > > Hi Lars,
> > >
> > > Here is a testcase:
> > >
> > > https://gist.github.com/3410948
> > >
> > > Benchmarking code:
> > >
> > > https://gist.github.com/3410952
> > >
> > > Try running it with numRows = 100, numCols = 200000, segmentSize = 1000
> > >
> > > Gurjeet
> > >
> > >
> > > On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <gu...@gmail.com>
> > wrote:
> > >> Sure - I can create a minimal testcase and send it along.
> > >>
> > >> Gurjeet
> > >>
> > >> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <lh...@yahoo.com>
> > wrote:
> > >>> That's interesting.
> > >>> Could you share your old and new schema. I would like to track down
> > the performance problems you saw.
> > >>> (If you had a demo program that populates your rows with 200.000
> > columns in a way where you saw the performance issues, that'd be even
> > better, but not necessary).
> > >>>
> > >>>
> > >>> -- Lars
> > >>>
> > >>>
> > >>>
> > >>> ________________________________
> > >>>  From: Gurjeet Singh <gu...@gmail.com>
> > >>> To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> > >>> Sent: Thursday, August 16, 2012 11:26 AM
> > >>> Subject: Re: Slow full-table scans
> > >>>
> > >>> Sorry for the delay guys.
> > >>>
> > >>> Here are a few results:
> > >>>
> > >>> 1. Regions in the table = 11
> > >>> 2. The region servers don't appear to be very busy with the query ~5%
> > >>> CPU (but with parallelization, they are all busy)
> > >>>
> > >>> Finally, I changed the format of my data, such that each cell in
> HBase
> > >>> contains a chunk of a row instead of the single value it had. So,
> > >>> stuffing each Hbase cell with 500 columns of a row, gave me a
> > >>> performance boost of 1000x. It seems that the underlying issue was IO
> > >>> overhead per byte of actual data stored.
> > >>>
> > >>>
> > >>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <lh...@yahoo.com>
> > wrote:
> > >>>> Yeah... It looks OK.
> > >>>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows.
> > >>>>
> > >>>>
> > >>>> If you can I'd like to know how busy your regionservers are during
> > these operations. That would be an indication on whether the
> > parallelization is good or not.
> > >>>>
> > >>>> -- Lars
> > >>>>
> > >>>>
> > >>>> ----- Original Message -----
> > >>>> From: Stack <st...@duboce.net>
> > >>>> To: user@hbase.apache.org
> > >>>> Cc:
> > >>>> Sent: Wednesday, August 15, 2012 3:13 PM
> > >>>> Subject: Re: Slow full-table scans
> > >>>>
> > >>>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <gu...@gmail.com>
> > wrote:
> > >>>>> I am beginning to think that this is a configuration issue on my
> > >>>>> cluster. Do the following configuration files seem sane ?
> > >>>>>
> > >>>>> hbase-env.sh    https://gist.github.com/3345338
> > >>>>>
> > >>>>
> > >>>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in
> > >>>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).
> > >>>>
> > >>>>
> > >>>>> hbase-site.xml    https://gist.github.com/3345356
> > >>>>>
> > >>>>
> > >>>> This is all defaults effectively.   I don't see any of the configs.
> > >>>> recommended by the performance section of the reference guide and/or
> > >>>> those suggested by the GBIF blog.
> > >>>>
> > >>>> You don't answer LarsH's query about where you see the 4%
> difference.
> > >>>>
> > >>>> How many regions in your table?  Whats the HBase Master UI look like
> > >>>> when this scan is running?
> > >>>> St.Ack
> > >>>>
> >
>

Re: Slow full-table scans

Posted by Mohit Anchlia <mo...@gmail.com>.

It's possible that there is a bad or slower disk on Gurjeet's machine. I
think details of iostat and cpu would clear things up.

On Tue, Aug 21, 2012 at 4:33 PM, lars hofhansl <lh...@yahoo.com> wrote:

> I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment size
> 100
>
>
>
> ________________________________
>  From: Gurjeet Singh <gu...@gmail.com>
> To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> Sent: Tuesday, August 21, 2012 11:31 AM
>  Subject: Re: Slow full-table scans
>
> How does that compare with the newScanTable on your build ?
>
> Gurjeet
>
> On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl <lh...@yahoo.com>
> wrote:
> > Hmm... So I tried in HBase (current trunk).
> > I created 100 rows with 200.000 columns each (using your oldMakeTable).
> The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo
> distributed mode - with your oldScanTable).
> >
> > -- Lars
> >
> >
> >
> > ----- Original Message -----
> > From: lars hofhansl <lh...@yahoo.com>
> > To: "user@hbase.apache.org" <us...@hbase.apache.org>
> > Cc:
> > Sent: Monday, August 20, 2012 7:50 PM
> > Subject: Re: Slow full-table scans
> >
> > Thanks Gurjeet,
> >
> > I'll (hopefully) have a look tomorrow.
> >
> > -- Lars
> >
> >
> >
> > ----- Original Message -----
> > From: Gurjeet Singh <gu...@gmail.com>
> > To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> > Cc:
> > Sent: Monday, August 20, 2012 7:42 PM
> > Subject: Re: Slow full-table scans
> >
> > Hi Lars,
> >
> > Here is a testcase:
> >
> > https://gist.github.com/3410948
> >
> > Benchmarking code:
> >
> > https://gist.github.com/3410952
> >
> > Try running it with numRows = 100, numCols = 200000, segmentSize = 1000
> >
> > Gurjeet
> >
> >
> > On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <gu...@gmail.com>
> wrote:
> >> Sure - I can create a minimal testcase and send it along.
> >>
> >> Gurjeet
> >>
> >> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <lh...@yahoo.com>
> wrote:
> >>> That's interesting.
> >>> Could you share your old and new schema. I would like to track down
> the performance problems you saw.
> >>> (If you had a demo program that populates your rows with 200.000
> columns in a way where you saw the performance issues, that'd be even
> better, but not necessary).
> >>>
> >>>
> >>> -- Lars
> >>>
> >>>
> >>>
> >>> ________________________________
> >>>  From: Gurjeet Singh <gu...@gmail.com>
> >>> To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> >>> Sent: Thursday, August 16, 2012 11:26 AM
> >>> Subject: Re: Slow full-table scans
> >>>
> >>> Sorry for the delay guys.
> >>>
> >>> Here are a few results:
> >>>
> >>> 1. Regions in the table = 11
> >>> 2. The region servers don't appear to be very busy with the query ~5%
> >>> CPU (but with parallelization, they are all busy)
> >>>
> >>> Finally, I changed the format of my data, such that each cell in HBase
> >>> contains a chunk of a row instead of the single value it had. So,
> >>> stuffing each Hbase cell with 500 columns of a row, gave me a
> >>> performance boost of 1000x. It seems that the underlying issue was IO
> >>> overhead per byte of actual data stored.
> >>>
> >>>
> >>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <lh...@yahoo.com>
> wrote:
> >>>> Yeah... It looks OK.
> >>>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows.
> >>>>
> >>>>
> >>>> If you can I'd like to know how busy your regionservers are during
> these operations. That would be an indication on whether the
> parallelization is good or not.
> >>>>
> >>>> -- Lars
> >>>>
> >>>>
> >>>> ----- Original Message -----
> >>>> From: Stack <st...@duboce.net>
> >>>> To: user@hbase.apache.org
> >>>> Cc:
> >>>> Sent: Wednesday, August 15, 2012 3:13 PM
> >>>> Subject: Re: Slow full-table scans
> >>>>
> >>>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <gu...@gmail.com>
> wrote:
> >>>>> I am beginning to think that this is a configuration issue on my
> >>>>> cluster. Do the following configuration files seem sane ?
> >>>>>
> >>>>> hbase-env.sh    https://gist.github.com/3345338
> >>>>>
> >>>>
> >>>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in
> >>>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).
> >>>>
> >>>>
> >>>>> hbase-site.xml    https://gist.github.com/3345356
> >>>>>
> >>>>
> >>>> This is all defaults effectively.   I don't see any of the configs.
> >>>> recommended by the performance section of the reference guide and/or
> >>>> those suggested by the GBIF blog.
> >>>>
> >>>> You don't answer LarsH's query about where you see the 4% difference.
> >>>>
> >>>> How many regions in your table?  Whats the HBase Master UI look like
> >>>> when this scan is running?
> >>>> St.Ack
> >>>>
>

Re: Slow full-table scans

Posted by lars hofhansl <lh...@yahoo.com>.

I get roughly the same (~1.8s) - 100 rows, 200.000 columns, segment size 100



________________________________
 From: Gurjeet Singh <gu...@gmail.com>
To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com> 
Sent: Tuesday, August 21, 2012 11:31 AM
Subject: Re: Slow full-table scans
 
How does that compare with the newScanTable on your build ?

Gurjeet

On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl <lh...@yahoo.com> wrote:
> Hmm... So I tried in HBase (current trunk).
> I created 100 rows with 200.000 columns each (using your oldMakeTable). The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo distributed mode - with your oldScanTable).
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: lars hofhansl <lh...@yahoo.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc:
> Sent: Monday, August 20, 2012 7:50 PM
> Subject: Re: Slow full-table scans
>
> Thanks Gurjeet,
>
> I'll (hopefully) have a look tomorrow.
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Gurjeet Singh <gu...@gmail.com>
> To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> Cc:
> Sent: Monday, August 20, 2012 7:42 PM
> Subject: Re: Slow full-table scans
>
> Hi Lars,
>
> Here is a testcase:
>
> https://gist.github.com/3410948
>
> Benchmarking code:
>
> https://gist.github.com/3410952
>
> Try running it with numRows = 100, numCols = 200000, segmentSize = 1000
>
> Gurjeet
>
>
> On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <gu...@gmail.com> wrote:
>> Sure - I can create a minimal testcase and send it along.
>>
>> Gurjeet
>>
>> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <lh...@yahoo.com> wrote:
>>> That's interesting.
>>> Could you share your old and new schema. I would like to track down the performance problems you saw.
>>> (If you had a demo program that populates your rows with 200.000 columns in a way where you saw the performance issues, that'd be even better, but not necessary).
>>>
>>>
>>> -- Lars
>>>
>>>
>>>
>>> ________________________________
>>>  From: Gurjeet Singh <gu...@gmail.com>
>>> To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
>>> Sent: Thursday, August 16, 2012 11:26 AM
>>> Subject: Re: Slow full-table scans
>>>
>>> Sorry for the delay guys.
>>>
>>> Here are a few results:
>>>
>>> 1. Regions in the table = 11
>>> 2. The region servers don't appear to be very busy with the query ~5%
>>> CPU (but with parallelization, they are all busy)
>>>
>>> Finally, I changed the format of my data, such that each cell in HBase
>>> contains a chunk of a row instead of the single value it had. So,
>>> stuffing each Hbase cell with 500 columns of a row, gave me a
>>> performance boost of 1000x. It seems that the underlying issue was IO
>>> overhead per byte of actual data stored.
>>>
>>>
>>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <lh...@yahoo.com> wrote:
>>>> Yeah... It looks OK.
>>>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows.
>>>>
>>>>
>>>> If you can I'd like to know how busy your regionservers are during these operations. That would be an indication on whether the parallelization is good or not.
>>>>
>>>> -- Lars
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: Stack <st...@duboce.net>
>>>> To: user@hbase.apache.org
>>>> Cc:
>>>> Sent: Wednesday, August 15, 2012 3:13 PM
>>>> Subject: Re: Slow full-table scans
>>>>
>>>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <gu...@gmail.com> wrote:
>>>>> I am beginning to think that this is a configuration issue on my
>>>>> cluster. Do the following configuration files seem sane ?
>>>>>
>>>>> hbase-env.sh    https://gist.github.com/3345338
>>>>>
>>>>
>>>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in
>>>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).
>>>>
>>>>
>>>>> hbase-site.xml    https://gist.github.com/3345356
>>>>>
>>>>
>>>> This is all defaults effectively.   I don't see any of the configs.
>>>> recommended by the performance section of the reference guide and/or
>>>> those suggested by the GBIF blog.
>>>>
>>>> You don't answer LarsH's query about where you see the 4% difference.
>>>>
>>>> How many regions in your table?  Whats the HBase Master UI look like
>>>> when this scan is running?
>>>> St.Ack
>>>>

Re: Slow full-table scans

Posted by Gurjeet Singh <gu...@gmail.com>.

How does that compare with the newScanTable on your build ?

Gurjeet

On Tue, Aug 21, 2012 at 11:18 AM, lars hofhansl <lh...@yahoo.com> wrote:
> Hmm... So I tried in HBase (current trunk).
> I created 100 rows with 200.000 columns each (using your oldMakeTable). The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo distributed mode - with your oldScanTable).
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: lars hofhansl <lh...@yahoo.com>
> To: "user@hbase.apache.org" <us...@hbase.apache.org>
> Cc:
> Sent: Monday, August 20, 2012 7:50 PM
> Subject: Re: Slow full-table scans
>
> Thanks Gurjeet,
>
> I'll (hopefully) have a look tomorrow.
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Gurjeet Singh <gu...@gmail.com>
> To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> Cc:
> Sent: Monday, August 20, 2012 7:42 PM
> Subject: Re: Slow full-table scans
>
> Hi Lars,
>
> Here is a testcase:
>
> https://gist.github.com/3410948
>
> Benchmarking code:
>
> https://gist.github.com/3410952
>
> Try running it with numRows = 100, numCols = 200000, segmentSize = 1000
>
> Gurjeet
>
>
> On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <gu...@gmail.com> wrote:
>> Sure - I can create a minimal testcase and send it along.
>>
>> Gurjeet
>>
>> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <lh...@yahoo.com> wrote:
>>> That's interesting.
>>> Could you share your old and new schema. I would like to track down the performance problems you saw.
>>> (If you had a demo program that populates your rows with 200.000 columns in a way where you saw the performance issues, that'd be even better, but not necessary).
>>>
>>>
>>> -- Lars
>>>
>>>
>>>
>>> ________________________________
>>>  From: Gurjeet Singh <gu...@gmail.com>
>>> To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
>>> Sent: Thursday, August 16, 2012 11:26 AM
>>> Subject: Re: Slow full-table scans
>>>
>>> Sorry for the delay guys.
>>>
>>> Here are a few results:
>>>
>>> 1. Regions in the table = 11
>>> 2. The region servers don't appear to be very busy with the query ~5%
>>> CPU (but with parallelization, they are all busy)
>>>
>>> Finally, I changed the format of my data, such that each cell in HBase
>>> contains a chunk of a row instead of the single value it had. So,
>>> stuffing each Hbase cell with 500 columns of a row, gave me a
>>> performance boost of 1000x. It seems that the underlying issue was IO
>>> overhead per byte of actual data stored.
>>>
>>>
>>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <lh...@yahoo.com> wrote:
>>>> Yeah... It looks OK.
>>>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows.
>>>>
>>>>
>>>> If you can I'd like to know how busy your regionservers are during these operations. That would be an indication on whether the parallelization is good or not.
>>>>
>>>> -- Lars
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: Stack <st...@duboce.net>
>>>> To: user@hbase.apache.org
>>>> Cc:
>>>> Sent: Wednesday, August 15, 2012 3:13 PM
>>>> Subject: Re: Slow full-table scans
>>>>
>>>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <gu...@gmail.com> wrote:
>>>>> I am beginning to think that this is a configuration issue on my
>>>>> cluster. Do the following configuration files seem sane ?
>>>>>
>>>>> hbase-env.sh    https://gist.github.com/3345338
>>>>>
>>>>
>>>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in
>>>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).
>>>>
>>>>
>>>>> hbase-site.xml    https://gist.github.com/3345356
>>>>>
>>>>
>>>> This is all defaults effectively.   I don't see any of the configs.
>>>> recommended by the performance section of the reference guide and/or
>>>> those suggested by the GBIF blog.
>>>>
>>>> You don't answer LarsH's query about where you see the 4% difference.
>>>>
>>>> How many regions in your table?  Whats the HBase Master UI look like
>>>> when this scan is running?
>>>> St.Ack
>>>>

Re: Slow full-table scans

Posted by lars hofhansl <lh...@yahoo.com>.

Hmm... So I tried in HBase (current trunk).
I created 100 rows with 200.000 columns each (using your oldMakeTable). The creation took a bit, but scanning finished in 1.8s. (HBase in pseudo distributed mode - with your oldScanTable).

-- Lars



----- Original Message -----
From: lars hofhansl <lh...@yahoo.com>
To: "user@hbase.apache.org" <us...@hbase.apache.org>
Cc: 
Sent: Monday, August 20, 2012 7:50 PM
Subject: Re: Slow full-table scans

Thanks Gurjeet,

I'll (hopefully) have a look tomorrow.

-- Lars



----- Original Message -----
From: Gurjeet Singh <gu...@gmail.com>
To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
Cc: 
Sent: Monday, August 20, 2012 7:42 PM
Subject: Re: Slow full-table scans

Hi Lars,

Here is a testcase:

https://gist.github.com/3410948

Benchmarking code:

https://gist.github.com/3410952

Try running it with numRows = 100, numCols = 200000, segmentSize = 1000

Gurjeet


On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <gu...@gmail.com> wrote:
> Sure - I can create a minimal testcase and send it along.
>
> Gurjeet
>
> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <lh...@yahoo.com> wrote:
>> That's interesting.
>> Could you share your old and new schema. I would like to track down the performance problems you saw.
>> (If you had a demo program that populates your rows with 200.000 columns in a way where you saw the performance issues, that'd be even better, but not necessary).
>>
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: Gurjeet Singh <gu...@gmail.com>
>> To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
>> Sent: Thursday, August 16, 2012 11:26 AM
>> Subject: Re: Slow full-table scans
>>
>> Sorry for the delay guys.
>>
>> Here are a few results:
>>
>> 1. Regions in the table = 11
>> 2. The region servers don't appear to be very busy with the query ~5%
>> CPU (but with parallelization, they are all busy)
>>
>> Finally, I changed the format of my data, such that each cell in HBase
>> contains a chunk of a row instead of the single value it had. So,
>> stuffing each Hbase cell with 500 columns of a row, gave me a
>> performance boost of 1000x. It seems that the underlying issue was IO
>> overhead per byte of actual data stored.
>>
>>
>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <lh...@yahoo.com> wrote:
>>> Yeah... It looks OK.
>>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows.
>>>
>>>
>>> If you can I'd like to know how busy your regionservers are during these operations. That would be an indication on whether the parallelization is good or not.
>>>
>>> -- Lars
>>>
>>>
>>> ----- Original Message -----
>>> From: Stack <st...@duboce.net>
>>> To: user@hbase.apache.org
>>> Cc:
>>> Sent: Wednesday, August 15, 2012 3:13 PM
>>> Subject: Re: Slow full-table scans
>>>
>>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <gu...@gmail.com> wrote:
>>>> I am beginning to think that this is a configuration issue on my
>>>> cluster. Do the following configuration files seem sane ?
>>>>
>>>> hbase-env.sh    https://gist.github.com/3345338
>>>>
>>>
>>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in
>>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).
>>>
>>>
>>>> hbase-site.xml    https://gist.github.com/3345356
>>>>
>>>
>>> This is all defaults effectively.   I don't see any of the configs.
>>> recommended by the performance section of the reference guide and/or
>>> those suggested by the GBIF blog.
>>>
>>> You don't answer LarsH's query about where you see the 4% difference.
>>>
>>> How many regions in your table?  Whats the HBase Master UI look like
>>> when this scan is running?
>>> St.Ack
>>>

Re: Slow full-table scans

Posted by lars hofhansl <lh...@yahoo.com>.

Thanks Gurjeet,

I'll (hopefully) have a look tomorrow.

-- Lars



----- Original Message -----
From: Gurjeet Singh <gu...@gmail.com>
To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
Cc: 
Sent: Monday, August 20, 2012 7:42 PM
Subject: Re: Slow full-table scans

Hi Lars,

Here is a testcase:

https://gist.github.com/3410948

Benchmarking code:

https://gist.github.com/3410952

Try running it with numRows = 100, numCols = 200000, segmentSize = 1000

Gurjeet


On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <gu...@gmail.com> wrote:
> Sure - I can create a minimal testcase and send it along.
>
> Gurjeet
>
> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <lh...@yahoo.com> wrote:
>> That's interesting.
>> Could you share your old and new schema. I would like to track down the performance problems you saw.
>> (If you had a demo program that populates your rows with 200.000 columns in a way where you saw the performance issues, that'd be even better, but not necessary).
>>
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: Gurjeet Singh <gu...@gmail.com>
>> To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
>> Sent: Thursday, August 16, 2012 11:26 AM
>> Subject: Re: Slow full-table scans
>>
>> Sorry for the delay guys.
>>
>> Here are a few results:
>>
>> 1. Regions in the table = 11
>> 2. The region servers don't appear to be very busy with the query ~5%
>> CPU (but with parallelization, they are all busy)
>>
>> Finally, I changed the format of my data, such that each cell in HBase
>> contains a chunk of a row instead of the single value it had. So,
>> stuffing each Hbase cell with 500 columns of a row, gave me a
>> performance boost of 1000x. It seems that the underlying issue was IO
>> overhead per byte of actual data stored.
>>
>>
>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <lh...@yahoo.com> wrote:
>>> Yeah... It looks OK.
>>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows.
>>>
>>>
>>> If you can I'd like to know how busy your regionservers are during these operations. That would be an indication on whether the parallelization is good or not.
>>>
>>> -- Lars
>>>
>>>
>>> ----- Original Message -----
>>> From: Stack <st...@duboce.net>
>>> To: user@hbase.apache.org
>>> Cc:
>>> Sent: Wednesday, August 15, 2012 3:13 PM
>>> Subject: Re: Slow full-table scans
>>>
>>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <gu...@gmail.com> wrote:
>>>> I am beginning to think that this is a configuration issue on my
>>>> cluster. Do the following configuration files seem sane ?
>>>>
>>>> hbase-env.sh    https://gist.github.com/3345338
>>>>
>>>
>>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in
>>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).
>>>
>>>
>>>> hbase-site.xml    https://gist.github.com/3345356
>>>>
>>>
>>> This is all defaults effectively.   I don't see any of the configs.
>>> recommended by the performance section of the reference guide and/or
>>> those suggested by the GBIF blog.
>>>
>>> You don't answer LarsH's query about where you see the 4% difference.
>>>
>>> How many regions in your table?  Whats the HBase Master UI look like
>>> when this scan is running?
>>> St.Ack
>>>

Re: Slow full-table scans

Posted by Gurjeet Singh <gu...@gmail.com>.

Hi Lars,

Here is a testcase:

https://gist.github.com/3410948

Benchmarking code:

https://gist.github.com/3410952

Try running it with numRows = 100, numCols = 200000, segmentSize = 1000

Gurjeet


On Thu, Aug 16, 2012 at 11:40 AM, Gurjeet Singh <gu...@gmail.com> wrote:
> Sure - I can create a minimal testcase and send it along.
>
> Gurjeet
>
> On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <lh...@yahoo.com> wrote:
>> That's interesting.
>> Could you share your old and new schema. I would like to track down the performance problems you saw.
>> (If you had a demo program that populates your rows with 200.000 columns in a way where you saw the performance issues, that'd be even better, but not necessary).
>>
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: Gurjeet Singh <gu...@gmail.com>
>> To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
>> Sent: Thursday, August 16, 2012 11:26 AM
>> Subject: Re: Slow full-table scans
>>
>> Sorry for the delay guys.
>>
>> Here are a few results:
>>
>> 1. Regions in the table = 11
>> 2. The region servers don't appear to be very busy with the query ~5%
>> CPU (but with parallelization, they are all busy)
>>
>> Finally, I changed the format of my data, such that each cell in HBase
>> contains a chunk of a row instead of the single value it had. So,
>> stuffing each Hbase cell with 500 columns of a row, gave me a
>> performance boost of 1000x. It seems that the underlying issue was IO
>> overhead per byte of actual data stored.
>>
>>
>> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <lh...@yahoo.com> wrote:
>>> Yeah... It looks OK.
>>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows.
>>>
>>>
>>> If you can I'd like to know how busy your regionservers are during these operations. That would be an indication on whether the parallelization is good or not.
>>>
>>> -- Lars
>>>
>>>
>>> ----- Original Message -----
>>> From: Stack <st...@duboce.net>
>>> To: user@hbase.apache.org
>>> Cc:
>>> Sent: Wednesday, August 15, 2012 3:13 PM
>>> Subject: Re: Slow full-table scans
>>>
>>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <gu...@gmail.com> wrote:
>>>> I am beginning to think that this is a configuration issue on my
>>>> cluster. Do the following configuration files seem sane ?
>>>>
>>>> hbase-env.sh    https://gist.github.com/3345338
>>>>
>>>
>>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in
>>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).
>>>
>>>
>>>> hbase-site.xml    https://gist.github.com/3345356
>>>>
>>>
>>> This is all defaults effectively.   I don't see any of the configs.
>>> recommended by the performance section of the reference guide and/or
>>> those suggested by the GBIF blog.
>>>
>>> You don't answer LarsH's query about where you see the 4% difference.
>>>
>>> How many regions in your table?  Whats the HBase Master UI look like
>>> when this scan is running?
>>> St.Ack
>>>

Re: Slow full-table scans

Posted by Gurjeet Singh <gu...@gmail.com>.

Sure - I can create a minimal testcase and send it along.

Gurjeet

On Thu, Aug 16, 2012 at 11:36 AM, lars hofhansl <lh...@yahoo.com> wrote:
> That's interesting.
> Could you share your old and new schema. I would like to track down the performance problems you saw.
> (If you had a demo program that populates your rows with 200.000 columns in a way where you saw the performance issues, that'd be even better, but not necessary).
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Gurjeet Singh <gu...@gmail.com>
> To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> Sent: Thursday, August 16, 2012 11:26 AM
> Subject: Re: Slow full-table scans
>
> Sorry for the delay guys.
>
> Here are a few results:
>
> 1. Regions in the table = 11
> 2. The region servers don't appear to be very busy with the query ~5%
> CPU (but with parallelization, they are all busy)
>
> Finally, I changed the format of my data, such that each cell in HBase
> contains a chunk of a row instead of the single value it had. So,
> stuffing each Hbase cell with 500 columns of a row, gave me a
> performance boost of 1000x. It seems that the underlying issue was IO
> overhead per byte of actual data stored.
>
>
> On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <lh...@yahoo.com> wrote:
>> Yeah... It looks OK.
>> Maybe 2G of heap is a bit low when dealing with 200.000 column rows.
>>
>>
>> If you can I'd like to know how busy your regionservers are during these operations. That would be an indication on whether the parallelization is good or not.
>>
>> -- Lars
>>
>>
>> ----- Original Message -----
>> From: Stack <st...@duboce.net>
>> To: user@hbase.apache.org
>> Cc:
>> Sent: Wednesday, August 15, 2012 3:13 PM
>> Subject: Re: Slow full-table scans
>>
>> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <gu...@gmail.com> wrote:
>>> I am beginning to think that this is a configuration issue on my
>>> cluster. Do the following configuration files seem sane ?
>>>
>>> hbase-env.sh    https://gist.github.com/3345338
>>>
>>
>> Nothing wrong w/ this (Remove the -ea, you don't want asserts in
>> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).
>>
>>
>>> hbase-site.xml    https://gist.github.com/3345356
>>>
>>
>> This is all defaults effectively.   I don't see any of the configs.
>> recommended by the performance section of the reference guide and/or
>> those suggested by the GBIF blog.
>>
>> You don't answer LarsH's query about where you see the 4% difference.
>>
>> How many regions in your table?  Whats the HBase Master UI look like
>> when this scan is running?
>> St.Ack
>>

Re: Slow full-table scans

Posted by lars hofhansl <lh...@yahoo.com>.

That's interesting.
Could you share your old and new schema. I would like to track down the performance problems you saw.
(If you had a demo program that populates your rows with 200.000 columns in a way where you saw the performance issues, that'd be even better, but not necessary).

-- Lars

________________________________
 From: Gurjeet Singh <gu...@gmail.com>
To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com> 
Sent: Thursday, August 16, 2012 11:26 AM
Subject: Re: Slow full-table scans

Sorry for the delay guys.

Here are a few results:

1. Regions in the table = 11
2. The region servers don't appear to be very busy with the query ~5%
CPU (but with parallelization, they are all busy)

Finally, I changed the format of my data, such that each cell in HBase
contains a chunk of a row instead of the single value it had. So,
stuffing each Hbase cell with 500 columns of a row, gave me a
performance boost of 1000x. It seems that the underlying issue was IO
overhead per byte of actual data stored.

On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <lh...@yahoo.com> wrote:
> Yeah... It looks OK.
> Maybe 2G of heap is a bit low when dealing with 200.000 column rows.
>
>
> If you can I'd like to know how busy your regionservers are during these operations. That would be an indication on whether the parallelization is good or not.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Stack <st...@duboce.net>
> To: user@hbase.apache.org
> Cc:
> Sent: Wednesday, August 15, 2012 3:13 PM
> Subject: Re: Slow full-table scans
>
> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <gu...@gmail.com> wrote:
>> I am beginning to think that this is a configuration issue on my
>> cluster. Do the following configuration files seem sane ?
>>
>> hbase-env.sh    https://gist.github.com/3345338
>>
>
> Nothing wrong w/ this (Remove the -ea, you don't want asserts in
> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).
>
>
>> hbase-site.xml    https://gist.github.com/3345356
>>
>
> This is all defaults effectively.   I don't see any of the configs.
> recommended by the performance section of the reference guide and/or
> those suggested by the GBIF blog.
>
> You don't answer LarsH's query about where you see the 4% difference.
>
> How many regions in your table?  Whats the HBase Master UI look like
> when this scan is running?
> St.Ack
>

Re: Slow full-table scans

Posted by Gurjeet Singh <gu...@gmail.com>.

Sorry for the delay guys.

Here are a few results:

1. Regions in the table = 11
2. The region servers don't appear to be very busy with the query ~5%
CPU (but with parallelization, they are all busy)

Finally, I changed the format of my data, such that each cell in HBase
contains a chunk of a row instead of the single value it had. So,
stuffing each Hbase cell with 500 columns of a row, gave me a
performance boost of 1000x. It seems that the underlying issue was IO
overhead per byte of actual data stored.


On Wed, Aug 15, 2012 at 5:16 PM, lars hofhansl <lh...@yahoo.com> wrote:
> Yeah... It looks OK.
> Maybe 2G of heap is a bit low when dealing with 200.000 column rows.
>
>
> If you can I'd like to know how busy your regionservers are during these operations. That would be an indication on whether the parallelization is good or not.
>
> -- Lars
>
>
> ----- Original Message -----
> From: Stack <st...@duboce.net>
> To: user@hbase.apache.org
> Cc:
> Sent: Wednesday, August 15, 2012 3:13 PM
> Subject: Re: Slow full-table scans
>
> On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <gu...@gmail.com> wrote:
>> I am beginning to think that this is a configuration issue on my
>> cluster. Do the following configuration files seem sane ?
>>
>> hbase-env.sh    https://gist.github.com/3345338
>>
>
> Nothing wrong w/ this (Remove the -ea, you don't want asserts in
> production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).
>
>
>> hbase-site.xml    https://gist.github.com/3345356
>>
>
> This is all defaults effectively.   I don't see any of the configs.
> recommended by the performance section of the reference guide and/or
> those suggested by the GBIF blog.
>
> You don't answer LarsH's query about where you see the 4% difference.
>
> How many regions in your table?  Whats the HBase Master UI look like
> when this scan is running?
> St.Ack
>

Re: Slow full-table scans

Posted by lars hofhansl <lh...@yahoo.com>.

Yeah... It looks OK.
Maybe 2G of heap is a bit low when dealing with 200.000 column rows.

If you can I'd like to know how busy your regionservers are during these operations. That would be an indication on whether the parallelization is good or not.

-- Lars

----- Original Message -----
From: Stack <st...@duboce.net>
To: user@hbase.apache.org
Cc: 
Sent: Wednesday, August 15, 2012 3:13 PM
Subject: Re: Slow full-table scans

On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <gu...@gmail.com> wrote:
> I am beginning to think that this is a configuration issue on my
> cluster. Do the following configuration files seem sane ?
>
> hbase-env.sh    https://gist.github.com/3345338
>

Nothing wrong w/ this (Remove the -ea, you don't want asserts in
production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).

> hbase-site.xml    https://gist.github.com/3345356
>

This is all defaults effectively.   I don't see any of the configs.
recommended by the performance section of the reference guide and/or
those suggested by the GBIF blog.

You don't answer LarsH's query about where you see the 4% difference.

How many regions in your table?  Whats the HBase Master UI look like
when this scan is running?
St.Ack

Re: Slow full-table scans

Posted by Stack <st...@duboce.net>.

On Mon, Aug 13, 2012 at 6:10 PM, Gurjeet Singh <gu...@gmail.com> wrote:
> I am beginning to think that this is a configuration issue on my
> cluster. Do the following configuration files seem sane ?
>
> hbase-env.sh     https://gist.github.com/3345338
>

Nothing wrong w/ this (Remove the -ea, you don't want asserts in
production, and the -XX:+CMSIncrementalMode flag if >= 2 cores).

> hbase-site.xml    https://gist.github.com/3345356
>

This is all defaults effectively.   I don't see any of the configs.
recommended by the performance section of the reference guide and/or
those suggested by the GBIF blog.

You don't answer LarsH's query about where you see the 4% difference.

How many regions in your table?  Whats the HBase Master UI look like
when this scan is running?
St.Ack

Re: Slow full-table scans

Posted by Gurjeet Singh <gu...@gmail.com>.

I am beginning to think that this is a configuration issue on my
cluster. Do the following configuration files seem sane ?

hbase-env.sh     https://gist.github.com/3345338

hbase-site.xml    https://gist.github.com/3345356

Gurjeet

On Mon, Aug 13, 2012 at 5:30 PM, lars hofhansl <lh...@yahoo.com> wrote:
> Only 4% in the 12 node cluster case? I'd guess you're using not more cores then before (i.e. the parallelizing on the client is bad), or you're IO bound (which is unlikely).
> Are all your regionserver busy in terms of CPU?
>
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Gurjeet Singh <gu...@gmail.com>
> To: user@hbase.apache.org
> Cc:
> Sent: Monday, August 13, 2012 3:12 PM
> Subject: Re: Slow full-table scans
>
> Okay, I just ran this experiment. It did speed things up, but only by
> 4%. This all still seems awfully slow to me - does someone have
> another suggestion ?
>
> Thanks in advance!
> Gurjeet
>
> On Mon, Aug 13, 2012 at 12:51 AM, Gurjeet Singh <gu...@gmail.com> wrote:
>> Thanks a lot!
>>
>> On Mon, Aug 13, 2012 at 12:27 AM, Stack <st...@duboce.net> wrote:
>>> On Mon, Aug 13, 2012 at 6:10 AM, Gurjeet Singh <gu...@gmail.com> wrote:
>>>> Thanks Lars!
>>>>
>>>> One final question :  is it advisable to issue multiple threads
>>>> against a single HTable instance, like so:
>>>>
>>>> HTable table = ...
>>>> for (i = 0; i < 10; i++) {
>>>>   new ScanThread(table, startRow, endRow, rowProcessor).start();
>>>> }
>>>>
>>>
>>> Make an HTable per thread.  See the class comment:
>>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
>>>
>>> St.Ack
>

Re: Slow full-table scans

Posted by lars hofhansl <lh...@yahoo.com>.

Only 4% in the 12 node cluster case? I'd guess you're using not more cores then before (i.e. the parallelizing on the client is bad), or you're IO bound (which is unlikely).
Are all your regionserver busy in terms of CPU?

-- Lars

----- Original Message -----
From: Gurjeet Singh <gu...@gmail.com>
To: user@hbase.apache.org
Cc: 
Sent: Monday, August 13, 2012 3:12 PM
Subject: Re: Slow full-table scans

Okay, I just ran this experiment. It did speed things up, but only by
4%. This all still seems awfully slow to me - does someone have
another suggestion ?

Thanks in advance!
Gurjeet

On Mon, Aug 13, 2012 at 12:51 AM, Gurjeet Singh <gu...@gmail.com> wrote:
> Thanks a lot!
>
> On Mon, Aug 13, 2012 at 12:27 AM, Stack <st...@duboce.net> wrote:
>> On Mon, Aug 13, 2012 at 6:10 AM, Gurjeet Singh <gu...@gmail.com> wrote:
>>> Thanks Lars!
>>>
>>> One final question :  is it advisable to issue multiple threads
>>> against a single HTable instance, like so:
>>>
>>> HTable table = ...
>>> for (i = 0; i < 10; i++) {
>>>   new ScanThread(table, startRow, endRow, rowProcessor).start();
>>> }
>>>
>>
>> Make an HTable per thread.  See the class comment:
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
>>
>> St.Ack

Re: Slow full-table scans

Posted by Gurjeet Singh <gu...@gmail.com>.

Okay, I just ran this experiment. It did speed things up, but only by
4%. This all still seems awfully slow to me - does someone have
another suggestion ?

Thanks in advance!
Gurjeet

On Mon, Aug 13, 2012 at 12:51 AM, Gurjeet Singh <gu...@gmail.com> wrote:
> Thanks a lot!
>
> On Mon, Aug 13, 2012 at 12:27 AM, Stack <st...@duboce.net> wrote:
>> On Mon, Aug 13, 2012 at 6:10 AM, Gurjeet Singh <gu...@gmail.com> wrote:
>>> Thanks Lars!
>>>
>>> One final question :  is it advisable to issue multiple threads
>>> against a single HTable instance, like so:
>>>
>>> HTable table = ...
>>> for (i = 0; i < 10; i++) {
>>>   new ScanThread(table, startRow, endRow, rowProcessor).start();
>>> }
>>>
>>
>> Make an HTable per thread.  See the class comment:
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
>>
>> St.Ack

Re: Slow full-table scans

Posted by Gurjeet Singh <gu...@gmail.com>.

Thanks a lot!

On Mon, Aug 13, 2012 at 12:27 AM, Stack <st...@duboce.net> wrote:
> On Mon, Aug 13, 2012 at 6:10 AM, Gurjeet Singh <gu...@gmail.com> wrote:
>> Thanks Lars!
>>
>> One final question :  is it advisable to issue multiple threads
>> against a single HTable instance, like so:
>>
>> HTable table = ...
>> for (i = 0; i < 10; i++) {
>>   new ScanThread(table, startRow, endRow, rowProcessor).start();
>> }
>>
>
> Make an HTable per thread.  See the class comment:
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html
>
> St.Ack

Re: Slow full-table scans

Posted by Stack <st...@duboce.net>.

On Mon, Aug 13, 2012 at 6:10 AM, Gurjeet Singh <gu...@gmail.com> wrote:
> Thanks Lars!
>
> One final question :  is it advisable to issue multiple threads
> against a single HTable instance, like so:
>
> HTable table = ...
> for (i = 0; i < 10; i++) {
>   new ScanThread(table, startRow, endRow, rowProcessor).start();
> }
>

Make an HTable per thread.  See the class comment:
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html

St.Ack

Re: Slow full-table scans

Posted by Gurjeet Singh <gu...@gmail.com>.

Thanks Lars!

One final question :  is it advisable to issue multiple threads
against a single HTable instance, like so:

HTable table = ...
for (i = 0; i < 10; i++) {
  new ScanThread(table, startRow, endRow, rowProcessor).start();
}


....

class ScanThread implements Runnable {
  public void run() {
    Scan scan = new Scan()
    scan.setStartRow(startRow);
    scan.setEndRow(endRow);
    ResultScanner scanner = table.getScanner(scan);
    for (Result result : scanner) {
      rowProcessor.process(result);
    }
  }
}

On Sun, Aug 12, 2012 at 4:00 PM, lars hofhansl <lh...@yahoo.com> wrote:
> You can use HTable.{getStartEndKeys|getEndKeys|getStartKeys} to get the current region demarcations for your table.
> If you wanted to group threads by RegionServer (which you should) you get that information via HTable.getRegionLocation{s}
>
>
> -- Lars
>
>
> ----- Original Message -----
> From: Gurjeet Singh <gu...@gmail.com>
> To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> Cc:
> Sent: Sunday, August 12, 2012 3:51 PM
> Subject: Re: Slow full-table scans
>
> Hi Lars,
>
> Yes, I need to retrieve all the values for a row at a time. That said,
> I did experiment with different batch sizes and that made no
> difference whatsoever. (caching on the other hand did make some
> difference ~2-3% faster for larger cache)
>
> I see your point about scanners returning sorted KVs. In my
> application, I simply don't care whether the results are sorted or not
> and I know the key range in advance. This is a great suggestion. Let
> me try replacing a single scan with a list of GETs or a bunch of SCANs
> with different start/stop rows.
>
> Thanks!
> Gurjeet
>
> On Sun, Aug 12, 2012 at 3:24 PM, lars hofhansl <lh...@yahoo.com> wrote:
>> Do you really have to retrieve all 200.000 each time?
>> Scan.setBatch(...) makes no difference?! (note that batching is different and separate from caching).
>>
>> Also note that the scanner contract is to return sorted KVs, so a single scan cannot be parallelized across RegionServers (well not entirely true, it could be farmed off in parallel and then be presented to the client in the right order - but HBase is not doing that). That is why one vs 12 RSs makes no difference in this scenario.
>>
>> In the 12 node case you'll see low CPU on all but one RS, and each RS will get its turn.
>>
>> In your case this is scanning 20.000.000 KVs serially in 400s, that's 50000 KVs/s, which - depending on hardware - is not too bad for HBase (but not great either).
>>
>> If you only ever expect to run a single query like this on top your cluster (i.e. your concern is latency not throughput) you can do multiple RPCs in parallel for a sub portion of your key range. Together with batching can start using value before all is streamed back from the server.
>>
>>
>> -- Lars
>>
>>
>>
>> ----- Original Message -----
>> From: Gurjeet Singh <gu...@gmail.com>
>> To: user@hbase.apache.org
>> Cc:
>> Sent: Saturday, August 11, 2012 11:04 PM
>> Subject: Slow full-table scans
>>
>> Hi,
>>
>> I am trying to read all the data out of an HBase table using a scan
>> and it is extremely slow.
>>
>> Here are some characteristics of the data:
>>
>> 1. The total table size is tiny (~200MB)
>> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
>> Thus the size of each cell is ~10bytes and the size of each row is
>> ~2MB
>> 3. Currently scanning the whole table takes ~400s (both in a
>> distributed setting with 12 nodes or so and on a single node), thus
>> 5sec/row
>> 4. The row keys are unique 8 byte crypto hashes of sequential numbers
>> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
>> and is set to fetch 100MB of data at a time (scan.setCaching)
>> 6. Changing the caching size seems to have no effect on the total scan
>> time at all
>> 7. The column family is setup to keep a single version of the cells,
>> no compression, and no block cache.
>>
>> Am I missing something ? Is there a way to optimize this ?
>>
>> I guess a general question I have is whether HBase is good datastore
>> for storing many medium sized (~50GB), dense datasets with lots of
>> columns when a lot of the queries require full table scans ?
>>
>> Thanks!
>> Gurjeet
>>
>

Re: Slow full-table scans

Posted by lars hofhansl <lh...@yahoo.com>.

You can use HTable.{getStartEndKeys|getEndKeys|getStartKeys} to get the current region demarcations for your table.
If you wanted to group threads by RegionServer (which you should) you get that information via HTable.getRegionLocation{s}


-- Lars


----- Original Message -----
From: Gurjeet Singh <gu...@gmail.com>
To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
Cc: 
Sent: Sunday, August 12, 2012 3:51 PM
Subject: Re: Slow full-table scans

Hi Lars,

Yes, I need to retrieve all the values for a row at a time. That said,
I did experiment with different batch sizes and that made no
difference whatsoever. (caching on the other hand did make some
difference ~2-3% faster for larger cache)

I see your point about scanners returning sorted KVs. In my
application, I simply don't care whether the results are sorted or not
and I know the key range in advance. This is a great suggestion. Let
me try replacing a single scan with a list of GETs or a bunch of SCANs
with different start/stop rows.

Thanks!
Gurjeet

On Sun, Aug 12, 2012 at 3:24 PM, lars hofhansl <lh...@yahoo.com> wrote:
> Do you really have to retrieve all 200.000 each time?
> Scan.setBatch(...) makes no difference?! (note that batching is different and separate from caching).
>
> Also note that the scanner contract is to return sorted KVs, so a single scan cannot be parallelized across RegionServers (well not entirely true, it could be farmed off in parallel and then be presented to the client in the right order - but HBase is not doing that). That is why one vs 12 RSs makes no difference in this scenario.
>
> In the 12 node case you'll see low CPU on all but one RS, and each RS will get its turn.
>
> In your case this is scanning 20.000.000 KVs serially in 400s, that's 50000 KVs/s, which - depending on hardware - is not too bad for HBase (but not great either).
>
> If you only ever expect to run a single query like this on top your cluster (i.e. your concern is latency not throughput) you can do multiple RPCs in parallel for a sub portion of your key range. Together with batching can start using value before all is streamed back from the server.
>
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Gurjeet Singh <gu...@gmail.com>
> To: user@hbase.apache.org
> Cc:
> Sent: Saturday, August 11, 2012 11:04 PM
> Subject: Slow full-table scans
>
> Hi,
>
> I am trying to read all the data out of an HBase table using a scan
> and it is extremely slow.
>
> Here are some characteristics of the data:
>
> 1. The total table size is tiny (~200MB)
> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
> Thus the size of each cell is ~10bytes and the size of each row is
> ~2MB
> 3. Currently scanning the whole table takes ~400s (both in a
> distributed setting with 12 nodes or so and on a single node), thus
> 5sec/row
> 4. The row keys are unique 8 byte crypto hashes of sequential numbers
> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
> and is set to fetch 100MB of data at a time (scan.setCaching)
> 6. Changing the caching size seems to have no effect on the total scan
> time at all
> 7. The column family is setup to keep a single version of the cells,
> no compression, and no block cache.
>
> Am I missing something ? Is there a way to optimize this ?
>
> I guess a general question I have is whether HBase is good datastore
> for storing many medium sized (~50GB), dense datasets with lots of
> columns when a lot of the queries require full table scans ?
>
> Thanks!
> Gurjeet
>

Re: Slow full-table scans

Posted by Gurjeet Singh <gu...@gmail.com>.

Hi Lars,

Yes, I need to retrieve all the values for a row at a time. That said,
I did experiment with different batch sizes and that made no
difference whatsoever. (caching on the other hand did make some
difference ~2-3% faster for larger cache)

I see your point about scanners returning sorted KVs. In my
application, I simply don't care whether the results are sorted or not
and I know the key range in advance. This is a great suggestion. Let
me try replacing a single scan with a list of GETs or a bunch of SCANs
with different start/stop rows.

Thanks!
Gurjeet

On Sun, Aug 12, 2012 at 3:24 PM, lars hofhansl <lh...@yahoo.com> wrote:
> Do you really have to retrieve all 200.000 each time?
> Scan.setBatch(...) makes no difference?! (note that batching is different and separate from caching).
>
> Also note that the scanner contract is to return sorted KVs, so a single scan cannot be parallelized across RegionServers (well not entirely true, it could be farmed off in parallel and then be presented to the client in the right order - but HBase is not doing that). That is why one vs 12 RSs makes no difference in this scenario.
>
> In the 12 node case you'll see low CPU on all but one RS, and each RS will get its turn.
>
> In your case this is scanning 20.000.000 KVs serially in 400s, that's 50000 KVs/s, which - depending on hardware - is not too bad for HBase (but not great either).
>
> If you only ever expect to run a single query like this on top your cluster (i.e. your concern is latency not throughput) you can do multiple RPCs in parallel for a sub portion of your key range. Together with batching can start using value before all is streamed back from the server.
>
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Gurjeet Singh <gu...@gmail.com>
> To: user@hbase.apache.org
> Cc:
> Sent: Saturday, August 11, 2012 11:04 PM
> Subject: Slow full-table scans
>
> Hi,
>
> I am trying to read all the data out of an HBase table using a scan
> and it is extremely slow.
>
> Here are some characteristics of the data:
>
> 1. The total table size is tiny (~200MB)
> 2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
> Thus the size of each cell is ~10bytes and the size of each row is
> ~2MB
> 3. Currently scanning the whole table takes ~400s (both in a
> distributed setting with 12 nodes or so and on a single node), thus
> 5sec/row
> 4. The row keys are unique 8 byte crypto hashes of sequential numbers
> 5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
> and is set to fetch 100MB of data at a time (scan.setCaching)
> 6. Changing the caching size seems to have no effect on the total scan
> time at all
> 7. The column family is setup to keep a single version of the cells,
> no compression, and no block cache.
>
> Am I missing something ? Is there a way to optimize this ?
>
> I guess a general question I have is whether HBase is good datastore
> for storing many medium sized (~50GB), dense datasets with lots of
> columns when a lot of the queries require full table scans ?
>
> Thanks!
> Gurjeet
>

Re: Slow full-table scans

Posted by lars hofhansl <lh...@yahoo.com>.

Do you really have to retrieve all 200.000 each time?
Scan.setBatch(...) makes no difference?! (note that batching is different and separate from caching).

Also note that the scanner contract is to return sorted KVs, so a single scan cannot be parallelized across RegionServers (well not entirely true, it could be farmed off in parallel and then be presented to the client in the right order - but HBase is not doing that). That is why one vs 12 RSs makes no difference in this scenario.

In the 12 node case you'll see low CPU on all but one RS, and each RS will get its turn.

In your case this is scanning 20.000.000 KVs serially in 400s, that's 50000 KVs/s, which - depending on hardware - is not too bad for HBase (but not great either).

If you only ever expect to run a single query like this on top your cluster (i.e. your concern is latency not throughput) you can do multiple RPCs in parallel for a sub portion of your key range. Together with batching can start using value before all is streamed back from the server.


-- Lars



----- Original Message -----
From: Gurjeet Singh <gu...@gmail.com>
To: user@hbase.apache.org
Cc: 
Sent: Saturday, August 11, 2012 11:04 PM
Subject: Slow full-table scans

Hi,

I am trying to read all the data out of an HBase table using a scan
and it is extremely slow.

Here are some characteristics of the data:

1. The total table size is tiny (~200MB)
2. The table has ~100 rows and ~200,000 columns in a SINGLE family.
Thus the size of each cell is ~10bytes and the size of each row is
~2MB
3. Currently scanning the whole table takes ~400s (both in a
distributed setting with 12 nodes or so and on a single node), thus
5sec/row
4. The row keys are unique 8 byte crypto hashes of sequential numbers
5. The scanner is set to fetch a FULL row at a time (scan.setBatch)
and is set to fetch 100MB of data at a time (scan.setCaching)
6. Changing the caching size seems to have no effect on the total scan
time at all
7. The column family is setup to keep a single version of the cells,
no compression, and no block cache.

Am I missing something ? Is there a way to optimize this ?

I guess a general question I have is whether HBase is good datastore
for storing many medium sized (~50GB), dense datasets with lots of
columns when a lot of the queries require full table scans ?

Thanks!
Gurjeet