You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Wojciech Langiewicz <wl...@gmail.com> on 2011/05/01 15:44:05 UTC

Row count without iterating over ResultScanner?

Hi,
I would like to know if there's a way to quickly count number of rows 
from scan result?
Right now I'm iterating over ResultScanner like this:
int count = 0;
for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
	++count;
}
But with number of rows reaching millions this takes a while.
I tried to find something in documentation, but I didn't found anything.
I would like to use HBase API, not MR job (because this cluster only has 
HDFS and HBase installed).

Thanks for all help.

--
Wojciech Langiewicz

Re: Row count without iterating over ResultScanner?

Posted by Wojciech Langiewicz <wl...@gmail.com>.

Thanks, that's great. But I firstly I have to update HBase and read some 
documentation, so I'll let you know in a while how that works for me.

On 01.05.2011 20:42, Himanshu Vashishtha wrote:
> Yes, you can define your scan object at the client side and pass to the
> AggregateClient.rowCount. You can refer to AggregateClient javadoc and
> associated TestAggregateProtocol test methods to get an idea.
>
> Thanks,
> Himanshu
>
> On Sun, May 1, 2011 at 12:29 PM, Wojciech Langiewicz
> <wl...@gmail.com>wrote:
>
>> Hi,
>>
>> On 01.05.2011 20:03, Himanshu Vashishtha wrote:
>>
>>> If you are interested row count only (and not want to fetch the table rows
>>> to your client side), you can also try out
>>> https://issues.apache.org/jira/browse/HBASE-1512.
>>>
>>
>> Yes, I only want to count rows and apply filters or select columns.
>> Are filters also supported to work with those aggregate functions?
>>
>>
>>   PS: Which version you are on? The above patch is in main trunk as of now,
>>> so
>>> to use it you would have to checkout the code and build it.
>>>
>>
>> I'm using version from CDH3, so it is: 0.90.1-cdh3u0, but I'm not bound to
>> this version.
>>
>> Coprocessors with aggregate functions seem to be the thing I need. Thanks!
>> --
>> Wojciech Langiewicz
>>
>>
>>   Thanks,
>>> Himanshu
>>>
>>>
>>> On Sun, May 1, 2011 at 11:55 AM, Doug Meil<doug.meil@explorysmedical.com
>>>> wrote:
>>>
>>>   What caching value are you using on the scan?  If you aren't setting
>>>> this,
>>>> it's probably using the default - which is 1.  Which is slow.
>>>> http://hbase.apache.org/book.html#d379e3504
>>>>
>>>> Re:  "I would like to use HBase API, not MR job (because this cluster
>>>> only
>>>> has HDFS and HBase installed)."
>>>>
>>>> For Very Large tables you want to start using an MR job for this.
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Wojciech Langiewicz [mailto:wlangiewicz@gmail.com]
>>>> Sent: Sunday, May 01, 2011 9:44 AM
>>>> To: user@hbase.apache.org
>>>> Subject: Row count without iterating over ResultScanner?
>>>>
>>>> Hi,
>>>> I would like to know if there's a way to quickly count number of rows
>>>> from
>>>> scan result?
>>>> Right now I'm iterating over ResultScanner like this:
>>>> int count = 0;
>>>> for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
>>>>         ++count;
>>>> }
>>>> But with number of rows reaching millions this takes a while.
>>>> I tried to find something in documentation, but I didn't found anything.
>>>> I would like to use HBase API, not MR job (because this cluster only has
>>>> HDFS and HBase installed).
>>>>
>>>> Thanks for all help.
>>>>
>>>> --
>>>> Wojciech Langiewicz
>>>>
>>>>
>>>
>>
>

Re: Row count without iterating over ResultScanner?

Posted by Himanshu Vashishtha <hv...@cs.ualberta.ca>.

Yes, you can define your scan object at the client side and pass to the
AggregateClient.rowCount. You can refer to AggregateClient javadoc and
associated TestAggregateProtocol test methods to get an idea.

Thanks,
Himanshu

On Sun, May 1, 2011 at 12:29 PM, Wojciech Langiewicz
<wl...@gmail.com>wrote:

> Hi,
>
> On 01.05.2011 20:03, Himanshu Vashishtha wrote:
>
>> If you are interested row count only (and not want to fetch the table rows
>> to your client side), you can also try out
>> https://issues.apache.org/jira/browse/HBASE-1512.
>>
>
> Yes, I only want to count rows and apply filters or select columns.
> Are filters also supported to work with those aggregate functions?
>
>
>  PS: Which version you are on? The above patch is in main trunk as of now,
>> so
>> to use it you would have to checkout the code and build it.
>>
>
> I'm using version from CDH3, so it is: 0.90.1-cdh3u0, but I'm not bound to
> this version.
>
> Coprocessors with aggregate functions seem to be the thing I need. Thanks!
> --
> Wojciech Langiewicz
>
>
>  Thanks,
>> Himanshu
>>
>>
>> On Sun, May 1, 2011 at 11:55 AM, Doug Meil<doug.meil@explorysmedical.com
>> >wrote:
>>
>>  What caching value are you using on the scan?  If you aren't setting
>>> this,
>>> it's probably using the default - which is 1.  Which is slow.
>>> http://hbase.apache.org/book.html#d379e3504
>>>
>>> Re:  "I would like to use HBase API, not MR job (because this cluster
>>> only
>>> has HDFS and HBase installed)."
>>>
>>> For Very Large tables you want to start using an MR job for this.
>>>
>>>
>>> -----Original Message-----
>>> From: Wojciech Langiewicz [mailto:wlangiewicz@gmail.com]
>>> Sent: Sunday, May 01, 2011 9:44 AM
>>> To: user@hbase.apache.org
>>> Subject: Row count without iterating over ResultScanner?
>>>
>>> Hi,
>>> I would like to know if there's a way to quickly count number of rows
>>> from
>>> scan result?
>>> Right now I'm iterating over ResultScanner like this:
>>> int count = 0;
>>> for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
>>>        ++count;
>>> }
>>> But with number of rows reaching millions this takes a while.
>>> I tried to find something in documentation, but I didn't found anything.
>>> I would like to use HBase API, not MR job (because this cluster only has
>>> HDFS and HBase installed).
>>>
>>> Thanks for all help.
>>>
>>> --
>>> Wojciech Langiewicz
>>>
>>>
>>
>

Re: Row count without iterating over ResultScanner?

Posted by Wojciech Langiewicz <wl...@gmail.com>.

Hi,
On 01.05.2011 20:03, Himanshu Vashishtha wrote:
> If you are interested row count only (and not want to fetch the table rows
> to your client side), you can also try out
> https://issues.apache.org/jira/browse/HBASE-1512.

Yes, I only want to count rows and apply filters or select columns.
Are filters also supported to work with those aggregate functions?

> PS: Which version you are on? The above patch is in main trunk as of now, so
> to use it you would have to checkout the code and build it.

I'm using version from CDH3, so it is: 0.90.1-cdh3u0, but I'm not bound 
to this version.

Coprocessors with aggregate functions seem to be the thing I need. Thanks!
--
Wojciech Langiewicz

> Thanks,
> Himanshu
>
>
> On Sun, May 1, 2011 at 11:55 AM, Doug Meil<do...@explorysmedical.com>wrote:
>
>> What caching value are you using on the scan?  If you aren't setting this,
>> it's probably using the default - which is 1.  Which is slow.
>> http://hbase.apache.org/book.html#d379e3504
>>
>> Re:  "I would like to use HBase API, not MR job (because this cluster only
>> has HDFS and HBase installed)."
>>
>> For Very Large tables you want to start using an MR job for this.
>>
>>
>> -----Original Message-----
>> From: Wojciech Langiewicz [mailto:wlangiewicz@gmail.com]
>> Sent: Sunday, May 01, 2011 9:44 AM
>> To: user@hbase.apache.org
>> Subject: Row count without iterating over ResultScanner?
>>
>> Hi,
>> I would like to know if there's a way to quickly count number of rows from
>> scan result?
>> Right now I'm iterating over ResultScanner like this:
>> int count = 0;
>> for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
>>         ++count;
>> }
>> But with number of rows reaching millions this takes a while.
>> I tried to find something in documentation, but I didn't found anything.
>> I would like to use HBase API, not MR job (because this cluster only has
>> HDFS and HBase installed).
>>
>> Thanks for all help.
>>
>> --
>> Wojciech Langiewicz
>>
>

Re: Row count without iterating over ResultScanner?

Posted by Himanshu Vashishtha <hv...@cs.ualberta.ca>.

If you are interested row count only (and not want to fetch the table rows
to your client side), you can also try out
https://issues.apache.org/jira/browse/HBASE-1512.

PS: Which version you are on? The above patch is in main trunk as of now, so
to use it you would have to checkout the code and build it.

Thanks,
Himanshu


On Sun, May 1, 2011 at 11:55 AM, Doug Meil <do...@explorysmedical.com>wrote:

> What caching value are you using on the scan?  If you aren't setting this,
> it's probably using the default - which is 1.  Which is slow.
> http://hbase.apache.org/book.html#d379e3504
>
> Re:  "I would like to use HBase API, not MR job (because this cluster only
> has HDFS and HBase installed)."
>
> For Very Large tables you want to start using an MR job for this.
>
>
> -----Original Message-----
> From: Wojciech Langiewicz [mailto:wlangiewicz@gmail.com]
> Sent: Sunday, May 01, 2011 9:44 AM
> To: user@hbase.apache.org
> Subject: Row count without iterating over ResultScanner?
>
> Hi,
> I would like to know if there's a way to quickly count number of rows from
> scan result?
> Right now I'm iterating over ResultScanner like this:
> int count = 0;
> for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
>        ++count;
> }
> But with number of rows reaching millions this takes a while.
> I tried to find something in documentation, but I didn't found anything.
> I would like to use HBase API, not MR job (because this cluster only has
> HDFS and HBase installed).
>
> Thanks for all help.
>
> --
> Wojciech Langiewicz
>

Re: Row count without iterating over ResultScanner?

Posted by Wojciech Langiewicz <wl...@gmail.com>.

Thanks, also referring documentation from link you posted (13.6.5.) I 
have applied those filters.

On 01.05.2011 20:44, Doug Meil wrote:
> Another thing is be careful about CF/attributes you have in the Scan.  If you add a column family (scan.addFamily) , it will pull *all* the attributes of that column family.  If you only care about a row-count, pick only one very small attribute from the row.
>
>
> -----Original Message-----
> From: Wojciech Langiewicz [mailto:wlangiewicz@gmail.com]
> Sent: Sunday, May 01, 2011 2:12 PM
> To: user@hbase.apache.org
> Subject: Re: Row count without iterating over ResultScanner?
>
> Yes, I was using default caching, setting this value to few thousands made significant difference in performance, I'll experiment more with this option.
>
> Right now I want to stay away from MR, mainly because of cluster warm-up time, and I want to get results almost real-time (few seconds max).
>
> Thanks for the tip on caching!
>
> On 01.05.2011 19:55, Doug Meil wrote:
>> What caching value are you using on the scan?  If you aren't setting this, it's probably using the default - which is 1.  Which is slow.   http://hbase.apache.org/book.html#d379e3504
>>
>> Re:  "I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed)."
>>
>> For Very Large tables you want to start using an MR job for this.
>>
>>
>> -----Original Message-----
>> From: Wojciech Langiewicz [mailto:wlangiewicz@gmail.com]
>> Sent: Sunday, May 01, 2011 9:44 AM
>> To: user@hbase.apache.org
>> Subject: Row count without iterating over ResultScanner?
>>
>> Hi,
>> I would like to know if there's a way to quickly count number of rows from scan result?
>> Right now I'm iterating over ResultScanner like this:
>> int count = 0;
>> for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
>> 	++count;
>> }
>> But with number of rows reaching millions this takes a while.
>> I tried to find something in documentation, but I didn't found anything.
>> I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed).
>>
>> Thanks for all help.
>>
>> --
>> Wojciech Langiewicz
>

RE: Row count without iterating over ResultScanner?

Posted by Doug Meil <do...@explorysmedical.com>.

Another thing is be careful about CF/attributes you have in the Scan.  If you add a column family (scan.addFamily) , it will pull *all* the attributes of that column family.  If you only care about a row-count, pick only one very small attribute from the row.  


-----Original Message-----
From: Wojciech Langiewicz [mailto:wlangiewicz@gmail.com] 
Sent: Sunday, May 01, 2011 2:12 PM
To: user@hbase.apache.org
Subject: Re: Row count without iterating over ResultScanner?

Yes, I was using default caching, setting this value to few thousands made significant difference in performance, I'll experiment more with this option.

Right now I want to stay away from MR, mainly because of cluster warm-up time, and I want to get results almost real-time (few seconds max).

Thanks for the tip on caching!

On 01.05.2011 19:55, Doug Meil wrote:
> What caching value are you using on the scan?  If you aren't setting this, it's probably using the default - which is 1.  Which is slow.   http://hbase.apache.org/book.html#d379e3504
>
> Re:  "I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed)."
>
> For Very Large tables you want to start using an MR job for this.
>
>
> -----Original Message-----
> From: Wojciech Langiewicz [mailto:wlangiewicz@gmail.com]
> Sent: Sunday, May 01, 2011 9:44 AM
> To: user@hbase.apache.org
> Subject: Row count without iterating over ResultScanner?
>
> Hi,
> I would like to know if there's a way to quickly count number of rows from scan result?
> Right now I'm iterating over ResultScanner like this:
> int count = 0;
> for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
> 	++count;
> }
> But with number of rows reaching millions this takes a while.
> I tried to find something in documentation, but I didn't found anything.
> I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed).
>
> Thanks for all help.
>
> --
> Wojciech Langiewicz

Re: Row count without iterating over ResultScanner?

Posted by Wojciech Langiewicz <wl...@gmail.com>.

Yes, I was using default caching, setting this value to few thousands 
made significant difference in performance, I'll experiment more with 
this option.

Right now I want to stay away from MR, mainly because of cluster warm-up 
time, and I want to get results almost real-time (few seconds max).

Thanks for the tip on caching!

On 01.05.2011 19:55, Doug Meil wrote:
> What caching value are you using on the scan?  If you aren't setting this, it's probably using the default - which is 1.  Which is slow.   http://hbase.apache.org/book.html#d379e3504
>
> Re:  "I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed)."
>
> For Very Large tables you want to start using an MR job for this.
>
>
> -----Original Message-----
> From: Wojciech Langiewicz [mailto:wlangiewicz@gmail.com]
> Sent: Sunday, May 01, 2011 9:44 AM
> To: user@hbase.apache.org
> Subject: Row count without iterating over ResultScanner?
>
> Hi,
> I would like to know if there's a way to quickly count number of rows from scan result?
> Right now I'm iterating over ResultScanner like this:
> int count = 0;
> for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
> 	++count;
> }
> But with number of rows reaching millions this takes a while.
> I tried to find something in documentation, but I didn't found anything.
> I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed).
>
> Thanks for all help.
>
> --
> Wojciech Langiewicz

RE: Row count without iterating over ResultScanner?

Posted by Doug Meil <do...@explorysmedical.com>.

What caching value are you using on the scan?  If you aren't setting this, it's probably using the default - which is 1.  Which is slow.   http://hbase.apache.org/book.html#d379e3504

Re:  "I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed)."

For Very Large tables you want to start using an MR job for this.


-----Original Message-----
From: Wojciech Langiewicz [mailto:wlangiewicz@gmail.com] 
Sent: Sunday, May 01, 2011 9:44 AM
To: user@hbase.apache.org
Subject: Row count without iterating over ResultScanner?

Hi,
I would like to know if there's a way to quickly count number of rows from scan result?
Right now I'm iterating over ResultScanner like this:
int count = 0;
for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
	++count;
}
But with number of rows reaching millions this takes a while.
I tried to find something in documentation, but I didn't found anything.
I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed).

Thanks for all help.

--
Wojciech Langiewicz

Re: Row count without iterating over ResultScanner?

Posted by Michel Segel <mi...@hotmail.com>.

Hi,
There's a row counter app in the hbase release that's a m/r job.

You could also do a dynamic counter too.


Sent from a remote device. Please excuse any typos...

Mike Segel

On May 1, 2011, at 8:44 AM, Wojciech Langiewicz <wl...@gmail.com> wrote:

> Hi,
> I would like to know if there's a way to quickly count number of rows from scan result?
> Right now I'm iterating over ResultScanner like this:
> int count = 0;
> for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
>    ++count;
> }
> But with number of rows reaching millions this takes a while.
> I tried to find something in documentation, but I didn't found anything.
> I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed).
> 
> Thanks for all help.
> 
> --
> Wojciech Langiewicz
>