You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Ted Yu <yu...@gmail.com> on 2010/07/29 23:22:48 UTC

count of rows in table

Hi,
The count method in HBase shell is quite slow.
Is there a way to obtain count faster ?

Thanks

Re: count of rows in table

Posted by Ted Yu <yu...@gmail.com>.
I want to poll for ideas on how I can aggregate row counts from several
tables.
I can run rowcounter for each table. But how can I produce the sum of all
counts easily ?

Thanks

On Thu, Jul 29, 2010 at 9:07 PM, Ted Yu <yu...@gmail.com> wrote:

> I think OR is more reasonable.
>
>
> On Thu, Jul 29, 2010 at 8:54 PM, Angus He <an...@gmail.com> wrote:
>
>> By the way
>>
>> If users input multiple columns, it seems that the current
>> implementation of RowCounter employs the OR logical operation.
>>
>> Is the AND more reasonable?
>>
>>
>>
>> On Fri, Jul 30, 2010 at 11:13 AM, Ryan Rawson <ry...@gmail.com> wrote:
>> > RowCounter job counts rows. Its answer will be how many distinct row
>> keys
>> > were in the table approximately at a given time range.
>> >
>> > Even if the implementation uses first kv filter nothing about what I
>> just
>> > said is false.
>> >
>> > A KeyValue counter would tell you how many cells and versions there were
>> > total don't you think?
>> >
>> > On Jul 29, 2010 7:58 PM, "Angus He" <an...@gmail.com> wrote:
>> >> Column names are just optional for RowCounter job.
>> >>
>> >> To be more accurate, RowCounter is a KeyValueCounter.
>> >> If no columns are specified, only the first KeyValues of each row are
>> >> included, then get the RowCounter.
>> >>
>> >>
>> >> On Fri, Jul 30, 2010 at 9:28 AM, Ted Yu <yu...@gmail.com> wrote:
>> >>> If someone can share the commandline for running RowCounter, that
>> would
>> > be
>> >>> great.
>> >>>
>> >>> Also, hbase shell count doesn't require column name. Why does
>> RowCounter
>> >>> require it ?
>> >>>
>> >>> Thanks
>> >>>
>> >>> On Thu, Jul 29, 2010 at 4:55 PM, Ryan Rawson <ry...@gmail.com>
>> wrote:
>> >>>
>> >>>> Hi,
>> >>>>
>> >>>> That table appears to be empty.  Eg:
>> >>>>
>> >>>> 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
>> >>>>
>> >>>>
>> >>>> So back to the count issue... Counting in databases is a classic
>> >>>> problem. Unless your DB system is keeping stats on how many
>> >>>> inserts/deletes and thus how big it thinks the table is, you have to
>> >>>> count all the rows by reading them.  HBase is no different, and a
>> >>>> little harder, because we have a variable length data format, so we
>> >>>> can't just estimate row sizes from file sizes.  Keeping distributed
>> >>>> stats is not impossible, but certainly not on any priority list to be
>> >>>> implemented - of course JIRAs/patches welcome etc.
>> >>>>
>> >>>> -ryan
>> >>>>
>> >>>>
>> >>>> On Thu, Jul 29, 2010 at 3:48 PM, Ted Yu <yu...@gmail.com> wrote:
>> >>>> > We use HBase 0.20.5
>> >>>> >
>> >>>> > Here is the snippet from RowCounter output:
>> >>>> >
>> >>>> > 10/07/29 22:38:42 DEBUG client.HTable$ClientScanner: Finished with
>> >>>> scanning
>> >>>> > at REGION => {NAME =>
>> >>>> >
>> >>>>
>> >
>> '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0,DFF46493EB352D0E31CBFA4652E6EC06,1280412540858',
>> >>>> > STARTKEY => 'DFF46493EB352D0E31CBFA4652E6EC06', ENDKEY => '',
>> ENCODED
>> > =>
>> >>>> > 1375318608, TABLE => {{NAME =>
>> >>>> > '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0',
>> >>>> FAMILIES
>> >>>> > => [{NAME => 'd', COMPRESSION => 'GZ', VERSIONS => '2', TTL =>
>> >>>> '31536000',
>> >>>> > BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'},
>> > {NAME
>> >>>> =>
>> >>>> > 'i', COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000',
>> > BLOCKSIZE
>> >>>> =>
>> >>>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME =>
>> 'v',
>> >>>> > COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000', BLOCKSIZE
>> =>
>> >>>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}
>> >>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner:
>> >>>> Task:attempt_local_0001_m_000000_0
>> >>>> > is done. And is in the process of commiting
>> >>>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
>> >>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
>> >>>> attempt_local_0001_m_000000_0
>> >>>> > is allowed to commit now
>> >>>> > 10/07/29 22:38:42 INFO mapred.FileOutputCommitter: Saved output of
>> > task
>> >>>> > 'attempt_local_0001_m_000000_0' to
>> >>>> > file:/usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc
>> >>>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
>> >>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
>> >>>> > 'attempt_local_0001_m_000000_0' done.
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:  map 100% reduce 0%
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient: Job complete:
>> job_local_0001
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient: Counters: 6
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:   FileSystemCounters
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:
>> FILE_BYTES_READ=1592883
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:
>> > FILE_BYTES_WRITTEN=1624956
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:   Map-Reduce Framework
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Spilled Records=0
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input bytes=0
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map output records=0
>> >>>> >
>> >>>> > [sjc1-hadoop8.sjc1:hadoop 3705]ls -l
>> >>>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
>> >>>> > -rwxrwxrwx 1 hadoop users 0 Jul 29 22:38
>> >>>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
>> >>>> >
>> >>>> > But there are many records in the table I was querying.
>> >>>> >
>> >>>> > Can someone comment ?
>> >>>> >
>> >>>> > On Thu, Jul 29, 2010 at 2:26 PM, Jean-Daniel Cryans <
>> > jdcryans@apache.org
>> >>>> >wrote:
>> >>>> >
>> >>>> >> In 0.89 you can specify CACHE for the count command. Set it higher
>> > (it
>> >>>> >> defaults to 10 rows per call).
>> >>>> >>
>> >>>> >> Also you can use the RowCounter MR job.
>> >>>> >>
>> >>>> >> J-D
>> >>>> >>
>> >>>> >> On Thu, Jul 29, 2010 at 2:22 PM, Ted Yu <yu...@gmail.com>
>> wrote:
>> >>>> >> > Hi,
>> >>>> >> > The count method in HBase shell is quite slow.
>> >>>> >> > Is there a way to obtain count faster ?
>> >>>> >> >
>> >>>> >> > Thanks
>> >>>> >> >
>> >>>> >>
>> >>>> >
>> >>>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Regards
>> >> Angus
>> >
>>
>>
>>
>> --
>> Regards
>> Angus
>>
>
>

Re: count of rows in table

Posted by Ted Yu <yu...@gmail.com>.
I think OR is more reasonable.

On Thu, Jul 29, 2010 at 8:54 PM, Angus He <an...@gmail.com> wrote:

> By the way
>
> If users input multiple columns, it seems that the current
> implementation of RowCounter employs the OR logical operation.
>
> Is the AND more reasonable?
>
>
>
> On Fri, Jul 30, 2010 at 11:13 AM, Ryan Rawson <ry...@gmail.com> wrote:
> > RowCounter job counts rows. Its answer will be how many distinct row keys
> > were in the table approximately at a given time range.
> >
> > Even if the implementation uses first kv filter nothing about what I just
> > said is false.
> >
> > A KeyValue counter would tell you how many cells and versions there were
> > total don't you think?
> >
> > On Jul 29, 2010 7:58 PM, "Angus He" <an...@gmail.com> wrote:
> >> Column names are just optional for RowCounter job.
> >>
> >> To be more accurate, RowCounter is a KeyValueCounter.
> >> If no columns are specified, only the first KeyValues of each row are
> >> included, then get the RowCounter.
> >>
> >>
> >> On Fri, Jul 30, 2010 at 9:28 AM, Ted Yu <yu...@gmail.com> wrote:
> >>> If someone can share the commandline for running RowCounter, that would
> > be
> >>> great.
> >>>
> >>> Also, hbase shell count doesn't require column name. Why does
> RowCounter
> >>> require it ?
> >>>
> >>> Thanks
> >>>
> >>> On Thu, Jul 29, 2010 at 4:55 PM, Ryan Rawson <ry...@gmail.com>
> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> That table appears to be empty.  Eg:
> >>>>
> >>>> 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
> >>>>
> >>>>
> >>>> So back to the count issue... Counting in databases is a classic
> >>>> problem. Unless your DB system is keeping stats on how many
> >>>> inserts/deletes and thus how big it thinks the table is, you have to
> >>>> count all the rows by reading them.  HBase is no different, and a
> >>>> little harder, because we have a variable length data format, so we
> >>>> can't just estimate row sizes from file sizes.  Keeping distributed
> >>>> stats is not impossible, but certainly not on any priority list to be
> >>>> implemented - of course JIRAs/patches welcome etc.
> >>>>
> >>>> -ryan
> >>>>
> >>>>
> >>>> On Thu, Jul 29, 2010 at 3:48 PM, Ted Yu <yu...@gmail.com> wrote:
> >>>> > We use HBase 0.20.5
> >>>> >
> >>>> > Here is the snippet from RowCounter output:
> >>>> >
> >>>> > 10/07/29 22:38:42 DEBUG client.HTable$ClientScanner: Finished with
> >>>> scanning
> >>>> > at REGION => {NAME =>
> >>>> >
> >>>>
> >
> '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0,DFF46493EB352D0E31CBFA4652E6EC06,1280412540858',
> >>>> > STARTKEY => 'DFF46493EB352D0E31CBFA4652E6EC06', ENDKEY => '',
> ENCODED
> > =>
> >>>> > 1375318608, TABLE => {{NAME =>
> >>>> > '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0',
> >>>> FAMILIES
> >>>> > => [{NAME => 'd', COMPRESSION => 'GZ', VERSIONS => '2', TTL =>
> >>>> '31536000',
> >>>> > BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'},
> > {NAME
> >>>> =>
> >>>> > 'i', COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000',
> > BLOCKSIZE
> >>>> =>
> >>>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME => 'v',
> >>>> > COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000', BLOCKSIZE
> =>
> >>>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}
> >>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner:
> >>>> Task:attempt_local_0001_m_000000_0
> >>>> > is done. And is in the process of commiting
> >>>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
> >>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
> >>>> attempt_local_0001_m_000000_0
> >>>> > is allowed to commit now
> >>>> > 10/07/29 22:38:42 INFO mapred.FileOutputCommitter: Saved output of
> > task
> >>>> > 'attempt_local_0001_m_000000_0' to
> >>>> > file:/usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc
> >>>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
> >>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
> >>>> > 'attempt_local_0001_m_000000_0' done.
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:  map 100% reduce 0%
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient: Job complete:
> job_local_0001
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient: Counters: 6
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:   FileSystemCounters
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     FILE_BYTES_READ=1592883
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:
> > FILE_BYTES_WRITTEN=1624956
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:   Map-Reduce Framework
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Spilled Records=0
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input bytes=0
> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map output records=0
> >>>> >
> >>>> > [sjc1-hadoop8.sjc1:hadoop 3705]ls -l
> >>>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
> >>>> > -rwxrwxrwx 1 hadoop users 0 Jul 29 22:38
> >>>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
> >>>> >
> >>>> > But there are many records in the table I was querying.
> >>>> >
> >>>> > Can someone comment ?
> >>>> >
> >>>> > On Thu, Jul 29, 2010 at 2:26 PM, Jean-Daniel Cryans <
> > jdcryans@apache.org
> >>>> >wrote:
> >>>> >
> >>>> >> In 0.89 you can specify CACHE for the count command. Set it higher
> > (it
> >>>> >> defaults to 10 rows per call).
> >>>> >>
> >>>> >> Also you can use the RowCounter MR job.
> >>>> >>
> >>>> >> J-D
> >>>> >>
> >>>> >> On Thu, Jul 29, 2010 at 2:22 PM, Ted Yu <yu...@gmail.com>
> wrote:
> >>>> >> > Hi,
> >>>> >> > The count method in HBase shell is quite slow.
> >>>> >> > Is there a way to obtain count faster ?
> >>>> >> >
> >>>> >> > Thanks
> >>>> >> >
> >>>> >>
> >>>> >
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Regards
> >> Angus
> >
>
>
>
> --
> Regards
> Angus
>

Re: count of rows in table

Posted by Angus He <an...@gmail.com>.
By the way

If users input multiple columns, it seems that the current
implementation of RowCounter employs the OR logical operation.

Is the AND more reasonable?



On Fri, Jul 30, 2010 at 11:13 AM, Ryan Rawson <ry...@gmail.com> wrote:
> RowCounter job counts rows. Its answer will be how many distinct row keys
> were in the table approximately at a given time range.
>
> Even if the implementation uses first kv filter nothing about what I just
> said is false.
>
> A KeyValue counter would tell you how many cells and versions there were
> total don't you think?
>
> On Jul 29, 2010 7:58 PM, "Angus He" <an...@gmail.com> wrote:
>> Column names are just optional for RowCounter job.
>>
>> To be more accurate, RowCounter is a KeyValueCounter.
>> If no columns are specified, only the first KeyValues of each row are
>> included, then get the RowCounter.
>>
>>
>> On Fri, Jul 30, 2010 at 9:28 AM, Ted Yu <yu...@gmail.com> wrote:
>>> If someone can share the commandline for running RowCounter, that would
> be
>>> great.
>>>
>>> Also, hbase shell count doesn't require column name. Why does RowCounter
>>> require it ?
>>>
>>> Thanks
>>>
>>> On Thu, Jul 29, 2010 at 4:55 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> That table appears to be empty.  Eg:
>>>>
>>>> 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
>>>>
>>>>
>>>> So back to the count issue... Counting in databases is a classic
>>>> problem. Unless your DB system is keeping stats on how many
>>>> inserts/deletes and thus how big it thinks the table is, you have to
>>>> count all the rows by reading them.  HBase is no different, and a
>>>> little harder, because we have a variable length data format, so we
>>>> can't just estimate row sizes from file sizes.  Keeping distributed
>>>> stats is not impossible, but certainly not on any priority list to be
>>>> implemented - of course JIRAs/patches welcome etc.
>>>>
>>>> -ryan
>>>>
>>>>
>>>> On Thu, Jul 29, 2010 at 3:48 PM, Ted Yu <yu...@gmail.com> wrote:
>>>> > We use HBase 0.20.5
>>>> >
>>>> > Here is the snippet from RowCounter output:
>>>> >
>>>> > 10/07/29 22:38:42 DEBUG client.HTable$ClientScanner: Finished with
>>>> scanning
>>>> > at REGION => {NAME =>
>>>> >
>>>>
> '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0,DFF46493EB352D0E31CBFA4652E6EC06,1280412540858',
>>>> > STARTKEY => 'DFF46493EB352D0E31CBFA4652E6EC06', ENDKEY => '', ENCODED
> =>
>>>> > 1375318608, TABLE => {{NAME =>
>>>> > '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0',
>>>> FAMILIES
>>>> > => [{NAME => 'd', COMPRESSION => 'GZ', VERSIONS => '2', TTL =>
>>>> '31536000',
>>>> > BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'},
> {NAME
>>>> =>
>>>> > 'i', COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000',
> BLOCKSIZE
>>>> =>
>>>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME => 'v',
>>>> > COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000', BLOCKSIZE =>
>>>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}
>>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner:
>>>> Task:attempt_local_0001_m_000000_0
>>>> > is done. And is in the process of commiting
>>>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
>>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
>>>> attempt_local_0001_m_000000_0
>>>> > is allowed to commit now
>>>> > 10/07/29 22:38:42 INFO mapred.FileOutputCommitter: Saved output of
> task
>>>> > 'attempt_local_0001_m_000000_0' to
>>>> > file:/usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc
>>>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
>>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
>>>> > 'attempt_local_0001_m_000000_0' done.
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:  map 100% reduce 0%
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient: Job complete: job_local_0001
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient: Counters: 6
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:   FileSystemCounters
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     FILE_BYTES_READ=1592883
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:
> FILE_BYTES_WRITTEN=1624956
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:   Map-Reduce Framework
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Spilled Records=0
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input bytes=0
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map output records=0
>>>> >
>>>> > [sjc1-hadoop8.sjc1:hadoop 3705]ls -l
>>>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
>>>> > -rwxrwxrwx 1 hadoop users 0 Jul 29 22:38
>>>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
>>>> >
>>>> > But there are many records in the table I was querying.
>>>> >
>>>> > Can someone comment ?
>>>> >
>>>> > On Thu, Jul 29, 2010 at 2:26 PM, Jean-Daniel Cryans <
> jdcryans@apache.org
>>>> >wrote:
>>>> >
>>>> >> In 0.89 you can specify CACHE for the count command. Set it higher
> (it
>>>> >> defaults to 10 rows per call).
>>>> >>
>>>> >> Also you can use the RowCounter MR job.
>>>> >>
>>>> >> J-D
>>>> >>
>>>> >> On Thu, Jul 29, 2010 at 2:22 PM, Ted Yu <yu...@gmail.com> wrote:
>>>> >> > Hi,
>>>> >> > The count method in HBase shell is quite slow.
>>>> >> > Is there a way to obtain count faster ?
>>>> >> >
>>>> >> > Thanks
>>>> >> >
>>>> >>
>>>> >
>>>>
>>>
>>
>>
>>
>> --
>> Regards
>> Angus
>



-- 
Regards
Angus

Re: count of rows in table

Posted by Angus He <an...@gmail.com>.
Thanks, Ryan.

Yes, It only count rows.  :)




On Fri, Jul 30, 2010 at 11:13 AM, Ryan Rawson <ry...@gmail.com> wrote:
> RowCounter job counts rows. Its answer will be how many distinct row keys
> were in the table approximately at a given time range.
>
> Even if the implementation uses first kv filter nothing about what I just
> said is false.
>
> A KeyValue counter would tell you how many cells and versions there were
> total don't you think?
>
> On Jul 29, 2010 7:58 PM, "Angus He" <an...@gmail.com> wrote:
>> Column names are just optional for RowCounter job.
>>
>> To be more accurate, RowCounter is a KeyValueCounter.
>> If no columns are specified, only the first KeyValues of each row are
>> included, then get the RowCounter.
>>
>>
>> On Fri, Jul 30, 2010 at 9:28 AM, Ted Yu <yu...@gmail.com> wrote:
>>> If someone can share the commandline for running RowCounter, that would
> be
>>> great.
>>>
>>> Also, hbase shell count doesn't require column name. Why does RowCounter
>>> require it ?
>>>
>>> Thanks
>>>
>>> On Thu, Jul 29, 2010 at 4:55 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> That table appears to be empty.  Eg:
>>>>
>>>> 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
>>>>
>>>>
>>>> So back to the count issue... Counting in databases is a classic
>>>> problem. Unless your DB system is keeping stats on how many
>>>> inserts/deletes and thus how big it thinks the table is, you have to
>>>> count all the rows by reading them.  HBase is no different, and a
>>>> little harder, because we have a variable length data format, so we
>>>> can't just estimate row sizes from file sizes.  Keeping distributed
>>>> stats is not impossible, but certainly not on any priority list to be
>>>> implemented - of course JIRAs/patches welcome etc.
>>>>
>>>> -ryan
>>>>
>>>>
>>>> On Thu, Jul 29, 2010 at 3:48 PM, Ted Yu <yu...@gmail.com> wrote:
>>>> > We use HBase 0.20.5
>>>> >
>>>> > Here is the snippet from RowCounter output:
>>>> >
>>>> > 10/07/29 22:38:42 DEBUG client.HTable$ClientScanner: Finished with
>>>> scanning
>>>> > at REGION => {NAME =>
>>>> >
>>>>
> '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0,DFF46493EB352D0E31CBFA4652E6EC06,1280412540858',
>>>> > STARTKEY => 'DFF46493EB352D0E31CBFA4652E6EC06', ENDKEY => '', ENCODED
> =>
>>>> > 1375318608, TABLE => {{NAME =>
>>>> > '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0',
>>>> FAMILIES
>>>> > => [{NAME => 'd', COMPRESSION => 'GZ', VERSIONS => '2', TTL =>
>>>> '31536000',
>>>> > BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'},
> {NAME
>>>> =>
>>>> > 'i', COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000',
> BLOCKSIZE
>>>> =>
>>>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME => 'v',
>>>> > COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000', BLOCKSIZE =>
>>>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}
>>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner:
>>>> Task:attempt_local_0001_m_000000_0
>>>> > is done. And is in the process of commiting
>>>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
>>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
>>>> attempt_local_0001_m_000000_0
>>>> > is allowed to commit now
>>>> > 10/07/29 22:38:42 INFO mapred.FileOutputCommitter: Saved output of
> task
>>>> > 'attempt_local_0001_m_000000_0' to
>>>> > file:/usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc
>>>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
>>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
>>>> > 'attempt_local_0001_m_000000_0' done.
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:  map 100% reduce 0%
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient: Job complete: job_local_0001
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient: Counters: 6
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:   FileSystemCounters
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     FILE_BYTES_READ=1592883
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:
> FILE_BYTES_WRITTEN=1624956
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:   Map-Reduce Framework
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Spilled Records=0
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input bytes=0
>>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map output records=0
>>>> >
>>>> > [sjc1-hadoop8.sjc1:hadoop 3705]ls -l
>>>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
>>>> > -rwxrwxrwx 1 hadoop users 0 Jul 29 22:38
>>>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
>>>> >
>>>> > But there are many records in the table I was querying.
>>>> >
>>>> > Can someone comment ?
>>>> >
>>>> > On Thu, Jul 29, 2010 at 2:26 PM, Jean-Daniel Cryans <
> jdcryans@apache.org
>>>> >wrote:
>>>> >
>>>> >> In 0.89 you can specify CACHE for the count command. Set it higher
> (it
>>>> >> defaults to 10 rows per call).
>>>> >>
>>>> >> Also you can use the RowCounter MR job.
>>>> >>
>>>> >> J-D
>>>> >>
>>>> >> On Thu, Jul 29, 2010 at 2:22 PM, Ted Yu <yu...@gmail.com> wrote:
>>>> >> > Hi,
>>>> >> > The count method in HBase shell is quite slow.
>>>> >> > Is there a way to obtain count faster ?
>>>> >> >
>>>> >> > Thanks
>>>> >> >
>>>> >>
>>>> >
>>>>
>>>
>>
>>
>>
>> --
>> Regards
>> Angus
>



-- 
Regards
Angus

Re: count of rows in table

Posted by Ryan Rawson <ry...@gmail.com>.
RowCounter job counts rows. Its answer will be how many distinct row keys
were in the table approximately at a given time range.

Even if the implementation uses first kv filter nothing about what I just
said is false.

A KeyValue counter would tell you how many cells and versions there were
total don't you think?

On Jul 29, 2010 7:58 PM, "Angus He" <an...@gmail.com> wrote:
> Column names are just optional for RowCounter job.
>
> To be more accurate, RowCounter is a KeyValueCounter.
> If no columns are specified, only the first KeyValues of each row are
> included, then get the RowCounter.
>
>
> On Fri, Jul 30, 2010 at 9:28 AM, Ted Yu <yu...@gmail.com> wrote:
>> If someone can share the commandline for running RowCounter, that would
be
>> great.
>>
>> Also, hbase shell count doesn't require column name. Why does RowCounter
>> require it ?
>>
>> Thanks
>>
>> On Thu, Jul 29, 2010 at 4:55 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> That table appears to be empty.  Eg:
>>>
>>> 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
>>>
>>>
>>> So back to the count issue... Counting in databases is a classic
>>> problem. Unless your DB system is keeping stats on how many
>>> inserts/deletes and thus how big it thinks the table is, you have to
>>> count all the rows by reading them.  HBase is no different, and a
>>> little harder, because we have a variable length data format, so we
>>> can't just estimate row sizes from file sizes.  Keeping distributed
>>> stats is not impossible, but certainly not on any priority list to be
>>> implemented - of course JIRAs/patches welcome etc.
>>>
>>> -ryan
>>>
>>>
>>> On Thu, Jul 29, 2010 at 3:48 PM, Ted Yu <yu...@gmail.com> wrote:
>>> > We use HBase 0.20.5
>>> >
>>> > Here is the snippet from RowCounter output:
>>> >
>>> > 10/07/29 22:38:42 DEBUG client.HTable$ClientScanner: Finished with
>>> scanning
>>> > at REGION => {NAME =>
>>> >
>>>
'2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0,DFF46493EB352D0E31CBFA4652E6EC06,1280412540858',
>>> > STARTKEY => 'DFF46493EB352D0E31CBFA4652E6EC06', ENDKEY => '', ENCODED
=>
>>> > 1375318608, TABLE => {{NAME =>
>>> > '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0',
>>> FAMILIES
>>> > => [{NAME => 'd', COMPRESSION => 'GZ', VERSIONS => '2', TTL =>
>>> '31536000',
>>> > BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'},
{NAME
>>> =>
>>> > 'i', COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000',
BLOCKSIZE
>>> =>
>>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME => 'v',
>>> > COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000', BLOCKSIZE =>
>>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}
>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner:
>>> Task:attempt_local_0001_m_000000_0
>>> > is done. And is in the process of commiting
>>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
>>> attempt_local_0001_m_000000_0
>>> > is allowed to commit now
>>> > 10/07/29 22:38:42 INFO mapred.FileOutputCommitter: Saved output of
task
>>> > 'attempt_local_0001_m_000000_0' to
>>> > file:/usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc
>>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
>>> > 'attempt_local_0001_m_000000_0' done.
>>> > 10/07/29 22:38:43 INFO mapred.JobClient:  map 100% reduce 0%
>>> > 10/07/29 22:38:43 INFO mapred.JobClient: Job complete: job_local_0001
>>> > 10/07/29 22:38:43 INFO mapred.JobClient: Counters: 6
>>> > 10/07/29 22:38:43 INFO mapred.JobClient:   FileSystemCounters
>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     FILE_BYTES_READ=1592883
>>> > 10/07/29 22:38:43 INFO mapred.JobClient:
FILE_BYTES_WRITTEN=1624956
>>> > 10/07/29 22:38:43 INFO mapred.JobClient:   Map-Reduce Framework
>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Spilled Records=0
>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input bytes=0
>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map output records=0
>>> >
>>> > [sjc1-hadoop8.sjc1:hadoop 3705]ls -l
>>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
>>> > -rwxrwxrwx 1 hadoop users 0 Jul 29 22:38
>>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
>>> >
>>> > But there are many records in the table I was querying.
>>> >
>>> > Can someone comment ?
>>> >
>>> > On Thu, Jul 29, 2010 at 2:26 PM, Jean-Daniel Cryans <
jdcryans@apache.org
>>> >wrote:
>>> >
>>> >> In 0.89 you can specify CACHE for the count command. Set it higher
(it
>>> >> defaults to 10 rows per call).
>>> >>
>>> >> Also you can use the RowCounter MR job.
>>> >>
>>> >> J-D
>>> >>
>>> >> On Thu, Jul 29, 2010 at 2:22 PM, Ted Yu <yu...@gmail.com> wrote:
>>> >> > Hi,
>>> >> > The count method in HBase shell is quite slow.
>>> >> > Is there a way to obtain count faster ?
>>> >> >
>>> >> > Thanks
>>> >> >
>>> >>
>>> >
>>>
>>
>
>
>
> --
> Regards
> Angus

Re: count of rows in table

Posted by Ted Yu <yu...@gmail.com>.
Thanks for the reply.
I used rowcounter tool on 5 tables - we use striped tables before HBASE-2473
was implemented.

I logged https://issues.apache.org/jira/browse/HBASE-2891

On Thu, Jul 29, 2010 at 7:57 PM, Angus He <an...@gmail.com> wrote:

> Column names are just optional for RowCounter job.
>
> To be more accurate, RowCounter is a KeyValueCounter.
> If no columns are specified, only the first KeyValues of each row are
> included, then get the RowCounter.
>
>
> On Fri, Jul 30, 2010 at 9:28 AM, Ted Yu <yu...@gmail.com> wrote:
> > If someone can share the commandline for running RowCounter, that would
> be
> > great.
> >
> > Also, hbase shell count doesn't require column name. Why does RowCounter
> > require it ?
> >
> > Thanks
> >
> > On Thu, Jul 29, 2010 at 4:55 PM, Ryan Rawson <ry...@gmail.com> wrote:
> >
> >> Hi,
> >>
> >> That table appears to be empty.  Eg:
> >>
> >> 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
> >>
> >>
> >> So back to the count issue... Counting in databases is a classic
> >> problem. Unless your DB system is keeping stats on how many
> >> inserts/deletes and thus how big it thinks the table is, you have to
> >> count all the rows by reading them.  HBase is no different, and a
> >> little harder, because we have a variable length data format, so we
> >> can't just estimate row sizes from file sizes.  Keeping distributed
> >> stats is not impossible, but certainly not on any priority list to be
> >> implemented - of course JIRAs/patches welcome etc.
> >>
> >> -ryan
> >>
> >>
> >> On Thu, Jul 29, 2010 at 3:48 PM, Ted Yu <yu...@gmail.com> wrote:
> >> > We use HBase 0.20.5
> >> >
> >> > Here is the snippet from RowCounter output:
> >> >
> >> > 10/07/29 22:38:42 DEBUG client.HTable$ClientScanner: Finished with
> >> scanning
> >> > at REGION => {NAME =>
> >> >
> >>
> '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0,DFF46493EB352D0E31CBFA4652E6EC06,1280412540858',
> >> > STARTKEY => 'DFF46493EB352D0E31CBFA4652E6EC06', ENDKEY => '', ENCODED
> =>
> >> > 1375318608, TABLE => {{NAME =>
> >> > '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0',
> >> FAMILIES
> >> > => [{NAME => 'd', COMPRESSION => 'GZ', VERSIONS => '2', TTL =>
> >> '31536000',
> >> > BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'},
> {NAME
> >> =>
> >> > 'i', COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000',
> BLOCKSIZE
> >> =>
> >> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME => 'v',
> >> > COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000', BLOCKSIZE =>
> >> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}
> >> > 10/07/29 22:38:42 INFO mapred.TaskRunner:
> >> Task:attempt_local_0001_m_000000_0
> >> > is done. And is in the process of commiting
> >> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
> >> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
> >> attempt_local_0001_m_000000_0
> >> > is allowed to commit now
> >> > 10/07/29 22:38:42 INFO mapred.FileOutputCommitter: Saved output of
> task
> >> > 'attempt_local_0001_m_000000_0' to
> >> > file:/usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc
> >> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
> >> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
> >> > 'attempt_local_0001_m_000000_0' done.
> >> > 10/07/29 22:38:43 INFO mapred.JobClient:  map 100% reduce 0%
> >> > 10/07/29 22:38:43 INFO mapred.JobClient: Job complete: job_local_0001
> >> > 10/07/29 22:38:43 INFO mapred.JobClient: Counters: 6
> >> > 10/07/29 22:38:43 INFO mapred.JobClient:   FileSystemCounters
> >> > 10/07/29 22:38:43 INFO mapred.JobClient:     FILE_BYTES_READ=1592883
> >> > 10/07/29 22:38:43 INFO mapred.JobClient:
> FILE_BYTES_WRITTEN=1624956
> >> > 10/07/29 22:38:43 INFO mapred.JobClient:   Map-Reduce Framework
> >> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
> >> > 10/07/29 22:38:43 INFO mapred.JobClient:     Spilled Records=0
> >> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input bytes=0
> >> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map output records=0
> >> >
> >> > [sjc1-hadoop8.sjc1:hadoop 3705]ls -l
> >> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
> >> > -rwxrwxrwx 1 hadoop users 0 Jul 29 22:38
> >> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
> >> >
> >> > But there are many records in the table I was querying.
> >> >
> >> > Can someone comment ?
> >> >
> >> > On Thu, Jul 29, 2010 at 2:26 PM, Jean-Daniel Cryans <
> jdcryans@apache.org
> >> >wrote:
> >> >
> >> >> In 0.89 you can specify CACHE for the count command. Set it higher
> (it
> >> >> defaults to 10 rows per call).
> >> >>
> >> >> Also you can use the RowCounter MR job.
> >> >>
> >> >> J-D
> >> >>
> >> >> On Thu, Jul 29, 2010 at 2:22 PM, Ted Yu <yu...@gmail.com> wrote:
> >> >> > Hi,
> >> >> > The count method in HBase shell is quite slow.
> >> >> > Is there a way to obtain count faster ?
> >> >> >
> >> >> > Thanks
> >> >> >
> >> >>
> >> >
> >>
> >
>
>
>
> --
> Regards
> Angus
>

Re: count of rows in table

Posted by Angus He <an...@gmail.com>.
Column names are just optional for RowCounter job.

To be more accurate, RowCounter is a KeyValueCounter.
If no columns are specified, only the first KeyValues of each row are
included, then get the RowCounter.


On Fri, Jul 30, 2010 at 9:28 AM, Ted Yu <yu...@gmail.com> wrote:
> If someone can share the commandline for running RowCounter, that would be
> great.
>
> Also, hbase shell count doesn't require column name. Why does RowCounter
> require it ?
>
> Thanks
>
> On Thu, Jul 29, 2010 at 4:55 PM, Ryan Rawson <ry...@gmail.com> wrote:
>
>> Hi,
>>
>> That table appears to be empty.  Eg:
>>
>> 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
>>
>>
>> So back to the count issue... Counting in databases is a classic
>> problem. Unless your DB system is keeping stats on how many
>> inserts/deletes and thus how big it thinks the table is, you have to
>> count all the rows by reading them.  HBase is no different, and a
>> little harder, because we have a variable length data format, so we
>> can't just estimate row sizes from file sizes.  Keeping distributed
>> stats is not impossible, but certainly not on any priority list to be
>> implemented - of course JIRAs/patches welcome etc.
>>
>> -ryan
>>
>>
>> On Thu, Jul 29, 2010 at 3:48 PM, Ted Yu <yu...@gmail.com> wrote:
>> > We use HBase 0.20.5
>> >
>> > Here is the snippet from RowCounter output:
>> >
>> > 10/07/29 22:38:42 DEBUG client.HTable$ClientScanner: Finished with
>> scanning
>> > at REGION => {NAME =>
>> >
>> '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0,DFF46493EB352D0E31CBFA4652E6EC06,1280412540858',
>> > STARTKEY => 'DFF46493EB352D0E31CBFA4652E6EC06', ENDKEY => '', ENCODED =>
>> > 1375318608, TABLE => {{NAME =>
>> > '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0',
>> FAMILIES
>> > => [{NAME => 'd', COMPRESSION => 'GZ', VERSIONS => '2', TTL =>
>> '31536000',
>> > BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME
>> =>
>> > 'i', COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000', BLOCKSIZE
>> =>
>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME => 'v',
>> > COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000', BLOCKSIZE =>
>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}
>> > 10/07/29 22:38:42 INFO mapred.TaskRunner:
>> Task:attempt_local_0001_m_000000_0
>> > is done. And is in the process of commiting
>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
>> attempt_local_0001_m_000000_0
>> > is allowed to commit now
>> > 10/07/29 22:38:42 INFO mapred.FileOutputCommitter: Saved output of task
>> > 'attempt_local_0001_m_000000_0' to
>> > file:/usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc
>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
>> > 'attempt_local_0001_m_000000_0' done.
>> > 10/07/29 22:38:43 INFO mapred.JobClient:  map 100% reduce 0%
>> > 10/07/29 22:38:43 INFO mapred.JobClient: Job complete: job_local_0001
>> > 10/07/29 22:38:43 INFO mapred.JobClient: Counters: 6
>> > 10/07/29 22:38:43 INFO mapred.JobClient:   FileSystemCounters
>> > 10/07/29 22:38:43 INFO mapred.JobClient:     FILE_BYTES_READ=1592883
>> > 10/07/29 22:38:43 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=1624956
>> > 10/07/29 22:38:43 INFO mapred.JobClient:   Map-Reduce Framework
>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Spilled Records=0
>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input bytes=0
>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map output records=0
>> >
>> > [sjc1-hadoop8.sjc1:hadoop 3705]ls -l
>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
>> > -rwxrwxrwx 1 hadoop users 0 Jul 29 22:38
>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
>> >
>> > But there are many records in the table I was querying.
>> >
>> > Can someone comment ?
>> >
>> > On Thu, Jul 29, 2010 at 2:26 PM, Jean-Daniel Cryans <jdcryans@apache.org
>> >wrote:
>> >
>> >> In 0.89 you can specify CACHE for the count command. Set it higher (it
>> >> defaults to 10 rows per call).
>> >>
>> >> Also you can use the RowCounter MR job.
>> >>
>> >> J-D
>> >>
>> >> On Thu, Jul 29, 2010 at 2:22 PM, Ted Yu <yu...@gmail.com> wrote:
>> >> > Hi,
>> >> > The count method in HBase shell is quite slow.
>> >> > Is there a way to obtain count faster ?
>> >> >
>> >> > Thanks
>> >> >
>> >>
>> >
>>
>



-- 
Regards
Angus

Re: count of rows in table

Posted by Ted Yu <yu...@gmail.com>.
If someone can share the commandline for running RowCounter, that would be
great.

Also, hbase shell count doesn't require column name. Why does RowCounter
require it ?

Thanks

On Thu, Jul 29, 2010 at 4:55 PM, Ryan Rawson <ry...@gmail.com> wrote:

> Hi,
>
> That table appears to be empty.  Eg:
>
> 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
>
>
> So back to the count issue... Counting in databases is a classic
> problem. Unless your DB system is keeping stats on how many
> inserts/deletes and thus how big it thinks the table is, you have to
> count all the rows by reading them.  HBase is no different, and a
> little harder, because we have a variable length data format, so we
> can't just estimate row sizes from file sizes.  Keeping distributed
> stats is not impossible, but certainly not on any priority list to be
> implemented - of course JIRAs/patches welcome etc.
>
> -ryan
>
>
> On Thu, Jul 29, 2010 at 3:48 PM, Ted Yu <yu...@gmail.com> wrote:
> > We use HBase 0.20.5
> >
> > Here is the snippet from RowCounter output:
> >
> > 10/07/29 22:38:42 DEBUG client.HTable$ClientScanner: Finished with
> scanning
> > at REGION => {NAME =>
> >
> '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0,DFF46493EB352D0E31CBFA4652E6EC06,1280412540858',
> > STARTKEY => 'DFF46493EB352D0E31CBFA4652E6EC06', ENDKEY => '', ENCODED =>
> > 1375318608, TABLE => {{NAME =>
> > '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0',
> FAMILIES
> > => [{NAME => 'd', COMPRESSION => 'GZ', VERSIONS => '2', TTL =>
> '31536000',
> > BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME
> =>
> > 'i', COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000', BLOCKSIZE
> =>
> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME => 'v',
> > COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000', BLOCKSIZE =>
> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}
> > 10/07/29 22:38:42 INFO mapred.TaskRunner:
> Task:attempt_local_0001_m_000000_0
> > is done. And is in the process of commiting
> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
> attempt_local_0001_m_000000_0
> > is allowed to commit now
> > 10/07/29 22:38:42 INFO mapred.FileOutputCommitter: Saved output of task
> > 'attempt_local_0001_m_000000_0' to
> > file:/usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc
> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
> > 'attempt_local_0001_m_000000_0' done.
> > 10/07/29 22:38:43 INFO mapred.JobClient:  map 100% reduce 0%
> > 10/07/29 22:38:43 INFO mapred.JobClient: Job complete: job_local_0001
> > 10/07/29 22:38:43 INFO mapred.JobClient: Counters: 6
> > 10/07/29 22:38:43 INFO mapred.JobClient:   FileSystemCounters
> > 10/07/29 22:38:43 INFO mapred.JobClient:     FILE_BYTES_READ=1592883
> > 10/07/29 22:38:43 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=1624956
> > 10/07/29 22:38:43 INFO mapred.JobClient:   Map-Reduce Framework
> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
> > 10/07/29 22:38:43 INFO mapred.JobClient:     Spilled Records=0
> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input bytes=0
> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map output records=0
> >
> > [sjc1-hadoop8.sjc1:hadoop 3705]ls -l
> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
> > -rwxrwxrwx 1 hadoop users 0 Jul 29 22:38
> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
> >
> > But there are many records in the table I was querying.
> >
> > Can someone comment ?
> >
> > On Thu, Jul 29, 2010 at 2:26 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> In 0.89 you can specify CACHE for the count command. Set it higher (it
> >> defaults to 10 rows per call).
> >>
> >> Also you can use the RowCounter MR job.
> >>
> >> J-D
> >>
> >> On Thu, Jul 29, 2010 at 2:22 PM, Ted Yu <yu...@gmail.com> wrote:
> >> > Hi,
> >> > The count method in HBase shell is quite slow.
> >> > Is there a way to obtain count faster ?
> >> >
> >> > Thanks
> >> >
> >>
> >
>

Re: count of rows in table

Posted by Ryan Rawson <ry...@gmail.com>.
Hi,

That table appears to be empty.  Eg:

10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0


So back to the count issue... Counting in databases is a classic
problem. Unless your DB system is keeping stats on how many
inserts/deletes and thus how big it thinks the table is, you have to
count all the rows by reading them.  HBase is no different, and a
little harder, because we have a variable length data format, so we
can't just estimate row sizes from file sizes.  Keeping distributed
stats is not impossible, but certainly not on any priority list to be
implemented - of course JIRAs/patches welcome etc.

-ryan


On Thu, Jul 29, 2010 at 3:48 PM, Ted Yu <yu...@gmail.com> wrote:
> We use HBase 0.20.5
>
> Here is the snippet from RowCounter output:
>
> 10/07/29 22:38:42 DEBUG client.HTable$ClientScanner: Finished with scanning
> at REGION => {NAME =>
> '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0,DFF46493EB352D0E31CBFA4652E6EC06,1280412540858',
> STARTKEY => 'DFF46493EB352D0E31CBFA4652E6EC06', ENDKEY => '', ENCODED =>
> 1375318608, TABLE => {{NAME =>
> '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0', FAMILIES
> => [{NAME => 'd', COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000',
> BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME =>
> 'i', COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000', BLOCKSIZE =>
> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME => 'v',
> COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000', BLOCKSIZE =>
> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}
> 10/07/29 22:38:42 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0
> is done. And is in the process of commiting
> 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
> 10/07/29 22:38:42 INFO mapred.TaskRunner: Task attempt_local_0001_m_000000_0
> is allowed to commit now
> 10/07/29 22:38:42 INFO mapred.FileOutputCommitter: Saved output of task
> 'attempt_local_0001_m_000000_0' to
> file:/usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc
> 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
> 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
> 'attempt_local_0001_m_000000_0' done.
> 10/07/29 22:38:43 INFO mapred.JobClient:  map 100% reduce 0%
> 10/07/29 22:38:43 INFO mapred.JobClient: Job complete: job_local_0001
> 10/07/29 22:38:43 INFO mapred.JobClient: Counters: 6
> 10/07/29 22:38:43 INFO mapred.JobClient:   FileSystemCounters
> 10/07/29 22:38:43 INFO mapred.JobClient:     FILE_BYTES_READ=1592883
> 10/07/29 22:38:43 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=1624956
> 10/07/29 22:38:43 INFO mapred.JobClient:   Map-Reduce Framework
> 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
> 10/07/29 22:38:43 INFO mapred.JobClient:     Spilled Records=0
> 10/07/29 22:38:43 INFO mapred.JobClient:     Map input bytes=0
> 10/07/29 22:38:43 INFO mapred.JobClient:     Map output records=0
>
> [sjc1-hadoop8.sjc1:hadoop 3705]ls -l
> /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
> -rwxrwxrwx 1 hadoop users 0 Jul 29 22:38
> /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
>
> But there are many records in the table I was querying.
>
> Can someone comment ?
>
> On Thu, Jul 29, 2010 at 2:26 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> In 0.89 you can specify CACHE for the count command. Set it higher (it
>> defaults to 10 rows per call).
>>
>> Also you can use the RowCounter MR job.
>>
>> J-D
>>
>> On Thu, Jul 29, 2010 at 2:22 PM, Ted Yu <yu...@gmail.com> wrote:
>> > Hi,
>> > The count method in HBase shell is quite slow.
>> > Is there a way to obtain count faster ?
>> >
>> > Thanks
>> >
>>
>

Re: count of rows in table

Posted by Ted Yu <yu...@gmail.com>.
We use HBase 0.20.5

Here is the snippet from RowCounter output:

10/07/29 22:38:42 DEBUG client.HTable$ClientScanner: Finished with scanning
at REGION => {NAME =>
'2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0,DFF46493EB352D0E31CBFA4652E6EC06,1280412540858',
STARTKEY => 'DFF46493EB352D0E31CBFA4652E6EC06', ENDKEY => '', ENCODED =>
1375318608, TABLE => {{NAME =>
'2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0', FAMILIES
=> [{NAME => 'd', COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000',
BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME =>
'i', COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000', BLOCKSIZE =>
'65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME => 'v',
COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000', BLOCKSIZE =>
'65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}
10/07/29 22:38:42 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0
is done. And is in the process of commiting
10/07/29 22:38:42 INFO mapred.LocalJobRunner:
10/07/29 22:38:42 INFO mapred.TaskRunner: Task attempt_local_0001_m_000000_0
is allowed to commit now
10/07/29 22:38:42 INFO mapred.FileOutputCommitter: Saved output of task
'attempt_local_0001_m_000000_0' to
file:/usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc
10/07/29 22:38:42 INFO mapred.LocalJobRunner:
10/07/29 22:38:42 INFO mapred.TaskRunner: Task
'attempt_local_0001_m_000000_0' done.
10/07/29 22:38:43 INFO mapred.JobClient:  map 100% reduce 0%
10/07/29 22:38:43 INFO mapred.JobClient: Job complete: job_local_0001
10/07/29 22:38:43 INFO mapred.JobClient: Counters: 6
10/07/29 22:38:43 INFO mapred.JobClient:   FileSystemCounters
10/07/29 22:38:43 INFO mapred.JobClient:     FILE_BYTES_READ=1592883
10/07/29 22:38:43 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=1624956
10/07/29 22:38:43 INFO mapred.JobClient:   Map-Reduce Framework
10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
10/07/29 22:38:43 INFO mapred.JobClient:     Spilled Records=0
10/07/29 22:38:43 INFO mapred.JobClient:     Map input bytes=0
10/07/29 22:38:43 INFO mapred.JobClient:     Map output records=0

[sjc1-hadoop8.sjc1:hadoop 3705]ls -l
/usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
-rwxrwxrwx 1 hadoop users 0 Jul 29 22:38
/usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000

But there are many records in the table I was querying.

Can someone comment ?

On Thu, Jul 29, 2010 at 2:26 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> In 0.89 you can specify CACHE for the count command. Set it higher (it
> defaults to 10 rows per call).
>
> Also you can use the RowCounter MR job.
>
> J-D
>
> On Thu, Jul 29, 2010 at 2:22 PM, Ted Yu <yu...@gmail.com> wrote:
> > Hi,
> > The count method in HBase shell is quite slow.
> > Is there a way to obtain count faster ?
> >
> > Thanks
> >
>

Re: count of rows in table

Posted by Jean-Daniel Cryans <jd...@apache.org>.
The questions was "Is there a way to obtain count faster ?" and there
is 2, which I gave.

The existence of a row in HBase is nothing like your typical RDBMS.

J-D

On Thu, Jul 29, 2010 at 3:00 PM, Vladimir Rodionov
<vr...@carrieriq.com> wrote:
> I think topic starter prefers something like:
> "select count(*) from table" and not to launch M/R job
> for this purpose.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
>
> ________________________________________
> From: jdcryans@gmail.com [jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans [jdcryans@apache.org]
> Sent: Thursday, July 29, 2010 2:26 PM
> To: dev@hbase.apache.org
> Subject: Re: count of rows in table
>
> In 0.89 you can specify CACHE for the count command. Set it higher (it
> defaults to 10 rows per call).
>
> Also you can use the RowCounter MR job.
>
> J-D
>
> On Thu, Jul 29, 2010 at 2:22 PM, Ted Yu <yu...@gmail.com> wrote:
>> Hi,
>> The count method in HBase shell is quite slow.
>> Is there a way to obtain count faster ?
>>
>> Thanks
>>
>

RE: count of rows in table

Posted by Vladimir Rodionov <vr...@carrieriq.com>.
I think topic starter prefers something like:
"select count(*) from table" and not to launch M/R job
for this purpose.

Best regards,
Vladimir Rodionov
Principal Platform Engineer
Carrier IQ, www.carrieriq.com

________________________________________
From: jdcryans@gmail.com [jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans [jdcryans@apache.org]
Sent: Thursday, July 29, 2010 2:26 PM
To: dev@hbase.apache.org
Subject: Re: count of rows in table

In 0.89 you can specify CACHE for the count command. Set it higher (it
defaults to 10 rows per call).

Also you can use the RowCounter MR job.

J-D

On Thu, Jul 29, 2010 at 2:22 PM, Ted Yu <yu...@gmail.com> wrote:
> Hi,
> The count method in HBase shell is quite slow.
> Is there a way to obtain count faster ?
>
> Thanks
>

Re: count of rows in table

Posted by Jean-Daniel Cryans <jd...@apache.org>.
In 0.89 you can specify CACHE for the count command. Set it higher (it
defaults to 10 rows per call).

Also you can use the RowCounter MR job.

J-D

On Thu, Jul 29, 2010 at 2:22 PM, Ted Yu <yu...@gmail.com> wrote:
> Hi,
> The count method in HBase shell is quite slow.
> Is there a way to obtain count faster ?
>
> Thanks
>