You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Ted Yu <yu...@gmail.com> on 2010/08/04 22:34:29 UTC

Re: count of rows in table

I want to poll for ideas on how I can aggregate row counts from several
tables.
I can run rowcounter for each table. But how can I produce the sum of all
counts easily ?

Thanks

On Thu, Jul 29, 2010 at 9:07 PM, Ted Yu <yu...@gmail.com> wrote:

> I think OR is more reasonable.
>
>
> On Thu, Jul 29, 2010 at 8:54 PM, Angus He <an...@gmail.com> wrote:
>
>> By the way
>>
>> If users input multiple columns, it seems that the current
>> implementation of RowCounter employs the OR logical operation.
>>
>> Is the AND more reasonable?
>>
>>
>>
>> On Fri, Jul 30, 2010 at 11:13 AM, Ryan Rawson <ry...@gmail.com> wrote:
>> > RowCounter job counts rows. Its answer will be how many distinct row
>> keys
>> > were in the table approximately at a given time range.
>> >
>> > Even if the implementation uses first kv filter nothing about what I
>> just
>> > said is false.
>> >
>> > A KeyValue counter would tell you how many cells and versions there were
>> > total don't you think?
>> >
>> > On Jul 29, 2010 7:58 PM, "Angus He" <an...@gmail.com> wrote:
>> >> Column names are just optional for RowCounter job.
>> >>
>> >> To be more accurate, RowCounter is a KeyValueCounter.
>> >> If no columns are specified, only the first KeyValues of each row are
>> >> included, then get the RowCounter.
>> >>
>> >>
>> >> On Fri, Jul 30, 2010 at 9:28 AM, Ted Yu <yu...@gmail.com> wrote:
>> >>> If someone can share the commandline for running RowCounter, that
>> would
>> > be
>> >>> great.
>> >>>
>> >>> Also, hbase shell count doesn't require column name. Why does
>> RowCounter
>> >>> require it ?
>> >>>
>> >>> Thanks
>> >>>
>> >>> On Thu, Jul 29, 2010 at 4:55 PM, Ryan Rawson <ry...@gmail.com>
>> wrote:
>> >>>
>> >>>> Hi,
>> >>>>
>> >>>> That table appears to be empty.  Eg:
>> >>>>
>> >>>> 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
>> >>>>
>> >>>>
>> >>>> So back to the count issue... Counting in databases is a classic
>> >>>> problem. Unless your DB system is keeping stats on how many
>> >>>> inserts/deletes and thus how big it thinks the table is, you have to
>> >>>> count all the rows by reading them.  HBase is no different, and a
>> >>>> little harder, because we have a variable length data format, so we
>> >>>> can't just estimate row sizes from file sizes.  Keeping distributed
>> >>>> stats is not impossible, but certainly not on any priority list to be
>> >>>> implemented - of course JIRAs/patches welcome etc.
>> >>>>
>> >>>> -ryan
>> >>>>
>> >>>>
>> >>>> On Thu, Jul 29, 2010 at 3:48 PM, Ted Yu <yu...@gmail.com> wrote:
>> >>>> > We use HBase 0.20.5
>> >>>> >
>> >>>> > Here is the snippet from RowCounter output:
>> >>>> >
>> >>>> > 10/07/29 22:38:42 DEBUG client.HTable$ClientScanner: Finished with
>> >>>> scanning
>> >>>> > at REGION => {NAME =>
>> >>>> >
>> >>>>
>> >
>> '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0,DFF46493EB352D0E31CBFA4652E6EC06,1280412540858',
>> >>>> > STARTKEY => 'DFF46493EB352D0E31CBFA4652E6EC06', ENDKEY => '',
>> ENCODED
>> > =>
>> >>>> > 1375318608, TABLE => {{NAME =>
>> >>>> > '2__HB_NOINC_ORCL_SQLLDR_0728-THREEGPPSPEECHCALLS-1280408509541-0',
>> >>>> FAMILIES
>> >>>> > => [{NAME => 'd', COMPRESSION => 'GZ', VERSIONS => '2', TTL =>
>> >>>> '31536000',
>> >>>> > BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'},
>> > {NAME
>> >>>> =>
>> >>>> > 'i', COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000',
>> > BLOCKSIZE
>> >>>> =>
>> >>>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}, {NAME =>
>> 'v',
>> >>>> > COMPRESSION => 'GZ', VERSIONS => '2', TTL => '31536000', BLOCKSIZE
>> =>
>> >>>> > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}
>> >>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner:
>> >>>> Task:attempt_local_0001_m_000000_0
>> >>>> > is done. And is in the process of commiting
>> >>>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
>> >>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
>> >>>> attempt_local_0001_m_000000_0
>> >>>> > is allowed to commit now
>> >>>> > 10/07/29 22:38:42 INFO mapred.FileOutputCommitter: Saved output of
>> > task
>> >>>> > 'attempt_local_0001_m_000000_0' to
>> >>>> > file:/usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc
>> >>>> > 10/07/29 22:38:42 INFO mapred.LocalJobRunner:
>> >>>> > 10/07/29 22:38:42 INFO mapred.TaskRunner: Task
>> >>>> > 'attempt_local_0001_m_000000_0' done.
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:  map 100% reduce 0%
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient: Job complete:
>> job_local_0001
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient: Counters: 6
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:   FileSystemCounters
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:
>> FILE_BYTES_READ=1592883
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:
>> > FILE_BYTES_WRITTEN=1624956
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:   Map-Reduce Framework
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input records=0
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Spilled Records=0
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map input bytes=0
>> >>>> > 10/07/29 22:38:43 INFO mapred.JobClient:     Map output records=0
>> >>>> >
>> >>>> > [sjc1-hadoop8.sjc1:hadoop 3705]ls -l
>> >>>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
>> >>>> > -rwxrwxrwx 1 hadoop users 0 Jul 29 22:38
>> >>>> > /usr/local/hadoop/trunk.80-275066/hbase-0.20.5/rc/part-00000
>> >>>> >
>> >>>> > But there are many records in the table I was querying.
>> >>>> >
>> >>>> > Can someone comment ?
>> >>>> >
>> >>>> > On Thu, Jul 29, 2010 at 2:26 PM, Jean-Daniel Cryans <
>> > jdcryans@apache.org
>> >>>> >wrote:
>> >>>> >
>> >>>> >> In 0.89 you can specify CACHE for the count command. Set it higher
>> > (it
>> >>>> >> defaults to 10 rows per call).
>> >>>> >>
>> >>>> >> Also you can use the RowCounter MR job.
>> >>>> >>
>> >>>> >> J-D
>> >>>> >>
>> >>>> >> On Thu, Jul 29, 2010 at 2:22 PM, Ted Yu <yu...@gmail.com>
>> wrote:
>> >>>> >> > Hi,
>> >>>> >> > The count method in HBase shell is quite slow.
>> >>>> >> > Is there a way to obtain count faster ?
>> >>>> >> >
>> >>>> >> > Thanks
>> >>>> >> >
>> >>>> >>
>> >>>> >
>> >>>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Regards
>> >> Angus
>> >
>>
>>
>>
>> --
>> Regards
>> Angus
>>
>
>