You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@phoenix.apache.org by Jins George <ji...@aeris.net> on 2017/12/19 22:57:37 UTC

Efficient way to get the row count of a table

Hi,

Is there a way to get the total row count of a phoenix table without 
running select count(*) from table ?
my use case is to monitor the record count in a table every x minutes, 
so didn't want to put load on the system by running a select count(*) 
query.

Thanks,
Jins George

Re: Efficient way to get the row count of a table

Posted by Jins George <Ji...@aeris.net>.
Thank You James. That's great news that 4.13 for CDH is coming. I will try the same on the new version once its released.


Thanks,

Jins George

________________________________
From: James Taylor <ja...@apache.org>
Sent: Tuesday, December 19, 2017 5:38:38 PM
To: user
Subject: Re: Efficient way to get the row count of a table

The count would change when a major compaction is done. Back in 4.7, it may have changed when a split occurs too (but this is no longer the case). I'd recommend moving to a newer version: 4.7 was release almost two years ago and is six releases back from the current 4.13 release. FWIW, we're getting ready to release a 4.13 for CDH.

On Tue, Dec 19, 2017 at 4:27 PM, Jins George <ji...@aeris.net>> wrote:
Thank you James and Mujtaba  for the responses.

I am OK with an estimate count . So using SYSTEM.STATS table looks good in my case. But when I tried this, it gave me inconsistent results.  for example.

Time 1:
select count(*) from myschema.mytable  => 3474085
select sum(guide_posts_row_count) from system.stats where physical_name = 'myschema.device_data'  => 3348090

Time 2 : ( Time1 + ~10 mins)
select count(*) from myschema.mytable  => 3474106
select sum(guide_posts_row_count) from system.stats where physical_name = 'myschema.device_data'  => 3348080

So I was expecting the stats count to go up but surprisingly, the count went down.  Is there a specific configuration or something else that I am missing?

I am using phoenix 4.7( on CDH), So cannot try Table sampling feature.

Thanks,
Jins George



On 12/19/2017 03:43 PM, Mujtaba Chohan wrote:
Another alternate outside Phoenix is to use <http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html M/R.

On Tue, Dec 19, 2017 at 3:18 PM, James Taylor <ja...@apache.org>> wrote:
If it needs to be 100% accurate, then count(*) is the only way. If your data is write-once data, you might be able to track the row count at the application level through some kind of atomic counter in a different table (but this will likely be brittle). If you can live with an estimate, you could enable statistics [1], optionally configuring Phoenix not to use stats for parallelization [2], and query the SYSTEM.STATS table to get an estimate [3].

Another interesting alternative if you want the approximate row count when you have a where clause would be to use the new table sampling feature [4]. You'd also want stats enabled for this to be more accurate too.

Thanks,
James


[1] https://phoenix.apache.org/update_statistics.html
[2] phoenix.use.stats.parallelization=false
[3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where physical_name='my_schema.my_table'
     and COLUMN_FAMILY='my_first_column_family' -- necessary only if you have multiple column families
[4] https://phoenix.apache.org/tablesample.html

On Tue, Dec 19, 2017 at 2:57 PM, Jins George <<m...@aeris.net>> wrote:
Hi,

Is there a way to get the total row count of a phoenix table without running select count(*) from table ?
my use case is to monitor the record count in a table every x minutes, so didn't want to put load on the system by running a select count(*) query.

Thanks,
Jins George





Re: Efficient way to get the row count of a table

Posted by James Taylor <ja...@apache.org>.
The count would change when a major compaction is done. Back in 4.7, it may
have changed when a split occurs too (but this is no longer the case). I'd
recommend moving to a newer version: 4.7 was release almost two years ago
and is six releases back from the current 4.13 release. FWIW, we're getting
ready to release a 4.13 for CDH.

On Tue, Dec 19, 2017 at 4:27 PM, Jins George <ji...@aeris.net> wrote:

> Thank you James and Mujtaba  for the responses.
>
> I am OK with an estimate count . So using SYSTEM.STATS table looks good in
> my case. But when I tried this, it gave me inconsistent results.  for
> example.
>
> Time 1:
> select count(*) from myschema.mytable  => 3474085
> select sum(guide_posts_row_count) from system.stats where physical_name =
> 'myschema.device_data'  => 3348090
>
> Time 2 : ( Time1 + ~10 mins)
> select count(*) from myschema.mytable  => 3474106
> select sum(guide_posts_row_count) from system.stats where physical_name =
> 'myschema.device_data'  => 3348080
>
> So I was expecting the stats count to go up but surprisingly, the count
> went down.  Is there a specific configuration or something else that I am
> missing?
>
> I am using phoenix 4.7( on CDH), So cannot try Table sampling feature.
>
> Thanks,
> Jins George
>
>
>
> On 12/19/2017 03:43 PM, Mujtaba Chohan wrote:
>
> Another alternate outside Phoenix is to use
> <http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/
> mapreduce/RowCounter.html M/R.
>
> On Tue, Dec 19, 2017 at 3:18 PM, James Taylor <ja...@apache.org>
> wrote:
>
>> If it needs to be 100% accurate, then count(*) is the only way. If your
>> data is write-once data, you might be able to track the row count at the
>> application level through some kind of atomic counter in a different table
>> (but this will likely be brittle). If you can live with an estimate, you
>> could enable statistics [1], optionally configuring Phoenix not to use
>> stats for parallelization [2], and query the SYSTEM.STATS table to get an
>> estimate [3].
>>
>> Another interesting alternative if you want the approximate row count
>> when you have a where clause would be to use the new table sampling feature
>> [4]. You'd also want stats enabled for this to be more accurate too.
>>
>> Thanks,
>> James
>>
>>
>> [1] https://phoenix.apache.org/update_statistics.html
>> [2] phoenix.use.stats.parallelization=false
>> [3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
>> physical_name='my_schema.my_table'
>>      and COLUMN_FAMILY='my_first_column_family' -- necessary only if you
>> have multiple column families
>> [4] https://phoenix.apache.org/tablesample.html
>>
>> On Tue, Dec 19, 2017 at 2:57 PM, Jins George < <ji...@aeris.net>
>> jins.george@aeris.net> wrote:
>>
>>> Hi,
>>>
>>> Is there a way to get the total row count of a phoenix table without
>>> running select count(*) from table ?
>>> my use case is to monitor the record count in a table every x minutes,
>>> so didn't want to put load on the system by running a select count(*) query.
>>>
>>> Thanks,
>>> Jins George
>>>
>>
>>
>
>

Re: Efficient way to get the row count of a table

Posted by Jins George <ji...@aeris.net>.
Thank you James and Mujtaba  for the responses.

I am OK with an estimate count . So using SYSTEM.STATS table looks good 
in my case. But when I tried this, it gave me inconsistent results.  for 
example.

Time 1:
select count(*) from myschema.mytable  => 3474085
select sum(guide_posts_row_count) from system.stats where physical_name 
= 'myschema.device_data'  => 3348090

Time 2 : ( Time1 + ~10 mins)
select count(*) from myschema.mytable  => 3474106
select sum(guide_posts_row_count) from system.stats where physical_name 
= 'myschema.device_data'  => 3348080

So I was expecting the stats count to go up but surprisingly, the count 
went down.  Is there a specific configuration or something else that I 
am missing?

I am using phoenix 4.7( on CDH), So cannot try Table sampling feature.

Thanks,
Jins George


On 12/19/2017 03:43 PM, Mujtaba Chohan wrote:
> Another alternate outside Phoenix is to use 
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html 
> M/R.
>
> On Tue, Dec 19, 2017 at 3:18 PM, James Taylor <jamestaylor@apache.org 
> <ma...@apache.org>> wrote:
>
>     If it needs to be 100% accurate, then count(*) is the only way. If
>     your data is write-once data, you might be able to track the row
>     count at the application level through some kind of atomic counter
>     in a different table (but this will likely be brittle). If you can
>     live with an estimate, you could enable statistics [1], optionally
>     configuring Phoenix not to use stats for parallelization [2], and
>     query the SYSTEM.STATS table to get an estimate [3].
>
>     Another interesting alternative if you want the approximate row
>     count when you have a where clause would be to use the new table
>     sampling feature [4]. You'd also want stats enabled for this to be
>     more accurate too.
>
>     Thanks,
>     James
>
>
>     [1] https://phoenix.apache.org/update_statistics.html
>     <https://phoenix.apache.org/update_statistics.html>
>     [2] phoenix.use.stats.parallelization=false
>     [3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
>     physical_name='my_schema.my_table'
>          and COLUMN_FAMILY='my_first_column_family' -- necessary only
>     if you have multiple column families
>     [4] https://phoenix.apache.org/tablesample.html
>     <https://phoenix.apache.org/tablesample.html>
>
>     On Tue, Dec 19, 2017 at 2:57 PM, Jins George
>     <jins.george@aeris.net <ma...@aeris.net>> wrote:
>
>         Hi,
>
>         Is there a way to get the total row count of a phoenix table
>         without running select count(*) from table ?
>         my use case is to monitor the record count in a table every x
>         minutes, so didn't want to put load on the system by running a
>         select count(*) query.
>
>         Thanks,
>         Jins George
>
>
>


Re: Efficient way to get the row count of a table

Posted by Mujtaba Chohan <mu...@apache.org>.
Another alternate outside Phoenix is to use
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html
M/R.

On Tue, Dec 19, 2017 at 3:18 PM, James Taylor <ja...@apache.org>
wrote:

> If it needs to be 100% accurate, then count(*) is the only way. If your
> data is write-once data, you might be able to track the row count at the
> application level through some kind of atomic counter in a different table
> (but this will likely be brittle). If you can live with an estimate, you
> could enable statistics [1], optionally configuring Phoenix not to use
> stats for parallelization [2], and query the SYSTEM.STATS table to get an
> estimate [3].
>
> Another interesting alternative if you want the approximate row count when
> you have a where clause would be to use the new table sampling feature [4].
> You'd also want stats enabled for this to be more accurate too.
>
> Thanks,
> James
>
>
> [1] https://phoenix.apache.org/update_statistics.html
> [2] phoenix.use.stats.parallelization=false
> [3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
> physical_name='my_schema.my_table'
>      and COLUMN_FAMILY='my_first_column_family' -- necessary only if you
> have multiple column families
> [4] https://phoenix.apache.org/tablesample.html
>
> On Tue, Dec 19, 2017 at 2:57 PM, Jins George <ji...@aeris.net>
> wrote:
>
>> Hi,
>>
>> Is there a way to get the total row count of a phoenix table without
>> running select count(*) from table ?
>> my use case is to monitor the record count in a table every x minutes, so
>> didn't want to put load on the system by running a select count(*) query.
>>
>> Thanks,
>> Jins George
>>
>
>

Re: Efficient way to get the row count of a table

Posted by James Taylor <ja...@apache.org>.
If it needs to be 100% accurate, then count(*) is the only way. If your
data is write-once data, you might be able to track the row count at the
application level through some kind of atomic counter in a different table
(but this will likely be brittle). If you can live with an estimate, you
could enable statistics [1], optionally configuring Phoenix not to use
stats for parallelization [2], and query the SYSTEM.STATS table to get an
estimate [3].

Another interesting alternative if you want the approximate row count when
you have a where clause would be to use the new table sampling feature [4].
You'd also want stats enabled for this to be more accurate too.

Thanks,
James


[1] https://phoenix.apache.org/update_statistics.html
[2] phoenix.use.stats.parallelization=false
[3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
physical_name='my_schema.my_table'
     and COLUMN_FAMILY='my_first_column_family' -- necessary only if you
have multiple column families
[4] https://phoenix.apache.org/tablesample.html

On Tue, Dec 19, 2017 at 2:57 PM, Jins George <ji...@aeris.net> wrote:

> Hi,
>
> Is there a way to get the total row count of a phoenix table without
> running select count(*) from table ?
> my use case is to monitor the record count in a table every x minutes, so
> didn't want to put load on the system by running a select count(*) query.
>
> Thanks,
> Jins George
>