You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@phoenix.apache.org by Jins George <ji...@aeris.net> on 2017/12/19 22:57:37 UTC
Efficient way to get the row count of a table
Hi,
Is there a way to get the total row count of a phoenix table without
running select count(*) from table ?
my use case is to monitor the record count in a table every x minutes,
so didn't want to put load on the system by running a select count(*)
query.
Thanks,
Jins George
Re: Efficient way to get the row count of a table
Posted by Jins George <Ji...@aeris.net>.
Thank You James. That's great news that 4.13 for CDH is coming. I will try the same on the new version once its released.
Thanks,
Jins George
________________________________
From: James Taylor <ja...@apache.org>
Sent: Tuesday, December 19, 2017 5:38:38 PM
To: user
Subject: Re: Efficient way to get the row count of a table
The count would change when a major compaction is done. Back in 4.7, it may have changed when a split occurs too (but this is no longer the case). I'd recommend moving to a newer version: 4.7 was release almost two years ago and is six releases back from the current 4.13 release. FWIW, we're getting ready to release a 4.13 for CDH.
On Tue, Dec 19, 2017 at 4:27 PM, Jins George <ji...@aeris.net>> wrote:
Thank you James and Mujtaba for the responses.
I am OK with an estimate count . So using SYSTEM.STATS table looks good in my case. But when I tried this, it gave me inconsistent results. for example.
Time 1:
select count(*) from myschema.mytable => 3474085
select sum(guide_posts_row_count) from system.stats where physical_name = 'myschema.device_data' => 3348090
Time 2 : ( Time1 + ~10 mins)
select count(*) from myschema.mytable => 3474106
select sum(guide_posts_row_count) from system.stats where physical_name = 'myschema.device_data' => 3348080
So I was expecting the stats count to go up but surprisingly, the count went down. Is there a specific configuration or something else that I am missing?
I am using phoenix 4.7( on CDH), So cannot try Table sampling feature.
Thanks,
Jins George
On 12/19/2017 03:43 PM, Mujtaba Chohan wrote:
Another alternate outside Phoenix is to use <http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html M/R.
On Tue, Dec 19, 2017 at 3:18 PM, James Taylor <ja...@apache.org>> wrote:
If it needs to be 100% accurate, then count(*) is the only way. If your data is write-once data, you might be able to track the row count at the application level through some kind of atomic counter in a different table (but this will likely be brittle). If you can live with an estimate, you could enable statistics [1], optionally configuring Phoenix not to use stats for parallelization [2], and query the SYSTEM.STATS table to get an estimate [3].
Another interesting alternative if you want the approximate row count when you have a where clause would be to use the new table sampling feature [4]. You'd also want stats enabled for this to be more accurate too.
Thanks,
James
[1] https://phoenix.apache.org/update_statistics.html
[2] phoenix.use.stats.parallelization=false
[3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where physical_name='my_schema.my_table'
and COLUMN_FAMILY='my_first_column_family' -- necessary only if you have multiple column families
[4] https://phoenix.apache.org/tablesample.html
On Tue, Dec 19, 2017 at 2:57 PM, Jins George <<m...@aeris.net>> wrote:
Hi,
Is there a way to get the total row count of a phoenix table without running select count(*) from table ?
my use case is to monitor the record count in a table every x minutes, so didn't want to put load on the system by running a select count(*) query.
Thanks,
Jins George
Re: Efficient way to get the row count of a table
Posted by James Taylor <ja...@apache.org>.
The count would change when a major compaction is done. Back in 4.7, it may
have changed when a split occurs too (but this is no longer the case). I'd
recommend moving to a newer version: 4.7 was release almost two years ago
and is six releases back from the current 4.13 release. FWIW, we're getting
ready to release a 4.13 for CDH.
On Tue, Dec 19, 2017 at 4:27 PM, Jins George <ji...@aeris.net> wrote:
> Thank you James and Mujtaba for the responses.
>
> I am OK with an estimate count . So using SYSTEM.STATS table looks good in
> my case. But when I tried this, it gave me inconsistent results. for
> example.
>
> Time 1:
> select count(*) from myschema.mytable => 3474085
> select sum(guide_posts_row_count) from system.stats where physical_name =
> 'myschema.device_data' => 3348090
>
> Time 2 : ( Time1 + ~10 mins)
> select count(*) from myschema.mytable => 3474106
> select sum(guide_posts_row_count) from system.stats where physical_name =
> 'myschema.device_data' => 3348080
>
> So I was expecting the stats count to go up but surprisingly, the count
> went down. Is there a specific configuration or something else that I am
> missing?
>
> I am using phoenix 4.7( on CDH), So cannot try Table sampling feature.
>
> Thanks,
> Jins George
>
>
>
> On 12/19/2017 03:43 PM, Mujtaba Chohan wrote:
>
> Another alternate outside Phoenix is to use
> <http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/
> mapreduce/RowCounter.html M/R.
>
> On Tue, Dec 19, 2017 at 3:18 PM, James Taylor <ja...@apache.org>
> wrote:
>
>> If it needs to be 100% accurate, then count(*) is the only way. If your
>> data is write-once data, you might be able to track the row count at the
>> application level through some kind of atomic counter in a different table
>> (but this will likely be brittle). If you can live with an estimate, you
>> could enable statistics [1], optionally configuring Phoenix not to use
>> stats for parallelization [2], and query the SYSTEM.STATS table to get an
>> estimate [3].
>>
>> Another interesting alternative if you want the approximate row count
>> when you have a where clause would be to use the new table sampling feature
>> [4]. You'd also want stats enabled for this to be more accurate too.
>>
>> Thanks,
>> James
>>
>>
>> [1] https://phoenix.apache.org/update_statistics.html
>> [2] phoenix.use.stats.parallelization=false
>> [3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
>> physical_name='my_schema.my_table'
>> and COLUMN_FAMILY='my_first_column_family' -- necessary only if you
>> have multiple column families
>> [4] https://phoenix.apache.org/tablesample.html
>>
>> On Tue, Dec 19, 2017 at 2:57 PM, Jins George < <ji...@aeris.net>
>> jins.george@aeris.net> wrote:
>>
>>> Hi,
>>>
>>> Is there a way to get the total row count of a phoenix table without
>>> running select count(*) from table ?
>>> my use case is to monitor the record count in a table every x minutes,
>>> so didn't want to put load on the system by running a select count(*) query.
>>>
>>> Thanks,
>>> Jins George
>>>
>>
>>
>
>
Re: Efficient way to get the row count of a table
Posted by Jins George <ji...@aeris.net>.
Thank you James and Mujtaba for the responses.
I am OK with an estimate count . So using SYSTEM.STATS table looks good
in my case. But when I tried this, it gave me inconsistent results. for
example.
Time 1:
select count(*) from myschema.mytable => 3474085
select sum(guide_posts_row_count) from system.stats where physical_name
= 'myschema.device_data' => 3348090
Time 2 : ( Time1 + ~10 mins)
select count(*) from myschema.mytable => 3474106
select sum(guide_posts_row_count) from system.stats where physical_name
= 'myschema.device_data' => 3348080
So I was expecting the stats count to go up but surprisingly, the count
went down. Is there a specific configuration or something else that I
am missing?
I am using phoenix 4.7( on CDH), So cannot try Table sampling feature.
Thanks,
Jins George
On 12/19/2017 03:43 PM, Mujtaba Chohan wrote:
> Another alternate outside Phoenix is to use
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html
> M/R.
>
> On Tue, Dec 19, 2017 at 3:18 PM, James Taylor <jamestaylor@apache.org
> <ma...@apache.org>> wrote:
>
> If it needs to be 100% accurate, then count(*) is the only way. If
> your data is write-once data, you might be able to track the row
> count at the application level through some kind of atomic counter
> in a different table (but this will likely be brittle). If you can
> live with an estimate, you could enable statistics [1], optionally
> configuring Phoenix not to use stats for parallelization [2], and
> query the SYSTEM.STATS table to get an estimate [3].
>
> Another interesting alternative if you want the approximate row
> count when you have a where clause would be to use the new table
> sampling feature [4]. You'd also want stats enabled for this to be
> more accurate too.
>
> Thanks,
> James
>
>
> [1] https://phoenix.apache.org/update_statistics.html
> <https://phoenix.apache.org/update_statistics.html>
> [2] phoenix.use.stats.parallelization=false
> [3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
> physical_name='my_schema.my_table'
> and COLUMN_FAMILY='my_first_column_family' -- necessary only
> if you have multiple column families
> [4] https://phoenix.apache.org/tablesample.html
> <https://phoenix.apache.org/tablesample.html>
>
> On Tue, Dec 19, 2017 at 2:57 PM, Jins George
> <jins.george@aeris.net <ma...@aeris.net>> wrote:
>
> Hi,
>
> Is there a way to get the total row count of a phoenix table
> without running select count(*) from table ?
> my use case is to monitor the record count in a table every x
> minutes, so didn't want to put load on the system by running a
> select count(*) query.
>
> Thanks,
> Jins George
>
>
>
Re: Efficient way to get the row count of a table
Posted by Mujtaba Chohan <mu...@apache.org>.
Another alternate outside Phoenix is to use
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html
M/R.
On Tue, Dec 19, 2017 at 3:18 PM, James Taylor <ja...@apache.org>
wrote:
> If it needs to be 100% accurate, then count(*) is the only way. If your
> data is write-once data, you might be able to track the row count at the
> application level through some kind of atomic counter in a different table
> (but this will likely be brittle). If you can live with an estimate, you
> could enable statistics [1], optionally configuring Phoenix not to use
> stats for parallelization [2], and query the SYSTEM.STATS table to get an
> estimate [3].
>
> Another interesting alternative if you want the approximate row count when
> you have a where clause would be to use the new table sampling feature [4].
> You'd also want stats enabled for this to be more accurate too.
>
> Thanks,
> James
>
>
> [1] https://phoenix.apache.org/update_statistics.html
> [2] phoenix.use.stats.parallelization=false
> [3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
> physical_name='my_schema.my_table'
> and COLUMN_FAMILY='my_first_column_family' -- necessary only if you
> have multiple column families
> [4] https://phoenix.apache.org/tablesample.html
>
> On Tue, Dec 19, 2017 at 2:57 PM, Jins George <ji...@aeris.net>
> wrote:
>
>> Hi,
>>
>> Is there a way to get the total row count of a phoenix table without
>> running select count(*) from table ?
>> my use case is to monitor the record count in a table every x minutes, so
>> didn't want to put load on the system by running a select count(*) query.
>>
>> Thanks,
>> Jins George
>>
>
>
Re: Efficient way to get the row count of a table
Posted by James Taylor <ja...@apache.org>.
If it needs to be 100% accurate, then count(*) is the only way. If your
data is write-once data, you might be able to track the row count at the
application level through some kind of atomic counter in a different table
(but this will likely be brittle). If you can live with an estimate, you
could enable statistics [1], optionally configuring Phoenix not to use
stats for parallelization [2], and query the SYSTEM.STATS table to get an
estimate [3].
Another interesting alternative if you want the approximate row count when
you have a where clause would be to use the new table sampling feature [4].
You'd also want stats enabled for this to be more accurate too.
Thanks,
James
[1] https://phoenix.apache.org/update_statistics.html
[2] phoenix.use.stats.parallelization=false
[3] select sum(GUIDE_POSTS_ROW_COUNT) from SYSTEM.STATS where
physical_name='my_schema.my_table'
and COLUMN_FAMILY='my_first_column_family' -- necessary only if you
have multiple column families
[4] https://phoenix.apache.org/tablesample.html
On Tue, Dec 19, 2017 at 2:57 PM, Jins George <ji...@aeris.net> wrote:
> Hi,
>
> Is there a way to get the total row count of a phoenix table without
> running select count(*) from table ?
> my use case is to monitor the record count in a table every x minutes, so
> didn't want to put load on the system by running a select count(*) query.
>
> Thanks,
> Jins George
>