You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@phoenix.apache.org by Ken Hampson <ha...@gmail.com> on 2016/01/12 02:05:04 UTC

COUNT behavior

Hi, all:

I have noticed that when using an unqualified COUNT(*) call on a table
(whether Phoenix or view of an existing HBase table) without a qualifying
WHERE clause, it comes back very quickly and, usually, inaccurately. I
assume it is being cached somewhere as metadata, but I am wondering if
anything can be done to force it to become up to date.

Here is a scenario I've recently encountered:

I've imported ~50,000,000 rows into a Phoenix table, let's call it A, from
a CSV file via MapReduce. When I run this query:
SELECT COUNT(*) FROM A;

It yields (almost immediately) 0 rows.

However, I know there is data in there, which I can verify by running this
query:
SELECT * FROM A LIMIT 1;

which returns a row with data in it.

I am wondering if there is a mechanism by which I can obtain a truly
accurate count, and how COUNT(*) actually works, since it doesn't appear to
work as I would expect.

I tried running this:
UPDATE STATSTICS A;

but that did not seem to have an effect.

I have seen that if I change it to, say, COUNT(DISTINCT foo), or add a
WHERE clause, then it appears to do a full scan in some cases, but I have
also seen this result in a 0 count as well (I have not been able to discern
a pattern as to which happens when).

I would also be interested in being able to get a quicker, estimated count,
similar to what can be done in Postgres by looking at the live tuples that
are listed as being present in the table statistics (much quicker than
doing a COUNT and serves as a reasonable ballpark for row count).

I searched the mailing list archive and Googled around, and didn't come
across anything in terms of unexpected COUNT behavior. Has anyone else seen
this behavior? Any thoughts on how to get an accurate count?

Thanks,
 - Ken