You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by "Ethan Wang (JIRA)" <ji...@apache.org> on 2017/09/06 21:52:00 UTC

[jira] [Comment Edited] (PHOENIX-4164) APPROX_COUNT_DISTINCT becomes imprecise at 20m unique values.

    [ https://issues.apache.org/jira/browse/PHOENIX-4164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156083#comment-16156083 ] 

Ethan Wang edited comment on PHOENIX-4164 at 9/6/17 9:51 PM:
-------------------------------------------------------------

P.S., For folks that want to play around with precision configuration:

In this hll implementation there are two parameters that is configurable:
NormalSetPrecision (p)
SparseSetPrecision (sp)

In short: the actually leading zero counting space = (64-p). so less p, more room for counting.

For detail:
Due to the reason that hyperloglog performs poorly when cardinality is low, [Stefan et al.|http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf] came up an idea to use a normal hash set in the scenario when cordiality is low so that we can achieve the best at the both worlds (except we didn't end up using a normal hash set, we use a sparse set instead). So, in the case when cardinality goes up, the counting will switch to the hyper log log, because "sparse has the advantage on accuracy per unit of memory at low cardinality but quickly falls behind."

Based on this design, this is how streamlib come up a way of using this 64-bits hash. Code see [this|https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java]
{code}
     * *******************************************************   <- hashed length of bits
     * | p bits = idx ||     look for leading zeros here     |
     * |      sp bits = idx'     |
{code}
So, during normal hll model, p is the bucket size, the 64-p is the size used for counting leading zeros.

Though, note that, with default p=16, there is up to 48 bits left for hll, that is 2^48 of hashing space. The load for a 30million data test set is relatively trivial.


was (Author: aertoria):
P.S., For folks that want to play around with precision configuration:

In this hll implementation there are two parameters that is configurable:
NormalSetPrecision (p)
SparseSetPrecision (sp)

In short: the actually leading zero counting space = (64-p). so less p, more room for counting.

For detail:
Due to the reason that hyperloglog performs poorly when cardinality is low, [Stefan et al.|http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf] came up an idea to use a normal hash set in the scenario when cordiality is low so that we can achieve the best at the both worlds (except we didn't end up using a normal hash set, we use a sparse set instead). So, in the case when cardinality goes up, the counting will switch to the hyper log log, because "sparse has the advantage on accuracy per unit of memory at low cardinality but quickly falls behind."

Based on this design, this is how streamlib come up a way of using this 64-bits hash. See [this|https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java]
{code}
     * *******************************************************   <- hashed length of bits
     * | p bits = idx ||     look for leading zeros here     |
     * |      sp bits = idx'     |
{code}
So, during normal hll model, p is the bucket size, the 64-p is the size used for counting leading zeros.

Though, note that, with default p=16, there is up to 48 bits left for hll, that is 2^48 of hashing space. The load for a 30million data test set is relatively trivial.

> APPROX_COUNT_DISTINCT becomes imprecise at 20m unique values.
> -------------------------------------------------------------
>
>                 Key: PHOENIX-4164
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4164
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Lars Hofhansl
>            Assignee: Ethan Wang
>
> {code}
> 0: jdbc:phoenix:localhost> select count(*) from test;
> +-----------+
> | COUNT(1)  |
> +-----------+
> | 26931816  |
> +-----------+
> 1 row selected (14.604 seconds)
> 0: jdbc:phoenix:localhost> select approx_count_distinct(v1) from test;
> +----------------------------+
> | APPROX_COUNT_DISTINCT(V1)  |
> +----------------------------+
> | 17221394                   |
> +----------------------------+
> 1 row selected (21.619 seconds)
> {code}
> The table is generated from random numbers, and the cardinality of v1 is close to the number of rows.
> (I cannot run a COUNT(DISTINCT(v1)), as it uses up all memory on my machine and eventually kills the regionserver - that's another story and another jira)
> [~aertoria]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)