You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by Prashant Prakash <pr...@gmail.com> on 2016/01/11 20:49:26 UTC

HyperLogLogPlusCounter Usage

Hi,

I am experiencing strange issue with count(distinct) query in kylin. We are
using hllc12 for evaluating uniques for a measure in a table partitioned
over date.
The uniques estimate for individual dates 2016-01-07, 2016-01-08,
2016-01-09 are 93,728,324, 90,982,364, 45,485,278 respectively.
But the uniques across days, which is calculated through
HyperLogLogPlusCounter.merge operation gives a value 67,980,576.

1. Is the query for distinct across days a valid usage for kylin ?

Sample query:
SELECT COUNT(DISTINCT f.userid) AS m1 FROM
kylin.fact_publishers_uniques f WHERE
dt in ('2016-01-09', '2016-01-08', '2016-01-07')

Theoretically the lower bound for uniques across days should at least be
the maximum of uniques for each day, the final number does not seems
correct.
To debug the issue we also calculated uniques across  2016-01-07,
2016-01-08. It accounts to 164,637,916. Its only when we merge data for
2016-01-09 we get spurious value.

2. Is there any limit on the relative order elements being merged ?

Regards,
Prashant

Re: HyperLogLogPlusCounter Usage

Posted by ShaoFeng Shi <sh...@apache.org>.
Hi Prashant, the query is valid, no limit on the order of the elements;
Could you please open a JIRA
<https://issues.apache.org/jira/secure/Dashboard.jspa> for Kylin with the
version number? Thanks for the reporting!

2016-01-12 3:49 GMT+08:00 Prashant Prakash <pr...@gmail.com>:

> Hi,
>
> I am experiencing strange issue with count(distinct) query in kylin. We
> are using hllc12 for evaluating uniques for a measure in a table
> partitioned over date.
> The uniques estimate for individual dates 2016-01-07, 2016-01-08,
> 2016-01-09 are 93,728,324, 90,982,364, 45,485,278 respectively.
> But the uniques across days, which is calculated through
> HyperLogLogPlusCounter.merge operation gives a value 67,980,576.
>
> 1. Is the query for distinct across days a valid usage for kylin ?
>
> Sample query:
> SELECT COUNT(DISTINCT f.userid) AS m1 FROM kylin.fact_publishers_uniques f WHERE
> dt in ('2016-01-09', '2016-01-08', '2016-01-07')
>
> Theoretically the lower bound for uniques across days should at least be
> the maximum of uniques for each day, the final number does not seems
> correct.
> To debug the issue we also calculated uniques across  2016-01-07,
> 2016-01-08. It accounts to 164,637,916. Its only when we merge data for
> 2016-01-09 we get spurious value.
>
> 2. Is there any limit on the relative order elements being merged ?
>
> Regards,
> Prashant
>



-- 
Best regards,

Shaofeng Shi