You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kylin.apache.org by "Yerui Sun (JIRA)" <ji...@apache.org> on 2015/11/30 17:55:10 UTC

[jira] [Created] (KYLIN-1186) Support precise Count Distinct using bitmap

Yerui Sun created KYLIN-1186:
--------------------------------

             Summary: Support precise Count Distinct using bitmap
                 Key: KYLIN-1186
                 URL: https://issues.apache.org/jira/browse/KYLIN-1186
             Project: Kylin
          Issue Type: Improvement
          Components: Job Engine
    Affects Versions: v1.1
            Reporter: Yerui Sun
            Assignee: ZhouQianhao
             Fix For: v2.0, 1.2


For now, kylin only support non-precise count distinct by hyperloglog.
In our production scenario, there're strongly requirements for precise count distinct, mainly for the column of type int or bigint, such as user-id, product-id, etc.
Implementing of precise count distinct for all types is difficult and not efficiency. However, only supporting int or bigint make this much easier. The values can be projected into a bitmap, which is easy to be compressed and stored, and easy to count.
I've created a POC based on RoaringBitmap, proving that worked. There's some more work to be done:
* RoaringBitmap only support int, there need a solution to support bigint;
* Add a new measure and codec, like HyperLogLogPlusCounter, make it easy to use;
* Add new measure on web ui, and check that whether the column type is int or bigint;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)