You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by Xiaoxiang Yu <xi...@kyligence.io> on 2019/02/13 02:12:32 UTC

Re: kylin的topN,count distinct是如何存储的

Hi,

For COUNT_DITINCT , they should be stored into Hbase as a large byte array which could be decoded to a bitmap/HllCounter,  not a simple Java primitive data type.
So “他们不就是一个数值么” is not correct.

Since they are always larger than simple measure (sum/min/max), using a separated column family is good choice to make query only related to simple measure more efficient.

Please check https://github.com/apache/kylin/tree/master/core-metadata/src/main/java/org/apache/kylin/measure for more accurate answer.

If you find any mistake, please let me know.

----------------
Best wishes,
Xiaoxiang Yu


发件人: TUESDAY <gz...@foxmail.com>
答复: "user@kylin.apache.org" <us...@kylin.apache.org>
日期: 2019年2月12日 星期二 19:43
收件人: user <us...@kylin.apache.org>
主题: kylin的topN,count distinct是如何存储的

一直有个问题,kylin在hbase的存储中,rowkey是由维度的组合组成的,列簇是由这个组合的数值组成的,那为什么像topnN,count distinct这些要用另外的列簇来存储,他们不就是一个数值么(精确计算,如果不是精确计算的时候,那又是什么)
谢谢![https://rescdn.qqmail.com/zh_CN/images/mo/DEFAULT2/5.gif]不知道有没有发送成功,又发送了一次

Re: kylin的topN,count distinct是如何存储的

Posted by 邹易 <gz...@gmail.com>.
Hi,
I found top-N and count distinct article in kylin apache blog .

It is a very informative article about the technical aspects of top n
measure feature and count distinct measure feature. It describes the
purpose and principle of top n aggregation and count distinct measure
feature.


top n measure feature
http://kylin.apache.org/blog/2016/03/19/approximate-topn-measure/



count distinct measure feature
http://kylin.apache.org/blog/2016/08/01/count-distinct-in-kylin/

[image: image.png]
[image: image.png]

If you find any mistake, please let me know.

Thanks!


Xiaoxiang Yu <xi...@kyligence.io> 于2019年2月13日周三 上午10:18写道:

> Hi,
>
>
>
> For COUNT_DITINCT , they should be stored into Hbase as a large byte array
> which could be decoded to a bitmap/HllCounter,  not a simple Java primitive
> data type.
>
> So “他们不就是一个数值么” is not correct.
>
>
>
> Since they are always larger than simple measure (sum/min/max), using a
> separated column family is good choice to make query only related to simple
> measure more efficient.
>
>
>
> Please check
> https://github.com/apache/kylin/tree/master/core-metadata/src/main/java/org/apache/kylin/measure
> for more accurate answer.
>
>
>
> If you find any mistake, please let me know.
>
>
>
> ----------------
>
> Best wishes,
>
> Xiaoxiang Yu
>
>
>
>
>
> *发件人**: *TUESDAY <gz...@foxmail.com>
> *答复**: *"user@kylin.apache.org" <us...@kylin.apache.org>
> *日期**: *2019年2月12日 星期二 19:43
> *收件人**: *user <us...@kylin.apache.org>
> *主题**: *kylin的topN,count distinct是如何存储的
>
>
>
> 一直有个问题,kylin在hbase的存储中,rowkey是由维度的组合组成的,列簇是由这个组合的数值组成的,那为什么像topnN,count
> distinct这些要用另外的列簇来存储,他们不就是一个数值么(精确计算,如果不是精确计算的时候,那又是什么)
> 谢谢![image: https://rescdn.qqmail.com/zh_CN/images/mo/DEFAULT2/5.gif]
> 不知道有没有发送成功,又发送了一次
>