You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by 热爱大发挥 <38...@qq.com> on 2016/03/11 04:11:39 UTC

回复: Build Measures with count distinct high cardinality column

hive table records: 1000000hive table size: 70MB


cube info:
normal dimension : 8 (cardinality less than 6)
measures :  count (distinctuid),  the "uid" 's cardinality about 600000
                   count (distinct keyword), the "keyword" 's cardinality about 100000



cast time:12 MIN
cube size: 765MB






------------------ 原始邮件 ------------------
发件人: "ShaoFeng Shi"<sh...@apache.org>; 
发送时间: 2016年3月11日(星期五) 上午10:48
收件人: "user"<us...@kylin.apache.org>; 
主题: Re: Build Measures with count distinct high cardinality column



Which precision (error rate) you selected for this measure? "error rate < 1.22%" will take much more storage than "error rate < 9.75%", user need select proper precision depends on need. 

Also, when you state "cuboid size was very large and cast much time", please provide detail information like source data size, dimension number, dimension cardinality,  measure definition, your hadoop cluster capacity, cube expansion rate, build time etc. Otherwise we couldn't make judgement and give comment.


2016-03-10 23:20 GMT+08:00 热爱大发挥 <38...@qq.com>:
In measures step, I try to count distinct cardinality column (like user_id), then I found the cuboid size was very large and cast much time.is deprecated count distinct with the high cardinality column???





-- 
Best regards,

Shaofeng Shi

Re: Build Measures with count distinct high cardinality column

Posted by ShaoFeng Shi <sh...@apache.org>.
what's the precision of the two distinct counter? is the 12 minutes the
total time for building this cube?

2016-03-11 11:11 GMT+08:00 热爱大发挥 <38...@qq.com>:

> hive table records: 1000000
> hive table size: 70MB
>
> cube info:
> normal dimension : 8 (cardinality less than 6)
> measures :  count (distinctuid),  the "uid" 's cardinality about 600000
>                    count (distinct keyword), the "keyword" 's cardinality
> about 100000
>
> cast time:12 MIN
> cube size: 765MB
>
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "ShaoFeng Shi"<sh...@apache.org>;
> *发送时间:* 2016年3月11日(星期五) 上午10:48
> *收件人:* "user"<us...@kylin.apache.org>;
> *主题:* Re: Build Measures with count distinct high cardinality column
>
> Which precision (error rate) you selected for this measure? "error rate <
> 1.22%" will take much more storage than "error rate < 9.75%", user need
> select proper precision depends on need.
>
> Also, when you state "cuboid size was very large and cast much time",
> please provide detail information like source data size, dimension number,
> dimension cardinality,  measure definition, your hadoop cluster capacity,
> cube expansion rate, build time etc. Otherwise we couldn't make judgement
> and give comment.
>
> 2016-03-10 23:20 GMT+08:00 热爱大发挥 <38...@qq.com>:
>
>> In measures step, I try to count distinct cardinality column (like
>> user_id), then I found the cuboid size was very large and cast much time.
>> is deprecated count distinct with the high cardinality column???
>>
>
>
>
> --
> Best regards,
>
> Shaofeng Shi
>
>


-- 
Best regards,

Shaofeng Shi