You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kylin.apache.org by 吕卓然 <lv...@fosun.com> on 2017/04/28 02:01:34 UTC

答复: A problem in cube building time

Hi Roberto,

Glad to hear from you. Actually, I do not have any large cardinality dimensions in my case. The largest cardinality is around 400. I was wondering how much does the accuracy of count distinct matter. I set all dimensions in lookup table to derived dimensions already. What I am curious about is the relation between the number of dimensions and the building speed. Also the relation between the count distinct accuracy and the building speed.

Thanks,
Zhuoran

发件人: Roberto Tardío Olmos [mailto:roberto.tardio@stratebi.com]
发送时间: 2017年4月27日 19:08
收件人: user@kylin.apache.org
主题: Re: A problem in cube building time

Hi Zhuoran,

I faced a similar problem about cube building time. I think that depends on the cardinality of the 2 dimensions you add. If some of these has a big cardinality (eg. in my use case about 500.000 rows, Customer Dimension) the number of combinations Kylin need to build the cube increases a lot.

Some things you could try to reduce cube building time and size:

* Define all Dimension tables attributes as a Derived Dimensions. In this cases you can not use Hierarchy optimization in Agg Group. The query latency in queries that use derived attributes will be less optimal than using Agg Group Hierarchies (with Normal Dimensions), but in some cases the differences in query latency are acceptable (in my case between 2 and 6 seconds more, depending of the query). Cube size and building time will be reduced.
* Use "Shard By" in Rowkey for High Cardinality Dimensions. I have not been able to test it yet, but as indicated at https://kylin.apache.org/docs16/howto/howto_optimize_build.html should work fine. This helps to reduce cube building time.

I hope to help you, I'm also learning to use Kylin.

Kind Regards,
El 27/04/2017 a las 12:46, 吕卓然 escribió:
Hi all,

Currently I am using Kylin 1.6.1 and I face a problem about cube building time. I have one fact table and two lookup tables. When I set 13 normal dimensions and 15 derived dimensions and two measures (count and count distinct). The step3 in building takes around 20mins and the entire building takes around 1 hour. This is good.
However, when I try to increase to 15 normal dimensions and 15 derived dimensions and two measures(count and count distinct). The step3 in building takes around 240mins and the entire building takes forever….
BTW, I have a hierarchy dimension which has 4 normal dimensions.
I am really confusing about this. Does 13 normal dimensions become a bottleneck in building cube?

Thanks a lot!
Zhuoran

--
Roberto Tardío Olmos
Senior Big Data & Business Intelligence Consultant

Avenida de Brasil, 17, Planta 16.
28020 Madrid
Fijo: 91.788.34.10
[cid:image001.png@01D2C006.63F98780]

Re: 答复: A problem in cube building time

Posted by ShaoFeng Shi <sh...@apache.org>.

Derived can help to reduce the cuboid number, but it may impact the query
performance; so "mark all columns on dimension table as derived" is not
recommended all the time. For example you have a "customer" lookup table,
its primary key is "customer_id", and has another column "gender"; Although
"gender" can be marked as "derived", but from "customer_id" --> "gender"
there is much runtime aggregation, slowing down the query performance; But
if it is a column like "customer_email", since it's cardinality is close to
primary key, using "derived" is the best.

Zhuoran, don't guess, you can do some search about "kylin design
optimization", there are many articles. Besides, this presentation also
worth a read: http://www.slideshare.net/YangLi43/design-cube-in-apache-kylin

2017-04-28 10:01 GMT+08:00 吕卓然 <lv...@fosun.com>:

> Hi Roberto,
>
>
>
> Glad to hear from you. Actually, I do not have any large cardinality
> dimensions in my case. The largest cardinality is around 400. I was
> wondering how much does the accuracy of count distinct matter. I set all
> dimensions in lookup table to derived dimensions already. What I am curious
> about is the relation between the number of dimensions and the building
> speed. Also the relation between the count distinct accuracy and the
> building speed.
>
>
>
> Thanks,
>
> Zhuoran
>
>
>
>
>
> *发件人:* Roberto Tardío Olmos [mailto:roberto.tardio@stratebi.com]
> *发送时间:* 2017年4月27日 19:08
> *收件人:* user@kylin.apache.org
> *主题:* Re: A problem in cube building time
>
>
>
> Hi Zhuoran,
>
> I faced a similar problem about cube building time. I think that depends
> on the cardinality of the 2 dimensions you add. If some of these has a big
> cardinality (eg. in my use case about 500.000 rows, Customer Dimension) the
> number of combinations Kylin need to build the cube increases a lot.
>
> Some things you could try to reduce cube building time and size:
>
>    - Define all Dimension tables attributes as a Derived Dimensions. In
>    this cases you can not use Hierarchy optimization in Agg Group. The query
>    latency in queries that use derived attributes will be less optimal than
>    using Agg Group Hierarchies (with Normal Dimensions), but in some cases the
>    differences in query latency are acceptable (in my case between 2 and 6
>    seconds more, depending of the query). Cube size and building time will be
>    reduced.
>    - Use "Shard By" in Rowkey for High Cardinality Dimensions. I have not
>    been able to test it yet, but as indicated at https://kylin.apache.org/
>    docs16/howto/howto_optimize_build.html
>    <https://kylin.apache.org/docs16/howto/howto_optimize_build.html>
>    should work fine. This helps to reduce cube building time.
>
> I hope to help you, I'm also learning to use Kylin.
>
> Kind Regards,
>
> El 27/04/2017 a las 12:46, 吕卓然 escribió:
>
> Hi all,
>
>
>
> Currently I am using Kylin 1.6.1 and I face a problem about cube building
> time. I have one fact table and two lookup tables. When I set 13 normal
> dimensions and 15 derived dimensions and two measures (count and count
> distinct). The step3 in building takes around 20mins and the entire
> building takes around 1 hour. This is good.
>
> However, when I try to increase to 15 normal dimensions and 15 derived
> dimensions and two measures(count and count distinct). The step3 in
> building takes around 240mins and the entire building takes forever….
>
> BTW, I have a hierarchy dimension which has 4 normal dimensions.
>
> I am really confusing about this.  Does 13 normal dimensions become a
> bottleneck in building cube?
>
>
>
> Thanks a lot!
>
> Zhuoran
>
>
>
> --
> *Roberto Tardío Olmos*
>
> *Senior Big Data & Business Intelligence Consultant*
>
> Avenida de Brasil, 17, Planta 16. 28020 Madrid Fijo: 91.788.34.10
>



-- 
Best regards,

Shaofeng Shi 史少锋