You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by "Zhong Yanghong (JIRA)" <ji...@apache.org> on 2018/12/25 01:41:00 UTC

[jira] [Resolved] (KYLIN-3540) Improve Mandatory Cuboid Recommendation Algorithm

     [ https://issues.apache.org/jira/browse/KYLIN-3540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhong Yanghong resolved KYLIN-3540.
-----------------------------------
    Resolution: Resolved

> Improve Mandatory Cuboid Recommendation Algorithm
> -------------------------------------------------
>
>                 Key: KYLIN-3540
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3540
>             Project: Kylin
>          Issue Type: Improvement
>            Reporter: Zhong Yanghong
>            Assignee: Zhong Yanghong
>            Priority: Major
>             Fix For: v2.6.0
>
>
> Previously to add cuboids which are not prebuilt,  the cube planner turns to mandatory cuboids which are selected if its rollup row count is above some threshold. There are two shortcomings:
> * The way to estimate the rollup row count is not good
> * It's hard to determine the threshold of rollup row count for recommending mandatory cuboids
> bq. {color:#f79232}The improved way to estimate the rollup row count is as follows:{color}
> Current criteria to recommend mandatory cuboids is based on the average rollup count collected with query metrics. There's a disadvantage. An example is as follows:
> Cuboid (A,B) has 1000 rows, prebuilt; Cuboid (B) has 10 rows, not prebuilt; The ground truth for the rollup count from Cuboid (A,B) to Cuboid (B) is
> {code}
> Cuboid (A,B) - Cuboid (A) = 1000 - 10 = 990
> {code}
> Suppose B is evenly composed with A. Then for each value of B with A, the row count is 1000 * (10/100) = 100.
> Now for sql 
> {code}
> select B, count(*)
> from T
> where B = 'e1'
> group by B
> {code}
> Then the rollup count by current algorithm will be
> {code}
> Cuboid (A,{'e1'}) - return count = 100 - 1 = 99
> {code}
> which is much smaller than 990 due to the influence of lots of filtered row count.
> It's better to calculate the rollup rate first and then multiple the parent cuboid row count to estimate the rollup count. The refined formula is as follows:
> {code}
> Cuboid (A,B) - Cuboid (A,B) * (return count) / Cuboid (A,{'e1'}) = 1000-1000*1/100 = 990
> {code}
> Another sql
> {code}
> select count(*)
> from T
> where B in {'e1','e2'}
> {code}
> The rollup count by current algorithm will be
> {code}
> Cuboid (A,{'e1','e2'}) - return count = 100*2 - 1 = 199
> {code}
> The rollup count by refined algorithm will be
> {code}
> Cuboid (A,B) - Cuboid (A,B) * (return count) / Cuboid (A,{'e1','e2'}) = 1000-1000*1/(100*2) = 995
> {code}
> Above all, the refined algorithm will be much less influenced by filters in sql.
> bq. {color:#f79232}Don't recommend mandatory cuboids & don't need the threshold
> {color}
> Previously the reason to recommend mandatory cuboids is that they are not prebuilt and their row count statistics are not known, which causes it's not possible to apply cube planner algorithm for them. Now by the improved way of estimating rollup row count, we can better estimate the row count statistics for those cuboids which are not prebuilt. Then the cost-based cube planner algorithm will decide which cuboid to be built or not and the threshold is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)