You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by "Zhong Yanghong (JIRA)" <ji...@apache.org> on 2018/09/06 03:27:00 UTC
[jira] [Updated] (KYLIN-3540) Improve Mandatory Cuboid Recommendation Algorithm

     [ https://issues.apache.org/jira/browse/KYLIN-3540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhong Yanghong updated KYLIN-3540:
----------------------------------
    Description: 
Current criteria to recommend mandatory cuboids is based on the average rollup count collected with query metrics. There's a disadvantage. An example is as follows:
Cuboid (A,B) has 1000 rows, prebuilt; Cuboid (B) has 10 rows, not prebuilt; The ground truth for the rollup count from Cuboid (A,B) to Cuboid (B) is
{code}
Cuboid (A,B) - Cuboid (A) = 1000 - 10 = 990
{code}
Suppose B is evenly composed with A. Then for each value of B with A, the row count is 1000 * (10/100) = 100.

Now for sql 
{code}
select B, count(*)
from T
where B = 'e1'
group by B
{code}

Then the rollup count by current algorithm will be
{code}
Cuboid (A,{'e1'}) - return count = 100 - 1 = 99
{code}
which is much smaller than 990 due to the influence of lots of filtered row count.

It's better to calculate the rollup rate first and then multiple the parent cuboid row count to estimate the rollup count. The refined formula is as follows:
{code}
Cuboid (A,B) - Cuboid (A,B) * (return count) / Cuboid (A,{'e1'}) = 1000-1000*1/100 = 990
{code}

Another sql
{code}
select count(*)
from T
where B in {'e1','e2'}
{code}

The rollup count by current algorithm will be
{code}
Cuboid (A,{'e1','e2'}) - return count = 100*2 - 1 = 199
{code}

The rollup count by refined algorithm will be
{code}
Cuboid (A,B) - Cuboid (A,B) * (return count) / Cuboid (A,{'e1','e2'}) = 1000-1000*1/(100*2) = 995
{code}

Above all, the refined algorithm will be much less influenced by filters in sql.

> Improve Mandatory Cuboid Recommendation Algorithm
> -------------------------------------------------
>
>                 Key: KYLIN-3540
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3540
>             Project: Kylin
>          Issue Type: Improvement
>            Reporter: Zhong Yanghong
>            Assignee: Zhong Yanghong
>            Priority: Major
>
> Current criteria to recommend mandatory cuboids is based on the average rollup count collected with query metrics. There's a disadvantage. An example is as follows:
> Cuboid (A,B) has 1000 rows, prebuilt; Cuboid (B) has 10 rows, not prebuilt; The ground truth for the rollup count from Cuboid (A,B) to Cuboid (B) is
> {code}
> Cuboid (A,B) - Cuboid (A) = 1000 - 10 = 990
> {code}
> Suppose B is evenly composed with A. Then for each value of B with A, the row count is 1000 * (10/100) = 100.
> Now for sql 
> {code}
> select B, count(*)
> from T
> where B = 'e1'
> group by B
> {code}
> Then the rollup count by current algorithm will be
> {code}
> Cuboid (A,{'e1'}) - return count = 100 - 1 = 99
> {code}
> which is much smaller than 990 due to the influence of lots of filtered row count.
> It's better to calculate the rollup rate first and then multiple the parent cuboid row count to estimate the rollup count. The refined formula is as follows:
> {code}
> Cuboid (A,B) - Cuboid (A,B) * (return count) / Cuboid (A,{'e1'}) = 1000-1000*1/100 = 990
> {code}
> Another sql
> {code}
> select count(*)
> from T
> where B in {'e1','e2'}
> {code}
> The rollup count by current algorithm will be
> {code}
> Cuboid (A,{'e1','e2'}) - return count = 100*2 - 1 = 199
> {code}
> The rollup count by refined algorithm will be
> {code}
> Cuboid (A,B) - Cuboid (A,B) * (return count) / Cuboid (A,{'e1','e2'}) = 1000-1000*1/(100*2) = 995
> {code}
> Above all, the refined algorithm will be much less influenced by filters in sql.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)