You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by "zhao jintao (JIRA)" <ji...@apache.org> on 2019/04/17 10:32:00 UTC
[jira] [Commented] (KYLIN-3961) Optimize TopN measure merge function to reduce mistaks

    [ https://issues.apache.org/jira/browse/KYLIN-3961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16819943#comment-16819943 ] 

zhao jintao commented on KYLIN-3961:
------------------------------------

I read two papers "
Ahmed Metwally, et al. “Efficient computation of frequent and top-k elements in data streams”. Proceeding ICDT'05 Proceedings of the 10th international conference on Database Theory, 2005." and "Massimo Cafaro, et al. “A parallel space saving algorithm for frequent items and the Hurwitz zeta distribution”. Proceeding arXiv: 1401.0702v12 [cs.DS] 19 Setp 2015.".
 
I find that the "merge" function of "TopNCounter" need to be optimized.
After I optimized this function, I query same sql from the same cube:

The top5 "SUM PRICE" of second cube with "TopN" is 

"167.7270...", "99.9909...","99.9890...", "99.9869...", "99.9779...".

 

 

>  Optimize TopN  measure merge function  to  reduce mistaks 
> -----------------------------------------------------------
>
>                 Key: KYLIN-3961
>                 URL: https://issues.apache.org/jira/browse/KYLIN-3961
>             Project: Kylin
>          Issue Type: Improvement
>          Components: Measure - TopN
>    Affects Versions: v2.5.2
>         Environment: Huawei FusionInsight
>            Reporter: zhao jintao
>            Assignee: zhao jintao
>            Priority: Major
>              Labels: easyfix
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Hi Team:
> I use "Top-N "measure to query such sql "select sum(AAA) from BBB group by CCC,DDD", It is much better than a cube without "Top-N".
> In my system, kylin cost just 0.2s to query sql with "Top-N" measure cube; If without "Top-N" measure it may be cost 10s.
> But I find that Top-N measure can be optimized to reduce mistaks.
> I use kylin demo to test "TopN".
> I build two cube using "KYLIN_SALES". The first cube has three dimentions:"SELLER_ID","BUYER_ID" and "PART_DT", has one measures: "SUM(PRICE)" . The second cube has one dimention:"PART_DT", has twon measures: "SUM(PRICE)" and "TOPN(10)", the "ORDER|SUM by Column" of  "TOPN(10)" is "PRICE", the "Group by Column"  of “TOPN(10)” is "SELLER_ID" and "BUYER_ID",the "Return Type" of "TOPN(10)" is "Top 10". Then I build cube from "2012-01-01" to "2014-01-01".
> I use same sql to query two cube. I find that 2 cubes have a larger error.
> The top5  "SUM PRICE" of first cube without "TopN" is "167.7269", "99.9908", "99.9888","99.9865","99.978".
> The top5 "SUM PRICE" of second cube with "TopN" is "179.27699...","167.6320...","167.3050...","167.2069...","166.7429...".
> Does any one meet same problem?
>  
> Best regards.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)