You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kylin.apache.org by "hongbin ma (JIRA)" <ji...@apache.org> on 2015/09/01 06:07:48 UTC
[jira] [Resolved] (KYLIN-740) Slowness with many IN() values

     [ https://issues.apache.org/jira/browse/KYLIN-740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

hongbin ma resolved KYLIN-740.
------------------------------
       Resolution: Fixed
    Fix Version/s: v1.1
                   v0.8.1

the issue is fixed by removing all fuzzy keys when the number of fuzzy keys exceed threshold

> Slowness with many IN() values
> ------------------------------
>
>                 Key: KYLIN-740
>                 URL: https://issues.apache.org/jira/browse/KYLIN-740
>             Project: Kylin
>          Issue Type: Bug
>            Reporter: liyang
>            Assignee: hongbin ma
>             Fix For: v0.8.1, v1.1
>
>
> [from dev mail list]
> Huang Hua <hu...@mininglamp.com>
> We were testing the query efficiency with some sql containing 'IN' keyword
> on a relatively big table which holds 22+ Million records.
> It seems to us that sql with a number of items in 'IN' show very poor
> performance in terms of running time.
> The fact table used has three columns: themonth, id, trans_at, and we
> created the cube with 2 dimensions and 1 measure:
> Dimensions: themonth, id
> Measure: sum(trans_at)
> Basically our queries are all similar to the one:
> select themonth, id, sum(trans_at) from the_table where id IN (id1, id2,
> id3, .) group by themonth, id limit 10
> When the number of ids reaches about 30-50, the response time becomes
> considerable long in terms of minutes. By the way, we are running kylin on
> 12 nodes with 2 name nodes and 10 data nodes.
> Besides, we tried to break the above sql with 'IN' keyword into several
> sub-sql and merge the results of each sub-sql to get the final result like
> the following:
> select themonth, id, sum(trans_at) from the_table where id = id1 group by
> themonth, id limit 10
> select themonth, id, sum(trans_at) from the_table where id = id2 group by
> themonth, id limit 10
> select themonth, id, sum(trans_at) from the_table where id = id3 group by
> themonth, id limit 10
> .
> Surprisedly, the total running time of those sub-sql is much less than the
> running time of the orginal sql with 'IN' keyword.
> Initially I thought the backend query engine should handle 'IN' keyword in
> the similar way as the individual sql with '=' keyword, but it seems not.
> Can anybody provide any thoughts regarding this? Any ideas on how to tune
> the queries containing 'IN' keyword?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)