You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kylin.apache.org by "hujiahua (Jira)" <ji...@apache.org> on 2022/01/10 12:21:00 UTC

[jira] [Created] (KYLIN-5153) Big in-list query cause slow performance

hujiahua created KYLIN-5153:
-------------------------------

             Summary: Big in-list query cause slow performance 
                 Key: KYLIN-5153
                 URL: https://issues.apache.org/jira/browse/KYLIN-5153
             Project: Kylin
          Issue Type: Improvement
          Components: Query Engine
    Affects Versions: v4.0.1
            Reporter: hujiahua


{code:java}
select SELLER_ID,sum(PRICE) from KYLIN_SALES where SELLER_ID in (10000001,10000002,10000003,10000004,10000005,10000006)  GROUP BY SELLER_ID {code}
Current the above SQL will convert to a spark physical plan like this:
{code:java}
Project [2#122L AS F__KYLIN_SALES_SELLER_ID__1_4392b2b0__0#128L, 5#124 AS F__SUM_PRICE__1_4392b2b0__2#130]
+- Filter ((((((2#122L = 10000001) || (2#122L = 10000002)) || (2#122L = 10000003)) || (2#122L = 10000004)) || (2#122L = 10000005)) || (2#122L = 10000006))
   +- FileScan parquet [2#122L,5#124] Batched: false, Format: Parquet, Location: FilePruner[file:/Users/hujiahua/work/project/yz-kylin/examples/test_case_data/sample_local/defaul..., PartitionFilters: [], PushedFilters: [Or(Or(Or(Or(Or(EqualTo(2,10000001),EqualTo(2,10000002)),EqualTo(2,10000003)),EqualTo(2,10000004)..., ReadSchema: struct<2:bigint,5:decimal(29,4)> {code}
IN-LIST expression will always convert to OR expression. If the size of LIST was relatively small, it work fine. But when the size of LIST get bigger (The size value was  1000+ in our production case), it will have performance issues (the RT was more than 10 seconds). Too many OR expression cause spend too many time in plan optimization phase and spark code generation phase. 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)