You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by "chenglei (JIRA)" <ji...@apache.org> on 2017/02/14 15:17:41 UTC
[jira] [Updated] (PHOENIX-3670) KeyRange.intersect(List , List) is inefficient,especially for join dynamic filter

     [ https://issues.apache.org/jira/browse/PHOENIX-3670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

chenglei updated PHOENIX-3670:
------------------------------
    Summary: KeyRange.intersect(List<KeyRange> , List<KeyRange>) is inefficient,especially for join dynamic filter  (was: KeyRange.intersect(List<KeyRange> , List<KeyRange> ) is inefficient,especially for join dynamic filter)

> KeyRange.intersect(List<KeyRange> , List<KeyRange>) is inefficient,especially for join dynamic filter
> -----------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-3670
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3670
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 4.9.0
>            Reporter: chenglei
>
> In my business system, there is a following join SQL(which is simplified), fact_table is a fact table,  joining dimension table dim_table1 and dim_table2 : 
> {code:borderStyle=solid} 
> select /*+ SKIP_SCAN */ sum(t.click)  from fact_table t join dim_table1 d1 on  t.cust_id=d1.id  join dim_table2 d2 on t.cust_id =d2.id  where t.date between '2016-01-01' and '2017-01-01' and d1.code = 2008 and d2.region = 'us';
> {code} 
> I use /*+ SKIP_SCAN */ hint to enable join dynamic filter. For some small dataset, the sql executes quickly, but when the dataset is bigger, the sql become very slowly. When the  row count of fact_table is 30 million,dim_table1 is 300 thousand and dim_table2 is 100 thousand, the above query  costs 17s.
> When I debug the SQL executing, I find RHS1 return 5523 rows:
> {code:borderStyle=solid} 
>    select d1.id from dim_table1 d1 where d1.code = 2008
> {code} 
> and RHS2 return 23881 rows: 
> {code:borderStyle=solid}
>    select d2.id from dim_table2 d2 where d2.region='us'
> {code}  
> then HashJoinPlan uses  KeyRange.intersect(List<KeyRange> , List<KeyRange> ) method to compute RHS1 intersecting RHS2 for dynamic filter, narrowing down fact_table.cust_id should be. 
> Surprisingly,the KeyRange.intersect method costs 11s ! although the whole sql execution only costs 17s.After I read the code of  KeyRange.intersect method,I find following two problem:
> 1. The double loop is inefficient in line 521 and line 522,when keyRanges  size is M, keyRanges2 size is N, the time complexity is O(M*N), for my example,is 5523*23881: 
> {code:borderStyle=solid} 
> 519 public static List<KeyRange> intersect(List<KeyRange> keyRanges,  List<KeyRange> keyRanges2) {
> 520        List<KeyRange> tmp = new ArrayList<KeyRange>();
> 521        for (KeyRange r1 : keyRanges) {
> 522            for (KeyRange r2 : keyRanges2) {
> 523                KeyRange r = r1.intersect(r2);
> 524                if (EMPTY_RANGE != r) {
> 525                    tmp.add(r);
> 526                }
> 527            }
> 528        }
> {code}  
> 2. line 540 shoule be r = r.union(tmp.get(i)), not intersect,just as KeyRange.coalesce method does:
> {code:borderStyle=solid} 
> 532        Collections.sort(tmp, KeyRange.COMPARATOR);
> 533        List<KeyRange> tmp2 = new ArrayList<KeyRange>();
> 534        KeyRange r = tmp.get(0);
> 535        for (int i=1; i<tmp.size(); i++) {
> 536            if (EMPTY_RANGE == r.intersect(tmp.get(i))) {
> 537                tmp2.add(r);
> 538                r = tmp.get(i);
> 539            } else {
> 540                r = r.intersect(tmp.get(i));
> 541            }
> 542        }
> {code}
> and it seems that no unit tests for  KeyRange.intersect(List<KeyRange> , List<KeyRange>) method. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)