You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Victoria Markman (JIRA)" <ji...@apache.org> on 2015/04/01 20:30:53 UTC
[jira] [Updated] (DRILL-2655) Hash join is chosen when join columns
are sorted in the same order
[ https://issues.apache.org/jira/browse/DRILL-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Victoria Markman updated DRILL-2655:
------------------------------------
Attachment: t2_parquet
> Hash join is chosen when join columns are sorted in the same order
> ------------------------------------------------------------------
>
> Key: DRILL-2655
> URL: https://issues.apache.org/jira/browse/DRILL-2655
> Project: Apache Drill
> Issue Type: Bug
> Components: Query Planning & Optimization
> Affects Versions: 0.8.0
> Reporter: Victoria Markman
> Assignee: Jinfeng Ni
> Attachments: t1_parquet, t2_parquet
>
>
> We should either drop the sort or plan merge join for this case (if execution can support this)
> Both subqueries are sorted on a join column in ASC (implicitly) order NULLS LAST.
> {code}
> 0: jdbc:drill:schema=dfs> explain plan for select *
> . . . . . . . . . . . . > from
> . . . . . . . . . . . . > (select a1,b1,c1 from t1 order by a1 nulls last) as sq1
> . . . . . . . . . . . . > inner join
> . . . . . . . . . . . . > (select a2,b2,c2 from t2 order by a2 nulls last) as sq2
> . . . . . . . . . . . . > on (sq1.a1 = sq2.a2)
> . . . . . . . . . . . . > ;
> +------------+------------+
> | text | json |
> +------------+------------+
> | 00-00 Screen
> 00-01 Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
> 00-02 HashJoin(condition=[=($0, $3)], joinType=[inner])
> 00-04 SelectionVectorRemover
> 00-06 Sort(sort0=[$0], dir0=[ASC-nulls-last])
> 00-08 Project(a1=[$2], b1=[$1], c1=[$0])
> 00-10 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
> 00-03 SelectionVectorRemover
> 00-05 Sort(sort0=[$0], dir0=[ASC-nulls-last])
> 00-07 SelectionVectorRemover
> 00-09 Sort(sort0=[$0], dir0=[ASC-nulls-last])
> 00-11 Project(a2=[$1], b2=[$0], c2=[$2])
> 00-12 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
> {code}
> Both subqueries are sorted on a join column in ASC (explicitly) order NULLS LAST.
> {code}
> 0: jdbc:drill:schema=dfs> explain plan for select *
> . . . . . . . . . . . . > from
> . . . . . . . . . . . . > (select a1,b1,c1 from t1 order by a1 asc nulls last) as sq1
> . . . . . . . . . . . . > inner join
> . . . . . . . . . . . . > (select a2,b2,c2 from t2 order by a2 asc nulls last) as sq2
> . . . . . . . . . . . . > on (sq1.a1 = sq2.a2)
> . . . . . . . . . . . . > ;
> +------------+------------+
> | text | json |
> +------------+------------+
> | 00-00 Screen
> 00-01 Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
> 00-02 HashJoin(condition=[=($0, $3)], joinType=[inner])
> 00-04 SelectionVectorRemover
> 00-06 Sort(sort0=[$0], dir0=[ASC-nulls-last])
> 00-08 Project(a1=[$2], b1=[$1], c1=[$0])
> 00-10 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
> 00-03 SelectionVectorRemover
> 00-05 Sort(sort0=[$0], dir0=[ASC-nulls-last])
> 00-07 SelectionVectorRemover
> 00-09 Sort(sort0=[$0], dir0=[ASC-nulls-last])
> 00-11 Project(a2=[$1], b2=[$0], c2=[$2])
> 00-12 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
> {code}
> When both subqueries are sorted in ASC order implicitly or explicitly without mentioning NULLS FIRST/NULLS LAST merge join is planned.
> Implicit ascending order:
> {code}
> 0: jdbc:drill:schema=dfs> explain plan for select *
> . . . . . . . . . . . . > from
> . . . . . . . . . . . . > (select a1,b1,c1 from t1 order by a1 ) as sq1
> . . . . . . . . . . . . > inner join
> . . . . . . . . . . . . > (select a2,b2,c2 from t2 order by a2 ) as sq2
> . . . . . . . . . . . . > on (sq1.a1 = sq2.a2)
> . . . . . . . . . . . . > ;
> +------------+------------+
> | text | json |
> +------------+------------+
> | 00-00 Screen
> 00-01 Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
> 00-02 MergeJoin(condition=[=($0, $3)], joinType=[inner])
> 00-04 SelectionVectorRemover
> 00-06 Sort(sort0=[$0], dir0=[ASC])
> 00-08 Project(a1=[$2], b1=[$1], c1=[$0])
> 00-10 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
> 00-03 SelectionVectorRemover
> 00-05 Sort(sort0=[$0], dir0=[ASC])
> 00-07 SelectionVectorRemover
> 00-09 Sort(sort0=[$0], dir0=[ASC])
> 00-11 Project(a2=[$1], b2=[$0], c2=[$2])
> 00-12 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
> {code}
> Explicit asc order:
> {code}
> 0: jdbc:drill:schema=dfs> explain plan for select *
> . . . . . . . . . . . . > from
> . . . . . . . . . . . . > (select a1,b1,c1 from t1 order by a1 asc) as sq1
> . . . . . . . . . . . . > inner join
> . . . . . . . . . . . . > (select a2,b2,c2 from t2 order by a2 asc) as sq2
> . . . . . . . . . . . . > on (sq1.a1 = sq2.a2)
> . . . . . . . . . . . . > ;
> +------------+------------+
> | text | json |
> +------------+------------+
> | 00-00 Screen
> 00-01 Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
> 00-02 MergeJoin(condition=[=($0, $3)], joinType=[inner])
> 00-04 SelectionVectorRemover
> 00-06 Sort(sort0=[$0], dir0=[ASC])
> 00-08 Project(a1=[$2], b1=[$1], c1=[$0])
> 00-10 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
> 00-03 SelectionVectorRemover
> 00-05 Sort(sort0=[$0], dir0=[ASC])
> 00-07 SelectionVectorRemover
> 00-09 Sort(sort0=[$0], dir0=[ASC])
> 00-11 Project(a2=[$1], b2=[$0], c2=[$2])
> 00-12 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
> {code}
> And the last one: when left side is sorted in ascending order and right side in desc, we plan merge join (result seems to be correct on a small data set)
> I will add more tests for this particular case to verify results with duplicates and nulls.
> {code}
> 0: jdbc:drill:schema=dfs> explain plan for select *
> . . . . . . . . . . . . > from
> . . . . . . . . . . . . > (select a1,b1,c1 from t1 order by a1 asc) as sq1
> . . . . . . . . . . . . > inner join
> . . . . . . . . . . . . > (select a2,b2,c2 from t2 order by a2 desc) as sq2
> . . . . . . . . . . . . > on (sq1.a1 = sq2.a2)
> . . . . . . . . . . . . > ;
> +------------+------------+
> | text | json |
> +------------+------------+
> | 00-00 Screen
> 00-01 Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
> 00-02 MergeJoin(condition=[=($0, $3)], joinType=[inner])
> 00-04 SelectionVectorRemover
> 00-06 Sort(sort0=[$0], dir0=[ASC])
> 00-08 Project(a1=[$2], b1=[$1], c1=[$0])
> 00-10 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
> 00-03 SelectionVectorRemover
> 00-05 Sort(sort0=[$0], dir0=[ASC])
> 00-07 SelectionVectorRemover
> 00-09 Sort(sort0=[$0], dir0=[DESC])
> 00-11 Project(a2=[$1], b2=[$0], c2=[$2])
> 00-12 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
> {code}
> I also don't understand why I'm seeing two sorts on the right side.
> It would be ok if second sort did not do anything, because there was no predicate ...
> {code}
> 00-05 Sort(sort0=[$0], dir0=[ASC])
> 00-07 SelectionVectorRemover
> 00-09 Sort(sort0=[$0], dir0=[ASC])
> {code}
> cut/paste query:
> {code}
> select *
> from
> (select a1,b1,c1 from t1 order by a1 nulls last) as sq1
> inner join
> (select a2,b2,c2 from t2 order by a2 nulls last) as sq2
> on (sq1.a1 = sq2.a2)
> ;
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)