You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Victoria Markman (JIRA)" <ji...@apache.org> on 2015/04/01 20:30:52 UTC
[jira] [Created] (DRILL-2655) Hash join is chosen when join columns
are sorted in the same order
Victoria Markman created DRILL-2655:
---------------------------------------
Summary: Hash join is chosen when join columns are sorted in the same order
Key: DRILL-2655
URL: https://issues.apache.org/jira/browse/DRILL-2655
Project: Apache Drill
Issue Type: Bug
Components: Query Planning & Optimization
Affects Versions: 0.8.0
Reporter: Victoria Markman
Assignee: Jinfeng Ni
Attachments: t1_parquet, t2_parquet
We should either drop the sort or plan merge join for this case (if execution can support this)
Both subqueries are sorted on a join column in ASC (implicitly) order NULLS LAST.
{code}
0: jdbc:drill:schema=dfs> explain plan for select *
. . . . . . . . . . . . > from
. . . . . . . . . . . . > (select a1,b1,c1 from t1 order by a1 nulls last) as sq1
. . . . . . . . . . . . > inner join
. . . . . . . . . . . . > (select a2,b2,c2 from t2 order by a2 nulls last) as sq2
. . . . . . . . . . . . > on (sq1.a1 = sq2.a2)
. . . . . . . . . . . . > ;
+------------+------------+
| text | json |
+------------+------------+
| 00-00 Screen
00-01 Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
00-02 HashJoin(condition=[=($0, $3)], joinType=[inner])
00-04 SelectionVectorRemover
00-06 Sort(sort0=[$0], dir0=[ASC-nulls-last])
00-08 Project(a1=[$2], b1=[$1], c1=[$0])
00-10 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
00-03 SelectionVectorRemover
00-05 Sort(sort0=[$0], dir0=[ASC-nulls-last])
00-07 SelectionVectorRemover
00-09 Sort(sort0=[$0], dir0=[ASC-nulls-last])
00-11 Project(a2=[$1], b2=[$0], c2=[$2])
00-12 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
{code}
Both subqueries are sorted on a join column in ASC (explicitly) order NULLS LAST.
{code}
0: jdbc:drill:schema=dfs> explain plan for select *
. . . . . . . . . . . . > from
. . . . . . . . . . . . > (select a1,b1,c1 from t1 order by a1 asc nulls last) as sq1
. . . . . . . . . . . . > inner join
. . . . . . . . . . . . > (select a2,b2,c2 from t2 order by a2 asc nulls last) as sq2
. . . . . . . . . . . . > on (sq1.a1 = sq2.a2)
. . . . . . . . . . . . > ;
+------------+------------+
| text | json |
+------------+------------+
| 00-00 Screen
00-01 Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
00-02 HashJoin(condition=[=($0, $3)], joinType=[inner])
00-04 SelectionVectorRemover
00-06 Sort(sort0=[$0], dir0=[ASC-nulls-last])
00-08 Project(a1=[$2], b1=[$1], c1=[$0])
00-10 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
00-03 SelectionVectorRemover
00-05 Sort(sort0=[$0], dir0=[ASC-nulls-last])
00-07 SelectionVectorRemover
00-09 Sort(sort0=[$0], dir0=[ASC-nulls-last])
00-11 Project(a2=[$1], b2=[$0], c2=[$2])
00-12 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
{code}
When both subqueries are sorted in ASC order implicitly or explicitly without mentioning NULLS FIRST/NULLS LAST merge join is planned.
Implicit ascending order:
{code}
0: jdbc:drill:schema=dfs> explain plan for select *
. . . . . . . . . . . . > from
. . . . . . . . . . . . > (select a1,b1,c1 from t1 order by a1 ) as sq1
. . . . . . . . . . . . > inner join
. . . . . . . . . . . . > (select a2,b2,c2 from t2 order by a2 ) as sq2
. . . . . . . . . . . . > on (sq1.a1 = sq2.a2)
. . . . . . . . . . . . > ;
+------------+------------+
| text | json |
+------------+------------+
| 00-00 Screen
00-01 Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
00-02 MergeJoin(condition=[=($0, $3)], joinType=[inner])
00-04 SelectionVectorRemover
00-06 Sort(sort0=[$0], dir0=[ASC])
00-08 Project(a1=[$2], b1=[$1], c1=[$0])
00-10 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
00-03 SelectionVectorRemover
00-05 Sort(sort0=[$0], dir0=[ASC])
00-07 SelectionVectorRemover
00-09 Sort(sort0=[$0], dir0=[ASC])
00-11 Project(a2=[$1], b2=[$0], c2=[$2])
00-12 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
{code}
Explicit asc order:
{code}
0: jdbc:drill:schema=dfs> explain plan for select *
. . . . . . . . . . . . > from
. . . . . . . . . . . . > (select a1,b1,c1 from t1 order by a1 asc) as sq1
. . . . . . . . . . . . > inner join
. . . . . . . . . . . . > (select a2,b2,c2 from t2 order by a2 asc) as sq2
. . . . . . . . . . . . > on (sq1.a1 = sq2.a2)
. . . . . . . . . . . . > ;
+------------+------------+
| text | json |
+------------+------------+
| 00-00 Screen
00-01 Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
00-02 MergeJoin(condition=[=($0, $3)], joinType=[inner])
00-04 SelectionVectorRemover
00-06 Sort(sort0=[$0], dir0=[ASC])
00-08 Project(a1=[$2], b1=[$1], c1=[$0])
00-10 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
00-03 SelectionVectorRemover
00-05 Sort(sort0=[$0], dir0=[ASC])
00-07 SelectionVectorRemover
00-09 Sort(sort0=[$0], dir0=[ASC])
00-11 Project(a2=[$1], b2=[$0], c2=[$2])
00-12 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
{code}
And the last one: when left side is sorted in ascending order and right side in desc, we plan merge join (result seems to be correct on a small data set)
I will add more tests for this particular case to verify results with duplicates and nulls.
{code}
0: jdbc:drill:schema=dfs> explain plan for select *
. . . . . . . . . . . . > from
. . . . . . . . . . . . > (select a1,b1,c1 from t1 order by a1 asc) as sq1
. . . . . . . . . . . . > inner join
. . . . . . . . . . . . > (select a2,b2,c2 from t2 order by a2 desc) as sq2
. . . . . . . . . . . . > on (sq1.a1 = sq2.a2)
. . . . . . . . . . . . > ;
+------------+------------+
| text | json |
+------------+------------+
| 00-00 Screen
00-01 Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
00-02 MergeJoin(condition=[=($0, $3)], joinType=[inner])
00-04 SelectionVectorRemover
00-06 Sort(sort0=[$0], dir0=[ASC])
00-08 Project(a1=[$2], b1=[$1], c1=[$0])
00-10 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
00-03 SelectionVectorRemover
00-05 Sort(sort0=[$0], dir0=[ASC])
00-07 SelectionVectorRemover
00-09 Sort(sort0=[$0], dir0=[DESC])
00-11 Project(a2=[$1], b2=[$0], c2=[$2])
00-12 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
{code}
I also don't understand why I'm seeing two sorts on the right side.
It would be ok if second sort did not do anything, because there was no predicate ...
{code}
00-05 Sort(sort0=[$0], dir0=[ASC])
00-07 SelectionVectorRemover
00-09 Sort(sort0=[$0], dir0=[ASC])
{code}
cut/paste query:
{code}
select *
from
(select a1,b1,c1 from t1 order by a1 nulls last) as sq1
inner join
(select a2,b2,c2 from t2 order by a2 nulls last) as sq2
on (sq1.a1 = sq2.a2)
;
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)