You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "Victoria Markman (JIRA)" <ji...@apache.org> on 2015/04/01 20:30:52 UTC
[jira] [Created] (DRILL-2655) Hash join is chosen when join columns are sorted in the same order

Victoria Markman created DRILL-2655:
---------------------------------------

             Summary: Hash join is chosen when join columns are sorted in the same order
                 Key: DRILL-2655
                 URL: https://issues.apache.org/jira/browse/DRILL-2655
             Project: Apache Drill
          Issue Type: Bug
          Components: Query Planning & Optimization
    Affects Versions: 0.8.0
            Reporter: Victoria Markman
            Assignee: Jinfeng Ni
         Attachments: t1_parquet, t2_parquet

We should either drop the sort or plan merge join for this case (if execution can support this)

Both subqueries are sorted on a join column in ASC (implicitly) order NULLS LAST.
{code}
0: jdbc:drill:schema=dfs> explain plan for select  *
. . . . . . . . . . . . >                  from
. . . . . . . . . . . . >                  (select a1,b1,c1 from t1 order by a1 nulls last) as sq1
. . . . . . . . . . . . >                  inner join
. . . . . . . . . . . . >                  (select a2,b2,c2 from t2 order by a2 nulls last) as sq2
. . . . . . . . . . . . > on (sq1.a1 = sq2.a2)
. . . . . . . . . . . . > ;
+------------+------------+
|    text    |    json    |
+------------+------------+
| 00-00    Screen
00-01      Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
00-02        HashJoin(condition=[=($0, $3)], joinType=[inner])
00-04          SelectionVectorRemover
00-06            Sort(sort0=[$0], dir0=[ASC-nulls-last])
00-08              Project(a1=[$2], b1=[$1], c1=[$0])
00-10                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
00-03          SelectionVectorRemover
00-05            Sort(sort0=[$0], dir0=[ASC-nulls-last])
00-07              SelectionVectorRemover
00-09                Sort(sort0=[$0], dir0=[ASC-nulls-last])
00-11                  Project(a2=[$1], b2=[$0], c2=[$2])
00-12                    Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
{code}

Both subqueries are sorted on a join column in ASC (explicitly) order NULLS LAST.
{code}
0: jdbc:drill:schema=dfs> explain plan for select  *
. . . . . . . . . . . . > from
. . . . . . . . . . . . >         (select a1,b1,c1 from t1 order by a1 asc nulls last) as sq1
. . . . . . . . . . . . >         inner join
. . . . . . . . . . . . >         (select a2,b2,c2 from t2 order by a2 asc nulls last) as sq2
. . . . . . . . . . . . >         on (sq1.a1 = sq2.a2)
. . . . . . . . . . . . > ;
+------------+------------+
|    text    |    json    |
+------------+------------+
| 00-00    Screen
00-01      Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
00-02        HashJoin(condition=[=($0, $3)], joinType=[inner])
00-04          SelectionVectorRemover
00-06            Sort(sort0=[$0], dir0=[ASC-nulls-last])
00-08              Project(a1=[$2], b1=[$1], c1=[$0])
00-10                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
00-03          SelectionVectorRemover
00-05            Sort(sort0=[$0], dir0=[ASC-nulls-last])
00-07              SelectionVectorRemover
00-09                Sort(sort0=[$0], dir0=[ASC-nulls-last])
00-11                  Project(a2=[$1], b2=[$0], c2=[$2])
00-12                    Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
{code}

When both subqueries are sorted in ASC order implicitly or explicitly without mentioning NULLS FIRST/NULLS LAST merge join is planned.

Implicit ascending order:
{code}
0: jdbc:drill:schema=dfs> explain plan for select  *
. . . . . . . . . . . . > from
. . . . . . . . . . . . >         (select a1,b1,c1 from t1 order by a1 ) as sq1
. . . . . . . . . . . . >         inner join
. . . . . . . . . . . . >         (select a2,b2,c2 from t2 order by a2 ) as sq2
. . . . . . . . . . . . >         on (sq1.a1 = sq2.a2)
. . . . . . . . . . . . > ;
+------------+------------+
|    text    |    json    |
+------------+------------+
| 00-00    Screen
00-01      Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
00-02        MergeJoin(condition=[=($0, $3)], joinType=[inner])
00-04          SelectionVectorRemover
00-06            Sort(sort0=[$0], dir0=[ASC])
00-08              Project(a1=[$2], b1=[$1], c1=[$0])
00-10                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
00-03          SelectionVectorRemover
00-05            Sort(sort0=[$0], dir0=[ASC])
00-07              SelectionVectorRemover
00-09                Sort(sort0=[$0], dir0=[ASC])
00-11                  Project(a2=[$1], b2=[$0], c2=[$2])
00-12                    Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
{code}

Explicit asc order:
{code}
0: jdbc:drill:schema=dfs> explain plan for select  *
. . . . . . . . . . . . > from
. . . . . . . . . . . . >         (select a1,b1,c1 from t1 order by a1 asc) as sq1
. . . . . . . . . . . . >         inner join
. . . . . . . . . . . . >         (select a2,b2,c2 from t2 order by a2 asc) as sq2
. . . . . . . . . . . . >         on (sq1.a1 = sq2.a2)
. . . . . . . . . . . . > ;
+------------+------------+
|    text    |    json    |
+------------+------------+
| 00-00    Screen
00-01      Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
00-02        MergeJoin(condition=[=($0, $3)], joinType=[inner])
00-04          SelectionVectorRemover
00-06            Sort(sort0=[$0], dir0=[ASC])
00-08              Project(a1=[$2], b1=[$1], c1=[$0])
00-10                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
00-03          SelectionVectorRemover
00-05            Sort(sort0=[$0], dir0=[ASC])
00-07              SelectionVectorRemover
00-09                Sort(sort0=[$0], dir0=[ASC])
00-11                  Project(a2=[$1], b2=[$0], c2=[$2])
00-12                    Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
{code}

And the last one: when left side is sorted in ascending order and right side in desc, we plan merge join (result seems to be correct on a small data set)
I will add more tests for this particular case to verify results with duplicates and nulls.
{code}
0: jdbc:drill:schema=dfs> explain plan for select  *
. . . . . . . . . . . . > from
. . . . . . . . . . . . >         (select a1,b1,c1 from t1 order by a1 asc) as sq1
. . . . . . . . . . . . >         inner join
. . . . . . . . . . . . >         (select a2,b2,c2 from t2 order by a2 desc) as sq2
. . . . . . . . . . . . >         on (sq1.a1 = sq2.a2)
. . . . . . . . . . . . > ;
+------------+------------+
|    text    |    json    |
+------------+------------+
| 00-00    Screen
00-01      Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
00-02        MergeJoin(condition=[=($0, $3)], joinType=[inner])
00-04          SelectionVectorRemover
00-06            Sort(sort0=[$0], dir0=[ASC])
00-08              Project(a1=[$2], b1=[$1], c1=[$0])
00-10                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
00-03          SelectionVectorRemover
00-05            Sort(sort0=[$0], dir0=[ASC])
00-07              SelectionVectorRemover
00-09                Sort(sort0=[$0], dir0=[DESC])
00-11                  Project(a2=[$1], b2=[$0], c2=[$2])
00-12                    Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
{code}

I also don't understand why I'm seeing two sorts on the right side.
It would be ok if second sort did not do anything, because there was no predicate ...

{code}
00-05            Sort(sort0=[$0], dir0=[ASC])
00-07              SelectionVectorRemover
00-09                Sort(sort0=[$0], dir0=[ASC])
{code}

cut/paste query:
{code}
select  *
from
        (select a1,b1,c1 from t1 order by a1 nulls last) as sq1
        inner join
        (select a2,b2,c2 from t2 order by a2 nulls last) as sq2
        on (sq1.a1 = sq2.a2)
;
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)