You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Victoria Markman (JIRA)" <ji...@apache.org> on 2015/04/01 20:30:53 UTC

[jira] [Updated] (DRILL-2655) Hash join is chosen when join columns are sorted in the same order

     [ https://issues.apache.org/jira/browse/DRILL-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Victoria Markman updated DRILL-2655:
------------------------------------
    Attachment: t2_parquet

> Hash join is chosen when join columns are sorted in the same order
> ------------------------------------------------------------------
>
>                 Key: DRILL-2655
>                 URL: https://issues.apache.org/jira/browse/DRILL-2655
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 0.8.0
>            Reporter: Victoria Markman
>            Assignee: Jinfeng Ni
>         Attachments: t1_parquet, t2_parquet
>
>
> We should either drop the sort or plan merge join for this case (if execution can support this)
> Both subqueries are sorted on a join column in ASC (implicitly) order NULLS LAST.
> {code}
> 0: jdbc:drill:schema=dfs> explain plan for select  *
> . . . . . . . . . . . . >                  from
> . . . . . . . . . . . . >                  (select a1,b1,c1 from t1 order by a1 nulls last) as sq1
> . . . . . . . . . . . . >                  inner join
> . . . . . . . . . . . . >                  (select a2,b2,c2 from t2 order by a2 nulls last) as sq2
> . . . . . . . . . . . . > on (sq1.a1 = sq2.a2)
> . . . . . . . . . . . . > ;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
> 00-02        HashJoin(condition=[=($0, $3)], joinType=[inner])
> 00-04          SelectionVectorRemover
> 00-06            Sort(sort0=[$0], dir0=[ASC-nulls-last])
> 00-08              Project(a1=[$2], b1=[$1], c1=[$0])
> 00-10                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
> 00-03          SelectionVectorRemover
> 00-05            Sort(sort0=[$0], dir0=[ASC-nulls-last])
> 00-07              SelectionVectorRemover
> 00-09                Sort(sort0=[$0], dir0=[ASC-nulls-last])
> 00-11                  Project(a2=[$1], b2=[$0], c2=[$2])
> 00-12                    Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
> {code}
> Both subqueries are sorted on a join column in ASC (explicitly) order NULLS LAST.
> {code}
> 0: jdbc:drill:schema=dfs> explain plan for select  *
> . . . . . . . . . . . . > from
> . . . . . . . . . . . . >         (select a1,b1,c1 from t1 order by a1 asc nulls last) as sq1
> . . . . . . . . . . . . >         inner join
> . . . . . . . . . . . . >         (select a2,b2,c2 from t2 order by a2 asc nulls last) as sq2
> . . . . . . . . . . . . >         on (sq1.a1 = sq2.a2)
> . . . . . . . . . . . . > ;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
> 00-02        HashJoin(condition=[=($0, $3)], joinType=[inner])
> 00-04          SelectionVectorRemover
> 00-06            Sort(sort0=[$0], dir0=[ASC-nulls-last])
> 00-08              Project(a1=[$2], b1=[$1], c1=[$0])
> 00-10                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
> 00-03          SelectionVectorRemover
> 00-05            Sort(sort0=[$0], dir0=[ASC-nulls-last])
> 00-07              SelectionVectorRemover
> 00-09                Sort(sort0=[$0], dir0=[ASC-nulls-last])
> 00-11                  Project(a2=[$1], b2=[$0], c2=[$2])
> 00-12                    Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
> {code}
> When both subqueries are sorted in ASC order implicitly or explicitly without mentioning NULLS FIRST/NULLS LAST merge join is planned.
> Implicit ascending order:
> {code}
> 0: jdbc:drill:schema=dfs> explain plan for select  *
> . . . . . . . . . . . . > from
> . . . . . . . . . . . . >         (select a1,b1,c1 from t1 order by a1 ) as sq1
> . . . . . . . . . . . . >         inner join
> . . . . . . . . . . . . >         (select a2,b2,c2 from t2 order by a2 ) as sq2
> . . . . . . . . . . . . >         on (sq1.a1 = sq2.a2)
> . . . . . . . . . . . . > ;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
> 00-02        MergeJoin(condition=[=($0, $3)], joinType=[inner])
> 00-04          SelectionVectorRemover
> 00-06            Sort(sort0=[$0], dir0=[ASC])
> 00-08              Project(a1=[$2], b1=[$1], c1=[$0])
> 00-10                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
> 00-03          SelectionVectorRemover
> 00-05            Sort(sort0=[$0], dir0=[ASC])
> 00-07              SelectionVectorRemover
> 00-09                Sort(sort0=[$0], dir0=[ASC])
> 00-11                  Project(a2=[$1], b2=[$0], c2=[$2])
> 00-12                    Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
> {code}
> Explicit asc order:
> {code}
> 0: jdbc:drill:schema=dfs> explain plan for select  *
> . . . . . . . . . . . . > from
> . . . . . . . . . . . . >         (select a1,b1,c1 from t1 order by a1 asc) as sq1
> . . . . . . . . . . . . >         inner join
> . . . . . . . . . . . . >         (select a2,b2,c2 from t2 order by a2 asc) as sq2
> . . . . . . . . . . . . >         on (sq1.a1 = sq2.a2)
> . . . . . . . . . . . . > ;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
> 00-02        MergeJoin(condition=[=($0, $3)], joinType=[inner])
> 00-04          SelectionVectorRemover
> 00-06            Sort(sort0=[$0], dir0=[ASC])
> 00-08              Project(a1=[$2], b1=[$1], c1=[$0])
> 00-10                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
> 00-03          SelectionVectorRemover
> 00-05            Sort(sort0=[$0], dir0=[ASC])
> 00-07              SelectionVectorRemover
> 00-09                Sort(sort0=[$0], dir0=[ASC])
> 00-11                  Project(a2=[$1], b2=[$0], c2=[$2])
> 00-12                    Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
> {code}
> And the last one: when left side is sorted in ascending order and right side in desc, we plan merge join (result seems to be correct on a small data set)
> I will add more tests for this particular case to verify results with duplicates and nulls.
> {code}
> 0: jdbc:drill:schema=dfs> explain plan for select  *
> . . . . . . . . . . . . > from
> . . . . . . . . . . . . >         (select a1,b1,c1 from t1 order by a1 asc) as sq1
> . . . . . . . . . . . . >         inner join
> . . . . . . . . . . . . >         (select a2,b2,c2 from t2 order by a2 desc) as sq2
> . . . . . . . . . . . . >         on (sq1.a1 = sq2.a2)
> . . . . . . . . . . . . > ;
> +------------+------------+
> |    text    |    json    |
> +------------+------------+
> | 00-00    Screen
> 00-01      Project(a1=[$0], b1=[$1], c1=[$2], a2=[$3], b2=[$4], c2=[$5])
> 00-02        MergeJoin(condition=[=($0, $3)], joinType=[inner])
> 00-04          SelectionVectorRemover
> 00-06            Sort(sort0=[$0], dir0=[ASC])
> 00-08              Project(a1=[$2], b1=[$1], c1=[$0])
> 00-10                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t1]], selectionRoot=/test/t1, numFiles=1, columns=[`a1`, `b1`, `c1`]]])
> 00-03          SelectionVectorRemover
> 00-05            Sort(sort0=[$0], dir0=[ASC])
> 00-07              SelectionVectorRemover
> 00-09                Sort(sort0=[$0], dir0=[DESC])
> 00-11                  Project(a2=[$1], b2=[$0], c2=[$2])
> 00-12                    Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:/test/t2]], selectionRoot=/test/t2, numFiles=1, columns=[`a2`, `b2`, `c2`]]])
> {code}
> I also don't understand why I'm seeing two sorts on the right side.
> It would be ok if second sort did not do anything, because there was no predicate ...
> {code}
> 00-05            Sort(sort0=[$0], dir0=[ASC])
> 00-07              SelectionVectorRemover
> 00-09                Sort(sort0=[$0], dir0=[ASC])
> {code}
> cut/paste query:
> {code}
> select  *
> from
>         (select a1,b1,c1 from t1 order by a1 nulls last) as sq1
>         inner join
>         (select a2,b2,c2 from t2 order by a2 nulls last) as sq2
>         on (sq1.a1 = sq2.a2)
> ;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)