You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Chun Chang (JIRA)" <ji...@apache.org> on 2015/03/31 22:27:52 UTC
[jira] [Closed] (DRILL-1987) join with tons of duplicates hangs
with hash join
[ https://issues.apache.org/jira/browse/DRILL-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chun Chang closed DRILL-1987.
-----------------------------
Assignee: Chun Chang (was: Aman Sinha)
> join with tons of duplicates hangs with hash join
> -------------------------------------------------
>
> Key: DRILL-1987
> URL: https://issues.apache.org/jira/browse/DRILL-1987
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Flow
> Affects Versions: 0.8.0
> Reporter: Chun Chang
> Assignee: Chun Chang
>
> #Fri Jan 09 20:39:31 EST 2015
> git.commit.id.abbrev=487d98e
> With hash join enabled (default), the following join query hangs (running for about 30 min now). The join condition has mostly duplicates. Each table has 1 million rows. Data can be downloaded here:
> https://s3.amazonaws.com/apache-drill/files/complex.json.gz
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_mergejoin` = false;
> +------------+------------+
> | ok | summary |
> +------------+------------+
> | true | planner.enable_mergejoin updated. |
> +------------+------------+
> 1 row selected (0.025 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_hashjoin` = true;
> +------------+------------+
> | ok | summary |
> +------------+------------+
> | true | planner.enable_hashjoin updated. |
> +------------+------------+
> 1 row selected (0.045 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> select a.id, b.gbyi, a.str from `complex.json` a inner join `complex.json` b on a.gbyi=b.gbyi order by a.id limit 20;
> +------------+------------+------------+
> | id | gbyi | str |
> +------------+------------+------------+
> {code}
> physical plan:
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select a.id, b.gbyi, a.str from `complex.json` a inner join `complex.json` b on a.gbyi=b.gbyi order by a.id limit 20;
> +------------+------------+
> | text | json |
> +------------+------------+
> | 00-00 Screen
> 00-01 Project(id=[$0], gbyi=[$1], str=[$2])
> 00-02 SelectionVectorRemover
> 00-03 Limit(fetch=[20])
> 00-04 SingleMergeExchange(sort0=[0 ASC])
> 01-01 SelectionVectorRemover
> 01-02 TopN(limit=[20])
> 01-03 HashToRandomExchange(dist0=[[$0]])
> 02-01 Project(id=[$1], gbyi=[$3], str=[$2])
> 02-02 HashJoin(condition=[=($0, $3)], joinType=[inner])
> 02-04 HashToRandomExchange(dist0=[[$0]])
> 03-01 Project(gbyi=[$0], id=[$2], str=[$1])
> 03-02 Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/testdata/complex_type/json/complex.json, numFiles=1, columns=[`gbyi`, `id`, `str`], files=[maprfs:/drill/testdata/complex_type/json/complex.json]]])
> 02-03 Project(gbyi0=[$0])
> 02-05 HashToRandomExchange(dist0=[[$0]])
> 04-01 Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/testdata/complex_type/json/complex.json, numFiles=1, columns=[`gbyi`], files=[maprfs:/drill/testdata/complex_type/json/complex.json]]])
> {code}
> If I turn merge join on, the query finishes rather quickly, like within a minute.
> {code}
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_hashjoin` = false;
> +------------+------------+
> | ok | summary |
> +------------+------------+
> | true | planner.enable_hashjoin updated. |
> +------------+------------+
> 1 row selected (0.026 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> alter session set `planner.enable_mergejoin` = true;
> +------------+------------+
> | ok | summary |
> +------------+------------+
> | true | planner.enable_mergejoin updated. |
> +------------+------------+
> 1 row selected (0.024 seconds)
> 0: jdbc:drill:schema=dfs.drillTestDirComplexJ> explain plan for select a.id, b.gbyi, a.str from `complex.json` a inner join `complex.json` b on a.gbyi=b.gbyi order by a.id limit 20;
> +------------+------------+
> | text | json |
> +------------+------------+
> | 00-00 Screen
> 00-01 Project(id=[$0], gbyi=[$1], str=[$2])
> 00-02 SelectionVectorRemover
> 00-03 Limit(fetch=[20])
> 00-04 SingleMergeExchange(sort0=[0 ASC])
> 01-01 SelectionVectorRemover
> 01-02 TopN(limit=[20])
> 01-03 HashToRandomExchange(dist0=[[$0]])
> 02-01 Project(id=[$1], gbyi=[$3], str=[$2])
> 02-02 MergeJoin(condition=[=($0, $3)], joinType=[inner])
> 02-04 SelectionVectorRemover
> 02-06 Sort(sort0=[$0], dir0=[ASC])
> 02-08 HashToRandomExchange(dist0=[[$0]])
> 03-01 Project(gbyi=[$0], id=[$2], str=[$1])
> 03-02 Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/testdata/complex_type/json/complex.json, numFiles=1, columns=[`gbyi`, `id`, `str`], files=[maprfs:/drill/testdata/complex_type/json/complex.json]]])
> 02-03 Project(gbyi0=[$0])
> 02-05 SelectionVectorRemover
> 02-07 Sort(sort0=[$0], dir0=[ASC])
> 02-09 HashToRandomExchange(dist0=[[$0]])
> 04-01 Scan(groupscan=[EasyGroupScan [selectionRoot=/drill/testdata/complex_type/json/complex.json, numFiles=1, columns=[`gbyi`], files=[maprfs:/drill/testdata/complex_type/json/complex.json]]])
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)