You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Aman Sinha (JIRA)" <ji...@apache.org> on 2014/12/08 19:51:16 UTC
[jira] [Commented] (DRILL-1788) Conflicting column names in join
[ https://issues.apache.org/jira/browse/DRILL-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14238243#comment-14238243 ]
Aman Sinha commented on DRILL-1788:
-----------------------------------
In the 2nd query above, the renaming of the column happens on the left side but not on the right side and I believe that is ok because the only requirement is uniqueness and as long that is accomplished with one side renaming it should be fine.
I tried changing the field name comparison in JoinPrel.getJoinInput() to be case-insensitive but that did not help. Note that if I change the join condition in the subquery from n1.N_name = n2.n_name to one where the uppercase is on the right side: n1.n_name = n2.N_name, then the query runs successfully:
{code:sql}
0: jdbc:drill:zk=local> select n3.n_name from (select n2.n_name from cp.`tpch/nation.parquet` n1, cp.`tpch/nation.parquet` n2 where n1.n_name = n2.N_name) n3 join cp.`tpch/nation.parquet` n4 on n3.n_name = n4.n_name limit 1;
+------------+
| n_name |
+------------+
| ALGERIA |
+------------+
{code}
I will look into this some more. BTW, the IndexOutOfBounds occurs both with MergeJoin and HashJoin plans.
> Conflicting column names in join
> --------------------------------
>
> Key: DRILL-1788
> URL: https://issues.apache.org/jira/browse/DRILL-1788
> Project: Apache Drill
> Issue Type: Bug
> Reporter: Steven Phillips
> Assignee: Aman Sinha
> Fix For: 0.8.0
>
>
> Drill doesn't support multiple columns within a batch having the same name. when doing a join where there are matching column names, the planner will insert a project to rename one of the columns to avoid this conflict.
> However, it appears that there is some case-sensitive matching somewhere in the code path, because there are some cases where this rewrite does not happen:
> For example, this query does do the column name change (see 01-03):
> 0: jdbc:drill:> explain plan for select n3.n_name from (select n2.n_name from cp.`tpch/nation.parquet` n1, cp.`tpch/nation.parquet` n2 where n1.n_name = n2.n_name) n3 join cp.`tpch/nation.parquet` n4 on n3.n_name = n4.n_name;
> +------------+------------+
> | text | json |
> +------------+------------+
> | 00-00 Screen
> 00-01 UnionExchange
> 01-01 Project(n_name=[$0])
> 01-02 HashJoin(condition=[=($0, $1)], joinType=[inner])
> 01-04 HashToRandomExchange(dist0=[[$0]])
> 02-01 Project(n_name=[$1])
> 02-02 HashJoin(condition=[=($0, $1)], joinType=[inner])
> 02-04 HashToRandomExchange(dist0=[[$0]])
> 04-01 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`n_name`]]])
> 02-03 Project(n_name0=[$0])
> 02-05 HashToRandomExchange(dist0=[[$0]])
> 05-01 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`n_name`]]])
> 01-03 Project(n_name0=[$0])
> 01-05 HashToRandomExchange(dist0=[[$0]])
> 03-01 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`n_name`]]])
> But if I change the one of the letters in one of the identifiers to uppercase, the rename goes away:
> 0: jdbc:drill:> explain plan for select n3.n_name from (select n2.n_name from cp.`tpch/nation.parquet` n1, cp.`tpch/nation.parquet` n2 where n1.N_name = n2.n_name) n3 join cp.`tpch/nation.parquet` n4 on n3.n_name = n4.n_name;
> +------------+------------+
> | text | json |
> +------------+------------+
> | 00-00 Screen
> 00-01 UnionExchange
> 01-01 Project(n_name=[$0])
> 01-02 HashJoin(condition=[=($0, $1)], joinType=[inner])
> 01-04 HashToRandomExchange(dist0=[[$0]])
> 02-01 Project(n_name=[$1])
> 02-02 HashJoin(condition=[=($0, $1)], joinType=[inner])
> 02-04 HashToRandomExchange(dist0=[[$0]])
> 04-01 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`N_name`]]])
> 02-03 Project(N_name0=[$0])
> 02-05 HashToRandomExchange(dist0=[[$0]])
> 05-01 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`N_name`]]])
> 01-03 HashToRandomExchange(dist0=[[$0]])
> 03-01 Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tpch/nation.parquet]], selectionRoot=/tpch/nation.parquet, numFiles=1, columns=[`N_name`]]])
> Running this query without the rewrite results in failure:
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
> at java.util.ArrayList.rangeCheck(ArrayList.java:604) ~[na:1.7.0_21]
> at java.util.ArrayList.get(ArrayList.java:382) ~[na:1.7.0_21]
> at org.apache.drill.exec.record.VectorContainer.getValueAccessorById(VectorContainer.java:252) ~[drill-java-exec-0.7.0-incubating-SNAPSHOT-rebuffed.jar:0.7.0-incubating-SNAPSHOT]
> at org.apache.drill.exec.record.AbstractRecordBatch.getValueAccessorById(AbstractRecordBatch.java:153) ~[drill-java-exec-0.7.0-incubating-SNAPSHOT-rebuffed.jar:0.7.0-incubating-SNAPSHOT]
> at org.apache.drill.exec.test.generated.HashJoinProbeGen249.doSetup(HashJoinProbeTemplate.java:46) ~[na:na]
> at org.apache.drill.exec.test.generated.HashJoinProbeGen249.setupHashJoinProbe(HashJoinProbeTemplate.java:97) ~[na:na]
> at org.apache.drill.exec.physical.impl.join.HashJoinBatch.innerNext(HashJoinBatch.java:226) ~[drill-java-exec-0.7.0-incubating-SNAPSHOT-rebuffed.jar:0.7.0-incubating-SNAPSHOT]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)