You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "nseekhao (via GitHub)" <gi...@apache.org> on 2023/06/03 15:36:43 UTC

[GitHub] [arrow-datafusion] nseekhao commented on a diff in pull request #6135: Substrait: Fix incorrect join key fields (indices) when same table is being used more than once

nseekhao commented on code in PR #6135:
URL: https://github.com/apache/arrow-datafusion/pull/6135#discussion_r1215611954


##########
datafusion/substrait/tests/roundtrip_logical_plan.rs:
##########
@@ -278,6 +278,18 @@ mod tests {
         roundtrip("SELECT a,b,c,d,e FROM datafusion.public.data;").await
     }
 
+    #[tokio::test]
+    async fn roundtrip_inner_join_table_reuse() -> Result<()> {
+        assert_expected_plan(
+            "SELECT d1.b, d2.c FROM data d1 JOIN data d2 ON d1.a = d2.a",
+            "Projection: data.b, data.c\
+            \n  Inner Join: data.a = data.a\
+            \n    TableScan: data projection=[a, b]\
+            \n    TableScan: data projection=[a, c]",

Review Comment:
   If I recall correctly, this would have failed in the earlier version of DF where `DuplicateQualifiedField` gets thrown in `DFSchema::new_with_metadata()`. However, in the later version, this error does not get thrown anymore and only `DuplicateUnqualifiedField` gets thrown [here](https://github.com/apache/arrow-datafusion/blob/main/datafusion/common/src/dfschema.rs#L72-L76).
   
   However, even if this error does not get thrown, and the test passes, it also does not mean that the produced Substrait plan is correct. If we join the same table to itself, the `JoinRel` in the Substrait plan would expect two input relations. AFAIK, there is no notion of pointer, so we'll need two `ReadRel`s created from the same table. And as far as the `JoinRel` is concerned, `left` and `right` are two separate relations. Since Substrait uses indices as opposed to name qualifier, `JoinRel` would expect the available input indices to be `0` to `size(left output)-1` and `size(left output)` to `size(left output) + size(right output)-1`. For example, if the input table has 5 columns, it'll expect to get indices from `0` to `9`. And let's say we're trying to join first column from `left` to second column from `right`, that would be join condition of `col_0 = col_6`. Note that without this PR, the join condition would be `col_0 = col_1` since from the DF named qualifiers, the produce
 r would find the columns from the left and right to come from the same table, but that will send the wrong message to Substrait.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org