You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/17 11:05:42 UTC

[GitHub] [arrow-datafusion] ygf11 opened a new pull request, #4258: Add ambiguous check for join

ygf11 opened a new pull request, #4258:
URL: https://github.com/apache/arrow-datafusion/pull/4258

   # Which issue does this PR close?
   
   Closes #4197.
   
   # Rationale for this change
   
   When join condition is ambiguous, the `sql` will work.
   It should throw an error, like PostgreSQL or Spark do.
   
   For example, test0 and test1 both has c0 column:
   ```sql
   ❯ explain select t0.c0, t1.c0 from test0 as t0 inner join test0 as t1 on c0 = 1;
   +---------------+-------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                                                                        |
   +---------------+-------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: t0.c0, t1.c0                                                                                    |
   |               |   CrossJoin:                                                                                                |
   |               |     Filter: t0.c0 = Int16(1)                                                                                |
   |               |       SubqueryAlias: t0                                                                                     |
   |               |         TableScan: test0 projection=[c0]                                                                    |
   |               |     SubqueryAlias: t1                                                                                       |
   |               |       TableScan: test0 projection=[c0]                                                                      |
   ```
   
   
   # What changes are included in this PR?
   
   Add ambiguous check for join condition.
   
   # Are these changes tested?
   
   Yes, test case is added.
   
   # Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #4258: Add ambiguous check for join

Posted by GitBox <gi...@apache.org>.
mingmwang commented on PR #4258:
URL: https://github.com/apache/arrow-datafusion/pull/4258#issuecomment-1321510399

   @ygf11 
   Could you please add one more UT with the JOIN USING clause ? You can add this in the following PR.
   
   https://www.neilwithdata.com/join-using
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #4258: Add ambiguous check for join

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #4258:
URL: https://github.com/apache/arrow-datafusion/pull/4258#issuecomment-1320477120

   Benchmark runs are scheduled for baseline = f3a65c74442fa42770418684a71a09ad9bcc348c and contender = 8c02485cfc3d2c48bc48a557db58e3e5a0f75777. 8c02485cfc3d2c48bc48a557db58e3e5a0f75777 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/adb03fa12479433caac07b1cbaa734e2...09f3ec9bb6b5432eb2cb6c71f70fade6/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/ea186a6a7c3e4e11a6775856f0edce61...eb2d2a992392400084779ee61398554f/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/88b4915d1bfa4fcd8a766b5efbc1e2d4...7f45ea95a8254a26a160856df10f6650/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/47f465170a2e4768a4e05727c80d1044...716ef8c5fd8742f88505f010add60bb1/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4258: Add ambiguous check for join

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #4258:
URL: https://github.com/apache/arrow-datafusion/pull/4258#discussion_r1026837425


##########
datafusion/sql/src/planner.rs:
##########
@@ -3165,6 +3171,64 @@ fn wrap_projection_for_join_if_necessary(
     Ok((plan, need_project))
 }
 
+/// Ensure any column reference of the expression is determined.

Review Comment:
   Another term for this concept is `unambiguous`



##########
datafusion/sql/src/planner.rs:
##########
@@ -3165,6 +3171,64 @@ fn wrap_projection_for_join_if_necessary(
     Ok((plan, need_project))
 }
 
+/// Ensure any column reference of the expression is determined.
+/// Assume we have two schema:
+/// schema1: a, b ,c
+/// schema2: a, d, e
+///
+/// `schema1.a + schema2.a` is determined.
+/// `a + d` is not determined, because `a` may come from schema1 or schema2.
+fn ensure_any_column_reference_is_determined(
+    expr: &Expr,
+    schemas: &[DFSchemaRef],
+) -> Result<()> {
+    if schemas.len() == 1 {
+        return Ok(());
+    }
+
+    // extract columns both in one more schemas.
+    let mut column_count_map: HashMap<String, usize> = HashMap::new();

Review Comment:
   You might be able to avoid a clone into a string here. Maybe a follow on PR
   
   ```suggestion
       let mut column_count_map: HashMap<&str, usize> = HashMap::new();
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] waynexia commented on a diff in pull request #4258: Add ambiguous check for join

Posted by GitBox <gi...@apache.org>.
waynexia commented on code in PR #4258:
URL: https://github.com/apache/arrow-datafusion/pull/4258#discussion_r1025284507


##########
datafusion/sql/src/planner.rs:
##########
@@ -3005,6 +3011,66 @@ fn extract_possible_join_keys(
     }
 }
 
+/// Ensure any column reference of the expression is determined.
+/// Assume we have two schema:
+/// schema1: a, b ,c
+/// schema2: a, d, e
+///
+/// `schema1.a + schema2.a` is determined.
+/// `a + d` is not determined, because `a` may come from schema1 or schema2.
+fn ensure_any_column_reference_is_determined(
+    expr: &Expr,
+    schemas: &[DFSchemaRef],
+) -> Result<()> {
+    if schemas.len() == 1 {
+        return Ok(());
+    }
+
+    // extract columns both in one more schemas.
+    let mut column_count_map: HashMap<String, usize> = HashMap::new();
+    schemas
+        .iter()
+        .flat_map(|schema| schema.fields())
+        .for_each(|field| {
+            column_count_map
+                .entry(field.name().into())
+                .and_modify(|v| *v += 1)
+                .or_insert(1usize);
+        });
+
+    let duplicated_column_set = column_count_map
+        .iter()
+        .filter(|(_, count)| **count > 1usize)
+        .map(|(column, _)| column)
+        .cloned()
+        .collect::<HashSet<String>>();
+
+    // check if there is ambiguous column.
+    let using_columns = expr.to_columns()?;
+    let ambiguous_column = using_columns.iter().find(|column| {
+        column.relation.is_none() && duplicated_column_set.contains(&column.name)
+    });
+
+    if let Some(column) = ambiguous_column {
+        let maybe_field = schemas
+            .iter()
+            .flat_map(|schema| {
+                schema
+                    .field_with_unqualified_name(&column.name)
+                    .map(|f| f.qualified_name())
+                    .ok()
+            })
+            .collect::<Vec<_>>();
+        Err(DataFusionError::Plan(format!(
+            "reference \'{}\' is ambiguous, could be {};",
+            column.name,
+            maybe_field.join(","),

Review Comment:
   Very concrete error hint 👍 



##########
datafusion/sql/src/planner.rs:
##########
@@ -3005,6 +3011,66 @@ fn extract_possible_join_keys(
     }
 }
 
+/// Ensure any column reference of the expression is determined.
+/// Assume we have two schema:
+/// schema1: a, b ,c
+/// schema2: a, d, e
+///
+/// `schema1.a + schema2.a` is determined.
+/// `a + d` is not determined, because `a` may come from schema1 or schema2.
+fn ensure_any_column_reference_is_determined(
+    expr: &Expr,
+    schemas: &[DFSchemaRef],
+) -> Result<()> {
+    if schemas.len() == 1 {
+        return Ok(());
+    }
+
+    // extract columns both in one more schemas.
+    let mut column_count_map: HashMap<String, usize> = HashMap::new();
+    schemas
+        .iter()
+        .flat_map(|schema| schema.fields())
+        .for_each(|field| {
+            column_count_map
+                .entry(field.name().into())
+                .and_modify(|v| *v += 1)
+                .or_insert(1usize);
+        });
+
+    let duplicated_column_set = column_count_map
+        .iter()
+        .filter(|(_, count)| **count > 1usize)
+        .map(|(column, _)| column)
+        .cloned()
+        .collect::<HashSet<String>>();

Review Comment:
   nit:
   ```suggestion
       let duplicated_column_set = column_count_map
           .into_iter()
           .filter_map(|(column, count)| if count > 1 { Some(column) } else { None })
           .collect::<HashSet<String>>();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #4258: Add ambiguous check for join

Posted by GitBox <gi...@apache.org>.
alamb merged PR #4258:
URL: https://github.com/apache/arrow-datafusion/pull/4258


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org