You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "Jefffrey (via GitHub)" <gi...@apache.org> on 2023/02/07 11:08:02 UTC

[GitHub] [arrow-datafusion] Jefffrey opened a new pull request, #5210: Dataframe join_on method

Jefffrey opened a new pull request, #5210:
URL: https://github.com/apache/arrow-datafusion/pull/5210

# Which issue does this PR close?

Closes #1254

# Rationale for this change

# What changes are included in this PR?

New method for DataFrame `join_on` allowing user to pass in arbitrary `Expr`'s which are AND'ed together to form the `ON` condition.

Also fix to DataFrame join to enforce ambiguity check, like how was done by SQL planner

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
-->

New unit test

# Are there any user-facing changes?

New method in DataFrame, doc updated

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jefffrey commented on a diff in pull request #5210: Dataframe join_on method

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.

Jefffrey commented on code in PR #5210:
URL: https://github.com/apache/arrow-datafusion/pull/5210#discussion_r1098517157


##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -502,6 +505,25 @@ impl LogicalPlanBuilder {
             ));
         }
 
+        let filter = if let Some(expr) = filter {
+            // ambiguous check
+            ensure_any_column_reference_is_unambiguous(
+                &expr,
+                &[self.schema(), right.schema()],
+            )?;
+
+            // normalize all columns in expression
+            let using_columns = expr.to_columns()?;
+            let filter = normalize_col_with_schemas(
+                expr,
+                &[self.schema(), right.schema()],
+                &[using_columns],
+            )?;
+            Some(filter)
+        } else {
+            None
+        };
+

Review Comment:
   related to https://github.com/apache/arrow-datafusion/issues/4196
   
   fix bug where you could do dataframe join with ambiguous column for the filter expr
   
   instead of having the check done in both DataFrame join api and SQL planner join mod, unify by having check done inside the logical plan builder
   
   this is technically an unrelated fix to the actual issue, so i can extract into separate issue if needed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5210: Dataframe join_on method

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #5210:
URL: https://github.com/apache/arrow-datafusion/pull/5210#discussion_r1100529037


##########
docs/source/user-guide/dataframe.md:
##########
@@ -68,6 +68,7 @@ execution. The plan is evaluated (executed) when an action method is invoked, su
 | filter              | Filter a DataFrame to only include rows that match the specified filter expression.                                                        |
 | intersect           | Calculate the intersection of two DataFrames. The two DataFrames must have exactly the same schema                                         |
 | join                | Join this DataFrame with another DataFrame using the specified columns as join keys.                                                       |
+| join_on             | Join this DataFrame with another DataFrame using arbitrary expressions.                                                                    |

Review Comment:
   ❤️ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] ursabot commented on pull request #5210: Dataframe join_on method

Posted by "ursabot (via GitHub)" <gi...@apache.org>.

ursabot commented on PR #5210:
URL: https://github.com/apache/arrow-datafusion/pull/5210#issuecomment-1424004762

   Benchmark runs are scheduled for baseline = dee9fd7d2b9a3dbe57fb88fb9cbe9572f6117ab2 and contender = 1b03a7a35aad77456cb3fca58e37612903c96aec. 1b03a7a35aad77456cb3fca58e37612903c96aec is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/38d42e69e4f549f2be79d3e6f20d8521...f254e6029b734dc4a7959e267f3742eb/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/d14d27258bcb4eaeb842d0142f8cd9fa...99380f3dffe34e239786c217360a948a/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/b6c769a95a0b45eab8b3c56ad9938086...6586f91696f441a7888c2d585aa4ee2f/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/e3167220c38347489051978a1bfda7c8...972cc1e055a843b1aa4ce9b92f8e3f7a/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] liukun4515 merged pull request #5210: Dataframe join_on method

Posted by "liukun4515 (via GitHub)" <gi...@apache.org>.

liukun4515 merged PR #5210:
URL: https://github.com/apache/arrow-datafusion/pull/5210


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5210: Dataframe join_on method

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #5210:
URL: https://github.com/apache/arrow-datafusion/pull/5210#discussion_r1099265795


##########
datafusion/core/src/dataframe.rs:
##########
@@ -1039,6 +1088,33 @@ mod tests {
         Ok(())
     }
 
+    #[tokio::test]
+    async fn join_on() -> Result<()> {
+        let left = test_table_with_name("a")
+            .await?
+            .select_columns(&["c1", "c2"])?;
+        let right = test_table_with_name("b")
+            .await?
+            .select_columns(&["c1", "c2"])?;
+        let join = left.join_on(
+            right,
+            JoinType::Inner,
+            [
+                col("a.c1").not_eq(col("b.c1")),
+                col("a.c2").not_eq(col("b.c2")),

Review Comment:
   Would it be possible here to also add an equality predicate to demonstrate they are automatically recognized as equi preds?
   
   Perhaps something like
   
   ```suggestion
                   col("a.c2").eq(col("b.c2")),
   ```



##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -502,6 +505,25 @@ impl LogicalPlanBuilder {
             ));
         }
 
+        let filter = if let Some(expr) = filter {
+            // ambiguous check
+            ensure_any_column_reference_is_unambiguous(
+                &expr,
+                &[self.schema(), right.schema()],
+            )?;
+
+            // normalize all columns in expression
+            let using_columns = expr.to_columns()?;
+            let filter = normalize_col_with_schemas(
+                expr,
+                &[self.schema(), right.schema()],
+                &[using_columns],
+            )?;
+            Some(filter)
+        } else {
+            None
+        };
+

Review Comment:
   I think it is fine to include in this PR as long as it also has a test (for ambiguity check using the DataFrame API)



##########
datafusion/core/src/dataframe.rs:
##########
@@ -363,6 +363,55 @@ impl DataFrame {
         Ok(DataFrame::new(self.session_state, plan))
     }
 
+    /// Join this DataFrame with another DataFrame using the specified expressions.
+    ///
+    /// Simply a thin wrapper over [`join`](Self::join) where the join keys are not provided,
+    /// and the provided expressions are AND'ed together to form the filter expression.
+    ///
+    /// ```
+    /// # use datafusion::prelude::*;
+    /// # use datafusion::error::Result;
+    /// # #[tokio::main]
+    /// # async fn main() -> Result<()> {
+    /// let ctx = SessionContext::new();
+    /// let left = ctx
+    ///     .read_csv("tests/data/example.csv", CsvReadOptions::new())
+    ///     .await?;
+    /// let right = ctx
+    ///     .read_csv("tests/data/example.csv", CsvReadOptions::new())
+    ///     .await?
+    ///     .select(vec![
+    ///         col("a").alias("a2"),
+    ///         col("b").alias("b2"),
+    ///         col("c").alias("c2"),
+    ///     ])?;
+    /// let join_on = left.join_on(
+    ///     right,
+    ///     JoinType::Inner,
+    ///     [col("a").not_eq(col("a2")), col("b").not_eq(col("b2"))],
+    /// )?;
+    /// let batches = join_on.collect().await?;
+    /// # Ok(())
+    /// # }
+    /// ```
+    pub fn join_on(

Review Comment:
   👍  LGTM



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jefffrey commented on a diff in pull request #5210: Dataframe join_on method

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.

Jefffrey commented on code in PR #5210:
URL: https://github.com/apache/arrow-datafusion/pull/5210#discussion_r1099842534


##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -502,6 +505,25 @@ impl LogicalPlanBuilder {
             ));
         }
 
+        let filter = if let Some(expr) = filter {
+            // ambiguous check
+            ensure_any_column_reference_is_unambiguous(
+                &expr,
+                &[self.schema(), right.schema()],
+            )?;
+
+            // normalize all columns in expression
+            let using_columns = expr.to_columns()?;
+            let filter = normalize_col_with_schemas(
+                expr,
+                &[self.schema(), right.schema()],
+                &[using_columns],
+            )?;
+            Some(filter)
+        } else {
+            None
+        };
+

Review Comment:
   test added



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jefffrey commented on a diff in pull request #5210: Dataframe join_on method

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.

Jefffrey commented on code in PR #5210:
URL: https://github.com/apache/arrow-datafusion/pull/5210#discussion_r1099845204


##########
datafusion/core/src/dataframe.rs:
##########
@@ -1039,6 +1088,33 @@ mod tests {
         Ok(())
     }
 
+    #[tokio::test]
+    async fn join_on() -> Result<()> {
+        let left = test_table_with_name("a")
+            .await?
+            .select_columns(&["c1", "c2"])?;
+        let right = test_table_with_name("b")
+            .await?
+            .select_columns(&["c1", "c2"])?;
+        let join = left.join_on(
+            right,
+            JoinType::Inner,
+            [
+                col("a.c1").not_eq(col("b.c1")),
+                col("a.c2").not_eq(col("b.c2")),

Review Comment:
   done as you suggested. it seems they still are considered as part of the filter, though this seems to track with the explicit SQL version too:
   
   https://github.com/apache/arrow-datafusion/blob/f0c67193a3d18ff1d94f9dd55bfb1715e5473bf1/datafusion/sql/tests/integration_test.rs#L1661-L1672



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] liukun4515 commented on pull request #5210: Dataframe join_on method

Posted by "liukun4515 (via GitHub)" <gi...@apache.org>.

liukun4515 commented on PR #5210:
URL: https://github.com/apache/arrow-datafusion/pull/5210#issuecomment-1422577555

   I want to take a look this PR tomorrow. @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Jefffrey commented on a diff in pull request #5210: Dataframe join_on method

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.

Jefffrey commented on code in PR #5210:
URL: https://github.com/apache/arrow-datafusion/pull/5210#discussion_r1099845204


##########
datafusion/core/src/dataframe.rs:
##########
@@ -1039,6 +1088,33 @@ mod tests {
         Ok(())
     }
 
+    #[tokio::test]
+    async fn join_on() -> Result<()> {
+        let left = test_table_with_name("a")
+            .await?
+            .select_columns(&["c1", "c2"])?;
+        let right = test_table_with_name("b")
+            .await?
+            .select_columns(&["c1", "c2"])?;
+        let join = left.join_on(
+            right,
+            JoinType::Inner,
+            [
+                col("a.c1").not_eq(col("b.c1")),
+                col("a.c2").not_eq(col("b.c2")),

Review Comment:
   done as you suggested. it seems they still are considered as part of the filter, though this seems to track with the explicit SQL version too:
   
   https://github.com/apache/arrow-datafusion/blob/f0c67193a3d18ff1d94f9dd55bfb1715e5473bf1/datafusion/sql/tests/integration_test.rs#L1661-L1672
   
   edit: nvm there's the `extract_equijoin_predicate` logical optimization which extracts it into an equijoin predicate indeed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org