You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/06 02:59:28 UTC

[GitHub] [arrow-datafusion] AssHero opened a new pull request, #2702: Make sure that the data types are supported in hashjoin before genera…

AssHero opened a new pull request, #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702

   # Which issue does this PR close?
   
   Closes #2145 
   
    # Rationale for this change
   Before generating the hash join logical plan, make sure the data types in equal conditions are supported. Otherwise, try 
   cross join instead.
   
   # What changes are included in this PR?
   1. add can_hash func in datafusion/expr/src/utils.rs 
   2. use can_hash in datafusion/expr/src/logical_plan/builder.rs and datafusion/sql/src/planner.rs to check if the data type is supported in hashjoin or not


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r893003362


##########
datafusion/core/tests/sql/joins.rs:
##########
@@ -1204,3 +1204,141 @@ async fn join_partitioned() -> Result<()> {
 
     Ok(())
 }
+
+#[tokio::test]
+async fn join_with_hash_unsupported_data_type() -> Result<()> {
+    let ctx = SessionContext::new();
+
+    let schema = Schema::new(vec![
+        Field::new("c1", DataType::Int32, true),
+        Field::new("c2", DataType::Utf8, true),
+        Field::new("c3", DataType::Int64, true),
+        Field::new("c4", DataType::Date32, true),
+    ]);
+    let data = RecordBatch::try_new(
+        Arc::new(schema),
+        vec![
+            Arc::new(Int32Array::from_slice(&[1, 2, 3])),
+            Arc::new(StringArray::from_slice(&["aaa", "bbb", "ccc"])),
+            Arc::new(Int64Array::from_slice(&[100, 200, 300])),
+            Arc::new(Date32Array::from(vec![Some(1), Some(2), Some(3)])),
+        ],
+    )?;
+    let table = MemTable::try_new(data.schema(), vec![vec![data]])?;
+    ctx.register_table("foo", Arc::new(table))?;
+
+    // join on hash unsupported data type (Date32), use cross join instead hash join
+    let sql = "select * from foo t1 join foo t2 on t1.c4 = t2.c4";
+    let msg = format!("Creating logical plan for '{}'", sql);

Review Comment:
   > So I think CrossJoin is almost never what the user would want: as once the tables get beyond any trivial size the query will effectively never finish or will run out of memory. An error is clearer.
   > 
   > From the issue description [#2145 (comment)](https://github.com/apache/arrow-datafusion/issues/2145#issue-1191065060) I think @pjmore's idea to cast unsupported types to a supported type is a good one -- the arrow `cast` kernels are quite efficient for things like `Date32` -> `Int32` (no copies) as the representations are the same
   > 
   > @pjmore what do you think?
   
   I agree. supporting more data types in hash join is the better way to solve this issue, and i'm already working on it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#issuecomment-1150961878

   Thank you again @AssHero 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
alamb merged PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890701528


##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -19,6 +19,7 @@
 
 use crate::expr_rewriter::{normalize_col, normalize_cols, rewrite_sort_cols_by_aggs};
 use crate::utils::{columnize_expr, exprlist_to_fields, from_plan};
+use crate::Operator;

Review Comment:
   Thank you! I'll fix this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#issuecomment-1147067806

   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2702?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#2702](https://codecov.io/gh/apache/arrow-datafusion/pull/2702?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9d74954) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/930653415c6303b34f856826ba73717b25b9574c?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9306534) will **decrease** coverage by `0.02%`.
   > The diff coverage is `77.86%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #2702      +/-   ##
   ==========================================
   - Coverage   84.66%   84.63%   -0.03%     
   ==========================================
     Files         270      270              
     Lines       46919    47039     +120     
   ==========================================
   + Hits        39724    39813      +89     
   - Misses       7195     7226      +31     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/2702?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [datafusion/expr/src/logical\_plan/builder.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2702/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9leHByL3NyYy9sb2dpY2FsX3BsYW4vYnVpbGRlci5ycw==) | `89.02% <51.72%> (-2.92%)` | :arrow_down: |
   | [datafusion/expr/src/utils.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2702/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9leHByL3NyYy91dGlscy5ycw==) | `90.39% <68.18%> (-1.48%)` | :arrow_down: |
   | [datafusion/sql/src/planner.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2702/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcWwvc3JjL3BsYW5uZXIucnM=) | `81.33% <69.23%> (-0.19%)` | :arrow_down: |
   | [datafusion/core/tests/sql/joins.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2702/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3Rlc3RzL3NxbC9qb2lucy5ycw==) | `99.07% <100.00%> (+0.13%)` | :arrow_up: |
   | [datafusion/core/src/physical\_plan/metrics/value.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2702/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9waHlzaWNhbF9wbGFuL21ldHJpY3MvdmFsdWUucnM=) | `86.93% <0.00%> (-0.51%)` | :arrow_down: |
   | [datafusion/expr/src/logical\_plan/plan.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2702/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9leHByL3NyYy9sb2dpY2FsX3BsYW4vcGxhbi5ycw==) | `68.47% <0.00%> (+0.51%)` | :arrow_up: |
   | [datafusion/expr/src/window\_frame.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2702/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9leHByL3NyYy93aW5kb3dfZnJhbWUucnM=) | `93.27% <0.00%> (+0.84%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2702?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2702?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [9306534...9d74954](https://codecov.io/gh/apache/arrow-datafusion/pull/2702?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890710049


##########
datafusion/expr/src/utils.rs:
##########
@@ -643,6 +644,35 @@ pub fn expr_as_column_expr(expr: &Expr, plan: &LogicalPlan) -> Result<Expr> {
     }
 }
 
+/// can this data type be used in hash join equal conditions??
+/// If more data types are supported in hash join, add those data types here
+/// to generate join logical plan.
+pub fn can_hash(data_type: &DataType) -> bool {
+    match data_type {

Review Comment:
   decimal should be supported here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890326974


##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -605,17 +606,52 @@ impl LogicalPlanBuilder {
         let on: Vec<(_, _)> = left_keys.into_iter().zip(right_keys.into_iter()).collect();
         let join_schema =
             build_join_schema(self.plan.schema(), right.schema(), &join_type)?;
-
-        Ok(Self::from(LogicalPlan::Join(Join {
-            left: Arc::new(self.plan.clone()),
-            right: Arc::new(right.clone()),
-            on,
-            filter: None,
-            join_type,
-            join_constraint: JoinConstraint::Using,
-            schema: DFSchemaRef::new(join_schema),
-            null_equals_null: false,
-        })))
+        let mut join_on: Vec<(Column, Column)> = vec![];
+        let mut filters: Option<Expr> = None;
+        for (l, r) in &on {
+            if self.plan.schema().field_from_column(l).is_ok()
+                && right.schema().field_from_column(r).is_ok()
+                && can_hash(self.plan.schema().field_from_column(l).unwrap().data_type())
+            {
+                join_on.push((l.clone(), r.clone()));
+            } else if self.plan.schema().field_from_column(r).is_ok()
+                && right.schema().field_from_column(l).is_ok()
+                && can_hash(self.plan.schema().field_from_column(r).unwrap().data_type())
+            {
+                join_on.push((r.clone(), l.clone()));
+            } else {
+                let expr = Expr::BinaryExpr {
+                    left: Box::new(Expr::Column(l.clone())),
+                    op: Operator::Eq,
+                    right: Box::new(Expr::Column(r.clone())),
+                };

Review Comment:
   ```suggestion
                   let expr = binary_expr(
                       Expr::Column(l.clone()),
                       Operator::Eq,
                       Expr::Column(r.clone()),
                   );
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r893003697


##########
datafusion/expr/src/utils.rs:
##########
@@ -643,6 +644,35 @@ pub fn expr_as_column_expr(expr: &Expr, plan: &LogicalPlan) -> Result<Expr> {
     }
 }
 
+/// can this data type be used in hash join equal conditions??
+/// If more data types are supported in hash join, add those data types here
+/// to generate join logical plan.
+pub fn can_hash(data_type: &DataType) -> bool {
+    match data_type {

Review Comment:
   > I think it may help to add a comment here (or in `equal_rows`) mentioning they need to remain in sync
   
   I'll add this comment.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890325614


##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -605,17 +606,52 @@ impl LogicalPlanBuilder {
         let on: Vec<(_, _)> = left_keys.into_iter().zip(right_keys.into_iter()).collect();
         let join_schema =
             build_join_schema(self.plan.schema(), right.schema(), &join_type)?;
-
-        Ok(Self::from(LogicalPlan::Join(Join {
-            left: Arc::new(self.plan.clone()),
-            right: Arc::new(right.clone()),
-            on,
-            filter: None,
-            join_type,
-            join_constraint: JoinConstraint::Using,
-            schema: DFSchemaRef::new(join_schema),
-            null_equals_null: false,
-        })))
+        let mut join_on: Vec<(Column, Column)> = vec![];
+        let mut filters: Option<Expr> = None;
+        for (l, r) in &on {
+            if self.plan.schema().field_from_column(l).is_ok()
+                && right.schema().field_from_column(r).is_ok()
+                && can_hash(self.plan.schema().field_from_column(l).unwrap().data_type())
+            {
+                join_on.push((l.clone(), r.clone()));
+            } else if self.plan.schema().field_from_column(r).is_ok()
+                && right.schema().field_from_column(l).is_ok()
+                && can_hash(self.plan.schema().field_from_column(r).unwrap().data_type())
+            {
+                join_on.push((r.clone(), l.clone()));
+            } else {
+                let expr = Expr::BinaryExpr {
+                    left: Box::new(Expr::Column(l.clone())),
+                    op: Operator::Eq,
+                    right: Box::new(Expr::Column(r.clone())),
+                };

Review Comment:
   nit: we can use `binary_expr` helper here for more concise code
   
   ```
                   let expr = binary_expr(
                       Expr::Column(l.clone()),
                       Operator::Eq,
                       Expr::Column(r.clone()),
                   );
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890711868


##########
datafusion/expr/src/utils.rs:
##########
@@ -643,6 +644,35 @@ pub fn expr_as_column_expr(expr: &Expr, plan: &LogicalPlan) -> Result<Expr> {
     }
 }
 
+/// can this data type be used in hash join equal conditions??
+/// If more data types are supported in hash join, add those data types here
+/// to generate join logical plan.
+pub fn can_hash(data_type: &DataType) -> bool {
+    match data_type {

Review Comment:
   Data types here come from function equal_rows, decimal is not supported in equal_rows.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890701328


##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -605,17 +606,52 @@ impl LogicalPlanBuilder {
         let on: Vec<(_, _)> = left_keys.into_iter().zip(right_keys.into_iter()).collect();
         let join_schema =
             build_join_schema(self.plan.schema(), right.schema(), &join_type)?;
-
-        Ok(Self::from(LogicalPlan::Join(Join {
-            left: Arc::new(self.plan.clone()),
-            right: Arc::new(right.clone()),
-            on,
-            filter: None,
-            join_type,
-            join_constraint: JoinConstraint::Using,
-            schema: DFSchemaRef::new(join_schema),
-            null_equals_null: false,
-        })))
+        let mut join_on: Vec<(Column, Column)> = vec![];
+        let mut filters: Option<Expr> = None;
+        for (l, r) in &on {
+            if self.plan.schema().field_from_column(l).is_ok()
+                && right.schema().field_from_column(r).is_ok()
+                && can_hash(self.plan.schema().field_from_column(l).unwrap().data_type())
+            {
+                join_on.push((l.clone(), r.clone()));
+            } else if self.plan.schema().field_from_column(r).is_ok()
+                && right.schema().field_from_column(l).is_ok()
+                && can_hash(self.plan.schema().field_from_column(r).unwrap().data_type())
+            {
+                join_on.push((r.clone(), l.clone()));
+            } else {
+                let expr = Expr::BinaryExpr {
+                    left: Box::new(Expr::Column(l.clone())),
+                    op: Operator::Eq,
+                    right: Box::new(Expr::Column(r.clone())),
+                };

Review Comment:
   Thank you! I'll fix this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890711868


##########
datafusion/expr/src/utils.rs:
##########
@@ -643,6 +644,35 @@ pub fn expr_as_column_expr(expr: &Expr, plan: &LogicalPlan) -> Result<Expr> {
     }
 }
 
+/// can this data type be used in hash join equal conditions??
+/// If more data types are supported in hash join, add those data types here
+/// to generate join logical plan.
+pub fn can_hash(data_type: &DataType) -> bool {
+    match data_type {

Review Comment:
   Data types here come from function equal_rows, decimal is not supported in equal_rows so that hash join currently does not support joining on columns of decimal data type. That's why decimal is not here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r893003362


##########
datafusion/core/tests/sql/joins.rs:
##########
@@ -1204,3 +1204,141 @@ async fn join_partitioned() -> Result<()> {
 
     Ok(())
 }
+
+#[tokio::test]
+async fn join_with_hash_unsupported_data_type() -> Result<()> {
+    let ctx = SessionContext::new();
+
+    let schema = Schema::new(vec![
+        Field::new("c1", DataType::Int32, true),
+        Field::new("c2", DataType::Utf8, true),
+        Field::new("c3", DataType::Int64, true),
+        Field::new("c4", DataType::Date32, true),
+    ]);
+    let data = RecordBatch::try_new(
+        Arc::new(schema),
+        vec![
+            Arc::new(Int32Array::from_slice(&[1, 2, 3])),
+            Arc::new(StringArray::from_slice(&["aaa", "bbb", "ccc"])),
+            Arc::new(Int64Array::from_slice(&[100, 200, 300])),
+            Arc::new(Date32Array::from(vec![Some(1), Some(2), Some(3)])),
+        ],
+    )?;
+    let table = MemTable::try_new(data.schema(), vec![vec![data]])?;
+    ctx.register_table("foo", Arc::new(table))?;
+
+    // join on hash unsupported data type (Date32), use cross join instead hash join
+    let sql = "select * from foo t1 join foo t2 on t1.c4 = t2.c4";
+    let msg = format!("Creating logical plan for '{}'", sql);

Review Comment:
   > So I think CrossJoin is almost never what the user would want: as once the tables get beyond any trivial size the query will effectively never finish or will run out of memory. An error is clearer.
   > 
   > From the issue description [#2145 (comment)](https://github.com/apache/arrow-datafusion/issues/2145#issue-1191065060) I think @pjmore's idea to cast unsupported types to a supported type is a good one -- the arrow `cast` kernels are quite efficient for things like `Date32` -> `Int32` (no copies) as the representations are the same
   > 
   > @pjmore what do you think?
   
   this pr is only to make the hash unsupported join running in cross join instead of error/panic(I think it is not friendly in database system). 
   supporting more data types in hash join is the better way to solve this issue, and i'm already working on it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890326016


##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -605,17 +606,52 @@ impl LogicalPlanBuilder {
         let on: Vec<(_, _)> = left_keys.into_iter().zip(right_keys.into_iter()).collect();
         let join_schema =
             build_join_schema(self.plan.schema(), right.schema(), &join_type)?;
-
-        Ok(Self::from(LogicalPlan::Join(Join {
-            left: Arc::new(self.plan.clone()),
-            right: Arc::new(right.clone()),
-            on,
-            filter: None,
-            join_type,
-            join_constraint: JoinConstraint::Using,
-            schema: DFSchemaRef::new(join_schema),
-            null_equals_null: false,
-        })))
+        let mut join_on: Vec<(Column, Column)> = vec![];
+        let mut filters: Option<Expr> = None;
+        for (l, r) in &on {
+            if self.plan.schema().field_from_column(l).is_ok()
+                && right.schema().field_from_column(r).is_ok()
+                && can_hash(self.plan.schema().field_from_column(l).unwrap().data_type())
+            {
+                join_on.push((l.clone(), r.clone()));
+            } else if self.plan.schema().field_from_column(r).is_ok()
+                && right.schema().field_from_column(l).is_ok()
+                && can_hash(self.plan.schema().field_from_column(r).unwrap().data_type())
+            {
+                join_on.push((r.clone(), l.clone()));
+            } else {
+                let expr = Expr::BinaryExpr {
+                    left: Box::new(Expr::Column(l.clone())),
+                    op: Operator::Eq,
+                    right: Box::new(Expr::Column(r.clone())),
+                };
+                match filters {
+                    None => filters = Some(expr),
+                    Some(filter_expr) => {
+                        filters = Some(Expr::BinaryExpr {
+                            left: Box::new(expr),
+                            op: Operator::And,
+                            right: Box::new(filter_expr),
+                        });
+                    }

Review Comment:
   ```suggestion
                       Some(filter_expr) => filters = Some(and(expr, filter_expr)),
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890327498


##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -19,6 +19,7 @@
 
 use crate::expr_rewriter::{normalize_col, normalize_cols, rewrite_sort_cols_by_aggs};
 use crate::utils::{columnize_expr, exprlist_to_fields, from_plan};
+use crate::Operator;

Review Comment:
   I let some suggestions below that require these additional imports
   
   ```suggestion
   use crate::{and, binary_expr, Operator};
   ```



##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -19,6 +19,7 @@
 
 use crate::expr_rewriter::{normalize_col, normalize_cols, rewrite_sort_cols_by_aggs};
 use crate::utils::{columnize_expr, exprlist_to_fields, from_plan};
+use crate::Operator;

Review Comment:
   I left some suggestions below that require these additional imports
   
   ```suggestion
   use crate::{and, binary_expr, Operator};
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r893003362


##########
datafusion/core/tests/sql/joins.rs:
##########
@@ -1204,3 +1204,141 @@ async fn join_partitioned() -> Result<()> {
 
     Ok(())
 }
+
+#[tokio::test]
+async fn join_with_hash_unsupported_data_type() -> Result<()> {
+    let ctx = SessionContext::new();
+
+    let schema = Schema::new(vec![
+        Field::new("c1", DataType::Int32, true),
+        Field::new("c2", DataType::Utf8, true),
+        Field::new("c3", DataType::Int64, true),
+        Field::new("c4", DataType::Date32, true),
+    ]);
+    let data = RecordBatch::try_new(
+        Arc::new(schema),
+        vec![
+            Arc::new(Int32Array::from_slice(&[1, 2, 3])),
+            Arc::new(StringArray::from_slice(&["aaa", "bbb", "ccc"])),
+            Arc::new(Int64Array::from_slice(&[100, 200, 300])),
+            Arc::new(Date32Array::from(vec![Some(1), Some(2), Some(3)])),
+        ],
+    )?;
+    let table = MemTable::try_new(data.schema(), vec![vec![data]])?;
+    ctx.register_table("foo", Arc::new(table))?;
+
+    // join on hash unsupported data type (Date32), use cross join instead hash join
+    let sql = "select * from foo t1 join foo t2 on t1.c4 = t2.c4";
+    let msg = format!("Creating logical plan for '{}'", sql);

Review Comment:
   > So I think CrossJoin is almost never what the user would want: as once the tables get beyond any trivial size the query will effectively never finish or will run out of memory. An error is clearer.
   > 
   > From the issue description [#2145 (comment)](https://github.com/apache/arrow-datafusion/issues/2145#issue-1191065060) I think @pjmore's idea to cast unsupported types to a supported type is a good one -- the arrow `cast` kernels are quite efficient for things like `Date32` -> `Int32` (no copies) as the representations are the same
   > 
   > @pjmore what do you think?
   
   I agree. supporting more data types in hash join is the better way to solve this issue, and i'm already working on it.
   
   this pr is only to make the hash unsupported join running in cross join instead of error/panic(I think it is not friendly in database system).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890697708


##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -605,17 +606,52 @@ impl LogicalPlanBuilder {
         let on: Vec<(_, _)> = left_keys.into_iter().zip(right_keys.into_iter()).collect();
         let join_schema =
             build_join_schema(self.plan.schema(), right.schema(), &join_type)?;
-
-        Ok(Self::from(LogicalPlan::Join(Join {
-            left: Arc::new(self.plan.clone()),
-            right: Arc::new(right.clone()),
-            on,
-            filter: None,
-            join_type,
-            join_constraint: JoinConstraint::Using,
-            schema: DFSchemaRef::new(join_schema),
-            null_equals_null: false,
-        })))
+        let mut join_on: Vec<(Column, Column)> = vec![];
+        let mut filters: Option<Expr> = None;
+        for (l, r) in &on {
+            if self.plan.schema().field_from_column(l).is_ok()
+                && right.schema().field_from_column(r).is_ok()
+                && can_hash(self.plan.schema().field_from_column(l).unwrap().data_type())
+            {
+                join_on.push((l.clone(), r.clone()));
+            } else if self.plan.schema().field_from_column(r).is_ok()
+                && right.schema().field_from_column(l).is_ok()
+                && can_hash(self.plan.schema().field_from_column(r).unwrap().data_type())
+            {
+                join_on.push((r.clone(), l.clone()));
+            } else {
+                let expr = Expr::BinaryExpr {
+                    left: Box::new(Expr::Column(l.clone())),
+                    op: Operator::Eq,
+                    right: Box::new(Expr::Column(r.clone())),
+                };
+                match filters {
+                    None => filters = Some(expr),
+                    Some(filter_expr) => {
+                        filters = Some(Expr::BinaryExpr {
+                            left: Box::new(expr),
+                            op: Operator::And,
+                            right: Box::new(filter_expr),
+                        });
+                    }

Review Comment:
   Thank you! I'll fix this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r893003362


##########
datafusion/core/tests/sql/joins.rs:
##########
@@ -1204,3 +1204,141 @@ async fn join_partitioned() -> Result<()> {
 
     Ok(())
 }
+
+#[tokio::test]
+async fn join_with_hash_unsupported_data_type() -> Result<()> {
+    let ctx = SessionContext::new();
+
+    let schema = Schema::new(vec![
+        Field::new("c1", DataType::Int32, true),
+        Field::new("c2", DataType::Utf8, true),
+        Field::new("c3", DataType::Int64, true),
+        Field::new("c4", DataType::Date32, true),
+    ]);
+    let data = RecordBatch::try_new(
+        Arc::new(schema),
+        vec![
+            Arc::new(Int32Array::from_slice(&[1, 2, 3])),
+            Arc::new(StringArray::from_slice(&["aaa", "bbb", "ccc"])),
+            Arc::new(Int64Array::from_slice(&[100, 200, 300])),
+            Arc::new(Date32Array::from(vec![Some(1), Some(2), Some(3)])),
+        ],
+    )?;
+    let table = MemTable::try_new(data.schema(), vec![vec![data]])?;
+    ctx.register_table("foo", Arc::new(table))?;
+
+    // join on hash unsupported data type (Date32), use cross join instead hash join
+    let sql = "select * from foo t1 join foo t2 on t1.c4 = t2.c4";
+    let msg = format!("Creating logical plan for '{}'", sql);

Review Comment:
   > So I think CrossJoin is almost never what the user would want: as once the tables get beyond any trivial size the query will effectively never finish or will run out of memory. An error is clearer.
   > 
   > From the issue description [#2145 (comment)](https://github.com/apache/arrow-datafusion/issues/2145#issue-1191065060) I think @pjmore's idea to cast unsupported types to a supported type is a good one -- the arrow `cast` kernels are quite efficient for things like `Date32` -> `Int32` (no copies) as the representations are the same
   > 
   > @pjmore what do you think?
   
   I agree. supporting more data types in hash join is the better way to solve this issue, and i'm already working on it. 
   And this pr only wants to make hash unsupported  join running in cross join instead of error/panic, we can support more data types in hash join continuously. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r892808334


##########
datafusion/expr/src/utils.rs:
##########
@@ -643,6 +644,35 @@ pub fn expr_as_column_expr(expr: &Expr, plan: &LogicalPlan) -> Result<Expr> {
     }
 }
 
+/// can this data type be used in hash join equal conditions??
+/// If more data types are supported in hash join, add those data types here
+/// to generate join logical plan.
+pub fn can_hash(data_type: &DataType) -> bool {
+    match data_type {

Review Comment:
   I think it may help to add a comment here (or in `equal_rows`) mentioning they need to remain in sync



##########
datafusion/core/tests/sql/joins.rs:
##########
@@ -1204,3 +1204,141 @@ async fn join_partitioned() -> Result<()> {
 
     Ok(())
 }
+
+#[tokio::test]
+async fn join_with_hash_unsupported_data_type() -> Result<()> {
+    let ctx = SessionContext::new();
+
+    let schema = Schema::new(vec![
+        Field::new("c1", DataType::Int32, true),
+        Field::new("c2", DataType::Utf8, true),
+        Field::new("c3", DataType::Int64, true),
+        Field::new("c4", DataType::Date32, true),
+    ]);
+    let data = RecordBatch::try_new(
+        Arc::new(schema),
+        vec![
+            Arc::new(Int32Array::from_slice(&[1, 2, 3])),
+            Arc::new(StringArray::from_slice(&["aaa", "bbb", "ccc"])),
+            Arc::new(Int64Array::from_slice(&[100, 200, 300])),
+            Arc::new(Date32Array::from(vec![Some(1), Some(2), Some(3)])),
+        ],
+    )?;
+    let table = MemTable::try_new(data.schema(), vec![vec![data]])?;
+    ctx.register_table("foo", Arc::new(table))?;
+
+    // join on hash unsupported data type (Date32), use cross join instead hash join
+    let sql = "select * from foo t1 join foo t2 on t1.c4 = t2.c4";
+    let msg = format!("Creating logical plan for '{}'", sql);

Review Comment:
   So I think CrossJoin is almost never what the user would want: as once the tables get beyond any trivial size the query will effectively never finish or will run out of memory. An error is clearer.
   
   
   From the issue description https://github.com/apache/arrow-datafusion/issues/2145#issue-1191065060 I think @pjmore's idea to cast unsupported types to a supported type is a good one -- the arrow `cast` kernels are quite efficient for things like `Date32` -> `Int32` (no copies) as the representations are the same
   
   @pjmore  what do you think? 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org