You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/06 02:59:28 UTC
[GitHub] [arrow-datafusion] AssHero opened a new pull request, #2702: Make sure that the data types are supported in hashjoin before genera…
AssHero opened a new pull request, #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702
# Which issue does this PR close?
Closes #2145
# Rationale for this change
Before generating the hash join logical plan, make sure the data types in equal conditions are supported. Otherwise, try
cross join instead.
# What changes are included in this PR?
1. add can_hash func in datafusion/expr/src/utils.rs
2. use can_hash in datafusion/expr/src/logical_plan/builder.rs and datafusion/sql/src/planner.rs to check if the data type is supported in hashjoin or not
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r893003362
##########
datafusion/core/tests/sql/joins.rs:
##########
@@ -1204,3 +1204,141 @@ async fn join_partitioned() -> Result<()> {
Ok(())
}
+
+#[tokio::test]
+async fn join_with_hash_unsupported_data_type() -> Result<()> {
+ let ctx = SessionContext::new();
+
+ let schema = Schema::new(vec![
+ Field::new("c1", DataType::Int32, true),
+ Field::new("c2", DataType::Utf8, true),
+ Field::new("c3", DataType::Int64, true),
+ Field::new("c4", DataType::Date32, true),
+ ]);
+ let data = RecordBatch::try_new(
+ Arc::new(schema),
+ vec![
+ Arc::new(Int32Array::from_slice(&[1, 2, 3])),
+ Arc::new(StringArray::from_slice(&["aaa", "bbb", "ccc"])),
+ Arc::new(Int64Array::from_slice(&[100, 200, 300])),
+ Arc::new(Date32Array::from(vec![Some(1), Some(2), Some(3)])),
+ ],
+ )?;
+ let table = MemTable::try_new(data.schema(), vec![vec![data]])?;
+ ctx.register_table("foo", Arc::new(table))?;
+
+ // join on hash unsupported data type (Date32), use cross join instead hash join
+ let sql = "select * from foo t1 join foo t2 on t1.c4 = t2.c4";
+ let msg = format!("Creating logical plan for '{}'", sql);
Review Comment:
> So I think CrossJoin is almost never what the user would want: as once the tables get beyond any trivial size the query will effectively never finish or will run out of memory. An error is clearer.
>
> From the issue description [#2145 (comment)](https://github.com/apache/arrow-datafusion/issues/2145#issue-1191065060) I think @pjmore's idea to cast unsupported types to a supported type is a good one -- the arrow `cast` kernels are quite efficient for things like `Date32` -> `Int32` (no copies) as the representations are the same
>
> @pjmore what do you think?
I agree. supporting more data types in hash join is the better way to solve this issue, and i'm already working on it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#issuecomment-1150961878
Thank you again @AssHero
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb merged pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
alamb merged PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890701528
##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -19,6 +19,7 @@
use crate::expr_rewriter::{normalize_col, normalize_cols, rewrite_sort_cols_by_aggs};
use crate::utils::{columnize_expr, exprlist_to_fields, from_plan};
+use crate::Operator;
Review Comment:
Thank you! I'll fix this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#issuecomment-1147067806
# [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2702?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#2702](https://codecov.io/gh/apache/arrow-datafusion/pull/2702?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9d74954) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/930653415c6303b34f856826ba73717b25b9574c?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9306534) will **decrease** coverage by `0.02%`.
> The diff coverage is `77.86%`.
```diff
@@ Coverage Diff @@
## master #2702 +/- ##
==========================================
- Coverage 84.66% 84.63% -0.03%
==========================================
Files 270 270
Lines 46919 47039 +120
==========================================
+ Hits 39724 39813 +89
- Misses 7195 7226 +31
```
| [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/2702?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [datafusion/expr/src/logical\_plan/builder.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2702/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9leHByL3NyYy9sb2dpY2FsX3BsYW4vYnVpbGRlci5ycw==) | `89.02% <51.72%> (-2.92%)` | :arrow_down: |
| [datafusion/expr/src/utils.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2702/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9leHByL3NyYy91dGlscy5ycw==) | `90.39% <68.18%> (-1.48%)` | :arrow_down: |
| [datafusion/sql/src/planner.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2702/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcWwvc3JjL3BsYW5uZXIucnM=) | `81.33% <69.23%> (-0.19%)` | :arrow_down: |
| [datafusion/core/tests/sql/joins.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2702/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3Rlc3RzL3NxbC9qb2lucy5ycw==) | `99.07% <100.00%> (+0.13%)` | :arrow_up: |
| [datafusion/core/src/physical\_plan/metrics/value.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2702/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9waHlzaWNhbF9wbGFuL21ldHJpY3MvdmFsdWUucnM=) | `86.93% <0.00%> (-0.51%)` | :arrow_down: |
| [datafusion/expr/src/logical\_plan/plan.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2702/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9leHByL3NyYy9sb2dpY2FsX3BsYW4vcGxhbi5ycw==) | `68.47% <0.00%> (+0.51%)` | :arrow_up: |
| [datafusion/expr/src/window\_frame.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2702/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9leHByL3NyYy93aW5kb3dfZnJhbWUucnM=) | `93.27% <0.00%> (+0.84%)` | :arrow_up: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2702?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2702?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [9306534...9d74954](https://codecov.io/gh/apache/arrow-datafusion/pull/2702?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] liukun4515 commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890710049
##########
datafusion/expr/src/utils.rs:
##########
@@ -643,6 +644,35 @@ pub fn expr_as_column_expr(expr: &Expr, plan: &LogicalPlan) -> Result<Expr> {
}
}
+/// can this data type be used in hash join equal conditions??
+/// If more data types are supported in hash join, add those data types here
+/// to generate join logical plan.
+pub fn can_hash(data_type: &DataType) -> bool {
+ match data_type {
Review Comment:
decimal should be supported here
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890326974
##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -605,17 +606,52 @@ impl LogicalPlanBuilder {
let on: Vec<(_, _)> = left_keys.into_iter().zip(right_keys.into_iter()).collect();
let join_schema =
build_join_schema(self.plan.schema(), right.schema(), &join_type)?;
-
- Ok(Self::from(LogicalPlan::Join(Join {
- left: Arc::new(self.plan.clone()),
- right: Arc::new(right.clone()),
- on,
- filter: None,
- join_type,
- join_constraint: JoinConstraint::Using,
- schema: DFSchemaRef::new(join_schema),
- null_equals_null: false,
- })))
+ let mut join_on: Vec<(Column, Column)> = vec![];
+ let mut filters: Option<Expr> = None;
+ for (l, r) in &on {
+ if self.plan.schema().field_from_column(l).is_ok()
+ && right.schema().field_from_column(r).is_ok()
+ && can_hash(self.plan.schema().field_from_column(l).unwrap().data_type())
+ {
+ join_on.push((l.clone(), r.clone()));
+ } else if self.plan.schema().field_from_column(r).is_ok()
+ && right.schema().field_from_column(l).is_ok()
+ && can_hash(self.plan.schema().field_from_column(r).unwrap().data_type())
+ {
+ join_on.push((r.clone(), l.clone()));
+ } else {
+ let expr = Expr::BinaryExpr {
+ left: Box::new(Expr::Column(l.clone())),
+ op: Operator::Eq,
+ right: Box::new(Expr::Column(r.clone())),
+ };
Review Comment:
```suggestion
let expr = binary_expr(
Expr::Column(l.clone()),
Operator::Eq,
Expr::Column(r.clone()),
);
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r893003697
##########
datafusion/expr/src/utils.rs:
##########
@@ -643,6 +644,35 @@ pub fn expr_as_column_expr(expr: &Expr, plan: &LogicalPlan) -> Result<Expr> {
}
}
+/// can this data type be used in hash join equal conditions??
+/// If more data types are supported in hash join, add those data types here
+/// to generate join logical plan.
+pub fn can_hash(data_type: &DataType) -> bool {
+ match data_type {
Review Comment:
> I think it may help to add a comment here (or in `equal_rows`) mentioning they need to remain in sync
I'll add this comment.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890325614
##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -605,17 +606,52 @@ impl LogicalPlanBuilder {
let on: Vec<(_, _)> = left_keys.into_iter().zip(right_keys.into_iter()).collect();
let join_schema =
build_join_schema(self.plan.schema(), right.schema(), &join_type)?;
-
- Ok(Self::from(LogicalPlan::Join(Join {
- left: Arc::new(self.plan.clone()),
- right: Arc::new(right.clone()),
- on,
- filter: None,
- join_type,
- join_constraint: JoinConstraint::Using,
- schema: DFSchemaRef::new(join_schema),
- null_equals_null: false,
- })))
+ let mut join_on: Vec<(Column, Column)> = vec![];
+ let mut filters: Option<Expr> = None;
+ for (l, r) in &on {
+ if self.plan.schema().field_from_column(l).is_ok()
+ && right.schema().field_from_column(r).is_ok()
+ && can_hash(self.plan.schema().field_from_column(l).unwrap().data_type())
+ {
+ join_on.push((l.clone(), r.clone()));
+ } else if self.plan.schema().field_from_column(r).is_ok()
+ && right.schema().field_from_column(l).is_ok()
+ && can_hash(self.plan.schema().field_from_column(r).unwrap().data_type())
+ {
+ join_on.push((r.clone(), l.clone()));
+ } else {
+ let expr = Expr::BinaryExpr {
+ left: Box::new(Expr::Column(l.clone())),
+ op: Operator::Eq,
+ right: Box::new(Expr::Column(r.clone())),
+ };
Review Comment:
nit: we can use `binary_expr` helper here for more concise code
```
let expr = binary_expr(
Expr::Column(l.clone()),
Operator::Eq,
Expr::Column(r.clone()),
);
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890711868
##########
datafusion/expr/src/utils.rs:
##########
@@ -643,6 +644,35 @@ pub fn expr_as_column_expr(expr: &Expr, plan: &LogicalPlan) -> Result<Expr> {
}
}
+/// can this data type be used in hash join equal conditions??
+/// If more data types are supported in hash join, add those data types here
+/// to generate join logical plan.
+pub fn can_hash(data_type: &DataType) -> bool {
+ match data_type {
Review Comment:
Data types here come from function equal_rows, decimal is not supported in equal_rows.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890701328
##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -605,17 +606,52 @@ impl LogicalPlanBuilder {
let on: Vec<(_, _)> = left_keys.into_iter().zip(right_keys.into_iter()).collect();
let join_schema =
build_join_schema(self.plan.schema(), right.schema(), &join_type)?;
-
- Ok(Self::from(LogicalPlan::Join(Join {
- left: Arc::new(self.plan.clone()),
- right: Arc::new(right.clone()),
- on,
- filter: None,
- join_type,
- join_constraint: JoinConstraint::Using,
- schema: DFSchemaRef::new(join_schema),
- null_equals_null: false,
- })))
+ let mut join_on: Vec<(Column, Column)> = vec![];
+ let mut filters: Option<Expr> = None;
+ for (l, r) in &on {
+ if self.plan.schema().field_from_column(l).is_ok()
+ && right.schema().field_from_column(r).is_ok()
+ && can_hash(self.plan.schema().field_from_column(l).unwrap().data_type())
+ {
+ join_on.push((l.clone(), r.clone()));
+ } else if self.plan.schema().field_from_column(r).is_ok()
+ && right.schema().field_from_column(l).is_ok()
+ && can_hash(self.plan.schema().field_from_column(r).unwrap().data_type())
+ {
+ join_on.push((r.clone(), l.clone()));
+ } else {
+ let expr = Expr::BinaryExpr {
+ left: Box::new(Expr::Column(l.clone())),
+ op: Operator::Eq,
+ right: Box::new(Expr::Column(r.clone())),
+ };
Review Comment:
Thank you! I'll fix this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890711868
##########
datafusion/expr/src/utils.rs:
##########
@@ -643,6 +644,35 @@ pub fn expr_as_column_expr(expr: &Expr, plan: &LogicalPlan) -> Result<Expr> {
}
}
+/// can this data type be used in hash join equal conditions??
+/// If more data types are supported in hash join, add those data types here
+/// to generate join logical plan.
+pub fn can_hash(data_type: &DataType) -> bool {
+ match data_type {
Review Comment:
Data types here come from function equal_rows, decimal is not supported in equal_rows so that hash join currently does not support joining on columns of decimal data type. That's why decimal is not here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r893003362
##########
datafusion/core/tests/sql/joins.rs:
##########
@@ -1204,3 +1204,141 @@ async fn join_partitioned() -> Result<()> {
Ok(())
}
+
+#[tokio::test]
+async fn join_with_hash_unsupported_data_type() -> Result<()> {
+ let ctx = SessionContext::new();
+
+ let schema = Schema::new(vec![
+ Field::new("c1", DataType::Int32, true),
+ Field::new("c2", DataType::Utf8, true),
+ Field::new("c3", DataType::Int64, true),
+ Field::new("c4", DataType::Date32, true),
+ ]);
+ let data = RecordBatch::try_new(
+ Arc::new(schema),
+ vec![
+ Arc::new(Int32Array::from_slice(&[1, 2, 3])),
+ Arc::new(StringArray::from_slice(&["aaa", "bbb", "ccc"])),
+ Arc::new(Int64Array::from_slice(&[100, 200, 300])),
+ Arc::new(Date32Array::from(vec![Some(1), Some(2), Some(3)])),
+ ],
+ )?;
+ let table = MemTable::try_new(data.schema(), vec![vec![data]])?;
+ ctx.register_table("foo", Arc::new(table))?;
+
+ // join on hash unsupported data type (Date32), use cross join instead hash join
+ let sql = "select * from foo t1 join foo t2 on t1.c4 = t2.c4";
+ let msg = format!("Creating logical plan for '{}'", sql);
Review Comment:
> So I think CrossJoin is almost never what the user would want: as once the tables get beyond any trivial size the query will effectively never finish or will run out of memory. An error is clearer.
>
> From the issue description [#2145 (comment)](https://github.com/apache/arrow-datafusion/issues/2145#issue-1191065060) I think @pjmore's idea to cast unsupported types to a supported type is a good one -- the arrow `cast` kernels are quite efficient for things like `Date32` -> `Int32` (no copies) as the representations are the same
>
> @pjmore what do you think?
this pr is only to make the hash unsupported join running in cross join instead of error/panic(I think it is not friendly in database system).
supporting more data types in hash join is the better way to solve this issue, and i'm already working on it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890326016
##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -605,17 +606,52 @@ impl LogicalPlanBuilder {
let on: Vec<(_, _)> = left_keys.into_iter().zip(right_keys.into_iter()).collect();
let join_schema =
build_join_schema(self.plan.schema(), right.schema(), &join_type)?;
-
- Ok(Self::from(LogicalPlan::Join(Join {
- left: Arc::new(self.plan.clone()),
- right: Arc::new(right.clone()),
- on,
- filter: None,
- join_type,
- join_constraint: JoinConstraint::Using,
- schema: DFSchemaRef::new(join_schema),
- null_equals_null: false,
- })))
+ let mut join_on: Vec<(Column, Column)> = vec![];
+ let mut filters: Option<Expr> = None;
+ for (l, r) in &on {
+ if self.plan.schema().field_from_column(l).is_ok()
+ && right.schema().field_from_column(r).is_ok()
+ && can_hash(self.plan.schema().field_from_column(l).unwrap().data_type())
+ {
+ join_on.push((l.clone(), r.clone()));
+ } else if self.plan.schema().field_from_column(r).is_ok()
+ && right.schema().field_from_column(l).is_ok()
+ && can_hash(self.plan.schema().field_from_column(r).unwrap().data_type())
+ {
+ join_on.push((r.clone(), l.clone()));
+ } else {
+ let expr = Expr::BinaryExpr {
+ left: Box::new(Expr::Column(l.clone())),
+ op: Operator::Eq,
+ right: Box::new(Expr::Column(r.clone())),
+ };
+ match filters {
+ None => filters = Some(expr),
+ Some(filter_expr) => {
+ filters = Some(Expr::BinaryExpr {
+ left: Box::new(expr),
+ op: Operator::And,
+ right: Box::new(filter_expr),
+ });
+ }
Review Comment:
```suggestion
Some(filter_expr) => filters = Some(and(expr, filter_expr)),
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890327498
##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -19,6 +19,7 @@
use crate::expr_rewriter::{normalize_col, normalize_cols, rewrite_sort_cols_by_aggs};
use crate::utils::{columnize_expr, exprlist_to_fields, from_plan};
+use crate::Operator;
Review Comment:
I let some suggestions below that require these additional imports
```suggestion
use crate::{and, binary_expr, Operator};
```
##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -19,6 +19,7 @@
use crate::expr_rewriter::{normalize_col, normalize_cols, rewrite_sort_cols_by_aggs};
use crate::utils::{columnize_expr, exprlist_to_fields, from_plan};
+use crate::Operator;
Review Comment:
I left some suggestions below that require these additional imports
```suggestion
use crate::{and, binary_expr, Operator};
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r893003362
##########
datafusion/core/tests/sql/joins.rs:
##########
@@ -1204,3 +1204,141 @@ async fn join_partitioned() -> Result<()> {
Ok(())
}
+
+#[tokio::test]
+async fn join_with_hash_unsupported_data_type() -> Result<()> {
+ let ctx = SessionContext::new();
+
+ let schema = Schema::new(vec![
+ Field::new("c1", DataType::Int32, true),
+ Field::new("c2", DataType::Utf8, true),
+ Field::new("c3", DataType::Int64, true),
+ Field::new("c4", DataType::Date32, true),
+ ]);
+ let data = RecordBatch::try_new(
+ Arc::new(schema),
+ vec![
+ Arc::new(Int32Array::from_slice(&[1, 2, 3])),
+ Arc::new(StringArray::from_slice(&["aaa", "bbb", "ccc"])),
+ Arc::new(Int64Array::from_slice(&[100, 200, 300])),
+ Arc::new(Date32Array::from(vec![Some(1), Some(2), Some(3)])),
+ ],
+ )?;
+ let table = MemTable::try_new(data.schema(), vec![vec![data]])?;
+ ctx.register_table("foo", Arc::new(table))?;
+
+ // join on hash unsupported data type (Date32), use cross join instead hash join
+ let sql = "select * from foo t1 join foo t2 on t1.c4 = t2.c4";
+ let msg = format!("Creating logical plan for '{}'", sql);
Review Comment:
> So I think CrossJoin is almost never what the user would want: as once the tables get beyond any trivial size the query will effectively never finish or will run out of memory. An error is clearer.
>
> From the issue description [#2145 (comment)](https://github.com/apache/arrow-datafusion/issues/2145#issue-1191065060) I think @pjmore's idea to cast unsupported types to a supported type is a good one -- the arrow `cast` kernels are quite efficient for things like `Date32` -> `Int32` (no copies) as the representations are the same
>
> @pjmore what do you think?
I agree. supporting more data types in hash join is the better way to solve this issue, and i'm already working on it.
this pr is only to make the hash unsupported join running in cross join instead of error/panic(I think it is not friendly in database system).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r890697708
##########
datafusion/expr/src/logical_plan/builder.rs:
##########
@@ -605,17 +606,52 @@ impl LogicalPlanBuilder {
let on: Vec<(_, _)> = left_keys.into_iter().zip(right_keys.into_iter()).collect();
let join_schema =
build_join_schema(self.plan.schema(), right.schema(), &join_type)?;
-
- Ok(Self::from(LogicalPlan::Join(Join {
- left: Arc::new(self.plan.clone()),
- right: Arc::new(right.clone()),
- on,
- filter: None,
- join_type,
- join_constraint: JoinConstraint::Using,
- schema: DFSchemaRef::new(join_schema),
- null_equals_null: false,
- })))
+ let mut join_on: Vec<(Column, Column)> = vec![];
+ let mut filters: Option<Expr> = None;
+ for (l, r) in &on {
+ if self.plan.schema().field_from_column(l).is_ok()
+ && right.schema().field_from_column(r).is_ok()
+ && can_hash(self.plan.schema().field_from_column(l).unwrap().data_type())
+ {
+ join_on.push((l.clone(), r.clone()));
+ } else if self.plan.schema().field_from_column(r).is_ok()
+ && right.schema().field_from_column(l).is_ok()
+ && can_hash(self.plan.schema().field_from_column(r).unwrap().data_type())
+ {
+ join_on.push((r.clone(), l.clone()));
+ } else {
+ let expr = Expr::BinaryExpr {
+ left: Box::new(Expr::Column(l.clone())),
+ op: Operator::Eq,
+ right: Box::new(Expr::Column(r.clone())),
+ };
+ match filters {
+ None => filters = Some(expr),
+ Some(filter_expr) => {
+ filters = Some(Expr::BinaryExpr {
+ left: Box::new(expr),
+ op: Operator::And,
+ right: Box::new(filter_expr),
+ });
+ }
Review Comment:
Thank you! I'll fix this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] AssHero commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
AssHero commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r893003362
##########
datafusion/core/tests/sql/joins.rs:
##########
@@ -1204,3 +1204,141 @@ async fn join_partitioned() -> Result<()> {
Ok(())
}
+
+#[tokio::test]
+async fn join_with_hash_unsupported_data_type() -> Result<()> {
+ let ctx = SessionContext::new();
+
+ let schema = Schema::new(vec![
+ Field::new("c1", DataType::Int32, true),
+ Field::new("c2", DataType::Utf8, true),
+ Field::new("c3", DataType::Int64, true),
+ Field::new("c4", DataType::Date32, true),
+ ]);
+ let data = RecordBatch::try_new(
+ Arc::new(schema),
+ vec![
+ Arc::new(Int32Array::from_slice(&[1, 2, 3])),
+ Arc::new(StringArray::from_slice(&["aaa", "bbb", "ccc"])),
+ Arc::new(Int64Array::from_slice(&[100, 200, 300])),
+ Arc::new(Date32Array::from(vec![Some(1), Some(2), Some(3)])),
+ ],
+ )?;
+ let table = MemTable::try_new(data.schema(), vec![vec![data]])?;
+ ctx.register_table("foo", Arc::new(table))?;
+
+ // join on hash unsupported data type (Date32), use cross join instead hash join
+ let sql = "select * from foo t1 join foo t2 on t1.c4 = t2.c4";
+ let msg = format!("Creating logical plan for '{}'", sql);
Review Comment:
> So I think CrossJoin is almost never what the user would want: as once the tables get beyond any trivial size the query will effectively never finish or will run out of memory. An error is clearer.
>
> From the issue description [#2145 (comment)](https://github.com/apache/arrow-datafusion/issues/2145#issue-1191065060) I think @pjmore's idea to cast unsupported types to a supported type is a good one -- the arrow `cast` kernels are quite efficient for things like `Date32` -> `Int32` (no copies) as the representations are the same
>
> @pjmore what do you think?
I agree. supporting more data types in hash join is the better way to solve this issue, and i'm already working on it.
And this pr only wants to make hash unsupported join running in cross join instead of error/panic, we can support more data types in hash join continuously.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…
Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r892808334
##########
datafusion/expr/src/utils.rs:
##########
@@ -643,6 +644,35 @@ pub fn expr_as_column_expr(expr: &Expr, plan: &LogicalPlan) -> Result<Expr> {
}
}
+/// can this data type be used in hash join equal conditions??
+/// If more data types are supported in hash join, add those data types here
+/// to generate join logical plan.
+pub fn can_hash(data_type: &DataType) -> bool {
+ match data_type {
Review Comment:
I think it may help to add a comment here (or in `equal_rows`) mentioning they need to remain in sync
##########
datafusion/core/tests/sql/joins.rs:
##########
@@ -1204,3 +1204,141 @@ async fn join_partitioned() -> Result<()> {
Ok(())
}
+
+#[tokio::test]
+async fn join_with_hash_unsupported_data_type() -> Result<()> {
+ let ctx = SessionContext::new();
+
+ let schema = Schema::new(vec![
+ Field::new("c1", DataType::Int32, true),
+ Field::new("c2", DataType::Utf8, true),
+ Field::new("c3", DataType::Int64, true),
+ Field::new("c4", DataType::Date32, true),
+ ]);
+ let data = RecordBatch::try_new(
+ Arc::new(schema),
+ vec![
+ Arc::new(Int32Array::from_slice(&[1, 2, 3])),
+ Arc::new(StringArray::from_slice(&["aaa", "bbb", "ccc"])),
+ Arc::new(Int64Array::from_slice(&[100, 200, 300])),
+ Arc::new(Date32Array::from(vec![Some(1), Some(2), Some(3)])),
+ ],
+ )?;
+ let table = MemTable::try_new(data.schema(), vec![vec![data]])?;
+ ctx.register_table("foo", Arc::new(table))?;
+
+ // join on hash unsupported data type (Date32), use cross join instead hash join
+ let sql = "select * from foo t1 join foo t2 on t1.c4 = t2.c4";
+ let msg = format!("Creating logical plan for '{}'", sql);
Review Comment:
So I think CrossJoin is almost never what the user would want: as once the tables get beyond any trivial size the query will effectively never finish or will run out of memory. An error is clearer.
From the issue description https://github.com/apache/arrow-datafusion/issues/2145#issue-1191065060 I think @pjmore's idea to cast unsupported types to a supported type is a good one -- the arrow `cast` kernels are quite efficient for things like `Date32` -> `Int32` (no copies) as the representations are the same
@pjmore what do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org