You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/08 19:59:23 UTC

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2702: Make sure that the data types are supported in hashjoin before genera…

alamb commented on code in PR #2702:
URL: https://github.com/apache/arrow-datafusion/pull/2702#discussion_r892808334


##########
datafusion/expr/src/utils.rs:
##########
@@ -643,6 +644,35 @@ pub fn expr_as_column_expr(expr: &Expr, plan: &LogicalPlan) -> Result<Expr> {
     }
 }
 
+/// can this data type be used in hash join equal conditions??
+/// If more data types are supported in hash join, add those data types here
+/// to generate join logical plan.
+pub fn can_hash(data_type: &DataType) -> bool {
+    match data_type {

Review Comment:
   I think it may help to add a comment here (or in `equal_rows`) mentioning they need to remain in sync



##########
datafusion/core/tests/sql/joins.rs:
##########
@@ -1204,3 +1204,141 @@ async fn join_partitioned() -> Result<()> {
 
     Ok(())
 }
+
+#[tokio::test]
+async fn join_with_hash_unsupported_data_type() -> Result<()> {
+    let ctx = SessionContext::new();
+
+    let schema = Schema::new(vec![
+        Field::new("c1", DataType::Int32, true),
+        Field::new("c2", DataType::Utf8, true),
+        Field::new("c3", DataType::Int64, true),
+        Field::new("c4", DataType::Date32, true),
+    ]);
+    let data = RecordBatch::try_new(
+        Arc::new(schema),
+        vec![
+            Arc::new(Int32Array::from_slice(&[1, 2, 3])),
+            Arc::new(StringArray::from_slice(&["aaa", "bbb", "ccc"])),
+            Arc::new(Int64Array::from_slice(&[100, 200, 300])),
+            Arc::new(Date32Array::from(vec![Some(1), Some(2), Some(3)])),
+        ],
+    )?;
+    let table = MemTable::try_new(data.schema(), vec![vec![data]])?;
+    ctx.register_table("foo", Arc::new(table))?;
+
+    // join on hash unsupported data type (Date32), use cross join instead hash join
+    let sql = "select * from foo t1 join foo t2 on t1.c4 = t2.c4";
+    let msg = format!("Creating logical plan for '{}'", sql);

Review Comment:
   So I think CrossJoin is almost never what the user would want: as once the tables get beyond any trivial size the query will effectively never finish or will run out of memory. An error is clearer.
   
   
   From the issue description https://github.com/apache/arrow-datafusion/issues/2145#issue-1191065060 I think @pjmore's idea to cast unsupported types to a supported type is a good one -- the arrow `cast` kernels are quite efficient for things like `Date32` -> `Int32` (no copies) as the representations are the same
   
   @pjmore  what do you think? 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org