You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/06 12:26:39 UTC

[GitHub] [arrow] alamb opened a new pull request #8359: ARROW-10163: [Rust] [DataFusion] Add DictionaryArray coercion support to physical plans

alamb opened a new pull request #8359:
URL: https://github.com/apache/arrow/pull/8359


   NOTE: this builds on #8333 and #8340 and #8346 so leaving as a draft until those are merged
   
   This PR adds basic physical expression / casting support to DataFusion. Right now, it will cause all DictionaryArray data to be unpacked into a `StringArray` for operations.  
   
   Ideally, DataFusion would support direct comparison / operation on DictionaryArrays, however, there isn't the necessary support in the arrow kernels yet to do so (e.g. there is no Dictionary equality comparison kernel I could find https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/comparison.rs
   
   However, this PR gets the basic queries running and I hope to contribute further optimizations as time allows and our project needs dictate. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #8359: ARROW-10163: [Rust] [DataFusion] Add DictionaryArray coercion support to physical plans

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #8359:
URL: https://github.com/apache/arrow/pull/8359


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #8359: ARROW-10163: [Rust] [DataFusion] Add DictionaryArray coercion support to physical plans

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #8359:
URL: https://github.com/apache/arrow/pull/8359


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8359: ARROW-10163: [Rust] [DataFusion] Add DictionaryArray coercion support to physical plans

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8359:
URL: https://github.com/apache/arrow/pull/8359#issuecomment-704237332


   https://issues.apache.org/jira/browse/ARROW-10163


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8359: ARROW-10163: [Rust] [DataFusion] Add DictionaryArray coercion support to physical plans

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8359:
URL: https://github.com/apache/arrow/pull/8359#discussion_r500236277



##########
File path: rust/datafusion/src/physical_plan/expressions.rs
##########
@@ -1166,16 +1201,13 @@ fn order_coercion(lhs_type: &DataType, rhs_type: &DataType) -> Option<DataType>
         return Some(lhs_type.clone());
     }
 
-    match numerical_coercion(lhs_type, rhs_type) {
-        None => {
-            // strings are naturally ordered, and thus ordering can be applied to them.
-            string_coercion(lhs_type, rhs_type)
-        }
-        t => t,
-    }
+    numerical_coercion(lhs_type, rhs_type)
+        .or_else(|| string_coercion(lhs_type, rhs_type))

Review comment:
       FWIW I don't understand why `order_coercion` and `eq_coercion` are different (eq_coercion does not include string_coercion).
   
   

##########
File path: rust/datafusion/src/physical_plan/expressions.rs
##########
@@ -1542,20 +1574,7 @@ pub fn cast(
     if expr_type == cast_type {
         return Ok(expr.clone());
     }
-    if is_numeric(&expr_type) && (is_numeric(&cast_type) || cast_type == DataType::Utf8) {
-        Ok(Arc::new(CastExpr { expr, cast_type }))
-    } else if expr_type == DataType::Binary && cast_type == DataType::Utf8 {
-        Ok(Arc::new(CastExpr { expr, cast_type }))
-    } else if is_numeric(&expr_type)
-        && cast_type == DataType::Timestamp(TimeUnit::Nanosecond, None)
-    {
-        Ok(Arc::new(CastExpr { expr, cast_type }))
-    } else {
-        Err(ExecutionError::General(format!(
-            "Invalid CAST from {:?} to {:?}",
-            expr_type, cast_type
-        )))
-    }
+    Ok(Arc::new(CastExpr { expr, cast_type }))

Review comment:
       As in https://github.com/apache/arrow/pull/8340, I don't think partially redundant and incomplete checks for casting support in DataFusion adds much over the actual arrow casting, so I have removed the plan time checks

##########
File path: rust/datafusion/tests/sql.rs
##########
@@ -1242,33 +1242,23 @@ async fn query_on_string_dictionary() -> Result<()> {
     let expected = vec![vec!["one"], vec!["three"]];
     assert_eq!(expected, actual);
 
-    // The following queries are not yet supported
-
-    // // filtering with constant
-    // let sql = "SELECT * FROM test WHERE d1 = 'three'";
-    // let actual = execute(&mut ctx, sql).await;
-    // let expected = vec![
-    //     vec!["three"],
-    // ];
-    // assert_eq!(expected, actual);
-
-    // // Expression evaluation
-    // let sql = "SELECT concat(d1, '-foo') FROM test";
-    // let actual = execute(&mut ctx, sql).await;
-    // let expected = vec![
-    //     vec!["one-foo"],
-    //     vec!["NULL"],
-    //     vec!["three-foo"],
-    // ];
-    // assert_eq!(expected, actual);
-
-    // // aggregation
-    // let sql = "SELECT COUNT(d1) FROM test";
-    // let actual = execute(&mut ctx, sql).await;
-    // let expected = vec![
-    //     vec!["2"]
-    // ];
-    // assert_eq!(expected, actual);
+    // filtering with constant

Review comment:
       here is the (simple) end to end test for DictionaryArray support

##########
File path: rust/datafusion/src/physical_plan/expressions.rs
##########
@@ -1782,11 +1804,13 @@ mod tests {
 
     // runs an end-to-end test of physical type coercion:
     // 1. construct a record batch with two columns of type A and B
+    //  (*_ARRAY is the Rust Arrow array type, and *_TYPE is the DataType of the elements)

Review comment:
       Unrelated to this PR, I just renamed some of these arguments to match the comment description




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org