You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/14 14:25:27 UTC

[GitHub] [arrow] alamb opened a new pull request #8460: ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, use in DataFusion

alamb opened a new pull request #8460:
URL: https://github.com/apache/arrow/pull/8460


   This is a PR incorporating the feedback from @nevi-me  and @jorgecarleitao  from https://github.com/apache/arrow/pull/8400
   
   It adds
   1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao  / @nevi-me  in https://github.com/apache/arrow/pull/8400#discussion_r501850814) that encodes the valid type casting
   2. A test that ensures `can_cast_types` and `cast` remain in sync
   3. Bug fixes that the test above uncovered (I'll comment inline)
   4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows
   
   Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in https://github.com/apache/arrow/pull/8399 and this PR. See also https://github.com/apache/arrow/pull/8340#discussion_r501257096 for more discussion.
   
   I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`.
   
   Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see https://github.com/apache/arrow/pull/8400#discussion_r504278902), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8460: ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, use in DataFusion

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8460:
URL: https://github.com/apache/arrow/pull/8460#discussion_r504735006



##########
File path: rust/datafusion/src/physical_plan/expressions.rs
##########
@@ -1985,9 +1982,10 @@ mod tests {
 
     #[test]
     fn invalid_cast() -> Result<()> {
-        let schema = Schema::new(vec![Field::new("a", DataType::Utf8, false)]);
-        let result = cast(col("a"), &schema, DataType::Int32);
-        result.expect_err("Invalid CAST from Utf8 to Int32");
+        // Ensure a useful error happens at plan time if invalid casts are used

Review comment:
       It turns out that arrow can, in fact, cast from utf8 -> Int32




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8460: ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, use in DataFusion

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8460:
URL: https://github.com/apache/arrow/pull/8460#discussion_r504721964



##########
File path: rust/arrow/src/compute/kernels/cast.rs
##########
@@ -356,11 +520,24 @@ pub fn cast(array: &ArrayRef, to_type: &DataType) -> Result<ArrayRef> {
 
         // temporal casts
         (Int32, Date32(_)) => cast_array_data::<Date32Type>(array, to_type.clone()),
-        (Int32, Time32(_)) => cast_array_data::<Date32Type>(array, to_type.clone()),
+        (Int32, Time32(TimeUnit::Second)) => {

Review comment:
       It is not possible to cast Int32 to a Time32(Microsecond) or Time32(Nanosecond)

##########
File path: rust/arrow/src/compute/kernels/cast.rs
##########
@@ -549,7 +726,18 @@ pub fn cast(array: &ArrayRef, to_type: &DataType) -> Result<ArrayRef> {
         (Timestamp(from_unit, _), Date64(_)) => {
             let from_size = time_unit_multiple(&from_unit);
             let to_size = MILLISECONDS;
-            if from_size != to_size {
+
+            // Scale time_array by (to_size / from_size) using a

Review comment:
       Without this code, casting from a timestamp 32 -> Date64 would result in a divide by zero error (as `from_size / to_size` was `1 / 1000 == 0`

##########
File path: rust/arrow/src/datatypes.rs
##########
@@ -1129,6 +1129,16 @@ impl DataType {
             DataType::Dictionary(_, _) => json!({ "name": "dictionary"}),
         }
     }
+
+    /// Returns true if this type is numeric: (UInt*, Unit*, or Float*)
+    pub fn is_numeric(t: &DataType) -> bool {

Review comment:
       As suggested by @nevi-me on https://github.com/apache/arrow/pull/8400#discussion_r501851945




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8460: ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, use in DataFusion

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8460:
URL: https://github.com/apache/arrow/pull/8460#issuecomment-708444187


   https://issues.apache.org/jira/browse/ARROW-10236


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me closed pull request #8460: ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, use in DataFusion

Posted by GitBox <gi...@apache.org>.
nevi-me closed pull request #8460:
URL: https://github.com/apache/arrow/pull/8460


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8460: ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, use in DataFusion

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8460:
URL: https://github.com/apache/arrow/pull/8460#discussion_r504935919



##########
File path: rust/datafusion/src/physical_plan/expressions.rs
##########
@@ -1985,9 +1982,10 @@ mod tests {
 
     #[test]
     fn invalid_cast() -> Result<()> {
-        let schema = Schema::new(vec![Field::new("a", DataType::Utf8, false)]);
-        let result = cast(col("a"), &schema, DataType::Int32);
-        result.expect_err("Invalid CAST from Utf8 to Int32");
+        // Ensure a useful error happens at plan time if invalid casts are used

Review comment:
       yup, unifying the cast logic was good for this. I've wanted to add cast options, such as disallowing lossy casts.
   If/when we get to that point, we'll have to think about what behaviour we want DataFusion to use.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8460: ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, use in DataFusion

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8460:
URL: https://github.com/apache/arrow/pull/8460#discussion_r504969480



##########
File path: rust/datafusion/src/physical_plan/expressions.rs
##########
@@ -1985,9 +1982,10 @@ mod tests {
 
     #[test]
     fn invalid_cast() -> Result<()> {
-        let schema = Schema::new(vec![Field::new("a", DataType::Utf8, false)]);
-        let result = cast(col("a"), &schema, DataType::Int32);
-        result.expect_err("Invalid CAST from Utf8 to Int32");
+        // Ensure a useful error happens at plan time if invalid casts are used

Review comment:
       > If/when we get to that point, we'll have to think about what behaviour we want DataFusion to use.
   
   I think DataFusion now makes the distinction between "casting" (aka if the user specifically requests to cast from one type to another) which can be lossy and "coercion" (aka when casts need to be added explicitly so that expressions can be evaluated (e.g. plus). 
   
   Coercion is designed not be lossy, but casting can be. 
   
   The downside is that Datafusion has to have another set of rules of what type coercions are allowed (e.g. I need to add https://github.com/apache/arrow/pull/8463 to properly support DictionaryArray in DataFusion). 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org