You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/05 10:51:07 UTC

[GitHub] [arrow] alamb opened a new pull request #8340: ARROW-10165: [Rust] [DataFusion]: Remove special case DataFusion casting checks in favor of Arrow cast kernel

alamb opened a new pull request #8340:
URL: https://github.com/apache/arrow/pull/8340


   NOTE: this builds on https://github.com/apache/arrow/pull/8333, so I am waiting on that to merge before turning this into a real PR)
   
   This PR removes the explicit plan-time cast support check from DataFusion and instead uses the runtime check of the [arrow compute cast kernel](https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/cast.rs).
   
   The upside of this change is less code overall and increased casting support (DataFusion now supports all casting supported by Arrow rather than just the subset that was hard coded in `can_coerce_from`). 
   
   A potential downside is that an unsupported cast will produce an  error later in the process (after the plan has started running rather than during plan time).
   
   This PR is part of a set of changes to support `DictionaryArray`s in DataFusion. Once I add the appropriate support to the `cast` kernel, DataFusion will be able to cast `DictionaryArray` types correctly with no additional code.
   
   Prior to this change, the test in sql failed with this error message:
   
   ```
   ---- query_on_string_dictionary stdout ----
   thread 'query_on_string_dictionary' panicked at 'called `Result::unwrap()` on an `Err` value: General("\'Dictionary(Int32, Utf8) = Utf8\' can\'t be evaluated because there isn\'t a common type to coerce the types to")', datafusion/tests/sql.rs:925:16
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8340: ARROW-10165: [Rust] [DataFusion]: Remove special case DataFusion casting checks in favor of Arrow cast kernel

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8340:
URL: https://github.com/apache/arrow/pull/8340#discussion_r501257096



##########
File path: rust/datafusion/src/logical_plan/mod.rs
##########
@@ -323,21 +322,19 @@ impl Expr {
     ///
     /// # Errors
     ///
-    /// This function errors when it is impossible to cast the expression to the target [arrow::datatypes::DataType].
+    /// Currently no errors happen at plan time. If it is impossible

Review comment:
       I am sorry, I was confused: `can_coerce_from` checks whether we can perform a lossless cast. This is different from the ability to perform *any* cast.
   
   Therefore, I think that the logic we may want is:
   
   * the `Expr::Cast` operation should check that we can perform the cast at all (e.g. a `ListArray` to a `UInt32Array` should not be allowed). This should match the casts that the kernel supports.
   * implicit casts, such as the ones that we perform when we pass arguments to functions to try to match the signature, should use `can_coerce_from`, as we do not want to perform lossless implicit casts.
   
   AFAIK casts to and from dictionaries are lossless, and thus they can happen at any point in the physical planning. Since all arrays support dictionaries, I guess we only have to error if the implementation is still not available.
   
   What do you think, @alamb ?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8340: ARROW-10165: [Rust] [DataFusion]: Remove special case DataFusion casting checks in favor of Arrow cast kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8340:
URL: https://github.com/apache/arrow/pull/8340#issuecomment-704834619


   @jorgecarleitao 
   
   > Doesn't this mean that the plan can fail arbitrarily when a user performs an impossible cast? This can happen like 10hs after the execution starts, when the query finally reaches that point.
   
   Yes I think that is accurate. 
   
   > Isn't it possible to expand can_coerce_from to cover the cases that DataFusion is missing but that arrow cast kernel supports?
   
   Yes I can do that -- my rationale for this proposal was to avoiding having to keep two lists of valid typecasts in sync. However, the value of faster errors is a valid point. I will make a new PR to do as you suggest and expand the coverage in `can_coerce_from` 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8340: ARROW-10165: [Rust] [DataFusion]: Remove special case DataFusion casting checks in favor of Arrow cast kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8340:
URL: https://github.com/apache/arrow/pull/8340#discussion_r501797053



##########
File path: rust/datafusion/src/logical_plan/mod.rs
##########
@@ -323,21 +322,19 @@ impl Expr {
     ///
     /// # Errors
     ///
-    /// This function errors when it is impossible to cast the expression to the target [arrow::datatypes::DataType].
+    /// Currently no errors happen at plan time. If it is impossible

Review comment:
       @jorgecarleitao  -- here is a proposed alternate approach: https://github.com/apache/arrow/pull/8400
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8340: ARROW-10165: [Rust] [DataFusion]: Remove special case DataFusion casting checks in favor of Arrow cast kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8340:
URL: https://github.com/apache/arrow/pull/8340#discussion_r501262082



##########
File path: rust/datafusion/src/logical_plan/mod.rs
##########
@@ -323,21 +322,19 @@ impl Expr {
     ///
     /// # Errors
     ///
-    /// This function errors when it is impossible to cast the expression to the target [arrow::datatypes::DataType].
+    /// Currently no errors happen at plan time. If it is impossible

Review comment:
       @jorgecarleitao -- I think this makes sense. I was confused about the intent of `can_coerce_from` (aka that it is for lossless conversions only). With this explanation let me go back and give adding dictionary coercion / cast a second pass (though likely not until tomorrow, Thursday, US Eastern time)
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8340: ARROW-10165: [Rust] [DataFusion]: Remove special case DataFusion casting checks in favor of Arrow cast kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8340:
URL: https://github.com/apache/arrow/pull/8340#discussion_r499515276



##########
File path: rust/datafusion/tests/sql.rs
##########
@@ -918,14 +918,20 @@ fn register_alltypes_parquet(ctx: &mut ExecutionContext) {
 /// Execute query and return result set as 2-d table of Vecs
 /// `result[row][column]`
 async fn execute(ctx: &mut ExecutionContext, sql: &str) -> Vec<Vec<String>> {
-    let plan = ctx.create_logical_plan(&sql).unwrap();
+    let msg = format!("Creating logical plan for '{}'", sql);

Review comment:
       These are some changes to improve the test error reporting (rather than a straight up panic, some diagnostic information is printed as well)

##########
File path: rust/datafusion/tests/sql.rs
##########
@@ -1200,3 +1206,69 @@ async fn query_is_not_null() -> Result<()> {
     assert_eq!(expected, actual);
     Ok(())
 }
+
+#[tokio::test]
+async fn query_on_string_dictionary() -> Result<()> {

Review comment:
       Here is the end to end testcast I am working on for `DictionaryArray` support -- with this PR we can do basic filtering in DataFusion. There are a few more PRs needed to complete expressions and aggregation, but they are coming.

##########
File path: rust/datafusion/tests/sql.rs
##########
@@ -1200,3 +1206,69 @@ async fn query_is_not_null() -> Result<()> {
     assert_eq!(expected, actual);
     Ok(())
 }
+
+#[tokio::test]
+async fn query_on_string_dictionary() -> Result<()> {
+    // Test to ensure DataFusion can operate on dictionary types
+    // Use StringDictionary (32 bit indexes = keys)
+    let field_type =
+        DataType::Dictionary(Box::new(DataType::Int32), Box::new(DataType::Utf8));
+    let schema = Arc::new(Schema::new(vec![Field::new("d1", field_type, true)]));
+
+    let keys_builder = PrimitiveBuilder::<Int32Type>::new(10);
+    let values_builder = StringBuilder::new(10);
+    let mut builder = StringDictionaryBuilder::new(keys_builder, values_builder);
+
+    builder.append("one")?;
+    builder.append_null()?;
+    builder.append("three")?;
+    let array = Arc::new(builder.finish());
+
+    let data = RecordBatch::try_new(schema.clone(), vec![array])?;
+
+    let table = MemTable::new(schema, vec![vec![data]])?;
+    let mut ctx = ExecutionContext::new();
+    ctx.register_table("test", Box::new(table));
+
+    // Basic SELECT
+    let sql = "SELECT * FROM test";
+    let actual = execute(&mut ctx, sql).await;
+    let expected = vec![vec!["one"], vec!["NULL"], vec!["three"]];
+    assert_eq!(expected, actual);
+
+    // basic filtering
+    let sql = "SELECT * FROM test WHERE d1 IS NOT NULL";
+    let actual = execute(&mut ctx, sql).await;
+    let expected = vec![vec!["one"], vec!["three"]];
+    assert_eq!(expected, actual);
+
+    // The following queries are not yet supported
+
+    // // filtering with constant

Review comment:
       I have PRs in the works to support these cases and I will uncomment them as I do so.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8340: ARROW-10165: [Rust] [DataFusion]: Remove special case DataFusion casting checks in favor of Arrow cast kernel

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8340:
URL: https://github.com/apache/arrow/pull/8340#issuecomment-703561475


   https://issues.apache.org/jira/browse/ARROW-10165


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8340: ARROW-10165: [Rust] [DataFusion]: Remove special case DataFusion casting checks in favor of Arrow cast kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8340:
URL: https://github.com/apache/arrow/pull/8340#issuecomment-705116207


   Alternate approach is in https://github.com/apache/arrow/pull/8359


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8340: ARROW-10165: [Rust] [DataFusion]: Remove special case DataFusion casting checks in favor of Arrow cast kernel

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8340:
URL: https://github.com/apache/arrow/pull/8340#discussion_r500710152



##########
File path: rust/datafusion/src/logical_plan/mod.rs
##########
@@ -323,21 +322,19 @@ impl Expr {
     ///
     /// # Errors
     ///
-    /// This function errors when it is impossible to cast the expression to the target [arrow::datatypes::DataType].
+    /// Currently no errors happen at plan time. If it is impossible

Review comment:
       I am unsure on whether this is advisable:
   
   Doesn't this mean that the plan can fail arbitrarily when a user performs an impossible cast? This can happen like 10hs after the execution starts, when the query finally reaches that point.
   
   I though that the point of having the types available during planning, both logical and physical, was to perform exactly these types of checks. Removing this check seems a regression to me, on which an impossible cast will only be caught during execution.
   
   Isn't it possible to expand `can_coerce_from` to cover the cases that `DataFusion` is missing but that arrow cast kernel supports?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8340: ARROW-10165: [Rust] [DataFusion]: Remove special case DataFusion casting checks in favor of Arrow cast kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8340:
URL: https://github.com/apache/arrow/pull/8340#discussion_r500443182



##########
File path: rust/datafusion/src/logical_plan/mod.rs
##########
@@ -323,21 +322,19 @@ impl Expr {
     ///
     /// # Errors
     ///
-    /// This function errors when it is impossible to cast the expression to the target [arrow::datatypes::DataType].
+    /// Currently no errors happen at plan time. If it is impossible
+    /// to cast the expression to the target
+    /// [arrow::datatypes::DataType] then an error will occur at
+    /// runtime.
     pub fn cast_to(&self, cast_to_type: &DataType, schema: &Schema) -> Result<Expr> {
         let this_type = self.get_type(schema)?;
         if this_type == *cast_to_type {
             Ok(self.clone())
-        } else if can_coerce_from(cast_to_type, &this_type) {

Review comment:
       The PR itself is pretty simple -- it just deletes code :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8340: ARROW-10165: [Rust] [DataFusion]: Remove special case DataFusion casting checks in favor of Arrow cast kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8340:
URL: https://github.com/apache/arrow/pull/8340#discussion_r501246915



##########
File path: rust/datafusion/src/logical_plan/mod.rs
##########
@@ -323,21 +322,19 @@ impl Expr {
     ///
     /// # Errors
     ///
-    /// This function errors when it is impossible to cast the expression to the target [arrow::datatypes::DataType].
+    /// Currently no errors happen at plan time. If it is impossible

Review comment:
       It is interesting that there appear to be no uses (or tests) for this function -- https://github.com/apache/arrow/search?q=cast_to I found this quite confusing
   
   @jorgecarleitao  what do you think about simply removing the whole thing entirely?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8340: ARROW-10165: [Rust] [DataFusion]: Remove special case DataFusion casting checks in favor of Arrow cast kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8340:
URL: https://github.com/apache/arrow/pull/8340#discussion_r501797053



##########
File path: rust/datafusion/src/logical_plan/mod.rs
##########
@@ -323,21 +322,19 @@ impl Expr {
     ///
     /// # Errors
     ///
-    /// This function errors when it is impossible to cast the expression to the target [arrow::datatypes::DataType].
+    /// Currently no errors happen at plan time. If it is impossible

Review comment:
       @jorgecarleitao  -- here is a proposed alternate approach: https://github.com/apache/arrow/pull/8400
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #8340: ARROW-10165: [Rust] [DataFusion]: Remove special case DataFusion casting checks in favor of Arrow cast kernel

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #8340:
URL: https://github.com/apache/arrow/pull/8340


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8340: ARROW-10165: [Rust] [DataFusion]: Remove special case DataFusion casting checks in favor of Arrow cast kernel

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8340:
URL: https://github.com/apache/arrow/pull/8340#discussion_r500710152



##########
File path: rust/datafusion/src/logical_plan/mod.rs
##########
@@ -323,21 +322,19 @@ impl Expr {
     ///
     /// # Errors
     ///
-    /// This function errors when it is impossible to cast the expression to the target [arrow::datatypes::DataType].
+    /// Currently no errors happen at plan time. If it is impossible

Review comment:
       I am unsure on whether we this is advisable:
   
   Doesn't this mean that the plan can fail arbitrarily when a user performs an impossible cast? This can happen like 10hs after the execution starts, when the query finally reaches that point.
   
   I though that the point of having the types available during planning, both logical and physical, was to perform exactly these types of checks. Removing this check seems a regression to me, on which an impossible cast will only be caught during execution.
   
   Isn't it possible to expand `can_coerce_from` to cover the cases that `DataFusion` is missing but that arrow cast kernel supports?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org