You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/22 15:51:38 UTC

[GitHub] [arrow] jorgecarleitao opened a new pull request #8024: ARROW-9809: [Rust][DataFusion] Fixed type coercion, supertypes and type checking.

jorgecarleitao opened a new pull request #8024:
URL: https://github.com/apache/arrow/pull/8024


   This commit makes all type coercion happen on the physical plane instead of logical plane and fixes the supertype function. This makes field names to not change due to coercion rules, better control of how the coercion supports physical calculations, and others.
   
   This commit also makes it more clear how we enforce type checking during planning. the Logical plan now knows how to derive its schema directly from binary expressions, even before the coercion is applied.
   
   The rational for this change is that coercions are simplifications to a physical computation (it is easier to sum two numbers of the same type at the hardware level).
   
   This automatically solves ARROW-9809, an issue on which the physical schema could be modified by coercion rules, causing the RecordBatch's schema to be different from the logical batch.
   
   This also addresses some inconsistencies in how we coerced certain types for binary operators, causing such inconsistencies to error during planning instead of execution.
   
   This also introduces a significant number of tests into the overall consistency of binary operators: it is now explicit what types they expect and how coercion happens to each operator. It also adds tests to different parts of the physical execution, to ensure schema consistency for binary operators, including negative tests (when it should error).
   
   This also makes `like` and `nlike` generally available, and added some tests to it.
   
   This closes at least ARROW-9809 and ARROW-4957.
   
   @andygrove  and @alamb, I am really sorry for this long commit, but I was unable to split this in smaller parts with passing tests. There was a strong coupling between the `get_supertype` and the physical expressions that made it hard to work this through.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on a change in pull request #8024: ARROW-9809: [Rust][DataFusion] Fixed type coercion, supertypes and type checking.

Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #8024:
URL: https://github.com/apache/arrow/pull/8024#discussion_r475106389



##########
File path: rust/datafusion/src/execution/physical_plan/expressions.rs
##########
@@ -991,30 +991,236 @@ impl fmt::Display for BinaryExpr {
     }
 }
 
+// Returns a formatted error about being impossible to coerce types for the binary operator.
+fn coercion_error<T>(lhs_type: &DataType, op: &String, rhs_type: &DataType) -> Result<T> {
+    Err(ExecutionError::General(
+        format!(
+            "The binary operator '{}' can't evaluate with lhs = '{:?}' and rhs = '{:?}'",
+            op, lhs_type, rhs_type
+        )
+        .to_string(),
+    ))
+}
+
+// the type that both lhs and rhs can be casted to for the purpose of a string computation
+fn string_coercion(
+    lhs_type: &DataType,
+    op: &Operator,
+    rhs_type: &DataType,
+) -> Result<DataType> {
+    use arrow::datatypes::DataType::*;
+    match (lhs_type, rhs_type) {
+        (Utf8, Utf8) => Ok(Utf8),
+        (LargeUtf8, Utf8) => Ok(LargeUtf8),
+        (Utf8, LargeUtf8) => Ok(LargeUtf8),
+        (LargeUtf8, LargeUtf8) => Ok(LargeUtf8),
+        _ => coercion_error(lhs_type, &format!("{}", op), rhs_type),
+    }
+}
+
+/// coercion rule for numerical values
+pub fn numerical_coercion(
+    lhs_type: &DataType,
+    op: &String,

Review comment:
       I'm curious why `string_coercion` takes `op` as `&Operator` but `numerical_coercion` takes it as `&String`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8024: ARROW-9809: [Rust][DataFusion] Fixed type coercion, supertypes and type checking.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8024:
URL: https://github.com/apache/arrow/pull/8024#issuecomment-678770336


   @alamb , thanks a lot for that insight.
   
   I may have been using the wrong notation here.
   
   I think that we have each columns' type during logical planning: the `LogicalPlanBuilder` always starts with a scan with a well defined (or infered via scan) schema. When a projection is constructed, which requires us to derive a `schema`, we build that schema by deriving the column types from its expressions, via `exprlist_to_fields` (that uses `Expr::to_field` that uses `Expr::get_type(input_schema)`).
   
   As I see it, the type coercer optimizer is casting types being passed to binary operators for the sole purpose of matching numerical types to perform computations, as we do not have kernels for different numerical types (e.g. u16 + u32).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8024: ARROW-9809: [Rust][DataFusion] Fixed type coercion, supertypes and type checking.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8024:
URL: https://github.com/apache/arrow/pull/8024#discussion_r475106741



##########
File path: rust/datafusion/src/execution/physical_plan/expressions.rs
##########
@@ -991,30 +991,236 @@ impl fmt::Display for BinaryExpr {
     }
 }
 
+// Returns a formatted error about being impossible to coerce types for the binary operator.
+fn coercion_error<T>(lhs_type: &DataType, op: &String, rhs_type: &DataType) -> Result<T> {
+    Err(ExecutionError::General(
+        format!(
+            "The binary operator '{}' can't evaluate with lhs = '{:?}' and rhs = '{:?}'",
+            op, lhs_type, rhs_type
+        )
+        .to_string(),
+    ))
+}
+
+// the type that both lhs and rhs can be casted to for the purpose of a string computation
+fn string_coercion(
+    lhs_type: &DataType,
+    op: &Operator,
+    rhs_type: &DataType,
+) -> Result<DataType> {
+    use arrow::datatypes::DataType::*;
+    match (lhs_type, rhs_type) {
+        (Utf8, Utf8) => Ok(Utf8),
+        (LargeUtf8, Utf8) => Ok(LargeUtf8),
+        (Utf8, LargeUtf8) => Ok(LargeUtf8),
+        (LargeUtf8, LargeUtf8) => Ok(LargeUtf8),
+        _ => coercion_error(lhs_type, &format!("{}", op), rhs_type),
+    }
+}
+
+/// coercion rule for numerical values
+pub fn numerical_coercion(
+    lhs_type: &DataType,
+    op: &String,

Review comment:
       It is only used for error messages atm. But yes, for consistency, we should probably receive an Operator. The operator will be needed in the future for the Operator::Modulus, that only handles integers AFAIK.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on a change in pull request #8024: ARROW-9809: [Rust][DataFusion] Fixed type coercion, supertypes and type checking.

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on a change in pull request #8024:
URL: https://github.com/apache/arrow/pull/8024#discussion_r475109199



##########
File path: rust/datafusion/src/execution/physical_plan/expressions.rs
##########
@@ -991,30 +991,236 @@ impl fmt::Display for BinaryExpr {
     }
 }
 
+// Returns a formatted error about being impossible to coerce types for the binary operator.
+fn coercion_error<T>(lhs_type: &DataType, op: &String, rhs_type: &DataType) -> Result<T> {
+    Err(ExecutionError::General(
+        format!(
+            "The binary operator '{}' can't evaluate with lhs = '{:?}' and rhs = '{:?}'",
+            op, lhs_type, rhs_type
+        )
+        .to_string(),
+    ))
+}
+
+// the type that both lhs and rhs can be casted to for the purpose of a string computation
+fn string_coercion(
+    lhs_type: &DataType,
+    op: &Operator,
+    rhs_type: &DataType,
+) -> Result<DataType> {
+    use arrow::datatypes::DataType::*;
+    match (lhs_type, rhs_type) {
+        (Utf8, Utf8) => Ok(Utf8),
+        (LargeUtf8, Utf8) => Ok(LargeUtf8),
+        (Utf8, LargeUtf8) => Ok(LargeUtf8),
+        (LargeUtf8, LargeUtf8) => Ok(LargeUtf8),
+        _ => coercion_error(lhs_type, &format!("{}", op), rhs_type),
+    }
+}
+
+/// coercion rule for numerical values
+pub fn numerical_coercion(
+    lhs_type: &DataType,
+    op: &String,

Review comment:
       Fixed. I had to pass a random operator (Plus) on the type coercer optimizer for UDFs. Since that is on the best effort anyways, I think that it is fine.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove closed pull request #8024: ARROW-9809: [Rust][DataFusion] Fixed type coercion, supertypes and type checking.

Posted by GitBox <gi...@apache.org>.
andygrove closed pull request #8024:
URL: https://github.com/apache/arrow/pull/8024


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8024: ARROW-9809: [Rust][DataFusion] Fixed type coercion, supertypes and type checking.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8024:
URL: https://github.com/apache/arrow/pull/8024#issuecomment-678658740


   https://issues.apache.org/jira/browse/ARROW-9809


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org