You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jackwener (via GitHub)" <gi...@apache.org> on 2023/05/04 06:50:18 UTC

[GitHub] [arrow-datafusion] jackwener opened a new pull request, #6221: feat: separate get_result_type and coerce_type

jackwener opened a new pull request, #6221:
URL: https://github.com/apache/arrow-datafusion/pull/6221

   # Which issue does this PR close?
   
   Closes #.
   
   # Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6221: refactor: separate get_result_type from `coerce_type`

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6221:
URL: https://github.com/apache/arrow-datafusion/pull/6221#discussion_r1185101115


##########
datafusion/expr/src/type_coercion/binary.rs:
##########
@@ -106,6 +106,55 @@ pub fn binary_operator_data_type(
     }
 }
 
+pub fn get_result_type(

Review Comment:
   I think we need some docstrings on this function. Something like:
   
   ```suggestion
   /// returns the resulting type of a binary expression evaluating the `op` with the left and right hand types 
   pub fn get_result_type(
   ```
   
   Also calling this method `result_type` rather than `get_result_type` might be more consistent with the conventions in the codebase (this is a minor point)



##########
datafusion/expr/src/type_coercion/binary.rs:
##########
@@ -106,6 +106,55 @@ pub fn binary_operator_data_type(
     }
 }
 
+pub fn get_result_type(
+    lhs_type: &DataType,
+    op: &Operator,
+    rhs_type: &DataType,
+) -> Result<DataType> {
+    let result = match op {
+        Operator::And
+        | Operator::Or
+        | Operator::Eq
+        | Operator::NotEq
+        | Operator::Lt
+        | Operator::Gt
+        | Operator::GtEq
+        | Operator::LtEq
+        | Operator::IsDistinctFrom
+        | Operator::IsNotDistinctFrom => Some(DataType::Boolean),
+        Operator::Plus | Operator::Minus
+            if is_datetime(lhs_type)
+                || is_datetime(rhs_type)
+                || is_interval(lhs_type)
+                || is_interval(rhs_type) =>
+        {
+            temporal_add_sub_coercion(lhs_type, rhs_type, op)
+        }
+        Operator::BitwiseAnd
+        | Operator::BitwiseOr
+        | Operator::BitwiseXor
+        | Operator::BitwiseShiftRight
+        | Operator::BitwiseShiftLeft
+        | Operator::Plus
+        | Operator::Minus
+        | Operator::Modulo
+        | Operator::Divide
+        | Operator::Multiply
+        | Operator::RegexMatch
+        | Operator::RegexIMatch
+        | Operator::RegexNotMatch
+        | Operator::RegexNotIMatch
+        | Operator::StringConcat => coerce_types(lhs_type, op, rhs_type).ok(),
+    };
+
+    match result {
+        None => Err(DataFusionError::Plan(format!(
+            "there isn't result type for {lhs_type:?} {op} {rhs_type:?}"

Review Comment:
   ```suggestion
               "Unsupported argument types. Can not evaluate {lhs_type:?} {op} {rhs_type:?}"
   ```
   
   I think would make a nicer error message



##########
datafusion/expr/src/type_coercion/binary.rs:
##########
@@ -106,6 +106,55 @@ pub fn binary_operator_data_type(
     }
 }
 
+pub fn get_result_type(
+    lhs_type: &DataType,
+    op: &Operator,
+    rhs_type: &DataType,
+) -> Result<DataType> {
+    let result = match op {
+        Operator::And
+        | Operator::Or
+        | Operator::Eq
+        | Operator::NotEq
+        | Operator::Lt
+        | Operator::Gt
+        | Operator::GtEq
+        | Operator::LtEq
+        | Operator::IsDistinctFrom
+        | Operator::IsNotDistinctFrom => Some(DataType::Boolean),
+        Operator::Plus | Operator::Minus
+            if is_datetime(lhs_type)
+                || is_datetime(rhs_type)
+                || is_interval(lhs_type)
+                || is_interval(rhs_type) =>
+        {
+            temporal_add_sub_coercion(lhs_type, rhs_type, op)

Review Comment:
   Eventually I think it would make the code easier to understand and maintain if we can disentangle `coerce_types` from `get_result_types`. To start in this PR, calling into coerce_types is fine.
   
   Eventually I think it would keep the logic much simpler  if  `get_result_type` is only called on expressions that have their inputs properly coerced. 



##########
datafusion/expr/src/type_coercion/binary.rs:
##########
@@ -42,7 +42,7 @@ pub fn binary_operator_data_type(
     let result_type = if !any_decimal(lhs_type, rhs_type) {

Review Comment:
   I als find `binary_oprator_data_type` to be confusing and redundant with `get_result_type` but we can fix that in a follow on PR perhaps



##########
datafusion/physical-expr/src/expressions/datetime.rs:
##########
@@ -77,7 +77,7 @@ impl PhysicalExpr for DateTimeIntervalExpr {
     }
 
     fn data_type(&self, input_schema: &Schema) -> Result<DataType> {

Review Comment:
   I would also love to remove `DateTimeIntervalExpr` out of its own thing and into binary.rs with the other operators, but for now 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #6221: refactor: separate get_result_type from `coerce_type`

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on code in PR #6221:
URL: https://github.com/apache/arrow-datafusion/pull/6221#discussion_r1184744117


##########
datafusion/expr/src/type_coercion/binary.rs:
##########
@@ -106,6 +106,55 @@ pub fn binary_operator_data_type(
     }
 }
 
+pub fn get_result_type(
+    lhs_type: &DataType,
+    op: &Operator,
+    rhs_type: &DataType,
+) -> Result<DataType> {
+    let result = match op {
+        Operator::And
+        | Operator::Or
+        | Operator::Eq
+        | Operator::NotEq
+        | Operator::Lt
+        | Operator::Gt
+        | Operator::GtEq
+        | Operator::LtEq
+        | Operator::IsDistinctFrom

Review Comment:
   Original, when we call `get_data_type`, it will return common type instead of  result type (`Boolean`) in `coerce_type`. So original it's wrong.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #6221: refactor: separate get_result_type from `coerce_type`

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on code in PR #6221:
URL: https://github.com/apache/arrow-datafusion/pull/6221#discussion_r1185142776


##########
datafusion/expr/src/type_coercion/binary.rs:
##########
@@ -42,7 +42,7 @@ pub fn binary_operator_data_type(
     let result_type = if !any_decimal(lhs_type, rhs_type) {

Review Comment:
   Great suggestion, I will do in following PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on pull request #6221: refactor: separate get_result_type from `coerce_type`

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #6221:
URL: https://github.com/apache/arrow-datafusion/pull/6221#issuecomment-1609113819

   One slightly unfortunate quirk of this separation is it isn't clear if the inputs to get_result_type are before or after type coercion. This has non-trivial implications for decimal arithmetic, in particular for division, as type coercion is not idempotent. I'm still thinking about how to handle this better, but thought I would mention it in case you've already given this some thought


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on pull request #6221: refactor: separate get_result_type from `coerce_type`

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on PR #6221:
URL: https://github.com/apache/arrow-datafusion/pull/6221#issuecomment-1535643094

   Thanks @alamb review carefully.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener merged pull request #6221: refactor: separate get_result_type from `coerce_type`

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener merged PR #6221:
URL: https://github.com/apache/arrow-datafusion/pull/6221


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on pull request #6221: refactor: separate get_result_type from `coerce_type`

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on PR #6221:
URL: https://github.com/apache/arrow-datafusion/pull/6221#issuecomment-1609227947

   > One slightly unfortunate quirk of this separation is it isn't clear if the inputs to get_result_type are before or after type coercion.
   
   Agree with it.
   
   It may be related with https://github.com/apache/arrow-datafusion/issues/6261


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #6221: feat: separate get_result_type and coerce_type

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on code in PR #6221:
URL: https://github.com/apache/arrow-datafusion/pull/6221#discussion_r1184608466


##########
datafusion/core/src/physical_plan/planner.rs:
##########
@@ -2053,10 +2053,7 @@ mod tests {
         ];
         for case in cases {
             let logical_plan = test_csv_scan().await?.project(vec![case.clone()]);
-            let message = format!(
-                "Expression {case:?} expected to error due to impossible coercion"
-            );
-            assert!(logical_plan.is_err(), "{}", message);

Review Comment:
   Type checking is done in the `Type coercion analyzer rule` rather than directly in the build
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on pull request #6221: feat: separate get_result_type from `coerce_type`

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on PR #6221:
URL: https://github.com/apache/arrow-datafusion/pull/6221#issuecomment-1534196845

   Next, we should separate `getCommonType` from `coerce_type`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6221: refactor: separate get_result_type from `coerce_type`

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6221:
URL: https://github.com/apache/arrow-datafusion/pull/6221#issuecomment-1610030064

   > One slightly unfortunate quirk of this separation is it isn't clear if the inputs to get_result_type are before or after type coercion.
   
   In my opinion, `get_result_type` should always be called *after* type coercion and if the function doesn't support the current input types, it should error. 
   
   Basically I think once the "Analyzer" pass has run on the expr tree the rest of the code shouldn't have to guess / coerce types anymore -- the types would always line up or the query would be rejected. 
   
   I think this is the standard approach in other databases (and, eg. the type checking / semantic pass in compilers)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #6221: refactor: separate get_result_type from `coerce_type`

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on code in PR #6221:
URL: https://github.com/apache/arrow-datafusion/pull/6221#discussion_r1184744117


##########
datafusion/expr/src/type_coercion/binary.rs:
##########
@@ -106,6 +106,55 @@ pub fn binary_operator_data_type(
     }
 }
 
+pub fn get_result_type(
+    lhs_type: &DataType,
+    op: &Operator,
+    rhs_type: &DataType,
+) -> Result<DataType> {
+    let result = match op {
+        Operator::And
+        | Operator::Or
+        | Operator::Eq
+        | Operator::NotEq
+        | Operator::Lt
+        | Operator::Gt
+        | Operator::GtEq
+        | Operator::LtEq
+        | Operator::IsDistinctFrom

Review Comment:
   Original, even if when we call `get_data_type`, it will return common type instead of `Boolean` in `coerce_type`. So original it's wrong.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on pull request #6221: refactor: separate get_result_type from `coerce_type`

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on PR #6221:
URL: https://github.com/apache/arrow-datafusion/pull/6221#issuecomment-1535645179

   I prepare to merge this PR because there are some following PR.
   
   Anyone who wants to review this PR can continue, and they they can also review the content in the following PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org