You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "berkaysynnada (via GitHub)" <gi...@apache.org> on 2023/02/20 06:37:54 UTC

[GitHub] [arrow-datafusion] berkaysynnada opened a new pull request, #5342: Bug/union wrong casting

berkaysynnada opened a new pull request, #5342:
URL: https://github.com/apache/arrow-datafusion/pull/5342

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #5212.
   
   # Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   When we try to use "UNION" on columns having differently signed or sized integer types, we get a cast error for some cases (explained in the issue), which can be handled accurately. 
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   Previous casting matches are modified such that the resulting type of the column will be the smallest unit that will not cause to any error.
   
     | Int64 | UInt64 | Int32 | UInt32 | Int16 | UInt16 | Int8 | UInt8
   -- | -- | -- | -- | -- | -- | -- | -- | --
   Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64 | Int64
   UInt64 | Int64 | UInt64 | UInt64 | UInt64 | UInt64 | UInt64 | UInt64 | UInt64
   Int32 | Int64 | UInt64 | Int32 | Int64 | Int32 | Int32 | Int32 | Int32
   UInt32 | Int64 | UInt64 | Int64 | UInt32 | Int64 | UInt32 | Int64 | UInt32
   Int16 | Int64 | UInt64 | Int32 | Int64 | Int16 | Int32 | Int16 | Int16
   UInt16 | Int64 | UInt64 | Int32 | UInt32 | Int32 | UInt16 | Int32 | UInt16
   Int8 | Int64 | UInt64 | Int32 | Int64 | Int16 | Int32 | Int8 | Int16
   UInt8 | Int64 | UInt64 | Int32 | UInt32 | Int16 | UInt16 | Int16 | UInt8
   
   
   # Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   Yes, `test_union_upcast_types` test function is added showing the error is solved in the issue.
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ozankabak commented on a diff in pull request #5342: Bug/union wrong casting

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on code in PR #5342:
URL: https://github.com/apache/arrow-datafusion/pull/5342#discussion_r1119405340


##########
datafusion/expr/src/type_coercion/binary.rs:
##########
@@ -238,13 +238,32 @@ fn comparison_binary_numeric_coercion(
         (_, Decimal128(_, _)) => get_comparison_common_decimal_type(rhs_type, lhs_type),
         (Float64, _) | (_, Float64) => Some(Float64),
         (_, Float32) | (Float32, _) => Some(Float32),
-        (Int64, _) | (_, Int64) => Some(Int64),
-        (Int32, _) | (_, Int32) => Some(Int32),
-        (Int16, _) | (_, Int16) => Some(Int16),
-        (Int8, _) | (_, Int8) => Some(Int8),
+        // start checking from Int64, that is the most inclusive integer type

Review Comment:
   I am confused by the sum. Aren't we just "merging" the schemas? In that case, the narrowest type that can represent both sides should be chosen, no? Am I missing something here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ozankabak commented on a diff in pull request #5342: Bug/union wrong casting

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on code in PR #5342:
URL: https://github.com/apache/arrow-datafusion/pull/5342#discussion_r1120865909


##########
datafusion/expr/src/type_coercion/binary.rs:
##########
@@ -238,13 +238,32 @@ fn comparison_binary_numeric_coercion(
         (_, Decimal128(_, _)) => get_comparison_common_decimal_type(rhs_type, lhs_type),
         (Float64, _) | (_, Float64) => Some(Float64),
         (_, Float32) | (Float32, _) => Some(Float32),
-        (Int64, _) | (_, Int64) => Some(Int64),
-        (Int32, _) | (_, Int32) => Some(Int32),
-        (Int16, _) | (_, Int16) => Some(Int16),
-        (Int8, _) | (_, Int8) => Some(Int8),
+        // start checking from Int64, that is the most inclusive integer type

Review Comment:
   Just added comments to explain the reasoning of the `match` patterns. Also took note to add some tests in a follow-on PR. Thanks for the review!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #5342: Bug/union wrong casting

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #5342:
URL: https://github.com/apache/arrow-datafusion/pull/5342#issuecomment-1449050189

   Benchmark runs are scheduled for baseline = 0000d4fce387945d7b1f2e28f6cc5538f03b476d and contender = d076ab3b28ea0dffe65577dfa67dc7130c6006ce. d076ab3b28ea0dffe65577dfa67dc7130c6006ce is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/74a7dd32863b4910a56729edb90b16a3...c4959ef9773d406f979d2d9d658c8969/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/150613a632994bebadca9c2b7f057c98...7437884015704d48a81c525618259a31/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/9d86e6def49f47ebaa7444a2dc128017...2e6d5edfd5794454a98f140f70f6a834/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/a0d2e521ad43484e9bf9ea0d7609266c...48f056036b27481b8182e9b27726b684/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #5342: Bug/union wrong casting

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5342:
URL: https://github.com/apache/arrow-datafusion/pull/5342#issuecomment-1449045900

   Thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ozankabak commented on pull request #5342: Bug/union wrong casting

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on PR #5342:
URL: https://github.com/apache/arrow-datafusion/pull/5342#issuecomment-1437511617

   Once you hit the widest fixed-size type, information loss becomes inevitable unless you have `bigint` or arbitrary-precision types at your disposal (for integral and floating point types, respectively). AFAIK Datafusion does not do such types at this time. If this is indeed true, we need to choose between losing very high numbers (by casting to `Int64`), or lose negative numbers (by casting to `UInt64`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on pull request #5342: Bug/union wrong casting

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on PR #5342:
URL: https://github.com/apache/arrow-datafusion/pull/5342#issuecomment-1437848726

   cc @liukun4515 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #5342: Bug/union wrong casting

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #5342:
URL: https://github.com/apache/arrow-datafusion/pull/5342


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ozankabak commented on a diff in pull request #5342: Bug/union wrong casting

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on code in PR #5342:
URL: https://github.com/apache/arrow-datafusion/pull/5342#discussion_r1118160922


##########
datafusion/expr/src/type_coercion/binary.rs:
##########
@@ -238,13 +238,32 @@ fn comparison_binary_numeric_coercion(
         (_, Decimal128(_, _)) => get_comparison_common_decimal_type(rhs_type, lhs_type),
         (Float64, _) | (_, Float64) => Some(Float64),
         (_, Float32) | (Float32, _) => Some(Float32),
-        (Int64, _) | (_, Int64) => Some(Int64),
-        (Int32, _) | (_, Int32) => Some(Int32),
-        (Int16, _) | (_, Int16) => Some(Int16),
-        (Int8, _) | (_, Int8) => Some(Int8),
+        // start checking from Int64, that is the most inclusive integer type

Review Comment:
   Wouldn't that map (`Int32`, `Int16`) to `Int64`? I think @berkaysynnada's intent is to map to the narrowest correct type, which in this case would be `Int32`.
   
   I am not sure if the `match` pattern is the most succinct representation of the table in the PR text though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5342: Bug/union wrong casting

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5342:
URL: https://github.com/apache/arrow-datafusion/pull/5342#discussion_r1118079876


##########
datafusion/expr/src/type_coercion/binary.rs:
##########
@@ -238,13 +238,32 @@ fn comparison_binary_numeric_coercion(
         (_, Decimal128(_, _)) => get_comparison_common_decimal_type(rhs_type, lhs_type),
         (Float64, _) | (_, Float64) => Some(Float64),
         (_, Float32) | (Float32, _) => Some(Float32),
-        (Int64, _) | (_, Int64) => Some(Int64),
-        (Int32, _) | (_, Int32) => Some(Int32),
-        (Int16, _) | (_, Int16) => Some(Int16),
-        (Int8, _) | (_, Int8) => Some(Int8),
+        // start checking from Int64, that is the most inclusive integer type

Review Comment:
   Likewise I think the same pattern could be used for `UInt*` variants



##########
datafusion/expr/src/type_coercion/binary.rs:
##########
@@ -238,13 +238,32 @@ fn comparison_binary_numeric_coercion(
         (_, Decimal128(_, _)) => get_comparison_common_decimal_type(rhs_type, lhs_type),
         (Float64, _) | (_, Float64) => Some(Float64),
         (_, Float32) | (Float32, _) => Some(Float32),
-        (Int64, _) | (_, Int64) => Some(Int64),
-        (Int32, _) | (_, Int32) => Some(Int32),
-        (Int16, _) | (_, Int16) => Some(Int16),
-        (Int8, _) | (_, Int8) => Some(Int8),
+        // start checking from Int64, that is the most inclusive integer type
+        (Int64, _)
+        | (_, Int64)
+        | (UInt64, Int8)
+        | (Int8, UInt64)
+        | (UInt64, Int16)
+        | (Int16, UInt64)
+        | (UInt64, Int32)
+        | (Int32, UInt64)
+        | (UInt32, Int8)
+        | (Int8, UInt32)
+        | (UInt32, Int16)
+        | (Int16, UInt32)
+        | (UInt32, Int32)
+        | (Int32, UInt32) => Some(Int64),
         (UInt64, _) | (_, UInt64) => Some(UInt64),
+        (Int32, _)
+        | (_, Int32)

Review Comment:
   If one of the arguments is Int32, doesn't that mean the output should be Int64? 
   (as in `(_, Int32)` should not be in the same match arm.



##########
datafusion/expr/src/type_coercion/binary.rs:
##########
@@ -238,13 +238,32 @@ fn comparison_binary_numeric_coercion(
         (_, Decimal128(_, _)) => get_comparison_common_decimal_type(rhs_type, lhs_type),
         (Float64, _) | (_, Float64) => Some(Float64),
         (_, Float32) | (Float32, _) => Some(Float32),
-        (Int64, _) | (_, Int64) => Some(Int64),
-        (Int32, _) | (_, Int32) => Some(Int32),
-        (Int16, _) | (_, Int16) => Some(Int16),
-        (Int8, _) | (_, Int8) => Some(Int8),
+        // start checking from Int64, that is the most inclusive integer type

Review Comment:
   Could we express the same logic more concisely with something like:
   
   ```
         (Int64, _) | (_, Int64) => Some(Int64),
           (Int32, _) | (_, Int32) => Some(Int64),
           (Int16, _) | (_, Int16) => Some(Int32),
           (Int8, _) | (_, Int8) => Some(Int16),
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] iajoiner commented on pull request #5342: Bug/union wrong casting

Posted by "iajoiner (via GitHub)" <gi...@apache.org>.
iajoiner commented on PR #5342:
URL: https://github.com/apache/arrow-datafusion/pull/5342#issuecomment-1436945176

   @berkaysynnada Really thanks for your work! I wonder what the type of the union of an Int64 and an UInt64 should be though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5342: Bug/union wrong casting

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5342:
URL: https://github.com/apache/arrow-datafusion/pull/5342#discussion_r1119286791


##########
datafusion/expr/src/type_coercion/binary.rs:
##########
@@ -238,13 +238,32 @@ fn comparison_binary_numeric_coercion(
         (_, Decimal128(_, _)) => get_comparison_common_decimal_type(rhs_type, lhs_type),
         (Float64, _) | (_, Float64) => Some(Float64),
         (_, Float32) | (Float32, _) => Some(Float32),
-        (Int64, _) | (_, Int64) => Some(Int64),
-        (Int32, _) | (_, Int32) => Some(Int32),
-        (Int16, _) | (_, Int16) => Some(Int16),
-        (Int8, _) | (_, Int8) => Some(Int8),
+        // start checking from Int64, that is the most inclusive integer type

Review Comment:
   > Wouldn't that map (Int32, Int16) to Int64? 
   
   Yes.
   
   I guess I was thinking of the more general case for arithmetic where  `i32::MAX + i16::MAX` requires a `i64` to store
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5342: Bug/union wrong casting

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5342:
URL: https://github.com/apache/arrow-datafusion/pull/5342#discussion_r1120839022


##########
datafusion/expr/src/type_coercion/binary.rs:
##########
@@ -238,13 +238,32 @@ fn comparison_binary_numeric_coercion(
         (_, Decimal128(_, _)) => get_comparison_common_decimal_type(rhs_type, lhs_type),
         (Float64, _) | (_, Float64) => Some(Float64),
         (_, Float32) | (Float32, _) => Some(Float32),
-        (Int64, _) | (_, Int64) => Some(Int64),
-        (Int32, _) | (_, Int32) => Some(Int32),
-        (Int16, _) | (_, Int16) => Some(Int16),
-        (Int8, _) | (_, Int8) => Some(Int8),
+        // start checking from Int64, that is the most inclusive integer type

Review Comment:
   I think I am just confused 
   
   For example this PR adds this rule:
   
   ```
   (Int8, UInt16) => Some(Int32),
   ```
   
   Which isn't just merging the schema (e.g. UInt16) in my mind, but is extending it in some way. In this case, so that the entire range of `Int8` and `UInt16` can be represented -- which now makes sense
   
   I think what would help me would be:
   1. Add some comments explaining the rationale for these rules
   2. (maybe) add test coverage (ideally looking like your beautiful table from the PR description) that helps to illustrate the point
   
   However, I don't think this is needed prior to merging this PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org