You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jayzhan211 (via GitHub)" <gi...@apache.org> on 2023/11/22 12:50:36 UTC

[I] Categorize lossless type coercion and lossy type coercion [arrow-datafusion]

jayzhan211 opened a new issue, #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302

   ### Is your feature request related to a problem or challenge?
   
   I think there is room for improvement in `type coerceion` or `casting`
   
   # Background
   `comparison_coercion` is widely used in datafusion, a lossless conversion
   https://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/type_coercion/binary.rs
   
   `can_coerce_from` is used mainly for signature, a lossless conversion
   https://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/type_coercion/functions.rs
   
   `can_cast_types` is from `arrow-cast`, which is a lossy conversion. It is also used in some `comparison_coercion` building block.
   
   Not sure if there is other coercion I missed
   
   # Proposal
   `comparison_coercion` and `can_coerce_from` seem like doing the similar thing, maybe we can just have one `lossless` conversion. If lossless conversion is useful for arrow-rs, we can introduce a lossless version of `can_cast_types`, then rely on it for datafusion.
   
   # Lossy conversion vs Lossless
   I think the definition for lossy is that the value is not recoverable after casting back, otherwise it is lossless.
   
   ## Lossy
   * Int32 to Int16 / Int8
   ## Lossless
   * Int32 to Int64
   
   ### Describe the solution you'd like
   
   
   1. Replace `can_coerce_from` with `comparison_coercion`'s building block `numeric coercion`, `list coercion`, `string coercion`, `null coercion`, etc
   2. Split `list_coercion` from `string_coercion` to make each building block of coercion clear on the task it focus on. list_coercion do `list, fixed size list, large list` coercion, `string_coercion` do `utf`/`large utf` coercion.
   3. Introduce these lossless coercion to arrow-rs?
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Unify type coercion and casting [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302#issuecomment-1832935357

   > . Is there any concern about this change?
   
   I would recommend you write some tests / try out the change before making a full featured PR. I am not sure what the implications are with the change you propose, but I suspect it could be non trivial and subtle


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Unify type coercion and casting [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on issue #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302#issuecomment-1833753259

   > // The following match arms encode the following logic: Given the two
   >         // integral types, we choose the narrowest possible integral type that
   >         // accommodates all values of both types. Note that some information
   >         // loss is inevitable when we have a signed type and a `UInt64`, in
   >         // which case we use `Int64`;i.e. the widest signed integral type.
   
   I rethink about this. This is the best we can do, like the comment mentioned, it is inevitable :(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Categorize lossless type coercion and lossy type coercion [arrow-datafusion]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302#issuecomment-1822831438

   That comment is incorrect and doesn't match the implementation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Categorize lossless type coercion and lossy type coercion [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on issue #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302#issuecomment-1822857428

   > > That comment is incorrect and doesn't match the implementation. I believe it means that it supports conversions that are potentially fallible, e.g. Int32 to Int8, but it never silently looses data
   > 
   > Any scenario that we should hint it as potentially fallible but actually do nothing for them? It seems to me return false is more straightforward if we do nothing.
   
   Maybe Int64(1) cast to Int32(1) is one that is able to safely cast without data loss.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Unify type coercion and casting [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302#issuecomment-1832759808

   > https://github.com/apache/arrow-datafusion/blob/2a692446f46ef96f48eb9ba19231e9576be9ff5a/datafusion/expr/src/type_coercion/binary.rs#L388-L406
   > 
   > In `comparison_binary_numeric_coercion`, signed type and u64 is coerced to i64, anyone knows why is this acceptable? what kind of cases are?
   > 
   > for type u64 and f32, it will return f32 which is smaller than u64, not sure why is this too
   
   The comment in the code you linked seems to say "some information loss is inevitable":
   
   
   ```
           // The following match arms encode the following logic: Given the two
           // integral types, we choose the narrowest possible integral type that
           // accommodates all values of both types. Note that some information
           // loss is inevitable when we have a signed type and a `UInt64`, in
           // which case we use `Int64`;i.e. the widest signed integral type.
   ```
   
   It is likely for "usability" so that you can compare i64 and u64 without an explicit type cast 🤷 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Categorize lossless type coercion and lossy type coercion [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on issue #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302#issuecomment-1822883571

   > can_coerce_from
   
   Yes! Should we have it aside `can_cast_types ` in arrow-rs? Or just let it stay in datafusion for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Unify type coercion and casting [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on issue #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302#issuecomment-1837052849

   After #8385, I found that it is not ideal to have one coercion for all the place (compare op, math op, signature coercion, etc... ).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Categorize lossless type coercion and lossy type coercion [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on issue #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302#issuecomment-1822838337

   > That comment is incorrect and doesn't match the implementation. I believe it means that it supports conversions that are potentially fallible, e.g. Int32 to Int8, but it never silently looses data
   
   Any scenario that we should hint it as potentially fallible but actually do nothing for them? It seems to me return false is more straightforward if we does not do the casting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Unify type coercion and casting [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on issue #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302#issuecomment-1832926920

   > > https://github.com/apache/arrow-datafusion/blob/2a692446f46ef96f48eb9ba19231e9576be9ff5a/datafusion/expr/src/type_coercion/binary.rs#L388-L406
   > > 
   > > In `comparison_binary_numeric_coercion`, signed type and u64 is coerced to i64, anyone knows why is this acceptable? what kind of cases are?
   > > for type u64 and f32, it will return f32 which is smaller than u64, not sure why is this too
   > 
   > The comment in the code you linked seems to say "some information loss is inevitable":
   > 
   > ```
   >         // The following match arms encode the following logic: Given the two
   >         // integral types, we choose the narrowest possible integral type that
   >         // accommodates all values of both types. Note that some information
   >         // loss is inevitable when we have a signed type and a `UInt64`, in
   >         // which case we use `Int64`;i.e. the widest signed integral type.
   > ```
   > 
   > It is likely for "usability" so that you can compare i64 and u64 without an explicit type cast 🤷
   
   I don't think that usability is the good reason :(  It might cause data loss. 
   
   @alamb I would like to change them to the straightforward version, choose the larger type between two. I think it can also be used all the place in datafusion easily (one place is `can_coerce_from`). Is there any concern about this change?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Categorize lossless type coercion and lossy type coercion [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on issue #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302#issuecomment-1823722164

   Short summary for the next step of this issue
   1. Cleanup `can_coerce_from` and `comparison_coercion`, reuse the code if possible. Rely on arrow-rs first (can_cast_types) if possible.
   2. Add nested support. Start from List


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Categorize lossless type coercion and lossy type coercion [arrow-datafusion]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302#issuecomment-1822871165

   Is that not what can_coerce_from is?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Categorize lossless type coercion and lossy type coercion [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on issue #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302#issuecomment-1822751556

   @tustvold Do you think introduce a lossless version of `can_cast_types` helpful?
   cc @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Categorize lossless type coercion and lossy type coercion [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on issue #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302#issuecomment-1822844771

   However, I think we can still cleanup `comparison_coercion ` and `can_coerce_from ` and forget about lossy conversion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Categorize lossless type coercion and lossy type coercion [arrow-datafusion]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302#issuecomment-1822888898

   Given the type coercion logic itself lives in DataFusion I personally think it makes sense for it to remain alongside it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Categorize lossless type coercion and lossy type coercion [arrow-datafusion]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302#issuecomment-1822758176

   > Decimal128 can cast to Float64 in can_coerce_from, why?
   
   Floating points inherently permit precision loss
   
   > No list coercion for can_coerce_from
   
   I think this would make sense as an addition, the current support for nested types in arrow-cast is very limited.
   
   > If lossless conversion is useful for arrow-rs
   
   In general I think it would be very confusing if coercion can result in silent data loss, perhaps you might articulate why this would be useful? My current thinking is casting should never silently lose data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Categorize lossless type coercion and lossy type coercion [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on issue #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302#issuecomment-1822827323

   > My current thinking is casting should never silently lose data.
   https://github.com/apache/arrow-rs/blob/df69ef57d055453c399fa925ad315d19211d7ab2/arrow-cast/src/cast.rs#L76C2-L80
   
   Comment in `can_cast_types` mention that cast is lossy.
   
   https://github.com/apache/arrow-rs/blob/df69ef57d055453c399fa925ad315d19211d7ab2/arrow-cast/src/cast.rs#L212-L216
   
   Casting Int64 to Int32 return true. I'm not sure if the `actual` casting block this lossy conversion, but if not, this can silently lose data?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Categorize lossless type coercion and lossy type coercion [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on issue #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302#issuecomment-1822868514

   Maybe we need another verions of `can_cast_types` that no possible falliable hint. For i64, only u64 or larger type, no i32 or i16 since we dont know the actually data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Unify type coercion and casting [arrow-datafusion]

Posted by "jayzhan211 (via GitHub)" <gi...@apache.org>.
jayzhan211 commented on issue #8302:
URL: https://github.com/apache/arrow-datafusion/issues/8302#issuecomment-1831035870

   https://github.com/apache/arrow-datafusion/blob/2a692446f46ef96f48eb9ba19231e9576be9ff5a/datafusion/expr/src/type_coercion/binary.rs#L388-L406
   
   In `comparison_binary_numeric_coercion`, signed type and u64 is coerced to i64, anyone knows why is this acceptable?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org