You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/28 19:28:03 UTC

[GitHub] [arrow-rs] tustvold opened a new issue, #2969: Combine _utf8 and _binary kernels

tustvold opened a new issue, #2969:
URL: https://github.com/apache/arrow-rs/issues/2969

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   <!--
   A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for this feature, in addition to  the *what*)
   -->
   
   With #2947 we can now write kernels that are generic over both byte arrays and string arrays. We have a large number of kernels that with duplicate implementations for both, e.g. gt_eq_dyn_binary and gt_eq_dyn_utf8.
   
   **Describe the solution you'd like**
   <!--
   A clear and concise description of what you want to happen.
   -->
   
   We should create a new unified kernel, e.g. gt_eq_dyn_bytes, and make the specialized kernels just call through to this.
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features you've considered.
   -->
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #2969: Combine _utf8 and _binary kernels

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2969:
URL: https://github.com/apache/arrow-rs/issues/2969#issuecomment-1318849308

   Much like locale aware sorting, the same is true of normalization. There isn't mature ecosystem support, yet, nor a motivated contributor, and so we don't currently support it.
   
   Is there a particular use-case that motivates your asking about this? I was under the perhaps naive impression that most DBs were moving away from locale aware string handling - postgres supports it but specifically advises against using it as it dramatically hurts performance, not to mention all the normal reproducibility pain inherent to locales... 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #2969: Combine _utf8 and _binary kernels

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2969:
URL: https://github.com/apache/arrow-rs/issues/2969#issuecomment-1317672021

   We don't provide locale aware string comparison, in part because there isn't Rust ecosystem support for it. We solely provide byte-based ordering, same as the standard Ord


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alippai commented on issue #2969: Combine _utf8 and _binary kernels

Posted by GitBox <gi...@apache.org>.
alippai commented on issue #2969:
URL: https://github.com/apache/arrow-rs/issues/2969#issuecomment-1319456422

   For the normalization: I had issues before and I remembered Utf8 is not simply a byte array.
   
   For localization: similar, I'm speaking several languages and I was surprised by the "assumption" that byte order is always the same.
   
   I wasn't sure they were considered and skipped or this didn't come up at all. I agree it's a big chunk of work and the performance is always worse. It's not essential, just wanted to raise if you are making design decisions, these questions can help making an informed decision instead of a lucky one :)
   
   I'm all good with proceeding, nothing actionable from my side


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alippai commented on issue #2969: Combine _utf8 and _binary kernels

Posted by GitBox <gi...@apache.org>.
alippai commented on issue #2969:
URL: https://github.com/apache/arrow-rs/issues/2969#issuecomment-1318019393

   I agree that the locale based sorting can be out of scope for now. What do you think regarding the normalization? 
   ![image](https://user-images.githubusercontent.com/240729/202349132-2fa5656e-efce-45d4-a3f5-24db2f6a2a52.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alippai commented on issue #2969: Combine _utf8 and _binary kernels

Posted by GitBox <gi...@apache.org>.
alippai commented on issue #2969:
URL: https://github.com/apache/arrow-rs/issues/2969#issuecomment-1318024573

   Btw I didn't find native rust locale tools a year ago, but this now looks ok?! https://github.com/unicode-org/icu4x


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alippai commented on issue #2969: Combine _utf8 and _binary kernels

Posted by GitBox <gi...@apache.org>.
alippai commented on issue #2969:
URL: https://github.com/apache/arrow-rs/issues/2969#issuecomment-1317604027

   I couldn't check the code yet, but UTF-8 comparison is different from byte comparison because of the normalization (or the lack of it), right? Also gt / lt is locale specific?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #2969: Combine _utf8 and _binary kernels

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #2969:
URL: https://github.com/apache/arrow-rs/issues/2969#issuecomment-1319644870

   Thanks, yeah I thought you were referring to unicode normalisation, which is its own wondrous thing as there are redundant codings for the same text. As you say not all byte arrays are valid UTF-8 we must and do perform validation of this at construction time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #2969: Combine _utf8 and _binary kernels

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold closed issue #2969: Combine _utf8 and _binary kernels
URL: https://github.com/apache/arrow-rs/issues/2969


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org