You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/29 14:39:10 UTC

[GitHub] [arrow-rs] msalib opened a new issue, #3228: better document when we need `LargeUtf8` instead of `Utf8`

msalib opened a new issue, #3228:
URL: https://github.com/apache/arrow-rs/issues/3228

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   When I filed https://github.com/apache/arrow-rs/issues/3123, I was surprised to discover that concatenating lots of `Utf8` elements is supposed to panic when the total size is over 2 GB, even though the individual sizes are much smaller. That constraint was really unexpected! It makes sense if you understand the storage model, but I didn't and so was very surprised.
   
   **Describe the solution you'd like**
   
   I'm not sure how to surface this knowledge better. When I first skimmed the data type docs, I walked away thinking that `LargeUtf8` is for cases where an individual element is large (I wasn't even clear that large meant > 2 GB) and that I should use `Utf8` for everything else. But I should've understood the constraint as "use `LargeUtf8` everywhere except places where you can guarantee that you'll never have an array with more than 2 GB of text total".
   
   Maybe we just need a big statement in the [Physical Memory Layout](https://arrow.apache.org/docs/format/Columnar.html#physical-memory-layout) guide and the [`DataType`](https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Utf8) doc string explaining that you cannot ever build an array where the total text size is over 2 GB if you use `Utf8`
   
   **Describe alternatives you've considered**
   
   This feels like a landmine and I wish Arrow could transparently convert between these types as needed. Ideally there should just be a `Utf8` type that internally specifies what type it uses to manage offsets.
   
   Alternatively, I wish the `concat` kernel could return a more explicit failure message by explicitly checking for this sort of overflow, something like "I've been asked to concat 2 `Utf8` arrays into an array that will be over 2 GB and I cannot do that: these arrays need to be `LargeUtf8` instead". I mean, when you're doing the concatenation, you can check lengths explicitly ahead of time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya closed issue #3228: better document when we need `LargeUtf8` instead of `Utf8`

Posted by GitBox <gi...@apache.org>.
viirya closed issue #3228: better document when we need `LargeUtf8` instead of `Utf8`
URL: https://github.com/apache/arrow-rs/issues/3228


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org