You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/28 20:39:22 UTC

[GitHub] [arrow-rs] jorgecarleitao commented on issue #506: "Optimize" Dictionary contents in DictionaryArray / `concat_batches`

jorgecarleitao commented on issue #506:
URL: https://github.com/apache/arrow-rs/issues/506#issuecomment-870024847


   Great issue description @alamb 🎩 
   
   I would do it on a separate kernel, as to not break the principle that concatenating arrays is an `O(N)` operation where `N` is the number of elements in all arrays (this is `O(N log N)`?)
   
   `ensure_sort: bool` or something like that would be a nice argument for such a function.
   
   In general, we have a small challenge in how we track dictionary metadata, though: our `DataType::Dictionary` does not hold dictionary metadata, which means that we must store it somewhere else. Yhis makes it more cumbersome, as the function cannot leverage this information to e.g. avoid re-sorting a sorted dictionary array without that other "dictionary metafata".
   
   My feeling is that we should (backward-incompatibly) extend `DataType::Dictionary(keys, values, metadata)` where `metadata` is a struct containing the different dictionary metadata available in `Field`, but I am not 100% convinced about this.
   
   I also though about a more radical approach of removing `DataType::Dictionary`, since a Dictionary is not formally a DataType, but an array encoding. With that said, it does have a different physical representation, so in this sense it is convenient to write it as a separate `DataType` that can be `matched`. The disadvantage is we can't change an array's encoding without changing the logical type associated with it. This contrasts with parquet, where encodings and logical types are independent of each other.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org