You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/28 20:26:29 UTC

[GitHub] [arrow-rs] jhorstmann commented on issue #506: "Optimize" Dictionary contents in DictionaryArray / `concat_batches`

jhorstmann commented on issue #506:
URL: https://github.com/apache/arrow-rs/issues/506#issuecomment-870014675


   >     1. b) Every value in the dictionary has at least one use in the array' values
   
   A nice benefit of this is that a GROUP BY that dictionary column afterwards would be very cheap since it does not need another hashmap and instead could index directly into an array of accumulators with the keys. Not sure if that is the usecase you are after or if this is more of a nice side effect.
   
   Ensuring sorted dictionaries is something I'm definitely interested in, `Field` already has the `dict_is_ordered` flag based on which a much faster implementation of sort comparator or comparison kernel could be selected. I was thinking of a different implementation than using a BTreeSet though. I have only a rough sketch, but the idea is to use `sort_to_indices` on the dictionary values and then somehow build a lookup table as a vector. With the sorted indices it should also be possible to build a lookup table for remapping duplicates.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org