You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/04 14:47:41 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue #258: Improve performance of COUNT (distinct x) for dictionary columns

alamb opened a new issue #258:
URL: https://github.com/apache/arrow-datafusion/issues/258


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   I have large amounts of low cardinality string data (for example, 200 M rows, but only 20 distinct values). DictionaryArrays are very good for such data as they are space efficient.  
   
   https://github.com/apache/arrow-datafusion/pull/256 adds basic query support for distinct dictionary columns but it is not a very computationally efficient imlementation. It effectively unpacks the (likely mostly deduplicated) dictionary's values row by row into a hash set to deduplicate it again. That is a lot of extra hashing work.
   
   
   **Describe the solution you'd like**
   It would likely be much more efficient (especially for arrays that have a small number of distinct values in their dictionary) to look at the values from the dictionary directly, first checking that each entry in the dictionary was actually used. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org