You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/08 21:38:07 UTC

[GitHub] [arrow-rs] lquerel commented on issue #506: "Optimize" Dictionary contents in DictionaryArray / `concat_batches`

lquerel commented on issue #506:
URL: https://github.com/apache/arrow-rs/issues/506#issuecomment-989216649


   Another issue with the existing implementation is the DictionaryKeyOverflowError error that is returned in situations where it is reasonably not expected. For example like in this scenario.
   * Let's imagine a dictionary column type is: DataType::Dictionary(Box::new(**DataType::UInt8**), Box::new(DataType::Utf8))
   * The dictionary represents an enumeration with 10 distincts values.
   * As currently the dictionary columns are concatenated without deduplication it becomes very easy to overflow the key type. In my example the concatenation of 26 batches (containing 10 rows, each row containing a different value of the enum) will return a DictionaryKeyOverflowError error.
   
   This issue makes UInt8 dictionary key unusable in a context where concatenation of batches could take place. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org