You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/21 11:08:23 UTC

[GitHub] [arrow-rs] tustvold opened a new issue #1218: Cast Dictionary Options

tustvold opened a new issue #1218:
URL: https://github.com/apache/arrow-rs/issues/1218


   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   Currently when casting an array to DictionaryArray, the code will compute a new dictionary for the type. This dictionary will have unique values, but won't be sorted.
   
   However, in some cases uniqueness and/or sortedness may not be a priority, e.g. because a subsequent operation is going to filter out a large number of potential matches, and computing this dictionary is therefore wasted effort.
   
   **Describe the solution you'd like**
   
   Add two new CastOptions:
   
   * `sort_dictionary` - if the result is a dictionary array, the dictionary will be sorted
   * `pack_dictionary` - if the result is a dictionary array, the dictionary will be unique
   
   This will give the cast kernel the leeway to construct a DictionaryArray, by taking the provided array as the dictionary child data (values), and encoding `0..array.len()` in the keys array. This will of course need to fallback to computing a packed dictionary if the key size is too small to accommodate this.
   
   This will also provide an obvious way to implement (#506) as an array could be cast to itself with options to sort and/or pack the dictionary. This could be further combined with #1217 to avoid doing this computation if not necessary.
   
   **Additional Context**
   
   The concat kernel currently takes a similar approach of avoiding recomputing dictionaries
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jhorstmann commented on issue #1218: Cast Dictionary Options

Posted by GitBox <gi...@apache.org>.
jhorstmann commented on issue #1218:
URL: https://github.com/apache/arrow-rs/issues/1218#issuecomment-1019448610


   Slightly related to the `make_ordered` function from this draft PR: https://github.com/apache/arrow-rs/pull/1048/files
   
   Doing this in the `cast` kernels seems a bit more general. Is the idea to call this cast directly after concatenating batches? In that case de-duplicating while concatenating might be slightly more efficient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on issue #1218: Cast Dictionary Options

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1218:
URL: https://github.com/apache/arrow-rs/issues/1218#issuecomment-1019516731


   I think both would be cool to have, and can probably share a lot of logic :+1:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org