You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Neville Dipale (Jira)" <ji...@apache.org> on 2020/05/26 13:16:00 UTC

[jira] [Resolved] (ARROW-8791) [Rust] Creating StringDictionaryBuilder with existing dictionary values

     [ https://issues.apache.org/jira/browse/ARROW-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neville Dipale resolved ARROW-8791.
-----------------------------------
    Fix Version/s: 1.0.0
       Resolution: Fixed

Issue resolved by pull request 7226
[https://github.com/apache/arrow/pull/7226]

> [Rust] Creating StringDictionaryBuilder with existing dictionary values
> -----------------------------------------------------------------------
>
>                 Key: ARROW-8791
>                 URL: https://issues.apache.org/jira/browse/ARROW-8791
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust
>            Reporter: Jörn Horstmann
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.0.0
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> It might be useful to create a DictionaryArray that uses the same dictionary keys as another array. One usecase would be more efficient comparison between arrays if it is known that they use the same dictionary. Another could be more efficient grouping operations, across multiple chunks (ie a `Vec<DictionaryArray>`).
>  
> A possible implementation could look like this:
>  
> {code:java}
> impl<K> StringDictionaryBuilder<K>
> where
>     K: ArrowDictionaryKeyType,
> {
>     pub fn new_with_dictionary(
>         keys_builder: PrimitiveBuilder<K>,
>         dictionary_values: &StringArray,
>     ) -> Result<Self> {
>         let mut values_builder = StringBuilder::with_capacity(
>             dictionary_values.len(),
>             dictionary_values.value_data().len(),
>         );
>         let mut map: HashMap<Box<[u8]>, K::Native> = HashMap::new();
>         for i in 0..dictionary_values.len() {
>             if dictionary_values.is_valid(i) {
>                 let value = dictionary_values.value(i);
>                 map.insert(
>                     value.as_bytes().into(),
>                     K::Native::from_usize(i)
>                         .ok_or(ArrowError::DictionaryKeyOverflowError)?,
>                 );
>                 values_builder.append_value(value);
>             } else {
>                 values_builder.append_null();
>             }
>         }
>         Ok(Self {
>             keys_builder,
>             values_builder,
>             map,
>         })
>     }
> }{code}
> I don't really like here that the map has to be reconstructed, maybe there is a more efficient way by passing in the HashMap directly, but it's probably not a good idea to expose the `Box<[u8]>` encoding of its keys.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)