You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Jörn Horstmann (Jira)" <ji...@apache.org> on 2020/05/13 18:26:00 UTC

[jira] [Created] (ARROW-8791) [RUST] Creating StringDictionaryBuilder with existing dictionary values

Jörn Horstmann created ARROW-8791:
-------------------------------------

             Summary: [RUST] Creating StringDictionaryBuilder with existing dictionary values
                 Key: ARROW-8791
                 URL: https://issues.apache.org/jira/browse/ARROW-8791
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Rust
            Reporter: Jörn Horstmann


It might be useful to create a DictionaryArray that uses the same dictionary keys as another array. One usecase would be more efficient comparison between arrays if it is known that they use the same dictionary. Another could be more efficient grouping operations, across multiple chunks (ie a `Vec<DictionaryArray>`).

 

A possible implementation could look like this:

 
{code:java}
impl<K> StringDictionaryBuilder<K>
where
    K: ArrowDictionaryKeyType,
{
    pub fn new_with_dictionary(
        keys_builder: PrimitiveBuilder<K>,
        dictionary_values: &StringArray,
    ) -> Result<Self> {
        let mut values_builder = StringBuilder::with_capacity(
            dictionary_values.len(),
            dictionary_values.value_data().len(),
        );
        let mut map: HashMap<Box<[u8]>, K::Native> = HashMap::new();
        for i in 0..dictionary_values.len() {
            if dictionary_values.is_valid(i) {
                let value = dictionary_values.value(i);
                map.insert(
                    value.as_bytes().into(),
                    K::Native::from_usize(i)
                        .ok_or(ArrowError::DictionaryKeyOverflowError)?,
                );
                values_builder.append_value(value);
            } else {
                values_builder.append_null();
            }
        }
        Ok(Self {
            keys_builder,
            values_builder,
            map,
        })
    }
}{code}
I don't really like here that the map has to be reconstructed, maybe there is a more efficient way by passing in the HashMap directly, but it's probably not a good idea to expose the `Box<[u8]>` encoding of its keys.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)