You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Andy Thomason <an...@atomicincrement.com> on 2019/11/28 20:41:39 UTC

[Discuss][Rust] Support for Dictionary Array types

I am noodling with the Dictionary implementation and would like to approveĀ the data design and invite edits. Forgive my unfamiliarity with this mailing list.

Given you current design, it would seem best to add a DataType of Dictionary with the two sub-types for the key and values.

An array type like this may be sufficient for a reference implementation.

```
/// A dictionary where integer keys index an array in the `DictionaryBatch`
pub struct DictionaryArray {
    keys: ArrayRef,
    values: Vec<ArrayDataRef>,
}
```

Note that in the `RecordBatch`, the keys are owned by the `RecordBatch` and the values they index are owned by one or more `DictionaryBatch`. The multiple entriesĀ for values allow for delta DictionaryBatches.

The most conceptually similar existing array is the List type except that the index can be something other than i32 and the result is a single row.

In practice, there will only ever be one dictionary batch shared amongst all the record batches and so we would just get a pair of slices and use one to index the other.

Another common case is to reduce the size of string arrays in the case where there is a limited alphabet of strings and some acceleration would be welcome for this.