You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Dimitri Vorona (JIRA)" <ji...@apache.org> on 2018/02/19 16:22:00 UTC

[jira] [Updated] (ARROW-2176) [C++] Extend DictionaryBuilder to support delta dictionaries

     [ https://issues.apache.org/jira/browse/ARROW-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dimitri Vorona updated ARROW-2176:
----------------------------------
    External issue URL: https://github.com/apache/arrow/pull/1629

> [C++] Extend DictionaryBuilder to support delta dictionaries
> ------------------------------------------------------------
>
>                 Key: ARROW-2176
>                 URL: https://issues.apache.org/jira/browse/ARROW-2176
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Dimitri Vorona
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.9.0
>
>
> [The IPC format|https://arrow.apache.org/docs/ipc.html] specifies a possibility of sending additional dictionary batches with a previously seen id and a isDelta flag to extend the existing dictionaries with new entries. Right now, the DictioniaryBuilder (as well as IPC writer and reader) do not support generation of delta dictionaries.
> This pull request contains a basic implementation of the DictionaryBuilder with delta dictionaries support. The use API can be seen in the dictionary tests (i.e. [here|https://github.com/alendit/arrow/blob/delta_dictionary_builder/cpp/src/arrow/array-test.cc#L1773]). The basic idea is that the user just reuses the builder object after calling Finish(Array*) for the first time. Subsequent calls to Append will create new entries only for the unseen element and reuse id from previous dictionaries for the seen ones.
> Some considerations:
>  # The API is pretty implicit, and additional flag for Finish, which explicitly indicates a desire to use the builder for delta dictionary generation might be expedient from the error avoidance point of view.
>  # Right now the implementation uses an additional "overflow dictionary" to store the seen items. This adds a copy on each Finish call and an additional lookup at each GetItem or Append call. I assume, we might get away with returning Array slices at Finish, which would remove the need for an additional overflow dictionary. If the gist of the PR is approved, I can look into further optimizations.
> The Writer and Reader extensions would be pretty simple, since the DictionaryBuilder API remains basically the same. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)