You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/02/19 16:22:00 UTC

[jira] [Commented] (ARROW-2176) [C++] Extend DictionaryBuilder to support delta dictionaries

    [ https://issues.apache.org/jira/browse/ARROW-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369282#comment-16369282 ] 

ASF GitHub Bot commented on ARROW-2176:
---------------------------------------

alendit opened a new pull request #1629: ARROW-2176: [C++] Extend DictionaryBuilder to support delta dictionaries
URL: https://github.com/apache/arrow/pull/1629
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> [C++] Extend DictionaryBuilder to support delta dictionaries
> ------------------------------------------------------------
>
>                 Key: ARROW-2176
>                 URL: https://issues.apache.org/jira/browse/ARROW-2176
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Dimitri Vorona
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.9.0
>
>
> [The IPC format|https://arrow.apache.org/docs/ipc.html] specifies a possibility of sending additional dictionary batches with a previously seen id and a isDelta flag to extend the existing dictionaries with new entries. Right now, the DictioniaryBuilder (as well as IPC writer and reader) do not support generation of delta dictionaries.
> This pull request contains a basic implementation of the DictionaryBuilder with delta dictionaries support. The use API can be seen in the dictionary tests (i.e. [here|https://github.com/alendit/arrow/blob/delta_dictionary_builder/cpp/src/arrow/array-test.cc#L1773]). The basic idea is that the user just reuses the builder object after calling Finish(Array*) for the first time. Subsequent calls to Append will create new entries only for the unseen element and reuse id from previous dictionaries for the seen ones.
> Some considerations:
>  # The API is pretty implicit, and additional flag for Finish, which explicitly indicates a desire to use the builder for delta dictionary generation might be expedient from the error avoidance point of view.
>  # Right now the implementation uses an additional "overflow dictionary" to store the seen items. This adds a copy on each Finish call and an additional lookup at each GetItem or Append call. I assume, we might get away with returning Array slices at Finish, which would remove the need for an additional overflow dictionary. If the gist of the PR is approved, I can look into further optimizations.
> The Writer and Reader extensions would be pretty simple, since the DictionaryBuilder API remains basically the same. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)