You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2016/08/12 03:21:20 UTC

[jira] [Commented] (ARROW-255) Finalize Dictionary representation

    [ https://issues.apache.org/jira/browse/ARROW-255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418301#comment-15418301 ] 

Wes McKinney commented on ARROW-255:
------------------------------------

This makes sense, as any level of a nested type subtree could be hypothetically dictionary encoded. 

Are there many benefits to using unsigned integers for the dictionary indices (that reference elements in the dictionary)? If it makes things more difficult for JVM users, then regular int32 seems acceptable (similar in that we are doing that for variable length collection offsets). 

> Finalize Dictionary representation
> ----------------------------------
>
>                 Key: ARROW-255
>                 URL: https://issues.apache.org/jira/browse/ARROW-255
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Format
>            Reporter: Julien Le Dem
>
> format/Messages.fbs mentions DictionaryBatches with an id but does not specify where they are referenced.
> We should add a {{dictionary: long}} in Field that references the dictionary id:
> Field: https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf0045532d6607/format/Message.fbs#L86
> Dictionary id: https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf0045532d6607/format/Message.fbs#L165
> We need a spec in format/Layout.md that describes the dictionary layout.
> When dictionary encoded the value vector is an array of unsigned int32.
> The dictionary vector is a Vector of the type of the value. indexed by their id in the dictionary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)