You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2016/08/09 23:32:20 UTC

[jira] [Commented] (ARROW-81) C++: Add a Category nested type

    [ https://issues.apache.org/jira/browse/ARROW-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414445#comment-15414445 ] 

Wes McKinney commented on ARROW-81:
-----------------------------------

A couple more notes on this:

While creating the Feather format, which utilizes Arrow for much of its memory layout, we (Hadley Wickham and I) ran into this limitation in the current draft of Arrow metadata (https://github.com/apache/arrow/blob/master/format/Message.fbs).

It would be great to reconcile this need to make progress toward a canonical metadata.

This data type also has the benefit of reducing memory usage for arrays (e.g. with string logical type) containing many duplicate values. 

> C++: Add a Category nested type
> -------------------------------
>
>                 Key: ARROW-81
>                 URL: https://issues.apache.org/jira/browse/ARROW-81
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>
> A Category (or "factor") is a dictionary-encoded array whose dictionary has semantic meaning. The data consists of
> - An array of integer "codes"
> - A child array of some other type, known as the "categories" or "levels" of the array. Typically there is an "ordered" boolean flag indicating whether the order of the categories is meaningful.
> Category/factor types are used in a number of common statistical analyses. See, for example, http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a basic requirement for Python and R, at least, as Arrow C++ consumers, to have this type. Separately, we should consider what is necessary to be able to transmit category data in IPCs -- possible an expansion of the Arrow format. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)