You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2021/09/03 16:08:00 UTC

[jira] [Updated] (ARROW-13573) [C++] Support dictionaries directly in case_when kernel

     [ https://issues.apache.org/jira/browse/ARROW-13573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neal Richardson updated ARROW-13573:
------------------------------------
    Labels: kernel pull-request-available types  (was: kernel pull-request-available)

> [C++] Support dictionaries directly in case_when kernel
> -------------------------------------------------------
>
>                 Key: ARROW-13573
>                 URL: https://issues.apache.org/jira/browse/ARROW-13573
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: David Li
>            Assignee: David Li
>            Priority: Major
>              Labels: kernel, pull-request-available, types
>             Fix For: 6.0.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> case_when (and other similar kernels) currently dictionary-decode inputs, then operate on the decoded values. This is both inefficient and unexpected. We should instead operate directly on dictionary indices.
> Of course, this introduces more edge cases. If the dictionaries of inputs do not match, we have the following choices:
>  # Raise an error.
>  # Unify the dictionaries.
>  # Use one of the dictionaries, and raise an error if an index of another dictionary cannot be mapped to an index of the chosen dictionary.
>  # Use one of the dictionaries, and emit null if an index of another dictionary cannot be mapped to an index of the chosen dictionary. (This is what base dplyr if_else does with factors.)
> All of these options are reasonable, so we should introduce an options struct. We can implement #3 and #4 at first (to cover R); #2 isn't strictly necessary, as the user can unify the dictionaries manually first, but it may be more efficient to do it this way. Similarly, #1 isn't strictly necessary.
> #3 and #4 are justifiable (beyond just "it's what R does") since users may filter down disjoint dictionaries into a set of common values and then expect to combine the remaining values with a kernel like case_when.
> As described on [GitHub|https://github.com/apache/arrow/pull/10724#discussion_r682671015].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)