You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/02/22 19:30:00 UTC

[jira] [Created] (ARROW-11732) [C++] DictionaryEncode should convert dictionaries from one type of encoding to the other

Weston Pace created ARROW-11732:
-----------------------------------

             Summary: [C++] DictionaryEncode should convert dictionaries from one type of encoding to the other
                 Key: ARROW-11732
                 URL: https://issues.apache.org/jira/browse/ARROW-11732
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


There are two styles of encoding nulls in dictionaries (masked or encoded).  In compute:: DictionaryEncode this is controlled by an option.  Today, if you pass a dictionary into DictionaryEncode it is a no-op.

Instead it should check to see if the dictionary is properly encoded (this is easily checked in constant time) according to the requested null encoding scheme and, if not, it should convert it.

The default NullEncodingBehavior should also change to EXISTING_OR_ENCODE or a second option should be added so that this doesn't change existing behavior.

Once this is done then partition.cc could be improved.  It currently requires dictionaries use "encoded nulls" and, if a dictionary is passed in that uses "masked nulls" then it uncodes and re-encodes the dictionary which is a potentially costly operation.  This could be fixed to use the conversion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)