You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Rasmus Johansen (Jira)" <ji...@apache.org> on 2022/09/14 21:34:00 UTC

[jira] [Commented] (ARROW-17733) [C++] Concatenating dictionary arrays with nulls fills wrong parts of index buffer with 0.

    [ https://issues.apache.org/jira/browse/ARROW-17733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17604986#comment-17604986 ] 

Rasmus Johansen commented on ARROW-17733:
-----------------------------------------

I've created this pull request which I believe solves this bug, and added a test to prevent regression: [https://github.com/apache/arrow/pull/14129/]

> [C++] Concatenating dictionary arrays with nulls fills wrong parts of index buffer with 0.
> ------------------------------------------------------------------------------------------
>
>                 Key: ARROW-17733
>                 URL: https://issues.apache.org/jira/browse/ARROW-17733
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Rasmus Johansen
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When concatenating dictionary arrays with nulls, and whose index type is not 8-bit wide the wrong bits of the index buffer get zeroed out.
> Example using pyarrow:
> {code:java}
> import pyarrow as pa
> dictionary_type = pa.dictionary(pa.int16(), pa.string())
> empty_array = pa.array([], dictionary_type)
> array1 = pa.array(["a", "b", None], dictionary_type)
> array2 = pa.concat_arrays([empty_array, array1])
> print(array1.to_pylist())
> print(array2.to_pylist()) {code}
> We would expect array1 and array2 to be the same, but this prints:
> {noformat}
> ['a', 'b', None]
> ['a', 'a', None] {noformat}
>  
> This bug happens because the index type is 2-byte wide, so the null at position 2 should result in zeroing out byte 4-5 (0-indexed) of the index buffer. However the code instead zeroes out byte 2-3 because we don't take into account the width of the index type when adding the position here:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/concatenate.cc#L314-L315



--
This message was sent by Atlassian Jira
(v8.20.10#820010)