You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jared Weston (Jira)" <ji...@apache.org> on 2022/09/29 23:32:00 UTC

[jira] [Updated] (ARROW-17900) [Python] combine_chunks on DictionaryArray appears to be broken

     [ https://issues.apache.org/jira/browse/ARROW-17900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jared Weston updated ARROW-17900:
---------------------------------
    Attachment: two.png

> [Python] combine_chunks on DictionaryArray appears to be broken
> ---------------------------------------------------------------
>
>                 Key: ARROW-17900
>                 URL: https://issues.apache.org/jira/browse/ARROW-17900
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Jared Weston
>            Priority: Minor
>         Attachments: category_counts.py, test.parquet, two.png
>
>
> Recently upgraded from pyarrow 4.0.1 to 9.0.0 and there appears to be a bug when combining the chunks of a dictionary with multiple row groups.  The dictionary is a stringarray of categories.
> It is worth noting here that each category is not present in each chunk. To me, the issue appears to be that the category indices per chunk appear to be incorrect when a category is missing from a chunk when they are combined together. I assume this as counts for the categories of a lower index (0, 1) appear to be more frequent in the bugged version compared to the working version, and the counts of the lower indices (2, 3, 4) are lower.
>  
> The difference can be easily noted when running a value count. For example;
> !two.png!
> A workaround for now is to read directly as a string array, and then encode this as a dictionary. This isn't the best however due to speed and memory concerns.
> !one.png!
>  
> Attached is my table (I did not create this - so excuse the data / uuid style column names) and a script to see the difference. Please run this with pyarrow 4.0.1 and pyarrow 9.0.0 to see the difference in output.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)