You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Alenka Frim (Jira)" <ji...@apache.org> on 2022/09/30 09:37:00 UTC

[jira] [Commented] (ARROW-17900) [Python] combine_chunks on DictionaryArray appears to be broken

    [ https://issues.apache.org/jira/browse/ARROW-17900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611468#comment-17611468 ] 

Alenka Frim commented on ARROW-17900:
-------------------------------------

Thank you for reporting this! I made a small reproducible example:
{code:python}
import pyarrow as pa
indices = pa.array([0, 1, 2, 0, 2, 0, None, 2])
dictionary = pa.array(["MV", "OB", "LMS2"])
dict_array = pa.DictionaryArray.from_arrays(indices, dictionary)

indices1 = pa.array([0, 0, 0, 0, 0, 0, 0, 0])
dictionary1 = pa.array(["MV","OB"])
dict_array1 = pa.DictionaryArray.from_arrays(indices1, dictionary1)

# ChunkedArray made from two separate
# DictionarryArray objects
ca = pa.chunked_array((
    dict_array,
    dict_array1
))
# Creating one DictionarryArray from a ChunkedArray
# where each chunk is a DictionarryArray 
da = ca.combine_chunks(){code}
Researching the data in pyarrow 4.0.1:
{code:python}
>>> pa.__version__
'4.0.1'
>>> ca.value_counts()
<pyarrow.lib.StructArray object at 0x7fcc4083d280>
-- is_valid: all not null
-- child 0 type: dictionary<values=string, indices=int64, ordered=0>

  -- dictionary:
    [
      "MV",
      "OB",
      "LMS2"
    ]
  -- indices:
    [
      0,
      1,
      2,
      null
    ]
-- child 1 type: int64
  [
    11,
    1,
    3,
    1
  ]
>>> da.value_counts()
<pyarrow.lib.StructArray object at 0x7fcc4083d220>
-- is_valid: all not null
-- child 0 type: dictionary<values=string, indices=int64, ordered=0>

  -- dictionary:
    [
      "MV",
      "OB",
      "LMS2"
    ]
  -- indices:
    [
      0,
      1,
      2,
      null
    ]
-- child 1 type: int64
  [
    11,
    1,
    3,
    1
  ]
{code}
and in pyarrow 9.0.0:
{code:python}
>>> pa.__version__
'9.0.0'
>>> ca.value_counts()
<pyarrow.lib.StructArray object at 0x7fa4989877c0>
-- is_valid: all not null
-- child 0 type: dictionary<values=string, indices=int64, ordered=0>

  -- dictionary:
    [
      "MV",
      "OB",
      "LMS2"
    ]
  -- indices:
    [
      0,
      1,
      2,
      null
    ]
-- child 1 type: int64
  [
    11,
    1,
    3,
    1
  ]
>>> da.value_counts()
<pyarrow.lib.StructArray object at 0x7fa498987be0>
-- is_valid: all not null
-- child 0 type: dictionary<values=string, indices=int64, ordered=0>

  -- dictionary:
    [
      "MV",
      "OB",
      "LMS2"
    ]
  -- indices:
    [
      0,
      2,
      null
    ]
-- child 1 type: int64
  [
    12,
    3,
    1
  ]
{code}

> [Python] combine_chunks on DictionaryArray appears to be broken
> ---------------------------------------------------------------
>
>                 Key: ARROW-17900
>                 URL: https://issues.apache.org/jira/browse/ARROW-17900
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Jared Weston
>            Priority: Minor
>         Attachments: category_counts.py, one.png, test.parquet, two.png
>
>
> Recently upgraded from pyarrow 4.0.1 to 9.0.0 and there appears to be a bug when combining the chunks of a dictionary with multiple row groups.  The dictionary is a stringarray of categories.
> It is worth noting here that each category is not present in each chunk. To me, the issue appears to be that the category indices per chunk appear to be incorrect when a category is missing from a chunk when they are combined together. I assume this as counts for the categories of a lower index (0, 1) appear to be more frequent in the bugged version compared to the working version, and the counts of the lower indices (2, 3, 4) are lower.
>  
> The difference can be easily noted when running a value count. For example;
> !two.png!
> A workaround for now is to read directly as a string array, and then encode this as a dictionary. This isn't the best however due to speed and memory concerns.
> !one.png!
>  
> Attached is my parquet file (test.parquet) and a simply python script to see the difference (category_counts.py). I did not create this parquet file, rather am consuming it from a service- so excuse the data / uuid style column names. Please run this with pyarrow 4.0.1 and pyarrow 9.0.0 to see the difference in output. The images say pyarrow 6.0.0 but the issue is still present in 9.0.0. too
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)