You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Thomas Newton (Jira)" <ji...@apache.org> on 2022/06/27 07:25:00 UTC

[jira] [Updated] (ARROW-16905) Table.to_pandas() fails for dictionary encoded columns with an is_null partition_expression

     [ https://issues.apache.org/jira/browse/ARROW-16905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Newton updated ARROW-16905:
----------------------------------
    Component/s: Python

> Table.to_pandas() fails for dictionary encoded columns with an is_null partition_expression
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-16905
>                 URL: https://issues.apache.org/jira/browse/ARROW-16905
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 8.0.0
>         Environment: Ubuntu 18.04, PyArrow 8.0.0, Pandas 1.4.3
>            Reporter: Thomas Newton
>            Priority: Major
>         Attachments: reproduce_null_dictionary_issue.zip
>
>
> Minimal steps to reproduce:
> I attached a `.zip` file containing a python script and a test parquet file. Running this python script reproduces the issue.
> The steps taken to reproduce:
>  # Create a test parquet file with one column containing only null.
>  # Create a parquet fragment from this file adding a `partition_expression` with an `is_null` guarantee on this fragment.
>  # Create a `FileSystemDataset` from this fragment setting the schema to be a dictionary column.
>  # Call `.to_table().to_pandas()` on the resulting pyarrow dataset. You will get the following error.
> {code:java}
>   File "/.../pip-core_pandas/pandas/core/dtypes/dtypes.py", line 492, in validate_categories
>     raise ValueError("Categorical categories cannot be null")
> ValueError: Categorical categories cannot be null {code}
>  
> My understanding of why this doesn't work:
>  # There are 2 ways of dictionary encoding nulls: `mask` and `encode` described in the [pyarrow docs|https://arrow.apache.org/docs/python/generated/pyarrow.compute.DictionaryEncodeOptions.html#pyarrow.compute.DictionaryEncodeOptions]. Pyarrow supports both but pandas categoricals only supports mask. Arguably the real issue here is pandas should support `encode` style categoricals.
>  # When you provide an `.is_null` guarantee on a fragment arrow will not actually read the data. It knows the type from the schema, we've guaranteed the values are all null and it can get the length from the parquet metadata so it has everything it needs.
>  # Instead of reading the data it uses the [Null ArrayFactory|https://github.com/apache/arrow/blob/master/cpp/src/arrow/array/util.cc]. For dictionary type columns I believe that calls [this DictionaryArray constructor |https://github.com/apache/arrow/blob/53752adc6b81166cd4ee7db5a819494042f29197/cpp/src/arrow/array/array_dict.cc#L80-L93]which appears to be creating the dictionary in the `encode` style.
> Would it be possible to make this configurable? It seems like the `mask` style of dictionary encoding is the default for the rest of PyArrow and it would solve the Pandas compatibility issue. I appreciate this is probably an extremely niche issue but my options for a workaround are looking pretty horrible. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)