You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Damian Barabonkov (Jira)" <ji...@apache.org> on 2022/03/28 13:14:00 UTC

[jira] [Created] (ARROW-16045) Version=7.0.0 introduces bug when filtering by empty set during load

Damian Barabonkov created ARROW-16045:
-----------------------------------------

             Summary: Version=7.0.0 introduces bug when filtering by empty set during load
                 Key: ARROW-16045
                 URL: https://issues.apache.org/jira/browse/ARROW-16045
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 7.0.0
         Environment: pandas                    1.3.5
pyarrow                   7.0.0
python                    3.10.4

            Reporter: Damian Barabonkov
             Fix For: 6.0.1


Pyarrow errors when attempting to read from a parquet file with an empty filter on a string column. This issue is present in pyarrow v7.0.0, but not in v6.0.1. Also, interestingly the issue is not present when reading from an integer column (in v7.0.0 as well).

 

The following Python code presents a minimal example which reproduces the issue:
{code:python}
import pandas as pd
import numpy as np
path = './example_df.parquet'
df = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df.to_parquet(path)

# Works!
df_read = pd.read_parquet(
    path,
    filters=[
        [
            ("A", "in", set())
        ]
    ]
)

# Pyarrow v6.0.1 and v7.0.0
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []
print(df_read)

# Fails!
df_read = pd.read_parquet(
    path,
    filters=[
        [
            ("F", "in", set())
        ]
    ]
)
# Pyarrow v6.0.1
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []

# Pyarrow v7.0.0
#
# pyarrow.lib.ArrowInvalid: Array type didn't match type of values set: string vs null
print(df_read) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)