You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Nick Riasanovsky (Jira)" <ji...@apache.org> on 2022/05/06 17:22:00 UTC

[jira] [Created] (ARROW-16495) [Python] Scanner.count_rows() doesn't properly handle null expressions

Nick Riasanovsky created ARROW-16495:
----------------------------------------

             Summary: [Python] Scanner.count_rows() doesn't properly handle null expressions
                 Key: ARROW-16495
                 URL: https://issues.apache.org/jira/browse/ARROW-16495
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 7.0.0
            Reporter: Nick Riasanovsky


Passing an expression filter with `is_null()` doesn't properly remove null values, when computing row counts. I have reproduced this with both strings and integer. Here is a reproducer.

 

```python

df = pd.DataFrame(\{"C": pd.array([None, None, 1], dtype=pd.Int64Dtype())})

print(df)N

df.to_parquet("test.pq")

 

# Create a dataset

dataset = ds.dataset("test.pq")

fragments = [f for f in dataset.get_fragments()]

# There should just be 1 fragment.

fragment = fragments[0]

# Get the null row count

expr = ds.field("C").is_null()

scanner = fragment.scanner(filter=expr)

print(scanner.count_rows())

```

 

I expect this print 2 as there are 2 NULL values.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)