You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Nick Riasanovsky (Jira)" <ji...@apache.org> on 2022/05/06 17:22:00 UTC
[jira] [Created] (ARROW-16495) [Python] Scanner.count_rows() doesn't properly handle null expressions
Nick Riasanovsky created ARROW-16495:
----------------------------------------
Summary: [Python] Scanner.count_rows() doesn't properly handle null expressions
Key: ARROW-16495
URL: https://issues.apache.org/jira/browse/ARROW-16495
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 7.0.0
Reporter: Nick Riasanovsky
Passing an expression filter with `is_null()` doesn't properly remove null values, when computing row counts. I have reproduced this with both strings and integer. Here is a reproducer.
```python
df = pd.DataFrame(\{"C": pd.array([None, None, 1], dtype=pd.Int64Dtype())})
print(df)N
df.to_parquet("test.pq")
# Create a dataset
dataset = ds.dataset("test.pq")
fragments = [f for f in dataset.get_fragments()]
# There should just be 1 fragment.
fragment = fragments[0]
# Get the null row count
expr = ds.field("C").is_null()
scanner = fragment.scanner(filter=expr)
print(scanner.count_rows())
```
I expect this print 2 as there are 2 NULL values.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)