You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Troy Zimmerman (Jira)" <ji...@apache.org> on 2020/09/16 21:00:08 UTC

[jira] [Created] (ARROW-10027) [Python] Incorrect null column returned by dataset filter expression.

Troy Zimmerman created ARROW-10027:
--------------------------------------

             Summary: [Python] Incorrect null column returned by dataset filter expression.
                 Key: ARROW-10027
                 URL: https://issues.apache.org/jira/browse/ARROW-10027
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 1.0.1
            Reporter: Troy Zimmerman


When using dataset filter expressions (which I <3) with Parquet files, entire {{null}} columns are returned, rather than rows that matched other columns in the filter. 

Here's an example.

{code:python}
In [7]: import pyarrow as pa
In [8]: import pyarrow.dataset as ds
In [9]: import pyarrow.parquet as pq

In [10]: table = pa.Table.from_arrays(
 ...:     arrays=[
 ...:         pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 ...:         pa.array(["zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"]),
 ...:         pa.array([None, None, None, None, None, None, None, None, None, None]),
 ...:     ],
 ...:     names=["id", "name", "other"],
 ...: )

In [11]: table
Out[11]:
pyarrow.Table
id: int64
name: string
other: null

In [12]: table.to_pandas()
Out[12]:
   id   name other
0   0   zero  None
1   1    one  None
2   2    two  None
3   3  three  None
4   4   four  None
5   5   five  None
6   6    six  None
7   7  seven  None
8   8  eight  None
9   9   nine  None

In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0")
In [14]: data = ds.dataset("/tmp/test.parquet")
In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7]))
In [16]: table
Out[16]:
pyarrow.Table
id: int64
name: string
other: null

In [17]: table.to_pydict()
Out[17]:
{'id': [1, 4, 7],
 'name': ['one', 'four', 'seven'],
 'other': [None, None, None, None, None, None, None, None, None, None]}
{code}
The {{to_pydict}} method highlights the strange behavior: the {{id}} and {{name}} columns have 3 elements, but the {{other}} column has all 10. When I call {{to_pandas}} on the filtered table, the program crashes.

This could be a C++ issue, but, since my examples are in Python, I categorized it as a Python issue. Let me know if that's wrong and I'll note that for the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)