You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/09/17 08:54:00 UTC

[jira] [Comment Edited] (ARROW-10027) [Python] Incorrect null column returned when using a dataset filter expression.

    [ https://issues.apache.org/jira/browse/ARROW-10027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197516#comment-17197516 ] 

Joris Van den Bossche edited comment on ARROW-10027 at 9/17/20, 8:53 AM:
-------------------------------------------------------------------------

Also selecting the null column from the filtered table indicates it still has 10 elements:

{code}
In [9]: table['other']
Out[9]: 
<pyarrow.lib.ChunkedArray object at 0x7fdfb2a0e7d8>
[
10 nulls
]
{code}

so it seems the null column doesn't get properly filtered (which means for a NullArray: change the length)


was (Author: jorisvandenbossche):
Also selecting the null column from the filtered table indicates it still has 10 elements:

{code}
In [9]: table['other']
Out[9]: 
<pyarrow.lib.ChunkedArray object at 0x7fdfb2a0e7d8>
[
10 nulls
]
{code}

so it seems the null column doesn't get propertly filtered (which means for a NullArray: change the length)

> [Python] Incorrect null column returned when using a dataset filter expression.
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-10027
>                 URL: https://issues.apache.org/jira/browse/ARROW-10027
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Troy Zimmerman
>            Priority: Major
>
> When using dataset filter expressions (which I <3) with Parquet files, entire {{null}} columns are returned, rather than rows that matched other columns in the filter. 
> Here's an example.
> {code:python}
> In [7]: import pyarrow as pa
> In [8]: import pyarrow.dataset as ds
> In [9]: import pyarrow.parquet as pq
> In [10]: table = pa.Table.from_arrays(
>  ...:     arrays=[
>  ...:         pa.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
>  ...:         pa.array(["zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"]),
>  ...:         pa.array([None, None, None, None, None, None, None, None, None, None]),
>  ...:     ],
>  ...:     names=["id", "name", "other"],
>  ...: )
> In [11]: table
> Out[11]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [12]: table.to_pandas()
> Out[12]:
>    id   name other
> 0   0   zero  None
> 1   1    one  None
> 2   2    two  None
> 3   3  three  None
> 4   4   four  None
> 5   5   five  None
> 6   6    six  None
> 7   7  seven  None
> 8   8  eight  None
> 9   9   nine  None
> In [13]: pq.write_table(table, "/tmp/test.parquet", data_page_version="2.0")
> In [14]: data = ds.dataset("/tmp/test.parquet")
> In [15]: table = data.to_table(filter=ds.field("id").isin([1, 4, 7]))
> In [16]: table
> Out[16]:
> pyarrow.Table
> id: int64
> name: string
> other: null
> In [17]: table.to_pydict()
> Out[17]:
> {'id': [1, 4, 7],
>  'name': ['one', 'four', 'seven'],
>  'other': [None, None, None, None, None, None, None, None, None, None]}
> {code}
> The {{to_pydict}} method highlights the strange behavior: the {{id}} and {{name}} columns have 3 elements, but the {{other}} column has all 10. When I call {{to_pandas}} on the filtered table, the program crashes.
> This could be a C++ issue, but, since my examples are in Python, I categorized it as a Python issue. Let me know if that's wrong and I'll note that for the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)