You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/09/23 14:12:00 UTC

[jira] [Commented] (ARROW-16199) [Python] Filters and pq.ParquetDataset/pq.read_table with legacy API

    [ https://issues.apache.org/jira/browse/ARROW-16199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608773#comment-17608773 ] 

Joris Van den Bossche commented on ARROW-16199:
-----------------------------------------------

Apparently, we already had an issue for this -> ARROW-9780

> [Python] Filters and pq.ParquetDataset/pq.read_table with legacy API
> --------------------------------------------------------------------
>
>                 Key: ARROW-16199
>                 URL: https://issues.apache.org/jira/browse/ARROW-16199
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Alenka Frim
>            Priority: Major
>              Labels: dataset-parquet-legacy
>
> The supply of filters in pq.ParquetDataset and pq.read_table when using the old API should give a better error message:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> data = [
>     list(range(5)),
>     list(map(str, range(5))),
> ]
> schema = pa.schema([
>     ('i64', pa.int64()),
>     ('str', pa.string()),
> ])
> batch = pa.record_batch(data, schema=schema)
> table = pa.Table.from_batches([batch])
> pq.write_table(table, 'example.parquet')
> {code}
> {code:python}
> >>> pq.ParquetDataset('example.parquet', use_legacy_dataset=True, filters=[('str', '=', "1")])
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/Users/alenkafrim/repos/arrow/python/pyarrow/parquet/__init__.py", line 1755, in __init__
>     self._filter(filters)
>   File "/Users/alenkafrim/repos/arrow/python/pyarrow/parquet/__init__.py", line 1933, in _filter
>     accepts_filter = self._partitions.filter_accepts_partition
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {code}
> {code:python}
> >>> pq.read_table('example.parquet', use_legacy_dataset=True, filters=[('str', '=', "1")])
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/Users/alenkafrim/repos/arrow/python/pyarrow/parquet/__init__.py", line 2760, in read_table
>     pf = ParquetDataset(
>   File "/Users/alenkafrim/repos/arrow/python/pyarrow/parquet/__init__.py", line 1755, in __init__
>     self._filter(filters)
>   File "/Users/alenkafrim/repos/arrow/python/pyarrow/parquet/__init__.py", line 1933, in _filter
>     accepts_filter = self._partitions.filter_accepts_partition
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)