You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/09/23 14:12:00 UTC
[jira] [Commented] (ARROW-16199) [Python] Filters and pq.ParquetDataset/pq.read_table with legacy API
[ https://issues.apache.org/jira/browse/ARROW-16199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608773#comment-17608773 ]
Joris Van den Bossche commented on ARROW-16199:
-----------------------------------------------
Apparently, we already had an issue for this -> ARROW-9780
> [Python] Filters and pq.ParquetDataset/pq.read_table with legacy API
> --------------------------------------------------------------------
>
> Key: ARROW-16199
> URL: https://issues.apache.org/jira/browse/ARROW-16199
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Alenka Frim
> Priority: Major
> Labels: dataset-parquet-legacy
>
> The supply of filters in pq.ParquetDataset and pq.read_table when using the old API should give a better error message:
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> data = [
> list(range(5)),
> list(map(str, range(5))),
> ]
> schema = pa.schema([
> ('i64', pa.int64()),
> ('str', pa.string()),
> ])
> batch = pa.record_batch(data, schema=schema)
> table = pa.Table.from_batches([batch])
> pq.write_table(table, 'example.parquet')
> {code}
> {code:python}
> >>> pq.ParquetDataset('example.parquet', use_legacy_dataset=True, filters=[('str', '=', "1")])
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/Users/alenkafrim/repos/arrow/python/pyarrow/parquet/__init__.py", line 1755, in __init__
> self._filter(filters)
> File "/Users/alenkafrim/repos/arrow/python/pyarrow/parquet/__init__.py", line 1933, in _filter
> accepts_filter = self._partitions.filter_accepts_partition
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {code}
> {code:python}
> >>> pq.read_table('example.parquet', use_legacy_dataset=True, filters=[('str', '=', "1")])
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/Users/alenkafrim/repos/arrow/python/pyarrow/parquet/__init__.py", line 2760, in read_table
> pf = ParquetDataset(
> File "/Users/alenkafrim/repos/arrow/python/pyarrow/parquet/__init__.py", line 1755, in __init__
> self._filter(filters)
> File "/Users/alenkafrim/repos/arrow/python/pyarrow/parquet/__init__.py", line 1933, in _filter
> accepts_filter = self._partitions.filter_accepts_partition
> AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)