You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Ben Kietzman (Jira)" <ji...@apache.org> on 2020/06/15 17:52:00 UTC
[jira] [Assigned] (ARROW-9105) [C++] ParquetFileFragment scanning
doesn't handle filter on partition field
[ https://issues.apache.org/jira/browse/ARROW-9105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ben Kietzman reassigned ARROW-9105:
-----------------------------------
Assignee: Ben Kietzman
> [C++] ParquetFileFragment scanning doesn't handle filter on partition field
> ---------------------------------------------------------------------------
>
> Key: ARROW-9105
> URL: https://issues.apache.org/jira/browse/ARROW-9105
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Joris Van den Bossche
> Assignee: Ben Kietzman
> Priority: Major
> Labels: dataset, dataset-dask-integration
> Fix For: 1.0.0
>
>
> When splitting a fragment into row group fragments, filtering on the partition field raises an error.
> Python reproducer:
> {code:python}
> df = pd.DataFrame({"dummy": [1, 1, 1, 1], "part": ["A", "A", "B", "B"]})
> df.to_parquet("test_partitioned_filter", partition_cols="part", engine="pyarrow")
> import pyarrow.dataset as ds
> dataset = ds.dataset("test_partitioned_filter", format="parquet", partitioning="hive")
> fragment = list(dataset.get_fragments())[0]
> {code}
> {code}
> In [31]: dataset.to_table(filter=ds.field("part") == "A").to_pandas()
> Out[31]:
> dummy part
> 0 1 A
> 1 1 A
> In [32]: fragment.split_by_row_group(ds.field("part") == "A")
> ---------------------------------------------------------------------------
> ArrowInvalid Traceback (most recent call last)
> <ipython-input-32-371cba80fd6f> in <module>
> ----> 1 fragment.split_by_row_group(ds.field("part") == "A")
> ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.ParquetFileFragment.split_by_row_group()
> ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset._insert_implicit_casts()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Field named 'part' not found or not unique in the schema.
> {code}
> This is probably a "strange" thing to do, since the fragment from a partitioned dataset is already coming only from a single partition (so will always only satisfy a single equality expression). But it's still nice that as a user you don't have to care about only passing part of the filter down to {{split_by_row_groups}}.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)