You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Alenka Frim (Jira)" <ji...@apache.org> on 2022/01/12 11:36:00 UTC
[jira] [Created] (ARROW-15311) [C++][Python] Opening a partitioned dataset with schema and filter
Alenka Frim created ARROW-15311:
-----------------------------------
Summary: [C++][Python] Opening a partitioned dataset with schema and filter
Key: ARROW-15311
URL: https://issues.apache.org/jira/browse/ARROW-15311
Project: Apache Arrow
Issue Type: Improvement
Components: Documentation
Reporter: Alenka Frim
Add a note to the docs that if partitioning and schema are both specified at opening of a dataset and partitioning names are not included in the data, schema needs to include the partitioning names (directory or hive partitioning) in a case that filtering will be done.
Example:
{code:python}
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
# Define the data
table = pa.table({'one': [-1, np.nan, 2.5],
'two': ['foo', 'bar', 'baz'],
'three': [True, False, True]})
# Write to partitioned dataset
# The files will include columns "two" and "three"
pq.write_to_dataset(table, root_path='dataset_name',
partition_cols=['one'])
# Reading the partitioned dataset with schema not including partitioned names
# will error
schema = pa.schema([("three", "double")])
data = ds.dataset("dataset_name", partitioning="hive", schema=schema)
subset = ds.field("one") == 2.5
data.to_table(filter=subset)
# And will not if done like so:
schema = pa.schema([("three", "double"), ("one", "double")])
data = ds.dataset("dataset_name", partitioning="hive", schema=schema)
subset = ds.field("one") == 2.5
data.to_table(filter=subset)
{code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)