You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Alenka Frim (Jira)" <ji...@apache.org> on 2022/01/12 11:36:00 UTC

[jira] [Created] (ARROW-15311) [C++][Python] Opening a partitioned dataset with schema and filter

Alenka Frim created ARROW-15311:
-----------------------------------

             Summary: [C++][Python] Opening a partitioned dataset with schema and filter
                 Key: ARROW-15311
                 URL: https://issues.apache.org/jira/browse/ARROW-15311
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Documentation
            Reporter: Alenka Frim


Add a note to the docs that if partitioning and schema are both specified at opening of a dataset and partitioning names are not included in the data, schema needs to include the partitioning names (directory or hive partitioning) in a case that filtering will be done.

Example:

{code:python}
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

# Define the data
table = pa.table({'one': [-1, np.nan, 2.5],
                   'two': ['foo', 'bar', 'baz'],
                   'three': [True, False, True]})

# Write to partitioned dataset
# The files will include columns "two" and "three"
pq.write_to_dataset(table, root_path='dataset_name',
                    partition_cols=['one'])

# Reading the partitioned dataset with schema not including partitioned names
# will error

schema = pa.schema([("three", "double")])
data = ds.dataset("dataset_name", partitioning="hive", schema=schema)
subset = ds.field("one") == 2.5
data.to_table(filter=subset)

# And will not if done like so:
schema = pa.schema([("three", "double"), ("one", "double")])
data = ds.dataset("dataset_name", partitioning="hive", schema=schema)
subset = ds.field("one") == 2.5
data.to_table(filter=subset)

{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)