You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "&res (Jira)" <ji...@apache.org> on 2022/07/27 17:01:00 UTC
[jira] [Created] (ARROW-17228) dataset.write_data should use Scanner.projected_schema when passed a scanner with projected columns
&res created ARROW-17228:
----------------------------
Summary: dataset.write_data should use Scanner.projected_schema when passed a scanner with projected columns
Key: ARROW-17228
URL: https://issues.apache.org/jira/browse/ARROW-17228
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 8.0.0
Environment: Python 3.9.13
pyarrow 8.0.0
Reporter: &res
In the code below:
{code:java}
import pyarrow as pa
import pyarrow.dataset as ds
table = pa.Table.from_arrays(
[
pa.array(['a', 'b', 'c'], pa.string()),
pa.array(['a', 'b', 'c'], pa.string()),
],
names=['region', "Other"]
)
table_dataset = ds.dataset(table)
columns = {
"Region": ds.field('region'),
"Other": ds.field('Other'),
}
scanner = table_dataset.scanner(columns=columns)
ds.write_dataset(
scanner,
'newpath',
partitioning=['Region'], partitioning_flavor='hive',
format='parquet')
{code}
I get this exception:
{code:java}
KeyError: 'Column Region does not exist in schema'
{code}
I suspect it is because write_dataset isn't looking at the correct schema. It should look at scanner.project_schema (rather than scanner.dataset_schema).
I think it's just a matter of updating this line: https://github.com/apache/arrow/blob/bc6c4988691cf60ecac67542b2daa2ac19fde5d9/python/pyarrow/dataset.py#L967
The issue was raised here: https://stackoverflow.com/questions/73139467/how-to-incorporate-projected-columns-in-scanner-into-new-dataset-partitioning
--
This message was sent by Atlassian Jira
(v8.20.10#820010)