You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "&res (Jira)" <ji...@apache.org> on 2022/07/27 17:01:00 UTC

[jira] [Created] (ARROW-17228) dataset.write_data should use Scanner.projected_schema when passed a scanner with projected columns

&res created ARROW-17228:
----------------------------

             Summary: dataset.write_data should use Scanner.projected_schema when passed a scanner with projected columns
                 Key: ARROW-17228
                 URL: https://issues.apache.org/jira/browse/ARROW-17228
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 8.0.0
         Environment: Python 3.9.13
pyarrow 8.0.0
            Reporter: &res


In the code below:
{code:java}
import pyarrow as pa
import pyarrow.dataset as ds

table = pa.Table.from_arrays(
    [
        pa.array(['a', 'b', 'c'], pa.string()),
        pa.array(['a', 'b', 'c'], pa.string()),
    ],
    names=['region', "Other"]
)
table_dataset = ds.dataset(table)
columns = {
    "Region": ds.field('region'),
    "Other": ds.field('Other'),
}
scanner = table_dataset.scanner(columns=columns)

ds.write_dataset(
    scanner,
    'newpath',
    partitioning=['Region'], partitioning_flavor='hive',
    format='parquet')
 {code}
I get this exception:
{code:java}
KeyError: 'Column Region does not exist in schema'
 {code}
I suspect it is because write_dataset isn't looking at the correct schema. It should look at scanner.project_schema (rather than scanner.dataset_schema).

I think it's just a matter of updating this line: https://github.com/apache/arrow/blob/bc6c4988691cf60ecac67542b2daa2ac19fde5d9/python/pyarrow/dataset.py#L967

 

The issue was raised here: https://stackoverflow.com/questions/73139467/how-to-incorporate-projected-columns-in-scanner-into-new-dataset-partitioning

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)