You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/12/21 17:01:00 UTC
[jira] [Created] (ARROW-11001) [C++][Dataset] Enable column
renaming (in physical schema -> dataset schema) in Dataset scanning
Joris Van den Bossche created ARROW-11001:
---------------------------------------------
Summary: [C++][Dataset] Enable column renaming (in physical schema -> dataset schema) in Dataset scanning
Key: ARROW-11001
URL: https://issues.apache.org/jira/browse/ARROW-11001
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Joris Van den Bossche
Currently, we allow dropping/adding columns when scanning the actual sources of a Dataset (e.g. if newer files in the dataset have additional columns), but we should also provide a way to specify fields that are renamed in certain files of the dataset.
While it _might_ be possible to also provide some convenience for this in the discovery factories, it's probably best to start to see how this could be added to the actual {{Dataset}} class and the lower-level constructor functionalities (such as {{FileSystemDataset}} main constructor from fragments or {{from_paths}}).
What I am thinking right now, is that we would need an (optional) mapping of "field ref/name in physical schema -> name in projected/dataset schema" for each fragment of a dataset.
However, that might not fully fit in the current design, as the fragment doesn't know about the dataset schema, but only sees this when it is projected.
cc [~bkietz]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)