You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/12/21 17:01:00 UTC

[jira] [Created] (ARROW-11001) [C++][Dataset] Enable column renaming (in physical schema -> dataset schema) in Dataset scanning

Joris Van den Bossche created ARROW-11001:
---------------------------------------------

             Summary: [C++][Dataset] Enable column renaming (in physical schema -> dataset schema) in Dataset scanning
                 Key: ARROW-11001
                 URL: https://issues.apache.org/jira/browse/ARROW-11001
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Joris Van den Bossche


Currently, we allow dropping/adding columns when scanning the actual sources of a Dataset (e.g. if newer files in the dataset have additional columns), but we should also provide a way to specify fields that are renamed in certain files of the dataset.

While it _might_ be possible to also provide some convenience for this in the discovery factories, it's probably best to start to see how this could be added to the actual {{Dataset}} class and the lower-level constructor functionalities (such as {{FileSystemDataset}} main constructor from fragments or {{from_paths}}). 

What I am thinking right now, is that we would need an (optional) mapping of "field ref/name in physical schema -> name in projected/dataset schema" for each fragment of a dataset. 
However, that might not fully fit in the current design, as the fragment doesn't know about the dataset schema, but only sees this when it is projected. 

cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)