You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/08/22 20:29:08 UTC

[GitHub] [arrow] westonpace commented on pull request #13938: ARROW-17388: [C++][Python] Error on WriteTable if duplicate field names

westonpace commented on PR #13938:
URL: https://github.com/apache/arrow/pull/13938#issuecomment-1222954260

I've actually been poking around this area recently (#13782). I would say this is somewhat related to the problem of "schema evolution". The current behavior is undocumented but attempts to handle some potential variation in schema between files. As a result, field references need to be names, and we lookup each name in the fragment schema to figure out which column to map it to in the dataset schema.

For example, if the fragments have schemas:

Fragment 1
a,b,c

Fragment 2
c,a,b

Dataset schema
b,c,a

And the user asks for "b" then we look for column 1 in fragment 1 and column 2 in fragment 2. This approach breaks down pretty quickly when a fragment has duplicate columns with the same name.

Once #13782 merges then perhaps we could add a "no evolution" option which would be the default if there is only a single fragment. This option would allow for duplicate columns.

What should be returned if the user were to run...

```
pq.read_table('file.parquet', use_legacy_dataset=False, columns=["a"])
```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org