You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "meng qingyou (Jira)" <ji...@apache.org> on 2021/01/08 03:42:00 UTC
[jira] [Created] (ARROW-11178) [Rust] StructArray: handling
duplicate field names
meng qingyou created ARROW-11178:
------------------------------------
Summary: [Rust] StructArray: handling duplicate field names
Key: ARROW-11178
URL: https://issues.apache.org/jira/browse/ARROW-11178
Project: Apache Arrow
Issue Type: Improvement
Reporter: meng qingyou
The arrow spec leaves the solution of `duplicate field names` to implementors.
The C++'s solution: ignore or raise error, the Java's solution: ignore, append, replace or raise error. Both use ignore as the default. Here is the references:
* [https://github.com/apache/arrow/blob/57376d28cf433bed95f19fa44c1e90a780ba54e8/cpp/src/arrow/type.cc]
* [https://github.com/apache/arrow/blob/25c736d48dc289f457e74d15d05db65f6d539447/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractStructVector.java]
I'm not expert at database or data science, but as far as I know, in the traditional RDBMS domain, it's unusual to allow duplicate field names. Further more, in the data analysis domain, perhaps it's usual to normalize/clean various kind of bad/dirty data *interactively* with tools like `pandas`?
Back to the problem, I have an example: given duplicate field names A A A B B, the user who knows actual data MAY choose to: replace first A with second A and append third A, and ignore second B. Or the duplication was just mistake?
Quote from [~nevi_me]: "I also prefer raising an error by default, as that'll make users aware very quickly". Is not acceptable if we silently append/ignore/replace duplicate fields, resulting unexpected results that user does not aware at all.
If we choose to support `replace`, `ignore` or `append`, at least we must let user control the exact behavior. For IPC data, perhaps custom metadata (for file, message and field) is the only choice. I suggest just record this problem here, keep raising error until it's really necessary to support other solutions.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)