You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "meng qingyou (Jira)" <ji...@apache.org> on 2021/01/08 03:42:00 UTC

[jira] [Created] (ARROW-11178) [Rust] StructArray: handling duplicate field names

meng qingyou created ARROW-11178:
------------------------------------

             Summary: [Rust] StructArray: handling duplicate field names
                 Key: ARROW-11178
                 URL: https://issues.apache.org/jira/browse/ARROW-11178
             Project: Apache Arrow
          Issue Type: Improvement
            Reporter: meng qingyou


The arrow spec leaves the solution of `duplicate field names` to implementors.

The C++'s solution: ignore or raise error, the Java's solution: ignore, append, replace or raise error. Both use ignore as the default. Here is the references:
 * [https://github.com/apache/arrow/blob/57376d28cf433bed95f19fa44c1e90a780ba54e8/cpp/src/arrow/type.cc]
 * [https://github.com/apache/arrow/blob/25c736d48dc289f457e74d15d05db65f6d539447/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractStructVector.java]

I'm not expert at database or data science, but as far as I know, in the traditional RDBMS domain, it's unusual to allow duplicate field names. Further more, in the data analysis domain, perhaps it's usual to normalize/clean various kind of bad/dirty data *interactively* with tools like `pandas`?

Back to the problem, I have an example: given duplicate field names A A A B B, the user who knows actual data MAY choose to: replace first A with second A and append third A, and ignore second B. Or the duplication was just mistake?

Quote from [~nevi_me]: "I also prefer raising an error by default, as that'll make users aware very quickly". Is not acceptable if we silently append/ignore/replace duplicate fields, resulting unexpected results that user does not aware at all.

If we choose to support `replace`, `ignore` or `append`, at least we must let user control the exact behavior.  For IPC data, perhaps custom metadata (for file, message and field) is the only choice. I suggest just record this problem here, keep raising error until it's really necessary to support other solutions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)