You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/11/06 02:21:00 UTC

[jira] [Created] (ARROW-14621) [Python] read_feather's "columns" argument claims to support any iterable but does not accept pandas series

Weston Pace created ARROW-14621:
-----------------------------------

             Summary: [Python] read_feather's "columns" argument claims to support any iterable but does not accept pandas series
                 Key: ARROW-14621
                 URL: https://issues.apache.org/jira/browse/ARROW-14621
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
            Reporter: Weston Pace


From https://github.com/apache/arrow/issues/11500:

{quote}
According to the read_feather documentation it accepts any sequence type ( https://arrow.apache.org/docs/python/generated/pyarrow.feather.read_feather.html ) so not strictly lists, but also tuples etc... As far as they contain int or str.

While a pandas.Index adheres to the sequence protocol it provides custom implementation of many methods you would expect in a container. I personally think that supporting all possible implementations of classes that adheres to sequence protocol even when they are heavily custom is probably out of scope, so I would personally change the docstring to mention list|tuple and avoid confusion.

Implementing support for pandas.Index is fairly quick but that would probably end up casting the index to a tuple or list if we want to keep the code generic. And that would mean introducing an extra cost that the user doesn't know it's paying when invoking that method. I guess it's more reasonable to make that cost explicit and have user cast to list or tuple
{quote}
[~amol-]

{quote}
Technically the sequence protocol does not define equality. The problem seems to originate from the line sorted(set(columns)) == columns. We are relying on list == sequence => bool which is not valid when the sequence is a numpy array (list == np.ndarray => np.ndarray).

The correct method for comparing sequences seems to be converting both sides to list or using imap (although given we are already doing sorted(set(columns)) I think imap would be overkill).

I'm in favor of @amol- 's general point though. Changing the docs to mention list|tuple is probably a good solution.
{quote}
[~westonpace]

{quote}
On the other hand, supporting numpy arrays for those kind of arguments that expect a list is also relatively easy. We have quite some other methods that document a certain argument as list, while still accepting a numpy array (because in most cases we just rely on 1) length check and 2) iteration through the values).

I think the most important would be to have consistency. But if we have some APIs that are currently strict on requiring a list, and others loose on requiring any iterable, then there are two options: make everything strict, or fix those cases that are now strict to have them work with any iterable.
But to me it would also feel strange to start adding strict isinstance(.., list) checks in several places that currently are perfectly fine with accepting a numpy array or any iterable ...
{quote}
[~jorisvandenbossche]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)