You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/11/09 15:14:00 UTC

[jira] [Commented] (ARROW-14621) [Python] read_feather's "columns" argument claims to support any iterable but does not accept pandas series

    [ https://issues.apache.org/jira/browse/ARROW-14621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441215#comment-17441215 ] 

Joris Van den Bossche commented on ARROW-14621:
-----------------------------------------------

Proposal: coerce arguments that accept a list-like to list (`list(...)`) where needed (so where not just iterating through the object). 

That seems more generally useful and cleaner (compared to eg explicitly starting to check for list as argument type)

> [Python] read_feather's "columns" argument claims to support any iterable but does not accept pandas series
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14621
>                 URL: https://issues.apache.org/jira/browse/ARROW-14621
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Weston Pace
>            Priority: Major
>
> From https://github.com/apache/arrow/issues/11500:
> {quote}
> According to the read_feather documentation it accepts any sequence type ( https://arrow.apache.org/docs/python/generated/pyarrow.feather.read_feather.html ) so not strictly lists, but also tuples etc... As far as they contain int or str.
> While a pandas.Index adheres to the sequence protocol it provides custom implementation of many methods you would expect in a container. I personally think that supporting all possible implementations of classes that adheres to sequence protocol even when they are heavily custom is probably out of scope, so I would personally change the docstring to mention list|tuple and avoid confusion.
> Implementing support for pandas.Index is fairly quick but that would probably end up casting the index to a tuple or list if we want to keep the code generic. And that would mean introducing an extra cost that the user doesn't know it's paying when invoking that method. I guess it's more reasonable to make that cost explicit and have user cast to list or tuple
> {quote}
> [~amol-]
> {quote}
> Technically the sequence protocol does not define equality. The problem seems to originate from the line sorted(set(columns)) == columns. We are relying on list == sequence => bool which is not valid when the sequence is a numpy array (list == np.ndarray => np.ndarray).
> The correct method for comparing sequences seems to be converting both sides to list or using imap (although given we are already doing sorted(set(columns)) I think imap would be overkill).
> I'm in favor of @amol- 's general point though. Changing the docs to mention list|tuple is probably a good solution.
> {quote}
> [~westonpace]
> {quote}
> On the other hand, supporting numpy arrays for those kind of arguments that expect a list is also relatively easy. We have quite some other methods that document a certain argument as list, while still accepting a numpy array (because in most cases we just rely on 1) length check and 2) iteration through the values).
> I think the most important would be to have consistency. But if we have some APIs that are currently strict on requiring a list, and others loose on requiring any iterable, then there are two options: make everything strict, or fix those cases that are now strict to have them work with any iterable.
> But to me it would also feel strange to start adding strict isinstance(.., list) checks in several places that currently are perfectly fine with accepting a numpy array or any iterable ...
> {quote}
> [~jorisvandenbossche]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)