You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/10/04 10:14:00 UTC
[jira] [Commented] (ARROW-17813) [Python] Nested ExtensionArray conversion to/from pandas/numpy

    [ https://issues.apache.org/jira/browse/ARROW-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612550#comment-17612550 ] 

Joris Van den Bossche commented on ARROW-17813:
-----------------------------------------------

{quote}*ExtensionArray => pandas*

Just for discussion, I was curious whether you had any thoughts around using the extension scalar as a fallback mechanism
{quote}

I just was wondering the same in ARROW-17535, forgetting your brought that up here as well. I opened a dedicated JIRA for this part: ARROW-17925

> [Python] Nested ExtensionArray conversion to/from pandas/numpy
> --------------------------------------------------------------
>
>                 Key: ARROW-17813
>                 URL: https://issues.apache.org/jira/browse/ARROW-17813
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 9.0.0
>            Reporter: Chang She
>            Assignee: Miles Granger
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 2h
>  Remaining Estimate: 0h
>
> user@ thread: [https://lists.apache.org/thread/dhnxq0g4kgdysjowftfv3z5ngj780xpb]
> repro gist: [https://gist.github.com/changhiskhan/4163f8cec675a2418a69ec9168d5fdd9]
> *Arrow => numpy/pandas*
> For a non-nested array, pa.ExtensionArray.to_numpy automatically "lowers" to the storage type (as expected). However this is not done for nested arrays:
> {code:python}
> import pyarrow as pa
> class LabelType(pa.ExtensionType):
>     def __init__(self):
>         super(LabelType, self).__init__(pa.string(), "label")
>     def __arrow_ext_serialize__(self):
>         return b""
>     @classmethod
>     def __arrow_ext_deserialize__(cls, storage_type, serialized):
>         return LabelType()
>     
> storage = pa.array(["dog", "cat", "horse"])
> ext_arr = pa.ExtensionArray.from_storage(LabelType(), storage)
> offsets = pa.array([0, 1])
> list_arr = pa.ListArray.from_arrays(offsets, ext_arr)
> list_arr.to_numpy()
> {code}
> {code:java}
> ---------------------------------------------------------------------------
> ArrowNotImplementedError                  Traceback (most recent call last)
> Cell In [15], line 1
> ----> 1 list_arr.to_numpy()
> File /mnt/lance/.venv/lance/lib/python3.10/site-packages/pyarrow/array.pxi:1445, in pyarrow.lib.Array.to_numpy()
> File /mnt/lance/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in pyarrow.lib.check_status()
> ArrowNotImplementedError: Not implemented type for Arrow list to pandas: extension<label<LabelType>>
> {code}
> As mentioned on the user thread linked from the top, a fairly generic solution would just have the conversion default to the storage array's to_numpy.
>  
> *pandas/numpy => Arrow*
> Equivalently, conversion to Arrow is also difficult for nested extension types: 
> if I have say a pandas DataFrame that has a column of list-of-string and I want to convert that to list-of-label Array. Currently I have to:
> 1. Convert to list-of-string (storage) numpy array to pa.list_(pa.string())
> 2. Convert the string values array to ExtensionArray, then reconstitue a list<extension> array using the ExtensionArray combined with the offsets from the result of step 1
> {code:python}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({'labels': [["dog", "horse", "cat"], ["person", "person", "car", "car"]]})
> list_of_storage = pa.array(df.labels)
> ext_values = pa.ExtensionArray.from_storage(LabelType(), list_of_storage.values)
> list_of_ext = pa.ListArray.from_arrays(offsets=list_of_storage.offsets, values=ext_values)
> {code}
> For non-nested columns, one can achieve easier conversion by defining a pandas extension dtype, but i don't think that works for a nested column. You would instead have to fallback to something like `pa.ExtensionArray.from_storage` (or `from_pandas`?) to do the trick. Even that doesn't necessarily work for something like a dictionary column because you'd have to pass in the dictionary somehow. Off the cuff, one could provide a custom lambda to `pa.Table.from_pandas` that is used for either specified column names / data types?
> Thanks in advance for the consideration!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)