You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/09/23 16:41:00 UTC

[jira] [Commented] (ARROW-17813) [Python] Nested ExtensionArray conversion to/from pandas/numpy

    [ https://issues.apache.org/jira/browse/ARROW-17813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608842#comment-17608842 ] 

Joris Van den Bossche commented on ARROW-17813:
-----------------------------------------------

*Arrow => numpy/pandas*

For numpy, we can indeed fall back to converting the storage array. That's also what happens for the {{ExtensionArray.to_numpy()}} at the moment. Although this is implemented in python right now (https://github.com/apache/arrow/blob/356e7f836c145966ebbeb65c3b65d82348e4234e/python/pyarrow/array.pxi#L2795), while the ListArray conversion is done in C++. So we would need to move that logic of using the storage type into the pyarrow C++ code (which should be doable, I think)

For conversion to pandas, for plain ExtensionArrays, this is controlled by whether there is an equivalent pandas extension type to convert to. So the question is whether this should be done for ExtensionArrays within a nested type as well. That would get a bit more complicated, as then we need to call back into python from C++ (this is basically covered by ARROW-17535)

*pandas/numpy => Arrow*

One way this will be a bit easier is to cast to the final type, something like: {{list_of_storage.cast(pa.list_(LabelType()))}}.  
This is currently not yet possible, but there is some work being done on that at the moment (ARROW-14500 about casting storage type to extension type, ARROW-15545 is a different issue related to casting of extension types, but this might actually also solve the former, and there is an open PR for this: https://github.com/apache/arrow/pull/14106. We should verify if that PR also enables this cast)

> For non-nested columns, one can achieve easier conversion by defining a pandas extension dtype, but i don't think that works for a nested column. 

Indeed, that won't work without specifying a separate extension type for this nested type (until pandas supports nested types properly)

> Off the cuff, one could provide a custom lambda to `pa.Table.from_pandas` that is used for either specified column names / data types?

That could be one option. But maybe we should start with enabling basic conversion (through the storage type) for extension types in the array conversion, which currently fails:

{code:python}
# this could be the equivalent of `pa.ExtensionArray.from_storage(LabelType(), pa.array(["dog", "cat", "horse"]))` ?
>>> pa.array(["dog", "cat", "horse"], type=LabelType())
ArrowNotImplementedError: extension
{python}

If the above works, I think it should also work to specify a schema with the extension type in the Table.from_pandas conversion. 
(we could still make it easier to allow to specify the type for one specific column, instead of having to specify the full schema)

> [Python] Nested ExtensionArray conversion to/from pandas/numpy
> --------------------------------------------------------------
>
>                 Key: ARROW-17813
>                 URL: https://issues.apache.org/jira/browse/ARROW-17813
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 9.0.0
>            Reporter: Chang She
>            Priority: Major
>
> user@ thread: [https://lists.apache.org/thread/dhnxq0g4kgdysjowftfv3z5ngj780xpb]
> repro gist: [https://gist.github.com/changhiskhan/4163f8cec675a2418a69ec9168d5fdd9]
> *Arrow => numpy/pandas*
> For a non-nested array, pa.ExtensionArray.to_numpy automatically "lowers" to the storage type (as expected). However this is not done for nested arrays:
> {code:python}
> import pyarrow as pa
> class LabelType(pa.ExtensionType):
>     def __init__(self):
>         super(LabelType, self).__init__(pa.string(), "label")
>     def __arrow_ext_serialize__(self):
>         return b""
>     @classmethod
>     def __arrow_ext_deserialize__(cls, storage_type, serialized):
>         return LabelType()
>     
> storage = pa.array(["dog", "cat", "horse"])
> ext_arr = pa.ExtensionArray.from_storage(LabelType(), storage)
> offsets = pa.array([0, 1])
> list_arr = pa.ListArray.from_arrays(offsets, ext_arr)
> list_arr.to_numpy()
> {code}
> {code:java}
> ---------------------------------------------------------------------------
> ArrowNotImplementedError                  Traceback (most recent call last)
> Cell In [15], line 1
> ----> 1 list_arr.to_numpy()
> File /mnt/lance/.venv/lance/lib/python3.10/site-packages/pyarrow/array.pxi:1445, in pyarrow.lib.Array.to_numpy()
> File /mnt/lance/.venv/lance/lib/python3.10/site-packages/pyarrow/error.pxi:121, in pyarrow.lib.check_status()
> ArrowNotImplementedError: Not implemented type for Arrow list to pandas: extension<label<LabelType>>
> {code}
> As mentioned on the user thread linked from the top, a fairly generic solution would just have the conversion default to the storage array's to_numpy.
>  
> *pandas/numpy => Arrow*
> Equivalently, conversion to Arrow is also difficult for nested extension types: 
> if I have say a pandas DataFrame that has a column of list-of-string and I want to convert that to list-of-label Array. Currently I have to:
> 1. Convert to list-of-string (storage) numpy array to pa.list_(pa.string())
> 2. Convert the string values array to ExtensionArray, then reconstitue a list<extension> array using the ExtensionArray combined with the offsets from the result of step 1
> {code:python}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({'labels': [["dog", "horse", "cat"], ["person", "person", "car", "car"]]})
> list_of_storage = pa.array(df.labels)
> ext_values = pa.ExtensionArray.from_storage(LabelType(), list_of_storage.values)
> list_of_ext = pa.ListArray.from_arrays(offsets=list_of_storage.offsets, values=ext_values)
> {code}
> For non-nested columns, one can achieve easier conversion by defining a pandas extension dtype, but i don't think that works for a nested column. You would instead have to fallback to something like `pa.ExtensionArray.from_storage` (or `from_pandas`?) to do the trick. Even that doesn't necessarily work for something like a dictionary column because you'd have to pass in the dictionary somehow. Off the cuff, one could provide a custom lambda to `pa.Table.from_pandas` that is used for either specified column names / data types?
> Thanks in advance for the consideration!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)