You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/10/04 10:22:00 UTC
[jira] [Comment Edited] (ARROW-17925) [Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas?

    [ https://issues.apache.org/jira/browse/ARROW-17925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612555#comment-17612555 ] 

Joris Van den Bossche edited comment on ARROW-17925 at 10/4/22 10:21 AM:
-------------------------------------------------------------------------

To give a concrete copy-pastable example (using the one from the docs: https://arrow.apache.org/docs/dev/python/extending_types.html#custom-scalar-conversion):

{code:python}
from collections import namedtuple
import pyarrow as pa

Point3D = namedtuple("Point3D", ["x", "y", "z"])

class Point3DScalar(pa.ExtensionScalar):
    def as_py(self) -> Point3D:
        return Point3D(*self.value.as_py())

class Point3DType(pa.PyExtensionType):
    def __init__(self):
        pa.PyExtensionType.__init__(self, pa.list_(pa.float32(), 3))

    def __reduce__(self):
        return Point3DType, ()

    def __arrow_ext_scalar_class__(self):
        return Point3DScalar
{code}

{code}
storage = pa.array([[1, 2, 3], [4, 5, 6]], pa.list_(pa.float32(), 3))
arr = pa.ExtensionArray.from_storage(Point3DType(), storage)

>>> arr.to_pandas().values
array([array([1., 2., 3.], dtype=float32),
       array([4., 5., 6.], dtype=float32)], dtype=object)

>>> arr.to_pylist()
[Point3D(x=1.0, y=2.0, z=3.0), Point3D(x=4.0, y=5.0, z=6.0)]
{code}

So here, {{to_pylist}} gives the nice scalars, while in {{to_pandas()}}, we have the raw numpy arrays from converting the storage list array. 

We _could_ do this automatically in {{to_pandas}} as well if we detect that the ExtensionType raises NotImplementedError for {{to_pandas_dtype}} and returns a subclass from {{\_\_arrow_ext_scalar_class\_\_}}. 

On the other hand, you can also do this yourself by overriding {{to_pandas()}}? 

And what about {{to_numpy()}}?


was (Author: jorisvandenbossche):
To give a concrete copy-pastable example (using the one from the docs: https://arrow.apache.org/docs/dev/python/extending_types.html#custom-scalar-conversion):

{code:python}
from collections import namedtuple
import pyarrow as pa

Point3D = namedtuple("Point3D", ["x", "y", "z"])

class Point3DScalar(pa.ExtensionScalar):
    def as_py(self) -> Point3D:
        return Point3D(*self.value.as_py())

class Point3DType(pa.PyExtensionType):
    def __init__(self):
        pa.PyExtensionType.__init__(self, pa.list_(pa.float32(), 3))

    def __reduce__(self):
        return Point3DType, ()

    def __arrow_ext_scalar_class__(self):
        return Point3DScalar
{code}

{code}
storage = pa.array([[1, 2, 3], [4, 5, 6]], pa.list_(pa.float32(), 3))
arr = pa.ExtensionArray.from_storage(Point3DType(), storage)

>>> arr.to_pandas().values
array([array([1., 2., 3.], dtype=float32),
       array([4., 5., 6.], dtype=float32)], dtype=object)

>>> arr.to_pylist()
[Point3D(x=1.0, y=2.0, z=3.0), Point3D(x=4.0, y=5.0, z=6.0)]
{code}

So here, {{to_pylist}} gives the nice scalars, while in {{to_pandas()}}, we have the raw numpy arrays from converting the storage list array. 

We _could_ do this automatically in {{to_pandas}} as well if we detect that the ExtensionType raises NotImplementedError for {{to_pandas_dtype}} and returns a subclass from {{\_\_arrow_ext_scalar_class\_\_}}. 

On the other hand, you can also do this yourself by overriding {{to_pandas()}}? 

And what about {{to_numy()}}?

> [Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas?
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-17925
>                 URL: https://issues.apache.org/jira/browse/ARROW-17925
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>
> This was raised in ARROW-17813 by [~changhiskhan]:
> {quote}*ExtensionArray => pandas*
> Just for discussion, I was curious whether you had any thoughts around using the extension scalar as a fallback mechanism. It's a lot simpler to define an ExtensionScalar with `as_py` than a pandas extension dtype. So if an ExtensionArray doesn't have an equivalent pandas dtype, would it make sense to convert it to just an object series whose elements are the result of `as_py`? {quote}
> and I also mentioned this in ARROW-17535:
> {quote}That actually brings up a question: if an ExtensionType defines an ExtensionScalar (but not an associciated pandas dtype, or custom to_numpy conversion), should we use this scalar's {{as_py()}} for the to_numpy/to_pandas conversion as well for plain extension arrays? (not the nested case) 
> Because currently, if you have an ExtensionArray like that (for example using the example from the docs: https://arrow.apache.org/docs/dev/python/extending_types.html#custom-scalar-conversion), we still use the storage type conversion for to_numpy/to_pandas, and only use the scalar's conversion in {{to_pylist}}.{quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)