You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/05/17 10:30:00 UTC

[jira] [Commented] (ARROW-12609) [Python] TypeError when accessing length of an invalid ListScalar

    [ https://issues.apache.org/jira/browse/ARROW-12609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346055#comment-17346055 ] 

Joris Van den Bossche commented on ARROW-12609:
-----------------------------------------------

bq. why not return {{NullScalar}} in such case? It seems to me that {{pa.list_(pa.int32())}} means a schema that supports null values in the list, then the array should just return a null value when it hits one.

[~amol-] The returned ListScalar _is_ a null value, though. Because each type supports null values, each scalar type also supports it's own null scalars. A {{NullScalar}} is what you would get when accessing a single element of a {{NullArray}}:

{code}
>>> arr = pa.array([None, None])
>>> arr
<pyarrow.lib.NullArray object at 0x7fee45555940>
2 nulls
>>> arr[0]
<pyarrow.NullScalar: None>
{code}

bq. Expected behavior: length is expected to be 0.

[~mosalx] I think you could also argue that a missing list scalar has "no defined length" (why would it be zero? it's an empty list that has zero length) 
The problem, though, is that Python doesn't support this kind of missing or undefined values for integers ({{\_\_len\_\_}} needs to return an integer, or error)

For example, if not using Python's builtin {{len}}, but using the pyarrow compute kernel to get the length of list element, we actually "propagate" the null, and the null list has a null length:

{code}
>>> import pyarrow.compute as pc
>>> pc.list_value_length(pa.scalar([1, 2], type=pa.list_(pa.int32())))
<pyarrow.Int32Scalar: 2>
>>> pc.list_value_length(pa.scalar(None, type=pa.list_(pa.int32())))
<pyarrow.Int32Scalar: None>
{code}

> [Python] TypeError when accessing length of an invalid ListScalar
> -----------------------------------------------------------------
>
>                 Key: ARROW-12609
>                 URL: https://issues.apache.org/jira/browse/ARROW-12609
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 3.0.0, 4.0.0
>         Environment: Windows 10
> python=3.9.2
> pyarrow=4.0.0 (3.0.0 has the same behavior)
>            Reporter: Sergey Mozharov
>            Priority: Major
>
> For List-like data types, the scalar corresponding to a missing value has '___len___' attribute, but TypeError is raised when it is accessed
> {code:java}
> import pyarrow as pa
> data_type = pa.list_(pa.struct([
>     ('a', pa.int64()),
>     ('b', pa.bool_())
> ]))
> data = [[{'a': 1, 'b': False}, {'a': 2, 'b': True}], None]
> arr = pa.array(data, type=data_type)
> missing_scalar = arr[1]  # <pyarrow.ListScalar: None>
> assert hasattr(missing_scalar, '__len__')
> assert len(missing_scalar) == 0  # --> TypeError: object of type 'NoneType' has no len()
> {code}
> Expected behavior: length is expected to be 0.
> This issue causes several pandas unit tests to fail when an ExtensionArray backed by arrow array with this data type is built.
> This behavior is also inconsistent with a similar example where the data type is a struct:
> {code:java}
> import pyarrow as pa
> data_type = pa.struct([
>     ('a', pa.int64()),
>     ('b', pa.bool_())
> ])
> data = [{'a': 1, 'b': False}, None]
> arr = pa.array(data, type=data_type)
> missing_scalar = arr[1]  # <pyarrow.StructScalar: None>
> assert hasattr(missing_scalar, '__len__')
> assert len(missing_scalar) == 0  # Ok
> {code}
>  In this second example the TypeError is not raised.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)