You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Alessandro Molina (Jira)" <ji...@apache.org> on 2021/05/03 14:02:00 UTC
[jira] [Commented] (ARROW-12609) TypeError when accessing length of an invalid ListScalar

    [ https://issues.apache.org/jira/browse/ARROW-12609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17338383#comment-17338383 ] 

Alessandro Molina commented on ARROW-12609:
-------------------------------------------

This looks like an interesting question in term of possible behaviours.



I mean, the moment we declare an {{Array}} as containing lists (thus {{Array[x]}} is {{ListScalar}}) what's the proper behaviour when {{Array[x]}} in reality is not a list?

Normally when the value doesn't respect the schema it seems we throw an error
{code:java}
>>> pa.array([[1], 1], type=pa.list_(pa.int32()))
...
pyarrow.lib.ArrowTypeError: Could not convert 1 with type int: was not a sequence or recognized null for conversion to list type
{code}

for {{None}} that is not true by the way as it's necessary to nulls

{code}
>>> pa.array([[1], None], type=pa.list_(pa.int32()))
<pyarrow.lib.ListArray object at 0x120e69e20>
[
  [
    1
  ],
  null
]
{code}

The question at that point is what should {{Array[x]}} return? Does it make sense to return a {{ListScalar}} when in reality it's not a list? 

{code}
>>> pa.array([[1], None], type=pa.list_(pa.int32()))[1]
<pyarrow.ListScalar: None>
{code}

why not return {{NullScalar}} in such case? It seems to me that {{pa.list_(pa.int32())}} means a schema that supports null values in the list, then the list should just return a null value when it hits one.

> TypeError when accessing length of an invalid ListScalar
> --------------------------------------------------------
>
>                 Key: ARROW-12609
>                 URL: https://issues.apache.org/jira/browse/ARROW-12609
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 3.0.0, 4.0.0
>         Environment: Windows 10
> python=3.9.2
> pyarrow=4.0.0 (3.0.0 has the same behavior)
>            Reporter: Sergey Mozharov
>            Priority: Major
>
> For List-like data types, the scalar corresponding to a missing value has '___len___' attribute, but TypeError is raised when it is accessed
> {code:java}
> import pyarrow as pa
> data_type = pa.list_(pa.struct([
>     ('a', pa.int64()),
>     ('b', pa.bool_())
> ]))
> data = [[{'a': 1, 'b': False}, {'a': 2, 'b': True}], None]
> arr = pa.array(data, type=data_type)
> missing_scalar = arr[1]  # <pyarrow.ListScalar: None>
> assert hasattr(missing_scalar, '__len__')
> assert len(missing_scalar) == 0  # --> TypeError: object of type 'NoneType' has no len()
> {code}
> Expected behavior: length is expected to be 0.
> This issue causes several pandas unit tests to fail when an ExtensionArray backed by arrow array with this data type is built.
> This behavior is also inconsistent with a similar example where the data type is a struct:
> {code:java}
> import pyarrow as pa
> data_type = pa.struct([
>     ('a', pa.int64()),
>     ('b', pa.bool_())
> ])
> data = [{'a': 1, 'b': False}, None]
> arr = pa.array(data, type=data_type)
> missing_scalar = arr[1]  # <pyarrow.StructScalar: None>
> assert hasattr(missing_scalar, '__len__')
> assert len(missing_scalar) == 0  # Ok
> {code}
>  In this second example the TypeError is not raised.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)