You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/01/27 14:47:00 UTC

[jira] [Commented] (ARROW-15033) [Python] No way to create ListArray from sub-arrays

    [ https://issues.apache.org/jira/browse/ARROW-15033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483189#comment-17483189 ] 

Joris Van den Bossche commented on ARROW-15033:
-----------------------------------------------

[~vladfi] sorry for the slow follow-up here. 

But, a struct array and list array are fundamentally different (memory-wise), and what you show for StructArray doesn't translate that well to ListArray.

A struct array has different fields where each field is stored as a separate, contiguous array. So creating this StructArray from those individual arrays as you show in your example makes sense. 

However, a ListArray doesn't have separate fields. The values of the ListArray are all contigously stored in memory in a single array. So for example, your reference array in

{code}
reference = pa.array([[1, 2, 3], [4, 5, 6]])
{code}

has the values stored as [1, 2, 3, 4, 5, 6]:

{code}
In [19]: reference.values
Out[19]: 
<pyarrow.lib.Int64Array object at 0x7f75bec218e0>
[
  1,
  2,
  3,
  4,
  5,
  6
]
{code}

So for that reason, it is less logical to create it from the arrays in the way your want in the {{from_sub_arrays}} example. Moreover, this is actually also only possible if you would have a FixedSizeListArray, and not a variable size ListArray (eg you can have {{pa.array([[1, 2], [3, 4, 5]])}}, which wouldn't be able to be created with {{from_sub_arrays}}).

In the end, if you have such sub-arrays, and mentally consider them as such, I think using a StructArray is more natural (although the fact you need to "make up" names can certainly be a downside here, if there are no natural names in your application).

Alternatively, if you have such sub-arrays like that, and want to create a ListArray from it, you can do some manipulation of the sub-arrays to get them into the layout needed by Arrow. For example:

{code}
In [26]: flat_array = [val for row in zip(*sub_arrays) for val in row]

In [27]: flat_array
Out[27]: 
[<pyarrow.Int64Scalar: 1>,
 <pyarrow.Int64Scalar: 2>,
 <pyarrow.Int64Scalar: 3>,
 <pyarrow.Int64Scalar: 4>,
 <pyarrow.Int64Scalar: 5>,
 <pyarrow.Int64Scalar: 6>]
{code}

and then use {{FixedSizeListArray.from_arrays}}. However, that's a bit complicated by the fact that this results in pyarrow Scalars, which are not handled very well in the {{pa.array}} constructor. So for this option, you would first have to convert you pyarrow sub-arrays to numpy arrays or python lists.

---

That said, while IMO a dedicated creation function is not warranted here, it could be useful to have functionality that could convert a StructArray (with fields of all same type) into a FixedSizeListArray and vice versa.
 


> [Python] No way to create ListArray from sub-arrays
> ---------------------------------------------------
>
>                 Key: ARROW-15033
>                 URL: https://issues.apache.org/jira/browse/ARROW-15033
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>            Reporter: Vlad Firoiu
>            Priority: Major
>
> I'd like to create a `ListArray` from a list of sub-arrays, similar to how `StructArray.from_arrays` can create a `StructArray` from a sequence of names and arrays. A similarly-named function, `ListArray.from_arrays` does exist, but it does something completely different.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)