You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "&res (Jira)" <ji...@apache.org> on 2021/05/08 17:59:00 UTC

[jira] [Commented] (ARROW-12677) [Python] Add a mask argument to pyarrow.StructArray.from_arrays

    [ https://issues.apache.org/jira/browse/ARROW-12677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17341380#comment-17341380 ] 

&res commented on ARROW-12677:
------------------------------

[~westonpace] thanks for looking into this.

I'm not sure if it's the right place to mention that, but I now have the same issue with ListArray, and I'm wondering if it'd be worth doing the same changes there.

 

Here's an example where I'm have a list of struct, but some of the list are null:

 
 * Using pyarrow.array (works, but requires turning columns into rows)

{code:python}
list_of_struct = pyarrow.list_(
    pyarrow.struct([pyarrow.field("foo", pyarrow.string())])
)
array = pyarrow.array(
    [[("hello",), ("World",)], [], None, [None, ("foo",), ("bar",)]],
    type=list_of_struct,
)
print(array) {code}
{code:java}
[
  -- is_valid: all not null
  -- child 0 type: string
    [
      "hello",
      "World"
    ],
  -- is_valid: all not null
  -- child 0 type: string
    [],
  null,
  -- is_valid:
      [
      false,
      true,
      true
    ]
  -- child 0 type: string
    [
      "",
      "foo",
      "bar"
    ]
] {code}
 

 * Using ListArray.from_array (it's not possible to mark a list a null (It falls back to empty)
{code:python}
struct_type = pyarrow.struct([pyarrow.field("foo", pyarrow.string())])
foo = pyarrow.array(["hello", "World", None, "foo", "bar"])
validity_mask = pyarrow.array([True, True, False, True, True])
validity_bitmask = validity_mask.buffers()[1]
struct_array = pyarrow.StructArray.from_buffers(
    struct_type, len(foo), [validity_bitmask], children=[foo]
)
list_array = pyarrow.ListArray.from_arrays(
    offsets=[0, 2, 2, 2, 5], values=struct_array
)
{code}
{code:java}
[
  -- is_valid: all not null
  -- child 0 type: string
    [
      "hello",
      "World"
    ],
  -- is_valid: all not null
  -- child 0 type: string
    [],
  -- is_valid: all not null
  -- child 0 type: string
    [],
  -- is_valid:
      [
      false,
      true,
      true
    ]
  -- child 0 type: string
    [
      null,
      "foo",
      "bar"
    ]
]
{code}
 

 * Using the "from_buffers" work around (it works, but not a great API):

{code:python}
struct_type = pyarrow.struct([pyarrow.field("foo", pyarrow.string())])
foo_values = pyarrow.array(["hello", "World", None, "foo", "bar"])
struct_validity_mask = pyarrow.array([True, True, False, True, True])
struct_validity_bitmask = struct_validity_mask.buffers()[1]
struct_array = pyarrow.StructArray.from_buffers(
    struct_type,
    len(foo_values),
    [struct_validity_bitmask],
    children=[foo_values],
)

list_validity_mask = pyarrow.array([True, True, False, True])
list_validity_buffer = list_validity_mask.buffers()[1]
list_offsets_buffer = pyarrow.array([0, 2, 2, 2, 5], pyarrow.int32()).buffers()[1]

list_array = pyarrow.ListArray.from_buffers(
    type=pyarrow.list_(struct_type),
    length=4,
    buffers=[list_validity_buffer, list_offsets_buffer, ],
    children=[struct_array],
)
print(list_array)
{code}
{code:java}
  -- is_valid: all not null
  -- child 0 type: string
    [
      "hello",
      "World"
    ],
  -- is_valid: all not null
  -- child 0 type: string
    [],
  null,
  -- is_valid:
      [
      false,
      true,
      true
    ]
  -- child 0 type: string
    [
      null,
      "foo",
      "bar"
    ]
]
{code}

> [Python] Add a mask argument to pyarrow.StructArray.from_arrays
> ---------------------------------------------------------------
>
>                 Key: ARROW-12677
>                 URL: https://issues.apache.org/jira/browse/ARROW-12677
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: &res
>            Assignee: Weston Pace
>            Priority: Trivial
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> The python API for creating StructArray from a list of array doesn't allow to pass a missing value mask. 
> At the moment the only way to create a StructArray with missing value is to use `pyarrow.array` and passing a vector of tuple.
> {code:python}
> >>> pyarrow.array(
>     [
>         None,
>         (1, "foo"),
>     ],
>     type=pyarrow.struct(
>         [pyarrow.field('col1', pyarrow.int64()), pyarrow.field("col2", pyarrow.string())]
>     )
> )
> -- is_valid:
>   [
>     false,
>     true
>   ]
> -- child 0 type: int64
>   [
>     0,
>     1
>   ]
> -- child 1 type: string
>   [
>     "",
>     "foo"
>   ]
> >>> pyarrow.StructArray.from_arrays(
>     [
>         [None, 1],
>         [None, "foo"]
>     ],
>     fields=[pyarrow.field('col1', pyarrow.int64()), pyarrow.field("col2", pyarrow.string())]
> )
> -- is_valid: all not null
> -- child 0 type: int64
>   [
>     null,
>     1
>   ]
> -- child 1 type: string
>   [
>     null,
>     "foo"
>   ]
> {code}
> The C++ API allows it, so it should be easy to add.
> see [this so question|https://stackoverflow.com/questions/67417110/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)