You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "&res (Jira)" <ji...@apache.org> on 2021/05/08 17:59:00 UTC
[jira] [Commented] (ARROW-12677) [Python] Add a mask argument to
pyarrow.StructArray.from_arrays
[ https://issues.apache.org/jira/browse/ARROW-12677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17341380#comment-17341380 ]
&res commented on ARROW-12677:
------------------------------
[~westonpace] thanks for looking into this.
I'm not sure if it's the right place to mention that, but I now have the same issue with ListArray, and I'm wondering if it'd be worth doing the same changes there.
Here's an example where I'm have a list of struct, but some of the list are null:
* Using pyarrow.array (works, but requires turning columns into rows)
{code:python}
list_of_struct = pyarrow.list_(
pyarrow.struct([pyarrow.field("foo", pyarrow.string())])
)
array = pyarrow.array(
[[("hello",), ("World",)], [], None, [None, ("foo",), ("bar",)]],
type=list_of_struct,
)
print(array) {code}
{code:java}
[
-- is_valid: all not null
-- child 0 type: string
[
"hello",
"World"
],
-- is_valid: all not null
-- child 0 type: string
[],
null,
-- is_valid:
[
false,
true,
true
]
-- child 0 type: string
[
"",
"foo",
"bar"
]
] {code}
* Using ListArray.from_array (it's not possible to mark a list a null (It falls back to empty)
{code:python}
struct_type = pyarrow.struct([pyarrow.field("foo", pyarrow.string())])
foo = pyarrow.array(["hello", "World", None, "foo", "bar"])
validity_mask = pyarrow.array([True, True, False, True, True])
validity_bitmask = validity_mask.buffers()[1]
struct_array = pyarrow.StructArray.from_buffers(
struct_type, len(foo), [validity_bitmask], children=[foo]
)
list_array = pyarrow.ListArray.from_arrays(
offsets=[0, 2, 2, 2, 5], values=struct_array
)
{code}
{code:java}
[
-- is_valid: all not null
-- child 0 type: string
[
"hello",
"World"
],
-- is_valid: all not null
-- child 0 type: string
[],
-- is_valid: all not null
-- child 0 type: string
[],
-- is_valid:
[
false,
true,
true
]
-- child 0 type: string
[
null,
"foo",
"bar"
]
]
{code}
* Using the "from_buffers" work around (it works, but not a great API):
{code:python}
struct_type = pyarrow.struct([pyarrow.field("foo", pyarrow.string())])
foo_values = pyarrow.array(["hello", "World", None, "foo", "bar"])
struct_validity_mask = pyarrow.array([True, True, False, True, True])
struct_validity_bitmask = struct_validity_mask.buffers()[1]
struct_array = pyarrow.StructArray.from_buffers(
struct_type,
len(foo_values),
[struct_validity_bitmask],
children=[foo_values],
)
list_validity_mask = pyarrow.array([True, True, False, True])
list_validity_buffer = list_validity_mask.buffers()[1]
list_offsets_buffer = pyarrow.array([0, 2, 2, 2, 5], pyarrow.int32()).buffers()[1]
list_array = pyarrow.ListArray.from_buffers(
type=pyarrow.list_(struct_type),
length=4,
buffers=[list_validity_buffer, list_offsets_buffer, ],
children=[struct_array],
)
print(list_array)
{code}
{code:java}
-- is_valid: all not null
-- child 0 type: string
[
"hello",
"World"
],
-- is_valid: all not null
-- child 0 type: string
[],
null,
-- is_valid:
[
false,
true,
true
]
-- child 0 type: string
[
null,
"foo",
"bar"
]
]
{code}
> [Python] Add a mask argument to pyarrow.StructArray.from_arrays
> ---------------------------------------------------------------
>
> Key: ARROW-12677
> URL: https://issues.apache.org/jira/browse/ARROW-12677
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: &res
> Assignee: Weston Pace
> Priority: Trivial
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> The python API for creating StructArray from a list of array doesn't allow to pass a missing value mask.
> At the moment the only way to create a StructArray with missing value is to use `pyarrow.array` and passing a vector of tuple.
> {code:python}
> >>> pyarrow.array(
> [
> None,
> (1, "foo"),
> ],
> type=pyarrow.struct(
> [pyarrow.field('col1', pyarrow.int64()), pyarrow.field("col2", pyarrow.string())]
> )
> )
> -- is_valid:
> [
> false,
> true
> ]
> -- child 0 type: int64
> [
> 0,
> 1
> ]
> -- child 1 type: string
> [
> "",
> "foo"
> ]
> >>> pyarrow.StructArray.from_arrays(
> [
> [None, 1],
> [None, "foo"]
> ],
> fields=[pyarrow.field('col1', pyarrow.int64()), pyarrow.field("col2", pyarrow.string())]
> )
> -- is_valid: all not null
> -- child 0 type: int64
> [
> null,
> 1
> ]
> -- child 1 type: string
> [
> null,
> "foo"
> ]
> {code}
> The C++ API allows it, so it should be easy to add.
> see [this so question|https://stackoverflow.com/questions/67417110/]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)