You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (JIRA)" <ji...@apache.org> on 2019/08/07 11:01:00 UTC
[jira] [Commented] (ARROW-6158) [Python] possible to create
StructArray with type that conflicts with child array's types
[ https://issues.apache.org/jira/browse/ARROW-6158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901978#comment-16901978 ]
Joris Van den Bossche commented on ARROW-6158:
----------------------------------------------
Found an example where it starts to give errors: after taking a subset with {{Take}}.
{code}
In [6]: subset = a.take(pa.array([0, 2]))
In [7]: subset
Out[7]:
<pyarrow.lib.StructArray object at 0x7f450a6f8468>
-- is_valid: all not null
-- child 0 type: int32
[
1,
2
]
-- child 1 type: double
[
2.122e-314,
0
]
In [8]: subset.validate()
In [9]: subset.to_pandas()
Out[9]: array([{'a': 1, 'b': 2.121995791e-314}, {'a': 2, 'b': 0.0}], dtype=object)
{code}
> [Python] possible to create StructArray with type that conflicts with child array's types
> -----------------------------------------------------------------------------------------
>
> Key: ARROW-6158
> URL: https://issues.apache.org/jira/browse/ARROW-6158
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Joris Van den Bossche
> Priority: Major
>
> Using the Python interface as example. This creates a {{StructArray}} where the field types don't match the child array types:
> {code}
> a = pa.array([1, 2, 3], type=pa.int64())
> b = pa.array(['a', 'b', 'c'], type=pa.string())
> inconsistent_fields = [pa.field('a', pa.int32()), pa.field('b', pa.float64())]
> a = pa.StructArray.from_arrays([a, b], fields=inconsistent_fields)
> {code}
> The above works fine. I didn't find anything that errors (eg conversion to pandas, slicing), also validation passes, but the type actually has the inconsistent child types:
> {code}
> In [2]: a
> Out[2]:
> <pyarrow.lib.StructArray object at 0x7f450af52eb8>
> -- is_valid: all not null
> -- child 0 type: int64
> [
> 1,
> 2,
> 3
> ]
> -- child 1 type: string
> [
> "a",
> "b",
> "c"
> ]
> In [3]: a.type
> Out[3]: StructType(struct<a: int32, b: double>)
> In [4]: a.to_pandas()
> Out[4]:
> array([{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3, 'b': 'c'}],
> dtype=object)
> In [5]: a.validate()
> {code}
> Shouldn't this be disallowed somehow? (it could be checked in the Python {{from_arrays}} method, but maybe also in {{StructArray::Make}} which already checks for the number of fields vs arrays and a consistent array length).
> Similarly to discussion in ARROW-6132, I would also expect that this the {{ValidateArray}} catches this.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)