You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jorge Leitão (Jira)" <ji...@apache.org> on 2021/10/19 16:14:00 UTC
[jira] [Updated] (ARROW-14383) [C++] [Python] Does a sliced
StructArray roundtrip on c data interface?
[ https://issues.apache.org/jira/browse/ARROW-14383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jorge Leitão updated ARROW-14383:
---------------------------------
Description:
I am struggling to roundtrip a sliced StructArray over the c data interface.
Consider the array:
{code:python}
fields = [
("f1", pyarrow.int32()),
("f2", pyarrow.string()),
]
a = pyarrow.array(
[
{"f1": 1, "f2": "a"},
None,
{"f1": 3, "f2": None},
{"f1": None, "f2": "d"},
{"f1": None, "f2": None},
],
pyarrow.struct(fields),
).slice(1, 2)
{code}
When reading this array from the c data interface, I get:
{code:java}
array: Ffi_ArrowArray {
length: 2,
null_count: 1,
offset: 1,
n_buffers: 1,
n_children: 2,
buffers: 0x00007f61796091c0,
children: 0x00007f6179609280,
dictionary: 0x0000000000000000,
release: Some(
0x00007f617aef2ba0,
),
private_data: 0x00007f617960b3c0,
}
child #0: Ffi_ArrowArray {
length: 5,
null_count: 2,
offset: 0,
n_buffers: 2,
n_children: 0,
buffers: 0x00007f0f49609200,
children: 0x0000000000000000,
dictionary: 0x0000000000000000,
release: Some(
0x00007f0f4aec9ba0,
),
private_data: 0x00007f0f4960b480,
}
child #1: Ffi_ArrowArray {
length: 5,
null_count: 2,
offset: 0,
n_buffers: 3,
n_children: 0,
buffers: 0x00007f0f49609240,
children: 0x0000000000000000,
dictionary: 0x0000000000000000,
release: Some(
0x00007f0f4aec9ba0,
),
private_data: 0x00007f0f4960b540,
}
{code}
This does not seem consistent with what the Python API offers:
{code:python}
print(a.field(0).offset, len(a.field(0))) # 1 2 <- shouldn't it be 0 5?
{code}
Secondly and most importantly, the condition that each child's length must equal the array's own length is violated (children length is 5, array's length is 2 in the example above).
We could argue that a consumer MUST slice each child to achieve the desired behavior, but that won't roundtrip because, when writing the StructArray (after consuming it), we would now write
{code}
write child: Ffi_ArrowArray {
length: 2,
null_count: 0,
offset: 1,
n_buffers: 2,
n_children: 0,
buffers: 0x00000000021c8b20,
children: 0x0000000000000008,
dictionary: 0x0000000000000000,
release: Some(
0x00007fb1f8d536c0,
),
private_data: 0x00000000024f0db0,
}
write child: Ffi_ArrowArray {
length: 2,
null_count: 1,
offset: 1,
n_buffers: 3,
n_children: 0,
buffers: 0x00000000024998f0,
children: 0x0000000000000008,
dictionary: 0x0000000000000000,
release: Some(
0x00007fb1f8d536c0,
),
private_data: 0x0000000002499910,
}
Ffi_ArrowArray {
length: 2,
null_count: 1,
offset: 1,
n_buffers: 1,
n_children: 2,
buffers: 0x00000000024f12d0,
children: 0x00000000021c8ae0,
dictionary: 0x0000000000000000,
release: Some(
0x00007fb1f8d536c0,
),
private_data: 0x00000000024999c0,
}
{code}
is consumed as
{code}
print(b.field(0).offset, len(b.field(0))) # 2 1 <------------ why?
print(b.offset, len(b)) # 1 2 <-- OK
{code}
which causes the check in [this line|https://github.com/apache/arrow/blob/b73af9a1607caa4a04e1a11896aed6669847a4d4/cpp/src/arrow/array/validate.cc#L115] to fail.
I was unable to find a test for a roundtrip of a sliced struct [in pyarrow tests|https://github.com/apache/arrow/blob/5ead37593472c42f61c76396dde7dcb8954bde70/python/pyarrow/tests/test_cffi.py] to compare my test with a reference test, but it seems to me that when we slice a StructArray, we should slice its children accordingly so that its C data interface yields a consistent result?
was:
I am struggling to roundtrip a sliced StructArray over the c data interface.
Consider the array:
{code:python}
fields = [
("f1", pyarrow.int32()),
("f2", pyarrow.string()),
]
a = pyarrow.array(
[
{"f1": 1, "f2": "a"},
None,
{"f1": 3, "f2": None},
{"f1": None, "f2": "d"},
{"f1": None, "f2": None},
],
pyarrow.struct(fields),
).slice(1, 2)
{code}
When reading this array from the c data interface, I get:
{code:java}
array: Ffi_ArrowArray {
length: 2,
null_count: 1,
offset: 1,
n_buffers: 1,
n_children: 2,
buffers: 0x00007f61796091c0,
children: 0x00007f6179609280,
dictionary: 0x0000000000000000,
release: Some(
0x00007f617aef2ba0,
),
private_data: 0x00007f617960b3c0,
}
child #0: Ffi_ArrowArray {
length: 5,
null_count: 2,
offset: 0,
n_buffers: 2,
n_children: 0,
buffers: 0x00007f0f49609200,
children: 0x0000000000000000,
dictionary: 0x0000000000000000,
release: Some(
0x00007f0f4aec9ba0,
),
private_data: 0x00007f0f4960b480,
}
child #1: Ffi_ArrowArray {
length: 5,
null_count: 2,
offset: 0,
n_buffers: 3,
n_children: 0,
buffers: 0x00007f0f49609240,
children: 0x0000000000000000,
dictionary: 0x0000000000000000,
release: Some(
0x00007f0f4aec9ba0,
),
private_data: 0x00007f0f4960b540,
}
{code}
This does not seem consistent with what the Python API offers:
{code:python}
print(a.field(0).offset, len(a.field(0))) # 1 2 <- shouldn't it be 5 0?
{code}
Secondly and most importantly, the condition that each child's length must equal the array's own length is violated (children length is 5, array's length is 2 in the example above).
We could argue that a consumer MUST slice each child to achieve the desired behavior, but that won't roundtrip because, when writing the StructArray (after consuming it), we would now write
{code}
write child: Ffi_ArrowArray {
length: 2,
null_count: 0,
offset: 1,
n_buffers: 2,
n_children: 0,
buffers: 0x00000000021c8b20,
children: 0x0000000000000008,
dictionary: 0x0000000000000000,
release: Some(
0x00007fb1f8d536c0,
),
private_data: 0x00000000024f0db0,
}
write child: Ffi_ArrowArray {
length: 2,
null_count: 1,
offset: 1,
n_buffers: 3,
n_children: 0,
buffers: 0x00000000024998f0,
children: 0x0000000000000008,
dictionary: 0x0000000000000000,
release: Some(
0x00007fb1f8d536c0,
),
private_data: 0x0000000002499910,
}
Ffi_ArrowArray {
length: 2,
null_count: 1,
offset: 1,
n_buffers: 1,
n_children: 2,
buffers: 0x00000000024f12d0,
children: 0x00000000021c8ae0,
dictionary: 0x0000000000000000,
release: Some(
0x00007fb1f8d536c0,
),
private_data: 0x00000000024999c0,
}
{code}
is consumed as
{code}
print(b.field(0).offset, len(b.field(0))) # 2 1 <------------ why?
print(b.offset, len(b)) # 1 2 <-- OK
{code}
which causes the check in [this line|https://github.com/apache/arrow/blob/b73af9a1607caa4a04e1a11896aed6669847a4d4/cpp/src/arrow/array/validate.cc#L115] to fail.
I was unable to find a test for a roundtrip of a sliced struct [in pyarrow tests|https://github.com/apache/arrow/blob/5ead37593472c42f61c76396dde7dcb8954bde70/python/pyarrow/tests/test_cffi.py] to compare my test with a reference test, but it seems to me that when we slice a StructArray, we should slice its children accordingly so that its C data interface yields a consistent result?
> [C++] [Python] Does a sliced StructArray roundtrip on c data interface?
> -----------------------------------------------------------------------
>
> Key: ARROW-14383
> URL: https://issues.apache.org/jira/browse/ARROW-14383
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 5.0.0
> Reporter: Jorge Leitão
> Priority: Major
>
> I am struggling to roundtrip a sliced StructArray over the c data interface.
> Consider the array:
> {code:python}
> fields = [
> ("f1", pyarrow.int32()),
> ("f2", pyarrow.string()),
> ]
> a = pyarrow.array(
> [
> {"f1": 1, "f2": "a"},
> None,
> {"f1": 3, "f2": None},
> {"f1": None, "f2": "d"},
> {"f1": None, "f2": None},
> ],
> pyarrow.struct(fields),
> ).slice(1, 2)
> {code}
> When reading this array from the c data interface, I get:
> {code:java}
> array: Ffi_ArrowArray {
> length: 2,
> null_count: 1,
> offset: 1,
> n_buffers: 1,
> n_children: 2,
> buffers: 0x00007f61796091c0,
> children: 0x00007f6179609280,
> dictionary: 0x0000000000000000,
> release: Some(
> 0x00007f617aef2ba0,
> ),
> private_data: 0x00007f617960b3c0,
> }
> child #0: Ffi_ArrowArray {
> length: 5,
> null_count: 2,
> offset: 0,
> n_buffers: 2,
> n_children: 0,
> buffers: 0x00007f0f49609200,
> children: 0x0000000000000000,
> dictionary: 0x0000000000000000,
> release: Some(
> 0x00007f0f4aec9ba0,
> ),
> private_data: 0x00007f0f4960b480,
> }
> child #1: Ffi_ArrowArray {
> length: 5,
> null_count: 2,
> offset: 0,
> n_buffers: 3,
> n_children: 0,
> buffers: 0x00007f0f49609240,
> children: 0x0000000000000000,
> dictionary: 0x0000000000000000,
> release: Some(
> 0x00007f0f4aec9ba0,
> ),
> private_data: 0x00007f0f4960b540,
> }
> {code}
> This does not seem consistent with what the Python API offers:
> {code:python}
> print(a.field(0).offset, len(a.field(0))) # 1 2 <- shouldn't it be 0 5?
> {code}
> Secondly and most importantly, the condition that each child's length must equal the array's own length is violated (children length is 5, array's length is 2 in the example above).
> We could argue that a consumer MUST slice each child to achieve the desired behavior, but that won't roundtrip because, when writing the StructArray (after consuming it), we would now write
> {code}
> write child: Ffi_ArrowArray {
> length: 2,
> null_count: 0,
> offset: 1,
> n_buffers: 2,
> n_children: 0,
> buffers: 0x00000000021c8b20,
> children: 0x0000000000000008,
> dictionary: 0x0000000000000000,
> release: Some(
> 0x00007fb1f8d536c0,
> ),
> private_data: 0x00000000024f0db0,
> }
> write child: Ffi_ArrowArray {
> length: 2,
> null_count: 1,
> offset: 1,
> n_buffers: 3,
> n_children: 0,
> buffers: 0x00000000024998f0,
> children: 0x0000000000000008,
> dictionary: 0x0000000000000000,
> release: Some(
> 0x00007fb1f8d536c0,
> ),
> private_data: 0x0000000002499910,
> }
> Ffi_ArrowArray {
> length: 2,
> null_count: 1,
> offset: 1,
> n_buffers: 1,
> n_children: 2,
> buffers: 0x00000000024f12d0,
> children: 0x00000000021c8ae0,
> dictionary: 0x0000000000000000,
> release: Some(
> 0x00007fb1f8d536c0,
> ),
> private_data: 0x00000000024999c0,
> }
> {code}
> is consumed as
> {code}
> print(b.field(0).offset, len(b.field(0))) # 2 1 <------------ why?
> print(b.offset, len(b)) # 1 2 <-- OK
> {code}
> which causes the check in [this line|https://github.com/apache/arrow/blob/b73af9a1607caa4a04e1a11896aed6669847a4d4/cpp/src/arrow/array/validate.cc#L115] to fail.
> I was unable to find a test for a roundtrip of a sliced struct [in pyarrow tests|https://github.com/apache/arrow/blob/5ead37593472c42f61c76396dde7dcb8954bde70/python/pyarrow/tests/test_cffi.py] to compare my test with a reference test, but it seems to me that when we slice a StructArray, we should slice its children accordingly so that its C data interface yields a consistent result?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)