You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jorge Leitão (Jira)" <ji...@apache.org> on 2021/10/19 16:14:00 UTC

[jira] [Updated] (ARROW-14383) [C++] [Python] Does a sliced StructArray roundtrip on c data interface?

     [ https://issues.apache.org/jira/browse/ARROW-14383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jorge Leitão updated ARROW-14383:
---------------------------------
    Description: 
I am struggling to roundtrip a sliced StructArray over the c data interface.

Consider the array:

{code:python}
fields = [
            ("f1", pyarrow.int32()),
            ("f2", pyarrow.string()),
        ]
        a = pyarrow.array(
            [
                {"f1": 1, "f2": "a"},
                None,
                {"f1": 3, "f2": None},
                {"f1": None, "f2": "d"},
                {"f1": None, "f2": None},
            ],
            pyarrow.struct(fields),
        ).slice(1, 2)
{code}

When reading this array from the c data interface, I get:

{code:java}
array: Ffi_ArrowArray {
    length: 2,
    null_count: 1,
    offset: 1,
    n_buffers: 1,
    n_children: 2,
    buffers: 0x00007f61796091c0,
    children: 0x00007f6179609280,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007f617aef2ba0,
    ),
    private_data: 0x00007f617960b3c0,
}

child #0: Ffi_ArrowArray {
    length: 5,
    null_count: 2,
    offset: 0,
    n_buffers: 2,
    n_children: 0,
    buffers: 0x00007f0f49609200,
    children: 0x0000000000000000,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007f0f4aec9ba0,
    ),
    private_data: 0x00007f0f4960b480,
}

child #1: Ffi_ArrowArray {
    length: 5,
    null_count: 2,
    offset: 0,
    n_buffers: 3,
    n_children: 0,
    buffers: 0x00007f0f49609240,
    children: 0x0000000000000000,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007f0f4aec9ba0,
    ),
    private_data: 0x00007f0f4960b540,
}
{code}

This does not seem consistent with what the Python API offers:
{code:python}
print(a.field(0).offset, len(a.field(0))) # 1 2 <- shouldn't it be 0 5?
{code}

Secondly and most importantly, the condition that each child's length must equal the array's own length is violated (children length is 5, array's length is 2 in the example above).

We could argue that a consumer MUST slice each child to achieve the desired behavior, but that won't roundtrip because, when writing the StructArray (after consuming it), we would now write

{code}
write child: Ffi_ArrowArray {
    length: 2,
    null_count: 0,
    offset: 1,
    n_buffers: 2,
    n_children: 0,
    buffers: 0x00000000021c8b20,
    children: 0x0000000000000008,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007fb1f8d536c0,
    ),
    private_data: 0x00000000024f0db0,
}
write child: Ffi_ArrowArray {
    length: 2,
    null_count: 1,
    offset: 1,
    n_buffers: 3,
    n_children: 0,
    buffers: 0x00000000024998f0,
    children: 0x0000000000000008,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007fb1f8d536c0,
    ),
    private_data: 0x0000000002499910,
}
Ffi_ArrowArray {
    length: 2,
    null_count: 1,
    offset: 1,
    n_buffers: 1,
    n_children: 2,
    buffers: 0x00000000024f12d0,
    children: 0x00000000021c8ae0,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007fb1f8d536c0,
    ),
    private_data: 0x00000000024999c0,
}
{code}

is consumed as 

{code}
print(b.field(0).offset, len(b.field(0))) # 2 1 <------------ why?
print(b.offset, len(b))  # 1 2 <-- OK
{code}

which causes the check in [this line|https://github.com/apache/arrow/blob/b73af9a1607caa4a04e1a11896aed6669847a4d4/cpp/src/arrow/array/validate.cc#L115] to fail.

I was unable to find a test for a roundtrip of a sliced struct [in pyarrow tests|https://github.com/apache/arrow/blob/5ead37593472c42f61c76396dde7dcb8954bde70/python/pyarrow/tests/test_cffi.py] to compare my test with a reference test, but it seems to me that when we slice a StructArray, we should slice its children accordingly so that its C data interface yields a consistent result?

  was:
I am struggling to roundtrip a sliced StructArray over the c data interface.

Consider the array:

{code:python}
fields = [
            ("f1", pyarrow.int32()),
            ("f2", pyarrow.string()),
        ]
        a = pyarrow.array(
            [
                {"f1": 1, "f2": "a"},
                None,
                {"f1": 3, "f2": None},
                {"f1": None, "f2": "d"},
                {"f1": None, "f2": None},
            ],
            pyarrow.struct(fields),
        ).slice(1, 2)
{code}

When reading this array from the c data interface, I get:

{code:java}
array: Ffi_ArrowArray {
    length: 2,
    null_count: 1,
    offset: 1,
    n_buffers: 1,
    n_children: 2,
    buffers: 0x00007f61796091c0,
    children: 0x00007f6179609280,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007f617aef2ba0,
    ),
    private_data: 0x00007f617960b3c0,
}

child #0: Ffi_ArrowArray {
    length: 5,
    null_count: 2,
    offset: 0,
    n_buffers: 2,
    n_children: 0,
    buffers: 0x00007f0f49609200,
    children: 0x0000000000000000,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007f0f4aec9ba0,
    ),
    private_data: 0x00007f0f4960b480,
}

child #1: Ffi_ArrowArray {
    length: 5,
    null_count: 2,
    offset: 0,
    n_buffers: 3,
    n_children: 0,
    buffers: 0x00007f0f49609240,
    children: 0x0000000000000000,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007f0f4aec9ba0,
    ),
    private_data: 0x00007f0f4960b540,
}
{code}

This does not seem consistent with what the Python API offers:
{code:python}
print(a.field(0).offset, len(a.field(0))) # 1 2 <- shouldn't it be 5 0?
{code}

Secondly and most importantly, the condition that each child's length must equal the array's own length is violated (children length is 5, array's length is 2 in the example above).

We could argue that a consumer MUST slice each child to achieve the desired behavior, but that won't roundtrip because, when writing the StructArray (after consuming it), we would now write

{code}
write child: Ffi_ArrowArray {
    length: 2,
    null_count: 0,
    offset: 1,
    n_buffers: 2,
    n_children: 0,
    buffers: 0x00000000021c8b20,
    children: 0x0000000000000008,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007fb1f8d536c0,
    ),
    private_data: 0x00000000024f0db0,
}
write child: Ffi_ArrowArray {
    length: 2,
    null_count: 1,
    offset: 1,
    n_buffers: 3,
    n_children: 0,
    buffers: 0x00000000024998f0,
    children: 0x0000000000000008,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007fb1f8d536c0,
    ),
    private_data: 0x0000000002499910,
}
Ffi_ArrowArray {
    length: 2,
    null_count: 1,
    offset: 1,
    n_buffers: 1,
    n_children: 2,
    buffers: 0x00000000024f12d0,
    children: 0x00000000021c8ae0,
    dictionary: 0x0000000000000000,
    release: Some(
        0x00007fb1f8d536c0,
    ),
    private_data: 0x00000000024999c0,
}
{code}

is consumed as 

{code}
print(b.field(0).offset, len(b.field(0))) # 2 1 <------------ why?
print(b.offset, len(b))  # 1 2 <-- OK
{code}

which causes the check in [this line|https://github.com/apache/arrow/blob/b73af9a1607caa4a04e1a11896aed6669847a4d4/cpp/src/arrow/array/validate.cc#L115] to fail.

I was unable to find a test for a roundtrip of a sliced struct [in pyarrow tests|https://github.com/apache/arrow/blob/5ead37593472c42f61c76396dde7dcb8954bde70/python/pyarrow/tests/test_cffi.py] to compare my test with a reference test, but it seems to me that when we slice a StructArray, we should slice its children accordingly so that its C data interface yields a consistent result?


> [C++] [Python] Does a sliced StructArray roundtrip on c data interface?
> -----------------------------------------------------------------------
>
>                 Key: ARROW-14383
>                 URL: https://issues.apache.org/jira/browse/ARROW-14383
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 5.0.0
>            Reporter: Jorge Leitão
>            Priority: Major
>
> I am struggling to roundtrip a sliced StructArray over the c data interface.
> Consider the array:
> {code:python}
> fields = [
>             ("f1", pyarrow.int32()),
>             ("f2", pyarrow.string()),
>         ]
>         a = pyarrow.array(
>             [
>                 {"f1": 1, "f2": "a"},
>                 None,
>                 {"f1": 3, "f2": None},
>                 {"f1": None, "f2": "d"},
>                 {"f1": None, "f2": None},
>             ],
>             pyarrow.struct(fields),
>         ).slice(1, 2)
> {code}
> When reading this array from the c data interface, I get:
> {code:java}
> array: Ffi_ArrowArray {
>     length: 2,
>     null_count: 1,
>     offset: 1,
>     n_buffers: 1,
>     n_children: 2,
>     buffers: 0x00007f61796091c0,
>     children: 0x00007f6179609280,
>     dictionary: 0x0000000000000000,
>     release: Some(
>         0x00007f617aef2ba0,
>     ),
>     private_data: 0x00007f617960b3c0,
> }
> child #0: Ffi_ArrowArray {
>     length: 5,
>     null_count: 2,
>     offset: 0,
>     n_buffers: 2,
>     n_children: 0,
>     buffers: 0x00007f0f49609200,
>     children: 0x0000000000000000,
>     dictionary: 0x0000000000000000,
>     release: Some(
>         0x00007f0f4aec9ba0,
>     ),
>     private_data: 0x00007f0f4960b480,
> }
> child #1: Ffi_ArrowArray {
>     length: 5,
>     null_count: 2,
>     offset: 0,
>     n_buffers: 3,
>     n_children: 0,
>     buffers: 0x00007f0f49609240,
>     children: 0x0000000000000000,
>     dictionary: 0x0000000000000000,
>     release: Some(
>         0x00007f0f4aec9ba0,
>     ),
>     private_data: 0x00007f0f4960b540,
> }
> {code}
> This does not seem consistent with what the Python API offers:
> {code:python}
> print(a.field(0).offset, len(a.field(0))) # 1 2 <- shouldn't it be 0 5?
> {code}
> Secondly and most importantly, the condition that each child's length must equal the array's own length is violated (children length is 5, array's length is 2 in the example above).
> We could argue that a consumer MUST slice each child to achieve the desired behavior, but that won't roundtrip because, when writing the StructArray (after consuming it), we would now write
> {code}
> write child: Ffi_ArrowArray {
>     length: 2,
>     null_count: 0,
>     offset: 1,
>     n_buffers: 2,
>     n_children: 0,
>     buffers: 0x00000000021c8b20,
>     children: 0x0000000000000008,
>     dictionary: 0x0000000000000000,
>     release: Some(
>         0x00007fb1f8d536c0,
>     ),
>     private_data: 0x00000000024f0db0,
> }
> write child: Ffi_ArrowArray {
>     length: 2,
>     null_count: 1,
>     offset: 1,
>     n_buffers: 3,
>     n_children: 0,
>     buffers: 0x00000000024998f0,
>     children: 0x0000000000000008,
>     dictionary: 0x0000000000000000,
>     release: Some(
>         0x00007fb1f8d536c0,
>     ),
>     private_data: 0x0000000002499910,
> }
> Ffi_ArrowArray {
>     length: 2,
>     null_count: 1,
>     offset: 1,
>     n_buffers: 1,
>     n_children: 2,
>     buffers: 0x00000000024f12d0,
>     children: 0x00000000021c8ae0,
>     dictionary: 0x0000000000000000,
>     release: Some(
>         0x00007fb1f8d536c0,
>     ),
>     private_data: 0x00000000024999c0,
> }
> {code}
> is consumed as 
> {code}
> print(b.field(0).offset, len(b.field(0))) # 2 1 <------------ why?
> print(b.offset, len(b))  # 1 2 <-- OK
> {code}
> which causes the check in [this line|https://github.com/apache/arrow/blob/b73af9a1607caa4a04e1a11896aed6669847a4d4/cpp/src/arrow/array/validate.cc#L115] to fail.
> I was unable to find a test for a roundtrip of a sliced struct [in pyarrow tests|https://github.com/apache/arrow/blob/5ead37593472c42f61c76396dde7dcb8954bde70/python/pyarrow/tests/test_cffi.py] to compare my test with a reference test, but it seems to me that when we slice a StructArray, we should slice its children accordingly so that its C data interface yields a consistent result?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)