You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Micah Kornfield (Jira)" <ji...@apache.org> on 2020/11/11 04:46:00 UTC

[jira] [Resolved] (ARROW-10493) [C++][Parquet] Writing nullable nested strings results in wrong data in file

     [ https://issues.apache.org/jira/browse/ARROW-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Micah Kornfield resolved ARROW-10493.
-------------------------------------
    Fix Version/s: 3.0.0
       Resolution: Fixed

Issue resolved by pull request 8589
[https://github.com/apache/arrow/pull/8589]

> [C++][Parquet] Writing nullable nested strings results in wrong data in file
> ----------------------------------------------------------------------------
>
>                 Key: ARROW-10493
>                 URL: https://issues.apache.org/jira/browse/ARROW-10493
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 2.0.0
>         Environment: Python 3.6
>            Reporter: Christian Lundgren
>            Assignee: Christian Lundgren
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.0.1, 3.0.0
>
>          Time Spent: 3h
>  Remaining Estimate: 0h
>
> When I try writing a column of type `struct(string)` that has more elements than the write_batch_size, the output will only contain the first batch, repeated. The data in batches after the first batch are not written to the output.
> I am only seeing this behaviour with arrow 2.0.0, in 1.0.1 the output contains all the data as expected.
>  
> This python test case reproduces the problem, the last value in the output is "key-0" instead of the expected "key-1024":
>  
> {code:python}
> import io
> import pyarrow as pa
> import pyarrow.parquet as pq
> def test_struct_array():
>     default_writer_batch_size = 1024
>     n_samples = default_writer_batch_size + 1
>     keys = [f"key-{i}" for i in range(n_samples)]
>     expected = list(keys)
>     struct_array = pa.StructArray.from_arrays(
>         [pa.array(keys, type=pa.string())],
>         names=["string"],
>     )
>     table = pa.table({"struct": struct_array})
>     buf = io.BytesIO()
>     pq.write_table(table, buf)
>     actual = pq.read_table(buf).flatten()[0].to_pylist()
>     assert actual[:1024] == expected[:1024]
>     assert actual[-1] == expected[-1], (actual[-1], expected[-1])
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)