You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/11/09 18:58:00 UTC
[jira] [Assigned] (ARROW-10493) [C++][Parquet] Writing nullable
nested strings results in wrong data in file
[ https://issues.apache.org/jira/browse/ARROW-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche reassigned ARROW-10493:
---------------------------------------------
Assignee: Joris Van den Bossche
> [C++][Parquet] Writing nullable nested strings results in wrong data in file
> ----------------------------------------------------------------------------
>
> Key: ARROW-10493
> URL: https://issues.apache.org/jira/browse/ARROW-10493
> Project: Apache Arrow
> Issue Type: Bug
> Affects Versions: 2.0.0
> Environment: Python 3.6
> Reporter: Christian Lundgren
> Assignee: Joris Van den Bossche
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.0.1
>
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> When I try writing a column of type `struct(string)` that has more elements than the write_batch_size, the output will only contain the first batch, repeated. The data in batches after the first batch are not written to the output.
> I am only seeing this behaviour with arrow 2.0.0, in 1.0.1 the output contains all the data as expected.
>
> This python test case reproduces the problem, the last value in the output is "key-0" instead of the expected "key-1024":
>
> {code:python}
> import io
> import pyarrow as pa
> import pyarrow.parquet as pq
> def test_struct_array():
> default_writer_batch_size = 1024
> n_samples = default_writer_batch_size + 1
> keys = [f"key-{i}" for i in range(n_samples)]
> expected = list(keys)
> struct_array = pa.StructArray.from_arrays(
> [pa.array(keys, type=pa.string())],
> names=["string"],
> )
> table = pa.table({"struct": struct_array})
> buf = io.BytesIO()
> pq.write_table(table, buf)
> actual = pq.read_table(buf).flatten()[0].to_pylist()
> assert actual[:1024] == expected[:1024]
> assert actual[-1] == expected[-1], (actual[-1], expected[-1])
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)