You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Christian Lundgren (Jira)" <ji...@apache.org> on 2020/11/04 09:48:00 UTC
[jira] [Created] (ARROW-10493) [C++][Parquet] Writing nullable
nested strings results in wrong data in file
Christian Lundgren created ARROW-10493:
------------------------------------------
Summary: [C++][Parquet] Writing nullable nested strings results in wrong data in file
Key: ARROW-10493
URL: https://issues.apache.org/jira/browse/ARROW-10493
Project: Apache Arrow
Issue Type: Bug
Affects Versions: 2.0.0
Environment: Python 3.6
Reporter: Christian Lundgren
When I try writing a column of type `struct(string)` that has more elements than the write_batch_size, the output will only contain the first batch, repeated. The data in batches after the first batch are not written to the output.
I am only seeing this behaviour with arrow 2.0.0, in 1.0.1 the output contains all the data as expected.
This python test cases reproduces the problem:
{code:python}
import io
import pyarrow as pa
import pyarrow.parquet as pq
def test_struct_array():
default_writer_batch_size = 1024
n_samples = default_writer_batch_size + 1
keys = [f"key-{i}" for i in range(n_samples)]
expected = list(keys)
struct_array = pa.StructArray.from_arrays(
[pa.array(keys, type=pa.string())],
names=["string"],
)
table = pa.table({"struct": struct_array})
buf = io.BytesIO()
pq.write_table(table, buf)
actual = pq.read_table(buf).flatten()[0].to_pylist()
assert actual[:1024] == expected[:1024]
assert actual[-1] == expected[-1], (actual[-1], expected[-1])
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)