You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Christian Lundgren (Jira)" <ji...@apache.org> on 2020/11/04 09:48:00 UTC

[jira] [Created] (ARROW-10493) [C++][Parquet] Writing nullable nested strings results in wrong data in file

Christian Lundgren created ARROW-10493:
------------------------------------------

             Summary: [C++][Parquet] Writing nullable nested strings results in wrong data in file
                 Key: ARROW-10493
                 URL: https://issues.apache.org/jira/browse/ARROW-10493
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 2.0.0
         Environment: Python 3.6
            Reporter: Christian Lundgren


When I try writing a column of type `struct(string)` that has more elements than the write_batch_size, the output will only contain the first batch, repeated. The data in batches after the first batch are not written to the output.

I am only seeing this behaviour with arrow 2.0.0, in 1.0.1 the output contains all the data as expected.
 
This python test cases reproduces the problem:
 
{code:python}
import io
import pyarrow as pa
import pyarrow.parquet as pq

def test_struct_array():
    default_writer_batch_size = 1024
    n_samples = default_writer_batch_size + 1
    keys = [f"key-{i}" for i in range(n_samples)]
    expected = list(keys)

    struct_array = pa.StructArray.from_arrays(
        [pa.array(keys, type=pa.string())],
        names=["string"],
    )
    table = pa.table({"struct": struct_array})

    buf = io.BytesIO()
    pq.write_table(table, buf)

    actual = pq.read_table(buf).flatten()[0].to_pylist()

    assert actual[:1024] == expected[:1024]
    assert actual[-1] == expected[-1], (actual[-1], expected[-1])
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)