You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Marco Neumann (JIRA)" <ji...@apache.org> on 2019/04/05 12:21:00 UTC

[jira] [Commented] (ARROW-5028) [Python][C++] Arrow to Parquet conversion drops and corrupts values

    [ https://issues.apache.org/jira/browse/ARROW-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16810752#comment-16810752 ] 

Marco Neumann commented on ARROW-5028:
--------------------------------------

More debugging results:
 * {{def_levels}} and {{rep_levels}} have different length (first one is 1 element too short) leading to an out-of-bounds / uninitialized read which explain the {{0}} seen in the last report
 * the place where a {{rep_levels}} entry is created but no data for {{def_levels}} is {{HandleNonNullList}} in {{writer.cc}}
 * the reason for that is that {{inner_length}} is negative. It seems to jump from a large number ({{16268812}}) to a small number ({{2}}) and then continues from there (6, 13, 17, ...)

> [Python][C++] Arrow to Parquet conversion drops and corrupts values
> -------------------------------------------------------------------
>
>                 Key: ARROW-5028
>                 URL: https://issues.apache.org/jira/browse/ARROW-5028
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.11.1, 0.13.0
>         Environment: python 3.6
>            Reporter: Marco Neumann
>            Priority: Major
>             Fix For: 0.14.0
>
>         Attachments: dct.pickle.gz
>
>
> I am sorry if this bugs feels rather long and the reproduction data is large, but I was not able to reduce the data even further while still triggering the problem. I was able to trigger this behavior on master and on {{0.11.1}}.
> {code:python}
> import io
> import os.path
> import pickle
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> def dct_to_table(index_dct):
>     labeled_array = pa.array(np.array(list(index_dct.keys())))
>     partition_array = pa.array(np.array(list(index_dct.values())))
>     return pa.Table.from_arrays(
>         [labeled_array, partition_array], names=['a', 'b']
>     )
> def check_pq_nulls(data):
>     fp = io.BytesIO(data)
>     pfile = pq.ParquetFile(fp)
>     assert pfile.num_row_groups == 1
>     md = pfile.metadata.row_group(0)
>     col = md.column(1)
>     assert col.path_in_schema == 'b.list.item'
>     assert col.statistics.null_count == 0  # fails
> def roundtrip(table):
>     buf = pa.BufferOutputStream()
>     pq.write_table(table, buf)
>     data = buf.getvalue().to_pybytes()
>     # this fails:
>     #   check_pq_nulls(data)
>     reader = pa.BufferReader(data)
>     return pq.read_table(reader)
> with open(os.path.join(os.path.dirname(__file__), 'dct.pickle'), 'rb') as fp:
>     dct = pickle.load(fp)
> # this does NOT help:
> #   pa.set_cpu_count(1)
> #   import gc; gc.disable()
> table = dct_to_table(dct)
> # this fixes the issue:
> #   table = pa.Table.from_pandas(table.to_pandas())
> table2 = roundtrip(table)
> assert table.column('b').null_count == 0
> assert table2.column('b').null_count == 0  # fails
> # if table2 is converted to pandas, you can also observe that some values at the end of column b are `['']` which clearly is not present in the original data
> {code}
> I would also be thankful for any pointers on where the bug comes from or on who to reduce the test case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)