You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/10/08 19:09:00 UTC
[jira] [Commented] (ARROW-11057) [Python] Data inconsistency with
read and write
[ https://issues.apache.org/jira/browse/ARROW-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426339#comment-17426339 ]
Joris Van den Bossche commented on ARROW-11057:
-----------------------------------------------
This logic was changed in PARQUET-1798 / https://github.com/apache/arrow/pull/10289, and now those PARQUET:field_id fields are only preserved if already present, and not automatically generated.
If you re-run the example above with recent released pyarrow, you actually get identical files now, and the schemas also don't contains the field_ids anymore.
> [Python] Data inconsistency with read and write
> -----------------------------------------------
>
> Key: ARROW-11057
> URL: https://issues.apache.org/jira/browse/ARROW-11057
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 2.0.0
> Reporter: David Quijano
> Priority: Major
>
> I have been reading and writing some tables to parquet and I found some inconsistencies.
> {code:java}
> # create a table with some data
> a = pa.Table.from_pydict({'x': [1]*100,'y': [2]*100,'z': [3]*100,})
> # write it to file
> pq.write_table(a, 'test.parquet')
> # read the same file
> b = pq.read_table('test.parquet')
> # a == b is True, that's good
> # write table b to file
> pq.write_table(b, 'test2.parquet')
> # test is different from test2{code}
> Basically it is:
> * Create table in memory
> * Write it to file
> * Read it again
> * Write it to a different file
> The files are not the same. The second one contains extra information.
> The differences are consistent across different compressions (I tried snappy and zstd).
> Also, reading the second file and and writing it again, produces the same file.
> Is this a bug or an expected behavior?
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)