You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Micah Kornfield (Jira)" <ji...@apache.org> on 2021/03/26 20:50:00 UTC

[jira] [Commented] (ARROW-11057) [Python] Data inconsistency with read and write

    [ https://issues.apache.org/jira/browse/ARROW-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309721#comment-17309721 ] 

Micah Kornfield commented on ARROW-11057:
-----------------------------------------

The extra metadata is for traceability purposes (it seems less useful without additional metadata on the source file) but maybe we should consider stripping it out when writing back to parquet (it isn't clear what happens to me if you try writing and reading the table one more time).

> [Python] Data inconsistency with read and write
> -----------------------------------------------
>
>                 Key: ARROW-11057
>                 URL: https://issues.apache.org/jira/browse/ARROW-11057
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: David Quijano
>            Priority: Major
>
> I have been reading and writing some tables to parquet and I found some inconsistencies.
> {code:java}
> # create a table with some data
> a = pa.Table.from_pydict({'x': [1]*100,'y': [2]*100,'z': [3]*100,})
> # write it to file
> pq.write_table(a, 'test.parquet')
> # read the same file
> b = pq.read_table('test.parquet')
> # a == b is True, that's good
> # write table b to file
> pq.write_table(b, 'test2.parquet')
> # test is different from test2{code}
> Basically it is:
>  * Create table in memory
>  * Write it to file
>  * Read it again
>  * Write it to a different file
> The files are not the same. The second one contains extra information.
> The differences are consistent across different compressions (I tried snappy and zstd).
> Also, reading the second file and and writing it again, produces the same file.
> Is this a bug or an expected behavior?
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)