You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Quijano (Jira)" <ji...@apache.org> on 2020/12/29 00:41:00 UTC

[jira] [Created] (ARROW-11057) [Python] Data inconsistency with read and write

David Quijano created ARROW-11057:
-------------------------------------

             Summary: [Python] Data inconsistency with read and write
                 Key: ARROW-11057
                 URL: https://issues.apache.org/jira/browse/ARROW-11057
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 2.0.0
            Reporter: David Quijano


 

I have been reading and writing some tables to parquet and I found some inconsistencies.
{code:java}
# create a table with some data
a = pa.Table.from_pydict({'x': [1]*100,'y': [2]*100,'z': [3]*100,})
# write it to file
pq.write_table(a, 'test.parquet')
# read the same file
b = pq.write_table('test.parquet')
# a == b is True, that's good
# write table b to file
pq.write_table(b, 'test2.parquet')
# test is different from test2{code}
Basically it is:
 * Create table in memory
 * Write it to file
 * Read it again
 * Write it to a different file

The files are not the same. The second one contains extra information.

The differences are consistent across different compressions (I tried snappy and zstd).

Also, reading the second file and and writing it again, produces the same file.

Is this a bug or an expected behavior?

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)