You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Bogdan Klichuk (Jira)" <ji...@apache.org> on 2019/11/12 17:26:00 UTC

[jira] [Created] (ARROW-7150) [Python] Explain parquet file size growth

Bogdan Klichuk created ARROW-7150:
-------------------------------------

             Summary: [Python] Explain parquet file size growth
                 Key: ARROW-7150
                 URL: https://issues.apache.org/jira/browse/ARROW-7150
             Project: Apache Arrow
          Issue Type: Task
          Components: Python
    Affects Versions: 0.14.1
         Environment: Mac OS X. Pyarrow==0.15.1
            Reporter: Bogdan Klichuk


Having columnar storage format in mind, with gzip compression enabled, I can't make sense of how parquet file size is growing in my specific example.

So far without sharing a dataset (would need to create a mock one to share).
{code:java}

> df = pandas.read_csv('...')
> len(df)
820
> df.to_parquet('820.parquet', compression='gzip)
> # size of 820.parquet is 6.1M
> df_big = pandas.concat([df] * 10).reset_index(drop=True)
> len(df_big)
8200
> df_big.to_parquet('8200.parquet', compression='gzip')
> # size of 800.parquet is 320M.
 {code}
 

 

Compression works better on bigger files. How come 10x1 increase with repeated data resulted in 50x growth of file? Insane imo.

 

Working on a periodic job that concats smaller files into bigger ones and doubting now whether I need this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)