You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2019/11/12 20:00:00 UTC
[jira] [Commented] (ARROW-7150) [Python] Explain parquet file size
growth
[ https://issues.apache.org/jira/browse/ARROW-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972730#comment-16972730 ]
Neal Richardson commented on ARROW-7150:
----------------------------------------
Hard to say without a working example to reproduce the issue. Can you confirm that this is still an issue in 0.15.1? There was a lot of Parquet work done between 0.14 and 0.15.
> [Python] Explain parquet file size growth
> -----------------------------------------
>
> Key: ARROW-7150
> URL: https://issues.apache.org/jira/browse/ARROW-7150
> Project: Apache Arrow
> Issue Type: Task
> Components: Python
> Affects Versions: 0.14.1
> Environment: Mac OS X. Pyarrow==0.14.1
> Reporter: Bogdan Klichuk
> Priority: Major
>
> Having columnar storage format in mind, with gzip compression enabled, I can't make sense of how parquet file size is growing in my specific example.
> So far without sharing a dataset (would need to create a mock one to share).
> {code:java}
> > df = pandas.read_csv('...')
> > len(df)
> 820
> > # 1. write 820 rows to a parquet file
> > df.to_parquet('820.parquet', compression='gzip)
> > # size of 820.parquet is 6.1M
> > # 2. write 8200 rows to a parquet file
> > df_big = pandas.concat([df] * 10).reset_index(drop=True)
> > len(df_big)
> 8200
> > df_big.to_parquet('8200.parquet', compression='gzip')
> > # size of 800.parquet is 320M.
> {code}
>
>
> Compression works better on bigger files. How come 10x1 increase with repeated data resulted in 50x growth of file? Insane imo.
>
> Working on a periodic job that concats smaller files into bigger ones and doubting now whether I need this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)