You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2020/03/18 18:43:18 UTC

Re: Learning pyarrow and optimize row groups size

We don't really have any API options at the moment for targeting a
particular row group size, only selecting the number of rows per row
group (and since rows can vary a lot in size when you have strings or
nested data, this does not do the same thing). My understanding is
that the way that writers work is that they encode small chunks of
data in memory (without writing them out to storage yet) and compute
the approximate encoded on-disk size. When the threshold is reached,
the row group is then written out.

I would suggest opening some JIRA issues and proposing what APIs can
be added to e.g. "pyarrow.parquet.write_table" and what their
semantics would be.

On Wed, Mar 18, 2020 at 5:30 AM jonathan mercier
<jo...@cnrgh.fr> wrote:
>
> Dear,
>
> I am learning pyarrow API and arrow tecnology. So I would like first to
> thank you for your works.
>
>
> From my understanding pyarrow.arrays, pyarrow.RecordBatch are write
> only structure. We can not append data.
> 1/ is it correct ?
>
>
> I wrote a little script to write data into parquet file. The data is a
> 2D list ( a list of rows which contains a list of columns
> [['a','b','c'], ['d','e','f']])
> Script is here:
>
> https://gist.github.com/bioinfornatics/c82398fa22339d34f41b3580c988c308
>
> To obtain this goal I stored in memory all intermediate pyarrow
> structures in order to create a table (schema and list of pyarrow
> array)
>
> 2/ is it possible to reach the same goal with a stream in order to not
> waste memory/handle terabyte of data ?
>
>
>
> I read these interesting articles:
> https://www.dremio.com/tuning-parquet/,
> https://parquet.apache.org/documentation/latest/
>
>  which recommends large row groups (512MB - 1GB).
> 3/ how to manage row group in order to feat approximately the size 1GB
> ?
>
> 4) using pyarrow should store at end (on disk) to a parquet file or
> pyarrow provide its generic file as common data layer?
>
>
> Thanks a lot for your help and your works on arrow
>
> Best regards
>
> Jonathan
>