You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Joris Van den Bossche <jo...@gmail.com> on 2020/07/09 07:49:45 UTC

Re: How to specify number of partitions?

Hi Yash,

Currently, there is the `parquet.write_to_dataset` function for
something like that. But that requires to specify a column by which to
split the single pyarrow Table.
To just split one table in regular chunks to write to multiple files
in a single directory, I don't think we have an automatic function for
that (you could slice the table in a loop and write each subset with
`write_table`).

You can also control the row group size (partitioning within a single
Parquet file), using the row_group_size argument of `write_table`.

Best,
Joris

On Wed, 8 Jul 2020 at 20:44, Yash Ganthe <ya...@gmail.com> wrote:
>
> Hi,
>
> parquet_writer.write_table(table)
>
> This line writes a single file.
> The documentation says:
> This creates a single Parquet file. In practice, a Parquet dataset may
> consist of many files in many directories. We can read a single file back
> with read_table:
>
> Is there a way for PyArrow to create a parquet file in the form of a
> directory with multiple part files in it such as :
>
> ls -lrt permit-inspections-recent.parquet
> ...  14:53 part-00001-bd5d902d-fac9-4e03-b63e-6a8dfc4060b6.snappy.parquet
> ...  14:53 part-00000-bd5d902d-fac9-4e03-b63e-6a8dfc4060b6.snappy.parquet
>
> Regards,
> Yash