You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Palak Harwani <pa...@gmail.com> on 2020/05/30 14:09:29 UTC

Writing Parquet datasets using pyarrow.parquet.ParquetWriter

Hi,
I had a few questions regarding pyarrow.parquet. I want to write a Parquet
dataset which is partitioned according to one column. I have a large csv
file and I'm using chunks of csv using the following code :

  # csv_to_parquet.py

import pandas as pdimport pyarrow as paimport pyarrow.parquet as pq

csv_file = '/path/to/my.tsv'
parquet_file = '/path/to/my.parquet'
chunksize = 100_000

csv_stream = pd.read_csv(csv_file, sep='\t', chunksize=chunksize,
low_memory=False)
for i, chunk in enumerate(csv_stream):
    print("Chunk", i)
    if i == 0:
        # Guess the schema of the CSV file from the first chunk
        parquet_schema = pa.Table.from_pandas(df=chunk).schema
        # Open a Parquet file for writing
        parquet_writer = pq.ParquetWriter(parquet_file,
parquet_schema, compression='snappy')
    # Write CSV chunk to the parquet file
    table = pa.Table.from_pandas(chunk, schema=parquet_schema)
    parquet_writer.write_table(table)


parquet_writer.close()



But this code writes a single parquet file and I don't see any method in
Parquet writer to write to a dataset, It just has the write_table method.
Is there a way to do this ?

Also how do I write the metadata file in the example mentioned above and
the common metadata file as well as the metadata files in case of a
partitioned dataset?

Thanks in advanced.

-- 
*Regards,*
*Palak Harwani*

Re: Writing Parquet datasets using pyarrow.parquet.ParquetWriter

Posted by Joris Van den Bossche <jo...@gmail.com>.

Hi Palak,

The ParquetWriter class is meant to write a single parquet file (so in that
sense, that you see only a single parquet file being written based on the
shown code, that is expected).

If you want to write multiple files, you can either manually create
multiple ParquetWriter instances (each with a different parquet file name).
Or, you can use the `pq.write_to_dataset()` function, which can
automatically partition your data in multiple files based on a column. But,
this function requires the full dataset in memory as a pandas dataframe or
pyarrow table (so this is not compatible with the chunked csv reading).

If you want to do it in chunks, it might be easier to use a higher level
package such as dask. Dask can read a csv file in chunks and write to
parquet using pyarrow automatically (see eg
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv).
It would look like:

import dask.dataframe as dd
df = dd.read_csv(..)
df.to_parquet(.., partition_on=['col'], engine="pyarrow")

Dask can also write a (common) metadata file for you.
If you want to do this manually using pyarrow, you can take a look at the
`parquet.write_metadata` function (
https://github.com/apache/arrow/blob/494e7a9c5714f3ed9e5590aeef8362114d5a3a46/python/pyarrow/parquet.py#L1748-L1783).

This needs to be better documented (covered by
https://issues.apache.org/jira/browse/ARROW-3154).

Best,
Joris

On Sat, 30 May 2020 at 16:52, Palak Harwani <pa...@gmail.com>
wrote:

> Hi,
> I had a few questions regarding pyarrow.parquet. I want to write a Parquet
> dataset which is partitioned according to one column. I have a large csv
> file and I'm using chunks of csv using the following code :
>
>   # csv_to_parquet.py
>
> import pandas as pdimport pyarrow as paimport pyarrow.parquet as pq
>
> csv_file = '/path/to/my.tsv'
> parquet_file = '/path/to/my.parquet'
> chunksize = 100_000
>
> csv_stream = pd.read_csv(csv_file, sep='\t', chunksize=chunksize,
> low_memory=False)
> for i, chunk in enumerate(csv_stream):
>     print("Chunk", i)
>     if i == 0:
>         # Guess the schema of the CSV file from the first chunk
>         parquet_schema = pa.Table.from_pandas(df=chunk).schema
>         # Open a Parquet file for writing
>         parquet_writer = pq.ParquetWriter(parquet_file,
> parquet_schema, compression='snappy')
>     # Write CSV chunk to the parquet file
>     table = pa.Table.from_pandas(chunk, schema=parquet_schema)
>     parquet_writer.write_table(table)
>
>
> parquet_writer.close()
>
>
>
> But this code writes a single parquet file and I don't see any method in
> Parquet writer to write to a dataset, It just has the write_table method.
> Is there a way to do this ?
>
> Also how do I write the metadata file in the example mentioned above and
> the common metadata file as well as the metadata files in case of a
> partitioned dataset?
>
> Thanks in advanced.
>
> --
> *Regards,*
> *Palak Harwani*
>