You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/01/27 23:57:00 UTC

[jira] [Commented] (ARROW-15484) [Python] kwargs fails for pyarrow.parquet.write_to_dataset()

    [ https://issues.apache.org/jira/browse/ARROW-15484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483475#comment-17483475 ] 

Weston Pace commented on ARROW-15484:
-------------------------------------

{{basename_template}} and {{existing_data_behavior}} are valid args for {{pyarrow.dataset.write_dataset}} but they are not valid args to {{pyarrow.parquet.write_to_dataset}}.  The latter (a parquet specific interface kept mostly for backwards compatibility) tries to interpret any extra arguments as arguments to {{pyarrow.parquet.write_table}} which doesn't accept those arguments either.

I think the error message is correct with respect to the docstring for the method.

{noformat}
        Additional kwargs for write_table function. See docstring for
        `write_table` or `ParquetWriter` for more information.
        Using `metadata_collector` in kwargs allows one to collect the
        file metadata instances of dataset pieces. The file paths in the
        ColumnChunkMetaData will be set relative to `root_path`.
{noformat}

and the error message is {{unexpected parquet write option: basename_template}}

I agree the error stack trace is a little misleading.  Arguments to {{pyarrow.parquet.write_table}} are enveloped into a {{ParquetFileFormat}} object which informs the dataset how, specifically, it should be writing parquet files (e.g. what compression to use, whether or not to use dictionaries, etc.)

> [Python] kwargs fails for pyarrow.parquet.write_to_dataset()
> ------------------------------------------------------------
>
>                 Key: ARROW-15484
>                 URL: https://issues.apache.org/jira/browse/ARROW-15484
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 6.0.1
>            Reporter: Martin Thøgersen
>            Priority: Major
>
> When supplying `kwargs` such as `basename_template` or `existing_data_behaviour` to `pyarrow.parquet.write_to_dataset()`, it fails as below.
>  
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> df = pd.DataFrame({
>     'int': [1, 2],
>     'str': ['a', 'b']
> })
> table = pa.Table.from_pandas(df)
> """
> **kwargs : dict,
>     Additional kwargs for write_table function. See docstring for write_table or ParquetWriter for more information.
> """
> pq.write_to_dataset(table, root_path='foo',
>                     use_legacy_dataset=False,
>                     # kwargs:
>                     basename_template="prefix-{i}.parquet",
>                     existing_data_behaviour="error"
>                     )
> {code}
> {noformat}
> TypeError                                 Traceback (most recent call last)
> ...test.py in <module>
>      16     Additional kwargs for write_table function. See docstring for write_table or ParquetWriter for more information.
>      17 """
> ---> 18 pq.write_to_dataset(table, root_path='foo',
>      19                     use_legacy_dataset=False,
>      20                     # kwargs:
> ...lib/python3.8/site-packages/pyarrow/parquet.py in write_to_dataset(table, root_path, partition_cols, partition_filename_cb, filesystem, use_legacy_dataset, **kwargs)
>    2144         # map format arguments
>    2145         parquet_format = ds.ParquetFileFormat()
> -> 2146         write_options = parquet_format.make_write_options(**kwargs)
>    2147 
>    2148         # map old filesystems to new one
> ...lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.ParquetFileFormat.make_write_options()
> ...lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.ParquetFileWriteOptions.update()
> TypeError: unexpected parquet write option: basename_template
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)