You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "ARF (Jira)" <ji...@apache.org> on 2021/02/16 12:43:00 UTC

[jira] [Commented] (ARROW-1858) [Python] Add documentation about parquet.write_to_dataset and related methods

    [ https://issues.apache.org/jira/browse/ARROW-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17285173#comment-17285173 ] 

ARF commented on ARROW-1858:
----------------------------

Even with the new documentation I am unclear on whether I can append to partitioned datasets.

I.e. is it possible to write a partitioned dataset when the entire dataset is too large to hold in memory prior to writing?

> [Python] Add documentation about parquet.write_to_dataset and related methods
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-1858
>                 URL: https://issues.apache.org/jira/browse/ARROW-1858
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>            Assignee: Donal Simmie
>            Priority: Major
>              Labels: beginner, pull-request-available
>             Fix For: 0.10.0
>
>
> {{pyarrow}} does not only allow one to write to a single Parquet file but you can also write only the schema metadata for a full multi-file dataset. This dataset can also be automatically partitioned by one or more columns. At the moment, this functionality is not really visible in the documentation. You mainly find the API documentation for it but we should have a small tutorial-like section that explains the differences and use cases for each of these functions.
> See also https://stackoverflow.com/questions/47482434/can-pyarrow-write-multiple-parquet-files-to-a-folder-like-fastparquets-file-sch



--
This message was sent by Atlassian Jira
(v8.3.4#803005)