You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (JIRA)" <ji...@apache.org> on 2019/04/29 12:19:00 UTC

[jira] [Commented] (ARROW-2628) [Python] parquet.write_to_dataset is memory-hungry on large DataFrames

    [ https://issues.apache.org/jira/browse/ARROW-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16829198#comment-16829198 ] 

Joris Van den Bossche commented on ARROW-2628:
----------------------------------------------

Similar report in ARROW-2709, which had a PR linked: https://github.com/apache/arrow/pull/3344. This PR is closed (without merging), but contains some relevant discussion.

The current implementation of {{pyarrow.parquet.write_to_dataset}} converts the pyarrow Table to a pandas DataFrame, and then uses pandas' groupby method to split it in multiple dataframes (after dropping the partition columns from the dataframe, which makes yet another data copy). Each subset pandas DataFrame is then converted back to a pyarrow Table. 
In addition, when using this functionality from pandas' {{to_parquet}}, you get an additional initial conversion of the pandas DataFrame to arrow Table. 

This clearly is less than optimal. It might be that some of the copies could be avoided (e.g. it is not clear to me if {{Table.from_pandas}} always copies data). The closed PR tried to circumvent this by using arrow's dictionary encoding instead of pandas' groupby, and then reconstructing the subset Tables based on those indices). But ideally, a more arrow-native solution is used instead of those work-arounds.

To quote [~wesmckinn] from the PR ([github comment|https://github.com/apache/arrow/pull/3344#issuecomment-462093173]):

{quote}
I want the proper C++ work completed instead. Here are the steps:

* Struct hash: ARROW-3978
* Integer argsort part of ARROW-1566
* Take function ARROW-772

These are the three things you need to do the groupby-split operation natively against an Arrow table. There is a slight complication which is implementing "take" against chunked arrays. This could be mitigated by doing the groupby-split at the contiguous record batch level
{quote}


> [Python] parquet.write_to_dataset is memory-hungry on large DataFrames
> ----------------------------------------------------------------------
>
>                 Key: ARROW-2628
>                 URL: https://issues.apache.org/jira/browse/ARROW-2628
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Wes McKinney
>            Priority: Major
>              Labels: parquet
>             Fix For: 0.14.0
>
>
> See discussion in https://github.com/apache/arrow/issues/1749. We should consider strategies for writing very large tables to a partitioned directory scheme. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)