You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Lee June Woo (JIRA)" <ji...@apache.org> on 2018/12/26 09:21:00 UTC

[jira] [Commented] (ARROW-2709) [Python] write_to_dataset poor performance when splitting

    [ https://issues.apache.org/jira/browse/ARROW-2709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16728948#comment-16728948 ] 

Lee June Woo commented on ARROW-2709:
-------------------------------------

Hello,

May I ask you simple question about the improvement? I think that It seem to be more efficient to split the pandas dataframe base on "dt" column before converting dataframe to arrow table.

Would you have any plan to implement group-by operation of arrow table or improve write_to_dataset function? I hope to jump in this issue and contribute this project as possible as I could if there's no time-constraint.

> [Python] write_to_dataset poor performance when splitting
> ---------------------------------------------------------
>
>                 Key: ARROW-2709
>                 URL: https://issues.apache.org/jira/browse/ARROW-2709
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Olaf
>            Priority: Critical
>              Labels: parquet
>
> Hello,
> Posting this from github (master [~wesmckinn] asked for it :) )
> [https://github.com/apache/arrow/issues/2138]
>  
> {code:java}
> import pandas as pd 
> import numpy as np 
> import pyarrow.parquet as pq 
> import pyarrow as pa 
> idx = pd.date_range('2017-01-01 12:00:00.000', '2017-03-01 12:00:00.000', freq = 'T') 
> dataframe = pd.DataFrame({'numeric_col' : np.random.rand(len(idx)), 
>                           'string_col' : pd.util.testing.rands_array(8,len(idx))}, 
>                          index = idx){code}
>  
> {code:java}
> df["dt"] = df.index 
> df["dt"] = df["dt"].dt.date 
> table = pa.Table.from_pandas(df) 
> pq.write_to_dataset(table, root_path='dataset_name', partition_cols=['dt'], flavor='spark'){code}
>  
> {{this works but is inefficient memory-wise. The arrow table is a copy of the large pandas daframe and quickly saturates the RAM.}}
>  
> {{Thanks!}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)