You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/04/10 18:54:00 UTC
[jira] [Commented] (ARROW-1938) [Python] Error writing to partitioned Parquet dataset

    [ https://issues.apache.org/jira/browse/ARROW-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16432768#comment-16432768 ] 

ASF GitHub Bot commented on ARROW-1938:
---------------------------------------

joshuastorck opened a new pull request #453: Bug fix for ARROW-1938
URL: https://github.com/apache/parquet-cpp/pull/453
 
 
   The error was reported here: https://issues.apache.org/jira/browse/ARROW-1938.
   
   Because dictionary types are not supported in writing yet, the code converts the dictionary column to the actual values first before writing. However, the existing code was accidentally using zero as the offset and the length of the column as the size. This resulted in writing all of the column values for each chunk of the column that was supposed to be written.
   
   The fix is to pass the offset and size when recursively calling through to WriteColumnChunk with the "flattened" data. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> [Python] Error writing to partitioned Parquet dataset
> -----------------------------------------------------
>
>                 Key: ARROW-1938
>                 URL: https://issues.apache.org/jira/browse/ARROW-1938
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.8.0
>         Environment: Linux (Ubuntu 16.04)
>            Reporter: Robert Dailey
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.10.0
>
>         Attachments: ARROW-1938-test-data.csv.gz, ARROW-1938.py, pyarrow_dataset_error.png
>
>
> I receive the following error after upgrading to pyarrow 0.8.0 when writing to a dataset:
> * ArrowIOError: Column 3 had 187374 while previous column had 10000
> The command was:
> write_table_values = {'row_group_size': 10000}
> pq.write_to_dataset(pa.Table.from_pandas(df, preserve_index=True), '/logs/parsed/test', partition_cols=['Product', 'year', 'month', 'day', 'hour'], **write_table_values)
> I've also tried write_table_values = {'chunk_size': 10000} and received the same error.
> This same command works in version 0.7.1.  I am trying to troubleshoot the problem but wanted to submit a ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)