You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2019/01/10 04:58:00 UTC

[jira] [Commented] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

    [ https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16739008#comment-16739008 ] 

Wes McKinney commented on ARROW-2659:
-------------------------------------

I'm moving this to 0.13 as unfortunately I don't think we have the time to do this properly for 0.12

I suggest we implement a couple of different things to help us:

* "Schema-normalized concatenate tables" -- perform safe casts and determine the merged schema for a collection of smaller tables, or attempt to safely cast tables to a fixed schema. As null will safely cast to anything this will solve the problem one way

* Additionally implement partitioned writes natively against Arrow tables without going through pandas, to avoid the issues in ARROW-2860

> [Python] More graceful reading of empty String columns in ParquetDataset
> ------------------------------------------------------------------------
>
>                 Key: ARROW-2659
>                 URL: https://issues.apache.org/jira/browse/ARROW-2659
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.9.0
>            Reporter: Uwe L. Korn
>            Priority: Major
>              Labels: parquet
>             Fix For: 0.13.0
>
>         Attachments: read_parquet_dataset.error.read_table.novalidation.txt, read_parquet_dataset.error.read_table.txt
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get consistent schemas, even if the source was a single DataFrame. This is due to the fact that in some partitions object columns like string can become empty. Then the resulting Arrow schema will differ. In the central metadata, we will store this column as {{pa.string}} whereas in the partition file with the empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we should respect that in https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 Instead of doing a {{pa.Schema.equals}} in https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 we should introduce a new method {{pa.Schema.can_evolve_to}} that is more graceful and returns {{True}} if a dataset piece has a null column where the main metadata states a nullable column of any type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)