You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Steven Anton (Jira)" <ji...@apache.org> on 2021/11/24 22:32:00 UTC
[jira] [Commented] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

    [ https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448831#comment-17448831 ] 

Steven Anton commented on ARROW-2659:
-------------------------------------

This is still an issue for me. The {{ParquetDataset}} implementation won't help in the case where you're creating a partitioned dataset one partition at a time. Suppose one partition has all nulls in a column; that partition will be written with the null datatype. In another partition, the data could contain strings.

There are a number of options that could help:
* At the very least, I feel that {{Schema.from_pandas}} should generate a warning when encountering {{object}} columns that are all null. Something along the lines of "Unable to infer data type for <col>". Yes, there's a note in the docstring, but that's not visible when using something like {{pd.DataFrame.to_parquet}}.
* Another idea is to add a {{kwarg}} to {{Schema.from_pandas}} to explicitly specify the data type for these null columns where the data type is ambiguous. The function could allow either a single type or a dictionary mapping the column name to the correct type.
* Lastly, it seems reasonable to allow some simple schema merging. For example, if you have a partitioned dataset and some partitions have the data type as null for one column in one partition but string in another, it seems we should be able to merge the two to string. (Of course, null, int, and string could not be merged and should still raise an exception.)

> [Python] More graceful reading of empty String columns in ParquetDataset
> ------------------------------------------------------------------------
>
>                 Key: ARROW-2659
>                 URL: https://issues.apache.org/jira/browse/ARROW-2659
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.9.0
>            Reporter: Uwe Korn
>            Assignee: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, dataset-parquet-read, parquet
>             Fix For: 7.0.0
>
>         Attachments: read_parquet_dataset.error.read_table.novalidation.txt, read_parquet_dataset.error.read_table.txt
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get consistent schemas, even if the source was a single DataFrame. This is due to the fact that in some partitions object columns like string can become empty. Then the resulting Arrow schema will differ. In the central metadata, we will store this column as {{pa.string}} whereas in the partition file with the empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we should respect that in https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 Instead of doing a {{pa.Schema.equals}} in https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 we should introduce a new method {{pa.Schema.can_evolve_to}} that is more graceful and returns {{True}} if a dataset piece has a null column where the main metadata states a nullable column of any type.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)