You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Daniel Figus (Jira)" <ji...@apache.org> on 2020/12/22 14:04:00 UTC
[jira] [Commented] (ARROW-2079) [Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't available

    [ https://issues.apache.org/jira/browse/ARROW-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253519#comment-17253519 ] 

Daniel Figus commented on ARROW-2079:
-------------------------------------

[~jorisvandenbossche]: In my opinion the biggest advantage of using the {{_common_metadata}} is in case of schema evolution. The first parquet file might not contain all fields and newer files might have additional fields. In order to get the full schema with all fields one would need to infer the effective schema from all files which might be very expensive for larger datasets.

Current workaround is to manually read the schema from the {{_common_metadata}} file and pass it to the dataset API. 

> [Python][C++] Possibly use `_common_metadata` for schema if `_metadata` isn't available
> ---------------------------------------------------------------------------------------
>
>                 Key: ARROW-2079
>                 URL: https://issues.apache.org/jira/browse/ARROW-2079
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Jim Crist
>            Priority: Minor
>              Labels: dataset, dataset-parquet-read, parquet
>
> Currently pyarrow's parquet writer only writes `_common_metadata` and not `_metadata`. From what I understand these are intended to contain the dataset schema but not any row group information.
>  
> A few (possibly naive) questions:
>  
> 1. In the `__init__` for `ParquetDataset`, the following lines exist:
> {code:java}
> if self.metadata_path is not None:
>     with self.fs.open(self.metadata_path) as f:
>         self.common_metadata = ParquetFile(f).metadata
> else:
>     self.common_metadata = None
> {code}
> I believe this should use `common_metadata_path` instead of `metadata_path`, as the latter is never written by `pyarrow`, and is given by the `_metadata` file instead of `_common_metadata` (as seemingly intended?).
>  
> 2. In `validate_schemas` I believe an option should exist for using the schema from `_common_metadata` instead of `_metadata`, as pyarrow currently only writes the former, and as far as I can tell `_common_metadata` does include all the schema information needed.
>  
> Perhaps the logic in `validate_schemas` could be ported over to:
>  
> {code:java}
> if self.schema is not None:
>     pass  # schema explicitly provided
> elif self.metadata is not None:
>     self.schema = self.metadata.schema
> elif self.common_metadata is not None:
>     self.schema = self.common_metadata.schema
> else:
>     self.schema = self.pieces[0].get_metadata(open_file).schema{code}
> If these changes are valid, I'd be happy to submit a PR. It's not 100% clear to me the difference between `_common_metadata` and `_metadata`, but I believe the schema in both should be the same. Figured I'd open this for discussion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)