You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Kyle Barron (Jira)" <ji...@apache.org> on 2022/04/22 17:17:00 UTC
[jira] [Created] (ARROW-16287) PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file
Kyle Barron created ARROW-16287:
-----------------------------------
Summary: PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file
Key: ARROW-16287
URL: https://issues.apache.org/jira/browse/ARROW-16287
Project: Apache Arrow
Issue Type: Bug
Components: Parquet
Affects Versions: 7.0.0
Environment: MacOS. Python 3.8.10.
pyarrow: '7.0.0'
pandas: '1.4.2'
numpy: '1.22.3'
Reporter: Kyle Barron
I'm trying to follow the example here: [https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files] to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve:
```
from pathlib import Path
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
size = 100_000_000
partition_col = np.random.randint(0, 10, size)
values = np.random.rand(size)
table = pa.Table.from_pandas(
pd.DataFrame(\{"partition_col": partition_col, "values": values})
)
metadata_collector = []
root_path = Path("random.parquet")
pq.write_to_dataset(
table,
root_path,
partition_cols=["partition_col"],
metadata_collector=metadata_collector,
)
# Write the ``_common_metadata`` parquet file without row groups statistics
pq.write_metadata(table.schema, root_path / "_common_metadata")
# Write the ``_metadata`` parquet file with row groups statistics of all files
pq.write_metadata(
table.schema, root_path / "_metadata", metadata_collector=metadata_collector
)
```
This raises the error
```
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Input In [92], in <cell line: 1>()
----> 1 pq.write_metadata(
2 table.schema, root_path / "_metadata", metadata_collector=metadata_collector
3 )
File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in write_metadata(schema, where, metadata_collector, **kwargs)
2322 metadata = read_metadata(where)
2323 for m in metadata_collector:
-> 2324 metadata.append_row_groups(m)
2325 metadata.write_metadata_file(where)
File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in pyarrow._parquet.FileMetaData.append_row_groups()
RuntimeError: AppendRowGroups requires equal schemas.
```
But all schemas in the `metadata_collector` list seem to be the same:
```
all(metadata_collector[0].schema == meta.schema for meta in metadata_collector)
# True
```
--
This message was sent by Atlassian Jira
(v8.20.7#820007)