You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Kyle Barron (Jira)" <ji...@apache.org> on 2022/04/22 17:17:00 UTC

[jira] [Created] (ARROW-16287) PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file

Kyle Barron created ARROW-16287:
-----------------------------------

             Summary: PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file
                 Key: ARROW-16287
                 URL: https://issues.apache.org/jira/browse/ARROW-16287
             Project: Apache Arrow
          Issue Type: Bug
          Components: Parquet
    Affects Versions: 7.0.0
         Environment: MacOS. Python 3.8.10.
pyarrow: '7.0.0'
pandas: '1.4.2'
numpy: '1.22.3'
            Reporter: Kyle Barron


I'm trying to follow the example here: [https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files] to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve:

```

from pathlib import Path

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

size = 100_000_000
partition_col = np.random.randint(0, 10, size)
values = np.random.rand(size)
table = pa.Table.from_pandas(
    pd.DataFrame(\{"partition_col": partition_col, "values": values})
)

metadata_collector = []
root_path = Path("random.parquet")
pq.write_to_dataset(
    table,
    root_path,
    partition_cols=["partition_col"],
    metadata_collector=metadata_collector,
)

# Write the ``_common_metadata`` parquet file without row groups statistics
pq.write_metadata(table.schema, root_path / "_common_metadata")

# Write the ``_metadata`` parquet file with row groups statistics of all files
pq.write_metadata(
    table.schema, root_path / "_metadata", metadata_collector=metadata_collector
)

```

This raises the error

```

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [92], in <cell line: 1>()
----> 1 pq.write_metadata(
      2     table.schema, root_path / "_metadata", metadata_collector=metadata_collector
      3 )

File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in write_metadata(schema, where, metadata_collector, **kwargs)
   2322 metadata = read_metadata(where)
   2323 for m in metadata_collector:
-> 2324     metadata.append_row_groups(m)
   2325 metadata.write_metadata_file(where)

File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in pyarrow._parquet.FileMetaData.append_row_groups()

RuntimeError: AppendRowGroups requires equal schemas.

```

But all schemas in the `metadata_collector` list seem to be the same:

```

all(metadata_collector[0].schema == meta.schema for meta in metadata_collector)

# True

```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)