You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Kyle Barron (Jira)" <ji...@apache.org> on 2022/04/22 17:18:00 UTC
[jira] [Updated] (ARROW-16287) PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file
[ https://issues.apache.org/jira/browse/ARROW-16287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kyle Barron updated ARROW-16287:
--------------------------------
Description:
I'm trying to follow the example here: [https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files] to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve:
{code:java}
from pathlib import Path
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
size = 100_000_000
partition_col = np.random.randint(0, 10, size)
values = np.random.rand(size)
table = pa.Table.from_pandas(
pd.DataFrame({"partition_col": partition_col, "values": values})
)
metadata_collector = []
root_path = Path("random.parquet")
pq.write_to_dataset(
table,
root_path,
partition_cols=["partition_col"],
metadata_collector=metadata_collector,
)
Write the ``_common_metadata`` parquet file without row groups statistics
pq.write_metadata(table.schema, root_path / "_common_metadata")
Write the ``_metadata`` parquet file with row groups statistics of all files
pq.write_metadata(
table.schema, root_path / "_metadata", metadata_collector=metadata_collector
) {code}
This raises the error
{code:java}
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Input In [92], in <cell line: 1>()
----> 1 pq.write_metadata(
2 table.schema, root_path / "_metadata", metadata_collector=metadata_collector
3 )
File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in write_metadata(schema, where, metadata_collector, **kwargs)
2322 metadata = read_metadata(where)
2323 for m in metadata_collector:
-> 2324 metadata.append_row_groups(m)
2325 metadata.write_metadata_file(where)
File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in pyarrow._parquet.FileMetaData.append_row_groups()
RuntimeError: AppendRowGroups requires equal schemas. {code}
But all schemas in the `metadata_collector` list seem to be the same:
{code:java}
all(metadata_collector[0].schema == meta.schema for meta in metadata_collector)
# True {code}
was:
I'm trying to follow the example here: [https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files] to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve:
```
from pathlib import Path
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
size = 100_000_000
partition_col = np.random.randint(0, 10, size)
values = np.random.rand(size)
table = pa.Table.from_pandas(
pd.DataFrame(\{"partition_col": partition_col, "values": values})
)
metadata_collector = []
root_path = Path("random.parquet")
pq.write_to_dataset(
table,
root_path,
partition_cols=["partition_col"],
metadata_collector=metadata_collector,
)
# Write the ``_common_metadata`` parquet file without row groups statistics
pq.write_metadata(table.schema, root_path / "_common_metadata")
# Write the ``_metadata`` parquet file with row groups statistics of all files
pq.write_metadata(
table.schema, root_path / "_metadata", metadata_collector=metadata_collector
)
```
This raises the error
```
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Input In [92], in <cell line: 1>()
----> 1 pq.write_metadata(
2 table.schema, root_path / "_metadata", metadata_collector=metadata_collector
3 )
File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in write_metadata(schema, where, metadata_collector, **kwargs)
2322 metadata = read_metadata(where)
2323 for m in metadata_collector:
-> 2324 metadata.append_row_groups(m)
2325 metadata.write_metadata_file(where)
File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in pyarrow._parquet.FileMetaData.append_row_groups()
RuntimeError: AppendRowGroups requires equal schemas.
```
But all schemas in the `metadata_collector` list seem to be the same:
```
all(metadata_collector[0].schema == meta.schema for meta in metadata_collector)
# True
```
> PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file
> -----------------------------------------------------------------------------------------
>
> Key: ARROW-16287
> URL: https://issues.apache.org/jira/browse/ARROW-16287
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet
> Affects Versions: 7.0.0
> Environment: MacOS. Python 3.8.10.
> pyarrow: '7.0.0'
> pandas: '1.4.2'
> numpy: '1.22.3'
> Reporter: Kyle Barron
> Priority: Major
>
> I'm trying to follow the example here: [https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files] to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve:
> {code:java}
> from pathlib import Path
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> size = 100_000_000
> partition_col = np.random.randint(0, 10, size)
> values = np.random.rand(size)
> table = pa.Table.from_pandas(
> pd.DataFrame({"partition_col": partition_col, "values": values})
> )
> metadata_collector = []
> root_path = Path("random.parquet")
> pq.write_to_dataset(
> table,
> root_path,
> partition_cols=["partition_col"],
> metadata_collector=metadata_collector,
> )
> Write the ``_common_metadata`` parquet file without row groups statistics
> pq.write_metadata(table.schema, root_path / "_common_metadata")
> Write the ``_metadata`` parquet file with row groups statistics of all files
> pq.write_metadata(
> table.schema, root_path / "_metadata", metadata_collector=metadata_collector
> ) {code}
> This raises the error
> {code:java}
> ---------------------------------------------------------------------------
> RuntimeError Traceback (most recent call last)
> Input In [92], in <cell line: 1>()
> ----> 1 pq.write_metadata(
> 2 table.schema, root_path / "_metadata", metadata_collector=metadata_collector
> 3 )
> File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in write_metadata(schema, where, metadata_collector, **kwargs)
> 2322 metadata = read_metadata(where)
> 2323 for m in metadata_collector:
> -> 2324 metadata.append_row_groups(m)
> 2325 metadata.write_metadata_file(where)
> File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in pyarrow._parquet.FileMetaData.append_row_groups()
> RuntimeError: AppendRowGroups requires equal schemas. {code}
> But all schemas in the `metadata_collector` list seem to be the same:
> {code:java}
> all(metadata_collector[0].schema == meta.schema for meta in metadata_collector)
> # True {code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)