You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Joris Peeters <jo...@gmail.com> on 2021/07/20 16:18:19 UTC

[Parquet] writing metadata for dataset with partitions

The docs on https://arrow.apache.org/docs/python/parquet.html suggest a
mechanism for collecting+writing metadata, when using `pq.write_dataset` to
build the dataset on disk.

Using that mechanism I ran into an issue when employing partitions. Perhaps
most easily demonstrated using a little script to reproduce, using a table
with two columns, one of which (`letter`) is the partitioning column.

import tempfile
import random
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import os
import pandas as pd

N = 100
df = pd.DataFrame({
    'letter': [random.choice(['A', 'B']) for _ in range(0, N)],
    'number': np.random.rand(N)})

table = pa.Table.from_pandas(df, schema=pa.schema(fields=[
    pa.field('letter', pa.string(), nullable=False),
    pa.field('number', pa.float64(), nullable=False)]))

with tempfile.TemporaryDirectory() as root:
    metadata_collector = []
    pq.write_to_dataset(
        table,
        root_path=root,
        partition_cols=['letter'],
        metadata_collector=metadata_collector)

    pq.write_metadata(table.schema, os.path.join(root, '_common_metadata'))
    pq.write_metadata(table.schema,  os.path.join(root, '_metadata'),
                 metadata_collector=metadata_collector)

which gives (on the last line),

RuntimeError: AppendRowGroups requires equal schemas.

on https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L2184


It looks like this is because the common metadata's schema (which has
all the cols) has a different schema from the collected filemetadata's
- which omit the partioning column, i.e. are of shape:


<pyarrow._parquet.ParquetSchema object at 0x0000024FB1084100>
required group field_id=0 schema {
  required double field_id=1 number;
}


Happy to take a more manual approach to writing the metadata - as
suggested in the docs for when write_dataset isn't used, but was
wondering if this is,

- a known issue

- for which there is a correct solution (i.e. which of the two
schema's should it be?)

- that I could contribute a fix for.


-J

Re: [Parquet] writing metadata for dataset with partitions

Posted by Weston Pace <we...@gmail.com>.
I just ran into this myself not too long ago :).  I've been adding
support for this process to the new write_dataset API which will
eventually (hopefully) obsolete pq.write_to_dataset.

> Happy to take a more manual approach to writing the metadata - as suggested in the docs for when write_dataset isn't used, but was wondering if this is,
> - a known issue
Yes.  You can find more details in ARROW-13269[1].

> - for which there is a correct solution (i.e. which of the two schema's should it be?)
The _common_metadata should have all of the columns (including
partitioning).  The _metadata should not.  Right now this also only
works if all the files in _metadata have the same schema so let us
know if that is an issue for your use case.

> - that I could contribute a fix for.
The documentation around this could definitely be improved.  There is
precious little non-arrow documentation around these files so it is
rather tricky to google for.  If you have suggestions on ways this
process could be made easier that is always welcome too.

[1] https://issues.apache.org/jira/browse/ARROW-13269

On Tue, Jul 20, 2021 at 6:18 AM Joris Peeters
<jo...@gmail.com> wrote:
>
> The docs on https://arrow.apache.org/docs/python/parquet.html suggest a mechanism for collecting+writing metadata, when using `pq.write_dataset` to build the dataset on disk.
>
> Using that mechanism I ran into an issue when employing partitions. Perhaps most easily demonstrated using a little script to reproduce, using a table with two columns, one of which (`letter`) is the partitioning column.
>
> import tempfile
> import random
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> import os
> import pandas as pd
>
> N = 100
> df = pd.DataFrame({
>     'letter': [random.choice(['A', 'B']) for _ in range(0, N)],
>     'number': np.random.rand(N)})
>
> table = pa.Table.from_pandas(df, schema=pa.schema(fields=[
>     pa.field('letter', pa.string(), nullable=False),
>     pa.field('number', pa.float64(), nullable=False)]))
>
> with tempfile.TemporaryDirectory() as root:
>     metadata_collector = []
>     pq.write_to_dataset(
>         table,
>         root_path=root,
>         partition_cols=['letter'],
>         metadata_collector=metadata_collector)
>
>     pq.write_metadata(table.schema, os.path.join(root, '_common_metadata'))
>     pq.write_metadata(table.schema,  os.path.join(root, '_metadata'),
>                  metadata_collector=metadata_collector)
>
> which gives (on the last line),
>
> RuntimeError: AppendRowGroups requires equal schemas.
>
> on https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L2184
>
>
> It looks like this is because the common metadata's schema (which has all the cols) has a different schema from the collected filemetadata's - which omit the partioning column, i.e. are of shape:
>
>
> <pyarrow._parquet.ParquetSchema object at 0x0000024FB1084100>
> required group field_id=0 schema {
>   required double field_id=1 number;
> }
>
>
> Happy to take a more manual approach to writing the metadata - as suggested in the docs for when write_dataset isn't used, but was wondering if this is,
>
> - a known issue
>
> - for which there is a correct solution (i.e. which of the two schema's should it be?)
>
> - that I could contribute a fix for.
>
>
> -J
>
>