You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Vasilis Themelis <vd...@gmail.com> on 2021/10/12 09:40:17 UTC

[python] Duplication of data in 'ARROW:schema' metadata?

Hi,

It looks like pyarrow adds some metadata under 'ARROW:schema' that
duplicates the rest of the key-value metadata in the resulting parquet file:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import base64 as b64

df = pd.DataFrame({'one': [-1, 2], 'two': ['foo', 'bar']})
table = pa.Table.from_pandas(df)
pq.write_table(table, "example.parquet")
metadata = pq.read_metadata("example.parquet").metadata
print("==== All metadata ====")
print(metadata)
print("")
print("==== ARROW:schema ====")
print(metadata[b'ARROW:schema'])
print("")
print("==== b64 decoded ====")
print(b64.b64decode(metadata[b'ARROW:schema']))

The above should show the duplication between "All metadata" and "b64
decoded" ARROW:schema.

What is the reason for this? Is there a good use for ARROW:schema?

I have used other libraries to write parquet files without an issue and
none of them adds the 'ARROW:schema' metadata. I had no issues with reading
their output files with pyarrow or similar. As an example, here is the
result of writing the same dataframe into parquet using fastparquet:

from fastparquet import write
write("example-fq.parquet", df)
print(pq.read_metadata("example-fq.parquet").metadata)

Also, given that this duplication can significantly increase the size of
the file when there is a large amount of metadata stored, would it be
possible to optionally disable writing 'ARROW:schema' if the output files
are still functional?

Vasilis Themelis

Re: [python] Duplication of data in 'ARROW:schema' metadata?

Posted by Wes McKinney <we...@gmail.com>.
hi Vasilis,

The Arrow schema is used to restore metadata (like timestamp time
zones) and reconstruct other Arrow types which might otherwise be lost
in the roundtrip (like returning data as dictionary-encoded if it was
written originally that way). This can be disabled disabling the
store_schema option in ArrowWriterProperties

You are right that schema metadata is being duplicated both in the
ARROW:schema and in the Parquet schema-level metadata — I believe this
is a bug and we should fix it either by not storing the Arrow metadata
in the Parquet metadata (only storing the metadata in ARROW:schema) or
dropping the metadata from ARROW:schema and using that only for
restoring data types and type metadata.

https://issues.apache.org/jira/browse/ARROW-14303

Thanks,
Wes

On Tue, Oct 12, 2021 at 4:40 AM Vasilis Themelis <vd...@gmail.com> wrote:
>
> Hi,
>
> It looks like pyarrow adds some metadata under 'ARROW:schema' that duplicates the rest of the key-value metadata in the resulting parquet file:
>
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> import base64 as b64
>
> df = pd.DataFrame({'one': [-1, 2], 'two': ['foo', 'bar']})
> table = pa.Table.from_pandas(df)
> pq.write_table(table, "example.parquet")
> metadata = pq.read_metadata("example.parquet").metadata
> print("==== All metadata ====")
> print(metadata)
> print("")
> print("==== ARROW:schema ====")
> print(metadata[b'ARROW:schema'])
> print("")
> print("==== b64 decoded ====")
> print(b64.b64decode(metadata[b'ARROW:schema']))
>
> The above should show the duplication between "All metadata" and "b64 decoded" ARROW:schema.
>
> What is the reason for this? Is there a good use for ARROW:schema?
>
> I have used other libraries to write parquet files without an issue and none of them adds the 'ARROW:schema' metadata. I had no issues with reading their output files with pyarrow or similar. As an example, here is the result of writing the same dataframe into parquet using fastparquet:
>
> from fastparquet import write
> write("example-fq.parquet", df)
> print(pq.read_metadata("example-fq.parquet").metadata)
>
> Also, given that this duplication can significantly increase the size of the file when there is a large amount of metadata stored, would it be possible to optionally disable writing 'ARROW:schema' if the output files are still functional?
>
> Vasilis Themelis