You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Ted Gooch <tg...@netflix.com> on 2021/04/29 11:56:33 UTC

[C++][python] Arrow Parquet metadata issues with round trip read/write table

Hi,

I'm having an issue where I'm reading in some parquet data, and writing it
back, and when I write the field_id's don't match the schema that I
provided to pyarrow.parquet.write_table. I browsed through the PR that
added support for field_id metadata, and it looks like this is a known
behavior and has this currently open issue:
https://issues.apache.org/jira/browse/PARQUET-1798

Is there any way in the current API to get the write_table to use the
metadata from the provided schema? Or is the DFS assignment of field_id's
the only behavior pending the issue referenced above?

*Basic Example here:*

import pyarrow.parquet as pq
print("------------ORIGINAL------------")
print(arrow_tbl.schema)
pq.write_table(arrow_tbl, 'example.parquet')
read_back = pq.ParquetFile('example.parquet')
print("------------READ BACK------------")
print(read_back.schema_arrow)

*Output*
------------ORIGINAL------------
tester_flags: list<element: string>
  child 0, element: string
    -- field metadata --
    PARQUET:field_id: '36'
  -- field metadata --
  PARQUET:field_id: '16'
signup_country_iso_code: string
  -- field metadata --
  PARQUET:field_id: '17'
-- schema metadata --
iceberg.schema: '{"type":"struct","fields":[{"id":1,"name":"account_id","'
+ 5286
------------READ BACK------------
tester_flags: list<element: string>
  child 0, element: string
    -- field metadata --
    PARQUET:field_id: '3'
  -- field metadata --
  PARQUET:field_id: '1'
signup_country_iso_code: string
  -- field metadata --
  PARQUET:field_id: '4'
-- schema metadata --
iceberg.schema: '{"type":"struct","fields":[{"id":1,"name":"account_id","'
+ 5286

Re: [C++][python] Arrow Parquet metadata issues with round trip read/write table

Posted by Ted Gooch <tg...@netflix.com>.
Thanks Weston, unfortunately this would be consumed downstream by Spark and
Trino.  I do actually have the iceberg schema saved into the table metadata
but im pretty sure(haven't browsed the code of either yet, but still...)
that neither engine will leverage that info.



On Thu, Apr 29, 2021, 8:50 AM Weston Pace <we...@ursacomputing.com> wrote:

> You could copy the parquet field ids when you originally read in the data
> and write them out to a custom metadata field.  This will get saved
> (unmodified) into the parquet file.  Then, after reading the parquet file,
> you could copy your custom metadata back into the field_id field (replacing
> the made up field IDs).
>
> This won't help if your workflow is (external tool -> arrow -> parquet
> file -> external tool) but it may help if your workflow is (external tool
> -> arrow -> parquet file -> arrow -> external tool)
>
> On Thu, Apr 29, 2021 at 1:56 AM Ted Gooch <tg...@netflix.com> wrote:
>
>> Hi,
>>
>> I'm having an issue where I'm reading in some parquet data, and writing
>> it back, and when I write the field_id's don't match the schema that I
>> provided to pyarrow.parquet.write_table. I browsed through the PR that
>> added support for field_id metadata, and it looks like this is a known
>> behavior and has this currently open issue:
>> https://issues.apache.org/jira/browse/PARQUET-1798
>>
>> Is there any way in the current API to get the write_table to use the
>> metadata from the provided schema? Or is the DFS assignment of field_id's
>> the only behavior pending the issue referenced above?
>>
>> *Basic Example here:*
>>
>> import pyarrow.parquet as pq
>> print("------------ORIGINAL------------")
>> print(arrow_tbl.schema)
>> pq.write_table(arrow_tbl, 'example.parquet')
>> read_back = pq.ParquetFile('example.parquet')
>> print("------------READ BACK------------")
>> print(read_back.schema_arrow)
>>
>> *Output*
>> ------------ORIGINAL------------
>> tester_flags: list<element: string>
>>   child 0, element: string
>>     -- field metadata --
>>     PARQUET:field_id: '36'
>>   -- field metadata --
>>   PARQUET:field_id: '16'
>> signup_country_iso_code: string
>>   -- field metadata --
>>   PARQUET:field_id: '17'
>> -- schema metadata --
>> iceberg.schema:
>> '{"type":"struct","fields":[{"id":1,"name":"account_id","' + 5286
>> ------------READ BACK------------
>> tester_flags: list<element: string>
>>   child 0, element: string
>>     -- field metadata --
>>     PARQUET:field_id: '3'
>>   -- field metadata --
>>   PARQUET:field_id: '1'
>> signup_country_iso_code: string
>>   -- field metadata --
>>   PARQUET:field_id: '4'
>> -- schema metadata --
>> iceberg.schema:
>> '{"type":"struct","fields":[{"id":1,"name":"account_id","' + 5286
>>
>>

Re: [C++][python] Arrow Parquet metadata issues with round trip read/write table

Posted by Weston Pace <we...@ursacomputing.com>.
You could copy the parquet field ids when you originally read in the data
and write them out to a custom metadata field.  This will get saved
(unmodified) into the parquet file.  Then, after reading the parquet file,
you could copy your custom metadata back into the field_id field (replacing
the made up field IDs).

This won't help if your workflow is (external tool -> arrow -> parquet file
-> external tool) but it may help if your workflow is (external tool ->
arrow -> parquet file -> arrow -> external tool)

On Thu, Apr 29, 2021 at 1:56 AM Ted Gooch <tg...@netflix.com> wrote:

> Hi,
>
> I'm having an issue where I'm reading in some parquet data, and writing it
> back, and when I write the field_id's don't match the schema that I
> provided to pyarrow.parquet.write_table. I browsed through the PR that
> added support for field_id metadata, and it looks like this is a known
> behavior and has this currently open issue:
> https://issues.apache.org/jira/browse/PARQUET-1798
>
> Is there any way in the current API to get the write_table to use the
> metadata from the provided schema? Or is the DFS assignment of field_id's
> the only behavior pending the issue referenced above?
>
> *Basic Example here:*
>
> import pyarrow.parquet as pq
> print("------------ORIGINAL------------")
> print(arrow_tbl.schema)
> pq.write_table(arrow_tbl, 'example.parquet')
> read_back = pq.ParquetFile('example.parquet')
> print("------------READ BACK------------")
> print(read_back.schema_arrow)
>
> *Output*
> ------------ORIGINAL------------
> tester_flags: list<element: string>
>   child 0, element: string
>     -- field metadata --
>     PARQUET:field_id: '36'
>   -- field metadata --
>   PARQUET:field_id: '16'
> signup_country_iso_code: string
>   -- field metadata --
>   PARQUET:field_id: '17'
> -- schema metadata --
> iceberg.schema: '{"type":"struct","fields":[{"id":1,"name":"account_id","'
> + 5286
> ------------READ BACK------------
> tester_flags: list<element: string>
>   child 0, element: string
>     -- field metadata --
>     PARQUET:field_id: '3'
>   -- field metadata --
>   PARQUET:field_id: '1'
> signup_country_iso_code: string
>   -- field metadata --
>   PARQUET:field_id: '4'
> -- schema metadata --
> iceberg.schema: '{"type":"struct","fields":[{"id":1,"name":"account_id","'
> + 5286
>
>