You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/06/02 09:27:00 UTC

[jira] [Created] (ARROW-9009) [C++][Dataset] ARROW:schema should be removed from schema's metadata when reading Parquet files

Joris Van den Bossche created ARROW-9009:
--------------------------------------------

             Summary: [C++][Dataset] ARROW:schema should be removed from schema's metadata when reading Parquet files
                 Key: ARROW-9009
                 URL: https://issues.apache.org/jira/browse/ARROW-9009
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: Joris Van den Bossche


When reading a parquet file (which was written by Arrow) with the datasets API, it preserves the "ARROW:schema" field in the metadata:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

table = pa.table({'a': [1, 2, 3]})
pq.write_table(table, "test.parquet")

dataset = ds.dataset("test.parquet", format="parquet")
{code}
In [7]: dataset.schema                                                                                                                                                                        
Out[7]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
ARROW:schema: '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAMAAAACAAIAAAABA' + 114

In [8]: dataset.to_table().schema                                                                                                                                                             
Out[8]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
ARROW:schema: '/////3gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABAwAMAAAACAAIAAAABA' + 114
{code}

while when reading with the `parquet` module reader, we do not preserve this metadata:

{code}
In [9]: pq.read_table("test.parquet").schema                                                                                                                                                  
Out[9]: 
a: int64
  -- field metadata --
  PARQUET:field_id: '1'
{code}

Since the "ARROW:schema" information is used to properly reconstruct the Arrow schema from the ParquetSchema, it is no longer needed once you already have the arrow schema, so it's probably not needed to keep it as metadata in the arrow schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)