You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Jim Pivarski (Jira)" <ji...@apache.org> on 2022/04/26 19:53:00 UTC
[jira] [Created] (ARROW-16348) ParquetWriter use_compliant_nested_type=True does not preserve ExtensionArray when reading back

Jim Pivarski created ARROW-16348:
------------------------------------

             Summary: ParquetWriter use_compliant_nested_type=True does not preserve ExtensionArray when reading back
                 Key: ARROW-16348
                 URL: https://issues.apache.org/jira/browse/ARROW-16348
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 7.0.0
         Environment: pyarrow 7.0.0 installed via pip.
            Reporter: Jim Pivarski


I've been happily making ExtensionArrays, but recently noticed that they aren't preserved by round-trips through Parquet files when {{{}use_compliant_nested_type=True{}}}.

Consider this writer.py:

 
{code:java}
import json
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
class AnnotatedType(pa.ExtensionType):
    def __init__(self, storage_type, annotation):
        self.annotation = annotation
        super().__init__(storage_type, "my:app")
    def __arrow_ext_serialize__(self):
        return json.dumps(self.annotation).encode()
    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        annotation = json.loads(serialized.decode())
        return cls(storage_type, annotation)
    @property
    def num_buffers(self):
        return self.storage_type.num_buffers
    @property
    def num_fields(self):
        return self.storage_type.num_fields
pa.register_extension_type(AnnotatedType(pa.null(), None))
array = pa.Array.from_buffers(
    AnnotatedType(pa.list_(pa.float64()), {"cool": "beans"}),
    3,
    [None, pa.py_buffer(np.array([0, 3, 3, 5], np.int32))],
    children=[pa.array([1.1, 2.2, 3.3, 4.4, 5.5])],
)
table = pa.table({"": array})
print(table)
pq.write_table(table, "tmp.parquet", use_compliant_nested_type=True)
{code}
And this reader.py:

 
{code:java}
import json
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
class AnnotatedType(pa.ExtensionType):
    def __init__(self, storage_type, annotation):
        self.annotation = annotation
        super().__init__(storage_type, "my:app")
    def __arrow_ext_serialize__(self):
        return json.dumps(self.annotation).encode()
    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        annotation = json.loads(serialized.decode())
        return cls(storage_type, annotation)
    @property
    def num_buffers(self):
        return self.storage_type.num_buffers
    @property
    def num_fields(self):
        return self.storage_type.num_fields
pa.register_extension_type(AnnotatedType(pa.null(), None))
table = pq.read_table("tmp.parquet")
print(table)
{code}
(The AnnotatedType is the same; I wrote it twice for explicitness.)

When the writer.py has {{{}use_compliant_nested_type=False{}}}, the output is
{code:java}
% python writer.py 
pyarrow.Table
: extension<my:app<AnnotatedType>>
----
: [[[1.1,2.2,3.3],[],[4.4,5.5]]]
% python reader.py 
pyarrow.Table
: extension<my:app<AnnotatedType>>
----
: [[[1.1,2.2,3.3],[],[4.4,5.5]]]{code}
In other words, the AnnotatedType is preserved. When {{{}use_compliant_nested_type=True{}}}, however,
{code:java}
% rm tmp.parquet
rm: remove regular file 'tmp.parquet'? y
% python writer.py 
pyarrow.Table
: extension<my:app<AnnotatedType>>
----
: [[[1.1,2.2,3.3],[],[4.4,5.5]]]
% python reader.py 
pyarrow.Table
: list<element: double>
  child 0, element: double
----
: [[[1.1,2.2,3.3],[],[4.4,5.5]]]{code}
The issue doesn't seem to be in the writing, but in the reading: regardless of whether {{use_compliant_nested_type}} is {{True}} or {{{}False{}}}, I can see the extension metadata in the Parquet → Arrow converted schema.
{code:java}
>>> import pyarrow.parquet as pq
>>> pq.ParquetFile("tmp.parquet").schema.to_arrow_schema()
: list<item: double>
  child 0, item: double
  -- field metadata --
  ARROW:extension:metadata: '{"cool": "beans"}'
  ARROW:extension:name: 'my:app'{code}
versus
{code:java}
>>> import pyarrow.parquet as pq
>>> pq.ParquetFile("tmp.parquet").schema.to_arrow_schema()
: list<element: double>
  child 0, element: double
  -- field metadata --
  ARROW:extension:metadata: '{"cool": "beans"}'
  ARROW:extension:name: 'my:app'{code}
Note that the first has "{{{}item: double{}}}" and the second has "{{{}element: double{}}}".

(I'm also rather surprised that {{use_compliant_nested_type=False}} is an option. Wouldn't you want the Parquet files to always be written with compliant lists? I noticed this when I was having trouble getting the data into BigQuery.)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)