You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/11/09 13:00:00 UTC
[jira] [Assigned] (ARROW-14522) [C++] Validation of ExtensionType
with null storage type failing (Can't read empty-but-for-nulls data from
Parquet if it has an ExtensionType)
[ https://issues.apache.org/jira/browse/ARROW-14522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche reassigned ARROW-14522:
---------------------------------------------
Assignee: Joris Van den Bossche
> [C++] Validation of ExtensionType with null storage type failing (Can't read empty-but-for-nulls data from Parquet if it has an ExtensionType)
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: ARROW-14522
> URL: https://issues.apache.org/jira/browse/ARROW-14522
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 6.0.0
> Reporter: Jim Pivarski
> Assignee: Joris Van den Bossche
> Priority: Major
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Here's a corner case: suppose that I have data with type null, but it can have missing values so the whole array consists of nothing but nulls. In real life, this might only happen inside a nested data structure, at some level where an untyped data source (e.g. nested Python lists) had no entries so a type could not be determined. We expect to be able to write and read this data to and from Parquet, and we can—as long as it doesn't have an ExtensionType.
> Here's an example that works, _without_ ExtensionType:
> {code:python}
> >>> import json
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> import pyarrow.parquet
> >>>
> >>> validbits = np.packbits(np.ones(14, dtype=np.uint8), bitorder="little")
> >>> empty_but_for_nulls = pa.Array.from_buffers(
> ... pa.null(), 14, [pa.py_buffer(validbits)], null_count=14
> ... )
> >>> empty_but_for_nulls
> <pyarrow.lib.NullArray object at 0x7fb1560bbd00>
> 14 nulls
> >>>
> >>> pa.parquet.write_table(pa.table({"": empty_but_for_nulls}), "tmp.parquet")
> >>> pa.parquet.read_table("tmp.parquet")
> pyarrow.Table
> : null
> ----
> : [14 nulls]
> {code}
> And here's a continuation of that example, which doesn't work because the type {{pa.null()}} is replaced by {{AnnotatedType(pa.null(), \{"cool": "beans"})}}:
> {code:python}
> >>> class AnnotatedType(pa.ExtensionType):
> ... def __init__(self, storage_type, annotation):
> ... self.annotation = annotation
> ... super().__init__(storage_type, "my:app")
> ... def __arrow_ext_serialize__(self):
> ... return json.dumps(self.annotation).encode()
> ... @classmethod
> ... def __arrow_ext_deserialize__(cls, storage_type, serialized):
> ... annotation = json.loads(serialized.decode())
> ... return cls(storage_type, annotation)
> ...
> >>> pa.register_extension_type(AnnotatedType(pa.null(), None))
> >>>
> >>> empty_but_for_nulls = pa.Array.from_buffers(
> ... AnnotatedType(pa.null(), {"cool": "beans"}),
> ... 14,
> ... [pa.py_buffer(validbits)],
> ... null_count=14,
> ... )
> >>> empty_but_for_nulls
> <pyarrow.lib.ExtensionArray object at 0x7fb14b5e1ca0>
> 14 nulls
> >>>
> >>> pa.parquet.write_table(pa.table({"": empty_but_for_nulls}), "tmp2.parquet")
> >>> pa.parquet.read_table("tmp2.parquet")
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 1941, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
> File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 1776, in read
> table = self._dataset.to_table(
> File "pyarrow/_dataset.pyx", line 491, in pyarrow._dataset.Dataset.to_table
> File "pyarrow/_dataset.pyx", line 3235, in pyarrow._dataset.Scanner.to_table
> File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
> File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Array of type extension<my:app<AnnotatedType>> has 14 nulls but no null bitmap
> {code}
> If "nullable type null" were outside the set of types that should be writable to Parquet, then it would not work for the non-ExtensionType or it would fail on writing, not reading, so I'm quite sure this is a bug.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)