You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/11/09 13:00:00 UTC

[jira] [Assigned] (ARROW-14522) [C++] Validation of ExtensionType with null storage type failing (Can't read empty-but-for-nulls data from Parquet if it has an ExtensionType)

     [ https://issues.apache.org/jira/browse/ARROW-14522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche reassigned ARROW-14522:
---------------------------------------------

    Assignee: Joris Van den Bossche

> [C++] Validation of ExtensionType with null storage type failing (Can't read empty-but-for-nulls data from Parquet if it has an ExtensionType)
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14522
>                 URL: https://issues.apache.org/jira/browse/ARROW-14522
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 6.0.0
>            Reporter: Jim Pivarski
>            Assignee: Joris Van den Bossche
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Here's a corner case: suppose that I have data with type null, but it can have missing values so the whole array consists of nothing but nulls. In real life, this might only happen inside a nested data structure, at some level where an untyped data source (e.g. nested Python lists) had no entries so a type could not be determined. We expect to be able to write and read this data to and from Parquet, and we can—as long as it doesn't have an ExtensionType.
> Here's an example that works, _without_ ExtensionType:
> {code:python}
> >>> import json
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> import pyarrow.parquet
> >>> 
> >>> validbits = np.packbits(np.ones(14, dtype=np.uint8), bitorder="little")
> >>> empty_but_for_nulls = pa.Array.from_buffers(
> ...     pa.null(), 14, [pa.py_buffer(validbits)], null_count=14
> ... )
> >>> empty_but_for_nulls
> <pyarrow.lib.NullArray object at 0x7fb1560bbd00>
> 14 nulls
> >>> 
> >>> pa.parquet.write_table(pa.table({"": empty_but_for_nulls}), "tmp.parquet")
> >>> pa.parquet.read_table("tmp.parquet")
> pyarrow.Table
> : null
> ----
> : [14 nulls]
> {code}
> And here's a continuation of that example, which doesn't work because the type {{pa.null()}} is replaced by {{AnnotatedType(pa.null(), \{"cool": "beans"})}}:
> {code:python}
> >>> class AnnotatedType(pa.ExtensionType):
> ...     def __init__(self, storage_type, annotation):
> ...         self.annotation = annotation
> ...         super().__init__(storage_type, "my:app")
> ...     def __arrow_ext_serialize__(self):
> ...         return json.dumps(self.annotation).encode()
> ...     @classmethod
> ...     def __arrow_ext_deserialize__(cls, storage_type, serialized):
> ...         annotation = json.loads(serialized.decode())
> ...         return cls(storage_type, annotation)
> ... 
> >>> pa.register_extension_type(AnnotatedType(pa.null(), None))
> >>> 
> >>> empty_but_for_nulls = pa.Array.from_buffers(
> ...     AnnotatedType(pa.null(), {"cool": "beans"}),
> ...     14,
> ...     [pa.py_buffer(validbits)],
> ...     null_count=14,
> ... )
> >>> empty_but_for_nulls
> <pyarrow.lib.ExtensionArray object at 0x7fb14b5e1ca0>
> 14 nulls
> >>> 
> >>> pa.parquet.write_table(pa.table({"": empty_but_for_nulls}), "tmp2.parquet")
> >>> pa.parquet.read_table("tmp2.parquet")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 1941, in read_table
>     return dataset.read(columns=columns, use_threads=use_threads,
>   File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 1776, in read
>     table = self._dataset.to_table(
>   File "pyarrow/_dataset.pyx", line 491, in pyarrow._dataset.Dataset.to_table
>   File "pyarrow/_dataset.pyx", line 3235, in pyarrow._dataset.Scanner.to_table
>   File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Array of type extension<my:app<AnnotatedType>> has 14 nulls but no null bitmap
> {code}
> If "nullable type null" were outside the set of types that should be writable to Parquet, then it would not work for the non-ExtensionType or it would fail on writing, not reading, so I'm quite sure this is a bug.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)