You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jim Pivarski (Jira)" <ji...@apache.org> on 2021/10/30 00:28:00 UTC

[jira] [Created] (ARROW-14525) Writing DictionaryArrays with ExtensionType to Parquet

Jim Pivarski created ARROW-14525:
------------------------------------

             Summary: Writing DictionaryArrays with ExtensionType to Parquet
                 Key: ARROW-14525
                 URL: https://issues.apache.org/jira/browse/ARROW-14525
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 6.0.0
            Reporter: Jim Pivarski


Thanks to some help I got from [~jorisvandenbossche], I can create DictionaryArrays with ExtensionType (on just the dictionary, the dictionary array itself, or both). However, these extended-DictionaryArrays can't be written to Parquet files.

To start, let's set up my minimal reproducer ExtensionType, this time with an explicit ExtensionArray:
{code:python}
>>> import json
>>> import numpy as np
>>> import pyarrow as pa
>>> import pyarrow.parquet
>>> 
>>> class AnnotatedArray(pa.ExtensionArray):
...     pass
... 
>>> class AnnotatedType(pa.ExtensionType):
...     def __init__(self, storage_type, annotation):
...         self.annotation = annotation
...         super().__init__(storage_type, "my:app")
...     def __arrow_ext_serialize__(self):
...         return json.dumps(self.annotation).encode()
...     @classmethod
...     def __arrow_ext_deserialize__(cls, storage_type, serialized):
...         annotation = json.loads(serialized.decode())
...         return cls(storage_type, annotation)
...     def __arrow_ext_class__(self):
...         return AnnotatedArray
... 
>>> pa.register_extension_type(AnnotatedType(pa.null(), None))
{code}
A non-extended DictionaryArray could be built like this:
{code:python}
>>> dictarray = pa.DictionaryArray.from_arrays(
...     np.array([3, 2, 2, 2, 0, 1, 3], np.int32),
...     pa.Array.from_buffers(
...         pa.float64(),
...         4,
...         [
...             None,
...             pa.py_buffer(np.array([0.0, 1.1, 2.2, 3.3])),
...         ],
...     ),
... )
>>> dictarray
<pyarrow.lib.DictionaryArray object at 0x7f8c71f593c0>

-- dictionary:
  [
    0,
    1.1,
    2.2,
    3.3
  ]
-- indices:
  [
    3,
    2,
    2,
    2,
    0,
    1,
    3
  ]
{code}
I can write it to a file and read it back, though the fact that it comes back as a non-DictionaryArray might be part of the problem. Is some decision being made about the array of indices being too short to warrant dictionary encoding?
{code:python}
>>> pa.parquet.write_table(pa.table({"": dictarray}), "tmp.parquet")
>>> pa.parquet.read_table("tmp.parquet")
pyarrow.Table
: double
----
: [[3.3,2.2,2.2,2.2,0,1.1,3.3]]
{code}
Anyway, the next step is to make a DictionaryArray with ExtensionTypes. In this example, I'm making both the dictionary and the outer DictionaryArray itself be extended:
{code:python}
>>> dictionary_type = AnnotatedType(pa.float64(), "inner annotation")
>>> dictarray_type = AnnotatedType(
...     pa.dictionary(pa.int32(), dictionary_type), "outer annotation"
... )
>>> dictarray_ext = AnnotatedArray.from_storage(
...     dictarray_type,
...     pa.DictionaryArray.from_arrays(
...         np.array([3, 2, 2, 2, 0, 1, 3], np.int32),
...         pa.Array.from_buffers(
...             dictionary_type,
...             4,
...             [
...                 None,
...                 pa.py_buffer(np.array([0.0, 1.1, 2.2, 3.3])),
...             ],
...         ),
...     )
... )
>>> dictarray_ext
<__main__.AnnotatedArray object at 0x7f8c71ec7ee0>

-- dictionary:
  [
    0,
    1.1,
    2.2,
    3.3
  ]
-- indices:
  [
    3,
    2,
    2,
    2,
    0,
    1,
    3
  ]
{code}
This can't be written to a Parquet file:
{code:python}
>>> pa.parquet.write_table(pa.table({"": dictarray_ext}), "tmp2.parquet")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 2034, in write_table
    writer.write_table(table, row_group_size=row_group_size)
  File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 701, in write_table
    self.writer.write_table(table, row_group_size=row_group_size)
  File "pyarrow/_parquet.pyx", line 1451, in pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from dictionary<values=extension<my:app<AnnotatedType>>, indices=int32, ordered=0> to extension<my:app<AnnotatedType>> (no available cast function for target type)
{code}
My first thought was maybe the data used in the dictionary must be simple (it's usually strings). So how about making the outer DictionaryArray extended, but the inner dictionary not extended? The type definitions are now inline.
{code:python}
>>> dictarray_partial = AnnotatedArray.from_storage(
...     AnnotatedType(          # extended, but the content is not
...         pa.dictionary(pa.int32(), pa.float64()), "only annotation"
...     ),
...     pa.DictionaryArray.from_arrays(
...         np.array([3, 2, 2, 2, 0, 1, 3], np.int32),
...         pa.Array.from_buffers(
...             pa.float64(),   # not extended
...             4,
...             [
...                 None,
...                 pa.py_buffer(np.array([0.0, 1.1, 2.2, 3.3])),
...             ],
...         ),
...     )
... )
>>> dictarray_partial
<__main__.AnnotatedArray object at 0x7f8c71ee5100>

-- dictionary:
  [
    0,
    1.1,
    2.2,
    3.3
  ]
-- indices:
  [
    3,
    2,
    2,
    2,
    0,
    1,
    3
  ]
{code}
I can write this, but it comes back as a non-extended type, maybe because it's a non-DictionaryArray with the type of the original's dictionary (non-extended).
{code:python}
>>> pa.parquet.write_table(pa.table({"": dictarray_partial}), "tmp3.parquet")
>>> pa.parquet.read_table("tmp3.parquet")
pyarrow.Table
: double
----
: [[3.3,2.2,2.2,2.2,0,1.1,3.3]]
{code}
Okay, since there's four possibilities here, what about making the dictionary an ExtensionType, but the outer DictionaryArray is not?
{code:python}
>>> dictarray_other = pa.DictionaryArray.from_arrays(
...     np.array([3, 2, 2, 2, 0, 1, 3], np.int32),
...     pa.Array.from_buffers(
...         AnnotatedType(pa.float64(), "only annotation"),
...         4,
...         [
...             None,
...             pa.py_buffer(np.array([0.0, 1.1, 2.2, 3.3])),
...         ],
...     )
... )
>>> dictarray_other
<pyarrow.lib.DictionaryArray object at 0x7f8c71ee62e0>

-- dictionary:
  [
    0,
    1.1,
    2.2,
    3.3
  ]
-- indices:
  [
    3,
    2,
    2,
    2,
    0,
    1,
    3
  ]
{code}
Nope, can't write this, either:
{code:python}
>>> pa.parquet.write_table(pa.table({"": dictarray_other}), "tmp4.parquet")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 2034, in write_table
    writer.write_table(table, row_group_size=row_group_size)
  File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 701, in write_table
    self.writer.write_table(table, row_group_size=row_group_size)
  File "pyarrow/_parquet.pyx", line 1451, in pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from dictionary<values=extension<my:app<AnnotatedType>>, indices=int32, ordered=0> to extension<my:app<AnnotatedType>> (no available cast function for target type)
{code}
I'm pretty sure I aligned all the types right. Perhaps only one of these cases should be supported as the way it ought to work, but there ought to be some way to get the annotations into a Parquet file and read them back. (Other than un-dictencoding the array.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)