You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/11/05 09:24:00 UTC
[jira] [Commented] (ARROW-14525) Writing DictionaryArrays with ExtensionType to Parquet

    [ https://issues.apache.org/jira/browse/ARROW-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17439117#comment-17439117 ] 

Joris Van den Bossche commented on ARROW-14525:
-----------------------------------------------

> This can't be written to a Parquet file:

Similarly as [~lidavidm] mentioned in ARROW-14569, it seems we need to implement a cast from dictionary<extension> to extension.

Now, the reason this is needed for Parquet is that, currently, only dictionary with string/binary type is supported to be stored as is in Parquet. All other dictionary types are stored as materialized values (still with some parquet encoding of course, but not necessarily using dictionary encoding, or at least not directly writing/reading from/to arrow's dictionary type to parquet dictionary encoding without additional conversion). See ARROW-6140

> Writing DictionaryArrays with ExtensionType to Parquet
> ------------------------------------------------------
>
>                 Key: ARROW-14525
>                 URL: https://issues.apache.org/jira/browse/ARROW-14525
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 6.0.0
>            Reporter: Jim Pivarski
>            Priority: Major
>
> Thanks to some help I got from [~jorisvandenbossche], I can create DictionaryArrays with ExtensionType (on just the dictionary, the dictionary array itself, or both). However, these extended-DictionaryArrays can't be written to Parquet files.
> To start, let's set up my minimal reproducer ExtensionType, this time with an explicit ExtensionArray:
> {code:python}
> >>> import json
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> import pyarrow.parquet
> >>> 
> >>> class AnnotatedArray(pa.ExtensionArray):
> ...     pass
> ... 
> >>> class AnnotatedType(pa.ExtensionType):
> ...     def __init__(self, storage_type, annotation):
> ...         self.annotation = annotation
> ...         super().__init__(storage_type, "my:app")
> ...     def __arrow_ext_serialize__(self):
> ...         return json.dumps(self.annotation).encode()
> ...     @classmethod
> ...     def __arrow_ext_deserialize__(cls, storage_type, serialized):
> ...         annotation = json.loads(serialized.decode())
> ...         return cls(storage_type, annotation)
> ...     def __arrow_ext_class__(self):
> ...         return AnnotatedArray
> ... 
> >>> pa.register_extension_type(AnnotatedType(pa.null(), None))
> {code}
> A non-extended DictionaryArray could be built like this:
> {code:python}
> >>> dictarray = pa.DictionaryArray.from_arrays(
> ...     np.array([3, 2, 2, 2, 0, 1, 3], np.int32),
> ...     pa.Array.from_buffers(
> ...         pa.float64(),
> ...         4,
> ...         [
> ...             None,
> ...             pa.py_buffer(np.array([0.0, 1.1, 2.2, 3.3])),
> ...         ],
> ...     ),
> ... )
> >>> dictarray
> <pyarrow.lib.DictionaryArray object at 0x7f8c71f593c0>
> -- dictionary:
>   [
>     0,
>     1.1,
>     2.2,
>     3.3
>   ]
> -- indices:
>   [
>     3,
>     2,
>     2,
>     2,
>     0,
>     1,
>     3
>   ]
> {code}
> I can write it to a file and read it back, though the fact that it comes back as a non-DictionaryArray might be part of the problem. Is some decision being made about the array of indices being too short to warrant dictionary encoding?
> {code:python}
> >>> pa.parquet.write_table(pa.table({"": dictarray}), "tmp.parquet")
> >>> pa.parquet.read_table("tmp.parquet")
> pyarrow.Table
> : double
> ----
> : [[3.3,2.2,2.2,2.2,0,1.1,3.3]]
> {code}
> Anyway, the next step is to make a DictionaryArray with ExtensionTypes. In this example, I'm making both the dictionary and the outer DictionaryArray itself be extended:
> {code:python}
> >>> dictionary_type = AnnotatedType(pa.float64(), "inner annotation")
> >>> dictarray_type = AnnotatedType(
> ...     pa.dictionary(pa.int32(), dictionary_type), "outer annotation"
> ... )
> >>> dictarray_ext = AnnotatedArray.from_storage(
> ...     dictarray_type,
> ...     pa.DictionaryArray.from_arrays(
> ...         np.array([3, 2, 2, 2, 0, 1, 3], np.int32),
> ...         pa.Array.from_buffers(
> ...             dictionary_type,
> ...             4,
> ...             [
> ...                 None,
> ...                 pa.py_buffer(np.array([0.0, 1.1, 2.2, 3.3])),
> ...             ],
> ...         ),
> ...     )
> ... )
> >>> dictarray_ext
> <__main__.AnnotatedArray object at 0x7f8c71ec7ee0>
> -- dictionary:
>   [
>     0,
>     1.1,
>     2.2,
>     3.3
>   ]
> -- indices:
>   [
>     3,
>     2,
>     2,
>     2,
>     0,
>     1,
>     3
>   ]
> {code}
> This can't be written to a Parquet file:
> {code:python}
> >>> pa.parquet.write_table(pa.table({"": dictarray_ext}), "tmp2.parquet")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 2034, in write_table
>     writer.write_table(table, row_group_size=row_group_size)
>   File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 701, in write_table
>     self.writer.write_table(table, row_group_size=row_group_size)
>   File "pyarrow/_parquet.pyx", line 1451, in pyarrow._parquet.ParquetWriter.write_table
>   File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: Unsupported cast from dictionary<values=extension<my:app<AnnotatedType>>, indices=int32, ordered=0> to extension<my:app<AnnotatedType>> (no available cast function for target type)
> {code}
> My first thought was maybe the data used in the dictionary must be simple (it's usually strings). So how about making the outer DictionaryArray extended, but the inner dictionary not extended? The type definitions are now inline.
> {code:python}
> >>> dictarray_partial = AnnotatedArray.from_storage(
> ...     AnnotatedType(          # extended, but the content is not
> ...         pa.dictionary(pa.int32(), pa.float64()), "only annotation"
> ...     ),
> ...     pa.DictionaryArray.from_arrays(
> ...         np.array([3, 2, 2, 2, 0, 1, 3], np.int32),
> ...         pa.Array.from_buffers(
> ...             pa.float64(),   # not extended
> ...             4,
> ...             [
> ...                 None,
> ...                 pa.py_buffer(np.array([0.0, 1.1, 2.2, 3.3])),
> ...             ],
> ...         ),
> ...     )
> ... )
> >>> dictarray_partial
> <__main__.AnnotatedArray object at 0x7f8c71ee5100>
> -- dictionary:
>   [
>     0,
>     1.1,
>     2.2,
>     3.3
>   ]
> -- indices:
>   [
>     3,
>     2,
>     2,
>     2,
>     0,
>     1,
>     3
>   ]
> {code}
> I can write this, but it comes back as a non-extended type, maybe because it's a non-DictionaryArray with the type of the original's dictionary (non-extended).
> {code:python}
> >>> pa.parquet.write_table(pa.table({"": dictarray_partial}), "tmp3.parquet")
> >>> pa.parquet.read_table("tmp3.parquet")
> pyarrow.Table
> : double
> ----
> : [[3.3,2.2,2.2,2.2,0,1.1,3.3]]
> {code}
> Okay, since there's four possibilities here, what about making the dictionary an ExtensionType, but the outer DictionaryArray is not?
> {code:python}
> >>> dictarray_other = pa.DictionaryArray.from_arrays(
> ...     np.array([3, 2, 2, 2, 0, 1, 3], np.int32),
> ...     pa.Array.from_buffers(
> ...         AnnotatedType(pa.float64(), "only annotation"),
> ...         4,
> ...         [
> ...             None,
> ...             pa.py_buffer(np.array([0.0, 1.1, 2.2, 3.3])),
> ...         ],
> ...     )
> ... )
> >>> dictarray_other
> <pyarrow.lib.DictionaryArray object at 0x7f8c71ee62e0>
> -- dictionary:
>   [
>     0,
>     1.1,
>     2.2,
>     3.3
>   ]
> -- indices:
>   [
>     3,
>     2,
>     2,
>     2,
>     0,
>     1,
>     3
>   ]
> {code}
> Nope, can't write this, either:
> {code:python}
> >>> pa.parquet.write_table(pa.table({"": dictarray_other}), "tmp4.parquet")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 2034, in write_table
>     writer.write_table(table, row_group_size=row_group_size)
>   File "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", line 701, in write_table
>     self.writer.write_table(table, row_group_size=row_group_size)
>   File "pyarrow/_parquet.pyx", line 1451, in pyarrow._parquet.ParquetWriter.write_table
>   File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: Unsupported cast from dictionary<values=extension<my:app<AnnotatedType>>, indices=int32, ordered=0> to extension<my:app<AnnotatedType>> (no available cast function for target type)
> {code}
> I'm pretty sure I aligned all the types right. Perhaps only one of these cases should be supported as the way it ought to work, but there ought to be some way to get the annotations into a Parquet file and read them back. (Other than un-dictencoding the array.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)