You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/04/19 10:30:00 UTC

[jira] [Commented] (ARROW-16231) [C++][Python] IPC failure for dictionary with extension type with struct storage type

    [ https://issues.apache.org/jira/browse/ARROW-16231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524230#comment-17524230 ] 

Joris Van den Bossche commented on ARROW-16231:
-----------------------------------------------

If I try to recreate this with a pure-pyarrow example, I get a different error:

 

{code}
import pyarrow as pa
from pyarrow.tests.test_extension_type import MyStructType

struct_array = pa.StructArray.from_arrays(
    [pa.array([0, 1], type="int64"), pa.array([1, 2], type="int64")],
    names=["left", "right"])
mystruct_array = pa.ExtensionArray.from_storage(MyStructType(), struct_array)
dict_array = pa.DictionaryArray.from_arrays(pa.array([0, 1, 0]), mystruct_array)

# roundtrip through Feather
from pyarrow import feather
feather.write_feather(pa.table({'a': dict_array}), "test_dict_ext_nested.feather")
feather.read_table("test_dict_ext_nested.feather")
{code}

gives

{code}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-df8b416670f4> in <module>
----> 1 feather.read_table("test_dict_ext_nested.feather")

~/scipy/repos/arrow/python/pyarrow/feather.py in read_table(source, columns, memory_map, use_threads)
    242     table : pyarrow.Table
    243     """
--> 244     reader = _feather.FeatherReader(
    245         source, use_memory_map=memory_map, use_threads=use_threads)
    246 

~/scipy/repos/arrow/python/pyarrow/_feather.pyx in pyarrow._feather.FeatherReader.__cinit__()
~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
~/scipy/repos/arrow/python/pyarrow/types.pxi in pyarrow.lib.PyExtensionType.__arrow_ext_deserialize__()
TypeError: Expected storage type struct<left: int64, right: int64> but got dictionary<values=struct<left: int64, right: int64>, indices=int64, ordered=0>
{code}


> [C++][Python] IPC failure for dictionary with extension type with struct storage type
> -------------------------------------------------------------------------------------
>
>                 Key: ARROW-16231
>                 URL: https://issues.apache.org/jira/browse/ARROW-16231
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>
> Report from [https://github.com/apache/arrow/issues/12899]
> Roundtripping through IPC/Feather using a dictionary type where the dictionary is an extension type with a nested storage type fails. Writing seems to work (but no idea if the written file is "correct", as trying to read the schema gives an error), but reading it back fails with {_}"ArrowInvalid: Ran out of field metadata, likely malformed"{_}.
> The original use case was from a pandas extension type (the pandas interval dtype is mapped to an arrow extension type with a struct type as storage, and in this case this interval type was further wrapped in a categorical (dictionary) type). A pandas-based test that reproduces this (can be added like this in {{{}test_feather.py{}}}):
> {code:python}
> @pytest.mark.pandas
> def test_dictionary_interval():
>     df = pd.DataFrame({'a': pd.cut(range(1, 10, 3), [-1, 5, 10])})
>     _check_pandas_roundtrip(df, version=2)
> {code}
> this gives:
> {code:java}
> $ pytest python/pyarrow/tests/test_feather.py::test_dictionary_interval
> ....
> ========================= FAILURES =================
> ____________ test_dictionary_interval _______________
> pyarrow/_feather.pyx:88: in pyarrow._feather.FeatherReader.read
> E   pyarrow.lib.ArrowInvalid: Ran out of field metadata, likely malformed
> E   ../src/arrow/ipc/reader.cc:266  GetFieldMetadata(field_index_++, out_)
> E   ../src/arrow/ipc/reader.cc:283  LoadCommon(type_id)
> E   ../src/arrow/ipc/reader.cc:324  Load(child_fields[i].get(), parent->child_data[i].get())
> E   ../src/arrow/ipc/reader.cc:529  loader.Load(&field, column.get())
> E   ../src/arrow/ipc/reader.cc:1188  ReadRecordBatchInternal( *message->metadata(), schema_, field_inclusion_mask_, context, reader.get())
> E   ../src/arrow/ipc/feather.cc:730  reader->ReadRecordBatch(i)
> pyarrow/error.pxi:100: ArrowInvalid
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)