You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Nick Radcliffe (Jira)" <ji...@apache.org> on 2020/08/28 09:56:00 UTC
[jira] [Created] (ARROW-9880) Lose access to indices & dictionary
roundtripping DictionaryArray to parquet file
Nick Radcliffe created ARROW-9880:
-------------------------------------
Summary: Lose access to indices & dictionary roundtripping DictionaryArray to parquet file
Key: ARROW-9880
URL: https://issues.apache.org/jira/browse/ARROW-9880
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 1.0.1
Environment: Mac running macOS Catalina (10.15.2), Python 3.7.6.
Reporter: Nick Radcliffe
Attachments: pyarraw_dictionaryarray_bug.py
I am in the process of adding support for reading/writing Parquet to a data analysis tool (Miró: [https://stochasticsolutions.com/miro/).] The tool has a string column type that is extremely close to PyArrow's DictionaryArray, so it was natural to add support for that, but round-tripping doesn't seem to work, as this example shows:
The code creates writes a table with single column, a dictionary array, and writes it as a parquet file using `write_table`. On reading it back in, the column's `.type` indicates that it's a DictionaryArray, but Python reports its type as a `ChunkedArray`. Either way, it doesn't seem to have `indices` or `dictionary` properties. `to_pylist` works, so I can get the data in, but almost all the benefit of writing as a dictionary array is lost if I need to convert it to a Python list to access its values.
I presume it isn't supposed to be like this.
{code:python}
$ python3
Python 3.7.6 (v3.7.6:43364a7ae0, Dec 18 2019, 14:18:50)
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> print('PyArrow version:', pa.__version__)
PyArrow version: 1.0.1
>>>
>>>
>>> dictionary = ['zero', 'one', 'two']
>>> indices = [None, 0, 1, 2, 0, 1, 0]
>>>
>>> col = pa.DictionaryArray.from_arrays(indices, dictionary)
>>> print('col:', col)
col:
-- dictionary:
[
"zero",
"one",
"two"
]
-- indices:
[
null,
0,
1,
2,
0,
1,
0
]
>>> print('col.to_pylist():', col.to_pylist())
col.to_pylist(): [None, 'zero', 'one', 'two', 'zero', 'one', 'zero']
>>> print('col.type:', col.type)
col.type: dictionary<values=string, indices=int64, ordered=0>
>>> print('type(col):', type(col))
type(col): <class 'pyarrow.lib.DictionaryArray'>
>>> print('col.indices:', col.indices)
col.indices: [
null,
0,
1,
2,
0,
1,
0
]
>>> print('col.dictionary:', col.dictionary)
col.dictionary: [
"zero",
"one",
"two"
]
>>>
>>> path = '/tmp/zot.parquet'
>>> pq.write_table(pa.lib.Table.from_pydict({'zot': col}), path)
>>> table = pq.read_table(path)
>>>
>>> zot = table['zot']
>>> print('zot:', zot)
zot: [
-- dictionary:
[
"zero",
"one",
"two"
]
-- indices:
[
null,
0,
1,
2,
0,
1,
0
]
]
>>> print('zot.to_pylist():', zot.to_pylist())
zot.to_pylist(): [None, 'zero', 'one', 'two', 'zero', 'one', 'zero']
>>> print('zot.type:', zot.type)
zot.type: dictionary<values=string, indices=int32, ordered=0>
>>> print('type(zot):', type(zot))
type(zot): <class 'pyarrow.lib.ChunkedArray'>
>>> print('zot.indices:', zot.indices)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 'indices'
>>> print('zot.dictionary:', zot.dictionary)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 'dictionary'
>>> ^D
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)