You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Nick Radcliffe (Jira)" <ji...@apache.org> on 2020/08/28 09:56:00 UTC
[jira] [Created] (ARROW-9880) Lose access to indices & dictionary roundtripping DictionaryArray to parquet file

Nick Radcliffe created ARROW-9880:
-------------------------------------

             Summary: Lose access to indices & dictionary roundtripping DictionaryArray to parquet file
                 Key: ARROW-9880
                 URL: https://issues.apache.org/jira/browse/ARROW-9880
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 1.0.1
         Environment: Mac running macOS Catalina (10.15.2), Python 3.7.6.
            Reporter: Nick Radcliffe
         Attachments: pyarraw_dictionaryarray_bug.py

I am in the process of adding support for reading/writing Parquet to a data analysis tool (Miró: [https://stochasticsolutions.com/miro/).] The tool has a string column type that is extremely close to PyArrow's DictionaryArray, so it was natural to add support for that, but round-tripping doesn't seem to work, as this example shows:

The code creates writes a table with single column, a dictionary array, and writes it as a parquet file using `write_table`. On reading it back in, the column's `.type` indicates that it's a DictionaryArray, but Python reports its type as a `ChunkedArray`. Either way, it doesn't seem to have `indices` or `dictionary` properties. `to_pylist` works, so I can get the data in, but almost all the benefit of writing as a dictionary array is lost if I need to convert it to a Python list to access its values.

I presume it isn't supposed to be like this.

 
{code:python}
$ python3
Python 3.7.6 (v3.7.6:43364a7ae0, Dec 18 2019, 14:18:50) 
[Clang 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> print('PyArrow version:', pa.__version__)
PyArrow version: 1.0.1
>>> 
>>> 
>>> dictionary = ['zero', 'one', 'two']
>>> indices = [None, 0, 1, 2, 0, 1, 0]
>>> 
>>> col = pa.DictionaryArray.from_arrays(indices, dictionary)
>>> print('col:', col)
col: 
-- dictionary:
  [
    "zero",
    "one",
    "two"
  ]
-- indices:
  [
    null,
    0,
    1,
    2,
    0,
    1,
    0
  ]
>>> print('col.to_pylist():', col.to_pylist())
col.to_pylist(): [None, 'zero', 'one', 'two', 'zero', 'one', 'zero']
>>> print('col.type:', col.type)
col.type: dictionary<values=string, indices=int64, ordered=0>
>>> print('type(col):', type(col))
type(col): <class 'pyarrow.lib.DictionaryArray'>
>>> print('col.indices:', col.indices)
col.indices: [
  null,
  0,
  1,
  2,
  0,
  1,
  0
]
>>> print('col.dictionary:', col.dictionary)
col.dictionary: [
  "zero",
  "one",
  "two"
]
>>> 
>>> path = '/tmp/zot.parquet'
>>> pq.write_table(pa.lib.Table.from_pydict({'zot': col}), path)
>>> table = pq.read_table(path)
>>> 
>>> zot = table['zot']
>>> print('zot:', zot)
zot: [


  -- dictionary:
    [
      "zero",
      "one",
      "two"
    ]
  -- indices:
    [
      null,
      0,
      1,
      2,
      0,
      1,
      0
    ]
]
>>> print('zot.to_pylist():', zot.to_pylist())
zot.to_pylist(): [None, 'zero', 'one', 'two', 'zero', 'one', 'zero']
>>> print('zot.type:', zot.type)
zot.type: dictionary<values=string, indices=int32, ordered=0>
>>> print('type(zot):', type(zot))
type(zot): <class 'pyarrow.lib.ChunkedArray'>
>>> print('zot.indices:', zot.indices)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 'indices'
>>> print('zot.dictionary:', zot.dictionary)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 'dictionary'
>>> ^D

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)