You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Brent Kerby (JIRA)" <ji...@apache.org> on 2018/04/26 15:54:00 UTC

[jira] [Updated] (ARROW-2515) Errors with DictionaryArray inside of ListArray or other DictionaryArray

     [ https://issues.apache.org/jira/browse/ARROW-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brent Kerby updated ARROW-2515:
-------------------------------
    Description: 
An exception ("KeyError: 26") is raised when .as_py() is called on elements of a ListArray over a DictionaryArray, or of a DictionaryArray with values in a DictionaryArray. Here are a couple tests that currently fail:

 
{code:java}
import pyarrow as pa

def test_dictionary_array_1():
    dict_arr = pa.DictionaryArray.from_arrays([0, 1, 0], ['a', 'b'])
    list_arr = pa.ListArray.from_arrays([0, 2, 3], dict_arr)
    assert list_arr.to_pylist() == [['a', 'b'], ['a']]

def test_dictionary_array_2():
    dict_arr = pa.DictionaryArray.from_arrays([0, 1, 0], ['a', 'b'])
    dict_arr2 = pa.DictionaryArray.from_arrays([0, 1, 2, 1, 0], dict_arr)
    assert dict_arr2.to_pylist() == ['a', 'b', 'a', 'b', 'a']
{code}
It appears that the problem is caused by the fact that the function box_scalar in scalar.pxi does not handle the case of dictionary array, as we currently have no DictionaryValue type. 

 

DictionaryArray.__getitem__ currently works around the lack of DictionaryValue type by dereferencing the index and constructing a scalar based on the value in the underlying dictionary. In other words, if we have a dictionary with int8 indices and string values, then the result of __getitem__ will be a StringValue (rather than a DictionaryValue). This works in simple cases but not in the more complex scenarios illustrated above.

I have a patch ready, which would add a DictionaryValue type similar to other scalar types, resolving these bugs and removing the need for a special-cased implementation of DictionaryArray.__getitem__. This DictionaryValue would contain a couple accessor properties, "indices_value" and "dictionary_value" to allow access to both the index in the dictionary as well as the looked-up value. Then DictionaryValue.as_py() would simply call .as_py() on the underlying dictionary_value. 

  was:
An exception ("KeyError: 26") is raised when .as_py() is called on elements of a ListArray over a DictionaryArray, or of a DictionaryArray with values in a DictionaryArray. Here are a couple tests that currently fail:

 
{code:java}
import pyarrow as pa

def test_dictionary_array_1():
    dict_arr = pa.DictionaryArray.from_arrays([0, 1, 0], ['a', 'b'])
    list_arr = pa.ListArray.from_arrays([0, 2, 3], dict_arr)
    assert list_arr.to_pylist() == [['a', 'b'], ['a']]

def test_dictionary_array_2():
    dict_arr = pa.DictionaryArray.from_arrays([0, 1, 0], ['a', 'b'])
    dict_arr2 = pa.DictionaryArray.from_arrays([0, 1, 2, 1, 0], dict_arr)
    assert dict_arr2.to_pylist() == ['a', 'b', 'a', 'b', 'a']
{code}
It appears that the problem is caused by the fact that the function box_scalar in scalar.pxi does not handle the case of dictionary array, as we currently have no DictionaryValue type. 

 

DictionaryArray.__getitem__ currently works around the lack of DictionaryValue type by dereferencing the index and constructs a scalar based on the value in the underlying dictionary. In other words, if we have a dictionary with int8 indices and string values, then the result of __getitem__ will be a StringValue (rather than a DictionaryValue). This works in simple cases but not in the more complex scenarios illustrated above.

I have a patch ready, which would add a DictionaryValue type similar to other scalar types, resolving these bugs and removing the need for a special-cased implementation of DictionaryArray.__getitem__. This DictionaryValue would contain a couple accessor properties, "indices_value" and "dictionary_value" to allow access to both the index in the dictionary as well as the looked-up value. Then DictionaryValue.as_py() would simply call .as_py() on the underlying dictionary_value. 


> Errors with DictionaryArray inside of ListArray or other DictionaryArray
> ------------------------------------------------------------------------
>
>                 Key: ARROW-2515
>                 URL: https://issues.apache.org/jira/browse/ARROW-2515
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.9.0
>            Reporter: Brent Kerby
>            Priority: Major
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> An exception ("KeyError: 26") is raised when .as_py() is called on elements of a ListArray over a DictionaryArray, or of a DictionaryArray with values in a DictionaryArray. Here are a couple tests that currently fail:
>  
> {code:java}
> import pyarrow as pa
> def test_dictionary_array_1():
>     dict_arr = pa.DictionaryArray.from_arrays([0, 1, 0], ['a', 'b'])
>     list_arr = pa.ListArray.from_arrays([0, 2, 3], dict_arr)
>     assert list_arr.to_pylist() == [['a', 'b'], ['a']]
> def test_dictionary_array_2():
>     dict_arr = pa.DictionaryArray.from_arrays([0, 1, 0], ['a', 'b'])
>     dict_arr2 = pa.DictionaryArray.from_arrays([0, 1, 2, 1, 0], dict_arr)
>     assert dict_arr2.to_pylist() == ['a', 'b', 'a', 'b', 'a']
> {code}
> It appears that the problem is caused by the fact that the function box_scalar in scalar.pxi does not handle the case of dictionary array, as we currently have no DictionaryValue type. 
>  
> DictionaryArray.__getitem__ currently works around the lack of DictionaryValue type by dereferencing the index and constructing a scalar based on the value in the underlying dictionary. In other words, if we have a dictionary with int8 indices and string values, then the result of __getitem__ will be a StringValue (rather than a DictionaryValue). This works in simple cases but not in the more complex scenarios illustrated above.
> I have a patch ready, which would add a DictionaryValue type similar to other scalar types, resolving these bugs and removing the need for a special-cased implementation of DictionaryArray.__getitem__. This DictionaryValue would contain a couple accessor properties, "indices_value" and "dictionary_value" to allow access to both the index in the dictionary as well as the looked-up value. Then DictionaryValue.as_py() would simply call .as_py() on the underlying dictionary_value. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)