You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/11/02 16:34:00 UTC

[jira] [Commented] (ARROW-10443) Nested dictionaries not able to be converted to table

    [ https://issues.apache.org/jira/browse/ARROW-10443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224790#comment-17224790 ] 

Joris Van den Bossche commented on ARROW-10443:
-----------------------------------------------

(General questions are probably better asked on the user or dev mailing list, see https://arrow.apache.org/community/, or on StackOverflow, as you already did https://stackoverflow.com/questions/64618007/nested-dictionaries-in-pyarrow, to keep JIRA issues for actual bug reports / enhancement requests)

The problem you are running into here is that the {{from_pydict}} method is expecting a dictionary of {column name: array-like}, and here what is inside your dictionary for the "a" key is not an array like, but another dictionary. 
So what would work is this:

{code}
In [24]: a = {'a': [{'b': 1, 'c': 3, 'd': 1}, {'b': 2, 'c': 2, 'd': 2}]}

In [25]: a_pa = pa.Table.from_pydict(a, schema)

In [26]: a_pa
Out[26]: 
pyarrow.Table
a: struct<b: int32, c: int32, d: int32>
  child 0, b: int32
  child 1, c: int32
  child 2, d: int32

In [27]: a_pa.to_pandas()
Out[27]: 
                          a
0  {'b': 1, 'c': 3, 'd': 1}
1  {'b': 2, 'c': 2, 'd': 2}
{code}

Now, the format you have (a full array for each of the struct's childs) is of course also useful to be able to convert. One option that works right now is to first directly create a StructArray, and then only afterwards put this in a table:

{code}
In [42]: a = {'a': {'b': [1, 2, 3], 'c': [3, 2, 1], 'd': [1, 2, 3]}}

In [43]: struct_arr = pa.StructArray.from_arrays(a['a'].values(), a['a'].keys())

In [44]: pa.table({'a': struct_arr})
Out[44]: 
pyarrow.Table
a: struct<b: int64, c: int64, d: int64>
  child 0, b: int64
  child 1, c: int64
  child 2, d: int64

In [45]: pa.table({'a': struct_arr}).to_pandas()
Out[45]: 
                          a
0  {'b': 1, 'c': 3, 'd': 1}
1  {'b': 2, 'c': 2, 'd': 2}
2  {'b': 3, 'c': 1, 'd': 3}
{code}

But note that this only works if your arrays in the nested dictionary all have the same length, which is not the case for your original example.

> Nested dictionaries not able to be converted to table
> -----------------------------------------------------
>
>                 Key: ARROW-10443
>                 URL: https://issues.apache.org/jira/browse/ARROW-10443
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>         Environment: Windows 10, v. 2004. Python 3.8.5
>            Reporter: Xavante Erickson
>            Priority: Trivial
>              Labels: newbie
>
> Hi, it seems as you wanted questions and issues here and not in github.
> I was trying to convert a nested dictionary creating my own schema without success, it seems like it should and I get an unintuitive error message. When I execute the code below I get the error message "pyarrow.lib.ArrowTypeError: Could not convert b with type str: was expecting tuple of (key, value) pair"
> {code:java}
> import pyarrow as pa
> a = {'a': {'b': [1, 2, 3, 4, 5, 6], 'c': [3, 2, 1], 'd': [1, 2]}}
> struct = pa.struct([pa.field('b', pa.int32()), pa.field('c', pa.int32()), pa.field('d', pa.int32())])
> schema = pa.schema([pa.field('a', struct)])
> a_pa = pa.Table.from_pydict(a, schema)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)