You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Will Jones (Jira)" <ji...@apache.org> on 2022/01/04 19:22:00 UTC

[jira] [Created] (ARROW-15247) [Python] Convert array of Pandas dataframe to struct column

Will Jones created ARROW-15247:
----------------------------------

             Summary: [Python] Convert array of Pandas dataframe to struct column
                 Key: ARROW-15247
                 URL: https://issues.apache.org/jira/browse/ARROW-15247
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
    Affects Versions: 6.0.1
            Reporter: Will Jones


Currently, converting a Pandas dataframe with a column of dataframes to Arrow fails with "Could not convert <data> with type DataFrame: did not recognize Python value type when inferring an Arrow data type". We should be able to convert this to a List<Struct> array, similar to how [the R binding do it|https://arrow.apache.org/docs/r/articles/arrow.html#r-to-arrow]. This could even be bi-directional, where structs could be parsed back into a column of dataframe in {{to_pandas()}}

Here is an example that currently fails:

{code:python}
import pandas as pd
import pyarrow as pa

df1 = pd.DataFrame({
    'x': [1, 2, 3],
    'y': ['a', 'b', 'c']
})

df = pd.DataFrame({
    'df': [df1]*10
})

pa.Table.from_pandas(df)
{code}

Here's what the other directly might look like for the same data:

{code:python}
sub_tab = [{'x': 1, 'y': 'a'},
           {'x': 2, 'y': 'b'},
           {'x': 3, 'y': 'c'}]

tab = pa.table({
    'df': pa.array([sub_tab]*10)
})

print(tab.schema)
# df: list<item: struct<x: int64, y: string>>
#    child 0, item: struct<x: int64, y: string>
#       child 0, x: int64
#       child 1, y: string

tab.to_pandas()
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)