You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Will Jones (Jira)" <ji...@apache.org> on 2022/01/04 19:22:00 UTC
[jira] [Created] (ARROW-15247) [Python] Convert array of Pandas dataframe to struct column
Will Jones created ARROW-15247:
----------------------------------
Summary: [Python] Convert array of Pandas dataframe to struct column
Key: ARROW-15247
URL: https://issues.apache.org/jira/browse/ARROW-15247
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Affects Versions: 6.0.1
Reporter: Will Jones
Currently, converting a Pandas dataframe with a column of dataframes to Arrow fails with "Could not convert <data> with type DataFrame: did not recognize Python value type when inferring an Arrow data type". We should be able to convert this to a List<Struct> array, similar to how [the R binding do it|https://arrow.apache.org/docs/r/articles/arrow.html#r-to-arrow]. This could even be bi-directional, where structs could be parsed back into a column of dataframe in {{to_pandas()}}
Here is an example that currently fails:
{code:python}
import pandas as pd
import pyarrow as pa
df1 = pd.DataFrame({
'x': [1, 2, 3],
'y': ['a', 'b', 'c']
})
df = pd.DataFrame({
'df': [df1]*10
})
pa.Table.from_pandas(df)
{code}
Here's what the other directly might look like for the same data:
{code:python}
sub_tab = [{'x': 1, 'y': 'a'},
{'x': 2, 'y': 'b'},
{'x': 3, 'y': 'c'}]
tab = pa.table({
'df': pa.array([sub_tab]*10)
})
print(tab.schema)
# df: list<item: struct<x: int64, y: string>>
# child 0, item: struct<x: int64, y: string>
# child 0, x: int64
# child 1, y: string
tab.to_pandas()
{code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)