You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "sergun (via GitHub)" <gi...@apache.org> on 2023/11/09 08:47:44 UTC
[I] Flatten column of list of struct type and convert to pandas [arrow]
sergun opened a new issue, #38643:
URL: https://github.com/apache/arrow/issues/38643
### Describe the usage question you have. Please include as many useful details as possible.
I have pa.Table with neseted column events:
```
id int64
events list<item: struct<tm: timestamp[s], sum: int64>>
```
It is easy to convert it to pandas with pa.Table.to_pandas() method but it creates pd.DataFrame with column events of object type:
```
id int64
events object
```
And further flattening of the data in pandas is inefficient.
How can I efficiently convert the table in PyArrow to flattened pd.DataFrame with columns id, tm, sum?
It is possible e.g. in Spark powered by Arrow:
```
df.select("id", explode("events")).select("id", "col.*")
```
And I hope it should be also possible in PyArrow only.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [Python] Flatten column of list of struct type and convert to pandas [arrow]
Posted by "sergun (via GitHub)" <gi...@apache.org>.
sergun commented on issue #38643:
URL: https://github.com/apache/arrow/issues/38643#issuecomment-1808896791
> Exploding is currently something that isn't provided out of the box, see #27923 for an issue on this topic and some example workarounds (using existing pyarrow compute functions to achieve the same effect).
>
> Once you exploded the list over multiple rows, you can flatten the table with the struct type into a table with a top-level column for each struct field with the `flatten()` method:
>
> ```
> >>> table = pa.table({"id": [1, 1, 2], "events": [{"tm": pd.Timestamp("2012-01-01"), "sum": 10}] * 3})
> >>> table.to_pandas()
> id events
> 0 1 {'sum': 10, 'tm': 2012-01-01 00:00:00}
> 1 1 {'sum': 10, 'tm': 2012-01-01 00:00:00}
> 2 2 {'sum': 10, 'tm': 2012-01-01 00:00:00}
>
> >>> table.flatten().to_pandas()
> id events.sum events.tm
> 0 1 10 2012-01-01
> 1 1 10 2012-01-01
> 2 2 10 2012-01-01
> ```
Thx a lot!
#27923 + pa.table.flatten() solves the issue
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [Python] Flatten column of list of struct type and convert to pandas [arrow]
Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #38643:
URL: https://github.com/apache/arrow/issues/38643#issuecomment-1810018002
OK, good to hear! Will close the issue then, given we already have https://github.com/apache/arrow/issues/27923 covering the explode feature request.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [Python] Flatten column of list of struct type and convert to pandas [arrow]
Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #38643:
URL: https://github.com/apache/arrow/issues/38643#issuecomment-1803767114
Exploding is currently something that isn't provided out of the box, see https://github.com/apache/arrow/issues/27923 for an issue on this topic and some example workarounds (using existing pyarrow compute functions to achieve the same effect).
Once you exploded the list over multiple rows, you can flatten the table with the struct type into a table with a top-level column for each struct field with the `flatten()` method:
```
>>> table = pa.table({"id": [1, 1, 2], "events": [{"tm": pd.Timestamp("2012-01-01"), "sum": 10}] * 3})
>>> table.to_pandas()
id events
0 1 {'sum': 10, 'tm': 2012-01-01 00:00:00}
1 1 {'sum': 10, 'tm': 2012-01-01 00:00:00}
2 2 {'sum': 10, 'tm': 2012-01-01 00:00:00}
>>> table.flatten().to_pandas()
id events.sum events.tm
0 1 10 2012-01-01
1 1 10 2012-01-01
2 2 10 2012-01-01
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [I] [Python] Flatten column of list of struct type and convert to pandas [arrow]
Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche closed issue #38643: [Python] Flatten column of list of struct type and convert to pandas
URL: https://github.com/apache/arrow/issues/38643
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org