You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "sergun (via GitHub)" <gi...@apache.org> on 2023/11/09 08:47:44 UTC

[I] Flatten column of list of struct type and convert to pandas [arrow]

sergun opened a new issue, #38643:
URL: https://github.com/apache/arrow/issues/38643

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   I have pa.Table with neseted column events:
   ```
   id           int64
   events       list<item: struct<tm: timestamp[s], sum: int64>>
   ```
   It is easy to convert it to pandas with pa.Table.to_pandas() method but it creates pd.DataFrame with column events of object type:
   ```
   id           int64
   events       object
   ```
   And further flattening of the data in pandas is inefficient. 
   
   How can I efficiently convert the table in PyArrow to flattened pd.DataFrame with columns id, tm, sum?
   
   It is possible e.g. in Spark powered by Arrow:
   ```
   df.select("id", explode("events")).select("id", "col.*")
   ```
   And I hope it should be also possible in PyArrow only.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Python] Flatten column of list of struct type and convert to pandas [arrow]

Posted by "sergun (via GitHub)" <gi...@apache.org>.

sergun commented on issue #38643:
URL: https://github.com/apache/arrow/issues/38643#issuecomment-1808896791

   > Exploding is currently something that isn't provided out of the box, see #27923 for an issue on this topic and some example workarounds (using existing pyarrow compute functions to achieve the same effect).
   > 
   > Once you exploded the list over multiple rows, you can flatten the table with the struct type into a table with a top-level column for each struct field with the `flatten()` method:
   > 
   > ```
   > >>> table = pa.table({"id": [1, 1, 2], "events": [{"tm": pd.Timestamp("2012-01-01"), "sum": 10}] * 3})
   > >>> table.to_pandas()
   >    id                                  events
   > 0   1  {'sum': 10, 'tm': 2012-01-01 00:00:00}
   > 1   1  {'sum': 10, 'tm': 2012-01-01 00:00:00}
   > 2   2  {'sum': 10, 'tm': 2012-01-01 00:00:00}
   > 
   > >>> table.flatten().to_pandas() 
   >    id  events.sum  events.tm
   > 0   1          10 2012-01-01
   > 1   1          10 2012-01-01
   > 2   2          10 2012-01-01
   > ```
   
   Thx a lot!
   #27923 + pa.table.flatten() solves the issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Python] Flatten column of list of struct type and convert to pandas [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #38643:
URL: https://github.com/apache/arrow/issues/38643#issuecomment-1810018002

   OK, good to hear! Will close the issue then, given we already have https://github.com/apache/arrow/issues/27923 covering the explode feature request.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Python] Flatten column of list of struct type and convert to pandas [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on issue #38643:
URL: https://github.com/apache/arrow/issues/38643#issuecomment-1803767114

   Exploding is currently something that isn't provided out of the box, see https://github.com/apache/arrow/issues/27923 for an issue on this topic and some example workarounds (using existing pyarrow compute functions to achieve the same effect). 
   
   Once you exploded the list over multiple rows, you can flatten the table with the struct type into a table with a top-level column for each struct field with the `flatten()` method:
   
   ```
   >>> table = pa.table({"id": [1, 1, 2], "events": [{"tm": pd.Timestamp("2012-01-01"), "sum": 10}] * 3})
   >>> table.to_pandas()
      id                                  events
   0   1  {'sum': 10, 'tm': 2012-01-01 00:00:00}
   1   1  {'sum': 10, 'tm': 2012-01-01 00:00:00}
   2   2  {'sum': 10, 'tm': 2012-01-01 00:00:00}
   
   >>> table.flatten().to_pandas() 
      id  events.sum  events.tm
   0   1          10 2012-01-01
   1   1          10 2012-01-01
   2   2          10 2012-01-01
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Python] Flatten column of list of struct type and convert to pandas [arrow]

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche closed issue #38643: [Python] Flatten column of list of struct type and convert to pandas
URL: https://github.com/apache/arrow/issues/38643


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org