You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/06/12 22:32:08 UTC

[GitHub] [arrow] westonpace commented on issue #32439: [Python] Converting data frame to Table with large nested column fails `Invalid Struct child array has length smaller than expected`

westonpace commented on issue #32439:
URL: https://github.com/apache/arrow/issues/32439#issuecomment-1588196555

   The issues is going to happen anytime a single string column ends up with more than 2^31 characters.  So, in OPs reproduction the column `square` has 161 characters per string and 800,000 * 24 strings which is `3,091,200,000` characters.  2^31 is `2,147,483,648`.  At this point we have to split the resulting array into chunks (or use the large_string data type but that has issues of its own).
   
   This "breaking unexpectedly large columns into chunks" behavior is rather tricky and it appears we are doing something wrong when working with lists of struct arrays.  Here's a compact reproducer (that only has 3 rows):
   
   ```
   import pyarrow as pa
   import pandas as pd
   
   x = "0" * 1000000000
   df = pd.DataFrame({"strings": [x, x, x]})
   tab = pa.Table.from_pandas(df)
   print(tab.column(0).num_chunks)
   
   struct = {"struct_field": x}
   df = pd.DataFrame({"structs": [struct, struct, struct]})
   tab = pa.Table.from_pandas(df)
   print(tab.column(0).num_chunks)
   
   lists = [x]
   df = pd.DataFrame({"lists": [lists, lists, lists]})
   tab = pa.Table.from_pandas(df)
   print(tab.column(0).num_chunks)
   
   los = [struct]
   df = pd.DataFrame({"los": [los, los, los]})
   tab = pa.Table.from_pandas(df)
   print(tab.column(0).num_chunks)
   ```
   
   It seems the struct array has length 3.  Meanwhile, it's child, the string array, has length 2 (because it had to be broken into 2 chunks.  The first chunk has the first 2 values and the second chunk has the third).
   
   So if someone wanted to investigate this I would recommend starting by looking at the conversion from pandas code and see how the struct array and list arrays are handling the case where their children is converted into multiple chunks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org