You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "leprechaunt33 (via GitHub)" <gi...@apache.org> on 2023/02/24 13:46:32 UTC

[GitHub] [arrow] leprechaunt33 commented on issue #33049: [C++][Python] Large strings cause ArrowInvalid: offset overflow while concatenating arrays

leprechaunt33 commented on issue #33049:
URL: https://github.com/apache/arrow/issues/33049#issuecomment-1443703100

   Current work around I've developed for vaex in general with this pyarrow related error on dataframes for which the technique mentioned above does not work (for materialisation of pandas array from a joined multi file data frame where I was unable to set the arrow data type on the column):
   - catch the ArrowInvalid exception, create blank pandas data frame with required columns and iterate the columns in the vaex data frame to materialize them one by one within the pandas df.  
   - If ArrowInvalid is caught again, evaluate the rogue column with an evaluate_iterator() with prefetch and suitable chunk_size that will not exceed the bounds of the string, working off maximum expected record size, and collate the pyarrow StringArray/ChunkedArray data 
   - Continue iterating columns, typically only one or two columns will need the chunked treatment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org