You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/04/14 08:14:13 UTC

[GitHub] [arrow] jorisvandenbossche commented on issue #35126: Slow pyarray table slice/take when the table has many chunks

jorisvandenbossche commented on issue #35126:
URL: https://github.com/apache/arrow/issues/35126#issuecomment-1508115139

   > Is this slice slowness expected when a table has many chunks?
   
   I certainly wouldn't expect such a huge slowdown, but given that slicing a table with many chunks has inherently some more overhead (slicing each individual chunk each time), _some_ slowdown can be expected. But we should maybe see if there some overhead that can be reduced (from a quick profile, I don't directly see something obvious though, a large part of the time is spent in the actual `Array::Slice` / `ArrayData::Slice`).
   
   Maybe there could be some optimization when you are taking a slice that covers several chunks entirely, we don't actually call Slice on the chunks that are needed in full for the result.
   
   In general, chunks incur some overhead, and a batch size of 1024 is quite small for pyarrow (pyarrow/Arrow C++ is not optimized to work on such small batch sizes). So probably best to use a larger batch size.
   
   > Is there a way to tell pyarrow.concat_tables to return a table with a single chunk so I can avoid an extra copy by calling combine_chunks()?
   
   That's currently not possible, but note that `concat_tables` does not actually make a copy of the data (the original chunking is preserved), only `combine_chunks` does a copy. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org