You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/03/31 18:13:07 UTC

[GitHub] [arrow] westonpace commented on issue #34758: [Python] Batch much smaller than `batch_size` parameter

westonpace commented on issue #34758:
URL: https://github.com/apache/arrow/issues/34758#issuecomment-1492402437

   > I think the issue you are seeing is that Dataset.to_batches() doesn't combine row groups, so if your row groups are smaller than a batch size you will get smaller batches. You should think of batch_size here as an upper bound.
   
   That is part of it.  However, the `batch_size` parameter is also pretty broken at the moment.  There is a max batch size of 32Ki that is hard-coded into Acero.  Given that `max(lst)` is 32Ki I suspect this is the cause for at least some of the data.  This hard-coded limit was added for good reason (it is, in theory, cheaper to run low-level compute on smaller batches).  Though I'm not convinced we actually need it.
   
   However, it is also very reasonable to want large batches in python.  We should probably rewire `batch_size` to some kind of sink-level accumulator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org