You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/16 04:52:50 UTC

[GitHub] [arrow] wjones1 edited a comment on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

wjones1 edited a comment on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-659143356


   So it appears there were changes to the underlying implementation of RecordBatchReader. Prior to these changes, it would yield record batches with the exact batch size (if possible). So for a `batch_size` of 900 on a file written with `chunk_size` of 1,000, it would yield batches of: 900, 900, 900, 900, ... rows. Now it yields slices that are aligned with the rowgroups, so the same parameters would yield batches with rowcounts: 900, 100, 900, 100, and so on.
   
   I'm not 100% sure if we care about the exact number of rows returned, but for now I'm leaning towards yes. Open to feedback on that. ~~In the meantime, I will push changes soon that will stitch together the batches to yield consistent rowcounts.~~
   
   A cool side affect of these changes is is that it gets around the bug I mentioned earlier that would have blocked support for categorical columns in this method.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org