You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/27 02:19:58 UTC

[GitHub] [arrow] wjones1 edited a comment on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

wjones1 edited a comment on pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#issuecomment-650474759


   RE: @jorisvandenbossche 
   > Same question as in the other PR: does setting the batch size also influence existing methods like `read` or `read_row_group` ? Should we add that keyword there as well?
   
   My one hesitation to adding them is that it's not clear to me what the effect would be on the execution. For `iter_batches()` the effect of `batch_size` is quite obvious, but I'm not sure about these other methods. 
   
   After quick search of the Apache Arrow docs the only explanation I saw on the batch size parameter was this:
   
   >  [The maximum row count for scanned record batches. If scanned record batches are overflowing memory then this method can be called to reduce their size.](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html?highlight=batch_size)
   
   If I can find a good explanation for it or if you have one, I'd be happy to add the `batch_size` parameter to the `read()` and `read_row_group()` methods and include that explanation in the docstring.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org