You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/14 18:06:50 UTC

[GitHub] [arrow-rs] tustvold commented on a change in pull request #1031: Extract method to drive PageIterator -> RecordReader

tustvold commented on a change in pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031#discussion_r768920834



##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -100,6 +100,36 @@ pub trait ArrayReader {
     fn get_rep_levels(&self) -> Option<&[i16]>;
 }
 
+/// Uses `record_reader` to read up to `batch_size` records from `pages`
+///
+/// Returns the number of records read, which can be less than batch_size if
+/// pages is exhausted.
+fn read_records<T: DataType>(
+    record_reader: &mut RecordReader<T>,
+    pages: &mut dyn PageIterator,
+    batch_size: usize,
+) -> Result<usize> {
+    let mut records_read = 0usize;
+    while records_read < batch_size {
+        let records_to_read = batch_size - records_read;
+
+        let records_read_once = record_reader.read_records(records_to_read)?;
+        records_read += records_read_once;
+
+        // Record reader exhausted
+        if records_read_once < records_to_read {
+            if let Some(page_reader) = pages.next() {
+                // Read from new page reader (i.e. column chunk)
+                record_reader.set_page_reader(page_reader?)?;

Review comment:
       If we just called reset here, we would lose data. But we definitely could delimit record batches, i.e. don't buffer data across column chunk boundaries. This would be breaking change though




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org