You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/30 14:04:59 UTC
[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2237: Separate ArrayReader::next_batch with read_records and consume_batch

tustvold commented on code in PR #2237:
URL: https://github.com/apache/arrow-rs/pull/2237#discussion_r933810032


##########
parquet/src/arrow/array_reader/empty_array.rs:
##########
@@ -54,14 +54,21 @@ impl ArrayReader for EmptyArrayReader {
     }
 
     fn next_batch(&mut self, batch_size: usize) -> Result<ArrayRef> {
+        let size = self.read_records(batch_size)?;
+        self.consume_batch(size)
+    }
+
+    fn read_records(&mut self, batch_size: usize) -> Result<usize> {
         let len = self.remaining_rows.min(batch_size);
         self.remaining_rows -= len;
+        Ok(len)
+    }
 
+    fn consume_batch(&mut self, batch_size: usize) -> Result<ArrayRef> {
         let data = ArrayDataBuilder::new(self.data_type.clone())
-            .len(len)
+            .len(batch_size)

Review Comment:
   I think this needs to keep track of how many values have actually been read in read_records. Otherwise I think this will keep yielding values beyond the actual length of the column :sweat_smile: 



##########
parquet/src/arrow/array_reader/mod.rs:
##########
@@ -115,7 +125,7 @@ impl RowGroupCollection for Arc<dyn FileReader> {
 ///
 /// Returns the number of records read, which can be less than `batch_size` if
 /// pages is exhausted.
-fn read_records<V, CV>(
+fn read_records_inner<V, CV>(

Review Comment:
   Why rename this? I think this is now even less clear, as the concept of `inner` for a free function is a bit... undefined :smile: 



##########
parquet/src/arrow/array_reader/byte_array.rs:
##########
@@ -109,7 +109,23 @@ impl<I: OffsetSizeTrait + ScalarValue> ArrayReader for ByteArrayReader<I> {
     }
 
     fn next_batch(&mut self, batch_size: usize) -> Result<ArrayRef> {
-        read_records(&mut self.record_reader, self.pages.as_mut(), batch_size)?;
+        let size = self.read_records(batch_size)?;
+        self.consume_batch(size)
+    }
+
+    fn read_records(&mut self, batch_size: usize) -> Result<usize> {
+        read_records_inner(&mut self.record_reader, self.pages.as_mut(), batch_size)
+    }
+
+    fn consume_batch(&mut self, batch_size: usize) -> Result<ArrayRef> {

Review Comment:
   I think it would be simpler to remove the `batch_size` parameter and just return what has been buffered



##########
parquet/src/arrow/array_reader/byte_array.rs:
##########
@@ -109,7 +109,23 @@ impl<I: OffsetSizeTrait + ScalarValue> ArrayReader for ByteArrayReader<I> {
     }
 
     fn next_batch(&mut self, batch_size: usize) -> Result<ArrayRef> {
-        read_records(&mut self.record_reader, self.pages.as_mut(), batch_size)?;
+        let size = self.read_records(batch_size)?;

Review Comment:
   Perhaps we could make this the default impl on the trait?



##########
parquet/src/arrow/array_reader/complex_object_array.rs:
##########
@@ -160,6 +196,15 @@ where
             array = arrow::compute::cast(&array, &self.data_type)?;
         }
 
+        self.data_buffer = self.data_buffer.split_off(batch_size);
+        if let Some(buf) = &mut self.def_levels_buffer {

Review Comment:
   I think if we remove the `batch_size` argument to `consume_batch` we won't need to make this change



##########
parquet/src/arrow/array_reader/complex_object_array.rs:
##########
@@ -349,30 +398,32 @@ mod tests {
 
         let mut accu_len: usize = 0;
 
-        let array = array_reader.next_batch(values_per_page / 2).unwrap();
-        assert_eq!(array.len(), values_per_page / 2);
+        let len = array_reader.read_records(values_per_page / 2).unwrap();

Review Comment:
   Again I think removing batch_size from consume_batch allows preserving the existing behaviour



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org