You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/13 19:30:22 UTC

[GitHub] [arrow-rs] sunchao commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

sunchao commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r768041180



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -107,21 +103,25 @@ impl<T: DataType> RecordReader<T> {
         loop {
             // Try to find some records from buffers that has been read into memory
             // but not counted as seen records.
-            records_read += self.split_records(num_records - records_read)?;
-
-            // Since page reader contains complete records, so if we reached end of a
-            // page reader, we should reach the end of a record
-            if end_of_column
-                && self.values_seen >= self.values_written
-                && self.in_middle_of_record
-            {
-                self.num_records += 1;
-                self.num_values = self.values_seen;
-                self.in_middle_of_record = false;
-                records_read += 1;
+            let (record_count, value_count) =
+                self.count_records(num_records - records_read);
+
+            self.num_records += record_count;

Review comment:
       nit: maybe we can update this only once before returning from the method?

##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -381,32 +380,26 @@ impl<T: DataType> RecordReader<T> {
         match rep_levels {
             Some(buf) => {
                 let mut records_read = 0;
+                let mut end_of_last_record = self.num_values;
+
+                for current in self.num_values..self.values_written {
+                    if buf[current] == 0 && current != end_of_last_record {

Review comment:
       Hmm, what if you haven't finished the current repeated list, and it continues to the next batch? seems we'll return here and count as if the repeated list has been read completely (since we'll increment the `records_read` here)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org