You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/10 14:02:19 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request #1021: Simplify record reader

tustvold opened a new pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021


   # Which issue does this PR close?
   
   Closes #1020.
   
   # Rationale for this change
    
   See ticket
   
   # What changes are included in this PR?
   
   This alters RecordReader to remove some shared mutable state, along with the concept of being in the middle of a record.
   
   # Are there any user-facing changes?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r768130610



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -107,21 +103,25 @@ impl<T: DataType> RecordReader<T> {
         loop {
             // Try to find some records from buffers that has been read into memory
             // but not counted as seen records.
-            records_read += self.split_records(num_records - records_read)?;
-
-            // Since page reader contains complete records, so if we reached end of a
-            // page reader, we should reach the end of a record
-            if end_of_column
-                && self.values_seen >= self.values_written
-                && self.in_middle_of_record
-            {
-                self.num_records += 1;
-                self.num_values = self.values_seen;
-                self.in_middle_of_record = false;
-                records_read += 1;
+            let (record_count, value_count) =
+                self.count_records(num_records - records_read);
+
+            self.num_records += record_count;
+            self.num_values += value_count;
+            records_read += record_count;
+
+            if records_read == num_records {
+                break;
             }
 
-            if (records_read >= num_records) || end_of_column {
+            if end_of_column {

Review comment:
       Ehehe, `PageReader` is actually a column chunk... So the end of a `PageReader` is the end of a row group, not the end of a page. Confusingly `PageIterator` is an iterator of `PageReader` which are themselves iterators of `Page` :laughing: 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sunchao commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r768128041



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -107,21 +103,25 @@ impl<T: DataType> RecordReader<T> {
         loop {
             // Try to find some records from buffers that has been read into memory
             // but not counted as seen records.
-            records_read += self.split_records(num_records - records_read)?;
-
-            // Since page reader contains complete records, so if we reached end of a
-            // page reader, we should reach the end of a record
-            if end_of_column
-                && self.values_seen >= self.values_written
-                && self.in_middle_of_record
-            {
-                self.num_records += 1;
-                self.num_values = self.values_seen;
-                self.in_middle_of_record = false;
-                records_read += 1;
+            let (record_count, value_count) =
+                self.count_records(num_records - records_read);
+
+            self.num_records += record_count;
+            self.num_values += value_count;
+            records_read += record_count;
+
+            if records_read == num_records {
+                break;
             }
 
-            if (records_read >= num_records) || end_of_column {
+            if end_of_column {
+                // Since page reader contains complete records, if we reached end of a

Review comment:
       I'm not sure if this is true though. Take `parquet-mr` as example, this is true for the latest version but in versions before 1.11.0, it seems there is no such guarantee: https://github.com/apache/parquet-mr/blob/apache-parquet-1.10.1/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java#L106, and a repeated list could span multiple pages.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #1021: Simplify record reader

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#issuecomment-991010923


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1021](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9930e50) into [master](https://codecov.io/gh/apache/arrow-rs/commit/239cba141cb27519b7c32d58a3ea6447fda31d11?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (239cba1) will **decrease** coverage by `0.00%`.
   > The diff coverage is `82.60%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/1021/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1021      +/-   ##
   ==========================================
   - Coverage   82.31%   82.30%   -0.01%     
   ==========================================
     Files         168      168              
     Lines       49031    49027       -4     
   ==========================================
   - Hits        40360    40353       -7     
   - Misses       8671     8674       +3     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/arrow/record\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1021/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvcmVjb3JkX3JlYWRlci5ycw==) | `92.79% <82.60%> (-0.94%)` | :arrow_down: |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1021/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `65.95% <0.00%> (-0.43%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1021/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `85.24% <0.00%> (+0.13%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [239cba1...9930e50](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r768096700



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -381,32 +380,26 @@ impl<T: DataType> RecordReader<T> {
         match rep_levels {
             Some(buf) => {
                 let mut records_read = 0;
+                let mut end_of_last_record = self.num_values;
+
+                for current in self.num_values..self.values_written {
+                    if buf[current] == 0 && current != end_of_last_record {

Review comment:
       > what if you haven't finished the current repeated list
   
   I'm not sure I follow, `buf[current] == 0` implies we've reached the end of the list. Perhaps it would be more clear if the second condition were `current != self.num_values`? :thinking: 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r768116795



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -381,32 +380,26 @@ impl<T: DataType> RecordReader<T> {
         match rep_levels {
             Some(buf) => {
                 let mut records_read = 0;
+                let mut end_of_last_record = self.num_values;
+
+                for current in self.num_values..self.values_written {
+                    if buf[current] == 0 && current != end_of_last_record {

Review comment:
       Users of `RecordBatch` call `read_records` and then call `consume_rep_levels` and friends to split data out. The result being it should only buffer a little bit more than the batch_size passed to `read_records`. 
   
   I agree this API is not particularly intuitive, I created #1032 in part because I felt these APIs were clearly not designed for external consumption. I believe the funky arises because `ArrayReader` wants to be able to stitch together multiple column chunks into the same RecordBatch.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sunchao commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r768110745



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -381,32 +380,26 @@ impl<T: DataType> RecordReader<T> {
         match rep_levels {
             Some(buf) => {
                 let mut records_read = 0;
+                let mut end_of_last_record = self.num_values;
+
+                for current in self.num_values..self.values_written {
+                    if buf[current] == 0 && current != end_of_last_record {

Review comment:
       It's also strange that we initialize the `repLevels` to be the min batch size but keep growing it as we read more batches, until it hit the total number of levels for the entire column chunk.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #1021: Simplify record reader

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r766786054



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -75,9 +73,7 @@ impl<T: DataType> RecordReader<T> {
             column_desc: column_schema,
             num_records: 0,
             num_values: 0,
-            values_seen: 0,

Review comment:
       These fields look like they have been here since the initial implementation by @liurenjie1024 in https://github.com/apache/arrow/pull/4292 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r768098518



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -381,32 +380,26 @@ impl<T: DataType> RecordReader<T> {
         match rep_levels {
             Some(buf) => {
                 let mut records_read = 0;
+                let mut end_of_last_record = self.num_values;
+
+                for current in self.num_values..self.values_written {
+                    if buf[current] == 0 && current != end_of_last_record {

Review comment:
       Updated




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
tustvold commented on pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#issuecomment-992899061


   Further context for this PR can be found in #1041 as it was what motivated me to juggle the logic a bit, so that I could traitify it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r768131558



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -107,21 +103,25 @@ impl<T: DataType> RecordReader<T> {
         loop {
             // Try to find some records from buffers that has been read into memory
             // but not counted as seen records.
-            records_read += self.split_records(num_records - records_read)?;
-
-            // Since page reader contains complete records, so if we reached end of a
-            // page reader, we should reach the end of a record
-            if end_of_column
-                && self.values_seen >= self.values_written
-                && self.in_middle_of_record
-            {
-                self.num_records += 1;
-                self.num_values = self.values_seen;
-                self.in_middle_of_record = false;
-                records_read += 1;
+            let (record_count, value_count) =
+                self.count_records(num_records - records_read);
+
+            self.num_records += record_count;
+            self.num_values += value_count;
+            records_read += record_count;
+
+            if records_read == num_records {
+                break;
             }
 
-            if (records_read >= num_records) || end_of_column {
+            if end_of_column {
+                // Since page reader contains complete records, if we reached end of a

Review comment:
       See comment below, page reader is a column chunk. So this is effectively saying that a record can't be split across row groups, which I think is guaranteed?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#issuecomment-991010923


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1021](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (cd0f759) into [master](https://codecov.io/gh/apache/arrow-rs/commit/e0abda2c178be0c38d4257d22de2e4a3bfafde82?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (e0abda2) will **decrease** coverage by `0.00%`.
   > The diff coverage is `82.60%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/1021/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1021      +/-   ##
   ==========================================
   - Coverage   82.31%   82.30%   -0.01%     
   ==========================================
     Files         168      168              
     Lines       49031    49026       -5     
   ==========================================
   - Hits        40359    40350       -9     
   - Misses       8672     8676       +4     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/arrow/record\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1021/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvcmVjb3JkX3JlYWRlci5ycw==) | `92.77% <82.60%> (-0.96%)` | :arrow_down: |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/1021/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `93.52% <0.00%> (-0.20%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1021/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `85.10% <0.00%> (-0.14%)` | :arrow_down: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1021/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `66.21% <0.00%> (ø)` | |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1021/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `66.38% <0.00%> (+0.42%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [e0abda2...cd0f759](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #1021: Simplify record reader

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#issuecomment-991015630


   Filed https://github.com/apache/arrow-rs/issues/1022 to track CI failure in "nightly" builds


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sunchao commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r768041180



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -107,21 +103,25 @@ impl<T: DataType> RecordReader<T> {
         loop {
             // Try to find some records from buffers that has been read into memory
             // but not counted as seen records.
-            records_read += self.split_records(num_records - records_read)?;
-
-            // Since page reader contains complete records, so if we reached end of a
-            // page reader, we should reach the end of a record
-            if end_of_column
-                && self.values_seen >= self.values_written
-                && self.in_middle_of_record
-            {
-                self.num_records += 1;
-                self.num_values = self.values_seen;
-                self.in_middle_of_record = false;
-                records_read += 1;
+            let (record_count, value_count) =
+                self.count_records(num_records - records_read);
+
+            self.num_records += record_count;

Review comment:
       nit: maybe we can update this only once before returning from the method?

##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -381,32 +380,26 @@ impl<T: DataType> RecordReader<T> {
         match rep_levels {
             Some(buf) => {
                 let mut records_read = 0;
+                let mut end_of_last_record = self.num_values;
+
+                for current in self.num_values..self.values_written {
+                    if buf[current] == 0 && current != end_of_last_record {

Review comment:
       Hmm, what if you haven't finished the current repeated list, and it continues to the next batch? seems we'll return here and count as if the repeated list has been read completely (since we'll increment the `records_read` here)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sunchao merged pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
sunchao merged pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #1021: Simplify record reader

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#issuecomment-991053958


   I fixed the nightly failures in https://github.com/apache/arrow-rs/pull/1023 -- will merge to this PR to get that to pass too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r768116795



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -381,32 +380,26 @@ impl<T: DataType> RecordReader<T> {
         match rep_levels {
             Some(buf) => {
                 let mut records_read = 0;
+                let mut end_of_last_record = self.num_values;
+
+                for current in self.num_values..self.values_written {
+                    if buf[current] == 0 && current != end_of_last_record {

Review comment:
       Users of `RecordReader` call `read_records` and then call `consume_rep_levels` and friends to split data out. The result being it should only buffer a little bit more than the batch_size passed to `read_records`. 
   
   I agree this API is not particularly intuitive, I created #1032 in part because I felt these APIs were clearly not designed for external consumption. I believe the funky arises because `ArrayReader` wants to be able to stitch together multiple column chunks from different row groups (i.e. `PageReader`) into the same RecordBatch.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sunchao commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r768126971



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -381,32 +380,26 @@ impl<T: DataType> RecordReader<T> {
         match rep_levels {
             Some(buf) => {
                 let mut records_read = 0;
+                let mut end_of_last_record = self.num_values;
+
+                for current in self.num_values..self.values_written {
+                    if buf[current] == 0 && current != end_of_last_record {

Review comment:
       Thanks for the context. Yea I think `consume_rep_levels` and the friends are for assembling complex records like array, list and map. It'd be nice if we can simplify the APIs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sunchao commented on pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#issuecomment-992938255


   Merged, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r768096700



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -381,32 +380,26 @@ impl<T: DataType> RecordReader<T> {
         match rep_levels {
             Some(buf) => {
                 let mut records_read = 0;
+                let mut end_of_last_record = self.num_values;
+
+                for current in self.num_values..self.values_written {
+                    if buf[current] == 0 && current != end_of_last_record {

Review comment:
       > what if you haven't finished the current repeated list
   
   I'm not sure I follow, `buf[current] == 0` implies we've reached the end of the list. Perhaps it would be more clear if the second condition were `current != self.num_values` it's only false on the first iteration? :thinking: 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #1021: Simplify record reader

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#issuecomment-991075986


   I think we should run the parquet performance benchmark for this change -- I will do so


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #1021: Simplify record reader

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#issuecomment-991226859


   
   
   My performance tests showed no significant performance difference
   
   1. `tustvold/simplify-record-reader` @ 290b24f7d1845b4cfe58eefeb6562e4ec2d54829
   2. `apache/master` @ e0abda2c178be0c38d4257d22de2e4a3bfafde82
   
   Test command
   
   ```shell
   cargo bench -p parquet --bench arrow_array_reader --features=test_common -- --save-baseline <name>
   ```
   
   
   Result:
   ```
   alamb@instance-1:/data/arrow-rs$ critcmp master1 simplify-record-reader1
   group                                                                                  master1                                simplify-record-reader1
   -----                                                                                  -------                                -----------------------
   arrow_array_reader/read Int32Array, dictionary encoded, mandatory, no NULLs - new      1.00    109.2±0.33µs        ? ?/sec    1.00    109.0±0.31µs        ? ?/sec
   arrow_array_reader/read Int32Array, dictionary encoded, mandatory, no NULLs - old      1.00     37.5±0.15µs        ? ?/sec    1.00     37.6±0.19µs        ? ?/sec
   arrow_array_reader/read Int32Array, dictionary encoded, optional, half NULLs - new     1.02    279.6±0.59µs        ? ?/sec    1.00    275.0±1.73µs        ? ?/sec
   arrow_array_reader/read Int32Array, dictionary encoded, optional, half NULLs - old     1.00    258.4±0.74µs        ? ?/sec    1.12    290.2±1.68µs        ? ?/sec
   arrow_array_reader/read Int32Array, dictionary encoded, optional, no NULLs - new       1.01    132.6±0.39µs        ? ?/sec    1.00    130.9±0.66µs        ? ?/sec
   arrow_array_reader/read Int32Array, dictionary encoded, optional, no NULLs - old       1.00    126.4±0.74µs        ? ?/sec    1.02    128.9±0.57µs        ? ?/sec
   arrow_array_reader/read Int32Array, plain encoded, mandatory, no NULLs - new           1.00      3.6±0.18µs        ? ?/sec    1.03      3.7±0.23µs        ? ?/sec
   arrow_array_reader/read Int32Array, plain encoded, mandatory, no NULLs - old           1.00      5.8±0.40µs        ? ?/sec    1.01      5.8±0.41µs        ? ?/sec
   arrow_array_reader/read Int32Array, plain encoded, optional, half NULLs - new          1.03    225.6±0.93µs        ? ?/sec    1.00    219.8±1.14µs        ? ?/sec
   arrow_array_reader/read Int32Array, plain encoded, optional, half NULLs - old          1.00    242.1±0.84µs        ? ?/sec    1.13    272.6±1.12µs        ? ?/sec
   arrow_array_reader/read Int32Array, plain encoded, optional, no NULLs - new            1.05     26.8±0.32µs        ? ?/sec    1.00     25.4±0.32µs        ? ?/sec
   arrow_array_reader/read Int32Array, plain encoded, optional, no NULLs - old            1.00     96.0±0.86µs        ? ?/sec    1.02     98.2±1.41µs        ? ?/sec
   arrow_array_reader/read StringArray, dictionary encoded, mandatory, no NULLs - new     1.00    155.6±1.06µs        ? ?/sec    1.01    157.1±1.18µs        ? ?/sec
   arrow_array_reader/read StringArray, dictionary encoded, mandatory, no NULLs - old     1.00   1201.5±3.71µs        ? ?/sec    1.00   1197.0±4.59µs        ? ?/sec
   arrow_array_reader/read StringArray, dictionary encoded, optional, half NULLs - new    1.00    358.7±1.41µs        ? ?/sec    1.00    358.4±2.82µs        ? ?/sec
   arrow_array_reader/read StringArray, dictionary encoded, optional, half NULLs - old    1.01   1086.9±3.57µs        ? ?/sec    1.00   1080.2±3.87µs        ? ?/sec
   arrow_array_reader/read StringArray, dictionary encoded, optional, no NULLs - new      1.01    181.7±1.17µs        ? ?/sec    1.00    179.4±1.03µs        ? ?/sec
   arrow_array_reader/read StringArray, dictionary encoded, optional, no NULLs - old      1.01   1273.7±7.95µs        ? ?/sec    1.00   1265.2±8.85µs        ? ?/sec
   arrow_array_reader/read StringArray, plain encoded, mandatory, no NULLs - new          1.00    176.6±0.96µs        ? ?/sec    1.01    177.6±1.47µs        ? ?/sec
   arrow_array_reader/read StringArray, plain encoded, mandatory, no NULLs - old          1.00   1377.9±7.25µs        ? ?/sec    1.02   1399.2±5.47µs        ? ?/sec
   arrow_array_reader/read StringArray, plain encoded, optional, half NULLs - new         1.00    380.6±1.63µs        ? ?/sec    1.00    380.0±2.58µs        ? ?/sec
   arrow_array_reader/read StringArray, plain encoded, optional, half NULLs - old         1.00   1179.4±4.56µs        ? ?/sec    1.00   1180.2±4.77µs        ? ?/sec
   arrow_array_reader/read StringArray, plain encoded, optional, no NULLs - new           1.00    206.2±1.45µs        ? ?/sec    1.00    205.2±1.94µs        ? ?/sec
   arrow_array_reader/read StringArray, plain encoded, optional, no NULLs - old           1.00  1452.5±17.75µs        ? ?/sec    1.00   1445.8±5.62µs        ? ?/sec
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r768112403



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -107,21 +103,25 @@ impl<T: DataType> RecordReader<T> {
         loop {
             // Try to find some records from buffers that has been read into memory
             // but not counted as seen records.
-            records_read += self.split_records(num_records - records_read)?;
-
-            // Since page reader contains complete records, so if we reached end of a
-            // page reader, we should reach the end of a record
-            if end_of_column
-                && self.values_seen >= self.values_written
-                && self.in_middle_of_record
-            {
-                self.num_records += 1;
-                self.num_values = self.values_seen;
-                self.in_middle_of_record = false;
-                records_read += 1;
+            let (record_count, value_count) =
+                self.count_records(num_records - records_read);
+
+            self.num_records += record_count;

Review comment:
       I think this would leave RecordReader in a strange state if read_one_batch returned an error, as `self.num_values` would have been updated and not `self.num`? I can't pull `self.num_values` out to match as it is used by `count_records`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sunchao commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r768109989



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -381,32 +380,26 @@ impl<T: DataType> RecordReader<T> {
         match rep_levels {
             Some(buf) => {
                 let mut records_read = 0;
+                let mut end_of_last_record = self.num_values;
+
+                for current in self.num_values..self.values_written {
+                    if buf[current] == 0 && current != end_of_last_record {

Review comment:
       Ah sorry, my bad. yea this looks OK. I think the downside is we could potentially read a batch of `repLevels` multiple times if, say, the `repLevels` are all non-zero values.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sunchao commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r768128892



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -107,21 +103,25 @@ impl<T: DataType> RecordReader<T> {
         loop {
             // Try to find some records from buffers that has been read into memory
             // but not counted as seen records.
-            records_read += self.split_records(num_records - records_read)?;
-
-            // Since page reader contains complete records, so if we reached end of a
-            // page reader, we should reach the end of a record
-            if end_of_column
-                && self.values_seen >= self.values_written
-                && self.in_middle_of_record
-            {
-                self.num_records += 1;
-                self.num_values = self.values_seen;
-                self.in_middle_of_record = false;
-                records_read += 1;
+            let (record_count, value_count) =
+                self.count_records(num_records - records_read);
+
+            self.num_records += record_count;
+            self.num_values += value_count;
+            records_read += record_count;
+
+            if records_read == num_records {
+                break;
             }
 
-            if (records_read >= num_records) || end_of_column {
+            if end_of_column {

Review comment:
       I'm wondering if this should be called `end_of_page` since `read_records` consumes at most a page? a new page is set in `ArrayReader.next_batch`.

##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -107,21 +103,25 @@ impl<T: DataType> RecordReader<T> {
         loop {
             // Try to find some records from buffers that has been read into memory
             // but not counted as seen records.
-            records_read += self.split_records(num_records - records_read)?;
-
-            // Since page reader contains complete records, so if we reached end of a
-            // page reader, we should reach the end of a record
-            if end_of_column
-                && self.values_seen >= self.values_written
-                && self.in_middle_of_record
-            {
-                self.num_records += 1;
-                self.num_values = self.values_seen;
-                self.in_middle_of_record = false;
-                records_read += 1;
+            let (record_count, value_count) =
+                self.count_records(num_records - records_read);
+
+            self.num_records += record_count;
+            self.num_values += value_count;
+            records_read += record_count;
+
+            if records_read == num_records {
+                break;
             }
 
-            if (records_read >= num_records) || end_of_column {
+            if end_of_column {
+                // Since page reader contains complete records, if we reached end of a

Review comment:
       I'm not sure if this is true though. Take `parquet-mr` as example, this is true for the latest version but in versions before 1.11.0, it seems there is no such guarantee: https://github.com/apache/parquet-mr/blob/apache-parquet-1.10.1/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterV1.java#L106.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#issuecomment-991010923


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1021](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (cd0f759) into [master](https://codecov.io/gh/apache/arrow-rs/commit/e0abda2c178be0c38d4257d22de2e4a3bfafde82?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (e0abda2) will **decrease** coverage by `0.00%`.
   > The diff coverage is `82.60%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/1021/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1021      +/-   ##
   ==========================================
   - Coverage   82.31%   82.30%   -0.01%     
   ==========================================
     Files         168      168              
     Lines       49031    49026       -5     
   ==========================================
   - Hits        40359    40350       -9     
   - Misses       8672     8676       +4     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/arrow/record\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1021/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvcmVjb3JkX3JlYWRlci5ycw==) | `92.77% <82.60%> (-0.96%)` | :arrow_down: |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/1021/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `93.52% <0.00%> (-0.20%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1021/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `85.10% <0.00%> (-0.14%)` | :arrow_down: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1021/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `66.21% <0.00%> (ø)` | |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1021/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `66.38% <0.00%> (+0.42%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [e0abda2...cd0f759](https://codecov.io/gh/apache/arrow-rs/pull/1021?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #1021: Simplify record reader

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#issuecomment-991226859






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #1021: Simplify record reader

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#issuecomment-991227188


   It looks like this PR needs some clippy appeasement: https://github.com/apache/arrow-rs/runs/4485244206?check_suite_focus=true
   
   But otherwise looks good from my perspective


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r768116795



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -381,32 +380,26 @@ impl<T: DataType> RecordReader<T> {
         match rep_levels {
             Some(buf) => {
                 let mut records_read = 0;
+                let mut end_of_last_record = self.num_values;
+
+                for current in self.num_values..self.values_written {
+                    if buf[current] == 0 && current != end_of_last_record {

Review comment:
       Users of `RecordBatch` call `read_records` and then call `consume_rep_levels` and friends to split data out. The result being it should only buffer a little bit more than the batch_size passed to `read_records`. 
   
   I agree this API is not particularly intuitive, I created #1032 in part because I felt these APIs were clearly not designed for external consumption. I believe the funky arises because `ArrayReader` wants to be able to stitch together multiple column chunks from different row groups (i.e. `PageReader`) into the same RecordBatch.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sunchao commented on a change in pull request #1021: Simplify parquet arror `RecordReader`

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #1021:
URL: https://github.com/apache/arrow-rs/pull/1021#discussion_r768133537



##########
File path: parquet/src/arrow/record_reader.rs
##########
@@ -107,21 +103,25 @@ impl<T: DataType> RecordReader<T> {
         loop {
             // Try to find some records from buffers that has been read into memory
             // but not counted as seen records.
-            records_read += self.split_records(num_records - records_read)?;
-
-            // Since page reader contains complete records, so if we reached end of a
-            // page reader, we should reach the end of a record
-            if end_of_column
-                && self.values_seen >= self.values_written
-                && self.in_middle_of_record
-            {
-                self.num_records += 1;
-                self.num_values = self.values_seen;
-                self.in_middle_of_record = false;
-                records_read += 1;
+            let (record_count, value_count) =
+                self.count_records(num_records - records_read);
+
+            self.num_records += record_count;
+            self.num_values += value_count;
+            records_read += record_count;
+
+            if records_read == num_records {
+                break;
             }
 
-            if (records_read >= num_records) || end_of_column {
+            if end_of_column {

Review comment:
       Ah got it, thanks 🤦 . It all makes sense now!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org