You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/11 19:49:02 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request #1031: Extract method to drive PageIterator -> RecordReader

tustvold opened a new pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031


   # Which issue does this PR close?
   
   Relates to #171
   
   # Rationale for this change
    
   The logic for driving a `RecordReader` from a `PageIterator` is currently duplicated in `PrimitiveArrayReader` and `NullArrayReader`. This duplication will only increase with new `ArrayReader` implementations added as part of #171
   
   # What changes are included in this PR?
   
   Extracts a read_records function to handle this logic 
   
   # Are there any user-facing changes?
   
   This no longer eagerly populates the `RecordReader` with the first `PageReader` from the `PageIterator`. If for example the first page of the first row group contained a compression codec that was not supported, this would currently error in the constructor, whereas with this change it will error in `ArrayReader::next_batch`. I personally think this makes more sense. FWIW I think this is the only possible error case and so it is unlikely user-code would be impacted...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #1031: Extract method to drive PageIterator -> RecordReader

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031#issuecomment-991767524


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1031](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (75920d0) into [master](https://codecov.io/gh/apache/arrow-rs/commit/239cba141cb27519b7c32d58a3ea6447fda31d11?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (239cba1) will **decrease** coverage by `0.00%`.
   > The diff coverage is `93.75%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/1031/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1031      +/-   ##
   ==========================================
   - Coverage   82.31%   82.30%   -0.01%     
   ==========================================
     Files         168      168              
     Lines       49031    49022       -9     
   ==========================================
   - Hits        40360    40350      -10     
   - Misses       8671     8672       +1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz) | `76.66% <93.75%> (-0.10%)` | :arrow_down: |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `65.95% <0.00%> (-0.43%)` | :arrow_down: |
   | [arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==) | `53.37% <0.00%> (-0.31%)` | :arrow_down: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.98% <0.00%> (-0.23%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `85.24% <0.00%> (+0.13%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [239cba1...75920d0](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] tustvold commented on a change in pull request #1031: Extract method to drive PageIterator -> RecordReader

Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031#discussion_r768920834



##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -100,6 +100,36 @@ pub trait ArrayReader {
     fn get_rep_levels(&self) -> Option<&[i16]>;
 }
 
+/// Uses `record_reader` to read up to `batch_size` records from `pages`
+///
+/// Returns the number of records read, which can be less than batch_size if
+/// pages is exhausted.
+fn read_records<T: DataType>(
+    record_reader: &mut RecordReader<T>,
+    pages: &mut dyn PageIterator,
+    batch_size: usize,
+) -> Result<usize> {
+    let mut records_read = 0usize;
+    while records_read < batch_size {
+        let records_to_read = batch_size - records_read;
+
+        let records_read_once = record_reader.read_records(records_to_read)?;
+        records_read += records_read_once;
+
+        // Record reader exhausted
+        if records_read_once < records_to_read {
+            if let Some(page_reader) = pages.next() {
+                // Read from new page reader (i.e. column chunk)
+                record_reader.set_page_reader(page_reader?)?;

Review comment:
       If we just called reset here, we would lose data. But we definitely could delimit record batches, i.e. don't buffer data across column chunk boundaries. This would be breaking change though




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sunchao commented on a change in pull request #1031: Extract method to drive PageIterator -> RecordReader

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031#discussion_r768918989



##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -100,6 +100,36 @@ pub trait ArrayReader {
     fn get_rep_levels(&self) -> Option<&[i16]>;
 }
 
+/// Uses `record_reader` to read up to `batch_size` records from `pages`
+///
+/// Returns the number of records read, which can be less than batch_size if
+/// pages is exhausted.
+fn read_records<T: DataType>(
+    record_reader: &mut RecordReader<T>,
+    pages: &mut dyn PageIterator,
+    batch_size: usize,
+) -> Result<usize> {
+    let mut records_read = 0usize;
+    while records_read < batch_size {
+        let records_to_read = batch_size - records_read;
+
+        let records_read_once = record_reader.read_records(records_to_read)?;
+        records_read += records_read_once;
+
+        // Record reader exhausted
+        if records_read_once < records_to_read {
+            if let Some(page_reader) = pages.next() {
+                // Read from new page reader (i.e. column chunk)
+                record_reader.set_page_reader(page_reader?)?;

Review comment:
       Unrelated, but I'm thinking whether we can reset the `record_reader` upon new column chunk, so that we don't have to keep accumulating buffers for def & rep levels, and values.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sunchao merged pull request #1031: Extract method to drive PageIterator -> RecordReader

Posted by GitBox <gi...@apache.org>.
sunchao merged pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sunchao commented on a change in pull request #1031: Extract method to drive PageIterator -> RecordReader

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031#discussion_r768925684



##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -100,6 +100,36 @@ pub trait ArrayReader {
     fn get_rep_levels(&self) -> Option<&[i16]>;
 }
 
+/// Uses `record_reader` to read up to `batch_size` records from `pages`
+///
+/// Returns the number of records read, which can be less than batch_size if
+/// pages is exhausted.
+fn read_records<T: DataType>(
+    record_reader: &mut RecordReader<T>,
+    pages: &mut dyn PageIterator,
+    batch_size: usize,
+) -> Result<usize> {
+    let mut records_read = 0usize;
+    while records_read < batch_size {
+        let records_to_read = batch_size - records_read;
+
+        let records_read_once = record_reader.read_records(records_to_read)?;
+        records_read += records_read_once;
+
+        // Record reader exhausted
+        if records_read_once < records_to_read {
+            if let Some(page_reader) = pages.next() {
+                // Read from new page reader (i.e. column chunk)
+                record_reader.set_page_reader(page_reader?)?;

Review comment:
       Got it, thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #1031: Extract method to drive PageIterator -> RecordReader

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031#issuecomment-991767524


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1031](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (75920d0) into [master](https://codecov.io/gh/apache/arrow-rs/commit/239cba141cb27519b7c32d58a3ea6447fda31d11?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (239cba1) will **decrease** coverage by `0.00%`.
   > The diff coverage is `93.75%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/1031/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1031      +/-   ##
   ==========================================
   - Coverage   82.31%   82.30%   -0.01%     
   ==========================================
     Files         168      168              
     Lines       49031    49022       -9     
   ==========================================
   - Hits        40360    40350      -10     
   - Misses       8671     8672       +1     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz) | `76.66% <93.75%> (-0.10%)` | :arrow_down: |
   | [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `65.95% <0.00%> (-0.43%)` | :arrow_down: |
   | [arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==) | `53.37% <0.00%> (-0.31%)` | :arrow_down: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.98% <0.00%> (-0.23%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `85.24% <0.00%> (+0.13%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [239cba1...75920d0](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #1031: Extract method to drive PageIterator -> RecordReader

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031#discussion_r768125155



##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -100,6 +100,35 @@ pub trait ArrayReader {
     fn get_rep_levels(&self) -> Option<&[i16]>;
 }
 
+/// Uses `record_reader` to read up to `batch_size` records from `pages`
+///
+/// Returns the number of records read

Review comment:
       ```suggestion
   /// Returns the number of records read, which can be less than batch_size if 
   /// pages is exhausted. 
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sunchao commented on pull request #1031: Extract method to drive PageIterator -> RecordReader

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031#issuecomment-993919226


   Merged, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org