You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/11 19:49:02 UTC
[GitHub] [arrow-rs] tustvold opened a new pull request #1031: Extract method to drive PageIterator -> RecordReader
tustvold opened a new pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031
# Which issue does this PR close?
Relates to #171
# Rationale for this change
The logic for driving a `RecordReader` from a `PageIterator` is currently duplicated in `PrimitiveArrayReader` and `NullArrayReader`. This duplication will only increase with new `ArrayReader` implementations added as part of #171
# What changes are included in this PR?
Extracts a read_records function to handle this logic
# Are there any user-facing changes?
This no longer eagerly populates the `RecordReader` with the first `PageReader` from the `PageIterator`. If for example the first page of the first row group contained a compression codec that was not supported, this would currently error in the constructor, whereas with this change it will error in `ArrayReader::next_batch`. I personally think this makes more sense. FWIW I think this is the only possible error case and so it is unlikely user-code would be impacted...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] codecov-commenter commented on pull request #1031: Extract method to drive PageIterator -> RecordReader
Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031#issuecomment-991767524
# [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#1031](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (75920d0) into [master](https://codecov.io/gh/apache/arrow-rs/commit/239cba141cb27519b7c32d58a3ea6447fda31d11?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (239cba1) will **decrease** coverage by `0.00%`.
> The diff coverage is `93.75%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/1031/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
```diff
@@ Coverage Diff @@
## master #1031 +/- ##
==========================================
- Coverage 82.31% 82.30% -0.01%
==========================================
Files 168 168
Lines 49031 49022 -9
==========================================
- Hits 40360 40350 -10
- Misses 8671 8672 +1
```
| [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz) | `76.66% <93.75%> (-0.10%)` | :arrow_down: |
| [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `65.95% <0.00%> (-0.43%)` | :arrow_down: |
| [arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==) | `53.37% <0.00%> (-0.31%)` | :arrow_down: |
| [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.98% <0.00%> (-0.23%)` | :arrow_down: |
| [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `85.24% <0.00%> (+0.13%)` | :arrow_up: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [239cba1...75920d0](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold commented on a change in pull request #1031: Extract method to drive PageIterator -> RecordReader
Posted by GitBox <gi...@apache.org>.
tustvold commented on a change in pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031#discussion_r768920834
##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -100,6 +100,36 @@ pub trait ArrayReader {
fn get_rep_levels(&self) -> Option<&[i16]>;
}
+/// Uses `record_reader` to read up to `batch_size` records from `pages`
+///
+/// Returns the number of records read, which can be less than batch_size if
+/// pages is exhausted.
+fn read_records<T: DataType>(
+ record_reader: &mut RecordReader<T>,
+ pages: &mut dyn PageIterator,
+ batch_size: usize,
+) -> Result<usize> {
+ let mut records_read = 0usize;
+ while records_read < batch_size {
+ let records_to_read = batch_size - records_read;
+
+ let records_read_once = record_reader.read_records(records_to_read)?;
+ records_read += records_read_once;
+
+ // Record reader exhausted
+ if records_read_once < records_to_read {
+ if let Some(page_reader) = pages.next() {
+ // Read from new page reader (i.e. column chunk)
+ record_reader.set_page_reader(page_reader?)?;
Review comment:
If we just called reset here, we would lose data. But we definitely could delimit record batches, i.e. don't buffer data across column chunk boundaries. This would be breaking change though
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] sunchao commented on a change in pull request #1031: Extract method to drive PageIterator -> RecordReader
Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031#discussion_r768918989
##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -100,6 +100,36 @@ pub trait ArrayReader {
fn get_rep_levels(&self) -> Option<&[i16]>;
}
+/// Uses `record_reader` to read up to `batch_size` records from `pages`
+///
+/// Returns the number of records read, which can be less than batch_size if
+/// pages is exhausted.
+fn read_records<T: DataType>(
+ record_reader: &mut RecordReader<T>,
+ pages: &mut dyn PageIterator,
+ batch_size: usize,
+) -> Result<usize> {
+ let mut records_read = 0usize;
+ while records_read < batch_size {
+ let records_to_read = batch_size - records_read;
+
+ let records_read_once = record_reader.read_records(records_to_read)?;
+ records_read += records_read_once;
+
+ // Record reader exhausted
+ if records_read_once < records_to_read {
+ if let Some(page_reader) = pages.next() {
+ // Read from new page reader (i.e. column chunk)
+ record_reader.set_page_reader(page_reader?)?;
Review comment:
Unrelated, but I'm thinking whether we can reset the `record_reader` upon new column chunk, so that we don't have to keep accumulating buffers for def & rep levels, and values.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] sunchao merged pull request #1031: Extract method to drive PageIterator -> RecordReader
Posted by GitBox <gi...@apache.org>.
sunchao merged pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] sunchao commented on a change in pull request #1031: Extract method to drive PageIterator -> RecordReader
Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031#discussion_r768925684
##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -100,6 +100,36 @@ pub trait ArrayReader {
fn get_rep_levels(&self) -> Option<&[i16]>;
}
+/// Uses `record_reader` to read up to `batch_size` records from `pages`
+///
+/// Returns the number of records read, which can be less than batch_size if
+/// pages is exhausted.
+fn read_records<T: DataType>(
+ record_reader: &mut RecordReader<T>,
+ pages: &mut dyn PageIterator,
+ batch_size: usize,
+) -> Result<usize> {
+ let mut records_read = 0usize;
+ while records_read < batch_size {
+ let records_to_read = batch_size - records_read;
+
+ let records_read_once = record_reader.read_records(records_to_read)?;
+ records_read += records_read_once;
+
+ // Record reader exhausted
+ if records_read_once < records_to_read {
+ if let Some(page_reader) = pages.next() {
+ // Read from new page reader (i.e. column chunk)
+ record_reader.set_page_reader(page_reader?)?;
Review comment:
Got it, thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] codecov-commenter commented on pull request #1031: Extract method to drive PageIterator -> RecordReader
Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031#issuecomment-991767524
# [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
> Merging [#1031](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (75920d0) into [master](https://codecov.io/gh/apache/arrow-rs/commit/239cba141cb27519b7c32d58a3ea6447fda31d11?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (239cba1) will **decrease** coverage by `0.00%`.
> The diff coverage is `93.75%`.
[![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/1031/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
```diff
@@ Coverage Diff @@
## master #1031 +/- ##
==========================================
- Coverage 82.31% 82.30% -0.01%
==========================================
Files 168 168
Lines 49031 49022 -9
==========================================
- Hits 40360 40350 -10
- Misses 8671 8672 +1
```
| [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
|---|---|---|
| [parquet/src/arrow/array\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyLnJz) | `76.66% <93.75%> (-0.10%)` | :arrow_down: |
| [arrow/src/datatypes/datatype.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9kYXRhdHlwZS5ycw==) | `65.95% <0.00%> (-0.43%)` | :arrow_down: |
| [arrow/src/datatypes/field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2RhdGF0eXBlcy9maWVsZC5ycw==) | `53.37% <0.00%> (-0.31%)` | :arrow_down: |
| [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.98% <0.00%> (-0.23%)` | :arrow_down: |
| [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1031/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `85.24% <0.00%> (+0.13%)` | :arrow_up: |
------
[Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
> **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
> `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
> Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [239cba1...75920d0](https://codecov.io/gh/apache/arrow-rs/pull/1031?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb commented on a change in pull request #1031: Extract method to drive PageIterator -> RecordReader
Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031#discussion_r768125155
##########
File path: parquet/src/arrow/array_reader.rs
##########
@@ -100,6 +100,35 @@ pub trait ArrayReader {
fn get_rep_levels(&self) -> Option<&[i16]>;
}
+/// Uses `record_reader` to read up to `batch_size` records from `pages`
+///
+/// Returns the number of records read
Review comment:
```suggestion
/// Returns the number of records read, which can be less than batch_size if
/// pages is exhausted.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] sunchao commented on pull request #1031: Extract method to drive PageIterator -> RecordReader
Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #1031:
URL: https://github.com/apache/arrow-rs/pull/1031#issuecomment-993919226
Merged, thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org