You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/25 06:58:15 UTC

[GitHub] [arrow-rs] Ted-Jiang opened a new pull request, #2158: Add integration test for scan rows with selection

Ted-Jiang opened a new pull request, #2158:
URL: https://github.com/apache/arrow-rs/pull/2158

   # Which issue does this PR close?
   
   Closes #2106 .
   
   # Rationale for this change
    
    <!---
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   # What changes are included in this PR?
   
   <!---
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   
   <!---
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#discussion_r930040710


##########
parquet/src/arrow/array_reader/byte_array.rs:
##########
@@ -120,6 +120,15 @@ impl<I: OffsetSizeTrait + ScalarValue> ArrayReader for ByteArrayReader<I> {
     }
 
     fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        if self.record_reader.column_reader().is_none() {

Review Comment:
   This now behaves differently from next_batch which will potentially read from multiple column chunks for the same "batch". Can we extract this logic into a free function, similar to read_records, that performs the same loop?
   
   This would also avoid duplicating this code in every ArrayReader



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#discussion_r928559983


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -184,11 +185,24 @@ where
     /// # Returns
     ///
     /// Number of records skipped
-    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+    pub fn skip_records(
+        &mut self,
+        num_records: usize,
+        pages: &mut dyn PageIterator,
+    ) -> Result<usize> {
         // First need to clear the buffer
         let end_of_column = match self.column_reader.as_mut() {
             Some(reader) => !reader.has_next()?,
-            None => return Ok(0),
+            None => {
+                // If we skip records before all read operation
+                // we need set `column_reader` by `set_page_reader`
+                if let Some(page_reader) = pages.next() {

Review Comment:
   Fix skip before all read operator, need set column_reader



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#discussion_r930574164


##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -298,11 +307,15 @@ impl Iterator for ParquetRecordBatchReader {
                     continue;
                 }
 
+                // try to read record
                 let to_read = match front.row_count.checked_sub(self.batch_size) {
-                    Some(remaining) => {
-                        selection.push_front(RowSelection::skip(remaining));
+                    Some(remaining) if remaining != 0 => {
+                        // if page row count less than batch_size we must set batch size to page row count.
+                        // add check avoid dead loop
+                        selection.push_front(RowSelection::select(remaining));
                         self.batch_size
                     }
+                    Some(_) => self.batch_size,

Review Comment:
   yes more elegance 👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#discussion_r930576465


##########
parquet/src/column/reader.rs:
##########
@@ -312,13 +312,20 @@ where
 
                 // If page has less rows than the remaining records to
                 // be skipped, skip entire page
-                if metadata.num_rows < remaining {
+                while metadata.num_rows < remaining {

Review Comment:
   Oh... it's an useless loop



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] ursabot commented on pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#issuecomment-1196491741

   Benchmark runs are scheduled for baseline = e96ae8a70a6a006779798153934ca371c2884742 and contender = d10d962ac85a9675c511ff8f783d1b455588d220. d10d962ac85a9675c511ff8f783d1b455588d220 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/3076540d13e64b9f800334af0bc2d86f...82117b4003a1424090804acb45d5d109/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/d16e45054c1641a3b32cb3c54fbd2d39...ce051df236da44d699bda5e61e295efe/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/58ee6c352f684af8961c04d7bfd35721...3d5ec937b2324944883de3ac32d7e1c0/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/0f39d3c784754ee3a8562ca0be367781...ab2b55850da14c3dab07956e864bbb2e/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#discussion_r930059108


##########
parquet/src/column/reader.rs:
##########
@@ -312,13 +312,20 @@ where
 
                 // If page has less rows than the remaining records to
                 // be skipped, skip entire page
-                if metadata.num_rows < remaining {
+                while metadata.num_rows < remaining {
                     self.page_reader.skip_next_page()?;
                     remaining -= metadata.num_rows;
-                    continue;
+                    metadata = match self.page_reader.peek_next_page()? {
+                        None => return Ok(num_records - remaining),
+                        Some(metadata) => metadata,
+                    };
                 }
+                // because self.num_buffered_values == self.num_decoded_values means
+                // we need reads a new page and set up the decoders for levels
+                self.read_new_page()?;

Review Comment:
   Perhaps we could check the return type of this, and short-circuit if it returns false?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#issuecomment-1196584820

   @tustvold thanks for your patient review 👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#discussion_r930055371


##########
parquet/src/column/reader.rs:
##########
@@ -312,13 +312,20 @@ where
 
                 // If page has less rows than the remaining records to
                 // be skipped, skip entire page
-                if metadata.num_rows < remaining {
+                while metadata.num_rows < remaining {
                     self.page_reader.skip_next_page()?;
                     remaining -= metadata.num_rows;
-                    continue;
+                    metadata = match self.page_reader.peek_next_page()? {
+                        None => return Ok(num_records - remaining),
+                        Some(metadata) => metadata,
+                    };
                 }
+                // because self.num_buffered_values == self.num_decoded_values means
+                // we need reads a new page and set up the decoders for levels
+                self.read_new_page()?;

Review Comment:
   I'm confused why this is necessary. The first thing `read_batch` does is call `has_next` which will call `read_new_page` if `self.num_buffered_values == self.num_decoded_values`? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] codecov-commenter commented on pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#issuecomment-1193681120

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/2158?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#2158](https://codecov.io/gh/apache/arrow-rs/pull/2158?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (4b27280) into [master](https://codecov.io/gh/apache/arrow-rs/commit/1621c713d724b0cd4aabccfa3243714789283df5?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (1621c71) will **increase** coverage by `0.16%`.
   > The diff coverage is `93.83%`.
   
   > :exclamation: Current head 4b27280 differs from pull request most recent head 7153bee. Consider uploading reports for the commit 7153bee to get more accurate results
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #2158      +/-   ##
   ==========================================
   + Coverage   82.85%   83.02%   +0.16%     
   ==========================================
     Files         237      237              
     Lines       61381    61558     +177     
   ==========================================
   + Hits        50856    51107     +251     
   + Misses      10525    10451      -74     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/2158?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...et/src/arrow/array\_reader/byte\_array\_dictionary.rs](https://codecov.io/gh/apache/arrow-rs/pull/2158/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL2J5dGVfYXJyYXlfZGljdGlvbmFyeS5ycw==) | `86.97% <0.00%> (ø)` | |
   | [parquet/src/arrow/array\_reader/null\_array.rs](https://codecov.io/gh/apache/arrow-rs/pull/2158/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL251bGxfYXJyYXkucnM=) | `81.48% <0.00%> (ø)` | |
   | [...uet/src/arrow/array\_reader/complex\_object\_array.rs](https://codecov.io/gh/apache/arrow-rs/pull/2158/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL2NvbXBsZXhfb2JqZWN0X2FycmF5LnJz) | `94.02% <66.66%> (+0.82%)` | :arrow_up: |
   | [parquet/src/arrow/record\_reader/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/2158/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvcmVjb3JkX3JlYWRlci9tb2QucnM=) | `93.95% <66.66%> (+4.35%)` | :arrow_up: |
   | [parquet/src/column/reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/2158/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvY29sdW1uL3JlYWRlci5ycw==) | `68.79% <83.33%> (+5.92%)` | :arrow_up: |
   | [parquet/src/arrow/arrow\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/2158/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyb3dfcmVhZGVyLnJz) | `95.94% <97.63%> (+3.17%)` | :arrow_up: |
   | [parquet/src/arrow/array\_reader/byte\_array.rs](https://codecov.io/gh/apache/arrow-rs/pull/2158/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL2J5dGVfYXJyYXkucnM=) | `86.32% <100.00%> (+0.53%)` | :arrow_up: |
   | [parquet/src/arrow/array\_reader/primitive\_array.rs](https://codecov.io/gh/apache/arrow-rs/pull/2158/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL3ByaW1pdGl2ZV9hcnJheS5ycw==) | `90.17% <100.00%> (+1.15%)` | :arrow_up: |
   | [arrow/src/array/iterator.rs](https://codecov.io/gh/apache/arrow-rs/pull/2158/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2l0ZXJhdG9yLnJz) | `86.45% <0.00%> (-13.55%)` | :arrow_down: |
   | [arrow/src/util/decimal.rs](https://codecov.io/gh/apache/arrow-rs/pull/2158/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL3V0aWwvZGVjaW1hbC5ycw==) | `86.92% <0.00%> (-4.59%)` | :arrow_down: |
   | ... and [16 more](https://codecov.io/gh/apache/arrow-rs/pull/2158/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#issuecomment-1193767370

   @tustvold @thinkharderdev  PTAL😊


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#discussion_r929489261


##########
parquet/src/column/reader.rs:
##########
@@ -312,13 +312,20 @@ where
 
                 // If page has less rows than the remaining records to
                 // be skipped, skip entire page
-                if metadata.num_rows < remaining {
+                while metadata.num_rows < remaining {

Review Comment:
   because first add below
   ```
   // because self.num_buffered_values == self.num_decoded_values means
   // we need reads a new page and set up the decoders for levels
     self.read_new_page()?;
   ```
   if we still use `if`, we may read needless page header



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#issuecomment-1196813925

   FYI I'm working on a follow up PR to address some stuff, e.g. get this integrated into the fuzz tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#discussion_r928969472


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -184,11 +185,24 @@ where
     /// # Returns
     ///
     /// Number of records skipped
-    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+    pub fn skip_records(
+        &mut self,
+        num_records: usize,
+        pages: &mut dyn PageIterator,
+    ) -> Result<usize> {
         // First need to clear the buffer
         let end_of_column = match self.column_reader.as_mut() {
             Some(reader) => !reader.has_next()?,
-            None => return Ok(0),
+            None => {
+                // If we skip records before all read operation
+                // we need set `column_reader` by `set_page_reader`
+                if let Some(page_reader) = pages.next() {
+                    self.set_page_reader(page_reader?)?;
+                    false

Review Comment:
   This is wrong, as it will now only mark end_of_column when it reaches the end of the file, instead of the end of a column chunk within a row group. This will break record delimiting for repeated fields.



##########
parquet/src/column/reader.rs:
##########
@@ -312,13 +312,20 @@ where
 
                 // If page has less rows than the remaining records to
                 // be skipped, skip entire page
-                if metadata.num_rows < remaining {
+                while metadata.num_rows < remaining {

Review Comment:
   Why is this necessary, there is already an outer while loop?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#discussion_r928965513


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -184,11 +185,24 @@ where
     /// # Returns
     ///
     /// Number of records skipped
-    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+    pub fn skip_records(
+        &mut self,
+        num_records: usize,
+        pages: &mut dyn PageIterator,

Review Comment:
   I would prefer we handle this outside RecordReader in the same way we have a free function for read_records. Otherwise this ends up with a somewhat confused API



##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -298,9 +307,14 @@ impl Iterator for ParquetRecordBatchReader {
                     continue;
                 }
 
+                // try to read record
                 let to_read = match front.row_count.checked_sub(self.batch_size) {
                     Some(remaining) => {
-                        selection.push_front(RowSelection::skip(remaining));
+                        // if page row count less than batch_size we must set batch size to page row count.
+                        // add check avoid dead loop
+                        if remaining != 0 {

Review Comment:
   You could inline this into the match expression 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#discussion_r929886358


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -184,11 +185,24 @@ where
     /// # Returns
     ///
     /// Number of records skipped
-    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+    pub fn skip_records(
+        &mut self,
+        num_records: usize,
+        pages: &mut dyn PageIterator,
+    ) -> Result<usize> {
         // First need to clear the buffer
         let end_of_column = match self.column_reader.as_mut() {
             Some(reader) => !reader.has_next()?,
-            None => return Ok(0),
+            None => {
+                // If we skip records before all read operation
+                // we need set `column_reader` by `set_page_reader`
+                if let Some(page_reader) = pages.next() {
+                    self.set_page_reader(page_reader?)?;
+                    false

Review Comment:
   @tustvold i move it out to
   ```
    fn skip_records(&mut self, num_records: usize) -> Result<usize> {
           if self.record_reader.column_reader().is_none() {
               // If we skip records before all read operation
               // we need set `column_reader` by `set_page_reader`
               if let Some(page_reader) = self.pages.next() {
                   self.record_reader.set_page_reader(page_reader?)?;
               } else {
                   return Ok(0);
               }
           }
           self.record_reader.skip_records(num_records)
       }
   ```
   
   I think in this situation , only skip the first page without read any record the `column_reader` is none. related #2171 if 
   we create it in colchunk, then we will remove this check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#discussion_r930574164


##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -298,11 +307,15 @@ impl Iterator for ParquetRecordBatchReader {
                     continue;
                 }
 
+                // try to read record
                 let to_read = match front.row_count.checked_sub(self.batch_size) {
-                    Some(remaining) => {
-                        selection.push_front(RowSelection::skip(remaining));
+                    Some(remaining) if remaining != 0 => {
+                        // if page row count less than batch_size we must set batch size to page row count.
+                        // add check avoid dead loop
+                        selection.push_front(RowSelection::select(remaining));
                         self.batch_size
                     }
+                    Some(_) => self.batch_size,

Review Comment:
   yes more elegance



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#discussion_r930572458


##########
parquet/src/column/reader.rs:
##########
@@ -312,13 +312,20 @@ where
 
                 // If page has less rows than the remaining records to
                 // be skipped, skip entire page
-                if metadata.num_rows < remaining {
+                while metadata.num_rows < remaining {

Review Comment:
   oh.... sorry it's an useless loop



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#discussion_r928559403


##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -298,9 +307,14 @@ impl Iterator for ParquetRecordBatchReader {
                     continue;
                 }
 
+                // try to read record
                 let to_read = match front.row_count.checked_sub(self.batch_size) {
                     Some(remaining) => {
-                        selection.push_front(RowSelection::skip(remaining));
+                        // if page row count less than batch_size we must set batch size to page row count.
+                        // add check avoid dead loop

Review Comment:
   Fix wrong logic, `remaining` record need read



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#discussion_r930042383


##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -298,11 +307,15 @@ impl Iterator for ParquetRecordBatchReader {
                     continue;
                 }
 
+                // try to read record
                 let to_read = match front.row_count.checked_sub(self.batch_size) {
-                    Some(remaining) => {
-                        selection.push_front(RowSelection::skip(remaining));
+                    Some(remaining) if remaining != 0 => {
+                        // if page row count less than batch_size we must set batch size to page row count.
+                        // add check avoid dead loop
+                        selection.push_front(RowSelection::select(remaining));
                         self.batch_size
                     }
+                    Some(_) => self.batch_size,

Review Comment:
   ```suggestion
                       _ => self.batch_size,
   ```
   
   And remove the `None` case below. If `remaining == 0` then `front.row_count == self.batch_size`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#discussion_r930040710


##########
parquet/src/arrow/array_reader/byte_array.rs:
##########
@@ -120,6 +120,15 @@ impl<I: OffsetSizeTrait + ScalarValue> ArrayReader for ByteArrayReader<I> {
     }
 
     fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        if self.record_reader.column_reader().is_none() {

Review Comment:
   This now behaves differently from next_batch which will potentially read from multiple column chunks for the same "batch". Can we extract this logic into a free function, similar to read_records, that performs the same loop?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#discussion_r930047286


##########
parquet/src/column/reader.rs:
##########
@@ -312,13 +312,20 @@ where
 
                 // If page has less rows than the remaining records to
                 // be skipped, skip entire page
-                if metadata.num_rows < remaining {
+                while metadata.num_rows < remaining {

Review Comment:
   This while loop should result in the same behaviour as the previous `continue`??



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#discussion_r930572458


##########
parquet/src/column/reader.rs:
##########
@@ -312,13 +312,20 @@ where
 
                 // If page has less rows than the remaining records to
                 // be skipped, skip entire page
-                if metadata.num_rows < remaining {
+                while metadata.num_rows < remaining {

Review Comment:
   oh.... sorry it's an useless loop



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on pull request #2158: Add integration test for scan rows with selection

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2158:
URL: https://github.com/apache/arrow-rs/pull/2158#issuecomment-1194358764

   Thank you @Ted-Jiang  -- the project to add page index and skipping is really coming along very nicely. It is a very nice piece of work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org