You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/03 02:24:54 UTC

[GitHub] [arrow-rs] tustvold opened a new pull request, #1998: Stub out Skip Records API (#1792)

tustvold opened a new pull request, #1998:
URL: https://github.com/apache/arrow-rs/pull/1998

   _Draft as builds on #1995_
   
   # Which issue does this PR close?
   
   Part of #1792
   
   # Rationale for this change
    
   Stubs out an API for providing skip records functionality within parquet. I think this will work to support #1792, #1191 and potentially other functionality down the line.
   
   Let me know what you think @Ted-Jiang @sunchao 
   
   # What changes are included in this PR?
   
   Stubs out APIs for adding row skipping logic to the parquet implementation
   
   # Are there any user-facing changes?
   
   No :tada:
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r912608716


##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -222,13 +271,47 @@ pub struct ParquetRecordBatchReader {
     batch_size: usize,
     array_reader: Box<dyn ArrayReader>,
     schema: SchemaRef,
+    selection: Option<VecDeque<RowSelection>>,
 }
 
 impl Iterator for ParquetRecordBatchReader {
     type Item = ArrowResult<RecordBatch>;
 
     fn next(&mut self) -> Option<Self::Item> {
-        match self.array_reader.next_batch(self.batch_size) {
+        let to_read = match self.selection.as_mut() {

Review Comment:
   ๐Ÿ‘ pass mask here not each col is more reasonable ๐Ÿ˜‚



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r915218036


##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -70,9 +71,39 @@ pub trait ArrowReader {
     ) -> Result<Self::RecordReader>;
 }
 
+/// [`RowSelection`] allows selecting or skipping a provided number of rows
+/// when scanning the parquet file
+#[derive(Debug, Clone, Copy)]
+pub(crate) struct RowSelection {

Review Comment:
   See with_row_selection which takes a Vec to allow for this use-case



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] codecov-commenter commented on pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#issuecomment-1173000731

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1998?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1998](https://codecov.io/gh/apache/arrow-rs/pull/1998?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (0071931) into [master](https://codecov.io/gh/apache/arrow-rs/commit/7ae97c9cebabf757275e41587fe4e3dbdf38d4ab?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (7ae97c9) will **decrease** coverage by `0.18%`.
   > The diff coverage is `36.69%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #1998      +/-   ##
   ==========================================
   - Coverage   83.58%   83.39%   -0.19%     
   ==========================================
     Files         222      222              
     Lines       57495    57602     +107     
   ==========================================
   - Hits        48056    48037      -19     
   - Misses       9439     9565     +126     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1998?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage ฮ” | |
   |---|---|---|
   | [...uet/src/arrow/array\_reader/complex\_object\_array.rs](https://codecov.io/gh/apache/arrow-rs/pull/1998/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL2NvbXBsZXhfb2JqZWN0X2FycmF5LnJz) | `93.20% <0.00%> (-1.07%)` | :arrow_down: |
   | [parquet/src/arrow/array\_reader/empty\_array.rs](https://codecov.io/gh/apache/arrow-rs/pull/1998/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL2VtcHR5X2FycmF5LnJz) | `45.45% <0.00%> (-10.11%)` | :arrow_down: |
   | [parquet/src/arrow/array\_reader/list\_array.rs](https://codecov.io/gh/apache/arrow-rs/pull/1998/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL2xpc3RfYXJyYXkucnM=) | `92.69% <0.00%> (-0.72%)` | :arrow_down: |
   | [parquet/src/arrow/array\_reader/map\_array.rs](https://codecov.io/gh/apache/arrow-rs/pull/1998/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL21hcF9hcnJheS5ycw==) | `58.82% <0.00%> (-8.98%)` | :arrow_down: |
   | [parquet/src/arrow/array\_reader/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1998/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL21vZC5ycw==) | `88.23% <รธ> (รธ)` | |
   | [parquet/src/arrow/array\_reader/struct\_array.rs](https://codecov.io/gh/apache/arrow-rs/pull/1998/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL3N0cnVjdF9hcnJheS5ycw==) | `78.99% <0.00%> (-9.69%)` | :arrow_down: |
   | [parquet/src/arrow/array\_reader/test\_util.rs](https://codecov.io/gh/apache/arrow-rs/pull/1998/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL3Rlc3RfdXRpbC5ycw==) | `77.63% <0.00%> (-3.20%)` | :arrow_down: |
   | [parquet/src/arrow/async\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1998/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXN5bmNfcmVhZGVyLnJz) | `0.00% <0.00%> (รธ)` | |
   | [...rquet/src/arrow/record\_reader/definition\_levels.rs](https://codecov.io/gh/apache/arrow-rs/pull/1998/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvcmVjb3JkX3JlYWRlci9kZWZpbml0aW9uX2xldmVscy5ycw==) | `85.71% <0.00%> (-0.54%)` | :arrow_down: |
   | [parquet/src/column/page.rs](https://codecov.io/gh/apache/arrow-rs/pull/1998/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvY29sdW1uL3BhZ2UucnM=) | `98.68% <รธ> (รธ)` | |
   | ... and [16 more](https://codecov.io/gh/apache/arrow-rs/pull/1998/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1998?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `ฮ” = absolute <relative> (impact)`, `รธ = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1998?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [7ae97c9...0071931](https://codecov.io/gh/apache/arrow-rs/pull/1998?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r913085131


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -202,6 +203,37 @@ where
         Ok(records_read)
     }
 
+    /// Try to skip the next `num_records` rows
+    ///
+    /// # Returns
+    ///
+    /// Number of records skipped
+    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        // First need to clear the buffer
+        let (buffered_records, buffered_values) = self.count_records(num_records);
+        self.num_records += buffered_records;
+        self.num_values += buffered_values;
+
+        self.consume_def_levels();
+        self.consume_rep_levels();
+        self.consume_record_data();
+        self.consume_bitmap();
+        self.reset();
+
+        let remaining = buffered_records - num_records;

Review Comment:
   Yes, see other comment. In the case of no repetition levels buffered levels will always be 0, otherwise it will be an arbitrary number based on how much extra data was read in the last call to read_records



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r913079262


##########
parquet/src/column/page.rs:
##########
@@ -187,12 +187,29 @@ impl PageWriteSpec {
     }
 }
 
+/// Contains metadata for a page
+pub struct PageMetadata {
+    /// The number of rows in this page
+    pub num_rows: usize,
+
+    /// Returns true if the page is a dictionary page
+    pub is_dict: bool,
+}
+
 /// API for reading pages from a column chunk.
 /// This offers a iterator like API to get the next page.
 pub trait PageReader: Iterator<Item = Result<Page>> + Send {
     /// Gets the next page in the column chunk associated with this reader.
     /// Returns `None` if there are no pages left.
     fn get_next_page(&mut self) -> Result<Option<Page>>;
+
+    /// Gets metadata about the next page, returns an error if no
+    /// column index information

Review Comment:
   I think so



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r915204619


##########
parquet/src/arrow/record_reader/definition_levels.rs:
##########
@@ -146,15 +146,15 @@ impl LevelsBufferSlice for DefinitionLevelBuffer {
     }
 }
 
-pub struct DefinitionLevelDecoder {
+pub struct DefinitionLevelBufferDecoder {

Review Comment:
   No it is crate local



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r912606829


##########
parquet/src/column/page.rs:
##########
@@ -187,12 +187,29 @@ impl PageWriteSpec {
     }
 }
 
+/// Contains metadata for a page
+pub struct PageMetadata {
+    /// The number of rows in this page
+    pub num_rows: usize,
+
+    /// Returns true if the page is a dictionary page
+    pub is_dict: bool,
+}
+
 /// API for reading pages from a column chunk.
 /// This offers a iterator like API to get the next page.
 pub trait PageReader: Iterator<Item = Result<Page>> + Send {
     /// Gets the next page in the column chunk associated with this reader.
     /// Returns `None` if there are no pages left.
     fn get_next_page(&mut self) -> Result<Option<Page>>;
+
+    /// Gets metadata about the next page, returns an error if no
+    /// column index information
+    fn peek_next_page(&self) -> Result<Option<PageMetadata>>;

Review Comment:
   ๐Ÿ‘ really need this abstraction๏ผ



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#issuecomment-1173255383

   cool!  ๐Ÿ‘ @tustvold Are you the Flash ๐Ÿ˜„! i will try to go through this and give your my opinion today. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r912614816


##########
parquet/src/column/page.rs:
##########
@@ -187,12 +187,29 @@ impl PageWriteSpec {
     }
 }
 
+/// Contains metadata for a page
+pub struct PageMetadata {
+    /// The number of rows in this page
+    pub num_rows: usize,
+
+    /// Returns true if the page is a dictionary page
+    pub is_dict: bool,
+}
+
 /// API for reading pages from a column chunk.
 /// This offers a iterator like API to get the next page.
 pub trait PageReader: Iterator<Item = Result<Page>> + Send {
     /// Gets the next page in the column chunk associated with this reader.
     /// Returns `None` if there are no pages left.
     fn get_next_page(&mut self) -> Result<Option<Page>>;
+
+    /// Gets metadata about the next page, returns an error if no
+    /// column index information

Review Comment:
   Is there we only need `offset index`, without the `min max index`?๐Ÿค”



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r913328966


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -202,6 +203,37 @@ where
         Ok(records_read)
     }
 
+    /// Try to skip the next `num_records` rows
+    ///
+    /// # Returns
+    ///
+    /// Number of records skipped
+    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        // First need to clear the buffer
+        let (buffered_records, buffered_values) = self.count_records(num_records);
+        self.num_records += buffered_records;
+        self.num_values += buffered_values;
+
+        self.consume_def_levels();
+        self.consume_rep_levels();
+        self.consume_record_data();
+        self.consume_bitmap();
+        self.reset();
+
+        let remaining = buffered_records - num_records;

Review Comment:
   both `buffered_records ` and `buffered_records` are usize it will overflow,
   and if `remaining ` is negative how to use in ` column_reader.skip_records(remaining)?,` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r913752851


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -202,6 +203,37 @@ where
         Ok(records_read)
     }
 
+    /// Try to skip the next `num_records` rows
+    ///
+    /// # Returns
+    ///
+    /// Number of records skipped
+    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        // First need to clear the buffer
+        let (buffered_records, buffered_values) = self.count_records(num_records);
+        self.num_records += buffered_records;
+        self.num_values += buffered_values;
+
+        self.consume_def_levels();
+        self.consume_rep_levels();
+        self.consume_record_data();

Review Comment:
   > This is a part of skip, we need to read the rp ,dp to skip some records in the page(maybe have been readed or never readed ).
   
   Yes, this is just to consume the data that has been read to the internal buffers of RecordReader if any
   
   > This also part of skip, remaining > 0, I think this we skip start at a new page
   
   Not necessarily, the only thing RecordReader needs to handle is skipping any data that has already been read from ColumnReader into its own buffers. It can then delegate to ColumnReader to skip the remaining rows, with no requirement that this is done at a page boundary - ColumnReader must be able to handle any case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r913790320


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -202,6 +203,37 @@ where
         Ok(records_read)
     }
 
+    /// Try to skip the next `num_records` rows
+    ///
+    /// # Returns
+    ///
+    /// Number of records skipped
+    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        // First need to clear the buffer
+        let (buffered_records, buffered_values) = self.count_records(num_records);
+        self.num_records += buffered_records;
+        self.num_values += buffered_values;
+
+        self.consume_def_levels();
+        self.consume_rep_levels();
+        self.consume_record_data();
+        self.consume_bitmap();
+        self.reset();
+
+        let remaining = buffered_records - num_records;

Review Comment:
   I thinks it should be ๐Ÿค”:
   ```suggestion
           let remaining = num_records - buffered_records;
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r912756340


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -202,6 +203,37 @@ where
         Ok(records_read)
     }
 
+    /// Try to skip the next `num_records` rows
+    ///
+    /// # Returns
+    ///
+    /// Number of records skipped
+    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        // First need to clear the buffer
+        let (buffered_records, buffered_values) = self.count_records(num_records);
+        self.num_records += buffered_records;
+        self.num_values += buffered_values;
+
+        self.consume_def_levels();
+        self.consume_rep_levels();
+        self.consume_record_data();

Review Comment:
   Is this for the situation  a page which has been `read_records` but left some unreaded buffer?
   



##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -202,6 +203,37 @@ where
         Ok(records_read)
     }
 
+    /// Try to skip the next `num_records` rows
+    ///
+    /// # Returns
+    ///
+    /// Number of records skipped
+    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        // First need to clear the buffer
+        let (buffered_records, buffered_values) = self.count_records(num_records);
+        self.num_records += buffered_records;
+        self.num_values += buffered_values;
+
+        self.consume_def_levels();
+        self.consume_rep_levels();
+        self.consume_record_data();
+        self.consume_bitmap();
+        self.reset();
+
+        let remaining = buffered_records - num_records;

Review Comment:
   Is there a situation, where `buffered_record`s less than `num_records`๐Ÿค”
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r913329326


##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -86,11 +117,24 @@ impl ArrowReaderOptions {
     ///
     /// For example:[ARROW-16184](https://issues.apache.org/jira/browse/ARROW-16184)
     ///
-
     /// Set `skip_arrow_metadata` to true, to skip decoding this
     pub fn with_skip_arrow_metadata(self, skip_arrow_metadata: bool) -> Self {
         Self {
             skip_arrow_metadata,
+            ..self
+        }
+    }
+
+    /// Scan rows from the parquet file according to the provided `selection`
+    ///
+    /// TODO: Make public once row selection fully implemented
+    pub(crate) fn with_row_selection(
+        self,
+        selection: impl Into<Vec<RowSelection>>,
+    ) -> Self {

Review Comment:
   yes, got it, it should check in user side. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r913793928


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -202,6 +203,37 @@ where
         Ok(records_read)
     }
 
+    /// Try to skip the next `num_records` rows
+    ///
+    /// # Returns
+    ///
+    /// Number of records skipped
+    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        // First need to clear the buffer
+        let (buffered_records, buffered_values) = self.count_records(num_records);
+        self.num_records += buffered_records;
+        self.num_values += buffered_values;
+
+        self.consume_def_levels();
+        self.consume_rep_levels();
+        self.consume_record_data();
+        self.consume_bitmap();
+        self.reset();
+
+        let remaining = buffered_records - num_records;

Review Comment:
   Right you are :+1:



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r912614816


##########
parquet/src/column/page.rs:
##########
@@ -187,12 +187,29 @@ impl PageWriteSpec {
     }
 }
 
+/// Contains metadata for a page
+pub struct PageMetadata {
+    /// The number of rows in this page
+    pub num_rows: usize,
+
+    /// Returns true if the page is a dictionary page
+    pub is_dict: bool,
+}
+
 /// API for reading pages from a column chunk.
 /// This offers a iterator like API to get the next page.
 pub trait PageReader: Iterator<Item = Result<Page>> + Send {
     /// Gets the next page in the column chunk associated with this reader.
     /// Returns `None` if there are no pages left.
     fn get_next_page(&mut self) -> Result<Option<Page>>;
+
+    /// Gets metadata about the next page, returns an error if no
+    /// column index information

Review Comment:
   Is there we only need `offset index` not the `min max index`?๐Ÿค”



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r913362921


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -202,6 +203,37 @@ where
         Ok(records_read)
     }
 
+    /// Try to skip the next `num_records` rows
+    ///
+    /// # Returns
+    ///
+    /// Number of records skipped
+    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        // First need to clear the buffer
+        let (buffered_records, buffered_values) = self.count_records(num_records);
+        self.num_records += buffered_records;
+        self.num_values += buffered_values;
+
+        self.consume_def_levels();
+        self.consume_rep_levels();
+        self.consume_record_data();

Review Comment:
   So, i got it. More specific details to ask:
   This is a part of skip, we need to read the `rp` ,`dp` to skip some records in the page(maybe  have been readed or never readed ).
   ```
   let (buffered_records, buffered_values) = self.count_records(num_records);
           self.num_records += buffered_records;
           self.num_values += buffered_values;
   
           self.consume_def_levels();
           self.consume_rep_levels();
           self.consume_record_data();
           self.consume_bitmap();
           self.reset();
   
           let remaining = buffered_records - num_records;
   ```
   This also part of skip, `remaining > 0`, I think this we skip start at a new page
   ```
           if remaining == 0 {
               return Ok(buffered_records);
           }
   
           let skipped = match self.column_reader.as_mut() {
               Some(column_reader) => column_reader.skip_records(remaining)?,
               None => 0,
           };
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r912763675


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -202,6 +203,37 @@ where
         Ok(records_read)
     }
 
+    /// Try to skip the next `num_records` rows
+    ///
+    /// # Returns
+    ///
+    /// Number of records skipped
+    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        // First need to clear the buffer
+        let (buffered_records, buffered_values) = self.count_records(num_records);
+        self.num_records += buffered_records;
+        self.num_values += buffered_values;
+
+        self.consume_def_levels();
+        self.consume_rep_levels();
+        self.consume_record_data();

Review Comment:
   Sorry, i don't get this point, why not directly call `column_reader.skip_records(num_records)`
    could you give me some hint?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r912605881


##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -86,11 +117,24 @@ impl ArrowReaderOptions {
     ///
     /// For example:[ARROW-16184](https://issues.apache.org/jira/browse/ARROW-16184)
     ///
-
     /// Set `skip_arrow_metadata` to true, to skip decoding this
     pub fn with_skip_arrow_metadata(self, skip_arrow_metadata: bool) -> Self {
         Self {
             skip_arrow_metadata,
+            ..self
+        }
+    }
+
+    /// Scan rows from the parquet file according to the provided `selection`
+    ///
+    /// TODO: Make public once row selection fully implemented
+    pub(crate) fn with_row_selection(
+        self,
+        selection: impl Into<Vec<RowSelection>>,
+    ) -> Self {

Review Comment:
   Could we add `total_row_count` to check this `selection` is valid(maybe like continuous)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r913082760


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -202,6 +203,37 @@ where
         Ok(records_read)
     }
 
+    /// Try to skip the next `num_records` rows
+    ///
+    /// # Returns
+    ///
+    /// Number of records skipped
+    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        // First need to clear the buffer
+        let (buffered_records, buffered_values) = self.count_records(num_records);
+        self.num_records += buffered_records;
+        self.num_values += buffered_values;
+
+        self.consume_def_levels();
+        self.consume_rep_levels();
+        self.consume_record_data();

Review Comment:
   RecordReader is a bit of an odd cookie, let me try to explain what it is doing.
   
   In the absence of repetition levels, it can simply read batch size levels, and the corresponding number of values.
   
   However, if repetition levels are present, it will likely need to read more than batch_size levels in order to read batch_size actual records (rows). 
   
   To achieve this it reads to its internal buffer and then splits off the data corresponding to batch_size rows, leaving the excess behind.
   
   It is this excess of data that has been read to its buffers but not yielded to the caller yet, which we must consume here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r913085131


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -202,6 +203,37 @@ where
         Ok(records_read)
     }
 
+    /// Try to skip the next `num_records` rows
+    ///
+    /// # Returns
+    ///
+    /// Number of records skipped
+    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        // First need to clear the buffer
+        let (buffered_records, buffered_values) = self.count_records(num_records);
+        self.num_records += buffered_records;
+        self.num_values += buffered_values;
+
+        self.consume_def_levels();
+        self.consume_rep_levels();
+        self.consume_record_data();
+        self.consume_bitmap();
+        self.reset();
+
+        let remaining = buffered_records - num_records;

Review Comment:
   Yes, see other comment. In the case of no repetition levels buffered levels will always be 0, otherwise it will be an arbitrary number based on how much extra data was read in the last call to next_batch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r913362921


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -202,6 +203,37 @@ where
         Ok(records_read)
     }
 
+    /// Try to skip the next `num_records` rows
+    ///
+    /// # Returns
+    ///
+    /// Number of records skipped
+    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        // First need to clear the buffer
+        let (buffered_records, buffered_values) = self.count_records(num_records);
+        self.num_records += buffered_records;
+        self.num_values += buffered_values;
+
+        self.consume_def_levels();
+        self.consume_rep_levels();
+        self.consume_record_data();

Review Comment:
   ๐Ÿ‘ nice write up ! Save me some time ๐Ÿ˜„!  
   So, i got it. More specific details to ask:
   This is a part of skip, we need to read the `rp` ,`dp` to skip some records in the page(maybe  have been readed or never readed ).
   ```
   let (buffered_records, buffered_values) = self.count_records(num_records);
           self.num_records += buffered_records;
           self.num_values += buffered_values;
   
           self.consume_def_levels();
           self.consume_rep_levels();
           self.consume_record_data();
           self.consume_bitmap();
           self.reset();
   
           let remaining = buffered_records - num_records;
   ```
   This also part of skip, `remaining > 0`, I think this we skip start at a new page
   ```
           if remaining == 0 {
               return Ok(buffered_records);
           }
   
           let skipped = match self.column_reader.as_mut() {
               Some(column_reader) => column_reader.skip_records(remaining)?,
               None => 0,
           };
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r913755398


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -202,6 +203,37 @@ where
         Ok(records_read)
     }
 
+    /// Try to skip the next `num_records` rows
+    ///
+    /// # Returns
+    ///
+    /// Number of records skipped
+    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        // First need to clear the buffer
+        let (buffered_records, buffered_values) = self.count_records(num_records);
+        self.num_records += buffered_records;
+        self.num_values += buffered_values;
+
+        self.consume_def_levels();
+        self.consume_rep_levels();
+        self.consume_record_data();
+        self.consume_bitmap();
+        self.reset();
+
+        let remaining = buffered_records - num_records;

Review Comment:
   Oh sorry, I wrote that response on my phone and got my wires crossed. No `buffered_records <= num_records` as `Self::count_records` cannot return more than the `records_to_read` argument, which in this case is `num_records`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r913078777


##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -86,11 +117,24 @@ impl ArrowReaderOptions {
     ///
     /// For example:[ARROW-16184](https://issues.apache.org/jira/browse/ARROW-16184)
     ///
-
     /// Set `skip_arrow_metadata` to true, to skip decoding this
     pub fn with_skip_arrow_metadata(self, skip_arrow_metadata: bool) -> Self {
         Self {
             skip_arrow_metadata,
+            ..self
+        }
+    }
+
+    /// Scan rows from the parquet file according to the provided `selection`
+    ///
+    /// TODO: Make public once row selection fully implemented
+    pub(crate) fn with_row_selection(
+        self,
+        selection: impl Into<Vec<RowSelection>>,
+    ) -> Self {

Review Comment:
   Is it actually an issue if it isn't, e.g. if I only want the first 100 rows?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r913328966


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -202,6 +203,37 @@ where
         Ok(records_read)
     }
 
+    /// Try to skip the next `num_records` rows
+    ///
+    /// # Returns
+    ///
+    /// Number of records skipped
+    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        // First need to clear the buffer
+        let (buffered_records, buffered_values) = self.count_records(num_records);
+        self.num_records += buffered_records;
+        self.num_values += buffered_values;
+
+        self.consume_def_levels();
+        self.consume_rep_levels();
+        self.consume_record_data();
+        self.consume_bitmap();
+        self.reset();
+
+        let remaining = buffered_records - num_records;

Review Comment:
   both `buffered_records ` and `buffered_records` are usize it will overflow,
   and if `remaining ` is nagtive how to use in ` column_reader.skip_records(remaining)?,` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r915218036


##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -70,9 +71,39 @@ pub trait ArrowReader {
     ) -> Result<Self::RecordReader>;
 }
 
+/// [`RowSelection`] allows selecting or skipping a provided number of rows
+/// when scanning the parquet file
+#[derive(Debug, Clone, Copy)]
+pub(crate) struct RowSelection {

Review Comment:
   See with_row_selection which takes a Vec



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r912605881


##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -86,11 +117,24 @@ impl ArrowReaderOptions {
     ///
     /// For example:[ARROW-16184](https://issues.apache.org/jira/browse/ARROW-16184)
     ///
-
     /// Set `skip_arrow_metadata` to true, to skip decoding this
     pub fn with_skip_arrow_metadata(self, skip_arrow_metadata: bool) -> Self {
         Self {
             skip_arrow_metadata,
+            ..self
+        }
+    }
+
+    /// Scan rows from the parquet file according to the provided `selection`
+    ///
+    /// TODO: Make public once row selection fully implemented
+    pub(crate) fn with_row_selection(
+        self,
+        selection: impl Into<Vec<RowSelection>>,
+    ) -> Self {

Review Comment:
   Could we add `total_row_count` to check this `selection` is valid(maybe continuous)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r912605881


##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -86,11 +117,24 @@ impl ArrowReaderOptions {
     ///
     /// For example:[ARROW-16184](https://issues.apache.org/jira/browse/ARROW-16184)
     ///
-
     /// Set `skip_arrow_metadata` to true, to skip decoding this
     pub fn with_skip_arrow_metadata(self, skip_arrow_metadata: bool) -> Self {
         Self {
             skip_arrow_metadata,
+            ..self
+        }
+    }
+
+    /// Scan rows from the parquet file according to the provided `selection`
+    ///
+    /// TODO: Make public once row selection fully implemented
+    pub(crate) fn with_row_selection(
+        self,
+        selection: impl Into<Vec<RowSelection>>,
+    ) -> Self {

Review Comment:
   I think we should add total_row_count, check this selection is valid.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r912608716


##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -222,13 +271,47 @@ pub struct ParquetRecordBatchReader {
     batch_size: usize,
     array_reader: Box<dyn ArrayReader>,
     schema: SchemaRef,
+    selection: Option<VecDeque<RowSelection>>,
 }
 
 impl Iterator for ParquetRecordBatchReader {
     type Item = ArrowResult<RecordBatch>;
 
     fn next(&mut self) -> Option<Self::Item> {
-        match self.array_reader.next_batch(self.batch_size) {
+        let to_read = match self.selection.as_mut() {

Review Comment:
   ๐Ÿ‘ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r913755398


##########
parquet/src/arrow/record_reader/mod.rs:
##########
@@ -202,6 +203,37 @@ where
         Ok(records_read)
     }
 
+    /// Try to skip the next `num_records` rows
+    ///
+    /// # Returns
+    ///
+    /// Number of records skipped
+    pub fn skip_records(&mut self, num_records: usize) -> Result<usize> {
+        // First need to clear the buffer
+        let (buffered_records, buffered_values) = self.count_records(num_records);
+        self.num_records += buffered_records;
+        self.num_values += buffered_values;
+
+        self.consume_def_levels();
+        self.consume_rep_levels();
+        self.consume_record_data();
+        self.consume_bitmap();
+        self.reset();
+
+        let remaining = buffered_records - num_records;

Review Comment:
   Oh sorry, I wrote that response on my phone and got my wires crossed. No buffered_records here <= num_records - count_records cannot return more than the `records_to_read` argument which in this case is `num_records`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on a diff in pull request #1998: Stub out Skip Records API (#1792)

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #1998:
URL: https://github.com/apache/arrow-rs/pull/1998#discussion_r915163698


##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -90,6 +121,20 @@ impl ArrowReaderOptions {
     pub fn with_skip_arrow_metadata(self, skip_arrow_metadata: bool) -> Self {
         Self {
             skip_arrow_metadata,
+            ..self
+        }
+    }
+
+    /// Scan rows from the parquet file according to the provided `selection`
+    ///
+    /// TODO: Make public once row selection fully implemented

Review Comment:
   perhaps worth a ticket?



##########
parquet/src/arrow/array_reader/byte_array.rs:
##########
@@ -210,6 +214,10 @@ impl<I: OffsetSizeTrait + ScalarValue> ColumnValueDecoder
 
         decoder.read(out, range.end - range.start, self.dict.as_ref())
     }
+
+    fn skip_values(&mut self, _num_values: usize) -> Result<usize> {
+        todo!()

Review Comment:
   I think adding a ticket reference here like
   `unimplemented!("See https://github.com/apache/arrow-rs/.....")` would help future readers
   
   Bonus points for returning `ArrowError::Unimplemented`
   
   This comment applies to everything below as well



##########
parquet/src/file/serialized_reader.rs:
##########
@@ -555,6 +555,14 @@ impl<T: Read + Send> PageReader for SerializedPageReader<T> {
         // We are at the end of this column chunk and no more page left. Return None.
         Ok(None)
     }
+
+    fn peek_next_page(&self) -> Result<Option<PageMetadata>> {
+        todo!()

Review Comment:
   ditto returning "not yet implemented" would probably be nicer



##########
parquet/src/arrow/record_reader/definition_levels.rs:
##########
@@ -146,15 +146,15 @@ impl LevelsBufferSlice for DefinitionLevelBuffer {
     }
 }
 
-pub struct DefinitionLevelDecoder {
+pub struct DefinitionLevelBufferDecoder {

Review Comment:
   I this rename a public API change as well? It does not appear in the docs
   
   https://docs.rs/parquet/17.0.0/parquet/?search=DefinitionLevelDecoder
   
   



##########
parquet/src/arrow/arrow_reader.rs:
##########
@@ -70,9 +71,39 @@ pub trait ArrowReader {
     ) -> Result<Self::RecordReader>;
 }
 
+/// [`RowSelection`] allows selecting or skipping a provided number of rows
+/// when scanning the parquet file
+#[derive(Debug, Clone, Copy)]
+pub(crate) struct RowSelection {

Review Comment:
   You probably already have thought about this, but I would expect that in certain scenarios, non contiguous rows / skips would be desired
   
   Like "fetch the first 100 rows, skip the next 200, and then fetch the remaining"
   
   Would this interface handle that case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org