You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/11 15:00:54 UTC

[GitHub] [arrow-rs] Ted-Jiang opened a new pull request, #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Ted-Jiang opened a new pull request, #2044:
URL: https://github.com/apache/arrow-rs/pull/2044

   # Which issue does this PR close?
   
   
   Closes #2043.
   
   # Rationale for this change
   
   # What changes are included in this PR?
   
   <!---
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   
   
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#discussion_r919617920


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -526,6 +543,34 @@ impl<T: Read> SerializedPageReader<T> {
             page_offset_index: None,
             seen_num_data_pages: 0,
             has_dictionary_page_to_read: false,
+            page_bufs: Default::default(),
+        };
+        Ok(result)
+    }
+
+    /// Creates a new serialized page reader from file source.
+    pub fn new_with_page_offsets(
+        buf: T,

Review Comment:
   Yes, but i try to create an Empty T not succeeded. 
   So i  try to make buf `option` it will make a lot code change and break Api
   Any suggestion?😂



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#discussion_r919618461


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -526,6 +543,34 @@ impl<T: Read> SerializedPageReader<T> {
             page_offset_index: None,
             seen_num_data_pages: 0,
             has_dictionary_page_to_read: false,
+            page_bufs: Default::default(),
+        };
+        Ok(result)
+    }
+
+    /// Creates a new serialized page reader from file source.
+    pub fn new_with_page_offsets(
+        buf: T,
+        total_num_values: i64,

Review Comment:
   Yes, first version use this,  i will delete it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#discussion_r919102261


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -550,22 +595,29 @@ impl<T: Read + Send> Iterator for SerializedPageReader<T> {
 
 impl<T: Read + Send> PageReader for SerializedPageReader<T> {
     fn get_next_page(&mut self) -> Result<Option<Page>> {
+        let mut cursor = &mut self.buf;
+        let mut dictionary_cursor;
         while self.seen_num_values < self.total_num_values {
-            // For now we can not update `seen_num_values` in `skip_next_page`,
-            // so we need add this check.
             if let Some(indexes) = &self.page_offset_index {
-                if indexes.len() == self.seen_num_data_pages {
+                // For now we can not update `seen_num_values` in `skip_next_page`,
+                // so we need add this check.
+                if indexes.len() <= self.seen_num_data_pages {
                     return Ok(None);
+                } else if self.seen_num_data_pages == 0
+                    && self.has_dictionary_page_to_read
+                {
+                    dictionary_cursor = self.page_bufs.pop_front().unwrap();
+                    cursor = dictionary_cursor.borrow_mut();

Review Comment:
   ```suggestion
                       cursor = &mut dictionary_cursor;
   ```



##########
parquet/src/util/page_util.rs:
##########
@@ -15,13 +15,40 @@
 // specific language governing permissions and limitations
 // under the License.
 
+use std::collections::VecDeque;
+use std::io::Read;
+use std::sync::Arc;
 use crate::errors::Result;
 use parquet_format::PageLocation;
+use crate::file::reader::ChunkReader;
 
+/// Use column chunk's offset index to get the `page_num` page row count.
 pub(crate) fn calculate_row_count(indexes: &[PageLocation], page_num: usize, total_row_count: i64) -> Result<usize> {
     if page_num == indexes.len() - 1 {
         Ok((total_row_count - indexes[page_num].first_row_index + 1) as usize)
     } else {
         Ok((indexes[page_num + 1].first_row_index - indexes[page_num].first_row_index) as usize)
     }
 }
+
+/// Use column chunk's offset index to get each page serially readable slice
+/// and a flag indicates whether having one dictionary page in this column chunk.
+pub(crate) fn get_pages_readable_slices<T: Read + Send, R: ChunkReader + ChunkReader<T=T>>(col_chunk_offset_index: &[PageLocation], col_start: u64, chunk_reader: Arc<R>) -> Result<(VecDeque<T>, bool)> {

Review Comment:
   ```suggestion
   pub(crate) fn get_pages_readable_slices<T: Read + Send, R: ChunkReader<T=T>>(col_chunk_offset_index: &[PageLocation], col_start: u64, chunk_reader: Arc<R>) -> Result<(VecDeque<T>, bool)> {
   ```



##########
parquet/src/file/serialized_reader.rs:
##########
@@ -526,6 +543,34 @@ impl<T: Read> SerializedPageReader<T> {
             page_offset_index: None,
             seen_num_data_pages: 0,
             has_dictionary_page_to_read: false,
+            page_bufs: Default::default(),
+        };
+        Ok(result)
+    }
+
+    /// Creates a new serialized page reader from file source.
+    pub fn new_with_page_offsets(
+        buf: T,

Review Comment:
   It would be nice to not need to pass this, as it isn't actually used



##########
parquet/src/file/serialized_reader.rs:
##########
@@ -498,9 +566,23 @@ impl<T: Read> SerializedPageReader<T> {
             seen_num_values: 0,
             decompressor,
             physical_type,
+            num_rows,
+            page_offset_index: Some(offset_index),
+            seen_num_data_pages: 0,
+            has_dictionary_page_to_read,
+            page_bufs,
         };
         Ok(result)
     }
+
+    pub fn with_page_offset_index_and_has_dictionary_to_read(

Review Comment:
   I would remove this method now in favor of new_with_page_offsets



##########
parquet/src/file/serialized_reader.rs:
##########
@@ -481,6 +505,22 @@ pub struct SerializedPageReader<T: Read> {
 
     // Column chunk type.
     physical_type: Type,
+
+    // total row count.
+    num_rows: i64,
+
+    //Page offset index.

Review Comment:
   Perhaps we could have something like
   
   ```
   enum SerializedPages<T: Read>
       /// Read entire chunk
       Chunk {
           buf: T,
           total_num_values: i64,
       },
       Pages {
           index: Vec<PageLocation>,
           seen_num_data_pages: usize,
           has_dictionary_page_to_read: bool,
           page_bufs: VecDeque<T>
       }
   ```
   To make it clear what is present in what mode



##########
parquet/src/file/serialized_reader.rs:
##########
@@ -481,6 +505,22 @@ pub struct SerializedPageReader<T: Read> {
 
     // Column chunk type.
     physical_type: Type,
+
+    // total row count.
+    num_rows: i64,

Review Comment:
   This doesn't appear to be being used, and would avoid this being a breaking change



##########
parquet/src/file/serialized_reader.rs:
##########
@@ -526,6 +543,34 @@ impl<T: Read> SerializedPageReader<T> {
             page_offset_index: None,
             seen_num_data_pages: 0,
             has_dictionary_page_to_read: false,
+            page_bufs: Default::default(),
+        };
+        Ok(result)
+    }
+
+    /// Creates a new serialized page reader from file source.
+    pub fn new_with_page_offsets(
+        buf: T,
+        total_num_values: i64,

Review Comment:
   I don't think we can actually respect this when skipping pages, so not sure we need to provide it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold merged pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#discussion_r920875372


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -37,6 +38,7 @@ use crate::util::{io::TryClone, memory::ByteBufferPtr};
 
 // export `SliceableCursor` and `FileSource` publically so clients can
 // re-use the logic in their own ParquetFileWriter wrappers
+use crate::util::page_util::{calculate_row_count, get_pages_readable_slices};

Review Comment:
   ```suggestion
   use crate::util::page_util::{calculate_row_count, get_pages_readable_slices};
   
   // export `SliceableCursor` and `FileSource` publically so clients can
   // re-use the logic in their own ParquetFileWriter wrappers
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#discussion_r918045867


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -560,12 +606,53 @@ impl<T: Read + Send> PageReader for SerializedPageReader<T> {
         Ok(None)
     }
 
-    fn peek_next_page(&self) -> Result<Option<PageMetadata>> {
-        Err(nyi_err!("https://github.com/apache/arrow-rs/issues/1792"))
+    fn peek_next_page(&mut self) -> Result<Option<PageMetadata>> {
+        if let Some(page_offset_index) = &self.page_offset_index {
+            if self.seen_num_data_pages == page_offset_index.len() {
+                Ok(None)
+            } else if self.seen_num_data_pages == 0 && self.has_dictionary_page_to_read {
+                self.has_dictionary_page_to_read = false;
+                Ok(Some(PageMetadata {
+                    num_rows: usize::MIN,
+                    is_dict: true,
+                }))
+            } else {
+                let row_count = calculate_row_count(
+                    page_offset_index,
+                    self.seen_num_data_pages,
+                    self.total_num_values,
+                )?;
+                Ok(Some(PageMetadata {
+                    num_rows: row_count,
+                    is_dict: false,
+                }))
+            }
+        } else {
+            Err(general_err!("Must set page_offset_index when using peek_next_page in SerializedPageReader."))
+        }
     }
 
     fn skip_next_page(&mut self) -> Result<()> {
-        Err(nyi_err!("https://github.com/apache/arrow-rs/issues/1792"))
+        if let Some(page_offset_index) = &self.page_offset_index {
+            let location = &page_offset_index[self.seen_num_data_pages];
+            let compressed_page_size = location.compressed_page_size;
+            //skip page bytes
+            let skip_size = io::copy(

Review Comment:
   @tustvold Sorry to bother you,
   i found `buf:T` only impl `read` but without `seek`,
   Is there any good way to skip bytes in buffer? 🤔 



##########
parquet/src/file/serialized_reader.rs:
##########
@@ -560,12 +606,53 @@ impl<T: Read + Send> PageReader for SerializedPageReader<T> {
         Ok(None)
     }
 
-    fn peek_next_page(&self) -> Result<Option<PageMetadata>> {
-        Err(nyi_err!("https://github.com/apache/arrow-rs/issues/1792"))
+    fn peek_next_page(&mut self) -> Result<Option<PageMetadata>> {
+        if let Some(page_offset_index) = &self.page_offset_index {
+            if self.seen_num_data_pages == page_offset_index.len() {
+                Ok(None)
+            } else if self.seen_num_data_pages == 0 && self.has_dictionary_page_to_read {
+                self.has_dictionary_page_to_read = false;
+                Ok(Some(PageMetadata {
+                    num_rows: usize::MIN,
+                    is_dict: true,
+                }))
+            } else {
+                let row_count = calculate_row_count(
+                    page_offset_index,
+                    self.seen_num_data_pages,
+                    self.total_num_values,
+                )?;
+                Ok(Some(PageMetadata {
+                    num_rows: row_count,
+                    is_dict: false,
+                }))
+            }
+        } else {
+            Err(general_err!("Must set page_offset_index when using peek_next_page in SerializedPageReader."))
+        }
     }
 
     fn skip_next_page(&mut self) -> Result<()> {
-        Err(nyi_err!("https://github.com/apache/arrow-rs/issues/1792"))
+        if let Some(page_offset_index) = &self.page_offset_index {
+            let location = &page_offset_index[self.seen_num_data_pages];
+            let compressed_page_size = location.compressed_page_size;
+            //skip page bytes
+            let skip_size = io::copy(
+                self.buf.by_ref().take(compressed_page_size as u64).by_ref(),

Review Comment:
   @tustvold Sorry to bother you,
   i found `buf:T` only impl `read` but without `seek`,
   Is there any good way to skip bytes in buffer? 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#discussion_r919616909


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -498,9 +566,23 @@ impl<T: Read> SerializedPageReader<T> {
             seen_num_values: 0,
             decompressor,
             physical_type,
+            num_rows,
+            page_offset_index: Some(offset_index),
+            seen_num_data_pages: 0,
+            has_dictionary_page_to_read,
+            page_bufs,
         };
         Ok(result)
     }
+
+    pub fn with_page_offset_index_and_has_dictionary_to_read(

Review Comment:
   forget delete it😂. Why ci not check this func never used🤔



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#discussion_r919617920


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -526,6 +543,34 @@ impl<T: Read> SerializedPageReader<T> {
             page_offset_index: None,
             seen_num_data_pages: 0,
             has_dictionary_page_to_read: false,
+            page_bufs: Default::default(),
+        };
+        Ok(result)
+    }
+
+    /// Creates a new serialized page reader from file source.
+    pub fn new_with_page_offsets(
+        buf: T,

Review Comment:
   Yes, but i try to create an Empty T not succeeded. 
   So i  try to make buf `option` it will make a lot code change and break Api



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#discussion_r918987470


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -560,12 +606,53 @@ impl<T: Read + Send> PageReader for SerializedPageReader<T> {
         Ok(None)
     }
 
-    fn peek_next_page(&self) -> Result<Option<PageMetadata>> {
-        Err(nyi_err!("https://github.com/apache/arrow-rs/issues/1792"))
+    fn peek_next_page(&mut self) -> Result<Option<PageMetadata>> {
+        if let Some(page_offset_index) = &self.page_offset_index {
+            if self.seen_num_data_pages == page_offset_index.len() {
+                Ok(None)
+            } else if self.seen_num_data_pages == 0 && self.has_dictionary_page_to_read {
+                self.has_dictionary_page_to_read = false;
+                Ok(Some(PageMetadata {
+                    num_rows: usize::MIN,
+                    is_dict: true,
+                }))
+            } else {
+                let row_count = calculate_row_count(
+                    page_offset_index,
+                    self.seen_num_data_pages,
+                    self.total_num_values,
+                )?;
+                Ok(Some(PageMetadata {
+                    num_rows: row_count,
+                    is_dict: false,
+                }))
+            }
+        } else {
+            Err(general_err!("Must set page_offset_index when using peek_next_page in SerializedPageReader."))
+        }
     }
 
     fn skip_next_page(&mut self) -> Result<()> {
-        Err(nyi_err!("https://github.com/apache/arrow-rs/issues/1792"))
+        if let Some(page_offset_index) = &self.page_offset_index {
+            let location = &page_offset_index[self.seen_num_data_pages];
+            let compressed_page_size = location.compressed_page_size;
+            //skip page bytes
+            let skip_size = io::copy(
+                self.buf.by_ref().take(compressed_page_size as u64).by_ref(),

Review Comment:
   @tustvold [62f9bbc](https://github.com/apache/arrow-rs/pull/2044/commits/62f9bbc42071c32533cd49d5e847a151b3aa4d43) PTAL 😊



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#issuecomment-1185096565

   > Nice 👍
   
   Thanks ❤️ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#discussion_r919644656


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -481,6 +505,22 @@ pub struct SerializedPageReader<T: Read> {
 
     // Column chunk type.
     physical_type: Type,
+
+    // total row count.
+    num_rows: i64,
+
+    //Page offset index.

Review Comment:
   fix in [bd5335f](https://github.com/apache/arrow-rs/pull/2044/commits/bd5335f0a02b808b542eadd17b08b37a52a1cd37) @tustvold  PTAL, learn a lot from you 😊



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#discussion_r919599314


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -481,6 +505,22 @@ pub struct SerializedPageReader<T: Read> {
 
     // Column chunk type.
     physical_type: Type,
+
+    // total row count.
+    num_rows: i64,

Review Comment:
   Agree. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#discussion_r920875720


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -37,6 +38,7 @@ use crate::util::{io::TryClone, memory::ByteBufferPtr};
 
 // export `SliceableCursor` and `FileSource` publically so clients can
 // re-use the logic in their own ParquetFileWriter wrappers
+use crate::util::page_util::{calculate_row_count, get_pages_readable_slices};

Review Comment:
   change the position of the import



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on a diff in pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#discussion_r918234131


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -560,12 +606,53 @@ impl<T: Read + Send> PageReader for SerializedPageReader<T> {
         Ok(None)
     }
 
-    fn peek_next_page(&self) -> Result<Option<PageMetadata>> {
-        Err(nyi_err!("https://github.com/apache/arrow-rs/issues/1792"))
+    fn peek_next_page(&mut self) -> Result<Option<PageMetadata>> {
+        if let Some(page_offset_index) = &self.page_offset_index {
+            if self.seen_num_data_pages == page_offset_index.len() {
+                Ok(None)
+            } else if self.seen_num_data_pages == 0 && self.has_dictionary_page_to_read {
+                self.has_dictionary_page_to_read = false;
+                Ok(Some(PageMetadata {
+                    num_rows: usize::MIN,
+                    is_dict: true,
+                }))
+            } else {
+                let row_count = calculate_row_count(
+                    page_offset_index,
+                    self.seen_num_data_pages,
+                    self.total_num_values,
+                )?;
+                Ok(Some(PageMetadata {
+                    num_rows: row_count,
+                    is_dict: false,
+                }))
+            }
+        } else {
+            Err(general_err!("Must set page_offset_index when using peek_next_page in SerializedPageReader."))
+        }
     }
 
     fn skip_next_page(&mut self) -> Result<()> {
-        Err(nyi_err!("https://github.com/apache/arrow-rs/issues/1792"))
+        if let Some(page_offset_index) = &self.page_offset_index {
+            let location = &page_offset_index[self.seen_num_data_pages];
+            let compressed_page_size = location.compressed_page_size;
+            //skip page bytes
+            let skip_size = io::copy(
+                self.buf.by_ref().take(compressed_page_size as u64).by_ref(),

Review Comment:
   I think the best option here would be to store a list of Read within SerializedPageReader, and have it work through them sequentially. This will allow adding an additional constructor that would take the column index and the ChunkReader, and create readers for each page up front.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#discussion_r919620826


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -526,6 +543,34 @@ impl<T: Read> SerializedPageReader<T> {
             page_offset_index: None,
             seen_num_data_pages: 0,
             has_dictionary_page_to_read: false,
+            page_bufs: Default::default(),
+        };
+        Ok(result)
+    }
+
+    /// Creates a new serialized page reader from file source.
+    pub fn new_with_page_offsets(
+        buf: T,

Review Comment:
   find your advice below, will try.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#discussion_r919620637


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -481,6 +505,22 @@ pub struct SerializedPageReader<T: Read> {
 
     // Column chunk type.
     physical_type: Type,
+
+    // total row count.
+    num_rows: i64,
+
+    //Page offset index.

Review Comment:
   Good advice!  i will try in this mode.👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#discussion_r920874239


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -481,6 +513,18 @@ pub struct SerializedPageReader<T: Read> {
 
     // Column chunk type.
     physical_type: Type,
+    // //Page offset index.

Review Comment:
   Do you want to remove these lines?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on code in PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#discussion_r920880654


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -526,6 +543,34 @@ impl<T: Read> SerializedPageReader<T> {
             page_offset_index: None,
             seen_num_data_pages: 0,
             has_dictionary_page_to_read: false,
+            page_bufs: Default::default(),
+        };
+        Ok(result)
+    }
+
+    /// Creates a new serialized page reader from file source.
+    pub fn new_with_page_offsets(
+        buf: T,
+        total_num_values: i64,

Review Comment:
   comments are better resolved by the person who provided them  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] codecov-commenter commented on pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#issuecomment-1182711168

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/2044?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#2044](https://codecov.io/gh/apache/arrow-rs/pull/2044?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (d9b6a66) into [master](https://codecov.io/gh/apache/arrow-rs/commit/ca5fe7df5ca0bdcfd6f732cc3f10da511b753c5f?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (ca5fe7d) will **increase** coverage by `0.02%`.
   > The diff coverage is `84.89%`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #2044      +/-   ##
   ==========================================
   + Coverage   83.54%   83.57%   +0.02%     
   ==========================================
     Files         222      223       +1     
     Lines       58186    58345     +159     
   ==========================================
   + Hits        48612    48762     +150     
   - Misses       9574     9583       +9     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/2044?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/arrow/async\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/2044/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXN5bmNfcmVhZGVyLnJz) | `0.00% <0.00%> (ø)` | |
   | [parquet/src/column/page.rs](https://codecov.io/gh/apache/arrow-rs/pull/2044/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvY29sdW1uL3BhZ2UucnM=) | `98.68% <ø> (ø)` | |
   | [parquet/src/file/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/2044/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS93cml0ZXIucnM=) | `92.60% <ø> (ø)` | |
   | [parquet/src/util/test\_common/page\_util.rs](https://codecov.io/gh/apache/arrow-rs/pull/2044/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvdXRpbC90ZXN0X2NvbW1vbi9wYWdlX3V0aWwucnM=) | `86.73% <0.00%> (ø)` | |
   | [parquet/src/file/metadata.rs](https://codecov.io/gh/apache/arrow-rs/pull/2044/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9tZXRhZGF0YS5ycw==) | `94.70% <66.66%> (-0.69%)` | :arrow_down: |
   | [parquet/src/file/serialized\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/2044/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9zZXJpYWxpemVkX3JlYWRlci5ycw==) | `94.38% <85.32%> (-0.96%)` | :arrow_down: |
   | [parquet/src/column/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/2044/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvY29sdW1uL3dyaXRlci5ycw==) | `91.53% <100.00%> (+0.05%)` | :arrow_up: |
   | [parquet/src/util/page\_util.rs](https://codecov.io/gh/apache/arrow-rs/pull/2044/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvdXRpbC9wYWdlX3V0aWwucnM=) | `100.00% <100.00%> (ø)` | |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/2044/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.75% <0.00%> (-0.46%)` | :arrow_down: |
   | ... and [5 more](https://codecov.io/gh/apache/arrow-rs/pull/2044/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/2044?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/2044?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [ca5fe7d...d9b6a66](https://codecov.io/gh/apache/arrow-rs/pull/2044?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
Ted-Jiang commented on code in PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#discussion_r920890689


##########
parquet/src/file/serialized_reader.rs:
##########
@@ -481,6 +513,18 @@ pub struct SerializedPageReader<T: Read> {
 
     // Column chunk type.
     physical_type: Type,
+    // //Page offset index.

Review Comment:
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] ursabot commented on pull request #2044: Support peek_next_page() and skip_next_page in serialized_reader.

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #2044:
URL: https://github.com/apache/arrow-rs/pull/2044#issuecomment-1184744464

   Benchmark runs are scheduled for baseline = 9116297900d8b528c32ab2595efb6d0ffc3b7f39 and contender = 5e520bb69714ee8e6a3171d0bac805b31b3032eb. 5e520bb69714ee8e6a3171d0bac805b31b3032eb is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/74511a51033b46d39e81155969dfe65c...30c61d7b11ed442c9a65270486bf67f4/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/6441e0eb3dcd4d8faacc90aa36b526f7...503c8c9db9d546c8a8f83a158aac8bae/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/f9d56a78daa3480ca3630f3c56588b0d...dd4decce60314293a06960415eb8b3e6/)
   [Skipped :warning: Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/c8117ca804fc45eaa9b76819eeb2165d...6655abd1dcf74af1a5029e32f7ab280e/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org