You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/24 05:56:14 UTC

[GitHub] [arrow-rs] liukun4515 opened a new pull request, #1935: add column index writer for parquet

liukun4515 opened a new pull request, #1935:
URL: https://github.com/apache/arrow-rs/pull/1935

# Which issue does this PR close?

part of #1777

# Rationale for this change

# What changes are included in this PR?

# Are there any user-facing changes?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] liukun4515 commented on pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#issuecomment-1168389669

   Do we need to add an option to control this feature in the `WriterProperties` struct?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold merged pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

tustvold merged PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r909531990


##########
parquet/src/file/writer.rs:
##########
@@ -177,10 +182,66 @@ impl<W: Write> SerializedFileWriter<W> {
         Ok(())
     }
 
+    /// Serialize all the offset index to the file
+    fn write_offset_indexes(&mut self) -> Result<()> {
+        // iter row group
+        // iter each column
+        // write offset index to the file
+        for row_group in &mut self.row_groups {
+            for column_metdata in row_group.columns_mut() {
+                match column_metdata.offset_index() {
+                    Some(offset_index) => {
+                        let start_offset = self.buf.bytes_written();
+                        let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
+                        offset_index.write_to_out_protocol(&mut protocol)?;
+                        protocol.flush()?;
+                        let end_offset = self.buf.bytes_written();
+                        // set offset and index for offset index
+                        column_metdata.set_offset_index_offset(start_offset as i64);
+                        column_metdata
+                            .set_offset_index_length((end_offset - start_offset) as i32);

Review Comment:
   I have thought this.
   If we do and follow above method, we need to a new struct to maintain the `offsetindex` and `columnindex` for each column and each rowgroup.
   
   Maybe we can optimize this in the follow-up pull request.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r910504035


##########
parquet/src/column/writer.rs:
##########
@@ -162,6 +164,14 @@ pub fn get_typed_column_writer_mut<'a, 'b: 'a, T: DataType>(
     })
 }
 
+type ColumnCloseResult = (

Review Comment:
   Do it in the follow-up PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #1935: [WIP] add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r908066758


##########
parquet/src/file/writer.rs:
##########
@@ -85,7 +86,7 @@ pub type OnCloseColumnChunk<'a> =
 /// Callback invoked on closing a row group, arguments are:
 ///
 /// - the row group metadata
-pub type OnCloseRowGroup<'a> = Box<dyn FnOnce(RowGroupMetaDataPtr) -> Result<()> + 'a>;
+pub type OnCloseRowGroup<'a> = Box<dyn FnOnce(RowGroupMetaData) -> Result<()> + 'a>;

Review Comment:
   change from the ptr to the owner of the data.
   We need to change/edit the columnchunk/column_metadata field after write the column/offset index to the file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r909551523


##########
parquet/src/file/writer.rs:
##########
@@ -177,10 +182,66 @@ impl<W: Write> SerializedFileWriter<W> {
         Ok(())
     }
 
+    /// Serialize all the offset index to the file
+    fn write_offset_indexes(&mut self) -> Result<()> {
+        // iter row group
+        // iter each column
+        // write offset index to the file
+        for row_group in &mut self.row_groups {
+            for column_metdata in row_group.columns_mut() {
+                match column_metdata.offset_index() {
+                    Some(offset_index) => {
+                        let start_offset = self.buf.bytes_written();
+                        let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
+                        offset_index.write_to_out_protocol(&mut protocol)?;
+                        protocol.flush()?;
+                        let end_offset = self.buf.bytes_written();
+                        // set offset and index for offset index
+                        column_metdata.set_offset_index_offset(start_offset as i64);
+                        column_metdata
+                            .set_offset_index_length((end_offset - start_offset) as i32);

Review Comment:
   I think I would prefer if we can avoid this being a breaking API change, which this PR currently is, if possible. I can't see an obvious reason why it needs to be



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r909684959


##########
parquet/src/file/writer.rs:
##########
@@ -177,10 +182,66 @@ impl<W: Write> SerializedFileWriter<W> {
         Ok(())
     }
 
+    /// Serialize all the offset index to the file
+    fn write_offset_indexes(&mut self) -> Result<()> {
+        // iter row group
+        // iter each column
+        // write offset index to the file
+        for row_group in &mut self.row_groups {
+            for column_metdata in row_group.columns_mut() {
+                match column_metdata.offset_index() {
+                    Some(offset_index) => {
+                        let start_offset = self.buf.bytes_written();
+                        let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
+                        offset_index.write_to_out_protocol(&mut protocol)?;
+                        protocol.flush()?;
+                        let end_offset = self.buf.bytes_written();
+                        // set offset and index for offset index
+                        column_metdata.set_offset_index_offset(start_offset as i64);
+                        column_metdata
+                            .set_offset_index_length((end_offset - start_offset) as i32);

Review Comment:
   But I would prefer to modify the column metadata before `to_thrift`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r909551523


##########
parquet/src/file/writer.rs:
##########
@@ -177,10 +182,66 @@ impl<W: Write> SerializedFileWriter<W> {
         Ok(())
     }
 
+    /// Serialize all the offset index to the file
+    fn write_offset_indexes(&mut self) -> Result<()> {
+        // iter row group
+        // iter each column
+        // write offset index to the file
+        for row_group in &mut self.row_groups {
+            for column_metdata in row_group.columns_mut() {
+                match column_metdata.offset_index() {
+                    Some(offset_index) => {
+                        let start_offset = self.buf.bytes_written();
+                        let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
+                        offset_index.write_to_out_protocol(&mut protocol)?;
+                        protocol.flush()?;
+                        let end_offset = self.buf.bytes_written();
+                        // set offset and index for offset index
+                        column_metdata.set_offset_index_offset(start_offset as i64);
+                        column_metdata
+                            .set_offset_index_length((end_offset - start_offset) as i32);

Review Comment:
   I think I would prefer if we can avoid this being a breaking API change, which this currently is, if possible



##########
parquet/src/file/writer.rs:
##########
@@ -177,10 +182,66 @@ impl<W: Write> SerializedFileWriter<W> {
         Ok(())
     }
 
+    /// Serialize all the offset index to the file
+    fn write_offset_indexes(&mut self) -> Result<()> {
+        // iter row group
+        // iter each column
+        // write offset index to the file
+        for row_group in &mut self.row_groups {
+            for column_metdata in row_group.columns_mut() {
+                match column_metdata.offset_index() {
+                    Some(offset_index) => {
+                        let start_offset = self.buf.bytes_written();
+                        let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
+                        offset_index.write_to_out_protocol(&mut protocol)?;
+                        protocol.flush()?;
+                        let end_offset = self.buf.bytes_written();
+                        // set offset and index for offset index
+                        column_metdata.set_offset_index_offset(start_offset as i64);
+                        column_metdata
+                            .set_offset_index_length((end_offset - start_offset) as i32);

Review Comment:
   I think I would prefer if we can avoid this being a breaking API change, which this PR currently is, if possible



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r908324061


##########
parquet/src/column/writer.rs:
##########
@@ -2068,6 +2129,80 @@ mod tests {
         ),);
     }
 
+    #[test]
+    fn test_column_offset_index_metadata() {
+        // write data
+        // and check the offset index and column index
+        let page_writer = get_test_page_writer();
+        let props = Arc::new(WriterProperties::builder().build());
+        let mut writer = get_test_column_writer::<Int32Type>(page_writer, 0, 0, props);
+        writer.write_batch(&[1, 2, 3, 4], None, None).unwrap();
+        // first page
+        writer.flush_data_pages().unwrap();
+        // second page
+        writer.write_batch(&[4, 8, 2, -5], None, None).unwrap();
+
+        let (_, rows_written, metadata) = writer.close().unwrap();
+        let column_index = match metadata.column_index() {
+            None => {
+                panic!("Can't fine the column index");
+            }
+            Some(column_index) => column_index,
+        };
+        let offset_index = match metadata.offset_index() {
+            None => {
+                panic!("Can't find the offset index");
+            }
+            Some(offset_index) => offset_index,
+        };
+
+        assert_eq!(8, rows_written);
+
+        // column index
+        assert_eq!(2, column_index.null_pages.len());
+        assert_eq!(2, offset_index.page_locations.len());
+        assert_eq!(BoundaryOrder::Unordered, column_index.boundary_order);
+        for idx in 0..2 {
+            assert_eq!(&false, column_index.null_pages.get(idx).unwrap());

Review Comment:
   ```suggestion
               assert!(!column_index.null_pages[idx]);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #1935: [WIP] add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r908068776


##########
parquet/src/file/writer.rs:
##########
@@ -339,11 +400,11 @@ impl<'a, W: Write> SerializedRowGroupWriter<'a, W> {
                 .set_num_rows(self.total_rows_written.unwrap_or(0) as i64)
                 .build()?;
 
-            let metadata = Arc::new(row_group_metadata);
-            self.row_group_metadata = Some(metadata.clone());
+            let clone_row_group_metadata = row_group_metadata.clone();
+            self.row_group_metadata = Some(Arc::new(row_group_metadata));

Review Comment:
   It is not a good practice, we can optimize it by using the Mutex with Arc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] liukun4515 commented on pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#issuecomment-1168367038

   @Ted-Jiang PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

tustvold commented on PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#issuecomment-1171008180

   Thank you :1st_place_medal: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #1935: [WIP] add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r908068776


##########
parquet/src/file/writer.rs:
##########
@@ -339,11 +400,11 @@ impl<'a, W: Write> SerializedRowGroupWriter<'a, W> {
                 .set_num_rows(self.total_rows_written.unwrap_or(0) as i64)
                 .build()?;
 
-            let metadata = Arc::new(row_group_metadata);
-            self.row_group_metadata = Some(metadata.clone());
+            let clone_row_group_metadata = row_group_metadata.clone();
+            self.row_group_metadata = Some(Arc::new(row_group_metadata));

Review Comment:
   It's is not a good practice, we can optimize it by using the Mutex with Arc



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r908500212


##########
parquet/src/file/writer.rs:
##########
@@ -339,11 +400,11 @@ impl<'a, W: Write> SerializedRowGroupWriter<'a, W> {
                 .set_num_rows(self.total_rows_written.unwrap_or(0) as i64)
                 .build()?;
 
-            let metadata = Arc::new(row_group_metadata);
-            self.row_group_metadata = Some(metadata.clone());
+            let clone_row_group_metadata = row_group_metadata.clone();
+            self.row_group_metadata = Some(Arc::new(row_group_metadata));

Review Comment:
   > I think I would prefer we clone at the location that the mutation happens, i.e. ParquetWriter, as opposed to modifying this interface
   
   Yes, the current implementation is not change the interface, but the clone may have some cost.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r908320788


##########
parquet/src/file/metadata.rs:
##########
@@ -386,6 +394,9 @@ pub struct ColumnChunkMetaData {
     offset_index_length: Option<i32>,
     column_index_offset: Option<i64>,
     column_index_length: Option<i32>,
+    // column index and offset index
+    column_index: Option<ColumnIndex>,
+    offset_index: Option<OffsetIndex>,

Review Comment:
   I'm not sure about this, as it conflates the file-level metadata with the index metadata which is stored separately. It also leaks a type from parquet-format out into the public interface, which typically this crate avoids doing.
   
   I'm not entirely sure why this needs to be here?



##########
parquet/src/file/writer.rs:
##########
@@ -177,10 +182,66 @@ impl<W: Write> SerializedFileWriter<W> {
         Ok(())
     }
 
+    /// Serialize all the offset index to the file
+    fn write_offset_indexes(&mut self) -> Result<()> {
+        // iter row group
+        // iter each column
+        // write offset index to the file
+        for row_group in &mut self.row_groups {
+            for column_metdata in row_group.columns_mut() {
+                match column_metdata.offset_index() {
+                    Some(offset_index) => {
+                        let start_offset = self.buf.bytes_written();
+                        let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
+                        offset_index.write_to_out_protocol(&mut protocol)?;
+                        protocol.flush()?;
+                        let end_offset = self.buf.bytes_written();
+                        // set offset and index for offset index
+                        column_metdata.set_offset_index_offset(start_offset as i64);
+                        column_metdata
+                            .set_offset_index_length((end_offset - start_offset) as i32);

Review Comment:
   As part of `SerializedFileWriter::write_metadata` we call `RowGroupMetaData::to_thrift`, perhaps we could store these offsets and apply them directly to the returned thrift message, instead of needing to clone RowGroupMetaData??



##########
parquet/src/column/writer.rs:
##########
@@ -2068,6 +2129,80 @@ mod tests {
         ),);
     }
 
+    #[test]
+    fn test_column_offset_index_metadata() {
+        // write data
+        // and check the offset index and column index
+        let page_writer = get_test_page_writer();
+        let props = Arc::new(WriterProperties::builder().build());
+        let mut writer = get_test_column_writer::<Int32Type>(page_writer, 0, 0, props);
+        writer.write_batch(&[1, 2, 3, 4], None, None).unwrap();
+        // first page
+        writer.flush_data_pages().unwrap();
+        // second page
+        writer.write_batch(&[4, 8, 2, -5], None, None).unwrap();
+
+        let (_, rows_written, metadata) = writer.close().unwrap();
+        let column_index = match metadata.column_index() {
+            None => {
+                panic!("Can't fine the column index");
+            }
+            Some(column_index) => column_index,
+        };
+        let offset_index = match metadata.offset_index() {
+            None => {
+                panic!("Can't find the offset index");
+            }
+            Some(offset_index) => offset_index,
+        };
+
+        assert_eq!(8, rows_written);
+
+        // column index
+        assert_eq!(2, column_index.null_pages.len());
+        assert_eq!(2, offset_index.page_locations.len());
+        assert_eq!(BoundaryOrder::Unordered, column_index.boundary_order);
+        for idx in 0..2 {
+            assert_eq!(&false, column_index.null_pages.get(idx).unwrap());
+            assert_eq!(
+                &0,
+                column_index.null_counts.as_ref().unwrap().get(idx).unwrap()
+            );
+        }
+
+        if let Some(stats) = metadata.statistics() {
+            assert!(stats.has_min_max_set());
+            assert_eq!(stats.null_count(), 0);
+            assert_eq!(stats.distinct_count(), None);
+            if let Statistics::Int32(stats) = stats {
+                // first page is [1,2,3,4]
+                // second page is [-5,2,4,8]
+                assert_eq!(
+                    stats.min_bytes(),
+                    column_index.min_values.get(1).unwrap().as_slice()

Review Comment:
   ```suggestion
                       column_index.min_values[1].as_slice()
   ```



##########
parquet/src/column/writer.rs:
##########
@@ -2068,6 +2129,80 @@ mod tests {
         ),);
     }
 
+    #[test]
+    fn test_column_offset_index_metadata() {
+        // write data
+        // and check the offset index and column index
+        let page_writer = get_test_page_writer();
+        let props = Arc::new(WriterProperties::builder().build());
+        let mut writer = get_test_column_writer::<Int32Type>(page_writer, 0, 0, props);
+        writer.write_batch(&[1, 2, 3, 4], None, None).unwrap();
+        // first page
+        writer.flush_data_pages().unwrap();
+        // second page
+        writer.write_batch(&[4, 8, 2, -5], None, None).unwrap();
+
+        let (_, rows_written, metadata) = writer.close().unwrap();
+        let column_index = match metadata.column_index() {
+            None => {
+                panic!("Can't fine the column index");
+            }
+            Some(column_index) => column_index,
+        };
+        let offset_index = match metadata.offset_index() {
+            None => {
+                panic!("Can't find the offset index");
+            }
+            Some(offset_index) => offset_index,
+        };
+
+        assert_eq!(8, rows_written);
+
+        // column index
+        assert_eq!(2, column_index.null_pages.len());
+        assert_eq!(2, offset_index.page_locations.len());
+        assert_eq!(BoundaryOrder::Unordered, column_index.boundary_order);
+        for idx in 0..2 {
+            assert_eq!(&false, column_index.null_pages.get(idx).unwrap());

Review Comment:
   ```suggestion
               assert!!(!column_index.null_pages[idx]);
   ```



##########
parquet/src/file/writer.rs:
##########
@@ -339,11 +400,11 @@ impl<'a, W: Write> SerializedRowGroupWriter<'a, W> {
                 .set_num_rows(self.total_rows_written.unwrap_or(0) as i64)
                 .build()?;
 
-            let metadata = Arc::new(row_group_metadata);
-            self.row_group_metadata = Some(metadata.clone());
+            let clone_row_group_metadata = row_group_metadata.clone();
+            self.row_group_metadata = Some(Arc::new(row_group_metadata));

Review Comment:
   I think I would prefer we clone at the location that the mutation happens, i.e. ParquetWriter, as opposed to modifying this interface



##########
parquet/src/column/writer.rs:
##########
@@ -2068,6 +2129,80 @@ mod tests {
         ),);
     }
 
+    #[test]
+    fn test_column_offset_index_metadata() {
+        // write data
+        // and check the offset index and column index
+        let page_writer = get_test_page_writer();
+        let props = Arc::new(WriterProperties::builder().build());
+        let mut writer = get_test_column_writer::<Int32Type>(page_writer, 0, 0, props);
+        writer.write_batch(&[1, 2, 3, 4], None, None).unwrap();
+        // first page
+        writer.flush_data_pages().unwrap();
+        // second page
+        writer.write_batch(&[4, 8, 2, -5], None, None).unwrap();
+
+        let (_, rows_written, metadata) = writer.close().unwrap();
+        let column_index = match metadata.column_index() {
+            None => {
+                panic!("Can't fine the column index");
+            }
+            Some(column_index) => column_index,
+        };
+        let offset_index = match metadata.offset_index() {
+            None => {
+                panic!("Can't find the offset index");
+            }
+            Some(offset_index) => offset_index,
+        };
+
+        assert_eq!(8, rows_written);
+
+        // column index
+        assert_eq!(2, column_index.null_pages.len());
+        assert_eq!(2, offset_index.page_locations.len());
+        assert_eq!(BoundaryOrder::Unordered, column_index.boundary_order);
+        for idx in 0..2 {
+            assert_eq!(&false, column_index.null_pages.get(idx).unwrap());
+            assert_eq!(
+                &0,
+                column_index.null_counts.as_ref().unwrap().get(idx).unwrap()
+            );

Review Comment:
   ```suggestion
               assert_eq!(
                   0,
                   *column_index.null_counts.as_ref().unwrap()[idx]
               );
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #1935: [WIP] add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r908068424


##########
parquet/src/file/writer.rs:
##########
@@ -339,11 +400,11 @@ impl<'a, W: Write> SerializedRowGroupWriter<'a, W> {
                 .set_num_rows(self.total_rows_written.unwrap_or(0) as i64)
                 .build()?;
 
-            let metadata = Arc::new(row_group_metadata);
-            self.row_group_metadata = Some(metadata.clone());
+            let clone_row_group_metadata = row_group_metadata.clone();
+            self.row_group_metadata = Some(Arc::new(row_group_metadata));

Review Comment:
   Clone the row group metadata and save it which will be used to construct the file metadata.
   Why don't use the ptr of metadata?
   In order to change the field of the column metadata in the row group metadata, we need a mutable data, the Arc::<T>() immutable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] Ted-Jiang commented on a diff in pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

Ted-Jiang commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r910581055


##########
parquet/src/file/metadata.rs:
##########
@@ -789,6 +792,107 @@ impl ColumnChunkMetaDataBuilder {
     }
 }
 
+/// Builder for column index
+pub struct ColumnIndexBuilder {
+    null_pages: Vec<bool>,
+    min_values: Vec<Vec<u8>>,
+    max_values: Vec<Vec<u8>>,
+    // TODO: calc the order for all pages in this column

Review Comment:
   👍 this is useful for checking whether use page index



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r910258249


##########
parquet/src/file/metadata.rs:
##########
@@ -519,6 +527,26 @@ impl ColumnChunkMetaData {
         self.offset_index_length
     }
 
+    /// Set File offset of ColumnChunk's OffsetIndex
+    pub fn set_offset_index_offset(&mut self, offset_index_offset: i64) {
+        self.offset_index_offset = Some(offset_index_offset);
+    }
+
+    /// Set Size of ColumnChunk's OffsetIndex, in bytes
+    pub fn set_offset_index_length(&mut self, offset_index_length: i32) {
+        self.offset_index_length = Some(offset_index_length);
+    }
+
+    /// Set File offset of ColumnChunk's ColumnIndex
+    pub fn set_column_index_offset(&mut self, column_index_offset: i64) {
+        self.column_index_offset = Some(column_index_offset);
+    }
+
+    /// Set Size of ColumnChunk's ColumnIndex, in bytes
+    pub fn set_column_index_length(&mut self, column_index_length: i32) {
+        self.column_index_length = Some(column_index_length);
+    }
+

Review Comment:
   These aren't used anymore



##########
parquet/src/file/metadata.rs:
##########
@@ -247,6 +250,11 @@ impl RowGroupMetaData {
         &self.columns
     }
 
+    /// Returns mut slice of column chunk metadata
+    pub fn columns_mut(&mut self) -> &mut [ColumnChunkMetaData] {

Review Comment:
   This isn't used anymore



##########
parquet/src/column/writer.rs:
##########
@@ -162,6 +164,14 @@ pub fn get_typed_column_writer_mut<'a, 'b: 'a, T: DataType>(
     })
 }
 
+type ColumnCloseResult = (

Review Comment:
   I wonder if we should make this a struct instead, definitely something we could do in a subsequent PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r909850281


##########
parquet/src/file/metadata.rs:
##########
@@ -386,6 +394,9 @@ pub struct ColumnChunkMetaData {
     offset_index_length: Option<i32>,
     column_index_offset: Option<i64>,
     column_index_length: Option<i32>,
+    // column index and offset index
+    column_index: Option<ColumnIndex>,
+    offset_index: Option<OffsetIndex>,

Review Comment:
   removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] liukun4515 commented on pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#issuecomment-1170071119

   @tustvold reset, PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r909848206


##########
parquet/src/file/writer.rs:
##########
@@ -177,10 +182,66 @@ impl<W: Write> SerializedFileWriter<W> {
         Ok(())
     }
 
+    /// Serialize all the offset index to the file
+    fn write_offset_indexes(&mut self) -> Result<()> {
+        // iter row group
+        // iter each column
+        // write offset index to the file
+        for row_group in &mut self.row_groups {
+            for column_metdata in row_group.columns_mut() {
+                match column_metdata.offset_index() {
+                    Some(offset_index) => {
+                        let start_offset = self.buf.bytes_written();
+                        let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
+                        offset_index.write_to_out_protocol(&mut protocol)?;
+                        protocol.flush()?;
+                        let end_offset = self.buf.bytes_written();
+                        // set offset and index for offset index
+                        column_metdata.set_offset_index_offset(start_offset as i64);
+                        column_metdata
+                            .set_offset_index_length((end_offset - start_offset) as i32);

Review Comment:
   I recover some break-api



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r909683213


##########
parquet/src/file/writer.rs:
##########
@@ -177,10 +182,66 @@ impl<W: Write> SerializedFileWriter<W> {
         Ok(())
     }
 
+    /// Serialize all the offset index to the file
+    fn write_offset_indexes(&mut self) -> Result<()> {
+        // iter row group
+        // iter each column
+        // write offset index to the file
+        for row_group in &mut self.row_groups {
+            for column_metdata in row_group.columns_mut() {
+                match column_metdata.offset_index() {
+                    Some(offset_index) => {
+                        let start_offset = self.buf.bytes_written();
+                        let mut protocol = TCompactOutputProtocol::new(&mut self.buf);
+                        offset_index.write_to_out_protocol(&mut protocol)?;
+                        protocol.flush()?;
+                        let end_offset = self.buf.bytes_written();
+                        // set offset and index for offset index
+                        column_metdata.set_offset_index_offset(start_offset as i64);
+                        column_metdata
+                            .set_offset_index_length((end_offset - start_offset) as i32);

Review Comment:
   @tustvold PTAL,
   I have change the implementation, remove the column_index and offset_index from the columnmetadata.
   Just return the offset index and column index from the `close` of ColumnWriterImpl, and using the callback to collect all the column index an offset index to the `SerializedFileWriter` before closing the parquet file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r910670247


##########
parquet/src/file/metadata.rs:
##########
@@ -789,6 +792,107 @@ impl ColumnChunkMetaDataBuilder {
     }
 }
 
+/// Builder for column index
+pub struct ColumnIndexBuilder {
+    null_pages: Vec<bool>,
+    min_values: Vec<Vec<u8>>,
+    max_values: Vec<Vec<u8>>,
+    // TODO: calc the order for all pages in this column

Review Comment:
   If the boundaryorder is UNORDERED, we need filter the page one by one.



##########
parquet/src/file/metadata.rs:
##########
@@ -789,6 +792,107 @@ impl ColumnChunkMetaDataBuilder {
     }
 }
 
+/// Builder for column index
+pub struct ColumnIndexBuilder {
+    null_pages: Vec<bool>,
+    min_values: Vec<Vec<u8>>,
+    max_values: Vec<Vec<u8>>,
+    // TODO: calc the order for all pages in this column

Review Comment:
   If the boundaryorder is UNORDERED, we need to filter the page one by one.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] liukun4515 commented on a diff in pull request #1935: add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

liukun4515 commented on code in PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#discussion_r910669550


##########
parquet/src/file/metadata.rs:
##########
@@ -789,6 +792,107 @@ impl ColumnChunkMetaDataBuilder {
     }
 }
 
+/// Builder for column index
+pub struct ColumnIndexBuilder {
+    null_pages: Vec<bool>,
+    min_values: Vec<Vec<u8>>,
+    max_values: Vec<Vec<u8>>,
+    // TODO: calc the order for all pages in this column

Review Comment:
   > 👍 this is useful for checking whether use page index
   
   If the data in the pages are ordered by ascend or descend, we can use the `binary search` to accelerate the page filter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] codecov-commenter commented on pull request #1935: [WIP] add column index writer for parquet

Posted by GitBox <gi...@apache.org>.

codecov-commenter commented on PR #1935:
URL: https://github.com/apache/arrow-rs/pull/1935#issuecomment-1165237542

   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1935?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#1935](https://codecov.io/gh/apache/arrow-rs/pull/1935?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (04f164a) into [master](https://codecov.io/gh/apache/arrow-rs/commit/9059cbf0abe07f64ae69183bebe5e1ec78bde738?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (9059cbf) will **increase** coverage by `0.00%`.
   > The diff coverage is `27.27%`.
   
   ```diff
   @@           Coverage Diff           @@
   ##           master    #1935   +/-   ##
   =======================================
     Coverage   83.41%   83.41%           
   =======================================
     Files         214      222    +8     
     Lines       57004    57059   +55     
   =======================================
   + Hits        47551    47597   +46     
   - Misses       9453     9462    +9     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/1935?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/column/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/1935/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvY29sdW1uL3dyaXRlci5ycw==) | `91.39% <ø> (ø)` | |
   | [parquet/src/file/metadata.rs](https://codecov.io/gh/apache/arrow-rs/pull/1935/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9tZXRhZGF0YS5ycw==) | `89.14% <0.00%> (-5.98%)` | :arrow_down: |
   | [parquet/src/file/page\_index/index\_writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/1935/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9wYWdlX2luZGV4L2luZGV4X3dyaXRlci5ycw==) | `0.00% <0.00%> (ø)` | |
   | [parquet/src/file/serialized\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/1935/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS9zZXJpYWxpemVkX3JlYWRlci5ycw==) | `94.95% <ø> (ø)` | |
   | [parquet/src/file/writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/1935/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZmlsZS93cml0ZXIucnM=) | `92.92% <ø> (ø)` | |
   | [parquet/src/basic.rs](https://codecov.io/gh/apache/arrow-rs/pull/1935/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYmFzaWMucnM=) | `91.58% <100.00%> (+0.08%)` | :arrow_up: |
   | [arrow/src/array/builder/buffer\_builder.rs](https://codecov.io/gh/apache/arrow-rs/pull/1935/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2J1aWxkZXIvYnVmZmVyX2J1aWxkZXIucnM=) | `85.49% <0.00%> (-7.31%)` | :arrow_down: |
   | [parquet/src/arrow/array\_reader/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/1935/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvYXJyYXlfcmVhZGVyL21vZC5ycw==) | `88.23% <0.00%> (-2.54%)` | :arrow_down: |
   | [parquet\_derive/src/parquet\_field.rs](https://codecov.io/gh/apache/arrow-rs/pull/1935/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldF9kZXJpdmUvc3JjL3BhcnF1ZXRfZmllbGQucnM=) | `65.75% <0.00%> (-0.23%)` | :arrow_down: |
   | [arrow/src/array/data.rs](https://codecov.io/gh/apache/arrow-rs/pull/1935/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2RhdGEucnM=) | `84.16% <0.00%> (ø)` | |
   | ... and [13 more](https://codecov.io/gh/apache/arrow-rs/pull/1935/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1935?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/1935?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [9059cbf...04f164a](https://codecov.io/gh/apache/arrow-rs/pull/1935?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org