You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/21 18:39:51 UTC

[GitHub] [arrow-rs] jmfiaschi opened a new pull request #1468: feat(557): append row groups to already exist parquet file

jmfiaschi opened a new pull request #1468:
URL: https://github.com/apache/arrow-rs/pull/1468


   # Which issue does this PR close?
   
   Closes #557
   
   
   # What changes are included in this PR?
   
   append to `SerializedFileWriter` & `ArrowWriter` a method `from_chunk` to create a writer with an existing chunk content
   
   # Are there any user-facing changes?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on pull request #1468: feat(557): append row groups to already exist parquet file

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #1468:
URL: https://github.com/apache/arrow-rs/pull/1468#issuecomment-1076724796


   Sorry @jmfiaschi  -- we have been backed up on review recently. I'll try and take a peek shortly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb closed pull request #1468: feat(557): append row groups to already exist parquet file

Posted by GitBox <gi...@apache.org>.

alamb closed pull request #1468:
URL: https://github.com/apache/arrow-rs/pull/1468


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jmfiaschi commented on pull request #1468: feat(557): append row groups to already exist parquet file

Posted by GitBox <gi...@apache.org>.

jmfiaschi commented on pull request #1468:
URL: https://github.com/apache/arrow-rs/pull/1468#issuecomment-1079406293


   I don't know. I saw append mode but it's seems it create a new file. Ok I think we can close this issue and PR. I'm a junior with parquet and continue to learn ^^. Thank you for your reactivity :) 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jmfiaschi commented on a change in pull request #1468: feat(557): append row groups to already exist parquet file

Posted by GitBox <gi...@apache.org>.

jmfiaschi commented on a change in pull request #1468:
URL: https://github.com/apache/arrow-rs/pull/1468#discussion_r835578005



##########
File path: parquet/src/arrow/arrow_writer.rs
##########
@@ -198,6 +201,43 @@ impl<W: 'static + ParquetWriter> ArrowWriter<W> {
     }
 }
 
+impl<W: 'static + ParquetWriter> ArrowWriter<W> {
+    /// Try to create a new Arrow writer to append data to an existing parquet file without read the entiery file.

Review comment:
       Oky, the name `from_chunk` is confusing. With this library it's not possible to do chunk file because the cursor W need to have the entirery file data in memory in order to work with the metadata (metadata contain the last position of the last row group). I find a documentation that the term chunk with parquet is more when you have multi file in a directory and one file is considered as a chunk. Is beter to use `from_file` in order to avoid any missunderstanding ^^.
   
   Yes, `SerializedFileWriter::from_chunk` read from `SliceableCursor` and write into `InMemoryWriteableCursor` for example. `SerializedFileWriter::new` write directly into `InMemoryWriteableCursor`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] viirya commented on a change in pull request #1468: feat(557): append row groups to already exist parquet file

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #1468:
URL: https://github.com/apache/arrow-rs/pull/1468#discussion_r833670627



##########
File path: parquet/src/arrow/arrow_writer.rs
##########
@@ -772,6 +813,74 @@ mod tests {
         }
     }
 
+    #[test]
+    fn arrow_writer_append_data_to_existing_file() {
+        let schema = Arc::new(Schema::new(vec![
+            Field::new("a", DataType::Int32, false),
+            Field::new("b", DataType::Int64, true),
+        ]));
+
+        let a = Int32Array::from(vec![1]);
+        let b = Int64Array::from(vec![Some(1)]);
+
+        let batch =
+            RecordBatch::try_new(schema.clone(), vec![Arc::new(a), Arc::new(b)]).unwrap();
+        let output_cursor = InMemoryWriteableCursor::default();
+
+        {
+            let mut writer =
+                ArrowWriter::try_new(output_cursor.clone(), schema.clone(), None)
+                    .unwrap();
+            writer.write(&batch).unwrap();
+            writer.close().unwrap();
+        }
+
+        // Append new data to the chunk cursor
+        let chunk_cursor = SliceableCursor::new(output_cursor.into_inner().unwrap());
+
+        let a = Int32Array::from(vec![2]);
+        let b = Int64Array::from(vec![None]);
+
+        let batch =
+            RecordBatch::try_new(schema.clone(), vec![Arc::new(a), Arc::new(b)]).unwrap();
+
+        let output_cursor = InMemoryWriteableCursor::default();

Review comment:
       I think you are writing to another cursor, not append to existing one.

##########
File path: parquet/src/file/writer.rs
##########
@@ -159,6 +164,48 @@ impl<W: ParquetWriter> SerializedFileWriter<W> {
         })
     }
 
+    /// Create new file writer from chunk file
+    pub fn from_chunk<R: ChunkReader>(
+        chunk: R,
+        mut buf: W,
+        schema: TypePtr,
+        properties: WriterPropertiesPtr,
+    ) -> Result<Self> {

Review comment:
       This looks like to read data from the chunk and write out through a `ParquetWriter`. They are different files, right? The original issue is created for appending to existing parquet file. So I'm not sure if this can address the original issue?

##########
File path: parquet/src/arrow/arrow_writer.rs
##########
@@ -198,6 +201,43 @@ impl<W: 'static + ParquetWriter> ArrowWriter<W> {
     }
 }
 
+impl<W: 'static + ParquetWriter> ArrowWriter<W> {
+    /// Try to create a new Arrow writer to append data to an existing parquet file without read the entiery file.

Review comment:
       And seems `SerializedFileWriter::from_chunk` will still read the data from the existing file and write to another file?

##########
File path: parquet/src/arrow/arrow_writer.rs
##########
@@ -198,6 +201,43 @@ impl<W: 'static + ParquetWriter> ArrowWriter<W> {
     }
 }
 
+impl<W: 'static + ParquetWriter> ArrowWriter<W> {
+    /// Try to create a new Arrow writer to append data to an existing parquet file without read the entiery file.

Review comment:
       Hmm, if the chunk is not for the entire file, how could you get the metadata?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jmfiaschi commented on pull request #1468: feat(557): append row groups to already exist parquet file

Posted by GitBox <gi...@apache.org>.

jmfiaschi commented on pull request #1468:
URL: https://github.com/apache/arrow-rs/pull/1468#issuecomment-1076558235


   Hello, maybe someone can check my MR ? @viirya @sunchao @tustvold @alamb


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jmfiaschi commented on a change in pull request #1468: feat(557): append row groups to already exist parquet file

Posted by GitBox <gi...@apache.org>.

jmfiaschi commented on a change in pull request #1468:
URL: https://github.com/apache/arrow-rs/pull/1468#discussion_r835581814



##########
File path: parquet/src/arrow/arrow_writer.rs
##########
@@ -772,6 +813,74 @@ mod tests {
         }
     }
 
+    #[test]
+    fn arrow_writer_append_data_to_existing_file() {
+        let schema = Arc::new(Schema::new(vec![
+            Field::new("a", DataType::Int32, false),
+            Field::new("b", DataType::Int64, true),
+        ]));
+
+        let a = Int32Array::from(vec![1]);
+        let b = Int64Array::from(vec![Some(1)]);
+
+        let batch =
+            RecordBatch::try_new(schema.clone(), vec![Arc::new(a), Arc::new(b)]).unwrap();
+        let output_cursor = InMemoryWriteableCursor::default();
+
+        {
+            let mut writer =
+                ArrowWriter::try_new(output_cursor.clone(), schema.clone(), None)
+                    .unwrap();
+            writer.write(&batch).unwrap();
+            writer.close().unwrap();
+        }
+
+        // Append new data to the chunk cursor
+        let chunk_cursor = SliceableCursor::new(output_cursor.into_inner().unwrap());
+
+        let a = Int32Array::from(vec![2]);
+        let b = Int64Array::from(vec![None]);
+
+        let batch =
+            RecordBatch::try_new(schema.clone(), vec![Arc::new(a), Arc::new(b)]).unwrap();
+
+        let output_cursor = InMemoryWriteableCursor::default();

Review comment:
       same explication than before, one cursor contain the reader, the slice cursor, and the other to write the data:
   reader (file data with footer) => | header | data | footer |
   writer => | header | data |
   
   The next write() will append the data in the writer
   writer => | header | data | next_row_group |
   ... etc 
   close() => | header | data | row_groups | footer |
   
   we don't touch the previous data




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] viirya commented on pull request #1468: feat(557): append row groups to already exist parquet file

Posted by GitBox <gi...@apache.org>.

viirya commented on pull request #1468:
URL: https://github.com/apache/arrow-rs/pull/1468#issuecomment-1081137857


   Thank you @jmfiaschi! I'm also learning here too!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on pull request #1468: feat(557): append row groups to already exist parquet file

Posted by GitBox <gi...@apache.org>.

alamb commented on pull request #1468:
URL: https://github.com/apache/arrow-rs/pull/1468#issuecomment-1081133541


   > I don't know. I saw append mode but it's seems it create a new file. Ok I think we can close this issue and PR. I'm a junior with parquet and continue to learn ^^. Thank you for your reactivity :)
   
   Thank you @jmfiaschi  -- we are all learning here!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] jmfiaschi commented on a change in pull request #1468: feat(557): append row groups to already exist parquet file

Posted by GitBox <gi...@apache.org>.

jmfiaschi commented on a change in pull request #1468:
URL: https://github.com/apache/arrow-rs/pull/1468#discussion_r835572112



##########
File path: parquet/src/file/writer.rs
##########
@@ -159,6 +164,48 @@ impl<W: ParquetWriter> SerializedFileWriter<W> {
         })
     }
 
+    /// Create new file writer from chunk file
+    pub fn from_chunk<R: ChunkReader>(
+        chunk: R,
+        mut buf: W,
+        schema: TypePtr,
+        properties: WriterPropertiesPtr,
+    ) -> Result<Self> {

Review comment:
       Hello, thank you for your feedback. The original issue take a ParquetWriter + ChunkReader as a writer but we don't have writer & reader object in the same time. For me, we start with the file reader and build a new writer without the footer and start to append after with the write() method. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org