You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/23 20:01:20 UTC

[GitHub] [arrow-rs] viirya commented on a change in pull request #1468: feat(557): append row groups to already exist parquet file

viirya commented on a change in pull request #1468:
URL: https://github.com/apache/arrow-rs/pull/1468#discussion_r833670627



##########
File path: parquet/src/arrow/arrow_writer.rs
##########
@@ -772,6 +813,74 @@ mod tests {
         }
     }
 
+    #[test]
+    fn arrow_writer_append_data_to_existing_file() {
+        let schema = Arc::new(Schema::new(vec![
+            Field::new("a", DataType::Int32, false),
+            Field::new("b", DataType::Int64, true),
+        ]));
+
+        let a = Int32Array::from(vec![1]);
+        let b = Int64Array::from(vec![Some(1)]);
+
+        let batch =
+            RecordBatch::try_new(schema.clone(), vec![Arc::new(a), Arc::new(b)]).unwrap();
+        let output_cursor = InMemoryWriteableCursor::default();
+
+        {
+            let mut writer =
+                ArrowWriter::try_new(output_cursor.clone(), schema.clone(), None)
+                    .unwrap();
+            writer.write(&batch).unwrap();
+            writer.close().unwrap();
+        }
+
+        // Append new data to the chunk cursor
+        let chunk_cursor = SliceableCursor::new(output_cursor.into_inner().unwrap());
+
+        let a = Int32Array::from(vec![2]);
+        let b = Int64Array::from(vec![None]);
+
+        let batch =
+            RecordBatch::try_new(schema.clone(), vec![Arc::new(a), Arc::new(b)]).unwrap();
+
+        let output_cursor = InMemoryWriteableCursor::default();

Review comment:
       I think you are writing to another cursor, not append to existing one.

##########
File path: parquet/src/file/writer.rs
##########
@@ -159,6 +164,48 @@ impl<W: ParquetWriter> SerializedFileWriter<W> {
         })
     }
 
+    /// Create new file writer from chunk file
+    pub fn from_chunk<R: ChunkReader>(
+        chunk: R,
+        mut buf: W,
+        schema: TypePtr,
+        properties: WriterPropertiesPtr,
+    ) -> Result<Self> {

Review comment:
       This looks like to read data from the chunk and write out through a `ParquetWriter`. They are different files, right? The original issue is created for appending to existing parquet file. So I'm not sure if this can address the original issue?

##########
File path: parquet/src/arrow/arrow_writer.rs
##########
@@ -198,6 +201,43 @@ impl<W: 'static + ParquetWriter> ArrowWriter<W> {
     }
 }
 
+impl<W: 'static + ParquetWriter> ArrowWriter<W> {
+    /// Try to create a new Arrow writer to append data to an existing parquet file without read the entiery file.

Review comment:
       And seems `SerializedFileWriter::from_chunk` will still read the data from the existing file and write to another file?

##########
File path: parquet/src/arrow/arrow_writer.rs
##########
@@ -198,6 +201,43 @@ impl<W: 'static + ParquetWriter> ArrowWriter<W> {
     }
 }
 
+impl<W: 'static + ParquetWriter> ArrowWriter<W> {
+    /// Try to create a new Arrow writer to append data to an existing parquet file without read the entiery file.

Review comment:
       Hmm, if the chunk is not for the entire file, how could you get the metadata?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org