You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/31 10:33:43 UTC

[GitHub] [arrow-rs] alamb commented on a change in pull request #381: Respect max rowgroup size in Arrow writer

alamb commented on a change in pull request #381:
URL: https://github.com/apache/arrow-rs/pull/381#discussion_r642388141



##########
File path: parquet/src/arrow/arrow_writer.rs
##########
@@ -1176,31 +1236,51 @@ mod tests {
         let raw_values: Vec<_> = (0..SMALL_SIZE as i64).collect();
         let values = Arc::new(TimestampSecondArray::from_vec(raw_values, None));
 
-        one_column_roundtrip("timestamp_second_single_column", values, false);
+        one_column_roundtrip(

Review comment:
       it might be worth at least one test that divides into "more than 2" batches as well. 

##########
File path: parquet/src/arrow/arrow_writer.rs
##########
@@ -87,17 +92,31 @@ impl<W: 'static + ParquetWriter> ArrowWriter<W> {
                 "Record batch schema does not match writer schema".to_string(),
             ));
         }
-        // compute the definition and repetition levels of the batch
-        let batch_level = LevelInfo::new_from_batch(batch);
-        let mut row_group_writer = self.writer.next_row_group()?;
-        for (array, field) in batch.columns().iter().zip(batch.schema().fields()) {
-            let mut levels = batch_level.calculate_array_levels(array, field);
-            // Reverse levels as we pop() them when writing arrays
-            levels.reverse();
-            write_leaves(&mut row_group_writer, array, &mut levels)?;
+        // Track the number of rows being written in the batch.
+        // We currently do not have a way of slicing nested arrays, thus we
+        // track this manually.
+        let num_rows = batch.num_rows();
+        let batches = (num_rows + self.max_row_group_size - 1) / self.max_row_group_size;

Review comment:
       Am I correct in thinking this code could result in non-uniform row group sizes?
   
   like if we had `max_row_group_size=10` and wrote a `RecordBatch` with 25 rows,  would we get row groups like
   
   ```
   (row_group 1: 10 rows)
   (row_group 2: 10 rows)
   (row_group 3: 5 rows)
   (row_group 4: 10 rows)
   (row_group 5: 10 rows)
   (row_group 6: 5 rows)
   ```
   
   ? 
   
   If so I think this is fine (in that it is technically respecting `max_row_group` but it might be unexpected from a user,  who might expect something more like
   
   ```
   (row_group 1: 10 rows)
   (row_group 2: 10 rows)
   (row_group 3: 10 rows)
   (row_group 4: 10 rows)
   (row_group 5: 10 rows)
   ```
   
   Perhaps it is worth a doc comment?
   

##########
File path: parquet/src/arrow/levels.rs
##########
@@ -748,6 +784,8 @@ mod tests {
             array_mask: vec![true, true], // both lists defined
             max_definition: 0,
             level_type: LevelType::Root,
+            offset: 0,

Review comment:
       It might be cool to add a test here where the offset was something other than `0` -- all the examples I see have `offset: 0`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org