You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wjones127 (via GitHub)" <gi...@apache.org> on 2023/06/09 05:25:58 UTC

[GitHub] [arrow-rs] wjones127 commented on a diff in pull request #4389: Truncate Min/Max values in the Colum Index

wjones127 commented on code in PR #4389:
URL: https://github.com/apache/arrow-rs/pull/4389#discussion_r1223844119


##########
parquet/src/column/writer/mod.rs:
##########
@@ -2220,6 +2228,66 @@ mod tests {
         );
     }
 
+    /// Verify min/max value truncation iin the column index works as expected

Review Comment:
   ```suggestion
       /// Verify min/max value truncation in the column index works as expected
   ```



##########
parquet/src/column/writer/mod.rs:
##########
@@ -683,6 +683,14 @@ impl<'a, E: ColumnValueEncoder> GenericColumnWriter<'a, E> {
             .append_row_count(self.page_metrics.num_buffered_rows as i64);
     }
 
+    fn truncate_minmax_value(&self, data: &[u8]) -> Vec<u8> {
+        if let Some(max_len) = self.props.minmax_value_truncate_len() {
+            data[0..usize::min(data.len(), max_len)].to_vec()

Review Comment:
   I'm not sure this is quite what we want. First, I don't think this would give the behavior described in the example:
   
   > For example, instead of storing ""Blart Versenwald III", a writer may set min_values[i]="B", max_values[i]="C". 
   
   https://github.com/apache/parquet-format/blob/master/PageIndex.md#technical-approach
   
   With this implementation, if we truncated to the first byte we would get min and max of `"B"`, right?
   
   Second, slicing a string based on bytes means we could be slicing in the middle of a multi-byte character. That's not strictly an issue for sorting, but could have some annoying consequences when debugging and inspecting statistics.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org