You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/06/12 09:43:11 UTC

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #4399: Faster UTF-8 truncation

tustvold commented on code in PR #4399:
URL: https://github.com/apache/arrow-rs/pull/4399#discussion_r1226382289


##########
parquet/src/column/writer/mod.rs:
##########
@@ -1188,30 +1188,11 @@ fn compare_greater_byte_array_decimals(a: &[u8], b: &[u8]) -> bool {
     (a[1..]) > (b[1..])
 }
 
-/// Truncate a UTF8 slice to the longest prefix that is still a valid UTF8 string, while being less than `length` bytes.
+/// Truncate a UTF8 slice to the longest prefix that is still a valid UTF8 string,
+/// while being less than `length` bytes and non-empty
 fn truncate_utf8(data: &str, length: usize) -> Option<Vec<u8>> {
-    // We return values like that at an earlier stage in the process.
-    assert!(data.len() >= length);
-    let mut char_indices = data.char_indices();
-
-    // We know `data` is a valid UTF8 encoded string, which means it has at least one valid UTF8 byte, which will make this loop exist.
-    while let Some((idx, c)) = char_indices.next_back() {
-        let split_point = idx + c.len_utf8();
-        if split_point <= length {
-            return data.as_bytes()[0..split_point].to_vec().into();
-        }
-    }
-
-    None
-}
-
-/// Truncate a binary slice to make sure its length is less than `length`
-fn truncate_binary(data: &[u8], length: usize) -> Option<Vec<u8>> {

Review Comment:
   This method seemed a touch redundant so I opted to remove it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org