You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/16 12:49:52 UTC

[GitHub] [arrow] vertexclique commented on a change in pull request #8673: ARROW-10609: [Rust] Optimize min/max of non null strings

vertexclique commented on a change in pull request #8673:
URL: https://github.com/apache/arrow/pull/8673#discussion_r524238466



##########
File path: rust/arrow/benches/aggregate_kernels.rs
##########
@@ -43,6 +43,25 @@ fn create_array(size: usize, with_nulls: bool) -> ArrayRef {
     Arc::new(builder.finish())
 }
 
+fn create_string_array(size: usize, with_nulls: bool) -> ArrayRef {
+    // use random numbers to avoid spurious compiler optimizations wrt to branching
+    let mut rng = seedable_rng();
+    let mut builder = StringBuilder::new(size);
+
+    for _ in 0..size {
+        if with_nulls && rng.gen::<f32>() > 0.5 {
+            builder.append_null().unwrap();
+        } else {
+            let string = seedable_rng()
+                .sample_iter(&Alphanumeric)
+                .take(10)
+                .collect::<String>();
+            builder.append_value(&string).unwrap();
+        }
+    }
+    Arc::new(builder.finish())
+}

Review comment:
       Can you make the downcasting inside the array create method? That will shave off some excess work.

##########
File path: rust/arrow/src/compute/kernels/aggregate.rs
##########
@@ -32,19 +32,20 @@ fn min_max_string<T: StringOffsetSizeTrait, F: Fn(&str, &str) -> bool>(
     if null_count == array.len() {
         return None;
     }
-    let mut n = "";
-    let mut has_value = false;
     let data = array.data();
-
+    let mut n;
     if null_count == 0 {
-        for i in 0..data.len() {
+        n = array.value(0);
+        for i in 1..data.len() {
             let item = array.value(i);
-            if !has_value || cmp(&n, item) {
-                has_value = true;
+            if cmp(&n, item) {
                 n = item;
             }
         }

Review comment:
       Feels like this code can use fold with your changes now.

##########
File path: rust/arrow/benches/aggregate_kernels.rs
##########
@@ -43,6 +43,25 @@ fn create_array(size: usize, with_nulls: bool) -> ArrayRef {
     Arc::new(builder.finish())
 }
 
+fn create_string_array(size: usize, with_nulls: bool) -> ArrayRef {
+    // use random numbers to avoid spurious compiler optimizations wrt to branching
+    let mut rng = seedable_rng();
+    let mut builder = StringBuilder::new(size);
+
+    for _ in 0..size {
+        if with_nulls && rng.gen::<f32>() > 0.5 {
+            builder.append_null().unwrap();
+        } else {
+            let string = seedable_rng()
+                .sample_iter(&Alphanumeric)
+                .take(10)
+                .collect::<String>();
+            builder.append_value(&string).unwrap();
+        }
+    }
+    Arc::new(builder.finish())
+}

Review comment:
       Would be also nice to change all of these benchmarks to do the same.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org