You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/27 16:01:47 UTC

[GitHub] [spark] kiszk commented on a change in pull request #24709: [SPARK-27841][SQL] Improve UTF8String to/fromString()/numBytesForFirstByte() performance

kiszk commented on a change in pull request #24709: [SPARK-27841][SQL] Improve UTF8String to/fromString()/numBytesForFirstByte() performance
URL: https://github.com/apache/spark/pull/24709#discussion_r287837386
 
 

 ##########
 File path: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
 ##########
 @@ -139,9 +139,55 @@ public static UTF8String fromAddress(Object base, long offset, int numBytes) {
    * Creates an UTF8String from String.
    */
   public static UTF8String fromString(String str) {
+    if (str == null) {
+      return null;
+    }
+    byte[] bytes;
+    // Optimization for ASCII strings: construct an exactly-right-sized
+    // array ourselves. CharsetEncoder ends up over-allocating for ASCII
+    // strings, leading to extra memory copies and garbage creation.
+    if (isAscii(str)) {
+      bytes = new byte[str.length()];
+      for (int i = 0; i < str.length(); i++) {
+        bytes[i] = (byte) str.charAt(i);
+      }
+    } else {
+      bytes = str.getBytes(StandardCharsets.UTF_8);
+    }
+    return fromBytes(bytes);
+  }
+
+  /**
+   * Pre-SPARK-27841 version of fromString(), retained for benchmarking purposes.
+   */
+  public static UTF8String fromStringSlow(String str) {
     return str == null ? null : fromBytes(str.getBytes(StandardCharsets.UTF_8));
   }
 
+  /**
+   * Returns true if the given string consists entirely of ASCII characters, false otherwise.
+   */
+  private static boolean isAscii(String str) {
+    for (int i = 0; i < str.length(); i++) {
+      if (str.charAt(i) > 127) { // unsigned comparison
+        return false;
+      }
+    }
+    return true;
+  }
+
+  /**
+   * Returns true if the given UTF8 bytes consist entirely of ASCII characters, false otherwise.
+   */
+  private static boolean isAscii(byte[] bytes) {
 
 Review comment:
   I think that the HW branch predictor can reduce misprediction since the direction is mostly the same. If we assume that ASCII case is dominant, the compiler optimizer may apply simdization.
   
   If this code is faster than the original one, how about applying this to `isAscii(String str)`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org