You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/26 19:37:34 UTC

[GitHub] [spark] JoshRosen commented on a change in pull request #24709: [SPARK-27841][SQL] Improve UTF8String to/fromString()/numBytesForFirstByte() performance

JoshRosen commented on a change in pull request #24709: [SPARK-27841][SQL] Improve UTF8String to/fromString()/numBytesForFirstByte() performance
URL: https://github.com/apache/spark/pull/24709#discussion_r287611422
 
 

 ##########
 File path: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
 ##########
 @@ -139,9 +139,55 @@ public static UTF8String fromAddress(Object base, long offset, int numBytes) {
    * Creates an UTF8String from String.
    */
   public static UTF8String fromString(String str) {
+    if (str == null) {
+      return null;
+    }
+    byte[] bytes;
+    // Optimization for ASCII strings: construct an exactly-right-sized
+    // array ourselves. CharsetEncoder ends up over-allocating for ASCII
+    // strings, leading to extra memory copies and garbage creation.
+    if (isAscii(str)) {
+      bytes = new byte[str.length()];
+      for (int i = 0; i < str.length(); i++) {
+        bytes[i] = (byte) str.charAt(i);
+      }
+    } else {
+      bytes = str.getBytes(StandardCharsets.UTF_8);
+    }
+    return fromBytes(bytes);
+  }
+
+  /**
+   * Pre-SPARK-27841 version of fromString(), retained for benchmarking purposes.
+   */
+  public static UTF8String fromStringSlow(String str) {
     return str == null ? null : fromBytes(str.getBytes(StandardCharsets.UTF_8));
   }
 
+  /**
+   * Returns true if the given string consists entirely of ASCII characters, false otherwise.
+   */
+  private static boolean isAscii(String str) {
+    for (int i = 0; i < str.length(); i++) {
+      if (str.charAt(i) > 127) { // unsigned comparison
+        return false;
+      }
+    }
+    return true;
+  }
+
+  /**
+   * Returns true if the given UTF8 bytes consist entirely of ASCII characters, false otherwise.
+   */
+  private static boolean isAscii(byte[] bytes) {
 
 Review comment:
   Good idea! I think we can go one step further and move the `& 0x80` out of the loop body and into the final `return` statement, shaving a couple of instructions out of the loop body.
   
   Just pushed a change implementing this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org