You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/26 07:37:47 UTC

[GitHub] [spark] MaxGekk commented on a change in pull request #24709: [SPARK-27841][SQL] Improve UTF8String to/fromString()/numBytesForFirstByte() performance

MaxGekk commented on a change in pull request #24709: [SPARK-27841][SQL] Improve UTF8String to/fromString()/numBytesForFirstByte() performance
URL: https://github.com/apache/spark/pull/24709#discussion_r287585018
 
 

 ##########
 File path: common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
 ##########
 @@ -139,9 +139,55 @@ public static UTF8String fromAddress(Object base, long offset, int numBytes) {
    * Creates an UTF8String from String.
    */
   public static UTF8String fromString(String str) {
+    if (str == null) {
+      return null;
+    }
+    byte[] bytes;
+    // Optimization for ASCII strings: construct an exactly-right-sized
+    // array ourselves. CharsetEncoder ends up over-allocating for ASCII
+    // strings, leading to extra memory copies and garbage creation.
+    if (isAscii(str)) {
+      bytes = new byte[str.length()];
+      for (int i = 0; i < str.length(); i++) {
+        bytes[i] = (byte) str.charAt(i);
+      }
+    } else {
+      bytes = str.getBytes(StandardCharsets.UTF_8);
+    }
+    return fromBytes(bytes);
+  }
+
+  /**
+   * Pre-SPARK-27841 version of fromString(), retained for benchmarking purposes.
+   */
+  public static UTF8String fromStringSlow(String str) {
     return str == null ? null : fromBytes(str.getBytes(StandardCharsets.UTF_8));
   }
 
+  /**
+   * Returns true if the given string consists entirely of ASCII characters, false otherwise.
+   */
+  private static boolean isAscii(String str) {
+    for (int i = 0; i < str.length(); i++) {
+      if (str.charAt(i) > 127) { // unsigned comparison
+        return false;
+      }
+    }
+    return true;
+  }
+
+  /**
+   * Returns true if the given UTF8 bytes consist entirely of ASCII characters, false otherwise.
+   */
+  private static boolean isAscii(byte[] bytes) {
 
 Review comment:
   It would be interesting to benchmark your implementation against to:
   ```scala
   private static boolean isAscii(byte[] bytes) {
       int length = bytes.length;
       int s = 0;
       for (int i = 0; i < length; i++) {
         s |= bytes[i] & 0x80;
       }
       return s == 0;
     }
   ```
   It doesn't have the problem of branch mispredictions. Maybe it will be faster in general case. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org