You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/26 01:36:42 UTC
[GitHub] [spark] JoshRosen opened a new pull request #24709: [SPARK-27841][SQL] Improve UTF8String to/fromString()/charwise iteration performance

JoshRosen opened a new pull request #24709: [SPARK-27841][SQL] Improve UTF8String to/fromString()/charwise iteration performance
URL: https://github.com/apache/spark/pull/24709
 
 
   ## What changes were proposed in this pull request?
   
   UTF8String's `fromString()`, `toString()`, and `numBytesForFirstByte()` methods are performance hotspots.
   
   This patch improves the performance of those methods by implementing special-case optimizations for strings which consist entirely of ASCII characters. Early microbenchmarks show a ~2x speedup for `numChars()` (which uses `numBytesForFirstByte()`), a ~20% speedup of `toString()` for short ASCII strings, and a 1.3x-1.9x speedup of `fromString()` for short ASCII strings (with somewhat negligible impact for non-ASCII strings).
   
   This assumes that it's cheap to check whether a string is ASCII and may harm performance in case non-ASCII strings cannot be quickly identified (e.g. the first non-ASCII character in a string appears at the very end of a huge string). To allow this optimization to be disabled, we may be able to add a system property feature flag in order to allow this optimization path to be disabled (we must use a static property instead of runtime conf. because the overhead of checking the configuration must be minimized: the configuration must be effectively constant for the lifetime of the JVM).
   
   I'll now describe each optimization in turn:
   
   ### charwise iteration / numBytesForFirstByte()
   
   UTF8 uses between 1 and 4 bytes to encode characters. If we want to quickly determine the character length of a string (numChars()) or iterate character-by-character (e.g. to support substring / slice operations) then we need a fast method for determining how many bytes are used to store a codepoint. Spark's `numBytesForFirstByte()` method takes the first byte of a codepoint and returns the number of bytes in the codepoint. For ASCII characters, this is 1 byte.
   
   The current implementation converts the codepoint's first (signed) byte into an unsigned integer (e.g. 0 to 255) and uses it to index into a 256-element lookup table where each position records the number of bytes in the codepoint.
   
   For ASCII characters, we end up indexing into an essentially random location in one half of this array and this has a performance penalty. instead, we can simply check whether the signed byte is positive (e.g. highest-order bit is 0): if so, we can short-circuit and return `1` because we know that the character is ASCII.
   
   With this optimization I saw a ~2x speedup in `numChars()` and I expect it will also benefit other methoids, including substring(), indexOf(), upper(), lower(), and trim().
   
   ### fromString()
   
   UTF8String.fromString(String) was originally implemented as 
   
   ```
   str.getBytes(StandardCharsets.UTF_8)
   ```
   
   but this is inefficient for ASCII characters because it ends up over-allocating a byte array and then needs to copy it to trim it. To see this:
   
   ```scala
   // Helper function for measuring total allocations on current thread:
   def measureAllocations(f: => Unit) {
     import java.lang.management.ManagementFactory
     import com.sun.management.ThreadMXBean
     val threadMxBean = ManagementFactory.getThreadMXBean.asInstanceOf[ThreadMXBean]
     def getAlloc() = threadMxBean.getThreadAllocatedBytes(Thread.currentThread.getId)
     System.gc()
     val allocStart = getAlloc()
     f
     System.gc()
     val allocEnd = getAlloc()
     println(s"${(allocEnd - allocStart) / (1024 * 1024)} MB")
   }
   
   measureAllocations {
     val s = "12345678"
     var i = 0
     while (i < 1000 * 1000) {
       // Use string charset name instead of StandardCharsets.UTF_8
       // so we re-use a cached thread local StringEncoder.
       // See https://psy-lob-saw.blogspot.com/2012/12/encode-utf-8-string-to-bytebuffer-faster.html 
       s.getBytes("UTF-8")
       i += 1
     }
   }
   ```
   
   This 8-character ASCII string can be stored in only 8 bytes for the array slots (plus 16 bytes of object + array header + alignment overhead), so each array should take 24 bytes of space and we'd expect converting 1 million of them to perform around ~24 MB of allocations.
   
   However, the code above allocates ~3x more memory than expected (~61 MB)! The reason for this is that it allocates an overly-large byte array based on an estimated bytes per character (which is > 1) and then needs to trim it.
   
   However, if we know that a string is ASCII then we can allocate a right-sized byte array ourselves and then can copy each character by casting it to a byte.
   
   ### toString()
   
   If we know that our bytes are ASCII then we can call the deprecated `String(bytes[], int)` constructor to invoke a path which simply casts each byte into a char, completely skipping the `StringDecoder` (and any associated allocation / GC (or thread-local lookups if passing a charset name instead of a `StandardCharsets` constant)).
   
   We could perform a `byte[]` to `char[]` conversion ourselves but that forces us to pay for an additional defensive copy of the `char[]`, whereas the deprecated constructor lets us skip this. I checked and this deprecated constructor is still present in newer versions of Java (and, in fact, has been optimized to support the JDK9 CompactString features).
   
   ## How was this patch tested?
   
   Correctness should be covered by existing tests.
   
   To measure performance I added a set of microbenchmarks for `UTF8String`.
   
    **Benchmarking help, especially on real workloads, would be very appreciated** 😃 .

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org