You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2021/02/10 03:22:41 UTC

[GitHub] [accumulo] ctubbsii opened a new pull request #1920: Use String.intern() instead of WeakHashMap

ctubbsii opened a new pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920


   Replace occurrence of WeakHashMap for Strings, with String.intern(),
   because it is:
   
   1. less code,
   2. similar performance in JDK8 and later,
   3. less memory overhead,
   4. bins are tunable by the user with JVM flags, and
   5. as of Java 8, no longer is constrained to a fixed size PermGen space
   
   This does not preclude the user from also configuring the G1GC string
   deduplication feature, if they choose to do so for the whole JVM.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii closed pull request #1920: Use String.intern() instead of WeakHashMap

Posted by GitBox <gi...@apache.org>.
ctubbsii closed pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] brianloss commented on pull request #1920: Use String.intern() instead of WeakHashMap

Posted by GitBox <gi...@apache.org>.
brianloss commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776691790


   In the [discussion thread](https://lists.apache.org/thread.html/r716826039c63640d7b1d983acc60b8c9116ade3e7f68fac39913144c%40%3Cdev.accumulo.apache.org%3E) about this, the Guava code was cited since they recommend against String.intern. I looked at the git history, and the documentation of that recommendation is only 4 months old, so I started looking around for any performance comparisons, and I couldn't find anything recent. 
   
   I took a gist from a [stackoverflow post](https://stackoverflow.com/questions/10624232/performance-penalty-of-string-intern) discussing String.intern performance, and ran it on JDK 11 on a MacBook Pro laptop. While it's probably not the best benchmark, I found that even in JDK 11, String.intern can be significantly slower than a ConcurrentHashMap. It was mostly bad when I tried with a large number of interned strings. For a small number of interned strings, String.intern performance was comparable to ConcurrentHashMap, though lookup (calling intern on an already interned string) was always slower for String.intern. For large numbers of Strings, I was able to get improvements by setting the `-XX:StringTableSize` argument to the VM to increase the table from its JDK11 default of 65536 (JDK 7-10 defaulted to 60013).
   
   I think if you aren't exceeding the string table size, then String.intern performance is probably comparable enough. But if you exceed, then it starts to degrade quickly. I'm not suggesting to skip this change, but it's just something to think about.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1920: Use String.intern() instead of WeakHashMap

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776976681


   Okay, so I'm going to scrap this PR, and just create a simple Interning utility that puts this deduplication code in one place. This is the only time we use it for String types, but we do use it for a few other types elsewhere. Guava's built-in Interner is beta, so I won't use that, but will just move our existing code into a separate util class and add generics, so it can be re-used if we need it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1920: Use String.intern() instead of WeakHashMap

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-777104293


   This is superseded by #1925 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1920: Use String.intern() instead of WeakHashMap

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776875548


   While the ConcurrentHashMap might give the best performance (except for initial load in the realistic case), the WeakHashMap is a better comparison, since that's what we use today. For the realistic case (100,000 strings), the interning is better than our WeakHashMap for loads (probably because of early rehashing as it resizes), and not substantially worse for lookups. Even under the extreme case (1,000,000 strings) interning is only moderately worse than the WeakHashMap after tuning.
   
   Overall, I'm still leaning towards preferring `String.intern()`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1920: Use String.intern() instead of WeakHashMap

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776795695


   > In the [discussion thread](https://lists.apache.org/thread.html/r716826039c63640d7b1d983acc60b8c9116ade3e7f68fac39913144c%40%3Cdev.accumulo.apache.org%3E) about this, the Guava code was cited since they recommend against String.intern. I looked at the git history, and the documentation of that recommendation is only 4 months old, so I started looking around for any performance comparisons, and I couldn't find anything recent.
   
   The comment was added in 2020, yes, but based on the git log, it was added as a clarification that it could also be used for String in addition to other immutable types. The note about performance was added based on a comment from 2010, when this functionality was added to Guava back in v3.0. It does not appear to be based on newer performance information, just a note about the original intent.
   
   The more recent performance information (since 2010) I was looking at were at http://java-performance.info/string-intern-in-java-6-7-8/ (and has the same observations as yours below).
   
   > 
   > I took a gist from a [stackoverflow post](https://stackoverflow.com/questions/10624232/performance-penalty-of-string-intern) discussing String.intern performance, and ran it on JDK 11 on a MacBook Pro laptop. While it's probably not the best benchmark, I found that even in JDK 11, String.intern can be significantly slower than a ConcurrentHashMap. It was mostly bad when I tried with a large number of interned strings. For a small number of interned strings, String.intern performance was comparable to ConcurrentHashMap, though lookup (calling intern on an already interned string) was always slower for String.intern. For large numbers of Strings, I was able to get improvements by setting the `-XX:StringTableSize` argument to the VM to increase the table from its JDK11 default of 65536 (JDK 7-10 defaulted to 60013).
   
   I'm curious how small/large your tests were, and which were more reasonable for Accumulo.
   
   > 
   > I think if you aren't exceeding the string table size, then String.intern performance is probably comparable enough. But if you exceed, then it starts to degrade quickly. I'm not suggesting to skip this change, but it's just something to think about.
   
   Right. That's the idea. The default value of 65536 for JDK11 seems reasonable to me for our usage. If it needs to be higher, based on the user's operating environment, then they can tune it accordingly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] dlmarion commented on pull request #1920: Use String.intern() instead of WeakHashMap

Posted by GitBox <gi...@apache.org>.
dlmarion commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776682944


   I looked to see if it made a difference if the intern() call was made upstream of the constructor. In this case I don't think that it does.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1920: Use String.intern() instead of WeakHashMap

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776868416


   I did a similar comparison. Here's the results for `-XX:StringTableSize=1000003` and 1,000,000 strings:
   
   | Implementation |  load (ms) | lookup same (ms) | lookup equal (ms) |
   | ----- | ----- | ----- | ----- |
   | String.intern() | 553 | 414 | 400 |
   | Our WeakHashMap | 633 | 216 | 254 |
   | ConcurrentHashMap | 356 | 109 | 101 |
   
   Based on this, it looks like the ConcurrentHashMap is definitely much faster. However, it doesn't garbage collect. Comparing with our WeakHashMap implementation, the differences with String.intern are much smaller.
   
   Here's a more realistic comparison for our use case with the default StringTableSize for Java 11 of 65536, and 100,000 strings:
   
   | Implementation |  load (ms) | lookup same (ms) | lookup equal (ms) |
   | ----- | ----- | ----- | ----- |
   | String.intern() | 88 | 66 | 46 |
   | Our WeakHashMap | 115 | 58 | 41 |
   | ConcurrentHashMap | 104 | 35 | 17 |
   
   Also, after running many times, I also observed `String.intern()` to have much more consistent performance (about +/- 5%). The two HashMap implementations seemed to have a lot more variance (about +/- 20%), but I didn't compute it precisely. This is only an approximation.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] brianloss commented on pull request #1920: Use String.intern() instead of WeakHashMap

Posted by GitBox <gi...@apache.org>.
brianloss commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776815394


   > I'm curious how small/large your tests were, and which were more reasonable for Accumulo.
   
   Based on that gist I took from the StackOverflow post, I did the same sizes: 10000, 20000, 40000, 80000, 100000, 400000, 800000, 1000000 and I added 10,000,000. I wasn't very rigorous (e.g., didn't run each size multiple times and take an average or anything like that), but I ended up with the following results. It looks like maybe up to ~100k strings interned, the differences are in the noise.
   
   ```java
   /*
    * intern() many different strings:
    * ================================
    * count       initial intern   lookup same string  lookup equal string
    *10'000'000            71803                32455                14831
    * 1'000'000             1840                 1490                 1502
    *   800'000             1244                 1139                 1063
    *   400'000              427                  340                  327
    *   200'000              161                  118                  106
    *   100'000               71                   46                   47
    *    80'000               49                   34                   35
    *    40'000               20                   14                   13
    *    20'000                8                    5                    4
    *    10'000                3                    2                    2
    *
    * intern() many different strings (and large string table):
    * =========================================================
    * count       initial intern   lookup same string  lookup equal string
    *10'000'000             4642                 3190                 3136
    * 1'000'000              389                  259                  262
    *   800'000              378                  255                  214
    *   400'000              209                  145                   99
    *   200'000              110                   73                   68
    *   100'000               56                   36                   42
    *    80'000               43                   33                   34
    *    40'000               19                   15                   15
    *    20'000                9                    6                    6
    *    10'000                4                    3                    2
    *
    * Manual "interning" with ConcurrentHashMap:
    * ==========================================
    *10'000'000             2283                  883                 1139
    * 1'000'000              252                   62                   79
    *   800'000              308                   51                   73
    *   400'000              272                   34                   41
    *   200'000              125                   39                   29
    *   100'000               83                   17                   15
    *    80'000               42                   18                   12
    *    40'000               20                    6                    5
    *    20'000                7                    3                    2
    *    10'000                3                    1                    1
    * /
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1920: Use String.intern() instead of WeakHashMap

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776913731


   > I assume that String.intern() handles strings that are no longer referenced well, but was not quite sure. Curious if anyone who has done research knows?
   
   The exact behavior is implementation-specific, as it is a native method. However, everything I've found says that yes, manually-interned strings are eligible for garbage collection and removal from the string pool once they are no longer referenced in the program. For string literals, they are eligible for garbage collection once their containing class is unloaded. However, it's hard to find exact documentation on this.
   
   > Also curious if anyone knows anything about the behavior of String.intern() under concurrent load (like 100 threads all calling String.intern())? I suspect String.intern() could only be better or the same as the current weak hashmap w.r.t to concurrency.
   
   I have not done any experimentation with concurrent access performance, and I've only found one broken link to a blog discussing it. I know it's thread-safe, but I don't know about performance. I might play around with that a bit to investigate it further.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] keith-turner commented on pull request #1920: Use String.intern() instead of WeakHashMap

Posted by GitBox <gi...@apache.org>.
keith-turner commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776900236


   I see this is used for deduping locations and sessions.  The locations and sessions should be relatively low cardinality at a given point in time.  However over longer periods the cardinality could be higher (tservers restarts create a new session and possibly could use different port) for all seen.  But the active set would always be much smaller.  I assume that String.intern() handles strings that are no longer referenced well, but was not quite sure.  Curious if anyone who has done research knows?  Also curious if anyone knows anything about the behavior of String.intern() under concurrent load (like 100 threads all calling String.intern())?    I suspect String.intern() could only be better or the same as the current weak hashmap w.r.t to concurrency.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] ctubbsii commented on pull request #1920: Use String.intern() instead of WeakHashMap

Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776974969


   Here's some representative numbers for the 1,000,000 strings case with 1000 concurrent threads (and `-XX:StringTableSize=1000003`):
   
   |   Implementation  | load (ms) | lookup same (ms) | lookup equal (ms) |
   |-------------------|-----------|------------------|-------------------|
   |   String.intern() |      1902 |             1274 |              1166 |
   |       WeakHashMap |      1491 |              907 |               743 |
   | ConcurrentHashMap |      1196 |              648 |               387 |
   
   And here's the more realistic case (100,000 strings, 100 threads, default string table size):
   
   |   Implementation  | load (ms) | lookup same (ms) | lookup equal (ms) |
   |-------------------|-----------|------------------|-------------------|
   |   String.intern() |       271 |              150 |                94 |
   |       WeakHashMap |       256 |              131 |                79 |
   | ConcurrentHashMap |       291 |               92 |                92 |
   
   It looks like intern has slightly more overhead that adds up over time, but for the realistic case, they are all very similar.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [accumulo] keith-turner commented on pull request #1920: Use String.intern() instead of WeakHashMap

Posted by GitBox <gi...@apache.org>.
keith-turner commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776944886


   Realized this change goes from a deduplication mechanism that is localized to a class to one that is JVM wide.  So if a user loads two libraries (Accumulo client and some other lib) that both use String.intern(),  if the non-accumulo library uses String.intern() for a much higher cardinality set of Strings than Accumlo does, then it could possibly impact Accumulo.  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org