You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2021/02/10 03:22:41 UTC
[GitHub] [accumulo] ctubbsii opened a new pull request #1920: Use String.intern() instead of WeakHashMap
ctubbsii opened a new pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920
Replace occurrence of WeakHashMap for Strings, with String.intern(),
because it is:
1. less code,
2. similar performance in JDK8 and later,
3. less memory overhead,
4. bins are tunable by the user with JVM flags, and
5. as of Java 8, no longer is constrained to a fixed size PermGen space
This does not preclude the user from also configuring the G1GC string
deduplication feature, if they choose to do so for the whole JVM.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [accumulo] ctubbsii closed pull request #1920: Use String.intern() instead of WeakHashMap
Posted by GitBox <gi...@apache.org>.
ctubbsii closed pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [accumulo] brianloss commented on pull request #1920: Use String.intern() instead of WeakHashMap
Posted by GitBox <gi...@apache.org>.
brianloss commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776691790
In the [discussion thread](https://lists.apache.org/thread.html/r716826039c63640d7b1d983acc60b8c9116ade3e7f68fac39913144c%40%3Cdev.accumulo.apache.org%3E) about this, the Guava code was cited since they recommend against String.intern. I looked at the git history, and the documentation of that recommendation is only 4 months old, so I started looking around for any performance comparisons, and I couldn't find anything recent.
I took a gist from a [stackoverflow post](https://stackoverflow.com/questions/10624232/performance-penalty-of-string-intern) discussing String.intern performance, and ran it on JDK 11 on a MacBook Pro laptop. While it's probably not the best benchmark, I found that even in JDK 11, String.intern can be significantly slower than a ConcurrentHashMap. It was mostly bad when I tried with a large number of interned strings. For a small number of interned strings, String.intern performance was comparable to ConcurrentHashMap, though lookup (calling intern on an already interned string) was always slower for String.intern. For large numbers of Strings, I was able to get improvements by setting the `-XX:StringTableSize` argument to the VM to increase the table from its JDK11 default of 65536 (JDK 7-10 defaulted to 60013).
I think if you aren't exceeding the string table size, then String.intern performance is probably comparable enough. But if you exceed, then it starts to degrade quickly. I'm not suggesting to skip this change, but it's just something to think about.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [accumulo] ctubbsii commented on pull request #1920: Use String.intern() instead of WeakHashMap
Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776976681
Okay, so I'm going to scrap this PR, and just create a simple Interning utility that puts this deduplication code in one place. This is the only time we use it for String types, but we do use it for a few other types elsewhere. Guava's built-in Interner is beta, so I won't use that, but will just move our existing code into a separate util class and add generics, so it can be re-used if we need it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [accumulo] ctubbsii commented on pull request #1920: Use String.intern() instead of WeakHashMap
Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-777104293
This is superseded by #1925
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [accumulo] ctubbsii commented on pull request #1920: Use String.intern() instead of WeakHashMap
Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776875548
While the ConcurrentHashMap might give the best performance (except for initial load in the realistic case), the WeakHashMap is a better comparison, since that's what we use today. For the realistic case (100,000 strings), the interning is better than our WeakHashMap for loads (probably because of early rehashing as it resizes), and not substantially worse for lookups. Even under the extreme case (1,000,000 strings) interning is only moderately worse than the WeakHashMap after tuning.
Overall, I'm still leaning towards preferring `String.intern()`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [accumulo] ctubbsii commented on pull request #1920: Use String.intern() instead of WeakHashMap
Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776795695
> In the [discussion thread](https://lists.apache.org/thread.html/r716826039c63640d7b1d983acc60b8c9116ade3e7f68fac39913144c%40%3Cdev.accumulo.apache.org%3E) about this, the Guava code was cited since they recommend against String.intern. I looked at the git history, and the documentation of that recommendation is only 4 months old, so I started looking around for any performance comparisons, and I couldn't find anything recent.
The comment was added in 2020, yes, but based on the git log, it was added as a clarification that it could also be used for String in addition to other immutable types. The note about performance was added based on a comment from 2010, when this functionality was added to Guava back in v3.0. It does not appear to be based on newer performance information, just a note about the original intent.
The more recent performance information (since 2010) I was looking at were at http://java-performance.info/string-intern-in-java-6-7-8/ (and has the same observations as yours below).
>
> I took a gist from a [stackoverflow post](https://stackoverflow.com/questions/10624232/performance-penalty-of-string-intern) discussing String.intern performance, and ran it on JDK 11 on a MacBook Pro laptop. While it's probably not the best benchmark, I found that even in JDK 11, String.intern can be significantly slower than a ConcurrentHashMap. It was mostly bad when I tried with a large number of interned strings. For a small number of interned strings, String.intern performance was comparable to ConcurrentHashMap, though lookup (calling intern on an already interned string) was always slower for String.intern. For large numbers of Strings, I was able to get improvements by setting the `-XX:StringTableSize` argument to the VM to increase the table from its JDK11 default of 65536 (JDK 7-10 defaulted to 60013).
I'm curious how small/large your tests were, and which were more reasonable for Accumulo.
>
> I think if you aren't exceeding the string table size, then String.intern performance is probably comparable enough. But if you exceed, then it starts to degrade quickly. I'm not suggesting to skip this change, but it's just something to think about.
Right. That's the idea. The default value of 65536 for JDK11 seems reasonable to me for our usage. If it needs to be higher, based on the user's operating environment, then they can tune it accordingly.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [accumulo] dlmarion commented on pull request #1920: Use String.intern() instead of WeakHashMap
Posted by GitBox <gi...@apache.org>.
dlmarion commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776682944
I looked to see if it made a difference if the intern() call was made upstream of the constructor. In this case I don't think that it does.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [accumulo] ctubbsii commented on pull request #1920: Use String.intern() instead of WeakHashMap
Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776868416
I did a similar comparison. Here's the results for `-XX:StringTableSize=1000003` and 1,000,000 strings:
| Implementation | load (ms) | lookup same (ms) | lookup equal (ms) |
| ----- | ----- | ----- | ----- |
| String.intern() | 553 | 414 | 400 |
| Our WeakHashMap | 633 | 216 | 254 |
| ConcurrentHashMap | 356 | 109 | 101 |
Based on this, it looks like the ConcurrentHashMap is definitely much faster. However, it doesn't garbage collect. Comparing with our WeakHashMap implementation, the differences with String.intern are much smaller.
Here's a more realistic comparison for our use case with the default StringTableSize for Java 11 of 65536, and 100,000 strings:
| Implementation | load (ms) | lookup same (ms) | lookup equal (ms) |
| ----- | ----- | ----- | ----- |
| String.intern() | 88 | 66 | 46 |
| Our WeakHashMap | 115 | 58 | 41 |
| ConcurrentHashMap | 104 | 35 | 17 |
Also, after running many times, I also observed `String.intern()` to have much more consistent performance (about +/- 5%). The two HashMap implementations seemed to have a lot more variance (about +/- 20%), but I didn't compute it precisely. This is only an approximation.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [accumulo] brianloss commented on pull request #1920: Use String.intern() instead of WeakHashMap
Posted by GitBox <gi...@apache.org>.
brianloss commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776815394
> I'm curious how small/large your tests were, and which were more reasonable for Accumulo.
Based on that gist I took from the StackOverflow post, I did the same sizes: 10000, 20000, 40000, 80000, 100000, 400000, 800000, 1000000 and I added 10,000,000. I wasn't very rigorous (e.g., didn't run each size multiple times and take an average or anything like that), but I ended up with the following results. It looks like maybe up to ~100k strings interned, the differences are in the noise.
```java
/*
* intern() many different strings:
* ================================
* count initial intern lookup same string lookup equal string
*10'000'000 71803 32455 14831
* 1'000'000 1840 1490 1502
* 800'000 1244 1139 1063
* 400'000 427 340 327
* 200'000 161 118 106
* 100'000 71 46 47
* 80'000 49 34 35
* 40'000 20 14 13
* 20'000 8 5 4
* 10'000 3 2 2
*
* intern() many different strings (and large string table):
* =========================================================
* count initial intern lookup same string lookup equal string
*10'000'000 4642 3190 3136
* 1'000'000 389 259 262
* 800'000 378 255 214
* 400'000 209 145 99
* 200'000 110 73 68
* 100'000 56 36 42
* 80'000 43 33 34
* 40'000 19 15 15
* 20'000 9 6 6
* 10'000 4 3 2
*
* Manual "interning" with ConcurrentHashMap:
* ==========================================
*10'000'000 2283 883 1139
* 1'000'000 252 62 79
* 800'000 308 51 73
* 400'000 272 34 41
* 200'000 125 39 29
* 100'000 83 17 15
* 80'000 42 18 12
* 40'000 20 6 5
* 20'000 7 3 2
* 10'000 3 1 1
* /
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [accumulo] ctubbsii commented on pull request #1920: Use String.intern() instead of WeakHashMap
Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776913731
> I assume that String.intern() handles strings that are no longer referenced well, but was not quite sure. Curious if anyone who has done research knows?
The exact behavior is implementation-specific, as it is a native method. However, everything I've found says that yes, manually-interned strings are eligible for garbage collection and removal from the string pool once they are no longer referenced in the program. For string literals, they are eligible for garbage collection once their containing class is unloaded. However, it's hard to find exact documentation on this.
> Also curious if anyone knows anything about the behavior of String.intern() under concurrent load (like 100 threads all calling String.intern())? I suspect String.intern() could only be better or the same as the current weak hashmap w.r.t to concurrency.
I have not done any experimentation with concurrent access performance, and I've only found one broken link to a blog discussing it. I know it's thread-safe, but I don't know about performance. I might play around with that a bit to investigate it further.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [accumulo] keith-turner commented on pull request #1920: Use String.intern() instead of WeakHashMap
Posted by GitBox <gi...@apache.org>.
keith-turner commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776900236
I see this is used for deduping locations and sessions. The locations and sessions should be relatively low cardinality at a given point in time. However over longer periods the cardinality could be higher (tservers restarts create a new session and possibly could use different port) for all seen. But the active set would always be much smaller. I assume that String.intern() handles strings that are no longer referenced well, but was not quite sure. Curious if anyone who has done research knows? Also curious if anyone knows anything about the behavior of String.intern() under concurrent load (like 100 threads all calling String.intern())? I suspect String.intern() could only be better or the same as the current weak hashmap w.r.t to concurrency.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [accumulo] ctubbsii commented on pull request #1920: Use String.intern() instead of WeakHashMap
Posted by GitBox <gi...@apache.org>.
ctubbsii commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776974969
Here's some representative numbers for the 1,000,000 strings case with 1000 concurrent threads (and `-XX:StringTableSize=1000003`):
| Implementation | load (ms) | lookup same (ms) | lookup equal (ms) |
|-------------------|-----------|------------------|-------------------|
| String.intern() | 1902 | 1274 | 1166 |
| WeakHashMap | 1491 | 907 | 743 |
| ConcurrentHashMap | 1196 | 648 | 387 |
And here's the more realistic case (100,000 strings, 100 threads, default string table size):
| Implementation | load (ms) | lookup same (ms) | lookup equal (ms) |
|-------------------|-----------|------------------|-------------------|
| String.intern() | 271 | 150 | 94 |
| WeakHashMap | 256 | 131 | 79 |
| ConcurrentHashMap | 291 | 92 | 92 |
It looks like intern has slightly more overhead that adds up over time, but for the realistic case, they are all very similar.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [accumulo] keith-turner commented on pull request #1920: Use String.intern() instead of WeakHashMap
Posted by GitBox <gi...@apache.org>.
keith-turner commented on pull request #1920:
URL: https://github.com/apache/accumulo/pull/1920#issuecomment-776944886
Realized this change goes from a deduplication mechanism that is localized to a class to one that is JVM wide. So if a user loads two libraries (Accumulo client and some other lib) that both use String.intern(), if the non-accumulo library uses String.intern() for a much higher cardinality set of Strings than Accumlo does, then it could possibly impact Accumulo.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org