You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pinot.apache.org by "vvivekiyer (via GitHub)" <gi...@apache.org> on 2023/12/01 00:32:34 UTC

[I] Reduce Heap Usage of OnHeapStringDictionary [pinot]

vvivekiyer opened a new issue, #12078:
URL: https://github.com/apache/pinot/issues/12078

With our OnHeapStringDictionary, we've observed that if a column has a lot of duplicates, it can waste a lot of our heap usage.

Below is JXray analysis of the heapdump for one usecase in Linkedin where the `OnHeapStringDictionary` uses about 13GB of heap
![image](https://github.com/apache/pinot/assets/21298365/e428e06b-64bf-4d02-ad32-19545ec7b323)

String Interning described in https://www.baeldung.com/string/intern solves this problem. However, there could be certain high-cardinality columns (even with enough duplicates) where interning can be counter productive. So we can solve this with a fixed size interner as described in the following article https://dzone.com/articles/duplicate-strings-how-to-get-rid-of-them-and-save.

I attempted to PoC this change on one of our usecases and observed that the we saw huge savings in heap usage. Below is the heapdump analysis with my PoC change. Note that I used a size of 32M for the fixed size interner.
![image](https://github.com/apache/pinot/assets/21298365/b740620f-6104-4808-90e4-02514d1e60bc)

I'm planning to expose a new tableIndexConfig called `onHeapDictionaryConfig` that will allow us to enable interning and control the size of the Fixed Size interner.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

Re: [I] Reduce Heap Usage of OnHeapStringDictionary [pinot]

Posted by "vvivekiyer (via GitHub)" <gi...@apache.org>.

vvivekiyer commented on issue #12078:
URL: https://github.com/apache/pinot/issues/12078#issuecomment-1837245948

   @richardstartin  thanks for the pointers.  I do plan to have configs to enable/disable interning along with knobs to control the size. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

Re: [I] Reduce Heap Usage of OnHeapStringDictionary [pinot]

Posted by "gortiz (via GitHub)" <gi...@apache.org>.

gortiz commented on issue #12078:
URL: https://github.com/apache/pinot/issues/12078#issuecomment-1882702995

   > Why do we need to store strings? We should probably use byte array right and avoid creating string in the first place?
   
   I think that is something we need to explore in the longer term. We would reduce the GC usage by a lot if we do that. 
   
   > Just to add some context, the reason why this was added in the first place was the fact that for certain workloads, byte -> String de-serialization was becoming the bottleneck.
   
   Sure, that is something we need to take into account and be careful with the implementation. This is specially problematic when strings are not [normalized](https://en.wikipedia.org/wiki/Unicode_equivalence). What I did in the past was to use a Str class that has two attributes: a ByteBuffer and String. When the Str is build from IO buffers, the bytebuffer is set to the slice and the String is set to null. When a materialization is needed (for example, the io buffer will be released or we need to compare the strings), a `materialize()` method is called. That method initializes the String and after that moment the String is always used. By doing so we can skip the String creation (and therefore heap allocation) in almost all cases where the Str is not used as aggregation key.
   
   > We did try the GC optimizations with -XX:+UseStringDeduplication (and others) but noticed elevated CPU usage affecting our query latencies.
   
   I may be wrong, but dictionaries are bound to the query lifetime, right? I mean, we create the dictionary when the segment is being queried and do not re-use it in following queries. If that is the case String Deduplication won't be useful at all because it is only used on Strings in the old generation.
   
   > I would recommend against using String.intern, see an authoritative source [here](https://shipilev.net/jvm/anatomy-quarks/10-string-intern/), which recommends manual interning over use of String.intern.
   
   I'm with Richard here. My experience with String.intern is bad. It is just better to use our own structure to intern Strings. Something as simple as a Guava Cache is usually better than String.intern.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

Re: [I] Reduce Heap Usage of OnHeapStringDictionary [pinot]

Posted by "jasperjiaguo (via GitHub)" <gi...@apache.org>.

jasperjiaguo commented on issue #12078:
URL: https://github.com/apache/pinot/issues/12078#issuecomment-1835209125

   >Our OnHeapStringDictionary implementation can result in a lot of wasted heap usage if there are enough duplicates in a column.
   
   Meaning the dictionaries of the same column across different segments would potentially have a lot of duplicate values and waste space?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

Re: [I] Reduce Heap Usage of OnHeapStringDictionary [pinot]

Posted by "kishoreg (via GitHub)" <gi...@apache.org>.

kishoreg commented on issue #12078:
URL: https://github.com/apache/pinot/issues/12078#issuecomment-1835157328

   Why do we need to store strings? We should probably use byte array right and avoid creating string in the first place?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

Re: [I] Reduce Heap Usage of OnHeapStringDictionary [pinot]

Posted by "richardstartin (via GitHub)" <gi...@apache.org>.

richardstartin commented on issue #12078:
URL: https://github.com/apache/pinot/issues/12078#issuecomment-1836789089

   I would recommend against using String.intern, see an authoritative source [here](https://shipilev.net/jvm/anatomy-quarks/10-string-intern/), which recommends manual interning over use of String.intern.
   
   Depending on which GC (G1, Shenandoah, Z) you’re running with you may be able to get the GC to deduplicate the backing data with -XX:+UseStringDeduplication. Assuming that data in dictionaries should get quite old, you can tune it with -XX:StringDeduplicationAgeThreshold=n where n is by default 3 collections. You can check it’s working properly with -XX:+PrintStringDeduplicationStatistics. This solution has the benefit of not making code changes with data-dependent efficacy and ramifications.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

Re: [I] Reduce Heap Usage of OnHeapStringDictionary [pinot]

Posted by "vvivekiyer (via GitHub)" <gi...@apache.org>.

vvivekiyer commented on issue #12078:
URL: https://github.com/apache/pinot/issues/12078#issuecomment-1836933120

   @richardstartin 
   We did try the GC optimizations with -XX:+UseStringDeduplication but noticed elevated CPU usage affecting our query latencies. 
   
   I want to clarify that we are not using Java's native `string.intern()` here but rather using manual interning. As I shared above, the implementation is based on article [here]( https://dzone.com/articles/duplicate-strings-how-to-get-rid-of-them-and-save). The Poc code is available here - https://github.com/vvivekiyer/pinot/commit/dc4538bd88f050c20f8dd17530d0e38052b92c65 .
   
   Do you see any potential issues mentioned in the article based on the code above? I'll also take a closer look at the article you shared. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

Re: [I] Reduce Heap Usage of OnHeapStringDictionary [pinot]

Posted by "mayankshriv (via GitHub)" <gi...@apache.org>.

mayankshriv commented on issue #12078:
URL: https://github.com/apache/pinot/issues/12078#issuecomment-1836640208

   Just to add some context, the reason why this was added in the first place was the fact that for certain workloads, byte -> String de-serialization was becoming the bottleneck. And it couldn't be avoided because sorted ordering of byte[] != sorted ordering of Strings.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

Re: [I] Reduce Heap Usage of OnHeapStringDictionary [pinot]

Posted by "richardstartin (via GitHub)" <gi...@apache.org>.

richardstartin commented on issue #12078:
URL: https://github.com/apache/pinot/issues/12078#issuecomment-1837106413

   Sorry for what may have seemed like a drive by comment. I was trying to support the idea of doing custom interning (if any interning is done at all) rather than suggest the prototype used native interning. I would also expect there to be some treatment of GC string deduplication as a potential alternative solution, if only to share lessons learned, what was tried (e.g. which GC, which JDK version, was the age threshold modified?) before proceeding to implementation. 
   
   I don’t see any issues with what’s being proposed, except users will need a way to switch it on/off, change the size, observe it and so on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

Re: [I] Reduce Heap Usage of OnHeapStringDictionary [pinot]

Posted by "gortiz (via GitHub)" <gi...@apache.org>.

gortiz commented on issue #12078:
URL: https://github.com/apache/pinot/issues/12078#issuecomment-1884447441

   BTW, this issue is focused on the memory impact of the dictionary. But there is another theoretical improvement here. The solution proposed in #[12223](https://github.com/apache/pinot/pull/12223) has the side effect that two equal string literals that belong to the same column in different segments will _probably_ be resolved to the same Java String object.
   
   When working with ClickBench, I've seen that we waste a lot of time evaluating equals between actually equal (but not same) String objects when these Strings are used as aggregation keys. With this PR it is possible to find that these two equal String values that were read from different segments are actually the same String Java object, which means that the equals may be evaluated in constant time instead of linear (comparing all bytes).
   
   We should verify the impact in reality of this theoretical reasoning, but in case it actually shows an increase in performance, we could apply the same technique in the brokers when data is being read (interning strings sent by different servers). Although, as said in my previous message, I think the largest improvement would be to use a Str class that actually doesn't allocate in heap if it is not needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

Re: [I] Reduce Heap Usage of OnHeapStringDictionary [pinot]

Posted by "vvivekiyer (via GitHub)" <gi...@apache.org>.

vvivekiyer commented on issue #12078:
URL: https://github.com/apache/pinot/issues/12078#issuecomment-1835128005

   cc: @Jackie-Jiang @siddharthteotia 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org

Re: [I] Reduce Heap Usage of OnHeapStringDictionary [pinot]

Posted by "vvivekiyer (via GitHub)" <gi...@apache.org>.

vvivekiyer commented on issue #12078:
URL: https://github.com/apache/pinot/issues/12078#issuecomment-1835273275

   > Why do we need to store strings? We should probably use byte array right and avoid creating string in the first place?
   
   
   @kishoreg  The above problem exists with duplicate bytes as well. My PoC code that generated above does both String internining and Byte interning. 
   
   Before
   ![image](https://github.com/apache/pinot/assets/21298365/876bdde5-0fc4-4a95-9e16-fcfb48d663c1)
   
   After
   ![image](https://github.com/apache/pinot/assets/21298365/c5c841d0-a1ba-4246-8981-77b697860608)
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org