You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Armin Braun (Jira)" <ji...@apache.org> on 2022/08/08 12:06:00 UTC

[jira] [Comment Edited] (LUCENE-10676) FieldInfo#name contributes significantly to heap usage at scale

    [ https://issues.apache.org/jira/browse/LUCENE-10676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576713#comment-17576713 ] 

Armin Braun edited comment on LUCENE-10676 at 8/8/22 12:05 PM:
---------------------------------------------------------------

The field names in the particular case that lead to this issue were indeed a little longer than usual (about 100-150 chars each). But the problem of duplicate strings in the `FieldInfo` is not limited to the field names. Analyzing the heap dump that motivated this issue I found the biggest contributors to duplicate strings as follows:

 

!image-2022-08-08-13-23-37-050.png!

To vastly improve the situation, we wouldn't even need to look into interning field names (though that would still be nice and a GB scale win in this case as well). If we were to just intern or deduplicate the obvious things like "PerFieldPostingsFormat.format" or "Lucene80" that we already have in the string constant pool anyway that would offer a trivial win for cases like this one.

PS: Created https://issues.apache.org/jira/browse/LUCENE-10677 as a separate issue for the non-field-name strings.


was (Author: original-brownbear):
The field names in the particular case that lead to this issue were indeed a little longer than usual (about 100-150 chars each). But the problem of duplicate strings in the `FieldInfo` is not limited to the field names. Analyzing the heap dump that motivated this issue I found the biggest contributors to duplicate strings as follows:

 

!image-2022-08-08-13-23-37-050.png!

To vastly improve the situation, we wouldn't even need to look into interning field names (though that would still be nice and a GB scale win in this case as well). If we were to just intern or deduplicate the obvious things like "PerFieldPostingsFormat.format" or "Lucene80" that we already have in the string constant pool anyway that would offer a trivial win for cases like this one.

> FieldInfo#name contributes significantly to heap usage at scale
> ---------------------------------------------------------------
>
>                 Key: LUCENE-10676
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10676
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/codecs
>    Affects Versions: 9.3
>         Environment: Seen in Lucene 9.3.0 running on Linux using JDK18 but seems independent of environment.
>            Reporter: David Turner
>            Priority: Minor
>              Labels: heap, scalability
>         Attachments: image-2022-08-08-13-23-37-050.png
>
>
> We encountered an Elasticsearch user with high heap usage, a significant proportion of which was down to the contents of `FieldInfo#name`.
> This user was certainly pushing some scalability boundaries: this single process had thousands of active Lucene indices, many with 10k+ fields, and many indices had hundreds of segments due to an excess of flushes, so in total they had an enormous number of `FieldInfo` instances. Still, the bulk of the heap usage was just field names, and the total number of distinct field names was fairly small. That's pretty common, especially for time-based data like logs. Some kind of interning or deduplication of these strings would have reduced their heap usage by many GBs.
> Is there a way we could deduplicate these strings? Deduplicating them across segments within each index would already have helped, but ideally we'd like to deduplicate them across indices too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org