You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "David Turner (Jira)" <ji...@apache.org> on 2022/08/10 09:43:00 UTC

[jira] [Comment Edited] (LUCENE-10677) Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale

    [ https://issues.apache.org/jira/browse/LUCENE-10677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17577853#comment-17577853 ] 

David Turner edited comment on LUCENE-10677 at 8/10/22 9:42 AM:
----------------------------------------------------------------

> I'm opposed to the use of string.intern by the lucene library here. It is inappropriate for a library (versus an app)

I think that's reasonable, `String#intern` is a pretty blunt tool to be using here. And yet it does seem awfully wasteful to burn so much heap on these things. "Buy more RAM" is not a great answer (implicitly this means "... or go and find a cheaper alternative elsewhere" and folks are indeed willing to do that). The next scaling limit in this dimension appears to be quite far off which is why we think this is worth addressing. (edit to add: these strings appear to roughly double the heap needed for each `SegmentReader` object)

Are there any other approaches you'd suggest? It looks like we might be able to intercept the relevant calls to `DataInput#readString` ourselves, although adding support for compound segments introduces an enormous amount of extra complexity to that approach. Would it work to introduce some simpler way for an application to hook in some kind of string deduplication mechanism even if it goes unused in pure Lucene by default?


was (Author: david turner):
> I'm opposed to the use of string.intern by the lucene library here. It is inappropriate for a library (versus an app)

I think that's reasonable, `String#intern` is a pretty blunt tool to be using here. And yet it does seem awfully wasteful to burn so much heap on these things. "Buy more RAM" is not a great answer (implicitly this means "... or go and find a cheaper alternative elsewhere" and folks are indeed willing to do that). The next scaling limit in this dimension appears to be quite far off which is why we think this is worth addressing.

Are there any other approaches you'd suggest? It looks like we might be able to intercept the relevant calls to `DataInput#readString` ourselves, although adding support for compound segments introduces an enormous amount of extra complexity to that approach. Would it work to introduce some simpler way for an application to hook in some kind of string deduplication mechanism even if it goes unused in pure Lucene by default?

> Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale
> -----------------------------------------------------------------------------------------
>
>                 Key: LUCENE-10677
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10677
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/codecs
>    Affects Versions: 9.3
>            Reporter: Armin Braun
>            Priority: Minor
>              Labels: heap, scalability
>         Attachments: lucene_duplicate_fields.png
>
>
> This has the same origin as issue LUCENE-10676 . Running a single process with thousands of fields across many indexes will lead to a lot of duplicate strings retained as keys and values in the `attributes` map. This can amount to GBs of heap for thousands of fields across a few thousand segments. The strings in the below heap dump analysis account for more than half  (roughly 2/3 and the field names are somewhat unusually long in this example) the duplicate strings from `FieldInfo` instances.
> If we could deduplicate theses obvious known strings when reading `FieldInfo` we could save GBs of heap for use cases like this.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org