You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "Armin Braun (Jira)" <ji...@apache.org> on 2022/08/08 12:05:00 UTC

[jira] [Created] (LUCENE-10677) Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale

Armin Braun created LUCENE-10677:
------------------------------------

             Summary: Duplicate strings in FieldInfo#attributes contribute significantly to heap usage at scale
                 Key: LUCENE-10677
                 URL: https://issues.apache.org/jira/browse/LUCENE-10677
             Project: Lucene - Core
          Issue Type: Bug
          Components: core/codecs
    Affects Versions: 9.3
            Reporter: Armin Braun
         Attachments: lucene_duplicate_fields.png

This has the same origin as issue LUCENE-10676 . Running a single process with thousands of fields across many indexes will lead to a lot of duplicate strings retained as keys and values in the `attributes` map. This can amount to GBs of heap for thousands of fields across a few thousand segments. The strings in the below heap dump analysis account for more than half  (roughly 2/3 and the field names are somewhat unusually long in this example) the duplicate strings from `FieldInfo` instances.

If we could deduplicate theses obvious known strings when reading `FieldInfo` we could save GBs of heap for use cases like this.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org