You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "Michael McCandless (Jira)" <ji...@apache.org> on 2021/03/01 20:09:00 UTC

[jira] [Commented] (LUCENE-9816) lazy-init LZ4-HC hashtable in blocktreewriter

    [ https://issues.apache.org/jira/browse/LUCENE-9816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293141#comment-17293141 ] 

Michael McCandless commented on LUCENE-9816:
--------------------------------------------

{quote}[~mikemccand] This is due to how the algorithm looks for duplicates, it stores a large hash table that maps 4-bytes sequences to offsets in the input.
{quote}
+1, thanks for the explanation and musings about how we might further optimize it [~jpountz]!

> lazy-init LZ4-HC hashtable in blocktreewriter
> ---------------------------------------------
>
>                 Key: LUCENE-9816
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9816
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Major
>             Fix For: master (9.0)
>
>         Attachments: LUCENE-9816.patch
>
>
> Based upon the data for a field, blocktree may compress with LZ4-HC (or with simple lowercase compression or none at all).
> But we currently eagerly initialize HC hashtable (132k) for each field regardless of whether it will be even "tried". This shows up as top cpu and heap hotspot when profiling tests. It creates unnecessary overhead for small flushes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org