You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "Jaison.Bi (Jira)" <ji...@apache.org> on 2021/01/12 12:25:00 UTC

[jira] [Created] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

Jaison.Bi created LUCENE-9663:
---------------------------------

             Summary: Adding compression to terms dict from SortedSet/Sorted DocValues
                 Key: LUCENE-9663
                 URL: https://issues.apache.org/jira/browse/LUCENE-9663
             Project: Lucene - Core
          Issue Type: Improvement
          Components: core/codecs
            Reporter: Jaison.Bi


Elasticsearch keyword field uses SortedSet DocValues. In our applications, “keyword” is the most frequently used field type.Elasticsearch keyword field uses SortedSet DocValues. In our applications, “keyword” is the most frequently used field type.
LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do better by replacing prefix-compression with LZ4. In one of our application, the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
I've done simple tests based on the real application data, comparing the write/merge time cost, and the on-disk *.dvd file size(after merge into 1 segment).
|| ||Before||After||
|Write time cost(ms)|591972|618200|
|Merge time cost(ms)|270661|294663|
|*.dvd file size(GB)|1.95|1.15|

This feature is only for the high-cardinality fields. 
I'm doing the benchmark test based on luceneutil. Will attach the report and patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org