You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Jaison.Bi (Jira)" <ji...@apache.org> on 2021/01/12 12:25:00 UTC
[jira] [Created] (LUCENE-9663) Adding compression to terms dict
from SortedSet/Sorted DocValues
Jaison.Bi created LUCENE-9663:
---------------------------------
Summary: Adding compression to terms dict from SortedSet/Sorted DocValues
Key: LUCENE-9663
URL: https://issues.apache.org/jira/browse/LUCENE-9663
Project: Lucene - Core
Issue Type: Improvement
Components: core/codecs
Reporter: Jaison.Bi
Elasticsearch keyword field uses SortedSet DocValues. In our applications, “keyword” is the most frequently used field type.Elasticsearch keyword field uses SortedSet DocValues. In our applications, “keyword” is the most frequently used field type.
LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do better by replacing prefix-compression with LZ4. In one of our application, the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
I've done simple tests based on the real application data, comparing the write/merge time cost, and the on-disk *.dvd file size(after merge into 1 segment).
|| ||Before||After||
|Write time cost(ms)|591972|618200|
|Merge time cost(ms)|270661|294663|
|*.dvd file size(GB)|1.95|1.15|
This feature is only for the high-cardinality fields.
I'm doing the benchmark test based on luceneutil. Will attach the report and patch after the test.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org