You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Jaison.Bi (Jira)" <ji...@apache.org> on 2021/01/12 12:27:00 UTC

[jira] [Updated] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

     [ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jaison.Bi updated LUCENE-9663:
------------------------------
    Description: 
Elasticsearch keyword field uses SortedSet DocValues. In our applications, “keyword” is the most frequently used field type.
 LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do better by replacing prefix-compression with LZ4. In one of our application, the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
 I've done simple tests based on the real application data, comparing the write/merge time cost, and the on-disk *.dvd file size(after merge into 1 segment).
|| ||Before||After||
|Write time cost(ms)|591972|618200|
|Merge time cost(ms)|270661|294663|
|*.dvd file size(GB)|1.95|1.15|

This feature is only for the high-cardinality fields. 
 I'm doing the benchmark test based on luceneutil. Will attach the report and patch after the test.

  was:
Elasticsearch keyword field uses SortedSet DocValues. In our applications, “keyword” is the most frequently used field type.Elasticsearch keyword field uses SortedSet DocValues. In our applications, “keyword” is the most frequently used field type.
LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do better by replacing prefix-compression with LZ4. In one of our application, the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
I've done simple tests based on the real application data, comparing the write/merge time cost, and the on-disk *.dvd file size(after merge into 1 segment).
|| ||Before||After||
|Write time cost(ms)|591972|618200|
|Merge time cost(ms)|270661|294663|
|*.dvd file size(GB)|1.95|1.15|

This feature is only for the high-cardinality fields. 
I'm doing the benchmark test based on luceneutil. Will attach the report and patch after the test.


> Adding compression to terms dict from SortedSet/Sorted DocValues
> ----------------------------------------------------------------
>
>                 Key: LUCENE-9663
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9663
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>            Reporter: Jaison.Bi
>            Priority: Trivial
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do better by replacing prefix-compression with LZ4. In one of our application, the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the write/merge time cost, and the on-disk *.dvd file size(after merge into 1 segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org