You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Ignacio Vera (Jira)" <ji...@apache.org> on 2022/03/31 15:30:00 UTC
[jira] [Commented] (LUCENE-10315) Speed up BKD leaf block ids codec by a 512 ints ForUtil

    [ https://issues.apache.org/jira/browse/LUCENE-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515397#comment-17515397 ] 

Ignacio Vera commented on LUCENE-10315:
---------------------------------------

I have been looking into this issue and I think I have a clear picture of what is going on and a proposal to move it forward. 

The first thing I have done is to look into the slowdown on the luceneutil benchmark. CPU profiling shows that the test is dominated by reads using the readInt24 method (calling directing the visitor). In order to understand the speed of the legacy methods in comparison with the new ForUtil methods I have created the following microbenchmarks for readInt24 methods:

when reading into an array, as expected the new method is faster:
 
Benchmark                                      Mode  Cnt   Score   Error   Units
ReadInts24Benchmark.readInts24ForUtil         thrpt   25  11,739 ± 0,032  ops/us
ReadInts24Benchmark.readInts24Legacy          thrpt   25   8,722 ± 0,097  ops/us

But when emitting the integers directly to the visitor (bulk method) is not:

Benchmark                                      Mode  Cnt   Score   Error   Units
ReadInts24Benchmark.readInts24ForUtilVisitor  thrpt   25   1,131 ± 0,003  ops/us
ReadInts24Benchmark.readInts24LegacyVisitor         thrpt   25   1,526 ± 0,004  ops/us

That can easily explain the slowdown we see in the test as it should be much more common to add integers using the bulk method. 

I have run the same experiment for readInt32 methods and in both cases the new method is faster:

when reading into an array:

Benchmark                                      Mode  Cnt   Score   Error   Units
ReadIntsBenchmark.readIntsForUtil             thrpt   25  24,962 ± 0,141  ops/us
ReadIntsBenchmark.readIntsLegacy              thrpt   25   5,475 ± 0,025  ops/us

Using bulk method:

Benchmark                                      Mode  Cnt   Score   Error   Units
ReadIntsBenchmark.readIntsForUtilVisitor      thrpt   25   1,387 ± 0,006  ops/us
ReadIntsBenchmark.readIntsLegacyVisitor       thrpt   25   0,851 ± 0,005  ops/us

We found some very important improvements with this patch on Elasticsearch benchmarks which I wanted to understand as it cannot be explained with the data above. In order to analyse it I built some flame-graphs that are attached to the issue.  The flame=graps show that before the patch, performance was dominated by the readVint methods which has been modified in this patch to only happen when values are close together.

I would recommend to add this patch again but removing the int24 forutil implementation, [~gf2121] would like to have a go?





> Speed up BKD leaf block ids codec by a 512 ints ForUtil
> -------------------------------------------------------
>
>                 Key: LUCENE-10315
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10315
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Feng Guo
>            Assignee: Feng Guo
>            Priority: Major
>         Attachments: addall.svg, cpu_profile_baseline.html, cpu_profile_path.html
>
>          Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> Elasticsearch (which based on lucene) can automatically infers types for users with its dynamic mapping feature. When users index some low cardinality fields, such as gender / age / status... they often use some numbers to represent the values, while ES will infer these fields as {{{}long{}}}, and ES uses BKD as the index of {{long}} fields. When the data volume grows, building the result set of low-cardinality fields will make the CPU usage and load very high.
> This is a flame graph we obtained from the production environment:
> [^addall.svg]
> It can be seen that almost all CPU is used in addAll. When we reindex {{long}} to {{{}keyword{}}}, the cluster load and search latency are greatly reduced ( We spent weeks of time to reindex all indices... ). I know that ES recommended to use {{keyword}} for term/terms query and {{long}} for range query in the document, but there are always some users who didn't realize this and keep their habit of using sql database, or dynamic mapping automatically selects the type for them. All in all, users won't realize that there would be such a big difference in performance between {{long}} and {{keyword}} fields in low cardinality fields. So from my point of view it will make sense if we can make BKD works better for the low/medium cardinality fields.
> As far as i can see, for low cardinality fields, there are two advantages of {{keyword}} over {{{}long{}}}:
> 1. {{ForUtil}} used in {{keyword}} postings is much more efficient than BKD's delta VInt, because its batch reading (readLongs) and SIMD decode.
> 2. When the query term count is less than 16, {{TermsInSetQuery}} can lazily materialize of its result set, and when another small result clause intersects with this low cardinality condition, the low cardinality field can avoid reading all docIds into memory.
> This ISSUE is targeting to solve the first point. The basic idea is trying to use a 512 ints {{ForUtil}} for BKD ids codec. I benchmarked this optimization by mocking some random {{LongPoint}} and querying them with {{PointInSetQuery}}.
> *Benchmark Result*
> |doc count|field cardinality|query point|baseline QPS|candidate QPS|diff percentage|
> |100000000|32|1|51.44|148.26|188.22%|
> |100000000|32|2|26.8|101.88|280.15%|
> |100000000|32|4|14.04|53.52|281.20%|
> |100000000|32|8|7.04|28.54|305.40%|
> |100000000|32|16|3.54|14.61|312.71%|
> |100000000|128|1|110.56|350.26|216.81%|
> |100000000|128|8|16.6|89.81|441.02%|
> |100000000|128|16|8.45|48.07|468.88%|
> |100000000|128|32|4.2|25.35|503.57%|
> |100000000|128|64|2.13|13.02|511.27%|
> |100000000|1024|1|536.19|843.88|57.38%|
> |100000000|1024|8|109.71|251.89|129.60%|
> |100000000|1024|32|33.24|104.11|213.21%|
> |100000000|1024|128|8.87|30.47|243.52%|
> |100000000|1024|512|2.24|8.3|270.54%|
> |100000000|8192|1|3333.33|5000|50.00%|
> |100000000|8192|32|139.47|214.59|53.86%|
> |100000000|8192|128|54.59|109.23|100.09%|
> |100000000|8192|512|15.61|36.15|131.58%|
> |100000000|8192|2048|4.11|11.14|171.05%|
> |100000000|1048576|1|2597.4|3030.3|16.67%|
> |100000000|1048576|32|314.96|371.75|18.03%|
> |100000000|1048576|128|99.7|116.28|16.63%|
> |100000000|1048576|512|30.5|37.15|21.80%|
> |100000000|1048576|2048|10.38|12.3|18.50%|
> |100000000|8388608|1|2564.1|3174.6|23.81%|
> |100000000|8388608|32|196.27|238.95|21.75%|
> |100000000|8388608|128|55.36|68.03|22.89%|
> |100000000|8388608|512|15.58|19.24|23.49%|
> |100000000|8388608|2048|4.56|5.71|25.22%|
> The indices size is reduced for low cardinality fields and flat for high cardinality fields.
> {code:java}
> 113M    index_100000000_doc_32_cardinality_baseline
> 114M    index_100000000_doc_32_cardinality_candidate
> 140M    index_100000000_doc_128_cardinality_baseline
> 133M    index_100000000_doc_128_cardinality_candidate
> 193M    index_100000000_doc_1024_cardinality_baseline
> 174M    index_100000000_doc_1024_cardinality_candidate
> 241M    index_100000000_doc_8192_cardinality_baseline
> 233M    index_100000000_doc_8192_cardinality_candidate
> 314M    index_100000000_doc_1048576_cardinality_baseline
> 315M    index_100000000_doc_1048576_cardinality_candidate
> 392M    index_100000000_doc_8388608_cardinality_baseline
> 391M    index_100000000_doc_8388608_cardinality_candidate
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org