You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Uwe Schindler (Jira)" <ji...@apache.org> on 2022/05/13 19:51:00 UTC

[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

    [ https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17536855#comment-17536855 ] 

Uwe Schindler commented on LUCENE-10572:
----------------------------------------

With the swapping of bytes I remember the other issue where this was discussed. In my personal opinion we should just use ByteOrder.getDefault() here to get the varhandle. The byte order does not mapper, we just need to get 2 bytes.

The reason for this decision was that Robert was arguing about reproducibility. And he did not like platform dependencies (he was arguing that we can't test the algorithm with different platforms, so if we use default byte order of the platform we run out code (e.g. you), somebody could see bugs on arm.

I don't think that's an issue, just the typical way how Robert argues. In that case let's get a varhandle with platform order in the static ctor and not use the one from BitUtil.

We may also use a var handle in same way to save the length.

> Can we optimize BytesRefHash?
> -----------------------------
>
>                 Key: LUCENE-10572
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10572
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Priority: Major
>
> I was poking around in our nightly benchmarks ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR profiling that the hottest method is this:
> {noformat}
> PERCENT       CPU SAMPLES   STACK
> 9.28%         53848         org.apache.lucene.util.BytesRefHash#equals()
>                               at org.apache.lucene.util.BytesRefHash#findHash()
>                               at org.apache.lucene.util.BytesRefHash#add()
>                               at org.apache.lucene.index.TermsHashPerField#add()
>                               at org.apache.lucene.index.IndexingChain$PerField#invert()
>                               at org.apache.lucene.index.IndexingChain#processField()
>                               at org.apache.lucene.index.IndexingChain#processDocument()
>                               at org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted index hash equals the term already added to {{BytesRefHash}} is the hottest method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the inserted term into the hash?  Let's just use two bytes always, since IW limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org