You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Robert Muir (Jira)" <ji...@apache.org> on 2021/02/21 15:50:00 UTC
[jira] [Commented] (LUCENE-9795) investigate large checkindex/grouping regression in nightly benchmarks

    [ https://issues.apache.org/jira/browse/LUCENE-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17287988#comment-17287988 ] 

Robert Muir commented on LUCENE-9795:
-------------------------------------

OK, I think i can explain the checkindex stuff.

When profiling unit tests, I do see this stack as top CPU user:

{noformat}
java.nio.ByteBuffer#get()
                              at java.nio.DirectByteBuffer#get()
                              at org.apache.lucene.store.ByteBufferGuard#getBytes()
                              at org.apache.lucene.store.ByteBufferIndexInput#readBytes()
                              at org.apache.lucene.store.MockIndexInputWrapper#readBytes()
                              at org.apache.lucene.util.compress.LZ4#decompress()
                              at org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict#decompressBlock()
                              at org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict#next()
                              at org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict#seekExact()
                              at org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$BaseSortedDocValues#lookupOrd()
                              at org.apache.lucene.index.SortedDocValues#binaryValue()
                              at org.apache.lucene.index.CheckIndex#checkBinaryDocValues()
{noformat}

I don't think checkindex should test retrieving every SORTED doc's bytes as if it were BINARY. Looks to me like a leftover actually. I will upload a simple patch.

The grouping stuff should maybe be a separate issue, I suspect grouping logic may be inefficiently doing similar stuff (reading tons of terms bytes instead of using ordinals or something).

> investigate large checkindex/grouping regression in nightly benchmarks
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-9795
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9795
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Major
>         Attachments: Screen_Shot_2021-02-21_at_09.17.53.png, Screen_Shot_2021-02-21_at_09.30.30.png
>
>
> In the nightly benchmark, checkindex times increased more than 4x on the 2/16 datapoint
> Looking at the commits on 2/15, most obvious thing to look into is docvalues terms dict compression: LUCENE-9663
> Will try to pinpoint it more, my concern is some perf bug such as every single term causing decompression of the whole block repeatedly (missing seek-within-block opto?)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org