You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2015/02/10 22:08:11 UTC

[jira] [Updated] (LUCENE-6233) CheckIndex is dog slow when checking term vectors

     [ https://issues.apache.org/jira/browse/LUCENE-6233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-6233:
---------------------------------------
    Attachment: LUCENE-6233.patch

Patch.

I disabled Terms.getMin/Max checking for TVs, fixed the "test with the
one doc deleted" to only run on the first doc, and only test 1
"advance" doc.

I also added time taken to each part we test, e.g.:

{noformat}
  1 of 24: name=_1b docCount=10309
    version=6.0.0
    id=cd308kthf553d7dl049vw982u
    codec=Asserting(Lucene50)
    compound=true
    numFiles=3
    size (MB)=30.358
    diagnostics = {os=Linux, java.vendor=Oracle Corporation, java.version=1.8.0_25, lucene.version=6.0.0, mergeMaxNumSegments=-1, os.arch=amd64, source=merge, mergeFactor=3, os.version=3.13.0-37-generic, timestamp=1423588030806}
    no deletions
    test: open reader.........OK
    test: check integrity.....OK
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [8 fields] [took 0.000 sec]
    test: field norms.........OK [2 fields] [took 0.005 sec]
    test: terms, freq, prox...OK [381010 terms; 1154763 terms/docs paris; 1883324 tokens] [took 0.550 sec]
    test: stored fields.......OK [41236 total field count; avg 4.0 fields per doc] [took 0.323 sec]
    test: term vectors........OK [20617 total term vector count; avg 2.0 term/freq vector fields per doc] [took 1.257 sec]
    test: docvalues...........OK [2 docvalues fields; 0 BINARY; 1 NUMERIC; 1 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.020 sec]
{noformat}

Term vectors checking is still slowish, but at least it's faster: on
my smallish test index the total CheckIndex time improves from 33.6
seconds to 12.5 seconds.

I also plotted the time to CheckIndex in the nightly benchmark: https://people.apache.org/~mikemccand/lucenebench/checkIndexTime.html

However that index doesn't have term vectors so this issue shouldn't
affect it ...


> CheckIndex is dog slow when checking term vectors
> -------------------------------------------------
>
>                 Key: LUCENE-6233
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6233
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-6233.patch
>
>
> I'm working on a test that creates a biggish index and I noticed the CheckIndex takes a surprisingly long time to check term vectors.
> I profiled it and uncovered that we spend a lot of time (not sure this explains all of it) in Terms.getMin/getMax.  Since CompressingTermVectorsReader doesn't impl these methods efficiently (which is fine), we fallback to super's impl, which does a digit-by-digit binary search using seekCeil.
> But for TVs this sometimes results in a linear scan.
> I think CheckIndex should not check Terms.getMin/Max for TVs?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org