You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2015/02/10 22:08:11 UTC
[jira] [Updated] (LUCENE-6233) CheckIndex is dog slow when checking
term vectors
[ https://issues.apache.org/jira/browse/LUCENE-6233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-6233:
---------------------------------------
Attachment: LUCENE-6233.patch
Patch.
I disabled Terms.getMin/Max checking for TVs, fixed the "test with the
one doc deleted" to only run on the first doc, and only test 1
"advance" doc.
I also added time taken to each part we test, e.g.:
{noformat}
1 of 24: name=_1b docCount=10309
version=6.0.0
id=cd308kthf553d7dl049vw982u
codec=Asserting(Lucene50)
compound=true
numFiles=3
size (MB)=30.358
diagnostics = {os=Linux, java.vendor=Oracle Corporation, java.version=1.8.0_25, lucene.version=6.0.0, mergeMaxNumSegments=-1, os.arch=amd64, source=merge, mergeFactor=3, os.version=3.13.0-37-generic, timestamp=1423588030806}
no deletions
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK [took 0.000 sec]
test: field infos.........OK [8 fields] [took 0.000 sec]
test: field norms.........OK [2 fields] [took 0.005 sec]
test: terms, freq, prox...OK [381010 terms; 1154763 terms/docs paris; 1883324 tokens] [took 0.550 sec]
test: stored fields.......OK [41236 total field count; avg 4.0 fields per doc] [took 0.323 sec]
test: term vectors........OK [20617 total term vector count; avg 2.0 term/freq vector fields per doc] [took 1.257 sec]
test: docvalues...........OK [2 docvalues fields; 0 BINARY; 1 NUMERIC; 1 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.020 sec]
{noformat}
Term vectors checking is still slowish, but at least it's faster: on
my smallish test index the total CheckIndex time improves from 33.6
seconds to 12.5 seconds.
I also plotted the time to CheckIndex in the nightly benchmark: https://people.apache.org/~mikemccand/lucenebench/checkIndexTime.html
However that index doesn't have term vectors so this issue shouldn't
affect it ...
> CheckIndex is dog slow when checking term vectors
> -------------------------------------------------
>
> Key: LUCENE-6233
> URL: https://issues.apache.org/jira/browse/LUCENE-6233
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Attachments: LUCENE-6233.patch
>
>
> I'm working on a test that creates a biggish index and I noticed the CheckIndex takes a surprisingly long time to check term vectors.
> I profiled it and uncovered that we spend a lot of time (not sure this explains all of it) in Terms.getMin/getMax. Since CompressingTermVectorsReader doesn't impl these methods efficiently (which is fine), we fallback to super's impl, which does a digit-by-digit binary search using seekCeil.
> But for TVs this sometimes results in a linear scan.
> I think CheckIndex should not check Terms.getMin/Max for TVs?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org