You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2008/01/07 15:57:33 UTC
[jira] Updated: (LUCENE-1120) Use bulk-byte-copy when merging term
vectors
[ https://issues.apache.org/jira/browse/LUCENE-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-1120:
---------------------------------------
Attachment: LUCENE-1120.patch
Attached patch. All tests pass.
(Note that the TestBackwardsCompatibility test will fail if you apply the patch because the new *.zip files I added aren't in the patch).
I think we should commit this for 2.3? It's a sizable gain in merging
performance.
> Use bulk-byte-copy when merging term vectors
> --------------------------------------------
>
> Key: LUCENE-1120
> URL: https://issues.apache.org/jira/browse/LUCENE-1120
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Attachments: LUCENE-1120.patch
>
>
> Indexing all of Wikipedia, with term vectors on, under the YourKit
> profiler, shows that 26% of the time (!!) was spent merging the
> vectors. This was without offsets & positions, which would make
> matters even worse.
> Depressingly, merging, even with ConcurrentMergeScheduler, cannot in
> fact keep up with the flushing of new segments in this test, and this
> is on a strong IO system (Mac Pro with 4 drive RAID 0 array, 4 CPU
> cores).
> So, just like Robert's idea to merge stored fields with bulk copying
> whenever the field name->number mapping is "congruent" (LUCENE-1043),
> we can do the same with term vectors.
> It's a little trickier because the term vectors format doesn't quite
> make it easy to bulk-copy because it doesn't directly encode the
> offset into the tvf file.
> I worked out a patch that changes the tvx format slightly, by storing
> the absolute position in the tvf file for the start of each document
> into the tvx file, just like it does for tvd now. This adds an extra
> 8 bytes (long) in the tvx file, per document.
> Then, I removed a vLong (the first "position" stored inside the tvd
> file), which makes tvd contents fully position independent (so you can
> just copy the bytes).
> This adds up to 7 bytes per document (less for larger indices) that
> have term vectors enabled, but I think this small increase in index
> size is acceptable for the gains in indexing performance?
> With this change, the time spent merging term vectors dropped from 26%
> to 3%. Of course, this only applies if your documents are "regular".
> I think in the future we could have Lucene try hard to assign the same
> field number for a given field name, if it had been seen before in the
> index...
> Merging terms now dominates the merge cost (~20% over overall time
> building the Wikipedia index).
> I also beefed up TestBackwardsCompatibility unit test: test a non-CFS
> and a CFS of versions 1.9, 2.0, 2.1, 2.2 index formats, and added some
> term vector fields to these indices.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org