You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2008/01/07 15:57:33 UTC

[jira] Updated: (LUCENE-1120) Use bulk-byte-copy when merging term vectors

     [ https://issues.apache.org/jira/browse/LUCENE-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1120:
---------------------------------------

    Attachment: LUCENE-1120.patch

Attached patch.  All tests pass.

(Note that the TestBackwardsCompatibility test will fail if you apply the patch because the new *.zip files I added aren't in the patch).

I think we should commit this for 2.3?  It's a sizable gain in merging
performance.


> Use bulk-byte-copy when merging term vectors
> --------------------------------------------
>
>                 Key: LUCENE-1120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1120
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-1120.patch
>
>
> Indexing all of Wikipedia, with term vectors on, under the YourKit
> profiler, shows that 26% of the time (!!) was spent merging the
> vectors.  This was without offsets & positions, which would make
> matters even worse.
> Depressingly, merging, even with ConcurrentMergeScheduler, cannot in
> fact keep up with the flushing of new segments in this test, and this
> is on a strong IO system (Mac Pro with 4 drive RAID 0 array, 4 CPU
> cores).
> So, just like Robert's idea to merge stored fields with bulk copying
> whenever the field name->number mapping is "congruent" (LUCENE-1043),
> we can do the same with term vectors.
> It's a little trickier because the term vectors format doesn't quite
> make it easy to bulk-copy because it doesn't directly encode the
> offset into the tvf file.
> I worked out a patch that changes the tvx format slightly, by storing
> the absolute position in the tvf file for the start of each document
> into the tvx file, just like it does for tvd now.  This adds an extra
> 8 bytes (long) in the tvx file, per document.
> Then, I removed a vLong (the first "position" stored inside the tvd
> file), which makes tvd contents fully position independent (so you can
> just copy the bytes).
> This adds up to 7 bytes per document (less for larger indices) that
> have term vectors enabled, but I think this small increase in index
> size is acceptable for the gains in indexing performance?
> With this change, the time spent merging term vectors dropped from 26%
> to 3%.  Of course, this only applies if your documents are "regular".
> I think in the future we could have Lucene try hard to assign the same
> field number for a given field name, if it had been seen before in the
> index...
> Merging terms now dominates the merge cost (~20% over overall time
> building the Wikipedia index).
> I also beefed up TestBackwardsCompatibility unit test: test a non-CFS
> and a CFS of versions 1.9, 2.0, 2.1, 2.2 index formats, and added some
> term vector fields to these indices.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org