You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2007/07/02 16:16:05 UTC
[jira] Commented: (LUCENE-856) Optimize segment merging
[ https://issues.apache.org/jira/browse/LUCENE-856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509576 ]
Michael McCandless commented on LUCENE-856:
-------------------------------------------
I ran a new performance comparison here to test the merging cost of
autoCommit false vs true, this time using Wikipedia content and
contrib/benchmark.
I indexed all of Wikipedia using the patch from LUCENE-843 and the
patch from LUCENE-947, once with autoCommit=true and once with
autoCommit=false. I used this alg (and just changed autocommit=true
to false for the second test):
max.field.length=2147483647
compound=false
analyzer=org.apache.lucene.analysis.SimpleAnalyzer
directory=FSDirectory
ram.flush.mb=32
max.buffered=20000
autocommit=true
doc.stored=true
doc.tokenized=true
doc.term.vector=true
doc.term.vector.offsets=true
doc.term.vector.positions=true
doc.add.log.step=500
docs.dir=enwiki
doc.maker=org.apache.lucene.benchmark.byTask.feeds.DirDocMaker
doc.maker.forever=false
ResetSystemErase
CreateIndex
[{AddDoc}: *] : 4
CloseIndex
RepSumByPref AddDoc
Which means: use 4 threads to index all text from each of the 3.2
million wikipedia docs, with stored fields & term vectors turned on,
using SimpleAnalyzer, flushing when RAM usage hits 32 MB.
The index size is 20 GB.
Report from autoCommit=true:
------------> Report Sum By Prefix (AddDoc) (1 about 3204066 out of 3204073)
Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
AddDoc 0 3204066 1 226.3 14,159.22 282,843,296 373,480,960
Net elapsed time = 87 minutes 18 seconds
Report from autoCommit=false:
------------> Report Sum By Prefix (AddDoc) (1 about 3204066 out of 3204073)
Operation round runCnt recsPerRun rec/s elapsedSec avgUsedMem avgTotalMem
AddDoc 0 3204066 1 407.6 7,860.63 252,046,000 329,962,048
Net elapsed time = 60 minutes 5 seconds
Some comments:
* According to net elapsed time, autoCommit=false is 31% faster than
autoCommit=true.
* According to "rec/s" it's actually 44% faster; this is because
rec/s only measures the actual addDocument time and not eg the IO
cost of retrieving the document contents.
* The speedup is due entirely to the fact that the "doc stores"
(vectors & stored fields) do not need to be merged when
autoCommit=false. This is a major win because these files are
enormous if you turn on stored fields & term vectors with offsets
& positions.
* The basic conclusion is the same as before: if you want to build
up a large index, and, it's not necessary to be searching this
index while you are building it, the fastest way to do so is with
LUCENE-843 patch and with autoCommit=false.
> Optimize segment merging
> ------------------------
>
> Key: LUCENE-856
> URL: https://issues.apache.org/jira/browse/LUCENE-856
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.1
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
>
> With LUCENE-843, the time spent indexing documents has been
> substantially reduced and now the time spent merging is a sizable
> portion of indexing time.
> I ran a test using the patch for LUCENE-843, building an index of 10
> million docs, each with ~5,500 byte plain text, with term vectors
> (positions + offsets) on and with 2 small stored fields per document.
> RAM buffer size was 32 MB. I didn't optimize the index in the end,
> though optimize speed would also improve if we optimize segment
> merging. Index size is 86 GB.
> Total time to build the index was 8 hrs 38 minutes, 5 hrs 40 minutes
> of which was spent merging. That's 65.6% of the time!
> Most of this time is presumably IO which probably can't be reduced
> much unless we improve overall merge policy and experiment with values
> for mergeFactor / buffer size.
> These tests were run on a Mac Pro with 2 dual-core Intel CPUs. The IO
> system is RAID 0 of 4 drives, so, these times are probably better than
> the more common case of a single hard drive which would likely be
> slower IO.
> I think there are some simple things we could do to speed up merging:
> * Experiment with buffer sizes -- maybe larger buffers for the
> IndexInputs used during merging could help? Because at a default
> mergeFactor of 10, the disk heads must do alot of seeking back and
> forth between these 10 files (and then to the 11th file where we
> are writing).
> * Use byte copying when possible, eg if there are no deletions on a
> segment we can almost (I think?) just copy things like prox
> postings, stored fields, term vectors, instead of full parsing to
> Jave objects and then re-serializing them.
> * Experiment with mergeFactor / different merge policies. For
> example I think LUCENE-854 would reduce time spend merging for a
> given index size.
> This is currently just a place to list ideas for optimizing segment
> merges. I don't plan on working on this until after LUCENE-843.
> Note that for "autoCommit=false", this optimization is somewhat less
> important, depending on how often you actually close/open a new
> IndexWriter. In the extreme case, if you open a writer, add 100 MM
> docs, close the writer, then no segment merges happen at all.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org