You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2008/07/11 20:00:31 UTC
[jira] Updated: (LUCENE-1301) Refactor DocumentsWriter
[ https://issues.apache.org/jira/browse/LUCENE-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-1301:
---------------------------------------
Attachment: LUCENE-1301.patch
New rev of the patch attached. I've fixed all nocommits. All tests
pass. I believe this version is ready to commit!
I'll wait a few more days before committing...
I ran some indexing throughput tests, indexing Wikipedia docs from a
line file using StandardAnalyzer. Each result is best of 4. Here's
the alg:
{code}
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
docs.file=/Volumes/External/lucene/wiki.txt
doc.stored = true
doc.term.vector = true
doc.add.log.step=2000
directory=FSDirectory
autocommit=false
compound=false
work.dir=/lucene/work
ram.flush.mb=64
{ "Rounds"
ResetSystemErase
{ "BuildIndex"
- CreateIndex
{ "AddDocs" AddDoc > : 200000
- CloseIndex
}
NewRound
} : 4
RepSumByPrefRound BuildIndex
{code}
Gives these results with term vectors & stored fields:
{code}
patch
BuildIndex - - 1 - - 1 - - 200000 - - 900.4 - - 222.12 - 410,938,688 1,029,046,272
trunk
BuildIndex - - 1 - - 1 - - 200000 - - 969.0 - - 206.39 - 400,372,256 1,029,046,272
2.3
BuildIndex 2 1 200002 905.4 220.89 391,630,016 1,029,046,272
{code}
And without term vectors & stored fields:
{code}
patch
BuildIndex - - 3 - - 1 - - 200000 - - 1,297.5 - - 154.15 - 399,966,592 1,029,046,272
trunk
BuildIndex - - 1 - - 1 - - 200000 - - 1,372.5 - - 145.72 - 390,581,376 1,029,046,272
2.3
BuildIndex - - 1 - - 1 - - 200002 - - 1,308.5 - - 152.85 - 389,224,640 1,029,046,272
{code}
So, the bad news is the refactoring had made things a bit (~5-7%)
slower than the current trunk. But the good news is trunk was already
6-7% faster than 2.4, so they nearly cancel out.
If I repeat these tests using tiny docs (~100 bytes per body) instead,
indexing the first 10 million docs, the slowdown is worse (~13-15% vs
trunk, ~11-13% vs 2.3)... I think it's because the additional method calls
with the refactoring become a bigger part of the time.
With term vectors & stored fields:
{code}
patch
BuildIndex - - 3 - - 1 - 10000000 - 38,320.1 - - 260.96 - 313,980,832 1,029,046,272
trunk
BuildIndex 2 1 10000000 45,194.1 221.27 414,987,072 1,029,046,272
2.3
BuildIndex - - 1 - - 1 - 10000002 - 42,861.4 - - 233.31 - 182,957,440 1,029,046,272
{code}
Without term vectors & stored fields:
{code}
patch
BuildIndex - - 1 - - 1 - 10000000 - 60,778.4 - - 164.53 - 341,611,456 1,029,046,272
trunk
BuildIndex 2 1 10000000 68,387.8 146.23 405,388,960 1,029,046,272
2.3
BuildIndex 0 1 10000002 68,052.7 146.95 330,334,912 1,029,046,272
{code}
I think these small slowdowns are worth the improvement in code
clarity.
> Refactor DocumentsWriter
> ------------------------
>
> Key: LUCENE-1301
> URL: https://issues.apache.org/jira/browse/LUCENE-1301
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.3, 2.3.1, 2.3.2, 2.4
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.4
>
> Attachments: LUCENE-1301.patch, LUCENE-1301.patch, LUCENE-1301.take2.patch, LUCENE-1301.take3.patch
>
>
> I've been working on refactoring DocumentsWriter to make it more
> modular, so that adding new indexing functionality (like column-stride
> stored fields, LUCENE-1231) is just a matter of adding a plugin into
> the indexing chain.
> This is an initial step towards flexible indexing (but there is still
> alot more to do!).
> And it's very much still a work in progress -- there are intemittant
> thread safety issues, I need to add tests cases and test/iterate on
> performance, many "nocommits", etc. This is a snapshot of my current
> state...
> The approach introduces "consumers" (abstract classes defining the
> interface) at different levels during indexing. EG DocConsumer
> consumes the whole document. DocFieldConsumer consumes separate
> fields, one at a time. InvertedDocConsumer consumes tokens produced
> by running each field through the analyzer. TermsHashConsumer writes
> its own bytes into in-memory posting lists stored in byte slices,
> indexed by term, etc.
> DocumentsWriter*.java is then much simpler: it only interacts with a
> DocConsumer and has no idea what that consumer is doing. Under that
> DocConsumer there is a whole "indexing chain" that does the real work:
> * NormsWriter holds norms in memory and then flushes them to _X.nrm.
> * FreqProxTermsWriter holds postings data in memory and then flushes
> to _X.frq/prx.
> * StoredFieldsWriter flushes immediately to _X.fdx/fdt
> * TermVectorsTermsWriter flushes immediately to _X.tvx/tvf/tvd
> DocumentsWriter still manages things like flushing a segment, closing
> doc stores, buffering & applying deletes, freeing memory, aborting
> when necesary, etc.
> In this first step, everything is package-private, and, the indexing
> chain is hardwired (instantiated in DocumentsWriter) to the chain
> currently matching Lucene trunk. Over time we can open this up.
> There are no changes to the index file format.
> For the most part this is just a [large] refactoring, except for these
> two small actual changes:
> * Improved concurrency with mixed large/small docs: previously the
> thread state would be tied up when docs finished indexing
> out-of-order. Now, it's not: instead I use a separate class to
> hold any pending state to flush to the doc stores, and immediately
> free up the thread state to index other docs.
> * Buffered norms in memory now remain sparse, until flushed to the
> _X.nrm file. Previously we would "fill holes" in norms in memory,
> as we go, which could easily use way too much memory. Really this
> isn't a solution to the problem of sparse norms (LUCENE-830); it
> just delays that issue from causing memory blowup during indexing;
> memory use will still blowup during searching.
> I expect performance (indexing throughput) will be worse with this
> change. I'll profile & iterate to minimize this, but I think we can
> accept some loss. I also plan to measure benefit of manually
> re-cycling RawPostingList instances from our own pool, vs letting GC
> recycle them.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org