You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2007/08/02 00:51:53 UTC
[jira] Updated: (LUCENE-969) Optimize the core tokenizers/analyzers
& deprecate Token.termText
[ https://issues.apache.org/jira/browse/LUCENE-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-969:
--------------------------------------
Attachment: LUCENE-969.take2.patch
Updated patch based on recent commits; fixed up the javadocs and a few
other small things. I think this is ready to commit but I'll wait a
few days for more comments...
> Optimize the core tokenizers/analyzers & deprecate Token.termText
> -----------------------------------------------------------------
>
> Key: LUCENE-969
> URL: https://issues.apache.org/jira/browse/LUCENE-969
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Analysis
> Affects Versions: 2.3
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-969.patch, LUCENE-969.take2.patch
>
>
> There is some "low hanging fruit" for optimizing the core tokenizers
> and analyzers:
> - Re-use a single Token instance during indexing instead of creating
> a new one for every term. To do this, I added a new method "Token
> next(Token result)" (Doron's suggestion) which means TokenStream
> may use the "Token result" as the returned Token, but is not
> required to (ie, can still return an entirely different Token if
> that is more convenient). I added default implementations for
> both next() methods in TokenStream.java so that a TokenStream can
> choose to implement only one of the next() methods.
> - Use "char[] termBuffer" in Token instead of the "String
> termText".
> Token now maintains a char[] termBuffer for holding the term's
> text. Tokenizers & filters should retrieve this buffer and
> directly alter it to put the term text in or change the term
> text.
> I only deprecated the termText() method. I still allow the ctors
> that pass in String termText, as well as setTermText(String), but
> added a NOTE about performance cost of using these methods. I
> think it's OK to keep these as convenience methods?
> After the next release, when we can remove the deprecated API, we
> should clean up Token.java to no longer maintain "either String or
> char[]" (and the initTermBuffer() private method) and always use
> the char[] termBuffer instead.
> - Re-use TokenStream instances across Fields & Documents instead of
> creating a new one for each doc. To do this I added an optional
> "reusableTokenStream(...)" to Analyzer which just defaults to
> calling tokenStream(...), and then I implemented this for the core
> analyzers.
> I'm using the patch from LUCENE-967 for benchmarking just
> tokenization.
> The changes above give 21% speedup (742 seconds -> 585 seconds) for
> LowerCaseTokenizer -> StopFilter -> PorterStemFilter chain, tokenizing
> all of Wikipedia, on JDK 1.6 -server -Xmx1024M, Debian Linux, RAID 5
> IO system (best of 2 runs).
> If I pre-break Wikipedia docs into 100 token docs then it's 37% faster
> (1236 sec -> 774 sec), I think because of re-using TokenStreams across
> docs.
> I'm just running with this alg and recording the elapsed time:
> analyzer=org.apache.lucene.analysis.LowercaseStopPorterAnalyzer
> doc.tokenize.log.step=50000
> docs.file=/lucene/wikifull.txt
> doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
> doc.tokenized=true
> doc.maker.forever=false
> {ReadTokens > : *
> See this thread for discussion leading up to this:
> http://www.gossamer-threads.com/lists/lucene/java-dev/51283
> I also fixed Token.toString() to work correctly when termBuffer is
> used (and added unit test).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org