You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by bu...@apache.org on 2004/03/01 08:10:10 UTC
DO NOT REPLY [Bug 27326] New: -
[PATCH] minor performance enhancements for DocumentWriter.invertDocument()
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=27326>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND
INSERTED IN THE BUG DATABASE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=27326
[PATCH] minor performance enhancements for DocumentWriter.invertDocument()
Summary: [PATCH] minor performance enhancements for
DocumentWriter.invertDocument()
Product: Lucene
Version: unspecified
Platform: All
OS/Version: All
Status: NEW
Severity: Enhancement
Priority: Other
Component: Index
AssignedTo: lucene-dev@jakarta.apache.org
ReportedBy: brian-apache@slesinsky.org
This patch includes two small performance improvements:
1. switch from Hashtable to HashMap and preset the capacity to avoid resizing the HashMap (barely
measurable improvement, but easy).
2. add a new Analyzer.tokenStream() method that takes a String instead of a Reader, and call this from
within DocumentWriter.invertDocument(). This allows subclasses of Analyzer to provide a more
efficient tokenizer for Strings. (The default implementation just uses a StringReader.)
I was able to write a variant on LowercaseAnalyzer (not included) that's about 10% faster for my dataset.
It works by converting the entire field value with String.toLowerCase() and then using String.substring()
to extract the string for each token. This avoids allocating individual char[] arrays inside String for each
token, because String.substring() shares its char[] array with the original.
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org