You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2010/08/27 09:25:55 UTC
[jira] Commented: (LUCENE-2098) make BaseCharFilter more efficient
in performance
[ https://issues.apache.org/jira/browse/LUCENE-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903276#action_12903276 ]
Robert Muir commented on LUCENE-2098:
-------------------------------------
here are the files i tested, htmlStripCharFilterTest.html (from the test, 12kb file) and http://en.wikipedia.org/wiki/Benjamin_Franklin (360kb file)
i ran each 3 times:
||file||before||after||
|htmlStripCharFilterTest.html|9709ms,9560ms,9587ms|8755ms,8697ms,8708ms|
|benFranklin.html|26877ms,26963ms,26495ms|17593ms,17674ms,17694ms|
here was the code (crude but i think it shows the point, the larger the files the worse the offset correction was):
{code}
Charset charset = Charset.forName("UTF-8");
WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(Version.LUCENE_CURRENT, new StringReader(""));
long startMS = System.currentTimeMillis();
for (int i = 0; i < 10000; i++) {
InputStream stream = HTMLStripCharFilterTest.class.getResourceAsStream("htmlStripReaderTest.html");
HTMLStripCharFilter reader = new HTMLStripCharFilter(CharReader.get(new InputStreamReader(stream, charset)));
tokenizer.reset(reader);
tokenizer.reset();
while (tokenizer.incrementToken())
;
}
System.out.println("time=" + (System.currentTimeMillis() - startMS));
{code}
> make BaseCharFilter more efficient in performance
> -------------------------------------------------
>
> Key: LUCENE-2098
> URL: https://issues.apache.org/jira/browse/LUCENE-2098
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Affects Versions: 3.1
> Reporter: Koji Sekiguchi
> Assignee: Robert Muir
> Priority: Minor
> Fix For: 3.1, 4.0
>
> Attachments: LUCENE-2098.patch, LUCENE-2098.patch
>
>
> Performance degradation in Solr 1.4 was reported. See:
> http://www.lucidimagination.com/search/document/43c4bdaf5c9ec98d/html_stripping_slower_in_solr_1_4
> The inefficiency has been pointed out in BaseCharFilter javadoc by Mike:
> {panel}
> NOTE: This class is not particularly efficient. For example, a new class instance is created for every call to addOffCorrectMap(int, int), which is then appended to a private list.
> {panel}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org