You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2010/08/27 09:25:55 UTC

[jira] Commented: (LUCENE-2098) make BaseCharFilter more efficient in performance

    [ https://issues.apache.org/jira/browse/LUCENE-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903276#action_12903276 ] 

Robert Muir commented on LUCENE-2098:
-------------------------------------

here are the files i tested, htmlStripCharFilterTest.html (from the test, 12kb file) and http://en.wikipedia.org/wiki/Benjamin_Franklin (360kb file)

i ran each 3 times:

||file||before||after||
|htmlStripCharFilterTest.html|9709ms,9560ms,9587ms|8755ms,8697ms,8708ms|
|benFranklin.html|26877ms,26963ms,26495ms|17593ms,17674ms,17694ms|

here was the code (crude but i think it shows the point, the larger the files the worse the offset correction was):
{code}
    Charset charset = Charset.forName("UTF-8");
    WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(Version.LUCENE_CURRENT, new StringReader(""));
    long startMS = System.currentTimeMillis();
    for (int i = 0; i < 10000; i++) {
      InputStream stream = HTMLStripCharFilterTest.class.getResourceAsStream("htmlStripReaderTest.html");
      HTMLStripCharFilter reader = new HTMLStripCharFilter(CharReader.get(new InputStreamReader(stream, charset)));
      tokenizer.reset(reader);
      tokenizer.reset();
      while (tokenizer.incrementToken())
        ;
    }
    System.out.println("time=" + (System.currentTimeMillis() - startMS));
{code}

> make BaseCharFilter more efficient in performance
> -------------------------------------------------
>
>                 Key: LUCENE-2098
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2098
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>    Affects Versions: 3.1
>            Reporter: Koji Sekiguchi
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2098.patch, LUCENE-2098.patch
>
>
> Performance degradation in Solr 1.4 was reported. See:
> http://www.lucidimagination.com/search/document/43c4bdaf5c9ec98d/html_stripping_slower_in_solr_1_4
> The inefficiency has been pointed out in BaseCharFilter javadoc by Mike:
> {panel}
> NOTE: This class is not particularly efficient. For example, a new class instance is created for every call to addOffCorrectMap(int, int), which is then appended to a private list. 
> {panel}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org