You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2012/07/08 14:07:34 UTC

[jira] [Resolved] (TIKA-471) Avoid Charset name bottleneck when multiple threads are using HtmlParser

     [ https://issues.apache.org/jira/browse/TIKA-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-471.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 1.2
         Assignee: Jukka Zitting  (was: Ken Krugler)

As a followup to TIKA-322 I did some fairly significant refactoring of the charset handling code. The outcome massively reduces the number of Charset.forName() calls we make.
                
> Avoid Charset name bottleneck when multiple threads are using HtmlParser
> ------------------------------------------------------------------------
>
>                 Key: TIKA-471
>                 URL: https://issues.apache.org/jira/browse/TIKA-471
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Ken Krugler
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 1.2
>
>
> As reported by a user on the Nutch list, if there are lots of threads all parsing HTML documents, there's a lock contention issue caused by a JVM-wide lock used when resolving charset names:
> {quote}
> Apparently this is a known issue with Java, and a couple articles are
> written about it:
> http://paul.vox.com/library/post/the-mysteries-of-java-character-set-perform
> ance.html 
> http://halfbottle.blogspot.com/2009/07/charset-continued-i-wrote-about.html 
> There is also a note in java bug database about scaling issues with the
> class...
> Please also note that the current implementation of
> sun.nio.cs.FastCharsetProvider.charsetForName() uses a JVM-wide lock and is
> called very often (e.g. by new String(byte[] data,String encoding)). This
> JVM-wide lock means that Java applications do not scale beyond 4 CPU cores.
> I noted in the case of my stack at this particular point in time.  The
> BLOCKED calls to charsetForName were generated by:
> at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:84) 378
> at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:99) 61
> at org.apache.tika.parser.html.HtmlParser.getEncoding(HtmlParser.java:133) 19 
> at org.apache.nutch.parse.html.HtmlParser.sniffCharacterEncoding(HtmlParser.java:86)  238
> ...
> {quote}
> We now have a CharsetUtils class in Tika, and we could add a cache for validated names in the isSupported() method.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira