You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@shindig.apache.org by "Kevin Brown (JIRA)" <ji...@apache.org> on 2008/07/30 04:09:31 UTC

[jira] Commented: (SHINDIG-479) Character set detection is EXPENSIVE using ICU4J.

    [ https://issues.apache.org/jira/browse/SHINDIG-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12618039#action_12618039 ] 

Kevin Brown commented on SHINDIG-479:
-------------------------------------

For reference, Louis and I ran some benchmarks using the yourkit profiler and found that for a gadget with 4 proxied requests, 22% of time was spent in character set detection alone. The ICU algorithm for detecting the likely character set is really awful (see the source for details).

There are a couple of potential options:

1.Do a quick check for UTF8 (extremely fast and easy). If not UTF8, try ICU
2. Do UTF8 check. If not UTF8, assume ISO-8859-1 (standard character set for most web servers).

In both cases we continue to believe the server if it does send an appropriate character set in the response headers.

The first won't break anything, but it will still be doing a lot of unnecessary work. 

The second will break some things (though probably not many), but the CPU usage will be down significantly. Anything that might be "broken" can be addressed by simply specifying the character set in the http responses.

> Character set detection is EXPENSIVE using ICU4J.
> -------------------------------------------------
>
>                 Key: SHINDIG-479
>                 URL: https://issues.apache.org/jira/browse/SHINDIG-479
>             Project: Shindig
>          Issue Type: Bug
>          Components: Gadget Rendering Server (Java)
>            Reporter: Louis Ryan
>            Assignee: Louis Ryan
>            Priority: Critical
>
> We use the ICU4J library to detect the character set on HTTP content fetched from 3rd parties when the content-type header does not contain the charset. The code is quite expensive and the cost was also being incurred on cached content rewritten content i.e. basically everything that runs through Shindig.
> I've submitted a partial fix which once the charset is derived it is stored back into the HTTP headers so that caching and rewriting can benefit from it so far containers with good caching this should eliminate ~75% of the CPU overhead but alot of makeRequest traffic is not cacheable and suffers from this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.