You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doug Cook (JIRA)" <ji...@apache.org> on 2007/08/01 22:30:53 UTC

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

    [ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517066 ] 

Doug Cook commented on NUTCH-25:
--------------------------------

Cool -- will take a look at the new patch (and will try to make stripGarbage more robust as I get some bandwidth to work on it; it definitely helped in my tests).

At a very quick look, one potential drawback of the private EncodingClue + addClue/clearClues interface is that because EncodingDetector now keeps internal state, it is no longer safe to call the same EncodingDetector from different threads (though I'm not sure if ICU4J's CharsetDetector is thread-safe anyway, so this may already have been a potential problem). Not sure if this is an issue with the parsers or not, but will take a look.


> needs 'character encoding' detector
> -----------------------------------
>
>                 Key: NUTCH-25
>                 URL: https://issues.apache.org/jira/browse/NUTCH-25
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Stefan Groschupf
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: EncodingDetector.java, NUTCH-25.patch, NUTCH-25_draft.patch, NUTCH-25_v2.patch, patch
>
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.