You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/09/26 16:06:51 UTC

[jira] Resolved: (NUTCH-25) needs 'character encoding' detector

     [ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney resolved NUTCH-25.
--------------------------------

    Resolution: Fixed

I am committing the latest patch with some changes:

  * Added a unit test case
  * Removed thread local stuff. Instead EncodingDetector is instanced for every Parser.getParse.
  * Removed per-charset confidence values. We don't use them right now. Doug, I assume you may not like this one. I removed them to simplify the patch a bit. If you feel that they are useful, we can add them (and other features) later on.

As I mentioned before, this may not be the perfect encoding detection system but it is definitely better than what we have now.

Note that encoding auto-detection is disabled by default. See property encodingdetector.charset.min.confidence .

Committed in rev. 579656.

> needs 'character encoding' detector
> -----------------------------------
>
>                 Key: NUTCH-25
>                 URL: https://issues.apache.org/jira/browse/NUTCH-25
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Stefan Groschupf
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: EncodingDetector.java, EncodingDetector_additive.java, NUTCH-25.patch, NUTCH-25_draft.patch, NUTCH-25_v2.patch, NUTCH-25_v3.patch, NUTCH-25_v4.patch, patch
>
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.