You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doug Cook (JIRA)" <ji...@apache.org> on 2007/05/21 18:34:16 UTC

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

    [ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497507 ] 

Doug Cook commented on NUTCH-25:
--------------------------------

We might want to think about raising the priority of this. I've seen encoding problems affect quite a few documents. Sometimes this is obvious, because it shows up the abstract, but often it is subtle, and simply affects recall.

Here's an example.

I have indexed the document: 
http://www.winereviewonline.com/wine_reviews.cfm?nCountryID=2&archives=1

This document is in UTF-8, but the header says it is in iso-8859-1 (this seems fairly common!). Because of this, a few characters get screwed up, and if I search for "Les Vignes du Soir", I won't find it, because it is being indexed as “Les Vignes du Soir”, since it uses curly quotes.

I've seen enough instances of problems like this to make me worry that it is causing significant recall problems.

If anyone has a ready solution for this, please let me know. If not, I'll try to get to it (and contribute back the changes once I get the chance...). Is jchardet still the best Java option out there?

> needs 'character encoding' detector
> -----------------------------------
>
>                 Key: NUTCH-25
>                 URL: https://issues.apache.org/jira/browse/NUTCH-25
>             Project: Nutch
>          Issue Type: Wish
>            Reporter: Stefan Groschupf
>            Priority: Trivial
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.