You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sami Siren (JIRA)" <ji...@apache.org> on 2007/05/15 20:32:16 UTC

[jira] Resolved: (NUTCH-161) Change Plain text parser to use parser.character.encoding.default property for fall back encoding

     [ https://issues.apache.org/jira/browse/NUTCH-161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sami Siren resolved NUTCH-161.
------------------------------

    Resolution: Fixed

I just committed a fix for this, thanks KuroSaka!

> Change Plain text parser to use parser.character.encoding.default property for fall back encoding
> -------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-161
>                 URL: https://issues.apache.org/jira/browse/NUTCH-161
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: any
>            Reporter: KuroSaka TeruHiko
>         Assigned To: Sami Siren
>            Priority: Minor
>             Fix For: 1.0.0
>
>
> The value of the property parser.character.encoding.default is used as a fallback character encoding (charset) when HTML parser cannot find the charset information in HTTP Content-Type header or in META HTTP-EQUIV tag.  But the plain text parser behaves differently.  It just uses the system encoding (Java VM file.encodings, which in turn derives from the OS and the locale of the environment from which the JVM was spawned).  This is not pretty.  To gurantee a consistent behavior, plain text parser should use the value of the same property.
> Though not tested, these changes in ./src/plugin/parse-text/src/java/org/apache/nutch/parse/text/TextParser.java should do it:
> Insert this statement in the class definition:
>   private static String defaultCharEncoding =
>     NutchConf.get().get("parser.character.encoding.default", "windows-1252");
> Replace this:
>       text = new String(content.getContent());    // use default encoding
> with this:
>       text = new String(content.getContent(), defaultCharEncoding );    // use default encoding

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.