You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/04/01 17:27:06 UTC
[jira] [Closed] (NUTCH-519) prased incorrectly
[ https://issues.apache.org/jira/browse/NUTCH-519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma closed NUTCH-519.
-------------------------------
Resolution: Won't Fix
Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira
> prased incorrectly
> -------------------------
>
> Key: NUTCH-519
> URL: https://issues.apache.org/jira/browse/NUTCH-519
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.9.0
> Environment: Linux 2.6.21
> Java 1.5
> Nutch 0.9
> Reporter: Chris Hane
>
> I have deployed nutch in a standard configuration without any modifications.
> On all of the pages that it is crawling on my website, during the parse phase it convertes html entity into Â.
> The charset is set on the page to be:
> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
> When I issue the command
> bin/nutch readseg -get demo.crawl/segments/20070718174552/ http://demo.itsolut.com/mr.com/bookstore/maintenancemanagement/wiremanlibrary.htm
> The HTML portion contains:
> <tr>
> <td align="center">
> <b><font face="Arial" size="0">Address: 120 South
> 7th Street - Terre Haute, IN 47807</font></b>
> </td>
> </tr>
> and the parsed content is:
> Address: 120 South 7th Street - Terre Haute, IN 47807
> Also, the output contains the following:
> Parse Metadata: OriginalCharEncoding=windows-1252 CharEncodingForConversion=windows-1252
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira