You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/04/01 17:27:06 UTC

[jira] [Closed] (NUTCH-519)   prased incorrectly

     [ https://issues.apache.org/jira/browse/NUTCH-519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-519.
-------------------------------

    Resolution: Won't Fix

Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

> &nbsp; prased incorrectly
> -------------------------
>
>                 Key: NUTCH-519
>                 URL: https://issues.apache.org/jira/browse/NUTCH-519
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>         Environment: Linux 2.6.21
> Java 1.5
> Nutch 0.9
>            Reporter: Chris Hane
>
> I have deployed nutch in a standard configuration without any modifications.
> On all of the pages that it is crawling on my website, during the parse phase it convertes &nbsp; html entity into Â.
> The charset is set on the page to be: 
> <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
> When I issue the command
> bin/nutch readseg -get demo.crawl/segments/20070718174552/ http://demo.itsolut.com/mr.com/bookstore/maintenancemanagement/wiremanlibrary.htm
> The HTML portion contains:
>     <tr>
>       <td align="center">
>       <b><font face="Arial" size="0">Address: 120 South
> 7th Street&nbsp; -&nbsp; Terre Haute, IN 47807</font></b>
>       </td>
>     </tr>
> and the parsed content is:
>  Address: 120 South 7th Street  -  Terre Haute, IN 47807
> Also, the output contains the following:
> Parse Metadata: OriginalCharEncoding=windows-1252 CharEncodingForConversion=windows-1252

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira