You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2013/02/19 01:47:12 UTC

[jira] [Resolved] (NUTCH-1420) Get rid of the dreaded �

     [ https://issues.apache.org/jira/browse/NUTCH-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney resolved NUTCH-1420.
-----------------------------------------

       Resolution: Fixed
    Fix Version/s: 2.2

Committed @r 1447562 in  2.x HEAD
Committed @r 14747563 in trunk
Thank you Markus. I moved  the new method into StringUtil and made it static. This way (if required) we can use it elsewhere more effectively.
Thank you
                
> Get rid of the dreaded �
> ------------------------
>
>                 Key: NUTCH-1420
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1420
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Trivial
>             Fix For: 1.7, 2.2
>
>         Attachments: NUTCH-1420-1.6-1.patch
>
>
> Some pages, especially PDF's, produce sequences with the dreaded � character. This patch removes them from the title and content field.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira