You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Ferdy Galema (JIRA)" <ji...@apache.org> on 2012/05/10 14:47:48 UTC

[jira] [Closed] (NUTCH-1026) Strip UTF-8 non-character codepoints

     [ https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy Galema closed NUTCH-1026.
-------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 2.1)
                   nutchgora

When indexing a huge dataset I ran into this issue too. The patch in NUTCH-1016 works fine. (Thanks Markus!) I verified and tested this. Committed at nutchgora.

Minor note: The patch checks for invalid chars ONLY on the "content" field of the NutchDocument. But since the problem is most likely to only occur on this field, it is okay for now.
                
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1026
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1026
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: nutchgora
>            Reporter: Markus Jelsma
>             Fix For: nutchgora
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira