You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/06/30 18:19:28 UTC

[jira] [Created] (NUTCH-1026) Strip UTF-8 non-character codepoints

Strip UTF-8 non-character codepoints
------------------------------------

                 Key: NUTCH-1026
                 URL: https://issues.apache.org/jira/browse/NUTCH-1026
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 2.0
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
             Fix For: 2.0


During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:

{code}
SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
        at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
        at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
{code}

Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]

Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1026) Strip UTF-8 non-character codepoints

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13272328#comment-13272328 ] 

Markus Jelsma commented on NUTCH-1026:
--------------------------------------

Great!
                
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1026
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1026
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: nutchgora
>            Reporter: Markus Jelsma
>             Fix For: nutchgora
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1026) Strip UTF-8 non-character codepoints

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy Galema closed NUTCH-1026.
-------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 2.1)
                   nutchgora

When indexing a huge dataset I ran into this issue too. The patch in NUTCH-1016 works fine. (Thanks Markus!) I verified and tested this. Committed at nutchgora.

Minor note: The patch checks for invalid chars ONLY on the "content" field of the NutchDocument. But since the problem is most likely to only occur on this field, it is okay for now.
                
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1026
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1026
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: nutchgora
>            Reporter: Markus Jelsma
>             Fix For: nutchgora
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1026) Strip UTF-8 non-character codepoints

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273027#comment-13273027 ] 

Hudson commented on NUTCH-1026:
-------------------------------

Integrated in Nutch-nutchgora #249 (See [https://builds.apache.org/job/Nutch-nutchgora/249/])
    NUTCH-1026 Strip UTF-8 non-character codepoints (Revision 1336643)

     Result = SUCCESS
ferdy : 
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/conf/log4j.properties
* /nutch/branches/nutchgora/src/java/org/apache/nutch/indexer/solr/SolrWriter.java

                
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1026
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1026
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: nutchgora
>            Reporter: Markus Jelsma
>             Fix For: nutchgora
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1026) Strip UTF-8 non-character codepoints

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-1026:
----------------------------------------

    Fix Version/s:     (was: nutchgora)
                   2.1

Set and Classify
                
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1026
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1026
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: nutchgora
>            Reporter: Markus Jelsma
>             Fix For: 2.1
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (NUTCH-1026) Strip UTF-8 non-character codepoints

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma reassigned NUTCH-1026:
------------------------------------

    Assignee:     (was: Markus Jelsma)

> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1026
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1026
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 2.0
>            Reporter: Markus Jelsma
>             Fix For: 2.0
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira