You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/06/30 18:19:28 UTC
[jira] [Created] (NUTCH-1026) Strip UTF-8 non-character codepoints
Strip UTF-8 non-character codepoints
------------------------------------
Key: NUTCH-1026
URL: https://issues.apache.org/jira/browse/NUTCH-1026
Project: Nutch
Issue Type: Bug
Components: indexer
Affects Versions: 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Fix For: 2.0
During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
{code}
SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
{code}
Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
Please comment!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1026) Strip UTF-8 non-character
codepoints
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13272328#comment-13272328 ]
Markus Jelsma commented on NUTCH-1026:
--------------------------------------
Great!
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1026
> URL: https://issues.apache.org/jira/browse/NUTCH-1026
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: nutchgora
> Reporter: Markus Jelsma
> Fix For: nutchgora
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1026) Strip UTF-8 non-character codepoints
Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ferdy Galema closed NUTCH-1026.
-------------------------------
Resolution: Fixed
Fix Version/s: (was: 2.1)
nutchgora
When indexing a huge dataset I ran into this issue too. The patch in NUTCH-1016 works fine. (Thanks Markus!) I verified and tested this. Committed at nutchgora.
Minor note: The patch checks for invalid chars ONLY on the "content" field of the NutchDocument. But since the problem is most likely to only occur on this field, it is okay for now.
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1026
> URL: https://issues.apache.org/jira/browse/NUTCH-1026
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: nutchgora
> Reporter: Markus Jelsma
> Fix For: nutchgora
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1026) Strip UTF-8 non-character
codepoints
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13273027#comment-13273027 ]
Hudson commented on NUTCH-1026:
-------------------------------
Integrated in Nutch-nutchgora #249 (See [https://builds.apache.org/job/Nutch-nutchgora/249/])
NUTCH-1026 Strip UTF-8 non-character codepoints (Revision 1336643)
Result = SUCCESS
ferdy :
Files :
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/conf/log4j.properties
* /nutch/branches/nutchgora/src/java/org/apache/nutch/indexer/solr/SolrWriter.java
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1026
> URL: https://issues.apache.org/jira/browse/NUTCH-1026
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: nutchgora
> Reporter: Markus Jelsma
> Fix For: nutchgora
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1026) Strip UTF-8 non-character codepoints
Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-1026:
----------------------------------------
Fix Version/s: (was: nutchgora)
2.1
Set and Classify
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1026
> URL: https://issues.apache.org/jira/browse/NUTCH-1026
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: nutchgora
> Reporter: Markus Jelsma
> Fix For: 2.1
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1026) Strip UTF-8 non-character codepoints
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma reassigned NUTCH-1026:
------------------------------------
Assignee: (was: Markus Jelsma)
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1026
> URL: https://issues.apache.org/jira/browse/NUTCH-1026
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 2.0
> Reporter: Markus Jelsma
> Fix For: 2.0
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira