You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/06/27 17:47:47 UTC
[jira] [Created] (NUTCH-1016) Strip UTF-8 non-character codepoints
Strip UTF-8 non-character codepoints
------------------------------------
Key: NUTCH-1016
URL: https://issues.apache.org/jira/browse/NUTCH-1016
Project: Nutch
Issue Type: Bug
Components: indexer
Affects Versions: 1.3
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Fix For: 1.4, 2.0
Attachments: NUTCH-1016-1.4.patch
During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
{code}
SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
{code}
Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
Please comment!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1016) Strip UTF-8 non-character codepoints
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma resolved NUTCH-1016.
----------------------------------
Resolution: Fixed
Committed for 1.4 in rev. 1141500.
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1016-1.4-4.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1016:
---------------------------------
Fix Version/s: 1.4
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1016:
---------------------------------
Attachment: (was: NUTCH-1016-1.4-2.patch)
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1016) Strip UTF-8 non-character
codepoints
Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271344#comment-13271344 ]
Christian Johnsson commented on NUTCH-1016:
-------------------------------------------
Does this apply to 1.5 RC1? (Stumbled upon the error a couple of times)
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4
>
> Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1016:
---------------------------------
Fix Version/s: (was: 1.4)
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1016:
---------------------------------
Attachment: NUTCH-1016-1.4.patch
Patch for 1.4.
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1016-1.4.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (NUTCH-1016) Strip UTF-8 non-character codepoints
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma reopened NUTCH-1016:
----------------------------------
Accidentally resolved. Issue stays open for 2.0 for the change is untested there.
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1016:
---------------------------------
Attachment: NUTCH-1016-1.4-2.patch
Silly me again, the patch was wrong. changed OR's to AND's!
This patch also includes more verbose output of the SolrWriter class. Handy for batches of many thousands of documents. This patch doesn't include change to log4j.properties though.
Should i get rid of the logging? Keep it?
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1016-1.4-2.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1016:
---------------------------------
Attachment: NUTCH-1016-1.4-3.patch
New patch also includes checking for non-printable control characters.
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1016:
---------------------------------
Attachment: (was: NUTCH-1016-1.4.patch)
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1016:
---------------------------------
Attachment: NUTCH-1016-2.0.patch
Patch for 2.0.
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1016:
---------------------------------
Attachment: NUTCH-1016-1.4-4.patch
Previous patch included debug line to stdout. Removed now.
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1016-1.4-4.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1016) Strip UTF-8 non-character codepoints
Posted by "Markus Jelsma (Closed) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma closed NUTCH-1016.
--------------------------------
Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4
>
> Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1016:
---------------------------------
Attachment: (was: NUTCH-1016-1.4-3.patch)
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1016) Strip UTF-8 non-character
codepoints
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271351#comment-13271351 ]
Markus Jelsma commented on NUTCH-1016:
--------------------------------------
It is resolved for Nutch 1.4.
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4
>
> Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1016) Strip UTF-8 non-character
codepoints
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057420#comment-13057420 ]
Markus Jelsma commented on NUTCH-1016:
--------------------------------------
If there are no objections i'd like to commit this issue tomorrow.
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1016-1.4-4.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1016) Strip UTF-8 non-character codepoints
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma resolved NUTCH-1016.
----------------------------------
Resolution: Fixed
Fix Version/s: (was: 2.0)
Resolved for 1.4.
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4
>
> Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1016) Strip UTF-8 non-character
codepoints
Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271432#comment-13271432 ]
Christian Johnsson commented on NUTCH-1016:
-------------------------------------------
Seems like it's my end as usual :-)
Rebooted the entire cluster and replaced the crawldb with yesterdays. Then tried the same segments that didn't work and now it work automagically.
Thank you for response and a great work!
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4
>
> Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1016) Strip UTF-8 non-character
codepoints
Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271354#comment-13271354 ]
Christian Johnsson commented on NUTCH-1016:
-------------------------------------------
Ok, never got the error before with 1.5rc1, It started this morning. Been running for 1 week without errors.
May 9, 2012 1:46:31 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1427640, byte #1564649)
at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:301)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:157)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.CharConversionException: Invalid UTF-8 character 0xffff at char #1427640, byte #1564649)
at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335)
at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249)
at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
... 21 more
and
May 9, 2012 1:46:36 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: [was class java.io.IOException] Invalid CRLF
at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:301)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:157)
at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: Invalid CRLF
at org.apache.coyote.http11.filters.ChunkedInputFilter.parseCRLF(ChunkedInputFilter.java:352)
at org.apache.coyote.http11.filters.ChunkedInputFilter.doRead(ChunkedInputFilter.java:151)
at org.apache.coyote.http11.InternalInputBuffer.doRead(InternalInputBuffer.java:710)
at org.apache.coyote.Request.doRead(Request.java:427)
at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:304)
at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:419)
at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:327)
at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:162)
at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:365)
at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:110)
at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
... 21 more
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
> Key: NUTCH-1016
> URL: https://issues.apache.org/jira/browse/NUTCH-1016
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4
>
> Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
> at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira