You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/06/27 17:47:47 UTC

[jira] [Created] (NUTCH-1016) Strip UTF-8 non-character codepoints

Strip UTF-8 non-character codepoints
------------------------------------

                 Key: NUTCH-1016
                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
             Project: Nutch
          Issue Type: Bug
          Components: indexer
    Affects Versions: 1.3
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
             Fix For: 1.4, 2.0
         Attachments: NUTCH-1016-1.4.patch

During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:

{code}
SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
        at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
        at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
{code}

Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]

Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma resolved NUTCH-1016.
----------------------------------

    Resolution: Fixed

Committed for 1.4 in rev. 1141500.

> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1016-1.4-4.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1016:
---------------------------------

    Fix Version/s: 1.4

> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1016:
---------------------------------

    Attachment:     (was: NUTCH-1016-1.4-2.patch)

> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271344#comment-13271344 ] 

Christian Johnsson commented on NUTCH-1016:
-------------------------------------------

Does this apply to 1.5 RC1? (Stumbled upon the error a couple of times)
                
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4
>
>         Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1016:
---------------------------------

    Fix Version/s:     (was: 1.4)

> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1016:
---------------------------------

    Attachment: NUTCH-1016-1.4.patch

Patch for 1.4.

> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1016-1.4.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Reopened] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma reopened NUTCH-1016:
----------------------------------


Accidentally resolved. Issue stays open for 2.0 for the change is untested there.

> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1016:
---------------------------------

    Attachment: NUTCH-1016-1.4-2.patch

Silly me again, the patch was wrong. changed OR's to AND's!

This patch also includes more verbose output of the SolrWriter class. Handy for batches of many thousands of documents. This patch doesn't include change to log4j.properties though.

Should i get rid of the logging? Keep it?

> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1016-1.4-2.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1016:
---------------------------------

    Attachment: NUTCH-1016-1.4-3.patch

New patch also includes checking for non-printable control characters.

> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1016:
---------------------------------

    Attachment:     (was: NUTCH-1016-1.4.patch)

> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1016:
---------------------------------

    Attachment: NUTCH-1016-2.0.patch

Patch for 2.0.

> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1016:
---------------------------------

    Attachment: NUTCH-1016-1.4-4.patch

Previous patch included debug line to stdout. Removed now.

> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1016-1.4-4.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Markus Jelsma (Closed) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-1016.
--------------------------------


Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220
                
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4
>
>         Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1016:
---------------------------------

    Attachment:     (was: NUTCH-1016-1.4-3.patch)

> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271351#comment-13271351 ] 

Markus Jelsma commented on NUTCH-1016:
--------------------------------------

It is resolved for Nutch 1.4.
                
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4
>
>         Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057420#comment-13057420 ] 

Markus Jelsma commented on NUTCH-1016:
--------------------------------------

If there are no objections i'd like to commit this issue tomorrow.

> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1016-1.4-4.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma resolved NUTCH-1016.
----------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 2.0)

Resolved for 1.4.

> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4
>
>         Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271432#comment-13271432 ] 

Christian Johnsson commented on NUTCH-1016:
-------------------------------------------

Seems like it's my end as usual :-)
Rebooted the entire cluster and replaced the crawldb with yesterdays. Then tried the same segments that didn't work and now it work automagically.
Thank you for response and a great work!
                
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4
>
>         Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1016) Strip UTF-8 non-character codepoints

Posted by "Christian Johnsson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13271354#comment-13271354 ] 

Christian Johnsson commented on NUTCH-1016:
-------------------------------------------

Ok, never got the error before with 1.5rc1, It started this morning. Been running for 1 week without errors.
May 9, 2012 1:46:31 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1427640, byte #1564649)
	at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
	at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
	at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
	at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
	at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:301)
	at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:157)
	at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
	at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
	at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
	at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.CharConversionException: Invalid UTF-8 character 0xffff at char #1427640, byte #1564649)
	at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335)
	at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249)
	at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
	at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
	at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
	at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
	at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
	at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
	at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
	at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
	... 21 more

and

May 9, 2012 1:46:36 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: [was class java.io.IOException] Invalid CRLF
	at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
	at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
	at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
	at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
	at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:301)
	at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:157)
	at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
	at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
	at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:602)
	at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
	at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: Invalid CRLF
	at org.apache.coyote.http11.filters.ChunkedInputFilter.parseCRLF(ChunkedInputFilter.java:352)
	at org.apache.coyote.http11.filters.ChunkedInputFilter.doRead(ChunkedInputFilter.java:151)
	at org.apache.coyote.http11.InternalInputBuffer.doRead(InternalInputBuffer.java:710)
	at org.apache.coyote.Request.doRead(Request.java:427)
	at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:304)
	at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:419)
	at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:327)
	at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:162)
	at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:365)
	at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:110)
	at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
	at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
	at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
	at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
	at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
	at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
	at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
	at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
	... 21 more


                
> Strip UTF-8 non-character codepoints
> ------------------------------------
>
>                 Key: NUTCH-1016
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1016
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.4
>
>         Attachments: NUTCH-1016-1.4-4.patch, NUTCH-1016-2.0.patch
>
>
> During a very large crawl i found a few documents producing non-character codepoints. When indexing to Solr this will yield the following exception:
> {code}
> SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
>         at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
>         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
> {code}
> Quite annoying! Here's quick fix for SolrWriter that'll pass the value of the content field to a method to strip away non-characters. I'm not too sure about this implementation but the tests i've done locally with a huge dataset now passes correctly. Here's a list of codepoints to strip away: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
> Please comment!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira