You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/06/24 15:47:49 UTC

[jira] [Updated] (NUTCH-1012) Cannot handle illegal charset $charset

     [ https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1012:
---------------------------------

    Description: 
Pages returning:

{code}
Content-Type: text/html; charset=$charset
{code}

cause:

{code}
Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset
Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999
ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
{code}

Stack trace:

{code}
2011-06-24 01:14:23,442 WARN  parse.html - java.nio.charset.IllegalCharsetNameException: $charset
2011-06-24 01:14:23,442 WARN  parse.html - at java.nio.charset.Charset.checkName(Charset.java:284)
2011-06-24 01:14:23,442 WARN  parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458)
2011-06-24 01:14:23,442 WARN  parse.html - at java.nio.charset.Charset.lookup(Charset.java:437)
2011-06-24 01:14:23,442 WARN  parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479)
2011-06-24 01:14:23,442 WARN  parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
2011-06-24 01:14:23,442 WARN  parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
2011-06-24 01:14:23,442 WARN  parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
2011-06-24 01:14:23,442 WARN  parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
2011-06-24 01:14:23,442 WARN  parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
2011-06-24 01:14:23,442 WARN  parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
2011-06-24 01:14:23,443 WARN  parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
2011-06-24 01:14:23,443 WARN  parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
2011-06-24 01:14:23,443 WARN  parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138)
2011-06-24 01:14:23,443 WARN  parse.html - at java.lang.Thread.run(Thread.java:662)
2011-06-24 01:14:23,443 WARN  parse.ParseSegment - Error parsing: http://host/: failed(2,200): java.nio.charset.Ill
egalCharsetNameException: $charset
{code}



  was:
Pages returning:

{code}
Content-Type: text/html; charset=$charset
{code}

cause:

{code}
Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset
Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999
ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
{code}

Stack trace:

{code}
2011-06-24 01:14:23,442 WARN  parse.html - java.nio.charset.IllegalCharsetNameException: $charset
2011-06-24 01:14:23,442 WARN  parse.html - at java.nio.charset.Charset.checkName(Charset.java:284)
2011-06-24 01:14:23,442 WARN  parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458)
2011-06-24 01:14:23,442 WARN  parse.html - at java.nio.charset.Charset.lookup(Charset.java:437)
2011-06-24 01:14:23,442 WARN  parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479)
2011-06-24 01:14:23,442 WARN  parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
2011-06-24 01:14:23,442 WARN  parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
2011-06-24 01:14:23,442 WARN  parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
2011-06-24 01:14:23,442 WARN  parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
2011-06-24 01:14:23,442 WARN  parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
2011-06-24 01:14:23,442 WARN  parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
2011-06-24 01:14:23,443 WARN  parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
2011-06-24 01:14:23,443 WARN  parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
2011-06-24 01:14:23,443 WARN  parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138)
2011-06-24 01:14:23,443 WARN  parse.html - at java.lang.Thread.run(Thread.java:662)
2011-06-24 01:14:23,443 WARN  parse.ParseSegment - Error parsing: http://www.sanitair-online.nl/: failed(2,200): java.nio.charset.Ill
egalCharsetNameException: $charset
{code}




> Cannot handle illegal charset $charset
> --------------------------------------
>
>                 Key: NUTCH-1012
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1012
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4
>
>
> Pages returning:
> {code}
> Content-Type: text/html; charset=$charset
> {code}
> cause:
> {code}
> Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset
> Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999
> ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
> {code}
> Stack trace:
> {code}
> 2011-06-24 01:14:23,442 WARN  parse.html - java.nio.charset.IllegalCharsetNameException: $charset
> 2011-06-24 01:14:23,442 WARN  parse.html - at java.nio.charset.Charset.checkName(Charset.java:284)
> 2011-06-24 01:14:23,442 WARN  parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458)
> 2011-06-24 01:14:23,442 WARN  parse.html - at java.nio.charset.Charset.lookup(Charset.java:437)
> 2011-06-24 01:14:23,442 WARN  parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479)
> 2011-06-24 01:14:23,442 WARN  parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
> 2011-06-24 01:14:23,442 WARN  parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
> 2011-06-24 01:14:23,442 WARN  parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
> 2011-06-24 01:14:23,442 WARN  parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
> 2011-06-24 01:14:23,442 WARN  parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
> 2011-06-24 01:14:23,442 WARN  parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> 2011-06-24 01:14:23,443 WARN  parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 2011-06-24 01:14:23,443 WARN  parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 2011-06-24 01:14:23,443 WARN  parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 2011-06-24 01:14:23,443 WARN  parse.html - at java.lang.Thread.run(Thread.java:662)
> 2011-06-24 01:14:23,443 WARN  parse.ParseSegment - Error parsing: http://host/: failed(2,200): java.nio.charset.Ill
> egalCharsetNameException: $charset
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira