You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/06/24 01:30:47 UTC
[jira] [Created] (NUTCH-1012) Cannot handle illegal charset
$charset
Cannot handle illegal charset $charset
--------------------------------------
Key: NUTCH-1012
URL: https://issues.apache.org/jira/browse/NUTCH-1012
Project: Nutch
Issue Type: Bug
Components: parser
Affects Versions: 1.3
Reporter: Markus Jelsma
Priority: Minor
Fix For: 1.4
Pages returning:
{code}
Content-Type: text/html; charset=$charset
{code}
cause:
{code}
Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset
Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999
ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
{code}
Stack trace:
{code}
2011-06-24 01:14:23,442 WARN parse.html - java.nio.charset.IllegalCharsetNameException: $charset
2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.checkName(Charset.java:284)
2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458)
2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup(Charset.java:437)
2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479)
2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
2011-06-24 01:14:23,443 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138)
2011-06-24 01:14:23,443 WARN parse.html - at java.lang.Thread.run(Thread.java:662)
2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: http://www.sanitair-online.nl/: failed(2,200): java.nio.charset.Ill
egalCharsetNameException: $charset
{code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1012) Cannot handle illegal charset
$charset
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma resolved NUTCH-1012.
----------------------------------
Resolution: Fixed
Assignee: Markus Jelsma
Committed for 1.4 rev. 1140695 and for trunk in rev. 1140696.
> Cannot handle illegal charset $charset
> --------------------------------------
>
> Key: NUTCH-1012
> URL: https://issues.apache.org/jira/browse/NUTCH-1012
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1012-1.4.patch
>
>
> Pages returning:
> {code}
> Content-Type: text/html; charset=$charset
> {code}
> cause:
> {code}
> Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset
> Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999
> ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
> {code}
> Stack trace:
> {code}
> 2011-06-24 01:14:23,442 WARN parse.html - java.nio.charset.IllegalCharsetNameException: $charset
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.checkName(Charset.java:284)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup(Charset.java:437)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> 2011-06-24 01:14:23,443 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.lang.Thread.run(Thread.java:662)
> 2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: http://host/: failed(2,200): java.nio.charset.Ill
> egalCharsetNameException: $charset
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1012) Cannot handle illegal charset
$charset
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1012:
---------------------------------
Patch Info: [Patch Available]
> Cannot handle illegal charset $charset
> --------------------------------------
>
> Key: NUTCH-1012
> URL: https://issues.apache.org/jira/browse/NUTCH-1012
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1012-1.4.patch
>
>
> Pages returning:
> {code}
> Content-Type: text/html; charset=$charset
> {code}
> cause:
> {code}
> Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset
> Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999
> ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
> {code}
> Stack trace:
> {code}
> 2011-06-24 01:14:23,442 WARN parse.html - java.nio.charset.IllegalCharsetNameException: $charset
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.checkName(Charset.java:284)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup(Charset.java:437)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> 2011-06-24 01:14:23,443 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.lang.Thread.run(Thread.java:662)
> 2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: http://host/: failed(2,200): java.nio.charset.Ill
> egalCharsetNameException: $charset
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1012) Cannot handle illegal charset
$charset
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055479#comment-13055479 ]
Markus Jelsma commented on NUTCH-1012:
--------------------------------------
Objections? I'd like to send this one in.
> Cannot handle illegal charset $charset
> --------------------------------------
>
> Key: NUTCH-1012
> URL: https://issues.apache.org/jira/browse/NUTCH-1012
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1012-1.4.patch
>
>
> Pages returning:
> {code}
> Content-Type: text/html; charset=$charset
> {code}
> cause:
> {code}
> Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset
> Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999
> ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
> {code}
> Stack trace:
> {code}
> 2011-06-24 01:14:23,442 WARN parse.html - java.nio.charset.IllegalCharsetNameException: $charset
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.checkName(Charset.java:284)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup(Charset.java:437)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> 2011-06-24 01:14:23,443 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.lang.Thread.run(Thread.java:662)
> 2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: http://host/: failed(2,200): java.nio.charset.Ill
> egalCharsetNameException: $charset
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1012) Cannot handle illegal charset
$charset
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054487#comment-13054487 ]
Markus Jelsma commented on NUTCH-1012:
--------------------------------------
Hi Ken,
I've tested it more thoroughly and it seems this issue is limited to the parse-html plugin. Trying the ParserChecker with parse-tika enabled for this doctype results in a clean parse. I also found out the error only occurs in cases where the invalid charset is found directly in the HTTP header, a <meta http-equiv="Content-Type" content="text/html; charset=$charset" /> in the document causes no error.
> Cannot handle illegal charset $charset
> --------------------------------------
>
> Key: NUTCH-1012
> URL: https://issues.apache.org/jira/browse/NUTCH-1012
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Priority: Minor
> Fix For: 1.4
>
>
> Pages returning:
> {code}
> Content-Type: text/html; charset=$charset
> {code}
> cause:
> {code}
> Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset
> Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999
> ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
> {code}
> Stack trace:
> {code}
> 2011-06-24 01:14:23,442 WARN parse.html - java.nio.charset.IllegalCharsetNameException: $charset
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.checkName(Charset.java:284)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup(Charset.java:437)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> 2011-06-24 01:14:23,443 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.lang.Thread.run(Thread.java:662)
> 2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: http://host/: failed(2,200): java.nio.charset.Ill
> egalCharsetNameException: $charset
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1012) Cannot handle illegal charset
$charset
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1012:
---------------------------------
Attachment: NUTCH-1012-1.4.patch
Patch for 1.4 wrapping the code in EncodingDetector in a try/catch block, returning null on failure thus falling back to default. It won't use a possible meta content type tag.
Not ideal but the page will at least end up in the index.
> Cannot handle illegal charset $charset
> --------------------------------------
>
> Key: NUTCH-1012
> URL: https://issues.apache.org/jira/browse/NUTCH-1012
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Priority: Minor
> Fix For: 1.4
>
> Attachments: NUTCH-1012-1.4.patch
>
>
> Pages returning:
> {code}
> Content-Type: text/html; charset=$charset
> {code}
> cause:
> {code}
> Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset
> Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999
> ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
> {code}
> Stack trace:
> {code}
> 2011-06-24 01:14:23,442 WARN parse.html - java.nio.charset.IllegalCharsetNameException: $charset
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.checkName(Charset.java:284)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup(Charset.java:437)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> 2011-06-24 01:14:23,443 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.lang.Thread.run(Thread.java:662)
> 2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: http://host/: failed(2,200): java.nio.charset.Ill
> egalCharsetNameException: $charset
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (NUTCH-1012) Cannot handle illegal
charset $charset
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054518#comment-13054518 ]
Markus Jelsma edited comment on NUTCH-1012 at 6/24/11 4:12 PM:
---------------------------------------------------------------
Patch for 1.4 wrapping the code in EncodingDetector in a try/catch block, returning null on failure thus falling back to default. It won't use a possible meta content type tag. Not ideal but the page will at least end up in the index.
Patch compiles for trunk as well.
was (Author: markus17):
Patch for 1.4 wrapping the code in EncodingDetector in a try/catch block, returning null on failure thus falling back to default. It won't use a possible meta content type tag.
Not ideal but the page will at least end up in the index.
> Cannot handle illegal charset $charset
> --------------------------------------
>
> Key: NUTCH-1012
> URL: https://issues.apache.org/jira/browse/NUTCH-1012
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Priority: Minor
> Fix For: 1.4
>
> Attachments: NUTCH-1012-1.4.patch
>
>
> Pages returning:
> {code}
> Content-Type: text/html; charset=$charset
> {code}
> cause:
> {code}
> Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset
> Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999
> ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
> {code}
> Stack trace:
> {code}
> 2011-06-24 01:14:23,442 WARN parse.html - java.nio.charset.IllegalCharsetNameException: $charset
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.checkName(Charset.java:284)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup(Charset.java:437)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> 2011-06-24 01:14:23,443 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.lang.Thread.run(Thread.java:662)
> 2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: http://host/: failed(2,200): java.nio.charset.Ill
> egalCharsetNameException: $charset
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1012) Cannot handle illegal charset
$charset
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1012:
---------------------------------
Fix Version/s: 2.0
> Cannot handle illegal charset $charset
> --------------------------------------
>
> Key: NUTCH-1012
> URL: https://issues.apache.org/jira/browse/NUTCH-1012
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1012-1.4.patch
>
>
> Pages returning:
> {code}
> Content-Type: text/html; charset=$charset
> {code}
> cause:
> {code}
> Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset
> Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999
> ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
> {code}
> Stack trace:
> {code}
> 2011-06-24 01:14:23,442 WARN parse.html - java.nio.charset.IllegalCharsetNameException: $charset
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.checkName(Charset.java:284)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup(Charset.java:437)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> 2011-06-24 01:14:23,443 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.lang.Thread.run(Thread.java:662)
> 2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: http://host/: failed(2,200): java.nio.charset.Ill
> egalCharsetNameException: $charset
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1012) Cannot handle illegal charset
$charset
Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054475#comment-13054475 ]
Ken Krugler commented on NUTCH-1012:
------------------------------------
Tika has code to try to resolve charset names (and handle common error cases) in a graceful manner. Nutch might want to use this code, or we could add a general wrapper to crawler-commons. See CharsetUtils in Tika.
> Cannot handle illegal charset $charset
> --------------------------------------
>
> Key: NUTCH-1012
> URL: https://issues.apache.org/jira/browse/NUTCH-1012
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Priority: Minor
> Fix For: 1.4
>
>
> Pages returning:
> {code}
> Content-Type: text/html; charset=$charset
> {code}
> cause:
> {code}
> Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset
> Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999
> ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
> {code}
> Stack trace:
> {code}
> 2011-06-24 01:14:23,442 WARN parse.html - java.nio.charset.IllegalCharsetNameException: $charset
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.checkName(Charset.java:284)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup(Charset.java:437)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> 2011-06-24 01:14:23,443 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.lang.Thread.run(Thread.java:662)
> 2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: http://host/: failed(2,200): java.nio.charset.Ill
> egalCharsetNameException: $charset
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1012) Cannot handle illegal charset
$charset
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1012:
---------------------------------
Description:
Pages returning:
{code}
Content-Type: text/html; charset=$charset
{code}
cause:
{code}
Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset
Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999
ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
{code}
Stack trace:
{code}
2011-06-24 01:14:23,442 WARN parse.html - java.nio.charset.IllegalCharsetNameException: $charset
2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.checkName(Charset.java:284)
2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458)
2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup(Charset.java:437)
2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479)
2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
2011-06-24 01:14:23,443 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138)
2011-06-24 01:14:23,443 WARN parse.html - at java.lang.Thread.run(Thread.java:662)
2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: http://host/: failed(2,200): java.nio.charset.Ill
egalCharsetNameException: $charset
{code}
was:
Pages returning:
{code}
Content-Type: text/html; charset=$charset
{code}
cause:
{code}
Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset
Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999
ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
{code}
Stack trace:
{code}
2011-06-24 01:14:23,442 WARN parse.html - java.nio.charset.IllegalCharsetNameException: $charset
2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.checkName(Charset.java:284)
2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458)
2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup(Charset.java:437)
2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479)
2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
2011-06-24 01:14:23,443 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138)
2011-06-24 01:14:23,443 WARN parse.html - at java.lang.Thread.run(Thread.java:662)
2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: http://www.sanitair-online.nl/: failed(2,200): java.nio.charset.Ill
egalCharsetNameException: $charset
{code}
> Cannot handle illegal charset $charset
> --------------------------------------
>
> Key: NUTCH-1012
> URL: https://issues.apache.org/jira/browse/NUTCH-1012
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Priority: Minor
> Fix For: 1.4
>
>
> Pages returning:
> {code}
> Content-Type: text/html; charset=$charset
> {code}
> cause:
> {code}
> Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset
> Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999
> ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
> {code}
> Stack trace:
> {code}
> 2011-06-24 01:14:23,442 WARN parse.html - java.nio.charset.IllegalCharsetNameException: $charset
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.checkName(Charset.java:284)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup(Charset.java:437)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> 2011-06-24 01:14:23,443 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.lang.Thread.run(Thread.java:662)
> 2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: http://host/: failed(2,200): java.nio.charset.Ill
> egalCharsetNameException: $charset
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1012) Cannot handle illegal charset
$charset
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056986#comment-13056986 ]
Hudson commented on NUTCH-1012:
-------------------------------
Integrated in Nutch-trunk #1530 (See [https://builds.apache.org/job/Nutch-trunk/1530/])
NUTCH-1012 Cannot handle illegal charset
markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1140696
Files :
* /nutch/trunk/src/java/org/apache/nutch/util/EncodingDetector.java
* /nutch/trunk/CHANGES.txt
> Cannot handle illegal charset $charset
> --------------------------------------
>
> Key: NUTCH-1012
> URL: https://issues.apache.org/jira/browse/NUTCH-1012
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1012-1.4.patch
>
>
> Pages returning:
> {code}
> Content-Type: text/html; charset=$charset
> {code}
> cause:
> {code}
> Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset
> Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999
> ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
> {code}
> Stack trace:
> {code}
> 2011-06-24 01:14:23,442 WARN parse.html - java.nio.charset.IllegalCharsetNameException: $charset
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.checkName(Charset.java:284)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup(Charset.java:437)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> 2011-06-24 01:14:23,443 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.lang.Thread.run(Thread.java:662)
> 2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: http://host/: failed(2,200): java.nio.charset.Ill
> egalCharsetNameException: $charset
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1012) Cannot handle illegal charset $charset
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma closed NUTCH-1012.
--------------------------------
> Cannot handle illegal charset $charset
> --------------------------------------
>
> Key: NUTCH-1012
> URL: https://issues.apache.org/jira/browse/NUTCH-1012
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.4, 2.0
>
> Attachments: NUTCH-1012-1.4.patch
>
>
> Pages returning:
> {code}
> Content-Type: text/html; charset=$charset
> {code}
> cause:
> {code}
> Error parsing: http://host/: failed(2,200): java.nio.charset.IllegalCharsetNameException: $charset
> Found a TextHeaderAtom not followed by a TextBytesAtom or TextCharsAtom: Followed by 3999
> ParseSegment: finished at 2011-06-24 01:14:54, elapsed: 00:01:12
> {code}
> Stack trace:
> {code}
> 2011-06-24 01:14:23,442 WARN parse.html - java.nio.charset.IllegalCharsetNameException: $charset
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.checkName(Charset.java:284)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup2(Charset.java:458)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.lookup(Charset.java:437)
> 2011-06-24 01:14:23,442 WARN parse.html - at java.nio.charset.Charset.isSupported(Charset.java:479)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.resolveEncodingAlias(EncodingDetector.java:310)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:201)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.addClue(EncodingDetector.java:208)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.util.EncodingDetector.autoDetectClues(EncodingDetector.java:193)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:138)
> 2011-06-24 01:14:23,442 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> 2011-06-24 01:14:23,443 WARN parse.html - at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> 2011-06-24 01:14:23,443 WARN parse.html - at java.lang.Thread.run(Thread.java:662)
> 2011-06-24 01:14:23,443 WARN parse.ParseSegment - Error parsing: http://host/: failed(2,200): java.nio.charset.Ill
> egalCharsetNameException: $charset
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira