You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Michael Nebel (JIRA)" <ji...@apache.org> on 2005/09/10 14:36:30 UTC
[jira] Created: (NUTCH-91) empty encoding causes exception
empty encoding causes exception
-------------------------------
Key: NUTCH-91
URL: http://issues.apache.org/jira/browse/NUTCH-91
Project: Nutch
Type: Bug
Versions: 0.8-dev
Reporter: Michael Nebel
I found some sites, where the header says: "Content-Type: text/html; charset=". This causes an exception in the HtmlParser. My suggestion:
Index: src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
===================================================================
--- src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java (revision 279397)
+++ src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java (working copy)
@@ -120,7 +120,7 @@
byte[] contentInOctets = content.getContent();
InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));
String encoding = StringUtil.parseCharacterEncoding(contentType);
- if (encoding!=null) {
+ if (encoding!=null && !"".equals(encoding)) {
metadata.put("OriginalCharEncoding", encoding);
if ((encoding = StringUtil.resolveEncodingAlias(encoding)) != null) {
metadata.put("CharEncodingForConversion", encoding);
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
[jira] Closed: (NUTCH-91) empty encoding causes exception
Posted by "Piotr Kosiorowski (JIRA)" <ji...@apache.org>.
[ http://issues.apache.org/jira/browse/NUTCH-91?page=all ]
Piotr Kosiorowski closed NUTCH-91:
----------------------------------
Fix Version: 0.7.2-dev
0.8-dev
Resolution: Fixed
Commited with small extension. Thanks.
> empty encoding causes exception
> -------------------------------
>
> Key: NUTCH-91
> URL: http://issues.apache.org/jira/browse/NUTCH-91
> Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Michael Nebel
> Fix For: 0.7.2-dev, 0.8-dev
>
> I found some sites, where the header says: "Content-Type: text/html; charset=". This causes an exception in the HtmlParser. My suggestion:
> Index: src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
> ===================================================================
> --- src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java (revision 279397)
> +++ src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java (working copy)
> @@ -120,7 +120,7 @@
> byte[] contentInOctets = content.getContent();
> InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));
> String encoding = StringUtil.parseCharacterEncoding(contentType);
> - if (encoding!=null) {
> + if (encoding!=null && !"".equals(encoding)) {
> metadata.put("OriginalCharEncoding", encoding);
> if ((encoding = StringUtil.resolveEncodingAlias(encoding)) != null) {
> metadata.put("CharEncodingForConversion", encoding);
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira