You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2011/07/11 18:54:59 UTC
[jira] [Created] (NUTCH-1041) Not reading mime-type correctly
Not reading mime-type correctly
-------------------------------
Key: NUTCH-1041
URL: https://issues.apache.org/jira/browse/NUTCH-1041
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.4
Reporter: Markus Jelsma
Fix For: 1.4, 2.0
Another issue with mime-types and test url's. Below are two logs lines from MimeUtil. Mime-type is still ok at the start of the autoResolveContentType method:
{code}
Jul 11, 2011 6:46:15 PM org.apache.nutch.util.MimeUtil autoResolveContentType
INFO: Type: text/html; charset=ISO-8859-1 from: http://www.taxipoll.nl/taxipol.htm
Jul 11, 2011 6:46:15 PM org.apache.nutch.util.MimeUtil autoResolveContentType
INFO: Type: text/html from: http://archief.hoofdklassehockey.nl/hschema2009.html
{code}
mIME-TYpe correctness has been confirmed with Curl. The documents, however, do not end up in the index with the correct mime-type, here's output from IndexingFiltersChecker. ParserChecker does output the correct Content-Type.
{code}
http://www.taxipoll.nl/taxipol.htm --> taxipoll/htm
http://archief.hoofdklassehockey.nl/hschema2009.html --> tet/html
{code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1041) Not reading mime-type correctly
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063438#comment-13063438 ]
Markus Jelsma commented on NUTCH-1041:
--------------------------------------
More strange behaviour, Nutch trunk ParserChecker outputs wrong Content-Type : tet/html; charset=iso-8859-1 while Nutch 1.4 ParserChecker is doing fine.
> Not reading mime-type correctly
> -------------------------------
>
> Key: NUTCH-1041
> URL: https://issues.apache.org/jira/browse/NUTCH-1041
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.4
> Reporter: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> Another issue with mime-types and test url's. Below are two logs lines from MimeUtil. Mime-type is still ok at the start of the autoResolveContentType method:
> {code}
> Jul 11, 2011 6:46:15 PM org.apache.nutch.util.MimeUtil autoResolveContentType
> INFO: Type: text/html; charset=ISO-8859-1 from: http://www.taxipoll.nl/taxipol.htm
> Jul 11, 2011 6:46:15 PM org.apache.nutch.util.MimeUtil autoResolveContentType
> INFO: Type: text/html from: http://archief.hoofdklassehockey.nl/hschema2009.html
> {code}
> mIME-TYpe correctness has been confirmed with Curl. The documents, however, do not end up in the index with the correct mime-type, here's output from IndexingFiltersChecker. ParserChecker does output the correct Content-Type.
> {code}
> http://www.taxipoll.nl/taxipol.htm --> taxipoll/htm
> http://archief.hoofdklassehockey.nl/hschema2009.html --> tet/html
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1041) Not reading mime-type correctly
Posted by "Markus Jelsma (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-1041:
---------------------------------
Affects Version/s: (was: 1.4)
1.3
Fix Version/s: (was: 1.4)
(was: nutchgora)
1.5
> Not reading mime-type correctly
> -------------------------------
>
> Key: NUTCH-1041
> URL: https://issues.apache.org/jira/browse/NUTCH-1041
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Fix For: 1.5
>
>
> Another issue with mime-types and test url's. Below are two logs lines from MimeUtil. Mime-type is still ok at the start of the autoResolveContentType method:
> {code}
> Jul 11, 2011 6:46:15 PM org.apache.nutch.util.MimeUtil autoResolveContentType
> INFO: Type: text/html; charset=ISO-8859-1 from: http://www.taxipoll.nl/taxipol.htm
> Jul 11, 2011 6:46:15 PM org.apache.nutch.util.MimeUtil autoResolveContentType
> INFO: Type: text/html from: http://archief.hoofdklassehockey.nl/hschema2009.html
> {code}
> mIME-TYpe correctness has been confirmed with Curl. The documents, however, do not end up in the index with the correct mime-type, here's output from IndexingFiltersChecker. ParserChecker does output the correct Content-Type.
> {code}
> http://www.taxipoll.nl/taxipol.htm --> taxipoll/htm
> http://archief.hoofdklassehockey.nl/hschema2009.html --> tet/html
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1041) Not reading mime-type correctly
Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063440#comment-13063440 ]
Markus Jelsma commented on NUTCH-1041:
--------------------------------------
Ah, the sites output a content-type meta (while the HTTP header received by Curl is ok) with the incorrect value:
{code}
<meta http-equiv="Content-Type" content="taxipoll/htm; charset=iso-8859-1">
{code}
We should only index the correct content-type.
> Not reading mime-type correctly
> -------------------------------
>
> Key: NUTCH-1041
> URL: https://issues.apache.org/jira/browse/NUTCH-1041
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.4
> Reporter: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> Another issue with mime-types and test url's. Below are two logs lines from MimeUtil. Mime-type is still ok at the start of the autoResolveContentType method:
> {code}
> Jul 11, 2011 6:46:15 PM org.apache.nutch.util.MimeUtil autoResolveContentType
> INFO: Type: text/html; charset=ISO-8859-1 from: http://www.taxipoll.nl/taxipol.htm
> Jul 11, 2011 6:46:15 PM org.apache.nutch.util.MimeUtil autoResolveContentType
> INFO: Type: text/html from: http://archief.hoofdklassehockey.nl/hschema2009.html
> {code}
> mIME-TYpe correctness has been confirmed with Curl. The documents, however, do not end up in the index with the correct mime-type, here's output from IndexingFiltersChecker. ParserChecker does output the correct Content-Type.
> {code}
> http://www.taxipoll.nl/taxipol.htm --> taxipoll/htm
> http://archief.hoofdklassehockey.nl/hschema2009.html --> tet/html
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1041) Not reading mime-type correctly
Posted by "Markus Jelsma (Resolved) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-1041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma resolved NUTCH-1041.
----------------------------------
Resolution: Fixed
Fixed in NUTCH-1230
> Not reading mime-type correctly
> -------------------------------
>
> Key: NUTCH-1041
> URL: https://issues.apache.org/jira/browse/NUTCH-1041
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Fix For: 1.5
>
>
> Another issue with mime-types and test url's. Below are two logs lines from MimeUtil. Mime-type is still ok at the start of the autoResolveContentType method:
> {code}
> Jul 11, 2011 6:46:15 PM org.apache.nutch.util.MimeUtil autoResolveContentType
> INFO: Type: text/html; charset=ISO-8859-1 from: http://www.taxipoll.nl/taxipol.htm
> Jul 11, 2011 6:46:15 PM org.apache.nutch.util.MimeUtil autoResolveContentType
> INFO: Type: text/html from: http://archief.hoofdklassehockey.nl/hschema2009.html
> {code}
> mIME-TYpe correctness has been confirmed with Curl. The documents, however, do not end up in the index with the correct mime-type, here's output from IndexingFiltersChecker. ParserChecker does output the correct Content-Type.
> {code}
> http://www.taxipoll.nl/taxipol.htm --> taxipoll/htm
> http://archief.hoofdklassehockey.nl/hschema2009.html --> tet/html
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira