You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Piotr B. (JIRA)" <ji...@apache.org> on 2009/09/07 08:50:57 UTC
[jira] Created: (TIKA-273) Content encoding in HtmlParser
Content encoding in HtmlParser
------------------------------
Key: TIKA-273
URL: https://issues.apache.org/jira/browse/TIKA-273
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.4, 0.5
Reporter: Piotr B.
Sometimes content encoding method is stored outside html document, for instance in mime mail with html attachment.
The problem is for text/html documents without http-equiv section. Actually there is no way to pass this information to the parser.
My fix for parse method in HtmlParser.java:
- parser.parse(new InputSource(stream));
+ InputSource source = new InputSource(stream);
+ String encoding = metadata.get(Metadata.CONTENT_ENCODING);
+ if (encoding != null) {
+ source.setEncoding(encoding);
+ parser.parse(source);
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (TIKA-273) Content encoding in HtmlParser
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-273.
--------------------------------
Resolution: Fixed
Fix Version/s: 0.5
Assignee: Jukka Zitting
Thanks! Fixed as suggested in revision 813626.
> Content encoding in HtmlParser
> ------------------------------
>
> Key: TIKA-273
> URL: https://issues.apache.org/jira/browse/TIKA-273
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.4, 0.5
> Reporter: Piotr B.
> Assignee: Jukka Zitting
> Fix For: 0.5
>
>
> Sometimes content encoding method is stored outside html document, for instance in mime mail with html attachment.
> The problem is for text/html documents without http-equiv section. Actually there is no way to pass this information to the parser.
> My fix for parse method in HtmlParser.java:
> - parser.parse(new InputSource(stream));
> + InputSource source = new InputSource(stream);
> + String encoding = metadata.get(Metadata.CONTENT_ENCODING);
> + if (encoding != null) {
> + source.setEncoding(encoding);
> + parser.parse(source);
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.