You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Erik Hetzner (JIRA)" <ji...@apache.org> on 2010/05/21 19:59:16 UTC
[jira] Created: (TIKA-431) Tika currently misuses the HTTP
Content-Encoding header, and does not seem to use the charset part of the
Content-Type header properly.
Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.
---------------------------------------------------------------------------------------------------------------------------------------
Key: TIKA-431
URL: https://issues.apache.org/jira/browse/TIKA-431
Project: Tika
Issue Type: Bug
Components: general
Reporter: Erik Hetzner
Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.
Content-Encoding is not for the charset. It is for values like gzip, deflate, compress, or identity.
Charset is passed in with the Content-Type. For instance: text/html; charset=iso-8859-1
Tika should, in my opinion, do the following:
1. Stop using Content-Encoding, unless it wants me to be able to pass in gzipped content in an input stream.
2. Parse and understand charset=... declarations if passed in the Metadata object
3. Return charset=... declarations in the Metadata object if a charset is detected.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (TIKA-431) Tika currently misuses the HTTP
Content-Encoding header, and does not seem to use the charset part of the
Content-Type header properly.
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871567#action_12871567 ]
Jukka Zitting commented on TIKA-431:
------------------------------------
Agreed, we should be using the charset parameter of the media type instead of the Content-Encoding header.
AFAICT we need to adjust the HtmlParser, MboxParser and TXTParser classes to do this. Any volunteers?
> Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> Key: TIKA-431
> URL: https://issues.apache.org/jira/browse/TIKA-431
> Project: Tika
> Issue Type: Bug
> Components: general
> Reporter: Erik Hetzner
>
> Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.
> Content-Encoding is not for the charset. It is for values like gzip, deflate, compress, or identity.
> Charset is passed in with the Content-Type. For instance: text/html; charset=iso-8859-1
> Tika should, in my opinion, do the following:
> 1. Stop using Content-Encoding, unless it wants me to be able to pass in gzipped content in an input stream.
> 2. Parse and understand charset=... declarations if passed in the Metadata object
> 3. Return charset=... declarations in the Metadata object if a charset is detected.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Assigned: (TIKA-431) Tika currently misuses the HTTP
Content-Encoding header, and does not seem to use the charset part of the
Content-Type header properly.
Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ken Krugler reassigned TIKA-431:
--------------------------------
Assignee: Ken Krugler
> Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> Key: TIKA-431
> URL: https://issues.apache.org/jira/browse/TIKA-431
> Project: Tika
> Issue Type: Bug
> Components: general
> Reporter: Erik Hetzner
> Assignee: Ken Krugler
>
> Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.
> Content-Encoding is not for the charset. It is for values like gzip, deflate, compress, or identity.
> Charset is passed in with the Content-Type. For instance: text/html; charset=iso-8859-1
> Tika should, in my opinion, do the following:
> 1. Stop using Content-Encoding, unless it wants me to be able to pass in gzipped content in an input stream.
> 2. Parse and understand charset=... declarations if passed in the Metadata object
> 3. Return charset=... declarations in the Metadata object if a charset is detected.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (TIKA-431) Tika currently misuses the HTTP
Content-Encoding header, and does not seem to use the charset part of the
Content-Type header properly.
Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871824#action_12871824 ]
Ken Krugler commented on TIKA-431:
----------------------------------
I should have some time soon to do a once-over on a bunch of encoding-related issues.
> Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> Key: TIKA-431
> URL: https://issues.apache.org/jira/browse/TIKA-431
> Project: Tika
> Issue Type: Bug
> Components: general
> Reporter: Erik Hetzner
> Assignee: Ken Krugler
>
> Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.
> Content-Encoding is not for the charset. It is for values like gzip, deflate, compress, or identity.
> Charset is passed in with the Content-Type. For instance: text/html; charset=iso-8859-1
> Tika should, in my opinion, do the following:
> 1. Stop using Content-Encoding, unless it wants me to be able to pass in gzipped content in an input stream.
> 2. Parse and understand charset=... declarations if passed in the Metadata object
> 3. Return charset=... declarations in the Metadata object if a charset is detected.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (TIKA-431) Tika currently misuses the HTTP
Content-Encoding header, and does not seem to use the charset part of the
Content-Type header properly.
Posted by "Erik Hetzner (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870079#action_12870079 ]
Erik Hetzner commented on TIKA-431:
-----------------------------------
See TIKA-341, apparently my suggestion (2) above is implemented already.
Thank you for anticipating this issue in advance! :)
> Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> Key: TIKA-431
> URL: https://issues.apache.org/jira/browse/TIKA-431
> Project: Tika
> Issue Type: Bug
> Components: general
> Reporter: Erik Hetzner
>
> Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.
> Content-Encoding is not for the charset. It is for values like gzip, deflate, compress, or identity.
> Charset is passed in with the Content-Type. For instance: text/html; charset=iso-8859-1
> Tika should, in my opinion, do the following:
> 1. Stop using Content-Encoding, unless it wants me to be able to pass in gzipped content in an input stream.
> 2. Parse and understand charset=... declarations if passed in the Metadata object
> 3. Return charset=... declarations in the Metadata object if a charset is detected.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.