You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/08/05 12:32:20 UTC
[jira] [Commented] (TIKA-2047) TXTParser overwrites mime type/masks
types that are subtype of text
[ https://issues.apache.org/jira/browse/TIKA-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409390#comment-15409390 ]
Tim Allison commented on TIKA-2047:
-----------------------------------
This fix breaks unit tests {{testUsingCharsetInContentTypeHeader()}} and {{testCharsetDetectionWithShortSnipet()}}for TIKA-341, TIKA-771, TIKA-868. The issue is that the unit tests send in a mime type that is not "text/plain", and they expect it to be overwritten. Given the issues that those tests are linked to, I don't think that was the original intent. I _think_ the original intent was only to carry the encoding information through.
[~kkrugler] and all, do you have any problems if I modify the unit tests, like so:
{noformat}
public void testUsingCharsetInContentTypeHeader() throws Exception {
...
- assertEquals("text/plain; charset=ISO-8859-15", metadata.get(Metadata.CONTENT_TYPE));
+ assertEquals("text/html; charset=ISO-8859-15", metadata.get(Metadata.CONTENT_TYPE));
...
{noformat}
{noformat}
@Test
public void testCharsetDetectionWithShortSnipet() throws Exception {
...
- assertEquals("text/plain; charset=UTF-8", metadata.get(Metadata.CONTENT_TYPE));
+ assertEquals("application/binary; charset=UTF-8", metadata.get(Metadata.CONTENT_TYPE));
...
{noformat}
> TXTParser overwrites mime type/masks types that are subtype of text
> -------------------------------------------------------------------
>
> Key: TIKA-2047
> URL: https://issues.apache.org/jira/browse/TIKA-2047
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.13
> Reporter: Tim Allison
> Assignee: Tim Allison
> Priority: Minor
>
> For vcal and other mime types that are subclasses of {{text/plain}}, the TXTParser overwrites their mime type as "text/plain". We should check to see what mime has been sent in via the Metadata and add the charset to that, e.g. "text/calendar; charset=ISO-8859-1"...right?
> {noformat}
> Charset charset = reader.getCharset();
> MediaType type = new MediaType(MediaType.TEXT_PLAIN, charset);
> metadata.set(Metadata.CONTENT_TYPE, type.toString());
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)