You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/08/05 12:32:20 UTC

[jira] [Commented] (TIKA-2047) TXTParser overwrites mime type/masks types that are subtype of text

    [ https://issues.apache.org/jira/browse/TIKA-2047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409390#comment-15409390 ] 

Tim Allison commented on TIKA-2047:
-----------------------------------

This fix breaks unit tests {{testUsingCharsetInContentTypeHeader()}} and {{testCharsetDetectionWithShortSnipet()}}for TIKA-341, TIKA-771, TIKA-868.  The issue is that the unit tests send in a mime type that is not "text/plain", and they expect it to be overwritten.  Given the issues that those tests are linked to, I don't think that was the original intent.  I _think_ the original intent was only to carry the encoding information through.

[~kkrugler] and all, do you have any problems if I modify the unit tests, like so:
{noformat}
    public void testUsingCharsetInContentTypeHeader() throws Exception {
...
-        assertEquals("text/plain; charset=ISO-8859-15", metadata.get(Metadata.CONTENT_TYPE));
+        assertEquals("text/html; charset=ISO-8859-15", metadata.get(Metadata.CONTENT_TYPE));
...
{noformat}

{noformat}
    @Test
    public void testCharsetDetectionWithShortSnipet() throws Exception {
...
-         assertEquals("text/plain; charset=UTF-8", metadata.get(Metadata.CONTENT_TYPE));
+        assertEquals("application/binary; charset=UTF-8", metadata.get(Metadata.CONTENT_TYPE));
...
{noformat}


> TXTParser overwrites mime type/masks types that are subtype of text
> -------------------------------------------------------------------
>
>                 Key: TIKA-2047
>                 URL: https://issues.apache.org/jira/browse/TIKA-2047
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.13
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Minor
>
> For vcal and other mime types that are subclasses of {{text/plain}}, the TXTParser overwrites their mime type as "text/plain".  We should check to see what mime has been sent in via the Metadata and add the charset to that, e.g. "text/calendar; charset=ISO-8859-1"...right?
> {noformat}
>             Charset charset = reader.getCharset();
>             MediaType type = new MediaType(MediaType.TEXT_PLAIN, charset);
>             metadata.set(Metadata.CONTENT_TYPE, type.toString());
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)