You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2017/01/12 12:38:52 UTC

[jira] [Comment Edited] (TIKA-2194) matlab files detected as 'text/plain'

    [ https://issues.apache.org/jira/browse/TIKA-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15820902#comment-15820902 ] 

Nick Burch edited comment on TIKA-2194 at 1/12/17 12:38 PM:
------------------------------------------------------------

Ah, I've found the problem with your filename case. In the tika mimetype definition for matlab we have this:

{noformat}
    <!-- <glob pattern="*.m"/> - conflicts with text/x-objcsrc -->
{noformat}

This leaves us with a problem - matlab program files don't have any universal unique magic to spot, and they don't have a unique file extension either :(

That said, with your test file and the Tika App, we do manage to detect it correct as matlab just from the function definition on the first line. If you change your line 73 to {{def sherlock = new DefaultDetector();}} then the detection will work


was (Author: gagravarr):
Ah, I've found the problem with your filename case. In the tika mimetype definition for matlab we have this:

{{    <!-- <glob pattern="*.m"/> - conflicts with text/x-objcsrc -->}}

This leaves us with a problem - matlab program files don't have any universal unique magic to spot, and they don't have a unique file extension either :(

That said, with your test file and the Tika App, we do manage to detect it correct as matlab just from the function definition on the first line. If you change your line 73 to {{def sherlock = new DefaultDetector();}} then the detection will work

> matlab files detected as 'text/plain'
> -------------------------------------
>
>                 Key: TIKA-2194
>                 URL: https://issues.apache.org/jira/browse/TIKA-2194
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, mime
>    Affects Versions: 1.9, 1.14
>            Reporter: Mihai Glont
>
> matlab files from https://issues.apache.org/jira/browse/TIKA-1634 are reported to have mime type 'text/plain' with either DefaultDetector or MimeTypes. I am able to reproduce the problem by running the following Groovy script https://gist.github.com/mglont/16630c8a66fdddaaa7aa44820d6f021f



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)