You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2015/05/23 16:07:17 UTC

[jira] [Commented] (TIKA-1634) Detecting problem with Matlab source code

    [ https://issues.apache.org/jira/browse/TIKA-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14557350#comment-14557350 ] 

Nick Burch commented on TIKA-1634:
----------------------------------

In r1681351, I've added two more matches, of lower priority as they have a higher false-positive chance. One covers single or no output functions, the other tries to spot the comments at the top of the file. Our 3 test matlab files (your two and my own "hello world" one) now detect correctly.

Could you try with your wider set of matlab files with these magics in, and close the issue if they all detect fine now?

> Detecting problem with Matlab source code
> -----------------------------------------
>
>                 Key: TIKA-1634
>                 URL: https://issues.apache.org/jira/browse/TIKA-1634
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 1.8
>            Reporter: Ji-Hyun Oh
>            Priority: Trivial
>         Attachments: BARCAST_MainCode.m, Matlab_mime-type_test.xlsx, wtsgaus.m
>
>
> Both Matlab source code and Objective-C source code have the same suffix, which is .m. Therefore, Matlab has additional match value in mime types.xml. 
> In tika-mimetypes.xml Matlab is defined as:
>   <mime-type type="text/x-matlab">
>     <_comment>Matlab source code</_comment>
>     <magic priority="50">
>       <match value="function [" type="string" offset="0"/>
>     </magic>
>     <!-- <glob pattern="*.m"/> - conflicts with text/x-objcsrc -->
>     <sub-class-of type="text/plain"/>
>   </mime-type>
> However, Matlab codes does not always start with "function [“. Therefore, some Matlab codes are detected as text/x-bojcsrc. Based on the source codes collected from NOAA Paleoclimatology Software Resources, many Matlab codes have match value like these (problematic files are attached as an example):
> <mime-type type="text/x-matlab">
>     <_comment>Matlab source code</_comment>
>     <magic priority="50">
>       <match value="function" type="string" offset="0"/>
>       <match value="%" type="string" offset="0"/>
>     </magic>
>     <!-- <glob pattern="*.m"/> - conflicts with text/x-objcsrc -->
>     <sub-class-of type="text/plain"/>
>   </mime-type>
> Conducted several detecting tests by using different Matlab packages obtained from NOAA Paleoclimatology Software Resources, with/without Custom-mimtypes.xml. Results are attached. As a results, total 121 Matlab files are detected correctly with custom-mimetypes.xml, while  55 Matlab files are detected as Matlab files without custom-mimetypes.xml (= only with current match value). However, this match value for Matlab source code could be only common in Paleoclimatology community. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)