You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2013/09/02 14:32:51 UTC

[jira] [Commented] (TIKA-1170) Insufficiently specific magic for binary image/cgm files

    [ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756040#comment-13756040 ] 

Nick Burch commented on TIKA-1170:
----------------------------------

Any chance we could get a CGM file or two to use in unit tests? We don't seem to have any...
                
> Insufficiently specific magic for binary image/cgm files
> --------------------------------------------------------
>
>                 Key: TIKA-1170
>                 URL: https://issues.apache.org/jira/browse/TIKA-1170
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.4
>            Reporter: Andrew Jackson
>            Priority: Minor
>
> I've been running Tika against a large corpus of web archives files, and I'm seeing a number of false positives for image/cgm. The Tika magic is
> {code}
>       <match value="BEGMF" type="string" offset="0"/>
>       <match value="0x0020" mask="0xffe0" type="string" offset="0"/>
> {code}
> The issue seems to be that the second magic matcher is not very specific, e.g. matching files that start 0x002a. To be fair, this is only c.700 false matches out of >300 million resources, but it would be nice if this could be tightened up. 
> Looking at the PRONOM signatures
> * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1048&strPageToDisplay=signatures
> * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1049&strPageToDisplay=signatures
> * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1050&strPageToDisplay=signatures
> * http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1051&strPageToDisplay=signatures
> it seems we have a variable position marker that changes slightly for each version. Therefore, a more robust signature should be:
> {code}
>       <match value="BEGMF" type="string" offset="0"/>
>       <match value="0x0020" mask="0xffe0" type="string" offset="0">
>         <match value="0x10220001" type="string" offset="2:64"/>
>         <match value="0x10220002" type="string" offset="2:64"/>
>         <match value="0x10220003" type="string" offset="2:64"/>
>         <match value="0x10220004" type="string" offset="2:64"/>
>       </match>
> {code}
> Where I have assumed the filename part of the CGM file will be less that 64 characters long.
> Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira