You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Andrew Jackson (JIRA)" <ji...@apache.org> on 2013/09/02 14:17:51 UTC

[jira] [Created] (TIKA-1170) Possibly erroneous magic for image/cgm files

Andrew Jackson created TIKA-1170:
------------------------------------

             Summary: Possibly erroneous magic for image/cgm files
                 Key: TIKA-1170
                 URL: https://issues.apache.org/jira/browse/TIKA-1170
             Project: Tika
          Issue Type: Bug
          Components: mime
    Affects Versions: 1.4
            Reporter: Andrew Jackson
            Priority: Minor


I've been running Tika against a large corpus of web archives files, and I'm seeing a number of false positives for image/cgm. The Tika magic is
{code}
      <match value="BEGMF" type="string" offset="0"/>
      <match value="0x0020" mask="0xffe0" type="string" offset="0"/>
{code}
The issue seems to be that the second magic matcher is not very specific, e.g. matching files that start 0x002a. To be fair, this is only c.700 false matches out of >300 million resources, but it would be nice if this could be tightened up. 

Looking at the PRONOM signatures
* http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1048&strPageToDisplay=signatures
* http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1049&strPageToDisplay=signatures
* http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1050&strPageToDisplay=signatures
* http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1051&strPageToDisplay=signatures
it seems we have a variable position marker that changes slightly for each version. Therefore, a more robust signature should be:

{code}
      <match value="BEGMF" type="string" offset="0"/>
      <match value="0x0020" mask="0xffe0" type="string" offset="0">
        <match value="0x10220001" type="string" offset="2:64"/>
        <match value="0x10220002" type="string" offset="2:64"/>
        <match value="0x10220003" type="string" offset="2:64"/>
        <match value="0x10220004" type="string" offset="2:64"/>
      </match>
{code}

Where I have assumed the filename part of the CGM file will be less that 64 characters long.

Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira