You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Antoni Mylka (Closed) (JIRA)" <ji...@apache.org> on 2011/12/19 12:19:30 UTC

[jira] [Closed] (TIKA-812) Improve the detection of Works Spreadsheet 7.0 files

     [ https://issues.apache.org/jira/browse/TIKA-812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoni Mylka closed TIKA-812.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 1.1

Committed tika-812-ver2.patch in r1220687.
                
> Improve the detection of Works Spreadsheet 7.0 files
> ----------------------------------------------------
>
>                 Key: TIKA-812
>                 URL: https://issues.apache.org/jira/browse/TIKA-812
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 1.1
>            Reporter: Antoni Mylka
>             Fix For: 1.1
>
>         Attachments: testWORKSSpreadsheet7.0.xlr, tika-812-ver2.patch, tika-812.patch
>
>
> This was originally part of ver3 of my patch submitted to TIKA-806.
> Works Spreadsheet files are weird. Versions up to 3.0 used a Quattro Pro magic, version 4.0 used its own magic, while version 7.0 (probably later ones as well) use an OLE2 structure and an MS Office magic. The 7.0 files also contain an entry labelled "Workbook". In Tika this makes both MimeTypes (due to the quirk recently discussed in TIKA-806) and the POIFSContainerDetector label them as Excel.
> "Conceptually" they should be vnd.ms-works, but "technically" they are vnd.ms-excel. A special media type seems like a good compromise, similar in vein to the compromise we reached with TIKA-798.
> I would like to mark them with a new media type: "application/x-tika-msworks-spreadsheet". It would be a subclass of vnd.ms-excel so that:
> # With pure MimeTypes and no name, ms-excel could be returned. 
> # With MimeTypes with name and data, the correct type could be returned
> # With POIFSContainerDetector the correct type could be returned
> # They can also be added to the list of types supported by ExcelParser as it seems to be able to get some content from them

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira