You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (Commented) (JIRA)" <ji...@apache.org> on 2011/12/14 01:05:29 UTC

[jira] [Commented] (TIKA-812) Improve the detection of Works Spreadsheet 7.0 files

    [ https://issues.apache.org/jira/browse/TIKA-812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168902#comment-13168902 ] 

Nick Burch commented on TIKA-812:
---------------------------------

If we put in a slightly higher priority match for WksSSWorkBook than Workbook, that could solve the mimetypes issue

In the absence of multi hierarchy in the mimetypes file (we'd need "this is what a user thinks this extends from" and "this is what it really extends from") then I think this is a sensible approach
                
> Improve the detection of Works Spreadsheet 7.0 files
> ----------------------------------------------------
>
>                 Key: TIKA-812
>                 URL: https://issues.apache.org/jira/browse/TIKA-812
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 1.1
>            Reporter: Antoni Mylka
>         Attachments: testWORKSSpreadsheet7.0.xlr, tika-812.patch
>
>
> This was originally part of ver3 of my patch submitted to TIKA-806.
> Works Spreadsheet files are weird. Versions up to 3.0 used a Quattro Pro magic, version 4.0 used its own magic, while version 7.0 (probably later ones as well) use an OLE2 structure and an MS Office magic. The 7.0 files also contain an entry labelled "Workbook". In Tika this makes both MimeTypes (due to the quirk recently discussed in TIKA-806) and the POIFSContainerDetector label them as Excel.
> "Conceptually" they should be vnd.ms-works, but "technically" they are vnd.ms-excel. A special media type seems like a good compromise, similar in vein to the compromise we reached with TIKA-798.
> I would like to mark them with a new media type: "application/x-tika-msworks-spreadsheet". It would be a subclass of vnd.ms-excel so that:
> # With pure MimeTypes and no name, ms-excel could be returned. 
> # With MimeTypes with name and data, the correct type could be returned
> # With POIFSContainerDetector the correct type could be returned
> # They can also be added to the list of types supported by ExcelParser as it seems to be able to get some content from them

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira