You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Antoni Mylka (JIRA)" <ji...@apache.org> on 2010/11/30 21:50:11 UTC

[jira] Issue Comment Edited: (TIKA-560) Improve detection of .mht, Foxmail, and OOXML files

    [ https://issues.apache.org/jira/browse/TIKA-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965428#action_12965428 ] 

Antoni Mylka edited comment on TIKA-560 at 11/30/10 3:49 PM:
-------------------------------------------------------------

It seems that when applying changes to tika-mimetypes.xml in revision 1039496 the modification for application/vnd.ms-excel.sheet.binary.macroenabled.12 got overlooked. I just created a binary workbook with Excel2007 and it definitely is an OpenXML file. Please change the parent type of application/vnd.ms-excel.sheet.binary.macroenabled.12 to application/x-tika-ooxml

      was (Author: antoni.mylka):
    It seems that when applying changes to tika-mimetypes.xml in revision 10399496 the modification for application/vnd.ms-excel.sheet.binary.macroenabled.12 got overlooked. I just created a binary workbook with Excel2007 and it definitely is an OpenXML file. Please change the parent type of application/vnd.ms-excel.sheet.binary.macroenabled.12 to application/x-tika-ooxml
  
> Improve detection of .mht, Foxmail, and OOXML files
> ---------------------------------------------------
>
>                 Key: TIKA-560
>                 URL: https://issues.apache.org/jira/browse/TIKA-560
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Antoni Mylka
>         Attachments: test-documents.zip, tika-560.patch
>
>
> I would like to address the following issues
> 1. Reduce the priority of the text/html magics. WIth the default priority I have lots of .eml, .emlx, mbox and .mht files which contain html content but should not be classified as XML. The reason for that is that the HTML magic looks for <html> between 0 and 8192 offsets. In Aperture we solved this with an allowsWhiteSpace switch, so that the <html> can be prepended with whitespace but not with other content. Since there is no such switch in Tika, I suggest reducing the priority of the magic in tika-mimetypes. I attach an .mht file from the Aperture test document suite which exhibits the problem.
> 2. Add support for detecting Foxmail. They come from Foxmail, a mail client popular in china, they are roughly the same as mbox, but use a different separator. 
> 3. In case of OOXML files, the container aware detector computes the mimetype by taking the part of [Content_Types.xml], namely:
> <Default Extension="bin" ContentType="application/vnd.ms-excel.sheet.binary.macroEnabled.main"/>
> then it takes the default content type and returns it with the part after the last dot removed. There are two issues with this approach
>  a. some documents use macroEnabled, while other use macroenabled so the case is not standard
>  b. the "official" mime types, contain a '12' suffix at the end, as shown at: http://technet.microsoft.com/en-us/library/ee309278%28office.12%29.aspx. I suggest to standardize on lowercase and add the '12' to the appropriate files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.