You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Antoni Mylka (JIRA)" <ji...@apache.org> on 2010/11/25 20:41:15 UTC

[jira] Updated: (TIKA-561) Support EMLX file detection

     [ https://issues.apache.org/jira/browse/TIKA-561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoni Mylka updated TIKA-561:
------------------------------

    Attachment: tika-561.patch

a patch which contains the modifications and the test file, It overlaps with my patch to TIKA-560, but I wanted to make both of them self-contained.

The test email contains a newsletter from CNET. It's public. I don't know if ASF policy would allow to commit it. If not, please find someone with Apple Mail and let them create normal HTML email.

Note that this works only because the priority of the text/html magics has been reduced, as explained in TIKA-560

> Support EMLX file detection
> ---------------------------
>
>                 Key: TIKA-561
>                 URL: https://issues.apache.org/jira/browse/TIKA-561
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Antoni Mylka
>         Attachments: tika-561.patch
>
>
> Apple Mail generates email files in .emlx format. They roughly resemble standard rfc822 .eml files but are different.
> On the first line they have the content length in bytes,
> then on the second line, normal rfc822 content starts
> and afterwards there is some XML metadata.
> I would suggest to add support for .emlx files to tika-mimetypes.xml. Just copy the message/rfc822 definitions and state that they should appear at offsets 3:10, this should be enough to accomodate the the content length on the first line. Any reasonable email should be longer than 9 bytes. In this case the first line would have two bytes, then the line break, and normal rfc822 headers can start at offset 4. This will work for emails up to 99 MB, (99 999 999 bytes). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.