You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/07/01 12:52:04 UTC

[jira] [Commented] (TIKA-1602) Detecting standards-non-compliant emails as message/rfc822

    [ https://issues.apache.org/jira/browse/TIKA-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609915#comment-14609915 ] 

Tim Allison commented on TIKA-1602:
-----------------------------------

+1.

This feels hacky, but we can undo it.  Govdocs1 is limited, and our mileage will vary.  Hopefully, someone will have the time to work on TIKA-879 soon.

[~jeremybmerrill], I'm sorry for taking so long to getting around to running this simple test.  Out of curiosity, what other headers were you getting in that batch of emails?  I'm wondering if there are more specific rfc822'ish headers that we could rely on, or were you only getting "Status:"?

> Detecting standards-non-compliant emails as message/rfc822
> ----------------------------------------------------------
>
>                 Key: TIKA-1602
>                 URL: https://issues.apache.org/jira/browse/TIKA-1602
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Jeremy B. Merrill
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: 036491.txt.zip
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Tika does not properly detect certain emails as `message/rfc822` if they're slightly standards-non-compliant and begin with `Status: ` as the first header. I've added `Status: ` as a magic detection line in tika-mimetypes.xml. 
> This solves my problem and does not appear to cause unit test failures. I have not yet run the tika-batch tests.
> As further information, the emails that are processed incorrectly come from dumps directly from various US public officials' mailservers. The dumps, I believe since they're not intended to be transmitted over the wire, sometimes are slightly non-compliant. 
> It's important to note that Tika (and the underlying library, James Mime4J) do properly *parse* these emails, despite the non-compliant header. The problem is getting Tika to *detect* the file as an email so that Mime4J gets chosen to parse it.
> Pull request on Github at https://github.com/apache/tika/pull/40



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)