You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/06/30 20:27:05 UTC

[jira] [Comment Edited] (TIKA-1602) Detecting standards-non-compliant emails as message/rfc822

    [ https://issues.apache.org/jira/browse/TIKA-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14608804#comment-14608804 ] 

Tim Allison edited comment on TIKA-1602 at 6/30/15 6:26 PM:
------------------------------------------------------------

One file out of 116,960 text/plain files was misidenfied as rfc822 in govdocs1.  No other diffs found.  

Ymmv.

What's odd (to me) is that the rfc parser parsed lots and lots of empty embedded documents, and none of them had any text:

{noformat}
  {
    "Content-Type": "application/zip",
    "X-Parsed-By": [
      "org.apache.tika.parser.DefaultParser",
      "org.apache.tika.parser.pkg.PackageParser"
    ],
    "X-TIKA:content": "\n036491.txt\n\n",
    "X-TIKA:digest:MD5": "e7cf541cbd061b63c03035ec692b86c9",
    "X-TIKA:digest:SHA256": "96b29ca0c2206feafd6115d993c1fb20ead631381f048442c87870934fb2cd8e",
    "X-TIKA:parse_time_millis": "140"
  },
  {
    "Content-Encoding": "US-ASCII",
    "Content-Type": "text/plain; charset\u003dUS-ASCII",
    "X-Parsed-By": [
      "org.apache.tika.parser.DefaultParser",
      "org.apache.tika.parser.txt.TXTParser"
    ],
    "X-TIKA:digest:MD5": "4ef8164712f6491c2848e861336987b5",
    "X-TIKA:digest:SHA256": "9c8d0d8dc8633ab1a8324bcd19679616729360171fde33812b12c335938f45dc",
    "X-TIKA:embedded_resource_path": "embedded-1/036491.txt/embedded-2/embedded-3/embedded-4/embedded-5/embedded-6/embedded-7/embedded-8/embedded-9/embedded-10/embedded-11/embedded-12/embedded-13/embedded-14/embedded-15/embedded-16/embedded-17/embedded-18/embedded-19/embedded-20/embedded-21/embedded-22/embedded-23/embedded-24/embedded-25/embedded-26/embedded-27/embedded-28/embedded-29/embedded-30/embedded-31/embedded-32/embedded-33/embedded-34/embedded-35/embedded-36/embedded-37/embedded-38/embedded-39/embedded-40/embedded-41/embedded-42/embedded-43/embedded-44/embedded-45/embedded-46/embedded-47/embedded-48/embedded-49/embedded-50/embedded-51/embedded-52/embedded-53/embedded-54"
  },
{noformat}


was (Author: tallison@mitre.org):
One file out of 116,960 text/plain files was misidenfied as rfc822 in govdocs1.  Ymmv.

What's odd (to me) is that the rfc parser parsed lots and lots of empty embedded documents, and none of them had any text:

{noformat}
  {
    "Content-Type": "application/zip",
    "X-Parsed-By": [
      "org.apache.tika.parser.DefaultParser",
      "org.apache.tika.parser.pkg.PackageParser"
    ],
    "X-TIKA:content": "\n036491.txt\n\n",
    "X-TIKA:digest:MD5": "e7cf541cbd061b63c03035ec692b86c9",
    "X-TIKA:digest:SHA256": "96b29ca0c2206feafd6115d993c1fb20ead631381f048442c87870934fb2cd8e",
    "X-TIKA:parse_time_millis": "140"
  },
  {
    "Content-Encoding": "US-ASCII",
    "Content-Type": "text/plain; charset\u003dUS-ASCII",
    "X-Parsed-By": [
      "org.apache.tika.parser.DefaultParser",
      "org.apache.tika.parser.txt.TXTParser"
    ],
    "X-TIKA:digest:MD5": "4ef8164712f6491c2848e861336987b5",
    "X-TIKA:digest:SHA256": "9c8d0d8dc8633ab1a8324bcd19679616729360171fde33812b12c335938f45dc",
    "X-TIKA:embedded_resource_path": "embedded-1/036491.txt/embedded-2/embedded-3/embedded-4/embedded-5/embedded-6/embedded-7/embedded-8/embedded-9/embedded-10/embedded-11/embedded-12/embedded-13/embedded-14/embedded-15/embedded-16/embedded-17/embedded-18/embedded-19/embedded-20/embedded-21/embedded-22/embedded-23/embedded-24/embedded-25/embedded-26/embedded-27/embedded-28/embedded-29/embedded-30/embedded-31/embedded-32/embedded-33/embedded-34/embedded-35/embedded-36/embedded-37/embedded-38/embedded-39/embedded-40/embedded-41/embedded-42/embedded-43/embedded-44/embedded-45/embedded-46/embedded-47/embedded-48/embedded-49/embedded-50/embedded-51/embedded-52/embedded-53/embedded-54"
  },
{noformat}

> Detecting standards-non-compliant emails as message/rfc822
> ----------------------------------------------------------
>
>                 Key: TIKA-1602
>                 URL: https://issues.apache.org/jira/browse/TIKA-1602
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Jeremy B. Merrill
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>         Attachments: 036491.txt.zip
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Tika does not properly detect certain emails as `message/rfc822` if they're slightly standards-non-compliant and begin with `Status: ` as the first header. I've added `Status: ` as a magic detection line in tika-mimetypes.xml. 
> This solves my problem and does not appear to cause unit test failures. I have not yet run the tika-batch tests.
> As further information, the emails that are processed incorrectly come from dumps directly from various US public officials' mailservers. The dumps, I believe since they're not intended to be transmitted over the wire, sometimes are slightly non-compliant. 
> It's important to note that Tika (and the underlying library, James Mime4J) do properly *parse* these emails, despite the non-compliant header. The problem is getting Tika to *detect* the file as an email so that Mime4J gets chosen to parse it.
> Pull request on Github at https://github.com/apache/tika/pull/40



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)