You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Yury Kats (JIRA)" <ji...@apache.org> on 2018/07/06 21:08:00 UTC

[jira] [Comment Edited] (TIKA-2680) Email attachments to an email are not extracted

    [ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535351#comment-16535351 ] 

Yury Kats edited comment on TIKA-2680 at 7/6/18 9:07 PM:
---------------------------------------------------------

Indeed, the first embedded rfc822 is not an attachment. I believe this is because it's an Exchange journaled email, see the presence of X-MS-Journal-Report header at the very top. 
In this case, the original message is wrapped in another message that can provide additional headers, such as Bcc and expanded distribution lists.


was (Author: yurykats):
Indeed, the first embedded rfc822 is not an attachment. I believe this is because it's an Exchange journaled email, see the presence of X-MS-Journal-Report header at the very top. 

> Email attachments to an email are not extracted
> -----------------------------------------------
>
>                 Key: TIKA-2680
>                 URL: https://issues.apache.org/jira/browse/TIKA-2680
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.18
>            Reporter: Yury Kats
>            Assignee: Tim Allison
>            Priority: Major
>         Attachments: nested.eml
>
>
> I have a number of email messages that contain other email messages as attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level attached email, 2nd level attached email), but I only get 1 email and 1 unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) <He...@bank.com>",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) <He...@bank.com>",
> "Message-To": [
> "fm.SAN Management Team <fm...@bank.com>",
> "Smith Van der, H (Henry) <He...@bank.com>"
> ],
> "Message:From-Email": "Henry.Van.der.Smith@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": "<ab...@journal.report.generator>",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": "<Mi...@bank.com>",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": "<0f...@BSTS124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) <He...@bank.com>",
> "dc:creator": "Smith Van der, H (Henry) <He...@bank.com>",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) <He...@bank.com>",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)