You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/10/30 16:03:00 UTC
[jira] [Updated] (TIKA-2478) RFC822 includes redundant copies of
the text
[ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated TIKA-2478:
------------------------------
Summary: RFC822 includes redundant copies of the text (was: MBOX import includes redundant copies of the text)
> RFC822 includes redundant copies of the text
> --------------------------------------------
>
> Key: TIKA-2478
> URL: https://issues.apache.org/jira/browse/TIKA-2478
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.16
> Reporter: Robert Letzler
> Assignee: Tim Allison
> Priority: Minor
> Attachments: TIKA-2478.patch, UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml, mixed-simple, mixed-with-pdf-inline
>
>
> MBOX messages often get parsed into four documents:
> a. The mbox file - outer container "/"
> b. The actual email-- "/embedded-1"
> c. The utf-8 text content of the email "/embedded-1/embedded-2"
> d. The utf-8 html content of the email "/embedded-1/embedded-3"
> entries C and D are redundant and distracting. The MSG parser parses the first non-null: email body and then it skips the rest. Please modify MBOX to not have separate "attached" documents for the html body and the text body.
> The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an example of input sufficient to generate this behavior.
> Thanks!
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)