You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Hudson (JIRA)" <ji...@apache.org> on 2018/10/03 20:34:00 UTC

[jira] [Commented] (TIKA-2478) RFC822 includes redundant copies of the text

    [ https://issues.apache.org/jira/browse/TIKA-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637482#comment-16637482 ] 

Hudson commented on TIKA-2478:
------------------------------

FAILURE: Integrated in Jenkins build tika-2.x-windows #326 (See [https://builds.apache.org/job/tika-2.x-windows/326/])
TIKA-2478 -- maxFiles should take an argument...duh (tallison: rev c068479d7ed1734f75e24fad24572b99c9c3a4c6)
* (edit) tika-server/src/test/java/org/apache/tika/server/TikaServerIntegrationTest.java
* (edit) tika-server/src/main/java/org/apache/tika/server/TikaServerCli.java
TIKA-2478 -- add preliminary pseudo test for -maxFiles (tallison: rev 7e798ef8603bf40ea4a17a125aaff36677478353)
* (edit) tika-server/src/test/java/org/apache/tika/server/TikaServerIntegrationTest.java


> RFC822 includes redundant copies of the text
> --------------------------------------------
>
>                 Key: TIKA-2478
>                 URL: https://issues.apache.org/jira/browse/TIKA-2478
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.16
>            Reporter: Robert Letzler
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 1.17
>
>         Attachments: TIKA-2478.patch, UET6KCXR5FYIEJYKUCK2AKF3FLXTRNAT.eml, mixed-simple, mixed-with-pdf-inline
>
>
> MBOX messages often get parsed into four documents:
> a.	The mbox file - outer container "/"
> b.	The actual email--  "/embedded-1"
> c.	The utf-8 text content of the email "/embedded-1/embedded-2"
> d.	The utf-8 html content of the email  "/embedded-1/embedded-3"
> entries C and D are redundant and distracting.  The MSG parser parses the first non-null: email body and then it skips the rest.  Please modify MBOX to not have separate "attached" documents for the html body and the text body.
> The attachment to https://issues.apache.org/jira/browse/TIKA-2471 is an example of input sufficient to generate this behavior.
> Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)