You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Matthew Caruana Galizia (JIRA)" <ji...@apache.org> on 2017/09/29 11:07:00 UTC

[jira] [Updated] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers

     [ https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matthew Caruana Galizia updated TIKA-2471:
------------------------------------------
    Attachment: mbox

Reduced test case attached. The result of parsing this file will include metadata keys with names like {{MboxParser-class=3dmsonormal>sincerely,<o\:p></o}}.

> Tab-prefixed message body lines in Mbox interpreted as headers
> --------------------------------------------------------------
>
>                 Key: TIKA-2471
>                 URL: https://issues.apache.org/jira/browse/TIKA-2471
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.16
>            Reporter: Matthew Caruana Galizia
>              Labels: message, rfc822
>         Attachments: mbox
>
>
> The mbox parser code is overly optimistic. It parses the entire message looking for anything that matches a header pattern, wherever it occurs in a line!
> It looks to me like the parsing logic is in desperate need of a refactor. But more to the point, what is the idea behind setting the headers in the MboxParser if they're going to be set by the RFC822Parser in any case?
> Also, out of curiosity, why does the parser force Windows-1252 as the charset?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)