You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/10/16 19:08:00 UTC

[jira] [Commented] (TIKA-2471) Tab-prefixed message body lines in Mbox interpreted as headers

    [ https://issues.apache.org/jira/browse/TIKA-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16206418#comment-16206418 ] 

Tim Allison commented on TIKA-2471:
-----------------------------------

That looks totally hosed.  Thank you for opening this and supplying an example triggering file. 

bq. But more to the point, what is the idea behind setting the headers in the MboxParser if they're going to be set by the RFC822Parser in any case?

TIKA-1244 brought that behavior in.  Before that, emails weren't treated as embedded files if I understand correctly.

bq.  why does the parser force Windows-1252 as the charset?
Again, no idea, but I suspect that was because of the rfc822 method of encoding.  Are you able to share an example where this corrupts the content?

> Tab-prefixed message body lines in Mbox interpreted as headers
> --------------------------------------------------------------
>
>                 Key: TIKA-2471
>                 URL: https://issues.apache.org/jira/browse/TIKA-2471
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.16
>            Reporter: Matthew Caruana Galizia
>              Labels: message, rfc822
>         Attachments: mbox
>
>
> The mbox parser code is overly optimistic. It parses the entire message looking for anything that matches a header pattern, wherever it occurs in a line!
> It looks to me like the parsing logic is in desperate need of a refactor. But more to the point, what is the idea behind setting the headers in the MboxParser if they're going to be set by the RFC822Parser in any case?
> Also, out of curiosity, why does the parser force Windows-1252 as the charset?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)