You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Thierry Guérin (Jira)" <ji...@apache.org> on 2022/03/03 18:23:00 UTC

[jira] [Comment Edited] (TIKA-3687) Email file detected as text/html

    [ https://issues.apache.org/jira/browse/TIKA-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17500962#comment-17500962 ] 

Thierry Guérin edited comment on TIKA-3687 at 3/3/22, 6:22 PM:
---------------------------------------------------------------

Created a pull request: [https://github.com/apache/tika/pull/520.]

I went with changing the X|DKIM|ARC headers look-ahead, which was 0 to 1024. Other solution was to increase 1024 to at least 8000 (I have another email in which the first 'From:' is around 6400) in lines 6407-6420. I'm sure someone here has a good idea on which version is the most efficient.

As of now, I only found examples where there was one 'Received:' header before the 'ARC*' headers, that's why I think that 1024 may be overkill.


was (Author: tguerin):
Created a pull request: [https://github.com/apache/tika/pull/520.]

I went with changing the X|DKIM|ARC headers look-ahead, which was 0 to 1024. Other solution was to increase 1024 to at least 8000 (I have another email in which the first 'From:' is around 6400) in lines 6407-6420. I'm sure someone here has a good idea on which version is the most efficient.

 

> Email file detected as text/html
> --------------------------------
>
>                 Key: TIKA-3687
>                 URL: https://issues.apache.org/jira/browse/TIKA-3687
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.3.0
>            Reporter: Thierry Guérin
>            Priority: Minor
>         Attachments: testRFC822-ARC.eml
>
>
> The attached email (which I redacted from a real email received from Office365) is detected a HTML.
> This is because it contains ARC * headers, but they're not the first one, so the matcher that looks for ARC headers fails, and the matcher for regular 'From' header also fails because the 'From' headers occurs after 1024 characters.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)