You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2022/03/23 19:24:00 UTC

[jira] [Created] (TIKA-3706) Handful of docs incorrectly identified as rfc822

Tim Allison created TIKA-3706:
---------------------------------

             Summary: Handful of docs incorrectly identified as rfc822
                 Key: TIKA-3706
                 URL: https://issues.apache.org/jira/browse/TIKA-3706
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


In the recent regression tests, we found a small handful of docs now identified as rfc822.

 

One example comes from PDFBox's jira:

[^sc-356376.pdf]

 

As Tilman notes on the issue, the PDF actually includes http headers before the PDF:
{noformat}
HTTP/1.1 200 OK
Cache-Control: private
Pragma: Public
Content-Type: application/pdf; charset=UTF-8
Server: Microsoft-IIS/7.5
Set-Cookie: ASP.NET_SessionId=ibc3nfydvyfh1z55zqis2q3y; path=/; HttpOnly
Content-Disposition: inline; filename=_MTR_AGHS_EN.pdf
X-AspNet-Version: 2.0.50727
X-Powered-By: ASP.NET
Date: Fri, 18 Sep 2015 17:30:08 GMT
Content-Length: 56779

%PDF-1.4 
{noformat}

I'm not sure how or if we want to fix these.  I'm going to look at the others.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)