You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2022/03/23 19:24:00 UTC
[jira] [Created] (TIKA-3706) Handful of docs incorrectly identified as rfc822
Tim Allison created TIKA-3706:
---------------------------------
Summary: Handful of docs incorrectly identified as rfc822
Key: TIKA-3706
URL: https://issues.apache.org/jira/browse/TIKA-3706
Project: Tika
Issue Type: Task
Reporter: Tim Allison
In the recent regression tests, we found a small handful of docs now identified as rfc822.
One example comes from PDFBox's jira:
[^sc-356376.pdf]
As Tilman notes on the issue, the PDF actually includes http headers before the PDF:
{noformat}
HTTP/1.1 200 OK
Cache-Control: private
Pragma: Public
Content-Type: application/pdf; charset=UTF-8
Server: Microsoft-IIS/7.5
Set-Cookie: ASP.NET_SessionId=ibc3nfydvyfh1z55zqis2q3y; path=/; HttpOnly
Content-Disposition: inline; filename=_MTR_AGHS_EN.pdf
X-AspNet-Version: 2.0.50727
X-Powered-By: ASP.NET
Date: Fri, 18 Sep 2015 17:30:08 GMT
Content-Length: 56779
%PDF-1.4
{noformat}
I'm not sure how or if we want to fix these. I'm going to look at the others.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)