You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2022/03/23 19:31:00 UTC

[jira] [Commented] (TIKA-3706) Handful of docs incorrectly identified as rfc822

    [ https://issues.apache.org/jira/browse/TIKA-3706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511455#comment-17511455 ] 

Tim Allison commented on TIKA-3706:
-----------------------------------

Do we add a new mime type for HTTP response?!

> Handful of docs incorrectly identified as rfc822
> ------------------------------------------------
>
>                 Key: TIKA-3706
>                 URL: https://issues.apache.org/jira/browse/TIKA-3706
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Trivial
>
> In the recent regression tests, we found a small handful of docs now identified as rfc822.
>  
> One example comes from PDFBox's jira:
> [^sc-356376.pdf]
>  
> As Tilman notes on the issue, the PDF actually includes http headers before the PDF:
> {noformat}
> HTTP/1.1 200 OK
> Cache-Control: private
> Pragma: Public
> Content-Type: application/pdf; charset=UTF-8
> Server: Microsoft-IIS/7.5
> Set-Cookie: ASP.NET_SessionId=ibc3nfydvyfh1z55zqis2q3y; path=/; HttpOnly
> Content-Disposition: inline; filename=_MTR_AGHS_EN.pdf
> X-AspNet-Version: 2.0.50727
> X-Powered-By: ASP.NET
> Date: Fri, 18 Sep 2015 17:30:08 GMT
> Content-Length: 56779
> %PDF-1.4 
> {noformat}
> I'm not sure how or if we want to fix these.  I'm going to look at the others.  Y, others also come w HTTP Headers: https://corpora.tika.apache.org/base/docs/commoncrawl3/QL/QLPQA77R36REFEF3ICLL2NPTXWJXKV54



--
This message was sent by Atlassian Jira
(v8.20.1#820001)