You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2024/02/23 17:24:00 UTC

[jira] [Updated] (TIKA-4202) Add page count of OCR'd pages in metadata for PDF files

     [ https://issues.apache.org/jira/browse/TIKA-4202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison updated TIKA-4202:
------------------------------
    Summary: Add page count of OCR'd pages in metadata for PDF files  (was: Add page count of OCR'd pages in PDF's metadata)

> Add page count of OCR'd pages in metadata for PDF files
> -------------------------------------------------------
>
>                 Key: TIKA-4202
>                 URL: https://issues.apache.org/jira/browse/TIKA-4202
>             Project: Tika
>          Issue Type: New Feature
>            Reporter: Tim Allison
>            Priority: Minor
>
> It would be useful to store the number of pages that triggered OCR in PDFs. 
> PDFs are treated differently than other files because the default is to render the page and then run OCR "inline", whereas for other file formats, we run OCR on embedded images, which are treated as embedded files. We can count tesseract as the parser for embedded images in regular files, but we can't do that with PDFs ... yet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)