You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Luis Filipe Nassif (JIRA)" <ji...@apache.org> on 2016/06/03 01:09:59 UTC
[jira] [Comment Edited] (TIKA-1994) Integrate OCR with PDFParser

    [ https://issues.apache.org/jira/browse/TIKA-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15313422#comment-15313422 ] 

Luis Filipe Nassif edited comment on TIKA-1994 at 6/3/16 1:09 AM:
------------------------------------------------------------------

Hi Tim,

Before the PDFBox deeper integration (good to know they are working on that!), I think this strategy is very good, and currently we use it in my organization instead of OCRing individual images inside a pdf. As you know, PDFs may have one image per paragraph, line, word or per char, and that can result in poor results with the individual image ocr approach.

As a suggestion, we count the number of extracted text chars per page and only do ocr if it is lower than a configurable value (we use 100 by default), because it suggests a high chance that the page is formed by a big (scanned) image. That eliminates lots of duplicate info that would be returned by ocr and speeds up the extraction a lot. 


was (Author: lfcnassif):
Hi Tim,

Before the PDFBox deeper integration (good to know they are working on that!), I think this strategy is very good, and currently we use it in my organization instead of OCRing individual images inside a pdf. As you know, PDFs may have one image per paragraph, line, word or per char, and that can result in poor results with the individual image ocr approach.

As a suggestion, we count the number of extracted text chars per page and only do ocr if it is lower than a configurable value (we use 100 by default). That eliminates lots of duplicate info that would be returned by ocr and speeds up the extraction a lot. 

> Integrate OCR with PDFParser
> ----------------------------
>
>                 Key: TIKA-1994
>                 URL: https://issues.apache.org/jira/browse/TIKA-1994
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>
> Users can now run OCR on individual images embedded inline in PDFs if they get the configuration right.  
> There are some drawbacks: 1) the text appears as an attachment if using the RecursiveParserWrapper, 2) text may be more cleanly extracted on the fully rendered page instead of on the individual images (this is still tbd).
> It might be useful to run OCR against each rendered page (instead of the component images). 
> Integrating OCR is on the roadmap for PDFBox 2.1 (PDFBOX-1912).  This will allow us to experiment with strategies until the cleaner integration is available with PDFBox 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)