You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2016/05/17 12:25:32 UTC

OCRing extracted inline images vs. fully rendered pages?

All,
  On Tika, users can choose to run OCR on inline images (and attached images, of course).  Would it be better for us to render each full page and then run OCR on that?

         Best,

                  Tim

RE: OCRing extracted inline images vs. fully rendered pages?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
>We have an experimental integration with Tesseract which was created a while ago by a GSoC student. Because it requires >building C++ we’ve not integrated it into trunk, but do have it on the todo list for 2.1. 

Ah, very cool.  Y, I'd trust you all to do a better job of integrating OCR for PDFs than we'd do. :)

>The advantage of this approach is that we can keep any embedded text in the PDF and embellish it with the output.

It would be neat to have an OCR-only option for documents where the text extraction yields complete garbage (...garbage detector...on our todo list TIKA-1443).

I'll hold off then on doing anything on our end.  Thank you!

Best,

         Tim


Re: OCRing extracted inline images vs. fully rendered pages?

Posted by John Hewson <jo...@jahewson.com>.
> On 17 May 2016, at 05:25, Allison, Timothy B. <ta...@mitre.org> wrote:
> 
> All,
>  On Tika, users can choose to run OCR on inline images (and attached images, of course).  Would it be better for us to render each full page and then run OCR on that?

We have an experimental integration with Tesseract which was created a while ago by a GSoC student. Because it requires building C++ we’ve not integrated it into trunk, but do have it on the todo list for 2.1. The advantage of this approach is that we can keep any embedded text in the PDF and embellish it with the output.

https://github.com/DImuthuUpe/OCR-Plugin

— John

>         Best,
> 
>                  Tim
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org