You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Thomas Fischer <fi...@aon.at> on 2012/05/01 18:54:23 UTC

Re: PDF Box to parse the data content on Images

Hello,

PDFBox won't hep you with this, it extracts only text from PDF files.
For images you need some OCR (Optical character recognition) application, these are available in either commercial (e.g. Abby Fine Reader) or free (e.g. Tesseract) versions. The EuDML project is working on a package that does what you want, see PdfToTextViaOCR.

Best
Thomas


Am 30.04.2012 um 15:31 schrieb chaya jajur:

> Hi Team,
> 
> We are planning to use PDFBox to parse PDF content.  I am able to parse and
> read the normal text data in PDF,
> but I am having challenges in reading the data/ content on images.
> 
> Our requirement is we need to read & parse the data/ content present on top
> of images also.
> ex: If i have scanned copy of a document , I should be able to parse the
> content of that also.
> 
> Please suggest me on how to proceed this.
> 
> Thanks In Advance
> Chaya