You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by DImuthu Upeksha <di...@gmail.com> on 2014/06/02 19:03:25 UTC

OCR for PDFBox : Progress

Hi John,
Here is the progress of OCR Plugin for PDFBox.
Project consists of two sub projects

1. Tesseract API for java
2. OCR Plugin for PDFBox using Tesseract API

*Tesseract API [1]*

1. Currently all necessary functions were implemented and test cases were
written in order to check proper functionality

2. Support for Mac and linux operating systems. In future I'll try to add
support for Windows also

3. All static libs for Tesseract and Leptonica were pre built and added to
resources folder.

4. At build phase it dynamically identify correct libs that support to
particular Operating system

5. If some one needs to build above static libs manually, instructions were
given in readme.

6. In future, I'll work on adding those static libs creation when project
 is built. Currently they must be manually built.

*OCR plugin [2]*

1. Almost finished implementing.

2. Working fine with sample PDF files I have given. Is there any set of PDF
files that can be used to test accuracy and performance?

In addition to that, there are some code formatting and commenting stuff to
be done.

[1] https://github.com/DImuthuUpe/Tesseract-API
[2] https://github.com/DImuthuUpe/OCR-Plugin
-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: OCR for PDFBox : Progress

Posted by DImuthu Upeksha <di...@gmail.com>.

Hi John,

Thank you for your valuable feedback.

As you have mentioned I copied ExtractText.java and Created OCRText.java
with changes you have mentioned.
https://github.com/DImuthuUpe/OCR-Plugin/blob/master/src/main/java/org/apache/pdfbox/tools/OCRText.java

Now it's working properly.

I removed arguments like -html which will not make sense for OCR stuff.

Small comment about PDFTextStripper. Why is variable currentPageNo is
private? Is there a special reason? In some case I needed to access
currentPageNo variable in PDFOCRTextStripper.java. Because it is private I
had to make my own local page number variable which is manually incremented
 in processStream method. I think this is not a good practice but I had no
other way to do it. Can we make currentPageNo variable protected which will
be able to make accessible to subclasses of PDFTextStripper in future?

Thanks
Dimuthu

On Wed, Jun 11, 2014 at 7:16 AM, John Hewson <jo...@jahewson.com> wrote:

> Hi Dimuthu,
>
> I cloned your code and did some experiments with it  - it’s working
> nicely. I’m glad that subclassing
> PDFTextStripper has been a success, it’s a nice clean implementation.
>
> *Tesseract API [1]*
>
> 1. Currently all necessary functions were implemented and test cases were
> written in order to check proper functionality
>
> 2. Support for Mac and linux operating systems. In future I'll try to add
> support for Windows also
>
>
> That’s fine for now.
>
> 3. All static libs for Tesseract and Leptonica were pre built and added to
> resources folder.
>
>
> Perfect.
>
> 4. At build phase it dynamically identify correct libs that support to
> particular Operating system
>
> 5. If some one needs to build above static libs manually, instructions
> were given in read me.
>
>
> 6. In future, I'll work on adding those static libs creation when project
>  is built. Currently they must be manually built.
>
>
> That would be handy.
>
>
> *OCR plugin [2]*
>
> 1. Almost finished implementing.
>
> 2. Working fine with sample PDF files I have given. Is there any set of
> PDF files that can be used to test accuracy and performance?
>
>
> Currently, no, but I’ll take a look in my collection of test files…
>
> In addition to that, there are some code formatting and commenting stuff
> to be done.
>
>
> It might be nice to add a command line utility to your OCR-Plugin, you
> could copy ExtractText.java from org.apache.pdfbox.tools and rename it to
> OCRText and have it use your PDFOCRTextStripper class instead of
> PDFTextStripper. That way your plugin is immediately usable by end-users.
>
> -- John
>
>

-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: OCR for PDFBox : Progress

Posted by John Hewson <jo...@jahewson.com>.

Hi Dimuthu,

I cloned your code and did some experiments with it  - it’s working nicely. I’m glad that subclassing
PDFTextStripper has been a success, it’s a nice clean implementation.

> Tesseract API [1]
> 
> 1. Currently all necessary functions were implemented and test cases were written in order to check proper functionality
> 
> 2. Support for Mac and linux operating systems. In future I'll try to add support for Windows also

That’s fine for now.

> 3. All static libs for Tesseract and Leptonica were pre built and added to resources folder. 

Perfect.

> 4. At build phase it dynamically identify correct libs that support to particular Operating system
> 
> 5. If some one needs to build above static libs manually, instructions were given in read me.

> 6. In future, I'll work on adding those static libs creation when project  is built. Currently they must be manually built.

That would be handy.

> OCR plugin [2]
> 
> 1. Almost finished implementing. 
> 
> 2. Working fine with sample PDF files I have given. Is there any set of PDF files that can be used to test accuracy and performance?

Currently, no, but I’ll take a look in my collection of test files…

> In addition to that, there are some code formatting and commenting stuff to be done.

It might be nice to add a command line utility to your OCR-Plugin, you could copy ExtractText.java from org.apache.pdfbox.tools and rename it to OCRText and have it use your PDFOCRTextStripper class instead of PDFTextStripper. That way your plugin is immediately usable by end-users.

-- John