You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by John Hewson <jo...@jahewson.com> on 2014/06/27 08:58:19 UTC

Re: Improving OCR plugin for PDFBox

Hi Dimuthu

That’s great. We should wait until closer to the end of the GSoC period to integrate your work with PDFBox, as ideally we only want to have to do it once. We’ve not included C++ dependencies before so no, there won’t be a standard way, we’ll have to think something up. We’ll either make it an optional sub-project and the Tesseract JNI bindings might be better of having their own branch so that they are more like an external dependency - I’ll ask the dev mailing list.

To prepare your code for contribution you’ll need to add the Apache header to each.java file (see any PDFBox .java file for an example) and submit a signed ICLA http://www.apache.org/licenses/icla.pdf to Apache.

Regarding additional functionality, the most useful would be for a new command line tool which could write the OCR’d text back into the original PDF file as “invisible text”, which would allow for copy and paste and text search to then work for that PDF file. A starting point for this would be to try and write the OCR’d text into the original PDF as “visible” text - we can make it invisible later!

-- John

On 19 Jun 2014, at 13:57, DImuthu Upeksha <di...@gmail.com> wrote:

> Hi John,
> Except providing compatibility for platforms like windows, I think most of the functionalities of OCR plugin are finished (Please correct me if I'm wrong). But I would like to contribute to project further. Do  you have anything to add as a new functionality? And If you plan to add this to PDFBox code, how should prepare my code? Is there any standard way?
> 
> Thanks
> Dimuthu
> -- 
> Regards
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> University of Moratuwa, Sri Lanka

Re: FW: Improving OCR plugin for PDFBox

Posted by Tyler Palsulich <tp...@gmail.com>.

Definitely of interest! We have work done on TIKA-93 (OCR). But, OCR for
PDFs should probably be in PDFBox, rather than Tika.

Tyler


On Fri, Jun 27, 2014 at 5:12 AM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> Thought this might be of interest.
>
> -----Original Message-----
> From: John Hewson [mailto:john@jahewson.com]
> Sent: Friday, June 27, 2014 2:58 AM
> To: DImuthu Upeksha
> Cc: dev@pdfbox.apache.org
> Subject: Re: Improving OCR plugin for PDFBox
>
> Hi Dimuthu
>
> That's great. We should wait until closer to the end of the GSoC period to
> integrate your work with PDFBox, as ideally we only want to have to do it
> once. We've not included C++ dependencies before so no, there won't be a
> standard way, we'll have to think something up. We'll either make it an
> optional sub-project and the Tesseract JNI bindings might be better of
> having their own branch so that they are more like an external dependency -
> I'll ask the dev mailing list.
>
> To prepare your code for contribution you'll need to add the Apache header
> to each.java file (see any PDFBox .java file for an example) and submit a
> signed ICLA http://www.apache.org/licenses/icla.pdf to Apache.
>
> Regarding additional functionality, the most useful would be for a new
> command line tool which could write the OCR'd text back into the original
> PDF file as "invisible text", which would allow for copy and paste and text
> search to then work for that PDF file. A starting point for this would be
> to try and write the OCR'd text into the original PDF as "visible" text -
> we can make it invisible later!
>
> -- John
>
> On 19 Jun 2014, at 13:57, DImuthu Upeksha <di...@gmail.com>
> wrote:
>
> > Hi John,
> > Except providing compatibility for platforms like windows, I think most
> of the functionalities of OCR plugin are finished (Please correct me if I'm
> wrong). But I would like to contribute to project further. Do  you have
> anything to add as a new functionality? And If you plan to add this to
> PDFBox code, how should prepare my code? Is there any standard way?
> >
> > Thanks
> > Dimuthu
> > --
> > Regards
> > W.Dimuthu Upeksha
> > Undergraduate
> > Department of Computer Science And Engineering
> > University of Moratuwa, Sri Lanka
>
>

FW: Improving OCR plugin for PDFBox

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Thought this might be of interest.

-----Original Message-----
From: John Hewson [mailto:john@jahewson.com] 
Sent: Friday, June 27, 2014 2:58 AM
To: DImuthu Upeksha
Cc: dev@pdfbox.apache.org
Subject: Re: Improving OCR plugin for PDFBox

Hi Dimuthu

That's great. We should wait until closer to the end of the GSoC period to integrate your work with PDFBox, as ideally we only want to have to do it once. We've not included C++ dependencies before so no, there won't be a standard way, we'll have to think something up. We'll either make it an optional sub-project and the Tesseract JNI bindings might be better of having their own branch so that they are more like an external dependency - I'll ask the dev mailing list.

To prepare your code for contribution you'll need to add the Apache header to each.java file (see any PDFBox .java file for an example) and submit a signed ICLA http://www.apache.org/licenses/icla.pdf to Apache.

Regarding additional functionality, the most useful would be for a new command line tool which could write the OCR'd text back into the original PDF file as "invisible text", which would allow for copy and paste and text search to then work for that PDF file. A starting point for this would be to try and write the OCR'd text into the original PDF as "visible" text - we can make it invisible later!

-- John

On 19 Jun 2014, at 13:57, DImuthu Upeksha <di...@gmail.com> wrote:

> Hi John,
> Except providing compatibility for platforms like windows, I think most of the functionalities of OCR plugin are finished (Please correct me if I'm wrong). But I would like to contribute to project further. Do  you have anything to add as a new functionality? And If you plan to add this to PDFBox code, how should prepare my code? Is there any standard way?
> 
> Thanks
> Dimuthu
> -- 
> Regards
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> University of Moratuwa, Sri Lanka

Re: Improving OCR plugin for PDFBox

Posted by John Hewson <jo...@jahewson.com>.

Santosh,

Please don’t e-mail the entire mailing list asking to be unsubscribed, simply send an e-mail to:

dev-unsubscribe@pdfbox.apache.org

-- John

On 7 Jul 2014, at 10:39, Santosh Arakeri <sa...@gmail.com> wrote:

> Pl dont send me mail.
> 
> 
> On Fri, Jun 27, 2014 at 12:28 PM, John Hewson <jo...@jahewson.com> wrote:
> 
>> Hi Dimuthu
>> 
>> That’s great. We should wait until closer to the end of the GSoC period to
>> integrate your work with PDFBox, as ideally we only want to have to do it
>> once. We’ve not included C++ dependencies before so no, there won’t be a
>> standard way, we’ll have to think something up. We’ll either make it an
>> optional sub-project and the Tesseract JNI bindings might be better of
>> having their own branch so that they are more like an external dependency -
>> I’ll ask the dev mailing list.
>> 
>> To prepare your code for contribution you’ll need to add the Apache header
>> to each.java file (see any PDFBox .java file for an example) and submit a
>> signed ICLA http://www.apache.org/licenses/icla.pdf to Apache.
>> 
>> Regarding additional functionality, the most useful would be for a new
>> command line tool which could write the OCR’d text back into the original
>> PDF file as “invisible text”, which would allow for copy and paste and text
>> search to then work for that PDF file. A starting point for this would be
>> to try and write the OCR’d text into the original PDF as “visible” text -
>> we can make it invisible later!
>> 
>> -- John
>> 
>> On 19 Jun 2014, at 13:57, DImuthu Upeksha <di...@gmail.com>
>> wrote:
>> 
>>> Hi John,
>>> Except providing compatibility for platforms like windows, I think most
>> of the functionalities of OCR plugin are finished (Please correct me if I'm
>> wrong). But I would like to contribute to project further. Do  you have
>> anything to add as a new functionality? And If you plan to add this to
>> PDFBox code, how should prepare my code? Is there any standard way?
>>> 
>>> Thanks
>>> Dimuthu
>>> --
>>> Regards
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> Department of Computer Science And Engineering
>>> University of Moratuwa, Sri Lanka
>> 
>>

Re: Improving OCR plugin for PDFBox

Posted by Santosh Arakeri <sa...@gmail.com>.

Pl dont send me mail.


On Fri, Jun 27, 2014 at 12:28 PM, John Hewson <jo...@jahewson.com> wrote:

> Hi Dimuthu
>
> That’s great. We should wait until closer to the end of the GSoC period to
> integrate your work with PDFBox, as ideally we only want to have to do it
> once. We’ve not included C++ dependencies before so no, there won’t be a
> standard way, we’ll have to think something up. We’ll either make it an
> optional sub-project and the Tesseract JNI bindings might be better of
> having their own branch so that they are more like an external dependency -
> I’ll ask the dev mailing list.
>
> To prepare your code for contribution you’ll need to add the Apache header
> to each.java file (see any PDFBox .java file for an example) and submit a
> signed ICLA http://www.apache.org/licenses/icla.pdf to Apache.
>
> Regarding additional functionality, the most useful would be for a new
> command line tool which could write the OCR’d text back into the original
> PDF file as “invisible text”, which would allow for copy and paste and text
> search to then work for that PDF file. A starting point for this would be
> to try and write the OCR’d text into the original PDF as “visible” text -
> we can make it invisible later!
>
> -- John
>
> On 19 Jun 2014, at 13:57, DImuthu Upeksha <di...@gmail.com>
> wrote:
>
> > Hi John,
> > Except providing compatibility for platforms like windows, I think most
> of the functionalities of OCR plugin are finished (Please correct me if I'm
> wrong). But I would like to contribute to project further. Do  you have
> anything to add as a new functionality? And If you plan to add this to
> PDFBox code, how should prepare my code? Is there any standard way?
> >
> > Thanks
> > Dimuthu
> > --
> > Regards
> > W.Dimuthu Upeksha
> > Undergraduate
> > Department of Computer Science And Engineering
> > University of Moratuwa, Sri Lanka
>
>

Re: Improving OCR plugin for PDFBox

Posted by DImuthu Upeksha <di...@gmail.com>.

Hi John,

I made the font size dynamically adjustable and text is written to the PDF
file as invisible text [1]. You can find sample PDF file [2] I used for
testing and resultant PDF file after adding invisible text. I'll be testing
more files in future.

I added a new argument to tool called 'Separation Mode' (-s). Separation
mode is used to extract data from the PDF file in character by
character(mode =0) or word by word (mode=1). When quality of images in the
PDF file is low or text alignments are not perfect, use mode 0. But this
will take more time than mode 1 because it processes data character by
character.

I did some improvements in Tesseract-API[3] recently. If you are going to
test this code, you may need to pull and build the latest version of
Tesseract-API also.

[1]
https://github.com/DImuthuUpe/OCR-Plugin/blob/master/src/main/java/org/apache/pdfbox/tools/OCRToPDF.java
[2]
https://github.com/DImuthuUpe/PDFBox-OCR-Plugin-Samples/tree/master/OCRToPDF
[3] https://github.com/DImuthuUpe/Tesseract-API

Thank You
Dimuthu



On Wed, Jul 9, 2014 at 7:13 AM, John Hewson <jo...@jahewson.com> wrote:

> Hi Dimuthu
>
> In ICLA there are two fields for preferred Apache id and notify projects.
> What should I put in those fields?
>
>
> You can leave the preferred id blank because you’re not applying to be a
> contributor, just a patch submitter.
> For notify projects put “PDFBox”.
>
> For new functionality you have suggested, I implemented a command line
> tool[1] that writes OCR'd text to original pdf as visible text. However it
> currently writes text to the PDF in constant font size (12). It should be
> dynamically adjusted.
>
>
> Yes, you should be able to set the font size in the graphics state.
>
> In addition to that, I need to know how to make those text invisible
> inside the PDF. How can I make them invisible?
>
>
> This can be done by setting the text rendering mode to 3 (neither fill nor
> stroke) in the text state, you can call:
>
>
> PDGraphicsState#getTextState().setRenderingMode(RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT)
>
> You might need to save/restore the state before/after your text rendering
> too.
>
> -- John
>
> On 6 Jul 2014, at 09:34, DImuthu Upeksha <di...@gmail.com>
> wrote:
>
> Hi John,
>
> I added Apache header to all java files and pom files in Tesseract API and
> OCR plugin. In ICLA there are two fields for preferred Apache id and notify
> projects. What should I put in those fields?
>
> For new functionality you have suggested, I implemented a command line
> tool[1] that writes OCR'd text to original pdf as visible text. However it
> currently writes text to the PDF in constant font size (12). It should be
> dynamically adjusted. In addition to that, I need to know how to make those
> text invisible inside the PDF. How can I make them invisible?
>
> [1]
> https://github.com/DImuthuUpe/OCR-Plugin/blob/master/src/main/java/org/apache/pdfbox/tools/OCRToPDF.java
>
> Thank You
> Dimuthu
>
>
> On Fri, Jun 27, 2014 at 12:28 PM, John Hewson <jo...@jahewson.com> wrote:
>
>> Hi Dimuthu
>>
>> That’s great. We should wait until closer to the end of the GSoC period
>> to integrate your work with PDFBox, as ideally we only want to have to do
>> it once. We’ve not included C++ dependencies before so no, there won’t be a
>> standard way, we’ll have to think something up. We’ll either make it an
>> optional sub-project and the Tesseract JNI bindings might be better of
>> having their own branch so that they are more like an external dependency -
>> I’ll ask the dev mailing list.
>>
>> To prepare your code for contribution you’ll need to add the Apache
>> header to each.java file (see any PDFBox .java file for an example) and
>> submit a signed ICLA http://www.apache.org/licenses/icla.pdf to Apache.
>>
>> Regarding additional functionality, the most useful would be for a new
>> command line tool which could write the OCR’d text back into the original
>> PDF file as “invisible text”, which would allow for copy and paste and text
>> search to then work for that PDF file. A starting point for this would be
>> to try and write the OCR’d text into the original PDF as “visible” text -
>> we can make it invisible later!
>>
>> -- John
>>
>> On 19 Jun 2014, at 13:57, DImuthu Upeksha <di...@gmail.com>
>> wrote:
>>
>> Hi John,
>> Except providing compatibility for platforms like windows, I think most
>> of the functionalities of OCR plugin are finished (Please correct me if I'm
>> wrong). But I would like to contribute to project further. Do  you have
>> anything to add as a new functionality? And If you plan to add this to
>> PDFBox code, how should prepare my code? Is there any standard way?
>>
>> Thanks
>> Dimuthu
>> --
>> Regards
>> W.Dimuthu Upeksha
>> Undergraduate
>> Department of Computer Science And Engineering
>>  University of Moratuwa, Sri Lanka
>>
>>
>>
>
>
> --
> Regards
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> University of Moratuwa, Sri Lanka
>
>
>


-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: Improving OCR plugin for PDFBox

Posted by DImuthu Upeksha <di...@gmail.com>.

Hi John,

I added Apache header to all java files and pom files in Tesseract API and
OCR plugin. In ICLA there are two fields for preferred Apache id and notify
projects. What should I put in those fields?

For new functionality you have suggested, I implemented a command line
tool[1] that writes OCR'd text to original pdf as visible text. However it
currently writes text to the PDF in constant font size (12). It should be
dynamically adjusted. In addition to that, I need to know how to make those
text invisible inside the PDF. How can I make them invisible?

[1]
https://github.com/DImuthuUpe/OCR-Plugin/blob/master/src/main/java/org/apache/pdfbox/tools/OCRToPDF.java

Thank You
Dimuthu


On Fri, Jun 27, 2014 at 12:28 PM, John Hewson <jo...@jahewson.com> wrote:

> Hi Dimuthu
>
> That’s great. We should wait until closer to the end of the GSoC period to
> integrate your work with PDFBox, as ideally we only want to have to do it
> once. We’ve not included C++ dependencies before so no, there won’t be a
> standard way, we’ll have to think something up. We’ll either make it an
> optional sub-project and the Tesseract JNI bindings might be better of
> having their own branch so that they are more like an external dependency -
> I’ll ask the dev mailing list.
>
> To prepare your code for contribution you’ll need to add the Apache header
> to each.java file (see any PDFBox .java file for an example) and submit a
> signed ICLA http://www.apache.org/licenses/icla.pdf to Apache.
>
> Regarding additional functionality, the most useful would be for a new
> command line tool which could write the OCR’d text back into the original
> PDF file as “invisible text”, which would allow for copy and paste and text
> search to then work for that PDF file. A starting point for this would be
> to try and write the OCR’d text into the original PDF as “visible” text -
> we can make it invisible later!
>
> -- John
>
> On 19 Jun 2014, at 13:57, DImuthu Upeksha <di...@gmail.com>
> wrote:
>
> Hi John,
> Except providing compatibility for platforms like windows, I think most of
> the functionalities of OCR plugin are finished (Please correct me if I'm
> wrong). But I would like to contribute to project further. Do  you have
> anything to add as a new functionality? And If you plan to add this to
> PDFBox code, how should prepare my code? Is there any standard way?
>
> Thanks
> Dimuthu
> --
> Regards
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> University of Moratuwa, Sri Lanka
>
>
>


-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka