You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by DImuthu Upeksha <di...@gmail.com> on 2014/03/01 19:09:41 UTC

Re: [GSoC 2014]Optical Character Recognition project - Introduction

I updated necessary changes to the document [1]

For last two days I had a deep look at this [2] jni wrapper for tessaract
api.
Unfortunately this has been designed for Android environment so I think we
need to write our own make files to build this in to a dll(windows) or
dylib(in mac). Currently it has Android.mk files [3]. I'm searching for a
way to convert it to a make file that we can run on console. Please suggest
if you have a better approach

[1]
https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
[2]
https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
[3]
https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk


On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com> wrote:

> This is a good start. However, there is no need for the Adder component,
> "Extracted Text (OCR) can just feed back into the PDFBox "Text Extractor".
>
> Maybe show a "PDF" file feeding in to "Text Extractor, to make it clear
> where the process starts.
>
> -- John
>
> On 26 Feb 2014, at 16:53, DImuthu Upeksha <di...@gmail.com>
> wrote:
>
> > Sorry for the mistake. I added it to my Dropbox [1].
> >
> > [1]
> >
> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
> >
> > Thanks
> > Dimuthu
> >
> >
> > On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com> wrote:
> >
> >> I should add that the OCR engine should be pluggable so PDFToText might
> >> use an interface, e.g. OCREngine and there will be a TesseractOCREngine
> >> class somewhere which provides the required functionality and lives in a
> >> separate jar file.
> >>
> >> -- John
> >>
> >>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com> wrote:
> >>>
> >>> So do you need to embed those new functionalities into existing
> >> PDFtoText algorithms or package them as a new sub system(something like
> an
> >> API)?
> >>>
> >>> -----Original Message-----
> >>> From: "John Hewson" <jo...@jahewson.com>
> >>> Sent: 26/02/2014 07:38
> >>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
> >>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
> >> Introduction
> >>>
> >>> Yes, exactly. By location data I just mean (x,y) coordinates and page
> >> rotation.
> >>>
> >>> There is another use case for OCR: some fonts embedded in PDFs have
> >> corrupt encodings, which means the ACSII codes map to the wrong glyphs.
> We
> >> could OCR the glyphs to repair the encoding.
> >>>
> >>> -- John
> >>>
> >>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <dimuthu.upeksha2@gmail.com
> >
> >> wrote:
> >>>>
> >>>> Hi John,
> >>>> Thanks for the explanation.
> >>>> Let's say there is a pdf with both text in extractable format and some
> >>>> images with text(Scanned images). In that case first we extract those
> >>>> extractable content using PDFBox algorithms and rest is extracted
> using
> >>>> OCR. Finally we pack both results together and give output as
> >> PDFToText. Am
> >>>> I correct? What do you mean by "location data"?
> >>>>
> >>>> Thanks
> >>>> Dimuthu
> >>>>
> >>>>
> >>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <jo...@jahewson.com>
> >> wrote:
> >>>>>
> >>>>> 1. What is called "glyphs" ?
> >>>>>
> >>>>> http://en.wikipedia.org/wiki/Glyph
> >>>>>
> >>>>>> 2. What is the main requirement of this project?
> >>>>>> As far as I understood, first we need to generate an image of
> >>>>>> malformed pdfs from
> >>>>>> PDFBox and then we need to do processing using OCR for further
> >> accurate
> >>>>>> results.  But the problem is, why shouldn't we directly do OCR on
> >> those
> >>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong.
> >>>>>
> >>>>> PDFBox can generate images (PDFToImage) and can extract text
> >> (PDFToText).
> >>>>> The goal of
> >>>>> this project is to enhance PDFToText so that it can use OCR to
> extract
> >>>>> text from areas of the
> >>>>> document where the text is embedded as an image. Such PDF files are
> >>>>> typically generated by
> >>>>> scanners or fax machines. There is also another case where OCR is
> >> useful:
> >>>>> some fonts embedded
> >>>>> in PDF files contain the wrong encoding, so when text is extracted
> with
> >>>>> PDFToText the result is
> >>>>> nonsense but when drawn with PDFToImage we see the correct letters.
> >>>>>
> >>>>> Instead of:
> >>>>> PDF => Image => OCR => Text
> >>>>>
> >>>>> We want to do:
> >>>>> PDF => (Many images for words + location data => OCR) => Text
> >>>>>
> >>>>> -- John
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
> >>>>> dimuthu.upeksha2@gmail.com
> >>>>>>> wrote:
> >>>>>>
> >>>>>>> Ok fixed. This is what I did
> >>>>>>> Right click on the new project ->Debug As-> Debug Configurations
> >>>>> ->Source
> >>>>>>> ->Add -> Project
> >>>>>>> Then I selected PDFBox project.
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>> Dimuthu
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
> >>>>>>> dimuthu.upeksha2@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> I'm using eclipse. This is what I want. I created a new Java
> >>>>> application
> >>>>>>>> project (say TestPDFBox) with a main class with following code.
> >>>>>>>>
> >>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
> >>>>> PDPage();document.addPage( blankPage
> >>>>> );document.save("BlankPage.pdf");document.close();
> >>>>>>>>
> >>>>>>>> Then I need to add those jar files generated in target folder of
> >> PDFBox
> >>>>>>>> to build path of my new project (I did build the PDFBox project
> from
> >>>>>>>> source). That is what I did. But let's say I need to check  the
> >>>>>>>> functionality of document.save("") method. But I don't have a
> >>>>> reference to
> >>>>>>>> it's sources because I directly used generated jars. As Tilman
> said
> >> I
> >>>>> built
> >>>>>>>> PDFBox from sources but I don't know a proper way to use it other
> >>>>> projects
> >>>>>>>> other than adding those jar files to build path.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <jo...@jahewson.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Which IDE are you using? You should be able to run the PDFToText
> >> class
> >>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the
> >>>>> command
> >>>>>>>>> line argument.
> >>>>>>>>>
> >>>>>>>>> -- John
> >>>>>>>>>
> >>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
> >>>>> dimuthu.upeksha2@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi John,
> >>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and managed
> to
> >>>>>>>>> build
> >>>>>>>>>> code successfully. I looked at the classes you mentioned and I
> >> got a
> >>>>>>>>> rough
> >>>>>>>>>> idea about how they are working. To check them I used the jars
> in
> >>>>>>>>> target
> >>>>>>>>>> folder to my separate java project. I tried samples in
> >>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look into
> >> code
> >>>>>>>>>> specially how those processXXX() methods work in PDFTextStripper
> >>>>> class.
> >>>>>>>>>> What I usually do is adding some berakpoints and checking them
> in
> >>>>> debug
> >>>>>>>>>> windows. But using jars it's not possible. What is the way you
> >> follow
> >>>>>>>>> in
> >>>>>>>>>> order to do such task?
> >>>>>>>>>>
> >>>>>>>>>> As well I installed tesseract in to my machine and managed to do
> >> some
> >>>>>>>>> OCR
> >>>>>>>>>> stuff also. That's a cool tool which works fine.
> >>>>>>>>>> I'm still learning the code. If I get any issue I'll drop you a
> >> mail.
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>> Dimuthu
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
> john@jahewson.com
> >>>
> >>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Dimuthu
> >>>>>>>>>>>
> >>>>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/it
> >>>>>>>>> contains
> >>>>>>>>>>> a basic overview of the project
> >>>>>>>>>>> and details on how to obtain the source code and build PDFBox
> for
> >>>>>>>>> yourself.
> >>>>>>>>>>>
> >>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details the
> >> only
> >>>>>>>>>>> thoughts so far regarding it.
> >>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are all
> >>>>> under
> >>>>>>>>> the
> >>>>>>>>>>> Apache license, which is a
> >>>>>>>>>>> requirement.
> >>>>>>>>>>>
> >>>>>>>>>>> Once you have the source code, take a look at the PageDrawer
> >> class
> >>>>> to
> >>>>>>>>> see
> >>>>>>>>>>> how text and images are
> >>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g. one
> >>>>> glyph,
> >>>>>>>>>>> word, or sentence at a time) with
> >>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text
> is
> >>>>>>>>> currently
> >>>>>>>>>>> extracted, take a look at how
> >>>>>>>>>>> we have to go to great length to sort text back into reading
> >> order
> >>>>> and
> >>>>>>>>>>> infer the placement of diacritics - PDF
> >>>>>>>>>>> is fundamentally a visual format, not a structured format like
> >> HTML
> >>>>> -
> >>>>>>>>>>> which is why extracting text can be so
> >>>>>>>>>>> difficult sometimes.
> >>>>>>>>>>>
> >>>>>>>>>>> The full PDF Reference document can be found at:
> >>>>>
> >>
> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
> >>>>>>>>>>>
> >>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
> >>>>> questions.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> -- John
> >>>>>>>>>>>
> >>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
> >>>>> dimuthu.upeksha2@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at
> >>>>>>>>> University
> >>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013
> with
> >>>>>>>>> Apache
> >>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image
> >>>>> processing
> >>>>>>>>>>> stuff. So I would like to select this project idea as my GSoC
> >> 2014
> >>>>>>>>> project
> >>>>>>>>>>> because I feel like it is the best suited project for me. In
> >>>>>>>>> university
> >>>>>>>>>>> also we have done some research in OCR area and our group
> wrote a
> >>>>>>>>>>> literature review about increasing efficiency of OCR
> >>>>>>>>> systems(attached). Can
> >>>>>>>>>>> you please suggest me where to start learning about PDFBox?
> >>>>>>>>>>>>
> >>>>>>>>>>>> [1]
> >>>>>
> >>
> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thank you
> >>>>>>>>>>>> Dimuthu
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Regards
> >>>>>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>>>>> Undergraduate
> >>>>>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Regards
> >>>>>>>>>>
> >>>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>>> Undergraduate
> >>>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>>
> >>>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Regards
> >>>>>>>>
> >>>>>>>> W.Dimuthu Upeksha
> >>>>>>>> Undergraduate
> >>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>
> >>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> W.Dimuthu Upeksha
> >>>>>>> Undergraduate
> >>>>>>> Department of Computer Science And Engineering
> >>>>>>>
> >>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Regards
> >>>>>>
> >>>>>> W.Dimuthu Upeksha
> >>>>>> Undergraduate
> >>>>>> Department of Computer Science And Engineering
> >>>>>>
> >>>>>> University of Moratuwa, Sri Lanka
> >>>>
> >>>>
> >>>> --
> >>>> Regards
> >>>>
> >>>> W.Dimuthu Upeksha
> >>>> Undergraduate
> >>>> Department of Computer Science And Engineering
> >>>>
> >>>> University of Moratuwa, Sri Lanka
> >>
> >
> >
> >
> > --
> > Regards
> >
> > W.Dimuthu Upeksha
> > Undergraduate
> > Department of Computer Science And Engineering
> >
> > University of Moratuwa, Sri Lanka
>
>


-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by John Hewson <jo...@jahewson.com>.
>  You just need to use the following Apache header on your Java source files:

Actually, no, forget that. I don’t think you can use that header yet as you haven’t signed a CLA.
Leave the files as they are without headers for now. We’ll deal with the licensing later
because your code isn’t in the official Apache repository yet.

-- John

On 10 Mar 2014, at 01:30, John Hewson <jo...@jahewson.com> wrote:

> Dimuthu,
> 
> That’s looking really good. You just need to use the following Apache header on your Java source files:
> 
> /*
>  * Licensed to the Apache Software Foundation (ASF) under one or more
>  * contributor license agreements.  See the NOTICE file distributed with
>  * this work for additional information regarding copyright ownership.
>  * The ASF licenses this file to You under the Apache License, Version 2.0
>  * (the "License"); you may not use this file except in compliance with
>  * the License.  You may obtain a copy of the License at
>  *
>  *      http://www.apache.org/licenses/LICENSE-2.0
>  *
>  * Unless required by applicable law or agreed to in writing, software
>  * distributed under the License is distributed on an "AS IS" BASIS,
>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>  * See the License for the specific language governing permissions and
>  * limitations under the License.
>  */
> 
> -- John
> 
> On 7 Mar 2014, at 07:56, DImuthu Upeksha <di...@gmail.com> wrote:
> 
>> Hi John
>> I refactored Tesseract JNI code to support maven build. To create the JNI
>> library I added pre-built static libraries of Tesseract and Leptonica to
>> resources folder[2]. For now it includes librararies supported for mac. But
>> we can easily add both windows and linux libraries. After "mvn clean
>> install", the jar is created under target folder. Now all setting up is
>> done. What remains is implementing those native methods in tessbaseapi.cpp
>> [3]. Hope to finish it asap. Please let me know if there is any concern
>> about project structure.
>> 
>> [1] https://github.com/DImuthuUpe/Tesseract-API.git
>> [2]
>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
>> [3]
>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
>> 
>> Thanks
>> Dimuthu
>> 
>> 
>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>> 
>>> Dimuthu
>>> 
>>>> There is a lot of code
>>>> fractions in current android jni wrapper which use "(jint)somePointer"
>>>> casting which will create terrible memory leaks in 64 bit environments
>>>> because ponters are 64 bit. So I believe writing it from the beginning is
>>>> much better.
>>> 
>>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>>> support
>>> 64-bit JVMs.
>>> 
>>>> we can use
>>>> the static library of Leptonica (I did and it worked nicely). I think it
>>> is
>>>> not a issue to use it's static library because both Tesseract and
>>> Leptonica
>>>> is under apache licence.
>>> 
>>> Sounds good, I found the following in the README:
>>> 
>>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
>>> without Leptonica.
>>> 
>>> Which makes sense.
>>> 
>>> -- John
>>> 
>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <di...@gmail.com>
>>> wrote:
>>> 
>>>> Hi John,
>>>> +1 for you suggestion about converting image <=> byte array at java side.
>>>> It reduces lot of complexities. I don't know whether you have noticed or
>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my
>>> Mac
>>>> but don't know about other operating systems.
>>>> 
>>>> Leptonica is the image processing library for Tesseract [1]. What
>>> tesseract
>>>> do is using image processing algorithms in Leptonica to implement its OCR
>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract
>>> API.
>>>> You can see it includes allheaders.h header file which is the main header
>>>> file of Leptonoca. So I think it is a must to build Leptonica first and
>>>> link it when we build Tesseract. This is not a big problem if we can use
>>>> the static library of Leptonica (I did and it worked nicely). I think it
>>> is
>>>> not a issue to use it's static library because both Tesseract and
>>> Leptonica
>>>> is under apache licence.
>>>> 
>>>> I'm working on the maven implementation you have mentioned and will get
>>>> back to you soon.
>>>> 
>>>> Thanks
>>>> Dimuthu
>>>> 
>>>> 
>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>>>> [2]
>>>> 
>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>>>> 
>>>> 
>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>>>> 
>>>>> Hi Dimuthu,
>>>>> 
>>>>> 1,2,3:
>>>>> 
>>>>> Feel free to write your own Tesseract binding or port the existing code
>>> as
>>>>> you see fit.
>>>>> The JNI binding should be minimal, only the methods you require need to
>>> be
>>>>> wrapped.
>>>>> Also, don't forget that some of the interop can be done in Java, for
>>>>> example if it is easier
>>>>> to convert a BufferedImage to a byte array in Java then do it there and
>>>>> pass the result
>>>>> to JNI rather than writing lots of JNI C++ to achieve the same result.
>>>>> 
>>>>> Your GitHub repo looks like a good start, I can make comments there as
>>>>> things progress.
>>>>> 
>>>>> Is it possible to build Tesseract without leptonica? I was under the
>>>>> impression that it was
>>>>> used for image i/o only, but I may be misinformed.
>>>>> 
>>>>> 4:  The native platform library should be built as part of the Maven
>>> build
>>>>> for the Tesseract
>>>>> wrapper which can be a separate project. The output can be a jar file
>>>>> which contains the
>>>>> native binaries. It should be possible for the jar to contain prebuilt
>>>>> binaries for all platforms
>>>>> but this is something we can worry about later. Right now the goal
>>> should
>>>>> be to build a jar
>>>>> containing just the current platform's native binary and any Java
>>> wrapper
>>>>> code.
>>>>> 
>>>>> -- John
>>>>> 
>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <di...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi John,
>>>>>> 
>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>>>>> observation
>>>>>> 
>>>>>> 1. This wrapper heavily depends on android image libraries.
>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>>>>>> 
>>>>>> 2. But I can understand underlying logic in each function. Basically
>>> what
>>>>>> it does is mapping between tesseract api functions [2] with java
>>> methods.
>>>>>> In between it does to some image <=> byte array like conversions by
>>> using
>>>>>> that bitmap libraries in Android
>>>>>> 
>>>>>> 3. There are two ways. 1: We can port it's code to make compatible with
>>>>> our
>>>>>> environments(linux,windows and mac) which is really painful. Also it
>>> will
>>>>>> cause memory leaks. 2: We can use only it's function signatures and
>>>>>> implement using our codes
>>>>>> 
>>>>>> I think 2nd solution is better because we need only few operations to
>>> be
>>>>>> done using tesseract library. I have created a github repo [3] for
>>> this.
>>>>>> It's still not finished. I need to add some make files and build files
>>> to
>>>>>> make it run properly. And also I need to implement those wrapper
>>>>> functions
>>>>>> [3]. This may take some time.
>>>>>> 
>>>>>> 4. Because we are calling native libraries we need different builds of
>>>>>> tesseract and leptonica libraries for each platform (dll for windows,
>>> so
>>>>>> for linux, dylib for mac). So we may need to build those libraries at
>>> the
>>>>>> time we build pdfbox project. Or we can pre build those libraries and
>>> add
>>>>>> them to the project as .dll, .so or .dylib format. What is the
>>> preferred
>>>>>> way?
>>>>>> 
>>>>>> [1]
>>>>>> 
>>>>> 
>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>>>>> [4]
>>>>>> 
>>>>> 
>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>>>>> 
>>>>>> Thanks
>>>>>> Dimuthu
>>>>>> 
>>>>>> 
>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>>>>> dimuthu.upeksha2@gmail.com
>>>>>>> wrote:
>>>>>> 
>>>>>>> I updated necessary changes to the document [1]
>>>>>>> 
>>>>>>> For last two days I had a deep look at this [2] jni wrapper for
>>>>> tessaract
>>>>>>> api.
>>>>>>> Unfortunately this has been designed for Android environment so I
>>> think
>>>>> we
>>>>>>> need to write our own make files to build this in to a dll(windows) or
>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>>> for
>>>>> a
>>>>>>> way to convert it to a make file that we can run on console. Please
>>>>> suggest
>>>>>>> if you have a better approach
>>>>>>> 
>>>>>>> [1]
>>>>>>> 
>>>>> 
>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>>>>> [2]
>>>>>>> 
>>>>> 
>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>>>>> [3]
>>>>>>> 
>>>>> 
>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com>
>>> wrote:
>>>>>>> 
>>>>>>>> This is a good start. However, there is no need for the Adder
>>>>> component,
>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>>>>> Extractor".
>>>>>>>> 
>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>>> clear
>>>>>>>> where the process starts.
>>>>>>>> 
>>>>>>>> -- John
>>>>>>>> 
>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>>> dimuthu.upeksha2@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>>>>> 
>>>>>>>>> [1]
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Dimuthu
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>>>>> might
>>>>>>>>>> use an interface, e.g. OCREngine and there will be a
>>>>> TesseractOCREngine
>>>>>>>>>> class somewhere which provides the required functionality and lives
>>>>> in
>>>>>>>> a
>>>>>>>>>> separate jar file.
>>>>>>>>>> 
>>>>>>>>>> -- John
>>>>>>>>>> 
>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com>
>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> So do you need to embed those new functionalities into existing
>>>>>>>>>> PDFtoText algorithms or package them as a new sub system(something
>>>>>>>> like an
>>>>>>>>>> API)?
>>>>>>>>>>> 
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: "John Hewson" <jo...@jahewson.com>
>>>>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>>>>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>>>>>>> Introduction
>>>>>>>>>>> 
>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>>>>> page
>>>>>>>>>> rotation.
>>>>>>>>>>> 
>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>>> have
>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>>>>> glyphs. We
>>>>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>>>>> 
>>>>>>>>>>> -- John
>>>>>>>>>>> 
>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi John,
>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format and
>>>>>>>> some
>>>>>>>>>>>> images with text(Scanned images). In that case first we extract
>>>>> those
>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is extracted
>>>>>>>> using
>>>>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>>>>> PDFToText. Am
>>>>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>>> john@jahewson.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>>>>>>>> accurate
>>>>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do OCR
>>> on
>>>>>>>>>> those
>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>>> wrong.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>>>>> (PDFToText).
>>>>>>>>>>>>> The goal of
>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>>>>>>> extract
>>>>>>>>>>>>> text from areas of the
>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF files
>>>>> are
>>>>>>>>>>>>> typically generated by
>>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR
>>> is
>>>>>>>>>> useful:
>>>>>>>>>>>>> some fonts embedded
>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>>> extracted
>>>>>>>> with
>>>>>>>>>>>>> PDFToText the result is
>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>>>>> letters.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Instead of:
>>>>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We want to do:
>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- John
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>>> Configurations
>>>>>>>>>>>>> ->Source
>>>>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>>>>>>>> application
>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>>> code.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
>>>>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target folder
>>>>> of
>>>>>>>>>> PDFBox
>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>>> project
>>>>>>>> from
>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>>> the
>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have a
>>>>>>>>>>>>> reference to
>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As
>>> Tilman
>>>>>>>> said
>>>>>>>>>> I
>>>>>>>>>>>>> built
>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>>>>> other
>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>>>>> john@jahewson.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>>>>> PDFToText
>>>>>>>>>> class
>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as
>>>>> the
>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>>>>> managed to
>>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>>> and
>>>>> I
>>>>>>>>>> got a
>>>>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>>>>> jars
>>>>>>>> in
>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look
>>>>>>>> into
>>>>>>>>>> code
>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>>>>> PDFTextStripper
>>>>>>>>>>>>> class.
>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>>>>> them
>>>>>>>> in
>>>>>>>>>>>>> debug
>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
>>>>> you
>>>>>>>>>> follow
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and managed
>>> to
>>>>>>>> do
>>>>>>>>>> some
>>>>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>>>>> you a
>>>>>>>>>> mail.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>>>>> john@jahewson.com
>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at
>>>>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>>>>> PDFBox
>>>>>>>> for
>>>>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>>> details
>>>>>>>> the
>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>>> are
>>>>>>>> all
>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>>> PageDrawer
>>>>>>>>>> class
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>>> (e.g.
>>>>>>>> one
>>>>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>>>>> text
>>>>>>>> is
>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>>> reading
>>>>>>>>>> order
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
>>>>> like
>>>>>>>>>> HTML
>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
>>>>>>>>>>>>> questions.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>>> Undergraduate
>>>>> at
>>>>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>>> 2013
>>>>>>>> with
>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>>> image
>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>>>>> GSoC
>>>>>>>>>> 2014
>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for me.
>>> In
>>>>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>>>>>>>> wrote a
>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about
>>> PDFBox?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Regards
>>>>>>>>>>>> 
>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>> 
>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Regards
>>>>>>>>> 
>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>> Undergraduate
>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>> 
>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Regards
>>>>>>> 
>>>>>>> W.Dimuthu Upeksha
>>>>>>> Undergraduate
>>>>>>> Department of Computer Science And Engineering
>>>>>>> 
>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards
>>>>>> 
>>>>>> W.Dimuthu Upeksha
>>>>>> Undergraduate
>>>>>> Department of Computer Science And Engineering
>>>>>> 
>>>>>> University of Moratuwa, Sri Lanka
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Regards
>>>> 
>>>> W.Dimuthu Upeksha
>>>> Undergraduate
>>>> Department of Computer Science And Engineering
>>>> 
>>>> University of Moratuwa, Sri Lanka
>>> 
>>> 
>> 
>> 
>> -- 
>> Regards
>> 
>> W.Dimuthu Upeksha
>> Undergraduate
>> Department of Computer Science And Engineering
>> 
>> University of Moratuwa, Sri Lanka
> 


Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by John Hewson <jo...@jahewson.com>.
Dimuthu,

That’s looking really good. You just need to use the following Apache header on your Java source files:

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

-- John

On 7 Mar 2014, at 07:56, DImuthu Upeksha <di...@gmail.com> wrote:

> Hi John
> I refactored Tesseract JNI code to support maven build. To create the JNI
> library I added pre-built static libraries of Tesseract and Leptonica to
> resources folder[2]. For now it includes librararies supported for mac. But
> we can easily add both windows and linux libraries. After "mvn clean
> install", the jar is created under target folder. Now all setting up is
> done. What remains is implementing those native methods in tessbaseapi.cpp
> [3]. Hope to finish it asap. Please let me know if there is any concern
> about project structure.
> 
> [1] https://github.com/DImuthuUpe/Tesseract-API.git
> [2]
> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
> [3]
> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
> 
> Thanks
> Dimuthu
> 
> 
> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
> 
>> Dimuthu
>> 
>>> There is a lot of code
>>> fractions in current android jni wrapper which use "(jint)somePointer"
>>> casting which will create terrible memory leaks in 64 bit environments
>>> because ponters are 64 bit. So I believe writing it from the beginning is
>>> much better.
>> 
>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>> support
>> 64-bit JVMs.
>> 
>>> we can use
>>> the static library of Leptonica (I did and it worked nicely). I think it
>> is
>>> not a issue to use it's static library because both Tesseract and
>> Leptonica
>>> is under apache licence.
>> 
>> Sounds good, I found the following in the README:
>> 
>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
>> without Leptonica.
>> 
>> Which makes sense.
>> 
>> -- John
>> 
>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <di...@gmail.com>
>> wrote:
>> 
>>> Hi John,
>>> +1 for you suggestion about converting image <=> byte array at java side.
>>> It reduces lot of complexities. I don't know whether you have noticed or
>>> not, jint data type in jni is a 32bit integer type. I noticed it in my
>> Mac
>>> but don't know about other operating systems.
>>> 
>>> Leptonica is the image processing library for Tesseract [1]. What
>> tesseract
>>> do is using image processing algorithms in Leptonica to implement its OCR
>>> algorithms. This [2] is the responsible .cpp file to create Tesseract
>> API.
>>> You can see it includes allheaders.h header file which is the main header
>>> file of Leptonoca. So I think it is a must to build Leptonica first and
>>> link it when we build Tesseract. This is not a big problem if we can use
>>> the static library of Leptonica (I did and it worked nicely). I think it
>> is
>>> not a issue to use it's static library because both Tesseract and
>> Leptonica
>>> is under apache licence.
>>> 
>>> I'm working on the maven implementation you have mentioned and will get
>>> back to you soon.
>>> 
>>> Thanks
>>> Dimuthu
>>> 
>>> 
>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>>> [2]
>>> 
>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>>> 
>>> 
>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>>> 
>>>> Hi Dimuthu,
>>>> 
>>>> 1,2,3:
>>>> 
>>>> Feel free to write your own Tesseract binding or port the existing code
>> as
>>>> you see fit.
>>>> The JNI binding should be minimal, only the methods you require need to
>> be
>>>> wrapped.
>>>> Also, don't forget that some of the interop can be done in Java, for
>>>> example if it is easier
>>>> to convert a BufferedImage to a byte array in Java then do it there and
>>>> pass the result
>>>> to JNI rather than writing lots of JNI C++ to achieve the same result.
>>>> 
>>>> Your GitHub repo looks like a good start, I can make comments there as
>>>> things progress.
>>>> 
>>>> Is it possible to build Tesseract without leptonica? I was under the
>>>> impression that it was
>>>> used for image i/o only, but I may be misinformed.
>>>> 
>>>> 4:  The native platform library should be built as part of the Maven
>> build
>>>> for the Tesseract
>>>> wrapper which can be a separate project. The output can be a jar file
>>>> which contains the
>>>> native binaries. It should be possible for the jar to contain prebuilt
>>>> binaries for all platforms
>>>> but this is something we can worry about later. Right now the goal
>> should
>>>> be to build a jar
>>>> containing just the current platform's native binary and any Java
>> wrapper
>>>> code.
>>>> 
>>>> -- John
>>>> 
>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <di...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi John,
>>>>> 
>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>>>> observation
>>>>> 
>>>>> 1. This wrapper heavily depends on android image libraries.
>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>>>>> 
>>>>> 2. But I can understand underlying logic in each function. Basically
>> what
>>>>> it does is mapping between tesseract api functions [2] with java
>> methods.
>>>>> In between it does to some image <=> byte array like conversions by
>> using
>>>>> that bitmap libraries in Android
>>>>> 
>>>>> 3. There are two ways. 1: We can port it's code to make compatible with
>>>> our
>>>>> environments(linux,windows and mac) which is really painful. Also it
>> will
>>>>> cause memory leaks. 2: We can use only it's function signatures and
>>>>> implement using our codes
>>>>> 
>>>>> I think 2nd solution is better because we need only few operations to
>> be
>>>>> done using tesseract library. I have created a github repo [3] for
>> this.
>>>>> It's still not finished. I need to add some make files and build files
>> to
>>>>> make it run properly. And also I need to implement those wrapper
>>>> functions
>>>>> [3]. This may take some time.
>>>>> 
>>>>> 4. Because we are calling native libraries we need different builds of
>>>>> tesseract and leptonica libraries for each platform (dll for windows,
>> so
>>>>> for linux, dylib for mac). So we may need to build those libraries at
>> the
>>>>> time we build pdfbox project. Or we can pre build those libraries and
>> add
>>>>> them to the project as .dll, .so or .dylib format. What is the
>> preferred
>>>>> way?
>>>>> 
>>>>> [1]
>>>>> 
>>>> 
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>>>> [4]
>>>>> 
>>>> 
>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>>>> 
>>>>> Thanks
>>>>> Dimuthu
>>>>> 
>>>>> 
>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>>>> dimuthu.upeksha2@gmail.com
>>>>>> wrote:
>>>>> 
>>>>>> I updated necessary changes to the document [1]
>>>>>> 
>>>>>> For last two days I had a deep look at this [2] jni wrapper for
>>>> tessaract
>>>>>> api.
>>>>>> Unfortunately this has been designed for Android environment so I
>> think
>>>> we
>>>>>> need to write our own make files to build this in to a dll(windows) or
>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>> for
>>>> a
>>>>>> way to convert it to a make file that we can run on console. Please
>>>> suggest
>>>>>> if you have a better approach
>>>>>> 
>>>>>> [1]
>>>>>> 
>>>> 
>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>>>> [2]
>>>>>> 
>>>> 
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>>>> [3]
>>>>>> 
>>>> 
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>>>> 
>>>>>> 
>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com>
>> wrote:
>>>>>> 
>>>>>>> This is a good start. However, there is no need for the Adder
>>>> component,
>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>>>> Extractor".
>>>>>>> 
>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>> clear
>>>>>>> where the process starts.
>>>>>>> 
>>>>>>> -- John
>>>>>>> 
>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>> dimuthu.upeksha2@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>>>> 
>>>>>>>> [1]
>>>>>>>> 
>>>>>>> 
>>>> 
>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Dimuthu
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com>
>>>> wrote:
>>>>>>>> 
>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>>>> might
>>>>>>>>> use an interface, e.g. OCREngine and there will be a
>>>> TesseractOCREngine
>>>>>>>>> class somewhere which provides the required functionality and lives
>>>> in
>>>>>>> a
>>>>>>>>> separate jar file.
>>>>>>>>> 
>>>>>>>>> -- John
>>>>>>>>> 
>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com>
>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> So do you need to embed those new functionalities into existing
>>>>>>>>> PDFtoText algorithms or package them as a new sub system(something
>>>>>>> like an
>>>>>>>>> API)?
>>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: "John Hewson" <jo...@jahewson.com>
>>>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>>>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>>>>>> Introduction
>>>>>>>>>> 
>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>>>> page
>>>>>>>>> rotation.
>>>>>>>>>> 
>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>> have
>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>>>> glyphs. We
>>>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>>>> 
>>>>>>>>>> -- John
>>>>>>>>>> 
>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi John,
>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>> Let's say there is a pdf with both text in extractable format and
>>>>>>> some
>>>>>>>>>>> images with text(Scanned images). In that case first we extract
>>>> those
>>>>>>>>>>> extractable content using PDFBox algorithms and rest is extracted
>>>>>>> using
>>>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>>>> PDFToText. Am
>>>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> Dimuthu
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>> john@jahewson.com>
>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>>>> 
>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>>>> 
>>>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>>>>>>> accurate
>>>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do OCR
>> on
>>>>>>>>> those
>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>> wrong.
>>>>>>>>>>>> 
>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>>>> (PDFToText).
>>>>>>>>>>>> The goal of
>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>>>>>> extract
>>>>>>>>>>>> text from areas of the
>>>>>>>>>>>> document where the text is embedded as an image. Such PDF files
>>>> are
>>>>>>>>>>>> typically generated by
>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR
>> is
>>>>>>>>> useful:
>>>>>>>>>>>> some fonts embedded
>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>> extracted
>>>>>>> with
>>>>>>>>>>>> PDFToText the result is
>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>>>> letters.
>>>>>>>>>>>> 
>>>>>>>>>>>> Instead of:
>>>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>>>> 
>>>>>>>>>>>> We want to do:
>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>>>>>> 
>>>>>>>>>>>> -- John
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>> Configurations
>>>>>>>>>>>> ->Source
>>>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>>>>>>> application
>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>> code.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
>>>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Then I need to add those jar files generated in target folder
>>>> of
>>>>>>>>> PDFBox
>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>> project
>>>>>>> from
>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>> the
>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have a
>>>>>>>>>>>> reference to
>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As
>> Tilman
>>>>>>> said
>>>>>>>>> I
>>>>>>>>>>>> built
>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>>>> other
>>>>>>>>>>>> projects
>>>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>>>> john@jahewson.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>>>> PDFToText
>>>>>>>>> class
>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as
>>>> the
>>>>>>>>>>>> command
>>>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>>>> managed to
>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>> and
>>>> I
>>>>>>>>> got a
>>>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>>>> jars
>>>>>>> in
>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look
>>>>>>> into
>>>>>>>>> code
>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>>>> PDFTextStripper
>>>>>>>>>>>> class.
>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>>>> them
>>>>>>> in
>>>>>>>>>>>> debug
>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
>>>> you
>>>>>>>>> follow
>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and managed
>> to
>>>>>>> do
>>>>>>>>> some
>>>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>>>> you a
>>>>>>>>> mail.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>>>> john@jahewson.com
>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The PDFBox website can be found at
>>>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>>>> PDFBox
>>>>>>> for
>>>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>> details
>>>>>>> the
>>>>>>>>> only
>>>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>> are
>>>>>>> all
>>>>>>>>>>>> under
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>> PageDrawer
>>>>>>>>> class
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>> (e.g.
>>>>>>> one
>>>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>>>> text
>>>>>>> is
>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>> reading
>>>>>>>>> order
>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
>>>> like
>>>>>>>>> HTML
>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>> 
>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
>>>>>>>>>>>> questions.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>> Undergraduate
>>>> at
>>>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>> 2013
>>>>>>> with
>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>> image
>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>>>> GSoC
>>>>>>>>> 2014
>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for me.
>> In
>>>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>>>>>>> wrote a
>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about
>> PDFBox?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>> 
>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> 
>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>> 
>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Regards
>>>>>>>>>>> 
>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>> Undergraduate
>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>> 
>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Regards
>>>>>>>> 
>>>>>>>> W.Dimuthu Upeksha
>>>>>>>> Undergraduate
>>>>>>>> Department of Computer Science And Engineering
>>>>>>>> 
>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards
>>>>>> 
>>>>>> W.Dimuthu Upeksha
>>>>>> Undergraduate
>>>>>> Department of Computer Science And Engineering
>>>>>> 
>>>>>> University of Moratuwa, Sri Lanka
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Regards
>>>>> 
>>>>> W.Dimuthu Upeksha
>>>>> Undergraduate
>>>>> Department of Computer Science And Engineering
>>>>> 
>>>>> University of Moratuwa, Sri Lanka
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Regards
>>> 
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> Department of Computer Science And Engineering
>>> 
>>> University of Moratuwa, Sri Lanka
>> 
>> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka


Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by John Hewson <jo...@jahewson.com>.
Hi Dimuthu

I'm travelling for the next week so I'm ping to be a little slow at replying and somewhat brief.

The scale can simply be 1.0 at all times. The font size should be the height of the current line of text in points (1/72 inch). To calculate this from the height of the text in pixels you need to take into account the DPI   (dots per inch) at which PDFRenderer rendered the image.

I'm not sure what totalVerticalDisplacementDisp does, I'm not at my computer currently so I'll have to get back to you on that.

-- John

> On 14 Apr 2014, at 15:46, DImuthu Upeksha <di...@gmail.com> wrote:
> 
> Hi john,
> 
> I managed to override processStream method and pass some hardcoded
> text position values to processStream method.
> I still have doubts about totalVerticalDisplacementDisp and
> fontSizeText variables. Is there is standard way to calculate the
> fontSizeText variable? What is the use of
> totalVerticalDisplacementDisp variable and how can we fetch it?
> 
> For text matrix how can we calculate the scale x and scale y? For this
> scenario I put 1 for each.
> 
> @Override
> 
>    public void processStream(PDResources resources, COSStream cosStream,
>    PDRectangle drawingSize, int rotation) throws IOException {
>    float pageWidth = drawingSize.getWidth();
>    float pageHeight = drawingSize.getHeight();
>    Matrix textMatrixStart = new Matrix();
> 
>    textMatrixStart.setValue(0, 0, 1); //scale x
>    textMatrixStart.setValue(0, 1, 0);
>    textMatrixStart.setValue(0, 2, 0);
> 
>    textMatrixStart.setValue(1, 0, 0);
>    textMatrixStart.setValue(1, 1, 1); //scale y
>    textMatrixStart.setValue(1, 2, 0);
> 
>    textMatrixStart.setValue(2, 0, 10);
>    textMatrixStart.setValue(2, 1, 100);
>    textMatrixStart.setValue(2, 2, 1);
> 
>    float endXPosition = 29.34f;
>    float endYPosition =0.0f;
>    float totalVerticalDisplacementDisp =8.0f;
>    float widthText = 29.34f;
>    float spaceWidthDisp = 12.0f;
>    String c = "Hello";
>    int []codePoints = {72,101,108,108,111};
>    PDFont font = new PDType1Font();
>    float fontSizeText = 12.0f;
> 
>    TextPosition textPosition = new TextPosition(rotation, pageWidth,
> pageHeight, textMatrixStart, endXPosition,
>        endYPosition, totalVerticalDisplacementDisp, widthText,
> spaceWidthDisp, c, codePoints, font,
>        fontSizeText,12);
> 
>    processTextPosition(textPosition);
>    }
> 
>> On Sat, Apr 12, 2014 at 6:36 AM, John Hewson <jo...@jahewson.com> wrote:
>> These are the values of the "text matrix" at the start and end of the given text. Take a look at the PDF spec for a complete description of how the text matrix is calculated. It's an affine transform which can rotate, scale, and skew text and it represents "text space", the coordinate system for rendering text. Usually it just contains a translation component, though often a scale too (default is 1.0).
>> 
>> Another way of describing this is to say that transforming (0,0) by the text matrix gives you the (x,y) coordinate of the text.
>> 
>> In order to generate a fake text matrix for OCR all you need is to start with the identity matrix and then set the translation components to the current (x,y) position, where St (start) is the left-hand side of the glyph and End is its right-hand side.
>> 
>> -- John
>> 
>>> On 11 Apr 2014, at 16:05, DImuthu Upeksha <di...@gmail.com> wrote:
>>> 
>>> I looked at processTextPosition method in PDFTextStripper class. It
>>> takes a TextPosition object as the parameter. TextPosition object
>>> takes two Matrices as parameters in its constructor
>>> 
>>> Matrix textPositionSt
>>> Matrix textPositionEnd
>>> 
>>> 1. What is the task of these matrices?
>>> 2. What should be the format of it's data?
>>> 
>>> I debugged one textPositionSt matrix and for that sample its value was
>>> 
>>> [12.0, 0.0, 0.0, 0.0, 12.0, 0.0, 15.336001, 0.0, 1.0]
>>> 
>>> What is the meaning of these values?
>>> 
>>> Thanks
>>> Dimuthu
>>> 
>>>> On Tue, Mar 25, 2014 at 11:54 PM, John Hewson <jo...@jahewson.com> wrote:
>>>> Hi Dimuthu
>>>> 
>>>> Each line of text is handled by the processEncodedText method in PDFStreamEngine
>>>> which calls processTextPosition once for each character. The processTextPosition
>>>> method in PDFStreamEngine collects the text positions into lines, paragraphs and
>>>> columns (also called "articles"). Text on a PDF page does not have to be drawn in
>>>> order, so text at any position can occur at any time and processTextPosition will sort
>>>> the text and insert it into the relevant line/paragraph/column.
>>>> 
>>>> To make your words with bounding boxes compatible with processTextPosition you
>>>> should convert each character in the word into a TextPosition and then you can call
>>>> processTextPosition.
>>>> 
>>>> Thanks
>>>> 
>>>> -- John
>>>> 
>>>>> On 24 Mar 2014, at 01:32, DImuthu Upeksha <di...@gmail.com> wrote:
>>>>> 
>>>>> Hi John,
>>>>> 
>>>>> I looked at processTextPosition method in PDFTextStripper. But I
>>>>> couldn't understand actual process happening inside the method. What
>>>>> should be the input for that method? In my case I have words with
>>>>> bounding box's coordinates. How can I make those data to compatible
>>>>> with the input of processTextPosition method. As well, what is the
>>>>> output of the method?
>>>>> 
>>>>> Thanks
>>>>> Dimuthu
>>>>> 
>>>>>> On Wed, Mar 19, 2014 at 11:19 PM, John Hewson <jo...@jahewson.com> wrote:
>>>>>> Hi Dimuthu
>>>>>> 
>>>>>>> 1 Print those data into PDDocument again and pass through TextStripper
>>>>>>> of PDFBox. This could reduce the performance of overall process.
>>>>>> 
>>>>>> This was what I had in mind, but rather than printing the text into the PDDocument
>>>>>> you can inject it directly into PDFTextStripper as TextPosition instances. I mentioned
>>>>>> something like this a while ago:
>>>>>> 
>>>>>>> You could subclass PDFTextStripper and override the startDocument method and use it to create a PDFRenderer and store it in a field. Then override the processPage method and use the previously created PDFRenderer to render the current page to a buffered image and perform OCR on the image. Once you have the OCR text + positions, instead of calling processStream you can call processTextPosition once for each character + position.
>>>>>> 
>>>>>> Let's see how well it works and then re-evaluate.
>>>>>> 
>>>>>> -- John
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Regards
>>>>> 
>>>>> W.Dimuthu Upeksha
>>>>> Undergraduate
>>>>> 
>>>>> Department of Computer Science And Engineering
>>>>> 
>>>>> University of Moratuwa, Sri Lanka
>>> 
>>> 
>>> 
>>> --
>>> Regards
>>> 
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> 
>>> Department of Computer Science And Engineering
>>> 
>>> University of Moratuwa, Sri Lanka
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> 
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by DImuthu Upeksha <di...@gmail.com>.
Hi john,

I managed to override processStream method and pass some hardcoded
text position values to processStream method.
I still have doubts about totalVerticalDisplacementDisp and
fontSizeText variables. Is there is standard way to calculate the
fontSizeText variable? What is the use of
totalVerticalDisplacementDisp variable and how can we fetch it?

For text matrix how can we calculate the scale x and scale y? For this
scenario I put 1 for each.

@Override

    public void processStream(PDResources resources, COSStream cosStream,
    PDRectangle drawingSize, int rotation) throws IOException {
    float pageWidth = drawingSize.getWidth();
    float pageHeight = drawingSize.getHeight();
    Matrix textMatrixStart = new Matrix();

    textMatrixStart.setValue(0, 0, 1); //scale x
    textMatrixStart.setValue(0, 1, 0);
    textMatrixStart.setValue(0, 2, 0);

    textMatrixStart.setValue(1, 0, 0);
    textMatrixStart.setValue(1, 1, 1); //scale y
    textMatrixStart.setValue(1, 2, 0);

    textMatrixStart.setValue(2, 0, 10);
    textMatrixStart.setValue(2, 1, 100);
    textMatrixStart.setValue(2, 2, 1);

    float endXPosition = 29.34f;
    float endYPosition =0.0f;
    float totalVerticalDisplacementDisp =8.0f;
    float widthText = 29.34f;
    float spaceWidthDisp = 12.0f;
    String c = "Hello";
    int []codePoints = {72,101,108,108,111};
    PDFont font = new PDType1Font();
    float fontSizeText = 12.0f;

    TextPosition textPosition = new TextPosition(rotation, pageWidth,
pageHeight, textMatrixStart, endXPosition,
        endYPosition, totalVerticalDisplacementDisp, widthText,
spaceWidthDisp, c, codePoints, font,
        fontSizeText,12);

    processTextPosition(textPosition);
    }

On Sat, Apr 12, 2014 at 6:36 AM, John Hewson <jo...@jahewson.com> wrote:
> These are the values of the "text matrix" at the start and end of the given text. Take a look at the PDF spec for a complete description of how the text matrix is calculated. It's an affine transform which can rotate, scale, and skew text and it represents "text space", the coordinate system for rendering text. Usually it just contains a translation component, though often a scale too (default is 1.0).
>
> Another way of describing this is to say that transforming (0,0) by the text matrix gives you the (x,y) coordinate of the text.
>
> In order to generate a fake text matrix for OCR all you need is to start with the identity matrix and then set the translation components to the current (x,y) position, where St (start) is the left-hand side of the glyph and End is its right-hand side.
>
> -- John
>
>> On 11 Apr 2014, at 16:05, DImuthu Upeksha <di...@gmail.com> wrote:
>>
>> I looked at processTextPosition method in PDFTextStripper class. It
>> takes a TextPosition object as the parameter. TextPosition object
>> takes two Matrices as parameters in its constructor
>>
>> Matrix textPositionSt
>> Matrix textPositionEnd
>>
>> 1. What is the task of these matrices?
>> 2. What should be the format of it's data?
>>
>> I debugged one textPositionSt matrix and for that sample its value was
>>
>> [12.0, 0.0, 0.0, 0.0, 12.0, 0.0, 15.336001, 0.0, 1.0]
>>
>> What is the meaning of these values?
>>
>> Thanks
>> Dimuthu
>>
>>> On Tue, Mar 25, 2014 at 11:54 PM, John Hewson <jo...@jahewson.com> wrote:
>>> Hi Dimuthu
>>>
>>> Each line of text is handled by the processEncodedText method in PDFStreamEngine
>>> which calls processTextPosition once for each character. The processTextPosition
>>> method in PDFStreamEngine collects the text positions into lines, paragraphs and
>>> columns (also called "articles"). Text on a PDF page does not have to be drawn in
>>> order, so text at any position can occur at any time and processTextPosition will sort
>>> the text and insert it into the relevant line/paragraph/column.
>>>
>>> To make your words with bounding boxes compatible with processTextPosition you
>>> should convert each character in the word into a TextPosition and then you can call
>>> processTextPosition.
>>>
>>> Thanks
>>>
>>> -- John
>>>
>>>> On 24 Mar 2014, at 01:32, DImuthu Upeksha <di...@gmail.com> wrote:
>>>>
>>>> Hi John,
>>>>
>>>> I looked at processTextPosition method in PDFTextStripper. But I
>>>> couldn't understand actual process happening inside the method. What
>>>> should be the input for that method? In my case I have words with
>>>> bounding box's coordinates. How can I make those data to compatible
>>>> with the input of processTextPosition method. As well, what is the
>>>> output of the method?
>>>>
>>>> Thanks
>>>> Dimuthu
>>>>
>>>>> On Wed, Mar 19, 2014 at 11:19 PM, John Hewson <jo...@jahewson.com> wrote:
>>>>> Hi Dimuthu
>>>>>
>>>>>> 1 Print those data into PDDocument again and pass through TextStripper
>>>>>> of PDFBox. This could reduce the performance of overall process.
>>>>>
>>>>> This was what I had in mind, but rather than printing the text into the PDDocument
>>>>> you can inject it directly into PDFTextStripper as TextPosition instances. I mentioned
>>>>> something like this a while ago:
>>>>>
>>>>>> You could subclass PDFTextStripper and override the startDocument method and use it to create a PDFRenderer and store it in a field. Then override the processPage method and use the previously created PDFRenderer to render the current page to a buffered image and perform OCR on the image. Once you have the OCR text + positions, instead of calling processStream you can call processTextPosition once for each character + position.
>>>>>
>>>>> Let's see how well it works and then re-evaluate.
>>>>>
>>>>> -- John
>>>>
>>>>
>>>>
>>>> --
>>>> Regards
>>>>
>>>> W.Dimuthu Upeksha
>>>> Undergraduate
>>>>
>>>> Department of Computer Science And Engineering
>>>>
>>>> University of Moratuwa, Sri Lanka
>>
>>
>>
>> --
>> Regards
>>
>> W.Dimuthu Upeksha
>> Undergraduate
>>
>> Department of Computer Science And Engineering
>>
>> University of Moratuwa, Sri Lanka



-- 
Regards

W.Dimuthu Upeksha
Undergraduate

Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by John Hewson <jo...@jahewson.com>.
These are the values of the "text matrix" at the start and end of the given text. Take a look at the PDF spec for a complete description of how the text matrix is calculated. It's an affine transform which can rotate, scale, and skew text and it represents "text space", the coordinate system for rendering text. Usually it just contains a translation component, though often a scale too (default is 1.0).

Another way of describing this is to say that transforming (0,0) by the text matrix gives you the (x,y) coordinate of the text.

In order to generate a fake text matrix for OCR all you need is to start with the identity matrix and then set the translation components to the current (x,y) position, where St (start) is the left-hand side of the glyph and End is its right-hand side.

-- John

> On 11 Apr 2014, at 16:05, DImuthu Upeksha <di...@gmail.com> wrote:
> 
> I looked at processTextPosition method in PDFTextStripper class. It
> takes a TextPosition object as the parameter. TextPosition object
> takes two Matrices as parameters in its constructor
> 
> Matrix textPositionSt
> Matrix textPositionEnd
> 
> 1. What is the task of these matrices?
> 2. What should be the format of it's data?
> 
> I debugged one textPositionSt matrix and for that sample its value was
> 
> [12.0, 0.0, 0.0, 0.0, 12.0, 0.0, 15.336001, 0.0, 1.0]
> 
> What is the meaning of these values?
> 
> Thanks
> Dimuthu
> 
>> On Tue, Mar 25, 2014 at 11:54 PM, John Hewson <jo...@jahewson.com> wrote:
>> Hi Dimuthu
>> 
>> Each line of text is handled by the processEncodedText method in PDFStreamEngine
>> which calls processTextPosition once for each character. The processTextPosition
>> method in PDFStreamEngine collects the text positions into lines, paragraphs and
>> columns (also called "articles"). Text on a PDF page does not have to be drawn in
>> order, so text at any position can occur at any time and processTextPosition will sort
>> the text and insert it into the relevant line/paragraph/column.
>> 
>> To make your words with bounding boxes compatible with processTextPosition you
>> should convert each character in the word into a TextPosition and then you can call
>> processTextPosition.
>> 
>> Thanks
>> 
>> -- John
>> 
>>> On 24 Mar 2014, at 01:32, DImuthu Upeksha <di...@gmail.com> wrote:
>>> 
>>> Hi John,
>>> 
>>> I looked at processTextPosition method in PDFTextStripper. But I
>>> couldn't understand actual process happening inside the method. What
>>> should be the input for that method? In my case I have words with
>>> bounding box's coordinates. How can I make those data to compatible
>>> with the input of processTextPosition method. As well, what is the
>>> output of the method?
>>> 
>>> Thanks
>>> Dimuthu
>>> 
>>>> On Wed, Mar 19, 2014 at 11:19 PM, John Hewson <jo...@jahewson.com> wrote:
>>>> Hi Dimuthu
>>>> 
>>>>> 1 Print those data into PDDocument again and pass through TextStripper
>>>>> of PDFBox. This could reduce the performance of overall process.
>>>> 
>>>> This was what I had in mind, but rather than printing the text into the PDDocument
>>>> you can inject it directly into PDFTextStripper as TextPosition instances. I mentioned
>>>> something like this a while ago:
>>>> 
>>>>> You could subclass PDFTextStripper and override the startDocument method and use it to create a PDFRenderer and store it in a field. Then override the processPage method and use the previously created PDFRenderer to render the current page to a buffered image and perform OCR on the image. Once you have the OCR text + positions, instead of calling processStream you can call processTextPosition once for each character + position.
>>>> 
>>>> Let's see how well it works and then re-evaluate.
>>>> 
>>>> -- John
>>> 
>>> 
>>> 
>>> --
>>> Regards
>>> 
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> 
>>> Department of Computer Science And Engineering
>>> 
>>> University of Moratuwa, Sri Lanka
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> 
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by DImuthu Upeksha <di...@gmail.com>.
I looked at processTextPosition method in PDFTextStripper class. It
takes a TextPosition object as the parameter. TextPosition object
takes two Matrices as parameters in its constructor

 Matrix textPositionSt
 Matrix textPositionEnd

1. What is the task of these matrices?
2. What should be the format of it's data?

I debugged one textPositionSt matrix and for that sample its value was

[12.0, 0.0, 0.0, 0.0, 12.0, 0.0, 15.336001, 0.0, 1.0]

What is the meaning of these values?

Thanks
Dimuthu

On Tue, Mar 25, 2014 at 11:54 PM, John Hewson <jo...@jahewson.com> wrote:
> Hi Dimuthu
>
> Each line of text is handled by the processEncodedText method in PDFStreamEngine
> which calls processTextPosition once for each character. The processTextPosition
> method in PDFStreamEngine collects the text positions into lines, paragraphs and
> columns (also called "articles"). Text on a PDF page does not have to be drawn in
> order, so text at any position can occur at any time and processTextPosition will sort
> the text and insert it into the relevant line/paragraph/column.
>
> To make your words with bounding boxes compatible with processTextPosition you
> should convert each character in the word into a TextPosition and then you can call
> processTextPosition.
>
> Thanks
>
> -- John
>
> On 24 Mar 2014, at 01:32, DImuthu Upeksha <di...@gmail.com> wrote:
>
>> Hi John,
>>
>> I looked at processTextPosition method in PDFTextStripper. But I
>> couldn't understand actual process happening inside the method. What
>> should be the input for that method? In my case I have words with
>> bounding box's coordinates. How can I make those data to compatible
>> with the input of processTextPosition method. As well, what is the
>> output of the method?
>>
>> Thanks
>> Dimuthu
>>
>> On Wed, Mar 19, 2014 at 11:19 PM, John Hewson <jo...@jahewson.com> wrote:
>>> Hi Dimuthu
>>>
>>>> 1 Print those data into PDDocument again and pass through TextStripper
>>>> of PDFBox. This could reduce the performance of overall process.
>>>
>>> This was what I had in mind, but rather than printing the text into the PDDocument
>>> you can inject it directly into PDFTextStripper as TextPosition instances. I mentioned
>>> something like this a while ago:
>>>
>>>> You could subclass PDFTextStripper and override the startDocument method and use it to create a PDFRenderer and store it in a field. Then override the processPage method and use the previously created PDFRenderer to render the current page to a buffered image and perform OCR on the image. Once you have the OCR text + positions, instead of calling processStream you can call processTextPosition once for each character + position.
>>>
>>> Let's see how well it works and then re-evaluate.
>>>
>>> -- John
>>>
>>
>>
>>
>> --
>> Regards
>>
>> W.Dimuthu Upeksha
>> Undergraduate
>>
>> Department of Computer Science And Engineering
>>
>> University of Moratuwa, Sri Lanka
>



-- 
Regards

W.Dimuthu Upeksha
Undergraduate

Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by John Hewson <jo...@jahewson.com>.
Hi Dimuthu

Each line of text is handled by the processEncodedText method in PDFStreamEngine
which calls processTextPosition once for each character. The processTextPosition
method in PDFStreamEngine collects the text positions into lines, paragraphs and
columns (also called “articles”). Text on a PDF page does not have to be drawn in
order, so text at any position can occur at any time and processTextPosition will sort
the text and insert it into the relevant line/paragraph/column.

To make your words with bounding boxes compatible with processTextPosition you
should convert each character in the word into a TextPosition and then you can call
processTextPosition.

Thanks

-- John

On 24 Mar 2014, at 01:32, DImuthu Upeksha <di...@gmail.com> wrote:

> Hi John,
> 
> I looked at processTextPosition method in PDFTextStripper. But I
> couldn't understand actual process happening inside the method. What
> should be the input for that method? In my case I have words with
> bounding box's coordinates. How can I make those data to compatible
> with the input of processTextPosition method. As well, what is the
> output of the method?
> 
> Thanks
> Dimuthu
> 
> On Wed, Mar 19, 2014 at 11:19 PM, John Hewson <jo...@jahewson.com> wrote:
>> Hi Dimuthu
>> 
>>> 1 Print those data into PDDocument again and pass through TextStripper
>>> of PDFBox. This could reduce the performance of overall process.
>> 
>> This was what I had in mind, but rather than printing the text into the PDDocument
>> you can inject it directly into PDFTextStripper as TextPosition instances. I mentioned
>> something like this a while ago:
>> 
>>> You could subclass PDFTextStripper and override the startDocument method and use it to create a PDFRenderer and store it in a field. Then override the processPage method and use the previously created PDFRenderer to render the current page to a buffered image and perform OCR on the image. Once you have the OCR text + positions, instead of calling processStream you can call processTextPosition once for each character + position.
>> 
>> Let's see how well it works and then re-evaluate.
>> 
>> -- John
>> 
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> 
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka


Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by DImuthu Upeksha <di...@gmail.com>.
Hi John,

I looked at processTextPosition method in PDFTextStripper. But I
couldn't understand actual process happening inside the method. What
should be the input for that method? In my case I have words with
bounding box's coordinates. How can I make those data to compatible
with the input of processTextPosition method. As well, what is the
output of the method?

Thanks
Dimuthu

On Wed, Mar 19, 2014 at 11:19 PM, John Hewson <jo...@jahewson.com> wrote:
> Hi Dimuthu
>
>> 1 Print those data into PDDocument again and pass through TextStripper
>> of PDFBox. This could reduce the performance of overall process.
>
> This was what I had in mind, but rather than printing the text into the PDDocument
> you can inject it directly into PDFTextStripper as TextPosition instances. I mentioned
> something like this a while ago:
>
>> You could subclass PDFTextStripper and override the startDocument method and use it to create a PDFRenderer and store it in a field. Then override the processPage method and use the previously created PDFRenderer to render the current page to a buffered image and perform OCR on the image. Once you have the OCR text + positions, instead of calling processStream you can call processTextPosition once for each character + position.
>
> Let's see how well it works and then re-evaluate.
>
> -- John
>



-- 
Regards

W.Dimuthu Upeksha
Undergraduate

Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by John Hewson <jo...@jahewson.com>.
Hi Dimuthu

> 1 Print those data into PDDocument again and pass through TextStripper
> of PDFBox. This could reduce the performance of overall process.

This was what I had in mind, but rather than printing the text into the PDDocument
you can inject it directly into PDFTextStripper as TextPosition instances. I mentioned
something like this a while ago:

> You could subclass PDFTextStripper and override the startDocument method and use it to create a PDFRenderer and store it in a field. Then override the processPage method and use the previously created PDFRenderer to render the current page to a buffered image and perform OCR on the image. Once you have the OCR text + positions, instead of calling processStream you can call processTextPosition once for each character + position.

Let’s see how well it works and then re-evaluate.

-- John


Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by DImuthu Upeksha <di...@gmail.com>.
Hi John,

I'm thinking about an approach to combine those word + location data
come from tesseract api in to actual sentences. What I get is
1 Identified word
2 Bounding box 's coordinates of that word

Likewise finally I have a set of words with bounding boxes. To combine
them I'm thinking about two approaches

1 Print those data into PDDocument again and pass through TextStripper
of PDFBox. This could reduce the performance of overall process.

2 Writing algorithms from scratch. This may need some extra research
work. However I feel that I could use same algorithms used in PDFBox
for this task.

What is the most feasible and efficient solution? I prefer second
approach. But it may require more time and testing than first one.

Thanks
Dimuthu

On Sun, Mar 16, 2014 at 6:45 PM, DImuthu Upeksha
<di...@gmail.com> wrote:
> Hi John,
>
> For now I'm using those methods to debug the wrapper. I'll remove
> those methods after I finished testing it.
>
> I started implementing OCR-plugin [1] for PDFBox. Currently it
> satisfies basic requirements such as getting word+location data [2].
> Please have a look at that and let me know if any changes are
> required.
>
> [1] https://github.com/DImuthuUpe/OCR-Plugin
> [2] https://github.com/DImuthuUpe/OCR-Plugin/blob/master/src/main/java/org/apache/pdfbox/ocr/OCRConnector.java
>
> Thanks
> Dimuthu
>
> On Fri, Mar 14, 2014 at 12:09 AM, John Hewson <jo...@jahewson.com> wrote:
>> Thanks, I saw your new refactoring too, it's good. Now the following methods are no longer needed:
>>
>> public void setImagePath(String path)
>> public void setImage(byte[] imagedata, int width, int height, int bpp,int bpl)
>>
>> Cheers
>>
>> -- John
>>
>> On 11 Mar 2014, at 22:58, DImuthu Upeksha <di...@gmail.com> wrote:
>>
>>> Hi John,
>>> Yes. I implemented a new method to accept byte streams of the image as
>>> an input. We directly can't send BufferedImage objects to native side.
>>> So what I did is converting buffered image into a byte array and
>>> passed it in to native side. At the native side it again converts in
>>> to compatible format. With that request we need to pass some metadata
>>> of byte stream like image width, height, bytes per pixel and bytes per
>>> row. I checked it with this [2] test case and it works fine.
>>>
>>> [1] https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/java/com/apache/pdfbox/ocr/tesseract/TessBaseAPI.java#L74
>>> [2] https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/test/java/com/apache/pdfbox/ocr/tesseract/TessByteSteamTest.java
>>>
>>> Thanks
>>> Dimuthu
>>>
>>> On Wed, Mar 12, 2014 at 12:40 AM, John Hewson <jo...@jahewson.com> wrote:
>>>> Hi Dimuthu
>>>>
>>>> The Tesseract wrapper needs to take its input from a BufferedImage rather than reading a file from disk, so instead of:
>>>>
>>>> api.setImagePath("test.tif");
>>>>
>>>> What we need is:
>>>>
>>>> BufferedImage image = ImageIO.read(new File("test.tif"));
>>>> api.setImagePath(image);
>>>>
>>>> Because this will let us used the BufferedImage generated by PDFRenderer without round-tripping to the disk.
>>>>
>>>> -- John
>>>>
>>>> On 11 Mar 2014, at 11:13, DImuthu Upeksha <di...@gmail.com> wrote:
>>>>
>>>>> Hi John,
>>>>> Thanks for the guidance.
>>>>> I did a small analysis of the accuracy and performance of new
>>>>> Tesseract wrapper. I used this [1] image as the input image and got
>>>>> following data [2] after OCR. First line is the recognised word
>>>>> followed by location details (bounding box) of the word. I think these
>>>>> details are pretty much enough for our task. Now what remaining is
>>>>> converting pdf file into a image as you have mentioned. These days I'm
>>>>> working on it.
>>>>>
>>>>> [1] https://www.dropbox.com/s/11wahtonoz08zmn/image4.TIF
>>>>> [2] https://gist.github.com/DImuthuUpe/9491660
>>>>>
>>>>> Thanks
>>>>> Dimuthu
>>>>>
>>>>> On Mon, Mar 10, 2014 at 2:30 PM, John Hewson <jo...@jahewson.com> wrote:
>>>>>> Dimuthu,
>>>>>>
>>>>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
>>>>>>> build using maven. Some useful methods that are needed to do basic OCR were
>>>>>>> implemented.
>>>>>>
>>>>>> Great, it's looking good, nice and clean.
>>>>>>
>>>>>>> 1. What is the task of processStream method in PDFTextStripper class line
>>>>>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>>>>>> page.findRotation() );
>>>>>>
>>>>>> A PDF file is made up of pages, each of which contains a "content stream". This content stream contains a list of drawing commands such as "move to 10,15" or "write the word `foo`", these are called operators. The processStream function reads the stream for the current page and executes each of the operators. The operators themselves are implemented each in their own class which is a subclass of PDFOperator. The constructor of PDFStreamEngine creates the operator classes using reflection, which is rather odd and I'm not sure why this design was chosen. The operators used by PDFTextStripper can be found in org/apache/pdfbox/resources/PDFTextStripper.properties
>>>>>>
>>>>>>> 2. Say I need to extract images and it's metadata from a pdf. What is the better approach to do it?
>>>>>>
>>>>>> You could subclass PDFTextStripper and override the startDocument method and use it to create a PDFRenderer and store it in a field. Then override the processPage method and use the previously created PDFRenderer to render the current page to a buffered image and perform OCR on the image. Once you have the OCR text + positions, instead of calling processStream you can call processTextPosition once for each character + position.
>>>>>>
>>>>>> The PDFRenderer class was just added to the trunk, so make sure you do an "svn update". Let me know if you need me to change PDFTextStripper to make it easier to subclass.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> -- John
>>>>>>
>>>>>> On 9 Mar 2014, at 09:08, DImuthu Upeksha <di...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi John,
>>>>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
>>>>>>> build using maven. Some useful methods that are needed to do basic OCR were
>>>>>>> implemented.
>>>>>>>
>>>>>>> I went through PDFBox code several times and got couple of issues that are
>>>>>>> needed to be clarified
>>>>>>>
>>>>>>> 1. What is the task of processStream method in PDFTextStripper class line
>>>>>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>>>>>> page.findRotation() );
>>>>>>>
>>>>>>> 2. Say I need to extract images and it's metadata from a pdf. What is the
>>>>>>> better approach to do it?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Dimuthu
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha
>>>>>>> <di...@gmail.com>wrote:
>>>>>>>
>>>>>>>> Hi John
>>>>>>>> I refactored Tesseract JNI code to support maven build. To create the JNI
>>>>>>>> library I added pre-built static libraries of Tesseract and Leptonica to
>>>>>>>> resources folder[2]. For now it includes librararies supported for mac. But
>>>>>>>> we can easily add both windows and linux libraries. After "mvn clean
>>>>>>>> install", the jar is created under target folder. Now all setting up is
>>>>>>>> done. What remains is implementing those native methods in tessbaseapi.cpp
>>>>>>>> [3]. Hope to finish it asap. Please let me know if there is any concern
>>>>>>>> about project structure.
>>>>>>>>
>>>>>>>> [1] https://github.com/DImuthuUpe/Tesseract-API.git
>>>>>>>> [2]
>>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
>>>>>>>> [3]
>>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Dimuthu
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>>>>>>>>
>>>>>>>>> Dimuthu
>>>>>>>>>
>>>>>>>>>> There is a lot of code
>>>>>>>>>> fractions in current android jni wrapper which use "(jint)somePointer"
>>>>>>>>>> casting which will create terrible memory leaks in 64 bit environments
>>>>>>>>>> because ponters are 64 bit. So I believe writing it from the beginning
>>>>>>>>> is
>>>>>>>>>> much better.
>>>>>>>>>
>>>>>>>>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>>>>>>>>> support
>>>>>>>>> 64-bit JVMs.
>>>>>>>>>
>>>>>>>>>> we can use
>>>>>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>>>>>> it is
>>>>>>>>>> not a issue to use it's static library because both Tesseract and
>>>>>>>>> Leptonica
>>>>>>>>>> is under apache licence.
>>>>>>>>>
>>>>>>>>> Sounds good, I found the following in the README:
>>>>>>>>>
>>>>>>>>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
>>>>>>>>> without Leptonica.
>>>>>>>>>
>>>>>>>>> Which makes sense.
>>>>>>>>>
>>>>>>>>> -- John
>>>>>>>>>
>>>>>>>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <di...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi John,
>>>>>>>>>> +1 for you suggestion about converting image <=> byte array at java
>>>>>>>>> side.
>>>>>>>>>> It reduces lot of complexities. I don't know whether you have noticed or
>>>>>>>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my
>>>>>>>>> Mac
>>>>>>>>>> but don't know about other operating systems.
>>>>>>>>>>
>>>>>>>>>> Leptonica is the image processing library for Tesseract [1]. What
>>>>>>>>> tesseract
>>>>>>>>>> do is using image processing algorithms in Leptonica to implement its
>>>>>>>>> OCR
>>>>>>>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract
>>>>>>>>> API.
>>>>>>>>>> You can see it includes allheaders.h header file which is the main
>>>>>>>>> header
>>>>>>>>>> file of Leptonoca. So I think it is a must to build Leptonica first and
>>>>>>>>>> link it when we build Tesseract. This is not a big problem if we can use
>>>>>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>>>>>> it is
>>>>>>>>>> not a issue to use it's static library because both Tesseract and
>>>>>>>>> Leptonica
>>>>>>>>>> is under apache licence.
>>>>>>>>>>
>>>>>>>>>> I'm working on the maven implementation you have mentioned and will get
>>>>>>>>>> back to you soon.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Dimuthu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>>>>>>>>>> [2]
>>>>>>>>>>
>>>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Dimuthu,
>>>>>>>>>>>
>>>>>>>>>>> 1,2,3:
>>>>>>>>>>>
>>>>>>>>>>> Feel free to write your own Tesseract binding or port the existing
>>>>>>>>> code as
>>>>>>>>>>> you see fit.
>>>>>>>>>>> The JNI binding should be minimal, only the methods you require need
>>>>>>>>> to be
>>>>>>>>>>> wrapped.
>>>>>>>>>>> Also, don't forget that some of the interop can be done in Java, for
>>>>>>>>>>> example if it is easier
>>>>>>>>>>> to convert a BufferedImage to a byte array in Java then do it there and
>>>>>>>>>>> pass the result
>>>>>>>>>>> to JNI rather than writing lots of JNI C++ to achieve the same result.
>>>>>>>>>>>
>>>>>>>>>>> Your GitHub repo looks like a good start, I can make comments there as
>>>>>>>>>>> things progress.
>>>>>>>>>>>
>>>>>>>>>>> Is it possible to build Tesseract without leptonica? I was under the
>>>>>>>>>>> impression that it was
>>>>>>>>>>> used for image i/o only, but I may be misinformed.
>>>>>>>>>>>
>>>>>>>>>>> 4:  The native platform library should be built as part of the Maven
>>>>>>>>> build
>>>>>>>>>>> for the Tesseract
>>>>>>>>>>> wrapper which can be a separate project. The output can be a jar file
>>>>>>>>>>> which contains the
>>>>>>>>>>> native binaries. It should be possible for the jar to contain prebuilt
>>>>>>>>>>> binaries for all platforms
>>>>>>>>>>> but this is something we can worry about later. Right now the goal
>>>>>>>>> should
>>>>>>>>>>> be to build a jar
>>>>>>>>>>> containing just the current platform's native binary and any Java
>>>>>>>>> wrapper
>>>>>>>>>>> code.
>>>>>>>>>>>
>>>>>>>>>>> -- John
>>>>>>>>>>>
>>>>>>>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <di...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>
>>>>>>>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>>>>>>>>>>> observation
>>>>>>>>>>>>
>>>>>>>>>>>> 1. This wrapper heavily depends on android image libraries.
>>>>>>>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>>>>>>>>>>>>
>>>>>>>>>>>> 2. But I can understand underlying logic in each function. Basically
>>>>>>>>> what
>>>>>>>>>>>> it does is mapping between tesseract api functions [2] with java
>>>>>>>>> methods.
>>>>>>>>>>>> In between it does to some image <=> byte array like conversions by
>>>>>>>>> using
>>>>>>>>>>>> that bitmap libraries in Android
>>>>>>>>>>>>
>>>>>>>>>>>> 3. There are two ways. 1: We can port it's code to make compatible
>>>>>>>>> with
>>>>>>>>>>> our
>>>>>>>>>>>> environments(linux,windows and mac) which is really painful. Also it
>>>>>>>>> will
>>>>>>>>>>>> cause memory leaks. 2: We can use only it's function signatures and
>>>>>>>>>>>> implement using our codes
>>>>>>>>>>>>
>>>>>>>>>>>> I think 2nd solution is better because we need only few operations to
>>>>>>>>> be
>>>>>>>>>>>> done using tesseract library. I have created a github repo [3] for
>>>>>>>>> this.
>>>>>>>>>>>> It's still not finished. I need to add some make files and build
>>>>>>>>> files to
>>>>>>>>>>>> make it run properly. And also I need to implement those wrapper
>>>>>>>>>>> functions
>>>>>>>>>>>> [3]. This may take some time.
>>>>>>>>>>>>
>>>>>>>>>>>> 4. Because we are calling native libraries we need different builds of
>>>>>>>>>>>> tesseract and leptonica libraries for each platform (dll for windows,
>>>>>>>>> so
>>>>>>>>>>>> for linux, dylib for mac). So we may need to build those libraries at
>>>>>>>>> the
>>>>>>>>>>>> time we build pdfbox project. Or we can pre build those libraries and
>>>>>>>>> add
>>>>>>>>>>>> them to the project as .dll, .so or .dylib format. What is the
>>>>>>>>> preferred
>>>>>>>>>>>> way?
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>>>>>>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>>>>>>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>>>>>>>>>>> [4]
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I updated necessary changes to the document [1]
>>>>>>>>>>>>>
>>>>>>>>>>>>> For last two days I had a deep look at this [2] jni wrapper for
>>>>>>>>>>> tessaract
>>>>>>>>>>>>> api.
>>>>>>>>>>>>> Unfortunately this has been designed for Android environment so I
>>>>>>>>> think
>>>>>>>>>>> we
>>>>>>>>>>>>> need to write our own make files to build this in to a dll(windows)
>>>>>>>>> or
>>>>>>>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>>>>>>>>> for
>>>>>>>>>>> a
>>>>>>>>>>>>> way to convert it to a make file that we can run on console. Please
>>>>>>>>>>> suggest
>>>>>>>>>>>>> if you have a better approach
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>>>>>>>>>>> [3]
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is a good start. However, there is no need for the Adder
>>>>>>>>>>> component,
>>>>>>>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>>>>>>>>>>> Extractor".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>>>>>>>>> clear
>>>>>>>>>>>>>> where the process starts.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>>>>>>>>>>> might
>>>>>>>>>>>>>>>> use an interface, e.g. OCREngine and there will be a
>>>>>>>>>>> TesseractOCREngine
>>>>>>>>>>>>>>>> class somewhere which provides the required functionality and
>>>>>>>>> lives
>>>>>>>>>>> in
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> separate jar file.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So do you need to embed those new functionalities into existing
>>>>>>>>>>>>>>>> PDFtoText algorithms or package them as a new sub system(something
>>>>>>>>>>>>>> like an
>>>>>>>>>>>>>>>> API)?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>> From: "John Hewson" <jo...@jahewson.com>
>>>>>>>>>>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>>>>>>>>>>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
>>>>>>>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>>>>>>>>>>>>> Introduction
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>>>>>>>>>>> page
>>>>>>>>>>>>>>>> rotation.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>>>>>>>>> have
>>>>>>>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>>>>>>>>>>> glyphs. We
>>>>>>>>>>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format
>>>>>>>>> and
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>> images with text(Scanned images). In that case first we extract
>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is
>>>>>>>>> extracted
>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>>>>>>>>>>> PDFToText. Am
>>>>>>>>>>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>>>>>>>>> john@jahewson.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>>>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>>>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>>>>>>>>>>>>>> accurate
>>>>>>>>>>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do
>>>>>>>>> OCR on
>>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>>>>>>>>>>> (PDFToText).
>>>>>>>>>>>>>>>>>>> The goal of
>>>>>>>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>>>>>>>>>>>>> extract
>>>>>>>>>>>>>>>>>>> text from areas of the
>>>>>>>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF files
>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>> typically generated by
>>>>>>>>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR
>>>>>>>>> is
>>>>>>>>>>>>>>>> useful:
>>>>>>>>>>>>>>>>>>> some fonts embedded
>>>>>>>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>>>>>>>>> extracted
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> PDFToText the result is
>>>>>>>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>>>>>>>>>>> letters.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Instead of:
>>>>>>>>>>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> We want to do:
>>>>>>>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>>>>>>>>> Configurations
>>>>>>>>>>>>>>>>>>> ->Source
>>>>>>>>>>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>>>>>>>>>>>>>> application
>>>>>>>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>>>>>>>>> code.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage =
>>>>>>>>> new
>>>>>>>>>>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target
>>>>>>>>> folder
>>>>>>>>>>> of
>>>>>>>>>>>>>>>> PDFBox
>>>>>>>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>>>>>>>>> project
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have
>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> reference to
>>>>>>>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As
>>>>>>>>> Tilman
>>>>>>>>>>>>>> said
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>> built
>>>>>>>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>>>>>>>>>>> john@jahewson.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>>>>>>>>>>> PDFToText
>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path
>>>>>>>>> as
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>>>>>>>>>>> managed to
>>>>>>>>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>>>>>>>>> and
>>>>>>>>>>> I
>>>>>>>>>>>>>>>> got a
>>>>>>>>>>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>>>>>>>>>>> jars
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further
>>>>>>>>> look
>>>>>>>>>>>>>> into
>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>>>>>>>>>>> PDFTextStripper
>>>>>>>>>>>>>>>>>>> class.
>>>>>>>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>>>>>>>>>>> them
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> debug
>>>>>>>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
>>>>>>>>>>> you
>>>>>>>>>>>>>>>> follow
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and
>>>>>>>>> managed to
>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>>>>>>>>>>> you a
>>>>>>>>>>>>>>>> mail.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>>>>>>>>>>> john@jahewson.com
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at
>>>>>>>>>>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>>>>>>>>>>> PDFBox
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>>>>>>>>> details
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>>>>>>>>> are
>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>>>>>>>>> PageDrawer
>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>>>>>>>>> (e.g.
>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>>>>>>>>>>> text
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>>>>>>>>> reading
>>>>>>>>>>>>>>>> order
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
>>>>>>>>>>> like
>>>>>>>>>>>>>>>> HTML
>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask
>>>>>>>>> any
>>>>>>>>>>>>>>>>>>> questions.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>>>>>>>>> Undergraduate
>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>>>>>>>>> 2013
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>>>>>>>>> image
>>>>>>>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>>>>>>>>>>> GSoC
>>>>>>>>>>>>>>>> 2014
>>>>>>>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for
>>>>>>>>> me. In
>>>>>>>>>>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>>>>>>>>>>>>>> wrote a
>>>>>>>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about
>>>>>>>>> PDFBox?
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>
>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>
>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Regards
>>>>>>>>>>>>
>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>
>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>> Undergraduate
>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>
>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> W.Dimuthu Upeksha
>>>>>>>> Undergraduate
>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>
>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards
>>>>>>>
>>>>>>> W.Dimuthu Upeksha
>>>>>>> Undergraduate
>>>>>>> Department of Computer Science And Engineering
>>>>>>>
>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards
>>>>>
>>>>> W.Dimuthu Upeksha
>>>>> Undergraduate
>>>>>
>>>>> Department of Computer Science And Engineering
>>>>>
>>>>> University of Moratuwa, Sri Lanka
>>>>
>>>
>>>
>>>
>>> --
>>> Regards
>>>
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>>
>>> Department of Computer Science And Engineering
>>>
>>> University of Moratuwa, Sri Lanka
>>
>
>
>
> --
> Regards
>
> W.Dimuthu Upeksha
> Undergraduate
>
> Department of Computer Science And Engineering
>
> University of Moratuwa, Sri Lanka



-- 
Regards

W.Dimuthu Upeksha
Undergraduate

Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by John Hewson <jo...@jahewson.com>.
Hi Dimuthu

This is a good start. One point to address is that a String in Java is encoded as UTF-16,
so your getUTF8Text() method must be doing something wrong. It should perform
a UTF-16 conversion internally and be renamed to getText(). You can probably do the
conversion in Java rather than in C++ (or maybe Tesseract can return UTF-16?).

Cheers

-- John

On 16 Mar 2014, at 06:15, DImuthu Upeksha <di...@gmail.com> wrote:

> Hi John,
> 
> For now I'm using those methods to debug the wrapper. I'll remove
> those methods after I finished testing it.
> 
> I started implementing OCR-plugin [1] for PDFBox. Currently it
> satisfies basic requirements such as getting word+location data [2].
> Please have a look at that and let me know if any changes are
> required.
> 
> [1] https://github.com/DImuthuUpe/OCR-Plugin
> [2] https://github.com/DImuthuUpe/OCR-Plugin/blob/master/src/main/java/org/apache/pdfbox/ocr/OCRConnector.java
> 
> Thanks
> Dimuthu
> 
> On Fri, Mar 14, 2014 at 12:09 AM, John Hewson <jo...@jahewson.com> wrote:
>> Thanks, I saw your new refactoring too, it's good. Now the following methods are no longer needed:
>> 
>> public void setImagePath(String path)
>> public void setImage(byte[] imagedata, int width, int height, int bpp,int bpl)
>> 
>> Cheers
>> 
>> -- John
>> 
>> On 11 Mar 2014, at 22:58, DImuthu Upeksha <di...@gmail.com> wrote:
>> 
>>> Hi John,
>>> Yes. I implemented a new method to accept byte streams of the image as
>>> an input. We directly can't send BufferedImage objects to native side.
>>> So what I did is converting buffered image into a byte array and
>>> passed it in to native side. At the native side it again converts in
>>> to compatible format. With that request we need to pass some metadata
>>> of byte stream like image width, height, bytes per pixel and bytes per
>>> row. I checked it with this [2] test case and it works fine.
>>> 
>>> [1] https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/java/com/apache/pdfbox/ocr/tesseract/TessBaseAPI.java#L74
>>> [2] https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/test/java/com/apache/pdfbox/ocr/tesseract/TessByteSteamTest.java
>>> 
>>> Thanks
>>> Dimuthu
>>> 
>>> On Wed, Mar 12, 2014 at 12:40 AM, John Hewson <jo...@jahewson.com> wrote:
>>>> Hi Dimuthu
>>>> 
>>>> The Tesseract wrapper needs to take its input from a BufferedImage rather than reading a file from disk, so instead of:
>>>> 
>>>> api.setImagePath("test.tif");
>>>> 
>>>> What we need is:
>>>> 
>>>> BufferedImage image = ImageIO.read(new File("test.tif"));
>>>> api.setImagePath(image);
>>>> 
>>>> Because this will let us used the BufferedImage generated by PDFRenderer without round-tripping to the disk.
>>>> 
>>>> -- John
>>>> 
>>>> On 11 Mar 2014, at 11:13, DImuthu Upeksha <di...@gmail.com> wrote:
>>>> 
>>>>> Hi John,
>>>>> Thanks for the guidance.
>>>>> I did a small analysis of the accuracy and performance of new
>>>>> Tesseract wrapper. I used this [1] image as the input image and got
>>>>> following data [2] after OCR. First line is the recognised word
>>>>> followed by location details (bounding box) of the word. I think these
>>>>> details are pretty much enough for our task. Now what remaining is
>>>>> converting pdf file into a image as you have mentioned. These days I'm
>>>>> working on it.
>>>>> 
>>>>> [1] https://www.dropbox.com/s/11wahtonoz08zmn/image4.TIF
>>>>> [2] https://gist.github.com/DImuthuUpe/9491660
>>>>> 
>>>>> Thanks
>>>>> Dimuthu
>>>>> 
>>>>> On Mon, Mar 10, 2014 at 2:30 PM, John Hewson <jo...@jahewson.com> wrote:
>>>>>> Dimuthu,
>>>>>> 
>>>>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
>>>>>>> build using maven. Some useful methods that are needed to do basic OCR were
>>>>>>> implemented.
>>>>>> 
>>>>>> Great, it's looking good, nice and clean.
>>>>>> 
>>>>>>> 1. What is the task of processStream method in PDFTextStripper class line
>>>>>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>>>>>> page.findRotation() );
>>>>>> 
>>>>>> A PDF file is made up of pages, each of which contains a "content stream". This content stream contains a list of drawing commands such as "move to 10,15" or "write the word `foo`", these are called operators. The processStream function reads the stream for the current page and executes each of the operators. The operators themselves are implemented each in their own class which is a subclass of PDFOperator. The constructor of PDFStreamEngine creates the operator classes using reflection, which is rather odd and I'm not sure why this design was chosen. The operators used by PDFTextStripper can be found in org/apache/pdfbox/resources/PDFTextStripper.properties
>>>>>> 
>>>>>>> 2. Say I need to extract images and it's metadata from a pdf. What is the better approach to do it?
>>>>>> 
>>>>>> You could subclass PDFTextStripper and override the startDocument method and use it to create a PDFRenderer and store it in a field. Then override the processPage method and use the previously created PDFRenderer to render the current page to a buffered image and perform OCR on the image. Once you have the OCR text + positions, instead of calling processStream you can call processTextPosition once for each character + position.
>>>>>> 
>>>>>> The PDFRenderer class was just added to the trunk, so make sure you do an "svn update". Let me know if you need me to change PDFTextStripper to make it easier to subclass.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> -- John
>>>>>> 
>>>>>> On 9 Mar 2014, at 09:08, DImuthu Upeksha <di...@gmail.com> wrote:
>>>>>> 
>>>>>>> Hi John,
>>>>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
>>>>>>> build using maven. Some useful methods that are needed to do basic OCR were
>>>>>>> implemented.
>>>>>>> 
>>>>>>> I went through PDFBox code several times and got couple of issues that are
>>>>>>> needed to be clarified
>>>>>>> 
>>>>>>> 1. What is the task of processStream method in PDFTextStripper class line
>>>>>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>>>>>> page.findRotation() );
>>>>>>> 
>>>>>>> 2. Say I need to extract images and it's metadata from a pdf. What is the
>>>>>>> better approach to do it?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Dimuthu
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha
>>>>>>> <di...@gmail.com>wrote:
>>>>>>> 
>>>>>>>> Hi John
>>>>>>>> I refactored Tesseract JNI code to support maven build. To create the JNI
>>>>>>>> library I added pre-built static libraries of Tesseract and Leptonica to
>>>>>>>> resources folder[2]. For now it includes librararies supported for mac. But
>>>>>>>> we can easily add both windows and linux libraries. After "mvn clean
>>>>>>>> install", the jar is created under target folder. Now all setting up is
>>>>>>>> done. What remains is implementing those native methods in tessbaseapi.cpp
>>>>>>>> [3]. Hope to finish it asap. Please let me know if there is any concern
>>>>>>>> about project structure.
>>>>>>>> 
>>>>>>>> [1] https://github.com/DImuthuUpe/Tesseract-API.git
>>>>>>>> [2]
>>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
>>>>>>>> [3]
>>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Dimuthu
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>>>>>>>> 
>>>>>>>>> Dimuthu
>>>>>>>>> 
>>>>>>>>>> There is a lot of code
>>>>>>>>>> fractions in current android jni wrapper which use "(jint)somePointer"
>>>>>>>>>> casting which will create terrible memory leaks in 64 bit environments
>>>>>>>>>> because ponters are 64 bit. So I believe writing it from the beginning
>>>>>>>>> is
>>>>>>>>>> much better.
>>>>>>>>> 
>>>>>>>>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>>>>>>>>> support
>>>>>>>>> 64-bit JVMs.
>>>>>>>>> 
>>>>>>>>>> we can use
>>>>>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>>>>>> it is
>>>>>>>>>> not a issue to use it's static library because both Tesseract and
>>>>>>>>> Leptonica
>>>>>>>>>> is under apache licence.
>>>>>>>>> 
>>>>>>>>> Sounds good, I found the following in the README:
>>>>>>>>> 
>>>>>>>>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
>>>>>>>>> without Leptonica.
>>>>>>>>> 
>>>>>>>>> Which makes sense.
>>>>>>>>> 
>>>>>>>>> -- John
>>>>>>>>> 
>>>>>>>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <di...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi John,
>>>>>>>>>> +1 for you suggestion about converting image <=> byte array at java
>>>>>>>>> side.
>>>>>>>>>> It reduces lot of complexities. I don't know whether you have noticed or
>>>>>>>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my
>>>>>>>>> Mac
>>>>>>>>>> but don't know about other operating systems.
>>>>>>>>>> 
>>>>>>>>>> Leptonica is the image processing library for Tesseract [1]. What
>>>>>>>>> tesseract
>>>>>>>>>> do is using image processing algorithms in Leptonica to implement its
>>>>>>>>> OCR
>>>>>>>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract
>>>>>>>>> API.
>>>>>>>>>> You can see it includes allheaders.h header file which is the main
>>>>>>>>> header
>>>>>>>>>> file of Leptonoca. So I think it is a must to build Leptonica first and
>>>>>>>>>> link it when we build Tesseract. This is not a big problem if we can use
>>>>>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>>>>>> it is
>>>>>>>>>> not a issue to use it's static library because both Tesseract and
>>>>>>>>> Leptonica
>>>>>>>>>> is under apache licence.
>>>>>>>>>> 
>>>>>>>>>> I'm working on the maven implementation you have mentioned and will get
>>>>>>>>>> back to you soon.
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> Dimuthu
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>>>>>>>>>> [2]
>>>>>>>>>> 
>>>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Dimuthu,
>>>>>>>>>>> 
>>>>>>>>>>> 1,2,3:
>>>>>>>>>>> 
>>>>>>>>>>> Feel free to write your own Tesseract binding or port the existing
>>>>>>>>> code as
>>>>>>>>>>> you see fit.
>>>>>>>>>>> The JNI binding should be minimal, only the methods you require need
>>>>>>>>> to be
>>>>>>>>>>> wrapped.
>>>>>>>>>>> Also, don't forget that some of the interop can be done in Java, for
>>>>>>>>>>> example if it is easier
>>>>>>>>>>> to convert a BufferedImage to a byte array in Java then do it there and
>>>>>>>>>>> pass the result
>>>>>>>>>>> to JNI rather than writing lots of JNI C++ to achieve the same result.
>>>>>>>>>>> 
>>>>>>>>>>> Your GitHub repo looks like a good start, I can make comments there as
>>>>>>>>>>> things progress.
>>>>>>>>>>> 
>>>>>>>>>>> Is it possible to build Tesseract without leptonica? I was under the
>>>>>>>>>>> impression that it was
>>>>>>>>>>> used for image i/o only, but I may be misinformed.
>>>>>>>>>>> 
>>>>>>>>>>> 4:  The native platform library should be built as part of the Maven
>>>>>>>>> build
>>>>>>>>>>> for the Tesseract
>>>>>>>>>>> wrapper which can be a separate project. The output can be a jar file
>>>>>>>>>>> which contains the
>>>>>>>>>>> native binaries. It should be possible for the jar to contain prebuilt
>>>>>>>>>>> binaries for all platforms
>>>>>>>>>>> but this is something we can worry about later. Right now the goal
>>>>>>>>> should
>>>>>>>>>>> be to build a jar
>>>>>>>>>>> containing just the current platform's native binary and any Java
>>>>>>>>> wrapper
>>>>>>>>>>> code.
>>>>>>>>>>> 
>>>>>>>>>>> -- John
>>>>>>>>>>> 
>>>>>>>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <di...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi John,
>>>>>>>>>>>> 
>>>>>>>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>>>>>>>>>>> observation
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. This wrapper heavily depends on android image libraries.
>>>>>>>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>>>>>>>>>>>> 
>>>>>>>>>>>> 2. But I can understand underlying logic in each function. Basically
>>>>>>>>> what
>>>>>>>>>>>> it does is mapping between tesseract api functions [2] with java
>>>>>>>>> methods.
>>>>>>>>>>>> In between it does to some image <=> byte array like conversions by
>>>>>>>>> using
>>>>>>>>>>>> that bitmap libraries in Android
>>>>>>>>>>>> 
>>>>>>>>>>>> 3. There are two ways. 1: We can port it's code to make compatible
>>>>>>>>> with
>>>>>>>>>>> our
>>>>>>>>>>>> environments(linux,windows and mac) which is really painful. Also it
>>>>>>>>> will
>>>>>>>>>>>> cause memory leaks. 2: We can use only it's function signatures and
>>>>>>>>>>>> implement using our codes
>>>>>>>>>>>> 
>>>>>>>>>>>> I think 2nd solution is better because we need only few operations to
>>>>>>>>> be
>>>>>>>>>>>> done using tesseract library. I have created a github repo [3] for
>>>>>>>>> this.
>>>>>>>>>>>> It's still not finished. I need to add some make files and build
>>>>>>>>> files to
>>>>>>>>>>>> make it run properly. And also I need to implement those wrapper
>>>>>>>>>>> functions
>>>>>>>>>>>> [3]. This may take some time.
>>>>>>>>>>>> 
>>>>>>>>>>>> 4. Because we are calling native libraries we need different builds of
>>>>>>>>>>>> tesseract and leptonica libraries for each platform (dll for windows,
>>>>>>>>> so
>>>>>>>>>>>> for linux, dylib for mac). So we may need to build those libraries at
>>>>>>>>> the
>>>>>>>>>>>> time we build pdfbox project. Or we can pre build those libraries and
>>>>>>>>> add
>>>>>>>>>>>> them to the project as .dll, .so or .dylib format. What is the
>>>>>>>>> preferred
>>>>>>>>>>>> way?
>>>>>>>>>>>> 
>>>>>>>>>>>> [1]
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>>>>>>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>>>>>>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>>>>>>>>>>> [4]
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> I updated necessary changes to the document [1]
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For last two days I had a deep look at this [2] jni wrapper for
>>>>>>>>>>> tessaract
>>>>>>>>>>>>> api.
>>>>>>>>>>>>> Unfortunately this has been designed for Android environment so I
>>>>>>>>> think
>>>>>>>>>>> we
>>>>>>>>>>>>> need to write our own make files to build this in to a dll(windows)
>>>>>>>>> or
>>>>>>>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>>>>>>>>> for
>>>>>>>>>>> a
>>>>>>>>>>>>> way to convert it to a make file that we can run on console. Please
>>>>>>>>>>> suggest
>>>>>>>>>>>>> if you have a better approach
>>>>>>>>>>>>> 
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>>>>>>>>>>> [2]
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>>>>>>>>>>> [3]
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is a good start. However, there is no need for the Adder
>>>>>>>>>>> component,
>>>>>>>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>>>>>>>>>>> Extractor".
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>>>>>>>>> clear
>>>>>>>>>>>>>> where the process starts.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>>>>>>>>>>> might
>>>>>>>>>>>>>>>> use an interface, e.g. OCREngine and there will be a
>>>>>>>>>>> TesseractOCREngine
>>>>>>>>>>>>>>>> class somewhere which provides the required functionality and
>>>>>>>>> lives
>>>>>>>>>>> in
>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> separate jar file.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> So do you need to embed those new functionalities into existing
>>>>>>>>>>>>>>>> PDFtoText algorithms or package them as a new sub system(something
>>>>>>>>>>>>>> like an
>>>>>>>>>>>>>>>> API)?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>>> From: "John Hewson" <jo...@jahewson.com>
>>>>>>>>>>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>>>>>>>>>>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
>>>>>>>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>>>>>>>>>>>>> Introduction
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>>>>>>>>>>> page
>>>>>>>>>>>>>>>> rotation.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>>>>>>>>> have
>>>>>>>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>>>>>>>>>>> glyphs. We
>>>>>>>>>>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format
>>>>>>>>> and
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>> images with text(Scanned images). In that case first we extract
>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is
>>>>>>>>> extracted
>>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>>>>>>>>>>> PDFToText. Am
>>>>>>>>>>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>>>>>>>>> john@jahewson.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>>>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>>>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>>>>>>>>>>>>>> accurate
>>>>>>>>>>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do
>>>>>>>>> OCR on
>>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>>>>>>>>>>> (PDFToText).
>>>>>>>>>>>>>>>>>>> The goal of
>>>>>>>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>>>>>>>>>>>>> extract
>>>>>>>>>>>>>>>>>>> text from areas of the
>>>>>>>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF files
>>>>>>>>>>> are
>>>>>>>>>>>>>>>>>>> typically generated by
>>>>>>>>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR
>>>>>>>>> is
>>>>>>>>>>>>>>>> useful:
>>>>>>>>>>>>>>>>>>> some fonts embedded
>>>>>>>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>>>>>>>>> extracted
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> PDFToText the result is
>>>>>>>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>>>>>>>>>>> letters.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Instead of:
>>>>>>>>>>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> We want to do:
>>>>>>>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>>>>>>>>> Configurations
>>>>>>>>>>>>>>>>>>> ->Source
>>>>>>>>>>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>>>>>>>>>>>>>> application
>>>>>>>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>>>>>>>>> code.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage =
>>>>>>>>> new
>>>>>>>>>>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target
>>>>>>>>> folder
>>>>>>>>>>> of
>>>>>>>>>>>>>>>> PDFBox
>>>>>>>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>>>>>>>>> project
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have
>>>>>>>>> a
>>>>>>>>>>>>>>>>>>> reference to
>>>>>>>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As
>>>>>>>>> Tilman
>>>>>>>>>>>>>> said
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>> built
>>>>>>>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>>>>>>>>>>> other
>>>>>>>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>>>>>>>>>>> john@jahewson.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>>>>>>>>>>> PDFToText
>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path
>>>>>>>>> as
>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>>>>>>>>>>> managed to
>>>>>>>>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>>>>>>>>> and
>>>>>>>>>>> I
>>>>>>>>>>>>>>>> got a
>>>>>>>>>>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>>>>>>>>>>> jars
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further
>>>>>>>>> look
>>>>>>>>>>>>>> into
>>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>>>>>>>>>>> PDFTextStripper
>>>>>>>>>>>>>>>>>>> class.
>>>>>>>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>>>>>>>>>>> them
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> debug
>>>>>>>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
>>>>>>>>>>> you
>>>>>>>>>>>>>>>> follow
>>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and
>>>>>>>>> managed to
>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>>>>>>>>>>> you a
>>>>>>>>>>>>>>>> mail.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>>>>>>>>>>> john@jahewson.com
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at
>>>>>>>>>>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>>>>>>>>>>> PDFBox
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>>>>>>>>> details
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>>>>>>>>> are
>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>>>>>>>>> PageDrawer
>>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>>>>>>>>> (e.g.
>>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>>>>>>>>>>> text
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>>>>>>>>> reading
>>>>>>>>>>>>>>>> order
>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
>>>>>>>>>>> like
>>>>>>>>>>>>>>>> HTML
>>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask
>>>>>>>>> any
>>>>>>>>>>>>>>>>>>> questions.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>>>>>>>>> Undergraduate
>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>>>>>>>>> 2013
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>>>>>>>>> image
>>>>>>>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>>>>>>>>>>> GSoC
>>>>>>>>>>>>>>>> 2014
>>>>>>>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for
>>>>>>>>> me. In
>>>>>>>>>>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>>>>>>>>>>>>>> wrote a
>>>>>>>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about
>>>>>>>>> PDFBox?
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> 
>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>> 
>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Regards
>>>>>>>>>>>> 
>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>> 
>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Regards
>>>>>>>>>> 
>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>> Undergraduate
>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>> 
>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Regards
>>>>>>>> 
>>>>>>>> W.Dimuthu Upeksha
>>>>>>>> Undergraduate
>>>>>>>> Department of Computer Science And Engineering
>>>>>>>> 
>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Regards
>>>>>>> 
>>>>>>> W.Dimuthu Upeksha
>>>>>>> Undergraduate
>>>>>>> Department of Computer Science And Engineering
>>>>>>> 
>>>>>>> University of Moratuwa, Sri Lanka
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Regards
>>>>> 
>>>>> W.Dimuthu Upeksha
>>>>> Undergraduate
>>>>> 
>>>>> Department of Computer Science And Engineering
>>>>> 
>>>>> University of Moratuwa, Sri Lanka
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Regards
>>> 
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> 
>>> Department of Computer Science And Engineering
>>> 
>>> University of Moratuwa, Sri Lanka
>> 
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> 
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka


Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by DImuthu Upeksha <di...@gmail.com>.
Hi John,

For now I'm using those methods to debug the wrapper. I'll remove
those methods after I finished testing it.

I started implementing OCR-plugin [1] for PDFBox. Currently it
satisfies basic requirements such as getting word+location data [2].
Please have a look at that and let me know if any changes are
required.

[1] https://github.com/DImuthuUpe/OCR-Plugin
[2] https://github.com/DImuthuUpe/OCR-Plugin/blob/master/src/main/java/org/apache/pdfbox/ocr/OCRConnector.java

Thanks
Dimuthu

On Fri, Mar 14, 2014 at 12:09 AM, John Hewson <jo...@jahewson.com> wrote:
> Thanks, I saw your new refactoring too, it's good. Now the following methods are no longer needed:
>
> public void setImagePath(String path)
> public void setImage(byte[] imagedata, int width, int height, int bpp,int bpl)
>
> Cheers
>
> -- John
>
> On 11 Mar 2014, at 22:58, DImuthu Upeksha <di...@gmail.com> wrote:
>
>> Hi John,
>> Yes. I implemented a new method to accept byte streams of the image as
>> an input. We directly can't send BufferedImage objects to native side.
>> So what I did is converting buffered image into a byte array and
>> passed it in to native side. At the native side it again converts in
>> to compatible format. With that request we need to pass some metadata
>> of byte stream like image width, height, bytes per pixel and bytes per
>> row. I checked it with this [2] test case and it works fine.
>>
>> [1] https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/java/com/apache/pdfbox/ocr/tesseract/TessBaseAPI.java#L74
>> [2] https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/test/java/com/apache/pdfbox/ocr/tesseract/TessByteSteamTest.java
>>
>> Thanks
>> Dimuthu
>>
>> On Wed, Mar 12, 2014 at 12:40 AM, John Hewson <jo...@jahewson.com> wrote:
>>> Hi Dimuthu
>>>
>>> The Tesseract wrapper needs to take its input from a BufferedImage rather than reading a file from disk, so instead of:
>>>
>>> api.setImagePath("test.tif");
>>>
>>> What we need is:
>>>
>>> BufferedImage image = ImageIO.read(new File("test.tif"));
>>> api.setImagePath(image);
>>>
>>> Because this will let us used the BufferedImage generated by PDFRenderer without round-tripping to the disk.
>>>
>>> -- John
>>>
>>> On 11 Mar 2014, at 11:13, DImuthu Upeksha <di...@gmail.com> wrote:
>>>
>>>> Hi John,
>>>> Thanks for the guidance.
>>>> I did a small analysis of the accuracy and performance of new
>>>> Tesseract wrapper. I used this [1] image as the input image and got
>>>> following data [2] after OCR. First line is the recognised word
>>>> followed by location details (bounding box) of the word. I think these
>>>> details are pretty much enough for our task. Now what remaining is
>>>> converting pdf file into a image as you have mentioned. These days I'm
>>>> working on it.
>>>>
>>>> [1] https://www.dropbox.com/s/11wahtonoz08zmn/image4.TIF
>>>> [2] https://gist.github.com/DImuthuUpe/9491660
>>>>
>>>> Thanks
>>>> Dimuthu
>>>>
>>>> On Mon, Mar 10, 2014 at 2:30 PM, John Hewson <jo...@jahewson.com> wrote:
>>>>> Dimuthu,
>>>>>
>>>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
>>>>>> build using maven. Some useful methods that are needed to do basic OCR were
>>>>>> implemented.
>>>>>
>>>>> Great, it's looking good, nice and clean.
>>>>>
>>>>>> 1. What is the task of processStream method in PDFTextStripper class line
>>>>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>>>>> page.findRotation() );
>>>>>
>>>>> A PDF file is made up of pages, each of which contains a "content stream". This content stream contains a list of drawing commands such as "move to 10,15" or "write the word `foo`", these are called operators. The processStream function reads the stream for the current page and executes each of the operators. The operators themselves are implemented each in their own class which is a subclass of PDFOperator. The constructor of PDFStreamEngine creates the operator classes using reflection, which is rather odd and I'm not sure why this design was chosen. The operators used by PDFTextStripper can be found in org/apache/pdfbox/resources/PDFTextStripper.properties
>>>>>
>>>>>> 2. Say I need to extract images and it's metadata from a pdf. What is the better approach to do it?
>>>>>
>>>>> You could subclass PDFTextStripper and override the startDocument method and use it to create a PDFRenderer and store it in a field. Then override the processPage method and use the previously created PDFRenderer to render the current page to a buffered image and perform OCR on the image. Once you have the OCR text + positions, instead of calling processStream you can call processTextPosition once for each character + position.
>>>>>
>>>>> The PDFRenderer class was just added to the trunk, so make sure you do an "svn update". Let me know if you need me to change PDFTextStripper to make it easier to subclass.
>>>>>
>>>>> Cheers
>>>>>
>>>>> -- John
>>>>>
>>>>> On 9 Mar 2014, at 09:08, DImuthu Upeksha <di...@gmail.com> wrote:
>>>>>
>>>>>> Hi John,
>>>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
>>>>>> build using maven. Some useful methods that are needed to do basic OCR were
>>>>>> implemented.
>>>>>>
>>>>>> I went through PDFBox code several times and got couple of issues that are
>>>>>> needed to be clarified
>>>>>>
>>>>>> 1. What is the task of processStream method in PDFTextStripper class line
>>>>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>>>>> page.findRotation() );
>>>>>>
>>>>>> 2. Say I need to extract images and it's metadata from a pdf. What is the
>>>>>> better approach to do it?
>>>>>>
>>>>>> Thanks
>>>>>> Dimuthu
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha
>>>>>> <di...@gmail.com>wrote:
>>>>>>
>>>>>>> Hi John
>>>>>>> I refactored Tesseract JNI code to support maven build. To create the JNI
>>>>>>> library I added pre-built static libraries of Tesseract and Leptonica to
>>>>>>> resources folder[2]. For now it includes librararies supported for mac. But
>>>>>>> we can easily add both windows and linux libraries. After "mvn clean
>>>>>>> install", the jar is created under target folder. Now all setting up is
>>>>>>> done. What remains is implementing those native methods in tessbaseapi.cpp
>>>>>>> [3]. Hope to finish it asap. Please let me know if there is any concern
>>>>>>> about project structure.
>>>>>>>
>>>>>>> [1] https://github.com/DImuthuUpe/Tesseract-API.git
>>>>>>> [2]
>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
>>>>>>> [3]
>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
>>>>>>>
>>>>>>> Thanks
>>>>>>> Dimuthu
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>>>>>>>
>>>>>>>> Dimuthu
>>>>>>>>
>>>>>>>>> There is a lot of code
>>>>>>>>> fractions in current android jni wrapper which use "(jint)somePointer"
>>>>>>>>> casting which will create terrible memory leaks in 64 bit environments
>>>>>>>>> because ponters are 64 bit. So I believe writing it from the beginning
>>>>>>>> is
>>>>>>>>> much better.
>>>>>>>>
>>>>>>>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>>>>>>>> support
>>>>>>>> 64-bit JVMs.
>>>>>>>>
>>>>>>>>> we can use
>>>>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>>>>> it is
>>>>>>>>> not a issue to use it's static library because both Tesseract and
>>>>>>>> Leptonica
>>>>>>>>> is under apache licence.
>>>>>>>>
>>>>>>>> Sounds good, I found the following in the README:
>>>>>>>>
>>>>>>>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
>>>>>>>> without Leptonica.
>>>>>>>>
>>>>>>>> Which makes sense.
>>>>>>>>
>>>>>>>> -- John
>>>>>>>>
>>>>>>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <di...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi John,
>>>>>>>>> +1 for you suggestion about converting image <=> byte array at java
>>>>>>>> side.
>>>>>>>>> It reduces lot of complexities. I don't know whether you have noticed or
>>>>>>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my
>>>>>>>> Mac
>>>>>>>>> but don't know about other operating systems.
>>>>>>>>>
>>>>>>>>> Leptonica is the image processing library for Tesseract [1]. What
>>>>>>>> tesseract
>>>>>>>>> do is using image processing algorithms in Leptonica to implement its
>>>>>>>> OCR
>>>>>>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract
>>>>>>>> API.
>>>>>>>>> You can see it includes allheaders.h header file which is the main
>>>>>>>> header
>>>>>>>>> file of Leptonoca. So I think it is a must to build Leptonica first and
>>>>>>>>> link it when we build Tesseract. This is not a big problem if we can use
>>>>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>>>>> it is
>>>>>>>>> not a issue to use it's static library because both Tesseract and
>>>>>>>> Leptonica
>>>>>>>>> is under apache licence.
>>>>>>>>>
>>>>>>>>> I'm working on the maven implementation you have mentioned and will get
>>>>>>>>> back to you soon.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Dimuthu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>>>>>>>>> [2]
>>>>>>>>>
>>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Dimuthu,
>>>>>>>>>>
>>>>>>>>>> 1,2,3:
>>>>>>>>>>
>>>>>>>>>> Feel free to write your own Tesseract binding or port the existing
>>>>>>>> code as
>>>>>>>>>> you see fit.
>>>>>>>>>> The JNI binding should be minimal, only the methods you require need
>>>>>>>> to be
>>>>>>>>>> wrapped.
>>>>>>>>>> Also, don't forget that some of the interop can be done in Java, for
>>>>>>>>>> example if it is easier
>>>>>>>>>> to convert a BufferedImage to a byte array in Java then do it there and
>>>>>>>>>> pass the result
>>>>>>>>>> to JNI rather than writing lots of JNI C++ to achieve the same result.
>>>>>>>>>>
>>>>>>>>>> Your GitHub repo looks like a good start, I can make comments there as
>>>>>>>>>> things progress.
>>>>>>>>>>
>>>>>>>>>> Is it possible to build Tesseract without leptonica? I was under the
>>>>>>>>>> impression that it was
>>>>>>>>>> used for image i/o only, but I may be misinformed.
>>>>>>>>>>
>>>>>>>>>> 4:  The native platform library should be built as part of the Maven
>>>>>>>> build
>>>>>>>>>> for the Tesseract
>>>>>>>>>> wrapper which can be a separate project. The output can be a jar file
>>>>>>>>>> which contains the
>>>>>>>>>> native binaries. It should be possible for the jar to contain prebuilt
>>>>>>>>>> binaries for all platforms
>>>>>>>>>> but this is something we can worry about later. Right now the goal
>>>>>>>> should
>>>>>>>>>> be to build a jar
>>>>>>>>>> containing just the current platform's native binary and any Java
>>>>>>>> wrapper
>>>>>>>>>> code.
>>>>>>>>>>
>>>>>>>>>> -- John
>>>>>>>>>>
>>>>>>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <di...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi John,
>>>>>>>>>>>
>>>>>>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>>>>>>>>>> observation
>>>>>>>>>>>
>>>>>>>>>>> 1. This wrapper heavily depends on android image libraries.
>>>>>>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>>>>>>>>>>>
>>>>>>>>>>> 2. But I can understand underlying logic in each function. Basically
>>>>>>>> what
>>>>>>>>>>> it does is mapping between tesseract api functions [2] with java
>>>>>>>> methods.
>>>>>>>>>>> In between it does to some image <=> byte array like conversions by
>>>>>>>> using
>>>>>>>>>>> that bitmap libraries in Android
>>>>>>>>>>>
>>>>>>>>>>> 3. There are two ways. 1: We can port it's code to make compatible
>>>>>>>> with
>>>>>>>>>> our
>>>>>>>>>>> environments(linux,windows and mac) which is really painful. Also it
>>>>>>>> will
>>>>>>>>>>> cause memory leaks. 2: We can use only it's function signatures and
>>>>>>>>>>> implement using our codes
>>>>>>>>>>>
>>>>>>>>>>> I think 2nd solution is better because we need only few operations to
>>>>>>>> be
>>>>>>>>>>> done using tesseract library. I have created a github repo [3] for
>>>>>>>> this.
>>>>>>>>>>> It's still not finished. I need to add some make files and build
>>>>>>>> files to
>>>>>>>>>>> make it run properly. And also I need to implement those wrapper
>>>>>>>>>> functions
>>>>>>>>>>> [3]. This may take some time.
>>>>>>>>>>>
>>>>>>>>>>> 4. Because we are calling native libraries we need different builds of
>>>>>>>>>>> tesseract and leptonica libraries for each platform (dll for windows,
>>>>>>>> so
>>>>>>>>>>> for linux, dylib for mac). So we may need to build those libraries at
>>>>>>>> the
>>>>>>>>>>> time we build pdfbox project. Or we can pre build those libraries and
>>>>>>>> add
>>>>>>>>>>> them to the project as .dll, .so or .dylib format. What is the
>>>>>>>> preferred
>>>>>>>>>>> way?
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>>>>>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>>>>>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>>>>>>>>>> [4]
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Dimuthu
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I updated necessary changes to the document [1]
>>>>>>>>>>>>
>>>>>>>>>>>> For last two days I had a deep look at this [2] jni wrapper for
>>>>>>>>>> tessaract
>>>>>>>>>>>> api.
>>>>>>>>>>>> Unfortunately this has been designed for Android environment so I
>>>>>>>> think
>>>>>>>>>> we
>>>>>>>>>>>> need to write our own make files to build this in to a dll(windows)
>>>>>>>> or
>>>>>>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>>>>>>>> for
>>>>>>>>>> a
>>>>>>>>>>>> way to convert it to a make file that we can run on console. Please
>>>>>>>>>> suggest
>>>>>>>>>>>> if you have a better approach
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>>>>>>>>>> [2]
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>>>>>>>>>> [3]
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com>
>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> This is a good start. However, there is no need for the Adder
>>>>>>>>>> component,
>>>>>>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>>>>>>>>>> Extractor".
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>>>>>>>> clear
>>>>>>>>>>>>> where the process starts.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>>>>>>>>>> might
>>>>>>>>>>>>>>> use an interface, e.g. OCREngine and there will be a
>>>>>>>>>> TesseractOCREngine
>>>>>>>>>>>>>>> class somewhere which provides the required functionality and
>>>>>>>> lives
>>>>>>>>>> in
>>>>>>>>>>>>> a
>>>>>>>>>>>>>>> separate jar file.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So do you need to embed those new functionalities into existing
>>>>>>>>>>>>>>> PDFtoText algorithms or package them as a new sub system(something
>>>>>>>>>>>>> like an
>>>>>>>>>>>>>>> API)?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>>> From: "John Hewson" <jo...@jahewson.com>
>>>>>>>>>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>>>>>>>>>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
>>>>>>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>>>>>>>>>>>> Introduction
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>>>>>>>>>> page
>>>>>>>>>>>>>>> rotation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>>>>>>>> have
>>>>>>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>>>>>>>>>> glyphs. We
>>>>>>>>>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format
>>>>>>>> and
>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>> images with text(Scanned images). In that case first we extract
>>>>>>>>>> those
>>>>>>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is
>>>>>>>> extracted
>>>>>>>>>>>>> using
>>>>>>>>>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>>>>>>>>>> PDFToText. Am
>>>>>>>>>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>>>>>>>> john@jahewson.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>>>>>>>>>>>>> accurate
>>>>>>>>>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do
>>>>>>>> OCR on
>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>>>>>>>>>> (PDFToText).
>>>>>>>>>>>>>>>>>> The goal of
>>>>>>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>>>>>>>>>>>> extract
>>>>>>>>>>>>>>>>>> text from areas of the
>>>>>>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF files
>>>>>>>>>> are
>>>>>>>>>>>>>>>>>> typically generated by
>>>>>>>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR
>>>>>>>> is
>>>>>>>>>>>>>>> useful:
>>>>>>>>>>>>>>>>>> some fonts embedded
>>>>>>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>>>>>>>> extracted
>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>> PDFToText the result is
>>>>>>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>>>>>>>>>> letters.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Instead of:
>>>>>>>>>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We want to do:
>>>>>>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>>>>>>>> Configurations
>>>>>>>>>>>>>>>>>> ->Source
>>>>>>>>>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>>>>>>>>>>>>> application
>>>>>>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>>>>>>>> code.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage =
>>>>>>>> new
>>>>>>>>>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target
>>>>>>>> folder
>>>>>>>>>> of
>>>>>>>>>>>>>>> PDFBox
>>>>>>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>>>>>>>> project
>>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have
>>>>>>>> a
>>>>>>>>>>>>>>>>>> reference to
>>>>>>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As
>>>>>>>> Tilman
>>>>>>>>>>>>> said
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>> built
>>>>>>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>>>>>>>>>> other
>>>>>>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>>>>>>>>>> john@jahewson.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>>>>>>>>>> PDFToText
>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path
>>>>>>>> as
>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>>>>>>>>>> managed to
>>>>>>>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>>>>>>>> and
>>>>>>>>>> I
>>>>>>>>>>>>>>> got a
>>>>>>>>>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>>>>>>>>>> jars
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further
>>>>>>>> look
>>>>>>>>>>>>> into
>>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>>>>>>>>>> PDFTextStripper
>>>>>>>>>>>>>>>>>> class.
>>>>>>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>>>>>>>>>> them
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> debug
>>>>>>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
>>>>>>>>>> you
>>>>>>>>>>>>>>> follow
>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and
>>>>>>>> managed to
>>>>>>>>>>>>> do
>>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>>>>>>>>>> you a
>>>>>>>>>>>>>>> mail.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>>>>>>>>>> john@jahewson.com
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at
>>>>>>>>>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>>>>>>>>>> PDFBox
>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>>>>>>>> details
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>>>>>>>> are
>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>>>>>>>> PageDrawer
>>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>>>>>>>> (e.g.
>>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>>>>>>>>>> text
>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>>>>>>>> reading
>>>>>>>>>>>>>>> order
>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
>>>>>>>>>> like
>>>>>>>>>>>>>>> HTML
>>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask
>>>>>>>> any
>>>>>>>>>>>>>>>>>> questions.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>>>>>>>> Undergraduate
>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>>>>>>>> 2013
>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>>>>>>>> image
>>>>>>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>>>>>>>>>> GSoC
>>>>>>>>>>>>>>> 2014
>>>>>>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for
>>>>>>>> me. In
>>>>>>>>>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>>>>>>>>>>>>> wrote a
>>>>>>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about
>>>>>>>> PDFBox?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Regards
>>>>>>>>>>>>
>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>
>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Regards
>>>>>>>>>>>
>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>> Undergraduate
>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>
>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>> Undergraduate
>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>
>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards
>>>>>>>
>>>>>>> W.Dimuthu Upeksha
>>>>>>> Undergraduate
>>>>>>> Department of Computer Science And Engineering
>>>>>>>
>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards
>>>>>>
>>>>>> W.Dimuthu Upeksha
>>>>>> Undergraduate
>>>>>> Department of Computer Science And Engineering
>>>>>>
>>>>>> University of Moratuwa, Sri Lanka
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards
>>>>
>>>> W.Dimuthu Upeksha
>>>> Undergraduate
>>>>
>>>> Department of Computer Science And Engineering
>>>>
>>>> University of Moratuwa, Sri Lanka
>>>
>>
>>
>>
>> --
>> Regards
>>
>> W.Dimuthu Upeksha
>> Undergraduate
>>
>> Department of Computer Science And Engineering
>>
>> University of Moratuwa, Sri Lanka
>



-- 
Regards

W.Dimuthu Upeksha
Undergraduate

Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by John Hewson <jo...@jahewson.com>.
Thanks, I saw your new refactoring too, it’s good. Now the following methods are no longer needed:

public void setImagePath(String path)
public void setImage(byte[] imagedata, int width, int height, int bpp,int bpl)

Cheers

-- John

On 11 Mar 2014, at 22:58, DImuthu Upeksha <di...@gmail.com> wrote:

> Hi John,
> Yes. I implemented a new method to accept byte streams of the image as
> an input. We directly can't send BufferedImage objects to native side.
> So what I did is converting buffered image into a byte array and
> passed it in to native side. At the native side it again converts in
> to compatible format. With that request we need to pass some metadata
> of byte stream like image width, height, bytes per pixel and bytes per
> row. I checked it with this [2] test case and it works fine.
> 
> [1] https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/java/com/apache/pdfbox/ocr/tesseract/TessBaseAPI.java#L74
> [2] https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/test/java/com/apache/pdfbox/ocr/tesseract/TessByteSteamTest.java
> 
> Thanks
> Dimuthu
> 
> On Wed, Mar 12, 2014 at 12:40 AM, John Hewson <jo...@jahewson.com> wrote:
>> Hi Dimuthu
>> 
>> The Tesseract wrapper needs to take its input from a BufferedImage rather than reading a file from disk, so instead of:
>> 
>> api.setImagePath("test.tif");
>> 
>> What we need is:
>> 
>> BufferedImage image = ImageIO.read(new File("test.tif"));
>> api.setImagePath(image);
>> 
>> Because this will let us used the BufferedImage generated by PDFRenderer without round-tripping to the disk.
>> 
>> -- John
>> 
>> On 11 Mar 2014, at 11:13, DImuthu Upeksha <di...@gmail.com> wrote:
>> 
>>> Hi John,
>>> Thanks for the guidance.
>>> I did a small analysis of the accuracy and performance of new
>>> Tesseract wrapper. I used this [1] image as the input image and got
>>> following data [2] after OCR. First line is the recognised word
>>> followed by location details (bounding box) of the word. I think these
>>> details are pretty much enough for our task. Now what remaining is
>>> converting pdf file into a image as you have mentioned. These days I'm
>>> working on it.
>>> 
>>> [1] https://www.dropbox.com/s/11wahtonoz08zmn/image4.TIF
>>> [2] https://gist.github.com/DImuthuUpe/9491660
>>> 
>>> Thanks
>>> Dimuthu
>>> 
>>> On Mon, Mar 10, 2014 at 2:30 PM, John Hewson <jo...@jahewson.com> wrote:
>>>> Dimuthu,
>>>> 
>>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
>>>>> build using maven. Some useful methods that are needed to do basic OCR were
>>>>> implemented.
>>>> 
>>>> Great, it's looking good, nice and clean.
>>>> 
>>>>> 1. What is the task of processStream method in PDFTextStripper class line
>>>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>>>> page.findRotation() );
>>>> 
>>>> A PDF file is made up of pages, each of which contains a "content stream". This content stream contains a list of drawing commands such as "move to 10,15" or "write the word `foo`", these are called operators. The processStream function reads the stream for the current page and executes each of the operators. The operators themselves are implemented each in their own class which is a subclass of PDFOperator. The constructor of PDFStreamEngine creates the operator classes using reflection, which is rather odd and I'm not sure why this design was chosen. The operators used by PDFTextStripper can be found in org/apache/pdfbox/resources/PDFTextStripper.properties
>>>> 
>>>>> 2. Say I need to extract images and it's metadata from a pdf. What is the better approach to do it?
>>>> 
>>>> You could subclass PDFTextStripper and override the startDocument method and use it to create a PDFRenderer and store it in a field. Then override the processPage method and use the previously created PDFRenderer to render the current page to a buffered image and perform OCR on the image. Once you have the OCR text + positions, instead of calling processStream you can call processTextPosition once for each character + position.
>>>> 
>>>> The PDFRenderer class was just added to the trunk, so make sure you do an "svn update". Let me know if you need me to change PDFTextStripper to make it easier to subclass.
>>>> 
>>>> Cheers
>>>> 
>>>> -- John
>>>> 
>>>> On 9 Mar 2014, at 09:08, DImuthu Upeksha <di...@gmail.com> wrote:
>>>> 
>>>>> Hi John,
>>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
>>>>> build using maven. Some useful methods that are needed to do basic OCR were
>>>>> implemented.
>>>>> 
>>>>> I went through PDFBox code several times and got couple of issues that are
>>>>> needed to be clarified
>>>>> 
>>>>> 1. What is the task of processStream method in PDFTextStripper class line
>>>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>>>> page.findRotation() );
>>>>> 
>>>>> 2. Say I need to extract images and it's metadata from a pdf. What is the
>>>>> better approach to do it?
>>>>> 
>>>>> Thanks
>>>>> Dimuthu
>>>>> 
>>>>> 
>>>>> On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha
>>>>> <di...@gmail.com>wrote:
>>>>> 
>>>>>> Hi John
>>>>>> I refactored Tesseract JNI code to support maven build. To create the JNI
>>>>>> library I added pre-built static libraries of Tesseract and Leptonica to
>>>>>> resources folder[2]. For now it includes librararies supported for mac. But
>>>>>> we can easily add both windows and linux libraries. After "mvn clean
>>>>>> install", the jar is created under target folder. Now all setting up is
>>>>>> done. What remains is implementing those native methods in tessbaseapi.cpp
>>>>>> [3]. Hope to finish it asap. Please let me know if there is any concern
>>>>>> about project structure.
>>>>>> 
>>>>>> [1] https://github.com/DImuthuUpe/Tesseract-API.git
>>>>>> [2]
>>>>>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
>>>>>> [3]
>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
>>>>>> 
>>>>>> Thanks
>>>>>> Dimuthu
>>>>>> 
>>>>>> 
>>>>>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>>>>>> 
>>>>>>> Dimuthu
>>>>>>> 
>>>>>>>> There is a lot of code
>>>>>>>> fractions in current android jni wrapper which use "(jint)somePointer"
>>>>>>>> casting which will create terrible memory leaks in 64 bit environments
>>>>>>>> because ponters are 64 bit. So I believe writing it from the beginning
>>>>>>> is
>>>>>>>> much better.
>>>>>>> 
>>>>>>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>>>>>>> support
>>>>>>> 64-bit JVMs.
>>>>>>> 
>>>>>>>> we can use
>>>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>>>> it is
>>>>>>>> not a issue to use it's static library because both Tesseract and
>>>>>>> Leptonica
>>>>>>>> is under apache licence.
>>>>>>> 
>>>>>>> Sounds good, I found the following in the README:
>>>>>>> 
>>>>>>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
>>>>>>> without Leptonica.
>>>>>>> 
>>>>>>> Which makes sense.
>>>>>>> 
>>>>>>> -- John
>>>>>>> 
>>>>>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <di...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi John,
>>>>>>>> +1 for you suggestion about converting image <=> byte array at java
>>>>>>> side.
>>>>>>>> It reduces lot of complexities. I don't know whether you have noticed or
>>>>>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my
>>>>>>> Mac
>>>>>>>> but don't know about other operating systems.
>>>>>>>> 
>>>>>>>> Leptonica is the image processing library for Tesseract [1]. What
>>>>>>> tesseract
>>>>>>>> do is using image processing algorithms in Leptonica to implement its
>>>>>>> OCR
>>>>>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract
>>>>>>> API.
>>>>>>>> You can see it includes allheaders.h header file which is the main
>>>>>>> header
>>>>>>>> file of Leptonoca. So I think it is a must to build Leptonica first and
>>>>>>>> link it when we build Tesseract. This is not a big problem if we can use
>>>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>>>> it is
>>>>>>>> not a issue to use it's static library because both Tesseract and
>>>>>>> Leptonica
>>>>>>>> is under apache licence.
>>>>>>>> 
>>>>>>>> I'm working on the maven implementation you have mentioned and will get
>>>>>>>> back to you soon.
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Dimuthu
>>>>>>>> 
>>>>>>>> 
>>>>>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>>>>>>>> [2]
>>>>>>>> 
>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>>>>>>>> 
>>>>>>>>> Hi Dimuthu,
>>>>>>>>> 
>>>>>>>>> 1,2,3:
>>>>>>>>> 
>>>>>>>>> Feel free to write your own Tesseract binding or port the existing
>>>>>>> code as
>>>>>>>>> you see fit.
>>>>>>>>> The JNI binding should be minimal, only the methods you require need
>>>>>>> to be
>>>>>>>>> wrapped.
>>>>>>>>> Also, don't forget that some of the interop can be done in Java, for
>>>>>>>>> example if it is easier
>>>>>>>>> to convert a BufferedImage to a byte array in Java then do it there and
>>>>>>>>> pass the result
>>>>>>>>> to JNI rather than writing lots of JNI C++ to achieve the same result.
>>>>>>>>> 
>>>>>>>>> Your GitHub repo looks like a good start, I can make comments there as
>>>>>>>>> things progress.
>>>>>>>>> 
>>>>>>>>> Is it possible to build Tesseract without leptonica? I was under the
>>>>>>>>> impression that it was
>>>>>>>>> used for image i/o only, but I may be misinformed.
>>>>>>>>> 
>>>>>>>>> 4:  The native platform library should be built as part of the Maven
>>>>>>> build
>>>>>>>>> for the Tesseract
>>>>>>>>> wrapper which can be a separate project. The output can be a jar file
>>>>>>>>> which contains the
>>>>>>>>> native binaries. It should be possible for the jar to contain prebuilt
>>>>>>>>> binaries for all platforms
>>>>>>>>> but this is something we can worry about later. Right now the goal
>>>>>>> should
>>>>>>>>> be to build a jar
>>>>>>>>> containing just the current platform's native binary and any Java
>>>>>>> wrapper
>>>>>>>>> code.
>>>>>>>>> 
>>>>>>>>> -- John
>>>>>>>>> 
>>>>>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <di...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi John,
>>>>>>>>>> 
>>>>>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>>>>>>>>> observation
>>>>>>>>>> 
>>>>>>>>>> 1. This wrapper heavily depends on android image libraries.
>>>>>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>>>>>>>>>> 
>>>>>>>>>> 2. But I can understand underlying logic in each function. Basically
>>>>>>> what
>>>>>>>>>> it does is mapping between tesseract api functions [2] with java
>>>>>>> methods.
>>>>>>>>>> In between it does to some image <=> byte array like conversions by
>>>>>>> using
>>>>>>>>>> that bitmap libraries in Android
>>>>>>>>>> 
>>>>>>>>>> 3. There are two ways. 1: We can port it's code to make compatible
>>>>>>> with
>>>>>>>>> our
>>>>>>>>>> environments(linux,windows and mac) which is really painful. Also it
>>>>>>> will
>>>>>>>>>> cause memory leaks. 2: We can use only it's function signatures and
>>>>>>>>>> implement using our codes
>>>>>>>>>> 
>>>>>>>>>> I think 2nd solution is better because we need only few operations to
>>>>>>> be
>>>>>>>>>> done using tesseract library. I have created a github repo [3] for
>>>>>>> this.
>>>>>>>>>> It's still not finished. I need to add some make files and build
>>>>>>> files to
>>>>>>>>>> make it run properly. And also I need to implement those wrapper
>>>>>>>>> functions
>>>>>>>>>> [3]. This may take some time.
>>>>>>>>>> 
>>>>>>>>>> 4. Because we are calling native libraries we need different builds of
>>>>>>>>>> tesseract and leptonica libraries for each platform (dll for windows,
>>>>>>> so
>>>>>>>>>> for linux, dylib for mac). So we may need to build those libraries at
>>>>>>> the
>>>>>>>>>> time we build pdfbox project. Or we can pre build those libraries and
>>>>>>> add
>>>>>>>>>> them to the project as .dll, .so or .dylib format. What is the
>>>>>>> preferred
>>>>>>>>>> way?
>>>>>>>>>> 
>>>>>>>>>> [1]
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>>>>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>>>>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>>>>>>>>> [4]
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> Dimuthu
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I updated necessary changes to the document [1]
>>>>>>>>>>> 
>>>>>>>>>>> For last two days I had a deep look at this [2] jni wrapper for
>>>>>>>>> tessaract
>>>>>>>>>>> api.
>>>>>>>>>>> Unfortunately this has been designed for Android environment so I
>>>>>>> think
>>>>>>>>> we
>>>>>>>>>>> need to write our own make files to build this in to a dll(windows)
>>>>>>> or
>>>>>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>>>>>>> for
>>>>>>>>> a
>>>>>>>>>>> way to convert it to a make file that we can run on console. Please
>>>>>>>>> suggest
>>>>>>>>>>> if you have a better approach
>>>>>>>>>>> 
>>>>>>>>>>> [1]
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>>>>>>>>> [2]
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>>>>>>>>> [3]
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com>
>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> This is a good start. However, there is no need for the Adder
>>>>>>>>> component,
>>>>>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>>>>>>>>> Extractor".
>>>>>>>>>>>> 
>>>>>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>>>>>>> clear
>>>>>>>>>>>> where the process starts.
>>>>>>>>>>>> 
>>>>>>>>>>>> -- John
>>>>>>>>>>>> 
>>>>>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>>>>>>>>> 
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>>>>>>>>> might
>>>>>>>>>>>>>> use an interface, e.g. OCREngine and there will be a
>>>>>>>>> TesseractOCREngine
>>>>>>>>>>>>>> class somewhere which provides the required functionality and
>>>>>>> lives
>>>>>>>>> in
>>>>>>>>>>>> a
>>>>>>>>>>>>>> separate jar file.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> So do you need to embed those new functionalities into existing
>>>>>>>>>>>>>> PDFtoText algorithms or package them as a new sub system(something
>>>>>>>>>>>> like an
>>>>>>>>>>>>>> API)?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>>> From: "John Hewson" <jo...@jahewson.com>
>>>>>>>>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>>>>>>>>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
>>>>>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>>>>>>>>>>> Introduction
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>>>>>>>>> page
>>>>>>>>>>>>>> rotation.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>>>>>>> have
>>>>>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>>>>>>>>> glyphs. We
>>>>>>>>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format
>>>>>>> and
>>>>>>>>>>>> some
>>>>>>>>>>>>>>>> images with text(Scanned images). In that case first we extract
>>>>>>>>> those
>>>>>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is
>>>>>>> extracted
>>>>>>>>>>>> using
>>>>>>>>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>>>>>>>>> PDFToText. Am
>>>>>>>>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>>>>>>> john@jahewson.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>>>>>>>>>>>> accurate
>>>>>>>>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do
>>>>>>> OCR on
>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>>>>>>> wrong.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>>>>>>>>> (PDFToText).
>>>>>>>>>>>>>>>>> The goal of
>>>>>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>>>>>>>>>>> extract
>>>>>>>>>>>>>>>>> text from areas of the
>>>>>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF files
>>>>>>>>> are
>>>>>>>>>>>>>>>>> typically generated by
>>>>>>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR
>>>>>>> is
>>>>>>>>>>>>>> useful:
>>>>>>>>>>>>>>>>> some fonts embedded
>>>>>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>>>>>>> extracted
>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> PDFToText the result is
>>>>>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>>>>>>>>> letters.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Instead of:
>>>>>>>>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> We want to do:
>>>>>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>>>>>>> Configurations
>>>>>>>>>>>>>>>>> ->Source
>>>>>>>>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>>>>>>>>>>>> application
>>>>>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>>>>>>> code.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage =
>>>>>>> new
>>>>>>>>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target
>>>>>>> folder
>>>>>>>>> of
>>>>>>>>>>>>>> PDFBox
>>>>>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>>>>>>> project
>>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>>>>>>> the
>>>>>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have
>>>>>>> a
>>>>>>>>>>>>>>>>> reference to
>>>>>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As
>>>>>>> Tilman
>>>>>>>>>>>> said
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>> built
>>>>>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>>>>>>>>> other
>>>>>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>>>>>>>>> john@jahewson.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>>>>>>>>> PDFToText
>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path
>>>>>>> as
>>>>>>>>> the
>>>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>>>>>>>>> managed to
>>>>>>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>>>>>>> and
>>>>>>>>> I
>>>>>>>>>>>>>> got a
>>>>>>>>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>>>>>>>>> jars
>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further
>>>>>>> look
>>>>>>>>>>>> into
>>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>>>>>>>>> PDFTextStripper
>>>>>>>>>>>>>>>>> class.
>>>>>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>>>>>>>>> them
>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> debug
>>>>>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
>>>>>>>>> you
>>>>>>>>>>>>>> follow
>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and
>>>>>>> managed to
>>>>>>>>>>>> do
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>>>>>>>>> you a
>>>>>>>>>>>>>> mail.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>>>>>>>>> john@jahewson.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at
>>>>>>>>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>>>>>>>>> PDFBox
>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>>>>>>> details
>>>>>>>>>>>> the
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>>>>>>> are
>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>>>>>>> PageDrawer
>>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>>>>>>> (e.g.
>>>>>>>>>>>> one
>>>>>>>>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>>>>>>>>> text
>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>>>>>>> reading
>>>>>>>>>>>>>> order
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
>>>>>>>>> like
>>>>>>>>>>>>>> HTML
>>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask
>>>>>>> any
>>>>>>>>>>>>>>>>> questions.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>>>>>>> Undergraduate
>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>>>>>>> 2013
>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>>>>>>> image
>>>>>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>>>>>>>>> GSoC
>>>>>>>>>>>>>> 2014
>>>>>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for
>>>>>>> me. In
>>>>>>>>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>>>>>>>>>>>> wrote a
>>>>>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about
>>>>>>> PDFBox?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> 
>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>> 
>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Regards
>>>>>>>>>>> 
>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>> Undergraduate
>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>> 
>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Regards
>>>>>>>>>> 
>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>> Undergraduate
>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>> 
>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Regards
>>>>>>>> 
>>>>>>>> W.Dimuthu Upeksha
>>>>>>>> Undergraduate
>>>>>>>> Department of Computer Science And Engineering
>>>>>>>> 
>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards
>>>>>> 
>>>>>> W.Dimuthu Upeksha
>>>>>> Undergraduate
>>>>>> Department of Computer Science And Engineering
>>>>>> 
>>>>>> University of Moratuwa, Sri Lanka
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Regards
>>>>> 
>>>>> W.Dimuthu Upeksha
>>>>> Undergraduate
>>>>> Department of Computer Science And Engineering
>>>>> 
>>>>> University of Moratuwa, Sri Lanka
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Regards
>>> 
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> 
>>> Department of Computer Science And Engineering
>>> 
>>> University of Moratuwa, Sri Lanka
>> 
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> 
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka


Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by DImuthu Upeksha <di...@gmail.com>.
Hi John,
Yes. I implemented a new method to accept byte streams of the image as
an input. We directly can't send BufferedImage objects to native side.
So what I did is converting buffered image into a byte array and
passed it in to native side. At the native side it again converts in
to compatible format. With that request we need to pass some metadata
of byte stream like image width, height, bytes per pixel and bytes per
row. I checked it with this [2] test case and it works fine.

[1] https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/java/com/apache/pdfbox/ocr/tesseract/TessBaseAPI.java#L74
[2] https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/test/java/com/apache/pdfbox/ocr/tesseract/TessByteSteamTest.java

Thanks
Dimuthu

On Wed, Mar 12, 2014 at 12:40 AM, John Hewson <jo...@jahewson.com> wrote:
> Hi Dimuthu
>
> The Tesseract wrapper needs to take its input from a BufferedImage rather than reading a file from disk, so instead of:
>
> api.setImagePath("test.tif");
>
> What we need is:
>
> BufferedImage image = ImageIO.read(new File("test.tif"));
> api.setImagePath(image);
>
> Because this will let us used the BufferedImage generated by PDFRenderer without round-tripping to the disk.
>
> -- John
>
> On 11 Mar 2014, at 11:13, DImuthu Upeksha <di...@gmail.com> wrote:
>
>> Hi John,
>> Thanks for the guidance.
>> I did a small analysis of the accuracy and performance of new
>> Tesseract wrapper. I used this [1] image as the input image and got
>> following data [2] after OCR. First line is the recognised word
>> followed by location details (bounding box) of the word. I think these
>> details are pretty much enough for our task. Now what remaining is
>> converting pdf file into a image as you have mentioned. These days I'm
>> working on it.
>>
>> [1] https://www.dropbox.com/s/11wahtonoz08zmn/image4.TIF
>> [2] https://gist.github.com/DImuthuUpe/9491660
>>
>> Thanks
>> Dimuthu
>>
>> On Mon, Mar 10, 2014 at 2:30 PM, John Hewson <jo...@jahewson.com> wrote:
>>> Dimuthu,
>>>
>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
>>>> build using maven. Some useful methods that are needed to do basic OCR were
>>>> implemented.
>>>
>>> Great, it's looking good, nice and clean.
>>>
>>>> 1. What is the task of processStream method in PDFTextStripper class line
>>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>>> page.findRotation() );
>>>
>>> A PDF file is made up of pages, each of which contains a "content stream". This content stream contains a list of drawing commands such as "move to 10,15" or "write the word `foo`", these are called operators. The processStream function reads the stream for the current page and executes each of the operators. The operators themselves are implemented each in their own class which is a subclass of PDFOperator. The constructor of PDFStreamEngine creates the operator classes using reflection, which is rather odd and I'm not sure why this design was chosen. The operators used by PDFTextStripper can be found in org/apache/pdfbox/resources/PDFTextStripper.properties
>>>
>>>> 2. Say I need to extract images and it's metadata from a pdf. What is the better approach to do it?
>>>
>>> You could subclass PDFTextStripper and override the startDocument method and use it to create a PDFRenderer and store it in a field. Then override the processPage method and use the previously created PDFRenderer to render the current page to a buffered image and perform OCR on the image. Once you have the OCR text + positions, instead of calling processStream you can call processTextPosition once for each character + position.
>>>
>>> The PDFRenderer class was just added to the trunk, so make sure you do an "svn update". Let me know if you need me to change PDFTextStripper to make it easier to subclass.
>>>
>>> Cheers
>>>
>>> -- John
>>>
>>> On 9 Mar 2014, at 09:08, DImuthu Upeksha <di...@gmail.com> wrote:
>>>
>>>> Hi John,
>>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
>>>> build using maven. Some useful methods that are needed to do basic OCR were
>>>> implemented.
>>>>
>>>> I went through PDFBox code several times and got couple of issues that are
>>>> needed to be clarified
>>>>
>>>> 1. What is the task of processStream method in PDFTextStripper class line
>>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>>> page.findRotation() );
>>>>
>>>> 2. Say I need to extract images and it's metadata from a pdf. What is the
>>>> better approach to do it?
>>>>
>>>> Thanks
>>>> Dimuthu
>>>>
>>>>
>>>> On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha
>>>> <di...@gmail.com>wrote:
>>>>
>>>>> Hi John
>>>>> I refactored Tesseract JNI code to support maven build. To create the JNI
>>>>> library I added pre-built static libraries of Tesseract and Leptonica to
>>>>> resources folder[2]. For now it includes librararies supported for mac. But
>>>>> we can easily add both windows and linux libraries. After "mvn clean
>>>>> install", the jar is created under target folder. Now all setting up is
>>>>> done. What remains is implementing those native methods in tessbaseapi.cpp
>>>>> [3]. Hope to finish it asap. Please let me know if there is any concern
>>>>> about project structure.
>>>>>
>>>>> [1] https://github.com/DImuthuUpe/Tesseract-API.git
>>>>> [2]
>>>>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
>>>>> [3]
>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
>>>>>
>>>>> Thanks
>>>>> Dimuthu
>>>>>
>>>>>
>>>>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>>>>>
>>>>>> Dimuthu
>>>>>>
>>>>>>> There is a lot of code
>>>>>>> fractions in current android jni wrapper which use "(jint)somePointer"
>>>>>>> casting which will create terrible memory leaks in 64 bit environments
>>>>>>> because ponters are 64 bit. So I believe writing it from the beginning
>>>>>> is
>>>>>>> much better.
>>>>>>
>>>>>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>>>>>> support
>>>>>> 64-bit JVMs.
>>>>>>
>>>>>>> we can use
>>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>>> it is
>>>>>>> not a issue to use it's static library because both Tesseract and
>>>>>> Leptonica
>>>>>>> is under apache licence.
>>>>>>
>>>>>> Sounds good, I found the following in the README:
>>>>>>
>>>>>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
>>>>>> without Leptonica.
>>>>>>
>>>>>> Which makes sense.
>>>>>>
>>>>>> -- John
>>>>>>
>>>>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <di...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi John,
>>>>>>> +1 for you suggestion about converting image <=> byte array at java
>>>>>> side.
>>>>>>> It reduces lot of complexities. I don't know whether you have noticed or
>>>>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my
>>>>>> Mac
>>>>>>> but don't know about other operating systems.
>>>>>>>
>>>>>>> Leptonica is the image processing library for Tesseract [1]. What
>>>>>> tesseract
>>>>>>> do is using image processing algorithms in Leptonica to implement its
>>>>>> OCR
>>>>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract
>>>>>> API.
>>>>>>> You can see it includes allheaders.h header file which is the main
>>>>>> header
>>>>>>> file of Leptonoca. So I think it is a must to build Leptonica first and
>>>>>>> link it when we build Tesseract. This is not a big problem if we can use
>>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>>> it is
>>>>>>> not a issue to use it's static library because both Tesseract and
>>>>>> Leptonica
>>>>>>> is under apache licence.
>>>>>>>
>>>>>>> I'm working on the maven implementation you have mentioned and will get
>>>>>>> back to you soon.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Dimuthu
>>>>>>>
>>>>>>>
>>>>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>>>>>>> [2]
>>>>>>>
>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>>>>>>>
>>>>>>>> Hi Dimuthu,
>>>>>>>>
>>>>>>>> 1,2,3:
>>>>>>>>
>>>>>>>> Feel free to write your own Tesseract binding or port the existing
>>>>>> code as
>>>>>>>> you see fit.
>>>>>>>> The JNI binding should be minimal, only the methods you require need
>>>>>> to be
>>>>>>>> wrapped.
>>>>>>>> Also, don't forget that some of the interop can be done in Java, for
>>>>>>>> example if it is easier
>>>>>>>> to convert a BufferedImage to a byte array in Java then do it there and
>>>>>>>> pass the result
>>>>>>>> to JNI rather than writing lots of JNI C++ to achieve the same result.
>>>>>>>>
>>>>>>>> Your GitHub repo looks like a good start, I can make comments there as
>>>>>>>> things progress.
>>>>>>>>
>>>>>>>> Is it possible to build Tesseract without leptonica? I was under the
>>>>>>>> impression that it was
>>>>>>>> used for image i/o only, but I may be misinformed.
>>>>>>>>
>>>>>>>> 4:  The native platform library should be built as part of the Maven
>>>>>> build
>>>>>>>> for the Tesseract
>>>>>>>> wrapper which can be a separate project. The output can be a jar file
>>>>>>>> which contains the
>>>>>>>> native binaries. It should be possible for the jar to contain prebuilt
>>>>>>>> binaries for all platforms
>>>>>>>> but this is something we can worry about later. Right now the goal
>>>>>> should
>>>>>>>> be to build a jar
>>>>>>>> containing just the current platform's native binary and any Java
>>>>>> wrapper
>>>>>>>> code.
>>>>>>>>
>>>>>>>> -- John
>>>>>>>>
>>>>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <di...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi John,
>>>>>>>>>
>>>>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>>>>>>>> observation
>>>>>>>>>
>>>>>>>>> 1. This wrapper heavily depends on android image libraries.
>>>>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>>>>>>>>>
>>>>>>>>> 2. But I can understand underlying logic in each function. Basically
>>>>>> what
>>>>>>>>> it does is mapping between tesseract api functions [2] with java
>>>>>> methods.
>>>>>>>>> In between it does to some image <=> byte array like conversions by
>>>>>> using
>>>>>>>>> that bitmap libraries in Android
>>>>>>>>>
>>>>>>>>> 3. There are two ways. 1: We can port it's code to make compatible
>>>>>> with
>>>>>>>> our
>>>>>>>>> environments(linux,windows and mac) which is really painful. Also it
>>>>>> will
>>>>>>>>> cause memory leaks. 2: We can use only it's function signatures and
>>>>>>>>> implement using our codes
>>>>>>>>>
>>>>>>>>> I think 2nd solution is better because we need only few operations to
>>>>>> be
>>>>>>>>> done using tesseract library. I have created a github repo [3] for
>>>>>> this.
>>>>>>>>> It's still not finished. I need to add some make files and build
>>>>>> files to
>>>>>>>>> make it run properly. And also I need to implement those wrapper
>>>>>>>> functions
>>>>>>>>> [3]. This may take some time.
>>>>>>>>>
>>>>>>>>> 4. Because we are calling native libraries we need different builds of
>>>>>>>>> tesseract and leptonica libraries for each platform (dll for windows,
>>>>>> so
>>>>>>>>> for linux, dylib for mac). So we may need to build those libraries at
>>>>>> the
>>>>>>>>> time we build pdfbox project. Or we can pre build those libraries and
>>>>>> add
>>>>>>>>> them to the project as .dll, .so or .dylib format. What is the
>>>>>> preferred
>>>>>>>>> way?
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>>
>>>>>>>>
>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>>>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>>>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>>>>>>>> [4]
>>>>>>>>>
>>>>>>>>
>>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Dimuthu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I updated necessary changes to the document [1]
>>>>>>>>>>
>>>>>>>>>> For last two days I had a deep look at this [2] jni wrapper for
>>>>>>>> tessaract
>>>>>>>>>> api.
>>>>>>>>>> Unfortunately this has been designed for Android environment so I
>>>>>> think
>>>>>>>> we
>>>>>>>>>> need to write our own make files to build this in to a dll(windows)
>>>>>> or
>>>>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>>>>>> for
>>>>>>>> a
>>>>>>>>>> way to convert it to a make file that we can run on console. Please
>>>>>>>> suggest
>>>>>>>>>> if you have a better approach
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>>
>>>>>>>>
>>>>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>>>>>>>> [2]
>>>>>>>>>>
>>>>>>>>
>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>>>>>>>> [3]
>>>>>>>>>>
>>>>>>>>
>>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com>
>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> This is a good start. However, there is no need for the Adder
>>>>>>>> component,
>>>>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>>>>>>>> Extractor".
>>>>>>>>>>>
>>>>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>>>>>> clear
>>>>>>>>>>> where the process starts.
>>>>>>>>>>>
>>>>>>>>>>> -- John
>>>>>>>>>>>
>>>>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com>
>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>>>>>>>> might
>>>>>>>>>>>>> use an interface, e.g. OCREngine and there will be a
>>>>>>>> TesseractOCREngine
>>>>>>>>>>>>> class somewhere which provides the required functionality and
>>>>>> lives
>>>>>>>> in
>>>>>>>>>>> a
>>>>>>>>>>>>> separate jar file.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So do you need to embed those new functionalities into existing
>>>>>>>>>>>>> PDFtoText algorithms or package them as a new sub system(something
>>>>>>>>>>> like an
>>>>>>>>>>>>> API)?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: "John Hewson" <jo...@jahewson.com>
>>>>>>>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>>>>>>>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
>>>>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>>>>>>>>>> Introduction
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>>>>>>>> page
>>>>>>>>>>>>> rotation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>>>>>> have
>>>>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>>>>>>>> glyphs. We
>>>>>>>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format
>>>>>> and
>>>>>>>>>>> some
>>>>>>>>>>>>>>> images with text(Scanned images). In that case first we extract
>>>>>>>> those
>>>>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is
>>>>>> extracted
>>>>>>>>>>> using
>>>>>>>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>>>>>>>> PDFToText. Am
>>>>>>>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>>>>>> john@jahewson.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>>>>>>>>>>> accurate
>>>>>>>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do
>>>>>> OCR on
>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>>>>>> wrong.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>>>>>>>> (PDFToText).
>>>>>>>>>>>>>>>> The goal of
>>>>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>>>>>>>>>> extract
>>>>>>>>>>>>>>>> text from areas of the
>>>>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF files
>>>>>>>> are
>>>>>>>>>>>>>>>> typically generated by
>>>>>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR
>>>>>> is
>>>>>>>>>>>>> useful:
>>>>>>>>>>>>>>>> some fonts embedded
>>>>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>>>>>> extracted
>>>>>>>>>>> with
>>>>>>>>>>>>>>>> PDFToText the result is
>>>>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>>>>>>>> letters.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Instead of:
>>>>>>>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We want to do:
>>>>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>>>>>> Configurations
>>>>>>>>>>>>>>>> ->Source
>>>>>>>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>>>>>>>>>>> application
>>>>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>>>>>> code.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage =
>>>>>> new
>>>>>>>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target
>>>>>> folder
>>>>>>>> of
>>>>>>>>>>>>> PDFBox
>>>>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>>>>>> project
>>>>>>>>>>> from
>>>>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>>>>>> the
>>>>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have
>>>>>> a
>>>>>>>>>>>>>>>> reference to
>>>>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As
>>>>>> Tilman
>>>>>>>>>>> said
>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>> built
>>>>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>>>>>>>> other
>>>>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>>>>>>>> john@jahewson.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>>>>>>>> PDFToText
>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path
>>>>>> as
>>>>>>>> the
>>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>>>>>>>> managed to
>>>>>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>>>>>> and
>>>>>>>> I
>>>>>>>>>>>>> got a
>>>>>>>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>>>>>>>> jars
>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further
>>>>>> look
>>>>>>>>>>> into
>>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>>>>>>>> PDFTextStripper
>>>>>>>>>>>>>>>> class.
>>>>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>>>>>>>> them
>>>>>>>>>>> in
>>>>>>>>>>>>>>>> debug
>>>>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
>>>>>>>> you
>>>>>>>>>>>>> follow
>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and
>>>>>> managed to
>>>>>>>>>>> do
>>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>>>>>>>> you a
>>>>>>>>>>>>> mail.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>>>>>>>> john@jahewson.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at
>>>>>>>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>>>>>>>> PDFBox
>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>>>>>> details
>>>>>>>>>>> the
>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>>>>>> are
>>>>>>>>>>> all
>>>>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>>>>>> PageDrawer
>>>>>>>>>>>>> class
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>>>>>> (e.g.
>>>>>>>>>>> one
>>>>>>>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>>>>>>>> text
>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>>>>>> reading
>>>>>>>>>>>>> order
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
>>>>>>>> like
>>>>>>>>>>>>> HTML
>>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask
>>>>>> any
>>>>>>>>>>>>>>>> questions.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>>>>>> Undergraduate
>>>>>>>> at
>>>>>>>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>>>>>> 2013
>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>>>>>> image
>>>>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>>>>>>>> GSoC
>>>>>>>>>>>>> 2014
>>>>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for
>>>>>> me. In
>>>>>>>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>>>>>>>>>>> wrote a
>>>>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about
>>>>>> PDFBox?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Regards
>>>>>>>>>>>>
>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>
>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>> Undergraduate
>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>
>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>> Undergraduate
>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>
>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards
>>>>>>>
>>>>>>> W.Dimuthu Upeksha
>>>>>>> Undergraduate
>>>>>>> Department of Computer Science And Engineering
>>>>>>>
>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards
>>>>>
>>>>> W.Dimuthu Upeksha
>>>>> Undergraduate
>>>>> Department of Computer Science And Engineering
>>>>>
>>>>> University of Moratuwa, Sri Lanka
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards
>>>>
>>>> W.Dimuthu Upeksha
>>>> Undergraduate
>>>> Department of Computer Science And Engineering
>>>>
>>>> University of Moratuwa, Sri Lanka
>>>
>>
>>
>>
>> --
>> Regards
>>
>> W.Dimuthu Upeksha
>> Undergraduate
>>
>> Department of Computer Science And Engineering
>>
>> University of Moratuwa, Sri Lanka
>



-- 
Regards

W.Dimuthu Upeksha
Undergraduate

Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by John Hewson <jo...@jahewson.com>.
Hi Dimuthu

The Tesseract wrapper needs to take its input from a BufferedImage rather than reading a file from disk, so instead of:

api.setImagePath(“test.tif”);

What we need is:

BufferedImage image = ImageIO.read(new File(“test.tif"));
api.setImagePath(image);

Because this will let us used the BufferedImage generated by PDFRenderer without round-tripping to the disk.

-- John

On 11 Mar 2014, at 11:13, DImuthu Upeksha <di...@gmail.com> wrote:

> Hi John,
> Thanks for the guidance.
> I did a small analysis of the accuracy and performance of new
> Tesseract wrapper. I used this [1] image as the input image and got
> following data [2] after OCR. First line is the recognised word
> followed by location details (bounding box) of the word. I think these
> details are pretty much enough for our task. Now what remaining is
> converting pdf file into a image as you have mentioned. These days I'm
> working on it.
> 
> [1] https://www.dropbox.com/s/11wahtonoz08zmn/image4.TIF
> [2] https://gist.github.com/DImuthuUpe/9491660
> 
> Thanks
> Dimuthu
> 
> On Mon, Mar 10, 2014 at 2:30 PM, John Hewson <jo...@jahewson.com> wrote:
>> Dimuthu,
>> 
>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
>>> build using maven. Some useful methods that are needed to do basic OCR were
>>> implemented.
>> 
>> Great, it's looking good, nice and clean.
>> 
>>> 1. What is the task of processStream method in PDFTextStripper class line
>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>> page.findRotation() );
>> 
>> A PDF file is made up of pages, each of which contains a "content stream". This content stream contains a list of drawing commands such as "move to 10,15" or "write the word `foo`", these are called operators. The processStream function reads the stream for the current page and executes each of the operators. The operators themselves are implemented each in their own class which is a subclass of PDFOperator. The constructor of PDFStreamEngine creates the operator classes using reflection, which is rather odd and I'm not sure why this design was chosen. The operators used by PDFTextStripper can be found in org/apache/pdfbox/resources/PDFTextStripper.properties
>> 
>>> 2. Say I need to extract images and it's metadata from a pdf. What is the better approach to do it?
>> 
>> You could subclass PDFTextStripper and override the startDocument method and use it to create a PDFRenderer and store it in a field. Then override the processPage method and use the previously created PDFRenderer to render the current page to a buffered image and perform OCR on the image. Once you have the OCR text + positions, instead of calling processStream you can call processTextPosition once for each character + position.
>> 
>> The PDFRenderer class was just added to the trunk, so make sure you do an "svn update". Let me know if you need me to change PDFTextStripper to make it easier to subclass.
>> 
>> Cheers
>> 
>> -- John
>> 
>> On 9 Mar 2014, at 09:08, DImuthu Upeksha <di...@gmail.com> wrote:
>> 
>>> Hi John,
>>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
>>> build using maven. Some useful methods that are needed to do basic OCR were
>>> implemented.
>>> 
>>> I went through PDFBox code several times and got couple of issues that are
>>> needed to be clarified
>>> 
>>> 1. What is the task of processStream method in PDFTextStripper class line
>>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>>> page.findRotation() );
>>> 
>>> 2. Say I need to extract images and it's metadata from a pdf. What is the
>>> better approach to do it?
>>> 
>>> Thanks
>>> Dimuthu
>>> 
>>> 
>>> On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha
>>> <di...@gmail.com>wrote:
>>> 
>>>> Hi John
>>>> I refactored Tesseract JNI code to support maven build. To create the JNI
>>>> library I added pre-built static libraries of Tesseract and Leptonica to
>>>> resources folder[2]. For now it includes librararies supported for mac. But
>>>> we can easily add both windows and linux libraries. After "mvn clean
>>>> install", the jar is created under target folder. Now all setting up is
>>>> done. What remains is implementing those native methods in tessbaseapi.cpp
>>>> [3]. Hope to finish it asap. Please let me know if there is any concern
>>>> about project structure.
>>>> 
>>>> [1] https://github.com/DImuthuUpe/Tesseract-API.git
>>>> [2]
>>>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
>>>> [3]
>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
>>>> 
>>>> Thanks
>>>> Dimuthu
>>>> 
>>>> 
>>>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>>>> 
>>>>> Dimuthu
>>>>> 
>>>>>> There is a lot of code
>>>>>> fractions in current android jni wrapper which use "(jint)somePointer"
>>>>>> casting which will create terrible memory leaks in 64 bit environments
>>>>>> because ponters are 64 bit. So I believe writing it from the beginning
>>>>> is
>>>>>> much better.
>>>>> 
>>>>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>>>>> support
>>>>> 64-bit JVMs.
>>>>> 
>>>>>> we can use
>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>> it is
>>>>>> not a issue to use it's static library because both Tesseract and
>>>>> Leptonica
>>>>>> is under apache licence.
>>>>> 
>>>>> Sounds good, I found the following in the README:
>>>>> 
>>>>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
>>>>> without Leptonica.
>>>>> 
>>>>> Which makes sense.
>>>>> 
>>>>> -- John
>>>>> 
>>>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <di...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi John,
>>>>>> +1 for you suggestion about converting image <=> byte array at java
>>>>> side.
>>>>>> It reduces lot of complexities. I don't know whether you have noticed or
>>>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my
>>>>> Mac
>>>>>> but don't know about other operating systems.
>>>>>> 
>>>>>> Leptonica is the image processing library for Tesseract [1]. What
>>>>> tesseract
>>>>>> do is using image processing algorithms in Leptonica to implement its
>>>>> OCR
>>>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract
>>>>> API.
>>>>>> You can see it includes allheaders.h header file which is the main
>>>>> header
>>>>>> file of Leptonoca. So I think it is a must to build Leptonica first and
>>>>>> link it when we build Tesseract. This is not a big problem if we can use
>>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>>> it is
>>>>>> not a issue to use it's static library because both Tesseract and
>>>>> Leptonica
>>>>>> is under apache licence.
>>>>>> 
>>>>>> I'm working on the maven implementation you have mentioned and will get
>>>>>> back to you soon.
>>>>>> 
>>>>>> Thanks
>>>>>> Dimuthu
>>>>>> 
>>>>>> 
>>>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>>>>>> [2]
>>>>>> 
>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>>>>>> 
>>>>>> 
>>>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>>>>>> 
>>>>>>> Hi Dimuthu,
>>>>>>> 
>>>>>>> 1,2,3:
>>>>>>> 
>>>>>>> Feel free to write your own Tesseract binding or port the existing
>>>>> code as
>>>>>>> you see fit.
>>>>>>> The JNI binding should be minimal, only the methods you require need
>>>>> to be
>>>>>>> wrapped.
>>>>>>> Also, don't forget that some of the interop can be done in Java, for
>>>>>>> example if it is easier
>>>>>>> to convert a BufferedImage to a byte array in Java then do it there and
>>>>>>> pass the result
>>>>>>> to JNI rather than writing lots of JNI C++ to achieve the same result.
>>>>>>> 
>>>>>>> Your GitHub repo looks like a good start, I can make comments there as
>>>>>>> things progress.
>>>>>>> 
>>>>>>> Is it possible to build Tesseract without leptonica? I was under the
>>>>>>> impression that it was
>>>>>>> used for image i/o only, but I may be misinformed.
>>>>>>> 
>>>>>>> 4:  The native platform library should be built as part of the Maven
>>>>> build
>>>>>>> for the Tesseract
>>>>>>> wrapper which can be a separate project. The output can be a jar file
>>>>>>> which contains the
>>>>>>> native binaries. It should be possible for the jar to contain prebuilt
>>>>>>> binaries for all platforms
>>>>>>> but this is something we can worry about later. Right now the goal
>>>>> should
>>>>>>> be to build a jar
>>>>>>> containing just the current platform's native binary and any Java
>>>>> wrapper
>>>>>>> code.
>>>>>>> 
>>>>>>> -- John
>>>>>>> 
>>>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <di...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi John,
>>>>>>>> 
>>>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>>>>>>> observation
>>>>>>>> 
>>>>>>>> 1. This wrapper heavily depends on android image libraries.
>>>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>>>>>>>> 
>>>>>>>> 2. But I can understand underlying logic in each function. Basically
>>>>> what
>>>>>>>> it does is mapping between tesseract api functions [2] with java
>>>>> methods.
>>>>>>>> In between it does to some image <=> byte array like conversions by
>>>>> using
>>>>>>>> that bitmap libraries in Android
>>>>>>>> 
>>>>>>>> 3. There are two ways. 1: We can port it's code to make compatible
>>>>> with
>>>>>>> our
>>>>>>>> environments(linux,windows and mac) which is really painful. Also it
>>>>> will
>>>>>>>> cause memory leaks. 2: We can use only it's function signatures and
>>>>>>>> implement using our codes
>>>>>>>> 
>>>>>>>> I think 2nd solution is better because we need only few operations to
>>>>> be
>>>>>>>> done using tesseract library. I have created a github repo [3] for
>>>>> this.
>>>>>>>> It's still not finished. I need to add some make files and build
>>>>> files to
>>>>>>>> make it run properly. And also I need to implement those wrapper
>>>>>>> functions
>>>>>>>> [3]. This may take some time.
>>>>>>>> 
>>>>>>>> 4. Because we are calling native libraries we need different builds of
>>>>>>>> tesseract and leptonica libraries for each platform (dll for windows,
>>>>> so
>>>>>>>> for linux, dylib for mac). So we may need to build those libraries at
>>>>> the
>>>>>>>> time we build pdfbox project. Or we can pre build those libraries and
>>>>> add
>>>>>>>> them to the project as .dll, .so or .dylib format. What is the
>>>>> preferred
>>>>>>>> way?
>>>>>>>> 
>>>>>>>> [1]
>>>>>>>> 
>>>>>>> 
>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>>>>>>> [4]
>>>>>>>> 
>>>>>>> 
>>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>>>>>>> 
>>>>>>>> Thanks
>>>>>>>> Dimuthu
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I updated necessary changes to the document [1]
>>>>>>>>> 
>>>>>>>>> For last two days I had a deep look at this [2] jni wrapper for
>>>>>>> tessaract
>>>>>>>>> api.
>>>>>>>>> Unfortunately this has been designed for Android environment so I
>>>>> think
>>>>>>> we
>>>>>>>>> need to write our own make files to build this in to a dll(windows)
>>>>> or
>>>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>>>>> for
>>>>>>> a
>>>>>>>>> way to convert it to a make file that we can run on console. Please
>>>>>>> suggest
>>>>>>>>> if you have a better approach
>>>>>>>>> 
>>>>>>>>> [1]
>>>>>>>>> 
>>>>>>> 
>>>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>>>>>>> [2]
>>>>>>>>> 
>>>>>>> 
>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>>>>>>> [3]
>>>>>>>>> 
>>>>>>> 
>>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> This is a good start. However, there is no need for the Adder
>>>>>>> component,
>>>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>>>>>>> Extractor".
>>>>>>>>>> 
>>>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>>>>> clear
>>>>>>>>>> where the process starts.
>>>>>>>>>> 
>>>>>>>>>> -- John
>>>>>>>>>> 
>>>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>>>>>>> 
>>>>>>>>>>> [1]
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>>>>>>> 
>>>>>>>>>>> Thanks
>>>>>>>>>>> Dimuthu
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com>
>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>>>>>>> might
>>>>>>>>>>>> use an interface, e.g. OCREngine and there will be a
>>>>>>> TesseractOCREngine
>>>>>>>>>>>> class somewhere which provides the required functionality and
>>>>> lives
>>>>>>> in
>>>>>>>>>> a
>>>>>>>>>>>> separate jar file.
>>>>>>>>>>>> 
>>>>>>>>>>>> -- John
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> So do you need to embed those new functionalities into existing
>>>>>>>>>>>> PDFtoText algorithms or package them as a new sub system(something
>>>>>>>>>> like an
>>>>>>>>>>>> API)?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>> From: "John Hewson" <jo...@jahewson.com>
>>>>>>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>>>>>>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
>>>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>>>>>>>>> Introduction
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>>>>>>> page
>>>>>>>>>>>> rotation.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>>>>> have
>>>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>>>>>>> glyphs. We
>>>>>>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- John
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format
>>>>> and
>>>>>>>>>> some
>>>>>>>>>>>>>> images with text(Scanned images). In that case first we extract
>>>>>>> those
>>>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is
>>>>> extracted
>>>>>>>>>> using
>>>>>>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>>>>>>> PDFToText. Am
>>>>>>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>>>>> john@jahewson.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>>>>>>>>>> accurate
>>>>>>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do
>>>>> OCR on
>>>>>>>>>>>> those
>>>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>>>>> wrong.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>>>>>>> (PDFToText).
>>>>>>>>>>>>>>> The goal of
>>>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>>>>>>>>> extract
>>>>>>>>>>>>>>> text from areas of the
>>>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF files
>>>>>>> are
>>>>>>>>>>>>>>> typically generated by
>>>>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR
>>>>> is
>>>>>>>>>>>> useful:
>>>>>>>>>>>>>>> some fonts embedded
>>>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>>>>> extracted
>>>>>>>>>> with
>>>>>>>>>>>>>>> PDFToText the result is
>>>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>>>>>>> letters.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Instead of:
>>>>>>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> We want to do:
>>>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>>>>> Configurations
>>>>>>>>>>>>>>> ->Source
>>>>>>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>>>>>>>>>> application
>>>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>>>>> code.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage =
>>>>> new
>>>>>>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target
>>>>> folder
>>>>>>> of
>>>>>>>>>>>> PDFBox
>>>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>>>>> project
>>>>>>>>>> from
>>>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>>>>> the
>>>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have
>>>>> a
>>>>>>>>>>>>>>> reference to
>>>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As
>>>>> Tilman
>>>>>>>>>> said
>>>>>>>>>>>> I
>>>>>>>>>>>>>>> built
>>>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>>>>>>> other
>>>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>>>>>>> john@jahewson.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>>>>>>> PDFToText
>>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path
>>>>> as
>>>>>>> the
>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>>>>>>> managed to
>>>>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>>>>> and
>>>>>>> I
>>>>>>>>>>>> got a
>>>>>>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>>>>>>> jars
>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further
>>>>> look
>>>>>>>>>> into
>>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>>>>>>> PDFTextStripper
>>>>>>>>>>>>>>> class.
>>>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>>>>>>> them
>>>>>>>>>> in
>>>>>>>>>>>>>>> debug
>>>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
>>>>>>> you
>>>>>>>>>>>> follow
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and
>>>>> managed to
>>>>>>>>>> do
>>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>>>>>>> you a
>>>>>>>>>>>> mail.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>>>>>>> john@jahewson.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at
>>>>>>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>>>>>>> PDFBox
>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>>>>> details
>>>>>>>>>> the
>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>>>>> are
>>>>>>>>>> all
>>>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>>>>> PageDrawer
>>>>>>>>>>>> class
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>>>>> (e.g.
>>>>>>>>>> one
>>>>>>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>>>>>>> text
>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>>>>> reading
>>>>>>>>>>>> order
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
>>>>>>> like
>>>>>>>>>>>> HTML
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask
>>>>> any
>>>>>>>>>>>>>>> questions.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>>>>> Undergraduate
>>>>>>> at
>>>>>>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>>>>> 2013
>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>>>>> image
>>>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>>>>>>> GSoC
>>>>>>>>>>>> 2014
>>>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for
>>>>> me. In
>>>>>>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>>>>>>>>>> wrote a
>>>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about
>>>>> PDFBox?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Regards
>>>>>>>>>>> 
>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>> Undergraduate
>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>> 
>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Regards
>>>>>>>>> 
>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>> Undergraduate
>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>> 
>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Regards
>>>>>>>> 
>>>>>>>> W.Dimuthu Upeksha
>>>>>>>> Undergraduate
>>>>>>>> Department of Computer Science And Engineering
>>>>>>>> 
>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards
>>>>>> 
>>>>>> W.Dimuthu Upeksha
>>>>>> Undergraduate
>>>>>> Department of Computer Science And Engineering
>>>>>> 
>>>>>> University of Moratuwa, Sri Lanka
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Regards
>>>> 
>>>> W.Dimuthu Upeksha
>>>> Undergraduate
>>>> Department of Computer Science And Engineering
>>>> 
>>>> University of Moratuwa, Sri Lanka
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Regards
>>> 
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> Department of Computer Science And Engineering
>>> 
>>> University of Moratuwa, Sri Lanka
>> 
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> 
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka


Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by DImuthu Upeksha <di...@gmail.com>.
Hi John,
Thanks for the guidance.
I did a small analysis of the accuracy and performance of new
Tesseract wrapper. I used this [1] image as the input image and got
following data [2] after OCR. First line is the recognised word
followed by location details (bounding box) of the word. I think these
details are pretty much enough for our task. Now what remaining is
converting pdf file into a image as you have mentioned. These days I'm
working on it.

[1]https://www.dropbox.com/s/11wahtonoz08zmn/image4.TIF
[2] https://gist.github.com/DImuthuUpe/9491660

Thanks
Dimuthu

On Mon, Mar 10, 2014 at 2:30 PM, John Hewson <jo...@jahewson.com> wrote:
> Dimuthu,
>
>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
>> build using maven. Some useful methods that are needed to do basic OCR were
>> implemented.
>
> Great, it's looking good, nice and clean.
>
>> 1. What is the task of processStream method in PDFTextStripper class line
>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>> page.findRotation() );
>
> A PDF file is made up of pages, each of which contains a "content stream". This content stream contains a list of drawing commands such as "move to 10,15" or "write the word `foo`", these are called operators. The processStream function reads the stream for the current page and executes each of the operators. The operators themselves are implemented each in their own class which is a subclass of PDFOperator. The constructor of PDFStreamEngine creates the operator classes using reflection, which is rather odd and I'm not sure why this design was chosen. The operators used by PDFTextStripper can be found in org/apache/pdfbox/resources/PDFTextStripper.properties
>
>> 2. Say I need to extract images and it's metadata from a pdf. What is the better approach to do it?
>
> You could subclass PDFTextStripper and override the startDocument method and use it to create a PDFRenderer and store it in a field. Then override the processPage method and use the previously created PDFRenderer to render the current page to a buffered image and perform OCR on the image. Once you have the OCR text + positions, instead of calling processStream you can call processTextPosition once for each character + position.
>
> The PDFRenderer class was just added to the trunk, so make sure you do an "svn update". Let me know if you need me to change PDFTextStripper to make it easier to subclass.
>
> Cheers
>
> -- John
>
> On 9 Mar 2014, at 09:08, DImuthu Upeksha <di...@gmail.com> wrote:
>
>> Hi John,
>> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
>> build using maven. Some useful methods that are needed to do basic OCR were
>> implemented.
>>
>> I went through PDFBox code several times and got couple of issues that are
>> needed to be clarified
>>
>> 1. What is the task of processStream method in PDFTextStripper class line
>> 456 : processStream( page.findResources(), content, page.findCropBox(),
>> page.findRotation() );
>>
>> 2. Say I need to extract images and it's metadata from a pdf. What is the
>> better approach to do it?
>>
>> Thanks
>> Dimuthu
>>
>>
>> On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha
>> <di...@gmail.com>wrote:
>>
>>> Hi John
>>> I refactored Tesseract JNI code to support maven build. To create the JNI
>>> library I added pre-built static libraries of Tesseract and Leptonica to
>>> resources folder[2]. For now it includes librararies supported for mac. But
>>> we can easily add both windows and linux libraries. After "mvn clean
>>> install", the jar is created under target folder. Now all setting up is
>>> done. What remains is implementing those native methods in tessbaseapi.cpp
>>> [3]. Hope to finish it asap. Please let me know if there is any concern
>>> about project structure.
>>>
>>> [1] https://github.com/DImuthuUpe/Tesseract-API.git
>>> [2]
>>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
>>> [3]
>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
>>>
>>> Thanks
>>> Dimuthu
>>>
>>>
>>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>>>
>>>> Dimuthu
>>>>
>>>>> There is a lot of code
>>>>> fractions in current android jni wrapper which use "(jint)somePointer"
>>>>> casting which will create terrible memory leaks in 64 bit environments
>>>>> because ponters are 64 bit. So I believe writing it from the beginning
>>>> is
>>>>> much better.
>>>>
>>>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>>>> support
>>>> 64-bit JVMs.
>>>>
>>>>> we can use
>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>> it is
>>>>> not a issue to use it's static library because both Tesseract and
>>>> Leptonica
>>>>> is under apache licence.
>>>>
>>>> Sounds good, I found the following in the README:
>>>>
>>>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
>>>> without Leptonica.
>>>>
>>>> Which makes sense.
>>>>
>>>> -- John
>>>>
>>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <di...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi John,
>>>>> +1 for you suggestion about converting image <=> byte array at java
>>>> side.
>>>>> It reduces lot of complexities. I don't know whether you have noticed or
>>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my
>>>> Mac
>>>>> but don't know about other operating systems.
>>>>>
>>>>> Leptonica is the image processing library for Tesseract [1]. What
>>>> tesseract
>>>>> do is using image processing algorithms in Leptonica to implement its
>>>> OCR
>>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract
>>>> API.
>>>>> You can see it includes allheaders.h header file which is the main
>>>> header
>>>>> file of Leptonoca. So I think it is a must to build Leptonica first and
>>>>> link it when we build Tesseract. This is not a big problem if we can use
>>>>> the static library of Leptonica (I did and it worked nicely). I think
>>>> it is
>>>>> not a issue to use it's static library because both Tesseract and
>>>> Leptonica
>>>>> is under apache licence.
>>>>>
>>>>> I'm working on the maven implementation you have mentioned and will get
>>>>> back to you soon.
>>>>>
>>>>> Thanks
>>>>> Dimuthu
>>>>>
>>>>>
>>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>>>>> [2]
>>>>>
>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>>>>>
>>>>>
>>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>>>>>
>>>>>> Hi Dimuthu,
>>>>>>
>>>>>> 1,2,3:
>>>>>>
>>>>>> Feel free to write your own Tesseract binding or port the existing
>>>> code as
>>>>>> you see fit.
>>>>>> The JNI binding should be minimal, only the methods you require need
>>>> to be
>>>>>> wrapped.
>>>>>> Also, don't forget that some of the interop can be done in Java, for
>>>>>> example if it is easier
>>>>>> to convert a BufferedImage to a byte array in Java then do it there and
>>>>>> pass the result
>>>>>> to JNI rather than writing lots of JNI C++ to achieve the same result.
>>>>>>
>>>>>> Your GitHub repo looks like a good start, I can make comments there as
>>>>>> things progress.
>>>>>>
>>>>>> Is it possible to build Tesseract without leptonica? I was under the
>>>>>> impression that it was
>>>>>> used for image i/o only, but I may be misinformed.
>>>>>>
>>>>>> 4:  The native platform library should be built as part of the Maven
>>>> build
>>>>>> for the Tesseract
>>>>>> wrapper which can be a separate project. The output can be a jar file
>>>>>> which contains the
>>>>>> native binaries. It should be possible for the jar to contain prebuilt
>>>>>> binaries for all platforms
>>>>>> but this is something we can worry about later. Right now the goal
>>>> should
>>>>>> be to build a jar
>>>>>> containing just the current platform's native binary and any Java
>>>> wrapper
>>>>>> code.
>>>>>>
>>>>>> -- John
>>>>>>
>>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <di...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi John,
>>>>>>>
>>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>>>>>> observation
>>>>>>>
>>>>>>> 1. This wrapper heavily depends on android image libraries.
>>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>>>>>>>
>>>>>>> 2. But I can understand underlying logic in each function. Basically
>>>> what
>>>>>>> it does is mapping between tesseract api functions [2] with java
>>>> methods.
>>>>>>> In between it does to some image <=> byte array like conversions by
>>>> using
>>>>>>> that bitmap libraries in Android
>>>>>>>
>>>>>>> 3. There are two ways. 1: We can port it's code to make compatible
>>>> with
>>>>>> our
>>>>>>> environments(linux,windows and mac) which is really painful. Also it
>>>> will
>>>>>>> cause memory leaks. 2: We can use only it's function signatures and
>>>>>>> implement using our codes
>>>>>>>
>>>>>>> I think 2nd solution is better because we need only few operations to
>>>> be
>>>>>>> done using tesseract library. I have created a github repo [3] for
>>>> this.
>>>>>>> It's still not finished. I need to add some make files and build
>>>> files to
>>>>>>> make it run properly. And also I need to implement those wrapper
>>>>>> functions
>>>>>>> [3]. This may take some time.
>>>>>>>
>>>>>>> 4. Because we are calling native libraries we need different builds of
>>>>>>> tesseract and leptonica libraries for each platform (dll for windows,
>>>> so
>>>>>>> for linux, dylib for mac). So we may need to build those libraries at
>>>> the
>>>>>>> time we build pdfbox project. Or we can pre build those libraries and
>>>> add
>>>>>>> them to the project as .dll, .so or .dylib format. What is the
>>>> preferred
>>>>>>> way?
>>>>>>>
>>>>>>> [1]
>>>>>>>
>>>>>>
>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>>>>>> [4]
>>>>>>>
>>>>>>
>>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>>>>>>
>>>>>>> Thanks
>>>>>>> Dimuthu
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>> wrote:
>>>>>>>
>>>>>>>> I updated necessary changes to the document [1]
>>>>>>>>
>>>>>>>> For last two days I had a deep look at this [2] jni wrapper for
>>>>>> tessaract
>>>>>>>> api.
>>>>>>>> Unfortunately this has been designed for Android environment so I
>>>> think
>>>>>> we
>>>>>>>> need to write our own make files to build this in to a dll(windows)
>>>> or
>>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>>>> for
>>>>>> a
>>>>>>>> way to convert it to a make file that we can run on console. Please
>>>>>> suggest
>>>>>>>> if you have a better approach
>>>>>>>>
>>>>>>>> [1]
>>>>>>>>
>>>>>>
>>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>>>>>> [2]
>>>>>>>>
>>>>>>
>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>>>>>> [3]
>>>>>>>>
>>>>>>
>>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com>
>>>> wrote:
>>>>>>>>
>>>>>>>>> This is a good start. However, there is no need for the Adder
>>>>>> component,
>>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>>>>>> Extractor".
>>>>>>>>>
>>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>>>> clear
>>>>>>>>> where the process starts.
>>>>>>>>>
>>>>>>>>> -- John
>>>>>>>>>
>>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Dimuthu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com>
>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>>>>>> might
>>>>>>>>>>> use an interface, e.g. OCREngine and there will be a
>>>>>> TesseractOCREngine
>>>>>>>>>>> class somewhere which provides the required functionality and
>>>> lives
>>>>>> in
>>>>>>>>> a
>>>>>>>>>>> separate jar file.
>>>>>>>>>>>
>>>>>>>>>>> -- John
>>>>>>>>>>>
>>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com>
>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> So do you need to embed those new functionalities into existing
>>>>>>>>>>> PDFtoText algorithms or package them as a new sub system(something
>>>>>>>>> like an
>>>>>>>>>>> API)?
>>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: "John Hewson" <jo...@jahewson.com>
>>>>>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>>>>>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
>>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>>>>>>>> Introduction
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>>>>>> page
>>>>>>>>>>> rotation.
>>>>>>>>>>>>
>>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>>>> have
>>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>>>>>> glyphs. We
>>>>>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>>>>>>
>>>>>>>>>>>> -- John
>>>>>>>>>>>>
>>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format
>>>> and
>>>>>>>>> some
>>>>>>>>>>>>> images with text(Scanned images). In that case first we extract
>>>>>> those
>>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is
>>>> extracted
>>>>>>>>> using
>>>>>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>>>>>> PDFToText. Am
>>>>>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>>>> john@jahewson.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>>>>>>>>> accurate
>>>>>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do
>>>> OCR on
>>>>>>>>>>> those
>>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>>>> wrong.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>>>>>> (PDFToText).
>>>>>>>>>>>>>> The goal of
>>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>>>>>>>> extract
>>>>>>>>>>>>>> text from areas of the
>>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF files
>>>>>> are
>>>>>>>>>>>>>> typically generated by
>>>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR
>>>> is
>>>>>>>>>>> useful:
>>>>>>>>>>>>>> some fonts embedded
>>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>>>> extracted
>>>>>>>>> with
>>>>>>>>>>>>>> PDFToText the result is
>>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>>>>>> letters.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Instead of:
>>>>>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We want to do:
>>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>>>> Configurations
>>>>>>>>>>>>>> ->Source
>>>>>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>>>>>>>>> application
>>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>>>> code.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage =
>>>> new
>>>>>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target
>>>> folder
>>>>>> of
>>>>>>>>>>> PDFBox
>>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>>>> project
>>>>>>>>> from
>>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>>>> the
>>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have
>>>> a
>>>>>>>>>>>>>> reference to
>>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As
>>>> Tilman
>>>>>>>>> said
>>>>>>>>>>> I
>>>>>>>>>>>>>> built
>>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>>>>>> other
>>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>>>>>> john@jahewson.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>>>>>> PDFToText
>>>>>>>>>>> class
>>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path
>>>> as
>>>>>> the
>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>>>>>> managed to
>>>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>>>> and
>>>>>> I
>>>>>>>>>>> got a
>>>>>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>>>>>> jars
>>>>>>>>> in
>>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further
>>>> look
>>>>>>>>> into
>>>>>>>>>>> code
>>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>>>>>> PDFTextStripper
>>>>>>>>>>>>>> class.
>>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>>>>>> them
>>>>>>>>> in
>>>>>>>>>>>>>> debug
>>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
>>>>>> you
>>>>>>>>>>> follow
>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and
>>>> managed to
>>>>>>>>> do
>>>>>>>>>>> some
>>>>>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>>>>>> you a
>>>>>>>>>>> mail.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>>>>>> john@jahewson.com
>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at
>>>>>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>>>>>> PDFBox
>>>>>>>>> for
>>>>>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>>>> details
>>>>>>>>> the
>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>>>> are
>>>>>>>>> all
>>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>>>> PageDrawer
>>>>>>>>>>> class
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>>>> (e.g.
>>>>>>>>> one
>>>>>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>>>>>> text
>>>>>>>>> is
>>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>>>> reading
>>>>>>>>>>> order
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
>>>>>> like
>>>>>>>>>>> HTML
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask
>>>> any
>>>>>>>>>>>>>> questions.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>>>> Undergraduate
>>>>>> at
>>>>>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>>>> 2013
>>>>>>>>> with
>>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>>>> image
>>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>>>>>> GSoC
>>>>>>>>>>> 2014
>>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for
>>>> me. In
>>>>>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>>>>>>>>> wrote a
>>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about
>>>> PDFBox?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>
>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>
>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>> Undergraduate
>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>
>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> W.Dimuthu Upeksha
>>>>>>>> Undergraduate
>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>
>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards
>>>>>>>
>>>>>>> W.Dimuthu Upeksha
>>>>>>> Undergraduate
>>>>>>> Department of Computer Science And Engineering
>>>>>>>
>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards
>>>>>
>>>>> W.Dimuthu Upeksha
>>>>> Undergraduate
>>>>> Department of Computer Science And Engineering
>>>>>
>>>>> University of Moratuwa, Sri Lanka
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards
>>>
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> Department of Computer Science And Engineering
>>>
>>> University of Moratuwa, Sri Lanka
>>>
>>
>>
>>
>> --
>> Regards
>>
>> W.Dimuthu Upeksha
>> Undergraduate
>> Department of Computer Science And Engineering
>>
>> University of Moratuwa, Sri Lanka
>



-- 
Regards

W.Dimuthu Upeksha
Undergraduate

Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by John Hewson <jo...@jahewson.com>.
Dimuthu,

> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
> build using maven. Some useful methods that are needed to do basic OCR were
> implemented.

Great, it’s looking good, nice and clean.

> 1. What is the task of processStream method in PDFTextStripper class line
> 456 : processStream( page.findResources(), content, page.findCropBox(),
> page.findRotation() );

A PDF file is made up of pages, each of which contains a “content stream”. This content stream contains a list of drawing commands such as “move to 10,15” or “write the word `foo`”, these are called operators. The processStream function reads the stream for the current page and executes each of the operators. The operators themselves are implemented each in their own class which is a subclass of PDFOperator. The constructor of PDFStreamEngine creates the operator classes using reflection, which is rather odd and I’m not sure why this design was chosen. The operators used by PDFTextStripper can be found in org/apache/pdfbox/resources/PDFTextStripper.properties

> 2. Say I need to extract images and it's metadata from a pdf. What is the better approach to do it?

You could subclass PDFTextStripper and override the startDocument method and use it to create a PDFRenderer and store it in a field. Then override the processPage method and use the previously created PDFRenderer to render the current page to a buffered image and perform OCR on the image. Once you have the OCR text + positions, instead of calling processStream you can call processTextPosition once for each character + position.

The PDFRenderer class was just added to the trunk, so make sure you do an “svn update”. Let me know if you need me to change PDFTextStripper to make it easier to subclass.

Cheers

-- John

On 9 Mar 2014, at 09:08, DImuthu Upeksha <di...@gmail.com> wrote:

> Hi John,
> I finished basic implementation of JNI wrapper for Tesseract. Now it can be
> build using maven. Some useful methods that are needed to do basic OCR were
> implemented.
> 
> I went through PDFBox code several times and got couple of issues that are
> needed to be clarified
> 
> 1. What is the task of processStream method in PDFTextStripper class line
> 456 : processStream( page.findResources(), content, page.findCropBox(),
> page.findRotation() );
> 
> 2. Say I need to extract images and it's metadata from a pdf. What is the
> better approach to do it?
> 
> Thanks
> Dimuthu
> 
> 
> On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha
> <di...@gmail.com>wrote:
> 
>> Hi John
>> I refactored Tesseract JNI code to support maven build. To create the JNI
>> library I added pre-built static libraries of Tesseract and Leptonica to
>> resources folder[2]. For now it includes librararies supported for mac. But
>> we can easily add both windows and linux libraries. After "mvn clean
>> install", the jar is created under target folder. Now all setting up is
>> done. What remains is implementing those native methods in tessbaseapi.cpp
>> [3]. Hope to finish it asap. Please let me know if there is any concern
>> about project structure.
>> 
>> [1] https://github.com/DImuthuUpe/Tesseract-API.git
>> [2]
>> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
>> [3]
>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
>> 
>> Thanks
>> Dimuthu
>> 
>> 
>> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>> 
>>> Dimuthu
>>> 
>>>> There is a lot of code
>>>> fractions in current android jni wrapper which use "(jint)somePointer"
>>>> casting which will create terrible memory leaks in 64 bit environments
>>>> because ponters are 64 bit. So I believe writing it from the beginning
>>> is
>>>> much better.
>>> 
>>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>>> support
>>> 64-bit JVMs.
>>> 
>>>> we can use
>>>> the static library of Leptonica (I did and it worked nicely). I think
>>> it is
>>>> not a issue to use it's static library because both Tesseract and
>>> Leptonica
>>>> is under apache licence.
>>> 
>>> Sounds good, I found the following in the README:
>>> 
>>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
>>> without Leptonica.
>>> 
>>> Which makes sense.
>>> 
>>> -- John
>>> 
>>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <di...@gmail.com>
>>> wrote:
>>> 
>>>> Hi John,
>>>> +1 for you suggestion about converting image <=> byte array at java
>>> side.
>>>> It reduces lot of complexities. I don't know whether you have noticed or
>>>> not, jint data type in jni is a 32bit integer type. I noticed it in my
>>> Mac
>>>> but don't know about other operating systems.
>>>> 
>>>> Leptonica is the image processing library for Tesseract [1]. What
>>> tesseract
>>>> do is using image processing algorithms in Leptonica to implement its
>>> OCR
>>>> algorithms. This [2] is the responsible .cpp file to create Tesseract
>>> API.
>>>> You can see it includes allheaders.h header file which is the main
>>> header
>>>> file of Leptonoca. So I think it is a must to build Leptonica first and
>>>> link it when we build Tesseract. This is not a big problem if we can use
>>>> the static library of Leptonica (I did and it worked nicely). I think
>>> it is
>>>> not a issue to use it's static library because both Tesseract and
>>> Leptonica
>>>> is under apache licence.
>>>> 
>>>> I'm working on the maven implementation you have mentioned and will get
>>>> back to you soon.
>>>> 
>>>> Thanks
>>>> Dimuthu
>>>> 
>>>> 
>>>> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>>>> [2]
>>>> 
>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>>>> 
>>>> 
>>>> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>>>> 
>>>>> Hi Dimuthu,
>>>>> 
>>>>> 1,2,3:
>>>>> 
>>>>> Feel free to write your own Tesseract binding or port the existing
>>> code as
>>>>> you see fit.
>>>>> The JNI binding should be minimal, only the methods you require need
>>> to be
>>>>> wrapped.
>>>>> Also, don't forget that some of the interop can be done in Java, for
>>>>> example if it is easier
>>>>> to convert a BufferedImage to a byte array in Java then do it there and
>>>>> pass the result
>>>>> to JNI rather than writing lots of JNI C++ to achieve the same result.
>>>>> 
>>>>> Your GitHub repo looks like a good start, I can make comments there as
>>>>> things progress.
>>>>> 
>>>>> Is it possible to build Tesseract without leptonica? I was under the
>>>>> impression that it was
>>>>> used for image i/o only, but I may be misinformed.
>>>>> 
>>>>> 4:  The native platform library should be built as part of the Maven
>>> build
>>>>> for the Tesseract
>>>>> wrapper which can be a separate project. The output can be a jar file
>>>>> which contains the
>>>>> native binaries. It should be possible for the jar to contain prebuilt
>>>>> binaries for all platforms
>>>>> but this is something we can worry about later. Right now the goal
>>> should
>>>>> be to build a jar
>>>>> containing just the current platform's native binary and any Java
>>> wrapper
>>>>> code.
>>>>> 
>>>>> -- John
>>>>> 
>>>>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <di...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi John,
>>>>>> 
>>>>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>>>>> observation
>>>>>> 
>>>>>> 1. This wrapper heavily depends on android image libraries.
>>>>>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>>>>>> 
>>>>>> 2. But I can understand underlying logic in each function. Basically
>>> what
>>>>>> it does is mapping between tesseract api functions [2] with java
>>> methods.
>>>>>> In between it does to some image <=> byte array like conversions by
>>> using
>>>>>> that bitmap libraries in Android
>>>>>> 
>>>>>> 3. There are two ways. 1: We can port it's code to make compatible
>>> with
>>>>> our
>>>>>> environments(linux,windows and mac) which is really painful. Also it
>>> will
>>>>>> cause memory leaks. 2: We can use only it's function signatures and
>>>>>> implement using our codes
>>>>>> 
>>>>>> I think 2nd solution is better because we need only few operations to
>>> be
>>>>>> done using tesseract library. I have created a github repo [3] for
>>> this.
>>>>>> It's still not finished. I need to add some make files and build
>>> files to
>>>>>> make it run properly. And also I need to implement those wrapper
>>>>> functions
>>>>>> [3]. This may take some time.
>>>>>> 
>>>>>> 4. Because we are calling native libraries we need different builds of
>>>>>> tesseract and leptonica libraries for each platform (dll for windows,
>>> so
>>>>>> for linux, dylib for mac). So we may need to build those libraries at
>>> the
>>>>>> time we build pdfbox project. Or we can pre build those libraries and
>>> add
>>>>>> them to the project as .dll, .so or .dylib format. What is the
>>> preferred
>>>>>> way?
>>>>>> 
>>>>>> [1]
>>>>>> 
>>>>> 
>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>>>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>>>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>>>>> [4]
>>>>>> 
>>>>> 
>>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>>>>> 
>>>>>> Thanks
>>>>>> Dimuthu
>>>>>> 
>>>>>> 
>>>>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>>>>> dimuthu.upeksha2@gmail.com
>>>>>>> wrote:
>>>>>> 
>>>>>>> I updated necessary changes to the document [1]
>>>>>>> 
>>>>>>> For last two days I had a deep look at this [2] jni wrapper for
>>>>> tessaract
>>>>>>> api.
>>>>>>> Unfortunately this has been designed for Android environment so I
>>> think
>>>>> we
>>>>>>> need to write our own make files to build this in to a dll(windows)
>>> or
>>>>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>>> for
>>>>> a
>>>>>>> way to convert it to a make file that we can run on console. Please
>>>>> suggest
>>>>>>> if you have a better approach
>>>>>>> 
>>>>>>> [1]
>>>>>>> 
>>>>> 
>>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>>>>> [2]
>>>>>>> 
>>>>> 
>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>>>>> [3]
>>>>>>> 
>>>>> 
>>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com>
>>> wrote:
>>>>>>> 
>>>>>>>> This is a good start. However, there is no need for the Adder
>>>>> component,
>>>>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>>>>> Extractor".
>>>>>>>> 
>>>>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>>> clear
>>>>>>>> where the process starts.
>>>>>>>> 
>>>>>>>> -- John
>>>>>>>> 
>>>>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>>> dimuthu.upeksha2@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>>>>> 
>>>>>>>>> [1]
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Dimuthu
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>>>>> might
>>>>>>>>>> use an interface, e.g. OCREngine and there will be a
>>>>> TesseractOCREngine
>>>>>>>>>> class somewhere which provides the required functionality and
>>> lives
>>>>> in
>>>>>>>> a
>>>>>>>>>> separate jar file.
>>>>>>>>>> 
>>>>>>>>>> -- John
>>>>>>>>>> 
>>>>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com>
>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> So do you need to embed those new functionalities into existing
>>>>>>>>>> PDFtoText algorithms or package them as a new sub system(something
>>>>>>>> like an
>>>>>>>>>> API)?
>>>>>>>>>>> 
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: "John Hewson" <jo...@jahewson.com>
>>>>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>>>>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
>>>>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>>>>>>> Introduction
>>>>>>>>>>> 
>>>>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>>>>> page
>>>>>>>>>> rotation.
>>>>>>>>>>> 
>>>>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>>> have
>>>>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>>>>> glyphs. We
>>>>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>>>>> 
>>>>>>>>>>> -- John
>>>>>>>>>>> 
>>>>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi John,
>>>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>>>> Let's say there is a pdf with both text in extractable format
>>> and
>>>>>>>> some
>>>>>>>>>>>> images with text(Scanned images). In that case first we extract
>>>>> those
>>>>>>>>>>>> extractable content using PDFBox algorithms and rest is
>>> extracted
>>>>>>>> using
>>>>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>>>>> PDFToText. Am
>>>>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>>> john@jahewson.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>>>>>>>> accurate
>>>>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do
>>> OCR on
>>>>>>>>>> those
>>>>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>>> wrong.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>>>>> (PDFToText).
>>>>>>>>>>>>> The goal of
>>>>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>>>>>>> extract
>>>>>>>>>>>>> text from areas of the
>>>>>>>>>>>>> document where the text is embedded as an image. Such PDF files
>>>>> are
>>>>>>>>>>>>> typically generated by
>>>>>>>>>>>>> scanners or fax machines. There is also another case where OCR
>>> is
>>>>>>>>>> useful:
>>>>>>>>>>>>> some fonts embedded
>>>>>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>>> extracted
>>>>>>>> with
>>>>>>>>>>>>> PDFToText the result is
>>>>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>>>>> letters.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Instead of:
>>>>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>>>>> 
>>>>>>>>>>>>> We want to do:
>>>>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- John
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>>> Configurations
>>>>>>>>>>>>> ->Source
>>>>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>>>>>>>> application
>>>>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>>> code.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage =
>>> new
>>>>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Then I need to add those jar files generated in target
>>> folder
>>>>> of
>>>>>>>>>> PDFBox
>>>>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>>> project
>>>>>>>> from
>>>>>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>>> the
>>>>>>>>>>>>>>>> functionality of document.save("") method. But I don't have
>>> a
>>>>>>>>>>>>> reference to
>>>>>>>>>>>>>>>> it's sources because I directly used generated jars. As
>>> Tilman
>>>>>>>> said
>>>>>>>>>> I
>>>>>>>>>>>>> built
>>>>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>>>>> other
>>>>>>>>>>>>> projects
>>>>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>>>>> john@jahewson.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>>>>> PDFToText
>>>>>>>>>> class
>>>>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path
>>> as
>>>>> the
>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>>>>> managed to
>>>>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>>> and
>>>>> I
>>>>>>>>>> got a
>>>>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>>>>> jars
>>>>>>>> in
>>>>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further
>>> look
>>>>>>>> into
>>>>>>>>>> code
>>>>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>>>>> PDFTextStripper
>>>>>>>>>>>>> class.
>>>>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>>>>> them
>>>>>>>> in
>>>>>>>>>>>>> debug
>>>>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
>>>>> you
>>>>>>>>>> follow
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and
>>> managed to
>>>>>>>> do
>>>>>>>>>> some
>>>>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>>>>> you a
>>>>>>>>>> mail.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>>>>> john@jahewson.com
>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> The PDFBox website can be found at
>>>>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>>>>> PDFBox
>>>>>>>> for
>>>>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>>> details
>>>>>>>> the
>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>>> are
>>>>>>>> all
>>>>>>>>>>>>> under
>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>>> PageDrawer
>>>>>>>>>> class
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>>> (e.g.
>>>>>>>> one
>>>>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>>>>> text
>>>>>>>> is
>>>>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>>> reading
>>>>>>>>>> order
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
>>>>> like
>>>>>>>>>> HTML
>>>>>>>>>>>>> -
>>>>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask
>>> any
>>>>>>>>>>>>> questions.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>>> Undergraduate
>>>>> at
>>>>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>>> 2013
>>>>>>>> with
>>>>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>>> image
>>>>>>>>>>>>> processing
>>>>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>>>>> GSoC
>>>>>>>>>> 2014
>>>>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>>>>> because I feel like it is the best suited project for
>>> me. In
>>>>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>>>>>>>> wrote a
>>>>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>>>>> you please suggest me where to start learning about
>>> PDFBox?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Regards
>>>>>>>>>>>> 
>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>> 
>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Regards
>>>>>>>>> 
>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>> Undergraduate
>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>> 
>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Regards
>>>>>>> 
>>>>>>> W.Dimuthu Upeksha
>>>>>>> Undergraduate
>>>>>>> Department of Computer Science And Engineering
>>>>>>> 
>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards
>>>>>> 
>>>>>> W.Dimuthu Upeksha
>>>>>> Undergraduate
>>>>>> Department of Computer Science And Engineering
>>>>>> 
>>>>>> University of Moratuwa, Sri Lanka
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Regards
>>>> 
>>>> W.Dimuthu Upeksha
>>>> Undergraduate
>>>> Department of Computer Science And Engineering
>>>> 
>>>> University of Moratuwa, Sri Lanka
>>> 
>>> 
>> 
>> 
>> --
>> Regards
>> 
>> W.Dimuthu Upeksha
>> Undergraduate
>> Department of Computer Science And Engineering
>> 
>> University of Moratuwa, Sri Lanka
>> 
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka


Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by DImuthu Upeksha <di...@gmail.com>.
Hi John,
I finished basic implementation of JNI wrapper for Tesseract. Now it can be
build using maven. Some useful methods that are needed to do basic OCR were
implemented.

I went through PDFBox code several times and got couple of issues that are
needed to be clarified

1. What is the task of processStream method in PDFTextStripper class line
456 : processStream( page.findResources(), content, page.findCropBox(),
page.findRotation() );

2. Say I need to extract images and it's metadata from a pdf. What is the
better approach to do it?

Thanks
Dimuthu


On Fri, Mar 7, 2014 at 9:26 PM, DImuthu Upeksha
<di...@gmail.com>wrote:

> Hi John
> I refactored Tesseract JNI code to support maven build. To create the JNI
> library I added pre-built static libraries of Tesseract and Leptonica to
> resources folder[2]. For now it includes librararies supported for mac. But
> we can easily add both windows and linux libraries. After "mvn clean
> install", the jar is created under target folder. Now all setting up is
> done. What remains is implementing those native methods in tessbaseapi.cpp
> [3]. Hope to finish it asap. Please let me know if there is any concern
> about project structure.
>
> [1] https://github.com/DImuthuUpe/Tesseract-API.git
> [2]
> https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
> [3]
> https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp
>
> Thanks
> Dimuthu
>
>
> On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>
>> Dimuthu
>>
>> > There is a lot of code
>> > fractions in current android jni wrapper which use "(jint)somePointer"
>> > casting which will create terrible memory leaks in 64 bit environments
>> > because ponters are 64 bit. So I believe writing it from the beginning
>> is
>> > much better.
>>
>> That's a classic 64-bit pitfall, well spotted. We definitely need to
>> support
>> 64-bit JVMs.
>>
>> > we can use
>> > the static library of Leptonica (I did and it worked nicely). I think
>> it is
>> > not a issue to use it's static library because both Tesseract and
>> Leptonica
>> > is under apache licence.
>>
>> Sounds good, I found the following in the README:
>>
>> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
>> without Leptonica.
>>
>> Which makes sense.
>>
>> -- John
>>
>> On 5 Mar 2014, at 09:45, DImuthu Upeksha <di...@gmail.com>
>> wrote:
>>
>> > Hi John,
>> > +1 for you suggestion about converting image <=> byte array at java
>> side.
>> > It reduces lot of complexities. I don't know whether you have noticed or
>> > not, jint data type in jni is a 32bit integer type. I noticed it in my
>> Mac
>> > but don't know about other operating systems.
>> >
>> > Leptonica is the image processing library for Tesseract [1]. What
>> tesseract
>> > do is using image processing algorithms in Leptonica to implement its
>> OCR
>> > algorithms. This [2] is the responsible .cpp file to create Tesseract
>> API.
>> > You can see it includes allheaders.h header file which is the main
>> header
>> > file of Leptonoca. So I think it is a must to build Leptonica first and
>> > link it when we build Tesseract. This is not a big problem if we can use
>> > the static library of Leptonica (I did and it worked nicely). I think
>> it is
>> > not a issue to use it's static library because both Tesseract and
>> Leptonica
>> > is under apache licence.
>> >
>> > I'm working on the maven implementation you have mentioned and will get
>> > back to you soon.
>> >
>> > Thanks
>> > Dimuthu
>> >
>> >
>> > [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
>> > [2]
>> >
>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
>> >
>> >
>> > On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
>> >
>> >> Hi Dimuthu,
>> >>
>> >> 1,2,3:
>> >>
>> >> Feel free to write your own Tesseract binding or port the existing
>> code as
>> >> you see fit.
>> >> The JNI binding should be minimal, only the methods you require need
>> to be
>> >> wrapped.
>> >> Also, don't forget that some of the interop can be done in Java, for
>> >> example if it is easier
>> >> to convert a BufferedImage to a byte array in Java then do it there and
>> >> pass the result
>> >> to JNI rather than writing lots of JNI C++ to achieve the same result.
>> >>
>> >> Your GitHub repo looks like a good start, I can make comments there as
>> >> things progress.
>> >>
>> >> Is it possible to build Tesseract without leptonica? I was under the
>> >> impression that it was
>> >> used for image i/o only, but I may be misinformed.
>> >>
>> >> 4:  The native platform library should be built as part of the Maven
>> build
>> >> for the Tesseract
>> >> wrapper which can be a separate project. The output can be a jar file
>> >> which contains the
>> >> native binaries. It should be possible for the jar to contain prebuilt
>> >> binaries for all platforms
>> >> but this is something we can worry about later. Right now the goal
>> should
>> >> be to build a jar
>> >> containing just the current platform's native binary and any Java
>> wrapper
>> >> code.
>> >>
>> >> -- John
>> >>
>> >> On 3 Mar 2014, at 16:41, DImuthu Upeksha <di...@gmail.com>
>> >> wrote:
>> >>
>> >>> Hi John,
>> >>>
>> >>> I tried to reuse that android jni wrapper for tesseract. Here is my
>> >>> observation
>> >>>
>> >>> 1. This wrapper heavily depends on android image libraries.
>> >>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>> >>>
>> >>> 2. But I can understand underlying logic in each function. Basically
>> what
>> >>> it does is mapping between tesseract api functions [2] with java
>> methods.
>> >>> In between it does to some image <=> byte array like conversions by
>> using
>> >>> that bitmap libraries in Android
>> >>>
>> >>> 3. There are two ways. 1: We can port it's code to make compatible
>> with
>> >> our
>> >>> environments(linux,windows and mac) which is really painful. Also it
>> will
>> >>> cause memory leaks. 2: We can use only it's function signatures and
>> >>> implement using our codes
>> >>>
>> >>> I think 2nd solution is better because we need only few operations to
>> be
>> >>> done using tesseract library. I have created a github repo [3] for
>> this.
>> >>> It's still not finished. I need to add some make files and build
>> files to
>> >>> make it run properly. And also I need to implement those wrapper
>> >> functions
>> >>> [3]. This may take some time.
>> >>>
>> >>> 4. Because we are calling native libraries we need different builds of
>> >>> tesseract and leptonica libraries for each platform (dll for windows,
>> so
>> >>> for linux, dylib for mac). So we may need to build those libraries at
>> the
>> >>> time we build pdfbox project. Or we can pre build those libraries and
>> add
>> >>> them to the project as .dll, .so or .dylib format. What is the
>> preferred
>> >>> way?
>> >>>
>> >>> [1]
>> >>>
>> >>
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>> >>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>> >>> [3] https://github.com/DImuthuUpe/Tesseract-API
>> >>> [4]
>> >>>
>> >>
>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>> >>>
>> >>> Thanks
>> >>> Dimuthu
>> >>>
>> >>>
>> >>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>> >> dimuthu.upeksha2@gmail.com
>> >>>> wrote:
>> >>>
>> >>>> I updated necessary changes to the document [1]
>> >>>>
>> >>>> For last two days I had a deep look at this [2] jni wrapper for
>> >> tessaract
>> >>>> api.
>> >>>> Unfortunately this has been designed for Android environment so I
>> think
>> >> we
>> >>>> need to write our own make files to build this in to a dll(windows)
>> or
>> >>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
>> for
>> >> a
>> >>>> way to convert it to a make file that we can run on console. Please
>> >> suggest
>> >>>> if you have a better approach
>> >>>>
>> >>>> [1]
>> >>>>
>> >>
>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>> >>>> [2]
>> >>>>
>> >>
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>> >>>> [3]
>> >>>>
>> >>
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>> >>>>
>> >>>>
>> >>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com>
>> wrote:
>> >>>>
>> >>>>> This is a good start. However, there is no need for the Adder
>> >> component,
>> >>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>> >> Extractor".
>> >>>>>
>> >>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
>> clear
>> >>>>> where the process starts.
>> >>>>>
>> >>>>> -- John
>> >>>>>
>> >>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
>> dimuthu.upeksha2@gmail.com>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>> >>>>>>
>> >>>>>> [1]
>> >>>>>>
>> >>>>>
>> >>
>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>> >>>>>>
>> >>>>>> Thanks
>> >>>>>> Dimuthu
>> >>>>>>
>> >>>>>>
>> >>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com>
>> >> wrote:
>> >>>>>>
>> >>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>> >> might
>> >>>>>>> use an interface, e.g. OCREngine and there will be a
>> >> TesseractOCREngine
>> >>>>>>> class somewhere which provides the required functionality and
>> lives
>> >> in
>> >>>>> a
>> >>>>>>> separate jar file.
>> >>>>>>>
>> >>>>>>> -- John
>> >>>>>>>
>> >>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com>
>> >> wrote:
>> >>>>>>>>
>> >>>>>>>> So do you need to embed those new functionalities into existing
>> >>>>>>> PDFtoText algorithms or package them as a new sub system(something
>> >>>>> like an
>> >>>>>>> API)?
>> >>>>>>>>
>> >>>>>>>> -----Original Message-----
>> >>>>>>>> From: "John Hewson" <jo...@jahewson.com>
>> >>>>>>>> Sent: 26/02/2014 07:38
>> >>>>>>>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
>> >>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>> >>>>>>> Introduction
>> >>>>>>>>
>> >>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>> >> page
>> >>>>>>> rotation.
>> >>>>>>>>
>> >>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
>> have
>> >>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>> >>>>> glyphs. We
>> >>>>>>> could OCR the glyphs to repair the encoding.
>> >>>>>>>>
>> >>>>>>>> -- John
>> >>>>>>>>
>> >>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>> >>>>> dimuthu.upeksha2@gmail.com>
>> >>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>> Hi John,
>> >>>>>>>>> Thanks for the explanation.
>> >>>>>>>>> Let's say there is a pdf with both text in extractable format
>> and
>> >>>>> some
>> >>>>>>>>> images with text(Scanned images). In that case first we extract
>> >> those
>> >>>>>>>>> extractable content using PDFBox algorithms and rest is
>> extracted
>> >>>>> using
>> >>>>>>>>> OCR. Finally we pack both results together and give output as
>> >>>>>>> PDFToText. Am
>> >>>>>>>>> I correct? What do you mean by "location data"?
>> >>>>>>>>>
>> >>>>>>>>> Thanks
>> >>>>>>>>> Dimuthu
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
>> john@jahewson.com>
>> >>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> 1. What is called "glyphs" ?
>> >>>>>>>>>>
>> >>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>> >>>>>>>>>>
>> >>>>>>>>>>> 2. What is the main requirement of this project?
>> >>>>>>>>>>> As far as I understood, first we need to generate an image of
>> >>>>>>>>>>> malformed pdfs from
>> >>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>> >>>>>>> accurate
>> >>>>>>>>>>> results.  But the problem is, why shouldn't we directly do
>> OCR on
>> >>>>>>> those
>> >>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
>> wrong.
>> >>>>>>>>>>
>> >>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>> >>>>>>> (PDFToText).
>> >>>>>>>>>> The goal of
>> >>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>> >>>>> extract
>> >>>>>>>>>> text from areas of the
>> >>>>>>>>>> document where the text is embedded as an image. Such PDF files
>> >> are
>> >>>>>>>>>> typically generated by
>> >>>>>>>>>> scanners or fax machines. There is also another case where OCR
>> is
>> >>>>>>> useful:
>> >>>>>>>>>> some fonts embedded
>> >>>>>>>>>> in PDF files contain the wrong encoding, so when text is
>> extracted
>> >>>>> with
>> >>>>>>>>>> PDFToText the result is
>> >>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>> >> letters.
>> >>>>>>>>>>
>> >>>>>>>>>> Instead of:
>> >>>>>>>>>> PDF => Image => OCR => Text
>> >>>>>>>>>>
>> >>>>>>>>>> We want to do:
>> >>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>> >>>>>>>>>>
>> >>>>>>>>>> -- John
>> >>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>> >>>>>>>>>> dimuthu.upeksha2@gmail.com
>> >>>>>>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Ok fixed. This is what I did
>> >>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
>> Configurations
>> >>>>>>>>>> ->Source
>> >>>>>>>>>>>> ->Add -> Project
>> >>>>>>>>>>>> Then I selected PDFBox project.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thanks
>> >>>>>>>>>>>> Dimuthu
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>> >>>>>>>>>>>> dimuthu.upeksha2@gmail.com> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>> >>>>>>>>>> application
>> >>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
>> code.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage =
>> new
>> >>>>>>>>>> PDPage();document.addPage( blankPage
>> >>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Then I need to add those jar files generated in target
>> folder
>> >> of
>> >>>>>>> PDFBox
>> >>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
>> project
>> >>>>> from
>> >>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>>  the
>> >>>>>>>>>>>>> functionality of document.save("") method. But I don't have
>> a
>> >>>>>>>>>> reference to
>> >>>>>>>>>>>>> it's sources because I directly used generated jars. As
>> Tilman
>> >>>>> said
>> >>>>>>> I
>> >>>>>>>>>> built
>> >>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>> >> other
>> >>>>>>>>>> projects
>> >>>>>>>>>>>>> other than adding those jar files to build path.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>> >> john@jahewson.com>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>> >> PDFToText
>> >>>>>>> class
>> >>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path
>> as
>> >> the
>> >>>>>>>>>> command
>> >>>>>>>>>>>>>> line argument.
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> -- John
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>> >>>>>>>>>> dimuthu.upeksha2@gmail.com>
>> >>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Hi John,
>> >>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>> >>>>> managed to
>> >>>>>>>>>>>>>> build
>> >>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
>> and
>> >> I
>> >>>>>>> got a
>> >>>>>>>>>>>>>> rough
>> >>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>> >> jars
>> >>>>> in
>> >>>>>>>>>>>>>> target
>> >>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>> >>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further
>> look
>> >>>>> into
>> >>>>>>> code
>> >>>>>>>>>>>>>>> specially how those processXXX() methods work in
>> >>>>> PDFTextStripper
>> >>>>>>>>>> class.
>> >>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>> >> them
>> >>>>> in
>> >>>>>>>>>> debug
>> >>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
>> >> you
>> >>>>>>> follow
>> >>>>>>>>>>>>>> in
>> >>>>>>>>>>>>>>> order to do such task?
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> As well I installed tesseract in to my machine and
>> managed to
>> >>>>> do
>> >>>>>>> some
>> >>>>>>>>>>>>>> OCR
>> >>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>> >>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>> >> you a
>> >>>>>>> mail.
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> Thanks
>> >>>>>>>>>>>>>>> Dimuthu
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>> >>>>> john@jahewson.com
>> >>>>>>>>
>> >>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Hi Dimuthu
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> The PDFBox website can be found at
>> >> http://pdfbox.apache.org/it
>> >>>>>>>>>>>>>> contains
>> >>>>>>>>>>>>>>>> a basic overview of the project
>> >>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>> >> PDFBox
>> >>>>> for
>> >>>>>>>>>>>>>> yourself.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
>> details
>> >>>>> the
>> >>>>>>> only
>> >>>>>>>>>>>>>>>> thoughts so far regarding it.
>> >>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
>> are
>> >>>>> all
>> >>>>>>>>>> under
>> >>>>>>>>>>>>>> the
>> >>>>>>>>>>>>>>>> Apache license, which is a
>> >>>>>>>>>>>>>>>> requirement.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Once you have the source code, take a look at the
>> PageDrawer
>> >>>>>>> class
>> >>>>>>>>>> to
>> >>>>>>>>>>>>>> see
>> >>>>>>>>>>>>>>>> how text and images are
>> >>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
>> (e.g.
>> >>>>> one
>> >>>>>>>>>> glyph,
>> >>>>>>>>>>>>>>>> word, or sentence at a time) with
>> >>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>> >> text
>> >>>>> is
>> >>>>>>>>>>>>>> currently
>> >>>>>>>>>>>>>>>> extracted, take a look at how
>> >>>>>>>>>>>>>>>> we have to go to great length to sort text back into
>> reading
>> >>>>>>> order
>> >>>>>>>>>> and
>> >>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>> >>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
>> >> like
>> >>>>>>> HTML
>> >>>>>>>>>> -
>> >>>>>>>>>>>>>>>> which is why extracting text can be so
>> >>>>>>>>>>>>>>>> difficult sometimes.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>> >>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>
>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask
>> any
>> >>>>>>>>>> questions.
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> Thanks,
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> -- John
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>> >>>>>>>>>> dimuthu.upeksha2@gmail.com
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>> wrote:
>> >>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Hi,
>> >>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
>> Undergraduate
>> >> at
>> >>>>>>>>>>>>>> University
>> >>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
>> 2013
>> >>>>> with
>> >>>>>>>>>>>>>> Apache
>> >>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
>> image
>> >>>>>>>>>> processing
>> >>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>> >> GSoC
>> >>>>>>> 2014
>> >>>>>>>>>>>>>> project
>> >>>>>>>>>>>>>>>> because I feel like it is the best suited project for
>> me. In
>> >>>>>>>>>>>>>> university
>> >>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>> >>>>> wrote a
>> >>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>> >>>>>>>>>>>>>> systems(attached). Can
>> >>>>>>>>>>>>>>>> you please suggest me where to start learning about
>> PDFBox?
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> [1]
>> >>>>>>>>>>
>> >>>>>>>
>> >>>>>
>> >>
>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> Thank you
>> >>>>>>>>>>>>>>>>> Dimuthu
>> >>>>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>>>>> Regards
>> >>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>> >>>>>>>>>>>>>>>>> Undergraduate
>> >>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>> >>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> --
>> >>>>>>>>>>>>>>> Regards
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> W.Dimuthu Upeksha
>> >>>>>>>>>>>>>>> Undergraduate
>> >>>>>>>>>>>>>>> Department of Computer Science And Engineering
>> >>>>>>>>>>>>>>>
>> >>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> --
>> >>>>>>>>>>>>> Regards
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> W.Dimuthu Upeksha
>> >>>>>>>>>>>>> Undergraduate
>> >>>>>>>>>>>>> Department of Computer Science And Engineering
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> --
>> >>>>>>>>>>>> Regards
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> W.Dimuthu Upeksha
>> >>>>>>>>>>>> Undergraduate
>> >>>>>>>>>>>> Department of Computer Science And Engineering
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> --
>> >>>>>>>>>>> Regards
>> >>>>>>>>>>>
>> >>>>>>>>>>> W.Dimuthu Upeksha
>> >>>>>>>>>>> Undergraduate
>> >>>>>>>>>>> Department of Computer Science And Engineering
>> >>>>>>>>>>>
>> >>>>>>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> Regards
>> >>>>>>>>>
>> >>>>>>>>> W.Dimuthu Upeksha
>> >>>>>>>>> Undergraduate
>> >>>>>>>>> Department of Computer Science And Engineering
>> >>>>>>>>>
>> >>>>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Regards
>> >>>>>>
>> >>>>>> W.Dimuthu Upeksha
>> >>>>>> Undergraduate
>> >>>>>> Department of Computer Science And Engineering
>> >>>>>>
>> >>>>>> University of Moratuwa, Sri Lanka
>> >>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Regards
>> >>>>
>> >>>> W.Dimuthu Upeksha
>> >>>> Undergraduate
>> >>>> Department of Computer Science And Engineering
>> >>>>
>> >>>> University of Moratuwa, Sri Lanka
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Regards
>> >>>
>> >>> W.Dimuthu Upeksha
>> >>> Undergraduate
>> >>> Department of Computer Science And Engineering
>> >>>
>> >>> University of Moratuwa, Sri Lanka
>> >>
>> >>
>> >
>> >
>> > --
>> > Regards
>> >
>> > W.Dimuthu Upeksha
>> > Undergraduate
>> > Department of Computer Science And Engineering
>> >
>> > University of Moratuwa, Sri Lanka
>>
>>
>
>
> --
> Regards
>
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
>
> University of Moratuwa, Sri Lanka
>



-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by DImuthu Upeksha <di...@gmail.com>.
Hi John
I refactored Tesseract JNI code to support maven build. To create the JNI
library I added pre-built static libraries of Tesseract and Leptonica to
resources folder[2]. For now it includes librararies supported for mac. But
we can easily add both windows and linux libraries. After "mvn clean
install", the jar is created under target folder. Now all setting up is
done. What remains is implementing those native methods in tessbaseapi.cpp
[3]. Hope to finish it asap. Please let me know if there is any concern
about project structure.

[1] https://github.com/DImuthuUpe/Tesseract-API.git
[2]
https://github.com/DImuthuUpe/Tesseract-API/tree/master/src/main/resources
[3]
https://github.com/DImuthuUpe/Tesseract-API/blob/master/src/main/native/src/tessbaseapi.cpp

Thanks
Dimuthu


On Thu, Mar 6, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:

> Dimuthu
>
> > There is a lot of code
> > fractions in current android jni wrapper which use "(jint)somePointer"
> > casting which will create terrible memory leaks in 64 bit environments
> > because ponters are 64 bit. So I believe writing it from the beginning is
> > much better.
>
> That's a classic 64-bit pitfall, well spotted. We definitely need to
> support
> 64-bit JVMs.
>
> > we can use
> > the static library of Leptonica (I did and it worked nicely). I think it
> is
> > not a issue to use it's static library because both Tesseract and
> Leptonica
> > is under apache licence.
>
> Sounds good, I found the following in the README:
>
> Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
> without Leptonica.
>
> Which makes sense.
>
> -- John
>
> On 5 Mar 2014, at 09:45, DImuthu Upeksha <di...@gmail.com>
> wrote:
>
> > Hi John,
> > +1 for you suggestion about converting image <=> byte array at java side.
> > It reduces lot of complexities. I don't know whether you have noticed or
> > not, jint data type in jni is a 32bit integer type. I noticed it in my
> Mac
> > but don't know about other operating systems.
> >
> > Leptonica is the image processing library for Tesseract [1]. What
> tesseract
> > do is using image processing algorithms in Leptonica to implement its OCR
> > algorithms. This [2] is the responsible .cpp file to create Tesseract
> API.
> > You can see it includes allheaders.h header file which is the main header
> > file of Leptonoca. So I think it is a must to build Leptonica first and
> > link it when we build Tesseract. This is not a big problem if we can use
> > the static library of Leptonica (I did and it worked nicely). I think it
> is
> > not a issue to use it's static library because both Tesseract and
> Leptonica
> > is under apache licence.
> >
> > I'm working on the maven implementation you have mentioned and will get
> > back to you soon.
> >
> > Thanks
> > Dimuthu
> >
> >
> > [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
> > [2]
> >
> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
> >
> >
> > On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
> >
> >> Hi Dimuthu,
> >>
> >> 1,2,3:
> >>
> >> Feel free to write your own Tesseract binding or port the existing code
> as
> >> you see fit.
> >> The JNI binding should be minimal, only the methods you require need to
> be
> >> wrapped.
> >> Also, don't forget that some of the interop can be done in Java, for
> >> example if it is easier
> >> to convert a BufferedImage to a byte array in Java then do it there and
> >> pass the result
> >> to JNI rather than writing lots of JNI C++ to achieve the same result.
> >>
> >> Your GitHub repo looks like a good start, I can make comments there as
> >> things progress.
> >>
> >> Is it possible to build Tesseract without leptonica? I was under the
> >> impression that it was
> >> used for image i/o only, but I may be misinformed.
> >>
> >> 4:  The native platform library should be built as part of the Maven
> build
> >> for the Tesseract
> >> wrapper which can be a separate project. The output can be a jar file
> >> which contains the
> >> native binaries. It should be possible for the jar to contain prebuilt
> >> binaries for all platforms
> >> but this is something we can worry about later. Right now the goal
> should
> >> be to build a jar
> >> containing just the current platform's native binary and any Java
> wrapper
> >> code.
> >>
> >> -- John
> >>
> >> On 3 Mar 2014, at 16:41, DImuthu Upeksha <di...@gmail.com>
> >> wrote:
> >>
> >>> Hi John,
> >>>
> >>> I tried to reuse that android jni wrapper for tesseract. Here is my
> >>> observation
> >>>
> >>> 1. This wrapper heavily depends on android image libraries.
> >>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
> >>>
> >>> 2. But I can understand underlying logic in each function. Basically
> what
> >>> it does is mapping between tesseract api functions [2] with java
> methods.
> >>> In between it does to some image <=> byte array like conversions by
> using
> >>> that bitmap libraries in Android
> >>>
> >>> 3. There are two ways. 1: We can port it's code to make compatible with
> >> our
> >>> environments(linux,windows and mac) which is really painful. Also it
> will
> >>> cause memory leaks. 2: We can use only it's function signatures and
> >>> implement using our codes
> >>>
> >>> I think 2nd solution is better because we need only few operations to
> be
> >>> done using tesseract library. I have created a github repo [3] for
> this.
> >>> It's still not finished. I need to add some make files and build files
> to
> >>> make it run properly. And also I need to implement those wrapper
> >> functions
> >>> [3]. This may take some time.
> >>>
> >>> 4. Because we are calling native libraries we need different builds of
> >>> tesseract and leptonica libraries for each platform (dll for windows,
> so
> >>> for linux, dylib for mac). So we may need to build those libraries at
> the
> >>> time we build pdfbox project. Or we can pre build those libraries and
> add
> >>> them to the project as .dll, .so or .dylib format. What is the
> preferred
> >>> way?
> >>>
> >>> [1]
> >>>
> >>
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
> >>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
> >>> [3] https://github.com/DImuthuUpe/Tesseract-API
> >>> [4]
> >>>
> >>
> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
> >>>
> >>> Thanks
> >>> Dimuthu
> >>>
> >>>
> >>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
> >> dimuthu.upeksha2@gmail.com
> >>>> wrote:
> >>>
> >>>> I updated necessary changes to the document [1]
> >>>>
> >>>> For last two days I had a deep look at this [2] jni wrapper for
> >> tessaract
> >>>> api.
> >>>> Unfortunately this has been designed for Android environment so I
> think
> >> we
> >>>> need to write our own make files to build this in to a dll(windows) or
> >>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching
> for
> >> a
> >>>> way to convert it to a make file that we can run on console. Please
> >> suggest
> >>>> if you have a better approach
> >>>>
> >>>> [1]
> >>>>
> >>
> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
> >>>> [2]
> >>>>
> >>
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
> >>>> [3]
> >>>>
> >>
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
> >>>>
> >>>>
> >>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com>
> wrote:
> >>>>
> >>>>> This is a good start. However, there is no need for the Adder
> >> component,
> >>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
> >> Extractor".
> >>>>>
> >>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it
> clear
> >>>>> where the process starts.
> >>>>>
> >>>>> -- John
> >>>>>
> >>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <
> dimuthu.upeksha2@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Sorry for the mistake. I added it to my Dropbox [1].
> >>>>>>
> >>>>>> [1]
> >>>>>>
> >>>>>
> >>
> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
> >>>>>>
> >>>>>> Thanks
> >>>>>> Dimuthu
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com>
> >> wrote:
> >>>>>>
> >>>>>>> I should add that the OCR engine should be pluggable so PDFToText
> >> might
> >>>>>>> use an interface, e.g. OCREngine and there will be a
> >> TesseractOCREngine
> >>>>>>> class somewhere which provides the required functionality and lives
> >> in
> >>>>> a
> >>>>>>> separate jar file.
> >>>>>>>
> >>>>>>> -- John
> >>>>>>>
> >>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com>
> >> wrote:
> >>>>>>>>
> >>>>>>>> So do you need to embed those new functionalities into existing
> >>>>>>> PDFtoText algorithms or package them as a new sub system(something
> >>>>> like an
> >>>>>>> API)?
> >>>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: "John Hewson" <jo...@jahewson.com>
> >>>>>>>> Sent: 26/02/2014 07:38
> >>>>>>>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
> >>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
> >>>>>>> Introduction
> >>>>>>>>
> >>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
> >> page
> >>>>>>> rotation.
> >>>>>>>>
> >>>>>>>> There is another use case for OCR: some fonts embedded in PDFs
> have
> >>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
> >>>>> glyphs. We
> >>>>>>> could OCR the glyphs to repair the encoding.
> >>>>>>>>
> >>>>>>>> -- John
> >>>>>>>>
> >>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
> >>>>> dimuthu.upeksha2@gmail.com>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi John,
> >>>>>>>>> Thanks for the explanation.
> >>>>>>>>> Let's say there is a pdf with both text in extractable format and
> >>>>> some
> >>>>>>>>> images with text(Scanned images). In that case first we extract
> >> those
> >>>>>>>>> extractable content using PDFBox algorithms and rest is extracted
> >>>>> using
> >>>>>>>>> OCR. Finally we pack both results together and give output as
> >>>>>>> PDFToText. Am
> >>>>>>>>> I correct? What do you mean by "location data"?
> >>>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>> Dimuthu
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <
> john@jahewson.com>
> >>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> 1. What is called "glyphs" ?
> >>>>>>>>>>
> >>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
> >>>>>>>>>>
> >>>>>>>>>>> 2. What is the main requirement of this project?
> >>>>>>>>>>> As far as I understood, first we need to generate an image of
> >>>>>>>>>>> malformed pdfs from
> >>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
> >>>>>>> accurate
> >>>>>>>>>>> results.  But the problem is, why shouldn't we directly do OCR
> on
> >>>>>>> those
> >>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm
> wrong.
> >>>>>>>>>>
> >>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
> >>>>>>> (PDFToText).
> >>>>>>>>>> The goal of
> >>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
> >>>>> extract
> >>>>>>>>>> text from areas of the
> >>>>>>>>>> document where the text is embedded as an image. Such PDF files
> >> are
> >>>>>>>>>> typically generated by
> >>>>>>>>>> scanners or fax machines. There is also another case where OCR
> is
> >>>>>>> useful:
> >>>>>>>>>> some fonts embedded
> >>>>>>>>>> in PDF files contain the wrong encoding, so when text is
> extracted
> >>>>> with
> >>>>>>>>>> PDFToText the result is
> >>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
> >> letters.
> >>>>>>>>>>
> >>>>>>>>>> Instead of:
> >>>>>>>>>> PDF => Image => OCR => Text
> >>>>>>>>>>
> >>>>>>>>>> We want to do:
> >>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
> >>>>>>>>>>
> >>>>>>>>>> -- John
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
> >>>>>>>>>> dimuthu.upeksha2@gmail.com
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Ok fixed. This is what I did
> >>>>>>>>>>>> Right click on the new project ->Debug As-> Debug
> Configurations
> >>>>>>>>>> ->Source
> >>>>>>>>>>>> ->Add -> Project
> >>>>>>>>>>>> Then I selected PDFBox project.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks
> >>>>>>>>>>>> Dimuthu
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
> >>>>>>>>>>>> dimuthu.upeksha2@gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
> >>>>>>>>>> application
> >>>>>>>>>>>>> project (say TestPDFBox) with a main class with following
> code.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
> >>>>>>>>>> PDPage();document.addPage( blankPage
> >>>>>>>>>> );document.save("BlankPage.pdf");document.close();
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Then I need to add those jar files generated in target folder
> >> of
> >>>>>>> PDFBox
> >>>>>>>>>>>>> to build path of my new project (I did build the PDFBox
> project
> >>>>> from
> >>>>>>>>>>>>> source). That is what I did. But let's say I need to check
>  the
> >>>>>>>>>>>>> functionality of document.save("") method. But I don't have a
> >>>>>>>>>> reference to
> >>>>>>>>>>>>> it's sources because I directly used generated jars. As
> Tilman
> >>>>> said
> >>>>>>> I
> >>>>>>>>>> built
> >>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
> >> other
> >>>>>>>>>> projects
> >>>>>>>>>>>>> other than adding those jar files to build path.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
> >> john@jahewson.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
> >> PDFToText
> >>>>>>> class
> >>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as
> >> the
> >>>>>>>>>> command
> >>>>>>>>>>>>>> line argument.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -- John
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
> >>>>>>>>>> dimuthu.upeksha2@gmail.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi John,
> >>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
> >>>>> managed to
> >>>>>>>>>>>>>> build
> >>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned
> and
> >> I
> >>>>>>> got a
> >>>>>>>>>>>>>> rough
> >>>>>>>>>>>>>>> idea about how they are working. To check them I used the
> >> jars
> >>>>> in
> >>>>>>>>>>>>>> target
> >>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
> >>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look
> >>>>> into
> >>>>>>> code
> >>>>>>>>>>>>>>> specially how those processXXX() methods work in
> >>>>> PDFTextStripper
> >>>>>>>>>> class.
> >>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
> >> them
> >>>>> in
> >>>>>>>>>> debug
> >>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
> >> you
> >>>>>>> follow
> >>>>>>>>>>>>>> in
> >>>>>>>>>>>>>>> order to do such task?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> As well I installed tesseract in to my machine and managed
> to
> >>>>> do
> >>>>>>> some
> >>>>>>>>>>>>>> OCR
> >>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
> >>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
> >> you a
> >>>>>>> mail.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>> Dimuthu
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
> >>>>> john@jahewson.com
> >>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Dimuthu
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The PDFBox website can be found at
> >> http://pdfbox.apache.org/it
> >>>>>>>>>>>>>> contains
> >>>>>>>>>>>>>>>> a basic overview of the project
> >>>>>>>>>>>>>>>> and details on how to obtain the source code and build
> >> PDFBox
> >>>>> for
> >>>>>>>>>>>>>> yourself.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912
> details
> >>>>> the
> >>>>>>> only
> >>>>>>>>>>>>>>>> thoughts so far regarding it.
> >>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue
> are
> >>>>> all
> >>>>>>>>>> under
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> Apache license, which is a
> >>>>>>>>>>>>>>>> requirement.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Once you have the source code, take a look at the
> PageDrawer
> >>>>>>> class
> >>>>>>>>>> to
> >>>>>>>>>>>>>> see
> >>>>>>>>>>>>>>>> how text and images are
> >>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level
> (e.g.
> >>>>> one
> >>>>>>>>>> glyph,
> >>>>>>>>>>>>>>>> word, or sentence at a time) with
> >>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
> >> text
> >>>>> is
> >>>>>>>>>>>>>> currently
> >>>>>>>>>>>>>>>> extracted, take a look at how
> >>>>>>>>>>>>>>>> we have to go to great length to sort text back into
> reading
> >>>>>>> order
> >>>>>>>>>> and
> >>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
> >>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
> >> like
> >>>>>>> HTML
> >>>>>>>>>> -
> >>>>>>>>>>>>>>>> which is why extracting text can be so
> >>>>>>>>>>>>>>>> difficult sometimes.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
> >>>>>>>>>>
> >>>>>>>
> >>>>>
> >>
> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
> >>>>>>>>>> questions.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> -- John
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
> >>>>>>>>>> dimuthu.upeksha2@gmail.com
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering
> Undergraduate
> >> at
> >>>>>>>>>>>>>> University
> >>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC
> 2013
> >>>>> with
> >>>>>>>>>>>>>> Apache
> >>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and
> image
> >>>>>>>>>> processing
> >>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
> >> GSoC
> >>>>>>> 2014
> >>>>>>>>>>>>>> project
> >>>>>>>>>>>>>>>> because I feel like it is the best suited project for me.
> In
> >>>>>>>>>>>>>> university
> >>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
> >>>>> wrote a
> >>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
> >>>>>>>>>>>>>> systems(attached). Can
> >>>>>>>>>>>>>>>> you please suggest me where to start learning about
> PDFBox?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> [1]
> >>>>>>>>>>
> >>>>>>>
> >>>>>
> >>
> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thank you
> >>>>>>>>>>>>>>>>> Dimuthu
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>>>> Regards
> >>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>>>>>>>>>> Undergraduate
> >>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>> Regards
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>>>>>>>> Undergraduate
> >>>>>>>>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Regards
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>>>>>> Undergraduate
> >>>>>>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Regards
> >>>>>>>>>>>>
> >>>>>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>>>>> Undergraduate
> >>>>>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>>>>
> >>>>>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Regards
> >>>>>>>>>>>
> >>>>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>>>> Undergraduate
> >>>>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>>>
> >>>>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Regards
> >>>>>>>>>
> >>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>> Undergraduate
> >>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>
> >>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Regards
> >>>>>>
> >>>>>> W.Dimuthu Upeksha
> >>>>>> Undergraduate
> >>>>>> Department of Computer Science And Engineering
> >>>>>>
> >>>>>> University of Moratuwa, Sri Lanka
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Regards
> >>>>
> >>>> W.Dimuthu Upeksha
> >>>> Undergraduate
> >>>> Department of Computer Science And Engineering
> >>>>
> >>>> University of Moratuwa, Sri Lanka
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Regards
> >>>
> >>> W.Dimuthu Upeksha
> >>> Undergraduate
> >>> Department of Computer Science And Engineering
> >>>
> >>> University of Moratuwa, Sri Lanka
> >>
> >>
> >
> >
> > --
> > Regards
> >
> > W.Dimuthu Upeksha
> > Undergraduate
> > Department of Computer Science And Engineering
> >
> > University of Moratuwa, Sri Lanka
>
>


-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by John Hewson <jo...@jahewson.com>.
Dimuthu

> There is a lot of code
> fractions in current android jni wrapper which use "(jint)somePointer"
> casting which will create terrible memory leaks in 64 bit environments
> because ponters are 64 bit. So I believe writing it from the beginning is
> much better.

That’s a classic 64-bit pitfall, well spotted. We definitely need to support
64-bit JVMs.

> we can use
> the static library of Leptonica (I did and it worked nicely). I think it is
> not a issue to use it's static library because both Tesseract and Leptonica
> is under apache licence.

Sounds good, I found the following in the README:

Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
without Leptonica.

Which makes sense.

-- John

On 5 Mar 2014, at 09:45, DImuthu Upeksha <di...@gmail.com> wrote:

> Hi John,
> +1 for you suggestion about converting image <=> byte array at java side.
> It reduces lot of complexities. I don't know whether you have noticed or
> not, jint data type in jni is a 32bit integer type. I noticed it in my Mac
> but don't know about other operating systems. 
> 
> Leptonica is the image processing library for Tesseract [1]. What tesseract
> do is using image processing algorithms in Leptonica to implement its OCR
> algorithms. This [2] is the responsible .cpp file to create Tesseract API.
> You can see it includes allheaders.h header file which is the main header
> file of Leptonoca. So I think it is a must to build Leptonica first and
> link it when we build Tesseract. This is not a big problem if we can use
> the static library of Leptonica (I did and it worked nicely). I think it is
> not a issue to use it's static library because both Tesseract and Leptonica
> is under apache licence.
> 
> I'm working on the maven implementation you have mentioned and will get
> back to you soon.
> 
> Thanks
> Dimuthu
> 
> 
> [1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
> [2]
> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp
> 
> 
> On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:
> 
>> Hi Dimuthu,
>> 
>> 1,2,3:
>> 
>> Feel free to write your own Tesseract binding or port the existing code as
>> you see fit.
>> The JNI binding should be minimal, only the methods you require need to be
>> wrapped.
>> Also, don't forget that some of the interop can be done in Java, for
>> example if it is easier
>> to convert a BufferedImage to a byte array in Java then do it there and
>> pass the result
>> to JNI rather than writing lots of JNI C++ to achieve the same result.
>> 
>> Your GitHub repo looks like a good start, I can make comments there as
>> things progress.
>> 
>> Is it possible to build Tesseract without leptonica? I was under the
>> impression that it was
>> used for image i/o only, but I may be misinformed.
>> 
>> 4:  The native platform library should be built as part of the Maven build
>> for the Tesseract
>> wrapper which can be a separate project. The output can be a jar file
>> which contains the
>> native binaries. It should be possible for the jar to contain prebuilt
>> binaries for all platforms
>> but this is something we can worry about later. Right now the goal should
>> be to build a jar
>> containing just the current platform's native binary and any Java wrapper
>> code.
>> 
>> -- John
>> 
>> On 3 Mar 2014, at 16:41, DImuthu Upeksha <di...@gmail.com>
>> wrote:
>> 
>>> Hi John,
>>> 
>>> I tried to reuse that android jni wrapper for tesseract. Here is my
>>> observation
>>> 
>>> 1. This wrapper heavily depends on android image libraries.
>>> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>>> 
>>> 2. But I can understand underlying logic in each function. Basically what
>>> it does is mapping between tesseract api functions [2] with java methods.
>>> In between it does to some image <=> byte array like conversions by using
>>> that bitmap libraries in Android
>>> 
>>> 3. There are two ways. 1: We can port it's code to make compatible with
>> our
>>> environments(linux,windows and mac) which is really painful. Also it will
>>> cause memory leaks. 2: We can use only it's function signatures and
>>> implement using our codes
>>> 
>>> I think 2nd solution is better because we need only few operations to be
>>> done using tesseract library. I have created a github repo [3] for this.
>>> It's still not finished. I need to add some make files and build files to
>>> make it run properly. And also I need to implement those wrapper
>> functions
>>> [3]. This may take some time.
>>> 
>>> 4. Because we are calling native libraries we need different builds of
>>> tesseract and leptonica libraries for each platform (dll for windows, so
>>> for linux, dylib for mac). So we may need to build those libraries at the
>>> time we build pdfbox project. Or we can pre build those libraries and add
>>> them to the project as .dll, .so or .dylib format. What is the preferred
>>> way?
>>> 
>>> [1]
>>> 
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
>>> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
>>> [3] https://github.com/DImuthuUpe/Tesseract-API
>>> [4]
>>> 
>> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>>> 
>>> Thanks
>>> Dimuthu
>>> 
>>> 
>>> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
>> dimuthu.upeksha2@gmail.com
>>>> wrote:
>>> 
>>>> I updated necessary changes to the document [1]
>>>> 
>>>> For last two days I had a deep look at this [2] jni wrapper for
>> tessaract
>>>> api.
>>>> Unfortunately this has been designed for Android environment so I think
>> we
>>>> need to write our own make files to build this in to a dll(windows) or
>>>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching for
>> a
>>>> way to convert it to a make file that we can run on console. Please
>> suggest
>>>> if you have a better approach
>>>> 
>>>> [1]
>>>> 
>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>>>> [2]
>>>> 
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>>>> [3]
>>>> 
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>>> 
>>>> 
>>>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com> wrote:
>>>> 
>>>>> This is a good start. However, there is no need for the Adder
>> component,
>>>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
>> Extractor".
>>>>> 
>>>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it clear
>>>>> where the process starts.
>>>>> 
>>>>> -- John
>>>>> 
>>>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <di...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>>>> 
>>>>>> [1]
>>>>>> 
>>>>> 
>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>>>> 
>>>>>> Thanks
>>>>>> Dimuthu
>>>>>> 
>>>>>> 
>>>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com>
>> wrote:
>>>>>> 
>>>>>>> I should add that the OCR engine should be pluggable so PDFToText
>> might
>>>>>>> use an interface, e.g. OCREngine and there will be a
>> TesseractOCREngine
>>>>>>> class somewhere which provides the required functionality and lives
>> in
>>>>> a
>>>>>>> separate jar file.
>>>>>>> 
>>>>>>> -- John
>>>>>>> 
>>>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com>
>> wrote:
>>>>>>>> 
>>>>>>>> So do you need to embed those new functionalities into existing
>>>>>>> PDFtoText algorithms or package them as a new sub system(something
>>>>> like an
>>>>>>> API)?
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: "John Hewson" <jo...@jahewson.com>
>>>>>>>> Sent: 26/02/2014 07:38
>>>>>>>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
>>>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>>>> Introduction
>>>>>>>> 
>>>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
>> page
>>>>>>> rotation.
>>>>>>>> 
>>>>>>>> There is another use case for OCR: some fonts embedded in PDFs have
>>>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>>>> glyphs. We
>>>>>>> could OCR the glyphs to repair the encoding.
>>>>>>>> 
>>>>>>>> -- John
>>>>>>>> 
>>>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi John,
>>>>>>>>> Thanks for the explanation.
>>>>>>>>> Let's say there is a pdf with both text in extractable format and
>>>>> some
>>>>>>>>> images with text(Scanned images). In that case first we extract
>> those
>>>>>>>>> extractable content using PDFBox algorithms and rest is extracted
>>>>> using
>>>>>>>>> OCR. Finally we pack both results together and give output as
>>>>>>> PDFToText. Am
>>>>>>>>> I correct? What do you mean by "location data"?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Dimuthu
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <jo...@jahewson.com>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>>>> 
>>>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>>>> 
>>>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>>>> malformed pdfs from
>>>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>>>>> accurate
>>>>>>>>>>> results.  But the problem is, why shouldn't we directly do OCR on
>>>>>>> those
>>>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong.
>>>>>>>>>> 
>>>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>>>> (PDFToText).
>>>>>>>>>> The goal of
>>>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>>>> extract
>>>>>>>>>> text from areas of the
>>>>>>>>>> document where the text is embedded as an image. Such PDF files
>> are
>>>>>>>>>> typically generated by
>>>>>>>>>> scanners or fax machines. There is also another case where OCR is
>>>>>>> useful:
>>>>>>>>>> some fonts embedded
>>>>>>>>>> in PDF files contain the wrong encoding, so when text is extracted
>>>>> with
>>>>>>>>>> PDFToText the result is
>>>>>>>>>> nonsense but when drawn with PDFToImage we see the correct
>> letters.
>>>>>>>>>> 
>>>>>>>>>> Instead of:
>>>>>>>>>> PDF => Image => OCR => Text
>>>>>>>>>> 
>>>>>>>>>> We want to do:
>>>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>>>> 
>>>>>>>>>> -- John
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>>>> Right click on the new project ->Debug As-> Debug Configurations
>>>>>>>>>> ->Source
>>>>>>>>>>>> ->Add -> Project
>>>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>>>> dimuthu.upeksha2@gmail.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>>>>> application
>>>>>>>>>>>>> project (say TestPDFBox) with a main class with following code.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
>>>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Then I need to add those jar files generated in target folder
>> of
>>>>>>> PDFBox
>>>>>>>>>>>>> to build path of my new project (I did build the PDFBox project
>>>>> from
>>>>>>>>>>>>> source). That is what I did. But let's say I need to check  the
>>>>>>>>>>>>> functionality of document.save("") method. But I don't have a
>>>>>>>>>> reference to
>>>>>>>>>>>>> it's sources because I directly used generated jars. As Tilman
>>>>> said
>>>>>>> I
>>>>>>>>>> built
>>>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
>> other
>>>>>>>>>> projects
>>>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
>> john@jahewson.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Which IDE are you using? You should be able to run the
>> PDFToText
>>>>>>> class
>>>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as
>> the
>>>>>>>>>> command
>>>>>>>>>>>>>> line argument.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>>>> managed to
>>>>>>>>>>>>>> build
>>>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned and
>> I
>>>>>>> got a
>>>>>>>>>>>>>> rough
>>>>>>>>>>>>>>> idea about how they are working. To check them I used the
>> jars
>>>>> in
>>>>>>>>>>>>>> target
>>>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look
>>>>> into
>>>>>>> code
>>>>>>>>>>>>>>> specially how those processXXX() methods work in
>>>>> PDFTextStripper
>>>>>>>>>> class.
>>>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
>> them
>>>>> in
>>>>>>>>>> debug
>>>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
>> you
>>>>>>> follow
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> As well I installed tesseract in to my machine and managed to
>>>>> do
>>>>>>> some
>>>>>>>>>>>>>> OCR
>>>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
>> you a
>>>>>>> mail.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>>>> john@jahewson.com
>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The PDFBox website can be found at
>> http://pdfbox.apache.org/it
>>>>>>>>>>>>>> contains
>>>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>>>> and details on how to obtain the source code and build
>> PDFBox
>>>>> for
>>>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details
>>>>> the
>>>>>>> only
>>>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are
>>>>> all
>>>>>>>>>> under
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Once you have the source code, take a look at the PageDrawer
>>>>>>> class
>>>>>>>>>> to
>>>>>>>>>>>>>> see
>>>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g.
>>>>> one
>>>>>>>>>> glyph,
>>>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>> text
>>>>> is
>>>>>>>>>>>>>> currently
>>>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>>>> we have to go to great length to sort text back into reading
>>>>>>> order
>>>>>>>>>> and
>>>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
>> like
>>>>>>> HTML
>>>>>>>>>> -
>>>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>>>> 
>>>>>>> 
>>>>> 
>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
>>>>>>>>>> questions.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate
>> at
>>>>>>>>>>>>>> University
>>>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013
>>>>> with
>>>>>>>>>>>>>> Apache
>>>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image
>>>>>>>>>> processing
>>>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
>> GSoC
>>>>>>> 2014
>>>>>>>>>>>>>> project
>>>>>>>>>>>>>>>> because I feel like it is the best suited project for me. In
>>>>>>>>>>>>>> university
>>>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>>>>> wrote a
>>>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>>>> you please suggest me where to start learning about PDFBox?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>> 
>>>>>>> 
>>>>> 
>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> 
>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>> 
>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Regards
>>>>>>>>>>>> 
>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>> 
>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Regards
>>>>>>>>>>> 
>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>> Undergraduate
>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>> 
>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Regards
>>>>>>>>> 
>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>> Undergraduate
>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>> 
>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards
>>>>>> 
>>>>>> W.Dimuthu Upeksha
>>>>>> Undergraduate
>>>>>> Department of Computer Science And Engineering
>>>>>> 
>>>>>> University of Moratuwa, Sri Lanka
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Regards
>>>> 
>>>> W.Dimuthu Upeksha
>>>> Undergraduate
>>>> Department of Computer Science And Engineering
>>>> 
>>>> University of Moratuwa, Sri Lanka
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Regards
>>> 
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> Department of Computer Science And Engineering
>>> 
>>> University of Moratuwa, Sri Lanka
>> 
>> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka


Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by DImuthu Upeksha <di...@gmail.com>.
Hi John,
+1 for you suggestion about converting image <=> byte array at java side.
It reduces lot of complexities. I don't know whether you have noticed or
not, jint data type in jni is a 32bit integer type. I noticed it in my Mac
but don't know about other operating systems. There is a lot of code
fractions in current android jni wrapper which use "(jint)somePointer"
casting which will create terrible memory leaks in 64 bit environments
because ponters are 64 bit. So I believe writing it from the beginning is
much better.

Leptonica is the image processing library for Tesseract [1]. What tesseract
do is using image processing algorithms in Leptonica to implement its OCR
algorithms. This [2] is the responsible .cpp file to create Tesseract API.
You can see it includes allheaders.h header file which is the main header
file of Leptonoca. So I think it is a must to build Leptonica first and
link it when we build Tesseract. This is not a big problem if we can use
the static library of Leptonica (I did and it worked nicely). I think it is
not a issue to use it's static library because both Tesseract and Leptonica
is under apache licence.

I'm working on the maven implementation you have mentioned and will get
back to you soon.

Thanks
Dimuthu


[1] https://code.google.com/p/tesseract-ocr/wiki/Compiling
[2]
https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/src/api/tesseractmain.cpp


On Wed, Mar 5, 2014 at 1:15 AM, John Hewson <jo...@jahewson.com> wrote:

> Hi Dimuthu,
>
> 1,2,3:
>
> Feel free to write your own Tesseract binding or port the existing code as
> you see fit.
> The JNI binding should be minimal, only the methods you require need to be
> wrapped.
> Also, don't forget that some of the interop can be done in Java, for
> example if it is easier
> to convert a BufferedImage to a byte array in Java then do it there and
> pass the result
> to JNI rather than writing lots of JNI C++ to achieve the same result.
>
> Your GitHub repo looks like a good start, I can make comments there as
> things progress.
>
> Is it possible to build Tesseract without leptonica? I was under the
> impression that it was
> used for image i/o only, but I may be misinformed.
>
> 4:  The native platform library should be built as part of the Maven build
> for the Tesseract
> wrapper which can be a separate project. The output can be a jar file
> which contains the
> native binaries. It should be possible for the jar to contain prebuilt
> binaries for all platforms
> but this is something we can worry about later. Right now the goal should
> be to build a jar
> containing just the current platform's native binary and any Java wrapper
> code.
>
> -- John
>
> On 3 Mar 2014, at 16:41, DImuthu Upeksha <di...@gmail.com>
> wrote:
>
> > Hi John,
> >
> > I tried to reuse that android jni wrapper for tesseract. Here is my
> > observation
> >
> > 1. This wrapper heavily depends on android image libraries.
> > (android/bitmap.h). Most of the wrapper methods [1] use this library.
> >
> > 2. But I can understand underlying logic in each function. Basically what
> > it does is mapping between tesseract api functions [2] with java methods.
> > In between it does to some image <=> byte array like conversions by using
> > that bitmap libraries in Android
> >
> > 3. There are two ways. 1: We can port it's code to make compatible with
> our
> > environments(linux,windows and mac) which is really painful. Also it will
> > cause memory leaks. 2: We can use only it's function signatures and
> > implement using our codes
> >
> > I think 2nd solution is better because we need only few operations to be
> > done using tesseract library. I have created a github repo [3] for this.
> > It's still not finished. I need to add some make files and build files to
> > make it run properly. And also I need to implement those wrapper
> functions
> > [3]. This may take some time.
> >
> > 4. Because we are calling native libraries we need different builds of
> > tesseract and leptonica libraries for each platform (dll for windows, so
> > for linux, dylib for mac). So we may need to build those libraries at the
> > time we build pdfbox project. Or we can pre build those libraries and add
> > them to the project as .dll, .so or .dylib format. What is the preferred
> > way?
> >
> > [1]
> >
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
> > [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
> > [3] https://github.com/DImuthuUpe/Tesseract-API
> > [4]
> >
> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
> >
> > Thanks
> > Dimuthu
> >
> >
> > On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
> dimuthu.upeksha2@gmail.com
> >> wrote:
> >
> >> I updated necessary changes to the document [1]
> >>
> >> For last two days I had a deep look at this [2] jni wrapper for
> tessaract
> >> api.
> >> Unfortunately this has been designed for Android environment so I think
> we
> >> need to write our own make files to build this in to a dll(windows) or
> >> dylib(in mac). Currently it has Android.mk files [3]. I'm searching for
> a
> >> way to convert it to a make file that we can run on console. Please
> suggest
> >> if you have a better approach
> >>
> >> [1]
> >>
> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
> >> [2]
> >>
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
> >> [3]
> >>
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
> >>
> >>
> >> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com> wrote:
> >>
> >>> This is a good start. However, there is no need for the Adder
> component,
> >>> "Extracted Text (OCR) can just feed back into the PDFBox "Text
> Extractor".
> >>>
> >>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it clear
> >>> where the process starts.
> >>>
> >>> -- John
> >>>
> >>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <di...@gmail.com>
> >>> wrote:
> >>>
> >>>> Sorry for the mistake. I added it to my Dropbox [1].
> >>>>
> >>>> [1]
> >>>>
> >>>
> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
> >>>>
> >>>> Thanks
> >>>> Dimuthu
> >>>>
> >>>>
> >>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com>
> wrote:
> >>>>
> >>>>> I should add that the OCR engine should be pluggable so PDFToText
> might
> >>>>> use an interface, e.g. OCREngine and there will be a
> TesseractOCREngine
> >>>>> class somewhere which provides the required functionality and lives
> in
> >>> a
> >>>>> separate jar file.
> >>>>>
> >>>>> -- John
> >>>>>
> >>>>>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> So do you need to embed those new functionalities into existing
> >>>>> PDFtoText algorithms or package them as a new sub system(something
> >>> like an
> >>>>> API)?
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: "John Hewson" <jo...@jahewson.com>
> >>>>>> Sent: 26/02/2014 07:38
> >>>>>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
> >>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
> >>>>> Introduction
> >>>>>>
> >>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and
> page
> >>>>> rotation.
> >>>>>>
> >>>>>> There is another use case for OCR: some fonts embedded in PDFs have
> >>>>> corrupt encodings, which means the ACSII codes map to the wrong
> >>> glyphs. We
> >>>>> could OCR the glyphs to repair the encoding.
> >>>>>>
> >>>>>> -- John
> >>>>>>
> >>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
> >>> dimuthu.upeksha2@gmail.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> Hi John,
> >>>>>>> Thanks for the explanation.
> >>>>>>> Let's say there is a pdf with both text in extractable format and
> >>> some
> >>>>>>> images with text(Scanned images). In that case first we extract
> those
> >>>>>>> extractable content using PDFBox algorithms and rest is extracted
> >>> using
> >>>>>>> OCR. Finally we pack both results together and give output as
> >>>>> PDFToText. Am
> >>>>>>> I correct? What do you mean by "location data"?
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>> Dimuthu
> >>>>>>>
> >>>>>>>
> >>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <jo...@jahewson.com>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>> 1. What is called "glyphs" ?
> >>>>>>>>
> >>>>>>>> http://en.wikipedia.org/wiki/Glyph
> >>>>>>>>
> >>>>>>>>> 2. What is the main requirement of this project?
> >>>>>>>>> As far as I understood, first we need to generate an image of
> >>>>>>>>> malformed pdfs from
> >>>>>>>>> PDFBox and then we need to do processing using OCR for further
> >>>>> accurate
> >>>>>>>>> results.  But the problem is, why shouldn't we directly do OCR on
> >>>>> those
> >>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong.
> >>>>>>>>
> >>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
> >>>>> (PDFToText).
> >>>>>>>> The goal of
> >>>>>>>> this project is to enhance PDFToText so that it can use OCR to
> >>> extract
> >>>>>>>> text from areas of the
> >>>>>>>> document where the text is embedded as an image. Such PDF files
> are
> >>>>>>>> typically generated by
> >>>>>>>> scanners or fax machines. There is also another case where OCR is
> >>>>> useful:
> >>>>>>>> some fonts embedded
> >>>>>>>> in PDF files contain the wrong encoding, so when text is extracted
> >>> with
> >>>>>>>> PDFToText the result is
> >>>>>>>> nonsense but when drawn with PDFToImage we see the correct
> letters.
> >>>>>>>>
> >>>>>>>> Instead of:
> >>>>>>>> PDF => Image => OCR => Text
> >>>>>>>>
> >>>>>>>> We want to do:
> >>>>>>>> PDF => (Many images for words + location data => OCR) => Text
> >>>>>>>>
> >>>>>>>> -- John
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
> >>>>>>>> dimuthu.upeksha2@gmail.com
> >>>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Ok fixed. This is what I did
> >>>>>>>>>> Right click on the new project ->Debug As-> Debug Configurations
> >>>>>>>> ->Source
> >>>>>>>>>> ->Add -> Project
> >>>>>>>>>> Then I selected PDFBox project.
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>> Dimuthu
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
> >>>>>>>>>> dimuthu.upeksha2@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
> >>>>>>>> application
> >>>>>>>>>>> project (say TestPDFBox) with a main class with following code.
> >>>>>>>>>>>
> >>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
> >>>>>>>> PDPage();document.addPage( blankPage
> >>>>>>>> );document.save("BlankPage.pdf");document.close();
> >>>>>>>>>>>
> >>>>>>>>>>> Then I need to add those jar files generated in target folder
> of
> >>>>> PDFBox
> >>>>>>>>>>> to build path of my new project (I did build the PDFBox project
> >>> from
> >>>>>>>>>>> source). That is what I did. But let's say I need to check  the
> >>>>>>>>>>> functionality of document.save("") method. But I don't have a
> >>>>>>>> reference to
> >>>>>>>>>>> it's sources because I directly used generated jars. As Tilman
> >>> said
> >>>>> I
> >>>>>>>> built
> >>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it
> other
> >>>>>>>> projects
> >>>>>>>>>>> other than adding those jar files to build path.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <
> john@jahewson.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Which IDE are you using? You should be able to run the
> PDFToText
> >>>>> class
> >>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as
> the
> >>>>>>>> command
> >>>>>>>>>>>> line argument.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -- John
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
> >>>>>>>> dimuthu.upeksha2@gmail.com>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi John,
> >>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
> >>> managed to
> >>>>>>>>>>>> build
> >>>>>>>>>>>>> code successfully. I looked at the classes you mentioned and
> I
> >>>>> got a
> >>>>>>>>>>>> rough
> >>>>>>>>>>>>> idea about how they are working. To check them I used the
> jars
> >>> in
> >>>>>>>>>>>> target
> >>>>>>>>>>>>> folder to my separate java project. I tried samples in
> >>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look
> >>> into
> >>>>> code
> >>>>>>>>>>>>> specially how those processXXX() methods work in
> >>> PDFTextStripper
> >>>>>>>> class.
> >>>>>>>>>>>>> What I usually do is adding some berakpoints and checking
> them
> >>> in
> >>>>>>>> debug
> >>>>>>>>>>>>> windows. But using jars it's not possible. What is the way
> you
> >>>>> follow
> >>>>>>>>>>>> in
> >>>>>>>>>>>>> order to do such task?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> As well I installed tesseract in to my machine and managed to
> >>> do
> >>>>> some
> >>>>>>>>>>>> OCR
> >>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
> >>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop
> you a
> >>>>> mail.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>> Dimuthu
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
> >>> john@jahewson.com
> >>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Dimuthu
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The PDFBox website can be found at
> http://pdfbox.apache.org/it
> >>>>>>>>>>>> contains
> >>>>>>>>>>>>>> a basic overview of the project
> >>>>>>>>>>>>>> and details on how to obtain the source code and build
> PDFBox
> >>> for
> >>>>>>>>>>>> yourself.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details
> >>> the
> >>>>> only
> >>>>>>>>>>>>>> thoughts so far regarding it.
> >>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are
> >>> all
> >>>>>>>> under
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>> Apache license, which is a
> >>>>>>>>>>>>>> requirement.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Once you have the source code, take a look at the PageDrawer
> >>>>> class
> >>>>>>>> to
> >>>>>>>>>>>> see
> >>>>>>>>>>>>>> how text and images are
> >>>>>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g.
> >>> one
> >>>>>>>> glyph,
> >>>>>>>>>>>>>> word, or sentence at a time) with
> >>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
> text
> >>> is
> >>>>>>>>>>>> currently
> >>>>>>>>>>>>>> extracted, take a look at how
> >>>>>>>>>>>>>> we have to go to great length to sort text back into reading
> >>>>> order
> >>>>>>>> and
> >>>>>>>>>>>>>> infer the placement of diacritics - PDF
> >>>>>>>>>>>>>> is fundamentally a visual format, not a structured format
> like
> >>>>> HTML
> >>>>>>>> -
> >>>>>>>>>>>>>> which is why extracting text can be so
> >>>>>>>>>>>>>> difficult sometimes.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The full PDF Reference document can be found at:
> >>>>>>>>
> >>>>>
> >>>
> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
> >>>>>>>> questions.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -- John
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
> >>>>>>>> dimuthu.upeksha2@gmail.com
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate
> at
> >>>>>>>>>>>> University
> >>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013
> >>> with
> >>>>>>>>>>>> Apache
> >>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image
> >>>>>>>> processing
> >>>>>>>>>>>>>> stuff. So I would like to select this project idea as my
> GSoC
> >>>>> 2014
> >>>>>>>>>>>> project
> >>>>>>>>>>>>>> because I feel like it is the best suited project for me. In
> >>>>>>>>>>>> university
> >>>>>>>>>>>>>> also we have done some research in OCR area and our group
> >>> wrote a
> >>>>>>>>>>>>>> literature review about increasing efficiency of OCR
> >>>>>>>>>>>> systems(attached). Can
> >>>>>>>>>>>>>> you please suggest me where to start learning about PDFBox?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> [1]
> >>>>>>>>
> >>>>>
> >>>
> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thank you
> >>>>>>>>>>>>>>> Dimuthu
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> --
> >>>>>>>>>>>>>>> Regards
> >>>>>>>>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>>>>>>>> Undergraduate
> >>>>>>>>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Regards
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>>>>>> Undergraduate
> >>>>>>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Regards
> >>>>>>>>>>>
> >>>>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>>>> Undergraduate
> >>>>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>>>
> >>>>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Regards
> >>>>>>>>>>
> >>>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>>> Undergraduate
> >>>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>>
> >>>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Regards
> >>>>>>>>>
> >>>>>>>>> W.Dimuthu Upeksha
> >>>>>>>>> Undergraduate
> >>>>>>>>> Department of Computer Science And Engineering
> >>>>>>>>>
> >>>>>>>>> University of Moratuwa, Sri Lanka
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Regards
> >>>>>>>
> >>>>>>> W.Dimuthu Upeksha
> >>>>>>> Undergraduate
> >>>>>>> Department of Computer Science And Engineering
> >>>>>>>
> >>>>>>> University of Moratuwa, Sri Lanka
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Regards
> >>>>
> >>>> W.Dimuthu Upeksha
> >>>> Undergraduate
> >>>> Department of Computer Science And Engineering
> >>>>
> >>>> University of Moratuwa, Sri Lanka
> >>>
> >>>
> >>
> >>
> >> --
> >> Regards
> >>
> >> W.Dimuthu Upeksha
> >> Undergraduate
> >> Department of Computer Science And Engineering
> >>
> >> University of Moratuwa, Sri Lanka
> >>
> >
> >
> >
> > --
> > Regards
> >
> > W.Dimuthu Upeksha
> > Undergraduate
> > Department of Computer Science And Engineering
> >
> > University of Moratuwa, Sri Lanka
>
>


-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by John Hewson <jo...@jahewson.com>.
Hi Dimuthu,

1,2,3:

Feel free to write your own Tesseract binding or port the existing code as you see fit.
The JNI binding should be minimal, only the methods you require need to be wrapped.
Also, don’t forget that some of the interop can be done in Java, for example if it is easier
to convert a BufferedImage to a byte array in Java then do it there and pass the result
to JNI rather than writing lots of JNI C++ to achieve the same result.

Your GitHub repo looks like a good start, I can make comments there as things progress.

Is it possible to build Tesseract without leptonica? I was under the impression that it was
used for image i/o only, but I may be misinformed.

4:  The native platform library should be built as part of the Maven build for the Tesseract
wrapper which can be a separate project. The output can be a jar file which contains the
native binaries. It should be possible for the jar to contain prebuilt binaries for all platforms
but this is something we can worry about later. Right now the goal should be to build a jar
containing just the current platform’s native binary and any Java wrapper code.

-- John

On 3 Mar 2014, at 16:41, DImuthu Upeksha <di...@gmail.com> wrote:

> Hi John,
> 
> I tried to reuse that android jni wrapper for tesseract. Here is my
> observation
> 
> 1. This wrapper heavily depends on android image libraries.
> (android/bitmap.h). Most of the wrapper methods [1] use this library.
> 
> 2. But I can understand underlying logic in each function. Basically what
> it does is mapping between tesseract api functions [2] with java methods.
> In between it does to some image <=> byte array like conversions by using
> that bitmap libraries in Android
> 
> 3. There are two ways. 1: We can port it's code to make compatible with our
> environments(linux,windows and mac) which is really painful. Also it will
> cause memory leaks. 2: We can use only it's function signatures and
> implement using our codes
> 
> I think 2nd solution is better because we need only few operations to be
> done using tesseract library. I have created a github repo [3] for this.
> It's still not finished. I need to add some make files and build files to
> make it run properly. And also I need to implement those wrapper functions
> [3]. This may take some time.
> 
> 4. Because we are calling native libraries we need different builds of
> tesseract and leptonica libraries for each platform (dll for windows, so
> for linux, dylib for mac). So we may need to build those libraries at the
> time we build pdfbox project. Or we can pre build those libraries and add
> them to the project as .dll, .so or .dylib format. What is the preferred
> way?
> 
> [1]
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
> [3] https://github.com/DImuthuUpe/Tesseract-API
> [4]
> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
> 
> Thanks
> Dimuthu
> 
> 
> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <dimuthu.upeksha2@gmail.com
>> wrote:
> 
>> I updated necessary changes to the document [1]
>> 
>> For last two days I had a deep look at this [2] jni wrapper for tessaract
>> api.
>> Unfortunately this has been designed for Android environment so I think we
>> need to write our own make files to build this in to a dll(windows) or
>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching for a
>> way to convert it to a make file that we can run on console. Please suggest
>> if you have a better approach
>> 
>> [1]
>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>> [2]
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>> [3]
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>> 
>> 
>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com> wrote:
>> 
>>> This is a good start. However, there is no need for the Adder component,
>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text Extractor".
>>> 
>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it clear
>>> where the process starts.
>>> 
>>> -- John
>>> 
>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <di...@gmail.com>
>>> wrote:
>>> 
>>>> Sorry for the mistake. I added it to my Dropbox [1].
>>>> 
>>>> [1]
>>>> 
>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>>> 
>>>> Thanks
>>>> Dimuthu
>>>> 
>>>> 
>>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com> wrote:
>>>> 
>>>>> I should add that the OCR engine should be pluggable so PDFToText might
>>>>> use an interface, e.g. OCREngine and there will be a TesseractOCREngine
>>>>> class somewhere which provides the required functionality and lives in
>>> a
>>>>> separate jar file.
>>>>> 
>>>>> -- John
>>>>> 
>>>>>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com> wrote:
>>>>>> 
>>>>>> So do you need to embed those new functionalities into existing
>>>>> PDFtoText algorithms or package them as a new sub system(something
>>> like an
>>>>> API)?
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: "John Hewson" <jo...@jahewson.com>
>>>>>> Sent: 26/02/2014 07:38
>>>>>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
>>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>>> Introduction
>>>>>> 
>>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and page
>>>>> rotation.
>>>>>> 
>>>>>> There is another use case for OCR: some fonts embedded in PDFs have
>>>>> corrupt encodings, which means the ACSII codes map to the wrong
>>> glyphs. We
>>>>> could OCR the glyphs to repair the encoding.
>>>>>> 
>>>>>> -- John
>>>>>> 
>>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>> dimuthu.upeksha2@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>> Hi John,
>>>>>>> Thanks for the explanation.
>>>>>>> Let's say there is a pdf with both text in extractable format and
>>> some
>>>>>>> images with text(Scanned images). In that case first we extract those
>>>>>>> extractable content using PDFBox algorithms and rest is extracted
>>> using
>>>>>>> OCR. Finally we pack both results together and give output as
>>>>> PDFToText. Am
>>>>>>> I correct? What do you mean by "location data"?
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Dimuthu
>>>>>>> 
>>>>>>> 
>>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <jo...@jahewson.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>> 1. What is called "glyphs" ?
>>>>>>>> 
>>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>>> 
>>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>>> malformed pdfs from
>>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>>> accurate
>>>>>>>>> results.  But the problem is, why shouldn't we directly do OCR on
>>>>> those
>>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong.
>>>>>>>> 
>>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>>> (PDFToText).
>>>>>>>> The goal of
>>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>>> extract
>>>>>>>> text from areas of the
>>>>>>>> document where the text is embedded as an image. Such PDF files are
>>>>>>>> typically generated by
>>>>>>>> scanners or fax machines. There is also another case where OCR is
>>>>> useful:
>>>>>>>> some fonts embedded
>>>>>>>> in PDF files contain the wrong encoding, so when text is extracted
>>> with
>>>>>>>> PDFToText the result is
>>>>>>>> nonsense but when drawn with PDFToImage we see the correct letters.
>>>>>>>> 
>>>>>>>> Instead of:
>>>>>>>> PDF => Image => OCR => Text
>>>>>>>> 
>>>>>>>> We want to do:
>>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>>> 
>>>>>>>> -- John
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>>> Right click on the new project ->Debug As-> Debug Configurations
>>>>>>>> ->Source
>>>>>>>>>> ->Add -> Project
>>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> Dimuthu
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>>> dimuthu.upeksha2@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>>> application
>>>>>>>>>>> project (say TestPDFBox) with a main class with following code.
>>>>>>>>>>> 
>>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
>>>>>>>> PDPage();document.addPage( blankPage
>>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>>> 
>>>>>>>>>>> Then I need to add those jar files generated in target folder of
>>>>> PDFBox
>>>>>>>>>>> to build path of my new project (I did build the PDFBox project
>>> from
>>>>>>>>>>> source). That is what I did. But let's say I need to check  the
>>>>>>>>>>> functionality of document.save("") method. But I don't have a
>>>>>>>> reference to
>>>>>>>>>>> it's sources because I directly used generated jars. As Tilman
>>> said
>>>>> I
>>>>>>>> built
>>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it other
>>>>>>>> projects
>>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <jo...@jahewson.com>
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Which IDE are you using? You should be able to run the PDFToText
>>>>> class
>>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the
>>>>>>>> command
>>>>>>>>>>>> line argument.
>>>>>>>>>>>> 
>>>>>>>>>>>> -- John
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi John,
>>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>> managed to
>>>>>>>>>>>> build
>>>>>>>>>>>>> code successfully. I looked at the classes you mentioned and I
>>>>> got a
>>>>>>>>>>>> rough
>>>>>>>>>>>>> idea about how they are working. To check them I used the jars
>>> in
>>>>>>>>>>>> target
>>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look
>>> into
>>>>> code
>>>>>>>>>>>>> specially how those processXXX() methods work in
>>> PDFTextStripper
>>>>>>>> class.
>>>>>>>>>>>>> What I usually do is adding some berakpoints and checking them
>>> in
>>>>>>>> debug
>>>>>>>>>>>>> windows. But using jars it's not possible. What is the way you
>>>>> follow
>>>>>>>>>>>> in
>>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> As well I installed tesseract in to my machine and managed to
>>> do
>>>>> some
>>>>>>>>>>>> OCR
>>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop you a
>>>>> mail.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>> john@jahewson.com
>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/it
>>>>>>>>>>>> contains
>>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>>> and details on how to obtain the source code and build PDFBox
>>> for
>>>>>>>>>>>> yourself.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details
>>> the
>>>>> only
>>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are
>>> all
>>>>>>>> under
>>>>>>>>>>>> the
>>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Once you have the source code, take a look at the PageDrawer
>>>>> class
>>>>>>>> to
>>>>>>>>>>>> see
>>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g.
>>> one
>>>>>>>> glyph,
>>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text
>>> is
>>>>>>>>>>>> currently
>>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>>> we have to go to great length to sort text back into reading
>>>>> order
>>>>>>>> and
>>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>>> is fundamentally a visual format, not a structured format like
>>>>> HTML
>>>>>>>> -
>>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>>> 
>>>>> 
>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
>>>>>>>> questions.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -- John
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at
>>>>>>>>>>>> University
>>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013
>>> with
>>>>>>>>>>>> Apache
>>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image
>>>>>>>> processing
>>>>>>>>>>>>>> stuff. So I would like to select this project idea as my GSoC
>>>>> 2014
>>>>>>>>>>>> project
>>>>>>>>>>>>>> because I feel like it is the best suited project for me. In
>>>>>>>>>>>> university
>>>>>>>>>>>>>> also we have done some research in OCR area and our group
>>> wrote a
>>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>>> you please suggest me where to start learning about PDFBox?
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> [1]
>>>>>>>> 
>>>>> 
>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> 
>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>> 
>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Regards
>>>>>>>>>>> 
>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>> Undergraduate
>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>> 
>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Regards
>>>>>>>>>> 
>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>> Undergraduate
>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>> 
>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Regards
>>>>>>>>> 
>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>> Undergraduate
>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>> 
>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Regards
>>>>>>> 
>>>>>>> W.Dimuthu Upeksha
>>>>>>> Undergraduate
>>>>>>> Department of Computer Science And Engineering
>>>>>>> 
>>>>>>> University of Moratuwa, Sri Lanka
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Regards
>>>> 
>>>> W.Dimuthu Upeksha
>>>> Undergraduate
>>>> Department of Computer Science And Engineering
>>>> 
>>>> University of Moratuwa, Sri Lanka
>>> 
>>> 
>> 
>> 
>> --
>> Regards
>> 
>> W.Dimuthu Upeksha
>> Undergraduate
>> Department of Computer Science And Engineering
>> 
>> University of Moratuwa, Sri Lanka
>> 
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka


Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by DImuthu Upeksha <di...@gmail.com>.
Hi John,
I just noticed your last reply just after sending my previous mail. Sorry
about that. I'm using Mac also and I'm also using VMs to test other
platforms. I have done a lot of stuff using maven. I'll go through the
plugin and try to apply it to that github project.

Thanks
Dimuthu


On Tue, Mar 4, 2014 at 6:11 AM, DImuthu Upeksha
<di...@gmail.com>wrote:

> Hi John,
>
> I tried to reuse that android jni wrapper for tesseract. Here is my
> observation
>
> 1. This wrapper heavily depends on android image libraries.
> (android/bitmap.h). Most of the wrapper methods [1] use this library.
>
> 2. But I can understand underlying logic in each function. Basically what
> it does is mapping between tesseract api functions [2] with java methods.
> In between it does to some image <=> byte array like conversions by using
> that bitmap libraries in Android
>
> 3. There are two ways. 1: We can port it's code to make compatible with
> our environments(linux,windows and mac) which is really painful. Also it
> will cause memory leaks. 2: We can use only it's function signatures and
> implement using our codes
>
> I think 2nd solution is better because we need only few operations to be
> done using tesseract library. I have created a github repo [3] for this.
> It's still not finished. I need to add some make files and build files to
> make it run properly. And also I need to implement those wrapper functions
> [3]. This may take some time.
>
> 4. Because we are calling native libraries we need different builds of
> tesseract and leptonica libraries for each platform (dll for windows, so
> for linux, dylib for mac). So we may need to build those libraries at the
> time we build pdfbox project. Or we can pre build those libraries and add
> them to the project as .dll, .so or .dylib format. What is the preferred
> way?
>
> [1]
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
> [2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
> [3] https://github.com/DImuthuUpe/Tesseract-API
> [4]
> https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp
>
> Thanks
> Dimuthu
>
>
> On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <
> dimuthu.upeksha2@gmail.com> wrote:
>
>> I updated necessary changes to the document [1]
>>
>> For last two days I had a deep look at this [2] jni wrapper for tessaract
>> api.
>> Unfortunately this has been designed for Android environment so I think
>> we need to write our own make files to build this in to a dll(windows) or
>> dylib(in mac). Currently it has Android.mk files [3]. I'm searching for a
>> way to convert it to a make file that we can run on console. Please suggest
>> if you have a better approach
>>
>> [1]
>> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
>> [2]
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
>> [3]
>> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>>
>>
>> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com> wrote:
>>
>>> This is a good start. However, there is no need for the Adder component,
>>> "Extracted Text (OCR) can just feed back into the PDFBox "Text Extractor".
>>>
>>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it clear
>>> where the process starts.
>>>
>>> -- John
>>>
>>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <di...@gmail.com>
>>> wrote:
>>>
>>> > Sorry for the mistake. I added it to my Dropbox [1].
>>> >
>>> > [1]
>>> >
>>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>> >
>>> > Thanks
>>> > Dimuthu
>>> >
>>> >
>>> > On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com>
>>> wrote:
>>> >
>>> >> I should add that the OCR engine should be pluggable so PDFToText
>>> might
>>> >> use an interface, e.g. OCREngine and there will be a
>>> TesseractOCREngine
>>> >> class somewhere which provides the required functionality and lives
>>> in a
>>> >> separate jar file.
>>> >>
>>> >> -- John
>>> >>
>>> >>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> So do you need to embed those new functionalities into existing
>>> >> PDFtoText algorithms or package them as a new sub system(something
>>> like an
>>> >> API)?
>>> >>>
>>> >>> -----Original Message-----
>>> >>> From: "John Hewson" <jo...@jahewson.com>
>>> >>> Sent: 26/02/2014 07:38
>>> >>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
>>> >>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>> >> Introduction
>>> >>>
>>> >>> Yes, exactly. By location data I just mean (x,y) coordinates and page
>>> >> rotation.
>>> >>>
>>> >>> There is another use case for OCR: some fonts embedded in PDFs have
>>> >> corrupt encodings, which means the ACSII codes map to the wrong
>>> glyphs. We
>>> >> could OCR the glyphs to repair the encoding.
>>> >>>
>>> >>> -- John
>>> >>>
>>> >>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>>> dimuthu.upeksha2@gmail.com>
>>> >> wrote:
>>> >>>>
>>> >>>> Hi John,
>>> >>>> Thanks for the explanation.
>>> >>>> Let's say there is a pdf with both text in extractable format and
>>> some
>>> >>>> images with text(Scanned images). In that case first we extract
>>> those
>>> >>>> extractable content using PDFBox algorithms and rest is extracted
>>> using
>>> >>>> OCR. Finally we pack both results together and give output as
>>> >> PDFToText. Am
>>> >>>> I correct? What do you mean by "location data"?
>>> >>>>
>>> >>>> Thanks
>>> >>>> Dimuthu
>>> >>>>
>>> >>>>
>>> >>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <jo...@jahewson.com>
>>> >> wrote:
>>> >>>>>
>>> >>>>> 1. What is called "glyphs" ?
>>> >>>>>
>>> >>>>> http://en.wikipedia.org/wiki/Glyph
>>> >>>>>
>>> >>>>>> 2. What is the main requirement of this project?
>>> >>>>>> As far as I understood, first we need to generate an image of
>>> >>>>>> malformed pdfs from
>>> >>>>>> PDFBox and then we need to do processing using OCR for further
>>> >> accurate
>>> >>>>>> results.  But the problem is, why shouldn't we directly do OCR on
>>> >> those
>>> >>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong.
>>> >>>>>
>>> >>>>> PDFBox can generate images (PDFToImage) and can extract text
>>> >> (PDFToText).
>>> >>>>> The goal of
>>> >>>>> this project is to enhance PDFToText so that it can use OCR to
>>> extract
>>> >>>>> text from areas of the
>>> >>>>> document where the text is embedded as an image. Such PDF files are
>>> >>>>> typically generated by
>>> >>>>> scanners or fax machines. There is also another case where OCR is
>>> >> useful:
>>> >>>>> some fonts embedded
>>> >>>>> in PDF files contain the wrong encoding, so when text is extracted
>>> with
>>> >>>>> PDFToText the result is
>>> >>>>> nonsense but when drawn with PDFToImage we see the correct letters.
>>> >>>>>
>>> >>>>> Instead of:
>>> >>>>> PDF => Image => OCR => Text
>>> >>>>>
>>> >>>>> We want to do:
>>> >>>>> PDF => (Many images for words + location data => OCR) => Text
>>> >>>>>
>>> >>>>> -- John
>>> >>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>> >>>>> dimuthu.upeksha2@gmail.com
>>> >>>>>>> wrote:
>>> >>>>>>
>>> >>>>>>> Ok fixed. This is what I did
>>> >>>>>>> Right click on the new project ->Debug As-> Debug Configurations
>>> >>>>> ->Source
>>> >>>>>>> ->Add -> Project
>>> >>>>>>> Then I selected PDFBox project.
>>> >>>>>>>
>>> >>>>>>> Thanks
>>> >>>>>>> Dimuthu
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>> >>>>>>> dimuthu.upeksha2@gmail.com> wrote:
>>> >>>>>>>
>>> >>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>> >>>>> application
>>> >>>>>>>> project (say TestPDFBox) with a main class with following code.
>>> >>>>>>>>
>>> >>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
>>> >>>>> PDPage();document.addPage( blankPage
>>> >>>>> );document.save("BlankPage.pdf");document.close();
>>> >>>>>>>>
>>> >>>>>>>> Then I need to add those jar files generated in target folder of
>>> >> PDFBox
>>> >>>>>>>> to build path of my new project (I did build the PDFBox project
>>> from
>>> >>>>>>>> source). That is what I did. But let's say I need to check  the
>>> >>>>>>>> functionality of document.save("") method. But I don't have a
>>> >>>>> reference to
>>> >>>>>>>> it's sources because I directly used generated jars. As Tilman
>>> said
>>> >> I
>>> >>>>> built
>>> >>>>>>>> PDFBox from sources but I don't know a proper way to use it
>>> other
>>> >>>>> projects
>>> >>>>>>>> other than adding those jar files to build path.
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <john@jahewson.com
>>> >
>>> >>>>> wrote:
>>> >>>>>>>>
>>> >>>>>>>>> Which IDE are you using? You should be able to run the
>>> PDFToText
>>> >> class
>>> >>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as
>>> the
>>> >>>>> command
>>> >>>>>>>>> line argument.
>>> >>>>>>>>>
>>> >>>>>>>>> -- John
>>> >>>>>>>>>
>>> >>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>> >>>>> dimuthu.upeksha2@gmail.com>
>>> >>>>>>>>> wrote:
>>> >>>>>>>>>>
>>> >>>>>>>>>> Hi John,
>>> >>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>>> managed to
>>> >>>>>>>>> build
>>> >>>>>>>>>> code successfully. I looked at the classes you mentioned and I
>>> >> got a
>>> >>>>>>>>> rough
>>> >>>>>>>>>> idea about how they are working. To check them I used the
>>> jars in
>>> >>>>>>>>> target
>>> >>>>>>>>>> folder to my separate java project. I tried samples in
>>> >>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look
>>> into
>>> >> code
>>> >>>>>>>>>> specially how those processXXX() methods work in
>>> PDFTextStripper
>>> >>>>> class.
>>> >>>>>>>>>> What I usually do is adding some berakpoints and checking
>>> them in
>>> >>>>> debug
>>> >>>>>>>>>> windows. But using jars it's not possible. What is the way you
>>> >> follow
>>> >>>>>>>>> in
>>> >>>>>>>>>> order to do such task?
>>> >>>>>>>>>>
>>> >>>>>>>>>> As well I installed tesseract in to my machine and managed to
>>> do
>>> >> some
>>> >>>>>>>>> OCR
>>> >>>>>>>>>> stuff also. That's a cool tool which works fine.
>>> >>>>>>>>>> I'm still learning the code. If I get any issue I'll drop you
>>> a
>>> >> mail.
>>> >>>>>>>>>>
>>> >>>>>>>>>> Thanks
>>> >>>>>>>>>> Dimuthu
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>>> john@jahewson.com
>>> >>>
>>> >>>>>>>>> wrote:
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Hi Dimuthu
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/it
>>> >>>>>>>>> contains
>>> >>>>>>>>>>> a basic overview of the project
>>> >>>>>>>>>>> and details on how to obtain the source code and build
>>> PDFBox for
>>> >>>>>>>>> yourself.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details
>>> the
>>> >> only
>>> >>>>>>>>>>> thoughts so far regarding it.
>>> >>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are
>>> all
>>> >>>>> under
>>> >>>>>>>>> the
>>> >>>>>>>>>>> Apache license, which is a
>>> >>>>>>>>>>> requirement.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Once you have the source code, take a look at the PageDrawer
>>> >> class
>>> >>>>> to
>>> >>>>>>>>> see
>>> >>>>>>>>>>> how text and images are
>>> >>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g.
>>> one
>>> >>>>> glyph,
>>> >>>>>>>>>>> word, or sentence at a time) with
>>> >>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how
>>> text is
>>> >>>>>>>>> currently
>>> >>>>>>>>>>> extracted, take a look at how
>>> >>>>>>>>>>> we have to go to great length to sort text back into reading
>>> >> order
>>> >>>>> and
>>> >>>>>>>>>>> infer the placement of diacritics - PDF
>>> >>>>>>>>>>> is fundamentally a visual format, not a structured format
>>> like
>>> >> HTML
>>> >>>>> -
>>> >>>>>>>>>>> which is why extracting text can be so
>>> >>>>>>>>>>> difficult sometimes.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> The full PDF Reference document can be found at:
>>> >>>>>
>>> >>
>>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
>>> >>>>> questions.
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Thanks,
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> -- John
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>> >>>>> dimuthu.upeksha2@gmail.com
>>> >>>>>>>>>>
>>> >>>>>>>>>>> wrote:
>>> >>>>>>>>>>>
>>> >>>>>>>>>>>> Hi,
>>> >>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate
>>> at
>>> >>>>>>>>> University
>>> >>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013
>>> with
>>> >>>>>>>>> Apache
>>> >>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image
>>> >>>>> processing
>>> >>>>>>>>>>> stuff. So I would like to select this project idea as my GSoC
>>> >> 2014
>>> >>>>>>>>> project
>>> >>>>>>>>>>> because I feel like it is the best suited project for me. In
>>> >>>>>>>>> university
>>> >>>>>>>>>>> also we have done some research in OCR area and our group
>>> wrote a
>>> >>>>>>>>>>> literature review about increasing efficiency of OCR
>>> >>>>>>>>> systems(attached). Can
>>> >>>>>>>>>>> you please suggest me where to start learning about PDFBox?
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> [1]
>>> >>>>>
>>> >>
>>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> Thank you
>>> >>>>>>>>>>>> Dimuthu
>>> >>>>>>>>>>>>
>>> >>>>>>>>>>>> --
>>> >>>>>>>>>>>> Regards
>>> >>>>>>>>>>>> W.Dimuthu Upeksha
>>> >>>>>>>>>>>> Undergraduate
>>> >>>>>>>>>>>> Department of Computer Science And Engineering
>>> >>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>> >>>>>>>>>>
>>> >>>>>>>>>>
>>> >>>>>>>>>> --
>>> >>>>>>>>>> Regards
>>> >>>>>>>>>>
>>> >>>>>>>>>> W.Dimuthu Upeksha
>>> >>>>>>>>>> Undergraduate
>>> >>>>>>>>>> Department of Computer Science And Engineering
>>> >>>>>>>>>>
>>> >>>>>>>>>> University of Moratuwa, Sri Lanka
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> --
>>> >>>>>>>> Regards
>>> >>>>>>>>
>>> >>>>>>>> W.Dimuthu Upeksha
>>> >>>>>>>> Undergraduate
>>> >>>>>>>> Department of Computer Science And Engineering
>>> >>>>>>>>
>>> >>>>>>>> University of Moratuwa, Sri Lanka
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> --
>>> >>>>>>> Regards
>>> >>>>>>>
>>> >>>>>>> W.Dimuthu Upeksha
>>> >>>>>>> Undergraduate
>>> >>>>>>> Department of Computer Science And Engineering
>>> >>>>>>>
>>> >>>>>>> University of Moratuwa, Sri Lanka
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> --
>>> >>>>>> Regards
>>> >>>>>>
>>> >>>>>> W.Dimuthu Upeksha
>>> >>>>>> Undergraduate
>>> >>>>>> Department of Computer Science And Engineering
>>> >>>>>>
>>> >>>>>> University of Moratuwa, Sri Lanka
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Regards
>>> >>>>
>>> >>>> W.Dimuthu Upeksha
>>> >>>> Undergraduate
>>> >>>> Department of Computer Science And Engineering
>>> >>>>
>>> >>>> University of Moratuwa, Sri Lanka
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Regards
>>> >
>>> > W.Dimuthu Upeksha
>>> > Undergraduate
>>> > Department of Computer Science And Engineering
>>> >
>>> > University of Moratuwa, Sri Lanka
>>>
>>>
>>
>>
>> --
>> Regards
>>
>> W.Dimuthu Upeksha
>> Undergraduate
>> Department of Computer Science And Engineering
>>
>> University of Moratuwa, Sri Lanka
>>
>
>
>
> --
> Regards
>
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
>
> University of Moratuwa, Sri Lanka
>



-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by DImuthu Upeksha <di...@gmail.com>.
Hi John,

I tried to reuse that android jni wrapper for tesseract. Here is my
observation

1. This wrapper heavily depends on android image libraries.
(android/bitmap.h). Most of the wrapper methods [1] use this library.

2. But I can understand underlying logic in each function. Basically what
it does is mapping between tesseract api functions [2] with java methods.
In between it does to some image <=> byte array like conversions by using
that bitmap libraries in Android

3. There are two ways. 1: We can port it's code to make compatible with our
environments(linux,windows and mac) which is really painful. Also it will
cause memory leaks. 2: We can use only it's function signatures and
implement using our codes

I think 2nd solution is better because we need only few operations to be
done using tesseract library. I have created a github repo [3] for this.
It's still not finished. I need to add some make files and build files to
make it run properly. And also I need to implement those wrapper functions
[3]. This may take some time.

4. Because we are calling native libraries we need different builds of
tesseract and leptonica libraries for each platform (dll for windows, so
for linux, dylib for mac). So we may need to build those libraries at the
time we build pdfbox project. Or we can pre build those libraries and add
them to the project as .dll, .so or .dylib format. What is the preferred
way?

[1]
https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/tessbaseapi.cpp
[2] https://code.google.com/p/tesseract-ocr/wiki/APIExample
[3] https://github.com/DImuthuUpe/Tesseract-API
[4]
https://github.com/DImuthuUpe/Tesseract-API/blob/master/jni/tesseract/tessbaseapi.cpp

Thanks
Dimuthu


On Sat, Mar 1, 2014 at 11:39 PM, DImuthu Upeksha <dimuthu.upeksha2@gmail.com
> wrote:

> I updated necessary changes to the document [1]
>
> For last two days I had a deep look at this [2] jni wrapper for tessaract
> api.
> Unfortunately this has been designed for Android environment so I think we
> need to write our own make files to build this in to a dll(windows) or
> dylib(in mac). Currently it has Android.mk files [3]. I'm searching for a
> way to convert it to a make file that we can run on console. Please suggest
> if you have a better approach
>
> [1]
> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
> [2]
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
> [3]
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
>
>
> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com> wrote:
>
>> This is a good start. However, there is no need for the Adder component,
>> "Extracted Text (OCR) can just feed back into the PDFBox "Text Extractor".
>>
>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it clear
>> where the process starts.
>>
>> -- John
>>
>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <di...@gmail.com>
>> wrote:
>>
>> > Sorry for the mistake. I added it to my Dropbox [1].
>> >
>> > [1]
>> >
>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>> >
>> > Thanks
>> > Dimuthu
>> >
>> >
>> > On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com> wrote:
>> >
>> >> I should add that the OCR engine should be pluggable so PDFToText might
>> >> use an interface, e.g. OCREngine and there will be a TesseractOCREngine
>> >> class somewhere which provides the required functionality and lives in
>> a
>> >> separate jar file.
>> >>
>> >> -- John
>> >>
>> >>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com> wrote:
>> >>>
>> >>> So do you need to embed those new functionalities into existing
>> >> PDFtoText algorithms or package them as a new sub system(something
>> like an
>> >> API)?
>> >>>
>> >>> -----Original Message-----
>> >>> From: "John Hewson" <jo...@jahewson.com>
>> >>> Sent: 26/02/2014 07:38
>> >>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
>> >>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>> >> Introduction
>> >>>
>> >>> Yes, exactly. By location data I just mean (x,y) coordinates and page
>> >> rotation.
>> >>>
>> >>> There is another use case for OCR: some fonts embedded in PDFs have
>> >> corrupt encodings, which means the ACSII codes map to the wrong
>> glyphs. We
>> >> could OCR the glyphs to repair the encoding.
>> >>>
>> >>> -- John
>> >>>
>> >>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <
>> dimuthu.upeksha2@gmail.com>
>> >> wrote:
>> >>>>
>> >>>> Hi John,
>> >>>> Thanks for the explanation.
>> >>>> Let's say there is a pdf with both text in extractable format and
>> some
>> >>>> images with text(Scanned images). In that case first we extract those
>> >>>> extractable content using PDFBox algorithms and rest is extracted
>> using
>> >>>> OCR. Finally we pack both results together and give output as
>> >> PDFToText. Am
>> >>>> I correct? What do you mean by "location data"?
>> >>>>
>> >>>> Thanks
>> >>>> Dimuthu
>> >>>>
>> >>>>
>> >>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <jo...@jahewson.com>
>> >> wrote:
>> >>>>>
>> >>>>> 1. What is called "glyphs" ?
>> >>>>>
>> >>>>> http://en.wikipedia.org/wiki/Glyph
>> >>>>>
>> >>>>>> 2. What is the main requirement of this project?
>> >>>>>> As far as I understood, first we need to generate an image of
>> >>>>>> malformed pdfs from
>> >>>>>> PDFBox and then we need to do processing using OCR for further
>> >> accurate
>> >>>>>> results.  But the problem is, why shouldn't we directly do OCR on
>> >> those
>> >>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong.
>> >>>>>
>> >>>>> PDFBox can generate images (PDFToImage) and can extract text
>> >> (PDFToText).
>> >>>>> The goal of
>> >>>>> this project is to enhance PDFToText so that it can use OCR to
>> extract
>> >>>>> text from areas of the
>> >>>>> document where the text is embedded as an image. Such PDF files are
>> >>>>> typically generated by
>> >>>>> scanners or fax machines. There is also another case where OCR is
>> >> useful:
>> >>>>> some fonts embedded
>> >>>>> in PDF files contain the wrong encoding, so when text is extracted
>> with
>> >>>>> PDFToText the result is
>> >>>>> nonsense but when drawn with PDFToImage we see the correct letters.
>> >>>>>
>> >>>>> Instead of:
>> >>>>> PDF => Image => OCR => Text
>> >>>>>
>> >>>>> We want to do:
>> >>>>> PDF => (Many images for words + location data => OCR) => Text
>> >>>>>
>> >>>>> -- John
>> >>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>> >>>>> dimuthu.upeksha2@gmail.com
>> >>>>>>> wrote:
>> >>>>>>
>> >>>>>>> Ok fixed. This is what I did
>> >>>>>>> Right click on the new project ->Debug As-> Debug Configurations
>> >>>>> ->Source
>> >>>>>>> ->Add -> Project
>> >>>>>>> Then I selected PDFBox project.
>> >>>>>>>
>> >>>>>>> Thanks
>> >>>>>>> Dimuthu
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>> >>>>>>> dimuthu.upeksha2@gmail.com> wrote:
>> >>>>>>>
>> >>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>> >>>>> application
>> >>>>>>>> project (say TestPDFBox) with a main class with following code.
>> >>>>>>>>
>> >>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
>> >>>>> PDPage();document.addPage( blankPage
>> >>>>> );document.save("BlankPage.pdf");document.close();
>> >>>>>>>>
>> >>>>>>>> Then I need to add those jar files generated in target folder of
>> >> PDFBox
>> >>>>>>>> to build path of my new project (I did build the PDFBox project
>> from
>> >>>>>>>> source). That is what I did. But let's say I need to check  the
>> >>>>>>>> functionality of document.save("") method. But I don't have a
>> >>>>> reference to
>> >>>>>>>> it's sources because I directly used generated jars. As Tilman
>> said
>> >> I
>> >>>>> built
>> >>>>>>>> PDFBox from sources but I don't know a proper way to use it other
>> >>>>> projects
>> >>>>>>>> other than adding those jar files to build path.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <jo...@jahewson.com>
>> >>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> Which IDE are you using? You should be able to run the PDFToText
>> >> class
>> >>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the
>> >>>>> command
>> >>>>>>>>> line argument.
>> >>>>>>>>>
>> >>>>>>>>> -- John
>> >>>>>>>>>
>> >>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>> >>>>> dimuthu.upeksha2@gmail.com>
>> >>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> Hi John,
>> >>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and
>> managed to
>> >>>>>>>>> build
>> >>>>>>>>>> code successfully. I looked at the classes you mentioned and I
>> >> got a
>> >>>>>>>>> rough
>> >>>>>>>>>> idea about how they are working. To check them I used the jars
>> in
>> >>>>>>>>> target
>> >>>>>>>>>> folder to my separate java project. I tried samples in
>> >>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look
>> into
>> >> code
>> >>>>>>>>>> specially how those processXXX() methods work in
>> PDFTextStripper
>> >>>>> class.
>> >>>>>>>>>> What I usually do is adding some berakpoints and checking them
>> in
>> >>>>> debug
>> >>>>>>>>>> windows. But using jars it's not possible. What is the way you
>> >> follow
>> >>>>>>>>> in
>> >>>>>>>>>> order to do such task?
>> >>>>>>>>>>
>> >>>>>>>>>> As well I installed tesseract in to my machine and managed to
>> do
>> >> some
>> >>>>>>>>> OCR
>> >>>>>>>>>> stuff also. That's a cool tool which works fine.
>> >>>>>>>>>> I'm still learning the code. If I get any issue I'll drop you a
>> >> mail.
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks
>> >>>>>>>>>> Dimuthu
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>> john@jahewson.com
>> >>>
>> >>>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> Hi Dimuthu
>> >>>>>>>>>>>
>> >>>>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/it
>> >>>>>>>>> contains
>> >>>>>>>>>>> a basic overview of the project
>> >>>>>>>>>>> and details on how to obtain the source code and build PDFBox
>> for
>> >>>>>>>>> yourself.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details
>> the
>> >> only
>> >>>>>>>>>>> thoughts so far regarding it.
>> >>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are
>> all
>> >>>>> under
>> >>>>>>>>> the
>> >>>>>>>>>>> Apache license, which is a
>> >>>>>>>>>>> requirement.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Once you have the source code, take a look at the PageDrawer
>> >> class
>> >>>>> to
>> >>>>>>>>> see
>> >>>>>>>>>>> how text and images are
>> >>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g.
>> one
>> >>>>> glyph,
>> >>>>>>>>>>> word, or sentence at a time) with
>> >>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text
>> is
>> >>>>>>>>> currently
>> >>>>>>>>>>> extracted, take a look at how
>> >>>>>>>>>>> we have to go to great length to sort text back into reading
>> >> order
>> >>>>> and
>> >>>>>>>>>>> infer the placement of diacritics - PDF
>> >>>>>>>>>>> is fundamentally a visual format, not a structured format like
>> >> HTML
>> >>>>> -
>> >>>>>>>>>>> which is why extracting text can be so
>> >>>>>>>>>>> difficult sometimes.
>> >>>>>>>>>>>
>> >>>>>>>>>>> The full PDF Reference document can be found at:
>> >>>>>
>> >>
>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>> >>>>>>>>>>>
>> >>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
>> >>>>> questions.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thanks,
>> >>>>>>>>>>>
>> >>>>>>>>>>> -- John
>> >>>>>>>>>>>
>> >>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>> >>>>> dimuthu.upeksha2@gmail.com
>> >>>>>>>>>>
>> >>>>>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Hi,
>> >>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at
>> >>>>>>>>> University
>> >>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013
>> with
>> >>>>>>>>> Apache
>> >>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image
>> >>>>> processing
>> >>>>>>>>>>> stuff. So I would like to select this project idea as my GSoC
>> >> 2014
>> >>>>>>>>> project
>> >>>>>>>>>>> because I feel like it is the best suited project for me. In
>> >>>>>>>>> university
>> >>>>>>>>>>> also we have done some research in OCR area and our group
>> wrote a
>> >>>>>>>>>>> literature review about increasing efficiency of OCR
>> >>>>>>>>> systems(attached). Can
>> >>>>>>>>>>> you please suggest me where to start learning about PDFBox?
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> [1]
>> >>>>>
>> >>
>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Thank you
>> >>>>>>>>>>>> Dimuthu
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> --
>> >>>>>>>>>>>> Regards
>> >>>>>>>>>>>> W.Dimuthu Upeksha
>> >>>>>>>>>>>> Undergraduate
>> >>>>>>>>>>>> Department of Computer Science And Engineering
>> >>>>>>>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> --
>> >>>>>>>>>> Regards
>> >>>>>>>>>>
>> >>>>>>>>>> W.Dimuthu Upeksha
>> >>>>>>>>>> Undergraduate
>> >>>>>>>>>> Department of Computer Science And Engineering
>> >>>>>>>>>>
>> >>>>>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> --
>> >>>>>>>> Regards
>> >>>>>>>>
>> >>>>>>>> W.Dimuthu Upeksha
>> >>>>>>>> Undergraduate
>> >>>>>>>> Department of Computer Science And Engineering
>> >>>>>>>>
>> >>>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Regards
>> >>>>>>>
>> >>>>>>> W.Dimuthu Upeksha
>> >>>>>>> Undergraduate
>> >>>>>>> Department of Computer Science And Engineering
>> >>>>>>>
>> >>>>>>> University of Moratuwa, Sri Lanka
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Regards
>> >>>>>>
>> >>>>>> W.Dimuthu Upeksha
>> >>>>>> Undergraduate
>> >>>>>> Department of Computer Science And Engineering
>> >>>>>>
>> >>>>>> University of Moratuwa, Sri Lanka
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Regards
>> >>>>
>> >>>> W.Dimuthu Upeksha
>> >>>> Undergraduate
>> >>>> Department of Computer Science And Engineering
>> >>>>
>> >>>> University of Moratuwa, Sri Lanka
>> >>
>> >
>> >
>> >
>> > --
>> > Regards
>> >
>> > W.Dimuthu Upeksha
>> > Undergraduate
>> > Department of Computer Science And Engineering
>> >
>> > University of Moratuwa, Sri Lanka
>>
>>
>
>
> --
> Regards
>
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
>
> University of Moratuwa, Sri Lanka
>



-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: [GSoC 2014]Optical Character Recognition project - Introduction

Posted by John Hewson <jo...@jahewson.com>.
Dimuthu

Your new diagram looks good. The JNI wrapper for Tesseract is indeed for Android, so it will need porting
to a standard desktop C++ environment. We use Maven to build PDFBox and there is a native-maven plugin
which can build JNI projects, see http://docs.codehaus.org/display/MAVENUSER/Projects+With+JNI the plugin
itself is here http://mojo.codehaus.org/maven-native/native-maven-plugin/.

If you’ve not used Maven before, it’s a Java build system with its own package repository (like rubygems or npm)
so you just write an XML file and it downloads the appropriate plugins at build-time as they are required.

What operating system do you develop on? I’m on OS X, but I have VMs for most platforms.

Thanks

-- John

On 1 Mar 2014, at 10:09, DImuthu Upeksha <di...@gmail.com> wrote:

> I updated necessary changes to the document [1]
> 
> For last two days I had a deep look at this [2] jni wrapper for tessaract
> api.
> Unfortunately this has been designed for Android environment so I think we
> need to write our own make files to build this in to a dll(windows) or
> dylib(in mac). Currently it has Android.mk files [3]. I'm searching for a
> way to convert it to a make file that we can run on console. Please suggest
> if you have a better approach
> 
> [1]
> https://www.dropbox.com/s/9qclvq26divwr2q/Optical%20Character%20Recognition%20for%20PDFBox%20-%20updated.pdf
> [2]
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/
> [3]
> https://code.google.com/p/tesseract-android-tools/source/browse/tesseract-android-tools/jni/com_googlecode_tesseract_android/Android.mk
> 
> 
> On Sat, Mar 1, 2014 at 12:27 AM, John Hewson <jo...@jahewson.com> wrote:
> 
>> This is a good start. However, there is no need for the Adder component,
>> "Extracted Text (OCR) can just feed back into the PDFBox "Text Extractor".
>> 
>> Maybe show a "PDF" file feeding in to "Text Extractor, to make it clear
>> where the process starts.
>> 
>> -- John
>> 
>> On 26 Feb 2014, at 16:53, DImuthu Upeksha <di...@gmail.com>
>> wrote:
>> 
>>> Sorry for the mistake. I added it to my Dropbox [1].
>>> 
>>> [1]
>>> 
>> https://www.dropbox.com/s/y3m15rfjmw4eqij/Optical%20Character%20Recognition%20for%20PDFBox.pdf
>>> 
>>> Thanks
>>> Dimuthu
>>> 
>>> 
>>> On Thu, Feb 27, 2014 at 4:44 AM, John Hewson <jo...@jahewson.com> wrote:
>>> 
>>>> I should add that the OCR engine should be pluggable so PDFToText might
>>>> use an interface, e.g. OCREngine and there will be a TesseractOCREngine
>>>> class somewhere which provides the required functionality and lives in a
>>>> separate jar file.
>>>> 
>>>> -- John
>>>> 
>>>>> On 25 Feb 2014, at 20:18, Dimuthu <di...@gmail.com> wrote:
>>>>> 
>>>>> So do you need to embed those new functionalities into existing
>>>> PDFtoText algorithms or package them as a new sub system(something like
>> an
>>>> API)?
>>>>> 
>>>>> -----Original Message-----
>>>>> From: "John Hewson" <jo...@jahewson.com>
>>>>> Sent: 26/02/2014 07:38
>>>>> To: "dev@pdfbox.apache.org" <de...@pdfbox.apache.org>
>>>>> Subject: Re: [GSoC 2014]Optical Character Recognition project -
>>>> Introduction
>>>>> 
>>>>> Yes, exactly. By location data I just mean (x,y) coordinates and page
>>>> rotation.
>>>>> 
>>>>> There is another use case for OCR: some fonts embedded in PDFs have
>>>> corrupt encodings, which means the ACSII codes map to the wrong glyphs.
>> We
>>>> could OCR the glyphs to repair the encoding.
>>>>> 
>>>>> -- John
>>>>> 
>>>>>> On 25 Feb 2014, at 17:13, DImuthu Upeksha <dimuthu.upeksha2@gmail.com
>>> 
>>>> wrote:
>>>>>> 
>>>>>> Hi John,
>>>>>> Thanks for the explanation.
>>>>>> Let's say there is a pdf with both text in extractable format and some
>>>>>> images with text(Scanned images). In that case first we extract those
>>>>>> extractable content using PDFBox algorithms and rest is extracted
>> using
>>>>>> OCR. Finally we pack both results together and give output as
>>>> PDFToText. Am
>>>>>> I correct? What do you mean by "location data"?
>>>>>> 
>>>>>> Thanks
>>>>>> Dimuthu
>>>>>> 
>>>>>> 
>>>>>>> On Tue, Feb 25, 2014 at 11:22 PM, John Hewson <jo...@jahewson.com>
>>>> wrote:
>>>>>>> 
>>>>>>> 1. What is called "glyphs" ?
>>>>>>> 
>>>>>>> http://en.wikipedia.org/wiki/Glyph
>>>>>>> 
>>>>>>>> 2. What is the main requirement of this project?
>>>>>>>> As far as I understood, first we need to generate an image of
>>>>>>>> malformed pdfs from
>>>>>>>> PDFBox and then we need to do processing using OCR for further
>>>> accurate
>>>>>>>> results.  But the problem is, why shouldn't we directly do OCR on
>>>> those
>>>>>>>> PDFs without getting output from PDFBox? Correct me if I'm wrong.
>>>>>>> 
>>>>>>> PDFBox can generate images (PDFToImage) and can extract text
>>>> (PDFToText).
>>>>>>> The goal of
>>>>>>> this project is to enhance PDFToText so that it can use OCR to
>> extract
>>>>>>> text from areas of the
>>>>>>> document where the text is embedded as an image. Such PDF files are
>>>>>>> typically generated by
>>>>>>> scanners or fax machines. There is also another case where OCR is
>>>> useful:
>>>>>>> some fonts embedded
>>>>>>> in PDF files contain the wrong encoding, so when text is extracted
>> with
>>>>>>> PDFToText the result is
>>>>>>> nonsense but when drawn with PDFToImage we see the correct letters.
>>>>>>> 
>>>>>>> Instead of:
>>>>>>> PDF => Image => OCR => Text
>>>>>>> 
>>>>>>> We want to do:
>>>>>>> PDF => (Many images for words + location data => OCR) => Text
>>>>>>> 
>>>>>>> -- John
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Feb 25, 2014 at 1:35 PM, DImuthu Upeksha <
>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Ok fixed. This is what I did
>>>>>>>>> Right click on the new project ->Debug As-> Debug Configurations
>>>>>>> ->Source
>>>>>>>>> ->Add -> Project
>>>>>>>>> Then I selected PDFBox project.
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Dimuthu
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Tue, Feb 25, 2014 at 1:17 PM, DImuthu Upeksha <
>>>>>>>>> dimuthu.upeksha2@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>>> I'm using eclipse. This is what I want. I created a new Java
>>>>>>> application
>>>>>>>>>> project (say TestPDFBox) with a main class with following code.
>>>>>>>>>> 
>>>>>>>>>> PDDocument document = new PDDocument();PDPage blankPage = new
>>>>>>> PDPage();document.addPage( blankPage
>>>>>>> );document.save("BlankPage.pdf");document.close();
>>>>>>>>>> 
>>>>>>>>>> Then I need to add those jar files generated in target folder of
>>>> PDFBox
>>>>>>>>>> to build path of my new project (I did build the PDFBox project
>> from
>>>>>>>>>> source). That is what I did. But let's say I need to check  the
>>>>>>>>>> functionality of document.save("") method. But I don't have a
>>>>>>> reference to
>>>>>>>>>> it's sources because I directly used generated jars. As Tilman
>> said
>>>> I
>>>>>>> built
>>>>>>>>>> PDFBox from sources but I don't know a proper way to use it other
>>>>>>> projects
>>>>>>>>>> other than adding those jar files to build path.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Tue, Feb 25, 2014 at 1:03 PM, John Hewson <jo...@jahewson.com>
>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Which IDE are you using? You should be able to run the PDFToText
>>>> class
>>>>>>>>>>> (in pdfbox-tools) using your IDE and pass a PDF file path as the
>>>>>>> command
>>>>>>>>>>> line argument.
>>>>>>>>>>> 
>>>>>>>>>>> -- John
>>>>>>>>>>> 
>>>>>>>>>>>> On 24 Feb 2014, at 22:38, DImuthu Upeksha <
>>>>>>> dimuthu.upeksha2@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi John,
>>>>>>>>>>>> Thanks for the reply. Yes I checked out PDFBox code and managed
>> to
>>>>>>>>>>> build
>>>>>>>>>>>> code successfully. I looked at the classes you mentioned and I
>>>> got a
>>>>>>>>>>> rough
>>>>>>>>>>>> idea about how they are working. To check them I used the jars
>> in
>>>>>>>>>>> target
>>>>>>>>>>>> folder to my separate java project. I tried samples in
>>>>>>>>>>>> http://pdfbox.apache.org/cookbook/. I need to further look into
>>>> code
>>>>>>>>>>>> specially how those processXXX() methods work in PDFTextStripper
>>>>>>> class.
>>>>>>>>>>>> What I usually do is adding some berakpoints and checking them
>> in
>>>>>>> debug
>>>>>>>>>>>> windows. But using jars it's not possible. What is the way you
>>>> follow
>>>>>>>>>>> in
>>>>>>>>>>>> order to do such task?
>>>>>>>>>>>> 
>>>>>>>>>>>> As well I installed tesseract in to my machine and managed to do
>>>> some
>>>>>>>>>>> OCR
>>>>>>>>>>>> stuff also. That's a cool tool which works fine.
>>>>>>>>>>>> I'm still learning the code. If I get any issue I'll drop you a
>>>> mail.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 12:33 AM, John Hewson <
>> john@jahewson.com
>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Dimuthu
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The PDFBox website can be found at http://pdfbox.apache.org/it
>>>>>>>>>>> contains
>>>>>>>>>>>>> a basic overview of the project
>>>>>>>>>>>>> and details on how to obtain the source code and build PDFBox
>> for
>>>>>>>>>>> yourself.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Currently we do not perform any OCR and PDFBOX-1912 details the
>>>> only
>>>>>>>>>>>>> thoughts so far regarding it.
>>>>>>>>>>>>> Note that the OCR libraries mentioned in the JIRA issue are all
>>>>>>> under
>>>>>>>>>>> the
>>>>>>>>>>>>> Apache license, which is a
>>>>>>>>>>>>> requirement.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Once you have the source code, take a look at the PageDrawer
>>>> class
>>>>>>> to
>>>>>>>>>>> see
>>>>>>>>>>>>> how text and images are
>>>>>>>>>>>>> rendered. We want someone to interface at a low-level (e.g. one
>>>>>>> glyph,
>>>>>>>>>>>>> word, or sentence at a time) with
>>>>>>>>>>>>> an OCR engine. Also look at PDFTextStripper which is how text
>> is
>>>>>>>>>>> currently
>>>>>>>>>>>>> extracted, take a look at how
>>>>>>>>>>>>> we have to go to great length to sort text back into reading
>>>> order
>>>>>>> and
>>>>>>>>>>>>> infer the placement of diacritics - PDF
>>>>>>>>>>>>> is fundamentally a visual format, not a structured format like
>>>> HTML
>>>>>>> -
>>>>>>>>>>>>> which is why extracting text can be so
>>>>>>>>>>>>> difficult sometimes.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The full PDF Reference document can be found at:
>>>>>>> 
>>>> 
>> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/PDF32000_2008.pdf
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Feel free to discuss specifics of your proposal or ask any
>>>>>>> questions.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- John
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On 23 Feb 2014, at 21:13, DImuthu Upeksha <
>>>>>>> dimuthu.upeksha2@gmail.com
>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> I am Dimuthu Upeksha, a Computer Engineering Undergraduate at
>>>>>>>>>>> University
>>>>>>>>>>>>> of Moratuwa Sri Lanka. I successfully completed my GSoC 2013
>> with
>>>>>>>>>>> Apache
>>>>>>>>>>>>> ISIS [1] project. I'm very much interested in OCR and image
>>>>>>> processing
>>>>>>>>>>>>> stuff. So I would like to select this project idea as my GSoC
>>>> 2014
>>>>>>>>>>> project
>>>>>>>>>>>>> because I feel like it is the best suited project for me. In
>>>>>>>>>>> university
>>>>>>>>>>>>> also we have done some research in OCR area and our group
>> wrote a
>>>>>>>>>>>>> literature review about increasing efficiency of OCR
>>>>>>>>>>> systems(attached). Can
>>>>>>>>>>>>> you please suggest me where to start learning about PDFBox?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [1]
>>>>>>> 
>>>> 
>> http://google-opensource.blogspot.com/2013/10/google-summer-of-code-veteran-orgs.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+GoogleOpenSourceBlog+%28Google+Open+Source+Blog%29
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thank you
>>>>>>>>>>>>>> Dimuthu
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> Regards
>>>>>>>>>>>> 
>>>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>>>> Undergraduate
>>>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>>>> 
>>>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Regards
>>>>>>>>>> 
>>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>>> Undergraduate
>>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>>> 
>>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Regards
>>>>>>>>> 
>>>>>>>>> W.Dimuthu Upeksha
>>>>>>>>> Undergraduate
>>>>>>>>> Department of Computer Science And Engineering
>>>>>>>>> 
>>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Regards
>>>>>>>> 
>>>>>>>> W.Dimuthu Upeksha
>>>>>>>> Undergraduate
>>>>>>>> Department of Computer Science And Engineering
>>>>>>>> 
>>>>>>>> University of Moratuwa, Sri Lanka
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Regards
>>>>>> 
>>>>>> W.Dimuthu Upeksha
>>>>>> Undergraduate
>>>>>> Department of Computer Science And Engineering
>>>>>> 
>>>>>> University of Moratuwa, Sri Lanka
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Regards
>>> 
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> Department of Computer Science And Engineering
>>> 
>>> University of Moratuwa, Sri Lanka
>> 
>> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka