You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Thejan Wijesinghe <th...@gmail.com> on 2017/03/04 17:04:57 UTC

Tess4j API for TIKA OCR parser

Hi Thamme,

Yes. I am using Ubuntu :) and I had ImageMagick and Tesseract both
installed in my system using apt-get. Since, I wasn't sure whether this is
a problem with the APT software packages, I built both ImageMagick and
Tesseract from sources.

I also double checked the availability of Tesseract and ImageMagick by
typing CLI commands that you suggested and the below commands as well,

convert test.jpg -resize 64x64 resized_test.jpg

tesseract test.jpg out

and they worked.

I can't find a exact reason why I am not getting metadata but when I used
the AutoDetectParser class instead of the TesseractOCRParser class, I can
extract both content and metadata.

p.s. I will put updating the wiki OCR page in my TODO list :)

Re: Tess4j API for TIKA OCR parser

Posted by Thejan Wijesinghe <th...@gmail.com>.

Hi everyone!

Luis, It is my pleasure to meet an original creator of a major component of
TIKA. I should say that it is very creative + reliable workaround. :) I
still have many unclear areas in TIKA parsers. Perhaps you can help me to
clarify some of them.

Even now, there must be some unreliability in Tess4j because I also got
some jvm crashing issues when trying to test it first but however I got
through them. I can't exactly say whether this is going to be a perfect
implementation without testing this properly. However, I'll try my best to
make this work. I have crated an jira issue for this [1]
<https://issues.apache.org/jira/browse/TIKA-2293>. I invite you all to help
me, make this a success.

[1] https://issues.apache.org/jira/browse/TIKA-2293

On Tue, Mar 7, 2017 at 9:42 PM, Thamme Gowda <th...@apache.org> wrote:

> yes, we can try tika-eval to see the difference. Perfect!
>
> Best,
> TG
>
> On Mar 7, 2017 7:44 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:
>
> Y and why not give the new tika-eval module a trial to evaluate the
> differences in output?  :)
>
> -----Original Message-----
> From: Thamme Gowda [mailto:thammegowda@apache.org]
> Sent: Tuesday, March 7, 2017 10:38 AM
> To: Thejan Wijesinghe <th...@gmail.com>
> Cc: dev@tika.apache.org
> Subject: Re: Tess4j API for TIKA OCR parser
>
> Thanks Nick for the reply.
>
> Thejan,
>
> I am glad to know your progress. Rewriting the TesseractOCRParser would be
> the ultimate goal if using Tess4j proves to be better than the way it is
> done currently.
>
> But, for now, please consider these:
> + Rename your class to *Tess4jOCRParser*. It is a new parser providing
> + the
> same functionality as *TesseractOCRParser*
> + Keep the *TesseractOCRParser* intact. You can use it as your reference
> + to
> understand features of OCR parser to support.
> + Benchmark *TesseractOCRParser* and *Tess4jOCRParser* with respect to
> performance and stability. You can take a set of 100 images and compare
> how much time each of them took. Please share those results here.
>
>
> Based on the benchmark, we can decide whether to replace old one with new
> one. Because TesseractOCRParser is used along with many other parsers like
> JPEG/PDF etc any improvements you make with Tess4jOCRParser will have a
> huge effect!
>
> P.S.
> + Please don't edit any test cases. You may add new ones, though!
> + Could you please create a Jira Issue to track this. Sorry, I must have
> said this early.
>
> Best,
> TG
>
>
> On Tue, Mar 7, 2017 at 4:58 AM, Thejan Wijesinghe <
> thejan.k.wijesinghe@gmail.com> wrote:
>
> > Hi Nick,
> >
> > I thought the same thing. I will try to keep the public method
> > signatures unchanged and will send updates on my progress.
> >
> > On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <ap...@gagravarr.org> wrote:
> >
> > > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> > >
> > >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> > class,
> > >> Although It successfully extracts content from most of the file
> > >> types,
> > it
> > >> fails some particular unit tests in the TesseractOCRParserTest
> > >> class. I can solve that. However, I want to know whether I can
> > >> rewrite the entire TesseractOCRParser class from the ground up, but
> > >> if I do that there will be many broken links in the internals of
> > >> TIKA because as I witnessed, most
> > of
> > >> the classes use TesseractOCRParser class indirectly.
> > >>
> > >
> > > If you can, try to keep the public methods unchanged. That way,
> > > other callers to the class will be unaffected by your re-write of
> > > the internal logic
> > >
> > > Nick
> > >
> >
>
>
>

RE: Tess4j API for TIKA OCR parser

Posted by Thamme Gowda <th...@apache.org>.

yes, we can try tika-eval to see the difference. Perfect!

Best,
TG

On Mar 7, 2017 7:44 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:

Y and why not give the new tika-eval module a trial to evaluate the
differences in output?  :)

-----Original Message-----
From: Thamme Gowda [mailto:thammegowda@apache.org]
Sent: Tuesday, March 7, 2017 10:38 AM
To: Thejan Wijesinghe <th...@gmail.com>
Cc: dev@tika.apache.org
Subject: Re: Tess4j API for TIKA OCR parser

Thanks Nick for the reply.

Thejan,

I am glad to know your progress. Rewriting the TesseractOCRParser would be
the ultimate goal if using Tess4j proves to be better than the way it is
done currently.

But, for now, please consider these:
+ Rename your class to *Tess4jOCRParser*. It is a new parser providing
+ the
same functionality as *TesseractOCRParser*
+ Keep the *TesseractOCRParser* intact. You can use it as your reference
+ to
understand features of OCR parser to support.
+ Benchmark *TesseractOCRParser* and *Tess4jOCRParser* with respect to
performance and stability. You can take a set of 100 images and compare how
much time each of them took. Please share those results here.


Based on the benchmark, we can decide whether to replace old one with new
one. Because TesseractOCRParser is used along with many other parsers like
JPEG/PDF etc any improvements you make with Tess4jOCRParser will have a
huge effect!

P.S.
+ Please don't edit any test cases. You may add new ones, though!
+ Could you please create a Jira Issue to track this. Sorry, I must have
said this early.

Best,
TG


On Tue, Mar 7, 2017 at 4:58 AM, Thejan Wijesinghe <
thejan.k.wijesinghe@gmail.com> wrote:

> Hi Nick,
>
> I thought the same thing. I will try to keep the public method
> signatures unchanged and will send updates on my progress.
>
> On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <ap...@gagravarr.org> wrote:
>
> > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> >
> >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> class,
> >> Although It successfully extracts content from most of the file
> >> types,
> it
> >> fails some particular unit tests in the TesseractOCRParserTest
> >> class. I can solve that. However, I want to know whether I can
> >> rewrite the entire TesseractOCRParser class from the ground up, but
> >> if I do that there will be many broken links in the internals of
> >> TIKA because as I witnessed, most
> of
> >> the classes use TesseractOCRParser class indirectly.
> >>
> >
> > If you can, try to keep the public methods unchanged. That way,
> > other callers to the class will be unaffected by your re-write of
> > the internal logic
> >
> > Nick
> >
>

RE: Tess4j API for TIKA OCR parser

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Y and why not give the new tika-eval module a trial to evaluate the differences in output?  :)

-----Original Message-----
From: Thamme Gowda [mailto:thammegowda@apache.org] 
Sent: Tuesday, March 7, 2017 10:38 AM
To: Thejan Wijesinghe <th...@gmail.com>
Cc: dev@tika.apache.org
Subject: Re: Tess4j API for TIKA OCR parser

Thanks Nick for the reply.

Thejan,

I am glad to know your progress. Rewriting the TesseractOCRParser would be the ultimate goal if using Tess4j proves to be better than the way it is done currently.

But, for now, please consider these:
+ Rename your class to *Tess4jOCRParser*. It is a new parser providing 
+ the
same functionality as *TesseractOCRParser*
+ Keep the *TesseractOCRParser* intact. You can use it as your reference 
+ to
understand features of OCR parser to support.
+ Benchmark *TesseractOCRParser* and *Tess4jOCRParser* with respect to
performance and stability. You can take a set of 100 images and compare how much time each of them took. Please share those results here.


Based on the benchmark, we can decide whether to replace old one with new one. Because TesseractOCRParser is used along with many other parsers like JPEG/PDF etc any improvements you make with Tess4jOCRParser will have a huge effect!

P.S.
+ Please don't edit any test cases. You may add new ones, though!
+ Could you please create a Jira Issue to track this. Sorry, I must have
said this early.

Best,
TG


On Tue, Mar 7, 2017 at 4:58 AM, Thejan Wijesinghe < thejan.k.wijesinghe@gmail.com> wrote:

> Hi Nick,
>
> I thought the same thing. I will try to keep the public method 
> signatures unchanged and will send updates on my progress.
>
> On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <ap...@gagravarr.org> wrote:
>
> > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> >
> >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> class,
> >> Although It successfully extracts content from most of the file 
> >> types,
> it
> >> fails some particular unit tests in the TesseractOCRParserTest 
> >> class. I can solve that. However, I want to know whether I can 
> >> rewrite the entire TesseractOCRParser class from the ground up, but 
> >> if I do that there will be many broken links in the internals of 
> >> TIKA because as I witnessed, most
> of
> >> the classes use TesseractOCRParser class indirectly.
> >>
> >
> > If you can, try to keep the public methods unchanged. That way, 
> > other callers to the class will be unaffected by your re-write of 
> > the internal logic
> >
> > Nick
> >
>

Re: Tess4j API for TIKA OCR parser

Posted by Thamme Gowda <th...@apache.org>.

Thanks Nick for the reply.

Thejan,

I am glad to know your progress. Rewriting the TesseractOCRParser would be
the ultimate goal if using Tess4j proves to be better than the way it is
done currently.

But, for now, please consider these:
+ Rename your class to *Tess4jOCRParser*. It is a new parser providing the
same functionality as *TesseractOCRParser*
+ Keep the *TesseractOCRParser* intact. You can use it as your reference to
understand features of OCR parser to support.
+ Benchmark *TesseractOCRParser* and *Tess4jOCRParser* with respect to
performance and stability. You can take a set of 100 images and compare how
much time each of them took. Please share those results here.


Based on the benchmark, we can decide whether to replace old one with new
one. Because TesseractOCRParser is used along with many other parsers like
JPEG/PDF etc any improvements you make with Tess4jOCRParser will have a
huge effect!

P.S.
+ Please don't edit any test cases. You may add new ones, though!
+ Could you please create a Jira Issue to track this. Sorry, I must have
said this early.

Best,
TG


On Tue, Mar 7, 2017 at 4:58 AM, Thejan Wijesinghe <
thejan.k.wijesinghe@gmail.com> wrote:

> Hi Nick,
>
> I thought the same thing. I will try to keep the public method signatures
> unchanged and will send updates on my progress.
>
> On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <ap...@gagravarr.org> wrote:
>
> > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> >
> >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> class,
> >> Although It successfully extracts content from most of the file types,
> it
> >> fails some particular unit tests in the TesseractOCRParserTest class. I
> >> can
> >> solve that. However, I want to know whether I can rewrite the entire
> >> TesseractOCRParser class from the ground up, but if I do that there will
> >> be
> >> many broken links in the internals of TIKA because as I witnessed, most
> of
> >> the classes use TesseractOCRParser class indirectly.
> >>
> >
> > If you can, try to keep the public methods unchanged. That way, other
> > callers to the class will be unaffected by your re-write of the internal
> > logic
> >
> > Nick
> >
>

RE: Tess4j API for TIKA OCR parser

Posted by "Allison, Timothy B." <ta...@mitre.org>.

+1

Same experience, of same vintage. :)

-----Original Message-----
From: Luís Filipe Nassif [mailto:lfcnassif@gmail.com] 
Sent: Tuesday, March 7, 2017 10:34 AM
To: dev@tika.apache.org
Subject: Re: Tess4j API for TIKA OCR parser

Hi Thejan,

Before the first version of TesseractOcrParser was commited I tried to use Tess4j, that was 4 years ago. Unfortunatelly that time I run into some problems like permanent hangs with tesseract/Tess4j and, even worse, Jvm crashes because of bugs into native code (pointers to crazy adresses) when processing corrupted images. So I changed the strategy and take the Runtime.exec way to execute tesseract out of process to get rid of those Jvm crashes.

That was a long time ago, maybe those problems are gone away with current tesseract and Tess4j. But I recommend for now commiting your changes in a new parser instead of changing the default TesseractOcrParser, until the new code is tested against millions of images from the wild with tika-batch so it can be proved it is stable enough to be the default Ocr parser of Tika.

Best,
Luis

Em 7 de mar de 2017 9:58 AM, "Thejan Wijesinghe" < thejan.k.wijesinghe@gmail.com> escreveu:

> Hi Nick,
>
> I thought the same thing. I will try to keep the public method 
> signatures unchanged and will send updates on my progress.
>
> On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <ap...@gagravarr.org> wrote:
>
> > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> >
> >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> class,
> >> Although It successfully extracts content from most of the file 
> >> types,
> it
> >> fails some particular unit tests in the TesseractOCRParserTest 
> >> class. I can solve that. However, I want to know whether I can 
> >> rewrite the entire TesseractOCRParser class from the ground up, but 
> >> if I do that there will be many broken links in the internals of 
> >> TIKA because as I witnessed, most
> of
> >> the classes use TesseractOCRParser class indirectly.
> >>
> >
> > If you can, try to keep the public methods unchanged. That way, 
> > other callers to the class will be unaffected by your re-write of 
> > the internal logic
> >
> > Nick
> >
>

Re: Tess4j API for TIKA OCR parser

Posted by Luís Filipe Nassif <lf...@gmail.com>.

Hi Thejan,

Before the first version of TesseractOcrParser was commited I tried to use
Tess4j, that was 4 years ago. Unfortunatelly that time I run into some
problems like permanent hangs with tesseract/Tess4j and, even worse, Jvm
crashes because of bugs into native code (pointers to crazy adresses) when
processing corrupted images. So I changed the strategy and take the
Runtime.exec way to execute tesseract out of process to get rid of those
Jvm crashes.

That was a long time ago, maybe those problems are gone away with current
tesseract and Tess4j. But I recommend for now commiting your changes in a
new parser instead of changing the default TesseractOcrParser, until the
new code is tested against millions of images from the wild with tika-batch
so it can be proved it is stable enough to be the default Ocr parser of
Tika.

Best,
Luis

Em 7 de mar de 2017 9:58 AM, "Thejan Wijesinghe" <
thejan.k.wijesinghe@gmail.com> escreveu:

> Hi Nick,
>
> I thought the same thing. I will try to keep the public method signatures
> unchanged and will send updates on my progress.
>
> On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <ap...@gagravarr.org> wrote:
>
> > On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> >
> >> I have already use the Tess4j API to rewrite the TesseractOCRParser
> class,
> >> Although It successfully extracts content from most of the file types,
> it
> >> fails some particular unit tests in the TesseractOCRParserTest class. I
> >> can
> >> solve that. However, I want to know whether I can rewrite the entire
> >> TesseractOCRParser class from the ground up, but if I do that there will
> >> be
> >> many broken links in the internals of TIKA because as I witnessed, most
> of
> >> the classes use TesseractOCRParser class indirectly.
> >>
> >
> > If you can, try to keep the public methods unchanged. That way, other
> > callers to the class will be unaffected by your re-write of the internal
> > logic
> >
> > Nick
> >
>

Re: Tess4j API for TIKA OCR parser

Posted by Thejan Wijesinghe <th...@gmail.com>.

Hi Nick,

I thought the same thing. I will try to keep the public method signatures
unchanged and will send updates on my progress.

On Tue, Mar 7, 2017 at 5:48 PM, Nick Burch <ap...@gagravarr.org> wrote:

> On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
>
>> I have already use the Tess4j API to rewrite the TesseractOCRParser class,
>> Although It successfully extracts content from most of the file types, it
>> fails some particular unit tests in the TesseractOCRParserTest class. I
>> can
>> solve that. However, I want to know whether I can rewrite the entire
>> TesseractOCRParser class from the ground up, but if I do that there will
>> be
>> many broken links in the internals of TIKA because as I witnessed, most of
>> the classes use TesseractOCRParser class indirectly.
>>
>
> If you can, try to keep the public methods unchanged. That way, other
> callers to the class will be unaffected by your re-write of the internal
> logic
>
> Nick
>

Re: Tess4j API for TIKA OCR parser

Posted by Nick Burch <ap...@gagravarr.org>.

On Tue, 7 Mar 2017, Thejan Wijesinghe wrote:
> I have already use the Tess4j API to rewrite the TesseractOCRParser class,
> Although It successfully extracts content from most of the file types, it
> fails some particular unit tests in the TesseractOCRParserTest class. I can
> solve that. However, I want to know whether I can rewrite the entire
> TesseractOCRParser class from the ground up, but if I do that there will be
> many broken links in the internals of TIKA because as I witnessed, most of
> the classes use TesseractOCRParser class indirectly.

If you can, try to keep the public methods unchanged. That way, other 
callers to the class will be unaffected by your re-write of the internal 
logic

Nick

Re: Tess4j API for TIKA OCR parser

Posted by Thejan Wijesinghe <th...@gmail.com>.

Hi Thamme,

I did minimal changes to the TesseractOCRParser class. I basically changed
the doOCR() private method. But the existing unit tests get failed even
though the content and metadata get extracted. Could you provide me with
any guidance on resolving these errors by running the test cases. I also
added some dependencies to the pom.xml in parsers. please check the links
below.

changed pom.xml:
https://github.com/ThejanW/tika/blob/master/tika-parsers/pom.xml

changed TesseractOCRParser class:
https://github.com/ThejanW/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java

On Tue, Mar 7, 2017 at 1:17 PM, Thejan Wijesinghe <
thejan.k.wijesinghe@gmail.com> wrote:

> Thamme,
> I have already use the Tess4j API to rewrite the TesseractOCRParser class,
> Although It successfully extracts content from most of the file types, it
> fails some particular unit tests in the TesseractOCRParserTest class. I can
> solve that. However, I want to know whether I can rewrite the entire
> TesseractOCRParser class from the ground up, but if I do that there will be
> many broken links in the internals of TIKA because as I witnessed, most of
> the classes use TesseractOCRParser class indirectly.
>
> On Mon, Mar 6, 2017 at 12:37 AM, Thamme Gowda <th...@apache.org>
> wrote:
>
>> Thejan,
>>
>> Welcome to the world of mysteries. I am unable to explain why you are
>> facing it since I am unable to reproduce it.
>>
>> Try out few other images, may be the image you have chosen is corrupt and
>> maybe there is an exception thrown and silently swallowed in code.
>>
>> I suggest you do this:
>>    Please use an IDE like IntelliJ/Eclipse and use a debugger to
>> understand
>> the call stack inside TesseractOCRParser. It is indeed a nice way to get
>> to
>> the internals of Tika :-)
>>
>>
>> Best,
>> TG
>>
>>
>> *--*
>> *Thamme Gowda*
>> TG | @thammegowda <https://twitter.com/thammegowda>
>> ~Sent via somebody's Webmail server!
>>
>> On Sat, Mar 4, 2017 at 9:04 AM, Thejan Wijesinghe <
>> thejan.k.wijesinghe@gmail.com> wrote:
>>
>> >
>> > Hi Thamme,
>> >
>> > Yes. I am using Ubuntu :) and I had ImageMagick and Tesseract both
>> > installed in my system using apt-get. Since, I wasn't sure whether this
>> is
>> > a problem with the APT software packages, I built both ImageMagick and
>> > Tesseract from sources.
>> >
>> > I also double checked the availability of Tesseract and ImageMagick by
>> > typing CLI commands that you suggested and the below commands as well,
>> >
>> > convert test.jpg -resize 64x64 resized_test.jpg
>> >
>> > tesseract test.jpg out
>> >
>> > and they worked.
>> >
>> > I can't find a exact reason why I am not getting metadata but when I
>> used
>> > the AutoDetectParser class instead of the TesseractOCRParser class, I
>> can
>> > extract both content and metadata.
>> >
>> > p.s. I will put updating the wiki OCR page in my TODO list :)
>> >
>>
>
>
>
> --
>
> [image: cutmypic.png]
>
> Thejan Wijesinghe
>
> Department of Computer Science and Engineering
>
> University of Moratuwa
>
> [image: phone-16.png]
>
> +94778097907
>
> [image: link.png] <http://www.your-website.com/> [image: linkedin.png]
> <https://www.linkedin.com/in/thejanw/> [image: github_alt.png]
> <https://github.com/ThejanW> [image: facebook.png]
> <https://www.facebook.com/ThejanW> [image: twitter.png]
> <https://twitter.com/Thejan_W> [image: google_plus.png]
> <https://plus.google.com/u/2/116268117882077683208> [image:
> skype_online_social_media-20.png] [image: mail-32.png]
> <th...@gmail.com>
>

Re: Tess4j API for TIKA OCR parser

Posted by Thejan Wijesinghe <th...@gmail.com>.

Thamme,
I have already use the Tess4j API to rewrite the TesseractOCRParser class,
Although It successfully extracts content from most of the file types, it
fails some particular unit tests in the TesseractOCRParserTest class. I can
solve that. However, I want to know whether I can rewrite the entire
TesseractOCRParser class from the ground up, but if I do that there will be
many broken links in the internals of TIKA because as I witnessed, most of
the classes use TesseractOCRParser class indirectly.

On Mon, Mar 6, 2017 at 12:37 AM, Thamme Gowda <th...@apache.org>
wrote:

> Thejan,
>
> Welcome to the world of mysteries. I am unable to explain why you are
> facing it since I am unable to reproduce it.
>
> Try out few other images, may be the image you have chosen is corrupt and
> maybe there is an exception thrown and silently swallowed in code.
>
> I suggest you do this:
>    Please use an IDE like IntelliJ/Eclipse and use a debugger to understand
> the call stack inside TesseractOCRParser. It is indeed a nice way to get to
> the internals of Tika :-)
>
>
> Best,
> TG
>
>
> *--*
> *Thamme Gowda*
> TG | @thammegowda <https://twitter.com/thammegowda>
> ~Sent via somebody's Webmail server!
>
> On Sat, Mar 4, 2017 at 9:04 AM, Thejan Wijesinghe <
> thejan.k.wijesinghe@gmail.com> wrote:
>
> >
> > Hi Thamme,
> >
> > Yes. I am using Ubuntu :) and I had ImageMagick and Tesseract both
> > installed in my system using apt-get. Since, I wasn't sure whether this
> is
> > a problem with the APT software packages, I built both ImageMagick and
> > Tesseract from sources.
> >
> > I also double checked the availability of Tesseract and ImageMagick by
> > typing CLI commands that you suggested and the below commands as well,
> >
> > convert test.jpg -resize 64x64 resized_test.jpg
> >
> > tesseract test.jpg out
> >
> > and they worked.
> >
> > I can't find a exact reason why I am not getting metadata but when I used
> > the AutoDetectParser class instead of the TesseractOCRParser class, I can
> > extract both content and metadata.
> >
> > p.s. I will put updating the wiki OCR page in my TODO list :)
> >
>



-- 

[image: cutmypic.png]

Thejan Wijesinghe

Department of Computer Science and Engineering

University of Moratuwa

[image: phone-16.png]

+94778097907

[image: link.png] <http://www.your-website.com/> [image: linkedin.png]
<https://www.linkedin.com/in/thejanw/> [image: github_alt.png]
<https://github.com/ThejanW> [image: facebook.png]
<https://www.facebook.com/ThejanW> [image: twitter.png]
<https://twitter.com/Thejan_W> [image: google_plus.png]
<https://plus.google.com/u/2/116268117882077683208> [image:
skype_online_social_media-20.png] [image: mail-32.png]
<th...@gmail.com>

Re: Tess4j API for TIKA OCR parser

Posted by Thamme Gowda <th...@apache.org>.

Thejan,

Welcome to the world of mysteries. I am unable to explain why you are
facing it since I am unable to reproduce it.

Try out few other images, may be the image you have chosen is corrupt and
maybe there is an exception thrown and silently swallowed in code.

I suggest you do this:
   Please use an IDE like IntelliJ/Eclipse and use a debugger to understand
the call stack inside TesseractOCRParser. It is indeed a nice way to get to
the internals of Tika :-)

Best,
TG

*--*
*Thamme Gowda*
TG | @thammegowda <https://twitter.com/thammegowda>
~Sent via somebody's Webmail server!

On Sat, Mar 4, 2017 at 9:04 AM, Thejan Wijesinghe <
thejan.k.wijesinghe@gmail.com> wrote:

>
> Hi Thamme,
>
> Yes. I am using Ubuntu :) and I had ImageMagick and Tesseract both
> installed in my system using apt-get. Since, I wasn't sure whether this is
> a problem with the APT software packages, I built both ImageMagick and
> Tesseract from sources.
>
> I also double checked the availability of Tesseract and ImageMagick by
> typing CLI commands that you suggested and the below commands as well,
>
> convert test.jpg -resize 64x64 resized_test.jpg
>
> tesseract test.jpg out
>
> and they worked.
>
> I can't find a exact reason why I am not getting metadata but when I used
> the AutoDetectParser class instead of the TesseractOCRParser class, I can
> extract both content and metadata.
>
> p.s. I will put updating the wiki OCR page in my TODO list :)
>