You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Mark Kerzner <ma...@shmsoft.com> on 2018/02/20 04:02:01 UTC

Long time with OCR

Hi, all,

I am doing OCR on a pdf with more than 500 hundred pages. Since it takes a
long time, I broke the PDF into individual pages, so that I can better
track progress. It works, but I get 10 seconds per page. These pages are
hard because they have different fonts and maybe other complications.

Is that a good approach? Is the 10 seconds time normal? I am using the
latest most powerful Mac and I get similar results on an i7 processor in
Ubuntu.

Thanks a bunch!

Mark

Mark Kerzner, SHMsoft <http://shmsoft.com/>,
Book a call with me here <http://www.meetme.so/markkerzner>

Mobile: 713-724-2534
Skype: mark.kerzner1
<http://shmsoft.com/>

Re: Long time with OCR

Posted by Mark Kerzner <ma...@shmsoft.com>.

Thanks again

Mark Kerzner, SHMsoft <http://shmsoft.com/>,
Book a call with me here <http://www.meetme.so/markkerzner>

Mobile: 713-724-2534
Skype: mark.kerzner1
<http://shmsoft.com/>

On Tue, Feb 20, 2018 at 1:24 PM, Allison, Timothy B. <ta...@mitre.org>
wrote:

> > These pages are hard because they have different fonts and maybe other
> complications.
>
>
>
> +1 … As a side note, a colleague and I did an image degradation study,
> and we noticed that tesseract took far longer on the degraded images than
> on the originals.  Your intuition is correct.  This won’t help improve
> your speed, but I thought I’d share.
>
>
>
> *From:* Chris Mattmann [mailto:mattmann@apache.org]
> *Sent:* Tuesday, February 20, 2018 12:31 PM
> *To:* user@tika.apache.org
> *Subject:* Re: Long time with OCR
>
>
>
> Updated the wiki page with this info, thanks Nick!
>
>
>
>
>
>
>
> *From: *Mark Kerzner <ma...@shmsoft.com>
> *Reply-To: *"user@tika.apache.org" <us...@tika.apache.org>
> *Date: *Tuesday, February 20, 2018 at 6:36 AM
> *To: *Tika User <us...@tika.apache.org>
> *Subject: *Re: Long time with OCR
>
>
>
> Hi, Nick,
>
>
>
> Thank you very much.
>
>
>
> Mark
>
>
> Mark Kerzner, SHMsoft <http://shmsoft.com/>,
>
> Book a call with me here <http://www.meetme.so/markkerzner>
>
>
> Mobile: 713-724-2534 <(713)%20724-2534>
> Skype: mark.kerzner1
>
>
>
> On Tue, Feb 20, 2018 at 6:59 AM, Nick Burch <ap...@gagravarr.org> wrote:
>
> On Mon, 19 Feb 2018, Mark Kerzner wrote:
>
> Is that a good approach? Is the 10 seconds time normal? I am using the
> latest most powerful Mac and I get similar results on an i7 processor in
> Ubuntu.
>
>
> Tika uses the open source Tesseract OCR engine. Tesseract is optimised for
> ease of contributions and ease of implementing new approaches, rather than
> for performance, because as an (ex?-) accademic project that's more what
> they think's important
>
> There's some advice on the Tesseract github issues + wiki on ways to speed
> it up, eg https://github.com/tesseract-ocr/tesseract/issues/263 and
> https://github.com/tesseract-ocr/tesseract/issues/1171 and
> https://github.com/tesseract-ocr/tesseract/wiki/4.0-
> Accuracy-and-Performance
>
> Otherwise you'd need to switch to a proprietary OCR tool. I understand
> that the Google Cloud OCR is pretty good, if you don't mind pushing all
> your files up to Gooogle and paying per file
>
> Nick
>
>
>

RE: Long time with OCR

Posted by "Allison, Timothy B." <ta...@mitre.org>.

> These pages are hard because they have different fonts and maybe other complications.

+1 … As a side note, a colleague and I did an image degradation study, and we noticed that tesseract took far longer on the degraded images than on the originals.  Your intuition is correct.  This won’t help improve your speed, but I thought I’d share.

From: Chris Mattmann [mailto:mattmann@apache.org]
Sent: Tuesday, February 20, 2018 12:31 PM
To: user@tika.apache.org
Subject: Re: Long time with OCR

Updated the wiki page with this info, thanks Nick!

From: Mark Kerzner <ma...@shmsoft.com>>
Reply-To: "user@tika.apache.org<ma...@tika.apache.org>" <us...@tika.apache.org>>
Date: Tuesday, February 20, 2018 at 6:36 AM
To: Tika User <us...@tika.apache.org>>
Subject: Re: Long time with OCR

Hi, Nick,

Thank you very much.

Mark

Mark Kerzner, SHMsoft<http://shmsoft.com/>,
Book a call with me here<http://www.meetme.so/markkerzner>

Mobile: 713-724-2534
Skype: mark.kerzner1

On Tue, Feb 20, 2018 at 6:59 AM, Nick Burch <ap...@gagravarr.org>> wrote:
On Mon, 19 Feb 2018, Mark Kerzner wrote:
Is that a good approach? Is the 10 seconds time normal? I am using the latest most powerful Mac and I get similar results on an i7 processor in Ubuntu.

Tika uses the open source Tesseract OCR engine. Tesseract is optimised for ease of contributions and ease of implementing new approaches, rather than for performance, because as an (ex?-) accademic project that's more what they think's important

There's some advice on the Tesseract github issues + wiki on ways to speed it up, eg https://github.com/tesseract-ocr/tesseract/issues/263 and
https://github.com/tesseract-ocr/tesseract/issues/1171 and
https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance

Otherwise you'd need to switch to a proprietary OCR tool. I understand that the Google Cloud OCR is pretty good, if you don't mind pushing all your files up to Gooogle and paying per file

Nick

Re: Long time with OCR

Posted by Chris Mattmann <ma...@apache.org>.

Updated the wiki page with this info, thanks Nick!

 

 

 

From: Mark Kerzner <ma...@shmsoft.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Tuesday, February 20, 2018 at 6:36 AM
To: Tika User <us...@tika.apache.org>
Subject: Re: Long time with OCR

 

Hi, Nick, 

 

Thank you very much.

 

Mark


Mark Kerzner, SHMsoft, 

Book a call with me here


Mobile: 713-724-2534
Skype: mark.kerzner1

 

On Tue, Feb 20, 2018 at 6:59 AM, Nick Burch <ap...@gagravarr.org> wrote:

On Mon, 19 Feb 2018, Mark Kerzner wrote:

Is that a good approach? Is the 10 seconds time normal? I am using the latest most powerful Mac and I get similar results on an i7 processor in Ubuntu.


Tika uses the open source Tesseract OCR engine. Tesseract is optimised for ease of contributions and ease of implementing new approaches, rather than for performance, because as an (ex?-) accademic project that's more what they think's important

There's some advice on the Tesseract github issues + wiki on ways to speed it up, eg https://github.com/tesseract-ocr/tesseract/issues/263 and
https://github.com/tesseract-ocr/tesseract/issues/1171 and
https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance

Otherwise you'd need to switch to a proprietary OCR tool. I understand that the Google Cloud OCR is pretty good, if you don't mind pushing all your files up to Gooogle and paying per file

Nick

Re: Long time with OCR

Posted by Mark Kerzner <ma...@shmsoft.com>.

Hi, Nick,

Thank you very much.

Mark

Mark Kerzner, SHMsoft <http://shmsoft.com/>,
Book a call with me here <http://www.meetme.so/markkerzner>

Mobile: 713-724-2534
Skype: mark.kerzner1
<http://shmsoft.com/>

On Tue, Feb 20, 2018 at 6:59 AM, Nick Burch <ap...@gagravarr.org> wrote:

> On Mon, 19 Feb 2018, Mark Kerzner wrote:
>
>> Is that a good approach? Is the 10 seconds time normal? I am using the
>> latest most powerful Mac and I get similar results on an i7 processor in
>> Ubuntu.
>>
>
> Tika uses the open source Tesseract OCR engine. Tesseract is optimised for
> ease of contributions and ease of implementing new approaches, rather than
> for performance, because as an (ex?-) accademic project that's more what
> they think's important
>
> There's some advice on the Tesseract github issues + wiki on ways to speed
> it up, eg https://github.com/tesseract-ocr/tesseract/issues/263 and
> https://github.com/tesseract-ocr/tesseract/issues/1171 and
> https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy
> -and-Performance
>
> Otherwise you'd need to switch to a proprietary OCR tool. I understand
> that the Google Cloud OCR is pretty good, if you don't mind pushing all
> your files up to Gooogle and paying per file
>
> Nick
>

Re: Long time with OCR

Posted by Nick Burch <ap...@gagravarr.org>.

On Mon, 19 Feb 2018, Mark Kerzner wrote:
> Is that a good approach? Is the 10 seconds time normal? I am using the 
> latest most powerful Mac and I get similar results on an i7 processor in 
> Ubuntu.

Tika uses the open source Tesseract OCR engine. Tesseract is optimised for 
ease of contributions and ease of implementing new approaches, rather than 
for performance, because as an (ex?-) accademic project that's more what 
they think's important

There's some advice on the Tesseract github issues + wiki on ways to speed 
it up, eg https://github.com/tesseract-ocr/tesseract/issues/263 and
https://github.com/tesseract-ocr/tesseract/issues/1171 and
https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance

Otherwise you'd need to switch to a proprietary OCR tool. I understand 
that the Google Cloud OCR is pretty good, if you don't mind pushing all 
your files up to Gooogle and paying per file

Nick