You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Sudarsan, Sithu D." <Si...@fda.hhs.gov> on 2009/02/26 17:29:18 UTC

Use of scanned documents for text extraction and indexing

Hi All:

Is there any study / research done on using scanned paper documents as
images (may be PDF), and then use some OCR or other technique for
extracting text, and the resultant index quality?


Thanks in advance,
Sithu D Sudarsan

sithu.sudarsan@fda.hhs.gov
sdsudarsan@ualr.edu

Re: Use of scanned documents for text extraction and indexing

Posted by Bastian Buch <mr...@gmx.de>.

You can use Tesseract, an openSource OCR Engine owned from Google. Its 
native C Code and to use it in Java you should use JNI or direct process 
creation. There is no PDF support, but you can use imagemagick to 
convert those docs on the fly. The engine scan documents line by line 
without trying to resolve "text-boxes", which is a problem with 
1-n-column texts. But with some image preprocessing you can also solve this.


Cheers Bastian.

http://bastian-buch.de


Renaud Waldura schrieb:

> There is quite a bit of litterature available on this topic. This paper
> presents a summary. Nothing immediately applicable I'm afraid.
>
> Retrieving OCR Text: A survey of current approaches
> Steven M. Beitzel, Eric C. Jensen, David A Grossman
> Illinois Institute of Technology
>
> It lists a number of other papers that are easy to find online. Let me know
> what you find, I'm interested in this too.
>
> --Renaud
>
>  
>
> -----Original Message-----
> From: Sudarsan, Sithu D. [mailto:Sithu.Sudarsan@fda.hhs.gov] 
> Sent: Thursday, February 26, 2009 8:29 AM
> To: solr-user@lucene.apache.org; java-user@lucene.apache.org
> Subject: Use of scanned documents for text extraction and indexing
>
>
> Hi All:
>
> Is there any study / research done on using scanned paper documents as
> images (may be PDF), and then use some OCR or other technique for extracting
> text, and the resultant index quality?
>
>
> Thanks in advance,
> Sithu D Sudarsan
>
> sithu.sudarsan@fda.hhs.gov
> sdsudarsan@ualr.edu
>
>
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Use of scanned documents for text extraction and indexing

Posted by Bastian Buch <mr...@gmx.de>.

You can use Tesseract, an openSource OCR Engine owned from Google. Its 
native C Code and to use it in Java you should use JNI or direct process 
creation. There is no PDF support, but you can use imagemagick to 
convert those docs on the fly. The engine scan documents line by line 
without trying to resolve "text-boxes", which is a problem with 
1-n-column texts. But with some image preprocessing you can also solve this.


Cheers Bastian.

http://bastian-buch.de


Renaud Waldura schrieb:

> There is quite a bit of litterature available on this topic. This paper
> presents a summary. Nothing immediately applicable I'm afraid.
>
> Retrieving OCR Text: A survey of current approaches
> Steven M. Beitzel, Eric C. Jensen, David A Grossman
> Illinois Institute of Technology
>
> It lists a number of other papers that are easy to find online. Let me know
> what you find, I'm interested in this too.
>
> --Renaud
>
>  
>
> -----Original Message-----
> From: Sudarsan, Sithu D. [mailto:Sithu.Sudarsan@fda.hhs.gov] 
> Sent: Thursday, February 26, 2009 8:29 AM
> To: solr-user@lucene.apache.org; java-user@lucene.apache.org
> Subject: Use of scanned documents for text extraction and indexing
>
>
> Hi All:
>
> Is there any study / research done on using scanned paper documents as
> images (may be PDF), and then use some OCR or other technique for extracting
> text, and the resultant index quality?
>
>
> Thanks in advance,
> Sithu D Sudarsan
>
> sithu.sudarsan@fda.hhs.gov
> sdsudarsan@ualr.edu
>
>
>
>
>

RE: Use of scanned documents for text extraction and indexing

Posted by Renaud Waldura <re...@library.ucsf.edu>.

There is quite a bit of litterature available on this topic. This paper
presents a summary. Nothing immediately applicable I'm afraid.

Retrieving OCR Text: A survey of current approaches
Steven M. Beitzel, Eric C. Jensen, David A Grossman
Illinois Institute of Technology

It lists a number of other papers that are easy to find online. Let me know
what you find, I'm interested in this too.

--Renaud

 

-----Original Message-----
From: Sudarsan, Sithu D. [mailto:Sithu.Sudarsan@fda.hhs.gov] 
Sent: Thursday, February 26, 2009 8:29 AM
To: solr-user@lucene.apache.org; java-user@lucene.apache.org
Subject: Use of scanned documents for text extraction and indexing


Hi All:

Is there any study / research done on using scanned paper documents as
images (may be PDF), and then use some OCR or other technique for extracting
text, and the resultant index quality?


Thanks in advance,
Sithu D Sudarsan

sithu.sudarsan@fda.hhs.gov
sdsudarsan@ualr.edu





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Use of scanned documents for text extraction and indexing

Posted by Renaud Waldura <re...@library.ucsf.edu>.

There is quite a bit of litterature available on this topic. This paper
presents a summary. Nothing immediately applicable I'm afraid.

Retrieving OCR Text: A survey of current approaches
Steven M. Beitzel, Eric C. Jensen, David A Grossman
Illinois Institute of Technology

It lists a number of other papers that are easy to find online. Let me know
what you find, I'm interested in this too.

--Renaud

 

-----Original Message-----
From: Sudarsan, Sithu D. [mailto:Sithu.Sudarsan@fda.hhs.gov] 
Sent: Thursday, February 26, 2009 8:29 AM
To: solr-user@lucene.apache.org; java-user@lucene.apache.org
Subject: Use of scanned documents for text extraction and indexing


Hi All:

Is there any study / research done on using scanned paper documents as
images (may be PDF), and then use some OCR or other technique for extracting
text, and the resultant index quality?


Thanks in advance,
Sithu D Sudarsan

sithu.sudarsan@fda.hhs.gov
sdsudarsan@ualr.edu

RE: Use of scanned documents for text extraction and indexing

Posted by "Sudarsan, Sithu D." <Si...@fda.hhs.gov>.

 

Thanks to all who have responded (Hanners, Shashi, Vikram, Bastian,
Renaud and the rest).

Using OCRopus might provide the flexibility to use multi-column
documents and formatted ones.

Regarding literature on OCR, few follow up of the paper link provided
Renaud do exist, but could not locate anything significant.

I'll update if I can find something useful to report.



Sincerely,
Sithu 
sithu.sudarsan@fda.hhs.gov
sdsudarsan@ualr.edu

-----Original Message-----
From: Vikram Kumar [mailto:vikrambkumar@gmail.com] 
Sent: Friday, February 27, 2009 5:44 AM
To: solr-user@lucene.apache.org; Shashi Kant
Subject: Re: Use of scanned documents for text extraction and indexing

Check this:
http://code.google.com/p/ocropus/wiki/FrequentlyAskedQuestions

> How well does it work?
>
The character recognition accuracy of OCRopus right now (04/2007) is
about
> like Tesseract. That's because the only character recognition plug-in
in
> OCRopus is, in fact, Tesseract. In the future, there will be
additional
> character recognition plug-ins, both for Latin and for other character
sets.
>
The big area of improvement relative to other open source OCR systems
right
> now is in the area of layout analysis; in our benchmarks, OCRopus
greatly
> reduces layout errors compared to other open source systems."
>
OCR is only a part of the solution with scanned documents. i.,e they
recognize text.

For structural/semantic understanding of documents, you need engines
like
OCRopus that can do layout analysis and provide meaningful data for
document
analysis and understanding.

>From their own Wiki:

Should I use OCRopus or Tesseract?
>
You might consider using OCRopus right now if you require layout
analysis,
> if you want to contribute to it, if you find its output format more
> convenient (HTML with embedded OR information), and/or if you
anticipate
> requiring some of its other capabilities in the future (pluggability,
> multiple scripts, statistical language models, etc.).
>
In terms of character error rates, OCRopus performs similar to
Tesseract. In
> terms of layout analysis, OCRopus is significantly better than
Tesseract.
>
The main reasons not to use OCRopus yet is that it hasn't been packaged
yet,
> that it has limited multi-platform support, and that it runs somewhat
> slower. We hope to address all those issues by the beta release."
>


On Thu, Feb 26, 2009 at 11:35 PM, Shashi Kant <sh...@yahoo.com>
wrote:

> Can anyone back that up?
>
> IMHO Tesseract is the state-of-the-art in OCR, but not sure that
"Ocropus
> builds on Tesseract".
> Can you confirm that Vikram has a point?
>
> Shashi
>
>
>
>
> ----- Original Message ----
> From: Vikram Kumar <vi...@gmail.com>
> To: solr-user@lucene.apache.org; Shashi Kant <sk...@sloan.mit.edu>
> Sent: Thursday, February 26, 2009 9:21:07 PM
> Subject: Re: Use of scanned documents for text extraction and indexing
>
> Tesseract is pure OCR. Ocropus builds on Tesseract.
> Vikram
>
> On Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant <sh...@yahoo.com>
> wrote:
>
> > Another project worth investigating is Tesseract.
> >
> > http://code.google.com/p/tesseract-ocr/
> >
> >
> >
> >
> > ----- Original Message ----
> > From: Hannes Carl Meyer <ma...@hcmeyer.com>
> > To: solr-user@lucene.apache.org
> > Sent: Thursday, February 26, 2009 11:35:14 AM
> > Subject: Re: Use of scanned documents for text extraction and
indexing
> >
> > Hi Sithu,
> >
> > there is a project called ocropus done by the DFKI, check the online
demo
> > here: http://demo.iupr.org/cgi-bin/main.cgi
> >
> > And also http://sites.google.com/site/ocropus/
> >
> > Regards
> >
> > Hannes
> >
> > mail@hcmeyer.com
> > http://mimblog.de
> >
> > On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. <
> > Sithu.Sudarsan@fda.hhs.gov> wrote:
> >
> > >
> > > Hi All:
> > >
> > > Is there any study / research done on using scanned paper
documents as
> > > images (may be PDF), and then use some OCR or other technique for
> > > extracting text, and the resultant index quality?
> > >
> > >
> > > Thanks in advance,
> > > Sithu D Sudarsan
> > >
> > > sithu.sudarsan@fda.hhs.gov
> > > sdsudarsan@ualr.edu
> > >
> > >
> > >
> >
> >
>
>

Re: Use of scanned documents for text extraction and indexing

Posted by Vikram Kumar <vi...@gmail.com>.

Check this: http://code.google.com/p/ocropus/wiki/FrequentlyAskedQuestions

> How well does it work?
>
The character recognition accuracy of OCRopus right now (04/2007) is about
> like Tesseract. That's because the only character recognition plug-in in
> OCRopus is, in fact, Tesseract. In the future, there will be additional
> character recognition plug-ins, both for Latin and for other character sets.
>
The big area of improvement relative to other open source OCR systems right
> now is in the area of layout analysis; in our benchmarks, OCRopus greatly
> reduces layout errors compared to other open source systems."
>
OCR is only a part of the solution with scanned documents. i.,e they
recognize text.

For structural/semantic understanding of documents, you need engines like
OCRopus that can do layout analysis and provide meaningful data for document
analysis and understanding.

>From their own Wiki:

Should I use OCRopus or Tesseract?
>
You might consider using OCRopus right now if you require layout analysis,
> if you want to contribute to it, if you find its output format more
> convenient (HTML with embedded OR information), and/or if you anticipate
> requiring some of its other capabilities in the future (pluggability,
> multiple scripts, statistical language models, etc.).
>
In terms of character error rates, OCRopus performs similar to Tesseract. In
> terms of layout analysis, OCRopus is significantly better than Tesseract.
>
The main reasons not to use OCRopus yet is that it hasn't been packaged yet,
> that it has limited multi-platform support, and that it runs somewhat
> slower. We hope to address all those issues by the beta release."
>


On Thu, Feb 26, 2009 at 11:35 PM, Shashi Kant <sh...@yahoo.com> wrote:

> Can anyone back that up?
>
> IMHO Tesseract is the state-of-the-art in OCR, but not sure that "Ocropus
> builds on Tesseract".
> Can you confirm that Vikram has a point?
>
> Shashi
>
>
>
>
> ----- Original Message ----
> From: Vikram Kumar <vi...@gmail.com>
> To: solr-user@lucene.apache.org; Shashi Kant <sk...@sloan.mit.edu>
> Sent: Thursday, February 26, 2009 9:21:07 PM
> Subject: Re: Use of scanned documents for text extraction and indexing
>
> Tesseract is pure OCR. Ocropus builds on Tesseract.
> Vikram
>
> On Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant <sh...@yahoo.com>
> wrote:
>
> > Another project worth investigating is Tesseract.
> >
> > http://code.google.com/p/tesseract-ocr/
> >
> >
> >
> >
> > ----- Original Message ----
> > From: Hannes Carl Meyer <ma...@hcmeyer.com>
> > To: solr-user@lucene.apache.org
> > Sent: Thursday, February 26, 2009 11:35:14 AM
> > Subject: Re: Use of scanned documents for text extraction and indexing
> >
> > Hi Sithu,
> >
> > there is a project called ocropus done by the DFKI, check the online demo
> > here: http://demo.iupr.org/cgi-bin/main.cgi
> >
> > And also http://sites.google.com/site/ocropus/
> >
> > Regards
> >
> > Hannes
> >
> > mail@hcmeyer.com
> > http://mimblog.de
> >
> > On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. <
> > Sithu.Sudarsan@fda.hhs.gov> wrote:
> >
> > >
> > > Hi All:
> > >
> > > Is there any study / research done on using scanned paper documents as
> > > images (may be PDF), and then use some OCR or other technique for
> > > extracting text, and the resultant index quality?
> > >
> > >
> > > Thanks in advance,
> > > Sithu D Sudarsan
> > >
> > > sithu.sudarsan@fda.hhs.gov
> > > sdsudarsan@ualr.edu
> > >
> > >
> > >
> >
> >
>
>

Re: Use of scanned documents for text extraction and indexing

Posted by Shashi Kant <sh...@yahoo.com>.

Can anyone back that up?

IMHO Tesseract is the state-of-the-art in OCR, but not sure that "Ocropus builds on Tesseract".
Can you confirm that Vikram has a point?

Shashi




----- Original Message ----
From: Vikram Kumar <vi...@gmail.com>
To: solr-user@lucene.apache.org; Shashi Kant <sk...@sloan.mit.edu>
Sent: Thursday, February 26, 2009 9:21:07 PM
Subject: Re: Use of scanned documents for text extraction and indexing

Tesseract is pure OCR. Ocropus builds on Tesseract.
Vikram

On Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant <sh...@yahoo.com> wrote:

> Another project worth investigating is Tesseract.
>
> http://code.google.com/p/tesseract-ocr/
>
>
>
>
> ----- Original Message ----
> From: Hannes Carl Meyer <ma...@hcmeyer.com>
> To: solr-user@lucene.apache.org
> Sent: Thursday, February 26, 2009 11:35:14 AM
> Subject: Re: Use of scanned documents for text extraction and indexing
>
> Hi Sithu,
>
> there is a project called ocropus done by the DFKI, check the online demo
> here: http://demo.iupr.org/cgi-bin/main.cgi
>
> And also http://sites.google.com/site/ocropus/
>
> Regards
>
> Hannes
>
> mail@hcmeyer.com
> http://mimblog.de
>
> On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. <
> Sithu.Sudarsan@fda.hhs.gov> wrote:
>
> >
> > Hi All:
> >
> > Is there any study / research done on using scanned paper documents as
> > images (may be PDF), and then use some OCR or other technique for
> > extracting text, and the resultant index quality?
> >
> >
> > Thanks in advance,
> > Sithu D Sudarsan
> >
> > sithu.sudarsan@fda.hhs.gov
> > sdsudarsan@ualr.edu
> >
> >
> >
>
>

Re: Use of scanned documents for text extraction and indexing

Posted by Vikram Kumar <vi...@gmail.com>.

Tesseract is pure OCR. Ocropus builds on Tesseract.
Vikram

On Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant <sh...@yahoo.com> wrote:

> Another project worth investigating is Tesseract.
>
> http://code.google.com/p/tesseract-ocr/
>
>
>
>
> ----- Original Message ----
> From: Hannes Carl Meyer <ma...@hcmeyer.com>
> To: solr-user@lucene.apache.org
> Sent: Thursday, February 26, 2009 11:35:14 AM
> Subject: Re: Use of scanned documents for text extraction and indexing
>
> Hi Sithu,
>
> there is a project called ocropus done by the DFKI, check the online demo
> here: http://demo.iupr.org/cgi-bin/main.cgi
>
> And also http://sites.google.com/site/ocropus/
>
> Regards
>
> Hannes
>
> mail@hcmeyer.com
> http://mimblog.de
>
> On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. <
> Sithu.Sudarsan@fda.hhs.gov> wrote:
>
> >
> > Hi All:
> >
> > Is there any study / research done on using scanned paper documents as
> > images (may be PDF), and then use some OCR or other technique for
> > extracting text, and the resultant index quality?
> >
> >
> > Thanks in advance,
> > Sithu D Sudarsan
> >
> > sithu.sudarsan@fda.hhs.gov
> > sdsudarsan@ualr.edu
> >
> >
> >
>
>

Re: Use of scanned documents for text extraction and indexing

Posted by Shashi Kant <sh...@yahoo.com>.

Another project worth investigating is Tesseract.

http://code.google.com/p/tesseract-ocr/




----- Original Message ----
From: Hannes Carl Meyer <ma...@hcmeyer.com>
To: solr-user@lucene.apache.org
Sent: Thursday, February 26, 2009 11:35:14 AM
Subject: Re: Use of scanned documents for text extraction and indexing

Hi Sithu,

there is a project called ocropus done by the DFKI, check the online demo
here: http://demo.iupr.org/cgi-bin/main.cgi

And also http://sites.google.com/site/ocropus/

Regards

Hannes

mail@hcmeyer.com
http://mimblog.de

On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. <
Sithu.Sudarsan@fda.hhs.gov> wrote:

>
> Hi All:
>
> Is there any study / research done on using scanned paper documents as
> images (may be PDF), and then use some OCR or other technique for
> extracting text, and the resultant index quality?
>
>
> Thanks in advance,
> Sithu D Sudarsan
>
> sithu.sudarsan@fda.hhs.gov
> sdsudarsan@ualr.edu
>
>
>

RE: Use of scanned documents for text extraction and indexing

Posted by "Sudarsan, Sithu D." <Si...@fda.hhs.gov>.

Thanks Hannes,

The tool looks good. 

Sincerely,
Sithu D Sudarsan

sithu.sudarsan@fda.hhs.gov
sdsudarsan@ualr.edu

-----Original Message-----
From: hannescarl@googlemail.com [mailto:hannescarl@googlemail.com] On
Behalf Of Hannes Carl Meyer
Sent: Thursday, February 26, 2009 11:35 AM
To: solr-user@lucene.apache.org
Subject: Re: Use of scanned documents for text extraction and indexing

Hi Sithu,

there is a project called ocropus done by the DFKI, check the online
demo
here: http://demo.iupr.org/cgi-bin/main.cgi

And also http://sites.google.com/site/ocropus/

Regards

Hannes

mail@hcmeyer.com
http://mimblog.de

On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. <
Sithu.Sudarsan@fda.hhs.gov> wrote:

>
> Hi All:
>
> Is there any study / research done on using scanned paper documents as
> images (may be PDF), and then use some OCR or other technique for
> extracting text, and the resultant index quality?
>
>
> Thanks in advance,
> Sithu D Sudarsan
>
> sithu.sudarsan@fda.hhs.gov
> sdsudarsan@ualr.edu
>
>
>

Re: Use of scanned documents for text extraction and indexing

Posted by Hannes Carl Meyer <ma...@hcmeyer.com>.

Hi Sithu,

there is a project called ocropus done by the DFKI, check the online demo
here: http://demo.iupr.org/cgi-bin/main.cgi

And also http://sites.google.com/site/ocropus/

Regards

Hannes

mail@hcmeyer.com
http://mimblog.de

On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. <
Sithu.Sudarsan@fda.hhs.gov> wrote:

>
> Hi All:
>
> Is there any study / research done on using scanned paper documents as
> images (may be PDF), and then use some OCR or other technique for
> extracting text, and the resultant index quality?
>
>
> Thanks in advance,
> Sithu D Sudarsan
>
> sithu.sudarsan@fda.hhs.gov
> sdsudarsan@ualr.edu
>
>
>