You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Burton-West, Tom" <tb...@umich.edu> on 2010/03/09 20:31:27 UTC

Cleaning up dirty OCR

Hello all,

We have been indexing a large collection of OCR'd text. About 5 million books in over 200 languages.  With 1.5 billion OCR'd pages, even a small OCR error rate creates a relatively large number of meaningless unique terms.  (See  http://www.hathitrust.org/blogs/large-scale-search/too-many-words )

We would like to remove some *fraction* of these nonsense words caused by OCR errors prior to indexing. ( We don't want to remove "real" words, so we need some method with very few false positives.)

A dictionary based approach does not seem feasible given the number of languages and the inclusion of proper names, place names, and technical terms.   We are considering using some heuristics, such as looking for strings over a certain length or strings containing more than some number of punctuation characters.

This paper has a few such heuristics:
Kazem Taghva, Tom Nartker, Allen Condit, and Julie Borsack. Automatic Removal of ``Garbage Strings'' in OCR Text: An Implementation. In The 5th World Multi-Conference on Systemics, Cybernetics and Informatics, Orlando, Florida, July 2001. http://www.isri.unlv.edu/publications/isripub/Taghva01b.pdf

Can anyone suggest any practical solutions to removing some fraction of the tokens containing OCR errors from our input stream?

Tom Burton-West
University of Michigan Library
www.hathitrust.org

Re: Cleaning up dirty OCR

Posted by Tom Burton-West <tb...@gmail.com>.

We've been thinking about running some kind of a classifier against each book
to select books with a high percentage of dirty OCR for some kind of special
processing.  Haven't quite figured out a multilingual feature set yet other
than the punctuation/alphanumeric and character block ideas mentioned above.   

I'm not sure I understand your suggestion. Since real word hapax legomenons
are generally pretty common (maybe 40-60% of unique words) wouldn't  using
them as the "no" set provide mixed signals to the classifier?

Tom

Walter Underwood-2 wrote:
> 
> 
> Hmm, how about a classifier? Common words are the "yes" training set,
> hapax legomenons are the "no" set, and n-grams are the features.
> 
> But why isn't the OCR program already doing this?
> 
> wunder
> 
> 
> 
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871444.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Cleaning up dirty OCR

Posted by Walter Underwood <wu...@wunderwood.org>.

On Mar 11, 2010, at 1:34 PM, Chris Hostetter wrote:

> I wonder if one way to try and generalize 
> the idea of "unlikely" letter combinations into a math problem (instead of 
> grammer/spelling problem) would be to score all the hapax legomenon 
> words in your index

Hmm, how about a classifier? Common words are the "yes" training set, hapax legomenons are the "no" set, and n-grams are the features.

But why isn't the OCR program already doing this?

wunder

Re: Cleaning up dirty OCR

Posted by Robert Muir <rc...@gmail.com>.

>
> I don't deal with a lot of multi-lingual stuff, but my understanding is
> that this sort of thing gets a lot easier if you can partition your docs
> by language -- and even if you can't, doing some langauge detection on the
> (dirty) OCRed text to get a language guess (and then partition by language
> and attempt to find the suspicious words in each partition)
>

and if you are really OCR'ing Urdu text and trying to search it automatically,
then this is your last priority.

-- 
Robert Muir
rcmuir@gmail.com

Re: Cleaning up dirty OCR

Posted by Chris Hostetter <ho...@fucit.org>.

: Interesting.  I wonder though if we have 4 million English documents and 250
: in Urdu, if the Urdu words would score badly when compared to ngram
: statistics for the entire corpus.  

Well it doesn't have to be a strict ratio cutoff .. you could look at the 
average frequency of all character Ngrams in your index, and then 
consider any Ngram that appeared fewer then X stddev's below the average 
to be suspicious, and eliminate any work that contains Y or more 
suspicious Ngrams.

Of you could just start really simple and eliminate any work that contains 
an Ngram that doesn't appear in *any* other word in your corpus.

I don't deal with a lot of multi-lingual stuff, but my understanding is 
that this sort of thing gets a lot easier if you can partition your docs 
by language -- and even if you can't, doing some langauge detection on the 
(dirty) OCRed text to get a language guess (and then partition by language 
and attempt to find the suspicious words in each partition)


-Hoss

Re: Cleaning up dirty OCR

Posted by Tom Burton-West <tb...@gmail.com>.

Interesting.  I wonder though if we have 4 million English documents and 250
in Urdu, if the Urdu words would score badly when compared to ngram
statistics for the entire corpus.  


hossman wrote:
> 
> 
> 
> Since you are dealing with multiple langugaes, and multiple varient usages 
> of langauges (ie: olde english) I wonder if one way to try and generalize 
> the idea of "unlikely" letter combinations into a math problem (instead of 
> grammer/spelling problem) would be to score all the hapax legomenon 
> words in your index based on the frequency of (character) N-grams in 
> each of those words, relative the entire corpus, and then eliminate any of 
> the hapax legomenon words whose score is below some cut off threshold 
> (that you'd have to pick arbitrarily, probably by eyeballing the sorted 
> list of words and their contexts to deide if they are legitimate)
> 
> 	?
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871353.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Cleaning up dirty OCR

Posted by Chris Hostetter <ho...@fucit.org>.

: We can probably implement your suggestion about runs of punctuation and
: unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about
: looking for unlikely mixes of unicode character blocks.  For example some of
: the CJK material ends up with Cyrillic characters. (except we would have to
: watch out for any Russian-Chinese dictionaries:)

Since you are dealing with multiple langugaes, and multiple varient usages 
of langauges (ie: olde english) I wonder if one way to try and generalize 
the idea of "unlikely" letter combinations into a math problem (instead of 
grammer/spelling problem) would be to score all the hapax legomenon 
words in your index based on the frequency of (character) N-grams in 
each of those words, relative the entire corpus, and then eliminate any of 
the hapax legomenon words whose score is below some cut off threshold 
(that you'd have to pick arbitrarily, probably by eyeballing the sorted 
list of words and their contexts to deide if they are legitimate)

	?


-Hoss

Re: Cleaning up dirty OCR

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Mar 11, 2010 at 4:14 PM, Tom Burton-West <tb...@gmail.com> wrote:
>
> Thanks Simon,
>
> We can probably implement your suggestion about runs of punctuation and
> unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about
> looking for unlikely mixes of unicode character blocks.  For example some of
> the CJK material ends up with Cyrillic characters. (except we would have to
> watch out for any Russian-Chinese dictionaries:)
>

Ok this is a new one for me, I am just curious, have you figured out
why this is happening?

Separately, i would love to know some sort of character frequency data
for your non-english text, are you OCR'ing that data too? Are you
using Unicode normalization or anything to prevent explosion of terms
that are really the same?

-- 
Robert Muir
rcmuir@gmail.com

Re: Cleaning up dirty OCR

Posted by Tom Burton-West <tb...@gmail.com>.

Thanks Simon,

We can probably implement your suggestion about runs of punctuation and
unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about
looking for unlikely mixes of unicode character blocks.  For example some of
the CJK material ends up with Cyrillic characters. (except we would have to
watch out for any Russian-Chinese dictionaries:)

Tom



> 
> 
> There wasn't any completely satisfactory solution; there were a large
> number
> of two and three letter n-grams so we were able to use a dictionary
> approach
> to eliminate those (names tend to be longer).  We also looked for runs of
> punctuation,  unlikely mixes of alpha/numeric/punctuation, and also
> eliminated longer words which consisted of runs of not-ocurring-in-English
> bigrams.
> 
> Hope this helps
> 
> -Simon
> 
>>
>> --
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27869940.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Cleaning up dirty OCR

Posted by simon <si...@lucidimagination.com>.

On Tue, Mar 9, 2010 at 2:35 PM, Robert Muir <rc...@gmail.com> wrote:

> > Can anyone suggest any practical solutions to removing some fraction of
> the tokens containing OCR errors from our input stream?
>
> one approach would be to try
> http://issues.apache.org/jira/browse/LUCENE-1812
>
> and filter terms that only appear once in the document.
>

In another life (and with another search engine) I also had to find a
solution to the dirty OCR problem. Fortunately only in English,
unfortunately a corpus containing many non-American/non-English names, so we
also had to be very conservative and reduce the number of false positives.

There wasn't any completely satisfactory solution; there were a large number
of two and three letter n-grams so we were able to use a dictionary approach
to eliminate those (names tend to be longer).  We also looked for runs of
punctuation,  unlikely mixes of alpha/numeric/punctuation, and also
eliminated longer words which consisted of runs of not-ocurring-in-English
bigrams.

Hope this helps

-Simon

>
> --
>

Re: Cleaning up dirty OCR

Posted by Robert Muir <rc...@gmail.com>.

On Thu, Mar 11, 2010 at 3:37 PM, Burton-West, Tom <tb...@umich.edu> wrote:
> Thanks Robert,
>
> I've been thinking about this since you suggested it on another thread.  One problem is that it would also remove real words. Apparently 40-60% of the words in large corpora occur only once (http://en.wikipedia.org/wiki/Hapax_legomenon.)
>

You are correct. I really hate recommending you 'remove data', but at
the same time, as perhaps an intermediate step, this could be a
brutally simple approach to move you along.

> I guess the question I need to determine is whether the benefit of removing words that occur only once outweighs the costs in terms of the two use cases outlined above.   When we get our new test server set up, sometime in the next month, I think I will go ahead and prune a test index of 500K docs and do some performance testing just to get an idea of the potential performance gains of pruning the index.

Well, one thing I did with Andrzej's patch is immediately
relevance-test this approach against some corpora I had. The results
are on the JIRA issue, and the test collection itself is in
openrelevance.

In my opinion the P@N is probably overstated, and the MAP values are
probably understated (due to it being a pooled relevance collection),
but I think its fair to say for that specific large text collection,
pruning terms that only appear in the document a single time does not
hurt relevance.

At the same time I will not dispute that it could actually help P@N, I
am just saying I'm not sold :)

Either way its extremely interesting, cut your index size in half, and
get the same relevance!

>
> I have some other questions about index pruning, but I want to do a bit more reading and then I'll post a question to either the Solr or Lucene list.  Can you suggest which list I should post an index pruning question to?
>

I would recommend posting it to the JIRA issue:
http://issues.apache.org/jira/browse/LUCENE-1812

This way someone who knows more (Andrzej) could see it, too.

-- 
Robert Muir
rcmuir@gmail.com

RE: Cleaning up dirty OCR

Posted by "Burton-West, Tom" <tb...@umich.edu>.

Thanks Robert,

I've been thinking about this since you suggested it on another thread.  One problem is that it would also remove real words. Apparently 40-60% of the words in large corpora occur only once (http://en.wikipedia.org/wiki/Hapax_legomenon.)  

There are a couple of use cases where removing words that occur only once might be a problem.  

One is for genealogical searches where a user might want to retrieve a document if their relative is only mentioned once in the document.  We have quite a few government documents and other resources such as the "Lineage Book" of the Daughters of the American Revolution.  

Another use case is humanities researchers doing phrase searching for quotes.  In this case, if we remove one of the words in the quote because it occurs only once in a document, then the phrase search would fail.  For example if someone were searching Macbeth and entered the phrase query: "Eye of newt and toe of frog" it would fail if we had removed "newt" from the index because "newt" occurs only once in Macbeth.

I ran a quick check against a couple of our copies of Macbeth and found out of about 5,000 unique words about 3,000 occurred only once.  Of these about 1,800 were in the unix dictionary, so at least 1800 words that would be removed would be "real" words as opposed to OCR errors (a spot check of the words not in the unix /usr/share/dict/words file revealed most of them also as real words rather than OCR errors.)

I also ran a quick check against a document with bad OCR and out of about 30,000 unique words, 20,000 occurred only once.  Of those 20,000 only about 300 were in the unix dictionary so your intuition that a lot of OCR errors will occur only once seems spot on.  A quick look at the words not in the dictionary revealed a mix of technical terms, common names, and obvious OCR nonsense such as "ffllllll.lj'slall'lm" "

I guess the question I need to determine is whether the benefit of removing words that occur only once outweighs the costs in terms of the two use cases outlined above.   When we get our new test server set up, sometime in the next month, I think I will go ahead and prune a test index of 500K docs and do some performance testing just to get an idea of the potential performance gains of pruning the index.

I have some other questions about index pruning, but I want to do a bit more reading and then I'll post a question to either the Solr or Lucene list.  Can you suggest which list I should post an index pruning question to?

Tom








-----Original Message-----
From: Robert Muir [mailto:rcmuir@gmail.com] 
Sent: Tuesday, March 09, 2010 2:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Cleaning up dirty OCR

> Can anyone suggest any practical solutions to removing some fraction of the tokens containing OCR errors from our input stream?

one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812

and filter terms that only appear once in the document.


-- 
Robert Muir
rcmuir@gmail.com

Re: Cleaning up dirty OCR

Posted by Robert Muir <rc...@gmail.com>.

> Can anyone suggest any practical solutions to removing some fraction of the tokens containing OCR errors from our input stream?

one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812

and filter terms that only appear once in the document.


-- 
Robert Muir
rcmuir@gmail.com