You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Manuel Le Normand <ma...@gmail.com> on 2014/07/02 16:19:25 UTC

OCR - Saving multi-term position

Hello,
Many of our indexed documents are scanned and OCR'ed documents.
Unfortunately we were not able to improve much the OCR quality (less than
80% word accuracy) for various reasons, a fact which badly hurts the
retrieval quality.

As we use an open-source OCR, we think of changing every scanned term
output to it's main possible variations to get a higher level of confidence.

Is there any analyser that supports this kind of need or should I make up a
syntax and analyser of my own, i.e the payload syntax?

The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4

Thanks,
Manuel

Re: OCR - Saving multi-term position

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

Hi Manuel,

I think OCR error correction is one of well-known NLP tasks.
I'd thought it could be implemented in the past by using Lucene.

This is a brief idea:

1. You have got a Lucene index. This existing index is made from
correct (i.e. error free) documents that are same domain of OCR documents.

2. Tokenize OCR text by ShingleTokenizer. By ShingleTokenizer, you'll get:

the quiok
tlne quick
the quick
:

3. Search those phrase in the existing index. I think exact search
(PhraseQuery) or FuzzyQuery can be worked. You should get the highest hit
count when searching "the quick" among those phrases.

Koji
-- 
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html

(2014/07/02 7:19), Manuel Le Normand wrote:
> Hello,
> Many of our indexed documents are scanned and OCR'ed documents.
> Unfortunately we were not able to improve much the OCR quality (less than
> 80% word accuracy) for various reasons, a fact which badly hurts the
> retrieval quality.
>
> As we use an open-source OCR, we think of changing every scanned term
> output to it's main possible variations to get a higher level of confidence.
>
> Is there any analyser that supports this kind of need or should I make up a
> syntax and analyser of my own, i.e the payload syntax?
>
> The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4
>
> Thanks,
> Manuel
>

Re: OCR - Saving multi-term position

Posted by Jack Krupansky <ja...@basetechnology.com>.

Take a look at the synonym filter as well. I mean, basically that's exactly 
what you are doing - adding synonyms at each position.

-- Jack Krupansky

-----Original Message----- 
From: Manuel Le Normand
Sent: Wednesday, July 2, 2014 12:57 PM
To: solr-user@lucene.apache.org
Subject: Re: OCR - Saving multi-term position

Thanks for your answers Erick and Michael.

The term confidence level is an OCR output metric which tells for every
word what are the odds it's the actual scanned term. I wish the OCR prog to
output all the "suspected words" that sum up to above ~90% of confidence it
is the actual term instead of outputting a single word as default behaviour.

I'm happy to hear this approach was used before, I will implement an
analyser that indexes these terms in same position to enable positional
queries.
Hope it works on well. In case it does I will open up a Jira ticket for it.

If anyone else has had experience with this use case I'd love hearing,

Manuel


On Wed, Jul 2, 2014 at 7:28 PM, Erick Erickson <er...@gmail.com>
wrote:

> Problem here is that you wind up with a zillion unique terms in your
> index, which may lead to performance issues, but you probably already
> know that :).
>
> I've seen situations where running it through a dictionary helps. That
> is, does each term in the OCR match some dictionary? Problem here is
> that it then de-values terms that don't happen to be in the
> dictionary, names for instance.
>
> But to answer your question: No, there really isn't a pre-built
> analysis chain that i know of that does this. Root issue is how to
> assign "confidence"? No clue for your specific domain.
>
> So payloads seem quite reasonable here. Happens there's a recent
> end-to-end example, see:
> http://searchhub.org/2014/06/13/end-to-end-payload-example-in-solr/
>
> Best,
> Erick
>
> On Wed, Jul 2, 2014 at 7:58 AM, Michael Della Bitta
> <mi...@appinions.com> wrote:
> > I don't have first hand knowledge of how you implement that, but I bet a
> > look at the WordDelimiterFilter would help you understand how to emit
> > multiple terms with the same positions pretty easily.
> >
> > I've heard of this "bag of word variants" approach to indexing
> poor-quality
> > OCR output before for findability reasons and I heard it works out OK.
> >
> > Michael Della Bitta
> >
> > Applications Developer
> >
> > o: +1 646 532 3062
> >
> > appinions inc.
> >
> > “The Science of Influence Marketing”
> >
> > 18 East 41st Street
> >
> > New York, NY 10017
> >
> > t: @appinions <https://twitter.com/Appinions> | g+:
> > plus.google.com/appinions
> > <
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> > w: appinions.com <http://www.appinions.com/>
> >
> >
> > On Wed, Jul 2, 2014 at 10:19 AM, Manuel Le Normand <
> > manuel.lenormand@gmail.com> wrote:
> >
> >> Hello,
> >> Many of our indexed documents are scanned and OCR'ed documents.
> >> Unfortunately we were not able to improve much the OCR quality (less
> than
> >> 80% word accuracy) for various reasons, a fact which badly hurts the
> >> retrieval quality.
> >>
> >> As we use an open-source OCR, we think of changing every scanned term
> >> output to it's main possible variations to get a higher level of
> >> confidence.
> >>
> >> Is there any analyser that supports this kind of need or should I make
> up a
> >> syntax and analyser of my own, i.e the payload syntax?
> >>
> >> The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3
> fox|4
> >>
> >> Thanks,
> >> Manuel
> >>
>

Re: OCR - Saving multi-term position

Posted by Manuel Le Normand <ma...@gmail.com>.

Thanks for your answers Erick and Michael.

The term confidence level is an OCR output metric which tells for every
word what are the odds it's the actual scanned term. I wish the OCR prog to
output all the "suspected words" that sum up to above ~90% of confidence it
is the actual term instead of outputting a single word as default behaviour.

I'm happy to hear this approach was used before, I will implement an
analyser that indexes these terms in same position to enable positional
queries.
Hope it works on well. In case it does I will open up a Jira ticket for it.

If anyone else has had experience with this use case I'd love hearing,

Manuel


On Wed, Jul 2, 2014 at 7:28 PM, Erick Erickson <er...@gmail.com>
wrote:

> Problem here is that you wind up with a zillion unique terms in your
> index, which may lead to performance issues, but you probably already
> know that :).
>
> I've seen situations where running it through a dictionary helps. That
> is, does each term in the OCR match some dictionary? Problem here is
> that it then de-values terms that don't happen to be in the
> dictionary, names for instance.
>
> But to answer your question: No, there really isn't a pre-built
> analysis chain that i know of that does this. Root issue is how to
> assign "confidence"? No clue for your specific domain.
>
> So payloads seem quite reasonable here. Happens there's a recent
> end-to-end example, see:
> http://searchhub.org/2014/06/13/end-to-end-payload-example-in-solr/
>
> Best,
> Erick
>
> On Wed, Jul 2, 2014 at 7:58 AM, Michael Della Bitta
> <mi...@appinions.com> wrote:
> > I don't have first hand knowledge of how you implement that, but I bet a
> > look at the WordDelimiterFilter would help you understand how to emit
> > multiple terms with the same positions pretty easily.
> >
> > I've heard of this "bag of word variants" approach to indexing
> poor-quality
> > OCR output before for findability reasons and I heard it works out OK.
> >
> > Michael Della Bitta
> >
> > Applications Developer
> >
> > o: +1 646 532 3062
> >
> > appinions inc.
> >
> > “The Science of Influence Marketing”
> >
> > 18 East 41st Street
> >
> > New York, NY 10017
> >
> > t: @appinions <https://twitter.com/Appinions> | g+:
> > plus.google.com/appinions
> > <
> https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts
> >
> > w: appinions.com <http://www.appinions.com/>
> >
> >
> > On Wed, Jul 2, 2014 at 10:19 AM, Manuel Le Normand <
> > manuel.lenormand@gmail.com> wrote:
> >
> >> Hello,
> >> Many of our indexed documents are scanned and OCR'ed documents.
> >> Unfortunately we were not able to improve much the OCR quality (less
> than
> >> 80% word accuracy) for various reasons, a fact which badly hurts the
> >> retrieval quality.
> >>
> >> As we use an open-source OCR, we think of changing every scanned term
> >> output to it's main possible variations to get a higher level of
> >> confidence.
> >>
> >> Is there any analyser that supports this kind of need or should I make
> up a
> >> syntax and analyser of my own, i.e the payload syntax?
> >>
> >> The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3
> fox|4
> >>
> >> Thanks,
> >> Manuel
> >>
>

Re: OCR - Saving multi-term position

Posted by Erick Erickson <er...@gmail.com>.

Problem here is that you wind up with a zillion unique terms in your
index, which may lead to performance issues, but you probably already
know that :).

I've seen situations where running it through a dictionary helps. That
is, does each term in the OCR match some dictionary? Problem here is
that it then de-values terms that don't happen to be in the
dictionary, names for instance.

But to answer your question: No, there really isn't a pre-built
analysis chain that i know of that does this. Root issue is how to
assign "confidence"? No clue for your specific domain.

So payloads seem quite reasonable here. Happens there's a recent
end-to-end example, see:
http://searchhub.org/2014/06/13/end-to-end-payload-example-in-solr/

Best,
Erick

On Wed, Jul 2, 2014 at 7:58 AM, Michael Della Bitta
<mi...@appinions.com> wrote:
> I don't have first hand knowledge of how you implement that, but I bet a
> look at the WordDelimiterFilter would help you understand how to emit
> multiple terms with the same positions pretty easily.
>
> I've heard of this "bag of word variants" approach to indexing poor-quality
> OCR output before for findability reasons and I heard it works out OK.
>
> Michael Della Bitta
>
> Applications Developer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions <https://twitter.com/Appinions> | g+:
> plus.google.com/appinions
> <https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
> w: appinions.com <http://www.appinions.com/>
>
>
> On Wed, Jul 2, 2014 at 10:19 AM, Manuel Le Normand <
> manuel.lenormand@gmail.com> wrote:
>
>> Hello,
>> Many of our indexed documents are scanned and OCR'ed documents.
>> Unfortunately we were not able to improve much the OCR quality (less than
>> 80% word accuracy) for various reasons, a fact which badly hurts the
>> retrieval quality.
>>
>> As we use an open-source OCR, we think of changing every scanned term
>> output to it's main possible variations to get a higher level of
>> confidence.
>>
>> Is there any analyser that supports this kind of need or should I make up a
>> syntax and analyser of my own, i.e the payload syntax?
>>
>> The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4
>>
>> Thanks,
>> Manuel
>>

Re: OCR - Saving multi-term position

Posted by Michael Della Bitta <mi...@appinions.com>.

I don't have first hand knowledge of how you implement that, but I bet a
look at the WordDelimiterFilter would help you understand how to emit
multiple terms with the same positions pretty easily.

I've heard of this "bag of word variants" approach to indexing poor-quality
OCR output before for findability reasons and I heard it works out OK.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions
<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>

On Wed, Jul 2, 2014 at 10:19 AM, Manuel Le Normand <
manuel.lenormand@gmail.com> wrote:

> Hello,
> Many of our indexed documents are scanned and OCR'ed documents.
> Unfortunately we were not able to improve much the OCR quality (less than
> 80% word accuracy) for various reasons, a fact which badly hurts the
> retrieval quality.
>
> As we use an open-source OCR, we think of changing every scanned term
> output to it's main possible variations to get a higher level of
> confidence.
>
> Is there any analyser that supports this kind of need or should I make up a
> syntax and analyser of my own, i.e the payload syntax?
>
> The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4
>
> Thanks,
> Manuel
>

Re: OCR - Saving multi-term position

Posted by Charlie Hull <ch...@flax.co.uk>.

On 02/07/2014 15:19, Manuel Le Normand wrote:
> Hello,
> Many of our indexed documents are scanned and OCR'ed documents.
> Unfortunately we were not able to improve much the OCR quality (less than
> 80% word accuracy) for various reasons, a fact which badly hurts the
> retrieval quality.
>
> As we use an open-source OCR, we think of changing every scanned term
> output to it's main possible variations to get a higher level of confidence.
>
> Is there any analyser that supports this kind of need or should I make up a
> syntax and analyser of my own, i.e the payload syntax?
>
> The quick brown fox --> The|1 Tlne|1 quick|2 quiok|2 browm|3 brown|3 fox|4
>
> Thanks,
> Manuel
>
Hi Manuel,

We've done something like this for several of our media monitoring 
clients. The OCR system they use (ABBYY Fine Reader I think, it's pretty 
much an industry standard) has well-known error statistics - we know the 
top N things it gets wrong, i.e. scanning 'm' as two 'n's - so we can 
implement a kind of fuzzy search without introducing too many extra terms.

It isn't quite that simple as we're doing a lot of reverse searching 
('which queries match this document') but the approach is certainly 
sound. The following talk from Lucene Revolution is about this kind of 
thing: http://www.youtube.com/watch?v=rmRCsrJp2A8

Cheers

Charlie

-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk