You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Joep Greuter <jo...@conpend.com> on 2016/04/16 08:42:43 UTC

FW: Lucene: an advanced query question - SpanNear with Boolean Fuzzy terms

Goodmorning gentlemen,

 

We are working on a search challenge on large raw text documents for which we are using Lucene(.NET) and have run into some issues which we have not yet been able solve; so I thought let’s not waste too much time and ask some experts ;) Hoping you can and will help me on this path.

 

The context of the search challenge is the following:

 

For an application in the financial sector we perform checks against documents with the use of pretty large datasets which contain lists of “keywords” such as black listed individuals. We obtain the raw text by scanning pdf’s, tiffs, jpeg’s amongst others with OCR (Optical Character Recognition) software. Against the raw text we need to perform (thousands of) search queries to see if any of the keywords occur within these documents. We need some fuzziness on a term basis because names of keywords may be misspelled in the original document or “misspelling” may have occurred during OCR. Also, keywords consist out of 1 to many individual words (for example: “Osama Bin Laden” just to mention an interesting one ;). Not all of the individual words within a keyword have to match or be present because for longer ones we allow for a certain “percentage” of the words not to be found. We would rather be safe than sorry (an end-user will have to manually check suspicious text in the original documents). A last requirement is that words may appear in a different order in the original document (raw text) than in the sanctioned entity. Hope you followed me so far ;)

 

Let’s put up an example, raw text:

 

Lorem ipsum dolor sit Osama Bin Laden amet, consectetur adipiscing elit. Sed Bin Laden, Osama purus sapien, rhoncus sit amet erat in, tempus vehicula tellus. Proin quis pulvinar quam. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus Osiamo Bon Laden mus. Donec non erat facilisis, bibendum est eu, gravida nibh. Suspendisse aliquet rhoncus dolor nec consequat. Vivamus ultricies elit vestibulum, dictum odio non, rutrum neque. Nulla laoreet libero dolor, laoreet blandit Osama Laden Bin sapien auctor non. Donec sodales purus odio, vitae suscipit Bin Ladon quam aliquam non. Mauris neque sapien, varius et tristique nec, ullamcorper ut ligula. Morbi sit amet ultricies erat.

 

We would like to find “all” of the above Osama appearances with the keyword “Osama Bin Laden” (we use a highlighter to find the actual hit positions). But first of all this document needs to be a hit itself. Of course we can adjust our “confidence” level by tweaking the percentage of words from a keyword to be found (which becomes discrete of course based upon the number of words within an entity) and the fuzziness allowed per word. The order in which they appear may be pretty “large” and maybe also depending on the number of words in the keyword.

 

So, with my current knowledge of search and Lucene I thought of setting up a query that uses:

 

1.       SpanNear with a

2.       Boolean query and 

3.       Fuzzy query terms

 

*         SpanNear
With no specified order.

*         Boolean query 
I know SpanNear doesn’t allow for a Boolean query (with a BooleanClause for each individual word; Occur.Should; with a certain MinimumNumberShouldMatch = “our percentage” which we can configure based on benchmark tests). But want to be able to do so to allow for situations in which not all of the individual words have to be matched but only a certain percentage somewhere near each other

*         Fuzzy query terms
To allow for “misspelled” individual words in the original raw text

I am continuing my own research and will also post to Stack Overflow to ask the community to have a look on this. But if any of you could point me in the right direction it would be a great pleasure. Also, performance isn’t much of an issue. We parallel process the queries and initial tests have proven that it’s extremely fast already although we don’t have the right setup yet. If it’s not possible in a one-pass approach a two-pass or even three-pass solution is also worth to try. The end-result is the most important. That we are able to find those Osama’s throughout the text ;)

 

Any help is greatly appreciated.

 

Also, I found this Stack Overlfow issue that looks somehow in the right direction:  <http://stackoverflow.com/questions/18100233/lucene-fuzzy-search-on-a-phrase-fuzzyquery-spanquery> http://stackoverflow.com/questions/18100233/lucene-fuzzy-search-on-a-phrase-fuzzyquery-spanquery But for some reason the SpanMultiTermQueryWrapper is not supported (anymore, although available in the GitHubb repository) in Lucene.NET.

 

Have a nice and smart day today!

 

Regards,

 

Joep Greuter

 

 

 



 

Joep B.J. Greuter Msc Econometrics / Operations Research

Application architect

 

+31 (0)6 5025 4462

 <ma...@conpend.com> joep.greuter@conpend.com

 <http://www.conpend.com/> www.conpend.com