You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Jim Hargrave <Ha...@ldschurch.org> on 2003/06/05 23:12:36 UTC

String similarity search vs. typcial IR application...

Our application is a string similarity searcher where the query is an input string and we want to find all "fuzzy" variants of the input string in the DB.  The Score is basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in common, Q is the number of unique query terms and D is the number of unique document terms. Our documents will be sentences.
 
I know Lucene has a fuzzy search capability - but I assume this would be very slow since it must search through the entire term list to find candidates.
 
In order to do the calculation I will need to have 'C' - the number of terms in common between query and document. Is there an API that I can call to get this info? Any hints on what it will take to modify Lucene to handle these kinds of queries? 
 
BTW: 
Ever consider using Lucene for DNA searching? - this technique could also be used to search large DNA databases.
 
Thanks!
 
Jim Hargrave


------------------------------------------------------------------------------
This message may contain confidential information, and is intended only for the use of the individual(s) to whom it is addressed.


==============================================================================

Re: String similarity search vs. typcial IR application...

Posted by Ype Kingma <yk...@xs4all.nl>.

On Thursday 05 June 2003 14:12, Jim Hargrave wrote:
> Our application is a string similarity searcher where the query is an input
> string and we want to find all "fuzzy" variants of the input string in the
> DB.  The Score is basically dice's coefficient: 2C/Q+D, where C is the
> number of terms (n-grams) in common, Q is the number of unique query terms
> and D is the number of unique document terms. Our documents will be
> sentences.
>
> I know Lucene has a fuzzy search capability - but I assume this would be
> very slow since it must search through the entire term list to find
> candidates.

Fuzzy search is not as fast as searching with direct terms or truncation,
but it does not search _all_ the terms.

> In order to do the calculation I will need to have 'C' - the number of
> terms in common between query and document. Is there an API that I can call
> to get this info? Any hints on what it will take to modify Lucene to handle
> these kinds of queries?

Have a look at the coord() call in the Similarity interface of Lucene 1.3.
It gets called per document with overlap and nr. of query terms when you
search with your own similarity implementation. It's based on terms
and not on n-grams, so it might be no good in your case.
You might try indexing a 1-gram as a Lucene Term.
In case a 1-gram is pnly C, A, T or G (DNA proteins) this might be too much
overhead for Lucene to handle...

Good luck,
Ype

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: String similarity search vs. typcial IR application...

Posted by Leo Galambos <Le...@seznam.cz>.

AFAIK Lucene is not able to look DNA strings up effectively. You would 
use DASG+Lev (see my previous post - 05/30/2003 1916CEST).

-g-

Jim Hargrave wrote:

>Our application is a string similarity searcher where the query is an input string and we want to find all "fuzzy" variants of the input string in the DB.  The Score is basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in common, Q is the number of unique query terms and D is the number of unique document terms. Our documents will be sentences.
> 
>I know Lucene has a fuzzy search capability - but I assume this would be very slow since it must search through the entire term list to find candidates.
> 
>In order to do the calculation I will need to have 'C' - the number of terms in common between query and document. Is there an API that I can call to get this info? Any hints on what it will take to modify Lucene to handle these kinds of queries? 
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org