You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2008/04/16 18:11:08 UTC

Re: Using Lucene to find duplicate/similar names

I believe there were some posts on this about a year ago.  Try  
searching in the archives for duplicate names, as well as "record  
linkage" or any other various synonyms that you can think of.  The  
short answer is Lucene is reasonable to attempt this with, but you may  
need some help.  The long answer is to dig into those archives and see  
the other recommendations.

-Grant

On Apr 16, 2008, at 12:37 PM, Andy DePue wrote:

> I'm new to Lucene, and would like to use it to find duplicate (or  
> similar) names in a contact list.  Is Lucene a good fit?
> We have a form where a user enters a company or person's name, and  
> we want the system to warn them if there is already a company or  
> person entered with the same or similar name.
> Based on the little I know of Lucene, I'm thinking an NGram  
> algorithm (based on characters, not words) would work best... but,  
> I'm not sure if Lucene takes proximity or edit distances into  
> account?  For example, say you have these two names:
> Andrew John
> John Andrew
>
> If a user enters Andy John, without proximity or edit distance,  
> these two names will match about the same, while, obviously, the  
> first name should be ranked higher.
> Thanks in advance for any help or advice.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Using Lucene to find duplicate/similar names

Posted by Andy DePue <an...@marathon-man.com>.
Thanks for the pointer.  I found the thread, and there is certainly some 
interesting information there.  I'd like to stick to what Lucene has 
available today, mainly because I lack the time to implement anything 
more than that.  I originally thought Levenshtein, but then realized 
that Lucene would probably have to do a whole index scan for that?  I 
don't need anything too fancy, so I'm still wondering if NGram with some 
sort of proximity ranking would do the trick.  By proximity, I mean, how 
closely the NGrams in the document field match in proximity and order to 
each other as the same NGrams in the search string.  I'm hoping NGrams 
would avoid the need for a whole index scan.  Does Lucene already factor 
this into its hit score, or would I need to do some custom work?

  - Andy

Grant Ingersoll wrote:
> I believe there were some posts on this about a year ago.  Try 
> searching in the archives for duplicate names, as well as "record 
> linkage" or any other various synonyms that you can think of.  The 
> short answer is Lucene is reasonable to attempt this with, but you may 
> need some help.  The long answer is to dig into those archives and see 
> the other recommendations.
>
> -Grant
>
> On Apr 16, 2008, at 12:37 PM, Andy DePue wrote:
>
>> I'm new to Lucene, and would like to use it to find duplicate (or 
>> similar) names in a contact list.  Is Lucene a good fit?
>> We have a form where a user enters a company or person's name, and we 
>> want the system to warn them if there is already a company or person 
>> entered with the same or similar name.
>> Based on the little I know of Lucene, I'm thinking an NGram algorithm 
>> (based on characters, not words) would work best... but, I'm not sure 
>> if Lucene takes proximity or edit distances into account?  For 
>> example, say you have these two names:
>> Andrew John
>> John Andrew
>>
>> If a user enters Andy John, without proximity or edit distance, these 
>> two names will match about the same, while, obviously, the first name 
>> should be ranked higher.
>> Thanks in advance for any help or advice.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


LinkedIn Lucene Interest Group

Posted by Wilfred Beijer <wi...@sonepar.nl>.
Hello all,

De email adress for the group owner of the LinkedIn Lucene Interest Group
doesn't seem to work. Is this group still alive ?

Kind regards,

Wilfred Beijer



=======================================================
The information contained in this email is confidential and privileged. It
may be read, copied and used only by the intended recipient. If you have
received it in error, please contact the sender immediately by return
email; please delete in this case the email and do not disclose its
contents to any person. We don't accept liability for any errors,
omissions, delays of receipt or viruses in the contents of this message
which arise as a result of email transmission.
=======================================================


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org