You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Amit Jha <sh...@gmail.com> on 2015/01/03 08:54:17 UTC

De Duplication using Solr

I am trying to find out duplicate records based on distance and phonetic
algorithms. Can I utilize solr for that? I have following fields and
conditions to identify exact or possible duplicates.

1. Fields
prefix
suffix
firstname
lastname
email(primary_email1, email2, email3)
phone(primary_phone1, phone2, phone3)
2. Conditions:
Two records said to be exact duplicates if

1. IsExactMatchFunction(record1_prefix, record2_prefix) AND
IsExactMatchFunction(record1_suffix, record2_suffix) AND
IsExactMatchFunction(record1_firstname,record2_firstname) AND
IsExactMatchFunction(record1_lastname,record2_lastname) AND
IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
Two records said to be possible duplicates if

1. IsExactMatchFunction(record1_prefix, record2_prefix) OR
IsExactMatchFunction(record1_suffix, record2_suffix) OR
IsExactMatchFunction(record1_firstname,record2_firstname) AND
IsExactMatchFunction(record1_lastname,record2_lastname) AND
IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
 ELSE
 2. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND
IsExactMatchFunction(record1_lastname,record2_lastname) AND
IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
 ELSE
 3. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND
IsExactMatchFunction(record1_lastname,record2_lastname) AND
IsExactMatchFunction(record1_any_email,record2_any_email) OR
IsExactMatchFunction(record1_any_phone,record2_any_primary)

IsFuzzyMatchFunction() will perform distance and phonetic algorithms
calculation and compare it with predefined threshold.

For example:

if threshold defined for firsname is 85 and IsFuzzyMatchFunction() function
only return "ture" only and only if one of the algorithms(distance or
phonetic) return the similarity socre >= 85.

Can I use solr to perform this job. Or Can you guys suggest how can I
approach to this problem. I have seen the duke(De duplication API) but I
can not use duke out of the box.

Re: De Duplication using Solr

Posted by Amit Jha <sh...@gmail.com>.
Thanks for reply...I have already seen wiki. It is more  likely to record
matching.

On Sat, Jan 3, 2015 at 7:39 PM, Jack Krupansky <ja...@gmail.com>
wrote:

> First, see if you can get your requirements to align to the de-dupe feature
> that Solr already has:
> https://cwiki.apache.org/confluence/display/solr/De-Duplication
>
>
> -- Jack Krupansky
>
> On Sat, Jan 3, 2015 at 2:54 AM, Amit Jha <sh...@gmail.com> wrote:
>
> > I am trying to find out duplicate records based on distance and phonetic
> > algorithms. Can I utilize solr for that? I have following fields and
> > conditions to identify exact or possible duplicates.
> >
> > 1. Fields
> > prefix
> > suffix
> > firstname
> > lastname
> > email(primary_email1, email2, email3)
> > phone(primary_phone1, phone2, phone3)
> > 2. Conditions:
> > Two records said to be exact duplicates if
> >
> > 1. IsExactMatchFunction(record1_prefix, record2_prefix) AND
> > IsExactMatchFunction(record1_suffix, record2_suffix) AND
> > IsExactMatchFunction(record1_firstname,record2_firstname) AND
> > IsExactMatchFunction(record1_lastname,record2_lastname) AND
> > IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
> > IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
> > Two records said to be possible duplicates if
> >
> > 1. IsExactMatchFunction(record1_prefix, record2_prefix) OR
> > IsExactMatchFunction(record1_suffix, record2_suffix) OR
> > IsExactMatchFunction(record1_firstname,record2_firstname) AND
> > IsExactMatchFunction(record1_lastname,record2_lastname) AND
> > IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
> > IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
> >  ELSE
> >  2. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND
> > IsExactMatchFunction(record1_lastname,record2_lastname) AND
> > IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
> > IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
> >  ELSE
> >  3. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND
> > IsExactMatchFunction(record1_lastname,record2_lastname) AND
> > IsExactMatchFunction(record1_any_email,record2_any_email) OR
> > IsExactMatchFunction(record1_any_phone,record2_any_primary)
> >
> > IsFuzzyMatchFunction() will perform distance and phonetic algorithms
> > calculation and compare it with predefined threshold.
> >
> > For example:
> >
> > if threshold defined for firsname is 85 and IsFuzzyMatchFunction()
> function
> > only return "ture" only and only if one of the algorithms(distance or
> > phonetic) return the similarity socre >= 85.
> >
> > Can I use solr to perform this job. Or Can you guys suggest how can I
> > approach to this problem. I have seen the duke(De duplication API) but I
> > can not use duke out of the box.
> >
>

Re: De Duplication using Solr

Posted by Jack Krupansky <ja...@gmail.com>.
First, see if you can get your requirements to align to the de-dupe feature
that Solr already has:
https://cwiki.apache.org/confluence/display/solr/De-Duplication


-- Jack Krupansky

On Sat, Jan 3, 2015 at 2:54 AM, Amit Jha <sh...@gmail.com> wrote:

> I am trying to find out duplicate records based on distance and phonetic
> algorithms. Can I utilize solr for that? I have following fields and
> conditions to identify exact or possible duplicates.
>
> 1. Fields
> prefix
> suffix
> firstname
> lastname
> email(primary_email1, email2, email3)
> phone(primary_phone1, phone2, phone3)
> 2. Conditions:
> Two records said to be exact duplicates if
>
> 1. IsExactMatchFunction(record1_prefix, record2_prefix) AND
> IsExactMatchFunction(record1_suffix, record2_suffix) AND
> IsExactMatchFunction(record1_firstname,record2_firstname) AND
> IsExactMatchFunction(record1_lastname,record2_lastname) AND
> IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
> IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
> Two records said to be possible duplicates if
>
> 1. IsExactMatchFunction(record1_prefix, record2_prefix) OR
> IsExactMatchFunction(record1_suffix, record2_suffix) OR
> IsExactMatchFunction(record1_firstname,record2_firstname) AND
> IsExactMatchFunction(record1_lastname,record2_lastname) AND
> IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
> IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
>  ELSE
>  2. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND
> IsExactMatchFunction(record1_lastname,record2_lastname) AND
> IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
> IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
>  ELSE
>  3. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND
> IsExactMatchFunction(record1_lastname,record2_lastname) AND
> IsExactMatchFunction(record1_any_email,record2_any_email) OR
> IsExactMatchFunction(record1_any_phone,record2_any_primary)
>
> IsFuzzyMatchFunction() will perform distance and phonetic algorithms
> calculation and compare it with predefined threshold.
>
> For example:
>
> if threshold defined for firsname is 85 and IsFuzzyMatchFunction() function
> only return "ture" only and only if one of the algorithms(distance or
> phonetic) return the similarity socre >= 85.
>
> Can I use solr to perform this job. Or Can you guys suggest how can I
> approach to this problem. I have seen the duke(De duplication API) but I
> can not use duke out of the box.
>

RE: De Duplication using Solr

Posted by steve <sc...@hotmail.com>.
One possible "match" is using Python's FuzzyWuzzy
https://github.com/seatgeek/fuzzywuzzy
http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/

> Date: Sat, 3 Jan 2015 13:24:17 +0530
> Subject: De Duplication using Solr
> From: shanuu.jha@gmail.com
> To: solr-user@lucene.apache.org
> 
> I am trying to find out duplicate records based on distance and phonetic
> algorithms. Can I utilize solr for that? I have following fields and
> conditions to identify exact or possible duplicates.
> 
> 1. Fields
> prefix
> suffix
> firstname
> lastname
> email(primary_email1, email2, email3)
> phone(primary_phone1, phone2, phone3)
> 2. Conditions:
> Two records said to be exact duplicates if
> 
> 1. IsExactMatchFunction(record1_prefix, record2_prefix) AND
> IsExactMatchFunction(record1_suffix, record2_suffix) AND
> IsExactMatchFunction(record1_firstname,record2_firstname) AND
> IsExactMatchFunction(record1_lastname,record2_lastname) AND
> IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
> IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
> Two records said to be possible duplicates if
> 
> 1. IsExactMatchFunction(record1_prefix, record2_prefix) OR
> IsExactMatchFunction(record1_suffix, record2_suffix) OR
> IsExactMatchFunction(record1_firstname,record2_firstname) AND
> IsExactMatchFunction(record1_lastname,record2_lastname) AND
> IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
> IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
>  ELSE
>  2. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND
> IsExactMatchFunction(record1_lastname,record2_lastname) AND
> IsExactMatchFunction(record1_primary_email,record2_primary_email) OR
> IsExactMatchFunction(record1_primary_phone,record2_primary_primary)
>  ELSE
>  3. IsFuzzyMatchFunction(record1_firstname,record2_firstname) AND
> IsExactMatchFunction(record1_lastname,record2_lastname) AND
> IsExactMatchFunction(record1_any_email,record2_any_email) OR
> IsExactMatchFunction(record1_any_phone,record2_any_primary)
> 
> IsFuzzyMatchFunction() will perform distance and phonetic algorithms
> calculation and compare it with predefined threshold.
> 
> For example:
> 
> if threshold defined for firsname is 85 and IsFuzzyMatchFunction() function
> only return "ture" only and only if one of the algorithms(distance or
> phonetic) return the similarity socre >= 85.
> 
> Can I use solr to perform this job. Or Can you guys suggest how can I
> approach to this problem. I have seen the duke(De duplication API) but I
> can not use duke out of the box.