You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by nitingupta183 <ni...@gmail.com> on 2009/10/12 13:09:19 UTC

Usage of Lucene/Hibernate Search for Contacts Merging operation

Hi all,

I am supposed to add a feature in which my app will detect the duplicate
contacts of a user on the basis of their name, email, mobile number
etc.(i.e. Contacts Duplicate Killer kind of feature). The simplest algo i
can think of is find all the contacts on the basis of their name, email and
mobile and then run the loop to determine which all contacts have similar
entries. But i think this algo will have worst performance.

I am currently using Hibernate. I got to know about Hibernate Search/Lucene.
Can I use these solutions for this task. I am asking this on the basis that
Lucene already implements algos such as Levenshtein_distance. May be I can
harness the Lucene power to make this task efficient.

If anyone has done this or something similar with Lucene or some other also,
then please give me pointers for the same.

regards
nitin
-- 
View this message in context: http://www.nabble.com/Usage-of-Lucene-Hibernate-Search-for-Contacts-Merging-operation-tp25853966p25853966.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Usage of Lucene/Hibernate Search for Contacts Merging operation

Posted by Rene Wiermer <r....@lw-systems.de>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

nitingupta183 schrieb:
> Hi all,
> 
> I am supposed to add a feature in which my app will detect the duplicate
> contacts of a user on the basis of their name, email, mobile number
> etc.(i.e. Contacts Duplicate Killer kind of feature). The simplest algo i
> can think of is find all the contacts on the basis of their name, email and
> mobile and then run the loop to determine which all contacts have similar
> entries. But i think this algo will have worst performance.

Try to prune your search space. It is reasonable to assume that there
are not too many duplicates overall.
You can use IndexReader.terms() to get a list of terms and then a
docFreq() to check the number of documents containing that term.

E.g. search for all email terms and process those, whose docFreq is >1.
Add the corresponding documents for each email term to a "possible
identical contacts" container.

Repeat the same with birth dates, phone numbers and names, preferably
with some normalization.

Then merge those "possible identical contact" containers, who share a
common document.

Example:
Container 1	Container 2			Merged Container
				--->
A, B		B,C				A,B,C

(Implementation note: try to keep track of the list of containers a
certain number is in using a look-up table:  A -> 1; B -> 1,2,3,6; C ->2
etc.  )

Then compare the documents inside these container with each other and
decide, which contacts you want to merge and which not.

> I am currently using Hibernate. I got to know about Hibernate Search/Lucene.
> Can I use these solutions for this task. I am asking this on the basis that
> Lucene already implements algos such as Levenshtein_distance. May be I can
> harness the Lucene power to make this task efficient.

Try using a Soundex or Metaphone analyzer for similarity; they map
similar sounding strings to a single value and are much easier to handle
in the Lucene framework than numeric measures like Levensthein;

there are examples in Lucene contrib.



- --
Rene Wiermer
(Softwareentwickler/Systemingenieur)


- -- LWsystems GmbH & Co. KG ++ http://www.lw-systems.de/impressum
Tel: 05455 / 932 132 ++ Fax: 05455 / 932 099 ++ Mobil: 0171 / 37 28 760
Ihr Spezialist für Linux, Open Source & IT-Sicherheit
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
LWsystems GmbH & Co. KG Sitz der Gesellschaft: Tegelerweg 11, 49186 Bad
Iburg
Telefon +49 (0)5403 5556 Telefax +49 (0)5403 7958997
Handelsregister: Amtsgericht Osnabrück, HRA 110668 USt.-ID-Nr. DE23852211
Persönlich haftende Gesellschafterin: LWsystems Verwaltungs GmbH
Sitz der Gesellschaft: Tegelerweg 11, 49186 Bad Iburg
Handelsregister: Amtsgericht Osnabrück, HRB 111163
Geschäftsführer: Dipl.-Ing. Ansgar H. Licher, Bad Iburg Dipl.-Ing.
Martin Werthmöller, Ibbenbüren
Für weitere Firmendetails zu LWsystems siehe / For further company
details please look at: http://www.lw-systems.de/impressum
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrTJrAACgkQM8UTt+++8LiQogCfeMTyF9EMf2fVZtz61TnCIEII
5dMAn0YlKgiEQ8M5/Kkf2SZS/acHhe2u
=TFm+
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org