You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Mobius ReX <ao...@gmail.com> on 2014/03/17 19:02:19 UTC

Fwd: any project for record linkage, fuzzy grouping, and deduplication based on Solr/Lucene?

---------- Forwarded message ----------
Subject: any project for record linkage, fuzzy grouping, and deduplication
based on Solr/Lucene?


For example, given a new big department merged from three departments. A
few employees worked for two or three departments before merging. That
means, the attributes of one person might be listed under different
departments' databases. One additional problem is that one person can have
different first names or nick names.

These attributes of a person include
first name, last name, email, home phone, cell phone, ssn, address, etc ...

Because some values of the above could be empty, there is no unique primary
key.
Hence, we need an intelligent solution for the classification, and to put
weights for different matching rules.

Any tips to handle such runtime fast deduplication tasks for big data
(about 100 million records)?
Any open-source project working on this?