You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Patrick Collins <pa...@ready2sign.com> on 2011/05/01 08:21:38 UTC

Re: Fuzzy matching

Should I be worried that somebody with a scientology.net email address is
writing in about address harvesting and data deduping?

Patrick.

On Fri, Apr 29, 2011 at 12:50 PM, James Pettyjohn <ja...@scientology.net>wrote:

>
>
> Hey,
>
> First time writing in.
>
> I have around 6 million active records
> in a contacts database. Additional millions of history address records for
> these records. I got a new 60+ thousand records which are not correlated to
> these that I need to fuzzy match against both active and historical
> records.
>
> I will need to do the same thing with the database against
> itself for de-duplication later. The data is primarily in Oracle (with the
> supplement in csv's).
>
> I saw the Booz/Allen/Hamilton presentation on fuzzy
> matching - but I don't see any distributions for that implementation. At
> the same time I don't need real time query - just batch processing at the
> moment.
>
> I thought Mahout might fit the bill. Any comments on approach
> would be appreciated.
>
> Best, James

Re: Fuzzy matching

Posted by Ted Dunning <te...@gmail.com>.
Interesting point.  I hadn't noticed.

On the other hand, if they get their deduping in order, maybe we won't get
as much duplicated junk mail from them.

On Sat, Apr 30, 2011 at 11:21 PM, Patrick Collins <
patrick.collins@ready2sign.com> wrote:

> Should I be worried that somebody with a scientology.net email address is
> writing in about address harvesting and data deduping?
>
> Patrick.
>
> On Fri, Apr 29, 2011 at 12:50 PM, James Pettyjohn <jamesp@scientology.net
> >wrote:
>
> >
> >
> > Hey,
> >
> > First time writing in.
> >
> > I have around 6 million active records
> > in a contacts database. Additional millions of history address records
> for
> > these records. I got a new 60+ thousand records which are not correlated
> to
> > these that I need to fuzzy match against both active and historical
> > records.
> >
> > I will need to do the same thing with the database against
> > itself for de-duplication later. The data is primarily in Oracle (with
> the
> > supplement in csv's).
> >
> > I saw the Booz/Allen/Hamilton presentation on fuzzy
> > matching - but I don't see any distributions for that implementation. At
> > the same time I don't need real time query - just batch processing at the
> > moment.
> >
> > I thought Mahout might fit the bill. Any comments on approach
> > would be appreciated.
> >
> > Best, James
>