You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Jason Lee <wu...@gmail.com> on 2013/12/03 09:41:21 UTC

Any Entity Resolution & Deduplication solution?

I have 10M+ textual company names(in Chinese) that extracted from work
experiences of user's profile. Because those company names are manually
entered by users of our site, so there are lots of duplication. Our goal is
extracting & cleansing those data to establish a company dictionary. For
example, those terms should considered as one company:

Huawei Technologies Co. Ltd
Huawei
huawei.com
华为                        ->  (华为 is Huawei in Chinese)
华为有限公司                         -> (有限公司 is Co. Ltd in Chinese)

Looks like it's a clustering process, but i don't have any idea how can i
implement it.

Regards.
- Jason

Re: Any Entity Resolution & Deduplication solution?

Posted by Suneel Marthi <su...@yahoo.com>.
You don't need clustering for this.

Lucene
 should be able to help you here to create a dictionary. Look at

a) Lucene's CJK and Standard Analyzers
b) Mahout's DictionaryVectorizer (with appropriate Lucene Analyzer)

that along with an appropriate 
choice of ngrams and Stopwords should do it for you.








On Tuesday, December 3, 2013 3:41 AM, Jason Lee <wu...@gmail.com> wrote:
 
I have 10M+ textual company names(in Chinese) that extracted from work
experiences of user's profile. Because those company names are manually
entered by users of our site, so there are lots of duplication. Our goal is
extracting & cleansing those data to establish a company dictionary. For
example, those terms should considered as one company:

Huawei Technologies Co. Ltd
Huawei
huawei.com
华为                        ->  (华为 is Huawei in Chinese)
华为有限公司                         -> (有限公司 is Co. Ltd in Chinese)

Looks like it's a clustering process, but i don't have any idea how can i
implement it.

Regards.
- Jason

Re: Any Entity Resolution & Deduplication solution?

Posted by Jason Lee <wu...@gmail.com>.
Hi Suneel, Manuel,

Thank you so much for your advices, and sorry for my late reply.

About the methods and library you mentioned, i will try it out and let you
know how it goes and any questions along the way.


On Wed, Dec 4, 2013 at 3:28 AM, Manuel Blechschmidt <
Manuel.Blechschmidt@gmx.de> wrote:

> Hi Jason,
> mahout does not have any direct duplication detection capabilities.
>
> My former university provides a duplication detection library (dude):
>
> http://www.hpi.uni-potsdam.de/naumann/projekte/dude_duplicate_detection.html
>
> If you want to tag entities you might want to look into GATE.
> http://gate.ac.uk/sale/talks/stupidpoint/diana-fb.ppt‎
>
>
> Hope that helps
>     Manuel
>
> On 03.12.2013, at 09:41, Jason Lee wrote:
>
> > I have 10M+ textual company names(in Chinese) that extracted from work
> > experiences of user's profile. Because those company names are manually
> > entered by users of our site, so there are lots of duplication. Our goal
> is
> > extracting & cleansing those data to establish a company dictionary. For
> > example, those terms should considered as one company:
> >
> > Huawei Technologies Co. Ltd
> > Huawei
> > huawei.com
> > 华为                        ->  (华为 is Huawei in Chinese)
> > 华为有限公司                         -> (有限公司 is Co. Ltd in Chinese)
> >
> > Looks like it's a clustering process, but i don't have any idea how can i
> > implement it.
> >
> > Regards.
> > - Jason
>
> --
> Manuel Blechschmidt
> M.Sc. IT Systems Engineering
> Dortustr. 57
> 14467 Potsdam
> Mobil: 0173/6322621
> Twitter: http://twitter.com/Manuel_B
>
>

Re: Any Entity Resolution & Deduplication solution?

Posted by Manuel Blechschmidt <Ma...@gmx.de>.
Hi Jason,
mahout does not have any direct duplication detection capabilities.

My former university provides a duplication detection library (dude):
http://www.hpi.uni-potsdam.de/naumann/projekte/dude_duplicate_detection.html

If you want to tag entities you might want to look into GATE.
http://gate.ac.uk/sale/talks/stupidpoint/diana-fb.ppt‎


Hope that helps
    Manuel

On 03.12.2013, at 09:41, Jason Lee wrote:

> I have 10M+ textual company names(in Chinese) that extracted from work
> experiences of user's profile. Because those company names are manually
> entered by users of our site, so there are lots of duplication. Our goal is
> extracting & cleansing those data to establish a company dictionary. For
> example, those terms should considered as one company:
> 
> Huawei Technologies Co. Ltd
> Huawei
> huawei.com
> 华为                        ->  (华为 is Huawei in Chinese)
> 华为有限公司                         -> (有限公司 is Co. Ltd in Chinese)
> 
> Looks like it's a clustering process, but i don't have any idea how can i
> implement it.
> 
> Regards.
> - Jason

-- 
Manuel Blechschmidt
M.Sc. IT Systems Engineering
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B