You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Neil Chaudhuri <nc...@potomacfusion.com> on 2011/12/02 04:08:13 UTC

Word and Phrase Clustering

I have a need to cluster a collection of words and phrases by syntactic similarity over a distributed environment, and I came upon Mahout as a possible solution. After studying the documentation though, I am finding all of it tailored to working with entire documents rather than words and phrases. I simply want to know if you believe that Mahout is the right tool for this job. I suppose I could try to view each word and phrase as individual tiny documents, but that feels like I am forcing it.

Any insight is appreciated.

Thanks.

Re: Word and Phrase Clustering

Posted by Rob Podolski <ro...@yahoo.co.uk>.

Did you have a look at 'Taming Text' (by Grant S. Ingersoll, Thomas S. Morton, and Andrew L. Farris)?  There are some sections in this that might be relevant for your issue.

R

________________________________
 From: Neil Chaudhuri <nc...@potomacfusion.com>
To: "user@mahout.apache.org" <us...@mahout.apache.org> 
Sent: Friday, 2 December 2011, 3:08
Subject: Word and Phrase Clustering

I have a need to cluster a collection of words and phrases by syntactic similarity over a distributed environment, and I came upon Mahout as a possible solution. After studying the documentation though, I am finding all of it tailored to working with entire documents rather than words and phrases. I simply want to know if you believe that Mahout is the right tool for this job. I suppose I could try to view each word and phrase as individual tiny documents, but that feels like I am forcing it.

Any insight is appreciated.

Thanks.

Re: Word and Phrase Clustering

Posted by Ted Dunning <te...@gmail.com>.

Here is an ancient article on the subject.

http://www.aclweb.org/anthology-new/J/J92/J92-3004.pdf

You don't need fancy computer capabilities to cluster words based on
spelling.

On Fri, Dec 2, 2011 at 3:36 AM, Pascal Coupet <pc...@gmail.com> wrote:

> Hi Neil,
>
> I suggest you to start by doing clustering on lexical affinities (based on
> how words look). It seems that it's what you are looking for from your
> examples. To cluster terms this way you don't really need to use the full
> data. You can remove all duplicates and get hopefully a much smaller set.
>
> A good way to describe terms for this usage is to use ngrams. You can also
> use phonetic transcriptions of terms. An interesting trick that works well
> is to add a special character at the beginning of each work (in the ngrams
> method). This will boost similarity on the beginning of words which is
> usually good.
>
> I suggest you to have a look at Google
> Refine<http://code.google.com/p/google-refine/>.
> Watch the first video. It demonstrate nice terms clustering capabilities
> using different methods (ngrams, ...). If it's what you are looking for,
> you can try it on the most frequent terms in your dataset and get quickly
> interesting results and then implement the way which look the best for you.
>
> Best,
>
> Pascal
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> 2011/12/2 Neil Chaudhuri <nc...@potomacfusion.com>
>
> > Glad to fill in more detail. Imagine I have a list of words and phrases
> in
> > a data store like this:
> >
> > Alabama
> > Obama
> > University of Alabama
> > Bama
> > Potomac
> > Texas
> > Potomac River
> >
> > I would like to cluster the ones that look similar enough to be the same.
> > Like "Alabama" and "University of Alabama" and "Bama" (but not Obama
> > ideally) or "Potomac" and "Potomac River."
> >
> > Now this list of words could be in the terabytes range, which is why I
> > need distributed computing capability.
> >
> > How would I assemble a Vector from an individual entry in this list? With
> > a bit more understanding of my situation, do you think Mahout can work
> for
> > me?
> >
> > Please let me know if I can provide more information.
> >
> > Thanks.
> >
> >
> >
> > On Dec 1, 2011, at 11:29 PM, Jeff Eastman wrote:
> >
> > > Could you elaborate a bit on what you mean by "cluster a collection of
> > > words and phrases by syntactic similarity over a distributed
> environment
> > > "? If you can describe your collection in terms of a set of (sparse or
> > > dense) term vectors then you should be able to use Mahout clustering
> > > directly. The vectors do not need to be huge (as "document" might
> > > imply), indeed smaller dimensionality clusterings work better than
> large
> > > ones. The question would be how do you plan to encode these vectors?
> > > Another would be how large a collection you have?
> > >
> > > On 12/1/11 8:08 PM, Neil Chaudhuri wrote:
> > >> I have a need to cluster a collection of words and phrases by
> syntactic
> > similarity over a distributed environment, and I came upon Mahout as a
> > possible solution. After studying the documentation though, I am finding
> > all of it tailored to working with entire documents rather than words and
> > phrases. I simply want to know if you believe that Mahout is the right
> tool
> > for this job. I suppose I could try to view each word and phrase as
> > individual tiny documents, but that feels like I am forcing it.
> > >>
> > >> Any insight is appreciated.
> > >>
> > >> Thanks.
> > >>
> > >
> >
> >
>

Re: Word and Phrase Clustering

Posted by Neil Chaudhuri <nc...@potomacfusion.com>.

Thanks for the pointer to Google Refine. That looks like exactly what I
want. However, it isn't clear to me how to actually program with it though
since I don't see a published API. Even more troubling is its
distributability. It seems obvious that memory footprint is an issue, and
I don't know how easily Refine's capabilities can be distributed. I would
also have to be careful to choose a distributable algorithm rather than
one like Levenshtein whose holistic approach would run counter to a
distributed model. Any comments on these matters is appreciated.

If I were to pursue the Mahout approach, is it possible to create Vectors
from the words and phrases?

Thanks.




On 12/2/11 6:36 AM, "Pascal Coupet" <pc...@gmail.com> wrote:

>Hi Neil,
>
>I suggest you to start by doing clustering on lexical affinities (based on
>how words look). It seems that it's what you are looking for from your
>examples. To cluster terms this way you don't really need to use the full
>data. You can remove all duplicates and get hopefully a much smaller set.
>
>A good way to describe terms for this usage is to use ngrams. You can also
>use phonetic transcriptions of terms. An interesting trick that works well
>is to add a special character at the beginning of each work (in the ngrams
>method). This will boost similarity on the beginning of words which is
>usually good.
>
>I suggest you to have a look at Google
>Refine<http://code.google.com/p/google-refine/>.
>Watch the first video. It demonstrate nice terms clustering capabilities
>using different methods (ngrams, ...). If it's what you are looking for,
>you can try it on the most frequent terms in your dataset and get quickly
>interesting results and then implement the way which look the best for
>you.
>
>Best,
>
>Pascal
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>2011/12/2 Neil Chaudhuri <nc...@potomacfusion.com>
>
>> Glad to fill in more detail. Imagine I have a list of words and phrases
>>in
>> a data store like this:
>>
>> Alabama
>> Obama
>> University of Alabama
>> Bama
>> Potomac
>> Texas
>> Potomac River
>>
>> I would like to cluster the ones that look similar enough to be the
>>same.
>> Like "Alabama" and "University of Alabama" and "Bama" (but not Obama
>> ideally) or "Potomac" and "Potomac River."
>>
>> Now this list of words could be in the terabytes range, which is why I
>> need distributed computing capability.
>>
>> How would I assemble a Vector from an individual entry in this list?
>>With
>> a bit more understanding of my situation, do you think Mahout can work
>>for
>> me?
>>
>> Please let me know if I can provide more information.
>>
>> Thanks.
>>
>>
>>
>> On Dec 1, 2011, at 11:29 PM, Jeff Eastman wrote:
>>
>> > Could you elaborate a bit on what you mean by "cluster a collection of
>> > words and phrases by syntactic similarity over a distributed
>>environment
>> > "? If you can describe your collection in terms of a set of (sparse or
>> > dense) term vectors then you should be able to use Mahout clustering
>> > directly. The vectors do not need to be huge (as "document" might
>> > imply), indeed smaller dimensionality clusterings work better than
>>large
>> > ones. The question would be how do you plan to encode these vectors?
>> > Another would be how large a collection you have?
>> >
>> > On 12/1/11 8:08 PM, Neil Chaudhuri wrote:
>> >> I have a need to cluster a collection of words and phrases by
>>syntactic
>> similarity over a distributed environment, and I came upon Mahout as a
>> possible solution. After studying the documentation though, I am finding
>> all of it tailored to working with entire documents rather than words
>>and
>> phrases. I simply want to know if you believe that Mahout is the right
>>tool
>> for this job. I suppose I could try to view each word and phrase as
>> individual tiny documents, but that feels like I am forcing it.
>> >>
>> >> Any insight is appreciated.
>> >>
>> >> Thanks.
>> >>
>> >
>>
>>

Re: Word and Phrase Clustering

Posted by Pascal Coupet <pc...@gmail.com>.

Hi Neil,

I suggest you to start by doing clustering on lexical affinities (based on
how words look). It seems that it's what you are looking for from your
examples. To cluster terms this way you don't really need to use the full
data. You can remove all duplicates and get hopefully a much smaller set.

A good way to describe terms for this usage is to use ngrams. You can also
use phonetic transcriptions of terms. An interesting trick that works well
is to add a special character at the beginning of each work (in the ngrams
method). This will boost similarity on the beginning of words which is
usually good.

I suggest you to have a look at Google
Refine<http://code.google.com/p/google-refine/>.
Watch the first video. It demonstrate nice terms clustering capabilities
using different methods (ngrams, ...). If it's what you are looking for,
you can try it on the most frequent terms in your dataset and get quickly
interesting results and then implement the way which look the best for you.

Best,

Pascal

2011/12/2 Neil Chaudhuri <nc...@potomacfusion.com>

> Glad to fill in more detail. Imagine I have a list of words and phrases in
> a data store like this:
>
> Alabama
> Obama
> University of Alabama
> Bama
> Potomac
> Texas
> Potomac River
>
> I would like to cluster the ones that look similar enough to be the same.
> Like "Alabama" and "University of Alabama" and "Bama" (but not Obama
> ideally) or "Potomac" and "Potomac River."
>
> Now this list of words could be in the terabytes range, which is why I
> need distributed computing capability.
>
> How would I assemble a Vector from an individual entry in this list? With
> a bit more understanding of my situation, do you think Mahout can work for
> me?
>
> Please let me know if I can provide more information.
>
> Thanks.
>
>
>
> On Dec 1, 2011, at 11:29 PM, Jeff Eastman wrote:
>
> > Could you elaborate a bit on what you mean by "cluster a collection of
> > words and phrases by syntactic similarity over a distributed environment
> > "? If you can describe your collection in terms of a set of (sparse or
> > dense) term vectors then you should be able to use Mahout clustering
> > directly. The vectors do not need to be huge (as "document" might
> > imply), indeed smaller dimensionality clusterings work better than large
> > ones. The question would be how do you plan to encode these vectors?
> > Another would be how large a collection you have?
> >
> > On 12/1/11 8:08 PM, Neil Chaudhuri wrote:
> >> I have a need to cluster a collection of words and phrases by syntactic
> similarity over a distributed environment, and I came upon Mahout as a
> possible solution. After studying the documentation though, I am finding
> all of it tailored to working with entire documents rather than words and
> phrases. I simply want to know if you believe that Mahout is the right tool
> for this job. I suppose I could try to view each word and phrase as
> individual tiny documents, but that feels like I am forcing it.
> >>
> >> Any insight is appreciated.
> >>
> >> Thanks.
> >>
> >
>
>

Re: Word and Phrase Clustering

Posted by "Lin, Zhiwei" <zh...@sap.com>.

I suppose you could do so if you use sequence similarity. 
I know that it can be integrated into hierarchical clustering. But it seems that hierarchical clustering has not become part of mahout.

----- Original Message -----
From: Neil Chaudhuri [mailto:nchaudhuri@potomacfusion.com]
Sent: Friday, December 02, 2011 05:48 AM
To: user@mahout.apache.org <us...@mahout.apache.org>
Subject: Re: Word and Phrase Clustering

Glad to fill in more detail. Imagine I have a list of words and phrases in a data store like this:

Alabama
Obama
University of Alabama
Bama
Potomac
Texas
Potomac River

I would like to cluster the ones that look similar enough to be the same. Like "Alabama" and "University of Alabama" and "Bama" (but not Obama ideally) or "Potomac" and "Potomac River." 

Now this list of words could be in the terabytes range, which is why I need distributed computing capability.

How would I assemble a Vector from an individual entry in this list? With a bit more understanding of my situation, do you think Mahout can work for me?

Please let me know if I can provide more information.

Thanks.

On Dec 1, 2011, at 11:29 PM, Jeff Eastman wrote:

> Could you elaborate a bit on what you mean by "cluster a collection of 
> words and phrases by syntactic similarity over a distributed environment 
> "? If you can describe your collection in terms of a set of (sparse or 
> dense) term vectors then you should be able to use Mahout clustering 
> directly. The vectors do not need to be huge (as "document" might 
> imply), indeed smaller dimensionality clusterings work better than large 
> ones. The question would be how do you plan to encode these vectors? 
> Another would be how large a collection you have?
> 
> On 12/1/11 8:08 PM, Neil Chaudhuri wrote:
>> I have a need to cluster a collection of words and phrases by syntactic similarity over a distributed environment, and I came upon Mahout as a possible solution. After studying the documentation though, I am finding all of it tailored to working with entire documents rather than words and phrases. I simply want to know if you believe that Mahout is the right tool for this job. I suppose I could try to view each word and phrase as individual tiny documents, but that feels like I am forcing it.
>> 
>> Any insight is appreciated.
>> 
>> Thanks.
>> 
>

Re: Word and Phrase Clustering

Posted by Neil Chaudhuri <nc...@potomacfusion.com>.

Glad to fill in more detail. Imagine I have a list of words and phrases in a data store like this:

Alabama
Obama
University of Alabama
Bama
Potomac
Texas
Potomac River

I would like to cluster the ones that look similar enough to be the same. Like "Alabama" and "University of Alabama" and "Bama" (but not Obama ideally) or "Potomac" and "Potomac River." 

Now this list of words could be in the terabytes range, which is why I need distributed computing capability.

How would I assemble a Vector from an individual entry in this list? With a bit more understanding of my situation, do you think Mahout can work for me?

Please let me know if I can provide more information.

Thanks.

On Dec 1, 2011, at 11:29 PM, Jeff Eastman wrote:

> Could you elaborate a bit on what you mean by "cluster a collection of 
> words and phrases by syntactic similarity over a distributed environment 
> "? If you can describe your collection in terms of a set of (sparse or 
> dense) term vectors then you should be able to use Mahout clustering 
> directly. The vectors do not need to be huge (as "document" might 
> imply), indeed smaller dimensionality clusterings work better than large 
> ones. The question would be how do you plan to encode these vectors? 
> Another would be how large a collection you have?
> 
> On 12/1/11 8:08 PM, Neil Chaudhuri wrote:
>> I have a need to cluster a collection of words and phrases by syntactic similarity over a distributed environment, and I came upon Mahout as a possible solution. After studying the documentation though, I am finding all of it tailored to working with entire documents rather than words and phrases. I simply want to know if you believe that Mahout is the right tool for this job. I suppose I could try to view each word and phrase as individual tiny documents, but that feels like I am forcing it.
>> 
>> Any insight is appreciated.
>> 
>> Thanks.
>> 
>

Re: Word and Phrase Clustering

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Could you elaborate a bit on what you mean by "cluster a collection of 
words and phrases by syntactic similarity over a distributed environment 
"? If you can describe your collection in terms of a set of (sparse or 
dense) term vectors then you should be able to use Mahout clustering 
directly. The vectors do not need to be huge (as "document" might 
imply), indeed smaller dimensionality clusterings work better than large 
ones. The question would be how do you plan to encode these vectors? 
Another would be how large a collection you have?

On 12/1/11 8:08 PM, Neil Chaudhuri wrote:
> I have a need to cluster a collection of words and phrases by syntactic similarity over a distributed environment, and I came upon Mahout as a possible solution. After studying the documentation though, I am finding all of it tailored to working with entire documents rather than words and phrases. I simply want to know if you believe that Mahout is the right tool for this job. I suppose I could try to view each word and phrase as individual tiny documents, but that feels like I am forcing it.
>
> Any insight is appreciated.
>
> Thanks.
>

Re: Word and Phrase Clustering

Posted by Ted Dunning <te...@gmail.com>.

It depends on what you mean by syntactic similarity.  If you mean how the
words are used, then you can build a document by collecting a reasonable
sized sample all the words that appear near each word.  These neighboring
words can be clustered as if they were documents and should give you
reasonable usage clusters.

If you mean by internal structure, then you need to do something a bit
different.

On Thu, Dec 1, 2011 at 7:08 PM, Neil Chaudhuri <nchaudhuri@potomacfusion.com
> wrote:

> I have a need to cluster a collection of words and phrases by syntactic
> similarity over a distributed environment, and I came upon Mahout as a
> possible solution. After studying the documentation though, I am finding
> all of it tailored to working with entire documents rather than words and
> phrases. I simply want to know if you believe that Mahout is the right tool
> for this job. I suppose I could try to view each word and phrase as
> individual tiny documents, but that feels like I am forcing it.
>
> Any insight is appreciated.
>
> Thanks.
>