You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Philippe Lamarche <ph...@gmail.com> on 2008/12/03 17:48:16 UTC

Text clustering

Hi,

I have a questions concerning text clustering and the current
K-Means/vectors implementation.

For a school project, I did some text clustering with a subset of the Enron
corpus. I implemented a small M/R package that transforms text into TF-IDF
vector space, and then I used a little modified version of the
syntheticcontrol K-Means example. So far, all is fine.

However, the output of the k-mean algorithm is vector, as is the input. As I
understand it, when text is transformed in vector space, the cardinality of
the vector is the number of word in your global dictionary, all word in all
text being clustered. This, can grow up pretty quick. For example, with only
27000 Enron emails, even when removing word that only appears in 2 emails or
less, the dictionary size is about 45000 words.

My number one problem is this: how can we find out what document a vector is
representing, when it comes out of the k-means algorithm? My favorite
solution would be to have a unique id attached to each vector. Is there such
ID in the vector implementation? Is there a better solution? Is my approach
to text clustering wrong?

Thanks for the help,

Philippe.

Re: Text clustering

Posted by Isabel Drost <is...@apache.org>.

On Friday 05 December 2008, Grant Ingersoll wrote:
> I seem to recall some discussion a while back about being able to add
> labels to the vectors/matrices, but I don't know the status of the
> patch.

https://issues.apache.org/jira/browse/MAHOUT-65 <- is the corresponding issue. 
It would be really helpful if you added your user story to the issue. So in 
can be taken into account for the patch.

Isabel

-- 
I like being single.  I'm always there when I need me.		-- Art Leo
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Text clustering

Posted by Grant Ingersoll <gs...@apache.org>.

Definitely still interested!

On Feb 8, 2009, at 7:40 PM, Richard Tomsett wrote:

> Kind of lost track of this thread... when I got home for Christmas my
> laptop had ceased to work! Anyway, managed to coax it back into life
> so will go over my text clustering stuff tomorrow so that I can help
> out with a patch/example (I take it this hasn't been addressed yet as
> I haven't heard anything on this mailing list?).
>
> Richard
>
>
> 2008/12/5 Grant Ingersoll <gs...@apache.org>:
>>
>> On Dec 5, 2008, at 6:05 AM, Richard Tomsett wrote:
>>
>>> Sure :-) I haven't got my project on me at the moment but should  
>>> be able
>>> to
>>> get at it some time before Xmas so will look through it again and  
>>> send you
>>> anything that may be useful.
>>
>> Cool, just add a patch to JIRA, if you can.  I think we could work  
>> together
>> to create a Text Clustering "example".
>>
>>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Text clustering

Posted by Richard Tomsett <in...@gmail.com>.

Kind of lost track of this thread... when I got home for Christmas my
laptop had ceased to work! Anyway, managed to coax it back into life
so will go over my text clustering stuff tomorrow so that I can help
out with a patch/example (I take it this hasn't been addressed yet as
I haven't heard anything on this mailing list?).

Richard

2008/12/5 Grant Ingersoll <gs...@apache.org>:
>
> On Dec 5, 2008, at 6:05 AM, Richard Tomsett wrote:
>
>> Sure :-) I haven't got my project on me at the moment but should be able
>> to
>> get at it some time before Xmas so will look through it again and send you
>> anything that may be useful.
>
> Cool, just add a patch to JIRA, if you can.  I think we could work together
> to create a Text Clustering "example".
>
>

Re: Text clustering

Posted by dipesh <di...@gmail.com>.

Yes, Richard is right. I used the arc of the value and it solved the
mismatch.
Math.acos(value) which would range from 0 to π / 2.
"...π / 2 meaning independent, 0 meaning exactly the same, with in-between
values indicating intermediate similarities or dissimilarities...."
--wiki<http://en.wikipedia.org/w/index.php?title=Jaccard_index&section=2#Tanimoto_coefficient_.28extended_Jaccard_coefficient.29>

I think since Tanimoto distance is more suited for only binary values, (but
with TF-IDF we have other values than 0s and 1s).

Pearson correlations as Sean has suggested works for cosine distance if, the
data are 'centered' (have a mean of 0). But I think as Richard said (in
TF-IDF vectors we aren't going to get any negative values), we can't have
mean of 0.

Regards,
Dipesh


>
> 2008/12/6 Sean Owen <sr...@gmail.com>
>
> > To answer a few recent points:
> >
> > Not sure if this is helpful, but, the collaborative filtering part of
> > Mahout contains an implementation of cosine distance measure -- sort
> > of. Really it has an implementation of the Pearson correlation, which
> > is equivalent, if the data are 'centered' (have a mean of 0). This is,
> > in my opinion, a good idea. So if you agree, you could copy and adapt
> > this implementation of Pearson to your purpose. It is pretty easy to
> > re-create the actual cosine distance measure correlation too from this
> > code -- I used to have it separately in the code.
> >
> > The Tanimoto distance is a ratio of intersection to union of two sets,
> > so is between 0 and 1. Cosine distance is, essentially, the cosine of
> > an angle in feature-space, so is between -1 and 1.
> >
>



-- 
----------------------------------------
"Help Ever Hurt Never"- Baba

Re: Text clustering

Posted by Richard Tomsett <in...@gmail.com>.

Ah, I didn't realise that there was an implementation of the Pearson
correlation, I just wrote a cosine distance measure myself. The cosine
distance does go from -1 to 1, but with TF-IDF vectors you aren't going to
get any negative values, so it effectively goes from 0 to 1. You have to be
careful though because the k-means implementation assumes larger distance
value means "further away" (for clustering purposes), whereas obviously with
cosine distance a larger value means "closer together".


2008/12/6 Sean Owen <sr...@gmail.com>

> To answer a few recent points:
>
> Not sure if this is helpful, but, the collaborative filtering part of
> Mahout contains an implementation of cosine distance measure -- sort
> of. Really it has an implementation of the Pearson correlation, which
> is equivalent, if the data are 'centered' (have a mean of 0). This is,
> in my opinion, a good idea. So if you agree, you could copy and adapt
> this implementation of Pearson to your purpose. It is pretty easy to
> re-create the actual cosine distance measure correlation too from this
> code -- I used to have it separately in the code.
>
> The Tanimoto distance is a ratio of intersection to union of two sets,
> so is between 0 and 1. Cosine distance is, essentially, the cosine of
> an angle in feature-space, so is between -1 and 1.
>

Re: Text clustering

Posted by Sean Owen <sr...@gmail.com>.

To answer a few recent points:

Not sure if this is helpful, but, the collaborative filtering part of
Mahout contains an implementation of cosine distance measure -- sort
of. Really it has an implementation of the Pearson correlation, which
is equivalent, if the data are 'centered' (have a mean of 0). This is,
in my opinion, a good idea. So if you agree, you could copy and adapt
this implementation of Pearson to your purpose. It is pretty easy to
re-create the actual cosine distance measure correlation too from this
code -- I used to have it separately in the code.

The Tanimoto distance is a ratio of intersection to union of two sets,
so is between 0 and 1. Cosine distance is, essentially, the cosine of
an angle in feature-space, so is between -1 and 1.

On Sat, Dec 6, 2008 at 12:54 PM, Philippe Lamarche
<ph...@gmail.com> wrote:
> Hi,
>
> I used the Tanimoto distance. As I understand it, it's almost like the
> cosine distance, with a range between 0 and infinity as opposed to 0 and
> 3.14. Seems to work well.
>
>
>
>
> On Fri, Dec 5, 2008 at 11:54 PM, dipesh <di...@gmail.com> wrote:
>
>> Hi Philippe,
>>
>> I'm also doing some work on text clustering with feature extraction. For
>> text clustering the Cosine Distance is considered a better Similarity
>> metrics than the Eucledian Distance Measure. I couldn't find
>> CosineDistanceMeasure in Mahout, did u use Cosine Distance Measure in your
>> clustering project?

Re: Text clustering

Posted by Philippe Lamarche <ph...@gmail.com>.

Hi,

I used the Tanimoto distance. As I understand it, it's almost like the
cosine distance, with a range between 0 and infinity as opposed to 0 and
3.14. Seems to work well.




On Fri, Dec 5, 2008 at 11:54 PM, dipesh <di...@gmail.com> wrote:

> Hi Philippe,
>
> I'm also doing some work on text clustering with feature extraction. For
> text clustering the Cosine Distance is considered a better Similarity
> metrics than the Eucledian Distance Measure. I couldn't find
> CosineDistanceMeasure in Mahout, did u use Cosine Distance Measure in your
> clustering project?
>
> regards,
> Dipesh
>
> On Fri, Dec 5, 2008 at 11:45 PM, Philippe Lamarche <
> philippe.lamarche@gmail.com> wrote:
>
> > I will try to do the same.
> >
> > On Fri, Dec 5, 2008 at 8:40 AM, Grant Ingersoll <gs...@apache.org>
> > wrote:
> >
> > >
> > > On Dec 5, 2008, at 6:05 AM, Richard Tomsett wrote:
> > >
> > >  Sure :-) I haven't got my project on me at the moment but should be
> able
> > >> to
> > >> get at it some time before Xmas so will look through it again and send
> > you
> > >> anything that may be useful.
> > >>
> > >
> > > Cool, just add a patch to JIRA, if you can.  I think we could work
> > together
> > > to create a Text Clustering "example".
> > >
> > >
> > >
> > >
> > >>
> > >>
> > >> 2008/12/5 Grant Ingersoll <gs...@apache.org>
> > >>
> > >>  I seem to recall some discussion a while back about being able to add
> > >>> labels to the vectors/matrices, but I don't know the status of the
> > patch.
> > >>>
> > >>> At any rate, very cool that you are using it for text clustering.  I
> > >>> still
> > >>> have on my list to write up how to do this and to write some
> supporting
> > >>> code
> > >>> as well.  So, if either of you cares to contribute, that would be
> most
> > >>> useful.
> > >>>
> > >>> -Grant
> > >>>
> > >>>
> > >>> On Dec 3, 2008, at 6:46 PM, Richard Tomsett wrote:
> > >>>
> > >>> Hi Phillippe,
> > >>>
> > >>>>
> > >>>> I used the K-Means on TF-IDF vectors and wondered the same thing -
> > about
> > >>>> labelling the documents. I haven't got my code on me at the moment
> and
> > >>>> it
> > >>>> was a few months ago that I last looked at it (so I was also
> probably
> > >>>> using
> > >>>> an older version of Mahout)... but I seem to remember that I did
> just
> > as
> > >>>> you
> > >>>> are suggesting and simply attached a unique ID to each document
> which
> > >>>> got
> > >>>> passed through the map-reduce stages. This requires a bit of
> tinkering
> > >>>> with
> > >>>> the K-Means implementation but shouldn't be too much work.
> > >>>>
> > >>>> As for having massive vectors, you could try representing them as
> > sparse
> > >>>> vectors rather than the dense vectors the standard Mahout K-Means
> > >>>> algorithm
> > >>>> accepts, which gets rid of all the zero values in the document
> > vectors.
> > >>>> See
> > >>>> the Javadoc for details, it'll be more reliable than my memory :-)
> > >>>>
> > >>>> Richard
> > >>>>
> > >>>>
> > >>>> 2008/12/3 Philippe Lamarche <ph...@gmail.com>
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>>>
> > >>>>> I have a questions concerning text clustering and the current
> > >>>>> K-Means/vectors implementation.
> > >>>>>
> > >>>>> For a school project, I did some text clustering with a subset of
> the
> > >>>>> Enron
> > >>>>> corpus. I implemented a small M/R package that transforms text into
> > >>>>> TF-IDF
> > >>>>> vector space, and then I used a little modified version of the
> > >>>>> syntheticcontrol K-Means example. So far, all is fine.
> > >>>>>
> > >>>>> However, the output of the k-mean algorithm is vector, as is the
> > input.
> > >>>>> As
> > >>>>> I
> > >>>>> understand it, when text is transformed in vector space, the
> > >>>>> cardinality
> > >>>>> of
> > >>>>> the vector is the number of word in your global dictionary, all
> word
> > in
> > >>>>> all
> > >>>>> text being clustered. This, can grow up pretty quick. For example,
> > with
> > >>>>> only
> > >>>>> 27000 Enron emails, even when removing word that only appears in 2
> > >>>>> emails
> > >>>>> or
> > >>>>> less, the dictionary size is about 45000 words.
> > >>>>>
> > >>>>> My number one problem is this: how can we find out what document a
> > >>>>> vector
> > >>>>> is
> > >>>>> representing, when it comes out of the k-means algorithm? My
> favorite
> > >>>>> solution would be to have a unique id attached to each vector. Is
> > there
> > >>>>> such
> > >>>>> ID in the vector implementation? Is there a better solution? Is my
> > >>>>> approach
> > >>>>> to text clustering wrong?
> > >>>>>
> > >>>>> Thanks for the help,
> > >>>>>
> > >>>>> Philippe.
> > >>>>>
> > >>>>>
> > >>>>>  --------------------------
> > >>> Grant Ingersoll
> > >>>
> > >>> Lucene Helpful Hints:
> > >>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > >>> http://wiki.apache.org/lucene-java/LuceneFAQ
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > > --------------------------
> > > Grant Ingersoll
> > >
> > > Lucene Helpful Hints:
> > > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > > http://wiki.apache.org/lucene-java/LuceneFAQ
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
>
>
>
> --
> ----------------------------------------
> "Help Ever Hurt Never"- Baba
>

Re: Text clustering

Posted by dipesh <di...@gmail.com>.

Hi Philippe,

I'm also doing some work on text clustering with feature extraction. For
text clustering the Cosine Distance is considered a better Similarity
metrics than the Eucledian Distance Measure. I couldn't find
CosineDistanceMeasure in Mahout, did u use Cosine Distance Measure in your
clustering project?

regards,
Dipesh

On Fri, Dec 5, 2008 at 11:45 PM, Philippe Lamarche <
philippe.lamarche@gmail.com> wrote:

> I will try to do the same.
>
> On Fri, Dec 5, 2008 at 8:40 AM, Grant Ingersoll <gs...@apache.org>
> wrote:
>
> >
> > On Dec 5, 2008, at 6:05 AM, Richard Tomsett wrote:
> >
> >  Sure :-) I haven't got my project on me at the moment but should be able
> >> to
> >> get at it some time before Xmas so will look through it again and send
> you
> >> anything that may be useful.
> >>
> >
> > Cool, just add a patch to JIRA, if you can.  I think we could work
> together
> > to create a Text Clustering "example".
> >
> >
> >
> >
> >>
> >>
> >> 2008/12/5 Grant Ingersoll <gs...@apache.org>
> >>
> >>  I seem to recall some discussion a while back about being able to add
> >>> labels to the vectors/matrices, but I don't know the status of the
> patch.
> >>>
> >>> At any rate, very cool that you are using it for text clustering.  I
> >>> still
> >>> have on my list to write up how to do this and to write some supporting
> >>> code
> >>> as well.  So, if either of you cares to contribute, that would be most
> >>> useful.
> >>>
> >>> -Grant
> >>>
> >>>
> >>> On Dec 3, 2008, at 6:46 PM, Richard Tomsett wrote:
> >>>
> >>> Hi Phillippe,
> >>>
> >>>>
> >>>> I used the K-Means on TF-IDF vectors and wondered the same thing -
> about
> >>>> labelling the documents. I haven't got my code on me at the moment and
> >>>> it
> >>>> was a few months ago that I last looked at it (so I was also probably
> >>>> using
> >>>> an older version of Mahout)... but I seem to remember that I did just
> as
> >>>> you
> >>>> are suggesting and simply attached a unique ID to each document which
> >>>> got
> >>>> passed through the map-reduce stages. This requires a bit of tinkering
> >>>> with
> >>>> the K-Means implementation but shouldn't be too much work.
> >>>>
> >>>> As for having massive vectors, you could try representing them as
> sparse
> >>>> vectors rather than the dense vectors the standard Mahout K-Means
> >>>> algorithm
> >>>> accepts, which gets rid of all the zero values in the document
> vectors.
> >>>> See
> >>>> the Javadoc for details, it'll be more reliable than my memory :-)
> >>>>
> >>>> Richard
> >>>>
> >>>>
> >>>> 2008/12/3 Philippe Lamarche <ph...@gmail.com>
> >>>>
> >>>> Hi,
> >>>>
> >>>>>
> >>>>> I have a questions concerning text clustering and the current
> >>>>> K-Means/vectors implementation.
> >>>>>
> >>>>> For a school project, I did some text clustering with a subset of the
> >>>>> Enron
> >>>>> corpus. I implemented a small M/R package that transforms text into
> >>>>> TF-IDF
> >>>>> vector space, and then I used a little modified version of the
> >>>>> syntheticcontrol K-Means example. So far, all is fine.
> >>>>>
> >>>>> However, the output of the k-mean algorithm is vector, as is the
> input.
> >>>>> As
> >>>>> I
> >>>>> understand it, when text is transformed in vector space, the
> >>>>> cardinality
> >>>>> of
> >>>>> the vector is the number of word in your global dictionary, all word
> in
> >>>>> all
> >>>>> text being clustered. This, can grow up pretty quick. For example,
> with
> >>>>> only
> >>>>> 27000 Enron emails, even when removing word that only appears in 2
> >>>>> emails
> >>>>> or
> >>>>> less, the dictionary size is about 45000 words.
> >>>>>
> >>>>> My number one problem is this: how can we find out what document a
> >>>>> vector
> >>>>> is
> >>>>> representing, when it comes out of the k-means algorithm? My favorite
> >>>>> solution would be to have a unique id attached to each vector. Is
> there
> >>>>> such
> >>>>> ID in the vector implementation? Is there a better solution? Is my
> >>>>> approach
> >>>>> to text clustering wrong?
> >>>>>
> >>>>> Thanks for the help,
> >>>>>
> >>>>> Philippe.
> >>>>>
> >>>>>
> >>>>>  --------------------------
> >>> Grant Ingersoll
> >>>
> >>> Lucene Helpful Hints:
> >>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> >>> http://wiki.apache.org/lucene-java/LuceneFAQ
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> > --------------------------
> > Grant Ingersoll
> >
> > Lucene Helpful Hints:
> > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > http://wiki.apache.org/lucene-java/LuceneFAQ
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>



-- 
----------------------------------------
"Help Ever Hurt Never"- Baba

Re: Text clustering

Posted by Philippe Lamarche <ph...@gmail.com>.

I will try to do the same.

On Fri, Dec 5, 2008 at 8:40 AM, Grant Ingersoll <gs...@apache.org> wrote:

>
> On Dec 5, 2008, at 6:05 AM, Richard Tomsett wrote:
>
>  Sure :-) I haven't got my project on me at the moment but should be able
>> to
>> get at it some time before Xmas so will look through it again and send you
>> anything that may be useful.
>>
>
> Cool, just add a patch to JIRA, if you can.  I think we could work together
> to create a Text Clustering "example".
>
>
>
>
>>
>>
>> 2008/12/5 Grant Ingersoll <gs...@apache.org>
>>
>>  I seem to recall some discussion a while back about being able to add
>>> labels to the vectors/matrices, but I don't know the status of the patch.
>>>
>>> At any rate, very cool that you are using it for text clustering.  I
>>> still
>>> have on my list to write up how to do this and to write some supporting
>>> code
>>> as well.  So, if either of you cares to contribute, that would be most
>>> useful.
>>>
>>> -Grant
>>>
>>>
>>> On Dec 3, 2008, at 6:46 PM, Richard Tomsett wrote:
>>>
>>> Hi Phillippe,
>>>
>>>>
>>>> I used the K-Means on TF-IDF vectors and wondered the same thing - about
>>>> labelling the documents. I haven't got my code on me at the moment and
>>>> it
>>>> was a few months ago that I last looked at it (so I was also probably
>>>> using
>>>> an older version of Mahout)... but I seem to remember that I did just as
>>>> you
>>>> are suggesting and simply attached a unique ID to each document which
>>>> got
>>>> passed through the map-reduce stages. This requires a bit of tinkering
>>>> with
>>>> the K-Means implementation but shouldn't be too much work.
>>>>
>>>> As for having massive vectors, you could try representing them as sparse
>>>> vectors rather than the dense vectors the standard Mahout K-Means
>>>> algorithm
>>>> accepts, which gets rid of all the zero values in the document vectors.
>>>> See
>>>> the Javadoc for details, it'll be more reliable than my memory :-)
>>>>
>>>> Richard
>>>>
>>>>
>>>> 2008/12/3 Philippe Lamarche <ph...@gmail.com>
>>>>
>>>> Hi,
>>>>
>>>>>
>>>>> I have a questions concerning text clustering and the current
>>>>> K-Means/vectors implementation.
>>>>>
>>>>> For a school project, I did some text clustering with a subset of the
>>>>> Enron
>>>>> corpus. I implemented a small M/R package that transforms text into
>>>>> TF-IDF
>>>>> vector space, and then I used a little modified version of the
>>>>> syntheticcontrol K-Means example. So far, all is fine.
>>>>>
>>>>> However, the output of the k-mean algorithm is vector, as is the input.
>>>>> As
>>>>> I
>>>>> understand it, when text is transformed in vector space, the
>>>>> cardinality
>>>>> of
>>>>> the vector is the number of word in your global dictionary, all word in
>>>>> all
>>>>> text being clustered. This, can grow up pretty quick. For example, with
>>>>> only
>>>>> 27000 Enron emails, even when removing word that only appears in 2
>>>>> emails
>>>>> or
>>>>> less, the dictionary size is about 45000 words.
>>>>>
>>>>> My number one problem is this: how can we find out what document a
>>>>> vector
>>>>> is
>>>>> representing, when it comes out of the k-means algorithm? My favorite
>>>>> solution would be to have a unique id attached to each vector. Is there
>>>>> such
>>>>> ID in the vector implementation? Is there a better solution? Is my
>>>>> approach
>>>>> to text clustering wrong?
>>>>>
>>>>> Thanks for the help,
>>>>>
>>>>> Philippe.
>>>>>
>>>>>
>>>>>  --------------------------
>>> Grant Ingersoll
>>>
>>> Lucene Helpful Hints:
>>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
>

Re: Text clustering

Posted by Grant Ingersoll <gs...@apache.org>.

On Dec 5, 2008, at 6:05 AM, Richard Tomsett wrote:

> Sure :-) I haven't got my project on me at the moment but should be  
> able to
> get at it some time before Xmas so will look through it again and  
> send you
> anything that may be useful.

Cool, just add a patch to JIRA, if you can.  I think we could work  
together to create a Text Clustering "example".


>
>
>
> 2008/12/5 Grant Ingersoll <gs...@apache.org>
>
>> I seem to recall some discussion a while back about being able to add
>> labels to the vectors/matrices, but I don't know the status of the  
>> patch.
>>
>> At any rate, very cool that you are using it for text clustering.   
>> I still
>> have on my list to write up how to do this and to write some  
>> supporting code
>> as well.  So, if either of you cares to contribute, that would be  
>> most
>> useful.
>>
>> -Grant
>>
>>
>> On Dec 3, 2008, at 6:46 PM, Richard Tomsett wrote:
>>
>> Hi Phillippe,
>>>
>>> I used the K-Means on TF-IDF vectors and wondered the same thing -  
>>> about
>>> labelling the documents. I haven't got my code on me at the moment  
>>> and it
>>> was a few months ago that I last looked at it (so I was also  
>>> probably
>>> using
>>> an older version of Mahout)... but I seem to remember that I did  
>>> just as
>>> you
>>> are suggesting and simply attached a unique ID to each document  
>>> which got
>>> passed through the map-reduce stages. This requires a bit of  
>>> tinkering
>>> with
>>> the K-Means implementation but shouldn't be too much work.
>>>
>>> As for having massive vectors, you could try representing them as  
>>> sparse
>>> vectors rather than the dense vectors the standard Mahout K-Means
>>> algorithm
>>> accepts, which gets rid of all the zero values in the document  
>>> vectors.
>>> See
>>> the Javadoc for details, it'll be more reliable than my memory :-)
>>>
>>> Richard
>>>
>>>
>>> 2008/12/3 Philippe Lamarche <ph...@gmail.com>
>>>
>>> Hi,
>>>>
>>>> I have a questions concerning text clustering and the current
>>>> K-Means/vectors implementation.
>>>>
>>>> For a school project, I did some text clustering with a subset of  
>>>> the
>>>> Enron
>>>> corpus. I implemented a small M/R package that transforms text into
>>>> TF-IDF
>>>> vector space, and then I used a little modified version of the
>>>> syntheticcontrol K-Means example. So far, all is fine.
>>>>
>>>> However, the output of the k-mean algorithm is vector, as is the  
>>>> input.
>>>> As
>>>> I
>>>> understand it, when text is transformed in vector space, the  
>>>> cardinality
>>>> of
>>>> the vector is the number of word in your global dictionary, all  
>>>> word in
>>>> all
>>>> text being clustered. This, can grow up pretty quick. For  
>>>> example, with
>>>> only
>>>> 27000 Enron emails, even when removing word that only appears in  
>>>> 2 emails
>>>> or
>>>> less, the dictionary size is about 45000 words.
>>>>
>>>> My number one problem is this: how can we find out what document  
>>>> a vector
>>>> is
>>>> representing, when it comes out of the k-means algorithm? My  
>>>> favorite
>>>> solution would be to have a unique id attached to each vector. Is  
>>>> there
>>>> such
>>>> ID in the vector implementation? Is there a better solution? Is my
>>>> approach
>>>> to text clustering wrong?
>>>>
>>>> Thanks for the help,
>>>>
>>>> Philippe.
>>>>
>>>>
>> --------------------------
>> Grant Ingersoll
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Text clustering

Posted by Richard Tomsett <in...@gmail.com>.

Sure :-) I haven't got my project on me at the moment but should be able to
get at it some time before Xmas so will look through it again and send you
anything that may be useful.


2008/12/5 Grant Ingersoll <gs...@apache.org>

> I seem to recall some discussion a while back about being able to add
> labels to the vectors/matrices, but I don't know the status of the patch.
>
> At any rate, very cool that you are using it for text clustering.  I still
> have on my list to write up how to do this and to write some supporting code
> as well.  So, if either of you cares to contribute, that would be most
> useful.
>
> -Grant
>
>
> On Dec 3, 2008, at 6:46 PM, Richard Tomsett wrote:
>
>  Hi Phillippe,
>>
>> I used the K-Means on TF-IDF vectors and wondered the same thing - about
>> labelling the documents. I haven't got my code on me at the moment and it
>> was a few months ago that I last looked at it (so I was also probably
>> using
>> an older version of Mahout)... but I seem to remember that I did just as
>> you
>> are suggesting and simply attached a unique ID to each document which got
>> passed through the map-reduce stages. This requires a bit of tinkering
>> with
>> the K-Means implementation but shouldn't be too much work.
>>
>> As for having massive vectors, you could try representing them as sparse
>> vectors rather than the dense vectors the standard Mahout K-Means
>> algorithm
>> accepts, which gets rid of all the zero values in the document vectors.
>> See
>> the Javadoc for details, it'll be more reliable than my memory :-)
>>
>> Richard
>>
>>
>> 2008/12/3 Philippe Lamarche <ph...@gmail.com>
>>
>>  Hi,
>>>
>>> I have a questions concerning text clustering and the current
>>> K-Means/vectors implementation.
>>>
>>> For a school project, I did some text clustering with a subset of the
>>> Enron
>>> corpus. I implemented a small M/R package that transforms text into
>>> TF-IDF
>>> vector space, and then I used a little modified version of the
>>> syntheticcontrol K-Means example. So far, all is fine.
>>>
>>> However, the output of the k-mean algorithm is vector, as is the input.
>>> As
>>> I
>>> understand it, when text is transformed in vector space, the cardinality
>>> of
>>> the vector is the number of word in your global dictionary, all word in
>>> all
>>> text being clustered. This, can grow up pretty quick. For example, with
>>> only
>>> 27000 Enron emails, even when removing word that only appears in 2 emails
>>> or
>>> less, the dictionary size is about 45000 words.
>>>
>>> My number one problem is this: how can we find out what document a vector
>>> is
>>> representing, when it comes out of the k-means algorithm? My favorite
>>> solution would be to have a unique id attached to each vector. Is there
>>> such
>>> ID in the vector implementation? Is there a better solution? Is my
>>> approach
>>> to text clustering wrong?
>>>
>>> Thanks for the help,
>>>
>>> Philippe.
>>>
>>>
> --------------------------
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
>

Re: Text clustering

Posted by Grant Ingersoll <gs...@apache.org>.

I seem to recall some discussion a while back about being able to add  
labels to the vectors/matrices, but I don't know the status of the  
patch.

At any rate, very cool that you are using it for text clustering.  I  
still have on my list to write up how to do this and to write some  
supporting code as well.  So, if either of you cares to contribute,  
that would be most useful.

-Grant

On Dec 3, 2008, at 6:46 PM, Richard Tomsett wrote:

> Hi Phillippe,
>
> I used the K-Means on TF-IDF vectors and wondered the same thing -  
> about
> labelling the documents. I haven't got my code on me at the moment  
> and it
> was a few months ago that I last looked at it (so I was also  
> probably using
> an older version of Mahout)... but I seem to remember that I did  
> just as you
> are suggesting and simply attached a unique ID to each document  
> which got
> passed through the map-reduce stages. This requires a bit of  
> tinkering with
> the K-Means implementation but shouldn't be too much work.
>
> As for having massive vectors, you could try representing them as  
> sparse
> vectors rather than the dense vectors the standard Mahout K-Means  
> algorithm
> accepts, which gets rid of all the zero values in the document  
> vectors. See
> the Javadoc for details, it'll be more reliable than my memory :-)
>
> Richard
>
>
> 2008/12/3 Philippe Lamarche <ph...@gmail.com>
>
>> Hi,
>>
>> I have a questions concerning text clustering and the current
>> K-Means/vectors implementation.
>>
>> For a school project, I did some text clustering with a subset of  
>> the Enron
>> corpus. I implemented a small M/R package that transforms text into  
>> TF-IDF
>> vector space, and then I used a little modified version of the
>> syntheticcontrol K-Means example. So far, all is fine.
>>
>> However, the output of the k-mean algorithm is vector, as is the  
>> input. As
>> I
>> understand it, when text is transformed in vector space, the  
>> cardinality of
>> the vector is the number of word in your global dictionary, all  
>> word in all
>> text being clustered. This, can grow up pretty quick. For example,  
>> with
>> only
>> 27000 Enron emails, even when removing word that only appears in 2  
>> emails
>> or
>> less, the dictionary size is about 45000 words.
>>
>> My number one problem is this: how can we find out what document a  
>> vector
>> is
>> representing, when it comes out of the k-means algorithm? My favorite
>> solution would be to have a unique id attached to each vector. Is  
>> there
>> such
>> ID in the vector implementation? Is there a better solution? Is my  
>> approach
>> to text clustering wrong?
>>
>> Thanks for the help,
>>
>> Philippe.
>>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Text clustering

Posted by Isabel Drost <is...@apache.org>.

On Friday 26 December 2008, Palleti, Pallavi wrote:
>   Even I had to use an Id for a vector. So, What I did was, I used
> KeyValueTextInputFormat as the input format (Default is textinputformat)
> and gave the input as ID \t Vector (ID and vector are tab separated) and
> made changes at the final display part(runClustering) to consider id too
> along with the vector.

Could you please share the patch you made with us?

Isabel

-- 
We are anthill men upon an anthill world.		-- Ray Bradbury
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

RE: Text clustering

Posted by "Palleti, Pallavi" <pa...@corp.aol.com>.

Hi Phillippe,

  Even I had to use an Id for a vector. So, What I did was, I used
KeyValueTextInputFormat as the input format (Default is textinputformat)
and gave the input as ID \t Vector (ID and vector are tab separated) and
made changes at the final display part(runClustering) to consider id too
along with the vector.


Thanks
Pallavi
-----Original Message-----
From: Richard Tomsett [mailto:indigentmartian@gmail.com] 
Sent: Thursday, December 04, 2008 5:16 AM
To: mahout-user@lucene.apache.org
Subject: Re: Text clustering

Hi Phillippe,

I used the K-Means on TF-IDF vectors and wondered the same thing - about
labelling the documents. I haven't got my code on me at the moment and
it
was a few months ago that I last looked at it (so I was also probably
using
an older version of Mahout)... but I seem to remember that I did just as
you
are suggesting and simply attached a unique ID to each document which
got
passed through the map-reduce stages. This requires a bit of tinkering
with
the K-Means implementation but shouldn't be too much work.

As for having massive vectors, you could try representing them as sparse
vectors rather than the dense vectors the standard Mahout K-Means
algorithm
accepts, which gets rid of all the zero values in the document vectors.
See
the Javadoc for details, it'll be more reliable than my memory :-)

Richard


2008/12/3 Philippe Lamarche <ph...@gmail.com>

> Hi,
>
> I have a questions concerning text clustering and the current
> K-Means/vectors implementation.
>
> For a school project, I did some text clustering with a subset of the
Enron
> corpus. I implemented a small M/R package that transforms text into
TF-IDF
> vector space, and then I used a little modified version of the
> syntheticcontrol K-Means example. So far, all is fine.
>
> However, the output of the k-mean algorithm is vector, as is the
input. As
> I
> understand it, when text is transformed in vector space, the
cardinality of
> the vector is the number of word in your global dictionary, all word
in all
> text being clustered. This, can grow up pretty quick. For example,
with
> only
> 27000 Enron emails, even when removing word that only appears in 2
emails
> or
> less, the dictionary size is about 45000 words.
>
> My number one problem is this: how can we find out what document a
vector
> is
> representing, when it comes out of the k-means algorithm? My favorite
> solution would be to have a unique id attached to each vector. Is
there
> such
> ID in the vector implementation? Is there a better solution? Is my
approach
> to text clustering wrong?
>
> Thanks for the help,
>
> Philippe.
>

Re: Text clustering

Posted by Richard Tomsett <in...@gmail.com>.

Hi Phillippe,

I used the K-Means on TF-IDF vectors and wondered the same thing - about
labelling the documents. I haven't got my code on me at the moment and it
was a few months ago that I last looked at it (so I was also probably using
an older version of Mahout)... but I seem to remember that I did just as you
are suggesting and simply attached a unique ID to each document which got
passed through the map-reduce stages. This requires a bit of tinkering with
the K-Means implementation but shouldn't be too much work.

As for having massive vectors, you could try representing them as sparse
vectors rather than the dense vectors the standard Mahout K-Means algorithm
accepts, which gets rid of all the zero values in the document vectors. See
the Javadoc for details, it'll be more reliable than my memory :-)

Richard


2008/12/3 Philippe Lamarche <ph...@gmail.com>

> Hi,
>
> I have a questions concerning text clustering and the current
> K-Means/vectors implementation.
>
> For a school project, I did some text clustering with a subset of the Enron
> corpus. I implemented a small M/R package that transforms text into TF-IDF
> vector space, and then I used a little modified version of the
> syntheticcontrol K-Means example. So far, all is fine.
>
> However, the output of the k-mean algorithm is vector, as is the input. As
> I
> understand it, when text is transformed in vector space, the cardinality of
> the vector is the number of word in your global dictionary, all word in all
> text being clustered. This, can grow up pretty quick. For example, with
> only
> 27000 Enron emails, even when removing word that only appears in 2 emails
> or
> less, the dictionary size is about 45000 words.
>
> My number one problem is this: how can we find out what document a vector
> is
> representing, when it comes out of the k-means algorithm? My favorite
> solution would be to have a unique id attached to each vector. Is there
> such
> ID in the vector implementation? Is there a better solution? Is my approach
> to text clustering wrong?
>
> Thanks for the help,
>
> Philippe.
>