You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Vckay <da...@gmail.com> on 2011/07/14 11:38:30 UTC

Kernels for Text Clustering

I am clustering some real world text data using K-Means. I recently came
across Kernel K-Means and wanted to know if someone who has had experience
with Kernels could comment on their appropriateness for text data, i.e,
Would using a Kernel boost k-means quality? ( I know this is rather general
but it is sort of hard to figure out if my high dimensional real world data
is linearly separable.) If so, are there any Kernel's with "practically
accepted" parameters?

Thanks
VC

Re: Kernels for Text Clustering

Posted by Eshwaran Vijaya Kumar <ev...@mozilla.com>.

Yeah, so that is the other thing: The fact that text being so high dimensional..Wouldn't projecting it into an infinite dimensional vector space be of limited utility then?

On Jul 14, 2011, at 11:19 AM, Hector Yee wrote:

> Same reason you would use kernels instead of linear for SVMs... you can get
> more separation in a different space.
> But text is already so high dimensional...
> 
> On Thu, Jul 14, 2011 at 11:14 AM, Eshwaran Vijaya Kumar <
> evijayakumar@mozilla.com> wrote:
> 
>> Assuming the OP was doing cosine similarity (as is commonly done with text)
>> while clustering, wouldn't that implicitly imply the use of a Kernel ? Would
>> using a separate kernel help?
>> 
>> On Jul 14, 2011, at 6:56 AM, Hector Yee wrote:
>> 
>>> The histogram intersection kernel would work well and it has no
>> parameters
>>> 
>>> Sent from my iPad
>>> 
>>> On Jul 14, 2011, at 2:38 AM, Vckay <da...@gmail.com> wrote:
>>> 
>>>> I am clustering some real world text data using K-Means. I recently came
>>>> across Kernel K-Means and wanted to know if someone who has had
>> experience
>>>> with Kernels could comment on their appropriateness for text data, i.e,
>>>> Would using a Kernel boost k-means quality? ( I know this is rather
>> general
>>>> but it is sort of hard to figure out if my high dimensional real world
>> data
>>>> is linearly separable.) If so, are there any Kernel's with "practically
>>>> accepted" parameters?
>>>> 
>>>> Thanks
>>>> VC
>> 
>> 
> 
> 
> -- 
> Yee Yang Li Hector
> http://hectorgon.blogspot.com/ (tech + travel)
> http://hectorgon.com (book reviews)

Re: Kernels for Text Clustering

Posted by Hector Yee <he...@gmail.com>.

Same reason you would use kernels instead of linear for SVMs... you can get
more separation in a different space.
But text is already so high dimensional...

On Thu, Jul 14, 2011 at 11:14 AM, Eshwaran Vijaya Kumar <
evijayakumar@mozilla.com> wrote:

> Assuming the OP was doing cosine similarity (as is commonly done with text)
> while clustering, wouldn't that implicitly imply the use of a Kernel ? Would
> using a separate kernel help?
>
> On Jul 14, 2011, at 6:56 AM, Hector Yee wrote:
>
> > The histogram intersection kernel would work well and it has no
> parameters
> >
> > Sent from my iPad
> >
> > On Jul 14, 2011, at 2:38 AM, Vckay <da...@gmail.com> wrote:
> >
> >> I am clustering some real world text data using K-Means. I recently came
> >> across Kernel K-Means and wanted to know if someone who has had
> experience
> >> with Kernels could comment on their appropriateness for text data, i.e,
> >> Would using a Kernel boost k-means quality? ( I know this is rather
> general
> >> but it is sort of hard to figure out if my high dimensional real world
> data
> >> is linearly separable.) If so, are there any Kernel's with "practically
> >> accepted" parameters?
> >>
> >> Thanks
> >> VC
>
>


-- 
Yee Yang Li Hector
http://hectorgon.blogspot.com/ (tech + travel)
http://hectorgon.com (book reviews)

Re: Kernels for Text Clustering

Posted by Eshwaran Vijaya Kumar <ev...@mozilla.com>.

Assuming the OP was doing cosine similarity (as is commonly done with text) while clustering, wouldn't that implicitly imply the use of a Kernel ? Would using a separate kernel help?

On Jul 14, 2011, at 6:56 AM, Hector Yee wrote:

> The histogram intersection kernel would work well and it has no parameters
> 
> Sent from my iPad
> 
> On Jul 14, 2011, at 2:38 AM, Vckay <da...@gmail.com> wrote:
> 
>> I am clustering some real world text data using K-Means. I recently came
>> across Kernel K-Means and wanted to know if someone who has had experience
>> with Kernels could comment on their appropriateness for text data, i.e,
>> Would using a Kernel boost k-means quality? ( I know this is rather general
>> but it is sort of hard to figure out if my high dimensional real world data
>> is linearly separable.) If so, are there any Kernel's with "practically
>> accepted" parameters?
>> 
>> Thanks
>> VC

Re: Kernels for Text Clustering

Posted by Hector Yee <he...@gmail.com>.

The histogram intersection kernel would work well and it has no parameters

Sent from my iPad

On Jul 14, 2011, at 2:38 AM, Vckay <da...@gmail.com> wrote:

> I am clustering some real world text data using K-Means. I recently came
> across Kernel K-Means and wanted to know if someone who has had experience
> with Kernels could comment on their appropriateness for text data, i.e,
> Would using a Kernel boost k-means quality? ( I know this is rather general
> but it is sort of hard to figure out if my high dimensional real world data
> is linearly separable.) If so, are there any Kernel's with "practically
> accepted" parameters?
> 
> Thanks
> VC

Re: Kernels for Text Clustering

Posted by Fernando Fernández <fe...@gmail.com>.

I'm sorry,
I'm mean the TF-IDF vectors you have computed. Another usual step is to
reduce dimension of those vectors using SVD so you would have only a few
hundred of dense vectors representing almost all the usefull information
that your original data contains, and then use these dense vectors for
clustering / classifying the documents.

2011/7/14 Vckay <da...@gmail.com>

> Not too sure what you mean by "raw text data", I am doing the usual: remove
> stop words, stem etc and then computing TF-IDF vectors before trying to
> cluster them.
>
>
> 2011/7/14 Fernando Fernández <fe...@gmail.com>
>
> > Hi vcaky,
> >
> > Are you using raw text data with k-means? It's usual to obtain some lower
> > dimension and dense representation of the documents using Singular Value
> > Decomposition and such techniques, and working with that representation
> > instead. You may want to take a look at SVD algorithms in mahout.
> >
> > Best,
> > Fernando.
> >
> > 2011/7/14 Vckay <da...@gmail.com>
> >
> > > I am clustering some real world text data using K-Means. I recently
> came
> > > across Kernel K-Means and wanted to know if someone who has had
> > experience
> > > with Kernels could comment on their appropriateness for text data, i.e,
> > > Would using a Kernel boost k-means quality? ( I know this is rather
> > general
> > > but it is sort of hard to figure out if my high dimensional real world
> > data
> > > is linearly separable.) If so, are there any Kernel's with "practically
> > > accepted" parameters?
> > >
> > > Thanks
> > > VC
> > >
> >
>

Re: Kernels for Text Clustering

Posted by Vckay <da...@gmail.com>.

Not too sure what you mean by "raw text data", I am doing the usual: remove
stop words, stem etc and then computing TF-IDF vectors before trying to
cluster them.


2011/7/14 Fernando Fernández <fe...@gmail.com>

> Hi vcaky,
>
> Are you using raw text data with k-means? It's usual to obtain some lower
> dimension and dense representation of the documents using Singular Value
> Decomposition and such techniques, and working with that representation
> instead. You may want to take a look at SVD algorithms in mahout.
>
> Best,
> Fernando.
>
> 2011/7/14 Vckay <da...@gmail.com>
>
> > I am clustering some real world text data using K-Means. I recently came
> > across Kernel K-Means and wanted to know if someone who has had
> experience
> > with Kernels could comment on their appropriateness for text data, i.e,
> > Would using a Kernel boost k-means quality? ( I know this is rather
> general
> > but it is sort of hard to figure out if my high dimensional real world
> data
> > is linearly separable.) If so, are there any Kernel's with "practically
> > accepted" parameters?
> >
> > Thanks
> > VC
> >
>

Re: Kernels for Text Clustering

Posted by Fernando Fernández <fe...@gmail.com>.

Hi vcaky,

Are you using raw text data with k-means? It's usual to obtain some lower
dimension and dense representation of the documents using Singular Value
Decomposition and such techniques, and working with that representation
instead. You may want to take a look at SVD algorithms in mahout.

Best,
Fernando.

2011/7/14 Vckay <da...@gmail.com>

> I am clustering some real world text data using K-Means. I recently came
> across Kernel K-Means and wanted to know if someone who has had experience
> with Kernels could comment on their appropriateness for text data, i.e,
> Would using a Kernel boost k-means quality? ( I know this is rather general
> but it is sort of hard to figure out if my high dimensional real world data
> is linearly separable.) If so, are there any Kernel's with "practically
> accepted" parameters?
>
> Thanks
> VC
>