You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Sebastian Briesemeister <se...@unister-gmbh.de> on 2013/03/26 17:21:04 UTC

How to improve clustering?

Dear Mahout-users,

I am facing two problems when I am clustering instances with Fuzzy c
Means clustering (cosine distance, random initial clustering):

1.) I always end up with one large set of rubbish instances. All of them
have uniform cluster probability distribution and are, hence, in the
exact middle of the cluster space.
The cosine distance between instances within this cluster reaches from 0
to 1.

2.) Some of my clusters have the same or a very very similar center.

Besides the above described problems, the clustering seems to work fine.

Has somebody an idea how my clustering can be improved?

Regards
Sebastian

Re: How to improve clustering?

Posted by Ted Dunning <te...@gmail.com>.

Uh...

Shouldn't your be doing the IDF weighting *before* you normalize the vector
length?

On Tue, Mar 26, 2013 at 5:44 PM, Sebastian Briesemeister <
sebastian.briesemeister@unister-gmbh.de> wrote:

> ...
> For each document, I set a field in the corresponding vector to 1 if it
> contains a word. Then I normalize each vector using the L2-norm.
> Finally I multiply each element (representing a word) in the vector by
> log(#documents/#documents_with_word).
>
> For clustering, I am using cosine similarity.
>

Re: How to improve clustering?

Posted by Ted Dunning <te...@gmail.com>.

It makes hard cluster assignments, but that would be helpful two ways:

a) it will help you diagnose data issues

b) it can produce good starting points for fuzzy k-means.

On Thu, Mar 28, 2013 at 7:19 AM, Dan Filimon <da...@gmail.com>wrote:

> Sebastian, if you're interested I'd be glad to walk you through the main
> ideas, point you to the code and tell you  how to run it.
> Testing it on more data would be very helpful the project.
>
> But, it makes hard cluster assignments.
>
> On Mar 28, 2013, at 2:23, Ted Dunning <te...@gmail.com> wrote:
>
> > The streaming k-means stuff is what Dan has been working on.
> >
> > On Wed, Mar 27, 2013 at 6:14 PM, Sebastian Briesemeister <
> > sebastian.briesemeister@unister.de> wrote:
> >
> >> I did change it as you suggested. Now I normalize after the frequency
> >> weighting.
> >>
> >> The results from non fuzzy clustering are similar, but I require
> >> probabilities though.
> >>
> >> Streaming k-means stuff? I don't get you here.
> >>
> >>
> >>
> >> Ted Dunning <te...@gmail.com> schrieb:
> >>
> >>> I think you should change your vector preparation method.
> >>>
> >>> What kind of results do you get from non-fuzzy clustering?
> >>>
> >>> What about from the streaming k-means stuff?
> >>>
> >>> On Wed, Mar 27, 2013 at 5:02 PM, Sebastian Briesemeister <
> >>> sebastian.briesemeister@unister-gmbh.de> wrote:
> >>>
> >>>> Thanks for your input.
> >>>>
> >>>> The problem wasn't the high dimensional space itself but the cluster
> >>>> initialization. I validated the document cosine distance and they
> >>> look
> >>>> fairly well distributed.
> >>>>
> >>>> I now use canopy in a pre-clustering step. Interestingly, canopy
> >>>> suggests to use a large number of clusters, which might makes sense
> >>>> since the a lot of documents are unrelated due to their sparse word
> >>>> vector. If I reduce the number of clusters, a lot documents remain
> >>>> unclustered in the center of the cluster space.
> >>>> Further I would like to note that the random cluster initializations
> >>>> tends to choose initial centers that are close to each other. For
> >>> some
> >>>> reasons this leads to overlapping or even identical clusters.
> >>>>
> >>>> The problem of parameter tuning (T1 and T2) for canopy remains.
> >>> However,
> >>>> I assume their is no general strategy on this problem.
> >>>>
> >>>> Cheers
> >>>> Sebastian
> >>>>
> >>>> Am 27.03.2013 06:43, schrieb Dan Filimon:
> >>>>> Ah, so Ted, it looks like there's a bug with the mapreduce after
> >>> all
> >>>> then.
> >>>>>
> >>>>> Pity, I liked the higher dimensionality argument but thinking it
> >>>> through, it doesn't make that much sense.
> >>>>>
> >>>>> On Mar 27, 2013, at 6:52, Ted Dunning <te...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> Reducing to a lower dimensional space is a convenience, no more.
> >>>>>>
> >>>>>> Clustering in the original space is fine.  I still have trouble
> >>> with
> >>>> your
> >>>>>> normalizing before weighting, but I don't know what effect that
> >>> will
> >>>> have
> >>>>>> on anything.  It certainly will interfere with the interpretation
> >>> of the
> >>>>>> cosine metrics.
> >>>>>>
> >>>>>> On Tue, Mar 26, 2013 at 6:18 PM, Sebastian Briesemeister <
> >>>>>> sebastian.briesemeister@unister.de> wrote:
> >>>>>>
> >>>>>>> I am not quite sure whether this will solve the problem, though I
> >>> will
> >>>> try
> >>>>>>> it of course.
> >>>>>>>
> >>>>>>> I always thought that clustering documents based on their words
> >>> is a
> >>>>>>> common problem and is usually tackled in the word space and not
> >>> in a
> >>>>>>> reduced one.
> >>>>>>> Besides the distances look reasonable. Still I end up with very
> >>> similar
> >>>>>>> and very distant documents unclustered in the middle of all
> >>> clusters.
> >>>>>>>
> >>>>>>> So I think the problem lies in the clustering method not in the
> >>>> distances.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Dan Filimon <da...@gmail.com> schrieb:
> >>>>>>>
> >>>>>>>> So you're clustering 90K dimensional data?
> >>>>>>>>
> >>>>>>>> I'm faced with a very similar problem as you (working on
> >>>>>>>> StreamingKMeans for mahout) and from what I read [1], the
> >>> problem
> >>>>>>>> might be that in very high dimensional spaces the distances
> >>> become
> >>>>>>>> meaningless.
> >>>>>>>>
> >>>>>>>> I'm pretty sure this is the case and I was considering
> >>> implementing
> >>>>>>>> the test mentioned in the paper (also I feel like it's a very
> >>> useful
> >>>>>>>> algorithm to have).
> >>>>>>>>
> >>>>>>>> In any case, since the vectors are so sparse, why not reduce
> >>> their
> >>>>>>>> dimension?
> >>>>>>>>
> >>>>>>>> You can try principal component analysis (just getting the first
> >>> k
> >>>>>>>> eigenvectors in the singular value decomposition of the matrix
> >>> that
> >>>>>>>> has your vectors as rows). The class that does this is
> >>> SSVDSolver
> >>>>>>>> (there's also SingularValueDecomposition but that tries making
> >>> dense
> >>>>>>>> matrices and those might not fit into memory. I've never
> >>> personally
> >>>>>>>> used it though.
> >>>>>>>> Once you have the first k eigenvectors of size n, make them rows
> >>> in a
> >>>>>>>> matrix (U) and multiply each vector x you have with it (U x)
> >>> getting a
> >>>>>>>> reduced vector.
> >>>>>>>>
> >>>>>>>> Or, use random projections to reduce the size of the data set.
> >>> You
> >>>>>>>> want to create a matrix whose entries are sampled from a uniform
> >>>>>>>> distribution (0, 1) (Functions.random in
> >>>>>>>> o.a.m.math.function.Functions), normalize its rows and multiply
> >>> each
> >>>>>>>> vector x with it.
> >>>>>>>>
> >>>>>>>> So, reduce the size of your vectors thereby making the
> >>> dimensionality
> >>>>>>>> less of a problem and you'll get a decent approximation (you can
> >>>>>>>> actually quantify how good it is with SVD). From what I've seen,
> >>> the
> >>>>>>>> clusters separate at smaller dimensions but there's the question
> >>> of
> >>>>>>>> how good an approximation of the uncompressed data you have.
> >>>>>>>>
> >>>>>>>> See if this helps, I need to do the same thing :)
> >>>>>>>>
> >>>>>>>> What do you think?
> >>>>>>>>
> >>>>>>>> [1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
> >>>>>>>>
> >>>>>>>> On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
> >>>>>>>> <se...@unister-gmbh.de> wrote:
> >>>>>>>>> The dataset consists of about 4000 documents and is encoded by
> >>> 90.000
> >>>>>>>>> words. However, each document contains usually only about 10 to
> >>> 20
> >>>>>>>>> words. Only some contain more than 1000 words.
> >>>>>>>>>
> >>>>>>>>> For each document, I set a field in the corresponding vector to
> >>> 1 if
> >>>>>>>> it
> >>>>>>>>> contains a word. Then I normalize each vector using the
> >>> L2-norm.
> >>>>>>>>> Finally I multiply each element (representing a word) in the
> >>> vector
> >>>>>>>> by
> >>>>>>>>> log(#documents/#documents_with_word).
> >>>>>>>>>
> >>>>>>>>> For clustering, I am using cosine similarity.
> >>>>>>>>>
> >>>>>>>>> Regards
> >>>>>>>>> Sebastian
> >>>>>>>>>
> >>>>>>>>> Am 26.03.2013 17:33, schrieb Dan Filimon:
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> Could you tell us more about the kind of data you're
> >>> clustering?
> >>>>>>>> What
> >>>>>>>>>> distance measure you're using and what the dimensionality of
> >>> the
> >>>>>>>> data
> >>>>>>>>>> is?
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
> >>>>>>>>>> <se...@unister-gmbh.de> wrote:
> >>>>>>>>>>> Dear Mahout-users,
> >>>>>>>>>>>
> >>>>>>>>>>> I am facing two problems when I am clustering instances with
> >>> Fuzzy
> >>>>>>>> c
> >>>>>>>>>>> Means clustering (cosine distance, random initial
> >>> clustering):
> >>>>>>>>>>>
> >>>>>>>>>>> 1.) I always end up with one large set of rubbish instances.
> >>> All of
> >>>>>>>> them
> >>>>>>>>>>> have uniform cluster probability distribution and are, hence,
> >>> in
> >>>>>>>> the
> >>>>>>>>>>> exact middle of the cluster space.
> >>>>>>>>>>> The cosine distance between instances within this cluster
> >>> reaches
> >>>>>>>> from 0
> >>>>>>>>>>> to 1.
> >>>>>>>>>>>
> >>>>>>>>>>> 2.) Some of my clusters have the same or a very very similar
> >>>>>>>> center.
> >>>>>>>>>>> Besides the above described problems, the clustering seems to
> >>> work
> >>>>>>>> fine.
> >>>>>>>>>>> Has somebody an idea how my clustering can be improved?
> >>>>>>>>>>>
> >>>>>>>>>>> Regards
> >>>>>>>>>>> Sebastian
> >>>>>>> --
> >>>>>>> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9
> >>> Mail
> >>>>>>> gesendet.
> >>
> >> --
> >> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
> >> gesendet.
> >>
>

Re: How to improve clustering?

Posted by Dan Filimon <da...@gmail.com>.

Sebastian, if you're interested I'd be glad to walk you through the main ideas, point you to the code and tell you  how to run it.
Testing it on more data would be very helpful the project.

But, it makes hard cluster assignments.

On Mar 28, 2013, at 2:23, Ted Dunning <te...@gmail.com> wrote:

> The streaming k-means stuff is what Dan has been working on.
> 
> On Wed, Mar 27, 2013 at 6:14 PM, Sebastian Briesemeister <
> sebastian.briesemeister@unister.de> wrote:
> 
>> I did change it as you suggested. Now I normalize after the frequency
>> weighting.
>> 
>> The results from non fuzzy clustering are similar, but I require
>> probabilities though.
>> 
>> Streaming k-means stuff? I don't get you here.
>> 
>> 
>> 
>> Ted Dunning <te...@gmail.com> schrieb:
>> 
>>> I think you should change your vector preparation method.
>>> 
>>> What kind of results do you get from non-fuzzy clustering?
>>> 
>>> What about from the streaming k-means stuff?
>>> 
>>> On Wed, Mar 27, 2013 at 5:02 PM, Sebastian Briesemeister <
>>> sebastian.briesemeister@unister-gmbh.de> wrote:
>>> 
>>>> Thanks for your input.
>>>> 
>>>> The problem wasn't the high dimensional space itself but the cluster
>>>> initialization. I validated the document cosine distance and they
>>> look
>>>> fairly well distributed.
>>>> 
>>>> I now use canopy in a pre-clustering step. Interestingly, canopy
>>>> suggests to use a large number of clusters, which might makes sense
>>>> since the a lot of documents are unrelated due to their sparse word
>>>> vector. If I reduce the number of clusters, a lot documents remain
>>>> unclustered in the center of the cluster space.
>>>> Further I would like to note that the random cluster initializations
>>>> tends to choose initial centers that are close to each other. For
>>> some
>>>> reasons this leads to overlapping or even identical clusters.
>>>> 
>>>> The problem of parameter tuning (T1 and T2) for canopy remains.
>>> However,
>>>> I assume their is no general strategy on this problem.
>>>> 
>>>> Cheers
>>>> Sebastian
>>>> 
>>>> Am 27.03.2013 06:43, schrieb Dan Filimon:
>>>>> Ah, so Ted, it looks like there's a bug with the mapreduce after
>>> all
>>>> then.
>>>>> 
>>>>> Pity, I liked the higher dimensionality argument but thinking it
>>>> through, it doesn't make that much sense.
>>>>> 
>>>>> On Mar 27, 2013, at 6:52, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> Reducing to a lower dimensional space is a convenience, no more.
>>>>>> 
>>>>>> Clustering in the original space is fine.  I still have trouble
>>> with
>>>> your
>>>>>> normalizing before weighting, but I don't know what effect that
>>> will
>>>> have
>>>>>> on anything.  It certainly will interfere with the interpretation
>>> of the
>>>>>> cosine metrics.
>>>>>> 
>>>>>> On Tue, Mar 26, 2013 at 6:18 PM, Sebastian Briesemeister <
>>>>>> sebastian.briesemeister@unister.de> wrote:
>>>>>> 
>>>>>>> I am not quite sure whether this will solve the problem, though I
>>> will
>>>> try
>>>>>>> it of course.
>>>>>>> 
>>>>>>> I always thought that clustering documents based on their words
>>> is a
>>>>>>> common problem and is usually tackled in the word space and not
>>> in a
>>>>>>> reduced one.
>>>>>>> Besides the distances look reasonable. Still I end up with very
>>> similar
>>>>>>> and very distant documents unclustered in the middle of all
>>> clusters.
>>>>>>> 
>>>>>>> So I think the problem lies in the clustering method not in the
>>>> distances.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Dan Filimon <da...@gmail.com> schrieb:
>>>>>>> 
>>>>>>>> So you're clustering 90K dimensional data?
>>>>>>>> 
>>>>>>>> I'm faced with a very similar problem as you (working on
>>>>>>>> StreamingKMeans for mahout) and from what I read [1], the
>>> problem
>>>>>>>> might be that in very high dimensional spaces the distances
>>> become
>>>>>>>> meaningless.
>>>>>>>> 
>>>>>>>> I'm pretty sure this is the case and I was considering
>>> implementing
>>>>>>>> the test mentioned in the paper (also I feel like it's a very
>>> useful
>>>>>>>> algorithm to have).
>>>>>>>> 
>>>>>>>> In any case, since the vectors are so sparse, why not reduce
>>> their
>>>>>>>> dimension?
>>>>>>>> 
>>>>>>>> You can try principal component analysis (just getting the first
>>> k
>>>>>>>> eigenvectors in the singular value decomposition of the matrix
>>> that
>>>>>>>> has your vectors as rows). The class that does this is
>>> SSVDSolver
>>>>>>>> (there's also SingularValueDecomposition but that tries making
>>> dense
>>>>>>>> matrices and those might not fit into memory. I've never
>>> personally
>>>>>>>> used it though.
>>>>>>>> Once you have the first k eigenvectors of size n, make them rows
>>> in a
>>>>>>>> matrix (U) and multiply each vector x you have with it (U x)
>>> getting a
>>>>>>>> reduced vector.
>>>>>>>> 
>>>>>>>> Or, use random projections to reduce the size of the data set.
>>> You
>>>>>>>> want to create a matrix whose entries are sampled from a uniform
>>>>>>>> distribution (0, 1) (Functions.random in
>>>>>>>> o.a.m.math.function.Functions), normalize its rows and multiply
>>> each
>>>>>>>> vector x with it.
>>>>>>>> 
>>>>>>>> So, reduce the size of your vectors thereby making the
>>> dimensionality
>>>>>>>> less of a problem and you'll get a decent approximation (you can
>>>>>>>> actually quantify how good it is with SVD). From what I've seen,
>>> the
>>>>>>>> clusters separate at smaller dimensions but there's the question
>>> of
>>>>>>>> how good an approximation of the uncompressed data you have.
>>>>>>>> 
>>>>>>>> See if this helps, I need to do the same thing :)
>>>>>>>> 
>>>>>>>> What do you think?
>>>>>>>> 
>>>>>>>> [1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
>>>>>>>> 
>>>>>>>> On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
>>>>>>>> <se...@unister-gmbh.de> wrote:
>>>>>>>>> The dataset consists of about 4000 documents and is encoded by
>>> 90.000
>>>>>>>>> words. However, each document contains usually only about 10 to
>>> 20
>>>>>>>>> words. Only some contain more than 1000 words.
>>>>>>>>> 
>>>>>>>>> For each document, I set a field in the corresponding vector to
>>> 1 if
>>>>>>>> it
>>>>>>>>> contains a word. Then I normalize each vector using the
>>> L2-norm.
>>>>>>>>> Finally I multiply each element (representing a word) in the
>>> vector
>>>>>>>> by
>>>>>>>>> log(#documents/#documents_with_word).
>>>>>>>>> 
>>>>>>>>> For clustering, I am using cosine similarity.
>>>>>>>>> 
>>>>>>>>> Regards
>>>>>>>>> Sebastian
>>>>>>>>> 
>>>>>>>>> Am 26.03.2013 17:33, schrieb Dan Filimon:
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> Could you tell us more about the kind of data you're
>>> clustering?
>>>>>>>> What
>>>>>>>>>> distance measure you're using and what the dimensionality of
>>> the
>>>>>>>> data
>>>>>>>>>> is?
>>>>>>>>>> 
>>>>>>>>>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
>>>>>>>>>> <se...@unister-gmbh.de> wrote:
>>>>>>>>>>> Dear Mahout-users,
>>>>>>>>>>> 
>>>>>>>>>>> I am facing two problems when I am clustering instances with
>>> Fuzzy
>>>>>>>> c
>>>>>>>>>>> Means clustering (cosine distance, random initial
>>> clustering):
>>>>>>>>>>> 
>>>>>>>>>>> 1.) I always end up with one large set of rubbish instances.
>>> All of
>>>>>>>> them
>>>>>>>>>>> have uniform cluster probability distribution and are, hence,
>>> in
>>>>>>>> the
>>>>>>>>>>> exact middle of the cluster space.
>>>>>>>>>>> The cosine distance between instances within this cluster
>>> reaches
>>>>>>>> from 0
>>>>>>>>>>> to 1.
>>>>>>>>>>> 
>>>>>>>>>>> 2.) Some of my clusters have the same or a very very similar
>>>>>>>> center.
>>>>>>>>>>> Besides the above described problems, the clustering seems to
>>> work
>>>>>>>> fine.
>>>>>>>>>>> Has somebody an idea how my clustering can be improved?
>>>>>>>>>>> 
>>>>>>>>>>> Regards
>>>>>>>>>>> Sebastian
>>>>>>> --
>>>>>>> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9
>>> Mail
>>>>>>> gesendet.
>> 
>> --
>> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
>> gesendet.
>>

Re: How to improve clustering?

Posted by Ted Dunning <te...@gmail.com>.

The streaming k-means stuff is what Dan has been working on.

On Wed, Mar 27, 2013 at 6:14 PM, Sebastian Briesemeister <
sebastian.briesemeister@unister.de> wrote:

> I did change it as you suggested. Now I normalize after the frequency
> weighting.
>
> The results from non fuzzy clustering are similar, but I require
> probabilities though.
>
> Streaming k-means stuff? I don't get you here.
>
>
>
> Ted Dunning <te...@gmail.com> schrieb:
>
> >I think you should change your vector preparation method.
> >
> >What kind of results do you get from non-fuzzy clustering?
> >
> >What about from the streaming k-means stuff?
> >
> >On Wed, Mar 27, 2013 at 5:02 PM, Sebastian Briesemeister <
> >sebastian.briesemeister@unister-gmbh.de> wrote:
> >
> >> Thanks for your input.
> >>
> >> The problem wasn't the high dimensional space itself but the cluster
> >> initialization. I validated the document cosine distance and they
> >look
> >> fairly well distributed.
> >>
> >> I now use canopy in a pre-clustering step. Interestingly, canopy
> >> suggests to use a large number of clusters, which might makes sense
> >> since the a lot of documents are unrelated due to their sparse word
> >> vector. If I reduce the number of clusters, a lot documents remain
> >> unclustered in the center of the cluster space.
> >> Further I would like to note that the random cluster initializations
> >> tends to choose initial centers that are close to each other. For
> >some
> >> reasons this leads to overlapping or even identical clusters.
> >>
> >> The problem of parameter tuning (T1 and T2) for canopy remains.
> >However,
> >> I assume their is no general strategy on this problem.
> >>
> >> Cheers
> >> Sebastian
> >>
> >> Am 27.03.2013 06:43, schrieb Dan Filimon:
> >> > Ah, so Ted, it looks like there's a bug with the mapreduce after
> >all
> >> then.
> >> >
> >> > Pity, I liked the higher dimensionality argument but thinking it
> >> through, it doesn't make that much sense.
> >> >
> >> > On Mar 27, 2013, at 6:52, Ted Dunning <te...@gmail.com>
> >wrote:
> >> >
> >> >> Reducing to a lower dimensional space is a convenience, no more.
> >> >>
> >> >> Clustering in the original space is fine.  I still have trouble
> >with
> >> your
> >> >> normalizing before weighting, but I don't know what effect that
> >will
> >> have
> >> >> on anything.  It certainly will interfere with the interpretation
> >of the
> >> >> cosine metrics.
> >> >>
> >> >> On Tue, Mar 26, 2013 at 6:18 PM, Sebastian Briesemeister <
> >> >> sebastian.briesemeister@unister.de> wrote:
> >> >>
> >> >>> I am not quite sure whether this will solve the problem, though I
> >will
> >> try
> >> >>> it of course.
> >> >>>
> >> >>> I always thought that clustering documents based on their words
> >is a
> >> >>> common problem and is usually tackled in the word space and not
> >in a
> >> >>> reduced one.
> >> >>> Besides the distances look reasonable. Still I end up with very
> >similar
> >> >>> and very distant documents unclustered in the middle of all
> >clusters.
> >> >>>
> >> >>> So I think the problem lies in the clustering method not in the
> >> distances.
> >> >>>
> >> >>>
> >> >>>
> >> >>> Dan Filimon <da...@gmail.com> schrieb:
> >> >>>
> >> >>>> So you're clustering 90K dimensional data?
> >> >>>>
> >> >>>> I'm faced with a very similar problem as you (working on
> >> >>>> StreamingKMeans for mahout) and from what I read [1], the
> >problem
> >> >>>> might be that in very high dimensional spaces the distances
> >become
> >> >>>> meaningless.
> >> >>>>
> >> >>>> I'm pretty sure this is the case and I was considering
> >implementing
> >> >>>> the test mentioned in the paper (also I feel like it's a very
> >useful
> >> >>>> algorithm to have).
> >> >>>>
> >> >>>> In any case, since the vectors are so sparse, why not reduce
> >their
> >> >>>> dimension?
> >> >>>>
> >> >>>> You can try principal component analysis (just getting the first
> >k
> >> >>>> eigenvectors in the singular value decomposition of the matrix
> >that
> >> >>>> has your vectors as rows). The class that does this is
> >SSVDSolver
> >> >>>> (there's also SingularValueDecomposition but that tries making
> >dense
> >> >>>> matrices and those might not fit into memory. I've never
> >personally
> >> >>>> used it though.
> >> >>>> Once you have the first k eigenvectors of size n, make them rows
> >in a
> >> >>>> matrix (U) and multiply each vector x you have with it (U x)
> >getting a
> >> >>>> reduced vector.
> >> >>>>
> >> >>>> Or, use random projections to reduce the size of the data set.
> >You
> >> >>>> want to create a matrix whose entries are sampled from a uniform
> >> >>>> distribution (0, 1) (Functions.random in
> >> >>>> o.a.m.math.function.Functions), normalize its rows and multiply
> >each
> >> >>>> vector x with it.
> >> >>>>
> >> >>>> So, reduce the size of your vectors thereby making the
> >dimensionality
> >> >>>> less of a problem and you'll get a decent approximation (you can
> >> >>>> actually quantify how good it is with SVD). From what I've seen,
> >the
> >> >>>> clusters separate at smaller dimensions but there's the question
> >of
> >> >>>> how good an approximation of the uncompressed data you have.
> >> >>>>
> >> >>>> See if this helps, I need to do the same thing :)
> >> >>>>
> >> >>>> What do you think?
> >> >>>>
> >> >>>> [1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
> >> >>>>
> >> >>>> On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
> >> >>>> <se...@unister-gmbh.de> wrote:
> >> >>>>> The dataset consists of about 4000 documents and is encoded by
> >90.000
> >> >>>>> words. However, each document contains usually only about 10 to
> >20
> >> >>>>> words. Only some contain more than 1000 words.
> >> >>>>>
> >> >>>>> For each document, I set a field in the corresponding vector to
> >1 if
> >> >>>> it
> >> >>>>> contains a word. Then I normalize each vector using the
> >L2-norm.
> >> >>>>> Finally I multiply each element (representing a word) in the
> >vector
> >> >>>> by
> >> >>>>> log(#documents/#documents_with_word).
> >> >>>>>
> >> >>>>> For clustering, I am using cosine similarity.
> >> >>>>>
> >> >>>>> Regards
> >> >>>>> Sebastian
> >> >>>>>
> >> >>>>> Am 26.03.2013 17:33, schrieb Dan Filimon:
> >> >>>>>> Hi,
> >> >>>>>>
> >> >>>>>> Could you tell us more about the kind of data you're
> >clustering?
> >> >>>> What
> >> >>>>>> distance measure you're using and what the dimensionality of
> >the
> >> >>>> data
> >> >>>>>> is?
> >> >>>>>>
> >> >>>>>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
> >> >>>>>> <se...@unister-gmbh.de> wrote:
> >> >>>>>>> Dear Mahout-users,
> >> >>>>>>>
> >> >>>>>>> I am facing two problems when I am clustering instances with
> >Fuzzy
> >> >>>> c
> >> >>>>>>> Means clustering (cosine distance, random initial
> >clustering):
> >> >>>>>>>
> >> >>>>>>> 1.) I always end up with one large set of rubbish instances.
> >All of
> >> >>>> them
> >> >>>>>>> have uniform cluster probability distribution and are, hence,
> >in
> >> >>>> the
> >> >>>>>>> exact middle of the cluster space.
> >> >>>>>>> The cosine distance between instances within this cluster
> >reaches
> >> >>>> from 0
> >> >>>>>>> to 1.
> >> >>>>>>>
> >> >>>>>>> 2.) Some of my clusters have the same or a very very similar
> >> >>>> center.
> >> >>>>>>> Besides the above described problems, the clustering seems to
> >work
> >> >>>> fine.
> >> >>>>>>> Has somebody an idea how my clustering can be improved?
> >> >>>>>>>
> >> >>>>>>> Regards
> >> >>>>>>> Sebastian
> >> >>> --
> >> >>> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9
> >Mail
> >> >>> gesendet.
> >>
> >>
>
> --
> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
> gesendet.
>

Re: How to improve clustering?

Posted by Sebastian Briesemeister <se...@unister.de>.

I did change it as you suggested. Now I normalize after the frequency weighting. 

The results from non fuzzy clustering are similar, but I require probabilities though. 

Streaming k-means stuff? I don't get you here. 



Ted Dunning <te...@gmail.com> schrieb:

>I think you should change your vector preparation method.
>
>What kind of results do you get from non-fuzzy clustering?
>
>What about from the streaming k-means stuff?
>
>On Wed, Mar 27, 2013 at 5:02 PM, Sebastian Briesemeister <
>sebastian.briesemeister@unister-gmbh.de> wrote:
>
>> Thanks for your input.
>>
>> The problem wasn't the high dimensional space itself but the cluster
>> initialization. I validated the document cosine distance and they
>look
>> fairly well distributed.
>>
>> I now use canopy in a pre-clustering step. Interestingly, canopy
>> suggests to use a large number of clusters, which might makes sense
>> since the a lot of documents are unrelated due to their sparse word
>> vector. If I reduce the number of clusters, a lot documents remain
>> unclustered in the center of the cluster space.
>> Further I would like to note that the random cluster initializations
>> tends to choose initial centers that are close to each other. For
>some
>> reasons this leads to overlapping or even identical clusters.
>>
>> The problem of parameter tuning (T1 and T2) for canopy remains.
>However,
>> I assume their is no general strategy on this problem.
>>
>> Cheers
>> Sebastian
>>
>> Am 27.03.2013 06:43, schrieb Dan Filimon:
>> > Ah, so Ted, it looks like there's a bug with the mapreduce after
>all
>> then.
>> >
>> > Pity, I liked the higher dimensionality argument but thinking it
>> through, it doesn't make that much sense.
>> >
>> > On Mar 27, 2013, at 6:52, Ted Dunning <te...@gmail.com>
>wrote:
>> >
>> >> Reducing to a lower dimensional space is a convenience, no more.
>> >>
>> >> Clustering in the original space is fine.  I still have trouble
>with
>> your
>> >> normalizing before weighting, but I don't know what effect that
>will
>> have
>> >> on anything.  It certainly will interfere with the interpretation
>of the
>> >> cosine metrics.
>> >>
>> >> On Tue, Mar 26, 2013 at 6:18 PM, Sebastian Briesemeister <
>> >> sebastian.briesemeister@unister.de> wrote:
>> >>
>> >>> I am not quite sure whether this will solve the problem, though I
>will
>> try
>> >>> it of course.
>> >>>
>> >>> I always thought that clustering documents based on their words
>is a
>> >>> common problem and is usually tackled in the word space and not
>in a
>> >>> reduced one.
>> >>> Besides the distances look reasonable. Still I end up with very
>similar
>> >>> and very distant documents unclustered in the middle of all
>clusters.
>> >>>
>> >>> So I think the problem lies in the clustering method not in the
>> distances.
>> >>>
>> >>>
>> >>>
>> >>> Dan Filimon <da...@gmail.com> schrieb:
>> >>>
>> >>>> So you're clustering 90K dimensional data?
>> >>>>
>> >>>> I'm faced with a very similar problem as you (working on
>> >>>> StreamingKMeans for mahout) and from what I read [1], the
>problem
>> >>>> might be that in very high dimensional spaces the distances
>become
>> >>>> meaningless.
>> >>>>
>> >>>> I'm pretty sure this is the case and I was considering
>implementing
>> >>>> the test mentioned in the paper (also I feel like it's a very
>useful
>> >>>> algorithm to have).
>> >>>>
>> >>>> In any case, since the vectors are so sparse, why not reduce
>their
>> >>>> dimension?
>> >>>>
>> >>>> You can try principal component analysis (just getting the first
>k
>> >>>> eigenvectors in the singular value decomposition of the matrix
>that
>> >>>> has your vectors as rows). The class that does this is
>SSVDSolver
>> >>>> (there's also SingularValueDecomposition but that tries making
>dense
>> >>>> matrices and those might not fit into memory. I've never
>personally
>> >>>> used it though.
>> >>>> Once you have the first k eigenvectors of size n, make them rows
>in a
>> >>>> matrix (U) and multiply each vector x you have with it (U x)
>getting a
>> >>>> reduced vector.
>> >>>>
>> >>>> Or, use random projections to reduce the size of the data set.
>You
>> >>>> want to create a matrix whose entries are sampled from a uniform
>> >>>> distribution (0, 1) (Functions.random in
>> >>>> o.a.m.math.function.Functions), normalize its rows and multiply
>each
>> >>>> vector x with it.
>> >>>>
>> >>>> So, reduce the size of your vectors thereby making the
>dimensionality
>> >>>> less of a problem and you'll get a decent approximation (you can
>> >>>> actually quantify how good it is with SVD). From what I've seen,
>the
>> >>>> clusters separate at smaller dimensions but there's the question
>of
>> >>>> how good an approximation of the uncompressed data you have.
>> >>>>
>> >>>> See if this helps, I need to do the same thing :)
>> >>>>
>> >>>> What do you think?
>> >>>>
>> >>>> [1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
>> >>>>
>> >>>> On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
>> >>>> <se...@unister-gmbh.de> wrote:
>> >>>>> The dataset consists of about 4000 documents and is encoded by
>90.000
>> >>>>> words. However, each document contains usually only about 10 to
>20
>> >>>>> words. Only some contain more than 1000 words.
>> >>>>>
>> >>>>> For each document, I set a field in the corresponding vector to
>1 if
>> >>>> it
>> >>>>> contains a word. Then I normalize each vector using the
>L2-norm.
>> >>>>> Finally I multiply each element (representing a word) in the
>vector
>> >>>> by
>> >>>>> log(#documents/#documents_with_word).
>> >>>>>
>> >>>>> For clustering, I am using cosine similarity.
>> >>>>>
>> >>>>> Regards
>> >>>>> Sebastian
>> >>>>>
>> >>>>> Am 26.03.2013 17:33, schrieb Dan Filimon:
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> Could you tell us more about the kind of data you're
>clustering?
>> >>>> What
>> >>>>>> distance measure you're using and what the dimensionality of
>the
>> >>>> data
>> >>>>>> is?
>> >>>>>>
>> >>>>>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
>> >>>>>> <se...@unister-gmbh.de> wrote:
>> >>>>>>> Dear Mahout-users,
>> >>>>>>>
>> >>>>>>> I am facing two problems when I am clustering instances with
>Fuzzy
>> >>>> c
>> >>>>>>> Means clustering (cosine distance, random initial
>clustering):
>> >>>>>>>
>> >>>>>>> 1.) I always end up with one large set of rubbish instances.
>All of
>> >>>> them
>> >>>>>>> have uniform cluster probability distribution and are, hence,
>in
>> >>>> the
>> >>>>>>> exact middle of the cluster space.
>> >>>>>>> The cosine distance between instances within this cluster
>reaches
>> >>>> from 0
>> >>>>>>> to 1.
>> >>>>>>>
>> >>>>>>> 2.) Some of my clusters have the same or a very very similar
>> >>>> center.
>> >>>>>>> Besides the above described problems, the clustering seems to
>work
>> >>>> fine.
>> >>>>>>> Has somebody an idea how my clustering can be improved?
>> >>>>>>>
>> >>>>>>> Regards
>> >>>>>>> Sebastian
>> >>> --
>> >>> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9
>Mail
>> >>> gesendet.
>>
>>

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

Re: How to improve clustering?

Posted by Ted Dunning <te...@gmail.com>.

I think you should change your vector preparation method.

What kind of results do you get from non-fuzzy clustering?

What about from the streaming k-means stuff?

On Wed, Mar 27, 2013 at 5:02 PM, Sebastian Briesemeister <
sebastian.briesemeister@unister-gmbh.de> wrote:

> Thanks for your input.
>
> The problem wasn't the high dimensional space itself but the cluster
> initialization. I validated the document cosine distance and they look
> fairly well distributed.
>
> I now use canopy in a pre-clustering step. Interestingly, canopy
> suggests to use a large number of clusters, which might makes sense
> since the a lot of documents are unrelated due to their sparse word
> vector. If I reduce the number of clusters, a lot documents remain
> unclustered in the center of the cluster space.
> Further I would like to note that the random cluster initializations
> tends to choose initial centers that are close to each other. For some
> reasons this leads to overlapping or even identical clusters.
>
> The problem of parameter tuning (T1 and T2) for canopy remains. However,
> I assume their is no general strategy on this problem.
>
> Cheers
> Sebastian
>
> Am 27.03.2013 06:43, schrieb Dan Filimon:
> > Ah, so Ted, it looks like there's a bug with the mapreduce after all
> then.
> >
> > Pity, I liked the higher dimensionality argument but thinking it
> through, it doesn't make that much sense.
> >
> > On Mar 27, 2013, at 6:52, Ted Dunning <te...@gmail.com> wrote:
> >
> >> Reducing to a lower dimensional space is a convenience, no more.
> >>
> >> Clustering in the original space is fine.  I still have trouble with
> your
> >> normalizing before weighting, but I don't know what effect that will
> have
> >> on anything.  It certainly will interfere with the interpretation of the
> >> cosine metrics.
> >>
> >> On Tue, Mar 26, 2013 at 6:18 PM, Sebastian Briesemeister <
> >> sebastian.briesemeister@unister.de> wrote:
> >>
> >>> I am not quite sure whether this will solve the problem, though I will
> try
> >>> it of course.
> >>>
> >>> I always thought that clustering documents based on their words is a
> >>> common problem and is usually tackled in the word space and not in a
> >>> reduced one.
> >>> Besides the distances look reasonable. Still I end up with very similar
> >>> and very distant documents unclustered in the middle of all clusters.
> >>>
> >>> So I think the problem lies in the clustering method not in the
> distances.
> >>>
> >>>
> >>>
> >>> Dan Filimon <da...@gmail.com> schrieb:
> >>>
> >>>> So you're clustering 90K dimensional data?
> >>>>
> >>>> I'm faced with a very similar problem as you (working on
> >>>> StreamingKMeans for mahout) and from what I read [1], the problem
> >>>> might be that in very high dimensional spaces the distances become
> >>>> meaningless.
> >>>>
> >>>> I'm pretty sure this is the case and I was considering implementing
> >>>> the test mentioned in the paper (also I feel like it's a very useful
> >>>> algorithm to have).
> >>>>
> >>>> In any case, since the vectors are so sparse, why not reduce their
> >>>> dimension?
> >>>>
> >>>> You can try principal component analysis (just getting the first k
> >>>> eigenvectors in the singular value decomposition of the matrix that
> >>>> has your vectors as rows). The class that does this is SSVDSolver
> >>>> (there's also SingularValueDecomposition but that tries making dense
> >>>> matrices and those might not fit into memory. I've never personally
> >>>> used it though.
> >>>> Once you have the first k eigenvectors of size n, make them rows in a
> >>>> matrix (U) and multiply each vector x you have with it (U x) getting a
> >>>> reduced vector.
> >>>>
> >>>> Or, use random projections to reduce the size of the data set. You
> >>>> want to create a matrix whose entries are sampled from a uniform
> >>>> distribution (0, 1) (Functions.random in
> >>>> o.a.m.math.function.Functions), normalize its rows and multiply each
> >>>> vector x with it.
> >>>>
> >>>> So, reduce the size of your vectors thereby making the dimensionality
> >>>> less of a problem and you'll get a decent approximation (you can
> >>>> actually quantify how good it is with SVD). From what I've seen, the
> >>>> clusters separate at smaller dimensions but there's the question of
> >>>> how good an approximation of the uncompressed data you have.
> >>>>
> >>>> See if this helps, I need to do the same thing :)
> >>>>
> >>>> What do you think?
> >>>>
> >>>> [1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
> >>>>
> >>>> On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
> >>>> <se...@unister-gmbh.de> wrote:
> >>>>> The dataset consists of about 4000 documents and is encoded by 90.000
> >>>>> words. However, each document contains usually only about 10 to 20
> >>>>> words. Only some contain more than 1000 words.
> >>>>>
> >>>>> For each document, I set a field in the corresponding vector to 1 if
> >>>> it
> >>>>> contains a word. Then I normalize each vector using the L2-norm.
> >>>>> Finally I multiply each element (representing a word) in the vector
> >>>> by
> >>>>> log(#documents/#documents_with_word).
> >>>>>
> >>>>> For clustering, I am using cosine similarity.
> >>>>>
> >>>>> Regards
> >>>>> Sebastian
> >>>>>
> >>>>> Am 26.03.2013 17:33, schrieb Dan Filimon:
> >>>>>> Hi,
> >>>>>>
> >>>>>> Could you tell us more about the kind of data you're clustering?
> >>>> What
> >>>>>> distance measure you're using and what the dimensionality of the
> >>>> data
> >>>>>> is?
> >>>>>>
> >>>>>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
> >>>>>> <se...@unister-gmbh.de> wrote:
> >>>>>>> Dear Mahout-users,
> >>>>>>>
> >>>>>>> I am facing two problems when I am clustering instances with Fuzzy
> >>>> c
> >>>>>>> Means clustering (cosine distance, random initial clustering):
> >>>>>>>
> >>>>>>> 1.) I always end up with one large set of rubbish instances. All of
> >>>> them
> >>>>>>> have uniform cluster probability distribution and are, hence, in
> >>>> the
> >>>>>>> exact middle of the cluster space.
> >>>>>>> The cosine distance between instances within this cluster reaches
> >>>> from 0
> >>>>>>> to 1.
> >>>>>>>
> >>>>>>> 2.) Some of my clusters have the same or a very very similar
> >>>> center.
> >>>>>>> Besides the above described problems, the clustering seems to work
> >>>> fine.
> >>>>>>> Has somebody an idea how my clustering can be improved?
> >>>>>>>
> >>>>>>> Regards
> >>>>>>> Sebastian
> >>> --
> >>> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
> >>> gesendet.
>
>

Re: How to improve clustering?

Posted by Sebastian Briesemeister <se...@unister-gmbh.de>.

Thanks for your input.

The problem wasn't the high dimensional space itself but the cluster
initialization. I validated the document cosine distance and they look
fairly well distributed.

I now use canopy in a pre-clustering step. Interestingly, canopy
suggests to use a large number of clusters, which might makes sense
since the a lot of documents are unrelated due to their sparse word
vector. If I reduce the number of clusters, a lot documents remain
unclustered in the center of the cluster space.
Further I would like to note that the random cluster initializations
tends to choose initial centers that are close to each other. For some
reasons this leads to overlapping or even identical clusters.

The problem of parameter tuning (T1 and T2) for canopy remains. However,
I assume their is no general strategy on this problem.

Cheers
Sebastian

Am 27.03.2013 06:43, schrieb Dan Filimon:
> Ah, so Ted, it looks like there's a bug with the mapreduce after all then.
>
> Pity, I liked the higher dimensionality argument but thinking it through, it doesn't make that much sense.
>
> On Mar 27, 2013, at 6:52, Ted Dunning <te...@gmail.com> wrote:
>
>> Reducing to a lower dimensional space is a convenience, no more.
>>
>> Clustering in the original space is fine.  I still have trouble with your
>> normalizing before weighting, but I don't know what effect that will have
>> on anything.  It certainly will interfere with the interpretation of the
>> cosine metrics.
>>
>> On Tue, Mar 26, 2013 at 6:18 PM, Sebastian Briesemeister <
>> sebastian.briesemeister@unister.de> wrote:
>>
>>> I am not quite sure whether this will solve the problem, though I will try
>>> it of course.
>>>
>>> I always thought that clustering documents based on their words is a
>>> common problem and is usually tackled in the word space and not in a
>>> reduced one.
>>> Besides the distances look reasonable. Still I end up with very similar
>>> and very distant documents unclustered in the middle of all clusters.
>>>
>>> So I think the problem lies in the clustering method not in the distances.
>>>
>>>
>>>
>>> Dan Filimon <da...@gmail.com> schrieb:
>>>
>>>> So you're clustering 90K dimensional data?
>>>>
>>>> I'm faced with a very similar problem as you (working on
>>>> StreamingKMeans for mahout) and from what I read [1], the problem
>>>> might be that in very high dimensional spaces the distances become
>>>> meaningless.
>>>>
>>>> I'm pretty sure this is the case and I was considering implementing
>>>> the test mentioned in the paper (also I feel like it's a very useful
>>>> algorithm to have).
>>>>
>>>> In any case, since the vectors are so sparse, why not reduce their
>>>> dimension?
>>>>
>>>> You can try principal component analysis (just getting the first k
>>>> eigenvectors in the singular value decomposition of the matrix that
>>>> has your vectors as rows). The class that does this is SSVDSolver
>>>> (there's also SingularValueDecomposition but that tries making dense
>>>> matrices and those might not fit into memory. I've never personally
>>>> used it though.
>>>> Once you have the first k eigenvectors of size n, make them rows in a
>>>> matrix (U) and multiply each vector x you have with it (U x) getting a
>>>> reduced vector.
>>>>
>>>> Or, use random projections to reduce the size of the data set. You
>>>> want to create a matrix whose entries are sampled from a uniform
>>>> distribution (0, 1) (Functions.random in
>>>> o.a.m.math.function.Functions), normalize its rows and multiply each
>>>> vector x with it.
>>>>
>>>> So, reduce the size of your vectors thereby making the dimensionality
>>>> less of a problem and you'll get a decent approximation (you can
>>>> actually quantify how good it is with SVD). From what I've seen, the
>>>> clusters separate at smaller dimensions but there's the question of
>>>> how good an approximation of the uncompressed data you have.
>>>>
>>>> See if this helps, I need to do the same thing :)
>>>>
>>>> What do you think?
>>>>
>>>> [1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
>>>>
>>>> On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
>>>> <se...@unister-gmbh.de> wrote:
>>>>> The dataset consists of about 4000 documents and is encoded by 90.000
>>>>> words. However, each document contains usually only about 10 to 20
>>>>> words. Only some contain more than 1000 words.
>>>>>
>>>>> For each document, I set a field in the corresponding vector to 1 if
>>>> it
>>>>> contains a word. Then I normalize each vector using the L2-norm.
>>>>> Finally I multiply each element (representing a word) in the vector
>>>> by
>>>>> log(#documents/#documents_with_word).
>>>>>
>>>>> For clustering, I am using cosine similarity.
>>>>>
>>>>> Regards
>>>>> Sebastian
>>>>>
>>>>> Am 26.03.2013 17:33, schrieb Dan Filimon:
>>>>>> Hi,
>>>>>>
>>>>>> Could you tell us more about the kind of data you're clustering?
>>>> What
>>>>>> distance measure you're using and what the dimensionality of the
>>>> data
>>>>>> is?
>>>>>>
>>>>>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
>>>>>> <se...@unister-gmbh.de> wrote:
>>>>>>> Dear Mahout-users,
>>>>>>>
>>>>>>> I am facing two problems when I am clustering instances with Fuzzy
>>>> c
>>>>>>> Means clustering (cosine distance, random initial clustering):
>>>>>>>
>>>>>>> 1.) I always end up with one large set of rubbish instances. All of
>>>> them
>>>>>>> have uniform cluster probability distribution and are, hence, in
>>>> the
>>>>>>> exact middle of the cluster space.
>>>>>>> The cosine distance between instances within this cluster reaches
>>>> from 0
>>>>>>> to 1.
>>>>>>>
>>>>>>> 2.) Some of my clusters have the same or a very very similar
>>>> center.
>>>>>>> Besides the above described problems, the clustering seems to work
>>>> fine.
>>>>>>> Has somebody an idea how my clustering can be improved?
>>>>>>>
>>>>>>> Regards
>>>>>>> Sebastian
>>> --
>>> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
>>> gesendet.

Re: How to improve clustering?

Posted by Ted Dunning <te...@gmail.com>.

?!

Can you say more?

On Wed, Mar 27, 2013 at 6:43 AM, Dan Filimon <da...@gmail.com>wrote:

> Ah, so Ted, it looks like there's a bug with the mapreduce after all then.
>
> Pity, I liked the higher dimensionality argument but thinking it through,
> it doesn't make that much sense.
>
> On Mar 27, 2013, at 6:52, Ted Dunning <te...@gmail.com> wrote:
>
> > Reducing to a lower dimensional space is a convenience, no more.
> >
> > Clustering in the original space is fine.  I still have trouble with your
> > normalizing before weighting, but I don't know what effect that will have
> > on anything.  It certainly will interfere with the interpretation of the
> > cosine metrics.
> >
> > On Tue, Mar 26, 2013 at 6:18 PM, Sebastian Briesemeister <
> > sebastian.briesemeister@unister.de> wrote:
> >
> >> I am not quite sure whether this will solve the problem, though I will
> try
> >> it of course.
> >>
> >> I always thought that clustering documents based on their words is a
> >> common problem and is usually tackled in the word space and not in a
> >> reduced one.
> >> Besides the distances look reasonable. Still I end up with very similar
> >> and very distant documents unclustered in the middle of all clusters.
> >>
> >> So I think the problem lies in the clustering method not in the
> distances.
> >>
> >>
> >>
> >> Dan Filimon <da...@gmail.com> schrieb:
> >>
> >>> So you're clustering 90K dimensional data?
> >>>
> >>> I'm faced with a very similar problem as you (working on
> >>> StreamingKMeans for mahout) and from what I read [1], the problem
> >>> might be that in very high dimensional spaces the distances become
> >>> meaningless.
> >>>
> >>> I'm pretty sure this is the case and I was considering implementing
> >>> the test mentioned in the paper (also I feel like it's a very useful
> >>> algorithm to have).
> >>>
> >>> In any case, since the vectors are so sparse, why not reduce their
> >>> dimension?
> >>>
> >>> You can try principal component analysis (just getting the first k
> >>> eigenvectors in the singular value decomposition of the matrix that
> >>> has your vectors as rows). The class that does this is SSVDSolver
> >>> (there's also SingularValueDecomposition but that tries making dense
> >>> matrices and those might not fit into memory. I've never personally
> >>> used it though.
> >>> Once you have the first k eigenvectors of size n, make them rows in a
> >>> matrix (U) and multiply each vector x you have with it (U x) getting a
> >>> reduced vector.
> >>>
> >>> Or, use random projections to reduce the size of the data set. You
> >>> want to create a matrix whose entries are sampled from a uniform
> >>> distribution (0, 1) (Functions.random in
> >>> o.a.m.math.function.Functions), normalize its rows and multiply each
> >>> vector x with it.
> >>>
> >>> So, reduce the size of your vectors thereby making the dimensionality
> >>> less of a problem and you'll get a decent approximation (you can
> >>> actually quantify how good it is with SVD). From what I've seen, the
> >>> clusters separate at smaller dimensions but there's the question of
> >>> how good an approximation of the uncompressed data you have.
> >>>
> >>> See if this helps, I need to do the same thing :)
> >>>
> >>> What do you think?
> >>>
> >>> [1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
> >>>
> >>> On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
> >>> <se...@unister-gmbh.de> wrote:
> >>>> The dataset consists of about 4000 documents and is encoded by 90.000
> >>>> words. However, each document contains usually only about 10 to 20
> >>>> words. Only some contain more than 1000 words.
> >>>>
> >>>> For each document, I set a field in the corresponding vector to 1 if
> >>> it
> >>>> contains a word. Then I normalize each vector using the L2-norm.
> >>>> Finally I multiply each element (representing a word) in the vector
> >>> by
> >>>> log(#documents/#documents_with_word).
> >>>>
> >>>> For clustering, I am using cosine similarity.
> >>>>
> >>>> Regards
> >>>> Sebastian
> >>>>
> >>>> Am 26.03.2013 17:33, schrieb Dan Filimon:
> >>>>> Hi,
> >>>>>
> >>>>> Could you tell us more about the kind of data you're clustering?
> >>> What
> >>>>> distance measure you're using and what the dimensionality of the
> >>> data
> >>>>> is?
> >>>>>
> >>>>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
> >>>>> <se...@unister-gmbh.de> wrote:
> >>>>>> Dear Mahout-users,
> >>>>>>
> >>>>>> I am facing two problems when I am clustering instances with Fuzzy
> >>> c
> >>>>>> Means clustering (cosine distance, random initial clustering):
> >>>>>>
> >>>>>> 1.) I always end up with one large set of rubbish instances. All of
> >>> them
> >>>>>> have uniform cluster probability distribution and are, hence, in
> >>> the
> >>>>>> exact middle of the cluster space.
> >>>>>> The cosine distance between instances within this cluster reaches
> >>> from 0
> >>>>>> to 1.
> >>>>>>
> >>>>>> 2.) Some of my clusters have the same or a very very similar
> >>> center.
> >>>>>>
> >>>>>> Besides the above described problems, the clustering seems to work
> >>> fine.
> >>>>>>
> >>>>>> Has somebody an idea how my clustering can be improved?
> >>>>>>
> >>>>>> Regards
> >>>>>> Sebastian
> >>
> >> --
> >> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
> >> gesendet.
>

Re: How to improve clustering?

Posted by Dan Filimon <da...@gmail.com>.

Ah, so Ted, it looks like there's a bug with the mapreduce after all then.

Pity, I liked the higher dimensionality argument but thinking it through, it doesn't make that much sense.

On Mar 27, 2013, at 6:52, Ted Dunning <te...@gmail.com> wrote:

> Reducing to a lower dimensional space is a convenience, no more.
> 
> Clustering in the original space is fine.  I still have trouble with your
> normalizing before weighting, but I don't know what effect that will have
> on anything.  It certainly will interfere with the interpretation of the
> cosine metrics.
> 
> On Tue, Mar 26, 2013 at 6:18 PM, Sebastian Briesemeister <
> sebastian.briesemeister@unister.de> wrote:
> 
>> I am not quite sure whether this will solve the problem, though I will try
>> it of course.
>> 
>> I always thought that clustering documents based on their words is a
>> common problem and is usually tackled in the word space and not in a
>> reduced one.
>> Besides the distances look reasonable. Still I end up with very similar
>> and very distant documents unclustered in the middle of all clusters.
>> 
>> So I think the problem lies in the clustering method not in the distances.
>> 
>> 
>> 
>> Dan Filimon <da...@gmail.com> schrieb:
>> 
>>> So you're clustering 90K dimensional data?
>>> 
>>> I'm faced with a very similar problem as you (working on
>>> StreamingKMeans for mahout) and from what I read [1], the problem
>>> might be that in very high dimensional spaces the distances become
>>> meaningless.
>>> 
>>> I'm pretty sure this is the case and I was considering implementing
>>> the test mentioned in the paper (also I feel like it's a very useful
>>> algorithm to have).
>>> 
>>> In any case, since the vectors are so sparse, why not reduce their
>>> dimension?
>>> 
>>> You can try principal component analysis (just getting the first k
>>> eigenvectors in the singular value decomposition of the matrix that
>>> has your vectors as rows). The class that does this is SSVDSolver
>>> (there's also SingularValueDecomposition but that tries making dense
>>> matrices and those might not fit into memory. I've never personally
>>> used it though.
>>> Once you have the first k eigenvectors of size n, make them rows in a
>>> matrix (U) and multiply each vector x you have with it (U x) getting a
>>> reduced vector.
>>> 
>>> Or, use random projections to reduce the size of the data set. You
>>> want to create a matrix whose entries are sampled from a uniform
>>> distribution (0, 1) (Functions.random in
>>> o.a.m.math.function.Functions), normalize its rows and multiply each
>>> vector x with it.
>>> 
>>> So, reduce the size of your vectors thereby making the dimensionality
>>> less of a problem and you'll get a decent approximation (you can
>>> actually quantify how good it is with SVD). From what I've seen, the
>>> clusters separate at smaller dimensions but there's the question of
>>> how good an approximation of the uncompressed data you have.
>>> 
>>> See if this helps, I need to do the same thing :)
>>> 
>>> What do you think?
>>> 
>>> [1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
>>> 
>>> On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
>>> <se...@unister-gmbh.de> wrote:
>>>> The dataset consists of about 4000 documents and is encoded by 90.000
>>>> words. However, each document contains usually only about 10 to 20
>>>> words. Only some contain more than 1000 words.
>>>> 
>>>> For each document, I set a field in the corresponding vector to 1 if
>>> it
>>>> contains a word. Then I normalize each vector using the L2-norm.
>>>> Finally I multiply each element (representing a word) in the vector
>>> by
>>>> log(#documents/#documents_with_word).
>>>> 
>>>> For clustering, I am using cosine similarity.
>>>> 
>>>> Regards
>>>> Sebastian
>>>> 
>>>> Am 26.03.2013 17:33, schrieb Dan Filimon:
>>>>> Hi,
>>>>> 
>>>>> Could you tell us more about the kind of data you're clustering?
>>> What
>>>>> distance measure you're using and what the dimensionality of the
>>> data
>>>>> is?
>>>>> 
>>>>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
>>>>> <se...@unister-gmbh.de> wrote:
>>>>>> Dear Mahout-users,
>>>>>> 
>>>>>> I am facing two problems when I am clustering instances with Fuzzy
>>> c
>>>>>> Means clustering (cosine distance, random initial clustering):
>>>>>> 
>>>>>> 1.) I always end up with one large set of rubbish instances. All of
>>> them
>>>>>> have uniform cluster probability distribution and are, hence, in
>>> the
>>>>>> exact middle of the cluster space.
>>>>>> The cosine distance between instances within this cluster reaches
>>> from 0
>>>>>> to 1.
>>>>>> 
>>>>>> 2.) Some of my clusters have the same or a very very similar
>>> center.
>>>>>> 
>>>>>> Besides the above described problems, the clustering seems to work
>>> fine.
>>>>>> 
>>>>>> Has somebody an idea how my clustering can be improved?
>>>>>> 
>>>>>> Regards
>>>>>> Sebastian
>> 
>> --
>> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
>> gesendet.

Re: How to improve clustering?

Posted by Ted Dunning <te...@gmail.com>.

Reducing to a lower dimensional space is a convenience, no more.

Clustering in the original space is fine.  I still have trouble with your
normalizing before weighting, but I don't know what effect that will have
on anything.  It certainly will interfere with the interpretation of the
cosine metrics.

On Tue, Mar 26, 2013 at 6:18 PM, Sebastian Briesemeister <
sebastian.briesemeister@unister.de> wrote:

> I am not quite sure whether this will solve the problem, though I will try
> it of course.
>
> I always thought that clustering documents based on their words is a
> common problem and is usually tackled in the word space and not in a
> reduced one.
> Besides the distances look reasonable. Still I end up with very similar
> and very distant documents unclustered in the middle of all clusters.
>
> So I think the problem lies in the clustering method not in the distances.
>
>
>
> Dan Filimon <da...@gmail.com> schrieb:
>
> >So you're clustering 90K dimensional data?
> >
> >I'm faced with a very similar problem as you (working on
> >StreamingKMeans for mahout) and from what I read [1], the problem
> >might be that in very high dimensional spaces the distances become
> >meaningless.
> >
> >I'm pretty sure this is the case and I was considering implementing
> >the test mentioned in the paper (also I feel like it's a very useful
> >algorithm to have).
> >
> >In any case, since the vectors are so sparse, why not reduce their
> >dimension?
> >
> >You can try principal component analysis (just getting the first k
> >eigenvectors in the singular value decomposition of the matrix that
> >has your vectors as rows). The class that does this is SSVDSolver
> >(there's also SingularValueDecomposition but that tries making dense
> >matrices and those might not fit into memory. I've never personally
> >used it though.
> >Once you have the first k eigenvectors of size n, make them rows in a
> >matrix (U) and multiply each vector x you have with it (U x) getting a
> >reduced vector.
> >
> >Or, use random projections to reduce the size of the data set. You
> >want to create a matrix whose entries are sampled from a uniform
> >distribution (0, 1) (Functions.random in
> >o.a.m.math.function.Functions), normalize its rows and multiply each
> >vector x with it.
> >
> >So, reduce the size of your vectors thereby making the dimensionality
> >less of a problem and you'll get a decent approximation (you can
> >actually quantify how good it is with SVD). From what I've seen, the
> >clusters separate at smaller dimensions but there's the question of
> >how good an approximation of the uncompressed data you have.
> >
> >See if this helps, I need to do the same thing :)
> >
> >What do you think?
> >
> >[1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
> >
> >On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
> ><se...@unister-gmbh.de> wrote:
> >> The dataset consists of about 4000 documents and is encoded by 90.000
> >> words. However, each document contains usually only about 10 to 20
> >> words. Only some contain more than 1000 words.
> >>
> >> For each document, I set a field in the corresponding vector to 1 if
> >it
> >> contains a word. Then I normalize each vector using the L2-norm.
> >> Finally I multiply each element (representing a word) in the vector
> >by
> >> log(#documents/#documents_with_word).
> >>
> >> For clustering, I am using cosine similarity.
> >>
> >> Regards
> >> Sebastian
> >>
> >> Am 26.03.2013 17:33, schrieb Dan Filimon:
> >>> Hi,
> >>>
> >>> Could you tell us more about the kind of data you're clustering?
> >What
> >>> distance measure you're using and what the dimensionality of the
> >data
> >>> is?
> >>>
> >>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
> >>> <se...@unister-gmbh.de> wrote:
> >>>> Dear Mahout-users,
> >>>>
> >>>> I am facing two problems when I am clustering instances with Fuzzy
> >c
> >>>> Means clustering (cosine distance, random initial clustering):
> >>>>
> >>>> 1.) I always end up with one large set of rubbish instances. All of
> >them
> >>>> have uniform cluster probability distribution and are, hence, in
> >the
> >>>> exact middle of the cluster space.
> >>>> The cosine distance between instances within this cluster reaches
> >from 0
> >>>> to 1.
> >>>>
> >>>> 2.) Some of my clusters have the same or a very very similar
> >center.
> >>>>
> >>>> Besides the above described problems, the clustering seems to work
> >fine.
> >>>>
> >>>> Has somebody an idea how my clustering can be improved?
> >>>>
> >>>> Regards
> >>>> Sebastian
> >>
>
> --
> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
> gesendet.

Re: How to improve clustering?

Posted by Dan Filimon <da...@gmail.com>.

The main result of the paper I linked to is that in high dimensional
spaces, distances tend to not mean very much.
As in max_dist < (1+eps)*min_dist with pretty high probability *if*
certain conditions about the relative variance of the points are
fulfilled.

I don't really know what else to say about it and haven't used
FuzzyKMeans and am not familiar with its failure cases.

Let us know what fixes it. Good luck!

On Tue, Mar 26, 2013 at 7:18 PM, Sebastian Briesemeister
<se...@unister.de> wrote:
> I am not quite sure whether this will solve the problem, though I will try it of course.
>
> I always thought that clustering documents based on their words is a common problem and is usually tackled in the word space and not in a reduced one.
> Besides the distances look reasonable. Still I end up with very similar and very distant documents unclustered in the middle of all clusters.
>
> So I think the problem lies in the clustering method not in the distances.
>
>
>
> Dan Filimon <da...@gmail.com> schrieb:
>
>>So you're clustering 90K dimensional data?
>>
>>I'm faced with a very similar problem as you (working on
>>StreamingKMeans for mahout) and from what I read [1], the problem
>>might be that in very high dimensional spaces the distances become
>>meaningless.
>>
>>I'm pretty sure this is the case and I was considering implementing
>>the test mentioned in the paper (also I feel like it's a very useful
>>algorithm to have).
>>
>>In any case, since the vectors are so sparse, why not reduce their
>>dimension?
>>
>>You can try principal component analysis (just getting the first k
>>eigenvectors in the singular value decomposition of the matrix that
>>has your vectors as rows). The class that does this is SSVDSolver
>>(there's also SingularValueDecomposition but that tries making dense
>>matrices and those might not fit into memory. I've never personally
>>used it though.
>>Once you have the first k eigenvectors of size n, make them rows in a
>>matrix (U) and multiply each vector x you have with it (U x) getting a
>>reduced vector.
>>
>>Or, use random projections to reduce the size of the data set. You
>>want to create a matrix whose entries are sampled from a uniform
>>distribution (0, 1) (Functions.random in
>>o.a.m.math.function.Functions), normalize its rows and multiply each
>>vector x with it.
>>
>>So, reduce the size of your vectors thereby making the dimensionality
>>less of a problem and you'll get a decent approximation (you can
>>actually quantify how good it is with SVD). From what I've seen, the
>>clusters separate at smaller dimensions but there's the question of
>>how good an approximation of the uncompressed data you have.
>>
>>See if this helps, I need to do the same thing :)
>>
>>What do you think?
>>
>>[1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
>>
>>On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
>><se...@unister-gmbh.de> wrote:
>>> The dataset consists of about 4000 documents and is encoded by 90.000
>>> words. However, each document contains usually only about 10 to 20
>>> words. Only some contain more than 1000 words.
>>>
>>> For each document, I set a field in the corresponding vector to 1 if
>>it
>>> contains a word. Then I normalize each vector using the L2-norm.
>>> Finally I multiply each element (representing a word) in the vector
>>by
>>> log(#documents/#documents_with_word).
>>>
>>> For clustering, I am using cosine similarity.
>>>
>>> Regards
>>> Sebastian
>>>
>>> Am 26.03.2013 17:33, schrieb Dan Filimon:
>>>> Hi,
>>>>
>>>> Could you tell us more about the kind of data you're clustering?
>>What
>>>> distance measure you're using and what the dimensionality of the
>>data
>>>> is?
>>>>
>>>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
>>>> <se...@unister-gmbh.de> wrote:
>>>>> Dear Mahout-users,
>>>>>
>>>>> I am facing two problems when I am clustering instances with Fuzzy
>>c
>>>>> Means clustering (cosine distance, random initial clustering):
>>>>>
>>>>> 1.) I always end up with one large set of rubbish instances. All of
>>them
>>>>> have uniform cluster probability distribution and are, hence, in
>>the
>>>>> exact middle of the cluster space.
>>>>> The cosine distance between instances within this cluster reaches
>>from 0
>>>>> to 1.
>>>>>
>>>>> 2.) Some of my clusters have the same or a very very similar
>>center.
>>>>>
>>>>> Besides the above described problems, the clustering seems to work
>>fine.
>>>>>
>>>>> Has somebody an idea how my clustering can be improved?
>>>>>
>>>>> Regards
>>>>> Sebastian
>>>
>
> --
> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

Re: How to improve clustering?

Posted by Sebastian Briesemeister <se...@unister.de>.

I am not quite sure whether this will solve the problem, though I will try it of course. 

I always thought that clustering documents based on their words is a common problem and is usually tackled in the word space and not in a reduced one. 
Besides the distances look reasonable. Still I end up with very similar and very distant documents unclustered in the middle of all clusters. 

So I think the problem lies in the clustering method not in the distances. 



Dan Filimon <da...@gmail.com> schrieb:

>So you're clustering 90K dimensional data?
>
>I'm faced with a very similar problem as you (working on
>StreamingKMeans for mahout) and from what I read [1], the problem
>might be that in very high dimensional spaces the distances become
>meaningless.
>
>I'm pretty sure this is the case and I was considering implementing
>the test mentioned in the paper (also I feel like it's a very useful
>algorithm to have).
>
>In any case, since the vectors are so sparse, why not reduce their
>dimension?
>
>You can try principal component analysis (just getting the first k
>eigenvectors in the singular value decomposition of the matrix that
>has your vectors as rows). The class that does this is SSVDSolver
>(there's also SingularValueDecomposition but that tries making dense
>matrices and those might not fit into memory. I've never personally
>used it though.
>Once you have the first k eigenvectors of size n, make them rows in a
>matrix (U) and multiply each vector x you have with it (U x) getting a
>reduced vector.
>
>Or, use random projections to reduce the size of the data set. You
>want to create a matrix whose entries are sampled from a uniform
>distribution (0, 1) (Functions.random in
>o.a.m.math.function.Functions), normalize its rows and multiply each
>vector x with it.
>
>So, reduce the size of your vectors thereby making the dimensionality
>less of a problem and you'll get a decent approximation (you can
>actually quantify how good it is with SVD). From what I've seen, the
>clusters separate at smaller dimensions but there's the question of
>how good an approximation of the uncompressed data you have.
>
>See if this helps, I need to do the same thing :)
>
>What do you think?
>
>[1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf
>
>On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
><se...@unister-gmbh.de> wrote:
>> The dataset consists of about 4000 documents and is encoded by 90.000
>> words. However, each document contains usually only about 10 to 20
>> words. Only some contain more than 1000 words.
>>
>> For each document, I set a field in the corresponding vector to 1 if
>it
>> contains a word. Then I normalize each vector using the L2-norm.
>> Finally I multiply each element (representing a word) in the vector
>by
>> log(#documents/#documents_with_word).
>>
>> For clustering, I am using cosine similarity.
>>
>> Regards
>> Sebastian
>>
>> Am 26.03.2013 17:33, schrieb Dan Filimon:
>>> Hi,
>>>
>>> Could you tell us more about the kind of data you're clustering?
>What
>>> distance measure you're using and what the dimensionality of the
>data
>>> is?
>>>
>>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
>>> <se...@unister-gmbh.de> wrote:
>>>> Dear Mahout-users,
>>>>
>>>> I am facing two problems when I am clustering instances with Fuzzy
>c
>>>> Means clustering (cosine distance, random initial clustering):
>>>>
>>>> 1.) I always end up with one large set of rubbish instances. All of
>them
>>>> have uniform cluster probability distribution and are, hence, in
>the
>>>> exact middle of the cluster space.
>>>> The cosine distance between instances within this cluster reaches
>from 0
>>>> to 1.
>>>>
>>>> 2.) Some of my clusters have the same or a very very similar
>center.
>>>>
>>>> Besides the above described problems, the clustering seems to work
>fine.
>>>>
>>>> Has somebody an idea how my clustering can be improved?
>>>>
>>>> Regards
>>>> Sebastian
>>

-- 
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

Re: How to improve clustering?

Posted by Dan Filimon <da...@gmail.com>.

So you're clustering 90K dimensional data?

I'm faced with a very similar problem as you (working on
StreamingKMeans for mahout) and from what I read [1], the problem
might be that in very high dimensional spaces the distances become
meaningless.

I'm pretty sure this is the case and I was considering implementing
the test mentioned in the paper (also I feel like it's a very useful
algorithm to have).

In any case, since the vectors are so sparse, why not reduce their dimension?

You can try principal component analysis (just getting the first k
eigenvectors in the singular value decomposition of the matrix that
has your vectors as rows). The class that does this is SSVDSolver
(there's also SingularValueDecomposition but that tries making dense
matrices and those might not fit into memory. I've never personally
used it though.
Once you have the first k eigenvectors of size n, make them rows in a
matrix (U) and multiply each vector x you have with it (U x) getting a
reduced vector.

Or, use random projections to reduce the size of the data set. You
want to create a matrix whose entries are sampled from a uniform
distribution (0, 1) (Functions.random in
o.a.m.math.function.Functions), normalize its rows and multiply each
vector x with it.

So, reduce the size of your vectors thereby making the dimensionality
less of a problem and you'll get a decent approximation (you can
actually quantify how good it is with SVD). From what I've seen, the
clusters separate at smaller dimensions but there's the question of
how good an approximation of the uncompressed data you have.

See if this helps, I need to do the same thing :)

What do you think?

[1] http://www.cs.bham.ac.uk/~axk/dcfin2.pdf

On Tue, Mar 26, 2013 at 6:44 PM, Sebastian Briesemeister
<se...@unister-gmbh.de> wrote:
> The dataset consists of about 4000 documents and is encoded by 90.000
> words. However, each document contains usually only about 10 to 20
> words. Only some contain more than 1000 words.
>
> For each document, I set a field in the corresponding vector to 1 if it
> contains a word. Then I normalize each vector using the L2-norm.
> Finally I multiply each element (representing a word) in the vector by
> log(#documents/#documents_with_word).
>
> For clustering, I am using cosine similarity.
>
> Regards
> Sebastian
>
> Am 26.03.2013 17:33, schrieb Dan Filimon:
>> Hi,
>>
>> Could you tell us more about the kind of data you're clustering? What
>> distance measure you're using and what the dimensionality of the data
>> is?
>>
>> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
>> <se...@unister-gmbh.de> wrote:
>>> Dear Mahout-users,
>>>
>>> I am facing two problems when I am clustering instances with Fuzzy c
>>> Means clustering (cosine distance, random initial clustering):
>>>
>>> 1.) I always end up with one large set of rubbish instances. All of them
>>> have uniform cluster probability distribution and are, hence, in the
>>> exact middle of the cluster space.
>>> The cosine distance between instances within this cluster reaches from 0
>>> to 1.
>>>
>>> 2.) Some of my clusters have the same or a very very similar center.
>>>
>>> Besides the above described problems, the clustering seems to work fine.
>>>
>>> Has somebody an idea how my clustering can be improved?
>>>
>>> Regards
>>> Sebastian
>

Re: How to improve clustering?

Posted by Sebastian Briesemeister <se...@unister-gmbh.de>.

The dataset consists of about 4000 documents and is encoded by 90.000
words. However, each document contains usually only about 10 to 20
words. Only some contain more than 1000 words.

For each document, I set a field in the corresponding vector to 1 if it
contains a word. Then I normalize each vector using the L2-norm.
Finally I multiply each element (representing a word) in the vector by
log(#documents/#documents_with_word).

For clustering, I am using cosine similarity.

Regards
Sebastian

Am 26.03.2013 17:33, schrieb Dan Filimon:
> Hi,
>
> Could you tell us more about the kind of data you're clustering? What
> distance measure you're using and what the dimensionality of the data
> is?
>
> On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
> <se...@unister-gmbh.de> wrote:
>> Dear Mahout-users,
>>
>> I am facing two problems when I am clustering instances with Fuzzy c
>> Means clustering (cosine distance, random initial clustering):
>>
>> 1.) I always end up with one large set of rubbish instances. All of them
>> have uniform cluster probability distribution and are, hence, in the
>> exact middle of the cluster space.
>> The cosine distance between instances within this cluster reaches from 0
>> to 1.
>>
>> 2.) Some of my clusters have the same or a very very similar center.
>>
>> Besides the above described problems, the clustering seems to work fine.
>>
>> Has somebody an idea how my clustering can be improved?
>>
>> Regards
>> Sebastian

Re: How to improve clustering?

Posted by Dan Filimon <da...@gmail.com>.

Hi,

Could you tell us more about the kind of data you're clustering? What
distance measure you're using and what the dimensionality of the data
is?

On Tue, Mar 26, 2013 at 6:21 PM, Sebastian Briesemeister
<se...@unister-gmbh.de> wrote:
> Dear Mahout-users,
>
> I am facing two problems when I am clustering instances with Fuzzy c
> Means clustering (cosine distance, random initial clustering):
>
> 1.) I always end up with one large set of rubbish instances. All of them
> have uniform cluster probability distribution and are, hence, in the
> exact middle of the cluster space.
> The cosine distance between instances within this cluster reaches from 0
> to 1.
>
> 2.) Some of my clusters have the same or a very very similar center.
>
> Besides the above described problems, the clustering seems to work fine.
>
> Has somebody an idea how my clustering can be improved?
>
> Regards
> Sebastian