You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by sharath jagannath <sh...@gmail.com> on 2011/02/04 07:39:13 UTC

Another set of basic questions

I have 3 questions:
1. Now that I am able to create clusters. I want to know how to find
intra-cluster distance between the data points say top m data points close
to me within my cluster.
2. Say I have created initial cluster and now want to update it but do not
want to do it from scratch, I will use canopy to approximate the closest
cluster but how should I know what is the new cluster created from the data
points which are not part of any of the old cluster?
3. Now after some time I want to recluster everything. How should I do it?
Where should I get the all the vectors? Should I have to recreate
everything?

Thanks,
Sharath

Re: Another set of basic questions

Posted by sharath jagannath <sh...@gmail.com>.
Yeah sure.
Apologies.

Thanks,
Sharath

Re: Another set of basic questions

Posted by Jake Mannix <ja...@gmail.com>.
It's a friday afternoon / evening (in the US), give it a bit, people might
respond a little
if you show a little patience.

On Fri, Feb 4, 2011 at 6:01 PM, sharath jagannath <
sharathjagannath@gmail.com> wrote:

> ?
>
> On Thu, Feb 3, 2011 at 10:39 PM, sharath jagannath <
> sharathjagannath@gmail.com> wrote:
>
> > I have 3 questions:
> > 1. Now that I am able to create clusters. I want to know how to find
> > intra-cluster distance between the data points say top m data points
> close
> > to me within my cluster.
> > 2. Say I have created initial cluster and now want to update it but do
> not
> > want to do it from scratch, I will use canopy to approximate the closest
> > cluster but how should I know what is the new cluster created from the
> data
> > points which are not part of any of the old cluster?
> > 3. Now after some time I want to recluster everything. How should I do
> it?
> > Where should I get the all the vectors? Should I have to recreate
> > everything?
> >
> > Thanks,
> > Sharath
> >
> >
>
>
> --
> Thanks,
> Sharath Jagannath
>

Re: Another set of basic questions

Posted by sharath jagannath <sh...@gmail.com>.
?

On Thu, Feb 3, 2011 at 10:39 PM, sharath jagannath <
sharathjagannath@gmail.com> wrote:

> I have 3 questions:
> 1. Now that I am able to create clusters. I want to know how to find
> intra-cluster distance between the data points say top m data points close
> to me within my cluster.
> 2. Say I have created initial cluster and now want to update it but do not
> want to do it from scratch, I will use canopy to approximate the closest
> cluster but how should I know what is the new cluster created from the data
> points which are not part of any of the old cluster?
> 3. Now after some time I want to recluster everything. How should I do it?
> Where should I get the all the vectors? Should I have to recreate
> everything?
>
> Thanks,
> Sharath
>
>


-- 
Thanks,
Sharath Jagannath

Re: Another set of basic questions

Posted by sharath jagannath <sh...@gmail.com>.
I would really appreciate if somebody could respond.
I am trying to do a online clustering of feed data.

I am now able to write my custom analyzer and create Tf-vectors, use canopy
as seed generator and cluster using KMeansDriver.
Question1: I want to save the centroids generated. Is there a specific
interface with which I can create backups/ Should I have to read it and save
somewhere else say database for further use.

Say now I have 100 article and have grouped them into 10 clusters.
With which I want to cluster the new feed. Lets say I have 10 more article.

My first approach:
 I can use the same cycle to achieve reclustering which takes time. So I do
not want to do it for my online clustering.

Second Approach:
I want to use the saved centroids generated in the initial phase and cluster
using Canopy Driver. But Canopy driver takes vector as input and generate
centroid.
Question2 :Can we do it with Canopy Driver? I want to use the previous
centroid.

If this possible, let say out of my 10 new articles. 8 is grouped to one of
the existing cluster but 2 are new. To achieve this I need previous
centroids.
I want to cluster the new 2 in the usual kmeans and form new cluster.
Question3: How should I add the centroids of the new clusters formed to the
initial centroid list?

Again, I would appreciate the response. I know my questions are bit stupid
but for a novice I guess that is expected.

Thanks,
Sharath





On Fri, Feb 4, 2011 at 9:38 AM, sharath jagannath <
sharathjagannath@gmail.com> wrote:

> anybody please?
>
> Thanks,
> Sharath
>
>
> On Thu, Feb 3, 2011 at 10:39 PM, sharath jagannath <
> sharathjagannath@gmail.com> wrote:
>
>> I have 3 questions:
>> 1. Now that I am able to create clusters. I want to know how to find
>> intra-cluster distance between the data points say top m data points close
>> to me within my cluster.
>> 2. Say I have created initial cluster and now want to update it but do not
>> want to do it from scratch, I will use canopy to approximate the closest
>> cluster but how should I know what is the new cluster created from the data
>> points which are not part of any of the old cluster?
>> 3. Now after some time I want to recluster everything. How should I do it?
>> Where should I get the all the vectors? Should I have to recreate
>> everything?
>>
>> Thanks,
>> Sharath
>>
>>

Re: Another set of basic questions

Posted by sharath jagannath <sh...@gmail.com>.
anybody please?

Thanks,
Sharath

On Thu, Feb 3, 2011 at 10:39 PM, sharath jagannath <
sharathjagannath@gmail.com> wrote:

> I have 3 questions:
> 1. Now that I am able to create clusters. I want to know how to find
> intra-cluster distance between the data points say top m data points close
> to me within my cluster.
> 2. Say I have created initial cluster and now want to update it but do not
> want to do it from scratch, I will use canopy to approximate the closest
> cluster but how should I know what is the new cluster created from the data
> points which are not part of any of the old cluster?
> 3. Now after some time I want to recluster everything. How should I do it?
> Where should I get the all the vectors? Should I have to recreate
> everything?
>
> Thanks,
> Sharath
>
>


-- 
Thanks,
Sharath Jagannath