You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by David Saile <da...@uni-koblenz.de> on 2011/05/09 11:53:18 UTC

Incremental clustering

Hi list,

I am completely new to Mahout, so please forgive me if the answer to my question is too obvious.

For a case study, I am working on a simple incremental web crawler (much like Nutch) and I want to include a very simple indexing step that incorporates clustering of documents.

I was hoping to use some kind of incremental clustering algorithm, in order to make use of the incremental way the crawler is supposed to work (i.e. continuously adding and updating websites).

Is there some way to achieve the following: 	
	1) initial clustering of the first web-crawl
 	2) assigning new sites to existing clusters
	3) possibly moving modified sites between clusters

I would really appreciate any help!

Thanks,
David

Re: AW: Incremental clustering

Posted by Benson Margulies <bi...@gmail.com>.

You can do agglomerative clustering incrementally, deciding at each
point where to put it. Then you have to decide whether, on some
schedule or another, to consider 'rebalancing' by moving things
around.

On Thu, May 12, 2011 at 4:53 AM, David Saile <da...@uni-koblenz.de> wrote:
> I am still stuck at this problem.
>
> Can anyone give me a heads-up on how existing systems handle this?
> If a collection of documents is modified, is the clustering recomputed from scratch each time?
> Or is there in fact any incremental way to handle an evolving set of documents?
>
> I would really appreciate any hint!
>
> Thanks,
> David
>
>
> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
>
>> Not an answer, but a follow-up question:
>> I would be interested in the very same thing, but with the possibility to assign new sites to existing clusters OR to new ones.
>>
>> Thanks in advance,
>> Ulrich
>>
>> -----Ursprüngliche Nachricht-----
>> Von: David Saile [mailto:david@uni-koblenz.de]
>> Gesendet: Montag, 9. Mai 2011 11:53
>> An: user@mahout.apache.org
>> Betreff: Incremental clustering
>>
>> Hi list,
>>
>> I am completely new to Mahout, so please forgive me if the answer to my question is too obvious.
>>
>> For a case study, I am working on a simple incremental web crawler (much like Nutch) and I want to include a very simple indexing step that incorporates clustering of documents.
>>
>> I was hoping to use some kind of incremental clustering algorithm, in order to make use of the incremental way the crawler is supposed to work (i.e. continuously adding and updating websites).
>>
>> Is there some way to achieve the following:
>>       1) initial clustering of the first web-crawl
>>       2) assigning new sites to existing clusters
>>       3) possibly moving modified sites between clusters
>>
>> I would really appreciate any help!
>>
>> Thanks,
>> David
>
>

Re: AW: Incremental clustering

Posted by Ted Dunning <te...@gmail.com>.

Using whatever you used originally would be best.  A map-reduce program will
be slow for small batches, of course.  I don't know if seq2sparse has an
efficient sequential mode.

On Thu, May 12, 2011 at 11:18 AM, Frank Scholten <fr...@frankscholten.nl>wrote:

> What do you recommend for vectorizing the new docs? Run seq2sparse on
> a batch of them? Seems there's no code at the moment for quickly
> vectorizing a few new documents based on the existing dictionary.
>
> Frank
>
> On Thu, May 12, 2011 at 12:32 PM, Grant Ingersoll <gs...@apache.org>
> wrote:
> > From what I've seen, using Mahout's existing clustering methods, I think
> most people setup some schedule whereby they cluster the whole collection on
> a regular basis and then all docs that come in the meantime are simply
> assigned to the closest cluster until the next whole collection iteration is
> completed.  There are, of course, other variants one could do, such as kick
> off the whole clustering when some threshold of number of docs is reached.
> >
> > There are other clustering methods, as Benson alluded to, that may better
> support incremental approaches.
> >
> > On May 12, 2011, at 4:53 AM, David Saile wrote:
> >
> >> I am still stuck at this problem.
> >>
> >> Can anyone give me a heads-up on how existing systems handle this?
> >> If a collection of documents is modified, is the clustering recomputed
> from scratch each time?
> >> Or is there in fact any incremental way to handle an evolving set of
> documents?
> >>
> >> I would really appreciate any hint!
> >>
> >> Thanks,
> >> David
> >>
> >>
> >> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
> >>
> >>> Not an answer, but a follow-up question:
> >>> I would be interested in the very same thing, but with the possibility
> to assign new sites to existing clusters OR to new ones.
> >>>
> >>> Thanks in advance,
> >>> Ulrich
> >>>
> >>> -----Ursprüngliche Nachricht-----
> >>> Von: David Saile [mailto:david@uni-koblenz.de]
> >>> Gesendet: Montag, 9. Mai 2011 11:53
> >>> An: user@mahout.apache.org
> >>> Betreff: Incremental clustering
> >>>
> >>> Hi list,
> >>>
> >>> I am completely new to Mahout, so please forgive me if the answer to my
> question is too obvious.
> >>>
> >>> For a case study, I am working on a simple incremental web crawler
> (much like Nutch) and I want to include a very simple indexing step that
> incorporates clustering of documents.
> >>>
> >>> I was hoping to use some kind of incremental clustering algorithm, in
> order to make use of the incremental way the crawler is supposed to work
> (i.e. continuously adding and updating websites).
> >>>
> >>> Is there some way to achieve the following:
> >>>      1) initial clustering of the first web-crawl
> >>>      2) assigning new sites to existing clusters
> >>>      3) possibly moving modified sites between clusters
> >>>
> >>> I would really appreciate any help!
> >>>
> >>> Thanks,
> >>> David
> >>
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem docs using Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
>

Re: AW: Incremental clustering

Posted by Frank Scholten <fr...@frankscholten.nl>.

What do you recommend for vectorizing the new docs? Run seq2sparse on
a batch of them? Seems there's no code at the moment for quickly
vectorizing a few new documents based on the existing dictionary.

Frank

On Thu, May 12, 2011 at 12:32 PM, Grant Ingersoll <gs...@apache.org> wrote:
> From what I've seen, using Mahout's existing clustering methods, I think most people setup some schedule whereby they cluster the whole collection on a regular basis and then all docs that come in the meantime are simply assigned to the closest cluster until the next whole collection iteration is completed.  There are, of course, other variants one could do, such as kick off the whole clustering when some threshold of number of docs is reached.
>
> There are other clustering methods, as Benson alluded to, that may better support incremental approaches.
>
> On May 12, 2011, at 4:53 AM, David Saile wrote:
>
>> I am still stuck at this problem.
>>
>> Can anyone give me a heads-up on how existing systems handle this?
>> If a collection of documents is modified, is the clustering recomputed from scratch each time?
>> Or is there in fact any incremental way to handle an evolving set of documents?
>>
>> I would really appreciate any hint!
>>
>> Thanks,
>> David
>>
>>
>> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
>>
>>> Not an answer, but a follow-up question:
>>> I would be interested in the very same thing, but with the possibility to assign new sites to existing clusters OR to new ones.
>>>
>>> Thanks in advance,
>>> Ulrich
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: David Saile [mailto:david@uni-koblenz.de]
>>> Gesendet: Montag, 9. Mai 2011 11:53
>>> An: user@mahout.apache.org
>>> Betreff: Incremental clustering
>>>
>>> Hi list,
>>>
>>> I am completely new to Mahout, so please forgive me if the answer to my question is too obvious.
>>>
>>> For a case study, I am working on a simple incremental web crawler (much like Nutch) and I want to include a very simple indexing step that incorporates clustering of documents.
>>>
>>> I was hoping to use some kind of incremental clustering algorithm, in order to make use of the incremental way the crawler is supposed to work (i.e. continuously adding and updating websites).
>>>
>>> Is there some way to achieve the following:
>>>      1) initial clustering of the first web-crawl
>>>      2) assigning new sites to existing clusters
>>>      3) possibly moving modified sites between clusters
>>>
>>> I would really appreciate any help!
>>>
>>> Thanks,
>>> David
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem docs using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: AW: Incremental clustering

Posted by Michael Kurze <mk...@mozilla.com>.

Hi David,

Not really an existing system yet, but I am looking into this problem as well. Basically 
you can do what Grant said; alternatively (or additionally) you can try and work with an
inverted index (e.g. a Lucene search index), which would allow you to assign documents to 
close clusters comparing them to possibly large numbers of cluster centroids first.

I am experimenting with using this method for the initial clustering as well, by first
combining well matching documents using the index and then only using k-means or such to
merge related clusters afterwards. This greatly cuts down on dense vector distance 
computations (most methods create dense centroids, entailing lots of dense computation 
on them).

Best regards,
Michael

On May 12, 2011, at 12:32 PM, Grant Ingersoll wrote:

> From what I've seen, using Mahout's existing clustering methods, I think most people setup some schedule whereby they cluster the whole collection on a regular basis and then all docs that come in the meantime are simply assigned to the closest cluster until the next whole collection iteration is completed.  There are, of course, other variants one could do, such as kick off the whole clustering when some threshold of number of docs is reached.
> 
> There are other clustering methods, as Benson alluded to, that may better support incremental approaches.
> 
> On May 12, 2011, at 4:53 AM, David Saile wrote:
> 
>> I am still stuck at this problem.
>> 
>> Can anyone give me a heads-up on how existing systems handle this? 
>> If a collection of documents is modified, is the clustering recomputed from scratch each time? 
>> Or is there in fact any incremental way to handle an evolving set of documents?
>> 
>> I would really appreciate any hint!
>> 
>> Thanks,
>> David
>> 
>> 
>> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
>> 
>>> Not an answer, but a follow-up question: 
>>> I would be interested in the very same thing, but with the possibility to assign new sites to existing clusters OR to new ones.
>>> 
>>> Thanks in advance,
>>> Ulrich
>>> 
>>> -----Ursprüngliche Nachricht-----
>>> Von: David Saile [mailto:david@uni-koblenz.de] 
>>> Gesendet: Montag, 9. Mai 2011 11:53
>>> An: user@mahout.apache.org
>>> Betreff: Incremental clustering
>>> 
>>> Hi list,
>>> 
>>> I am completely new to Mahout, so please forgive me if the answer to my question is too obvious.
>>> 
>>> For a case study, I am working on a simple incremental web crawler (much like Nutch) and I want to include a very simple indexing step that incorporates clustering of documents.
>>> 
>>> I was hoping to use some kind of incremental clustering algorithm, in order to make use of the incremental way the crawler is supposed to work (i.e. continuously adding and updating websites).
>>> 
>>> Is there some way to achieve the following: 	
>>> 	1) initial clustering of the first web-crawl
>>> 	2) assigning new sites to existing clusters
>>> 	3) possibly moving modified sites between clusters
>>> 
>>> I would really appreciate any help!
>>> 
>>> Thanks,
>>> David
>> 
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem docs using Solr/Lucene:
> http://www.lucidimagination.com/search
>

Re: AW: Incremental clustering

Posted by Ted Dunning <te...@gmail.com>.

If I could jump in here.

All of the clustering algorithms that we have follow a roughly EM algorithm
structure and can be viewed as
estimating mixture distributions.  For the uninitiated, a mixture
distribution is one where you can pretend
the data were generated by

1) picking a number i from 1 to n

2) picking a data sample from probability distribution i

Most of the algorithms that we use work by doing something a lot like this:

   for iteration = 1 to n
      # E step
      for each data point
            hard or soft assign data point to one of the distributions

      # M step
      for each distribution (aka cluster)
            estimate the distribution parameters from the points assigned in
the E step


Initialization of the distributions can happen before the algorithm starts
(mean-shift and canopy) or
during the E step (Dirichlet process clustering).

This set of distributions can easily be considered to be a classification
model.  The E step is
normal classification and the M step is model training.


To answer the specific question, if you assume even pretty weak stability
for your incoming data,
yes, you can use the clustering from past data to define the prior state for
clustering the current
data.  With k-means, this means that we don't have to do the canopy step.
 In general, we should
keep some idea of how much data has been processed in the past so that a few
new points can't
completely rewrite the clustering, but that is pretty easily done with
k-means and dirichlet.  With
k-means, that means that we keep the count in addition to the centroid.
 With Dirichlet, it means
that we need to have an online estimator for the probability distribution.


On Thu, May 12, 2011 at 9:13 AM, Benson Margulies <bi...@gmail.com>wrote:

> Jeff,
>
> Could you expand a bit on the subject of models in clustering? I
> mentally simplify this into 'clustering: unsupervised; classification:
> supervised.'
>
> Is the idea here that you are going to be presented with many
> different corpora that have some sort of overall resemblance, so that
> priors derived from the first N speed up clustering N+1?
>
> --benson
>
>
> On Thu, May 12, 2011 at 12:00 PM, Jeff Eastman <je...@narus.com> wrote:
> > Sure, by using your old clusters as the prior (clustersIn) for the new
> clustering, you can reduce the number of iterations required to converge.
> >
> > -----Original Message-----
> > From: David Saile [mailto:david@uni-koblenz.de]
> > Sent: Thursday, May 12, 2011 8:54 AM
> > To: user@mahout.apache.org
> > Subject: Re: AW: Incremental clustering
> >
> > Thank you very much everyone! This really helped a lot.
> >
> > Here is what I am planning to do:
> > I am going to compute an initial clustering after the first crawl.
> > Then, as sites are being added to the index I will simply classify them
> using the existing clusters.
> >
> > As I expect updates to be generally very small, I will only recompute the
> clustering after some threshold has been hit, like Grant suggested.
> > As Ted pointed out, this can be done with the old clusters as input.
> >
> > Thanks again,
> > David
> >
> >
> >
> > Am 12.05.2011 um 17:35 schrieb Ted Dunning:
> >
> >> Most of these algorithms can be done in an incremental fashion in which
> you
> >> can add batches to the previous training.
> >>
> >> On Thu, May 12, 2011 at 8:30 AM, Jeff Eastman <je...@narus.com>
> wrote:
> >>
> >>> Most of the clustering drivers have two methods: one to train the
> clusterer
> >>> with data to produce the cluster models; one to classify the data using
> a
> >>> given set of cluster models. Currently the CLI only allows train
> followed by
> >>> optional classify. We could pretty easily allow classify to be done
> >>> stand-alone, and this would be useful in support of Grant's approach
> below.
> >>>
> >>> Jeff
> >>>
> >>> -----Original Message-----
> >>> From: Grant Ingersoll [mailto:gsingers@apache.org]
> >>> Sent: Thursday, May 12, 2011 3:32 AM
> >>> To: user@mahout.apache.org
> >>> Subject: Re: AW: Incremental clustering
> >>>
> >>> From what I've seen, using Mahout's existing clustering methods, I
> think
> >>> most people setup some schedule whereby they cluster the whole
> collection on
> >>> a regular basis and then all docs that come in the meantime are simply
> >>> assigned to the closest cluster until the next whole collection
> iteration is
> >>> completed.  There are, of course, other variants one could do, such as
> kick
> >>> off the whole clustering when some threshold of number of docs is
> reached.
> >>>
> >>> There are other clustering methods, as Benson alluded to, that may
> better
> >>> support incremental approaches.
> >>>
> >>> On May 12, 2011, at 4:53 AM, David Saile wrote:
> >>>
> >>>> I am still stuck at this problem.
> >>>>
> >>>> Can anyone give me a heads-up on how existing systems handle this?
> >>>> If a collection of documents is modified, is the clustering recomputed
> >>> from scratch each time?
> >>>> Or is there in fact any incremental way to handle an evolving set of
> >>> documents?
> >>>>
> >>>> I would really appreciate any hint!
> >>>>
> >>>> Thanks,
> >>>> David
> >>>>
> >>>>
> >>>> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
> >>>>
> >>>>> Not an answer, but a follow-up question:
> >>>>> I would be interested in the very same thing, but with the
> possibility
> >>> to assign new sites to existing clusters OR to new ones.
> >>>>>
> >>>>> Thanks in advance,
> >>>>> Ulrich
> >>>>>
> >>>>> -----Ursprüngliche Nachricht-----
> >>>>> Von: David Saile [mailto:david@uni-koblenz.de]
> >>>>> Gesendet: Montag, 9. Mai 2011 11:53
> >>>>> An: user@mahout.apache.org
> >>>>> Betreff: Incremental clustering
> >>>>>
> >>>>> Hi list,
> >>>>>
> >>>>> I am completely new to Mahout, so please forgive me if the answer to
> my
> >>> question is too obvious.
> >>>>>
> >>>>> For a case study, I am working on a simple incremental web crawler
> (much
> >>> like Nutch) and I want to include a very simple indexing step that
> >>> incorporates clustering of documents.
> >>>>>
> >>>>> I was hoping to use some kind of incremental clustering algorithm, in
> >>> order to make use of the incremental way the crawler is supposed to
> work
> >>> (i.e. continuously adding and updating websites).
> >>>>>
> >>>>> Is there some way to achieve the following:
> >>>>>     1) initial clustering of the first web-crawl
> >>>>>     2) assigning new sites to existing clusters
> >>>>>     3) possibly moving modified sites between clusters
> >>>>>
> >>>>> I would really appreciate any help!
> >>>>>
> >>>>> Thanks,
> >>>>> David
> >>>>
> >>>
> >>> --------------------------
> >>> Grant Ingersoll
> >>> http://www.lucidimagination.com/
> >>>
> >>> Search the Lucene ecosystem docs using Solr/Lucene:
> >>> http://www.lucidimagination.com/search
> >>>
> >>>
> >
> >
>

RE: AW: Incremental clustering

Posted by Jeff Eastman <je...@Narus.com>.

Check your convergence criteria. The iterations will end when: a) the maxIterations have been accomplished or; b) when all the clusters have converged. If they did not converge in either run then the times won't change by plugging them together.

-----Original Message-----
From: David Saile [mailto:david@uni-koblenz.de] 
Sent: Thursday, May 12, 2011 10:09 AM
To: user@mahout.apache.org
Subject: Re: AW: Incremental clustering

I had that same thought, so I actually tried running k-Means twice on the Reuters dataset (as described in Quickstart). 
The second run received the resulting cluster of the first run as input.

However, the execution times of the two runs did not differ much (actually the 2nd run was a bit slower). 
I also tried to double the input or the number of iterations, but no improvement.

Could this be caused by running Hadoop on a single machine? 
Or is the number of iterations with 20 (or 40) simply not high enough?

David  

Am 12.05.2011 um 18:46 schrieb Jeff Eastman:

> Also, if cluster training begins with the posterior from a previous training session over the corpus but with new data added since that training began, the prior clusters should be very close to an optimal solution with the new data and the number of iterations required to converge on a new posterior should be reduced. Haven't tried this in practice but it seems logical. Convergence is calculated by how much each cluster has changed during an iteration.
> 
> -----Original Message-----
> From: Benson Margulies [mailto:bimargulies@gmail.com] 
> Sent: Thursday, May 12, 2011 9:14 AM
> To: user@mahout.apache.org
> Subject: Re: AW: Incremental clustering
> 
> Is the idea here that you are going to be presented with many
> different corpora that have some sort of overall resemblance, so that
> priors derived from the first N speed up clustering N+1?
> 
> --benson
>

RE: AW: Incremental clustering

Posted by Jeff Eastman <je...@Narus.com>.

Dirichlet maintains weights (total counts over all iterations) in the mixture but Kmeans does not have anything equivalent.

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Thursday, May 12, 2011 10:16 AM
To: user@mahout.apache.org
Subject: Re: AW: Incremental clustering

I think that this may also have to do with whether k-means retains a sense
of weight for the old clusters.  I don't think it currently does.

On Thu, May 12, 2011 at 10:09 AM, David Saile <da...@uni-koblenz.de> wrote:

> I had that same thought, so I actually tried running k-Means twice on the
> Reuters dataset (as described in Quickstart).
> The second run received the resulting cluster of the first run as input.
>
> However, the execution times of the two runs did not differ much (actually
> the 2nd run was a bit slower).
> I also tried to double the input or the number of iterations, but no
> improvement.
>
> Could this be caused by running Hadoop on a single machine?
> Or is the number of iterations with 20 (or 40) simply not high enough?
>
> David
>
>
> Am 12.05.2011 um 18:46 schrieb Jeff Eastman:
>
> > Also, if cluster training begins with the posterior from a previous
> training session over the corpus but with new data added since that training
> began, the prior clusters should be very close to an optimal solution with
> the new data and the number of iterations required to converge on a new
> posterior should be reduced. Haven't tried this in practice but it seems
> logical. Convergence is calculated by how much each cluster has changed
> during an iteration.
> >
> > -----Original Message-----
> > From: Benson Margulies [mailto:bimargulies@gmail.com]
> > Sent: Thursday, May 12, 2011 9:14 AM
> > To: user@mahout.apache.org
> > Subject: Re: AW: Incremental clustering
> >
> > Is the idea here that you are going to be presented with many
> > different corpora that have some sort of overall resemblance, so that
> > priors derived from the first N speed up clustering N+1?
> >
> > --benson
> >
>
>

Re: AW: Incremental clustering

Posted by Ted Dunning <te...@gmail.com>.

I think that this may also have to do with whether k-means retains a sense
of weight for the old clusters.  I don't think it currently does.

On Thu, May 12, 2011 at 10:09 AM, David Saile <da...@uni-koblenz.de> wrote:

> I had that same thought, so I actually tried running k-Means twice on the
> Reuters dataset (as described in Quickstart).
> The second run received the resulting cluster of the first run as input.
>
> However, the execution times of the two runs did not differ much (actually
> the 2nd run was a bit slower).
> I also tried to double the input or the number of iterations, but no
> improvement.
>
> Could this be caused by running Hadoop on a single machine?
> Or is the number of iterations with 20 (or 40) simply not high enough?
>
> David
>
>
> Am 12.05.2011 um 18:46 schrieb Jeff Eastman:
>
> > Also, if cluster training begins with the posterior from a previous
> training session over the corpus but with new data added since that training
> began, the prior clusters should be very close to an optimal solution with
> the new data and the number of iterations required to converge on a new
> posterior should be reduced. Haven't tried this in practice but it seems
> logical. Convergence is calculated by how much each cluster has changed
> during an iteration.
> >
> > -----Original Message-----
> > From: Benson Margulies [mailto:bimargulies@gmail.com]
> > Sent: Thursday, May 12, 2011 9:14 AM
> > To: user@mahout.apache.org
> > Subject: Re: AW: Incremental clustering
> >
> > Is the idea here that you are going to be presented with many
> > different corpora that have some sort of overall resemblance, so that
> > priors derived from the first N speed up clustering N+1?
> >
> > --benson
> >
>
>

Re: AW: Incremental clustering

Posted by David Saile <da...@uni-koblenz.de>.

I had that same thought, so I actually tried running k-Means twice on the Reuters dataset (as described in Quickstart). 
The second run received the resulting cluster of the first run as input.

However, the execution times of the two runs did not differ much (actually the 2nd run was a bit slower). 
I also tried to double the input or the number of iterations, but no improvement.

Could this be caused by running Hadoop on a single machine? 
Or is the number of iterations with 20 (or 40) simply not high enough?

David  


Am 12.05.2011 um 18:46 schrieb Jeff Eastman:

> Also, if cluster training begins with the posterior from a previous training session over the corpus but with new data added since that training began, the prior clusters should be very close to an optimal solution with the new data and the number of iterations required to converge on a new posterior should be reduced. Haven't tried this in practice but it seems logical. Convergence is calculated by how much each cluster has changed during an iteration.
> 
> -----Original Message-----
> From: Benson Margulies [mailto:bimargulies@gmail.com] 
> Sent: Thursday, May 12, 2011 9:14 AM
> To: user@mahout.apache.org
> Subject: Re: AW: Incremental clustering
> 
> Is the idea here that you are going to be presented with many
> different corpora that have some sort of overall resemblance, so that
> priors derived from the first N speed up clustering N+1?
> 
> --benson
>

RE: AW: Incremental clustering

Posted by Jeff Eastman <je...@Narus.com>.

Also, if cluster training begins with the posterior from a previous training session over the corpus but with new data added since that training began, the prior clusters should be very close to an optimal solution with the new data and the number of iterations required to converge on a new posterior should be reduced. Haven't tried this in practice but it seems logical. Convergence is calculated by how much each cluster has changed during an iteration.

-----Original Message-----
From: Benson Margulies [mailto:bimargulies@gmail.com] 
Sent: Thursday, May 12, 2011 9:14 AM
To: user@mahout.apache.org
Subject: Re: AW: Incremental clustering

Is the idea here that you are going to be presented with many
different corpora that have some sort of overall resemblance, so that
priors derived from the first N speed up clustering N+1?

--benson

RE: AW: Incremental clustering

Posted by Jeff Eastman <je...@Narus.com>.

Sure,
Each iteration of the kmeans, fuzzyK & Dirichlet clustering algorithms begin with an initial (prior) set of clusters (a.k.a. models). Each iteration assigns each input vector to one (kmeans = most likely; Dirichlet = multinomial sampling) or multiple (fuzzyK = percentage of each) clusters. Then, at the end of the iteration, each cluster's parameters are recomputed based upon the observed data and the posterior clusters from iteration n become the prior clusters for iteration n+1.

Based upon discussions with Ted, I've been trying to recast clustering in terms of an unsupervised classification problem. This is most obvious is you look at the new ClusterClassifier & ClusterIterator, which implement all three algorithms in a single classification-ready engine. ClusterClassifier extends AbstractVectorClassifier and implements OnlineLearner. This means a ClusterClassifier produced by unsupervised training with some data can be used as a model in a semi-supervised classifier along with models obtained via supervised training.

I've adjusted the 3 Display clustering examples to use the ClusterClassifier so you can see that it works pretty well. I'm particularly pleased with how Dirichlet and Kmeans fit together using this approach.

-----Original Message-----
From: Benson Margulies [mailto:bimargulies@gmail.com] 
Sent: Thursday, May 12, 2011 9:14 AM
To: user@mahout.apache.org
Subject: Re: AW: Incremental clustering

Jeff,

Could you expand a bit on the subject of models in clustering? I
mentally simplify this into 'clustering: unsupervised; classification:
supervised.'

Is the idea here that you are going to be presented with many
different corpora that have some sort of overall resemblance, so that
priors derived from the first N speed up clustering N+1?

--benson


On Thu, May 12, 2011 at 12:00 PM, Jeff Eastman <je...@narus.com> wrote:
> Sure, by using your old clusters as the prior (clustersIn) for the new clustering, you can reduce the number of iterations required to converge.
>
> -----Original Message-----
> From: David Saile [mailto:david@uni-koblenz.de]
> Sent: Thursday, May 12, 2011 8:54 AM
> To: user@mahout.apache.org
> Subject: Re: AW: Incremental clustering
>
> Thank you very much everyone! This really helped a lot.
>
> Here is what I am planning to do:
> I am going to compute an initial clustering after the first crawl.
> Then, as sites are being added to the index I will simply classify them using the existing clusters.
>
> As I expect updates to be generally very small, I will only recompute the clustering after some threshold has been hit, like Grant suggested.
> As Ted pointed out, this can be done with the old clusters as input.
>
> Thanks again,
> David
>
>
>
> Am 12.05.2011 um 17:35 schrieb Ted Dunning:
>
>> Most of these algorithms can be done in an incremental fashion in which you
>> can add batches to the previous training.
>>
>> On Thu, May 12, 2011 at 8:30 AM, Jeff Eastman <je...@narus.com> wrote:
>>
>>> Most of the clustering drivers have two methods: one to train the clusterer
>>> with data to produce the cluster models; one to classify the data using a
>>> given set of cluster models. Currently the CLI only allows train followed by
>>> optional classify. We could pretty easily allow classify to be done
>>> stand-alone, and this would be useful in support of Grant's approach below.
>>>
>>> Jeff
>>>
>>> -----Original Message-----
>>> From: Grant Ingersoll [mailto:gsingers@apache.org]
>>> Sent: Thursday, May 12, 2011 3:32 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: AW: Incremental clustering
>>>
>>> From what I've seen, using Mahout's existing clustering methods, I think
>>> most people setup some schedule whereby they cluster the whole collection on
>>> a regular basis and then all docs that come in the meantime are simply
>>> assigned to the closest cluster until the next whole collection iteration is
>>> completed.  There are, of course, other variants one could do, such as kick
>>> off the whole clustering when some threshold of number of docs is reached.
>>>
>>> There are other clustering methods, as Benson alluded to, that may better
>>> support incremental approaches.
>>>
>>> On May 12, 2011, at 4:53 AM, David Saile wrote:
>>>
>>>> I am still stuck at this problem.
>>>>
>>>> Can anyone give me a heads-up on how existing systems handle this?
>>>> If a collection of documents is modified, is the clustering recomputed
>>> from scratch each time?
>>>> Or is there in fact any incremental way to handle an evolving set of
>>> documents?
>>>>
>>>> I would really appreciate any hint!
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>
>>>> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
>>>>
>>>>> Not an answer, but a follow-up question:
>>>>> I would be interested in the very same thing, but with the possibility
>>> to assign new sites to existing clusters OR to new ones.
>>>>>
>>>>> Thanks in advance,
>>>>> Ulrich
>>>>>
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: David Saile [mailto:david@uni-koblenz.de]
>>>>> Gesendet: Montag, 9. Mai 2011 11:53
>>>>> An: user@mahout.apache.org
>>>>> Betreff: Incremental clustering
>>>>>
>>>>> Hi list,
>>>>>
>>>>> I am completely new to Mahout, so please forgive me if the answer to my
>>> question is too obvious.
>>>>>
>>>>> For a case study, I am working on a simple incremental web crawler (much
>>> like Nutch) and I want to include a very simple indexing step that
>>> incorporates clustering of documents.
>>>>>
>>>>> I was hoping to use some kind of incremental clustering algorithm, in
>>> order to make use of the incremental way the crawler is supposed to work
>>> (i.e. continuously adding and updating websites).
>>>>>
>>>>> Is there some way to achieve the following:
>>>>>     1) initial clustering of the first web-crawl
>>>>>     2) assigning new sites to existing clusters
>>>>>     3) possibly moving modified sites between clusters
>>>>>
>>>>> I would really appreciate any help!
>>>>>
>>>>> Thanks,
>>>>> David
>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem docs using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>
>

Re: AW: Incremental clustering

Posted by Benson Margulies <bi...@gmail.com>.

Jeff,

Could you expand a bit on the subject of models in clustering? I
mentally simplify this into 'clustering: unsupervised; classification:
supervised.'

Is the idea here that you are going to be presented with many
different corpora that have some sort of overall resemblance, so that
priors derived from the first N speed up clustering N+1?

--benson


On Thu, May 12, 2011 at 12:00 PM, Jeff Eastman <je...@narus.com> wrote:
> Sure, by using your old clusters as the prior (clustersIn) for the new clustering, you can reduce the number of iterations required to converge.
>
> -----Original Message-----
> From: David Saile [mailto:david@uni-koblenz.de]
> Sent: Thursday, May 12, 2011 8:54 AM
> To: user@mahout.apache.org
> Subject: Re: AW: Incremental clustering
>
> Thank you very much everyone! This really helped a lot.
>
> Here is what I am planning to do:
> I am going to compute an initial clustering after the first crawl.
> Then, as sites are being added to the index I will simply classify them using the existing clusters.
>
> As I expect updates to be generally very small, I will only recompute the clustering after some threshold has been hit, like Grant suggested.
> As Ted pointed out, this can be done with the old clusters as input.
>
> Thanks again,
> David
>
>
>
> Am 12.05.2011 um 17:35 schrieb Ted Dunning:
>
>> Most of these algorithms can be done in an incremental fashion in which you
>> can add batches to the previous training.
>>
>> On Thu, May 12, 2011 at 8:30 AM, Jeff Eastman <je...@narus.com> wrote:
>>
>>> Most of the clustering drivers have two methods: one to train the clusterer
>>> with data to produce the cluster models; one to classify the data using a
>>> given set of cluster models. Currently the CLI only allows train followed by
>>> optional classify. We could pretty easily allow classify to be done
>>> stand-alone, and this would be useful in support of Grant's approach below.
>>>
>>> Jeff
>>>
>>> -----Original Message-----
>>> From: Grant Ingersoll [mailto:gsingers@apache.org]
>>> Sent: Thursday, May 12, 2011 3:32 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: AW: Incremental clustering
>>>
>>> From what I've seen, using Mahout's existing clustering methods, I think
>>> most people setup some schedule whereby they cluster the whole collection on
>>> a regular basis and then all docs that come in the meantime are simply
>>> assigned to the closest cluster until the next whole collection iteration is
>>> completed.  There are, of course, other variants one could do, such as kick
>>> off the whole clustering when some threshold of number of docs is reached.
>>>
>>> There are other clustering methods, as Benson alluded to, that may better
>>> support incremental approaches.
>>>
>>> On May 12, 2011, at 4:53 AM, David Saile wrote:
>>>
>>>> I am still stuck at this problem.
>>>>
>>>> Can anyone give me a heads-up on how existing systems handle this?
>>>> If a collection of documents is modified, is the clustering recomputed
>>> from scratch each time?
>>>> Or is there in fact any incremental way to handle an evolving set of
>>> documents?
>>>>
>>>> I would really appreciate any hint!
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>
>>>> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
>>>>
>>>>> Not an answer, but a follow-up question:
>>>>> I would be interested in the very same thing, but with the possibility
>>> to assign new sites to existing clusters OR to new ones.
>>>>>
>>>>> Thanks in advance,
>>>>> Ulrich
>>>>>
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: David Saile [mailto:david@uni-koblenz.de]
>>>>> Gesendet: Montag, 9. Mai 2011 11:53
>>>>> An: user@mahout.apache.org
>>>>> Betreff: Incremental clustering
>>>>>
>>>>> Hi list,
>>>>>
>>>>> I am completely new to Mahout, so please forgive me if the answer to my
>>> question is too obvious.
>>>>>
>>>>> For a case study, I am working on a simple incremental web crawler (much
>>> like Nutch) and I want to include a very simple indexing step that
>>> incorporates clustering of documents.
>>>>>
>>>>> I was hoping to use some kind of incremental clustering algorithm, in
>>> order to make use of the incremental way the crawler is supposed to work
>>> (i.e. continuously adding and updating websites).
>>>>>
>>>>> Is there some way to achieve the following:
>>>>>     1) initial clustering of the first web-crawl
>>>>>     2) assigning new sites to existing clusters
>>>>>     3) possibly moving modified sites between clusters
>>>>>
>>>>> I would really appreciate any help!
>>>>>
>>>>> Thanks,
>>>>> David
>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem docs using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>
>

RE: AW: Incremental clustering

Posted by Jeff Eastman <je...@Narus.com>.

Sure, by using your old clusters as the prior (clustersIn) for the new clustering, you can reduce the number of iterations required to converge.

-----Original Message-----
From: David Saile [mailto:david@uni-koblenz.de] 
Sent: Thursday, May 12, 2011 8:54 AM
To: user@mahout.apache.org
Subject: Re: AW: Incremental clustering

Thank you very much everyone! This really helped a lot.

Here is what I am planning to do:
I am going to compute an initial clustering after the first crawl. 
Then, as sites are being added to the index I will simply classify them using the existing clusters.

As I expect updates to be generally very small, I will only recompute the clustering after some threshold has been hit, like Grant suggested. 
As Ted pointed out, this can be done with the old clusters as input.

Thanks again,
David


 
Am 12.05.2011 um 17:35 schrieb Ted Dunning:

> Most of these algorithms can be done in an incremental fashion in which you
> can add batches to the previous training.
> 
> On Thu, May 12, 2011 at 8:30 AM, Jeff Eastman <je...@narus.com> wrote:
> 
>> Most of the clustering drivers have two methods: one to train the clusterer
>> with data to produce the cluster models; one to classify the data using a
>> given set of cluster models. Currently the CLI only allows train followed by
>> optional classify. We could pretty easily allow classify to be done
>> stand-alone, and this would be useful in support of Grant's approach below.
>> 
>> Jeff
>> 
>> -----Original Message-----
>> From: Grant Ingersoll [mailto:gsingers@apache.org]
>> Sent: Thursday, May 12, 2011 3:32 AM
>> To: user@mahout.apache.org
>> Subject: Re: AW: Incremental clustering
>> 
>> From what I've seen, using Mahout's existing clustering methods, I think
>> most people setup some schedule whereby they cluster the whole collection on
>> a regular basis and then all docs that come in the meantime are simply
>> assigned to the closest cluster until the next whole collection iteration is
>> completed.  There are, of course, other variants one could do, such as kick
>> off the whole clustering when some threshold of number of docs is reached.
>> 
>> There are other clustering methods, as Benson alluded to, that may better
>> support incremental approaches.
>> 
>> On May 12, 2011, at 4:53 AM, David Saile wrote:
>> 
>>> I am still stuck at this problem.
>>> 
>>> Can anyone give me a heads-up on how existing systems handle this?
>>> If a collection of documents is modified, is the clustering recomputed
>> from scratch each time?
>>> Or is there in fact any incremental way to handle an evolving set of
>> documents?
>>> 
>>> I would really appreciate any hint!
>>> 
>>> Thanks,
>>> David
>>> 
>>> 
>>> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
>>> 
>>>> Not an answer, but a follow-up question:
>>>> I would be interested in the very same thing, but with the possibility
>> to assign new sites to existing clusters OR to new ones.
>>>> 
>>>> Thanks in advance,
>>>> Ulrich
>>>> 
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: David Saile [mailto:david@uni-koblenz.de]
>>>> Gesendet: Montag, 9. Mai 2011 11:53
>>>> An: user@mahout.apache.org
>>>> Betreff: Incremental clustering
>>>> 
>>>> Hi list,
>>>> 
>>>> I am completely new to Mahout, so please forgive me if the answer to my
>> question is too obvious.
>>>> 
>>>> For a case study, I am working on a simple incremental web crawler (much
>> like Nutch) and I want to include a very simple indexing step that
>> incorporates clustering of documents.
>>>> 
>>>> I was hoping to use some kind of incremental clustering algorithm, in
>> order to make use of the incremental way the crawler is supposed to work
>> (i.e. continuously adding and updating websites).
>>>> 
>>>> Is there some way to achieve the following:
>>>>     1) initial clustering of the first web-crawl
>>>>     2) assigning new sites to existing clusters
>>>>     3) possibly moving modified sites between clusters
>>>> 
>>>> I would really appreciate any help!
>>>> 
>>>> Thanks,
>>>> David
>>> 
>> 
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>> 
>> Search the Lucene ecosystem docs using Solr/Lucene:
>> http://www.lucidimagination.com/search
>> 
>>

Re: AW: Incremental clustering

Posted by David Saile <da...@uni-koblenz.de>.

Thank you very much everyone! This really helped a lot.

Here is what I am planning to do:
I am going to compute an initial clustering after the first crawl. 
Then, as sites are being added to the index I will simply classify them using the existing clusters.

As I expect updates to be generally very small, I will only recompute the clustering after some threshold has been hit, like Grant suggested. 
As Ted pointed out, this can be done with the old clusters as input.

Thanks again,
David


 
Am 12.05.2011 um 17:35 schrieb Ted Dunning:

> Most of these algorithms can be done in an incremental fashion in which you
> can add batches to the previous training.
> 
> On Thu, May 12, 2011 at 8:30 AM, Jeff Eastman <je...@narus.com> wrote:
> 
>> Most of the clustering drivers have two methods: one to train the clusterer
>> with data to produce the cluster models; one to classify the data using a
>> given set of cluster models. Currently the CLI only allows train followed by
>> optional classify. We could pretty easily allow classify to be done
>> stand-alone, and this would be useful in support of Grant's approach below.
>> 
>> Jeff
>> 
>> -----Original Message-----
>> From: Grant Ingersoll [mailto:gsingers@apache.org]
>> Sent: Thursday, May 12, 2011 3:32 AM
>> To: user@mahout.apache.org
>> Subject: Re: AW: Incremental clustering
>> 
>> From what I've seen, using Mahout's existing clustering methods, I think
>> most people setup some schedule whereby they cluster the whole collection on
>> a regular basis and then all docs that come in the meantime are simply
>> assigned to the closest cluster until the next whole collection iteration is
>> completed.  There are, of course, other variants one could do, such as kick
>> off the whole clustering when some threshold of number of docs is reached.
>> 
>> There are other clustering methods, as Benson alluded to, that may better
>> support incremental approaches.
>> 
>> On May 12, 2011, at 4:53 AM, David Saile wrote:
>> 
>>> I am still stuck at this problem.
>>> 
>>> Can anyone give me a heads-up on how existing systems handle this?
>>> If a collection of documents is modified, is the clustering recomputed
>> from scratch each time?
>>> Or is there in fact any incremental way to handle an evolving set of
>> documents?
>>> 
>>> I would really appreciate any hint!
>>> 
>>> Thanks,
>>> David
>>> 
>>> 
>>> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
>>> 
>>>> Not an answer, but a follow-up question:
>>>> I would be interested in the very same thing, but with the possibility
>> to assign new sites to existing clusters OR to new ones.
>>>> 
>>>> Thanks in advance,
>>>> Ulrich
>>>> 
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: David Saile [mailto:david@uni-koblenz.de]
>>>> Gesendet: Montag, 9. Mai 2011 11:53
>>>> An: user@mahout.apache.org
>>>> Betreff: Incremental clustering
>>>> 
>>>> Hi list,
>>>> 
>>>> I am completely new to Mahout, so please forgive me if the answer to my
>> question is too obvious.
>>>> 
>>>> For a case study, I am working on a simple incremental web crawler (much
>> like Nutch) and I want to include a very simple indexing step that
>> incorporates clustering of documents.
>>>> 
>>>> I was hoping to use some kind of incremental clustering algorithm, in
>> order to make use of the incremental way the crawler is supposed to work
>> (i.e. continuously adding and updating websites).
>>>> 
>>>> Is there some way to achieve the following:
>>>>     1) initial clustering of the first web-crawl
>>>>     2) assigning new sites to existing clusters
>>>>     3) possibly moving modified sites between clusters
>>>> 
>>>> I would really appreciate any help!
>>>> 
>>>> Thanks,
>>>> David
>>> 
>> 
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>> 
>> Search the Lucene ecosystem docs using Solr/Lucene:
>> http://www.lucidimagination.com/search
>> 
>>

Re: AW: Incremental clustering

Posted by Ted Dunning <te...@gmail.com>.

Most of these algorithms can be done in an incremental fashion in which you
can add batches to the previous training.

On Thu, May 12, 2011 at 8:30 AM, Jeff Eastman <je...@narus.com> wrote:

> Most of the clustering drivers have two methods: one to train the clusterer
> with data to produce the cluster models; one to classify the data using a
> given set of cluster models. Currently the CLI only allows train followed by
> optional classify. We could pretty easily allow classify to be done
> stand-alone, and this would be useful in support of Grant's approach below.
>
> Jeff
>
> -----Original Message-----
> From: Grant Ingersoll [mailto:gsingers@apache.org]
> Sent: Thursday, May 12, 2011 3:32 AM
> To: user@mahout.apache.org
> Subject: Re: AW: Incremental clustering
>
> From what I've seen, using Mahout's existing clustering methods, I think
> most people setup some schedule whereby they cluster the whole collection on
> a regular basis and then all docs that come in the meantime are simply
> assigned to the closest cluster until the next whole collection iteration is
> completed.  There are, of course, other variants one could do, such as kick
> off the whole clustering when some threshold of number of docs is reached.
>
> There are other clustering methods, as Benson alluded to, that may better
> support incremental approaches.
>
> On May 12, 2011, at 4:53 AM, David Saile wrote:
>
> > I am still stuck at this problem.
> >
> > Can anyone give me a heads-up on how existing systems handle this?
> > If a collection of documents is modified, is the clustering recomputed
> from scratch each time?
> > Or is there in fact any incremental way to handle an evolving set of
> documents?
> >
> > I would really appreciate any hint!
> >
> > Thanks,
> > David
> >
> >
> > Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
> >
> >> Not an answer, but a follow-up question:
> >> I would be interested in the very same thing, but with the possibility
> to assign new sites to existing clusters OR to new ones.
> >>
> >> Thanks in advance,
> >> Ulrich
> >>
> >> -----Ursprüngliche Nachricht-----
> >> Von: David Saile [mailto:david@uni-koblenz.de]
> >> Gesendet: Montag, 9. Mai 2011 11:53
> >> An: user@mahout.apache.org
> >> Betreff: Incremental clustering
> >>
> >> Hi list,
> >>
> >> I am completely new to Mahout, so please forgive me if the answer to my
> question is too obvious.
> >>
> >> For a case study, I am working on a simple incremental web crawler (much
> like Nutch) and I want to include a very simple indexing step that
> incorporates clustering of documents.
> >>
> >> I was hoping to use some kind of incremental clustering algorithm, in
> order to make use of the incremental way the crawler is supposed to work
> (i.e. continuously adding and updating websites).
> >>
> >> Is there some way to achieve the following:
> >>      1) initial clustering of the first web-crawl
> >>      2) assigning new sites to existing clusters
> >>      3) possibly moving modified sites between clusters
> >>
> >> I would really appreciate any help!
> >>
> >> Thanks,
> >> David
> >
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem docs using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

RE: AW: Incremental clustering

Posted by Jeff Eastman <je...@Narus.com>.

Most of the clustering drivers have two methods: one to train the clusterer with data to produce the cluster models; one to classify the data using a given set of cluster models. Currently the CLI only allows train followed by optional classify. We could pretty easily allow classify to be done stand-alone, and this would be useful in support of Grant's approach below.

Jeff

-----Original Message-----
From: Grant Ingersoll [mailto:gsingers@apache.org] 
Sent: Thursday, May 12, 2011 3:32 AM
To: user@mahout.apache.org
Subject: Re: AW: Incremental clustering

>From what I've seen, using Mahout's existing clustering methods, I think most people setup some schedule whereby they cluster the whole collection on a regular basis and then all docs that come in the meantime are simply assigned to the closest cluster until the next whole collection iteration is completed.  There are, of course, other variants one could do, such as kick off the whole clustering when some threshold of number of docs is reached.

There are other clustering methods, as Benson alluded to, that may better support incremental approaches.

On May 12, 2011, at 4:53 AM, David Saile wrote:

> I am still stuck at this problem.
> 
> Can anyone give me a heads-up on how existing systems handle this? 
> If a collection of documents is modified, is the clustering recomputed from scratch each time? 
> Or is there in fact any incremental way to handle an evolving set of documents?
> 
> I would really appreciate any hint!
> 
> Thanks,
> David
> 
> 
> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
> 
>> Not an answer, but a follow-up question: 
>> I would be interested in the very same thing, but with the possibility to assign new sites to existing clusters OR to new ones.
>> 
>> Thanks in advance,
>> Ulrich
>> 
>> -----Ursprüngliche Nachricht-----
>> Von: David Saile [mailto:david@uni-koblenz.de] 
>> Gesendet: Montag, 9. Mai 2011 11:53
>> An: user@mahout.apache.org
>> Betreff: Incremental clustering
>> 
>> Hi list,
>> 
>> I am completely new to Mahout, so please forgive me if the answer to my question is too obvious.
>> 
>> For a case study, I am working on a simple incremental web crawler (much like Nutch) and I want to include a very simple indexing step that incorporates clustering of documents.
>> 
>> I was hoping to use some kind of incremental clustering algorithm, in order to make use of the incremental way the crawler is supposed to work (i.e. continuously adding and updating websites).
>> 
>> Is there some way to achieve the following: 	
>> 	1) initial clustering of the first web-crawl
>> 	2) assigning new sites to existing clusters
>> 	3) possibly moving modified sites between clusters
>> 
>> I would really appreciate any help!
>> 
>> Thanks,
>> David
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search

Re: AW: Incremental clustering

Posted by Grant Ingersoll <gs...@apache.org>.

From what I've seen, using Mahout's existing clustering methods, I think most people setup some schedule whereby they cluster the whole collection on a regular basis and then all docs that come in the meantime are simply assigned to the closest cluster until the next whole collection iteration is completed.  There are, of course, other variants one could do, such as kick off the whole clustering when some threshold of number of docs is reached.

There are other clustering methods, as Benson alluded to, that may better support incremental approaches.

On May 12, 2011, at 4:53 AM, David Saile wrote:

> I am still stuck at this problem.
> 
> Can anyone give me a heads-up on how existing systems handle this? 
> If a collection of documents is modified, is the clustering recomputed from scratch each time? 
> Or is there in fact any incremental way to handle an evolving set of documents?
> 
> I would really appreciate any hint!
> 
> Thanks,
> David
> 
> 
> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
> 
>> Not an answer, but a follow-up question: 
>> I would be interested in the very same thing, but with the possibility to assign new sites to existing clusters OR to new ones.
>> 
>> Thanks in advance,
>> Ulrich
>> 
>> -----Ursprüngliche Nachricht-----
>> Von: David Saile [mailto:david@uni-koblenz.de] 
>> Gesendet: Montag, 9. Mai 2011 11:53
>> An: user@mahout.apache.org
>> Betreff: Incremental clustering
>> 
>> Hi list,
>> 
>> I am completely new to Mahout, so please forgive me if the answer to my question is too obvious.
>> 
>> For a case study, I am working on a simple incremental web crawler (much like Nutch) and I want to include a very simple indexing step that incorporates clustering of documents.
>> 
>> I was hoping to use some kind of incremental clustering algorithm, in order to make use of the incremental way the crawler is supposed to work (i.e. continuously adding and updating websites).
>> 
>> Is there some way to achieve the following: 	
>> 	1) initial clustering of the first web-crawl
>> 	2) assigning new sites to existing clusters
>> 	3) possibly moving modified sites between clusters
>> 
>> I would really appreciate any help!
>> 
>> Thanks,
>> David
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search

Re: AW: Incremental clustering

Posted by David Saile <da...@uni-koblenz.de>.

I am still stuck at this problem.

Can anyone give me a heads-up on how existing systems handle this? 
If a collection of documents is modified, is the clustering recomputed from scratch each time? 
Or is there in fact any incremental way to handle an evolving set of documents?

I would really appreciate any hint!

Thanks,
David


Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:

> Not an answer, but a follow-up question: 
> I would be interested in the very same thing, but with the possibility to assign new sites to existing clusters OR to new ones.
> 
> Thanks in advance,
> Ulrich
> 
> -----Ursprüngliche Nachricht-----
> Von: David Saile [mailto:david@uni-koblenz.de] 
> Gesendet: Montag, 9. Mai 2011 11:53
> An: user@mahout.apache.org
> Betreff: Incremental clustering
> 
> Hi list,
> 
> I am completely new to Mahout, so please forgive me if the answer to my question is too obvious.
> 
> For a case study, I am working on a simple incremental web crawler (much like Nutch) and I want to include a very simple indexing step that incorporates clustering of documents.
> 
> I was hoping to use some kind of incremental clustering algorithm, in order to make use of the incremental way the crawler is supposed to work (i.e. continuously adding and updating websites).
> 
> Is there some way to achieve the following: 	
> 	1) initial clustering of the first web-crawl
> 	2) assigning new sites to existing clusters
> 	3) possibly moving modified sites between clusters
> 
> I would really appreciate any help!
> 
> Thanks,
> David

AW: Incremental clustering

Posted by Ulrich Poppendieck <ul...@vico-research.com>.

Not an answer, but a follow-up question: 
I would be interested in the very same thing, but with the possibility to assign new sites to existing clusters OR to new ones.

Thanks in advance,
Ulrich

-----Ursprüngliche Nachricht-----
Von: David Saile [mailto:david@uni-koblenz.de] 
Gesendet: Montag, 9. Mai 2011 11:53
An: user@mahout.apache.org
Betreff: Incremental clustering

Hi list,

I am completely new to Mahout, so please forgive me if the answer to my question is too obvious.

For a case study, I am working on a simple incremental web crawler (much like Nutch) and I want to include a very simple indexing step that incorporates clustering of documents.

I was hoping to use some kind of incremental clustering algorithm, in order to make use of the incremental way the crawler is supposed to work (i.e. continuously adding and updating websites).

Is there some way to achieve the following: 	
	1) initial clustering of the first web-crawl
 	2) assigning new sites to existing clusters
	3) possibly moving modified sites between clusters

I would really appreciate any help!

Thanks,
David