You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Asif Rahman <as...@newscred.com> on 2010/07/16 17:27:45 UTC

Updating clusters

Can anyone provide some advice on how to update an existing clustering with
new data points.  Our data set is approximately 1mm newspaper headlines over
the course of a month.  I'm able to get a high quality clustering using the
existing mahout tasks (I'm just using canopy in this instance) but I'd like
to update the clusters on an hourly basis.  Given the hardware that is
available to me, I won't be able to run the clustering to completion over
the entire data set every hour.  Are there any methods for completing such a
task?

Since I'm not a mahout or linear algebra expert at this point, ideally the
solution would involve a combination of the existing mahout tasks.  That
being said, I'd be appreciative of any and all advice.

Thanks,

Asif


-- 
Asif Rahman
Lead Engineer - NewsCred
asif@newscred.com
http://platform.newscred.com

Re: Updating clusters

Posted by Asif Rahman <as...@newscred.com>.

It seems to be.  Intuitively, it might actually be better since there is a
higher proportion of meaningful words in the headline vs. the body.

On Fri, Jul 16, 2010 at 1:49 PM, Ted Dunning <te...@gmail.com> wrote:

> Is headline enough?
>
> On Fri, Jul 16, 2010 at 10:02 AM, Asif Rahman <as...@newscred.com> wrote:
>
> > In between, as articles come in, use the headline to query
> > against the cluster set that I have indexed in solr, and add the article
> to
> > the most similar cluster.
> >
>

-- 
Asif Rahman
Lead Engineer - NewsCred
asif@newscred.com
http://platform.newscred.com

Re: Updating clusters

Posted by Ted Dunning <te...@gmail.com>.

Is headline enough?

On Fri, Jul 16, 2010 at 10:02 AM, Asif Rahman <as...@newscred.com> wrote:

> In between, as articles come in, use the headline to query
> against the cluster set that I have indexed in solr, and add the article to
> the most similar cluster.
>

Re: Updating clusters

Posted by Asif Rahman <as...@newscred.com>.

As soon as I can get this puppy into production, I'll add our name to that
list.

That idea has a lot of potential.  We get anywhere between 1,000 and 2,000
articles per hour.  Since we're looking at news, articles published within,
say, 24 hours of each other have a lot of affinity to form a cluster.
Articles that are days apart have almost no affinity to each other.  Every 6
hours or so, I could run the clustering on just new articles since the last
clustering.  In between, as articles come in, use the headline to query
against the cluster set that I have indexed in solr, and add the article to
the most similar cluster.

I think I just repeated, in different words, exactly what you said below.
Thanks in any case for helping me reason this out.

On Fri, Jul 16, 2010 at 12:06 PM, Grant Ingersoll <gs...@apache.org>wrote:

>
> On Jul 16, 2010, at 11:27 AM, Asif Rahman wrote:
>
> > Can anyone provide some advice on how to update an existing clustering
> with
> > new data points.  Our data set is approximately 1mm newspaper headlines
> over
> > the course of a month.  I'm able to get a high quality clustering using
> the
> > existing mahout tasks (I'm just using canopy in this instance)
>
> [OT] Care to share more (since you've already said you are using it)?
> https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
>
> > but I'd like
> > to update the clusters on an hourly basis.  Given the hardware that is
> > available to me, I won't be able to run the clustering to completion over
> > the entire data set every hour.  Are there any methods for completing
> such a
> > task?
>
> How many new docs are you talking in that hour?  I'm sure others can add
> here, but AIUI, people in this situation often calculate the clusters and
> then for new docs in some time period, they just see which cluster that new
> document is closest to and add it there, then, offline or "later" they
> recluster the whole set.  So, for instance, perhaps nightly or every 6 hours
> or whatever you can afford, you do the whole job, but then in between you
> just do the lighter weight calculation.  I imagine there are probably ways
> of calculating when a new cluster is needed or when quality has dropped too
> much, so perhaps that could be used to trigger a new full run, too.
>
> >
> > Since I'm not a mahout or linear algebra expert at this point, ideally
> the
> > solution would involve a combination of the existing mahout tasks.  That
> > being said, I'd be appreciative of any and all advice.
> >
> > Thanks,
> >
> > Asif
> >
> >
> > --
> > Asif Rahman
> > Lead Engineer - NewsCred
> > asif@newscred.com
> > http://platform.newscred.com
>
>


-- 
Asif Rahman
Lead Engineer - NewsCred
asif@newscred.com
http://platform.newscred.com

Re: Updating clusters

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

I agree with Grant; since your cluster centers may be expected to change 
slowly over time with the addition of new documents you can just use the 
existing clusters to cluster the new documents hourly then recompute the 
whole corpus periodically to update the clusters. You can get pretty 
good clusters by random sampling of the documents in your corpus too. 
Currently the runClustering methods in the various drivers are private; 
perhaps this is a use case for making them public?


On 7/16/10 9:06 AM, Grant Ingersoll wrote:
> On Jul 16, 2010, at 11:27 AM, Asif Rahman wrote:
>
>    
>> Can anyone provide some advice on how to update an existing clustering with
>> new data points.  Our data set is approximately 1mm newspaper headlines over
>> the course of a month.  I'm able to get a high quality clustering using the
>> existing mahout tasks (I'm just using canopy in this instance)
>>      
> [OT] Care to share more (since you've already said you are using it)?  https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
>
>    
>> but I'd like
>> to update the clusters on an hourly basis.  Given the hardware that is
>> available to me, I won't be able to run the clustering to completion over
>> the entire data set every hour.  Are there any methods for completing such a
>> task?
>>      
> How many new docs are you talking in that hour?  I'm sure others can add here, but AIUI, people in this situation often calculate the clusters and then for new docs in some time period, they just see which cluster that new document is closest to and add it there, then, offline or "later" they recluster the whole set.  So, for instance, perhaps nightly or every 6 hours or whatever you can afford, you do the whole job, but then in between you just do the lighter weight calculation.  I imagine there are probably ways of calculating when a new cluster is needed or when quality has dropped too much, so perhaps that could be used to trigger a new full run, too.
>
>    
>> Since I'm not a mahout or linear algebra expert at this point, ideally the
>> solution would involve a combination of the existing mahout tasks.  That
>> being said, I'd be appreciative of any and all advice.
>>
>> Thanks,
>>
>> Asif
>>
>>
>> -- 
>> Asif Rahman
>> Lead Engineer - NewsCred
>> asif@newscred.com
>> http://platform.newscred.com
>>      
>
>

Re: Updating clusters

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 16, 2010, at 11:27 AM, Asif Rahman wrote:

> Can anyone provide some advice on how to update an existing clustering with
> new data points.  Our data set is approximately 1mm newspaper headlines over
> the course of a month.  I'm able to get a high quality clustering using the
> existing mahout tasks (I'm just using canopy in this instance)

[OT] Care to share more (since you've already said you are using it)?  https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout

> but I'd like
> to update the clusters on an hourly basis.  Given the hardware that is
> available to me, I won't be able to run the clustering to completion over
> the entire data set every hour.  Are there any methods for completing such a
> task?

How many new docs are you talking in that hour?  I'm sure others can add here, but AIUI, people in this situation often calculate the clusters and then for new docs in some time period, they just see which cluster that new document is closest to and add it there, then, offline or "later" they recluster the whole set.  So, for instance, perhaps nightly or every 6 hours or whatever you can afford, you do the whole job, but then in between you just do the lighter weight calculation.  I imagine there are probably ways of calculating when a new cluster is needed or when quality has dropped too much, so perhaps that could be used to trigger a new full run, too.

> 
> Since I'm not a mahout or linear algebra expert at this point, ideally the
> solution would involve a combination of the existing mahout tasks.  That
> being said, I'd be appreciative of any and all advice.
> 
> Thanks,
> 
> Asif
> 
> 
> -- 
> Asif Rahman
> Lead Engineer - NewsCred
> asif@newscred.com
> http://platform.newscred.com