You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ioan Eugen Stan <st...@gmail.com> on 2012/05/14 14:34:39 UTC

online clustering with mahout

Hi,

Dos mahout offer online clustering out of the box using sequential
clustering (no MapReduce). I'm looking over the code (trunk) and I
found ClusterClassifier but I can't figure out how that works. Any
examples or more docs on this topic?

Thanks,
-- 
Ioan Eugen Stan
http://ieugen.blogspot.com/  *** http://bucharest-jug.github.com/ ***

Re: online clustering with mahout

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

+1 you've got it with the iterator and classifier.

Mahout really doesn't have good support yet for online clustering. The 
problem you note will occur if new documents introduce new terms that 
are not in the dictionary. You can fudge a bit by widening the cluster 
center vectors so the DistanceMeasures don't complain when new terms are 
encountered during classification. As the vectors are sparse, the extra 
terms will just be zero.

Periodically, you will need to re-cluster as your models evolve of 
course. You really also ought to look at Ted Dunning's large scale 
k-means (https://github.com/tdunning/knn) clustering too.

Please keep us posted on how your work evolves.

On 5/15/12 6:50 AM, Ioan Eugen Stan wrote:
> Hello Jeff,
>
> 2012/5/14 Jeff Eastman<jd...@windwardsolutions.com>:
>> Look at ClusterIterator.iterate(). This will do clustering in memory without
>> any Hadoop. ClusterIterator.iterateSeq will do clustering in a single
>> process from/to Hadoop sequence files but without map/reduce.
>> ClusterIterator.iterateMR uses full Hadoop to do clustering for the same
>> algorithms (k-means, fuzzy-k, Dirichlet), all configured using
>> ClusteringPolicy instances.
> Thanks for the response. It's exactly what I need.
>
> > From what I can figure out, please correct me if I'm wrong, the
> scenario will look like this (in my case):
>
> - vectorize my documents and run ClusterIterator.iterate*() to get
> back a ClusterClassifier.
> - call ClusterClassifier.classify( newDocumentVector) to get a list of
> probabilities as to which cluster my newDocument belongs.
>
> However there are some issues that I can't get my head around.
>
> How do I make the vector to use the dictionary from my model so the
> vectors will have terms on the same positions and the classifier will
> be able to correctly compute distances between the new vector and the
> model. Another way to put it: Doing online clustering with text
> documents will result in vectors that contain elemtents/terms that do
> not exist in the model. Doesn't this mean I will get IndexOutOfbounds
> or some exception when I try to classify
>
> Does mahout offer some support for updating the model?
>
> Thanks,
>
>> On 5/14/12 8:34 AM, Ioan Eugen Stan wrote:
>>> Hi,
>>>
>>> Dos mahout offer online clustering out of the box using sequential
>>> clustering (no MapReduce). I'm looking over the code (trunk) and I
>>> found ClusterClassifier but I can't figure out how that works. Any
>>> examples or more docs on this topic?
>>>
>>> Thanks,
>>
>
>

Re: online clustering with mahout

Posted by Paritosh Ranjan <pr...@xebia.com>.

The ClusterIterator is already used in KMeans, FuzzyK and Dirichlet 
Clustering implemented in Mahout. They can run both sequentially 
(without hadoop, in memory) and on hadoop simply by setting 
runSequential=true/false.
You can look at KMeansDriver's, FuzzyKMeansDriver's and 
DirichletDriver's run method if you want to use them.

If you want to use ClusterIterator directly, then maybe this snipped can 
help

     Path priorClustersPath = new Path(output, Cluster.INITIAL_CLUSTERS_DIR);  
    ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta); //or any other policy
    ClusterClassifier prior = new ClusterClassifier(clusters, policy);
    prior.writeToSeqFiles(priorClustersPath);

     if (runSequential) {
       new ClusterIterator().iterateSeq(conf, input, priorClustersPath, output, maxIterations);
     } else {
       new ClusterIterator().iterateMR(conf, input, priorClustersPath, output, maxIterations);
     }

     //the clusters are written at output


On 15-05-2012 16:20, Ioan Eugen Stan wrote:
> Hello Jeff,
>
> 2012/5/14 Jeff Eastman<jd...@windwardsolutions.com>:
>> Look at ClusterIterator.iterate(). This will do clustering in memory without
>> any Hadoop. ClusterIterator.iterateSeq will do clustering in a single
>> process from/to Hadoop sequence files but without map/reduce.
>> ClusterIterator.iterateMR uses full Hadoop to do clustering for the same
>> algorithms (k-means, fuzzy-k, Dirichlet), all configured using
>> ClusteringPolicy instances.
> Thanks for the response. It's exactly what I need.
>
>  From what I can figure out, please correct me if I'm wrong, the
> scenario will look like this (in my case):
>
> - vectorize my documents and run ClusterIterator.iterate*() to get
> back a ClusterClassifier.
> - call ClusterClassifier.classify( newDocumentVector) to get a list of
> probabilities as to which cluster my newDocument belongs.
>
> However there are some issues that I can't get my head around.
>
> How do I make the vector to use the dictionary from my model so the
> vectors will have terms on the same positions and the classifier will
> be able to correctly compute distances between the new vector and the
> model. Another way to put it: Doing online clustering with text
> documents will result in vectors that contain elemtents/terms that do
> not exist in the model. Doesn't this mean I will get IndexOutOfbounds
> or some exception when I try to classify()?
>
> Does mahout offer some support for updating the model?
>
> Thanks,
>
>> On 5/14/12 8:34 AM, Ioan Eugen Stan wrote:
>>> Hi,
>>>
>>> Dos mahout offer online clustering out of the box using sequential
>>> clustering (no MapReduce). I'm looking over the code (trunk) and I
>>> found ClusterClassifier but I can't figure out how that works. Any
>>> examples or more docs on this topic?
>>>
>>> Thanks,
>>
>
>

Re: online clustering with mahout

Posted by Ioan Eugen Stan <st...@gmail.com>.

Hello Jeff,

2012/5/14 Jeff Eastman <jd...@windwardsolutions.com>:
> Look at ClusterIterator.iterate(). This will do clustering in memory without
> any Hadoop. ClusterIterator.iterateSeq will do clustering in a single
> process from/to Hadoop sequence files but without map/reduce.
> ClusterIterator.iterateMR uses full Hadoop to do clustering for the same
> algorithms (k-means, fuzzy-k, Dirichlet), all configured using
> ClusteringPolicy instances.

Thanks for the response. It's exactly what I need.

>From what I can figure out, please correct me if I'm wrong, the
scenario will look like this (in my case):

- vectorize my documents and run ClusterIterator.iterate*() to get
back a ClusterClassifier.
- call ClusterClassifier.classify( newDocumentVector) to get a list of
probabilities as to which cluster my newDocument belongs.

However there are some issues that I can't get my head around.

How do I make the vector to use the dictionary from my model so the
vectors will have terms on the same positions and the classifier will
be able to correctly compute distances between the new vector and the
model. Another way to put it: Doing online clustering with text
documents will result in vectors that contain elemtents/terms that do
not exist in the model. Doesn't this mean I will get IndexOutOfbounds
or some exception when I try to classify()?

Does mahout offer some support for updating the model?

Thanks,

>
> On 5/14/12 8:34 AM, Ioan Eugen Stan wrote:
>>
>> Hi,
>>
>> Dos mahout offer online clustering out of the box using sequential
>> clustering (no MapReduce). I'm looking over the code (trunk) and I
>> found ClusterClassifier but I can't figure out how that works. Any
>> examples or more docs on this topic?
>>
>> Thanks,
>
>

-- 
Ioan Eugen Stan
http://ieugen.blogspot.com/  *** http://bucharest-jug.github.com/ ***

Re: online clustering with mahout

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Look at ClusterIterator.iterate(). This will do clustering in memory 
without any Hadoop. ClusterIterator.iterateSeq will do clustering in a 
single process from/to Hadoop sequence files but without map/reduce. 
ClusterIterator.iterateMR uses full Hadoop to do clustering for the same 
algorithms (k-means, fuzzy-k, Dirichlet), all configured using 
ClusteringPolicy instances.

On 5/14/12 8:34 AM, Ioan Eugen Stan wrote:
> Hi,
>
> Dos mahout offer online clustering out of the box using sequential
> clustering (no MapReduce). I'm looking over the code (trunk) and I
> found ClusterClassifier but I can't figure out how that works. Any
> examples or more docs on this topic?
>
> Thanks,