You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Allan Roberto Avendano Sudario <aa...@fiec.espol.edu.ec> on 2009/08/05 16:47:51 UTC

Mahout clustering parameters

Regards,

I´m trying to fit the kmeans syntheticcontrol job with my own dataset,
everything works well.
But, only one cluster is generated. I suppose that it´s about the default
parameters of clustering
process.

What do you recommend about how to change clustering parameters?
*(2 threshold and 1 convergenceDelta)*

Which would be the clustering algorithm into information retrieval process?

Thanks for your help.

-- 
Allan Avendaño S.

Re: Mahout clustering parameters

Posted by Grant Ingersoll <gs...@apache.org>.
On Aug 5, 2009, at 7:52 PM, Allan Roberto Avendano Sudario wrote:

> 2009/8/5 Grant Ingersoll <gs...@apache.org>
>
>> What parameters did you use in the command line?
>
>
> I'm running syntheticcontrol kmeans clustering. Three parameters are  
> needed:
> 2 threshold & 1 convergence criteria for iterations.
>
> Which values are recommended to assign to each one?

Synthetic Control is just an example data set.  For generic  
clustering, use the KMeansDriver.  AFAICT, setting those values is  
done by trial and error, but others may have more insight.


>
>
>>
>> There are a couple of threads in the archives that are likely of  
>> interest
>> along these lines:
>> http://www.lucidimagination.com/search/p:mahout?q=clustering#/
>> p:mahout/s:email/l:user
>>
>> Are you trying to cluster text?  Or something else?
>>
>
> Yes, I'm trying to clustering text. I've build a tf-idf matrix  
> compose by
> sparse vectors. Syntheticcontrol kmeans clustering works well with  
> sparse
> vectors?

KMeans works fine w/ Sparse, although you might want to wait for  
MAHOUT-121 to be resolved, as it has a pretty significant speedup.   
Should be done in a few days.  Either that, or try the patch that is  
already there.

As I understand it, you will need to match up your L-norm with your  
distance measure to some extent, but see the archive thread: http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering

http://cwiki.apache.org/MAHOUT/clusteringyourdata.html has some  
information, but needs to be filled in more.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Re: Mahout clustering parameters

Posted by Allan Roberto Avendano Sudario <aa...@fiec.espol.edu.ec>.
2009/8/5 Grant Ingersoll <gs...@apache.org>

> What parameters did you use in the command line?


I'm running syntheticcontrol kmeans clustering. Three parameters are needed:
2 threshold & 1 convergence criteria for iterations.

Which values are recommended to assign to each one?


>
> There are a couple of threads in the archives that are likely of interest
> along these lines:
> http://www.lucidimagination.com/search/p:mahout?q=clustering#/
> p:mahout/s:email/l:user
>
> Are you trying to cluster text?  Or something else?
>

Yes, I'm trying to clustering text. I've build a tf-idf matrix compose by
sparse vectors. Syntheticcontrol kmeans clustering works well with sparse
vectors?

Thanks again.


> On Aug 5, 2009, at 10:47 AM, Allan Roberto Avendano Sudario wrote:
>
>  Regards,
>>
>> I´m trying to fit the kmeans syntheticcontrol job with my own dataset,
>> everything works well.
>> But, only one cluster is generated. I suppose that it´s about the default
>> parameters of clustering
>> process.
>>
>> What do you recommend about how to change clustering parameters?
>> *(2 threshold and 1 convergenceDelta)*
>>
>> Which would be the clustering algorithm into information retrieval
>> process?
>>
>> Thanks for your help.
>>
>> --
>> Allan Avendaño S.
>>
>
>
>


-- 
Allan Avendaño S.
Home: 04 2 800 692
Cell: 09 700 42 48

Re: Mahout clustering parameters

Posted by Grant Ingersoll <gs...@apache.org>.
What parameters did you use in the command line?

There are a couple of threads in the archives that are likely of  
interest along these lines: http://www.lucidimagination.com/search/p:mahout?q=clustering#/ 
p:mahout/s:email/l:user

Are you trying to cluster text?  Or something else?

On Aug 5, 2009, at 10:47 AM, Allan Roberto Avendano Sudario wrote:

> Regards,
>
> I´m trying to fit the kmeans syntheticcontrol job with my own dataset,
> everything works well.
> But, only one cluster is generated. I suppose that it´s about the  
> default
> parameters of clustering
> process.
>
> What do you recommend about how to change clustering parameters?
> *(2 threshold and 1 convergenceDelta)*
>
> Which would be the clustering algorithm into information retrieval  
> process?
>
> Thanks for your help.
>
> -- 
> Allan Avendaño S.