You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Allan Roberto Avendano Sudario <aa...@fiec.espol.edu.ec> on 2009/08/05 16:47:51 UTC
Mahout clustering parameters
Regards,
I´m trying to fit the kmeans syntheticcontrol job with my own dataset,
everything works well.
But, only one cluster is generated. I suppose that it´s about the default
parameters of clustering
process.
What do you recommend about how to change clustering parameters?
*(2 threshold and 1 convergenceDelta)*
Which would be the clustering algorithm into information retrieval process?
Thanks for your help.
--
Allan Avendaño S.
Re: Mahout clustering parameters
Posted by Grant Ingersoll <gs...@apache.org>.
On Aug 5, 2009, at 7:52 PM, Allan Roberto Avendano Sudario wrote:
> 2009/8/5 Grant Ingersoll <gs...@apache.org>
>
>> What parameters did you use in the command line?
>
>
> I'm running syntheticcontrol kmeans clustering. Three parameters are
> needed:
> 2 threshold & 1 convergence criteria for iterations.
>
> Which values are recommended to assign to each one?
Synthetic Control is just an example data set. For generic
clustering, use the KMeansDriver. AFAICT, setting those values is
done by trial and error, but others may have more insight.
>
>
>>
>> There are a couple of threads in the archives that are likely of
>> interest
>> along these lines:
>> http://www.lucidimagination.com/search/p:mahout?q=clustering#/
>> p:mahout/s:email/l:user
>>
>> Are you trying to cluster text? Or something else?
>>
>
> Yes, I'm trying to clustering text. I've build a tf-idf matrix
> compose by
> sparse vectors. Syntheticcontrol kmeans clustering works well with
> sparse
> vectors?
KMeans works fine w/ Sparse, although you might want to wait for
MAHOUT-121 to be resolved, as it has a pretty significant speedup.
Should be done in a few days. Either that, or try the patch that is
already there.
As I understand it, you will need to match up your L-norm with your
distance measure to some extent, but see the archive thread: http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering
http://cwiki.apache.org/MAHOUT/clusteringyourdata.html has some
information, but needs to be filled in more.
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search
Re: Mahout clustering parameters
Posted by Allan Roberto Avendano Sudario <aa...@fiec.espol.edu.ec>.
2009/8/5 Grant Ingersoll <gs...@apache.org>
> What parameters did you use in the command line?
I'm running syntheticcontrol kmeans clustering. Three parameters are needed:
2 threshold & 1 convergence criteria for iterations.
Which values are recommended to assign to each one?
>
> There are a couple of threads in the archives that are likely of interest
> along these lines:
> http://www.lucidimagination.com/search/p:mahout?q=clustering#/
> p:mahout/s:email/l:user
>
> Are you trying to cluster text? Or something else?
>
Yes, I'm trying to clustering text. I've build a tf-idf matrix compose by
sparse vectors. Syntheticcontrol kmeans clustering works well with sparse
vectors?
Thanks again.
> On Aug 5, 2009, at 10:47 AM, Allan Roberto Avendano Sudario wrote:
>
> Regards,
>>
>> I´m trying to fit the kmeans syntheticcontrol job with my own dataset,
>> everything works well.
>> But, only one cluster is generated. I suppose that it´s about the default
>> parameters of clustering
>> process.
>>
>> What do you recommend about how to change clustering parameters?
>> *(2 threshold and 1 convergenceDelta)*
>>
>> Which would be the clustering algorithm into information retrieval
>> process?
>>
>> Thanks for your help.
>>
>> --
>> Allan Avendaño S.
>>
>
>
>
--
Allan Avendaño S.
Home: 04 2 800 692
Cell: 09 700 42 48
Re: Mahout clustering parameters
Posted by Grant Ingersoll <gs...@apache.org>.
What parameters did you use in the command line?
There are a couple of threads in the archives that are likely of
interest along these lines: http://www.lucidimagination.com/search/p:mahout?q=clustering#/
p:mahout/s:email/l:user
Are you trying to cluster text? Or something else?
On Aug 5, 2009, at 10:47 AM, Allan Roberto Avendano Sudario wrote:
> Regards,
>
> I´m trying to fit the kmeans syntheticcontrol job with my own dataset,
> everything works well.
> But, only one cluster is generated. I suppose that it´s about the
> default
> parameters of clustering
> process.
>
> What do you recommend about how to change clustering parameters?
> *(2 threshold and 1 convergenceDelta)*
>
> Which would be the clustering algorithm into information retrieval
> process?
>
> Thanks for your help.
>
> --
> Allan Avendaño S.