You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Bogdan Vatkov <bo...@gmail.com> on 2010/01/07 03:20:57 UTC

Mahout algorithms guide

Hi,

I am wondering if the different algorithms available @ Mahout have different
results and different behavior (e.g. performance - memory, speed, etc.) and
if yes could we have some short (2-3 sentences per alg.) description of the
different algs.
For example how they perform in different conditions: e.g. how they behave
related to:
- documents amount
- documents average size
- documents of very different sizes (e.g. half of the docs are very small
and the other half very large - would either of the doc sizes win for some
reason during clustering)
- cluster size
- documents amount to cluster size ratio
- memory needed
- time needed

For example I am right now interested in clustering of documents:
- of close size (most of the documents have size very close to the average
size)
- ratio between docs and clusters desired is 23 000 : 80 (or maybe even : 40
and :20)
Which Mahout algorithm and using which parameters is recommended for my
case?

Of course I should be able to run my data through all possible algorithms
and then try to compare results - but it would be good to know if using one
or another algorithm would lead to one or another flavor of the result -
especially if this is already known based on the specifics of the
algorithms.

Best regards,
Bogdan

Re: Mahout algorithms guide

Posted by Ted Dunning <te...@gmail.com>.

Absolutely true.

On Fri, Jan 8, 2010 at 2:38 AM, Olivier Grisel <ol...@ensta.org>wrote:

> That would be indeed great to have a global ready-to-run benchmark /
> shootout driver
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Mahout algorithms guide

Posted by Olivier Grisel <ol...@ensta.org>.

2010/1/7 Bogdan Vatkov <bo...@gmail.com>:
> Hi,
>
> I am wondering if the different algorithms available @ Mahout have different
> results and different behavior (e.g. performance - memory, speed, etc.) and
> if yes could we have some short (2-3 sentences per alg.) description of the
> different algs.
> For example how they perform in different conditions: e.g. how they behave
> related to:
> - documents amount
> - documents average size
> - documents of very different sizes (e.g. half of the docs are very small
> and the other half very large - would either of the doc sizes win for some
> reason during clustering)
> - cluster size
> - documents amount to cluster size ratio
> - memory needed
> - time needed
>
> For example I am right now interested in clustering of documents:
> - of close size (most of the documents have size very close to the average
> size)
> - ratio between docs and clusters desired is 23 000 : 80 (or maybe even : 40
> and :20)
> Which Mahout algorithm and using which parameters is recommended for my
> case?
>
> Of course I should be able to run my data through all possible algorithms
> and then try to compare results - but it would be good to know if using one
> or another algorithm would lead to one or another flavor of the result -
> especially if this is already known based on the specifics of the
> algorithms.

That would be indeed great to have a global ready-to-run benchmark /
shootout driver that runs all the available algorithms for a given
task (e.g. document clustering, classification, ...) that could be run
on Amazon Elastic MapReduce with a couple of clicks by using the data
already avaible on a public S3 account.

The results would be a comparative shootout report (with performance
measures) and published on the mahout website regularly.

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name