You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by deneche abdelhakim <ad...@gmail.com> on 2011/03/11 05:33:57 UTC

Training DecisionForests on large scale datasets

Ok, I am working on a new implementation of DecisionForests that should be
able to take real advantage of Hadoop's ability to handle really big
datasets. And by big datasets I mean datasets that are so big they cannot
fit on a single machine's storage disk.

But I am wondering what are the real world applications of such an
implementation ? I mean, I want to make sure this implementation will be
really useful.

Re: Training DecisionForests on large scale datasets

Posted by Ted Dunning <te...@gmail.com>.
Dineche,

While you are at it, can you make the resulting model be serializable in a
single file with a single method call?  ModelSerializer is a good example of
that.

And make the resulting model extend AbstractVectorClassifier?

This will make the RF implementation usable in certain production settings.

On Thu, Mar 10, 2011 at 8:33 PM, deneche abdelhakim <ad...@gmail.com>wrote:

> Ok, I am working on a new implementation of DecisionForests that should be
> able to take real advantage of Hadoop's ability to handle really big
> datasets. And by big datasets I mean datasets that are so big they cannot
> fit on a single machine's storage disk.
>
> But I am wondering what are the real world applications of such an
> implementation ? I mean, I want to make sure this implementation will be
> really useful.
>

Re: Training DecisionForests on large scale datasets

Posted by Ted Dunning <te...@gmail.com>.
Larger than memory is definitely useful.

Larger than any single machines cumulative disk is probably a bit excessive
(but nice).

On Thu, Mar 10, 2011 at 8:33 PM, deneche abdelhakim <ad...@gmail.com>wrote:

> Ok, I am working on a new implementation of DecisionForests that should be
> able to take real advantage of Hadoop's ability to handle really big
> datasets. And by big datasets I mean datasets that are so big they cannot
> fit on a single machine's storage disk.
>
> But I am wondering what are the real world applications of such an
> implementation ? I mean, I want to make sure this implementation will be
> really useful.
>