You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by deneche abdelhakim <a_...@yahoo.fr> on 2009/06/17 12:45:30 UTC

[GSOC] Thoughts about Random forests map-reduce implementation

As we talked about in the following discussion (A), I'm considering two ways to implement a distributed map-reduce builder.

Given the reference implementation, the easiest implementation is the following:

* the data is distributed to the slave nodes using the DistributedCache
* each mapper loader the data in-memory when in JobConfigurable.configure()
* each tree is built by one mapper
* the Job doesn't really need any input data, it may be possible to implement our own InputFormat that generates InputSplit s using the configuration parameters (number of trees)
* the mapper uses DecisionTree.toString() to output the tree in a String
* the main program builds the forest using DecisionTree.parse(String) for each tree

Pros:
* the easiest implementation because the actual ref. code can be used as it is, and the distributed implementation can benefit from future optimizations of the ref. code

Cons:
* because its based on the ref. implementation, it will be very slow when dealing with large datasets. For example, with half of the KDD 99 dataset (a file of about 350 Mb), building one single tree will took more than 12 hours (in fact I stopped the program after 12 hours) in a core 2 2Ghz with 3 Gb of RAM laptop !!!
* if the slave nodes contain many computing cores it would be interesting to launch parallel mappers in every slave node to exploit the multi-threading. But as I didn't found a way to share a memory variable between the mappers, each mapper must load its own copy of the Data in memory !

Important:
* The ref. implementation memory usage will probably change when the Information Gain computing will be optimized, in this case an out-of-core approach could become viable

So I'm asking, what to do next ?
* go on with this implementation
* optimize the IG computing
* consider the "Distributed Implementation B", see (A), which deals with BIG datasets and should not need any IG optimization, but its code has nearly nothing to do with the ref. implementation

(A) http://markmail.org/message/mgap2nuhnl4kokeu


      

Re: [GSOC] Thoughts about Random forests map-reduce implementation

Posted by Ted Dunning <te...@gmail.com>.
This is a classic problem of scaling a solution as the problem gets wide
(number of trees) and tall (amount of data).

The problem of building a random forest on a large data set with N trees is
N times the cost on a single node (as you point out) and N is typically
about the number of cores available in a hadoop cluster or a small multiple
thereof.  This means that your simple solution would give essentially
perfect speed up if the data set still fits in memory.  That means that a
simple implementation is likely to be of some use.

On the other hand, it sounds like your Information Gain computation has some
real performance problems that probably should be addressed.

Ultimately, I would think that it is also interesting to modify the original
algorithm to build multiple trees for different portions of the data.  That
loses some of the solidity of the original method, but could actually do
better if the splits exposed non-stationary behavior.

On Wed, Jun 17, 2009 at 3:45 AM, deneche abdelhakim <a_...@yahoo.fr>wrote:

>
> As we talked about in the following discussion (A), I'm considering two
> ways to implement a distributed map-reduce builder.
>
> Given the reference implementation, the easiest implementation is the
> following:
>
> * the data is distributed to the slave nodes using the DistributedCache
> * each mapper loader the data in-memory when in JobConfigurable.configure()
> * each tree is built by one mapper
> ...

* the main program builds the forest using DecisionTree.parse(String) for
> each tree
> ...
> Cons:
> * because its based on the ref. implementation, it will be very slow when
> dealing with large datasets....