You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Andy Twigg <an...@gmail.com> on 2013/01/25 10:58:19 UTC

Out-of-core random forest implementation

Hi,

I'm new to this list so I apologise if this is covered elsewhere (but
I couldn't find it..)

I'm looking at the Random Forests implementations, both mapreduce
("partial") and non-distributed. Both appear to require the data
loaded into memory. Random forests should be straightforward to
construct with multiple passes through the data without storing the
data in memory. Is there such an implementation in Mahout? If not, is
there a ticket/plan ?

Thanks,
Andy


--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
andy.twigg@cs.ox.ac.uk | +447799647538

Re: Out-of-core random forest implementation

Posted by Sean Owen <sr...@gmail.com>.

You can get a Hadoop job to finish in <1 minute -- with tuning and care,
not by default. I suppose I'd be surprised if broadly equivalent,
well-written iterative processes on the same data took very different
amounts of time on two different general infrastructures. (I'm sure
specialized implementations could do better -- by not being purely
iterative.) But I am sure Hadoop is not optimal by a wide margin.

I suppose I was wondering earlier out loud whether iterative/synchronous
processes are just not suitable in general for large-scale learning. (I
don't think so.)

But this is a different and interesting question here about when (not if)
there will be a better framework for iterative/synchronous processing.
Hadoop is not optimal, but probably still worth building on now, given that
people actually have Hadoop clusters. If you have a Hadoop-sized problem
that runs for hours on a cluster, the extra 20 minutes of overhead over 20
iterations isn't game-changing enough to start over. I am really interested
to see if YARN is that next-gen fabric or something else.

It's also an interesting point about most people not actually having large
data, after pruning and selection. Completely agree, and there's no
particular reason not to use tools that run comfortably on one big machine
if you're to that point. Simpler and cheaper.

The only interesting thing to do with "Big Learning" is be able to take
away the need to prune, filter, select features. If you can offer a
scalable way to magically squeeze more quality of much more and
lower-quality data, that's something interesting. That probably requires
something distributed. But otherwise, if you've already cleaned and refined
the data, probably not much is added by distributing.

(I also still think the real-time update and query aspect is a different,
and hard, question not addressed by any of these parallel computation
frameworks -- building the model is just half the battle!)

On Fri, Mar 8, 2013 at 2:15 PM, Sebastian Schelter
<ss...@googlemail.com>wrote:

> We'll my general experience is that a lot of datasets used for iterative
> computations are not that large after feature extraction is done. Hadoop
> includes a lot of design decisions that only make sense when you scale
> out to really large setups. I'm not convinced that machine learning
> (except for maybe Google or facebook) really falls into this category.
>
> If you think of graph mining or collaborative filtering, even datasets
> with a few billion datapoints will need only a few dozen gigabytes and
> easily fit into the aggregate main memory of a small cluster.
>
> For example, in some recent experiments, I was able to conduct an
> iteration of PageRank on a dataset with 1B edges in approx 20 seconds on
> a small 26 node cluster using the Stratosphere [1] system. The authors
> of GraphLab report similar numbers in recent papers [2].
>
> I'm not sure that you can get Hadoop anywhere near that performance on a
> similar setup.
>
> [1] https://www.stratosphere.eu/
> [2]
>
> http://www.select.cs.cmu.edu/publications/paperdir/osdi2012-gonzalez-low-gu-bickson-guestrin.pdf
>
>
>

Re: Out-of-core random forest implementation

Posted by Sebastian Schelter <ss...@googlemail.com>.

We'll my general experience is that a lot of datasets used for iterative
computations are not that large after feature extraction is done. Hadoop
includes a lot of design decisions that only make sense when you scale
out to really large setups. I'm not convinced that machine learning
(except for maybe Google or facebook) really falls into this category.

If you think of graph mining or collaborative filtering, even datasets
with a few billion datapoints will need only a few dozen gigabytes and
easily fit into the aggregate main memory of a small cluster.

For example, in some recent experiments, I was able to conduct an
iteration of PageRank on a dataset with 1B edges in approx 20 seconds on
a small 26 node cluster using the Stratosphere [1] system. The authors
of GraphLab report similar numbers in recent papers [2].

I'm not sure that you can get Hadoop anywhere near that performance on a
similar setup.

[1] https://www.stratosphere.eu/
[2]
http://www.select.cs.cmu.edu/publications/paperdir/osdi2012-gonzalez-low-gu-bickson-guestrin.pdf

On 08.03.2013 15:03, Sean Owen wrote:
> Weighing in late here -- it's an interesting question whether M/R is a good
> fit for iterative processes. Today you can already optimize away most of
> the startup overhead for a job, by setting Hadoop to reuse JVMs, increasing
> heart-beat rate, etc. I know Ted can add more about how MapR tunes that
> further. So I'm not sure it's the overhead of starting jobs per se.
> 
> Unless your iterations are < 1 minute or something. And they may be in some
> situations. I don't think that's true of, say, ALS or even the RF
> implementation I have in mind. Iterations may be really fast at small scale
> too, but that's not what Hadoop is for. Or unless your iterations have
> significant init overhead, like loading data. That's a good point too.
> 
> 
> I think the problem comes if the process is not naturally iterative -- if
> it's parallel, but, the workers need not stop to sync up, then forcing them
> into an iterative process just wastes time. Most time  you're waiting for a
> straggler worker, needlessly. In this regard, I am not sure that a BSP
> paradigm is any better? But not sure anyone was pushing BSP.
> 
> 
> But I think RF could fit an iterative scheme well. a) I haven't thought it
> through completely or tried it, and b) I can imagine reasons it may work in
> theory but not in practice. But is this not roughly how you'd do it --
> maybe this is what's already being suggested? Roughly:
> 
> 
> Break up the data by feature. (Subdivide features if needed.) Map over all
> the data and distribute the single-feature data. Reducers compute an
> optimal split / decision rule and output their rules. Next iteration:
> mappers read the rules and choose a next random feature for each rule. They
> map the data, apply each rule to each record, apply the next choice of
> feature, and output the single feature again. Reducers will receive, in one
> input group, again all the data they need to compute another split. They
> output that split. Repeat.
> 
> There's a lot of devil in the details there. But this seems like a fine way
> to build a tree level by level without sending all the data all over the
> place. I suppose the problem is uneven data distribution, and that some
> decision trees will go deep. But quickly your number of input groups gets
> quite large as they get smaller, so the it ought to even out over reducers
> (?).
> 
> 
> 
> To answer Marty, I don't this project will never change much from what it
> is now. It's not even properly on Hadoop 0.20.x, much less 2.x. An
> MRv2-based project is a different project, as it would properly be a total
> rewrite. Something to start thinking about and start thinking about drawing
> a line under what's here IMHO.
> 
> 
> On Fri, Mar 8, 2013 at 1:36 PM, Marty Kube <
> martykube@beavercreekconsulting.com> wrote:
> 
>> What about using one map reduce job per iteration?  The models you load
>> into distributed cache are the model from the last round and the reducer
>> can emit the expanded model.  We are presumably working with large data
>> sets so I would not expect start-up latency to be an issue.
>>
>>
>

Re: Out-of-core random forest implementation

Posted by Marty Kube <ma...@beavercreekconsulting.com>.

On 03/07/2013 04:56 PM, Ted Dunning wrote:
> On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <an...@gmail.com> wrote:
>
>> ... Right now what we have is a
>> single-machine procedure for scanning through some data, building a
>> set of histograms, combining histograms and then expanding the tree.
>> The next step is to decide the best way to distribute this. I'm not an
>> expert here, so any advice or help here is welcome.
>>
> That sounds good so far.
>
>
>> I think the easiest approach would be to use the mappers to construct
>> the set of histograms, and then send all histograms for a given leaf
>> to a reducer, which decides how to expand that leaf. The code I have
>> can be almost be ported as-is to a mapper and reducer in this way.
>> Would using the distributed cache to send the updated tree be wise, or
>> is there a better way?
>>
> Distributed cache is a very limited thing.  You can only put things in at
> program launch and they must remain constant throughout the program's run.
>
> The problem here is that iterated map-reduce is pretty heinously
> inefficient.
>
> The best candidate approaches for avoiding that are to use a BSP sort of
> model (see the Pregel paper at
> http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an
> unsynchronized model update cycle the way that Vowpal Wabbit does with
> all-reduce or the way that Google's deep learning system does.
>
> Running these approaches on Hadoop without Yarn or Mesos requires a slight
> perversion of the map-reduce paradigm, but is quite doable.
It seems like we might be jumping through some hoops to do this in Map 
Reduce.  Is it possible to use Yarn or Mesos?  Is Mahout targeted at 
only MR V1?
>

Re: Out-of-core random forest implementation

Posted by Sean Owen <sr...@gmail.com>.

On Fri, Mar 8, 2013 at 11:54 PM, Ted Dunning <te...@gmail.com> wrote:

> Right on both.  Serializing isn't much of the issue.  It is the disk and
> the hard checkpointing.
>
> Well, with k-means, Hadoop map-reduce implementations are at least an order
> of magnitude slower than, say, a Shark or Graphlab implementation.  The
> Shark implementation is pretty much one-for-one and the graphlab version
> allows asynchronous updates to centroids.
>

The pattern here seems to be 'checkpointing' as the culprit -- that's well
said. In an iterative M/R process you are necessarily writing a lot of
state to HDFS just to read it back again. Whereas the alternatives
mentioned here are based on long-lived workers holding on to data over long
periods. Some checkpointing is necessary to be robust to failure over hours
of work, but I assume (?) these accomplish this in a lighter-weight way.

Hmm, my long-standing assumption/experience had been that the I/O wasn't a
big part of the run-time. But I'm working on a particular set of tasks that
jumps through hoops to avoid I/O. So if it runs for 10 minutes to write out
100MB of data, no big deal. At smaller scales, for different distributions,
and for different algorithms -- not necessarily true.

My disposition has been to take M/R as a given for now, because in practice
it's so widely available, and figure out what works well within those
constraints. I feel more compelled than ever to optimize away I/O, but it
seems to work just fine for certain approaches, done with care, even when
iterative.

But what do you think of my distinction above? personally that would be a
bright line that I'm looking for to conclude that big-learning-style
problems ought to be moving at last in 2013 to a different paradigm. I
hadn't quite had that before for BSP or graph-oriented or other paradigms.

> If your data fits in cluster memory and you aren't running a caching
> implementation, it definitely increases disk I/O.
>

I was going to say that fitting in cluster memory seems like a non-trivial
condition -- but for a classifier problem, and for even quite large data
sets, probably not. I'm interested in avoiding this condition though, if
the price is only "moderate".

>
> If your data doesn't fit in memory you get a kinda not scalable
> implementation.  You have to pass over your data a number of times roughly
> proportional to the depth of your tree.  Your tree will be deeper for
> bigger data.  Thus you get super-linear scaling which is my definition of
> not very scalable.  Hopefully the overage is log N or less so that you can
> get away with it.

Yes a # of passes over the data equal to the depth of the trees is what I
had in mind. I thought approaches dismissed earlier in this thread were
contemplating something that sent around much more data than that. Good
point, that is super-linear. Lightly maybe; the depth of the tree is still
softly bounded by the minimum leaf size or hard-bounded by a max depth.

I am still not 100% clear how you would avoid evaluating the data this many
times... and how you do that without reading or transferring it around...
but I haven't thought about it for longer than the minutes writing this
message.

Re: Out-of-core random forest implementation

Posted by Ted Dunning <te...@gmail.com>.

On Fri, Mar 8, 2013 at 6:03 PM, Sean Owen <sr...@gmail.com> wrote:

> Oh, certainly. I was thinking in the realm of distributed systems only.
> Surely serialization across a network is a necessary step in anything like
> that. Serializing to local disk first, or a distributed file system, may
> not be.

Right on both.  Serializing isn't much of the issue.  It is the disk and
the hard checkpointing.

> The local writes may not matter. But wouldn't YARN-type setups
> still be writing to distributed storage?
>

Not necessarily.  Mesos and Yarn allow you to run programs across a
cluster.  Both have the notion of a distinguished node that requests the
others.  What these nodes actually do is up to the implementor.  You could
checkpointing or not.

> My broad hunch is that communicating the same amount of data faster
> probably doesn't get an order of magnitude faster, but a different paradigm
> that lets you transmit less data does.

Well, with k-means, Hadoop map-reduce implementations are at least an order
of magnitude slower than, say, a Shark or Graphlab implementation.  The
Shark implementation is pretty much one-for-one and the graphlab version
allows asynchronous updates to centroids.

> I was musing about whether M/R
> forced you into a hopelessly huge amount of I/O implementation for RF

If your data fits in cluster memory and you aren't running a caching
implementation, it definitely increases disk I/O.

If your data doesn't fit in memory you get a kinda not scalable
implementation.  You have to pass over your data a number of times roughly
proportional to the depth of your tree.  Your tree will be deeper for
bigger data.  Thus you get super-linear scaling which is my definition of
not very scalable.  Hopefully the overage is log N or less so that you can
get away with it.

These days I continue to want a better sense not just of whether entire

paradigms are more/less suitable and how and when and why, but when two
> different concepts in the same paradigm are qualitatively different or just
> a different point on a tradeoff curve, optimizing for a different type of
> problem.
>

Yes.  Yes.

For instance, stochastic svd and streaming k-means both radically change
the game when it comes to map-reduce.

But the real issue has to do with whether scaling is truly linear or not.

>
>
> On Fri, Mar 8, 2013 at 10:35 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > The big cost in map-reduce iteration isn't just startup.  It is that the
> > input has to be read from disk and the output written to same.  Were it
> to
> > stay in memory, things would be vastly faster.
> >
> > Also, startup costs are still pretty significant.  Even on MapR, one of
> the
> > major problems in setting the recent minute-sort record was getting
> things
> > to start quickly.  Just setting the heartbeat faster doesn't work on
> > ordinary Hadoop because there is a global lock that begins to starve the
> > system.  We (our guys, not me) had to seriously hack the job tracker to
> > move things out of that critical section.  At that point, we were able to
> > shorten the heartbeat interval to 800 ms (which on 2000 nodes means >2400
> > heartbeats per second).  The startup and cleanup tasks are also single
> > threaded.
> >
> > It might be plausible to shorten to this degree and further on a small
> > cluster.  But iteration is still very painful.
> >
> >
>

Re: Out-of-core random forest implementation

Posted by Sean Owen <sr...@gmail.com>.

Oh, certainly. I was thinking in the realm of distributed systems only.
Surely serialization across a network is a necessary step in anything like
that. Serializing to local disk first, or a distributed file system, may
not be. The local writes may not matter. But wouldn't YARN-type setups
still be writing to distributed storage?

My broad hunch is that communicating the same amount of data faster
probably doesn't get an order of magnitude faster, but a different paradigm
that lets you transmit less data does. I was musing about whether M/R
forced you into a hopelessly huge amount of I/O implementation for RF, and
I'm not sure it does, not yet.

These days I continue to want a better sense not just of whether entire
paradigms are more/less suitable and how and when and why, but when two
different concepts in the same paradigm are qualitatively different or just
a different point on a tradeoff curve, optimizing for a different type of
problem.

On Fri, Mar 8, 2013 at 10:35 PM, Ted Dunning <te...@gmail.com> wrote:

> The big cost in map-reduce iteration isn't just startup.  It is that the
> input has to be read from disk and the output written to same.  Were it to
> stay in memory, things would be vastly faster.
>
> Also, startup costs are still pretty significant.  Even on MapR, one of the
> major problems in setting the recent minute-sort record was getting things
> to start quickly.  Just setting the heartbeat faster doesn't work on
> ordinary Hadoop because there is a global lock that begins to starve the
> system.  We (our guys, not me) had to seriously hack the job tracker to
> move things out of that critical section.  At that point, we were able to
> shorten the heartbeat interval to 800 ms (which on 2000 nodes means >2400
> heartbeats per second).  The startup and cleanup tasks are also single
> threaded.
>
> It might be plausible to shorten to this degree and further on a small
> cluster.  But iteration is still very painful.
>
>

Re: Out-of-core random forest implementation

Posted by Ted Dunning <te...@gmail.com>.

The big cost in map-reduce iteration isn't just startup.  It is that the
input has to be read from disk and the output written to same.  Were it to
stay in memory, things would be vastly faster.

Also, startup costs are still pretty significant.  Even on MapR, one of the
major problems in setting the recent minute-sort record was getting things
to start quickly.  Just setting the heartbeat faster doesn't work on
ordinary Hadoop because there is a global lock that begins to starve the
system.  We (our guys, not me) had to seriously hack the job tracker to
move things out of that critical section.  At that point, we were able to
shorten the heartbeat interval to 800 ms (which on 2000 nodes means >2400
heartbeats per second).  The startup and cleanup tasks are also single
threaded.

It might be plausible to shorten to this degree and further on a small
cluster.  But iteration is still very painful.

On Fri, Mar 8, 2013 at 9:03 AM, Sean Owen <sr...@gmail.com> wrote:

> Weighing in late here -- it's an interesting question whether M/R is a good
> fit for iterative processes. Today you can already optimize away most of
> the startup overhead for a job, by setting Hadoop to reuse JVMs, increasing
> heart-beat rate, etc. I know Ted can add more about how MapR tunes that
> further. So I'm not sure it's the overhead of starting jobs per se.
>
> Unless your iterations are < 1 minute or something. And they may be in some
> situations. I don't think that's true of, say, ALS or even the RF
> implementation I have in mind. Iterations may be really fast at small scale
> too, but that's not what Hadoop is for. Or unless your iterations have
> significant init overhead, like loading data. That's a good point too.
>
>
> I think the problem comes if the process is not naturally iterative -- if
> it's parallel, but, the workers need not stop to sync up, then forcing them
> into an iterative process just wastes time. Most time  you're waiting for a
> straggler worker, needlessly. In this regard, I am not sure that a BSP
> paradigm is any better? But not sure anyone was pushing BSP.
>
>
> But I think RF could fit an iterative scheme well. a) I haven't thought it
> through completely or tried it, and b) I can imagine reasons it may work in
> theory but not in practice. But is this not roughly how you'd do it --
> maybe this is what's already being suggested? Roughly:
>
>
> Break up the data by feature. (Subdivide features if needed.) Map over all
> the data and distribute the single-feature data. Reducers compute an
> optimal split / decision rule and output their rules. Next iteration:
> mappers read the rules and choose a next random feature for each rule. They
> map the data, apply each rule to each record, apply the next choice of
> feature, and output the single feature again. Reducers will receive, in one
> input group, again all the data they need to compute another split. They
> output that split. Repeat.
>
> There's a lot of devil in the details there. But this seems like a fine way
> to build a tree level by level without sending all the data all over the
> place. I suppose the problem is uneven data distribution, and that some
> decision trees will go deep. But quickly your number of input groups gets
> quite large as they get smaller, so the it ought to even out over reducers
> (?).
>
>
>
> To answer Marty, I don't this project will never change much from what it
> is now. It's not even properly on Hadoop 0.20.x, much less 2.x. An
> MRv2-based project is a different project, as it would properly be a total
> rewrite. Something to start thinking about and start thinking about drawing
> a line under what's here IMHO.
>
>
> On Fri, Mar 8, 2013 at 1:36 PM, Marty Kube <
> martykube@beavercreekconsulting.com> wrote:
>
> > What about using one map reduce job per iteration?  The models you load
> > into distributed cache are the model from the last round and the reducer
> > can emit the expanded model.  We are presumably working with large data
> > sets so I would not expect start-up latency to be an issue.
> >
> >
>

Re: Out-of-core random forest implementation

Posted by Sean Owen <sr...@gmail.com>.

Weighing in late here -- it's an interesting question whether M/R is a good
fit for iterative processes. Today you can already optimize away most of
the startup overhead for a job, by setting Hadoop to reuse JVMs, increasing
heart-beat rate, etc. I know Ted can add more about how MapR tunes that
further. So I'm not sure it's the overhead of starting jobs per se.

Unless your iterations are < 1 minute or something. And they may be in some
situations. I don't think that's true of, say, ALS or even the RF
implementation I have in mind. Iterations may be really fast at small scale
too, but that's not what Hadoop is for. Or unless your iterations have
significant init overhead, like loading data. That's a good point too.

I think the problem comes if the process is not naturally iterative -- if
it's parallel, but, the workers need not stop to sync up, then forcing them
into an iterative process just wastes time. Most time  you're waiting for a
straggler worker, needlessly. In this regard, I am not sure that a BSP
paradigm is any better? But not sure anyone was pushing BSP.

But I think RF could fit an iterative scheme well. a) I haven't thought it
through completely or tried it, and b) I can imagine reasons it may work in
theory but not in practice. But is this not roughly how you'd do it --
maybe this is what's already being suggested? Roughly:

Break up the data by feature. (Subdivide features if needed.) Map over all
the data and distribute the single-feature data. Reducers compute an
optimal split / decision rule and output their rules. Next iteration:
mappers read the rules and choose a next random feature for each rule. They
map the data, apply each rule to each record, apply the next choice of
feature, and output the single feature again. Reducers will receive, in one
input group, again all the data they need to compute another split. They
output that split. Repeat.

There's a lot of devil in the details there. But this seems like a fine way
to build a tree level by level without sending all the data all over the
place. I suppose the problem is uneven data distribution, and that some
decision trees will go deep. But quickly your number of input groups gets
quite large as they get smaller, so the it ought to even out over reducers
(?).

To answer Marty, I don't this project will never change much from what it
is now. It's not even properly on Hadoop 0.20.x, much less 2.x. An
MRv2-based project is a different project, as it would properly be a total
rewrite. Something to start thinking about and start thinking about drawing
a line under what's here IMHO.

On Fri, Mar 8, 2013 at 1:36 PM, Marty Kube <
martykube@beavercreekconsulting.com> wrote:

> What about using one map reduce job per iteration?  The models you load
> into distributed cache are the model from the last round and the reducer
> can emit the expanded model.  We are presumably working with large data
> sets so I would not expect start-up latency to be an issue.
>
>

Re: Out-of-core random forest implementation

Posted by Marty Kube <ma...@beavercreekconsulting.com>.

I'm not sure, but in this case I don't think we have much iteration 
invariant data.

On 03/08/2013 08:39 AM, Sebastian Schelter wrote:
> Well, this is certainly possible and is an approach that is used in our
> ALS code. But the startup latency and the need to rescan
> iteration-invariant data usually typically induce an overhead of an
> order of magnitude compared to approaches specialized for distributed
> iterations.
>
> Best,
> Sebastian
>
> On 08.03.2013 14:36, Marty Kube wrote:
>> What about using one map reduce job per iteration?  The models you load
>> into distributed cache are the model from the last round and the reducer
>> can emit the expanded model.  We are presumably working with large data
>> sets so I would not expect start-up latency to be an issue.
>>
>> On 03/07/2013 04:56 PM, Ted Dunning wrote:
>>> On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <an...@gmail.com> wrote:
>>>
>>>> ... Right now what we have is a
>>>> single-machine procedure for scanning through some data, building a
>>>> set of histograms, combining histograms and then expanding the tree.
>>>> The next step is to decide the best way to distribute this. I'm not an
>>>> expert here, so any advice or help here is welcome.
>>>>
>>> That sounds good so far.
>>>
>>>
>>>> I think the easiest approach would be to use the mappers to construct
>>>> the set of histograms, and then send all histograms for a given leaf
>>>> to a reducer, which decides how to expand that leaf. The code I have
>>>> can be almost be ported as-is to a mapper and reducer in this way.
>>>> Would using the distributed cache to send the updated tree be wise, or
>>>> is there a better way?
>>>>
>>> Distributed cache is a very limited thing.  You can only put things in at
>>> program launch and they must remain constant throughout the program's
>>> run.
>>>
>>> The problem here is that iterated map-reduce is pretty heinously
>>> inefficient.
>>>
>>> The best candidate approaches for avoiding that are to use a BSP sort of
>>> model (see the Pregel paper at
>>> http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an
>>> unsynchronized model update cycle the way that Vowpal Wabbit does with
>>> all-reduce or the way that Google's deep learning system does.
>>>
>>> Running these approaches on Hadoop without Yarn or Mesos requires a
>>> slight
>>> perversion of the map-reduce paradigm, but is quite doable.
>>>
>
>

Re: Out-of-core random forest implementation

Posted by Andy Twigg <an...@gmail.com>.

I'm concerned that the diff (see https://github.com/andytwigg/mahout)
is now becoming quite large against trunk, and I don't the first patch
to be a scary one. I put some refactoring effort in to separate the
existing inmemory and the new streaming implementations, while trying
to retain a shared interface (eg for tree building, bagging,  data
loading, classifying, etc.). This is mostly fine but it does mean
there are some significant '---''s. I don't have a lot more time to
spend on this right now, but should I try to pull a patch out early,
and if so, should it be a refactoring one or a contributing
implementation?

re iteration: it does feel like mahout is perhaps ready for a  spring
clean. Does anyone use the existing DF classifier in production?

Andy



On 8 March 2013 13:39, Sebastian Schelter <ss...@googlemail.com> wrote:
> Well, this is certainly possible and is an approach that is used in our
> ALS code. But the startup latency and the need to rescan
> iteration-invariant data usually typically induce an overhead of an
> order of magnitude compared to approaches specialized for distributed
> iterations.
>
> Best,
> Sebastian
>
> On 08.03.2013 14:36, Marty Kube wrote:
>> What about using one map reduce job per iteration?  The models you load
>> into distributed cache are the model from the last round and the reducer
>> can emit the expanded model.  We are presumably working with large data
>> sets so I would not expect start-up latency to be an issue.
>>
>> On 03/07/2013 04:56 PM, Ted Dunning wrote:
>>> On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <an...@gmail.com> wrote:
>>>
>>>> ... Right now what we have is a
>>>> single-machine procedure for scanning through some data, building a
>>>> set of histograms, combining histograms and then expanding the tree.
>>>> The next step is to decide the best way to distribute this. I'm not an
>>>> expert here, so any advice or help here is welcome.
>>>>
>>> That sounds good so far.
>>>
>>>
>>>> I think the easiest approach would be to use the mappers to construct
>>>> the set of histograms, and then send all histograms for a given leaf
>>>> to a reducer, which decides how to expand that leaf. The code I have
>>>> can be almost be ported as-is to a mapper and reducer in this way.
>>>> Would using the distributed cache to send the updated tree be wise, or
>>>> is there a better way?
>>>>
>>> Distributed cache is a very limited thing.  You can only put things in at
>>> program launch and they must remain constant throughout the program's
>>> run.
>>>
>>> The problem here is that iterated map-reduce is pretty heinously
>>> inefficient.
>>>
>>> The best candidate approaches for avoiding that are to use a BSP sort of
>>> model (see the Pregel paper at
>>> http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an
>>> unsynchronized model update cycle the way that Vowpal Wabbit does with
>>> all-reduce or the way that Google's deep learning system does.
>>>
>>> Running these approaches on Hadoop without Yarn or Mesos requires a
>>> slight
>>> perversion of the map-reduce paradigm, but is quite doable.
>>>
>>
>



--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
andy.twigg@cs.ox.ac.uk | +447799647538

Re: Out-of-core random forest implementation

Posted by Sebastian Schelter <ss...@googlemail.com>.

Well, this is certainly possible and is an approach that is used in our
ALS code. But the startup latency and the need to rescan
iteration-invariant data usually typically induce an overhead of an
order of magnitude compared to approaches specialized for distributed
iterations.

Best,
Sebastian

On 08.03.2013 14:36, Marty Kube wrote:
> What about using one map reduce job per iteration?  The models you load
> into distributed cache are the model from the last round and the reducer
> can emit the expanded model.  We are presumably working with large data
> sets so I would not expect start-up latency to be an issue.
> 
> On 03/07/2013 04:56 PM, Ted Dunning wrote:
>> On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <an...@gmail.com> wrote:
>>
>>> ... Right now what we have is a
>>> single-machine procedure for scanning through some data, building a
>>> set of histograms, combining histograms and then expanding the tree.
>>> The next step is to decide the best way to distribute this. I'm not an
>>> expert here, so any advice or help here is welcome.
>>>
>> That sounds good so far.
>>
>>
>>> I think the easiest approach would be to use the mappers to construct
>>> the set of histograms, and then send all histograms for a given leaf
>>> to a reducer, which decides how to expand that leaf. The code I have
>>> can be almost be ported as-is to a mapper and reducer in this way.
>>> Would using the distributed cache to send the updated tree be wise, or
>>> is there a better way?
>>>
>> Distributed cache is a very limited thing.  You can only put things in at
>> program launch and they must remain constant throughout the program's
>> run.
>>
>> The problem here is that iterated map-reduce is pretty heinously
>> inefficient.
>>
>> The best candidate approaches for avoiding that are to use a BSP sort of
>> model (see the Pregel paper at
>> http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an
>> unsynchronized model update cycle the way that Vowpal Wabbit does with
>> all-reduce or the way that Google's deep learning system does.
>>
>> Running these approaches on Hadoop without Yarn or Mesos requires a
>> slight
>> perversion of the map-reduce paradigm, but is quite doable.
>>
>

Re: Out-of-core random forest implementation

Posted by Marty Kube <ma...@beavercreekconsulting.com>.

What about using one map reduce job per iteration?  The models you load 
into distributed cache are the model from the last round and the reducer 
can emit the expanded model.  We are presumably working with large data 
sets so I would not expect start-up latency to be an issue.

On 03/07/2013 04:56 PM, Ted Dunning wrote:
> On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <an...@gmail.com> wrote:
>
>> ... Right now what we have is a
>> single-machine procedure for scanning through some data, building a
>> set of histograms, combining histograms and then expanding the tree.
>> The next step is to decide the best way to distribute this. I'm not an
>> expert here, so any advice or help here is welcome.
>>
> That sounds good so far.
>
>
>> I think the easiest approach would be to use the mappers to construct
>> the set of histograms, and then send all histograms for a given leaf
>> to a reducer, which decides how to expand that leaf. The code I have
>> can be almost be ported as-is to a mapper and reducer in this way.
>> Would using the distributed cache to send the updated tree be wise, or
>> is there a better way?
>>
> Distributed cache is a very limited thing.  You can only put things in at
> program launch and they must remain constant throughout the program's run.
>
> The problem here is that iterated map-reduce is pretty heinously
> inefficient.
>
> The best candidate approaches for avoiding that are to use a BSP sort of
> model (see the Pregel paper at
> http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an
> unsynchronized model update cycle the way that Vowpal Wabbit does with
> all-reduce or the way that Google's deep learning system does.
>
> Running these approaches on Hadoop without Yarn or Mesos requires a slight
> perversion of the map-reduce paradigm, but is quite doable.
>

Re: Out-of-core random forest implementation

Posted by Ted Dunning <te...@gmail.com>.

See Giraph.

On Thu, Mar 7, 2013 at 6:01 PM, Andy Twigg <an...@gmail.com> wrote:

> That sounds like a horrid amount of work to do something simple. Is there a
> hadoop implementation of a master-workers problem you can point me to?
> On Mar 7, 2013 9:57 PM, "Ted Dunning" <te...@gmail.com> wrote:
>
> > On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <an...@gmail.com> wrote:
> >
> > > ... Right now what we have is a
> > > single-machine procedure for scanning through some data, building a
> > > set of histograms, combining histograms and then expanding the tree.
> > > The next step is to decide the best way to distribute this. I'm not an
> > > expert here, so any advice or help here is welcome.
> > >
> >
> > That sounds good so far.
> >
> >
> > > I think the easiest approach would be to use the mappers to construct
> > > the set of histograms, and then send all histograms for a given leaf
> > > to a reducer, which decides how to expand that leaf. The code I have
> > > can be almost be ported as-is to a mapper and reducer in this way.
> > > Would using the distributed cache to send the updated tree be wise, or
> > > is there a better way?
> > >
> >
> > Distributed cache is a very limited thing.  You can only put things in at
> > program launch and they must remain constant throughout the program's
> run.
> >
> > The problem here is that iterated map-reduce is pretty heinously
> > inefficient.
> >
> > The best candidate approaches for avoiding that are to use a BSP sort of
> > model (see the Pregel paper at
> > http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an
> > unsynchronized model update cycle the way that Vowpal Wabbit does with
> > all-reduce or the way that Google's deep learning system does.
> >
> > Running these approaches on Hadoop without Yarn or Mesos requires a
> slight
> > perversion of the map-reduce paradigm, but is quite doable.
> >
>

Re: Out-of-core random forest implementation

Posted by Andy Twigg <an...@gmail.com>.

That sounds like a horrid amount of work to do something simple. Is there a
hadoop implementation of a master-workers problem you can point me to?
On Mar 7, 2013 9:57 PM, "Ted Dunning" <te...@gmail.com> wrote:

> On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <an...@gmail.com> wrote:
>
> > ... Right now what we have is a
> > single-machine procedure for scanning through some data, building a
> > set of histograms, combining histograms and then expanding the tree.
> > The next step is to decide the best way to distribute this. I'm not an
> > expert here, so any advice or help here is welcome.
> >
>
> That sounds good so far.
>
>
> > I think the easiest approach would be to use the mappers to construct
> > the set of histograms, and then send all histograms for a given leaf
> > to a reducer, which decides how to expand that leaf. The code I have
> > can be almost be ported as-is to a mapper and reducer in this way.
> > Would using the distributed cache to send the updated tree be wise, or
> > is there a better way?
> >
>
> Distributed cache is a very limited thing.  You can only put things in at
> program launch and they must remain constant throughout the program's run.
>
> The problem here is that iterated map-reduce is pretty heinously
> inefficient.
>
> The best candidate approaches for avoiding that are to use a BSP sort of
> model (see the Pregel paper at
> http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an
> unsynchronized model update cycle the way that Vowpal Wabbit does with
> all-reduce or the way that Google's deep learning system does.
>
> Running these approaches on Hadoop without Yarn or Mesos requires a slight
> perversion of the map-reduce paradigm, but is quite doable.
>

Re: Out-of-core random forest implementation

Posted by Ted Dunning <te...@gmail.com>.

On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <an...@gmail.com> wrote:

> ... Right now what we have is a
> single-machine procedure for scanning through some data, building a
> set of histograms, combining histograms and then expanding the tree.
> The next step is to decide the best way to distribute this. I'm not an
> expert here, so any advice or help here is welcome.
>

That sounds good so far.

> I think the easiest approach would be to use the mappers to construct
> the set of histograms, and then send all histograms for a given leaf
> to a reducer, which decides how to expand that leaf. The code I have
> can be almost be ported as-is to a mapper and reducer in this way.
> Would using the distributed cache to send the updated tree be wise, or
> is there a better way?
>

Distributed cache is a very limited thing.  You can only put things in at
program launch and they must remain constant throughout the program's run.

The problem here is that iterated map-reduce is pretty heinously
inefficient.

The best candidate approaches for avoiding that are to use a BSP sort of
model (see the Pregel paper at
http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an
unsynchronized model update cycle the way that Vowpal Wabbit does with
all-reduce or the way that Google's deep learning system does.

Running these approaches on Hadoop without Yarn or Mesos requires a slight
perversion of the map-reduce paradigm, but is quite doable.

Re: Out-of-core random forest implementation

Posted by Andy Twigg <an...@gmail.com>.

Thanks for bringing that to my attention. I don't have access to the
pdf right now, but i will take a look. Right now what we have is a
single-machine procedure for scanning through some data, building a
set of histograms, combining histograms and then expanding the tree.
The next step is to decide the best way to distribute this. I'm not an
expert here, so any advice or help here is welcome.

I think the easiest approach would be to use the mappers to construct
the set of histograms, and then send all histograms for a given leaf
to a reducer, which decides how to expand that leaf. The code I have
can be almost be ported as-is to a mapper and reducer in this way.
Would using the distributed cache to send the updated tree be wise, or
is there a better way?



The phd thesis on streaming random forests [1] might also be of interest.
[1] http://www.collectionscanada.gc.ca/obj/s4/f2/dsk3/OKQ/TC-OKQ-1321.pdf

On 7 March 2013 02:31, Meng Qinghan <qi...@gmail.com> wrote:
> You can read the following paper which I think it is useful.
>
> http://link.springer.com/chapter/10.1007%2F978-3-642-30217-6_12?LI=true
>
> 2013/3/6 Andy Twigg <an...@gmail.com>
>
>> Hi guys,
>>
>> I've created a JIRA issue for this now -
>> https://issues.apache.org/jira/browse/MAHOUT-1153
>>
>> Right now we're working on my fork but will make an early patch once
>> we can. Hopefully that will provide a basis for others who are
>> interested to add to it.
>>
>> Andy
>>
>>
>> On 22 February 2013 08:49, Andy Twigg <an...@gmail.com> wrote:
>> > Hi Marty,
>> >
>> > I would suggest doing each tree independently, one after the other.
>> > Have each node select a random sample w/replacement of its data, and
>> > let the master choose the random subset of features to split with. I'm
>> > sure there are more efficient ways, but we can think of them later.
>> >
>> > -Andy
>> >
>> >
>> > On 22 February 2013 01:47, Marty Kube <ma...@gmail.com>
>> wrote:
>> >> On the algorithm choice, Parallel Boosted Regression Trees looks
>> interesting
>> >> as well.
>> >>
>> >>
>> http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf
>> >>
>> >> After reading that paper I had a question about the Streaming Parallel
>> >> Decision Trees.  The SPDT paper is for building one decision tree.  I
>> was
>> >> thinking about how to extend that to a decision forest.  I could see
>> how one
>> >> can build many trees in the master and use a random choice of the
>> splitting
>> >> variables (as opposed to checking all variables and using the best
>> >> variable). However, it seems like the idea of sampling the dataset with
>> >> replacement for each tree gets lost.  Is that okay?
>> >>
>> >>
>> >> On 02/21/2013 03:19 AM, Ted Dunning wrote:
>> >>>
>> >>> For quantile estimation, consider also streamlib at
>> >>> https://github.com/clearspring/stream-lib
>> >>>
>> >>> The bigmlcom implementation looks more directly applicable, actually.
>> >>>
>> >>> On Wed, Feb 20, 2013 at 5:01 PM, Andy Twigg <andy.twigg@gmail.com
>> >>> <ma...@gmail.com>> wrote:
>> >>>
>> >>>     Even better, there is already a good implementation of the
>> histograms:
>> >>>     https://github.com/bigmlcom/histogram
>> >>>
>> >>>     -Andy
>> >>>
>> >>>
>> >>>     On 20 February 2013 22:50, Marty Kube <marty.kube.apache@gmail.com
>> >>>     <ma...@gmail.com>> wrote:
>> >>>     > That's a winner...
>> >>>     > Out of all of the algorithms I've looked at the Ben-Haim/SPDT
>> >>>     looks most
>> >>>     > likely.  In batch mode it uses one pass over the data set, it
>> >>>     can be used in
>> >>>     > a streaming mode, and has constant space and time requirements.
>> >>>      That seems
>> >>>     > like the kind of scalable algorithm we're after.
>> >>>     > I'm in!
>> >>>     >
>> >>>     >
>> >>>     > On 02/20/2013 10:09 AM, Andy Twigg wrote:
>> >>>     >>
>> >>>     >> Alternatively, the algorithm described in [1] is more
>> >>>     straightforward,
>> >>>     >> efficient, hadoop-compatible (using only mappers communicating
>> to a
>> >>>     >> master) and satisfies all our requirements so far. I would like
>> to
>> >>>     >> take a pass at implementing that, if anyone else is interested?
>> >>>     >>
>> >>>     >> [1]
>> >>>
>> http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
>> >>>     >>
>> >>>     >>
>> >>>     >> On 20 February 2013 14:27, Andy Twigg <andy.twigg@gmail.com
>> >>>     <ma...@gmail.com>> wrote:
>> >>>     >>>
>> >>>     >>> Why don't we start from
>> >>>     >>>
>> >>>     >>> https://github.com/ashenfad/hadooptree ?
>> >>>     >>>
>> >>>     >>> On 20 February 2013 13:25, Marty Kube
>> >>>     <marty.kube.apache@gmail.com <ma...@gmail.com>>
>> >>>
>> >>>     >>> wrote:
>> >>>     >>>>
>> >>>     >>>> Hi Lorenz,
>> >>>     >>>>
>> >>>     >>>> Very interesting, that's what I was asking for when I
>> >>>     mentioned non-MR
>> >>>     >>>> implementations :-)
>> >>>     >>>>
>> >>>     >>>> I have not looked at spark before, interesting that it uses
>> >>>     Mesos for
>> >>>     >>>> clustering.   I'll check it out.
>> >>>     >>>>
>> >>>     >>>>
>> >>>     >>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
>> >>>     >>>>>
>> >>>     >>>>> Hi Marty,
>> >>>     >>>>>
>> >>>     >>>>> i am currently working on a PLANET-like implementation on
>> >>>     top of spark:
>> >>>     >>>>> http://spark-project.org
>> >>>     >>>>>
>> >>>     >>>>> I think this framework is a nice fit for the problem.
>> >>>     >>>>> If the input data fits into the "total cluster memory" you
>> >>>     benefit from
>> >>>     >>>>> the caching of the RDD's.
>> >>>     >>>>>
>> >>>     >>>>> regards,
>> >>>     >>>>>
>> >>>     >>>>> lorenz
>> >>>     >>>>>
>> >>>     >>>>>
>> >>>     >>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube
>> >>>     <marty.kube.apache@gmail.com <ma...@gmail.com>>
>> >>>
>> >>>     >>>>> wrote:
>> >>>     >>>>>
>> >>>     >>>>>> You had mentioned other "resource management" platforms
>> >>>     like Giraph or
>> >>>     >>>>>> Mesos.  I haven't looked at those yet.  I guess I was think
>> >>>     of other
>> >>>     >>>>>> parallelization frameworks.
>> >>>     >>>>>>
>> >>>     >>>>>> It's interesting that the planet folks thought it was really
>> >>>     >>>>>> worthwhile
>> >>>     >>>>>> working on top of map reduce for all of the resource
>> >>>     management that
>> >>>     >>>>>> is
>> >>>     >>>>>> built in.
>> >>>     >>>>>>
>> >>>     >>>>>>
>> >>>     >>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
>> >>>     >>>>>>>
>> >>>     >>>>>>> If non-MR means map-only job with communicating mappers
>> >>>     and a state
>> >>>     >>>>>>> store,
>> >>>     >>>>>>> I am down with that.
>> >>>     >>>>>>>
>> >>>     >>>>>>> What did you mean?
>> >>>     >>>>>>>
>> >>>     >>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
>> >>>     >>>>>>> martykube@beavercreekconsulting.com
>> >>>     <ma...@beavercreekconsulting.com>> wrote:
>> >>>     >>>>>>>
>> >>>     >>>>>>>> Right now I'd lean towards the planet model, or maybe a
>> >>>     non-MR
>> >>>     >>>>>>>> implementation.  Anyone have a good idea for a non-MR
>> >>>     solution?
>> >>>     >>>>>>>>
>> >>>     >>>
>> >>>     >>>
>> >>>     >>> --
>> >>>     >>> Dr Andy Twigg
>> >>>     >>> Junior Research Fellow, St Johns College, Oxford
>> >>>     >>> Room 351, Department of Computer Science
>> >>>     >>> http://www.cs.ox.ac.uk/people/andy.twigg/
>> >>>     >>> andy.twigg@cs.ox.ac.uk <ma...@cs.ox.ac.uk> |
>> >>>     +447799647538 <tel:%2B447799647538>
>> >>>
>> >>>     >>
>> >>>     >>
>> >>>     >>
>> >>>     >> --
>> >>>     >> Dr Andy Twigg
>> >>>     >> Junior Research Fellow, St Johns College, Oxford
>> >>>     >> Room 351, Department of Computer Science
>> >>>     >> http://www.cs.ox.ac.uk/people/andy.twigg/
>> >>>     >> andy.twigg@cs.ox.ac.uk <ma...@cs.ox.ac.uk> |
>> >>>     +447799647538 <tel:%2B447799647538>
>> >>>
>> >>>     >
>> >>>     >
>> >>>
>> >>>
>> >>>
>> >>>     --
>> >>>     Dr Andy Twigg
>> >>>     Junior Research Fellow, St Johns College, Oxford
>> >>>     Room 351, Department of Computer Science
>> >>>     http://www.cs.ox.ac.uk/people/andy.twigg/
>> >>>     andy.twigg@cs.ox.ac.uk <ma...@cs.ox.ac.uk> |
>> >>>     +447799647538 <tel:%2B447799647538>
>> >>>
>> >>>
>> >>
>> >
>> >
>> >
>> > --
>> > Dr Andy Twigg
>> > Junior Research Fellow, St Johns College, Oxford
>> > Room 351, Department of Computer Science
>> > http://www.cs.ox.ac.uk/people/andy.twigg/
>> > andy.twigg@cs.ox.ac.uk | +447799647538
>>
>>
>>
>> --
>> Dr Andy Twigg
>> Junior Research Fellow, St Johns College, Oxford
>> Room 351, Department of Computer Science
>> http://www.cs.ox.ac.uk/people/andy.twigg/
>> andy.twigg@cs.ox.ac.uk | +447799647538
>>



--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
andy.twigg@cs.ox.ac.uk | +447799647538

Re: Out-of-core random forest implementation

Posted by Meng Qinghan <qi...@gmail.com>.

You can read the following paper which I think it is useful.

http://link.springer.com/chapter/10.1007%2F978-3-642-30217-6_12?LI=true

2013/3/6 Andy Twigg <an...@gmail.com>

> Hi guys,
>
> I've created a JIRA issue for this now -
> https://issues.apache.org/jira/browse/MAHOUT-1153
>
> Right now we're working on my fork but will make an early patch once
> we can. Hopefully that will provide a basis for others who are
> interested to add to it.
>
> Andy
>
>
> On 22 February 2013 08:49, Andy Twigg <an...@gmail.com> wrote:
> > Hi Marty,
> >
> > I would suggest doing each tree independently, one after the other.
> > Have each node select a random sample w/replacement of its data, and
> > let the master choose the random subset of features to split with. I'm
> > sure there are more efficient ways, but we can think of them later.
> >
> > -Andy
> >
> >
> > On 22 February 2013 01:47, Marty Kube <ma...@gmail.com>
> wrote:
> >> On the algorithm choice, Parallel Boosted Regression Trees looks
> interesting
> >> as well.
> >>
> >>
> http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf
> >>
> >> After reading that paper I had a question about the Streaming Parallel
> >> Decision Trees.  The SPDT paper is for building one decision tree.  I
> was
> >> thinking about how to extend that to a decision forest.  I could see
> how one
> >> can build many trees in the master and use a random choice of the
> splitting
> >> variables (as opposed to checking all variables and using the best
> >> variable). However, it seems like the idea of sampling the dataset with
> >> replacement for each tree gets lost.  Is that okay?
> >>
> >>
> >> On 02/21/2013 03:19 AM, Ted Dunning wrote:
> >>>
> >>> For quantile estimation, consider also streamlib at
> >>> https://github.com/clearspring/stream-lib
> >>>
> >>> The bigmlcom implementation looks more directly applicable, actually.
> >>>
> >>> On Wed, Feb 20, 2013 at 5:01 PM, Andy Twigg <andy.twigg@gmail.com
> >>> <ma...@gmail.com>> wrote:
> >>>
> >>>     Even better, there is already a good implementation of the
> histograms:
> >>>     https://github.com/bigmlcom/histogram
> >>>
> >>>     -Andy
> >>>
> >>>
> >>>     On 20 February 2013 22:50, Marty Kube <marty.kube.apache@gmail.com
> >>>     <ma...@gmail.com>> wrote:
> >>>     > That's a winner...
> >>>     > Out of all of the algorithms I've looked at the Ben-Haim/SPDT
> >>>     looks most
> >>>     > likely.  In batch mode it uses one pass over the data set, it
> >>>     can be used in
> >>>     > a streaming mode, and has constant space and time requirements.
> >>>      That seems
> >>>     > like the kind of scalable algorithm we're after.
> >>>     > I'm in!
> >>>     >
> >>>     >
> >>>     > On 02/20/2013 10:09 AM, Andy Twigg wrote:
> >>>     >>
> >>>     >> Alternatively, the algorithm described in [1] is more
> >>>     straightforward,
> >>>     >> efficient, hadoop-compatible (using only mappers communicating
> to a
> >>>     >> master) and satisfies all our requirements so far. I would like
> to
> >>>     >> take a pass at implementing that, if anyone else is interested?
> >>>     >>
> >>>     >> [1]
> >>>
> http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
> >>>     >>
> >>>     >>
> >>>     >> On 20 February 2013 14:27, Andy Twigg <andy.twigg@gmail.com
> >>>     <ma...@gmail.com>> wrote:
> >>>     >>>
> >>>     >>> Why don't we start from
> >>>     >>>
> >>>     >>> https://github.com/ashenfad/hadooptree ?
> >>>     >>>
> >>>     >>> On 20 February 2013 13:25, Marty Kube
> >>>     <marty.kube.apache@gmail.com <ma...@gmail.com>>
> >>>
> >>>     >>> wrote:
> >>>     >>>>
> >>>     >>>> Hi Lorenz,
> >>>     >>>>
> >>>     >>>> Very interesting, that's what I was asking for when I
> >>>     mentioned non-MR
> >>>     >>>> implementations :-)
> >>>     >>>>
> >>>     >>>> I have not looked at spark before, interesting that it uses
> >>>     Mesos for
> >>>     >>>> clustering.   I'll check it out.
> >>>     >>>>
> >>>     >>>>
> >>>     >>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
> >>>     >>>>>
> >>>     >>>>> Hi Marty,
> >>>     >>>>>
> >>>     >>>>> i am currently working on a PLANET-like implementation on
> >>>     top of spark:
> >>>     >>>>> http://spark-project.org
> >>>     >>>>>
> >>>     >>>>> I think this framework is a nice fit for the problem.
> >>>     >>>>> If the input data fits into the "total cluster memory" you
> >>>     benefit from
> >>>     >>>>> the caching of the RDD's.
> >>>     >>>>>
> >>>     >>>>> regards,
> >>>     >>>>>
> >>>     >>>>> lorenz
> >>>     >>>>>
> >>>     >>>>>
> >>>     >>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube
> >>>     <marty.kube.apache@gmail.com <ma...@gmail.com>>
> >>>
> >>>     >>>>> wrote:
> >>>     >>>>>
> >>>     >>>>>> You had mentioned other "resource management" platforms
> >>>     like Giraph or
> >>>     >>>>>> Mesos.  I haven't looked at those yet.  I guess I was think
> >>>     of other
> >>>     >>>>>> parallelization frameworks.
> >>>     >>>>>>
> >>>     >>>>>> It's interesting that the planet folks thought it was really
> >>>     >>>>>> worthwhile
> >>>     >>>>>> working on top of map reduce for all of the resource
> >>>     management that
> >>>     >>>>>> is
> >>>     >>>>>> built in.
> >>>     >>>>>>
> >>>     >>>>>>
> >>>     >>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
> >>>     >>>>>>>
> >>>     >>>>>>> If non-MR means map-only job with communicating mappers
> >>>     and a state
> >>>     >>>>>>> store,
> >>>     >>>>>>> I am down with that.
> >>>     >>>>>>>
> >>>     >>>>>>> What did you mean?
> >>>     >>>>>>>
> >>>     >>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
> >>>     >>>>>>> martykube@beavercreekconsulting.com
> >>>     <ma...@beavercreekconsulting.com>> wrote:
> >>>     >>>>>>>
> >>>     >>>>>>>> Right now I'd lean towards the planet model, or maybe a
> >>>     non-MR
> >>>     >>>>>>>> implementation.  Anyone have a good idea for a non-MR
> >>>     solution?
> >>>     >>>>>>>>
> >>>     >>>
> >>>     >>>
> >>>     >>> --
> >>>     >>> Dr Andy Twigg
> >>>     >>> Junior Research Fellow, St Johns College, Oxford
> >>>     >>> Room 351, Department of Computer Science
> >>>     >>> http://www.cs.ox.ac.uk/people/andy.twigg/
> >>>     >>> andy.twigg@cs.ox.ac.uk <ma...@cs.ox.ac.uk> |
> >>>     +447799647538 <tel:%2B447799647538>
> >>>
> >>>     >>
> >>>     >>
> >>>     >>
> >>>     >> --
> >>>     >> Dr Andy Twigg
> >>>     >> Junior Research Fellow, St Johns College, Oxford
> >>>     >> Room 351, Department of Computer Science
> >>>     >> http://www.cs.ox.ac.uk/people/andy.twigg/
> >>>     >> andy.twigg@cs.ox.ac.uk <ma...@cs.ox.ac.uk> |
> >>>     +447799647538 <tel:%2B447799647538>
> >>>
> >>>     >
> >>>     >
> >>>
> >>>
> >>>
> >>>     --
> >>>     Dr Andy Twigg
> >>>     Junior Research Fellow, St Johns College, Oxford
> >>>     Room 351, Department of Computer Science
> >>>     http://www.cs.ox.ac.uk/people/andy.twigg/
> >>>     andy.twigg@cs.ox.ac.uk <ma...@cs.ox.ac.uk> |
> >>>     +447799647538 <tel:%2B447799647538>
> >>>
> >>>
> >>
> >
> >
> >
> > --
> > Dr Andy Twigg
> > Junior Research Fellow, St Johns College, Oxford
> > Room 351, Department of Computer Science
> > http://www.cs.ox.ac.uk/people/andy.twigg/
> > andy.twigg@cs.ox.ac.uk | +447799647538
>
>
>
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford
> Room 351, Department of Computer Science
> http://www.cs.ox.ac.uk/people/andy.twigg/
> andy.twigg@cs.ox.ac.uk | +447799647538
>

Re: Out-of-core random forest implementation

Posted by Andy Twigg <an...@gmail.com>.

Hi guys,

I've created a JIRA issue for this now -
https://issues.apache.org/jira/browse/MAHOUT-1153

Right now we're working on my fork but will make an early patch once
we can. Hopefully that will provide a basis for others who are
interested to add to it.

Andy


On 22 February 2013 08:49, Andy Twigg <an...@gmail.com> wrote:
> Hi Marty,
>
> I would suggest doing each tree independently, one after the other.
> Have each node select a random sample w/replacement of its data, and
> let the master choose the random subset of features to split with. I'm
> sure there are more efficient ways, but we can think of them later.
>
> -Andy
>
>
> On 22 February 2013 01:47, Marty Kube <ma...@gmail.com> wrote:
>> On the algorithm choice, Parallel Boosted Regression Trees looks interesting
>> as well.
>>
>> http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf
>>
>> After reading that paper I had a question about the Streaming Parallel
>> Decision Trees.  The SPDT paper is for building one decision tree.  I was
>> thinking about how to extend that to a decision forest.  I could see how one
>> can build many trees in the master and use a random choice of the splitting
>> variables (as opposed to checking all variables and using the best
>> variable). However, it seems like the idea of sampling the dataset with
>> replacement for each tree gets lost.  Is that okay?
>>
>>
>> On 02/21/2013 03:19 AM, Ted Dunning wrote:
>>>
>>> For quantile estimation, consider also streamlib at
>>> https://github.com/clearspring/stream-lib
>>>
>>> The bigmlcom implementation looks more directly applicable, actually.
>>>
>>> On Wed, Feb 20, 2013 at 5:01 PM, Andy Twigg <andy.twigg@gmail.com
>>> <ma...@gmail.com>> wrote:
>>>
>>>     Even better, there is already a good implementation of the histograms:
>>>     https://github.com/bigmlcom/histogram
>>>
>>>     -Andy
>>>
>>>
>>>     On 20 February 2013 22:50, Marty Kube <marty.kube.apache@gmail.com
>>>     <ma...@gmail.com>> wrote:
>>>     > That's a winner...
>>>     > Out of all of the algorithms I've looked at the Ben-Haim/SPDT
>>>     looks most
>>>     > likely.  In batch mode it uses one pass over the data set, it
>>>     can be used in
>>>     > a streaming mode, and has constant space and time requirements.
>>>      That seems
>>>     > like the kind of scalable algorithm we're after.
>>>     > I'm in!
>>>     >
>>>     >
>>>     > On 02/20/2013 10:09 AM, Andy Twigg wrote:
>>>     >>
>>>     >> Alternatively, the algorithm described in [1] is more
>>>     straightforward,
>>>     >> efficient, hadoop-compatible (using only mappers communicating to a
>>>     >> master) and satisfies all our requirements so far. I would like to
>>>     >> take a pass at implementing that, if anyone else is interested?
>>>     >>
>>>     >> [1]
>>>     http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
>>>     >>
>>>     >>
>>>     >> On 20 February 2013 14:27, Andy Twigg <andy.twigg@gmail.com
>>>     <ma...@gmail.com>> wrote:
>>>     >>>
>>>     >>> Why don't we start from
>>>     >>>
>>>     >>> https://github.com/ashenfad/hadooptree ?
>>>     >>>
>>>     >>> On 20 February 2013 13:25, Marty Kube
>>>     <marty.kube.apache@gmail.com <ma...@gmail.com>>
>>>
>>>     >>> wrote:
>>>     >>>>
>>>     >>>> Hi Lorenz,
>>>     >>>>
>>>     >>>> Very interesting, that's what I was asking for when I
>>>     mentioned non-MR
>>>     >>>> implementations :-)
>>>     >>>>
>>>     >>>> I have not looked at spark before, interesting that it uses
>>>     Mesos for
>>>     >>>> clustering.   I'll check it out.
>>>     >>>>
>>>     >>>>
>>>     >>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
>>>     >>>>>
>>>     >>>>> Hi Marty,
>>>     >>>>>
>>>     >>>>> i am currently working on a PLANET-like implementation on
>>>     top of spark:
>>>     >>>>> http://spark-project.org
>>>     >>>>>
>>>     >>>>> I think this framework is a nice fit for the problem.
>>>     >>>>> If the input data fits into the "total cluster memory" you
>>>     benefit from
>>>     >>>>> the caching of the RDD's.
>>>     >>>>>
>>>     >>>>> regards,
>>>     >>>>>
>>>     >>>>> lorenz
>>>     >>>>>
>>>     >>>>>
>>>     >>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube
>>>     <marty.kube.apache@gmail.com <ma...@gmail.com>>
>>>
>>>     >>>>> wrote:
>>>     >>>>>
>>>     >>>>>> You had mentioned other "resource management" platforms
>>>     like Giraph or
>>>     >>>>>> Mesos.  I haven't looked at those yet.  I guess I was think
>>>     of other
>>>     >>>>>> parallelization frameworks.
>>>     >>>>>>
>>>     >>>>>> It's interesting that the planet folks thought it was really
>>>     >>>>>> worthwhile
>>>     >>>>>> working on top of map reduce for all of the resource
>>>     management that
>>>     >>>>>> is
>>>     >>>>>> built in.
>>>     >>>>>>
>>>     >>>>>>
>>>     >>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
>>>     >>>>>>>
>>>     >>>>>>> If non-MR means map-only job with communicating mappers
>>>     and a state
>>>     >>>>>>> store,
>>>     >>>>>>> I am down with that.
>>>     >>>>>>>
>>>     >>>>>>> What did you mean?
>>>     >>>>>>>
>>>     >>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
>>>     >>>>>>> martykube@beavercreekconsulting.com
>>>     <ma...@beavercreekconsulting.com>> wrote:
>>>     >>>>>>>
>>>     >>>>>>>> Right now I'd lean towards the planet model, or maybe a
>>>     non-MR
>>>     >>>>>>>> implementation.  Anyone have a good idea for a non-MR
>>>     solution?
>>>     >>>>>>>>
>>>     >>>
>>>     >>>
>>>     >>> --
>>>     >>> Dr Andy Twigg
>>>     >>> Junior Research Fellow, St Johns College, Oxford
>>>     >>> Room 351, Department of Computer Science
>>>     >>> http://www.cs.ox.ac.uk/people/andy.twigg/
>>>     >>> andy.twigg@cs.ox.ac.uk <ma...@cs.ox.ac.uk> |
>>>     +447799647538 <tel:%2B447799647538>
>>>
>>>     >>
>>>     >>
>>>     >>
>>>     >> --
>>>     >> Dr Andy Twigg
>>>     >> Junior Research Fellow, St Johns College, Oxford
>>>     >> Room 351, Department of Computer Science
>>>     >> http://www.cs.ox.ac.uk/people/andy.twigg/
>>>     >> andy.twigg@cs.ox.ac.uk <ma...@cs.ox.ac.uk> |
>>>     +447799647538 <tel:%2B447799647538>
>>>
>>>     >
>>>     >
>>>
>>>
>>>
>>>     --
>>>     Dr Andy Twigg
>>>     Junior Research Fellow, St Johns College, Oxford
>>>     Room 351, Department of Computer Science
>>>     http://www.cs.ox.ac.uk/people/andy.twigg/
>>>     andy.twigg@cs.ox.ac.uk <ma...@cs.ox.ac.uk> |
>>>     +447799647538 <tel:%2B447799647538>
>>>
>>>
>>
>
>
>
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford
> Room 351, Department of Computer Science
> http://www.cs.ox.ac.uk/people/andy.twigg/
> andy.twigg@cs.ox.ac.uk | +447799647538



--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
andy.twigg@cs.ox.ac.uk | +447799647538

Re: Out-of-core random forest implementation

Posted by Andy Twigg <an...@gmail.com>.

Hi Marty,

I would suggest doing each tree independently, one after the other.
Have each node select a random sample w/replacement of its data, and
let the master choose the random subset of features to split with. I'm
sure there are more efficient ways, but we can think of them later.

-Andy


On 22 February 2013 01:47, Marty Kube <ma...@gmail.com> wrote:
> On the algorithm choice, Parallel Boosted Regression Trees looks interesting
> as well.
>
> http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf
>
> After reading that paper I had a question about the Streaming Parallel
> Decision Trees.  The SPDT paper is for building one decision tree.  I was
> thinking about how to extend that to a decision forest.  I could see how one
> can build many trees in the master and use a random choice of the splitting
> variables (as opposed to checking all variables and using the best
> variable). However, it seems like the idea of sampling the dataset with
> replacement for each tree gets lost.  Is that okay?
>
>
> On 02/21/2013 03:19 AM, Ted Dunning wrote:
>>
>> For quantile estimation, consider also streamlib at
>> https://github.com/clearspring/stream-lib
>>
>> The bigmlcom implementation looks more directly applicable, actually.
>>
>> On Wed, Feb 20, 2013 at 5:01 PM, Andy Twigg <andy.twigg@gmail.com
>> <ma...@gmail.com>> wrote:
>>
>>     Even better, there is already a good implementation of the histograms:
>>     https://github.com/bigmlcom/histogram
>>
>>     -Andy
>>
>>
>>     On 20 February 2013 22:50, Marty Kube <marty.kube.apache@gmail.com
>>     <ma...@gmail.com>> wrote:
>>     > That's a winner...
>>     > Out of all of the algorithms I've looked at the Ben-Haim/SPDT
>>     looks most
>>     > likely.  In batch mode it uses one pass over the data set, it
>>     can be used in
>>     > a streaming mode, and has constant space and time requirements.
>>      That seems
>>     > like the kind of scalable algorithm we're after.
>>     > I'm in!
>>     >
>>     >
>>     > On 02/20/2013 10:09 AM, Andy Twigg wrote:
>>     >>
>>     >> Alternatively, the algorithm described in [1] is more
>>     straightforward,
>>     >> efficient, hadoop-compatible (using only mappers communicating to a
>>     >> master) and satisfies all our requirements so far. I would like to
>>     >> take a pass at implementing that, if anyone else is interested?
>>     >>
>>     >> [1]
>>     http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
>>     >>
>>     >>
>>     >> On 20 February 2013 14:27, Andy Twigg <andy.twigg@gmail.com
>>     <ma...@gmail.com>> wrote:
>>     >>>
>>     >>> Why don't we start from
>>     >>>
>>     >>> https://github.com/ashenfad/hadooptree ?
>>     >>>
>>     >>> On 20 February 2013 13:25, Marty Kube
>>     <marty.kube.apache@gmail.com <ma...@gmail.com>>
>>
>>     >>> wrote:
>>     >>>>
>>     >>>> Hi Lorenz,
>>     >>>>
>>     >>>> Very interesting, that's what I was asking for when I
>>     mentioned non-MR
>>     >>>> implementations :-)
>>     >>>>
>>     >>>> I have not looked at spark before, interesting that it uses
>>     Mesos for
>>     >>>> clustering.   I'll check it out.
>>     >>>>
>>     >>>>
>>     >>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
>>     >>>>>
>>     >>>>> Hi Marty,
>>     >>>>>
>>     >>>>> i am currently working on a PLANET-like implementation on
>>     top of spark:
>>     >>>>> http://spark-project.org
>>     >>>>>
>>     >>>>> I think this framework is a nice fit for the problem.
>>     >>>>> If the input data fits into the "total cluster memory" you
>>     benefit from
>>     >>>>> the caching of the RDD's.
>>     >>>>>
>>     >>>>> regards,
>>     >>>>>
>>     >>>>> lorenz
>>     >>>>>
>>     >>>>>
>>     >>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube
>>     <marty.kube.apache@gmail.com <ma...@gmail.com>>
>>
>>     >>>>> wrote:
>>     >>>>>
>>     >>>>>> You had mentioned other "resource management" platforms
>>     like Giraph or
>>     >>>>>> Mesos.  I haven't looked at those yet.  I guess I was think
>>     of other
>>     >>>>>> parallelization frameworks.
>>     >>>>>>
>>     >>>>>> It's interesting that the planet folks thought it was really
>>     >>>>>> worthwhile
>>     >>>>>> working on top of map reduce for all of the resource
>>     management that
>>     >>>>>> is
>>     >>>>>> built in.
>>     >>>>>>
>>     >>>>>>
>>     >>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
>>     >>>>>>>
>>     >>>>>>> If non-MR means map-only job with communicating mappers
>>     and a state
>>     >>>>>>> store,
>>     >>>>>>> I am down with that.
>>     >>>>>>>
>>     >>>>>>> What did you mean?
>>     >>>>>>>
>>     >>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
>>     >>>>>>> martykube@beavercreekconsulting.com
>>     <ma...@beavercreekconsulting.com>> wrote:
>>     >>>>>>>
>>     >>>>>>>> Right now I'd lean towards the planet model, or maybe a
>>     non-MR
>>     >>>>>>>> implementation.  Anyone have a good idea for a non-MR
>>     solution?
>>     >>>>>>>>
>>     >>>
>>     >>>
>>     >>> --
>>     >>> Dr Andy Twigg
>>     >>> Junior Research Fellow, St Johns College, Oxford
>>     >>> Room 351, Department of Computer Science
>>     >>> http://www.cs.ox.ac.uk/people/andy.twigg/
>>     >>> andy.twigg@cs.ox.ac.uk <ma...@cs.ox.ac.uk> |
>>     +447799647538 <tel:%2B447799647538>
>>
>>     >>
>>     >>
>>     >>
>>     >> --
>>     >> Dr Andy Twigg
>>     >> Junior Research Fellow, St Johns College, Oxford
>>     >> Room 351, Department of Computer Science
>>     >> http://www.cs.ox.ac.uk/people/andy.twigg/
>>     >> andy.twigg@cs.ox.ac.uk <ma...@cs.ox.ac.uk> |
>>     +447799647538 <tel:%2B447799647538>
>>
>>     >
>>     >
>>
>>
>>
>>     --
>>     Dr Andy Twigg
>>     Junior Research Fellow, St Johns College, Oxford
>>     Room 351, Department of Computer Science
>>     http://www.cs.ox.ac.uk/people/andy.twigg/
>>     andy.twigg@cs.ox.ac.uk <ma...@cs.ox.ac.uk> |
>>     +447799647538 <tel:%2B447799647538>
>>
>>
>



--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
andy.twigg@cs.ox.ac.uk | +447799647538

Re: Out-of-core random forest implementation

Posted by Marty Kube <ma...@gmail.com>.

On the algorithm choice, Parallel Boosted Regression Trees looks 
interesting as well.

http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf

After reading that paper I had a question about the Streaming Parallel 
Decision Trees.  The SPDT paper is for building one decision tree.  I 
was thinking about how to extend that to a decision forest.  I could see 
how one can build many trees in the master and use a random choice of 
the splitting variables (as opposed to checking all variables and using 
the best variable). However, it seems like the idea of sampling the 
dataset with replacement for each tree gets lost.  Is that okay?

On 02/21/2013 03:19 AM, Ted Dunning wrote:
> For quantile estimation, consider also streamlib at 
> https://github.com/clearspring/stream-lib
>
> The bigmlcom implementation looks more directly applicable, actually.
>
> On Wed, Feb 20, 2013 at 5:01 PM, Andy Twigg <andy.twigg@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Even better, there is already a good implementation of the histograms:
>     https://github.com/bigmlcom/histogram
>
>     -Andy
>
>
>     On 20 February 2013 22:50, Marty Kube <marty.kube.apache@gmail.com
>     <ma...@gmail.com>> wrote:
>     > That's a winner...
>     > Out of all of the algorithms I've looked at the Ben-Haim/SPDT
>     looks most
>     > likely.  In batch mode it uses one pass over the data set, it
>     can be used in
>     > a streaming mode, and has constant space and time requirements.
>      That seems
>     > like the kind of scalable algorithm we're after.
>     > I'm in!
>     >
>     >
>     > On 02/20/2013 10:09 AM, Andy Twigg wrote:
>     >>
>     >> Alternatively, the algorithm described in [1] is more
>     straightforward,
>     >> efficient, hadoop-compatible (using only mappers communicating to a
>     >> master) and satisfies all our requirements so far. I would like to
>     >> take a pass at implementing that, if anyone else is interested?
>     >>
>     >> [1]
>     http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
>     >>
>     >>
>     >> On 20 February 2013 14:27, Andy Twigg <andy.twigg@gmail.com
>     <ma...@gmail.com>> wrote:
>     >>>
>     >>> Why don't we start from
>     >>>
>     >>> https://github.com/ashenfad/hadooptree ?
>     >>>
>     >>> On 20 February 2013 13:25, Marty Kube
>     <marty.kube.apache@gmail.com <ma...@gmail.com>>
>     >>> wrote:
>     >>>>
>     >>>> Hi Lorenz,
>     >>>>
>     >>>> Very interesting, that's what I was asking for when I
>     mentioned non-MR
>     >>>> implementations :-)
>     >>>>
>     >>>> I have not looked at spark before, interesting that it uses
>     Mesos for
>     >>>> clustering.   I'll check it out.
>     >>>>
>     >>>>
>     >>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
>     >>>>>
>     >>>>> Hi Marty,
>     >>>>>
>     >>>>> i am currently working on a PLANET-like implementation on
>     top of spark:
>     >>>>> http://spark-project.org
>     >>>>>
>     >>>>> I think this framework is a nice fit for the problem.
>     >>>>> If the input data fits into the "total cluster memory" you
>     benefit from
>     >>>>> the caching of the RDD's.
>     >>>>>
>     >>>>> regards,
>     >>>>>
>     >>>>> lorenz
>     >>>>>
>     >>>>>
>     >>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube
>     <marty.kube.apache@gmail.com <ma...@gmail.com>>
>     >>>>> wrote:
>     >>>>>
>     >>>>>> You had mentioned other "resource management" platforms
>     like Giraph or
>     >>>>>> Mesos.  I haven't looked at those yet.  I guess I was think
>     of other
>     >>>>>> parallelization frameworks.
>     >>>>>>
>     >>>>>> It's interesting that the planet folks thought it was really
>     >>>>>> worthwhile
>     >>>>>> working on top of map reduce for all of the resource
>     management that
>     >>>>>> is
>     >>>>>> built in.
>     >>>>>>
>     >>>>>>
>     >>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
>     >>>>>>>
>     >>>>>>> If non-MR means map-only job with communicating mappers
>     and a state
>     >>>>>>> store,
>     >>>>>>> I am down with that.
>     >>>>>>>
>     >>>>>>> What did you mean?
>     >>>>>>>
>     >>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
>     >>>>>>> martykube@beavercreekconsulting.com
>     <ma...@beavercreekconsulting.com>> wrote:
>     >>>>>>>
>     >>>>>>>> Right now I'd lean towards the planet model, or maybe a
>     non-MR
>     >>>>>>>> implementation.  Anyone have a good idea for a non-MR
>     solution?
>     >>>>>>>>
>     >>>
>     >>>
>     >>> --
>     >>> Dr Andy Twigg
>     >>> Junior Research Fellow, St Johns College, Oxford
>     >>> Room 351, Department of Computer Science
>     >>> http://www.cs.ox.ac.uk/people/andy.twigg/
>     >>> andy.twigg@cs.ox.ac.uk <ma...@cs.ox.ac.uk> |
>     +447799647538 <tel:%2B447799647538>
>     >>
>     >>
>     >>
>     >> --
>     >> Dr Andy Twigg
>     >> Junior Research Fellow, St Johns College, Oxford
>     >> Room 351, Department of Computer Science
>     >> http://www.cs.ox.ac.uk/people/andy.twigg/
>     >> andy.twigg@cs.ox.ac.uk <ma...@cs.ox.ac.uk> |
>     +447799647538 <tel:%2B447799647538>
>     >
>     >
>
>
>
>     --
>     Dr Andy Twigg
>     Junior Research Fellow, St Johns College, Oxford
>     Room 351, Department of Computer Science
>     http://www.cs.ox.ac.uk/people/andy.twigg/
>     andy.twigg@cs.ox.ac.uk <ma...@cs.ox.ac.uk> |
>     +447799647538 <tel:%2B447799647538>
>
>

Re: Out-of-core random forest implementation

Posted by Ted Dunning <te...@gmail.com>.

For quantile estimation, consider also streamlib at
https://github.com/clearspring/stream-lib

The bigmlcom implementation looks more directly applicable, actually.

On Wed, Feb 20, 2013 at 5:01 PM, Andy Twigg <an...@gmail.com> wrote:

> Even better, there is already a good implementation of the histograms:
> https://github.com/bigmlcom/histogram
>
> -Andy
>
>
> On 20 February 2013 22:50, Marty Kube <ma...@gmail.com> wrote:
> > That's a winner...
> > Out of all of the algorithms I've looked at the Ben-Haim/SPDT looks most
> > likely.  In batch mode it uses one pass over the data set, it can be
> used in
> > a streaming mode, and has constant space and time requirements.  That
> seems
> > like the kind of scalable algorithm we're after.
> > I'm in!
> >
> >
> > On 02/20/2013 10:09 AM, Andy Twigg wrote:
> >>
> >> Alternatively, the algorithm described in [1] is more straightforward,
> >> efficient, hadoop-compatible (using only mappers communicating to a
> >> master) and satisfies all our requirements so far. I would like to
> >> take a pass at implementing that, if anyone else is interested?
> >>
> >> [1]
> http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
> >>
> >>
> >> On 20 February 2013 14:27, Andy Twigg <an...@gmail.com> wrote:
> >>>
> >>> Why don't we start from
> >>>
> >>> https://github.com/ashenfad/hadooptree ?
> >>>
> >>> On 20 February 2013 13:25, Marty Kube <ma...@gmail.com>
> >>> wrote:
> >>>>
> >>>> Hi Lorenz,
> >>>>
> >>>> Very interesting, that's what I was asking for when I mentioned non-MR
> >>>> implementations :-)
> >>>>
> >>>> I have not looked at spark before, interesting that it uses Mesos for
> >>>> clustering.   I'll check it out.
> >>>>
> >>>>
> >>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
> >>>>>
> >>>>> Hi Marty,
> >>>>>
> >>>>> i am currently working on a PLANET-like implementation on top of
> spark:
> >>>>> http://spark-project.org
> >>>>>
> >>>>> I think this framework is a nice fit for the problem.
> >>>>> If the input data fits into the "total cluster memory" you benefit
> from
> >>>>> the caching of the RDD's.
> >>>>>
> >>>>> regards,
> >>>>>
> >>>>> lorenz
> >>>>>
> >>>>>
> >>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube <marty.kube.apache@gmail.com
> >
> >>>>> wrote:
> >>>>>
> >>>>>> You had mentioned other "resource management" platforms like Giraph
> or
> >>>>>> Mesos.  I haven't looked at those yet.  I guess I was think of other
> >>>>>> parallelization frameworks.
> >>>>>>
> >>>>>> It's interesting that the planet folks thought it was really
> >>>>>> worthwhile
> >>>>>> working on top of map reduce for all of the resource management that
> >>>>>> is
> >>>>>> built in.
> >>>>>>
> >>>>>>
> >>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
> >>>>>>>
> >>>>>>> If non-MR means map-only job with communicating mappers and a state
> >>>>>>> store,
> >>>>>>> I am down with that.
> >>>>>>>
> >>>>>>> What did you mean?
> >>>>>>>
> >>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
> >>>>>>> martykube@beavercreekconsulting.com> wrote:
> >>>>>>>
> >>>>>>>> Right now I'd lean towards the planet model, or maybe a non-MR
> >>>>>>>> implementation.  Anyone have a good idea for a non-MR solution?
> >>>>>>>>
> >>>
> >>>
> >>> --
> >>> Dr Andy Twigg
> >>> Junior Research Fellow, St Johns College, Oxford
> >>> Room 351, Department of Computer Science
> >>> http://www.cs.ox.ac.uk/people/andy.twigg/
> >>> andy.twigg@cs.ox.ac.uk | +447799647538
> >>
> >>
> >>
> >> --
> >> Dr Andy Twigg
> >> Junior Research Fellow, St Johns College, Oxford
> >> Room 351, Department of Computer Science
> >> http://www.cs.ox.ac.uk/people/andy.twigg/
> >> andy.twigg@cs.ox.ac.uk | +447799647538
> >
> >
>
>
>
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford
> Room 351, Department of Computer Science
> http://www.cs.ox.ac.uk/people/andy.twigg/
> andy.twigg@cs.ox.ac.uk | +447799647538
>

Re: Out-of-core random forest implementation

Posted by Andy Twigg <an...@gmail.com>.

Even better, there is already a good implementation of the histograms:
https://github.com/bigmlcom/histogram

-Andy


On 20 February 2013 22:50, Marty Kube <ma...@gmail.com> wrote:
> That's a winner...
> Out of all of the algorithms I've looked at the Ben-Haim/SPDT looks most
> likely.  In batch mode it uses one pass over the data set, it can be used in
> a streaming mode, and has constant space and time requirements.  That seems
> like the kind of scalable algorithm we're after.
> I'm in!
>
>
> On 02/20/2013 10:09 AM, Andy Twigg wrote:
>>
>> Alternatively, the algorithm described in [1] is more straightforward,
>> efficient, hadoop-compatible (using only mappers communicating to a
>> master) and satisfies all our requirements so far. I would like to
>> take a pass at implementing that, if anyone else is interested?
>>
>> [1] http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
>>
>>
>> On 20 February 2013 14:27, Andy Twigg <an...@gmail.com> wrote:
>>>
>>> Why don't we start from
>>>
>>> https://github.com/ashenfad/hadooptree ?
>>>
>>> On 20 February 2013 13:25, Marty Kube <ma...@gmail.com>
>>> wrote:
>>>>
>>>> Hi Lorenz,
>>>>
>>>> Very interesting, that's what I was asking for when I mentioned non-MR
>>>> implementations :-)
>>>>
>>>> I have not looked at spark before, interesting that it uses Mesos for
>>>> clustering.   I'll check it out.
>>>>
>>>>
>>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
>>>>>
>>>>> Hi Marty,
>>>>>
>>>>> i am currently working on a PLANET-like implementation on top of spark:
>>>>> http://spark-project.org
>>>>>
>>>>> I think this framework is a nice fit for the problem.
>>>>> If the input data fits into the "total cluster memory" you benefit from
>>>>> the caching of the RDD's.
>>>>>
>>>>> regards,
>>>>>
>>>>> lorenz
>>>>>
>>>>>
>>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube <ma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> You had mentioned other "resource management" platforms like Giraph or
>>>>>> Mesos.  I haven't looked at those yet.  I guess I was think of other
>>>>>> parallelization frameworks.
>>>>>>
>>>>>> It's interesting that the planet folks thought it was really
>>>>>> worthwhile
>>>>>> working on top of map reduce for all of the resource management that
>>>>>> is
>>>>>> built in.
>>>>>>
>>>>>>
>>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
>>>>>>>
>>>>>>> If non-MR means map-only job with communicating mappers and a state
>>>>>>> store,
>>>>>>> I am down with that.
>>>>>>>
>>>>>>> What did you mean?
>>>>>>>
>>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
>>>>>>> martykube@beavercreekconsulting.com> wrote:
>>>>>>>
>>>>>>>> Right now I'd lean towards the planet model, or maybe a non-MR
>>>>>>>> implementation.  Anyone have a good idea for a non-MR solution?
>>>>>>>>
>>>
>>>
>>> --
>>> Dr Andy Twigg
>>> Junior Research Fellow, St Johns College, Oxford
>>> Room 351, Department of Computer Science
>>> http://www.cs.ox.ac.uk/people/andy.twigg/
>>> andy.twigg@cs.ox.ac.uk | +447799647538
>>
>>
>>
>> --
>> Dr Andy Twigg
>> Junior Research Fellow, St Johns College, Oxford
>> Room 351, Department of Computer Science
>> http://www.cs.ox.ac.uk/people/andy.twigg/
>> andy.twigg@cs.ox.ac.uk | +447799647538
>
>



--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
andy.twigg@cs.ox.ac.uk | +447799647538

Re: Out-of-core random forest implementation

Posted by Marty Kube <ma...@gmail.com>.

That's a winner...
Out of all of the algorithms I've looked at the Ben-Haim/SPDT looks most 
likely.  In batch mode it uses one pass over the data set, it can be 
used in a streaming mode, and has constant space and time requirements.  
That seems like the kind of scalable algorithm we're after.
I'm in!

On 02/20/2013 10:09 AM, Andy Twigg wrote:
> Alternatively, the algorithm described in [1] is more straightforward,
> efficient, hadoop-compatible (using only mappers communicating to a
> master) and satisfies all our requirements so far. I would like to
> take a pass at implementing that, if anyone else is interested?
>
> [1] http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
>
>
> On 20 February 2013 14:27, Andy Twigg <an...@gmail.com> wrote:
>> Why don't we start from
>>
>> https://github.com/ashenfad/hadooptree ?
>>
>> On 20 February 2013 13:25, Marty Kube <ma...@gmail.com> wrote:
>>> Hi Lorenz,
>>>
>>> Very interesting, that's what I was asking for when I mentioned non-MR
>>> implementations :-)
>>>
>>> I have not looked at spark before, interesting that it uses Mesos for
>>> clustering.   I'll check it out.
>>>
>>>
>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
>>>> Hi Marty,
>>>>
>>>> i am currently working on a PLANET-like implementation on top of spark:
>>>> http://spark-project.org
>>>>
>>>> I think this framework is a nice fit for the problem.
>>>> If the input data fits into the "total cluster memory" you benefit from
>>>> the caching of the RDD's.
>>>>
>>>> regards,
>>>>
>>>> lorenz
>>>>
>>>>
>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube <ma...@gmail.com>
>>>> wrote:
>>>>
>>>>> You had mentioned other "resource management" platforms like Giraph or
>>>>> Mesos.  I haven't looked at those yet.  I guess I was think of other
>>>>> parallelization frameworks.
>>>>>
>>>>> It's interesting that the planet folks thought it was really worthwhile
>>>>> working on top of map reduce for all of the resource management that is
>>>>> built in.
>>>>>
>>>>>
>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
>>>>>> If non-MR means map-only job with communicating mappers and a state
>>>>>> store,
>>>>>> I am down with that.
>>>>>>
>>>>>> What did you mean?
>>>>>>
>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
>>>>>> martykube@beavercreekconsulting.com> wrote:
>>>>>>
>>>>>>> Right now I'd lean towards the planet model, or maybe a non-MR
>>>>>>> implementation.  Anyone have a good idea for a non-MR solution?
>>>>>>>
>>
>>
>> --
>> Dr Andy Twigg
>> Junior Research Fellow, St Johns College, Oxford
>> Room 351, Department of Computer Science
>> http://www.cs.ox.ac.uk/people/andy.twigg/
>> andy.twigg@cs.ox.ac.uk | +447799647538
>
>
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford
> Room 351, Department of Computer Science
> http://www.cs.ox.ac.uk/people/andy.twigg/
> andy.twigg@cs.ox.ac.uk | +447799647538

Re: Out-of-core random forest implementation

Posted by Marty Kube <ma...@beavercreekconsulting.com>.

Hi Andy
I am interested. Let me take a look at the papers you mentioned. 


On Feb 20, 2013, at 10:09 AM, Andy Twigg <an...@gmail.com> wrote:

> Alternatively, the algorithm described in [1] is more straightforward,
> efficient, hadoop-compatible (using only mappers communicating to a
> master) and satisfies all our requirements so far. I would like to
> take a pass at implementing that, if anyone else is interested?
> 
> [1] http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
> 
> 
> On 20 February 2013 14:27, Andy Twigg <an...@gmail.com> wrote:
>> Why don't we start from
>> 
>> https://github.com/ashenfad/hadooptree ?
>> 
>> On 20 February 2013 13:25, Marty Kube <ma...@gmail.com> wrote:
>>> Hi Lorenz,
>>> 
>>> Very interesting, that's what I was asking for when I mentioned non-MR
>>> implementations :-)
>>> 
>>> I have not looked at spark before, interesting that it uses Mesos for
>>> clustering.   I'll check it out.
>>> 
>>> 
>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
>>>> 
>>>> Hi Marty,
>>>> 
>>>> i am currently working on a PLANET-like implementation on top of spark:
>>>> http://spark-project.org
>>>> 
>>>> I think this framework is a nice fit for the problem.
>>>> If the input data fits into the "total cluster memory" you benefit from
>>>> the caching of the RDD's.
>>>> 
>>>> regards,
>>>> 
>>>> lorenz
>>>> 
>>>> 
>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube <ma...@gmail.com>
>>>> wrote:
>>>> 
>>>>> You had mentioned other "resource management" platforms like Giraph or
>>>>> Mesos.  I haven't looked at those yet.  I guess I was think of other
>>>>> parallelization frameworks.
>>>>> 
>>>>> It's interesting that the planet folks thought it was really worthwhile
>>>>> working on top of map reduce for all of the resource management that is
>>>>> built in.
>>>>> 
>>>>> 
>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
>>>>>> 
>>>>>> If non-MR means map-only job with communicating mappers and a state
>>>>>> store,
>>>>>> I am down with that.
>>>>>> 
>>>>>> What did you mean?
>>>>>> 
>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
>>>>>> martykube@beavercreekconsulting.com> wrote:
>>>>>> 
>>>>>>> Right now I'd lean towards the planet model, or maybe a non-MR
>>>>>>> implementation.  Anyone have a good idea for a non-MR solution?
>> 
>> 
>> 
>> --
>> Dr Andy Twigg
>> Junior Research Fellow, St Johns College, Oxford
>> Room 351, Department of Computer Science
>> http://www.cs.ox.ac.uk/people/andy.twigg/
>> andy.twigg@cs.ox.ac.uk | +447799647538
> 
> 
> 
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford
> Room 351, Department of Computer Science
> http://www.cs.ox.ac.uk/people/andy.twigg/
> andy.twigg@cs.ox.ac.uk | +447799647538

Re: Out-of-core random forest implementation

Posted by Andy Twigg <an...@gmail.com>.

Alternatively, the algorithm described in [1] is more straightforward,
efficient, hadoop-compatible (using only mappers communicating to a
master) and satisfies all our requirements so far. I would like to
take a pass at implementing that, if anyone else is interested?

[1] http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf


On 20 February 2013 14:27, Andy Twigg <an...@gmail.com> wrote:
> Why don't we start from
>
> https://github.com/ashenfad/hadooptree ?
>
> On 20 February 2013 13:25, Marty Kube <ma...@gmail.com> wrote:
>> Hi Lorenz,
>>
>> Very interesting, that's what I was asking for when I mentioned non-MR
>> implementations :-)
>>
>> I have not looked at spark before, interesting that it uses Mesos for
>> clustering.   I'll check it out.
>>
>>
>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
>>>
>>> Hi Marty,
>>>
>>> i am currently working on a PLANET-like implementation on top of spark:
>>> http://spark-project.org
>>>
>>> I think this framework is a nice fit for the problem.
>>> If the input data fits into the "total cluster memory" you benefit from
>>> the caching of the RDD's.
>>>
>>> regards,
>>>
>>> lorenz
>>>
>>>
>>> On Feb 20, 2013, at 2:42 AM, Marty Kube <ma...@gmail.com>
>>> wrote:
>>>
>>>> You had mentioned other "resource management" platforms like Giraph or
>>>> Mesos.  I haven't looked at those yet.  I guess I was think of other
>>>> parallelization frameworks.
>>>>
>>>> It's interesting that the planet folks thought it was really worthwhile
>>>> working on top of map reduce for all of the resource management that is
>>>> built in.
>>>>
>>>>
>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
>>>>>
>>>>> If non-MR means map-only job with communicating mappers and a state
>>>>> store,
>>>>> I am down with that.
>>>>>
>>>>> What did you mean?
>>>>>
>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
>>>>> martykube@beavercreekconsulting.com> wrote:
>>>>>
>>>>>> Right now I'd lean towards the planet model, or maybe a non-MR
>>>>>> implementation.  Anyone have a good idea for a non-MR solution?
>>>>>>
>>
>
>
>
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford
> Room 351, Department of Computer Science
> http://www.cs.ox.ac.uk/people/andy.twigg/
> andy.twigg@cs.ox.ac.uk | +447799647538



--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
andy.twigg@cs.ox.ac.uk | +447799647538

Re: Out-of-core random forest implementation

Posted by Andy Twigg <an...@gmail.com>.

Why don't we start from

https://github.com/ashenfad/hadooptree ?

On 20 February 2013 13:25, Marty Kube <ma...@gmail.com> wrote:
> Hi Lorenz,
>
> Very interesting, that's what I was asking for when I mentioned non-MR
> implementations :-)
>
> I have not looked at spark before, interesting that it uses Mesos for
> clustering.   I'll check it out.
>
>
> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
>>
>> Hi Marty,
>>
>> i am currently working on a PLANET-like implementation on top of spark:
>> http://spark-project.org
>>
>> I think this framework is a nice fit for the problem.
>> If the input data fits into the "total cluster memory" you benefit from
>> the caching of the RDD's.
>>
>> regards,
>>
>> lorenz
>>
>>
>> On Feb 20, 2013, at 2:42 AM, Marty Kube <ma...@gmail.com>
>> wrote:
>>
>>> You had mentioned other "resource management" platforms like Giraph or
>>> Mesos.  I haven't looked at those yet.  I guess I was think of other
>>> parallelization frameworks.
>>>
>>> It's interesting that the planet folks thought it was really worthwhile
>>> working on top of map reduce for all of the resource management that is
>>> built in.
>>>
>>>
>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
>>>>
>>>> If non-MR means map-only job with communicating mappers and a state
>>>> store,
>>>> I am down with that.
>>>>
>>>> What did you mean?
>>>>
>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
>>>> martykube@beavercreekconsulting.com> wrote:
>>>>
>>>>> Right now I'd lean towards the planet model, or maybe a non-MR
>>>>> implementation.  Anyone have a good idea for a non-MR solution?
>>>>>
>



--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
andy.twigg@cs.ox.ac.uk | +447799647538

Re: Out-of-core random forest implementation

Posted by Marty Kube <ma...@gmail.com>.

Hi Lorenz,

Very interesting, that's what I was asking for when I mentioned non-MR 
implementations :-)

I have not looked at spark before, interesting that it uses Mesos for 
clustering.   I'll check it out.

On 02/19/2013 09:32 PM, Lorenz Knies wrote:
> Hi Marty,
>
> i am currently working on a PLANET-like implementation on top of spark: http://spark-project.org
>
> I think this framework is a nice fit for the problem.
> If the input data fits into the "total cluster memory" you benefit from the caching of the RDD's.
>
> regards,
>
> lorenz
>
>
> On Feb 20, 2013, at 2:42 AM, Marty Kube <ma...@gmail.com> wrote:
>
>> You had mentioned other "resource management" platforms like Giraph or Mesos.  I haven't looked at those yet.  I guess I was think of other parallelization frameworks.
>>
>> It's interesting that the planet folks thought it was really worthwhile working on top of map reduce for all of the resource management that is built in.
>>
>>
>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
>>> If non-MR means map-only job with communicating mappers and a state store,
>>> I am down with that.
>>>
>>> What did you mean?
>>>
>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
>>> martykube@beavercreekconsulting.com> wrote:
>>>
>>>> Right now I'd lean towards the planet model, or maybe a non-MR
>>>> implementation.  Anyone have a good idea for a non-MR solution?
>>>>

Re: Out-of-core random forest implementation

Posted by Lorenz Knies <me...@l1024.org>.

Hi Marty,

i am currently working on a PLANET-like implementation on top of spark: http://spark-project.org

I think this framework is a nice fit for the problem.
If the input data fits into the "total cluster memory" you benefit from the caching of the RDD's.

regards,

lorenz


On Feb 20, 2013, at 2:42 AM, Marty Kube <ma...@gmail.com> wrote:

> You had mentioned other "resource management" platforms like Giraph or Mesos.  I haven't looked at those yet.  I guess I was think of other parallelization frameworks.
> 
> It's interesting that the planet folks thought it was really worthwhile working on top of map reduce for all of the resource management that is built in.
> 
> 
> On 02/19/2013 08:04 PM, Ted Dunning wrote:
>> If non-MR means map-only job with communicating mappers and a state store,
>> I am down with that.
>> 
>> What did you mean?
>> 
>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
>> martykube@beavercreekconsulting.com> wrote:
>> 
>>> Right now I'd lean towards the planet model, or maybe a non-MR
>>> implementation.  Anyone have a good idea for a non-MR solution?
>>> 
>

Re: Out-of-core random forest implementation

Posted by Marty Kube <ma...@gmail.com>.

You had mentioned other "resource management" platforms like Giraph or 
Mesos.  I haven't looked at those yet.  I guess I was think of other 
parallelization frameworks.

It's interesting that the planet folks thought it was really worthwhile 
working on top of map reduce for all of the resource management that is 
built in.

On 02/19/2013 08:04 PM, Ted Dunning wrote:
> If non-MR means map-only job with communicating mappers and a state store,
> I am down with that.
>
> What did you mean?
>
> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
> martykube@beavercreekconsulting.com> wrote:
>
>> Right now I'd lean towards the planet model, or maybe a non-MR
>> implementation.  Anyone have a good idea for a non-MR solution?
>>

Re: Out-of-core random forest implementation

Posted by Ted Dunning <te...@gmail.com>.

If non-MR means map-only job with communicating mappers and a state store,
I am down with that.

What did you mean?

On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
martykube@beavercreekconsulting.com> wrote:

>
> Right now I'd lean towards the planet model, or maybe a non-MR
> implementation.  Anyone have a good idea for a non-MR solution?
>

Re: Out-of-core random forest implementation

Posted by Marty Kube <ma...@beavercreekconsulting.com>.

So I took a look at the PLANET paper...

That looks like a workable plan to get out-of-core decision forest 
implementation built on top of map reduce.  It has all of the features 
Ted is talking about such as weak classifiers, etc.

The bad news is it looks like a complicated piece of software.  And you 
get the feeling MR is a bit of a miss-match for this problem.

The planet paper also made it clear to me that RainForest is more of a 
single core algorithm.

Right now I'd lean towards the planet model, or maybe a non-MR 
implementation.  Anyone have a good idea for a non-MR solution?


On 02/15/2013 03:09 AM, deneche abdelhakim wrote:
> On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube <
> martykube@beavercreekconsulting.com> wrote:
>
>> On 01/28/2013 02:33 PM, Ted Dunning wrote:
>>
>>> I think I was suggesting something weaker.
>>>
>>> I was suggesting that trees get built against a portion of the data and
>>> each node builds some number of trees against just the data it sees.  This
>>> is in the spirit of random forests, but not the letter.
>>>
>> I'm looking at the Mahout partial implementation.  It appears to me that
>> some number of trees get built against the portion of the data each node
>> sees based on the map reduce input split.
>> Am I wrong here?  If not, Mahout already has an out-of-core implementation.
>>
> Indeed, each mapper will grow a subset of the forest using only the data
> passed to it. Actually, this was based on a suggestion by Ted:
>
> "modify the original algorithm to build multiple trees for different
> portions of the data. That loses some of the solidity of the original
> method, but could actually do better if the splits exposed non-stationary
> behavior."
>
>
>> BTW - where did the name Partial Implementation come from?  Is this
>> partially completed work :-)
>
> Ha ha ha, the name came from the fact that each mapper/node has a "partial"
> view of the dataset. But I do agree, that it's partially completed :P I
> guess they were never enough user interest in Mahout's Decision Forest
> implementation to motivate me to keep working on it, but I do see more and
> more questions on the mailing lists about this, maybe I will get motivated
> enough to work on this again.
>
> I did some research, some time ago, about how to grow the trees on a
> cluster using all data when the dataset is too big to fit in memory. Two
> papers got my attention:
>
> *RainForest - A Framework for Fast Decision Tree Construction of Large
> Datasets [1]*
> This paper describes how to grow the trees without loading all data in
> memory. Using this, you need to copy the dataset on all computing nodes.
>
> *PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce [2]*
> Although it uses map-reduce to grow the trees, it uses lot's of hacks and
> tricks to make it work.
>
> Shouldn't be too difficult to use [1] and improve the current
> implementation, this way the forest can be grown on "full" large dataset on
> a computing cluster. One question though, do we really need Hadoop for this
> ? I mean, as long as each node has access to the dataset files this should
> be enough to build the forest on any number of nodes.
>
> [1] http://www.cs.cornell.edu/johannes/papers/1998/vldb1998-rainforest.pdf
> [2] http://dejanseo.com.au/research/google/36296.pdf
>
>
>>> Another option is that all the nodes start some trees based on their own
>>> data and spit the set of all such trees to a central (possibly distributed
>>> store).  They also grab some number of trees back down from the central
>>> store and try elaborating those trees using their own data.  These will
>>> also be sent back to the central store.  This is closer to random forests.
>>>
>>> A third option is like option two, but the trees that get pulled down and
>>> spit back up collect statistics until enough nodes report such stats at
>>> which point a new branch is added and the stats are reset.  This is very
>>> close to real random forests.  It should also be efficient because each
>>> node is passing many trees across its own data in each pass.  If that data
>>> fits in memory (even in file cache) then the pass will be very fast.  If
>>> it
>>> doesn't fit in memory, then it will run as fast as the disk will let it
>>> run.  Allowing a tree to commit with only 90% coverage should allow good
>>> robustness against laggards while doing almost exactly the same thing as
>>> the real random forest algorithm.
>>>
>>> On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <an...@gmail.com> wrote:
>>>
>>>   Ted,
>>>> Sorry, I don't understand. Are you suggesting that a single decision
>>>> tree can be built efficiently in a distributed fashion?
>>>>
>>>> The following 2 ways seem like the naive ways of doing this:
>>>> 1) each machine constructs one node of the tree, by scanning all the
>>>> data, filtering those that don't get to its node (this is expensive),
>>>> computing the split and then writing the split out. This requires a
>>>> pass through the data for each node of the tree.
>>>> 2) as 1), except that each machine writes out the filtered data after
>>>> its node in the tree. This requires less scanning (as much as the
>>>> local case I described earlier), but lots of data movement.
>>>>
>>>> Do you have an alternative method?
>>>>
>>>> Andy
>>>>
>>>>
>>>> On 28 January 2013 16:42, Ted Dunning <te...@gmail.com> wrote:
>>>>
>>>>> IF we have a step which permutes data (once) then I doubt that
>>>>> redistribution is necessary.  At that point the randomness consists of
>>>>> building trees based on different variable subsets and data subsets.
>>>>>   The
>>>>> original random forests only split on variable subsets.  How much this
>>>>> matters is an open question.
>>>>>
>>>>> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <sr...@gmail.com> wrote:
>>>>>
>>>>>   It sounds OK to me. Yes, I don't think you want to try to parallelize
>>>>>> the building of each tree just to build it off all the data. It will
>>>>>> be too slow.
>>>>>>
>>>>>> I imagine the game is to parallelize such that each worker has 1/Nth
>>>>>> of the data, where N is as low as you can manage. Each worker is
>>>>>> building one or more trees from its slice. Then iterate, and split up
>>>>>> the data differently. Each worker can now cross-validate previous
>>>>>> trees based on the new slice, or even build new ones, and iterate
>>>>>> again.
>>>>>>
>>>>>> I'm guessing this process will tend to need to build more trees than
>>>>>> usual. I think it's also possible that it will want to try to discard
>>>>>> trees with high out-of-bag error.
>>>>>>
>>>>>> Just riffing. I think this is some potential green-field territory to
>>>>>> really think about how you build trees when it doesn't nearly fit in
>>>>>> memory, and what shape of computation gives a really nice tradeoff
>>>>>> between speed and accuracy. I don't think the conventional approach is
>>>>>> that point but it's more like the above.
>>>>>>
>>>>>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <an...@gmail.com>
>>>>>>
>>>>> wrote:
>>>>> Here's my thought for what might work well. Comments very welcome.
>>>>>>> We construct k decision trees (where k might be 20-100). Each tree has
>>>>>>> depth roughly 10-30. Rather than construct each tree in a distributed
>>>>>>> fashion, construct each tree locally by using multiple passes over the
>>>>>>> bagged sample for that tree (construct in BFS manner, assuming the
>>>>>>> tree can fit in memory). Do this in parallel k times. I know this is
>>>>>>> perhaps not the "mapreduce" way, but I'd be interested to hear
>>>>>>> alternatives. It seems a bad idea to shuffle the data around on every
>>>>>>> BFS iteration per tree.
>>>>>>>
>>>>>>> I'm not planning to work on this right now, but would like to figure
>>>>>>> out if Mahout is a good platform to work with if I want such an
>>>>>>> algorithm.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>> --
>>>> Dr Andy Twigg
>>>> Junior Research Fellow, St Johns College, Oxford
>>>> Room 351, Department of Computer Science
>>>> http://www.cs.ox.ac.uk/people/**andy.twigg/<http://www.cs.ox.ac.uk/people/andy.twigg/>
>>>> andy.twigg@cs.ox.ac.uk | +447799647538
>>>>
>>>>

Re: Out-of-core random forest implementation

Posted by Ted Dunning <te...@gmail.com>.

I think that it helps to view random forests in terms of the general
approach.

The basic idea is that you want to train up many weak classifiers.
 Classical random forest selects a subset of variables to work with.  You
could also work on a subset of the data as well.  This is what partial does.

You can also have each node collect the data for many trees from part of
the data and send their summary statistics to other nodes so that all of
the data for a particular tree winds up in the same place.  These other
nodes can elaborate all of the trees (possibly in parallel) by adding a new
decision level and send the trees back to all of the original nodes.  These
can pass over their data collating partial statistics for the next round in
the same way as the original round.  This approach admits many different
divisions of the problem.  Each of the original nodes can look at all or
part of the data.  If part, it can be a randomized part of the data or a
map-reduce style slice.  Each of the original nodes can learn for all of
the trees, or a subset.  If a subset, you can have multiple nodes for each
data slice or just one. The collating nodes can involve many nodes or a
few.  The collating nodes can be a separate set of nodes or they can just
be original nodes who moonlight as collators.

My guess is that having all original nodes handle all trees is a good move
since the number of trees is likely to be relatively small and that makes
passing over all of the data simple.

This style of coding is well suited for BSP such as Giraph, but it isn't
hard to hand-code either.  It is a natural fit for something like Mesos of
Yarn as well.

On Fri, Feb 15, 2013 at 6:29 PM, Marty Kube <ma...@gmail.com>wrote:

> I just took a look at the Rain Forest paper.  It's from '98 and the feel
> is really different.  Large main memory is measured in MB and you use like
> only one computer to do stuff.  Weird :-)
>
> I think the jist of the algorithm is to build one tree per host and to
> have roughly one pass over the data set per node in the tree.  Is that
> scalable?  If the data is large I tend to think you would want few passes
> through the data and not repeated passes over the entire dataset per host.
>
> It makes me think about sampling the data set until in fits in memory or
> using the partial implementation...
>
> On 02/15/2013 03:09 AM, deneche abdelhakim wrote:
>
>> On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube <
>> martykube@**beavercreekconsulting.com<ma...@beavercreekconsulting.com>>
>> wrote:
>>
>>  On 01/28/2013 02:33 PM, Ted Dunning wrote:
>>>
>>>  I think I was suggesting something weaker.
>>>>
>>>> I was suggesting that trees get built against a portion of the data and
>>>> each node builds some number of trees against just the data it sees.
>>>>  This
>>>> is in the spirit of random forests, but not the letter.
>>>>
>>>>  I'm looking at the Mahout partial implementation.  It appears to me
>>> that
>>> some number of trees get built against the portion of the data each node
>>> sees based on the map reduce input split.
>>> Am I wrong here?  If not, Mahout already has an out-of-core
>>> implementation.
>>>
>>>  Indeed, each mapper will grow a subset of the forest using only the data
>> passed to it. Actually, this was based on a suggestion by Ted:
>>
>> "modify the original algorithm to build multiple trees for different
>> portions of the data. That loses some of the solidity of the original
>> method, but could actually do better if the splits exposed non-stationary
>> behavior."
>>
>>
>>  BTW - where did the name Partial Implementation come from?  Is this
>>> partially completed work :-)
>>>
>>
>> Ha ha ha, the name came from the fact that each mapper/node has a
>> "partial"
>> view of the dataset. But I do agree, that it's partially completed :P I
>> guess they were never enough user interest in Mahout's Decision Forest
>> implementation to motivate me to keep working on it, but I do see more and
>> more questions on the mailing lists about this, maybe I will get motivated
>> enough to work on this again.
>>
>> I did some research, some time ago, about how to grow the trees on a
>> cluster using all data when the dataset is too big to fit in memory. Two
>> papers got my attention:
>>
>> *RainForest - A Framework for Fast Decision Tree Construction of Large
>> Datasets [1]*
>> This paper describes how to grow the trees without loading all data in
>> memory. Using this, you need to copy the dataset on all computing nodes.
>>
>> *PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce [2]*
>> Although it uses map-reduce to grow the trees, it uses lot's of hacks and
>> tricks to make it work.
>>
>> Shouldn't be too difficult to use [1] and improve the current
>> implementation, this way the forest can be grown on "full" large dataset
>> on
>> a computing cluster. One question though, do we really need Hadoop for
>> this
>> ? I mean, as long as each node has access to the dataset files this should
>> be enough to build the forest on any number of nodes.
>>
>> [1] http://www.cs.cornell.edu/**johannes/papers/1998/vldb1998-**
>> rainforest.pdf<http://www.cs.cornell.edu/johannes/papers/1998/vldb1998-rainforest.pdf>
>> [2] http://dejanseo.com.au/**research/google/36296.pdf<http://dejanseo.com.au/research/google/36296.pdf>
>>
>>
>>  Another option is that all the nodes start some trees based on their own
>>>> data and spit the set of all such trees to a central (possibly
>>>> distributed
>>>> store).  They also grab some number of trees back down from the central
>>>> store and try elaborating those trees using their own data.  These will
>>>> also be sent back to the central store.  This is closer to random
>>>> forests.
>>>>
>>>> A third option is like option two, but the trees that get pulled down
>>>> and
>>>> spit back up collect statistics until enough nodes report such stats at
>>>> which point a new branch is added and the stats are reset.  This is very
>>>> close to real random forests.  It should also be efficient because each
>>>> node is passing many trees across its own data in each pass.  If that
>>>> data
>>>> fits in memory (even in file cache) then the pass will be very fast.  If
>>>> it
>>>> doesn't fit in memory, then it will run as fast as the disk will let it
>>>> run.  Allowing a tree to commit with only 90% coverage should allow good
>>>> robustness against laggards while doing almost exactly the same thing as
>>>> the real random forest algorithm.
>>>>
>>>> On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <an...@gmail.com>
>>>> wrote:
>>>>
>>>>   Ted,
>>>>
>>>>> Sorry, I don't understand. Are you suggesting that a single decision
>>>>> tree can be built efficiently in a distributed fashion?
>>>>>
>>>>> The following 2 ways seem like the naive ways of doing this:
>>>>> 1) each machine constructs one node of the tree, by scanning all the
>>>>> data, filtering those that don't get to its node (this is expensive),
>>>>> computing the split and then writing the split out. This requires a
>>>>> pass through the data for each node of the tree.
>>>>> 2) as 1), except that each machine writes out the filtered data after
>>>>> its node in the tree. This requires less scanning (as much as the
>>>>> local case I described earlier), but lots of data movement.
>>>>>
>>>>> Do you have an alternative method?
>>>>>
>>>>> Andy
>>>>>
>>>>>
>>>>> On 28 January 2013 16:42, Ted Dunning <te...@gmail.com> wrote:
>>>>>
>>>>>  IF we have a step which permutes data (once) then I doubt that
>>>>>> redistribution is necessary.  At that point the randomness consists of
>>>>>> building trees based on different variable subsets and data subsets.
>>>>>>   The
>>>>>> original random forests only split on variable subsets.  How much this
>>>>>> matters is an open question.
>>>>>>
>>>>>> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <sr...@gmail.com> wrote:
>>>>>>
>>>>>>   It sounds OK to me. Yes, I don't think you want to try to
>>>>>> parallelize
>>>>>>
>>>>>>> the building of each tree just to build it off all the data. It will
>>>>>>> be too slow.
>>>>>>>
>>>>>>> I imagine the game is to parallelize such that each worker has 1/Nth
>>>>>>> of the data, where N is as low as you can manage. Each worker is
>>>>>>> building one or more trees from its slice. Then iterate, and split up
>>>>>>> the data differently. Each worker can now cross-validate previous
>>>>>>> trees based on the new slice, or even build new ones, and iterate
>>>>>>> again.
>>>>>>>
>>>>>>> I'm guessing this process will tend to need to build more trees than
>>>>>>> usual. I think it's also possible that it will want to try to discard
>>>>>>> trees with high out-of-bag error.
>>>>>>>
>>>>>>> Just riffing. I think this is some potential green-field territory to
>>>>>>> really think about how you build trees when it doesn't nearly fit in
>>>>>>> memory, and what shape of computation gives a really nice tradeoff
>>>>>>> between speed and accuracy. I don't think the conventional approach
>>>>>>> is
>>>>>>> that point but it's more like the above.
>>>>>>>
>>>>>>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <an...@gmail.com>
>>>>>>>
>>>>>>>  wrote:
>>>>>> Here's my thought for what might work well. Comments very welcome.
>>>>>>
>>>>>>> We construct k decision trees (where k might be 20-100). Each tree
>>>>>>>> has
>>>>>>>> depth roughly 10-30. Rather than construct each tree in a
>>>>>>>> distributed
>>>>>>>> fashion, construct each tree locally by using multiple passes over
>>>>>>>> the
>>>>>>>> bagged sample for that tree (construct in BFS manner, assuming the
>>>>>>>> tree can fit in memory). Do this in parallel k times. I know this is
>>>>>>>> perhaps not the "mapreduce" way, but I'd be interested to hear
>>>>>>>> alternatives. It seems a bad idea to shuffle the data around on
>>>>>>>> every
>>>>>>>> BFS iteration per tree.
>>>>>>>>
>>>>>>>> I'm not planning to work on this right now, but would like to figure
>>>>>>>> out if Mahout is a good platform to work with if I want such an
>>>>>>>> algorithm.
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>>>>>  --
>>>>> Dr Andy Twigg
>>>>> Junior Research Fellow, St Johns College, Oxford
>>>>> Room 351, Department of Computer Science
>>>>> http://www.cs.ox.ac.uk/people/****andy.twigg/<http://www.cs.ox.ac.uk/people/**andy.twigg/>
>>>>> <http://www.cs.**ox.ac.uk/people/andy.twigg/<http://www.cs.ox.ac.uk/people/andy.twigg/>
>>>>> >
>>>>> andy.twigg@cs.ox.ac.uk | +447799647538
>>>>>
>>>>>
>>>>>
>

Re: Out-of-core random forest implementation

Posted by Marty Kube <ma...@gmail.com>.

I just took a look at the Rain Forest paper.  It's from '98 and the feel 
is really different.  Large main memory is measured in MB and you use 
like only one computer to do stuff.  Weird :-)

I think the jist of the algorithm is to build one tree per host and to 
have roughly one pass over the data set per node in the tree.  Is that 
scalable?  If the data is large I tend to think you would want few 
passes through the data and not repeated passes over the entire dataset 
per host.

It makes me think about sampling the data set until in fits in memory or 
using the partial implementation...

On 02/15/2013 03:09 AM, deneche abdelhakim wrote:
> On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube <
> martykube@beavercreekconsulting.com> wrote:
>
>> On 01/28/2013 02:33 PM, Ted Dunning wrote:
>>
>>> I think I was suggesting something weaker.
>>>
>>> I was suggesting that trees get built against a portion of the data and
>>> each node builds some number of trees against just the data it sees.  This
>>> is in the spirit of random forests, but not the letter.
>>>
>> I'm looking at the Mahout partial implementation.  It appears to me that
>> some number of trees get built against the portion of the data each node
>> sees based on the map reduce input split.
>> Am I wrong here?  If not, Mahout already has an out-of-core implementation.
>>
> Indeed, each mapper will grow a subset of the forest using only the data
> passed to it. Actually, this was based on a suggestion by Ted:
>
> "modify the original algorithm to build multiple trees for different
> portions of the data. That loses some of the solidity of the original
> method, but could actually do better if the splits exposed non-stationary
> behavior."
>
>
>> BTW - where did the name Partial Implementation come from?  Is this
>> partially completed work :-)
>
> Ha ha ha, the name came from the fact that each mapper/node has a "partial"
> view of the dataset. But I do agree, that it's partially completed :P I
> guess they were never enough user interest in Mahout's Decision Forest
> implementation to motivate me to keep working on it, but I do see more and
> more questions on the mailing lists about this, maybe I will get motivated
> enough to work on this again.
>
> I did some research, some time ago, about how to grow the trees on a
> cluster using all data when the dataset is too big to fit in memory. Two
> papers got my attention:
>
> *RainForest - A Framework for Fast Decision Tree Construction of Large
> Datasets [1]*
> This paper describes how to grow the trees without loading all data in
> memory. Using this, you need to copy the dataset on all computing nodes.
>
> *PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce [2]*
> Although it uses map-reduce to grow the trees, it uses lot's of hacks and
> tricks to make it work.
>
> Shouldn't be too difficult to use [1] and improve the current
> implementation, this way the forest can be grown on "full" large dataset on
> a computing cluster. One question though, do we really need Hadoop for this
> ? I mean, as long as each node has access to the dataset files this should
> be enough to build the forest on any number of nodes.
>
> [1] http://www.cs.cornell.edu/johannes/papers/1998/vldb1998-rainforest.pdf
> [2] http://dejanseo.com.au/research/google/36296.pdf
>
>
>>> Another option is that all the nodes start some trees based on their own
>>> data and spit the set of all such trees to a central (possibly distributed
>>> store).  They also grab some number of trees back down from the central
>>> store and try elaborating those trees using their own data.  These will
>>> also be sent back to the central store.  This is closer to random forests.
>>>
>>> A third option is like option two, but the trees that get pulled down and
>>> spit back up collect statistics until enough nodes report such stats at
>>> which point a new branch is added and the stats are reset.  This is very
>>> close to real random forests.  It should also be efficient because each
>>> node is passing many trees across its own data in each pass.  If that data
>>> fits in memory (even in file cache) then the pass will be very fast.  If
>>> it
>>> doesn't fit in memory, then it will run as fast as the disk will let it
>>> run.  Allowing a tree to commit with only 90% coverage should allow good
>>> robustness against laggards while doing almost exactly the same thing as
>>> the real random forest algorithm.
>>>
>>> On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <an...@gmail.com> wrote:
>>>
>>>   Ted,
>>>> Sorry, I don't understand. Are you suggesting that a single decision
>>>> tree can be built efficiently in a distributed fashion?
>>>>
>>>> The following 2 ways seem like the naive ways of doing this:
>>>> 1) each machine constructs one node of the tree, by scanning all the
>>>> data, filtering those that don't get to its node (this is expensive),
>>>> computing the split and then writing the split out. This requires a
>>>> pass through the data for each node of the tree.
>>>> 2) as 1), except that each machine writes out the filtered data after
>>>> its node in the tree. This requires less scanning (as much as the
>>>> local case I described earlier), but lots of data movement.
>>>>
>>>> Do you have an alternative method?
>>>>
>>>> Andy
>>>>
>>>>
>>>> On 28 January 2013 16:42, Ted Dunning <te...@gmail.com> wrote:
>>>>
>>>>> IF we have a step which permutes data (once) then I doubt that
>>>>> redistribution is necessary.  At that point the randomness consists of
>>>>> building trees based on different variable subsets and data subsets.
>>>>>   The
>>>>> original random forests only split on variable subsets.  How much this
>>>>> matters is an open question.
>>>>>
>>>>> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <sr...@gmail.com> wrote:
>>>>>
>>>>>   It sounds OK to me. Yes, I don't think you want to try to parallelize
>>>>>> the building of each tree just to build it off all the data. It will
>>>>>> be too slow.
>>>>>>
>>>>>> I imagine the game is to parallelize such that each worker has 1/Nth
>>>>>> of the data, where N is as low as you can manage. Each worker is
>>>>>> building one or more trees from its slice. Then iterate, and split up
>>>>>> the data differently. Each worker can now cross-validate previous
>>>>>> trees based on the new slice, or even build new ones, and iterate
>>>>>> again.
>>>>>>
>>>>>> I'm guessing this process will tend to need to build more trees than
>>>>>> usual. I think it's also possible that it will want to try to discard
>>>>>> trees with high out-of-bag error.
>>>>>>
>>>>>> Just riffing. I think this is some potential green-field territory to
>>>>>> really think about how you build trees when it doesn't nearly fit in
>>>>>> memory, and what shape of computation gives a really nice tradeoff
>>>>>> between speed and accuracy. I don't think the conventional approach is
>>>>>> that point but it's more like the above.
>>>>>>
>>>>>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <an...@gmail.com>
>>>>>>
>>>>> wrote:
>>>>> Here's my thought for what might work well. Comments very welcome.
>>>>>>> We construct k decision trees (where k might be 20-100). Each tree has
>>>>>>> depth roughly 10-30. Rather than construct each tree in a distributed
>>>>>>> fashion, construct each tree locally by using multiple passes over the
>>>>>>> bagged sample for that tree (construct in BFS manner, assuming the
>>>>>>> tree can fit in memory). Do this in parallel k times. I know this is
>>>>>>> perhaps not the "mapreduce" way, but I'd be interested to hear
>>>>>>> alternatives. It seems a bad idea to shuffle the data around on every
>>>>>>> BFS iteration per tree.
>>>>>>>
>>>>>>> I'm not planning to work on this right now, but would like to figure
>>>>>>> out if Mahout is a good platform to work with if I want such an
>>>>>>> algorithm.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>> --
>>>> Dr Andy Twigg
>>>> Junior Research Fellow, St Johns College, Oxford
>>>> Room 351, Department of Computer Science
>>>> http://www.cs.ox.ac.uk/people/**andy.twigg/<http://www.cs.ox.ac.uk/people/andy.twigg/>
>>>> andy.twigg@cs.ox.ac.uk | +447799647538
>>>>
>>>>

Re: Out-of-core random forest implementation

Posted by Ted Dunning <te...@gmail.com>.

Exactly.

On Fri, Feb 15, 2013 at 5:37 PM, Marty Kube <ma...@gmail.com>wrote:

> Even if you are not doing map reduce exactly, hadoop does give you a nice
> infrastructure for running jobs across a lot of host.
>
> On 02/15/2013 04:00 PM, Ted Dunning wrote:
>
>> Remember that Hadoop != map-reduce.
>>
>> If there is another style that we need to use, that isn't such a bad
>> thing.
>>
>>
>> On Fri, Feb 15, 2013 at 7:42 AM, Andy Twigg <an...@gmail.com> wrote:
>>
>>  I am having a hard time convincing myself that doing it on hadoop is
>>> the best idea (and like I said, it's not like there are other
>>> libraries that would make life much easier).
>>>
>>>
>

Re: Out-of-core random forest implementation

Posted by Marty Kube <ma...@gmail.com>.

Even if you are not doing map reduce exactly, hadoop does give you a 
nice infrastructure for running jobs across a lot of host.

On 02/15/2013 04:00 PM, Ted Dunning wrote:
> Remember that Hadoop != map-reduce.
>
> If there is another style that we need to use, that isn't such a bad thing.
>
>
> On Fri, Feb 15, 2013 at 7:42 AM, Andy Twigg <an...@gmail.com> wrote:
>
>> I am having a hard time convincing myself that doing it on hadoop is
>> the best idea (and like I said, it's not like there are other
>> libraries that would make life much easier).
>>

Re: Out-of-core random forest implementation

Posted by Ted Dunning <te...@gmail.com>.

Remember that Hadoop != map-reduce.

If there is another style that we need to use, that isn't such a bad thing.

On Fri, Feb 15, 2013 at 7:42 AM, Andy Twigg <an...@gmail.com> wrote:

> I am having a hard time convincing myself that doing it on hadoop is
> the best idea (and like I said, it's not like there are other
> libraries that would make life much easier).
>

Re: Out-of-core random forest implementation

Posted by Andy Twigg <an...@gmail.com>.

Memory mapping would help, but a proper scalable random forest
implementation should be the goal imo, especially given the claims of
mahout.

I am having a hard time convincing myself that doing it on hadoop is
the best idea (and like I said, it's not like there are other
libraries that would make life much easier).

Andy


On 15 February 2013 13:47, Marty Kube <ma...@gmail.com> wrote:
> The hurdle I faced on my current project is that we had many random forest
> and the RAM requirements per JVM during classification were too big so we
> could not use Mahout.
>
> We went to a memory mapped forest representation which worked out nicely.
> Is that a feature Mahout could use?
>
>
> On 02/15/2013 03:09 AM, deneche abdelhakim wrote:
>>
>> On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube <
>> martykube@beavercreekconsulting.com> wrote:
>>
>>> On 01/28/2013 02:33 PM, Ted Dunning wrote:
>>>
>>>> I think I was suggesting something weaker.
>>>>
>>>> I was suggesting that trees get built against a portion of the data and
>>>> each node builds some number of trees against just the data it sees.
>>>> This
>>>> is in the spirit of random forests, but not the letter.
>>>>
>>> I'm looking at the Mahout partial implementation.  It appears to me that
>>> some number of trees get built against the portion of the data each node
>>> sees based on the map reduce input split.
>>> Am I wrong here?  If not, Mahout already has an out-of-core
>>> implementation.
>>>
>> Indeed, each mapper will grow a subset of the forest using only the data
>> passed to it. Actually, this was based on a suggestion by Ted:
>>
>> "modify the original algorithm to build multiple trees for different
>> portions of the data. That loses some of the solidity of the original
>> method, but could actually do better if the splits exposed non-stationary
>> behavior."
>>
>>
>>> BTW - where did the name Partial Implementation come from?  Is this
>>> partially completed work :-)
>>
>>
>> Ha ha ha, the name came from the fact that each mapper/node has a
>> "partial"
>> view of the dataset. But I do agree, that it's partially completed :P I
>> guess they were never enough user interest in Mahout's Decision Forest
>> implementation to motivate me to keep working on it, but I do see more and
>> more questions on the mailing lists about this, maybe I will get motivated
>> enough to work on this again.
>>
>> I did some research, some time ago, about how to grow the trees on a
>> cluster using all data when the dataset is too big to fit in memory. Two
>> papers got my attention:
>>
>> *RainForest - A Framework for Fast Decision Tree Construction of Large
>> Datasets [1]*
>> This paper describes how to grow the trees without loading all data in
>> memory. Using this, you need to copy the dataset on all computing nodes.
>>
>> *PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce [2]*
>> Although it uses map-reduce to grow the trees, it uses lot's of hacks and
>> tricks to make it work.
>>
>> Shouldn't be too difficult to use [1] and improve the current
>> implementation, this way the forest can be grown on "full" large dataset
>> on
>> a computing cluster. One question though, do we really need Hadoop for
>> this
>> ? I mean, as long as each node has access to the dataset files this should
>> be enough to build the forest on any number of nodes.
>>
>> [1] http://www.cs.cornell.edu/johannes/papers/1998/vldb1998-rainforest.pdf
>> [2] http://dejanseo.com.au/research/google/36296.pdf
>>
>>
>>>> Another option is that all the nodes start some trees based on their own
>>>> data and spit the set of all such trees to a central (possibly
>>>> distributed
>>>> store).  They also grab some number of trees back down from the central
>>>> store and try elaborating those trees using their own data.  These will
>>>> also be sent back to the central store.  This is closer to random
>>>> forests.
>>>>
>>>> A third option is like option two, but the trees that get pulled down
>>>> and
>>>> spit back up collect statistics until enough nodes report such stats at
>>>> which point a new branch is added and the stats are reset.  This is very
>>>> close to real random forests.  It should also be efficient because each
>>>> node is passing many trees across its own data in each pass.  If that
>>>> data
>>>> fits in memory (even in file cache) then the pass will be very fast.  If
>>>> it
>>>> doesn't fit in memory, then it will run as fast as the disk will let it
>>>> run.  Allowing a tree to commit with only 90% coverage should allow good
>>>> robustness against laggards while doing almost exactly the same thing as
>>>> the real random forest algorithm.
>>>>
>>>> On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <an...@gmail.com>
>>>> wrote:
>>>>
>>>>   Ted,
>>>>>
>>>>> Sorry, I don't understand. Are you suggesting that a single decision
>>>>> tree can be built efficiently in a distributed fashion?
>>>>>
>>>>> The following 2 ways seem like the naive ways of doing this:
>>>>> 1) each machine constructs one node of the tree, by scanning all the
>>>>> data, filtering those that don't get to its node (this is expensive),
>>>>> computing the split and then writing the split out. This requires a
>>>>> pass through the data for each node of the tree.
>>>>> 2) as 1), except that each machine writes out the filtered data after
>>>>> its node in the tree. This requires less scanning (as much as the
>>>>> local case I described earlier), but lots of data movement.
>>>>>
>>>>> Do you have an alternative method?
>>>>>
>>>>> Andy
>>>>>
>>>>>
>>>>> On 28 January 2013 16:42, Ted Dunning <te...@gmail.com> wrote:
>>>>>
>>>>>> IF we have a step which permutes data (once) then I doubt that
>>>>>> redistribution is necessary.  At that point the randomness consists of
>>>>>> building trees based on different variable subsets and data subsets.
>>>>>>   The
>>>>>> original random forests only split on variable subsets.  How much this
>>>>>> matters is an open question.
>>>>>>
>>>>>> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <sr...@gmail.com> wrote:
>>>>>>
>>>>>>   It sounds OK to me. Yes, I don't think you want to try to
>>>>>> parallelize
>>>>>>>
>>>>>>> the building of each tree just to build it off all the data. It will
>>>>>>> be too slow.
>>>>>>>
>>>>>>> I imagine the game is to parallelize such that each worker has 1/Nth
>>>>>>> of the data, where N is as low as you can manage. Each worker is
>>>>>>> building one or more trees from its slice. Then iterate, and split up
>>>>>>> the data differently. Each worker can now cross-validate previous
>>>>>>> trees based on the new slice, or even build new ones, and iterate
>>>>>>> again.
>>>>>>>
>>>>>>> I'm guessing this process will tend to need to build more trees than
>>>>>>> usual. I think it's also possible that it will want to try to discard
>>>>>>> trees with high out-of-bag error.
>>>>>>>
>>>>>>> Just riffing. I think this is some potential green-field territory to
>>>>>>> really think about how you build trees when it doesn't nearly fit in
>>>>>>> memory, and what shape of computation gives a really nice tradeoff
>>>>>>> between speed and accuracy. I don't think the conventional approach
>>>>>>> is
>>>>>>> that point but it's more like the above.
>>>>>>>
>>>>>>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <an...@gmail.com>
>>>>>>>
>>>>>> wrote:
>>>>>> Here's my thought for what might work well. Comments very welcome.
>>>>>>>>
>>>>>>>> We construct k decision trees (where k might be 20-100). Each tree
>>>>>>>> has
>>>>>>>> depth roughly 10-30. Rather than construct each tree in a
>>>>>>>> distributed
>>>>>>>> fashion, construct each tree locally by using multiple passes over
>>>>>>>> the
>>>>>>>> bagged sample for that tree (construct in BFS manner, assuming the
>>>>>>>> tree can fit in memory). Do this in parallel k times. I know this is
>>>>>>>> perhaps not the "mapreduce" way, but I'd be interested to hear
>>>>>>>> alternatives. It seems a bad idea to shuffle the data around on
>>>>>>>> every
>>>>>>>> BFS iteration per tree.
>>>>>>>>
>>>>>>>> I'm not planning to work on this right now, but would like to figure
>>>>>>>> out if Mahout is a good platform to work with if I want such an
>>>>>>>> algorithm.
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>> --
>>>>> Dr Andy Twigg
>>>>> Junior Research Fellow, St Johns College, Oxford
>>>>> Room 351, Department of Computer Science
>>>>>
>>>>> http://www.cs.ox.ac.uk/people/**andy.twigg/<http://www.cs.ox.ac.uk/people/andy.twigg/>
>>>>> andy.twigg@cs.ox.ac.uk | +447799647538
>>>>>
>>>>>
>



-- 
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
andy.twigg@cs.ox.ac.uk | +447799647538

On 15 February 2013 13:47, Marty Kube <ma...@gmail.com> wrote:
> The hurdle I faced on my current project is that we had many random forest
> and the RAM requirements per JVM during classification were too big so we
> could not use Mahout.
>
> We went to a memory mapped forest representation which worked out nicely.
> Is that a feature Mahout could use?
>
>
> On 02/15/2013 03:09 AM, deneche abdelhakim wrote:
>>
>> On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube <
>> martykube@beavercreekconsulting.com> wrote:
>>
>>> On 01/28/2013 02:33 PM, Ted Dunning wrote:
>>>
>>>> I think I was suggesting something weaker.
>>>>
>>>> I was suggesting that trees get built against a portion of the data and
>>>> each node builds some number of trees against just the data it sees.
>>>> This
>>>> is in the spirit of random forests, but not the letter.
>>>>
>>> I'm looking at the Mahout partial implementation.  It appears to me that
>>> some number of trees get built against the portion of the data each node
>>> sees based on the map reduce input split.
>>> Am I wrong here?  If not, Mahout already has an out-of-core
>>> implementation.
>>>
>> Indeed, each mapper will grow a subset of the forest using only the data
>> passed to it. Actually, this was based on a suggestion by Ted:
>>
>> "modify the original algorithm to build multiple trees for different
>> portions of the data. That loses some of the solidity of the original
>> method, but could actually do better if the splits exposed non-stationary
>> behavior."
>>
>>
>>> BTW - where did the name Partial Implementation come from?  Is this
>>> partially completed work :-)
>>
>>
>> Ha ha ha, the name came from the fact that each mapper/node has a
>> "partial"
>> view of the dataset. But I do agree, that it's partially completed :P I
>> guess they were never enough user interest in Mahout's Decision Forest
>> implementation to motivate me to keep working on it, but I do see more and
>> more questions on the mailing lists about this, maybe I will get motivated
>> enough to work on this again.
>>
>> I did some research, some time ago, about how to grow the trees on a
>> cluster using all data when the dataset is too big to fit in memory. Two
>> papers got my attention:
>>
>> *RainForest - A Framework for Fast Decision Tree Construction of Large
>> Datasets [1]*
>> This paper describes how to grow the trees without loading all data in
>> memory. Using this, you need to copy the dataset on all computing nodes.
>>
>> *PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce [2]*
>> Although it uses map-reduce to grow the trees, it uses lot's of hacks and
>> tricks to make it work.
>>
>> Shouldn't be too difficult to use [1] and improve the current
>> implementation, this way the forest can be grown on "full" large dataset
>> on
>> a computing cluster. One question though, do we really need Hadoop for
>> this
>> ? I mean, as long as each node has access to the dataset files this should
>> be enough to build the forest on any number of nodes.
>>
>> [1] http://www.cs.cornell.edu/johannes/papers/1998/vldb1998-rainforest.pdf
>> [2] http://dejanseo.com.au/research/google/36296.pdf
>>
>>
>>>> Another option is that all the nodes start some trees based on their own
>>>> data and spit the set of all such trees to a central (possibly
>>>> distributed
>>>> store).  They also grab some number of trees back down from the central
>>>> store and try elaborating those trees using their own data.  These will
>>>> also be sent back to the central store.  This is closer to random
>>>> forests.
>>>>
>>>> A third option is like option two, but the trees that get pulled down
>>>> and
>>>> spit back up collect statistics until enough nodes report such stats at
>>>> which point a new branch is added and the stats are reset.  This is very
>>>> close to real random forests.  It should also be efficient because each
>>>> node is passing many trees across its own data in each pass.  If that
>>>> data
>>>> fits in memory (even in file cache) then the pass will be very fast.  If
>>>> it
>>>> doesn't fit in memory, then it will run as fast as the disk will let it
>>>> run.  Allowing a tree to commit with only 90% coverage should allow good
>>>> robustness against laggards while doing almost exactly the same thing as
>>>> the real random forest algorithm.
>>>>
>>>> On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <an...@gmail.com>
>>>> wrote:
>>>>
>>>>   Ted,
>>>>>
>>>>> Sorry, I don't understand. Are you suggesting that a single decision
>>>>> tree can be built efficiently in a distributed fashion?
>>>>>
>>>>> The following 2 ways seem like the naive ways of doing this:
>>>>> 1) each machine constructs one node of the tree, by scanning all the
>>>>> data, filtering those that don't get to its node (this is expensive),
>>>>> computing the split and then writing the split out. This requires a
>>>>> pass through the data for each node of the tree.
>>>>> 2) as 1), except that each machine writes out the filtered data after
>>>>> its node in the tree. This requires less scanning (as much as the
>>>>> local case I described earlier), but lots of data movement.
>>>>>
>>>>> Do you have an alternative method?
>>>>>
>>>>> Andy
>>>>>
>>>>>
>>>>> On 28 January 2013 16:42, Ted Dunning <te...@gmail.com> wrote:
>>>>>
>>>>>> IF we have a step which permutes data (once) then I doubt that
>>>>>> redistribution is necessary.  At that point the randomness consists of
>>>>>> building trees based on different variable subsets and data subsets.
>>>>>>   The
>>>>>> original random forests only split on variable subsets.  How much this
>>>>>> matters is an open question.
>>>>>>
>>>>>> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <sr...@gmail.com> wrote:
>>>>>>
>>>>>>   It sounds OK to me. Yes, I don't think you want to try to
>>>>>> parallelize
>>>>>>>
>>>>>>> the building of each tree just to build it off all the data. It will
>>>>>>> be too slow.
>>>>>>>
>>>>>>> I imagine the game is to parallelize such that each worker has 1/Nth
>>>>>>> of the data, where N is as low as you can manage. Each worker is
>>>>>>> building one or more trees from its slice. Then iterate, and split up
>>>>>>> the data differently. Each worker can now cross-validate previous
>>>>>>> trees based on the new slice, or even build new ones, and iterate
>>>>>>> again.
>>>>>>>
>>>>>>> I'm guessing this process will tend to need to build more trees than
>>>>>>> usual. I think it's also possible that it will want to try to discard
>>>>>>> trees with high out-of-bag error.
>>>>>>>
>>>>>>> Just riffing. I think this is some potential green-field territory to
>>>>>>> really think about how you build trees when it doesn't nearly fit in
>>>>>>> memory, and what shape of computation gives a really nice tradeoff
>>>>>>> between speed and accuracy. I don't think the conventional approach
>>>>>>> is
>>>>>>> that point but it's more like the above.
>>>>>>>
>>>>>>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <an...@gmail.com>
>>>>>>>
>>>>>> wrote:
>>>>>> Here's my thought for what might work well. Comments very welcome.
>>>>>>>>
>>>>>>>> We construct k decision trees (where k might be 20-100). Each tree
>>>>>>>> has
>>>>>>>> depth roughly 10-30. Rather than construct each tree in a
>>>>>>>> distributed
>>>>>>>> fashion, construct each tree locally by using multiple passes over
>>>>>>>> the
>>>>>>>> bagged sample for that tree (construct in BFS manner, assuming the
>>>>>>>> tree can fit in memory). Do this in parallel k times. I know this is
>>>>>>>> perhaps not the "mapreduce" way, but I'd be interested to hear
>>>>>>>> alternatives. It seems a bad idea to shuffle the data around on
>>>>>>>> every
>>>>>>>> BFS iteration per tree.
>>>>>>>>
>>>>>>>> I'm not planning to work on this right now, but would like to figure
>>>>>>>> out if Mahout is a good platform to work with if I want such an
>>>>>>>> algorithm.
>>>>>>>>
>>>>>>>> Andy
>>>>>>>>
>>>>> --
>>>>> Dr Andy Twigg
>>>>> Junior Research Fellow, St Johns College, Oxford
>>>>> Room 351, Department of Computer Science
>>>>>
>>>>> http://www.cs.ox.ac.uk/people/**andy.twigg/<http://www.cs.ox.ac.uk/people/andy.twigg/>
>>>>> andy.twigg@cs.ox.ac.uk | +447799647538
>>>>>
>>>>>
>



--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
andy.twigg@cs.ox.ac.uk | +447799647538

Re: Out-of-core random forest implementation

Posted by Marty Kube <ma...@gmail.com>.

The hurdle I faced on my current project is that we had many random 
forest and the RAM requirements per JVM during classification were too 
big so we could not use Mahout.

We went to a memory mapped forest representation which worked out 
nicely.  Is that a feature Mahout could use?

On 02/15/2013 03:09 AM, deneche abdelhakim wrote:
> On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube <
> martykube@beavercreekconsulting.com> wrote:
>
>> On 01/28/2013 02:33 PM, Ted Dunning wrote:
>>
>>> I think I was suggesting something weaker.
>>>
>>> I was suggesting that trees get built against a portion of the data and
>>> each node builds some number of trees against just the data it sees.  This
>>> is in the spirit of random forests, but not the letter.
>>>
>> I'm looking at the Mahout partial implementation.  It appears to me that
>> some number of trees get built against the portion of the data each node
>> sees based on the map reduce input split.
>> Am I wrong here?  If not, Mahout already has an out-of-core implementation.
>>
> Indeed, each mapper will grow a subset of the forest using only the data
> passed to it. Actually, this was based on a suggestion by Ted:
>
> "modify the original algorithm to build multiple trees for different
> portions of the data. That loses some of the solidity of the original
> method, but could actually do better if the splits exposed non-stationary
> behavior."
>
>
>> BTW - where did the name Partial Implementation come from?  Is this
>> partially completed work :-)
>
> Ha ha ha, the name came from the fact that each mapper/node has a "partial"
> view of the dataset. But I do agree, that it's partially completed :P I
> guess they were never enough user interest in Mahout's Decision Forest
> implementation to motivate me to keep working on it, but I do see more and
> more questions on the mailing lists about this, maybe I will get motivated
> enough to work on this again.
>
> I did some research, some time ago, about how to grow the trees on a
> cluster using all data when the dataset is too big to fit in memory. Two
> papers got my attention:
>
> *RainForest - A Framework for Fast Decision Tree Construction of Large
> Datasets [1]*
> This paper describes how to grow the trees without loading all data in
> memory. Using this, you need to copy the dataset on all computing nodes.
>
> *PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce [2]*
> Although it uses map-reduce to grow the trees, it uses lot's of hacks and
> tricks to make it work.
>
> Shouldn't be too difficult to use [1] and improve the current
> implementation, this way the forest can be grown on "full" large dataset on
> a computing cluster. One question though, do we really need Hadoop for this
> ? I mean, as long as each node has access to the dataset files this should
> be enough to build the forest on any number of nodes.
>
> [1] http://www.cs.cornell.edu/johannes/papers/1998/vldb1998-rainforest.pdf
> [2] http://dejanseo.com.au/research/google/36296.pdf
>
>
>>> Another option is that all the nodes start some trees based on their own
>>> data and spit the set of all such trees to a central (possibly distributed
>>> store).  They also grab some number of trees back down from the central
>>> store and try elaborating those trees using their own data.  These will
>>> also be sent back to the central store.  This is closer to random forests.
>>>
>>> A third option is like option two, but the trees that get pulled down and
>>> spit back up collect statistics until enough nodes report such stats at
>>> which point a new branch is added and the stats are reset.  This is very
>>> close to real random forests.  It should also be efficient because each
>>> node is passing many trees across its own data in each pass.  If that data
>>> fits in memory (even in file cache) then the pass will be very fast.  If
>>> it
>>> doesn't fit in memory, then it will run as fast as the disk will let it
>>> run.  Allowing a tree to commit with only 90% coverage should allow good
>>> robustness against laggards while doing almost exactly the same thing as
>>> the real random forest algorithm.
>>>
>>> On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <an...@gmail.com> wrote:
>>>
>>>   Ted,
>>>> Sorry, I don't understand. Are you suggesting that a single decision
>>>> tree can be built efficiently in a distributed fashion?
>>>>
>>>> The following 2 ways seem like the naive ways of doing this:
>>>> 1) each machine constructs one node of the tree, by scanning all the
>>>> data, filtering those that don't get to its node (this is expensive),
>>>> computing the split and then writing the split out. This requires a
>>>> pass through the data for each node of the tree.
>>>> 2) as 1), except that each machine writes out the filtered data after
>>>> its node in the tree. This requires less scanning (as much as the
>>>> local case I described earlier), but lots of data movement.
>>>>
>>>> Do you have an alternative method?
>>>>
>>>> Andy
>>>>
>>>>
>>>> On 28 January 2013 16:42, Ted Dunning <te...@gmail.com> wrote:
>>>>
>>>>> IF we have a step which permutes data (once) then I doubt that
>>>>> redistribution is necessary.  At that point the randomness consists of
>>>>> building trees based on different variable subsets and data subsets.
>>>>>   The
>>>>> original random forests only split on variable subsets.  How much this
>>>>> matters is an open question.
>>>>>
>>>>> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <sr...@gmail.com> wrote:
>>>>>
>>>>>   It sounds OK to me. Yes, I don't think you want to try to parallelize
>>>>>> the building of each tree just to build it off all the data. It will
>>>>>> be too slow.
>>>>>>
>>>>>> I imagine the game is to parallelize such that each worker has 1/Nth
>>>>>> of the data, where N is as low as you can manage. Each worker is
>>>>>> building one or more trees from its slice. Then iterate, and split up
>>>>>> the data differently. Each worker can now cross-validate previous
>>>>>> trees based on the new slice, or even build new ones, and iterate
>>>>>> again.
>>>>>>
>>>>>> I'm guessing this process will tend to need to build more trees than
>>>>>> usual. I think it's also possible that it will want to try to discard
>>>>>> trees with high out-of-bag error.
>>>>>>
>>>>>> Just riffing. I think this is some potential green-field territory to
>>>>>> really think about how you build trees when it doesn't nearly fit in
>>>>>> memory, and what shape of computation gives a really nice tradeoff
>>>>>> between speed and accuracy. I don't think the conventional approach is
>>>>>> that point but it's more like the above.
>>>>>>
>>>>>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <an...@gmail.com>
>>>>>>
>>>>> wrote:
>>>>> Here's my thought for what might work well. Comments very welcome.
>>>>>>> We construct k decision trees (where k might be 20-100). Each tree has
>>>>>>> depth roughly 10-30. Rather than construct each tree in a distributed
>>>>>>> fashion, construct each tree locally by using multiple passes over the
>>>>>>> bagged sample for that tree (construct in BFS manner, assuming the
>>>>>>> tree can fit in memory). Do this in parallel k times. I know this is
>>>>>>> perhaps not the "mapreduce" way, but I'd be interested to hear
>>>>>>> alternatives. It seems a bad idea to shuffle the data around on every
>>>>>>> BFS iteration per tree.
>>>>>>>
>>>>>>> I'm not planning to work on this right now, but would like to figure
>>>>>>> out if Mahout is a good platform to work with if I want such an
>>>>>>> algorithm.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>> --
>>>> Dr Andy Twigg
>>>> Junior Research Fellow, St Johns College, Oxford
>>>> Room 351, Department of Computer Science
>>>> http://www.cs.ox.ac.uk/people/**andy.twigg/<http://www.cs.ox.ac.uk/people/andy.twigg/>
>>>> andy.twigg@cs.ox.ac.uk | +447799647538
>>>>
>>>>

Re: Out-of-core random forest implementation

Posted by Julien Nioche <li...@gmail.com>.

> As you point out, the problem (that I see) with mahout is that it offers no
> reason to stay within the framework, for example there is no consistent way
> of representing datasets unlike say numpy or pandas. Even something like
> that - load some data, have it automatically passed into a consistent
> tabular format for regression/classification would be very helpful.


+1. Would be great if there was some consistency in the way the classifiers
handle their inputs indeed. Could be SequenceFiles containing Vectors with
a separation of concern between the code preprocessing the inputs (from
Lucene indices, text files on FS, computation of weights, etc...) and the
classifiers themselves.

Julien


 wrote:
>
> > On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube <
> > martykube@beavercreekconsulting.com> wrote:
> >
> > >
> > > On 01/28/2013 02:33 PM, Ted Dunning wrote:
> > >
> > >> I think I was suggesting something weaker.
> > >>
> > >> I was suggesting that trees get built against a portion of the data
> and
> > >> each node builds some number of trees against just the data it sees.
> >  This
> > >> is in the spirit of random forests, but not the letter.
> > >>
> > > I'm looking at the Mahout partial implementation.  It appears to me
> that
> > > some number of trees get built against the portion of the data each
> node
> > > sees based on the map reduce input split.
> > > Am I wrong here?  If not, Mahout already has an out-of-core
> > implementation.
> > >
> >
> > Indeed, each mapper will grow a subset of the forest using only the data
> > passed to it. Actually, this was based on a suggestion by Ted:
> >
> > "modify the original algorithm to build multiple trees for different
> > portions of the data. That loses some of the solidity of the original
> > method, but could actually do better if the splits exposed non-stationary
> > behavior."
> >
> >
> > > BTW - where did the name Partial Implementation come from?  Is this
> > > partially completed work :-)
> >
> >
> > Ha ha ha, the name came from the fact that each mapper/node has a
> "partial"
> > view of the dataset. But I do agree, that it's partially completed :P I
> > guess they were never enough user interest in Mahout's Decision Forest
> > implementation to motivate me to keep working on it, but I do see more
> and
> > more questions on the mailing lists about this, maybe I will get
> motivated
> > enough to work on this again.
> >
> > I did some research, some time ago, about how to grow the trees on a
> > cluster using all data when the dataset is too big to fit in memory. Two
> > papers got my attention:
> >
> > *RainForest - A Framework for Fast Decision Tree Construction of Large
> > Datasets [1]*
> > This paper describes how to grow the trees without loading all data in
> > memory. Using this, you need to copy the dataset on all computing nodes.
> >
> > *PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce
> [2]*
> > Although it uses map-reduce to grow the trees, it uses lot's of hacks and
> > tricks to make it work.
> >
> > Shouldn't be too difficult to use [1] and improve the current
> > implementation, this way the forest can be grown on "full" large dataset
> on
> > a computing cluster. One question though, do we really need Hadoop for
> this
> > ? I mean, as long as each node has access to the dataset files this
> should
> > be enough to build the forest on any number of nodes.
> >
> > [1]
> http://www.cs.cornell.edu/johannes/papers/1998/vldb1998-rainforest.pdf
> > [2] http://dejanseo.com.au/research/google/36296.pdf
> >
> >
> > >
> > >> Another option is that all the nodes start some trees based on their
> own
> > >> data and spit the set of all such trees to a central (possibly
> > distributed
> > >> store).  They also grab some number of trees back down from the
> central
> > >> store and try elaborating those trees using their own data.  These
> will
> > >> also be sent back to the central store.  This is closer to random
> > forests.
> > >>
> > >> A third option is like option two, but the trees that get pulled down
> > and
> > >> spit back up collect statistics until enough nodes report such stats
> at
> > >> which point a new branch is added and the stats are reset.  This is
> very
> > >> close to real random forests.  It should also be efficient because
> each
> > >> node is passing many trees across its own data in each pass.  If that
> > data
> > >> fits in memory (even in file cache) then the pass will be very fast.
>  If
> > >> it
> > >> doesn't fit in memory, then it will run as fast as the disk will let
> it
> > >> run.  Allowing a tree to commit with only 90% coverage should allow
> good
> > >> robustness against laggards while doing almost exactly the same thing
> as
> > >> the real random forest algorithm.
> > >>
> > >> On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <an...@gmail.com>
> > wrote:
> > >>
> > >>  Ted,
> > >>>
> > >>> Sorry, I don't understand. Are you suggesting that a single decision
> > >>> tree can be built efficiently in a distributed fashion?
> > >>>
> > >>> The following 2 ways seem like the naive ways of doing this:
> > >>> 1) each machine constructs one node of the tree, by scanning all the
> > >>> data, filtering those that don't get to its node (this is expensive),
> > >>> computing the split and then writing the split out. This requires a
> > >>> pass through the data for each node of the tree.
> > >>> 2) as 1), except that each machine writes out the filtered data after
> > >>> its node in the tree. This requires less scanning (as much as the
> > >>> local case I described earlier), but lots of data movement.
> > >>>
> > >>> Do you have an alternative method?
> > >>>
> > >>> Andy
> > >>>
> > >>>
> > >>> On 28 January 2013 16:42, Ted Dunning <te...@gmail.com> wrote:
> > >>>
> > >>>> IF we have a step which permutes data (once) then I doubt that
> > >>>> redistribution is necessary.  At that point the randomness consists
> of
> > >>>> building trees based on different variable subsets and data subsets.
> > >>>>  The
> > >>>> original random forests only split on variable subsets.  How much
> this
> > >>>> matters is an open question.
> > >>>>
> > >>>> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <sr...@gmail.com>
> wrote:
> > >>>>
> > >>>>  It sounds OK to me. Yes, I don't think you want to try to
> parallelize
> > >>>>> the building of each tree just to build it off all the data. It
> will
> > >>>>> be too slow.
> > >>>>>
> > >>>>> I imagine the game is to parallelize such that each worker has
> 1/Nth
> > >>>>> of the data, where N is as low as you can manage. Each worker is
> > >>>>> building one or more trees from its slice. Then iterate, and split
> up
> > >>>>> the data differently. Each worker can now cross-validate previous
> > >>>>> trees based on the new slice, or even build new ones, and iterate
> > >>>>> again.
> > >>>>>
> > >>>>> I'm guessing this process will tend to need to build more trees
> than
> > >>>>> usual. I think it's also possible that it will want to try to
> discard
> > >>>>> trees with high out-of-bag error.
> > >>>>>
> > >>>>> Just riffing. I think this is some potential green-field territory
> to
> > >>>>> really think about how you build trees when it doesn't nearly fit
> in
> > >>>>> memory, and what shape of computation gives a really nice tradeoff
> > >>>>> between speed and accuracy. I don't think the conventional approach
> > is
> > >>>>> that point but it's more like the above.
> > >>>>>
> > >>>>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <andy.twigg@gmail.com
> >
> > >>>>>
> > >>>> wrote:
> > >>>
> > >>>> Here's my thought for what might work well. Comments very welcome.
> > >>>>>>
> > >>>>>> We construct k decision trees (where k might be 20-100). Each tree
> > has
> > >>>>>> depth roughly 10-30. Rather than construct each tree in a
> > distributed
> > >>>>>> fashion, construct each tree locally by using multiple passes over
> > the
> > >>>>>> bagged sample for that tree (construct in BFS manner, assuming the
> > >>>>>> tree can fit in memory). Do this in parallel k times. I know this
> is
> > >>>>>> perhaps not the "mapreduce" way, but I'd be interested to hear
> > >>>>>> alternatives. It seems a bad idea to shuffle the data around on
> > every
> > >>>>>> BFS iteration per tree.
> > >>>>>>
> > >>>>>> I'm not planning to work on this right now, but would like to
> figure
> > >>>>>> out if Mahout is a good platform to work with if I want such an
> > >>>>>> algorithm.
> > >>>>>>
> > >>>>>> Andy
> > >>>>>>
> > >>>>>
> > >>>
> > >>> --
> > >>> Dr Andy Twigg
> > >>> Junior Research Fellow, St Johns College, Oxford
> > >>> Room 351, Department of Computer Science
> > >>> http://www.cs.ox.ac.uk/people/**andy.twigg/<
> > http://www.cs.ox.ac.uk/people/andy.twigg/>
> > >>> andy.twigg@cs.ox.ac.uk | +447799647538
> > >>>
> > >>>
> > >
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Out-of-core random forest implementation

Posted by Andy Twigg <an...@gmail.com>.

An efficient implementation of something like [1] would be very useful.
Either myself or a student would be able to help here.

As you point out, the problem (that I see) with mahout is that it offers no
reason to stay within the framework, for example there is no consistent way
of representing datasets unlike say numpy or pandas. Even something like
that - load some data, have it automatically passed into a consistent
tabular format for regression/classification would be very helpful. The
bigML folk appear to have their own out of core, distributed RF
implementation. Unfortunately this appears to be closed source.
On Feb 15, 2013 8:10 AM, "deneche abdelhakim" <ad...@gmail.com> wrote:

> On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube <
> martykube@beavercreekconsulting.com> wrote:
>
> >
> > On 01/28/2013 02:33 PM, Ted Dunning wrote:
> >
> >> I think I was suggesting something weaker.
> >>
> >> I was suggesting that trees get built against a portion of the data and
> >> each node builds some number of trees against just the data it sees.
>  This
> >> is in the spirit of random forests, but not the letter.
> >>
> > I'm looking at the Mahout partial implementation.  It appears to me that
> > some number of trees get built against the portion of the data each node
> > sees based on the map reduce input split.
> > Am I wrong here?  If not, Mahout already has an out-of-core
> implementation.
> >
>
> Indeed, each mapper will grow a subset of the forest using only the data
> passed to it. Actually, this was based on a suggestion by Ted:
>
> "modify the original algorithm to build multiple trees for different
> portions of the data. That loses some of the solidity of the original
> method, but could actually do better if the splits exposed non-stationary
> behavior."
>
>
> > BTW - where did the name Partial Implementation come from?  Is this
> > partially completed work :-)
>
>
> Ha ha ha, the name came from the fact that each mapper/node has a "partial"
> view of the dataset. But I do agree, that it's partially completed :P I
> guess they were never enough user interest in Mahout's Decision Forest
> implementation to motivate me to keep working on it, but I do see more and
> more questions on the mailing lists about this, maybe I will get motivated
> enough to work on this again.
>
> I did some research, some time ago, about how to grow the trees on a
> cluster using all data when the dataset is too big to fit in memory. Two
> papers got my attention:
>
> *RainForest - A Framework for Fast Decision Tree Construction of Large
> Datasets [1]*
> This paper describes how to grow the trees without loading all data in
> memory. Using this, you need to copy the dataset on all computing nodes.
>
> *PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce [2]*
> Although it uses map-reduce to grow the trees, it uses lot's of hacks and
> tricks to make it work.
>
> Shouldn't be too difficult to use [1] and improve the current
> implementation, this way the forest can be grown on "full" large dataset on
> a computing cluster. One question though, do we really need Hadoop for this
> ? I mean, as long as each node has access to the dataset files this should
> be enough to build the forest on any number of nodes.
>
> [1] http://www.cs.cornell.edu/johannes/papers/1998/vldb1998-rainforest.pdf
> [2] http://dejanseo.com.au/research/google/36296.pdf
>
>
> >
> >> Another option is that all the nodes start some trees based on their own
> >> data and spit the set of all such trees to a central (possibly
> distributed
> >> store).  They also grab some number of trees back down from the central
> >> store and try elaborating those trees using their own data.  These will
> >> also be sent back to the central store.  This is closer to random
> forests.
> >>
> >> A third option is like option two, but the trees that get pulled down
> and
> >> spit back up collect statistics until enough nodes report such stats at
> >> which point a new branch is added and the stats are reset.  This is very
> >> close to real random forests.  It should also be efficient because each
> >> node is passing many trees across its own data in each pass.  If that
> data
> >> fits in memory (even in file cache) then the pass will be very fast.  If
> >> it
> >> doesn't fit in memory, then it will run as fast as the disk will let it
> >> run.  Allowing a tree to commit with only 90% coverage should allow good
> >> robustness against laggards while doing almost exactly the same thing as
> >> the real random forest algorithm.
> >>
> >> On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <an...@gmail.com>
> wrote:
> >>
> >>  Ted,
> >>>
> >>> Sorry, I don't understand. Are you suggesting that a single decision
> >>> tree can be built efficiently in a distributed fashion?
> >>>
> >>> The following 2 ways seem like the naive ways of doing this:
> >>> 1) each machine constructs one node of the tree, by scanning all the
> >>> data, filtering those that don't get to its node (this is expensive),
> >>> computing the split and then writing the split out. This requires a
> >>> pass through the data for each node of the tree.
> >>> 2) as 1), except that each machine writes out the filtered data after
> >>> its node in the tree. This requires less scanning (as much as the
> >>> local case I described earlier), but lots of data movement.
> >>>
> >>> Do you have an alternative method?
> >>>
> >>> Andy
> >>>
> >>>
> >>> On 28 January 2013 16:42, Ted Dunning <te...@gmail.com> wrote:
> >>>
> >>>> IF we have a step which permutes data (once) then I doubt that
> >>>> redistribution is necessary.  At that point the randomness consists of
> >>>> building trees based on different variable subsets and data subsets.
> >>>>  The
> >>>> original random forests only split on variable subsets.  How much this
> >>>> matters is an open question.
> >>>>
> >>>> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <sr...@gmail.com> wrote:
> >>>>
> >>>>  It sounds OK to me. Yes, I don't think you want to try to parallelize
> >>>>> the building of each tree just to build it off all the data. It will
> >>>>> be too slow.
> >>>>>
> >>>>> I imagine the game is to parallelize such that each worker has 1/Nth
> >>>>> of the data, where N is as low as you can manage. Each worker is
> >>>>> building one or more trees from its slice. Then iterate, and split up
> >>>>> the data differently. Each worker can now cross-validate previous
> >>>>> trees based on the new slice, or even build new ones, and iterate
> >>>>> again.
> >>>>>
> >>>>> I'm guessing this process will tend to need to build more trees than
> >>>>> usual. I think it's also possible that it will want to try to discard
> >>>>> trees with high out-of-bag error.
> >>>>>
> >>>>> Just riffing. I think this is some potential green-field territory to
> >>>>> really think about how you build trees when it doesn't nearly fit in
> >>>>> memory, and what shape of computation gives a really nice tradeoff
> >>>>> between speed and accuracy. I don't think the conventional approach
> is
> >>>>> that point but it's more like the above.
> >>>>>
> >>>>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <an...@gmail.com>
> >>>>>
> >>>> wrote:
> >>>
> >>>> Here's my thought for what might work well. Comments very welcome.
> >>>>>>
> >>>>>> We construct k decision trees (where k might be 20-100). Each tree
> has
> >>>>>> depth roughly 10-30. Rather than construct each tree in a
> distributed
> >>>>>> fashion, construct each tree locally by using multiple passes over
> the
> >>>>>> bagged sample for that tree (construct in BFS manner, assuming the
> >>>>>> tree can fit in memory). Do this in parallel k times. I know this is
> >>>>>> perhaps not the "mapreduce" way, but I'd be interested to hear
> >>>>>> alternatives. It seems a bad idea to shuffle the data around on
> every
> >>>>>> BFS iteration per tree.
> >>>>>>
> >>>>>> I'm not planning to work on this right now, but would like to figure
> >>>>>> out if Mahout is a good platform to work with if I want such an
> >>>>>> algorithm.
> >>>>>>
> >>>>>> Andy
> >>>>>>
> >>>>>
> >>>
> >>> --
> >>> Dr Andy Twigg
> >>> Junior Research Fellow, St Johns College, Oxford
> >>> Room 351, Department of Computer Science
> >>> http://www.cs.ox.ac.uk/people/**andy.twigg/<
> http://www.cs.ox.ac.uk/people/andy.twigg/>
> >>> andy.twigg@cs.ox.ac.uk | +447799647538
> >>>
> >>>
> >
>

Re: Out-of-core random forest implementation

Posted by deneche abdelhakim <ad...@gmail.com>.

On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube <
martykube@beavercreekconsulting.com> wrote:

>
> On 01/28/2013 02:33 PM, Ted Dunning wrote:
>
>> I think I was suggesting something weaker.
>>
>> I was suggesting that trees get built against a portion of the data and
>> each node builds some number of trees against just the data it sees.  This
>> is in the spirit of random forests, but not the letter.
>>
> I'm looking at the Mahout partial implementation.  It appears to me that
> some number of trees get built against the portion of the data each node
> sees based on the map reduce input split.
> Am I wrong here?  If not, Mahout already has an out-of-core implementation.
>

Indeed, each mapper will grow a subset of the forest using only the data
passed to it. Actually, this was based on a suggestion by Ted:

"modify the original algorithm to build multiple trees for different
portions of the data. That loses some of the solidity of the original
method, but could actually do better if the splits exposed non-stationary
behavior."


> BTW - where did the name Partial Implementation come from?  Is this
> partially completed work :-)


Ha ha ha, the name came from the fact that each mapper/node has a "partial"
view of the dataset. But I do agree, that it's partially completed :P I
guess they were never enough user interest in Mahout's Decision Forest
implementation to motivate me to keep working on it, but I do see more and
more questions on the mailing lists about this, maybe I will get motivated
enough to work on this again.

I did some research, some time ago, about how to grow the trees on a
cluster using all data when the dataset is too big to fit in memory. Two
papers got my attention:

*RainForest - A Framework for Fast Decision Tree Construction of Large
Datasets [1]*
This paper describes how to grow the trees without loading all data in
memory. Using this, you need to copy the dataset on all computing nodes.

*PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce [2]*
Although it uses map-reduce to grow the trees, it uses lot's of hacks and
tricks to make it work.

Shouldn't be too difficult to use [1] and improve the current
implementation, this way the forest can be grown on "full" large dataset on
a computing cluster. One question though, do we really need Hadoop for this
? I mean, as long as each node has access to the dataset files this should
be enough to build the forest on any number of nodes.

[1] http://www.cs.cornell.edu/johannes/papers/1998/vldb1998-rainforest.pdf
[2] http://dejanseo.com.au/research/google/36296.pdf


>
>> Another option is that all the nodes start some trees based on their own
>> data and spit the set of all such trees to a central (possibly distributed
>> store).  They also grab some number of trees back down from the central
>> store and try elaborating those trees using their own data.  These will
>> also be sent back to the central store.  This is closer to random forests.
>>
>> A third option is like option two, but the trees that get pulled down and
>> spit back up collect statistics until enough nodes report such stats at
>> which point a new branch is added and the stats are reset.  This is very
>> close to real random forests.  It should also be efficient because each
>> node is passing many trees across its own data in each pass.  If that data
>> fits in memory (even in file cache) then the pass will be very fast.  If
>> it
>> doesn't fit in memory, then it will run as fast as the disk will let it
>> run.  Allowing a tree to commit with only 90% coverage should allow good
>> robustness against laggards while doing almost exactly the same thing as
>> the real random forest algorithm.
>>
>> On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <an...@gmail.com> wrote:
>>
>>  Ted,
>>>
>>> Sorry, I don't understand. Are you suggesting that a single decision
>>> tree can be built efficiently in a distributed fashion?
>>>
>>> The following 2 ways seem like the naive ways of doing this:
>>> 1) each machine constructs one node of the tree, by scanning all the
>>> data, filtering those that don't get to its node (this is expensive),
>>> computing the split and then writing the split out. This requires a
>>> pass through the data for each node of the tree.
>>> 2) as 1), except that each machine writes out the filtered data after
>>> its node in the tree. This requires less scanning (as much as the
>>> local case I described earlier), but lots of data movement.
>>>
>>> Do you have an alternative method?
>>>
>>> Andy
>>>
>>>
>>> On 28 January 2013 16:42, Ted Dunning <te...@gmail.com> wrote:
>>>
>>>> IF we have a step which permutes data (once) then I doubt that
>>>> redistribution is necessary.  At that point the randomness consists of
>>>> building trees based on different variable subsets and data subsets.
>>>>  The
>>>> original random forests only split on variable subsets.  How much this
>>>> matters is an open question.
>>>>
>>>> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <sr...@gmail.com> wrote:
>>>>
>>>>  It sounds OK to me. Yes, I don't think you want to try to parallelize
>>>>> the building of each tree just to build it off all the data. It will
>>>>> be too slow.
>>>>>
>>>>> I imagine the game is to parallelize such that each worker has 1/Nth
>>>>> of the data, where N is as low as you can manage. Each worker is
>>>>> building one or more trees from its slice. Then iterate, and split up
>>>>> the data differently. Each worker can now cross-validate previous
>>>>> trees based on the new slice, or even build new ones, and iterate
>>>>> again.
>>>>>
>>>>> I'm guessing this process will tend to need to build more trees than
>>>>> usual. I think it's also possible that it will want to try to discard
>>>>> trees with high out-of-bag error.
>>>>>
>>>>> Just riffing. I think this is some potential green-field territory to
>>>>> really think about how you build trees when it doesn't nearly fit in
>>>>> memory, and what shape of computation gives a really nice tradeoff
>>>>> between speed and accuracy. I don't think the conventional approach is
>>>>> that point but it's more like the above.
>>>>>
>>>>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <an...@gmail.com>
>>>>>
>>>> wrote:
>>>
>>>> Here's my thought for what might work well. Comments very welcome.
>>>>>>
>>>>>> We construct k decision trees (where k might be 20-100). Each tree has
>>>>>> depth roughly 10-30. Rather than construct each tree in a distributed
>>>>>> fashion, construct each tree locally by using multiple passes over the
>>>>>> bagged sample for that tree (construct in BFS manner, assuming the
>>>>>> tree can fit in memory). Do this in parallel k times. I know this is
>>>>>> perhaps not the "mapreduce" way, but I'd be interested to hear
>>>>>> alternatives. It seems a bad idea to shuffle the data around on every
>>>>>> BFS iteration per tree.
>>>>>>
>>>>>> I'm not planning to work on this right now, but would like to figure
>>>>>> out if Mahout is a good platform to work with if I want such an
>>>>>> algorithm.
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>
>>>
>>> --
>>> Dr Andy Twigg
>>> Junior Research Fellow, St Johns College, Oxford
>>> Room 351, Department of Computer Science
>>> http://www.cs.ox.ac.uk/people/**andy.twigg/<http://www.cs.ox.ac.uk/people/andy.twigg/>
>>> andy.twigg@cs.ox.ac.uk | +447799647538
>>>
>>>
>

Re: Out-of-core random forest implementation

Posted by Marty Kube <ma...@beavercreekconsulting.com>.

On 01/28/2013 02:33 PM, Ted Dunning wrote:
> I think I was suggesting something weaker.
>
> I was suggesting that trees get built against a portion of the data and
> each node builds some number of trees against just the data it sees.  This
> is in the spirit of random forests, but not the letter.
I'm looking at the Mahout partial implementation.  It appears to me that 
some number of trees get built against the portion of the data each node 
sees based on the map reduce input split.
Am I wrong here?  If not, Mahout already has an out-of-core implementation.

BTW - where did the name Partial Implementation come from?  Is this 
partially completed work :-)
>
> Another option is that all the nodes start some trees based on their own
> data and spit the set of all such trees to a central (possibly distributed
> store).  They also grab some number of trees back down from the central
> store and try elaborating those trees using their own data.  These will
> also be sent back to the central store.  This is closer to random forests.
>
> A third option is like option two, but the trees that get pulled down and
> spit back up collect statistics until enough nodes report such stats at
> which point a new branch is added and the stats are reset.  This is very
> close to real random forests.  It should also be efficient because each
> node is passing many trees across its own data in each pass.  If that data
> fits in memory (even in file cache) then the pass will be very fast.  If it
> doesn't fit in memory, then it will run as fast as the disk will let it
> run.  Allowing a tree to commit with only 90% coverage should allow good
> robustness against laggards while doing almost exactly the same thing as
> the real random forest algorithm.
>
> On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <an...@gmail.com> wrote:
>
>> Ted,
>>
>> Sorry, I don't understand. Are you suggesting that a single decision
>> tree can be built efficiently in a distributed fashion?
>>
>> The following 2 ways seem like the naive ways of doing this:
>> 1) each machine constructs one node of the tree, by scanning all the
>> data, filtering those that don't get to its node (this is expensive),
>> computing the split and then writing the split out. This requires a
>> pass through the data for each node of the tree.
>> 2) as 1), except that each machine writes out the filtered data after
>> its node in the tree. This requires less scanning (as much as the
>> local case I described earlier), but lots of data movement.
>>
>> Do you have an alternative method?
>>
>> Andy
>>
>>
>> On 28 January 2013 16:42, Ted Dunning <te...@gmail.com> wrote:
>>> IF we have a step which permutes data (once) then I doubt that
>>> redistribution is necessary.  At that point the randomness consists of
>>> building trees based on different variable subsets and data subsets.  The
>>> original random forests only split on variable subsets.  How much this
>>> matters is an open question.
>>>
>>> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> It sounds OK to me. Yes, I don't think you want to try to parallelize
>>>> the building of each tree just to build it off all the data. It will
>>>> be too slow.
>>>>
>>>> I imagine the game is to parallelize such that each worker has 1/Nth
>>>> of the data, where N is as low as you can manage. Each worker is
>>>> building one or more trees from its slice. Then iterate, and split up
>>>> the data differently. Each worker can now cross-validate previous
>>>> trees based on the new slice, or even build new ones, and iterate
>>>> again.
>>>>
>>>> I'm guessing this process will tend to need to build more trees than
>>>> usual. I think it's also possible that it will want to try to discard
>>>> trees with high out-of-bag error.
>>>>
>>>> Just riffing. I think this is some potential green-field territory to
>>>> really think about how you build trees when it doesn't nearly fit in
>>>> memory, and what shape of computation gives a really nice tradeoff
>>>> between speed and accuracy. I don't think the conventional approach is
>>>> that point but it's more like the above.
>>>>
>>>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <an...@gmail.com>
>> wrote:
>>>>> Here's my thought for what might work well. Comments very welcome.
>>>>>
>>>>> We construct k decision trees (where k might be 20-100). Each tree has
>>>>> depth roughly 10-30. Rather than construct each tree in a distributed
>>>>> fashion, construct each tree locally by using multiple passes over the
>>>>> bagged sample for that tree (construct in BFS manner, assuming the
>>>>> tree can fit in memory). Do this in parallel k times. I know this is
>>>>> perhaps not the "mapreduce" way, but I'd be interested to hear
>>>>> alternatives. It seems a bad idea to shuffle the data around on every
>>>>> BFS iteration per tree.
>>>>>
>>>>> I'm not planning to work on this right now, but would like to figure
>>>>> out if Mahout is a good platform to work with if I want such an
>>>>> algorithm.
>>>>>
>>>>> Andy
>>
>>
>> --
>> Dr Andy Twigg
>> Junior Research Fellow, St Johns College, Oxford
>> Room 351, Department of Computer Science
>> http://www.cs.ox.ac.uk/people/andy.twigg/
>> andy.twigg@cs.ox.ac.uk | +447799647538
>>

Re: Out-of-core random forest implementation

Posted by Marty Kube <ma...@beavercreekconsulting.com>.

I think the best design can be selected by starting with the simplest 
approach and letting results/data make the decisions.
Is there a data set we could use to drive the design choices?

On 01/28/2013 02:33 PM, Ted Dunning wrote:
> I think I was suggesting something weaker.
>
> I was suggesting that trees get built against a portion of the data and
> each node builds some number of trees against just the data it sees.  This
> is in the spirit of random forests, but not the letter.
>
> Another option is that all the nodes start some trees based on their own
> data and spit the set of all such trees to a central (possibly distributed
> store).  They also grab some number of trees back down from the central
> store and try elaborating those trees using their own data.  These will
> also be sent back to the central store.  This is closer to random forests.
>
> A third option is like option two, but the trees that get pulled down and
> spit back up collect statistics until enough nodes report such stats at
> which point a new branch is added and the stats are reset.  This is very
> close to real random forests.  It should also be efficient because each
> node is passing many trees across its own data in each pass.  If that data
> fits in memory (even in file cache) then the pass will be very fast.  If it
> doesn't fit in memory, then it will run as fast as the disk will let it
> run.  Allowing a tree to commit with only 90% coverage should allow good
> robustness against laggards while doing almost exactly the same thing as
> the real random forest algorithm.
>
> On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <an...@gmail.com> wrote:
>
>> Ted,
>>
>> Sorry, I don't understand. Are you suggesting that a single decision
>> tree can be built efficiently in a distributed fashion?
>>
>> The following 2 ways seem like the naive ways of doing this:
>> 1) each machine constructs one node of the tree, by scanning all the
>> data, filtering those that don't get to its node (this is expensive),
>> computing the split and then writing the split out. This requires a
>> pass through the data for each node of the tree.
>> 2) as 1), except that each machine writes out the filtered data after
>> its node in the tree. This requires less scanning (as much as the
>> local case I described earlier), but lots of data movement.
>>
>> Do you have an alternative method?
>>
>> Andy
>>
>>
>> On 28 January 2013 16:42, Ted Dunning <te...@gmail.com> wrote:
>>> IF we have a step which permutes data (once) then I doubt that
>>> redistribution is necessary.  At that point the randomness consists of
>>> building trees based on different variable subsets and data subsets.  The
>>> original random forests only split on variable subsets.  How much this
>>> matters is an open question.
>>>
>>> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <sr...@gmail.com> wrote:
>>>
>>>> It sounds OK to me. Yes, I don't think you want to try to parallelize
>>>> the building of each tree just to build it off all the data. It will
>>>> be too slow.
>>>>
>>>> I imagine the game is to parallelize such that each worker has 1/Nth
>>>> of the data, where N is as low as you can manage. Each worker is
>>>> building one or more trees from its slice. Then iterate, and split up
>>>> the data differently. Each worker can now cross-validate previous
>>>> trees based on the new slice, or even build new ones, and iterate
>>>> again.
>>>>
>>>> I'm guessing this process will tend to need to build more trees than
>>>> usual. I think it's also possible that it will want to try to discard
>>>> trees with high out-of-bag error.
>>>>
>>>> Just riffing. I think this is some potential green-field territory to
>>>> really think about how you build trees when it doesn't nearly fit in
>>>> memory, and what shape of computation gives a really nice tradeoff
>>>> between speed and accuracy. I don't think the conventional approach is
>>>> that point but it's more like the above.
>>>>
>>>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <an...@gmail.com>
>> wrote:
>>>>> Here's my thought for what might work well. Comments very welcome.
>>>>>
>>>>> We construct k decision trees (where k might be 20-100). Each tree has
>>>>> depth roughly 10-30. Rather than construct each tree in a distributed
>>>>> fashion, construct each tree locally by using multiple passes over the
>>>>> bagged sample for that tree (construct in BFS manner, assuming the
>>>>> tree can fit in memory). Do this in parallel k times. I know this is
>>>>> perhaps not the "mapreduce" way, but I'd be interested to hear
>>>>> alternatives. It seems a bad idea to shuffle the data around on every
>>>>> BFS iteration per tree.
>>>>>
>>>>> I'm not planning to work on this right now, but would like to figure
>>>>> out if Mahout is a good platform to work with if I want such an
>>>>> algorithm.
>>>>>
>>>>> Andy
>>
>>
>> --
>> Dr Andy Twigg
>> Junior Research Fellow, St Johns College, Oxford
>> Room 351, Department of Computer Science
>> http://www.cs.ox.ac.uk/people/andy.twigg/
>> andy.twigg@cs.ox.ac.uk | +447799647538
>>

Re: Out-of-core random forest implementation

Posted by Ted Dunning <te...@gmail.com>.

I think I was suggesting something weaker.

I was suggesting that trees get built against a portion of the data and
each node builds some number of trees against just the data it sees.  This
is in the spirit of random forests, but not the letter.

Another option is that all the nodes start some trees based on their own
data and spit the set of all such trees to a central (possibly distributed
store).  They also grab some number of trees back down from the central
store and try elaborating those trees using their own data.  These will
also be sent back to the central store.  This is closer to random forests.

A third option is like option two, but the trees that get pulled down and
spit back up collect statistics until enough nodes report such stats at
which point a new branch is added and the stats are reset.  This is very
close to real random forests.  It should also be efficient because each
node is passing many trees across its own data in each pass.  If that data
fits in memory (even in file cache) then the pass will be very fast.  If it
doesn't fit in memory, then it will run as fast as the disk will let it
run.  Allowing a tree to commit with only 90% coverage should allow good
robustness against laggards while doing almost exactly the same thing as
the real random forest algorithm.

On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <an...@gmail.com> wrote:

> Ted,
>
> Sorry, I don't understand. Are you suggesting that a single decision
> tree can be built efficiently in a distributed fashion?
>
> The following 2 ways seem like the naive ways of doing this:
> 1) each machine constructs one node of the tree, by scanning all the
> data, filtering those that don't get to its node (this is expensive),
> computing the split and then writing the split out. This requires a
> pass through the data for each node of the tree.
> 2) as 1), except that each machine writes out the filtered data after
> its node in the tree. This requires less scanning (as much as the
> local case I described earlier), but lots of data movement.
>
> Do you have an alternative method?
>
> Andy
>
>
> On 28 January 2013 16:42, Ted Dunning <te...@gmail.com> wrote:
> > IF we have a step which permutes data (once) then I doubt that
> > redistribution is necessary.  At that point the randomness consists of
> > building trees based on different variable subsets and data subsets.  The
> > original random forests only split on variable subsets.  How much this
> > matters is an open question.
> >
> > On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <sr...@gmail.com> wrote:
> >
> >> It sounds OK to me. Yes, I don't think you want to try to parallelize
> >> the building of each tree just to build it off all the data. It will
> >> be too slow.
> >>
> >> I imagine the game is to parallelize such that each worker has 1/Nth
> >> of the data, where N is as low as you can manage. Each worker is
> >> building one or more trees from its slice. Then iterate, and split up
> >> the data differently. Each worker can now cross-validate previous
> >> trees based on the new slice, or even build new ones, and iterate
> >> again.
> >>
> >> I'm guessing this process will tend to need to build more trees than
> >> usual. I think it's also possible that it will want to try to discard
> >> trees with high out-of-bag error.
> >>
> >> Just riffing. I think this is some potential green-field territory to
> >> really think about how you build trees when it doesn't nearly fit in
> >> memory, and what shape of computation gives a really nice tradeoff
> >> between speed and accuracy. I don't think the conventional approach is
> >> that point but it's more like the above.
> >>
> >> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <an...@gmail.com>
> wrote:
> >> > Here's my thought for what might work well. Comments very welcome.
> >> >
> >> > We construct k decision trees (where k might be 20-100). Each tree has
> >> > depth roughly 10-30. Rather than construct each tree in a distributed
> >> > fashion, construct each tree locally by using multiple passes over the
> >> > bagged sample for that tree (construct in BFS manner, assuming the
> >> > tree can fit in memory). Do this in parallel k times. I know this is
> >> > perhaps not the "mapreduce" way, but I'd be interested to hear
> >> > alternatives. It seems a bad idea to shuffle the data around on every
> >> > BFS iteration per tree.
> >> >
> >> > I'm not planning to work on this right now, but would like to figure
> >> > out if Mahout is a good platform to work with if I want such an
> >> > algorithm.
> >> >
> >> > Andy
> >>
>
>
>
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford
> Room 351, Department of Computer Science
> http://www.cs.ox.ac.uk/people/andy.twigg/
> andy.twigg@cs.ox.ac.uk | +447799647538
>

Re: Out-of-core random forest implementation

Posted by Andy Twigg <an...@gmail.com>.

Ted,

Sorry, I don't understand. Are you suggesting that a single decision
tree can be built efficiently in a distributed fashion?

The following 2 ways seem like the naive ways of doing this:
1) each machine constructs one node of the tree, by scanning all the
data, filtering those that don't get to its node (this is expensive),
computing the split and then writing the split out. This requires a
pass through the data for each node of the tree.
2) as 1), except that each machine writes out the filtered data after
its node in the tree. This requires less scanning (as much as the
local case I described earlier), but lots of data movement.

Do you have an alternative method?

Andy


On 28 January 2013 16:42, Ted Dunning <te...@gmail.com> wrote:
> IF we have a step which permutes data (once) then I doubt that
> redistribution is necessary.  At that point the randomness consists of
> building trees based on different variable subsets and data subsets.  The
> original random forests only split on variable subsets.  How much this
> matters is an open question.
>
> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> It sounds OK to me. Yes, I don't think you want to try to parallelize
>> the building of each tree just to build it off all the data. It will
>> be too slow.
>>
>> I imagine the game is to parallelize such that each worker has 1/Nth
>> of the data, where N is as low as you can manage. Each worker is
>> building one or more trees from its slice. Then iterate, and split up
>> the data differently. Each worker can now cross-validate previous
>> trees based on the new slice, or even build new ones, and iterate
>> again.
>>
>> I'm guessing this process will tend to need to build more trees than
>> usual. I think it's also possible that it will want to try to discard
>> trees with high out-of-bag error.
>>
>> Just riffing. I think this is some potential green-field territory to
>> really think about how you build trees when it doesn't nearly fit in
>> memory, and what shape of computation gives a really nice tradeoff
>> between speed and accuracy. I don't think the conventional approach is
>> that point but it's more like the above.
>>
>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <an...@gmail.com> wrote:
>> > Here's my thought for what might work well. Comments very welcome.
>> >
>> > We construct k decision trees (where k might be 20-100). Each tree has
>> > depth roughly 10-30. Rather than construct each tree in a distributed
>> > fashion, construct each tree locally by using multiple passes over the
>> > bagged sample for that tree (construct in BFS manner, assuming the
>> > tree can fit in memory). Do this in parallel k times. I know this is
>> > perhaps not the "mapreduce" way, but I'd be interested to hear
>> > alternatives. It seems a bad idea to shuffle the data around on every
>> > BFS iteration per tree.
>> >
>> > I'm not planning to work on this right now, but would like to figure
>> > out if Mahout is a good platform to work with if I want such an
>> > algorithm.
>> >
>> > Andy
>>



-- 
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
andy.twigg@cs.ox.ac.uk | +447799647538

Re: Out-of-core random forest implementation

Posted by Ted Dunning <te...@gmail.com>.

IF we have a step which permutes data (once) then I doubt that
redistribution is necessary.  At that point the randomness consists of
building trees based on different variable subsets and data subsets.  The
original random forests only split on variable subsets.  How much this
matters is an open question.

On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <sr...@gmail.com> wrote:

> It sounds OK to me. Yes, I don't think you want to try to parallelize
> the building of each tree just to build it off all the data. It will
> be too slow.
>
> I imagine the game is to parallelize such that each worker has 1/Nth
> of the data, where N is as low as you can manage. Each worker is
> building one or more trees from its slice. Then iterate, and split up
> the data differently. Each worker can now cross-validate previous
> trees based on the new slice, or even build new ones, and iterate
> again.
>
> I'm guessing this process will tend to need to build more trees than
> usual. I think it's also possible that it will want to try to discard
> trees with high out-of-bag error.
>
> Just riffing. I think this is some potential green-field territory to
> really think about how you build trees when it doesn't nearly fit in
> memory, and what shape of computation gives a really nice tradeoff
> between speed and accuracy. I don't think the conventional approach is
> that point but it's more like the above.
>
> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <an...@gmail.com> wrote:
> > Here's my thought for what might work well. Comments very welcome.
> >
> > We construct k decision trees (where k might be 20-100). Each tree has
> > depth roughly 10-30. Rather than construct each tree in a distributed
> > fashion, construct each tree locally by using multiple passes over the
> > bagged sample for that tree (construct in BFS manner, assuming the
> > tree can fit in memory). Do this in parallel k times. I know this is
> > perhaps not the "mapreduce" way, but I'd be interested to hear
> > alternatives. It seems a bad idea to shuffle the data around on every
> > BFS iteration per tree.
> >
> > I'm not planning to work on this right now, but would like to figure
> > out if Mahout is a good platform to work with if I want such an
> > algorithm.
> >
> > Andy
>

Re: Out-of-core random forest implementation

Posted by Sean Owen <sr...@gmail.com>.

It sounds OK to me. Yes, I don't think you want to try to parallelize
the building of each tree just to build it off all the data. It will
be too slow.

I imagine the game is to parallelize such that each worker has 1/Nth
of the data, where N is as low as you can manage. Each worker is
building one or more trees from its slice. Then iterate, and split up
the data differently. Each worker can now cross-validate previous
trees based on the new slice, or even build new ones, and iterate
again.

I'm guessing this process will tend to need to build more trees than
usual. I think it's also possible that it will want to try to discard
trees with high out-of-bag error.

Just riffing. I think this is some potential green-field territory to
really think about how you build trees when it doesn't nearly fit in
memory, and what shape of computation gives a really nice tradeoff
between speed and accuracy. I don't think the conventional approach is
that point but it's more like the above.

On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <an...@gmail.com> wrote:
> Here's my thought for what might work well. Comments very welcome.
>
> We construct k decision trees (where k might be 20-100). Each tree has
> depth roughly 10-30. Rather than construct each tree in a distributed
> fashion, construct each tree locally by using multiple passes over the
> bagged sample for that tree (construct in BFS manner, assuming the
> tree can fit in memory). Do this in parallel k times. I know this is
> perhaps not the "mapreduce" way, but I'd be interested to hear
> alternatives. It seems a bad idea to shuffle the data around on every
> BFS iteration per tree.
>
> I'm not planning to work on this right now, but would like to figure
> out if Mahout is a good platform to work with if I want such an
> algorithm.
>
> Andy

Re: Out-of-core random forest implementation

Posted by Andy Twigg <an...@gmail.com>.

Here's my thought for what might work well. Comments very welcome.

We construct k decision trees (where k might be 20-100). Each tree has
depth roughly 10-30. Rather than construct each tree in a distributed
fashion, construct each tree locally by using multiple passes over the
bagged sample for that tree (construct in BFS manner, assuming the
tree can fit in memory). Do this in parallel k times. I know this is
perhaps not the "mapreduce" way, but I'd be interested to hear
alternatives. It seems a bad idea to shuffle the data around on every
BFS iteration per tree.

I'm not planning to work on this right now, but would like to figure
out if Mahout is a good platform to work with if I want such an
algorithm.

Andy

On 26 January 2013 12:39, Andy Twigg <an...@gmail.com> wrote:
> Hi,
>
> I'd like to grow some decision trees efficiently where the dataset is
> too large for memory. The decision tree will almost certainly fit in
> memory. I'm a bit surprised that Mahout doesn't have an algorithm to
> construct decision trees when the data doesn't fit in memory, given
> its focus on large scale!
>
> Can anyone recommend an algorithm that does grow random forests 1)
> when the data is out-of-core, 2) with multi-pass streaming access to
> data, 3) is distributed, [optionally: 4) is incremental ] ? I am happy
> to help contribute such an algorithm to mahout, but would prefer it if
> one already exists. Does anyone have familiarity with the algorithm
> used by bigML.com ?
>
> Andy
>
>
>
>
> On 26 January 2013 00:29, Marty Kube
> <ma...@beavercreekconsulting.com> wrote:
>> Hey Andy,
>>
>> What is the use case that is driving your question?
>> Are you looking at the training phase - I didn't realise that one needed to
>> keep the data in memory.
>> I have a use case where even keeping the trees, much less the data, in
>> memory during classification is an issue.
>>
>> Ted, do you have some references on the algorithms below?  I've done some
>> work on using mmap trees during classification.  I'd be happy to contribute
>> along those lines.
>>
>>
>> On 01/25/2013 04:52 PM, Ted Dunning wrote:
>>>
>>> Hey Andy,
>>>
>>> There are no plans for this.  You are correct that multiple passes aren't
>>> too difficult, but they do go against the standard map-reduce paradigm a
>>> bit if you want to avoid iterative map-reduce.
>>>
>>> It definitely would be nice to have a really competitive random forest
>>> implementation that uses the global  accumulator style plus long-lived
>>> mappers.  The basic idea would be to use the same sort of tricks that
>>> Vowpal Wabbit or Giraph use to get a bunch of long-lived mappers and then
>>> have them asynchronously talk to a tree repository.
>>>
>>> On Fri, Jan 25, 2013 at 6:58 PM, Andy Twigg <an...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm new to this list so I apologise if this is covered elsewhere (but
>>>> I couldn't find it..)
>>>>
>>>> I'm looking at the Random Forests implementations, both mapreduce
>>>> ("partial") and non-distributed. Both appear to require the data
>>>> loaded into memory. Random forests should be straightforward to
>>>> construct with multiple passes through the data without storing the
>>>> data in memory. Is there such an implementation in Mahout? If not, is
>>>> there a ticket/plan ?
>>>>
>>>> Thanks,
>>>> Andy
>>>>
>>>>
>>>> --
>>>> Dr Andy Twigg
>>>> Junior Research Fellow, St Johns College, Oxford
>>>> Room 351, Department of Computer Science
>>>> http://www.cs.ox.ac.uk/people/andy.twigg/
>>>> andy.twigg@cs.ox.ac.uk | +447799647538
>>>>
>>
>
>
>
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford
> Room 351, Department of Computer Science
> http://www.cs.ox.ac.uk/people/andy.twigg/
> andy.twigg@cs.ox.ac.uk | +447799647538



-- 
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
andy.twigg@cs.ox.ac.uk | +447799647538

Re: Out-of-core random forest implementation

Posted by Andy Twigg <an...@gmail.com>.

Hi,

I'd like to grow some decision trees efficiently where the dataset is
too large for memory. The decision tree will almost certainly fit in
memory. I'm a bit surprised that Mahout doesn't have an algorithm to
construct decision trees when the data doesn't fit in memory, given
its focus on large scale!

Can anyone recommend an algorithm that does grow random forests 1)
when the data is out-of-core, 2) with multi-pass streaming access to
data, 3) is distributed, [optionally: 4) is incremental ] ? I am happy
to help contribute such an algorithm to mahout, but would prefer it if
one already exists. Does anyone have familiarity with the algorithm
used by bigML.com ?

Andy




On 26 January 2013 00:29, Marty Kube
<ma...@beavercreekconsulting.com> wrote:
> Hey Andy,
>
> What is the use case that is driving your question?
> Are you looking at the training phase - I didn't realise that one needed to
> keep the data in memory.
> I have a use case where even keeping the trees, much less the data, in
> memory during classification is an issue.
>
> Ted, do you have some references on the algorithms below?  I've done some
> work on using mmap trees during classification.  I'd be happy to contribute
> along those lines.
>
>
> On 01/25/2013 04:52 PM, Ted Dunning wrote:
>>
>> Hey Andy,
>>
>> There are no plans for this.  You are correct that multiple passes aren't
>> too difficult, but they do go against the standard map-reduce paradigm a
>> bit if you want to avoid iterative map-reduce.
>>
>> It definitely would be nice to have a really competitive random forest
>> implementation that uses the global  accumulator style plus long-lived
>> mappers.  The basic idea would be to use the same sort of tricks that
>> Vowpal Wabbit or Giraph use to get a bunch of long-lived mappers and then
>> have them asynchronously talk to a tree repository.
>>
>> On Fri, Jan 25, 2013 at 6:58 PM, Andy Twigg <an...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'm new to this list so I apologise if this is covered elsewhere (but
>>> I couldn't find it..)
>>>
>>> I'm looking at the Random Forests implementations, both mapreduce
>>> ("partial") and non-distributed. Both appear to require the data
>>> loaded into memory. Random forests should be straightforward to
>>> construct with multiple passes through the data without storing the
>>> data in memory. Is there such an implementation in Mahout? If not, is
>>> there a ticket/plan ?
>>>
>>> Thanks,
>>> Andy
>>>
>>>
>>> --
>>> Dr Andy Twigg
>>> Junior Research Fellow, St Johns College, Oxford
>>> Room 351, Department of Computer Science
>>> http://www.cs.ox.ac.uk/people/andy.twigg/
>>> andy.twigg@cs.ox.ac.uk | +447799647538
>>>
>



-- 
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
andy.twigg@cs.ox.ac.uk | +447799647538

Re: Out-of-core random forest implementation

Posted by Marty Kube <ma...@beavercreekconsulting.com>.

Hey Andy,

What is the use case that is driving your question?
Are you looking at the training phase - I didn't realise that one needed 
to keep the data in memory.
I have a use case where even keeping the trees, much less the data, in 
memory during classification is an issue.

Ted, do you have some references on the algorithms below?  I've done 
some work on using mmap trees during classification.  I'd be happy to 
contribute along those lines.

On 01/25/2013 04:52 PM, Ted Dunning wrote:
> Hey Andy,
>
> There are no plans for this.  You are correct that multiple passes aren't
> too difficult, but they do go against the standard map-reduce paradigm a
> bit if you want to avoid iterative map-reduce.
>
> It definitely would be nice to have a really competitive random forest
> implementation that uses the global  accumulator style plus long-lived
> mappers.  The basic idea would be to use the same sort of tricks that
> Vowpal Wabbit or Giraph use to get a bunch of long-lived mappers and then
> have them asynchronously talk to a tree repository.
>
> On Fri, Jan 25, 2013 at 6:58 PM, Andy Twigg <an...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm new to this list so I apologise if this is covered elsewhere (but
>> I couldn't find it..)
>>
>> I'm looking at the Random Forests implementations, both mapreduce
>> ("partial") and non-distributed. Both appear to require the data
>> loaded into memory. Random forests should be straightforward to
>> construct with multiple passes through the data without storing the
>> data in memory. Is there such an implementation in Mahout? If not, is
>> there a ticket/plan ?
>>
>> Thanks,
>> Andy
>>
>>
>> --
>> Dr Andy Twigg
>> Junior Research Fellow, St Johns College, Oxford
>> Room 351, Department of Computer Science
>> http://www.cs.ox.ac.uk/people/andy.twigg/
>> andy.twigg@cs.ox.ac.uk | +447799647538
>>

Re: Out-of-core random forest implementation

Posted by Ted Dunning <te...@gmail.com>.

Hey Andy,

There are no plans for this.  You are correct that multiple passes aren't
too difficult, but they do go against the standard map-reduce paradigm a
bit if you want to avoid iterative map-reduce.

It definitely would be nice to have a really competitive random forest
implementation that uses the global  accumulator style plus long-lived
mappers.  The basic idea would be to use the same sort of tricks that
Vowpal Wabbit or Giraph use to get a bunch of long-lived mappers and then
have them asynchronously talk to a tree repository.

On Fri, Jan 25, 2013 at 6:58 PM, Andy Twigg <an...@gmail.com> wrote:

> Hi,
>
> I'm new to this list so I apologise if this is covered elsewhere (but
> I couldn't find it..)
>
> I'm looking at the Random Forests implementations, both mapreduce
> ("partial") and non-distributed. Both appear to require the data
> loaded into memory. Random forests should be straightforward to
> construct with multiple passes through the data without storing the
> data in memory. Is there such an implementation in Mahout? If not, is
> there a ticket/plan ?
>
> Thanks,
> Andy
>
>
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford
> Room 351, Department of Computer Science
> http://www.cs.ox.ac.uk/people/andy.twigg/
> andy.twigg@cs.ox.ac.uk | +447799647538
>