You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Jeff Eastman <je...@windwardsolutions.com> on 2008/04/04 02:04:52 UTC

RE: MapReduce, machine learning, and introductions

Hi Gary,

Thanks for your suggestion on Random Forests. I've cc'd this thread to the
Mahout dev list just in case you would like to continue it there. We have
received a lot of interest from students in conjunction with the Google
Summer of Code project and others looking to contribute to our mission. We
are not restricted at all to the 10 original NIPS algorithms; they were just
a natural starting point and a way to "prime the pump". Perhaps some more
information on your experiences using it on real manufacturing data would
motivate an implementation. 

Jeff 

  _____  

From: Gary Bradski [mailto:garybradski@gmail.com] 
Sent: Thursday, April 03, 2008 4:46 PM
To: Jeff Eastman
Cc: Andrew Y. Ng; Dubey, Pradeep; Jimmy Lin
Subject: Re: MapReduce, machine learning, and introductions

One of the things I'd like to see parallelized is Random forests.  Though
there is no "best" algorithm for classification, when I ran it on Intel
manufacturing data sets it was almost always beating boosting, SVM, and
MART. Zisserman claimed it worked best on keypoint recognition in vision and
his version was the simplest one I've heard.

This is one of those "brain dead" parallelizations -- just parcel out the
learning of trees on randomly selected subsets of the data.  In learning,
each tree randomly selects from a subset of the features at each node.

It has nice techniques for doing feature selection as well.

Gary

On Thu, Apr 3, 2008 at 4:27 PM, Jeff Eastman <je...@windwardsolutions.com>
wrote:

Well, it has been a couple of years. Thanks for the response and
retransmission. Good luck in your current endeavors.

Jeff 

  _____  

From: Gary Bradski [mailto:garybradski@gmail.com] 
Sent: Thursday, April 03, 2008 4:23 PM
To: Andrew Y. Ng; Dubey, Pradeep
Cc: Jeff Eastman; Jimmy Lin
Subject: Re: MapReduce, machine learning, and introductions

Re: Parallel Machine learning project mahout http://lucene.apache.org/mahout

When I was at Intel, I began carving out a parallel Machine learning niche
since it was something interesting that Intel would also be interested in.

But that was two companies ago for me and I haven't touched it since.  I'm
now focused on sensor guided manipulation and revamping the computer vision
library I started, OpenCV.  

About all I can do is send the last known working version of the code that I
had.  I've CC'd Pradeep Dubey, and Intel Fellow with whom I worked on some
of the parallel machine learning issues, his team also studied that code.  I
don't know what happened since, but Parallel machine learning might still be
one of his active areas and maybe theres's some synergy there.

Gary

On Thu, Apr 3, 2008 at 3:38 PM, Andrew Y. Ng <an...@cs.stanford.edu> wrote:

Hi Jeff,

I'd been hearing increasing amounts of buzz on Mahour and am excited
about it, but unfortunately am no longer working in this space.
Gary Bradski, CC-ed above, would be a great person to talk to about
Map-Reduce and machine learning, though!

Andrew

On Thu, 3 Apr 2008, Jeff Eastman wrote:

> Hi Andrew,
>
> I'm a committer on the new Mahout project. As Jimmy indicated, we are
> setting out to implement versions of the NIPS paper algorithms on top of
> Hadoop. So far, we have committed versions of only k-means and canopy but
> have a number of other algorithms in various stages of implementation. I
> don't have any immediate questions but I live in Los Altos and so it would
> be convenient to visit if you or your colleagues do have questions about
> Mahout.
>
> In any case I thought it would be nice to introduce myself.
>
> Jeff
>
> http://lucene.apache.org/mahout
>
>
> Jeff Eastman, Ph.D.
> Windward Solutions Inc.
> +1.415.298.0023
> http://windwardsolutions.com
> http://jeffeastman.blogspot.com
>
>
> > -----Original Message-----
> > From: Jimmy Lin [mailto:jimmylin@umd.edu]
> > Sent: Saturday, March 29, 2008 8:37 PM
> > To: ang@cs.stanford.edu
> > Cc: Jeff Eastman
> > Subject: MapReduce, machine learning, and introductions
> >
> > Hi Andrew,
> >
> > How are things going?  Haven't seen you in a while... hope everything
> > is going well at Stanford.
> >
> > I was recently in the bay area attending the Yahoo Hadoop summit---
> > I've been using MapReduce in teaching and research recently (stat MT,
> > IR, etc.), so I was there talking about that.
> >
> > Are you aware of the Apache Mahout project?  They are putting together
> > an open-source MR toolkit for machine-learning-ish things; one of the
> > things they're working on is implementing the various algorithms in
> > your NIPS paper.  Jeff Eastman is involved in the project, cc'ed
> > here.  I thought I'd put the two of you in touch...
> >
> > Best,
> > Jimmy
>
>
>
>

Re: MapReduce, machine learning, and introductions

Posted by Gary Bradski <ga...@gmail.com>.

Random forests, though not developed that way, are an example of Kleinberg's
Stochastic Discrimination, which builds optimal classifiers based on the
Law  of Large Numbers. Such classifiers are built out of large collections
of simple classifiers, but are distinct from boosting. For this, the
classifiers have to meet three conditions:

   1. Encouragement: The simple classifiers must weakly encourage one
   class from another
   2. Generalization: This is data dependent and thus why you can in fact
   build an optimal classifier FOR THAT DATA.  Your decision functions which
   work on the test set must do so on the training set.
   3. Fairness: You cannot have any statistical bias.  Thus, square
   classifiers cannot be used to classify squares for example because the edges
   and corners of the squares would have different statistics.

Like most things in life, you cannot actually meet all the conditions for
such a classifier, usually you get the first two and fudge the third either
by post processing as Kleinberg does which breaks the parallelization, or by
random functions that tend to diffuse the bias as in Random Forest.


   - Kleinberg's site is worth a look http://kappa.math.buffalo.edu/  But
   he can be rather obscure and others have implemented his theory such as
   *http://tinyurl.com/6hvfvs*
   - For Random forests, Leo Breiman's site (RIP)
   http://www.stat.berkeley.edu/users/breiman/    Breiman was the, or one
   of the key inventors of decision trees.
   - but I'd also look at very simple implementations such as done by
   Zisserman:

*Bosch, A. , Zisserman, A. and Munoz, X.*
*Image Classification using Random Forests and Ferns*
Proceedings of the 11th International Conference on Computer Vision, Rio de
Janeiro, Brazil (2007)
Bibtex source<http://www.robots.ox.ac.uk/%7Evgg/publications/html/bosch07a-bibtex.html>|
Abstract<http://www.robots.ox.ac.uk/%7Evgg/publications/html/bosch07a-abstract.html>|
Document:
ps.gz <http://www.robots.ox.ac.uk/%7Evgg/publications/papers/bosch07a.ps.gz>
PDF <http://www.robots.ox.ac.uk/%7Evgg/publications/papers/bosch07a.pdf>

Stochastic Discrimination classifiers have nice properties:

   - The never over train unlike boosting. Because of the Law of Large
   Numbers, they just get better with more data.
   - They are innately parallel/independent
   - It is easy to use them for variable selection via techniques such as
   discussed by Breiman (see his "Black Box" lecture on his site)

When built out of decision trees, they have principled ways of handling
missing data, mixed data types and data at very different scales such as
often occurs with real data, but seldom with computer vision.

By the way, OpenCV has a full implementation of Random Forests that is free,
open and under a BSD license.

Gary

On Thu, Apr 3, 2008 at 5:04 PM, Jeff Eastman <je...@windwardsolutions.com>
wrote:

>  Hi Gary,
>
>
>
> Thanks for your suggestion on Random Forests. I've cc'd this thread to the
> Mahout dev list just in case you would like to continue it there. We have
> received a lot of interest from students in conjunction with the Google
> Summer of Code project and others looking to contribute to our mission. We
> are not restricted at all to the 10 original NIPS algorithms; they were just
> a natural starting point and a way to "prime the pump". Perhaps some more
> information on your experiences using it on real manufacturing data would
> motivate an implementation.
>
>
>
> Jeff
>
>
>   ------------------------------
>
> *From:* Gary Bradski [mailto:garybradski@gmail.com]
> *Sent:* Thursday, April 03, 2008 4:46 PM
> *To:* Jeff Eastman
> *Cc:* Andrew Y. Ng; Dubey, Pradeep; Jimmy Lin
>
> *Subject:* Re: MapReduce, machine learning, and introductions
>
>
>
> One of the things I'd like to see parallelized is Random forests.  Though
> there is no "best" algorithm for classification, when I ran it on Intel
> manufacturing data sets it was almost always beating boosting, SVM, and
> MART. Zisserman claimed it worked best on keypoint recognition in vision and
> his version was the simplest one I've heard.
>
> This is one of those "brain dead" parallelizations -- just parcel out the
> learning of trees on randomly selected subsets of the data.  In learning,
> each tree randomly selects from a subset of the features at each node.
>
> It has nice techniques for doing feature selection as well.
>
> Gary
>
> On Thu, Apr 3, 2008 at 4:27 PM, Jeff Eastman <je...@windwardsolutions.com>
> wrote:
>
> Well, it has been a couple of years. Thanks for the response and
> retransmission. Good luck in your current endeavors.
>
>
>
> Jeff
>
>
>   ------------------------------
>
> *From:* Gary Bradski [mailto:garybradski@gmail.com]
> *Sent:* Thursday, April 03, 2008 4:23 PM
> *To:* Andrew Y. Ng; Dubey, Pradeep
> *Cc:* Jeff Eastman; Jimmy Lin
> *Subject:* Re: MapReduce, machine learning, and introductions
>
>
>
> Re: Parallel Machine learning project mahout
> http://lucene.apache.org/mahout
>
> When I was at Intel, I began carving out a parallel Machine learning niche
> since it was something interesting that Intel would also be interested in.
>
> But that was two companies ago for me and I haven't touched it since.  I'm
> now focused on sensor guided manipulation and revamping the computer vision
> library I started, OpenCV.
>
> About all I can do is send the last known working version of the code that
> I had.  I've CC'd Pradeep Dubey, and Intel Fellow with whom I worked on some
> of the parallel machine learning issues, his team also studied that code.  I
> don't know what happened since, but Parallel machine learning might still be
> one of his active areas and maybe theres's some synergy there.
>
> Gary
>
> On Thu, Apr 3, 2008 at 3:38 PM, Andrew Y. Ng <an...@cs.stanford.edu> wrote:
>
> Hi Jeff,
>
> I'd been hearing increasing amounts of buzz on Mahour and am excited
> about it, but unfortunately am no longer working in this space.
> Gary Bradski, CC-ed above, would be a great person to talk to about
> Map-Reduce and machine learning, though!
>
> Andrew
>
>
> On Thu, 3 Apr 2008, Jeff Eastman wrote:
>
> > Hi Andrew,
> >
> > I'm a committer on the new Mahout project. As Jimmy indicated, we are
> > setting out to implement versions of the NIPS paper algorithms on top of
> > Hadoop. So far, we have committed versions of only k-means and canopy
> but
> > have a number of other algorithms in various stages of implementation. I
> > don't have any immediate questions but I live in Los Altos and so it
> would
> > be convenient to visit if you or your colleagues do have questions about
> > Mahout.
> >
> > In any case I thought it would be nice to introduce myself.
> >
> > Jeff
> >
> > http://lucene.apache.org/mahout
> >
> >
> > Jeff Eastman, Ph.D.
> > Windward Solutions Inc.
> > +1.415.298.0023
> > http://windwardsolutions.com
> > http://jeffeastman.blogspot.com
> >
> >
> > > -----Original Message-----
> > > From: Jimmy Lin [mailto:jimmylin@umd.edu]
> > > Sent: Saturday, March 29, 2008 8:37 PM
> > > To: ang@cs.stanford.edu
> > > Cc: Jeff Eastman
> > > Subject: MapReduce, machine learning, and introductions
> > >
> > > Hi Andrew,
> > >
> > > How are things going?  Haven't seen you in a while... hope everything
> > > is going well at Stanford.
> > >
> > > I was recently in the bay area attending the Yahoo Hadoop summit---
> > > I've been using MapReduce in teaching and research recently (stat MT,
> > > IR, etc.), so I was there talking about that.
> > >
> > > Are you aware of the Apache Mahout project?  They are putting together
> > > an open-source MR toolkit for machine-learning-ish things; one of the
> > > things they're working on is implementing the various algorithms in
> > > your NIPS paper.  Jeff Eastman is involved in the project, cc'ed
> > > here.  I thought I'd put the two of you in touch...
> > >
> > > Best,
> > > Jimmy
> >
> >
> >
> >
>
>
>
>
>

RE: MapReduce, machine learning, and introductions

Posted by "Dubey, Pradeep" <pr...@intel.com>.

Gary,
Yes, we have continued where you left it off. All of those kernels have
been parallelized/simulated/analyzed, and now being optimized for
'many-core'. Hopefully, we will be able to publish soon :-).

Pradeep

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Thursday, April 03, 2008 5:13 PM
To: mahout-dev@lucene.apache.org; 'Gary Bradski'
Cc: 'Andrew Y. Ng'; Dubey, Pradeep; 'Jimmy Lin'
Subject: Re: MapReduce, machine learning, and introductions


Random forests are very cool and very odd little bests.

+n!


On 4/3/08 5:04 PM, "Jeff Eastman" <je...@windwardsolutions.com> wrote:

> Hi Gary,
> 
>  
> 
> Thanks for your suggestion on Random Forests. I've cc'd this thread to
the
> Mahout dev list just in case you would like to continue it there. We
have
> received a lot of interest from students in conjunction with the
Google
> Summer of Code project and others looking to contribute to our
mission. We
> are not restricted at all to the 10 original NIPS algorithms; they
were just
> a natural starting point and a way to "prime the pump". Perhaps some
more
> information on your experiences using it on real manufacturing data
would
> motivate an implementation.
> 
>  
> 
> Jeff 
> 
>  
> 
>   _____  
> 
> From: Gary Bradski [mailto:garybradski@gmail.com]
> Sent: Thursday, April 03, 2008 4:46 PM
> To: Jeff Eastman
> Cc: Andrew Y. Ng; Dubey, Pradeep; Jimmy Lin
> Subject: Re: MapReduce, machine learning, and introductions
> 
>  
> 
> One of the things I'd like to see parallelized is Random forests.
Though
> there is no "best" algorithm for classification, when I ran it on
Intel
> manufacturing data sets it was almost always beating boosting, SVM,
and
> MART. Zisserman claimed it worked best on keypoint recognition in
vision and
> his version was the simplest one I've heard.
> 
> This is one of those "brain dead" parallelizations -- just parcel out
the
> learning of trees on randomly selected subsets of the data.  In
learning,
> each tree randomly selects from a subset of the features at each node.
> 
> It has nice techniques for doing feature selection as well.
> 
> Gary
> 
> On Thu, Apr 3, 2008 at 4:27 PM, Jeff Eastman
<je...@windwardsolutions.com>
> wrote:
> 
> Well, it has been a couple of years. Thanks for the response and
> retransmission. Good luck in your current endeavors.
> 
>  
> 
> Jeff 
> 
>  
> 
>   _____  
> 
> From: Gary Bradski [mailto:garybradski@gmail.com]
> Sent: Thursday, April 03, 2008 4:23 PM
> To: Andrew Y. Ng; Dubey, Pradeep
> Cc: Jeff Eastman; Jimmy Lin
> Subject: Re: MapReduce, machine learning, and introductions
> 
>  
> 
> Re: Parallel Machine learning project mahout
http://lucene.apache.org/mahout
> 
> When I was at Intel, I began carving out a parallel Machine learning
niche
> since it was something interesting that Intel would also be interested
in.
> 
> But that was two companies ago for me and I haven't touched it since.
I'm
> now focused on sensor guided manipulation and revamping the computer
vision
> library I started, OpenCV.
> 
> About all I can do is send the last known working version of the code
that I
> had.  I've CC'd Pradeep Dubey, and Intel Fellow with whom I worked on
some
> of the parallel machine learning issues, his team also studied that
code.  I
> don't know what happened since, but Parallel machine learning might
still be
> one of his active areas and maybe theres's some synergy there.
> 
> Gary
> 
> On Thu, Apr 3, 2008 at 3:38 PM, Andrew Y. Ng <an...@cs.stanford.edu>
wrote:
> 
> Hi Jeff,
> 
> I'd been hearing increasing amounts of buzz on Mahour and am excited
> about it, but unfortunately am no longer working in this space.
> Gary Bradski, CC-ed above, would be a great person to talk to about
> Map-Reduce and machine learning, though!
> 
> Andrew
> 
> 
> On Thu, 3 Apr 2008, Jeff Eastman wrote:
> 
>> Hi Andrew,
>> 
>> I'm a committer on the new Mahout project. As Jimmy indicated, we are
>> setting out to implement versions of the NIPS paper algorithms on top
of
>> Hadoop. So far, we have committed versions of only k-means and canopy
but
>> have a number of other algorithms in various stages of
implementation. I
>> don't have any immediate questions but I live in Los Altos and so it
would
>> be convenient to visit if you or your colleagues do have questions
about
>> Mahout.
>> 
>> In any case I thought it would be nice to introduce myself.
>> 
>> Jeff
>> 
>> http://lucene.apache.org/mahout
>> 
>> 
>> Jeff Eastman, Ph.D.
>> Windward Solutions Inc.
>> +1.415.298.0023
>> http://windwardsolutions.com
>> http://jeffeastman.blogspot.com
>> 
>> 
>>> -----Original Message-----
>>> From: Jimmy Lin [mailto:jimmylin@umd.edu]
>>> Sent: Saturday, March 29, 2008 8:37 PM
>>> To: ang@cs.stanford.edu
>>> Cc: Jeff Eastman
>>> Subject: MapReduce, machine learning, and introductions
>>> 
>>> Hi Andrew,
>>> 
>>> How are things going?  Haven't seen you in a while... hope
everything
>>> is going well at Stanford.
>>> 
>>> I was recently in the bay area attending the Yahoo Hadoop summit---
>>> I've been using MapReduce in teaching and research recently (stat
MT,
>>> IR, etc.), so I was there talking about that.
>>> 
>>> Are you aware of the Apache Mahout project?  They are putting
together
>>> an open-source MR toolkit for machine-learning-ish things; one of
the
>>> things they're working on is implementing the various algorithms in
>>> your NIPS paper.  Jeff Eastman is involved in the project, cc'ed
>>> here.  I thought I'd put the two of you in touch...
>>> 
>>> Best,
>>> Jimmy
>> 
>> 
>> 
>> 
> 
>  
> 
>  
>

Re: MapReduce, machine learning, and introductions

Posted by Ted Dunning <td...@veoh.com>.

Random forests are very cool and very odd little bests.

+n!


On 4/3/08 5:04 PM, "Jeff Eastman" <je...@windwardsolutions.com> wrote:

> Hi Gary,
> 
>  
> 
> Thanks for your suggestion on Random Forests. I've cc'd this thread to the
> Mahout dev list just in case you would like to continue it there. We have
> received a lot of interest from students in conjunction with the Google
> Summer of Code project and others looking to contribute to our mission. We
> are not restricted at all to the 10 original NIPS algorithms; they were just
> a natural starting point and a way to "prime the pump". Perhaps some more
> information on your experiences using it on real manufacturing data would
> motivate an implementation.
> 
>  
> 
> Jeff 
> 
>  
> 
>   _____  
> 
> From: Gary Bradski [mailto:garybradski@gmail.com]
> Sent: Thursday, April 03, 2008 4:46 PM
> To: Jeff Eastman
> Cc: Andrew Y. Ng; Dubey, Pradeep; Jimmy Lin
> Subject: Re: MapReduce, machine learning, and introductions
> 
>  
> 
> One of the things I'd like to see parallelized is Random forests.  Though
> there is no "best" algorithm for classification, when I ran it on Intel
> manufacturing data sets it was almost always beating boosting, SVM, and
> MART. Zisserman claimed it worked best on keypoint recognition in vision and
> his version was the simplest one I've heard.
> 
> This is one of those "brain dead" parallelizations -- just parcel out the
> learning of trees on randomly selected subsets of the data.  In learning,
> each tree randomly selects from a subset of the features at each node.
> 
> It has nice techniques for doing feature selection as well.
> 
> Gary
> 
> On Thu, Apr 3, 2008 at 4:27 PM, Jeff Eastman <je...@windwardsolutions.com>
> wrote:
> 
> Well, it has been a couple of years. Thanks for the response and
> retransmission. Good luck in your current endeavors.
> 
>  
> 
> Jeff 
> 
>  
> 
>   _____  
> 
> From: Gary Bradski [mailto:garybradski@gmail.com]
> Sent: Thursday, April 03, 2008 4:23 PM
> To: Andrew Y. Ng; Dubey, Pradeep
> Cc: Jeff Eastman; Jimmy Lin
> Subject: Re: MapReduce, machine learning, and introductions
> 
>  
> 
> Re: Parallel Machine learning project mahout http://lucene.apache.org/mahout
> 
> When I was at Intel, I began carving out a parallel Machine learning niche
> since it was something interesting that Intel would also be interested in.
> 
> But that was two companies ago for me and I haven't touched it since.  I'm
> now focused on sensor guided manipulation and revamping the computer vision
> library I started, OpenCV.
> 
> About all I can do is send the last known working version of the code that I
> had.  I've CC'd Pradeep Dubey, and Intel Fellow with whom I worked on some
> of the parallel machine learning issues, his team also studied that code.  I
> don't know what happened since, but Parallel machine learning might still be
> one of his active areas and maybe theres's some synergy there.
> 
> Gary
> 
> On Thu, Apr 3, 2008 at 3:38 PM, Andrew Y. Ng <an...@cs.stanford.edu> wrote:
> 
> Hi Jeff,
> 
> I'd been hearing increasing amounts of buzz on Mahour and am excited
> about it, but unfortunately am no longer working in this space.
> Gary Bradski, CC-ed above, would be a great person to talk to about
> Map-Reduce and machine learning, though!
> 
> Andrew
> 
> 
> On Thu, 3 Apr 2008, Jeff Eastman wrote:
> 
>> Hi Andrew,
>> 
>> I'm a committer on the new Mahout project. As Jimmy indicated, we are
>> setting out to implement versions of the NIPS paper algorithms on top of
>> Hadoop. So far, we have committed versions of only k-means and canopy but
>> have a number of other algorithms in various stages of implementation. I
>> don't have any immediate questions but I live in Los Altos and so it would
>> be convenient to visit if you or your colleagues do have questions about
>> Mahout.
>> 
>> In any case I thought it would be nice to introduce myself.
>> 
>> Jeff
>> 
>> http://lucene.apache.org/mahout
>> 
>> 
>> Jeff Eastman, Ph.D.
>> Windward Solutions Inc.
>> +1.415.298.0023
>> http://windwardsolutions.com
>> http://jeffeastman.blogspot.com
>> 
>> 
>>> -----Original Message-----
>>> From: Jimmy Lin [mailto:jimmylin@umd.edu]
>>> Sent: Saturday, March 29, 2008 8:37 PM
>>> To: ang@cs.stanford.edu
>>> Cc: Jeff Eastman
>>> Subject: MapReduce, machine learning, and introductions
>>> 
>>> Hi Andrew,
>>> 
>>> How are things going?  Haven't seen you in a while... hope everything
>>> is going well at Stanford.
>>> 
>>> I was recently in the bay area attending the Yahoo Hadoop summit---
>>> I've been using MapReduce in teaching and research recently (stat MT,
>>> IR, etc.), so I was there talking about that.
>>> 
>>> Are you aware of the Apache Mahout project?  They are putting together
>>> an open-source MR toolkit for machine-learning-ish things; one of the
>>> things they're working on is implementing the various algorithms in
>>> your NIPS paper.  Jeff Eastman is involved in the project, cc'ed
>>> here.  I thought I'd put the two of you in touch...
>>> 
>>> Best,
>>> Jimmy
>> 
>> 
>> 
>> 
> 
>  
> 
>  
>