You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by deneche abdelhakim <a_...@yahoo.fr> on 2008/05/27 10:52:53 UTC

GSOC Mahout.GA, next steps ?

In a GA there are many things that can be distributed, and one should always start with the most compute demanding task . This is very problem dependent, but in most cases the fitness evaluation function (FEF) "is" the part to distribute.

The FEF evaluates each single individual in the population, and it may need some datas (D) to do so. For example in the traveling Salesman Problem, the problem is defined by a set of cities and the distances between them, the FEF needs those distances to evaluate the individuals.

I see 2 ways to distribute the FEF:

A. if the datas D is not big and can fit in each single cluster node, then the easiest solution is to use each Mapper to evaluate one individual and to pass the Datas D to all the mappers (using some Job parameter or the DistributedCache). The input of the job is the population of individuals. For someone used to work with Watchmaker, the solution A is straightforward, he needs to change one line of code.

B. if the datas D are really big and span over multiple nodes, then the FEF should be writen in the form of Mappers-Reducers, the population of individuals is passed to all the mappers (again using the DistributedCache or a Job parameter) and the datas D are now the input of the Job.

[MAHOUT-56] contains a possible implementation for solution A. Now I should start thinking about solution B and all I need is a problem that uses very big datasets. I already proposed one in my GSoC proposal, it consists of using a Genetic Algorithm to find good binary classification rule for a given dataset. But I am open to any other suggestion.

__________________________________________________
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible contre les messages non sollicités
http://mail.yahoo.fr Yahoo! Mail

Re: GSOC Mahout.GA, next steps ?

Posted by Grant Ingersoll <gs...@apache.org>.

Cool, thanks!  Sorry I have been so quiet.  Do feel free to ask more  
questions if you have them.  I hope to finally have a bunch of  
personal things past very soon and will be able to focus some time on  
Mahout again.

-Grant

On Jun 9, 2008, at 6:14 AM, deneche abdelhakim wrote:

> I found a cool introduction to evolutionary algorithms, I added it  
> to the wiki if someone is interested...
>
>
> --- En date de : Mer 28.5.08, Grant Ingersoll <gs...@apache.org>  
> a écrit :
>
>> De: Grant Ingersoll <gs...@apache.org>
>> Objet: Re: GSOC Mahout.GA, next steps ?
>> À: mahout-dev@lucene.apache.org
>> Date: Mercredi 28 Mai 2008, 13h11
>> This sounds good.  I don't know a lot about GAs, so if
>> others have
>> insight, that would be great.  It would also be handy if
>> you could put
>> up a section on the Wiki about GAs and maybe post some
>> links to basic
>> papers there, so people that aren't familiar can go do
>> some background
>> reading.
>>
>> I will try to get to MAHOUT-56 this week, but others can
>> jump in and
>> review as well.
>>
>> -Grant
>>
>> On May 27, 2008, at 4:52 AM, deneche abdelhakim wrote:
>>
>>> In a GA there are many things that can be distributed,
>> and one
>>> should always start with the most compute demanding
>> task . This is
>>> very problem dependent, but in most cases the fitness
>> evaluation
>>> function (FEF) "is" the part to distribute.
>>>
>>> The FEF evaluates each single individual in the
>> population, and it
>>> may need some datas (D) to do so. For example in the
>> traveling
>>> Salesman Problem, the problem is defined by a set of
>> cities and the
>>> distances between them, the FEF needs those distances
>> to evaluate
>>> the individuals.
>>>
>>> I see 2 ways to distribute the FEF:
>>>
>>> A. if the datas D is not big and can fit in each
>> single cluster
>>> node, then the easiest solution is to use each Mapper
>> to evaluate
>>> one individual and to pass the Datas D to all the
>> mappers (using
>>> some Job parameter or the DistributedCache). The input
>> of the job is
>>> the population of individuals. For someone used to
>> work with
>>> Watchmaker, the solution A is straightforward, he
>> needs to change
>>> one line of code.
>>>
>>> B. if the datas D are really big and span over
>> multiple nodes, then
>>> the FEF should be writen in the form of
>> Mappers-Reducers, the
>>> population of individuals is passed to all the mappers
>> (again using
>>> the DistributedCache or a Job parameter) and the datas
>> D are now the
>>> input of the Job.
>>>
>>> [MAHOUT-56] contains a possible implementation for
>> solution A. Now I
>>> should start thinking about solution B and all I need
>> is a problem
>>> that uses very big datasets. I already proposed one in
>> my GSoC
>>> proposal, it consists of using a Genetic Algorithm to
>> find good
>>> binary classification rule for a given dataset. But I
>> am open to any
>>> other suggestion.
>>>
>>> __________________________________________________
>>> Do You Yahoo!?
>>> En finir avec le spam? Yahoo! Mail vous offre la
>> meilleure
>>> protection possible contre les messages non
>> sollicités
>>> http://mail.yahoo.fr Yahoo! Mail
>
>
>       
> _____________________________________________________________________________
> Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: GSOC Mahout.GA, next steps ?

Posted by deneche abdelhakim <a_...@yahoo.fr>.

I found a cool introduction to evolutionary algorithms, I added it to the wiki if someone is interested...


--- En date de : Mer 28.5.08, Grant Ingersoll <gs...@apache.org> a écrit :

> De: Grant Ingersoll <gs...@apache.org>
> Objet: Re: GSOC Mahout.GA, next steps ?
> À: mahout-dev@lucene.apache.org
> Date: Mercredi 28 Mai 2008, 13h11
> This sounds good.  I don't know a lot about GAs, so if
> others have  
> insight, that would be great.  It would also be handy if
> you could put  
> up a section on the Wiki about GAs and maybe post some
> links to basic  
> papers there, so people that aren't familiar can go do
> some background  
> reading.
> 
> I will try to get to MAHOUT-56 this week, but others can
> jump in and  
> review as well.
> 
> -Grant
> 
> On May 27, 2008, at 4:52 AM, deneche abdelhakim wrote:
> 
> > In a GA there are many things that can be distributed,
> and one  
> > should always start with the most compute demanding
> task . This is  
> > very problem dependent, but in most cases the fitness
> evaluation  
> > function (FEF) "is" the part to distribute.
> >
> > The FEF evaluates each single individual in the
> population, and it  
> > may need some datas (D) to do so. For example in the
> traveling  
> > Salesman Problem, the problem is defined by a set of
> cities and the  
> > distances between them, the FEF needs those distances
> to evaluate  
> > the individuals.
> >
> > I see 2 ways to distribute the FEF:
> >
> > A. if the datas D is not big and can fit in each
> single cluster  
> > node, then the easiest solution is to use each Mapper
> to evaluate  
> > one individual and to pass the Datas D to all the
> mappers (using  
> > some Job parameter or the DistributedCache). The input
> of the job is  
> > the population of individuals. For someone used to
> work with  
> > Watchmaker, the solution A is straightforward, he
> needs to change  
> > one line of code.
> >
> > B. if the datas D are really big and span over
> multiple nodes, then  
> > the FEF should be writen in the form of
> Mappers-Reducers, the  
> > population of individuals is passed to all the mappers
> (again using  
> > the DistributedCache or a Job parameter) and the datas
> D are now the  
> > input of the Job.
> >
> > [MAHOUT-56] contains a possible implementation for
> solution A. Now I  
> > should start thinking about solution B and all I need
> is a problem  
> > that uses very big datasets. I already proposed one in
> my GSoC  
> > proposal, it consists of using a Genetic Algorithm to
> find good  
> > binary classification rule for a given dataset. But I
> am open to any  
> > other suggestion.
> >
> > __________________________________________________
> > Do You Yahoo!?
> > En finir avec le spam? Yahoo! Mail vous offre la
> meilleure  
> > protection possible contre les messages non
> sollicités
> > http://mail.yahoo.fr Yahoo! Mail


      _____________________________________________________________________________ 
Envoyez avec Yahoo! Mail. Une boite mail plus intelligente http://mail.yahoo.fr

Re: GSOC Mahout.GA, next steps ?

Posted by Grant Ingersoll <gs...@apache.org>.

This sounds good.  I don't know a lot about GAs, so if others have  
insight, that would be great.  It would also be handy if you could put  
up a section on the Wiki about GAs and maybe post some links to basic  
papers there, so people that aren't familiar can go do some background  
reading.

I will try to get to MAHOUT-56 this week, but others can jump in and  
review as well.

-Grant

On May 27, 2008, at 4:52 AM, deneche abdelhakim wrote:

> In a GA there are many things that can be distributed, and one  
> should always start with the most compute demanding task . This is  
> very problem dependent, but in most cases the fitness evaluation  
> function (FEF) "is" the part to distribute.
>
> The FEF evaluates each single individual in the population, and it  
> may need some datas (D) to do so. For example in the traveling  
> Salesman Problem, the problem is defined by a set of cities and the  
> distances between them, the FEF needs those distances to evaluate  
> the individuals.
>
> I see 2 ways to distribute the FEF:
>
> A. if the datas D is not big and can fit in each single cluster  
> node, then the easiest solution is to use each Mapper to evaluate  
> one individual and to pass the Datas D to all the mappers (using  
> some Job parameter or the DistributedCache). The input of the job is  
> the population of individuals. For someone used to work with  
> Watchmaker, the solution A is straightforward, he needs to change  
> one line of code.
>
> B. if the datas D are really big and span over multiple nodes, then  
> the FEF should be writen in the form of Mappers-Reducers, the  
> population of individuals is passed to all the mappers (again using  
> the DistributedCache or a Job parameter) and the datas D are now the  
> input of the Job.
>
> [MAHOUT-56] contains a possible implementation for solution A. Now I  
> should start thinking about solution B and all I need is a problem  
> that uses very big datasets. I already proposed one in my GSoC  
> proposal, it consists of using a Genetic Algorithm to find good  
> binary classification rule for a given dataset. But I am open to any  
> other suggestion.
>
> __________________________________________________
> Do You Yahoo!?
> En finir avec le spam? Yahoo! Mail vous offre la meilleure  
> protection possible contre les messages non sollicités
> http://mail.yahoo.fr Yahoo! Mail

Re: GSOC Mahout.GA, next steps ?

Posted by deneche abdelhakim <a_...@yahoo.fr>.

--- En date de : Mer 28.5.08, Ted Dunning <te...@gmail.com> a écrit :
> How about writing the population to the file and using it as
> input to map-reduce directly?  Evaluations that fit into a map can
> obviously handle that.  

This is exacly what I did in [Mahout-56]

> Evaluations that need to be parallelized, can have the map emit multiple > copies with sequential keys.  This will put a copy of each input into 
> multiple reduces that can do a piece of the evaluation.  The pieces can 
> be put back together in a second MR step.

the only reason, I see, to write the fitness evaluation in a MR form is that it needs some sort a very large dataset (in addition to the population of individuals). If the input of the job is the population, how can I pass the dataset to the mappers (or the reducers) ?

--
abdelhakim

__________________________________________________
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail

Re: GSOC Mahout.GA, next steps ?

Posted by Ted Dunning <te...@gmail.com>.

How about writing the population to the file and using it as input to
map-reduce directly?  Evaluations that fit into a map can obviously handle
that.  Evaluations that need to be parallelized, can have the map emit
multiple copies with sequential keys.  This will put a copy of each input
into multiple reduces that can do a piece of the evaluation.  The pieces can
be put back together in a second MR step.

I even think that you should be able to demonstrate that any MR program that
evaluates a single population member can be applied to multiple population
members with slight modifications.

On Wed, May 28, 2008 at 11:44 AM, deneche abdelhakim <a_...@yahoo.fr>
wrote:

>
>
> I don't know if there are many ways to do it in Hadoop, but how about
> writing the population into a file and pass it with the DistributedCache ?
>
>

-- 
ted

Re: GSOC Mahout.GA, next steps ?

Posted by deneche abdelhakim <a_...@yahoo.fr>.

> Ted Dunning <te...@gmail.com> wrote:
>
> Conceptually, at least, it would be good to have the option for fitness
> functions to be expressed as map-reduce programs.  Unfortunately, having
> mappers spawn MR programs runs the real risk of dead-lock, especially on
> less than grandiose clusters.
> 
> To me, that indicates that if the fitness function is nasty enough to
> require map-reduce to compute, then either:
> 
> a) the executive that manages the population and generates mutations 
> should be written in sequential form
> 
> or
> 
> b) the evolutionary algorithm has to be written in such a way as to be 
> able to manipulate a map-reduce program so that evolution and evaluation > can be merged into a single (composite) map-reduce program.
> 
> I vote for (a) because if fitness computations are so complex as to need > MR, then the cost of sorting the population will be negligible.

(a) has another advantage too: one can start by writing its program in a sequential form, test it with a small dataset, then rewrite only the fitness function in a M-R form.

> This raises the question of how the population should be communicated to > the parallel evaluator.

I don't know if there are many ways to do it in Hadoop, but how about writing the population into a file and pass it with the DistributedCache ?

-- 
abdelhakim

__________________________________________________
Do You Yahoo!?
En finir avec le spam? Yahoo! Mail vous offre la meilleure protection possible contre les messages non sollicités 
http://mail.yahoo.fr Yahoo! Mail

Re: GSOC Mahout.GA, next steps ?

Posted by Ted Dunning <te...@gmail.com>.

Conceptually, at least, it would be good to have the option for fitness
functions to be expressed as map-reduce programs.  Unfortunately, having
mappers spawn MR programs runs the real risk of dead-lock, especially on
less than grandiose clusters.

To me, that indicates that if the fitness function is nasty enough to
require map-reduce to compute, then either:

a) the executive that manages the population and generates mutations should
be written in sequential form

or

b) the evolutionary algorithm has to be written in such a way as to be able
to manipulate a map-reduce program so that evolution and evaluation can be
merged into a single (composite) map-reduce program.

I vote for (a) because if fitness computations are so complex as to need MR,
then the cost of sorting the population will be negligible.

This raises the question of how the population should be communicated to the
parallel evaluator.


On Tue, May 27, 2008 at 1:52 AM, deneche abdelhakim <a_...@yahoo.fr>
wrote:

> In a GA there are many things that can be distributed, and one should
> always start with the most compute demanding task . This is very problem
> dependent, but in most cases the fitness evaluation function (FEF) "is" the
> part to distribute.
>
> The FEF evaluates each single individual in the population, and it may need
> some datas (D) to do so. For example in the traveling Salesman Problem, the
> problem is defined by a set of cities and the distances between them, the
> FEF needs those distances to evaluate the individuals.
>
> I see 2 ways to distribute the FEF:
>
> A. if the datas D is not big and can fit in each single cluster node, then
> the easiest solution is to use each Mapper to evaluate one individual and to
> pass the Datas D to all the mappers (using some Job parameter or the
> DistributedCache). The input of the job is the population of individuals.
> For someone used to work with Watchmaker, the solution A is straightforward,
> he needs to change one line of code.
>
> B. if the datas D are really big and span over multiple nodes, then the FEF
> should be writen in the form of Mappers-Reducers, the population of
> individuals is passed to all the mappers (again using the DistributedCache
> or a Job parameter) and the datas D are now the input of the Job.
>
> [MAHOUT-56] contains a possible implementation for solution A. Now I should
> start thinking about solution B and all I need is a problem that uses very
> big datasets. I already proposed one in my GSoC proposal, it consists of
> using a Genetic Algorithm to find good binary classification rule for a
> given dataset. But I am open to any other suggestion.
>
> __________________________________________________
> Do You Yahoo!?
> En finir avec le spam? Yahoo! Mail vous offre la meilleure protection
> possible contre les messages non sollicités
> http://mail.yahoo.fr Yahoo! Mail
>



-- 
ted