You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Vitalisova, Natallia" <Vi...@belhard.com> on 2008/03/30 15:50:40 UTC

SoC Naive Bayes Implementation

Hi,

 

My name is Natallia Vitalisova and I've applied for Google SoC 2008 to implement the Naïve Bayes algorithm on Hadoop.

 

Either I will be accepted for SoC or not, I want to spend my time investigating this topic which I consider to be very interesting. But I will certainly need a mentor, or someone who will help me to answer some questions etc. Is there anyone willing to help?

RE : About the Mahout.GA Comment

Posted by deneche abdelhakim <a_...@yahoo.fr>.

I don't know the exact term, but may be I should have said "computing process", so each processor (or computing node) can run many "computing processes"...

Ted Dunning <td...@veoh.com> a écrit : 
How is "computing node" not a processor?


On 4/12/08 9:26 PM, "deneche abdelhakim"  wrote:

> The number of running algorithms don't depend on the number of processors, in
> fact this kind of algorithms is used even if there is only one single
> processor because of its good search properties. You can imagine it as a
> single big GA with a distributed population and each individual can have its
> own set of operators.
> 
> Abdel Hakim
> 
>> Ted Dunning wrote :
>> 
>> I think it is a very bad idea to tie the algorithm to the number of
>> processors being used in this way.  A program should produce identical
>> results on any machine, subject only to PRNG seeding issues.
> On 4/11/08 8:52 PM, "deneche abdelhakim"  wrote:
> 
>> And there are other reasons to distribute a GA: for example, you may want to
>> run a different version of the algorithm (a different population and perhaps
>> a
>> different set of operators) in each computing node, and from time to time
>> some
>> individuals will migrate from one node to another...this kind of distribution
>> has proven to be more effective cause it searches a larger space.
> 
> 
> 
>        
> ---------------------------------
>  Envoyé avec Yahoo! Mail.
> Une boite mail plus intelligente. 



       
---------------------------------
 Envoyé avec Yahoo! Mail.
Une boite mail plus intelligente.

Re: RE : About the Mahout.GA Comment

Posted by Ted Dunning <td...@veoh.com>.

How is "computing node" not a processor?


On 4/12/08 9:26 PM, "deneche abdelhakim" <a_...@yahoo.fr> wrote:

> The number of running algorithms don't depend on the number of processors, in
> fact this kind of algorithms is used even if there is only one single
> processor because of its good search properties. You can imagine it as a
> single big GA with a distributed population and each individual can have its
> own set of operators.
> 
> Abdel Hakim
> 
>> Ted Dunning wrote :
>> 
>> I think it is a very bad idea to tie the algorithm to the number of
>> processors being used in this way.  A program should produce identical
>> results on any machine, subject only to PRNG seeding issues.
> On 4/11/08 8:52 PM, "deneche abdelhakim"  wrote:
> 
>> And there are other reasons to distribute a GA: for example, you may want to
>> run a different version of the algorithm (a different population and perhaps
>> a
>> different set of operators) in each computing node, and from time to time
>> some
>> individuals will migrate from one node to another...this kind of distribution
>> has proven to be more effective cause it searches a larger space.
> 
> 
> 
>        
> ---------------------------------
>  Envoyé avec Yahoo! Mail.
> Une boite mail plus intelligente.

RE : About the Mahout.GA Comment

Posted by deneche abdelhakim <a_...@yahoo.fr>.

The number of running algorithms don't depend on the number of processors, in fact this kind of algorithms is used even if there is only one single processor because of its good search properties. You can imagine it as a single big GA with a distributed population and each individual can have its own set of operators.

Abdel Hakim

>Ted Dunning wrote :
  >
  > I think it is a very bad idea to tie the algorithm to the number of
  > processors being used in this way.  A program should produce identical
  > results on any machine, subject only to PRNG seeding issues.
On 4/11/08 8:52 PM, "deneche abdelhakim"  wrote:

> And there are other reasons to distribute a GA: for example, you may want to
> run a different version of the algorithm (a different population and perhaps a
> different set of operators) in each computing node, and from time to time some
> individuals will migrate from one node to another...this kind of distribution
> has proven to be more effective cause it searches a larger space.



       
---------------------------------
 Envoyé avec Yahoo! Mail.
Une boite mail plus intelligente.

Re: About the Mahout.GA Comment

Posted by Ted Dunning <td...@veoh.com>.

I think it is a very bad idea to tie the algorithm to the number of
processors being used in this way.  A program should produce identical
results on any machine, subject only to PRNG seeding issues.

On 4/11/08 8:52 PM, "deneche abdelhakim" <a_...@yahoo.fr> wrote:

> And there are other reasons to distribute a GA: for example, you may want to
> run a different version of the algorithm (a different population and perhaps a
> different set of operators) in each computing node, and from time to time some
> individuals will migrate from one node to another...this kind of distribution
> has proven to be more effective cause it searches a larger space.

Re: About the Mahout.GA Comment

Posted by deneche abdelhakim <a_...@yahoo.fr>.

Hi, Grant

> I don't have a sense for how long the fitness function takes versus  
> the other operations that WatchMaker does.   In other words, do the  
> other pieces of running a GA algorithm needed to be distributed.  Just  
> curious and I don't think it is a big deal as far as your proposal goes.

This is problem dependent, generally the fitness function is the big deal but for others you may have very complex mutation operators (or any other operator).

And there are other reasons to distribute a GA: for example, you may want to run a different version of the algorithm (a different population and perhaps a different set of operators) in each computing node, and from time to time some individuals will migrate from one node to another...this kind of distribution has proven to be more effective cause it searches a larger space.

Abdel Hakim

       
---------------------------------
 Envoyé avec Yahoo! Mail.
Une boite mail plus intelligente.

Re: About the Mahout.GA Comment

Posted by Ted Dunning <td...@veoh.com>.

I can't comment on Watchmaker, but in the EP implementation that I posted,
the control aspects are easily parallelized using multiple reduces.
Moreover, the cost of the framework is just the cost of keeping a priority
queue of the top items.  This can be implemented using a sort on all
candidates (as I did, for simplicity) or as a size limited heap or priority
queue.  In any case, the overhead of the framework is really small.  This is
even more true in real applications of EP's where the evaluation often
involves some truly non-trivial application.

On 4/11/08 4:38 AM, "Grant Ingersoll" <gs...@apache.org> wrote:

> I don't have a sense for how long the fitness function takes versus
> the other operations that WatchMaker does.   In other words, do the
> other pieces of running a GA algorithm needed to be distributed.  Just
> curious and I don't think it is a big deal as far as your proposal goes.
> 
> -Grant
> 
> On Apr 11, 2008, at 10:48 AM, deneche abdelhakim wrote:
> 
>> Hi Grant,
>> 
>> You wrote the following comment on my GSoC proposal:
>> 
>>> Could someone w/ a little more GA knowledge comment on the use of
>>> WatchMaker?  What I wonder is if it is possible to distribute some
>>> of the
>>> watchmaker functionality?
>> 
>> Do you want to know if there are more other ways to distribute a GA ?
>> 
>>> May not be needed for this proposal, but I am curious as to how
>>> much work is
>>> done in Watchmaker vs. the actual fitness function.
>> 
>> I dont understand...
>> 
>> Abdel Hakim Deneche
>> 
>> 
>> ---------------------------------
>> Envoyé avec Yahoo! Mail.
>> Une boite mail plus intelligente.
>

Re: About the Mahout.GA Comment

Posted by Grant Ingersoll <gs...@apache.org>.

I don't have a sense for how long the fitness function takes versus  
the other operations that WatchMaker does.   In other words, do the  
other pieces of running a GA algorithm needed to be distributed.  Just  
curious and I don't think it is a big deal as far as your proposal goes.

-Grant

On Apr 11, 2008, at 10:48 AM, deneche abdelhakim wrote:

> Hi Grant,
>
> You wrote the following comment on my GSoC proposal:
>
>> Could someone w/ a little more GA knowledge comment on the use of
>> WatchMaker?  What I wonder is if it is possible to distribute some  
>> of the
>> watchmaker functionality?
>
> Do you want to know if there are more other ways to distribute a GA ?
>
>> May not be needed for this proposal, but I am curious as to how  
>> much work is
>> done in Watchmaker vs. the actual fitness function.
>
> I dont understand...
>
> Abdel Hakim Deneche
>
>
> ---------------------------------
> Envoyé avec Yahoo! Mail.
> Une boite mail plus intelligente.

About the Mahout.GA Comment

Posted by deneche abdelhakim <a_...@yahoo.fr>.

Hi Grant, 

You wrote the following comment on my GSoC proposal:

> Could someone w/ a little more GA knowledge comment on the use of 
> WatchMaker?  What I wonder is if it is possible to distribute some of the 
> watchmaker functionality? 

Do you want to know if there are more other ways to distribute a GA ?

> May not be needed for this proposal, but I am curious as to how much work is 
>done in Watchmaker vs. the actual fitness function.

I dont understand...

Abdel Hakim Deneche

       
---------------------------------
 Envoyé avec Yahoo! Mail.
Une boite mail plus intelligente.

Re: SoC Naive Bayes Implementation

Posted by Grant Ingersoll <gs...@apache.org>.

Hi Natallia,

Have a look at https://issues.apache.org/jira/browse/MAHOUT-9.  I am  
hoping to have something to put up after ApacheCon Europe, at which  
point testing, help would be appreciated, so I am not sure it will  
make sense for a GSOC project or not.  Perhaps you would be interested  
in looking into other algorithms?  I think there are other things  
besides the "10" in our NIPS paper that are interesting, perhaps we  
can start brainstorming on them.  Things of interest to me, although I  
don't know how well they can be implemented in M/R just yet are:

1. TextRank (see Mihalcea) although I doubt this is a full summer  
project
2. Brill POS tagger
3. Max entropy stuff

Of course, we don't have to just use M/R distributed approaches.  We  
can look into algs that require more distributed capabilities.

Cheers,
Grant

On Mar 30, 2008, at 9:50 AM, Vitalisova, Natallia wrote:

> Hi,
>
>
>
> My name is Natallia Vitalisova and I've applied for Google SoC 2008  
> to implement the Naïve Bayes algorithm on Hadoop.
>
>
>
> Either I will be accepted for SoC or not, I want to spend my time  
> investigating this topic which I consider to be very interesting.  
> But I will certainly need a mentor, or someone who will help me to  
> answer some questions etc. Is there anyone willing to help?
>

Re: Cross validation

Posted by Karl Wettin <ka...@gmail.com>.

With a "variable set" do you mean a sub set of features (attributes, 
columns) to be evaluated with a cross validation?

I really know to little and have to read up a bit in the Weka book (Data 
mining, 2nd edition, page 420-425) and take a look if some algorithm 
makes more sense than others.


It would be nice to to feature select and rank a 100K features wide 
ngram Lucene index with various classes.


     karl


Ted Dunning skrev:
> Yes.  This is a form of model selection.  It would be plausible to run the
> cross-folds and learning in parallel.  Cross validation would only give
> small parallelism, but if you have several hundred variable sets, that
> becomes plausible. 
> 
> This raises the question of what the right map-reduce architecture would be
> for this sort of architecture.  Should there be a special input format that
> reads input records with a test/train/fold# key or column?  The thought
> would be that normal sequential learning could be done in the reducer, or
> the folded data could be passed to separate learning algorithms.
> 
> 
> On 3/31/08 9:08 AM, "Karl Wettin" <ka...@gmail.com> wrote:
> 
>> Paul Elschot skrev:
>>> Op Monday 31 March 2008 15:43:03 schreef Karl Wettin:
>>>> Paul Elschot skrev:
>>>>> Parallelizing cross validation may also be trivial, but it would be
>>>>> quite useful.
>>>> I know it can be used for feature selection. What else is there?
>>> Actually, I meant no more than K-fold cross validation:
>>>
>>> http://en.wikipedia.org/wiki/Cross-validation
>>>
>>> It nicely parallelizes to a factor of K.
>> Ah, OK.
>>
>> I mean that many feature selection algorithms more or less is a series
>> of cross fold validations using some classifier on either a single or
>> subset of available attributes.
>>
>>
>>
>>     karl
>

Re: Cross validation

Posted by Karl Wettin <ka...@gmail.com>.

That is apparently some paper everybody is talking about:

Ron Kohavi, George H. John (1997). Wrappers for feature subset 
selection. Artificial Intelligence. 97(1-2):273-324

I'll have to read that.


Ted Dunning skrev:
> Yes.  This is a form of model selection.  It would be plausible to run the
> cross-folds and learning in parallel.  Cross validation would only give
> small parallelism, but if you have several hundred variable sets, that
> becomes plausible. 
> 
> This raises the question of what the right map-reduce architecture would be
> for this sort of architecture.  Should there be a special input format that
> reads input records with a test/train/fold# key or column?  The thought
> would be that normal sequential learning could be done in the reducer, or
> the folded data could be passed to separate learning algorithms.
> 
> 
> On 3/31/08 9:08 AM, "Karl Wettin" <ka...@gmail.com> wrote:
> 
>> Paul Elschot skrev:
>>> Op Monday 31 March 2008 15:43:03 schreef Karl Wettin:
>>>> Paul Elschot skrev:
>>>>> Parallelizing cross validation may also be trivial, but it would be
>>>>> quite useful.
>>>> I know it can be used for feature selection. What else is there?
>>> Actually, I meant no more than K-fold cross validation:
>>>
>>> http://en.wikipedia.org/wiki/Cross-validation
>>>
>>> It nicely parallelizes to a factor of K.
>> Ah, OK.
>>
>> I mean that many feature selection algorithms more or less is a series
>> of cross fold validations using some classifier on either a single or
>> subset of available attributes.
>>
>>
>>
>>     karl
>

Re: Cross validation

Posted by Ted Dunning <td...@veoh.com>.

Yes.  This is a form of model selection.  It would be plausible to run the
cross-folds and learning in parallel.  Cross validation would only give
small parallelism, but if you have several hundred variable sets, that
becomes plausible. 

This raises the question of what the right map-reduce architecture would be
for this sort of architecture.  Should there be a special input format that
reads input records with a test/train/fold# key or column?  The thought
would be that normal sequential learning could be done in the reducer, or
the folded data could be passed to separate learning algorithms.

On 3/31/08 9:08 AM, "Karl Wettin" <ka...@gmail.com> wrote:

> Paul Elschot skrev:
>> Op Monday 31 March 2008 15:43:03 schreef Karl Wettin:
>>> Paul Elschot skrev:
>>>> Parallelizing cross validation may also be trivial, but it would be
>>>> quite useful.
>>> I know it can be used for feature selection. What else is there?
>> 
>> Actually, I meant no more than K-fold cross validation:
>> 
>> http://en.wikipedia.org/wiki/Cross-validation
>> 
>> It nicely parallelizes to a factor of K.
> 
> Ah, OK.
> 
> I mean that many feature selection algorithms more or less is a series
> of cross fold validations using some classifier on either a single or
> subset of available attributes.
> 
> 
> 
>     karl

Re: Cross validation

Posted by Karl Wettin <ka...@gmail.com>.

Paul Elschot skrev:
> Op Monday 31 March 2008 15:43:03 schreef Karl Wettin:
>> Paul Elschot skrev:
>>> Parallelizing cross validation may also be trivial, but it would be
>>> quite useful.
>> I know it can be used for feature selection. What else is there?
> 
> Actually, I meant no more than K-fold cross validation:
> 
> http://en.wikipedia.org/wiki/Cross-validation
> 
> It nicely parallelizes to a factor of K.

Ah, OK.

I mean that many feature selection algorithms more or less is a series 
of cross fold validations using some classifier on either a single or 
subset of available attributes.



    karl

Re: Cross validation (was: SoC Naive Bayes Implementation)

Posted by Paul Elschot <pa...@xs4all.nl>.

Op Monday 31 March 2008 15:43:03 schreef Karl Wettin:
> Paul Elschot skrev:
> > Parallelizing cross validation may also be trivial, but it would be
> > quite useful.
>
> I know it can be used for feature selection. What else is there?

Actually, I meant no more than K-fold cross validation:

http://en.wikipedia.org/wiki/Cross-validation

It nicely parallelizes to a factor of K.

Regards,
Paul Elschot

Cross validation (was: SoC Naive Bayes Implementation)

Posted by Karl Wettin <ka...@gmail.com>.

Paul Elschot skrev:
> Parallelizing cross validation may also be trivial, but it would be
> quite useful.

I know it can be used for feature selection. What else is there?


     karl

Re: SoC Naive Bayes Implementation

Posted by Ted Dunning <td...@veoh.com>.

Both good points.


On 3/30/08 3:38 PM, "Paul Elschot" <pa...@xs4all.nl> wrote:

> Op Sunday 30 March 2008 20:51:40 schreef Ted Dunning:
>> I am sure that the entire Mahout community will be happy to help.
>> 
>> You may find, however, that naïve Bayes is trivially parallel (and
>> not very difficult even without parallelism).  That means you may
>> want to have something additional to work on in the back of your
>> mind.
> 
> How about Complementary Naive Bayes and cross validation?
> 
> Parallelizing cross validation may also be trivial, but it would be
> quite useful.
> 
> Regards,
> Paul Elschot.
> 
> 
> 
>> 
>> On 3/30/08 6:50 AM, "Vitalisova, Natallia" <Vi...@belhard.com>
> wrote:
>>> My name is Natallia Vitalisova and I've applied for Google SoC 2008
>>> to implement the Naïve Bayes algorithm on Hadoop.
>>> 
>>> 
>>> 
>>> Either I will be accepted for SoC or not, I want to spend my time
>>> investigating this topic which I consider to be very interesting.
>>> But I will certainly need a mentor, or someone who will help me to
>>> answer some questions etc. Is there anyone willing to help?
> 
>

Re: SoC Naive Bayes Implementation

Posted by Paul Elschot <pa...@xs4all.nl>.

Op Sunday 30 March 2008 20:51:40 schreef Ted Dunning:
> I am sure that the entire Mahout community will be happy to help.
>
> You may find, however, that naïve Bayes is trivially parallel (and
> not very difficult even without parallelism).  That means you may
> want to have something additional to work on in the back of your
> mind.

How about Complementary Naive Bayes and cross validation?

Parallelizing cross validation may also be trivial, but it would be
quite useful.

Regards,
Paul Elschot.



>
> On 3/30/08 6:50 AM, "Vitalisova, Natallia" <Vi...@belhard.com> 
wrote:
> > My name is Natallia Vitalisova and I've applied for Google SoC 2008
> > to implement the Naïve Bayes algorithm on Hadoop.
> >
> >
> >
> > Either I will be accepted for SoC or not, I want to spend my time
> > investigating this topic which I consider to be very interesting.
> > But I will certainly need a mentor, or someone who will help me to
> > answer some questions etc. Is there anyone willing to help?

Re: SoC Naive Bayes Implementation

Posted by Ted Dunning <td...@veoh.com>.

I am sure that the entire Mahout community will be happy to help.

You may find, however, that naïve Bayes is trivially parallel (and not very
difficult even without parallelism).  That means you may want to have
something additional to work on in the back of your mind.

On 3/30/08 6:50 AM, "Vitalisova, Natallia" <Vi...@belhard.com> wrote:

> My name is Natallia Vitalisova and I've applied for Google SoC 2008 to
> implement the Naïve Bayes algorithm on Hadoop.
> 
>  
> 
> Either I will be accepted for SoC or not, I want to spend my time
> investigating this topic which I consider to be very interesting. But I will
> certainly need a mentor, or someone who will help me to answer some questions
> etc. Is there anyone willing to help?