You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Karl Wettin <ka...@gmail.com> on 2008/03/31 15:43:03 UTC

Cross validation (was: SoC Naive Bayes Implementation)

Paul Elschot skrev:
> Parallelizing cross validation may also be trivial, but it would be
> quite useful.

I know it can be used for feature selection. What else is there?


     karl

Re: Cross validation

Posted by Karl Wettin <ka...@gmail.com>.

With a "variable set" do you mean a sub set of features (attributes, 
columns) to be evaluated with a cross validation?

I really know to little and have to read up a bit in the Weka book (Data 
mining, 2nd edition, page 420-425) and take a look if some algorithm 
makes more sense than others.


It would be nice to to feature select and rank a 100K features wide 
ngram Lucene index with various classes.


     karl


Ted Dunning skrev:
> Yes.  This is a form of model selection.  It would be plausible to run the
> cross-folds and learning in parallel.  Cross validation would only give
> small parallelism, but if you have several hundred variable sets, that
> becomes plausible. 
> 
> This raises the question of what the right map-reduce architecture would be
> for this sort of architecture.  Should there be a special input format that
> reads input records with a test/train/fold# key or column?  The thought
> would be that normal sequential learning could be done in the reducer, or
> the folded data could be passed to separate learning algorithms.
> 
> 
> On 3/31/08 9:08 AM, "Karl Wettin" <ka...@gmail.com> wrote:
> 
>> Paul Elschot skrev:
>>> Op Monday 31 March 2008 15:43:03 schreef Karl Wettin:
>>>> Paul Elschot skrev:
>>>>> Parallelizing cross validation may also be trivial, but it would be
>>>>> quite useful.
>>>> I know it can be used for feature selection. What else is there?
>>> Actually, I meant no more than K-fold cross validation:
>>>
>>> http://en.wikipedia.org/wiki/Cross-validation
>>>
>>> It nicely parallelizes to a factor of K.
>> Ah, OK.
>>
>> I mean that many feature selection algorithms more or less is a series
>> of cross fold validations using some classifier on either a single or
>> subset of available attributes.
>>
>>
>>
>>     karl
>

Re: Cross validation

Posted by Karl Wettin <ka...@gmail.com>.

That is apparently some paper everybody is talking about:

Ron Kohavi, George H. John (1997). Wrappers for feature subset 
selection. Artificial Intelligence. 97(1-2):273-324

I'll have to read that.


Ted Dunning skrev:
> Yes.  This is a form of model selection.  It would be plausible to run the
> cross-folds and learning in parallel.  Cross validation would only give
> small parallelism, but if you have several hundred variable sets, that
> becomes plausible. 
> 
> This raises the question of what the right map-reduce architecture would be
> for this sort of architecture.  Should there be a special input format that
> reads input records with a test/train/fold# key or column?  The thought
> would be that normal sequential learning could be done in the reducer, or
> the folded data could be passed to separate learning algorithms.
> 
> 
> On 3/31/08 9:08 AM, "Karl Wettin" <ka...@gmail.com> wrote:
> 
>> Paul Elschot skrev:
>>> Op Monday 31 March 2008 15:43:03 schreef Karl Wettin:
>>>> Paul Elschot skrev:
>>>>> Parallelizing cross validation may also be trivial, but it would be
>>>>> quite useful.
>>>> I know it can be used for feature selection. What else is there?
>>> Actually, I meant no more than K-fold cross validation:
>>>
>>> http://en.wikipedia.org/wiki/Cross-validation
>>>
>>> It nicely parallelizes to a factor of K.
>> Ah, OK.
>>
>> I mean that many feature selection algorithms more or less is a series
>> of cross fold validations using some classifier on either a single or
>> subset of available attributes.
>>
>>
>>
>>     karl
>

Re: Cross validation

Posted by Ted Dunning <td...@veoh.com>.

Yes.  This is a form of model selection.  It would be plausible to run the
cross-folds and learning in parallel.  Cross validation would only give
small parallelism, but if you have several hundred variable sets, that
becomes plausible. 

This raises the question of what the right map-reduce architecture would be
for this sort of architecture.  Should there be a special input format that
reads input records with a test/train/fold# key or column?  The thought
would be that normal sequential learning could be done in the reducer, or
the folded data could be passed to separate learning algorithms.

On 3/31/08 9:08 AM, "Karl Wettin" <ka...@gmail.com> wrote:

> Paul Elschot skrev:
>> Op Monday 31 March 2008 15:43:03 schreef Karl Wettin:
>>> Paul Elschot skrev:
>>>> Parallelizing cross validation may also be trivial, but it would be
>>>> quite useful.
>>> I know it can be used for feature selection. What else is there?
>> 
>> Actually, I meant no more than K-fold cross validation:
>> 
>> http://en.wikipedia.org/wiki/Cross-validation
>> 
>> It nicely parallelizes to a factor of K.
> 
> Ah, OK.
> 
> I mean that many feature selection algorithms more or less is a series
> of cross fold validations using some classifier on either a single or
> subset of available attributes.
> 
> 
> 
>     karl

Re: Cross validation

Posted by Karl Wettin <ka...@gmail.com>.

Paul Elschot skrev:
> Op Monday 31 March 2008 15:43:03 schreef Karl Wettin:
>> Paul Elschot skrev:
>>> Parallelizing cross validation may also be trivial, but it would be
>>> quite useful.
>> I know it can be used for feature selection. What else is there?
> 
> Actually, I meant no more than K-fold cross validation:
> 
> http://en.wikipedia.org/wiki/Cross-validation
> 
> It nicely parallelizes to a factor of K.

Ah, OK.

I mean that many feature selection algorithms more or less is a series 
of cross fold validations using some classifier on either a single or 
subset of available attributes.



    karl

Re: Cross validation (was: SoC Naive Bayes Implementation)

Posted by Paul Elschot <pa...@xs4all.nl>.

Op Monday 31 March 2008 15:43:03 schreef Karl Wettin:
> Paul Elschot skrev:
> > Parallelizing cross validation may also be trivial, but it would be
> > quite useful.
>
> I know it can be used for feature selection. What else is there?

Actually, I meant no more than K-fold cross validation:

http://en.wikipedia.org/wiki/Cross-validation

It nicely parallelizes to a factor of K.

Regards,
Paul Elschot