You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Stuart Smith <st...@yahoo.com> on 2012/01/23 23:54:10 UTC

SGD: mismatch in percentCorrect vs classify() on training data?

Hello,

  I just started experimenting with the SGD/Logistic Regression classifier.
Right now I believe have too little training data for the number of dimensions (~1800 vector, roughly even split between two classes, ~500 dimensions).

However, I'm just trying to understand how to measure the efficacy of the classifier.

I trained a classifier like so:

- I have two categories, "good" and "bad"


- ran AdaptiveLogisticRegression() over the training data 10 times (in the same order)

- get percentCorrect and AUC of the best classifier


- Took .getBest().getPayload().getLearner(), trained that over all the training data again.
   (on the theory that ALR was only showing it a small slice of the data that it had, it seemed to help).

- get percentCorrect() of the classifier.

- run classify() on the good/bad vectors of the training set, counting FP/TP in each case.

What I'm having trouble with is understanding a discrepancy between the results of the last two steps.

.percentCorrect() returns ~ 90% 
M = Number of training examples
however (TP_Good + TP_Bad) / (M) ~ 50%
Interestingly enough (TP_Good + FP_Bad) / (M) ~ 90%


So I'm kind of confused about what .percentCorrect means... how is this counted?

Below is a code snippet where I do the final training & counting, just in case I made some bonehead mistake:

            /** training best on all data... **/
            System.out.println( "Training best on all data..");
            ARFFVectorIterable retrainGood = new ARFFVectorIterable(goodArff, new MapBackedARFFModel());
            Iterator<Vector> retrainGoodIter = retrainGood.iterator();
            while (retrainGoodIter.hasNext()) {
                bestClassifier.train( goodLabel, retrainGoodIter.next() );
            }

            
            ARFFVectorIterable retrainBad = new ARFFVectorIterable(badArff, new MapBackedARFFModel());
            Iterator<Vector> retrainBadIter = retrainBad.iterator();
            while (retrainBadIter.hasNext()) {
                bestClassifier.train( badLabel, retrainBadIter.next() );
            }
            System.out.println("Best learner percent correct on all data: " + bestClassifier.percentCorrect());

            ARFFVectorIterable fpVectors = new ARFFVectorIterable(goodArff, new MapBackedARFFModel());
            Iterator<Vector> fpIterator = fpVectors.iterator();
            int goodFpCount = 0;
            int goodTpCount = 0;
            int testCount = 0;
            while (fpIterator.hasNext())
            {

                Vector goodVector = fpIterator.next();
                double probabilityGood = (1.0 - bestClassifier.classify(goodVector).get(badLabel));
                testCount++;
                if( probabilityGood > 0.0 ) {
                    if( probabilityGood <= 1.0 ) {
                        System.out.print( probabilityGood + "," );
                    }
                    goodTpCount++;
                }
                else {
                    goodFpCount++;
                }
            }
            System.out.println();
            System.out.println( "FP count: " + goodFpCount );
            System.out.println( "TP of good files: " + goodTpCount );
            
            ARFFVectorIterable tpVectors = new ARFFVectorIterable(badArff, new MapBackedARFFModel());
            Iterator<Vector> tpIterator = tpVectors.iterator();
            int badTpCount = 0;
            int badFpCount = 0;
            while (tpIterator.hasNext())
            {
                Vector badVector = tpIterator.next();
                double probabilityBad = bestClassifier.classify(badVector).get(badLabel);
                testCount++;
                if( probabilityBad > 0.0 ) {
                    if( probabilityBad <= 1.0 ) {
                        System.out.print( probabilityBad + "," );
                    }
                    badTpCount++;
                }
                else {
                    badFpCount++;
                }
            }
            System.out.println();
            System.out.println( "TP count: " + badTpCount );
            System.out.println( "FP on bad clusters: " + badFpCount);
            System.out.println( "Test count: " + testCount );


Any help is appreciated! 


Take care,
  -stu

Re: SGD: mismatch in percentCorrect vs classify() on training data?

Posted by Ted Dunning <te...@gmail.com>.
One nice way to sanitize tokens is to use a dictionary to rewrite tokens as
t1, t2, t3, ...

If you can see fit to expose the data in any form, it would help us help
you.

On Tue, Jan 24, 2012 at 3:50 PM, Stuart Smith <st...@yahoo.com> wrote:

> Actually, I looked over my feature names, and I started with about 2K of
> them, not 500... and the names would need to be sanitized before I released
> them.. so...
>
>
> I did add a setPercentCorrect() to the CrossFold class, and reset the
> member to zero before I did the last training run... it then flipped the
> other way. It said percentCorrect() was about 9%, instead of the 50% I
> actually got.
>
> I guess..
>   - be sure to validate on your own test set, even your own training data
> might say something useful.
>
>   - it might be nice to add in a resetStats() method or something.
>
> It's working pretty good now!
>
>
> Thanks for the help!
>
> Take care,
>   -stu
>
>
>
> ________________________________
>  From: Stuart Smith <st...@yahoo.com>
> To: "user@mahout.apache.org" <us...@mahout.apache.org>
> Sent: Monday, January 23, 2012 7:18 PM
> Subject: Re: SGD: mismatch in percentCorrect vs classify() on training
> data?
>
> Gotta run, but will do tmr.
>
> I actually took my feature count down from ~500 to  10, and started
> getting much better results :)
> Even with a 10% hold out set (held out from any training whatsover).
>
> So it's looking better, but that stat is still just odd... (even now)..
>
> Thanks!
>
>
> Take care,
>   -stu
>
>
>
> ________________________________
> From: Ted Dunning <te...@gmail.com>
> To: user@mahout.apache.org; Stuart Smith <st...@yahoo.com>
> Cc: Mahout List <ma...@lucene.apache.org>
> Sent: Monday, January 23, 2012 5:52 PM
> Subject: Re: SGD: mismatch in percentCorrect vs classify() on training
> data?
>
> Hmm... I am surprised as well.
>
> As I remember percentCorrect *is* a weighted moving average so I would
> expect some discrepancy, but not this much.
>
> Can you post your training/test data somewhere?  It would be good to test
> in synchrony.
>
> On Mon, Jan 23, 2012 at 3:37 PM, Stuart Smith <st...@yahoo.com> wrote:
>
> > Actually, to be clear, I looked through the CrossFoldLearner code, and
> > understand how it gets calculated.. but I'm surprised that the
> discrepancy
> > is so large..
> >
> > Take care,
> >   -stu
> >
> >
> >
> > ________________________________
> >  From: Stuart Smith <st...@yahoo.com>
> > To: Mahout List <ma...@lucene.apache.org>
> > Sent: Monday, January 23, 2012 2:54 PM
> > Subject: SGD: mismatch in percentCorrect vs classify() on training data?
> >
> > Hello,
> >
> >   I just started experimenting with the SGD/Logistic Regression
> classifier.
> > Right now I believe have too little training data for the number of
> > dimensions (~1800 vector, roughly even split between two classes, ~500
> > dimensions).
> >
> > However, I'm just trying to understand how to measure the efficacy of the
> > classifier.
> >
> > I trained a classifier like so:
> >
> > - I have two categories, "good" and "bad"
> >
> >
> > - ran AdaptiveLogisticRegression() over the training data 10 times (in
> the
> > same order)
> >
> > - get percentCorrect and AUC of the best classifier
> >
> >
> > - Took .getBest().getPayload().getLearner(), trained that over all the
> > training data again.
> >    (on the theory that ALR was only showing it a small slice of the data
> > that it had, it seemed to help).
> >
> > - get percentCorrect() of the classifier.
> >
> > - run classify() on the good/bad vectors of the training set, counting
> > FP/TP in each case.
> >
> > What I'm having trouble with is understanding a discrepancy between the
> > results of the last two steps.
> >
> > .percentCorrect() returns ~ 90%
> > M = Number of training examples
> > however (TP_Good + TP_Bad) / (M) ~ 50%
> > Interestingly enough (TP_Good + FP_Bad) / (M) ~ 90%
> >
> >
> > So I'm kind of confused about what .percentCorrect means... how is this
> > counted?
> >
> > Below is a code snippet where I do the final training & counting, just in
> > case I made some bonehead mistake:
> >
> >             /** training best on all data... **/
> >             System.out.println( "Training best on all data..");
> >             ARFFVectorIterable retrainGood = new
> > ARFFVectorIterable(goodArff, new MapBackedARFFModel());
> >             Iterator<Vector> retrainGoodIter = retrainGood.iterator();
> >             while (retrainGoodIter.hasNext()) {
> >                 bestClassifier.train( goodLabel, retrainGoodIter.next()
> );
> >             }
> >
> >
> >             ARFFVectorIterable retrainBad = new
> > ARFFVectorIterable(badArff, new MapBackedARFFModel());
> >             Iterator<Vector> retrainBadIter = retrainBad.iterator();
> >             while (retrainBadIter.hasNext()) {
> >                 bestClassifier.train( badLabel, retrainBadIter.next() );
> >             }
> >             System.out.println("Best learner percent correct on all data:
> > " + bestClassifier.percentCorrect());
> >
> >             ARFFVectorIterable fpVectors = new
> > ARFFVectorIterable(goodArff, new MapBackedARFFModel());
> >             Iterator<Vector> fpIterator = fpVectors.iterator();
> >             int goodFpCount = 0;
> >             int goodTpCount = 0;
> >             int testCount = 0;
> >             while (fpIterator.hasNext())
> >             {
> >
> >                 Vector goodVector = fpIterator.next();
> >                 double probabilityGood = (1.0 -
> > bestClassifier.classify(goodVector).get(badLabel));
> >                 testCount++;
> >                 if( probabilityGood > 0.0 ) {
> >                     if( probabilityGood <= 1.0 ) {
> >                         System.out.print( probabilityGood + "," );
> >                     }
> >                     goodTpCount++;
> >                 }
> >                 else {
> >                     goodFpCount++;
> >                 }
> >             }
> >             System.out.println();
> >             System.out.println( "FP count: " + goodFpCount );
> >             System.out.println( "TP of good files: " + goodTpCount );
> >
> >             ARFFVectorIterable tpVectors = new
> ARFFVectorIterable(badArff,
> > new MapBackedARFFModel());
> >             Iterator<Vector> tpIterator = tpVectors.iterator();
> >             int badTpCount = 0;
> >             int badFpCount = 0;
> >             while (tpIterator.hasNext())
> >             {
> >                 Vector badVector = tpIterator.next();
> >                 double probabilityBad =
> > bestClassifier.classify(badVector).get(badLabel);
> >                 testCount++;
> >                 if( probabilityBad > 0.0 ) {
> >                     if( probabilityBad <= 1.0 ) {
> >                         System.out.print( probabilityBad + "," );
> >                     }
> >                     badTpCount++;
> >                 }
> >                 else {
> >                     badFpCount++;
> >                 }
> >             }
> >             System.out.println();
> >             System.out.println( "TP count: " + badTpCount );
> >             System.out.println( "FP on bad clusters: " + badFpCount);
> >             System.out.println( "Test count: " + testCount );
> >
> >
> > Any help is appreciated!
> >
> >
> > Take care,
> >   -stu
> >
>

Re: SGD: mismatch in percentCorrect vs classify() on training data?

Posted by Stuart Smith <st...@yahoo.com>.
Actually, I looked over my feature names, and I started with about 2K of them, not 500... and the names would need to be sanitized before I released them.. so...


I did add a setPercentCorrect() to the CrossFold class, and reset the member to zero before I did the last training run... it then flipped the other way. It said percentCorrect() was about 9%, instead of the 50% I actually got.

I guess..
  - be sure to validate on your own test set, even your own training data might say something useful.

  - it might be nice to add in a resetStats() method or something.

It's working pretty good now!


Thanks for the help!

Take care,
  -stu



________________________________
 From: Stuart Smith <st...@yahoo.com>
To: "user@mahout.apache.org" <us...@mahout.apache.org> 
Sent: Monday, January 23, 2012 7:18 PM
Subject: Re: SGD: mismatch in percentCorrect vs classify() on training data?
 
Gotta run, but will do tmr.

I actually took my feature count down from ~500 to  10, and started getting much better results :)
Even with a 10% hold out set (held out from any training whatsover).

So it's looking better, but that stat is still just odd... (even now)..

Thanks!


Take care,
  -stu



________________________________
From: Ted Dunning <te...@gmail.com>
To: user@mahout.apache.org; Stuart Smith <st...@yahoo.com> 
Cc: Mahout List <ma...@lucene.apache.org> 
Sent: Monday, January 23, 2012 5:52 PM
Subject: Re: SGD: mismatch in percentCorrect vs classify() on training data?

Hmm... I am surprised as well.

As I remember percentCorrect *is* a weighted moving average so I would
expect some discrepancy, but not this much.

Can you post your training/test data somewhere?  It would be good to test
in synchrony.

On Mon, Jan 23, 2012 at 3:37 PM, Stuart Smith <st...@yahoo.com> wrote:

> Actually, to be clear, I looked through the CrossFoldLearner code, and
> understand how it gets calculated.. but I'm surprised that the discrepancy
> is so large..
>
> Take care,
>   -stu
>
>
>
> ________________________________
>  From: Stuart Smith <st...@yahoo.com>
> To: Mahout List <ma...@lucene.apache.org>
> Sent: Monday, January 23, 2012 2:54 PM
> Subject: SGD: mismatch in percentCorrect vs classify() on training data?
>
> Hello,
>
>   I just started experimenting with the SGD/Logistic Regression classifier.
> Right now I believe have too little training data for the number of
> dimensions (~1800 vector, roughly even split between two classes, ~500
> dimensions).
>
> However, I'm just trying to understand how to measure the efficacy of the
> classifier.
>
> I trained a classifier like so:
>
> - I have two categories, "good" and "bad"
>
>
> - ran AdaptiveLogisticRegression() over the training data 10 times (in the
> same order)
>
> - get percentCorrect and AUC of the best classifier
>
>
> - Took .getBest().getPayload().getLearner(), trained that over all the
> training data again.
>    (on the theory that ALR was only showing it a small slice of the data
> that it had, it seemed to help).
>
> - get percentCorrect() of the classifier.
>
> - run classify() on the good/bad vectors of the training set, counting
> FP/TP in each case.
>
> What I'm having trouble with is understanding a discrepancy between the
> results of the last two steps.
>
> .percentCorrect() returns ~ 90%
> M = Number of training examples
> however (TP_Good + TP_Bad) / (M) ~ 50%
> Interestingly enough (TP_Good + FP_Bad) / (M) ~ 90%
>
>
> So I'm kind of confused about what .percentCorrect means... how is this
> counted?
>
> Below is a code snippet where I do the final training & counting, just in
> case I made some bonehead mistake:
>
>             /** training best on all data... **/
>             System.out.println( "Training best on all data..");
>             ARFFVectorIterable retrainGood = new
> ARFFVectorIterable(goodArff, new MapBackedARFFModel());
>             Iterator<Vector> retrainGoodIter = retrainGood.iterator();
>             while (retrainGoodIter.hasNext()) {
>                 bestClassifier.train( goodLabel, retrainGoodIter.next() );
>             }
>
>
>             ARFFVectorIterable retrainBad = new
> ARFFVectorIterable(badArff, new MapBackedARFFModel());
>             Iterator<Vector> retrainBadIter = retrainBad.iterator();
>             while (retrainBadIter.hasNext()) {
>                 bestClassifier.train( badLabel, retrainBadIter.next() );
>             }
>             System.out.println("Best learner percent correct on all data:
> " + bestClassifier.percentCorrect());
>
>             ARFFVectorIterable fpVectors = new
> ARFFVectorIterable(goodArff, new MapBackedARFFModel());
>             Iterator<Vector> fpIterator = fpVectors.iterator();
>             int goodFpCount = 0;
>             int goodTpCount = 0;
>             int testCount = 0;
>             while (fpIterator.hasNext())
>             {
>
>                 Vector goodVector = fpIterator.next();
>                 double probabilityGood = (1.0 -
> bestClassifier.classify(goodVector).get(badLabel));
>                 testCount++;
>                 if( probabilityGood > 0.0 ) {
>                     if( probabilityGood <= 1.0 ) {
>                         System.out.print( probabilityGood + "," );
>                     }
>                     goodTpCount++;
>                 }
>                 else {
>                     goodFpCount++;
>                 }
>             }
>             System.out.println();
>             System.out.println( "FP count: " + goodFpCount );
>             System.out.println( "TP of good files: " + goodTpCount );
>
>             ARFFVectorIterable tpVectors = new ARFFVectorIterable(badArff,
> new MapBackedARFFModel());
>             Iterator<Vector> tpIterator = tpVectors.iterator();
>             int badTpCount = 0;
>             int badFpCount = 0;
>             while (tpIterator.hasNext())
>             {
>                 Vector badVector = tpIterator.next();
>                 double probabilityBad =
> bestClassifier.classify(badVector).get(badLabel);
>                 testCount++;
>                 if( probabilityBad > 0.0 ) {
>                     if( probabilityBad <= 1.0 ) {
>                         System.out.print( probabilityBad + "," );
>                     }
>                     badTpCount++;
>                 }
>                 else {
>                     badFpCount++;
>                 }
>             }
>             System.out.println();
>             System.out.println( "TP count: " + badTpCount );
>             System.out.println( "FP on bad clusters: " + badFpCount);
>             System.out.println( "Test count: " + testCount );
>
>
> Any help is appreciated!
>
>
> Take care,
>   -stu
>

Re: SGD: mismatch in percentCorrect vs classify() on training data?

Posted by Stuart Smith <st...@yahoo.com>.
Gotta run, but will do tmr.

I actually took my feature count down from ~500 to  10, and started getting much better results :)
Even with a 10% hold out set (held out from any training whatsover).

So it's looking better, but that stat is still just odd... (even now)..

Thanks!


Take care,
  -stu



________________________________
 From: Ted Dunning <te...@gmail.com>
To: user@mahout.apache.org; Stuart Smith <st...@yahoo.com> 
Cc: Mahout List <ma...@lucene.apache.org> 
Sent: Monday, January 23, 2012 5:52 PM
Subject: Re: SGD: mismatch in percentCorrect vs classify() on training data?
 
Hmm... I am surprised as well.

As I remember percentCorrect *is* a weighted moving average so I would
expect some discrepancy, but not this much.

Can you post your training/test data somewhere?  It would be good to test
in synchrony.

On Mon, Jan 23, 2012 at 3:37 PM, Stuart Smith <st...@yahoo.com> wrote:

> Actually, to be clear, I looked through the CrossFoldLearner code, and
> understand how it gets calculated.. but I'm surprised that the discrepancy
> is so large..
>
> Take care,
>   -stu
>
>
>
> ________________________________
>  From: Stuart Smith <st...@yahoo.com>
> To: Mahout List <ma...@lucene.apache.org>
> Sent: Monday, January 23, 2012 2:54 PM
> Subject: SGD: mismatch in percentCorrect vs classify() on training data?
>
> Hello,
>
>   I just started experimenting with the SGD/Logistic Regression classifier.
> Right now I believe have too little training data for the number of
> dimensions (~1800 vector, roughly even split between two classes, ~500
> dimensions).
>
> However, I'm just trying to understand how to measure the efficacy of the
> classifier.
>
> I trained a classifier like so:
>
> - I have two categories, "good" and "bad"
>
>
> - ran AdaptiveLogisticRegression() over the training data 10 times (in the
> same order)
>
> - get percentCorrect and AUC of the best classifier
>
>
> - Took .getBest().getPayload().getLearner(), trained that over all the
> training data again.
>    (on the theory that ALR was only showing it a small slice of the data
> that it had, it seemed to help).
>
> - get percentCorrect() of the classifier.
>
> - run classify() on the good/bad vectors of the training set, counting
> FP/TP in each case.
>
> What I'm having trouble with is understanding a discrepancy between the
> results of the last two steps.
>
> .percentCorrect() returns ~ 90%
> M = Number of training examples
> however (TP_Good + TP_Bad) / (M) ~ 50%
> Interestingly enough (TP_Good + FP_Bad) / (M) ~ 90%
>
>
> So I'm kind of confused about what .percentCorrect means... how is this
> counted?
>
> Below is a code snippet where I do the final training & counting, just in
> case I made some bonehead mistake:
>
>             /** training best on all data... **/
>             System.out.println( "Training best on all data..");
>             ARFFVectorIterable retrainGood = new
> ARFFVectorIterable(goodArff, new MapBackedARFFModel());
>             Iterator<Vector> retrainGoodIter = retrainGood.iterator();
>             while (retrainGoodIter.hasNext()) {
>                 bestClassifier.train( goodLabel, retrainGoodIter.next() );
>             }
>
>
>             ARFFVectorIterable retrainBad = new
> ARFFVectorIterable(badArff, new MapBackedARFFModel());
>             Iterator<Vector> retrainBadIter = retrainBad.iterator();
>             while (retrainBadIter.hasNext()) {
>                 bestClassifier.train( badLabel, retrainBadIter.next() );
>             }
>             System.out.println("Best learner percent correct on all data:
> " + bestClassifier.percentCorrect());
>
>             ARFFVectorIterable fpVectors = new
> ARFFVectorIterable(goodArff, new MapBackedARFFModel());
>             Iterator<Vector> fpIterator = fpVectors.iterator();
>             int goodFpCount = 0;
>             int goodTpCount = 0;
>             int testCount = 0;
>             while (fpIterator.hasNext())
>             {
>
>                 Vector goodVector = fpIterator.next();
>                 double probabilityGood = (1.0 -
> bestClassifier.classify(goodVector).get(badLabel));
>                 testCount++;
>                 if( probabilityGood > 0.0 ) {
>                     if( probabilityGood <= 1.0 ) {
>                         System.out.print( probabilityGood + "," );
>                     }
>                     goodTpCount++;
>                 }
>                 else {
>                     goodFpCount++;
>                 }
>             }
>             System.out.println();
>             System.out.println( "FP count: " + goodFpCount );
>             System.out.println( "TP of good files: " + goodTpCount );
>
>             ARFFVectorIterable tpVectors = new ARFFVectorIterable(badArff,
> new MapBackedARFFModel());
>             Iterator<Vector> tpIterator = tpVectors.iterator();
>             int badTpCount = 0;
>             int badFpCount = 0;
>             while (tpIterator.hasNext())
>             {
>                 Vector badVector = tpIterator.next();
>                 double probabilityBad =
> bestClassifier.classify(badVector).get(badLabel);
>                 testCount++;
>                 if( probabilityBad > 0.0 ) {
>                     if( probabilityBad <= 1.0 ) {
>                         System.out.print( probabilityBad + "," );
>                     }
>                     badTpCount++;
>                 }
>                 else {
>                     badFpCount++;
>                 }
>             }
>             System.out.println();
>             System.out.println( "TP count: " + badTpCount );
>             System.out.println( "FP on bad clusters: " + badFpCount);
>             System.out.println( "Test count: " + testCount );
>
>
> Any help is appreciated!
>
>
> Take care,
>   -stu
>

Re: SGD: mismatch in percentCorrect vs classify() on training data?

Posted by Ted Dunning <te...@gmail.com>.
Hmm... I am surprised as well.

As I remember percentCorrect *is* a weighted moving average so I would
expect some discrepancy, but not this much.

Can you post your training/test data somewhere?  It would be good to test
in synchrony.

On Mon, Jan 23, 2012 at 3:37 PM, Stuart Smith <st...@yahoo.com> wrote:

> Actually, to be clear, I looked through the CrossFoldLearner code, and
> understand how it gets calculated.. but I'm surprised that the discrepancy
> is so large..
>
> Take care,
>   -stu
>
>
>
> ________________________________
>  From: Stuart Smith <st...@yahoo.com>
> To: Mahout List <ma...@lucene.apache.org>
> Sent: Monday, January 23, 2012 2:54 PM
> Subject: SGD: mismatch in percentCorrect vs classify() on training data?
>
> Hello,
>
>   I just started experimenting with the SGD/Logistic Regression classifier.
> Right now I believe have too little training data for the number of
> dimensions (~1800 vector, roughly even split between two classes, ~500
> dimensions).
>
> However, I'm just trying to understand how to measure the efficacy of the
> classifier.
>
> I trained a classifier like so:
>
> - I have two categories, "good" and "bad"
>
>
> - ran AdaptiveLogisticRegression() over the training data 10 times (in the
> same order)
>
> - get percentCorrect and AUC of the best classifier
>
>
> - Took .getBest().getPayload().getLearner(), trained that over all the
> training data again.
>    (on the theory that ALR was only showing it a small slice of the data
> that it had, it seemed to help).
>
> - get percentCorrect() of the classifier.
>
> - run classify() on the good/bad vectors of the training set, counting
> FP/TP in each case.
>
> What I'm having trouble with is understanding a discrepancy between the
> results of the last two steps.
>
> .percentCorrect() returns ~ 90%
> M = Number of training examples
> however (TP_Good + TP_Bad) / (M) ~ 50%
> Interestingly enough (TP_Good + FP_Bad) / (M) ~ 90%
>
>
> So I'm kind of confused about what .percentCorrect means... how is this
> counted?
>
> Below is a code snippet where I do the final training & counting, just in
> case I made some bonehead mistake:
>
>             /** training best on all data... **/
>             System.out.println( "Training best on all data..");
>             ARFFVectorIterable retrainGood = new
> ARFFVectorIterable(goodArff, new MapBackedARFFModel());
>             Iterator<Vector> retrainGoodIter = retrainGood.iterator();
>             while (retrainGoodIter.hasNext()) {
>                 bestClassifier.train( goodLabel, retrainGoodIter.next() );
>             }
>
>
>             ARFFVectorIterable retrainBad = new
> ARFFVectorIterable(badArff, new MapBackedARFFModel());
>             Iterator<Vector> retrainBadIter = retrainBad.iterator();
>             while (retrainBadIter.hasNext()) {
>                 bestClassifier.train( badLabel, retrainBadIter.next() );
>             }
>             System.out.println("Best learner percent correct on all data:
> " + bestClassifier.percentCorrect());
>
>             ARFFVectorIterable fpVectors = new
> ARFFVectorIterable(goodArff, new MapBackedARFFModel());
>             Iterator<Vector> fpIterator = fpVectors.iterator();
>             int goodFpCount = 0;
>             int goodTpCount = 0;
>             int testCount = 0;
>             while (fpIterator.hasNext())
>             {
>
>                 Vector goodVector = fpIterator.next();
>                 double probabilityGood = (1.0 -
> bestClassifier.classify(goodVector).get(badLabel));
>                 testCount++;
>                 if( probabilityGood > 0.0 ) {
>                     if( probabilityGood <= 1.0 ) {
>                         System.out.print( probabilityGood + "," );
>                     }
>                     goodTpCount++;
>                 }
>                 else {
>                     goodFpCount++;
>                 }
>             }
>             System.out.println();
>             System.out.println( "FP count: " + goodFpCount );
>             System.out.println( "TP of good files: " + goodTpCount );
>
>             ARFFVectorIterable tpVectors = new ARFFVectorIterable(badArff,
> new MapBackedARFFModel());
>             Iterator<Vector> tpIterator = tpVectors.iterator();
>             int badTpCount = 0;
>             int badFpCount = 0;
>             while (tpIterator.hasNext())
>             {
>                 Vector badVector = tpIterator.next();
>                 double probabilityBad =
> bestClassifier.classify(badVector).get(badLabel);
>                 testCount++;
>                 if( probabilityBad > 0.0 ) {
>                     if( probabilityBad <= 1.0 ) {
>                         System.out.print( probabilityBad + "," );
>                     }
>                     badTpCount++;
>                 }
>                 else {
>                     badFpCount++;
>                 }
>             }
>             System.out.println();
>             System.out.println( "TP count: " + badTpCount );
>             System.out.println( "FP on bad clusters: " + badFpCount);
>             System.out.println( "Test count: " + testCount );
>
>
> Any help is appreciated!
>
>
> Take care,
>   -stu
>

Re: SGD: mismatch in percentCorrect vs classify() on training data?

Posted by Stuart Smith <st...@yahoo.com>.
Actually, to be clear, I looked through the CrossFoldLearner code, and understand how it gets calculated.. but I'm surprised that the discrepancy is so large..

Take care,
  -stu



________________________________
 From: Stuart Smith <st...@yahoo.com>
To: Mahout List <ma...@lucene.apache.org> 
Sent: Monday, January 23, 2012 2:54 PM
Subject: SGD: mismatch in percentCorrect vs classify() on training data?
 
Hello,

  I just started experimenting with the SGD/Logistic Regression classifier.
Right now I believe have too little training data for the number of dimensions (~1800 vector, roughly even split between two classes, ~500 dimensions).

However, I'm just trying to understand how to measure the efficacy of the classifier.

I trained a classifier like so:

- I have two categories, "good" and "bad"


- ran AdaptiveLogisticRegression() over the training data 10 times (in the same order)

- get percentCorrect and AUC of the best classifier


- Took .getBest().getPayload().getLearner(), trained that over all the training data again.
   (on the theory that ALR was only showing it a small slice of the data that it had, it seemed to help).

- get percentCorrect() of the classifier.

- run classify() on the good/bad vectors of the training set, counting FP/TP in each case.

What I'm having trouble with is understanding a discrepancy between the results of the last two steps.

.percentCorrect() returns ~ 90% 
M = Number of training examples
however (TP_Good + TP_Bad) / (M) ~ 50%
Interestingly enough (TP_Good + FP_Bad) / (M) ~ 90%


So I'm kind of confused about what .percentCorrect means... how is this counted?

Below is a code snippet where I do the final training & counting, just in case I made some bonehead mistake:

            /** training best on all data... **/
            System.out.println( "Training best on all data..");
            ARFFVectorIterable retrainGood = new ARFFVectorIterable(goodArff, new MapBackedARFFModel());
            Iterator<Vector> retrainGoodIter = retrainGood.iterator();
            while (retrainGoodIter.hasNext()) {
                bestClassifier.train( goodLabel, retrainGoodIter.next() );
            }

            
            ARFFVectorIterable retrainBad = new ARFFVectorIterable(badArff, new MapBackedARFFModel());
            Iterator<Vector> retrainBadIter = retrainBad.iterator();
            while (retrainBadIter.hasNext()) {
                bestClassifier.train( badLabel, retrainBadIter.next() );
            }
            System.out.println("Best learner percent correct on all data: " + bestClassifier.percentCorrect());

            ARFFVectorIterable fpVectors = new ARFFVectorIterable(goodArff, new MapBackedARFFModel());
            Iterator<Vector> fpIterator = fpVectors.iterator();
            int goodFpCount = 0;
            int goodTpCount = 0;
            int testCount = 0;
            while (fpIterator.hasNext())
            {

                Vector goodVector = fpIterator.next();
                double probabilityGood = (1.0 - bestClassifier.classify(goodVector).get(badLabel));
                testCount++;
                if( probabilityGood > 0.0 ) {
                    if( probabilityGood <= 1.0 ) {
                        System.out.print( probabilityGood + "," );
                    }
                    goodTpCount++;
                }
                else {
                    goodFpCount++;
                }
            }
            System.out.println();
            System.out.println( "FP count: " + goodFpCount );
            System.out.println( "TP of good files: " + goodTpCount );
            
            ARFFVectorIterable tpVectors = new ARFFVectorIterable(badArff, new MapBackedARFFModel());
            Iterator<Vector> tpIterator = tpVectors.iterator();
            int badTpCount = 0;
            int badFpCount = 0;
            while (tpIterator.hasNext())
            {
                Vector badVector = tpIterator.next();
                double probabilityBad = bestClassifier.classify(badVector).get(badLabel);
                testCount++;
                if( probabilityBad > 0.0 ) {
                    if( probabilityBad <= 1.0 ) {
                        System.out.print( probabilityBad + "," );
                    }
                    badTpCount++;
                }
                else {
                    badFpCount++;
                }
            }
            System.out.println();
            System.out.println( "TP count: " + badTpCount );
            System.out.println( "FP on bad clusters: " + badFpCount);
            System.out.println( "Test count: " + testCount );


Any help is appreciated! 


Take care,
  -stu