You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Rajesh Nikam <ra...@gmail.com> on 2012/10/10 08:29:30 UTC

Problem using SGD and iris arff as test set

Hi Ted,

Putting specific question with data for getting problem with SGD.

I am using Iris Plants Database from Michael Marshall. PFA iris.arff.

Converted this to csv file just by updating header: iris-3-classes.csv

mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
/usr/local/mahout/trunk/*iris-3-classes.csv* --features 4 --output
/usr/local/mahout/trunk/*iris-3-classes.model* --target class *--categories
3* --predictors sepallength sepalwidth petallength petalwidth --types n n

>> it gave following error.
Exception in thread "main" java.lang.IllegalArgumentException: Can only
call classifyScalar with two categories

Now created csv with only 2 classes. PFA iris-2-classes.csv

>> trained iris-2-classes.csv with sgd

mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
/usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output
/usr/local/mahout/trunk/*iris-2-classes.mode*l --target class *--categories
2* --predictors sepallength sepalwidth petallength petalwidth --types n n


mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
--model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion

AUC = 0.14
confusion: [[50.0, 50.0], [0.0, 0.0]]
entropy: [[-0.6, -0.3], [-0.8, -0.4]]

>> AUC seems to poor. Now changed --predictors

mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
/usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output
/usr/local/mahout/trunk/*iris-2-classes.mode*l --target class *--categories
2* --predictors sepalwidth petallength --types n n

mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
--model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion
--scores

AUC = 0.80
confusion: [[50.0, 50.0], [0.0, 0.0]]
entropy: [[-0.7, -0.3], [-0.7, -0.4]]

AUC is improved, however from confusion matrix seems everything is
classified as class a.

Below is the output.

"target","model-output","log-likelihood"
0,0.492,-0.677017
0,0.493,-0.679192
0,0.493,-0.678355
0,0.493,-0.678724
0,0.492,-0.676583
0,0.491,-0.675182
0,0.492,-0.677452
0,0.492,-0.677419
0,0.493,-0.679628
0,0.493,-0.678724
0,0.491,-0.676116
0,0.492,-0.677386
0,0.493,-0.679192
0,0.493,-0.679291
0,0.491,-0.674912
0,0.490,-0.673081
0,0.491,-0.675313
0,0.492,-0.677017
0,0.491,-0.675616
0,0.491,-0.675682
0,0.492,-0.677353
0,0.491,-0.676116
0,0.492,-0.676714
0,0.492,-0.677788
0,0.492,-0.677287
0,0.493,-0.679126
0,0.492,-0.677386
0,0.492,-0.676984
0,0.492,-0.677452
0,0.492,-0.678256
0,0.493,-0.678691
0,0.492,-0.677419
0,0.491,-0.674381
0,0.490,-0.673980
0,0.493,-0.678724
0,0.493,-0.678387
0,0.492,-0.677050
0,0.493,-0.678724
0,0.493,-0.679225
0,0.492,-0.677419
0,0.492,-0.677050
0,0.495,-0.682279
0,0.493,-0.678355
0,0.492,-0.676951
0,0.491,-0.675550
0,0.493,-0.679192
0,0.491,-0.675649
0,0.493,-0.678322
0,0.491,-0.676116
0,0.492,-0.677887
1,0.492,-0.709316
1,0.492,-0.709248
1,0.492,-0.708935
1,0.494,-0.705048
1,0.493,-0.707488
1,0.493,-0.707454
1,0.492,-0.709765
1,0.494,-0.705258
1,0.493,-0.707936
1,0.493,-0.706803
1,0.495,-0.703539
1,0.493,-0.708249
1,0.494,-0.704601
1,0.493,-0.707970
1,0.493,-0.707597
1,0.492,-0.708765
1,0.492,-0.708351
1,0.493,-0.706871
1,0.494,-0.704770
1,0.494,-0.705908
1,0.492,-0.709350
1,0.493,-0.707285
1,0.493,-0.706247
1,0.493,-0.707522
1,0.493,-0.707835
1,0.492,-0.708317
1,0.493,-0.707556
1,0.492,-0.708520
1,0.493,-0.707902
1,0.494,-0.706220
1,0.494,-0.705427
1,0.494,-0.705393
1,0.493,-0.706803
1,0.493,-0.707210
1,0.492,-0.708351
1,0.492,-0.710146
1,0.492,-0.708867
1,0.494,-0.705183
1,0.493,-0.708215
1,0.494,-0.705942
1,0.493,-0.706525
1,0.492,-0.708385
1,0.493,-0.706389
1,0.494,-0.704811
1,0.493,-0.706905
1,0.493,-0.708249
1,0.493,-0.707801
1,0.493,-0.707835
1,0.494,-0.705604
1,0.493,-0.707319

AUC = 0.80
confusion: [[50.0, 50.0], [0.0, 0.0]]
entropy: [[-0.7, -0.3], [-0.7, -0.4]]

SGD is suitable for what kind of data?

Thanks,
Rajesh

Re: Problem using SGD and iris arff as test set

Posted by Rajesh Nikam <ra...@gmail.com>.

Hi Ted,

Seems something wrong from input file or parameters to package.
Could you point what is missing ?

Thanks
Rajesh

On Thu, Oct 11, 2012 at 10:18 PM, Ted Dunning <te...@gmail.com> wrote:

> Not sure just off=hand.  Need to look in more detail in a debugger.  Need
> to find time to do that.
>
> On Thu, Oct 11, 2012 at 1:58 AM, Rajesh Nikam <ra...@gmail.com>
> wrote:
>
> > what could be the problem with data formatting ?
> > Could you please update on the same.
> >
> > On Thu, Oct 11, 2012 at 11:31 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > My first thought was that we needed several passes, but that is clearly
> > > wrong.
> > >
> > > I think that the problem is in the data formatting and conversion
> > somehow.
> > >  Haven't had time to dope this out just yet.  The iris data should
> > converge
> > > trivially.
> > >
> > > On Wed, Oct 10, 2012 at 9:58 PM, Rajesh Nikam <ra...@gmail.com>
> > > wrote:
> > >
> > > > Thanks for looking into it.
> > > >
> > > > Actually first I have tried it with big data. Below was model info
> for
> > > it.
> > > >
> > > > AUC = 0.50
> > > > confusion: [[1252978.0, 23003.0], [0.0, 0.0]]
> > > > entropy: [[-0.0, -0.0], [-46.1, -0.8]]
> > > >
> > > > Looking forward for your comments.
> > > >
> > > > Thanks
> > > > Rajesh
> > > >
> > > >
> > > > On Wed, Oct 10, 2012 at 8:08 PM, Ted Dunning <te...@gmail.com>
> > > > wrote:
> > > >
> > > > > Sgd is more suitable for large data.  I will take a look later
> today.
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > On Oct 9, 2012, at 11:29 PM, Rajesh Nikam <ra...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Ted,
> > > > > >
> > > > > > Putting specific question with data for getting problem with SGD.
> > > > > >
> > > > > > I am using Iris Plants Database from Michael Marshall. PFA
> > iris.arff.
> > > > > >
> > > > > > Converted this to csv file just by updating header:
> > > iris-3-classes.csv
> > > > > >
> > > > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > > > > /usr/local/mahout/trunk/iris-3-classes.csv --features 4 --output
> > > > > /usr/local/mahout/trunk/iris-3-classes.model --target class
> > > --categories
> > > > 3
> > > > > --predictors sepallength sepalwidth petallength petalwidth --types
> n
> > n
> > > > > >
> > > > > > >> it gave following error.
> > > > > > Exception in thread "main" java.lang.IllegalArgumentException:
> Can
> > > only
> > > > > call classifyScalar with two categories
> > > > > >
> > > > > > Now created csv with only 2 classes. PFA iris-2-classes.csv
> > > > > >
> > > > > > >> trained iris-2-classes.csv with sgd
> > > > > >
> > > > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > > > > /usr/local/mahout/trunk/iris-2-classes.csv --features 4 --output
> > > > > /usr/local/mahout/trunk/iris-2-classes.model --target class
> > > --categories
> > > > 2
> > > > > --predictors sepallength sepalwidth petallength petalwidth --types
> n
> > n
> > > > > >
> > > > > >
> > > > > > mahout runlogistic --input
> > /usr/local/mahout/trunk/iris-2-classes.csv
> > > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
> > --confusion
> > > > > >
> > > > > > AUC = 0.14
> > > > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > > > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
> > > > > >
> > > > > > >> AUC seems to poor. Now changed --predictors
> > > > > >
> > > > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > > > > /usr/local/mahout/trunk/iris-2-classes.csv --features 4 --output
> > > > > /usr/local/mahout/trunk/iris-2-classes.model --target class
> > > --categories
> > > > 2
> > > > > --predictors sepalwidth petallength --types n n
> > > > > >
> > > > > > mahout runlogistic --input
> > /usr/local/mahout/trunk/iris-2-classes.csv
> > > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
> > --confusion
> > > > > --scores
> > > > > >
> > > > > > AUC = 0.80
> > > > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > > > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> > > > > >
> > > > > > AUC is improved, however from confusion matrix seems everything
> is
> > > > > classified as class a.
> > > > > >
> > > > > > Below is the output.
> > > > > >
> > > > > > "target","model-output","log-likelihood"
> > > > > > 0,0.492,-0.677017
> > > > > > 0,0.493,-0.679192
> > > > > > 0,0.493,-0.678355
> > > > > > 0,0.493,-0.678724
> > > > > > 0,0.492,-0.676583
> > > > > > 0,0.491,-0.675182
> > > > > > 0,0.492,-0.677452
> > > > > > 0,0.492,-0.677419
> > > > > > 0,0.493,-0.679628
> > > > > > 0,0.493,-0.678724
> > > > > > 0,0.491,-0.676116
> > > > > > 0,0.492,-0.677386
> > > > > > 0,0.493,-0.679192
> > > > > > 0,0.493,-0.679291
> > > > > > 0,0.491,-0.674912
> > > > > > 0,0.490,-0.673081
> > > > > > 0,0.491,-0.675313
> > > > > > 0,0.492,-0.677017
> > > > > > 0,0.491,-0.675616
> > > > > > 0,0.491,-0.675682
> > > > > > 0,0.492,-0.677353
> > > > > > 0,0.491,-0.676116
> > > > > > 0,0.492,-0.676714
> > > > > > 0,0.492,-0.677788
> > > > > > 0,0.492,-0.677287
> > > > > > 0,0.493,-0.679126
> > > > > > 0,0.492,-0.677386
> > > > > > 0,0.492,-0.676984
> > > > > > 0,0.492,-0.677452
> > > > > > 0,0.492,-0.678256
> > > > > > 0,0.493,-0.678691
> > > > > > 0,0.492,-0.677419
> > > > > > 0,0.491,-0.674381
> > > > > > 0,0.490,-0.673980
> > > > > > 0,0.493,-0.678724
> > > > > > 0,0.493,-0.678387
> > > > > > 0,0.492,-0.677050
> > > > > > 0,0.493,-0.678724
> > > > > > 0,0.493,-0.679225
> > > > > > 0,0.492,-0.677419
> > > > > > 0,0.492,-0.677050
> > > > > > 0,0.495,-0.682279
> > > > > > 0,0.493,-0.678355
> > > > > > 0,0.492,-0.676951
> > > > > > 0,0.491,-0.675550
> > > > > > 0,0.493,-0.679192
> > > > > > 0,0.491,-0.675649
> > > > > > 0,0.493,-0.678322
> > > > > > 0,0.491,-0.676116
> > > > > > 0,0.492,-0.677887
> > > > > > 1,0.492,-0.709316
> > > > > > 1,0.492,-0.709248
> > > > > > 1,0.492,-0.708935
> > > > > > 1,0.494,-0.705048
> > > > > > 1,0.493,-0.707488
> > > > > > 1,0.493,-0.707454
> > > > > > 1,0.492,-0.709765
> > > > > > 1,0.494,-0.705258
> > > > > > 1,0.493,-0.707936
> > > > > > 1,0.493,-0.706803
> > > > > > 1,0.495,-0.703539
> > > > > > 1,0.493,-0.708249
> > > > > > 1,0.494,-0.704601
> > > > > > 1,0.493,-0.707970
> > > > > > 1,0.493,-0.707597
> > > > > > 1,0.492,-0.708765
> > > > > > 1,0.492,-0.708351
> > > > > > 1,0.493,-0.706871
> > > > > > 1,0.494,-0.704770
> > > > > > 1,0.494,-0.705908
> > > > > > 1,0.492,-0.709350
> > > > > > 1,0.493,-0.707285
> > > > > > 1,0.493,-0.706247
> > > > > > 1,0.493,-0.707522
> > > > > > 1,0.493,-0.707835
> > > > > > 1,0.492,-0.708317
> > > > > > 1,0.493,-0.707556
> > > > > > 1,0.492,-0.708520
> > > > > > 1,0.493,-0.707902
> > > > > > 1,0.494,-0.706220
> > > > > > 1,0.494,-0.705427
> > > > > > 1,0.494,-0.705393
> > > > > > 1,0.493,-0.706803
> > > > > > 1,0.493,-0.707210
> > > > > > 1,0.492,-0.708351
> > > > > > 1,0.492,-0.710146
> > > > > > 1,0.492,-0.708867
> > > > > > 1,0.494,-0.705183
> > > > > > 1,0.493,-0.708215
> > > > > > 1,0.494,-0.705942
> > > > > > 1,0.493,-0.706525
> > > > > > 1,0.492,-0.708385
> > > > > > 1,0.493,-0.706389
> > > > > > 1,0.494,-0.704811
> > > > > > 1,0.493,-0.706905
> > > > > > 1,0.493,-0.708249
> > > > > > 1,0.493,-0.707801
> > > > > > 1,0.493,-0.707835
> > > > > > 1,0.494,-0.705604
> > > > > > 1,0.493,-0.707319
> > > > > >
> > > > > > AUC = 0.80
> > > > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > > > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> > > > > >
> > > > > > SGD is suitable for what kind of data?
> > > > > >
> > > > > > Thanks,
> > > > > > Rajesh
> > > > > >
> > > > > >
> > > > > > <iris-2-classes.csv>
> > > > > > <iris-3-classes.csv>
> > > > >
> > > >
> > >
> >
>

Re: Problem using SGD and iris arff as test set

Posted by Rajesh Nikam <ra...@gmail.com>.

Similar question was reaised by Kevin on 27 July who tried with donut.csv.

Please see below link:

http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout

Looking forward for your comments on sgd.

Thanks
Rajesh


On Thu, Oct 11, 2012 at 10:18 PM, Ted Dunning <te...@gmail.com> wrote:

> Not sure just off=hand.  Need to look in more detail in a debugger.  Need
> to find time to do that.
>
> On Thu, Oct 11, 2012 at 1:58 AM, Rajesh Nikam <ra...@gmail.com>
> wrote:
>
> > what could be the problem with data formatting ?
> > Could you please update on the same.
> >
> > On Thu, Oct 11, 2012 at 11:31 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > My first thought was that we needed several passes, but that is clearly
> > > wrong.
> > >
> > > I think that the problem is in the data formatting and conversion
> > somehow.
> > >  Haven't had time to dope this out just yet.  The iris data should
> > converge
> > > trivially.
> > >
> > > On Wed, Oct 10, 2012 at 9:58 PM, Rajesh Nikam <ra...@gmail.com>
> > > wrote:
> > >
> > > > Thanks for looking into it.
> > > >
> > > > Actually first I have tried it with big data. Below was model info
> for
> > > it.
> > > >
> > > > AUC = 0.50
> > > > confusion: [[1252978.0, 23003.0], [0.0, 0.0]]
> > > > entropy: [[-0.0, -0.0], [-46.1, -0.8]]
> > > >
> > > > Looking forward for your comments.
> > > >
> > > > Thanks
> > > > Rajesh
> > > >
> > > >
> > > > On Wed, Oct 10, 2012 at 8:08 PM, Ted Dunning <te...@gmail.com>
> > > > wrote:
> > > >
> > > > > Sgd is more suitable for large data.  I will take a look later
> today.
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > On Oct 9, 2012, at 11:29 PM, Rajesh Nikam <ra...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Ted,
> > > > > >
> > > > > > Putting specific question with data for getting problem with SGD.
> > > > > >
> > > > > > I am using Iris Plants Database from Michael Marshall. PFA
> > iris.arff.
> > > > > >
> > > > > > Converted this to csv file just by updating header:
> > > iris-3-classes.csv
> > > > > >
> > > > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > > > > /usr/local/mahout/trunk/iris-3-classes.csv --features 4 --output
> > > > > /usr/local/mahout/trunk/iris-3-classes.model --target class
> > > --categories
> > > > 3
> > > > > --predictors sepallength sepalwidth petallength petalwidth --types
> n
> > n
> > > > > >
> > > > > > >> it gave following error.
> > > > > > Exception in thread "main" java.lang.IllegalArgumentException:
> Can
> > > only
> > > > > call classifyScalar with two categories
> > > > > >
> > > > > > Now created csv with only 2 classes. PFA iris-2-classes.csv
> > > > > >
> > > > > > >> trained iris-2-classes.csv with sgd
> > > > > >
> > > > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > > > > /usr/local/mahout/trunk/iris-2-classes.csv --features 4 --output
> > > > > /usr/local/mahout/trunk/iris-2-classes.model --target class
> > > --categories
> > > > 2
> > > > > --predictors sepallength sepalwidth petallength petalwidth --types
> n
> > n
> > > > > >
> > > > > >
> > > > > > mahout runlogistic --input
> > /usr/local/mahout/trunk/iris-2-classes.csv
> > > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
> > --confusion
> > > > > >
> > > > > > AUC = 0.14
> > > > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > > > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
> > > > > >
> > > > > > >> AUC seems to poor. Now changed --predictors
> > > > > >
> > > > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > > > > /usr/local/mahout/trunk/iris-2-classes.csv --features 4 --output
> > > > > /usr/local/mahout/trunk/iris-2-classes.model --target class
> > > --categories
> > > > 2
> > > > > --predictors sepalwidth petallength --types n n
> > > > > >
> > > > > > mahout runlogistic --input
> > /usr/local/mahout/trunk/iris-2-classes.csv
> > > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
> > --confusion
> > > > > --scores
> > > > > >
> > > > > > AUC = 0.80
> > > > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > > > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> > > > > >
> > > > > > AUC is improved, however from confusion matrix seems everything
> is
> > > > > classified as class a.
> > > > > >
> > > > > > Below is the output.
> > > > > >
> > > > > > "target","model-output","log-likelihood"
> > > > > > 0,0.492,-0.677017
> > > > > > 0,0.493,-0.679192
> > > > > > 0,0.493,-0.678355
> > > > > > 0,0.493,-0.678724
> > > > > > 0,0.492,-0.676583
> > > > > > 0,0.491,-0.675182
> > > > > > 0,0.492,-0.677452
> > > > > > 0,0.492,-0.677419
> > > > > > 0,0.493,-0.679628
> > > > > > 0,0.493,-0.678724
> > > > > > 0,0.491,-0.676116
> > > > > > 0,0.492,-0.677386
> > > > > > 0,0.493,-0.679192
> > > > > > 0,0.493,-0.679291
> > > > > > 0,0.491,-0.674912
> > > > > > 0,0.490,-0.673081
> > > > > > 0,0.491,-0.675313
> > > > > > 0,0.492,-0.677017
> > > > > > 0,0.491,-0.675616
> > > > > > 0,0.491,-0.675682
> > > > > > 0,0.492,-0.677353
> > > > > > 0,0.491,-0.676116
> > > > > > 0,0.492,-0.676714
> > > > > > 0,0.492,-0.677788
> > > > > > 0,0.492,-0.677287
> > > > > > 0,0.493,-0.679126
> > > > > > 0,0.492,-0.677386
> > > > > > 0,0.492,-0.676984
> > > > > > 0,0.492,-0.677452
> > > > > > 0,0.492,-0.678256
> > > > > > 0,0.493,-0.678691
> > > > > > 0,0.492,-0.677419
> > > > > > 0,0.491,-0.674381
> > > > > > 0,0.490,-0.673980
> > > > > > 0,0.493,-0.678724
> > > > > > 0,0.493,-0.678387
> > > > > > 0,0.492,-0.677050
> > > > > > 0,0.493,-0.678724
> > > > > > 0,0.493,-0.679225
> > > > > > 0,0.492,-0.677419
> > > > > > 0,0.492,-0.677050
> > > > > > 0,0.495,-0.682279
> > > > > > 0,0.493,-0.678355
> > > > > > 0,0.492,-0.676951
> > > > > > 0,0.491,-0.675550
> > > > > > 0,0.493,-0.679192
> > > > > > 0,0.491,-0.675649
> > > > > > 0,0.493,-0.678322
> > > > > > 0,0.491,-0.676116
> > > > > > 0,0.492,-0.677887
> > > > > > 1,0.492,-0.709316
> > > > > > 1,0.492,-0.709248
> > > > > > 1,0.492,-0.708935
> > > > > > 1,0.494,-0.705048
> > > > > > 1,0.493,-0.707488
> > > > > > 1,0.493,-0.707454
> > > > > > 1,0.492,-0.709765
> > > > > > 1,0.494,-0.705258
> > > > > > 1,0.493,-0.707936
> > > > > > 1,0.493,-0.706803
> > > > > > 1,0.495,-0.703539
> > > > > > 1,0.493,-0.708249
> > > > > > 1,0.494,-0.704601
> > > > > > 1,0.493,-0.707970
> > > > > > 1,0.493,-0.707597
> > > > > > 1,0.492,-0.708765
> > > > > > 1,0.492,-0.708351
> > > > > > 1,0.493,-0.706871
> > > > > > 1,0.494,-0.704770
> > > > > > 1,0.494,-0.705908
> > > > > > 1,0.492,-0.709350
> > > > > > 1,0.493,-0.707285
> > > > > > 1,0.493,-0.706247
> > > > > > 1,0.493,-0.707522
> > > > > > 1,0.493,-0.707835
> > > > > > 1,0.492,-0.708317
> > > > > > 1,0.493,-0.707556
> > > > > > 1,0.492,-0.708520
> > > > > > 1,0.493,-0.707902
> > > > > > 1,0.494,-0.706220
> > > > > > 1,0.494,-0.705427
> > > > > > 1,0.494,-0.705393
> > > > > > 1,0.493,-0.706803
> > > > > > 1,0.493,-0.707210
> > > > > > 1,0.492,-0.708351
> > > > > > 1,0.492,-0.710146
> > > > > > 1,0.492,-0.708867
> > > > > > 1,0.494,-0.705183
> > > > > > 1,0.493,-0.708215
> > > > > > 1,0.494,-0.705942
> > > > > > 1,0.493,-0.706525
> > > > > > 1,0.492,-0.708385
> > > > > > 1,0.493,-0.706389
> > > > > > 1,0.494,-0.704811
> > > > > > 1,0.493,-0.706905
> > > > > > 1,0.493,-0.708249
> > > > > > 1,0.493,-0.707801
> > > > > > 1,0.493,-0.707835
> > > > > > 1,0.494,-0.705604
> > > > > > 1,0.493,-0.707319
> > > > > >
> > > > > > AUC = 0.80
> > > > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > > > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> > > > > >
> > > > > > SGD is suitable for what kind of data?
> > > > > >
> > > > > > Thanks,
> > > > > > Rajesh
> > > > > >
> > > > > >
> > > > > > <iris-2-classes.csv>
> > > > > > <iris-3-classes.csv>
> > > > >
> > > >
> > >
> >
>

Re: Problem using SGD and iris arff as test set

Posted by Ted Dunning <te...@gmail.com>.

Not sure just off=hand.  Need to look in more detail in a debugger.  Need
to find time to do that.

On Thu, Oct 11, 2012 at 1:58 AM, Rajesh Nikam <ra...@gmail.com> wrote:

> what could be the problem with data formatting ?
> Could you please update on the same.
>
> On Thu, Oct 11, 2012 at 11:31 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > My first thought was that we needed several passes, but that is clearly
> > wrong.
> >
> > I think that the problem is in the data formatting and conversion
> somehow.
> >  Haven't had time to dope this out just yet.  The iris data should
> converge
> > trivially.
> >
> > On Wed, Oct 10, 2012 at 9:58 PM, Rajesh Nikam <ra...@gmail.com>
> > wrote:
> >
> > > Thanks for looking into it.
> > >
> > > Actually first I have tried it with big data. Below was model info for
> > it.
> > >
> > > AUC = 0.50
> > > confusion: [[1252978.0, 23003.0], [0.0, 0.0]]
> > > entropy: [[-0.0, -0.0], [-46.1, -0.8]]
> > >
> > > Looking forward for your comments.
> > >
> > > Thanks
> > > Rajesh
> > >
> > >
> > > On Wed, Oct 10, 2012 at 8:08 PM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > Sgd is more suitable for large data.  I will take a look later today.
> > > >
> > > > Sent from my iPhone
> > > >
> > > > On Oct 9, 2012, at 11:29 PM, Rajesh Nikam <ra...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Ted,
> > > > >
> > > > > Putting specific question with data for getting problem with SGD.
> > > > >
> > > > > I am using Iris Plants Database from Michael Marshall. PFA
> iris.arff.
> > > > >
> > > > > Converted this to csv file just by updating header:
> > iris-3-classes.csv
> > > > >
> > > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > > > /usr/local/mahout/trunk/iris-3-classes.csv --features 4 --output
> > > > /usr/local/mahout/trunk/iris-3-classes.model --target class
> > --categories
> > > 3
> > > > --predictors sepallength sepalwidth petallength petalwidth --types n
> n
> > > > >
> > > > > >> it gave following error.
> > > > > Exception in thread "main" java.lang.IllegalArgumentException: Can
> > only
> > > > call classifyScalar with two categories
> > > > >
> > > > > Now created csv with only 2 classes. PFA iris-2-classes.csv
> > > > >
> > > > > >> trained iris-2-classes.csv with sgd
> > > > >
> > > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > > > /usr/local/mahout/trunk/iris-2-classes.csv --features 4 --output
> > > > /usr/local/mahout/trunk/iris-2-classes.model --target class
> > --categories
> > > 2
> > > > --predictors sepallength sepalwidth petallength petalwidth --types n
> n
> > > > >
> > > > >
> > > > > mahout runlogistic --input
> /usr/local/mahout/trunk/iris-2-classes.csv
> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
> --confusion
> > > > >
> > > > > AUC = 0.14
> > > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
> > > > >
> > > > > >> AUC seems to poor. Now changed --predictors
> > > > >
> > > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > > > /usr/local/mahout/trunk/iris-2-classes.csv --features 4 --output
> > > > /usr/local/mahout/trunk/iris-2-classes.model --target class
> > --categories
> > > 2
> > > > --predictors sepalwidth petallength --types n n
> > > > >
> > > > > mahout runlogistic --input
> /usr/local/mahout/trunk/iris-2-classes.csv
> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
> --confusion
> > > > --scores
> > > > >
> > > > > AUC = 0.80
> > > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> > > > >
> > > > > AUC is improved, however from confusion matrix seems everything is
> > > > classified as class a.
> > > > >
> > > > > Below is the output.
> > > > >
> > > > > "target","model-output","log-likelihood"
> > > > > 0,0.492,-0.677017
> > > > > 0,0.493,-0.679192
> > > > > 0,0.493,-0.678355
> > > > > 0,0.493,-0.678724
> > > > > 0,0.492,-0.676583
> > > > > 0,0.491,-0.675182
> > > > > 0,0.492,-0.677452
> > > > > 0,0.492,-0.677419
> > > > > 0,0.493,-0.679628
> > > > > 0,0.493,-0.678724
> > > > > 0,0.491,-0.676116
> > > > > 0,0.492,-0.677386
> > > > > 0,0.493,-0.679192
> > > > > 0,0.493,-0.679291
> > > > > 0,0.491,-0.674912
> > > > > 0,0.490,-0.673081
> > > > > 0,0.491,-0.675313
> > > > > 0,0.492,-0.677017
> > > > > 0,0.491,-0.675616
> > > > > 0,0.491,-0.675682
> > > > > 0,0.492,-0.677353
> > > > > 0,0.491,-0.676116
> > > > > 0,0.492,-0.676714
> > > > > 0,0.492,-0.677788
> > > > > 0,0.492,-0.677287
> > > > > 0,0.493,-0.679126
> > > > > 0,0.492,-0.677386
> > > > > 0,0.492,-0.676984
> > > > > 0,0.492,-0.677452
> > > > > 0,0.492,-0.678256
> > > > > 0,0.493,-0.678691
> > > > > 0,0.492,-0.677419
> > > > > 0,0.491,-0.674381
> > > > > 0,0.490,-0.673980
> > > > > 0,0.493,-0.678724
> > > > > 0,0.493,-0.678387
> > > > > 0,0.492,-0.677050
> > > > > 0,0.493,-0.678724
> > > > > 0,0.493,-0.679225
> > > > > 0,0.492,-0.677419
> > > > > 0,0.492,-0.677050
> > > > > 0,0.495,-0.682279
> > > > > 0,0.493,-0.678355
> > > > > 0,0.492,-0.676951
> > > > > 0,0.491,-0.675550
> > > > > 0,0.493,-0.679192
> > > > > 0,0.491,-0.675649
> > > > > 0,0.493,-0.678322
> > > > > 0,0.491,-0.676116
> > > > > 0,0.492,-0.677887
> > > > > 1,0.492,-0.709316
> > > > > 1,0.492,-0.709248
> > > > > 1,0.492,-0.708935
> > > > > 1,0.494,-0.705048
> > > > > 1,0.493,-0.707488
> > > > > 1,0.493,-0.707454
> > > > > 1,0.492,-0.709765
> > > > > 1,0.494,-0.705258
> > > > > 1,0.493,-0.707936
> > > > > 1,0.493,-0.706803
> > > > > 1,0.495,-0.703539
> > > > > 1,0.493,-0.708249
> > > > > 1,0.494,-0.704601
> > > > > 1,0.493,-0.707970
> > > > > 1,0.493,-0.707597
> > > > > 1,0.492,-0.708765
> > > > > 1,0.492,-0.708351
> > > > > 1,0.493,-0.706871
> > > > > 1,0.494,-0.704770
> > > > > 1,0.494,-0.705908
> > > > > 1,0.492,-0.709350
> > > > > 1,0.493,-0.707285
> > > > > 1,0.493,-0.706247
> > > > > 1,0.493,-0.707522
> > > > > 1,0.493,-0.707835
> > > > > 1,0.492,-0.708317
> > > > > 1,0.493,-0.707556
> > > > > 1,0.492,-0.708520
> > > > > 1,0.493,-0.707902
> > > > > 1,0.494,-0.706220
> > > > > 1,0.494,-0.705427
> > > > > 1,0.494,-0.705393
> > > > > 1,0.493,-0.706803
> > > > > 1,0.493,-0.707210
> > > > > 1,0.492,-0.708351
> > > > > 1,0.492,-0.710146
> > > > > 1,0.492,-0.708867
> > > > > 1,0.494,-0.705183
> > > > > 1,0.493,-0.708215
> > > > > 1,0.494,-0.705942
> > > > > 1,0.493,-0.706525
> > > > > 1,0.492,-0.708385
> > > > > 1,0.493,-0.706389
> > > > > 1,0.494,-0.704811
> > > > > 1,0.493,-0.706905
> > > > > 1,0.493,-0.708249
> > > > > 1,0.493,-0.707801
> > > > > 1,0.493,-0.707835
> > > > > 1,0.494,-0.705604
> > > > > 1,0.493,-0.707319
> > > > >
> > > > > AUC = 0.80
> > > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> > > > >
> > > > > SGD is suitable for what kind of data?
> > > > >
> > > > > Thanks,
> > > > > Rajesh
> > > > >
> > > > >
> > > > > <iris-2-classes.csv>
> > > > > <iris-3-classes.csv>
> > > >
> > >
> >
>

Re: Problem using SGD and iris arff as test set

Posted by Rajesh Nikam <ra...@gmail.com>.

what could be the problem with data formatting ?
Could you please update on the same.

On Thu, Oct 11, 2012 at 11:31 AM, Ted Dunning <te...@gmail.com> wrote:

> My first thought was that we needed several passes, but that is clearly
> wrong.
>
> I think that the problem is in the data formatting and conversion somehow.
>  Haven't had time to dope this out just yet.  The iris data should converge
> trivially.
>
> On Wed, Oct 10, 2012 at 9:58 PM, Rajesh Nikam <ra...@gmail.com>
> wrote:
>
> > Thanks for looking into it.
> >
> > Actually first I have tried it with big data. Below was model info for
> it.
> >
> > AUC = 0.50
> > confusion: [[1252978.0, 23003.0], [0.0, 0.0]]
> > entropy: [[-0.0, -0.0], [-46.1, -0.8]]
> >
> > Looking forward for your comments.
> >
> > Thanks
> > Rajesh
> >
> >
> > On Wed, Oct 10, 2012 at 8:08 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Sgd is more suitable for large data.  I will take a look later today.
> > >
> > > Sent from my iPhone
> > >
> > > On Oct 9, 2012, at 11:29 PM, Rajesh Nikam <ra...@gmail.com>
> wrote:
> > >
> > > > Hi Ted,
> > > >
> > > > Putting specific question with data for getting problem with SGD.
> > > >
> > > > I am using Iris Plants Database from Michael Marshall. PFA iris.arff.
> > > >
> > > > Converted this to csv file just by updating header:
> iris-3-classes.csv
> > > >
> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > > /usr/local/mahout/trunk/iris-3-classes.csv --features 4 --output
> > > /usr/local/mahout/trunk/iris-3-classes.model --target class
> --categories
> > 3
> > > --predictors sepallength sepalwidth petallength petalwidth --types n n
> > > >
> > > > >> it gave following error.
> > > > Exception in thread "main" java.lang.IllegalArgumentException: Can
> only
> > > call classifyScalar with two categories
> > > >
> > > > Now created csv with only 2 classes. PFA iris-2-classes.csv
> > > >
> > > > >> trained iris-2-classes.csv with sgd
> > > >
> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > > /usr/local/mahout/trunk/iris-2-classes.csv --features 4 --output
> > > /usr/local/mahout/trunk/iris-2-classes.model --target class
> --categories
> > 2
> > > --predictors sepallength sepalwidth petallength petalwidth --types n n
> > > >
> > > >
> > > > mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
> > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion
> > > >
> > > > AUC = 0.14
> > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
> > > >
> > > > >> AUC seems to poor. Now changed --predictors
> > > >
> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > > /usr/local/mahout/trunk/iris-2-classes.csv --features 4 --output
> > > /usr/local/mahout/trunk/iris-2-classes.model --target class
> --categories
> > 2
> > > --predictors sepalwidth petallength --types n n
> > > >
> > > > mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
> > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion
> > > --scores
> > > >
> > > > AUC = 0.80
> > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> > > >
> > > > AUC is improved, however from confusion matrix seems everything is
> > > classified as class a.
> > > >
> > > > Below is the output.
> > > >
> > > > "target","model-output","log-likelihood"
> > > > 0,0.492,-0.677017
> > > > 0,0.493,-0.679192
> > > > 0,0.493,-0.678355
> > > > 0,0.493,-0.678724
> > > > 0,0.492,-0.676583
> > > > 0,0.491,-0.675182
> > > > 0,0.492,-0.677452
> > > > 0,0.492,-0.677419
> > > > 0,0.493,-0.679628
> > > > 0,0.493,-0.678724
> > > > 0,0.491,-0.676116
> > > > 0,0.492,-0.677386
> > > > 0,0.493,-0.679192
> > > > 0,0.493,-0.679291
> > > > 0,0.491,-0.674912
> > > > 0,0.490,-0.673081
> > > > 0,0.491,-0.675313
> > > > 0,0.492,-0.677017
> > > > 0,0.491,-0.675616
> > > > 0,0.491,-0.675682
> > > > 0,0.492,-0.677353
> > > > 0,0.491,-0.676116
> > > > 0,0.492,-0.676714
> > > > 0,0.492,-0.677788
> > > > 0,0.492,-0.677287
> > > > 0,0.493,-0.679126
> > > > 0,0.492,-0.677386
> > > > 0,0.492,-0.676984
> > > > 0,0.492,-0.677452
> > > > 0,0.492,-0.678256
> > > > 0,0.493,-0.678691
> > > > 0,0.492,-0.677419
> > > > 0,0.491,-0.674381
> > > > 0,0.490,-0.673980
> > > > 0,0.493,-0.678724
> > > > 0,0.493,-0.678387
> > > > 0,0.492,-0.677050
> > > > 0,0.493,-0.678724
> > > > 0,0.493,-0.679225
> > > > 0,0.492,-0.677419
> > > > 0,0.492,-0.677050
> > > > 0,0.495,-0.682279
> > > > 0,0.493,-0.678355
> > > > 0,0.492,-0.676951
> > > > 0,0.491,-0.675550
> > > > 0,0.493,-0.679192
> > > > 0,0.491,-0.675649
> > > > 0,0.493,-0.678322
> > > > 0,0.491,-0.676116
> > > > 0,0.492,-0.677887
> > > > 1,0.492,-0.709316
> > > > 1,0.492,-0.709248
> > > > 1,0.492,-0.708935
> > > > 1,0.494,-0.705048
> > > > 1,0.493,-0.707488
> > > > 1,0.493,-0.707454
> > > > 1,0.492,-0.709765
> > > > 1,0.494,-0.705258
> > > > 1,0.493,-0.707936
> > > > 1,0.493,-0.706803
> > > > 1,0.495,-0.703539
> > > > 1,0.493,-0.708249
> > > > 1,0.494,-0.704601
> > > > 1,0.493,-0.707970
> > > > 1,0.493,-0.707597
> > > > 1,0.492,-0.708765
> > > > 1,0.492,-0.708351
> > > > 1,0.493,-0.706871
> > > > 1,0.494,-0.704770
> > > > 1,0.494,-0.705908
> > > > 1,0.492,-0.709350
> > > > 1,0.493,-0.707285
> > > > 1,0.493,-0.706247
> > > > 1,0.493,-0.707522
> > > > 1,0.493,-0.707835
> > > > 1,0.492,-0.708317
> > > > 1,0.493,-0.707556
> > > > 1,0.492,-0.708520
> > > > 1,0.493,-0.707902
> > > > 1,0.494,-0.706220
> > > > 1,0.494,-0.705427
> > > > 1,0.494,-0.705393
> > > > 1,0.493,-0.706803
> > > > 1,0.493,-0.707210
> > > > 1,0.492,-0.708351
> > > > 1,0.492,-0.710146
> > > > 1,0.492,-0.708867
> > > > 1,0.494,-0.705183
> > > > 1,0.493,-0.708215
> > > > 1,0.494,-0.705942
> > > > 1,0.493,-0.706525
> > > > 1,0.492,-0.708385
> > > > 1,0.493,-0.706389
> > > > 1,0.494,-0.704811
> > > > 1,0.493,-0.706905
> > > > 1,0.493,-0.708249
> > > > 1,0.493,-0.707801
> > > > 1,0.493,-0.707835
> > > > 1,0.494,-0.705604
> > > > 1,0.493,-0.707319
> > > >
> > > > AUC = 0.80
> > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> > > >
> > > > SGD is suitable for what kind of data?
> > > >
> > > > Thanks,
> > > > Rajesh
> > > >
> > > >
> > > > <iris-2-classes.csv>
> > > > <iris-3-classes.csv>
> > >
> >
>

Re: Problem using SGD and iris arff as test set

Posted by Ted Dunning <te...@gmail.com>.

My first thought was that we needed several passes, but that is clearly
wrong.

I think that the problem is in the data formatting and conversion somehow.
 Haven't had time to dope this out just yet.  The iris data should converge
trivially.

On Wed, Oct 10, 2012 at 9:58 PM, Rajesh Nikam <ra...@gmail.com> wrote:

> Thanks for looking into it.
>
> Actually first I have tried it with big data. Below was model info for it.
>
> AUC = 0.50
> confusion: [[1252978.0, 23003.0], [0.0, 0.0]]
> entropy: [[-0.0, -0.0], [-46.1, -0.8]]
>
> Looking forward for your comments.
>
> Thanks
> Rajesh
>
>
> On Wed, Oct 10, 2012 at 8:08 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Sgd is more suitable for large data.  I will take a look later today.
> >
> > Sent from my iPhone
> >
> > On Oct 9, 2012, at 11:29 PM, Rajesh Nikam <ra...@gmail.com> wrote:
> >
> > > Hi Ted,
> > >
> > > Putting specific question with data for getting problem with SGD.
> > >
> > > I am using Iris Plants Database from Michael Marshall. PFA iris.arff.
> > >
> > > Converted this to csv file just by updating header: iris-3-classes.csv
> > >
> > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > /usr/local/mahout/trunk/iris-3-classes.csv --features 4 --output
> > /usr/local/mahout/trunk/iris-3-classes.model --target class --categories
> 3
> > --predictors sepallength sepalwidth petallength petalwidth --types n n
> > >
> > > >> it gave following error.
> > > Exception in thread "main" java.lang.IllegalArgumentException: Can only
> > call classifyScalar with two categories
> > >
> > > Now created csv with only 2 classes. PFA iris-2-classes.csv
> > >
> > > >> trained iris-2-classes.csv with sgd
> > >
> > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > /usr/local/mahout/trunk/iris-2-classes.csv --features 4 --output
> > /usr/local/mahout/trunk/iris-2-classes.model --target class --categories
> 2
> > --predictors sepallength sepalwidth petallength petalwidth --types n n
> > >
> > >
> > > mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
> > --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion
> > >
> > > AUC = 0.14
> > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
> > >
> > > >> AUC seems to poor. Now changed --predictors
> > >
> > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > /usr/local/mahout/trunk/iris-2-classes.csv --features 4 --output
> > /usr/local/mahout/trunk/iris-2-classes.model --target class --categories
> 2
> > --predictors sepalwidth petallength --types n n
> > >
> > > mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
> > --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion
> > --scores
> > >
> > > AUC = 0.80
> > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> > >
> > > AUC is improved, however from confusion matrix seems everything is
> > classified as class a.
> > >
> > > Below is the output.
> > >
> > > "target","model-output","log-likelihood"
> > > 0,0.492,-0.677017
> > > 0,0.493,-0.679192
> > > 0,0.493,-0.678355
> > > 0,0.493,-0.678724
> > > 0,0.492,-0.676583
> > > 0,0.491,-0.675182
> > > 0,0.492,-0.677452
> > > 0,0.492,-0.677419
> > > 0,0.493,-0.679628
> > > 0,0.493,-0.678724
> > > 0,0.491,-0.676116
> > > 0,0.492,-0.677386
> > > 0,0.493,-0.679192
> > > 0,0.493,-0.679291
> > > 0,0.491,-0.674912
> > > 0,0.490,-0.673081
> > > 0,0.491,-0.675313
> > > 0,0.492,-0.677017
> > > 0,0.491,-0.675616
> > > 0,0.491,-0.675682
> > > 0,0.492,-0.677353
> > > 0,0.491,-0.676116
> > > 0,0.492,-0.676714
> > > 0,0.492,-0.677788
> > > 0,0.492,-0.677287
> > > 0,0.493,-0.679126
> > > 0,0.492,-0.677386
> > > 0,0.492,-0.676984
> > > 0,0.492,-0.677452
> > > 0,0.492,-0.678256
> > > 0,0.493,-0.678691
> > > 0,0.492,-0.677419
> > > 0,0.491,-0.674381
> > > 0,0.490,-0.673980
> > > 0,0.493,-0.678724
> > > 0,0.493,-0.678387
> > > 0,0.492,-0.677050
> > > 0,0.493,-0.678724
> > > 0,0.493,-0.679225
> > > 0,0.492,-0.677419
> > > 0,0.492,-0.677050
> > > 0,0.495,-0.682279
> > > 0,0.493,-0.678355
> > > 0,0.492,-0.676951
> > > 0,0.491,-0.675550
> > > 0,0.493,-0.679192
> > > 0,0.491,-0.675649
> > > 0,0.493,-0.678322
> > > 0,0.491,-0.676116
> > > 0,0.492,-0.677887
> > > 1,0.492,-0.709316
> > > 1,0.492,-0.709248
> > > 1,0.492,-0.708935
> > > 1,0.494,-0.705048
> > > 1,0.493,-0.707488
> > > 1,0.493,-0.707454
> > > 1,0.492,-0.709765
> > > 1,0.494,-0.705258
> > > 1,0.493,-0.707936
> > > 1,0.493,-0.706803
> > > 1,0.495,-0.703539
> > > 1,0.493,-0.708249
> > > 1,0.494,-0.704601
> > > 1,0.493,-0.707970
> > > 1,0.493,-0.707597
> > > 1,0.492,-0.708765
> > > 1,0.492,-0.708351
> > > 1,0.493,-0.706871
> > > 1,0.494,-0.704770
> > > 1,0.494,-0.705908
> > > 1,0.492,-0.709350
> > > 1,0.493,-0.707285
> > > 1,0.493,-0.706247
> > > 1,0.493,-0.707522
> > > 1,0.493,-0.707835
> > > 1,0.492,-0.708317
> > > 1,0.493,-0.707556
> > > 1,0.492,-0.708520
> > > 1,0.493,-0.707902
> > > 1,0.494,-0.706220
> > > 1,0.494,-0.705427
> > > 1,0.494,-0.705393
> > > 1,0.493,-0.706803
> > > 1,0.493,-0.707210
> > > 1,0.492,-0.708351
> > > 1,0.492,-0.710146
> > > 1,0.492,-0.708867
> > > 1,0.494,-0.705183
> > > 1,0.493,-0.708215
> > > 1,0.494,-0.705942
> > > 1,0.493,-0.706525
> > > 1,0.492,-0.708385
> > > 1,0.493,-0.706389
> > > 1,0.494,-0.704811
> > > 1,0.493,-0.706905
> > > 1,0.493,-0.708249
> > > 1,0.493,-0.707801
> > > 1,0.493,-0.707835
> > > 1,0.494,-0.705604
> > > 1,0.493,-0.707319
> > >
> > > AUC = 0.80
> > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> > >
> > > SGD is suitable for what kind of data?
> > >
> > > Thanks,
> > > Rajesh
> > >
> > >
> > > <iris-2-classes.csv>
> > > <iris-3-classes.csv>
> >
>

Re: Problem using SGD and iris arff as test set

Posted by Rajesh Nikam <ra...@gmail.com>.

Thanks for looking into it.

Actually first I have tried it with big data. Below was model info for it.

AUC = 0.50
confusion: [[1252978.0, 23003.0], [0.0, 0.0]]
entropy: [[-0.0, -0.0], [-46.1, -0.8]]

Looking forward for your comments.

Thanks
Rajesh


On Wed, Oct 10, 2012 at 8:08 PM, Ted Dunning <te...@gmail.com> wrote:

> Sgd is more suitable for large data.  I will take a look later today.
>
> Sent from my iPhone
>
> On Oct 9, 2012, at 11:29 PM, Rajesh Nikam <ra...@gmail.com> wrote:
>
> > Hi Ted,
> >
> > Putting specific question with data for getting problem with SGD.
> >
> > I am using Iris Plants Database from Michael Marshall. PFA iris.arff.
> >
> > Converted this to csv file just by updating header: iris-3-classes.csv
> >
> > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> /usr/local/mahout/trunk/iris-3-classes.csv --features 4 --output
> /usr/local/mahout/trunk/iris-3-classes.model --target class --categories 3
> --predictors sepallength sepalwidth petallength petalwidth --types n n
> >
> > >> it gave following error.
> > Exception in thread "main" java.lang.IllegalArgumentException: Can only
> call classifyScalar with two categories
> >
> > Now created csv with only 2 classes. PFA iris-2-classes.csv
> >
> > >> trained iris-2-classes.csv with sgd
> >
> > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> /usr/local/mahout/trunk/iris-2-classes.csv --features 4 --output
> /usr/local/mahout/trunk/iris-2-classes.model --target class --categories 2
> --predictors sepallength sepalwidth petallength petalwidth --types n n
> >
> >
> > mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
> --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion
> >
> > AUC = 0.14
> > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
> >
> > >> AUC seems to poor. Now changed --predictors
> >
> > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> /usr/local/mahout/trunk/iris-2-classes.csv --features 4 --output
> /usr/local/mahout/trunk/iris-2-classes.model --target class --categories 2
> --predictors sepalwidth petallength --types n n
> >
> > mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
> --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion
> --scores
> >
> > AUC = 0.80
> > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> >
> > AUC is improved, however from confusion matrix seems everything is
> classified as class a.
> >
> > Below is the output.
> >
> > "target","model-output","log-likelihood"
> > 0,0.492,-0.677017
> > 0,0.493,-0.679192
> > 0,0.493,-0.678355
> > 0,0.493,-0.678724
> > 0,0.492,-0.676583
> > 0,0.491,-0.675182
> > 0,0.492,-0.677452
> > 0,0.492,-0.677419
> > 0,0.493,-0.679628
> > 0,0.493,-0.678724
> > 0,0.491,-0.676116
> > 0,0.492,-0.677386
> > 0,0.493,-0.679192
> > 0,0.493,-0.679291
> > 0,0.491,-0.674912
> > 0,0.490,-0.673081
> > 0,0.491,-0.675313
> > 0,0.492,-0.677017
> > 0,0.491,-0.675616
> > 0,0.491,-0.675682
> > 0,0.492,-0.677353
> > 0,0.491,-0.676116
> > 0,0.492,-0.676714
> > 0,0.492,-0.677788
> > 0,0.492,-0.677287
> > 0,0.493,-0.679126
> > 0,0.492,-0.677386
> > 0,0.492,-0.676984
> > 0,0.492,-0.677452
> > 0,0.492,-0.678256
> > 0,0.493,-0.678691
> > 0,0.492,-0.677419
> > 0,0.491,-0.674381
> > 0,0.490,-0.673980
> > 0,0.493,-0.678724
> > 0,0.493,-0.678387
> > 0,0.492,-0.677050
> > 0,0.493,-0.678724
> > 0,0.493,-0.679225
> > 0,0.492,-0.677419
> > 0,0.492,-0.677050
> > 0,0.495,-0.682279
> > 0,0.493,-0.678355
> > 0,0.492,-0.676951
> > 0,0.491,-0.675550
> > 0,0.493,-0.679192
> > 0,0.491,-0.675649
> > 0,0.493,-0.678322
> > 0,0.491,-0.676116
> > 0,0.492,-0.677887
> > 1,0.492,-0.709316
> > 1,0.492,-0.709248
> > 1,0.492,-0.708935
> > 1,0.494,-0.705048
> > 1,0.493,-0.707488
> > 1,0.493,-0.707454
> > 1,0.492,-0.709765
> > 1,0.494,-0.705258
> > 1,0.493,-0.707936
> > 1,0.493,-0.706803
> > 1,0.495,-0.703539
> > 1,0.493,-0.708249
> > 1,0.494,-0.704601
> > 1,0.493,-0.707970
> > 1,0.493,-0.707597
> > 1,0.492,-0.708765
> > 1,0.492,-0.708351
> > 1,0.493,-0.706871
> > 1,0.494,-0.704770
> > 1,0.494,-0.705908
> > 1,0.492,-0.709350
> > 1,0.493,-0.707285
> > 1,0.493,-0.706247
> > 1,0.493,-0.707522
> > 1,0.493,-0.707835
> > 1,0.492,-0.708317
> > 1,0.493,-0.707556
> > 1,0.492,-0.708520
> > 1,0.493,-0.707902
> > 1,0.494,-0.706220
> > 1,0.494,-0.705427
> > 1,0.494,-0.705393
> > 1,0.493,-0.706803
> > 1,0.493,-0.707210
> > 1,0.492,-0.708351
> > 1,0.492,-0.710146
> > 1,0.492,-0.708867
> > 1,0.494,-0.705183
> > 1,0.493,-0.708215
> > 1,0.494,-0.705942
> > 1,0.493,-0.706525
> > 1,0.492,-0.708385
> > 1,0.493,-0.706389
> > 1,0.494,-0.704811
> > 1,0.493,-0.706905
> > 1,0.493,-0.708249
> > 1,0.493,-0.707801
> > 1,0.493,-0.707835
> > 1,0.494,-0.705604
> > 1,0.493,-0.707319
> >
> > AUC = 0.80
> > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> >
> > SGD is suitable for what kind of data?
> >
> > Thanks,
> > Rajesh
> >
> >
> > <iris-2-classes.csv>
> > <iris-3-classes.csv>
>

Re: Problem using SGD and iris arff as test set

Posted by Ted Dunning <te...@gmail.com>.

Sgd is more suitable for large data.  I will take a look later today.  

Sent from my iPhone

On Oct 9, 2012, at 11:29 PM, Rajesh Nikam <ra...@gmail.com> wrote:

> Hi Ted,
> 
> Putting specific question with data for getting problem with SGD.
> 
> I am using Iris Plants Database from Michael Marshall. PFA iris.arff.
> 
> Converted this to csv file just by updating header: iris-3-classes.csv
> 
> mahout org.apache.mahout.classifier.sgd.TrainLogistic --input /usr/local/mahout/trunk/iris-3-classes.csv --features 4 --output /usr/local/mahout/trunk/iris-3-classes.model --target class --categories 3 --predictors sepallength sepalwidth petallength petalwidth --types n n
> 
> >> it gave following error.
> Exception in thread "main" java.lang.IllegalArgumentException: Can only call classifyScalar with two categories
> 
> Now created csv with only 2 classes. PFA iris-2-classes.csv
> 
> >> trained iris-2-classes.csv with sgd
> 
> mahout org.apache.mahout.classifier.sgd.TrainLogistic --input /usr/local/mahout/trunk/iris-2-classes.csv --features 4 --output /usr/local/mahout/trunk/iris-2-classes.model --target class --categories 2 --predictors sepallength sepalwidth petallength petalwidth --types n n
> 
> 
> mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion
> 
> AUC = 0.14
> confusion: [[50.0, 50.0], [0.0, 0.0]]
> entropy: [[-0.6, -0.3], [-0.8, -0.4]]
> 
> >> AUC seems to poor. Now changed --predictors 
> 
> mahout org.apache.mahout.classifier.sgd.TrainLogistic --input /usr/local/mahout/trunk/iris-2-classes.csv --features 4 --output /usr/local/mahout/trunk/iris-2-classes.model --target class --categories 2 --predictors sepalwidth petallength --types n n
> 
> mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion --scores
> 
> AUC = 0.80
> confusion: [[50.0, 50.0], [0.0, 0.0]]
> entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> 
> AUC is improved, however from confusion matrix seems everything is classified as class a.
> 
> Below is the output. 
> 
> "target","model-output","log-likelihood"
> 0,0.492,-0.677017
> 0,0.493,-0.679192
> 0,0.493,-0.678355
> 0,0.493,-0.678724
> 0,0.492,-0.676583
> 0,0.491,-0.675182
> 0,0.492,-0.677452
> 0,0.492,-0.677419
> 0,0.493,-0.679628
> 0,0.493,-0.678724
> 0,0.491,-0.676116
> 0,0.492,-0.677386
> 0,0.493,-0.679192
> 0,0.493,-0.679291
> 0,0.491,-0.674912
> 0,0.490,-0.673081
> 0,0.491,-0.675313
> 0,0.492,-0.677017
> 0,0.491,-0.675616
> 0,0.491,-0.675682
> 0,0.492,-0.677353
> 0,0.491,-0.676116
> 0,0.492,-0.676714
> 0,0.492,-0.677788
> 0,0.492,-0.677287
> 0,0.493,-0.679126
> 0,0.492,-0.677386
> 0,0.492,-0.676984
> 0,0.492,-0.677452
> 0,0.492,-0.678256
> 0,0.493,-0.678691
> 0,0.492,-0.677419
> 0,0.491,-0.674381
> 0,0.490,-0.673980
> 0,0.493,-0.678724
> 0,0.493,-0.678387
> 0,0.492,-0.677050
> 0,0.493,-0.678724
> 0,0.493,-0.679225
> 0,0.492,-0.677419
> 0,0.492,-0.677050
> 0,0.495,-0.682279
> 0,0.493,-0.678355
> 0,0.492,-0.676951
> 0,0.491,-0.675550
> 0,0.493,-0.679192
> 0,0.491,-0.675649
> 0,0.493,-0.678322
> 0,0.491,-0.676116
> 0,0.492,-0.677887
> 1,0.492,-0.709316
> 1,0.492,-0.709248
> 1,0.492,-0.708935
> 1,0.494,-0.705048
> 1,0.493,-0.707488
> 1,0.493,-0.707454
> 1,0.492,-0.709765
> 1,0.494,-0.705258
> 1,0.493,-0.707936
> 1,0.493,-0.706803
> 1,0.495,-0.703539
> 1,0.493,-0.708249
> 1,0.494,-0.704601
> 1,0.493,-0.707970
> 1,0.493,-0.707597
> 1,0.492,-0.708765
> 1,0.492,-0.708351
> 1,0.493,-0.706871
> 1,0.494,-0.704770
> 1,0.494,-0.705908
> 1,0.492,-0.709350
> 1,0.493,-0.707285
> 1,0.493,-0.706247
> 1,0.493,-0.707522
> 1,0.493,-0.707835
> 1,0.492,-0.708317
> 1,0.493,-0.707556
> 1,0.492,-0.708520
> 1,0.493,-0.707902
> 1,0.494,-0.706220
> 1,0.494,-0.705427
> 1,0.494,-0.705393
> 1,0.493,-0.706803
> 1,0.493,-0.707210
> 1,0.492,-0.708351
> 1,0.492,-0.710146
> 1,0.492,-0.708867
> 1,0.494,-0.705183
> 1,0.493,-0.708215
> 1,0.494,-0.705942
> 1,0.493,-0.706525
> 1,0.492,-0.708385
> 1,0.493,-0.706389
> 1,0.494,-0.704811
> 1,0.493,-0.706905
> 1,0.493,-0.708249
> 1,0.493,-0.707801
> 1,0.493,-0.707835
> 1,0.494,-0.705604
> 1,0.493,-0.707319
> 
> AUC = 0.80
> confusion: [[50.0, 50.0], [0.0, 0.0]]
> entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> 
> SGD is suitable for what kind of data?
> 
> Thanks,
> Rajesh
> 
> 
> <iris-2-classes.csv>
> <iris-3-classes.csv>

** Problem using SGD and iris arff as test set **

Re: ** Problem using SGD and iris arff as test set **

Re: ** Problem using SGD and iris arff as test set **

Re: ** Problem using SGD and iris arff as test set **

Re: ** Problem using SGD and iris arff as test set **

Re: ** Problem using SGD and iris arff as test set **

Re: ** Problem using SGD and iris arff as test set **

Re: ** Problem using SGD and iris arff as test set **

Problem using SGD and iris arff as test set

Re: Problem using SGD and iris arff as test set

Re: Problem using SGD and iris arff as test set

Re: Problem using SGD and iris arff as test set

Re: Problem using SGD and iris arff as test set

Re: Problem using SGD and iris arff as test set

Re: Problem using SGD and iris arff as test set

Re: Problem using SGD and iris arff as test set