You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Rajesh Nikam <ra...@gmail.com> on 2012/10/15 15:28:11 UTC

SGD: Logistic regression package in Mahout

Hello,

I have asked below question on issue with using sgd on mahout forum.

Similar issue with sgd is reported by

http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout

Even below link has similar output:

AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]*
entropy: [[-0.4, -0.3], [-1.2, -0.7]]


http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html

I am still wannder confusion how then this model works and used by many ?
Not able to get any points on how to use SGD that generates effective model.

Could someone point out what is missing in input file or provided
parameters.

I appreciate your help.

Below is description of steps that I followed.

PF Attached uses input files for experiment.

I am using Iris Plants Database from Michael Marshall. PFA iris.arff.
Converted this to csv file just by updating header: iris-3-classes.csv

mahout org.apache.mahout.classifier.
sgd.TrainLogistic --input
/usr/local/mahout/trunk/*iris-3-classes.csv*--features 4 --output
/usr/local/mahout/trunk/
*iris-3-classes.model* --target class *--categories 3* --predictors
sepallength sepalwidth petallength petalwidth --types n

>> it gave following error.
Exception in thread "main" java.lang.IllegalArgumentException: Can only
call classifyScalar with two categories

Now created csv with only 2 classes. PFA iris-2-classes.csv

>> trained iris-2-classes.csv with sgd

mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
/usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output
/usr/local/mahout/trunk/*iris-2-classes.mode*l --target class *--categories
2* --predictors sepallength sepalwidth petallength petalwidth --types n

mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
--model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion

AUC = 0.14
confusion: [[50.0, 50.0], [0.0, 0.0]]
entropy: [[-0.6, -0.3], [-0.8, -0.4]]

>> AUC seems to poor. Now changed --predictors

mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
/usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output
/usr/local/mahout/trunk/*iris-2-classes.mode*l --target class *--categories
2* --predictors sepalwidth petallength --types n

mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
--model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion
--scores

AUC = 0.80
*confusion: [[50.0, 50.0], [0.0, 0.0]]*
entropy: [[-0.7, -0.3], [-0.7, -0.4]]

This model classifies everything as category 1 which of no use.

Thanks
Rajesh

Re: SGD: Logistic regression package in Mahout

Posted by Ted Dunning <te...@gmail.com>.
The output will be in a report file under core/target, I think.  Look for a
file with OnlineLogisticRegressionTest in the name.  Saving to a separate
file is a fine approach as well.

On Thu, Nov 1, 2012 at 6:45 AM, Rajesh Nikam <ra...@gmail.com> wrote:

> Thanks Ted for providing testcase that helped me to look into details of
> the problem that I am facing.
>
> Got how to run test case using maven:
>
> mvn test
> -Dtest="org.apache.mahout.classifier.sgd.OnlineLogisticRegressionTest"
>
> However I could not see printf output spitted on console, so I have saved
> output to file.
>
> Now I will look at the results and update in case of any issue.
>
> Thanks
> Rajesh
>
>
> On Thu, Nov 1, 2012 at 1:05 PM, Rajesh Nikam <ra...@gmail.com>
> wrote:
>
> > Hi Mat,
> >
> > Thanks for pointing out link for JIRA for this particular case.
> >
> > Could you extend one more help:
> >
> > I have not used maven for building and running java classes. I am looking
> > at
> > http://maven.apache.org/guides/getting-started/index.html
> >
> > Could you please point out how to build & run any specific class like
> > OnlineLogisticRegressionTest.java from mahout.
> >
> > Thanks
> > Rajesh
> >
> >
> > On Wed, Oct 31, 2012 at 8:15 PM, Mat Kelcey <matthew.kelcey@gmail.com
> >wrote:
> >
> >> Rajesh, Ted has added the test case code already
> >> https://issues.apache.org/jira/browse/MAHOUT-1107
> >>
> >> On 31 October 2012 05:14, Rajesh Nikam <ra...@gmail.com> wrote:
> >>
> >> > Hi Ted,
> >> >
> >> > Please update once JIRA and test case is uploaded.
> >> >
> >> > Looking forward for your reply.
> >> >
> >> > Thanks
> >> > Rajesh
> >> >
> >> > On Wed, Oct 31, 2012 at 11:00 AM, Rajesh Nikam <rajeshnikam@gmail.com
> >> > >wrote:
> >> >
> >> > > Hi Ted,
> >> > >
> >> > > Thanks for reply. I will wait for JIRA and hope to get rid of any
> >> > encoding
> >> > > issue.
> >> > >
> >> > > Thanks,
> >> > > Rajesh
> >> > > On Oct 31, 2012 5:24 AM, "Ted Dunning" <te...@gmail.com>
> wrote:
> >> > >
> >> > >> OK.  I am back up for air.
> >> > >>
> >> > >> Rajesh,
> >> > >>
> >> > >> As I am sure you know, most folks here contribute on their own
> time.
> >>  I
> >> > >> have been busy with my day job and unable to help with this until
> >> just
> >> > >> now.
> >> > >>
> >> > >> I just wrote a test case that looks at the Iris data set.  The
> >> results
> >> > are
> >> > >> categorically different from yours.
> >> > >>
> >> > >> That substantiates my original feeling that your encoding of the
> >> data is
> >> > >> problematic.  I will file a JIRA and attach a test case that you
> can
> >> > look
> >> > >> at.  Then we can see what the differences are.
> >> > >>
> >> > >>
> >> > >> On Tue, Oct 23, 2012 at 1:28 AM, Rajesh Nikam <
> rajeshnikam@gmail.com
> >> >
> >> > >> wrote:
> >> > >>
> >> > >> > Hi,
> >> > >> >
> >> > >> > Is there development happening on fixing issue with SGD that
> >> generates
> >> > >> > models which are as good as random prediction?
> >> > >> >
> >> > >> > I am not sure why such issue is not noticed and raised by others
> ?
> >> > >> > May be this specific algo is not used in practical applications.
> >> > >> >
> >> > >> > Thanks,
> >> > >> > Rajesh
> >> > >> >
> >> > >> >
> >> > >> > >>
> >> > >> > >> On Tue, Oct 16, 2012 at 10:23 PM, Ted Dunning <
> >> > ted.dunning@gmail.com
> >> > >> > >wrote:
> >> > >> > >>
> >> > >> > >>> Rajesh,
> >> > >> > >>>
> >> > >> > >>> In the testing that I did, I ran 100, 1000 and 10,000 passes
> >> > through
> >> > >> > the
> >> > >> > >>> data.  All produced identical results.  Thus it isn't an
> issue
> >> of
> >> > >> SGD
> >> > >> > >>> converging.
> >> > >> > >>>
> >> > >> > >>> I also did a parameter scan of lambda and saw no effect.
> >> > >> > >>>
> >> > >> > >>> I also did the standard thing in R with glm and got the
> >> expected
> >> > >> > >>> (correct)
> >> > >> > >>> results.
> >> > >> > >>>
> >> > >> > >>> I haven't looked yet in detail, but I really suspect that the
> >> > >> reading
> >> > >> > of
> >> > >> > >>> the data is horked.  This is exactly how that behaves.
> >> > >> > >>>
> >> > >> > >>> On Tue, Oct 16, 2012 at 4:49 AM, Rajesh Nikam <
> >> > >> rajeshnikam@gmail.com>
> >> > >> > >>> wrote:
> >> > >> > >>>
> >> > >> > >>> > Hi Ted,
> >> > >> > >>> >
> >> > >> > >>> > I was thinking, this might be due to having only 100
> >> instances
> >> > for
> >> > >> > >>> > training.
> >> > >> > >>> >
> >> > >> > >>> > So I have created test set with two classes having ~49K
> >> > instances,
> >> > >> > >>> included
> >> > >> > >>> > all features as predictors.
> >> > >> > >>> > PFA sgd.grps.zip with test file.
> >> > >> > >>> >
> >> > >> > >>> > mahout trainlogistic --input
> >> > >> /usr/local/mahout/trainme/sgd-grps.csv
> >> > >> > >>> > --output /usr/local/mahout/trainme/sgd-grps.model --target
> >> class
> >> > >> > >>> > --categories 2 --features 128 --types n --predictors a1 a2
> >> a3 a4
> >> > >> a5
> >> > >> > a6
> >> > >> > >>> a7
> >> > >> > >>> > a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22
> a23
> >> > a24
> >> > >> a25
> >> > >> > >>> a26
> >> > >> > >>> > a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41
> >> a42
> >> > >> a43
> >> > >> > >>> a44 a45
> >> > >> > >>> > a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 a60
> >> a61
> >> > >> a62
> >> > >> > >>> a63 a64
> >> > >> > >>> > a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79
> >> a80
> >> > >> a81
> >> > >> > >>> a82 a83
> >> > >> > >>> > a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98
> >> a99
> >> > >> a100
> >> > >> > >>> a101
> >> > >> > >>> > a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113
> >> a114
> >> > >> a115
> >> > >> > >>> a116
> >> > >> > >>> > a117 a118 a119 a120 a121 a122 a123 a124 a125 a126 a127
> >> > >> > >>> >
> >> > >> > >>> >
> >> > >> > >>> > mahout runlogistic --input
> >> > /usr/local/mahout/trainme/sgd-grps.csv
> >> > >> > >>> --model
> >> > >> > >>> > /usr/local/mahout/trainme/sgd-grps.model --auc --confusion
> >> > >> > >>> >
> >> > >> > >>> > Still the results are similar, it classifies everything as
> >> > >> class_1.
> >> > >> > >>> >
> >> > >> > >>> > AUC = 0.50
> >> > >> > >>> > confusion: [[*26563.0, 23006.0*], [0.0, 0.0]]
> >> > >> > >>> > entropy: [[-0.0, -0.0], [-46.1, -21.4]]
> >> > >> > >>> >
> >> > >> > >>> > I am not sure why this is failing all the time.
> >> > >> > >>> >
> >> > >> > >>> > Looking forward for your reply.
> >> > >> > >>> >
> >> > >> > >>> > Thanks
> >> > >> > >>> > Rajesh
> >> > >> > >>> >
> >> > >> > >>> >
> >> > >> > >>> >
> >> > >> > >>> > On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning <
> >> > >> ted.dunning@gmail.com>
> >> > >> > >>> > wrote:
> >> > >> > >>> >
> >> > >> > >>> > > I would love to help and will before long.  Just can't do
> >> it
> >> > in
> >> > >> the
> >> > >> > >>> first
> >> > >> > >>> > > part of this week.
> >> > >> > >>> > >
> >> > >> > >>> > > On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam <
> >> > >> > rajeshnikam@gmail.com
> >> > >> > >>> >
> >> > >> > >>> > > wrote:
> >> > >> > >>> > >
> >> > >> > >>> > > > Hello,
> >> > >> > >>> > > >
> >> > >> > >>> > > > I have asked below question on issue with using sgd on
> >> > mahout
> >> > >> > >>> forum.
> >> > >> > >>> > > >
> >> > >> > >>> > > > Similar issue with sgd is reported by
> >> > >> > >>> > > >
> >> > >> > >>> > > >
> >> > >> > >>> > >
> >> > >> > >>> >
> >> > >> > >>>
> >> > >> >
> >> > >>
> >> >
> >>
> http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout
> >> > >> > >>> > > >
> >> > >> > >>> > > > Even below link has similar output:
> >> > >> > >>> > > >
> >> > >> > >>> > > > AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]*
> >> > >> > >>> > > > entropy: [[-0.4, -0.3], [-1.2, -0.7]]
> >> > >> > >>> > > >
> >> > >> > >>> > > >
> >> > >> > >>> > > >
> >> > >> > >>> >
> >> > >> > >>>
> >> > >> >
> >> >
> http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html
> >> > >> > >>> > > >
> >> > >> > >>> > > > I am still wannder confusion how then this model works
> >> and
> >> > >> used
> >> > >> > by
> >> > >> > >>> > many ?
> >> > >> > >>> > > > Not able to get any points on how to use SGD that
> >> generates
> >> > >> > >>> effective
> >> > >> > >>> > > > model.
> >> > >> > >>> > > >
> >> > >> > >>> > > > Could someone point out what is missing in input file
> or
> >> > >> provided
> >> > >> > >>> > > > parameters.
> >> > >> > >>> > > >
> >> > >> > >>> > > > I appreciate your help.
> >> > >> > >>> > > >
> >> > >> > >>> > > > Below is description of steps that I followed.
> >> > >> > >>> > > >
> >> > >> > >>> > > > PF Attached uses input files for experiment.
> >> > >> > >>> > > >
> >> > >> > >>> > > > I am using Iris Plants Database from Michael Marshall.
> >> PFA
> >> > >> > >>> iris.arff.
> >> > >> > >>> > > > Converted this to csv file just by updating header:
> >> > >> > >>> iris-3-classes.csv
> >> > >> > >>> > > >
> >> > >> > >>> > > > mahout org.apache.mahout.classifier.
> >> > >> > >>> > > > sgd.TrainLogistic --input
> >> > >> > >>> > > /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4
> >> > >> --output
> >> > >> > >>> > > /usr/local/mahout/trunk/
> >> > >> > >>> > > > *iris-3-classes.model* --target class *--categories 3*
> >> > >> > --predictors
> >> > >> > >>> > > > sepallength sepalwidth petallength petalwidth --types n
> >> > >> > >>> > > >
> >> > >> > >>> > > > >> it gave following error.
> >> > >> > >>> > > > Exception in thread "main"
> >> > java.lang.IllegalArgumentException:
> >> > >> > Can
> >> > >> > >>> only
> >> > >> > >>> > > > call classifyScalar with two categories
> >> > >> > >>> > > >
> >> > >> > >>> > > > Now created csv with only 2 classes. PFA
> >> iris-2-classes.csv
> >> > >> > >>> > > >
> >> > >> > >>> > > > >> trained iris-2-classes.csv with sgd
> >> > >> > >>> > > >
> >> > >> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic
> >> > --input
> >> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv*
> --features 4
> >> > >> > --output
> >> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target
> >> > class
> >> > >> > >>> > > *--categories
> >> > >> > >>> > > > 2* --predictors sepallength sepalwidth petallength
> >> > petalwidth
> >> > >> > >>> --types n
> >> > >> > >>> > > >
> >> > >> > >>> > > > mahout runlogistic --input
> >> > >> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
> >> > >> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model
> >> --auc
> >> > >> > >>> --confusion
> >> > >> > >>> > > >
> >> > >> > >>> > > > AUC = 0.14
> >> > >> > >>> > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> >> > >> > >>> > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
> >> > >> > >>> > > >
> >> > >> > >>> > > > >> AUC seems to poor. Now changed --predictors
> >> > >> > >>> > > >
> >> > >> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic
> >> > --input
> >> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv*
> --features 4
> >> > >> > --output
> >> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target
> >> > class
> >> > >> > >>> > > *--categories
> >> > >> > >>> > > > 2* --predictors sepalwidth petallength --types n
> >> > >> > >>> > > >
> >> > >> > >>> > > > mahout runlogistic --input
> >> > >> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
> >> > >> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model
> >> --auc
> >> > >> > >>> --confusion
> >> > >> > >>> > > > --scores
> >> > >> > >>> > > >
> >> > >> > >>> > > > AUC = 0.80
> >> > >> > >>> > > > *confusion: [[50.0, 50.0], [0.0, 0.0]]*
> >> > >> > >>> > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> >> > >> > >>> > > >
> >> > >> > >>> > > > This model classifies everything as category 1 which of
> >> no
> >> > >> use.
> >> > >> > >>> > > >
> >> > >> > >>> > > > Thanks
> >> > >> > >>> > > > Rajesh
> >> > >> > >>> > > >
> >> > >> > >>> > > >
> >> > >> > >>> > > >
> >> > >> > >>> > > >
> >> > >> > >>> > >
> >> > >> > >>> >
> >> > >> > >>>
> >> > >> > >>
> >> > >> > >>
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> >
> >>
> >
> >
>

Re: SGD: Logistic regression package in Mahout

Posted by Rajesh Nikam <ra...@gmail.com>.
Thanks Ted for providing testcase that helped me to look into details of
the problem that I am facing.

Got how to run test case using maven:

mvn test
-Dtest="org.apache.mahout.classifier.sgd.OnlineLogisticRegressionTest"

However I could not see printf output spitted on console, so I have saved
output to file.

Now I will look at the results and update in case of any issue.

Thanks
Rajesh


On Thu, Nov 1, 2012 at 1:05 PM, Rajesh Nikam <ra...@gmail.com> wrote:

> Hi Mat,
>
> Thanks for pointing out link for JIRA for this particular case.
>
> Could you extend one more help:
>
> I have not used maven for building and running java classes. I am looking
> at
> http://maven.apache.org/guides/getting-started/index.html
>
> Could you please point out how to build & run any specific class like
> OnlineLogisticRegressionTest.java from mahout.
>
> Thanks
> Rajesh
>
>
> On Wed, Oct 31, 2012 at 8:15 PM, Mat Kelcey <ma...@gmail.com>wrote:
>
>> Rajesh, Ted has added the test case code already
>> https://issues.apache.org/jira/browse/MAHOUT-1107
>>
>> On 31 October 2012 05:14, Rajesh Nikam <ra...@gmail.com> wrote:
>>
>> > Hi Ted,
>> >
>> > Please update once JIRA and test case is uploaded.
>> >
>> > Looking forward for your reply.
>> >
>> > Thanks
>> > Rajesh
>> >
>> > On Wed, Oct 31, 2012 at 11:00 AM, Rajesh Nikam <rajeshnikam@gmail.com
>> > >wrote:
>> >
>> > > Hi Ted,
>> > >
>> > > Thanks for reply. I will wait for JIRA and hope to get rid of any
>> > encoding
>> > > issue.
>> > >
>> > > Thanks,
>> > > Rajesh
>> > > On Oct 31, 2012 5:24 AM, "Ted Dunning" <te...@gmail.com> wrote:
>> > >
>> > >> OK.  I am back up for air.
>> > >>
>> > >> Rajesh,
>> > >>
>> > >> As I am sure you know, most folks here contribute on their own time.
>>  I
>> > >> have been busy with my day job and unable to help with this until
>> just
>> > >> now.
>> > >>
>> > >> I just wrote a test case that looks at the Iris data set.  The
>> results
>> > are
>> > >> categorically different from yours.
>> > >>
>> > >> That substantiates my original feeling that your encoding of the
>> data is
>> > >> problematic.  I will file a JIRA and attach a test case that you can
>> > look
>> > >> at.  Then we can see what the differences are.
>> > >>
>> > >>
>> > >> On Tue, Oct 23, 2012 at 1:28 AM, Rajesh Nikam <rajeshnikam@gmail.com
>> >
>> > >> wrote:
>> > >>
>> > >> > Hi,
>> > >> >
>> > >> > Is there development happening on fixing issue with SGD that
>> generates
>> > >> > models which are as good as random prediction?
>> > >> >
>> > >> > I am not sure why such issue is not noticed and raised by others ?
>> > >> > May be this specific algo is not used in practical applications.
>> > >> >
>> > >> > Thanks,
>> > >> > Rajesh
>> > >> >
>> > >> >
>> > >> > >>
>> > >> > >> On Tue, Oct 16, 2012 at 10:23 PM, Ted Dunning <
>> > ted.dunning@gmail.com
>> > >> > >wrote:
>> > >> > >>
>> > >> > >>> Rajesh,
>> > >> > >>>
>> > >> > >>> In the testing that I did, I ran 100, 1000 and 10,000 passes
>> > through
>> > >> > the
>> > >> > >>> data.  All produced identical results.  Thus it isn't an issue
>> of
>> > >> SGD
>> > >> > >>> converging.
>> > >> > >>>
>> > >> > >>> I also did a parameter scan of lambda and saw no effect.
>> > >> > >>>
>> > >> > >>> I also did the standard thing in R with glm and got the
>> expected
>> > >> > >>> (correct)
>> > >> > >>> results.
>> > >> > >>>
>> > >> > >>> I haven't looked yet in detail, but I really suspect that the
>> > >> reading
>> > >> > of
>> > >> > >>> the data is horked.  This is exactly how that behaves.
>> > >> > >>>
>> > >> > >>> On Tue, Oct 16, 2012 at 4:49 AM, Rajesh Nikam <
>> > >> rajeshnikam@gmail.com>
>> > >> > >>> wrote:
>> > >> > >>>
>> > >> > >>> > Hi Ted,
>> > >> > >>> >
>> > >> > >>> > I was thinking, this might be due to having only 100
>> instances
>> > for
>> > >> > >>> > training.
>> > >> > >>> >
>> > >> > >>> > So I have created test set with two classes having ~49K
>> > instances,
>> > >> > >>> included
>> > >> > >>> > all features as predictors.
>> > >> > >>> > PFA sgd.grps.zip with test file.
>> > >> > >>> >
>> > >> > >>> > mahout trainlogistic --input
>> > >> /usr/local/mahout/trainme/sgd-grps.csv
>> > >> > >>> > --output /usr/local/mahout/trainme/sgd-grps.model --target
>> class
>> > >> > >>> > --categories 2 --features 128 --types n --predictors a1 a2
>> a3 a4
>> > >> a5
>> > >> > a6
>> > >> > >>> a7
>> > >> > >>> > a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22 a23
>> > a24
>> > >> a25
>> > >> > >>> a26
>> > >> > >>> > a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41
>> a42
>> > >> a43
>> > >> > >>> a44 a45
>> > >> > >>> > a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 a60
>> a61
>> > >> a62
>> > >> > >>> a63 a64
>> > >> > >>> > a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79
>> a80
>> > >> a81
>> > >> > >>> a82 a83
>> > >> > >>> > a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98
>> a99
>> > >> a100
>> > >> > >>> a101
>> > >> > >>> > a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113
>> a114
>> > >> a115
>> > >> > >>> a116
>> > >> > >>> > a117 a118 a119 a120 a121 a122 a123 a124 a125 a126 a127
>> > >> > >>> >
>> > >> > >>> >
>> > >> > >>> > mahout runlogistic --input
>> > /usr/local/mahout/trainme/sgd-grps.csv
>> > >> > >>> --model
>> > >> > >>> > /usr/local/mahout/trainme/sgd-grps.model --auc --confusion
>> > >> > >>> >
>> > >> > >>> > Still the results are similar, it classifies everything as
>> > >> class_1.
>> > >> > >>> >
>> > >> > >>> > AUC = 0.50
>> > >> > >>> > confusion: [[*26563.0, 23006.0*], [0.0, 0.0]]
>> > >> > >>> > entropy: [[-0.0, -0.0], [-46.1, -21.4]]
>> > >> > >>> >
>> > >> > >>> > I am not sure why this is failing all the time.
>> > >> > >>> >
>> > >> > >>> > Looking forward for your reply.
>> > >> > >>> >
>> > >> > >>> > Thanks
>> > >> > >>> > Rajesh
>> > >> > >>> >
>> > >> > >>> >
>> > >> > >>> >
>> > >> > >>> > On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning <
>> > >> ted.dunning@gmail.com>
>> > >> > >>> > wrote:
>> > >> > >>> >
>> > >> > >>> > > I would love to help and will before long.  Just can't do
>> it
>> > in
>> > >> the
>> > >> > >>> first
>> > >> > >>> > > part of this week.
>> > >> > >>> > >
>> > >> > >>> > > On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam <
>> > >> > rajeshnikam@gmail.com
>> > >> > >>> >
>> > >> > >>> > > wrote:
>> > >> > >>> > >
>> > >> > >>> > > > Hello,
>> > >> > >>> > > >
>> > >> > >>> > > > I have asked below question on issue with using sgd on
>> > mahout
>> > >> > >>> forum.
>> > >> > >>> > > >
>> > >> > >>> > > > Similar issue with sgd is reported by
>> > >> > >>> > > >
>> > >> > >>> > > >
>> > >> > >>> > >
>> > >> > >>> >
>> > >> > >>>
>> > >> >
>> > >>
>> >
>> http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout
>> > >> > >>> > > >
>> > >> > >>> > > > Even below link has similar output:
>> > >> > >>> > > >
>> > >> > >>> > > > AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]*
>> > >> > >>> > > > entropy: [[-0.4, -0.3], [-1.2, -0.7]]
>> > >> > >>> > > >
>> > >> > >>> > > >
>> > >> > >>> > > >
>> > >> > >>> >
>> > >> > >>>
>> > >> >
>> > http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html
>> > >> > >>> > > >
>> > >> > >>> > > > I am still wannder confusion how then this model works
>> and
>> > >> used
>> > >> > by
>> > >> > >>> > many ?
>> > >> > >>> > > > Not able to get any points on how to use SGD that
>> generates
>> > >> > >>> effective
>> > >> > >>> > > > model.
>> > >> > >>> > > >
>> > >> > >>> > > > Could someone point out what is missing in input file or
>> > >> provided
>> > >> > >>> > > > parameters.
>> > >> > >>> > > >
>> > >> > >>> > > > I appreciate your help.
>> > >> > >>> > > >
>> > >> > >>> > > > Below is description of steps that I followed.
>> > >> > >>> > > >
>> > >> > >>> > > > PF Attached uses input files for experiment.
>> > >> > >>> > > >
>> > >> > >>> > > > I am using Iris Plants Database from Michael Marshall.
>> PFA
>> > >> > >>> iris.arff.
>> > >> > >>> > > > Converted this to csv file just by updating header:
>> > >> > >>> iris-3-classes.csv
>> > >> > >>> > > >
>> > >> > >>> > > > mahout org.apache.mahout.classifier.
>> > >> > >>> > > > sgd.TrainLogistic --input
>> > >> > >>> > > /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4
>> > >> --output
>> > >> > >>> > > /usr/local/mahout/trunk/
>> > >> > >>> > > > *iris-3-classes.model* --target class *--categories 3*
>> > >> > --predictors
>> > >> > >>> > > > sepallength sepalwidth petallength petalwidth --types n
>> > >> > >>> > > >
>> > >> > >>> > > > >> it gave following error.
>> > >> > >>> > > > Exception in thread "main"
>> > java.lang.IllegalArgumentException:
>> > >> > Can
>> > >> > >>> only
>> > >> > >>> > > > call classifyScalar with two categories
>> > >> > >>> > > >
>> > >> > >>> > > > Now created csv with only 2 classes. PFA
>> iris-2-classes.csv
>> > >> > >>> > > >
>> > >> > >>> > > > >> trained iris-2-classes.csv with sgd
>> > >> > >>> > > >
>> > >> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic
>> > --input
>> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4
>> > >> > --output
>> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target
>> > class
>> > >> > >>> > > *--categories
>> > >> > >>> > > > 2* --predictors sepallength sepalwidth petallength
>> > petalwidth
>> > >> > >>> --types n
>> > >> > >>> > > >
>> > >> > >>> > > > mahout runlogistic --input
>> > >> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
>> > >> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model
>> --auc
>> > >> > >>> --confusion
>> > >> > >>> > > >
>> > >> > >>> > > > AUC = 0.14
>> > >> > >>> > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
>> > >> > >>> > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
>> > >> > >>> > > >
>> > >> > >>> > > > >> AUC seems to poor. Now changed --predictors
>> > >> > >>> > > >
>> > >> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic
>> > --input
>> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4
>> > >> > --output
>> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target
>> > class
>> > >> > >>> > > *--categories
>> > >> > >>> > > > 2* --predictors sepalwidth petallength --types n
>> > >> > >>> > > >
>> > >> > >>> > > > mahout runlogistic --input
>> > >> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
>> > >> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model
>> --auc
>> > >> > >>> --confusion
>> > >> > >>> > > > --scores
>> > >> > >>> > > >
>> > >> > >>> > > > AUC = 0.80
>> > >> > >>> > > > *confusion: [[50.0, 50.0], [0.0, 0.0]]*
>> > >> > >>> > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
>> > >> > >>> > > >
>> > >> > >>> > > > This model classifies everything as category 1 which of
>> no
>> > >> use.
>> > >> > >>> > > >
>> > >> > >>> > > > Thanks
>> > >> > >>> > > > Rajesh
>> > >> > >>> > > >
>> > >> > >>> > > >
>> > >> > >>> > > >
>> > >> > >>> > > >
>> > >> > >>> > >
>> > >> > >>> >
>> > >> > >>>
>> > >> > >>
>> > >> > >>
>> > >> > >
>> > >> >
>> > >>
>> > >
>> >
>>
>
>

Re: SGD: Logistic regression package in Mahout

Posted by Rajesh Nikam <ra...@gmail.com>.
Hi Mat,

Thanks for pointing out link for JIRA for this particular case.

Could you extend one more help:

I have not used maven for building and running java classes. I am looking
at
http://maven.apache.org/guides/getting-started/index.html

Could you please point out how to build & run any specific class like
OnlineLogisticRegressionTest.java from mahout.

Thanks
Rajesh

On Wed, Oct 31, 2012 at 8:15 PM, Mat Kelcey <ma...@gmail.com>wrote:

> Rajesh, Ted has added the test case code already
> https://issues.apache.org/jira/browse/MAHOUT-1107
>
> On 31 October 2012 05:14, Rajesh Nikam <ra...@gmail.com> wrote:
>
> > Hi Ted,
> >
> > Please update once JIRA and test case is uploaded.
> >
> > Looking forward for your reply.
> >
> > Thanks
> > Rajesh
> >
> > On Wed, Oct 31, 2012 at 11:00 AM, Rajesh Nikam <rajeshnikam@gmail.com
> > >wrote:
> >
> > > Hi Ted,
> > >
> > > Thanks for reply. I will wait for JIRA and hope to get rid of any
> > encoding
> > > issue.
> > >
> > > Thanks,
> > > Rajesh
> > > On Oct 31, 2012 5:24 AM, "Ted Dunning" <te...@gmail.com> wrote:
> > >
> > >> OK.  I am back up for air.
> > >>
> > >> Rajesh,
> > >>
> > >> As I am sure you know, most folks here contribute on their own time.
>  I
> > >> have been busy with my day job and unable to help with this until just
> > >> now.
> > >>
> > >> I just wrote a test case that looks at the Iris data set.  The results
> > are
> > >> categorically different from yours.
> > >>
> > >> That substantiates my original feeling that your encoding of the data
> is
> > >> problematic.  I will file a JIRA and attach a test case that you can
> > look
> > >> at.  Then we can see what the differences are.
> > >>
> > >>
> > >> On Tue, Oct 23, 2012 at 1:28 AM, Rajesh Nikam <ra...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > Is there development happening on fixing issue with SGD that
> generates
> > >> > models which are as good as random prediction?
> > >> >
> > >> > I am not sure why such issue is not noticed and raised by others ?
> > >> > May be this specific algo is not used in practical applications.
> > >> >
> > >> > Thanks,
> > >> > Rajesh
> > >> >
> > >> >
> > >> > >>
> > >> > >> On Tue, Oct 16, 2012 at 10:23 PM, Ted Dunning <
> > ted.dunning@gmail.com
> > >> > >wrote:
> > >> > >>
> > >> > >>> Rajesh,
> > >> > >>>
> > >> > >>> In the testing that I did, I ran 100, 1000 and 10,000 passes
> > through
> > >> > the
> > >> > >>> data.  All produced identical results.  Thus it isn't an issue
> of
> > >> SGD
> > >> > >>> converging.
> > >> > >>>
> > >> > >>> I also did a parameter scan of lambda and saw no effect.
> > >> > >>>
> > >> > >>> I also did the standard thing in R with glm and got the expected
> > >> > >>> (correct)
> > >> > >>> results.
> > >> > >>>
> > >> > >>> I haven't looked yet in detail, but I really suspect that the
> > >> reading
> > >> > of
> > >> > >>> the data is horked.  This is exactly how that behaves.
> > >> > >>>
> > >> > >>> On Tue, Oct 16, 2012 at 4:49 AM, Rajesh Nikam <
> > >> rajeshnikam@gmail.com>
> > >> > >>> wrote:
> > >> > >>>
> > >> > >>> > Hi Ted,
> > >> > >>> >
> > >> > >>> > I was thinking, this might be due to having only 100 instances
> > for
> > >> > >>> > training.
> > >> > >>> >
> > >> > >>> > So I have created test set with two classes having ~49K
> > instances,
> > >> > >>> included
> > >> > >>> > all features as predictors.
> > >> > >>> > PFA sgd.grps.zip with test file.
> > >> > >>> >
> > >> > >>> > mahout trainlogistic --input
> > >> /usr/local/mahout/trainme/sgd-grps.csv
> > >> > >>> > --output /usr/local/mahout/trainme/sgd-grps.model --target
> class
> > >> > >>> > --categories 2 --features 128 --types n --predictors a1 a2 a3
> a4
> > >> a5
> > >> > a6
> > >> > >>> a7
> > >> > >>> > a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22 a23
> > a24
> > >> a25
> > >> > >>> a26
> > >> > >>> > a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41
> a42
> > >> a43
> > >> > >>> a44 a45
> > >> > >>> > a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 a60
> a61
> > >> a62
> > >> > >>> a63 a64
> > >> > >>> > a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79
> a80
> > >> a81
> > >> > >>> a82 a83
> > >> > >>> > a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98
> a99
> > >> a100
> > >> > >>> a101
> > >> > >>> > a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113
> a114
> > >> a115
> > >> > >>> a116
> > >> > >>> > a117 a118 a119 a120 a121 a122 a123 a124 a125 a126 a127
> > >> > >>> >
> > >> > >>> >
> > >> > >>> > mahout runlogistic --input
> > /usr/local/mahout/trainme/sgd-grps.csv
> > >> > >>> --model
> > >> > >>> > /usr/local/mahout/trainme/sgd-grps.model --auc --confusion
> > >> > >>> >
> > >> > >>> > Still the results are similar, it classifies everything as
> > >> class_1.
> > >> > >>> >
> > >> > >>> > AUC = 0.50
> > >> > >>> > confusion: [[*26563.0, 23006.0*], [0.0, 0.0]]
> > >> > >>> > entropy: [[-0.0, -0.0], [-46.1, -21.4]]
> > >> > >>> >
> > >> > >>> > I am not sure why this is failing all the time.
> > >> > >>> >
> > >> > >>> > Looking forward for your reply.
> > >> > >>> >
> > >> > >>> > Thanks
> > >> > >>> > Rajesh
> > >> > >>> >
> > >> > >>> >
> > >> > >>> >
> > >> > >>> > On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning <
> > >> ted.dunning@gmail.com>
> > >> > >>> > wrote:
> > >> > >>> >
> > >> > >>> > > I would love to help and will before long.  Just can't do it
> > in
> > >> the
> > >> > >>> first
> > >> > >>> > > part of this week.
> > >> > >>> > >
> > >> > >>> > > On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam <
> > >> > rajeshnikam@gmail.com
> > >> > >>> >
> > >> > >>> > > wrote:
> > >> > >>> > >
> > >> > >>> > > > Hello,
> > >> > >>> > > >
> > >> > >>> > > > I have asked below question on issue with using sgd on
> > mahout
> > >> > >>> forum.
> > >> > >>> > > >
> > >> > >>> > > > Similar issue with sgd is reported by
> > >> > >>> > > >
> > >> > >>> > > >
> > >> > >>> > >
> > >> > >>> >
> > >> > >>>
> > >> >
> > >>
> >
> http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout
> > >> > >>> > > >
> > >> > >>> > > > Even below link has similar output:
> > >> > >>> > > >
> > >> > >>> > > > AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]*
> > >> > >>> > > > entropy: [[-0.4, -0.3], [-1.2, -0.7]]
> > >> > >>> > > >
> > >> > >>> > > >
> > >> > >>> > > >
> > >> > >>> >
> > >> > >>>
> > >> >
> > http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html
> > >> > >>> > > >
> > >> > >>> > > > I am still wannder confusion how then this model works and
> > >> used
> > >> > by
> > >> > >>> > many ?
> > >> > >>> > > > Not able to get any points on how to use SGD that
> generates
> > >> > >>> effective
> > >> > >>> > > > model.
> > >> > >>> > > >
> > >> > >>> > > > Could someone point out what is missing in input file or
> > >> provided
> > >> > >>> > > > parameters.
> > >> > >>> > > >
> > >> > >>> > > > I appreciate your help.
> > >> > >>> > > >
> > >> > >>> > > > Below is description of steps that I followed.
> > >> > >>> > > >
> > >> > >>> > > > PF Attached uses input files for experiment.
> > >> > >>> > > >
> > >> > >>> > > > I am using Iris Plants Database from Michael Marshall. PFA
> > >> > >>> iris.arff.
> > >> > >>> > > > Converted this to csv file just by updating header:
> > >> > >>> iris-3-classes.csv
> > >> > >>> > > >
> > >> > >>> > > > mahout org.apache.mahout.classifier.
> > >> > >>> > > > sgd.TrainLogistic --input
> > >> > >>> > > /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4
> > >> --output
> > >> > >>> > > /usr/local/mahout/trunk/
> > >> > >>> > > > *iris-3-classes.model* --target class *--categories 3*
> > >> > --predictors
> > >> > >>> > > > sepallength sepalwidth petallength petalwidth --types n
> > >> > >>> > > >
> > >> > >>> > > > >> it gave following error.
> > >> > >>> > > > Exception in thread "main"
> > java.lang.IllegalArgumentException:
> > >> > Can
> > >> > >>> only
> > >> > >>> > > > call classifyScalar with two categories
> > >> > >>> > > >
> > >> > >>> > > > Now created csv with only 2 classes. PFA
> iris-2-classes.csv
> > >> > >>> > > >
> > >> > >>> > > > >> trained iris-2-classes.csv with sgd
> > >> > >>> > > >
> > >> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic
> > --input
> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4
> > >> > --output
> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target
> > class
> > >> > >>> > > *--categories
> > >> > >>> > > > 2* --predictors sepallength sepalwidth petallength
> > petalwidth
> > >> > >>> --types n
> > >> > >>> > > >
> > >> > >>> > > > mahout runlogistic --input
> > >> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
> > >> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
> > >> > >>> --confusion
> > >> > >>> > > >
> > >> > >>> > > > AUC = 0.14
> > >> > >>> > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > >> > >>> > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
> > >> > >>> > > >
> > >> > >>> > > > >> AUC seems to poor. Now changed --predictors
> > >> > >>> > > >
> > >> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic
> > --input
> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4
> > >> > --output
> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target
> > class
> > >> > >>> > > *--categories
> > >> > >>> > > > 2* --predictors sepalwidth petallength --types n
> > >> > >>> > > >
> > >> > >>> > > > mahout runlogistic --input
> > >> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
> > >> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
> > >> > >>> --confusion
> > >> > >>> > > > --scores
> > >> > >>> > > >
> > >> > >>> > > > AUC = 0.80
> > >> > >>> > > > *confusion: [[50.0, 50.0], [0.0, 0.0]]*
> > >> > >>> > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> > >> > >>> > > >
> > >> > >>> > > > This model classifies everything as category 1 which of no
> > >> use.
> > >> > >>> > > >
> > >> > >>> > > > Thanks
> > >> > >>> > > > Rajesh
> > >> > >>> > > >
> > >> > >>> > > >
> > >> > >>> > > >
> > >> > >>> > > >
> > >> > >>> > >
> > >> > >>> >
> > >> > >>>
> > >> > >>
> > >> > >>
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: SGD: Logistic regression package in Mahout

Posted by Mat Kelcey <ma...@gmail.com>.
Rajesh, Ted has added the test case code already
https://issues.apache.org/jira/browse/MAHOUT-1107

On 31 October 2012 05:14, Rajesh Nikam <ra...@gmail.com> wrote:

> Hi Ted,
>
> Please update once JIRA and test case is uploaded.
>
> Looking forward for your reply.
>
> Thanks
> Rajesh
>
> On Wed, Oct 31, 2012 at 11:00 AM, Rajesh Nikam <rajeshnikam@gmail.com
> >wrote:
>
> > Hi Ted,
> >
> > Thanks for reply. I will wait for JIRA and hope to get rid of any
> encoding
> > issue.
> >
> > Thanks,
> > Rajesh
> > On Oct 31, 2012 5:24 AM, "Ted Dunning" <te...@gmail.com> wrote:
> >
> >> OK.  I am back up for air.
> >>
> >> Rajesh,
> >>
> >> As I am sure you know, most folks here contribute on their own time.  I
> >> have been busy with my day job and unable to help with this until just
> >> now.
> >>
> >> I just wrote a test case that looks at the Iris data set.  The results
> are
> >> categorically different from yours.
> >>
> >> That substantiates my original feeling that your encoding of the data is
> >> problematic.  I will file a JIRA and attach a test case that you can
> look
> >> at.  Then we can see what the differences are.
> >>
> >>
> >> On Tue, Oct 23, 2012 at 1:28 AM, Rajesh Nikam <ra...@gmail.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > Is there development happening on fixing issue with SGD that generates
> >> > models which are as good as random prediction?
> >> >
> >> > I am not sure why such issue is not noticed and raised by others ?
> >> > May be this specific algo is not used in practical applications.
> >> >
> >> > Thanks,
> >> > Rajesh
> >> >
> >> >
> >> > >>
> >> > >> On Tue, Oct 16, 2012 at 10:23 PM, Ted Dunning <
> ted.dunning@gmail.com
> >> > >wrote:
> >> > >>
> >> > >>> Rajesh,
> >> > >>>
> >> > >>> In the testing that I did, I ran 100, 1000 and 10,000 passes
> through
> >> > the
> >> > >>> data.  All produced identical results.  Thus it isn't an issue of
> >> SGD
> >> > >>> converging.
> >> > >>>
> >> > >>> I also did a parameter scan of lambda and saw no effect.
> >> > >>>
> >> > >>> I also did the standard thing in R with glm and got the expected
> >> > >>> (correct)
> >> > >>> results.
> >> > >>>
> >> > >>> I haven't looked yet in detail, but I really suspect that the
> >> reading
> >> > of
> >> > >>> the data is horked.  This is exactly how that behaves.
> >> > >>>
> >> > >>> On Tue, Oct 16, 2012 at 4:49 AM, Rajesh Nikam <
> >> rajeshnikam@gmail.com>
> >> > >>> wrote:
> >> > >>>
> >> > >>> > Hi Ted,
> >> > >>> >
> >> > >>> > I was thinking, this might be due to having only 100 instances
> for
> >> > >>> > training.
> >> > >>> >
> >> > >>> > So I have created test set with two classes having ~49K
> instances,
> >> > >>> included
> >> > >>> > all features as predictors.
> >> > >>> > PFA sgd.grps.zip with test file.
> >> > >>> >
> >> > >>> > mahout trainlogistic --input
> >> /usr/local/mahout/trainme/sgd-grps.csv
> >> > >>> > --output /usr/local/mahout/trainme/sgd-grps.model --target class
> >> > >>> > --categories 2 --features 128 --types n --predictors a1 a2 a3 a4
> >> a5
> >> > a6
> >> > >>> a7
> >> > >>> > a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22 a23
> a24
> >> a25
> >> > >>> a26
> >> > >>> > a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41 a42
> >> a43
> >> > >>> a44 a45
> >> > >>> > a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 a60 a61
> >> a62
> >> > >>> a63 a64
> >> > >>> > a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79 a80
> >> a81
> >> > >>> a82 a83
> >> > >>> > a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98 a99
> >> a100
> >> > >>> a101
> >> > >>> > a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113 a114
> >> a115
> >> > >>> a116
> >> > >>> > a117 a118 a119 a120 a121 a122 a123 a124 a125 a126 a127
> >> > >>> >
> >> > >>> >
> >> > >>> > mahout runlogistic --input
> /usr/local/mahout/trainme/sgd-grps.csv
> >> > >>> --model
> >> > >>> > /usr/local/mahout/trainme/sgd-grps.model --auc --confusion
> >> > >>> >
> >> > >>> > Still the results are similar, it classifies everything as
> >> class_1.
> >> > >>> >
> >> > >>> > AUC = 0.50
> >> > >>> > confusion: [[*26563.0, 23006.0*], [0.0, 0.0]]
> >> > >>> > entropy: [[-0.0, -0.0], [-46.1, -21.4]]
> >> > >>> >
> >> > >>> > I am not sure why this is failing all the time.
> >> > >>> >
> >> > >>> > Looking forward for your reply.
> >> > >>> >
> >> > >>> > Thanks
> >> > >>> > Rajesh
> >> > >>> >
> >> > >>> >
> >> > >>> >
> >> > >>> > On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning <
> >> ted.dunning@gmail.com>
> >> > >>> > wrote:
> >> > >>> >
> >> > >>> > > I would love to help and will before long.  Just can't do it
> in
> >> the
> >> > >>> first
> >> > >>> > > part of this week.
> >> > >>> > >
> >> > >>> > > On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam <
> >> > rajeshnikam@gmail.com
> >> > >>> >
> >> > >>> > > wrote:
> >> > >>> > >
> >> > >>> > > > Hello,
> >> > >>> > > >
> >> > >>> > > > I have asked below question on issue with using sgd on
> mahout
> >> > >>> forum.
> >> > >>> > > >
> >> > >>> > > > Similar issue with sgd is reported by
> >> > >>> > > >
> >> > >>> > > >
> >> > >>> > >
> >> > >>> >
> >> > >>>
> >> >
> >>
> http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout
> >> > >>> > > >
> >> > >>> > > > Even below link has similar output:
> >> > >>> > > >
> >> > >>> > > > AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]*
> >> > >>> > > > entropy: [[-0.4, -0.3], [-1.2, -0.7]]
> >> > >>> > > >
> >> > >>> > > >
> >> > >>> > > >
> >> > >>> >
> >> > >>>
> >> >
> http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html
> >> > >>> > > >
> >> > >>> > > > I am still wannder confusion how then this model works and
> >> used
> >> > by
> >> > >>> > many ?
> >> > >>> > > > Not able to get any points on how to use SGD that generates
> >> > >>> effective
> >> > >>> > > > model.
> >> > >>> > > >
> >> > >>> > > > Could someone point out what is missing in input file or
> >> provided
> >> > >>> > > > parameters.
> >> > >>> > > >
> >> > >>> > > > I appreciate your help.
> >> > >>> > > >
> >> > >>> > > > Below is description of steps that I followed.
> >> > >>> > > >
> >> > >>> > > > PF Attached uses input files for experiment.
> >> > >>> > > >
> >> > >>> > > > I am using Iris Plants Database from Michael Marshall. PFA
> >> > >>> iris.arff.
> >> > >>> > > > Converted this to csv file just by updating header:
> >> > >>> iris-3-classes.csv
> >> > >>> > > >
> >> > >>> > > > mahout org.apache.mahout.classifier.
> >> > >>> > > > sgd.TrainLogistic --input
> >> > >>> > > /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4
> >> --output
> >> > >>> > > /usr/local/mahout/trunk/
> >> > >>> > > > *iris-3-classes.model* --target class *--categories 3*
> >> > --predictors
> >> > >>> > > > sepallength sepalwidth petallength petalwidth --types n
> >> > >>> > > >
> >> > >>> > > > >> it gave following error.
> >> > >>> > > > Exception in thread "main"
> java.lang.IllegalArgumentException:
> >> > Can
> >> > >>> only
> >> > >>> > > > call classifyScalar with two categories
> >> > >>> > > >
> >> > >>> > > > Now created csv with only 2 classes. PFA iris-2-classes.csv
> >> > >>> > > >
> >> > >>> > > > >> trained iris-2-classes.csv with sgd
> >> > >>> > > >
> >> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic
> --input
> >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4
> >> > --output
> >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target
> class
> >> > >>> > > *--categories
> >> > >>> > > > 2* --predictors sepallength sepalwidth petallength
> petalwidth
> >> > >>> --types n
> >> > >>> > > >
> >> > >>> > > > mahout runlogistic --input
> >> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
> >> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
> >> > >>> --confusion
> >> > >>> > > >
> >> > >>> > > > AUC = 0.14
> >> > >>> > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> >> > >>> > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
> >> > >>> > > >
> >> > >>> > > > >> AUC seems to poor. Now changed --predictors
> >> > >>> > > >
> >> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic
> --input
> >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4
> >> > --output
> >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target
> class
> >> > >>> > > *--categories
> >> > >>> > > > 2* --predictors sepalwidth petallength --types n
> >> > >>> > > >
> >> > >>> > > > mahout runlogistic --input
> >> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
> >> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
> >> > >>> --confusion
> >> > >>> > > > --scores
> >> > >>> > > >
> >> > >>> > > > AUC = 0.80
> >> > >>> > > > *confusion: [[50.0, 50.0], [0.0, 0.0]]*
> >> > >>> > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> >> > >>> > > >
> >> > >>> > > > This model classifies everything as category 1 which of no
> >> use.
> >> > >>> > > >
> >> > >>> > > > Thanks
> >> > >>> > > > Rajesh
> >> > >>> > > >
> >> > >>> > > >
> >> > >>> > > >
> >> > >>> > > >
> >> > >>> > >
> >> > >>> >
> >> > >>>
> >> > >>
> >> > >>
> >> > >
> >> >
> >>
> >
>

Re: SGD: Logistic regression package in Mahout

Posted by Rajesh Nikam <ra...@gmail.com>.
Hi Ted,

Please update once JIRA and test case is uploaded.

Looking forward for your reply.

Thanks
Rajesh

On Wed, Oct 31, 2012 at 11:00 AM, Rajesh Nikam <ra...@gmail.com>wrote:

> Hi Ted,
>
> Thanks for reply. I will wait for JIRA and hope to get rid of any encoding
> issue.
>
> Thanks,
> Rajesh
> On Oct 31, 2012 5:24 AM, "Ted Dunning" <te...@gmail.com> wrote:
>
>> OK.  I am back up for air.
>>
>> Rajesh,
>>
>> As I am sure you know, most folks here contribute on their own time.  I
>> have been busy with my day job and unable to help with this until just
>> now.
>>
>> I just wrote a test case that looks at the Iris data set.  The results are
>> categorically different from yours.
>>
>> That substantiates my original feeling that your encoding of the data is
>> problematic.  I will file a JIRA and attach a test case that you can look
>> at.  Then we can see what the differences are.
>>
>>
>> On Tue, Oct 23, 2012 at 1:28 AM, Rajesh Nikam <ra...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > Is there development happening on fixing issue with SGD that generates
>> > models which are as good as random prediction?
>> >
>> > I am not sure why such issue is not noticed and raised by others ?
>> > May be this specific algo is not used in practical applications.
>> >
>> > Thanks,
>> > Rajesh
>> >
>> >
>> > >>
>> > >> On Tue, Oct 16, 2012 at 10:23 PM, Ted Dunning <ted.dunning@gmail.com
>> > >wrote:
>> > >>
>> > >>> Rajesh,
>> > >>>
>> > >>> In the testing that I did, I ran 100, 1000 and 10,000 passes through
>> > the
>> > >>> data.  All produced identical results.  Thus it isn't an issue of
>> SGD
>> > >>> converging.
>> > >>>
>> > >>> I also did a parameter scan of lambda and saw no effect.
>> > >>>
>> > >>> I also did the standard thing in R with glm and got the expected
>> > >>> (correct)
>> > >>> results.
>> > >>>
>> > >>> I haven't looked yet in detail, but I really suspect that the
>> reading
>> > of
>> > >>> the data is horked.  This is exactly how that behaves.
>> > >>>
>> > >>> On Tue, Oct 16, 2012 at 4:49 AM, Rajesh Nikam <
>> rajeshnikam@gmail.com>
>> > >>> wrote:
>> > >>>
>> > >>> > Hi Ted,
>> > >>> >
>> > >>> > I was thinking, this might be due to having only 100 instances for
>> > >>> > training.
>> > >>> >
>> > >>> > So I have created test set with two classes having ~49K instances,
>> > >>> included
>> > >>> > all features as predictors.
>> > >>> > PFA sgd.grps.zip with test file.
>> > >>> >
>> > >>> > mahout trainlogistic --input
>> /usr/local/mahout/trainme/sgd-grps.csv
>> > >>> > --output /usr/local/mahout/trainme/sgd-grps.model --target class
>> > >>> > --categories 2 --features 128 --types n --predictors a1 a2 a3 a4
>> a5
>> > a6
>> > >>> a7
>> > >>> > a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22 a23 a24
>> a25
>> > >>> a26
>> > >>> > a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41 a42
>> a43
>> > >>> a44 a45
>> > >>> > a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 a60 a61
>> a62
>> > >>> a63 a64
>> > >>> > a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79 a80
>> a81
>> > >>> a82 a83
>> > >>> > a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98 a99
>> a100
>> > >>> a101
>> > >>> > a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113 a114
>> a115
>> > >>> a116
>> > >>> > a117 a118 a119 a120 a121 a122 a123 a124 a125 a126 a127
>> > >>> >
>> > >>> >
>> > >>> > mahout runlogistic --input /usr/local/mahout/trainme/sgd-grps.csv
>> > >>> --model
>> > >>> > /usr/local/mahout/trainme/sgd-grps.model --auc --confusion
>> > >>> >
>> > >>> > Still the results are similar, it classifies everything as
>> class_1.
>> > >>> >
>> > >>> > AUC = 0.50
>> > >>> > confusion: [[*26563.0, 23006.0*], [0.0, 0.0]]
>> > >>> > entropy: [[-0.0, -0.0], [-46.1, -21.4]]
>> > >>> >
>> > >>> > I am not sure why this is failing all the time.
>> > >>> >
>> > >>> > Looking forward for your reply.
>> > >>> >
>> > >>> > Thanks
>> > >>> > Rajesh
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>> > On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning <
>> ted.dunning@gmail.com>
>> > >>> > wrote:
>> > >>> >
>> > >>> > > I would love to help and will before long.  Just can't do it in
>> the
>> > >>> first
>> > >>> > > part of this week.
>> > >>> > >
>> > >>> > > On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam <
>> > rajeshnikam@gmail.com
>> > >>> >
>> > >>> > > wrote:
>> > >>> > >
>> > >>> > > > Hello,
>> > >>> > > >
>> > >>> > > > I have asked below question on issue with using sgd on mahout
>> > >>> forum.
>> > >>> > > >
>> > >>> > > > Similar issue with sgd is reported by
>> > >>> > > >
>> > >>> > > >
>> > >>> > >
>> > >>> >
>> > >>>
>> >
>> http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout
>> > >>> > > >
>> > >>> > > > Even below link has similar output:
>> > >>> > > >
>> > >>> > > > AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]*
>> > >>> > > > entropy: [[-0.4, -0.3], [-1.2, -0.7]]
>> > >>> > > >
>> > >>> > > >
>> > >>> > > >
>> > >>> >
>> > >>>
>> > http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html
>> > >>> > > >
>> > >>> > > > I am still wannder confusion how then this model works and
>> used
>> > by
>> > >>> > many ?
>> > >>> > > > Not able to get any points on how to use SGD that generates
>> > >>> effective
>> > >>> > > > model.
>> > >>> > > >
>> > >>> > > > Could someone point out what is missing in input file or
>> provided
>> > >>> > > > parameters.
>> > >>> > > >
>> > >>> > > > I appreciate your help.
>> > >>> > > >
>> > >>> > > > Below is description of steps that I followed.
>> > >>> > > >
>> > >>> > > > PF Attached uses input files for experiment.
>> > >>> > > >
>> > >>> > > > I am using Iris Plants Database from Michael Marshall. PFA
>> > >>> iris.arff.
>> > >>> > > > Converted this to csv file just by updating header:
>> > >>> iris-3-classes.csv
>> > >>> > > >
>> > >>> > > > mahout org.apache.mahout.classifier.
>> > >>> > > > sgd.TrainLogistic --input
>> > >>> > > /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4
>> --output
>> > >>> > > /usr/local/mahout/trunk/
>> > >>> > > > *iris-3-classes.model* --target class *--categories 3*
>> > --predictors
>> > >>> > > > sepallength sepalwidth petallength petalwidth --types n
>> > >>> > > >
>> > >>> > > > >> it gave following error.
>> > >>> > > > Exception in thread "main" java.lang.IllegalArgumentException:
>> > Can
>> > >>> only
>> > >>> > > > call classifyScalar with two categories
>> > >>> > > >
>> > >>> > > > Now created csv with only 2 classes. PFA iris-2-classes.csv
>> > >>> > > >
>> > >>> > > > >> trained iris-2-classes.csv with sgd
>> > >>> > > >
>> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
>> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4
>> > --output
>> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class
>> > >>> > > *--categories
>> > >>> > > > 2* --predictors sepallength sepalwidth petallength petalwidth
>> > >>> --types n
>> > >>> > > >
>> > >>> > > > mahout runlogistic --input
>> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
>> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
>> > >>> --confusion
>> > >>> > > >
>> > >>> > > > AUC = 0.14
>> > >>> > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
>> > >>> > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
>> > >>> > > >
>> > >>> > > > >> AUC seems to poor. Now changed --predictors
>> > >>> > > >
>> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
>> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4
>> > --output
>> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class
>> > >>> > > *--categories
>> > >>> > > > 2* --predictors sepalwidth petallength --types n
>> > >>> > > >
>> > >>> > > > mahout runlogistic --input
>> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
>> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
>> > >>> --confusion
>> > >>> > > > --scores
>> > >>> > > >
>> > >>> > > > AUC = 0.80
>> > >>> > > > *confusion: [[50.0, 50.0], [0.0, 0.0]]*
>> > >>> > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
>> > >>> > > >
>> > >>> > > > This model classifies everything as category 1 which of no
>> use.
>> > >>> > > >
>> > >>> > > > Thanks
>> > >>> > > > Rajesh
>> > >>> > > >
>> > >>> > > >
>> > >>> > > >
>> > >>> > > >
>> > >>> > >
>> > >>> >
>> > >>>
>> > >>
>> > >>
>> > >
>> >
>>
>

Re: SGD: Logistic regression package in Mahout

Posted by Rajesh Nikam <ra...@gmail.com>.
Hi Ted,

Thanks for reply. I will wait for JIRA and hope to get rid of any encoding
issue.

Thanks,
Rajesh
On Oct 31, 2012 5:24 AM, "Ted Dunning" <te...@gmail.com> wrote:

> OK.  I am back up for air.
>
> Rajesh,
>
> As I am sure you know, most folks here contribute on their own time.  I
> have been busy with my day job and unable to help with this until just now.
>
> I just wrote a test case that looks at the Iris data set.  The results are
> categorically different from yours.
>
> That substantiates my original feeling that your encoding of the data is
> problematic.  I will file a JIRA and attach a test case that you can look
> at.  Then we can see what the differences are.
>
>
> On Tue, Oct 23, 2012 at 1:28 AM, Rajesh Nikam <ra...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Is there development happening on fixing issue with SGD that generates
> > models which are as good as random prediction?
> >
> > I am not sure why such issue is not noticed and raised by others ?
> > May be this specific algo is not used in practical applications.
> >
> > Thanks,
> > Rajesh
> >
> >
> > >>
> > >> On Tue, Oct 16, 2012 at 10:23 PM, Ted Dunning <ted.dunning@gmail.com
> > >wrote:
> > >>
> > >>> Rajesh,
> > >>>
> > >>> In the testing that I did, I ran 100, 1000 and 10,000 passes through
> > the
> > >>> data.  All produced identical results.  Thus it isn't an issue of SGD
> > >>> converging.
> > >>>
> > >>> I also did a parameter scan of lambda and saw no effect.
> > >>>
> > >>> I also did the standard thing in R with glm and got the expected
> > >>> (correct)
> > >>> results.
> > >>>
> > >>> I haven't looked yet in detail, but I really suspect that the reading
> > of
> > >>> the data is horked.  This is exactly how that behaves.
> > >>>
> > >>> On Tue, Oct 16, 2012 at 4:49 AM, Rajesh Nikam <rajeshnikam@gmail.com
> >
> > >>> wrote:
> > >>>
> > >>> > Hi Ted,
> > >>> >
> > >>> > I was thinking, this might be due to having only 100 instances for
> > >>> > training.
> > >>> >
> > >>> > So I have created test set with two classes having ~49K instances,
> > >>> included
> > >>> > all features as predictors.
> > >>> > PFA sgd.grps.zip with test file.
> > >>> >
> > >>> > mahout trainlogistic --input /usr/local/mahout/trainme/sgd-grps.csv
> > >>> > --output /usr/local/mahout/trainme/sgd-grps.model --target class
> > >>> > --categories 2 --features 128 --types n --predictors a1 a2 a3 a4 a5
> > a6
> > >>> a7
> > >>> > a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22 a23 a24
> a25
> > >>> a26
> > >>> > a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41 a42 a43
> > >>> a44 a45
> > >>> > a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 a60 a61 a62
> > >>> a63 a64
> > >>> > a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79 a80 a81
> > >>> a82 a83
> > >>> > a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98 a99
> a100
> > >>> a101
> > >>> > a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113 a114
> a115
> > >>> a116
> > >>> > a117 a118 a119 a120 a121 a122 a123 a124 a125 a126 a127
> > >>> >
> > >>> >
> > >>> > mahout runlogistic --input /usr/local/mahout/trainme/sgd-grps.csv
> > >>> --model
> > >>> > /usr/local/mahout/trainme/sgd-grps.model --auc --confusion
> > >>> >
> > >>> > Still the results are similar, it classifies everything as class_1.
> > >>> >
> > >>> > AUC = 0.50
> > >>> > confusion: [[*26563.0, 23006.0*], [0.0, 0.0]]
> > >>> > entropy: [[-0.0, -0.0], [-46.1, -21.4]]
> > >>> >
> > >>> > I am not sure why this is failing all the time.
> > >>> >
> > >>> > Looking forward for your reply.
> > >>> >
> > >>> > Thanks
> > >>> > Rajesh
> > >>> >
> > >>> >
> > >>> >
> > >>> > On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning <
> ted.dunning@gmail.com>
> > >>> > wrote:
> > >>> >
> > >>> > > I would love to help and will before long.  Just can't do it in
> the
> > >>> first
> > >>> > > part of this week.
> > >>> > >
> > >>> > > On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam <
> > rajeshnikam@gmail.com
> > >>> >
> > >>> > > wrote:
> > >>> > >
> > >>> > > > Hello,
> > >>> > > >
> > >>> > > > I have asked below question on issue with using sgd on mahout
> > >>> forum.
> > >>> > > >
> > >>> > > > Similar issue with sgd is reported by
> > >>> > > >
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> >
> http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout
> > >>> > > >
> > >>> > > > Even below link has similar output:
> > >>> > > >
> > >>> > > > AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]*
> > >>> > > > entropy: [[-0.4, -0.3], [-1.2, -0.7]]
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> >
> > >>>
> > http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html
> > >>> > > >
> > >>> > > > I am still wannder confusion how then this model works and used
> > by
> > >>> > many ?
> > >>> > > > Not able to get any points on how to use SGD that generates
> > >>> effective
> > >>> > > > model.
> > >>> > > >
> > >>> > > > Could someone point out what is missing in input file or
> provided
> > >>> > > > parameters.
> > >>> > > >
> > >>> > > > I appreciate your help.
> > >>> > > >
> > >>> > > > Below is description of steps that I followed.
> > >>> > > >
> > >>> > > > PF Attached uses input files for experiment.
> > >>> > > >
> > >>> > > > I am using Iris Plants Database from Michael Marshall. PFA
> > >>> iris.arff.
> > >>> > > > Converted this to csv file just by updating header:
> > >>> iris-3-classes.csv
> > >>> > > >
> > >>> > > > mahout org.apache.mahout.classifier.
> > >>> > > > sgd.TrainLogistic --input
> > >>> > > /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4 --output
> > >>> > > /usr/local/mahout/trunk/
> > >>> > > > *iris-3-classes.model* --target class *--categories 3*
> > --predictors
> > >>> > > > sepallength sepalwidth petallength petalwidth --types n
> > >>> > > >
> > >>> > > > >> it gave following error.
> > >>> > > > Exception in thread "main" java.lang.IllegalArgumentException:
> > Can
> > >>> only
> > >>> > > > call classifyScalar with two categories
> > >>> > > >
> > >>> > > > Now created csv with only 2 classes. PFA iris-2-classes.csv
> > >>> > > >
> > >>> > > > >> trained iris-2-classes.csv with sgd
> > >>> > > >
> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4
> > --output
> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class
> > >>> > > *--categories
> > >>> > > > 2* --predictors sepallength sepalwidth petallength petalwidth
> > >>> --types n
> > >>> > > >
> > >>> > > > mahout runlogistic --input
> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
> > >>> --confusion
> > >>> > > >
> > >>> > > > AUC = 0.14
> > >>> > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > >>> > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
> > >>> > > >
> > >>> > > > >> AUC seems to poor. Now changed --predictors
> > >>> > > >
> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4
> > --output
> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class
> > >>> > > *--categories
> > >>> > > > 2* --predictors sepalwidth petallength --types n
> > >>> > > >
> > >>> > > > mahout runlogistic --input
> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
> > >>> --confusion
> > >>> > > > --scores
> > >>> > > >
> > >>> > > > AUC = 0.80
> > >>> > > > *confusion: [[50.0, 50.0], [0.0, 0.0]]*
> > >>> > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> > >>> > > >
> > >>> > > > This model classifies everything as category 1 which of no use.
> > >>> > > >
> > >>> > > > Thanks
> > >>> > > > Rajesh
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Re: SGD: Logistic regression package in Mahout

Posted by Ted Dunning <te...@gmail.com>.
OK.  I am back up for air.

Rajesh,

As I am sure you know, most folks here contribute on their own time.  I
have been busy with my day job and unable to help with this until just now.

I just wrote a test case that looks at the Iris data set.  The results are
categorically different from yours.

That substantiates my original feeling that your encoding of the data is
problematic.  I will file a JIRA and attach a test case that you can look
at.  Then we can see what the differences are.


On Tue, Oct 23, 2012 at 1:28 AM, Rajesh Nikam <ra...@gmail.com> wrote:

> Hi,
>
> Is there development happening on fixing issue with SGD that generates
> models which are as good as random prediction?
>
> I am not sure why such issue is not noticed and raised by others ?
> May be this specific algo is not used in practical applications.
>
> Thanks,
> Rajesh
>
>
> >>
> >> On Tue, Oct 16, 2012 at 10:23 PM, Ted Dunning <ted.dunning@gmail.com
> >wrote:
> >>
> >>> Rajesh,
> >>>
> >>> In the testing that I did, I ran 100, 1000 and 10,000 passes through
> the
> >>> data.  All produced identical results.  Thus it isn't an issue of SGD
> >>> converging.
> >>>
> >>> I also did a parameter scan of lambda and saw no effect.
> >>>
> >>> I also did the standard thing in R with glm and got the expected
> >>> (correct)
> >>> results.
> >>>
> >>> I haven't looked yet in detail, but I really suspect that the reading
> of
> >>> the data is horked.  This is exactly how that behaves.
> >>>
> >>> On Tue, Oct 16, 2012 at 4:49 AM, Rajesh Nikam <ra...@gmail.com>
> >>> wrote:
> >>>
> >>> > Hi Ted,
> >>> >
> >>> > I was thinking, this might be due to having only 100 instances for
> >>> > training.
> >>> >
> >>> > So I have created test set with two classes having ~49K instances,
> >>> included
> >>> > all features as predictors.
> >>> > PFA sgd.grps.zip with test file.
> >>> >
> >>> > mahout trainlogistic --input /usr/local/mahout/trainme/sgd-grps.csv
> >>> > --output /usr/local/mahout/trainme/sgd-grps.model --target class
> >>> > --categories 2 --features 128 --types n --predictors a1 a2 a3 a4 a5
> a6
> >>> a7
> >>> > a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22 a23 a24 a25
> >>> a26
> >>> > a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41 a42 a43
> >>> a44 a45
> >>> > a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 a60 a61 a62
> >>> a63 a64
> >>> > a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79 a80 a81
> >>> a82 a83
> >>> > a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98 a99 a100
> >>> a101
> >>> > a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113 a114 a115
> >>> a116
> >>> > a117 a118 a119 a120 a121 a122 a123 a124 a125 a126 a127
> >>> >
> >>> >
> >>> > mahout runlogistic --input /usr/local/mahout/trainme/sgd-grps.csv
> >>> --model
> >>> > /usr/local/mahout/trainme/sgd-grps.model --auc --confusion
> >>> >
> >>> > Still the results are similar, it classifies everything as class_1.
> >>> >
> >>> > AUC = 0.50
> >>> > confusion: [[*26563.0, 23006.0*], [0.0, 0.0]]
> >>> > entropy: [[-0.0, -0.0], [-46.1, -21.4]]
> >>> >
> >>> > I am not sure why this is failing all the time.
> >>> >
> >>> > Looking forward for your reply.
> >>> >
> >>> > Thanks
> >>> > Rajesh
> >>> >
> >>> >
> >>> >
> >>> > On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning <te...@gmail.com>
> >>> > wrote:
> >>> >
> >>> > > I would love to help and will before long.  Just can't do it in the
> >>> first
> >>> > > part of this week.
> >>> > >
> >>> > > On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam <
> rajeshnikam@gmail.com
> >>> >
> >>> > > wrote:
> >>> > >
> >>> > > > Hello,
> >>> > > >
> >>> > > > I have asked below question on issue with using sgd on mahout
> >>> forum.
> >>> > > >
> >>> > > > Similar issue with sgd is reported by
> >>> > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout
> >>> > > >
> >>> > > > Even below link has similar output:
> >>> > > >
> >>> > > > AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]*
> >>> > > > entropy: [[-0.4, -0.3], [-1.2, -0.7]]
> >>> > > >
> >>> > > >
> >>> > > >
> >>> >
> >>>
> http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html
> >>> > > >
> >>> > > > I am still wannder confusion how then this model works and used
> by
> >>> > many ?
> >>> > > > Not able to get any points on how to use SGD that generates
> >>> effective
> >>> > > > model.
> >>> > > >
> >>> > > > Could someone point out what is missing in input file or provided
> >>> > > > parameters.
> >>> > > >
> >>> > > > I appreciate your help.
> >>> > > >
> >>> > > > Below is description of steps that I followed.
> >>> > > >
> >>> > > > PF Attached uses input files for experiment.
> >>> > > >
> >>> > > > I am using Iris Plants Database from Michael Marshall. PFA
> >>> iris.arff.
> >>> > > > Converted this to csv file just by updating header:
> >>> iris-3-classes.csv
> >>> > > >
> >>> > > > mahout org.apache.mahout.classifier.
> >>> > > > sgd.TrainLogistic --input
> >>> > > /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4 --output
> >>> > > /usr/local/mahout/trunk/
> >>> > > > *iris-3-classes.model* --target class *--categories 3*
> --predictors
> >>> > > > sepallength sepalwidth petallength petalwidth --types n
> >>> > > >
> >>> > > > >> it gave following error.
> >>> > > > Exception in thread "main" java.lang.IllegalArgumentException:
> Can
> >>> only
> >>> > > > call classifyScalar with two categories
> >>> > > >
> >>> > > > Now created csv with only 2 classes. PFA iris-2-classes.csv
> >>> > > >
> >>> > > > >> trained iris-2-classes.csv with sgd
> >>> > > >
> >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4
> --output
> >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class
> >>> > > *--categories
> >>> > > > 2* --predictors sepallength sepalwidth petallength petalwidth
> >>> --types n
> >>> > > >
> >>> > > > mahout runlogistic --input
> >>> /usr/local/mahout/trunk/iris-2-classes.csv
> >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
> >>> --confusion
> >>> > > >
> >>> > > > AUC = 0.14
> >>> > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> >>> > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
> >>> > > >
> >>> > > > >> AUC seems to poor. Now changed --predictors
> >>> > > >
> >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4
> --output
> >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class
> >>> > > *--categories
> >>> > > > 2* --predictors sepalwidth petallength --types n
> >>> > > >
> >>> > > > mahout runlogistic --input
> >>> /usr/local/mahout/trunk/iris-2-classes.csv
> >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
> >>> --confusion
> >>> > > > --scores
> >>> > > >
> >>> > > > AUC = 0.80
> >>> > > > *confusion: [[50.0, 50.0], [0.0, 0.0]]*
> >>> > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> >>> > > >
> >>> > > > This model classifies everything as category 1 which of no use.
> >>> > > >
> >>> > > > Thanks
> >>> > > > Rajesh
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >
>

Re: SGD: Logistic regression package in Mahout

Posted by Rajesh Nikam <ra...@gmail.com>.
Hi,

Is there development happening on fixing issue with SGD that generates
models which are as good as random prediction?

I am not sure why such issue is not noticed and raised by others ?
May be this specific algo is not used in practical applications.

Thanks,
Rajesh


>>
>> On Tue, Oct 16, 2012 at 10:23 PM, Ted Dunning <te...@gmail.com>wrote:
>>
>>> Rajesh,
>>>
>>> In the testing that I did, I ran 100, 1000 and 10,000 passes through the
>>> data.  All produced identical results.  Thus it isn't an issue of SGD
>>> converging.
>>>
>>> I also did a parameter scan of lambda and saw no effect.
>>>
>>> I also did the standard thing in R with glm and got the expected
>>> (correct)
>>> results.
>>>
>>> I haven't looked yet in detail, but I really suspect that the reading of
>>> the data is horked.  This is exactly how that behaves.
>>>
>>> On Tue, Oct 16, 2012 at 4:49 AM, Rajesh Nikam <ra...@gmail.com>
>>> wrote:
>>>
>>> > Hi Ted,
>>> >
>>> > I was thinking, this might be due to having only 100 instances for
>>> > training.
>>> >
>>> > So I have created test set with two classes having ~49K instances,
>>> included
>>> > all features as predictors.
>>> > PFA sgd.grps.zip with test file.
>>> >
>>> > mahout trainlogistic --input /usr/local/mahout/trainme/sgd-grps.csv
>>> > --output /usr/local/mahout/trainme/sgd-grps.model --target class
>>> > --categories 2 --features 128 --types n --predictors a1 a2 a3 a4 a5 a6
>>> a7
>>> > a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22 a23 a24 a25
>>> a26
>>> > a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41 a42 a43
>>> a44 a45
>>> > a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 a60 a61 a62
>>> a63 a64
>>> > a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79 a80 a81
>>> a82 a83
>>> > a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98 a99 a100
>>> a101
>>> > a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113 a114 a115
>>> a116
>>> > a117 a118 a119 a120 a121 a122 a123 a124 a125 a126 a127
>>> >
>>> >
>>> > mahout runlogistic --input /usr/local/mahout/trainme/sgd-grps.csv
>>> --model
>>> > /usr/local/mahout/trainme/sgd-grps.model --auc --confusion
>>> >
>>> > Still the results are similar, it classifies everything as class_1.
>>> >
>>> > AUC = 0.50
>>> > confusion: [[*26563.0, 23006.0*], [0.0, 0.0]]
>>> > entropy: [[-0.0, -0.0], [-46.1, -21.4]]
>>> >
>>> > I am not sure why this is failing all the time.
>>> >
>>> > Looking forward for your reply.
>>> >
>>> > Thanks
>>> > Rajesh
>>> >
>>> >
>>> >
>>> > On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning <te...@gmail.com>
>>> > wrote:
>>> >
>>> > > I would love to help and will before long.  Just can't do it in the
>>> first
>>> > > part of this week.
>>> > >
>>> > > On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam <rajeshnikam@gmail.com
>>> >
>>> > > wrote:
>>> > >
>>> > > > Hello,
>>> > > >
>>> > > > I have asked below question on issue with using sgd on mahout
>>> forum.
>>> > > >
>>> > > > Similar issue with sgd is reported by
>>> > > >
>>> > > >
>>> > >
>>> >
>>> http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout
>>> > > >
>>> > > > Even below link has similar output:
>>> > > >
>>> > > > AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]*
>>> > > > entropy: [[-0.4, -0.3], [-1.2, -0.7]]
>>> > > >
>>> > > >
>>> > > >
>>> >
>>> http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html
>>> > > >
>>> > > > I am still wannder confusion how then this model works and used by
>>> > many ?
>>> > > > Not able to get any points on how to use SGD that generates
>>> effective
>>> > > > model.
>>> > > >
>>> > > > Could someone point out what is missing in input file or provided
>>> > > > parameters.
>>> > > >
>>> > > > I appreciate your help.
>>> > > >
>>> > > > Below is description of steps that I followed.
>>> > > >
>>> > > > PF Attached uses input files for experiment.
>>> > > >
>>> > > > I am using Iris Plants Database from Michael Marshall. PFA
>>> iris.arff.
>>> > > > Converted this to csv file just by updating header:
>>> iris-3-classes.csv
>>> > > >
>>> > > > mahout org.apache.mahout.classifier.
>>> > > > sgd.TrainLogistic --input
>>> > > /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4 --output
>>> > > /usr/local/mahout/trunk/
>>> > > > *iris-3-classes.model* --target class *--categories 3* --predictors
>>> > > > sepallength sepalwidth petallength petalwidth --types n
>>> > > >
>>> > > > >> it gave following error.
>>> > > > Exception in thread "main" java.lang.IllegalArgumentException: Can
>>> only
>>> > > > call classifyScalar with two categories
>>> > > >
>>> > > > Now created csv with only 2 classes. PFA iris-2-classes.csv
>>> > > >
>>> > > > >> trained iris-2-classes.csv with sgd
>>> > > >
>>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
>>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output
>>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class
>>> > > *--categories
>>> > > > 2* --predictors sepallength sepalwidth petallength petalwidth
>>> --types n
>>> > > >
>>> > > > mahout runlogistic --input
>>> /usr/local/mahout/trunk/iris-2-classes.csv
>>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
>>> --confusion
>>> > > >
>>> > > > AUC = 0.14
>>> > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
>>> > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
>>> > > >
>>> > > > >> AUC seems to poor. Now changed --predictors
>>> > > >
>>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
>>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output
>>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class
>>> > > *--categories
>>> > > > 2* --predictors sepalwidth petallength --types n
>>> > > >
>>> > > > mahout runlogistic --input
>>> /usr/local/mahout/trunk/iris-2-classes.csv
>>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
>>> --confusion
>>> > > > --scores
>>> > > >
>>> > > > AUC = 0.80
>>> > > > *confusion: [[50.0, 50.0], [0.0, 0.0]]*
>>> > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
>>> > > >
>>> > > > This model classifies everything as category 1 which of no use.
>>> > > >
>>> > > > Thanks
>>> > > > Rajesh
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: SGD: Logistic regression package in Mahout

Posted by Rajesh Nikam <ra...@gmail.com>.
Hi Ted,

Please update once SGD parsing issue is fixed.

Thanks
Rajesh

On Wed, Oct 17, 2012 at 2:22 PM, Rajesh Nikam <ra...@gmail.com> wrote:

> Hello Ted,
>
> Thanks for investigating into it.
> I would look forward for further analysis and fix in SGD.
>
> I appreciate your efforts in looking into it.
>
> Thanks,
> Rajesh
>
>
>
> On Tue, Oct 16, 2012 at 10:23 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> Rajesh,
>>
>> In the testing that I did, I ran 100, 1000 and 10,000 passes through the
>> data.  All produced identical results.  Thus it isn't an issue of SGD
>> converging.
>>
>> I also did a parameter scan of lambda and saw no effect.
>>
>> I also did the standard thing in R with glm and got the expected (correct)
>> results.
>>
>> I haven't looked yet in detail, but I really suspect that the reading of
>> the data is horked.  This is exactly how that behaves.
>>
>> On Tue, Oct 16, 2012 at 4:49 AM, Rajesh Nikam <ra...@gmail.com>
>> wrote:
>>
>> > Hi Ted,
>> >
>> > I was thinking, this might be due to having only 100 instances for
>> > training.
>> >
>> > So I have created test set with two classes having ~49K instances,
>> included
>> > all features as predictors.
>> > PFA sgd.grps.zip with test file.
>> >
>> > mahout trainlogistic --input /usr/local/mahout/trainme/sgd-grps.csv
>> > --output /usr/local/mahout/trainme/sgd-grps.model --target class
>> > --categories 2 --features 128 --types n --predictors a1 a2 a3 a4 a5 a6
>> a7
>> > a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22 a23 a24 a25
>> a26
>> > a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41 a42 a43 a44
>> a45
>> > a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 a60 a61 a62 a63
>> a64
>> > a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79 a80 a81 a82
>> a83
>> > a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98 a99 a100
>> a101
>> > a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113 a114 a115
>> a116
>> > a117 a118 a119 a120 a121 a122 a123 a124 a125 a126 a127
>> >
>> >
>> > mahout runlogistic --input /usr/local/mahout/trainme/sgd-grps.csv
>> --model
>> > /usr/local/mahout/trainme/sgd-grps.model --auc --confusion
>> >
>> > Still the results are similar, it classifies everything as class_1.
>> >
>> > AUC = 0.50
>> > confusion: [[*26563.0, 23006.0*], [0.0, 0.0]]
>> > entropy: [[-0.0, -0.0], [-46.1, -21.4]]
>> >
>> > I am not sure why this is failing all the time.
>> >
>> > Looking forward for your reply.
>> >
>> > Thanks
>> > Rajesh
>> >
>> >
>> >
>> > On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning <te...@gmail.com>
>> > wrote:
>> >
>> > > I would love to help and will before long.  Just can't do it in the
>> first
>> > > part of this week.
>> > >
>> > > On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam <ra...@gmail.com>
>> > > wrote:
>> > >
>> > > > Hello,
>> > > >
>> > > > I have asked below question on issue with using sgd on mahout forum.
>> > > >
>> > > > Similar issue with sgd is reported by
>> > > >
>> > > >
>> > >
>> >
>> http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout
>> > > >
>> > > > Even below link has similar output:
>> > > >
>> > > > AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]*
>> > > > entropy: [[-0.4, -0.3], [-1.2, -0.7]]
>> > > >
>> > > >
>> > > >
>> > http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html
>> > > >
>> > > > I am still wannder confusion how then this model works and used by
>> > many ?
>> > > > Not able to get any points on how to use SGD that generates
>> effective
>> > > > model.
>> > > >
>> > > > Could someone point out what is missing in input file or provided
>> > > > parameters.
>> > > >
>> > > > I appreciate your help.
>> > > >
>> > > > Below is description of steps that I followed.
>> > > >
>> > > > PF Attached uses input files for experiment.
>> > > >
>> > > > I am using Iris Plants Database from Michael Marshall. PFA
>> iris.arff.
>> > > > Converted this to csv file just by updating header:
>> iris-3-classes.csv
>> > > >
>> > > > mahout org.apache.mahout.classifier.
>> > > > sgd.TrainLogistic --input
>> > > /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4 --output
>> > > /usr/local/mahout/trunk/
>> > > > *iris-3-classes.model* --target class *--categories 3* --predictors
>> > > > sepallength sepalwidth petallength petalwidth --types n
>> > > >
>> > > > >> it gave following error.
>> > > > Exception in thread "main" java.lang.IllegalArgumentException: Can
>> only
>> > > > call classifyScalar with two categories
>> > > >
>> > > > Now created csv with only 2 classes. PFA iris-2-classes.csv
>> > > >
>> > > > >> trained iris-2-classes.csv with sgd
>> > > >
>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output
>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class
>> > > *--categories
>> > > > 2* --predictors sepallength sepalwidth petallength petalwidth
>> --types n
>> > > >
>> > > > mahout runlogistic --input
>> /usr/local/mahout/trunk/iris-2-classes.csv
>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
>> --confusion
>> > > >
>> > > > AUC = 0.14
>> > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
>> > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
>> > > >
>> > > > >> AUC seems to poor. Now changed --predictors
>> > > >
>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output
>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class
>> > > *--categories
>> > > > 2* --predictors sepalwidth petallength --types n
>> > > >
>> > > > mahout runlogistic --input
>> /usr/local/mahout/trunk/iris-2-classes.csv
>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
>> --confusion
>> > > > --scores
>> > > >
>> > > > AUC = 0.80
>> > > > *confusion: [[50.0, 50.0], [0.0, 0.0]]*
>> > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
>> > > >
>> > > > This model classifies everything as category 1 which of no use.
>> > > >
>> > > > Thanks
>> > > > Rajesh
>> > > >
>> > > >
>> > > >
>> > > >
>> > >
>> >
>>
>
>

Re: SGD: Logistic regression package in Mahout

Posted by Rajesh Nikam <ra...@gmail.com>.
Hello Ted,

Thanks for investigating into it.
I would look forward for further analysis and fix in SGD.

I appreciate your efforts in looking into it.

Thanks,
Rajesh


On Tue, Oct 16, 2012 at 10:23 PM, Ted Dunning <te...@gmail.com> wrote:

> Rajesh,
>
> In the testing that I did, I ran 100, 1000 and 10,000 passes through the
> data.  All produced identical results.  Thus it isn't an issue of SGD
> converging.
>
> I also did a parameter scan of lambda and saw no effect.
>
> I also did the standard thing in R with glm and got the expected (correct)
> results.
>
> I haven't looked yet in detail, but I really suspect that the reading of
> the data is horked.  This is exactly how that behaves.
>
> On Tue, Oct 16, 2012 at 4:49 AM, Rajesh Nikam <ra...@gmail.com>
> wrote:
>
> > Hi Ted,
> >
> > I was thinking, this might be due to having only 100 instances for
> > training.
> >
> > So I have created test set with two classes having ~49K instances,
> included
> > all features as predictors.
> > PFA sgd.grps.zip with test file.
> >
> > mahout trainlogistic --input /usr/local/mahout/trainme/sgd-grps.csv
> > --output /usr/local/mahout/trainme/sgd-grps.model --target class
> > --categories 2 --features 128 --types n --predictors a1 a2 a3 a4 a5 a6 a7
> > a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22 a23 a24 a25 a26
> > a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41 a42 a43 a44
> a45
> > a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 a60 a61 a62 a63
> a64
> > a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79 a80 a81 a82
> a83
> > a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98 a99 a100 a101
> > a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113 a114 a115
> a116
> > a117 a118 a119 a120 a121 a122 a123 a124 a125 a126 a127
> >
> >
> > mahout runlogistic --input /usr/local/mahout/trainme/sgd-grps.csv --model
> > /usr/local/mahout/trainme/sgd-grps.model --auc --confusion
> >
> > Still the results are similar, it classifies everything as class_1.
> >
> > AUC = 0.50
> > confusion: [[*26563.0, 23006.0*], [0.0, 0.0]]
> > entropy: [[-0.0, -0.0], [-46.1, -21.4]]
> >
> > I am not sure why this is failing all the time.
> >
> > Looking forward for your reply.
> >
> > Thanks
> > Rajesh
> >
> >
> >
> > On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > I would love to help and will before long.  Just can't do it in the
> first
> > > part of this week.
> > >
> > > On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam <ra...@gmail.com>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I have asked below question on issue with using sgd on mahout forum.
> > > >
> > > > Similar issue with sgd is reported by
> > > >
> > > >
> > >
> >
> http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout
> > > >
> > > > Even below link has similar output:
> > > >
> > > > AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]*
> > > > entropy: [[-0.4, -0.3], [-1.2, -0.7]]
> > > >
> > > >
> > > >
> > http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html
> > > >
> > > > I am still wannder confusion how then this model works and used by
> > many ?
> > > > Not able to get any points on how to use SGD that generates effective
> > > > model.
> > > >
> > > > Could someone point out what is missing in input file or provided
> > > > parameters.
> > > >
> > > > I appreciate your help.
> > > >
> > > > Below is description of steps that I followed.
> > > >
> > > > PF Attached uses input files for experiment.
> > > >
> > > > I am using Iris Plants Database from Michael Marshall. PFA iris.arff.
> > > > Converted this to csv file just by updating header:
> iris-3-classes.csv
> > > >
> > > > mahout org.apache.mahout.classifier.
> > > > sgd.TrainLogistic --input
> > > /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4 --output
> > > /usr/local/mahout/trunk/
> > > > *iris-3-classes.model* --target class *--categories 3* --predictors
> > > > sepallength sepalwidth petallength petalwidth --types n
> > > >
> > > > >> it gave following error.
> > > > Exception in thread "main" java.lang.IllegalArgumentException: Can
> only
> > > > call classifyScalar with two categories
> > > >
> > > > Now created csv with only 2 classes. PFA iris-2-classes.csv
> > > >
> > > > >> trained iris-2-classes.csv with sgd
> > > >
> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output
> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class
> > > *--categories
> > > > 2* --predictors sepallength sepalwidth petallength petalwidth
> --types n
> > > >
> > > > mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
> --confusion
> > > >
> > > > AUC = 0.14
> > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
> > > >
> > > > >> AUC seems to poor. Now changed --predictors
> > > >
> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output
> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class
> > > *--categories
> > > > 2* --predictors sepalwidth petallength --types n
> > > >
> > > > mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
> --confusion
> > > > --scores
> > > >
> > > > AUC = 0.80
> > > > *confusion: [[50.0, 50.0], [0.0, 0.0]]*
> > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> > > >
> > > > This model classifies everything as category 1 which of no use.
> > > >
> > > > Thanks
> > > > Rajesh
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: SGD: Logistic regression package in Mahout

Posted by Ted Dunning <te...@gmail.com>.
Rajesh,

In the testing that I did, I ran 100, 1000 and 10,000 passes through the
data.  All produced identical results.  Thus it isn't an issue of SGD
converging.

I also did a parameter scan of lambda and saw no effect.

I also did the standard thing in R with glm and got the expected (correct)
results.

I haven't looked yet in detail, but I really suspect that the reading of
the data is horked.  This is exactly how that behaves.

On Tue, Oct 16, 2012 at 4:49 AM, Rajesh Nikam <ra...@gmail.com> wrote:

> Hi Ted,
>
> I was thinking, this might be due to having only 100 instances for
> training.
>
> So I have created test set with two classes having ~49K instances, included
> all features as predictors.
> PFA sgd.grps.zip with test file.
>
> mahout trainlogistic --input /usr/local/mahout/trainme/sgd-grps.csv
> --output /usr/local/mahout/trainme/sgd-grps.model --target class
> --categories 2 --features 128 --types n --predictors a1 a2 a3 a4 a5 a6 a7
> a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22 a23 a24 a25 a26
> a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41 a42 a43 a44 a45
> a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 a60 a61 a62 a63 a64
> a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79 a80 a81 a82 a83
> a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98 a99 a100 a101
> a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113 a114 a115 a116
> a117 a118 a119 a120 a121 a122 a123 a124 a125 a126 a127
>
>
> mahout runlogistic --input /usr/local/mahout/trainme/sgd-grps.csv --model
> /usr/local/mahout/trainme/sgd-grps.model --auc --confusion
>
> Still the results are similar, it classifies everything as class_1.
>
> AUC = 0.50
> confusion: [[*26563.0, 23006.0*], [0.0, 0.0]]
> entropy: [[-0.0, -0.0], [-46.1, -21.4]]
>
> I am not sure why this is failing all the time.
>
> Looking forward for your reply.
>
> Thanks
> Rajesh
>
>
>
> On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > I would love to help and will before long.  Just can't do it in the first
> > part of this week.
> >
> > On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam <ra...@gmail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I have asked below question on issue with using sgd on mahout forum.
> > >
> > > Similar issue with sgd is reported by
> > >
> > >
> >
> http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout
> > >
> > > Even below link has similar output:
> > >
> > > AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]*
> > > entropy: [[-0.4, -0.3], [-1.2, -0.7]]
> > >
> > >
> > >
> http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html
> > >
> > > I am still wannder confusion how then this model works and used by
> many ?
> > > Not able to get any points on how to use SGD that generates effective
> > > model.
> > >
> > > Could someone point out what is missing in input file or provided
> > > parameters.
> > >
> > > I appreciate your help.
> > >
> > > Below is description of steps that I followed.
> > >
> > > PF Attached uses input files for experiment.
> > >
> > > I am using Iris Plants Database from Michael Marshall. PFA iris.arff.
> > > Converted this to csv file just by updating header: iris-3-classes.csv
> > >
> > > mahout org.apache.mahout.classifier.
> > > sgd.TrainLogistic --input
> > /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4 --output
> > /usr/local/mahout/trunk/
> > > *iris-3-classes.model* --target class *--categories 3* --predictors
> > > sepallength sepalwidth petallength petalwidth --types n
> > >
> > > >> it gave following error.
> > > Exception in thread "main" java.lang.IllegalArgumentException: Can only
> > > call classifyScalar with two categories
> > >
> > > Now created csv with only 2 classes. PFA iris-2-classes.csv
> > >
> > > >> trained iris-2-classes.csv with sgd
> > >
> > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output
> > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class
> > *--categories
> > > 2* --predictors sepallength sepalwidth petallength petalwidth --types n
> > >
> > > mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
> > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion
> > >
> > > AUC = 0.14
> > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
> > >
> > > >> AUC seems to poor. Now changed --predictors
> > >
> > > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output
> > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class
> > *--categories
> > > 2* --predictors sepalwidth petallength --types n
> > >
> > > mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
> > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion
> > > --scores
> > >
> > > AUC = 0.80
> > > *confusion: [[50.0, 50.0], [0.0, 0.0]]*
> > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> > >
> > > This model classifies everything as category 1 which of no use.
> > >
> > > Thanks
> > > Rajesh
> > >
> > >
> > >
> > >
> >
>

Fwd: SGD: Logistic regression package in Mahout

Posted by Rajesh Nikam <ra...@gmail.com>.
Hi Ted,

I was thinking, this might be due to having only 100 instances for training.

So I have created test set with two classes having ~49K instances, included
all features as predictors.
PFA sgd.grps.zip with test file.

mahout trainlogistic --input /usr/local/mahout/trainme/sgd-grps.csv
--output /usr/local/mahout/trainme/sgd-grps.model --target class
--categories 2 --features 128 --types n --predictors a1 a2 a3 a4 a5 a6 a7
a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22 a23 a24 a25 a26
a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41 a42 a43 a44 a45
a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 a60 a61 a62 a63 a64
a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79 a80 a81 a82 a83
a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98 a99 a100 a101
a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113 a114 a115 a116
a117 a118 a119 a120 a121 a122 a123 a124 a125 a126 a127


mahout runlogistic --input /usr/local/mahout/trainme/sgd-grps.csv --model
/usr/local/mahout/trainme/sgd-grps.model --auc --confusion

Still the results are similar, it classifies everything as class_1.

AUC = 0.50
confusion: [[*26563.0, 23006.0*], [0.0, 0.0]]
entropy: [[-0.0, -0.0], [-46.1, -21.4]]

I am not sure why this is failing all the time.

Looking forward for your reply.

Thanks
Rajesh



On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning <te...@gmail.com> wrote:

> I would love to help and will before long.  Just can't do it in the first
> part of this week.
>
> On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam <ra...@gmail.com>
> wrote:
>
> > Hello,
> >
> > I have asked below question on issue with using sgd on mahout forum.
> >
> > Similar issue with sgd is reported by
> >
> >
> http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout
> >
> > Even below link has similar output:
> >
> > AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]*
> > entropy: [[-0.4, -0.3], [-1.2, -0.7]]
> >
> >
> > http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html
> >
> > I am still wannder confusion how then this model works and used by many ?
> > Not able to get any points on how to use SGD that generates effective
> > model.
> >
> > Could someone point out what is missing in input file or provided
> > parameters.
> >
> > I appreciate your help.
> >
> > Below is description of steps that I followed.
> >
> > PF Attached uses input files for experiment.
> >
> > I am using Iris Plants Database from Michael Marshall. PFA iris.arff.
> > Converted this to csv file just by updating header: iris-3-classes.csv
> >
> > mahout org.apache.mahout.classifier.
> > sgd.TrainLogistic --input
> /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4 --output
> /usr/local/mahout/trunk/
> > *iris-3-classes.model* --target class *--categories 3* --predictors
> > sepallength sepalwidth petallength petalwidth --types n
> >
> > >> it gave following error.
> > Exception in thread "main" java.lang.IllegalArgumentException: Can only
> > call classifyScalar with two categories
> >
> > Now created csv with only 2 classes. PFA iris-2-classes.csv
> >
> > >> trained iris-2-classes.csv with sgd
> >
> > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output
> > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class
> *--categories
> > 2* --predictors sepallength sepalwidth petallength petalwidth --types n
> >
> > mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
> > --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion
> >
> > AUC = 0.14
> > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
> >
> > >> AUC seems to poor. Now changed --predictors
> >
> > mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output
> > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class
> *--categories
> > 2* --predictors sepalwidth petallength --types n
> >
> > mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
> > --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion
> > --scores
> >
> > AUC = 0.80
> > *confusion: [[50.0, 50.0], [0.0, 0.0]]*
> > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> >
> > This model classifies everything as category 1 which of no use.
> >
> > Thanks
> > Rajesh
> >
> >
> >
> >
>

Re: SGD: Logistic regression package in Mahout

Posted by Ted Dunning <te...@gmail.com>.
I would love to help and will before long.  Just can't do it in the first
part of this week.

On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam <ra...@gmail.com> wrote:

> Hello,
>
> I have asked below question on issue with using sgd on mahout forum.
>
> Similar issue with sgd is reported by
>
> http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout
>
> Even below link has similar output:
>
> AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]*
> entropy: [[-0.4, -0.3], [-1.2, -0.7]]
>
>
> http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html
>
> I am still wannder confusion how then this model works and used by many ?
> Not able to get any points on how to use SGD that generates effective
> model.
>
> Could someone point out what is missing in input file or provided
> parameters.
>
> I appreciate your help.
>
> Below is description of steps that I followed.
>
> PF Attached uses input files for experiment.
>
> I am using Iris Plants Database from Michael Marshall. PFA iris.arff.
> Converted this to csv file just by updating header: iris-3-classes.csv
>
> mahout org.apache.mahout.classifier.
> sgd.TrainLogistic --input /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4 --output /usr/local/mahout/trunk/
> *iris-3-classes.model* --target class *--categories 3* --predictors
> sepallength sepalwidth petallength petalwidth --types n
>
> >> it gave following error.
> Exception in thread "main" java.lang.IllegalArgumentException: Can only
> call classifyScalar with two categories
>
> Now created csv with only 2 classes. PFA iris-2-classes.csv
>
> >> trained iris-2-classes.csv with sgd
>
> mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output
> /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class *--categories
> 2* --predictors sepallength sepalwidth petallength petalwidth --types n
>
> mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
> --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion
>
> AUC = 0.14
> confusion: [[50.0, 50.0], [0.0, 0.0]]
> entropy: [[-0.6, -0.3], [-0.8, -0.4]]
>
> >> AUC seems to poor. Now changed --predictors
>
> mahout org.apache.mahout.classifier.sgd.TrainLogistic --input
> /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4 --output
> /usr/local/mahout/trunk/*iris-2-classes.mode*l --target class *--categories
> 2* --predictors sepalwidth petallength --types n
>
> mahout runlogistic --input /usr/local/mahout/trunk/iris-2-classes.csv
> --model /usr/local/mahout/trunk/iris-2-classes.model --auc --confusion
> --scores
>
> AUC = 0.80
> *confusion: [[50.0, 50.0], [0.0, 0.0]]*
> entropy: [[-0.7, -0.3], [-0.7, -0.4]]
>
> This model classifies everything as category 1 which of no use.
>
> Thanks
> Rajesh
>
>
>
>