You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Rajesh Nikam <ra...@gmail.com> on 2012/11/01 08:35:03 UTC

Re: SGD: Logistic regression package in Mahout

Hi Mat,

Thanks for pointing out link for JIRA for this particular case.

Could you extend one more help:

I have not used maven for building and running java classes. I am looking
at
http://maven.apache.org/guides/getting-started/index.html

Could you please point out how to build & run any specific class like
OnlineLogisticRegressionTest.java from mahout.

Thanks
Rajesh

On Wed, Oct 31, 2012 at 8:15 PM, Mat Kelcey <ma...@gmail.com>wrote:

> Rajesh, Ted has added the test case code already
> https://issues.apache.org/jira/browse/MAHOUT-1107
>
> On 31 October 2012 05:14, Rajesh Nikam <ra...@gmail.com> wrote:
>
> > Hi Ted,
> >
> > Please update once JIRA and test case is uploaded.
> >
> > Looking forward for your reply.
> >
> > Thanks
> > Rajesh
> >
> > On Wed, Oct 31, 2012 at 11:00 AM, Rajesh Nikam <rajeshnikam@gmail.com
> > >wrote:
> >
> > > Hi Ted,
> > >
> > > Thanks for reply. I will wait for JIRA and hope to get rid of any
> > encoding
> > > issue.
> > >
> > > Thanks,
> > > Rajesh
> > > On Oct 31, 2012 5:24 AM, "Ted Dunning" <te...@gmail.com> wrote:
> > >
> > >> OK.  I am back up for air.
> > >>
> > >> Rajesh,
> > >>
> > >> As I am sure you know, most folks here contribute on their own time.
>  I
> > >> have been busy with my day job and unable to help with this until just
> > >> now.
> > >>
> > >> I just wrote a test case that looks at the Iris data set.  The results
> > are
> > >> categorically different from yours.
> > >>
> > >> That substantiates my original feeling that your encoding of the data
> is
> > >> problematic.  I will file a JIRA and attach a test case that you can
> > look
> > >> at.  Then we can see what the differences are.
> > >>
> > >>
> > >> On Tue, Oct 23, 2012 at 1:28 AM, Rajesh Nikam <ra...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > Is there development happening on fixing issue with SGD that
> generates
> > >> > models which are as good as random prediction?
> > >> >
> > >> > I am not sure why such issue is not noticed and raised by others ?
> > >> > May be this specific algo is not used in practical applications.
> > >> >
> > >> > Thanks,
> > >> > Rajesh
> > >> >
> > >> >
> > >> > >>
> > >> > >> On Tue, Oct 16, 2012 at 10:23 PM, Ted Dunning <
> > ted.dunning@gmail.com
> > >> > >wrote:
> > >> > >>
> > >> > >>> Rajesh,
> > >> > >>>
> > >> > >>> In the testing that I did, I ran 100, 1000 and 10,000 passes
> > through
> > >> > the
> > >> > >>> data.  All produced identical results.  Thus it isn't an issue
> of
> > >> SGD
> > >> > >>> converging.
> > >> > >>>
> > >> > >>> I also did a parameter scan of lambda and saw no effect.
> > >> > >>>
> > >> > >>> I also did the standard thing in R with glm and got the expected
> > >> > >>> (correct)
> > >> > >>> results.
> > >> > >>>
> > >> > >>> I haven't looked yet in detail, but I really suspect that the
> > >> reading
> > >> > of
> > >> > >>> the data is horked.  This is exactly how that behaves.
> > >> > >>>
> > >> > >>> On Tue, Oct 16, 2012 at 4:49 AM, Rajesh Nikam <
> > >> rajeshnikam@gmail.com>
> > >> > >>> wrote:
> > >> > >>>
> > >> > >>> > Hi Ted,
> > >> > >>> >
> > >> > >>> > I was thinking, this might be due to having only 100 instances
> > for
> > >> > >>> > training.
> > >> > >>> >
> > >> > >>> > So I have created test set with two classes having ~49K
> > instances,
> > >> > >>> included
> > >> > >>> > all features as predictors.
> > >> > >>> > PFA sgd.grps.zip with test file.
> > >> > >>> >
> > >> > >>> > mahout trainlogistic --input
> > >> /usr/local/mahout/trainme/sgd-grps.csv
> > >> > >>> > --output /usr/local/mahout/trainme/sgd-grps.model --target
> class
> > >> > >>> > --categories 2 --features 128 --types n --predictors a1 a2 a3
> a4
> > >> a5
> > >> > a6
> > >> > >>> a7
> > >> > >>> > a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22 a23
> > a24
> > >> a25
> > >> > >>> a26
> > >> > >>> > a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41
> a42
> > >> a43
> > >> > >>> a44 a45
> > >> > >>> > a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 a60
> a61
> > >> a62
> > >> > >>> a63 a64
> > >> > >>> > a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79
> a80
> > >> a81
> > >> > >>> a82 a83
> > >> > >>> > a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98
> a99
> > >> a100
> > >> > >>> a101
> > >> > >>> > a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113
> a114
> > >> a115
> > >> > >>> a116
> > >> > >>> > a117 a118 a119 a120 a121 a122 a123 a124 a125 a126 a127
> > >> > >>> >
> > >> > >>> >
> > >> > >>> > mahout runlogistic --input
> > /usr/local/mahout/trainme/sgd-grps.csv
> > >> > >>> --model
> > >> > >>> > /usr/local/mahout/trainme/sgd-grps.model --auc --confusion
> > >> > >>> >
> > >> > >>> > Still the results are similar, it classifies everything as
> > >> class_1.
> > >> > >>> >
> > >> > >>> > AUC = 0.50
> > >> > >>> > confusion: [[*26563.0, 23006.0*], [0.0, 0.0]]
> > >> > >>> > entropy: [[-0.0, -0.0], [-46.1, -21.4]]
> > >> > >>> >
> > >> > >>> > I am not sure why this is failing all the time.
> > >> > >>> >
> > >> > >>> > Looking forward for your reply.
> > >> > >>> >
> > >> > >>> > Thanks
> > >> > >>> > Rajesh
> > >> > >>> >
> > >> > >>> >
> > >> > >>> >
> > >> > >>> > On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning <
> > >> ted.dunning@gmail.com>
> > >> > >>> > wrote:
> > >> > >>> >
> > >> > >>> > > I would love to help and will before long.  Just can't do it
> > in
> > >> the
> > >> > >>> first
> > >> > >>> > > part of this week.
> > >> > >>> > >
> > >> > >>> > > On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam <
> > >> > rajeshnikam@gmail.com
> > >> > >>> >
> > >> > >>> > > wrote:
> > >> > >>> > >
> > >> > >>> > > > Hello,
> > >> > >>> > > >
> > >> > >>> > > > I have asked below question on issue with using sgd on
> > mahout
> > >> > >>> forum.
> > >> > >>> > > >
> > >> > >>> > > > Similar issue with sgd is reported by
> > >> > >>> > > >
> > >> > >>> > > >
> > >> > >>> > >
> > >> > >>> >
> > >> > >>>
> > >> >
> > >>
> >
> http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout
> > >> > >>> > > >
> > >> > >>> > > > Even below link has similar output:
> > >> > >>> > > >
> > >> > >>> > > > AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]*
> > >> > >>> > > > entropy: [[-0.4, -0.3], [-1.2, -0.7]]
> > >> > >>> > > >
> > >> > >>> > > >
> > >> > >>> > > >
> > >> > >>> >
> > >> > >>>
> > >> >
> > http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html
> > >> > >>> > > >
> > >> > >>> > > > I am still wannder confusion how then this model works and
> > >> used
> > >> > by
> > >> > >>> > many ?
> > >> > >>> > > > Not able to get any points on how to use SGD that
> generates
> > >> > >>> effective
> > >> > >>> > > > model.
> > >> > >>> > > >
> > >> > >>> > > > Could someone point out what is missing in input file or
> > >> provided
> > >> > >>> > > > parameters.
> > >> > >>> > > >
> > >> > >>> > > > I appreciate your help.
> > >> > >>> > > >
> > >> > >>> > > > Below is description of steps that I followed.
> > >> > >>> > > >
> > >> > >>> > > > PF Attached uses input files for experiment.
> > >> > >>> > > >
> > >> > >>> > > > I am using Iris Plants Database from Michael Marshall. PFA
> > >> > >>> iris.arff.
> > >> > >>> > > > Converted this to csv file just by updating header:
> > >> > >>> iris-3-classes.csv
> > >> > >>> > > >
> > >> > >>> > > > mahout org.apache.mahout.classifier.
> > >> > >>> > > > sgd.TrainLogistic --input
> > >> > >>> > > /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4
> > >> --output
> > >> > >>> > > /usr/local/mahout/trunk/
> > >> > >>> > > > *iris-3-classes.model* --target class *--categories 3*
> > >> > --predictors
> > >> > >>> > > > sepallength sepalwidth petallength petalwidth --types n
> > >> > >>> > > >
> > >> > >>> > > > >> it gave following error.
> > >> > >>> > > > Exception in thread "main"
> > java.lang.IllegalArgumentException:
> > >> > Can
> > >> > >>> only
> > >> > >>> > > > call classifyScalar with two categories
> > >> > >>> > > >
> > >> > >>> > > > Now created csv with only 2 classes. PFA
> iris-2-classes.csv
> > >> > >>> > > >
> > >> > >>> > > > >> trained iris-2-classes.csv with sgd
> > >> > >>> > > >
> > >> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic
> > --input
> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4
> > >> > --output
> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target
> > class
> > >> > >>> > > *--categories
> > >> > >>> > > > 2* --predictors sepallength sepalwidth petallength
> > petalwidth
> > >> > >>> --types n
> > >> > >>> > > >
> > >> > >>> > > > mahout runlogistic --input
> > >> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
> > >> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
> > >> > >>> --confusion
> > >> > >>> > > >
> > >> > >>> > > > AUC = 0.14
> > >> > >>> > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> > >> > >>> > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
> > >> > >>> > > >
> > >> > >>> > > > >> AUC seems to poor. Now changed --predictors
> > >> > >>> > > >
> > >> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic
> > --input
> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4
> > >> > --output
> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target
> > class
> > >> > >>> > > *--categories
> > >> > >>> > > > 2* --predictors sepalwidth petallength --types n
> > >> > >>> > > >
> > >> > >>> > > > mahout runlogistic --input
> > >> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
> > >> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model --auc
> > >> > >>> --confusion
> > >> > >>> > > > --scores
> > >> > >>> > > >
> > >> > >>> > > > AUC = 0.80
> > >> > >>> > > > *confusion: [[50.0, 50.0], [0.0, 0.0]]*
> > >> > >>> > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> > >> > >>> > > >
> > >> > >>> > > > This model classifies everything as category 1 which of no
> > >> use.
> > >> > >>> > > >
> > >> > >>> > > > Thanks
> > >> > >>> > > > Rajesh
> > >> > >>> > > >
> > >> > >>> > > >
> > >> > >>> > > >
> > >> > >>> > > >
> > >> > >>> > >
> > >> > >>> >
> > >> > >>>
> > >> > >>
> > >> > >>
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: SGD: Logistic regression package in Mahout

Posted by Ted Dunning <te...@gmail.com>.
The output will be in a report file under core/target, I think.  Look for a
file with OnlineLogisticRegressionTest in the name.  Saving to a separate
file is a fine approach as well.

On Thu, Nov 1, 2012 at 6:45 AM, Rajesh Nikam <ra...@gmail.com> wrote:

> Thanks Ted for providing testcase that helped me to look into details of
> the problem that I am facing.
>
> Got how to run test case using maven:
>
> mvn test
> -Dtest="org.apache.mahout.classifier.sgd.OnlineLogisticRegressionTest"
>
> However I could not see printf output spitted on console, so I have saved
> output to file.
>
> Now I will look at the results and update in case of any issue.
>
> Thanks
> Rajesh
>
>
> On Thu, Nov 1, 2012 at 1:05 PM, Rajesh Nikam <ra...@gmail.com>
> wrote:
>
> > Hi Mat,
> >
> > Thanks for pointing out link for JIRA for this particular case.
> >
> > Could you extend one more help:
> >
> > I have not used maven for building and running java classes. I am looking
> > at
> > http://maven.apache.org/guides/getting-started/index.html
> >
> > Could you please point out how to build & run any specific class like
> > OnlineLogisticRegressionTest.java from mahout.
> >
> > Thanks
> > Rajesh
> >
> >
> > On Wed, Oct 31, 2012 at 8:15 PM, Mat Kelcey <matthew.kelcey@gmail.com
> >wrote:
> >
> >> Rajesh, Ted has added the test case code already
> >> https://issues.apache.org/jira/browse/MAHOUT-1107
> >>
> >> On 31 October 2012 05:14, Rajesh Nikam <ra...@gmail.com> wrote:
> >>
> >> > Hi Ted,
> >> >
> >> > Please update once JIRA and test case is uploaded.
> >> >
> >> > Looking forward for your reply.
> >> >
> >> > Thanks
> >> > Rajesh
> >> >
> >> > On Wed, Oct 31, 2012 at 11:00 AM, Rajesh Nikam <rajeshnikam@gmail.com
> >> > >wrote:
> >> >
> >> > > Hi Ted,
> >> > >
> >> > > Thanks for reply. I will wait for JIRA and hope to get rid of any
> >> > encoding
> >> > > issue.
> >> > >
> >> > > Thanks,
> >> > > Rajesh
> >> > > On Oct 31, 2012 5:24 AM, "Ted Dunning" <te...@gmail.com>
> wrote:
> >> > >
> >> > >> OK.  I am back up for air.
> >> > >>
> >> > >> Rajesh,
> >> > >>
> >> > >> As I am sure you know, most folks here contribute on their own
> time.
> >>  I
> >> > >> have been busy with my day job and unable to help with this until
> >> just
> >> > >> now.
> >> > >>
> >> > >> I just wrote a test case that looks at the Iris data set.  The
> >> results
> >> > are
> >> > >> categorically different from yours.
> >> > >>
> >> > >> That substantiates my original feeling that your encoding of the
> >> data is
> >> > >> problematic.  I will file a JIRA and attach a test case that you
> can
> >> > look
> >> > >> at.  Then we can see what the differences are.
> >> > >>
> >> > >>
> >> > >> On Tue, Oct 23, 2012 at 1:28 AM, Rajesh Nikam <
> rajeshnikam@gmail.com
> >> >
> >> > >> wrote:
> >> > >>
> >> > >> > Hi,
> >> > >> >
> >> > >> > Is there development happening on fixing issue with SGD that
> >> generates
> >> > >> > models which are as good as random prediction?
> >> > >> >
> >> > >> > I am not sure why such issue is not noticed and raised by others
> ?
> >> > >> > May be this specific algo is not used in practical applications.
> >> > >> >
> >> > >> > Thanks,
> >> > >> > Rajesh
> >> > >> >
> >> > >> >
> >> > >> > >>
> >> > >> > >> On Tue, Oct 16, 2012 at 10:23 PM, Ted Dunning <
> >> > ted.dunning@gmail.com
> >> > >> > >wrote:
> >> > >> > >>
> >> > >> > >>> Rajesh,
> >> > >> > >>>
> >> > >> > >>> In the testing that I did, I ran 100, 1000 and 10,000 passes
> >> > through
> >> > >> > the
> >> > >> > >>> data.  All produced identical results.  Thus it isn't an
> issue
> >> of
> >> > >> SGD
> >> > >> > >>> converging.
> >> > >> > >>>
> >> > >> > >>> I also did a parameter scan of lambda and saw no effect.
> >> > >> > >>>
> >> > >> > >>> I also did the standard thing in R with glm and got the
> >> expected
> >> > >> > >>> (correct)
> >> > >> > >>> results.
> >> > >> > >>>
> >> > >> > >>> I haven't looked yet in detail, but I really suspect that the
> >> > >> reading
> >> > >> > of
> >> > >> > >>> the data is horked.  This is exactly how that behaves.
> >> > >> > >>>
> >> > >> > >>> On Tue, Oct 16, 2012 at 4:49 AM, Rajesh Nikam <
> >> > >> rajeshnikam@gmail.com>
> >> > >> > >>> wrote:
> >> > >> > >>>
> >> > >> > >>> > Hi Ted,
> >> > >> > >>> >
> >> > >> > >>> > I was thinking, this might be due to having only 100
> >> instances
> >> > for
> >> > >> > >>> > training.
> >> > >> > >>> >
> >> > >> > >>> > So I have created test set with two classes having ~49K
> >> > instances,
> >> > >> > >>> included
> >> > >> > >>> > all features as predictors.
> >> > >> > >>> > PFA sgd.grps.zip with test file.
> >> > >> > >>> >
> >> > >> > >>> > mahout trainlogistic --input
> >> > >> /usr/local/mahout/trainme/sgd-grps.csv
> >> > >> > >>> > --output /usr/local/mahout/trainme/sgd-grps.model --target
> >> class
> >> > >> > >>> > --categories 2 --features 128 --types n --predictors a1 a2
> >> a3 a4
> >> > >> a5
> >> > >> > a6
> >> > >> > >>> a7
> >> > >> > >>> > a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22
> a23
> >> > a24
> >> > >> a25
> >> > >> > >>> a26
> >> > >> > >>> > a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41
> >> a42
> >> > >> a43
> >> > >> > >>> a44 a45
> >> > >> > >>> > a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 a60
> >> a61
> >> > >> a62
> >> > >> > >>> a63 a64
> >> > >> > >>> > a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79
> >> a80
> >> > >> a81
> >> > >> > >>> a82 a83
> >> > >> > >>> > a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98
> >> a99
> >> > >> a100
> >> > >> > >>> a101
> >> > >> > >>> > a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113
> >> a114
> >> > >> a115
> >> > >> > >>> a116
> >> > >> > >>> > a117 a118 a119 a120 a121 a122 a123 a124 a125 a126 a127
> >> > >> > >>> >
> >> > >> > >>> >
> >> > >> > >>> > mahout runlogistic --input
> >> > /usr/local/mahout/trainme/sgd-grps.csv
> >> > >> > >>> --model
> >> > >> > >>> > /usr/local/mahout/trainme/sgd-grps.model --auc --confusion
> >> > >> > >>> >
> >> > >> > >>> > Still the results are similar, it classifies everything as
> >> > >> class_1.
> >> > >> > >>> >
> >> > >> > >>> > AUC = 0.50
> >> > >> > >>> > confusion: [[*26563.0, 23006.0*], [0.0, 0.0]]
> >> > >> > >>> > entropy: [[-0.0, -0.0], [-46.1, -21.4]]
> >> > >> > >>> >
> >> > >> > >>> > I am not sure why this is failing all the time.
> >> > >> > >>> >
> >> > >> > >>> > Looking forward for your reply.
> >> > >> > >>> >
> >> > >> > >>> > Thanks
> >> > >> > >>> > Rajesh
> >> > >> > >>> >
> >> > >> > >>> >
> >> > >> > >>> >
> >> > >> > >>> > On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning <
> >> > >> ted.dunning@gmail.com>
> >> > >> > >>> > wrote:
> >> > >> > >>> >
> >> > >> > >>> > > I would love to help and will before long.  Just can't do
> >> it
> >> > in
> >> > >> the
> >> > >> > >>> first
> >> > >> > >>> > > part of this week.
> >> > >> > >>> > >
> >> > >> > >>> > > On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam <
> >> > >> > rajeshnikam@gmail.com
> >> > >> > >>> >
> >> > >> > >>> > > wrote:
> >> > >> > >>> > >
> >> > >> > >>> > > > Hello,
> >> > >> > >>> > > >
> >> > >> > >>> > > > I have asked below question on issue with using sgd on
> >> > mahout
> >> > >> > >>> forum.
> >> > >> > >>> > > >
> >> > >> > >>> > > > Similar issue with sgd is reported by
> >> > >> > >>> > > >
> >> > >> > >>> > > >
> >> > >> > >>> > >
> >> > >> > >>> >
> >> > >> > >>>
> >> > >> >
> >> > >>
> >> >
> >>
> http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout
> >> > >> > >>> > > >
> >> > >> > >>> > > > Even below link has similar output:
> >> > >> > >>> > > >
> >> > >> > >>> > > > AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]*
> >> > >> > >>> > > > entropy: [[-0.4, -0.3], [-1.2, -0.7]]
> >> > >> > >>> > > >
> >> > >> > >>> > > >
> >> > >> > >>> > > >
> >> > >> > >>> >
> >> > >> > >>>
> >> > >> >
> >> >
> http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html
> >> > >> > >>> > > >
> >> > >> > >>> > > > I am still wannder confusion how then this model works
> >> and
> >> > >> used
> >> > >> > by
> >> > >> > >>> > many ?
> >> > >> > >>> > > > Not able to get any points on how to use SGD that
> >> generates
> >> > >> > >>> effective
> >> > >> > >>> > > > model.
> >> > >> > >>> > > >
> >> > >> > >>> > > > Could someone point out what is missing in input file
> or
> >> > >> provided
> >> > >> > >>> > > > parameters.
> >> > >> > >>> > > >
> >> > >> > >>> > > > I appreciate your help.
> >> > >> > >>> > > >
> >> > >> > >>> > > > Below is description of steps that I followed.
> >> > >> > >>> > > >
> >> > >> > >>> > > > PF Attached uses input files for experiment.
> >> > >> > >>> > > >
> >> > >> > >>> > > > I am using Iris Plants Database from Michael Marshall.
> >> PFA
> >> > >> > >>> iris.arff.
> >> > >> > >>> > > > Converted this to csv file just by updating header:
> >> > >> > >>> iris-3-classes.csv
> >> > >> > >>> > > >
> >> > >> > >>> > > > mahout org.apache.mahout.classifier.
> >> > >> > >>> > > > sgd.TrainLogistic --input
> >> > >> > >>> > > /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4
> >> > >> --output
> >> > >> > >>> > > /usr/local/mahout/trunk/
> >> > >> > >>> > > > *iris-3-classes.model* --target class *--categories 3*
> >> > >> > --predictors
> >> > >> > >>> > > > sepallength sepalwidth petallength petalwidth --types n
> >> > >> > >>> > > >
> >> > >> > >>> > > > >> it gave following error.
> >> > >> > >>> > > > Exception in thread "main"
> >> > java.lang.IllegalArgumentException:
> >> > >> > Can
> >> > >> > >>> only
> >> > >> > >>> > > > call classifyScalar with two categories
> >> > >> > >>> > > >
> >> > >> > >>> > > > Now created csv with only 2 classes. PFA
> >> iris-2-classes.csv
> >> > >> > >>> > > >
> >> > >> > >>> > > > >> trained iris-2-classes.csv with sgd
> >> > >> > >>> > > >
> >> > >> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic
> >> > --input
> >> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv*
> --features 4
> >> > >> > --output
> >> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target
> >> > class
> >> > >> > >>> > > *--categories
> >> > >> > >>> > > > 2* --predictors sepallength sepalwidth petallength
> >> > petalwidth
> >> > >> > >>> --types n
> >> > >> > >>> > > >
> >> > >> > >>> > > > mahout runlogistic --input
> >> > >> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
> >> > >> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model
> >> --auc
> >> > >> > >>> --confusion
> >> > >> > >>> > > >
> >> > >> > >>> > > > AUC = 0.14
> >> > >> > >>> > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
> >> > >> > >>> > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
> >> > >> > >>> > > >
> >> > >> > >>> > > > >> AUC seems to poor. Now changed --predictors
> >> > >> > >>> > > >
> >> > >> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic
> >> > --input
> >> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv*
> --features 4
> >> > >> > --output
> >> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target
> >> > class
> >> > >> > >>> > > *--categories
> >> > >> > >>> > > > 2* --predictors sepalwidth petallength --types n
> >> > >> > >>> > > >
> >> > >> > >>> > > > mahout runlogistic --input
> >> > >> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
> >> > >> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model
> >> --auc
> >> > >> > >>> --confusion
> >> > >> > >>> > > > --scores
> >> > >> > >>> > > >
> >> > >> > >>> > > > AUC = 0.80
> >> > >> > >>> > > > *confusion: [[50.0, 50.0], [0.0, 0.0]]*
> >> > >> > >>> > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
> >> > >> > >>> > > >
> >> > >> > >>> > > > This model classifies everything as category 1 which of
> >> no
> >> > >> use.
> >> > >> > >>> > > >
> >> > >> > >>> > > > Thanks
> >> > >> > >>> > > > Rajesh
> >> > >> > >>> > > >
> >> > >> > >>> > > >
> >> > >> > >>> > > >
> >> > >> > >>> > > >
> >> > >> > >>> > >
> >> > >> > >>> >
> >> > >> > >>>
> >> > >> > >>
> >> > >> > >>
> >> > >> > >
> >> > >> >
> >> > >>
> >> > >
> >> >
> >>
> >
> >
>

Re: SGD: Logistic regression package in Mahout

Posted by Rajesh Nikam <ra...@gmail.com>.
Thanks Ted for providing testcase that helped me to look into details of
the problem that I am facing.

Got how to run test case using maven:

mvn test
-Dtest="org.apache.mahout.classifier.sgd.OnlineLogisticRegressionTest"

However I could not see printf output spitted on console, so I have saved
output to file.

Now I will look at the results and update in case of any issue.

Thanks
Rajesh


On Thu, Nov 1, 2012 at 1:05 PM, Rajesh Nikam <ra...@gmail.com> wrote:

> Hi Mat,
>
> Thanks for pointing out link for JIRA for this particular case.
>
> Could you extend one more help:
>
> I have not used maven for building and running java classes. I am looking
> at
> http://maven.apache.org/guides/getting-started/index.html
>
> Could you please point out how to build & run any specific class like
> OnlineLogisticRegressionTest.java from mahout.
>
> Thanks
> Rajesh
>
>
> On Wed, Oct 31, 2012 at 8:15 PM, Mat Kelcey <ma...@gmail.com>wrote:
>
>> Rajesh, Ted has added the test case code already
>> https://issues.apache.org/jira/browse/MAHOUT-1107
>>
>> On 31 October 2012 05:14, Rajesh Nikam <ra...@gmail.com> wrote:
>>
>> > Hi Ted,
>> >
>> > Please update once JIRA and test case is uploaded.
>> >
>> > Looking forward for your reply.
>> >
>> > Thanks
>> > Rajesh
>> >
>> > On Wed, Oct 31, 2012 at 11:00 AM, Rajesh Nikam <rajeshnikam@gmail.com
>> > >wrote:
>> >
>> > > Hi Ted,
>> > >
>> > > Thanks for reply. I will wait for JIRA and hope to get rid of any
>> > encoding
>> > > issue.
>> > >
>> > > Thanks,
>> > > Rajesh
>> > > On Oct 31, 2012 5:24 AM, "Ted Dunning" <te...@gmail.com> wrote:
>> > >
>> > >> OK.  I am back up for air.
>> > >>
>> > >> Rajesh,
>> > >>
>> > >> As I am sure you know, most folks here contribute on their own time.
>>  I
>> > >> have been busy with my day job and unable to help with this until
>> just
>> > >> now.
>> > >>
>> > >> I just wrote a test case that looks at the Iris data set.  The
>> results
>> > are
>> > >> categorically different from yours.
>> > >>
>> > >> That substantiates my original feeling that your encoding of the
>> data is
>> > >> problematic.  I will file a JIRA and attach a test case that you can
>> > look
>> > >> at.  Then we can see what the differences are.
>> > >>
>> > >>
>> > >> On Tue, Oct 23, 2012 at 1:28 AM, Rajesh Nikam <rajeshnikam@gmail.com
>> >
>> > >> wrote:
>> > >>
>> > >> > Hi,
>> > >> >
>> > >> > Is there development happening on fixing issue with SGD that
>> generates
>> > >> > models which are as good as random prediction?
>> > >> >
>> > >> > I am not sure why such issue is not noticed and raised by others ?
>> > >> > May be this specific algo is not used in practical applications.
>> > >> >
>> > >> > Thanks,
>> > >> > Rajesh
>> > >> >
>> > >> >
>> > >> > >>
>> > >> > >> On Tue, Oct 16, 2012 at 10:23 PM, Ted Dunning <
>> > ted.dunning@gmail.com
>> > >> > >wrote:
>> > >> > >>
>> > >> > >>> Rajesh,
>> > >> > >>>
>> > >> > >>> In the testing that I did, I ran 100, 1000 and 10,000 passes
>> > through
>> > >> > the
>> > >> > >>> data.  All produced identical results.  Thus it isn't an issue
>> of
>> > >> SGD
>> > >> > >>> converging.
>> > >> > >>>
>> > >> > >>> I also did a parameter scan of lambda and saw no effect.
>> > >> > >>>
>> > >> > >>> I also did the standard thing in R with glm and got the
>> expected
>> > >> > >>> (correct)
>> > >> > >>> results.
>> > >> > >>>
>> > >> > >>> I haven't looked yet in detail, but I really suspect that the
>> > >> reading
>> > >> > of
>> > >> > >>> the data is horked.  This is exactly how that behaves.
>> > >> > >>>
>> > >> > >>> On Tue, Oct 16, 2012 at 4:49 AM, Rajesh Nikam <
>> > >> rajeshnikam@gmail.com>
>> > >> > >>> wrote:
>> > >> > >>>
>> > >> > >>> > Hi Ted,
>> > >> > >>> >
>> > >> > >>> > I was thinking, this might be due to having only 100
>> instances
>> > for
>> > >> > >>> > training.
>> > >> > >>> >
>> > >> > >>> > So I have created test set with two classes having ~49K
>> > instances,
>> > >> > >>> included
>> > >> > >>> > all features as predictors.
>> > >> > >>> > PFA sgd.grps.zip with test file.
>> > >> > >>> >
>> > >> > >>> > mahout trainlogistic --input
>> > >> /usr/local/mahout/trainme/sgd-grps.csv
>> > >> > >>> > --output /usr/local/mahout/trainme/sgd-grps.model --target
>> class
>> > >> > >>> > --categories 2 --features 128 --types n --predictors a1 a2
>> a3 a4
>> > >> a5
>> > >> > a6
>> > >> > >>> a7
>> > >> > >>> > a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21 a22 a23
>> > a24
>> > >> a25
>> > >> > >>> a26
>> > >> > >>> > a27 a28 a29 a30 a31 a32 a33 a34 a35 a36 a37 a38 a39 a40 a41
>> a42
>> > >> a43
>> > >> > >>> a44 a45
>> > >> > >>> > a46 a47 a48 a49 a50 a51 a52 a53 a54 a55 a56 a57 a58 a59 a60
>> a61
>> > >> a62
>> > >> > >>> a63 a64
>> > >> > >>> > a65 a66 a67 a68 a69 a70 a71 a72 a73 a74 a75 a76 a77 a78 a79
>> a80
>> > >> a81
>> > >> > >>> a82 a83
>> > >> > >>> > a84 a85 a86 a87 a88 a89 a90 a91 a92 a93 a94 a95 a96 a97 a98
>> a99
>> > >> a100
>> > >> > >>> a101
>> > >> > >>> > a102 a103 a104 a105 a106 a107 a108 a109 a110 a111 a112 a113
>> a114
>> > >> a115
>> > >> > >>> a116
>> > >> > >>> > a117 a118 a119 a120 a121 a122 a123 a124 a125 a126 a127
>> > >> > >>> >
>> > >> > >>> >
>> > >> > >>> > mahout runlogistic --input
>> > /usr/local/mahout/trainme/sgd-grps.csv
>> > >> > >>> --model
>> > >> > >>> > /usr/local/mahout/trainme/sgd-grps.model --auc --confusion
>> > >> > >>> >
>> > >> > >>> > Still the results are similar, it classifies everything as
>> > >> class_1.
>> > >> > >>> >
>> > >> > >>> > AUC = 0.50
>> > >> > >>> > confusion: [[*26563.0, 23006.0*], [0.0, 0.0]]
>> > >> > >>> > entropy: [[-0.0, -0.0], [-46.1, -21.4]]
>> > >> > >>> >
>> > >> > >>> > I am not sure why this is failing all the time.
>> > >> > >>> >
>> > >> > >>> > Looking forward for your reply.
>> > >> > >>> >
>> > >> > >>> > Thanks
>> > >> > >>> > Rajesh
>> > >> > >>> >
>> > >> > >>> >
>> > >> > >>> >
>> > >> > >>> > On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning <
>> > >> ted.dunning@gmail.com>
>> > >> > >>> > wrote:
>> > >> > >>> >
>> > >> > >>> > > I would love to help and will before long.  Just can't do
>> it
>> > in
>> > >> the
>> > >> > >>> first
>> > >> > >>> > > part of this week.
>> > >> > >>> > >
>> > >> > >>> > > On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam <
>> > >> > rajeshnikam@gmail.com
>> > >> > >>> >
>> > >> > >>> > > wrote:
>> > >> > >>> > >
>> > >> > >>> > > > Hello,
>> > >> > >>> > > >
>> > >> > >>> > > > I have asked below question on issue with using sgd on
>> > mahout
>> > >> > >>> forum.
>> > >> > >>> > > >
>> > >> > >>> > > > Similar issue with sgd is reported by
>> > >> > >>> > > >
>> > >> > >>> > > >
>> > >> > >>> > >
>> > >> > >>> >
>> > >> > >>>
>> > >> >
>> > >>
>> >
>> http://stackoverflow.com/questions/11221436/using-sgd-classifier-in-mahout
>> > >> > >>> > > >
>> > >> > >>> > > > Even below link has similar output:
>> > >> > >>> > > >
>> > >> > >>> > > > AUC = 0.57*confusion: [[27.0, 13.0], [0.0, 0.0]]*
>> > >> > >>> > > > entropy: [[-0.4, -0.3], [-1.2, -0.7]]
>> > >> > >>> > > >
>> > >> > >>> > > >
>> > >> > >>> > > >
>> > >> > >>> >
>> > >> > >>>
>> > >> >
>> > http://sujitpal.blogspot.in/2012/09/learning-mahout-classification.html
>> > >> > >>> > > >
>> > >> > >>> > > > I am still wannder confusion how then this model works
>> and
>> > >> used
>> > >> > by
>> > >> > >>> > many ?
>> > >> > >>> > > > Not able to get any points on how to use SGD that
>> generates
>> > >> > >>> effective
>> > >> > >>> > > > model.
>> > >> > >>> > > >
>> > >> > >>> > > > Could someone point out what is missing in input file or
>> > >> provided
>> > >> > >>> > > > parameters.
>> > >> > >>> > > >
>> > >> > >>> > > > I appreciate your help.
>> > >> > >>> > > >
>> > >> > >>> > > > Below is description of steps that I followed.
>> > >> > >>> > > >
>> > >> > >>> > > > PF Attached uses input files for experiment.
>> > >> > >>> > > >
>> > >> > >>> > > > I am using Iris Plants Database from Michael Marshall.
>> PFA
>> > >> > >>> iris.arff.
>> > >> > >>> > > > Converted this to csv file just by updating header:
>> > >> > >>> iris-3-classes.csv
>> > >> > >>> > > >
>> > >> > >>> > > > mahout org.apache.mahout.classifier.
>> > >> > >>> > > > sgd.TrainLogistic --input
>> > >> > >>> > > /usr/local/mahout/trunk/*iris-3-classes.csv*--features 4
>> > >> --output
>> > >> > >>> > > /usr/local/mahout/trunk/
>> > >> > >>> > > > *iris-3-classes.model* --target class *--categories 3*
>> > >> > --predictors
>> > >> > >>> > > > sepallength sepalwidth petallength petalwidth --types n
>> > >> > >>> > > >
>> > >> > >>> > > > >> it gave following error.
>> > >> > >>> > > > Exception in thread "main"
>> > java.lang.IllegalArgumentException:
>> > >> > Can
>> > >> > >>> only
>> > >> > >>> > > > call classifyScalar with two categories
>> > >> > >>> > > >
>> > >> > >>> > > > Now created csv with only 2 classes. PFA
>> iris-2-classes.csv
>> > >> > >>> > > >
>> > >> > >>> > > > >> trained iris-2-classes.csv with sgd
>> > >> > >>> > > >
>> > >> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic
>> > --input
>> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4
>> > >> > --output
>> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target
>> > class
>> > >> > >>> > > *--categories
>> > >> > >>> > > > 2* --predictors sepallength sepalwidth petallength
>> > petalwidth
>> > >> > >>> --types n
>> > >> > >>> > > >
>> > >> > >>> > > > mahout runlogistic --input
>> > >> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
>> > >> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model
>> --auc
>> > >> > >>> --confusion
>> > >> > >>> > > >
>> > >> > >>> > > > AUC = 0.14
>> > >> > >>> > > > confusion: [[50.0, 50.0], [0.0, 0.0]]
>> > >> > >>> > > > entropy: [[-0.6, -0.3], [-0.8, -0.4]]
>> > >> > >>> > > >
>> > >> > >>> > > > >> AUC seems to poor. Now changed --predictors
>> > >> > >>> > > >
>> > >> > >>> > > > mahout org.apache.mahout.classifier.sgd.TrainLogistic
>> > --input
>> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.csv* --features 4
>> > >> > --output
>> > >> > >>> > > > /usr/local/mahout/trunk/*iris-2-classes.mode*l --target
>> > class
>> > >> > >>> > > *--categories
>> > >> > >>> > > > 2* --predictors sepalwidth petallength --types n
>> > >> > >>> > > >
>> > >> > >>> > > > mahout runlogistic --input
>> > >> > >>> /usr/local/mahout/trunk/iris-2-classes.csv
>> > >> > >>> > > > --model /usr/local/mahout/trunk/iris-2-classes.model
>> --auc
>> > >> > >>> --confusion
>> > >> > >>> > > > --scores
>> > >> > >>> > > >
>> > >> > >>> > > > AUC = 0.80
>> > >> > >>> > > > *confusion: [[50.0, 50.0], [0.0, 0.0]]*
>> > >> > >>> > > > entropy: [[-0.7, -0.3], [-0.7, -0.4]]
>> > >> > >>> > > >
>> > >> > >>> > > > This model classifies everything as category 1 which of
>> no
>> > >> use.
>> > >> > >>> > > >
>> > >> > >>> > > > Thanks
>> > >> > >>> > > > Rajesh
>> > >> > >>> > > >
>> > >> > >>> > > >
>> > >> > >>> > > >
>> > >> > >>> > > >
>> > >> > >>> > >
>> > >> > >>> >
>> > >> > >>>
>> > >> > >>
>> > >> > >>
>> > >> > >
>> > >> >
>> > >>
>> > >
>> >
>>
>
>