You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Salman Mahmood <sa...@influestor.com> on 2012/08/30 23:49:55 UTC

SGD diferent confusion matrix for each run

I have noticed that every time I train and test a model using the same data (in SGD algo), I get different confusion matrix. Meaning, if I generate a model and look at the confusion matrix, it might say 90% correctly classified instances, but if I generate the model again (with the SAME data for training and testing as before) and test it, the confusion matrix changes and it might say 75% correctly classified instances.

Is this a desired behavior?

Re: SGD diferent confusion matrix for each run

Posted by Ted Dunning <te...@gmail.com>.

Frankly, you get a better approximation of the underlying distribution of
samples if you sample *with* replacement.  This means just pick a uniform
sample from the training data each time and limit by the number of samples,
not the number of passes through the data.

The idea of SGD is sample centric and depends on you taking a random sample
from the underlying distribution of training data.  Convergence and all is
in terms of the number of samples and the closer you can come to sampling
from the real distribution, the better the process will approximate the
mathematical idea.  When you have a fixed and finite sample of training
data instead of something that samples from the real distribution, then you
have to approximate the underlying distribution using the bootstrap [1] and
that is best done using sampling with replacement rather than repeated
sampling without replacement.

[1] http://en.wikipedia.org/wiki/Bootstrapping_(statistics)

On Fri, Aug 31, 2012 at 11:24 PM, Ted Dunning <te...@gmail.com> wrote:

> That would be best, but practically speaking, randomizing once is usually
> OK.  With a tiny data set like this that is in memory anyway, I wouldn't
> take any chances.
>
>
> On Fri, Aug 31, 2012 at 9:08 PM, Lance Norskog <go...@gmail.com> wrote:
>
>> "Try passing through the data 100 times for a start. "
>>
>> And randomize the order each time?
>>
>> On Fri, Aug 31, 2012 at 9:04 AM, Salman Mahmood <sa...@influestor.com>
>> wrote:
>> > Cheers ted. Appreciate the input!
>> >
>> > Sent from my iPhone
>> >
>> > On 31 Aug 2012, at 17:53, Ted Dunning <te...@gmail.com> wrote:
>> >
>> >> OK.
>> >>
>> >> Try passing through the data 100 times for a start.  I think that this
>> is
>> >> likely to fix your problems.
>> >>
>> >> Be warned that AdaptiveLogisticRegression has been misbehaving lately
>> and
>> >> may converge faster than it should.
>> >>
>> >> On Fri, Aug 31, 2012 at 9:33 AM, Salman Mahmood <salman@influestor.com
>> >wrote:
>> >>
>> >>> Thanks a lot ted. Here are the answers:
>> >>> d) Data (news articles from different feeds)
>> >>>        News Article 1: Title : BP Profits Plunge On Massive Asset
>> >>> Write-down
>> >>>                                    Description :BP PLC (BP) Tuesday
>> >>> posted a dramatic fall of 96% in adjusted profit for the
>> >>> second quarter as it wrote down the value of its assets by $5 billion
>> >>> including some U.S. refineries a suspended Alaskan oil project and
>> U.S.
>> >>> shale gas resources
>> >>>
>> >>>        News Article 2: Title : Morgan Stanley Missed Big
>> >>>                                     Description: Why It's Still A
>> >>> Fantastic Short,"By Mike Williams: Though the market responded very
>> >>> positively to Citigroup (C) and Bank of America's (BAC) reserve
>> >>> release-driven earnings ""beats"" last week's Morgan Stanley (MS)
>> earnings
>> >>> report illustrated what happens when a bank doesn't have billions of
>> >>> reserves to release back into earnings. Estimates called for the
>> following:
>> >>> $.43 per share in earnings $.29 per share in earnings ex-DVA (debt
>> value
>> >>> adjustment) $7.7 billion in revenue GAAP results (including the DVA)
>> came
>> >>> in at $.28 per share while ex-DVA earnings were $.16. Revenue was a
>> >>> particular disappointment coming in at $6.95 billion.
>> >>>
>> >>> c) As you can see the data is textual. and I am using title and
>> >>> description as predictor variable and the target variable is the
>> company
>> >>> name a news belongs to.
>> >>>
>> >>> b) I am passing through the data once (at least this is what I
>> think). I
>> >>> folowed the 20newsgroup example code(in java) and dint find that the
>> data
>> >>> was passed more than once.
>> >>> Yes I randomize the order every time.
>> >>>
>> >>> a) I am using AdaptiveLearningRegression (just like 20newsgroup).
>> >>>
>> >>> Thanks!
>> >>>
>> >>>
>> >>>
>> >>> On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote:
>> >>>
>> >>>> First, this is a tiny training set.  You are well outside the
>> intended
>> >>>> application range so you are likely to find less experience in the
>> >>>> community in that range.  That said, the algorithm should still
>> produce
>> >>>> reasonably stable results.
>> >>>>
>> >>>> Here are a few questions:
>> >>>>
>> >>>> a) which class are you using to train your model?  I would start with
>> >>>> OnlineLogisticRegression and experiment with training rate schedules
>> and
>> >>>> amount of regularization to find out how to build a good model.
>> >>>>
>> >>>> b) how many times are you passing through your data?  Do you
>> randomize
>> >>> the
>> >>>> order each time?  These are critical to proper training.  Instead of
>> >>>> randomizing order, you could just sample a data point at random and
>> not
>> >>>> worry about using a complete permutation of the data.  With such a
>> tiny
>> >>>> data set, you will need to pass through the data many times ...
>> possibly
>> >>>> hundreds of times or more.
>> >>>>
>> >>>> c) what kind of data do you have?  Sparse?  Dense?  How many
>> variables?
>> >>>> What kind?
>> >>>>
>> >>>> d) can you post your data?
>> >>>>
>> >>>>
>> >>>> On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <
>> salman@influestor.com
>> >>>> wrote:
>> >>>>
>> >>>>> Thanks a lot lance. Let me elaborate the problem if it was a bit
>> >>> confusing.
>> >>>>>
>> >>>>> Assuming I am making a binary classifier using SGD. I have got 50
>> >>> positive
>> >>>>> and 50 negative examples to train the classifier. After training and
>> >>>>> testing the model, the confusion matrix tells you the number of
>> >>> correctly
>> >>>>> and incorrectly classified instances. Let's assume I got 85%
>> correct and
>> >>>>> 15% incorrect instances.
>> >>>>>
>> >>>>> Now if I run my program again using the same 50 negative and 50
>> positive
>> >>>>> examples, then according to my knowledge the classifier should
>> yield the
>> >>>>> same results as before (cause not even a single training or testing
>> data
>> >>>>> was changed), but this is not the case. I get different results for
>> >>>>> different runs. The confusion matrix figures changes each time I
>> >>> generate a
>> >>>>> model keeping the data constant. What I do is, I generate a model
>> >>> several
>> >>>>> times and keep a look for the accuracy, and if it is above 90%,
>> then I
>> >>> stop
>> >>>>> running the code and hence an accurate model is created.
>> >>>>>
>> >>>>> So what you are saying is to shuffle my data before I use it for
>> >>> training
>> >>>>> and testing?
>> >>>>> Thanks!
>> >>>>> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote:
>> >>>>>
>> >>>>>> Now I remember: SGD wants its data input in random order. You need
>> to
>> >>>>>> permute the order of your data.
>> >>>>>>
>> >>>>>> If that does not help, another trick: for each data point, randomly
>> >>>>>> generate 5 or 10 or 20 points which are close. And again, randomly
>> >>>>>> permute the entire input set.
>> >>>>>>
>> >>>>>> On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <go...@gmail.com>
>> >>>>> wrote:
>> >>>>>>> The more data you have, the closer each run will be. How much
>> data do
>> >>>>> you have?
>> >>>>>>>
>> >>>>>>> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <
>> >>> salman@influestor.com>
>> >>>>> wrote:
>> >>>>>>>> I have noticed that every time I train and test a model using the
>> >>> same
>> >>>>> data (in SGD algo), I get different confusion matrix. Meaning, if I
>> >>>>> generate a model and look at the confusion matrix, it might say 90%
>> >>>>> correctly classified instances, but if I generate the model again
>> (with
>> >>> the
>> >>>>> SAME data for training and testing as before) and test it, the
>> confusion
>> >>>>> matrix changes and it might say 75% correctly classified instances.
>> >>>>>>>>
>> >>>>>>>> Is this a desired behavior?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Lance Norskog
>> >>>>>>> goksron@gmail.com
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Lance Norskog
>> >>>>>> goksron@gmail.com
>> >>>>>
>> >>>>>
>> >>>
>> >>>
>>
>>
>>
>> --
>> Lance Norskog
>> goksron@gmail.com
>>
>
>

Re: SGD diferent confusion matrix for each run

Posted by Ted Dunning <te...@gmail.com>.

That would be best, but practically speaking, randomizing once is usually
OK.  With a tiny data set like this that is in memory anyway, I wouldn't
take any chances.

On Fri, Aug 31, 2012 at 9:08 PM, Lance Norskog <go...@gmail.com> wrote:

> "Try passing through the data 100 times for a start. "
>
> And randomize the order each time?
>
> On Fri, Aug 31, 2012 at 9:04 AM, Salman Mahmood <sa...@influestor.com>
> wrote:
> > Cheers ted. Appreciate the input!
> >
> > Sent from my iPhone
> >
> > On 31 Aug 2012, at 17:53, Ted Dunning <te...@gmail.com> wrote:
> >
> >> OK.
> >>
> >> Try passing through the data 100 times for a start.  I think that this
> is
> >> likely to fix your problems.
> >>
> >> Be warned that AdaptiveLogisticRegression has been misbehaving lately
> and
> >> may converge faster than it should.
> >>
> >> On Fri, Aug 31, 2012 at 9:33 AM, Salman Mahmood <salman@influestor.com
> >wrote:
> >>
> >>> Thanks a lot ted. Here are the answers:
> >>> d) Data (news articles from different feeds)
> >>>        News Article 1: Title : BP Profits Plunge On Massive Asset
> >>> Write-down
> >>>                                    Description :BP PLC (BP) Tuesday
> >>> posted a dramatic fall of 96% in adjusted profit for the
> >>> second quarter as it wrote down the value of its assets by $5 billion
> >>> including some U.S. refineries a suspended Alaskan oil project and U.S.
> >>> shale gas resources
> >>>
> >>>        News Article 2: Title : Morgan Stanley Missed Big
> >>>                                     Description: Why It's Still A
> >>> Fantastic Short,"By Mike Williams: Though the market responded very
> >>> positively to Citigroup (C) and Bank of America's (BAC) reserve
> >>> release-driven earnings ""beats"" last week's Morgan Stanley (MS)
> earnings
> >>> report illustrated what happens when a bank doesn't have billions of
> >>> reserves to release back into earnings. Estimates called for the
> following:
> >>> $.43 per share in earnings $.29 per share in earnings ex-DVA (debt
> value
> >>> adjustment) $7.7 billion in revenue GAAP results (including the DVA)
> came
> >>> in at $.28 per share while ex-DVA earnings were $.16. Revenue was a
> >>> particular disappointment coming in at $6.95 billion.
> >>>
> >>> c) As you can see the data is textual. and I am using title and
> >>> description as predictor variable and the target variable is the
> company
> >>> name a news belongs to.
> >>>
> >>> b) I am passing through the data once (at least this is what I think).
> I
> >>> folowed the 20newsgroup example code(in java) and dint find that the
> data
> >>> was passed more than once.
> >>> Yes I randomize the order every time.
> >>>
> >>> a) I am using AdaptiveLearningRegression (just like 20newsgroup).
> >>>
> >>> Thanks!
> >>>
> >>>
> >>>
> >>> On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote:
> >>>
> >>>> First, this is a tiny training set.  You are well outside the intended
> >>>> application range so you are likely to find less experience in the
> >>>> community in that range.  That said, the algorithm should still
> produce
> >>>> reasonably stable results.
> >>>>
> >>>> Here are a few questions:
> >>>>
> >>>> a) which class are you using to train your model?  I would start with
> >>>> OnlineLogisticRegression and experiment with training rate schedules
> and
> >>>> amount of regularization to find out how to build a good model.
> >>>>
> >>>> b) how many times are you passing through your data?  Do you randomize
> >>> the
> >>>> order each time?  These are critical to proper training.  Instead of
> >>>> randomizing order, you could just sample a data point at random and
> not
> >>>> worry about using a complete permutation of the data.  With such a
> tiny
> >>>> data set, you will need to pass through the data many times ...
> possibly
> >>>> hundreds of times or more.
> >>>>
> >>>> c) what kind of data do you have?  Sparse?  Dense?  How many
> variables?
> >>>> What kind?
> >>>>
> >>>> d) can you post your data?
> >>>>
> >>>>
> >>>> On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <
> salman@influestor.com
> >>>> wrote:
> >>>>
> >>>>> Thanks a lot lance. Let me elaborate the problem if it was a bit
> >>> confusing.
> >>>>>
> >>>>> Assuming I am making a binary classifier using SGD. I have got 50
> >>> positive
> >>>>> and 50 negative examples to train the classifier. After training and
> >>>>> testing the model, the confusion matrix tells you the number of
> >>> correctly
> >>>>> and incorrectly classified instances. Let's assume I got 85% correct
> and
> >>>>> 15% incorrect instances.
> >>>>>
> >>>>> Now if I run my program again using the same 50 negative and 50
> positive
> >>>>> examples, then according to my knowledge the classifier should yield
> the
> >>>>> same results as before (cause not even a single training or testing
> data
> >>>>> was changed), but this is not the case. I get different results for
> >>>>> different runs. The confusion matrix figures changes each time I
> >>> generate a
> >>>>> model keeping the data constant. What I do is, I generate a model
> >>> several
> >>>>> times and keep a look for the accuracy, and if it is above 90%, then
> I
> >>> stop
> >>>>> running the code and hence an accurate model is created.
> >>>>>
> >>>>> So what you are saying is to shuffle my data before I use it for
> >>> training
> >>>>> and testing?
> >>>>> Thanks!
> >>>>> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote:
> >>>>>
> >>>>>> Now I remember: SGD wants its data input in random order. You need
> to
> >>>>>> permute the order of your data.
> >>>>>>
> >>>>>> If that does not help, another trick: for each data point, randomly
> >>>>>> generate 5 or 10 or 20 points which are close. And again, randomly
> >>>>>> permute the entire input set.
> >>>>>>
> >>>>>> On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <go...@gmail.com>
> >>>>> wrote:
> >>>>>>> The more data you have, the closer each run will be. How much data
> do
> >>>>> you have?
> >>>>>>>
> >>>>>>> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <
> >>> salman@influestor.com>
> >>>>> wrote:
> >>>>>>>> I have noticed that every time I train and test a model using the
> >>> same
> >>>>> data (in SGD algo), I get different confusion matrix. Meaning, if I
> >>>>> generate a model and look at the confusion matrix, it might say 90%
> >>>>> correctly classified instances, but if I generate the model again
> (with
> >>> the
> >>>>> SAME data for training and testing as before) and test it, the
> confusion
> >>>>> matrix changes and it might say 75% correctly classified instances.
> >>>>>>>>
> >>>>>>>> Is this a desired behavior?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Lance Norskog
> >>>>>>> goksron@gmail.com
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Lance Norskog
> >>>>>> goksron@gmail.com
> >>>>>
> >>>>>
> >>>
> >>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: SGD diferent confusion matrix for each run

Posted by Lance Norskog <go...@gmail.com>.

"Try passing through the data 100 times for a start. "

And randomize the order each time?

On Fri, Aug 31, 2012 at 9:04 AM, Salman Mahmood <sa...@influestor.com> wrote:
> Cheers ted. Appreciate the input!
>
> Sent from my iPhone
>
> On 31 Aug 2012, at 17:53, Ted Dunning <te...@gmail.com> wrote:
>
>> OK.
>>
>> Try passing through the data 100 times for a start.  I think that this is
>> likely to fix your problems.
>>
>> Be warned that AdaptiveLogisticRegression has been misbehaving lately and
>> may converge faster than it should.
>>
>> On Fri, Aug 31, 2012 at 9:33 AM, Salman Mahmood <sa...@influestor.com>wrote:
>>
>>> Thanks a lot ted. Here are the answers:
>>> d) Data (news articles from different feeds)
>>>        News Article 1: Title : BP Profits Plunge On Massive Asset
>>> Write-down
>>>                                    Description :BP PLC (BP) Tuesday
>>> posted a dramatic fall of 96% in adjusted profit for the
>>> second quarter as it wrote down the value of its assets by $5 billion
>>> including some U.S. refineries a suspended Alaskan oil project and U.S.
>>> shale gas resources
>>>
>>>        News Article 2: Title : Morgan Stanley Missed Big
>>>                                     Description: Why It's Still A
>>> Fantastic Short,"By Mike Williams: Though the market responded very
>>> positively to Citigroup (C) and Bank of America's (BAC) reserve
>>> release-driven earnings ""beats"" last week's Morgan Stanley (MS) earnings
>>> report illustrated what happens when a bank doesn't have billions of
>>> reserves to release back into earnings. Estimates called for the following:
>>> $.43 per share in earnings $.29 per share in earnings ex-DVA (debt value
>>> adjustment) $7.7 billion in revenue GAAP results (including the DVA) came
>>> in at $.28 per share while ex-DVA earnings were $.16. Revenue was a
>>> particular disappointment coming in at $6.95 billion.
>>>
>>> c) As you can see the data is textual. and I am using title and
>>> description as predictor variable and the target variable is the company
>>> name a news belongs to.
>>>
>>> b) I am passing through the data once (at least this is what I think). I
>>> folowed the 20newsgroup example code(in java) and dint find that the data
>>> was passed more than once.
>>> Yes I randomize the order every time.
>>>
>>> a) I am using AdaptiveLearningRegression (just like 20newsgroup).
>>>
>>> Thanks!
>>>
>>>
>>>
>>> On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote:
>>>
>>>> First, this is a tiny training set.  You are well outside the intended
>>>> application range so you are likely to find less experience in the
>>>> community in that range.  That said, the algorithm should still produce
>>>> reasonably stable results.
>>>>
>>>> Here are a few questions:
>>>>
>>>> a) which class are you using to train your model?  I would start with
>>>> OnlineLogisticRegression and experiment with training rate schedules and
>>>> amount of regularization to find out how to build a good model.
>>>>
>>>> b) how many times are you passing through your data?  Do you randomize
>>> the
>>>> order each time?  These are critical to proper training.  Instead of
>>>> randomizing order, you could just sample a data point at random and not
>>>> worry about using a complete permutation of the data.  With such a tiny
>>>> data set, you will need to pass through the data many times ... possibly
>>>> hundreds of times or more.
>>>>
>>>> c) what kind of data do you have?  Sparse?  Dense?  How many variables?
>>>> What kind?
>>>>
>>>> d) can you post your data?
>>>>
>>>>
>>>> On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <salman@influestor.com
>>>> wrote:
>>>>
>>>>> Thanks a lot lance. Let me elaborate the problem if it was a bit
>>> confusing.
>>>>>
>>>>> Assuming I am making a binary classifier using SGD. I have got 50
>>> positive
>>>>> and 50 negative examples to train the classifier. After training and
>>>>> testing the model, the confusion matrix tells you the number of
>>> correctly
>>>>> and incorrectly classified instances. Let's assume I got 85% correct and
>>>>> 15% incorrect instances.
>>>>>
>>>>> Now if I run my program again using the same 50 negative and 50 positive
>>>>> examples, then according to my knowledge the classifier should yield the
>>>>> same results as before (cause not even a single training or testing data
>>>>> was changed), but this is not the case. I get different results for
>>>>> different runs. The confusion matrix figures changes each time I
>>> generate a
>>>>> model keeping the data constant. What I do is, I generate a model
>>> several
>>>>> times and keep a look for the accuracy, and if it is above 90%, then I
>>> stop
>>>>> running the code and hence an accurate model is created.
>>>>>
>>>>> So what you are saying is to shuffle my data before I use it for
>>> training
>>>>> and testing?
>>>>> Thanks!
>>>>> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote:
>>>>>
>>>>>> Now I remember: SGD wants its data input in random order. You need to
>>>>>> permute the order of your data.
>>>>>>
>>>>>> If that does not help, another trick: for each data point, randomly
>>>>>> generate 5 or 10 or 20 points which are close. And again, randomly
>>>>>> permute the entire input set.
>>>>>>
>>>>>> On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <go...@gmail.com>
>>>>> wrote:
>>>>>>> The more data you have, the closer each run will be. How much data do
>>>>> you have?
>>>>>>>
>>>>>>> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <
>>> salman@influestor.com>
>>>>> wrote:
>>>>>>>> I have noticed that every time I train and test a model using the
>>> same
>>>>> data (in SGD algo), I get different confusion matrix. Meaning, if I
>>>>> generate a model and look at the confusion matrix, it might say 90%
>>>>> correctly classified instances, but if I generate the model again (with
>>> the
>>>>> SAME data for training and testing as before) and test it, the confusion
>>>>> matrix changes and it might say 75% correctly classified instances.
>>>>>>>>
>>>>>>>> Is this a desired behavior?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Lance Norskog
>>>>>>> goksron@gmail.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Lance Norskog
>>>>>> goksron@gmail.com
>>>>>
>>>>>
>>>
>>>



-- 
Lance Norskog
goksron@gmail.com

Re: SGD diferent confusion matrix for each run

Posted by Salman Mahmood <sa...@influestor.com>.

Cheers ted. Appreciate the input!

Sent from my iPhone

On 31 Aug 2012, at 17:53, Ted Dunning <te...@gmail.com> wrote:

> OK.
>
> Try passing through the data 100 times for a start.  I think that this is
> likely to fix your problems.
>
> Be warned that AdaptiveLogisticRegression has been misbehaving lately and
> may converge faster than it should.
>
> On Fri, Aug 31, 2012 at 9:33 AM, Salman Mahmood <sa...@influestor.com>wrote:
>
>> Thanks a lot ted. Here are the answers:
>> d) Data (news articles from different feeds)
>>        News Article 1: Title : BP Profits Plunge On Massive Asset
>> Write-down
>>                                    Description :BP PLC (BP) Tuesday
>> posted a dramatic fall of 96% in adjusted profit for the
>> second quarter as it wrote down the value of its assets by $5 billion
>> including some U.S. refineries a suspended Alaskan oil project and U.S.
>> shale gas resources
>>
>>        News Article 2: Title : Morgan Stanley Missed Big
>>                                     Description: Why It's Still A
>> Fantastic Short,"By Mike Williams: Though the market responded very
>> positively to Citigroup (C) and Bank of America's (BAC) reserve
>> release-driven earnings ""beats"" last week's Morgan Stanley (MS) earnings
>> report illustrated what happens when a bank doesn't have billions of
>> reserves to release back into earnings. Estimates called for the following:
>> $.43 per share in earnings $.29 per share in earnings ex-DVA (debt value
>> adjustment) $7.7 billion in revenue GAAP results (including the DVA) came
>> in at $.28 per share while ex-DVA earnings were $.16. Revenue was a
>> particular disappointment coming in at $6.95 billion.
>>
>> c) As you can see the data is textual. and I am using title and
>> description as predictor variable and the target variable is the company
>> name a news belongs to.
>>
>> b) I am passing through the data once (at least this is what I think). I
>> folowed the 20newsgroup example code(in java) and dint find that the data
>> was passed more than once.
>> Yes I randomize the order every time.
>>
>> a) I am using AdaptiveLearningRegression (just like 20newsgroup).
>>
>> Thanks!
>>
>>
>>
>> On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote:
>>
>>> First, this is a tiny training set.  You are well outside the intended
>>> application range so you are likely to find less experience in the
>>> community in that range.  That said, the algorithm should still produce
>>> reasonably stable results.
>>>
>>> Here are a few questions:
>>>
>>> a) which class are you using to train your model?  I would start with
>>> OnlineLogisticRegression and experiment with training rate schedules and
>>> amount of regularization to find out how to build a good model.
>>>
>>> b) how many times are you passing through your data?  Do you randomize
>> the
>>> order each time?  These are critical to proper training.  Instead of
>>> randomizing order, you could just sample a data point at random and not
>>> worry about using a complete permutation of the data.  With such a tiny
>>> data set, you will need to pass through the data many times ... possibly
>>> hundreds of times or more.
>>>
>>> c) what kind of data do you have?  Sparse?  Dense?  How many variables?
>>> What kind?
>>>
>>> d) can you post your data?
>>>
>>>
>>> On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <salman@influestor.com
>>> wrote:
>>>
>>>> Thanks a lot lance. Let me elaborate the problem if it was a bit
>> confusing.
>>>>
>>>> Assuming I am making a binary classifier using SGD. I have got 50
>> positive
>>>> and 50 negative examples to train the classifier. After training and
>>>> testing the model, the confusion matrix tells you the number of
>> correctly
>>>> and incorrectly classified instances. Let's assume I got 85% correct and
>>>> 15% incorrect instances.
>>>>
>>>> Now if I run my program again using the same 50 negative and 50 positive
>>>> examples, then according to my knowledge the classifier should yield the
>>>> same results as before (cause not even a single training or testing data
>>>> was changed), but this is not the case. I get different results for
>>>> different runs. The confusion matrix figures changes each time I
>> generate a
>>>> model keeping the data constant. What I do is, I generate a model
>> several
>>>> times and keep a look for the accuracy, and if it is above 90%, then I
>> stop
>>>> running the code and hence an accurate model is created.
>>>>
>>>> So what you are saying is to shuffle my data before I use it for
>> training
>>>> and testing?
>>>> Thanks!
>>>> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote:
>>>>
>>>>> Now I remember: SGD wants its data input in random order. You need to
>>>>> permute the order of your data.
>>>>>
>>>>> If that does not help, another trick: for each data point, randomly
>>>>> generate 5 or 10 or 20 points which are close. And again, randomly
>>>>> permute the entire input set.
>>>>>
>>>>> On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <go...@gmail.com>
>>>> wrote:
>>>>>> The more data you have, the closer each run will be. How much data do
>>>> you have?
>>>>>>
>>>>>> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <
>> salman@influestor.com>
>>>> wrote:
>>>>>>> I have noticed that every time I train and test a model using the
>> same
>>>> data (in SGD algo), I get different confusion matrix. Meaning, if I
>>>> generate a model and look at the confusion matrix, it might say 90%
>>>> correctly classified instances, but if I generate the model again (with
>> the
>>>> SAME data for training and testing as before) and test it, the confusion
>>>> matrix changes and it might say 75% correctly classified instances.
>>>>>>>
>>>>>>> Is this a desired behavior?
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Lance Norskog
>>>>>> goksron@gmail.com
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Lance Norskog
>>>>> goksron@gmail.com
>>>>
>>>>
>>
>>

Re: SGD diferent confusion matrix for each run

Posted by Ted Dunning <te...@gmail.com>.

OK.

Try passing through the data 100 times for a start.  I think that this is
likely to fix your problems.

Be warned that AdaptiveLogisticRegression has been misbehaving lately and
may converge faster than it should.

On Fri, Aug 31, 2012 at 9:33 AM, Salman Mahmood <sa...@influestor.com>wrote:

> Thanks a lot ted. Here are the answers:
> d) Data (news articles from different feeds)
>         News Article 1: Title : BP Profits Plunge On Massive Asset
> Write-down
>                                     Description :BP PLC (BP) Tuesday
> posted a dramatic fall of 96% in adjusted profit for the
> second quarter as it wrote down the value of its assets by $5 billion
> including some U.S. refineries a suspended Alaskan oil project and U.S.
> shale gas resources
>
>         News Article 2: Title : Morgan Stanley Missed Big
>                                      Description: Why It's Still A
> Fantastic Short,"By Mike Williams: Though the market responded very
> positively to Citigroup (C) and Bank of America's (BAC) reserve
> release-driven earnings ""beats"" last week's Morgan Stanley (MS) earnings
> report illustrated what happens when a bank doesn't have billions of
> reserves to release back into earnings. Estimates called for the following:
> $.43 per share in earnings $.29 per share in earnings ex-DVA (debt value
> adjustment) $7.7 billion in revenue GAAP results (including the DVA) came
> in at $.28 per share while ex-DVA earnings were $.16. Revenue was a
> particular disappointment coming in at $6.95 billion.
>
> c) As you can see the data is textual. and I am using title and
> description as predictor variable and the target variable is the company
> name a news belongs to.
>
> b) I am passing through the data once (at least this is what I think). I
> folowed the 20newsgroup example code(in java) and dint find that the data
> was passed more than once.
> Yes I randomize the order every time.
>
> a) I am using AdaptiveLearningRegression (just like 20newsgroup).
>
> Thanks!
>
>
>
> On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote:
>
> > First, this is a tiny training set.  You are well outside the intended
> > application range so you are likely to find less experience in the
> > community in that range.  That said, the algorithm should still produce
> > reasonably stable results.
> >
> > Here are a few questions:
> >
> > a) which class are you using to train your model?  I would start with
> > OnlineLogisticRegression and experiment with training rate schedules and
> > amount of regularization to find out how to build a good model.
> >
> > b) how many times are you passing through your data?  Do you randomize
> the
> > order each time?  These are critical to proper training.  Instead of
> > randomizing order, you could just sample a data point at random and not
> > worry about using a complete permutation of the data.  With such a tiny
> > data set, you will need to pass through the data many times ... possibly
> > hundreds of times or more.
> >
> > c) what kind of data do you have?  Sparse?  Dense?  How many variables?
> > What kind?
> >
> > d) can you post your data?
> >
> >
> > On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <salman@influestor.com
> >wrote:
> >
> >> Thanks a lot lance. Let me elaborate the problem if it was a bit
> confusing.
> >>
> >> Assuming I am making a binary classifier using SGD. I have got 50
> positive
> >> and 50 negative examples to train the classifier. After training and
> >> testing the model, the confusion matrix tells you the number of
> correctly
> >> and incorrectly classified instances. Let's assume I got 85% correct and
> >> 15% incorrect instances.
> >>
> >> Now if I run my program again using the same 50 negative and 50 positive
> >> examples, then according to my knowledge the classifier should yield the
> >> same results as before (cause not even a single training or testing data
> >> was changed), but this is not the case. I get different results for
> >> different runs. The confusion matrix figures changes each time I
> generate a
> >> model keeping the data constant. What I do is, I generate a model
> several
> >> times and keep a look for the accuracy, and if it is above 90%, then I
> stop
> >> running the code and hence an accurate model is created.
> >>
> >> So what you are saying is to shuffle my data before I use it for
> training
> >> and testing?
> >> Thanks!
> >> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote:
> >>
> >>> Now I remember: SGD wants its data input in random order. You need to
> >>> permute the order of your data.
> >>>
> >>> If that does not help, another trick: for each data point, randomly
> >>> generate 5 or 10 or 20 points which are close. And again, randomly
> >>> permute the entire input set.
> >>>
> >>> On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <go...@gmail.com>
> >> wrote:
> >>>> The more data you have, the closer each run will be. How much data do
> >> you have?
> >>>>
> >>>> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <
> salman@influestor.com>
> >> wrote:
> >>>>> I have noticed that every time I train and test a model using the
> same
> >> data (in SGD algo), I get different confusion matrix. Meaning, if I
> >> generate a model and look at the confusion matrix, it might say 90%
> >> correctly classified instances, but if I generate the model again (with
> the
> >> SAME data for training and testing as before) and test it, the confusion
> >> matrix changes and it might say 75% correctly classified instances.
> >>>>>
> >>>>> Is this a desired behavior?
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Lance Norskog
> >>>> goksron@gmail.com
> >>>
> >>>
> >>>
> >>> --
> >>> Lance Norskog
> >>> goksron@gmail.com
> >>
> >>
>
>

Re: SGD diferent confusion matrix for each run

Posted by Salman Mahmood <sa...@influestor.com>.

Thanks a lot ted. Here are the answers:
d) Data (news articles from different feeds)
	News Article 1: Title : BP Profits Plunge On Massive Asset Write-down
				    Description :BP PLC (BP) Tuesday posted a dramatic fall of 96% in adjusted profit for the 		      second quarter as it wrote down the value of its assets by $5 billion including some U.S. refineries a suspended Alaskan oil project and U.S. shale gas resources

	News Article 2: Title : Morgan Stanley Missed Big
				     Description: Why It's Still A Fantastic Short,"By Mike Williams: Though the market responded very positively to Citigroup (C) and Bank of America's (BAC) reserve release-driven earnings ""beats"" last week's Morgan Stanley (MS) earnings report illustrated what happens when a bank doesn't have billions of reserves to release back into earnings. Estimates called for the following: $.43 per share in earnings $.29 per share in earnings ex-DVA (debt value adjustment) $7.7 billion in revenue GAAP results (including the DVA) came in at $.28 per share while ex-DVA earnings were $.16. Revenue was a particular disappointment coming in at $6.95 billion.

c) As you can see the data is textual. and I am using title and description as predictor variable and the target variable is the company name a news belongs to.

b) I am passing through the data once (at least this is what I think). I folowed the 20newsgroup example code(in java) and dint find that the data was passed more than once. 
Yes I randomize the order every time.

a) I am using AdaptiveLearningRegression (just like 20newsgroup).

Thanks!


  
On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote:

> First, this is a tiny training set.  You are well outside the intended
> application range so you are likely to find less experience in the
> community in that range.  That said, the algorithm should still produce
> reasonably stable results.
> 
> Here are a few questions:
> 
> a) which class are you using to train your model?  I would start with
> OnlineLogisticRegression and experiment with training rate schedules and
> amount of regularization to find out how to build a good model.
> 
> b) how many times are you passing through your data?  Do you randomize the
> order each time?  These are critical to proper training.  Instead of
> randomizing order, you could just sample a data point at random and not
> worry about using a complete permutation of the data.  With such a tiny
> data set, you will need to pass through the data many times ... possibly
> hundreds of times or more.
> 
> c) what kind of data do you have?  Sparse?  Dense?  How many variables?
> What kind?
> 
> d) can you post your data?
> 
> 
> On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <sa...@influestor.com>wrote:
> 
>> Thanks a lot lance. Let me elaborate the problem if it was a bit confusing.
>> 
>> Assuming I am making a binary classifier using SGD. I have got 50 positive
>> and 50 negative examples to train the classifier. After training and
>> testing the model, the confusion matrix tells you the number of correctly
>> and incorrectly classified instances. Let's assume I got 85% correct and
>> 15% incorrect instances.
>> 
>> Now if I run my program again using the same 50 negative and 50 positive
>> examples, then according to my knowledge the classifier should yield the
>> same results as before (cause not even a single training or testing data
>> was changed), but this is not the case. I get different results for
>> different runs. The confusion matrix figures changes each time I generate a
>> model keeping the data constant. What I do is, I generate a model several
>> times and keep a look for the accuracy, and if it is above 90%, then I stop
>> running the code and hence an accurate model is created.
>> 
>> So what you are saying is to shuffle my data before I use it for training
>> and testing?
>> Thanks!
>> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote:
>> 
>>> Now I remember: SGD wants its data input in random order. You need to
>>> permute the order of your data.
>>> 
>>> If that does not help, another trick: for each data point, randomly
>>> generate 5 or 10 or 20 points which are close. And again, randomly
>>> permute the entire input set.
>>> 
>>> On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <go...@gmail.com>
>> wrote:
>>>> The more data you have, the closer each run will be. How much data do
>> you have?
>>>> 
>>>> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <sa...@influestor.com>
>> wrote:
>>>>> I have noticed that every time I train and test a model using the same
>> data (in SGD algo), I get different confusion matrix. Meaning, if I
>> generate a model and look at the confusion matrix, it might say 90%
>> correctly classified instances, but if I generate the model again (with the
>> SAME data for training and testing as before) and test it, the confusion
>> matrix changes and it might say 75% correctly classified instances.
>>>>> 
>>>>> Is this a desired behavior?
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Lance Norskog
>>>> goksron@gmail.com
>>> 
>>> 
>>> 
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>> 
>>

Re: SGD diferent confusion matrix for each run

Posted by Ted Dunning <te...@gmail.com>.

First, this is a tiny training set.  You are well outside the intended
application range so you are likely to find less experience in the
community in that range.  That said, the algorithm should still produce
reasonably stable results.

Here are a few questions:

a) which class are you using to train your model?  I would start with
OnlineLogisticRegression and experiment with training rate schedules and
amount of regularization to find out how to build a good model.

b) how many times are you passing through your data?  Do you randomize the
order each time?  These are critical to proper training.  Instead of
randomizing order, you could just sample a data point at random and not
worry about using a complete permutation of the data.  With such a tiny
data set, you will need to pass through the data many times ... possibly
hundreds of times or more.

c) what kind of data do you have?  Sparse?  Dense?  How many variables?
 What kind?

d) can you post your data?

On Fri, Aug 31, 2012 at 5:03 AM, Salman Mahmood <sa...@influestor.com>wrote:

> Thanks a lot lance. Let me elaborate the problem if it was a bit confusing.
>
> Assuming I am making a binary classifier using SGD. I have got 50 positive
> and 50 negative examples to train the classifier. After training and
> testing the model, the confusion matrix tells you the number of correctly
> and incorrectly classified instances. Let's assume I got 85% correct and
> 15% incorrect instances.
>
> Now if I run my program again using the same 50 negative and 50 positive
> examples, then according to my knowledge the classifier should yield the
> same results as before (cause not even a single training or testing data
> was changed), but this is not the case. I get different results for
> different runs. The confusion matrix figures changes each time I generate a
> model keeping the data constant. What I do is, I generate a model several
> times and keep a look for the accuracy, and if it is above 90%, then I stop
> running the code and hence an accurate model is created.
>
> So what you are saying is to shuffle my data before I use it for training
> and testing?
> Thanks!
> On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote:
>
> > Now I remember: SGD wants its data input in random order. You need to
> > permute the order of your data.
> >
> > If that does not help, another trick: for each data point, randomly
> > generate 5 or 10 or 20 points which are close. And again, randomly
> > permute the entire input set.
> >
> > On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <go...@gmail.com>
> wrote:
> >> The more data you have, the closer each run will be. How much data do
> you have?
> >>
> >> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <sa...@influestor.com>
> wrote:
> >>> I have noticed that every time I train and test a model using the same
> data (in SGD algo), I get different confusion matrix. Meaning, if I
> generate a model and look at the confusion matrix, it might say 90%
> correctly classified instances, but if I generate the model again (with the
> SAME data for training and testing as before) and test it, the confusion
> matrix changes and it might say 75% correctly classified instances.
> >>>
> >>> Is this a desired behavior?
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >> goksron@gmail.com
> >
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
>
>

Re: SGD diferent confusion matrix for each run

Posted by Salman Mahmood <sa...@influestor.com>.

Thanks a lot lance. Let me elaborate the problem if it was a bit confusing.

Assuming I am making a binary classifier using SGD. I have got 50 positive and 50 negative examples to train the classifier. After training and testing the model, the confusion matrix tells you the number of correctly and incorrectly classified instances. Let's assume I got 85% correct and 15% incorrect instances.

Now if I run my program again using the same 50 negative and 50 positive examples, then according to my knowledge the classifier should yield the same results as before (cause not even a single training or testing data was changed), but this is not the case. I get different results for different runs. The confusion matrix figures changes each time I generate a model keeping the data constant. What I do is, I generate a model several times and keep a look for the accuracy, and if it is above 90%, then I stop running the code and hence an accurate model is created.

So what you are saying is to shuffle my data before I use it for training and testing?
Thanks! 
On Aug 31, 2012, at 10:33 AM, Lance Norskog wrote:

> Now I remember: SGD wants its data input in random order. You need to
> permute the order of your data.
> 
> If that does not help, another trick: for each data point, randomly
> generate 5 or 10 or 20 points which are close. And again, randomly
> permute the entire input set.
> 
> On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <go...@gmail.com> wrote:
>> The more data you have, the closer each run will be. How much data do you have?
>> 
>> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <sa...@influestor.com> wrote:
>>> I have noticed that every time I train and test a model using the same data (in SGD algo), I get different confusion matrix. Meaning, if I generate a model and look at the confusion matrix, it might say 90% correctly classified instances, but if I generate the model again (with the SAME data for training and testing as before) and test it, the confusion matrix changes and it might say 75% correctly classified instances.
>>> 
>>> Is this a desired behavior?
>> 
>> 
>> 
>> --
>> Lance Norskog
>> goksron@gmail.com
> 
> 
> 
> -- 
> Lance Norskog
> goksron@gmail.com

Re: SGD diferent confusion matrix for each run

Posted by Lance Norskog <go...@gmail.com>.

Now I remember: SGD wants its data input in random order. You need to
permute the order of your data.

If that does not help, another trick: for each data point, randomly
generate 5 or 10 or 20 points which are close. And again, randomly
permute the entire input set.

On Thu, Aug 30, 2012 at 5:23 PM, Lance Norskog <go...@gmail.com> wrote:
> The more data you have, the closer each run will be. How much data do you have?
>
> On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <sa...@influestor.com> wrote:
>> I have noticed that every time I train and test a model using the same data (in SGD algo), I get different confusion matrix. Meaning, if I generate a model and look at the confusion matrix, it might say 90% correctly classified instances, but if I generate the model again (with the SAME data for training and testing as before) and test it, the confusion matrix changes and it might say 75% correctly classified instances.
>>
>> Is this a desired behavior?
>
>
>
> --
> Lance Norskog
> goksron@gmail.com



-- 
Lance Norskog
goksron@gmail.com

Re: SGD diferent confusion matrix for each run

Posted by Lance Norskog <go...@gmail.com>.

The more data you have, the closer each run will be. How much data do you have?

On Thu, Aug 30, 2012 at 2:49 PM, Salman Mahmood <sa...@influestor.com> wrote:
> I have noticed that every time I train and test a model using the same data (in SGD algo), I get different confusion matrix. Meaning, if I generate a model and look at the confusion matrix, it might say 90% correctly classified instances, but if I generate the model again (with the SAME data for training and testing as before) and test it, the confusion matrix changes and it might say 75% correctly classified instances.
>
> Is this a desired behavior?



-- 
Lance Norskog
goksron@gmail.com