You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Stanley Xu <we...@gmail.com> on 2011/04/25 15:12:05 UTC

Which exact algorithm is used in the Mahout SGD?

Dear All,

I am trying to go through the Mahout SGD algorithm and trying to read
the "Logistic
Regression for Data Mining and High-Dimensional Classification" a little
bit, I am wondering which algorithm is exactly used in the SGD code? There
are quite a couple of algorithms mentioned in the paper, a little hard to me
to find out the algorithm matched the code.

Thanks in advance.

Best wishes,
Stanley Xu

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Ted Dunning <te...@gmail.com>.
Well, our classifiers are regressors as well.  Just a different kind.

On Tue, May 31, 2011 at 10:00 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> Factorization is essentially a quantitative (continuous) target
> regression, not a classification, so our abstract classifier
> interfaces probably would not fit here
>

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
oh. they do use regularizers but with manually tuned reg rate, it seems.

On Tue, May 31, 2011 at 10:00 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> i must say i don't understand most of the math.
>
> as for sharding, if i understood it correctly, i remember having
> exactly same idea for 'strata' selection as they show their a year
> ago. But i think the problem is that you have to run as many MR jobs
> as the number of strata selected. I.e. if you parallelize it 5 ways (5
> maps) then you have to run it at least 5 times. or maybe one can
> recombine subepochs in reducers and have another run with reducers (so
> it's 3 times, not 5). Which seems to put fundamental limitations on
> hadoopified scalability of this (they partly show increased time after
> some rather low # of mappers  which seems to confirm my old concern
> about this).
>
> it probably makes sense with a lot of data. It probably makes even
> more sense without MR sort phase.
>
> Another thing i did not quite get, how they cope with regularization?
> it looks like they don't want to use it. How's overfitting handled
> then?
>
> but it's compelling enough for my work so i could try it. Again, i
> probably did not get some aspects of the algorithm though.
>
> Factorization is essentially a quantitative (continuous) target
> regression, not a classification, so our abstract classifier
> interfaces probably would not fit here
>
> On Tue, May 31, 2011 at 9:45 PM, Ted Dunning <te...@gmail.com> wrote:
>> After a quick skumming of the paper, it looks vaguely like if you reduced
>> this to learning logistic regression that you have something roughly the
>> same as feature sharding.
>>
>> (which is still a good idea)
>>
>> With matrices, of course, you have two ways to shard, not just one.
>>
>> On Tue, May 31, 2011 at 7:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>>> Interesting.
>>> i'd probably be interested to try it out.
>>>
>>>
>>>
>>> On Thu, Apr 28, 2011 at 11:31 PM, Stanley Xu <we...@gmail.com> wrote:
>>> > Thanks Ted and Lance. And sorry for the jargon.
>>> >
>>> > For the delay Ted mentioned, we have already considered that, still
>>> thanks a
>>> > lot for all the detail ideas, they were pretty helpful.
>>> > For the parallelized SGD, just found a new paper about using DSGD in
>>> matrix
>>> > factorization, it's different from logistic regression, but might helpful
>>> as
>>> > well. Put the title here "Large-Scale Matrix Factorization with
>>> Distributed
>>> > Stochastic Gradient Descent" if anyone is interested.
>>> >
>>> > Best wishes,
>>> > Stanley Xu
>>> > On Fri, Apr 29, 2011 at 2:08 PM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>> >
>>> >> Yes.
>>> >>
>>> >> Apologies for jargon and TLA<
>>> >> http://en.wikipedia.org/wiki/Three-letter_acronym>
>>> >> 's
>>> >>
>>> >> On Thu, Apr 28, 2011 at 7:04 PM, Lance Norskog <go...@gmail.com>
>>> wrote:
>>> >>
>>> >> > CTR == Clickthrough Rate
>>> >> >
>>> >> > On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning <te...@gmail.com>
>>> >> > wrote:
>>> >> > > On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <we...@gmail.com>
>>> >> wrote:
>>> >> > >
>>> >> > >> ... I understood as the algorithm, the time in training only relies
>>> on
>>> >> > the
>>> >> > >> non-zero records, but per our test, there would be some overhead we
>>> >> > could
>>> >> > >> not ignore for thoso non-zero records, though the cost is
>>> sub-linear
>>> >> or
>>> >> > >> logit to the length of the hashed vector.
>>> >> > >>
>>> >> > >
>>> >> > > This is pretty close if we say "non-zero values".  A record usually
>>> >> > refers
>>> >> > > to an entire training
>>> >> > > example.
>>> >> > >
>>> >> > > The extra work refers mostly to deferred regularization that
>>> eventually
>>> >> > has
>>> >> > > to be
>>> >> > > applied.  My guess is that it is even less than log in the feature
>>> >> vector
>>> >> > > size.
>>> >> > >
>>> >> > >
>>> >> > >> And in CTR prediction, I am not pretty sure it will converge very
>>> >> > quickly.
>>> >> > >>
>>> >> > >
>>> >> > > I was saying this purely based on the number of features.
>>> >> > >
>>> >> > >
>>> >> > >> Because we will very possibly see some records has the almost same
>>> >> > feature
>>> >> > >> but different result in display ads.
>>> >> > >
>>> >> > >
>>> >> > > The algorithm can still converge to an estimate of the probability
>>> >> here.
>>> >> > >
>>> >> > >
>>> >> > >> But we will see the result in the
>>> >> > >> future.
>>> >> > >
>>> >> > >
>>> >> > > You have to be *very* careful about this to avoid prejudicing the
>>> model
>>> >> > > against
>>> >> > > recent impressions.  If you have a fast feedback to the ad targeting
>>> >> > system,
>>> >> > > you
>>> >> > > can have severely instability.
>>> >> > >
>>> >> > > The key thing that you have to do to avoid these biases is to define
>>> a
>>> >> > > maximum
>>> >> > > delay before click for the purposes of modeling.  You need to ignore
>>> >> all
>>> >> > > impressions
>>> >> > > younger than this delay (because they may still get a click) and you
>>> >> need
>>> >> > to
>>> >> > > ignore
>>> >> > > all clicks after this delay (to avoid bias in favor of old
>>> >> impressions).
>>> >> > >  For on-line ads
>>> >> > > you can probably use a maximum delay of a few minutes because most
>>> >> clicks
>>> >> > > will
>>> >> > > happen by then.
>>> >> > >
>>> >> > > To find a good value for maximum delay, you should plot the CTR for
>>> a
>>> >> > bunch
>>> >> > > of
>>> >> > > ads versus delay.  This will increase rapidly shortly after zero
>>> delay,
>>> >> > but
>>> >> > > then will
>>> >> > > level off.  The ordering of ads by CTR is what you care about so you
>>> >> can
>>> >> > > follow the
>>> >> > > curves back and find the shortest delay where the ordering is
>>> clearly
>>> >> > > preserved.  Use
>>> >> > > that as your maximum delay.  Typically this is roughly where your
>>> CTR
>>> >> is
>>> >> > at
>>> >> > > about
>>> >> > > 80-90% of the final value.
>>> >> > >
>>> >> > >
>>> >> > >
>>> >> > >
>>> >> > >> (We were still working on creating a framework to digg all the
>>> >> > >> features we need from the log, I would like to share our experience
>>> by
>>> >> > >> using
>>> >> > >> Mahout SGD once we got our CTR prediction model release.)
>>> >> > >>
>>> >> > >> And for parallelize SGD, what do you mean for help with sparse
>>> inputs
>>> >> > that
>>> >> > >> exhibit long-tail frequency distribution? Would you like to share
>>> some
>>> >> > of
>>> >> > >> your ideas, Ted?
>>> >> > >>
>>> >> > >> Currently, what I could think about is split the data to multiple
>>> >> mapper
>>> >> > >> randomly and let every mapper to learn from the local data and get
>>> an
>>> >> > >> average on the whole model, or let multiple model to vote for every
>>> >> > >> feature's weight. A little like the idea of AdaBoost or
>>> RandomForest.
>>> >> > But I
>>> >> > >> am not a scientist or mathematician, so no idea if it is correct or
>>> >> not.
>>> >> > >>
>>> >> > >>
>>> >> > >> Thanks so much.
>>> >> > >> Stanley Xu
>>> >> > >>
>>> >> > >>
>>> >> > >>
>>> >> > >> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <
>>> ted.dunning@gmail.com>
>>> >> > >> wrote:
>>> >> > >>
>>> >> > >> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <
>>> wenhao.xu@gmail.com>
>>> >> > >> wrote:
>>> >> > >> >
>>> >> > >> > > 1 hour is acceptable, but I guess you misunderstand the data
>>> scale
>>> >> I
>>> >> > >> mean
>>> >> > >> > > here. The 900M records didn't mean 900M Bytes, but 900M lines
>>> of
>>> >> > >> training
>>> >> > >> > > set(900M training example.). If every training data has 1000
>>> >> > dimension,
>>> >> > >> > it
>>> >> > >> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs
>>> >> > collected
>>> >> > >> > to
>>> >> > >> > > 14 days, it would be still 2-3TB data.
>>> >> > >> > >
>>> >> > >> >
>>> >> > >> > Oops.  Forgot that last multiplier.
>>> >> > >> >
>>> >> > >> >
>>> >> > >> > > Per our simple test, for 1000 dimension, 10M lines of record,
>>> it
>>> >> > will
>>> >> > >> > take
>>> >> > >> > > about 1-2 hours to do the training, so 90M lines of data will
>>> cost
>>> >> > at
>>> >> > >> > least
>>> >> > >> > > 90 hours, is that correct?
>>> >> > >> > >
>>> >> > >> >
>>> >> > >> > 10M x 1000 x 8 = 80 GB.
>>> >> > >> >
>>> >> > >> > 1-2 hours = (approx) 5000 seconds.  So this is
>>> >> > >> >
>>> >> > >> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s
>>> >> > >> >
>>> >> > >> > Yes.  This is reasonable speed.  I think you can get a small
>>> factor
>>> >> > >> faster
>>> >> > >> > than this with SGD.  I have seen 100 million records with more
>>> >> > non-zero
>>> >> > >> > values than you describe with a training time of 3 hours.  I
>>> would
>>> >> not
>>> >> > >> > expect even as much as a factor of 10 speedup here.
>>> >> > >> >
>>> >> > >> >
>>> >> > >> > >
>>> >> > >> > > And from the PPT you provided
>>> >> > >> > > http://www.slideshare.net/tdunning/sdforum-11042010
>>> >> > >> > > You said it would take less than an hour for 20M data records
>>> for
>>> >> > >> > > numeric/category mixed dimensions. I am wondering, how many
>>> >> > dimensions
>>> >> > >> > per
>>> >> > >> > > record?
>>> >> > >> > >
>>> >> > >> >
>>> >> > >> > These are sparse records records with about a thousand non-zero
>>> >> > elements
>>> >> > >> > per
>>> >> > >> > record.
>>> >> > >> >
>>> >> > >> >
>>> >> > >> > But let's step back to your data for a moment.  Where do these
>>> >> > thousand
>>> >> > >> > dimensions come from?  Do you really have a thousand hand-built
>>> >> > features?
>>> >> > >> >  Do you not have any sparse, text-like features?
>>> >> > >> >
>>> >> > >> > If you really only have a thousand dimensional problem, then I
>>> think
>>> >> > your
>>> >> > >> > model might exhibit early convergence.
>>> >> > >> >
>>> >> > >> > If not, it is quite possible to parallelize SGD, but this is only
>>> >> > likely
>>> >> > >> to
>>> >> > >> > help with sparse inputs that exhibit long-tail frequency
>>> >> distribution.
>>> >> > >> >
>>> >> > >>
>>> >> > >
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Lance Norskog
>>> >> > goksron@gmail.com
>>> >> >
>>> >>
>>> >
>>>
>>
>

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
i must say i don't understand most of the math.

as for sharding, if i understood it correctly, i remember having
exactly same idea for 'strata' selection as they show their a year
ago. But i think the problem is that you have to run as many MR jobs
as the number of strata selected. I.e. if you parallelize it 5 ways (5
maps) then you have to run it at least 5 times. or maybe one can
recombine subepochs in reducers and have another run with reducers (so
it's 3 times, not 5). Which seems to put fundamental limitations on
hadoopified scalability of this (they partly show increased time after
some rather low # of mappers  which seems to confirm my old concern
about this).

it probably makes sense with a lot of data. It probably makes even
more sense without MR sort phase.

Another thing i did not quite get, how they cope with regularization?
it looks like they don't want to use it. How's overfitting handled
then?

but it's compelling enough for my work so i could try it. Again, i
probably did not get some aspects of the algorithm though.

Factorization is essentially a quantitative (continuous) target
regression, not a classification, so our abstract classifier
interfaces probably would not fit here

On Tue, May 31, 2011 at 9:45 PM, Ted Dunning <te...@gmail.com> wrote:
> After a quick skumming of the paper, it looks vaguely like if you reduced
> this to learning logistic regression that you have something roughly the
> same as feature sharding.
>
> (which is still a good idea)
>
> With matrices, of course, you have two ways to shard, not just one.
>
> On Tue, May 31, 2011 at 7:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> Interesting.
>> i'd probably be interested to try it out.
>>
>>
>>
>> On Thu, Apr 28, 2011 at 11:31 PM, Stanley Xu <we...@gmail.com> wrote:
>> > Thanks Ted and Lance. And sorry for the jargon.
>> >
>> > For the delay Ted mentioned, we have already considered that, still
>> thanks a
>> > lot for all the detail ideas, they were pretty helpful.
>> > For the parallelized SGD, just found a new paper about using DSGD in
>> matrix
>> > factorization, it's different from logistic regression, but might helpful
>> as
>> > well. Put the title here "Large-Scale Matrix Factorization with
>> Distributed
>> > Stochastic Gradient Descent" if anyone is interested.
>> >
>> > Best wishes,
>> > Stanley Xu
>> > On Fri, Apr 29, 2011 at 2:08 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> >
>> >> Yes.
>> >>
>> >> Apologies for jargon and TLA<
>> >> http://en.wikipedia.org/wiki/Three-letter_acronym>
>> >> 's
>> >>
>> >> On Thu, Apr 28, 2011 at 7:04 PM, Lance Norskog <go...@gmail.com>
>> wrote:
>> >>
>> >> > CTR == Clickthrough Rate
>> >> >
>> >> > On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning <te...@gmail.com>
>> >> > wrote:
>> >> > > On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <we...@gmail.com>
>> >> wrote:
>> >> > >
>> >> > >> ... I understood as the algorithm, the time in training only relies
>> on
>> >> > the
>> >> > >> non-zero records, but per our test, there would be some overhead we
>> >> > could
>> >> > >> not ignore for thoso non-zero records, though the cost is
>> sub-linear
>> >> or
>> >> > >> logit to the length of the hashed vector.
>> >> > >>
>> >> > >
>> >> > > This is pretty close if we say "non-zero values".  A record usually
>> >> > refers
>> >> > > to an entire training
>> >> > > example.
>> >> > >
>> >> > > The extra work refers mostly to deferred regularization that
>> eventually
>> >> > has
>> >> > > to be
>> >> > > applied.  My guess is that it is even less than log in the feature
>> >> vector
>> >> > > size.
>> >> > >
>> >> > >
>> >> > >> And in CTR prediction, I am not pretty sure it will converge very
>> >> > quickly.
>> >> > >>
>> >> > >
>> >> > > I was saying this purely based on the number of features.
>> >> > >
>> >> > >
>> >> > >> Because we will very possibly see some records has the almost same
>> >> > feature
>> >> > >> but different result in display ads.
>> >> > >
>> >> > >
>> >> > > The algorithm can still converge to an estimate of the probability
>> >> here.
>> >> > >
>> >> > >
>> >> > >> But we will see the result in the
>> >> > >> future.
>> >> > >
>> >> > >
>> >> > > You have to be *very* careful about this to avoid prejudicing the
>> model
>> >> > > against
>> >> > > recent impressions.  If you have a fast feedback to the ad targeting
>> >> > system,
>> >> > > you
>> >> > > can have severely instability.
>> >> > >
>> >> > > The key thing that you have to do to avoid these biases is to define
>> a
>> >> > > maximum
>> >> > > delay before click for the purposes of modeling.  You need to ignore
>> >> all
>> >> > > impressions
>> >> > > younger than this delay (because they may still get a click) and you
>> >> need
>> >> > to
>> >> > > ignore
>> >> > > all clicks after this delay (to avoid bias in favor of old
>> >> impressions).
>> >> > >  For on-line ads
>> >> > > you can probably use a maximum delay of a few minutes because most
>> >> clicks
>> >> > > will
>> >> > > happen by then.
>> >> > >
>> >> > > To find a good value for maximum delay, you should plot the CTR for
>> a
>> >> > bunch
>> >> > > of
>> >> > > ads versus delay.  This will increase rapidly shortly after zero
>> delay,
>> >> > but
>> >> > > then will
>> >> > > level off.  The ordering of ads by CTR is what you care about so you
>> >> can
>> >> > > follow the
>> >> > > curves back and find the shortest delay where the ordering is
>> clearly
>> >> > > preserved.  Use
>> >> > > that as your maximum delay.  Typically this is roughly where your
>> CTR
>> >> is
>> >> > at
>> >> > > about
>> >> > > 80-90% of the final value.
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > >> (We were still working on creating a framework to digg all the
>> >> > >> features we need from the log, I would like to share our experience
>> by
>> >> > >> using
>> >> > >> Mahout SGD once we got our CTR prediction model release.)
>> >> > >>
>> >> > >> And for parallelize SGD, what do you mean for help with sparse
>> inputs
>> >> > that
>> >> > >> exhibit long-tail frequency distribution? Would you like to share
>> some
>> >> > of
>> >> > >> your ideas, Ted?
>> >> > >>
>> >> > >> Currently, what I could think about is split the data to multiple
>> >> mapper
>> >> > >> randomly and let every mapper to learn from the local data and get
>> an
>> >> > >> average on the whole model, or let multiple model to vote for every
>> >> > >> feature's weight. A little like the idea of AdaBoost or
>> RandomForest.
>> >> > But I
>> >> > >> am not a scientist or mathematician, so no idea if it is correct or
>> >> not.
>> >> > >>
>> >> > >>
>> >> > >> Thanks so much.
>> >> > >> Stanley Xu
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <
>> ted.dunning@gmail.com>
>> >> > >> wrote:
>> >> > >>
>> >> > >> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <
>> wenhao.xu@gmail.com>
>> >> > >> wrote:
>> >> > >> >
>> >> > >> > > 1 hour is acceptable, but I guess you misunderstand the data
>> scale
>> >> I
>> >> > >> mean
>> >> > >> > > here. The 900M records didn't mean 900M Bytes, but 900M lines
>> of
>> >> > >> training
>> >> > >> > > set(900M training example.). If every training data has 1000
>> >> > dimension,
>> >> > >> > it
>> >> > >> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs
>> >> > collected
>> >> > >> > to
>> >> > >> > > 14 days, it would be still 2-3TB data.
>> >> > >> > >
>> >> > >> >
>> >> > >> > Oops.  Forgot that last multiplier.
>> >> > >> >
>> >> > >> >
>> >> > >> > > Per our simple test, for 1000 dimension, 10M lines of record,
>> it
>> >> > will
>> >> > >> > take
>> >> > >> > > about 1-2 hours to do the training, so 90M lines of data will
>> cost
>> >> > at
>> >> > >> > least
>> >> > >> > > 90 hours, is that correct?
>> >> > >> > >
>> >> > >> >
>> >> > >> > 10M x 1000 x 8 = 80 GB.
>> >> > >> >
>> >> > >> > 1-2 hours = (approx) 5000 seconds.  So this is
>> >> > >> >
>> >> > >> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s
>> >> > >> >
>> >> > >> > Yes.  This is reasonable speed.  I think you can get a small
>> factor
>> >> > >> faster
>> >> > >> > than this with SGD.  I have seen 100 million records with more
>> >> > non-zero
>> >> > >> > values than you describe with a training time of 3 hours.  I
>> would
>> >> not
>> >> > >> > expect even as much as a factor of 10 speedup here.
>> >> > >> >
>> >> > >> >
>> >> > >> > >
>> >> > >> > > And from the PPT you provided
>> >> > >> > > http://www.slideshare.net/tdunning/sdforum-11042010
>> >> > >> > > You said it would take less than an hour for 20M data records
>> for
>> >> > >> > > numeric/category mixed dimensions. I am wondering, how many
>> >> > dimensions
>> >> > >> > per
>> >> > >> > > record?
>> >> > >> > >
>> >> > >> >
>> >> > >> > These are sparse records records with about a thousand non-zero
>> >> > elements
>> >> > >> > per
>> >> > >> > record.
>> >> > >> >
>> >> > >> >
>> >> > >> > But let's step back to your data for a moment.  Where do these
>> >> > thousand
>> >> > >> > dimensions come from?  Do you really have a thousand hand-built
>> >> > features?
>> >> > >> >  Do you not have any sparse, text-like features?
>> >> > >> >
>> >> > >> > If you really only have a thousand dimensional problem, then I
>> think
>> >> > your
>> >> > >> > model might exhibit early convergence.
>> >> > >> >
>> >> > >> > If not, it is quite possible to parallelize SGD, but this is only
>> >> > likely
>> >> > >> to
>> >> > >> > help with sparse inputs that exhibit long-tail frequency
>> >> distribution.
>> >> > >> >
>> >> > >>
>> >> > >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Lance Norskog
>> >> > goksron@gmail.com
>> >> >
>> >>
>> >
>>
>

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Ted Dunning <te...@gmail.com>.
After a quick skumming of the paper, it looks vaguely like if you reduced
this to learning logistic regression that you have something roughly the
same as feature sharding.

(which is still a good idea)

With matrices, of course, you have two ways to shard, not just one.

On Tue, May 31, 2011 at 7:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Interesting.
> i'd probably be interested to try it out.
>
>
>
> On Thu, Apr 28, 2011 at 11:31 PM, Stanley Xu <we...@gmail.com> wrote:
> > Thanks Ted and Lance. And sorry for the jargon.
> >
> > For the delay Ted mentioned, we have already considered that, still
> thanks a
> > lot for all the detail ideas, they were pretty helpful.
> > For the parallelized SGD, just found a new paper about using DSGD in
> matrix
> > factorization, it's different from logistic regression, but might helpful
> as
> > well. Put the title here "Large-Scale Matrix Factorization with
> Distributed
> > Stochastic Gradient Descent" if anyone is interested.
> >
> > Best wishes,
> > Stanley Xu
> > On Fri, Apr 29, 2011 at 2:08 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> >> Yes.
> >>
> >> Apologies for jargon and TLA<
> >> http://en.wikipedia.org/wiki/Three-letter_acronym>
> >> 's
> >>
> >> On Thu, Apr 28, 2011 at 7:04 PM, Lance Norskog <go...@gmail.com>
> wrote:
> >>
> >> > CTR == Clickthrough Rate
> >> >
> >> > On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning <te...@gmail.com>
> >> > wrote:
> >> > > On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <we...@gmail.com>
> >> wrote:
> >> > >
> >> > >> ... I understood as the algorithm, the time in training only relies
> on
> >> > the
> >> > >> non-zero records, but per our test, there would be some overhead we
> >> > could
> >> > >> not ignore for thoso non-zero records, though the cost is
> sub-linear
> >> or
> >> > >> logit to the length of the hashed vector.
> >> > >>
> >> > >
> >> > > This is pretty close if we say "non-zero values".  A record usually
> >> > refers
> >> > > to an entire training
> >> > > example.
> >> > >
> >> > > The extra work refers mostly to deferred regularization that
> eventually
> >> > has
> >> > > to be
> >> > > applied.  My guess is that it is even less than log in the feature
> >> vector
> >> > > size.
> >> > >
> >> > >
> >> > >> And in CTR prediction, I am not pretty sure it will converge very
> >> > quickly.
> >> > >>
> >> > >
> >> > > I was saying this purely based on the number of features.
> >> > >
> >> > >
> >> > >> Because we will very possibly see some records has the almost same
> >> > feature
> >> > >> but different result in display ads.
> >> > >
> >> > >
> >> > > The algorithm can still converge to an estimate of the probability
> >> here.
> >> > >
> >> > >
> >> > >> But we will see the result in the
> >> > >> future.
> >> > >
> >> > >
> >> > > You have to be *very* careful about this to avoid prejudicing the
> model
> >> > > against
> >> > > recent impressions.  If you have a fast feedback to the ad targeting
> >> > system,
> >> > > you
> >> > > can have severely instability.
> >> > >
> >> > > The key thing that you have to do to avoid these biases is to define
> a
> >> > > maximum
> >> > > delay before click for the purposes of modeling.  You need to ignore
> >> all
> >> > > impressions
> >> > > younger than this delay (because they may still get a click) and you
> >> need
> >> > to
> >> > > ignore
> >> > > all clicks after this delay (to avoid bias in favor of old
> >> impressions).
> >> > >  For on-line ads
> >> > > you can probably use a maximum delay of a few minutes because most
> >> clicks
> >> > > will
> >> > > happen by then.
> >> > >
> >> > > To find a good value for maximum delay, you should plot the CTR for
> a
> >> > bunch
> >> > > of
> >> > > ads versus delay.  This will increase rapidly shortly after zero
> delay,
> >> > but
> >> > > then will
> >> > > level off.  The ordering of ads by CTR is what you care about so you
> >> can
> >> > > follow the
> >> > > curves back and find the shortest delay where the ordering is
> clearly
> >> > > preserved.  Use
> >> > > that as your maximum delay.  Typically this is roughly where your
> CTR
> >> is
> >> > at
> >> > > about
> >> > > 80-90% of the final value.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >> (We were still working on creating a framework to digg all the
> >> > >> features we need from the log, I would like to share our experience
> by
> >> > >> using
> >> > >> Mahout SGD once we got our CTR prediction model release.)
> >> > >>
> >> > >> And for parallelize SGD, what do you mean for help with sparse
> inputs
> >> > that
> >> > >> exhibit long-tail frequency distribution? Would you like to share
> some
> >> > of
> >> > >> your ideas, Ted?
> >> > >>
> >> > >> Currently, what I could think about is split the data to multiple
> >> mapper
> >> > >> randomly and let every mapper to learn from the local data and get
> an
> >> > >> average on the whole model, or let multiple model to vote for every
> >> > >> feature's weight. A little like the idea of AdaBoost or
> RandomForest.
> >> > But I
> >> > >> am not a scientist or mathematician, so no idea if it is correct or
> >> not.
> >> > >>
> >> > >>
> >> > >> Thanks so much.
> >> > >> Stanley Xu
> >> > >>
> >> > >>
> >> > >>
> >> > >> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <
> ted.dunning@gmail.com>
> >> > >> wrote:
> >> > >>
> >> > >> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <
> wenhao.xu@gmail.com>
> >> > >> wrote:
> >> > >> >
> >> > >> > > 1 hour is acceptable, but I guess you misunderstand the data
> scale
> >> I
> >> > >> mean
> >> > >> > > here. The 900M records didn't mean 900M Bytes, but 900M lines
> of
> >> > >> training
> >> > >> > > set(900M training example.). If every training data has 1000
> >> > dimension,
> >> > >> > it
> >> > >> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs
> >> > collected
> >> > >> > to
> >> > >> > > 14 days, it would be still 2-3TB data.
> >> > >> > >
> >> > >> >
> >> > >> > Oops.  Forgot that last multiplier.
> >> > >> >
> >> > >> >
> >> > >> > > Per our simple test, for 1000 dimension, 10M lines of record,
> it
> >> > will
> >> > >> > take
> >> > >> > > about 1-2 hours to do the training, so 90M lines of data will
> cost
> >> > at
> >> > >> > least
> >> > >> > > 90 hours, is that correct?
> >> > >> > >
> >> > >> >
> >> > >> > 10M x 1000 x 8 = 80 GB.
> >> > >> >
> >> > >> > 1-2 hours = (approx) 5000 seconds.  So this is
> >> > >> >
> >> > >> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s
> >> > >> >
> >> > >> > Yes.  This is reasonable speed.  I think you can get a small
> factor
> >> > >> faster
> >> > >> > than this with SGD.  I have seen 100 million records with more
> >> > non-zero
> >> > >> > values than you describe with a training time of 3 hours.  I
> would
> >> not
> >> > >> > expect even as much as a factor of 10 speedup here.
> >> > >> >
> >> > >> >
> >> > >> > >
> >> > >> > > And from the PPT you provided
> >> > >> > > http://www.slideshare.net/tdunning/sdforum-11042010
> >> > >> > > You said it would take less than an hour for 20M data records
> for
> >> > >> > > numeric/category mixed dimensions. I am wondering, how many
> >> > dimensions
> >> > >> > per
> >> > >> > > record?
> >> > >> > >
> >> > >> >
> >> > >> > These are sparse records records with about a thousand non-zero
> >> > elements
> >> > >> > per
> >> > >> > record.
> >> > >> >
> >> > >> >
> >> > >> > But let's step back to your data for a moment.  Where do these
> >> > thousand
> >> > >> > dimensions come from?  Do you really have a thousand hand-built
> >> > features?
> >> > >> >  Do you not have any sparse, text-like features?
> >> > >> >
> >> > >> > If you really only have a thousand dimensional problem, then I
> think
> >> > your
> >> > >> > model might exhibit early convergence.
> >> > >> >
> >> > >> > If not, it is quite possible to parallelize SGD, but this is only
> >> > likely
> >> > >> to
> >> > >> > help with sparse inputs that exhibit long-tail frequency
> >> distribution.
> >> > >> >
> >> > >>
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Lance Norskog
> >> > goksron@gmail.com
> >> >
> >>
> >
>

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Interesting.
i'd probably be interested to try it out.



On Thu, Apr 28, 2011 at 11:31 PM, Stanley Xu <we...@gmail.com> wrote:
> Thanks Ted and Lance. And sorry for the jargon.
>
> For the delay Ted mentioned, we have already considered that, still thanks a
> lot for all the detail ideas, they were pretty helpful.
> For the parallelized SGD, just found a new paper about using DSGD in matrix
> factorization, it's different from logistic regression, but might helpful as
> well. Put the title here "Large-Scale Matrix Factorization with Distributed
> Stochastic Gradient Descent" if anyone is interested.
>
> Best wishes,
> Stanley Xu
> On Fri, Apr 29, 2011 at 2:08 PM, Ted Dunning <te...@gmail.com> wrote:
>
>> Yes.
>>
>> Apologies for jargon and TLA<
>> http://en.wikipedia.org/wiki/Three-letter_acronym>
>> 's
>>
>> On Thu, Apr 28, 2011 at 7:04 PM, Lance Norskog <go...@gmail.com> wrote:
>>
>> > CTR == Clickthrough Rate
>> >
>> > On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning <te...@gmail.com>
>> > wrote:
>> > > On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <we...@gmail.com>
>> wrote:
>> > >
>> > >> ... I understood as the algorithm, the time in training only relies on
>> > the
>> > >> non-zero records, but per our test, there would be some overhead we
>> > could
>> > >> not ignore for thoso non-zero records, though the cost is sub-linear
>> or
>> > >> logit to the length of the hashed vector.
>> > >>
>> > >
>> > > This is pretty close if we say "non-zero values".  A record usually
>> > refers
>> > > to an entire training
>> > > example.
>> > >
>> > > The extra work refers mostly to deferred regularization that eventually
>> > has
>> > > to be
>> > > applied.  My guess is that it is even less than log in the feature
>> vector
>> > > size.
>> > >
>> > >
>> > >> And in CTR prediction, I am not pretty sure it will converge very
>> > quickly.
>> > >>
>> > >
>> > > I was saying this purely based on the number of features.
>> > >
>> > >
>> > >> Because we will very possibly see some records has the almost same
>> > feature
>> > >> but different result in display ads.
>> > >
>> > >
>> > > The algorithm can still converge to an estimate of the probability
>> here.
>> > >
>> > >
>> > >> But we will see the result in the
>> > >> future.
>> > >
>> > >
>> > > You have to be *very* careful about this to avoid prejudicing the model
>> > > against
>> > > recent impressions.  If you have a fast feedback to the ad targeting
>> > system,
>> > > you
>> > > can have severely instability.
>> > >
>> > > The key thing that you have to do to avoid these biases is to define a
>> > > maximum
>> > > delay before click for the purposes of modeling.  You need to ignore
>> all
>> > > impressions
>> > > younger than this delay (because they may still get a click) and you
>> need
>> > to
>> > > ignore
>> > > all clicks after this delay (to avoid bias in favor of old
>> impressions).
>> > >  For on-line ads
>> > > you can probably use a maximum delay of a few minutes because most
>> clicks
>> > > will
>> > > happen by then.
>> > >
>> > > To find a good value for maximum delay, you should plot the CTR for a
>> > bunch
>> > > of
>> > > ads versus delay.  This will increase rapidly shortly after zero delay,
>> > but
>> > > then will
>> > > level off.  The ordering of ads by CTR is what you care about so you
>> can
>> > > follow the
>> > > curves back and find the shortest delay where the ordering is clearly
>> > > preserved.  Use
>> > > that as your maximum delay.  Typically this is roughly where your CTR
>> is
>> > at
>> > > about
>> > > 80-90% of the final value.
>> > >
>> > >
>> > >
>> > >
>> > >> (We were still working on creating a framework to digg all the
>> > >> features we need from the log, I would like to share our experience by
>> > >> using
>> > >> Mahout SGD once we got our CTR prediction model release.)
>> > >>
>> > >> And for parallelize SGD, what do you mean for help with sparse inputs
>> > that
>> > >> exhibit long-tail frequency distribution? Would you like to share some
>> > of
>> > >> your ideas, Ted?
>> > >>
>> > >> Currently, what I could think about is split the data to multiple
>> mapper
>> > >> randomly and let every mapper to learn from the local data and get an
>> > >> average on the whole model, or let multiple model to vote for every
>> > >> feature's weight. A little like the idea of AdaBoost or RandomForest.
>> > But I
>> > >> am not a scientist or mathematician, so no idea if it is correct or
>> not.
>> > >>
>> > >>
>> > >> Thanks so much.
>> > >> Stanley Xu
>> > >>
>> > >>
>> > >>
>> > >> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <te...@gmail.com>
>> > >> wrote:
>> > >>
>> > >> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <we...@gmail.com>
>> > >> wrote:
>> > >> >
>> > >> > > 1 hour is acceptable, but I guess you misunderstand the data scale
>> I
>> > >> mean
>> > >> > > here. The 900M records didn't mean 900M Bytes, but 900M lines of
>> > >> training
>> > >> > > set(900M training example.). If every training data has 1000
>> > dimension,
>> > >> > it
>> > >> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs
>> > collected
>> > >> > to
>> > >> > > 14 days, it would be still 2-3TB data.
>> > >> > >
>> > >> >
>> > >> > Oops.  Forgot that last multiplier.
>> > >> >
>> > >> >
>> > >> > > Per our simple test, for 1000 dimension, 10M lines of record, it
>> > will
>> > >> > take
>> > >> > > about 1-2 hours to do the training, so 90M lines of data will cost
>> > at
>> > >> > least
>> > >> > > 90 hours, is that correct?
>> > >> > >
>> > >> >
>> > >> > 10M x 1000 x 8 = 80 GB.
>> > >> >
>> > >> > 1-2 hours = (approx) 5000 seconds.  So this is
>> > >> >
>> > >> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s
>> > >> >
>> > >> > Yes.  This is reasonable speed.  I think you can get a small factor
>> > >> faster
>> > >> > than this with SGD.  I have seen 100 million records with more
>> > non-zero
>> > >> > values than you describe with a training time of 3 hours.  I would
>> not
>> > >> > expect even as much as a factor of 10 speedup here.
>> > >> >
>> > >> >
>> > >> > >
>> > >> > > And from the PPT you provided
>> > >> > > http://www.slideshare.net/tdunning/sdforum-11042010
>> > >> > > You said it would take less than an hour for 20M data records for
>> > >> > > numeric/category mixed dimensions. I am wondering, how many
>> > dimensions
>> > >> > per
>> > >> > > record?
>> > >> > >
>> > >> >
>> > >> > These are sparse records records with about a thousand non-zero
>> > elements
>> > >> > per
>> > >> > record.
>> > >> >
>> > >> >
>> > >> > But let's step back to your data for a moment.  Where do these
>> > thousand
>> > >> > dimensions come from?  Do you really have a thousand hand-built
>> > features?
>> > >> >  Do you not have any sparse, text-like features?
>> > >> >
>> > >> > If you really only have a thousand dimensional problem, then I think
>> > your
>> > >> > model might exhibit early convergence.
>> > >> >
>> > >> > If not, it is quite possible to parallelize SGD, but this is only
>> > likely
>> > >> to
>> > >> > help with sparse inputs that exhibit long-tail frequency
>> distribution.
>> > >> >
>> > >>
>> > >
>> >
>> >
>> >
>> > --
>> > Lance Norskog
>> > goksron@gmail.com
>> >
>>
>

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Stanley Xu <we...@gmail.com>.
Thanks Ted and Lance. And sorry for the jargon.

For the delay Ted mentioned, we have already considered that, still thanks a
lot for all the detail ideas, they were pretty helpful.
For the parallelized SGD, just found a new paper about using DSGD in matrix
factorization, it's different from logistic regression, but might helpful as
well. Put the title here "Large-Scale Matrix Factorization with Distributed
Stochastic Gradient Descent" if anyone is interested.

Best wishes,
Stanley Xu
On Fri, Apr 29, 2011 at 2:08 PM, Ted Dunning <te...@gmail.com> wrote:

> Yes.
>
> Apologies for jargon and TLA<
> http://en.wikipedia.org/wiki/Three-letter_acronym>
> 's
>
> On Thu, Apr 28, 2011 at 7:04 PM, Lance Norskog <go...@gmail.com> wrote:
>
> > CTR == Clickthrough Rate
> >
> > On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> > > On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <we...@gmail.com>
> wrote:
> > >
> > >> ... I understood as the algorithm, the time in training only relies on
> > the
> > >> non-zero records, but per our test, there would be some overhead we
> > could
> > >> not ignore for thoso non-zero records, though the cost is sub-linear
> or
> > >> logit to the length of the hashed vector.
> > >>
> > >
> > > This is pretty close if we say "non-zero values".  A record usually
> > refers
> > > to an entire training
> > > example.
> > >
> > > The extra work refers mostly to deferred regularization that eventually
> > has
> > > to be
> > > applied.  My guess is that it is even less than log in the feature
> vector
> > > size.
> > >
> > >
> > >> And in CTR prediction, I am not pretty sure it will converge very
> > quickly.
> > >>
> > >
> > > I was saying this purely based on the number of features.
> > >
> > >
> > >> Because we will very possibly see some records has the almost same
> > feature
> > >> but different result in display ads.
> > >
> > >
> > > The algorithm can still converge to an estimate of the probability
> here.
> > >
> > >
> > >> But we will see the result in the
> > >> future.
> > >
> > >
> > > You have to be *very* careful about this to avoid prejudicing the model
> > > against
> > > recent impressions.  If you have a fast feedback to the ad targeting
> > system,
> > > you
> > > can have severely instability.
> > >
> > > The key thing that you have to do to avoid these biases is to define a
> > > maximum
> > > delay before click for the purposes of modeling.  You need to ignore
> all
> > > impressions
> > > younger than this delay (because they may still get a click) and you
> need
> > to
> > > ignore
> > > all clicks after this delay (to avoid bias in favor of old
> impressions).
> > >  For on-line ads
> > > you can probably use a maximum delay of a few minutes because most
> clicks
> > > will
> > > happen by then.
> > >
> > > To find a good value for maximum delay, you should plot the CTR for a
> > bunch
> > > of
> > > ads versus delay.  This will increase rapidly shortly after zero delay,
> > but
> > > then will
> > > level off.  The ordering of ads by CTR is what you care about so you
> can
> > > follow the
> > > curves back and find the shortest delay where the ordering is clearly
> > > preserved.  Use
> > > that as your maximum delay.  Typically this is roughly where your CTR
> is
> > at
> > > about
> > > 80-90% of the final value.
> > >
> > >
> > >
> > >
> > >> (We were still working on creating a framework to digg all the
> > >> features we need from the log, I would like to share our experience by
> > >> using
> > >> Mahout SGD once we got our CTR prediction model release.)
> > >>
> > >> And for parallelize SGD, what do you mean for help with sparse inputs
> > that
> > >> exhibit long-tail frequency distribution? Would you like to share some
> > of
> > >> your ideas, Ted?
> > >>
> > >> Currently, what I could think about is split the data to multiple
> mapper
> > >> randomly and let every mapper to learn from the local data and get an
> > >> average on the whole model, or let multiple model to vote for every
> > >> feature's weight. A little like the idea of AdaBoost or RandomForest.
> > But I
> > >> am not a scientist or mathematician, so no idea if it is correct or
> not.
> > >>
> > >>
> > >> Thanks so much.
> > >> Stanley Xu
> > >>
> > >>
> > >>
> > >> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <te...@gmail.com>
> > >> wrote:
> > >>
> > >> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <we...@gmail.com>
> > >> wrote:
> > >> >
> > >> > > 1 hour is acceptable, but I guess you misunderstand the data scale
> I
> > >> mean
> > >> > > here. The 900M records didn't mean 900M Bytes, but 900M lines of
> > >> training
> > >> > > set(900M training example.). If every training data has 1000
> > dimension,
> > >> > it
> > >> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs
> > collected
> > >> > to
> > >> > > 14 days, it would be still 2-3TB data.
> > >> > >
> > >> >
> > >> > Oops.  Forgot that last multiplier.
> > >> >
> > >> >
> > >> > > Per our simple test, for 1000 dimension, 10M lines of record, it
> > will
> > >> > take
> > >> > > about 1-2 hours to do the training, so 90M lines of data will cost
> > at
> > >> > least
> > >> > > 90 hours, is that correct?
> > >> > >
> > >> >
> > >> > 10M x 1000 x 8 = 80 GB.
> > >> >
> > >> > 1-2 hours = (approx) 5000 seconds.  So this is
> > >> >
> > >> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s
> > >> >
> > >> > Yes.  This is reasonable speed.  I think you can get a small factor
> > >> faster
> > >> > than this with SGD.  I have seen 100 million records with more
> > non-zero
> > >> > values than you describe with a training time of 3 hours.  I would
> not
> > >> > expect even as much as a factor of 10 speedup here.
> > >> >
> > >> >
> > >> > >
> > >> > > And from the PPT you provided
> > >> > > http://www.slideshare.net/tdunning/sdforum-11042010
> > >> > > You said it would take less than an hour for 20M data records for
> > >> > > numeric/category mixed dimensions. I am wondering, how many
> > dimensions
> > >> > per
> > >> > > record?
> > >> > >
> > >> >
> > >> > These are sparse records records with about a thousand non-zero
> > elements
> > >> > per
> > >> > record.
> > >> >
> > >> >
> > >> > But let's step back to your data for a moment.  Where do these
> > thousand
> > >> > dimensions come from?  Do you really have a thousand hand-built
> > features?
> > >> >  Do you not have any sparse, text-like features?
> > >> >
> > >> > If you really only have a thousand dimensional problem, then I think
> > your
> > >> > model might exhibit early convergence.
> > >> >
> > >> > If not, it is quite possible to parallelize SGD, but this is only
> > likely
> > >> to
> > >> > help with sparse inputs that exhibit long-tail frequency
> distribution.
> > >> >
> > >>
> > >
> >
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
> >
>

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Ted Dunning <te...@gmail.com>.
Yes.

Apologies for jargon and TLA<http://en.wikipedia.org/wiki/Three-letter_acronym>
's

On Thu, Apr 28, 2011 at 7:04 PM, Lance Norskog <go...@gmail.com> wrote:

> CTR == Clickthrough Rate
>
> On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <we...@gmail.com> wrote:
> >
> >> ... I understood as the algorithm, the time in training only relies on
> the
> >> non-zero records, but per our test, there would be some overhead we
> could
> >> not ignore for thoso non-zero records, though the cost is sub-linear or
> >> logit to the length of the hashed vector.
> >>
> >
> > This is pretty close if we say "non-zero values".  A record usually
> refers
> > to an entire training
> > example.
> >
> > The extra work refers mostly to deferred regularization that eventually
> has
> > to be
> > applied.  My guess is that it is even less than log in the feature vector
> > size.
> >
> >
> >> And in CTR prediction, I am not pretty sure it will converge very
> quickly.
> >>
> >
> > I was saying this purely based on the number of features.
> >
> >
> >> Because we will very possibly see some records has the almost same
> feature
> >> but different result in display ads.
> >
> >
> > The algorithm can still converge to an estimate of the probability here.
> >
> >
> >> But we will see the result in the
> >> future.
> >
> >
> > You have to be *very* careful about this to avoid prejudicing the model
> > against
> > recent impressions.  If you have a fast feedback to the ad targeting
> system,
> > you
> > can have severely instability.
> >
> > The key thing that you have to do to avoid these biases is to define a
> > maximum
> > delay before click for the purposes of modeling.  You need to ignore all
> > impressions
> > younger than this delay (because they may still get a click) and you need
> to
> > ignore
> > all clicks after this delay (to avoid bias in favor of old impressions).
> >  For on-line ads
> > you can probably use a maximum delay of a few minutes because most clicks
> > will
> > happen by then.
> >
> > To find a good value for maximum delay, you should plot the CTR for a
> bunch
> > of
> > ads versus delay.  This will increase rapidly shortly after zero delay,
> but
> > then will
> > level off.  The ordering of ads by CTR is what you care about so you can
> > follow the
> > curves back and find the shortest delay where the ordering is clearly
> > preserved.  Use
> > that as your maximum delay.  Typically this is roughly where your CTR is
> at
> > about
> > 80-90% of the final value.
> >
> >
> >
> >
> >> (We were still working on creating a framework to digg all the
> >> features we need from the log, I would like to share our experience by
> >> using
> >> Mahout SGD once we got our CTR prediction model release.)
> >>
> >> And for parallelize SGD, what do you mean for help with sparse inputs
> that
> >> exhibit long-tail frequency distribution? Would you like to share some
> of
> >> your ideas, Ted?
> >>
> >> Currently, what I could think about is split the data to multiple mapper
> >> randomly and let every mapper to learn from the local data and get an
> >> average on the whole model, or let multiple model to vote for every
> >> feature's weight. A little like the idea of AdaBoost or RandomForest.
> But I
> >> am not a scientist or mathematician, so no idea if it is correct or not.
> >>
> >>
> >> Thanks so much.
> >> Stanley Xu
> >>
> >>
> >>
> >> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >>
> >> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <we...@gmail.com>
> >> wrote:
> >> >
> >> > > 1 hour is acceptable, but I guess you misunderstand the data scale I
> >> mean
> >> > > here. The 900M records didn't mean 900M Bytes, but 900M lines of
> >> training
> >> > > set(900M training example.). If every training data has 1000
> dimension,
> >> > it
> >> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs
> collected
> >> > to
> >> > > 14 days, it would be still 2-3TB data.
> >> > >
> >> >
> >> > Oops.  Forgot that last multiplier.
> >> >
> >> >
> >> > > Per our simple test, for 1000 dimension, 10M lines of record, it
> will
> >> > take
> >> > > about 1-2 hours to do the training, so 90M lines of data will cost
> at
> >> > least
> >> > > 90 hours, is that correct?
> >> > >
> >> >
> >> > 10M x 1000 x 8 = 80 GB.
> >> >
> >> > 1-2 hours = (approx) 5000 seconds.  So this is
> >> >
> >> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s
> >> >
> >> > Yes.  This is reasonable speed.  I think you can get a small factor
> >> faster
> >> > than this with SGD.  I have seen 100 million records with more
> non-zero
> >> > values than you describe with a training time of 3 hours.  I would not
> >> > expect even as much as a factor of 10 speedup here.
> >> >
> >> >
> >> > >
> >> > > And from the PPT you provided
> >> > > http://www.slideshare.net/tdunning/sdforum-11042010
> >> > > You said it would take less than an hour for 20M data records for
> >> > > numeric/category mixed dimensions. I am wondering, how many
> dimensions
> >> > per
> >> > > record?
> >> > >
> >> >
> >> > These are sparse records records with about a thousand non-zero
> elements
> >> > per
> >> > record.
> >> >
> >> >
> >> > But let's step back to your data for a moment.  Where do these
> thousand
> >> > dimensions come from?  Do you really have a thousand hand-built
> features?
> >> >  Do you not have any sparse, text-like features?
> >> >
> >> > If you really only have a thousand dimensional problem, then I think
> your
> >> > model might exhibit early convergence.
> >> >
> >> > If not, it is quite possible to parallelize SGD, but this is only
> likely
> >> to
> >> > help with sparse inputs that exhibit long-tail frequency distribution.
> >> >
> >>
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Lance Norskog <go...@gmail.com>.
CTR == Clickthrough Rate

On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning <te...@gmail.com> wrote:
> On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <we...@gmail.com> wrote:
>
>> ... I understood as the algorithm, the time in training only relies on the
>> non-zero records, but per our test, there would be some overhead we could
>> not ignore for thoso non-zero records, though the cost is sub-linear or
>> logit to the length of the hashed vector.
>>
>
> This is pretty close if we say "non-zero values".  A record usually refers
> to an entire training
> example.
>
> The extra work refers mostly to deferred regularization that eventually has
> to be
> applied.  My guess is that it is even less than log in the feature vector
> size.
>
>
>> And in CTR prediction, I am not pretty sure it will converge very quickly.
>>
>
> I was saying this purely based on the number of features.
>
>
>> Because we will very possibly see some records has the almost same feature
>> but different result in display ads.
>
>
> The algorithm can still converge to an estimate of the probability here.
>
>
>> But we will see the result in the
>> future.
>
>
> You have to be *very* careful about this to avoid prejudicing the model
> against
> recent impressions.  If you have a fast feedback to the ad targeting system,
> you
> can have severely instability.
>
> The key thing that you have to do to avoid these biases is to define a
> maximum
> delay before click for the purposes of modeling.  You need to ignore all
> impressions
> younger than this delay (because they may still get a click) and you need to
> ignore
> all clicks after this delay (to avoid bias in favor of old impressions).
>  For on-line ads
> you can probably use a maximum delay of a few minutes because most clicks
> will
> happen by then.
>
> To find a good value for maximum delay, you should plot the CTR for a bunch
> of
> ads versus delay.  This will increase rapidly shortly after zero delay, but
> then will
> level off.  The ordering of ads by CTR is what you care about so you can
> follow the
> curves back and find the shortest delay where the ordering is clearly
> preserved.  Use
> that as your maximum delay.  Typically this is roughly where your CTR is at
> about
> 80-90% of the final value.
>
>
>
>
>> (We were still working on creating a framework to digg all the
>> features we need from the log, I would like to share our experience by
>> using
>> Mahout SGD once we got our CTR prediction model release.)
>>
>> And for parallelize SGD, what do you mean for help with sparse inputs that
>> exhibit long-tail frequency distribution? Would you like to share some of
>> your ideas, Ted?
>>
>> Currently, what I could think about is split the data to multiple mapper
>> randomly and let every mapper to learn from the local data and get an
>> average on the whole model, or let multiple model to vote for every
>> feature's weight. A little like the idea of AdaBoost or RandomForest. But I
>> am not a scientist or mathematician, so no idea if it is correct or not.
>>
>>
>> Thanks so much.
>> Stanley Xu
>>
>>
>>
>> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>
>> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <we...@gmail.com>
>> wrote:
>> >
>> > > 1 hour is acceptable, but I guess you misunderstand the data scale I
>> mean
>> > > here. The 900M records didn't mean 900M Bytes, but 900M lines of
>> training
>> > > set(900M training example.). If every training data has 1000 dimension,
>> > it
>> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs collected
>> > to
>> > > 14 days, it would be still 2-3TB data.
>> > >
>> >
>> > Oops.  Forgot that last multiplier.
>> >
>> >
>> > > Per our simple test, for 1000 dimension, 10M lines of record, it will
>> > take
>> > > about 1-2 hours to do the training, so 90M lines of data will cost at
>> > least
>> > > 90 hours, is that correct?
>> > >
>> >
>> > 10M x 1000 x 8 = 80 GB.
>> >
>> > 1-2 hours = (approx) 5000 seconds.  So this is
>> >
>> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s
>> >
>> > Yes.  This is reasonable speed.  I think you can get a small factor
>> faster
>> > than this with SGD.  I have seen 100 million records with more non-zero
>> > values than you describe with a training time of 3 hours.  I would not
>> > expect even as much as a factor of 10 speedup here.
>> >
>> >
>> > >
>> > > And from the PPT you provided
>> > > http://www.slideshare.net/tdunning/sdforum-11042010
>> > > You said it would take less than an hour for 20M data records for
>> > > numeric/category mixed dimensions. I am wondering, how many dimensions
>> > per
>> > > record?
>> > >
>> >
>> > These are sparse records records with about a thousand non-zero elements
>> > per
>> > record.
>> >
>> >
>> > But let's step back to your data for a moment.  Where do these thousand
>> > dimensions come from?  Do you really have a thousand hand-built features?
>> >  Do you not have any sparse, text-like features?
>> >
>> > If you really only have a thousand dimensional problem, then I think your
>> > model might exhibit early convergence.
>> >
>> > If not, it is quite possible to parallelize SGD, but this is only likely
>> to
>> > help with sparse inputs that exhibit long-tail frequency distribution.
>> >
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Ted Dunning <te...@gmail.com>.
On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <we...@gmail.com> wrote:

> ... I understood as the algorithm, the time in training only relies on the
> non-zero records, but per our test, there would be some overhead we could
> not ignore for thoso non-zero records, though the cost is sub-linear or
> logit to the length of the hashed vector.
>

This is pretty close if we say "non-zero values".  A record usually refers
to an entire training
example.

The extra work refers mostly to deferred regularization that eventually has
to be
applied.  My guess is that it is even less than log in the feature vector
size.


> And in CTR prediction, I am not pretty sure it will converge very quickly.
>

I was saying this purely based on the number of features.


> Because we will very possibly see some records has the almost same feature
> but different result in display ads.


The algorithm can still converge to an estimate of the probability here.


> But we will see the result in the
> future.


You have to be *very* careful about this to avoid prejudicing the model
against
recent impressions.  If you have a fast feedback to the ad targeting system,
you
can have severely instability.

The key thing that you have to do to avoid these biases is to define a
maximum
delay before click for the purposes of modeling.  You need to ignore all
impressions
younger than this delay (because they may still get a click) and you need to
ignore
all clicks after this delay (to avoid bias in favor of old impressions).
 For on-line ads
you can probably use a maximum delay of a few minutes because most clicks
will
happen by then.

To find a good value for maximum delay, you should plot the CTR for a bunch
of
ads versus delay.  This will increase rapidly shortly after zero delay, but
then will
level off.  The ordering of ads by CTR is what you care about so you can
follow the
curves back and find the shortest delay where the ordering is clearly
preserved.  Use
that as your maximum delay.  Typically this is roughly where your CTR is at
about
80-90% of the final value.




> (We were still working on creating a framework to digg all the
> features we need from the log, I would like to share our experience by
> using
> Mahout SGD once we got our CTR prediction model release.)
>
> And for parallelize SGD, what do you mean for help with sparse inputs that
> exhibit long-tail frequency distribution? Would you like to share some of
> your ideas, Ted?
>
> Currently, what I could think about is split the data to multiple mapper
> randomly and let every mapper to learn from the local data and get an
> average on the whole model, or let multiple model to vote for every
> feature's weight. A little like the idea of AdaBoost or RandomForest. But I
> am not a scientist or mathematician, so no idea if it is correct or not.
>
>
> Thanks so much.
> Stanley Xu
>
>
>
> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <we...@gmail.com>
> wrote:
> >
> > > 1 hour is acceptable, but I guess you misunderstand the data scale I
> mean
> > > here. The 900M records didn't mean 900M Bytes, but 900M lines of
> training
> > > set(900M training example.). If every training data has 1000 dimension,
> > it
> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs collected
> > to
> > > 14 days, it would be still 2-3TB data.
> > >
> >
> > Oops.  Forgot that last multiplier.
> >
> >
> > > Per our simple test, for 1000 dimension, 10M lines of record, it will
> > take
> > > about 1-2 hours to do the training, so 90M lines of data will cost at
> > least
> > > 90 hours, is that correct?
> > >
> >
> > 10M x 1000 x 8 = 80 GB.
> >
> > 1-2 hours = (approx) 5000 seconds.  So this is
> >
> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s
> >
> > Yes.  This is reasonable speed.  I think you can get a small factor
> faster
> > than this with SGD.  I have seen 100 million records with more non-zero
> > values than you describe with a training time of 3 hours.  I would not
> > expect even as much as a factor of 10 speedup here.
> >
> >
> > >
> > > And from the PPT you provided
> > > http://www.slideshare.net/tdunning/sdforum-11042010
> > > You said it would take less than an hour for 20M data records for
> > > numeric/category mixed dimensions. I am wondering, how many dimensions
> > per
> > > record?
> > >
> >
> > These are sparse records records with about a thousand non-zero elements
> > per
> > record.
> >
> >
> > But let's step back to your data for a moment.  Where do these thousand
> > dimensions come from?  Do you really have a thousand hand-built features?
> >  Do you not have any sparse, text-like features?
> >
> > If you really only have a thousand dimensional problem, then I think your
> > model might exhibit early convergence.
> >
> > If not, it is quite possible to parallelize SGD, but this is only likely
> to
> > help with sparse inputs that exhibit long-tail frequency distribution.
> >
>

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Stanley Xu <we...@gmail.com>.
Hi Ted,

For the data, currently, we digg the logs for a specific cookie. For
example, we will check how many times has he seen the banner from the
advertiser in last 7 days. We didn't has 1000 non-zero value now, I thought
we will only have 100-200 now, but we expect to have 1000 at most I thought.
We won't have lots of text like features for what we were predicting is for
display ads rather than search ads. We might introduce a tagging system for
ad management to see if we will get better result by adding these features
in the future.

I understood as the algorithm, the time in training only relies on the
non-zero records, but per our test, there would be some overhead we could
not ignore for thoso non-zero records, though the cost is sub-linear or
logit to the length of the hashed vector.

And in CTR prediction, I am not pretty sure it will converge very quickly.
Because we will very possibly see some records has the almost same feature
but different result in display ads. But we will see the result in the
future. (We were still working on creating a framework to digg all the
features we need from the log, I would like to share our experience by using
Mahout SGD once we got our CTR prediction model release.)

And for parallelize SGD, what do you mean for help with sparse inputs that
exhibit long-tail frequency distribution? Would you like to share some of
your ideas, Ted?

Currently, what I could think about is split the data to multiple mapper
randomly and let every mapper to learn from the local data and get an
average on the whole model, or let multiple model to vote for every
feature's weight. A little like the idea of AdaBoost or RandomForest. But I
am not a scientist or mathematician, so no idea if it is correct or not.


Thanks so much.
Stanley Xu



On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <te...@gmail.com> wrote:

> On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <we...@gmail.com> wrote:
>
> > 1 hour is acceptable, but I guess you misunderstand the data scale I mean
> > here. The 900M records didn't mean 900M Bytes, but 900M lines of training
> > set(900M training example.). If every training data has 1000 dimension,
> it
> > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs collected
> to
> > 14 days, it would be still 2-3TB data.
> >
>
> Oops.  Forgot that last multiplier.
>
>
> > Per our simple test, for 1000 dimension, 10M lines of record, it will
> take
> > about 1-2 hours to do the training, so 90M lines of data will cost at
> least
> > 90 hours, is that correct?
> >
>
> 10M x 1000 x 8 = 80 GB.
>
> 1-2 hours = (approx) 5000 seconds.  So this is
>
> 80 GB / 5000 s = 80/5 MB /s = 16MB / s
>
> Yes.  This is reasonable speed.  I think you can get a small factor faster
> than this with SGD.  I have seen 100 million records with more non-zero
> values than you describe with a training time of 3 hours.  I would not
> expect even as much as a factor of 10 speedup here.
>
>
> >
> > And from the PPT you provided
> > http://www.slideshare.net/tdunning/sdforum-11042010
> > You said it would take less than an hour for 20M data records for
> > numeric/category mixed dimensions. I am wondering, how many dimensions
> per
> > record?
> >
>
> These are sparse records records with about a thousand non-zero elements
> per
> record.
>
>
> But let's step back to your data for a moment.  Where do these thousand
> dimensions come from?  Do you really have a thousand hand-built features?
>  Do you not have any sparse, text-like features?
>
> If you really only have a thousand dimensional problem, then I think your
> model might exhibit early convergence.
>
> If not, it is quite possible to parallelize SGD, but this is only likely to
> help with sparse inputs that exhibit long-tail frequency distribution.
>

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Ted Dunning <te...@gmail.com>.
On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <we...@gmail.com> wrote:

> 1 hour is acceptable, but I guess you misunderstand the data scale I mean
> here. The 900M records didn't mean 900M Bytes, but 900M lines of training
> set(900M training example.). If every training data has 1000 dimension, it
> means 900 million X 1000 X 16 B = 14TB. If we reduce the logs collected to
> 14 days, it would be still 2-3TB data.
>

Oops.  Forgot that last multiplier.


> Per our simple test, for 1000 dimension, 10M lines of record, it will take
> about 1-2 hours to do the training, so 90M lines of data will cost at least
> 90 hours, is that correct?
>

10M x 1000 x 8 = 80 GB.

1-2 hours = (approx) 5000 seconds.  So this is

80 GB / 5000 s = 80/5 MB /s = 16MB / s

Yes.  This is reasonable speed.  I think you can get a small factor faster
than this with SGD.  I have seen 100 million records with more non-zero
values than you describe with a training time of 3 hours.  I would not
expect even as much as a factor of 10 speedup here.


>
> And from the PPT you provided
> http://www.slideshare.net/tdunning/sdforum-11042010
> You said it would take less than an hour for 20M data records for
> numeric/category mixed dimensions. I am wondering, how many dimensions per
> record?
>

These are sparse records records with about a thousand non-zero elements per
record.


But let's step back to your data for a moment.  Where do these thousand
dimensions come from?  Do you really have a thousand hand-built features?
 Do you not have any sparse, text-like features?

If you really only have a thousand dimensional problem, then I think your
model might exhibit early convergence.

If not, it is quite possible to parallelize SGD, but this is only likely to
help with sparse inputs that exhibit long-tail frequency distribution.

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Stanley Xu <we...@gmail.com>.
Hi Ted,

1 hour is acceptable, but I guess you misunderstand the data scale I mean
here. The 900M records didn't mean 900M Bytes, but 900M lines of training
set(900M training example.). If every training data has 1000 dimension, it
means 900 million X 1000 X 16 B = 14TB. If we reduce the logs collected to
14 days, it would be still 2-3TB data.

Per our simple test, for 1000 dimension, 10M lines of record, it will take
about 1-2 hours to do the training, so 90M lines of data will cost at least
90 hours, is that correct?

And from the PPT you provided
http://www.slideshare.net/tdunning/sdforum-11042010
You said it would take less than an hour for 20M data records for
numeric/category mixed dimensions. I am wondering, how many dimensions per
record?

Thanks.
Stanley Xu



On Tue, Apr 26, 2011 at 2:05 PM, Ted Dunning <te...@gmail.com> wrote:

> How much time do you have available for training?
>
> If you can do feature encoding in parallel, then you can probably do this
> pretty fast with SGD.
>
> My guess is that you can push 2-20 MB/s of data through SGD with your kind
> of  data with a good 8 core processor.  If you pre-process your data into 8
> B / dimension, this is 0.25 - 2.5 million data points per second.  This
> could mean that your training takes less than an hour.  If your training
> converges with less data, you may do even better.
>
> Is that not acceptable?
>
> On Mon, Apr 25, 2011 at 10:11 PM, Stanley Xu <we...@gmail.com> wrote:
>
> > Thanks Ted. Read the paper and the code and got the rough idea of how the
> > iteration goes. Thanks so much.
> >
> > With the current data scale we have, we were considering if we could
> train
> > more data with the Logistic Regression. For example, if we wanted to
> train
> > a
> > model for CTR prediction for last 90 days data. It would be 900M records
> > after down sampling, and assume there are 1000 feature dimension there.
> It
> > would still be so slow by a single machine with the current SGD
> algorithm.
> >
> > I wondering if there is a parallel algorithm with map-reduce I could use
> > for
> > Logistic Regression? The original Newton-Raphson will take N*N*M/P by
> > the "Map-Reduce
> > for Machine Learning on Multicore" paper, which is much slower than SGD
> on
> > a
> > single machine in a high-dimension space.
> >
> > Could algorithm like IRLS be parallelized or any approximate algorithm
> > there
> > could be parallelized?
> >
> > Thanks,
> > Stanley Xu
> >
> >
> >
> > On Mon, Apr 25, 2011 at 11:58 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Paul K described in memory algorithms in his dissertation.  Mahout uses
> > > on-line algorithms which are not limited by memory size.
> > >
> > > The method used in Mahout is closer to what Bob Carpenter describes
> here:
> > > http://lingpipe.files.wordpress.com/2008/04/lazysgdregression.pdf
> > >
> > > The most important additions in Mahout are:
> > >
> > > a) confidence weighted learning rates per term
> > >
> > > b) evolutionary tuning of hyper-parameters
> > >
> > > c) mixed ranking and regression
> > >
> > > d) grouped AUC
> > >
> > > On Mon, Apr 25, 2011 at 6:12 AM, Stanley Xu <we...@gmail.com>
> wrote:
> > >
> > > > Dear All,
> > > >
> > > > I am trying to go through the Mahout SGD algorithm and trying to read
> > > > the "Logistic
> > > > Regression for Data Mining and High-Dimensional Classification" a
> > little
> > > > bit, I am wondering which algorithm is exactly used in the SGD code?
> > > There
> > > > are quite a couple of algorithms mentioned in the paper, a little
> hard
> > to
> > > > me
> > > > to find out the algorithm matched the code.
> > > >
> > > > Thanks in advance.
> > > >
> > > > Best wishes,
> > > > Stanley Xu
> > > >
> > >
> >
>

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Ted Dunning <te...@gmail.com>.
How much time do you have available for training?

If you can do feature encoding in parallel, then you can probably do this
pretty fast with SGD.

My guess is that you can push 2-20 MB/s of data through SGD with your kind
of  data with a good 8 core processor.  If you pre-process your data into 8
B / dimension, this is 0.25 - 2.5 million data points per second.  This
could mean that your training takes less than an hour.  If your training
converges with less data, you may do even better.

Is that not acceptable?

On Mon, Apr 25, 2011 at 10:11 PM, Stanley Xu <we...@gmail.com> wrote:

> Thanks Ted. Read the paper and the code and got the rough idea of how the
> iteration goes. Thanks so much.
>
> With the current data scale we have, we were considering if we could train
> more data with the Logistic Regression. For example, if we wanted to train
> a
> model for CTR prediction for last 90 days data. It would be 900M records
> after down sampling, and assume there are 1000 feature dimension there. It
> would still be so slow by a single machine with the current SGD algorithm.
>
> I wondering if there is a parallel algorithm with map-reduce I could use
> for
> Logistic Regression? The original Newton-Raphson will take N*N*M/P by
> the "Map-Reduce
> for Machine Learning on Multicore" paper, which is much slower than SGD on
> a
> single machine in a high-dimension space.
>
> Could algorithm like IRLS be parallelized or any approximate algorithm
> there
> could be parallelized?
>
> Thanks,
> Stanley Xu
>
>
>
> On Mon, Apr 25, 2011 at 11:58 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Paul K described in memory algorithms in his dissertation.  Mahout uses
> > on-line algorithms which are not limited by memory size.
> >
> > The method used in Mahout is closer to what Bob Carpenter describes here:
> > http://lingpipe.files.wordpress.com/2008/04/lazysgdregression.pdf
> >
> > The most important additions in Mahout are:
> >
> > a) confidence weighted learning rates per term
> >
> > b) evolutionary tuning of hyper-parameters
> >
> > c) mixed ranking and regression
> >
> > d) grouped AUC
> >
> > On Mon, Apr 25, 2011 at 6:12 AM, Stanley Xu <we...@gmail.com> wrote:
> >
> > > Dear All,
> > >
> > > I am trying to go through the Mahout SGD algorithm and trying to read
> > > the "Logistic
> > > Regression for Data Mining and High-Dimensional Classification" a
> little
> > > bit, I am wondering which algorithm is exactly used in the SGD code?
> > There
> > > are quite a couple of algorithms mentioned in the paper, a little hard
> to
> > > me
> > > to find out the algorithm matched the code.
> > >
> > > Thanks in advance.
> > >
> > > Best wishes,
> > > Stanley Xu
> > >
> >
>

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Stanley Xu <we...@gmail.com>.
Thanks Ted. Read the paper and the code and got the rough idea of how the
iteration goes. Thanks so much.

With the current data scale we have, we were considering if we could train
more data with the Logistic Regression. For example, if we wanted to train a
model for CTR prediction for last 90 days data. It would be 900M records
after down sampling, and assume there are 1000 feature dimension there. It
would still be so slow by a single machine with the current SGD algorithm.

I wondering if there is a parallel algorithm with map-reduce I could use for
Logistic Regression? The original Newton-Raphson will take N*N*M/P by
the "Map-Reduce
for Machine Learning on Multicore" paper, which is much slower than SGD on a
single machine in a high-dimension space.

Could algorithm like IRLS be parallelized or any approximate algorithm there
could be parallelized?

Thanks,
Stanley Xu



On Mon, Apr 25, 2011 at 11:58 PM, Ted Dunning <te...@gmail.com> wrote:

> Paul K described in memory algorithms in his dissertation.  Mahout uses
> on-line algorithms which are not limited by memory size.
>
> The method used in Mahout is closer to what Bob Carpenter describes here:
> http://lingpipe.files.wordpress.com/2008/04/lazysgdregression.pdf
>
> The most important additions in Mahout are:
>
> a) confidence weighted learning rates per term
>
> b) evolutionary tuning of hyper-parameters
>
> c) mixed ranking and regression
>
> d) grouped AUC
>
> On Mon, Apr 25, 2011 at 6:12 AM, Stanley Xu <we...@gmail.com> wrote:
>
> > Dear All,
> >
> > I am trying to go through the Mahout SGD algorithm and trying to read
> > the "Logistic
> > Regression for Data Mining and High-Dimensional Classification" a little
> > bit, I am wondering which algorithm is exactly used in the SGD code?
> There
> > are quite a couple of algorithms mentioned in the paper, a little hard to
> > me
> > to find out the algorithm matched the code.
> >
> > Thanks in advance.
> >
> > Best wishes,
> > Stanley Xu
> >
>

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Ted Dunning <te...@gmail.com>.
Paul K described in memory algorithms in his dissertation.  Mahout uses
on-line algorithms which are not limited by memory size.

The method used in Mahout is closer to what Bob Carpenter describes here:
http://lingpipe.files.wordpress.com/2008/04/lazysgdregression.pdf

The most important additions in Mahout are:

a) confidence weighted learning rates per term

b) evolutionary tuning of hyper-parameters

c) mixed ranking and regression

d) grouped AUC

On Mon, Apr 25, 2011 at 6:12 AM, Stanley Xu <we...@gmail.com> wrote:

> Dear All,
>
> I am trying to go through the Mahout SGD algorithm and trying to read
> the "Logistic
> Regression for Data Mining and High-Dimensional Classification" a little
> bit, I am wondering which algorithm is exactly used in the SGD code? There
> are quite a couple of algorithms mentioned in the paper, a little hard to
> me
> to find out the algorithm matched the code.
>
> Thanks in advance.
>
> Best wishes,
> Stanley Xu
>