You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Dmitriy Lyubimov <dl...@gmail.com> on 2011/06/01 04:19:42 UTC

Re: Which exact algorithm is used in the Mahout SGD?

Interesting.
i'd probably be interested to try it out.



On Thu, Apr 28, 2011 at 11:31 PM, Stanley Xu <we...@gmail.com> wrote:
> Thanks Ted and Lance. And sorry for the jargon.
>
> For the delay Ted mentioned, we have already considered that, still thanks a
> lot for all the detail ideas, they were pretty helpful.
> For the parallelized SGD, just found a new paper about using DSGD in matrix
> factorization, it's different from logistic regression, but might helpful as
> well. Put the title here "Large-Scale Matrix Factorization with Distributed
> Stochastic Gradient Descent" if anyone is interested.
>
> Best wishes,
> Stanley Xu
> On Fri, Apr 29, 2011 at 2:08 PM, Ted Dunning <te...@gmail.com> wrote:
>
>> Yes.
>>
>> Apologies for jargon and TLA<
>> http://en.wikipedia.org/wiki/Three-letter_acronym>
>> 's
>>
>> On Thu, Apr 28, 2011 at 7:04 PM, Lance Norskog <go...@gmail.com> wrote:
>>
>> > CTR == Clickthrough Rate
>> >
>> > On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning <te...@gmail.com>
>> > wrote:
>> > > On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <we...@gmail.com>
>> wrote:
>> > >
>> > >> ... I understood as the algorithm, the time in training only relies on
>> > the
>> > >> non-zero records, but per our test, there would be some overhead we
>> > could
>> > >> not ignore for thoso non-zero records, though the cost is sub-linear
>> or
>> > >> logit to the length of the hashed vector.
>> > >>
>> > >
>> > > This is pretty close if we say "non-zero values".  A record usually
>> > refers
>> > > to an entire training
>> > > example.
>> > >
>> > > The extra work refers mostly to deferred regularization that eventually
>> > has
>> > > to be
>> > > applied.  My guess is that it is even less than log in the feature
>> vector
>> > > size.
>> > >
>> > >
>> > >> And in CTR prediction, I am not pretty sure it will converge very
>> > quickly.
>> > >>
>> > >
>> > > I was saying this purely based on the number of features.
>> > >
>> > >
>> > >> Because we will very possibly see some records has the almost same
>> > feature
>> > >> but different result in display ads.
>> > >
>> > >
>> > > The algorithm can still converge to an estimate of the probability
>> here.
>> > >
>> > >
>> > >> But we will see the result in the
>> > >> future.
>> > >
>> > >
>> > > You have to be *very* careful about this to avoid prejudicing the model
>> > > against
>> > > recent impressions.  If you have a fast feedback to the ad targeting
>> > system,
>> > > you
>> > > can have severely instability.
>> > >
>> > > The key thing that you have to do to avoid these biases is to define a
>> > > maximum
>> > > delay before click for the purposes of modeling.  You need to ignore
>> all
>> > > impressions
>> > > younger than this delay (because they may still get a click) and you
>> need
>> > to
>> > > ignore
>> > > all clicks after this delay (to avoid bias in favor of old
>> impressions).
>> > >  For on-line ads
>> > > you can probably use a maximum delay of a few minutes because most
>> clicks
>> > > will
>> > > happen by then.
>> > >
>> > > To find a good value for maximum delay, you should plot the CTR for a
>> > bunch
>> > > of
>> > > ads versus delay.  This will increase rapidly shortly after zero delay,
>> > but
>> > > then will
>> > > level off.  The ordering of ads by CTR is what you care about so you
>> can
>> > > follow the
>> > > curves back and find the shortest delay where the ordering is clearly
>> > > preserved.  Use
>> > > that as your maximum delay.  Typically this is roughly where your CTR
>> is
>> > at
>> > > about
>> > > 80-90% of the final value.
>> > >
>> > >
>> > >
>> > >
>> > >> (We were still working on creating a framework to digg all the
>> > >> features we need from the log, I would like to share our experience by
>> > >> using
>> > >> Mahout SGD once we got our CTR prediction model release.)
>> > >>
>> > >> And for parallelize SGD, what do you mean for help with sparse inputs
>> > that
>> > >> exhibit long-tail frequency distribution? Would you like to share some
>> > of
>> > >> your ideas, Ted?
>> > >>
>> > >> Currently, what I could think about is split the data to multiple
>> mapper
>> > >> randomly and let every mapper to learn from the local data and get an
>> > >> average on the whole model, or let multiple model to vote for every
>> > >> feature's weight. A little like the idea of AdaBoost or RandomForest.
>> > But I
>> > >> am not a scientist or mathematician, so no idea if it is correct or
>> not.
>> > >>
>> > >>
>> > >> Thanks so much.
>> > >> Stanley Xu
>> > >>
>> > >>
>> > >>
>> > >> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <te...@gmail.com>
>> > >> wrote:
>> > >>
>> > >> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <we...@gmail.com>
>> > >> wrote:
>> > >> >
>> > >> > > 1 hour is acceptable, but I guess you misunderstand the data scale
>> I
>> > >> mean
>> > >> > > here. The 900M records didn't mean 900M Bytes, but 900M lines of
>> > >> training
>> > >> > > set(900M training example.). If every training data has 1000
>> > dimension,
>> > >> > it
>> > >> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs
>> > collected
>> > >> > to
>> > >> > > 14 days, it would be still 2-3TB data.
>> > >> > >
>> > >> >
>> > >> > Oops.  Forgot that last multiplier.
>> > >> >
>> > >> >
>> > >> > > Per our simple test, for 1000 dimension, 10M lines of record, it
>> > will
>> > >> > take
>> > >> > > about 1-2 hours to do the training, so 90M lines of data will cost
>> > at
>> > >> > least
>> > >> > > 90 hours, is that correct?
>> > >> > >
>> > >> >
>> > >> > 10M x 1000 x 8 = 80 GB.
>> > >> >
>> > >> > 1-2 hours = (approx) 5000 seconds.  So this is
>> > >> >
>> > >> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s
>> > >> >
>> > >> > Yes.  This is reasonable speed.  I think you can get a small factor
>> > >> faster
>> > >> > than this with SGD.  I have seen 100 million records with more
>> > non-zero
>> > >> > values than you describe with a training time of 3 hours.  I would
>> not
>> > >> > expect even as much as a factor of 10 speedup here.
>> > >> >
>> > >> >
>> > >> > >
>> > >> > > And from the PPT you provided
>> > >> > > http://www.slideshare.net/tdunning/sdforum-11042010
>> > >> > > You said it would take less than an hour for 20M data records for
>> > >> > > numeric/category mixed dimensions. I am wondering, how many
>> > dimensions
>> > >> > per
>> > >> > > record?
>> > >> > >
>> > >> >
>> > >> > These are sparse records records with about a thousand non-zero
>> > elements
>> > >> > per
>> > >> > record.
>> > >> >
>> > >> >
>> > >> > But let's step back to your data for a moment.  Where do these
>> > thousand
>> > >> > dimensions come from?  Do you really have a thousand hand-built
>> > features?
>> > >> >  Do you not have any sparse, text-like features?
>> > >> >
>> > >> > If you really only have a thousand dimensional problem, then I think
>> > your
>> > >> > model might exhibit early convergence.
>> > >> >
>> > >> > If not, it is quite possible to parallelize SGD, but this is only
>> > likely
>> > >> to
>> > >> > help with sparse inputs that exhibit long-tail frequency
>> distribution.
>> > >> >
>> > >>
>> > >
>> >
>> >
>> >
>> > --
>> > Lance Norskog
>> > goksron@gmail.com
>> >
>>
>

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Ted Dunning <te...@gmail.com>.

Well, our classifiers are regressors as well.  Just a different kind.

On Tue, May 31, 2011 at 10:00 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> Factorization is essentially a quantitative (continuous) target
> regression, not a classification, so our abstract classifier
> interfaces probably would not fit here
>

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

oh. they do use regularizers but with manually tuned reg rate, it seems.

On Tue, May 31, 2011 at 10:00 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> i must say i don't understand most of the math.
>
> as for sharding, if i understood it correctly, i remember having
> exactly same idea for 'strata' selection as they show their a year
> ago. But i think the problem is that you have to run as many MR jobs
> as the number of strata selected. I.e. if you parallelize it 5 ways (5
> maps) then you have to run it at least 5 times. or maybe one can
> recombine subepochs in reducers and have another run with reducers (so
> it's 3 times, not 5). Which seems to put fundamental limitations on
> hadoopified scalability of this (they partly show increased time after
> some rather low # of mappers  which seems to confirm my old concern
> about this).
>
> it probably makes sense with a lot of data. It probably makes even
> more sense without MR sort phase.
>
> Another thing i did not quite get, how they cope with regularization?
> it looks like they don't want to use it. How's overfitting handled
> then?
>
> but it's compelling enough for my work so i could try it. Again, i
> probably did not get some aspects of the algorithm though.
>
> Factorization is essentially a quantitative (continuous) target
> regression, not a classification, so our abstract classifier
> interfaces probably would not fit here
>
> On Tue, May 31, 2011 at 9:45 PM, Ted Dunning <te...@gmail.com> wrote:
>> After a quick skumming of the paper, it looks vaguely like if you reduced
>> this to learning logistic regression that you have something roughly the
>> same as feature sharding.
>>
>> (which is still a good idea)
>>
>> With matrices, of course, you have two ways to shard, not just one.
>>
>> On Tue, May 31, 2011 at 7:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>>> Interesting.
>>> i'd probably be interested to try it out.
>>>
>>>
>>>
>>> On Thu, Apr 28, 2011 at 11:31 PM, Stanley Xu <we...@gmail.com> wrote:
>>> > Thanks Ted and Lance. And sorry for the jargon.
>>> >
>>> > For the delay Ted mentioned, we have already considered that, still
>>> thanks a
>>> > lot for all the detail ideas, they were pretty helpful.
>>> > For the parallelized SGD, just found a new paper about using DSGD in
>>> matrix
>>> > factorization, it's different from logistic regression, but might helpful
>>> as
>>> > well. Put the title here "Large-Scale Matrix Factorization with
>>> Distributed
>>> > Stochastic Gradient Descent" if anyone is interested.
>>> >
>>> > Best wishes,
>>> > Stanley Xu
>>> > On Fri, Apr 29, 2011 at 2:08 PM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>> >
>>> >> Yes.
>>> >>
>>> >> Apologies for jargon and TLA<
>>> >> http://en.wikipedia.org/wiki/Three-letter_acronym>
>>> >> 's
>>> >>
>>> >> On Thu, Apr 28, 2011 at 7:04 PM, Lance Norskog <go...@gmail.com>
>>> wrote:
>>> >>
>>> >> > CTR == Clickthrough Rate
>>> >> >
>>> >> > On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning <te...@gmail.com>
>>> >> > wrote:
>>> >> > > On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <we...@gmail.com>
>>> >> wrote:
>>> >> > >
>>> >> > >> ... I understood as the algorithm, the time in training only relies
>>> on
>>> >> > the
>>> >> > >> non-zero records, but per our test, there would be some overhead we
>>> >> > could
>>> >> > >> not ignore for thoso non-zero records, though the cost is
>>> sub-linear
>>> >> or
>>> >> > >> logit to the length of the hashed vector.
>>> >> > >>
>>> >> > >
>>> >> > > This is pretty close if we say "non-zero values".  A record usually
>>> >> > refers
>>> >> > > to an entire training
>>> >> > > example.
>>> >> > >
>>> >> > > The extra work refers mostly to deferred regularization that
>>> eventually
>>> >> > has
>>> >> > > to be
>>> >> > > applied.  My guess is that it is even less than log in the feature
>>> >> vector
>>> >> > > size.
>>> >> > >
>>> >> > >
>>> >> > >> And in CTR prediction, I am not pretty sure it will converge very
>>> >> > quickly.
>>> >> > >>
>>> >> > >
>>> >> > > I was saying this purely based on the number of features.
>>> >> > >
>>> >> > >
>>> >> > >> Because we will very possibly see some records has the almost same
>>> >> > feature
>>> >> > >> but different result in display ads.
>>> >> > >
>>> >> > >
>>> >> > > The algorithm can still converge to an estimate of the probability
>>> >> here.
>>> >> > >
>>> >> > >
>>> >> > >> But we will see the result in the
>>> >> > >> future.
>>> >> > >
>>> >> > >
>>> >> > > You have to be *very* careful about this to avoid prejudicing the
>>> model
>>> >> > > against
>>> >> > > recent impressions.  If you have a fast feedback to the ad targeting
>>> >> > system,
>>> >> > > you
>>> >> > > can have severely instability.
>>> >> > >
>>> >> > > The key thing that you have to do to avoid these biases is to define
>>> a
>>> >> > > maximum
>>> >> > > delay before click for the purposes of modeling.  You need to ignore
>>> >> all
>>> >> > > impressions
>>> >> > > younger than this delay (because they may still get a click) and you
>>> >> need
>>> >> > to
>>> >> > > ignore
>>> >> > > all clicks after this delay (to avoid bias in favor of old
>>> >> impressions).
>>> >> > >  For on-line ads
>>> >> > > you can probably use a maximum delay of a few minutes because most
>>> >> clicks
>>> >> > > will
>>> >> > > happen by then.
>>> >> > >
>>> >> > > To find a good value for maximum delay, you should plot the CTR for
>>> a
>>> >> > bunch
>>> >> > > of
>>> >> > > ads versus delay.  This will increase rapidly shortly after zero
>>> delay,
>>> >> > but
>>> >> > > then will
>>> >> > > level off.  The ordering of ads by CTR is what you care about so you
>>> >> can
>>> >> > > follow the
>>> >> > > curves back and find the shortest delay where the ordering is
>>> clearly
>>> >> > > preserved.  Use
>>> >> > > that as your maximum delay.  Typically this is roughly where your
>>> CTR
>>> >> is
>>> >> > at
>>> >> > > about
>>> >> > > 80-90% of the final value.
>>> >> > >
>>> >> > >
>>> >> > >
>>> >> > >
>>> >> > >> (We were still working on creating a framework to digg all the
>>> >> > >> features we need from the log, I would like to share our experience
>>> by
>>> >> > >> using
>>> >> > >> Mahout SGD once we got our CTR prediction model release.)
>>> >> > >>
>>> >> > >> And for parallelize SGD, what do you mean for help with sparse
>>> inputs
>>> >> > that
>>> >> > >> exhibit long-tail frequency distribution? Would you like to share
>>> some
>>> >> > of
>>> >> > >> your ideas, Ted?
>>> >> > >>
>>> >> > >> Currently, what I could think about is split the data to multiple
>>> >> mapper
>>> >> > >> randomly and let every mapper to learn from the local data and get
>>> an
>>> >> > >> average on the whole model, or let multiple model to vote for every
>>> >> > >> feature's weight. A little like the idea of AdaBoost or
>>> RandomForest.
>>> >> > But I
>>> >> > >> am not a scientist or mathematician, so no idea if it is correct or
>>> >> not.
>>> >> > >>
>>> >> > >>
>>> >> > >> Thanks so much.
>>> >> > >> Stanley Xu
>>> >> > >>
>>> >> > >>
>>> >> > >>
>>> >> > >> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <
>>> ted.dunning@gmail.com>
>>> >> > >> wrote:
>>> >> > >>
>>> >> > >> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <
>>> wenhao.xu@gmail.com>
>>> >> > >> wrote:
>>> >> > >> >
>>> >> > >> > > 1 hour is acceptable, but I guess you misunderstand the data
>>> scale
>>> >> I
>>> >> > >> mean
>>> >> > >> > > here. The 900M records didn't mean 900M Bytes, but 900M lines
>>> of
>>> >> > >> training
>>> >> > >> > > set(900M training example.). If every training data has 1000
>>> >> > dimension,
>>> >> > >> > it
>>> >> > >> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs
>>> >> > collected
>>> >> > >> > to
>>> >> > >> > > 14 days, it would be still 2-3TB data.
>>> >> > >> > >
>>> >> > >> >
>>> >> > >> > Oops.  Forgot that last multiplier.
>>> >> > >> >
>>> >> > >> >
>>> >> > >> > > Per our simple test, for 1000 dimension, 10M lines of record,
>>> it
>>> >> > will
>>> >> > >> > take
>>> >> > >> > > about 1-2 hours to do the training, so 90M lines of data will
>>> cost
>>> >> > at
>>> >> > >> > least
>>> >> > >> > > 90 hours, is that correct?
>>> >> > >> > >
>>> >> > >> >
>>> >> > >> > 10M x 1000 x 8 = 80 GB.
>>> >> > >> >
>>> >> > >> > 1-2 hours = (approx) 5000 seconds.  So this is
>>> >> > >> >
>>> >> > >> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s
>>> >> > >> >
>>> >> > >> > Yes.  This is reasonable speed.  I think you can get a small
>>> factor
>>> >> > >> faster
>>> >> > >> > than this with SGD.  I have seen 100 million records with more
>>> >> > non-zero
>>> >> > >> > values than you describe with a training time of 3 hours.  I
>>> would
>>> >> not
>>> >> > >> > expect even as much as a factor of 10 speedup here.
>>> >> > >> >
>>> >> > >> >
>>> >> > >> > >
>>> >> > >> > > And from the PPT you provided
>>> >> > >> > > http://www.slideshare.net/tdunning/sdforum-11042010
>>> >> > >> > > You said it would take less than an hour for 20M data records
>>> for
>>> >> > >> > > numeric/category mixed dimensions. I am wondering, how many
>>> >> > dimensions
>>> >> > >> > per
>>> >> > >> > > record?
>>> >> > >> > >
>>> >> > >> >
>>> >> > >> > These are sparse records records with about a thousand non-zero
>>> >> > elements
>>> >> > >> > per
>>> >> > >> > record.
>>> >> > >> >
>>> >> > >> >
>>> >> > >> > But let's step back to your data for a moment.  Where do these
>>> >> > thousand
>>> >> > >> > dimensions come from?  Do you really have a thousand hand-built
>>> >> > features?
>>> >> > >> >  Do you not have any sparse, text-like features?
>>> >> > >> >
>>> >> > >> > If you really only have a thousand dimensional problem, then I
>>> think
>>> >> > your
>>> >> > >> > model might exhibit early convergence.
>>> >> > >> >
>>> >> > >> > If not, it is quite possible to parallelize SGD, but this is only
>>> >> > likely
>>> >> > >> to
>>> >> > >> > help with sparse inputs that exhibit long-tail frequency
>>> >> distribution.
>>> >> > >> >
>>> >> > >>
>>> >> > >
>>> >> >
>>> >> >
>>> >> >
>>> >> > --
>>> >> > Lance Norskog
>>> >> > goksron@gmail.com
>>> >> >
>>> >>
>>> >
>>>
>>
>

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

i must say i don't understand most of the math.

as for sharding, if i understood it correctly, i remember having
exactly same idea for 'strata' selection as they show their a year
ago. But i think the problem is that you have to run as many MR jobs
as the number of strata selected. I.e. if you parallelize it 5 ways (5
maps) then you have to run it at least 5 times. or maybe one can
recombine subepochs in reducers and have another run with reducers (so
it's 3 times, not 5). Which seems to put fundamental limitations on
hadoopified scalability of this (they partly show increased time after
some rather low # of mappers  which seems to confirm my old concern
about this).

it probably makes sense with a lot of data. It probably makes even
more sense without MR sort phase.

Another thing i did not quite get, how they cope with regularization?
it looks like they don't want to use it. How's overfitting handled
then?

but it's compelling enough for my work so i could try it. Again, i
probably did not get some aspects of the algorithm though.

Factorization is essentially a quantitative (continuous) target
regression, not a classification, so our abstract classifier
interfaces probably would not fit here

On Tue, May 31, 2011 at 9:45 PM, Ted Dunning <te...@gmail.com> wrote:
> After a quick skumming of the paper, it looks vaguely like if you reduced
> this to learning logistic regression that you have something roughly the
> same as feature sharding.
>
> (which is still a good idea)
>
> With matrices, of course, you have two ways to shard, not just one.
>
> On Tue, May 31, 2011 at 7:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> Interesting.
>> i'd probably be interested to try it out.
>>
>>
>>
>> On Thu, Apr 28, 2011 at 11:31 PM, Stanley Xu <we...@gmail.com> wrote:
>> > Thanks Ted and Lance. And sorry for the jargon.
>> >
>> > For the delay Ted mentioned, we have already considered that, still
>> thanks a
>> > lot for all the detail ideas, they were pretty helpful.
>> > For the parallelized SGD, just found a new paper about using DSGD in
>> matrix
>> > factorization, it's different from logistic regression, but might helpful
>> as
>> > well. Put the title here "Large-Scale Matrix Factorization with
>> Distributed
>> > Stochastic Gradient Descent" if anyone is interested.
>> >
>> > Best wishes,
>> > Stanley Xu
>> > On Fri, Apr 29, 2011 at 2:08 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> >
>> >> Yes.
>> >>
>> >> Apologies for jargon and TLA<
>> >> http://en.wikipedia.org/wiki/Three-letter_acronym>
>> >> 's
>> >>
>> >> On Thu, Apr 28, 2011 at 7:04 PM, Lance Norskog <go...@gmail.com>
>> wrote:
>> >>
>> >> > CTR == Clickthrough Rate
>> >> >
>> >> > On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning <te...@gmail.com>
>> >> > wrote:
>> >> > > On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <we...@gmail.com>
>> >> wrote:
>> >> > >
>> >> > >> ... I understood as the algorithm, the time in training only relies
>> on
>> >> > the
>> >> > >> non-zero records, but per our test, there would be some overhead we
>> >> > could
>> >> > >> not ignore for thoso non-zero records, though the cost is
>> sub-linear
>> >> or
>> >> > >> logit to the length of the hashed vector.
>> >> > >>
>> >> > >
>> >> > > This is pretty close if we say "non-zero values".  A record usually
>> >> > refers
>> >> > > to an entire training
>> >> > > example.
>> >> > >
>> >> > > The extra work refers mostly to deferred regularization that
>> eventually
>> >> > has
>> >> > > to be
>> >> > > applied.  My guess is that it is even less than log in the feature
>> >> vector
>> >> > > size.
>> >> > >
>> >> > >
>> >> > >> And in CTR prediction, I am not pretty sure it will converge very
>> >> > quickly.
>> >> > >>
>> >> > >
>> >> > > I was saying this purely based on the number of features.
>> >> > >
>> >> > >
>> >> > >> Because we will very possibly see some records has the almost same
>> >> > feature
>> >> > >> but different result in display ads.
>> >> > >
>> >> > >
>> >> > > The algorithm can still converge to an estimate of the probability
>> >> here.
>> >> > >
>> >> > >
>> >> > >> But we will see the result in the
>> >> > >> future.
>> >> > >
>> >> > >
>> >> > > You have to be *very* careful about this to avoid prejudicing the
>> model
>> >> > > against
>> >> > > recent impressions.  If you have a fast feedback to the ad targeting
>> >> > system,
>> >> > > you
>> >> > > can have severely instability.
>> >> > >
>> >> > > The key thing that you have to do to avoid these biases is to define
>> a
>> >> > > maximum
>> >> > > delay before click for the purposes of modeling.  You need to ignore
>> >> all
>> >> > > impressions
>> >> > > younger than this delay (because they may still get a click) and you
>> >> need
>> >> > to
>> >> > > ignore
>> >> > > all clicks after this delay (to avoid bias in favor of old
>> >> impressions).
>> >> > >  For on-line ads
>> >> > > you can probably use a maximum delay of a few minutes because most
>> >> clicks
>> >> > > will
>> >> > > happen by then.
>> >> > >
>> >> > > To find a good value for maximum delay, you should plot the CTR for
>> a
>> >> > bunch
>> >> > > of
>> >> > > ads versus delay.  This will increase rapidly shortly after zero
>> delay,
>> >> > but
>> >> > > then will
>> >> > > level off.  The ordering of ads by CTR is what you care about so you
>> >> can
>> >> > > follow the
>> >> > > curves back and find the shortest delay where the ordering is
>> clearly
>> >> > > preserved.  Use
>> >> > > that as your maximum delay.  Typically this is roughly where your
>> CTR
>> >> is
>> >> > at
>> >> > > about
>> >> > > 80-90% of the final value.
>> >> > >
>> >> > >
>> >> > >
>> >> > >
>> >> > >> (We were still working on creating a framework to digg all the
>> >> > >> features we need from the log, I would like to share our experience
>> by
>> >> > >> using
>> >> > >> Mahout SGD once we got our CTR prediction model release.)
>> >> > >>
>> >> > >> And for parallelize SGD, what do you mean for help with sparse
>> inputs
>> >> > that
>> >> > >> exhibit long-tail frequency distribution? Would you like to share
>> some
>> >> > of
>> >> > >> your ideas, Ted?
>> >> > >>
>> >> > >> Currently, what I could think about is split the data to multiple
>> >> mapper
>> >> > >> randomly and let every mapper to learn from the local data and get
>> an
>> >> > >> average on the whole model, or let multiple model to vote for every
>> >> > >> feature's weight. A little like the idea of AdaBoost or
>> RandomForest.
>> >> > But I
>> >> > >> am not a scientist or mathematician, so no idea if it is correct or
>> >> not.
>> >> > >>
>> >> > >>
>> >> > >> Thanks so much.
>> >> > >> Stanley Xu
>> >> > >>
>> >> > >>
>> >> > >>
>> >> > >> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <
>> ted.dunning@gmail.com>
>> >> > >> wrote:
>> >> > >>
>> >> > >> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <
>> wenhao.xu@gmail.com>
>> >> > >> wrote:
>> >> > >> >
>> >> > >> > > 1 hour is acceptable, but I guess you misunderstand the data
>> scale
>> >> I
>> >> > >> mean
>> >> > >> > > here. The 900M records didn't mean 900M Bytes, but 900M lines
>> of
>> >> > >> training
>> >> > >> > > set(900M training example.). If every training data has 1000
>> >> > dimension,
>> >> > >> > it
>> >> > >> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs
>> >> > collected
>> >> > >> > to
>> >> > >> > > 14 days, it would be still 2-3TB data.
>> >> > >> > >
>> >> > >> >
>> >> > >> > Oops.  Forgot that last multiplier.
>> >> > >> >
>> >> > >> >
>> >> > >> > > Per our simple test, for 1000 dimension, 10M lines of record,
>> it
>> >> > will
>> >> > >> > take
>> >> > >> > > about 1-2 hours to do the training, so 90M lines of data will
>> cost
>> >> > at
>> >> > >> > least
>> >> > >> > > 90 hours, is that correct?
>> >> > >> > >
>> >> > >> >
>> >> > >> > 10M x 1000 x 8 = 80 GB.
>> >> > >> >
>> >> > >> > 1-2 hours = (approx) 5000 seconds.  So this is
>> >> > >> >
>> >> > >> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s
>> >> > >> >
>> >> > >> > Yes.  This is reasonable speed.  I think you can get a small
>> factor
>> >> > >> faster
>> >> > >> > than this with SGD.  I have seen 100 million records with more
>> >> > non-zero
>> >> > >> > values than you describe with a training time of 3 hours.  I
>> would
>> >> not
>> >> > >> > expect even as much as a factor of 10 speedup here.
>> >> > >> >
>> >> > >> >
>> >> > >> > >
>> >> > >> > > And from the PPT you provided
>> >> > >> > > http://www.slideshare.net/tdunning/sdforum-11042010
>> >> > >> > > You said it would take less than an hour for 20M data records
>> for
>> >> > >> > > numeric/category mixed dimensions. I am wondering, how many
>> >> > dimensions
>> >> > >> > per
>> >> > >> > > record?
>> >> > >> > >
>> >> > >> >
>> >> > >> > These are sparse records records with about a thousand non-zero
>> >> > elements
>> >> > >> > per
>> >> > >> > record.
>> >> > >> >
>> >> > >> >
>> >> > >> > But let's step back to your data for a moment.  Where do these
>> >> > thousand
>> >> > >> > dimensions come from?  Do you really have a thousand hand-built
>> >> > features?
>> >> > >> >  Do you not have any sparse, text-like features?
>> >> > >> >
>> >> > >> > If you really only have a thousand dimensional problem, then I
>> think
>> >> > your
>> >> > >> > model might exhibit early convergence.
>> >> > >> >
>> >> > >> > If not, it is quite possible to parallelize SGD, but this is only
>> >> > likely
>> >> > >> to
>> >> > >> > help with sparse inputs that exhibit long-tail frequency
>> >> distribution.
>> >> > >> >
>> >> > >>
>> >> > >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Lance Norskog
>> >> > goksron@gmail.com
>> >> >
>> >>
>> >
>>
>

Re: Which exact algorithm is used in the Mahout SGD?

Posted by Ted Dunning <te...@gmail.com>.

After a quick skumming of the paper, it looks vaguely like if you reduced
this to learning logistic regression that you have something roughly the
same as feature sharding.

(which is still a good idea)

With matrices, of course, you have two ways to shard, not just one.

On Tue, May 31, 2011 at 7:19 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Interesting.
> i'd probably be interested to try it out.
>
>
>
> On Thu, Apr 28, 2011 at 11:31 PM, Stanley Xu <we...@gmail.com> wrote:
> > Thanks Ted and Lance. And sorry for the jargon.
> >
> > For the delay Ted mentioned, we have already considered that, still
> thanks a
> > lot for all the detail ideas, they were pretty helpful.
> > For the parallelized SGD, just found a new paper about using DSGD in
> matrix
> > factorization, it's different from logistic regression, but might helpful
> as
> > well. Put the title here "Large-Scale Matrix Factorization with
> Distributed
> > Stochastic Gradient Descent" if anyone is interested.
> >
> > Best wishes,
> > Stanley Xu
> > On Fri, Apr 29, 2011 at 2:08 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >
> >> Yes.
> >>
> >> Apologies for jargon and TLA<
> >> http://en.wikipedia.org/wiki/Three-letter_acronym>
> >> 's
> >>
> >> On Thu, Apr 28, 2011 at 7:04 PM, Lance Norskog <go...@gmail.com>
> wrote:
> >>
> >> > CTR == Clickthrough Rate
> >> >
> >> > On Thu, Apr 28, 2011 at 12:06 PM, Ted Dunning <te...@gmail.com>
> >> > wrote:
> >> > > On Tue, Apr 26, 2011 at 8:00 PM, Stanley Xu <we...@gmail.com>
> >> wrote:
> >> > >
> >> > >> ... I understood as the algorithm, the time in training only relies
> on
> >> > the
> >> > >> non-zero records, but per our test, there would be some overhead we
> >> > could
> >> > >> not ignore for thoso non-zero records, though the cost is
> sub-linear
> >> or
> >> > >> logit to the length of the hashed vector.
> >> > >>
> >> > >
> >> > > This is pretty close if we say "non-zero values".  A record usually
> >> > refers
> >> > > to an entire training
> >> > > example.
> >> > >
> >> > > The extra work refers mostly to deferred regularization that
> eventually
> >> > has
> >> > > to be
> >> > > applied.  My guess is that it is even less than log in the feature
> >> vector
> >> > > size.
> >> > >
> >> > >
> >> > >> And in CTR prediction, I am not pretty sure it will converge very
> >> > quickly.
> >> > >>
> >> > >
> >> > > I was saying this purely based on the number of features.
> >> > >
> >> > >
> >> > >> Because we will very possibly see some records has the almost same
> >> > feature
> >> > >> but different result in display ads.
> >> > >
> >> > >
> >> > > The algorithm can still converge to an estimate of the probability
> >> here.
> >> > >
> >> > >
> >> > >> But we will see the result in the
> >> > >> future.
> >> > >
> >> > >
> >> > > You have to be *very* careful about this to avoid prejudicing the
> model
> >> > > against
> >> > > recent impressions.  If you have a fast feedback to the ad targeting
> >> > system,
> >> > > you
> >> > > can have severely instability.
> >> > >
> >> > > The key thing that you have to do to avoid these biases is to define
> a
> >> > > maximum
> >> > > delay before click for the purposes of modeling.  You need to ignore
> >> all
> >> > > impressions
> >> > > younger than this delay (because they may still get a click) and you
> >> need
> >> > to
> >> > > ignore
> >> > > all clicks after this delay (to avoid bias in favor of old
> >> impressions).
> >> > >  For on-line ads
> >> > > you can probably use a maximum delay of a few minutes because most
> >> clicks
> >> > > will
> >> > > happen by then.
> >> > >
> >> > > To find a good value for maximum delay, you should plot the CTR for
> a
> >> > bunch
> >> > > of
> >> > > ads versus delay.  This will increase rapidly shortly after zero
> delay,
> >> > but
> >> > > then will
> >> > > level off.  The ordering of ads by CTR is what you care about so you
> >> can
> >> > > follow the
> >> > > curves back and find the shortest delay where the ordering is
> clearly
> >> > > preserved.  Use
> >> > > that as your maximum delay.  Typically this is roughly where your
> CTR
> >> is
> >> > at
> >> > > about
> >> > > 80-90% of the final value.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >> (We were still working on creating a framework to digg all the
> >> > >> features we need from the log, I would like to share our experience
> by
> >> > >> using
> >> > >> Mahout SGD once we got our CTR prediction model release.)
> >> > >>
> >> > >> And for parallelize SGD, what do you mean for help with sparse
> inputs
> >> > that
> >> > >> exhibit long-tail frequency distribution? Would you like to share
> some
> >> > of
> >> > >> your ideas, Ted?
> >> > >>
> >> > >> Currently, what I could think about is split the data to multiple
> >> mapper
> >> > >> randomly and let every mapper to learn from the local data and get
> an
> >> > >> average on the whole model, or let multiple model to vote for every
> >> > >> feature's weight. A little like the idea of AdaBoost or
> RandomForest.
> >> > But I
> >> > >> am not a scientist or mathematician, so no idea if it is correct or
> >> not.
> >> > >>
> >> > >>
> >> > >> Thanks so much.
> >> > >> Stanley Xu
> >> > >>
> >> > >>
> >> > >>
> >> > >> On Tue, Apr 26, 2011 at 11:16 PM, Ted Dunning <
> ted.dunning@gmail.com>
> >> > >> wrote:
> >> > >>
> >> > >> > On Mon, Apr 25, 2011 at 11:46 PM, Stanley Xu <
> wenhao.xu@gmail.com>
> >> > >> wrote:
> >> > >> >
> >> > >> > > 1 hour is acceptable, but I guess you misunderstand the data
> scale
> >> I
> >> > >> mean
> >> > >> > > here. The 900M records didn't mean 900M Bytes, but 900M lines
> of
> >> > >> training
> >> > >> > > set(900M training example.). If every training data has 1000
> >> > dimension,
> >> > >> > it
> >> > >> > > means 900 million X 1000 X 16 B = 14TB. If we reduce the logs
> >> > collected
> >> > >> > to
> >> > >> > > 14 days, it would be still 2-3TB data.
> >> > >> > >
> >> > >> >
> >> > >> > Oops.  Forgot that last multiplier.
> >> > >> >
> >> > >> >
> >> > >> > > Per our simple test, for 1000 dimension, 10M lines of record,
> it
> >> > will
> >> > >> > take
> >> > >> > > about 1-2 hours to do the training, so 90M lines of data will
> cost
> >> > at
> >> > >> > least
> >> > >> > > 90 hours, is that correct?
> >> > >> > >
> >> > >> >
> >> > >> > 10M x 1000 x 8 = 80 GB.
> >> > >> >
> >> > >> > 1-2 hours = (approx) 5000 seconds.  So this is
> >> > >> >
> >> > >> > 80 GB / 5000 s = 80/5 MB /s = 16MB / s
> >> > >> >
> >> > >> > Yes.  This is reasonable speed.  I think you can get a small
> factor
> >> > >> faster
> >> > >> > than this with SGD.  I have seen 100 million records with more
> >> > non-zero
> >> > >> > values than you describe with a training time of 3 hours.  I
> would
> >> not
> >> > >> > expect even as much as a factor of 10 speedup here.
> >> > >> >
> >> > >> >
> >> > >> > >
> >> > >> > > And from the PPT you provided
> >> > >> > > http://www.slideshare.net/tdunning/sdforum-11042010
> >> > >> > > You said it would take less than an hour for 20M data records
> for
> >> > >> > > numeric/category mixed dimensions. I am wondering, how many
> >> > dimensions
> >> > >> > per
> >> > >> > > record?
> >> > >> > >
> >> > >> >
> >> > >> > These are sparse records records with about a thousand non-zero
> >> > elements
> >> > >> > per
> >> > >> > record.
> >> > >> >
> >> > >> >
> >> > >> > But let's step back to your data for a moment.  Where do these
> >> > thousand
> >> > >> > dimensions come from?  Do you really have a thousand hand-built
> >> > features?
> >> > >> >  Do you not have any sparse, text-like features?
> >> > >> >
> >> > >> > If you really only have a thousand dimensional problem, then I
> think
> >> > your
> >> > >> > model might exhibit early convergence.
> >> > >> >
> >> > >> > If not, it is quite possible to parallelize SGD, but this is only
> >> > likely
> >> > >> to
> >> > >> > help with sparse inputs that exhibit long-tail frequency
> >> distribution.
> >> > >> >
> >> > >>
> >> > >
> >> >
> >> >
> >> >
> >> > --
> >> > Lance Norskog
> >> > goksron@gmail.com
> >> >
> >>
> >
>