You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Frank Wang <wa...@gmail.com> on 2010/10/20 22:57:47 UTC

Implementation for Linear Regression

Hi,

I'm interested in implementing Linear Regression in Mahout. Who would be the
point person for the algorithm? I'd love to discuss the implementation
details, or to help out if anyone is working on it already :)

Thanks

Re: Implementation for Linear Regression

Posted by Ted Dunning <te...@gmail.com>.
This is common for any online learning algorithm.

Normally a declining learning rate is used.  This is called annealing.  If
the learning rate starts too large but anneals reasonably quickly, you wind
up wasting some data and have to recover from some crazy coefficients, but
it generally works rather well.

On Wed, Nov 10, 2010 at 1:26 AM, Frank Wang <wa...@gmail.com> wrote:

> With linear regression, it seems that the coefficients tend to grow
> unboundedly when using a larger learning rate, ie. --rate 50.
> Works fine when i keep the rate < 1.
>
> Is this a normal characteristics for Linear Regression?
>
> On Fri, Oct 22, 2010 at 1:08 AM, Frank Wang <wa...@gmail.com> wrote:
>
> > Thanks Ted.
> >
> > It's a very interesting solution. Currently, we need to account for age
> > related terms when calculating the relevance ranking, and this is done
> > before display time. We will play around with our data and see if we can
> > model our data to leverage on the trick.
> >
> > In terms of Linear Regression, I've attached the initial patch on
> > MAHOUT-529 <https://issues.apache.org/jira/browse/MAHOUT-529>. It's
> mainly
> > the AbstractOnlineLinearRegression and OnlineLinearRegression classes.
> Lemme
> > know if the code makes sense.
> >
> > I have 2 questions:
> >
> > 1.
> > The apply() function in DefaultGradient has:
> >     Vector r = v.like();
> >     if (actual != 0) {
> >       r.setQuick(actual - 1, 1);
> >     }
> >
> > The code seems to work only for logistic regression. When actual is 0,
> r[0]
> > remains 0, and when actual is 1, r[0] gets set to 1. I'm not sure if I'm
> > understanding it correctly. For now, I've included DefaultGradientLinear
> in
> > the patch as a work around. If you could give me some advice, that'd be
> > helpful.
> >
> >
> > 2.
> > As I'm working on the sample code TrainLinear, I was referring to
> > TrainLogistic code. I'm confused with this line:
> >          int targetValue = csv.processLine(line, input);
> >
> > The training file is:
> > "a","b","c","target"
> > 3,1,10,1
> > 2,1,10,1
> > 1,0,2,0
> > ...
> >
> > But the output for processLine() is:
> > Line 1: targetValue = 0, input = {2:4.0, 1:10.0, 0:1.0}
> > Line 2: targetValue = 0, input = {2:3.0, 1:10.0, 0:1.0}
> > Line 3: targetValue = 1, input = {2:1.0, 1:2.0, 0:1.0}
> > ...
> >
> > It seems the target values are inverted, and some input values are
> > incremented. It'd be great if you could explain the processLine() a
> little
> > bit.
> >
> > btw, is the mail list a good place for implementation discussion or
> should
> > it take place on the JIRA page?
> >
> > Thanks
> >
> >
> > On Wed, Oct 20, 2010 at 9:58 PM, Ted Dunning <ted.dunning@gmail.com
> >wrote:
> >
> >> You don't have to apply the age correction to old data until you display
> >> the
> >> data.  The trick is to store all of the fixed components
> >> of the rating in linear form and then add only the age related terms at
> >> display time.  This allows you to penalize items that are unlikely to be
> >> relevant due to age and doesn't require any recomputation.
> >>
> >> On Wed, Oct 20, 2010 at 9:32 PM, Frank Wang <wa...@gmail.com>
> wrote:
> >>
> >> > Hi Ted,
> >> >
> >> > I've created the JIRA issue at
> >> > https://issues.apache.org/jira/browse/MAHOUT-529, will attach what i
> >> have
> >> > soon.
> >> >
> >> > Do you mean using time as a feature in the logistic regression? I
> >> thought
> >> > about your suggestion the other day, but I'm not re-calculating the
> >> > probability on the old data. After training each night, we only apply
> >> the
> >> > coefficients on next day's new data. I'm not quite sure how would the
> >> decay
> >> > function work in this case. Do you have an example?
> >> >
> >> > Thanks
> >> >
> >> >
> >> > On Wed, Oct 20, 2010 at 8:48 PM, Ted Dunning <te...@gmail.com>
> >> > wrote:
> >> >
> >> > > Can you open a JIRA and attach a patch.
> >> > >
> >> > > Your approach seems reasonable so far for the regression.
> >> > >
> >> > > In terms of how it could be applied, it seems like you are trying to
> >> > > estimate a life-span for a posting to model relevance decay.
> >> > >
> >> > > My own preference there would be to try to estimate relevance (0 or
> 1)
> >> > > using
> >> > > logistic regression and then put in various decay functions in as
> >> > features.
> >> > >  The weighted sum of those decay functions is your time decay of
> >> > relevance
> >> > > (in log-odds).
> >> > >
> >> > > My initial shot at decay functions would include age, square of age
> >> and
> >> > log
> >> > > of age.  My guess is that direct age would suffice because of the
> >> > logistic
> >> > > link function which looks like a logarithmic function where your
> >> models
> >> > > will
> >> > > probably live.
> >> > >
> >> > > On Wed, Oct 20, 2010 at 8:15 PM, Frank Wang <wa...@gmail.com>
> >> > wrote:
> >> > >
> >> > > > Hi Ted,
> >> > > >
> >> > > > thanks for your reply.
> >> > > > I'm trying a new model where I want to estimate the output as a
> >> > timespan
> >> > > > quantified in number of seconds, which is not bounded. That's why
> I
> >> > think
> >> > > > I'd use linear regression instead of logistic regression. (lemme
> >> know
> >> > if
> >> > > > i'm
> >> > > > wrong)
> >> > > >
> >> > > > I started on the code yesterday. The new
> >> AbstractOnlineLinearRegression
> >> > > > class is implementing the OnlineLearner interface. I updated the
> >> > > classify()
> >> > > > function to use linear model. I tried to follow the format for
> >> > > > AbstractOnlineLogisticRegression.
> >> > > >
> >> > > > I think since linear regression can be implemented w/ sgd, the
> >> train()
> >> > > > and regularize() functions would look similar. I'm not sure if i'm
> >> on
> >> > the
> >> > > > right path. Any advice would be helpful.
> >> > > >
> >> > > > Thanks
> >> > > >
> >> > > > On Wed, Oct 20, 2010 at 3:34 PM, Ted Dunning <
> ted.dunning@gmail.com
> >> >
> >> > > > wrote:
> >> > > >
> >> > > > > Frank,
> >> > > > >
> >> > > > > Sorry I didn't answer your previous email regarding this.
> >> > > > >
> >> > > > > It sounded to me like your application would actually be happier
> >> with
> >> > a
> >> > > > > form
> >> > > > > of logistic regression.
> >> > > > >
> >> > > > > Perhaps we should talk some more about this on the list.
> >> > > > >
> >> > > > > If you want a normal linear regression, the current
> OnlineLearner
> >> > > > interface
> >> > > > > isn't terribly appropriate since it assumes a 1 of n vector
> target
> >> > > > > variable.
> >> > > > >
> >> > > > > If you were to extend that interface to accept a vector form of
> >> > target
> >> > > > > variable then linear regression would work (and some clever
> tricks
> >> > > would
> >> > > > > become possible for logistic regression).
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Wed, Oct 20, 2010 at 1:57 PM, Frank Wang <
> wangfanjie@gmail.com
> >> >
> >> > > > wrote:
> >> > > > >
> >> > > > > > Hi,
> >> > > > > >
> >> > > > > > I'm interested in implementing Linear Regression in Mahout.
> Who
> >> > would
> >> > > > be
> >> > > > > > the
> >> > > > > > point person for the algorithm? I'd love to discuss the
> >> > > implementation
> >> > > > > > details, or to help out if anyone is working on it already :)
> >> > > > > >
> >> > > > > > Thanks
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Implementation for Linear Regression

Posted by Frank Wang <wa...@gmail.com>.
With linear regression, it seems that the coefficients tend to grow
unboundedly when using a larger learning rate, ie. --rate 50.
Works fine when i keep the rate < 1.

Is this a normal characteristics for Linear Regression?

On Fri, Oct 22, 2010 at 1:08 AM, Frank Wang <wa...@gmail.com> wrote:

> Thanks Ted.
>
> It's a very interesting solution. Currently, we need to account for age
> related terms when calculating the relevance ranking, and this is done
> before display time. We will play around with our data and see if we can
> model our data to leverage on the trick.
>
> In terms of Linear Regression, I've attached the initial patch on
> MAHOUT-529 <https://issues.apache.org/jira/browse/MAHOUT-529>. It's mainly
> the AbstractOnlineLinearRegression and OnlineLinearRegression classes. Lemme
> know if the code makes sense.
>
> I have 2 questions:
>
> 1.
> The apply() function in DefaultGradient has:
>     Vector r = v.like();
>     if (actual != 0) {
>       r.setQuick(actual - 1, 1);
>     }
>
> The code seems to work only for logistic regression. When actual is 0, r[0]
> remains 0, and when actual is 1, r[0] gets set to 1. I'm not sure if I'm
> understanding it correctly. For now, I've included DefaultGradientLinear in
> the patch as a work around. If you could give me some advice, that'd be
> helpful.
>
>
> 2.
> As I'm working on the sample code TrainLinear, I was referring to
> TrainLogistic code. I'm confused with this line:
>          int targetValue = csv.processLine(line, input);
>
> The training file is:
> "a","b","c","target"
> 3,1,10,1
> 2,1,10,1
> 1,0,2,0
> ...
>
> But the output for processLine() is:
> Line 1: targetValue = 0, input = {2:4.0, 1:10.0, 0:1.0}
> Line 2: targetValue = 0, input = {2:3.0, 1:10.0, 0:1.0}
> Line 3: targetValue = 1, input = {2:1.0, 1:2.0, 0:1.0}
> ...
>
> It seems the target values are inverted, and some input values are
> incremented. It'd be great if you could explain the processLine() a little
> bit.
>
> btw, is the mail list a good place for implementation discussion or should
> it take place on the JIRA page?
>
> Thanks
>
>
> On Wed, Oct 20, 2010 at 9:58 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> You don't have to apply the age correction to old data until you display
>> the
>> data.  The trick is to store all of the fixed components
>> of the rating in linear form and then add only the age related terms at
>> display time.  This allows you to penalize items that are unlikely to be
>> relevant due to age and doesn't require any recomputation.
>>
>> On Wed, Oct 20, 2010 at 9:32 PM, Frank Wang <wa...@gmail.com> wrote:
>>
>> > Hi Ted,
>> >
>> > I've created the JIRA issue at
>> > https://issues.apache.org/jira/browse/MAHOUT-529, will attach what i
>> have
>> > soon.
>> >
>> > Do you mean using time as a feature in the logistic regression? I
>> thought
>> > about your suggestion the other day, but I'm not re-calculating the
>> > probability on the old data. After training each night, we only apply
>> the
>> > coefficients on next day's new data. I'm not quite sure how would the
>> decay
>> > function work in this case. Do you have an example?
>> >
>> > Thanks
>> >
>> >
>> > On Wed, Oct 20, 2010 at 8:48 PM, Ted Dunning <te...@gmail.com>
>> > wrote:
>> >
>> > > Can you open a JIRA and attach a patch.
>> > >
>> > > Your approach seems reasonable so far for the regression.
>> > >
>> > > In terms of how it could be applied, it seems like you are trying to
>> > > estimate a life-span for a posting to model relevance decay.
>> > >
>> > > My own preference there would be to try to estimate relevance (0 or 1)
>> > > using
>> > > logistic regression and then put in various decay functions in as
>> > features.
>> > >  The weighted sum of those decay functions is your time decay of
>> > relevance
>> > > (in log-odds).
>> > >
>> > > My initial shot at decay functions would include age, square of age
>> and
>> > log
>> > > of age.  My guess is that direct age would suffice because of the
>> > logistic
>> > > link function which looks like a logarithmic function where your
>> models
>> > > will
>> > > probably live.
>> > >
>> > > On Wed, Oct 20, 2010 at 8:15 PM, Frank Wang <wa...@gmail.com>
>> > wrote:
>> > >
>> > > > Hi Ted,
>> > > >
>> > > > thanks for your reply.
>> > > > I'm trying a new model where I want to estimate the output as a
>> > timespan
>> > > > quantified in number of seconds, which is not bounded. That's why I
>> > think
>> > > > I'd use linear regression instead of logistic regression. (lemme
>> know
>> > if
>> > > > i'm
>> > > > wrong)
>> > > >
>> > > > I started on the code yesterday. The new
>> AbstractOnlineLinearRegression
>> > > > class is implementing the OnlineLearner interface. I updated the
>> > > classify()
>> > > > function to use linear model. I tried to follow the format for
>> > > > AbstractOnlineLogisticRegression.
>> > > >
>> > > > I think since linear regression can be implemented w/ sgd, the
>> train()
>> > > > and regularize() functions would look similar. I'm not sure if i'm
>> on
>> > the
>> > > > right path. Any advice would be helpful.
>> > > >
>> > > > Thanks
>> > > >
>> > > > On Wed, Oct 20, 2010 at 3:34 PM, Ted Dunning <ted.dunning@gmail.com
>> >
>> > > > wrote:
>> > > >
>> > > > > Frank,
>> > > > >
>> > > > > Sorry I didn't answer your previous email regarding this.
>> > > > >
>> > > > > It sounded to me like your application would actually be happier
>> with
>> > a
>> > > > > form
>> > > > > of logistic regression.
>> > > > >
>> > > > > Perhaps we should talk some more about this on the list.
>> > > > >
>> > > > > If you want a normal linear regression, the current OnlineLearner
>> > > > interface
>> > > > > isn't terribly appropriate since it assumes a 1 of n vector target
>> > > > > variable.
>> > > > >
>> > > > > If you were to extend that interface to accept a vector form of
>> > target
>> > > > > variable then linear regression would work (and some clever tricks
>> > > would
>> > > > > become possible for logistic regression).
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Wed, Oct 20, 2010 at 1:57 PM, Frank Wang <wangfanjie@gmail.com
>> >
>> > > > wrote:
>> > > > >
>> > > > > > Hi,
>> > > > > >
>> > > > > > I'm interested in implementing Linear Regression in Mahout. Who
>> > would
>> > > > be
>> > > > > > the
>> > > > > > point person for the algorithm? I'd love to discuss the
>> > > implementation
>> > > > > > details, or to help out if anyone is working on it already :)
>> > > > > >
>> > > > > > Thanks
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Implementation for Linear Regression

Posted by Frank Wang <wa...@gmail.com>.
Thanks Ted.

It's a very interesting solution. Currently, we need to account for age
related terms when calculating the relevance ranking, and this is done
before display time. We will play around with our data and see if we can
model our data to leverage on the trick.

In terms of Linear Regression, I've attached the initial patch on
MAHOUT-529<https://issues.apache.org/jira/browse/MAHOUT-529>.
It's mainly the AbstractOnlineLinearRegression and OnlineLinearRegression
classes. Lemme know if the code makes sense.

I have 2 questions:

1.
The apply() function in DefaultGradient has:
    Vector r = v.like();
    if (actual != 0) {
      r.setQuick(actual - 1, 1);
    }

The code seems to work only for logistic regression. When actual is 0, r[0]
remains 0, and when actual is 1, r[0] gets set to 1. I'm not sure if I'm
understanding it correctly. For now, I've included DefaultGradientLinear in
the patch as a work around. If you could give me some advice, that'd be
helpful.


2.
As I'm working on the sample code TrainLinear, I was referring to
TrainLogistic code. I'm confused with this line:
         int targetValue = csv.processLine(line, input);

The training file is:
"a","b","c","target"
3,1,10,1
2,1,10,1
1,0,2,0
...

But the output for processLine() is:
Line 1: targetValue = 0, input = {2:4.0, 1:10.0, 0:1.0}
Line 2: targetValue = 0, input = {2:3.0, 1:10.0, 0:1.0}
Line 3: targetValue = 1, input = {2:1.0, 1:2.0, 0:1.0}
...

It seems the target values are inverted, and some input values are
incremented. It'd be great if you could explain the processLine() a little
bit.

btw, is the mail list a good place for implementation discussion or should
it take place on the JIRA page?

Thanks

On Wed, Oct 20, 2010 at 9:58 PM, Ted Dunning <te...@gmail.com> wrote:

> You don't have to apply the age correction to old data until you display
> the
> data.  The trick is to store all of the fixed components
> of the rating in linear form and then add only the age related terms at
> display time.  This allows you to penalize items that are unlikely to be
> relevant due to age and doesn't require any recomputation.
>
> On Wed, Oct 20, 2010 at 9:32 PM, Frank Wang <wa...@gmail.com> wrote:
>
> > Hi Ted,
> >
> > I've created the JIRA issue at
> > https://issues.apache.org/jira/browse/MAHOUT-529, will attach what i
> have
> > soon.
> >
> > Do you mean using time as a feature in the logistic regression? I thought
> > about your suggestion the other day, but I'm not re-calculating the
> > probability on the old data. After training each night, we only apply the
> > coefficients on next day's new data. I'm not quite sure how would the
> decay
> > function work in this case. Do you have an example?
> >
> > Thanks
> >
> >
> > On Wed, Oct 20, 2010 at 8:48 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Can you open a JIRA and attach a patch.
> > >
> > > Your approach seems reasonable so far for the regression.
> > >
> > > In terms of how it could be applied, it seems like you are trying to
> > > estimate a life-span for a posting to model relevance decay.
> > >
> > > My own preference there would be to try to estimate relevance (0 or 1)
> > > using
> > > logistic regression and then put in various decay functions in as
> > features.
> > >  The weighted sum of those decay functions is your time decay of
> > relevance
> > > (in log-odds).
> > >
> > > My initial shot at decay functions would include age, square of age and
> > log
> > > of age.  My guess is that direct age would suffice because of the
> > logistic
> > > link function which looks like a logarithmic function where your models
> > > will
> > > probably live.
> > >
> > > On Wed, Oct 20, 2010 at 8:15 PM, Frank Wang <wa...@gmail.com>
> > wrote:
> > >
> > > > Hi Ted,
> > > >
> > > > thanks for your reply.
> > > > I'm trying a new model where I want to estimate the output as a
> > timespan
> > > > quantified in number of seconds, which is not bounded. That's why I
> > think
> > > > I'd use linear regression instead of logistic regression. (lemme know
> > if
> > > > i'm
> > > > wrong)
> > > >
> > > > I started on the code yesterday. The new
> AbstractOnlineLinearRegression
> > > > class is implementing the OnlineLearner interface. I updated the
> > > classify()
> > > > function to use linear model. I tried to follow the format for
> > > > AbstractOnlineLogisticRegression.
> > > >
> > > > I think since linear regression can be implemented w/ sgd, the
> train()
> > > > and regularize() functions would look similar. I'm not sure if i'm on
> > the
> > > > right path. Any advice would be helpful.
> > > >
> > > > Thanks
> > > >
> > > > On Wed, Oct 20, 2010 at 3:34 PM, Ted Dunning <te...@gmail.com>
> > > > wrote:
> > > >
> > > > > Frank,
> > > > >
> > > > > Sorry I didn't answer your previous email regarding this.
> > > > >
> > > > > It sounded to me like your application would actually be happier
> with
> > a
> > > > > form
> > > > > of logistic regression.
> > > > >
> > > > > Perhaps we should talk some more about this on the list.
> > > > >
> > > > > If you want a normal linear regression, the current OnlineLearner
> > > > interface
> > > > > isn't terribly appropriate since it assumes a 1 of n vector target
> > > > > variable.
> > > > >
> > > > > If you were to extend that interface to accept a vector form of
> > target
> > > > > variable then linear regression would work (and some clever tricks
> > > would
> > > > > become possible for logistic regression).
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Oct 20, 2010 at 1:57 PM, Frank Wang <wa...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I'm interested in implementing Linear Regression in Mahout. Who
> > would
> > > > be
> > > > > > the
> > > > > > point person for the algorithm? I'd love to discuss the
> > > implementation
> > > > > > details, or to help out if anyone is working on it already :)
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Implementation for Linear Regression

Posted by Ted Dunning <te...@gmail.com>.
You don't have to apply the age correction to old data until you display the
data.  The trick is to store all of the fixed components
of the rating in linear form and then add only the age related terms at
display time.  This allows you to penalize items that are unlikely to be
relevant due to age and doesn't require any recomputation.

On Wed, Oct 20, 2010 at 9:32 PM, Frank Wang <wa...@gmail.com> wrote:

> Hi Ted,
>
> I've created the JIRA issue at
> https://issues.apache.org/jira/browse/MAHOUT-529, will attach what i have
> soon.
>
> Do you mean using time as a feature in the logistic regression? I thought
> about your suggestion the other day, but I'm not re-calculating the
> probability on the old data. After training each night, we only apply the
> coefficients on next day's new data. I'm not quite sure how would the decay
> function work in this case. Do you have an example?
>
> Thanks
>
>
> On Wed, Oct 20, 2010 at 8:48 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Can you open a JIRA and attach a patch.
> >
> > Your approach seems reasonable so far for the regression.
> >
> > In terms of how it could be applied, it seems like you are trying to
> > estimate a life-span for a posting to model relevance decay.
> >
> > My own preference there would be to try to estimate relevance (0 or 1)
> > using
> > logistic regression and then put in various decay functions in as
> features.
> >  The weighted sum of those decay functions is your time decay of
> relevance
> > (in log-odds).
> >
> > My initial shot at decay functions would include age, square of age and
> log
> > of age.  My guess is that direct age would suffice because of the
> logistic
> > link function which looks like a logarithmic function where your models
> > will
> > probably live.
> >
> > On Wed, Oct 20, 2010 at 8:15 PM, Frank Wang <wa...@gmail.com>
> wrote:
> >
> > > Hi Ted,
> > >
> > > thanks for your reply.
> > > I'm trying a new model where I want to estimate the output as a
> timespan
> > > quantified in number of seconds, which is not bounded. That's why I
> think
> > > I'd use linear regression instead of logistic regression. (lemme know
> if
> > > i'm
> > > wrong)
> > >
> > > I started on the code yesterday. The new AbstractOnlineLinearRegression
> > > class is implementing the OnlineLearner interface. I updated the
> > classify()
> > > function to use linear model. I tried to follow the format for
> > > AbstractOnlineLogisticRegression.
> > >
> > > I think since linear regression can be implemented w/ sgd, the train()
> > > and regularize() functions would look similar. I'm not sure if i'm on
> the
> > > right path. Any advice would be helpful.
> > >
> > > Thanks
> > >
> > > On Wed, Oct 20, 2010 at 3:34 PM, Ted Dunning <te...@gmail.com>
> > > wrote:
> > >
> > > > Frank,
> > > >
> > > > Sorry I didn't answer your previous email regarding this.
> > > >
> > > > It sounded to me like your application would actually be happier with
> a
> > > > form
> > > > of logistic regression.
> > > >
> > > > Perhaps we should talk some more about this on the list.
> > > >
> > > > If you want a normal linear regression, the current OnlineLearner
> > > interface
> > > > isn't terribly appropriate since it assumes a 1 of n vector target
> > > > variable.
> > > >
> > > > If you were to extend that interface to accept a vector form of
> target
> > > > variable then linear regression would work (and some clever tricks
> > would
> > > > become possible for logistic regression).
> > > >
> > > >
> > > >
> > > > On Wed, Oct 20, 2010 at 1:57 PM, Frank Wang <wa...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I'm interested in implementing Linear Regression in Mahout. Who
> would
> > > be
> > > > > the
> > > > > point person for the algorithm? I'd love to discuss the
> > implementation
> > > > > details, or to help out if anyone is working on it already :)
> > > > >
> > > > > Thanks
> > > > >
> > > >
> > >
> >
>

Re: Implementation for Linear Regression

Posted by Frank Wang <wa...@gmail.com>.
Hi Ted,

I've created the JIRA issue at
https://issues.apache.org/jira/browse/MAHOUT-529, will attach what i have
soon.

Do you mean using time as a feature in the logistic regression? I thought
about your suggestion the other day, but I'm not re-calculating the
probability on the old data. After training each night, we only apply the
coefficients on next day's new data. I'm not quite sure how would the decay
function work in this case. Do you have an example?

Thanks


On Wed, Oct 20, 2010 at 8:48 PM, Ted Dunning <te...@gmail.com> wrote:

> Can you open a JIRA and attach a patch.
>
> Your approach seems reasonable so far for the regression.
>
> In terms of how it could be applied, it seems like you are trying to
> estimate a life-span for a posting to model relevance decay.
>
> My own preference there would be to try to estimate relevance (0 or 1)
> using
> logistic regression and then put in various decay functions in as features.
>  The weighted sum of those decay functions is your time decay of relevance
> (in log-odds).
>
> My initial shot at decay functions would include age, square of age and log
> of age.  My guess is that direct age would suffice because of the logistic
> link function which looks like a logarithmic function where your models
> will
> probably live.
>
> On Wed, Oct 20, 2010 at 8:15 PM, Frank Wang <wa...@gmail.com> wrote:
>
> > Hi Ted,
> >
> > thanks for your reply.
> > I'm trying a new model where I want to estimate the output as a timespan
> > quantified in number of seconds, which is not bounded. That's why I think
> > I'd use linear regression instead of logistic regression. (lemme know if
> > i'm
> > wrong)
> >
> > I started on the code yesterday. The new AbstractOnlineLinearRegression
> > class is implementing the OnlineLearner interface. I updated the
> classify()
> > function to use linear model. I tried to follow the format for
> > AbstractOnlineLogisticRegression.
> >
> > I think since linear regression can be implemented w/ sgd, the train()
> > and regularize() functions would look similar. I'm not sure if i'm on the
> > right path. Any advice would be helpful.
> >
> > Thanks
> >
> > On Wed, Oct 20, 2010 at 3:34 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Frank,
> > >
> > > Sorry I didn't answer your previous email regarding this.
> > >
> > > It sounded to me like your application would actually be happier with a
> > > form
> > > of logistic regression.
> > >
> > > Perhaps we should talk some more about this on the list.
> > >
> > > If you want a normal linear regression, the current OnlineLearner
> > interface
> > > isn't terribly appropriate since it assumes a 1 of n vector target
> > > variable.
> > >
> > > If you were to extend that interface to accept a vector form of target
> > > variable then linear regression would work (and some clever tricks
> would
> > > become possible for logistic regression).
> > >
> > >
> > >
> > > On Wed, Oct 20, 2010 at 1:57 PM, Frank Wang <wa...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm interested in implementing Linear Regression in Mahout. Who would
> > be
> > > > the
> > > > point person for the algorithm? I'd love to discuss the
> implementation
> > > > details, or to help out if anyone is working on it already :)
> > > >
> > > > Thanks
> > > >
> > >
> >
>

Re: Implementation for Linear Regression

Posted by Ted Dunning <te...@gmail.com>.
Can you open a JIRA and attach a patch.

Your approach seems reasonable so far for the regression.

In terms of how it could be applied, it seems like you are trying to
estimate a life-span for a posting to model relevance decay.

My own preference there would be to try to estimate relevance (0 or 1) using
logistic regression and then put in various decay functions in as features.
 The weighted sum of those decay functions is your time decay of relevance
(in log-odds).

My initial shot at decay functions would include age, square of age and log
of age.  My guess is that direct age would suffice because of the logistic
link function which looks like a logarithmic function where your models will
probably live.

On Wed, Oct 20, 2010 at 8:15 PM, Frank Wang <wa...@gmail.com> wrote:

> Hi Ted,
>
> thanks for your reply.
> I'm trying a new model where I want to estimate the output as a timespan
> quantified in number of seconds, which is not bounded. That's why I think
> I'd use linear regression instead of logistic regression. (lemme know if
> i'm
> wrong)
>
> I started on the code yesterday. The new AbstractOnlineLinearRegression
> class is implementing the OnlineLearner interface. I updated the classify()
> function to use linear model. I tried to follow the format for
> AbstractOnlineLogisticRegression.
>
> I think since linear regression can be implemented w/ sgd, the train()
> and regularize() functions would look similar. I'm not sure if i'm on the
> right path. Any advice would be helpful.
>
> Thanks
>
> On Wed, Oct 20, 2010 at 3:34 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Frank,
> >
> > Sorry I didn't answer your previous email regarding this.
> >
> > It sounded to me like your application would actually be happier with a
> > form
> > of logistic regression.
> >
> > Perhaps we should talk some more about this on the list.
> >
> > If you want a normal linear regression, the current OnlineLearner
> interface
> > isn't terribly appropriate since it assumes a 1 of n vector target
> > variable.
> >
> > If you were to extend that interface to accept a vector form of target
> > variable then linear regression would work (and some clever tricks would
> > become possible for logistic regression).
> >
> >
> >
> > On Wed, Oct 20, 2010 at 1:57 PM, Frank Wang <wa...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > I'm interested in implementing Linear Regression in Mahout. Who would
> be
> > > the
> > > point person for the algorithm? I'd love to discuss the implementation
> > > details, or to help out if anyone is working on it already :)
> > >
> > > Thanks
> > >
> >
>

Re: Implementation for Linear Regression

Posted by Frank Wang <wa...@gmail.com>.
Hi Ted,

thanks for your reply.
I'm trying a new model where I want to estimate the output as a timespan
quantified in number of seconds, which is not bounded. That's why I think
I'd use linear regression instead of logistic regression. (lemme know if i'm
wrong)

I started on the code yesterday. The new AbstractOnlineLinearRegression
class is implementing the OnlineLearner interface. I updated the classify()
function to use linear model. I tried to follow the format for
AbstractOnlineLogisticRegression.

I think since linear regression can be implemented w/ sgd, the train()
and regularize() functions would look similar. I'm not sure if i'm on the
right path. Any advice would be helpful.

Thanks

On Wed, Oct 20, 2010 at 3:34 PM, Ted Dunning <te...@gmail.com> wrote:

> Frank,
>
> Sorry I didn't answer your previous email regarding this.
>
> It sounded to me like your application would actually be happier with a
> form
> of logistic regression.
>
> Perhaps we should talk some more about this on the list.
>
> If you want a normal linear regression, the current OnlineLearner interface
> isn't terribly appropriate since it assumes a 1 of n vector target
> variable.
>
> If you were to extend that interface to accept a vector form of target
> variable then linear regression would work (and some clever tricks would
> become possible for logistic regression).
>
>
>
> On Wed, Oct 20, 2010 at 1:57 PM, Frank Wang <wa...@gmail.com> wrote:
>
> > Hi,
> >
> > I'm interested in implementing Linear Regression in Mahout. Who would be
> > the
> > point person for the algorithm? I'd love to discuss the implementation
> > details, or to help out if anyone is working on it already :)
> >
> > Thanks
> >
>

Re: Implementation for Linear Regression

Posted by Ted Dunning <te...@gmail.com>.
Frank,

Sorry I didn't answer your previous email regarding this.

It sounded to me like your application would actually be happier with a form
of logistic regression.

Perhaps we should talk some more about this on the list.

If you want a normal linear regression, the current OnlineLearner interface
isn't terribly appropriate since it assumes a 1 of n vector target variable.

If you were to extend that interface to accept a vector form of target
variable then linear regression would work (and some clever tricks would
become possible for logistic regression).



On Wed, Oct 20, 2010 at 1:57 PM, Frank Wang <wa...@gmail.com> wrote:

> Hi,
>
> I'm interested in implementing Linear Regression in Mahout. Who would be
> the
> point person for the algorithm? I'd love to discuss the implementation
> details, or to help out if anyone is working on it already :)
>
> Thanks
>