You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Frank Scholten (JIRA)" <ji...@apache.org> on 2014/02/23 19:51:19 UTC

[jira] [Created] (MAHOUT-1425) SGD classifier example with bank marketing dataset

Frank Scholten created MAHOUT-1425:
--------------------------------------

             Summary: SGD classifier example with bank marketing dataset
                 Key: MAHOUT-1425
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1425
             Project: Mahout
          Issue Type: Improvement
          Components: Examples
    Affects Versions: 1.0
            Reporter: Frank Scholten
            Assignee: Frank Scholten
             Fix For: 0.9


As discussed on the mailing list a few weeks back I started working on an SGD classifier example with the bank marketing dataset from UCI: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing

See https://github.com/frankscholten/mahout-sgd-bank-marketing

Ted has also made further changes that were very useful so I suggest to add this example to Mahout

Ted: can you tell a bit more about the log transforms? Some of them are just Math.log while others are more complex expressions. 

What else is needed to contribute it to Mahout? Anything that could improve the example?





--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Re: [jira] [Created] (MAHOUT-1425) SGD classifier example with bank marketing dataset

Posted by Ted Dunning <te...@gmail.com>.
Argh...

make that last be w(a-b) log w(a-b)/\gamma




On Sun, Feb 23, 2014 at 3:42 PM, Ted Dunning <te...@gmail.com> wrote:

>
> On Sun, Feb 23, 2014 at 10:51 AM, Frank Scholten (JIRA) <ji...@apache.org>wrote:
>
>> Ted: can you tell a bit more about the log transforms? Some of them are
>> just Math.log while others are more complex expressions.
>
>
> The increased complexity comes up when there are zero or small negative
> values.
>
> In general, monetary values are commonly transformed with a log during
> training of a logistic regression model.  Often you retain the original as
> well.
>
> The motivation for the log is that it is common for the structure of the
> problem to depend as much on relative differences rather than absolute
> differences.  Thus, $80 is different from $100 in about the same way that
> $800 is different from $1000.  This makes sense if you are talking about
> what makes a material difference.
>
> Of course, if you are talking about net profits, then you may want
> features that look like log(a-b) instead.  What happens when that goes
> negative is a bit of a can of worms in terms of feature design.  Sometimes,
> a small reference value is defined and a value like w(a-b) log w(a-b) is
> used where w(x) = x-\gamma if x > \gamma, x+\gamma if x < -\gamma and 0
> else.
>
>
>

Re: [jira] [Created] (MAHOUT-1425) SGD classifier example with bank marketing dataset

Posted by Ted Dunning <te...@gmail.com>.
On Sun, Feb 23, 2014 at 10:51 AM, Frank Scholten (JIRA) <ji...@apache.org>wrote:

> Ted: can you tell a bit more about the log transforms? Some of them are
> just Math.log while others are more complex expressions.


The increased complexity comes up when there are zero or small negative
values.

In general, monetary values are commonly transformed with a log during
training of a logistic regression model.  Often you retain the original as
well.

The motivation for the log is that it is common for the structure of the
problem to depend as much on relative differences rather than absolute
differences.  Thus, $80 is different from $100 in about the same way that
$800 is different from $1000.  This makes sense if you are talking about
what makes a material difference.

Of course, if you are talking about net profits, then you may want features
that look like log(a-b) instead.  What happens when that goes negative is a
bit of a can of worms in terms of feature design.  Sometimes, a small
reference value is defined and a value like w(a-b) log w(a-b) is used where
w(x) = x-\gamma if x > \gamma, x+\gamma if x < -\gamma and 0 else.