You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Frank Scholten (JIRA)" <ji...@apache.org> on 2014/02/23 19:51:19 UTC
[jira] [Created] (MAHOUT-1425) SGD classifier example with bank
marketing dataset
Frank Scholten created MAHOUT-1425:
--------------------------------------
Summary: SGD classifier example with bank marketing dataset
Key: MAHOUT-1425
URL: https://issues.apache.org/jira/browse/MAHOUT-1425
Project: Mahout
Issue Type: Improvement
Components: Examples
Affects Versions: 1.0
Reporter: Frank Scholten
Assignee: Frank Scholten
Fix For: 0.9
As discussed on the mailing list a few weeks back I started working on an SGD classifier example with the bank marketing dataset from UCI: http://archive.ics.uci.edu/ml/datasets/Bank+Marketing
See https://github.com/frankscholten/mahout-sgd-bank-marketing
Ted has also made further changes that were very useful so I suggest to add this example to Mahout
Ted: can you tell a bit more about the log transforms? Some of them are just Math.log while others are more complex expressions.
What else is needed to contribute it to Mahout? Anything that could improve the example?
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
Re: [jira] [Created] (MAHOUT-1425) SGD classifier example with bank
marketing dataset
Posted by Ted Dunning <te...@gmail.com>.
Argh...
make that last be w(a-b) log w(a-b)/\gamma
On Sun, Feb 23, 2014 at 3:42 PM, Ted Dunning <te...@gmail.com> wrote:
>
> On Sun, Feb 23, 2014 at 10:51 AM, Frank Scholten (JIRA) <ji...@apache.org>wrote:
>
>> Ted: can you tell a bit more about the log transforms? Some of them are
>> just Math.log while others are more complex expressions.
>
>
> The increased complexity comes up when there are zero or small negative
> values.
>
> In general, monetary values are commonly transformed with a log during
> training of a logistic regression model. Often you retain the original as
> well.
>
> The motivation for the log is that it is common for the structure of the
> problem to depend as much on relative differences rather than absolute
> differences. Thus, $80 is different from $100 in about the same way that
> $800 is different from $1000. This makes sense if you are talking about
> what makes a material difference.
>
> Of course, if you are talking about net profits, then you may want
> features that look like log(a-b) instead. What happens when that goes
> negative is a bit of a can of worms in terms of feature design. Sometimes,
> a small reference value is defined and a value like w(a-b) log w(a-b) is
> used where w(x) = x-\gamma if x > \gamma, x+\gamma if x < -\gamma and 0
> else.
>
>
>
Re: [jira] [Created] (MAHOUT-1425) SGD classifier example with bank
marketing dataset
Posted by Ted Dunning <te...@gmail.com>.
On Sun, Feb 23, 2014 at 10:51 AM, Frank Scholten (JIRA) <ji...@apache.org>wrote:
> Ted: can you tell a bit more about the log transforms? Some of them are
> just Math.log while others are more complex expressions.
The increased complexity comes up when there are zero or small negative
values.
In general, monetary values are commonly transformed with a log during
training of a logistic regression model. Often you retain the original as
well.
The motivation for the log is that it is common for the structure of the
problem to depend as much on relative differences rather than absolute
differences. Thus, $80 is different from $100 in about the same way that
$800 is different from $1000. This makes sense if you are talking about
what makes a material difference.
Of course, if you are talking about net profits, then you may want features
that look like log(a-b) instead. What happens when that goes negative is a
bit of a can of worms in terms of feature design. Sometimes, a small
reference value is defined and a value like w(a-b) log w(a-b) is used where
w(x) = x-\gamma if x > \gamma, x+\gamma if x < -\gamma and 0 else.