You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Tobi Bosede <an...@gmail.com> on 2016/09/26 05:22:15 UTC

MLib Documentation Update Needed

The loss function here
<https://spark.apache.org/docs/1.6.0/mllib-linear-methods.html#mjx-eqn-eqregPrimal>
for logistic regression is confusing. It seems to imply that spark uses
only -1 and 1 class labels. However it uses 0,1 as the very inconspicuous
note quoted below (under Classification) says. We need to make this point
more visible to avoid confusion.

Better yet, we should replace the loss function listed with that for 0, 1
no matter how mathematically inconvenient, since that is what is actually
implemented in Spark.

More problematic, the loss function (even in this "convenient" form) is
actually incorrect. This is because it is missing either a summation
(sigma) in the log or product (pi) outside the log, as the loss for
logistic is the log likelihood. So there are multiple problems with the
documentation. Please advise on steps to fix for all version documentation
or if there are already some in place.

"Note that, in the mathematical formulation in this guide, a binary label
y is denoted as either +1 (positive) or −1 (negative), which is convenient
for the formulation. *However*, the negative label is represented by 0 in
spark.mllib instead of −1, to be consistent with multiclass labeling."

Re: MLib Documentation Update Needed

Posted by Tobi Bosede <an...@gmail.com>.

OK, I've opened a jira. https://issues.apache.org/jira/browse/SPARK-17718

And ok, I forgot the loss is summed in the objective function provided. My
mistake.

On a tangentially related topic, why is there a half in front of the
squared loss? Similarly, the L2 regularizer has a half. It's just a
constant and so the objective's minimum is not affected, but still curious
to know why the half wasn't left out.

On Mon, Sep 26, 2016 at 4:40 AM, Sean Owen <so...@cloudera.com> wrote:

> Yes I think that footnote could be a lot more prominent, or pulled up
> right under the table.
>
> I also think it would be fine to present the {0,1} formulation. It's
> actually more recognizable, I think, for log-loss in that form. It's
> probably less recognizable for hinge loss, but, consistency is more
> important. There's just an extra (2y-1) term, at worst.
>
> The loss here is per instance, and implicitly summed over all
> instances. I think that is probably not confusing for the reader; if
> they're reading this at all to double-check just what formulation is
> being used, I think they'd know that. But, it's worth a note.
>
> The loss is summed in the case of log-loss, not multiplied (if that's
> what you're saying).
>
> Those are decent improvements, feel free to open a pull request / JIRA.
>
>
> On Mon, Sep 26, 2016 at 6:22 AM, Tobi Bosede <an...@gmail.com> wrote:
> > The loss function here for logistic regression is confusing. It seems to
> > imply that spark uses only -1 and 1 class labels. However it uses 0,1 as
> the
> > very inconspicuous note quoted below (under Classification) says. We
> need to
> > make this point more visible to avoid confusion.
> >
> > Better yet, we should replace the loss function listed with that for 0,
> 1 no
> > matter how mathematically inconvenient, since that is what is actually
> > implemented in Spark.
> >
> > More problematic, the loss function (even in this "convenient" form) is
> > actually incorrect. This is because it is missing either a summation
> (sigma)
> > in the log or product (pi) outside the log, as the loss for logistic is
> the
> > log likelihood. So there are multiple problems with the documentation.
> > Please advise on steps to fix for all version documentation or if there
> are
> > already some in place.
> >
> > "Note that, in the mathematical formulation in this guide, a binary label
> > y is denoted as either +1 (positive) or −1 (negative), which is
> convenient
> > for the formulation. However, the negative label is represented by 0 in
> > spark.mllib instead of −1, to be consistent with multiclass labeling."
>

Re: MLib Documentation Update Needed

Posted by Sean Owen <so...@cloudera.com>.

Yes I think that footnote could be a lot more prominent, or pulled up
right under the table.

I also think it would be fine to present the {0,1} formulation. It's
actually more recognizable, I think, for log-loss in that form. It's
probably less recognizable for hinge loss, but, consistency is more
important. There's just an extra (2y-1) term, at worst.

The loss here is per instance, and implicitly summed over all
instances. I think that is probably not confusing for the reader; if
they're reading this at all to double-check just what formulation is
being used, I think they'd know that. But, it's worth a note.

The loss is summed in the case of log-loss, not multiplied (if that's
what you're saying).

Those are decent improvements, feel free to open a pull request / JIRA.

On Mon, Sep 26, 2016 at 6:22 AM, Tobi Bosede <an...@gmail.com> wrote:
> The loss function here for logistic regression is confusing. It seems to
> imply that spark uses only -1 and 1 class labels. However it uses 0,1 as the
> very inconspicuous note quoted below (under Classification) says. We need to
> make this point more visible to avoid confusion.
>
> Better yet, we should replace the loss function listed with that for 0, 1 no
> matter how mathematically inconvenient, since that is what is actually
> implemented in Spark.
>
> More problematic, the loss function (even in this "convenient" form) is
> actually incorrect. This is because it is missing either a summation (sigma)
> in the log or product (pi) outside the log, as the loss for logistic is the
> log likelihood. So there are multiple problems with the documentation.
> Please advise on steps to fix for all version documentation or if there are
> already some in place.
>
> "Note that, in the mathematical formulation in this guide, a binary label
> y is denoted as either +1 (positive) or −1 (negative), which is convenient
> for the formulation. However, the negative label is represented by 0 in
> spark.mllib instead of −1, to be consistent with multiclass labeling."

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org