You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "DB Tsai (JIRA)" <ji...@apache.org> on 2015/05/13 06:35:01 UTC

[jira] [Commented] (SPARK-7568) ml.LogisticRegression doesn't output the right prediction

    [ https://issues.apache.org/jira/browse/SPARK-7568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541301#comment-14541301 ] 

DB Tsai commented on SPARK-7568:
--------------------------------

This is because we regularize the intercept before which effectively regularizing less on the weights. Now, we follow the R's standard without regularizing the intercept, so we need to decrease the regularization. BTW, we need to implement efficient cross validation soon to help users to pick up lambda. 

with lambda = 0.001 in the current code, the result looks correct.
```
(4, spark i j k) --> prob=[0.1596407738787411,0.8403592261212589], prediction=1.0
(5, l m n) --> prob=[0.8378325685476612,0.16216743145233883], prediction=0.0
(6, mapreduce spark) --> prob=[0.6693126798261013,0.3306873201738986], prediction=0.0
(7, apache hadoop) --> prob=[0.9821575333444208,0.01784246665557917], prediction=0.0
```

> ml.LogisticRegression doesn't output the right prediction
> ---------------------------------------------------------
>
>                 Key: SPARK-7568
>                 URL: https://issues.apache.org/jira/browse/SPARK-7568
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.4.0
>            Reporter: Xiangrui Meng
>            Assignee: DB Tsai
>            Priority: Blocker
>
> `bin/spark-submit examples/src/main/python/ml/simple_text_classification_pipeline.py`
> {code}
> Row(id=4, text=u'spark i j k', words=[u'spark', u'i', u'j', u'k'], features=SparseVector(262144, {105: 1.0, 106: 1.0, 107: 1.0, 62173: 1.0}), rawPrediction=DenseVector([0.1629, -0.1629]), probability=DenseVector([0.5406, 0.4594]), prediction=0.0)
> Row(id=5, text=u'l m n', words=[u'l', u'm', u'n'], features=SparseVector(262144, {108: 1.0, 109: 1.0, 110: 1.0}), rawPrediction=DenseVector([2.6407, -2.6407]), probability=DenseVector([0.9334, 0.0666]), prediction=0.0)
> Row(id=6, text=u'mapreduce spark', words=[u'mapreduce', u'spark'], features=SparseVector(262144, {62173: 1.0, 140738: 1.0}), rawPrediction=DenseVector([1.2651, -1.2651]), probability=DenseVector([0.7799, 0.2201]), prediction=0.0)
> Row(id=7, text=u'apache hadoop', words=[u'apache', u'hadoop'], features=SparseVector(262144, {128334: 1.0, 134181: 1.0}), rawPrediction=DenseVector([3.7429, -3.7429]), probability=DenseVector([0.9769, 0.0231]), prediction=0.0)
> {code}
> In Scala
> {code}
> $ bin/run-example ml.SimpleTextClassificationPipeline
> (4, spark i j k) --> prob=[0.5406433544851436,0.45935664551485655], prediction=0.0
> (5, l m n) --> prob=[0.9334382627383263,0.06656173726167364], prediction=0.0
> (6, mapreduce spark) --> prob=[0.7799076868203896,0.22009231317961045], prediction=0.0
> (7, apache hadoop) --> prob=[0.9768636139518304,0.023136386048169616], prediction=0.0
> {code}
> All predictions are 0, while some should be one based on the probability. It seems to be an issue with regularization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org