You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by DB Tsai <db...@dbtsai.com> on 2015/07/09 00:34:33 UTC
Re: FW: MLLIB (Spark) Question.

Hi Dhar,

Disabling `standardization` feature is just merged in master.

https://github.com/apache/spark/commit/57221934e0376e5bb8421dc35d4bf91db4deeca1

Let us know your feedback. Thanks.

Sincerely,

DB Tsai
----------------------------------------------------------
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Tue, Jun 16, 2015 at 9:11 PM, Dhar Sauptik (CR/RTC1.3-NA)
<Sa...@us.bosch.com> wrote:
> Hi DB,
>
> That will work too. I was just suggesting that as standardization is a simple operation and could have been performed explicitly.
>
> Thank you for the replies.
>
> -Sauptik.
>
> -----Original Message-----
> From: DB Tsai [mailto:dbtsai@dbtsai.com]
> Sent: Tuesday, June 16, 2015 9:04 PM
> To: Dhar Sauptik (CR/RTC1.3-NA)
> Cc: Ramakrishnan Naveen (CR/RTC1.3-NA); user@spark.apache.org
> Subject: Re: FW: MLLIB (Spark) Question.
>
> Hi Dhar,
>
> For "standardization", we can disable it effectively by using
> different regularization on each component. Thus, we're solving the
> same problem but having better rate of convergence. This is one of the
> features I will implement.
>
> Sincerely,
>
> DB Tsai
> ----------------------------------------------------------
> Blog: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Tue, Jun 16, 2015 at 8:34 PM, Dhar Sauptik (CR/RTC1.3-NA)
> <Sa...@us.bosch.com> wrote:
>> Hi DB,
>>
>> Thank you for the reply. The answers makes sense. I do have just one more point to add.
>>
>> Note that it may be better to not implicitly standardize the data. Agreed that a number of algorithms benefit from such standardization, but for many applications with contextual information such standardization "may" not be desirable.
>> Users can always perform the standardization themselves.
>>
>> However, that's just a suggestion. Again, thank you for the clarification.
>>
>> Thanks,
>> Sauptik.
>>
>>
>> -----Original Message-----
>> From: DB Tsai [mailto:dbtsai@dbtsai.com]
>> Sent: Tuesday, June 16, 2015 2:49 PM
>> To: Dhar Sauptik (CR/RTC1.3-NA); Ramakrishnan Naveen (CR/RTC1.3-NA)
>> Cc: user@spark.apache.org
>> Subject: Re: FW: MLLIB (Spark) Question.
>>
>> +cc user@spark.apache.org
>>
>> Reply inline.
>>
>> On Tue, Jun 16, 2015 at 2:31 PM, Dhar Sauptik (CR/RTC1.3-NA)
>> <Sauptik.Dhar> wrote:
>>> Hi DB,
>>>
>>> Thank you for the reply. That explains a lot.
>>>
>>> I however had a few points regarding this:-
>>>
>>> 1. Just to help with the debate of not regularizing the b parameter. A standard implementation argues against regularizing the b parameter. See Pg 64 para 1 :  http://statweb.stanford.edu/~tibs/ElemStatLearn/
>>>
>>
>> Agreed. We just worry about it will change behavior, but we actually
>> have a PR to change the behavior to standard one,
>> https://github.com/apache/spark/pull/6386
>>
>>> 2. Further, is the regularization of b also applicable for the SGD implementation. Currently the SGD vs. BFGS implementations give different results (and both the implementations don't match the IRLS algorithm). Are the SGD/BFGS implemented for different loss functions? Can you please share your thoughts on this.
>>>
>>
>> In SGD implementation, we don't "standardize" the dataset before
>> training. As a result, those columns with low standard deviation will
>> be penalized more, and those with high standard deviation will be
>> penalized less. Also, "standardize" will help the rate of convergence.
>> As a result, in most of package, they "standardize" the data
>> implicitly, and get the weights in the "standardized" space, and
>> transform back to original space so it's transparent for users.
>>
>> 1) LORWithSGD: No standardization, and penalize the intercept.
>> 2) LORWithLBFGS: With standardization but penalize the intercept.
>> 3) New LOR implementation: With standardization without penalizing the
>> intercept.
>>
>> As a result, only the new implementation in Spark ML handles
>> everything correctly. We have tests to verify that the results match
>> R.
>>
>>>
>>> @Naveen: Please feel free to add/comment on the above points as you see necessary.
>>>
>>> Thanks,
>>> Sauptik.
>>>
>>> -----Original Message-----
>>> From: DB Tsai
>>> Sent: Tuesday, June 16, 2015 2:08 PM
>>> To: Ramakrishnan Naveen (CR/RTC1.3-NA)
>>> Cc: Dhar Sauptik (CR/RTC1.3-NA)
>>> Subject: Re: FW: MLLIB (Spark) Question.
>>>
>>> Hey,
>>>
>>> In the LORWithLBFGS api you use, the intercept is regularized while
>>> other implementations don't regularize the intercept. That's why you
>>> see the difference.
>>>
>>> The intercept should not be regularized, so we fix this in new Spark
>>> ML api in spark 1.4. Since this will change the behavior in the old
>>> api if we decide to not regularize the intercept in old version, we
>>> are still debating about this.
>>>
>>> See the following code for full running example in spark 1.4
>>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionExample.scala
>>>
>>> And also check out my talk at spark summit.
>>> http://www.slideshare.net/dbtsai/2015-06-largescale-lasso-and-elasticnet-regularized-generalized-linear-models-at-spark-summit
>>>
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> ----------------------------------------------------------
>>> Blog: https://www.dbtsai.com
>>> PGP Key ID: 0xAF08DF8D
>>>
>>>
>>> On Mon, Jun 15, 2015 at 11:58 AM, Ramakrishnan Naveen (CR/RTC1.3-NA)
>>> <Naveen.Ramakrishnan> wrote:
>>>> Hi DB,
>>>>     Hope you are doing well! One of my colleagues, Sauptik, is working with
>>>> MLLib and the logistic regression based on LBFGS and is having trouble
>>>> reproducing the same results when compared to Matlab. Please see below for
>>>> details. I did take a look into this but seems like there’s also discrepancy
>>>> between the logistic regression with SGD and LBFGS implementations in MLLib.
>>>> We have attached all the codes for your analysis – it’s in PySpark though.
>>>> Let us know if you have any questions or concerns. We would very much
>>>> appreciate your help whenever you get a chance.
>>>>
>>>> Best,
>>>> Naveen.
>>>>
>>>> _____________________________________________
>>>> From: Dhar Sauptik (CR/RTC1.3-NA)
>>>> Sent: Thursday, June 11, 2015 6:03 PM
>>>> To: Ramakrishnan Naveen (CR/RTC1.3-NA)
>>>> Subject: MLLIB (Spark) Question.
>>>>
>>>>
>>>> Hi Naveen,
>>>>
>>>> I am writing this owing to some MLLIB issues I found while using Logistic
>>>> Regression. Basically, I am trying to test the stability of the L1/L2 –
>>>> Logistic Regression using SGD and BFGS. Unfortunately I am unable to confirm
>>>> the correctness of the algorithms. For comparison I implemented the
>>>> L2-Logistic regression algorithm (using IRLS algorithm Pg. 121) From the
>>>> book http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf
>>>> . Unfortunately the solutions don’t match:-
>>>>
>>>> For example:-
>>>>
>>>> Using the Publicly available data (diabetes.csv) for L2 regularized Logistic
>>>> Regression (with lamda = 0.1) we get,
>>>>
>>>> Solutions
>>>>
>>>> MATLAB CODE (IRLS):-
>>>>
>>>> w = 0.294293470805555
>>>> 0.550681766045083
>>>> 0.0396336870148899
>>>> 0.0641285712055971
>>>> 0.101238592147879
>>>> 0.261153541551578
>>>> 0.178686710290069
>>>>
>>>> b=  -0.347396594061553
>>>>
>>>>
>>>> MLLIB (SGD):-
>>>> (weights=[0.352873922589,0.420391294105,0.0100571908041,0.150724951988,0.238536959009,0.220329295188,0.269139932714],
>>>> intercept=-0.00749988882664631)
>>>>
>>>>
>>>> MLLIB(LBFGS):-
>>>> (weights=[0.787850211605,1.964589985,-0.209348425939,0.0278848173986,0.12729017522,1.58954647312,0.692671824394],
>>>> intercept=-0.027401869113912316)
>>>>
>>>>
>>>> All the codes are attached to the email.
>>>>
>>>> Apparently the solutions are quite far away from the optimal (and even from
>>>> each other)! Can you please check with DB Tsai on the reasons for such
>>>> differences? Note all the additional parameters are described in the source
>>>> codes.
>>>>
>>>>
>>>> Thanks,
>>>> Best regards / Mit freundlichen Grüßen,
>>>>
>>>> Sauptik Dhar, Ph.D.
>>>> CR/RTC1.3-NA
>>>>
>>>>
>>
>> Sincerely,
>>
>> DB Tsai
>> ----------------------------------------------------------
>> Blog: https://www.dbtsai.com
>> PGP Key ID: 0xAF08DF8D

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org