You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Caron <ca...@gmail.com> on 2014/11/12 01:41:40 UTC

SVMWithSGD default threshold

I'm hoping to get a linear classifier on a dataset.
I'm using SVMWithSGD to train the data.
After running with the default options: val model =
SVMWithSGD.train(training, numIterations), 
I don't think SVM has done the classification correctly.

My observations:
1. the intercept is always 0.0
2. the predicted labels are ALL 1's, no 0's.

My questions are:
1. what should the numIterations be? I tried to set it to
10*trainingSetSize, is that sufficient?
2. since MLlib only accepts data with labels "0" or "1", shouldn't the
default threshold for SVMWithSGD be 0.5 instead of 0.0?
3. It seems counter-intuitive to me to have the default intercept be 0.0,
meaning the line has to go through the origin.
4. Does Spark MLlib provide an API to do grid search like scikit-learn does?

Any help would be greatly appreciated!




-----
Thanks!
-Caron
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SVMWithSGD-default-threshold-tp18645.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: SVMWithSGD default threshold

Posted by Xiangrui Meng <me...@gmail.com>.
regParam=1.0 may penalize too much, because we use the average loss
instead of total loss. I just sent a PR to lower the default:
https://github.com/apache/spark/pull/3232

You can try LogisticRegressionWithLBFGS (and configure parameters
through its optimizer), which should converge faster than SGD. It uses
line search, so you don't need to worry about stepSize.

We recently added pipeline features with tuning. You can take a look
at the example code here:
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala
. Note that the features are experimental as this is an alpha
component.

Best,
Xiangrui

On Wed, Nov 12, 2014 at 10:08 AM, Sean Owen <so...@cloudera.com> wrote:
> OK, it's not class imbalance. Yes, 100 iterations.
> My other guess is that the stepSize of 1 is way too big for your data.
>
> I'd suggest you look at the weights / intercept of the resulting model to
> see if it makes any sense.
>
> You can call clearThreshold on the model, and then it will 'predict' the SVM
> margin instead of a class. That could at least tell you whether it's
> predicting the same value over and over or just lots of very big values.
>
> On Wed, Nov 12, 2014 at 6:02 PM, Caron <ca...@gmail.com> wrote:
>>
>> Sean,
>>
>> Thanks a lot for your reply!
>>
>> A few follow up questions:
>> 1. numIterations should be 100, not 100*trainingSetSize, right?
>> 2. My training set has 90k positive data points (with label 1) and 60k
>> negative data points (with label 0).
>> I set my numIterations to 100 as default. I still got the same predication
>> result: it all predicted to label 1.
>> And I'm sure my dataset is linearly separable because it has been run on
>> other frameworks like scikit-learn.
>>
>> // code
>> val numIterations = 100;
>> val regParam = 1
>> val svm = new SVMWithSGD()
>> svm.optimizer.setNumIterations(numIterations).setRegParam(regParam)
>> svm.setIntercept(true)
>> val model = svm.run(training)
>>
>>
>>
>>
>>
>>
>>
>>
>> -----
>> Thanks!
>> -Caron
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/SVMWithSGD-default-threshold-tp18645p18741.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: SVMWithSGD default threshold

Posted by Sean Owen <so...@cloudera.com>.
OK, it's not class imbalance. Yes, 100 iterations.
My other guess is that the stepSize of 1 is way too big for your data.

I'd suggest you look at the weights / intercept of the resulting model to
see if it makes any sense.

You can call clearThreshold on the model, and then it will 'predict' the
SVM margin instead of a class. That could at least tell you whether it's
predicting the same value over and over or just lots of very big values.

On Wed, Nov 12, 2014 at 6:02 PM, Caron <ca...@gmail.com> wrote:

> Sean,
>
> Thanks a lot for your reply!
>
> A few follow up questions:
> 1. numIterations should be 100, not 100*trainingSetSize, right?
> 2. My training set has 90k positive data points (with label 1) and 60k
> negative data points (with label 0).
> I set my numIterations to 100 as default. I still got the same predication
> result: it all predicted to label 1.
> And I'm sure my dataset is linearly separable because it has been run on
> other frameworks like scikit-learn.
>
> // code
> val numIterations = 100;
> val regParam = 1
> val svm = new SVMWithSGD()
> svm.optimizer.setNumIterations(numIterations).setRegParam(regParam)
> svm.setIntercept(true)
> val model = svm.run(training)
>
>
>
>
>
>
>
>
> -----
> Thanks!
> -Caron
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SVMWithSGD-default-threshold-tp18645p18741.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: SVMWithSGD default threshold

Posted by Caron <ca...@gmail.com>.
Sean,

Thanks a lot for your reply!

A few follow up questions:
1. numIterations should be 100, not 100*trainingSetSize, right?
2. My training set has 90k positive data points (with label 1) and 60k
negative data points (with label 0).
I set my numIterations to 100 as default. I still got the same predication
result: it all predicted to label 1.
And I'm sure my dataset is linearly separable because it has been run on
other frameworks like scikit-learn.

// code
val numIterations = 100;		
val regParam = 1
val svm = new SVMWithSGD()
svm.optimizer.setNumIterations(numIterations).setRegParam(regParam)
svm.setIntercept(true)			
val model = svm.run(training)








-----
Thanks!
-Caron
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SVMWithSGD-default-threshold-tp18645p18741.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: SVMWithSGD default threshold

Posted by Sean Owen <so...@cloudera.com>.
I think you need to use setIntercept(true) to get it to allow a non-zero
intercept. I also kind of agree that's not obvious or the intuitive default.

Is your data set highly imbalanced, with lots of positive examples? that
could explain why predictions are heavily skewed.

Iterations should definitely not be of the same order of magnitude as your
input, which could have millions of elements. 100 should be plenty as a
default.

Threshold is not related to the 0/1 labels in SVMs. It is a threshold on
the SVM margin. Margin is 0 at the decision boundary, not 0.5.

There's no grid search at this stage but it's easy to code up in a short
method.


On Wed, Nov 12, 2014 at 12:41 AM, Caron <ca...@gmail.com> wrote:

> I'm hoping to get a linear classifier on a dataset.
> I'm using SVMWithSGD to train the data.
> After running with the default options: val model =
> SVMWithSGD.train(training, numIterations),
> I don't think SVM has done the classification correctly.
>
> My observations:
> 1. the intercept is always 0.0
> 2. the predicted labels are ALL 1's, no 0's.
>
> My questions are:
> 1. what should the numIterations be? I tried to set it to
> 10*trainingSetSize, is that sufficient?
> 2. since MLlib only accepts data with labels "0" or "1", shouldn't the
> default threshold for SVMWithSGD be 0.5 instead of 0.0?
> 3. It seems counter-intuitive to me to have the default intercept be 0.0,
> meaning the line has to go through the origin.
> 4. Does Spark MLlib provide an API to do grid search like scikit-learn
> does?
>
> Any help would be greatly appreciated!
>
>
>
>
> -----
> Thanks!
> -Caron
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SVMWithSGD-default-threshold-tp18645.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>