You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Joseph Bradley <jo...@databricks.com> on 2015/12/01 01:33:37 UTC

Re: Problem in running MLlib SVM

model.predict should return a 0/1 predicted label.  The example code is
misleading when it calls the prediction a "score."

On Mon, Nov 30, 2015 at 9:13 AM, Fazlan Nazeem <fa...@wso2.com> wrote:

> You should never use the training data to measure your prediction
> accuracy. Always use a fresh dataset (test data) for this purpose.
>
> On Sun, Nov 29, 2015 at 8:36 AM, Jeff Zhang <zj...@gmail.com> wrote:
>
>> I think this should represent the label of LabledPoint (0 means negative
>> 1 means positive)
>> http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point
>>
>> The document you mention is for the mathematical formula, not the
>> implementation.
>>
>> On Sun, Nov 29, 2015 at 9:13 AM, Tarek Elgamal <ta...@gmail.com>
>> wrote:
>>
>>> According to the documentation
>>> <http://spark.apache.org/docs/latest/mllib-linear-methods.html>, by
>>> default, if wTx≥0 then the outcome is positive, and negative otherwise. I
>>> suppose that wTx is the "score" in my case. If score is more than 0 and the
>>> label is positive, then I return 1 which is correct classification and I
>>> return zero otherwise. Do you have any idea how to classify a point as
>>> positive or negative using this score or another function ?
>>>
>>> On Sat, Nov 28, 2015 at 5:14 AM, Jeff Zhang <zj...@gmail.com> wrote:
>>>
>>>>         if((score >=0 && label == 1) || (score <0 && label == 0))
>>>>              {
>>>>               return 1; //correct classiciation
>>>>              }
>>>>              else
>>>>               return 0;
>>>>
>>>>
>>>>
>>>> I suspect score is always between 0 and 1
>>>>
>>>>
>>>>
>>>> On Sat, Nov 28, 2015 at 10:39 AM, Tarek Elgamal <
>>>> tarek.elgamal@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am trying to run the straightforward example of SVm but I am getting
>>>>> low accuracy (around 50%) when I predict using the same data I used for
>>>>> training. I am probably doing the prediction in a wrong way. My code is
>>>>> below. I would appreciate any help.
>>>>>
>>>>>
>>>>> import java.util.List;
>>>>>
>>>>> import org.apache.spark.SparkConf;
>>>>> import org.apache.spark.SparkContext;
>>>>> import org.apache.spark.api.java.JavaRDD;
>>>>> import org.apache.spark.api.java.function.Function;
>>>>> import org.apache.spark.api.java.function.Function2;
>>>>> import org.apache.spark.mllib.classification.SVMModel;
>>>>> import org.apache.spark.mllib.classification.SVMWithSGD;
>>>>> import org.apache.spark.mllib.regression.LabeledPoint;
>>>>> import org.apache.spark.mllib.util.MLUtils;
>>>>>
>>>>> import scala.Tuple2;
>>>>> import edu.illinois.biglbjava.readers.LabeledPointReader;
>>>>>
>>>>> public class SimpleDistSVM {
>>>>>   public static void main(String[] args) {
>>>>>     SparkConf conf = new SparkConf().setAppName("SVM Classifier
>>>>> Example");
>>>>>     SparkContext sc = new SparkContext(conf);
>>>>>     String inputPath=args[0];
>>>>>
>>>>>     // Read training data
>>>>>     JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc,
>>>>> inputPath).toJavaRDD();
>>>>>
>>>>>     // Run training algorithm to build the model.
>>>>>     int numIterations = 3;
>>>>>     final SVMModel model = SVMWithSGD.train(data.rdd(), numIterations);
>>>>>
>>>>>     // Clear the default threshold.
>>>>>     model.clearThreshold();
>>>>>
>>>>>
>>>>>     // Predict points in test set and map to an RDD of 0/1 values
>>>>> where 0 is misclassication and 1 is correct classification
>>>>>     JavaRDD<Integer> classification = data.map(new
>>>>> Function<LabeledPoint, Integer>() {
>>>>>          public Integer call(LabeledPoint p) {
>>>>>            int label = (int) p.label();
>>>>>            Double score = model.predict(p.features());
>>>>>            if((score >=0 && label == 1) || (score <0 && label == 0))
>>>>>            {
>>>>>            return 1; //correct classiciation
>>>>>            }
>>>>>            else
>>>>>             return 0;
>>>>>
>>>>>          }
>>>>>        }
>>>>>      );
>>>>>     // sum up all values in the rdd to get the number of correctly
>>>>> classified examples
>>>>>      int sum=classification.reduce(new Function2<Integer, Integer,
>>>>> Integer>()
>>>>>     {
>>>>>     public Integer call(Integer arg0, Integer arg1)
>>>>>     throws Exception {
>>>>>     return arg0+arg1;
>>>>>     }});
>>>>>
>>>>>      //compute accuracy as the percentage of the correctly classified
>>>>> examples
>>>>>      double accuracy=((double)sum)/((double)classification.count());
>>>>>      System.out.println("Accuracy = " + accuracy);
>>>>>
>>>>>         }
>>>>>       }
>>>>>     );
>>>>>   }
>>>>> }
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards
>>>>
>>>> Jeff Zhang
>>>>
>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Thanks & Regards,
>
> Fazlan Nazeem
>
> *Software Engineer*
>
> *WSO2 Inc*
> Mobile : +94772338839
> <%2B94%20%280%29%20773%20451194>
> fazlann@wso2.com
>

Re: Problem in running MLlib SVM

Posted by Joseph Bradley <jo...@databricks.com>.
Oh, sorry about that.  I forgot that's the behavior when the threshold is
not set.  My guess would be that you need more iterations, or that the
regParam needs to be tuned.

I'd recommend testing on some of the LibSVM datasets.  They have a lot, and
you can find existing examples (and results) for many of them.

On Tue, Dec 1, 2015 at 12:02 PM, Tarek Elgamal <ta...@gmail.com>
wrote:

> Thanks, actually model.predict() gives a number between 0 and 1. However,
> model.predictPoint gives me a number from 0/1 but the accuracy is still
> very low. I am using the training data just to make sure that I am using it
> right. But it still seems not to work for me.
> @Joseph, do you have any benchmark data that you tried SVM on. I am
> attaching my toy data with just 100 examples. I tried it with different
> data and bigger data and still getting accuracy around 57% on training set.
>
> On Mon, Nov 30, 2015 at 6:33 PM, Joseph Bradley <jo...@databricks.com>
> wrote:
>
>> model.predict should return a 0/1 predicted label.  The example code is
>> misleading when it calls the prediction a "score."
>>
>> On Mon, Nov 30, 2015 at 9:13 AM, Fazlan Nazeem <fa...@wso2.com> wrote:
>>
>>> You should never use the training data to measure your prediction
>>> accuracy. Always use a fresh dataset (test data) for this purpose.
>>>
>>> On Sun, Nov 29, 2015 at 8:36 AM, Jeff Zhang <zj...@gmail.com> wrote:
>>>
>>>> I think this should represent the label of LabledPoint (0 means
>>>> negative 1 means positive)
>>>> http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point
>>>>
>>>> The document you mention is for the mathematical formula, not the
>>>> implementation.
>>>>
>>>> On Sun, Nov 29, 2015 at 9:13 AM, Tarek Elgamal <tarek.elgamal@gmail.com
>>>> > wrote:
>>>>
>>>>> According to the documentation
>>>>> <http://spark.apache.org/docs/latest/mllib-linear-methods.html>, by
>>>>> default, if wTx≥0 then the outcome is positive, and negative otherwise. I
>>>>> suppose that wTx is the "score" in my case. If score is more than 0 and the
>>>>> label is positive, then I return 1 which is correct classification and I
>>>>> return zero otherwise. Do you have any idea how to classify a point as
>>>>> positive or negative using this score or another function ?
>>>>>
>>>>> On Sat, Nov 28, 2015 at 5:14 AM, Jeff Zhang <zj...@gmail.com> wrote:
>>>>>
>>>>>>         if((score >=0 && label == 1) || (score <0 && label == 0))
>>>>>>              {
>>>>>>               return 1; //correct classiciation
>>>>>>              }
>>>>>>              else
>>>>>>               return 0;
>>>>>>
>>>>>>
>>>>>>
>>>>>> I suspect score is always between 0 and 1
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Nov 28, 2015 at 10:39 AM, Tarek Elgamal <
>>>>>> tarek.elgamal@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am trying to run the straightforward example of SVm but I am
>>>>>>> getting low accuracy (around 50%) when I predict using the same data I used
>>>>>>> for training. I am probably doing the prediction in a wrong way. My code is
>>>>>>> below. I would appreciate any help.
>>>>>>>
>>>>>>>
>>>>>>> import java.util.List;
>>>>>>>
>>>>>>> import org.apache.spark.SparkConf;
>>>>>>> import org.apache.spark.SparkContext;
>>>>>>> import org.apache.spark.api.java.JavaRDD;
>>>>>>> import org.apache.spark.api.java.function.Function;
>>>>>>> import org.apache.spark.api.java.function.Function2;
>>>>>>> import org.apache.spark.mllib.classification.SVMModel;
>>>>>>> import org.apache.spark.mllib.classification.SVMWithSGD;
>>>>>>> import org.apache.spark.mllib.regression.LabeledPoint;
>>>>>>> import org.apache.spark.mllib.util.MLUtils;
>>>>>>>
>>>>>>> import scala.Tuple2;
>>>>>>> import edu.illinois.biglbjava.readers.LabeledPointReader;
>>>>>>>
>>>>>>> public class SimpleDistSVM {
>>>>>>>   public static void main(String[] args) {
>>>>>>>     SparkConf conf = new SparkConf().setAppName("SVM Classifier
>>>>>>> Example");
>>>>>>>     SparkContext sc = new SparkContext(conf);
>>>>>>>     String inputPath=args[0];
>>>>>>>
>>>>>>>     // Read training data
>>>>>>>     JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc,
>>>>>>> inputPath).toJavaRDD();
>>>>>>>
>>>>>>>     // Run training algorithm to build the model.
>>>>>>>     int numIterations = 3;
>>>>>>>     final SVMModel model = SVMWithSGD.train(data.rdd(),
>>>>>>> numIterations);
>>>>>>>
>>>>>>>     // Clear the default threshold.
>>>>>>>     model.clearThreshold();
>>>>>>>
>>>>>>>
>>>>>>>     // Predict points in test set and map to an RDD of 0/1 values
>>>>>>> where 0 is misclassication and 1 is correct classification
>>>>>>>     JavaRDD<Integer> classification = data.map(new
>>>>>>> Function<LabeledPoint, Integer>() {
>>>>>>>          public Integer call(LabeledPoint p) {
>>>>>>>            int label = (int) p.label();
>>>>>>>            Double score = model.predict(p.features());
>>>>>>>            if((score >=0 && label == 1) || (score <0 && label == 0))
>>>>>>>            {
>>>>>>>            return 1; //correct classiciation
>>>>>>>            }
>>>>>>>            else
>>>>>>>             return 0;
>>>>>>>
>>>>>>>          }
>>>>>>>        }
>>>>>>>      );
>>>>>>>     // sum up all values in the rdd to get the number of correctly
>>>>>>> classified examples
>>>>>>>      int sum=classification.reduce(new Function2<Integer, Integer,
>>>>>>> Integer>()
>>>>>>>     {
>>>>>>>     public Integer call(Integer arg0, Integer arg1)
>>>>>>>     throws Exception {
>>>>>>>     return arg0+arg1;
>>>>>>>     }});
>>>>>>>
>>>>>>>      //compute accuracy as the percentage of the correctly
>>>>>>> classified examples
>>>>>>>      double accuracy=((double)sum)/((double)classification.count());
>>>>>>>      System.out.println("Accuracy = " + accuracy);
>>>>>>>
>>>>>>>         }
>>>>>>>       }
>>>>>>>     );
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards
>>>>>>
>>>>>> Jeff Zhang
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards
>>>>
>>>> Jeff Zhang
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>>
>>> Fazlan Nazeem
>>>
>>> *Software Engineer*
>>>
>>> *WSO2 Inc*
>>> Mobile : +94772338839
>>> <%2B94%20%280%29%20773%20451194>
>>> fazlann@wso2.com
>>>
>>
>>
>

Re: Problem in running MLlib SVM

Posted by Tarek Elgamal <ta...@gmail.com>.
Thanks, actually model.predict() gives a number between 0 and 1. However,
model.predictPoint gives me a number from 0/1 but the accuracy is still
very low. I am using the training data just to make sure that I am using it
right. But it still seems not to work for me.
@Joseph, do you have any benchmark data that you tried SVM on. I am
attaching my toy data with just 100 examples. I tried it with different
data and bigger data and still getting accuracy around 57% on training set.

On Mon, Nov 30, 2015 at 6:33 PM, Joseph Bradley <jo...@databricks.com>
wrote:

> model.predict should return a 0/1 predicted label.  The example code is
> misleading when it calls the prediction a "score."
>
> On Mon, Nov 30, 2015 at 9:13 AM, Fazlan Nazeem <fa...@wso2.com> wrote:
>
>> You should never use the training data to measure your prediction
>> accuracy. Always use a fresh dataset (test data) for this purpose.
>>
>> On Sun, Nov 29, 2015 at 8:36 AM, Jeff Zhang <zj...@gmail.com> wrote:
>>
>>> I think this should represent the label of LabledPoint (0 means negative
>>> 1 means positive)
>>> http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point
>>>
>>> The document you mention is for the mathematical formula, not the
>>> implementation.
>>>
>>> On Sun, Nov 29, 2015 at 9:13 AM, Tarek Elgamal <ta...@gmail.com>
>>> wrote:
>>>
>>>> According to the documentation
>>>> <http://spark.apache.org/docs/latest/mllib-linear-methods.html>, by
>>>> default, if wTx≥0 then the outcome is positive, and negative otherwise. I
>>>> suppose that wTx is the "score" in my case. If score is more than 0 and the
>>>> label is positive, then I return 1 which is correct classification and I
>>>> return zero otherwise. Do you have any idea how to classify a point as
>>>> positive or negative using this score or another function ?
>>>>
>>>> On Sat, Nov 28, 2015 at 5:14 AM, Jeff Zhang <zj...@gmail.com> wrote:
>>>>
>>>>>         if((score >=0 && label == 1) || (score <0 && label == 0))
>>>>>              {
>>>>>               return 1; //correct classiciation
>>>>>              }
>>>>>              else
>>>>>               return 0;
>>>>>
>>>>>
>>>>>
>>>>> I suspect score is always between 0 and 1
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Nov 28, 2015 at 10:39 AM, Tarek Elgamal <
>>>>> tarek.elgamal@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am trying to run the straightforward example of SVm but I am
>>>>>> getting low accuracy (around 50%) when I predict using the same data I used
>>>>>> for training. I am probably doing the prediction in a wrong way. My code is
>>>>>> below. I would appreciate any help.
>>>>>>
>>>>>>
>>>>>> import java.util.List;
>>>>>>
>>>>>> import org.apache.spark.SparkConf;
>>>>>> import org.apache.spark.SparkContext;
>>>>>> import org.apache.spark.api.java.JavaRDD;
>>>>>> import org.apache.spark.api.java.function.Function;
>>>>>> import org.apache.spark.api.java.function.Function2;
>>>>>> import org.apache.spark.mllib.classification.SVMModel;
>>>>>> import org.apache.spark.mllib.classification.SVMWithSGD;
>>>>>> import org.apache.spark.mllib.regression.LabeledPoint;
>>>>>> import org.apache.spark.mllib.util.MLUtils;
>>>>>>
>>>>>> import scala.Tuple2;
>>>>>> import edu.illinois.biglbjava.readers.LabeledPointReader;
>>>>>>
>>>>>> public class SimpleDistSVM {
>>>>>>   public static void main(String[] args) {
>>>>>>     SparkConf conf = new SparkConf().setAppName("SVM Classifier
>>>>>> Example");
>>>>>>     SparkContext sc = new SparkContext(conf);
>>>>>>     String inputPath=args[0];
>>>>>>
>>>>>>     // Read training data
>>>>>>     JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc,
>>>>>> inputPath).toJavaRDD();
>>>>>>
>>>>>>     // Run training algorithm to build the model.
>>>>>>     int numIterations = 3;
>>>>>>     final SVMModel model = SVMWithSGD.train(data.rdd(),
>>>>>> numIterations);
>>>>>>
>>>>>>     // Clear the default threshold.
>>>>>>     model.clearThreshold();
>>>>>>
>>>>>>
>>>>>>     // Predict points in test set and map to an RDD of 0/1 values
>>>>>> where 0 is misclassication and 1 is correct classification
>>>>>>     JavaRDD<Integer> classification = data.map(new
>>>>>> Function<LabeledPoint, Integer>() {
>>>>>>          public Integer call(LabeledPoint p) {
>>>>>>            int label = (int) p.label();
>>>>>>            Double score = model.predict(p.features());
>>>>>>            if((score >=0 && label == 1) || (score <0 && label == 0))
>>>>>>            {
>>>>>>            return 1; //correct classiciation
>>>>>>            }
>>>>>>            else
>>>>>>             return 0;
>>>>>>
>>>>>>          }
>>>>>>        }
>>>>>>      );
>>>>>>     // sum up all values in the rdd to get the number of correctly
>>>>>> classified examples
>>>>>>      int sum=classification.reduce(new Function2<Integer, Integer,
>>>>>> Integer>()
>>>>>>     {
>>>>>>     public Integer call(Integer arg0, Integer arg1)
>>>>>>     throws Exception {
>>>>>>     return arg0+arg1;
>>>>>>     }});
>>>>>>
>>>>>>      //compute accuracy as the percentage of the correctly
>>>>>> classified examples
>>>>>>      double accuracy=((double)sum)/((double)classification.count());
>>>>>>      System.out.println("Accuracy = " + accuracy);
>>>>>>
>>>>>>         }
>>>>>>       }
>>>>>>     );
>>>>>>   }
>>>>>> }
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards
>>>>>
>>>>> Jeff Zhang
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>> Thanks & Regards,
>>
>> Fazlan Nazeem
>>
>> *Software Engineer*
>>
>> *WSO2 Inc*
>> Mobile : +94772338839
>> <%2B94%20%280%29%20773%20451194>
>> fazlann@wso2.com
>>
>
>