You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Tarek Elgamal <ta...@gmail.com> on 2015/11/28 03:39:59 UTC

Problem in running MLlib SVM

Hi,

I am trying to run the straightforward example of SVm but I am getting low
accuracy (around 50%) when I predict using the same data I used for
training. I am probably doing the prediction in a wrong way. My code is
below. I would appreciate any help.


import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.mllib.classification.SVMModel;
import org.apache.spark.mllib.classification.SVMWithSGD;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;

import scala.Tuple2;
import edu.illinois.biglbjava.readers.LabeledPointReader;

public class SimpleDistSVM {
  public static void main(String[] args) {
    SparkConf conf = new SparkConf().setAppName("SVM Classifier Example");
    SparkContext sc = new SparkContext(conf);
    String inputPath=args[0];

    // Read training data
    JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc,
inputPath).toJavaRDD();

    // Run training algorithm to build the model.
    int numIterations = 3;
    final SVMModel model = SVMWithSGD.train(data.rdd(), numIterations);

    // Clear the default threshold.
    model.clearThreshold();


    // Predict points in test set and map to an RDD of 0/1 values where 0
is misclassication and 1 is correct classification
    JavaRDD<Integer> classification = data.map(new Function<LabeledPoint,
Integer>() {
         public Integer call(LabeledPoint p) {
           int label = (int) p.label();
           Double score = model.predict(p.features());
           if((score >=0 && label == 1) || (score <0 && label == 0))
           {
           return 1; //correct classiciation
           }
           else
            return 0;

         }
       }
     );
    // sum up all values in the rdd to get the number of correctly
classified examples
     int sum=classification.reduce(new Function2<Integer, Integer,
Integer>()
    {
    public Integer call(Integer arg0, Integer arg1)
    throws Exception {
    return arg0+arg1;
    }});

     //compute accuracy as the percentage of the correctly classified
examples
     double accuracy=((double)sum)/((double)classification.count());
     System.out.println("Accuracy = " + accuracy);

        }
      }
    );
  }
}

Re: Problem in running MLlib SVM

Posted by Joseph Bradley <jo...@databricks.com>.

Oh, sorry about that.  I forgot that's the behavior when the threshold is
not set.  My guess would be that you need more iterations, or that the
regParam needs to be tuned.

I'd recommend testing on some of the LibSVM datasets.  They have a lot, and
you can find existing examples (and results) for many of them.

On Tue, Dec 1, 2015 at 12:02 PM, Tarek Elgamal <ta...@gmail.com>
wrote:

> Thanks, actually model.predict() gives a number between 0 and 1. However,
> model.predictPoint gives me a number from 0/1 but the accuracy is still
> very low. I am using the training data just to make sure that I am using it
> right. But it still seems not to work for me.
> @Joseph, do you have any benchmark data that you tried SVM on. I am
> attaching my toy data with just 100 examples. I tried it with different
> data and bigger data and still getting accuracy around 57% on training set.
>
> On Mon, Nov 30, 2015 at 6:33 PM, Joseph Bradley <jo...@databricks.com>
> wrote:
>
>> model.predict should return a 0/1 predicted label.  The example code is
>> misleading when it calls the prediction a "score."
>>
>> On Mon, Nov 30, 2015 at 9:13 AM, Fazlan Nazeem <fa...@wso2.com> wrote:
>>
>>> You should never use the training data to measure your prediction
>>> accuracy. Always use a fresh dataset (test data) for this purpose.
>>>
>>> On Sun, Nov 29, 2015 at 8:36 AM, Jeff Zhang <zj...@gmail.com> wrote:
>>>
>>>> I think this should represent the label of LabledPoint (0 means
>>>> negative 1 means positive)
>>>> http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point
>>>>
>>>> The document you mention is for the mathematical formula, not the
>>>> implementation.
>>>>
>>>> On Sun, Nov 29, 2015 at 9:13 AM, Tarek Elgamal <tarek.elgamal@gmail.com
>>>> > wrote:
>>>>
>>>>> According to the documentation
>>>>> <http://spark.apache.org/docs/latest/mllib-linear-methods.html>, by
>>>>> default, if wTx≥0 then the outcome is positive, and negative otherwise. I
>>>>> suppose that wTx is the "score" in my case. If score is more than 0 and the
>>>>> label is positive, then I return 1 which is correct classification and I
>>>>> return zero otherwise. Do you have any idea how to classify a point as
>>>>> positive or negative using this score or another function ?
>>>>>
>>>>> On Sat, Nov 28, 2015 at 5:14 AM, Jeff Zhang <zj...@gmail.com> wrote:
>>>>>
>>>>>>         if((score >=0 && label == 1) || (score <0 && label == 0))
>>>>>>              {
>>>>>>               return 1; //correct classiciation
>>>>>>              }
>>>>>>              else
>>>>>>               return 0;
>>>>>>
>>>>>>
>>>>>>
>>>>>> I suspect score is always between 0 and 1
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Nov 28, 2015 at 10:39 AM, Tarek Elgamal <
>>>>>> tarek.elgamal@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am trying to run the straightforward example of SVm but I am
>>>>>>> getting low accuracy (around 50%) when I predict using the same data I used
>>>>>>> for training. I am probably doing the prediction in a wrong way. My code is
>>>>>>> below. I would appreciate any help.
>>>>>>>
>>>>>>>
>>>>>>> import java.util.List;
>>>>>>>
>>>>>>> import org.apache.spark.SparkConf;
>>>>>>> import org.apache.spark.SparkContext;
>>>>>>> import org.apache.spark.api.java.JavaRDD;
>>>>>>> import org.apache.spark.api.java.function.Function;
>>>>>>> import org.apache.spark.api.java.function.Function2;
>>>>>>> import org.apache.spark.mllib.classification.SVMModel;
>>>>>>> import org.apache.spark.mllib.classification.SVMWithSGD;
>>>>>>> import org.apache.spark.mllib.regression.LabeledPoint;
>>>>>>> import org.apache.spark.mllib.util.MLUtils;
>>>>>>>
>>>>>>> import scala.Tuple2;
>>>>>>> import edu.illinois.biglbjava.readers.LabeledPointReader;
>>>>>>>
>>>>>>> public class SimpleDistSVM {
>>>>>>>   public static void main(String[] args) {
>>>>>>>     SparkConf conf = new SparkConf().setAppName("SVM Classifier
>>>>>>> Example");
>>>>>>>     SparkContext sc = new SparkContext(conf);
>>>>>>>     String inputPath=args[0];
>>>>>>>
>>>>>>>     // Read training data
>>>>>>>     JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc,
>>>>>>> inputPath).toJavaRDD();
>>>>>>>
>>>>>>>     // Run training algorithm to build the model.
>>>>>>>     int numIterations = 3;
>>>>>>>     final SVMModel model = SVMWithSGD.train(data.rdd(),
>>>>>>> numIterations);
>>>>>>>
>>>>>>>     // Clear the default threshold.
>>>>>>>     model.clearThreshold();
>>>>>>>
>>>>>>>
>>>>>>>     // Predict points in test set and map to an RDD of 0/1 values
>>>>>>> where 0 is misclassication and 1 is correct classification
>>>>>>>     JavaRDD<Integer> classification = data.map(new
>>>>>>> Function<LabeledPoint, Integer>() {
>>>>>>>          public Integer call(LabeledPoint p) {
>>>>>>>            int label = (int) p.label();
>>>>>>>            Double score = model.predict(p.features());
>>>>>>>            if((score >=0 && label == 1) || (score <0 && label == 0))
>>>>>>>            {
>>>>>>>            return 1; //correct classiciation
>>>>>>>            }
>>>>>>>            else
>>>>>>>             return 0;
>>>>>>>
>>>>>>>          }
>>>>>>>        }
>>>>>>>      );
>>>>>>>     // sum up all values in the rdd to get the number of correctly
>>>>>>> classified examples
>>>>>>>      int sum=classification.reduce(new Function2<Integer, Integer,
>>>>>>> Integer>()
>>>>>>>     {
>>>>>>>     public Integer call(Integer arg0, Integer arg1)
>>>>>>>     throws Exception {
>>>>>>>     return arg0+arg1;
>>>>>>>     }});
>>>>>>>
>>>>>>>      //compute accuracy as the percentage of the correctly
>>>>>>> classified examples
>>>>>>>      double accuracy=((double)sum)/((double)classification.count());
>>>>>>>      System.out.println("Accuracy = " + accuracy);
>>>>>>>
>>>>>>>         }
>>>>>>>       }
>>>>>>>     );
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards
>>>>>>
>>>>>> Jeff Zhang
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards
>>>>
>>>> Jeff Zhang
>>>>
>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>>
>>> Fazlan Nazeem
>>>
>>> *Software Engineer*
>>>
>>> *WSO2 Inc*
>>> Mobile : +94772338839
>>> <%2B94%20%280%29%20773%20451194>
>>> fazlann@wso2.com
>>>
>>
>>
>

Re: Problem in running MLlib SVM

Posted by Tarek Elgamal <ta...@gmail.com>.

Thanks, actually model.predict() gives a number between 0 and 1. However,
model.predictPoint gives me a number from 0/1 but the accuracy is still
very low. I am using the training data just to make sure that I am using it
right. But it still seems not to work for me.
@Joseph, do you have any benchmark data that you tried SVM on. I am
attaching my toy data with just 100 examples. I tried it with different
data and bigger data and still getting accuracy around 57% on training set.

On Mon, Nov 30, 2015 at 6:33 PM, Joseph Bradley <jo...@databricks.com>
wrote:

> model.predict should return a 0/1 predicted label.  The example code is
> misleading when it calls the prediction a "score."
>
> On Mon, Nov 30, 2015 at 9:13 AM, Fazlan Nazeem <fa...@wso2.com> wrote:
>
>> You should never use the training data to measure your prediction
>> accuracy. Always use a fresh dataset (test data) for this purpose.
>>
>> On Sun, Nov 29, 2015 at 8:36 AM, Jeff Zhang <zj...@gmail.com> wrote:
>>
>>> I think this should represent the label of LabledPoint (0 means negative
>>> 1 means positive)
>>> http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point
>>>
>>> The document you mention is for the mathematical formula, not the
>>> implementation.
>>>
>>> On Sun, Nov 29, 2015 at 9:13 AM, Tarek Elgamal <ta...@gmail.com>
>>> wrote:
>>>
>>>> According to the documentation
>>>> <http://spark.apache.org/docs/latest/mllib-linear-methods.html>, by
>>>> default, if wTx≥0 then the outcome is positive, and negative otherwise. I
>>>> suppose that wTx is the "score" in my case. If score is more than 0 and the
>>>> label is positive, then I return 1 which is correct classification and I
>>>> return zero otherwise. Do you have any idea how to classify a point as
>>>> positive or negative using this score or another function ?
>>>>
>>>> On Sat, Nov 28, 2015 at 5:14 AM, Jeff Zhang <zj...@gmail.com> wrote:
>>>>
>>>>>         if((score >=0 && label == 1) || (score <0 && label == 0))
>>>>>              {
>>>>>               return 1; //correct classiciation
>>>>>              }
>>>>>              else
>>>>>               return 0;
>>>>>
>>>>>
>>>>>
>>>>> I suspect score is always between 0 and 1
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Nov 28, 2015 at 10:39 AM, Tarek Elgamal <
>>>>> tarek.elgamal@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am trying to run the straightforward example of SVm but I am
>>>>>> getting low accuracy (around 50%) when I predict using the same data I used
>>>>>> for training. I am probably doing the prediction in a wrong way. My code is
>>>>>> below. I would appreciate any help.
>>>>>>
>>>>>>
>>>>>> import java.util.List;
>>>>>>
>>>>>> import org.apache.spark.SparkConf;
>>>>>> import org.apache.spark.SparkContext;
>>>>>> import org.apache.spark.api.java.JavaRDD;
>>>>>> import org.apache.spark.api.java.function.Function;
>>>>>> import org.apache.spark.api.java.function.Function2;
>>>>>> import org.apache.spark.mllib.classification.SVMModel;
>>>>>> import org.apache.spark.mllib.classification.SVMWithSGD;
>>>>>> import org.apache.spark.mllib.regression.LabeledPoint;
>>>>>> import org.apache.spark.mllib.util.MLUtils;
>>>>>>
>>>>>> import scala.Tuple2;
>>>>>> import edu.illinois.biglbjava.readers.LabeledPointReader;
>>>>>>
>>>>>> public class SimpleDistSVM {
>>>>>>   public static void main(String[] args) {
>>>>>>     SparkConf conf = new SparkConf().setAppName("SVM Classifier
>>>>>> Example");
>>>>>>     SparkContext sc = new SparkContext(conf);
>>>>>>     String inputPath=args[0];
>>>>>>
>>>>>>     // Read training data
>>>>>>     JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc,
>>>>>> inputPath).toJavaRDD();
>>>>>>
>>>>>>     // Run training algorithm to build the model.
>>>>>>     int numIterations = 3;
>>>>>>     final SVMModel model = SVMWithSGD.train(data.rdd(),
>>>>>> numIterations);
>>>>>>
>>>>>>     // Clear the default threshold.
>>>>>>     model.clearThreshold();
>>>>>>
>>>>>>
>>>>>>     // Predict points in test set and map to an RDD of 0/1 values
>>>>>> where 0 is misclassication and 1 is correct classification
>>>>>>     JavaRDD<Integer> classification = data.map(new
>>>>>> Function<LabeledPoint, Integer>() {
>>>>>>          public Integer call(LabeledPoint p) {
>>>>>>            int label = (int) p.label();
>>>>>>            Double score = model.predict(p.features());
>>>>>>            if((score >=0 && label == 1) || (score <0 && label == 0))
>>>>>>            {
>>>>>>            return 1; //correct classiciation
>>>>>>            }
>>>>>>            else
>>>>>>             return 0;
>>>>>>
>>>>>>          }
>>>>>>        }
>>>>>>      );
>>>>>>     // sum up all values in the rdd to get the number of correctly
>>>>>> classified examples
>>>>>>      int sum=classification.reduce(new Function2<Integer, Integer,
>>>>>> Integer>()
>>>>>>     {
>>>>>>     public Integer call(Integer arg0, Integer arg1)
>>>>>>     throws Exception {
>>>>>>     return arg0+arg1;
>>>>>>     }});
>>>>>>
>>>>>>      //compute accuracy as the percentage of the correctly
>>>>>> classified examples
>>>>>>      double accuracy=((double)sum)/((double)classification.count());
>>>>>>      System.out.println("Accuracy = " + accuracy);
>>>>>>
>>>>>>         }
>>>>>>       }
>>>>>>     );
>>>>>>   }
>>>>>> }
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards
>>>>>
>>>>> Jeff Zhang
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>> Thanks & Regards,
>>
>> Fazlan Nazeem
>>
>> *Software Engineer*
>>
>> *WSO2 Inc*
>> Mobile : +94772338839
>> <%2B94%20%280%29%20773%20451194>
>> fazlann@wso2.com
>>
>
>

Re: Problem in running MLlib SVM

Posted by Joseph Bradley <jo...@databricks.com>.

model.predict should return a 0/1 predicted label.  The example code is
misleading when it calls the prediction a "score."

On Mon, Nov 30, 2015 at 9:13 AM, Fazlan Nazeem <fa...@wso2.com> wrote:

> You should never use the training data to measure your prediction
> accuracy. Always use a fresh dataset (test data) for this purpose.
>
> On Sun, Nov 29, 2015 at 8:36 AM, Jeff Zhang <zj...@gmail.com> wrote:
>
>> I think this should represent the label of LabledPoint (0 means negative
>> 1 means positive)
>> http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point
>>
>> The document you mention is for the mathematical formula, not the
>> implementation.
>>
>> On Sun, Nov 29, 2015 at 9:13 AM, Tarek Elgamal <ta...@gmail.com>
>> wrote:
>>
>>> According to the documentation
>>> <http://spark.apache.org/docs/latest/mllib-linear-methods.html>, by
>>> default, if wTx≥0 then the outcome is positive, and negative otherwise. I
>>> suppose that wTx is the "score" in my case. If score is more than 0 and the
>>> label is positive, then I return 1 which is correct classification and I
>>> return zero otherwise. Do you have any idea how to classify a point as
>>> positive or negative using this score or another function ?
>>>
>>> On Sat, Nov 28, 2015 at 5:14 AM, Jeff Zhang <zj...@gmail.com> wrote:
>>>
>>>>         if((score >=0 && label == 1) || (score <0 && label == 0))
>>>>              {
>>>>               return 1; //correct classiciation
>>>>              }
>>>>              else
>>>>               return 0;
>>>>
>>>>
>>>>
>>>> I suspect score is always between 0 and 1
>>>>
>>>>
>>>>
>>>> On Sat, Nov 28, 2015 at 10:39 AM, Tarek Elgamal <
>>>> tarek.elgamal@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am trying to run the straightforward example of SVm but I am getting
>>>>> low accuracy (around 50%) when I predict using the same data I used for
>>>>> training. I am probably doing the prediction in a wrong way. My code is
>>>>> below. I would appreciate any help.
>>>>>
>>>>>
>>>>> import java.util.List;
>>>>>
>>>>> import org.apache.spark.SparkConf;
>>>>> import org.apache.spark.SparkContext;
>>>>> import org.apache.spark.api.java.JavaRDD;
>>>>> import org.apache.spark.api.java.function.Function;
>>>>> import org.apache.spark.api.java.function.Function2;
>>>>> import org.apache.spark.mllib.classification.SVMModel;
>>>>> import org.apache.spark.mllib.classification.SVMWithSGD;
>>>>> import org.apache.spark.mllib.regression.LabeledPoint;
>>>>> import org.apache.spark.mllib.util.MLUtils;
>>>>>
>>>>> import scala.Tuple2;
>>>>> import edu.illinois.biglbjava.readers.LabeledPointReader;
>>>>>
>>>>> public class SimpleDistSVM {
>>>>>   public static void main(String[] args) {
>>>>>     SparkConf conf = new SparkConf().setAppName("SVM Classifier
>>>>> Example");
>>>>>     SparkContext sc = new SparkContext(conf);
>>>>>     String inputPath=args[0];
>>>>>
>>>>>     // Read training data
>>>>>     JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc,
>>>>> inputPath).toJavaRDD();
>>>>>
>>>>>     // Run training algorithm to build the model.
>>>>>     int numIterations = 3;
>>>>>     final SVMModel model = SVMWithSGD.train(data.rdd(), numIterations);
>>>>>
>>>>>     // Clear the default threshold.
>>>>>     model.clearThreshold();
>>>>>
>>>>>
>>>>>     // Predict points in test set and map to an RDD of 0/1 values
>>>>> where 0 is misclassication and 1 is correct classification
>>>>>     JavaRDD<Integer> classification = data.map(new
>>>>> Function<LabeledPoint, Integer>() {
>>>>>          public Integer call(LabeledPoint p) {
>>>>>            int label = (int) p.label();
>>>>>            Double score = model.predict(p.features());
>>>>>            if((score >=0 && label == 1) || (score <0 && label == 0))
>>>>>            {
>>>>>            return 1; //correct classiciation
>>>>>            }
>>>>>            else
>>>>>             return 0;
>>>>>
>>>>>          }
>>>>>        }
>>>>>      );
>>>>>     // sum up all values in the rdd to get the number of correctly
>>>>> classified examples
>>>>>      int sum=classification.reduce(new Function2<Integer, Integer,
>>>>> Integer>()
>>>>>     {
>>>>>     public Integer call(Integer arg0, Integer arg1)
>>>>>     throws Exception {
>>>>>     return arg0+arg1;
>>>>>     }});
>>>>>
>>>>>      //compute accuracy as the percentage of the correctly classified
>>>>> examples
>>>>>      double accuracy=((double)sum)/((double)classification.count());
>>>>>      System.out.println("Accuracy = " + accuracy);
>>>>>
>>>>>         }
>>>>>       }
>>>>>     );
>>>>>   }
>>>>> }
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards
>>>>
>>>> Jeff Zhang
>>>>
>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Thanks & Regards,
>
> Fazlan Nazeem
>
> *Software Engineer*
>
> *WSO2 Inc*
> Mobile : +94772338839
> <%2B94%20%280%29%20773%20451194>
> fazlann@wso2.com
>

Re: Problem in running MLlib SVM

Posted by Fazlan Nazeem <fa...@wso2.com>.

You should never use the training data to measure your prediction accuracy.
Always use a fresh dataset (test data) for this purpose.

On Sun, Nov 29, 2015 at 8:36 AM, Jeff Zhang <zj...@gmail.com> wrote:

> I think this should represent the label of LabledPoint (0 means negative 1
> means positive)
> http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point
>
> The document you mention is for the mathematical formula, not the
> implementation.
>
> On Sun, Nov 29, 2015 at 9:13 AM, Tarek Elgamal <ta...@gmail.com>
> wrote:
>
>> According to the documentation
>> <http://spark.apache.org/docs/latest/mllib-linear-methods.html>, by
>> default, if wTx≥0 then the outcome is positive, and negative otherwise. I
>> suppose that wTx is the "score" in my case. If score is more than 0 and the
>> label is positive, then I return 1 which is correct classification and I
>> return zero otherwise. Do you have any idea how to classify a point as
>> positive or negative using this score or another function ?
>>
>> On Sat, Nov 28, 2015 at 5:14 AM, Jeff Zhang <zj...@gmail.com> wrote:
>>
>>>         if((score >=0 && label == 1) || (score <0 && label == 0))
>>>              {
>>>               return 1; //correct classiciation
>>>              }
>>>              else
>>>               return 0;
>>>
>>>
>>>
>>> I suspect score is always between 0 and 1
>>>
>>>
>>>
>>> On Sat, Nov 28, 2015 at 10:39 AM, Tarek Elgamal <tarek.elgamal@gmail.com
>>> > wrote:
>>>
>>>> Hi,
>>>>
>>>> I am trying to run the straightforward example of SVm but I am getting
>>>> low accuracy (around 50%) when I predict using the same data I used for
>>>> training. I am probably doing the prediction in a wrong way. My code is
>>>> below. I would appreciate any help.
>>>>
>>>>
>>>> import java.util.List;
>>>>
>>>> import org.apache.spark.SparkConf;
>>>> import org.apache.spark.SparkContext;
>>>> import org.apache.spark.api.java.JavaRDD;
>>>> import org.apache.spark.api.java.function.Function;
>>>> import org.apache.spark.api.java.function.Function2;
>>>> import org.apache.spark.mllib.classification.SVMModel;
>>>> import org.apache.spark.mllib.classification.SVMWithSGD;
>>>> import org.apache.spark.mllib.regression.LabeledPoint;
>>>> import org.apache.spark.mllib.util.MLUtils;
>>>>
>>>> import scala.Tuple2;
>>>> import edu.illinois.biglbjava.readers.LabeledPointReader;
>>>>
>>>> public class SimpleDistSVM {
>>>>   public static void main(String[] args) {
>>>>     SparkConf conf = new SparkConf().setAppName("SVM Classifier
>>>> Example");
>>>>     SparkContext sc = new SparkContext(conf);
>>>>     String inputPath=args[0];
>>>>
>>>>     // Read training data
>>>>     JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc,
>>>> inputPath).toJavaRDD();
>>>>
>>>>     // Run training algorithm to build the model.
>>>>     int numIterations = 3;
>>>>     final SVMModel model = SVMWithSGD.train(data.rdd(), numIterations);
>>>>
>>>>     // Clear the default threshold.
>>>>     model.clearThreshold();
>>>>
>>>>
>>>>     // Predict points in test set and map to an RDD of 0/1 values where
>>>> 0 is misclassication and 1 is correct classification
>>>>     JavaRDD<Integer> classification = data.map(new
>>>> Function<LabeledPoint, Integer>() {
>>>>          public Integer call(LabeledPoint p) {
>>>>            int label = (int) p.label();
>>>>            Double score = model.predict(p.features());
>>>>            if((score >=0 && label == 1) || (score <0 && label == 0))
>>>>            {
>>>>            return 1; //correct classiciation
>>>>            }
>>>>            else
>>>>             return 0;
>>>>
>>>>          }
>>>>        }
>>>>      );
>>>>     // sum up all values in the rdd to get the number of correctly
>>>> classified examples
>>>>      int sum=classification.reduce(new Function2<Integer, Integer,
>>>> Integer>()
>>>>     {
>>>>     public Integer call(Integer arg0, Integer arg1)
>>>>     throws Exception {
>>>>     return arg0+arg1;
>>>>     }});
>>>>
>>>>      //compute accuracy as the percentage of the correctly classified
>>>> examples
>>>>      double accuracy=((double)sum)/((double)classification.count());
>>>>      System.out.println("Accuracy = " + accuracy);
>>>>
>>>>         }
>>>>       }
>>>>     );
>>>>   }
>>>> }
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Thanks & Regards,

Fazlan Nazeem

*Software Engineer*

*WSO2 Inc*
Mobile : +94772338839
<%2B94%20%280%29%20773%20451194>
fazlann@wso2.com

Re: Problem in running MLlib SVM

Posted by Jeff Zhang <zj...@gmail.com>.

I think this should represent the label of LabledPoint (0 means negative 1
means positive)
http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point

The document you mention is for the mathematical formula, not the
implementation.

On Sun, Nov 29, 2015 at 9:13 AM, Tarek Elgamal <ta...@gmail.com>
wrote:

> According to the documentation
> <http://spark.apache.org/docs/latest/mllib-linear-methods.html>, by
> default, if wTx≥0 then the outcome is positive, and negative otherwise. I
> suppose that wTx is the "score" in my case. If score is more than 0 and the
> label is positive, then I return 1 which is correct classification and I
> return zero otherwise. Do you have any idea how to classify a point as
> positive or negative using this score or another function ?
>
> On Sat, Nov 28, 2015 at 5:14 AM, Jeff Zhang <zj...@gmail.com> wrote:
>
>>         if((score >=0 && label == 1) || (score <0 && label == 0))
>>              {
>>               return 1; //correct classiciation
>>              }
>>              else
>>               return 0;
>>
>>
>>
>> I suspect score is always between 0 and 1
>>
>>
>>
>> On Sat, Nov 28, 2015 at 10:39 AM, Tarek Elgamal <ta...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am trying to run the straightforward example of SVm but I am getting
>>> low accuracy (around 50%) when I predict using the same data I used for
>>> training. I am probably doing the prediction in a wrong way. My code is
>>> below. I would appreciate any help.
>>>
>>>
>>> import java.util.List;
>>>
>>> import org.apache.spark.SparkConf;
>>> import org.apache.spark.SparkContext;
>>> import org.apache.spark.api.java.JavaRDD;
>>> import org.apache.spark.api.java.function.Function;
>>> import org.apache.spark.api.java.function.Function2;
>>> import org.apache.spark.mllib.classification.SVMModel;
>>> import org.apache.spark.mllib.classification.SVMWithSGD;
>>> import org.apache.spark.mllib.regression.LabeledPoint;
>>> import org.apache.spark.mllib.util.MLUtils;
>>>
>>> import scala.Tuple2;
>>> import edu.illinois.biglbjava.readers.LabeledPointReader;
>>>
>>> public class SimpleDistSVM {
>>>   public static void main(String[] args) {
>>>     SparkConf conf = new SparkConf().setAppName("SVM Classifier
>>> Example");
>>>     SparkContext sc = new SparkContext(conf);
>>>     String inputPath=args[0];
>>>
>>>     // Read training data
>>>     JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc,
>>> inputPath).toJavaRDD();
>>>
>>>     // Run training algorithm to build the model.
>>>     int numIterations = 3;
>>>     final SVMModel model = SVMWithSGD.train(data.rdd(), numIterations);
>>>
>>>     // Clear the default threshold.
>>>     model.clearThreshold();
>>>
>>>
>>>     // Predict points in test set and map to an RDD of 0/1 values where
>>> 0 is misclassication and 1 is correct classification
>>>     JavaRDD<Integer> classification = data.map(new
>>> Function<LabeledPoint, Integer>() {
>>>          public Integer call(LabeledPoint p) {
>>>            int label = (int) p.label();
>>>            Double score = model.predict(p.features());
>>>            if((score >=0 && label == 1) || (score <0 && label == 0))
>>>            {
>>>            return 1; //correct classiciation
>>>            }
>>>            else
>>>             return 0;
>>>
>>>          }
>>>        }
>>>      );
>>>     // sum up all values in the rdd to get the number of correctly
>>> classified examples
>>>      int sum=classification.reduce(new Function2<Integer, Integer,
>>> Integer>()
>>>     {
>>>     public Integer call(Integer arg0, Integer arg1)
>>>     throws Exception {
>>>     return arg0+arg1;
>>>     }});
>>>
>>>      //compute accuracy as the percentage of the correctly classified
>>> examples
>>>      double accuracy=((double)sum)/((double)classification.count());
>>>      System.out.println("Accuracy = " + accuracy);
>>>
>>>         }
>>>       }
>>>     );
>>>   }
>>> }
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


-- 
Best Regards

Jeff Zhang

Re: Problem in running MLlib SVM

Posted by Tarek Elgamal <ta...@gmail.com>.

According to the documentation
<http://spark.apache.org/docs/latest/mllib-linear-methods.html>, by
default, if wTx≥0 then the outcome is positive, and negative otherwise. I
suppose that wTx is the "score" in my case. If score is more than 0 and the
label is positive, then I return 1 which is correct classification and I
return zero otherwise. Do you have any idea how to classify a point as
positive or negative using this score or another function ?

On Sat, Nov 28, 2015 at 5:14 AM, Jeff Zhang <zj...@gmail.com> wrote:

>         if((score >=0 && label == 1) || (score <0 && label == 0))
>              {
>               return 1; //correct classiciation
>              }
>              else
>               return 0;
>
>
>
> I suspect score is always between 0 and 1
>
>
>
> On Sat, Nov 28, 2015 at 10:39 AM, Tarek Elgamal <ta...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am trying to run the straightforward example of SVm but I am getting
>> low accuracy (around 50%) when I predict using the same data I used for
>> training. I am probably doing the prediction in a wrong way. My code is
>> below. I would appreciate any help.
>>
>>
>> import java.util.List;
>>
>> import org.apache.spark.SparkConf;
>> import org.apache.spark.SparkContext;
>> import org.apache.spark.api.java.JavaRDD;
>> import org.apache.spark.api.java.function.Function;
>> import org.apache.spark.api.java.function.Function2;
>> import org.apache.spark.mllib.classification.SVMModel;
>> import org.apache.spark.mllib.classification.SVMWithSGD;
>> import org.apache.spark.mllib.regression.LabeledPoint;
>> import org.apache.spark.mllib.util.MLUtils;
>>
>> import scala.Tuple2;
>> import edu.illinois.biglbjava.readers.LabeledPointReader;
>>
>> public class SimpleDistSVM {
>>   public static void main(String[] args) {
>>     SparkConf conf = new SparkConf().setAppName("SVM Classifier Example");
>>     SparkContext sc = new SparkContext(conf);
>>     String inputPath=args[0];
>>
>>     // Read training data
>>     JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc,
>> inputPath).toJavaRDD();
>>
>>     // Run training algorithm to build the model.
>>     int numIterations = 3;
>>     final SVMModel model = SVMWithSGD.train(data.rdd(), numIterations);
>>
>>     // Clear the default threshold.
>>     model.clearThreshold();
>>
>>
>>     // Predict points in test set and map to an RDD of 0/1 values where 0
>> is misclassication and 1 is correct classification
>>     JavaRDD<Integer> classification = data.map(new Function<LabeledPoint,
>> Integer>() {
>>          public Integer call(LabeledPoint p) {
>>            int label = (int) p.label();
>>            Double score = model.predict(p.features());
>>            if((score >=0 && label == 1) || (score <0 && label == 0))
>>            {
>>            return 1; //correct classiciation
>>            }
>>            else
>>             return 0;
>>
>>          }
>>        }
>>      );
>>     // sum up all values in the rdd to get the number of correctly
>> classified examples
>>      int sum=classification.reduce(new Function2<Integer, Integer,
>> Integer>()
>>     {
>>     public Integer call(Integer arg0, Integer arg1)
>>     throws Exception {
>>     return arg0+arg1;
>>     }});
>>
>>      //compute accuracy as the percentage of the correctly classified
>> examples
>>      double accuracy=((double)sum)/((double)classification.count());
>>      System.out.println("Accuracy = " + accuracy);
>>
>>         }
>>       }
>>     );
>>   }
>> }
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Problem in running MLlib SVM

Posted by Jeff Zhang <zj...@gmail.com>.

        if((score >=0 && label == 1) || (score <0 && label == 0))
             {
              return 1; //correct classiciation
             }
             else
              return 0;



I suspect score is always between 0 and 1



On Sat, Nov 28, 2015 at 10:39 AM, Tarek Elgamal <ta...@gmail.com>
wrote:

> Hi,
>
> I am trying to run the straightforward example of SVm but I am getting low
> accuracy (around 50%) when I predict using the same data I used for
> training. I am probably doing the prediction in a wrong way. My code is
> below. I would appreciate any help.
>
>
> import java.util.List;
>
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.api.java.JavaRDD;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.api.java.function.Function2;
> import org.apache.spark.mllib.classification.SVMModel;
> import org.apache.spark.mllib.classification.SVMWithSGD;
> import org.apache.spark.mllib.regression.LabeledPoint;
> import org.apache.spark.mllib.util.MLUtils;
>
> import scala.Tuple2;
> import edu.illinois.biglbjava.readers.LabeledPointReader;
>
> public class SimpleDistSVM {
>   public static void main(String[] args) {
>     SparkConf conf = new SparkConf().setAppName("SVM Classifier Example");
>     SparkContext sc = new SparkContext(conf);
>     String inputPath=args[0];
>
>     // Read training data
>     JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(sc,
> inputPath).toJavaRDD();
>
>     // Run training algorithm to build the model.
>     int numIterations = 3;
>     final SVMModel model = SVMWithSGD.train(data.rdd(), numIterations);
>
>     // Clear the default threshold.
>     model.clearThreshold();
>
>
>     // Predict points in test set and map to an RDD of 0/1 values where 0
> is misclassication and 1 is correct classification
>     JavaRDD<Integer> classification = data.map(new Function<LabeledPoint,
> Integer>() {
>          public Integer call(LabeledPoint p) {
>            int label = (int) p.label();
>            Double score = model.predict(p.features());
>            if((score >=0 && label == 1) || (score <0 && label == 0))
>            {
>            return 1; //correct classiciation
>            }
>            else
>             return 0;
>
>          }
>        }
>      );
>     // sum up all values in the rdd to get the number of correctly
> classified examples
>      int sum=classification.reduce(new Function2<Integer, Integer,
> Integer>()
>     {
>     public Integer call(Integer arg0, Integer arg1)
>     throws Exception {
>     return arg0+arg1;
>     }});
>
>      //compute accuracy as the percentage of the correctly classified
> examples
>      double accuracy=((double)sum)/((double)classification.count());
>      System.out.println("Accuracy = " + accuracy);
>
>         }
>       }
>     );
>   }
> }
>



-- 
Best Regards

Jeff Zhang

Re: Problem in running MLlib SVM

Posted by Robert Dodier <ro...@gmail.com>.

Tarek,

On looking at the code in SVM.scala, I see that SVMWithSGD.predictPoint
first computes dot(w, x) + b where w is the SVM weight vector, x is the
input vector, and b is a constant. If there is a threshold defined, then the
output is 1 if that's greater than the threshold and 0 otherwise. If there
is no threshold, then it just returns dot(w, x) + b. There is no requirement
that the output be constrained to a specific range. 

For a little problem I was working on, I investigated the outputs a little
bit; here's a snippet of some stuff you could put in spark-shell:

    model.clearThreshold
    val foo = x.map (p => (p.label, model.predict (p.features)))
    import org.apache.spark.mllib.stat.Statistics
    val summary = Statistics.colStats (foo.map {case (a, b) => Vectors.dense
(a, b)})
    summary.mean
    summary.min
    summary.max

When I tried that, I found a very large range of outputs -- something like
-6*10^6 to -400, with a mean of about -30000. If you look into it, let us
know what you find, I would be interested to hear about it.

best,

Robert Dodier



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Problem-in-running-MLlib-SVM-tp15380p15416.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org