You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Adamantios Corais <ad...@gmail.com> on 2014/09/19 18:43:53 UTC

return probability \ confidence instead of actual class

Hi,

I am working with the SVMWithSGD classification algorithm on Spark. It
works fine for me, however, I would like to recognize the instances that
are classified with a high confidence from those with a low one. How do we
define the threshold here? Ultimately, I want to keep only those for which
the algorithm is very *very* certain about its its decision! How to do
that? Is this feature supported already by any MLlib algorithm? What if I
had multiple categories?

Any input is highly appreciated!

Re: return probability \ confidence instead of actual class

Posted by Adamantios Corais <ad...@gmail.com>.
Thank you Sean. I'll try to do it externally as you suggested, however, can
you please give me some hints on how to do that? In fact, where can I find
the 1.2 implementation you just mentioned? Thanks!




On Wed, Oct 8, 2014 at 12:58 PM, Sean Owen <so...@cloudera.com> wrote:

> Plain old SVMs don't produce an estimate of class probabilities;
> predict_proba() does some additional work to estimate class
> probabilities from the SVM output. Spark does not implement this right
> now.
>
> Spark implements the equivalent of decision_function (the wTx + b bit)
> but does not expose it, and instead gives you predict(), which gives 0
> or 1 depending on whether the decision function exceeds the specified
> threshold.
>
> Yes you can roll your own just like you did to calculate the decision
> function from weights and intercept. I suppose it would be nice to
> expose it (do I hear a PR?) but it's not hard to do externally. You'll
> have to do this anyway if you're on anything earlier than 1.2.
>
> On Wed, Oct 8, 2014 at 10:17 AM, Adamantios Corais
> <ad...@gmail.com> wrote:
> > ok let me rephrase my question once again. python-wise I am preferring
> > .predict_proba(X) instead of .decision_function(X) since it is easier
> for me
> > to interpret the results. as far as I can see, the latter functionality
> is
> > already implemented in Spark (well, in version 0.9.2 for example I have
> to
> > compute the dot product on my own otherwise I get 0 or 1) but the former
> is
> > not implemented (yet!). what should I do \ how to implement that one in
> > Spark as well? what are the required inputs here and how does the formula
> > look like?
> >
> > On Tue, Oct 7, 2014 at 10:04 PM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> It looks like you are directly computing the SVM decision function in
> >> both cases:
> >>
> >> val predictions2 = m_users_double.map{point=>
> >>   point.zip(weights).map(a=> a._1 * a._2).sum + intercept
> >> }.cache()
> >>
> >> clf.decision_function(T)
> >>
> >> This does not give you +1/-1 in SVMs (well... not for most points,
> >> which will be outside the margin around the separating hyperplane).
> >>
> >> You can use the predict() function in SVMModel -- which will give you
> >> 0 or 1 (rather than +/- 1 but that's just differing convention)
> >> depending on the sign of the decision function. I don't know if this
> >> was in 0.9.
> >>
> >> At the moment I assume you saw small values of the decision function
> >> in scikit because of the radial basis function.
>

Re: return probability \ confidence instead of actual class

Posted by Sean Owen <so...@cloudera.com>.
Plain old SVMs don't produce an estimate of class probabilities;
predict_proba() does some additional work to estimate class
probabilities from the SVM output. Spark does not implement this right
now.

Spark implements the equivalent of decision_function (the wTx + b bit)
but does not expose it, and instead gives you predict(), which gives 0
or 1 depending on whether the decision function exceeds the specified
threshold.

Yes you can roll your own just like you did to calculate the decision
function from weights and intercept. I suppose it would be nice to
expose it (do I hear a PR?) but it's not hard to do externally. You'll
have to do this anyway if you're on anything earlier than 1.2.

On Wed, Oct 8, 2014 at 10:17 AM, Adamantios Corais
<ad...@gmail.com> wrote:
> ok let me rephrase my question once again. python-wise I am preferring
> .predict_proba(X) instead of .decision_function(X) since it is easier for me
> to interpret the results. as far as I can see, the latter functionality is
> already implemented in Spark (well, in version 0.9.2 for example I have to
> compute the dot product on my own otherwise I get 0 or 1) but the former is
> not implemented (yet!). what should I do \ how to implement that one in
> Spark as well? what are the required inputs here and how does the formula
> look like?
>
> On Tue, Oct 7, 2014 at 10:04 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> It looks like you are directly computing the SVM decision function in
>> both cases:
>>
>> val predictions2 = m_users_double.map{point=>
>>   point.zip(weights).map(a=> a._1 * a._2).sum + intercept
>> }.cache()
>>
>> clf.decision_function(T)
>>
>> This does not give you +1/-1 in SVMs (well... not for most points,
>> which will be outside the margin around the separating hyperplane).
>>
>> You can use the predict() function in SVMModel -- which will give you
>> 0 or 1 (rather than +/- 1 but that's just differing convention)
>> depending on the sign of the decision function. I don't know if this
>> was in 0.9.
>>
>> At the moment I assume you saw small values of the decision function
>> in scikit because of the radial basis function.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: return probability \ confidence instead of actual class

Posted by Adamantios Corais <ad...@gmail.com>.
ok let me rephrase my question once again. python-wise I am preferring
.predict_proba(X) instead of .decision_function(X) since it is easier for
me to interpret the results. as far as I can see, the latter functionality
is already implemented in Spark (well, in version 0.9.2 for example I have
to compute the dot product on my own otherwise I get 0 or 1) but the former
is not implemented (yet!). what should I do \ how to implement that one in
Spark as well? what are the required inputs here and how does the formula
look like?

On Tue, Oct 7, 2014 at 10:04 PM, Sean Owen <so...@cloudera.com> wrote:

> It looks like you are directly computing the SVM decision function in
> both cases:
>
> val predictions2 = m_users_double.map{point=>
>   point.zip(weights).map(a=> a._1 * a._2).sum + intercept
> }.cache()
>
> clf.decision_function(T)
>
> This does not give you +1/-1 in SVMs (well... not for most points,
> which will be outside the margin around the separating hyperplane).
>
> You can use the predict() function in SVMModel -- which will give you
> 0 or 1 (rather than +/- 1 but that's just differing convention)
> depending on the sign of the decision function. I don't know if this
> was in 0.9.
>
> At the moment I assume you saw small values of the decision function
> in scikit because of the radial basis function.
>
> On Tue, Oct 7, 2014 at 7:45 PM, Sunny Khatri <su...@gmail.com> wrote:
> > Not familiar with scikit SVM implementation ( and I assume you are using
> > linearSVC). To figure out an optimal decision boundary based on the
> scores
> > obtained, you can use an ROC curve varying your thresholds.
> >
>

Re: return probability \ confidence instead of actual class

Posted by Sean Owen <so...@cloudera.com>.
It looks like you are directly computing the SVM decision function in
both cases:

val predictions2 = m_users_double.map{point=>
  point.zip(weights).map(a=> a._1 * a._2).sum + intercept
}.cache()

clf.decision_function(T)

This does not give you +1/-1 in SVMs (well... not for most points,
which will be outside the margin around the separating hyperplane).

You can use the predict() function in SVMModel -- which will give you
0 or 1 (rather than +/- 1 but that's just differing convention)
depending on the sign of the decision function. I don't know if this
was in 0.9.

At the moment I assume you saw small values of the decision function
in scikit because of the radial basis function.

On Tue, Oct 7, 2014 at 7:45 PM, Sunny Khatri <su...@gmail.com> wrote:
> Not familiar with scikit SVM implementation ( and I assume you are using
> linearSVC). To figure out an optimal decision boundary based on the scores
> obtained, you can use an ROC curve varying your thresholds.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: return probability \ confidence instead of actual class

Posted by Sunny Khatri <su...@gmail.com>.
Not familiar with scikit SVM implementation ( and I assume you are using
linearSVC). To figure out an optimal decision boundary based on the scores
obtained, you can use an ROC curve varying your thresholds.

On Tue, Oct 7, 2014 at 12:08 AM, Adamantios Corais <
adamantios.corais@gmail.com> wrote:

> Well, apparently, the above Python set-up is wrong. Please consider the
> following set-up which DOES use 'linear' kernel... And the question remains
> the same: how to interpret Spark results (or why Spark results are NOT
> bounded between -1 and 1)?
>
> On Mon, Oct 6, 2014 at 8:35 PM, Sunny Khatri <su...@gmail.com> wrote:
>
>> One diff I can find is you may have different kernel functions for your
>> training, In Spark, you end up using Linear Kernel whereas for scikit you
>> are using rbk kernel. That can explain the different in the coefficients
>> you are getting.
>>
>> On Mon, Oct 6, 2014 at 10:15 AM, Adamantios Corais <
>> adamantios.corais@gmail.com> wrote:
>>
>>> Hi again,
>>>
>>> Finally, I found the time to play around with your suggestions.
>>> Unfortunately, I noticed some unusual behavior in the MLlib results, which
>>> is more obvious when I compare them against their scikit-learn equivalent.
>>> Note that I am currently using spark 0.9.2. Long story short: I find it
>>> difficult to interpret the result: scikit-learn SVM always returns a value
>>> between 0 and 1 which makes it easy for me to set-up a threshold in order
>>> to keep only the most significant classifications (this is the case for
>>> both short and long input vectors). On the other hand, Spark MLlib makes it
>>> impossible to interpret the results; results are hardly ever bounded
>>> between -1 and +1 and hence it is impossible to choose a good cut-off value
>>> - results are of no practical use. And here is the strangest thing ever:
>>> although - it seems that - MLlib does NOT generate the right weights and
>>> intercept, when I feed the MLlib with the weights and intercept from
>>> scikit-learn the results become pretty accurate!!!! Any ideas about what is
>>> happening? Any suggestion is highly appreciated.
>>>
>>> PS: to make thinks easier I have quoted both of my implantations as well
>>> as results, bellow.
>>>
>>> //////////////////////////////////////////////////
>>>
>>> SPARK (short input):
>>> training_error: Double = 0.0
>>> res2: Array[Double] = Array(-1.4420684459128205E-19,
>>> -1.4420684459128205E-19, -1.4420684459128205E-19, 0.3749999999999999,
>>> 0.7499999999999998, 0.7499999999999998, 0.7499999999999998)
>>>
>>> SPARK (long input):
>>> training_error: Double = 0.0
>>> res2: Array[Double] = Array(-0.782207630902241, -0.782207630902241,
>>> -0.782207630902241, 0.9522394329769612, 2.6866864968561632,
>>> 2.6866864968561632, 2.6866864968561632)
>>>
>>> PYTHON (short input):
>>> array([[-1.00000001],
>>>        [-1.00000001],
>>>        [-1.00000001],
>>>        [-0.        ],
>>>        [ 1.00000001],
>>>        [ 1.00000001],
>>>        [ 1.00000001]])
>>>
>>> PYTHON (long input):
>>> array([[-1.00000001],
>>>        [-1.00000001],
>>>        [-1.00000001],
>>>        [-0.        ],
>>>        [ 1.00000001],
>>>        [ 1.00000001],
>>>        [ 1.00000001]])
>>>
>>> //////////////////////////////////////////////////
>>>
>>> import analytics.MSC
>>>
>>> import java.util.Calendar
>>> import java.text.SimpleDateFormat
>>> import scala.collection.mutable
>>> import scala.collection.JavaConversions._
>>> import org.apache.spark.SparkContext._
>>> import org.apache.spark.mllib.classification.SVMWithSGD
>>> import org.apache.spark.mllib.regression.LabeledPoint
>>> import org.apache.spark.mllib.optimization.L1Updater
>>> import com.datastax.bdp.spark.connector.CassandraConnector
>>> import com.datastax.bdp.spark.SparkContextCassandraFunctions._
>>>
>>> val sc = MSC.sc
>>> val lg = MSC.logger
>>>
>>> //val s_users_double_2 = Seq(
>>> //  (0.0,Seq(0.0, 0.0, 0.0)),
>>> //  (0.0,Seq(0.0, 0.0, 0.0)),
>>> //  (0.0,Seq(0.0, 0.0, 0.0)),
>>> //  (1.0,Seq(1.0, 1.0, 1.0)),
>>> //  (1.0,Seq(1.0, 1.0, 1.0)),
>>> //  (1.0,Seq(1.0, 1.0, 1.0))
>>> //)
>>> val s_users_double_2 = Seq(
>>>     (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
>>>     (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
>>>     (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
>>>     (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)),
>>>     (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)),
>>>     (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0))
>>> )
>>> val s_users_double = sc.parallelize(s_users_double_2)
>>>
>>> val s_users_parsed = s_users_double.map{line=>
>>>   LabeledPoint(line._1, line._2.toArray)
>>> }.cache()
>>>
>>> val iterations = 100
>>>
>>> val model = SVMWithSGD.train(s_users_parsed, iterations)
>>>
>>> val predictions1 = s_users_parsed.map{point=>
>>>   (point.label, model.predict(point.features))
>>> }.cache()
>>>
>>> val training_error = predictions1.filter(r=> r._1 !=
>>> r._2).count().toDouble / s_users_parsed.count()
>>>
>>> val TP = predictions1.map(s=> if (s._1==1.0 && s._2==1.0) true else
>>> false).filter(t=> t).count()
>>> val FP = predictions1.map(s=> if (s._1==0.0 && s._2==1.0) true else
>>> false).filter(t=> t).count()
>>> val TN = predictions1.map(s=> if (s._1==0.0 && s._2==0.0) true else
>>> false).filter(t=> t).count()
>>> val FN = predictions1.map(s=> if (s._1==1.0 && s._2==0.0) true else
>>> false).filter(t=> t).count()
>>>
>>> val weights = model.weights
>>>
>>> val intercept = model.intercept
>>>
>>> //val m_users_double_2 = Seq(
>>> //  Seq(0.0, 0.0, 0.0),
>>> //  Seq(0.0, 0.0, 0.0),
>>> //  Seq(0.0, 0.0, 0.0),
>>> //  Seq(0.5, 0.5, 0.5),
>>> //  Seq(1.0, 1.0, 1.0),
>>> //  Seq(1.0, 1.0, 1.0),
>>> //  Seq(1.0, 1.0, 1.0)
>>> //)
>>> val m_users_double_2 = Seq(
>>>     Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0),
>>>     Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0),
>>>     Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0),
>>>       Seq(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
>>> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
>>> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 2.0, 0.5, 0.5, 0.5),
>>>     Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0),
>>>     Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0),
>>>     Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)
>>> )
>>> val m_users_double = sc.parallelize(m_users_double_2)
>>>
>>> val predictions2 = m_users_double.map{point=>
>>>   point.zip(weights).map(a=> a._1 * a._2).sum + intercept
>>> }.cache()
>>>
>>> predictions2.collect()
>>>
>>> //////////////////////////////////////////////////
>>>
>>> from sklearn import svm
>>>
>>> flag = 'short' # 'long'
>>>
>>> if flag == 'long':
>>>     X = [
>>>         [0.0, 0.0, 0.0],
>>>         [0.0, 0.0, 0.0],
>>>         [0.0, 0.0, 0.0],
>>>         [1.0, 1.0, 1.0],
>>>         [1.0, 1.0, 1.0],
>>>         [1.0, 1.0, 1.0]
>>>     ]
>>>     Y = [
>>>         0.0,
>>>         0.0,
>>>         0.0,
>>>         1.0,
>>>         1.0,
>>>         1.0
>>>     ]
>>>     T = [
>>>         [0.0, 0.0, 0.0],
>>>         [0.0, 0.0, 0.0],
>>>         [0.0, 0.0, 0.0],
>>>         [0.5, 0.5, 0.5],
>>>         [1.0, 1.0, 1.0],
>>>         [1.0, 1.0, 1.0],
>>>         [1.0, 1.0, 1.0]
>>>     ]
>>>
>>> if flag == 'long':
>>>     X = [
>>>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>>>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>>>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>>>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
>>>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
>>>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0]
>>>     ]
>>>     Y = [
>>>         0.0,
>>>         0.0,
>>>         0.0,
>>>         1.0,
>>>         1.0,
>>>         1.0
>>>     ]
>>>     T = [
>>>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>>>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>>>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>>>         [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
>>> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
>>> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 2.0, 0.5, 0.5, 0.5],
>>>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
>>>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
>>>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0]
>>>     ]
>>>
>>> clf = svm.SVC()
>>> clf.fit(X, Y)
>>> svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
>>> gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None,
>>> shrinking=True, tol=0.001, verbose=False)
>>> clf.decision_function(T)
>>>
>>> ///////////////////////////////////////////////////
>>>
>>>
>>>
>>>
>>> On Thu, Sep 25, 2014 at 2:25 AM, Sunny Khatri <su...@gmail.com>
>>> wrote:
>>>
>>>> For multi-class you can use the same SVMWithSGD (for binary
>>>> classification) with One-vs-All approach constructing respective training
>>>> corpuses consisting one Class i as positive samples and Rest of the classes
>>>> as negative one, and then use the same method provided by Aris as a measure
>>>> of how far Class i is from the decision boundary.
>>>>
>>>> On Wed, Sep 24, 2014 at 4:06 PM, Aris <ar...@gmail.com> wrote:
>>>>
>>>>> Χαίρε Αδαμάντιε Κοραή....έαν είναι πράγματι το όνομα σου..
>>>>>
>>>>> Just to follow up on Liquan, you might be interested in removing the
>>>>> thresholds, and then treating the predictions as a probability from 0..1
>>>>> inclusive. SVM with the linear kernel is a straightforward linear
>>>>> classifier -- so you with the model.clearThreshold() you can just get the
>>>>> raw predicted scores, removing the threshold which simple translates that
>>>>> into a positive/negative class.
>>>>>
>>>>> API is here
>>>>> http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel
>>>>>
>>>>> Enjoy!
>>>>> Aris
>>>>>
>>>>> On Sun, Sep 21, 2014 at 11:50 PM, Liquan Pei <li...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> HI Adamantios,
>>>>>>
>>>>>> For your first question, after you train the SVM, you get a model
>>>>>> with a vector of weights w and an intercept b, point x such that  w.dot(x)
>>>>>> + b = 1 and w.dot(x) + b = -1 are points that on the decision boundary. The
>>>>>> quantity w.dot(x) + b for point x is a confidence measure of
>>>>>> classification.
>>>>>>
>>>>>> Code wise, suppose you trained your model via
>>>>>> val model = SVMWithSGD.train(...)
>>>>>>
>>>>>> and you can set a threshold by calling
>>>>>>
>>>>>> model.setThreshold(your threshold here)
>>>>>>
>>>>>> to set the threshold that separate positive predictions from negative
>>>>>> predictions.
>>>>>>
>>>>>> For more info, please take a look at
>>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel
>>>>>>
>>>>>> For your second question, SVMWithSGD only supports binary
>>>>>> classification.
>>>>>>
>>>>>> Hope this helps,
>>>>>>
>>>>>> Liquan
>>>>>>
>>>>>> On Sun, Sep 21, 2014 at 11:22 PM, Adamantios Corais <
>>>>>> adamantios.corais@gmail.com> wrote:
>>>>>>
>>>>>>> Nobody?
>>>>>>>
>>>>>>> If that's not supported already, can please, at least, give me a few
>>>>>>> hints on how to implement it?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais <
>>>>>>> adamantios.corais@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I am working with the SVMWithSGD classification algorithm on Spark.
>>>>>>>> It works fine for me, however, I would like to recognize the instances that
>>>>>>>> are classified with a high confidence from those with a low one. How do we
>>>>>>>> define the threshold here? Ultimately, I want to keep only those for which
>>>>>>>> the algorithm is very *very* certain about its its decision! How to do
>>>>>>>> that? Is this feature supported already by any MLlib algorithm? What if I
>>>>>>>> had multiple categories?
>>>>>>>>
>>>>>>>> Any input is highly appreciated!
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Liquan Pei
>>>>>> Department of Physics
>>>>>> University of Massachusetts Amherst
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: return probability \ confidence instead of actual class

Posted by Adamantios Corais <ad...@gmail.com>.
Well, apparently, the above Python set-up is wrong. Please consider the
following set-up which DOES use 'linear' kernel... And the question remains
the same: how to interpret Spark results (or why Spark results are NOT
bounded between -1 and 1)?

On Mon, Oct 6, 2014 at 8:35 PM, Sunny Khatri <su...@gmail.com> wrote:

> One diff I can find is you may have different kernel functions for your
> training, In Spark, you end up using Linear Kernel whereas for scikit you
> are using rbk kernel. That can explain the different in the coefficients
> you are getting.
>
> On Mon, Oct 6, 2014 at 10:15 AM, Adamantios Corais <
> adamantios.corais@gmail.com> wrote:
>
>> Hi again,
>>
>> Finally, I found the time to play around with your suggestions.
>> Unfortunately, I noticed some unusual behavior in the MLlib results, which
>> is more obvious when I compare them against their scikit-learn equivalent.
>> Note that I am currently using spark 0.9.2. Long story short: I find it
>> difficult to interpret the result: scikit-learn SVM always returns a value
>> between 0 and 1 which makes it easy for me to set-up a threshold in order
>> to keep only the most significant classifications (this is the case for
>> both short and long input vectors). On the other hand, Spark MLlib makes it
>> impossible to interpret the results; results are hardly ever bounded
>> between -1 and +1 and hence it is impossible to choose a good cut-off value
>> - results are of no practical use. And here is the strangest thing ever:
>> although - it seems that - MLlib does NOT generate the right weights and
>> intercept, when I feed the MLlib with the weights and intercept from
>> scikit-learn the results become pretty accurate!!!! Any ideas about what is
>> happening? Any suggestion is highly appreciated.
>>
>> PS: to make thinks easier I have quoted both of my implantations as well
>> as results, bellow.
>>
>> //////////////////////////////////////////////////
>>
>> SPARK (short input):
>> training_error: Double = 0.0
>> res2: Array[Double] = Array(-1.4420684459128205E-19,
>> -1.4420684459128205E-19, -1.4420684459128205E-19, 0.3749999999999999,
>> 0.7499999999999998, 0.7499999999999998, 0.7499999999999998)
>>
>> SPARK (long input):
>> training_error: Double = 0.0
>> res2: Array[Double] = Array(-0.782207630902241, -0.782207630902241,
>> -0.782207630902241, 0.9522394329769612, 2.6866864968561632,
>> 2.6866864968561632, 2.6866864968561632)
>>
>> PYTHON (short input):
>> array([[-1.00000001],
>>        [-1.00000001],
>>        [-1.00000001],
>>        [-0.        ],
>>        [ 1.00000001],
>>        [ 1.00000001],
>>        [ 1.00000001]])
>>
>> PYTHON (long input):
>> array([[-1.00000001],
>>        [-1.00000001],
>>        [-1.00000001],
>>        [-0.        ],
>>        [ 1.00000001],
>>        [ 1.00000001],
>>        [ 1.00000001]])
>>
>> //////////////////////////////////////////////////
>>
>> import analytics.MSC
>>
>> import java.util.Calendar
>> import java.text.SimpleDateFormat
>> import scala.collection.mutable
>> import scala.collection.JavaConversions._
>> import org.apache.spark.SparkContext._
>> import org.apache.spark.mllib.classification.SVMWithSGD
>> import org.apache.spark.mllib.regression.LabeledPoint
>> import org.apache.spark.mllib.optimization.L1Updater
>> import com.datastax.bdp.spark.connector.CassandraConnector
>> import com.datastax.bdp.spark.SparkContextCassandraFunctions._
>>
>> val sc = MSC.sc
>> val lg = MSC.logger
>>
>> //val s_users_double_2 = Seq(
>> //  (0.0,Seq(0.0, 0.0, 0.0)),
>> //  (0.0,Seq(0.0, 0.0, 0.0)),
>> //  (0.0,Seq(0.0, 0.0, 0.0)),
>> //  (1.0,Seq(1.0, 1.0, 1.0)),
>> //  (1.0,Seq(1.0, 1.0, 1.0)),
>> //  (1.0,Seq(1.0, 1.0, 1.0))
>> //)
>> val s_users_double_2 = Seq(
>>     (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
>>     (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
>>     (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
>>     (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)),
>>     (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)),
>>     (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0))
>> )
>> val s_users_double = sc.parallelize(s_users_double_2)
>>
>> val s_users_parsed = s_users_double.map{line=>
>>   LabeledPoint(line._1, line._2.toArray)
>> }.cache()
>>
>> val iterations = 100
>>
>> val model = SVMWithSGD.train(s_users_parsed, iterations)
>>
>> val predictions1 = s_users_parsed.map{point=>
>>   (point.label, model.predict(point.features))
>> }.cache()
>>
>> val training_error = predictions1.filter(r=> r._1 !=
>> r._2).count().toDouble / s_users_parsed.count()
>>
>> val TP = predictions1.map(s=> if (s._1==1.0 && s._2==1.0) true else
>> false).filter(t=> t).count()
>> val FP = predictions1.map(s=> if (s._1==0.0 && s._2==1.0) true else
>> false).filter(t=> t).count()
>> val TN = predictions1.map(s=> if (s._1==0.0 && s._2==0.0) true else
>> false).filter(t=> t).count()
>> val FN = predictions1.map(s=> if (s._1==1.0 && s._2==0.0) true else
>> false).filter(t=> t).count()
>>
>> val weights = model.weights
>>
>> val intercept = model.intercept
>>
>> //val m_users_double_2 = Seq(
>> //  Seq(0.0, 0.0, 0.0),
>> //  Seq(0.0, 0.0, 0.0),
>> //  Seq(0.0, 0.0, 0.0),
>> //  Seq(0.5, 0.5, 0.5),
>> //  Seq(1.0, 1.0, 1.0),
>> //  Seq(1.0, 1.0, 1.0),
>> //  Seq(1.0, 1.0, 1.0)
>> //)
>> val m_users_double_2 = Seq(
>>     Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0),
>>     Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0),
>>     Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0),
>>       Seq(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
>> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
>> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 2.0, 0.5, 0.5, 0.5),
>>     Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0),
>>     Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0),
>>     Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)
>> )
>> val m_users_double = sc.parallelize(m_users_double_2)
>>
>> val predictions2 = m_users_double.map{point=>
>>   point.zip(weights).map(a=> a._1 * a._2).sum + intercept
>> }.cache()
>>
>> predictions2.collect()
>>
>> //////////////////////////////////////////////////
>>
>> from sklearn import svm
>>
>> flag = 'short' # 'long'
>>
>> if flag == 'long':
>>     X = [
>>         [0.0, 0.0, 0.0],
>>         [0.0, 0.0, 0.0],
>>         [0.0, 0.0, 0.0],
>>         [1.0, 1.0, 1.0],
>>         [1.0, 1.0, 1.0],
>>         [1.0, 1.0, 1.0]
>>     ]
>>     Y = [
>>         0.0,
>>         0.0,
>>         0.0,
>>         1.0,
>>         1.0,
>>         1.0
>>     ]
>>     T = [
>>         [0.0, 0.0, 0.0],
>>         [0.0, 0.0, 0.0],
>>         [0.0, 0.0, 0.0],
>>         [0.5, 0.5, 0.5],
>>         [1.0, 1.0, 1.0],
>>         [1.0, 1.0, 1.0],
>>         [1.0, 1.0, 1.0]
>>     ]
>>
>> if flag == 'long':
>>     X = [
>>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
>>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
>>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0]
>>     ]
>>     Y = [
>>         0.0,
>>         0.0,
>>         0.0,
>>         1.0,
>>         1.0,
>>         1.0
>>     ]
>>     T = [
>>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
>> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>>         [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
>> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
>> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 2.0, 0.5, 0.5, 0.5],
>>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
>>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
>>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
>> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0]
>>     ]
>>
>> clf = svm.SVC()
>> clf.fit(X, Y)
>> svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
>> gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None,
>> shrinking=True, tol=0.001, verbose=False)
>> clf.decision_function(T)
>>
>> ///////////////////////////////////////////////////
>>
>>
>>
>>
>> On Thu, Sep 25, 2014 at 2:25 AM, Sunny Khatri <su...@gmail.com>
>> wrote:
>>
>>> For multi-class you can use the same SVMWithSGD (for binary
>>> classification) with One-vs-All approach constructing respective training
>>> corpuses consisting one Class i as positive samples and Rest of the classes
>>> as negative one, and then use the same method provided by Aris as a measure
>>> of how far Class i is from the decision boundary.
>>>
>>> On Wed, Sep 24, 2014 at 4:06 PM, Aris <ar...@gmail.com> wrote:
>>>
>>>> Χαίρε Αδαμάντιε Κοραή....έαν είναι πράγματι το όνομα σου..
>>>>
>>>> Just to follow up on Liquan, you might be interested in removing the
>>>> thresholds, and then treating the predictions as a probability from 0..1
>>>> inclusive. SVM with the linear kernel is a straightforward linear
>>>> classifier -- so you with the model.clearThreshold() you can just get the
>>>> raw predicted scores, removing the threshold which simple translates that
>>>> into a positive/negative class.
>>>>
>>>> API is here
>>>> http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel
>>>>
>>>> Enjoy!
>>>> Aris
>>>>
>>>> On Sun, Sep 21, 2014 at 11:50 PM, Liquan Pei <li...@gmail.com>
>>>> wrote:
>>>>
>>>>> HI Adamantios,
>>>>>
>>>>> For your first question, after you train the SVM, you get a model with
>>>>> a vector of weights w and an intercept b, point x such that  w.dot(x) + b =
>>>>> 1 and w.dot(x) + b = -1 are points that on the decision boundary. The
>>>>> quantity w.dot(x) + b for point x is a confidence measure of
>>>>> classification.
>>>>>
>>>>> Code wise, suppose you trained your model via
>>>>> val model = SVMWithSGD.train(...)
>>>>>
>>>>> and you can set a threshold by calling
>>>>>
>>>>> model.setThreshold(your threshold here)
>>>>>
>>>>> to set the threshold that separate positive predictions from negative
>>>>> predictions.
>>>>>
>>>>> For more info, please take a look at
>>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel
>>>>>
>>>>> For your second question, SVMWithSGD only supports binary
>>>>> classification.
>>>>>
>>>>> Hope this helps,
>>>>>
>>>>> Liquan
>>>>>
>>>>> On Sun, Sep 21, 2014 at 11:22 PM, Adamantios Corais <
>>>>> adamantios.corais@gmail.com> wrote:
>>>>>
>>>>>> Nobody?
>>>>>>
>>>>>> If that's not supported already, can please, at least, give me a few
>>>>>> hints on how to implement it?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais <
>>>>>> adamantios.corais@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am working with the SVMWithSGD classification algorithm on Spark.
>>>>>>> It works fine for me, however, I would like to recognize the instances that
>>>>>>> are classified with a high confidence from those with a low one. How do we
>>>>>>> define the threshold here? Ultimately, I want to keep only those for which
>>>>>>> the algorithm is very *very* certain about its its decision! How to do
>>>>>>> that? Is this feature supported already by any MLlib algorithm? What if I
>>>>>>> had multiple categories?
>>>>>>>
>>>>>>> Any input is highly appreciated!
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Liquan Pei
>>>>> Department of Physics
>>>>> University of Massachusetts Amherst
>>>>>
>>>>
>>>>
>>>
>>
>

Re: return probability \ confidence instead of actual class

Posted by Sunny Khatri <su...@gmail.com>.
One diff I can find is you may have different kernel functions for your
training, In Spark, you end up using Linear Kernel whereas for scikit you
are using rbk kernel. That can explain the different in the coefficients
you are getting.

On Mon, Oct 6, 2014 at 10:15 AM, Adamantios Corais <
adamantios.corais@gmail.com> wrote:

> Hi again,
>
> Finally, I found the time to play around with your suggestions.
> Unfortunately, I noticed some unusual behavior in the MLlib results, which
> is more obvious when I compare them against their scikit-learn equivalent.
> Note that I am currently using spark 0.9.2. Long story short: I find it
> difficult to interpret the result: scikit-learn SVM always returns a value
> between 0 and 1 which makes it easy for me to set-up a threshold in order
> to keep only the most significant classifications (this is the case for
> both short and long input vectors). On the other hand, Spark MLlib makes it
> impossible to interpret the results; results are hardly ever bounded
> between -1 and +1 and hence it is impossible to choose a good cut-off value
> - results are of no practical use. And here is the strangest thing ever:
> although - it seems that - MLlib does NOT generate the right weights and
> intercept, when I feed the MLlib with the weights and intercept from
> scikit-learn the results become pretty accurate!!!! Any ideas about what is
> happening? Any suggestion is highly appreciated.
>
> PS: to make thinks easier I have quoted both of my implantations as well
> as results, bellow.
>
> //////////////////////////////////////////////////
>
> SPARK (short input):
> training_error: Double = 0.0
> res2: Array[Double] = Array(-1.4420684459128205E-19,
> -1.4420684459128205E-19, -1.4420684459128205E-19, 0.3749999999999999,
> 0.7499999999999998, 0.7499999999999998, 0.7499999999999998)
>
> SPARK (long input):
> training_error: Double = 0.0
> res2: Array[Double] = Array(-0.782207630902241, -0.782207630902241,
> -0.782207630902241, 0.9522394329769612, 2.6866864968561632,
> 2.6866864968561632, 2.6866864968561632)
>
> PYTHON (short input):
> array([[-1.00000001],
>        [-1.00000001],
>        [-1.00000001],
>        [-0.        ],
>        [ 1.00000001],
>        [ 1.00000001],
>        [ 1.00000001]])
>
> PYTHON (long input):
> array([[-1.00000001],
>        [-1.00000001],
>        [-1.00000001],
>        [-0.        ],
>        [ 1.00000001],
>        [ 1.00000001],
>        [ 1.00000001]])
>
> //////////////////////////////////////////////////
>
> import analytics.MSC
>
> import java.util.Calendar
> import java.text.SimpleDateFormat
> import scala.collection.mutable
> import scala.collection.JavaConversions._
> import org.apache.spark.SparkContext._
> import org.apache.spark.mllib.classification.SVMWithSGD
> import org.apache.spark.mllib.regression.LabeledPoint
> import org.apache.spark.mllib.optimization.L1Updater
> import com.datastax.bdp.spark.connector.CassandraConnector
> import com.datastax.bdp.spark.SparkContextCassandraFunctions._
>
> val sc = MSC.sc
> val lg = MSC.logger
>
> //val s_users_double_2 = Seq(
> //  (0.0,Seq(0.0, 0.0, 0.0)),
> //  (0.0,Seq(0.0, 0.0, 0.0)),
> //  (0.0,Seq(0.0, 0.0, 0.0)),
> //  (1.0,Seq(1.0, 1.0, 1.0)),
> //  (1.0,Seq(1.0, 1.0, 1.0)),
> //  (1.0,Seq(1.0, 1.0, 1.0))
> //)
> val s_users_double_2 = Seq(
>     (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
>     (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
>     (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
>     (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)),
>     (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)),
>     (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0))
> )
> val s_users_double = sc.parallelize(s_users_double_2)
>
> val s_users_parsed = s_users_double.map{line=>
>   LabeledPoint(line._1, line._2.toArray)
> }.cache()
>
> val iterations = 100
>
> val model = SVMWithSGD.train(s_users_parsed, iterations)
>
> val predictions1 = s_users_parsed.map{point=>
>   (point.label, model.predict(point.features))
> }.cache()
>
> val training_error = predictions1.filter(r=> r._1 !=
> r._2).count().toDouble / s_users_parsed.count()
>
> val TP = predictions1.map(s=> if (s._1==1.0 && s._2==1.0) true else
> false).filter(t=> t).count()
> val FP = predictions1.map(s=> if (s._1==0.0 && s._2==1.0) true else
> false).filter(t=> t).count()
> val TN = predictions1.map(s=> if (s._1==0.0 && s._2==0.0) true else
> false).filter(t=> t).count()
> val FN = predictions1.map(s=> if (s._1==1.0 && s._2==0.0) true else
> false).filter(t=> t).count()
>
> val weights = model.weights
>
> val intercept = model.intercept
>
> //val m_users_double_2 = Seq(
> //  Seq(0.0, 0.0, 0.0),
> //  Seq(0.0, 0.0, 0.0),
> //  Seq(0.0, 0.0, 0.0),
> //  Seq(0.5, 0.5, 0.5),
> //  Seq(1.0, 1.0, 1.0),
> //  Seq(1.0, 1.0, 1.0),
> //  Seq(1.0, 1.0, 1.0)
> //)
> val m_users_double_2 = Seq(
>     Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0),
>     Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0),
>     Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0),
>       Seq(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 2.0, 0.5, 0.5, 0.5),
>     Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0),
>     Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0),
>     Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)
> )
> val m_users_double = sc.parallelize(m_users_double_2)
>
> val predictions2 = m_users_double.map{point=>
>   point.zip(weights).map(a=> a._1 * a._2).sum + intercept
> }.cache()
>
> predictions2.collect()
>
> //////////////////////////////////////////////////
>
> from sklearn import svm
>
> flag = 'short' # 'long'
>
> if flag == 'long':
>     X = [
>         [0.0, 0.0, 0.0],
>         [0.0, 0.0, 0.0],
>         [0.0, 0.0, 0.0],
>         [1.0, 1.0, 1.0],
>         [1.0, 1.0, 1.0],
>         [1.0, 1.0, 1.0]
>     ]
>     Y = [
>         0.0,
>         0.0,
>         0.0,
>         1.0,
>         1.0,
>         1.0
>     ]
>     T = [
>         [0.0, 0.0, 0.0],
>         [0.0, 0.0, 0.0],
>         [0.0, 0.0, 0.0],
>         [0.5, 0.5, 0.5],
>         [1.0, 1.0, 1.0],
>         [1.0, 1.0, 1.0],
>         [1.0, 1.0, 1.0]
>     ]
>
> if flag == 'long':
>     X = [
>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0]
>     ]
>     Y = [
>         0.0,
>         0.0,
>         0.0,
>         1.0,
>         1.0,
>         1.0
>     ]
>     T = [
>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>         [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
>         [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 2.0, 0.5, 0.5, 0.5],
>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
>         [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0]
>     ]
>
> clf = svm.SVC()
> clf.fit(X, Y)
> svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
> gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None,
> shrinking=True, tol=0.001, verbose=False)
> clf.decision_function(T)
>
> ///////////////////////////////////////////////////
>
>
>
>
> On Thu, Sep 25, 2014 at 2:25 AM, Sunny Khatri <su...@gmail.com>
> wrote:
>
>> For multi-class you can use the same SVMWithSGD (for binary
>> classification) with One-vs-All approach constructing respective training
>> corpuses consisting one Class i as positive samples and Rest of the classes
>> as negative one, and then use the same method provided by Aris as a measure
>> of how far Class i is from the decision boundary.
>>
>> On Wed, Sep 24, 2014 at 4:06 PM, Aris <ar...@gmail.com> wrote:
>>
>>> Χαίρε Αδαμάντιε Κοραή....έαν είναι πράγματι το όνομα σου..
>>>
>>> Just to follow up on Liquan, you might be interested in removing the
>>> thresholds, and then treating the predictions as a probability from 0..1
>>> inclusive. SVM with the linear kernel is a straightforward linear
>>> classifier -- so you with the model.clearThreshold() you can just get the
>>> raw predicted scores, removing the threshold which simple translates that
>>> into a positive/negative class.
>>>
>>> API is here
>>> http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel
>>>
>>> Enjoy!
>>> Aris
>>>
>>> On Sun, Sep 21, 2014 at 11:50 PM, Liquan Pei <li...@gmail.com>
>>> wrote:
>>>
>>>> HI Adamantios,
>>>>
>>>> For your first question, after you train the SVM, you get a model with
>>>> a vector of weights w and an intercept b, point x such that  w.dot(x) + b =
>>>> 1 and w.dot(x) + b = -1 are points that on the decision boundary. The
>>>> quantity w.dot(x) + b for point x is a confidence measure of
>>>> classification.
>>>>
>>>> Code wise, suppose you trained your model via
>>>> val model = SVMWithSGD.train(...)
>>>>
>>>> and you can set a threshold by calling
>>>>
>>>> model.setThreshold(your threshold here)
>>>>
>>>> to set the threshold that separate positive predictions from negative
>>>> predictions.
>>>>
>>>> For more info, please take a look at
>>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel
>>>>
>>>> For your second question, SVMWithSGD only supports binary
>>>> classification.
>>>>
>>>> Hope this helps,
>>>>
>>>> Liquan
>>>>
>>>> On Sun, Sep 21, 2014 at 11:22 PM, Adamantios Corais <
>>>> adamantios.corais@gmail.com> wrote:
>>>>
>>>>> Nobody?
>>>>>
>>>>> If that's not supported already, can please, at least, give me a few
>>>>> hints on how to implement it?
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>> On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais <
>>>>> adamantios.corais@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am working with the SVMWithSGD classification algorithm on Spark.
>>>>>> It works fine for me, however, I would like to recognize the instances that
>>>>>> are classified with a high confidence from those with a low one. How do we
>>>>>> define the threshold here? Ultimately, I want to keep only those for which
>>>>>> the algorithm is very *very* certain about its its decision! How to do
>>>>>> that? Is this feature supported already by any MLlib algorithm? What if I
>>>>>> had multiple categories?
>>>>>>
>>>>>> Any input is highly appreciated!
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Liquan Pei
>>>> Department of Physics
>>>> University of Massachusetts Amherst
>>>>
>>>
>>>
>>
>

Re: return probability \ confidence instead of actual class

Posted by Adamantios Corais <ad...@gmail.com>.
Hi again,

Finally, I found the time to play around with your suggestions.
Unfortunately, I noticed some unusual behavior in the MLlib results, which
is more obvious when I compare them against their scikit-learn equivalent.
Note that I am currently using spark 0.9.2. Long story short: I find it
difficult to interpret the result: scikit-learn SVM always returns a value
between 0 and 1 which makes it easy for me to set-up a threshold in order
to keep only the most significant classifications (this is the case for
both short and long input vectors). On the other hand, Spark MLlib makes it
impossible to interpret the results; results are hardly ever bounded
between -1 and +1 and hence it is impossible to choose a good cut-off value
- results are of no practical use. And here is the strangest thing ever:
although - it seems that - MLlib does NOT generate the right weights and
intercept, when I feed the MLlib with the weights and intercept from
scikit-learn the results become pretty accurate!!!! Any ideas about what is
happening? Any suggestion is highly appreciated.

PS: to make thinks easier I have quoted both of my implantations as well as
results, bellow.

//////////////////////////////////////////////////

SPARK (short input):
training_error: Double = 0.0
res2: Array[Double] = Array(-1.4420684459128205E-19,
-1.4420684459128205E-19, -1.4420684459128205E-19, 0.3749999999999999,
0.7499999999999998, 0.7499999999999998, 0.7499999999999998)

SPARK (long input):
training_error: Double = 0.0
res2: Array[Double] = Array(-0.782207630902241, -0.782207630902241,
-0.782207630902241, 0.9522394329769612, 2.6866864968561632,
2.6866864968561632, 2.6866864968561632)

PYTHON (short input):
array([[-1.00000001],
       [-1.00000001],
       [-1.00000001],
       [-0.        ],
       [ 1.00000001],
       [ 1.00000001],
       [ 1.00000001]])

PYTHON (long input):
array([[-1.00000001],
       [-1.00000001],
       [-1.00000001],
       [-0.        ],
       [ 1.00000001],
       [ 1.00000001],
       [ 1.00000001]])

//////////////////////////////////////////////////

import analytics.MSC

import java.util.Calendar
import java.text.SimpleDateFormat
import scala.collection.mutable
import scala.collection.JavaConversions._
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.optimization.L1Updater
import com.datastax.bdp.spark.connector.CassandraConnector
import com.datastax.bdp.spark.SparkContextCassandraFunctions._

val sc = MSC.sc
val lg = MSC.logger

//val s_users_double_2 = Seq(
//  (0.0,Seq(0.0, 0.0, 0.0)),
//  (0.0,Seq(0.0, 0.0, 0.0)),
//  (0.0,Seq(0.0, 0.0, 0.0)),
//  (1.0,Seq(1.0, 1.0, 1.0)),
//  (1.0,Seq(1.0, 1.0, 1.0)),
//  (1.0,Seq(1.0, 1.0, 1.0))
//)
val s_users_double_2 = Seq(
    (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
    (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
    (0.0,Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0)),
    (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)),
    (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)),
    (1.0,Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0))
)
val s_users_double = sc.parallelize(s_users_double_2)

val s_users_parsed = s_users_double.map{line=>
  LabeledPoint(line._1, line._2.toArray)
}.cache()

val iterations = 100

val model = SVMWithSGD.train(s_users_parsed, iterations)

val predictions1 = s_users_parsed.map{point=>
  (point.label, model.predict(point.features))
}.cache()

val training_error = predictions1.filter(r=> r._1 != r._2).count().toDouble
/ s_users_parsed.count()

val TP = predictions1.map(s=> if (s._1==1.0 && s._2==1.0) true else
false).filter(t=> t).count()
val FP = predictions1.map(s=> if (s._1==0.0 && s._2==1.0) true else
false).filter(t=> t).count()
val TN = predictions1.map(s=> if (s._1==0.0 && s._2==0.0) true else
false).filter(t=> t).count()
val FN = predictions1.map(s=> if (s._1==1.0 && s._2==0.0) true else
false).filter(t=> t).count()

val weights = model.weights

val intercept = model.intercept

//val m_users_double_2 = Seq(
//  Seq(0.0, 0.0, 0.0),
//  Seq(0.0, 0.0, 0.0),
//  Seq(0.0, 0.0, 0.0),
//  Seq(0.5, 0.5, 0.5),
//  Seq(1.0, 1.0, 1.0),
//  Seq(1.0, 1.0, 1.0),
//  Seq(1.0, 1.0, 1.0)
//)
val m_users_double_2 = Seq(
    Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0),
    Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0),
    Seq(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0),
      Seq(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 2.0, 0.5, 0.5, 0.5),
    Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0),
    Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0),
    Seq(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0)
)
val m_users_double = sc.parallelize(m_users_double_2)

val predictions2 = m_users_double.map{point=>
  point.zip(weights).map(a=> a._1 * a._2).sum + intercept
}.cache()

predictions2.collect()

//////////////////////////////////////////////////

from sklearn import svm

flag = 'short' # 'long'

if flag == 'long':
    X = [
        [0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0],
        [1.0, 1.0, 1.0],
        [1.0, 1.0, 1.0],
        [1.0, 1.0, 1.0]
    ]
    Y = [
        0.0,
        0.0,
        0.0,
        1.0,
        1.0,
        1.0
    ]
    T = [
        [0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0],
        [0.5, 0.5, 0.5],
        [1.0, 1.0, 1.0],
        [1.0, 1.0, 1.0],
        [1.0, 1.0, 1.0]
    ]

if flag == 'long':
    X = [
        [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
        [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
        [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
        [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0]
    ]
    Y = [
        0.0,
        0.0,
        0.0,
        1.0,
        1.0,
        1.0
    ]
    T = [
        [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
        [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 0.0, 0.0],
        [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 2.0, 0.5, 0.5, 0.5],
        [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
        [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0],
        [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0]
    ]

clf = svm.SVC()
clf.fit(X, Y)
svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.0, kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
clf.decision_function(T)

///////////////////////////////////////////////////




On Thu, Sep 25, 2014 at 2:25 AM, Sunny Khatri <su...@gmail.com> wrote:

> For multi-class you can use the same SVMWithSGD (for binary
> classification) with One-vs-All approach constructing respective training
> corpuses consisting one Class i as positive samples and Rest of the classes
> as negative one, and then use the same method provided by Aris as a measure
> of how far Class i is from the decision boundary.
>
> On Wed, Sep 24, 2014 at 4:06 PM, Aris <ar...@gmail.com> wrote:
>
>> Χαίρε Αδαμάντιε Κοραή....έαν είναι πράγματι το όνομα σου..
>>
>> Just to follow up on Liquan, you might be interested in removing the
>> thresholds, and then treating the predictions as a probability from 0..1
>> inclusive. SVM with the linear kernel is a straightforward linear
>> classifier -- so you with the model.clearThreshold() you can just get the
>> raw predicted scores, removing the threshold which simple translates that
>> into a positive/negative class.
>>
>> API is here
>> http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel
>>
>> Enjoy!
>> Aris
>>
>> On Sun, Sep 21, 2014 at 11:50 PM, Liquan Pei <li...@gmail.com> wrote:
>>
>>> HI Adamantios,
>>>
>>> For your first question, after you train the SVM, you get a model with a
>>> vector of weights w and an intercept b, point x such that  w.dot(x) + b = 1
>>> and w.dot(x) + b = -1 are points that on the decision boundary. The
>>> quantity w.dot(x) + b for point x is a confidence measure of
>>> classification.
>>>
>>> Code wise, suppose you trained your model via
>>> val model = SVMWithSGD.train(...)
>>>
>>> and you can set a threshold by calling
>>>
>>> model.setThreshold(your threshold here)
>>>
>>> to set the threshold that separate positive predictions from negative
>>> predictions.
>>>
>>> For more info, please take a look at
>>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel
>>>
>>> For your second question, SVMWithSGD only supports binary
>>> classification.
>>>
>>> Hope this helps,
>>>
>>> Liquan
>>>
>>> On Sun, Sep 21, 2014 at 11:22 PM, Adamantios Corais <
>>> adamantios.corais@gmail.com> wrote:
>>>
>>>> Nobody?
>>>>
>>>> If that's not supported already, can please, at least, give me a few
>>>> hints on how to implement it?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais <
>>>> adamantios.corais@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am working with the SVMWithSGD classification algorithm on Spark. It
>>>>> works fine for me, however, I would like to recognize the instances that
>>>>> are classified with a high confidence from those with a low one. How do we
>>>>> define the threshold here? Ultimately, I want to keep only those for which
>>>>> the algorithm is very *very* certain about its its decision! How to do
>>>>> that? Is this feature supported already by any MLlib algorithm? What if I
>>>>> had multiple categories?
>>>>>
>>>>> Any input is highly appreciated!
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Liquan Pei
>>> Department of Physics
>>> University of Massachusetts Amherst
>>>
>>
>>
>

Re: return probability \ confidence instead of actual class

Posted by Sunny Khatri <su...@gmail.com>.
For multi-class you can use the same SVMWithSGD (for binary classification)
with One-vs-All approach constructing respective training corpuses
consisting one Class i as positive samples and Rest of the classes as
negative one, and then use the same method provided by Aris as a measure of
how far Class i is from the decision boundary.

On Wed, Sep 24, 2014 at 4:06 PM, Aris <ar...@gmail.com> wrote:

> Χαίρε Αδαμάντιε Κοραή....έαν είναι πράγματι το όνομα σου..
>
> Just to follow up on Liquan, you might be interested in removing the
> thresholds, and then treating the predictions as a probability from 0..1
> inclusive. SVM with the linear kernel is a straightforward linear
> classifier -- so you with the model.clearThreshold() you can just get the
> raw predicted scores, removing the threshold which simple translates that
> into a positive/negative class.
>
> API is here
> http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel
>
> Enjoy!
> Aris
>
> On Sun, Sep 21, 2014 at 11:50 PM, Liquan Pei <li...@gmail.com> wrote:
>
>> HI Adamantios,
>>
>> For your first question, after you train the SVM, you get a model with a
>> vector of weights w and an intercept b, point x such that  w.dot(x) + b = 1
>> and w.dot(x) + b = -1 are points that on the decision boundary. The
>> quantity w.dot(x) + b for point x is a confidence measure of
>> classification.
>>
>> Code wise, suppose you trained your model via
>> val model = SVMWithSGD.train(...)
>>
>> and you can set a threshold by calling
>>
>> model.setThreshold(your threshold here)
>>
>> to set the threshold that separate positive predictions from negative
>> predictions.
>>
>> For more info, please take a look at
>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel
>>
>> For your second question, SVMWithSGD only supports binary classification.
>>
>> Hope this helps,
>>
>> Liquan
>>
>> On Sun, Sep 21, 2014 at 11:22 PM, Adamantios Corais <
>> adamantios.corais@gmail.com> wrote:
>>
>>> Nobody?
>>>
>>> If that's not supported already, can please, at least, give me a few
>>> hints on how to implement it?
>>>
>>> Thanks!
>>>
>>>
>>> On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais <
>>> adamantios.corais@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am working with the SVMWithSGD classification algorithm on Spark. It
>>>> works fine for me, however, I would like to recognize the instances that
>>>> are classified with a high confidence from those with a low one. How do we
>>>> define the threshold here? Ultimately, I want to keep only those for which
>>>> the algorithm is very *very* certain about its its decision! How to do
>>>> that? Is this feature supported already by any MLlib algorithm? What if I
>>>> had multiple categories?
>>>>
>>>> Any input is highly appreciated!
>>>>
>>>
>>>
>>
>>
>> --
>> Liquan Pei
>> Department of Physics
>> University of Massachusetts Amherst
>>
>
>

Re: return probability \ confidence instead of actual class

Posted by Aris <ar...@gmail.com>.
Χαίρε Αδαμάντιε Κοραή....έαν είναι πράγματι το όνομα σου..

Just to follow up on Liquan, you might be interested in removing the
thresholds, and then treating the predictions as a probability from 0..1
inclusive. SVM with the linear kernel is a straightforward linear
classifier -- so you with the model.clearThreshold() you can just get the
raw predicted scores, removing the threshold which simple translates that
into a positive/negative class.

API is here
http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel

Enjoy!
Aris

On Sun, Sep 21, 2014 at 11:50 PM, Liquan Pei <li...@gmail.com> wrote:

> HI Adamantios,
>
> For your first question, after you train the SVM, you get a model with a
> vector of weights w and an intercept b, point x such that  w.dot(x) + b = 1
> and w.dot(x) + b = -1 are points that on the decision boundary. The
> quantity w.dot(x) + b for point x is a confidence measure of
> classification.
>
> Code wise, suppose you trained your model via
> val model = SVMWithSGD.train(...)
>
> and you can set a threshold by calling
>
> model.setThreshold(your threshold here)
>
> to set the threshold that separate positive predictions from negative
> predictions.
>
> For more info, please take a look at
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel
>
> For your second question, SVMWithSGD only supports binary classification.
>
> Hope this helps,
>
> Liquan
>
> On Sun, Sep 21, 2014 at 11:22 PM, Adamantios Corais <
> adamantios.corais@gmail.com> wrote:
>
>> Nobody?
>>
>> If that's not supported already, can please, at least, give me a few
>> hints on how to implement it?
>>
>> Thanks!
>>
>>
>> On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais <
>> adamantios.corais@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am working with the SVMWithSGD classification algorithm on Spark. It
>>> works fine for me, however, I would like to recognize the instances that
>>> are classified with a high confidence from those with a low one. How do we
>>> define the threshold here? Ultimately, I want to keep only those for which
>>> the algorithm is very *very* certain about its its decision! How to do
>>> that? Is this feature supported already by any MLlib algorithm? What if I
>>> had multiple categories?
>>>
>>> Any input is highly appreciated!
>>>
>>
>>
>
>
> --
> Liquan Pei
> Department of Physics
> University of Massachusetts Amherst
>

Re: return probability \ confidence instead of actual class

Posted by Liquan Pei <li...@gmail.com>.
HI Adamantios,

For your first question, after you train the SVM, you get a model with a
vector of weights w and an intercept b, point x such that  w.dot(x) + b = 1
and w.dot(x) + b = -1 are points that on the decision boundary. The
quantity w.dot(x) + b for point x is a confidence measure of
classification.

Code wise, suppose you trained your model via
val model = SVMWithSGD.train(...)

and you can set a threshold by calling

model.setThreshold(your threshold here)

to set the threshold that separate positive predictions from negative
predictions.

For more info, please take a look at
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel

For your second question, SVMWithSGD only supports binary classification.

Hope this helps,

Liquan

On Sun, Sep 21, 2014 at 11:22 PM, Adamantios Corais <
adamantios.corais@gmail.com> wrote:

> Nobody?
>
> If that's not supported already, can please, at least, give me a few hints
> on how to implement it?
>
> Thanks!
>
>
> On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais <
> adamantios.corais@gmail.com> wrote:
>
>> Hi,
>>
>> I am working with the SVMWithSGD classification algorithm on Spark. It
>> works fine for me, however, I would like to recognize the instances that
>> are classified with a high confidence from those with a low one. How do we
>> define the threshold here? Ultimately, I want to keep only those for which
>> the algorithm is very *very* certain about its its decision! How to do
>> that? Is this feature supported already by any MLlib algorithm? What if I
>> had multiple categories?
>>
>> Any input is highly appreciated!
>>
>
>


-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst

Re: return probability \ confidence instead of actual class

Posted by Adamantios Corais <ad...@gmail.com>.
Nobody?

If that's not supported already, can please, at least, give me a few hints
on how to implement it?

Thanks!


On Fri, Sep 19, 2014 at 7:43 PM, Adamantios Corais <
adamantios.corais@gmail.com> wrote:

> Hi,
>
> I am working with the SVMWithSGD classification algorithm on Spark. It
> works fine for me, however, I would like to recognize the instances that
> are classified with a high confidence from those with a low one. How do we
> define the threshold here? Ultimately, I want to keep only those for which
> the algorithm is very *very* certain about its its decision! How to do
> that? Is this feature supported already by any MLlib algorithm? What if I
> had multiple categories?
>
> Any input is highly appreciated!
>