You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Matt Hicks <ma...@outr.com> on 2018/01/15 18:21:33 UTC

[Spark ML] Positive-Only Training Classification in Scala

I'm attempting to create a training classification, but only have positive
information.  Specifically in this case it is a donor list of users, but I want
to use it as training in order to determine classification for new contacts to
give probabilities that they will donate.
Any insights or links are appreciated. I've gone through the documentation but
have been unable to find any references to how I might do this.
Thanks
---
Matt Hicks

Chief Technology Officer

405.283.6887 | http://outr.com

Re: [Spark ML] Positive-Only Training Classification in Scala

Posted by Matt Hicks <ma...@outr.com>.

If I try to use LogisticRegression with only positive training it always gives
me positive results:

Positive Only                     private def positiveOnly(): Unit = {    val training = spark.createDataFrame(Seq(      (1.0, Vectors.dense(0.0, 1.1, 0.1)),      (1.0, Vectors.dense(0.0, 1.0, -1.0)),      (1.0, Vectors.dense(0.2, 1.3, 1.0)),      (1.0, Vectors.dense(0.1, 1.2, -0.5))    )).toDF("label", "features")    val lr = new LogisticRegression()    lr.setMaxIter(10).setRegParam(0.01)    val model = lr.fit(training)    val test = spark.createDataFrame(Seq(      (1.0, Vectors.dense(-1.0, 1.5, 1.3)),      (0.0, Vectors.dense(3.0, 2.0, -0.1)),      (1.0, Vectors.dense(0.0, 2.2, -1.5))    )).toDF("label", "features")    model.transform(test)      .select("features", "label", "probability", "prediction")      .collect()      .foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) =>        println(s"($features, $label) -> prob=$prob, prediction=$prediction")      }  }
                

Not using Mixmax yet?  

The results look like this:
[info] ([-1.0,1.5,1.3], 1.0) -> prob=[0.0,1.0], prediction=1.0[info]
([3.0,2.0,-0.1], 0.0) -> prob=[0.0,1.0], prediction=1.0[info] ([0.0,2.2,-1.5],
1.0) -> prob=[0.0,1.0], prediction=1.0  





On Tue, Jan 16, 2018 8:51 AM, Matt Hicks matt@outr.com  wrote:
Hi Hari, I'm not sure I understand.  I apologize, I'm still pretty new to Spark
and Spark ML.  Can you point me to some example code or documentation that would
more fully represent this?
Thanks  





On Tue, Jan 16, 2018 2:54 AM, hosur narahari hnr1992@gmail.com  wrote:
You can make use of probability vector from spark classification.When you run
spark classification model for prediction, along with classifying into its class
spark also gives probability vector(what's the probability that this could
belong to each individual class) . So just take the probability corresponding to
the donor class. And it'll be same as what's the probability the a person will
become donor.
Best Regards,Hari
On 15 Jan 2018 11:51 p.m., "Matt Hicks" <ma...@outr.com> wrote:
I'm attempting to create a training classification, but only have positive
information.  Specifically in this case it is a donor list of users, but I want
to use it as training in order to determine classification for new contacts to
give probabilities that they will donate.
Any insights or links are appreciated. I've gone through the documentation but
have been unable to find any references to how I might do this.
Thanks
---
Matt Hicks

Chief Technology Officer

405.283.6887 | http://outr.com

Re: [Spark ML] Positive-Only Training Classification in Scala

Posted by Matt Hicks <ma...@outr.com>.

Hi Hari, I'm not sure I understand.  I apologize, I'm still pretty new to
Spark and Spark ML.  Can you point me to some example code or documentation that
would more fully represent this?
Thanks  





On Tue, Jan 16, 2018 2:54 AM, hosur narahari hnr1992@gmail.com  wrote:
You can make use of probability vector from spark classification.When you run
spark classification model for prediction, along with classifying into its class
spark also gives probability vector(what's the probability that this could
belong to each individual class) . So just take the probability corresponding to
the donor class. And it'll be same as what's the probability the a person will
become donor.
Best Regards,Hari
On 15 Jan 2018 11:51 p.m., "Matt Hicks" <ma...@outr.com> wrote:
I'm attempting to create a training classification, but only have positive
information.  Specifically in this case it is a donor list of users, but I want
to use it as training in order to determine classification for new contacts to
give probabilities that they will donate.
Any insights or links are appreciated. I've gone through the documentation but
have been unable to find any references to how I might do this.
Thanks
---
Matt Hicks

Chief Technology Officer

405.283.6887 | http://outr.com

Re: [Spark ML] Positive-Only Training Classification in Scala

Posted by hosur narahari <hn...@gmail.com>.

You can make use of probability vector from spark classification.
When you run spark classification model for prediction, along with
classifying into its class spark also gives probability vector(what's the
probability that this could belong to each individual class) . So just take
the probability corresponding to the donor class. And it'll be same as
what's the probability the a person will become donor.

Best Regards,
Hari

On 15 Jan 2018 11:51 p.m., "Matt Hicks" <ma...@outr.com> wrote:

> I'm attempting to create a training classification, but only have positive
> information.  Specifically in this case it is a donor list of users, but I
> want to use it as training in order to determine classification for new
> contacts to give probabilities that they will donate.
>
> Any insights or links are appreciated. I've gone through the documentation
> but have been unable to find any references to how I might do this.
>
> Thanks
>
> ---*Matt Hicks*
>
> *Chief Technology Officer*
>
> 405.283.6887 | http://outr.com
>
> [image: logo 2 small.png]
>
>

Re: [Spark ML] Positive-Only Training Classification in Scala

Posted by Georg Heiler <ge...@gmail.com>.

I do not know that module, but in literature PUL is the exact term you
should look for.

Matt Hicks <ma...@outr.com> schrieb am Mo., 15. Jan. 2018 um 20:56 Uhr:

> Is it fair to assume this is what I need?
> https://github.com/ispras/pu4spark
>
>
>
> On Mon, Jan 15, 2018 1:55 PM, Georg Heiler georg.kf.heiler@gmail.com
> wrote:
>
>> As far as I know spark does not implement such algorithms. In case the
>> dataset is small
>> http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html might
>> be of interest to you.
>>
>> Jörn Franke <jo...@gmail.com> schrieb am Mo., 15. Jan. 2018 um
>> 20:04 Uhr:
>>
>> I think you look more for algorithms for unsupervised learning, eg
>> clustering.
>>
>> Depending on the characteristics different clusters might be created , eg
>> donor or non-donor. Most likely you may find also more clusters (eg would
>> donate but has a disease preventing it or too old). You can verify which
>> clusters make sense for your approach so I recommend not only try two
>> clusters but multiple and see which number is more statistically
>> significant .
>>
>> On 15. Jan 2018, at 19:21, Matt Hicks <ma...@outr.com> wrote:
>>
>> I'm attempting to create a training classification, but only have
>> positive information.  Specifically in this case it is a donor list of
>> users, but I want to use it as training in order to determine
>> classification for new contacts to give probabilities that they will donate.
>>
>> Any insights or links are appreciated. I've gone through the
>> documentation but have been unable to find any references to how I might do
>> this.
>>
>> Thanks
>>
>> ---*Matt Hicks*
>>
>> *Chief Technology Officer*
>>
>> 405.283.6887 <(405)%20283-6887> | http://outr.com
>>
>> <logo 2 small.png>
>>
>>

Re: [Spark ML] Positive-Only Training Classification in Scala

Posted by Matt Hicks <ma...@outr.com>.

Is it fair to assume this is what I need? https://github.com/ispras/pu4spark  





On Mon, Jan 15, 2018 1:55 PM, Georg Heiler georg.kf.heiler@gmail.com  wrote:
As far as I know spark does not implement such algorithms. In case the dataset
is small
http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html
 might be of interest to you.
Jörn Franke <jo...@gmail.com> schrieb am Mo., 15. Jan. 2018 um 20:04 Uhr:
I think you look more for algorithms for unsupervised learning, eg clustering.
Depending on the characteristics different clusters might be created , eg donor
or non-donor. Most likely you may find also more clusters (eg would donate but
has a disease preventing it or too old). You can verify which clusters make
sense for your approach so I recommend not only try two clusters but multiple
and see which number is more statistically significant .
On 15. Jan 2018, at 19:21, Matt Hicks <ma...@outr.com> wrote:

I'm attempting to create a training classification, but only have positive
information.  Specifically in this case it is a donor list of users, but I want
to use it as training in order to determine classification for new contacts to
give probabilities that they will donate.
Any insights or links are appreciated. I've gone through the documentation but
have been unable to find any references to how I might do this.
Thanks
---
Matt Hicks

Chief Technology Officer

405.283.6887 | http://outr.com


<logo 2 small.png>

Re: [Spark ML] Positive-Only Training Classification in Scala

Posted by Georg Heiler <ge...@gmail.com>.

As far as I know spark does not implement such algorithms. In case the
dataset is small
http://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html
might
be of interest to you.

Jörn Franke <jo...@gmail.com> schrieb am Mo., 15. Jan. 2018 um
20:04 Uhr:

> I think you look more for algorithms for unsupervised learning, eg
> clustering.
>
> Depending on the characteristics different clusters might be created , eg
> donor or non-donor. Most likely you may find also more clusters (eg would
> donate but has a disease preventing it or too old). You can verify which
> clusters make sense for your approach so I recommend not only try two
> clusters but multiple and see which number is more statistically
> significant .
>
> On 15. Jan 2018, at 19:21, Matt Hicks <ma...@outr.com> wrote:
>
> I'm attempting to create a training classification, but only have positive
> information.  Specifically in this case it is a donor list of users, but I
> want to use it as training in order to determine classification for new
> contacts to give probabilities that they will donate.
>
> Any insights or links are appreciated. I've gone through the documentation
> but have been unable to find any references to how I might do this.
>
> Thanks
>
> ---*Matt Hicks*
>
> *Chief Technology Officer*
>
> 405.283.6887 <(405)%20283-6887> | http://outr.com
>
> <logo 2 small.png>
>
>

Re: [Spark ML] Positive-Only Training Classification in Scala

Posted by Jörn Franke <jo...@gmail.com>.

I think you look more for algorithms for unsupervised learning, eg clustering.

Depending on the characteristics different clusters might be created , eg donor or non-donor. Most likely you may find also more clusters (eg would donate but has a disease preventing it or too old). You can verify which clusters make sense for your approach so I recommend not only try two clusters but multiple and see which number is more statistically significant .

> On 15. Jan 2018, at 19:21, Matt Hicks <ma...@outr.com> wrote:
> 
> 
> I'm attempting to create a training classification, but only have positive information.  Specifically in this case it is a donor list of users, but I want to use it as training in order to determine classification for new contacts to give probabilities that they will donate.
> 
> Any insights or links are appreciated. I've gone through the documentation but have been unable to find any references to how I might do this.
> 
> Thanks
> 
> ---
> Matt Hicks
> Chief Technology Officer
> 405.283.6887 | http://outr.com
> <logo 2 small.png>