You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Marius FETEANU <ma...@sien.com> on 2014/09/30 11:55:48 UTC

Fwd: Actual Probabilities when Using Naive Bayes classifier

I want to use the mllib NaiveBayes classifier to predict user responses to
an offer.


I am interested in different types of responses (not just accept/reject)
and also I need the actual probabilities for each predictions (as each
label might come with a different benefit/cost not known at training time).

Anybody else has experience doing this with spark? Below is a detailed
explanation of what I have done/tried to do.

The question is how to get those probabilities out of the classifier on
each prediction? A built-in way would be best, but looking at the code it
does not seem possible. Instead I created this method:

  def predictProbs(testData: Vector): (BDV[Double], BDV[Double]) = {
    val logLikelihoodRatios = brzPi + brzTheta * new
BDV[Double](testData.toArray)
    val relativeLikelihoods = logLikelihoodRatios.map(x => math.exp(x))
    val probMass = relativeLikelihoods.reduceLeft[Double](_+_)
    (logLikelihoodRatios, relativeLikelihoods.map(x => x/probMass))
  }

  def predictProbs(testData: RDD[Vector]): RDD[(BDV[Double], BDV[Double])]
= {
    val bcModel = testData.context.broadcast(this)
    testData.map{ item =>
      val model = bcModel.value
      model.predictProbs(item)
    }
  }

There are two big issues here:

- I have not tested this code (especially for performance), and it requires
me to either re-compile spark or duplicate the class
- I am not sure about the math (I used to use exp(llr)/1(1+exp(llr)) to do
this conversion but it does not seem to work here)