You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by jatinpreet <ja...@gmail.com> on 2014/11/26 07:11:06 UTC

Accessing posterior probability of Naive Baye's prediction

Hi,

I am trying to access the posterior probability of Naive Baye's prediction
with MLlib using Java. As the member variables brzPi and brzTheta are
private, I applied a hack to access the values through reflection.

I am using Java and couldn't find a way to use the breeze library with Java.
If I am correct the relevant calculation is given through line number 66 in
NaiveBayesModel class,

labels(brzArgmax(brzPi + brzTheta * testData.toBreeze))

Here the element-wise additions and multiplication of DenseVectors are given
as operators which are not directly accessible in Java. Also, the use of
brzArgmax is not very clear with Java for me.

Can anyone please help me convert the above mentioned calculation from Scala
to Java. 

PS: I have raised a improvement request on Jira for making these variables
directly accessible from outside.

Thanks,
Jatin



-----
Novice Big Data Programmer
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Accessing posterior probability of Naive Baye's prediction

Posted by jatinpreet <ja...@gmail.com>.

Thanks Sean, it did turn out to be a simple mistake after all. I appreciate
your help.

Jatin

On Thu, Nov 27, 2014 at 7:52 PM, sowen [via Apache Spark User List] <
ml-node+s1001560n19975h65@n3.nabble.com> wrote:

> No, the feature vector is not converted. It contains count n_i of how
> often each term t_i occurs (or a TF-IDF transformation of those). You
> are finding the class c such that P(c) * P(t_1|c)^n_1 * ... is
> maximized.
>
> In log space it's log(P(c)) + n_1*log(P(t_1|c)) + ...
>
> So your n_1 counts (or TF-IDF values) are used as-is and this is where
> the dot product comes from.
>
> Your bug is probably something lower-level and simple. I'd debug the
> Spark example and print exactly its values for the log priors and
> conditional probabilities, and the matrix operations, and yours too,
> and see where the difference is.
>
> On Thu, Nov 27, 2014 at 11:37 AM, jatinpreet <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=19975&i=0>> wrote:
>
> > Hi,
> >
> > I have been running through some troubles while converting the code to
> Java.
> > I have done the matrix operations as directed and tried to find the
> maximum
> > score for each category. But the predicted category is mostly different
> from
> > the prediction done by MLlib.
> >
> > I am fetching iterators of the pi, theta and testData to do my
> calculations.
> > pi and theta are in  log space while my testData vector is not, could
> that
> > be a problem because I didn't see explicit conversion in Mllib also?
> >
> > For example, for two categories and 5 features, I am doing the following
> > operation,
> >
> > [1,2] + [1 2 3 4 5  ] * [1,2,3,4,5]
> >            [6 7 8 9 10]
> > These are simple element-wise matrix multiplication and addition
> operators.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=19975&i=1>
> For additional commands, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=19975&i=2>
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828p19975.html
>  To unsubscribe from Accessing posterior probability of Naive Baye's
> prediction, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=19828&code=amF0aW5wcmVldEBnbWFpbC5jb218MTk4Mjh8MTY0NDI0MzIyNw==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>



-- 
Regards,
Jatinpreet Singh




-----
Novice Big Data Programmer
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828p20011.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Accessing posterior probability of Naive Baye's prediction

Posted by Sean Owen <so...@cloudera.com>.

No, the feature vector is not converted. It contains count n_i of how
often each term t_i occurs (or a TF-IDF transformation of those). You
are finding the class c such that P(c) * P(t_1|c)^n_1 * ... is
maximized.

In log space it's log(P(c)) + n_1*log(P(t_1|c)) + ...

So your n_1 counts (or TF-IDF values) are used as-is and this is where
the dot product comes from.

Your bug is probably something lower-level and simple. I'd debug the
Spark example and print exactly its values for the log priors and
conditional probabilities, and the matrix operations, and yours too,
and see where the difference is.

On Thu, Nov 27, 2014 at 11:37 AM, jatinpreet <ja...@gmail.com> wrote:
> Hi,
>
> I have been running through some troubles while converting the code to Java.
> I have done the matrix operations as directed and tried to find the maximum
> score for each category. But the predicted category is mostly different from
> the prediction done by MLlib.
>
> I am fetching iterators of the pi, theta and testData to do my calculations.
> pi and theta are in  log space while my testData vector is not, could that
> be a problem because I didn't see explicit conversion in Mllib also?
>
> For example, for two categories and 5 features, I am doing the following
> operation,
>
> [1,2] + [1 2 3 4 5  ] * [1,2,3,4,5]
>            [6 7 8 9 10]
> These are simple element-wise matrix multiplication and addition operators.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Accessing posterior probability of Naive Baye's prediction

Posted by jatinpreet <ja...@gmail.com>.

Hi,

I have been running through some troubles while converting the code to Java.
I have done the matrix operations as directed and tried to find the maximum
score for each category. But the predicted category is mostly different from
the prediction done by MLlib.

I am fetching iterators of the pi, theta and testData to do my calculations.
pi and theta are in  log space while my testData vector is not, could that
be a problem because I didn't see explicit conversion in Mllib also?

For example, for two categories and 5 features, I am doing the following
operation,

[1,2] + [1 2 3 4 5  ] * [1,2,3,4,5]
           [6 7 8 9 10]            
These are simple element-wise matrix multiplication and addition operators.

Following is the code,

            Iterator<Tuple2&lt;Object, Object>> piIterator =
piValue.iterator();
            Iterator<Tuple2&lt;Tuple2&lt;Object, Object>, Object>>
thetaIterator = thetaValue.iterator();
            Iterator<Tuple2&lt;Object, Object>> testDataIterator = null;
          
            double[] scores = new double[piValue.size()];
            while (piIterator.hasNext()) {
                double score = 0.0;
                // reset to index 0
                testDataIterator = testData.toBreeze().iterator();
                
                while (testDataIterator.hasNext()) {
                    Tuple2<Object, Object> testTuple =
testDataIterator.next();
                    Tuple2<Tuple2&lt;Object, Object>, Object> thetaTuple =
thetaIterator.next();
                                     
                    score += ((double) testTuple2._2 * (double)
thetaTuple2._2);
                }
				
		Tuple2<Object, Object> piTuple = piIterator.next();
                score += (double) piTuple._2;
                scores[(int) piTuple._1] = score;
                if (maxScore < score) {
                    predictedCategory = (int) piTuple._1;
                    maxScore = score;
                }
            }


Where am I going wrong?

Thanks,
Jatin



-----
Novice Big Data Programmer
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828p19968.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Accessing posterior probability of Naive Baye's prediction

Posted by jatinpreet <ja...@gmail.com>.

Hi Sean,

The values brzPi and brzTheta are of the form
breeze.linalg.DenseVector<Double>. So would I have to convert them back to
simple vectors and use a library to perform addition/multiplication?

If yes, can you please point me to the conversion logic and vector operation
library for Java?

Thanks,
Jatin



-----
Novice Big Data Programmer
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828p19858.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Accessing posterior probability of Naive Baye's prediction

Posted by Sean Owen <so...@cloudera.com>.

You can call Scala code from Java, even when it involves overloaded
operators, since they are also just methods with names like $plus and
$times. In this case, it's not quite feasible since the Scala API is
complex and would end up forcing you to manually supply some other
implementation details to work

This code is very easy to reproduce in Java. It's a vector plus
matrix-times-vector. argmax tells you the index of that resulting
vector with the largest value.

On Wed, Nov 26, 2014 at 6:11 AM, jatinpreet <ja...@gmail.com> wrote:
> Hi,
>
> I am trying to access the posterior probability of Naive Baye's prediction
> with MLlib using Java. As the member variables brzPi and brzTheta are
> private, I applied a hack to access the values through reflection.
>
> I am using Java and couldn't find a way to use the breeze library with Java.
> If I am correct the relevant calculation is given through line number 66 in
> NaiveBayesModel class,
>
> labels(brzArgmax(brzPi + brzTheta * testData.toBreeze))
>
> Here the element-wise additions and multiplication of DenseVectors are given
> as operators which are not directly accessible in Java. Also, the use of
> brzArgmax is not very clear with Java for me.
>
> Can anyone please help me convert the above mentioned calculation from Scala
> to Java.
>
> PS: I have raised a improvement request on Jira for making these variables
> directly accessible from outside.
>
> Thanks,
> Jatin
>
>
>
> -----
> Novice Big Data Programmer
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Accessing-posterior-probability-of-Naive-Baye-s-prediction-tp19828.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org