You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by x <wa...@gmail.com> on 2014/07/03 05:23:22 UTC

One question about RDD.zip function when trying Naive Bayes

Hello,

I a newbie to Spark MLlib and ran into a curious case when following the
instruction at the page below.

http://spark.apache.org/docs/latest/mllib-naive-bayes.html

I ran a test program on my local machine using some data.

val spConfig = (new
SparkConf).setMaster("local").setAppName("SparkNaiveBayes")
val sc = new SparkContext(spConfig)

The test data was as follows and there were three lableled categories I
wanted to predict.

 1  LabeledPoint(0.0, [4.9,3.0,1.4,0.2])
 2  LabeledPoint(0.0, [4.6,3.4,1.4,0.3])
 3  LabeledPoint(0.0, [5.7,4.4,1.5,0.4])
 4  LabeledPoint(0.0, [5.2,3.4,1.4,0.2])
 5  LabeledPoint(0.0, [4.7,3.2,1.6,0.2])
 6  LabeledPoint(0.0, [4.8,3.1,1.6,0.2])
 7  LabeledPoint(0.0, [5.1,3.8,1.9,0.4])
 8  LabeledPoint(0.0, [4.8,3.0,1.4,0.3])
 9  LabeledPoint(0.0, [5.0,3.3,1.4,0.2])
10  LabeledPoint(1.0, [6.6,2.9,4.6,1.3])
11  LabeledPoint(1.0, [5.2,2.7,3.9,1.4])
12  LabeledPoint(1.0, [5.6,2.5,3.9,1.1])
13  LabeledPoint(1.0, [6.4,2.9,4.3,1.3])
14  LabeledPoint(1.0, [6.6,3.0,4.4,1.4])
15  LabeledPoint(1.0, [6.0,2.7,5.1,1.6])
16  LabeledPoint(1.0, [5.5,2.6,4.4,1.2])
17  LabeledPoint(1.0, [5.8,2.6,4.0,1.2])
18  LabeledPoint(1.0, [5.7,2.9,4.2,1.3])
19  LabeledPoint(1.0, [5.7,2.8,4.1,1.3])
20  LabeledPoint(2.0, [6.3,2.9,5.6,1.8])
21  LabeledPoint(2.0, [6.5,3.0,5.8,2.2])
22  LabeledPoint(2.0, [6.5,3.0,5.5,1.8])
23  LabeledPoint(2.0, [6.7,3.3,5.7,2.1])
24  LabeledPoint(2.0, [7.4,2.8,6.1,1.9])
25  LabeledPoint(2.0, [6.3,3.4,5.6,2.4])
26  LabeledPoint(2.0, [6.0,3.0,4.8,1.8])
27  LabeledPoint(2.0, [6.8,3.2,5.9,2.3])

The predicted result via NaiveBayes is below. Comparing to test data, only
two predicted results(#11 and #15) were different.

 1  0.0
 2  0.0
 3  0.0
 4  0.0
 5  0.0
 6  0.0
 7  0.0
 8  0.0
 9  0.0
10  1.0
11  2.0
12  1.0
13  1.0
14  1.0
15  2.0
16  1.0
17  1.0
18  1.0
19  1.0
20  2.0
21  2.0
22  2.0
23  2.0
24  2.0
25  2.0
26  2.0
27  2.0

After grouping test RDD and predicted RDD via zip I got this.

 1  (0.0,0.0)
 2  (0.0,0.0)
 3  (0.0,0.0)
 4  (0.0,0.0)
 5  (0.0,0.0)
 6  (0.0,0.0)
 7  (0.0,0.0)
 8  (0.0,0.0)
 9  (0.0,1.0)
10  (0.0,1.0)
11  (0.0,1.0)
12  (1.0,1.0)
13  (1.0,1.0)
14  (2.0,1.0)
15  (1.0,1.0)
16  (1.0,2.0)
17  (1.0,2.0)
18  (1.0,2.0)
19  (1.0,2.0)
20  (2.0,2.0)
21  (2.0,2.0)
22  (2.0,2.0)
23  (2.0,2.0)
24  (2.0,2.0)
25  (2.0,2.0)

I expected there were 27 pairs but I saw two results were lost.
Could someone please point out what I missed something here?

Regards,
xj

Re: One question about RDD.zip function when trying Naive Bayes

Posted by x <wa...@gmail.com>.

Thanks for the confirm.
I will be checking it.

Regards,
xj


On Thu, Jul 3, 2014 at 2:31 PM, Xiangrui Meng <me...@gmail.com> wrote:

> This is due to a bug in sampling, which was fixed in 1.0.1 and latest
> master. See https://github.com/apache/spark/pull/1234 . -Xiangrui
>
> On Wed, Jul 2, 2014 at 8:23 PM, x <wa...@gmail.com> wrote:
> > Hello,
> >
> > I a newbie to Spark MLlib and ran into a curious case when following the
> > instruction at the page below.
> >
> > http://spark.apache.org/docs/latest/mllib-naive-bayes.html
> >
> > I ran a test program on my local machine using some data.
> >
> > val spConfig = (new
> > SparkConf).setMaster("local").setAppName("SparkNaiveBayes")
> > val sc = new SparkContext(spConfig)
> >
> > The test data was as follows and there were three lableled categories I
> > wanted to predict.
> >
> >  1  LabeledPoint(0.0, [4.9,3.0,1.4,0.2])
> >  2  LabeledPoint(0.0, [4.6,3.4,1.4,0.3])
> >  3  LabeledPoint(0.0, [5.7,4.4,1.5,0.4])
> >  4  LabeledPoint(0.0, [5.2,3.4,1.4,0.2])
> >  5  LabeledPoint(0.0, [4.7,3.2,1.6,0.2])
> >  6  LabeledPoint(0.0, [4.8,3.1,1.6,0.2])
> >  7  LabeledPoint(0.0, [5.1,3.8,1.9,0.4])
> >  8  LabeledPoint(0.0, [4.8,3.0,1.4,0.3])
> >  9  LabeledPoint(0.0, [5.0,3.3,1.4,0.2])
> > 10  LabeledPoint(1.0, [6.6,2.9,4.6,1.3])
> > 11  LabeledPoint(1.0, [5.2,2.7,3.9,1.4])
> > 12  LabeledPoint(1.0, [5.6,2.5,3.9,1.1])
> > 13  LabeledPoint(1.0, [6.4,2.9,4.3,1.3])
> > 14  LabeledPoint(1.0, [6.6,3.0,4.4,1.4])
> > 15  LabeledPoint(1.0, [6.0,2.7,5.1,1.6])
> > 16  LabeledPoint(1.0, [5.5,2.6,4.4,1.2])
> > 17  LabeledPoint(1.0, [5.8,2.6,4.0,1.2])
> > 18  LabeledPoint(1.0, [5.7,2.9,4.2,1.3])
> > 19  LabeledPoint(1.0, [5.7,2.8,4.1,1.3])
> > 20  LabeledPoint(2.0, [6.3,2.9,5.6,1.8])
> > 21  LabeledPoint(2.0, [6.5,3.0,5.8,2.2])
> > 22  LabeledPoint(2.0, [6.5,3.0,5.5,1.8])
> > 23  LabeledPoint(2.0, [6.7,3.3,5.7,2.1])
> > 24  LabeledPoint(2.0, [7.4,2.8,6.1,1.9])
> > 25  LabeledPoint(2.0, [6.3,3.4,5.6,2.4])
> > 26  LabeledPoint(2.0, [6.0,3.0,4.8,1.8])
> > 27  LabeledPoint(2.0, [6.8,3.2,5.9,2.3])
> >
> > The predicted result via NaiveBayes is below. Comparing to test data,
> only
> > two predicted results(#11 and #15) were different.
> >
> >  1  0.0
> >  2  0.0
> >  3  0.0
> >  4  0.0
> >  5  0.0
> >  6  0.0
> >  7  0.0
> >  8  0.0
> >  9  0.0
> > 10  1.0
> > 11  2.0
> > 12  1.0
> > 13  1.0
> > 14  1.0
> > 15  2.0
> > 16  1.0
> > 17  1.0
> > 18  1.0
> > 19  1.0
> > 20  2.0
> > 21  2.0
> > 22  2.0
> > 23  2.0
> > 24  2.0
> > 25  2.0
> > 26  2.0
> > 27  2.0
> >
> > After grouping test RDD and predicted RDD via zip I got this.
> >
> >  1  (0.0,0.0)
> >  2  (0.0,0.0)
> >  3  (0.0,0.0)
> >  4  (0.0,0.0)
> >  5  (0.0,0.0)
> >  6  (0.0,0.0)
> >  7  (0.0,0.0)
> >  8  (0.0,0.0)
> >  9  (0.0,1.0)
> > 10  (0.0,1.0)
> > 11  (0.0,1.0)
> > 12  (1.0,1.0)
> > 13  (1.0,1.0)
> > 14  (2.0,1.0)
> > 15  (1.0,1.0)
> > 16  (1.0,2.0)
> > 17  (1.0,2.0)
> > 18  (1.0,2.0)
> > 19  (1.0,2.0)
> > 20  (2.0,2.0)
> > 21  (2.0,2.0)
> > 22  (2.0,2.0)
> > 23  (2.0,2.0)
> > 24  (2.0,2.0)
> > 25  (2.0,2.0)
> >
> > I expected there were 27 pairs but I saw two results were lost.
> > Could someone please point out what I missed something here?
> >
> > Regards,
> > xj
>

Re: One question about RDD.zip function when trying Naive Bayes

Posted by x <wa...@gmail.com>.

I tried my test case with Spark 1.0.1 and saw the same result(27 pairs
becomes 25 pairs after zip).

Could someone please check it?

Regards,
xj

On Thu, Jul 3, 2014 at 2:31 PM, Xiangrui Meng <me...@gmail.com> wrote:

> This is due to a bug in sampling, which was fixed in 1.0.1 and latest
> master. See https://github.com/apache/spark/pull/1234 . -Xiangrui
>
> On Wed, Jul 2, 2014 at 8:23 PM, x <wa...@gmail.com> wrote:
> > Hello,
> >
> > I a newbie to Spark MLlib and ran into a curious case when following the
> > instruction at the page below.
> >
> > http://spark.apache.org/docs/latest/mllib-naive-bayes.html
> >
> > I ran a test program on my local machine using some data.
> >
> > val spConfig = (new
> > SparkConf).setMaster("local").setAppName("SparkNaiveBayes")
> > val sc = new SparkContext(spConfig)
> >
> > The test data was as follows and there were three lableled categories I
> > wanted to predict.
> >
> >  1  LabeledPoint(0.0, [4.9,3.0,1.4,0.2])
> >  2  LabeledPoint(0.0, [4.6,3.4,1.4,0.3])
> >  3  LabeledPoint(0.0, [5.7,4.4,1.5,0.4])
> >  4  LabeledPoint(0.0, [5.2,3.4,1.4,0.2])
> >  5  LabeledPoint(0.0, [4.7,3.2,1.6,0.2])
> >  6  LabeledPoint(0.0, [4.8,3.1,1.6,0.2])
> >  7  LabeledPoint(0.0, [5.1,3.8,1.9,0.4])
> >  8  LabeledPoint(0.0, [4.8,3.0,1.4,0.3])
> >  9  LabeledPoint(0.0, [5.0,3.3,1.4,0.2])
> > 10  LabeledPoint(1.0, [6.6,2.9,4.6,1.3])
> > 11  LabeledPoint(1.0, [5.2,2.7,3.9,1.4])
> > 12  LabeledPoint(1.0, [5.6,2.5,3.9,1.1])
> > 13  LabeledPoint(1.0, [6.4,2.9,4.3,1.3])
> > 14  LabeledPoint(1.0, [6.6,3.0,4.4,1.4])
> > 15  LabeledPoint(1.0, [6.0,2.7,5.1,1.6])
> > 16  LabeledPoint(1.0, [5.5,2.6,4.4,1.2])
> > 17  LabeledPoint(1.0, [5.8,2.6,4.0,1.2])
> > 18  LabeledPoint(1.0, [5.7,2.9,4.2,1.3])
> > 19  LabeledPoint(1.0, [5.7,2.8,4.1,1.3])
> > 20  LabeledPoint(2.0, [6.3,2.9,5.6,1.8])
> > 21  LabeledPoint(2.0, [6.5,3.0,5.8,2.2])
> > 22  LabeledPoint(2.0, [6.5,3.0,5.5,1.8])
> > 23  LabeledPoint(2.0, [6.7,3.3,5.7,2.1])
> > 24  LabeledPoint(2.0, [7.4,2.8,6.1,1.9])
> > 25  LabeledPoint(2.0, [6.3,3.4,5.6,2.4])
> > 26  LabeledPoint(2.0, [6.0,3.0,4.8,1.8])
> > 27  LabeledPoint(2.0, [6.8,3.2,5.9,2.3])
> >
> > The predicted result via NaiveBayes is below. Comparing to test data,
> only
> > two predicted results(#11 and #15) were different.
> >
> >  1  0.0
> >  2  0.0
> >  3  0.0
> >  4  0.0
> >  5  0.0
> >  6  0.0
> >  7  0.0
> >  8  0.0
> >  9  0.0
> > 10  1.0
> > 11  2.0
> > 12  1.0
> > 13  1.0
> > 14  1.0
> > 15  2.0
> > 16  1.0
> > 17  1.0
> > 18  1.0
> > 19  1.0
> > 20  2.0
> > 21  2.0
> > 22  2.0
> > 23  2.0
> > 24  2.0
> > 25  2.0
> > 26  2.0
> > 27  2.0
> >
> > After grouping test RDD and predicted RDD via zip I got this.
> >
> >  1  (0.0,0.0)
> >  2  (0.0,0.0)
> >  3  (0.0,0.0)
> >  4  (0.0,0.0)
> >  5  (0.0,0.0)
> >  6  (0.0,0.0)
> >  7  (0.0,0.0)
> >  8  (0.0,0.0)
> >  9  (0.0,1.0)
> > 10  (0.0,1.0)
> > 11  (0.0,1.0)
> > 12  (1.0,1.0)
> > 13  (1.0,1.0)
> > 14  (2.0,1.0)
> > 15  (1.0,1.0)
> > 16  (1.0,2.0)
> > 17  (1.0,2.0)
> > 18  (1.0,2.0)
> > 19  (1.0,2.0)
> > 20  (2.0,2.0)
> > 21  (2.0,2.0)
> > 22  (2.0,2.0)
> > 23  (2.0,2.0)
> > 24  (2.0,2.0)
> > 25  (2.0,2.0)
> >
> > I expected there were 27 pairs but I saw two results were lost.
> > Could someone please point out what I missed something here?
> >
> > Regards,
> > xj
>

Re: One question about RDD.zip function when trying Naive Bayes

Posted by Xiangrui Meng <me...@gmail.com>.

This is due to a bug in sampling, which was fixed in 1.0.1 and latest
master. See https://github.com/apache/spark/pull/1234 . -Xiangrui

On Wed, Jul 2, 2014 at 8:23 PM, x <wa...@gmail.com> wrote:
> Hello,
>
> I a newbie to Spark MLlib and ran into a curious case when following the
> instruction at the page below.
>
> http://spark.apache.org/docs/latest/mllib-naive-bayes.html
>
> I ran a test program on my local machine using some data.
>
> val spConfig = (new
> SparkConf).setMaster("local").setAppName("SparkNaiveBayes")
> val sc = new SparkContext(spConfig)
>
> The test data was as follows and there were three lableled categories I
> wanted to predict.
>
>  1  LabeledPoint(0.0, [4.9,3.0,1.4,0.2])
>  2  LabeledPoint(0.0, [4.6,3.4,1.4,0.3])
>  3  LabeledPoint(0.0, [5.7,4.4,1.5,0.4])
>  4  LabeledPoint(0.0, [5.2,3.4,1.4,0.2])
>  5  LabeledPoint(0.0, [4.7,3.2,1.6,0.2])
>  6  LabeledPoint(0.0, [4.8,3.1,1.6,0.2])
>  7  LabeledPoint(0.0, [5.1,3.8,1.9,0.4])
>  8  LabeledPoint(0.0, [4.8,3.0,1.4,0.3])
>  9  LabeledPoint(0.0, [5.0,3.3,1.4,0.2])
> 10  LabeledPoint(1.0, [6.6,2.9,4.6,1.3])
> 11  LabeledPoint(1.0, [5.2,2.7,3.9,1.4])
> 12  LabeledPoint(1.0, [5.6,2.5,3.9,1.1])
> 13  LabeledPoint(1.0, [6.4,2.9,4.3,1.3])
> 14  LabeledPoint(1.0, [6.6,3.0,4.4,1.4])
> 15  LabeledPoint(1.0, [6.0,2.7,5.1,1.6])
> 16  LabeledPoint(1.0, [5.5,2.6,4.4,1.2])
> 17  LabeledPoint(1.0, [5.8,2.6,4.0,1.2])
> 18  LabeledPoint(1.0, [5.7,2.9,4.2,1.3])
> 19  LabeledPoint(1.0, [5.7,2.8,4.1,1.3])
> 20  LabeledPoint(2.0, [6.3,2.9,5.6,1.8])
> 21  LabeledPoint(2.0, [6.5,3.0,5.8,2.2])
> 22  LabeledPoint(2.0, [6.5,3.0,5.5,1.8])
> 23  LabeledPoint(2.0, [6.7,3.3,5.7,2.1])
> 24  LabeledPoint(2.0, [7.4,2.8,6.1,1.9])
> 25  LabeledPoint(2.0, [6.3,3.4,5.6,2.4])
> 26  LabeledPoint(2.0, [6.0,3.0,4.8,1.8])
> 27  LabeledPoint(2.0, [6.8,3.2,5.9,2.3])
>
> The predicted result via NaiveBayes is below. Comparing to test data, only
> two predicted results(#11 and #15) were different.
>
>  1  0.0
>  2  0.0
>  3  0.0
>  4  0.0
>  5  0.0
>  6  0.0
>  7  0.0
>  8  0.0
>  9  0.0
> 10  1.0
> 11  2.0
> 12  1.0
> 13  1.0
> 14  1.0
> 15  2.0
> 16  1.0
> 17  1.0
> 18  1.0
> 19  1.0
> 20  2.0
> 21  2.0
> 22  2.0
> 23  2.0
> 24  2.0
> 25  2.0
> 26  2.0
> 27  2.0
>
> After grouping test RDD and predicted RDD via zip I got this.
>
>  1  (0.0,0.0)
>  2  (0.0,0.0)
>  3  (0.0,0.0)
>  4  (0.0,0.0)
>  5  (0.0,0.0)
>  6  (0.0,0.0)
>  7  (0.0,0.0)
>  8  (0.0,0.0)
>  9  (0.0,1.0)
> 10  (0.0,1.0)
> 11  (0.0,1.0)
> 12  (1.0,1.0)
> 13  (1.0,1.0)
> 14  (2.0,1.0)
> 15  (1.0,1.0)
> 16  (1.0,2.0)
> 17  (1.0,2.0)
> 18  (1.0,2.0)
> 19  (1.0,2.0)
> 20  (2.0,2.0)
> 21  (2.0,2.0)
> 22  (2.0,2.0)
> 23  (2.0,2.0)
> 24  (2.0,2.0)
> 25  (2.0,2.0)
>
> I expected there were 27 pairs but I saw two results were lost.
> Could someone please point out what I missed something here?
>
> Regards,
> xj