You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by buring <qy...@gmail.com> on 2014/12/17 01:55:56 UTC

"toArray","first" get the different result from one element RDD

Hi
	Recently I have some problems about rdd behaviors.It's about
"RDD.first","RDD.toArray" method when RDD only has one element. 
	I get the different result in different method from one element RDD where i
should have the same result. I will give more detail after the code.
	My code was as follows:
	//get and rdd with just one row ，RDD[(Long,Array[Byte])]	
	val alsresult =
sc.sequenceFile(args(0)+"/als",classOf[LongWritable],classOf[BytesWritable]).map{case(uid,sessions)=>
      sessions.setCapacity(sessions.getLength)
      (uid.get(),sessions.getBytes)
    }.filter{line=>
      line._1 == userindex.value //specified from arguments
    }
    //log information really surprised me
    logger.info("alsInformation:%d".format(alsresult.count()))
   
alsresult.toArray().foreach(e=>logger.info("alstoarray:%d\t%s".format(e._1,e._2.mkString("
"))))
   
alsresult.take(1).foreach(e=>logger.info("take1result:%d\t%s".format(e._1,e._2.mkString("
"))))
   
logger.info("firstInformation:%d\t%s".format(alsresult.first()._1,alsresult.first()._2.mkString("
")))
   
alsresult.collect().foreach(e=>logger.info("alscollectresult:%d\t%s".format(e._1,e._2.mkString("
"))))
   
alsresult.take(3).foreach(e=>logger.info("alstake3result:%d\t%s".format(e._1,e._2.mkString("
")))) //3 is big than the rdd.count()

    I get a RDD which just have one element. But use the different method ,I
got the different element. My print information as follows:

argument userindex	33057172							28116855 							3814772					3209314
alsInformation		1 									1 									1 						1 
alstoarray			1612242	0 22 47 37 6 19...  	3337442	16 32 0 22 13 49...	
3697319 16 32 0 22 13 49...	 3	 0 22 47 37 6 19...
take1result			1612242	21 24 3 56 16 27... 	3337442	16 52 31 42 29 36 ...
3697319 39 21 34 56 3 37...	 3   34 10 18 28 38 11...
firstInformation	1612242	21 24 3 56 16 27... 	3337442	16 52 31 42 29 36 ...
3697319	39 21 34 56 3 37...	 3   34 10 18 28 38 11...
alscollectresult	1612242	0 22 47 37 6 19...   	3337442	16 32 0 22 13 49...	
3697319 16 32 0 22 13 49...	 3   0 22 47 37 6 19...
alstake3result		1612242	0 22 47 37 6 19... 		3337442	16 32 0 22 13 49...	
3697319 16 32 0 22 13 49...	 3   0 22 47 37 6 19...
 I filter the rdd and guarantee the RDD.count() equal 1.,I think different
"userindex.value"arguments should get different alsresult ,
but "RDD.toArray","RDD.collect","RDD.take(3)" ,have the same result and
under the same argument "toArray" ,"take(1)","take(3)" 
have the different resultmethod ,It's really surpurised me. The arguments is
random specified.

Can anyone explain it or give me some reference?

Thanks 




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/toArray-first-get-the-different-result-from-one-element-RDD-tp20734.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: "toArray","first" get the different result from one element RDD

Posted by buring <qy...@gmail.com>.

I get the key point . The problem is in sc.sequenceFile,From API description
"RDD will create many references to the same objecty" ,So I revise the code 
"sessions.getBytes" to "sessions.getBytes.clone",
It seems to work. 
Thanks.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/toArray-first-get-the-different-result-from-one-element-RDD-tp20734p20739.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org