You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Chen Song (JIRA)" <ji...@apache.org> on 2015/04/27 08:46:38 UTC

[jira] [Commented] (SPARK-7158) collect and take return different results

    [ https://issues.apache.org/jira/browse/SPARK-7158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14513623#comment-14513623 ] 

Chen Song commented on SPARK-7158:
----------------------------------

Should we execute the query in cache method?

> collect and take return different results
> -----------------------------------------
>
>                 Key: SPARK-7158
>                 URL: https://issues.apache.org/jira/browse/SPARK-7158
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Reynold Xin
>            Priority: Blocker
>
> Reported by [~rams]
> {code}
> import java.util.UUID
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val rdd = sc.parallelize(List(1,2,3), 2)
> val schema = StructType(List(StructField("index",IntegerType,true)))
> val df = sqlContext.createDataFrame(rdd.map(p => Row(p)), schema)
> def id:() => String = () => {UUID.randomUUID().toString()}
> def square:Int => Int = (x: Int) => {x * x}
> val dfWithId = df.withColumn("id",callUDF(id, StringType)).cache() //expect the ID to have materialized at this point
> dfWithId.collect()
> //res0: Array[org.apache.spark.sql.Row] = Array([1,43c7b8e2-b4a3-43ee-beff-0bb4b7d6c1b1], [2,efd061be-e8cc-43fa-956e-cfd6e7355982], [3,79b0baab-627c-4761-af0d-8995b8c5a125])
> val dfWithIdAndSquare = dfWithId.withColumn("square",callUDF(square, IntegerType, col("index")))
> dfWithIdAndSquare.collect()
> //res1: Array[org.apache.spark.sql.Row] = Array([1,a3b2e744-a0a1-40fe-8133-87a67660b4ab,1], [2,0a7052a0-6071-4ef5-a25a-2670248ea5cd,4], [3,209f269e-207a-4dfd-a186-738be5db2eff,9])
> //why are the IDs in lines 11 and 15 different?
> {code}
> The randomly generated IDs are the same if show (which uses take under the hood) is used instead of collect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org