You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by david <da...@free.fr> on 2014/11/24 16:13:45 UTC

Spark SQL (1.0)

Hi,

 I build 2 tables from files. Table F1 join with table F2 on c5=d4.

 F1 has 46730613 rows
 F2 has   3386740 rows

All keys d4 exists in F1.c5,  so i expect to retrieve 46730613  rows. But it
returns only 3437  rows

// --- begin code ---

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD


val rddFile = sc.textFile("hdfs://referential/F1/part-*")
case class F1(c1:String, c2:String,c3:Double, c3:String, c5: String)
val stkrdd = rddFile.map(x => x.split("|")).map(f =>
F1(f(44),f(3),f(10).toDouble, "",f(2)))
stkrdd.registerAsTable("F1")
sqlContext.cacheTable("F1")


val prdfile = sc.textFile("hdfs://referential/F2/part-*")
case class F2(d1: String, d2:String, d3:String,d4:String)
val productrdd = prdfile.map(x => x.split("|")).map(f =>
F2(f(0),f(2),f(101),f(3)))
productrdd.registerAsTable("F2")
sqlContext.cacheTable("F2")

val resrdd = sqlContext.sql("Select count(*) from F1, F2 where F1.c5 = F2.d4
").collect()

// --- end of code ---


Does anybody know what i missed ?

Thanks






--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-1-0-tp19651.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org