You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by david <da...@free.fr> on 2014/11/25 09:08:51 UTC
Spark SQL Join returns less rows that expected
Hi,
I have 2 files which come from csv import of 2 Oracle tables.
F1 has 46730613 rows
F2 has 3386740 rows
I build 2 tables with spark.
Table F1 join with table F2 on c1=d1.
All keys F2.d1 exists in F1.c1, so i expect to retrieve 46730613 rows. But
it returns only 3437 rows
// --- begin code ---
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
val rddFile = sc.textFile("hdfs://referential/F1/part-*")
case class F1(c1:String, c2:String,c3:Double, c3:String, c5: String)
val stkrdd = rddFile.map(x => x.split("|")).map(f =>
F1(f(44),f(3),f(10).toDouble, "",f(2)))
stkrdd.registerAsTable("F1")
sqlContext.cacheTable("F1")
val prdfile = sc.textFile("hdfs://referential/F2/part-*")
case class F2(d1: String, d2:String, d3:String,d4:String)
val productrdd = prdfile.map(x => x.split("|")).map(f =>
F2(f(0),f(2),f(101),f(3)))
productrdd.registerAsTable("F2")
sqlContext.cacheTable("F2")
val resrdd = sqlContext.sql("Select count(*) from F1, F2 where F1.c1 = F2.d1
").count()
// --- end of code ---
Does anybody know what i missed ?
Thanks
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-returns-less-rows-that-expected-tp19731.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: Spark SQL Join returns less rows that expected
Posted by Yin Huai <hu...@gmail.com>.
I guess you want to use split("\\|") instead of split("|").
On Tue, Nov 25, 2014 at 4:51 AM, Cheng Lian <li...@gmail.com> wrote:
> Which version are you using? Or if you are using the most recent master or
> branch-1.2, which commit are you using?
>
>
> On 11/25/14 4:08 PM, david wrote:
>
>> Hi,
>>
>> I have 2 files which come from csv import of 2 Oracle tables.
>>
>> F1 has 46730613 rows
>> F2 has 3386740 rows
>>
>> I build 2 tables with spark.
>>
>> Table F1 join with table F2 on c1=d1.
>>
>>
>> All keys F2.d1 exists in F1.c1, so i expect to retrieve 46730613 rows.
>> But
>> it returns only 3437 rows
>>
>> // --- begin code ---
>>
>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>> import sqlContext.createSchemaRDD
>>
>>
>> val rddFile = sc.textFile("hdfs://referential/F1/part-*")
>> case class F1(c1:String, c2:String,c3:Double, c3:String, c5: String)
>> val stkrdd = rddFile.map(x => x.split("|")).map(f =>
>> F1(f(44),f(3),f(10).toDouble, "",f(2)))
>> stkrdd.registerAsTable("F1")
>> sqlContext.cacheTable("F1")
>>
>>
>> val prdfile = sc.textFile("hdfs://referential/F2/part-*")
>> case class F2(d1: String, d2:String, d3:String,d4:String)
>> val productrdd = prdfile.map(x => x.split("|")).map(f =>
>> F2(f(0),f(2),f(101),f(3)))
>> productrdd.registerAsTable("F2")
>> sqlContext.cacheTable("F2")
>>
>> val resrdd = sqlContext.sql("Select count(*) from F1, F2 where F1.c1 =
>> F2.d1
>> ").count()
>>
>> // --- end of code ---
>>
>>
>> Does anybody know what i missed ?
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Spark-SQL-Join-returns-less-rows-
>> that-expected-tp19731.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
Re: Spark SQL Join returns less rows that expected
Posted by Cheng Lian <li...@gmail.com>.
Which version are you using? Or if you are using the most recent master
or branch-1.2, which commit are you using?
On 11/25/14 4:08 PM, david wrote:
> Hi,
>
> I have 2 files which come from csv import of 2 Oracle tables.
>
> F1 has 46730613 rows
> F2 has 3386740 rows
>
> I build 2 tables with spark.
>
> Table F1 join with table F2 on c1=d1.
>
>
> All keys F2.d1 exists in F1.c1, so i expect to retrieve 46730613 rows. But
> it returns only 3437 rows
>
> // --- begin code ---
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
>
>
> val rddFile = sc.textFile("hdfs://referential/F1/part-*")
> case class F1(c1:String, c2:String,c3:Double, c3:String, c5: String)
> val stkrdd = rddFile.map(x => x.split("|")).map(f =>
> F1(f(44),f(3),f(10).toDouble, "",f(2)))
> stkrdd.registerAsTable("F1")
> sqlContext.cacheTable("F1")
>
>
> val prdfile = sc.textFile("hdfs://referential/F2/part-*")
> case class F2(d1: String, d2:String, d3:String,d4:String)
> val productrdd = prdfile.map(x => x.split("|")).map(f =>
> F2(f(0),f(2),f(101),f(3)))
> productrdd.registerAsTable("F2")
> sqlContext.cacheTable("F2")
>
> val resrdd = sqlContext.sql("Select count(*) from F1, F2 where F1.c1 = F2.d1
> ").count()
>
> // --- end of code ---
>
>
> Does anybody know what i missed ?
>
> Thanks
>
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-returns-less-rows-that-expected-tp19731.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org