You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by david <da...@free.fr> on 2014/11/25 09:08:51 UTC

Spark SQL Join returns less rows that expected

Hi,

 I have 2 files which come from csv import of 2 Oracle tables. 

 F1 has 46730613 rows
 F2 has   3386740 rows

I build 2 tables with spark.

Table F1 join with table F2 on c1=d1.


All keys F2.d1 exists in F1.c1,  so i expect to retrieve 46730613  rows. But
it returns only 3437  rows

// --- begin code ---

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD


val rddFile = sc.textFile("hdfs://referential/F1/part-*")
case class F1(c1:String, c2:String,c3:Double, c3:String, c5: String)
val stkrdd = rddFile.map(x => x.split("|")).map(f =>
F1(f(44),f(3),f(10).toDouble, "",f(2)))
stkrdd.registerAsTable("F1")
sqlContext.cacheTable("F1")


val prdfile = sc.textFile("hdfs://referential/F2/part-*")
case class F2(d1: String, d2:String, d3:String,d4:String)
val productrdd = prdfile.map(x => x.split("|")).map(f =>
F2(f(0),f(2),f(101),f(3)))
productrdd.registerAsTable("F2")
sqlContext.cacheTable("F2")

val resrdd = sqlContext.sql("Select count(*) from F1, F2 where F1.c1 = F2.d1
").count()

// --- end of code ---


Does anybody know what i missed ?

Thanks






--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-returns-less-rows-that-expected-tp19731.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Spark SQL Join returns less rows that expected

Posted by Yin Huai <hu...@gmail.com>.
I guess you want to use split("\\|") instead of split("|").

On Tue, Nov 25, 2014 at 4:51 AM, Cheng Lian <li...@gmail.com> wrote:

> Which version are you using? Or if you are using the most recent master or
> branch-1.2, which commit are you using?
>
>
> On 11/25/14 4:08 PM, david wrote:
>
>> Hi,
>>
>>   I have 2 files which come from csv import of 2 Oracle tables.
>>
>>   F1 has 46730613 rows
>>   F2 has   3386740 rows
>>
>> I build 2 tables with spark.
>>
>> Table F1 join with table F2 on c1=d1.
>>
>>
>> All keys F2.d1 exists in F1.c1,  so i expect to retrieve 46730613  rows.
>> But
>> it returns only 3437  rows
>>
>> // --- begin code ---
>>
>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>> import sqlContext.createSchemaRDD
>>
>>
>> val rddFile = sc.textFile("hdfs://referential/F1/part-*")
>> case class F1(c1:String, c2:String,c3:Double, c3:String, c5: String)
>> val stkrdd = rddFile.map(x => x.split("|")).map(f =>
>> F1(f(44),f(3),f(10).toDouble, "",f(2)))
>> stkrdd.registerAsTable("F1")
>> sqlContext.cacheTable("F1")
>>
>>
>> val prdfile = sc.textFile("hdfs://referential/F2/part-*")
>> case class F2(d1: String, d2:String, d3:String,d4:String)
>> val productrdd = prdfile.map(x => x.split("|")).map(f =>
>> F2(f(0),f(2),f(101),f(3)))
>> productrdd.registerAsTable("F2")
>> sqlContext.cacheTable("F2")
>>
>> val resrdd = sqlContext.sql("Select count(*) from F1, F2 where F1.c1 =
>> F2.d1
>> ").count()
>>
>> // --- end of code ---
>>
>>
>> Does anybody know what i missed ?
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Spark-SQL-Join-returns-less-rows-
>> that-expected-tp19731.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Spark SQL Join returns less rows that expected

Posted by Cheng Lian <li...@gmail.com>.
Which version are you using? Or if you are using the most recent master 
or branch-1.2, which commit are you using?

On 11/25/14 4:08 PM, david wrote:
> Hi,
>
>   I have 2 files which come from csv import of 2 Oracle tables.
>
>   F1 has 46730613 rows
>   F2 has   3386740 rows
>
> I build 2 tables with spark.
>
> Table F1 join with table F2 on c1=d1.
>
>
> All keys F2.d1 exists in F1.c1,  so i expect to retrieve 46730613  rows. But
> it returns only 3437  rows
>
> // --- begin code ---
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.createSchemaRDD
>
>
> val rddFile = sc.textFile("hdfs://referential/F1/part-*")
> case class F1(c1:String, c2:String,c3:Double, c3:String, c5: String)
> val stkrdd = rddFile.map(x => x.split("|")).map(f =>
> F1(f(44),f(3),f(10).toDouble, "",f(2)))
> stkrdd.registerAsTable("F1")
> sqlContext.cacheTable("F1")
>
>
> val prdfile = sc.textFile("hdfs://referential/F2/part-*")
> case class F2(d1: String, d2:String, d3:String,d4:String)
> val productrdd = prdfile.map(x => x.split("|")).map(f =>
> F2(f(0),f(2),f(101),f(3)))
> productrdd.registerAsTable("F2")
> sqlContext.cacheTable("F2")
>
> val resrdd = sqlContext.sql("Select count(*) from F1, F2 where F1.c1 = F2.d1
> ").count()
>
> // --- end of code ---
>
>
> Does anybody know what i missed ?
>
> Thanks
>
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Join-returns-less-rows-that-expected-tp19731.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org