You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by kpeng1 <kp...@gmail.com> on 2016/05/02 18:58:23 UTC

Weird results with Spark SQL Outer joins

Hi All,

I am running into a weird result with Spark SQL Outer joins.  The results
for all of them seem to be the same, which does not make sense due to the
data.  Here are the queries that I am running with the results:

sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account AS
d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =
d.ad) WHERE s.date >= '2016-01-03'    AND d.date >= '2016-01-03'").count()
RESULT:23747


sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account AS
d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =
d.ad) WHERE s.date >= '2016-01-03'    AND d.date >= '2016-01-03'").count()
RESULT:23747

sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account AS
d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =
d.ad) WHERE s.date >= '2016-01-03'    AND d.date >= '2016-01-03'").count()
RESULT: 23747

Was wondering if someone had encountered this issues before.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Weird results with Spark SQL Outer joins

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

its morning 4:40 here, therefore I might not be getting things right.

But there is a very high chance of getting spurious results in case you
have created that variable more than once in IPython or pyspark shell and
cached it and are re using it. Please close the sessions and create the
variable only once and then add the table (after ensuring that you have
dropped if first) and then check it.

I had faced this issue once and then realized it was because of the above
reasons.


Regards,
Gourav

On Mon, May 2, 2016 at 10:13 PM, Kevin Peng <kp...@gmail.com> wrote:

> Yong,
>
> Sorry, let explain my deduction; it is going be difficult to get a sample
> data out since the dataset I am using is proprietary.
>
> From the above set queries (ones mentioned in above comments) both inner
> and outer join are producing the same counts.  They are basically pulling
> out selected columns from the query, but there is no roll up happening or
> anything that would possible make it suspicious that there is any
> difference besides the type of joins.  The tables are matched based on
> three keys that are present in both tables (ad, account, and date), on top
> of this they are filtered by date being above 2016-01-03.  Since all the
> joins are producing the same counts, the natural suspicions is that the
> tables are identical, but I when I run the following two queries:
>
> scala> sqlContext.sql("select * from swig_pin_promo_lt where date
> >='2016-01-03'").count
>
> res14: Long = 34158
>
> scala> sqlContext.sql("select * from dps_pin_promo_lt where date
> >='2016-01-03'").count
>
> res15: Long = 42693
>
>
> The above two queries filter out the data based on date used by the joins
> of 2016-01-03 and you can see the row count between the two tables are
> different, which is why I am suspecting something is wrong with the outer
> joins in spark sql, because in this situation the right and outer joins may
> produce the same results, but it should not be equal to the left join and
> definitely not the inner join; unless I am missing something.
>
>
> Side note: In my haste response above I posted the wrong counts for
> dps.count, the real value is res16: Long = 42694
>
>
> Thanks,
>
>
> KP
>
>
>
> On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <ja...@hotmail.com> wrote:
>
>> We are still not sure what is the problem, if you cannot show us with
>> some example data.
>>
>> For dps with 42632 rows, and swig with 42034 rows, if dps full outer join
>> with swig on 3 columns; with additional filters, get the same resultSet row
>> count as dps lefter outer join with swig on 3 columns, with additional
>> filters, again get the the same resultSet row count as dps right outer join
>> with swig on 3 columns, with same additional filters.
>>
>> Without knowing your data, I cannot see the reason that has to be a bug
>> in the spark.
>>
>> Am I misunderstanding your bug?
>>
>> Yong
>>
>> ------------------------------
>> From: kpeng1@gmail.com
>> Date: Mon, 2 May 2016 12:11:18 -0700
>> Subject: Re: Weird results with Spark SQL Outer joins
>> To: gourav.sengupta@gmail.com
>> CC: user@spark.apache.org
>>
>>
>> Gourav,
>>
>> I wish that was case, but I have done a select count on each of the two
>> tables individually and they return back different number of rows:
>>
>>
>> dps.registerTempTable("dps_pin_promo_lt")
>> swig.registerTempTable("swig_pin_promo_lt")
>>
>>
>> dps.count()
>> RESULT: 42632
>>
>>
>> swig.count()
>> RESULT: 42034
>>
>> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta <
>> gourav.sengupta@gmail.com> wrote:
>>
>> This shows that both the tables have matching records and no mismatches.
>> Therefore obviously you have the same results irrespective of whether you
>> use right or left join.
>>
>> I think that there is no problem here, unless I am missing something.
>>
>> Regards,
>> Gourav
>>
>> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kp...@gmail.com> wrote:
>>
>> Also, the results of the inner query produced the same results:
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS
>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad =
>> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >=
>> '2016-01-03'").count()
>> RESULT:23747
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>>
>>
>

Re: Weird results with Spark SQL Outer joins

Posted by Kevin Peng <kp...@gmail.com>.

Davies,

What exactly do you mean in regards to Spark 2.0 turning these join into an
inner join?  Does this mean that spark sql won't be supporting where
clauses in outer joins?


Cesar & Gourav,

When running the queries without the where clause it works as expected.  I
am pasting my results below:
val dps =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").load("file:///home/ltu/dps_csv/")
val swig =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").load("file:///home/ltu/swig_csv/")

dps.count
res0: Long = 42694

swig.count
res1: Long = 42034


dps.registerTempTable("dps_pin_promo_lt")
swig.registerTempTable("swig_pin_promo_lt")


sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
= d.ad)").count()
res5: Long = 60919


sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
= d.ad)").count()
res6: Long = 42034


sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
= d.ad)").count()
res7: Long = 42694

Thanks,

KP


On Tue, May 3, 2016 at 10:42 AM, Davies Liu <da...@databricks.com> wrote:

> Bingo, the two predicate s.date >= '2016-01-03' AND d.date >=
> '2016-01-03' is the root cause,
>  which will filter out all the nulls from outer join, will have same
> result as inner join.
>
> In Spark 2.0, we turn these join into inner join actually.
>
> On Tue, May 3, 2016 at 9:50 AM, Cesar Flores <ce...@gmail.com> wrote:
> > Hi
> >
> > Have you tried the joins without the where clause? When you use them you
> are
> > filtering all the rows with null columns in those fields. In other words
> you
> > are doing a inner join in all your queries.
> >
> > On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta <
> gourav.sengupta@gmail.com>
> > wrote:
> >>
> >> Hi Kevin,
> >>
> >> Having given it a first look I do think that you have hit something here
> >> and this does not look quite fine. I have to work on the multiple AND
> >> conditions in ON and see whether that is causing any issues.
> >>
> >> Regards,
> >> Gourav Sengupta
> >>
> >> On Tue, May 3, 2016 at 8:28 AM, Kevin Peng <kp...@gmail.com> wrote:
> >>>
> >>> Davies,
> >>>
> >>> Here is the code that I am typing into the spark-shell along with the
> >>> results (my question is at the bottom):
> >>>
> >>> val dps =
> >>> sqlContext.read.format("com.databricks.spark.csv").option("header",
> >>> "true").load("file:///home/ltu/dps_csv/")
> >>> val swig =
> >>> sqlContext.read.format("com.databricks.spark.csv").option("header",
> >>> "true").load("file:///home/ltu/swig_csv/")
> >>>
> >>> dps.count
> >>> res0: Long = 42694
> >>>
> >>> swig.count
> >>> res1: Long = 42034
> >>>
> >>>
> >>> dps.registerTempTable("dps_pin_promo_lt")
> >>> swig.registerTempTable("swig_pin_promo_lt")
> >>>
> >>> sqlContext.sql("select * from dps_pin_promo_lt where date >
> >>> '2016-01-03'").count
> >>> res4: Long = 42666
> >>>
> >>> sqlContext.sql("select * from swig_pin_promo_lt where date >
> >>> '2016-01-03'").count
> >>> res5: Long = 34131
> >>>
> >>> sqlContext.sql("select distinct date, account, ad from dps_pin_promo_lt
> >>> where date > '2016-01-03'").count
> >>> res6: Long = 42533
> >>>
> >>> sqlContext.sql("select distinct date, account, ad from
> swig_pin_promo_lt
> >>> where date > '2016-01-03'").count
> >>> res7: Long = 34131
> >>>
> >>>
> >>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> d.account
> >>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> >>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
> s.ad =
> >>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
> '2016-01-03'").count()
> >>> res9: Long = 23809
> >>>
> >>>
> >>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> d.account
> >>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
> >>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
> s.ad =
> >>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
> '2016-01-03'").count()
> >>> res10: Long = 23809
> >>>
> >>>
> >>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> d.account
> >>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
> >>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
> s.ad =
> >>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
> '2016-01-03'").count()
> >>> res11: Long = 23809
> >>>
> >>>
> >>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> d.account
> >>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
> >>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
> s.ad =
> >>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
> '2016-01-03'").count()
> >>> res12: Long = 23809
> >>>
> >>>
> >>>
> >>> From my results above, we notice that the counts of distinct values
> based
> >>> on the join criteria and filter criteria for each individual table is
> >>> located at res6 and res7.  My question is why is the outer join
> producing
> >>> less rows than the smallest table; if there are no matches it should
> still
> >>> bring in that row as part of the outer join.  For the full and right
> outer
> >>> join I am expecting to see a minimum of res6 rows, but I get less, is
> there
> >>> something specific that I am missing here?  I am expecting that the
> full
> >>> outer join would give me the union of the two table sets so I am
> expecting
> >>> at least 42533 rows not 23809.
> >>>
> >>>
> >>> Gourav,
> >>>
> >>> I just ran this result set on a new session with slightly newer data...
> >>> still seeing those results.
> >>>
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> KP
> >>>
> >>>
> >>> On Mon, May 2, 2016 at 11:16 PM, Davies Liu <da...@databricks.com>
> >>> wrote:
> >>>>
> >>>> as @Gourav said, all the join with different join type show the same
> >>>> results,
> >>>> which meant that all the rows from left could match at least one row
> >>>> from right,
> >>>> all the rows from right could match at least one row from left, even
> >>>> the number of row from left does not equal that of right.
> >>>>
> >>>> This is correct result.
> >>>>
> >>>> On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <kp...@gmail.com> wrote:
> >>>> > Yong,
> >>>> >
> >>>> > Sorry, let explain my deduction; it is going be difficult to get a
> >>>> > sample
> >>>> > data out since the dataset I am using is proprietary.
> >>>> >
> >>>> > From the above set queries (ones mentioned in above comments) both
> >>>> > inner and
> >>>> > outer join are producing the same counts.  They are basically
> pulling
> >>>> > out
> >>>> > selected columns from the query, but there is no roll up happening
> or
> >>>> > anything that would possible make it suspicious that there is any
> >>>> > difference
> >>>> > besides the type of joins.  The tables are matched based on three
> keys
> >>>> > that
> >>>> > are present in both tables (ad, account, and date), on top of this
> >>>> > they are
> >>>> > filtered by date being above 2016-01-03.  Since all the joins are
> >>>> > producing
> >>>> > the same counts, the natural suspicions is that the tables are
> >>>> > identical,
> >>>> > but I when I run the following two queries:
> >>>> >
> >>>> > scala> sqlContext.sql("select * from swig_pin_promo_lt where date
> >>>> >>='2016-01-03'").count
> >>>> >
> >>>> > res14: Long = 34158
> >>>> >
> >>>> > scala> sqlContext.sql("select * from dps_pin_promo_lt where date
> >>>> >>='2016-01-03'").count
> >>>> >
> >>>> > res15: Long = 42693
> >>>> >
> >>>> >
> >>>> > The above two queries filter out the data based on date used by the
> >>>> > joins of
> >>>> > 2016-01-03 and you can see the row count between the two tables are
> >>>> > different, which is why I am suspecting something is wrong with the
> >>>> > outer
> >>>> > joins in spark sql, because in this situation the right and outer
> >>>> > joins may
> >>>> > produce the same results, but it should not be equal to the left
> join
> >>>> > and
> >>>> > definitely not the inner join; unless I am missing something.
> >>>> >
> >>>> >
> >>>> > Side note: In my haste response above I posted the wrong counts for
> >>>> > dps.count, the real value is res16: Long = 42694
> >>>> >
> >>>> >
> >>>> > Thanks,
> >>>> >
> >>>> >
> >>>> > KP
> >>>> >
> >>>> >
> >>>> >
> >>>> >
> >>>> > On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <ja...@hotmail.com>
> >>>> > wrote:
> >>>> >>
> >>>> >> We are still not sure what is the problem, if you cannot show us
> with
> >>>> >> some
> >>>> >> example data.
> >>>> >>
> >>>> >> For dps with 42632 rows, and swig with 42034 rows, if dps full
> outer
> >>>> >> join
> >>>> >> with swig on 3 columns; with additional filters, get the same
> >>>> >> resultSet row
> >>>> >> count as dps lefter outer join with swig on 3 columns, with
> >>>> >> additional
> >>>> >> filters, again get the the same resultSet row count as dps right
> >>>> >> outer join
> >>>> >> with swig on 3 columns, with same additional filters.
> >>>> >>
> >>>> >> Without knowing your data, I cannot see the reason that has to be a
> >>>> >> bug in
> >>>> >> the spark.
> >>>> >>
> >>>> >> Am I misunderstanding your bug?
> >>>> >>
> >>>> >> Yong
> >>>> >>
> >>>> >> ________________________________
> >>>> >> From: kpeng1@gmail.com
> >>>> >> Date: Mon, 2 May 2016 12:11:18 -0700
> >>>> >> Subject: Re: Weird results with Spark SQL Outer joins
> >>>> >> To: gourav.sengupta@gmail.com
> >>>> >> CC: user@spark.apache.org
> >>>> >>
> >>>> >>
> >>>> >> Gourav,
> >>>> >>
> >>>> >> I wish that was case, but I have done a select count on each of the
> >>>> >> two
> >>>> >> tables individually and they return back different number of rows:
> >>>> >>
> >>>> >>
> >>>> >> dps.registerTempTable("dps_pin_promo_lt")
> >>>> >> swig.registerTempTable("swig_pin_promo_lt")
> >>>> >>
> >>>> >>
> >>>> >> dps.count()
> >>>> >> RESULT: 42632
> >>>> >>
> >>>> >>
> >>>> >> swig.count()
> >>>> >> RESULT: 42034
> >>>> >>
> >>>> >> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta
> >>>> >> <go...@gmail.com> wrote:
> >>>> >>
> >>>> >> This shows that both the tables have matching records and no
> >>>> >> mismatches.
> >>>> >> Therefore obviously you have the same results irrespective of
> whether
> >>>> >> you
> >>>> >> use right or left join.
> >>>> >>
> >>>> >> I think that there is no problem here, unless I am missing
> something.
> >>>> >>
> >>>> >> Regards,
> >>>> >> Gourav
> >>>> >>
> >>>> >> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kp...@gmail.com> wrote:
> >>>> >>
> >>>> >> Also, the results of the inner query produced the same results:
> >>>> >> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> >>>> >> d.account
> >>>> >> AS
> >>>> >> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>>> >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> >>>> >> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account
> AND
> >>>> >> s.ad
> >>>> >> =
> >>>> >> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >=
> >>>> >> '2016-01-03'").count()
> >>>> >> RESULT:23747
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >> --
> >>>> >> View this message in context:
> >>>> >>
> >>>> >>
> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
> >>>> >> Sent from the Apache Spark User List mailing list archive at
> >>>> >> Nabble.com.
> >>>> >>
> >>>> >>
> ---------------------------------------------------------------------
> >>>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >>>> >> For additional commands, e-mail: user-help@spark.apache.org
> >>>> >>
> >>>> >>
> >>>> >>
> >>>> >
> >>>
> >>>
> >>
> >
> >
> >
> > --
> > Cesar Flores
>

Re: Weird results with Spark SQL Outer joins

Posted by Kevin Peng <kp...@gmail.com>.

Mike,

It looks like you are right.  The result seem to be fine.  It looks like I
messed up on the filtering clause.

sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
= d.ad) WHERE (s.date >= '2016-01-03' OR s.date IS NULL) AND (d.date >=
'2016-01-03' OR d.date IS NULL)").count()
res2: Long = 53042

Davies, Cesar, Gourav,

Thanks for the support.

KP

On Tue, May 3, 2016 at 11:26 AM, Michael Segel <ms...@hotmail.com>
wrote:

> Silly question?
>
> If you change the predicate to
> ( s.date >= ‘2016-01-03’ OR s.date IS NULL )
> AND
> (d.date >= ‘2016-01-03’ OR d.date IS NULL)
>
> What do you get?
>
> Sorry if the syntax isn’t 100% correct. The idea is to not drop null
> values from the query.
> I would imagine that this shouldn’t kill performance since its most likely
> a post join filter on the result set?
> (Or is that just a Hive thing?)
>
> -Mike
>
> > On May 3, 2016, at 12:42 PM, Davies Liu <da...@databricks.com> wrote:
> >
> > Bingo, the two predicate s.date >= '2016-01-03' AND d.date >=
> > '2016-01-03' is the root cause,
> > which will filter out all the nulls from outer join, will have same
> > result as inner join.
> >
> > In Spark 2.0, we turn these join into inner join actually.
> >
> > On Tue, May 3, 2016 at 9:50 AM, Cesar Flores <ce...@gmail.com> wrote:
> >> Hi
> >>
> >> Have you tried the joins without the where clause? When you use them
> you are
> >> filtering all the rows with null columns in those fields. In other
> words you
> >> are doing a inner join in all your queries.
> >>
> >> On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta <
> gourav.sengupta@gmail.com>
> >> wrote:
> >>>
> >>> Hi Kevin,
> >>>
> >>> Having given it a first look I do think that you have hit something
> here
> >>> and this does not look quite fine. I have to work on the multiple AND
> >>> conditions in ON and see whether that is causing any issues.
> >>>
> >>> Regards,
> >>> Gourav Sengupta
> >>>
> >>> On Tue, May 3, 2016 at 8:28 AM, Kevin Peng <kp...@gmail.com> wrote:
> >>>>
> >>>> Davies,
> >>>>
> >>>> Here is the code that I am typing into the spark-shell along with the
> >>>> results (my question is at the bottom):
> >>>>
> >>>> val dps =
> >>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
> >>>> "true").load("file:///home/ltu/dps_csv/")
> >>>> val swig =
> >>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
> >>>> "true").load("file:///home/ltu/swig_csv/")
> >>>>
> >>>> dps.count
> >>>> res0: Long = 42694
> >>>>
> >>>> swig.count
> >>>> res1: Long = 42034
> >>>>
> >>>>
> >>>> dps.registerTempTable("dps_pin_promo_lt")
> >>>> swig.registerTempTable("swig_pin_promo_lt")
> >>>>
> >>>> sqlContext.sql("select * from dps_pin_promo_lt where date >
> >>>> '2016-01-03'").count
> >>>> res4: Long = 42666
> >>>>
> >>>> sqlContext.sql("select * from swig_pin_promo_lt where date >
> >>>> '2016-01-03'").count
> >>>> res5: Long = 34131
> >>>>
> >>>> sqlContext.sql("select distinct date, account, ad from
> dps_pin_promo_lt
> >>>> where date > '2016-01-03'").count
> >>>> res6: Long = 42533
> >>>>
> >>>> sqlContext.sql("select distinct date, account, ad from
> swig_pin_promo_lt
> >>>> where date > '2016-01-03'").count
> >>>> res7: Long = 34131
> >>>>
> >>>>
> >>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> d.account
> >>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> >>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
> s.ad =
> >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
> '2016-01-03'").count()
> >>>> res9: Long = 23809
> >>>>
> >>>>
> >>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> d.account
> >>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
> >>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
> s.ad =
> >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
> '2016-01-03'").count()
> >>>> res10: Long = 23809
> >>>>
> >>>>
> >>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> d.account
> >>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
> >>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
> s.ad =
> >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
> '2016-01-03'").count()
> >>>> res11: Long = 23809
> >>>>
> >>>>
> >>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> d.account
> >>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
> >>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
> s.ad =
> >>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
> '2016-01-03'").count()
> >>>> res12: Long = 23809
> >>>>
> >>>>
> >>>>
> >>>> From my results above, we notice that the counts of distinct values
> based
> >>>> on the join criteria and filter criteria for each individual table is
> >>>> located at res6 and res7.  My question is why is the outer join
> producing
> >>>> less rows than the smallest table; if there are no matches it should
> still
> >>>> bring in that row as part of the outer join.  For the full and right
> outer
> >>>> join I am expecting to see a minimum of res6 rows, but I get less, is
> there
> >>>> something specific that I am missing here?  I am expecting that the
> full
> >>>> outer join would give me the union of the two table sets so I am
> expecting
> >>>> at least 42533 rows not 23809.
> >>>>
> >>>>
> >>>> Gourav,
> >>>>
> >>>> I just ran this result set on a new session with slightly newer
> data...
> >>>> still seeing those results.
> >>>>
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> KP
> >>>>
> >>>>
> >>>> On Mon, May 2, 2016 at 11:16 PM, Davies Liu <da...@databricks.com>
> >>>> wrote:
> >>>>>
> >>>>> as @Gourav said, all the join with different join type show the same
> >>>>> results,
> >>>>> which meant that all the rows from left could match at least one row
> >>>>> from right,
> >>>>> all the rows from right could match at least one row from left, even
> >>>>> the number of row from left does not equal that of right.
> >>>>>
> >>>>> This is correct result.
> >>>>>
> >>>>> On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <kp...@gmail.com> wrote:
> >>>>>> Yong,
> >>>>>>
> >>>>>> Sorry, let explain my deduction; it is going be difficult to get a
> >>>>>> sample
> >>>>>> data out since the dataset I am using is proprietary.
> >>>>>>
> >>>>>> From the above set queries (ones mentioned in above comments) both
> >>>>>> inner and
> >>>>>> outer join are producing the same counts.  They are basically
> pulling
> >>>>>> out
> >>>>>> selected columns from the query, but there is no roll up happening
> or
> >>>>>> anything that would possible make it suspicious that there is any
> >>>>>> difference
> >>>>>> besides the type of joins.  The tables are matched based on three
> keys
> >>>>>> that
> >>>>>> are present in both tables (ad, account, and date), on top of this
> >>>>>> they are
> >>>>>> filtered by date being above 2016-01-03.  Since all the joins are
> >>>>>> producing
> >>>>>> the same counts, the natural suspicions is that the tables are
> >>>>>> identical,
> >>>>>> but I when I run the following two queries:
> >>>>>>
> >>>>>> scala> sqlContext.sql("select * from swig_pin_promo_lt where date
> >>>>>>> ='2016-01-03'").count
> >>>>>>
> >>>>>> res14: Long = 34158
> >>>>>>
> >>>>>> scala> sqlContext.sql("select * from dps_pin_promo_lt where date
> >>>>>>> ='2016-01-03'").count
> >>>>>>
> >>>>>> res15: Long = 42693
> >>>>>>
> >>>>>>
> >>>>>> The above two queries filter out the data based on date used by the
> >>>>>> joins of
> >>>>>> 2016-01-03 and you can see the row count between the two tables are
> >>>>>> different, which is why I am suspecting something is wrong with the
> >>>>>> outer
> >>>>>> joins in spark sql, because in this situation the right and outer
> >>>>>> joins may
> >>>>>> produce the same results, but it should not be equal to the left
> join
> >>>>>> and
> >>>>>> definitely not the inner join; unless I am missing something.
> >>>>>>
> >>>>>>
> >>>>>> Side note: In my haste response above I posted the wrong counts for
> >>>>>> dps.count, the real value is res16: Long = 42694
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>>
> >>>>>> KP
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <ja...@hotmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> We are still not sure what is the problem, if you cannot show us
> with
> >>>>>>> some
> >>>>>>> example data.
> >>>>>>>
> >>>>>>> For dps with 42632 rows, and swig with 42034 rows, if dps full
> outer
> >>>>>>> join
> >>>>>>> with swig on 3 columns; with additional filters, get the same
> >>>>>>> resultSet row
> >>>>>>> count as dps lefter outer join with swig on 3 columns, with
> >>>>>>> additional
> >>>>>>> filters, again get the the same resultSet row count as dps right
> >>>>>>> outer join
> >>>>>>> with swig on 3 columns, with same additional filters.
> >>>>>>>
> >>>>>>> Without knowing your data, I cannot see the reason that has to be a
> >>>>>>> bug in
> >>>>>>> the spark.
> >>>>>>>
> >>>>>>> Am I misunderstanding your bug?
> >>>>>>>
> >>>>>>> Yong
> >>>>>>>
> >>>>>>> ________________________________
> >>>>>>> From: kpeng1@gmail.com
> >>>>>>> Date: Mon, 2 May 2016 12:11:18 -0700
> >>>>>>> Subject: Re: Weird results with Spark SQL Outer joins
> >>>>>>> To: gourav.sengupta@gmail.com
> >>>>>>> CC: user@spark.apache.org
> >>>>>>>
> >>>>>>>
> >>>>>>> Gourav,
> >>>>>>>
> >>>>>>> I wish that was case, but I have done a select count on each of the
> >>>>>>> two
> >>>>>>> tables individually and they return back different number of rows:
> >>>>>>>
> >>>>>>>
> >>>>>>> dps.registerTempTable("dps_pin_promo_lt")
> >>>>>>> swig.registerTempTable("swig_pin_promo_lt")
> >>>>>>>
> >>>>>>>
> >>>>>>> dps.count()
> >>>>>>> RESULT: 42632
> >>>>>>>
> >>>>>>>
> >>>>>>> swig.count()
> >>>>>>> RESULT: 42034
> >>>>>>>
> >>>>>>> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta
> >>>>>>> <go...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> This shows that both the tables have matching records and no
> >>>>>>> mismatches.
> >>>>>>> Therefore obviously you have the same results irrespective of
> whether
> >>>>>>> you
> >>>>>>> use right or left join.
> >>>>>>>
> >>>>>>> I think that there is no problem here, unless I am missing
> something.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Gourav
> >>>>>>>
> >>>>>>> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kp...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Also, the results of the inner query produced the same results:
> >>>>>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> >>>>>>> d.account
> >>>>>>> AS
> >>>>>>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >>>>>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> >>>>>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account
> AND
> >>>>>>> s.ad
> >>>>>>> =
> >>>>>>> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >=
> >>>>>>> '2016-01-03'").count()
> >>>>>>> RESULT:23747
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> View this message in context:
> >>>>>>>
> >>>>>>>
> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
> >>>>>>> Sent from the Apache Spark User List mailing list archive at
> >>>>>>> Nabble.com.
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >>>>>>> For additional commands, e-mail: user-help@spark.apache.org
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Cesar Flores
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
> >
>
>

Re: Weird results with Spark SQL Outer joins

Posted by Michael Segel <ms...@hotmail.com>.

Silly question? 

If you change the predicate to 
( s.date >= ‘2016-01-03’ OR s.date IS NULL ) 
AND 
(d.date >= ‘2016-01-03’ OR d.date IS NULL) 

What do you get? 

Sorry if the syntax isn’t 100% correct. The idea is to not drop null values from the query. 
I would imagine that this shouldn’t kill performance since its most likely a post join filter on the result set? 
(Or is that just a Hive thing?) 

-Mike

> On May 3, 2016, at 12:42 PM, Davies Liu <da...@databricks.com> wrote:
> 
> Bingo, the two predicate s.date >= '2016-01-03' AND d.date >=
> '2016-01-03' is the root cause,
> which will filter out all the nulls from outer join, will have same
> result as inner join.
> 
> In Spark 2.0, we turn these join into inner join actually.
> 
> On Tue, May 3, 2016 at 9:50 AM, Cesar Flores <ce...@gmail.com> wrote:
>> Hi
>> 
>> Have you tried the joins without the where clause? When you use them you are
>> filtering all the rows with null columns in those fields. In other words you
>> are doing a inner join in all your queries.
>> 
>> On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta <go...@gmail.com>
>> wrote:
>>> 
>>> Hi Kevin,
>>> 
>>> Having given it a first look I do think that you have hit something here
>>> and this does not look quite fine. I have to work on the multiple AND
>>> conditions in ON and see whether that is causing any issues.
>>> 
>>> Regards,
>>> Gourav Sengupta
>>> 
>>> On Tue, May 3, 2016 at 8:28 AM, Kevin Peng <kp...@gmail.com> wrote:
>>>> 
>>>> Davies,
>>>> 
>>>> Here is the code that I am typing into the spark-shell along with the
>>>> results (my question is at the bottom):
>>>> 
>>>> val dps =
>>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>>>> "true").load("file:///home/ltu/dps_csv/")
>>>> val swig =
>>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>>>> "true").load("file:///home/ltu/swig_csv/")
>>>> 
>>>> dps.count
>>>> res0: Long = 42694
>>>> 
>>>> swig.count
>>>> res1: Long = 42034
>>>> 
>>>> 
>>>> dps.registerTempTable("dps_pin_promo_lt")
>>>> swig.registerTempTable("swig_pin_promo_lt")
>>>> 
>>>> sqlContext.sql("select * from dps_pin_promo_lt where date >
>>>> '2016-01-03'").count
>>>> res4: Long = 42666
>>>> 
>>>> sqlContext.sql("select * from swig_pin_promo_lt where date >
>>>> '2016-01-03'").count
>>>> res5: Long = 34131
>>>> 
>>>> sqlContext.sql("select distinct date, account, ad from dps_pin_promo_lt
>>>> where date > '2016-01-03'").count
>>>> res6: Long = 42533
>>>> 
>>>> sqlContext.sql("select distinct date, account, ad from swig_pin_promo_lt
>>>> where date > '2016-01-03'").count
>>>> res7: Long = 34131
>>>> 
>>>> 
>>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =
>>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
>>>> res9: Long = 23809
>>>> 
>>>> 
>>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
>>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =
>>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
>>>> res10: Long = 23809
>>>> 
>>>> 
>>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
>>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =
>>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
>>>> res11: Long = 23809
>>>> 
>>>> 
>>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
>>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =
>>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
>>>> res12: Long = 23809
>>>> 
>>>> 
>>>> 
>>>> From my results above, we notice that the counts of distinct values based
>>>> on the join criteria and filter criteria for each individual table is
>>>> located at res6 and res7.  My question is why is the outer join producing
>>>> less rows than the smallest table; if there are no matches it should still
>>>> bring in that row as part of the outer join.  For the full and right outer
>>>> join I am expecting to see a minimum of res6 rows, but I get less, is there
>>>> something specific that I am missing here?  I am expecting that the full
>>>> outer join would give me the union of the two table sets so I am expecting
>>>> at least 42533 rows not 23809.
>>>> 
>>>> 
>>>> Gourav,
>>>> 
>>>> I just ran this result set on a new session with slightly newer data...
>>>> still seeing those results.
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> 
>>>> KP
>>>> 
>>>> 
>>>> On Mon, May 2, 2016 at 11:16 PM, Davies Liu <da...@databricks.com>
>>>> wrote:
>>>>> 
>>>>> as @Gourav said, all the join with different join type show the same
>>>>> results,
>>>>> which meant that all the rows from left could match at least one row
>>>>> from right,
>>>>> all the rows from right could match at least one row from left, even
>>>>> the number of row from left does not equal that of right.
>>>>> 
>>>>> This is correct result.
>>>>> 
>>>>> On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <kp...@gmail.com> wrote:
>>>>>> Yong,
>>>>>> 
>>>>>> Sorry, let explain my deduction; it is going be difficult to get a
>>>>>> sample
>>>>>> data out since the dataset I am using is proprietary.
>>>>>> 
>>>>>> From the above set queries (ones mentioned in above comments) both
>>>>>> inner and
>>>>>> outer join are producing the same counts.  They are basically pulling
>>>>>> out
>>>>>> selected columns from the query, but there is no roll up happening or
>>>>>> anything that would possible make it suspicious that there is any
>>>>>> difference
>>>>>> besides the type of joins.  The tables are matched based on three keys
>>>>>> that
>>>>>> are present in both tables (ad, account, and date), on top of this
>>>>>> they are
>>>>>> filtered by date being above 2016-01-03.  Since all the joins are
>>>>>> producing
>>>>>> the same counts, the natural suspicions is that the tables are
>>>>>> identical,
>>>>>> but I when I run the following two queries:
>>>>>> 
>>>>>> scala> sqlContext.sql("select * from swig_pin_promo_lt where date
>>>>>>> ='2016-01-03'").count
>>>>>> 
>>>>>> res14: Long = 34158
>>>>>> 
>>>>>> scala> sqlContext.sql("select * from dps_pin_promo_lt where date
>>>>>>> ='2016-01-03'").count
>>>>>> 
>>>>>> res15: Long = 42693
>>>>>> 
>>>>>> 
>>>>>> The above two queries filter out the data based on date used by the
>>>>>> joins of
>>>>>> 2016-01-03 and you can see the row count between the two tables are
>>>>>> different, which is why I am suspecting something is wrong with the
>>>>>> outer
>>>>>> joins in spark sql, because in this situation the right and outer
>>>>>> joins may
>>>>>> produce the same results, but it should not be equal to the left join
>>>>>> and
>>>>>> definitely not the inner join; unless I am missing something.
>>>>>> 
>>>>>> 
>>>>>> Side note: In my haste response above I posted the wrong counts for
>>>>>> dps.count, the real value is res16: Long = 42694
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> 
>>>>>> KP
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <ja...@hotmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>> We are still not sure what is the problem, if you cannot show us with
>>>>>>> some
>>>>>>> example data.
>>>>>>> 
>>>>>>> For dps with 42632 rows, and swig with 42034 rows, if dps full outer
>>>>>>> join
>>>>>>> with swig on 3 columns; with additional filters, get the same
>>>>>>> resultSet row
>>>>>>> count as dps lefter outer join with swig on 3 columns, with
>>>>>>> additional
>>>>>>> filters, again get the the same resultSet row count as dps right
>>>>>>> outer join
>>>>>>> with swig on 3 columns, with same additional filters.
>>>>>>> 
>>>>>>> Without knowing your data, I cannot see the reason that has to be a
>>>>>>> bug in
>>>>>>> the spark.
>>>>>>> 
>>>>>>> Am I misunderstanding your bug?
>>>>>>> 
>>>>>>> Yong
>>>>>>> 
>>>>>>> ________________________________
>>>>>>> From: kpeng1@gmail.com
>>>>>>> Date: Mon, 2 May 2016 12:11:18 -0700
>>>>>>> Subject: Re: Weird results with Spark SQL Outer joins
>>>>>>> To: gourav.sengupta@gmail.com
>>>>>>> CC: user@spark.apache.org
>>>>>>> 
>>>>>>> 
>>>>>>> Gourav,
>>>>>>> 
>>>>>>> I wish that was case, but I have done a select count on each of the
>>>>>>> two
>>>>>>> tables individually and they return back different number of rows:
>>>>>>> 
>>>>>>> 
>>>>>>> dps.registerTempTable("dps_pin_promo_lt")
>>>>>>> swig.registerTempTable("swig_pin_promo_lt")
>>>>>>> 
>>>>>>> 
>>>>>>> dps.count()
>>>>>>> RESULT: 42632
>>>>>>> 
>>>>>>> 
>>>>>>> swig.count()
>>>>>>> RESULT: 42034
>>>>>>> 
>>>>>>> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta
>>>>>>> <go...@gmail.com> wrote:
>>>>>>> 
>>>>>>> This shows that both the tables have matching records and no
>>>>>>> mismatches.
>>>>>>> Therefore obviously you have the same results irrespective of whether
>>>>>>> you
>>>>>>> use right or left join.
>>>>>>> 
>>>>>>> I think that there is no problem here, unless I am missing something.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Gourav
>>>>>>> 
>>>>>>> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kp...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Also, the results of the inner query produced the same results:
>>>>>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
>>>>>>> d.account
>>>>>>> AS
>>>>>>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>>>>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>>>>>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>>>>>>> s.ad
>>>>>>> =
>>>>>>> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >=
>>>>>>> '2016-01-03'").count()
>>>>>>> RESULT:23747
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> 
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Cesar Flores
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Weird results with Spark SQL Outer joins

Posted by Davies Liu <da...@databricks.com>.

Bingo, the two predicate s.date >= '2016-01-03' AND d.date >=
'2016-01-03' is the root cause,
 which will filter out all the nulls from outer join, will have same
result as inner join.

In Spark 2.0, we turn these join into inner join actually.

On Tue, May 3, 2016 at 9:50 AM, Cesar Flores <ce...@gmail.com> wrote:
> Hi
>
> Have you tried the joins without the where clause? When you use them you are
> filtering all the rows with null columns in those fields. In other words you
> are doing a inner join in all your queries.
>
> On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta <go...@gmail.com>
> wrote:
>>
>> Hi Kevin,
>>
>> Having given it a first look I do think that you have hit something here
>> and this does not look quite fine. I have to work on the multiple AND
>> conditions in ON and see whether that is causing any issues.
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Tue, May 3, 2016 at 8:28 AM, Kevin Peng <kp...@gmail.com> wrote:
>>>
>>> Davies,
>>>
>>> Here is the code that I am typing into the spark-shell along with the
>>> results (my question is at the bottom):
>>>
>>> val dps =
>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>>> "true").load("file:///home/ltu/dps_csv/")
>>> val swig =
>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>>> "true").load("file:///home/ltu/swig_csv/")
>>>
>>> dps.count
>>> res0: Long = 42694
>>>
>>> swig.count
>>> res1: Long = 42034
>>>
>>>
>>> dps.registerTempTable("dps_pin_promo_lt")
>>> swig.registerTempTable("swig_pin_promo_lt")
>>>
>>> sqlContext.sql("select * from dps_pin_promo_lt where date >
>>> '2016-01-03'").count
>>> res4: Long = 42666
>>>
>>> sqlContext.sql("select * from swig_pin_promo_lt where date >
>>> '2016-01-03'").count
>>> res5: Long = 34131
>>>
>>> sqlContext.sql("select distinct date, account, ad from dps_pin_promo_lt
>>> where date > '2016-01-03'").count
>>> res6: Long = 42533
>>>
>>> sqlContext.sql("select distinct date, account, ad from swig_pin_promo_lt
>>> where date > '2016-01-03'").count
>>> res7: Long = 34131
>>>
>>>
>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =
>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
>>> res9: Long = 23809
>>>
>>>
>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =
>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
>>> res10: Long = 23809
>>>
>>>
>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =
>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
>>> res11: Long = 23809
>>>
>>>
>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =
>>> d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
>>> res12: Long = 23809
>>>
>>>
>>>
>>> From my results above, we notice that the counts of distinct values based
>>> on the join criteria and filter criteria for each individual table is
>>> located at res6 and res7.  My question is why is the outer join producing
>>> less rows than the smallest table; if there are no matches it should still
>>> bring in that row as part of the outer join.  For the full and right outer
>>> join I am expecting to see a minimum of res6 rows, but I get less, is there
>>> something specific that I am missing here?  I am expecting that the full
>>> outer join would give me the union of the two table sets so I am expecting
>>> at least 42533 rows not 23809.
>>>
>>>
>>> Gourav,
>>>
>>> I just ran this result set on a new session with slightly newer data...
>>> still seeing those results.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> KP
>>>
>>>
>>> On Mon, May 2, 2016 at 11:16 PM, Davies Liu <da...@databricks.com>
>>> wrote:
>>>>
>>>> as @Gourav said, all the join with different join type show the same
>>>> results,
>>>> which meant that all the rows from left could match at least one row
>>>> from right,
>>>> all the rows from right could match at least one row from left, even
>>>> the number of row from left does not equal that of right.
>>>>
>>>> This is correct result.
>>>>
>>>> On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <kp...@gmail.com> wrote:
>>>> > Yong,
>>>> >
>>>> > Sorry, let explain my deduction; it is going be difficult to get a
>>>> > sample
>>>> > data out since the dataset I am using is proprietary.
>>>> >
>>>> > From the above set queries (ones mentioned in above comments) both
>>>> > inner and
>>>> > outer join are producing the same counts.  They are basically pulling
>>>> > out
>>>> > selected columns from the query, but there is no roll up happening or
>>>> > anything that would possible make it suspicious that there is any
>>>> > difference
>>>> > besides the type of joins.  The tables are matched based on three keys
>>>> > that
>>>> > are present in both tables (ad, account, and date), on top of this
>>>> > they are
>>>> > filtered by date being above 2016-01-03.  Since all the joins are
>>>> > producing
>>>> > the same counts, the natural suspicions is that the tables are
>>>> > identical,
>>>> > but I when I run the following two queries:
>>>> >
>>>> > scala> sqlContext.sql("select * from swig_pin_promo_lt where date
>>>> >>='2016-01-03'").count
>>>> >
>>>> > res14: Long = 34158
>>>> >
>>>> > scala> sqlContext.sql("select * from dps_pin_promo_lt where date
>>>> >>='2016-01-03'").count
>>>> >
>>>> > res15: Long = 42693
>>>> >
>>>> >
>>>> > The above two queries filter out the data based on date used by the
>>>> > joins of
>>>> > 2016-01-03 and you can see the row count between the two tables are
>>>> > different, which is why I am suspecting something is wrong with the
>>>> > outer
>>>> > joins in spark sql, because in this situation the right and outer
>>>> > joins may
>>>> > produce the same results, but it should not be equal to the left join
>>>> > and
>>>> > definitely not the inner join; unless I am missing something.
>>>> >
>>>> >
>>>> > Side note: In my haste response above I posted the wrong counts for
>>>> > dps.count, the real value is res16: Long = 42694
>>>> >
>>>> >
>>>> > Thanks,
>>>> >
>>>> >
>>>> > KP
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <ja...@hotmail.com>
>>>> > wrote:
>>>> >>
>>>> >> We are still not sure what is the problem, if you cannot show us with
>>>> >> some
>>>> >> example data.
>>>> >>
>>>> >> For dps with 42632 rows, and swig with 42034 rows, if dps full outer
>>>> >> join
>>>> >> with swig on 3 columns; with additional filters, get the same
>>>> >> resultSet row
>>>> >> count as dps lefter outer join with swig on 3 columns, with
>>>> >> additional
>>>> >> filters, again get the the same resultSet row count as dps right
>>>> >> outer join
>>>> >> with swig on 3 columns, with same additional filters.
>>>> >>
>>>> >> Without knowing your data, I cannot see the reason that has to be a
>>>> >> bug in
>>>> >> the spark.
>>>> >>
>>>> >> Am I misunderstanding your bug?
>>>> >>
>>>> >> Yong
>>>> >>
>>>> >> ________________________________
>>>> >> From: kpeng1@gmail.com
>>>> >> Date: Mon, 2 May 2016 12:11:18 -0700
>>>> >> Subject: Re: Weird results with Spark SQL Outer joins
>>>> >> To: gourav.sengupta@gmail.com
>>>> >> CC: user@spark.apache.org
>>>> >>
>>>> >>
>>>> >> Gourav,
>>>> >>
>>>> >> I wish that was case, but I have done a select count on each of the
>>>> >> two
>>>> >> tables individually and they return back different number of rows:
>>>> >>
>>>> >>
>>>> >> dps.registerTempTable("dps_pin_promo_lt")
>>>> >> swig.registerTempTable("swig_pin_promo_lt")
>>>> >>
>>>> >>
>>>> >> dps.count()
>>>> >> RESULT: 42632
>>>> >>
>>>> >>
>>>> >> swig.count()
>>>> >> RESULT: 42034
>>>> >>
>>>> >> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta
>>>> >> <go...@gmail.com> wrote:
>>>> >>
>>>> >> This shows that both the tables have matching records and no
>>>> >> mismatches.
>>>> >> Therefore obviously you have the same results irrespective of whether
>>>> >> you
>>>> >> use right or left join.
>>>> >>
>>>> >> I think that there is no problem here, unless I am missing something.
>>>> >>
>>>> >> Regards,
>>>> >> Gourav
>>>> >>
>>>> >> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kp...@gmail.com> wrote:
>>>> >>
>>>> >> Also, the results of the inner query produced the same results:
>>>> >> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
>>>> >> d.account
>>>> >> AS
>>>> >> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>>> >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>>>> >> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>>>> >> s.ad
>>>> >> =
>>>> >> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >=
>>>> >> '2016-01-03'").count()
>>>> >> RESULT:23747
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> View this message in context:
>>>> >>
>>>> >> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
>>>> >> Sent from the Apache Spark User List mailing list archive at
>>>> >> Nabble.com.
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> >> For additional commands, e-mail: user-help@spark.apache.org
>>>> >>
>>>> >>
>>>> >>
>>>> >
>>>
>>>
>>
>
>
>
> --
> Cesar Flores

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Weird results with Spark SQL Outer joins

Posted by Cesar Flores <ce...@gmail.com>.

Hi

Have you tried the joins without the where clause? When you use them you
are filtering all the rows with null columns in those fields. In other
words you are doing a inner join in all your queries.

On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta <go...@gmail.com>
wrote:

> Hi Kevin,
>
> Having given it a first look I do think that you have hit something here
> and this does not look quite fine. I have to work on the multiple AND
> conditions in ON and see whether that is causing any issues.
>
> Regards,
> Gourav Sengupta
>
> On Tue, May 3, 2016 at 8:28 AM, Kevin Peng <kp...@gmail.com> wrote:
>
>> Davies,
>>
>> Here is the code that I am typing into the spark-shell along with the
>> results (my question is at the bottom):
>>
>> val dps =
>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>> "true").load("file:///home/ltu/dps_csv/")
>> val swig =
>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>> "true").load("file:///home/ltu/swig_csv/")
>>
>> dps.count
>> res0: Long = 42694
>>
>> swig.count
>> res1: Long = 42034
>>
>>
>> dps.registerTempTable("dps_pin_promo_lt")
>> swig.registerTempTable("swig_pin_promo_lt")
>>
>> sqlContext.sql("select * from dps_pin_promo_lt where date >
>> '2016-01-03'").count
>> res4: Long = 42666
>>
>> sqlContext.sql("select * from swig_pin_promo_lt where date >
>> '2016-01-03'").count
>> res5: Long = 34131
>>
>> sqlContext.sql("select distinct date, account, ad from dps_pin_promo_lt
>> where date > '2016-01-03'").count
>> res6: Long = 42533
>>
>> sqlContext.sql("select distinct date, account, ad from swig_pin_promo_lt
>> where date > '2016-01-03'").count
>> res7: Long = 34131
>>
>>
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad = d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
>> '2016-01-03'").count()
>> res9: Long = 23809
>>
>>
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad = d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
>> '2016-01-03'").count()
>> res10: Long = 23809
>>
>>
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad = d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
>> '2016-01-03'").count()
>> res11: Long = 23809
>>
>>
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad = d.ad) WHERE s.date >= '2016-01-03' AND d.date >=
>> '2016-01-03'").count()
>> res12: Long = 23809
>>
>>
>>
>> From my results above, we notice that the counts of distinct values based
>> on the join criteria and filter criteria for each individual table is
>> located at res6 and res7.  My question is why is the outer join producing
>> less rows than the smallest table; if there are no matches it should still
>> bring in that row as part of the outer join.  For the full and right outer
>> join I am expecting to see a minimum of res6 rows, but I get less, is there
>> something specific that I am missing here?  I am expecting that the full
>> outer join would give me the union of the two table sets so I am expecting
>> at least 42533 rows not 23809.
>>
>>
>> Gourav,
>>
>> I just ran this result set on a new session with slightly newer data...
>> still seeing those results.
>>
>>
>>
>> Thanks,
>>
>> KP
>>
>>
>> On Mon, May 2, 2016 at 11:16 PM, Davies Liu <da...@databricks.com>
>> wrote:
>>
>>> as @Gourav said, all the join with different join type show the same
>>> results,
>>> which meant that all the rows from left could match at least one row
>>> from right,
>>> all the rows from right could match at least one row from left, even
>>> the number of row from left does not equal that of right.
>>>
>>> This is correct result.
>>>
>>> On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <kp...@gmail.com> wrote:
>>> > Yong,
>>> >
>>> > Sorry, let explain my deduction; it is going be difficult to get a
>>> sample
>>> > data out since the dataset I am using is proprietary.
>>> >
>>> > From the above set queries (ones mentioned in above comments) both
>>> inner and
>>> > outer join are producing the same counts.  They are basically pulling
>>> out
>>> > selected columns from the query, but there is no roll up happening or
>>> > anything that would possible make it suspicious that there is any
>>> difference
>>> > besides the type of joins.  The tables are matched based on three keys
>>> that
>>> > are present in both tables (ad, account, and date), on top of this
>>> they are
>>> > filtered by date being above 2016-01-03.  Since all the joins are
>>> producing
>>> > the same counts, the natural suspicions is that the tables are
>>> identical,
>>> > but I when I run the following two queries:
>>> >
>>> > scala> sqlContext.sql("select * from swig_pin_promo_lt where date
>>> >>='2016-01-03'").count
>>> >
>>> > res14: Long = 34158
>>> >
>>> > scala> sqlContext.sql("select * from dps_pin_promo_lt where date
>>> >>='2016-01-03'").count
>>> >
>>> > res15: Long = 42693
>>> >
>>> >
>>> > The above two queries filter out the data based on date used by the
>>> joins of
>>> > 2016-01-03 and you can see the row count between the two tables are
>>> > different, which is why I am suspecting something is wrong with the
>>> outer
>>> > joins in spark sql, because in this situation the right and outer
>>> joins may
>>> > produce the same results, but it should not be equal to the left join
>>> and
>>> > definitely not the inner join; unless I am missing something.
>>> >
>>> >
>>> > Side note: In my haste response above I posted the wrong counts for
>>> > dps.count, the real value is res16: Long = 42694
>>> >
>>> >
>>> > Thanks,
>>> >
>>> >
>>> > KP
>>> >
>>> >
>>> >
>>> >
>>> > On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <ja...@hotmail.com>
>>> wrote:
>>> >>
>>> >> We are still not sure what is the problem, if you cannot show us with
>>> some
>>> >> example data.
>>> >>
>>> >> For dps with 42632 rows, and swig with 42034 rows, if dps full outer
>>> join
>>> >> with swig on 3 columns; with additional filters, get the same
>>> resultSet row
>>> >> count as dps lefter outer join with swig on 3 columns, with additional
>>> >> filters, again get the the same resultSet row count as dps right
>>> outer join
>>> >> with swig on 3 columns, with same additional filters.
>>> >>
>>> >> Without knowing your data, I cannot see the reason that has to be a
>>> bug in
>>> >> the spark.
>>> >>
>>> >> Am I misunderstanding your bug?
>>> >>
>>> >> Yong
>>> >>
>>> >> ________________________________
>>> >> From: kpeng1@gmail.com
>>> >> Date: Mon, 2 May 2016 12:11:18 -0700
>>> >> Subject: Re: Weird results with Spark SQL Outer joins
>>> >> To: gourav.sengupta@gmail.com
>>> >> CC: user@spark.apache.org
>>> >>
>>> >>
>>> >> Gourav,
>>> >>
>>> >> I wish that was case, but I have done a select count on each of the
>>> two
>>> >> tables individually and they return back different number of rows:
>>> >>
>>> >>
>>> >> dps.registerTempTable("dps_pin_promo_lt")
>>> >> swig.registerTempTable("swig_pin_promo_lt")
>>> >>
>>> >>
>>> >> dps.count()
>>> >> RESULT: 42632
>>> >>
>>> >>
>>> >> swig.count()
>>> >> RESULT: 42034
>>> >>
>>> >> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta
>>> >> <go...@gmail.com> wrote:
>>> >>
>>> >> This shows that both the tables have matching records and no
>>> mismatches.
>>> >> Therefore obviously you have the same results irrespective of whether
>>> you
>>> >> use right or left join.
>>> >>
>>> >> I think that there is no problem here, unless I am missing something.
>>> >>
>>> >> Regards,
>>> >> Gourav
>>> >>
>>> >> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kp...@gmail.com> wrote:
>>> >>
>>> >> Also, the results of the inner query produced the same results:
>>> >> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
>>> d.account
>>> >> AS
>>> >> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>> >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>>> >> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>>> s.ad
>>> >> =
>>> >> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >=
>>> '2016-01-03'").count()
>>> >> RESULT:23747
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
>>> >> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> >> For additional commands, e-mail: user-help@spark.apache.org
>>> >>
>>> >>
>>> >>
>>> >
>>>
>>
>>
>


-- 
Cesar Flores

Re: Weird results with Spark SQL Outer joins

Posted by Gourav Sengupta <go...@gmail.com>.

Hi Kevin,

Having given it a first look I do think that you have hit something here
and this does not look quite fine. I have to work on the multiple AND
conditions in ON and see whether that is causing any issues.

Regards,
Gourav Sengupta

On Tue, May 3, 2016 at 8:28 AM, Kevin Peng <kp...@gmail.com> wrote:

> Davies,
>
> Here is the code that I am typing into the spark-shell along with the
> results (my question is at the bottom):
>
> val dps =
> sqlContext.read.format("com.databricks.spark.csv").option("header",
> "true").load("file:///home/ltu/dps_csv/")
> val swig =
> sqlContext.read.format("com.databricks.spark.csv").option("header",
> "true").load("file:///home/ltu/swig_csv/")
>
> dps.count
> res0: Long = 42694
>
> swig.count
> res1: Long = 42034
>
>
> dps.registerTempTable("dps_pin_promo_lt")
> swig.registerTempTable("swig_pin_promo_lt")
>
> sqlContext.sql("select * from dps_pin_promo_lt where date >
> '2016-01-03'").count
> res4: Long = 42666
>
> sqlContext.sql("select * from swig_pin_promo_lt where date >
> '2016-01-03'").count
> res5: Long = 34131
>
> sqlContext.sql("select distinct date, account, ad from dps_pin_promo_lt
> where date > '2016-01-03'").count
> res6: Long = 42533
>
> sqlContext.sql("select distinct date, account, ad from swig_pin_promo_lt
> where date > '2016-01-03'").count
> res7: Long = 34131
>
>
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> = d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
> res9: Long = 23809
>
>
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> = d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
> res10: Long = 23809
>
>
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> = d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
> res11: Long = 23809
>
>
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> = d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
> res12: Long = 23809
>
>
>
> From my results above, we notice that the counts of distinct values based
> on the join criteria and filter criteria for each individual table is
> located at res6 and res7.  My question is why is the outer join producing
> less rows than the smallest table; if there are no matches it should still
> bring in that row as part of the outer join.  For the full and right outer
> join I am expecting to see a minimum of res6 rows, but I get less, is there
> something specific that I am missing here?  I am expecting that the full
> outer join would give me the union of the two table sets so I am expecting
> at least 42533 rows not 23809.
>
>
> Gourav,
>
> I just ran this result set on a new session with slightly newer data...
> still seeing those results.
>
>
>
> Thanks,
>
> KP
>
>
> On Mon, May 2, 2016 at 11:16 PM, Davies Liu <da...@databricks.com> wrote:
>
>> as @Gourav said, all the join with different join type show the same
>> results,
>> which meant that all the rows from left could match at least one row from
>> right,
>> all the rows from right could match at least one row from left, even
>> the number of row from left does not equal that of right.
>>
>> This is correct result.
>>
>> On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <kp...@gmail.com> wrote:
>> > Yong,
>> >
>> > Sorry, let explain my deduction; it is going be difficult to get a
>> sample
>> > data out since the dataset I am using is proprietary.
>> >
>> > From the above set queries (ones mentioned in above comments) both
>> inner and
>> > outer join are producing the same counts.  They are basically pulling
>> out
>> > selected columns from the query, but there is no roll up happening or
>> > anything that would possible make it suspicious that there is any
>> difference
>> > besides the type of joins.  The tables are matched based on three keys
>> that
>> > are present in both tables (ad, account, and date), on top of this they
>> are
>> > filtered by date being above 2016-01-03.  Since all the joins are
>> producing
>> > the same counts, the natural suspicions is that the tables are
>> identical,
>> > but I when I run the following two queries:
>> >
>> > scala> sqlContext.sql("select * from swig_pin_promo_lt where date
>> >>='2016-01-03'").count
>> >
>> > res14: Long = 34158
>> >
>> > scala> sqlContext.sql("select * from dps_pin_promo_lt where date
>> >>='2016-01-03'").count
>> >
>> > res15: Long = 42693
>> >
>> >
>> > The above two queries filter out the data based on date used by the
>> joins of
>> > 2016-01-03 and you can see the row count between the two tables are
>> > different, which is why I am suspecting something is wrong with the
>> outer
>> > joins in spark sql, because in this situation the right and outer joins
>> may
>> > produce the same results, but it should not be equal to the left join
>> and
>> > definitely not the inner join; unless I am missing something.
>> >
>> >
>> > Side note: In my haste response above I posted the wrong counts for
>> > dps.count, the real value is res16: Long = 42694
>> >
>> >
>> > Thanks,
>> >
>> >
>> > KP
>> >
>> >
>> >
>> >
>> > On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <ja...@hotmail.com>
>> wrote:
>> >>
>> >> We are still not sure what is the problem, if you cannot show us with
>> some
>> >> example data.
>> >>
>> >> For dps with 42632 rows, and swig with 42034 rows, if dps full outer
>> join
>> >> with swig on 3 columns; with additional filters, get the same
>> resultSet row
>> >> count as dps lefter outer join with swig on 3 columns, with additional
>> >> filters, again get the the same resultSet row count as dps right outer
>> join
>> >> with swig on 3 columns, with same additional filters.
>> >>
>> >> Without knowing your data, I cannot see the reason that has to be a
>> bug in
>> >> the spark.
>> >>
>> >> Am I misunderstanding your bug?
>> >>
>> >> Yong
>> >>
>> >> ________________________________
>> >> From: kpeng1@gmail.com
>> >> Date: Mon, 2 May 2016 12:11:18 -0700
>> >> Subject: Re: Weird results with Spark SQL Outer joins
>> >> To: gourav.sengupta@gmail.com
>> >> CC: user@spark.apache.org
>> >>
>> >>
>> >> Gourav,
>> >>
>> >> I wish that was case, but I have done a select count on each of the two
>> >> tables individually and they return back different number of rows:
>> >>
>> >>
>> >> dps.registerTempTable("dps_pin_promo_lt")
>> >> swig.registerTempTable("swig_pin_promo_lt")
>> >>
>> >>
>> >> dps.count()
>> >> RESULT: 42632
>> >>
>> >>
>> >> swig.count()
>> >> RESULT: 42034
>> >>
>> >> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta
>> >> <go...@gmail.com> wrote:
>> >>
>> >> This shows that both the tables have matching records and no
>> mismatches.
>> >> Therefore obviously you have the same results irrespective of whether
>> you
>> >> use right or left join.
>> >>
>> >> I think that there is no problem here, unless I am missing something.
>> >>
>> >> Regards,
>> >> Gourav
>> >>
>> >> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kp...@gmail.com> wrote:
>> >>
>> >> Also, the results of the inner query produced the same results:
>> >> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
>> d.account
>> >> AS
>> >> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>> >> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad
>> >> =
>> >> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >=
>> '2016-01-03'").count()
>> >> RESULT:23747
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
>> >> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> >> For additional commands, e-mail: user-help@spark.apache.org
>> >>
>> >>
>> >>
>> >
>>
>
>

Re: Weird results with Spark SQL Outer joins

Posted by Kevin Peng <kp...@gmail.com>.

Davies,

Here is the code that I am typing into the spark-shell along with the
results (my question is at the bottom):

val dps =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").load("file:///home/ltu/dps_csv/")
val swig =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").load("file:///home/ltu/swig_csv/")

dps.count
res0: Long = 42694

swig.count
res1: Long = 42034


dps.registerTempTable("dps_pin_promo_lt")
swig.registerTempTable("swig_pin_promo_lt")

sqlContext.sql("select * from dps_pin_promo_lt where date >
'2016-01-03'").count
res4: Long = 42666

sqlContext.sql("select * from swig_pin_promo_lt where date >
'2016-01-03'").count
res5: Long = 34131

sqlContext.sql("select distinct date, account, ad from dps_pin_promo_lt
where date > '2016-01-03'").count
res6: Long = 42533

sqlContext.sql("select distinct date, account, ad from swig_pin_promo_lt
where date > '2016-01-03'").count
res7: Long = 34131


sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
= d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
res9: Long = 23809


sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
= d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
res10: Long = 23809


sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
= d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
res11: Long = 23809


sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
AS d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
= d.ad) WHERE s.date >= '2016-01-03' AND d.date >= '2016-01-03'").count()
res12: Long = 23809



From my results above, we notice that the counts of distinct values based
on the join criteria and filter criteria for each individual table is
located at res6 and res7.  My question is why is the outer join producing
less rows than the smallest table; if there are no matches it should still
bring in that row as part of the outer join.  For the full and right outer
join I am expecting to see a minimum of res6 rows, but I get less, is there
something specific that I am missing here?  I am expecting that the full
outer join would give me the union of the two table sets so I am expecting
at least 42533 rows not 23809.


Gourav,

I just ran this result set on a new session with slightly newer data...
still seeing those results.



Thanks,

KP


On Mon, May 2, 2016 at 11:16 PM, Davies Liu <da...@databricks.com> wrote:

> as @Gourav said, all the join with different join type show the same
> results,
> which meant that all the rows from left could match at least one row from
> right,
> all the rows from right could match at least one row from left, even
> the number of row from left does not equal that of right.
>
> This is correct result.
>
> On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <kp...@gmail.com> wrote:
> > Yong,
> >
> > Sorry, let explain my deduction; it is going be difficult to get a sample
> > data out since the dataset I am using is proprietary.
> >
> > From the above set queries (ones mentioned in above comments) both inner
> and
> > outer join are producing the same counts.  They are basically pulling out
> > selected columns from the query, but there is no roll up happening or
> > anything that would possible make it suspicious that there is any
> difference
> > besides the type of joins.  The tables are matched based on three keys
> that
> > are present in both tables (ad, account, and date), on top of this they
> are
> > filtered by date being above 2016-01-03.  Since all the joins are
> producing
> > the same counts, the natural suspicions is that the tables are identical,
> > but I when I run the following two queries:
> >
> > scala> sqlContext.sql("select * from swig_pin_promo_lt where date
> >>='2016-01-03'").count
> >
> > res14: Long = 34158
> >
> > scala> sqlContext.sql("select * from dps_pin_promo_lt where date
> >>='2016-01-03'").count
> >
> > res15: Long = 42693
> >
> >
> > The above two queries filter out the data based on date used by the
> joins of
> > 2016-01-03 and you can see the row count between the two tables are
> > different, which is why I am suspecting something is wrong with the outer
> > joins in spark sql, because in this situation the right and outer joins
> may
> > produce the same results, but it should not be equal to the left join and
> > definitely not the inner join; unless I am missing something.
> >
> >
> > Side note: In my haste response above I posted the wrong counts for
> > dps.count, the real value is res16: Long = 42694
> >
> >
> > Thanks,
> >
> >
> > KP
> >
> >
> >
> >
> > On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <ja...@hotmail.com>
> wrote:
> >>
> >> We are still not sure what is the problem, if you cannot show us with
> some
> >> example data.
> >>
> >> For dps with 42632 rows, and swig with 42034 rows, if dps full outer
> join
> >> with swig on 3 columns; with additional filters, get the same resultSet
> row
> >> count as dps lefter outer join with swig on 3 columns, with additional
> >> filters, again get the the same resultSet row count as dps right outer
> join
> >> with swig on 3 columns, with same additional filters.
> >>
> >> Without knowing your data, I cannot see the reason that has to be a bug
> in
> >> the spark.
> >>
> >> Am I misunderstanding your bug?
> >>
> >> Yong
> >>
> >> ________________________________
> >> From: kpeng1@gmail.com
> >> Date: Mon, 2 May 2016 12:11:18 -0700
> >> Subject: Re: Weird results with Spark SQL Outer joins
> >> To: gourav.sengupta@gmail.com
> >> CC: user@spark.apache.org
> >>
> >>
> >> Gourav,
> >>
> >> I wish that was case, but I have done a select count on each of the two
> >> tables individually and they return back different number of rows:
> >>
> >>
> >> dps.registerTempTable("dps_pin_promo_lt")
> >> swig.registerTempTable("swig_pin_promo_lt")
> >>
> >>
> >> dps.count()
> >> RESULT: 42632
> >>
> >>
> >> swig.count()
> >> RESULT: 42034
> >>
> >> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta
> >> <go...@gmail.com> wrote:
> >>
> >> This shows that both the tables have matching records and no mismatches.
> >> Therefore obviously you have the same results irrespective of whether
> you
> >> use right or left join.
> >>
> >> I think that there is no problem here, unless I am missing something.
> >>
> >> Regards,
> >> Gourav
> >>
> >> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kp...@gmail.com> wrote:
> >>
> >> Also, the results of the inner query produced the same results:
> >> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
> d.account
> >> AS
> >> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> >> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
> s.ad
> >> =
> >> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >=
> '2016-01-03'").count()
> >> RESULT:23747
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: user-help@spark.apache.org
> >>
> >>
> >>
> >
>

Re: Weird results with Spark SQL Outer joins

Posted by Davies Liu <da...@databricks.com>.

as @Gourav said, all the join with different join type show the same results,
which meant that all the rows from left could match at least one row from right,
all the rows from right could match at least one row from left, even
the number of row from left does not equal that of right.

This is correct result.

On Mon, May 2, 2016 at 2:13 PM, Kevin Peng <kp...@gmail.com> wrote:
> Yong,
>
> Sorry, let explain my deduction; it is going be difficult to get a sample
> data out since the dataset I am using is proprietary.
>
> From the above set queries (ones mentioned in above comments) both inner and
> outer join are producing the same counts.  They are basically pulling out
> selected columns from the query, but there is no roll up happening or
> anything that would possible make it suspicious that there is any difference
> besides the type of joins.  The tables are matched based on three keys that
> are present in both tables (ad, account, and date), on top of this they are
> filtered by date being above 2016-01-03.  Since all the joins are producing
> the same counts, the natural suspicions is that the tables are identical,
> but I when I run the following two queries:
>
> scala> sqlContext.sql("select * from swig_pin_promo_lt where date
>>='2016-01-03'").count
>
> res14: Long = 34158
>
> scala> sqlContext.sql("select * from dps_pin_promo_lt where date
>>='2016-01-03'").count
>
> res15: Long = 42693
>
>
> The above two queries filter out the data based on date used by the joins of
> 2016-01-03 and you can see the row count between the two tables are
> different, which is why I am suspecting something is wrong with the outer
> joins in spark sql, because in this situation the right and outer joins may
> produce the same results, but it should not be equal to the left join and
> definitely not the inner join; unless I am missing something.
>
>
> Side note: In my haste response above I posted the wrong counts for
> dps.count, the real value is res16: Long = 42694
>
>
> Thanks,
>
>
> KP
>
>
>
>
> On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <ja...@hotmail.com> wrote:
>>
>> We are still not sure what is the problem, if you cannot show us with some
>> example data.
>>
>> For dps with 42632 rows, and swig with 42034 rows, if dps full outer join
>> with swig on 3 columns; with additional filters, get the same resultSet row
>> count as dps lefter outer join with swig on 3 columns, with additional
>> filters, again get the the same resultSet row count as dps right outer join
>> with swig on 3 columns, with same additional filters.
>>
>> Without knowing your data, I cannot see the reason that has to be a bug in
>> the spark.
>>
>> Am I misunderstanding your bug?
>>
>> Yong
>>
>> ________________________________
>> From: kpeng1@gmail.com
>> Date: Mon, 2 May 2016 12:11:18 -0700
>> Subject: Re: Weird results with Spark SQL Outer joins
>> To: gourav.sengupta@gmail.com
>> CC: user@spark.apache.org
>>
>>
>> Gourav,
>>
>> I wish that was case, but I have done a select count on each of the two
>> tables individually and they return back different number of rows:
>>
>>
>> dps.registerTempTable("dps_pin_promo_lt")
>> swig.registerTempTable("swig_pin_promo_lt")
>>
>>
>> dps.count()
>> RESULT: 42632
>>
>>
>> swig.count()
>> RESULT: 42034
>>
>> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta
>> <go...@gmail.com> wrote:
>>
>> This shows that both the tables have matching records and no mismatches.
>> Therefore obviously you have the same results irrespective of whether you
>> use right or left join.
>>
>> I think that there is no problem here, unless I am missing something.
>>
>> Regards,
>> Gourav
>>
>> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kp...@gmail.com> wrote:
>>
>> Also, the results of the inner query produced the same results:
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS
>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
>> =
>> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >= '2016-01-03'").count()
>> RESULT:23747
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Weird results with Spark SQL Outer joins

Posted by Kevin Peng <kp...@gmail.com>.

Yong,

Sorry, let explain my deduction; it is going be difficult to get a sample
data out since the dataset I am using is proprietary.

From the above set queries (ones mentioned in above comments) both inner
and outer join are producing the same counts.  They are basically pulling
out selected columns from the query, but there is no roll up happening or
anything that would possible make it suspicious that there is any
difference besides the type of joins.  The tables are matched based on
three keys that are present in both tables (ad, account, and date), on top
of this they are filtered by date being above 2016-01-03.  Since all the
joins are producing the same counts, the natural suspicions is that the
tables are identical, but I when I run the following two queries:

scala> sqlContext.sql("select * from swig_pin_promo_lt where date
>='2016-01-03'").count

res14: Long = 34158

scala> sqlContext.sql("select * from dps_pin_promo_lt where date
>='2016-01-03'").count

res15: Long = 42693

The above two queries filter out the data based on date used by the joins
of 2016-01-03 and you can see the row count between the two tables are
different, which is why I am suspecting something is wrong with the outer
joins in spark sql, because in this situation the right and outer joins may
produce the same results, but it should not be equal to the left join and
definitely not the inner join; unless I am missing something.

Side note: In my haste response above I posted the wrong counts for
dps.count, the real value is res16: Long = 42694

Thanks,

KP

On Mon, May 2, 2016 at 12:50 PM, Yong Zhang <ja...@hotmail.com> wrote:

> We are still not sure what is the problem, if you cannot show us with some
> example data.
>
> For dps with 42632 rows, and swig with 42034 rows, if dps full outer join
> with swig on 3 columns; with additional filters, get the same resultSet row
> count as dps lefter outer join with swig on 3 columns, with additional
> filters, again get the the same resultSet row count as dps right outer join
> with swig on 3 columns, with same additional filters.
>
> Without knowing your data, I cannot see the reason that has to be a bug in
> the spark.
>
> Am I misunderstanding your bug?
>
> Yong
>
> ------------------------------
> From: kpeng1@gmail.com
> Date: Mon, 2 May 2016 12:11:18 -0700
> Subject: Re: Weird results with Spark SQL Outer joins
> To: gourav.sengupta@gmail.com
> CC: user@spark.apache.org
>
>
> Gourav,
>
> I wish that was case, but I have done a select count on each of the two
> tables individually and they return back different number of rows:
>
>
> dps.registerTempTable("dps_pin_promo_lt")
> swig.registerTempTable("swig_pin_promo_lt")
>
>
> dps.count()
> RESULT: 42632
>
>
> swig.count()
> RESULT: 42034
>
> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta <
> gourav.sengupta@gmail.com> wrote:
>
> This shows that both the tables have matching records and no mismatches.
> Therefore obviously you have the same results irrespective of whether you
> use right or left join.
>
> I think that there is no problem here, unless I am missing something.
>
> Regards,
> Gourav
>
> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kp...@gmail.com> wrote:
>
> Also, the results of the inner query produced the same results:
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS
> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> =
> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >= '2016-01-03'").count()
> RESULT:23747
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>
>

RE: Weird results with Spark SQL Outer joins

Posted by Yong Zhang <ja...@hotmail.com>.

We are still not sure what is the problem, if you cannot show us with some example data.
For dps with 42632 rows, and swig with 42034 rows, if dps full outer join with swig on 3 columns; with additional filters, get the same resultSet row count as dps lefter outer join with swig on 3 columns, with additional filters, again get the the same resultSet row count as dps right outer join with swig on 3 columns, with same additional filters.
Without knowing your data, I cannot see the reason that has to be a bug in the spark.
Am I misunderstanding your bug?
Yong

From: kpeng1@gmail.com
Date: Mon, 2 May 2016 12:11:18 -0700
Subject: Re: Weird results with Spark SQL Outer joins
To: gourav.sengupta@gmail.com
CC: user@spark.apache.org

Gourav,
I wish that was case, but I have done a select count on each of the two tables individually and they return back different number of rows:

dps.registerTempTable("dps_pin_promo_lt")

swig.registerTempTable("swig_pin_promo_lt")

dps.count()

RESULT: 42632

swig.count()

RESULT: 42034

On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta <go...@gmail.com> wrote:
This shows that both the tables have matching records and no mismatches. Therefore obviously you have the same results irrespective of whether you use right or left join. 
I think that there is no problem here, unless I am missing something.
Regards,Gourav 
On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kp...@gmail.com> wrote:
Also, the results of the inner query produced the same results:

sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account AS

d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,

d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN

dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =

d.ad) WHERE s.date >= '2016-01-03'    AND d.date >= '2016-01-03'").count()

RESULT:23747

--

View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------

To unsubscribe, e-mail: user-unsubscribe@spark.apache.org

For additional commands, e-mail: user-help@spark.apache.org

Re: Weird results with Spark SQL Outer joins

Posted by Kevin Peng <kp...@gmail.com>.

Gourav,

I wish that was case, but I have done a select count on each of the two
tables individually and they return back different number of rows:


dps.registerTempTable("dps_pin_promo_lt")

swig.registerTempTable("swig_pin_promo_lt")


dps.count()

RESULT: 42632


swig.count()

RESULT: 42034

On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta <go...@gmail.com>
wrote:

> This shows that both the tables have matching records and no mismatches.
> Therefore obviously you have the same results irrespective of whether you
> use right or left join.
>
> I think that there is no problem here, unless I am missing something.
>
> Regards,
> Gourav
>
> On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kp...@gmail.com> wrote:
>
>> Also, the results of the inner query produced the same results:
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS
>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad =
>> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >=
>> '2016-01-03'").count()
>> RESULT:23747
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: Weird results with Spark SQL Outer joins

Posted by Gourav Sengupta <go...@gmail.com>.

This shows that both the tables have matching records and no mismatches.
Therefore obviously you have the same results irrespective of whether you
use right or left join.

I think that there is no problem here, unless I am missing something.

Regards,
Gourav

On Mon, May 2, 2016 at 7:48 PM, kpeng1 <kp...@gmail.com> wrote:

> Also, the results of the inner query produced the same results:
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS
> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> =
> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >= '2016-01-03'").count()
> RESULT:23747
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Weird results with Spark SQL Outer joins

Posted by kpeng1 <kp...@gmail.com>.

Also, the results of the inner query produced the same results:
sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account AS
d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad =
d.ad) WHERE s.date >= '2016-01-03'    AND d.date >= '2016-01-03'").count() 
RESULT:23747 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Weird results with Spark SQL Outer joins

Posted by Gourav Sengupta <go...@gmail.com>.

Hi Kevin,

Thanks.

Please post the result of the same query with INNER JOIN and then it will
give us a bit of insight.

Regards,
Gourav


On Mon, May 2, 2016 at 7:10 PM, Kevin Peng <kp...@gmail.com> wrote:

> Gourav,
>
> Apologies.  I edited my post with this information:
> Spark version: 1.6
> Result from spark shell
> OS: Linux version 2.6.32-431.20.3.el6.x86_64 (
> mockbuild@c6b9.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat
> 4.4.7-4) (GCC) ) #1 SMP Thu Jun 19 21:14:45 UTC 2014
>
> Thanks,
>
> KP
>
> On Mon, May 2, 2016 at 11:05 AM, Gourav Sengupta <
> gourav.sengupta@gmail.com> wrote:
>
>> Hi,
>>
>> As always, can you please write down details regarding your SPARK cluster
>> - the version, OS, IDE used, etc?
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Mon, May 2, 2016 at 5:58 PM, kpeng1 <kp...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I am running into a weird result with Spark SQL Outer joins.  The results
>>> for all of them seem to be the same, which does not make sense due to the
>>> data.  Here are the queries that I am running with the results:
>>>
>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
>>> d.account AS
>>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>>> s.ad =
>>> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >=
>>> '2016-01-03'").count()
>>> RESULT:23747
>>>
>>>
>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
>>> d.account AS
>>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>>> s.ad =
>>> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >=
>>> '2016-01-03'").count()
>>> RESULT:23747
>>>
>>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  ,
>>> d.account AS
>>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
>>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>>> s.ad =
>>> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >=
>>> '2016-01-03'").count()
>>> RESULT: 23747
>>>
>>> Was wondering if someone had encountered this issues before.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: Weird results with Spark SQL Outer joins

Posted by Kevin Peng <kp...@gmail.com>.

Gourav,

Apologies.  I edited my post with this information:
Spark version: 1.6
Result from spark shell
OS: Linux version 2.6.32-431.20.3.el6.x86_64 (
mockbuild@c6b9.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat
4.4.7-4) (GCC) ) #1 SMP Thu Jun 19 21:14:45 UTC 2014

Thanks,

KP

On Mon, May 2, 2016 at 11:05 AM, Gourav Sengupta <go...@gmail.com>
wrote:

> Hi,
>
> As always, can you please write down details regarding your SPARK cluster
> - the version, OS, IDE used, etc?
>
> Regards,
> Gourav Sengupta
>
> On Mon, May 2, 2016 at 5:58 PM, kpeng1 <kp...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am running into a weird result with Spark SQL Outer joins.  The results
>> for all of them seem to be the same, which does not make sense due to the
>> data.  Here are the queries that I am running with the results:
>>
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS
>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad =
>> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >=
>> '2016-01-03'").count()
>> RESULT:23747
>>
>>
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS
>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad =
>> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >=
>> '2016-01-03'").count()
>> RESULT:23747
>>
>> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
>> AS
>> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
>> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
>> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND
>> s.ad =
>> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >=
>> '2016-01-03'").count()
>> RESULT: 23747
>>
>> Was wondering if someone had encountered this issues before.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: Weird results with Spark SQL Outer joins

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

As always, can you please write down details regarding your SPARK cluster -
the version, OS, IDE used, etc?

Regards,
Gourav Sengupta

On Mon, May 2, 2016 at 5:58 PM, kpeng1 <kp...@gmail.com> wrote:

> Hi All,
>
> I am running into a weird result with Spark SQL Outer joins.  The results
> for all of them seem to be the same, which does not make sense due to the
> data.  Here are the queries that I am running with the results:
>
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS
> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s FULL OUTER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> =
> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >= '2016-01-03'").count()
> RESULT:23747
>
>
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS
> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s LEFT OUTER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> =
> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >= '2016-01-03'").count()
> RESULT:23747
>
> sqlContext.sql("SELECT s.date AS edate  , s.account AS s_acc  , d.account
> AS
> d_acc  , s.ad as s_ad  , d.ad as d_ad , s.spend AS s_spend  ,
> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s RIGHT OUTER JOIN
> dps_pin_promo_lt d  ON (s.date = d.date AND s.account = d.account AND s.ad
> =
> d.ad) WHERE s.date >= '2016-01-03'    AND d.date >= '2016-01-03'").count()
> RESULT: 23747
>
> Was wondering if someone had encountered this issues before.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>