You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Alexey Zinoviev (Jira)" <ji...@apache.org> on 2019/09/30 13:08:00 UTC

[jira] [Updated] (IGNITE-12245) [Spark] Support of Null Handling in JOIN condition

     [ https://issues.apache.org/jira/browse/IGNITE-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexey Zinoviev updated IGNITE-12245:
-------------------------------------
    Description: 
Also, in Spark was fixed bug with incorrect null handling on columns in codition

https://issues.apache.org/jira/browse/SPARK-21479

It leads to IgniteOptimizationJoinSpec fixes (the same thing was in the previous migration from 2.2 to 2.3)

 

Also, in Spark was fixed bug with incorrect null handling on columns in codition

https://issues.apache.org/jira/browse/SPARK-21479

It leads to IgniteOptimizationJoinSpec fixes (the same thing was in the previous migration from 2.2 to 2.3)

 

 

I made experiment with Spark code for version 2.3 it generates the next plan

== Parsed Logical Plan ==
'Project 'jt1.id AS id1#28, 'jt1.val1, 'jt2.id AS id2#29, 'jt2.val2
+- 'Join LeftOuter, ('jt1.val1 = 'jt2.val2)
:- 'UnresolvedRelation `jt1`
+- 'UnresolvedRelation `jt2`

== Analyzed Logical Plan ==
id1: string, val1: string, id2: string, val2: string
Project id#10 AS id1#28, val1#11, id#24 AS id2#29, val2#25
+- Join LeftOuter, (val1#11 = val2#25)
:- SubqueryAlias jt1
: +- Relationid#10,val1#11 csv
+- SubqueryAlias jt2
+- Relationid#24,val2#25 csv

== Optimized Logical Plan ==
Project id#10 AS id1#28, val1#11, id#24 AS id2#29, val2#25
+- Join LeftOuter, (val1#11 = val2#25)
:- Relationid#10,val1#11 csv
+- Relationid#24,val2#25 csv

 

The 2.4 generates

== Parsed Logical Plan ==
'Project 'jt1.id AS id1#28, 'jt1.val1, 'jt2.id AS id2#29, 'jt2.val2
+- 'Join LeftOuter, ('jt1.val1 = 'jt2.val2)
:- 'UnresolvedRelation `jt1`
+- 'UnresolvedRelation `jt2`

== Analyzed Logical Plan ==
id1: string, val1: string, id2: string, val2: string
Project id#10 AS id1#28, val1#11, id#24 AS id2#29, val2#25
+- Join LeftOuter, (val1#11 = val2#25)
:- SubqueryAlias `jt1`
: +- Relationid#10,val1#11 csv
+- SubqueryAlias `jt2`
+- Relationid#24,val2#25 csv

== Optimized Logical Plan ==
Project id#10 AS id1#28, val1#11, id#24 AS id2#29, val2#25
+- Join LeftOuter, (val1#11 = val2#25)
:- Relationid#10,val1#11 csv
+- Filter isnotnull(val2#25)
+- Relationid#24,val2#25 csv

 

The +- Filter isnotnull(val2#25) is added in optimized logical plan

But in reality it doesn't work properly (and doesn't filter in Spark), but wow! It works for Ignite (because we honestly work with Spark plan)

 

If you enable next option 

.config("ignite.disableSparkSQLOptimization", "true") - the behaviour will be the same in Ignite and Spark and will not add the filter

 

The best approach for Spark 2.4 - disableSparkOptimization before fixing on Spark side (I could create a bug for this)

  was:
Also, in Spark was fixed bug with incorrect null handling on columns in codition

https://issues.apache.org/jira/browse/SPARK-21479

It leads to IgniteOptimizationJoinSpec fixes (the same thing was in the previous migration from 2.2 to 2.3)


> [Spark] Support of Null Handling in JOIN condition
> --------------------------------------------------
>
>                 Key: IGNITE-12245
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12245
>             Project: Ignite
>          Issue Type: Sub-task
>          Components: spark
>    Affects Versions: 2.8
>            Reporter: Alexey Zinoviev
>            Assignee: Alexey Zinoviev
>            Priority: Major
>             Fix For: 2.8
>
>
> Also, in Spark was fixed bug with incorrect null handling on columns in codition
> https://issues.apache.org/jira/browse/SPARK-21479
> It leads to IgniteOptimizationJoinSpec fixes (the same thing was in the previous migration from 2.2 to 2.3)
>  
> Also, in Spark was fixed bug with incorrect null handling on columns in codition
> https://issues.apache.org/jira/browse/SPARK-21479
> It leads to IgniteOptimizationJoinSpec fixes (the same thing was in the previous migration from 2.2 to 2.3)
>  
>  
> I made experiment with Spark code for version 2.3 it generates the next plan
> == Parsed Logical Plan ==
> 'Project 'jt1.id AS id1#28, 'jt1.val1, 'jt2.id AS id2#29, 'jt2.val2
> +- 'Join LeftOuter, ('jt1.val1 = 'jt2.val2)
> :- 'UnresolvedRelation `jt1`
> +- 'UnresolvedRelation `jt2`
> == Analyzed Logical Plan ==
> id1: string, val1: string, id2: string, val2: string
> Project id#10 AS id1#28, val1#11, id#24 AS id2#29, val2#25
> +- Join LeftOuter, (val1#11 = val2#25)
> :- SubqueryAlias jt1
> : +- Relationid#10,val1#11 csv
> +- SubqueryAlias jt2
> +- Relationid#24,val2#25 csv
> == Optimized Logical Plan ==
> Project id#10 AS id1#28, val1#11, id#24 AS id2#29, val2#25
> +- Join LeftOuter, (val1#11 = val2#25)
> :- Relationid#10,val1#11 csv
> +- Relationid#24,val2#25 csv
>  
> The 2.4 generates
> == Parsed Logical Plan ==
> 'Project 'jt1.id AS id1#28, 'jt1.val1, 'jt2.id AS id2#29, 'jt2.val2
> +- 'Join LeftOuter, ('jt1.val1 = 'jt2.val2)
> :- 'UnresolvedRelation `jt1`
> +- 'UnresolvedRelation `jt2`
> == Analyzed Logical Plan ==
> id1: string, val1: string, id2: string, val2: string
> Project id#10 AS id1#28, val1#11, id#24 AS id2#29, val2#25
> +- Join LeftOuter, (val1#11 = val2#25)
> :- SubqueryAlias `jt1`
> : +- Relationid#10,val1#11 csv
> +- SubqueryAlias `jt2`
> +- Relationid#24,val2#25 csv
> == Optimized Logical Plan ==
> Project id#10 AS id1#28, val1#11, id#24 AS id2#29, val2#25
> +- Join LeftOuter, (val1#11 = val2#25)
> :- Relationid#10,val1#11 csv
> +- Filter isnotnull(val2#25)
> +- Relationid#24,val2#25 csv
>  
> The +- Filter isnotnull(val2#25) is added in optimized logical plan
> But in reality it doesn't work properly (and doesn't filter in Spark), but wow! It works for Ignite (because we honestly work with Spark plan)
>  
> If you enable next option 
> .config("ignite.disableSparkSQLOptimization", "true") - the behaviour will be the same in Ignite and Spark and will not add the filter
>  
> The best approach for Spark 2.4 - disableSparkOptimization before fixing on Spark side (I could create a bug for this)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)