You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Maxim Muzafarov (Jira)" <ji...@apache.org> on 2019/10/01 15:15:00 UTC
[jira] [Updated] (IGNITE-12245) [Spark] Support of Null Handling in JOIN condition

     [ https://issues.apache.org/jira/browse/IGNITE-12245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Maxim Muzafarov updated IGNITE-12245:
-------------------------------------
    Labels: await  (was: )

> [Spark] Support of Null Handling in JOIN condition
> --------------------------------------------------
>
>                 Key: IGNITE-12245
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12245
>             Project: Ignite
>          Issue Type: Sub-task
>          Components: spark
>    Affects Versions: 2.8
>            Reporter: Alexey Zinoviev
>            Assignee: Alexey Zinoviev
>            Priority: Major
>              Labels: await
>             Fix For: 2.8
>
>
> Also, in Spark was fixed bug with incorrect null handling on columns in codition
> https://issues.apache.org/jira/browse/SPARK-21479
> It leads to IgniteOptimizationJoinSpec fixes (the same thing was in the previous migration from 2.2 to 2.3)
>  
> Also, in Spark was fixed bug with incorrect null handling on columns in codition
> https://issues.apache.org/jira/browse/SPARK-21479
> It leads to IgniteOptimizationJoinSpec fixes (the same thing was in the previous migration from 2.2 to 2.3)
>  
>  
> I made experiment with Spark code for version 2.3 it generates the next plan
> == Parsed Logical Plan ==
> 'Project 'jt1.id AS id1#28, 'jt1.val1, 'jt2.id AS id2#29, 'jt2.val2
> +- 'Join LeftOuter, ('jt1.val1 = 'jt2.val2)
> :- 'UnresolvedRelation `jt1`
> +- 'UnresolvedRelation `jt2`
> == Analyzed Logical Plan ==
> id1: string, val1: string, id2: string, val2: string
> Project id#10 AS id1#28, val1#11, id#24 AS id2#29, val2#25
> +- Join LeftOuter, (val1#11 = val2#25)
> :- SubqueryAlias jt1
> : +- Relationid#10,val1#11 csv
> +- SubqueryAlias jt2
> +- Relationid#24,val2#25 csv
> == Optimized Logical Plan ==
> Project id#10 AS id1#28, val1#11, id#24 AS id2#29, val2#25
> +- Join LeftOuter, (val1#11 = val2#25)
> :- Relationid#10,val1#11 csv
> +- Relationid#24,val2#25 csv
>  
> The 2.4 generates
> == Parsed Logical Plan ==
> 'Project 'jt1.id AS id1#28, 'jt1.val1, 'jt2.id AS id2#29, 'jt2.val2
> +- 'Join LeftOuter, ('jt1.val1 = 'jt2.val2)
> :- 'UnresolvedRelation `jt1`
> +- 'UnresolvedRelation `jt2`
> == Analyzed Logical Plan ==
> id1: string, val1: string, id2: string, val2: string
> Project id#10 AS id1#28, val1#11, id#24 AS id2#29, val2#25
> +- Join LeftOuter, (val1#11 = val2#25)
> :- SubqueryAlias `jt1`
> : +- Relationid#10,val1#11 csv
> +- SubqueryAlias `jt2`
> +- Relationid#24,val2#25 csv
> == Optimized Logical Plan ==
> Project id#10 AS id1#28, val1#11, id#24 AS id2#29, val2#25
> +- Join LeftOuter, (val1#11 = val2#25)
> :- Relationid#10,val1#11 csv
> +- Filter isnotnull(val2#25)
> +- Relationid#24,val2#25 csv
>  
> The +- Filter isnotnull(val2#25) is added in optimized logical plan
> But in reality it doesn't work properly (and doesn't filter in Spark), but wow! It works for Ignite (because we honestly work with Spark plan)
>  
> If you enable next option 
> .config("ignite.disableSparkSQLOptimization", "true") - the behaviour will be the same in Ignite and Spark and will not add the filter
>  
> The best approach for Spark 2.4 - disableSparkOptimization before fixing on Spark side (I could create a bug for this)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)