You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/04/24 02:31:39 UTC

[jira] [Updated] (SPARK-6550) Add PreAnalyzer to keep logical plan consistent across DataFrame

     [ https://issues.apache.org/jira/browse/SPARK-6550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated SPARK-6550:
-----------------------------
    Assignee: Michael Armbrust

> Add PreAnalyzer to keep logical plan consistent across DataFrame
> ----------------------------------------------------------------
>
>                 Key: SPARK-6550
>                 URL: https://issues.apache.org/jira/browse/SPARK-6550
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Liang-Chi Hsieh
>            Assignee: Michael Armbrust
>             Fix For: 1.3.1, 1.4.0
>
>
> h2. Problems
> In some cases, the expressions in a logical plan will be modified to new ones during analysis, e.g. the handling for self-join cases. If some expressions are resolved based on the analyzed plan, they are referring to changed expression ids, not original ids.
> But the transformation of DataFrame will use logical plan to construct new DataFrame, e.g. {{groupBy}} and aggregation. So in such cases, the expressions in these DataFrames will be inconsistent.
> The problems are specified as following:
> # Expression ids in logical plan are possibly inconsistent if expression ids are changed during analysis and some expressions are resolved after that
> When we try to run the following codes:
> {code}
> val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str")
> val df2 = df.as('x).join(df.as('y), $"x.str" === $"y.str").groupBy("y.str").min("y.int")
> {code}
> Because {{groupBy}} and {{min}} will perform resolving based on the analyzed logical plan, their expression ids refer to analyzed plan, instead of logical plan.
> So the logical plan of df2 looks like:
> {code}
> 'Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6]
>  'Join Inner, Some(('x.str = 'y.str))
>   Subquery x
>    Project [_1#0 AS int#2,_2#1 AS str#3]
>     LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
>   Subquery y
>    Project [_1#0 AS int#2,_2#1 AS str#3]
>     LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
> {code}
> As you see, the expression ids in {{Aggregate}} are different to the expression ids in {{Subquery y}}. This is the first problem.
> # The {{df2}} can't be performed
> The showing logical plan of {{df2}} can't be performed. Because the expression ids of {{Subquery y}} will be modified for self-join handling during analysis, the analyzed plan of {{df2}} becomes:
> {code}
> Aggregate [str#5], [str#5,MIN(int#4) AS MIN(int)#6]
>  Join Inner, Some((str#3 = str#8))
>   Subquery x
>    Project [_1#0 AS int#2,_2#1 AS str#3]
>     LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
>   Subquery y
>    Project [_1#0 AS int#7,_2#1 AS str#8]
>     LocalRelation [_1#0,_2#1], [[1,1],[2,2],[3,3]]
> {code}
> The expressions referred in {{Aggregate}} are not matching to these in {{Subquery y}}. This is the second problem.
> h2. Proposed solution
> We try to add a {{PreAnalyzer}}. When a logical plan {{rawPlan}} is given to SQLContext, it uses PreAnalyzer to modify the logical plan before assigning to {{QueryExecution.logical}}. Then later operations will based on the pre-analyzed logical plan, instead of the original {{rawPlan}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org