You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2018/07/19 21:53:00 UTC
[jira] [Updated] (SPARK-24865) Remove AnalysisBarrier

     [ https://issues.apache.org/jira/browse/SPARK-24865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Reynold Xin updated SPARK-24865:
--------------------------------
    Description: 
AnalysisBarrier was introduced in SPARK-20392 to improve analysis speed (don't re-analyze nodes that have already been analyzed).

Before AnalysisBarrier, we already had some infrastructure in place, with analysis specific functions (resolveOperators and resolveExpressions). These functions do not recursively traverse down subplans that are already analyzed (with a mutable boolean flag _analyzed). The issue with the old system was that developers started using transformDown, which does a top-down traversal of the plan tree, because there was not top-down resolution function, and as a result analyzer performance became pretty bad.

In order to fix the issue in SPARK-20392, AnalysisBarrier was introduced as a special node and for this special node, transform/transformUp/transformDown don't traverse down. However, the introduction of this special node caused a lot more troubles than it solves. This implicit node breaks assumptions and code in a few places, and it's hard to know when analysis barrier would exist, and when it wouldn't. Just a simple search of AnalysisBarrier in PR discussions demonstrates it is a source of bugs and additional complexity.

Instead, I think a much simpler fix to the original issue is to introduce resolveOperatorsDown, and change all places that call transformDown in the analyzer to use that. We can also ban accidental uses of the various transform* methods by using a linter (which can only lint specific packages), or in test mode inspect the stack trace and fail explicitly if transform* are called in the analyzer. 

  was:
AnalysisBarrier was introduced in SPARK-20392 to improve analysis speed (don't re-analyze nodes that have already been analyzed).

 

Before AnalysisBarrier, we already had some infrastructure in place, with analysis specific functions (resolveOperators and resolveExpressions). These functions do not recursively traverse down subplans that are already analyzed (with a mutable boolean flag _analyzed). The issue with the old system was that developers started using transformDown, which does a top-down traversal of the plan tree, because there was not top-down resolution function, and as a result analyzer performance became pretty bad.

 

In order to fix the issue in SPARK-20392, AnalysisBarrier was introduced as a special node and for this special node, transform/transformUp/transformDown don't traverse down. However, the introduction of this special node caused a lot more troubles than it solves. This implicit node breaks assumptions and code in a few places, and it's hard to know when analysis barrier would exist, and when it wouldn't. Just a simple search of AnalysisBarrier in PR discussions demonstrates it is a source of bugs and additional complexity.

 

Instead, I think a much simpler fix to the original issue is to introduce resolveOperatorsDown, and change all places that call transformDown in the analyzer to use that. We can also ban accidental uses of the various transform* methods by using a linter (which can only lint specific packages), or in test mode inspect the stack trace and fail explicitly if transform* are called in the analyzer.

 

 

 

 


> Remove AnalysisBarrier
> ----------------------
>
>                 Key: SPARK-24865
>                 URL: https://issues.apache.org/jira/browse/SPARK-24865
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Reynold Xin
>            Priority: Major
>
> AnalysisBarrier was introduced in SPARK-20392 to improve analysis speed (don't re-analyze nodes that have already been analyzed).
> Before AnalysisBarrier, we already had some infrastructure in place, with analysis specific functions (resolveOperators and resolveExpressions). These functions do not recursively traverse down subplans that are already analyzed (with a mutable boolean flag _analyzed). The issue with the old system was that developers started using transformDown, which does a top-down traversal of the plan tree, because there was not top-down resolution function, and as a result analyzer performance became pretty bad.
> In order to fix the issue in SPARK-20392, AnalysisBarrier was introduced as a special node and for this special node, transform/transformUp/transformDown don't traverse down. However, the introduction of this special node caused a lot more troubles than it solves. This implicit node breaks assumptions and code in a few places, and it's hard to know when analysis barrier would exist, and when it wouldn't. Just a simple search of AnalysisBarrier in PR discussions demonstrates it is a source of bugs and additional complexity.
> Instead, I think a much simpler fix to the original issue is to introduce resolveOperatorsDown, and change all places that call transformDown in the analyzer to use that. We can also ban accidental uses of the various transform* methods by using a linter (which can only lint specific packages), or in test mode inspect the stack trace and fail explicitly if transform* are called in the analyzer. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org