You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "mingmwang (via GitHub)" <gi...@apache.org> on 2023/03/13 09:30:17 UTC

[GitHub] [arrow-datafusion] mingmwang opened a new pull request, #5570: Add Analyzer phase to DataFusion , add basic validation logic to Subquery Plans and Expressions

mingmwang opened a new pull request, #5570:
URL: https://github.com/apache/arrow-datafusion/pull/5570

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Partially close #2361, #5463.
   
   # Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   This PR add an Analyzer phase to DataFusion. It also adds some basic validation logic to Subquery plans and expressions.
   It does not check all the unsupported cases. This PR just come up with a framework to do the necessary checking:
   1)  The allowed outer plans to use SubQuery expressions
   2) The allowed inner plans inside SubQuery plan
   3) The supported cases/positions to refer the correlated(outer) columns inside the inner plans/inner expressions.
   
   In the current DataFusion, there are other gaps to correctly identify the out reference columns in the `SqlToRel` process.
   Those gaps will be addressed in other PRs.
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #5570: Add Analyzer phase to DataFusion , add basic validation logic to Subquery Plans and Expressions

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on PR #5570:
URL: https://github.com/apache/arrow-datafusion/pull/5570#issuecomment-1467223087

   > Thanks @mingmwang -- I think this PR's structure looks great.
   > 
   > Do you plan to move the https://github.com/apache/arrow-datafusion/blob/main/datafusion/optimizer/src/type_coercion.rs logic into the analyzer as well?
   > 
   > I had some suggestions on how to reduce replication which I think would be nice but not strictly required.
   
   Sure, I can work on this in future. But recently I plan to address the subquery related gaps first.
   
   https://github.com/apache/arrow-datafusion/issues/5483
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener commented on a diff in pull request #5570: Add Analyzer phase to DataFusion , add basic validation logic to Subquery Plans and Expressions

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener commented on code in PR #5570:
URL: https://github.com/apache/arrow-datafusion/pull/5570#discussion_r1134968294


##########
datafusion/optimizer/src/analyzer.rs:
##########
@@ -0,0 +1,196 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::rewrite::TreeNodeRewritable;
+use datafusion_common::config::ConfigOptions;
+use datafusion_common::{DataFusionError, Result};
+use datafusion_expr::expr_visitor::inspect_expr_pre;
+use datafusion_expr::{Expr, LogicalPlan};
+use log::{debug, trace};
+use std::sync::Arc;
+use std::time::Instant;
+
+/// `AnalyzerRule` transforms the unresolved ['LogicalPlan']s and unresolved ['Expr']s into
+/// the resolved form.
+pub trait AnalyzerRule {

Review Comment:
   I feel we should add a new dir for analyzer.
   Because we can pull some analyzer rule into it.



##########
datafusion/optimizer/src/analyzer.rs:
##########
@@ -0,0 +1,196 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use crate::rewrite::TreeNodeRewritable;
+use datafusion_common::config::ConfigOptions;
+use datafusion_common::{DataFusionError, Result};
+use datafusion_expr::expr_visitor::inspect_expr_pre;
+use datafusion_expr::{Expr, LogicalPlan};
+use log::{debug, trace};
+use std::sync::Arc;
+use std::time::Instant;
+
+/// `AnalyzerRule` transforms the unresolved ['LogicalPlan']s and unresolved ['Expr']s into
+/// the resolved form.
+pub trait AnalyzerRule {
+    /// Rewrite `plan`
+    fn analyze(&self, plan: &LogicalPlan, config: &ConfigOptions) -> Result<LogicalPlan>;
+
+    /// A human readable name for this analyzer rule
+    fn name(&self) -> &str;
+}
+/// A rule-based Analyzer.
+#[derive(Clone)]
+pub struct Analyzer {
+    /// All rules to apply
+    pub rules: Vec<Arc<dyn AnalyzerRule + Send + Sync>>,
+}
+
+impl Default for Analyzer {
+    fn default() -> Self {
+        Self::new()
+    }
+}
+
+impl Analyzer {
+    /// Create a new analyzer using the recommended list of rules
+    pub fn new() -> Self {
+        let rules = vec![];
+        Self::with_rules(rules)
+    }
+
+    /// Create a new analyzer with the given rules
+    pub fn with_rules(rules: Vec<Arc<dyn AnalyzerRule + Send + Sync>>) -> Self {
+        Self { rules }
+    }
+
+    /// Analyze the logical plan by applying analyzer rules, and
+    /// do necessary check and fail the invalid plans
+    pub fn execute_and_check(
+        &self,
+        plan: &LogicalPlan,
+        config: &ConfigOptions,
+    ) -> Result<LogicalPlan>
+where {
+        let start_time = Instant::now();
+        let mut new_plan = plan.clone();
+
+        // TODO add common rule executor for Analyzer and Optimizer
+        for rule in &self.rules {
+            new_plan = rule.analyze(&new_plan, config)?;
+        }
+        check_plan(&new_plan)?;
+        log_plan("Final analyzed plan", &new_plan);
+        debug!("Analyzer took {} ms", start_time.elapsed().as_millis());
+        Ok(new_plan)
+    }
+}
+
+/// Log the plan in debug/tracing mode after some part of the optimizer runs
+fn log_plan(description: &str, plan: &LogicalPlan) {
+    debug!("{description}:\n{}\n", plan.display_indent());
+    trace!("{description}::\n{}\n", plan.display_indent_schema());
+}
+
+/// Do necessary check and fail the invalid plan
+fn check_plan(plan: &LogicalPlan) -> Result<()> {
+    plan.for_each_up(&|plan: &LogicalPlan| {
+        plan.expressions().into_iter().try_for_each(|expr| {
+            // recursively look for subqueries
+            inspect_expr_pre(&expr, |expr| match expr {
+                Expr::Exists { subquery, .. }
+                | Expr::InSubquery { subquery, .. }
+                | Expr::ScalarSubquery(subquery) => {
+                    check_subquery_expr(plan, &subquery.subquery, expr)
+                }
+                _ => Ok(()),
+            })
+        })
+    })
+}
+
+fn check_subquery_expr(

Review Comment:
   in the future, we also can move type_coercion into analyzer



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Jefffrey commented on pull request #5570: Add Analyzer phase to DataFusion , add basic validation logic to Subquery Plans and Expressions

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on PR #5570:
URL: https://github.com/apache/arrow-datafusion/pull/5570#issuecomment-1467495949

   Will these new analyzer phase rules be visible in output of explain commands? If not worth a follow-on ticket to track it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #5570: Add Analyzer phase to DataFusion , add basic validation logic to Subquery Plans and Expressions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5570:
URL: https://github.com/apache/arrow-datafusion/pull/5570#issuecomment-1470250592

   I think this is ready to go -- any concerns about merging @mingmwang ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #5570: Add Analyzer phase to DataFusion , add basic validation logic to Subquery Plans and Expressions

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on PR #5570:
URL: https://github.com/apache/arrow-datafusion/pull/5570#issuecomment-1468178842

   > Will these new analyzer phase rules be visible in output of explain commands? If not, then worth a follow-on ticket to track it
   
   Just log a ticket to track the effort.
   https://github.com/apache/arrow-datafusion/issues/5598


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #5570: Add Analyzer phase to DataFusion , add basic validation logic to Subquery Plans and Expressions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5570:
URL: https://github.com/apache/arrow-datafusion/pull/5570#issuecomment-1466590221

   cc @jackwener  @ygf11 @liukun4515 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #5570: Add Analyzer phase to DataFusion , add basic validation logic to Subquery Plans and Expressions

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on PR #5570:
URL: https://github.com/apache/arrow-datafusion/pull/5570#issuecomment-1471149840

   > I think this is ready to go -- any concerns about merging @mingmwang ?
   
   Sure, please help to merge this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] jackwener merged pull request #5570: Add Analyzer phase to DataFusion , add basic validation logic to Subquery Plans and Expressions

Posted by "jackwener (via GitHub)" <gi...@apache.org>.
jackwener merged PR #5570:
URL: https://github.com/apache/arrow-datafusion/pull/5570


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5570: Add Analyzer phase to DataFusion , add basic validation logic to Subquery Plans and Expressions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5570:
URL: https://github.com/apache/arrow-datafusion/pull/5570#discussion_r1134364692


##########
datafusion/optimizer/src/rewrite.rs:
##########
@@ -0,0 +1,199 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Trait to make LogicalPlan rewritable
+
+use datafusion_common::Result;
+
+use datafusion_expr::LogicalPlan;
+
+/// a Trait for marking tree node types that are rewritable
+pub trait TreeNodeRewritable: Clone {

Review Comment:
   This appears to largely be the same as https://github.com/apache/arrow-datafusion/blob/146a949218ec970784974137277cde3b4e547d0a/datafusion/physical-expr/src/rewrite.rs#L25 but they are both generic, 
   
   Would it be possible to move the trait into `datafusion_common` and only keep the impls in logical and physical plan?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on a diff in pull request #5570: Add Analyzer phase to DataFusion , add basic validation logic to Subquery Plans and Expressions

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on code in PR #5570:
URL: https://github.com/apache/arrow-datafusion/pull/5570#discussion_r1134801584


##########
datafusion/optimizer/src/rewrite.rs:
##########
@@ -0,0 +1,199 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Trait to make LogicalPlan rewritable
+
+use datafusion_common::Result;
+
+use datafusion_expr::LogicalPlan;
+
+/// a Trait for marking tree node types that are rewritable
+pub trait TreeNodeRewritable: Clone {

Review Comment:
   In the past, I had tried to move the Trait to a common place but failed due to Rust's orphan rule limitations.
   Either the trait or the type for which we are implementing the trait must be defined in the same crate. The major problem is the `Arc` struct is from the std crate.
   
   `impl TreeNodeRewritable for Arc<dyn ExecutionPlan>`
   
   `impl TreeNodeRewritable for Arc<dyn PhysicalExpr>`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org