You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/21 16:10:10 UTC

[GitHub] [arrow-datafusion] Dandandan opened a new pull request, #3923: Inline TableScans for views and dataframes

Dandandan opened a new pull request, #3923:
URL: https://github.com/apache/arrow-datafusion/pull/3923

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #3913
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on a diff in pull request #3923: Support inlining view / dataframes logical plan

Posted by GitBox <gi...@apache.org>.
isidentical commented on code in PR #3923:
URL: https://github.com/apache/arrow-datafusion/pull/3923#discussion_r1002065650


##########
datafusion/optimizer/src/inline_table_scan.rs:
##########
@@ -0,0 +1,188 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Optimizer rule to replace TableScan references
+//! such as DataFrames and Views and inlines the LogicalPlan
+//! to support further optimization
+use crate::{OptimizerConfig, OptimizerRule};
+use datafusion_common::Result;
+use datafusion_expr::{
+    col, logical_plan::LogicalPlan, utils::from_plan, LogicalPlanBuilder, TableScan,
+};
+
+/// Optimization rule that inlines TableScan that provide a [LogicalPlan]
+/// ([DataFrame] / [ViewTable])
+#[derive(Default)]
+pub struct InlineTableScan;
+
+impl InlineTableScan {
+    #[allow(missing_docs)]
+    pub fn new() -> Self {
+        Self {}
+    }
+}
+
+/// Inline
+fn inline_table_scan(plan: &LogicalPlan) -> Result<LogicalPlan> {
+    match plan {
+        LogicalPlan::TableScan(TableScan {
+            source,
+            table_name,
+            filters,
+            fetch,
+            projected_schema,
+            ..
+        }) => {
+            if let Some(sub_plan) = source.get_logical_plan() {
+                // Recurse into scan
+                let plan = inline_table_scan(sub_plan)?;
+                let mut plan = LogicalPlanBuilder::from(plan).project_with_alias(
+                    projected_schema
+                        .fields()
+                        .iter()
+                        .map(|field| col(field.name())),
+                    Some(table_name.clone()),
+                )?;
+                for filter in filters {

Review Comment:
   Question: would it better if we can handle this logic inside `get_logical_plan` (pass limit/filter/projections)? If I am not missing anything, we do some tiny changes when building the plan for the view (e.g. avoiding redundant projections) for example and we might miss doing them here if we have two paths now (I guess the following optimizer rules can take care of this particular example, but just future proof-ing it). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #3923: Support inlining view / dataframes logical plan

Posted by GitBox <gi...@apache.org>.
Dandandan commented on code in PR #3923:
URL: https://github.com/apache/arrow-datafusion/pull/3923#discussion_r1002070063


##########
datafusion/optimizer/src/inline_table_scan.rs:
##########
@@ -0,0 +1,188 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Optimizer rule to replace TableScan references
+//! such as DataFrames and Views and inlines the LogicalPlan
+//! to support further optimization
+use crate::{OptimizerConfig, OptimizerRule};
+use datafusion_common::Result;
+use datafusion_expr::{
+    col, logical_plan::LogicalPlan, utils::from_plan, LogicalPlanBuilder, TableScan,
+};
+
+/// Optimization rule that inlines TableScan that provide a [LogicalPlan]
+/// ([DataFrame] / [ViewTable])
+#[derive(Default)]
+pub struct InlineTableScan;
+
+impl InlineTableScan {
+    #[allow(missing_docs)]
+    pub fn new() -> Self {
+        Self {}
+    }
+}
+
+/// Inline
+fn inline_table_scan(plan: &LogicalPlan) -> Result<LogicalPlan> {
+    match plan {
+        LogicalPlan::TableScan(TableScan {
+            source,
+            table_name,
+            filters,
+            fetch,
+            projected_schema,
+            ..
+        }) => {
+            if let Some(sub_plan) = source.get_logical_plan() {
+                // Recurse into scan
+                let plan = inline_table_scan(sub_plan)?;
+                let mut plan = LogicalPlanBuilder::from(plan).project_with_alias(
+                    projected_schema
+                        .fields()
+                        .iter()
+                        .map(|field| col(field.name())),
+                    Some(table_name.clone()),
+                )?;
+                for filter in filters {

Review Comment:
   I think probably we can even remove handling filter / limit here as tables and for views/dataframes don't have those. Will remove the handling of possible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #3923: Support inlining view / dataframes logical plan

Posted by GitBox <gi...@apache.org>.
Dandandan commented on code in PR #3923:
URL: https://github.com/apache/arrow-datafusion/pull/3923#discussion_r1002498212


##########
datafusion/optimizer/src/projection_push_down.rs:
##########
@@ -527,7 +527,9 @@ fn optimize_plan(
 }
 
 fn projection_equal(p: &Projection, p2: &Projection) -> bool {
-    p.expr.len() == p2.expr.len() && p.expr.iter().zip(&p2.expr).all(|(l, r)| l == r)
+    p.expr.len() == p2.expr.len()
+        && p.alias == p2.alias

Review Comment:
   FYI @andygrove a small change because otherwise it also removes projections with (different) alias.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on pull request #3923: Support inlining view / dataframes logical plan

Posted by GitBox <gi...@apache.org>.
Dandandan commented on PR #3923:
URL: https://github.com/apache/arrow-datafusion/pull/3923#issuecomment-1287813937

   Going to merge this when everything is green


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #3923: Support inlining view / dataframes logical plan

Posted by GitBox <gi...@apache.org>.
Dandandan commented on code in PR #3923:
URL: https://github.com/apache/arrow-datafusion/pull/3923#discussion_r1002126516


##########
datafusion/optimizer/src/inline_table_scan.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Optimizer rule to replace TableScan references
+//! such as DataFrames and Views and inlines the LogicalPlan
+//! to support further optimization
+use crate::{OptimizerConfig, OptimizerRule};
+use datafusion_common::Result;
+use datafusion_expr::{
+    logical_plan::LogicalPlan, utils::from_plan, Expr, LogicalPlanBuilder, TableScan,
+};
+
+/// Optimization rule that inlines TableScan that provide a [LogicalPlan]
+/// ([DataFrame] / [ViewTable])
+#[derive(Default)]
+pub struct InlineTableScan;
+
+impl InlineTableScan {
+    #[allow(missing_docs)]
+    pub fn new() -> Self {
+        Self {}
+    }
+}
+
+/// Inline
+fn inline_table_scan(plan: &LogicalPlan) -> Result<LogicalPlan> {
+    match plan {
+        // Match only on scans without filter/projection
+        // As DataFrames / Views don't have those

Review Comment:
   I believe filters/projections/limit etc. won't be on the tablescan directly for view- / dataframes so removed it as it would be mostly dead code (and requires some more tests to cover those cases).
   Might be a future possibility for tablescans with those set or there is a good usecase for it for custom tableproviders, not sure...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3923: Support inlining view / dataframes logical plan

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3923:
URL: https://github.com/apache/arrow-datafusion/pull/3923#discussion_r1002113077


##########
datafusion/optimizer/src/inline_table_scan.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Optimizer rule to replace TableScan references
+//! such as DataFrames and Views and inlines the LogicalPlan

Review Comment:
   👍 



##########
datafusion/optimizer/src/inline_table_scan.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Optimizer rule to replace TableScan references
+//! such as DataFrames and Views and inlines the LogicalPlan
+//! to support further optimization
+use crate::{OptimizerConfig, OptimizerRule};
+use datafusion_common::Result;
+use datafusion_expr::{
+    logical_plan::LogicalPlan, utils::from_plan, Expr, LogicalPlanBuilder, TableScan,
+};
+
+/// Optimization rule that inlines TableScan that provide a [LogicalPlan]
+/// ([DataFrame] / [ViewTable])
+#[derive(Default)]
+pub struct InlineTableScan;
+
+impl InlineTableScan {
+    #[allow(missing_docs)]
+    pub fn new() -> Self {
+        Self {}
+    }
+}
+
+/// Inline
+fn inline_table_scan(plan: &LogicalPlan) -> Result<LogicalPlan> {
+    match plan {
+        // Match only on scans without filter/projection
+        // As DataFrames / Views don't have those

Review Comment:
   I don't see why we couldn't also create a projection / limit node as part of this rewrite as well if the table scan had them -- maybe we could file that as a future optimization



##########
datafusion/optimizer/src/inline_table_scan.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Optimizer rule to replace TableScan references
+//! such as DataFrames and Views and inlines the LogicalPlan
+//! to support further optimization
+use crate::{OptimizerConfig, OptimizerRule};
+use datafusion_common::Result;
+use datafusion_expr::{
+    logical_plan::LogicalPlan, utils::from_plan, Expr, LogicalPlanBuilder, TableScan,
+};
+
+/// Optimization rule that inlines TableScan that provide a [LogicalPlan]
+/// ([DataFrame] / [ViewTable])
+#[derive(Default)]
+pub struct InlineTableScan;
+
+impl InlineTableScan {
+    #[allow(missing_docs)]
+    pub fn new() -> Self {
+        Self {}
+    }
+}
+
+/// Inline
+fn inline_table_scan(plan: &LogicalPlan) -> Result<LogicalPlan> {
+    match plan {
+        // Match only on scans without filter/projection
+        // As DataFrames / Views don't have those
+        LogicalPlan::TableScan(TableScan {
+            source,
+            table_name,
+            filters,
+            fetch: None,
+            projected_schema,
+            projection: None,
+        }) if filters.is_empty() => {

Review Comment:
   Likewise, if it has filters, we could add a LogicalPlan::Filter here I think



##########
datafusion/optimizer/src/optimizer.rs:
##########
@@ -148,6 +149,7 @@ impl Optimizer {
     /// Create a new optimizer using the recommended list of rules
     pub fn new(config: &OptimizerConfig) -> Self {
         let mut rules: Vec<Arc<dyn OptimizerRule + Sync + Send>> = vec![
+            Arc::new(InlineTableScan::new()),

Review Comment:
   I agree
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on pull request #3923: Support inlining view / dataframes logical plan

Posted by GitBox <gi...@apache.org>.
Dandandan commented on PR #3923:
URL: https://github.com/apache/arrow-datafusion/pull/3923#issuecomment-1287812067

   Turns out there was a small bug being introduced recently with avoiding duplicate projections during projection pushdown


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan merged pull request #3923: Support inlining view / dataframes logical plan

Posted by GitBox <gi...@apache.org>.
Dandandan merged PR #3923:
URL: https://github.com/apache/arrow-datafusion/pull/3923


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #3923: Support inlining view / dataframes logical plan

Posted by GitBox <gi...@apache.org>.
Dandandan commented on code in PR #3923:
URL: https://github.com/apache/arrow-datafusion/pull/3923#discussion_r1002098890


##########
datafusion/optimizer/src/optimizer.rs:
##########
@@ -148,6 +149,7 @@ impl Optimizer {
     /// Create a new optimizer using the recommended list of rules
     pub fn new(config: &OptimizerConfig) -> Self {
         let mut rules: Vec<Arc<dyn OptimizerRule + Sync + Send>> = vec![
+            Arc::new(InlineTableScan::new()),

Review Comment:
   @alamb @andygrove this might be a good candidate as well to an `Analysis` phase.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #3923: Support inlining view / dataframes logical plan

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #3923:
URL: https://github.com/apache/arrow-datafusion/pull/3923#discussion_r1002129176


##########
datafusion/optimizer/src/inline_table_scan.rs:
##########
@@ -0,0 +1,184 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Optimizer rule to replace TableScan references
+//! such as DataFrames and Views and inlines the LogicalPlan
+//! to support further optimization
+use crate::{OptimizerConfig, OptimizerRule};
+use datafusion_common::Result;
+use datafusion_expr::{
+    logical_plan::LogicalPlan, utils::from_plan, Expr, LogicalPlanBuilder, TableScan,
+};
+
+/// Optimization rule that inlines TableScan that provide a [LogicalPlan]
+/// ([DataFrame] / [ViewTable])
+#[derive(Default)]
+pub struct InlineTableScan;
+
+impl InlineTableScan {
+    #[allow(missing_docs)]
+    pub fn new() -> Self {
+        Self {}
+    }
+}
+
+/// Inline
+fn inline_table_scan(plan: &LogicalPlan) -> Result<LogicalPlan> {
+    match plan {
+        // Match only on scans without filter/projection
+        // As DataFrames / Views don't have those

Review Comment:
   So maybe the comment could be updated to say "table scan won't have projecton / filters at this stage" (especially if this is run as one of the first optimizer passes)
   
   We could also potentially add a `debug!` log if they were ever not `None` to hint to someone in the future it could be improved



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on pull request #3923: Support inlining view / dataframes logical plan

Posted by GitBox <gi...@apache.org>.
Dandandan commented on PR #3923:
URL: https://github.com/apache/arrow-datafusion/pull/3923#issuecomment-1287673738

   There seems to be some subtle bug with projection with alias being wrong. Needs more investigation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org