You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/06 14:40:43 UTC

[GitHub] [arrow-datafusion] andygrove opened a new pull request, #2172: Add LogicalPlan::AliasedRelation

andygrove opened a new pull request, #2172:
URL: https://github.com/apache/arrow-datafusion/pull/2172

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #https://github.com/apache/arrow-datafusion/issues/2164.
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   I would like the logical plan to better represent aliased relations to help with translation to a different execution engine. Currently, original table names are discarded during SQL planning.
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   Introduces new `AliasedRelation` node in `LogicalPlan` and updates the SQL planner to wrap this around `TableScan` rather than just use the alias name in the `TableScan`.
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   The query plan will look different now.
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2172: Add LogicalPlan::AliasedRelation

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2172:
URL: https://github.com/apache/arrow-datafusion/pull/2172#discussion_r844285723


##########
datafusion/core/src/logical_plan/builder.rs:
##########
@@ -518,6 +518,18 @@ impl LogicalPlanBuilder {
         })))
     }
 
+    /// Apply an alias
+    pub fn alias(&self, alias: &str) -> Result<Self> {

Review Comment:
   Another way that this same aliasing can be represented is a `Projection` node
   
   So if the input has schema with columns `a`, `b`, `c`
   
   Then to implement `alias("foo")` you could build a `LogicalPlanNode` that was `Project([a foo.a, b as foo.b, c as foo.c])`
   
   There may be some good reason to introduce a new type of LogicalPlanNode too that I don't understand, but I wanted to point out this alternative
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2172: Add LogicalPlan::AliasedRelation

Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #2172:
URL: https://github.com/apache/arrow-datafusion/pull/2172#discussion_r844554857


##########
datafusion/core/src/logical_plan/builder.rs:
##########
@@ -518,6 +518,18 @@ impl LogicalPlanBuilder {
         })))
     }
 
+    /// Apply an alias
+    pub fn alias(&self, alias: &str) -> Result<Self> {

Review Comment:
   I think I need something to very specifically say that a table is being used as an alias. An easier change might be just to add an additional field to the `TableScan` to record the original table name. I think I will put up a separate PR for that approach.
   
   So to explain why I need this. I want to use DataFusion as a SQL parser and planner but I want to execute the query in a different engine. I can provide a `TableProvider` so that DataFuision can get the schema for table `person` and I get a logical plan. When I go to translate that plan to a physical plan, it refers to a table called `peeps` (the alias) and I have no way to know the actual table name.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on pull request #2172: Add LogicalPlan::AliasedRelation

Posted by GitBox <gi...@apache.org>.
mingmwang commented on PR #2172:
URL: https://github.com/apache/arrow-datafusion/pull/2172#issuecomment-1091051845

   Looks like in DataFusion, the logical Project node already support Alias. It should be similar to SparkSQL's SubqueryAlias.
   And we can also use it to represent a NamedStruct
   
   ````
   
   pub struct Projection {
       /// The list of expressions
       pub expr: Vec<Expr>,
       /// The incoming logical plan
       pub input: Arc<LogicalPlan>,
       /// The schema description of the output
       pub schema: DFSchemaRef,
       /// Projection output relation alias
       pub alias: Option<String>,
   }
   
   
   ````


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #2172: Add LogicalPlan::SubqueryAlias

Posted by GitBox <gi...@apache.org>.
Dandandan commented on code in PR #2172:
URL: https://github.com/apache/arrow-datafusion/pull/2172#discussion_r846840390


##########
datafusion/core/src/logical_plan/builder.rs:
##########
@@ -518,6 +518,18 @@ impl LogicalPlanBuilder {
         })))
     }
 
+    /// Apply an alias
+    pub fn alias(&self, alias: &str) -> Result<Self> {

Review Comment:
   I think I prefer the approach taken in this PR, as it could handle subqueries rather than just table scans (as suggested by the name).
   
   I think this is a cleaner, more generic approach.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #2172: Add LogicalPlan::SubqueryAlias

Posted by GitBox <gi...@apache.org>.
alamb commented on PR #2172:
URL: https://github.com/apache/arrow-datafusion/pull/2172#issuecomment-1095627099

   Thanks @andygrove 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2172: Add LogicalPlan::SubqueryAlias

Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #2172:
URL: https://github.com/apache/arrow-datafusion/pull/2172#discussion_r848535785


##########
datafusion/core/src/sql/planner.rs:
##########
@@ -3458,7 +3458,8 @@ mod tests {
         let expected = "Projection: #person.first_name, #person.id\
         \n  Inner Join: Using #person.id = #person2.id\
         \n    TableScan: person projection=None\
-        \n    TableScan: person2 projection=None";
+        \n    SubqueryAlias: person2\

Review Comment:
   Ok, let's leave it as is for now since we will make use of it for subqueries (soon, hopefully).
   
   I filed follow-on issues for use this approach for projection and union:
   
   - https://github.com/apache/arrow-datafusion/issues/2212
   - https://github.com/apache/arrow-datafusion/issues/2213



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2172: Add LogicalPlan::SubqueryAlias

Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #2172:
URL: https://github.com/apache/arrow-datafusion/pull/2172#discussion_r847906358


##########
datafusion/core/src/sql/planner.rs:
##########
@@ -3458,7 +3458,8 @@ mod tests {
         let expected = "Projection: #person.first_name, #person.id\
         \n  Inner Join: Using #person.id = #person2.id\
         \n    TableScan: person projection=None\
-        \n    TableScan: person2 projection=None";
+        \n    SubqueryAlias: person2\

Review Comment:
   I originally named this `AliasedRelation` and then changed it to `SubqueryAlias` because that is what Spark uses. I don't have a strong preference either way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2172: Add LogicalPlan::SubqueryAlias

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2172:
URL: https://github.com/apache/arrow-datafusion/pull/2172#discussion_r848314537


##########
datafusion/core/src/sql/planner.rs:
##########
@@ -3458,7 +3458,8 @@ mod tests {
         let expected = "Projection: #person.first_name, #person.id\
         \n  Inner Join: Using #person.id = #person2.id\
         \n    TableScan: person projection=None\
-        \n    TableScan: person2 projection=None";
+        \n    SubqueryAlias: person2\

Review Comment:
   me neither. Whatever you prefer is fine with me



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2172: Add LogicalPlan::AliasedRelation

Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #2172:
URL: https://github.com/apache/arrow-datafusion/pull/2172#discussion_r844557174


##########
datafusion/core/src/logical_plan/builder.rs:
##########
@@ -518,6 +518,18 @@ impl LogicalPlanBuilder {
         })))
     }
 
+    /// Apply an alias
+    pub fn alias(&self, alias: &str) -> Result<Self> {

Review Comment:
   This is the Spark plan for this use case:
   
   ```
   'Project [*]
   +- 'SubqueryAlias peeps
      +- 'UnresolvedRelation [person], [], false
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on a diff in pull request #2172: Add LogicalPlan::SubqueryAlias

Posted by GitBox <gi...@apache.org>.
Dandandan commented on code in PR #2172:
URL: https://github.com/apache/arrow-datafusion/pull/2172#discussion_r846840613


##########
datafusion/core/src/sql/planner.rs:
##########
@@ -644,16 +644,15 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                         self.schema_provider.get_table_provider(name.try_into()?),
                     ) {
                         (Some(cte_plan), _) => Ok(cte_plan.clone()),
-                        (_, Some(provider)) => LogicalPlanBuilder::scan(
-                            // take alias into account to support `JOIN table1 as table2`
-                            alias
-                                .as_ref()
-                                .map(|a| a.name.value.as_str())

Review Comment:
   👍



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove merged pull request #2172: Add LogicalPlan::SubqueryAlias

Posted by GitBox <gi...@apache.org>.
andygrove merged PR #2172:
URL: https://github.com/apache/arrow-datafusion/pull/2172


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2172: Add LogicalPlan::SubqueryAlias

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2172:
URL: https://github.com/apache/arrow-datafusion/pull/2172#discussion_r847773472


##########
datafusion/core/src/optimizer/projection_push_down.rs:
##########
@@ -432,6 +432,34 @@ fn optimize_plan(
                 alias: alias.clone(),
             }))
         }
+        LogicalPlan::SubqueryAlias(SubqueryAlias { input, alias, .. }) => {

Review Comment:
   is it worth a test for projection pushdown specifically?



##########
datafusion/core/src/sql/planner.rs:
##########
@@ -3458,7 +3458,8 @@ mod tests {
         let expected = "Projection: #person.first_name, #person.id\
         \n  Inner Join: Using #person.id = #person2.id\
         \n    TableScan: person projection=None\
-        \n    TableScan: person2 projection=None";
+        \n    SubqueryAlias: person2\

Review Comment:
   the appearance of `SubqueryAlias` in this query is  somewhat strange to me as there is no subquery in this SQL (maybe a better name is `RelationAlias` or something)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on pull request #2172: Add LogicalPlan::SubqueryAlias

Posted by GitBox <gi...@apache.org>.
andygrove commented on PR #2172:
URL: https://github.com/apache/arrow-datafusion/pull/2172#issuecomment-1094361371

   @alamb @Dandandan @mingmwang I have fixed the test failures and renamed `AliasedRelation` to `SubqueryAlias`. This is only implemented for `TableScan` so far but if this approach is acceptable, I will file follow-on issues to use the same approach for projection and union, which both currently have their own `alias` fields.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #2172: Add LogicalPlan::AliasedRelation

Posted by GitBox <gi...@apache.org>.
alamb commented on code in PR #2172:
URL: https://github.com/apache/arrow-datafusion/pull/2172#discussion_r844286551


##########
datafusion/core/src/logical_plan/builder.rs:
##########
@@ -518,6 +518,18 @@ impl LogicalPlanBuilder {
         })))
     }
 
+    /// Apply an alias
+    pub fn alias(&self, alias: &str) -> Result<Self> {

Review Comment:
   Ah, and I see you don't want to do that :) in the description
   
   I wonder if you could use the `DFSchema::relation` name for this purpose?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2172: Add LogicalPlan::SubqueryAlias

Posted by GitBox <gi...@apache.org>.
andygrove commented on code in PR #2172:
URL: https://github.com/apache/arrow-datafusion/pull/2172#discussion_r848534983


##########
datafusion/core/src/optimizer/projection_push_down.rs:
##########
@@ -432,6 +432,34 @@ fn optimize_plan(
                 alias: alias.clone(),
             }))
         }
+        LogicalPlan::SubqueryAlias(SubqueryAlias { input, alias, .. }) => {

Review Comment:
   Good point. Test added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org