You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "dtenedor (via GitHub)" <gi...@apache.org> on 2023/04/07 22:41:50 UTC

[GitHub] [spark] dtenedor opened a new pull request, #40710: [SPARK-43071][SQL] Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation

dtenedor opened a new pull request, #40710:
URL: https://github.com/apache/spark/pull/40710

   ### What changes were proposed in this pull request?
   
   This PR extends column default support to allow the ORDER BY, LIMIT, and OFFSET clauses at the end of a SELECT query in the INSERT source relation.
   
   For example:
   
   ```
   create table t1(i boolean, s bigint default 42) using parquet;
   insert into t1 values (true, 41), (false, default);
   create table t2(i boolean default true, s bigint default 42, 
                   t string default 'abc') using parquet;
   insert into t2 (i, s) select default, s from t1 order by s limit 1;
   select * from t2;
   > true, 41L, "abc"
   ```
   
   ### Why are the changes needed?
   
   This improves usability and helps prevent confusing error messages.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, SQL queries that previously failed will now succeed.
   
   ### How was this patch tested?
   
   This PR adds new unit test coverage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #40710: [SPARK-43071][SQL] Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on code in PR #40710:
URL: https://github.com/apache/spark/pull/40710#discussion_r1161013565


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala:
##########
@@ -91,6 +90,25 @@ case class ResolveDefaultColumns(catalog: SessionCatalog) extends Rule[LogicalPl
     }
   }
 
+  /**
+   * Checks if a logical plan is an INSERT INTO command where the inserted data comes from a SELECT
+   * list, with possible other unary operators like sorting and/or alias(es) in between.
+   */
+  private def insertsFromProject(i: InsertIntoStatement): Option[Project] = {
+    var node = i.query
+    def matches(node: LogicalPlan): Boolean = node match {
+      case _: GlobalLimit | _: LocalLimit | _: Offset | _: SubqueryAlias | _: Sort => true

Review Comment:
   As I understand it, the intention is to allow and resolve `default` references in the top-most `Project`, but not such references within any table subqueries. For example, this should not work:
   
   ```
   insert into t2 (i, s) select i, s from (
     select default as i, default as s from t1 order by i limit 1)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on pull request #40710: [SPARK-43071][SQL] Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.

gengliangwang commented on PR #40710:
URL: https://github.com/apache/spark/pull/40710#issuecomment-1502451369

   Thanks, merging to master/3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang closed pull request #40710: [SPARK-43071][SQL] Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.

gengliangwang closed pull request #40710: [SPARK-43071][SQL] Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation
URL: https://github.com/apache/spark/pull/40710


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #40710: [SPARK-43071][SQL] Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation

Posted by "dtenedor (via GitHub)" <gi...@apache.org>.

dtenedor commented on code in PR #40710:
URL: https://github.com/apache/spark/pull/40710#discussion_r1161011837


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala:
##########
@@ -91,6 +90,25 @@ case class ResolveDefaultColumns(catalog: SessionCatalog) extends Rule[LogicalPl
     }
   }
 
+  /**
+   * Checks if a logical plan is an INSERT INTO command where the inserted data comes from a SELECT
+   * list, with possible other unary operators like sorting and/or alias(es) in between.
+   */
+  private def insertsFromProject(i: InsertIntoStatement): Option[Project] = {
+    var node = i.query
+    def matches(node: LogicalPlan): Boolean = node match {
+      case _: GlobalLimit | _: LocalLimit | _: Offset | _: SubqueryAlias | _: Sort => true
+      case _ => false
+    }
+    while (matches(node)) {
+      node = node.children.head

Review Comment:
   Good question, I added a test for this:
   
   ```
   insert into t2 (i, s) select default, default from t1 inner join t1 using (i, s);
   
   > true, 42L, "abc",
   > true, 42L, "abc"
   ```
   
   We want the `default` resolution to cover the topmost `select` list before the join, but not any below, per the specification. For example, this should not work:
   
   ```
   insert into t2 (i, s) select i, s from (
     select default as i, default as s from t1)
     inner join t1
     using (i, s);
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #40710: [SPARK-43071][SQL] Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.

gengliangwang commented on code in PR #40710:
URL: https://github.com/apache/spark/pull/40710#discussion_r1161009871


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala:
##########
@@ -91,6 +90,25 @@ case class ResolveDefaultColumns(catalog: SessionCatalog) extends Rule[LogicalPl
     }
   }
 
+  /**
+   * Checks if a logical plan is an INSERT INTO command where the inserted data comes from a SELECT
+   * list, with possible other unary operators like sorting and/or alias(es) in between.
+   */
+  private def insertsFromProject(i: InsertIntoStatement): Option[Project] = {
+    var node = i.query
+    def matches(node: LogicalPlan): Boolean = node match {
+      case _: GlobalLimit | _: LocalLimit | _: Offset | _: SubqueryAlias | _: Sort => true
+      case _ => false
+    }
+    while (matches(node)) {
+      node = node.children.head

Review Comment:
   Why getting the first child only? What if there is a join and one of the children has a project containing default?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #40710: [SPARK-43071][SQL] Support SELECT DEFAULT with ORDER BY, LIMIT, OFFSET for INSERT source relation

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.

gengliangwang commented on code in PR #40710:
URL: https://github.com/apache/spark/pull/40710#discussion_r1161010584


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala:
##########
@@ -91,6 +90,25 @@ case class ResolveDefaultColumns(catalog: SessionCatalog) extends Rule[LogicalPl
     }
   }
 
+  /**
+   * Checks if a logical plan is an INSERT INTO command where the inserted data comes from a SELECT
+   * list, with possible other unary operators like sorting and/or alias(es) in between.
+   */
+  private def insertsFromProject(i: InsertIntoStatement): Option[Project] = {
+    var node = i.query
+    def matches(node: LogicalPlan): Boolean = node match {
+      case _: GlobalLimit | _: LocalLimit | _: Offset | _: SubqueryAlias | _: Sort => true

Review Comment:
   Shall we resolve all the `Project`s?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org