You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/07 21:28:57 UTC

[GitHub] [spark] dtenedor opened a new pull request, #37430: [SPARK-40000][SQL] Add config to toggle whether to automatically add default values for INSERTs without user-specified fields

dtenedor opened a new pull request, #37430:
URL: https://github.com/apache/spark/pull/37430

   ### What changes were proposed in this pull request?
   
   Add config to toggle whether to automatically add default values for INSERTs without user-specified fields.
   
   Example:
   
   ```
   CREATE TABLE t (a INT DEFAULT 1, b INT DEFAULT 2) USING PARQUET;
   INSERT INTO t VALUES (42);
   ```
   
   With the new config set to true, a following `SELECT * FROM t` returns `42, 2`. Otherwise, if the config is set to false, the `INSERT` command fails with an error message reporting that the table has two columns but the command only inserted one.
   
   ### Why are the changes needed?
   
   Depending on desired SQL semantics, it may be preferred to leave these INSERT commands as-is to prevent against accidental mistakes.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, see above.
   
   ### How was this patch tested?
   
   Updated unit test coverage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang closed pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

gengliangwang closed pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values
URL: https://github.com/apache/spark/pull/37430


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #37430: [SPARK-40000][SQL] Add config to toggle whether to automatically add default values for INSERTs without user-specified fields

Posted by GitBox <gi...@apache.org>.

dtenedor commented on code in PR #37430:
URL: https://github.com/apache/spark/pull/37430#discussion_r943829643


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2935,6 +2935,16 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+  val ADD_DEFAULTS_FOR_INSERTS_WITHOUT_USER_SPECIFIED_FIELDS =
+    buildConf("spark.sql.defaultColumn.addDefaultsForInsertsWithoutUserSpecifiedFields")
+      .internal()
+      .doc("When true, for each INSERT command without any user-specified fields where the " +

Review Comment:
   Good questions:
   
   * The `user-specified fields` are explicit columns after the table in the FROM clause, e.g. when the command looks like `INSERT INTO mytable(columnA, columnB) VALUES (...))`.
   * If `spark.sql.defaultColumn.useNullsForMissingDefaultValues` is true and one of these missing columns does not have an explicit DEFAULT value, then NULL will be appended. Otherwise, the command will fail with an error message reporting that the number of columns provided is incorrect.
   
   I added this information to the config description here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on pull request #37430: [SPARK-40000][SQL] Add config to toggle whether to automatically add default values for INSERTs without user-specified fields

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on PR #37430:
URL: https://github.com/apache/spark/pull/37430#issuecomment-1213340871

   Had an offline discussion with @dtenedor . Let's make it simple and strict for now: throw an exception if a table insert contains less columns than the table and there is not a insert column list.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #37430:
URL: https://github.com/apache/spark/pull/37430#discussion_r946007905


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala:
##########
@@ -265,17 +265,19 @@ case class ResolveDefaultColumns(catalog: SessionCatalog) extends Rule[LogicalPl
    */
   private def addMissingDefaultValuesForInsertFromInlineTable(
       node: LogicalPlan,
-      insertTableSchemaWithoutPartitionColumns: StructType): LogicalPlan = {
+      insertTableSchemaWithoutPartitionColumns: StructType,
+      numUserSpecifiedFields: Integer): LogicalPlan = {

Review Comment:
   ```suggestion
         numUserSpecifiedFields: Int): LogicalPlan = {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #37430:
URL: https://github.com/apache/spark/pull/37430#discussion_r944804283


##########
sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala:
##########
@@ -863,25 +863,17 @@ class InsertSuite extends DataSourceTest with SharedSparkSession {
   }
 
   test("Allow user to insert specified columns into insertable view") {
-    withSQLConf(SQLConf.USE_NULLS_FOR_MISSING_DEFAULT_COLUMN_VALUES.key -> "true") {
-      sql("INSERT OVERWRITE TABLE jsonTable SELECT a FROM jt")
-      checkAnswer(
-        sql("SELECT a, b FROM jsonTable"),
-        (1 to 10).map(i => Row(i, null))
-      )
-
-      sql("INSERT OVERWRITE TABLE jsonTable(a) SELECT a FROM jt")
-      checkAnswer(
-        sql("SELECT a, b FROM jsonTable"),
-        (1 to 10).map(i => Row(i, null))
-      )
+    sql("INSERT OVERWRITE TABLE jsonTable(a) SELECT a FROM jt")
+    checkAnswer(
+      sql("SELECT a, b FROM jsonTable"),
+      (1 to 10).map(i => Row(i, null))

Review Comment:
   ok, so in Spark 3.3:
   ```
   > create table foo(a int, b int);
   > insert into foo (a) values (1);
   Error in query: `default`.`foo` requires that the data to be inserted have the same number of columns as the target table: target table has 2 column(s) but the inserted data has 1 column(s), including 0 partition column(s) having constant value(s).
   ```
   
   Seems the column default project introduces a breaking change..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on code in PR #37430:
URL: https://github.com/apache/spark/pull/37430#discussion_r944812583


##########
sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala:
##########
@@ -863,25 +863,17 @@ class InsertSuite extends DataSourceTest with SharedSparkSession {
   }
 
   test("Allow user to insert specified columns into insertable view") {
-    withSQLConf(SQLConf.USE_NULLS_FOR_MISSING_DEFAULT_COLUMN_VALUES.key -> "true") {
-      sql("INSERT OVERWRITE TABLE jsonTable SELECT a FROM jt")
-      checkAnswer(
-        sql("SELECT a, b FROM jsonTable"),
-        (1 to 10).map(i => Row(i, null))
-      )
-
-      sql("INSERT OVERWRITE TABLE jsonTable(a) SELECT a FROM jt")
-      checkAnswer(
-        sql("SELECT a, b FROM jsonTable"),
-        (1 to 10).map(i => Row(i, null))
-      )
+    sql("INSERT OVERWRITE TABLE jsonTable(a) SELECT a FROM jt")
+    checkAnswer(
+      sql("SELECT a, b FROM jsonTable"),
+      (1 to 10).map(i => Row(i, null))

Review Comment:
   As discussed offline, this is a change but not a breaking one because if this `insert into foo (a) values (1)` command fails in Spark 3.3, but now succeeds, it is changing an error case into a successful case. This is part of the intentional behavior change of the column DEFAULT project.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on PR #37430:
URL: https://github.com/apache/spark/pull/37430#issuecomment-1213617047

   @dtenedor let's update the migration guide https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md since this changes the default behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on code in PR #37430:
URL: https://github.com/apache/spark/pull/37430#discussion_r944783282


##########
sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala:
##########
@@ -863,25 +863,17 @@ class InsertSuite extends DataSourceTest with SharedSparkSession {
   }
 
   test("Allow user to insert specified columns into insertable view") {
-    withSQLConf(SQLConf.USE_NULLS_FOR_MISSING_DEFAULT_COLUMN_VALUES.key -> "true") {
-      sql("INSERT OVERWRITE TABLE jsonTable SELECT a FROM jt")
-      checkAnswer(
-        sql("SELECT a, b FROM jsonTable"),
-        (1 to 10).map(i => Row(i, null))
-      )
-
-      sql("INSERT OVERWRITE TABLE jsonTable(a) SELECT a FROM jt")
-      checkAnswer(
-        sql("SELECT a, b FROM jsonTable"),
-        (1 to 10).map(i => Row(i, null))
-      )
+    sql("INSERT OVERWRITE TABLE jsonTable(a) SELECT a FROM jt")
+    checkAnswer(
+      sql("SELECT a, b FROM jsonTable"),
+      (1 to 10).map(i => Row(i, null))

Review Comment:
   Good question: no, because that config only applied to `INSERT INTO` commands without explicit user specified column lists. Because this command specifies explicit column `jsonTable(a)` then we add DEFAULT values for the remaining columns in the row. This behavior is the same as before where we added NULL values for the remaining columns in the row.
   
   <img width="938" alt="image" src="https://user-images.githubusercontent.com/99207096/184430799-e76b62f8-8c2b-4cae-9fb1-73da8951ffc6.png">
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on PR #37430:
URL: https://github.com/apache/spark/pull/37430#issuecomment-1214197686

   @dtenedor FYI there is a conflict with the master branch now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on PR #37430:
URL: https://github.com/apache/spark/pull/37430#issuecomment-1219998299

   @dongjoon-hyun Thanks for the note!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #37430:
URL: https://github.com/apache/spark/pull/37430#discussion_r944696774


##########
sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala:
##########
@@ -863,25 +863,17 @@ class InsertSuite extends DataSourceTest with SharedSparkSession {
   }
 
   test("Allow user to insert specified columns into insertable view") {
-    withSQLConf(SQLConf.USE_NULLS_FOR_MISSING_DEFAULT_COLUMN_VALUES.key -> "true") {
-      sql("INSERT OVERWRITE TABLE jsonTable SELECT a FROM jt")
-      checkAnswer(
-        sql("SELECT a, b FROM jsonTable"),
-        (1 to 10).map(i => Row(i, null))
-      )
-
-      sql("INSERT OVERWRITE TABLE jsonTable(a) SELECT a FROM jt")
-      checkAnswer(
-        sql("SELECT a, b FROM jsonTable"),
-        (1 to 10).map(i => Row(i, null))
-      )
+    sql("INSERT OVERWRITE TABLE jsonTable(a) SELECT a FROM jt")
+    checkAnswer(
+      sql("SELECT a, b FROM jsonTable"),
+      (1 to 10).map(i => Row(i, null))

Review Comment:
   So the behavior of `USE_NULLS_FOR_MISSING_DEFAULT_COLUMN_VALUES` is on by default without a configuration now?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #37430: [SPARK-40000][SQL] Add config to toggle whether to automatically add default values for INSERTs without user-specified fields

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #37430:
URL: https://github.com/apache/spark/pull/37430#discussion_r943025992


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2935,6 +2935,16 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+  val ADD_DEFAULTS_FOR_INSERTS_WITHOUT_USER_SPECIFIED_FIELDS =
+    buildConf("spark.sql.defaultColumn.addDefaultsForInsertsWithoutUserSpecifiedFields")
+      .internal()
+      .doc("When true, for each INSERT command without any user-specified fields where the " +

Review Comment:
   So the configuration is to fill in default values. If `spark.sql.defaultColumn.useNullsForMissingDefaultValues` and one of the filled column doesn't have a default value, there will be error, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #37430:
URL: https://github.com/apache/spark/pull/37430#discussion_r946009980


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala:
##########
@@ -298,11 +300,12 @@ case class ResolveDefaultColumns(catalog: SessionCatalog) extends Rule[LogicalPl
    */
   private def addMissingDefaultValuesForInsertFromProject(
       project: Project,
-      insertTableSchemaWithoutPartitionColumns: StructType): Project = {
+      insertTableSchemaWithoutPartitionColumns: StructType,
+      numUserSpecifiedFields: Integer): Project = {

Review Comment:
   ```suggestion
         numUserSpecifiedFields: Int): Project = {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on PR #37430:
URL: https://github.com/apache/spark/pull/37430#issuecomment-1216088021

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on PR #37430:
URL: https://github.com/apache/spark/pull/37430#issuecomment-1213457469

   Hi @gengliangwang responded to comments, this is ready for another round when ready :)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #37430: [SPARK-40000][SQL] Add config to toggle whether to automatically add default values for INSERTs without user-specified fields

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #37430:
URL: https://github.com/apache/spark/pull/37430#discussion_r943025312


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2935,6 +2935,16 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+  val ADD_DEFAULTS_FOR_INSERTS_WITHOUT_USER_SPECIFIED_FIELDS =
+    buildConf("spark.sql.defaultColumn.addDefaultsForInsertsWithoutUserSpecifiedFields")
+      .internal()
+      .doc("When true, for each INSERT command without any user-specified fields where the " +

Review Comment:
   What does `user-specified fields` mean here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on PR #37430:
URL: https://github.com/apache/spark/pull/37430#issuecomment-1214627364

   OK @gengliangwang all the conflicts should be resolved now. 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #37430:
URL: https://github.com/apache/spark/pull/37430#discussion_r946016552


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2924,6 +2924,16 @@ object SQLConf {
       .stringConf
       .createWithDefault("csv,json,orc,parquet")
 
+  val ADD_MISSING_DEFAULT_COLUMN_VALUES_FOR_INSERTS_WITH_EXPLICIT_COLUMNS =
+    buildConf("spark.sql.defaultColumn.addMissingValuesForInsertsWithExplicitColumns")
+      .internal()
+      .doc("When true, allow INSERT INTO commands with explicit columns (such as " +
+        "INSERT INTO t(a, b)) to specify fewer columns than the target table; the analyzer will " +
+        "assign default values for remaining columns. Otherwise, if false, return an error.")

Review Comment:
   Let's also mention that the "default default" is null.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37430: [SPARK-40000][SQL] Add config to toggle whether to automatically add default values for INSERTs without user-specified fields

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on PR #37430:
URL: https://github.com/apache/spark/pull/37430#issuecomment-1207490512

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #37430:
URL: https://github.com/apache/spark/pull/37430#discussion_r946014943


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala:
##########
@@ -265,17 +265,19 @@ case class ResolveDefaultColumns(catalog: SessionCatalog) extends Rule[LogicalPl
    */
   private def addMissingDefaultValuesForInsertFromInlineTable(
       node: LogicalPlan,
-      insertTableSchemaWithoutPartitionColumns: StructType): LogicalPlan = {
+      insertTableSchemaWithoutPartitionColumns: StructType,
+      numUserSpecifiedFields: Integer): LogicalPlan = {
     val numQueryOutputs: Int = node match {
       case table: UnresolvedInlineTable => table.rows(0).size
       case local: LocalRelation => local.data(0).numFields
     }
     val schema = insertTableSchemaWithoutPartitionColumns
     val newDefaultExpressions: Seq[Expression] =
-      getDefaultExpressionsForInsert(numQueryOutputs, schema)
+      getDefaultExpressionsForInsert(numQueryOutputs, schema, numUserSpecifiedFields, node)
     val newNames: Seq[String] = schema.fields.drop(numQueryOutputs).map { _.name }
     node match {
       case _ if newDefaultExpressions.isEmpty => node
+      case _ if numUserSpecifiedFields > 0 && numUserSpecifiedFields != numQueryOutputs => node

Review Comment:
   This line is not reachable. There is such a check in the method `getDefaultExpressionsForInsert` already



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala:
##########
@@ -265,17 +265,19 @@ case class ResolveDefaultColumns(catalog: SessionCatalog) extends Rule[LogicalPl
    */
   private def addMissingDefaultValuesForInsertFromInlineTable(
       node: LogicalPlan,
-      insertTableSchemaWithoutPartitionColumns: StructType): LogicalPlan = {
+      insertTableSchemaWithoutPartitionColumns: StructType,
+      numUserSpecifiedFields: Integer): LogicalPlan = {
     val numQueryOutputs: Int = node match {
       case table: UnresolvedInlineTable => table.rows(0).size
       case local: LocalRelation => local.data(0).numFields
     }
     val schema = insertTableSchemaWithoutPartitionColumns
     val newDefaultExpressions: Seq[Expression] =
-      getDefaultExpressionsForInsert(numQueryOutputs, schema)
+      getDefaultExpressionsForInsert(numQueryOutputs, schema, numUserSpecifiedFields, node)
     val newNames: Seq[String] = schema.fields.drop(numQueryOutputs).map { _.name }
     node match {
       case _ if newDefaultExpressions.isEmpty => node
+      case _ if numUserSpecifiedFields > 0 && numUserSpecifiedFields != numQueryOutputs => node

Review Comment:
   ```suggestion
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #37430:
URL: https://github.com/apache/spark/pull/37430#discussion_r946010121


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala:
##########
@@ -315,20 +318,19 @@ case class ResolveDefaultColumns(catalog: SessionCatalog) extends Rule[LogicalPl
    */
   private def getDefaultExpressionsForInsert(
       numQueryOutputs: Int,
-      schema: StructType): Seq[Expression] = {
-    val remainingFields: Seq[StructField] = schema.fields.drop(numQueryOutputs)
-    val numDefaultExpressionsToAdd = getStructFieldsForDefaultExpressions(remainingFields).size
-    Seq.fill(numDefaultExpressionsToAdd)(UnresolvedAttribute(CURRENT_DEFAULT_COLUMN_NAME))
-  }
-
-  /**
-   * This is a helper for the getDefaultExpressionsForInsert methods above.
-   */
-  private def getStructFieldsForDefaultExpressions(fields: Seq[StructField]): Seq[StructField] = {
-    if (SQLConf.get.useNullsForMissingDefaultColumnValues) {
-      fields
+      schema: StructType,
+      numUserSpecifiedFields: Integer,

Review Comment:
   ```suggestion
         numUserSpecifiedFields: Int,
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on PR #37430:
URL: https://github.com/apache/spark/pull/37430#issuecomment-1219997014

   This is reverted via https://github.com/apache/spark/commit/50c163578cfef79002fbdbc54b3b8fc10cfbcf65


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #37430: [SPARK-40000][SQL] Add config to toggle whether to automatically add default values for INSERTs without user-specified fields

Posted by GitBox <gi...@apache.org>.

dtenedor commented on PR #37430:
URL: https://github.com/apache/spark/pull/37430#issuecomment-1212376666

   Hi @gengliangwang, thanks for your review, this is ready again :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #37430: [SPARK-40000][SQL] Add config to toggle whether to automatically add default values for INSERTs without user-specified fields

Posted by GitBox <gi...@apache.org>.

dtenedor commented on PR #37430:
URL: https://github.com/apache/spark/pull/37430#issuecomment-1207489003

   Hi @gengliangwang this change will be helpful to let SQL engines toggle whether to return an error if `INSERT` commands without user-specified fields include fewer columns than present in the target table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on code in PR #37430:
URL: https://github.com/apache/spark/pull/37430#discussion_r946049952


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2924,6 +2924,16 @@ object SQLConf {
       .stringConf
       .createWithDefault("csv,json,orc,parquet")
 
+  val ADD_MISSING_DEFAULT_COLUMN_VALUES_FOR_INSERTS_WITH_EXPLICIT_COLUMNS =
+    buildConf("spark.sql.defaultColumn.addMissingValuesForInsertsWithExplicitColumns")
+      .internal()
+      .doc("When true, allow INSERT INTO commands with explicit columns (such as " +
+        "INSERT INTO t(a, b)) to specify fewer columns than the target table; the analyzer will " +
+        "assign default values for remaining columns. Otherwise, if false, return an error.")

Review Comment:
   Good idea, done (also updated the PR description accordingly).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on PR #37430:
URL: https://github.com/apache/spark/pull/37430#issuecomment-1215631493

   Thanks @gengliangwang for the reviews, applied suggestions and updated the PR description 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #37430: [SPARK-40000][SQL] Update INSERTs without user-specified fields to not automatically add default values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on PR #37430:
URL: https://github.com/apache/spark/pull/37430#issuecomment-1213579839

   Hi @gengliangwang per offline discussion, I have added the new flag `spark.sql.defaultColumn.addMissingValuesForInsertsWithExplicitColumns` to toggle whether commands like `INSERT INTO t (a, b) VALUES ...` will succeed or fail when the number of inserted columns is less than the number of columns in the target table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org