You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/05 20:40:31 UTC

[GitHub] [spark] dtenedor opened a new pull request, #37423: [SPARK-39985][SQL] Enable implicit DEFAULT column values in inserts from DataFrames

dtenedor opened a new pull request, #37423:
URL: https://github.com/apache/spark/pull/37423

   ### What changes were proposed in this pull request?
   
   Enable implicit DEFAULT column values in inserts from DataFrames.
   
   This mostly already worked since the DataFrame inserts already converted to LogicalPlans. I added testing and a small analysis change since the operators are resolved one-by-one instead of all at once.
   
   Note that explicit column "default" references are not supported in write operations from DataFrames: since the operators are resolved one-by-one, any `.select` referring to "default" generates a "column not found" error before any following `.insertInto`.
   
   ### Why are the changes needed?
   
   This makes inserts from DataFrames produce the same results as those from SQL commands, for consistency and correctness.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   Extended the `InsertSuite` in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dtenedor commented on pull request #37423: [SPARK-39985][SQL] Enable implicit DEFAULT column values in inserts from DataFrames

Posted by GitBox <gi...@apache.org>.
dtenedor commented on PR #37423:
URL: https://github.com/apache/spark/pull/37423#issuecomment-1206863387

   Hi @gengliangwang this fixes a small bug in column DEFAULT values with DataFrames, and adds some testing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang closed pull request #37423: [SPARK-39985][SQL] Enable implicit DEFAULT column values in inserts from DataFrames

Posted by GitBox <gi...@apache.org>.
gengliangwang closed pull request #37423: [SPARK-39985][SQL] Enable implicit DEFAULT column values in inserts from DataFrames
URL: https://github.com/apache/spark/pull/37423


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a diff in pull request #37423: [SPARK-39985][SQL] Enable implicit DEFAULT column values in inserts from DataFrames

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on code in PR #37423:
URL: https://github.com/apache/spark/pull/37423#discussion_r943016309


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala:
##########
@@ -262,16 +264,30 @@ case class ResolveDefaultColumns(catalog: SessionCatalog) extends Rule[LogicalPl
    * Updates an inline table to generate missing default column values.
    */
   private def addMissingDefaultValuesForInsertFromInlineTable(
-      table: UnresolvedInlineTable,
-      insertTableSchemaWithoutPartitionColumns: StructType): UnresolvedInlineTable = {
-    val numQueryOutputs: Int = table.rows(0).size
+      node: LogicalPlan,
+      insertTableSchemaWithoutPartitionColumns: StructType): LogicalPlan = {
+    val numQueryOutputs: Int = node match {
+      case table: UnresolvedInlineTable => table.rows(0).size
+      case local: LocalRelation => local.data(0).numFields
+    }
     val schema = insertTableSchemaWithoutPartitionColumns
     val newDefaultExpressions: Seq[Expression] =
       getDefaultExpressionsForInsert(numQueryOutputs, schema)
     val newNames: Seq[String] = schema.fields.drop(numQueryOutputs).map { _.name }
-    table.copy(
-      names = table.names ++ newNames,
-      rows = table.rows.map { row => row ++ newDefaultExpressions })
+    node match {
+      case table: UnresolvedInlineTable =>
+        table.copy(
+          names = table.names ++ newNames,
+          rows = table.rows.map { row => row ++ newDefaultExpressions })
+      case local: LocalRelation if newDefaultExpressions.nonEmpty =>
+        val colTypes = StructType(local.output.map(col => StructField(col.name, col.dataType)))
+        UnresolvedInlineTable(

Review Comment:
   Shall we return LocalRelation instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #37423: [SPARK-39985][SQL] Enable implicit DEFAULT column values in inserts from DataFrames

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on PR #37423:
URL: https://github.com/apache/spark/pull/37423#issuecomment-1211444278

   LGTM except one comment


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dtenedor commented on a diff in pull request #37423: [SPARK-39985][SQL] Enable implicit DEFAULT column values in inserts from DataFrames

Posted by GitBox <gi...@apache.org>.
dtenedor commented on code in PR #37423:
URL: https://github.com/apache/spark/pull/37423#discussion_r943819176


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala:
##########
@@ -262,16 +264,30 @@ case class ResolveDefaultColumns(catalog: SessionCatalog) extends Rule[LogicalPl
    * Updates an inline table to generate missing default column values.
    */
   private def addMissingDefaultValuesForInsertFromInlineTable(
-      table: UnresolvedInlineTable,
-      insertTableSchemaWithoutPartitionColumns: StructType): UnresolvedInlineTable = {
-    val numQueryOutputs: Int = table.rows(0).size
+      node: LogicalPlan,
+      insertTableSchemaWithoutPartitionColumns: StructType): LogicalPlan = {
+    val numQueryOutputs: Int = node match {
+      case table: UnresolvedInlineTable => table.rows(0).size
+      case local: LocalRelation => local.data(0).numFields
+    }
     val schema = insertTableSchemaWithoutPartitionColumns
     val newDefaultExpressions: Seq[Expression] =
       getDefaultExpressionsForInsert(numQueryOutputs, schema)
     val newNames: Seq[String] = schema.fields.drop(numQueryOutputs).map { _.name }
-    table.copy(
-      names = table.names ++ newNames,
-      rows = table.rows.map { row => row ++ newDefaultExpressions })
+    node match {
+      case table: UnresolvedInlineTable =>
+        table.copy(
+          names = table.names ++ newNames,
+          rows = table.rows.map { row => row ++ newDefaultExpressions })
+      case local: LocalRelation if newDefaultExpressions.nonEmpty =>
+        val colTypes = StructType(local.output.map(col => StructField(col.name, col.dataType)))
+        UnresolvedInlineTable(

Review Comment:
   I tried this a few different ways, since the `ResolveInlineTables` rule can convert the `UnresolvedInlineTable` to a `LocalRelation`. But I found this did not work well because this method adds unresolved `DEFAULT` columns and delegates replacing them to `addMissingDefaultValuesForInsertFromProject` later. I added a comment to help explain this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on a diff in pull request #37423: [SPARK-39985][SQL] Enable implicit DEFAULT column values in inserts from DataFrames

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on code in PR #37423:
URL: https://github.com/apache/spark/pull/37423#discussion_r943021311


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala:
##########
@@ -262,16 +264,30 @@ case class ResolveDefaultColumns(catalog: SessionCatalog) extends Rule[LogicalPl
    * Updates an inline table to generate missing default column values.
    */
   private def addMissingDefaultValuesForInsertFromInlineTable(
-      table: UnresolvedInlineTable,
-      insertTableSchemaWithoutPartitionColumns: StructType): UnresolvedInlineTable = {
-    val numQueryOutputs: Int = table.rows(0).size
+      node: LogicalPlan,
+      insertTableSchemaWithoutPartitionColumns: StructType): LogicalPlan = {
+    val numQueryOutputs: Int = node match {
+      case table: UnresolvedInlineTable => table.rows(0).size
+      case local: LocalRelation => local.data(0).numFields
+    }
     val schema = insertTableSchemaWithoutPartitionColumns
     val newDefaultExpressions: Seq[Expression] =
       getDefaultExpressionsForInsert(numQueryOutputs, schema)
     val newNames: Seq[String] = schema.fields.drop(numQueryOutputs).map { _.name }
-    table.copy(
-      names = table.names ++ newNames,
-      rows = table.rows.map { row => row ++ newDefaultExpressions })
+    node match {
+      case table: UnresolvedInlineTable =>

Review Comment:
   Shall we check `if newDefaultExpressions.nonEmpty` here as well like line 282? Or we can just return the node if `inewDefaultExpressions` is empty



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dtenedor commented on a diff in pull request #37423: [SPARK-39985][SQL] Enable implicit DEFAULT column values in inserts from DataFrames

Posted by GitBox <gi...@apache.org>.
dtenedor commented on code in PR #37423:
URL: https://github.com/apache/spark/pull/37423#discussion_r943821885


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala:
##########
@@ -262,16 +264,30 @@ case class ResolveDefaultColumns(catalog: SessionCatalog) extends Rule[LogicalPl
    * Updates an inline table to generate missing default column values.
    */
   private def addMissingDefaultValuesForInsertFromInlineTable(
-      table: UnresolvedInlineTable,
-      insertTableSchemaWithoutPartitionColumns: StructType): UnresolvedInlineTable = {
-    val numQueryOutputs: Int = table.rows(0).size
+      node: LogicalPlan,
+      insertTableSchemaWithoutPartitionColumns: StructType): LogicalPlan = {
+    val numQueryOutputs: Int = node match {
+      case table: UnresolvedInlineTable => table.rows(0).size
+      case local: LocalRelation => local.data(0).numFields
+    }
     val schema = insertTableSchemaWithoutPartitionColumns
     val newDefaultExpressions: Seq[Expression] =
       getDefaultExpressionsForInsert(numQueryOutputs, schema)
     val newNames: Seq[String] = schema.fields.drop(numQueryOutputs).map { _.name }
-    table.copy(
-      names = table.names ++ newNames,
-      rows = table.rows.map { row => row ++ newDefaultExpressions })
+    node match {
+      case table: UnresolvedInlineTable =>

Review Comment:
   Good idea, done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dtenedor commented on pull request #37423: [SPARK-39985][SQL] Enable implicit DEFAULT column values in inserts from DataFrames

Posted by GitBox <gi...@apache.org>.
dtenedor commented on PR #37423:
URL: https://github.com/apache/spark/pull/37423#issuecomment-1212377150

   Hi @gengliangwang, this is ready for another round when you have a moment :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gengliangwang commented on pull request #37423: [SPARK-39985][SQL] Enable implicit DEFAULT column values in inserts from DataFrames

Posted by GitBox <gi...@apache.org>.
gengliangwang commented on PR #37423:
URL: https://github.com/apache/spark/pull/37423#issuecomment-1212644313

   Thanks, merging to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dtenedor commented on a diff in pull request #37423: [SPARK-39985][SQL] Enable implicit DEFAULT column values in inserts from DataFrames

Posted by GitBox <gi...@apache.org>.
dtenedor commented on code in PR #37423:
URL: https://github.com/apache/spark/pull/37423#discussion_r943819176


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveDefaultColumns.scala:
##########
@@ -262,16 +264,30 @@ case class ResolveDefaultColumns(catalog: SessionCatalog) extends Rule[LogicalPl
    * Updates an inline table to generate missing default column values.
    */
   private def addMissingDefaultValuesForInsertFromInlineTable(
-      table: UnresolvedInlineTable,
-      insertTableSchemaWithoutPartitionColumns: StructType): UnresolvedInlineTable = {
-    val numQueryOutputs: Int = table.rows(0).size
+      node: LogicalPlan,
+      insertTableSchemaWithoutPartitionColumns: StructType): LogicalPlan = {
+    val numQueryOutputs: Int = node match {
+      case table: UnresolvedInlineTable => table.rows(0).size
+      case local: LocalRelation => local.data(0).numFields
+    }
     val schema = insertTableSchemaWithoutPartitionColumns
     val newDefaultExpressions: Seq[Expression] =
       getDefaultExpressionsForInsert(numQueryOutputs, schema)
     val newNames: Seq[String] = schema.fields.drop(numQueryOutputs).map { _.name }
-    table.copy(
-      names = table.names ++ newNames,
-      rows = table.rows.map { row => row ++ newDefaultExpressions })
+    node match {
+      case table: UnresolvedInlineTable =>
+        table.copy(
+          names = table.names ++ newNames,
+          rows = table.rows.map { row => row ++ newDefaultExpressions })
+      case local: LocalRelation if newDefaultExpressions.nonEmpty =>
+        val colTypes = StructType(local.output.map(col => StructField(col.name, col.dataType)))
+        UnresolvedInlineTable(

Review Comment:
   Good Q: I tried this a few different ways, since the `ResolveInlineTables` rule can convert the `UnresolvedInlineTable` to a `LocalRelation`. But I found this did not work well because this method adds unresolved `DEFAULT` columns and delegates replacing them to `addMissingDefaultValuesForInsertFromProject` later. I added a comment to help explain this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #37423: [SPARK-39985][SQL] Enable implicit DEFAULT column values in inserts from DataFrames

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #37423:
URL: https://github.com/apache/spark/pull/37423#issuecomment-1207259653

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org