You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/06/02 00:35:18 UTC

[GitHub] [spark] dtenedor opened a new pull request, #36745: [SPARK-39359][SQL]Restrict DEFAULT columns to allowlist of supported data source types

dtenedor opened a new pull request, #36745:
URL: https://github.com/apache/spark/pull/36745

   ### What changes were proposed in this pull request?
   
   Restrict DEFAULT columns to allowlist of supported data source types.
   
   Example:
   
   ```
   > create table t(a string) using avro
   > alter table t add column(b int default 42)
   AnalysisException: Failed to execute command because target data source type avro does not support assigning DEFAULT column values
   ```
   
   ### Why are the changes needed?
   
   This change is necessary for correctness because each data source allowing DEFAULT column values must include support for returning these values when missing in the corresponding storage.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   This PR adds unit test coverage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

dtenedor commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r888505435


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala:
##########
@@ -231,4 +232,18 @@ object ResolveDefaultColumns {
       }
     }
   }
+
+  def checkDataSourceSupportsDefaultColumns(table: CatalogTable): Unit = {
+    if (table.schema.fields.map(_.metadata).exists { m =>
+      m.contains(CURRENT_DEFAULT_COLUMN_METADATA_KEY) ||
+        m.contains(EXISTS_DEFAULT_COLUMN_METADATA_KEY)
+    }) {
+      table.provider.getOrElse("").toLowerCase() match {
+        case "csv" | "json" | "parquet" | "orc" =>

Review Comment:
   Good ideas! I made this a configuration option. It does support DataSource V2, I added a couple more tests to show that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sadikovi commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

sadikovi commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r888606358


##########
sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala:
##########
@@ -41,21 +41,28 @@ import org.apache.spark.sql.types.{MetadataBuilder, StructField, StructType}
  *
  * We can remove this rule once we implement all the catalog functionality in `V2SessionCatalog`.
  */
-class ResolveSessionCatalog(val catalogManager: CatalogManager)
+class ResolveSessionCatalog(val analyzer: Analyzer)
   extends Rule[LogicalPlan] with LookupCatalog {
+  val catalogManager = analyzer.catalogManager

Review Comment:
   It is up to you to change if you like, but maybe we could move this line below imports



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala:
##########
@@ -83,16 +83,26 @@ object ResolveDefaultColumns {
    *
    * @param analyzer      used for analyzing the result of parsing the expression stored as text.
    * @param tableSchema   represents the names and types of the columns of the statement to process.
+   * @param tableProvider provider of the target table to store default values for, if any.
    * @param statementType name of the statement being processed, such as INSERT; useful for errors.
    * @return a copy of `tableSchema` with field metadata updated with the constant-folded values.
    */
   def constantFoldCurrentDefaultsToExistDefaults(
       analyzer: Analyzer,
       tableSchema: StructType,
+      tableProvider: Option[String],
       statementType: String): StructType = {
     if (SQLConf.get.enableDefaultColumns) {
       val newFields: Seq[StructField] = tableSchema.fields.map { field =>
         if (field.metadata.contains(CURRENT_DEFAULT_COLUMN_METADATA_KEY)) {
+          // Make sure that the target table has a provider that supports default column values.
+          val allowedProviders: Array[String] =
+            SQLConf.get.getConf(SQLConf.DEFAULT_COLUMN_ALLOWED_PROVIDERS)

Review Comment:
   Maybe we could try to move the code out of the `map` loop? It is not a big deal but it could matter for large schemas. For example, we can prepare allowedProviders list and then call `map`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r890517016


##########
sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala:
##########
@@ -186,7 +186,7 @@ abstract class BaseSessionStateBuilder(
         new ResolveSQLOnFile(session) +:
         new FallBackFileSourceV2(session) +:
         ResolveEncodersInScalaAgg +:
-        new ResolveSessionCatalog(catalogManager) +:
+        new ResolveSessionCatalog(this) +:

Review Comment:
   I have seen this in previous PRs. It is odd that the analyzer relies on ResolveSessionCatalog, while ResolveSessionCatalog relies on the analyzer too..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #36745: [SPARK-39359][SQL]Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

dtenedor commented on PR #36745:
URL: https://github.com/apache/spark/pull/36745#issuecomment-1144288442

   @HyukjinKwon @sadikovi this PR restricts DEFAULT columns to supported data sources, it should be the last one to complete this open-source work, if you could help review (or refer someone) that would be great :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r891067965


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2881,6 +2881,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DEFAULT_COLUMN_ALLOWED_PROVIDERS =
+    buildConf("spark.sql.defaultColumn.allowedProviders")
+      .internal()
+      .doc("List of table providers wherein SQL commands are permitted to assign DEFAULT column " +
+        "values. Comma-separated list, whitespace ignored, case-insensitive.")
+      .version("3.4.0")
+      .stringConf
+      .createWithDefault("csv,json,orc,parquet")

Review Comment:
   If the default column support of one data source has a bug, users can set this conf to ban the data source. Probably we will need this conf and data source API support? 



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2881,6 +2881,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DEFAULT_COLUMN_ALLOWED_PROVIDERS =
+    buildConf("spark.sql.defaultColumn.allowedProviders")
+      .internal()
+      .doc("List of table providers wherein SQL commands are permitted to assign DEFAULT column " +
+        "values. Comma-separated list, whitespace ignored, case-insensitive.")
+      .version("3.4.0")
+      .stringConf
+      .createWithDefault("csv,json,orc,parquet")

Review Comment:
   If the default column support of one data source has a bug, users can set this conf to ban the data source. Probably we will need both this conf and data source API support? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

dtenedor commented on PR #36745:
URL: https://github.com/apache/spark/pull/36745#issuecomment-1147641785

   @gengliangwang @HyukjinKwon @cloud-fan this is ready to merge at your convenience (or leave review comment(s) for further iteration if desired)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

dtenedor commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r890579762


##########
sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala:
##########
@@ -41,21 +41,29 @@ import org.apache.spark.sql.types.{MetadataBuilder, StructField, StructType}
  *
  * We can remove this rule once we implement all the catalog functionality in `V2SessionCatalog`.
  */
-class ResolveSessionCatalog(val catalogManager: CatalogManager)
+class ResolveSessionCatalog(val analyzer: Analyzer)
   extends Rule[LogicalPlan] with LookupCatalog {
   import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
   import org.apache.spark.sql.connector.catalog.CatalogV2Util._
   import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits._
 
+  override val catalogManager = analyzer.catalogManager
+
   override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
-    case AddColumns(ResolvedV1TableIdentifier(ident), cols) =>
+    case AddColumns(
+      ResolvedTable(catalog, ident, v1Table: V1Table, _), cols)
+        if isSessionCatalog(catalog) =>
       cols.foreach { c =>
         assertTopLevelColumn(c.name, "AlterTableAddColumnsCommand")
         if (!c.nullable) {
           throw QueryCompilationErrors.addColumnWithV1TableCannotSpecifyNotNullError
         }
       }
-      AlterTableAddColumnsCommand(ident.asTableIdentifier, cols.map(convertToStructField))
+      val prevSchema = StructType(cols.map(convertToStructField))
+      val newSchema: StructType =
+        DefaultCols.constantFoldCurrentDefaultsToExistDefaults(

Review Comment:
   Thanks for pointing this out, we don't need to duplicate it again. Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sadikovi commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

sadikovi commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r888603145


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala:
##########
@@ -122,7 +122,7 @@ abstract class SessionCatalogSuite extends AnalysisTest with Eventually {
   }
 
   test("create table with default columns") {
-    withBasicCatalog { catalog =>
+    if (!isHiveExternalCatalog) withBasicCatalog { catalog =>

Review Comment:
   What happens if we remove `if (!isHiveExternalCatalog)`? What is the test failure in this case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

dtenedor commented on PR #36745:
URL: https://github.com/apache/spark/pull/36745#issuecomment-1145458556

   @sadikovi thanks for your review, these are helpful ideas! Please look again when you have time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sadikovi commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

sadikovi commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r887433430


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:
##########
@@ -427,6 +428,7 @@ class SessionCatalog(
       tableDefinition.copy(identifier = tableIdentifier)
     }
 
+    ResolveDefaultColumns.checkDataSourceSupportsDefaultColumns(tableDefinition)

Review Comment:
   Do we also need to handle DataFrame API? For example, I create a table using this:
   ```
   val schema = ...
   val df = spark.createDataFrame(schema, ...)
   df.write.saveAsTable("...")
   ```



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala:
##########
@@ -231,4 +232,18 @@ object ResolveDefaultColumns {
       }
     }
   }
+
+  def checkDataSourceSupportsDefaultColumns(table: CatalogTable): Unit = {
+    if (table.schema.fields.map(_.metadata).exists { m =>
+      m.contains(CURRENT_DEFAULT_COLUMN_METADATA_KEY) ||
+        m.contains(EXISTS_DEFAULT_COLUMN_METADATA_KEY)
+    }) {
+      table.provider.getOrElse("").toLowerCase() match {
+        case "csv" | "json" | "parquet" | "orc" =>

Review Comment:
   Would you like to make a configuration option? Or does it depend on the data source implementation. For example, can we have DEFAULT in JDBC?
   
   Do we also need to consider DataSource V2? Will this feature work there?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

dtenedor commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r888505739


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:
##########
@@ -427,6 +428,7 @@ class SessionCatalog(
       tableDefinition.copy(identifier = tableIdentifier)
     }
 
+    ResolveDefaultColumns.checkDataSourceSupportsDefaultColumns(tableDefinition)

Review Comment:
   This is not supported currently, the feature is SQL-only as of now. References to columns named "default" will simply return "column not found" errors until then. We can certainly consider adding this feature to the DataFrame API and PySpark later, I would be interested in helping with this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on PR #36745:
URL: https://github.com/apache/spark/pull/36745#issuecomment-1148625244

   I am merging this one to master now. We can have a new DS API for this later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r890519223


##########
sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala:
##########
@@ -41,21 +41,29 @@ import org.apache.spark.sql.types.{MetadataBuilder, StructField, StructType}
  *
  * We can remove this rule once we implement all the catalog functionality in `V2SessionCatalog`.
  */
-class ResolveSessionCatalog(val catalogManager: CatalogManager)
+class ResolveSessionCatalog(val analyzer: Analyzer)
   extends Rule[LogicalPlan] with LookupCatalog {
   import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
   import org.apache.spark.sql.connector.catalog.CatalogV2Util._
   import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Implicits._
 
+  override val catalogManager = analyzer.catalogManager
+
   override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
-    case AddColumns(ResolvedV1TableIdentifier(ident), cols) =>
+    case AddColumns(
+      ResolvedTable(catalog, ident, v1Table: V1Table, _), cols)
+        if isSessionCatalog(catalog) =>
       cols.foreach { c =>
         assertTopLevelColumn(c.name, "AlterTableAddColumnsCommand")
         if (!c.nullable) {
           throw QueryCompilationErrors.addColumnWithV1TableCannotSpecifyNotNullError
         }
       }
-      AlterTableAddColumnsCommand(ident.asTableIdentifier, cols.map(convertToStructField))
+      val prevSchema = StructType(cols.map(convertToStructField))
+      val newSchema: StructType =
+        DefaultCols.constantFoldCurrentDefaultsToExistDefaults(

Review Comment:
   `constantFoldCurrentDefaultsToExistDefaults` is called in `AlterTableAddColumnsCommand` too. Seems duplicated here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

dtenedor commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r890579364


##########
sql/core/src/main/scala/org/apache/spark/sql/internal/BaseSessionStateBuilder.scala:
##########
@@ -186,7 +186,7 @@ abstract class BaseSessionStateBuilder(
         new ResolveSQLOnFile(session) +:
         new FallBackFileSourceV2(session) +:
         ResolveEncodersInScalaAgg +:
-        new ResolveSessionCatalog(catalogManager) +:
+        new ResolveSessionCatalog(this) +:

Review Comment:
   Good point, this is not needed, we can revert this part.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

dtenedor commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r889272134


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalogSuite.scala:
##########
@@ -122,7 +122,7 @@ abstract class SessionCatalogSuite extends AnalysisTest with Eventually {
   }
 
   test("create table with default columns") {
-    withBasicCatalog { catalog =>
+    if (!isHiveExternalCatalog) withBasicCatalog { catalog =>

Review Comment:
   Good question. This test is part of the base class `SessionCatalogSuite`. It fails when running with the `HiveExternalSessionCatalogSuite` subclass, since Hive tables are not included in the allowlist of providers that allow DEFAULT columns. To simplify this, I edited the config to add Hive tables to the list of allowed providers and this test passes now in all cases.



##########
sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala:
##########
@@ -41,21 +41,28 @@ import org.apache.spark.sql.types.{MetadataBuilder, StructField, StructType}
  *
  * We can remove this rule once we implement all the catalog functionality in `V2SessionCatalog`.
  */
-class ResolveSessionCatalog(val catalogManager: CatalogManager)
+class ResolveSessionCatalog(val analyzer: Analyzer)
   extends Rule[LogicalPlan] with LookupCatalog {
+  val catalogManager = analyzer.catalogManager

Review Comment:
   SG, done.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala:
##########
@@ -83,16 +83,26 @@ object ResolveDefaultColumns {
    *
    * @param analyzer      used for analyzing the result of parsing the expression stored as text.
    * @param tableSchema   represents the names and types of the columns of the statement to process.
+   * @param tableProvider provider of the target table to store default values for, if any.
    * @param statementType name of the statement being processed, such as INSERT; useful for errors.
    * @return a copy of `tableSchema` with field metadata updated with the constant-folded values.
    */
   def constantFoldCurrentDefaultsToExistDefaults(
       analyzer: Analyzer,
       tableSchema: StructType,
+      tableProvider: Option[String],
       statementType: String): StructType = {
     if (SQLConf.get.enableDefaultColumns) {
       val newFields: Seq[StructField] = tableSchema.fields.map { field =>
         if (field.metadata.contains(CURRENT_DEFAULT_COLUMN_METADATA_KEY)) {
+          // Make sure that the target table has a provider that supports default column values.
+          val allowedProviders: Array[String] =
+            SQLConf.get.getConf(SQLConf.DEFAULT_COLUMN_ALLOWED_PROVIDERS)

Review Comment:
   Good idea, done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on PR #36745:
URL: https://github.com/apache/spark/pull/36745#issuecomment-1145590247

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r890902366


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2881,6 +2881,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DEFAULT_COLUMN_ALLOWED_PROVIDERS =
+    buildConf("spark.sql.defaultColumn.allowedProviders")
+      .internal()
+      .doc("List of table providers wherein SQL commands are permitted to assign DEFAULT column " +
+        "values. Comma-separated list, whitespace ignored, case-insensitive.")
+      .version("3.4.0")
+      .stringConf
+      .createWithDefault("csv,json,orc,parquet")

Review Comment:
   This is fine for now. But eventually, I think we need a data source API for a source to report if it supports default value or not, instead of asking end-users to set this conf.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang closed pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

gengliangwang closed pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types
URL: https://github.com/apache/spark/pull/36745


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #36745: [SPARK-39359][SQL] Restrict DEFAULT columns to allowlist of supported data source types

Posted by GitBox <gi...@apache.org>.

dtenedor commented on code in PR #36745:
URL: https://github.com/apache/spark/pull/36745#discussion_r891452022


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -2881,6 +2881,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DEFAULT_COLUMN_ALLOWED_PROVIDERS =
+    buildConf("spark.sql.defaultColumn.allowedProviders")
+      .internal()
+      .doc("List of table providers wherein SQL commands are permitted to assign DEFAULT column " +
+        "values. Comma-separated list, whitespace ignored, case-insensitive.")
+      .version("3.4.0")
+      .stringConf
+      .createWithDefault("csv,json,orc,parquet")

Review Comment:
   @cloud-fan @gengliangwang this is a good point. I was thinking the default value of this conf contains the four data sources that we actually have support for scanning the default values. The primary purpose is to serve as a mechanism for banning default values with the other data sources, users are not expected to have to change this.
   
   As Gengliang mentions, it could be a possible escape hatch if we discover a bug in one data source later, and it's also useful for testing. Maybe we can extend the description of this conf to mention that in the normal/expected case, users don't have to change anything.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org