You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/05/10 18:09:46 UTC

[GitHub] [spark] dtenedor opened a new pull request, #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

dtenedor opened a new pull request, #36501:
URL: https://github.com/apache/spark/pull/36501

   ### What changes were proposed in this pull request?
   
   Support CSV scans when the table schema has associated DEFAULT column values.
   
   Example:
   
   ```
   create table t(i int) using csv;
   insert into t values(42);
   alter table t add column s string default concat('abc', def');
   select * from t;
   > 42, 'abcdef'
   ```
   
   ### Why are the changes needed?
   
   This change makes it easier to build, query, and maintain tables backed by CSV data.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes.
   
   ### How was this patch tested?
   
   This PR includes new test coverage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on code in PR #36501:
URL: https://github.com/apache/spark/pull/36501#discussion_r876464600


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala:
##########
@@ -511,6 +511,30 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru
   @transient
   private[sql] lazy val interpretedOrdering =
     InterpretedOrdering.forSchema(this.fields.map(_.dataType))
+
+  /**
+   * Parses the text representing constant-folded default column literal values.
+   * @return a sequence of either (1) NULL, if the column had no default value, or (2) an object of
+   *         Any type suitable for assigning into a row using the InternalRow.update method.
+   */
+  private [sql] lazy val defaultValues: Array[Any] =
+    fields.map { field: StructField =>
+      val defaultValue: Option[String] = field.getExistenceDefaultValue()
+      defaultValue.map { text: String =>
+        val expr = try {
+          val expr = CatalystSqlParser.parseExpression(text)
+          expr match {
+            case _: ExprLiteral | _: AnsiCast | _: Cast => expr
+          }
+        } catch {
+          case _: ParseException | _: MatchError =>
+            throw QueryCompilationErrors.failedToParseExistenceDefaultAsLiteral(field.name, text)
+        }
+        // The expression should be a literal value by this point, possibly wrapped in a cast
+        // function. This is enforced by the execution of commands that assign default values.
+        expr.eval()
+      }.getOrElse(null)

Review Comment:
   NP, I can make these updates in the PR that adds JSON support!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36501:
URL: https://github.com/apache/spark/pull/36501#discussion_r871905486


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##########
@@ -118,6 +118,17 @@ case class StructField(
     }
   }
 
+  /**
+   * Return the existence default value of this StructField.
+   */
+  def getExistenceDefaultValue(): Option[String] = {

Review Comment:
   Also should probably protect this via `private[sql]` because `StructField` is an API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on code in PR #36501:
URL: https://github.com/apache/spark/pull/36501#discussion_r872912517


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -318,7 +318,8 @@ class UnivocityParser(
         case e: SparkUpgradeException => throw e
         case NonFatal(e) =>
           badRecordException = badRecordException.orElse(Some(e))
-          row.setNullAt(i)
+          // Use the corresponding DEFAULT value associated with the column, if any.
+          row.update(i, parsedSchema.defaultValues(i))

Review Comment:
   SG, done.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -318,7 +318,8 @@ class UnivocityParser(
         case e: SparkUpgradeException => throw e
         case NonFatal(e) =>
           badRecordException = badRecordException.orElse(Some(e))
-          row.setNullAt(i)
+          // Use the corresponding DEFAULT value associated with the column, if any.
+          row.update(i, parsedSchema.defaultValues(i))

Review Comment:
   no worries :)



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructField.scala:
##########
@@ -118,6 +118,17 @@ case class StructField(
     }
   }
 
+  /**
+   * Return the existence default value of this StructField.
+   */
+  def getExistenceDefaultValue(): Option[String] = {

Review Comment:
   SG, done.
   
   We can keep the previous methods `withCurrentDefaultValue`, `clearCurrentDefaultValue`, and `getCurrentDefaultValue` public, for consistency with `withComment` and `getComment` which are also public. This seems reasonable since the "current default" value is the result of the most recent CREATE TABLE or ALTER TABLE statement, whereas the "existence default" is a hidden value equivalent to the constant-folded result of the *initial* default value when the column was created. (Let me know if you want me to make those other methods `private[sql]` as well and I can do that.)



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala:
##########
@@ -511,6 +511,30 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru
   @transient
   private[sql] lazy val interpretedOrdering =
     InterpretedOrdering.forSchema(this.fields.map(_.dataType))
+
+  /**
+   * Parses the text representing constant-folded default column literal values.
+   * @return a sequence of either (1) NULL, if the column had no default value, or (2) an object of
+   *         Any type suitable for assigning into a row using the InternalRow.update method.
+   */
+  lazy val defaultValues: Array[Any] =

Review Comment:
   SG, done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36501:
URL: https://github.com/apache/spark/pull/36501#discussion_r872943310


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala:
##########
@@ -511,6 +511,30 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru
   @transient
   private[sql] lazy val interpretedOrdering =
     InterpretedOrdering.forSchema(this.fields.map(_.dataType))
+
+  /**
+   * Parses the text representing constant-folded default column literal values.
+   * @return a sequence of either (1) NULL, if the column had no default value, or (2) an object of
+   *         Any type suitable for assigning into a row using the InternalRow.update method.
+   */
+  private [sql] lazy val defaultValues: Array[Any] =

Review Comment:
   ```suggestion
     private[sql] lazy val defaultValues: Array[Any] =
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on code in PR #36501:
URL: https://github.com/apache/spark/pull/36501#discussion_r871901106


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala:
##########
@@ -511,6 +511,30 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru
   @transient
   private[sql] lazy val interpretedOrdering =
     InterpretedOrdering.forSchema(this.fields.map(_.dataType))
+
+  /**
+   * Parses the text representing constant-folded default column literal values.
+   * @return a sequence of either (1) NULL, if the column had no default value, or (2) an object of
+   *         Any type suitable for assigning into a row using the InternalRow.update method.
+   */
+  lazy val defaultValues: Seq[Any] =

Review Comment:
   Yeah, that's not a bad idea since we will be accessing this for every row of data source scans. Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values
URL: https://github.com/apache/spark/pull/36501


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] gengliangwang commented on a diff in pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

gengliangwang commented on code in PR #36501:
URL: https://github.com/apache/spark/pull/36501#discussion_r876028176


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala:
##########
@@ -511,6 +511,30 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru
   @transient
   private[sql] lazy val interpretedOrdering =
     InterpretedOrdering.forSchema(this.fields.map(_.dataType))
+
+  /**
+   * Parses the text representing constant-folded default column literal values.
+   * @return a sequence of either (1) NULL, if the column had no default value, or (2) an object of
+   *         Any type suitable for assigning into a row using the InternalRow.update method.
+   */
+  private [sql] lazy val defaultValues: Array[Any] =

Review Comment:
   Shall we rename this as existenceDefaultValues instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36501:
URL: https://github.com/apache/spark/pull/36501#discussion_r876464220


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala:
##########
@@ -511,6 +511,30 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru
   @transient
   private[sql] lazy val interpretedOrdering =
     InterpretedOrdering.forSchema(this.fields.map(_.dataType))
+
+  /**
+   * Parses the text representing constant-folded default column literal values.
+   * @return a sequence of either (1) NULL, if the column had no default value, or (2) an object of
+   *         Any type suitable for assigning into a row using the InternalRow.update method.
+   */
+  private [sql] lazy val defaultValues: Array[Any] =
+    fields.map { field: StructField =>
+      val defaultValue: Option[String] = field.getExistenceDefaultValue()
+      defaultValue.map { text: String =>
+        val expr = try {
+          val expr = CatalystSqlParser.parseExpression(text)
+          expr match {
+            case _: ExprLiteral | _: AnsiCast | _: Cast => expr
+          }
+        } catch {
+          case _: ParseException | _: MatchError =>
+            throw QueryCompilationErrors.failedToParseExistenceDefaultAsLiteral(field.name, text)
+        }
+        // The expression should be a literal value by this point, possibly wrapped in a cast
+        // function. This is enforced by the execution of commands that assign default values.
+        expr.eval()
+      }.getOrElse(null)

Review Comment:
   But it's already merged. so we could just leave it as is since it's just tiny nit :-).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36501:
URL: https://github.com/apache/spark/pull/36501#discussion_r876463352


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala:
##########
@@ -511,6 +511,30 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru
   @transient
   private[sql] lazy val interpretedOrdering =
     InterpretedOrdering.forSchema(this.fields.map(_.dataType))
+
+  /**
+   * Parses the text representing constant-folded default column literal values.
+   * @return a sequence of either (1) NULL, if the column had no default value, or (2) an object of
+   *         Any type suitable for assigning into a row using the InternalRow.update method.
+   */
+  private [sql] lazy val defaultValues: Array[Any] =

Review Comment:
   Is this because default values do not exist sometimes, or is it because `default` value can mean something like [`DecimalType.SYSTEM_DEFAULT`](https://github.com/apache/spark/blob/58d3f1516ed812b692709991e551829aa0090578/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L129)/[`DecimalType.USER_DEFAULT`](https://github.com/apache/spark/blob/58d3f1516ed812b692709991e551829aa0090578/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L130) or `0` for non-nullable numeric types? 
   
   `definedDefaultValues` as is sounds sane to me. Maybe we can rename something like `userDefinedDefaultValues` or `userSpecifiedDefaultValues` if there's the concern.
   
   BTW, `existenceDefaultValues` would have to be `existentDefaultValues` or `existingDefaultValues`? I am okay with this. Don't feel strongly. Just thinking load :-).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36501:
URL: https://github.com/apache/spark/pull/36501#discussion_r876463352


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala:
##########
@@ -511,6 +511,30 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru
   @transient
   private[sql] lazy val interpretedOrdering =
     InterpretedOrdering.forSchema(this.fields.map(_.dataType))
+
+  /**
+   * Parses the text representing constant-folded default column literal values.
+   * @return a sequence of either (1) NULL, if the column had no default value, or (2) an object of
+   *         Any type suitable for assigning into a row using the InternalRow.update method.
+   */
+  private [sql] lazy val defaultValues: Array[Any] =

Review Comment:
   Is this because default values do not exist sometimes, or is it because `default` value can mean something like [`DecimalType.SYSTEM_DEFAULT`](https://github.com/apache/spark/blob/58d3f1516ed812b692709991e551829aa0090578/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L129) or `0` for non-nullable numeric types? 
   
   `definedDefaultValues` as is sounds sane to me. Maybe we can rename something like `userDefinedDefaultValues` or `userSpecifiedDefaultValues` if there's the concern.
   
   BTW, `existenceDefaultValues` would have to be `existentDefaultValues` or `existingDefaultValues`? I am okay with this. Don't feel strongly. Just thinking load :-).
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36501:
URL: https://github.com/apache/spark/pull/36501#discussion_r871903016


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -318,7 +318,8 @@ class UnivocityParser(
         case e: SparkUpgradeException => throw e
         case NonFatal(e) =>
           badRecordException = badRecordException.orElse(Some(e))
-          row.setNullAt(i)
+          // Use the corresponding DEFAULT value associated with the column, if any.
+          row.update(i, parsedSchema.defaultValues(i))

Review Comment:
   Maybe we should do similar things at L286 too. The given `tokens` can be `null`, for example, when the input line is empty (this CSV code path is a bit convoluted)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #36501:
URL: https://github.com/apache/spark/pull/36501#issuecomment-1129420807

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36501:
URL: https://github.com/apache/spark/pull/36501#discussion_r871900359


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala:
##########
@@ -511,6 +511,30 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru
   @transient
   private[sql] lazy val interpretedOrdering =
     InterpretedOrdering.forSchema(this.fields.map(_.dataType))
+
+  /**
+   * Parses the text representing constant-folded default column literal values.
+   * @return a sequence of either (1) NULL, if the column had no default value, or (2) an object of
+   *         Any type suitable for assigning into a row using the InternalRow.update method.
+   */
+  lazy val defaultValues: Seq[Any] =

Review Comment:
   Should probably use `Array[Any]` or `IndexedSeq[Any]` because it can be a view instance that causes loop-up cost via iteration.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on PR #36501:
URL: https://github.com/apache/spark/pull/36501#issuecomment-1129093762

   @HyukjinKwon I fixed the bad sync, this is ready to merge now at your convenience.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on code in PR #36501:
URL: https://github.com/apache/spark/pull/36501#discussion_r876485553


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala:
##########
@@ -511,6 +511,30 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru
   @transient
   private[sql] lazy val interpretedOrdering =
     InterpretedOrdering.forSchema(this.fields.map(_.dataType))
+
+  /**
+   * Parses the text representing constant-folded default column literal values.
+   * @return a sequence of either (1) NULL, if the column had no default value, or (2) an object of
+   *         Any type suitable for assigning into a row using the InternalRow.update method.
+   */
+  private [sql] lazy val defaultValues: Array[Any] =

Review Comment:
   I added a comment about this to the class field in the next PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on a diff in pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on code in PR #36501:
URL: https://github.com/apache/spark/pull/36501#discussion_r876086313


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala:
##########
@@ -511,6 +511,30 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru
   @transient
   private[sql] lazy val interpretedOrdering =
     InterpretedOrdering.forSchema(this.fields.map(_.dataType))
+
+  /**
+   * Parses the text representing constant-folded default column literal values.
+   * @return a sequence of either (1) NULL, if the column had no default value, or (2) an object of
+   *         Any type suitable for assigning into a row using the InternalRow.update method.
+   */
+  private [sql] lazy val defaultValues: Array[Any] =

Review Comment:
   Thanks Gengliang, that's a good idea, I can rename that in the next PR :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36501:
URL: https://github.com/apache/spark/pull/36501#discussion_r871905184


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala:
##########
@@ -511,6 +511,30 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru
   @transient
   private[sql] lazy val interpretedOrdering =
     InterpretedOrdering.forSchema(this.fields.map(_.dataType))
+
+  /**
+   * Parses the text representing constant-folded default column literal values.
+   * @return a sequence of either (1) NULL, if the column had no default value, or (2) an object of
+   *         Any type suitable for assigning into a row using the InternalRow.update method.
+   */
+  lazy val defaultValues: Array[Any] =

Review Comment:
   Should probably protect this via `private[sql]` because `StructType` is an API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #36501:
URL: https://github.com/apache/spark/pull/36501#issuecomment-1125535738

   cc @MaxGekk FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36501:
URL: https://github.com/apache/spark/pull/36501#discussion_r871904967


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -318,7 +318,8 @@ class UnivocityParser(
         case e: SparkUpgradeException => throw e
         case NonFatal(e) =>
           badRecordException = badRecordException.orElse(Some(e))
-          row.setNullAt(i)
+          // Use the corresponding DEFAULT value associated with the column, if any.
+          row.update(i, parsedSchema.defaultValues(i))

Review Comment:
   BTW, would have to use `requiredSchema` instead of `parsedSchema`. When `CSVOptions.columnPruning` is disabled, `parsedSchema` becomes `dataSchema`. However, `getToken` reorders tokens to parse according to `requiredSchema` if I didn't misread the code path.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36501:
URL: https://github.com/apache/spark/pull/36501#discussion_r872071300


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala:
##########
@@ -318,7 +318,8 @@ class UnivocityParser(
         case e: SparkUpgradeException => throw e
         case NonFatal(e) =>
           badRecordException = badRecordException.orElse(Some(e))
-          row.setNullAt(i)
+          // Use the corresponding DEFAULT value associated with the column, if any.
+          row.update(i, parsedSchema.defaultValues(i))

Review Comment:
   Oh wait, sorry I misread. I think we shouldn't produce a partial record for empty lines. The change as is seems fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on PR #36501:
URL: https://github.com/apache/spark/pull/36501#issuecomment-1122856014

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #36501:
URL: https://github.com/apache/spark/pull/36501#issuecomment-1126833206

   Test results are in https://github.com/dtenedor/spark/runs/6433699950. Seems like the sync went failed for some reasons.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dtenedor commented on pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

dtenedor commented on PR #36501:
URL: https://github.com/apache/spark/pull/36501#issuecomment-1123120067

   @gengliangwang here is the DEFAULT column support for CSV files when ready


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #36501: [SPARK-39143][SQL] Support CSV scans with DEFAULT values

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on code in PR #36501:
URL: https://github.com/apache/spark/pull/36501#discussion_r876461566


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala:
##########
@@ -511,6 +511,30 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru
   @transient
   private[sql] lazy val interpretedOrdering =
     InterpretedOrdering.forSchema(this.fields.map(_.dataType))
+
+  /**
+   * Parses the text representing constant-folded default column literal values.
+   * @return a sequence of either (1) NULL, if the column had no default value, or (2) an object of
+   *         Any type suitable for assigning into a row using the InternalRow.update method.
+   */
+  private [sql] lazy val defaultValues: Array[Any] =
+    fields.map { field: StructField =>
+      val defaultValue: Option[String] = field.getExistenceDefaultValue()
+      defaultValue.map { text: String =>
+        val expr = try {
+          val expr = CatalystSqlParser.parseExpression(text)
+          expr match {
+            case _: ExprLiteral | _: AnsiCast | _: Cast => expr
+          }
+        } catch {
+          case _: ParseException | _: MatchError =>
+            throw QueryCompilationErrors.failedToParseExistenceDefaultAsLiteral(field.name, text)
+        }
+        // The expression should be a literal value by this point, possibly wrapped in a cast
+        // function. This is enforced by the execution of commands that assign default values.
+        expr.eval()
+      }.getOrElse(null)

Review Comment:
   Maybe `orNull`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org