You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "cloud-fan (via GitHub)" <gi...@apache.org> on 2023/07/17 01:59:58 UTC

[GitHub] [spark] cloud-fan commented on a diff in pull request #37011: [SPARK-39625][SPARK-38904][SQL] Add Dataset.as(StructType)

cloud-fan commented on code in PR #37011:
URL: https://github.com/apache/spark/pull/37011#discussion_r1264788328


##########
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##########
@@ -464,6 +464,29 @@ class Dataset[T] private[sql](
    */
   def as[U : Encoder]: Dataset[U] = Dataset[U](sparkSession, logicalPlan)
 
+  /**
+   * Returns a new DataFrame where each row is reconciled to match the specified schema. Spark will:
+   * <ul>
+   *   <li>Reorder columns and/or inner fields by name to match the specified schema.</li>
+   *   <li>Project away columns and/or inner fields that are not needed by the specified schema.
+   *   Missing columns and/or inner fields (present in the specified schema but not input DataFrame)
+   *   lead to failures.</li>
+   *   <li>Cast the columns and/or inner fields to match the data types in the specified schema, if
+   *   the types are compatible, e.g., numeric to numeric (error if overflows), but not string to
+   *   int.</li>
+   *   <li>Carry over the metadata from the specified schema, while the columns and/or inner fields
+   *   still keep their own metadata if not overwritten by the specified schema.</li>
+   *   <li>Fail if the nullability are not compatible. For example, the column and/or inner field is
+   *   nullable but the specified schema requires them to be not nullable.</li>
+   * </ul>
+   *
+   * @group basic
+   * @since 3.4.0
+   */
+  def as(schema: StructType): DataFrame = withPlan {
+    Project.matchSchema(logicalPlan, schema, sparkSession.sessionState.conf)

Review Comment:
   Good question. char/varchar is not officially supported in the type system so I think it's OK to ignore char/varchar here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org