You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/09/03 11:53:30 UTC

[GitHub] [spark] gaborgsomogyi commented on a change in pull request #29461: [SPARK-32456][SS][FOLLOWUP] Update doc to note about using SQL statement with streaming Dataset

gaborgsomogyi commented on a change in pull request #29461:
URL: https://github.com/apache/spark/pull/29461#discussion_r482914280



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
##########
@@ -2525,14 +2525,19 @@ class Dataset[T] private[sql](
 
   /**
    * Returns a new Dataset that contains only the unique rows from this Dataset.
-   * This is an alias for `distinct`.
+   * This is an alias for `distinct` on batch [[Dataset]]. For streaming [[Dataset]], it would show
+   * slightly different behavior. (see below)
    *
    * For a static batch [[Dataset]], it just drops duplicate rows. For a streaming [[Dataset]], it
    * will keep all data across triggers as intermediate state to drop duplicates rows. You can use
    * [[withWatermark]] to limit how late the duplicate data can be and system will accordingly limit
    * the state. In addition, too late data older than watermark will be dropped to avoid any
    * possibility of duplicates.
    *
+   * Note that for a streaming [[Dataset]], this method only returns distinct rows only once,
+   * regardless of the output mode. Spark may convert the `distinct` operation to aggregation`,

Review comment:
       > Spark may convert the `distinct` operation to aggregation`
   
   Can we add an example to the user? This question applies to other places.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org