You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by sr...@apache.org on 2017/02/25 02:18:27 UTC

spark git commit: [MINOR][DOCS] Fix few typos in structured streaming doc

Repository: spark
Updated Branches:
  refs/heads/master fa7c582e9 -> 1b9ba258e


[MINOR][DOCS] Fix few typos in structured streaming doc

## What changes were proposed in this pull request?

Minor typo in `even-time`, which is changed to `event-time` and a couple of grammatical errors fix.

## How was this patch tested?

N/A - since this is a doc fix. I did a jekyll build locally though.

Author: Ramkumar Venkataraman <rv...@paypal.com>

Closes #17037 from ramkumarvenkat/doc-fix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1b9ba258
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1b9ba258
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1b9ba258

Branch: refs/heads/master
Commit: 1b9ba258e086e2ba89a4f35a54106e2f8a38b525
Parents: fa7c582
Author: Ramkumar Venkataraman <rv...@paypal.com>
Authored: Sat Feb 25 02:18:22 2017 +0000
Committer: Sean Owen <sr...@percale.home>
Committed: Sat Feb 25 02:18:22 2017 +0000

----------------------------------------------------------------------
 docs/structured-streaming-programming-guide.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/1b9ba258/docs/structured-streaming-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md
index ad3b2fb..6af47b6 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -392,7 +392,7 @@ data, thus relieving the users from reasoning about it. As an example, let\u2019s
 see how this model handles event-time based processing and late arriving data.
 
 ## Handling Event-time and Late Data
-Event-time is the time embedded in the data itself. For many applications, you may want to operate on this event-time. For example, if you want to get the number of events generated by IoT devices every minute, then you probably want to use the time when the data was generated (that is, event-time in the data), rather than the time Spark receives them. This event-time is very naturally expressed in this model -- each event from the devices is a row in the table, and event-time is a column value in the row. This allows window-based aggregations (e.g. number of events every minute) to be just a special type of grouping and aggregation on the even-time column -- each time window is a group and each row can belong to multiple windows/groups. Therefore, such event-time-window-based aggregation queries can be defined consistently on both a static dataset (e.g. from collected device events logs) as well as on a data stream, making the life of the user much easier.
+Event-time is the time embedded in the data itself. For many applications, you may want to operate on this event-time. For example, if you want to get the number of events generated by IoT devices every minute, then you probably want to use the time when the data was generated (that is, event-time in the data), rather than the time Spark receives them. This event-time is very naturally expressed in this model -- each event from the devices is a row in the table, and event-time is a column value in the row. This allows window-based aggregations (e.g. number of events every minute) to be just a special type of grouping and aggregation on the event-time column -- each time window is a group and each row can belong to multiple windows/groups. Therefore, such event-time-window-based aggregation queries can be defined consistently on both a static dataset (e.g. from collected device events logs) as well as on a data stream, making the life of the user much easier.
 
 Furthermore, this model naturally handles data that has arrived later than 
 expected based on its event-time. Since Spark is updating the Result Table, 
@@ -401,7 +401,7 @@ as well as cleaning up old aggregates to limit the size of intermediate
 state data. Since Spark 2.1, we have support for watermarking which 
 allows the user to specify the threshold of late data, and allows the engine
 to accordingly clean up old state. These are explained later in more 
-details in the [Window Operations](#window-operations-on-event-time) section.
+detail in the [Window Operations](#window-operations-on-event-time) section.
 
 ## Fault Tolerance Semantics
 Delivering end-to-end exactly-once semantics was one of key goals behind the design of Structured Streaming. To achieve that, we have designed the Structured Streaming sources, the sinks and the execution engine to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Every streaming source is assumed to have offsets (similar to Kafka offsets, or Kinesis sequence numbers)
@@ -647,7 +647,7 @@ df.groupBy("deviceType").count()
 </div>
 
 ### Window Operations on Event Time
-Aggregations over a sliding event-time window are straightforward with Structured Streaming. The key idea to understand about window-based aggregations are very similar to grouped aggregations. In a grouped aggregation, aggregate values (e.g. counts) are maintained for each unique value in the user-specified grouping column. In case of window-based aggregations, aggregate values are maintained for each window the event-time of a row falls into. Let's understand this with an illustration. 
+Aggregations over a sliding event-time window are straightforward with Structured Streaming and are very similar to grouped aggregations. In a grouped aggregation, aggregate values (e.g. counts) are maintained for each unique value in the user-specified grouping column. In case of window-based aggregations, aggregate values are maintained for each window the event-time of a row falls into. Let's understand this with an illustration. 
 
 Imagine our [quick example](#quick-example) is modified and the stream now contains lines along with the time when the line was generated. Instead of running word counts, we want to count words within 10 minute windows, updating every 5 minutes. That is, word counts in words received between 10 minute windows 12:00 - 12:10, 12:05 - 12:15, 12:10 - 12:20, etc. Note that 12:00 - 12:10 means data that arrived after 12:00 but before 12:10. Now, consider a word that was received at 12:07. This word should increment the counts corresponding to two windows 12:00 - 12:10 and 12:05 - 12:15. So the counts will be indexed by both, the grouping key (i.e. the word) and the window (can be calculated from the event-time).
 
@@ -713,7 +713,7 @@ old windows correctly, as illustrated below.
 
 ![Handling Late Data](img/structured-streaming-late-data.png)
 
-However, to run this query for days, its necessary for the system to bound the amount of 
+However, to run this query for days, it's necessary for the system to bound the amount of 
 intermediate in-memory state it accumulates. This means the system needs to know when an old 
 aggregate can be dropped from the in-memory state because the application is not going to receive 
 late data for that aggregate any more. To enable this, in Spark 2.1, we have introduced 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org