You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by ja...@apache.org on 2018/10/24 06:43:30 UTC

samza git commit: Clean-up the Quick-Start and Code-Examples pages; Re-organize content

Repository: samza
Updated Branches:
  refs/heads/master 282f83494 -> 59bc23cf9


Clean-up the Quick-Start and Code-Examples pages; Re-organize content

Author: Jagadish <jv...@linkedin.com>

Reviewers: Jagadish<ja...@apache.org>

Closes #759 from vjagadish1989/website-reorg23


Project: http://git-wip-us.apache.org/repos/asf/samza/repo
Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/59bc23cf
Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/59bc23cf
Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/59bc23cf

Branch: refs/heads/master
Commit: 59bc23cf954bf86683c499729930f126af5ebf2b
Parents: 282f834
Author: Jagadish <jv...@linkedin.com>
Authored: Tue Oct 23 23:39:54 2018 -0700
Committer: Jagadish <jv...@linkedin.com>
Committed: Tue Oct 23 23:39:54 2018 -0700

----------------------------------------------------------------------
 docs/_docs/replace-versioned.sh               |  5 +-
 docs/_menu/index.html                         |  2 +-
 docs/startup/code-examples/versioned/index.md | 49 +++++++++++++
 docs/startup/quick-start/versioned/index.md   | 83 ++++++++++------------
 4 files changed, 91 insertions(+), 48 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/samza/blob/59bc23cf/docs/_docs/replace-versioned.sh
----------------------------------------------------------------------
diff --git a/docs/_docs/replace-versioned.sh b/docs/_docs/replace-versioned.sh
index 24bf7ae..c454cac 100755
--- a/docs/_docs/replace-versioned.sh
+++ b/docs/_docs/replace-versioned.sh
@@ -44,4 +44,7 @@ echo "replaced startup/hello-samza/versioned to startup/hello-samza/"$version
 mv -f $DIR/_site/startup/hello-samza/versioned $DIR/_site/startup/hello-samza/$version
 
 echo "replaced startup/quick-start/versioned to startup/quick-start/"$version
-mv -f $DIR/_site/startup/quick-start/versioned $DIR/_site/startup/quick-start/$version
\ No newline at end of file
+mv -f $DIR/_site/startup/quick-start/versioned $DIR/_site/startup/quick-start/$version
+
+echo "replaced startup/code-examples/versioned to startup/code-examples/"$version
+mv -f $DIR/_site/startup/code-examples/versioned $DIR/_site/startup/code-examples/$version

http://git-wip-us.apache.org/repos/asf/samza/blob/59bc23cf/docs/_menu/index.html
----------------------------------------------------------------------
diff --git a/docs/_menu/index.html b/docs/_menu/index.html
index 0d1750f..a363bae 100644
--- a/docs/_menu/index.html
+++ b/docs/_menu/index.html
@@ -5,7 +5,7 @@ items:
       - menu_title: QuickStart
         url: /startup/quick-start/version/
       - menu_title: Code Examples
-        url: /learn/tutorials/version/
+        url: /startup/code-examples/version/
   - menu_title: Documentation
     has_sub: true
     has_sub_subs: true

http://git-wip-us.apache.org/repos/asf/samza/blob/59bc23cf/docs/startup/code-examples/versioned/index.md
----------------------------------------------------------------------
diff --git a/docs/startup/code-examples/versioned/index.md b/docs/startup/code-examples/versioned/index.md
new file mode 100644
index 0000000..ba1cc3e
--- /dev/null
+++ b/docs/startup/code-examples/versioned/index.md
@@ -0,0 +1,49 @@
+---
+layout: page
+title:
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+
+### Checking out our examples
+
+The [hello-samza](https://github.com/apache/samza-hello-samza) project contains several examples to help you create your Samza applications. To checkout the hello-samza project:
+
+{% highlight bash %}
+> git clone https://git.apache.org/samza-hello-samza.git hello-samza
+{% endhighlight %}
+
+#### High-level API examples
+[The Samza Cookbook](https://github.com/apache/samza-hello-samza/tree/master/src/main/java/samza/examples/cookbook) contains various recipes using the Samza high-level API.
+These include:
+
+- The [Filter example](https://github.com/apache/samza-hello-samza/blob/latest/src/main/java/samza/examples/cookbook/FilterExample.java) demonstrates how to perform stateless operations on a stream. 
+
+- The [Join example](https://github.com/apache/samza-hello-samza/blob/latest/src/main/java/samza/examples/cookbook/JoinExample.java]) demonstrates how you can join a Kafka stream of page-views with a stream of ad-clicks
+
+- The [Stream-Table Join example](https://github.com/apache/samza-hello-samza/blob/latest/src/main/java/samza/examples/cookbook/RemoteTableJoinExample.java) demonstrates how the Samza Table API. It joins a Kafka stream with a remote dataset accessed through a REST service.
+
+- The [SessionWindow](https://github.com/apache/samza-hello-samza/blob/latest/src/main/java/samza/examples/cookbook/SessionWindowExample.java) and [TumblingWindow](https://github.com/apache/samza-hello-samza/blob/latest/src/main/java/samza/examples/cookbook/TumblingWindowExample.java) examples illustrate Samza's rich windowing and triggering capabilities.
+
+
+In addition to the cookbook, you can also consult these:
+
+- [Wikipedia Parser](https://github.com/apache/samza-hello-samza/tree/master/src/main/java/samza/examples/wikipedia): An advanced example that builds a streaming pipeline consuming a live-feed of wikipedia edits, parsing each message and generating statistics from them.
+
+
+- [Amazon Kinesis](https://github.com/apache/samza-hello-samza/tree/master/src/main/java/samza/examples/kinesis) and [Azure Eventhubs](https://github.com/apache/samza-hello-samza/tree/latest/src/main/java/samza/examples/azure) examples that cover how to consume input data from the respective systems.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/samza/blob/59bc23cf/docs/startup/quick-start/versioned/index.md
----------------------------------------------------------------------
diff --git a/docs/startup/quick-start/versioned/index.md b/docs/startup/quick-start/versioned/index.md
index a046ee7..30add8a 100644
--- a/docs/startup/quick-start/versioned/index.md
+++ b/docs/startup/quick-start/versioned/index.md
@@ -19,11 +19,11 @@ title: Quick Start
    limitations under the License.
 -->
 
-This tutorial will go through the steps of creating your first Samza application - `WordCount`. It demonstrates how to start writing a Samza application, consume from a kafka stream, tokenize the lines into words, and count the frequency of each word.  For this tutorial we are going to use gradle 4.9 to build the projects. The full tutorial project tar file can be downloaded [here](https://github.com/apache/samza-hello-samza/blob/latest/quickstart/wordcount.tar.gz).
+In this tutorial, we will create our first Samza application - `WordCount`. This application will consume messages from a Kafka stream, tokenize them into individual words and count the frequency of each word.  Let us download the entire project from [here](https://github.com/apache/samza-hello-samza/blob/latest/quickstart/wordcount.tar.gz).
 
 ### Setting up a Java Project
 
-First let’s create the project structure as follows:
+Observe the project structure as follows:
 
 {% highlight bash %}
 wordcount
@@ -38,7 +38,7 @@ wordcount
                  |-- WordCount.java
 {% endhighlight %}
 
-You can copy build.gradle and gradle.properties files from the downloaded tutorial tgz file. The WordCount class is just an empty class for now. Once finishing this setup, you can build the project by:
+You can build the project anytime by running:
 
 {% highlight bash %}
 > cd wordcount
@@ -48,7 +48,7 @@ You can copy build.gradle and gradle.properties files from the downloaded tutori
 
 ### Create a Samza StreamApplication
 
-Now let’s write some code! The first step is to create your own Samza application by implementing the [StreamApplication](/learn/documentation/{{site.version}}/api/javadocs/org/apache/samza/application/StreamApplication.html) class:
+Now let’s write some code! An application written using Samza's [high-level API](/learn/documentation/{{site.version}}/api/api/high-level-api.html) implements the [StreamApplication](/learn/documentation/{{site.version}}/api/javadocs/org/apache/samza/application/StreamApplication.html) interface:
 
 {% highlight java %}
 package samzaapp;
@@ -63,11 +63,11 @@ public class WordCount implements StreamApplication {
 }
 {% endhighlight %}
 
-The StreamApplication interface provides an API method named describe() for you to specify your streaming pipeline. Using [StreamApplicationDescriptor](/learn/documentation/{{site.version}}/api/javadocs/org/apache/samza/application/StreamApplicationDescriptor.html), you can describe your entire data processing task from data inputs, operations and outputs.
+The interface provides a single method named `describe()`, which allows us to define our inputs, the processing logic and outputs for our application. 
 
-### Input data source using Kafka
+### Describe your inputs and outputs
 
-In this example, we are going to use Kafka as the input data source and consume the text for word count line by line. We start by defining a KafkaSystemDescriptor, which specifies the properties to establishing the connection to the local Kafka cluster. Then we create a  `KafkaInputDescriptor`/`KafkaOutputDescriptor` to set up the topic, Serializer and Deserializer. Finally we use this input in the [StreamApplicationDescriptor](/learn/documentation/{{site.version}}/api/javadocs/org/apache/samza/application/StreamApplicationDescriptor.html) so we can consume from this topic. The code is in the following:
+To interact with Kafka, we will first create a `KafkaSystemDescriptor` by providing the coordinates of the Kafka cluster. For each Kafka topic our application reads from, we create a `KafkaInputDescriptor` with the name of the topic and a serializer. Likewise, for each output topic, we instantiate a corresponding `KafkaOutputDescriptor`. 
 
 {% highlight java %}
 public class WordCount implements StreamApplication {
@@ -81,11 +81,13 @@ public class WordCount implements StreamApplication {
 
  @Override
  public void describe(StreamApplicationDescriptor streamApplicationDescriptor) {
+   // Create a KafkaSystemDescriptor providing properties of the cluster
    KafkaSystemDescriptor kafkaSystemDescriptor = new KafkaSystemDescriptor(KAFKA_SYSTEM_NAME)
        .withConsumerZkConnect(KAFKA_CONSUMER_ZK_CONNECT)
        .withProducerBootstrapServers(KAFKA_PRODUCER_BOOTSTRAP_SERVERS)
        .withDefaultStreamConfigs(KAFKA_DEFAULT_STREAM_CONFIGS);
 
+   // For each input or output stream, create a KafkaInput/Output descriptor
    KafkaInputDescriptor<KV<String, String>> inputDescriptor =
        kafkaSystemDescriptor.getInputDescriptor(INPUT_STREAM_ID,
            KVSerde.of(new StringSerde(), new StringSerde()));
@@ -93,29 +95,31 @@ public class WordCount implements StreamApplication {
        kafkaSystemDescriptor.getOutputDescriptor(OUTPUT_STREAM_ID,
            KVSerde.of(new StringSerde(), new StringSerde()));
 
+   // Obtain a handle to a MessageStream that you can chain operations on
    MessageStream<KV<String, String>> lines = streamApplicationDescriptor.getInputStream(inputDescriptor);
    OutputStream<KV<String, String>> counts = streamApplicationDescriptor.getOutputStream(outputDescriptor);
  }
 }
 {% endhighlight %}
 
-The resulting [MessageStream](/learn/documentation/{{site.version}}/api/javadocs/org/apache/samza/operators/MessageStream.html) lines contains the data set that reads from Kafka and deserialized into string of each line. We also defined the output stream counts so we can write the word count results to it. Next let’s add processing logic. 
+The above example creates a [MessageStream](/learn/documentation/{{site.version}}/api/javadocs/org/apache/samza/operators/MessageStream.html) which reads from an input topic named `sample-text`. It also defines an output stream that emits results to a topic named `word-count-output`. Next let’s add our processing logic. 
 
 ### Add word count processing logic
 
-First we are going to extract the value from lines. This is a one-to-one transform and we can use the Samza map operator as following:
+Kafka messages typically have a key and a value. Since we only care about the value here, we will apply the `map` operator on the input stream to extract the value. 
 
 {% highlight java %}
-lines .map(kv -> kv.value)
+lines.map(kv -> kv.value)
 {% endhighlight %}
 
-Then we will split the line into words by using the flatmap operator:
+Next, we will tokenize the message into individual words using the `flatmap` operator.
 
 {% highlight java %}
 .flatMap(s -> Arrays.asList(s.split("\\W+")))
 {% endhighlight %}
 
-Now let’s think about how to count the words. We need to aggregate the count based on the word as the key, and emit the aggregation results once there are no more data coming. Here we can use a session window which will trigger the output if there is no data coming within a certain interval.
+
+We now need to group the words, aggregate their respective counts and periodically emit our results. For this, we will use Samza's session-windowing feature.
 
 {% highlight java %}
 .window(Windows.keyedSessionWindow(
@@ -123,7 +127,11 @@ Now let’s think about how to count the words. We need to aggregate the count b
    new StringSerde(), new IntegerSerde()), "count")
 {% endhighlight %}
 
-The output will be captured in a [WindowPane](/learn/documentation/{{site.version}}/api/javadocs/org/apache/samza/operators/windows/WindowPane.html) type, which contains the key and the aggregation value. We add a further map to transform that into a KV. To write the output to the output Kafka stream, we used the sentTo operator in Samza:
+Let's walk through each of the parameters to the above `window` function:
+The first parameter is a "key function", which defines the key to group messages by. In our case, we can simply use the word as the key. The second parameter is the windowing interval, which is set to 5 seconds. The third parameter is a function which provides the initial value for our aggregations. We can start with an initial count of zero for each word. The fourth parameter is an aggregation function for computing counts. The next two parameters specify the key and value serializers for our window. 
+
+The output from the window operator is captured in a [WindowPane](/learn/documentation/{{site.version}}/api/javadocs/org/apache/samza/operators/windows/WindowPane.html) type, which contains the word as the key and its count as the value. We add a further `map` to format this into a `KV`, that we can send to our Kafka topic. To write our results to the output topic, we use the `sendTo` operator in Samza.
+
 
 {% highlight java %}
 .map(windowPane ->
@@ -148,27 +156,31 @@ lines
 {% endhighlight %}
 
 
-### Config your application
+### Configure your application
 
-In this section we will configure the word count example to run locally in a single JVM. Please add a file named “word-count.properties” under the config folder. We will add the job configs in this file.
-
-Since there is only a single Samza processor, there is no coordination required. We use the PassthroughJobCoordinator for the example. We also group all Samza tasks into this single processor. As for the Kafka topic, we will consume from the beginning. Here is the full config needed for the job:
+In this section, we will configure our word count example to run locally in a single JVM. Let us add a file named “word-count.properties” under the config folder. 
 
 {% highlight jproperties %}
 job.name=word-count
+# Use a PassthroughJobCoordinator since there is no coordination needed
 job.coordinator.factory=org.apache.samza.standalone.PassthroughJobCoordinatorFactory
 job.coordination.utils.factory=org.apache.samza.standalone.PassthroughCoordinationUtilsFactory
+
 job.changelog.system=kafka
+
+# Use a single container to process all of the data
 task.name.grouper.factory=org.apache.samza.container.grouper.task.SingleContainerGrouperFactory
 processor.id=0
+
+# Read from the beginning of the topic
 systems.kafka.default.stream.samza.offset.default=oldest
 {% endhighlight %}
 
-For more details about Samza config, feel free to check out the latest config [here](/learn/documentation/{{site.version}}/jobs/configuration-table.html).
+For more details on Samza's configs, feel free to check out the latest [configuration reference](/learn/documentation/{{site.version}}/jobs/configuration-table.html).
 
 ### Run your application
 
-Let’s add a `main()` function to `WordCount` class first. The function reads the config file and factory from the args, and create a `LocalApplicationRunner` to run the application locally. Here is the function details:
+We are ready to add a `main()` function to the `WordCount` class. It parses the command-line arguments and instantiates a `LocalApplicationRunner` to execute the application locally.
 
 {% highlight java %}
 public static void main(String[] args) {
@@ -181,36 +193,29 @@ public static void main(String[] args) {
 }
 {% endhighlight %}
 
-In your "build.gradle" file, please add the following so we can use gradle to run it:
-
-{% highlight jproperties %}
-apply plugin:'application'
-
-mainClassName = "samzaapp.WordCount"
-{% endhighlight %}
 
-Before running `main()`, we need to create the input Kafka topic with some sample data. Let’s start a local kafka broker first. Samza examples provides a script named “grid” which you can use to start zookeeper, kafka broker and yarn. Your can download it [here](https://github.com/apache/samza-hello-samza/blob/master/bin/grid) and put it under scripts/ folder, then issue the following command:
+Before running `main()`, we will create our input Kafka topic and populate it with sample data. You can download the scripts to interact with Kafka along with the sample data from [here](https://github.com/apache/samza-hello-samza/blob/latest/quickstart/wordcount.tar.gz).
 
 {% highlight bash %}
 > ./scripts/grid install zookeeper && ./scripts/grid start zookeeper
 > ./scripts/grid install kafka && ./scripts/grid start kafka
 {% endhighlight %}
 
-Next we will create a Kafka topic named sample-text, and publish some sample data into it. A "sample-text.txt" file is included in the downloaded tutorial tgz file. In command line:
 
 {% highlight bash %}
 > ./deploy/kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --topic sample-text --partition 1 --replication-factor 1
 > ./deploy/kafka/bin/kafka-console-producer.sh --topic sample-text --broker localhost:9092 < ./sample-text.txt
 {% endhighlight %}
 
-Now let’s fire up our application. Here we use gradle to run it. You can also run it directly within your IDE, with the same program arguments.
+Let’s kick off our application and use gradle to run it. Alternately, you can also run it directly from your IDE, with the same program arguments.
 
 {% highlight bash %}
 > export BASE_DIR=`pwd`
 > ./gradlew run --args="--config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$BASE_DIR/src/main/config/word-count.properties"
 {% endhighlight %}
 
-This application will output to a Kafka topic named "word-count-output". Let’s consume this topic to check out the results:
+
+The application will output to a Kafka topic named "word-count-output". We will now fire up a Kafka consumer to read from this topic:
 
 {% highlight bash %}
 >  ./deploy/kafka/bin/kafka-console-consumer.sh --topic word-count-output --zookeeper localhost:2181 --from-beginning
@@ -235,20 +240,6 @@ and: 243
 from: 16
 {% endhighlight %}
 
-### More Examples
-
-The [hello-samza](https://github.com/apache/samza-hello-samza) project contains a lot of more examples to help you create your Samza job. To checkout the hello-samza project:
-
-{% highlight bash %}
-> git clone https://git.apache.org/samza-hello-samza.git hello-samza
-{% endhighlight %}
-
-There are four main categories of examples in this project, including:
-
-1. [wikipedia](https://github.com/apache/samza-hello-samza/tree/master/src/main/java/samza/examples/wikipedia): this is a more complex example demonstrating the entire pipeline of consuming from the live feed from wikipedia edits, parsing the message and generating statistics from them.
-
-2. [cookbook](https://github.com/apache/samza-hello-samza/tree/master/src/main/java/samza/examples/cookbook): you will find various examples in this folder to demonstrate usage of Samza high-level API, such as windowing, join and aggregations.
-
-3. [asure](https://github.com/apache/samza-hello-samza/tree/master/src/main/java/samza/examples/azure): this example shows how to run your application on Microsoft Asure.
+Congratulations! You've successfully run your first Samza application.
 
-4. [kinesis](https://github.com/apache/samza-hello-samza/tree/master/src/main/java/samza/examples/kinesis): this example shows how to consume from Kinesis streams
\ No newline at end of file
+### [More Examples >>](/startup/code-examples/{{site.version}})
\ No newline at end of file