You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by ja...@apache.org on 2018/10/11 23:26:29 UTC

samza git commit: Samza SQL Documentation

Repository: samza
Updated Branches:
  refs/heads/master efb2180cf -> 9903539a2


Samza SQL Documentation

This is initial Samza SQL documentation. vjagadish I still am tracking your comment to add more details on various queries, Samza SQL components. I think i need more time for those. If i can i will try to add them before Samza 1.0. If not we can target that for the next release.

![snappy](https://user-images.githubusercontent.com/1783806/46564244-faae4f80-c8ba-11e8-8e1c-cc4d929d79f1.png)

Author: Srinivasulu Punuru <sp...@linkedin.com>

Reviewers: Jagadish<ja...@apache.org>

Closes #695 from srinipunuru/samza-sql-docs.1


Project: http://git-wip-us.apache.org/repos/asf/samza/repo
Commit: http://git-wip-us.apache.org/repos/asf/samza/commit/9903539a
Tree: http://git-wip-us.apache.org/repos/asf/samza/tree/9903539a
Diff: http://git-wip-us.apache.org/repos/asf/samza/diff/9903539a

Branch: refs/heads/master
Commit: 9903539a22c49f247cb17022688e80cdd515949e
Parents: efb2180
Author: Srinivasulu Punuru <sp...@linkedin.com>
Authored: Thu Oct 11 16:26:24 2018 -0700
Committer: Jagadish <jv...@linkedin.com>
Committed: Thu Oct 11 16:26:24 2018 -0700

----------------------------------------------------------------------
 .../documentation/versioned/api/samza-sql.md    | 174 +++++++++++++++++--
 1 file changed, 156 insertions(+), 18 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/samza/blob/9903539a/docs/learn/documentation/versioned/api/samza-sql.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/api/samza-sql.md b/docs/learn/documentation/versioned/api/samza-sql.md
index bad7545..769b7ec 100644
--- a/docs/learn/documentation/versioned/api/samza-sql.md
+++ b/docs/learn/documentation/versioned/api/samza-sql.md
@@ -20,33 +20,171 @@ title: Samza SQL
 -->
 
 
-# Section 1
+# Overview
+Samza SQL allows users to write Stream processing application by just writing a SQL query. SQL query is translated to a Samza job using the high level API, which can then be run on all the environments (e.g. Yarn, standalone, etc..) that Samza currently supports.
 
-# Sample Applications
+Samza SQL was created with following principles:
 
+1. Users of Samza SQL should be able to create stream processing apps without needing to write Java code unless they require UDFs.
+2. No configs needed by users to develop and execute a Samza SQL job.
+3. Samza SQL should support all the runtimes and systems that Samza supports.   
 
-# Section 2
+Samza SQL uses Apache Calcite to provide the SQL interface. Apache Calcite is a popular SQL engine used by wide variety of big data processing systems (Beam, Flink, Storm etc..)
 
-# Section 3
+# How to use Samza SQL
+There are couple of ways to use Samza SQL:
 
+* Run Samza SQL on your local machine.
+* Run Samza SQL on YARN.
 
-# Section 4
+# Running Samza SQL on your local machine
+Samza SQL console tool documented [here](https://samza.apache.org/learn/tutorials/0.14/samza-tools.html) uses Samza standalone to run Samza SQL on your local machine. This is the quickest way to play with Samza SQL. Please follow the instructions [here](https://samza.apache.org/learn/tutorials/0.14/samza-tools.html) to get access to the Samza tools on your machine.
 
-The table below summarizes table metrics:
+## Start the Kafka server
+Please follow the instructions from the [Kafka quickstart](http://kafka.apache.org/quickstart) to start the zookeeper and Kafka server.
 
+## Create ProfileChangeStream Kafka topic
+The below sql statements require a topic named ProfileChangeStream to be created on the Kafka broker. You can follow the instructions in the Kafka quick start guide to create a topic named “ProfileChangeStream”.
 
-| Metrics | Class | Description |
-|---------|-------|-------------|
-|`get-ns`|`ReadableTable`|Average latency of `get/getAsync()` operations|
-|`getAll-ns`|`ReadableTable`|Average latency of `getAll/getAllAsync()` operations|
-|`num-gets`|`ReadableTable`|Count of `get/getAsync()` operations
-|`num-getAlls`|`ReadableTable`|Count of `getAll/getAllAsync()` operations
+```bash
+>./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic ProfileChangeStream
+```
 
+## Generate events into ProfileChangeStream topic
+Use generate-kafka-events from Samza tools to generate events into the ProfileChangeStream
 
-### Section 5 example
+```bash
+> cd samza-tools-<version>
+> ./scripts/generate-kafka-events.sh -t ProfileChangeStream -e ProfileChange
+```
 
-It is up to the developer whether to implement both `TableReadFunction` and 
-`TableWriteFunction` in one class or two separate classes. Defining them in 
-separate classes can be cleaner if their implementations are elaborate and 
-extended, whereas keeping them in a single class may be more practical if 
-they share a considerable amount of code or are relatively short.
+## Using Samza SQL Console to run Samza sql on your local machine
+
+Below are some of the sql queries that you can execute using the samza-sql-console tool from Samza tools package.
+
+This command just prints out all the events in the Kafka topic ProfileChangeStream into console output as a json serialized payload.
+
+```bash
+> ./scripts/samza-sql-console.sh --sql "insert into log.consoleoutput select * from kafka.ProfileChangeStream"
+```
+
+This command prints out the fields that are selected into the console output as a json serialized payload.
+
+```bash
+> ./scripts/samza-sql-console.sh --sql "insert into log.consoleoutput select Name, OldCompany, NewCompany from kafka.ProfileChangeStream"
+```
+
+
+This command showcases the RegexMatch udf and filtering capabilities.
+
+```bash
+> ./scripts/samza-sql-console.sh --sql "insert into log.consoleoutput select Name as __key__, Name, NewCompany, RegexMatch('.*soft', OldCompany) from kafka.ProfileChangeStream where NewCompany = 'LinkedIn'"
+```
+
+Note: Samza sql console right now doesn’t support queries that need state, for e.g. Stream-Table join, GroupBy and Stream-Stream joins.
+
+
+
+
+# Running Samza SQL on YARN
+The [hello-samza](https://github.com/apache/samza-hello-samza) project is an example project designed to help you run your first Samza application. It has examples of applications using the low level task API, high level API as well as Samza SQL.
+
+This tutorial demonstrates a simple Samza application that uses SQL to perform stream processing.
+
+## Get the hello-samza code and start the grid
+Please follow the instructions from hello-samza-high-level-yarn on how to build the hello-samza repository and start the yarn grid.
+
+## Create the topic and generate Kafka events
+Please follow the steps in the section “Create ProfileChangeStream Kafka topic” and “Generate events into ProfileChangeStream topic” above.
+
+Build a Samza Application package
+Before you can run a Samza application, you need to build a package for it. Please follow the instructions from hello-samza-high-level-yarn on how to build the hello-samza application package.
+
+## Run a Samza Application
+After you’ve built your Samza package, you can start the app on the grid using the run-app.sh script.
+
+```bash
+> ./deploy/samza/bin/run-app.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/page-view-filter-sql.properties
+```
+
+The app executes the following SQL command :
+
+
+```sql
+insert into kafka.NewLinkedInEmployees select Name from ProfileChangeStream where NewCompany = 'LinkedIn'
+```
+
+
+This SQL performs the following:
+
+* Consumes the Kafka topic ProfileChangeStream which contains the avro serialized ProfileChangeEvent(s)
+* Deserializes the events and filters out only the profile change events where NewCompany = ‘LinkedIn’ i.e. Members who have moved to LinkedIn.
+* Writes the Avro serialized event that contains the Id and Name of those profiles to Kafka topic NewLinkedInEmployees.
+
+Give the job a minute to startup, and then tail the Kafka topic:
+
+```bash
+> ./deploy/kafka/bin/kafka-console-consumer.sh  --zookeeper localhost:2181 --topic NewLinkedInEmployees
+```
+
+Congratulations! You’ve now setup a local grid that includes YARN, Kafka, and ZooKeeper, and run a Samza SQL application on it.
+## Shutdown and cleanup
+To shutdown the app, use the same run-app.sh script with an extra –operation=kill argument
+
+```bash
+> ./deploy/samza/bin/run-app.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/page-view-filter-sql.properties --operation=kill
+```
+
+Please follow the instructions from Hello Samza High Level API - YARN Deployment on how to shutdown and cleanup the app.
+
+
+# SQL Grammar
+Following BNF grammar is based on Apache Calicite’s SQL parser. It is the subset of the capabilities that Calcite supports.  
+
+
+```
+statement:
+  |   insert
+  |   query 
+  
+query:
+  values
+  | {
+      select
+    }
+ 
+insert:
+      ( INSERT | UPSERT ) INTO tablePrimary
+      [ '(' column [, column ]* ')' ]
+      query 
+
+select:
+  SELECT
+  { * | projectItem [, projectItem ]* }
+  FROM tableExpression
+  [ WHERE booleanExpression ]
+  [ GROUP BY { groupItem [, groupItem ]* } ]
+   
+projectItem:
+  expression [ [ AS ] columnAlias ]
+  | tableAlias . *
+ 
+tableExpression:
+  tableReference [, tableReference ]*
+  | tableExpression [ NATURAL ] [ LEFT | RIGHT | FULL ] JOIN tableExpression [ joinCondition ]
+ 
+joinCondition:
+  ON booleanExpression
+  | USING '(' column [, column ]* ')'
+ 
+tableReference:
+  tablePrimary
+  [ [ AS ] alias [ '(' columnAlias [, columnAlias ]* ')' ] ]
+ 
+tablePrimary:
+  [ TABLE ] [ [ catalogName . ] schemaName . ] tableName
+   
+values:
+  VALUES expression [, expression ]*
+
+```