You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by me...@apache.org on 2017/08/21 18:49:35 UTC

[beam-site] 06/10: split into 1 blog + 1 development doc.

This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a commit to branch mergebot
in repository https://gitbox.apache.org/repos/asf/beam-site.git

commit 581ec758058ff02070ed08611608c12f0a654dce
Author: mingmxu <mi...@ebay.com>
AuthorDate: Wed Aug 2 11:23:00 2017 -0700

    split into 1 blog + 1 development doc.
---
 src/_includes/header.html                          |   4 +
 src/_posts/2017-07-21-sql-dsl.md                   | 340 +--------------------
 .../dsls/sql.md}                                   |  14 +-
 3 files changed, 16 insertions(+), 342 deletions(-)

diff --git a/src/_includes/header.html b/src/_includes/header.html
index 3bd4ced..79e72b6 100644
--- a/src/_includes/header.html
+++ b/src/_includes/header.html
@@ -62,6 +62,10 @@
             <li><a href="{{ site.baseurl }}/documentation/runners/flink/">Apache Flink Runner</a></li>
             <li><a href="{{ site.baseurl }}/documentation/runners/spark/">Apache Spark Runner</a></li>
             <li><a href="{{ site.baseurl }}/documentation/runners/dataflow/">Cloud Dataflow Runner</a></li>
+
+            <li role="separator" class="divider"></li>
+            <li class="dropdown-header">DSLs</li>
+            <li><a href="{{ site.baseurl }}/documentation/dsls/sql/">SQL</a></li>
           </ul>
         </li>
         <li class="dropdown">
diff --git a/src/_posts/2017-07-21-sql-dsl.md b/src/_posts/2017-07-21-sql-dsl.md
index b21769c..f37c24a 100644
--- a/src/_posts/2017-07-21-sql-dsl.md
+++ b/src/_posts/2017-07-21-sql-dsl.md
@@ -8,60 +8,14 @@ authors:
   - mingmxu
 ---
 
-BeamSQL provides the capability to execute standard SQL queries using Beam Java SDK. DSL interface packages the backend parsing/validation/assembling features, and delivers a SDK style API to developers, to express a processing logic using SQL statements, from simple TABLE_FILTER, to complex queries containing JOIN/GROUP_BY etc.
+Beam SQL DSL provides the capability to execute standard SQL queries using Beam Java SDK. It packages the backend parsing/validation/assembling features, and delivers a SDK style API to developers, to express a processing logic using SQL statements, from simple TABLE_FILTER, to complex queries containing JOIN/GROUP_BY etc.
 
 <!--more-->
 
-* [1. Overview](#overview)
-* [2. The Internal of Beam SQL](#internal-of-sql)
-* [3. Usage of DSL APIs](#usage)
-* [4. Functionality in Beam SQL](#functionality)
-  * [4.1. Supported Features](#features)
-  * [4.2. Intro of BeamSqlRow](#beamsqlrow)
-  * [4.3. Data Types](#data-type)
-  * [4.4. built-in SQL functions](#built-in-functions)
-
-This page describes the implementation of Beam SQL, and how to simplify a Beam pipeline with DSL APIs.
-
 # <a name="overview"></a>1. Overview
 SQL is a well-adopted standard to process data with concise syntax. With DSL APIs, now `PCollection`s can be queried with standard SQL statements, like a regular table. The DSL APIs leverage [Apache Calcite](http://calcite.apache.org/) to parse and optimize SQL queries, then translate into a composite Beam `PTransform`. In this way, both SQL and normal Beam `PTransform`s can be mixed in the same pipeline.
 
-# <a name="internal-of-sql"></a>2. The Internal of Beam SQL
-Figure 1 describes the back-end steps from a SQL statement to a Beam `PTransform`.
-
-![Workflow of Beam SQL DSL]({{ "/images/beam_sql_dsl_workflow.png" | prepend: site.baseurl }} "workflow of Beam SQL DSL")
-
-**Figure 1** workflow of Beam SQL DSL
-
-Given a `PCollection` and the query as input, first of all the input `PCollection` is registered as a table in the schema repository. Then it's processed as:
-
-1. SQL query is parsed according to grammar to generate a SQL Abstract Syntax Tree;
-2. Validate against table schema, and output a logical plan represented with relational algebras;
-3. Relational rules are applied to convert it to a physical plan, expressed with Beam components. An optimizer is optional to update the plan;
-4. Finally, the Beam physical plan is compiled as a composite `PTransform`;
-
-Here is an example to show a query that filters and projects from an input `PCollection`:
-
-```
-SELECT USER_ID, USER_NAME FROM PCOLLECTION WHERE USER_ID = 1
-```
-
-The logical plan is shown as:
-
-```
-LogicalProject(USER_ID=[$0], USER_NAME=[$1])
-  LogicalFilter(condition=[=($0, 1)])
-    LogicalTableScan(table=[[PCOLLECTION]])
-```
-
-And compiled as a composite `PTransform`
-
-```
-pCollection.apply(BeamSqlFilter...)
-           .apply(BeamSqlProject...)
-```
-
-# <a name="usage"></a>3. Usage of DSL APIs 
+# <a name="usage"></a>2. Usage of DSL APIs 
 The DSL interface (`BeamSql.query()` and `BeamSql.simpleQuery()`), is the only endpoint exposed to developers. It wraps the back-end details of parsing/validation/assembling, to deliver a Beam SDK style API that can take either simple TABLE_FILTER queries or complex queries containing JOIN/GROUP_BY etc. 
 
 *Note*, the two APIs are equivalent in functionality, `BeamSql.query()` applies on a `PCollectionTuple` with one or many input `PCollection`s, and `BeamSql.simpleQuery()` is a simplified API which applies on single `PCollection`.
@@ -96,288 +50,12 @@ Note that, SQL support is not fully completed. Queries that include unsupported
 ## <a name="features"></a>4.1. Supported Features
 The following features are supported in current repository (this chapter will be updated continually).
 
-**1. filter clauses;**
-
-**2. data field projections;**
-
-**3. aggregations;**
-
-Beam SQL supports aggregation functions COUNT/SUM/MAX/MIN/AVG with group_by in global_window, fixed_window, sliding_window and session_window. A field with type `TIMESTAMP` is required to specify fixed_window/sliding_window/session window, which is used as event timestamp for rows. See below for several examples:
-
-```
-//fixed window, one hour in duration
-SELECT f_int, COUNT(*) AS `size` FROM PCOLLECTION GROUP BY f_int, TUMBLE(f_timestamp, INTERVAL '1' HOUR)
-
-//sliding window, one hour in duration and 30 minutes period
-SELECT f_int, COUNT(*) AS `size` FROM PCOLLECTION GROUP BY f_int, HOP(f_timestamp, INTERVAL '1' HOUR, INTERVAL '30' MINUTE)
-
-//session window, with 5 minutes gap duration
-SELECT f_int, COUNT(*) AS `size` FROM PCOLLECTION GROUP BY f_int, SESSION(f_timestamp, INTERVAL '5' MINUTE)
-```
-
-Note: 
-
-1. distinct aggregation is not supported yet.
-2. the default trigger is `Repeatedly.forever(AfterWatermark.pastEndOfWindow())`;
-3. when `time` field in `HOP(dateTime, slide, size [, time ])`/`TUMBLE(dateTime, interval [, time ])`/`SESSION(dateTime, interval [, time ])` is specified, a lateFiring trigger is added as 
-
-```
-Repeatedly.forever(AfterWatermark.pastEndOfWindow().withLateFirings(AfterProcessingTime
-        .pastFirstElementInPane().plusDelayOf(Duration.millis(delayTime.getTimeInMillis()))));
-```		
-
-**4. Join (inner, left_outer, right_outer);**
-
-The scenarios of join can be categorized into 3 cases:
-
-1. BoundedTable JOIN BoundedTable
-2. UnboundedTable JOIN UnboundedTable
-3. BoundedTable JOIN UnboundedTable
-
-For case 1 and case 2, a standard join is utilized as long as the windowFn of the both sides match. For case 3, sideInput is utilized to implement the join. So there are some constraints:
-
-* Only equal-join is supported, CROSS JOIN is not supported;
-* FULL OUTER JOIN is not supported;
-* If it's a LEFT OUTER JOIN, the unbounded table should on the left side; If it's a RIGHT OUTER JOIN, the unbounded table should on the right side;
-* The trigger is inherented from upstream, which should be consistent;
-
-**5. built-in SQL functions**
-
-**6. User Defined Function (UDF) and User Defined Aggregate Function (UDAF);**
-
-If the required function is not available, developers can register their own UDF(for scalar function) and UDAF(for aggregation function).
-
-**create and specify User Defined Function (UDF)**
-
-A UDF can be any Java method that takes zero or more scalar fields, and return one scalar value. Below is an example of UDF and how to use it in DSL:
-
-```
-/**
- * A example UDF for test.
- */
-public static class CubicInteger implements BeamSqlUdf{
-  public static Integer eval(Integer input){
-    return input * input * input;
-  }
-}
-
-// register and call in SQL
-String sql = "SELECT f_int, cubic(f_int) as cubicvalue FROM PCOLLECTION WHERE f_int = 2";
-PCollection<BeamSqlRow> result =
-    input.apply("udfExample",
-        BeamSql.simpleQuery(sql).withUdf("cubic", CubicInteger.class));
-```
-
-**create and specify User Defined Aggregate Function (UDAF)**
-
-A UDAF aggregates a set of grouped scalar values, and output a single scalar value. To create a UDAF function, it's required to extend `org.apache.beam.dsls.sql.schema.BeamSqlUdaf<InputT, AccumT, OutputT>`, which defines 4 methods to process an aggregation:
-
-1. init(), to create an initial accumulate value;
-2. add(), to apply a new value to existing accumulate value;
-3. merge(), to merge accumulate values from parallel operators;
-4. result(), to generate the final result from accumulate value;
-
-Here's an example of UDAF:
-
-```
-/**
- * UDAF for test, which returns the sum of square.
- */
-public static class SquareSum extends BeamSqlUdaf<Integer, Integer, Integer> {
-  public SquareSum() {
-  }
-  
-  @Override
-  public Integer init() {
-    return 0;
-  }
-  
-  @Override
-  public Integer add(Integer accumulator, Integer input) {
-    return accumulator + input * input;
-  }
-  
-  @Override
-  public Integer merge(Iterable<Integer> accumulators) {
-    int v = 0;
-    Iterator<Integer> ite = accumulators.iterator();
-    while (ite.hasNext()) {
-      v += ite.next();
-    }
-    return v;
-  }
-
-  @Override
-  public Integer result(Integer accumulator) {
-    return accumulator;
-  }
-}
-
-//register and call in SQL
-String sql = "SELECT f_int1, squaresum(f_int2) AS `squaresum` FROM PCOLLECTION GROUP BY f_int2";
-PCollection<BeamSqlRow> result =
-    input.apply("udafExample",
-        BeamSql.simpleQuery(sql).withUdaf("squaresum", SquareSum.class));
-```
- 
-## <a name="beamsqlrow"></a>4.2. Intro of BeamSqlRow
-`BeamSqlRow`, encoded/decoded by `BeamSqlRowCoder`, represents a single, implicitly structured data item in a Beam SQL compatible `PCollection`. Similar as _row_ in the context of relational database, each `BeamSqlRow` consists of named columns with fixed types(see [4.3. Data Types](#data-type)).
-
-A Beam SQL `PCollection` can be created from an external source, in-memory data or derive from another SQL query. For `PCollection`s from external source or in-memory data, it's required to specify coder explcitly; `PCollection` derived from SQL query has the coder set already. Below is one example:
-
-```
-//define the input row format
-List<String> fieldNames = Arrays.asList("c1", "c2", "c3");
-List<Integer> fieldTypes = Arrays.asList(Types.INTEGER, Types.VARCHAR, Types.DOUBLE);
-BeamSqlRecordType type = BeamSqlRecordType.create(fieldNames, fieldTypes);
-BeamSqlRow row = new BeamSqlRow(type);
-row.addField(0, 1);
-row.addField(1, "row");
-row.addField(2, 1.0);
-
-//create a source PCollection with Create.of();
-PCollection<BeamSqlRow> inputTable = PBegin.in(p).apply(Create.of(row)
-    .withCoder(new BeamSqlRowCoder(type)));
-```
- 
-## <a name="data-type"></a>4.3. Data Types
-Each type in Beam SQL maps to a Java class to holds the value in `BeamSqlRow`. The following table lists the relation between SQL types and Java classes, which are supported in current repository:
-
-| SQL Type | Java class |
-| ---- | ---- |
-| Types.INTEGER | java.lang.Integer |
-| Types.SMALLINT | java.lang.Short |
-| Types.TINYINT | java.lang.Byte |
-| Types.BIGINT | java.lang.Long |
-| Types.FLOAT | java.lang.Float |
-| Types.DOUBLE | java.lang.Double |
-| Types.DECIMAL | java.math.BigDecimal |
-| Types.VARCHAR | java.lang.String |
-| Types.TIMESTAMP | java.util.Date |
-{:.table}
-
-## <a name="built-in-functions"></a>4.4. built-in SQL functions
-
-Beam SQL has implemented lots of build-in functions defined in [Apache Calcite](http://calcite.apache.org). The available functions are listed as below:
-
-**Comparison functions and operators**
-
-| Operator syntax | Description |
-| ---- | ---- |
-| value1 = value2 | Equals |
-| value1 <> value2 | Not equal |
-| value1 > value2 | Greater than |
-| value1 >= value2 | Greater than or equal |
-| value1 < value2 | Less than |
-| value1 <= value2 | Less than or equal |
-| value IS NULL | Whether value is null |
-| value IS NOT NULL | Whether value is not null |
-{:.table}
-
-**Logical functions and operators**
-
-| Operator syntax | Description |
-| ---- | ---- |
-| boolean1 OR boolean2 | Whether boolean1 is TRUE or boolean2 is TRUE |
-| boolean1 AND boolean2 | Whether boolean1 and boolean2 are both TRUE |
-| NOT boolean | Whether boolean is not TRUE; returns UNKNOWN if boolean is UNKNOWN |
-{:.table}
-
-**Arithmetic functions and operators**
-
-| Operator syntax | Description|
-| ---- | ---- |
-| numeric1 + numeric2 | Returns numeric1 plus numeric2| 
-| numeric1 - numeric2 | Returns numeric1 minus numeric2| 
-| numeric1 * numeric2 | Returns numeric1 multiplied by numeric2| 
-| numeric1 / numeric2 | Returns numeric1 divided by numeric2| 
-| MOD(numeric, numeric) | Returns the remainder (modulus) of numeric1 divided by numeric2. The result is negative only if numeric1 is negative| 
-{:.table}
-
-**Math functions**
-
-| Operator syntax | Description |
-| ---- | ---- |
-| ABS(numeric) | Returns the absolute value of numeric |
-| SQRT(numeric) | Returns the square root of numeric |
-| LN(numeric) | Returns the natural logarithm (base e) of numeric |
-| LOG10(numeric) | Returns the base 10 logarithm of numeric |
-| EXP(numeric) | Returns e raised to the power of numeric |
-| ACOS(numeric) | Returns the arc cosine of numeric |
-| ASIN(numeric) | Returns the arc sine of numeric |
-| ATAN(numeric) | Returns the arc tangent of numeric |
-| COT(numeric) | Returns the cotangent of numeric |
-| DEGREES(numeric) | Converts numeric from radians to degrees |
-| RADIANS(numeric) | Converts numeric from degrees to radians |
-| SIGN(numeric) | Returns the signum of numeric |
-| SIN(numeric) | Returns the sine of numeric |
-| TAN(numeric) | Returns the tangent of numeric |
-| ROUND(numeric1, numeric2) | Rounds numeric1 to numeric2 places right to the decimal point |
-{:.table}
-
-**Date functions**
-
-| Operator syntax | Description |
-| ---- | ---- |
-| LOCALTIME | Returns the current date and time in the session time zone in a value of datatype TIME |
-| LOCALTIME(precision) | Returns the current date and time in the session time zone in a value of datatype TIME, with precision digits of precision |
-| LOCALTIMESTAMP | Returns the current date and time in the session time zone in a value of datatype TIMESTAMP |
-| LOCALTIMESTAMP(precision) | Returns the current date and time in the session time zone in a value of datatype TIMESTAMP, with precision digits of precision |
-| CURRENT_TIME | Returns the current time in the session time zone, in a value of datatype TIMESTAMP WITH TIME ZONE |
-| CURRENT_DATE | Returns the current date in the session time zone, in a value of datatype DATE |
-| CURRENT_TIMESTAMP | Returns the current date and time in the session time zone, in a value of datatype TIMESTAMP WITH TIME ZONE |
-| EXTRACT(timeUnit FROM datetime) | Extracts and returns the value of a specified datetime field from a datetime value expression |
-| FLOOR(datetime TO timeUnit) | Rounds datetime down to timeUnit |
-| CEIL(datetime TO timeUnit) | Rounds datetime up to timeUnit |
-| YEAR(date) | Equivalent to EXTRACT(YEAR FROM date). Returns an integer. |
-| QUARTER(date) | Equivalent to EXTRACT(QUARTER FROM date). Returns an integer between 1 and 4. |
-| MONTH(date) | Equivalent to EXTRACT(MONTH FROM date). Returns an integer between 1 and 12. |
-| WEEK(date) | Equivalent to EXTRACT(WEEK FROM date). Returns an integer between 1 and 53. |
-| DAYOFYEAR(date) | Equivalent to EXTRACT(DOY FROM date). Returns an integer between 1 and 366. |
-| DAYOFMONTH(date) | Equivalent to EXTRACT(DAY FROM date). Returns an integer between 1 and 31. |
-| DAYOFWEEK(date) | Equivalent to EXTRACT(DOW FROM date). Returns an integer between 1 and 7. |
-| HOUR(date) | Equivalent to EXTRACT(HOUR FROM date). Returns an integer between 0 and 23. |
-| MINUTE(date) | Equivalent to EXTRACT(MINUTE FROM date). Returns an integer between 0 and 59. |
-| SECOND(date) | Equivalent to EXTRACT(SECOND FROM date). Returns an integer between 0 and 59. |
-{:.table}
-
-**String functions**
-
-| Operator syntax | Description |
-| ---- | ---- |
-| string \|\| string | Concatenates two character strings |
-| CHAR_LENGTH(string) | Returns the number of characters in a character string |
-| CHARACTER_LENGTH(string) | As CHAR_LENGTH(string) |
-| UPPER(string) | Returns a character string converted to upper case |
-| LOWER(string) | Returns a character string converted to lower case |
-| POSITION(string1 IN string2) | Returns the position of the first occurrence of string1 in string2 |
-| POSITION(string1 IN string2 FROM integer) | Returns the position of the first occurrence of string1 in string2 starting at a given point (not standard SQL) |
-| TRIM( { BOTH \| LEADING \| TRAILING } string1 FROM string2) | Removes the longest string containing only the characters in string1 from the start/end/both ends of string1 |
-| OVERLAY(string1 PLACING string2 FROM integer [ FOR integer2 ]) | Replaces a substring of string1 with string2 |
-| SUBSTRING(string FROM integer) | Returns a substring of a character string starting at a given point |
-| SUBSTRING(string FROM integer FOR integer) | Returns a substring of a character string starting at a given point with a given length |
-| INITCAP(string) | Returns string with the first letter of each word converter to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters. |
-{:.table}
-
-**Conditional functions**
-
-| Operator syntax | Description |
-| ---- | ---- |
-| CASE value <br>WHEN value1 [, value11 ]* THEN result1 <br>[ WHEN valueN [, valueN1 ]* THEN resultN ]* <br>[ ELSE resultZ ] <br>END | Simple case |
-| CASE <br>WHEN condition1 THEN result1 <br>[ WHEN conditionN THEN resultN ]* <br>[ ELSE resultZ ] <br>END | Searched case |
-| NULLIF(value, value) | Returns NULL if the values are the same. For example, NULLIF(5, 5) returns NULL; NULLIF(5, 0) returns 5. |
-| COALESCE(value, value [, value ]*) | Provides a value if the first value is null. For example, COALESCE(NULL, 5) returns 5. |
-{:.table}
-
-**Type conversion functions**
+1. filter clauses;
+2. data field projections;
+3. aggregations (global_window, fixed_window, sliding_window, session_window);
+4. Join (inner, left_outer, right_outer);
+5. built-in SQL functions
+6. User Defined Function (UDF) and User Defined Aggregate Function (UDAF);
 
-**Aggregate functions**
+For more deep dive, please visit [DSLs: SQL]({{ site.baseurl }}/documentation/dsls/sql/)
 
-| Operator syntax | Description |
-| ---- | ---- |
-| COUNT(*) | Returns the number of input rows |
-| AVG(numeric) | Returns the average (arithmetic mean) of numeric across all input values |
-| SUM(numeric) | Returns the sum of numeric across all input values |
-| MAX(value) | Returns the maximum value of value across all input values |
-| MIN(value) | Returns the minimum value of value across all input values |
-{:.table}
diff --git a/src/_posts/2017-07-21-sql-dsl.md b/src/documentation/dsls/sql.md
similarity index 97%
copy from src/_posts/2017-07-21-sql-dsl.md
copy to src/documentation/dsls/sql.md
index b21769c..76574e5 100644
--- a/src/_posts/2017-07-21-sql-dsl.md
+++ b/src/documentation/dsls/sql.md
@@ -1,17 +1,9 @@
 ---
-layout: post
-title:  "Use Beam SQL DSL to build a pipeline"
-date:   2017-07-21 00:00:00 -0800
-excerpt_separator: <!--more-->
-categories: blog
-authors:
-  - mingmxu
+layout: default
+title: "DSLs: SQL"
+permalink: /documentation/dsls/sql/
 ---
 
-BeamSQL provides the capability to execute standard SQL queries using Beam Java SDK. DSL interface packages the backend parsing/validation/assembling features, and delivers a SDK style API to developers, to express a processing logic using SQL statements, from simple TABLE_FILTER, to complex queries containing JOIN/GROUP_BY etc.
-
-<!--more-->
-
 * [1. Overview](#overview)
 * [2. The Internal of Beam SQL](#internal-of-sql)
 * [3. Usage of DSL APIs](#usage)

-- 
To stop receiving notification emails like this one, please contact
"commits@beam.apache.org" <co...@beam.apache.org>.