You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by da...@apache.org on 2018/08/09 20:42:56 UTC
[incubator-druid] branch master updated: Add docs for virtual columns and transform specs (#6119)

This is an automated email from the ASF dual-hosted git repository.

davidlim pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-druid.git


The following commit(s) were added to refs/heads/master by this push:
     new aa660b8  Add docs for virtual columns and transform specs (#6119)
aa660b8 is described below

commit aa660b87515cc66b40899b46a3a83a37459f9e3b
Author: Jonathan Wei <jo...@users.noreply.github.com>
AuthorDate: Thu Aug 9 13:42:52 2018 -0700

    Add docs for virtual columns and transform specs (#6119)
    
    * Add docs for virtual columns and transform specs
    
    * PR Comments
    
    * PR comment
---
 docs/content/ingestion/index.md          | 11 ++++-
 docs/content/ingestion/transform-spec.md | 84 ++++++++++++++++++++++++++++++++
 docs/content/misc/math-expr.md           |  6 +++
 docs/content/querying/virtual-columns.md | 60 +++++++++++++++++++++++
 docs/content/toc.md                      |  3 ++
 5 files changed, 163 insertions(+), 1 deletion(-)

diff --git a/docs/content/ingestion/index.md b/docs/content/ingestion/index.md
index 1529455..02fa1cc 100644
--- a/docs/content/ingestion/index.md
+++ b/docs/content/ingestion/index.md
@@ -87,7 +87,8 @@ An example dataSchema is shown below:
     "segmentGranularity" : "DAY",
     "queryGranularity" : "NONE",
     "intervals" : [ "2013-08-31/2013-09-01" ]
-  }
+  },
+  "transformSpec" : null
 }
 ```
 
@@ -97,6 +98,7 @@ An example dataSchema is shown below:
 | parser | JSON Object | Specifies how ingested data can be parsed. | yes |
 | metricsSpec | JSON Object array | A list of [aggregators](../querying/aggregations.html). | yes |
 | granularitySpec | JSON Object | Specifies how to create segments and roll up data. | yes |
+| transformSpec | JSON Object | Specifes how to filter and transform input data. See [transform specs](../ingestion/transform-spec.html).| no |
 
 ## Parser
 
@@ -244,6 +246,9 @@ for the `comment` column.
 }
 ```
 
+## metricsSpec
+ The `metricsSpec` is a list of [aggregators](../querying/aggregations.html). If `rollup` is false in the granularity spec, the metrics spec should be an empty list and all columns should be defined in the `dimensionsSpec` instead (without rollup, there isn't a real distinction between dimensions and metrics at ingestion time). This is optional, however.
+ 
 ## GranularitySpec
 
 The default granularity spec is `uniform`, and can be changed by setting the `type` field.
@@ -270,6 +275,10 @@ This spec is used to generate segments with arbitrary intervals (it tries to cre
 | rollup | boolean | rollup or not | no (default == true) |
 | intervals | string | A list of intervals for the raw data being ingested. Ignored for real-time ingestion. | no. If specified, batch ingestion tasks may skip determining partitions phase which results in faster ingestion. |
 
+# Transform Spec
+
+Transform specs allow Druid to transform and filter input data during ingestion. See [Transform specs](../ingestion/transform-spec.html)
+
 # IO Config
 
 Stream Push Ingestion: Stream push ingestion with Tranquility does not require an IO Config.
diff --git a/docs/content/ingestion/transform-spec.md b/docs/content/ingestion/transform-spec.md
new file mode 100644
index 0000000..eedaaa6
--- /dev/null
+++ b/docs/content/ingestion/transform-spec.md
@@ -0,0 +1,84 @@
+---
+layout: doc_page
+---
+
+# Transform Specs
+
+Transform specs allow Druid to filter and transform input data during ingestion. 
+
+## Syntax
+
+The syntax for the transformSpec is shown below:
+
+```
+"transformSpec": {
+  "transforms: <List of transforms>,
+  "filter": <filter>
+}
+```
+
+|property|description|required?|
+|--------|-----------|---------|
+|transforms|A list of [transforms](#transforms) to be applied to input rows. |no|
+|filter|A [filter](../querying/filters.html) that will be applied to input rows; only rows that pass the filter will be ingested.|no|
+
+## Transforms
+
+The `transforms` list allows the user to specify a set of column transformations to be performed on input data.
+
+Transforms allow adding new fields to input rows. Each transform has a "name" (the name of the new field) which can be referred to by DimensionSpecs, AggregatorFactories, etc.
+
+A transform behaves as a "row function", taking an entire row as input and outputting a column value.
+
+If a transform has the same name as a field in an input row, then it will shadow the original field. Transforms that shadow fields may still refer to the fields they shadow. This can be used to transform a field "in-place".
+
+Transforms do have some limitations. They can only refer to fields present in the actual input rows; in particular, they cannot refer to other transforms. And they cannot remove fields, only add them. However, they can shadow a field with another field containing all nulls, which will act similarly to removing the field.
+
+Note that the transforms are applied before the filter.
+
+### Expression Transform
+
+Druid currently supports one kind of transform, the expression transform.
+
+An expression transform has the following syntax:
+
+```
+{
+  "type": "expression",
+  "name": <output field name>,
+  "expression": <expr>
+}
+```
+
+|property|description|required?|
+|--------|-----------|---------|
+|name|The output field name of the expression transform.|yes|
+|expression|An [expression](../misc/math-expr.html) that will be applied to input rows to produce a value for the transform's output field.|no|
+
+For example, the following expression transform prepends "foo" to the values of a `page` column in the input data, and creates a `fooPage` column.
+
+```
+    {
+      "type": "expression",
+      "name": "fooPage",
+      "expression": "concat('foo' + page)"
+    }
+```
+
+## Filtering
+
+The transformSpec allows Druid to filter out input rows during ingestion. A row that fails to pass the filter will not be ingested.
+
+Any of Druid's standard [filters](../querying/filters.html) can be used.
+
+Note that the filtering takes place after the transforms, so filters will operate on transformed rows and not the raw input data if transforms are present.
+
+For example, the following filter would ingest only input rows where a `country` column has the value "United States":
+
+```
+"filter": {
+  "type": "selector",
+  "dimension": "country",
+  "value": "United States"
+}
+```
\ No newline at end of file
diff --git a/docs/content/misc/math-expr.md b/docs/content/misc/math-expr.md
index abcebdd..d821491 100644
--- a/docs/content/misc/math-expr.md
+++ b/docs/content/misc/math-expr.md
@@ -2,6 +2,12 @@
 layout: doc_page
 ---
 
+# Druid Expressions
+
+<div class="note info">
+This feature is still experimental. It has not been optimized for performance yet, and its implementation is known to have significant inefficiencies.
+</div>
+ 
 This expression language supports the following operators (listed in decreasing order of precedence).
 
 |Operators|Description|
diff --git a/docs/content/querying/virtual-columns.md b/docs/content/querying/virtual-columns.md
new file mode 100644
index 0000000..117b75e
--- /dev/null
+++ b/docs/content/querying/virtual-columns.md
@@ -0,0 +1,60 @@
+---
+layout: doc_page
+---
+
+# Virtual Columns
+
+Virtual columns are queryable column "views" created from a set of columns during a query. 
+
+A virtual column can potentially draw from multiple underlying columns, although a virtual column always presents itself as a single column.
+
+Virtual columns can be used as dimensions or as inputs to aggregators.
+
+Each Druid query can accept a list of virtual columns as a parameter. The following scan query is provided as an example:
+
+```
+{
+ "queryType": "scan",
+ "dataSource": "page_data",
+ "columns":[],
+ "virtualColumns": [
+    {
+      "type": "expression",
+      "name": "fooPage",
+      "expression": "concat('foo' + page)",
+      "outputType": "STRING"
+    },
+    {
+      "type": "expression",
+      "name": "tripleWordCount",
+      "expression": "wordCount * 3",
+      "outputType": "LONG"
+    }
+  ],
+ "intervals": [
+   "2013-01-01/2019-01-02"
+ ] 
+}
+```
+
+
+## Virtual Column Types
+
+### Expression virtual column
+
+The expression virtual column has the following syntax:
+
+```
+{
+  "type": "expression",
+  "name": <name of the virtual column>,
+  "expression": <row expression>,
+  "outputType": <output value type of expression>
+}
+```
+
+|property|description|required?|
+|--------|-----------|---------|
+|name|The name of the virtual column.|yes|
+|expression|An [expression](../misc/math-expr.html) that takes a row as input and outputs a value for the virtual column.|yes|
+|outputType|The expression's output will be coerced to this type. Can be LONG, FLOAT, DOUBLE, or STRING.|no, default is FLOAT|
\ No newline at end of file
diff --git a/docs/content/toc.md b/docs/content/toc.md
index 6553e96..a8cd7ed 100644
--- a/docs/content/toc.md
+++ b/docs/content/toc.md
@@ -32,6 +32,7 @@ layout: toc
     * [Stream Pull](/docs/VERSION/ingestion/stream-pull.html)
   * [Updating Existing Data](/docs/VERSION/ingestion/update-existing-data.html)
   * [Ingestion Tasks](/docs/VERSION/ingestion/tasks.html)
+  * [Transform Specs](/docs/VERSION/ingestion/transform-spec.html)
   * [FAQ](/docs/VERSION/ingestion/faq.html)
 
 ## Querying
@@ -60,6 +61,7 @@ layout: toc
   * [Multitenancy](/docs/VERSION/querying/multitenancy.html)
   * [Caching](/docs/VERSION/querying/caching.html)
   * [Sorting Orders](/docs/VERSION/querying/sorting-orders.html)
+  * [Virtual Columns](/docs/VERSION/querying/virtual-columns.html)
 
 ## Design
   * [Overview](/docs/VERSION/design/design.html)
@@ -127,5 +129,6 @@ layout: toc
 
 
 ## Misc
+  * [Druid Expressions Language](/docs/VERSION/misc/math-expr.html)
   * [Papers & Talks](/docs/VERSION/misc/papers-and-talks.html)
   * [Thanks](/thanks.html)


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org