You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/07/08 18:36:19 UTC

[GitHub] [beam] TheNeuralBit commented on a diff in pull request #22047: [website] Add TPC-DS benchmark documentation

TheNeuralBit commented on code in PR #22047:
URL: https://github.com/apache/beam/pull/22047#discussion_r906370782


##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD
+
+## The queries
+
+TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
+question, which illustrates the business context in which the query could be used.
+
+All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
+performance.
+
+## Input data
+Input data source:
+
+- Input files (CSV) are generated with CLI tool `dsdgen`
+- Input datasets can be generated for different sizes:
+  - 1GB / 10GB / 100GB / 1000GB
+- The tool constraints the minimum amount of data to be generated to 1GB
+
+## TPC-DS extension in Beam
+
+### Reasons
+
+Beam provides a simplified implementation of TPC-DS benchmark and there are several reasons to have it in Beam:
+
+- Compare the performance boost or degradation of Beam SQL for different runners or their versions
+- Run Beam SQL on different runtime environments
+- Detect missing Beam SQL features or incompatibilities
+- Find performance issues in Beam
+
+### Queries
+All TPC-DS queries in Beam are pre-generated and stored in the provided artifacts.
+
+For the moment, 28 out of 103 SQL queries (99 + 4) successfully pass by running with Beam SQL transform since not all
+SQL-99 operations are supported.
+
+Currently supported queries:
+ - 3, 7, 10, 22, 25, 26, 29, 35, 38, 40, 42, 43, 50, 52, 55, 69, 78, 79, 83, 84, 87, 93, 96, 97, 99

Review Comment:
   Is there a way we can keep this up to date? If not, maybe it should say (as of `<commit or date or release>`)



##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD
+
+## The queries
+
+TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
+question, which illustrates the business context in which the query could be used.
+
+All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
+performance.
+
+## Input data
+Input data source:
+
+- Input files (CSV) are generated with CLI tool `dsdgen`
+- Input datasets can be generated for different sizes:
+  - 1GB / 10GB / 100GB / 1000GB
+- The tool constraints the minimum amount of data to be generated to 1GB

Review Comment:
   ```suggestion
   - The tool constrains the minimum amount of data to be generated to 1GB
   ```
   
   Is the maximum also constrained (is the list above the only allowed sizes)?



##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD
+
+## The queries
+
+TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
+question, which illustrates the business context in which the query could be used.
+
+All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
+performance.
+
+## Input data
+Input data source:
+
+- Input files (CSV) are generated with CLI tool `dsdgen`
+- Input datasets can be generated for different sizes:
+  - 1GB / 10GB / 100GB / 1000GB
+- The tool constraints the minimum amount of data to be generated to 1GB
+
+## TPC-DS extension in Beam
+
+### Reasons
+
+Beam provides a simplified implementation of TPC-DS benchmark and there are several reasons to have it in Beam:
+
+- Compare the performance boost or degradation of Beam SQL for different runners or their versions
+- Run Beam SQL on different runtime environments
+- Detect missing Beam SQL features or incompatibilities
+- Find performance issues in Beam
+
+### Queries
+All TPC-DS queries in Beam are pre-generated and stored in the provided artifacts.
+
+For the moment, 28 out of 103 SQL queries (99 + 4) successfully pass by running with Beam SQL transform since not all
+SQL-99 operations are supported.
+
+Currently supported queries:
+ - 3, 7, 10, 22, 25, 26, 29, 35, 38, 40, 42, 43, 50, 52, 55, 69, 78, 79, 83, 84, 87, 93, 96, 97, 99
+
+### Tables
+All TPC-DS table schemas are stored in the provided artifacts.
+
+### Input data
+Input data is already pre-generated for two data formats (CSV and Parquet) and stored in Google Cloud Storage (gs://beam-tpcds)
+
+### Runtime
+TPC-DS extension for Beam can only be run in **Batch** mode and supports these runners for the moment:
+- Spark Runner
+- Flink Runner
+- Dataflow Runner
+
+## TPC-DS output
+
+TBD
+
+## Benchmark launch configuration
+
+The TPC-DS launcher accepts the `--runner` argument as usual for programs that
+use Beam PipelineOptions to manage their command line arguments. In addition
+to this, the necessary dependencies must be configured.
+
+When running via Gradle, the following two parameters control the execution:
+
+    -P tpcds.args
+        The command line to pass to the TPC-DS main program.
+
+    -P tpcds.runner
+	The Gradle project name of the runner, such as ":runners:spark:3" or
+	":runners:flink:1.13. The project names can be found in the root
+        `settings.gradle.kts`.
+
+Test data has to be generated before running a suite and stored to accessible file system. The query results will be written into output files.
+
+### Common configuration parameters
+
+Size of input dataset (1GB / 10GB / 100GB / 1000GB):
+
+    --dataSize=<1GB|10GB|100GB|1000GB>
+

Review Comment:
   ```suggestion
   - ` --dataSize=<1GB|10GB|100GB|1000GB>`: Size of input dataset
   ```
   
   you might structure this as a bulleted list



##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD
+
+## The queries
+
+TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
+question, which illustrates the business context in which the query could be used.
+
+All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
+performance.
+
+## Input data
+Input data source:
+
+- Input files (CSV) are generated with CLI tool `dsdgen`
+- Input datasets can be generated for different sizes:
+  - 1GB / 10GB / 100GB / 1000GB
+- The tool constraints the minimum amount of data to be generated to 1GB
+
+## TPC-DS extension in Beam
+
+### Reasons
+
+Beam provides a simplified implementation of TPC-DS benchmark and there are several reasons to have it in Beam:
+
+- Compare the performance boost or degradation of Beam SQL for different runners or their versions
+- Run Beam SQL on different runtime environments
+- Detect missing Beam SQL features or incompatibilities
+- Find performance issues in Beam
+
+### Queries
+All TPC-DS queries in Beam are pre-generated and stored in the provided artifacts.
+
+For the moment, 28 out of 103 SQL queries (99 + 4) successfully pass by running with Beam SQL transform since not all
+SQL-99 operations are supported.
+
+Currently supported queries:
+ - 3, 7, 10, 22, 25, 26, 29, 35, 38, 40, 42, 43, 50, 52, 55, 69, 78, 79, 83, 84, 87, 93, 96, 97, 99
+
+### Tables
+All TPC-DS table schemas are stored in the provided artifacts.
+
+### Input data
+Input data is already pre-generated for two data formats (CSV and Parquet) and stored in Google Cloud Storage (gs://beam-tpcds)
+
+### Runtime
+TPC-DS extension for Beam can only be run in **Batch** mode and supports these runners for the moment:
+- Spark Runner
+- Flink Runner
+- Dataflow Runner
+
+## TPC-DS output
+
+TBD
+
+## Benchmark launch configuration
+
+The TPC-DS launcher accepts the `--runner` argument as usual for programs that
+use Beam PipelineOptions to manage their command line arguments. In addition
+to this, the necessary dependencies must be configured.
+
+When running via Gradle, the following two parameters control the execution:
+
+    -P tpcds.args
+        The command line to pass to the TPC-DS main program.
+
+    -P tpcds.runner
+	The Gradle project name of the runner, such as ":runners:spark:3" or
+	":runners:flink:1.13. The project names can be found in the root
+        `settings.gradle.kts`.
+
+Test data has to be generated before running a suite and stored to accessible file system. The query results will be written into output files.
+
+### Common configuration parameters
+
+Size of input dataset (1GB / 10GB / 100GB / 1000GB):
+
+    --dataSize=<1GB|10GB|100GB|1000GB>
+
+Path to input datasets directory:
+
+    --dataDirectory=<path to dir>
+
+Path to results directory:
+
+    --resultsDirectory=<path to dir>
+
+Format of input files:
+
+    --sourceType=<CSV|PARQUET>
+
+Run queries (comma separated list of query numbers or `all` for all queries):
+
+    --queries=<1,2,...N|all>
+
+Number of queries **N** that are running in parallel:

Review Comment:
   ```suggestion
   Number of queries **N** to run in parallel:
   ```



##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD
+
+## The queries
+
+TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
+question, which illustrates the business context in which the query could be used.
+
+All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
+performance.
+
+## Input data
+Input data source:
+
+- Input files (CSV) are generated with CLI tool `dsdgen`
+- Input datasets can be generated for different sizes:
+  - 1GB / 10GB / 100GB / 1000GB
+- The tool constraints the minimum amount of data to be generated to 1GB
+
+## TPC-DS extension in Beam
+
+### Reasons
+
+Beam provides a simplified implementation of TPC-DS benchmark and there are several reasons to have it in Beam:
+
+- Compare the performance boost or degradation of Beam SQL for different runners or their versions
+- Run Beam SQL on different runtime environments
+- Detect missing Beam SQL features or incompatibilities
+- Find performance issues in Beam
+
+### Queries
+All TPC-DS queries in Beam are pre-generated and stored in the provided artifacts.
+
+For the moment, 28 out of 103 SQL queries (99 + 4) successfully pass by running with Beam SQL transform since not all
+SQL-99 operations are supported.
+
+Currently supported queries:
+ - 3, 7, 10, 22, 25, 26, 29, 35, 38, 40, 42, 43, 50, 52, 55, 69, 78, 79, 83, 84, 87, 93, 96, 97, 99
+
+### Tables
+All TPC-DS table schemas are stored in the provided artifacts.
+
+### Input data
+Input data is already pre-generated for two data formats (CSV and Parquet) and stored in Google Cloud Storage (gs://beam-tpcds)
+
+### Runtime
+TPC-DS extension for Beam can only be run in **Batch** mode and supports these runners for the moment:
+- Spark Runner
+- Flink Runner
+- Dataflow Runner
+
+## TPC-DS output
+
+TBD
+
+## Benchmark launch configuration
+
+The TPC-DS launcher accepts the `--runner` argument as usual for programs that
+use Beam PipelineOptions to manage their command line arguments. In addition
+to this, the necessary dependencies must be configured.
+
+When running via Gradle, the following two parameters control the execution:
+
+    -P tpcds.args
+        The command line to pass to the TPC-DS main program.
+
+    -P tpcds.runner
+	The Gradle project name of the runner, such as ":runners:spark:3" or
+	":runners:flink:1.13. The project names can be found in the root
+        `settings.gradle.kts`.
+
+Test data has to be generated before running a suite and stored to accessible file system. The query results will be written into output files.
+
+### Common configuration parameters
+
+Size of input dataset (1GB / 10GB / 100GB / 1000GB):
+
+    --dataSize=<1GB|10GB|100GB|1000GB>
+
+Path to input datasets directory:
+
+    --dataDirectory=<path to dir>
+
+Path to results directory:
+
+    --resultsDirectory=<path to dir>
+
+Format of input files:
+
+    --sourceType=<CSV|PARQUET>
+
+Run queries (comma separated list of query numbers or `all` for all queries):
+
+    --queries=<1,2,...N|all>
+
+Number of queries **N** that are running in parallel:
+
+    --tpcParallel=N
+
+## Running TPC-DS
+
+There are some examples how to run TPC-DS benchmark on different runners.
+
+Running suite on the SparkRunner (local) with Query3 against 1Gb dataset in Parquet format:
+
+    ./gradlew :sdks:java:testing:tpcds:run \
+        -Ptpcds.runner=":runners:spark:3" \
+        -Ptpcds.args="
+            --runner=SparkRunner
+            --dataSize=1GB
+            --sourceType=PARQUET
+            --dataDirectory=gs://beam-tpcds/datasets/parquet/partitioned
+            --resultsDirectory=/tmp/beam-tpcds/results/spark/
+            --tpcParallel=1
+            --queries=3"
+
+Running suite on the FlinkRunner (local) with Query7 and Query10 in parallel against 10Gb dataset in CSV format:
+
+    ./gradlew :sdks:java:testing:tpcds:run \
+        -Ptpcds.runner=":runners:flink:1.13" \
+        -Ptpcds.args="
+            --runner=FlinkRunner
+            --parallelism=2
+            --dataSize=10GB
+            --sourceType=CSV
+            --dataDirectory=gs://beam-tpcds/datasets/csv
+            --resultsDirectory=/tmp/beam-tpcds/results/flink/
+            --tpcParallel=2
+            --queries=7,10"
+
+Running suite on the DataflowRunner (local) with all queries against 100Gb dataset in PARQUET format:
+
+    ./gradlew :sdks:java:testing:tpcds:run \
+        -Ptpcds.runner=":runners:google-cloud-dataflow-java" \
+        -Ptpcds.args="
+            --runner=DataflowRunner
+            --region=<region_name>
+            --project=<project_name>
+            --numWorkers=4
+            --maxNumWorkers=4
+            --autoscalingAlgorithm=NONE
+            --dataSize=100GB
+            --sourceType=PARQUET
+            --dataDirectory=gs://beam-tpcds/datasets/parquet/partitioned
+            --resultsDirectory=/tmp/beam-tpcds/results/dataflow/

Review Comment:
   Should the be a cloud storage path?



##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD
+
+## The queries
+
+TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
+question, which illustrates the business context in which the query could be used.
+
+All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
+performance.
+
+## Input data
+Input data source:
+
+- Input files (CSV) are generated with CLI tool `dsdgen`
+- Input datasets can be generated for different sizes:
+  - 1GB / 10GB / 100GB / 1000GB
+- The tool constraints the minimum amount of data to be generated to 1GB
+
+## TPC-DS extension in Beam
+
+### Reasons
+
+Beam provides a simplified implementation of TPC-DS benchmark and there are several reasons to have it in Beam:
+
+- Compare the performance boost or degradation of Beam SQL for different runners or their versions
+- Run Beam SQL on different runtime environments
+- Detect missing Beam SQL features or incompatibilities
+- Find performance issues in Beam
+
+### Queries
+All TPC-DS queries in Beam are pre-generated and stored in the provided artifacts.
+
+For the moment, 28 out of 103 SQL queries (99 + 4) successfully pass by running with Beam SQL transform since not all
+SQL-99 operations are supported.
+
+Currently supported queries:
+ - 3, 7, 10, 22, 25, 26, 29, 35, 38, 40, 42, 43, 50, 52, 55, 69, 78, 79, 83, 84, 87, 93, 96, 97, 99
+
+### Tables
+All TPC-DS table schemas are stored in the provided artifacts.
+
+### Input data
+Input data is already pre-generated for two data formats (CSV and Parquet) and stored in Google Cloud Storage (gs://beam-tpcds)
+
+### Runtime
+TPC-DS extension for Beam can only be run in **Batch** mode and supports these runners for the moment:
+- Spark Runner
+- Flink Runner
+- Dataflow Runner
+
+## TPC-DS output
+
+TBD
+
+## Benchmark launch configuration
+
+The TPC-DS launcher accepts the `--runner` argument as usual for programs that
+use Beam PipelineOptions to manage their command line arguments. In addition
+to this, the necessary dependencies must be configured.
+
+When running via Gradle, the following two parameters control the execution:
+
+    -P tpcds.args
+        The command line to pass to the TPC-DS main program.
+
+    -P tpcds.runner
+	The Gradle project name of the runner, such as ":runners:spark:3" or
+	":runners:flink:1.13. The project names can be found in the root
+        `settings.gradle.kts`.
+
+Test data has to be generated before running a suite and stored to accessible file system. The query results will be written into output files.
+
+### Common configuration parameters
+
+Size of input dataset (1GB / 10GB / 100GB / 1000GB):
+
+    --dataSize=<1GB|10GB|100GB|1000GB>
+
+Path to input datasets directory:
+
+    --dataDirectory=<path to dir>
+
+Path to results directory:
+
+    --resultsDirectory=<path to dir>
+
+Format of input files:
+
+    --sourceType=<CSV|PARQUET>
+
+Run queries (comma separated list of query numbers or `all` for all queries):
+
+    --queries=<1,2,...N|all>
+
+Number of queries **N** that are running in parallel:
+
+    --tpcParallel=N
+
+## Running TPC-DS
+
+There are some examples how to run TPC-DS benchmark on different runners.
+
+Running suite on the SparkRunner (local) with Query3 against 1Gb dataset in Parquet format:
+
+    ./gradlew :sdks:java:testing:tpcds:run \
+        -Ptpcds.runner=":runners:spark:3" \
+        -Ptpcds.args="
+            --runner=SparkRunner
+            --dataSize=1GB
+            --sourceType=PARQUET
+            --dataDirectory=gs://beam-tpcds/datasets/parquet/partitioned
+            --resultsDirectory=/tmp/beam-tpcds/results/spark/
+            --tpcParallel=1
+            --queries=3"
+
+Running suite on the FlinkRunner (local) with Query7 and Query10 in parallel against 10Gb dataset in CSV format:
+
+    ./gradlew :sdks:java:testing:tpcds:run \
+        -Ptpcds.runner=":runners:flink:1.13" \
+        -Ptpcds.args="
+            --runner=FlinkRunner
+            --parallelism=2
+            --dataSize=10GB
+            --sourceType=CSV
+            --dataDirectory=gs://beam-tpcds/datasets/csv
+            --resultsDirectory=/tmp/beam-tpcds/results/flink/
+            --tpcParallel=2
+            --queries=7,10"
+
+Running suite on the DataflowRunner (local) with all queries against 100Gb dataset in PARQUET format:

Review Comment:
   Also what is meant by "local" here, I'm assuming that's a copy paste error?



##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD
+
+## The queries
+
+TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
+question, which illustrates the business context in which the query could be used.
+
+All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
+performance.
+
+## Input data
+Input data source:
+
+- Input files (CSV) are generated with CLI tool `dsdgen`
+- Input datasets can be generated for different sizes:
+  - 1GB / 10GB / 100GB / 1000GB
+- The tool constraints the minimum amount of data to be generated to 1GB
+
+## TPC-DS extension in Beam
+
+### Reasons
+
+Beam provides a simplified implementation of TPC-DS benchmark and there are several reasons to have it in Beam:

Review Comment:
   ```suggestion
   There are several reasons to have TPC-DS benchmarks in Beam:
   ```
   
   I think the first part of this belongs somewhere else, maybe in a summary under "TPC-DS extension in Beam", which also links to the location in the repo?



##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD
+
+## The queries
+
+TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
+question, which illustrates the business context in which the query could be used.
+
+All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
+performance.
+
+## Input data
+Input data source:
+
+- Input files (CSV) are generated with CLI tool `dsdgen`
+- Input datasets can be generated for different sizes:
+  - 1GB / 10GB / 100GB / 1000GB
+- The tool constraints the minimum amount of data to be generated to 1GB
+
+## TPC-DS extension in Beam
+
+### Reasons
+
+Beam provides a simplified implementation of TPC-DS benchmark and there are several reasons to have it in Beam:
+
+- Compare the performance boost or degradation of Beam SQL for different runners or their versions
+- Run Beam SQL on different runtime environments
+- Detect missing Beam SQL features or incompatibilities
+- Find performance issues in Beam
+
+### Queries
+All TPC-DS queries in Beam are pre-generated and stored in the provided artifacts.
+
+For the moment, 28 out of 103 SQL queries (99 + 4) successfully pass by running with Beam SQL transform since not all
+SQL-99 operations are supported.
+
+Currently supported queries:
+ - 3, 7, 10, 22, 25, 26, 29, 35, 38, 40, 42, 43, 50, 52, 55, 69, 78, 79, 83, 84, 87, 93, 96, 97, 99
+
+### Tables
+All TPC-DS table schemas are stored in the provided artifacts.
+
+### Input data
+Input data is already pre-generated for two data formats (CSV and Parquet) and stored in Google Cloud Storage (gs://beam-tpcds)
+
+### Runtime
+TPC-DS extension for Beam can only be run in **Batch** mode and supports these runners for the moment:
+- Spark Runner
+- Flink Runner
+- Dataflow Runner
+
+## TPC-DS output
+
+TBD
+
+## Benchmark launch configuration
+
+The TPC-DS launcher accepts the `--runner` argument as usual for programs that
+use Beam PipelineOptions to manage their command line arguments. In addition
+to this, the necessary dependencies must be configured.
+
+When running via Gradle, the following two parameters control the execution:
+
+    -P tpcds.args
+        The command line to pass to the TPC-DS main program.
+
+    -P tpcds.runner
+	The Gradle project name of the runner, such as ":runners:spark:3" or
+	":runners:flink:1.13. The project names can be found in the root
+        `settings.gradle.kts`.
+
+Test data has to be generated before running a suite and stored to accessible file system. The query results will be written into output files.
+
+### Common configuration parameters
+
+Size of input dataset (1GB / 10GB / 100GB / 1000GB):
+
+    --dataSize=<1GB|10GB|100GB|1000GB>
+
+Path to input datasets directory:
+
+    --dataDirectory=<path to dir>
+
+Path to results directory:
+
+    --resultsDirectory=<path to dir>
+
+Format of input files:
+
+    --sourceType=<CSV|PARQUET>
+
+Run queries (comma separated list of query numbers or `all` for all queries):
+
+    --queries=<1,2,...N|all>
+
+Number of queries **N** that are running in parallel:
+
+    --tpcParallel=N
+
+## Running TPC-DS
+
+There are some examples how to run TPC-DS benchmark on different runners.
+
+Running suite on the SparkRunner (local) with Query3 against 1Gb dataset in Parquet format:
+
+    ./gradlew :sdks:java:testing:tpcds:run \
+        -Ptpcds.runner=":runners:spark:3" \
+        -Ptpcds.args="
+            --runner=SparkRunner
+            --dataSize=1GB
+            --sourceType=PARQUET
+            --dataDirectory=gs://beam-tpcds/datasets/parquet/partitioned
+            --resultsDirectory=/tmp/beam-tpcds/results/spark/
+            --tpcParallel=1
+            --queries=3"
+
+Running suite on the FlinkRunner (local) with Query7 and Query10 in parallel against 10Gb dataset in CSV format:
+
+    ./gradlew :sdks:java:testing:tpcds:run \
+        -Ptpcds.runner=":runners:flink:1.13" \
+        -Ptpcds.args="
+            --runner=FlinkRunner
+            --parallelism=2
+            --dataSize=10GB
+            --sourceType=CSV
+            --dataDirectory=gs://beam-tpcds/datasets/csv
+            --resultsDirectory=/tmp/beam-tpcds/results/flink/
+            --tpcParallel=2
+            --queries=7,10"
+
+Running suite on the DataflowRunner (local) with all queries against 100Gb dataset in PARQUET format:

Review Comment:
   ```suggestion
   Running suite on the DataflowRunner (local) with all queries against 100GB dataset in PARQUET format:
   ```



##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD
+
+## The queries
+
+TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
+question, which illustrates the business context in which the query could be used.
+
+All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
+performance.
+
+## Input data
+Input data source:
+
+- Input files (CSV) are generated with CLI tool `dsdgen`
+- Input datasets can be generated for different sizes:
+  - 1GB / 10GB / 100GB / 1000GB
+- The tool constraints the minimum amount of data to be generated to 1GB
+
+## TPC-DS extension in Beam
+
+### Reasons
+
+Beam provides a simplified implementation of TPC-DS benchmark and there are several reasons to have it in Beam:
+
+- Compare the performance boost or degradation of Beam SQL for different runners or their versions
+- Run Beam SQL on different runtime environments
+- Detect missing Beam SQL features or incompatibilities
+- Find performance issues in Beam
+
+### Queries
+All TPC-DS queries in Beam are pre-generated and stored in the provided artifacts.
+
+For the moment, 28 out of 103 SQL queries (99 + 4) successfully pass by running with Beam SQL transform since not all
+SQL-99 operations are supported.
+
+Currently supported queries:
+ - 3, 7, 10, 22, 25, 26, 29, 35, 38, 40, 42, 43, 50, 52, 55, 69, 78, 79, 83, 84, 87, 93, 96, 97, 99
+
+### Tables
+All TPC-DS table schemas are stored in the provided artifacts.
+
+### Input data
+Input data is already pre-generated for two data formats (CSV and Parquet) and stored in Google Cloud Storage (gs://beam-tpcds)

Review Comment:
   ```suggestion
   CSV and Parquet input data has been pre-generated and staged in the in the Google Cloud Storage bucket `gs://beam-tpcds`.
   ```
   
   You might add a summary of the directory structure here, like:
   ```
   gs://beam-tpcds/datasets/{parquet,text}/{partitioned,nonpartitioned}/{1GB,10GB,100GB}/<table name>
   ```
   with a brief explanation (for example presumably text = csv, but that's not immediately obvious, it's also unclear what the partitioned directory contains).
   
   A couple of other thoughts:
   - Maybe add a README in gs://beam-tpc-ds linking back here
   - Do we need all of datasets, gcpTempLocation, output, resources, results in that bucket?



##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD
+
+## The queries
+
+TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
+question, which illustrates the business context in which the query could be used.
+
+All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
+performance.

Review Comment:
   ```suggestion
   All queries are “templated” with random input parameters and used to compare completeness and
   performance of SQL implementations. 
   ```



##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD
+
+## The queries
+
+TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
+question, which illustrates the business context in which the query could be used.
+
+All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
+performance.
+
+## Input data
+Input data source:
+
+- Input files (CSV) are generated with CLI tool `dsdgen`
+- Input datasets can be generated for different sizes:
+  - 1GB / 10GB / 100GB / 1000GB
+- The tool constraints the minimum amount of data to be generated to 1GB
+
+## TPC-DS extension in Beam
+
+### Reasons
+
+Beam provides a simplified implementation of TPC-DS benchmark and there are several reasons to have it in Beam:
+
+- Compare the performance boost or degradation of Beam SQL for different runners or their versions
+- Run Beam SQL on different runtime environments
+- Detect missing Beam SQL features or incompatibilities
+- Find performance issues in Beam
+
+### Queries
+All TPC-DS queries in Beam are pre-generated and stored in the provided artifacts.
+
+For the moment, 28 out of 103 SQL queries (99 + 4) successfully pass by running with Beam SQL transform since not all
+SQL-99 operations are supported.
+
+Currently supported queries:
+ - 3, 7, 10, 22, 25, 26, 29, 35, 38, 40, 42, 43, 50, 52, 55, 69, 78, 79, 83, 84, 87, 93, 96, 97, 99
+
+### Tables
+All TPC-DS table schemas are stored in the provided artifacts.
+
+### Input data
+Input data is already pre-generated for two data formats (CSV and Parquet) and stored in Google Cloud Storage (gs://beam-tpcds)
+
+### Runtime
+TPC-DS extension for Beam can only be run in **Batch** mode and supports these runners for the moment:
+- Spark Runner
+- Flink Runner
+- Dataflow Runner
+
+## TPC-DS output
+
+TBD
+
+## Benchmark launch configuration
+
+The TPC-DS launcher accepts the `--runner` argument as usual for programs that
+use Beam PipelineOptions to manage their command line arguments. In addition
+to this, the necessary dependencies must be configured.
+
+When running via Gradle, the following two parameters control the execution:
+
+    -P tpcds.args
+        The command line to pass to the TPC-DS main program.
+
+    -P tpcds.runner
+	The Gradle project name of the runner, such as ":runners:spark:3" or
+	":runners:flink:1.13. The project names can be found in the root
+        `settings.gradle.kts`.
+
+Test data has to be generated before running a suite and stored to accessible file system. The query results will be written into output files.
+
+### Common configuration parameters
+
+Size of input dataset (1GB / 10GB / 100GB / 1000GB):
+
+    --dataSize=<1GB|10GB|100GB|1000GB>
+
+Path to input datasets directory:
+
+    --dataDirectory=<path to dir>
+
+Path to results directory:
+
+    --resultsDirectory=<path to dir>
+
+Format of input files:
+
+    --sourceType=<CSV|PARQUET>
+
+Run queries (comma separated list of query numbers or `all` for all queries):
+
+    --queries=<1,2,...N|all>
+
+Number of queries **N** that are running in parallel:
+
+    --tpcParallel=N
+
+## Running TPC-DS
+
+There are some examples how to run TPC-DS benchmark on different runners.
+
+Running suite on the SparkRunner (local) with Query3 against 1Gb dataset in Parquet format:
+
+    ./gradlew :sdks:java:testing:tpcds:run \
+        -Ptpcds.runner=":runners:spark:3" \
+        -Ptpcds.args="
+            --runner=SparkRunner
+            --dataSize=1GB
+            --sourceType=PARQUET
+            --dataDirectory=gs://beam-tpcds/datasets/parquet/partitioned
+            --resultsDirectory=/tmp/beam-tpcds/results/spark/
+            --tpcParallel=1
+            --queries=3"
+
+Running suite on the FlinkRunner (local) with Query7 and Query10 in parallel against 10Gb dataset in CSV format:
+
+    ./gradlew :sdks:java:testing:tpcds:run \
+        -Ptpcds.runner=":runners:flink:1.13" \
+        -Ptpcds.args="
+            --runner=FlinkRunner
+            --parallelism=2
+            --dataSize=10GB
+            --sourceType=CSV
+            --dataDirectory=gs://beam-tpcds/datasets/csv
+            --resultsDirectory=/tmp/beam-tpcds/results/flink/
+            --tpcParallel=2
+            --queries=7,10"
+
+Running suite on the DataflowRunner (local) with all queries against 100Gb dataset in PARQUET format:
+
+    ./gradlew :sdks:java:testing:tpcds:run \
+        -Ptpcds.runner=":runners:google-cloud-dataflow-java" \
+        -Ptpcds.args="
+            --runner=DataflowRunner
+            --region=<region_name>
+            --project=<project_name>
+            --numWorkers=4
+            --maxNumWorkers=4
+            --autoscalingAlgorithm=NONE
+            --dataSize=100GB
+            --sourceType=PARQUET
+            --dataDirectory=gs://beam-tpcds/datasets/parquet/partitioned
+            --resultsDirectory=/tmp/beam-tpcds/results/dataflow/
+            --tpcParallel=4
+            --queries=all"
+
+## TPC-DS dashboards
+TBD

Review Comment:
   I don't think we should publish this with the "TBD" sections still here, do you intend to address these? If not we could just comment them out for now so they don't show up on the website.



##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD
+
+## The queries
+
+TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
+question, which illustrates the business context in which the query could be used.
+
+All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
+performance.
+
+## Input data
+Input data source:
+
+- Input files (CSV) are generated with CLI tool `dsdgen`
+- Input datasets can be generated for different sizes:
+  - 1GB / 10GB / 100GB / 1000GB
+- The tool constraints the minimum amount of data to be generated to 1GB
+
+## TPC-DS extension in Beam
+
+### Reasons
+
+Beam provides a simplified implementation of TPC-DS benchmark and there are several reasons to have it in Beam:
+
+- Compare the performance boost or degradation of Beam SQL for different runners or their versions
+- Run Beam SQL on different runtime environments
+- Detect missing Beam SQL features or incompatibilities
+- Find performance issues in Beam
+
+### Queries
+All TPC-DS queries in Beam are pre-generated and stored in the provided artifacts.
+
+For the moment, 28 out of 103 SQL queries (99 + 4) successfully pass by running with Beam SQL transform since not all
+SQL-99 operations are supported.
+
+Currently supported queries:
+ - 3, 7, 10, 22, 25, 26, 29, 35, 38, 40, 42, 43, 50, 52, 55, 69, 78, 79, 83, 84, 87, 93, 96, 97, 99
+
+### Tables
+All TPC-DS table schemas are stored in the provided artifacts.
+
+### Input data
+Input data is already pre-generated for two data formats (CSV and Parquet) and stored in Google Cloud Storage (gs://beam-tpcds)
+
+### Runtime
+TPC-DS extension for Beam can only be run in **Batch** mode and supports these runners for the moment:
+- Spark Runner
+- Flink Runner
+- Dataflow Runner
+
+## TPC-DS output
+
+TBD
+
+## Benchmark launch configuration
+
+The TPC-DS launcher accepts the `--runner` argument as usual for programs that
+use Beam PipelineOptions to manage their command line arguments. In addition
+to this, the necessary dependencies must be configured.
+
+When running via Gradle, the following two parameters control the execution:
+
+    -P tpcds.args
+        The command line to pass to the TPC-DS main program.
+
+    -P tpcds.runner
+	The Gradle project name of the runner, such as ":runners:spark:3" or
+	":runners:flink:1.13. The project names can be found in the root
+        `settings.gradle.kts`.
+
+Test data has to be generated before running a suite and stored to accessible file system. The query results will be written into output files.
+
+### Common configuration parameters
+
+Size of input dataset (1GB / 10GB / 100GB / 1000GB):
+
+    --dataSize=<1GB|10GB|100GB|1000GB>
+
+Path to input datasets directory:
+
+    --dataDirectory=<path to dir>
+
+Path to results directory:
+
+    --resultsDirectory=<path to dir>
+
+Format of input files:
+
+    --sourceType=<CSV|PARQUET>
+
+Run queries (comma separated list of query numbers or `all` for all queries):

Review Comment:
   ```suggestion
   Select queries to run (comma separated list of query numbers or `all` for all queries):
   ```



##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD
+
+## The queries
+
+TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
+question, which illustrates the business context in which the query could be used.
+
+All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
+performance.
+
+## Input data
+Input data source:
+
+- Input files (CSV) are generated with CLI tool `dsdgen`
+- Input datasets can be generated for different sizes:
+  - 1GB / 10GB / 100GB / 1000GB
+- The tool constraints the minimum amount of data to be generated to 1GB
+
+## TPC-DS extension in Beam
+
+### Reasons
+
+Beam provides a simplified implementation of TPC-DS benchmark and there are several reasons to have it in Beam:
+
+- Compare the performance boost or degradation of Beam SQL for different runners or their versions
+- Run Beam SQL on different runtime environments
+- Detect missing Beam SQL features or incompatibilities
+- Find performance issues in Beam

Review Comment:
   ```suggestion
   - Compare the performance of Beam SQL against native SQL implementations for different runners.
   - Exercise Beam SQL on different runtime environments.
   - Identify missing or incorrect Beam SQL features.
   - Identify performance issues in Beam and Beam SQL.
   ```



##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD
+
+## The queries
+
+TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
+question, which illustrates the business context in which the query could be used.
+
+All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
+performance.
+
+## Input data
+Input data source:
+
+- Input files (CSV) are generated with CLI tool `dsdgen`
+- Input datasets can be generated for different sizes:
+  - 1GB / 10GB / 100GB / 1000GB
+- The tool constraints the minimum amount of data to be generated to 1GB
+
+## TPC-DS extension in Beam
+
+### Reasons
+
+Beam provides a simplified implementation of TPC-DS benchmark and there are several reasons to have it in Beam:
+
+- Compare the performance boost or degradation of Beam SQL for different runners or their versions
+- Run Beam SQL on different runtime environments
+- Detect missing Beam SQL features or incompatibilities
+- Find performance issues in Beam
+
+### Queries
+All TPC-DS queries in Beam are pre-generated and stored in the provided artifacts.
+
+For the moment, 28 out of 103 SQL queries (99 + 4) successfully pass by running with Beam SQL transform since not all
+SQL-99 operations are supported.
+
+Currently supported queries:
+ - 3, 7, 10, 22, 25, 26, 29, 35, 38, 40, 42, 43, 50, 52, 55, 69, 78, 79, 83, 84, 87, 93, 96, 97, 99
+
+### Tables
+All TPC-DS table schemas are stored in the provided artifacts.
+
+### Input data
+Input data is already pre-generated for two data formats (CSV and Parquet) and stored in Google Cloud Storage (gs://beam-tpcds)
+
+### Runtime
+TPC-DS extension for Beam can only be run in **Batch** mode and supports these runners for the moment:
+- Spark Runner
+- Flink Runner
+- Dataflow Runner
+
+## TPC-DS output
+
+TBD
+
+## Benchmark launch configuration
+
+The TPC-DS launcher accepts the `--runner` argument as usual for programs that
+use Beam PipelineOptions to manage their command line arguments. In addition
+to this, the necessary dependencies must be configured.
+
+When running via Gradle, the following two parameters control the execution:
+
+    -P tpcds.args
+        The command line to pass to the TPC-DS main program.
+
+    -P tpcds.runner
+	The Gradle project name of the runner, such as ":runners:spark:3" or
+	":runners:flink:1.13. The project names can be found in the root
+        `settings.gradle.kts`.
+
+Test data has to be generated before running a suite and stored to accessible file system. The query results will be written into output files.
+
+### Common configuration parameters
+
+Size of input dataset (1GB / 10GB / 100GB / 1000GB):
+
+    --dataSize=<1GB|10GB|100GB|1000GB>
+
+Path to input datasets directory:
+
+    --dataDirectory=<path to dir>
+
+Path to results directory:
+
+    --resultsDirectory=<path to dir>
+
+Format of input files:
+
+    --sourceType=<CSV|PARQUET>
+
+Run queries (comma separated list of query numbers or `all` for all queries):
+
+    --queries=<1,2,...N|all>
+
+Number of queries **N** that are running in parallel:
+
+    --tpcParallel=N
+
+## Running TPC-DS
+
+There are some examples how to run TPC-DS benchmark on different runners.
+
+Running suite on the SparkRunner (local) with Query3 against 1Gb dataset in Parquet format:
+
+    ./gradlew :sdks:java:testing:tpcds:run \
+        -Ptpcds.runner=":runners:spark:3" \
+        -Ptpcds.args="
+            --runner=SparkRunner
+            --dataSize=1GB
+            --sourceType=PARQUET
+            --dataDirectory=gs://beam-tpcds/datasets/parquet/partitioned

Review Comment:
   Could/should we resolve this path given dataSize and sourceType? (we could still allow the user to override it)



##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD
+
+## The queries
+
+TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
+question, which illustrates the business context in which the query could be used.
+
+All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
+performance.
+
+## Input data
+Input data source:
+
+- Input files (CSV) are generated with CLI tool `dsdgen`
+- Input datasets can be generated for different sizes:
+  - 1GB / 10GB / 100GB / 1000GB
+- The tool constraints the minimum amount of data to be generated to 1GB
+
+## TPC-DS extension in Beam
+
+### Reasons
+
+Beam provides a simplified implementation of TPC-DS benchmark and there are several reasons to have it in Beam:
+
+- Compare the performance boost or degradation of Beam SQL for different runners or their versions
+- Run Beam SQL on different runtime environments
+- Detect missing Beam SQL features or incompatibilities
+- Find performance issues in Beam
+
+### Queries
+All TPC-DS queries in Beam are pre-generated and stored in the provided artifacts.
+
+For the moment, 28 out of 103 SQL queries (99 + 4) successfully pass by running with Beam SQL transform since not all
+SQL-99 operations are supported.
+
+Currently supported queries:
+ - 3, 7, 10, 22, 25, 26, 29, 35, 38, 40, 42, 43, 50, 52, 55, 69, 78, 79, 83, 84, 87, 93, 96, 97, 99
+
+### Tables
+All TPC-DS table schemas are stored in the provided artifacts.
+
+### Input data
+Input data is already pre-generated for two data formats (CSV and Parquet) and stored in Google Cloud Storage (gs://beam-tpcds)
+
+### Runtime
+TPC-DS extension for Beam can only be run in **Batch** mode and supports these runners for the moment:
+- Spark Runner
+- Flink Runner
+- Dataflow Runner
+
+## TPC-DS output
+
+TBD
+
+## Benchmark launch configuration
+
+The TPC-DS launcher accepts the `--runner` argument as usual for programs that
+use Beam PipelineOptions to manage their command line arguments. In addition
+to this, the necessary dependencies must be configured.
+
+When running via Gradle, the following two parameters control the execution:
+
+    -P tpcds.args
+        The command line to pass to the TPC-DS main program.
+
+    -P tpcds.runner
+	The Gradle project name of the runner, such as ":runners:spark:3" or
+	":runners:flink:1.13. The project names can be found in the root
+        `settings.gradle.kts`.
+
+Test data has to be generated before running a suite and stored to accessible file system. The query results will be written into output files.
+
+### Common configuration parameters
+
+Size of input dataset (1GB / 10GB / 100GB / 1000GB):
+
+    --dataSize=<1GB|10GB|100GB|1000GB>
+
+Path to input datasets directory:
+
+    --dataDirectory=<path to dir>
+
+Path to results directory:
+
+    --resultsDirectory=<path to dir>
+
+Format of input files:
+
+    --sourceType=<CSV|PARQUET>
+
+Run queries (comma separated list of query numbers or `all` for all queries):
+
+    --queries=<1,2,...N|all>
+
+Number of queries **N** that are running in parallel:
+
+    --tpcParallel=N
+
+## Running TPC-DS
+
+There are some examples how to run TPC-DS benchmark on different runners.

Review Comment:
   ```suggestion
   Here are some examples demonstrating how to run TPC-DS benchmarks on different runners.
   ```



##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org