You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/07/05 08:06:17 UTC

[GitHub] [beam] echauchot commented on a diff in pull request #22047: [website] Add TPC-DS benchmark documentation

echauchot commented on code in PR #22047:
URL: https://github.com/apache/beam/pull/22047#discussion_r913504297


##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD

Review Comment:
   extracting the used tables from all the TPCDS table would be painful so maybe just give a word about the different tables like this: 
   **store_sales**: contains the sell records of a virtual store



##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD
+
+## The queries
+
+TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
+question, which illustrates the business context in which the query could be used.
+
+All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
+performance.
+
+## Input data
+Input data source:
+
+- Input files (CSV) are generated with CLI tool `dsdgen`
+- Input datasets can be generated for different sizes:
+  - 1GB / 10GB / 100GB / 1000GB
+- The tool constraints the minimum amount of data to be generated to 1GB
+
+## TPC-DS extension in Beam
+
+### Reasons
+
+Beam provides a simplified implementation of TPC-DS benchmark and there are several reasons to have it in Beam:
+
+- Compare the performance boost or degradation of Beam SQL for different runners or their versions
+- Run Beam SQL on different runtime environments
+- Detect missing Beam SQL features or incompatibilities
+- Find performance issues in Beam
+
+### Queries
+All TPC-DS queries in Beam are pre-generated and stored in the provided artifacts.
+
+For the moment, 28 out of 103 SQL queries (99 + 4) successfully pass by running with Beam SQL transform since not all
+SQL-99 operations are supported.
+
+Currently supported queries:
+ - 3, 7, 10, 22, 25, 26, 29, 35, 38, 40, 42, 43, 50, 52, 55, 69, 78, 79, 83, 84, 87, 93, 96, 97, 99
+
+### Tables
+All TPC-DS table schemas are stored in the provided artifacts.
+
+### Input data
+Input data is already pre-generated for two data formats (CSV and Parquet) and stored in Google Cloud Storage (gs://beam-tpcds)
+
+### Runtime
+TPC-DS extension for Beam can only be run in **Batch** mode and supports these runners for the moment:
+- Spark Runner
+- Flink Runner
+- Dataflow Runner
+
+## TPC-DS output
+
+TBD

Review Comment:
   for now just talk about the response time in the output log



##########
website/www/site/content/en/documentation/sdks/java/testing/tpcds.md:
##########
@@ -0,0 +1,183 @@
+---
+type: languages
+title: "TPC-DS benchmark suite"
+aliases: /documentation/sdks/java/tpcds/
+---
+<!--
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+# TPC Benchmark™ DS (TPC-DS) benchmark suite
+
+## What it is
+
+> "TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision support system,
+> including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general
+> purpose decision support system."
+
+- Industry standard benchmark (OLAP/Data Warehouse)
+  - http://www.tpc.org/tpcds/
+- Implemented for many analytical processing systems - RDBMS, Apache Spark, Apache Flink, etc
+- Wide range of different queries (SQL)
+- Existing tools to generate input data of different sizes
+
+## Table schemas
+TBD
+
+## The queries
+
+TPC-DS benchmark contains 99 distinct SQL-99 queries (including OLAP extensions). Each query answers a business
+question, which illustrates the business context in which the query could be used.
+
+All queries are “templated” with random input parameters and used to compare SQL implementation of completeness and
+performance.
+
+## Input data
+Input data source:
+
+- Input files (CSV) are generated with CLI tool `dsdgen`
+- Input datasets can be generated for different sizes:
+  - 1GB / 10GB / 100GB / 1000GB
+- The tool constraints the minimum amount of data to be generated to 1GB
+
+## TPC-DS extension in Beam
+
+### Reasons
+
+Beam provides a simplified implementation of TPC-DS benchmark and there are several reasons to have it in Beam:
+
+- Compare the performance boost or degradation of Beam SQL for different runners or their versions
+- Run Beam SQL on different runtime environments
+- Detect missing Beam SQL features or incompatibilities
+- Find performance issues in Beam
+
+### Queries
+All TPC-DS queries in Beam are pre-generated and stored in the provided artifacts.
+
+For the moment, 28 out of 103 SQL queries (99 + 4) successfully pass by running with Beam SQL transform since not all
+SQL-99 operations are supported.
+
+Currently supported queries:

Review Comment:
   add "are"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org