You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by ki...@apache.org on 2020/07/11 19:05:39 UTC

[incubator-pinot] branch master updated: Add benchmark documentation. (#5683)

This is an automated email from the ASF dual-hosted git repository.

kishoreg pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-pinot.git


The following commit(s) were added to refs/heads/master by this push:
     new 9477e01  Add benchmark documentation. (#5683)
9477e01 is described below

commit 9477e014ed7e053e9d098b15742dbbdfe7e85493
Author: Nikola Grcevski <62...@users.noreply.github.com>
AuthorDate: Sat Jul 11 15:05:26 2020 -0400

    Add benchmark documentation. (#5683)
---
 contrib/pinot-druid-benchmark/README.md | 297 ++++++++++++++++++++++++++++++++
 1 file changed, 297 insertions(+)

diff --git a/contrib/pinot-druid-benchmark/README.md b/contrib/pinot-druid-benchmark/README.md
new file mode 100644
index 0000000..f7ac5f1
--- /dev/null
+++ b/contrib/pinot-druid-benchmark/README.md
@@ -0,0 +1,297 @@
+# Running the benchmark
+
+For instructions on how to run the Pinot/Druid benchmark please refer to the
+```run_benchmark.sh``` file. 
+
+In order to run the Apache Pinot benchmark you'll need to create the appropriate
+data segments, which are too large to be included in this github repository and
+they may need to be recreated with new Apache Pinot versions.
+
+To create the neccessary segment data for the benchmark please follow the
+instructions below.
+
+# Creating Apache Pinot benchmark segments from TPC-H data
+
+To run the Pinot/Druid benchmark with Apache Pinot you'll need to download and run 
+the TPC-H tools to generate the benchmark data sets.
+
+## Downloading and building the TPC-H tools
+
+The TPC-H tools can be downloaded from the [TPC-H Website](http://www.tpc.org/tpch/default5.asp). 
+Registration is required.
+
+**Note:**: The instructions below for dbgen assume a Linux OS.
+
+After downloading and extracing the TPC-H tools, you'll need to build the
+db generator tool: ```dbgen```. To do so, extract the package that you have 
+downloaded from TPC-H's website and inside the dbgen sub directory edit the 
+```makefile``` file.
+
+Set the following variables in the makefile to:
+
+```
+CC      = gcc
+...
+DATABASE= SQLSERVER
+MACHINE = LINUX
+WORKLOAD = TPCH
+```
+
+Next, build the dbgen tool as per the README instructions in the dbgen directory.
+
+## Generating the TPC-H data and converting them for use in Apache Pinot
+
+After building ```dbgen``` run the following command line in the ```dbgen``` directory:
+
+```
+./dbgen -TL -s8
+```
+
+The command above will generate a single large file called ```lineitem.tbl```.
+This is the data file for the TPC-H benchmark, which we'll need to post-process 
+a bit to be imported into Apache Pinot.
+
+Next, build the Pinot/Druid Benchmark code if you haven't done so already.
+
+**Note:** Apache Pinot has JDK11 support, however for now it's
+best to use JDK8 for all build and run operations in this manual.
+
+Inside ```pinot_directory/contrib/pinot-druid-benchmark``` run:
+
+```
+mvn clean install
+```
+
+Next, inside the same directory split the ```lineitem``` table:
+
+```
+./target/appassembler/bin/data-separator.sh <Path to lineitem.tbl> <Output Directory> 
+```
+
+Use the output directory from the split as the input directory for the merge
+command below:
+
+```
+./target/appassembler/bin/data-merger.sh <Input Directory> <Output Directory> YEAR
+```
+
+If all ran well you should see a few CSV files produced, 1992.csv through 1998.csv.
+
+These files are the starting point for creating our Apache Pinot segments.
+
+## Create the Apache Pinot segments
+
+The first step in the process is to launch a standalone Apache Pinot Cluster on one
+single server. This cluster will serve as a host to hold the initial segments, 
+which we'll extract and copy for later re-use in the benchmark.
+
+Follow the steps outlined in the Apache Pinot Manual Cluster setup to launch the
+cluster:
+
+https://docs.pinot.apache.org/basics/getting-started/advanced-pinot-setup
+
+You don't need the Kafka service as we won't be using it.
+
+Next, we need to follow the instructions similar to the ones described in
+the [Batch Import Example](https://docs.pinot.apache.org/basics/getting-started/pushing-your-data-to-pinot)
+in the Apache Pinot documentation.
+
+### Create the Apache Pinot tables
+
+Run:
+
+```
+pinot-admin.sh AddTable \
+  -tableConfigFile /absolute/path/to/table-config.json \
+  -schemaFile /absolute/path/to/schema.json -exec
+```
+
+For this command above you'll need the following configuration files:
+
+```table_config.json```
+```
+{
+  "tableName": "tpch_lineitem",
+  "segmentsConfig" : {
+    "replication" : "1",
+    "schemaName" : "tpch_lineitem",
+    "segmentAssignmentStrategy" : "BalanceNumSegmentAssignmentStrategy"
+  },
+  "tenants" : {
+    "broker":"DefaultTenant",
+    "server":"DefaultTenant"
+  },
+  "tableIndexConfig" : {
+    "starTreeIndexConfigs":[{
+      "maxLeafRecords": 100,
+      "functionColumnPairs": ["SUM__l_extendedprice", "SUM__l_discount", "SUM__l_quantity"],
+      "dimensionsSplitOrder": ["l_receiptdate", "l_shipdate", "l_shipmode", "l_returnflag"],
+      "skipStarNodeCreationForDimensions": [],
+      "skipMaterializationForDimensions": ["l_partkey", "l_commitdate", "l_linestatus", "l_comment", "l_orderkey", "l_shipinstruct", "l_linenumber", "l_suppkey"]
+    }]
+  },
+  "tableType":"OFFLINE",
+  "metadata": {}
+}
+```
+
+```schema.json```
+```
+{
+  "schemaName": "tpch_lineitem",
+  "dimensionFieldSpecs": [
+    {
+      "name": "l_orderkey",
+      "dataType": "INT"
+    },
+    {
+      "name": "l_partkey",
+      "dataType": "INT"
+    },
+    {
+      "name": "l_suppkey",
+      "dataType": "INT"
+    },
+    {
+      "name": "l_linenumber",
+      "dataType": "INT"
+    },
+    {
+      "name": "l_returnflag",
+      "dataType": "STRING"
+    },
+    {
+      "name": "l_linestatus",
+      "dataType": "STRING"
+    },
+    {
+      "name": "l_shipdate",
+      "dataType": "STRING"
+    },
+    {
+      "name": "l_commitdate",
+      "dataType": "STRING"
+    },
+    {
+      "name": "l_receiptdate",
+      "dataType": "STRING"
+    },
+    {
+      "name": "l_shipinstruct",
+      "dataType": "STRING"
+    },
+    {
+      "name": "l_shipmode",
+      "dataType": "STRING"
+    },
+    {
+      "name": "l_comment",
+      "dataType": "STRING"
+    }
+  ],
+  "metricFieldSpecs": [
+    {
+      "name": "l_quantity",
+      "dataType": "LONG"
+    },
+    {
+      "name": "l_extendedprice",
+      "dataType": "DOUBLE"
+    },
+    {
+      "name": "l_discount",
+      "dataType": "DOUBLE"
+    },
+    {
+      "name": "l_tax",
+      "dataType": "DOUBLE"
+    }
+  ]
+}
+```
+
+**Note:** The configuration as specified above will give you
+the data with the **optimal star tree index**. The index configuration is
+specified in the ```tableIndexConfig``` section in the ```table_config.json``` file. If
+you want to generate a different type of indexed segment, then you
+should modify the tableIndexConfig section to reflect the correct index
+type as described in the [Indexing Section](https://docs.pinot.apache.org/basics/features/indexing) 
+of the Apache Pinot Documentation.
+
+### Create the Apache Pinot segments
+
+Next, we'll create the segments for this Apache Pinot table using the optimal
+star tree index configuration. 
+
+For this purpose you'll need a job specification YAML file. Here's an example
+that does the TPC-H data import:
+
+```job-spec.yml```
+```
+executionFrameworkSpec:
+  name: 'standalone'
+  segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
+  segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
+  segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
+jobType: SegmentCreationAndTarPush
+inputDirURI: '/absolute/path/to/incubator-pinot/contrib/pinot-druid-benchmark/data_out/raw_data/'
+includeFileNamePattern: 'glob:**/*.csv'
+outputDirURI: '/absolute/path/to/incubator-pinot/contrib/pinot-druid-benchmark/data_out/segments/'
+overwriteOutput: true
+pinotFSSpecs:
+  - scheme: file
+    className: org.apache.pinot.spi.filesystem.LocalPinotFS
+recordReaderSpec:
+  dataFormat: 'csv'
+  className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
+  configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
+  configs:
+    delimiter: '|'
+    header: 'l_orderkey|l_partkey|l_suppkey|l_linenumber|l_quantity|l_extendedprice|l_discount|l_tax|l_returnflag|l_linestatus|l_shipdate|l_commitdate|l_receiptdate|l_shipinstruct|l_shipmode|l_comment|'
+tableSpec:
+  tableName: 'tpch_lineitem'
+  schemaURI: 'http://localhost:9000/tables/tpch_lineitem/schema'
+  tableConfigURI: 'http://localhost:9000/tables/tpch_lineitem'
+pinotClusterSpecs:
+  - controllerURI: 'http://localhost:9000'
+```
+
+**Note:** Make sure you modify the absolute path for **inputDirURI** and **outputDirURI**
+above. The inputDirURI should be pointing to the directory where you have
+generated the 7 YEAR CSV files, 1992.csv through 1998.csv.
+
+After you have modified the input and output dir, run the job as described in the 
+[Batch Import Example](https://docs.pinot.apache.org/basics/getting-started/pushing-your-data-to-pinot) document:
+
+
+```
+pinot-admin.sh LaunchDataIngestionJob \
+    -jobSpecFile /absolute/path/to/job-spec.yml
+```
+
+The segment creation output on the console will tell you where Apache Pinot will
+store the created segments (it should be your output dir). You should see a 
+line appear in the output as:
+
+```
+...
+outputDirURI: /absolute/path/to/incubator-pinot/contrib/pinot-druid-benchmark/data_out/segments/
+...
+```
+
+Inside there you'll find the tpch_lineitem_OFFLINE directory with 7 separate 
+segments, 0 through 6. Tar/GZip the whole directory and this will be your
+optimal_startree_small_yearly temp segment that the benchmark requires. However,
+wait first for the segment creation to finish.
+
+Try few queries to ensure that the segments are working. You can find some
+sample queries under the benchmark directory ```src/main/resources/pinot_queries```.
+Watch the console output from the Apache Pinot cluster as you run the queries, and make sure 
+there are no complaints in there that the queries were slow since index wasn't found. 
+If you see a message saying the query was slow, it means that the indexes weren't 
+created properly. With the optimal star tree index your total query time should be
+few milliseconds at most.
+
+You can now shutdown the Apache Pinot cluster which you started manually and when you
+launch the benchmark server cluster it will pick up your new segments. 
+


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org