You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by mc...@apache.org on 2019/11/04 22:33:12 UTC
[incubator-pinot] 11/13: Update Getting Started documentation.
(#4615)
This is an automated email from the ASF dual-hosted git repository.
mcvsubbu pushed a commit to branch 0.2.0
in repository https://gitbox.apache.org/repos/asf/incubator-pinot.git
commit dd0c10d645a4bb5d118b64a9631818f6ee29aedf
Author: Dominique Adapon <da...@uci.edu>
AuthorDate: Fri Oct 4 10:49:58 2019 -0700
Update Getting Started documentation. (#4615)
* Update Getting Started documentation.
Updated Getting Started documentation to include a
CSV config file and a specific CSV file. Also updated
minor grammar issues and version number.
* Update Getting Started documentation
Updated Getting Started documentation to include a
specific CSV file and a CSV config file. Also updated
minor grammar issues and created variables for
version number and working directory, as well as
shortened all commands by navigating to pinot-admin.sh.
* Update Getting Started documentation
Updated Getting Started documentation again with
clearer instructions on where to store the data and config files.
* Update Getting Started Documentation
* Update Getting Started documentation.
Cleaned up minor errors and clarified instructions.
---
docs/getting_started.rst | 123 +++++++++++++++++++++++++++++++++++++----------
1 file changed, 97 insertions(+), 26 deletions(-)
diff --git a/docs/getting_started.rst b/docs/getting_started.rst
index 4c3e5f6..577cf0e 100644
--- a/docs/getting_started.rst
+++ b/docs/getting_started.rst
@@ -41,7 +41,11 @@ Pinot requires JDK 8 or later and Apache Maven 3.
#. Check out the code from GitHub (https://github.com/apache/incubator-pinot)
#. With Maven installed, run ``mvn install package -DskipTests -Pbin-dist`` in the directory in which you checked out Pinot.
-#. Make the generated scripts executable ``cd pinot-distribution/target/apache-pinot-incubating-<version>-SNAPSHOT-bin; chmod +x bin/*.sh``
+#. Make the generated scripts executable:
+
+.. code-block:: none
+
+ cd pinot-distribution/target/apache-pinot-incubating-<version>-SNAPSHOT-bin/apache-pinot-incubating-<version>-SNAPSHOT-bin; chmod +x bin/*.sh
Trying out Offline quickstart demo
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -97,10 +101,10 @@ last events that were ingested by Pinot.
Experimenting with Pinot
~~~~~~~~~~~~~~~~~~~~~~~~
-Now we have a quick start Pinot cluster running locally. The below shows a step-by-step instruction on
-how to add a simple table to the Pinot system, how to upload segments, and how to query it.
+Now we have a quick start Pinot cluster running locally. Below are step-by-step instructions on
+how to add a simple table to the Pinot system, how to upload a segment, and how to query the segment.
-Suppose we have a transcript in CSV format containing students' basic info and their scores of each subject.
+Suppose we have a transcript in CSV format containing students' basic info and their scores for each subject.
+------------+------------+-----------+-----------+-----------+-----------+
| studentID | firstName | lastName | gender | subject | score |
@@ -114,7 +118,53 @@ Suppose we have a transcript in CSV format containing students' basic info and t
| 202 | Nick | Young | Male | Physics | 3.6 |
+------------+------------+-----------+-----------+-----------+-----------+
-Firstly in order to set up a table, we need to specify the schema of this transcript.
+When we create a CSV file, we will also need a separate CSV config JSON file.
+
+First, however, we will create a working directory called ``getting-started`` (in this example, it is on ``Desktop``), and create two additional directories within ``getting-started`` called ``data``
+and ``config``.
+
+Note that we can create a variable for the working directory called ``WORKING_DIR``.
+
+.. code-block:: none
+
+ $ mkdir getting-started
+ $ WORKING_DIR=/Users/host1/Desktop/getting-started
+ $ cd $WORKING_DIR
+ $ mkdir getting-started/data
+ $ mkdir getting started/config
+
+We will create the transcript CSV file in ``data``, and the CSV config file in ``config``.
+
+.. code-block:: none
+
+ $ touch getting-started/data/test.csv
+ $ touch getting-started/config/csv-record-reader-config.json
+
+The ``test.csv`` file should look like this, with no header line at the top:
+
+.. code-block:: none
+
+ 200,Lucy,Smith,Female,Maths,3.8
+ 200,Lucy,Smith,Female,English,3.5
+ 201,Bob,King,Male,Maths,3.2
+ 202,Nick,Young,Male,Physics,3.6
+
+Instead of using a header line, we will use the CSV config JSON file ``csv-record-reader-config.json`` to specify the header:
+
+.. code-block:: none
+
+ {
+ "header":"studentID,firstName,lastName,gender,subject,score",
+ "fileFormat":"CSV"
+ }
+
+In order to set up a table, we need to specify the schema of this transcript in ``transcript-schema.json``, which we will store in ``config``:
+
+.. code-block:: none
+
+ $ touch getting-started/config/transcript-schema.json
+
+``transcript-schema.json`` should look like this:
.. code-block:: none
@@ -150,15 +200,24 @@ Firstly in order to set up a table, we need to specify the schema of this transc
]
}
-To upload the schema, we can use the command below:
+To upload the schema, we can navigate to the directory in ``pinot-distribution`` that contains
+``pinot-admin.sh``, and use the command below:
.. code-block:: none
- $ ./pinot-distribution/target/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/bin/pinot-admin.sh AddSchema -schemaFile /Users/host1/transcript-schema.json -exec
- Executing command: AddSchema -controllerHost [controller_host] -controllerPort 9000 -schemaFilePath /Users/host1/transcript-schema.json -exec
- Sending request: http://[controller_host]:9000/schemas to controller: [controller_host], version: 0.1.0-SNAPSHOT-2c5d42a908213122ab0ad8b7ac9524fcf390e4cb
+ $ VERSION=0.2.0
+ $ cd ./pinot-distribution/target/apache-pinot-incubating-$VERSION-SNAPSHOT-bin/apache-pinot-incubating-$VERSION-SNAPSHOT-bin/bin
+ $ ./pinot-admin.sh AddSchema -schemaFile $WORKING_DIR/config/transcript-schema.json -exec
+ Executing command: AddSchema -controllerHost [controller_host] -controllerPort 9000 -schemaFilePath /Users/host1/Desktop/getting-started/config/transcript-schema.json -exec
+ Sending request: http://[controller_host]:9000/schemas to controller: [controller_host], version: 0.2.0-SNAPSHOT-68092ab9eb83af173d725ec685c22ba4eb5bacf9
-Then, we need to specify the table config which links the schema to this table:
+Then, we need to specify the table config in another JSON file (also stored in ``config``), which links the schema to the table:
+
+.. code-block:: none
+
+ $ touch getting-started/config/transcript-table-config.json
+
+``transcript-table-config.json`` should look like this:
.. code-block:: none
@@ -186,17 +245,29 @@ And upload the table config to Pinot cluster:
.. code-block:: none
- $ ./pinot-distribution/target/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/bin/pinot-admin.sh AddTable -filePath /Users/host1/transcript-table-config.json -exec
- Executing command: AddTable -filePath /Users/host1/transcript-table-config.json -controllerHost [controller_host] -controllerPort 9000 -exec
+ $ ./pinot-admin.sh AddTable -filePath $WORKING_DIR/config/transcript-table-config.json -exec
+ Executing command: AddTable -filePath /Users/host1/Desktop/getting-started/config/transcript-table-config.json -controllerHost [controller_host] -controllerPort 9000 -exec
{"status":"Table transcript_OFFLINE successfully added"}
-In order to upload our data to Pinot cluster, we need to convert our CSV file to Pinot Segment:
+At this point, the directory tree for our ``getting-started`` should look like this:
+
+.. code-block:: none
+
+ |-- getting-started
+ |-- data
+ |-- test.csv
+ |-- config
+ |-- csv-record-reader-config.json
+ |-- transcript-schema.json
+ |-- transcript-table-config.json
+
+In order to upload our data to the Pinot cluster, we need to convert our CSV file into a Pinot Segment, which will be put in a new directory $WORKING_DIR/test2:
.. code-block:: none
- $ ./pinot-distribution/target/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/bin/pinot-admin.sh CreateSegment -dataDir /Users/host1/Desktop/test/ -format CSV -outDir /Users/host1/Desktop/test2/ -tableName transcript -segmentName transcript_0 -overwrite -schemaFile /Users/host1/transcript-schema.json
- Executing command: CreateSegment -generatorConfigFile null -dataDir /Users/host1/Desktop/test/ -format CSV -outDir /Users/host1/Desktop/test2/ -overwrite true -tableName transcript -segmentName transcript_0 -timeColumnName null -schemaFile /Users/host1/transcript-schema.json -readerConfigFile null -enableStarTreeIndex false -starTreeIndexSpecFile null -hllSize 9 -hllColumns null -hllSuffix _hll -numThreads 1
- Accepted files: [/Users/host1/Desktop/test/Transcript.csv]
+ $ ./pinot-admin.sh CreateSegment -dataDir $WORKING_DIR/data -format CSV -outDir $WORKING_DIR/test2 -tableName transcript -segmentName transcript_0 -overwrite -schemaFile $WORKING_DIR/config/transcript-schema.json -readerConfigFile $WORKING_DIR/config/csv-record-reader-config.json
+ Executing command: CreateSegment -generatorConfigFile null -dataDir /Users/host1/Desktop/getting-started/data -format CSV -outDir /Users/host1/Desktop/getting-started/test2 -overwrite true -tableName transcript -segmentName transcript_0 -timeColumnName null -schemaFile /Users/host1/Desktop/getting-started/config/transcript-schema.json -readerConfigFile /Users/host1/Desktop/getting-started/config/csv-record-reader-config.json -enableStarTreeIndex false -starTreeIndexSpecFile null -hllS [...]
+ Accepted files: [file:/Users/host1/Desktop/getting-started/data/test.csv]
Finished building StatsCollector!
Collected stats for 4 documents
Created dictionary for STRING column: studentID with cardinality: 1, max length in bytes: 4, range: null to null
@@ -208,30 +279,30 @@ In order to upload our data to Pinot cluster, we need to convert our CSV file to
Start building IndexCreator!
Finished records indexing in IndexCreator!
Finished segment seal!
- Converting segment: /Users/host1/Desktop/test2/transcript_0_0 to v3 format
- v3 segment location for segment: transcript_0_0 is /Users/host1/Desktop/test2/transcript_0_0/v3
- Deleting files in v1 segment directory: /Users/host1/Desktop/test2/transcript_0_0
+ Converting segment: /Users/host1/Desktop/getting-started/test2/transcript_0_0 to v3 format
+ v3 segment location for segment: transcript_0_0 is /Users/host1/Desktop/getting-started/test2/transcript_0_0/v3
+ Deleting files in v1 segment directory: /Users/host1/Desktop/getting-started/test2/transcript_0_0
Driver, record read time : 1
Driver, stats collector time : 0
Driver, indexing time : 0
-Once we have the Pinot segment, we can upload this segment to our cluster:
+Once we have the Pinot Segment, we can upload it to our cluster:
.. code-block:: none
- $ ./pinot-distribution/target/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/bin/pinot-admin.sh UploadSegment -segmentDir /Users/host1/Desktop/test2/
+ $ ./pinot-admin.sh UploadSegment -segmentDir $WORKING_DIR/test2/
Executing command: UploadSegment -controllerHost [controller_host] -controllerPort 9000 -segmentDir /Users/host1/Desktop/test2/
Compressing segment transcript_0_0
Uploading segment transcript_0_0.tar.gz
- Sending request: http://[controller_host]:9000/v2/segments to controller: [controller_host], version: 0.1.0-SNAPSHOT-2c5d42a908213122ab0ad8b7ac9524fcf390e4cb
+ Sending request: http://[controller_host]:9000/v2/segments to controller: [controller_host], version: 0.2.0-SNAPSHOT-68092ab9eb83af173d725ec685c22ba4eb5bacf9
-You made it! Now we can query the data in Pinot:
+You did it! Now we can query the data in Pinot.
To get all the number of rows in the table:
.. code-block:: none
- $ ./pinot-distribution/target/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/bin/pinot-admin.sh PostQuery -brokerPort 8000 -query "select count(*) from transcript"
+ $ ./pinot-admin.sh PostQuery -brokerPort 8000 -query "select count(*) from transcript"
Executing command: PostQuery -brokerHost [controller_host] -brokerPort 8000 -query select count(*) from transcript
Result: {"aggregationResults":[{"function":"count_star","value":"4"}],"exceptions":[],"numServersQueried":1,"numServersResponded":1,"numSegmentsQueried":1,"numSegmentsProcessed":1,"numSegmentsMatched":1,"numDocsScanned":4,"numEntriesScannedInFilter":0,"numEntriesScannedPostFilter":0,"numGroupsLimitReached":false,"totalDocs":4,"timeUsedMs":7,"segmentStatistics":[],"traceInfo":{}}
@@ -239,7 +310,7 @@ To get the average score of subject Maths:
.. code-block:: none
- $ ./pinot-distribution/target/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/bin/pinot-admin.sh PostQuery -brokerPort 8000 -query "select avg(score) from transcript where subject = \"Maths\""
+ $ ./pinot-admin.sh PostQuery -brokerPort 8000 -query "select avg(score) from transcript where subject = \"Maths\""
Executing command: PostQuery -brokerHost [controller_host] -brokerPort 8000 -query select avg(score) from transcript where subject = "Maths"
Result: {"aggregationResults":[{"function":"avg_score","value":"3.50000"}],"exceptions":[],"numServersQueried":1,"numServersResponded":1,"numSegmentsQueried":1,"numSegmentsProcessed":1,"numSegmentsMatched":1,"numDocsScanned":2,"numEntriesScannedInFilter":4,"numEntriesScannedPostFilter":2,"numGroupsLimitReached":false,"totalDocs":4,"timeUsedMs":33,"segmentStatistics":[],"traceInfo":{}}
@@ -247,6 +318,6 @@ To get the average score for Lucy Smith:
.. code-block:: none
- $ ./pinot-distribution/target/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/bin/pinot-admin.sh PostQuery -brokerPort 8000 -query "select avg(score) from transcript where firstName = \"Lucy\" and lastName = \"Smith\""
+ $ ./pinot-admin.sh PostQuery -brokerPort 8000 -query "select avg(score) from transcript where firstName = \"Lucy\" and lastName = \"Smith\""
Executing command: PostQuery -brokerHost [controller_host] -brokerPort 8000 -query select avg(score) from transcript where firstName = "Lucy" and lastName = "Smith"
Result: {"aggregationResults":[{"function":"avg_score","value":"3.65000"}],"exceptions":[],"numServersQueried":1,"numServersResponded":1,"numSegmentsQueried":1,"numSegmentsProcessed":1,"numSegmentsMatched":1,"numDocsScanned":2,"numEntriesScannedInFilter":6,"numEntriesScannedPostFilter":2,"numGroupsLimitReached":false,"totalDocs":4,"timeUsedMs":67,"segmentStatistics":[],"traceInfo":{}}
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org