You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by mc...@apache.org on 2019/11/04 22:33:12 UTC
[incubator-pinot] 11/13: Update Getting Started documentation. (#4615)

This is an automated email from the ASF dual-hosted git repository.

mcvsubbu pushed a commit to branch 0.2.0
in repository https://gitbox.apache.org/repos/asf/incubator-pinot.git

commit dd0c10d645a4bb5d118b64a9631818f6ee29aedf
Author: Dominique Adapon <da...@uci.edu>
AuthorDate: Fri Oct 4 10:49:58 2019 -0700

    Update Getting Started documentation. (#4615)
    
    * Update Getting Started documentation.
    Updated Getting Started documentation to include a
    CSV config file and a specific CSV file. Also updated
    minor grammar issues and version number.
    
    * Update Getting Started documentation
    
    Updated Getting Started documentation to include a
    specific CSV file and a CSV config file. Also updated
    minor grammar issues and created variables for
    version number and working directory, as well as
    shortened all commands by navigating to pinot-admin.sh.
    
    * Update Getting Started documentation
    
    Updated Getting Started documentation again with
    clearer instructions on where to store the data and config files.
    
    * Update Getting Started Documentation
    
    * Update Getting Started documentation.
    
    Cleaned up minor errors and clarified instructions.
---
 docs/getting_started.rst | 123 +++++++++++++++++++++++++++++++++++++----------
 1 file changed, 97 insertions(+), 26 deletions(-)

diff --git a/docs/getting_started.rst b/docs/getting_started.rst
index 4c3e5f6..577cf0e 100644
--- a/docs/getting_started.rst
+++ b/docs/getting_started.rst
@@ -41,7 +41,11 @@ Pinot requires JDK 8 or later and Apache Maven 3.
 
 #. Check out the code from GitHub (https://github.com/apache/incubator-pinot)
 #. With Maven installed, run ``mvn install package -DskipTests -Pbin-dist`` in the directory in which you checked out Pinot.
-#. Make the generated scripts executable ``cd pinot-distribution/target/apache-pinot-incubating-<version>-SNAPSHOT-bin; chmod +x bin/*.sh``
+#. Make the generated scripts executable:
+
+.. code-block:: none
+
+  cd pinot-distribution/target/apache-pinot-incubating-<version>-SNAPSHOT-bin/apache-pinot-incubating-<version>-SNAPSHOT-bin; chmod +x bin/*.sh
 
 Trying out Offline quickstart demo
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -97,10 +101,10 @@ last events that were ingested by Pinot.
 Experimenting with Pinot
 ~~~~~~~~~~~~~~~~~~~~~~~~
 
-Now we have a quick start Pinot cluster running locally. The below shows a step-by-step instruction on
-how to add a simple table to the Pinot system, how to upload segments, and how to query it.
+Now we have a quick start Pinot cluster running locally. Below are step-by-step instructions on
+how to add a simple table to the Pinot system, how to upload a segment, and how to query the segment.
 
-Suppose we have a transcript in CSV format containing students' basic info and their scores of each subject.
+Suppose we have a transcript in CSV format containing students' basic info and their scores for each subject.
 
 +------------+------------+-----------+-----------+-----------+-----------+
 | studentID  | firstName  | lastName  |   gender  |  subject  |   score   |
@@ -114,7 +118,53 @@ Suppose we have a transcript in CSV format containing students' basic info and t
 |     202    |     Nick   |   Young   |    Male   |  Physics  |    3.6    |
 +------------+------------+-----------+-----------+-----------+-----------+
 
-Firstly in order to set up a table, we need to specify the schema of this transcript.
+When we create a CSV file, we will also need a separate CSV config JSON file.
+
+First, however, we will create a working directory called ``getting-started`` (in this example, it is on ``Desktop``), and create two additional directories within ``getting-started`` called ``data``
+and ``config``.
+
+Note that we can create a variable for the working directory called ``WORKING_DIR``.
+
+.. code-block:: none
+
+  $ mkdir getting-started
+  $ WORKING_DIR=/Users/host1/Desktop/getting-started
+  $ cd $WORKING_DIR
+  $ mkdir getting-started/data
+  $ mkdir getting started/config
+
+We will create the transcript CSV file in ``data``, and the CSV config file in ``config``.
+
+.. code-block:: none
+
+  $ touch getting-started/data/test.csv
+  $ touch getting-started/config/csv-record-reader-config.json
+
+The ``test.csv`` file should look like this, with no header line at the top:
+
+.. code-block:: none
+
+  200,Lucy,Smith,Female,Maths,3.8
+  200,Lucy,Smith,Female,English,3.5
+  201,Bob,King,Male,Maths,3.2
+  202,Nick,Young,Male,Physics,3.6
+
+Instead of using a header line, we will use the CSV config JSON file ``csv-record-reader-config.json`` to specify the header:
+
+.. code-block:: none
+
+  {
+    "header":"studentID,firstName,lastName,gender,subject,score",
+    "fileFormat":"CSV"
+  }
+
+In order to set up a table, we need to specify the schema of this transcript in ``transcript-schema.json``, which we will store in ``config``:
+
+.. code-block:: none
+
+  $ touch getting-started/config/transcript-schema.json
+
+``transcript-schema.json`` should look like this:
 
 .. code-block:: none
 
@@ -150,15 +200,24 @@ Firstly in order to set up a table, we need to specify the schema of this transc
     ]
   }
 
-To upload the schema, we can use the command below:
+To upload the schema, we can navigate to the directory in ``pinot-distribution`` that contains
+``pinot-admin.sh``, and use the command below:
 
 .. code-block:: none
 
-  $ ./pinot-distribution/target/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/bin/pinot-admin.sh AddSchema -schemaFile /Users/host1/transcript-schema.json -exec
-  Executing command: AddSchema -controllerHost [controller_host] -controllerPort 9000 -schemaFilePath /Users/host1/transcript-schema.json -exec
-  Sending request: http://[controller_host]:9000/schemas to controller: [controller_host], version: 0.1.0-SNAPSHOT-2c5d42a908213122ab0ad8b7ac9524fcf390e4cb
+  $ VERSION=0.2.0
+  $ cd ./pinot-distribution/target/apache-pinot-incubating-$VERSION-SNAPSHOT-bin/apache-pinot-incubating-$VERSION-SNAPSHOT-bin/bin
+  $ ./pinot-admin.sh AddSchema -schemaFile $WORKING_DIR/config/transcript-schema.json -exec
+  Executing command: AddSchema -controllerHost [controller_host] -controllerPort 9000 -schemaFilePath /Users/host1/Desktop/getting-started/config/transcript-schema.json -exec
+  Sending request: http://[controller_host]:9000/schemas to controller: [controller_host], version: 0.2.0-SNAPSHOT-68092ab9eb83af173d725ec685c22ba4eb5bacf9
 
-Then, we need to specify the table config which links the schema to this table:
+Then, we need to specify the table config in another JSON file (also stored in ``config``), which links the schema to the table:
+
+.. code-block:: none
+
+  $ touch getting-started/config/transcript-table-config.json
+
+``transcript-table-config.json`` should look like this:
 
 .. code-block:: none
 
@@ -186,17 +245,29 @@ And upload the table config to Pinot cluster:
 
 .. code-block:: none
 
-  $ ./pinot-distribution/target/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/bin/pinot-admin.sh AddTable -filePath /Users/host1/transcript-table-config.json -exec
-  Executing command: AddTable -filePath /Users/host1/transcript-table-config.json -controllerHost [controller_host] -controllerPort 9000 -exec
+  $ ./pinot-admin.sh AddTable -filePath $WORKING_DIR/config/transcript-table-config.json -exec
+  Executing command: AddTable -filePath /Users/host1/Desktop/getting-started/config/transcript-table-config.json -controllerHost [controller_host] -controllerPort 9000 -exec
   {"status":"Table transcript_OFFLINE successfully added"}
 
-In order to upload our data to Pinot cluster, we need to convert our CSV file to Pinot Segment:
+At this point, the directory tree for our ``getting-started`` should look like this:
+
+.. code-block:: none
+
+  |-- getting-started
+      |-- data
+             |-- test.csv
+      |-- config
+             |-- csv-record-reader-config.json
+             |-- transcript-schema.json
+             |-- transcript-table-config.json
+
+In order to upload our data to the Pinot cluster, we need to convert our CSV file into a Pinot Segment, which will be put in a new directory $WORKING_DIR/test2:
 
 .. code-block:: none
 
-  $ ./pinot-distribution/target/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/bin/pinot-admin.sh CreateSegment -dataDir /Users/host1/Desktop/test/ -format CSV -outDir /Users/host1/Desktop/test2/ -tableName transcript -segmentName transcript_0 -overwrite -schemaFile /Users/host1/transcript-schema.json
-  Executing command: CreateSegment  -generatorConfigFile null -dataDir /Users/host1/Desktop/test/ -format CSV -outDir /Users/host1/Desktop/test2/ -overwrite true -tableName transcript -segmentName transcript_0 -timeColumnName null -schemaFile /Users/host1/transcript-schema.json -readerConfigFile null -enableStarTreeIndex false -starTreeIndexSpecFile null -hllSize 9 -hllColumns null -hllSuffix _hll -numThreads 1
-  Accepted files: [/Users/host1/Desktop/test/Transcript.csv]
+  $ ./pinot-admin.sh CreateSegment -dataDir $WORKING_DIR/data -format CSV -outDir $WORKING_DIR/test2 -tableName transcript -segmentName transcript_0 -overwrite -schemaFile $WORKING_DIR/config/transcript-schema.json -readerConfigFile $WORKING_DIR/config/csv-record-reader-config.json
+  Executing command: CreateSegment  -generatorConfigFile null -dataDir /Users/host1/Desktop/getting-started/data -format CSV -outDir /Users/host1/Desktop/getting-started/test2 -overwrite true -tableName transcript -segmentName transcript_0 -timeColumnName null -schemaFile /Users/host1/Desktop/getting-started/config/transcript-schema.json -readerConfigFile /Users/host1/Desktop/getting-started/config/csv-record-reader-config.json -enableStarTreeIndex false -starTreeIndexSpecFile null -hllS [...]
+  Accepted files: [file:/Users/host1/Desktop/getting-started/data/test.csv]
   Finished building StatsCollector!
   Collected stats for 4 documents
   Created dictionary for STRING column: studentID with cardinality: 1, max length in bytes: 4, range: null to null
@@ -208,30 +279,30 @@ In order to upload our data to Pinot cluster, we need to convert our CSV file to
   Start building IndexCreator!
   Finished records indexing in IndexCreator!
   Finished segment seal!
-  Converting segment: /Users/host1/Desktop/test2/transcript_0_0 to v3 format
-  v3 segment location for segment: transcript_0_0 is /Users/host1/Desktop/test2/transcript_0_0/v3
-  Deleting files in v1 segment directory: /Users/host1/Desktop/test2/transcript_0_0
+  Converting segment: /Users/host1/Desktop/getting-started/test2/transcript_0_0 to v3 format
+  v3 segment location for segment: transcript_0_0 is /Users/host1/Desktop/getting-started/test2/transcript_0_0/v3
+  Deleting files in v1 segment directory: /Users/host1/Desktop/getting-started/test2/transcript_0_0
   Driver, record read time : 1
   Driver, stats collector time : 0
   Driver, indexing time : 0
 
-Once we have the Pinot segment, we can upload this segment to our cluster:
+Once we have the Pinot Segment, we can upload it to our cluster:
 
 .. code-block:: none
 
-  $ ./pinot-distribution/target/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/bin/pinot-admin.sh UploadSegment -segmentDir /Users/host1/Desktop/test2/
+  $ ./pinot-admin.sh UploadSegment -segmentDir $WORKING_DIR/test2/
   Executing command: UploadSegment -controllerHost [controller_host] -controllerPort 9000 -segmentDir /Users/host1/Desktop/test2/
   Compressing segment transcript_0_0
   Uploading segment transcript_0_0.tar.gz
-  Sending request: http://[controller_host]:9000/v2/segments to controller: [controller_host], version: 0.1.0-SNAPSHOT-2c5d42a908213122ab0ad8b7ac9524fcf390e4cb
+  Sending request: http://[controller_host]:9000/v2/segments to controller: [controller_host], version: 0.2.0-SNAPSHOT-68092ab9eb83af173d725ec685c22ba4eb5bacf9
 
-You made it! Now we can query the data in Pinot:
+You did it! Now we can query the data in Pinot.
 
 To get all the number of rows in the table:
 
 .. code-block:: none
 
-  $ ./pinot-distribution/target/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/bin/pinot-admin.sh PostQuery -brokerPort 8000 -query "select count(*) from transcript"
+  $ ./pinot-admin.sh PostQuery -brokerPort 8000 -query "select count(*) from transcript"
   Executing command: PostQuery -brokerHost [controller_host] -brokerPort 8000 -query select count(*) from transcript
   Result: {"aggregationResults":[{"function":"count_star","value":"4"}],"exceptions":[],"numServersQueried":1,"numServersResponded":1,"numSegmentsQueried":1,"numSegmentsProcessed":1,"numSegmentsMatched":1,"numDocsScanned":4,"numEntriesScannedInFilter":0,"numEntriesScannedPostFilter":0,"numGroupsLimitReached":false,"totalDocs":4,"timeUsedMs":7,"segmentStatistics":[],"traceInfo":{}}
 
@@ -239,7 +310,7 @@ To get the average score of subject Maths:
 
 .. code-block:: none
 
-  $ ./pinot-distribution/target/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/bin/pinot-admin.sh PostQuery -brokerPort 8000 -query "select avg(score) from transcript where subject = \"Maths\""
+  $ ./pinot-admin.sh PostQuery -brokerPort 8000 -query "select avg(score) from transcript where subject = \"Maths\""
   Executing command: PostQuery -brokerHost [controller_host] -brokerPort 8000 -query select avg(score) from transcript where subject = "Maths"
   Result: {"aggregationResults":[{"function":"avg_score","value":"3.50000"}],"exceptions":[],"numServersQueried":1,"numServersResponded":1,"numSegmentsQueried":1,"numSegmentsProcessed":1,"numSegmentsMatched":1,"numDocsScanned":2,"numEntriesScannedInFilter":4,"numEntriesScannedPostFilter":2,"numGroupsLimitReached":false,"totalDocs":4,"timeUsedMs":33,"segmentStatistics":[],"traceInfo":{}}
 
@@ -247,6 +318,6 @@ To get the average score for Lucy Smith:
 
 .. code-block:: none
 
-  $ ./pinot-distribution/target/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/apache-pinot-incubating-0.1.0-SNAPSHOT-bin/bin/pinot-admin.sh PostQuery -brokerPort 8000 -query "select avg(score) from transcript where firstName = \"Lucy\" and lastName = \"Smith\""
+  $ ./pinot-admin.sh PostQuery -brokerPort 8000 -query "select avg(score) from transcript where firstName = \"Lucy\" and lastName = \"Smith\""
   Executing command: PostQuery -brokerHost [controller_host] -brokerPort 8000 -query select avg(score) from transcript where firstName = "Lucy" and lastName = "Smith"
   Result: {"aggregationResults":[{"function":"avg_score","value":"3.65000"}],"exceptions":[],"numServersQueried":1,"numServersResponded":1,"numSegmentsQueried":1,"numSegmentsProcessed":1,"numSegmentsMatched":1,"numDocsScanned":2,"numEntriesScannedInFilter":6,"numEntriesScannedPostFilter":2,"numGroupsLimitReached":false,"totalDocs":4,"timeUsedMs":67,"segmentStatistics":[],"traceInfo":{}}


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org