You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2019/04/08 18:13:38 UTC

[GitHub] [incubator-pinot] Jackie-Jiang commented on a change in pull request #4086: Update doc for "Creating Pinot segments" to reflect the current code base

Jackie-Jiang commented on a change in pull request #4086: Update doc for "Creating Pinot segments" to reflect the current code base
URL: https://github.com/apache/incubator-pinot/pull/4086#discussion_r273175633
 
 

 ##########
 File path: docs/pinot_hadoop.rst
 ##########
 @@ -48,151 +48,132 @@ Create a job properties configuration file, such as one below:
 
 .. code-block:: none
 
-  # === Index segment creation job config ===
+   # === Index segment creation job config ===
 
-  # path.to.input: Input directory containing Avro files
-  path.to.input=/user/pinot/input/data
+   # path.to.input: Input directory containing Avro files
+   path.to.input=/user/pinot/input/data
 
-  # path.to.output: Output directory containing Pinot segments
-  path.to.output=/user/pinot/output
+   # path.to.output: Output directory containing Pinot segments
+   path.to.output=/user/pinot/output
 
-  # path.to.schema: Schema file for the table, stored locally
-  path.to.schema=flights-schema.json
+   # path.to.schema: Schema file for the table, stored locally
+   path.to.schema=flights-schema.json
 
-  # segment.table.name: Name of the table for which to generate segments
-  segment.table.name=flights
+   # segment.table.name: Name of the table for which to generate segments
+   segment.table.name=flights
 
-  # === Segment tar push job config ===
+   # === Segment tar push job config ===
 
-  # push.to.hosts: Comma separated list of controllers host names to which to push
-  push.to.hosts=controller_host_0,controller_host_1
-
-  # push.to.port: The port on which the controller runs
-  push.to.port=8888
+   # push.to.hosts: Comma separated list of controllers host names to which to push
+   push.to.hosts=controller_host_0,controller_host_1
 
+   # push.to.port: The port on which the controller runs
+   push.to.port=8888
 
 Executing the job
 ^^^^^^^^^^^^^^^^^
 
 The Pinot Hadoop module contains a job that you can incorporate into your
 workflow to generate Pinot segments.
 
-.. code-block:: none
+.. code-block:: bash
 
-  mvn clean install -DskipTests -Pbuild-shaded-jar
-  hadoop jar pinot-hadoop-<version>-SNAPSHOT-shaded.jar SegmentCreation job.properties
+   mvn clean install -DskipTests -Pbuild-shaded-jar
+   hadoop jar pinot-hadoop-<version>-SNAPSHOT-shaded.jar SegmentCreation job.properties
 
 You can then use the SegmentTarPush job to push segments via the controller REST API.
 
-.. code-block:: none
-
-  hadoop jar pinot-hadoop-<version>-SNAPSHOT-shaded.jar SegmentTarPush job.properties
+.. code-block:: bash
 
+   hadoop jar pinot-hadoop-<version>-SNAPSHOT-shaded.jar SegmentTarPush job.properties
 
 Creating Pinot segments outside of Hadoop
 -----------------------------------------
 
 Here is how you can create Pinot segments from standard formats like CSV/JSON.
 
 #. Follow the steps described in the section on :ref:`compiling-code-section` to build pinot. Locate ``pinot-admin.sh`` in ``pinot-tools/target/pinot-tools=pkg/bin/pinot-admin.sh``.
-#.  Create a top level directory containing all the CSV/JSON files that need to be converted into segments.
-#.  The file name extensions are expected to be the same as the format name (*i.e* ``.csv``, or ``.json``), and are case insensitive.
-    Note that the converter expects the ``.csv`` extension even if the data is delimited using tabs or spaces instead.
-#.  Prepare a schema file describing the schema of the input data. The schema needs to be in JSON format. See example later in this section.
-#.  Specifically for CSV format, an optional csv config file can be provided (also in JSON format). This is used to configure parameters like the delimiter/header for the CSV file etc.
-        A detailed description of this follows below.
+#. Create a top level directory containing all the CSV/JSON files that need to be converted into segments.
+#. The file name extensions are expected to be the same as the format name (*i.e* ``.csv``, or ``.json``), and are case insensitive. Note that the converter expects the ``.csv`` extension even if the data is delimited using tabs or spaces instead.
+#. Prepare a schema file describing the schema of the input data. The schema needs to be in JSON format. See example later in this section.
+#. Specifically for CSV format, an optional csv config file can be provided (also in JSON format). This is used to configure parameters like the delimiter/header for the CSV file etc. A detailed description of this follows below.
 
 Run the pinot-admin command to generate the segments. The command can be invoked as follows. Options within "[ ]" are optional. For -format, the default value is AVRO.
 
-.. code-block:: none
-
-    bin/pinot-admin.sh CreateSegment -dataDir <input_data_dir> [-format [CSV/JSON/AVRO]] [-readerConfigFile <csv_config_file>] [-generatorConfigFile <generator_config_file>] -segmentName <segment_name> -schemaFile <input_schema_file> -tableName <table_name> -outDir <output_data_dir> [-overwrite]
+.. code-block:: bash
 
+   bin/pinot-admin.sh CreateSegment -dataDir <input_data_dir> [-format [CSV/JSON/AVRO]] [-readerConfigFile <csv_config_file>] [-generatorConfigFile <generator_config_file>] -segmentName <segment_name> -schemaFile <input_schema_file> -tableName <table_name> -outDir <output_data_dir> [-overwrite]
 
 To configure various parameters for CSV a config file in JSON format can be provided. This file is optional, as are each of its parameters. When not provided, default values used for these parameters are described below:
 
-#.  fileFormat: Specify one of the following. Default is EXCEL.
+#. fileFormat: Specify one of the following. Default is EXCEL.
 
- ##.  EXCEL
- ##.  MYSQL
- ##.  RFC4180
- ##.  TDF
+   #. EXCEL
+   #. MYSQL
+   #. RFC4180
+   #. TDF
 
-#.  header: If the input CSV file does not contain a header, it can be specified using this field. Note, if this is specified, then the input file is expected to not contain the header row, or else it will result in parse error. The columns in the header must be delimited by the same delimiter character as the rest of the CSV file.
-#.  delimiter: Use this to specify a delimiter character. The default value is ",".
-#.  dateFormat: If there are columns that are in date format and need to be converted into Epoch (in milliseconds), use this to specify the format. Default is "mm-dd-yyyy".
-#.  dateColumns: If there are multiple date columns, use this to list those columns.
+#. header: If the input CSV file does not contain a header, it can be specified using this field. Note, if this is specified, then the input file is expected to not contain the header row, or else it will result in parse error. The columns in the header must be delimited by the same delimiter character as the rest of the CSV file.
+#. delimiter: Use this to specify a delimiter character. The default value is ",".
+#. multiValueDelimiter: Use this to specify a delimiter character for each value in multi-valued columns. The default value is ";".
 
 Below is a sample config file.
 
-.. code-block:: none
+.. code-block:: json
 
-  {
-    "fileFormat" : "EXCEL",
-    "header" : "col1,col2,col3,col4",
-    "delimiter" : "\t",
-    "dateFormat" : "mm-dd-yy"
-    "dateColumns" : ["col1", "col2"]
-  }
+   {
+     "fileFormat": "EXCEL",
+     "header": "col1,col2,col3,col4",
+     "delimiter": "\t",
+     "multiValueDelimiter": ","
+   }
 
 Sample Schema:
 
-.. code-block:: none
-
-  {
-    "dimensionFieldSpecs" : [
-      {
-        "dataType" : "STRING",
-        "delimiter" : null,
-        "singleValueField" : true,
-        "name" : "name"
-      },
-      {
-        "dataType" : "INT",
-        "delimiter" : null,
-        "singleValueField" : true,
-        "name" : "age"
-      }
-    ],
-    "timeFieldSpec" : {
-      "incomingGranularitySpec" : {
-        "timeType" : "DAYS",
-        "dataType" : "LONG",
-        "name" : "incomingName1"
-      },
-      "outgoingGranularitySpec" : {
-        "timeType" : "DAYS",
-        "dataType" : "LONG",
-        "name" : "outgoingName1"
-      }
-    },
-    "metricFieldSpecs" : [
-      {
-        "dataType" : "FLOAT",
-        "delimiter" : null,
-        "singleValueField" : true,
-        "name" : "percent"
-      }
-     ]
-    },
-    "schemaName" : "mySchema",
-  }
+.. code-block:: json
+
+   {
+     "schemaName": "flights",
+     "dimensionFieldSpecs": [
+       {
+         "name": "name",
+         "dataType": "STRING"
+       },
+       {
+         "name": "age",
+         "dataType": "INT"
+       }
 
 Review comment:
   Done

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org