You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@carbondata.apache.org by ku...@apache.org on 2020/05/06 08:07:15 UTC

[carbondata] branch master updated: [CARBONDATA-3791]Correct the link, grammars and content of dml-management document

This is an automated email from the ASF dual-hosted git repository.

kunalkapoor pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/carbondata.git


The following commit(s) were added to refs/heads/master by this push:
     new 8d82ab5  [CARBONDATA-3791]Correct the link, grammars and content of dml-management document
8d82ab5 is described below

commit 8d82ab5a780e9133babb95d7c6aee0d35a0d6706
Author: akashrn5 <ak...@gmail.com>
AuthorDate: Sun May 3 23:11:09 2020 +0530

    [CARBONDATA-3791]Correct the link, grammars and content of dml-management document
    
    Why is this PR needed?
    Some links were not there, grammar mistakes and there were indentation errors.
    
    What changes were proposed in this PR?
    corrected the grammar, removed unnecessary content, and corrected the indentation problems.
    
    This closes #3736
---
 docs/dml-of-carbondata.md                | 139 +++++++++++++++----------------
 docs/segment-management-on-carbondata.md |  12 +--
 2 files changed, 73 insertions(+), 78 deletions(-)

diff --git a/docs/dml-of-carbondata.md b/docs/dml-of-carbondata.md
index 2b26957..98a3289 100644
--- a/docs/dml-of-carbondata.md
+++ b/docs/dml-of-carbondata.md
@@ -50,7 +50,7 @@ CarbonData DML statements are documented here,which includes:
 | ------------------------------------------------------- | ------------------------------------------------------------ |
 | [DELIMITER](#delimiter)                                 | Character used to separate the data in the input csv file    |
 | [QUOTECHAR](#quotechar)                                 | Character used to quote the data in the input csv file       |
-| [LINE_SEPARATOR](#line_separator)                       | Characters used to specify the line separator in the input csv file. If not provide, csv parser will detect it automatically. | 
+| [LINE_SEPARATOR](#line_separator)                       | Characters used to specify the line separator in the input csv file. If not provided, csv parser will detect it automatically. | 
 | [COMMENTCHAR](#commentchar)                             | Character used to comment the rows in the input csv file. Those rows will be skipped from processing |
 | [HEADER](#header)                                       | Whether the input csv files have header row                  |
 | [FILEHEADER](#fileheader)                               | If header is not present in the input csv, what is the column names to be used for data read from input csv |
@@ -60,6 +60,7 @@ CarbonData DML statements are documented here,which includes:
 | [SKIP_EMPTY_LINE](#skip_empty_line)                     | Whether empty lines in input csv file should be skipped or loaded as null row |
 | [COMPLEX_DELIMITER_LEVEL_1](#complex_delimiter_level_1) | Starting delimiter for complex type data in input csv file   |
 | [COMPLEX_DELIMITER_LEVEL_2](#complex_delimiter_level_2) | Ending delimiter for complex type data in input csv file     |
+| [COMPLEX_DELIMITER_LEVEL_3](#complex_delimiter_level_3) | Ending delimiter for nested complex type data in input csv file of level 3. |
 | [DATEFORMAT](#dateformattimestampformat)                | Format of date in the input csv file                         |
 | [TIMESTAMPFORMAT](#dateformattimestampformat)           | Format of timestamp in the input csv file                    |
 | [SORT_COLUMN_BOUNDS](#sort-column-bounds)               | How to partition the sort columns to make the evenly distributed |
@@ -69,7 +70,6 @@ CarbonData DML statements are documented here,which includes:
 | [IS_EMPTY_DATA_BAD_RECORD](#bad-records-handling)       | Whether empty data of a column to be considered as bad record or not |
 | [GLOBAL_SORT_PARTITIONS](#global_sort_partitions)       | Number of partition to use for shuffling of data during sorting |
 | [SCALE_FACTOR](#scale_factor)                           | Control the partition size for RANGE_COLUMN feature          |
-| [CARBON_OPTIONS_BINARY_DECODER]                         | Support configurable decode for loading from csv             |
 -
   You can use the following options to load data:
 
@@ -127,12 +127,12 @@ CarbonData DML statements are documented here,which includes:
     ```
     
     Priority order for choosing Sort Scope is:
-    1. Load Data Command
-    2. CARBON.TABLE.LOAD.SORT.SCOPE.<db>.<table> session property
-    3. Table level Sort Scope
-    4. CARBON.OPTIONS.SORT.SCOPE session property
-    5. Default Value: NO_SORT
-
+    * Load Data Command
+    * ```CARBON.TABLE.LOAD.SORT.SCOPE.<db>.<table>``` session property.
+    * Table level Sort Scope
+    * ```CARBON.OPTIONS.SORT.SCOPE``` session property
+    * Default Value: NO_SORT
+    
   - ##### MULTILINE:
 
     CSV with new line character in quotes.
@@ -189,7 +189,7 @@ CarbonData DML statements are documented here,which includes:
     ```
     OPTIONS('DATEFORMAT' = 'yyyy-MM-dd','TIMESTAMPFORMAT'='yyyy-MM-dd HH:mm:ss')
     ```
-    **NOTE:** Date formats are specified by date pattern strings. The date pattern letters in CarbonData are same as in JAVA. Refer to [SimpleDateFormat](http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html).
+    **NOTE:** Date formats are specified by date pattern strings. The date pattern in CarbonData is the same as in JAVA. Refer to [SimpleDateFormat](http://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html).
 
   - ##### SORT COLUMN BOUNDS:
 
@@ -205,8 +205,7 @@ CarbonData DML statements are documented here,which includes:
     * SORT_COLUMN_BOUNDS will be used only when the SORT_SCOPE is 'local_sort'.
     * Carbondata will use these bounds as ranges to process data concurrently during the final sort procedure. The records will be sorted and written out inside each partition. Since the partition is sorted, all records will be sorted.
     * The option works better if your CPU usage during loading is low. If your current system CPU usage is high, better not to use this option. Besides, it depends on the user to specify the bounds. If user does not know the exactly bounds to make the data distributed evenly among the bounds, loading performance will still be better than before or at least the same as before.
-    * Users can find more information about this option in the description of PR1953.
-
+    * Users can find more information about this option in the description of [PR1953](https://github.com/apache/carbondata/pull/1953).
 
   - ##### BAD RECORDS HANDLING:
 
@@ -219,61 +218,57 @@ CarbonData DML statements are documented here,which includes:
     OPTIONS('BAD_RECORDS_LOGGER_ENABLE'='true', 'BAD_RECORD_PATH'='hdfs://hacluster/tmp/carbon', 'BAD_RECORDS_ACTION'='REDIRECT', 'IS_EMPTY_DATA_BAD_RECORD'='false')
     ```
 
-  **NOTE:**
-  * BAD_RECORDS_ACTION property can have four type of actions for bad records FORCE, REDIRECT, IGNORE and FAIL.
-  * FAIL option is its Default value. If the FAIL option is used, then data loading fails if any bad records are found.
-  * If the REDIRECT option is used, CarbonData will add all bad records in to a separate CSV file. However, this file must not be used for subsequent data loading because the content may not exactly match the source record. You are advised to cleanse the original source record for further data ingestion. This option is used to remind you which records are bad records.
-  * If the FORCE option is used, then it auto-converts the data by storing the bad records as NULL before Loading data.
-  * If the IGNORE option is used, then bad records are neither loaded nor written to the separate CSV file.
-  * In loaded data, if all records are bad records, the BAD_RECORDS_ACTION is invalid and the load operation fails.
-  * The default maximum number of characters per column is 32000. If there are more than 32000 characters in a column, please refer to *String longer than 32000 characters* section.
-  * Since Bad Records Path can be specified in create, load and carbon properties. 
-    Therefore, value specified in load will have the highest priority, and value specified in carbon properties will have the least priority.
+    **NOTE:**
+    * BAD_RECORDS_ACTION property can have four types of actions for bad records FORCE, REDIRECT, IGNORE, and FAIL.
+    * FAIL option is its Default value. If the FAIL option is used, then data loading fails if any bad records are found.
+    * If the REDIRECT option is used, CarbonData will add all bad records into a separate CSV file. However, this file must not be used for subsequent data loading because the content may not exactly match the source record. You are advised to cleanse the source record for further data ingestion. This option is used to remind you which records are bad.
+    * If the FORCE option is used, then it auto-converts the data by storing the bad records as NULL before Loading data.
+    * If the IGNORE option is used, then bad records are neither loaded nor written to the separate CSV file.
+    * In loaded data, if all records are bad records, the BAD_RECORDS_ACTION is invalid and the load operation fails.
+    * The default maximum number of characters per column is 32000. If there are more than 32000 characters in a column, please refer to [String longer than 32000 characters](https://github.com/apache/carbondata/blob/master/docs/ddl-of-carbondata.md#string-longer-than-32000-characters) section.
+    * Since Bad Records Path can be specified in create, load and carbon properties. 
+      Therefore, the value specified in load will have the highest priority, and value specified in carbon properties will have the least priority.
 
-  Example:
+    Example:
 
-  ```
-  LOAD DATA INPATH 'filepath.csv' INTO TABLE tablename
-  OPTIONS('BAD_RECORDS_LOGGER_ENABLE'='true','BAD_RECORD_PATH'='hdfs://hacluster/tmp/carbon',
-  'BAD_RECORDS_ACTION'='REDIRECT','IS_EMPTY_DATA_BAD_RECORD'='false')
-  ```
+    ```
+    LOAD DATA INPATH 'filepath.csv' INTO TABLE tablename
+    OPTIONS('BAD_RECORDS_LOGGER_ENABLE'='true','BAD_RECORD_PATH'='hdfs://hacluster/tmp/carbon',
+    'BAD_RECORDS_ACTION'='REDIRECT','IS_EMPTY_DATA_BAD_RECORD'='false')
+    ```
 
   - ##### GLOBAL_SORT_PARTITIONS:
 
-    If the SORT_SCOPE is defined as GLOBAL_SORT, then user can specify the number of partitions to use while shuffling data for sort using GLOBAL_SORT_PARTITIONS. If it is not configured, or configured less than 1, then it uses the number of map task as reduce task. It is recommended that each reduce task deal with 512MB-1GB data.
+    If the SORT_SCOPE is defined as GLOBAL_SORT, then the user can specify the number of partitions to use while shuffling data for sort using GLOBAL_SORT_PARTITIONS. If it is not configured, or configured less than 1, then it uses the number of map tasks as reduce tasks. It is recommended that each reduce task deals with 512MB-1GB data.
     For RANGE_COLUMN, GLOBAL_SORT_PARTITIONS is used to specify the number of range partitions also.
-    GLOBAL_SORT_PARTITIONS should be specified optimally during RANGE_COLUMN LOAD because if a higher number is configured then the load time may be less but it will result in creation of more files which would degrade the query and compaction performance.
-    Conversely, if less partitions are configured then the load performance may degrade due to less use of parallelism but the query and compaction will become faster. Hence the user may choose optimal number depending on the use case.
-  ```
-  OPTIONS('GLOBAL_SORT_PARTITIONS'='2')
-  ```
-
-   **NOTE:**
-   * GLOBAL_SORT_PARTITIONS should be Integer type, the range is [1,Integer.MaxValue].
-   * It is only used when the SORT_SCOPE is GLOBAL_SORT.
-
-   - ##### SCALE_FACTOR
+    GLOBAL_SORT_PARTITIONS should be specified optimally during RANGE_COLUMN LOAD because if a higher number is configured then the load time may be less but it will result in the creation of more files which would degrade the query and compaction performance.
+    Conversely, if fewer partitions are configured then the load performance may degrade due to less use of parallelism but the query and compaction will become faster. Hence the user may choose an optimal number depending on the use case.
+    ```
+    OPTIONS('GLOBAL_SORT_PARTITIONS'='2')
+    ```
 
-   For RANGE_COLUMN, SCALE_FACTOR is used to control the number of range partitions as following.
-   ```
-     splitSize = max(blocklet_size, (block_size - blocklet_size)) * scale_factor
-     numPartitions = total size of input data / splitSize
-   ```
-   The default value is 3, and the range is [1, 300].
+     **NOTE:**
+     * GLOBAL_SORT_PARTITIONS should be Integer type, the range is [1,Integer.MaxValue].
+     * It is only used when the SORT_SCOPE is GLOBAL_SORT.
 
-   ```
-     OPTIONS('SCALE_FACTOR'='10')
-   ```
-   **NOTE:**
-   * If both GLOBAL_SORT_PARTITIONS and SCALE_FACTOR are used at the same time, only GLOBAL_SORT_PARTITIONS is valid.
-   * The compaction on RANGE_COLUMN will use LOCAL_SORT by default.
+  - ##### SCALE_FACTOR
 
-   - ##### CARBON_ENABLE_RANGE_COMPACTION
+    For RANGE_COLUMN, SCALE_FACTOR is used to control the number of range partitions as following.
+    ```
+      splitSize = max(blocklet_size, (block_size - blocklet_size)) * scale_factor
+      numPartitions = total size of input data / splitSize
+    ```
+    The default value is 3, and the range is [1, 300].
+ 
+    ```
+      OPTIONS('SCALE_FACTOR'='10')
+    ```
+    **NOTE:**
+    * If both GLOBAL_SORT_PARTITIONS and SCALE_FACTOR are used at the same time, only GLOBAL_SORT_PARTITIONS is valid.
+    * The compaction on RANGE_COLUMN will use LOCAL_SORT by default.
 
-   To configure Ranges-based Compaction to be used or not for RANGE_COLUMN.
-   The default value is 'true'.
 
-### INSERT DATA INTO CARBONDATA TABLE
+## INSERT DATA INTO CARBONDATA TABLE
 
   This command inserts data into a CarbonData table, it is defined as a combination of two queries Insert and Select query respectively. 
   It inserts records from a source table into a target CarbonData table, the source table can be a Hive table, Parquet table or a CarbonData table itself. 
@@ -284,7 +279,7 @@ CarbonData DML statements are documented here,which includes:
   [ WHERE { <filter_condition> } ]
   ```
 
-  You can also omit the `table` keyword and write your query as:
+  User can also omit the `table` keyword and write the query as:
 
   ```
   INSERT INTO <CARBONDATA TABLE> SELECT * FROM sourceTableName 
@@ -316,12 +311,12 @@ CarbonData DML statements are documented here,which includes:
   INSERT OVERWRITE TABLE table1 SELECT * FROM TABLE2
   ```
 
-### INSERT DATA INTO CARBONDATA TABLE From Stage Input Files
+## INSERT DATA INTO CARBONDATA TABLE From Stage Input Files
 
   Stage input files are data files written by external application (such as Flink). These files 
   are committed but not loaded into the table. 
   
-  You can use this command to insert them into the table, so that making them visible for query.
+  User can use this command to insert them into the table, thus making them visible for a query.
   
   ```
   INSERT INTO <CARBONDATA TABLE> STAGE OPTIONS(property_name=property_value, ...)
@@ -334,7 +329,7 @@ CarbonData DML statements are documented here,which includes:
 | [BATCH_FILE_ORDER](#batch_file_order)                   | The order type of stage files in per processing                     |
 
 -
-  You can use the following options to load data:
+  User can use the following options to load data:
 
   - ##### BATCH_FILE_COUNT: 
     The number of stage files per processing.
@@ -352,18 +347,18 @@ CarbonData DML statements are documented here,which includes:
     OPTIONS('batch_file_order'='DESC')
     ```
 
-  Examples:
-  ```
-  INSERT INTO table1 STAGE
-
-  INSERT INTO table1 STAGE OPTIONS('batch_file_count' = '5')
-  Note: This command use the default file order, will insert the earliest stage files into the table.
-
-  INSERT INTO table1 STAGE OPTIONS('batch_file_count' = '5', 'batch_file_order'='DESC')
-  Note: This command will insert the latest stage files into the table.
-  ```
+    Examples:
+    ```
+    INSERT INTO table1 STAGE
+  
+    INSERT INTO table1 STAGE OPTIONS('batch_file_count' = '5')
+    Note: This command uses the default file order, will insert the earliest stage files into the table.
+  
+    INSERT INTO table1 STAGE OPTIONS('batch_file_count' = '5', 'batch_file_order'='DESC')
+    Note: This command will insert the latest stage files into the table.
+    ```
 
-### Load Data Using Static Partition 
+## Load Data Using Static Partition 
 
   This command allows you to load data using static partition.
 
@@ -386,7 +381,7 @@ CarbonData DML statements are documented here,which includes:
   SELECT <columns list excluding partition columns> FROM another_user
   ```
 
-### Load Data Using Dynamic Partition
+## Load Data Using Dynamic Partition
 
   This command allows you to load data using dynamic partition. If partition spec is not specified, then the partition is considered as dynamic.
 
@@ -512,7 +507,7 @@ CarbonData DML statements are documented here,which includes:
 
   - **Minor Compaction**
 
-  In Minor compaction, user can specify the number of loads to be merged. 
+  In Minor compaction, the user can specify the number of loads to be merged. 
   Minor compaction triggers for every data load if the parameter carbon.enable.auto.load.merge is set to true. 
   If any segments are available to be merged, then compaction will run parallel with data load, there are 2 levels in minor compaction:
   * Level 1: Merging of the segments which are not yet compacted.
diff --git a/docs/segment-management-on-carbondata.md b/docs/segment-management-on-carbondata.md
index d18aca1..d4fe339 100644
--- a/docs/segment-management-on-carbondata.md
+++ b/docs/segment-management-on-carbondata.md
@@ -62,7 +62,7 @@ concept which helps to maintain consistency of data and easy transaction managem
 
   When more detail of the segment is required, user can issue SHOW SEGMENT by query.    
     
-  The query should against table name with '_segments' appended and select from following fields:
+  The query should be against table name with '_segments' appended and select from following fields:
     
 - id: String, the id of the segment
 - status: String, status of the segment
@@ -149,7 +149,7 @@ concept which helps to maintain consistency of data and easy transaction managem
   **NOTE:**
   carbon.input.segments: Specifies the segment IDs to be queried. This property allows you to query specified segments of the specified table. The CarbonScan will read data from specified segments only.
 
-  If user wants to query with segments reading in multi threading mode, then CarbonSession. threadSet can be used instead of SET query.
+  If user wants to query with segments reading in multi-threading mode, then CarbonSession.threadSet can be used instead of SET query.
   ```
   CarbonSession.threadSet ("carbon.input.segments.<database_name>.<table_name>","<list of segment IDs>");
   ```
@@ -159,14 +159,14 @@ concept which helps to maintain consistency of data and easy transaction managem
   SET carbon.input.segments.<database_name>.<table_name> = *;
   ```
 
-  If user wants to query with segments reading in multi threading mode, then CarbonSession. threadSet can be used instead of SET query. 
+  If user wants to query with segments reading in multi-threading mode, then CarbonSession.threadSet can be used instead of SET query.
   ```
   CarbonSession.threadSet ("carbon.input.segments.<database_name>.<table_name>","*");
   ```
 
   **Examples:**
 
-  * Example to show the list of segment IDs,segment status, and other required details and then specify the list of segments to be read.
+  * Example to show the list of segment IDs, segment status, and other required details and then specify the list of segments to be read.
 
   ```
   SHOW SEGMENTS FOR carbontable1;
@@ -174,13 +174,13 @@ concept which helps to maintain consistency of data and easy transaction managem
   SET carbon.input.segments.db.carbontable1 = 1,3,9;
   ```
 
-  * Example to query with segments reading in multi threading mode:
+  * Example to query with segments reading in multi-threading mode:
 
   ```
   CarbonSession.threadSet ("carbon.input.segments.db.carbontable_Multi_Thread","1,3");
   ```
 
-  * Example for threadset in multithread environment (following shows how it is used in Scala code):
+  * Example for threadset in multi-thread environment (following shows how it is used in Scala code):
 
   ```
   def main(args: Array[String]) {