You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@carbondata.apache.org by ra...@apache.org on 2017/11/11 12:02:48 UTC

[07/11] carbondata git commit: [DOCS] Removed unused parameters, added SORT_SCOPE, and updated dictionary details

[DOCS] Removed unused parameters, added SORT_SCOPE, and updated dictionary details

This closes #1426


Project: http://git-wip-us.apache.org/repos/asf/carbondata/repo
Commit: http://git-wip-us.apache.org/repos/asf/carbondata/commit/520e50f3
Tree: http://git-wip-us.apache.org/repos/asf/carbondata/tree/520e50f3
Diff: http://git-wip-us.apache.org/repos/asf/carbondata/diff/520e50f3

Branch: refs/heads/pre-aggregate
Commit: 520e50f32f3716b1335df37efb26222d37bc2b20
Parents: 9f6c8e6
Author: sgururajshetty <sg...@gmail.com>
Authored: Sun Oct 22 15:38:01 2017 +0530
Committer: chenliang613 <ch...@huawei.com>
Committed: Sat Nov 11 16:12:09 2017 +0800

----------------------------------------------------------------------
 docs/configuration-parameters.md    |  6 +-----
 docs/ddl-operation-on-carbondata.md | 31 ++++++++++++++++++++++++++++---
 2 files changed, 29 insertions(+), 8 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/carbondata/blob/520e50f3/docs/configuration-parameters.md
----------------------------------------------------------------------
diff --git a/docs/configuration-parameters.md b/docs/configuration-parameters.md
index e085317..141a60c 100644
--- a/docs/configuration-parameters.md
+++ b/docs/configuration-parameters.md
@@ -48,12 +48,8 @@ This section provides the details of all the configurations required for CarbonD
 
 | Parameter | Default Value | Description | Range |
 |--------------------------------------|---------------|----------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| carbon.sort.file.buffer.size | 20 | File read buffer size used during sorting. This value is expressed in MB. | Min=1 and Max=100 |
-| carbon.graph.rowset.size | 100000 | Rowset size exchanged between data load graph steps. | Min=500 and Max=1000000 |
 | carbon.number.of.cores.while.loading | 6 | Number of cores to be used while loading data. |  |
 | carbon.sort.size | 500000 | Record count to sort and write intermediate files to temp. |  |
-| carbon.enableXXHash | true | Algorithm for hashmap for hashkey calculation. |  |
-| carbon.number.of.cores.block.sort | 7 | Number of cores to use for block sort while loading data. |  |
 | carbon.max.driver.lru.cache.size | -1 | Max LRU cache size upto which data will be loaded at the driver side. This value is expressed in MB. Default value of -1 means there is no memory limit for caching. Only integer values greater than 0 are accepted. |  |
 | carbon.max.executor.lru.cache.size | -1 | Max LRU cache size upto which data will be loaded at the executor side. This value is expressed in MB. Default value of -1 means there is no memory limit for caching. Only integer values greater than 0 are accepted. If this parameter is not configured, then the carbon.max.driver.lru.cache.size value will be considered. |  |
 | carbon.merge.sort.prefetch | true | Enable prefetch of data during merge sort while reading data from sort temp files in data loading. |  |
@@ -135,7 +131,7 @@ This section provides the details of all the configurations required for CarbonD
 |---------------------------------------|---------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | high.cardinality.identify.enable | true | If the parameter is true, the high cardinality columns of the dictionary code are automatically recognized and these columns will not be used as global dictionary encoding. If the parameter is false, all dictionary encoding columns are used as dictionary encoding. The high cardinality column must meet the following requirements: value of cardinality > configured value of high.cardinality. <b> Note: </b> If SINGLE_PASS is used during data load, then this property will be disabled.|
 | high.cardinality.threshold | 1000000  | It is a threshold to identify high cardinality of the columns.If the value of columns' cardinality > the configured value, then the columns are excluded from dictionary encoding. |
-| carbon.cutOffTimestamp | 1970-01-01 05:30:00 | Sets the start date for calculating the timestamp. Java counts the number of milliseconds from start of "1970-01-01 00:00:00". This property is used to customize the start of position. For example "2000-01-01 00:00:00". The date must be in the form "carbon.timestamp.format". NOTE: The CarbonData supports data store up to 68 years from the cut-off time defined. For example, if the cut-off time is 1970-01-01 05:30:00, then the data can be stored up to 2038-01-01 05:30:00. |
+| carbon.cutOffTimestamp | 1970-01-01 05:30:00 | Sets the start date for calculating the timestamp. Java counts the number of milliseconds from start of "1970-01-01 00:00:00". This property is used to customize the start of position. For example "2000-01-01 00:00:00". The date must be in the form "carbon.timestamp.format". |
 | carbon.timegranularity | SECOND | The property used to set the data granularity level DAY, HOUR, MINUTE, or SECOND. |
   
 ##  Spark Configuration

http://git-wip-us.apache.org/repos/asf/carbondata/blob/520e50f3/docs/ddl-operation-on-carbondata.md
----------------------------------------------------------------------
diff --git a/docs/ddl-operation-on-carbondata.md b/docs/ddl-operation-on-carbondata.md
index 55d7063..d1fee46 100644
--- a/docs/ddl-operation-on-carbondata.md
+++ b/docs/ddl-operation-on-carbondata.md
@@ -62,14 +62,14 @@ The following DDL operations are supported in CarbonData :
 
    - **Dictionary Encoding Configuration**
 
-       Dictionary encoding is enabled by default for all String columns, and disabled for non-String columns. You can include and exclude columns for dictionary encoding.
+       Dictionary encoding is turned off for all columns by default. You can include and exclude columns for dictionary encoding.
 
 ```
        TBLPROPERTIES ('DICTIONARY_EXCLUDE'='column1, column2')
        TBLPROPERTIES ('DICTIONARY_INCLUDE'='column1, column2')
 ```
 
-   Here, DICTIONARY_EXCLUDE will exclude dictionary creation. This is applicable for high-cardinality columns and is an optional parameter. DICTIONARY_INCLUDE will generate dictionary for the columns specified in the list.
+   Here, DICTIONARY_INCLUDE will improve the performance for low cardinality dimensions, considerably for string. DICTIONARY_INCLUDE will generate dictionary for the columns specified.
 
 
 
@@ -129,7 +129,7 @@ The following DDL operations are supported in CarbonData :
 
    - **SORT_COLUMNS**
 
-    This table property specifies the order of the sort column.
+      This table property specifies the order of the sort column.
 
 ```
     TBLPROPERTIES('SORT_COLUMNS'='column1, column3')
@@ -140,6 +140,31 @@ The following DDL operations are supported in CarbonData :
    - If this property is not specified, then by default SORT_COLUMNS consist of all dimension (exclude Complex Column).
 
    - If this property is specified but with empty argument, then the table will be loaded without sort. For example, ('SORT_COLUMNS'='')
+   
+   - **SORT_SCOPE**
+      This option specifies the scope of the sort during data load. Following are the types of sort scope.
+     * BATCH_SORT: it will increase the load performance but decreases the query performance if identified blocks > parallelism.
+```
+    OPTIONS ('SORT_SCOPE'='BATCH_SORT')
+```
+      You can also specify the sort size option for sort scope.
+```
+    OPTIONS ('SORT_SCOPE'='BATCH_SORT', 'batch_sort_size_inmb'='7')
+```
+     * GLOBAL_SORT: it increases the query performance, especially point query.
+```
+    OPTIONS ('SORT_SCOPE'= GLOBAL_SORT ')
+```
+	 You can also specify the number of partitions to use when shuffling data for sort. If it is not configured, or configured less than 1, then it uses the number of map tasks as reduce tasks. It is recommended that each reduce task deal with 512MB - 1GB data.
+```
+    OPTIONS( 'SORT_SCOPE'='GLOBAL_SORT', 'GLOBAL_SORT_PARTITIONS'='2')
+```
+   NOTE:
+   - Increasing number of partitions might require increasing spark.driver.maxResultSize as sampling data collected at driver increases with increasing partitions.
+   - Increasing number of partitions might increase the number of Btree.
+     * LOCAL_SORT: it is the default sort scope.
+	 * NO_SORT: it will load the data in unsorted manner.
+	 
 
 ## SHOW TABLE