You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@carbondata.apache.org by ra...@apache.org on 2019/01/18 05:40:27 UTC
[carbondata] branch master updated: [CARBONDATA-3215] Optimize the documentation

This is an automated email from the ASF dual-hosted git repository.

raghunandan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/carbondata.git


The following commit(s) were added to refs/heads/master by this push:
     new b828d0d  [CARBONDATA-3215] Optimize the documentation
b828d0d is described below

commit b828d0da9c5b0d63b7285f7d801edc4f0a949f5a
Author: xubo245 <xu...@huawei.com>
AuthorDate: Fri Dec 28 20:37:16 2018 +0800

    [CARBONDATA-3215] Optimize the documentation
    
    When user use the Global dictionary, local dictionary，non-dictionary in the code,
    users maybe have some confusion. The same for mvdataMap and IndexDataMap. I describe and list it in this PR.
    
    1.describe Global dictionary, local dictionary，non-dictionary together in doc
    2.list mvdataMap and IndexDataMap
    
    This closes #3033
---
 docs/datamap-developer-guide.md |   8 +-
 docs/ddl-of-carbondata.md       | 166 ++++++++++++++++++++--------------------
 2 files changed, 89 insertions(+), 85 deletions(-)

diff --git a/docs/datamap-developer-guide.md b/docs/datamap-developer-guide.md
index c74aa1b..e1fa355 100644
--- a/docs/datamap-developer-guide.md
+++ b/docs/datamap-developer-guide.md
@@ -19,16 +19,16 @@
 
 ### Introduction
 DataMap is a data structure that can be used to accelerate certain query of the table. Different DataMap can be implemented by developers. 
-Currently, there are two 2 types of DataMap supported:
-1. IndexDataMap: DataMap that leverages index to accelerate filter query
-2. MVDataMap: DataMap that leverages Materialized View to accelerate OLAP style query, like SPJG query (select, predicate, join, groupby)
+Currently, there are two types of DataMap supported:
+1. IndexDataMap: DataMap that leverages index to accelerate filter query. Lucene DataMap and BloomFiler DataMap belong to this type of DataMaps.
+2. MVDataMap: DataMap that leverages Materialized View to accelerate olap style query, like SPJG query (select, predicate, join, groupby). Preaggregate, timeseries and mv DataMap belong to this type of DataMaps.
 
 ### DataMap Provider
 When user issues `CREATE DATAMAP dm ON TABLE main USING 'provider'`, the corresponding DataMapProvider implementation will be created and initialized. 
 Currently, the provider string can be:
 1. preaggregate: A type of MVDataMap that do pre-aggregate of single table
 2. timeseries: A type of MVDataMap that do pre-aggregate based on time dimension of the table
-3. class name IndexDataMapFactory  implementation: Developer can implement new type of IndexDataMap by extending IndexDataMapFactory
+3. class name IndexDataMapFactory implementation: Developer can implement new type of IndexDataMap by extending IndexDataMapFactory
 
 When user issues `DROP DATAMAP dm ON TABLE main`, the corresponding DataMapProvider interface will be called.
 
diff --git a/docs/ddl-of-carbondata.md b/docs/ddl-of-carbondata.md
index aaa2eda..b9b391b 100644
--- a/docs/ddl-of-carbondata.md
+++ b/docs/ddl-of-carbondata.md
@@ -21,13 +21,13 @@ CarbonData DDL statements are documented here,which includes:
 
 * [CREATE TABLE](#create-table)
   * [Dictionary Encoding](#dictionary-encoding-configuration)
+  * [Local Dictionary](#local-dictionary-configuration)
   * [Inverted Index](#inverted-index-configuration)
   * [Sort Columns](#sort-columns-configuration)
   * [Sort Scope](#sort-scope-configuration)
   * [Table Block Size](#table-block-size-configuration)
   * [Table Compaction](#table-compaction-configuration)
   * [Streaming](#streaming)
-  * [Local Dictionary](#local-dictionary-configuration)
   * [Caching Column Min/Max](#caching-minmax-value-for-required-columns)
   * [Caching Level](#caching-at-block-or-blocklet-level)
   * [Hive/Parquet folder Structure](#support-flat-folder-same-as-hiveparquet)
@@ -121,8 +121,91 @@ CarbonData DDL statements are documented here,which includes:
      TBLPROPERTIES ('DICTIONARY_INCLUDE'='column1, column2')
      ```
 
-     **NOTE**: Dictionary Include/Exclude for complex child columns is not supported.
+     **NOTE**: 
+      * Dictionary Include/Exclude for complex child columns is not supported.   
+      * Dictionary is global. Except global dictionary, there are local dictionary and non-dictionary in CarbonData.
+      
+   - ##### Local Dictionary Configuration
+
+   Columns for which dictionary is not generated needs more storage space and in turn more IO. Also since more data will have to be read during query, query performance also would suffer.Generating dictionary per blocklet for such columns would help in saving storage space and assist in improving query performance as carbondata is optimized for handling dictionary encoded columns more effectively.Generating dictionary internally per blocklet is termed as local dictionary. Please refer to [...]
 
+   Local Dictionary helps in:
+   1. Getting more compression.
+   2. Filter queries and full scan queries will be faster as filter will be done on encoded data.
+   3. Reducing the store size and memory footprint as only unique values will be stored as part of local dictionary and corresponding data will be stored as encoded data.
+   4. Getting higher IO throughput.
+
+   **NOTE:** 
+
+   * Following Data Types are Supported for Local Dictionary:
+      * STRING
+      * VARCHAR
+      * CHAR
+
+   * Following Data Types are not Supported for Local Dictionary: 
+      * SMALLINT
+      * INTEGER
+      * BIGINT
+      * DOUBLE
+      * DECIMAL
+      * TIMESTAMP
+      * DATE
+      * BOOLEAN
+      * FLOAT
+      * BYTE
+   * In case of multi-level complex dataType columns, primitive string/varchar/char columns are considered for local dictionary generation.
+
+   System Level Properties for Local Dictionary: 
+   
+   
+   | Properties | Default value | Description |
+   | ---------- | ------------- | ----------- |
+   | carbon.local.dictionary.enable | false | By default, Local Dictionary will be disabled for the carbondata table. |
+   | carbon.local.dictionary.decoder.fallback | true | Page Level data will not be maintained for the blocklet. During fallback, actual data will be retrieved from the encoded page data using local dictionary. **NOTE:** Memory footprint decreases significantly as compared to when this property is set to false |
+    
+   Local Dictionary can be configured using the following properties during create table command: 
+          
+
+| Properties | Default value | Description |
+| ---------- | ------------- | ----------- |
+| LOCAL_DICTIONARY_ENABLE | false | Whether to enable local dictionary generation. **NOTE:** If this property is defined, it will override the value configured at system level by '***carbon.local.dictionary.enable***'.Local dictionary will be generated for all string/varchar/char columns unless LOCAL_DICTIONARY_INCLUDE, LOCAL_DICTIONARY_EXCLUDE is configured. |
+| LOCAL_DICTIONARY_THRESHOLD | 10000 | The maximum cardinality of a column upto which carbondata can try to generate local dictionary (maximum - 100000). **NOTE:** When LOCAL_DICTIONARY_THRESHOLD is defined for Complex columns, the count of distinct records of all child columns are summed up. |
+| LOCAL_DICTIONARY_INCLUDE | string/varchar/char columns| Columns for which Local Dictionary has to be generated.**NOTE:** Those string/varchar/char columns which are added into DICTIONARY_INCLUDE option will not be considered for local dictionary generation. This property needs to be configured only when local dictionary needs to be generated for few columns, skipping others. This property takes effect only when **LOCAL_DICTIONARY_ENABLE** is true or **carbon.local.dictionary.enable** i [...]
+| LOCAL_DICTIONARY_EXCLUDE | none | Columns for which Local Dictionary need not be generated. This property needs to be configured only when local dictionary needs to be skipped for few columns, generating for others. This property takes effect only when **LOCAL_DICTIONARY_ENABLE** is true or **carbon.local.dictionary.enable** is true |
+
+   **Fallback behavior:** 
+
+   * When the cardinality of a column exceeds the threshold, it triggers a fallback and the generated dictionary will be reverted and data loading will be continued without dictionary encoding.
+   
+   * In case of complex columns, fallback is triggered when the summation value of all child columns' distinct records exceeds the defined LOCAL_DICTIONARY_THRESHOLD value.
+
+   **NOTE:** When fallback is triggered, the data loading performance will decrease as encoded data will be discarded and the actual data is written to the temporary sort files.
+
+   **Points to be noted:**
+
+   * Reduce Block size:
+   
+      Number of Blocks generated is less in case of Local Dictionary as compression ratio is high. This may reduce the number of tasks launched during query, resulting in degradation of query performance if the pruned blocks are less compared to the number of parallel tasks which can be run. So it is recommended to configure smaller block size which in turn generates more number of blocks.
+      
+### Example:
+
+   ```
+   CREATE TABLE carbontable(             
+     column1 string,             
+     column2 string,             
+     column3 LONG)
+   STORED AS carbondata
+   TBLPROPERTIES('LOCAL_DICTIONARY_ENABLE'='true','LOCAL_DICTIONARY_THRESHOLD'='1000',
+   'LOCAL_DICTIONARY_INCLUDE'='column1','LOCAL_DICTIONARY_EXCLUDE'='column2')
+   ```
+
+   **NOTE:** 
+
+   * We recommend to use Local Dictionary when cardinality is high but is distributed across multiple loads
+   * On a large cluster, decoding data can become a bottleneck for global dictionary as there will be many remote reads. In this scenario, it is better to use Local Dictionary.
+   * When cardinality is less, but loads are repetitive, it is better to use global dictionary as local dictionary generates multiple dictionary files at blocklet level increasing redundancy.
+   * If want to use non-dictionary, users can set LOCAL_DICTIONARY_ENABLE as false and don't set DICTIONARY_INCLUDE.
+      
    - ##### Inverted Index Configuration
 
      By default inverted index is disabled as store size will be reduced, it can be enabled by using a table property. It might help to improve compression ratio and query speed, especially for low cardinality columns which are in reward position.
@@ -224,85 +307,6 @@ CarbonData DDL statements are documented here,which includes:
      TBLPROPERTIES ('streaming'='true')
      ```
 
-   - ##### Local Dictionary Configuration
-
-   Columns for which dictionary is not generated needs more storage space and in turn more IO. Also since more data will have to be read during query, query performance also would suffer.Generating dictionary per blocklet for such columns would help in saving storage space and assist in improving query performance as carbondata is optimized for handling dictionary encoded columns more effectively.Generating dictionary internally per blocklet is termed as local dictionary. Please refer to [...]
-
-   Local Dictionary helps in:
-   1. Getting more compression.
-   2. Filter queries and full scan queries will be faster as filter will be done on encoded data.
-   3. Reducing the store size and memory footprint as only unique values will be stored as part of local dictionary and corresponding data will be stored as encoded data.
-   4. Getting higher IO throughput.
-
-   **NOTE:** 
-
-   * Following Data Types are Supported for Local Dictionary:
-      * STRING
-      * VARCHAR
-      * CHAR
-
-   * Following Data Types are not Supported for Local Dictionary: 
-      * SMALLINT
-      * INTEGER
-      * BIGINT
-      * DOUBLE
-      * DECIMAL
-      * TIMESTAMP
-      * DATE
-      * BOOLEAN
-      * FLOAT
-      * BYTE
-   * In case of multi-level complex dataType columns, primitive string/varchar/char columns are considered for local dictionary generation.
-
-   System Level Properties for Local Dictionary: 
-   
-   
-   | Properties | Default value | Description |
-   | ---------- | ------------- | ----------- |
-   | carbon.local.dictionary.enable | false | By default, Local Dictionary will be disabled for the carbondata table. |
-   | carbon.local.dictionary.decoder.fallback | true | Page Level data will not be maintained for the blocklet. During fallback, actual data will be retrieved from the encoded page data using local dictionary. **NOTE:** Memory footprint decreases significantly as compared to when this property is set to false |
-    
-   Local Dictionary can be configured using the following properties during create table command: 
-          
-
-| Properties | Default value | Description |
-| ---------- | ------------- | ----------- |
-| LOCAL_DICTIONARY_ENABLE | false | Whether to enable local dictionary generation. **NOTE:** If this property is defined, it will override the value configured at system level by '***carbon.local.dictionary.enable***'.Local dictionary will be generated for all string/varchar/char columns unless LOCAL_DICTIONARY_INCLUDE, LOCAL_DICTIONARY_EXCLUDE is configured. |
-| LOCAL_DICTIONARY_THRESHOLD | 10000 | The maximum cardinality of a column upto which carbondata can try to generate local dictionary (maximum - 100000). **NOTE:** When LOCAL_DICTIONARY_THRESHOLD is defined for Complex columns, the count of distinct records of all child columns are summed up. |
-| LOCAL_DICTIONARY_INCLUDE | string/varchar/char columns| Columns for which Local Dictionary has to be generated.**NOTE:** Those string/varchar/char columns which are added into DICTIONARY_INCLUDE option will not be considered for local dictionary generation. This property needs to be configured only when local dictionary needs to be generated for few columns, skipping others. This property takes effect only when **LOCAL_DICTIONARY_ENABLE** is true or **carbon.local.dictionary.enable** i [...]
-| LOCAL_DICTIONARY_EXCLUDE | none | Columns for which Local Dictionary need not be generated. This property needs to be configured only when local dictionary needs to be skipped for few columns, generating for others. This property takes effect only when **LOCAL_DICTIONARY_ENABLE** is true or **carbon.local.dictionary.enable** is true |
-
-   **Fallback behavior:** 
-
-   * When the cardinality of a column exceeds the threshold, it triggers a fallback and the generated dictionary will be reverted and data loading will be continued without dictionary encoding.
-   
-   * In case of complex columns, fallback is triggered when the summation value of all child columns' distinct records exceeds the defined LOCAL_DICTIONARY_THRESHOLD value.
-
-   **NOTE:** When fallback is triggered, the data loading performance will decrease as encoded data will be discarded and the actual data is written to the temporary sort files.
-
-   **Points to be noted:**
-
-   * Reduce Block size:
-   
-      Number of Blocks generated is less in case of Local Dictionary as compression ratio is high. This may reduce the number of tasks launched during query, resulting in degradation of query performance if the pruned blocks are less compared to the number of parallel tasks which can be run. So it is recommended to configure smaller block size which in turn generates more number of blocks.
-      
-### Example:
-
-   ```
-   CREATE TABLE carbontable(             
-     column1 string,             
-     column2 string,             
-     column3 LONG)
-   STORED AS carbondata
-   TBLPROPERTIES('LOCAL_DICTIONARY_ENABLE'='true','LOCAL_DICTIONARY_THRESHOLD'='1000',
-   'LOCAL_DICTIONARY_INCLUDE'='column1','LOCAL_DICTIONARY_EXCLUDE'='column2')
-   ```
-
-   **NOTE:** 
-
-   * We recommend to use Local Dictionary when cardinality is high but is distributed across multiple loads
-   * On a large cluster, decoding data can become a bottleneck for global dictionary as there will be many remote reads. In this scenario, it is better to use Local Dictionary.
-   * When cardinality is less, but loads are repetitive, it is better to use global dictionary as local dictionary generates multiple dictionary files at blocklet level increasing redundancy.
 
    - ##### Caching Min/Max Value for Required Columns