You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@carbondata.apache.org by ra...@apache.org on 2019/01/30 10:39:11 UTC

[carbondata] 07/27: [CARBONDATA-3263] Update doc for RANGE_COLUMN

This is an automated email from the ASF dual-hosted git repository.

ravipesala pushed a commit to branch branch-1.5
in repository https://gitbox.apache.org/repos/asf/carbondata.git

commit 6dc581a575581fc1ae16c836f542a09b876b590b
Author: QiangCai <qi...@qq.com>
AuthorDate: Tue Jan 22 11:27:04 2019 +0800

    [CARBONDATA-3263] Update doc for RANGE_COLUMN
    
    Added documentation for range_column feature support
    
    This closes #3093
---
 docs/ddl-of-carbondata.md | 12 +++++++++++-
 docs/dml-of-carbondata.md | 23 ++++++++++++++++++++---
 2 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/docs/ddl-of-carbondata.md b/docs/ddl-of-carbondata.md
index a1b0ce7..4f9e47b 100644
--- a/docs/ddl-of-carbondata.md
+++ b/docs/ddl-of-carbondata.md
@@ -34,7 +34,8 @@ CarbonData DDL statements are documented here,which includes:
   * [Extra Long String columns](#string-longer-than-32000-characters)
   * [Compression for Table](#compression-for-table)
   * [Bad Records Path](#bad-records-path) 
-  * [Load Minimum Input File Size](#load-minimum-data-size) 
+  * [Load Minimum Input File Size](#load-minimum-data-size)
+  * [Range Column](#range-column)
 
 * [CREATE TABLE AS SELECT](#create-table-as-select)
 * [CREATE EXTERNAL TABLE](#create-external-table)
@@ -109,6 +110,7 @@ CarbonData DDL statements are documented here,which includes:
 | [BUCKETNUMBER](#bucketing)                                   | Number of buckets to be created                              |
 | [BUCKETCOLUMNS](#bucketing)                                  | Columns which are to be placed in buckets                    |
 | [LOAD_MIN_SIZE_INMB](#load-minimum-data-size)                | Minimum input data size per node for data loading          |
+| [Range Column](#range-column)                                | partition input data by range                              |
 
  Following are the guidelines for TBLPROPERTIES, CarbonData's additional table options can be set via carbon.properties.
 
@@ -495,6 +497,14 @@ CarbonData DDL statements are documented here,which includes:
      TBLPROPERTIES('LOAD_MIN_SIZE_INMB'='256')
      ```
 
+   - ##### Range Column
+     This property is used to specify a column to partition the input data by range.
+     Only one column can be configured. During data loading, you can use "global_sort_partitions" or "scale_factor" to avoid generating small files.
+
+     ```
+     TBLPROPERTIES('RANGE_COLUMN'='col1')
+     ```
+
 ## CREATE TABLE AS SELECT
   This function allows user to create a Carbon table from any of the Parquet/Hive/Carbon table. This is beneficial when the user wants to create Carbon table from any other Parquet/Hive table and use the Carbon query engine to query and achieve better query results for cases where Carbon is faster than other file formats. Also this feature can be used for backing up the data.
 
diff --git a/docs/dml-of-carbondata.md b/docs/dml-of-carbondata.md
index d6e5932..b3fe517 100644
--- a/docs/dml-of-carbondata.md
+++ b/docs/dml-of-carbondata.md
@@ -66,7 +66,8 @@ CarbonData DML statements are documented here,which includes:
 | [BAD_RECORDS_ACTION](#bad-records-handling)             | Behavior of data loading when bad record is found            |
 | [IS_EMPTY_DATA_BAD_RECORD](#bad-records-handling)       | Whether empty data of a column to be considered as bad record or not |
 | [GLOBAL_SORT_PARTITIONS](#global_sort_partitions)       | Number of partition to use for shuffling of data during sorting |
-
+| [SCALE_FACTOR](#scale_factor)                           | Control the partition size for RANGE_COLUMN feature          |
+-
   You can use the following options to load data:
 
   - ##### DELIMITER: 
@@ -268,15 +269,31 @@ CarbonData DML statements are documented here,which includes:
   - ##### GLOBAL_SORT_PARTITIONS:
 
     If the SORT_SCOPE is defined as GLOBAL_SORT, then user can specify the number of partitions to use while shuffling data for sort using GLOBAL_SORT_PARTITIONS. If it is not configured, or configured less than 1, then it uses the number of map task as reduce task. It is recommended that each reduce task deal with 512MB-1GB data.
-
+    For RANGE_COLUMN, GLOBAL_SORT_PARTITIONS is used to specify the number of range partitions also.
   ```
   OPTIONS('GLOBAL_SORT_PARTITIONS'='2')
   ```
 
-   NOTE:
+   **NOTE:**
    * GLOBAL_SORT_PARTITIONS should be Integer type, the range is [1,Integer.MaxValue].
    * It is only used when the SORT_SCOPE is GLOBAL_SORT.
 
+   - ##### SCALE_FACTOR
+
+   For RANGE_COLUMN, SCALE_FACTOR is used to control the number of range partitions as following.
+   ```
+     splitSize = max(blocklet_size, (block_size - blocklet_size)) * scale_factor
+     numPartitions = total size of input data / splitSize
+   ```
+   The default value is 3, and the range is [1, 300].
+
+   ```
+     OPTIONS('SCALE_FACTOR'='10')
+   ```
+   **NOTE:**
+   * If both GLOBAL_SORT_PARTITIONS and SCALE_FACTOR are used at the same time, only GLOBAL_SORT_PARTITIONS is valid.
+   * The compaction on RANGE_COLUMN will use LOCAL_SORT by default.
+
 ### INSERT DATA INTO CARBONDATA TABLE
 
   This command inserts data into a CarbonData table, it is defined as a combination of two queries Insert and Select query respectively.