You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@carbondata.apache.org by ak...@apache.org on 2020/05/06 15:59:10 UTC
[carbondata] branch master updated: [CARBONDATA-3791] Fix documentation for various features

This is an automated email from the ASF dual-hosted git repository.

akashrn5 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/carbondata.git


The following commit(s) were added to refs/heads/master by this push:
     new 9122342  [CARBONDATA-3791] Fix documentation for various features
9122342 is described below

commit 9122342bdad83b50370435357c28aab0d51a8970
Author: kunal642 <ku...@gmail.com>
AuthorDate: Sun May 3 21:43:37 2020 +0530

    [CARBONDATA-3791] Fix documentation for various features
    
    Why is this PR needed?
    Fix documentation for various features
    
    What changes were proposed in this PR?
    1. Added write with hive doc
    2. Added alter upgrade segment doc
    3. Fix other random issues
    
    Does this PR introduce any user interface change?
    No
    
    Is any new testcase added?
    No
    
    This closes #3738
---
 docs/ddl-of-carbondata.md | 32 +++++++++++++++-----------------
 docs/hive-guide.md        | 22 +++++++++++++++-------
 docs/index-server.md      | 25 ++++++++++++++++---------
 3 files changed, 46 insertions(+), 33 deletions(-)

diff --git a/docs/ddl-of-carbondata.md b/docs/ddl-of-carbondata.md
index 84b18f3..3165f4e 100644
--- a/docs/ddl-of-carbondata.md
+++ b/docs/ddl-of-carbondata.md
@@ -20,7 +20,6 @@
 CarbonData DDL statements are documented here,which includes:
 
 * [CREATE TABLE](#create-table)
-  * [Dictionary Encoding](#dictionary-encoding-configuration)
   * [Local Dictionary](#local-dictionary-configuration)
   * [Inverted Index](#inverted-index-configuration)
   * [Sort Columns](#sort-columns-configuration)
@@ -31,7 +30,7 @@ CarbonData DDL statements are documented here,which includes:
   * [Caching Column Min/Max](#caching-minmax-value-for-required-columns)
   * [Caching Level](#caching-at-block-or-blocklet-level)
   * [Hive/Parquet folder Structure](#support-flat-folder-same-as-hiveparquet)
-  * [Extra Long String columns](#string-longer-than-32000-characters)
+  * [Long String columns](#string-longer-than-32000-characters)
   * [Compression for Table](#compression-for-table)
   * [Bad Records Path](#bad-records-path) 
   * [Load Minimum Input File Size](#load-minimum-data-size)
@@ -115,7 +114,7 @@ CarbonData DDL statements are documented here,which includes:
 
    - ##### Local Dictionary Configuration
 
-   Columns for which dictionary is not generated needs more storage space and in turn more IO. Also since more data will have to be read during query, query performance also would suffer.Generating dictionary per blocklet for such columns would help in saving storage space and assist in improving query performance as carbondata is optimized for handling dictionary encoded columns more effectively.Generating dictionary internally per blocklet is termed as local dictionary. Please refer to [...]
+   Columns for which dictionary is not generated needs more storage space and in turn more IO. Also since more data will have to be read during query, query performance also would suffer. Generating dictionary per blocklet for such columns would help in saving storage space and assist in improving query performance as carbondata is optimized for handling dictionary encoded columns more effectively.Generating dictionary internally per blocklet is termed as local dictionary. Please refer t [...]
 
    Local Dictionary helps in:
    1. Getting more compression.
@@ -200,7 +199,7 @@ CarbonData DDL statements are documented here,which includes:
      **NOTE**: Columns specified in INVERTED_INDEX should also be present in SORT_COLUMNS.
 
      ```
-     TBLPROPERTIES ('SORT_COLUMNS'='column2,column3','NO_INVERTED_INDEX'='column1', 'INVERTED_INDEX'='column2, column3')
+     TBLPROPERTIES ('SORT_COLUMNS'='column2,column3', 'INVERTED_INDEX'='column2, column3')
      ```
 
    - ##### Sort Columns Configuration
@@ -215,7 +214,7 @@ CarbonData DDL statements are documented here,which includes:
      TBLPROPERTIES ('SORT_COLUMNS'='column1, column3')
      ```
 
-     **NOTE**: Sort_Columns for Complex datatype columns and binary data type is not supported.
+     **NOTE**: Sort_Columns for Complex datatype columns, binary, double, float, decimal data type is not supported.
 
    - ##### Sort Scope Configuration
    
@@ -240,7 +239,7 @@ CarbonData DDL statements are documented here,which includes:
      revenue INT)
    STORED AS carbondata
    TBLPROPERTIES ('SORT_COLUMNS'='productName,storeCity',
-                  'SORT_SCOPE'='NO_SORT')
+                  'SORT_SCOPE'='LOCAL_SORT')
    ```
 
    **NOTE:** CarbonData also supports "using carbondata". Find example code at [SparkSessionExample](https://github.com/apache/carbondata/blob/master/examples/spark/src/main/scala/org/apache/carbondata/examples/SparkSessionExample.scala) in the CarbonData repo.
@@ -453,11 +452,11 @@ CarbonData DDL statements are documented here,which includes:
    - ##### Compression for table
 
      Data compression is also supported by CarbonData.
-     By default, Snappy is used to compress the data. CarbonData also supports ZSTD compressor.
+     By default, Snappy is used to compress the data. CarbonData also supports ZSTD and GZIP compressors.
+     
      User can specify the compressor in the table property:
-
      ```
-     TBLPROPERTIES('carbon.column.compressor'='snappy')
+     TBLPROPERTIES('carbon.column.compressor'='GZIP')
      ```
      or
      ```
@@ -588,7 +587,7 @@ CarbonData DDL statements are documented here,which includes:
          | STORED AS carbondata
          | LOCATION '$storeLocation/origin'
       """.stripMargin)
-  checkAnswer(sql("SELECT count(*) from source"), sql("SELECT count(*) from origin"))
+  sql("SELECT count(*) from source").show()
   ```
 
 ### Create external table on Non-Transactional table data location.
@@ -608,12 +607,10 @@ CarbonData DDL statements are documented here,which includes:
   This can be SDK output or C++ SDK output. Refer [SDK Guide](./sdk-guide.md) and [C++ SDK Guide](./csdk-guide.md). 
 
   **Note:**
-  1. Dropping of the external table should not delete the files present in the location.
+  1. Dropping of the external table will not delete the files present in the location.
   2. When external table is created on non-transactional table data, 
     external table will be registered with the schema of carbondata files.
-    If multiple files with different schema is present, exception will be thrown.
-    So, If table registered with one schema and files are of different schema, 
-    suggest to drop the external table and create again to register table with new schema.  
+    If multiple files have the same column with different datatypes then exception will be thrown.  
 
 
 ## CREATE DATABASE 
@@ -680,6 +677,7 @@ CarbonData DDL statements are documented here,which includes:
       **NOTE:** Add Complex datatype columns is not supported.
 
 Users can specify which columns to include and exclude for local dictionary generation after adding new columns. These will be appended with the already existing local dictionary include and exclude columns of main table respectively.
+     
      ```
      ALTER TABLE carbon ADD COLUMNS (a1 STRING, b1 STRING) TBLPROPERTIES('LOCAL_DICTIONARY_INCLUDE'='a1','LOCAL_DICTIONARY_EXCLUDE'='b1')
      ```
@@ -1038,7 +1036,7 @@ Users can specify which columns to include and exclude for local dictionary gene
   ``` 
   
   This shows the overall memory consumed in the cache by categories - index files, dictionary and 
-  datamaps. This also shows the cache usage by all the tables and children tables in the current 
+  indexes. This also shows the cache usage by all the tables and children tables in the current 
   database.
   
    ```sql
@@ -1054,7 +1052,7 @@ Users can specify which columns to include and exclude for local dictionary gene
   ```
   
   This shows detailed information on cache usage by the table `tableName` and its carbonindex files, 
-  its dictionary files, its datamaps and children tables.
+  its dictionary files, its indexes and children tables.
   
   This command is not allowed on child tables.
 
@@ -1063,7 +1061,7 @@ Users can specify which columns to include and exclude for local dictionary gene
    ```
     
   This clears any entry in cache by the table `tableName`, its carbonindex files, 
-  its dictionary files, its datamaps and children tables.
+  its dictionary files, its indexes and children tables.
     
   This command is not allowed on child tables.
 
diff --git a/docs/hive-guide.md b/docs/hive-guide.md
index 1941168..982ee03 100644
--- a/docs/hive-guide.md
+++ b/docs/hive-guide.md
@@ -64,7 +64,7 @@ carbon.sql("LOAD DATA INPATH '<hdfs store path>/sample.csv' INTO TABLE hive_carb
 scala>carbon.sql("SELECT * FROM hive_carbon").show()
 ```
 
-## Query Data in Hive
+## Configure Carbon in Hive
 ### Configure hive classpath
 ```
 mkdir hive/auxlibs/
@@ -93,6 +93,17 @@ Carbon Jars to be copied to the above paths.
 $HIVE_HOME/bin/beeline
 ```
 
+### Write data from hive
+
+ - Write data from hive into carbondata format.
+ 
+ ```
+create table hive_carbon(id int, name string, scale decimal, country string, salary double) stored by 'org.apache.carbondata.hive.CarbonStorageHandler';
+insert into hive_carbon select * from parquetTable;
+```
+
+**Note**: Only non-transactional tables are supported when created through hive. This means that the standard carbon folder structure would not be followed and all files would be written in a flat folder structure.
+
 ### Query data from hive
 
  - This is to read the carbon table through Hive. It is the integration of the carbon with Hive.
@@ -105,13 +116,10 @@ These properties helps to recursively traverse through the directories to read t
 
 ### Example
 ```
- - In case if the carbon table is not set with the SERDE and the INPUTFORMAT/OUTPUTFORMAT, user can create a new hive managed table like below with the required details for the hive to read.
-create table hive_carbon_1(id int, name string, scale decimal, country string, salary double) ROW FORMAT SERDE 'org.apache.carbondata.hive.CarbonHiveSerDe' WITH SERDEPROPERTIES ('mapreduce.input.carboninputformat.databaseName'='default', 'mapreduce.input.carboninputformat.tableName'='HIVE_CARBON_EXAMPLE') STORED AS INPUTFORMAT 'org.apache.carbondata.hive.MapredCarbonInputFormat' OUTPUTFORMAT 'org.apache.carbondata.hive.MapredCarbonOutputFormat' LOCATION 'location_to_the_carbon_table';
-
  - Query the table
-select * from hive_carbon_1;
-select count(*) from hive_carbon_1;
-select * from hive_carbon_1 order by id;
+select * from hive_carbon;
+select count(*) from hive_carbon;
+select * from hive_carbon order by id;
 ```
 
 ### Note
diff --git a/docs/index-server.md b/docs/index-server.md
index 62e239d..6dde633 100644
--- a/docs/index-server.md
+++ b/docs/index-server.md
@@ -19,9 +19,8 @@
 
 ## Background
 
-Carbon currently prunes and caches all block/blocklet datamap index information into the driver for
-normal table, for Bloom/Index datamaps the JDBC driver will launch a job to prune and cache the
-datamaps in executors.
+Carbon currently prunes and caches all block/blocklet index information into the driver for
+normal table, for Bloom/Lucene indexes the JDBC driver will launch a job to prune and cache in executors.
 
 This causes the driver to become a bottleneck in the following ways:
 1. If the cache size becomes huge(70-80% of the driver memory) then there can be excessive GC in
@@ -52,8 +51,7 @@ This mapping will be maintained for each table and will enable the index server
 cache location for each segment.
 
 2. Cache size held by each executor: 
-    This mapping will be used to distribute the segments equally(on the basis of size) among the 
-    executors.
+This mapping will be used to distribute the segments equally(on the basis of size) among the executors.
   
 Once a request is received each segment would be iterated over and
 checked against tableToExecutorMapping to find if a executor is already
@@ -82,6 +80,15 @@ the pruned blocklets which would be further used for result fetching.
 
 **Note:** Multiple JDBC drivers can connect to the index server to use the cache.
 
+## Enabling Size based distribution for Legacy stores
+The default round robin based distribution causes unequal distribution of cache among the executors, which can cause any one of the executors to be bloated with too much cache resulting in performance degrade.
+This problem can be solved by running the `upgrade_segment` command which will fill the data size values for each segment in the tablestatus file. Any cache loaded after this can use the traditional size based distribution.
+
+#### Example
+```
+alter table table1 compact 'upgrade_segment';
+```
+
 ## Reallocation of executor
 In case executor(s) become dead/unavailable then the segments that were
 earlier being handled by those would be reassigned to some other
@@ -102,7 +109,7 @@ In case of any failure the index server would fallback to embedded mode
 which means that the JDBCServer would take care of distributed pruning.
 A similar job would be fired by the JDBCServer which would take care of
 pruning using its own executors. If for any reason the embedded mode
-also fails to prune the datamaps then the job would be passed on to
+also fails to prune the indexes then the job would be passed on to
 driver.
 
 **NOTE:** In case of embedded mode a job would be fired after pruning to clear the
@@ -120,7 +127,7 @@ The user can set the location for these files by using 'carbon.indexserver.temp.
 the files are written in the path /tmp/indexservertmp.
 
 ## Prepriming
-As each query is responsible for caching the pruned datamaps, thus a lot of execution time is wasted in reading the 
+As each query is responsible for caching the pruned indexes, thus a lot of execution time is wasted in reading the 
 files and caching the datmaps for the first query.
 To avoid this problem we have introduced Pre-Priming which allows each data manipulation command like load, insert etc 
 to fire a request to the index server to load the corresponding segments into the index server.
@@ -152,11 +159,11 @@ The user can enable prepriming by using 'carbon.indexserver.enable.prepriming' =
 | carbon.index.server.ip |    NA   |   Specify the IP/HOST on which the server would be started. Better to specify the private IP. | 
 | carbon.index.server.port | NA | The port on which the index server has to be started. |
 |carbon.index.server.max.worker.threads| 500 | Number of RPC handlers to open for accepting the requests from JDBC driver. Max accepted value is Integer.Max. Refer: [Hive configuration](https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L3441) |
-|carbon.max.executor.lru.cache.size|  NA | Maximum memory **(in MB)** upto which the executor process can cache the data (DataMaps and reverse dictionary values). Only integer values greater than 0 are accepted. **NOTE:** Mandatory for the user to set. |
+|carbon.max.executor.lru.cache.size|  NA | Maximum memory **(in MB)** upto which the executor process can cache the data (Indexes and reverse dictionary values). Only integer values greater than 0 are accepted. **NOTE:** Mandatory for the user to set. |
 |carbon.index.server.max.jobname.length|NA|The max length of the job to show in the index server application UI. For bigger queries this may impact performance as the whole string would be sent from JDBCServer to IndexServer.|
 |carbon.max.executor.threads.for.block.pruning|4| max executor threads used for block pruning. |
 |carbon.index.server.inmemory.serialization.threshold.inKB|300|Max in memory serialization size after reaching threshold data will be written to file. Min value that the user can set is 0KB and max is 102400KB. |
-|carbon.indexserver.temp.path|tablePath| The folder to write the split files if in memory datamap size for network transfers crossed the 'carbon.index.server.inmemory.serialization.threshold.inKB' limit.|
+|carbon.indexserver.temp.path|tablePath| The folder to write the split files if in memory index cache size for network transfers crossed the 'carbon.index.server.inmemory.serialization.threshold.inKB' limit.|
 
 
 ##### spark-defaults.conf(only for secure mode)