You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@carbondata.apache.org by ra...@apache.org on 2018/02/03 19:03:50 UTC

carbondata git commit: [HOTFIX] Some basic fix for 1.3.0 release

Repository: carbondata
Updated Branches:
  refs/heads/master 4a2a2d1b7 -> fa6cd8d58


[HOTFIX] Some basic fix for 1.3.0 release

This closes #1924


Project: http://git-wip-us.apache.org/repos/asf/carbondata/repo
Commit: http://git-wip-us.apache.org/repos/asf/carbondata/commit/fa6cd8d5
Tree: http://git-wip-us.apache.org/repos/asf/carbondata/tree/fa6cd8d5
Diff: http://git-wip-us.apache.org/repos/asf/carbondata/diff/fa6cd8d5

Branch: refs/heads/master
Commit: fa6cd8d58632357cd29731d59398d1a43b282447
Parents: 4a2a2d1
Author: chenliang613 <ch...@huawei.com>
Authored: Sat Feb 3 21:06:55 2018 +0800
Committer: ravipesala <ra...@gmail.com>
Committed: Sun Feb 4 00:33:13 2018 +0530

----------------------------------------------------------------------
 docs/configuration-parameters.md                |   2 +-
 docs/data-management-on-carbondata.md           | 216 ++++++++-----------
 .../examples/StandardPartitionExample.scala     |  11 +-
 integration/spark2/pom.xml                      |   3 +
 4 files changed, 107 insertions(+), 125 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/carbondata/blob/fa6cd8d5/docs/configuration-parameters.md
----------------------------------------------------------------------
diff --git a/docs/configuration-parameters.md b/docs/configuration-parameters.md
index 621574d..91f6cf5 100644
--- a/docs/configuration-parameters.md
+++ b/docs/configuration-parameters.md
@@ -61,7 +61,7 @@ This section provides the details of all the configurations required for CarbonD
 | carbon.options.bad.record.path |  | Specifies the HDFS path where bad records are stored. By default the value is Null. This path must to be configured by the user if bad record logger is enabled or bad record action redirect. | |
 | carbon.enable.vector.reader | true | This parameter increases the performance of select queries as it fetch columnar batch of size 4*1024 rows instead of fetching data row by row. | |
 | carbon.blockletgroup.size.in.mb | 64 MB | The data are read as a group of blocklets which are called blocklet groups. This parameter specifies the size of the blocklet group. Higher value results in better sequential IO access.The minimum value is 16MB, any value lesser than 16MB will reset to the default value (64MB). |  |
-| carbon.task.distribution | block | **block**: Setting this value will launch one task per block. This setting is suggested in case of concurrent queries and queries having big shuffling scenarios. **custom**: Setting this value will group the blocks and distribute it uniformly to the available resources in the cluster. This enhances the query performance but not suggested in case of concurrent queries and queries having big shuffling scenarios. **blocklet**: Setting this value will launch one task per blocklet. This setting is suggested in case of concurrent queries and queries having big shuffling scenarios. **merge_small_files**: Setting this value will merge all the small partitions to a size of (128 MB) during querying. The small partitions are combined to a map task to reduce the number of read task. This enhances the performance. | | 
+| carbon.task.distribution | block | **block**: Setting this value will launch one task per block. This setting is suggested in case of concurrent queries and queries having big shuffling scenarios. **custom**: Setting this value will group the blocks and distribute it uniformly to the available resources in the cluster. This enhances the query performance but not suggested in case of concurrent queries and queries having big shuffling scenarios. **blocklet**: Setting this value will launch one task per blocklet. This setting is suggested in case of concurrent queries and queries having big shuffling scenarios. **merge_small_files**: Setting this value will merge all the small partitions to a size of (128 MB is the default value of "spark.sql.files.maxPartitionBytes",it is configurable) during querying. The small partitions are combined to a map task to reduce the number of read task. This enhances the performance. | | 
 
 * **Compaction Configuration**
   

http://git-wip-us.apache.org/repos/asf/carbondata/blob/fa6cd8d5/docs/data-management-on-carbondata.md
----------------------------------------------------------------------
diff --git a/docs/data-management-on-carbondata.md b/docs/data-management-on-carbondata.md
index 3acb711..9bb6c20 100644
--- a/docs/data-management-on-carbondata.md
+++ b/docs/data-management-on-carbondata.md
@@ -26,8 +26,7 @@ This tutorial is going to introduce all commands and data operations on CarbonDa
 * [UPDATE AND DELETE](#update-and-delete)
 * [COMPACTION](#compaction)
 * [PARTITION](#partition)
-* [HIVE STANDARD PARTITION](#hive-standard-partition)
-* [PRE-AGGREGATE TABLES](#agg-tables)
+* [PRE-AGGREGATE TABLES](#pre-aggregate-tables)
 * [BUCKETING](#bucketing)
 * [SEGMENT MANAGEMENT](#segment-management)
 
@@ -54,8 +53,6 @@ This tutorial is going to introduce all commands and data operations on CarbonDa
      ```
      TBLPROPERTIES ('DICTIONARY_INCLUDE'='column1, column2')
 	 ```
-     
-	 NOTE: DICTIONARY_EXCLUDE supports only int, string, timestamp, long, bigint, and varchar data types.
 	 
    - **Inverted Index Configuration**
 
@@ -603,34 +600,109 @@ This tutorial is going to introduce all commands and data operations on CarbonDa
   CLEAN FILES FOR TABLE carbon_table
   ```
 
-## STANDARD PARTITION
+## PARTITION
+
+### STANDARD PARTITION
+
+  The partition is similar as spark and hive partition, user can use any column to build partition:
+  
+#### Create Partition Table
 
-  The partition is same as Spark, the creation partition command as below:
+  This command allows you to create table with partition.
   
   ```
-  CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
-                    [(col_name data_type , ...)]
-  PARTITIONED BY (partition_col_name data_type)
+  CREATE TABLE [IF NOT EXISTS] [db_name.]table_name 
+    [(col_name data_type , ...)]
+    [COMMENT table_comment]
+    [PARTITIONED BY (col_name data_type , ...)]
+    [STORED BY file_format]
+    [TBLPROPERTIES (property_name=property_value, ...)]
+  ```
+  
+  Example:
+  ```
+   CREATE TABLE IF NOT EXISTS productSchema.productSalesTable (
+                                productNumber Int,
+                                productName String,
+                                storeCity String,
+                                storeProvince String,
+                                saleQuantity Int,
+                                revenue Int)
+  PARTITIONED BY (productCategory String, productBatch String)
   STORED BY 'carbondata'
-  [TBLPROPERTIES (property_name=property_value, ...)]
   ```
+		
+#### Load Data Using Static Partition 
+
+  This command allows you to load data using static partition.
+  
+  ```
+  LOAD DATA [LOCAL] INPATH 'folder_path' 
+    INTO TABLE [db_name.]table_name PARTITION (partition_spec) 
+    OPTIONS(property_name=property_value, ...)
+  NSERT INTO INTO TABLE [db_name.]table_name PARTITION (partition_spec) SELECT STATMENT 
+  ```
+  
+  Example:
+  ```
+  LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt'
+    INTO TABLE locationTable
+    PARTITION (country = 'US', state = 'CA')
+    
+  INSERT INTO TABLE locationTable
+    PARTITION (country = 'US', state = 'AL')
+    SELECT * FROM another_user au 
+    WHERE au.country = 'US' AND au.state = 'AL';
+  ```
+
+#### Load Data Using Dynamic Partition
+
+  This command allows you to load data using dynamic partition. If partition spec is not specified, then the partition is considered as dynamic.
 
   Example:
   ```
-  CREATE TABLE partitiontable0
-                  (id Int,
-                  vin String,
-                  phonenumber Long,
-                  area String,
-                  salary Int)
-                  PARTITIONED BY (country String)
-                  STORED BY 'org.apache.carbondata.format'
-                  TBLPROPERTIES('SORT_COLUMNS'='id,vin')
-                  )
+  LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt'
+    INTO TABLE locationTable
+          
+  INSERT INTO TABLE locationTable
+    SELECT * FROM another_user au 
+    WHERE au.country = 'US' AND au.state = 'AL';
   ```
 
+#### Show Partitions
+
+  This command gets the Hive partition information of the table
 
-## CARBONDATA PARTITION(HASH,RANGE,LIST)
+  ```
+  SHOW PARTITIONS [db_name.]table_name
+  ```
+
+#### Drop Partition
+
+  This command drops the specified Hive partition only.
+  ```
+  ALTER TABLE table_name DROP [IF EXISTS] (PARTITION part_spec, ...)
+  ```
+
+#### Insert OVERWRITE
+  
+  This command allows you to insert or load overwrite on a spcific partition.
+  
+  ```
+   INSERT OVERWRITE TABLE table_name
+    PARTITION (column = 'partition_name')
+    select_statement
+  ```
+  
+  Example:
+  ```
+  INSERT OVERWRITE TABLE partitioned_user
+    PARTITION (country = 'US')
+    SELECT * FROM another_user au 
+    WHERE au.country = 'US';
+  ```
+
+### CARBONDATA PARTITION(HASH,RANGE,LIST) -- Alpha feature, this partition not supports update and delete data.
 
   The partition supports three type:(Hash,Range,List), similar to other system's partition features, CarbonData's partition feature can be used to improve query performance by filtering on the partition column.
 
@@ -766,106 +838,6 @@ This tutorial is going to introduce all commands and data operations on CarbonDa
   * The partitioned column can be excluded from SORT_COLUMNS, this will let other columns to do the efficient sorting.
   * When writing SQL on a partition table, try to use filters on the partition column.
 
-## HIVE STANDARD PARTITION
-
-  Carbon supports the partition which is custom implemented by carbon but due to compatibility issue does not allow you to use the feature of Hive. By using this function, you can use the feature available in Hive.
-
-### Create Partition Table
-
-  This command allows you to create table with partition.
-  
-  ```
-  CREATE TABLE [IF NOT EXISTS] [db_name.]table_name 
-    [(col_name data_type , ...)]
-    [COMMENT table_comment]
-    [PARTITIONED BY (col_name data_type , ...)]
-    [STORED BY file_format]
-    [TBLPROPERTIES (property_name=property_value, ...)]
-    [AS select_statement];
-  ```
-  
-  Example:
-  ```
-   CREATE TABLE IF NOT EXISTS productSchema.productSalesTable (
-                                productNumber Int,
-                                productName String,
-                                storeCity String,
-                                storeProvince String,
-                                saleQuantity Int,
-                                revenue Int)
-  PARTITIONED BY (productCategory String, productBatch String)
-  STORED BY 'carbondata'
-  ```
-		
-### Load Data Using Static Partition
-
-  This command allows you to load data using static partition.
-  
-  ```
-  LOAD DATA [LOCAL] INPATH 'folder_path' 
-    INTO TABLE [db_name.]table_name PARTITION (partition_spec) 
-    OPTIONS(property_name=property_value, ...)
-  NSERT INTO INTO TABLE [db_name.]table_name PARTITION (partition_spec) SELECT STATMENT 
-  ```
-  
-  Example:
-  ```
-  LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt'
-    INTO TABLE locationTable
-    PARTITION (country = 'US', state = 'CA')
-    
-  INSERT INTO TABLE locationTable
-    PARTITION (country = 'US', state = 'AL')
-    SELECT * FROM another_user au 
-    WHERE au.country = 'US' AND au.state = 'AL';
-  ```
-
-### Load Data Using Dynamic Partition
-
-  This command allows you to load data using dynamic partition. If partition spec is not specified, then the partition is considered as dynamic.
-
-  Example:
-  ```
-  LOAD DATA LOCAL INPATH '${env:HOME}/staticinput.txt'
-    INTO TABLE locationTable
-          
-  INSERT INTO TABLE locationTable
-    SELECT * FROM another_user au 
-    WHERE au.country = 'US' AND au.state = 'AL';
-  ```
-
-### Show Partitions
-
-  This command gets the Hive partition information of the table
-
-  ```
-  SHOW PARTITIONS [db_name.]table_name
-  ```
-
-### Drop Partition
-
-  This command drops the specified Hive partition only.
-  ```
-  ALTER TABLE table_name DROP [IF EXISTS] (PARTITION part_spec, ...)
-  ```
-
-### Insert OVERWRITE
-  
-  This command allows you to insert or load overwrite on a spcific partition.
-  
-  ```
-   INSERT OVERWRITE TABLE table_name
-    PARTITION (column = 'partition_name')
-    select_statement
-  ```
-  
-  Example:
-  ```
-  INSERT OVERWRITE TABLE partitioned_user
-    PARTITION (country = 'US')
-    SELECT * FROM another_user au 
-    WHERE au.country = 'US';
-  ```
 
 ## PRE-AGGREGATE TABLES
   Carbondata supports pre aggregating of data so that OLAP kind of queries can fetch data 
@@ -989,7 +961,7 @@ This functionality is not supported.
   before Alter Operations can be performed on the main table.Pre-aggregate tables can be rebuilt 
   manually after Alter Table operations are completed
   
-### Supporting timeseries data
+### Supporting timeseries data (Alpha feature in 1.3.0)
 Carbondata has built-in understanding of time hierarchy and levels: year, month, day, hour, minute.
 Multiple pre-aggregate tables can be created for the hierarchy and Carbondata can do automatic 
 roll-up for the queries on these hierarchies.

http://git-wip-us.apache.org/repos/asf/carbondata/blob/fa6cd8d5/examples/spark2/src/main/scala/org/apache/carbondata/examples/StandardPartitionExample.scala
----------------------------------------------------------------------
diff --git a/examples/spark2/src/main/scala/org/apache/carbondata/examples/StandardPartitionExample.scala b/examples/spark2/src/main/scala/org/apache/carbondata/examples/StandardPartitionExample.scala
index 1126ecc..20570a2 100644
--- a/examples/spark2/src/main/scala/org/apache/carbondata/examples/StandardPartitionExample.scala
+++ b/examples/spark2/src/main/scala/org/apache/carbondata/examples/StandardPartitionExample.scala
@@ -56,7 +56,14 @@ object StandardPartitionExample {
 
     spark.sql(
       s"""
-         | SELECT country,id,vin,phonenumver,area,salary
+         | SELECT country,id,vin,phonenumber,area,salary
+         | FROM partitiontable0
+      """.stripMargin).show()
+
+    spark.sql("UPDATE partitiontable0 SET (salary) = (88888) WHERE country='UK'").show()
+    spark.sql(
+      s"""
+         | SELECT country,id,vin,phonenumber,area,salary
          | FROM partitiontable0
       """.stripMargin).show()
 
@@ -66,7 +73,7 @@ object StandardPartitionExample {
     import scala.util.Random
     import spark.implicits._
     val r = new Random()
-    val df = spark.sparkContext.parallelize(1 to 10 * 1000 * 10)
+    val df = spark.sparkContext.parallelize(1 to 10 * 100 * 1000)
       .map(x => ("No." + r.nextInt(1000), "country" + x % 8, "city" + x % 50, x % 300))
       .toDF("ID", "country", "city", "population")
 

http://git-wip-us.apache.org/repos/asf/carbondata/blob/fa6cd8d5/integration/spark2/pom.xml
----------------------------------------------------------------------
diff --git a/integration/spark2/pom.xml b/integration/spark2/pom.xml
index 60cb61f..9edb50e 100644
--- a/integration/spark2/pom.xml
+++ b/integration/spark2/pom.xml
@@ -209,6 +209,9 @@
     </profile>
     <profile>
     <id>spark-2.2</id>
+    <activation>
+      <activeByDefault>true</activeByDefault>
+    </activation>
     <properties>
       <spark.version>2.2.1</spark.version>
       <scala.binary.version>2.11</scala.binary.version>