You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@carbondata.apache.org by ku...@apache.org on 2018/09/26 10:44:01 UTC
carbondata git commit: [DOC] Add spark carbon file format documentation

Repository: carbondata
Updated Branches:
  refs/heads/master 13ecc9e7a -> d84cd817a


[DOC] Add spark carbon file format documentation

Add spark carbon file format documentation

This closes #2757


Project: http://git-wip-us.apache.org/repos/asf/carbondata/repo
Commit: http://git-wip-us.apache.org/repos/asf/carbondata/commit/d84cd817
Tree: http://git-wip-us.apache.org/repos/asf/carbondata/tree/d84cd817
Diff: http://git-wip-us.apache.org/repos/asf/carbondata/diff/d84cd817

Branch: refs/heads/master
Commit: d84cd817a8ed90e79c12dcee7334b6e52d030982
Parents: 13ecc9e
Author: ravipesala <ra...@gmail.com>
Authored: Tue Sep 25 16:19:16 2018 +0530
Committer: kunal642 <ku...@gmail.com>
Committed: Wed Sep 26 16:11:55 2018 +0530

----------------------------------------------------------------------
 docs/carbon-as-spark-datasource-guide.md | 100 ++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/carbondata/blob/d84cd817/docs/carbon-as-spark-datasource-guide.md
----------------------------------------------------------------------
diff --git a/docs/carbon-as-spark-datasource-guide.md b/docs/carbon-as-spark-datasource-guide.md
new file mode 100644
index 0000000..1d286cf
--- /dev/null
+++ b/docs/carbon-as-spark-datasource-guide.md
@@ -0,0 +1,100 @@
+<!--
+    Licensed to the Apache Software Foundation (ASF) under one or more 
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership. 
+    The ASF licenses this file to you under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with 
+    the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+    
+    Unless required by applicable law or agreed to in writing, software 
+    distributed under the License is distributed on an "AS IS" BASIS, 
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and 
+    limitations under the License.
+-->
+
+# Carbon as Spark's datasource guide
+
+Carbon fileformat can be integrated to Spark using datasource to read and write data without using CarbonSession.
+
+
+# Create Table with DDL
+
+Carbon table can be created with spark's datasource DDL syntax as follows.
+
+```
+ CREATE [TEMPORARY] TABLE [IF NOT EXISTS] [db_name.]table_name
+     [(col_name1 col_type1 [COMMENT col_comment1], ...)]
+     USING carbon
+     [OPTIONS (key1=val1, key2=val2, ...)]
+     [PARTITIONED BY (col_name1, col_name2, ...)]
+     [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
+     [LOCATION path]
+     [COMMENT table_comment]
+     [TBLPROPERTIES (key1=val1, key2=val2, ...)]
+     [AS select_statement]
+``` 
+
+## Supported OPTIONS
+
+| Property | Default Value | Description |
+|-----------|--------------|------------|
+| table_blocksize | 1024 | Size of blocks to write onto hdfs |
+| table_blocklet_size | 64 | Size of blocklet to write |
+| local_dictionary_threshold | 10000 | Cardinality upto which the local dictionary can be generated  |
+| local_dictionary_enable | false | Enable local dictionary generation  |
+| sort_columns | all dimensions are sorted | comma separated string columns which to include in sort and its order of sort |
+| sort_scope | local_sort | Sort scope of the load.Options include no sort, local sort ,batch sort and global sort |
+| long_string_columns | null | comma separated string columns which are more than 32k length |
+
+## Example 
+
+```
+ CREATE TABLE CARBON_TABLE (NAME  STRING) USING CARBON OPTIONS(‘table_block_size’=’256’)
+```
+
+Note: User can only apply the features of what spark datasource like parquet supports. It cannot support the features of carbon session like IUD, compaction etc. 
+
+# Using DataFrame
+
+Carbon format can be used in dataframe also using the following way.
+
+Write carbon using dataframe 
+```
+df.write.format("carbon").save(path)
+```
+
+Read carbon using dataframe
+```
+val df = spark.read.format("carbon").load(path)
+```
+
+## Example
+
+```
+import org.apache.spark.sql.SparkSession
+
+val spark = SparkSession
+  .builder()
+  .appName("Spark SQL basic example")
+  .config("spark.some.config.option", "some-value")
+  .getOrCreate()
+
+// For implicit conversions like converting RDDs to DataFrames
+import spark.implicits._
+val df = spark.sparkContext.parallelize(1 to 10 * 10 * 1000)
+     .map(x => (r.nextInt(100000), "name" + x % 8, "city" + x % 50, BigDecimal.apply(x % 60)))
+      .toDF("ID", "name", "city", "age")
+      
+// Write to carbon format      
+df.write.format("carbon").save("/user/person_table")
+
+// Read carbon using dataframe
+val dfread = spark.read.format("carbon").load("/user/person_table")
+dfread.show()
+```
+
+Reference : [list of carbon properties](./configuration-parameters.md)
+