You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@carbondata.apache.org by ch...@apache.org on 2018/09/07 16:54:07 UTC

[20/39] carbondata-site git commit: Handled comments

http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/a51dc596/src/site/markdown/quick-start-guide.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/quick-start-guide.md b/src/site/markdown/quick-start-guide.md
index 7ac5a3f..37c398c 100644
--- a/src/site/markdown/quick-start-guide.md
+++ b/src/site/markdown/quick-start-guide.md
@@ -19,9 +19,9 @@
 This tutorial provides a quick introduction to using CarbonData.To follow along with this guide, first download a packaged release of CarbonData from the [CarbonData website](https://dist.apache.org/repos/dist/release/carbondata/).Alternatively it can be created following [Building CarbonData](https://github.com/apache/carbondata/tree/master/build) steps.
 
 ##  Prerequisites
-* Spark 2.2.1 version is installed and running.CarbonData supports Spark versions upto 2.2.1.Please follow steps described in [Spark docs website](https://spark.apache.org/docs/latest) for installing and running Spark.
+* CarbonData supports Spark versions upto 2.2.1.Please download Spark package from [Spark website](https://spark.apache.org/downloads.html)
 
-* Create a sample.csv file using the following commands. The CSV file is required for loading data into CarbonData.
+* Create a sample.csv file using the following commands. The CSV file is required for loading data into CarbonData
 
   ```
   cd carbondata
@@ -33,7 +33,7 @@ This tutorial provides a quick introduction to using CarbonData.To follow along
   EOF
   ```
 
-## Deployment modes
+## Integration
 
 CarbonData can be integrated with Spark and Presto Execution Engines.The below documentation guides on Installing and Configuring with these execution engines.
 
@@ -45,16 +45,13 @@ CarbonData can be integrated with Spark and Presto Execution Engines.The below d
 
 [Installing and Configuring CarbonData on Spark on YARN Cluster](#installing-and-configuring-carbondata-on-spark-on-yarn-cluster)
 
+[Installing and Configuring CarbonData Thrift Server for Query Execution](#query-execution-using-carbondata-thrift-server)
+
 
 ### Presto
 [Installing and Configuring CarbonData on Presto](#installing-and-configuring-carbondata-on-presto)
 
 
-## Querying Data
-
-[Query Execution using CarbonData Thrift Server](#query-execution-using-carbondata-thrift-server)
-
-## 
 
 ## Installing and Configuring CarbonData to run locally with Spark Shell
 
@@ -95,12 +92,12 @@ val carbon = SparkSession.builder().config(sc.getConf)
 
 ```
 scala>carbon.sql("CREATE TABLE
-                        IF NOT EXISTS test_table(
-                                  id string,
-                                  name string,
-                                  city string,
-                                  age Int)
-                       STORED BY 'carbondata'")
+                    IF NOT EXISTS test_table(
+                    id string,
+                    name string,
+                    city string,
+                    age Int)
+                  STORED AS carbondata")
 ```
 
 ###### Loading Data to a Table
@@ -296,8 +293,12 @@ hdfs://<host_name>:port/user/hive/warehouse/carbon.store
 
 ## Installing and Configuring CarbonData on Presto
 
+**NOTE:** **CarbonData tables cannot be created nor loaded from Presto.User need to create CarbonData Table and load data into it
+either with [Spark](#installing-and-configuring-carbondata-to-run-locally-with-spark-shell) or [SDK](./sdk-guide.md).
+Once the table is created,it can be queried from Presto.**
 
-* ### Installing Presto
+
+### Installing Presto
 
  1. Download the 0.187 version of Presto using:
     `wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.187/presto-server-0.187.tar.gz`
@@ -429,9 +430,29 @@ select * from system.runtime.nodes;
 ```
 Now you can use the Presto CLI on the coordinator to query data sources in the catalog using the Presto workers.
 
+List the schemas(databases) available
+
+```
+show schemas;
+```
+
+Selected the schema where CarbonData table resides
+
+```
+use carbonschema;
+```
+
+List the available tables
+
+```
+show tables;
+```
+
+Query from the available tables
+
+```
+select * from carbon_table;
+```
+
 **Note :** Create Tables and data loads should be done before executing queries as we can not create carbon table from this interface.
 
-<script>
-// Show selected style on nav item
-$(function() { $('.b-nav__quickstart').addClass('selected'); });
-</script>

http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/a51dc596/src/site/markdown/release-guide.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/release-guide.md b/src/site/markdown/release-guide.md
index 40a9058..e626ccb 100644
--- a/src/site/markdown/release-guide.md
+++ b/src/site/markdown/release-guide.md
@@ -420,9 +420,3 @@ _Checklist to declare the process completed:_
 1. Release announced on the user@ mailing list.
 2. Release announced on the Incubator's general@ mailing list.
 3. Completion declared on the dev@ mailing list.
-
-
-<script>
-// Show selected style on nav item
-$(function() { $('.b-nav__release').addClass('selected'); });
-</script>

http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/a51dc596/src/site/markdown/s3-guide.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/s3-guide.md b/src/site/markdown/s3-guide.md
index 37f157c..a2e5f07 100644
--- a/src/site/markdown/s3-guide.md
+++ b/src/site/markdown/s3-guide.md
@@ -88,7 +88,3 @@ recommended to set the configurable lock path property([carbon.lock.path](./conf
  to a HDFS directory.
 2. Concurrent data manipulation operations are not supported. Object stores follow eventual consistency semantics, i.e., any put request might take some time to reflect when trying to list. This behaviour causes the data read is always not consistent or not the latest.
 
-<script>
-// Show selected style on nav item
-$(function() { $('.b-nav__s3').addClass('selected'); });
-</script>

http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/a51dc596/src/site/markdown/sdk-guide.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/sdk-guide.md b/src/site/markdown/sdk-guide.md
index 66f3d61..d786406 100644
--- a/src/site/markdown/sdk-guide.md
+++ b/src/site/markdown/sdk-guide.md
@@ -7,7 +7,7 @@
     the License.  You may obtain a copy of the License at
 
       http://www.apache.org/licenses/LICENSE-2.0
-
+    
     Unless required by applicable law or agreed to in writing, software 
     distributed under the License is distributed on an "AS IS" BASIS, 
     WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@@ -16,8 +16,16 @@
 -->
 
 # SDK Guide
-In the carbon jars package, there exist a carbondata-store-sdk-x.x.x-SNAPSHOT.jar, including SDK writer and reader.
+
+CarbonData provides SDK to facilitate
+
+1. [Writing carbondata files from other application which does not use Spark](#sdk-writer)
+2. [Reading carbondata files from other application which does not use Spark](#sdk-reader)
+
 # SDK Writer
+
+In the carbon jars package, there exist a carbondata-store-sdk-x.x.x-SNAPSHOT.jar, including SDK writer and reader.
+
 This SDK writer, writes carbondata file and carbonindex file at a given path.
 External client can make use of this writer to convert other format data or live data to create carbondata and index files.
 These SDK writer output contains just a carbondata and carbonindex files. No metadata folder will be present.
@@ -867,8 +875,3 @@ public String getProperty(String key, String defaultValue);
 ```
 Reference : [list of carbon properties](./configuration-parameters.md)
 
-
-<script>
-// Show selected style on nav item
-$(function() { $('.b-nav__api').addClass('selected'); });
-</script>

http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/a51dc596/src/site/markdown/segment-management-on-carbondata.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/segment-management-on-carbondata.md b/src/site/markdown/segment-management-on-carbondata.md
index a519c88..fe0cbd4 100644
--- a/src/site/markdown/segment-management-on-carbondata.md
+++ b/src/site/markdown/segment-management-on-carbondata.md
@@ -140,15 +140,3 @@ concept which helps to maintain consistency of data and easy transaction managem
      }
    }
   ```
-
-
-<script>
-$(function() {
-  // Show selected style on nav item
-  $('.b-nav__docs').addClass('selected');
-  // Display docs subnav items
-  if (!$('.b-nav__docs').parent().hasClass('nav__item__with__subs--expanded')) {
-    $('.b-nav__docs').parent().toggleClass('nav__item__with__subs--expanded');
-  }
-});
-</script>
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/a51dc596/src/site/markdown/streaming-guide.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/streaming-guide.md b/src/site/markdown/streaming-guide.md
index 2f8aa5e..3b71662 100644
--- a/src/site/markdown/streaming-guide.md
+++ b/src/site/markdown/streaming-guide.md
@@ -7,7 +7,7 @@
     the License.  You may obtain a copy of the License at
 
       http://www.apache.org/licenses/LICENSE-2.0
-
+    
     Unless required by applicable law or agreed to in writing, software 
     distributed under the License is distributed on an "AS IS" BASIS, 
     WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
@@ -17,6 +17,24 @@
 
 # CarbonData Streaming Ingestion
 
+- [Streaming Table Management](#quick-example)
+  - [Create table with streaming property](#create-table-with-streaming-property)
+  - [Alter streaming property](#alter-streaming-property)
+  - [Acquire streaming lock](#acquire-streaming-lock)
+  - [Create streaming segment](#create-streaming-segment)
+  - [Change Stream segment status](#change-segment-status)
+  - [Handoff "streaming finish" segment to columnar segment](#handoff-streaming-finish-segment-to-columnar-segment)
+  - [Auto handoff streaming segment](#auto-handoff-streaming-segment)
+  - [Stream data parser](#stream-data-parser)
+  - [Close streaming table](#close-streaming-table)
+  - [Constraints](#constraint)
+- [StreamSQL](#streamsql)
+  - [Defining Streaming Table](#streaming-table)
+  - [Streaming Job Management](#streaming-job-management)
+    - [START STREAM](#start-stream)
+    - [STOP STREAM](#stop-stream)
+    - [SHOW STREAMS](#show-streams)
+
 ## Quick example
 Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
 
@@ -68,7 +86,7 @@ Start spark-shell in new terminal, type :paste, then copy and run the following
       | col1 INT,
       | col2 STRING
       | )
-      | STORED BY 'carbondata'
+      | STORED AS carbondata
       | TBLPROPERTIES('streaming'='true')""".stripMargin)
 
  val carbonTable = CarbonEnv.getCarbonTable(Some("default"), "carbon_table")(spark)
@@ -116,19 +134,19 @@ streaming table using following DDL.
   col1 INT,
   col2 STRING
  )
- STORED BY 'carbondata'
+ STORED AS carbondata
  TBLPROPERTIES('streaming'='true')
 ```
 
  property name | default | description
  ---|---|--- 
  streaming | false |Whether to enable streaming ingest feature for this table <br /> Value range: true, false 
- 
+
  "DESC FORMATTED" command will show streaming property.
  ```sql
  DESC FORMATTED streaming_table
  ```
- 
+
 ## Alter streaming property
 For an old table, use ALTER TABLE command to set the streaming property.
 ```sql
@@ -261,14 +279,145 @@ ALTER TABLE streaming_table COMPACT 'close_streaming'
 7. block drop the streaming table while the streaming ingestion is running.
 
 
-<script>
-$(function() {
-  // Show selected style on nav item
-  $('.b-nav__docs').addClass('selected');
 
-  // Display docs subnav items
-  if (!$('.b-nav__docs').parent().hasClass('nav__item__with__subs--expanded')) {
-    $('.b-nav__docs').parent().toggleClass('nav__item__with__subs--expanded');
-  }
-});
-</script>
+## StreamSQL
+
+
+
+### Streaming Table
+
+**Example**
+
+Following example shows how to start a streaming ingest job
+
+```
+    sql(
+      s"""
+         |CREATE TABLE source(
+         | id INT,
+         | name STRING,
+         | city STRING,
+         | salary FLOAT,
+         | tax DECIMAL(8,2),
+         | percent double,
+         | birthday DATE,
+         | register TIMESTAMP,
+         | updated TIMESTAMP
+         |)
+         |STORED BY carbondata
+         |TBLPROPERTIES (
+         | 'format'='csv',
+         | 'path'='$csvDataDir'
+         |)
+      """.stripMargin)
+
+    sql(
+      s"""
+         |CREATE TABLE sink(
+         | id INT,
+         | name STRING,
+         | city STRING,
+         | salary FLOAT,
+         | tax DECIMAL(8,2),
+         | percent double,
+         | birthday DATE,
+         | register TIMESTAMP,
+         | updated TIMESTAMP
+         |)
+         |STORED BY carbondata
+         |TBLPROPERTIES (
+         |  'streaming'='true'
+         |)
+      """.stripMargin)
+
+    sql(
+      """
+        |START STREAM job123 ON TABLE sink
+        |STMPROPERTIES(
+        |  'trigger'='ProcessingTime',
+        |  'interval'='1 seconds')
+        |AS
+        |  SELECT *
+        |  FROM source
+        |  WHERE id % 2 = 1
+      """.stripMargin)
+
+    sql("STOP STREAM job123")
+
+    sql("SHOW STREAMS [ON TABLE tableName]")
+```
+
+
+
+In above example, two table is created: source and sink. The `source` table's format is `csv` and `sink` table format is `carbon`. Then a streaming job is created to stream data from source table to sink table.
+
+These two tables are normal carbon table, they can be queried independently.
+
+
+
+### Streaming Job Management
+
+As above example shown:
+
+- `START STREAM jobName ON TABLE tableName` is used to start a streaming ingest job. 
+- `STOP STREAM jobName` is used to stop a streaming job by its name
+- `SHOW STREAMS [ON TABLE tableName]` is used to print streaming job information
+
+
+
+##### START STREAM
+
+When this is issued, carbon will start a structured streaming job to do the streaming ingestion. Before launching the job, system will validate:
+
+- The format of table specified in CTAS FROM clause must be one of: csv, json, text, parquet, kafka, socket.  These are formats supported by spark 2.2.0 structured streaming
+
+- User should pass the options of the streaming source table in its TBLPROPERTIES when creating it. StreamSQL will pass them transparently to spark when creating the streaming job. For example:
+
+  ```SQL
+  CREATE TABLE source(
+    name STRING,
+    age INT
+  )
+  STORED BY carbondata
+  TBLPROPERTIES(
+    'format'='socket',
+    'host'='localhost',
+    'port'='8888'
+  )
+  ```
+
+  will translate to
+
+  ```Scala
+  spark.readStream
+  	 .schema(tableSchema)
+  	 .format("socket")
+  	 .option("host", "localhost")
+  	 .option("port", "8888")
+  ```
+
+
+
+- The sink table should have a TBLPROPERTY `'streaming'` equal to `true`, indicating it is a streaming table.
+- In the given STMPROPERTIES, user must specify `'trigger'`, its value must be `ProcessingTime` (In future, other value will be supported). User should also specify interval value for the streaming job.
+- If the schema specifid in sink table is different from CTAS, the streaming job will fail
+
+
+
+##### STOP STREAM
+
+When this is issued, the streaming job will be stopped immediately. It will fail if the jobName specified is not exist.
+
+
+
+##### SHOW STREAMS
+
+`SHOW STREAMS ON TABLE tableName` command will print the streaming job information as following
+
+| Job name | status  | Source | Sink | start time          | time elapsed |
+| -------- | ------- | ------ | ---- | ------------------- | ------------ |
+| job123   | Started | device | fact | 2018-02-03 14:32:42 | 10d2h32m     |
+
+`SHOW STREAMS` command will show all stream jobs in the system.
+
+

http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/a51dc596/src/site/markdown/supported-data-types-in-carbondata.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/supported-data-types-in-carbondata.md b/src/site/markdown/supported-data-types-in-carbondata.md
index 35e41ba..fee80f6 100644
--- a/src/site/markdown/supported-data-types-in-carbondata.md
+++ b/src/site/markdown/supported-data-types-in-carbondata.md
@@ -46,15 +46,4 @@
 
   * Other Types
     * BOOLEAN
-    
-<script>
-$(function() {
-  // Show selected style on nav item
-  $('.b-nav__docs').addClass('selected');
-
-  // Display docs subnav items
-  if (!$('.b-nav__docs').parent().hasClass('nav__item__with__subs--expanded')) {
-    $('.b-nav__docs').parent().toggleClass('nav__item__with__subs--expanded');
-  }
-});
-</script>
+

http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/a51dc596/src/site/markdown/timeseries-datamap-guide.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/timeseries-datamap-guide.md b/src/site/markdown/timeseries-datamap-guide.md
index d3ef3c6..3f849c4 100644
--- a/src/site/markdown/timeseries-datamap-guide.md
+++ b/src/site/markdown/timeseries-datamap-guide.md
@@ -17,9 +17,9 @@
 
 # CarbonData Timeseries DataMap
 
-* [Timeseries DataMap Introduction](#timeseries-datamap-intoduction)
-* [Compaction](#compacting-pre-aggregate-tables)
-* [Data Management](#data-management-with-pre-aggregate-tables)
+* [Timeseries DataMap Introduction](#timeseries-datamap-introduction-alpha-feature)
+* [Compaction](#compacting-timeseries-datamp)
+* [Data Management](#data-management-on-timeseries-datamap)
 
 ## Timeseries DataMap Introduction (Alpha Feature)
 Timeseries DataMap is a pre-aggregate table implementation based on 'pre-aggregate' DataMap.
@@ -153,14 +153,3 @@ Same applies to timeseries datamap.
 Refer to Data Management section in [preaggregation datamap](./preaggregate-datamap-guide.md).
 Same applies to timeseries datamap.
 
-<script>
-$(function() {
-  // Show selected style on nav item
-  $('.b-nav__datamap').addClass('selected');
-  
-  if (!$('.b-nav__datamap').parent().hasClass('nav__item__with__subs--expanded')) {
-    // Display datamap subnav items
-    $('.b-nav__datamap').parent().toggleClass('nav__item__with__subs--expanded');
-  }
-});
-</script>

http://git-wip-us.apache.org/repos/asf/carbondata-site/blob/a51dc596/src/site/markdown/usecases.md
----------------------------------------------------------------------
diff --git a/src/site/markdown/usecases.md b/src/site/markdown/usecases.md
new file mode 100644
index 0000000..277c455
--- /dev/null
+++ b/src/site/markdown/usecases.md
@@ -0,0 +1,215 @@
+# Use Cases
+
+CarbonData is useful in various analytical work loads.Some of the most typical usecases where CarbonData is being used is documented here.
+
+CarbonData is used for but not limited to
+
+- ### Bank
+
+  - fraud detection analysis
+  - risk profile analysis
+  - As a zip table to update the daily balance of customers
+
+- ### Telecom
+
+  - Detection of signal anamolies for VIP customers for providing improved customer experience
+  - Analysis of MR,CHR records of GSM data to determine the tower load at a particular time period and rebalance the tower configuration
+  - Analysis of access sites, video, screen size, streaming bandwidth, quality to determine the network quality,routing configuration
+
+- ### Web/Internet
+
+  - Analysis of page or video being accessed,server loads, streaming quality, screen size
+
+- ### Smart City
+
+  - Vehicle tracking analysis
+  - Unusual behaviour analysis
+
+
+
+These use cases can be broadly classified into below categories:
+
+- Full scan/Detailed/Interactive queries
+- Aggregation/OLAP BI queries
+- Real time Ingestion(Streaming) and queries
+
+
+
+## Detailed Queries in the Telecom scenario
+
+### Scenario
+
+User wants to analyse all the CHR(Call History Record) and MR(Measurement Records) of the mobile subscribers in order to identify the service failures within 10 secs.Also user wants to run machine learning models on the data to fairly estimate the reasons and time of probable failures and take action ahead to meet the SLA(Service Level Agreements) of VIP customers. 
+
+### Challenges
+
+- Data incoming rate might vary based on the user concentration at a particular period of time.Hence higher data load speeds are required
+- Cluster needs to be well utilised and share the cluster among various applications for better resource consumption and savings
+- Queries needs to be interactive.ie., the queries fetch small data and need to be returned in seconds
+- Data Loaded into the system every few minutes.
+
+### Solution
+
+Setup a Hadoop + Spark + CarbonData cluster managed by YARN.
+
+Proposed the following configurations for CarbonData.(These tunings were proposed before CarbonData introduced SORT_COLUMNS parameter using which the sort order and schema order could be different.)
+
+Add the frequently used columns to the left of the table definition.Add it in the increasing order of cardinality.It was suggested to keep msisdn,imsi columns in the beginning of the schema.With latest CarbonData, SORT_COLUMNS needs to be configured msisdn,imsi in the beginning.
+
+Add timestamp column to the right of the schema as it is naturally increasing.
+
+Create two separate YARN queues for Query and Data Loading.
+
+Apart from these, the following CarbonData configuration was suggested to be configured in the cluster.
+
+
+
+| Configuration for | Parameter                               | Value  | Description |
+|------------------ | --------------------------------------- | ------ | ----------- |
+| Data Loading | carbon.graph.rowset.size                | 100000 | Based on the size of each row, this determines the memory required during data loading.Higher value leads to increased memory foot print |
+| Data Loading | carbon.number.of.cores.while.loading    | 12     | More cores can improve data loading speed |
+| Data Loading | carbon.sort.size                        | 100000 | Number of records to sort at a time.More number of records configured will lead to increased memory foot print |
+| Data Loading | table_blocksize                         | 256  | To efficiently schedule multiple tasks during query |
+| Data Loading | carbon.sort.intermediate.files.limit    | 100    | Increased to 100 as number of cores are more.Can perform merging in backgorund.If less number of files to merge, sort threads would be idle |
+| Data Loading | carbon.use.local.dir                    | TRUE   | yarn application directory will be usually on a single disk.YARN would be configured with multiple disks to be used as temp or to assign randomly to applications.Using the yarn temp directory will allow carbon to use multiple disks and improve IO performance |
+| Data Loading | carbon.use.multiple.temp.dir            | TRUE   | multiple disks to write sort files will lead to better IO and reduce the IO bottleneck |
+| Compaction | carbon.compaction.level.threshold       | 6,6    | Since frequent small loads, compacting more segments will give better query results |
+| Compaction | carbon.enable.auto.load.merge           | true   | Since data loading is small,auto compacting keeps the number of segments less and also compaction can complete in  time |
+| Compaction | carbon.number.of.cores.while.compacting | 4      | Higher number of cores can improve the compaction speed |
+| Compaction | carbon.major.compaction.size            | 921600 | Sum of several loads to combine into single segment |
+
+
+
+### Results Achieved
+
+| Parameter                                 | Results          |
+| ----------------------------------------- | ---------------- |
+| Query                                     | < 3 Sec          |
+| Data Loading Speed                        | 40 MB/s Per Node |
+| Concurrent query performance (20 queries) | < 10 Sec         |
+
+
+
+## Detailed Queries in the Smart City scenario
+
+### Scenario
+
+User wants to analyse the person/vehicle movement and behavior during a certain time period.This output data needs to be joined with a external table for Human details extraction.The query will be run with different time period as filter to identify potential behavior mismatch.
+
+### Challenges
+
+Data generated per day is very huge.Data needs to be loaded multiple times per day to accomodate the incoming data size.
+
+Data Loading done once in 6 hours.
+
+### Solution
+
+Setup a Hadoop + Spark + CarbonData cluster managed by YARN.
+
+Since data needs to be queried for a time period, it was recommended to keep the time column at the beginning of schema.
+
+Use table block size as 512MB.
+
+Use local sort mode.
+
+Apart from these, the following CarbonData configuration was suggested to be configured in the cluster.
+
+Use all columns are no-dictionary as the cardinality is high.
+
+| Configuration for | Parameter                               | Value                   | Description |
+| ------------------| --------------------------------------- | ----------------------- | ------------------|
+| Data Loading | carbon.graph.rowset.size                | 100000                  | Based on the size of each row, this determines the memory required during data loading.Higher value leads to increased memory foot print |
+| Data Loading | enable.unsafe.sort                      | TRUE                    | Temporary data generated during sort is huge which causes GC bottlenecks.Using unsafe reduces the pressure on GC |
+| Data Loading | enable.offheap.sort                     | TRUE                    | Temporary data generated during sort is huge which causes GC bottlenecks.Using offheap reduces the pressure on GC.offheap can be accessed through java unsafe.hence enable.unsafe.sort needs to be true |
+| Data Loading | offheap.sort.chunk.size.in.mb           | 128                     | Size of memory to allocate for sorting.Can increase this based on the memory available |
+| Data Loading | carbon.number.of.cores.while.loading    | 12                      | Higher cores can improve data loading speed |
+| Data Loading | carbon.sort.size                        | 100000                  | Number of records to sort at a time.More number of records configured will lead to increased memory foot print |
+| Data Loading | table_blocksize                         | 512                     | To efficiently schedule multiple tasks during query.This size depends on data scenario.If data is such that the filters would select less number of blocklets to scan, keeping higher number works well.If the number blocklets to scan is more, better to reduce the size as more tasks can be scheduled in parallel. |
+| Data Loading | carbon.sort.intermediate.files.limit    | 100                     | Increased to 100 as number of cores are more.Can perform merging in backgorund.If less number of files to merge, sort threads would be idle |
+| Data Loading | carbon.use.local.dir                    | TRUE                    | yarn application directory will be usually on a single disk.YARN would be configured with multiple disks to be used as temp or to assign randomly to applications.Using the yarn temp directory will allow carbon to use multiple disks and improve IO performance |
+| Data Loading | carbon.use.multiple.temp.dir            | TRUE                    | multiple disks to write sort files will lead to better IO and reduce the IO bottleneck |
+| Data Loading | sort.inmemory.size.in.mb                | 92160 | Memory allocated to do inmemory sorting.When more memory is available in the node, configuring this will retain more sort blocks in memory so that the merge sort is faster due to no/very less IO |
+| Compaction | carbon.major.compaction.size            | 921600                  | Sum of several loads to combine into single segment |
+| Compaction | carbon.number.of.cores.while.compacting | 12                      | Higher number of cores can improve the compaction speed.Data size is huge.Compaction need to use more threads to speed up the process |
+| Compaction | carbon.enable.auto.load.merge           | FALSE                   | Doing auto minor compaction is costly process as data size is huge.Perform manual compaction when  the cluster is less loaded |
+| Query | carbon.enable.vector.reader             | true                    | To fetch results faster, supporting spark vector processing will speed up the query |
+| Query | enable.unsafe.in.query.procressing      | true                    | Data that needs to be scanned in huge which in turn generates more short lived Java objects.This cause pressure of GC.using unsafe and offheap will reduce the GC overhead |
+| Query | use.offheap.in.query.processing         | true                    | Data that needs to be scanned in huge which in turn generates more short lived Java objects.This cause pressure of GC.using unsafe and offheap will reduce the GC overhead.offheap can be accessed through java unsafe.hence enable.unsafe.in.query.procressing needs to be true |
+| Query | enable.unsafe.columnpage                | TRUE                    | Keep the column pages in offheap memory so that the memory overhead due to java object is less and also reduces GC pressure. |
+| Query | carbon.unsafe.working.memory.in.mb      | 10240                   | Amount of memory to use for offheap operations.Can increase this memory based on the data size |
+
+
+
+### Results Achieved
+
+| Parameter                              | Results          |
+| -------------------------------------- | ---------------- |
+| Query (Time Period spanning 1 segment) | < 10 Sec         |
+| Data Loading Speed                     | 45 MB/s Per Node |
+
+
+
+## OLAP/BI Queries in the web/Internet scenario
+
+### Scenario
+
+An Internet company wants to analyze the average download speed, kind of handsets used in a particular region/area,kind of Apps being used, what kind of videos are trending in a particular region to enable them to identify the appropriate resolution size of videos to speed up transfer, and perform many more analysis to serve th customers better.
+
+### Challenges
+
+Since data is being queried by a BI tool, all the queries contain group by, which means CarbonData need to return more records as limit cannot be pushed down to carbondata layer.
+
+Results have to be returned faster as the BI tool would not respond till the data is fetched, causing bad user experience.
+
+Data might be loaded less frequently(once or twice in a day), but raw data size is huge, which causes the group by queries to run slower.
+
+Concurrent queries can be more due to the BI dashboard
+
+### Goal
+
+1. Aggregation queries are faster
+2. Concurrency is high(Number of concurrent queries supported)
+
+### Solution
+
+- Use table block size as 128MB so that pruning is more effective
+- Use global sort mode so that the data to be fetched are grouped together
+- Create pre-aggregate tables for non timestamp based group by queries
+- For queries containing group by date, create timeseries based Datamap(pre-aggregate) tables so that the data is rolled up during creation and fetch is faster
+- Reduce the Spark shuffle partitions.(In our configuration on 14 node cluster, it was reduced to 35 from default of 200)
+- Enable global dictionary for columns which have less cardinalities.Aggregation can be done on encoded data, there by improving the performance
+- For columns whose cardinality is high,enable the local dictionary so that store size is less and can take dictionary benefit for scan
+
+## Handling near realtime data ingestion scenario
+
+### Scenario
+
+Need to support storing of continously arriving data and make it available immediately for query.
+
+### Challenges
+
+When the data ingestion is near real time and the data needs to be available for query immediately, usual scenario is to do data loading in micro batches.But this causes the problem of generating many small files.This poses two problems:
+
+1. Small file handling in HDFS is inefficient
+2. CarbonData will suffer in query performance as all the small files will have to be queried when filter is on non time column
+
+CarbonData will suffer in query performance as all the small files will have to be queried when filter is on non time column.
+
+Since data is continouly arriving, allocating resources for compaction might not be feasible.
+
+### Goal
+
+1. Data is available in near real time for query as it arrives
+2. CarbonData doesnt suffer from small files problem
+
+### Solution
+
+- Use Streaming tables support of CarbonData
+- Configure the carbon.streaming.segment.max.size property to higher value(default is 1GB) if a bit slower query performance is not a concern
+- Configure carbon.streaming.auto.handoff.enabled to true so that after the  carbon.streaming.segment.max.size is reached, the segment is converted into format optimized for query
+- Disable auto compaction.Manually trigger the minor compaction with default 4,3 when the cluster is not busy
+- Manually trigger Major compaction based on the size of segments and the frequency with which the segments are being created
+- Enable local dictionary
+
+
+