You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@carbondata.apache.org by xu...@apache.org on 2019/06/19 07:01:32 UTC
[carbondata] branch master updated: [CARBONDATA-3425] Added documentation for mv

This is an automated email from the ASF dual-hosted git repository.

xubo245 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/carbondata.git


The following commit(s) were added to refs/heads/master by this push:
     new fc8c9d0  [CARBONDATA-3425] Added documentation for mv
fc8c9d0 is described below

commit fc8c9d06750ca3392312c7ac9bc46b9d240ac6a0
Author: Indhumathi27 <in...@gmail.com>
AuthorDate: Tue Jun 4 18:08:20 2019 +0530

    [CARBONDATA-3425] Added documentation for mv
    
    This closes #3275
---
 docs/datamap/datamap-management.md         |   8 +-
 docs/datamap/mv-datamap-guide.md           | 208 +++++++++++++++++++++++++++++
 docs/datamap/preaggregate-datamap-guide.md |   3 +
 3 files changed, 215 insertions(+), 4 deletions(-)

diff --git a/docs/datamap/datamap-management.md b/docs/datamap/datamap-management.md
index 087c70a..199cd14 100644
--- a/docs/datamap/datamap-management.md
+++ b/docs/datamap/datamap-management.md
@@ -49,7 +49,7 @@ Currently, there are 5 DataMap implementations in CarbonData.
 | ---------------- | ---------------------------------------- | ---------------------------------------- | ---------------- |
 | preaggregate     | single table pre-aggregate table         | No DMPROPERTY is required                | Automatic        |
 | timeseries       | time dimension rollup table              | event_time, xx_granularity, please refer to [Timeseries DataMap](./timeseries-datamap-guide.md) | Automatic        |
-| mv               | multi-table pre-aggregate table          | No DMPROPERTY is required                | Manual           |
+| mv               | multi-table pre-aggregate table          | No DMPROPERTY is required                | Manual/Automatic           |
 | lucene           | lucene indexing for text column          | index_columns to specifying the index columns | Automatic |
 | bloomfilter      | bloom filter for high cardinality column, geospatial column | index_columns to specifying the index columns | Automatic |
 
@@ -60,9 +60,6 @@ There are two kinds of management semantic for DataMap.
 1. Automatic Refresh: Create datamap without `WITH DEFERRED REBUILD` in the statement, which is by default.
 2. Manual Refresh: Create datamap with `WITH DEFERRED REBUILD` in the statement
 
-**CAUTION:**
-If user create MV datamap without specifying `WITH DEFERRED REBUILD`, carbondata will give a warning and treat the datamap as deferred rebuild.
-
 ### Automatic Refresh
 
 When user creates a datamap on the main table without using `WITH DEFERRED REBUILD` syntax, the datamap will be managed by system automatically.
@@ -142,6 +139,9 @@ There is a SHOW DATAMAPS command, when this is issued, system will read all data
 - DataMapProviderName like mv, preaggreagte, timeseries, etc
 - Associated Table
 - DataMap Properties
+- DataMap status (ENABLED/DISABLED)
+- Sync Status - which displays Last segment Id of main table synced with datamap table and its load
+  end time (Applicable only for mv datamap)
 
 ### Compaction on DataMap
 
diff --git a/docs/datamap/mv-datamap-guide.md b/docs/datamap/mv-datamap-guide.md
new file mode 100644
index 0000000..d22357c
--- /dev/null
+++ b/docs/datamap/mv-datamap-guide.md
@@ -0,0 +1,208 @@
+<!--
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to you under the Apache License, Version 2.0
+    (the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+-->
+
+# CarbonData MV DataMap
+
+* [Quick Example](#quick-example)
+* [MV DataMap](#mv-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Compaction](#compacting-mv-tables)
+* [Data Management](#data-management-with-mv-tables)
+
+## Quick example
+
+Start spark-sql in terminal and run the following queries,
+```
+CREATE TABLE maintable(a int, b string, c int) stored by 'carbondata';
+insert into maintable select 1, 'ab', 2;
+CREATE DATAMAP datamap_1 on table maintable as SELECT a, sum(b) from maintable group by a;
+SELECT a, sum(b) from maintable group by a;
+// NOTE: run explain query and check if query hits the datamap table from the plan
+EXPLAIN SELECT a, sum(b) from maintable group by a;
+```
+
+## MV DataMap Introduction
+  MV tables are created as DataMaps and managed as tables internally by CarbonData. User can create
+  limitless MV datamaps on a table to improve query performance provided the storage requirements
+  and loading time is acceptable.
+
+  MV datamap can be a lazy or a non-lazy datamap. Once MV datamaps are created, CarbonData's
+  CarbonAnalyzer helps to select the most efficient MV datamap based on the user query and rewrite
+  the SQL to select the data from MV datamap instead of main table. Since the data size of MV
+  datamap is smaller and data is pre-processed, user queries are much faster.
+
+  For instance, main table called **sales** which is defined as
+
+  ```
+  CREATE TABLE sales (
+    order_time timestamp,
+    user_id string,
+    sex string,
+    country string,
+    quantity int,
+    price bigint)
+  STORED AS carbondata
+  ```
+
+  User can create MV tables using the Create DataMap DDL
+
+  ```
+  CREATE DATAMAP agg_sales
+  ON TABLE sales
+  USING "MV"
+  AS
+    SELECT country, sex, sum(quantity), avg(price)
+    FROM sales
+    GROUP BY country, sex
+  ```
+ **NOTE**:
+ * Group by/Filter columns has to be provided in projection list while creating mv datamap
+ * If only single parent table is involved in mv datamap creation, then TableProperties of Parent table
+   (if not present in a aggregate function like sum(col)) listed below will be
+   inherited to datamap table
+    1. SORT_COLUMNS
+    2. SORT_SCOPE
+    3. TABLE_BLOCKSIZE
+    4. FLAT_FOLDER
+    5. LONG_STRING_COLUMNS
+    6. LOCAL_DICTIONARY_ENABLE
+    7. LOCAL_DICTIONARY_THRESHOLD
+    8. LOCAL_DICTIONARY_EXCLUDE
+    9. DICTIONARY_INCLUDE
+   10. DICTIONARY_EXCLUDE
+   11. INVERTED_INDEX
+   12. NO_INVERTED_INDEX
+   13. COLUMN_COMPRESSOR
+
+ * All columns of main table at once cannot participate in mv datamap table creation
+ * TableProperties can be provided in DMProperties excluding LOCAL_DICTIONARY_INCLUDE,
+   LOCAL_DICTIONARY_EXCLUDE, DICTIONARY_INCLUDE, DICTIONARY_EXCLUDE, INVERTED_INDEX,
+   NO_INVERTED_INDEX, SORT_COLUMNS, LONG_STRING_COLUMNS, RANGE_COLUMN & COLUMN_META_CACHE
+ * TableProperty given in DMProperties will be considered for mv creation, eventhough if same
+   property is inherited from parent table, which allows user to provide different tableproperties
+   for child table
+ * MV creation with limit or union all ctas queries is unsupported
+
+#### How MV tables are selected
+
+When a user query is submitted, during query planning phase, CarbonData will collect modular plan
+candidates and process the the ModularPlan based on registered summary data sets. Then,
+mv datamap table for this query will be selected among the candidates.
+
+For the main table **sales** and mv table  **agg_sales** created above, following queries
+```
+SELECT country, sex, sum(quantity), avg(price) from sales GROUP BY country, sex
+
+SELECT sex, sum(quantity) from sales GROUP BY sex
+
+SELECT avg(price), country from sales GROUP BY country
+```
+
+will be transformed by CarbonData's query planner to query against mv table
+**agg_sales** instead of the main table **sales**
+
+However, for following queries
+```
+SELECT user_id, country, sex, sum(quantity), avg(price) from sales GROUP BY user_id, country, sex
+
+SELECT sex, avg(quantity) from sales GROUP BY sex
+
+SELECT country, max(price) from sales GROUP BY country
+```
+
+will query against main table **sales** only, because it does not satisfy mv table
+selection logic.
+
+## Loading data
+
+### Loading data to Non-Lazy MV Datamap
+
+In case of WITHOUT DEFERRED REBUILD, for existing table with loaded data, data load to MV table will
+be triggered by the CREATE DATAMAP statement when user creates the MV table.
+For incremental loads to main table, data to datamap will be loaded once the corresponding main
+table load is completed.
+
+### Loading data to Lazy MV Datamap
+
+In case of WITH DEFERRED REBUILD, data load to MV table will be triggered by the [Manual Refresh](./datamap-management.md#manual-refresh)
+command. MV datamap will be in DISABLED state in below scenarios,
+  * when mv datamap is created
+  * when data of main table and datamap are not in sync
+
+User should fire REBUILD DATAMAP command to sync all segments of main table with datamap table and
+which ENABLES the datamap for query
+
+### Loading data to Multiple MV's
+During load to main table, if anyone of the load to datamap table fails, then that corresponding
+datamap will be DISABLED and load to other datamaps mapped to main table will continue. User can
+fire REBUILD DATAMAP command to sync or else the subsequent table load will load the old failed
+loads along with current load and enable the disabled datamap.
+
+ **NOTE**:
+ * In case of InsertOverwrite/Update operation on parent table, all segments of datamap table will
+   be MARKED_FOR_DELETE and reload to datamap table will happen by REBUILD DATAMAP, in case of Lazy
+   mv datamap/ once InsertOverwrite/Update operation on parent table is finished, in case of
+   Non-Lazy mv.
+ * In case of full scan query, Data Size and Index Size of main table and child table will not the
+   same, as main table and child table has different column names.
+
+## Querying data
+As a technique for query acceleration, MV tables cannot be queried directly.
+Queries are to be made on main table. While doing query planning, internally CarbonData will check
+associated mv datamap tables with the main table, and do query plan transformation accordingly.
+
+User can verify whether a query can leverage mv datamap table or not by executing `EXPLAIN`
+command, which will show the transformed logical plan, and thus user can check whether mv datamap
+table is selected.
+
+
+## Compacting MV datamap
+
+### Compacting MV datamap table through Main Table compaction
+Running Compaction command (`ALTER TABLE COMPACT`)[COMPACTION TYPE-> MINOR/MAJOR] on main table will
+automatically compact the mv datamap tables created on the main table, once compaction on main table
+is done.
+
+### Compacting MV datamap table through DDL command
+Compaction on mv datamap can be triggered by running the following DDL command(supported only for mv).
+  ```
+  ALTER DATAMAP datamap_name COMPACT 'COMPACTION_TYPE'
+  ```
+
+## Data Management with mv tables
+In current implementation, data consistency needs to be maintained for both main table and mv datamap
+tables. Once there is mv datamap table created on the main table, following command on the main
+table is not supported:
+1. Data management command: `DELETE SEGMENT`.
+2. Schema management command: `ALTER TABLE DROP COLUMN`, `ALTER TABLE CHANGE DATATYPE`,
+   `ALTER TABLE RENAME`, `ALTER COLUMN RENAME`. Note that adding a new column is supported, and for
+   dropping columns and change datatype command, CarbonData will check whether it will impact the
+   mv datamap table, if not, the operation is allowed, otherwise operation will be rejected by
+   throwing exception.
+3. Partition management command: `ALTER TABLE ADD/DROP PARTITION`. Note that dropping a partition
+   will be allowed only if partition is participating in all datamaps associated with main table.
+   Drop Partition is not allowed, if any mv datamap is associated with more than one parent table.
+   Drop Partition directly on datamap table is not allowed.
+4. Complex Datatype's for mv datamap is not supported.
+
+However, there is still way to support these operations on main table, in current CarbonData
+release, user can do as following:
+1. Remove the mv datamap table by `DROP DATAMAP` command
+2. Carry out the data management operation on main table
+3. Create the mv datamap table again by `CREATE DATAMAP` command
+Basically, user can manually trigger the operation by re-building the datamap.
diff --git a/docs/datamap/preaggregate-datamap-guide.md b/docs/datamap/preaggregate-datamap-guide.md
index eff601d..5369bb7 100644
--- a/docs/datamap/preaggregate-datamap-guide.md
+++ b/docs/datamap/preaggregate-datamap-guide.md
@@ -176,6 +176,9 @@ It will show all DataMaps created on main table.
     FROM sales
     GROUP BY country, sex
   ```
+  **NOTE**:
+   * Preaggregate datamap is deprecated and it is replaced by MV datamap.
+     Please refer [CarbonData MV DataMap](./mv-datamap-guide.md) for more info.
   
 #### Functions supported in pre-aggregate table