You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@carbondata.apache.org by xu...@apache.org on 2019/06/19 07:01:32 UTC
[carbondata] branch master updated: [CARBONDATA-3425] Added
documentation for mv
This is an automated email from the ASF dual-hosted git repository.
xubo245 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/carbondata.git
The following commit(s) were added to refs/heads/master by this push:
new fc8c9d0 [CARBONDATA-3425] Added documentation for mv
fc8c9d0 is described below
commit fc8c9d06750ca3392312c7ac9bc46b9d240ac6a0
Author: Indhumathi27 <in...@gmail.com>
AuthorDate: Tue Jun 4 18:08:20 2019 +0530
[CARBONDATA-3425] Added documentation for mv
This closes #3275
---
docs/datamap/datamap-management.md | 8 +-
docs/datamap/mv-datamap-guide.md | 208 +++++++++++++++++++++++++++++
docs/datamap/preaggregate-datamap-guide.md | 3 +
3 files changed, 215 insertions(+), 4 deletions(-)
diff --git a/docs/datamap/datamap-management.md b/docs/datamap/datamap-management.md
index 087c70a..199cd14 100644
--- a/docs/datamap/datamap-management.md
+++ b/docs/datamap/datamap-management.md
@@ -49,7 +49,7 @@ Currently, there are 5 DataMap implementations in CarbonData.
| ---------------- | ---------------------------------------- | ---------------------------------------- | ---------------- |
| preaggregate | single table pre-aggregate table | No DMPROPERTY is required | Automatic |
| timeseries | time dimension rollup table | event_time, xx_granularity, please refer to [Timeseries DataMap](./timeseries-datamap-guide.md) | Automatic |
-| mv | multi-table pre-aggregate table | No DMPROPERTY is required | Manual |
+| mv | multi-table pre-aggregate table | No DMPROPERTY is required | Manual/Automatic |
| lucene | lucene indexing for text column | index_columns to specifying the index columns | Automatic |
| bloomfilter | bloom filter for high cardinality column, geospatial column | index_columns to specifying the index columns | Automatic |
@@ -60,9 +60,6 @@ There are two kinds of management semantic for DataMap.
1. Automatic Refresh: Create datamap without `WITH DEFERRED REBUILD` in the statement, which is by default.
2. Manual Refresh: Create datamap with `WITH DEFERRED REBUILD` in the statement
-**CAUTION:**
-If user create MV datamap without specifying `WITH DEFERRED REBUILD`, carbondata will give a warning and treat the datamap as deferred rebuild.
-
### Automatic Refresh
When user creates a datamap on the main table without using `WITH DEFERRED REBUILD` syntax, the datamap will be managed by system automatically.
@@ -142,6 +139,9 @@ There is a SHOW DATAMAPS command, when this is issued, system will read all data
- DataMapProviderName like mv, preaggreagte, timeseries, etc
- Associated Table
- DataMap Properties
+- DataMap status (ENABLED/DISABLED)
+- Sync Status - which displays Last segment Id of main table synced with datamap table and its load
+ end time (Applicable only for mv datamap)
### Compaction on DataMap
diff --git a/docs/datamap/mv-datamap-guide.md b/docs/datamap/mv-datamap-guide.md
new file mode 100644
index 0000000..d22357c
--- /dev/null
+++ b/docs/datamap/mv-datamap-guide.md
@@ -0,0 +1,208 @@
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to you under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+# CarbonData MV DataMap
+
+* [Quick Example](#quick-example)
+* [MV DataMap](#mv-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Compaction](#compacting-mv-tables)
+* [Data Management](#data-management-with-mv-tables)
+
+## Quick example
+
+Start spark-sql in terminal and run the following queries,
+```
+CREATE TABLE maintable(a int, b string, c int) stored by 'carbondata';
+insert into maintable select 1, 'ab', 2;
+CREATE DATAMAP datamap_1 on table maintable as SELECT a, sum(b) from maintable group by a;
+SELECT a, sum(b) from maintable group by a;
+// NOTE: run explain query and check if query hits the datamap table from the plan
+EXPLAIN SELECT a, sum(b) from maintable group by a;
+```
+
+## MV DataMap Introduction
+ MV tables are created as DataMaps and managed as tables internally by CarbonData. User can create
+ limitless MV datamaps on a table to improve query performance provided the storage requirements
+ and loading time is acceptable.
+
+ MV datamap can be a lazy or a non-lazy datamap. Once MV datamaps are created, CarbonData's
+ CarbonAnalyzer helps to select the most efficient MV datamap based on the user query and rewrite
+ the SQL to select the data from MV datamap instead of main table. Since the data size of MV
+ datamap is smaller and data is pre-processed, user queries are much faster.
+
+ For instance, main table called **sales** which is defined as
+
+ ```
+ CREATE TABLE sales (
+ order_time timestamp,
+ user_id string,
+ sex string,
+ country string,
+ quantity int,
+ price bigint)
+ STORED AS carbondata
+ ```
+
+ User can create MV tables using the Create DataMap DDL
+
+ ```
+ CREATE DATAMAP agg_sales
+ ON TABLE sales
+ USING "MV"
+ AS
+ SELECT country, sex, sum(quantity), avg(price)
+ FROM sales
+ GROUP BY country, sex
+ ```
+ **NOTE**:
+ * Group by/Filter columns has to be provided in projection list while creating mv datamap
+ * If only single parent table is involved in mv datamap creation, then TableProperties of Parent table
+ (if not present in a aggregate function like sum(col)) listed below will be
+ inherited to datamap table
+ 1. SORT_COLUMNS
+ 2. SORT_SCOPE
+ 3. TABLE_BLOCKSIZE
+ 4. FLAT_FOLDER
+ 5. LONG_STRING_COLUMNS
+ 6. LOCAL_DICTIONARY_ENABLE
+ 7. LOCAL_DICTIONARY_THRESHOLD
+ 8. LOCAL_DICTIONARY_EXCLUDE
+ 9. DICTIONARY_INCLUDE
+ 10. DICTIONARY_EXCLUDE
+ 11. INVERTED_INDEX
+ 12. NO_INVERTED_INDEX
+ 13. COLUMN_COMPRESSOR
+
+ * All columns of main table at once cannot participate in mv datamap table creation
+ * TableProperties can be provided in DMProperties excluding LOCAL_DICTIONARY_INCLUDE,
+ LOCAL_DICTIONARY_EXCLUDE, DICTIONARY_INCLUDE, DICTIONARY_EXCLUDE, INVERTED_INDEX,
+ NO_INVERTED_INDEX, SORT_COLUMNS, LONG_STRING_COLUMNS, RANGE_COLUMN & COLUMN_META_CACHE
+ * TableProperty given in DMProperties will be considered for mv creation, eventhough if same
+ property is inherited from parent table, which allows user to provide different tableproperties
+ for child table
+ * MV creation with limit or union all ctas queries is unsupported
+
+#### How MV tables are selected
+
+When a user query is submitted, during query planning phase, CarbonData will collect modular plan
+candidates and process the the ModularPlan based on registered summary data sets. Then,
+mv datamap table for this query will be selected among the candidates.
+
+For the main table **sales** and mv table **agg_sales** created above, following queries
+```
+SELECT country, sex, sum(quantity), avg(price) from sales GROUP BY country, sex
+
+SELECT sex, sum(quantity) from sales GROUP BY sex
+
+SELECT avg(price), country from sales GROUP BY country
+```
+
+will be transformed by CarbonData's query planner to query against mv table
+**agg_sales** instead of the main table **sales**
+
+However, for following queries
+```
+SELECT user_id, country, sex, sum(quantity), avg(price) from sales GROUP BY user_id, country, sex
+
+SELECT sex, avg(quantity) from sales GROUP BY sex
+
+SELECT country, max(price) from sales GROUP BY country
+```
+
+will query against main table **sales** only, because it does not satisfy mv table
+selection logic.
+
+## Loading data
+
+### Loading data to Non-Lazy MV Datamap
+
+In case of WITHOUT DEFERRED REBUILD, for existing table with loaded data, data load to MV table will
+be triggered by the CREATE DATAMAP statement when user creates the MV table.
+For incremental loads to main table, data to datamap will be loaded once the corresponding main
+table load is completed.
+
+### Loading data to Lazy MV Datamap
+
+In case of WITH DEFERRED REBUILD, data load to MV table will be triggered by the [Manual Refresh](./datamap-management.md#manual-refresh)
+command. MV datamap will be in DISABLED state in below scenarios,
+ * when mv datamap is created
+ * when data of main table and datamap are not in sync
+
+User should fire REBUILD DATAMAP command to sync all segments of main table with datamap table and
+which ENABLES the datamap for query
+
+### Loading data to Multiple MV's
+During load to main table, if anyone of the load to datamap table fails, then that corresponding
+datamap will be DISABLED and load to other datamaps mapped to main table will continue. User can
+fire REBUILD DATAMAP command to sync or else the subsequent table load will load the old failed
+loads along with current load and enable the disabled datamap.
+
+ **NOTE**:
+ * In case of InsertOverwrite/Update operation on parent table, all segments of datamap table will
+ be MARKED_FOR_DELETE and reload to datamap table will happen by REBUILD DATAMAP, in case of Lazy
+ mv datamap/ once InsertOverwrite/Update operation on parent table is finished, in case of
+ Non-Lazy mv.
+ * In case of full scan query, Data Size and Index Size of main table and child table will not the
+ same, as main table and child table has different column names.
+
+## Querying data
+As a technique for query acceleration, MV tables cannot be queried directly.
+Queries are to be made on main table. While doing query planning, internally CarbonData will check
+associated mv datamap tables with the main table, and do query plan transformation accordingly.
+
+User can verify whether a query can leverage mv datamap table or not by executing `EXPLAIN`
+command, which will show the transformed logical plan, and thus user can check whether mv datamap
+table is selected.
+
+
+## Compacting MV datamap
+
+### Compacting MV datamap table through Main Table compaction
+Running Compaction command (`ALTER TABLE COMPACT`)[COMPACTION TYPE-> MINOR/MAJOR] on main table will
+automatically compact the mv datamap tables created on the main table, once compaction on main table
+is done.
+
+### Compacting MV datamap table through DDL command
+Compaction on mv datamap can be triggered by running the following DDL command(supported only for mv).
+ ```
+ ALTER DATAMAP datamap_name COMPACT 'COMPACTION_TYPE'
+ ```
+
+## Data Management with mv tables
+In current implementation, data consistency needs to be maintained for both main table and mv datamap
+tables. Once there is mv datamap table created on the main table, following command on the main
+table is not supported:
+1. Data management command: `DELETE SEGMENT`.
+2. Schema management command: `ALTER TABLE DROP COLUMN`, `ALTER TABLE CHANGE DATATYPE`,
+ `ALTER TABLE RENAME`, `ALTER COLUMN RENAME`. Note that adding a new column is supported, and for
+ dropping columns and change datatype command, CarbonData will check whether it will impact the
+ mv datamap table, if not, the operation is allowed, otherwise operation will be rejected by
+ throwing exception.
+3. Partition management command: `ALTER TABLE ADD/DROP PARTITION`. Note that dropping a partition
+ will be allowed only if partition is participating in all datamaps associated with main table.
+ Drop Partition is not allowed, if any mv datamap is associated with more than one parent table.
+ Drop Partition directly on datamap table is not allowed.
+4. Complex Datatype's for mv datamap is not supported.
+
+However, there is still way to support these operations on main table, in current CarbonData
+release, user can do as following:
+1. Remove the mv datamap table by `DROP DATAMAP` command
+2. Carry out the data management operation on main table
+3. Create the mv datamap table again by `CREATE DATAMAP` command
+Basically, user can manually trigger the operation by re-building the datamap.
diff --git a/docs/datamap/preaggregate-datamap-guide.md b/docs/datamap/preaggregate-datamap-guide.md
index eff601d..5369bb7 100644
--- a/docs/datamap/preaggregate-datamap-guide.md
+++ b/docs/datamap/preaggregate-datamap-guide.md
@@ -176,6 +176,9 @@ It will show all DataMaps created on main table.
FROM sales
GROUP BY country, sex
```
+ **NOTE**:
+ * Preaggregate datamap is deprecated and it is replaced by MV datamap.
+ Please refer [CarbonData MV DataMap](./mv-datamap-guide.md) for more info.
#### Functions supported in pre-aggregate table