You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@carbondata.apache.org by ra...@apache.org on 2018/03/03 12:43:58 UTC
[11/25] carbondata git commit: [CARBONDATA-2098] Optimize document
for datamap
[CARBONDATA-2098] Optimize document for datamap
Optimize document for datamap
This closes #2025
Project: http://git-wip-us.apache.org/repos/asf/carbondata/repo
Commit: http://git-wip-us.apache.org/repos/asf/carbondata/commit/95ac5eff
Tree: http://git-wip-us.apache.org/repos/asf/carbondata/tree/95ac5eff
Diff: http://git-wip-us.apache.org/repos/asf/carbondata/diff/95ac5eff
Branch: refs/heads/branch-1.3
Commit: 95ac5eff1673c4c595352e7ec6443f49145b2af7
Parents: 877172c
Author: Jacky Li <ja...@qq.com>
Authored: Sat Mar 3 11:34:46 2018 +0800
Committer: ravipesala <ra...@gmail.com>
Committed: Sat Mar 3 17:48:45 2018 +0530
----------------------------------------------------------------------
docs/datamap/preaggregate-datamap-guide.md | 213 ++++++++++++++++++++++++
docs/datamap/timeseries-datamap-guide.md | 135 +++++++++++++++
2 files changed, 348 insertions(+)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/carbondata/blob/95ac5eff/docs/datamap/preaggregate-datamap-guide.md
----------------------------------------------------------------------
diff --git a/docs/datamap/preaggregate-datamap-guide.md b/docs/datamap/preaggregate-datamap-guide.md
new file mode 100644
index 0000000..fabfd7d
--- /dev/null
+++ b/docs/datamap/preaggregate-datamap-guide.md
@@ -0,0 +1,213 @@
+# CarbonData Pre-aggregate DataMap
+
+## Quick example
+Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
+
+Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
+```shell
+mvn clean package -DskipTests -Pspark-2.2
+```
+
+Start spark-shell in new terminal, type :paste, then copy and run the following code.
+```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("preAggregateExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS sales")
+
+ // Create main table
+ spark.sql(
+ s"""
+ | CREATE TABLE sales (
+ | user_id string,
+ | country string,
+ | quantity int,
+ | price bigint)
+ | STORED BY 'carbondata'
+ """.stripMargin)
+
+ // Create pre-aggregate table on the main table
+ // If main table already have data, following command
+ // will trigger one immediate load to the pre-aggregate table
+ spark.sql(
+ s"""
+ | CREATE DATAMAP agg_sales
+ | ON TABLE sales
+ | USING "preaggregate"
+ | AS
+ | SELECT country, sum(quantity), avg(price)
+ | FROM sales
+ | GROUP BY country
+ """.stripMargin)
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ // Load data to the main table, it will also
+ // trigger immediate load to pre-aggregate table.
+ // These two loading operation is carried out in a
+ // transactional manner, meaning that the whole
+ // operation will fail if one of the loading fails
+ val r = new Random()
+ spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("ID." + r.nextInt(100000), "country" + x % 8, x % 50, x % 60))
+ .toDF("user_id", "country", "quantity", "price")
+ .write
+ .format("carbondata")
+ .option("tableName", "sales")
+ .option("compress", "true")
+ .mode(SaveMode.Append)
+ .save()
+
+ spark.sql(
+ s"""
+ |SELECT country, sum(quantity), avg(price)
+ | from sales GROUP BY country
+ """.stripMargin).show
+
+ spark.stop
+```
+
+##PRE-AGGREGATE DataMap
+ Pre-aggregate tables are created as DataMaps and managed as tables internally by CarbonData.
+ User can create as many pre-aggregate datamaps required to improve query performance,
+ provided the storage requirements and loading speeds are acceptable.
+
+ Once pre-aggregate datamaps are created, CarbonData's SparkSQL optimizer extension supports to
+ select the most efficient pre-aggregate datamap and rewrite the SQL to query against the selected
+ datamap instead of the main table. Since the data size of pre-aggregate datamap is smaller,
+ user queries are much faster. In our previous experience, we have seen 5X to 100X times faster
+ in production SQLs.
+
+ For instance, main table called **sales** which is defined as
+
+ ```
+ CREATE TABLE sales (
+ order_time timestamp,
+ user_id string,
+ sex string,
+ country string,
+ quantity int,
+ price bigint)
+ STORED BY 'carbondata'
+ ```
+
+ User can create pre-aggregate tables using the Create DataMap DDL
+
+ ```
+ CREATE DATAMAP agg_sales
+ ON TABLE sales
+ USING "preaggregate"
+ AS
+ SELECT country, sex, sum(quantity), avg(price)
+ FROM sales
+ GROUP BY country, sex
+ ```
+
+#### Functions supported in pre-aggregate table
+
+| Function | Rollup supported |
+|----------|:----------------:|
+| SUM |Yes |
+| AVG |Yes |
+| MAX |Yes |
+| MIN |Yes |
+| COUNT |Yes |
+
+
+#### How pre-aggregate tables are selected
+When a user query is submitted, during query planning phase, CarbonData will collect all matched
+pre-aggregate tables as candidates according to Relational Algebra transformation rules. Then, the
+best pre-aggregate table for this query will be selected among the candidates based on cost.
+For simplicity, current cost estimation is based on the data size of the pre-aggregate table. (We
+assume that query will be faster on smaller table)
+
+For the main table **sales** and pre-aggregate table **agg_sales** created above, following queries
+```
+SELECT country, sex, sum(quantity), avg(price) from sales GROUP BY country, sex
+
+SELECT sex, sum(quantity) from sales GROUP BY sex
+
+SELECT sum(price), country from sales GROUP BY country
+```
+
+will be transformed by CarbonData's query planner to query against pre-aggregate table
+**agg_sales** instead of the main table **sales**
+
+However, for following queries
+```
+SELECT user_id, country, sex, sum(quantity), avg(price) from sales GROUP BY user_id, country, sex
+
+SELECT sex, avg(quantity) from sales GROUP BY sex
+
+SELECT country, max(price) from sales GROUP BY country
+```
+
+will query against main table **sales** only, because it does not satisfy pre-aggregate table
+selection logic.
+
+#### Loading data to pre-aggregate tables
+For existing table with loaded data, data load to pre-aggregate table will be triggered by the
+CREATE DATAMAP statement when user creates the pre-aggregate table. For incremental loads after
+aggregates tables are created, loading data to main table triggers the load to pre-aggregate tables
+once main table loading is complete.
+
+These loads are transactional
+meaning that data on main table and pre-aggregate tables are only visible to the user after all
+tables are loaded successfully, if one of these loads fails, new data are not visible in all tables
+as if the load operation is not happened.
+
+#### Querying data from pre-aggregate tables
+As a technique for query acceleration, Pre-aggregate tables cannot be queries directly.
+Queries are to be made on main table. While doing query planning, internally CarbonData will check
+associated pre-aggregate tables with the main table, and do query plan transformation accordingly.
+
+User can verify whether a query can leverage pre-aggregate table or not by executing `EXPLAIN`
+command, which will show the transformed logical plan, and thus user can check whether pre-aggregate
+table is selected.
+
+#### Compacting pre-aggregate tables
+Running Compaction command (`ALTER TABLE COMPACT`) on main table will **not automatically**
+compact the pre-aggregate tables created on the main table. User need to run Compaction command
+separately on each pre-aggregate table to compact them.
+
+Compaction is an optional operation for pre-aggregate table. If compaction is performed on
+main table but not performed on pre-aggregate table, all queries still can benefit from
+pre-aggregate tables. To further improve the query performance, compaction on pre-aggregate tables
+can be triggered to merge the segments and files in the pre-aggregate tables.
+
+#### Data Management on pre-aggregate tables
+Once there is pre-aggregate table created on the main table, following command on the main table
+is not supported:
+1. Data management command: `UPDATE/DELETE/DELETE SEGMENT`.
+2. Schema management command: `ALTER TABLE DROP COLUMN`, `ALTER TABLE CHANGE DATATYPE`,
+`ALTER TABLE RENAME`. Note that adding a new column is supported, and for dropping columns and
+change datatype command, CarbonData will check whether it will impact the pre-aggregate table, if
+ not, the operation is allowed, otherwise operation will be rejected by throwing exception.
+3. Partition management command: `ALTER TABLE ADD/DROP PARTITION`
+
+However, there is still way to support these operations on main table, in current CarbonData
+release, user can do as following:
+1. Remove the pre-aggregate table by `DROP DATAMAP` command
+2. Carry out the data management operation on main table
+3. Create the pre-aggregate table again by `CREATE DATAMAP` command
+Basically, user can manually trigger the operation by re-building the datamap.
+
+
http://git-wip-us.apache.org/repos/asf/carbondata/blob/95ac5eff/docs/datamap/timeseries-datamap-guide.md
----------------------------------------------------------------------
diff --git a/docs/datamap/timeseries-datamap-guide.md b/docs/datamap/timeseries-datamap-guide.md
new file mode 100644
index 0000000..ecd7234
--- /dev/null
+++ b/docs/datamap/timeseries-datamap-guide.md
@@ -0,0 +1,135 @@
+# CarbonData Timeseries DataMap
+
+## Supporting timeseries data (Alpha feature in 1.3.0)
+Timeseries DataMap a pre-aggregate table implementation based on 'preaggregate' DataMap.
+Difference is that Timerseries DataMap has built-in understanding of time hierarchy and
+levels: year, month, day, hour, minute, so that it supports automatic roll-up in time dimension
+for query.
+
+For instance, user can create multiple timeseries datamap on the main table which has a *event_time*
+column, one datamap for one time granularity. Then Carbondata can do automatic roll-up for queries
+on the main table.
+
+```
+CREATE DATAMAP agg_year
+ON TABLE sales
+USING "timeseries"
+DMPROPERTIES (
+ 'event_time'='order_time',
+ 'year_granualrity'='1',
+) AS
+SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
+ avg(price) FROM sales GROUP BY order_time, country, sex
+
+CREATE DATAMAP agg_month
+ON TABLE sales
+USING "timeseries"
+DMPROPERTIES (
+ 'event_time'='order_time',
+ 'month_granualrity'='1',
+) AS
+SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
+ avg(price) FROM sales GROUP BY order_time, country, sex
+
+CREATE DATAMAP agg_day
+ON TABLE sales
+USING "timeseries"
+DMPROPERTIES (
+ 'event_time'='order_time',
+ 'day_granualrity'='1',
+) AS
+SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
+ avg(price) FROM sales GROUP BY order_time, country, sex
+
+CREATE DATAMAP agg_sales_hour
+ON TABLE sales
+USING "timeseries"
+DMPROPERTIES (
+ 'event_time'='order_time',
+ 'hour_granualrity'='1',
+) AS
+SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
+ avg(price) FROM sales GROUP BY order_time, country, sex
+
+CREATE DATAMAP agg_minute
+ON TABLE sales
+USING "timeseries"
+DMPROPERTIES (
+ 'event_time'='order_time',
+ 'minute_granualrity'='1',
+) AS
+SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
+ avg(price) FROM sales GROUP BY order_time, country, sex
+
+CREATE DATAMAP agg_minute
+ON TABLE sales
+USING "timeseries"
+DMPROPERTIES (
+ 'event_time'='order_time',
+ 'minute_granualrity'='1',
+) AS
+SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
+ avg(price) FROM sales GROUP BY order_time, country, sex
+```
+
+For querying timeseries data, Carbondata has builtin support for following time related UDF
+to enable automatically roll-up to the desired aggregation level
+```
+timeseries(timeseries column name, 'aggregation level')
+```
+```
+SELECT timeseries(order_time, 'hour'), sum(quantity) FROM sales GROUP BY timeseries(order_time,
+'hour')
+```
+
+It is **not necessary** to create pre-aggregate tables for each granularity unless required for
+query. Carbondata can roll-up the data and fetch it.
+
+For Example: For main table **sales** , if following timeseries datamaps were created for day
+level and hour level pre-aggregate
+
+```
+ CREATE DATAMAP agg_day
+ ON TABLE sales
+ USING "timeseries"
+ DMPROPERTIES (
+ 'event_time'='order_time',
+ 'day_granualrity'='1',
+ ) AS
+ SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
+ avg(price) FROM sales GROUP BY order_time, country, sex
+
+ CREATE DATAMAP agg_sales_hour
+ ON TABLE sales
+ USING "timeseries"
+ DMPROPERTIES (
+ 'event_time'='order_time',
+ 'hour_granualrity'='1',
+ ) AS
+ SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
+ avg(price) FROM sales GROUP BY order_time, country, sex
+```
+
+Queries like below will be rolled-up and hit the timeseries datamaps
+```
+Select timeseries(order_time, 'month'), sum(quantity) from sales group by timeseries(order_time,
+ 'month')
+
+Select timeseries(order_time, 'year'), sum(quantity) from sales group by timeseries(order_time,
+ 'year')
+```
+
+NOTE (<b>RESTRICTION</b>):
+* Only value of 1 is supported for hierarchy levels. Other hierarchy levels will be supported in
+the future CarbonData release.
+* timeseries datamap for the desired levels needs to be created one after the other
+* timeseries datamaps created for each level needs to be dropped separately
+
+
+#### Compacting timeseries datamp
+Refer to Compaction section in [preaggregation datamap](https://github.com/apache/carbondata/blob/master/docs/datamap/preaggregate-datamap-guide.md).
+Same applies to timeseries datamap.
+
+#### Data Management on timeseries datamap
+Refer to Data Management section in [preaggregation datamap](https://github.com/apache/carbondata/blob/master/docs/datamap/preaggregate-datamap-guide.md).
+Same applies to timeseries datamap.
\ No newline at end of file