You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@carbondata.apache.org by ja...@apache.org on 2018/03/03 01:44:24 UTC
carbondata git commit: [CARBONDATA-2098] Optimize pre-aggregate
documentation
Repository: carbondata
Updated Branches:
refs/heads/master 74f5d67c0 -> 3e36639ed
[CARBONDATA-2098] Optimize pre-aggregate documentation
optimize pre-aggregate documentation
move to separate file
add more examples
This closes #2022
Project: http://git-wip-us.apache.org/repos/asf/carbondata/repo
Commit: http://git-wip-us.apache.org/repos/asf/carbondata/commit/3e36639e
Tree: http://git-wip-us.apache.org/repos/asf/carbondata/tree/3e36639e
Diff: http://git-wip-us.apache.org/repos/asf/carbondata/diff/3e36639e
Branch: refs/heads/master
Commit: 3e36639eda45933a036ea364dba1088c5f4e57ec
Parents: 74f5d67
Author: sraghunandan <ca...@gmail.com>
Authored: Fri Mar 2 17:02:39 2018 +0530
Committer: Jacky Li <ja...@qq.com>
Committed: Sat Mar 3 09:44:02 2018 +0800
----------------------------------------------------------------------
docs/data-management-on-carbondata.md | 242 --------------
docs/preaggregate-guide.md | 313 +++++++++++++++++++
.../examples/PreAggregateTableExample.scala | 50 ++-
3 files changed, 362 insertions(+), 243 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/carbondata/blob/3e36639e/docs/data-management-on-carbondata.md
----------------------------------------------------------------------
diff --git a/docs/data-management-on-carbondata.md b/docs/data-management-on-carbondata.md
index ea80d41..2aa4a49 100644
--- a/docs/data-management-on-carbondata.md
+++ b/docs/data-management-on-carbondata.md
@@ -26,7 +26,6 @@ This tutorial is going to introduce all commands and data operations on CarbonDa
* [UPDATE AND DELETE](#update-and-delete)
* [COMPACTION](#compaction)
* [PARTITION](#partition)
-* [PRE-AGGREGATE TABLES](#pre-aggregate-tables)
* [BUCKETING](#bucketing)
* [SEGMENT MANAGEMENT](#segment-management)
@@ -859,247 +858,6 @@ This tutorial is going to introduce all commands and data operations on CarbonDa
* The partitioned column can be excluded from SORT_COLUMNS, this will let other columns to do the efficient sorting.
* When writing SQL on a partition table, try to use filters on the partition column.
-
-## PRE-AGGREGATE TABLES
- CarbonData supports pre aggregating of data so that OLAP kind of queries can fetch data
- much faster. Aggregate tables are created as datamaps so that the handling is as efficient as
- other indexing support. Users can create as many aggregate tables they require as datamaps to
- improve their query performance, provided the storage requirements and loading speeds are
- acceptable.
-
- For main table called **sales** which is defined as
-
- ```
- CREATE TABLE sales (
- order_time timestamp,
- user_id STRING,
- sex STRING,
- country STRING,
- quantity INT,
- price BIGINT)
- STORED BY 'carbondata'
- ```
-
- user can create pre-aggregate tables using the DDL
-
- ```
- CREATE DATAMAP agg_sales
- ON TABLE sales
- USING "preaggregate"
- AS
- SELECT country, sex, sum(quantity), avg(price)
- FROM sales
- GROUP BY country, sex
- ```
-
-<b><p align="left">Functions supported in pre-aggregate tables</p></b>
-
-| Function | Rollup supported |
-|-----------|----------------|
-| SUM | Yes |
-| AVG | Yes |
-| MAX | Yes |
-| MIN | Yes |
-| COUNT | Yes |
-
-
-##### How pre-aggregate tables are selected
-For the main table **sales** and pre-aggregate table **agg_sales** created above, queries of the
-kind
-```
-SELECT country, sex, sum(quantity), avg(price) from sales GROUP BY country, sex
-
-SELECT sex, sum(quantity) from sales GROUP BY sex
-
-SELECT sum(price), country from sales GROUP BY country
-```
-
-will be transformed by Query Planner to fetch data from pre-aggregate table **agg_sales**
-
-But queries of kind
-```
-SELECT user_id, country, sex, sum(quantity), avg(price) from sales GROUP BY user_id, country, sex
-
-SELECT sex, avg(quantity) from sales GROUP BY sex
-
-SELECT country, max(price) from sales GROUP BY country
-```
-
-will fetch the data from the main table **sales**
-
-##### Loading data to pre-aggregate tables
-For existing table with loaded data, data load to pre-aggregate table will be triggered by the
-CREATE DATAMAP statement when user creates the pre-aggregate table.
-For incremental loads after aggregates tables are created, loading data to main table triggers
-the load to pre-aggregate tables once main table loading is complete. These loads are automic
-meaning that data on main table and aggregate tables are only visible to the user after all tables
-are loaded
-
-##### Querying data from pre-aggregate tables
-Pre-aggregate tables cannot be queries directly. Queries are to be made on main table. Internally
-carbondata will check associated pre-aggregate tables with the main table, and if the
-pre-aggregate tables satisfy the query condition, the plan is transformed automatically to use
-pre-aggregate table to fetch the data.
-
-##### Compacting pre-aggregate tables
-Compaction command (ALTER TABLE COMPACT) need to be run separately on each pre-aggregate table.
-Running Compaction command on main table will **not automatically** compact the pre-aggregate
-tables. Compaction is an optional operation for pre-aggregate table. If compaction is performed on
-main table but not performed on pre-aggregate table, all queries still can benefit from
-pre-aggregate tables. To further improve performance on pre-aggregate tables, compaction can be
-triggered on pre-aggregate tables directly, it will merge the segments inside pre-aggregate table.
-
-##### Update/Delete Operations on pre-aggregate tables
-This functionality is not supported.
-
- NOTE (<b>RESTRICTION</b>):
- Update/Delete operations are <b>not supported</b> on main table which has pre-aggregate tables
- created on it. All the pre-aggregate tables <b>will have to be dropped</b> before update/delete
- operations can be performed on the main table. Pre-aggregate tables can be rebuilt manually
- after update/delete operations are completed
-
-##### Delete Segment Operations on pre-aggregate tables
-This functionality is not supported.
-
- NOTE (<b>RESTRICTION</b>):
- Delete Segment operations are <b>not supported</b> on main table which has pre-aggregate tables
- created on it. All the pre-aggregate tables <b>will have to be dropped</b> before delete segment
- operations can be performed on the main table. Pre-aggregate tables can be rebuilt manually
- after delete segment operations are completed
-
-##### Alter Table Operations on pre-aggregate tables
-This functionality is not supported.
-
- NOTE (<b>RESTRICTION</b>):
- Adding new column in new table does not have any affect on pre-aggregate tables. However if
- dropping or renaming a column has impact in pre-aggregate table, such operations will be
- rejected and error will be thrown. All the pre-aggregate tables <b>will have to be dropped</b>
- before alter operations can be performed on the main table. Pre-aggregate tables can be rebuilt
- manually after alter table operations are completed
-
-### Supporting timeseries data (Alpha feature in 1.3.0)
-CarbonData has built-in understanding of time hierarchy and levels: year, month, day, hour, minute, second.
-Timeseries pre-aggregate tables can be created for the hierarchy and CarbonData can do automatic
-roll-up for the queries on these hierarchies.
-
- ```
- CREATE DATAMAP agg_year
- ON TABLE sales
- USING "timeseries"
- DMPROPERTIES (
- 'event_time'='order_time',
- 'year_granualrity'='1',
- ) AS
- SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
- avg(price) FROM sales GROUP BY order_time, country, sex
-
- CREATE DATAMAP agg_month
- ON TABLE sales
- USING "timeseries"
- DMPROPERTIES (
- 'event_time'='order_time',
- 'month_granualrity'='1',
- ) AS
- SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
- avg(price) FROM sales GROUP BY order_time, country, sex
-
- CREATE DATAMAP agg_day
- ON TABLE sales
- USING "timeseries"
- DMPROPERTIES (
- 'event_time'='order_time',
- 'day_granualrity'='1',
- ) AS
- SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
- avg(price) FROM sales GROUP BY order_time, country, sex
-
- CREATE DATAMAP agg_sales_hour
- ON TABLE sales
- USING "timeseries"
- DMPROPERTIES (
- 'event_time'='order_time',
- 'hour_granualrity'='1',
- ) AS
- SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
- avg(price) FROM sales GROUP BY order_time, country, sex
-
- CREATE DATAMAP agg_minute
- ON TABLE sales
- USING "timeseries"
- DMPROPERTIES (
- 'event_time'='order_time',
- 'minute_granualrity'='1',
- ) AS
- SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
- avg(price) FROM sales GROUP BY order_time, country, sex
-
- CREATE DATAMAP agg_minute
- ON TABLE sales
- USING "timeseries"
- DMPROPERTIES (
- 'event_time'='order_time',
- 'minute_granualrity'='1',
- ) AS
- SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
- avg(price) FROM sales GROUP BY order_time, country, sex
- ```
- For Querying data and automatically roll-up to the desired aggregation level, CarbonData supports
- UDF as
- ```
- timeseries(timeseries column name, 'aggregation level')
- ```
- Examples
- ```
- SELECT
- timeseries(order_time, 'hour'),
- sum(quantity)
- FROM sales
- GROUP BY timeseries(order_time, 'hour')
- ```
-
- It is **not necessary** to create pre-aggregate tables for each granularity unless required for
- query. CarbonData can roll-up the data and fetch it.
-
- For Example: For main table **sales**, If timeseries pre-aggregate tables were created as
-
- ```
- CREATE DATAMAP agg_day
- ON TABLE sales
- USING "timeseries"
- DMPROPERTIES (
- 'event_time'='order_time',
- 'day_granualrity'='1',
- ) AS
- SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
- avg(price) FROM sales GROUP BY order_time, country, sex
-
- CREATE DATAMAP agg_sales_hour
- ON TABLE sales
- USING "timeseries"
- DMPROPERTIES (
- 'event_time'='order_time',
- 'hour_granualrity'='1',
- ) AS
- SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
- avg(price) FROM sales GROUP BY order_time, country, sex
- ```
-
- Queries like below will be rolled-up and fetched from pre-aggregate tables
- ```
- Select timeseries(order_time, 'month'), sum(quantity) from sales group by timeseries(order_time,
- 'month')
-
- Select timeseries(order_time, 'year'), sum(quantity) from sales group by timeseries(order_time,
- 'year')
- ```
-
- NOTE (<b>RESTRICTION</b>):
- * Only 1 is supported for granularity value of timeseries pre-aggregate table. Other granularity value are not supported.
- * Only one granularity can be defined on creating one timeseries pre-aggregate table. Other granularity are created separately.
- * Pre-aggregate tables for the desired levels needs to be created one after the other
- * Pre-aggregate tables are created for each level needs to be dropped separately
-
-
## BUCKETING
Bucketing feature can be used to distribute/organize the table/partition data into multiple files such
http://git-wip-us.apache.org/repos/asf/carbondata/blob/3e36639e/docs/preaggregate-guide.md
----------------------------------------------------------------------
diff --git a/docs/preaggregate-guide.md b/docs/preaggregate-guide.md
new file mode 100644
index 0000000..411433a
--- /dev/null
+++ b/docs/preaggregate-guide.md
@@ -0,0 +1,313 @@
+# CarbonData Pre-aggregate tables
+
+## Quick example
+Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
+
+Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
+```shell
+mvn clean package -DskipTests -Pspark-2.2
+```
+
+Start spark-shell in new terminal, type :paste, then copy and run the following code.
+```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("preAggregateExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS sales")
+ // Create target carbon table and populate with initial data
+ spark.sql(
+ s"""
+ | CREATE TABLE sales (
+ | user_id string,
+ | country string,
+ | quantity int,
+ | price bigint)
+ | STORED BY 'carbondata'""".stripMargin)
+
+ spark.sql(
+ s"""
+ | CREATE DATAMAP agg_sales
+ | ON TABLE sales
+ | USING "preaggregate"
+ | AS
+ | SELECT country, sum(quantity), avg(price)
+ | FROM sales
+ | GROUP BY country""".stripMargin)
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ val r = new Random()
+ val df = spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("ID." + r.nextInt(100000), "country" + x % 8, x % 50, x % 60))
+ .toDF("user_id", "country", "quantity", "price")
+
+ // Create table with pre-aggregate table
+ df.write.format("carbondata")
+ .option("tableName", "sales")
+ .option("compress", "true")
+ .mode(SaveMode.Append).save()
+
+ spark.sql(
+ s"""
+ |SELECT country, sum(quantity), avg(price)
+ | from sales GROUP BY country""".stripMargin).show
+
+ spark.stop
+```
+
+##PRE-AGGREGATE TABLES
+ Carbondata supports pre aggregating of data so that OLAP kind of queries can fetch data
+ much faster.Aggregate tables are created as datamaps so that the handling is as efficient as
+ other indexing support.Users can create as many aggregate tables they require as datamaps to
+ improve their query performance,provided the storage requirements and loading speeds are
+ acceptable.
+
+ For main table called **sales** which is defined as
+
+ ```
+ CREATE TABLE sales (
+ order_time timestamp,
+ user_id string,
+ sex string,
+ country string,
+ quantity int,
+ price bigint)
+ STORED BY 'carbondata'
+ ```
+
+ user can create pre-aggregate tables using the DDL
+
+ ```
+ CREATE DATAMAP agg_sales
+ ON TABLE sales
+ USING "preaggregate"
+ AS
+ SELECT country, sex, sum(quantity), avg(price)
+ FROM sales
+ GROUP BY country, sex
+ ```
+
+
+
+<b><p align="left">Functions supported in pre-aggregate tables</p></b>
+
+| Function | Rollup supported |
+|-----------|----------------|
+| SUM | Yes |
+| AVG | Yes |
+| MAX | Yes |
+| MIN | Yes |
+| COUNT | Yes |
+
+
+##### How pre-aggregate tables are selected
+For the main table **sales** and pre-aggregate table **agg_sales** created above, queries of the
+kind
+```
+SELECT country, sex, sum(quantity), avg(price) from sales GROUP BY country, sex
+
+SELECT sex, sum(quantity) from sales GROUP BY sex
+
+SELECT sum(price), country from sales GROUP BY country
+```
+
+will be transformed by Query Planner to fetch data from pre-aggregate table **agg_sales**
+
+But queries of kind
+```
+SELECT user_id, country, sex, sum(quantity), avg(price) from sales GROUP BY user_id, country, sex
+
+SELECT sex, avg(quantity) from sales GROUP BY sex
+
+SELECT country, max(price) from sales GROUP BY country
+```
+
+will fetch the data from the main table **sales**
+
+##### Loading data to pre-aggregate tables
+For existing table with loaded data, data load to pre-aggregate table will be triggered by the
+CREATE DATAMAP statement when user creates the pre-aggregate table.
+For incremental loads after aggregates tables are created, loading data to main table triggers
+the load to pre-aggregate tables once main table loading is complete. These loads are automic
+meaning that data on main table and aggregate tables are only visible to the user after all tables
+are loaded
+
+##### Querying data from pre-aggregate tables
+Pre-aggregate tables cannot be queries directly. Queries are to be made on main table. Internally
+carbondata will check associated pre-aggregate tables with the main table, and if the
+pre-aggregate tables satisfy the query condition, the plan is transformed automatically to use
+pre-aggregate table to fetch the data.
+
+##### Compacting pre-aggregate tables
+Compaction command (ALTER TABLE COMPACT) need to be run separately on each pre-aggregate table.
+Running Compaction command on main table will **not automatically** compact the pre-aggregate
+tables.Compaction is an optional operation for pre-aggregate table. If compaction is performed on
+main table but not performed on pre-aggregate table, all queries still can benefit from
+pre-aggregate tables. To further improve performance on pre-aggregate tables, compaction can be
+triggered on pre-aggregate tables directly, it will merge the segments inside pre-aggregate table.
+
+##### Update/Delete Operations on pre-aggregate tables
+This functionality is not supported.
+
+ NOTE (<b>RESTRICTION</b>):
+ Update/Delete operations are <b>not supported</b> on main table which has pre-aggregate tables
+ created on it. All the pre-aggregate tables <b>will have to be dropped</b> before update/delete
+ operations can be performed on the main table. Pre-aggregate tables can be rebuilt manually
+ after update/delete operations are completed
+
+##### Delete Segment Operations on pre-aggregate tables
+This functionality is not supported.
+
+ NOTE (<b>RESTRICTION</b>):
+ Delete Segment operations are <b>not supported</b> on main table which has pre-aggregate tables
+ created on it. All the pre-aggregate tables <b>will have to be dropped</b> before update/delete
+ operations can be performed on the main table. Pre-aggregate tables can be rebuilt manually
+ after delete segment operations are completed
+
+##### Alter Table Operations on pre-aggregate tables
+This functionality is not supported.
+
+ NOTE (<b>RESTRICTION</b>):
+ Adding new column in new table does not have any affect on pre-aggregate tables. However if
+ dropping or renaming a column has impact in pre-aggregate table, such operations will be
+ rejected and error will be thrown. All the pre-aggregate tables <b>will have to be dropped</b>
+ before Alter Operations can be performed on the main table. Pre-aggregate tables can be rebuilt
+ manually after Alter Table operations are completed
+
+### Supporting timeseries data (Alpha feature in 1.3.0)
+Carbondata has built-in understanding of time hierarchy and levels: year, month, day, hour, minute.
+Multiple pre-aggregate tables can be created for the hierarchy and Carbondata can do automatic
+roll-up for the queries on these hierarchies.
+
+ ```
+ CREATE DATAMAP agg_year
+ ON TABLE sales
+ USING "timeseries"
+ DMPROPERTIES (
+ 'event_time'='order_time',
+ 'year_granualrity'='1',
+ ) AS
+ SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
+ avg(price) FROM sales GROUP BY order_time, country, sex
+
+ CREATE DATAMAP agg_month
+ ON TABLE sales
+ USING "timeseries"
+ DMPROPERTIES (
+ 'event_time'='order_time',
+ 'month_granualrity'='1',
+ ) AS
+ SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
+ avg(price) FROM sales GROUP BY order_time, country, sex
+
+ CREATE DATAMAP agg_day
+ ON TABLE sales
+ USING "timeseries"
+ DMPROPERTIES (
+ 'event_time'='order_time',
+ 'day_granualrity'='1',
+ ) AS
+ SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
+ avg(price) FROM sales GROUP BY order_time, country, sex
+
+ CREATE DATAMAP agg_sales_hour
+ ON TABLE sales
+ USING "timeseries"
+ DMPROPERTIES (
+ 'event_time'='order_time',
+ 'hour_granualrity'='1',
+ ) AS
+ SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
+ avg(price) FROM sales GROUP BY order_time, country, sex
+
+ CREATE DATAMAP agg_minute
+ ON TABLE sales
+ USING "timeseries"
+ DMPROPERTIES (
+ 'event_time'='order_time',
+ 'minute_granualrity'='1',
+ ) AS
+ SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
+ avg(price) FROM sales GROUP BY order_time, country, sex
+
+ CREATE DATAMAP agg_minute
+ ON TABLE sales
+ USING "timeseries"
+ DMPROPERTIES (
+ 'event_time'='order_time',
+ 'minute_granualrity'='1',
+ ) AS
+ SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
+ avg(price) FROM sales GROUP BY order_time, country, sex
+ ```
+
+ For Querying data and automatically roll-up to the desired aggregation level,Carbondata supports
+ UDF as
+ ```
+ timeseries(timeseries column name, 'aggregation level')
+ ```
+ ```
+ Select timeseries(order_time, 'hour'), sum(quantity) from sales group by timeseries(order_time,
+ 'hour')
+ ```
+
+ It is **not necessary** to create pre-aggregate tables for each granularity unless required for
+ query. Carbondata can roll-up the data and fetch it.
+
+ For Example: For main table **sales** , If pre-aggregate tables were created as
+
+ ```
+ CREATE DATAMAP agg_day
+ ON TABLE sales
+ USING "timeseries"
+ DMPROPERTIES (
+ 'event_time'='order_time',
+ 'day_granualrity'='1',
+ ) AS
+ SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
+ avg(price) FROM sales GROUP BY order_time, country, sex
+
+ CREATE DATAMAP agg_sales_hour
+ ON TABLE sales
+ USING "timeseries"
+ DMPROPERTIES (
+ 'event_time'='order_time',
+ 'hour_granualrity'='1',
+ ) AS
+ SELECT order_time, country, sex, sum(quantity), max(quantity), count(user_id), sum(price),
+ avg(price) FROM sales GROUP BY order_time, country, sex
+ ```
+
+ Queries like below will be rolled-up and fetched from pre-aggregate tables
+ ```
+ Select timeseries(order_time, 'month'), sum(quantity) from sales group by timeseries(order_time,
+ 'month')
+
+ Select timeseries(order_time, 'year'), sum(quantity) from sales group by timeseries(order_time,
+ 'year')
+ ```
+
+ NOTE (<b>RESTRICTION</b>):
+ * Only value of 1 is supported for hierarchy levels. Other hierarchy levels are not supported.
+ Other hierarchy levels are not supported
+ * pre-aggregate tables for the desired levels needs to be created one after the other
+ * pre-aggregate tables created for each level needs to be dropped separately
+
http://git-wip-us.apache.org/repos/asf/carbondata/blob/3e36639e/examples/spark2/src/main/scala/org/apache/carbondata/examples/PreAggregateTableExample.scala
----------------------------------------------------------------------
diff --git a/examples/spark2/src/main/scala/org/apache/carbondata/examples/PreAggregateTableExample.scala b/examples/spark2/src/main/scala/org/apache/carbondata/examples/PreAggregateTableExample.scala
index d27eefb..ace3dcc 100644
--- a/examples/spark2/src/main/scala/org/apache/carbondata/examples/PreAggregateTableExample.scala
+++ b/examples/spark2/src/main/scala/org/apache/carbondata/examples/PreAggregateTableExample.scala
@@ -38,6 +38,7 @@ object PreAggregateTableExample {
// 1. simple usage for Pre-aggregate tables creation and query
spark.sql("DROP TABLE IF EXISTS mainTable")
+ spark.sql("DROP TABLE IF EXISTS mainTable_other")
spark.sql("""
| CREATE TABLE mainTable
| (id Int,
@@ -47,10 +48,23 @@ object PreAggregateTableExample {
| STORED BY 'org.apache.carbondata.format'
""".stripMargin)
+ spark.sql("""
+ | CREATE TABLE mainTable_other
+ | (id Int,
+ | name String,
+ | city String,
+ | age Int)
+ | STORED BY 'org.apache.carbondata.format'
+ """.stripMargin)
+
spark.sql(s"""
LOAD DATA LOCAL INPATH '$testData' into table mainTable
""")
+ spark.sql(s"""
+ LOAD DATA LOCAL INPATH '$testData' into table mainTable_other
+ """)
+
spark.sql(
s"""create datamap preagg_sum on table mainTable using 'preaggregate' as
| select id,sum(age) from mainTable group by id"""
@@ -59,14 +73,17 @@ object PreAggregateTableExample {
s"""create datamap preagg_avg on table mainTable using 'preaggregate' as
| select id,avg(age) from mainTable group by id"""
.stripMargin)
+
spark.sql(
- s"""create datamap preagg_count on table mainTable using 'preaggregate' as
+ s"""create datamap preagg_count_age on table mainTable using 'preaggregate' as
| select id,count(age) from mainTable group by id"""
.stripMargin)
+
spark.sql(
s"""create datamap preagg_min on table mainTable using 'preaggregate' as
| select id,min(age) from mainTable group by id"""
.stripMargin)
+
spark.sql(
s"""create datamap preagg_max on table mainTable using 'preaggregate' as
| select id,max(age) from mainTable group by id"""
@@ -74,10 +91,40 @@ object PreAggregateTableExample {
spark.sql(
s"""
+ | create datamap preagg_case on table mainTable using 'preaggregate' as
+ | select name,sum(case when age=35 then id else 0 end) from mainTable group by name
+ | """.stripMargin)
+
+ spark.sql(
+ s"""create datamap preagg_count on table maintable using 'preaggregate' as
+ | select name, count(*) from maintable group by name""".stripMargin)
+
+ spark.sql(
+ s"""
| SELECT id,max(age)
| FROM mainTable group by id
""".stripMargin).show()
+ spark.sql(
+ s"""
+ | select name, count(*) from
+ | mainTable group by name
+ """.stripMargin).show()
+
+ spark.sql(
+ s"""
+ | select name as NewName,
+ | sum(case when age=35 then id else 0 end) as sum
+ | from mainTable group by name order by name
+ """.stripMargin).show()
+
+ spark.sql(
+ s"""
+ | select t1.name,t1.city from mainTable_other t1 join
+ | (select name as newnewname,sum(age) as sum
+ | from mainTable group by name )t2 on t1.name=t2.newnewname
+ """.stripMargin).show()
+
// 2.compare the performance : with pre-aggregate VS main table
// build test data, if set the data is larger than 100M, it will take 10+ mins.
@@ -160,6 +207,7 @@ object PreAggregateTableExample {
// scalastyle:on
spark.sql("DROP TABLE IF EXISTS mainTable")
+ spark.sql("DROP TABLE IF EXISTS mainTable_other")
spark.sql("DROP TABLE IF EXISTS personTable")
spark.sql("DROP TABLE IF EXISTS personTableWithoutAgg")