You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@carbondata.apache.org by sraghunandan <gi...@git.apache.org> on 2018/03/02 11:37:47 UTC

[GitHub] carbondata pull request #2022: [CARBONDATA-2098] Optimize pre-aggregate docu...

GitHub user sraghunandan opened a pull request:

    https://github.com/apache/carbondata/pull/2022

    [CARBONDATA-2098] Optimize pre-aggregate documentation

    optimize pre-aggregate documentation
    move to separate file
    add more examples
    
    Be sure to do all of the following checklist to help us incorporate 
    your contribution quickly and easily:
    
     - [x] Any interfaces changed?
     No
     - [x] Any backward compatibility impacted?
     No
     - [x] Document update required?
    Updating docs
     - [x] Testing done
            Please provide details on 
            - Whether new unit test cases have been added or why no new tests are required?
            - How it is tested? Please attach test report.
            - Is it a performance related change? Please attach the performance test report.
            - Any additional information to help reviewers in testing this change.
      NA     
     - [x] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. 
    NA


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sraghunandan/carbondata-1 agg_doc_new_file

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2022.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2022
    
----
commit 742359d1640bab97b3c0d40d948b0bedf8fe6a30
Author: sraghunandan <ca...@...>
Date:   2018-03-02T11:32:39Z

    optimize pre-aggregate documentation;move to separate file;add more examples

----


---

[GitHub] carbondata pull request #2022: [CARBONDATA-2098] Optimize pre-aggregate docu...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2022#discussion_r172000793
  
    --- Diff: docs/preaggregate-guide.md ---
    @@ -0,0 +1,313 @@
    +# CarbonData Pre-aggregate tables
    +  
    +## Quick example
    +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
    +
    +Package carbon jar, and copy assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar to $SPARK_HOME/jars
    +```shell
    +mvn clean package -DskipTests -Pspark-2.2
    +```
    +
    +Start spark-shell in new terminal, type :paste, then copy and run the following code.
    +```scala
    + import java.io.File
    + import org.apache.spark.sql.{CarbonEnv, SparkSession}
    + import org.apache.spark.sql.CarbonSession._
    + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    + import org.apache.carbondata.core.util.path.CarbonStorePath
    + 
    + val warehouse = new File("./warehouse").getCanonicalPath
    + val metastore = new File("./metastore").getCanonicalPath
    + 
    + val spark = SparkSession
    +   .builder()
    +   .master("local")
    +   .appName("preAggregateExample")
    +   .config("spark.sql.warehouse.dir", warehouse)
    +   .getOrCreateCarbonSession(warehouse, metastore)
    +
    + spark.sparkContext.setLogLevel("ERROR")
    +
    + // drop table if exists previously
    + spark.sql(s"DROP TABLE IF EXISTS sales")
    + // Create target carbon table and populate with initial data
    + spark.sql(
    +   s"""
    +      | CREATE TABLE sales (
    +      | user_id string,
    +      | country string,
    +      | quantity int,
    +      | price bigint)
    +      | STORED BY 'carbondata'""".stripMargin)
    +      
    + spark.sql(
    +   s"""
    +      | CREATE DATAMAP agg_sales
    +      | ON TABLE sales
    +      | USING "preaggregate"
    +      | AS
    +      | SELECT country, sum(quantity), avg(price)
    +      | FROM sales
    +      | GROUP BY country""".stripMargin)
    +      
    + import spark.implicits._
    + import org.apache.spark.sql.SaveMode
    + import scala.util.Random
    + 
    + val r = new Random()
    + val df = spark.sparkContext.parallelize(1 to 10)
    +   .map(x => ("ID." + r.nextInt(100000), "country" + x % 8, x % 50, x % 60))
    +   .toDF("user_id", "country", "quantity", "price")
    +
    + // Create table with pre-aggregate table
    + df.write.format("carbondata")
    +   .option("tableName", "sales")
    +   .option("compress", "true")
    +   .mode(SaveMode.Append).save()
    +      
    + spark.sql(
    +      s"""
    +    |SELECT country, sum(quantity), avg(price)
    +    | from sales GROUP BY country""".stripMargin).show
    +
    + spark.stop
    +```
    +
    +##PRE-AGGREGATE TABLES  
    +  Carbondata supports pre aggregating of data so that OLAP kind of queries can fetch data 
    +  much faster.Aggregate tables are created as datamaps so that the handling is as efficient as 
    +  other indexing support.Users can create as many aggregate tables they require as datamaps to 
    +  improve their query performance,provided the storage requirements and loading speeds are 
    +  acceptable.
    +  
    +  For main table called **sales** which is defined as 
    +  
    +  ```
    +  CREATE TABLE sales (
    +  order_time timestamp,
    +  user_id string,
    +  sex string,
    +  country string,
    +  quantity int,
    +  price bigint)
    +  STORED BY 'carbondata'
    +  ```
    +  
    +  user can create pre-aggregate tables using the DDL
    +  
    +  ```
    +  CREATE DATAMAP agg_sales
    +  ON TABLE sales
    +  USING "preaggregate"
    +  AS
    +  SELECT country, sex, sum(quantity), avg(price)
    +  FROM sales
    +  GROUP BY country, sex
    +  ```
    +  
    +
    +  
    +<b><p align="left">Functions supported in pre-aggregate tables</p></b>
    +
    +| Function | Rollup supported |
    +|-----------|----------------|
    +| SUM | Yes |
    +| AVG | Yes |
    +| MAX | Yes |
    +| MIN | Yes |
    +| COUNT | Yes |
    +
    +
    +##### How pre-aggregate tables are selected
    +For the main table **sales** and pre-aggregate table **agg_sales** created above, queries of the 
    +kind
    +```
    +SELECT country, sex, sum(quantity), avg(price) from sales GROUP BY country, sex
    +
    +SELECT sex, sum(quantity) from sales GROUP BY sex
    +
    +SELECT sum(price), country from sales GROUP BY country
    +``` 
    +
    +will be transformed by Query Planner to fetch data from pre-aggregate table **agg_sales**
    +
    +But queries of kind
    +```
    +SELECT user_id, country, sex, sum(quantity), avg(price) from sales GROUP BY user_id, country, sex
    +
    +SELECT sex, avg(quantity) from sales GROUP BY sex
    +
    +SELECT country, max(price) from sales GROUP BY country
    +```
    +
    +will fetch the data from the main table **sales**
    +
    +##### Loading data to pre-aggregate tables
    +For existing table with loaded data, data load to pre-aggregate table will be triggered by the 
    +CREATE DATAMAP statement when user creates the pre-aggregate table.
    +For incremental loads after aggregates tables are created, loading data to main table triggers 
    +the load to pre-aggregate tables once main table loading is complete. These loads are automic 
    +meaning that data on main table and aggregate tables are only visible to the user after all tables 
    +are loaded
    +
    +##### Querying data from pre-aggregate tables
    +Pre-aggregate tables cannot be queries directly. Queries are to be made on main table. Internally 
    +carbondata will check associated pre-aggregate tables with the main table, and if the 
    +pre-aggregate tables satisfy the query condition, the plan is transformed automatically to use 
    +pre-aggregate table to fetch the data.
    +
    +##### Compacting pre-aggregate tables
    +Compaction command (ALTER TABLE COMPACT) need to be run separately on each pre-aggregate table.
    +Running Compaction command on main table will **not automatically** compact the pre-aggregate 
    +tables.Compaction is an optional operation for pre-aggregate table. If compaction is performed on
    +main table but not performed on pre-aggregate table, all queries still can benefit from 
    +pre-aggregate tables. To further improve performance on pre-aggregate tables, compaction can be 
    +triggered on pre-aggregate tables directly, it will merge the segments inside pre-aggregate table. 
    +
    +##### Update/Delete Operations on pre-aggregate tables
    +This functionality is not supported.
    +
    +  NOTE (<b>RESTRICTION</b>):
    +  Update/Delete operations are <b>not supported</b> on main table which has pre-aggregate tables 
    +  created on it. All the pre-aggregate tables <b>will have to be dropped</b> before update/delete 
    +  operations can be performed on the main table. Pre-aggregate tables can be rebuilt manually 
    +  after update/delete operations are completed
    + 
    +##### Delete Segment Operations on pre-aggregate tables
    +This functionality is not supported.
    +
    +  NOTE (<b>RESTRICTION</b>):
    +  Delete Segment operations are <b>not supported</b> on main table which has pre-aggregate tables 
    +  created on it. All the pre-aggregate tables <b>will have to be dropped</b> before update/delete 
    +  operations can be performed on the main table. Pre-aggregate tables can be rebuilt manually 
    +  after delete segment operations are completed
    +  
    +##### Alter Table Operations on pre-aggregate tables
    +This functionality is not supported.
    +
    +  NOTE (<b>RESTRICTION</b>):
    +  Adding new column in new table does not have any affect on pre-aggregate tables. However if 
    +  dropping or renaming a column has impact in pre-aggregate table, such operations will be 
    +  rejected and error will be thrown. All the pre-aggregate tables <b>will have to be dropped</b> 
    +  before Alter Operations can be performed on the main table. Pre-aggregate tables can be rebuilt 
    +  manually after Alter Table operations are completed
    +  
    +### Supporting timeseries data (Alpha feature in 1.3.0)
    --- End diff --
    
    I think it is better we create a datamap folder under doc folder and put preaggregate guide and timeseries guide doc separately in datamap folder. 


---

[GitHub] carbondata issue #2022: [CARBONDATA-2098] Optimize pre-aggregate documentati...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2022
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2804/



---

[GitHub] carbondata issue #2022: [CARBONDATA-2098] Optimize pre-aggregate documentati...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2022
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/4050/



---

[GitHub] carbondata issue #2022: [CARBONDATA-2098] Optimize pre-aggregate documentati...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on the issue:

    https://github.com/apache/carbondata/pull/2022
  
    LGTM


---

[GitHub] carbondata issue #2022: [CARBONDATA-2098] Optimize pre-aggregate documentati...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2022
  
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2795/



---

[GitHub] carbondata issue #2022: [WIP][CARBONDATA-2098] Optimize pre-aggregate docume...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2022
  
    Build Failed with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/2800/



---

[GitHub] carbondata issue #2022: [CARBONDATA-2098] Optimize pre-aggregate documentati...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2022
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/4041/



---

[GitHub] carbondata pull request #2022: [CARBONDATA-2098] Optimize pre-aggregate docu...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/carbondata/pull/2022


---

[GitHub] carbondata issue #2022: [WIP][CARBONDATA-2098] Optimize pre-aggregate docume...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2022
  
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/4046/



---