You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@carbondata.apache.org by jackylk <gi...@git.apache.org> on 2017/03/05 16:05:46 UTC

[GitHub] incubator-carbondata pull request #624: [WIP] Add performance test

GitHub user jackylk opened a pull request:

    https://github.com/apache/incubator-carbondata/pull/624

    [WIP] Add performance test

    1. add a CompareTest for spark-2.1 integration
    2. add single pass support for dataframe.write to carbon

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jackylk/incubator-carbondata comparetest

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-carbondata/pull/624.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #624
    
----
commit 394112f72c87375965a279762d743dc279695f0d
Author: jackylk <ja...@huawei.com>
Date:   2017-03-04T02:57:31Z

    add compareTest

commit 07aa5f98e4cfb6021913ab671d66a48a5420ee19
Author: jackylk <ja...@huawei.com>
Date:   2017-03-04T15:17:40Z

    change compare test

commit ea6e88567ee139954cb7a4303e5c2cf13316f83b
Author: jackylk <ja...@huawei.com>
Date:   2017-03-05T16:02:27Z

    add singlepass

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #624: [CARBONDATA-747][WIP] Add simple performanc...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/incubator-carbondata/pull/624
  
    Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1027/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747] Add simple performa...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-carbondata/pull/624


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747][WIP] Add simple per...

Posted by jarray888 <gi...@git.apache.org>.
Github user jarray888 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/624#discussion_r104369679
  
    --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala ---
    @@ -0,0 +1,122 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.examples
    +
    +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
    +
    +import org.apache.carbondata.core.util.CarbonProperties
    +
    +// scalastyle:off println
    +object CompareTest {
    +
    +  val parquetTableName = "comparetest_parquet"
    +  val carbonTableName = "comparetest_carbon"
    +
    +  private def generateDataFrame(spark: SparkSession): DataFrame = {
    +    import spark.implicits._
    +    spark.sparkContext.parallelize(1 to 10 * 1000 * 1000, 4)
    +        .map(x => ("i" + x, "p" + x % 10, "j" + x % 100, x, x + 1, (x + 7) % 21, (x + 5) / 43, x
    +            * 5))
    +        .toDF("id", "country", "city", "c4", "c5", "c6", "c7", "c8")
    +  }
    +
    +  private def loadParquetTable(spark: SparkSession, input: DataFrame): Long = timeit {
    +    input.write.mode(SaveMode.Overwrite).parquet(parquetTableName)
    +  }
    +
    +  private def loadCarbonTable(spark: SparkSession, input: DataFrame): Long = {
    +    spark.sql(s"drop table if exists $carbonTableName")
    +    timeit {
    +      input.write
    +          .format("carbondata")
    +          .option("tableName", carbonTableName)
    +          .option("tempCSV", "false")
    +          .option("single_pass", "true")
    +          .option("dictionary_exclude", "id") // id is high cardinality column
    +          .mode(SaveMode.Overwrite)
    +          .save()
    +    }
    +  }
    +
    +  private def prepareTable(spark: SparkSession): Unit = {
    +    val df = generateDataFrame(spark).cache()
    +    println(s"loading dataframe into table, schema: ${df.schema}")
    +    val loadParquetTime = loadParquetTable(spark, df)
    +    val loadCarbonTime = loadCarbonTable(spark, df)
    +    println(s"load completed, time: $loadParquetTime, $loadCarbonTime")
    +    spark.read.parquet(parquetTableName).registerTempTable(parquetTableName)
    +  }
    +
    +  private def runQuery(spark: SparkSession): Unit = {
    +    val test = Array(
    +      "select count(*) from $table",
    +      "select sum(c4) from $table",
    +      "select sum(c4), sum(c5) from $table",
    +      "select sum(c4), sum(c5), sum(c6) from $table",
    +      "select sum(c4), sum(c5), sum(c6), sum(c7) from $table",
    +      "select sum(c4), sum(c5), sum(c6), sum(c7), avg(c8) from $table",
    +      "select * from $table where id = 'i9999999' ",
    +      "select * from $table where country = 'p9' ",
    +      "select * from $table where city = 'j99' ",
    +      "select * from $table where c4 < 1000 "
    --- End diff --
    
    please add more testcase, for example:
     "select sum(c4) from $table where id like 'i1%' "
     "select sum(c4) from $table where id like '%10' "
     "select sum(c4) from $table where id like '%xyz%' "



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #624: [CARBONDATA-747][WIP] Add simple performanc...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/incubator-carbondata/pull/624
  
    Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1017/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #624: [CARBONDATA-747] Add simple performance tes...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/incubator-carbondata/pull/624
  
    Can you run all the queries one after another. Like run all parquet queries first then do System.gc and wait for few seconds to clear all gc. Then start running carbon queries.
    In this we can make sure that one engine memory consumption would not effect other engine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747] Add simple performa...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/624#discussion_r106448892
  
    --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala ---
    @@ -0,0 +1,372 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.examples
    +
    +import java.io.File
    +import java.text.SimpleDateFormat
    +import java.util.Date
    +
    +import scala.util.Random
    +
    +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
    +import org.apache.spark.sql.types._
    +
    +import org.apache.carbondata.core.constants.CarbonCommonConstants
    +import org.apache.carbondata.core.util.CarbonProperties
    +
    +/**
    + * A query test case
    + * @param sqlText SQL statement
    + * @param queryType type of query: scan, filter, aggregate, topN
    + * @param desc description of the goal of this test case
    + */
    +case class Query(sqlText: String, queryType: String, desc: String)
    +
    +// scalastyle:off println
    +object CompareTest {
    +
    +  def parquetTableName: String = "comparetest_parquet"
    +  def carbonTableName(version: String): String = s"comparetest_carbonV$version"
    +
    +  // Table schema:
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | Column name | Data type | Cardinality | Column type | Dictionary |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | city        | string    | 8           | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | country     | string    | 1103        | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | planet      | string    | 100,007     | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | id          | string    | 10,000,000  | dimension   | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m1          | short     | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m2          | int       | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m3          | big int   | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m4          | double    | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m5          | double    | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  private def generateDataFrame(spark: SparkSession): DataFrame = {
    +    val r = new Random()
    +    val rdd = spark.sparkContext
    +        .parallelize(1 to 1 * 1000 * 1000, 4)
    +        .map { x =>
    +          ("city" + x % 8, "country" + x % 1103, "planet" + x % 10007, x.toString,
    +          (x % 16).toShort, x / 2, (x << 1).toLong, x.toDouble / 13, x.toDouble / 11)
    +        }.map { x =>
    +          Row(x._1, x._2, x._3, x._4, x._5, x._6, x._7, x._8, x._9)
    +        }
    +
    +    val schema = StructType(
    +      Seq(
    +        StructField("city", StringType, nullable = false),
    +        StructField("country", StringType, nullable = false),
    +        StructField("planet", StringType, nullable = false),
    +        StructField("id", StringType, nullable = false),
    +        StructField("m1", ShortType, nullable = false),
    +        StructField("m2", IntegerType, nullable = false),
    +        StructField("m3", LongType, nullable = false),
    +        StructField("m4", DoubleType, nullable = false),
    +        StructField("m5", DoubleType, nullable = false)
    +      )
    +    )
    +
    +    spark.createDataFrame(rdd, schema)
    +  }
    +
    +  // performance test queries, they are designed to test various data access type
    +  val queries: Array[Query] = Array(
    +    // ===========================================================================
    +    // ==                     FULL SCAN AGGREGATION                             ==
    +    // ===========================================================================
    +    Query(
    +      "select sum(m1) from $table",
    +      "full scan",
    +      "full scan query, 1 aggregate"
    +    ),
    +    Query(
    +      "select sum(m1), sum(m2) from $table",
    +      "full scan",
    +      "full scan query, 2 aggregate"
    +    ),
    +    Query(
    +      "select sum(m1), sum(m2), sum(m3) from $table",
    +      "full scan",
    +      "full scan query, 3 aggregate"
    +    ),
    +    Query(
    +      "select sum(m1), sum(m2), sum(m3), sum(m4) from $table",
    +      "full scan",
    +      "full scan query, 4 aggregate"
    +    ),
    +    Query(
    +      "select sum(m1), sum(m2), sum(m3), sum(m4), avg(m5) from $table",
    +      "full scan",
    +      "full scan query, 5 aggregate"
    +    ),
    +    Query(
    +      "select * from $table",
    +      "full scan",
    +      "full scan query, big result set"
    +    ),
    +    Query(
    +      "select count(distinct id) from $table",
    +      "full scan",
    +      "full scan and count distinct of high card column"
    +    ),
    +    Query(
    +      "select count(distinct country) from $table",
    +      "full scan",
    +      "full scan and count distinct of medium card column"
    +    ),
    +    Query(
    +      "select count(distinct city) from $table",
    +      "full scan",
    +      "full scan and count distinct of low card column"
    +    ),
    +    // ===========================================================================
    +    // ==                      FULL SCAN GROUP BY AGGREGATE                     ==
    +    // ===========================================================================
    +    Query(
    +      "select id, sum(m1) from $table group by id",
    +      "aggregate",
    +      "group by on big data, on high card column, big result set"
    +    ),
    +    Query(
    +      "select country, sum(m1) from $table group by country",
    +      "aggregate",
    +      "group by on big data, on medium card column, medium result set,"
    +    ),
    +    Query(
    +      "select city, sum(m1) from $table group by city",
    +      "aggregate",
    +      "group by on big data, on low card column, small result set,"
    +    ),
    +    Query(
    +      "select id, sum(m1) as metric from $table group by id order by metric desc limit 100",
    +      "topN",
    +      "top N on high card column"
    +    ),
    +    Query(
    +      "select country,sum(m1) as metric from $table group by country order by metric desc limit 10",
    +      "topN",
    +      "top N on medium card column"
    +    ),
    +    Query(
    +      "select city,sum(m1) as metric from $table group by city order by metric desc limit 10",
    +      "topN",
    +      "top N on low card column"
    +    ),
    +    // ===========================================================================
    +    // ==                  FILTER SCAN GROUP BY AGGREGATION                     ==
    +    // ===========================================================================
    +    Query(
    +      "select country, sum(m1) from $table where city='city8' group by country ",
    +      "filter scan and aggregate",
    +      "group by on large data, small result set"
    +    ),
    +    Query(
    +      "select id, sum(m1) from $table where planet='planet10' group by id",
    +      "filter scan and aggregate",
    +      "group by on medium data, large result set"
    +    ),
    +    Query(
    +      "select city, sum(m1) from $table where country='country12' group by city ",
    +      "filter scan and aggregate",
    +      "group by on medium data, small result set"
    +    ),
    +    // ===========================================================================
    +    // ==                             FILTER SCAN                               ==
    +    // ===========================================================================
    +    Query(
    +      "select * from $table where city = 'city3' limit 10000",
    +      "filter scan",
    +      "filter on low card dimension, limit, medium result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where country = 'country9' ",
    +      "filter scan",
    +      "filter on low card dimension, medium result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where planet = 'planet101' ",
    +      "filter scan",
    +      "filter on medium card dimension, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where id = '408938' ",
    +      "filter scan",
    +      "filter on high card dimension"
    +    ),
    +    Query(
    +      "select * from $table where country='country10000'  ",
    +      "filter scan",
    +      "filter on low card dimension, not exist"
    +    ),
    +    Query(
    +      "select * from $table where country='country2' and city ='city8' ",
    +      "filter scan",
    +      "filter on 2 dimensions, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where city='city1' and country='country2' and planet ='planet3' ",
    +      "filter scan",
    +      "filter on 3 dimensions, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where m1 < 3",
    +      "filter scan",
    +      "filter on measure, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where id like '1%' ",
    +      "fuzzy filter scan",
    +      "like filter, big result set"
    +    ),
    +    Query(
    +      "select * from $table where id like '%111'",
    +      "fuzzy filter scan",
    +      "like filter, medium result set"
    +    ),
    +    Query(
    +      "select * from $table where id like 'xyz%' ",
    +      "fuzzy filter scan",
    +      "like filter, full scan but not exist"
    +    )
    +  )
    +
    +  private def loadParquetTable(spark: SparkSession, input: DataFrame): Double = time {
    +    // partitioned by last 1 digit of id column
    +    val dfWithPartition = input.withColumn("partitionCol", input.col("id").%(10))
    +    dfWithPartition.write
    +        .partitionBy("partitionCol")
    +        .mode(SaveMode.Overwrite)
    +        .parquet(parquetTableName)
    +  }
    +
    +  private def loadCarbonTable(spark: SparkSession, input: DataFrame, version: String): Double = {
    +    CarbonProperties.getInstance().addProperty(
    +      CarbonCommonConstants.CARBON_DATA_FILE_VERSION,
    +      version
    +    )
    +    spark.sql(s"drop table if exists ${carbonTableName(version)}")
    +    time {
    +      input.write
    +          .format("carbondata")
    +          .option("tableName", carbonTableName(version))
    +          .option("tempCSV", "false")
    +          .option("single_pass", "true")
    +          .option("dictionary_exclude", "id") // id is high cardinality column
    +          .mode(SaveMode.Overwrite)
    +          .save()
    +    }
    +  }
    +
    +  // load data into parquet, carbonV2, carbonV3
    +  private def prepareTable(spark: SparkSession): Unit = {
    +    val df = generateDataFrame(spark)
    +    println(s"loading ${df.count} records, schema: ${df.schema}")
    +    val loadParquetTime = loadParquetTable(spark, df)
    +    val loadCarbonV3Time = loadCarbonTable(spark, df, version = "3")
    +    println(s"load completed, time: $loadParquetTime, $loadCarbonV3Time")
    +    spark.read.parquet(parquetTableName).registerTempTable(parquetTableName)
    +  }
    +
    +  // Run all queries for the specified table
    +  private def runQueries(spark: SparkSession, tableName: String): Array[(Double, Array[Row])] = {
    +    println(s"start running queries for $tableName...")
    +    var result: Array[Row] = null
    +    queries.zipWithIndex.map { case (query, index) =>
    +      val sqlText = query.sqlText.replace("$table", tableName)
    +      print(s"running query ${index + 1}: $sqlText ")
    +      val rt = time {
    +        result = spark.sql(sqlText).collect()
    +      }
    +      println(s"=> $rt sec")
    +      (rt, result)
    --- End diff --
    
    ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747] Add simple performa...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/624#discussion_r106432887
  
    --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala ---
    @@ -0,0 +1,372 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.examples
    +
    +import java.io.File
    +import java.text.SimpleDateFormat
    +import java.util.Date
    +
    +import scala.util.Random
    +
    +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
    +import org.apache.spark.sql.types._
    +
    +import org.apache.carbondata.core.constants.CarbonCommonConstants
    +import org.apache.carbondata.core.util.CarbonProperties
    +
    +/**
    + * A query test case
    + * @param sqlText SQL statement
    + * @param queryType type of query: scan, filter, aggregate, topN
    + * @param desc description of the goal of this test case
    + */
    +case class Query(sqlText: String, queryType: String, desc: String)
    +
    +// scalastyle:off println
    +object CompareTest {
    +
    +  def parquetTableName: String = "comparetest_parquet"
    +  def carbonTableName(version: String): String = s"comparetest_carbonV$version"
    +
    +  // Table schema:
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | Column name | Data type | Cardinality | Column type | Dictionary |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | city        | string    | 8           | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | country     | string    | 1103        | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | planet      | string    | 100,007     | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | id          | string    | 10,000,000  | dimension   | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m1          | short     | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m2          | int       | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m3          | big int   | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m4          | double    | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m5          | double    | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  private def generateDataFrame(spark: SparkSession): DataFrame = {
    +    val r = new Random()
    +    val rdd = spark.sparkContext
    +        .parallelize(1 to 1 * 1000 * 1000, 4)
    +        .map { x =>
    +          ("city" + x % 8, "country" + x % 1103, "planet" + x % 10007, x.toString,
    +          (x % 16).toShort, x / 2, (x << 1).toLong, x.toDouble / 13, x.toDouble / 11)
    +        }.map { x =>
    +          Row(x._1, x._2, x._3, x._4, x._5, x._6, x._7, x._8, x._9)
    +        }
    +
    +    val schema = StructType(
    +      Seq(
    +        StructField("city", StringType, nullable = false),
    +        StructField("country", StringType, nullable = false),
    +        StructField("planet", StringType, nullable = false),
    +        StructField("id", StringType, nullable = false),
    +        StructField("m1", ShortType, nullable = false),
    +        StructField("m2", IntegerType, nullable = false),
    +        StructField("m3", LongType, nullable = false),
    +        StructField("m4", DoubleType, nullable = false),
    +        StructField("m5", DoubleType, nullable = false)
    +      )
    +    )
    +
    +    spark.createDataFrame(rdd, schema)
    +  }
    +
    +  // performance test queries, they are designed to test various data access type
    +  val queries: Array[Query] = Array(
    +    // ===========================================================================
    +    // ==                     FULL SCAN AGGREGATION                             ==
    +    // ===========================================================================
    +    Query(
    +      "select sum(m1) from $table",
    +      "full scan",
    +      "full scan query, 1 aggregate"
    +    ),
    +    Query(
    +      "select sum(m1), sum(m2) from $table",
    +      "full scan",
    +      "full scan query, 2 aggregate"
    +    ),
    +    Query(
    +      "select sum(m1), sum(m2), sum(m3) from $table",
    +      "full scan",
    +      "full scan query, 3 aggregate"
    +    ),
    +    Query(
    +      "select sum(m1), sum(m2), sum(m3), sum(m4) from $table",
    +      "full scan",
    +      "full scan query, 4 aggregate"
    +    ),
    +    Query(
    +      "select sum(m1), sum(m2), sum(m3), sum(m4), avg(m5) from $table",
    +      "full scan",
    +      "full scan query, 5 aggregate"
    +    ),
    +    Query(
    +      "select * from $table",
    +      "full scan",
    +      "full scan query, big result set"
    +    ),
    +    Query(
    +      "select count(distinct id) from $table",
    +      "full scan",
    +      "full scan and count distinct of high card column"
    +    ),
    +    Query(
    +      "select count(distinct country) from $table",
    +      "full scan",
    +      "full scan and count distinct of medium card column"
    +    ),
    +    Query(
    +      "select count(distinct city) from $table",
    +      "full scan",
    +      "full scan and count distinct of low card column"
    +    ),
    +    // ===========================================================================
    +    // ==                      FULL SCAN GROUP BY AGGREGATE                     ==
    +    // ===========================================================================
    +    Query(
    +      "select id, sum(m1) from $table group by id",
    +      "aggregate",
    +      "group by on big data, on high card column, big result set"
    +    ),
    +    Query(
    +      "select country, sum(m1) from $table group by country",
    +      "aggregate",
    +      "group by on big data, on medium card column, medium result set,"
    +    ),
    +    Query(
    +      "select city, sum(m1) from $table group by city",
    +      "aggregate",
    +      "group by on big data, on low card column, small result set,"
    +    ),
    +    Query(
    +      "select id, sum(m1) as metric from $table group by id order by metric desc limit 100",
    +      "topN",
    +      "top N on high card column"
    +    ),
    +    Query(
    +      "select country,sum(m1) as metric from $table group by country order by metric desc limit 10",
    +      "topN",
    +      "top N on medium card column"
    +    ),
    +    Query(
    +      "select city,sum(m1) as metric from $table group by city order by metric desc limit 10",
    +      "topN",
    +      "top N on low card column"
    +    ),
    +    // ===========================================================================
    +    // ==                  FILTER SCAN GROUP BY AGGREGATION                     ==
    +    // ===========================================================================
    +    Query(
    +      "select country, sum(m1) from $table where city='city8' group by country ",
    +      "filter scan and aggregate",
    +      "group by on large data, small result set"
    +    ),
    +    Query(
    +      "select id, sum(m1) from $table where planet='planet10' group by id",
    +      "filter scan and aggregate",
    +      "group by on medium data, large result set"
    +    ),
    +    Query(
    +      "select city, sum(m1) from $table where country='country12' group by city ",
    +      "filter scan and aggregate",
    +      "group by on medium data, small result set"
    +    ),
    +    // ===========================================================================
    +    // ==                             FILTER SCAN                               ==
    +    // ===========================================================================
    +    Query(
    +      "select * from $table where city = 'city3' limit 10000",
    +      "filter scan",
    +      "filter on low card dimension, limit, medium result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where country = 'country9' ",
    +      "filter scan",
    +      "filter on low card dimension, medium result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where planet = 'planet101' ",
    +      "filter scan",
    +      "filter on medium card dimension, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where id = '408938' ",
    +      "filter scan",
    +      "filter on high card dimension"
    +    ),
    +    Query(
    +      "select * from $table where country='country10000'  ",
    +      "filter scan",
    +      "filter on low card dimension, not exist"
    +    ),
    +    Query(
    +      "select * from $table where country='country2' and city ='city8' ",
    +      "filter scan",
    +      "filter on 2 dimensions, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where city='city1' and country='country2' and planet ='planet3' ",
    +      "filter scan",
    +      "filter on 3 dimensions, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where m1 < 3",
    +      "filter scan",
    +      "filter on measure, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where id like '1%' ",
    +      "fuzzy filter scan",
    +      "like filter, big result set"
    +    ),
    +    Query(
    +      "select * from $table where id like '%111'",
    +      "fuzzy filter scan",
    +      "like filter, medium result set"
    +    ),
    +    Query(
    +      "select * from $table where id like 'xyz%' ",
    +      "fuzzy filter scan",
    +      "like filter, full scan but not exist"
    +    )
    +  )
    +
    +  private def loadParquetTable(spark: SparkSession, input: DataFrame): Double = time {
    +    // partitioned by last 1 digit of id column
    +    val dfWithPartition = input.withColumn("partitionCol", input.col("id").%(10))
    +    dfWithPartition.write
    +        .partitionBy("partitionCol")
    +        .mode(SaveMode.Overwrite)
    +        .parquet(parquetTableName)
    +  }
    +
    +  private def loadCarbonTable(spark: SparkSession, input: DataFrame, version: String): Double = {
    +    CarbonProperties.getInstance().addProperty(
    +      CarbonCommonConstants.CARBON_DATA_FILE_VERSION,
    +      version
    +    )
    +    spark.sql(s"drop table if exists ${carbonTableName(version)}")
    +    time {
    +      input.write
    +          .format("carbondata")
    +          .option("tableName", carbonTableName(version))
    +          .option("tempCSV", "false")
    +          .option("single_pass", "true")
    +          .option("dictionary_exclude", "id") // id is high cardinality column
    +          .mode(SaveMode.Overwrite)
    +          .save()
    +    }
    +  }
    +
    +  // load data into parquet, carbonV2, carbonV3
    +  private def prepareTable(spark: SparkSession): Unit = {
    +    val df = generateDataFrame(spark)
    +    println(s"loading ${df.count} records, schema: ${df.schema}")
    +    val loadParquetTime = loadParquetTable(spark, df)
    +    val loadCarbonV3Time = loadCarbonTable(spark, df, version = "3")
    +    println(s"load completed, time: $loadParquetTime, $loadCarbonV3Time")
    +    spark.read.parquet(parquetTableName).registerTempTable(parquetTableName)
    +  }
    +
    +  // Run all queries for the specified table
    +  private def runQueries(spark: SparkSession, tableName: String): Array[(Double, Array[Row])] = {
    +    println(s"start running queries for $tableName...")
    +    var result: Array[Row] = null
    +    queries.zipWithIndex.map { case (query, index) =>
    +      val sqlText = query.sqlText.replace("$table", tableName)
    +      print(s"running query ${index + 1}: $sqlText ")
    +      val rt = time {
    +        result = spark.sql(sqlText).collect()
    +      }
    +      println(s"=> $rt sec")
    +      (rt, result)
    +    }
    +  }
    +
    +  // run testcases and print comparison result
    +  private def runTest(spark: SparkSession): Unit = {
    +    val formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
    +    val date = new Date
    +    val timestamp = date.getTime
    +    // run queries on parquet and carbon
    +    val parquetResult: Array[(Double, Array[Row])] = runQueries(spark, parquetTableName)
    +    // do GC and sleep for some time before running next table
    +    System.gc()
    +    Thread.sleep(1000)
    +    val carbonResult: Array[(Double, Array[Row])] = runQueries(spark, carbonTableName("3"))
    +    // check result by comparing output from parquet and carbon
    +    parquetResult.zipWithIndex.foreach { case (result, index) =>
    +      if (result._2.length != carbonResult(index)._2.length) {
    +        sys.error(s"result not matching for query ${index + 1}: " +
    +            s"${result._2.length} and ${carbonResult(index)._2.length}")
    +      }
    +    }
    +    // print all response time in JSON format, so that it can be analyzed later
    +    queries.zipWithIndex.foreach { case (query, index) =>
    +      println("{" +
    +          s""""query":"${index + 1}", """ +
    +          s""""parquetTime":${parquetResult(index)._1}, """ +
    +          s""""carbonTime":${carbonResult(index)._1}, """ +
    +          s""""fetched":${parquetResult(index)._2.length}, """ +
    +          s""""type":"${query.queryType}", """ +
    +          s""""desc":"${query.desc}",  """ +
    +          s""""timestamp": "$timestamp" """ +
    +          s""""date": "${formatter.format(date)}" """ +
    +          "}"
    +      )
    +    }
    +  }
    +
    +  def main(args: Array[String]): Unit = {
    +    CarbonProperties.getInstance()
    +        .addProperty("carbon.number.of.page.in.blocklet.column", "10")
    --- End diff --
    
    This property is remove now. so it is not required.
    Now the size calculated in mb with property `carbon.blockletgroup.size.in.mb` and default size is 64. so no need to configure here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747][WIP] Add simple per...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/624#discussion_r104642435
  
    --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala ---
    @@ -0,0 +1,122 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.examples
    +
    +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
    +
    +import org.apache.carbondata.core.util.CarbonProperties
    +
    +// scalastyle:off println
    +object CompareTest {
    +
    +  val parquetTableName = "comparetest_parquet"
    +  val carbonTableName = "comparetest_carbon"
    +
    +  private def generateDataFrame(spark: SparkSession): DataFrame = {
    +    import spark.implicits._
    +    spark.sparkContext.parallelize(1 to 10 * 1000 * 1000, 4)
    +        .map(x => ("i" + x, "p" + x % 10, "j" + x % 100, x, x + 1, (x + 7) % 21, (x + 5) / 43, x
    +            * 5))
    +        .toDF("id", "country", "city", "c4", "c5", "c6", "c7", "c8")
    +  }
    +
    +  private def loadParquetTable(spark: SparkSession, input: DataFrame): Long = timeit {
    +    input.write.mode(SaveMode.Overwrite).parquet(parquetTableName)
    --- End diff --
    
    ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #624: [CARBONDATA-747] Add simple performance tes...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/incubator-carbondata/pull/624
  
    Yes, it fits in OS cache, that is the same case with parquet as well so it would be fare comparison.  What I am suggesting is now we run the queries of parquet and carbon together. So there is a possibility of one engine jvm memory consumption can effect the other engine. It would be better to run all the queries of one engine  and flush the GC and then start running other engine queries.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747][WIP] Add simple per...

Posted by jarray888 <gi...@git.apache.org>.
Github user jarray888 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/624#discussion_r104369883
  
    --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala ---
    @@ -0,0 +1,122 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.examples
    +
    +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
    +
    +import org.apache.carbondata.core.util.CarbonProperties
    +
    +// scalastyle:off println
    +object CompareTest {
    +
    +  val parquetTableName = "comparetest_parquet"
    +  val carbonTableName = "comparetest_carbon"
    +
    +  private def generateDataFrame(spark: SparkSession): DataFrame = {
    +    import spark.implicits._
    +    spark.sparkContext.parallelize(1 to 10 * 1000 * 1000, 4)
    +        .map(x => ("i" + x, "p" + x % 10, "j" + x % 100, x, x + 1, (x + 7) % 21, (x + 5) / 43, x
    +            * 5))
    +        .toDF("id", "country", "city", "c4", "c5", "c6", "c7", "c8")
    +  }
    +
    +  private def loadParquetTable(spark: SparkSession, input: DataFrame): Long = timeit {
    +    input.write.mode(SaveMode.Overwrite).parquet(parquetTableName)
    --- End diff --
    
    suggest to use last char of id column to do partition on parquet, so the comparison is fare.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747][WIP] Add simple per...

Posted by jarray888 <gi...@git.apache.org>.
Github user jarray888 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/624#discussion_r104370904
  
    --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala ---
    @@ -0,0 +1,122 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.examples
    +
    +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
    +
    +import org.apache.carbondata.core.util.CarbonProperties
    +
    +// scalastyle:off println
    +object CompareTest {
    +
    +  val parquetTableName = "comparetest_parquet"
    +  val carbonTableName = "comparetest_carbon"
    +
    +  private def generateDataFrame(spark: SparkSession): DataFrame = {
    +    import spark.implicits._
    +    spark.sparkContext.parallelize(1 to 10 * 1000 * 1000, 4)
    +        .map(x => ("i" + x, "p" + x % 10, "j" + x % 100, x, x + 1, (x + 7) % 21, (x + 5) / 43, x
    --- End diff --
    
    To simulate a real-life data, please make the data unsorted, like
    `map(x => ("i" + randon number, "p" + x % 13, "j" + x % 97, ...)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747] Add simple performa...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/624#discussion_r106432353
  
    --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala ---
    @@ -0,0 +1,372 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.examples
    +
    +import java.io.File
    +import java.text.SimpleDateFormat
    +import java.util.Date
    +
    +import scala.util.Random
    +
    +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
    +import org.apache.spark.sql.types._
    +
    +import org.apache.carbondata.core.constants.CarbonCommonConstants
    +import org.apache.carbondata.core.util.CarbonProperties
    +
    +/**
    + * A query test case
    + * @param sqlText SQL statement
    + * @param queryType type of query: scan, filter, aggregate, topN
    + * @param desc description of the goal of this test case
    + */
    +case class Query(sqlText: String, queryType: String, desc: String)
    +
    +// scalastyle:off println
    +object CompareTest {
    +
    +  def parquetTableName: String = "comparetest_parquet"
    +  def carbonTableName(version: String): String = s"comparetest_carbonV$version"
    +
    +  // Table schema:
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | Column name | Data type | Cardinality | Column type | Dictionary |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | city        | string    | 8           | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | country     | string    | 1103        | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | planet      | string    | 100,007     | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | id          | string    | 10,000,000  | dimension   | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m1          | short     | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m2          | int       | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m3          | big int   | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m4          | double    | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m5          | double    | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  private def generateDataFrame(spark: SparkSession): DataFrame = {
    +    val r = new Random()
    +    val rdd = spark.sparkContext
    +        .parallelize(1 to 1 * 1000 * 1000, 4)
    +        .map { x =>
    +          ("city" + x % 8, "country" + x % 1103, "planet" + x % 10007, x.toString,
    +          (x % 16).toShort, x / 2, (x << 1).toLong, x.toDouble / 13, x.toDouble / 11)
    +        }.map { x =>
    +          Row(x._1, x._2, x._3, x._4, x._5, x._6, x._7, x._8, x._9)
    +        }
    +
    +    val schema = StructType(
    +      Seq(
    +        StructField("city", StringType, nullable = false),
    +        StructField("country", StringType, nullable = false),
    +        StructField("planet", StringType, nullable = false),
    +        StructField("id", StringType, nullable = false),
    +        StructField("m1", ShortType, nullable = false),
    +        StructField("m2", IntegerType, nullable = false),
    +        StructField("m3", LongType, nullable = false),
    +        StructField("m4", DoubleType, nullable = false),
    +        StructField("m5", DoubleType, nullable = false)
    +      )
    +    )
    +
    +    spark.createDataFrame(rdd, schema)
    +  }
    +
    +  // performance test queries, they are designed to test various data access type
    +  val queries: Array[Query] = Array(
    +    // ===========================================================================
    +    // ==                     FULL SCAN AGGREGATION                             ==
    +    // ===========================================================================
    +    Query(
    +      "select sum(m1) from $table",
    +      "full scan",
    +      "full scan query, 1 aggregate"
    +    ),
    +    Query(
    +      "select sum(m1), sum(m2) from $table",
    +      "full scan",
    +      "full scan query, 2 aggregate"
    +    ),
    +    Query(
    +      "select sum(m1), sum(m2), sum(m3) from $table",
    +      "full scan",
    +      "full scan query, 3 aggregate"
    +    ),
    +    Query(
    +      "select sum(m1), sum(m2), sum(m3), sum(m4) from $table",
    +      "full scan",
    +      "full scan query, 4 aggregate"
    +    ),
    +    Query(
    +      "select sum(m1), sum(m2), sum(m3), sum(m4), avg(m5) from $table",
    +      "full scan",
    +      "full scan query, 5 aggregate"
    +    ),
    +    Query(
    +      "select * from $table",
    +      "full scan",
    +      "full scan query, big result set"
    +    ),
    +    Query(
    +      "select count(distinct id) from $table",
    +      "full scan",
    +      "full scan and count distinct of high card column"
    +    ),
    +    Query(
    +      "select count(distinct country) from $table",
    +      "full scan",
    +      "full scan and count distinct of medium card column"
    +    ),
    +    Query(
    +      "select count(distinct city) from $table",
    +      "full scan",
    +      "full scan and count distinct of low card column"
    +    ),
    +    // ===========================================================================
    +    // ==                      FULL SCAN GROUP BY AGGREGATE                     ==
    +    // ===========================================================================
    +    Query(
    +      "select id, sum(m1) from $table group by id",
    +      "aggregate",
    +      "group by on big data, on high card column, big result set"
    +    ),
    +    Query(
    +      "select country, sum(m1) from $table group by country",
    +      "aggregate",
    +      "group by on big data, on medium card column, medium result set,"
    +    ),
    +    Query(
    +      "select city, sum(m1) from $table group by city",
    +      "aggregate",
    +      "group by on big data, on low card column, small result set,"
    +    ),
    +    Query(
    +      "select id, sum(m1) as metric from $table group by id order by metric desc limit 100",
    +      "topN",
    +      "top N on high card column"
    +    ),
    +    Query(
    +      "select country,sum(m1) as metric from $table group by country order by metric desc limit 10",
    +      "topN",
    +      "top N on medium card column"
    +    ),
    +    Query(
    +      "select city,sum(m1) as metric from $table group by city order by metric desc limit 10",
    +      "topN",
    +      "top N on low card column"
    +    ),
    +    // ===========================================================================
    +    // ==                  FILTER SCAN GROUP BY AGGREGATION                     ==
    +    // ===========================================================================
    +    Query(
    +      "select country, sum(m1) from $table where city='city8' group by country ",
    +      "filter scan and aggregate",
    +      "group by on large data, small result set"
    +    ),
    +    Query(
    +      "select id, sum(m1) from $table where planet='planet10' group by id",
    +      "filter scan and aggregate",
    +      "group by on medium data, large result set"
    +    ),
    +    Query(
    +      "select city, sum(m1) from $table where country='country12' group by city ",
    +      "filter scan and aggregate",
    +      "group by on medium data, small result set"
    +    ),
    +    // ===========================================================================
    +    // ==                             FILTER SCAN                               ==
    +    // ===========================================================================
    +    Query(
    +      "select * from $table where city = 'city3' limit 10000",
    +      "filter scan",
    +      "filter on low card dimension, limit, medium result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where country = 'country9' ",
    +      "filter scan",
    +      "filter on low card dimension, medium result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where planet = 'planet101' ",
    +      "filter scan",
    +      "filter on medium card dimension, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where id = '408938' ",
    +      "filter scan",
    +      "filter on high card dimension"
    +    ),
    +    Query(
    +      "select * from $table where country='country10000'  ",
    +      "filter scan",
    +      "filter on low card dimension, not exist"
    +    ),
    +    Query(
    +      "select * from $table where country='country2' and city ='city8' ",
    +      "filter scan",
    +      "filter on 2 dimensions, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where city='city1' and country='country2' and planet ='planet3' ",
    +      "filter scan",
    +      "filter on 3 dimensions, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where m1 < 3",
    +      "filter scan",
    +      "filter on measure, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where id like '1%' ",
    +      "fuzzy filter scan",
    +      "like filter, big result set"
    +    ),
    +    Query(
    +      "select * from $table where id like '%111'",
    +      "fuzzy filter scan",
    +      "like filter, medium result set"
    +    ),
    +    Query(
    +      "select * from $table where id like 'xyz%' ",
    +      "fuzzy filter scan",
    +      "like filter, full scan but not exist"
    +    )
    +  )
    +
    +  private def loadParquetTable(spark: SparkSession, input: DataFrame): Double = time {
    +    // partitioned by last 1 digit of id column
    +    val dfWithPartition = input.withColumn("partitionCol", input.col("id").%(10))
    +    dfWithPartition.write
    +        .partitionBy("partitionCol")
    +        .mode(SaveMode.Overwrite)
    +        .parquet(parquetTableName)
    +  }
    +
    +  private def loadCarbonTable(spark: SparkSession, input: DataFrame, version: String): Double = {
    +    CarbonProperties.getInstance().addProperty(
    +      CarbonCommonConstants.CARBON_DATA_FILE_VERSION,
    +      version
    +    )
    +    spark.sql(s"drop table if exists ${carbonTableName(version)}")
    +    time {
    +      input.write
    +          .format("carbondata")
    +          .option("tableName", carbonTableName(version))
    +          .option("tempCSV", "false")
    +          .option("single_pass", "true")
    +          .option("dictionary_exclude", "id") // id is high cardinality column
    +          .mode(SaveMode.Overwrite)
    +          .save()
    +    }
    +  }
    +
    +  // load data into parquet, carbonV2, carbonV3
    +  private def prepareTable(spark: SparkSession): Unit = {
    +    val df = generateDataFrame(spark)
    +    println(s"loading ${df.count} records, schema: ${df.schema}")
    +    val loadParquetTime = loadParquetTable(spark, df)
    +    val loadCarbonV3Time = loadCarbonTable(spark, df, version = "3")
    +    println(s"load completed, time: $loadParquetTime, $loadCarbonV3Time")
    +    spark.read.parquet(parquetTableName).registerTempTable(parquetTableName)
    +  }
    +
    +  // Run all queries for the specified table
    +  private def runQueries(spark: SparkSession, tableName: String): Array[(Double, Array[Row])] = {
    +    println(s"start running queries for $tableName...")
    +    var result: Array[Row] = null
    +    queries.zipWithIndex.map { case (query, index) =>
    +      val sqlText = query.sqlText.replace("$table", tableName)
    +      print(s"running query ${index + 1}: $sqlText ")
    +      val rt = time {
    +        result = spark.sql(sqlText).collect()
    +      }
    +      println(s"=> $rt sec")
    +      (rt, result)
    --- End diff --
    
    Don't cache the result, it is leading to memory issues if more than 2 million data. Just keep the length of the data is enough


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747] Add simple performa...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/624#discussion_r106090560
  
    --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala ---
    @@ -0,0 +1,343 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.examples
    +
    +import java.io.File
    +import java.util.Date
    +
    +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
    +import org.apache.spark.sql.types._
    +
    +import org.apache.carbondata.core.constants.CarbonCommonConstants
    +import org.apache.carbondata.core.util.CarbonProperties
    +
    +/**
    + * A query test case
    + * @param sqlText SQL statement
    + * @param queryType type of query: scan, filter, aggregate, topN
    + * @param desc description of the goal of this test case
    + */
    +case class Query(sqlText: String, queryType: String, desc: String)
    +
    +// scalastyle:off println
    +object CompareTest {
    +
    +  def parquetTableName: String = "comparetest_parquet"
    +  def carbonTableName(version: String): String = s"comparetest_carbonV$version"
    +
    +  // Table schema:
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | Column name | Data type | Cardinality | Column type | Dictionary |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | id          | string    | 10,000,000  | dimension   | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | country     | string    | 1103        | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | city        | string    | 13          | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c4          | short     | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c5          | int       | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c6          | big int   | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c7          | double    | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c8          | double    | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    --- End diff --
    
    I guess it is better to add some more dimensions with little more cardinality. Now there are only 3 dimensions and one with very high cardinality and others with very less cardinality. So it is better to add some more dimensions with medium cardinality.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #624: [CARBONDATA-747] Add simple performance tes...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/incubator-carbondata/pull/624
  
    Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1087/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #624: [WIP] Add simple performance test for spark...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/incubator-carbondata/pull/624
  
    Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1010/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #624: [CARBONDATA-747][WIP] Add simple performanc...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/incubator-carbondata/pull/624
  
    Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1016/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #624: [WIP] Add simple performance test for spark...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/incubator-carbondata/pull/624
  
    Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1009/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747][WIP] Add simple per...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/624#discussion_r104564727
  
    --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala ---
    @@ -0,0 +1,347 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.examples
    +
    +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
    +import org.apache.spark.sql.types._
    +
    +import org.apache.carbondata.core.constants.CarbonCommonConstants
    +import org.apache.carbondata.core.util.CarbonProperties
    +
    +/**
    + * A query test case
    + * @param sqlText SQL statement
    + * @param queryType type of query: scan, filter, aggregate, topN
    + * @param desc description of the goal of this test case
    + */
    +case class Query(sqlText: String, queryType: String, desc: String)
    +
    +// scalastyle:off println
    +object CompareTest {
    +
    +  def parquetTableName: String = "comparetest_parquet"
    +  def carbonTableName(version: String): String = s"comparetest_carbonV$version"
    +
    +  // Table schema:
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | Column name | Data type | Cardinality | Column type | Dictionary |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | id          | string    | 10,000,000  | dimension   | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | country     | string    | 1103        | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | city        | string    | 13          | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c4          | short     | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c5          | int       | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c6          | big int   | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c7          | double    | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c8          | double    | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  private def generateDataFrame(spark: SparkSession): DataFrame = {
    +    val rdd = spark.sparkContext
    +        .parallelize(1 to 10 * 1000 * 1000, 4)
    +        .map { x =>
    +          (x.toString, "p" + x % 1103, "j" + x % 13, (x % 31).toShort, x, x.toLong * 1000,
    +              x.toDouble / 13, x.toDouble / 71 )
    +        }.map { x =>
    +          Row(x._1, x._2, x._3, x._4, x._5, x._6, x._7, x._8)
    +        }
    +    val schema = StructType(
    +      Seq(
    +        StructField("id", StringType, nullable = false),
    +        StructField("country", StringType, nullable = false),
    +        StructField("city", StringType, nullable = false),
    +        StructField("c4", ShortType, nullable = true),
    +        StructField("c5", IntegerType, nullable = true),
    +        StructField("c6", LongType, nullable = true),
    +        StructField("c7", DoubleType, nullable = true),
    +        StructField("c8", DoubleType, nullable = true)
    +      )
    +    )
    +    spark.createDataFrame(rdd, schema)
    +  }
    +
    +  // performance test queries
    +  val queries: Array[Query] = Array(
    +    Query(
    +      "select count(*) from $table",
    +      "warm up",
    +      "warm up query"
    +    ),
    +    // ===========================================================================
    +    // ==                               FULL SCAN                               ==
    +    // ===========================================================================
    +    Query(
    +      "select sum(c4) from $table",
    +      "full scan",
    +      "full scan query, 1 aggregate"
    +    ),
    +    Query(
    +      "select sum(c4), sum(c5) from $table",
    +      "full scan",
    +      "full scan query, 2 aggregate"
    +    ),
    +    Query(
    +      "select sum(c4), sum(c5), sum(c6) from $table",
    +      "full scan",
    +      "full scan query, 3 aggregate"
    +    ),
    +    Query(
    +      "select sum(c4), sum(c5), sum(c6), sum(c7) from $table",
    +      "full scan",
    +      "full scan query, 4 aggregate"
    +    ),
    +    Query(
    +      "select sum(c4), sum(c5), sum(c6), sum(c7), avg(c8) from $table",
    +      "full scan",
    +      "full scan query, 4 aggregate"
    +    ),
    +//    Query(
    +//      "select * from $table",
    +//      "full scan",
    +//      "full scan query, big result set"
    +//    ),
    +    Query(
    +      "select count(distinct id) from $table",
    +      "full scan",
    +      "full scan and count distinct of high card column"
    +    ),
    +    Query(
    +      "select count(distinct country) from $table",
    +      "full scan",
    +      "full scan and count distinct of medium card column"
    +    ),
    +    Query(
    +      "select count(distinct city) from $table",
    +      "full scan",
    +      "full scan and count distinct of low card column"
    +    ),
    +    // ===========================================================================
    +    // ==                               AGGREGATE                               ==
    +    // ===========================================================================
    +//    Query(
    +//      "select id, sum(c4) from $table group by id",
    +//      "aggregate",
    +//      "group by on big data, on high card column, big result set"
    +//    ),
    +    Query(
    +      "select country, sum(c4) from $table group by country",
    +      "aggregate",
    +      "group by on big data, on medium card column, medium result set,"
    +    ),
    +    Query(
    +      "select city, sum(c4) from $table group by city",
    +      "aggregate",
    +      "group by on big data, on low card column, small result set,"
    +    ),
    +    Query(
    +      "select id, sum(c4) as metric from $table group by id order by metric desc limit 100",
    +      "topN",
    +      "top N on high card column"
    +    ),
    +    Query(
    +      "select country,sum(c4) as metric from $table group by country order by metric desc limit 10",
    +      "topN",
    +      "top N on medium card column"
    +    ),
    +    Query(
    +      "select city,sum(c4) as metric from $table group by city order by metric desc limit 10",
    +      "topN",
    +      "top N on low card column"
    +    ),
    +    // ===========================================================================
    +    // ==                             FILTER SCAN                               ==
    +    // ===========================================================================
    +    Query(
    +      "select id, sum(c4) from $table where city='j9' group by id",
    +      "filter scan and aggregate",
    +      "group by on medium data, large result set"
    +    ),
    +    Query(
    +      "select country, sum(c4) from $table where city='j8' group by country ",
    +      "filter scan and aggregate",
    +      "group by on medium data, medium result set"
    +    ),
    +    Query(
    +      "select city, sum(c4) from $table where city='j7' group by city ",
    +      "filter scan and aggregate",
    +      "group by on medium data, small result set"
    +    ),
    +    Query(
    +      "select * from $table where id = '408938' ",
    +      "primary key filter",
    +      "primary key filtering"
    +    ),
    +    Query(
    +      "select * from $table where country = 'p99' ",
    +      "filter scan",
    +      "filter on c2, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where country='p10000'  ",
    +      "filter scan",
    +      "filter on c2, not exist"
    +    ),
    +    Query(
    +      "select * from $table where city = 'j3' limit 100000",
    +      "filter scan",
    +      "filter on c3, medium result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where country='p88' and city ='j8' ",
    +      "filter scan",
    +      "filter on c2 and c3, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where id='1' and country='p1' and city ='j1' ",
    +      "filter scan",
    +      "filter on c1, c2 and c3, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where c4 < 3",
    +      "filter scan",
    +      "filter on measure, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where id like '1%' ",
    +      "fuzzy filter scan",
    +      "like filter, big result set"
    +    ),
    +    Query(
    +      "select * from $table where id like '%111'",
    +      "fuzzy filter scan",
    +      "like filter, medium result set"
    +    ),
    +    Query(
    +      "select * from $table where id like 'xyz%' ",
    +      "fuzzy filter scan",
    +      "like filter, full scan but not exist"
    +    )
    +  )
    +
    +  private def loadParquetTable(spark: SparkSession, input: DataFrame): Double = timeit {
    +    input.write.mode(SaveMode.Overwrite).parquet(parquetTableName)
    +  }
    +
    +  private def loadCarbonTable(spark: SparkSession, input: DataFrame, version: String): Double = {
    +    CarbonProperties.getInstance().addProperty(
    +      CarbonCommonConstants.CARBON_DATA_FILE_VERSION,
    +      version
    +    )
    +    spark.sql(s"drop table if exists ${carbonTableName(version)}")
    +    timeit {
    +      input.write
    +          .format("carbondata")
    +          .option("tableName", carbonTableName(version))
    +          .option("tempCSV", "false")
    +          .option("single_pass", "true")
    +          .option("dictionary_exclude", "id") // id is high cardinality column
    +          .mode(SaveMode.Overwrite)
    +          .save()
    +    }
    +  }
    +
    +  // load data into parquet, carbonV2, carbonV3
    +  private def prepareTable(spark: SparkSession): Unit = {
    +    val df = generateDataFrame(spark)
    +    println(s"loading ${df.count} records, schema: ${df.schema}")
    +    val loadParquetTime = loadParquetTable(spark, df)
    +    val loadCarbonV3Time = loadCarbonTable(spark, df, version = "3")
    +    val loadCarbonV2Time = loadCarbonTable(spark, df, version = "2")
    +    println(s"load completed, time: $loadParquetTime, $loadCarbonV3Time, $loadCarbonV2Time")
    +    spark.read.parquet(parquetTableName).registerTempTable(parquetTableName)
    +  }
    +
    +  private def runQuery(spark: SparkSession): Unit = {
    +    println("[")
    +    queries.zipWithIndex.foreach {
    +      case (query, index) =>
    +        var result1: Array[Row] = null
    +        var result2: Array[Row] = null
    +
    +        // query parquet, carbonV3, carbonV2, and carbonV2 usinga non-vector reader
    +        CarbonProperties.getInstance().addProperty("carbon.enable.vector.reader", "false")
    +        var sqlText = query.sqlText.replace("$table", carbonTableName("2"))
    +        val carbonV2NonVectorQueryTime = timeit {
    +          spark.sql(sqlText).collect()
    +        }
    +        CarbonProperties.getInstance().addProperty("carbon.enable.vector.reader", "true")
    +        sqlText = query.sqlText.replace("$table", carbonTableName("2"))
    +        val carbonV2QueryTime = timeit {
    +          spark.sql(sqlText).collect()
    +        }
    +        CarbonProperties.getInstance().addProperty("carbon.enable.vector.reader", "true")
    +        sqlText = query.sqlText.replace("$table", carbonTableName("3"))
    +        val carbonV3QueryTime = timeit {
    +          result2 = spark.sql(sqlText).collect()
    +        }
    +        sqlText = query.sqlText.replace("$table", parquetTableName)
    +        val parquetQueryTime = timeit {
    +          result1 = spark.sql(sqlText).collect()
    +        }
    +        checkResult(result1, result2)
    +        println("\t" +
    +            s"""{ "query":"$index", """ +
    +            s""""parquet"=$parquetQueryTime, """ +
    +            s""""carbonV3"=$carbonV3QueryTime, """ +
    +            s""""carbonV2"=$carbonV2QueryTime, """ +
    +            s""""carbonV2nonVector"=$carbonV2NonVectorQueryTime, """ +
    +            s""""fetched"=${result1.length}", """ +
    +            s""""desc"="${query.desc}" }, """ )
    +    }
    +    println("]")
    +  }
    +
    +  private def checkResult(result1: Array[Row], result2: Array[Row]): Unit = {
    +    if (result1.length != result2.length) {
    +      sys.error(s"result not matching: ${result1.length} and ${result2.length}")
    +    }
    +  }
    +
    +  def main(args: Array[String]): Unit = {
    +    CarbonProperties.getInstance()
    +        .addProperty("carbon.number.of.page.in.blocklet.column", "10")
    +        .addProperty("carbon.enable.vector.reader", "true")
    +        .addProperty("enable.unsafe.sort", "true")
    +    import org.apache.spark.sql.CarbonSession._
    +    val spark = SparkSession
    +        .builder()
    +        .master("local")
    +        .enableHiveSupport()
    +        .getOrCreateCarbonSession("./carbonstore")
    --- End diff --
    
    ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #624: [CARBONDATA-747] Add simple performance tes...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/incubator-carbondata/pull/624
  
    Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1117/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #624: [CARBONDATA-747] Add simple performance tes...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on the issue:

    https://github.com/apache/incubator-carbondata/pull/624
  
    I am afraid that 10Million record (around 200 MB) is too less, it will fit in OS cache if I run all queries for one after another. Any suggestion on this? Make more data?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747][WIP] Add simple per...

Posted by jarray888 <gi...@git.apache.org>.
Github user jarray888 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/624#discussion_r104374666
  
    --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala ---
    @@ -0,0 +1,122 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.examples
    +
    +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
    +
    +import org.apache.carbondata.core.util.CarbonProperties
    +
    +// scalastyle:off println
    +object CompareTest {
    +
    +  val parquetTableName = "comparetest_parquet"
    +  val carbonTableName = "comparetest_carbon"
    +
    +  private def generateDataFrame(spark: SparkSession): DataFrame = {
    +    import spark.implicits._
    +    spark.sparkContext.parallelize(1 to 10 * 1000 * 1000, 4)
    +        .map(x => ("i" + x, "p" + x % 10, "j" + x % 100, x, x + 1, (x + 7) % 21, (x + 5) / 43, x
    +            * 5))
    +        .toDF("id", "country", "city", "c4", "c5", "c6", "c7", "c8")
    --- End diff --
    
    can you add a column using decimal data type?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747] Add simple performa...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/624#discussion_r106448866
  
    --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala ---
    @@ -0,0 +1,372 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.examples
    +
    +import java.io.File
    +import java.text.SimpleDateFormat
    +import java.util.Date
    +
    +import scala.util.Random
    +
    +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
    +import org.apache.spark.sql.types._
    +
    +import org.apache.carbondata.core.constants.CarbonCommonConstants
    +import org.apache.carbondata.core.util.CarbonProperties
    +
    +/**
    + * A query test case
    + * @param sqlText SQL statement
    + * @param queryType type of query: scan, filter, aggregate, topN
    + * @param desc description of the goal of this test case
    + */
    +case class Query(sqlText: String, queryType: String, desc: String)
    +
    +// scalastyle:off println
    +object CompareTest {
    +
    +  def parquetTableName: String = "comparetest_parquet"
    +  def carbonTableName(version: String): String = s"comparetest_carbonV$version"
    +
    +  // Table schema:
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | Column name | Data type | Cardinality | Column type | Dictionary |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | city        | string    | 8           | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | country     | string    | 1103        | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | planet      | string    | 100,007     | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | id          | string    | 10,000,000  | dimension   | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m1          | short     | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m2          | int       | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m3          | big int   | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m4          | double    | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | m5          | double    | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  private def generateDataFrame(spark: SparkSession): DataFrame = {
    +    val r = new Random()
    +    val rdd = spark.sparkContext
    +        .parallelize(1 to 1 * 1000 * 1000, 4)
    +        .map { x =>
    +          ("city" + x % 8, "country" + x % 1103, "planet" + x % 10007, x.toString,
    +          (x % 16).toShort, x / 2, (x << 1).toLong, x.toDouble / 13, x.toDouble / 11)
    +        }.map { x =>
    +          Row(x._1, x._2, x._3, x._4, x._5, x._6, x._7, x._8, x._9)
    +        }
    +
    +    val schema = StructType(
    +      Seq(
    +        StructField("city", StringType, nullable = false),
    +        StructField("country", StringType, nullable = false),
    +        StructField("planet", StringType, nullable = false),
    +        StructField("id", StringType, nullable = false),
    +        StructField("m1", ShortType, nullable = false),
    +        StructField("m2", IntegerType, nullable = false),
    +        StructField("m3", LongType, nullable = false),
    +        StructField("m4", DoubleType, nullable = false),
    +        StructField("m5", DoubleType, nullable = false)
    +      )
    +    )
    +
    +    spark.createDataFrame(rdd, schema)
    +  }
    +
    +  // performance test queries, they are designed to test various data access type
    +  val queries: Array[Query] = Array(
    +    // ===========================================================================
    +    // ==                     FULL SCAN AGGREGATION                             ==
    +    // ===========================================================================
    +    Query(
    +      "select sum(m1) from $table",
    +      "full scan",
    +      "full scan query, 1 aggregate"
    +    ),
    +    Query(
    +      "select sum(m1), sum(m2) from $table",
    +      "full scan",
    +      "full scan query, 2 aggregate"
    +    ),
    +    Query(
    +      "select sum(m1), sum(m2), sum(m3) from $table",
    +      "full scan",
    +      "full scan query, 3 aggregate"
    +    ),
    +    Query(
    +      "select sum(m1), sum(m2), sum(m3), sum(m4) from $table",
    +      "full scan",
    +      "full scan query, 4 aggregate"
    +    ),
    +    Query(
    +      "select sum(m1), sum(m2), sum(m3), sum(m4), avg(m5) from $table",
    +      "full scan",
    +      "full scan query, 5 aggregate"
    +    ),
    +    Query(
    +      "select * from $table",
    +      "full scan",
    +      "full scan query, big result set"
    +    ),
    +    Query(
    +      "select count(distinct id) from $table",
    +      "full scan",
    +      "full scan and count distinct of high card column"
    +    ),
    +    Query(
    +      "select count(distinct country) from $table",
    +      "full scan",
    +      "full scan and count distinct of medium card column"
    +    ),
    +    Query(
    +      "select count(distinct city) from $table",
    +      "full scan",
    +      "full scan and count distinct of low card column"
    +    ),
    +    // ===========================================================================
    +    // ==                      FULL SCAN GROUP BY AGGREGATE                     ==
    +    // ===========================================================================
    +    Query(
    +      "select id, sum(m1) from $table group by id",
    +      "aggregate",
    +      "group by on big data, on high card column, big result set"
    +    ),
    +    Query(
    +      "select country, sum(m1) from $table group by country",
    +      "aggregate",
    +      "group by on big data, on medium card column, medium result set,"
    +    ),
    +    Query(
    +      "select city, sum(m1) from $table group by city",
    +      "aggregate",
    +      "group by on big data, on low card column, small result set,"
    +    ),
    +    Query(
    +      "select id, sum(m1) as metric from $table group by id order by metric desc limit 100",
    +      "topN",
    +      "top N on high card column"
    +    ),
    +    Query(
    +      "select country,sum(m1) as metric from $table group by country order by metric desc limit 10",
    +      "topN",
    +      "top N on medium card column"
    +    ),
    +    Query(
    +      "select city,sum(m1) as metric from $table group by city order by metric desc limit 10",
    +      "topN",
    +      "top N on low card column"
    +    ),
    +    // ===========================================================================
    +    // ==                  FILTER SCAN GROUP BY AGGREGATION                     ==
    +    // ===========================================================================
    +    Query(
    +      "select country, sum(m1) from $table where city='city8' group by country ",
    +      "filter scan and aggregate",
    +      "group by on large data, small result set"
    +    ),
    +    Query(
    +      "select id, sum(m1) from $table where planet='planet10' group by id",
    +      "filter scan and aggregate",
    +      "group by on medium data, large result set"
    +    ),
    +    Query(
    +      "select city, sum(m1) from $table where country='country12' group by city ",
    +      "filter scan and aggregate",
    +      "group by on medium data, small result set"
    +    ),
    +    // ===========================================================================
    +    // ==                             FILTER SCAN                               ==
    +    // ===========================================================================
    +    Query(
    +      "select * from $table where city = 'city3' limit 10000",
    +      "filter scan",
    +      "filter on low card dimension, limit, medium result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where country = 'country9' ",
    +      "filter scan",
    +      "filter on low card dimension, medium result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where planet = 'planet101' ",
    +      "filter scan",
    +      "filter on medium card dimension, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where id = '408938' ",
    +      "filter scan",
    +      "filter on high card dimension"
    +    ),
    +    Query(
    +      "select * from $table where country='country10000'  ",
    +      "filter scan",
    +      "filter on low card dimension, not exist"
    +    ),
    +    Query(
    +      "select * from $table where country='country2' and city ='city8' ",
    +      "filter scan",
    +      "filter on 2 dimensions, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where city='city1' and country='country2' and planet ='planet3' ",
    +      "filter scan",
    +      "filter on 3 dimensions, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where m1 < 3",
    +      "filter scan",
    +      "filter on measure, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where id like '1%' ",
    +      "fuzzy filter scan",
    +      "like filter, big result set"
    +    ),
    +    Query(
    +      "select * from $table where id like '%111'",
    +      "fuzzy filter scan",
    +      "like filter, medium result set"
    +    ),
    +    Query(
    +      "select * from $table where id like 'xyz%' ",
    +      "fuzzy filter scan",
    +      "like filter, full scan but not exist"
    +    )
    +  )
    +
    +  private def loadParquetTable(spark: SparkSession, input: DataFrame): Double = time {
    +    // partitioned by last 1 digit of id column
    +    val dfWithPartition = input.withColumn("partitionCol", input.col("id").%(10))
    +    dfWithPartition.write
    +        .partitionBy("partitionCol")
    +        .mode(SaveMode.Overwrite)
    +        .parquet(parquetTableName)
    +  }
    +
    +  private def loadCarbonTable(spark: SparkSession, input: DataFrame, version: String): Double = {
    +    CarbonProperties.getInstance().addProperty(
    +      CarbonCommonConstants.CARBON_DATA_FILE_VERSION,
    +      version
    +    )
    +    spark.sql(s"drop table if exists ${carbonTableName(version)}")
    +    time {
    +      input.write
    +          .format("carbondata")
    +          .option("tableName", carbonTableName(version))
    +          .option("tempCSV", "false")
    +          .option("single_pass", "true")
    +          .option("dictionary_exclude", "id") // id is high cardinality column
    +          .mode(SaveMode.Overwrite)
    +          .save()
    +    }
    +  }
    +
    +  // load data into parquet, carbonV2, carbonV3
    +  private def prepareTable(spark: SparkSession): Unit = {
    +    val df = generateDataFrame(spark)
    +    println(s"loading ${df.count} records, schema: ${df.schema}")
    +    val loadParquetTime = loadParquetTable(spark, df)
    +    val loadCarbonV3Time = loadCarbonTable(spark, df, version = "3")
    +    println(s"load completed, time: $loadParquetTime, $loadCarbonV3Time")
    +    spark.read.parquet(parquetTableName).registerTempTable(parquetTableName)
    +  }
    +
    +  // Run all queries for the specified table
    +  private def runQueries(spark: SparkSession, tableName: String): Array[(Double, Array[Row])] = {
    +    println(s"start running queries for $tableName...")
    +    var result: Array[Row] = null
    +    queries.zipWithIndex.map { case (query, index) =>
    +      val sqlText = query.sqlText.replace("$table", tableName)
    +      print(s"running query ${index + 1}: $sqlText ")
    +      val rt = time {
    +        result = spark.sql(sqlText).collect()
    +      }
    +      println(s"=> $rt sec")
    +      (rt, result)
    +    }
    +  }
    +
    +  // run testcases and print comparison result
    +  private def runTest(spark: SparkSession): Unit = {
    +    val formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
    +    val date = new Date
    +    val timestamp = date.getTime
    +    // run queries on parquet and carbon
    +    val parquetResult: Array[(Double, Array[Row])] = runQueries(spark, parquetTableName)
    +    // do GC and sleep for some time before running next table
    +    System.gc()
    +    Thread.sleep(1000)
    +    val carbonResult: Array[(Double, Array[Row])] = runQueries(spark, carbonTableName("3"))
    +    // check result by comparing output from parquet and carbon
    +    parquetResult.zipWithIndex.foreach { case (result, index) =>
    +      if (result._2.length != carbonResult(index)._2.length) {
    +        sys.error(s"result not matching for query ${index + 1}: " +
    +            s"${result._2.length} and ${carbonResult(index)._2.length}")
    +      }
    +    }
    +    // print all response time in JSON format, so that it can be analyzed later
    +    queries.zipWithIndex.foreach { case (query, index) =>
    +      println("{" +
    +          s""""query":"${index + 1}", """ +
    +          s""""parquetTime":${parquetResult(index)._1}, """ +
    +          s""""carbonTime":${carbonResult(index)._1}, """ +
    +          s""""fetched":${parquetResult(index)._2.length}, """ +
    +          s""""type":"${query.queryType}", """ +
    +          s""""desc":"${query.desc}",  """ +
    +          s""""timestamp": "$timestamp" """ +
    +          s""""date": "${formatter.format(date)}" """ +
    +          "}"
    +      )
    +    }
    +  }
    +
    +  def main(args: Array[String]): Unit = {
    +    CarbonProperties.getInstance()
    +        .addProperty("carbon.number.of.page.in.blocklet.column", "10")
    --- End diff --
    
    ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #624: [CARBONDATA-747][WIP] Add simple performanc...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/incubator-carbondata/pull/624
  
    Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1037/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #624: [CARBONDATA-747] Add simple performance tes...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/incubator-carbondata/pull/624
  
    Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1193/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #624: [CARBONDATA-747] Add simple performance tes...

Posted by ravipesala <gi...@git.apache.org>.
Github user ravipesala commented on the issue:

    https://github.com/apache/incubator-carbondata/pull/624
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #624: [CARBONDATA-747] Add simple performance tes...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/incubator-carbondata/pull/624
  
    Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1101/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747] Add simple performa...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/624#discussion_r106400558
  
    --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala ---
    @@ -0,0 +1,343 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.examples
    +
    +import java.io.File
    +import java.util.Date
    +
    +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
    +import org.apache.spark.sql.types._
    +
    +import org.apache.carbondata.core.constants.CarbonCommonConstants
    +import org.apache.carbondata.core.util.CarbonProperties
    +
    +/**
    + * A query test case
    + * @param sqlText SQL statement
    + * @param queryType type of query: scan, filter, aggregate, topN
    + * @param desc description of the goal of this test case
    + */
    +case class Query(sqlText: String, queryType: String, desc: String)
    +
    +// scalastyle:off println
    +object CompareTest {
    +
    +  def parquetTableName: String = "comparetest_parquet"
    +  def carbonTableName(version: String): String = s"comparetest_carbonV$version"
    +
    +  // Table schema:
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | Column name | Data type | Cardinality | Column type | Dictionary |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | id          | string    | 10,000,000  | dimension   | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | country     | string    | 1103        | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | city        | string    | 13          | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c4          | short     | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c5          | int       | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c6          | big int   | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c7          | double    | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c8          | double    | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    --- End diff --
    
    ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747][WIP] Add simple per...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/624#discussion_r104566012
  
    --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala ---
    @@ -0,0 +1,122 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.examples
    +
    +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
    +
    +import org.apache.carbondata.core.util.CarbonProperties
    +
    +// scalastyle:off println
    +object CompareTest {
    +
    +  val parquetTableName = "comparetest_parquet"
    +  val carbonTableName = "comparetest_carbon"
    +
    +  private def generateDataFrame(spark: SparkSession): DataFrame = {
    +    import spark.implicits._
    +    spark.sparkContext.parallelize(1 to 10 * 1000 * 1000, 4)
    +        .map(x => ("i" + x, "p" + x % 10, "j" + x % 100, x, x + 1, (x + 7) % 21, (x + 5) / 43, x
    +            * 5))
    +        .toDF("id", "country", "city", "c4", "c5", "c6", "c7", "c8")
    --- End diff --
    
    ok, I found decimal is not supported for dataframe.write, I will  raise a JIRA


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747][WIP] Add simple per...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/624#discussion_r104564975
  
    --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala ---
    @@ -0,0 +1,122 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.examples
    +
    +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
    +
    +import org.apache.carbondata.core.util.CarbonProperties
    +
    +// scalastyle:off println
    +object CompareTest {
    +
    +  val parquetTableName = "comparetest_parquet"
    +  val carbonTableName = "comparetest_carbon"
    +
    +  private def generateDataFrame(spark: SparkSession): DataFrame = {
    +    import spark.implicits._
    +    spark.sparkContext.parallelize(1 to 10 * 1000 * 1000, 4)
    +        .map(x => ("i" + x, "p" + x % 10, "j" + x % 100, x, x + 1, (x + 7) % 21, (x + 5) / 43, x
    +            * 5))
    +        .toDF("id", "country", "city", "c4", "c5", "c6", "c7", "c8")
    +  }
    +
    +  private def loadParquetTable(spark: SparkSession, input: DataFrame): Long = timeit {
    +    input.write.mode(SaveMode.Overwrite).parquet(parquetTableName)
    +  }
    +
    +  private def loadCarbonTable(spark: SparkSession, input: DataFrame): Long = {
    +    spark.sql(s"drop table if exists $carbonTableName")
    +    timeit {
    +      input.write
    +          .format("carbondata")
    +          .option("tableName", carbonTableName)
    +          .option("tempCSV", "false")
    +          .option("single_pass", "true")
    +          .option("dictionary_exclude", "id") // id is high cardinality column
    +          .mode(SaveMode.Overwrite)
    +          .save()
    +    }
    +  }
    +
    +  private def prepareTable(spark: SparkSession): Unit = {
    +    val df = generateDataFrame(spark).cache()
    +    println(s"loading dataframe into table, schema: ${df.schema}")
    +    val loadParquetTime = loadParquetTable(spark, df)
    +    val loadCarbonTime = loadCarbonTable(spark, df)
    +    println(s"load completed, time: $loadParquetTime, $loadCarbonTime")
    +    spark.read.parquet(parquetTableName).registerTempTable(parquetTableName)
    +  }
    +
    +  private def runQuery(spark: SparkSession): Unit = {
    +    val test = Array(
    +      "select count(*) from $table",
    +      "select sum(c4) from $table",
    +      "select sum(c4), sum(c5) from $table",
    +      "select sum(c4), sum(c5), sum(c6) from $table",
    +      "select sum(c4), sum(c5), sum(c6), sum(c7) from $table",
    +      "select sum(c4), sum(c5), sum(c6), sum(c7), avg(c8) from $table",
    +      "select * from $table where id = 'i9999999' ",
    +      "select * from $table where country = 'p9' ",
    +      "select * from $table where city = 'j99' ",
    +      "select * from $table where c4 < 1000 "
    --- End diff --
    
    added


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747][WIP] Add simple per...

Posted by jackylk <gi...@git.apache.org>.
Github user jackylk commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/624#discussion_r104565728
  
    --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala ---
    @@ -0,0 +1,122 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.examples
    +
    +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
    +
    +import org.apache.carbondata.core.util.CarbonProperties
    +
    +// scalastyle:off println
    +object CompareTest {
    +
    +  val parquetTableName = "comparetest_parquet"
    +  val carbonTableName = "comparetest_carbon"
    +
    +  private def generateDataFrame(spark: SparkSession): DataFrame = {
    +    import spark.implicits._
    +    spark.sparkContext.parallelize(1 to 10 * 1000 * 1000, 4)
    +        .map(x => ("i" + x, "p" + x % 10, "j" + x % 100, x, x + 1, (x + 7) % 21, (x + 5) / 43, x
    --- End diff --
    
    ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #624: [CARBONDATA-747] Add simple performance tes...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/incubator-carbondata/pull/624
  
    Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1195/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata issue #624: [CARBONDATA-747] Add simple performance tes...

Posted by CarbonDataQA <gi...@git.apache.org>.
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/incubator-carbondata/pull/624
  
    Build Success with Spark 1.6.2, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder/1194/



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-carbondata pull request #624: [CARBONDATA-747][WIP] Add simple per...

Posted by chenliang613 <gi...@git.apache.org>.
Github user chenliang613 commented on a diff in the pull request:

    https://github.com/apache/incubator-carbondata/pull/624#discussion_r104542895
  
    --- Diff: examples/spark2/src/main/scala/org/apache/carbondata/examples/CompareTest.scala ---
    @@ -0,0 +1,347 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.carbondata.examples
    +
    +import org.apache.spark.sql.{DataFrame, Row, SaveMode, SparkSession}
    +import org.apache.spark.sql.types._
    +
    +import org.apache.carbondata.core.constants.CarbonCommonConstants
    +import org.apache.carbondata.core.util.CarbonProperties
    +
    +/**
    + * A query test case
    + * @param sqlText SQL statement
    + * @param queryType type of query: scan, filter, aggregate, topN
    + * @param desc description of the goal of this test case
    + */
    +case class Query(sqlText: String, queryType: String, desc: String)
    +
    +// scalastyle:off println
    +object CompareTest {
    +
    +  def parquetTableName: String = "comparetest_parquet"
    +  def carbonTableName(version: String): String = s"comparetest_carbonV$version"
    +
    +  // Table schema:
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | Column name | Data type | Cardinality | Column type | Dictionary |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | id          | string    | 10,000,000  | dimension   | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | country     | string    | 1103        | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | city        | string    | 13          | dimension   | yes        |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c4          | short     | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c5          | int       | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c6          | big int   | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c7          | double    | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  // | c8          | double    | NA          | measure     | no         |
    +  // +-------------+-----------+-------------+-------------+------------+
    +  private def generateDataFrame(spark: SparkSession): DataFrame = {
    +    val rdd = spark.sparkContext
    +        .parallelize(1 to 10 * 1000 * 1000, 4)
    +        .map { x =>
    +          (x.toString, "p" + x % 1103, "j" + x % 13, (x % 31).toShort, x, x.toLong * 1000,
    +              x.toDouble / 13, x.toDouble / 71 )
    +        }.map { x =>
    +          Row(x._1, x._2, x._3, x._4, x._5, x._6, x._7, x._8)
    +        }
    +    val schema = StructType(
    +      Seq(
    +        StructField("id", StringType, nullable = false),
    +        StructField("country", StringType, nullable = false),
    +        StructField("city", StringType, nullable = false),
    +        StructField("c4", ShortType, nullable = true),
    +        StructField("c5", IntegerType, nullable = true),
    +        StructField("c6", LongType, nullable = true),
    +        StructField("c7", DoubleType, nullable = true),
    +        StructField("c8", DoubleType, nullable = true)
    +      )
    +    )
    +    spark.createDataFrame(rdd, schema)
    +  }
    +
    +  // performance test queries
    +  val queries: Array[Query] = Array(
    +    Query(
    +      "select count(*) from $table",
    +      "warm up",
    +      "warm up query"
    +    ),
    +    // ===========================================================================
    +    // ==                               FULL SCAN                               ==
    +    // ===========================================================================
    +    Query(
    +      "select sum(c4) from $table",
    +      "full scan",
    +      "full scan query, 1 aggregate"
    +    ),
    +    Query(
    +      "select sum(c4), sum(c5) from $table",
    +      "full scan",
    +      "full scan query, 2 aggregate"
    +    ),
    +    Query(
    +      "select sum(c4), sum(c5), sum(c6) from $table",
    +      "full scan",
    +      "full scan query, 3 aggregate"
    +    ),
    +    Query(
    +      "select sum(c4), sum(c5), sum(c6), sum(c7) from $table",
    +      "full scan",
    +      "full scan query, 4 aggregate"
    +    ),
    +    Query(
    +      "select sum(c4), sum(c5), sum(c6), sum(c7), avg(c8) from $table",
    +      "full scan",
    +      "full scan query, 4 aggregate"
    +    ),
    +//    Query(
    +//      "select * from $table",
    +//      "full scan",
    +//      "full scan query, big result set"
    +//    ),
    +    Query(
    +      "select count(distinct id) from $table",
    +      "full scan",
    +      "full scan and count distinct of high card column"
    +    ),
    +    Query(
    +      "select count(distinct country) from $table",
    +      "full scan",
    +      "full scan and count distinct of medium card column"
    +    ),
    +    Query(
    +      "select count(distinct city) from $table",
    +      "full scan",
    +      "full scan and count distinct of low card column"
    +    ),
    +    // ===========================================================================
    +    // ==                               AGGREGATE                               ==
    +    // ===========================================================================
    +//    Query(
    +//      "select id, sum(c4) from $table group by id",
    +//      "aggregate",
    +//      "group by on big data, on high card column, big result set"
    +//    ),
    +    Query(
    +      "select country, sum(c4) from $table group by country",
    +      "aggregate",
    +      "group by on big data, on medium card column, medium result set,"
    +    ),
    +    Query(
    +      "select city, sum(c4) from $table group by city",
    +      "aggregate",
    +      "group by on big data, on low card column, small result set,"
    +    ),
    +    Query(
    +      "select id, sum(c4) as metric from $table group by id order by metric desc limit 100",
    +      "topN",
    +      "top N on high card column"
    +    ),
    +    Query(
    +      "select country,sum(c4) as metric from $table group by country order by metric desc limit 10",
    +      "topN",
    +      "top N on medium card column"
    +    ),
    +    Query(
    +      "select city,sum(c4) as metric from $table group by city order by metric desc limit 10",
    +      "topN",
    +      "top N on low card column"
    +    ),
    +    // ===========================================================================
    +    // ==                             FILTER SCAN                               ==
    +    // ===========================================================================
    +    Query(
    +      "select id, sum(c4) from $table where city='j9' group by id",
    +      "filter scan and aggregate",
    +      "group by on medium data, large result set"
    +    ),
    +    Query(
    +      "select country, sum(c4) from $table where city='j8' group by country ",
    +      "filter scan and aggregate",
    +      "group by on medium data, medium result set"
    +    ),
    +    Query(
    +      "select city, sum(c4) from $table where city='j7' group by city ",
    +      "filter scan and aggregate",
    +      "group by on medium data, small result set"
    +    ),
    +    Query(
    +      "select * from $table where id = '408938' ",
    +      "primary key filter",
    +      "primary key filtering"
    +    ),
    +    Query(
    +      "select * from $table where country = 'p99' ",
    +      "filter scan",
    +      "filter on c2, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where country='p10000'  ",
    +      "filter scan",
    +      "filter on c2, not exist"
    +    ),
    +    Query(
    +      "select * from $table where city = 'j3' limit 100000",
    +      "filter scan",
    +      "filter on c3, medium result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where country='p88' and city ='j8' ",
    +      "filter scan",
    +      "filter on c2 and c3, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where id='1' and country='p1' and city ='j1' ",
    +      "filter scan",
    +      "filter on c1, c2 and c3, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where c4 < 3",
    +      "filter scan",
    +      "filter on measure, small result set, fetch all columns"
    +    ),
    +    Query(
    +      "select * from $table where id like '1%' ",
    +      "fuzzy filter scan",
    +      "like filter, big result set"
    +    ),
    +    Query(
    +      "select * from $table where id like '%111'",
    +      "fuzzy filter scan",
    +      "like filter, medium result set"
    +    ),
    +    Query(
    +      "select * from $table where id like 'xyz%' ",
    +      "fuzzy filter scan",
    +      "like filter, full scan but not exist"
    +    )
    +  )
    +
    +  private def loadParquetTable(spark: SparkSession, input: DataFrame): Double = timeit {
    +    input.write.mode(SaveMode.Overwrite).parquet(parquetTableName)
    +  }
    +
    +  private def loadCarbonTable(spark: SparkSession, input: DataFrame, version: String): Double = {
    +    CarbonProperties.getInstance().addProperty(
    +      CarbonCommonConstants.CARBON_DATA_FILE_VERSION,
    +      version
    +    )
    +    spark.sql(s"drop table if exists ${carbonTableName(version)}")
    +    timeit {
    +      input.write
    +          .format("carbondata")
    +          .option("tableName", carbonTableName(version))
    +          .option("tempCSV", "false")
    +          .option("single_pass", "true")
    +          .option("dictionary_exclude", "id") // id is high cardinality column
    +          .mode(SaveMode.Overwrite)
    +          .save()
    +    }
    +  }
    +
    +  // load data into parquet, carbonV2, carbonV3
    +  private def prepareTable(spark: SparkSession): Unit = {
    +    val df = generateDataFrame(spark)
    +    println(s"loading ${df.count} records, schema: ${df.schema}")
    +    val loadParquetTime = loadParquetTable(spark, df)
    +    val loadCarbonV3Time = loadCarbonTable(spark, df, version = "3")
    +    val loadCarbonV2Time = loadCarbonTable(spark, df, version = "2")
    +    println(s"load completed, time: $loadParquetTime, $loadCarbonV3Time, $loadCarbonV2Time")
    +    spark.read.parquet(parquetTableName).registerTempTable(parquetTableName)
    +  }
    +
    +  private def runQuery(spark: SparkSession): Unit = {
    +    println("[")
    +    queries.zipWithIndex.foreach {
    +      case (query, index) =>
    +        var result1: Array[Row] = null
    +        var result2: Array[Row] = null
    +
    +        // query parquet, carbonV3, carbonV2, and carbonV2 usinga non-vector reader
    +        CarbonProperties.getInstance().addProperty("carbon.enable.vector.reader", "false")
    +        var sqlText = query.sqlText.replace("$table", carbonTableName("2"))
    +        val carbonV2NonVectorQueryTime = timeit {
    +          spark.sql(sqlText).collect()
    +        }
    +        CarbonProperties.getInstance().addProperty("carbon.enable.vector.reader", "true")
    +        sqlText = query.sqlText.replace("$table", carbonTableName("2"))
    +        val carbonV2QueryTime = timeit {
    +          spark.sql(sqlText).collect()
    +        }
    +        CarbonProperties.getInstance().addProperty("carbon.enable.vector.reader", "true")
    +        sqlText = query.sqlText.replace("$table", carbonTableName("3"))
    +        val carbonV3QueryTime = timeit {
    +          result2 = spark.sql(sqlText).collect()
    +        }
    +        sqlText = query.sqlText.replace("$table", parquetTableName)
    +        val parquetQueryTime = timeit {
    +          result1 = spark.sql(sqlText).collect()
    +        }
    +        checkResult(result1, result2)
    +        println("\t" +
    +            s"""{ "query":"$index", """ +
    +            s""""parquet"=$parquetQueryTime, """ +
    +            s""""carbonV3"=$carbonV3QueryTime, """ +
    +            s""""carbonV2"=$carbonV2QueryTime, """ +
    +            s""""carbonV2nonVector"=$carbonV2NonVectorQueryTime, """ +
    +            s""""fetched"=${result1.length}", """ +
    +            s""""desc"="${query.desc}" }, """ )
    +    }
    +    println("]")
    +  }
    +
    +  private def checkResult(result1: Array[Row], result2: Array[Row]): Unit = {
    +    if (result1.length != result2.length) {
    +      sys.error(s"result not matching: ${result1.length} and ${result2.length}")
    +    }
    +  }
    +
    +  def main(args: Array[String]): Unit = {
    +    CarbonProperties.getInstance()
    +        .addProperty("carbon.number.of.page.in.blocklet.column", "10")
    +        .addProperty("carbon.enable.vector.reader", "true")
    +        .addProperty("enable.unsafe.sort", "true")
    +    import org.apache.spark.sql.CarbonSession._
    +    val spark = SparkSession
    +        .builder()
    +        .master("local")
    +        .enableHiveSupport()
    +        .getOrCreateCarbonSession("./carbonstore")
    --- End diff --
    
    Please put carbonstore to target directory.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---