You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by sameeragarwal <gi...@git.apache.org> on 2016/05/19 05:26:42 UTC

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

GitHub user sameeragarwal opened a pull request:

    https://github.com/apache/spark/pull/13188

    [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark queries for SparkSQL

    ## What changes were proposed in this pull request?
    
    Now that SparkSQL supports all TPC-DS queries, this patch adds all 99 benchmark queries inside SparkSQL.
    
    ## How was this patch tested?
    
    Benchmark only

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sameeragarwal/spark tpcds-all

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13188.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13188
    
----
commit e584575bb786e77b7ea1d6de3f80ec556011d291
Author: Sameer Agarwal <sa...@databricks.com>
Date:   2016-05-03T00:28:12Z

    Add all TPCDS queries

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220728198
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59019/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220241461
  
    **[Test build #58844 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58844/consoleFull)** for PR 13188 at commit [`e584575`](https://github.com/apache/spark/commit/e584575bb786e77b7ea1d6de3f80ec556011d291).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `              |           and i_class in('personal', 'portable', 'reference', 'self-help')`
      * `              |           and i_class in('accessories', 'classical', 'fragrances', 'pants')`
      * `              |            and i_class in('personal', 'portable', 'refernece', 'self-help')`
      * `              |            and i_class in('accessories', 'classical', 'fragrances', 'pants')`
      * `              |          and i_class in('wallpaper', 'parenting', 'musical'))`
      * `              |            and i_class in('womens', 'birdal', 'pants'))`
      * `      i_class IN ('personal', 'portable', 'reference', 'self-help') AND`
      * `        i_class IN ('accessories', 'classical', 'fragrances', 'pants') AND`
      * `  AND i_class IN ('personal', 'portable', 'refernece', 'self-help')`
      * `  AND i_class IN ('accessories', 'classical', 'fragrances', 'pants')`
      * `           i_class IN ('computers', 'stereo', 'football'))`
      * `           i_class IN ('shirts', 'birdal', 'dresses')))`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220503069
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220486610
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63969828
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/tpcds/queries/q77.sql ---
    @@ -0,0 +1,100 @@
    +WITH ss AS
    +(SELECT
    +    s_store_sk,
    +    sum(ss_ext_sales_price) AS sales,
    +    sum(ss_net_profit) AS profit
    +  FROM store_sales, date_dim, store
    +  WHERE ss_sold_date_sk = d_date_sk
    +    AND d_date BETWEEN cast('2000-08-03' AS DATE) AND
    +  date_add(cast('2000-08-03' AS DATE), 30)
    +    AND ss_store_sk = s_store_sk
    +  GROUP BY s_store_sk),
    +    sr AS
    +  (SELECT
    +    s_store_sk,
    +    sum(sr_return_amt) AS returns,
    +    sum(sr_net_loss) AS profit_loss
    +  FROM store_returns, date_dim, store
    +  WHERE sr_returned_date_sk = d_date_sk
    +    AND d_date BETWEEN cast('2000-08-03' AS DATE) AND
    +  date_add(cast('2000-08-03' AS DATE), 30)
    +    AND sr_store_sk = s_store_sk
    +  GROUP BY s_store_sk),
    +    cs AS
    +  (SELECT
    +    cs_call_center_sk,
    +    sum(cs_ext_sales_price) AS sales,
    +    sum(cs_net_profit) AS profit
    +  FROM catalog_sales, date_dim
    +  WHERE cs_sold_date_sk = d_date_sk
    +    AND d_date BETWEEN cast('2000-08-03' AS DATE) AND
    +  date_add(cast('2000-08-03' AS DATE), 30)
    +  GROUP BY cs_call_center_sk),
    +    cr AS
    +  (SELECT
    +    sum(cr_return_amount) AS returns,
    +    sum(cr_net_loss) AS profit_loss
    +  FROM catalog_returns, date_dim
    +  WHERE cr_returned_date_sk = d_date_sk
    +    AND d_date BETWEEN cast('2000-08-03]' AS DATE) AND
    +  date_add(cast('2000-08-03' AS DATE), 30)),
    +    ws AS
    +  (SELECT
    +    wp_web_page_sk,
    +    sum(ws_ext_sales_price) AS sales,
    +    sum(ws_net_profit) AS profit
    +  FROM web_sales, date_dim, web_page
    +  WHERE ws_sold_date_sk = d_date_sk
    +    AND d_date BETWEEN cast('2000-08-03' AS DATE) AND
    +  date_add(cast('2000-08-03' AS DATE), 30)
    --- End diff --
    
    Could you change the `date_add` back to interval expression?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63968907
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/tpcds/queries/ss_max.sql ---
    @@ -0,0 +1,14 @@
    +SELECT
    --- End diff --
    
    This query is not part of TPC-DS, we may remove it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63971344
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/tpcds/queries/q31.sql ---
    @@ -0,0 +1,60 @@
    +WITH ss AS
    +(SELECT
    +    ca_county,
    +    d_qoy,
    +    d_year,
    +    sum(ss_ext_sales_price) AS store_sales
    +  FROM store_sales, date_dim, customer_address
    +  WHERE ss_sold_date_sk = d_date_sk
    +    AND ss_addr_sk = ca_address_sk
    +  GROUP BY ca_county, d_qoy, d_year),
    +    ws AS
    +  (SELECT
    +    ca_county,
    +    d_qoy,
    +    d_year,
    +    sum(ws_ext_sales_price) AS web_sales
    +  FROM web_sales, date_dim, customer_address
    +  WHERE ws_sold_date_sk = d_date_sk
    +    AND ws_bill_addr_sk = ca_address_sk
    +  GROUP BY ca_county, d_qoy, d_year)
    +SELECT
    +  ss1.ca_county,
    +  ss1.d_year,
    +  ws2.web_sales / ws1.web_sales web_q1_q2_increase,
    +  ss2.store_sales / ss1.store_sales store_q1_q2_increase,
    +  ws3.web_sales / ws2.web_sales web_q2_q3_increase,
    +  ss3.store_sales / ss2.store_sales store_q2_q3_increase
    +FROM
    +  ss ss1, ss ss2, ss ss3, ws ws1, ws ws2, ws ws3
    +WHERE
    +  ss1.d_qoy = 1
    +    AND ss1.d_year = 2000
    +    AND ss1.ca_county = ss2.ca_county
    +    AND ss2.d_qoy = 2
    +    AND ss2.d_year = 2000
    +    AND ss2.ca_county = ss3.ca_county
    +    AND ss3.d_qoy = 3
    +    AND ss3.d_year = 2000
    +    AND ss1.ca_county = ws1.ca_county
    +    AND ws1.d_qoy = 1
    +    AND ws1.d_year = 2000
    +    AND ws1.ca_county = ws2.ca_county
    +    AND ws2.d_qoy = 2
    +    AND ws2.d_year = 2000
    +    AND ws1.ca_county = ws3.ca_county
    +    AND ws3.d_qoy = 3
    +    AND ws3.d_year = 2000
    +    AND CASE WHEN ws1.web_sales > 0
    +    THEN ws2.web_sales / ws1.web_sales
    +        ELSE NULL END
    +    > CASE WHEN ss1.store_sales > 0
    +    THEN ss2.store_sales / ss1.store_sales
    +      ELSE NULL END
    +    AND CASE WHEN ws2.web_sales > 0
    +    THEN ws3.web_sales / ws2.web_sales
    +        ELSE NULL END
    +    > CASE WHEN ss2.store_sales > 0
    +    THEN ss3.store_sales / ss2.store_sales
    +      ELSE NULL END
    +ORDER BY ss1.ca_county
    --- End diff --
    
    new line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220241617
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58844/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220478320
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58909/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220456361
  
    **[Test build #58892 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58892/consoleFull)** for PR 13188 at commit [`15ec1fa`](https://github.com/apache/spark/commit/15ec1fab1ad7c4b1998f21a7592f867d1df1a59a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220230908
  
    **[Test build #58844 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58844/consoleFull)** for PR 13188 at commit [`e584575`](https://github.com/apache/spark/commit/e584575bb786e77b7ea1d6de3f80ec556011d291).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63968968
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/tpcds/queries/q95.sql ---
    @@ -0,0 +1,28 @@
    +WITH ws_wh AS
    +(SELECT
    +    ws1.ws_order_number,
    +    ws1.ws_warehouse_sk wh1,
    +    ws2.ws_warehouse_sk wh2
    +  FROM web_sales ws1, web_sales ws2
    +  WHERE ws1.ws_order_number = ws2.ws_order_number
    +    AND ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
    +SELECT count(DISTINCT ws_order_number) AS ` ORDER count`
    +, sum(ws_ext_ship_cost) AS `total shipping COST `
    +, sum(ws_net_profit) AS `total net profit`
    +FROM
    +web_sales ws1, date_dim, customer_address, web_site
    +WHERE
    +d_date BETWEEN '1999-02-01' AND
    +date_add( CAST ('1999-02-01' AS DATE ), 60)
    +AND ws1.ws_ship_date_sk = d_date_sk
    --- End diff --
    
    indents?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220478261
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220503070
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58922/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220505375
  
    Could you remove ss_max, otherwise LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63940722
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/tpcds/TPCDSQueryBenchmark.scala ---
    @@ -0,0 +1,106 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.parquet.tpcds
    +
    +import java.io.File
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.sql.SQLContext
    +import org.apache.spark.sql.catalyst.TableIdentifier
    +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
    +import org.apache.spark.sql.catalyst.util._
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.Benchmark
    +
    +/**
    + * Benchmark to measure TPCDS query performance.
    + * To run this:
    + *  spark-submit --class <this class> --jars <spark sql test jar>
    + */
    +object TPCDSQueryBenchmark {
    +  val conf = new SparkConf()
    +  conf.set("spark.sql.parquet.compression.codec", "snappy")
    +  conf.set("spark.sql.shuffle.partitions", "4")
    +  conf.set("spark.driver.memory", "3g")
    +  conf.set("spark.executor.memory", "3g")
    +  conf.set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024).toString)
    +
    +  val sc = new SparkContext("local[1]", "test-sql-context", conf)
    +  val sqlContext = new SQLContext(sc)
    --- End diff --
    
    yes, for sure. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63992224
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/tpcds/TPCDSQueryBenchmark.scala ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.benchmark.tpcds
    +
    +import java.io.File
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.catalyst.TableIdentifier
    +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
    +import org.apache.spark.sql.catalyst.expressions.SubqueryExpression
    +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
    +import org.apache.spark.sql.catalyst.util._
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.Benchmark
    +
    +/**
    + * Benchmark to measure TPCDS query performance.
    + * To run this:
    + *  spark-submit --class <this class> --jars <spark sql test jar>
    + */
    +object TPCDSQueryBenchmark {
    +  val conf =
    +    new SparkConf()
    +      .setMaster("local[1]")
    +      .setAppName("test-sql-context")
    +      .set("spark.sql.parquet.compression.codec", "snappy")
    +      .set("spark.sql.shuffle.partitions", "4")
    +      .set("spark.driver.memory", "3g")
    +      .set("spark.executor.memory", "3g")
    +      .set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024).toString)
    +
    +  val spark = SparkSession.builder.config(conf).getOrCreate()
    +
    +  val tables = Seq("catalog_page", "catalog_returns", "customer", "customer_address",
    +    "customer_demographics", "date_dim", "household_demographics", "inventory", "item",
    +    "promotion", "store", "store_returns", "catalog_sales", "web_sales", "store_sales",
    +    "web_returns", "web_site", "reason", "call_center", "warehouse", "ship_mode", "income_band",
    +    "time_dim", "web_page")
    +
    +  def setupTables(dataLocation: String): Map[String, Long] = {
    +    tables.map { tableName =>
    +      spark.read.parquet(s"$dataLocation/$tableName").createOrReplaceTempView(tableName)
    +      tableName -> spark.table(tableName).count()
    +    }.toMap
    +  }
    +
    +  def tpcdsAll(dataLocation: String, queries: Seq[String]): Unit = {
    +    require(dataLocation.nonEmpty,
    +      "please modify the value of dataLocation to point to your local TPCDS data")
    +    val tableSizes = setupTables(dataLocation)
    +    spark.conf.set(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true")
    +    spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +    queries.foreach { name =>
    +      val queriesString = fileToString(new File(s"sql/core/src/test/scala/org/apache/spark/sql/" +
    +        s"execution/benchmark/tpcds/queries/$name.sql"))
    +
    +      // This is an indirect hack to estimate the size of each query's input by traversing the
    +      // logical plan and adding up the sizes of all tables that appear in the plan. Note that this
    +      // currently doesn't take WITH subqueries into account which might lead to fairly inaccurate
    +      // per-row processing time for those cases.
    +      val queryRelations = scala.collection.mutable.HashSet[String]()
    +      spark.sql(queriesString).queryExecution.logical.map {
    +        case ur @ UnresolvedRelation(t: TableIdentifier, _) =>
    +          queryRelations.add(t.table)
    +        case lp: LogicalPlan =>
    +          lp.expressions.foreach { _ foreach {
    +            case subquery: SubqueryExpression =>
    +              subquery.plan.foreach {
    +                case ur @ UnresolvedRelation(t: TableIdentifier, _) =>
    +                  queryRelations.add(t.table)
    +                case _ =>
    +              }
    +            case _ =>
    +          }
    +        }
    +        case _ =>
    +      }
    +      val numRows = queryRelations.map(tableSizes.getOrElse(_, 0L)).sum
    +      val benchmark = new Benchmark("TPCDS Snappy", numRows, 5)
    +      benchmark.addCase(name) { i =>
    +        spark.sql(queriesString).collect()
    +      }
    +      benchmark.run()
    +    }
    +  }
    +
    +  def main(args: Array[String]): Unit = {
    +
    +    // List of all TPC-DS queries
    +    val allQueries = Seq(
    +      "q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8", "q9", "q10", "q11",
    +      "q12", "q13", "q14a", "q14b", "q15", "q16", "q17", "q18", "q19", "q20",
    +      "q21", "q22", "q23a", "q23b", "q24a", "q24b", "q25", "q26", "q27", "q28", "q29", "q30",
    +      "q31", "q32", "q33", "q34", "q35", "q36", "q37", "q38", "q39a", "q39b", "q40",
    +      "q41", "q42", "q43", "q44", "q45", "q46", "q47", "q48", "q49", "q50",
    +      "q51", "q52", "q53", "q54", "q55", "q56", "q57", "q58", "q59", "q60",
    +      "q61", "q62", "q63", "q64", "q65", "q66", "q67", "q68", "q69", "q70",
    +      "q71", "q72", "q73", "q74", "q75", "q76", "q77", "q78", "q79", "q80",
    +      "q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90",
    +      "q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99", "ss_max")
    --- End diff --
    
    If we end up keeping it, we should probably have a comment saying it is not part of tpcds, but added from the impala test kit.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63970987
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/tpcds/queries/q78.sql ---
    @@ -0,0 +1,64 @@
    +WITH ws AS
    +(SELECT
    +    d_year AS ws_sold_year,
    +    ws_item_sk,
    +    ws_bill_customer_sk ws_customer_sk,
    +    sum(ws_quantity) ws_qty,
    +    sum(ws_wholesale_cost) ws_wc,
    +    sum(ws_sales_price) ws_sp
    +  FROM web_sales
    +    LEFT JOIN web_returns ON wr_order_number = ws_order_number AND ws_item_sk = wr_item_sk
    +    JOIN date_dim ON ws_sold_date_sk = d_date_sk
    +  WHERE wr_order_number IS NULL
    +  GROUP BY d_year, ws_item_sk, ws_bill_customer_sk
    +),
    +    cs AS
    +  (SELECT
    +    d_year AS cs_sold_year,
    +    cs_item_sk,
    +    cs_bill_customer_sk cs_customer_sk,
    +    sum(cs_quantity) cs_qty,
    +    sum(cs_wholesale_cost) cs_wc,
    +    sum(cs_sales_price) cs_sp
    +  FROM catalog_sales
    +    LEFT JOIN catalog_returns ON cr_order_number = cs_order_number AND cs_item_sk = cr_item_sk
    +    JOIN date_dim ON cs_sold_date_sk = d_date_sk
    +  WHERE cr_order_number IS NULL
    +  GROUP BY d_year, cs_item_sk, cs_bill_customer_sk
    +  ),
    +    ss AS
    +  (SELECT
    +    d_year AS ss_sold_year,
    +    ss_item_sk,
    +    ss_customer_sk,
    +    sum(ss_quantity) ss_qty,
    +    sum(ss_wholesale_cost) ss_wc,
    +    sum(ss_sales_price) ss_sp
    +  FROM store_sales
    +    LEFT JOIN store_returns ON sr_ticket_number = ss_ticket_number AND ss_item_sk = sr_item_sk
    +    JOIN date_dim ON ss_sold_date_sk = d_date_sk
    +  WHERE sr_ticket_number IS NULL
    +  GROUP BY d_year, ss_item_sk, ss_customer_sk
    +  )
    +SELECT
    +  round(ss_qty / (coalesce(ws_qty + cs_qty, 1)), 2) ratio,
    +  ss_qty store_qty,
    +  ss_wc store_wholesale_cost,
    +  ss_sp store_sales_price,
    +  coalesce(ws_qty, 0) + coalesce(cs_qty, 0) other_chan_qty,
    +  coalesce(ws_wc, 0) + coalesce(cs_wc, 0) other_chan_wholesale_cost,
    +  coalesce(ws_sp, 0) + coalesce(cs_sp, 0) other_chan_sales_price
    +FROM ss
    +  LEFT JOIN ws
    +    ON (ws_sold_year = ss_sold_year AND ws_item_sk = ss_item_sk AND ws_customer_sk = ss_customer_sk)
    +  LEFT JOIN cs
    +    ON (cs_sold_year = ss_sold_year AND cs_item_sk = cs_item_sk AND cs_customer_sk = ss_customer_sk)
    --- End diff --
    
    cs_item_sk => ss_item_sk


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220507731
  
    **[Test build #58926 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58926/consoleFull)** for PR 13188 at commit [`8386511`](https://github.com/apache/spark/commit/8386511e4544594cfb45ed3c63820965a79ed0dd).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63970061
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/tpcds/queries/q14a.sql ---
    @@ -0,0 +1,121 @@
    +WITH cross_items AS
    +(SELECT i_item_sk ss_item_sk
    +  FROM item,
    +    (SELECT
    +      iss.i_brand_id brand_id,
    +      iss.i_class_id class_id,
    +      iss.i_category_id category_id
    +    FROM store_sales, item iss, date_dim d1
    +    WHERE ss_item_sk = iss.i_item_sk
    +      AND ss_sold_date_sk = d1.d_date_sk
    +      AND d1.d_year BETWEEN 1999 AND 1999 + 2
    +    INTERSECT
    +    SELECT
    +      ics.i_brand_id,
    +      ics.i_class_id,
    +      ics.i_category_id
    +    FROM catalog_sales, item ics, date_dim d2
    +    WHERE cs_item_sk = ics.i_item_sk
    +      AND cs_sold_date_sk = d2.d_date_sk
    +      AND d2.d_year BETWEEN 1999 AND 1999 + 2
    +    INTERSECT
    +    SELECT
    +      iws.i_brand_id,
    +      iws.i_class_id,
    +      iws.i_category_id
    +    FROM web_sales, item iws, date_dim d3
    +    WHERE ws_item_sk = iws.i_item_sk
    +      AND ws_sold_date_sk = d3.d_date_sk
    +      AND d3.d_year BETWEEN 1999 AND 1999 + 2) x
    +  WHERE i_brand_id = brand_id
    +    AND i_class_id = class_id
    +    AND i_category_id = category_id
    +),
    +    avg_sales AS
    +  (SELECT avg(quantity * list_price) average_sales
    +  FROM (
    +         SELECT
    +           ss_quantity quantity,
    +           ss_list_price list_price
    +         FROM store_sales, date_dim
    +         WHERE ss_sold_date_sk = d_date_sk
    +           AND d_year BETWEEN 1999 AND 2001
    +         UNION ALL
    +         SELECT
    +           cs_quantity quantity,
    +           cs_list_price list_price
    +         FROM catalog_sales, date_dim
    +         WHERE cs_sold_date_sk = d_date_sk
    +           AND d_year BETWEEN 1999 AND 1999 + 2
    +         UNION ALL
    +         SELECT
    +           ws_quantity quantity,
    +           ws_list_price list_price
    +         FROM web_sales, date_dim
    +         WHERE ws_sold_date_sk = d_date_sk
    +           AND d_year BETWEEN 1999 AND 1999 + 2) x)
    +SELECT
    +  channel,
    +  i_brand_id,
    +  i_class_id,
    +  i_category_id,
    +  sum(sales),
    +  sum(number_sales)
    +FROM (
    +       SELECT
    +         'store' channel,
    +         i_brand_id,
    +         i_class_id,
    +         i_category_id,
    +         sum(ss_quantity * ss_list_price) sales,
    +         count(*) number_sales
    +       FROM store_sales, item, date_dim
    +       WHERE ss_item_sk IN (SELECT ss_item_sk
    +       FROM cross_items)
    +         AND ss_item_sk = i_item_sk
    +         AND ss_sold_date_sk = d_date_sk
    +         AND d_year = 1999 + 2
    +         AND d_moy = 11
    +       GROUP BY i_brand_id, i_class_id, i_category_id
    +       HAVING sum(ss_quantity * ss_list_price) > (SELECT average_sales
    +       FROM avg_sales)
    +       UNION ALL
    +       SELECT
    +         'catalog' channel,
    +         i_brand_id,
    +         i_class_id,
    +         i_category_id,
    +         sum(cs_quantity * cs_list_price) sales,
    +         count(*) number_sales
    +       FROM catalog_sales, item, date_dim
    +       WHERE cs_item_sk IN (SELECT ss_item_sk
    +       FROM cross_items)
    +         AND cs_item_sk = i_item_sk
    +         AND cs_sold_date_sk = d_date_sk
    +         AND d_year = 1999 + 2
    +         AND d_moy = 11
    +       GROUP BY i_brand_id, i_class_id, i_category_id
    +       HAVING sum(cs_quantity * cs_list_price) > (SELECT average_sales
    +       FROM avg_sales)
    --- End diff --
    
    Could you move this FROM to the same line as SELECT?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63992275
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/tpcds/TPCDSQueryBenchmark.scala ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.benchmark.tpcds
    +
    +import java.io.File
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.catalyst.TableIdentifier
    +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
    +import org.apache.spark.sql.catalyst.expressions.SubqueryExpression
    +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
    +import org.apache.spark.sql.catalyst.util._
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.Benchmark
    +
    +/**
    + * Benchmark to measure TPCDS query performance.
    + * To run this:
    + *  spark-submit --class <this class> --jars <spark sql test jar>
    + */
    +object TPCDSQueryBenchmark {
    +  val conf =
    +    new SparkConf()
    +      .setMaster("local[1]")
    +      .setAppName("test-sql-context")
    +      .set("spark.sql.parquet.compression.codec", "snappy")
    +      .set("spark.sql.shuffle.partitions", "4")
    +      .set("spark.driver.memory", "3g")
    +      .set("spark.executor.memory", "3g")
    +      .set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024).toString)
    +
    +  val spark = SparkSession.builder.config(conf).getOrCreate()
    +
    +  val tables = Seq("catalog_page", "catalog_returns", "customer", "customer_address",
    +    "customer_demographics", "date_dim", "household_demographics", "inventory", "item",
    +    "promotion", "store", "store_returns", "catalog_sales", "web_sales", "store_sales",
    +    "web_returns", "web_site", "reason", "call_center", "warehouse", "ship_mode", "income_band",
    +    "time_dim", "web_page")
    +
    +  def setupTables(dataLocation: String): Map[String, Long] = {
    +    tables.map { tableName =>
    +      spark.read.parquet(s"$dataLocation/$tableName").createOrReplaceTempView(tableName)
    +      tableName -> spark.table(tableName).count()
    +    }.toMap
    +  }
    +
    +  def tpcdsAll(dataLocation: String, queries: Seq[String]): Unit = {
    +    require(dataLocation.nonEmpty,
    +      "please modify the value of dataLocation to point to your local TPCDS data")
    +    val tableSizes = setupTables(dataLocation)
    +    spark.conf.set(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true")
    +    spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +    queries.foreach { name =>
    +      val queriesString = fileToString(new File(s"sql/core/src/test/scala/org/apache/spark/sql/" +
    --- End diff --
    
    one thing - these files should go into test/resources, and then we can get their path using the getresource function on the current thread's classloader.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220478082
  
    **[Test build #58908 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58908/consoleFull)** for PR 13188 at commit [`ab50e5f`](https://github.com/apache/spark/commit/ab50e5fcd435f249c436f7b284c82fba7276558e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63968750
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/tpcds/TPCDSQueryBenchmark.scala ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.parquet.tpcds
    --- End diff --
    
    Should we move this outside of parquet? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63826284
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/tpcds/TPCDSQueryBenchmark.scala ---
    @@ -0,0 +1,106 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.parquet.tpcds
    +
    +import java.io.File
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.sql.SQLContext
    +import org.apache.spark.sql.catalyst.TableIdentifier
    +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
    +import org.apache.spark.sql.catalyst.util._
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.Benchmark
    +
    +/**
    + * Benchmark to measure TPCDS query performance.
    + * To run this:
    + *  spark-submit --class <this class> --jars <spark sql test jar>
    + */
    +object TPCDSQueryBenchmark {
    +  val conf = new SparkConf()
    +  conf.set("spark.sql.parquet.compression.codec", "snappy")
    +  conf.set("spark.sql.shuffle.partitions", "4")
    +  conf.set("spark.driver.memory", "3g")
    +  conf.set("spark.executor.memory", "3g")
    +  conf.set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024).toString)
    +
    +  val sc = new SparkContext("local[1]", "test-sql-context", conf)
    +  val sqlContext = new SQLContext(sc)
    --- End diff --
    
    Hi, @sameeragarwal !
    This PR looks great. By the way, could you update line 36~44 with new `SparkSession` builder pattern?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220727980
  
    **[Test build #59019 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59019/consoleFull)** for PR 13188 at commit [`1815012`](https://github.com/apache/spark/commit/18150121df04c8f0fd39c2c2fbfbc7fc39ccbd64).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `      i_class IN ('personal', 'portable', 'reference', 'self-help') AND`
      * `        i_class IN ('accessories', 'classical', 'fragrances', 'pants') AND`
      * `  AND i_class IN ('personal', 'portable', 'refernece', 'self-help')`
      * `  AND i_class IN ('accessories', 'classical', 'fragrances', 'pants')`
      * `           i_class IN ('computers', 'stereo', 'football'))`
      * `           i_class IN ('shirts', 'birdal', 'dresses')))`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220486609
  
    **[Test build #58920 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58920/consoleFull)** for PR 13188 at commit [`d184942`](https://github.com/apache/spark/commit/d184942cdb7530316bb9a85e24d610580c766086).
     * This patch **fails RAT tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `      i_class IN ('personal', 'portable', 'reference', 'self-help') AND`
      * `        i_class IN ('accessories', 'classical', 'fragrances', 'pants') AND`
      * `  AND i_class IN ('personal', 'portable', 'refernece', 'self-help')`
      * `  AND i_class IN ('accessories', 'classical', 'fragrances', 'pants')`
      * `           i_class IN ('computers', 'stereo', 'football'))`
      * `           i_class IN ('shirts', 'birdal', 'dresses')))`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220494664
  
    **[Test build #58926 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58926/consoleFull)** for PR 13188 at commit [`8386511`](https://github.com/apache/spark/commit/8386511e4544594cfb45ed3c63820965a79ed0dd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13188


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220796056
  
    I'm going to cherry-pick this into 2.0 since it has caused confusion and people thought 2.0 couldn't run the queries.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220502933
  
    **[Test build #58922 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58922/consoleFull)** for PR 13188 at commit [`8cb52d8`](https://github.com/apache/spark/commit/8cb52d869a113c985d28ac11f28227f9962e9689).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by rxin <gi...@git.apache.org>.

Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63973393
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/tpcds/queries/ss_max.sql ---
    @@ -0,0 +1,14 @@
    +SELECT
    --- End diff --
    
    might be good to keep since it is a decent benchmark for scan & aggregate performance.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220488546
  
    **[Test build #58922 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58922/consoleFull)** for PR 13188 at commit [`8cb52d8`](https://github.com/apache/spark/commit/8cb52d869a113c985d28ac11f28227f9962e9689).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220507832
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220709836
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63970138
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/tpcds/queries/q16.sql ---
    @@ -0,0 +1,21 @@
    +SELECT count(DISTINCT cs_order_number) AS ` ORDER count`,
    +sum(cs_ext_ship_cost) AS `total shipping COST `,
    +sum(cs_net_profit) AS `total net profit`
    +FROM
    +catalog_sales cs1, date_dim, customer_address, call_center
    +WHERE
    +d_date BETWEEN '2002-2-01' AND ( CAST ('2002-2-01' AS DATE ) + INTERVAL 60 days)
    --- End diff --
    
    `2002-2-01` -> `2002-02-01`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220486512
  
    **[Test build #58920 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58920/consoleFull)** for PR 13188 at commit [`d184942`](https://github.com/apache/spark/commit/d184942cdb7530316bb9a85e24d610580c766086).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63940752
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/tpcds/TPCDSQueryBenchmark.scala ---
    @@ -0,0 +1,106 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.parquet.tpcds
    +
    +import java.io.File
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.sql.SQLContext
    +import org.apache.spark.sql.catalyst.TableIdentifier
    +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
    +import org.apache.spark.sql.catalyst.util._
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.Benchmark
    +
    +/**
    + * Benchmark to measure TPCDS query performance.
    + * To run this:
    + *  spark-submit --class <this class> --jars <spark sql test jar>
    + */
    +object TPCDSQueryBenchmark {
    +  val conf = new SparkConf()
    +  conf.set("spark.sql.parquet.compression.codec", "snappy")
    +  conf.set("spark.sql.shuffle.partitions", "4")
    +  conf.set("spark.driver.memory", "3g")
    +  conf.set("spark.executor.memory", "3g")
    +  conf.set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024).toString)
    +
    +  val sc = new SparkContext("local[1]", "test-sql-context", conf)
    +  val sqlContext = new SQLContext(sc)
    +
    +  // modified q9
    +
    +  val queries = Seq(
    +    "q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8", "q9", "q10", "q11",
    +    "q12", "q13", "q14a", "q14b", "q15", "q16", "q17", "q18", "q19", "q20",
    +    "q21", "q22", "q23a", "q23b", "q24a", "q24b", "q25", "q26", "q27", "q28", "q29", "q30",
    +    "q31", "q32", "q33", "q34", "q35", "q36", "q37", "q38", "q39a", "q39b", "q40",
    +    "q41", "q42", "q43", "q44", "q45", "q46", "q47", "q48", "q49", "q50",
    +    "q51", "q52", "q53", "q54", "q55", "q56", "q57", "q58", "q59", "q60",
    +    "q61", "q62", "q63", "q64", "q65", "q66", "q67", "q68", "q69", "q70",
    +    "q71", "q72", "q73", "q74", "q75", "q76", "q77", "q78", "q79", "q80",
    +    "q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90",
    +    "q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99", "ss_max")
    +    .filter(_ != "q41") // Exclude 41; 72 is long!
    --- End diff --
    
    I think it had a correlated subquery that used to not work. It works now thanks to your patch :)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63992090
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/tpcds/TPCDSQueryBenchmark.scala ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.benchmark.tpcds
    +
    +import java.io.File
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.catalyst.TableIdentifier
    +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
    +import org.apache.spark.sql.catalyst.expressions.SubqueryExpression
    +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
    +import org.apache.spark.sql.catalyst.util._
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.Benchmark
    +
    +/**
    + * Benchmark to measure TPCDS query performance.
    + * To run this:
    + *  spark-submit --class <this class> --jars <spark sql test jar>
    + */
    +object TPCDSQueryBenchmark {
    +  val conf =
    +    new SparkConf()
    +      .setMaster("local[1]")
    +      .setAppName("test-sql-context")
    +      .set("spark.sql.parquet.compression.codec", "snappy")
    +      .set("spark.sql.shuffle.partitions", "4")
    +      .set("spark.driver.memory", "3g")
    +      .set("spark.executor.memory", "3g")
    +      .set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024).toString)
    +
    +  val spark = SparkSession.builder.config(conf).getOrCreate()
    +
    +  val tables = Seq("catalog_page", "catalog_returns", "customer", "customer_address",
    +    "customer_demographics", "date_dim", "household_demographics", "inventory", "item",
    +    "promotion", "store", "store_returns", "catalog_sales", "web_sales", "store_sales",
    +    "web_returns", "web_site", "reason", "call_center", "warehouse", "ship_mode", "income_band",
    +    "time_dim", "web_page")
    +
    +  def setupTables(dataLocation: String): Map[String, Long] = {
    +    tables.map { tableName =>
    +      spark.read.parquet(s"$dataLocation/$tableName").createOrReplaceTempView(tableName)
    +      tableName -> spark.table(tableName).count()
    +    }.toMap
    +  }
    +
    +  def tpcdsAll(dataLocation: String, queries: Seq[String]): Unit = {
    +    require(dataLocation.nonEmpty,
    +      "please modify the value of dataLocation to point to your local TPCDS data")
    +    val tableSizes = setupTables(dataLocation)
    +    spark.conf.set(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true")
    +    spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +    queries.foreach { name =>
    +      val queriesString = fileToString(new File(s"sql/core/src/test/scala/org/apache/spark/sql/" +
    +        s"execution/benchmark/tpcds/queries/$name.sql"))
    +
    +      // This is an indirect hack to estimate the size of each query's input by traversing the
    +      // logical plan and adding up the sizes of all tables that appear in the plan. Note that this
    +      // currently doesn't take WITH subqueries into account which might lead to fairly inaccurate
    +      // per-row processing time for those cases.
    +      val queryRelations = scala.collection.mutable.HashSet[String]()
    +      spark.sql(queriesString).queryExecution.logical.map {
    +        case ur @ UnresolvedRelation(t: TableIdentifier, _) =>
    +          queryRelations.add(t.table)
    +        case lp: LogicalPlan =>
    +          lp.expressions.foreach { _ foreach {
    +            case subquery: SubqueryExpression =>
    +              subquery.plan.foreach {
    +                case ur @ UnresolvedRelation(t: TableIdentifier, _) =>
    +                  queryRelations.add(t.table)
    +                case _ =>
    +              }
    +            case _ =>
    +          }
    +        }
    +        case _ =>
    +      }
    +      val numRows = queryRelations.map(tableSizes.getOrElse(_, 0L)).sum
    +      val benchmark = new Benchmark("TPCDS Snappy", numRows, 5)
    +      benchmark.addCase(name) { i =>
    +        spark.sql(queriesString).collect()
    +      }
    +      benchmark.run()
    +    }
    +  }
    +
    +  def main(args: Array[String]): Unit = {
    +
    +    // List of all TPC-DS queries
    +    val allQueries = Seq(
    +      "q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8", "q9", "q10", "q11",
    +      "q12", "q13", "q14a", "q14b", "q15", "q16", "q17", "q18", "q19", "q20",
    +      "q21", "q22", "q23a", "q23b", "q24a", "q24b", "q25", "q26", "q27", "q28", "q29", "q30",
    +      "q31", "q32", "q33", "q34", "q35", "q36", "q37", "q38", "q39a", "q39b", "q40",
    +      "q41", "q42", "q43", "q44", "q45", "q46", "q47", "q48", "q49", "q50",
    +      "q51", "q52", "q53", "q54", "q55", "q56", "q57", "q58", "q59", "q60",
    +      "q61", "q62", "q63", "q64", "q65", "q66", "q67", "q68", "q69", "q70",
    +      "q71", "q72", "q73", "q74", "q75", "q76", "q77", "q78", "q79", "q80",
    +      "q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90",
    +      "q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99", "ss_max")
    --- End diff --
    
    Reynold suggested that it might be a good idea to keep it around (https://github.com/apache/spark/pull/13188#discussion_r63973393). Let me know if you think otherwise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220463800
  
    **[Test build #58908 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58908/consoleFull)** for PR 13188 at commit [`ab50e5f`](https://github.com/apache/spark/commit/ab50e5fcd435f249c436f7b284c82fba7276558e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220429808
  
    **[Test build #58892 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58892/consoleFull)** for PR 13188 at commit [`15ec1fa`](https://github.com/apache/spark/commit/15ec1fab1ad7c4b1998f21a7592f867d1df1a59a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by hvanhovell <gi...@git.apache.org>.

Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63915743
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/tpcds/TPCDSQueryBenchmark.scala ---
    @@ -0,0 +1,106 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.parquet.tpcds
    +
    +import java.io.File
    +
    +import org.apache.spark.{SparkConf, SparkContext}
    +import org.apache.spark.sql.SQLContext
    +import org.apache.spark.sql.catalyst.TableIdentifier
    +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
    +import org.apache.spark.sql.catalyst.util._
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.Benchmark
    +
    +/**
    + * Benchmark to measure TPCDS query performance.
    + * To run this:
    + *  spark-submit --class <this class> --jars <spark sql test jar>
    + */
    +object TPCDSQueryBenchmark {
    +  val conf = new SparkConf()
    +  conf.set("spark.sql.parquet.compression.codec", "snappy")
    +  conf.set("spark.sql.shuffle.partitions", "4")
    +  conf.set("spark.driver.memory", "3g")
    +  conf.set("spark.executor.memory", "3g")
    +  conf.set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024).toString)
    +
    +  val sc = new SparkContext("local[1]", "test-sql-context", conf)
    +  val sqlContext = new SQLContext(sc)
    +
    +  // modified q9
    +
    +  val queries = Seq(
    +    "q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8", "q9", "q10", "q11",
    +    "q12", "q13", "q14a", "q14b", "q15", "q16", "q17", "q18", "q19", "q20",
    +    "q21", "q22", "q23a", "q23b", "q24a", "q24b", "q25", "q26", "q27", "q28", "q29", "q30",
    +    "q31", "q32", "q33", "q34", "q35", "q36", "q37", "q38", "q39a", "q39b", "q40",
    +    "q41", "q42", "q43", "q44", "q45", "q46", "q47", "q48", "q49", "q50",
    +    "q51", "q52", "q53", "q54", "q55", "q56", "q57", "q58", "q59", "q60",
    +    "q61", "q62", "q63", "q64", "q65", "q66", "q67", "q68", "q69", "q70",
    +    "q71", "q72", "q73", "q74", "q75", "q76", "q77", "q78", "q79", "q80",
    +    "q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90",
    +    "q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99", "ss_max")
    +    .filter(_ != "q41") // Exclude 41; 72 is long!
    --- End diff --
    
    Offtopic: What is wrong with q41?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63993020
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/tpcds/TPCDSQueryBenchmark.scala ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.benchmark.tpcds
    +
    +import java.io.File
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.catalyst.TableIdentifier
    +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
    +import org.apache.spark.sql.catalyst.expressions.SubqueryExpression
    +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
    +import org.apache.spark.sql.catalyst.util._
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.Benchmark
    +
    +/**
    + * Benchmark to measure TPCDS query performance.
    + * To run this:
    + *  spark-submit --class <this class> --jars <spark sql test jar>
    + */
    +object TPCDSQueryBenchmark {
    +  val conf =
    +    new SparkConf()
    +      .setMaster("local[1]")
    +      .setAppName("test-sql-context")
    +      .set("spark.sql.parquet.compression.codec", "snappy")
    +      .set("spark.sql.shuffle.partitions", "4")
    +      .set("spark.driver.memory", "3g")
    +      .set("spark.executor.memory", "3g")
    +      .set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024).toString)
    +
    +  val spark = SparkSession.builder.config(conf).getOrCreate()
    +
    +  val tables = Seq("catalog_page", "catalog_returns", "customer", "customer_address",
    +    "customer_demographics", "date_dim", "household_demographics", "inventory", "item",
    +    "promotion", "store", "store_returns", "catalog_sales", "web_sales", "store_sales",
    +    "web_returns", "web_site", "reason", "call_center", "warehouse", "ship_mode", "income_band",
    +    "time_dim", "web_page")
    +
    +  def setupTables(dataLocation: String): Map[String, Long] = {
    +    tables.map { tableName =>
    +      spark.read.parquet(s"$dataLocation/$tableName").createOrReplaceTempView(tableName)
    +      tableName -> spark.table(tableName).count()
    +    }.toMap
    +  }
    +
    +  def tpcdsAll(dataLocation: String, queries: Seq[String]): Unit = {
    +    require(dataLocation.nonEmpty,
    +      "please modify the value of dataLocation to point to your local TPCDS data")
    +    val tableSizes = setupTables(dataLocation)
    +    spark.conf.set(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true")
    +    spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +    queries.foreach { name =>
    +      val queriesString = fileToString(new File(s"sql/core/src/test/scala/org/apache/spark/sql/" +
    +        s"execution/benchmark/tpcds/queries/$name.sql"))
    +
    +      // This is an indirect hack to estimate the size of each query's input by traversing the
    +      // logical plan and adding up the sizes of all tables that appear in the plan. Note that this
    +      // currently doesn't take WITH subqueries into account which might lead to fairly inaccurate
    +      // per-row processing time for those cases.
    +      val queryRelations = scala.collection.mutable.HashSet[String]()
    +      spark.sql(queriesString).queryExecution.logical.map {
    +        case ur @ UnresolvedRelation(t: TableIdentifier, _) =>
    +          queryRelations.add(t.table)
    +        case lp: LogicalPlan =>
    +          lp.expressions.foreach { _ foreach {
    +            case subquery: SubqueryExpression =>
    +              subquery.plan.foreach {
    +                case ur @ UnresolvedRelation(t: TableIdentifier, _) =>
    +                  queryRelations.add(t.table)
    +                case _ =>
    +              }
    +            case _ =>
    +          }
    +        }
    +        case _ =>
    +      }
    +      val numRows = queryRelations.map(tableSizes.getOrElse(_, 0L)).sum
    +      val benchmark = new Benchmark("TPCDS Snappy", numRows, 5)
    +      benchmark.addCase(name) { i =>
    +        spark.sql(queriesString).collect()
    +      }
    +      benchmark.run()
    +    }
    +  }
    +
    +  def main(args: Array[String]): Unit = {
    +
    +    // List of all TPC-DS queries
    +    val allQueries = Seq(
    +      "q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8", "q9", "q10", "q11",
    +      "q12", "q13", "q14a", "q14b", "q15", "q16", "q17", "q18", "q19", "q20",
    +      "q21", "q22", "q23a", "q23b", "q24a", "q24b", "q25", "q26", "q27", "q28", "q29", "q30",
    +      "q31", "q32", "q33", "q34", "q35", "q36", "q37", "q38", "q39a", "q39b", "q40",
    +      "q41", "q42", "q43", "q44", "q45", "q46", "q47", "q48", "q49", "q50",
    +      "q51", "q52", "q53", "q54", "q55", "q56", "q57", "q58", "q59", "q60",
    +      "q61", "q62", "q63", "q64", "q65", "q66", "q67", "q68", "q69", "q70",
    +      "q71", "q72", "q73", "q74", "q75", "q76", "q77", "q78", "q79", "q80",
    +      "q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90",
    +      "q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99", "ss_max")
    --- End diff --
    
    oh, I think that's what davies meant. I'll remove `ss_max` from `allQueries` as it's already part of `commonQueries` below. Additionally, let me rename `allQueries` -> `allTpcdsQueries` and `commonQueries` -> `impalaKitQueries` so that the query sets are more obvious.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220702772
  
    **[Test build #59019 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59019/consoleFull)** for PR 13188 at commit [`1815012`](https://github.com/apache/spark/commit/18150121df04c8f0fd39c2c2fbfbc7fc39ccbd64).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220478319
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63985324
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/tpcds/TPCDSQueryBenchmark.scala ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.benchmark.tpcds
    +
    +import java.io.File
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.catalyst.TableIdentifier
    +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
    +import org.apache.spark.sql.catalyst.expressions.SubqueryExpression
    +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
    +import org.apache.spark.sql.catalyst.util._
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.Benchmark
    +
    +/**
    + * Benchmark to measure TPCDS query performance.
    + * To run this:
    + *  spark-submit --class <this class> --jars <spark sql test jar>
    + */
    +object TPCDSQueryBenchmark {
    +  val conf =
    +    new SparkConf()
    +      .setMaster("local[1]")
    +      .setAppName("test-sql-context")
    +      .set("spark.sql.parquet.compression.codec", "snappy")
    +      .set("spark.sql.shuffle.partitions", "4")
    +      .set("spark.driver.memory", "3g")
    +      .set("spark.executor.memory", "3g")
    +      .set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024).toString)
    +
    +  val spark = SparkSession.builder.config(conf).getOrCreate()
    +
    +  val tables = Seq("catalog_page", "catalog_returns", "customer", "customer_address",
    +    "customer_demographics", "date_dim", "household_demographics", "inventory", "item",
    +    "promotion", "store", "store_returns", "catalog_sales", "web_sales", "store_sales",
    +    "web_returns", "web_site", "reason", "call_center", "warehouse", "ship_mode", "income_band",
    +    "time_dim", "web_page")
    +
    +  def setupTables(dataLocation: String): Map[String, Long] = {
    +    tables.map { tableName =>
    +      spark.read.parquet(s"$dataLocation/$tableName").createOrReplaceTempView(tableName)
    +      tableName -> spark.table(tableName).count()
    +    }.toMap
    +  }
    +
    +  def tpcdsAll(dataLocation: String, queries: Seq[String]): Unit = {
    +    require(dataLocation.nonEmpty,
    +      "please modify the value of dataLocation to point to your local TPCDS data")
    +    val tableSizes = setupTables(dataLocation)
    +    spark.conf.set(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true")
    +    spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +    queries.foreach { name =>
    +      val queriesString = fileToString(new File(s"sql/core/src/test/scala/org/apache/spark/sql/" +
    +        s"execution/benchmark/tpcds/queries/$name.sql"))
    +
    +      // This is an indirect hack to estimate the size of each query's input by traversing the
    +      // logical plan and adding up the sizes of all tables that appear in the plan. Note that this
    +      // currently doesn't take WITH subqueries into account which might lead to fairly inaccurate
    +      // per-row processing time for those cases.
    +      val queryRelations = scala.collection.mutable.HashSet[String]()
    +      spark.sql(queriesString).queryExecution.logical.map {
    +        case ur @ UnresolvedRelation(t: TableIdentifier, _) =>
    +          queryRelations.add(t.table)
    +        case lp: LogicalPlan =>
    +          lp.expressions.foreach { _ foreach {
    +            case subquery: SubqueryExpression =>
    +              subquery.plan.foreach {
    +                case ur @ UnresolvedRelation(t: TableIdentifier, _) =>
    +                  queryRelations.add(t.table)
    +                case _ =>
    +              }
    +            case _ =>
    +          }
    +        }
    +        case _ =>
    +      }
    +      val numRows = queryRelations.map(tableSizes.getOrElse(_, 0L)).sum
    +      val benchmark = new Benchmark("TPCDS Snappy", numRows, 5)
    +      benchmark.addCase(name) { i =>
    +        spark.sql(queriesString).collect()
    +      }
    +      benchmark.run()
    +    }
    +  }
    +
    +  def main(args: Array[String]): Unit = {
    +
    +    // List of all TPC-DS queries
    +    val allQueries = Seq(
    +      "q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8", "q9", "q10", "q11",
    +      "q12", "q13", "q14a", "q14b", "q15", "q16", "q17", "q18", "q19", "q20",
    +      "q21", "q22", "q23a", "q23b", "q24a", "q24b", "q25", "q26", "q27", "q28", "q29", "q30",
    +      "q31", "q32", "q33", "q34", "q35", "q36", "q37", "q38", "q39a", "q39b", "q40",
    +      "q41", "q42", "q43", "q44", "q45", "q46", "q47", "q48", "q49", "q50",
    +      "q51", "q52", "q53", "q54", "q55", "q56", "q57", "q58", "q59", "q60",
    +      "q61", "q62", "q63", "q64", "q65", "q66", "q67", "q68", "q69", "q70",
    +      "q71", "q72", "q73", "q74", "q75", "q76", "q77", "q78", "q79", "q80",
    +      "q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90",
    +      "q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99", "ss_max")
    --- End diff --
    
    Could you remove ss_max?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13188#discussion_r63971900
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/tpcds/TPCDSQueryBenchmark.scala ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.sql.execution.datasources.parquet.tpcds
    +
    +import java.io.File
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.catalyst.TableIdentifier
    +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
    +import org.apache.spark.sql.catalyst.expressions.SubqueryExpression
    +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
    +import org.apache.spark.sql.catalyst.util._
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.Benchmark
    +
    +/**
    + * Benchmark to measure TPCDS query performance.
    + * To run this:
    + *  spark-submit --class <this class> --jars <spark sql test jar>
    + */
    +object TPCDSQueryBenchmark {
    +  val conf =
    +    new SparkConf()
    +      .setMaster("local[1]")
    +      .setAppName("test-sql-context")
    +      .set("spark.sql.parquet.compression.codec", "snappy")
    +      .set("spark.sql.shuffle.partitions", "4")
    +      .set("spark.driver.memory", "3g")
    +      .set("spark.executor.memory", "3g")
    +      .set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024).toString)
    +
    +  val spark = SparkSession.builder.config(conf).getOrCreate()
    +
    +  val tables = Seq("catalog_page", "catalog_returns", "customer", "customer_address",
    +    "customer_demographics", "date_dim", "household_demographics", "inventory", "item",
    +    "promotion", "store", "store_returns", "catalog_sales", "web_sales", "store_sales",
    +    "web_returns", "web_site", "reason", "call_center", "warehouse", "ship_mode", "income_band",
    +    "time_dim", "web_page")
    +
    +  def setupTables(dataLocation: String): Map[String, Long] = {
    +    tables.map { tableName =>
    +      spark.read.parquet(s"$dataLocation/$tableName").createOrReplaceTempView(tableName)
    +      tableName -> spark.table(tableName).count()
    +    }.toMap
    +  }
    +
    +  def tpcdsAll(dataLocation: String, queries: Seq[String]): Unit = {
    +    require(dataLocation.nonEmpty,
    +      "please modify the value of dataLocation to point to your local TPCDS data")
    +    val tableSizes = setupTables(dataLocation)
    +    spark.conf.set(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key, "true")
    +    spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +    queries.foreach { name =>
    +      val queriesString = fileToString(new File(s"sql/core/src/test/scala/org/apache/spark/sql/" +
    +        s"execution/datasources/parquet/tpcds/queries/$name.sql"))
    +
    +      // This is an indirect hack to estimate the size of each query's input by traversing the
    +      // logical plan and adding up the sizes of all tables that appear in the plan. Note that this
    +      // currently doesn't take WITH subqueries into account which might lead to fairly inaccurate
    +      // per-row processing time for those cases.
    +      val queryRelations = scala.collection.mutable.HashSet[String]()
    +      spark.sql(queriesString).queryExecution.logical.map {
    +        case ur @ UnresolvedRelation(t: TableIdentifier, _) =>
    +          queryRelations.add(t.table)
    +        case lp: LogicalPlan =>
    +          lp.expressions.foreach { _ foreach {
    +            case subquery: SubqueryExpression =>
    +              subquery.plan.foreach {
    +                case ur @ UnresolvedRelation(t: TableIdentifier, _) =>
    +                  queryRelations.add(t.table)
    +                case _ =>
    +              }
    +            case _ =>
    +          }
    +        }
    +        case _ =>
    +      }
    +      val numRows = queryRelations.map(tableSizes.getOrElse(_, 0L)).sum
    +      val benchmark = new Benchmark("TPCDS Snappy (scale = 5)", numRows, 1)
    --- End diff --
    
    The scale may be not 5


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220478265
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58908/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220728197
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220456693
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58892/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220731956
  
    Merging this into master and 2.0, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220478172
  
    **[Test build #58909 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58909/consoleFull)** for PR 13188 at commit [`85b4ba2`](https://github.com/apache/spark/commit/85b4ba2a77fc8876ef85fb2fd978c085e1c26d14).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220464858
  
    **[Test build #58909 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58909/consoleFull)** for PR 13188 at commit [`85b4ba2`](https://github.com/apache/spark/commit/85b4ba2a77fc8876ef85fb2fd978c085e1c26d14).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220488738
  
    Thanks, comments addressed!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by sameeragarwal <gi...@git.apache.org>.

Github user sameeragarwal commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220464114
  
    cc @davies 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220486612
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58920/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220241614
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15078][SQL] Add all TPCDS 1.4 benchmark...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13188#issuecomment-220456688
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org