You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by gengliangwang <gi...@git.apache.org> on 2018/05/23 10:29:32 UTC

[GitHub] spark pull request #21409: [SPARK-24365][SQL] Add Parquet write benchmark

GitHub user gengliangwang opened a pull request:

    https://github.com/apache/spark/pull/21409

    [SPARK-24365][SQL] Add Parquet write benchmark

    ## What changes were proposed in this pull request?
    
    Add Parquet write benchmark. So that it would be easier to measure the writer performance.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gengliangwang/spark parquetWriteBenchmark

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21409.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21409
    
----
commit 11d612aaa52b7f8ad5006d692e046a0a8c6f410d
Author: Gengliang Wang <ge...@...>
Date:   2018-05-23T09:43:23Z

    Add parquet write benchmark

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    **[Test build #91043 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91043/testReport)** for PR 21409 at commit [`abf3837`](https://github.com/apache/spark/commit/abf38376f6121cb48c6e8203400c4aaf732db0a1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Data Source write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Data Source write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21409: [SPARK-24365][SQL] Add Data Source write benchmar...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21409#discussion_r191297180
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceWriteBenchmark.scala ---
    @@ -0,0 +1,145 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.benchmark
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.Benchmark
    +
    +/**
    + * Benchmark to measure data source write performance.
    + * By default it measures 4 data source format: Parquet, ORC, JSON, CSV:
    + *  spark-submit --class <this class> <spark sql test jar>
    + * To measure specified formats, run it with arguments:
    + *  spark-submit --class <this class> <spark sql test jar> format1 [format2] [...]
    + */
    +object DataSourceWriteBenchmark {
    +  val conf = new SparkConf()
    +    .setAppName("DataSourceWriteBenchmark")
    +    .setIfMissing("spark.master", "local[1]")
    +    .set("spark.sql.parquet.compression.codec", "snappy")
    +    .set("spark.sql.orc.compression.codec", "snappy")
    +
    +  val spark = SparkSession.builder.config(conf).getOrCreate()
    +
    +  // Set default configs. Individual cases will change them if necessary.
    +  spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +
    +  val tempTable = "temp"
    +  val numRows = 1024 * 1024 * 15
    +
    +  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally tableNames.foreach(spark.catalog.dropTempView)
    +  }
    +
    +  def withTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally {
    +      tableNames.foreach { name =>
    +        spark.sql(s"DROP TABLE IF EXISTS $name")
    +      }
    +    }
    +  }
    +
    +  def writeInt(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(c1 INT, c2 STRING) using $format")
    +    benchmark.addCase("Output Single Int Column") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as " +
    +        s"c1, cast(id as STRING) as c2 from $tempTable")
    +    }
    +  }
    +
    +  def writeIntString(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(c1 INT, c2 STRING) using $format")
    +    benchmark.addCase("Output Int and String Column") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as " +
    +        s"c1, cast(id as STRING) as c2 from $tempTable")
    +    }
    +  }
    +
    +  def writePartition(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(p INT, id INT) using $format PARTITIONED BY (p)")
    +    benchmark.addCase("Output Partitions") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as id," +
    +        s" cast(id % 2 as INT) as p from $tempTable")
    +    }
    +  }
    +
    +  def writeBucket(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(c1 INT, c2 INT) using $format CLUSTERED BY (c2) INTO 2 BUCKETS")
    +    benchmark.addCase("Output Buckets") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as " +
    +        s"c1, cast(id as INT) as c2 from $tempTable")
    +    }
    +  }
    +
    +  def main(args: Array[String]): Unit = {
    +    val tableInt = "tableInt"
    +    val tableIntString = "tableIntString"
    +    val tablePartition = "tablePartition"
    +    val tableBucket = "tableBucket"
    +    // If the
    +    val formats: Seq[String] = if (args.isEmpty) {
    +      Seq("Parquet", "ORC", "JSON", "CSV")
    +    } else {
    +      args
    +    }
    +    /*
    +    Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
    +    Parquet writer benchmark:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +    ------------------------------------------------------------------------------------------------
    +    Output Single Int Column                      6054 / 6070          2.6         384.9       1.0X
    +    Output Int and String Column                  5784 / 5800          2.7         367.8       1.0X
    +    Output Partitions                             3891 / 3904          4.0         247.4       1.6X
    +    Output Buckets                                5446 / 5729          2.9         346.2       1.1X
    +
    --- End diff --
    
    To make the results easy-to-compare, how about benchmarking the 4 cases for each types? e.g.,
    ```
    Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
    Output Single Int Column                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    CSV  ...
    JSON  ...
    Parquet ...
    ORC ...
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21409: [SPARK-24365][SQL] Add Data Source write benchmar...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21409#discussion_r191412739
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceWriteBenchmark.scala ---
    @@ -0,0 +1,145 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.benchmark
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.Benchmark
    +
    +/**
    + * Benchmark to measure data source write performance.
    + * By default it measures 4 data source format: Parquet, ORC, JSON, CSV:
    + *  spark-submit --class <this class> <spark sql test jar>
    + * To measure specified formats, run it with arguments:
    + *  spark-submit --class <this class> <spark sql test jar> format1 [format2] [...]
    + */
    +object DataSourceWriteBenchmark {
    +  val conf = new SparkConf()
    +    .setAppName("DataSourceWriteBenchmark")
    +    .setIfMissing("spark.master", "local[1]")
    +    .set("spark.sql.parquet.compression.codec", "snappy")
    +    .set("spark.sql.orc.compression.codec", "snappy")
    +
    +  val spark = SparkSession.builder.config(conf).getOrCreate()
    +
    +  // Set default configs. Individual cases will change them if necessary.
    +  spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +
    +  val tempTable = "temp"
    +  val numRows = 1024 * 1024 * 15
    +
    +  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally tableNames.foreach(spark.catalog.dropTempView)
    +  }
    +
    +  def withTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally {
    +      tableNames.foreach { name =>
    +        spark.sql(s"DROP TABLE IF EXISTS $name")
    +      }
    +    }
    +  }
    +
    +  def writeInt(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(c1 INT, c2 STRING) using $format")
    +    benchmark.addCase("Output Single Int Column") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as " +
    +        s"c1, cast(id as STRING) as c2 from $tempTable")
    +    }
    +  }
    +
    +  def writeIntString(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(c1 INT, c2 STRING) using $format")
    +    benchmark.addCase("Output Int and String Column") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as " +
    +        s"c1, cast(id as STRING) as c2 from $tempTable")
    +    }
    +  }
    +
    +  def writePartition(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(p INT, id INT) using $format PARTITIONED BY (p)")
    +    benchmark.addCase("Output Partitions") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as id," +
    +        s" cast(id % 2 as INT) as p from $tempTable")
    +    }
    +  }
    +
    +  def writeBucket(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(c1 INT, c2 INT) using $format CLUSTERED BY (c2) INTO 2 BUCKETS")
    +    benchmark.addCase("Output Buckets") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as " +
    +        s"c1, cast(id as INT) as c2 from $tempTable")
    +    }
    +  }
    +
    +  def main(args: Array[String]): Unit = {
    +    val tableInt = "tableInt"
    +    val tableIntString = "tableIntString"
    +    val tablePartition = "tablePartition"
    +    val tableBucket = "tableBucket"
    +    // If the
    +    val formats: Seq[String] = if (args.isEmpty) {
    +      Seq("Parquet", "ORC", "JSON", "CSV")
    +    } else {
    +      args
    +    }
    +    /*
    +    Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
    +    Parquet writer benchmark:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +    ------------------------------------------------------------------------------------------------
    +    Output Single Int Column                      6054 / 6070          2.6         384.9       1.0X
    +    Output Int and String Column                  5784 / 5800          2.7         367.8       1.0X
    +    Output Partitions                             3891 / 3904          4.0         247.4       1.6X
    +    Output Buckets                                5446 / 5729          2.9         346.2       1.1X
    +
    --- End diff --
    
    +1, I think it's more useful to compare different types of a certain data source, instead of between data sources.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21409#discussion_r190207779
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteBenchmark.scala ---
    @@ -0,0 +1,128 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.datasources.parquet
    +
    +import java.io.File
    +
    +import scala.util.Try
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.{Benchmark, Utils}
    +
    +/**
    + * Benchmark to measure parquet write performance.
    + * To run this:
    + *  spark-submit --class <this class> --jars <spark sql test jar>
    + */
    +object ParquetWriteBenchmark {
    +  val conf = new SparkConf()
    +  conf.set("spark.sql.parquet.compression.codec", "snappy")
    +
    +  val spark = SparkSession.builder
    +    .master("local[1]")
    +    .appName("parquet-write-benchmark")
    +    .config(conf)
    +    .getOrCreate()
    +
    +  // Set default configs. Individual cases will change them if necessary.
    +  spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +
    +  def withTempPath(f: File => Unit): Unit = {
    +    val path = Utils.createTempDir()
    +    path.delete()
    +    try f(path) finally Utils.deleteRecursively(path)
    +  }
    +
    +  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally tableNames.foreach(spark.catalog.dropTempView)
    +  }
    +
    +  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
    +    val (keys, values) = pairs.unzip
    +    val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
    +    (keys, values).zipped.foreach(spark.conf.set)
    +    try f finally {
    +      keys.zip(currentValues).foreach {
    +        case (key, Some(value)) => spark.conf.set(key, value)
    +        case (key, None) => spark.conf.unset(key)
    +      }
    +    }
    +  }
    +
    +  def runSQL(name: String, sql: String, values: Int): Unit = {
    +    withTempTable("t1") {
    +      spark.range(values).createOrReplaceTempView("t1")
    +      val benchmark = new Benchmark(name, values)
    +      benchmark.addCase("Parquet Writer") { _ =>
    +        withTempPath { dir =>
    +          spark.sql("select cast(id as INT) as id from t1").write.parquet(dir.getCanonicalPath)
    +        }
    +      }
    +      benchmark.run()
    +    }
    +  }
    +
    +  def intWriteBenchmark(values: Int): Unit = {
    +    runSQL("Output Single Int Column", "select cast(id as INT) as id from t1", values)
    +  }
    +
    +  def intStringScanBenchmark(values: Int): Unit = {
    +    runSQL(name = "Output Int and String Column",
    +      sql = "select cast(id as INT) as c1, cast(id as STRING) as c2 from t1",
    +      values = values)
    +  }
    +
    +  def stringWithNullsScanBenchmark(values: Int, fractionOfNulls: Double): Unit = {
    --- End diff --
    
    we don't need this


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21409: [SPARK-24365][SQL] Add Data Source write benchmar...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21409#discussion_r191302688
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceWriteBenchmark.scala ---
    @@ -0,0 +1,145 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.benchmark
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.Benchmark
    +
    +/**
    + * Benchmark to measure data source write performance.
    + * By default it measures 4 data source format: Parquet, ORC, JSON, CSV:
    + *  spark-submit --class <this class> <spark sql test jar>
    + * To measure specified formats, run it with arguments:
    + *  spark-submit --class <this class> <spark sql test jar> format1 [format2] [...]
    + */
    +object DataSourceWriteBenchmark {
    +  val conf = new SparkConf()
    +    .setAppName("DataSourceWriteBenchmark")
    +    .setIfMissing("spark.master", "local[1]")
    +    .set("spark.sql.parquet.compression.codec", "snappy")
    +    .set("spark.sql.orc.compression.codec", "snappy")
    +
    +  val spark = SparkSession.builder.config(conf).getOrCreate()
    +
    +  // Set default configs. Individual cases will change them if necessary.
    +  spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +
    +  val tempTable = "temp"
    +  val numRows = 1024 * 1024 * 15
    +
    +  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally tableNames.foreach(spark.catalog.dropTempView)
    +  }
    +
    +  def withTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally {
    +      tableNames.foreach { name =>
    +        spark.sql(s"DROP TABLE IF EXISTS $name")
    +      }
    +    }
    +  }
    +
    +  def writeInt(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(c1 INT, c2 STRING) using $format")
    --- End diff --
    
    Here I am sure if we need to compare all numeric types: ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3512/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Data Source write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91269/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    **[Test build #91237 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91237/testReport)** for PR 21409 at commit [`8ffba61`](https://github.com/apache/spark/commit/8ffba61a3ebd6e06eec2fdf03e19a65cb5b40787).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Data Source write benchmark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    **[Test build #91237 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91237/testReport)** for PR 21409 at commit [`8ffba61`](https://github.com/apache/spark/commit/8ffba61a3ebd6e06eec2fdf03e19a65cb5b40787).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Data Source write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21409: [SPARK-24365][SQL] Add Data Source write benchmar...

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21409#discussion_r191302141
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceWriteBenchmark.scala ---
    @@ -0,0 +1,145 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.benchmark
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.Benchmark
    +
    +/**
    + * Benchmark to measure data source write performance.
    + * By default it measures 4 data source format: Parquet, ORC, JSON, CSV:
    + *  spark-submit --class <this class> <spark sql test jar>
    + * To measure specified formats, run it with arguments:
    + *  spark-submit --class <this class> <spark sql test jar> format1 [format2] [...]
    + */
    +object DataSourceWriteBenchmark {
    +  val conf = new SparkConf()
    +    .setAppName("DataSourceWriteBenchmark")
    +    .setIfMissing("spark.master", "local[1]")
    +    .set("spark.sql.parquet.compression.codec", "snappy")
    +    .set("spark.sql.orc.compression.codec", "snappy")
    +
    +  val spark = SparkSession.builder.config(conf).getOrCreate()
    +
    +  // Set default configs. Individual cases will change them if necessary.
    +  spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +
    +  val tempTable = "temp"
    +  val numRows = 1024 * 1024 * 15
    +
    +  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally tableNames.foreach(spark.catalog.dropTempView)
    +  }
    +
    +  def withTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally {
    +      tableNames.foreach { name =>
    +        spark.sql(s"DROP TABLE IF EXISTS $name")
    +      }
    +    }
    +  }
    +
    +  def writeInt(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(c1 INT, c2 STRING) using $format")
    +    benchmark.addCase("Output Single Int Column") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as " +
    +        s"c1, cast(id as STRING) as c2 from $tempTable")
    +    }
    +  }
    +
    +  def writeIntString(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(c1 INT, c2 STRING) using $format")
    +    benchmark.addCase("Output Int and String Column") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as " +
    +        s"c1, cast(id as STRING) as c2 from $tempTable")
    +    }
    +  }
    +
    +  def writePartition(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(p INT, id INT) using $format PARTITIONED BY (p)")
    +    benchmark.addCase("Output Partitions") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as id," +
    +        s" cast(id % 2 as INT) as p from $tempTable")
    +    }
    +  }
    +
    +  def writeBucket(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(c1 INT, c2 INT) using $format CLUSTERED BY (c2) INTO 2 BUCKETS")
    +    benchmark.addCase("Output Buckets") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as " +
    +        s"c1, cast(id as INT) as c2 from $tempTable")
    +    }
    +  }
    +
    +  def main(args: Array[String]): Unit = {
    +    val tableInt = "tableInt"
    +    val tableIntString = "tableIntString"
    +    val tablePartition = "tablePartition"
    +    val tableBucket = "tableBucket"
    +    // If the
    +    val formats: Seq[String] = if (args.isEmpty) {
    +      Seq("Parquet", "ORC", "JSON", "CSV")
    +    } else {
    +      args
    +    }
    +    /*
    +    Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
    +    Parquet writer benchmark:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +    ------------------------------------------------------------------------------------------------
    +    Output Single Int Column                      6054 / 6070          2.6         384.9       1.0X
    +    Output Int and String Column                  5784 / 5800          2.7         367.8       1.0X
    +    Output Partitions                             3891 / 3904          4.0         247.4       1.6X
    +    Output Buckets                                5446 / 5729          2.9         346.2       1.1X
    +
    --- End diff --
    
    In my opinion, the write benchmark can be used to measure the performance of before/after code changes. The code changes can be about certain data source, e.g., Parquet.
    In such case, we can run this benchmark with specified argument to get straight forward results.
    
    To me, comparing different data sources over each workload is less meaningful. I know we already did that in read benchmark. 
    Let's discuss about this and make them consistent.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Data Source write benchmark

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    thanks, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Data Source write benchmark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    **[Test build #91269 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91269/testReport)** for PR 21409 at commit [`e90fa00`](https://github.com/apache/spark/commit/e90fa00e8963eb985bdd30d9a262c61f6ca1ce61).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91043/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3504/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    **[Test build #91043 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91043/testReport)** for PR 21409 at commit [`abf3837`](https://github.com/apache/spark/commit/abf38376f6121cb48c6e8203400c4aaf732db0a1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21409: [SPARK-24365][SQL] Add Data Source write benchmar...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21409#discussion_r191621006
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceWriteBenchmark.scala ---
    @@ -0,0 +1,145 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.benchmark
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.Benchmark
    +
    +/**
    + * Benchmark to measure data source write performance.
    + * By default it measures 4 data source format: Parquet, ORC, JSON, CSV:
    + *  spark-submit --class <this class> <spark sql test jar>
    + * To measure specified formats, run it with arguments:
    + *  spark-submit --class <this class> <spark sql test jar> format1 [format2] [...]
    + */
    +object DataSourceWriteBenchmark {
    +  val conf = new SparkConf()
    +    .setAppName("DataSourceWriteBenchmark")
    +    .setIfMissing("spark.master", "local[1]")
    +    .set("spark.sql.parquet.compression.codec", "snappy")
    +    .set("spark.sql.orc.compression.codec", "snappy")
    +
    +  val spark = SparkSession.builder.config(conf).getOrCreate()
    +
    +  // Set default configs. Individual cases will change them if necessary.
    +  spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +
    +  val tempTable = "temp"
    +  val numRows = 1024 * 1024 * 15
    +
    +  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally tableNames.foreach(spark.catalog.dropTempView)
    +  }
    +
    +  def withTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally {
    +      tableNames.foreach { name =>
    +        spark.sql(s"DROP TABLE IF EXISTS $name")
    +      }
    +    }
    +  }
    +
    +  def writeInt(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(c1 INT, c2 STRING) using $format")
    +    benchmark.addCase("Output Single Int Column") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as " +
    +        s"c1, cast(id as STRING) as c2 from $tempTable")
    +    }
    +  }
    +
    +  def writeIntString(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(c1 INT, c2 STRING) using $format")
    +    benchmark.addCase("Output Int and String Column") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as " +
    +        s"c1, cast(id as STRING) as c2 from $tempTable")
    +    }
    +  }
    +
    +  def writePartition(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(p INT, id INT) using $format PARTITIONED BY (p)")
    +    benchmark.addCase("Output Partitions") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as id," +
    +        s" cast(id % 2 as INT) as p from $tempTable")
    +    }
    +  }
    +
    +  def writeBucket(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(c1 INT, c2 INT) using $format CLUSTERED BY (c2) INTO 2 BUCKETS")
    +    benchmark.addCase("Output Buckets") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as " +
    +        s"c1, cast(id as INT) as c2 from $tempTable")
    +    }
    +  }
    +
    +  def main(args: Array[String]): Unit = {
    +    val tableInt = "tableInt"
    +    val tableIntString = "tableIntString"
    +    val tablePartition = "tablePartition"
    +    val tableBucket = "tableBucket"
    +    // If the
    +    val formats: Seq[String] = if (args.isEmpty) {
    +      Seq("Parquet", "ORC", "JSON", "CSV")
    +    } else {
    +      args
    +    }
    +    /*
    +    Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
    +    Parquet writer benchmark:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +    ------------------------------------------------------------------------------------------------
    +    Output Single Int Column                      6054 / 6070          2.6         384.9       1.0X
    +    Output Int and String Column                  5784 / 5800          2.7         367.8       1.0X
    +    Output Partitions                             3891 / 3904          4.0         247.4       1.6X
    +    Output Buckets                                5446 / 5729          2.9         346.2       1.1X
    +
    --- End diff --
    
    ok


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21409: [SPARK-24365][SQL] Add Data Source write benchmar...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21409


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21409#discussion_r190259448
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteBenchmark.scala ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.datasources.parquet
    +
    +import java.io.File
    +
    +import scala.util.Try
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.{Benchmark, Utils}
    +
    +/**
    + * Benchmark to measure parquet write performance.
    + * To run this:
    + *  spark-submit --class <this class> --jars <spark sql test jar>
    + */
    +object ParquetWriteBenchmark {
    +  val conf = new SparkConf()
    +  conf.set("spark.sql.parquet.compression.codec", "snappy")
    +
    +  val spark = SparkSession.builder
    +    .master("local[1]")
    +    .appName("parquet-write-benchmark")
    +    .config(conf)
    +    .getOrCreate()
    +
    +  // Set default configs. Individual cases will change them if necessary.
    +  spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +
    +  def withTempPath(f: File => Unit): Unit = {
    +    val path = Utils.createTempDir()
    +    path.delete()
    +    try f(path) finally Utils.deleteRecursively(path)
    +  }
    +
    +  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally tableNames.foreach(spark.catalog.dropTempView)
    +  }
    +
    +  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
    +    val (keys, values) = pairs.unzip
    +    val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
    +    (keys, values).zipped.foreach(spark.conf.set)
    +    try f finally {
    +      keys.zip(currentValues).foreach {
    +        case (key, Some(value)) => spark.conf.set(key, value)
    +        case (key, None) => spark.conf.unset(key)
    +      }
    +    }
    +  }
    +
    +  def runSQL(name: String, sql: String, values: Int): Unit = {
    +    withTempTable("t1") {
    +      spark.range(values).createOrReplaceTempView("t1")
    +      val benchmark = new Benchmark(name, values)
    +      benchmark.addCase("Parquet Writer") { _ =>
    +        withTempPath { dir =>
    +          spark.sql(sql).write.parquet(dir.getCanonicalPath)
    +        }
    +      }
    +      benchmark.run()
    +    }
    +  }
    +
    +  def intWriteBenchmark(values: Int): Unit = {
    +    /*
    +    Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
    +
    +    Output Single Int Column:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +    ------------------------------------------------------------------------------------------------
    +    Parquet Writer                                2536 / 2610          6.2         161.3       1.0X
    +    */
    +    runSQL("Output Single Int Column", "select cast(id as INT) as id from t1", values)
    +  }
    +
    +  def intStringWriteBenchmark(values: Int): Unit = {
    +    /*
    +    Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
    +
    +    Output Int and String Column:            Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +    ------------------------------------------------------------------------------------------------
    +    Parquet Writer                                4644 / 4673          2.3         442.9       1.0X
    +    */
    +    runSQL(name = "Output Int and String Column",
    +      sql = "select cast(id as INT) as c1, cast(id as STRING) as c2 from t1",
    +      values = values)
    +  }
    +
    +  def partitionTableWriteBenchmark(values: Int): Unit = {
    +    withTempTable("t1") {
    +      spark.range(values).createOrReplaceTempView("t1")
    +      val benchmark = new Benchmark("Partitioned Table", values)
    +      benchmark.addCase("Parquet Writer") { _ =>
    +        withTempPath { dir =>
    +          spark.sql("select id % 2 as p, cast(id as INT) as id from t1")
    +            .write.partitionBy("p").parquet(dir.getCanonicalPath)
    +        }
    +      }
    +
    +      /*
    +      Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
    +
    +      Partitioned Table:                    Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +      ---------------------------------------------------------------------------------------------
    +      Parquet Writer                             4163 / 4173          3.8         264.7       1.0X
    +      */
    +      benchmark.run()
    +    }
    +  }
    +
    +  def main(args: Array[String]): Unit = {
    +    intWriteBenchmark(1024 * 1024 * 15)
    +    intStringWriteBenchmark(1024 * 1024 * 10)
    +    partitionTableWriteBenchmark(1024 * 1024 * 15)
    --- End diff --
    
    we should also test bucket.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    **[Test build #91030 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91030/testReport)** for PR 21409 at commit [`11d612a`](https://github.com/apache/spark/commit/11d612aaa52b7f8ad5006d692e046a0a8c6f410d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Data Source write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91237/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21409: [SPARK-24365][SQL] Add Data Source write benchmar...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21409#discussion_r191412111
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceWriteBenchmark.scala ---
    @@ -0,0 +1,145 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.benchmark
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.Benchmark
    +
    +/**
    + * Benchmark to measure data source write performance.
    + * By default it measures 4 data source format: Parquet, ORC, JSON, CSV:
    + *  spark-submit --class <this class> <spark sql test jar>
    + * To measure specified formats, run it with arguments:
    + *  spark-submit --class <this class> <spark sql test jar> format1 [format2] [...]
    + */
    +object DataSourceWriteBenchmark {
    +  val conf = new SparkConf()
    +    .setAppName("DataSourceWriteBenchmark")
    +    .setIfMissing("spark.master", "local[1]")
    +    .set("spark.sql.parquet.compression.codec", "snappy")
    +    .set("spark.sql.orc.compression.codec", "snappy")
    +
    +  val spark = SparkSession.builder.config(conf).getOrCreate()
    +
    +  // Set default configs. Individual cases will change them if necessary.
    +  spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +
    +  val tempTable = "temp"
    +  val numRows = 1024 * 1024 * 15
    +
    +  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally tableNames.foreach(spark.catalog.dropTempView)
    +  }
    +
    +  def withTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally {
    +      tableNames.foreach { name =>
    +        spark.sql(s"DROP TABLE IF EXISTS $name")
    +      }
    +    }
    +  }
    +
    +  def writeInt(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(c1 INT, c2 STRING) using $format")
    +    benchmark.addCase("Output Single Int Column") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as " +
    +        s"c1, cast(id as STRING) as c2 from $tempTable")
    +    }
    +  }
    +
    +  def writeIntString(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(c1 INT, c2 STRING) using $format")
    +    benchmark.addCase("Output Int and String Column") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as " +
    +        s"c1, cast(id as STRING) as c2 from $tempTable")
    +    }
    +  }
    +
    +  def writePartition(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(p INT, id INT) using $format PARTITIONED BY (p)")
    +    benchmark.addCase("Output Partitions") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as id," +
    +        s" cast(id % 2 as INT) as p from $tempTable")
    +    }
    +  }
    +
    +  def writeBucket(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(c1 INT, c2 INT) using $format CLUSTERED BY (c2) INTO 2 BUCKETS")
    +    benchmark.addCase("Output Buckets") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as " +
    +        s"c1, cast(id as INT) as c2 from $tempTable")
    +    }
    +  }
    +
    +  def main(args: Array[String]): Unit = {
    +    val tableInt = "tableInt"
    +    val tableIntString = "tableIntString"
    +    val tablePartition = "tablePartition"
    +    val tableBucket = "tableBucket"
    +    // If the
    --- End diff --
    
    ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21409: [SPARK-24365][SQL] Add Data Source write benchmar...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21409#discussion_r191411987
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceWriteBenchmark.scala ---
    @@ -0,0 +1,145 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.benchmark
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.Benchmark
    +
    +/**
    + * Benchmark to measure data source write performance.
    + * By default it measures 4 data source format: Parquet, ORC, JSON, CSV:
    + *  spark-submit --class <this class> <spark sql test jar>
    + * To measure specified formats, run it with arguments:
    + *  spark-submit --class <this class> <spark sql test jar> format1 [format2] [...]
    + */
    +object DataSourceWriteBenchmark {
    +  val conf = new SparkConf()
    +    .setAppName("DataSourceWriteBenchmark")
    +    .setIfMissing("spark.master", "local[1]")
    +    .set("spark.sql.parquet.compression.codec", "snappy")
    +    .set("spark.sql.orc.compression.codec", "snappy")
    +
    +  val spark = SparkSession.builder.config(conf).getOrCreate()
    +
    +  // Set default configs. Individual cases will change them if necessary.
    +  spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +
    +  val tempTable = "temp"
    +  val numRows = 1024 * 1024 * 15
    +
    +  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally tableNames.foreach(spark.catalog.dropTempView)
    +  }
    +
    +  def withTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally {
    +      tableNames.foreach { name =>
    +        spark.sql(s"DROP TABLE IF EXISTS $name")
    +      }
    +    }
    +  }
    +
    +  def writeInt(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(c1 INT, c2 STRING) using $format")
    +    benchmark.addCase("Output Single Int Column") { _ =>
    +      spark.sql(s"INSERT overwrite table $table select cast(id as INT) as " +
    --- End diff --
    
    nit: we can use multiline string to format the SQL
    ```
    s"""
      |INSERT OVERWRITE TABLE $table
      |SELECT ...
    """.stripMargin
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Data Source write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3651/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21409: [SPARK-24365][SQL] Add Data Source write benchmar...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21409#discussion_r191411473
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceWriteBenchmark.scala ---
    @@ -0,0 +1,145 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.benchmark
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.Benchmark
    +
    +/**
    + * Benchmark to measure data source write performance.
    + * By default it measures 4 data source format: Parquet, ORC, JSON, CSV:
    + *  spark-submit --class <this class> <spark sql test jar>
    + * To measure specified formats, run it with arguments:
    + *  spark-submit --class <this class> <spark sql test jar> format1 [format2] [...]
    + */
    +object DataSourceWriteBenchmark {
    +  val conf = new SparkConf()
    +    .setAppName("DataSourceWriteBenchmark")
    +    .setIfMissing("spark.master", "local[1]")
    +    .set("spark.sql.parquet.compression.codec", "snappy")
    +    .set("spark.sql.orc.compression.codec", "snappy")
    +
    +  val spark = SparkSession.builder.config(conf).getOrCreate()
    +
    +  // Set default configs. Individual cases will change them if necessary.
    +  spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +
    +  val tempTable = "temp"
    +  val numRows = 1024 * 1024 * 15
    +
    +  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally tableNames.foreach(spark.catalog.dropTempView)
    +  }
    +
    +  def withTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally {
    +      tableNames.foreach { name =>
    +        spark.sql(s"DROP TABLE IF EXISTS $name")
    +      }
    +    }
    +  }
    +
    +  def writeInt(table: String, format: String, benchmark: Benchmark): Unit = {
    +    spark.sql(s"create table $table(c1 INT, c2 STRING) using $format")
    --- End diff --
    
    I think int and double should be good.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    **[Test build #91030 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91030/testReport)** for PR 21409 at commit [`11d612a`](https://github.com/apache/spark/commit/11d612aaa52b7f8ad5006d692e046a0a8c6f410d).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21409#discussion_r190258642
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteBenchmark.scala ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.datasources.parquet
    +
    +import java.io.File
    +
    +import scala.util.Try
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.{Benchmark, Utils}
    +
    +/**
    + * Benchmark to measure parquet write performance.
    + * To run this:
    + *  spark-submit --class <this class> --jars <spark sql test jar>
    + */
    +object ParquetWriteBenchmark {
    +  val conf = new SparkConf()
    +  conf.set("spark.sql.parquet.compression.codec", "snappy")
    +
    +  val spark = SparkSession.builder
    +    .master("local[1]")
    +    .appName("parquet-write-benchmark")
    +    .config(conf)
    +    .getOrCreate()
    +
    +  // Set default configs. Individual cases will change them if necessary.
    +  spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +
    +  def withTempPath(f: File => Unit): Unit = {
    +    val path = Utils.createTempDir()
    +    path.delete()
    +    try f(path) finally Utils.deleteRecursively(path)
    +  }
    +
    +  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally tableNames.foreach(spark.catalog.dropTempView)
    +  }
    +
    +  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
    +    val (keys, values) = pairs.unzip
    +    val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
    +    (keys, values).zipped.foreach(spark.conf.set)
    +    try f finally {
    +      keys.zip(currentValues).foreach {
    +        case (key, Some(value)) => spark.conf.set(key, value)
    +        case (key, None) => spark.conf.unset(key)
    +      }
    +    }
    +  }
    +
    +  def runSQL(name: String, sql: String, values: Int): Unit = {
    +    withTempTable("t1") {
    +      spark.range(values).createOrReplaceTempView("t1")
    +      val benchmark = new Benchmark(name, values)
    +      benchmark.addCase("Parquet Writer") { _ =>
    +        withTempPath { dir =>
    +          spark.sql(sql).write.parquet(dir.getCanonicalPath)
    +        }
    +      }
    +      benchmark.run()
    +    }
    +  }
    +
    +  def intWriteBenchmark(values: Int): Unit = {
    +    /*
    +    Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
    +
    +    Output Single Int Column:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +    ------------------------------------------------------------------------------------------------
    +    Parquet Writer                                2536 / 2610          6.2         161.3       1.0X
    +    */
    +    runSQL("Output Single Int Column", "select cast(id as INT) as id from t1", values)
    --- End diff --
    
    let's also use named parameter here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91030/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21409#discussion_r190231900
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteBenchmark.scala ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.datasources.parquet
    +
    +import java.io.File
    +
    +import scala.util.Try
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.{Benchmark, Utils}
    +
    +/**
    + * Benchmark to measure parquet write performance.
    + * To run this:
    + *  spark-submit --class <this class> --jars <spark sql test jar>
    + */
    +object ParquetWriteBenchmark {
    +  val conf = new SparkConf()
    +  conf.set("spark.sql.parquet.compression.codec", "snappy")
    +
    +  val spark = SparkSession.builder
    +    .master("local[1]")
    +    .appName("parquet-write-benchmark")
    +    .config(conf)
    +    .getOrCreate()
    +
    +  // Set default configs. Individual cases will change them if necessary.
    +  spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +
    +  def withTempPath(f: File => Unit): Unit = {
    +    val path = Utils.createTempDir()
    +    path.delete()
    +    try f(path) finally Utils.deleteRecursively(path)
    +  }
    +
    +  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally tableNames.foreach(spark.catalog.dropTempView)
    +  }
    +
    +  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
    +    val (keys, values) = pairs.unzip
    +    val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
    +    (keys, values).zipped.foreach(spark.conf.set)
    +    try f finally {
    +      keys.zip(currentValues).foreach {
    +        case (key, Some(value)) => spark.conf.set(key, value)
    +        case (key, None) => spark.conf.unset(key)
    +      }
    +    }
    +  }
    +
    +  def runSQL(name: String, sql: String, values: Int): Unit = {
    +    withTempTable("t1") {
    +      spark.range(values).createOrReplaceTempView("t1")
    +      val benchmark = new Benchmark(name, values)
    +      benchmark.addCase("Parquet Writer") { _ =>
    +        withTempPath { dir =>
    +          spark.sql(sql).write.parquet(dir.getCanonicalPath)
    +        }
    +      }
    +      benchmark.run()
    +    }
    +  }
    +
    +  def intWriteBenchmark(values: Int): Unit = {
    +    /*
    +    Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
    +
    +    Output Single Int Column:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +    ------------------------------------------------------------------------------------------------
    +    Parquet Writer                                2536 / 2610          6.2         161.3       1.0X
    +    */
    +    runSQL("Output Single Int Column", "select cast(id as INT) as id from t1", values)
    +  }
    +
    +  def intStringWriteBenchmark(values: Int): Unit = {
    +    /*
    +    Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
    +
    +    Output Int and String Column:            Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    +    ------------------------------------------------------------------------------------------------
    +    Parquet Writer                                4644 / 4673          2.3         442.9       1.0X
    +    */
    +    runSQL(name = "Output Int and String Column",
    --- End diff --
    
    nit: Is there any reason to use named argument for this `runSQL` and not to use name argument in another `runSQL`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21409#discussion_r190207562
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteBenchmark.scala ---
    @@ -0,0 +1,128 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.datasources.parquet
    +
    +import java.io.File
    +
    +import scala.util.Try
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.{Benchmark, Utils}
    +
    +/**
    + * Benchmark to measure parquet write performance.
    + * To run this:
    + *  spark-submit --class <this class> --jars <spark sql test jar>
    + */
    +object ParquetWriteBenchmark {
    +  val conf = new SparkConf()
    +  conf.set("spark.sql.parquet.compression.codec", "snappy")
    +
    +  val spark = SparkSession.builder
    +    .master("local[1]")
    +    .appName("parquet-write-benchmark")
    +    .config(conf)
    +    .getOrCreate()
    +
    +  // Set default configs. Individual cases will change them if necessary.
    +  spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +
    +  def withTempPath(f: File => Unit): Unit = {
    +    val path = Utils.createTempDir()
    +    path.delete()
    +    try f(path) finally Utils.deleteRecursively(path)
    +  }
    +
    +  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally tableNames.foreach(spark.catalog.dropTempView)
    +  }
    +
    +  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
    +    val (keys, values) = pairs.unzip
    +    val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
    +    (keys, values).zipped.foreach(spark.conf.set)
    +    try f finally {
    +      keys.zip(currentValues).foreach {
    +        case (key, Some(value)) => spark.conf.set(key, value)
    +        case (key, None) => spark.conf.unset(key)
    +      }
    +    }
    +  }
    +
    +  def runSQL(name: String, sql: String, values: Int): Unit = {
    +    withTempTable("t1") {
    +      spark.range(values).createOrReplaceTempView("t1")
    +      val benchmark = new Benchmark(name, values)
    +      benchmark.addCase("Parquet Writer") { _ =>
    +        withTempPath { dir =>
    +          spark.sql("select cast(id as INT) as id from t1").write.parquet(dir.getCanonicalPath)
    --- End diff --
    
    shouldn't we use `sql` here? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Data Source write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3679/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21409#discussion_r190260242
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteBenchmark.scala ---
    @@ -0,0 +1,132 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.spark.sql.execution.datasources.parquet
    +
    +import java.io.File
    +
    +import scala.util.Try
    +
    +import org.apache.spark.SparkConf
    +import org.apache.spark.sql.SparkSession
    +import org.apache.spark.sql.internal.SQLConf
    +import org.apache.spark.util.{Benchmark, Utils}
    +
    +/**
    + * Benchmark to measure parquet write performance.
    + * To run this:
    + *  spark-submit --class <this class> --jars <spark sql test jar>
    + */
    +object ParquetWriteBenchmark {
    +  val conf = new SparkConf()
    +  conf.set("spark.sql.parquet.compression.codec", "snappy")
    +
    +  val spark = SparkSession.builder
    +    .master("local[1]")
    +    .appName("parquet-write-benchmark")
    +    .config(conf)
    +    .getOrCreate()
    +
    +  // Set default configs. Individual cases will change them if necessary.
    +  spark.conf.set(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "true")
    +
    +  def withTempPath(f: File => Unit): Unit = {
    +    val path = Utils.createTempDir()
    +    path.delete()
    +    try f(path) finally Utils.deleteRecursively(path)
    +  }
    +
    +  def withTempTable(tableNames: String*)(f: => Unit): Unit = {
    +    try f finally tableNames.foreach(spark.catalog.dropTempView)
    +  }
    +
    +  def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = {
    +    val (keys, values) = pairs.unzip
    +    val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption)
    +    (keys, values).zipped.foreach(spark.conf.set)
    +    try f finally {
    +      keys.zip(currentValues).foreach {
    +        case (key, Some(value)) => spark.conf.set(key, value)
    +        case (key, None) => spark.conf.unset(key)
    +      }
    +    }
    +  }
    +
    +  def runSQL(name: String, sql: String, values: Int): Unit = {
    --- End diff --
    
    the benchmark workflow should be
    1. create a table
    2. run INSERT OVERWRITE
    
    then all the cases can use `runSQL`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3502/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Data Source write benchmark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    **[Test build #91269 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91269/testReport)** for PR 21409 at commit [`e90fa00`](https://github.com/apache/spark/commit/e90fa00e8963eb985bdd30d9a262c61f6ca1ce61).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Data Source write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    **[Test build #91033 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91033/testReport)** for PR 21409 at commit [`0f0908e`](https://github.com/apache/spark/commit/0f0908ee2e92850671d84bdae4dfe07ad1135c24).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91033/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21409: [SPARK-24365][SQL] Add Parquet write benchmark

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21409
  
    **[Test build #91033 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91033/testReport)** for PR 21409 at commit [`0f0908e`](https://github.com/apache/spark/commit/0f0908ee2e92850671d84bdae4dfe07ad1135c24).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org