You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by dbtsai <gi...@git.apache.org> on 2018/08/02 01:27:58 UTC

[GitHub] spark pull request #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

GitHub user dbtsai opened a pull request:

    https://github.com/apache/spark/pull/21952

    [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

    ## What changes were proposed in this pull request?
    
    When @lindblombr developed [SPARK-24855](https://github.com/apache/spark/pull/21847) to support specified schema on write at Apple, we found a performance regression in Avro writer for our dataset.
    
    The benchmark result for Spark 2.3 + databricks avro is
    ```
    +-------+-------------------+                                                   
    |summary|         writeTimes|
    +-------+-------------------+
    |  count|                100|
    |   mean| 1.3629600000000002|
    | stddev|0.10027788863700186|
    |    min|              1.197|
    |    max|              1.791|
    +-------+-------------------+
    
    +-------+-------------------+
    |summary|          readTimes|
    +-------+-------------------+
    |  count|                100|
    |   mean| 0.5118100000000001|
    | stddev|0.03879333874923806|
    |    min|              0.463|
    |    max|              0.636|
    +-------+-------------------+
    ``` 
    
    The benchmark for current master is 
    ```
    +-------+-------------------+                                                   
    |summary|         writeTimes|
    +-------+-------------------+
    |  count|                100|
    |   mean| 2.2086099999999997|
    | stddev|0.03511191199061028|
    |    min|              2.119|
    |    max|              2.352|
    +-------+-------------------+
    
    +-------+--------------------+
    |summary|           readTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean|              0.4224|
    | stddev|0.023321642092678414|
    |    min|                 0.4|
    |    max|               0.523|
    +-------+--------------------+
    ```
    
    With this PR, the performance is slightly improved, but not much compared with the old avro writer. There must something we miss which we need to investigate. 
    
    The following is the test code to reproduce the result.
    ```scala
        spark.sqlContext.setConf("spark.sql.avro.compression.codec", "uncompressed")
        val sparkSession = spark
        import sparkSession.implicits._
        val df = spark.sparkContext.range(1, 3000).repartition(1).map { uid =>
          val features = Array.fill(16000)(scala.math.random)
          (uid, scala.math.random, java.util.UUID.randomUUID().toString, java.util.UUID.randomUUID().toString, features)
        }.toDF("uid", "random", "uuid1", "uuid2", "features").cache()
        val size = df.count()
    
        // Write into ramdisk to rule out the disk IO impact
        val tempSaveDir = s"/Volumes/ramdisk/${java.util.UUID.randomUUID()}/"
        val n = 150
        val writeTimes = new Array[Double](n)
        var i = 0
        while (i < n) {
          val t1 = System.currentTimeMillis()
          df.write
            .format("com.databricks.spark.avro")
            .mode("overwrite")
            .save(tempSaveDir)
          val t2 = System.currentTimeMillis()
          writeTimes(i) = (t2 - t1) / 1000.0
          i += 1
        }
    
        df.unpersist()
    
        // The first 50 runs are for warm-up
        val readTimes = new Array[Double](n)
        i = 0
        while (i < n) {
          val t1 = System.currentTimeMillis()
          val readDF = spark.read.format("com.databricks.spark.avro").load(tempSaveDir)
          assert(readDF.count() == size)
          val t2 = System.currentTimeMillis()
          readTimes(i) = (t2 - t1) / 1000.0
          i += 1
        }
    
        spark.sparkContext.parallelize(writeTimes.slice(50, 150)).toDF("writeTimes").describe("writeTimes").show()
        spark.sparkContext.parallelize(readTimes.slice(50, 150)).toDF("readTimes").describe("readTimes").show()
    ```
    
    ## How was this patch tested?
    
    Existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dbtsai/spark avro-performance-fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21952.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21952
    
----
commit 3be6906a310c100153316d0d144e6c4180071c8e
Author: DB Tsai <d_...@...>
Date:   2018-08-01T20:58:05Z

    Make avro fast again

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94062/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21952#discussion_r207444381
  
    --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala ---
    @@ -151,11 +155,12 @@ class AvroSerializer(rootCatalystType: DataType, rootAvroType: Schema, nullable:
           case (f1, f2) => newConverter(f1.dataType, resolveNullableType(f2.schema(), f1.nullable))
         }
         val numFields = catalystStruct.length
    +    val containsNull = catalystStruct.exists(_.nullable)
    --- End diff --
    
    Was addressing the feedback from @gengliangwang We can remove it since the cases when all the fields are not nullable will be probably fairly rare. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94069/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Ah, finally I can reproduce this. It needs to allocate the array feature with length 16000. I was reducing it to 1600 and it largely relieve the regression. `com.databricks.spark.avro` is faster only on Spark 2.3. If using with current master branch, it isn't faster than built-in avro datasource. Maybe somewhere causes this regression.
    
    ```scala
    > "com.databricks.spark.avro - Spark 2.3"
    
    scala> spark.sparkContext.parallelize(writeTimes.slice(50, 150)).toDF("writeTimes").describe("writeTimes").show()
    +-------+-------------------+
    |summary|         writeTimes|
    +-------+-------------------+
    |  count|                100|
    |   mean| 0.9711099999999999|
    | stddev|0.01940836797556013|
    |    min|              0.941|
    |    max|              1.037|
    +-------+-------------------+
    
    
    scala> spark.sparkContext.parallelize(readTimes.slice(50, 150)).toDF("readTimes").describe("readTimes").show()
    +-------+-------------------+
    |summary|          readTimes|
    +-------+-------------------+
    |  count|                100|
    |   mean|            0.36022|
    | stddev|0.05807476546520342|
    |    min|              0.287|
    |    max|              0.626|
    +-------+-------------------+
    
    > "avro"
    
    scala> spark.sparkContext.parallelize(writeTimes.slice(50, 150)).toDF("writeTimes").describe("writeTimes").show()
    +-------+-------------------+
    |summary|         writeTimes|
    +-------+-------------------+
    |  count|                100|
    |   mean| 1.7371699999999999|
    | stddev|0.03504399976018602|
    |    min|              1.695|
    |    max|              1.886|
    +-------+-------------------+
    
    
    scala> spark.sparkContext.parallelize(readTimes.slice(50, 150)).toDF("readTimes").describe("readTimes").show()
    +-------+-------------------+
    |summary|          readTimes|
    +-------+-------------------+
    |  count|                100|
    |   mean|0.32348999999999994|
    | stddev|0.06235617714615632|
    |    min|              0.263|
    |    max|              0.781|
    +-------+-------------------+
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    I noticed that the benchmark uses `df.count`, is it possible that column pruning has some issues in master?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94104/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by dbtsai <gi...@git.apache.org>.

Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21952#discussion_r207444037
  
    --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala ---
    @@ -100,17 +100,20 @@ class AvroSerializer(rootCatalystType: DataType, rootAvroType: Schema, nullable:
               et, resolveNullableType(avroType.getElementType, containsNull))
             (getter, ordinal) => {
               val arrayData = getter.getArray(ordinal)
    -          val result = new java.util.ArrayList[Any]
    --- End diff --
    
    My previous experience in ml project told me that `ArrayList` has slower setter performance due to one extra function call, so my preference is using array as much as possible, and wrap it into the right container in the end.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21952#discussion_r207443196
  
    --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala ---
    @@ -100,17 +100,20 @@ class AvroSerializer(rootCatalystType: DataType, rootAvroType: Schema, nullable:
               et, resolveNullableType(avroType.getElementType, containsNull))
             (getter, ordinal) => {
               val arrayData = getter.getArray(ordinal)
    -          val result = new java.util.ArrayList[Any]
    --- End diff --
    
    can we just `new java.util.ArrayList[Any](len)` here instead of creating an array and wrap it with array list?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    @dbtsai I didn't use Spark 2.3 when testing databricks-avro. I also used current master. But because a recent change of schema verifying (`FileFormat.supportDataType`) causes incompatibility, I manually skip this call to `supportDataType`.
    
    So basically I tested built-in avro and databricks-avro both on current master. I think the difference between Spark 2.3 and current master may cause difference.
    
    Btw, in the following benchmark numbers I modify array feature length from 16000 to 1600.
    
    ```scala
    > "com.databricks.spark.avro"
     
    scala> spark.sparkContext.parallelize(writeTimes.slice(50, 150)).toDF("writeTimes").describe("writeTimes").show()
    +-------+--------------------+
    |summary|          writeTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean|             0.21102|
    | stddev|0.010737435692590912|
    |    min|               0.195|
    |    max|               0.247|
    +-------+--------------------+
    
    
    scala> spark.sparkContext.parallelize(readTimes.slice(50, 150)).toDF("readTimes").describe("readTimes").show()
    +-------+--------------------+
    |summary|           readTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean| 0.09441999999999999|
    | stddev|0.016021563751722395|
    |    min|                0.07|
    |    max|               0.134|
    +-------+--------------------+
    
    > "avro"
    
    scala> spark.sparkContext.parallelize(writeTimes.slice(50, 150)).toDF("writeTimes").describe("writeTimes").show()
    +-------+--------------------+
    |summary|          writeTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean|             0.21445|
    | stddev|0.008952596824329237|
    |    min|               0.201|
    |    max|                0.25|
    +-------+--------------------+
    
    
    scala> spark.sparkContext.parallelize(readTimes.slice(50, 150)).toDF("readTimes").describe("readTimes").show()
    +-------+--------------------+
    |summary|           readTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean|             0.10792|
    | stddev|0.015983375201386058|
    |    min|                0.08|
    |    max|                0.15|
    +-------+--------------------+
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21952#discussion_r207443421
  
    --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala ---
    @@ -151,11 +155,12 @@ class AvroSerializer(rootCatalystType: DataType, rootAvroType: Schema, nullable:
           case (f1, f2) => newConverter(f1.dataType, resolveNullableType(f2.schema(), f1.nullable))
         }
         val numFields = catalystStruct.length
    +    val containsNull = catalystStruct.exists(_.nullable)
    --- End diff --
    
    this only works when all the fields are not nullable, I don't think it's very useful.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94079/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1734/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    **[Test build #94104 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94104/testReport)** for PR 21952 at commit [`ec17d58`](https://github.com/apache/spark/commit/ec17d58ea674ffba6e2c07284a26f6b3a1e7357e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by holdensmagicalunicorn <gi...@git.apache.org>.

Github user holdensmagicalunicorn commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    @dbtsai, thanks! I am a bot who has found some folks who might be able to help with the review:@cloud-fan


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    **[Test build #94117 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94117/testReport)** for PR 21952 at commit [`2df8142`](https://github.com/apache/spark/commit/2df81420871cfeff0707b8712ad25f7cbf0f45ce).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    **[Test build #93922 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93922/testReport)** for PR 21952 at commit [`3be6906`](https://github.com/apache/spark/commit/3be6906a310c100153316d0d144e6c4180071c8e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org