You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dbtsai <gi...@git.apache.org> on 2018/08/02 01:27:58 UTC

[GitHub] spark pull request #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

GitHub user dbtsai opened a pull request:

    https://github.com/apache/spark/pull/21952

    [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

    ## What changes were proposed in this pull request?
    
    When @lindblombr developed [SPARK-24855](https://github.com/apache/spark/pull/21847) to support specified schema on write at Apple, we found a performance regression in Avro writer for our dataset.
    
    The benchmark result for Spark 2.3 + databricks avro is
    ```
    +-------+-------------------+                                                   
    |summary|         writeTimes|
    +-------+-------------------+
    |  count|                100|
    |   mean| 1.3629600000000002|
    | stddev|0.10027788863700186|
    |    min|              1.197|
    |    max|              1.791|
    +-------+-------------------+
    
    +-------+-------------------+
    |summary|          readTimes|
    +-------+-------------------+
    |  count|                100|
    |   mean| 0.5118100000000001|
    | stddev|0.03879333874923806|
    |    min|              0.463|
    |    max|              0.636|
    +-------+-------------------+
    ``` 
    
    The benchmark for current master is 
    ```
    +-------+-------------------+                                                   
    |summary|         writeTimes|
    +-------+-------------------+
    |  count|                100|
    |   mean| 2.2086099999999997|
    | stddev|0.03511191199061028|
    |    min|              2.119|
    |    max|              2.352|
    +-------+-------------------+
    
    +-------+--------------------+
    |summary|           readTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean|              0.4224|
    | stddev|0.023321642092678414|
    |    min|                 0.4|
    |    max|               0.523|
    +-------+--------------------+
    ```
    
    With this PR, the performance is slightly improved, but not much compared with the old avro writer. There must something we miss which we need to investigate. 
    
    The following is the test code to reproduce the result.
    ```scala
        spark.sqlContext.setConf("spark.sql.avro.compression.codec", "uncompressed")
        val sparkSession = spark
        import sparkSession.implicits._
        val df = spark.sparkContext.range(1, 3000).repartition(1).map { uid =>
          val features = Array.fill(16000)(scala.math.random)
          (uid, scala.math.random, java.util.UUID.randomUUID().toString, java.util.UUID.randomUUID().toString, features)
        }.toDF("uid", "random", "uuid1", "uuid2", "features").cache()
        val size = df.count()
    
        // Write into ramdisk to rule out the disk IO impact
        val tempSaveDir = s"/Volumes/ramdisk/${java.util.UUID.randomUUID()}/"
        val n = 150
        val writeTimes = new Array[Double](n)
        var i = 0
        while (i < n) {
          val t1 = System.currentTimeMillis()
          df.write
            .format("com.databricks.spark.avro")
            .mode("overwrite")
            .save(tempSaveDir)
          val t2 = System.currentTimeMillis()
          writeTimes(i) = (t2 - t1) / 1000.0
          i += 1
        }
    
        df.unpersist()
    
        // The first 50 runs are for warm-up
        val readTimes = new Array[Double](n)
        i = 0
        while (i < n) {
          val t1 = System.currentTimeMillis()
          val readDF = spark.read.format("com.databricks.spark.avro").load(tempSaveDir)
          assert(readDF.count() == size)
          val t2 = System.currentTimeMillis()
          readTimes(i) = (t2 - t1) / 1000.0
          i += 1
        }
    
        spark.sparkContext.parallelize(writeTimes.slice(50, 150)).toDF("writeTimes").describe("writeTimes").show()
        spark.sparkContext.parallelize(readTimes.slice(50, 150)).toDF("readTimes").describe("readTimes").show()
    ```
    
    ## How was this patch tested?
    
    Existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dbtsai/spark avro-performance-fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21952.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21952
    
----
commit 3be6906a310c100153316d0d144e6c4180071c8e
Author: DB Tsai <d_...@...>
Date:   2018-08-01T20:58:05Z

    Make avro fast again

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94062/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21952#discussion_r207444381
  
    --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala ---
    @@ -151,11 +155,12 @@ class AvroSerializer(rootCatalystType: DataType, rootAvroType: Schema, nullable:
           case (f1, f2) => newConverter(f1.dataType, resolveNullableType(f2.schema(), f1.nullable))
         }
         val numFields = catalystStruct.length
    +    val containsNull = catalystStruct.exists(_.nullable)
    --- End diff --
    
    Was addressing the feedback from @gengliangwang We can remove it since the cases when all the fields are not nullable will be probably fairly rare. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94069/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Ah, finally I can reproduce this. It needs to allocate the array feature with length 16000. I was reducing it to 1600 and it largely relieve the regression. `com.databricks.spark.avro` is faster only on Spark 2.3. If using with current master branch, it isn't faster than built-in avro datasource. Maybe somewhere causes this regression.
    
    ```scala
    > "com.databricks.spark.avro - Spark 2.3"
    
    scala> spark.sparkContext.parallelize(writeTimes.slice(50, 150)).toDF("writeTimes").describe("writeTimes").show()
    +-------+-------------------+
    |summary|         writeTimes|
    +-------+-------------------+
    |  count|                100|
    |   mean| 0.9711099999999999|
    | stddev|0.01940836797556013|
    |    min|              0.941|
    |    max|              1.037|
    +-------+-------------------+
    
    
    scala> spark.sparkContext.parallelize(readTimes.slice(50, 150)).toDF("readTimes").describe("readTimes").show()
    +-------+-------------------+
    |summary|          readTimes|
    +-------+-------------------+
    |  count|                100|
    |   mean|            0.36022|
    | stddev|0.05807476546520342|
    |    min|              0.287|
    |    max|              0.626|
    +-------+-------------------+
    
    > "avro"
    
    scala> spark.sparkContext.parallelize(writeTimes.slice(50, 150)).toDF("writeTimes").describe("writeTimes").show()
    +-------+-------------------+
    |summary|         writeTimes|
    +-------+-------------------+
    |  count|                100|
    |   mean| 1.7371699999999999|
    | stddev|0.03504399976018602|
    |    min|              1.695|
    |    max|              1.886|
    +-------+-------------------+
    
    
    scala> spark.sparkContext.parallelize(readTimes.slice(50, 150)).toDF("readTimes").describe("readTimes").show()
    +-------+-------------------+
    |summary|          readTimes|
    +-------+-------------------+
    |  count|                100|
    |   mean|0.32348999999999994|
    | stddev|0.06235617714615632|
    |    min|              0.263|
    |    max|              0.781|
    +-------+-------------------+
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    I noticed that the benchmark uses `df.count`, is it possible that column pruning has some issues in master?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94104/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21952#discussion_r207444037
  
    --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala ---
    @@ -100,17 +100,20 @@ class AvroSerializer(rootCatalystType: DataType, rootAvroType: Schema, nullable:
               et, resolveNullableType(avroType.getElementType, containsNull))
             (getter, ordinal) => {
               val arrayData = getter.getArray(ordinal)
    -          val result = new java.util.ArrayList[Any]
    --- End diff --
    
    My previous experience in ml project told me that `ArrayList` has slower setter performance due to one extra function call, so my preference is using array as much as possible, and wrap it into the right container in the end.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21952#discussion_r207443196
  
    --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala ---
    @@ -100,17 +100,20 @@ class AvroSerializer(rootCatalystType: DataType, rootAvroType: Schema, nullable:
               et, resolveNullableType(avroType.getElementType, containsNull))
             (getter, ordinal) => {
               val arrayData = getter.getArray(ordinal)
    -          val result = new java.util.ArrayList[Any]
    --- End diff --
    
    can we just `new java.util.ArrayList[Any](len)` here instead of creating an array and wrap it with array list?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    @dbtsai I didn't use Spark 2.3 when testing databricks-avro. I also used current master. But because a recent change of schema verifying (`FileFormat.supportDataType`) causes incompatibility, I manually skip this call to `supportDataType`.
    
    So basically I tested built-in avro and databricks-avro both on current master. I think the difference between Spark 2.3 and current master may cause difference.
    
    Btw, in the following benchmark numbers I modify array feature length from 16000 to 1600.
    
    ```scala
    > "com.databricks.spark.avro"
     
    scala> spark.sparkContext.parallelize(writeTimes.slice(50, 150)).toDF("writeTimes").describe("writeTimes").show()
    +-------+--------------------+
    |summary|          writeTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean|             0.21102|
    | stddev|0.010737435692590912|
    |    min|               0.195|
    |    max|               0.247|
    +-------+--------------------+
    
    
    scala> spark.sparkContext.parallelize(readTimes.slice(50, 150)).toDF("readTimes").describe("readTimes").show()
    +-------+--------------------+
    |summary|           readTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean| 0.09441999999999999|
    | stddev|0.016021563751722395|
    |    min|                0.07|
    |    max|               0.134|
    +-------+--------------------+
    
    > "avro"
    
    scala> spark.sparkContext.parallelize(writeTimes.slice(50, 150)).toDF("writeTimes").describe("writeTimes").show()
    +-------+--------------------+
    |summary|          writeTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean|             0.21445|
    | stddev|0.008952596824329237|
    |    min|               0.201|
    |    max|                0.25|
    +-------+--------------------+
    
    
    scala> spark.sparkContext.parallelize(readTimes.slice(50, 150)).toDF("readTimes").describe("readTimes").show()
    +-------+--------------------+
    |summary|           readTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean|             0.10792|
    | stddev|0.015983375201386058|
    |    min|                0.08|
    |    max|                0.15|
    +-------+--------------------+
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21952#discussion_r207443421
  
    --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala ---
    @@ -151,11 +155,12 @@ class AvroSerializer(rootCatalystType: DataType, rootAvroType: Schema, nullable:
           case (f1, f2) => newConverter(f1.dataType, resolveNullableType(f2.schema(), f1.nullable))
         }
         val numFields = catalystStruct.length
    +    val containsNull = catalystStruct.exists(_.nullable)
    --- End diff --
    
    this only works when all the fields are not nullable, I don't think it's very useful.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94079/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1734/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    **[Test build #94104 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94104/testReport)** for PR 21952 at commit [`ec17d58`](https://github.com/apache/spark/commit/ec17d58ea674ffba6e2c07284a26f6b3a1e7357e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by holdensmagicalunicorn <gi...@git.apache.org>.
Github user holdensmagicalunicorn commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    @dbtsai, thanks! I am a bot who has found some folks who might be able to help with the review:@cloud-fan


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    **[Test build #94117 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94117/testReport)** for PR 21952 at commit [`2df8142`](https://github.com/apache/spark/commit/2df81420871cfeff0707b8712ad25f7cbf0f45ce).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    **[Test build #93922 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93922/testReport)** for PR 21952 at commit [`3be6906`](https://github.com/apache/spark/commit/3be6906a310c100153316d0d144e6c4180071c8e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21952#discussion_r207405634
  
    --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala ---
    @@ -100,13 +100,14 @@ class AvroSerializer(rootCatalystType: DataType, rootAvroType: Schema, nullable:
               et, resolveNullableType(avroType.getElementType, containsNull))
             (getter, ordinal) => {
               val arrayData = getter.getArray(ordinal)
    -          val result = new java.util.ArrayList[Any]
    +          val len = arrayData.numElements()
    +          val result = new Array[Any](len)
    --- End diff --
    
    I tested this out, and this doesn't help much. 
    
    I guess the reason is the avro writer is expecting a boxed `ArrayList`, so even we call the primitive APIs, Scala will still do the auto-boxing which will not be much different than the current code.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1702/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    The regression happens at writing. Looks like when benchmarking writing time, we don't use `df.count`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    **[Test build #94104 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94104/testReport)** for PR 21952 at commit [`ec17d58`](https://github.com/apache/spark/commit/ec17d58ea674ffba6e2c07284a26f6b3a1e7357e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21952#discussion_r207102304
  
    --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala ---
    @@ -100,13 +100,14 @@ class AvroSerializer(rootCatalystType: DataType, rootAvroType: Schema, nullable:
               et, resolveNullableType(avroType.getElementType, containsNull))
             (getter, ordinal) => {
               val arrayData = getter.getArray(ordinal)
    -          val result = new java.util.ArrayList[Any]
    +          val len = arrayData.numElements()
    +          val result = new Array[Any](len)
    --- End diff --
    
    one more improvement: if the element is primitive type, we can call `arrayData.toBoolean/Int/...Array` directly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    **[Test build #94117 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94117/testReport)** for PR 21952 at commit [`2df8142`](https://github.com/apache/spark/commit/2df81420871cfeff0707b8712ad25f7cbf0f45ce).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Maybe due to test environment difference, I ran the benchmark code above but didn't notice significant regression. See if others can confirm the regression too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21952#discussion_r207444698
  
    --- Diff: external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala ---
    @@ -151,11 +155,12 @@ class AvroSerializer(rootCatalystType: DataType, rootAvroType: Schema, nullable:
           case (f1, f2) => newConverter(f1.dataType, resolveNullableType(f2.schema(), f1.nullable))
         }
         val numFields = catalystStruct.length
    +    val containsNull = catalystStruct.exists(_.nullable)
    --- End diff --
    
    Let's remove it. We can fix the issue that Spark always turn schema to nullable later.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    **[Test build #94062 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94062/testReport)** for PR 21952 at commit [`069625d`](https://github.com/apache/spark/commit/069625d0e1107adc66b1c045410492c72900c16e).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    **[Test build #94069 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94069/testReport)** for PR 21952 at commit [`8d35f06`](https://github.com/apache/spark/commit/8d35f062f6bf384fc915cdeed7ac0d83dccafdde).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94117/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    @dbtsai I was thinking the same thing. I will do the test later after I come back to my laptop.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93922/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    **[Test build #94079 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94079/testReport)** for PR 21952 at commit [`3309020`](https://github.com/apache/spark/commit/3309020e7d9102a2ef92021d43c006289d0fdd3d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai closed the pull request at:

    https://github.com/apache/spark/pull/21952


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1708/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1594/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    @dbtsai This is what I see when testing on Spark 2.3. Compared with above numbers, seems to me there are no such significant difference as same as your findings.
    
    ```scala
    > "com.databricks.spark.avro - Spark 2.3"
    
    scala> spark.sparkContext.parallelize(writeTimes.slice(50, 150)).toDF("writeTimes").describe("writeTimes").show()
    +-------+-------------------+
    |summary|         writeTimes|
    +-------+-------------------+
    |  count|                100|
    |   mean|0.21722999999999998|
    | stddev|0.04375479309963559|
    |    min|              0.176|
    |    max|              0.481|
    +-------+-------------------+
    
    
    scala> spark.sparkContext.parallelize(readTimes.slice(50, 150)).toDF("readTimes").describe("readTimes").show()
    +-------+-------------------+
    |summary|          readTimes|
    +-------+-------------------+
    |  count|                100|
    |   mean|0.12025999999999999|
    | stddev|0.04034638406438311|
    |    min|              0.072|
    |    max|               0.26|
    +-------+-------------------+
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    @viirya How did you run the benchmark? I tried again on my desktop, and still got consistent regression. Thanks.
    
    Spark 2.4
    ```
    spark git:(master) ./build/mvn -DskipTests clean package
    spark git:(master) bin/spark-shell --jars external/avro/target/spark-avro_2.11-2.4.0-SNAPSHOT.jar
    ``` 
    
    Spark 2.3 + databricks avro
    ```
    spark git:(branch-2.3) ./build/mvn -DskipTests clean package
    spark git:(branch-2.3) bin/spark-shell --packages com.databricks:spark-avro_2.11:4.0.0               
    ```
    
    Current master:
    ```
    +-------+--------------------+                                                  
    |summary|          writeTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean|             2.95621|
    | stddev|0.030895815479469294|
    |    min|               2.915|
    |    max|               3.049|
    +-------+--------------------+
    
    +-------+--------------------+
    |summary|           readTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean| 0.31072999999999995|
    | stddev|0.054139709842390006|
    |    min|               0.259|
    |    max|               0.692|
    +-------+--------------------+
    ```
    
    Current master with this PR:
    ```
    +-------+--------------------+                                                  
    |summary|          writeTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean|  2.5804300000000002|
    | stddev|0.011175600225672079|
    |    min|               2.558|
    |    max|                2.62|
    +-------+--------------------+
    
    +-------+--------------------+
    |summary|           readTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean| 0.29922000000000004|
    | stddev|0.058261961532514166|
    |    min|               0.251|
    |    max|               0.732|
    +-------+--------------------+
    ```
    
    Spark 2.3 + databricks avro:
    ```
    +-------+--------------------+                                                  
    |summary|          writeTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean|  1.7730500000000005|
    | stddev|0.025199156230863575|
    |    min|               1.729|
    |    max|               1.833|
    +-------+--------------------+
    
    +-------+-------------------+
    |summary|          readTimes|
    +-------+-------------------+
    |  count|                100|
    |   mean|            0.29715|
    | stddev|0.05685643358850465|
    |    min|              0.258|
    |    max|              0.718|
    +-------+-------------------+
    
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by gengliangwang <gi...@git.apache.org>.
Github user gengliangwang commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Hi @dbtsai , nice catch!
    I think we can also check the nullability here:
    https://github.com/apache/spark/pull/21952/files#diff-01fea32e6ec6bcf6f34d06282e08705aR160
    
    If the input data is from data source, I doubt the improvement of this PR. As the data schema is always nullable for data source:
    https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L379
    
    Anyway we should add these checks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    **[Test build #94069 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94069/testReport)** for PR 21952 at commit [`8d35f06`](https://github.com/apache/spark/commit/8d35f062f6bf384fc915cdeed7ac0d83dccafdde).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    we can keep investigating the perf regression, this patch itself LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    **[Test build #93922 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93922/testReport)** for PR 21952 at commit [`3be6906`](https://github.com/apache/spark/commit/3be6906a310c100153316d0d144e6c4180071c8e).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    @cloud-fan as you suggested, I benchmarked cache read performance, and the performance is the same. This makes sense, since it's unlikely that cache read performance is that bad so we can see the impact on avro writing. 
    
    Spark 2.4
    ```
    +-------+--------------------+
    |summary|           cacheRead|
    +-------+--------------------+
    |  count|                 100|
    |   mean|0.061929999999999985|
    | stddev|0.002450541065795...|
    |    min|               0.059|
    |    max|               0.071|
    +-------+--------------------+
    ```
    Spark 2.3
    ```
    +-------+-------------------+
    |summary|          cacheRead|
    +-------+-------------------+
    |  count|                100|
    |   mean|0.06026999999999999|
    | stddev|0.00201937584116449|
    |    min|              0.058|
    |    max|              0.069|
    +-------+-------------------+
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    **[Test build #94062 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94062/testReport)** for PR 21952 at commit [`069625d`](https://github.com/apache/spark/commit/069625d0e1107adc66b1c045410492c72900c16e).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1716/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    do we have the same regression for parquet? wondering if the regression comes from the `FileFormat` framework.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Merged into master. Thanks all for reviewing. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    This is serious and we should fix it before Spark 2.4.
    
    For the benchmark, I have 2 questions:
    1. will the regression caused by the df cache? we can run `df.queryExecution.toRdd.foreach(_ => ())` to verify the cache reading performance.
    2. is this related to array type only? if we remove the array column, does the regression disappear?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    **[Test build #94079 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94079/testReport)** for PR 21952 at commit [`3309020`](https://github.com/apache/spark/commit/3309020e7d9102a2ef92021d43c006289d0fdd3d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1744/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1717/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] Make Avro Fast Again

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Posted by dbtsai <gi...@git.apache.org>.
Github user dbtsai commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    @viirya  since you don't see the performance regression between 2.4 + builtin reader and 2.4 + databricks reader, do you think the regression is somewhere else in Spark?
    
    Can you try 2.3 branch to confirm my finding? 
    
    Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org