You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by JoshRosen <gi...@git.apache.org> on 2016/04/28 05:48:12 UTC

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/12750

    [SPARK-14972] Improve performance of JSON schema inference's compatibleType method

    This patch improves the performance of `InferSchema.compatibleType` and `inferField`. The net result of this patch is a 2x-4x speedup in local benchmarks running against cached data with a massive nested schema.
    
    The key idea is to remove unnecessary sorting in `compatibleType`'s `StructType` merging code. This code takes two structs, merges the fields with matching names, and copies over the unique fields, producing a new schema which is the union of the two structs' schemas. Previously, this code performed a very inefficient `groupBy()` to match up fields with the same name, but this is unnecessary because `inferField` already sorts structs' fields by name: since both lists of fields are sorted, we can simply merge them in a single pass.
    
    This patch also speeds up the existing field sorting in `inferField`: the old sorting code allocated unnecessary intermediate collections, while the new code uses mutable collects and performs in-place sorting.
    
    Finally, I replaced a `treeAggregate` call with `fold`: I doubt that `treeAggregate` will benefit us very much because the schemas would have to be enormous to realize large savings in network traffic. Since most schemas are probably fairly small in serialized form, they should typically fit within a direct task result and therefore can be incrementally merged at the driver as individual tasks finish. This change eliminates an entire (short) scheduler stage. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark schema-inference-speedups

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12750.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12750
    
----
commit 5406092f5c16189f1b6c46c1ed324f13dadd57b1
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-04-28T02:12:29Z

    Stop using groupByKey to merge struct schemas.

commit 5319a6abfb37cb43c500385b21e9d905e831d24d
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-04-28T02:38:21Z

    Take advantage of fact that structs' fields are already sorted by name.

commit d9793f71cf5d9e2ff5bf7d62b561e9d6e377d1a1
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-04-28T03:01:47Z

    Perform in-place sort without additional allocation.

commit 4bbf4292802e475d84ec55994a4ebae3ddc2f4da
Author: Josh Rosen <jo...@databricks.com>
Date:   2016-04-28T03:04:27Z

    Replace treeAggregate with fold.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/12750


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12750#discussion_r61666430
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/Metadata.scala ---
    @@ -85,17 +85,18 @@ sealed class Metadata private[types] (private[types] val map: Map[String, Any])
       override def equals(obj: Any): Boolean = {
         obj match {
           case that: Metadata =>
    -        if (map.keySet == that.map.keySet) {
    -          map.keys.forall { k =>
    -            (map(k), that.map(k)) match {
    -              case (v0: Array[_], v1: Array[_]) =>
    -                v0.view == v1.view
    -              case (v0, v1) =>
    -                v0 == v1
    -            }
    +        map.keysIterator.forall { key =>
    --- End diff --
    
    I think it's sufficient to just check that the key sets are the same size.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215943950
  
    **[Test build #57425 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57425/consoleFull)** for PR 12750 at commit [`48be267`](https://github.com/apache/spark/commit/48be267a03c5f8990633015e07d43e92d94fa504).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215945786
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215316921
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12750#discussion_r61665640
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala ---
    @@ -246,12 +263,39 @@ private[sql] object InferSchema {
               }
     
             case (StructType(fields1), StructType(fields2)) =>
    -          val newFields = (fields1 ++ fields2).groupBy(field => field.name).map {
    -            case (name, fieldTypes) =>
    -              val dataType = fieldTypes.view.map(_.dataType).reduce(compatibleType)
    -              StructField(name, dataType, nullable = true)
    +          // Both fields1 and fields2 should be sorted by name, since inferField performs sorting.
    +          // Therefore, we can take advantage of the fact that we're merging sorted lists and skip
    +          // building a hash map or performing additional sorting.
    +          val newFields = new mutable.ArrayBuffer[StructField]()
    +
    +          var f1Idx = 0
    +          var f2Idx = 0
    +
    +          while (f1Idx < fields1.length && f2Idx < fields2.length) {
    +            val f1Name = fields1(f1Idx).name
    +            val f2Name = fields2(f2Idx).name
    +            if (f1Name == f2Name) {
    +              val dataType = compatibleType(fields1(f1Idx).dataType, fields2(f2Idx).dataType)
    +              newFields += StructField(f1Name, dataType, nullable = true)
    +              f1Idx += 1
    +              f2Idx += 1
    +            } else if (f1Name < f2Name) {
    +              newFields += fields1(f1Idx)
    +              f1Idx += 1
    +            } else { // f1Name > f2Name
    +              newFields += fields2(f2Idx)
    +              f2Idx += 1
    +            }
    +          }
    +          while (f1Idx < fields1.length) {
    +            newFields += fields1(f1Idx)
    +            f1Idx += 1
    +          }
    +          while (f2Idx < fields2.length) {
    +            newFields += fields2(f2Idx)
    +            f2Idx += 1
               }
    -          StructType(newFields.toSeq.sortBy(_.name))
    +          StructType(newFields.toSeq)
    --- End diff --
    
    Agreed; good catch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215949797
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57433/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215946397
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57421/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215941197
  
    @NathanHowell, I played around with your code from https://github.com/apache/spark/pull/12750#issuecomment-215607825 and it was a bit slower than mine because it turns out that comparing `StructType`s for equality is really expensive and the benefits of avoiding object allocation were outweighed by the additional equality checks.
    
    There are a couple of minor tricks which can improve the performance of your approach, though. In
    
    ```scala
    if (biggerVal.dataType != smallerVal.dataType) {
              val merged = compatibleType(biggerVal.dataType, smallerVal.dataType)
              // test to see if the merged type is equivalent to one of the existing
              // StructField instances, reuse will reduce GC pressure
              if (smallerVal.dataType == merged) {
                bigger.update(biggerIdx, smallerVal)
              } else if (biggerVal.dataType == merged) {
                // do nothing, the bigger struct already has the correct field
              } else {
                // we can't reuse an existing field so allocate a new one
                bigger.update(biggerIdx, biggerVal.copy(dataType = merged))
              }
    [...]`
    ```
    
    note that  `compatibleType`'s first step is to compare its inputs for equality, so if the two input types are equal then one of those inputs will be re-used as the value of `merged`. Thus, I think you can remove some checks by rewriting this as
    
    ```scala
              val merged = compatibleType(biggerVal.dataType, smallerVal.dataType)
              if (smallerVal.dataType eq merged) {
                bigger.update(biggerIdx, smallerVal)
              } else if (biggerVal.dataType eq merged) {
                // do nothing, the bigger struct already has the correct field
              } else {
                // we can't reuse an existing field so allocate a new one
                bigger.update(biggerIdx, biggerVal.copy(dataType = merged))
              }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by NathanHowell <gi...@git.apache.org>.

Github user NathanHowell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12750#discussion_r61526786
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala ---
    @@ -76,6 +78,15 @@ private[sql] object InferSchema {
         }
       }
     
    +  private def sortFieldsInPlace(fields: Array[StructField]): Unit = {
    +    // Note: other code relies on this sorting for correctness, so don't remove it!
    +    java.util.Arrays.sort(fields, new Comparator[StructField] {
    --- End diff --
    
    Should the `Comparator` instance be cached?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-216010060
  
    **[Test build #57461 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57461/consoleFull)** for PR 12750 at commit [`48be267`](https://github.com/apache/spark/commit/48be267a03c5f8990633015e07d43e92d94fa504).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215308136
  
    **[Test build #57215 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57215/consoleFull)** for PR 12750 at commit [`5d34a64`](https://github.com/apache/spark/commit/5d34a646046727ffaf8f5932b2dfeae3a5c10d32).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215946369
  
    **[Test build #57421 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57421/consoleFull)** for PR 12750 at commit [`6bf5ee6`](https://github.com/apache/spark/commit/6bf5ee634ee7e509a88e64236f18e3c6e7a07aa2).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-217966086
  
    **[Test build #58151 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58151/consoleFull)** for PR 12750 at commit [`8c7f7ad`](https://github.com/apache/spark/commit/8c7f7ad9cef1993f817e35056a9d7b572829e349).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-216376862
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57557/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12750#discussion_r61666426
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/Metadata.scala ---
    @@ -85,17 +85,18 @@ sealed class Metadata private[types] (private[types] val map: Map[String, Any])
       override def equals(obj: Any): Boolean = {
         obj match {
           case that: Metadata =>
    -        if (map.keySet == that.map.keySet) {
    -          map.keys.forall { k =>
    -            (map(k), that.map(k)) match {
    -              case (v0: Array[_], v1: Array[_]) =>
    -                v0.view == v1.view
    -              case (v0, v1) =>
    -                v0 == v1
    -            }
    +        map.keysIterator.forall { key =>
    --- End diff --
    
    Whoops, realized this is wrong if `that.map.keySet` contains keys that aren't in this key set. Will fix now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-217974955
  
    Merging this into master and 2.0, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-216396595
  
    LGTM, just one minor comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-216376855
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215310152
  
    **[Test build #57212 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57212/consoleFull)** for PR 12750 at commit [`4bbf429`](https://github.com/apache/spark/commit/4bbf4292802e475d84ec55994a4ebae3ddc2f4da).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215947085
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215948577
  
    **[Test build #57433 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57433/consoleFull)** for PR 12750 at commit [`48be267`](https://github.com/apache/spark/commit/48be267a03c5f8990633015e07d43e92d94fa504).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215949796
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by NathanHowell <gi...@git.apache.org>.

Github user NathanHowell commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215607825
  
    Alright, here's a few ideas that will at least reduce allocations by a bit. Your version with the merge sort is likely better than the insertion sort here but I thought I'd try something different :trollface: any chance you can run it through your benchmark?
    
    I'll put my other comments inline, generally looks good to me.
    
    ``` scala
      /**
        * Combine two sorted arrays of StructFields into a new StructType
        */
      private def compatibleFields(
          fields1: Array[StructField],
          fields2: Array[StructField]): StructType = {
        // perform an insertion sort of the smaller struct into the larger one
        val (bigger, smaller) = if (fields1.length > fields2.length) {
          (fields1.toBuffer, fields2)
        } else {
          (fields2.toBuffer, fields1)
        }
    
        var biggerIdx = 0
        var smallerIdx = 0
        while (biggerIdx < bigger.length && smallerIdx < smaller.length) {
          val biggerVal = bigger(biggerIdx)
          val smallerVal = smaller(smallerIdx)
          val comp = biggerVal.name.compareTo(smallerVal.name)
          if (comp == 0) {
            if (biggerVal.dataType != smallerVal.dataType) {
              val merged = compatibleType(biggerVal.dataType, smallerVal.dataType)
              // test to see if the merged type is equivalent to one of the existing
              // StructField instances, reuse will reduce GC pressure
              if (smallerVal.dataType == merged) {
                bigger.update(biggerIdx, smallerVal)
              } else if (biggerVal.dataType == merged) {
                // do nothing, the bigger struct already has the correct field
              } else {
                // we can't reuse an existing field so allocate a new one
                bigger.update(biggerIdx, biggerVal.copy(dataType = merged))
              }
            }
            biggerIdx += 1
            smallerIdx += 1
          } else if (comp > 0) {
            bigger.insert(biggerIdx, smallerVal)
            // bump both indexes, the bigger struct has grown
            biggerIdx += 1
            smallerIdx += 1
          } else { // comp < 0
            // advance to the next field on the bigger struct
            // nothing else to do here
            biggerIdx += 1
          }
        }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-216014586
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215316922
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57215/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by NathanHowell <gi...@git.apache.org>.

Github user NathanHowell commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215585269
  
    Would Guava's `Iterables.mergeSorted[T]` help out here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215945856
  
    **[Test build #57429 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57429/consoleFull)** for PR 12750 at commit [`48be267`](https://github.com/apache/spark/commit/48be267a03c5f8990633015e07d43e92d94fa504).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12750#discussion_r61666404
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala ---
    @@ -95,7 +95,8 @@ object HiveTypeCoercion {
           Some(t1)
     
         // Promote numeric types to the highest of the two
    -    case (t1, t2) if Seq(t1, t2).forall(numericPrecedence.contains) =>
    --- End diff --
    
    The `Seq` construction, `forall` / traversable code, and `IndexedSeq.contains` calls made this line pretty expensive in the old code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215945376
  
    **[Test build #57425 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57425/consoleFull)** for PR 12750 at commit [`48be267`](https://github.com/apache/spark/commit/48be267a03c5f8990633015e07d43e92d94fa504).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215943176
  
    After a bit more time in a profiler, I was able to make this 2x faster than my previous best time.
    
    To give a rough idea of the structure of my benchmark:
    
    I have a folder containing a bunch of gigantic JSON documents with huge, deeply-nested schemas. Imagine the schema being the union of a huge number of smaller schemas, so it's pretty big and sparse.
    
    My benchmark harness reads in the raw JSON, caches it, unions it together a bunch of times to inflate the benchmark runtime (to avoid small constant factor noise), then coalesces it to a single partition to avoid task-launch overheads (and so we measure single-core performance):
    
    ```scala
    val lines = spark.read.text("...").rdd
    lines.cache.count()
    val coalescedLines =  (lines ++ lines ++ lines ++ lines ++ lines).coalesce(1)
    
    for (i <- 1 to 10) {
      val startTime = System.currentTimeMillis
      spark.read.json(coalescedLines)
      println(s"Took ${System.currentTimeMillis - startTime}")
    }
    ```
    
    Before (f5da592fc63b8d3bc09d49c196d6c5d98cd2a013), this took around 88 seconds.
    
    After (6bf5ee634ee7e509a88e64236f18e3c6e7a07aa2), this takes about **14.5 seconds**.
    
    Coupled with the changes in #12741, a change that led to a massive speedup in the `reduce` / `treeAggregate` step, this should be a huge speedup vs 1.6.x.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215945382
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by NathanHowell <gi...@git.apache.org>.

Github user NathanHowell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12750#discussion_r61526935
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala ---
    @@ -246,12 +263,39 @@ private[sql] object InferSchema {
               }
     
             case (StructType(fields1), StructType(fields2)) =>
    -          val newFields = (fields1 ++ fields2).groupBy(field => field.name).map {
    -            case (name, fieldTypes) =>
    -              val dataType = fieldTypes.view.map(_.dataType).reduce(compatibleType)
    -              StructField(name, dataType, nullable = true)
    +          // Both fields1 and fields2 should be sorted by name, since inferField performs sorting.
    +          // Therefore, we can take advantage of the fact that we're merging sorted lists and skip
    +          // building a hash map or performing additional sorting.
    +          val newFields = new mutable.ArrayBuffer[StructField]()
    +
    +          var f1Idx = 0
    +          var f2Idx = 0
    +
    +          while (f1Idx < fields1.length && f2Idx < fields2.length) {
    +            val f1Name = fields1(f1Idx).name
    +            val f2Name = fields2(f2Idx).name
    +            if (f1Name == f2Name) {
    +              val dataType = compatibleType(fields1(f1Idx).dataType, fields2(f2Idx).dataType)
    +              newFields += StructField(f1Name, dataType, nullable = true)
    +              f1Idx += 1
    +              f2Idx += 1
    +            } else if (f1Name < f2Name) {
    +              newFields += fields1(f1Idx)
    +              f1Idx += 1
    +            } else { // f1Name > f2Name
    +              newFields += fields2(f2Idx)
    +              f2Idx += 1
    +            }
    +          }
    +          while (f1Idx < fields1.length) {
    +            newFields += fields1(f1Idx)
    +            f1Idx += 1
    +          }
    +          while (f2Idx < fields2.length) {
    +            newFields += fields2(f2Idx)
    +            f2Idx += 1
               }
    -          StructType(newFields.toSeq.sortBy(_.name))
    +          StructType(newFields.toSeq)
    --- End diff --
    
    `StructType(Seq[StructField])` just calls `toArray` on the fields... so I would use `toArray` instead of `toSeq` to reduce one allocation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215948518
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215310236
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215316797
  
    **[Test build #57215 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57215/consoleFull)** for PR 12750 at commit [`5d34a64`](https://github.com/apache/spark/commit/5d34a646046727ffaf8f5932b2dfeae3a5c10d32).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-217942314
  
    **[Test build #58151 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/58151/consoleFull)** for PR 12750 at commit [`8c7f7ad`](https://github.com/apache/spark/commit/8c7f7ad9cef1993f817e35056a9d7b572829e349).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-216353667
  
    ;retest


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215303075
  
    **[Test build #57212 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57212/consoleFull)** for PR 12750 at commit [`4bbf429`](https://github.com/apache/spark/commit/4bbf4292802e475d84ec55994a4ebae3ddc2f4da).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-216001074
  
    It looks like the failing test recently became flaky in the master branch (https://spark-tests.appspot.com/tests/org.apache.spark.mllib.stat.JavaStatisticsSuite/testCorr), so I'll come back and retest this after it's ignored / fixed there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-217966399
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/58151/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215569915
  
    /cc @NathanHowell, FYI.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215946395
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by NathanHowell <gi...@git.apache.org>.

Github user NathanHowell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12750#discussion_r61526900
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala ---
    @@ -246,12 +263,39 @@ private[sql] object InferSchema {
               }
     
             case (StructType(fields1), StructType(fields2)) =>
    -          val newFields = (fields1 ++ fields2).groupBy(field => field.name).map {
    -            case (name, fieldTypes) =>
    -              val dataType = fieldTypes.view.map(_.dataType).reduce(compatibleType)
    -              StructField(name, dataType, nullable = true)
    +          // Both fields1 and fields2 should be sorted by name, since inferField performs sorting.
    +          // Therefore, we can take advantage of the fact that we're merging sorted lists and skip
    +          // building a hash map or performing additional sorting.
    +          val newFields = new mutable.ArrayBuffer[StructField]()
    --- End diff --
    
    Consider calling `newFields.sizeHint(fields1.length max fields2.length)` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215949786
  
    **[Test build #57433 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57433/consoleFull)** for PR 12750 at commit [`48be267`](https://github.com/apache/spark/commit/48be267a03c5f8990633015e07d43e92d94fa504).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-216354301
  
    **[Test build #57557 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57557/consoleFull)** for PR 12750 at commit [`48be267`](https://github.com/apache/spark/commit/48be267a03c5f8990633015e07d43e92d94fa504).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-216014548
  
    **[Test build #57461 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57461/consoleFull)** for PR 12750 at commit [`48be267`](https://github.com/apache/spark/commit/48be267a03c5f8990633015e07d43e92d94fa504).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12750#discussion_r61821978
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala ---
    @@ -246,12 +260,40 @@ private[sql] object InferSchema {
               }
     
             case (StructType(fields1), StructType(fields2)) =>
    -          val newFields = (fields1 ++ fields2).groupBy(field => field.name).map {
    -            case (name, fieldTypes) =>
    -              val dataType = fieldTypes.view.map(_.dataType).reduce(compatibleType)
    -              StructField(name, dataType, nullable = true)
    +          // Both fields1 and fields2 should be sorted by name, since inferField performs sorting.
    +          // Therefore, we can take advantage of the fact that we're merging sorted lists and skip
    +          // building a hash map or performing additional sorting.
    +          val newFields = new java.util.ArrayList[StructField]()
    +
    +          var f1Idx = 0
    +          var f2Idx = 0
    +
    +          while (f1Idx < fields1.length && f2Idx < fields2.length) {
    +            val f1Name = fields1(f1Idx).name
    +            val f2Name = fields2(f2Idx).name
    +            val comp = f1Name.compareTo(f2Name)
    --- End diff --
    
    Can we have a check to make sure the two list are sorted? (try to capture the bug if we forgot to sort them in some cases)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-216376350
  
    **[Test build #57557 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57557/consoleFull)** for PR 12750 at commit [`48be267`](https://github.com/apache/spark/commit/48be267a03c5f8990633015e07d43e92d94fa504).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-216014587
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57461/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12750#discussion_r61666400
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ---
    @@ -103,6 +103,17 @@ case class StructType(fields: Array[StructField]) extends DataType with Seq[Stru
       private lazy val nameToField: Map[String, StructField] = fields.map(f => f.name -> f).toMap
       private lazy val nameToIndex: Map[String, Int] = fieldNames.zipWithIndex.toMap
     
    +  override def equals(that: Any): Boolean = {
    --- End diff --
    
    The old `StructType.equals()` was inherited from `Seq` and was fairly slow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-217966396
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-216353700
  
    Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215310239
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57212/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12750#discussion_r61666385
  
    --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/Metadata.scala ---
    @@ -85,17 +85,18 @@ sealed class Metadata private[types] (private[types] val map: Map[String, Any])
       override def equals(obj: Any): Boolean = {
         obj match {
           case that: Metadata =>
    -        if (map.keySet == that.map.keySet) {
    --- End diff --
    
    This old implementation of `Metadata.equals()` was insanely inefficient because `map.keySet()` performs tons of object allocation. The old code also used `array.view` to compare array contents for equality, which is going to be slower than `Arrays.equals()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215592580
  
    Maybe, but my hunch is that it's going to be slower and won't save much code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12750#discussion_r61666410
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala ---
    @@ -246,12 +260,40 @@ private[sql] object InferSchema {
               }
     
             case (StructType(fields1), StructType(fields2)) =>
    -          val newFields = (fields1 ++ fields2).groupBy(field => field.name).map {
    -            case (name, fieldTypes) =>
    -              val dataType = fieldTypes.view.map(_.dataType).reduce(compatibleType)
    -              StructField(name, dataType, nullable = true)
    +          // Both fields1 and fields2 should be sorted by name, since inferField performs sorting.
    +          // Therefore, we can take advantage of the fact that we're merging sorted lists and skip
    +          // building a hash map or performing additional sorting.
    +          val newFields = new java.util.ArrayList[StructField]()
    --- End diff --
    
    Using a Java `ArrayList` here is _way_ faster than Scala's own ArrayBuffer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/12750#discussion_r61665637
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/InferSchema.scala ---
    @@ -76,6 +78,15 @@ private[sql] object InferSchema {
         }
       }
     
    +  private def sortFieldsInPlace(fields: Array[StructField]): Unit = {
    +    // Note: other code relies on this sorting for correctness, so don't remove it!
    +    java.util.Arrays.sort(fields, new Comparator[StructField] {
    --- End diff --
    
    Good catch. Yes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215943346
  
    **[Test build #57421 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57421/consoleFull)** for PR 12750 at commit [`6bf5ee6`](https://github.com/apache/spark/commit/6bf5ee634ee7e509a88e64236f18e3c6e7a07aa2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215947083
  
    **[Test build #57429 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57429/consoleFull)** for PR 12750 at commit [`48be267`](https://github.com/apache/spark/commit/48be267a03c5f8990633015e07d43e92d94fa504).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215947086
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57429/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-216008877
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14972] Improve performance of JSON sche...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12750#issuecomment-215945383
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57425/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org