You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yanbohappy <gi...@git.apache.org> on 2015/02/11 09:46:01 UTC

[GitHub] spark pull request: [SQL] Reuse mutable row for each record at jso...

GitHub user yanbohappy opened a pull request:

    https://github.com/apache/spark/pull/4527

    [SQL] Reuse mutable row for each record at jsonStringToRow

    When serialize json string to row, reuse a mutable row for each record instead of creating a new one for every record. But every nested struct type in each record, we still need to create a new row for them.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yanbohappy/spark jsonStringToRowOptimization

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4527.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4527
    
----
commit b0c2b145950c18e30a8e88086e018cb66931fbec
Author: Yanbo Liang <ya...@gmail.com>
Date:   2015-02-11T08:36:50Z

    [SQL] Reuse mutable row for each record at jsonStringToRow

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74059744
  
      [Test build #27351 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27351/consoleFull) for   PR 4527 at commit [`c30a358`](https://github.com/apache/spark/commit/c30a358b927de171487b1ce5063714e4d6ef25bf).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74421311
  
      [Test build #27522 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27522/consoleFull) for   PR 4527 at commit [`2d45c68`](https://github.com/apache/spark/commit/2d45c68f4c61408ec650fba3214a19446d96ce37).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-73924548
  
    Thank you for working on it.
    
    Seems `new SpecificMutableRow(schema.fields.map(_.dataType))` cannot handle nested structure. I think we need to use the schema to create the top level mutable row and all inner rows (for inner `StructType`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by yanbohappy <gi...@git.apache.org>.
Github user yanbohappy commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74059428
  
    @chenghao-intel @yhuai 
    Thank you for your advice and it's very useful.
    We can use mutable rows for both top level records and inner structures at present.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74410038
  
      [Test build #27513 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27513/consoleFull) for   PR 4527 at commit [`6cd26fe`](https://github.com/apache/spark/commit/6cd26fe6fbd5a2aa88a325933810d43c7dd39b57).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-73927673
  
    Oh, `enforceCorrectType` will take care inner structures by calling `asRow`. 
    
    It will be great if we can use mutable rows for inner structures as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4527#discussion_r24503309
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala ---
    @@ -39,7 +39,19 @@ private[sql] object JsonRDD extends Logging {
           json: RDD[String],
           schema: StructType,
           columnNameOfCorruptRecords: String): RDD[Row] = {
    -    parseJson(json, columnNameOfCorruptRecords).map(parsed => asRow(parsed, schema))
    +    // Reuse the mutable row for each record, however we still need to 
    +    // create a new row for every nested struct type in each record
    +    val mutableRow = new SpecificMutableRow(schema.fields.map(_.dataType))
    --- End diff --
    
    Move this inside of  `mapPartitions`, to reduce the closure serialization overhead. And I didn't see any benefit when using the `SpecificMutableRow`, why not just use the `GenericMutableRow` instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4527#discussion_r24504239
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala ---
    @@ -39,7 +39,19 @@ private[sql] object JsonRDD extends Logging {
           json: RDD[String],
           schema: StructType,
           columnNameOfCorruptRecords: String): RDD[Row] = {
    -    parseJson(json, columnNameOfCorruptRecords).map(parsed => asRow(parsed, schema))
    +    // Reuse the mutable row for each record, however we still need to 
    +    // create a new row for every nested struct type in each record
    +    val mutableRow = new SpecificMutableRow(schema.fields.map(_.dataType))
    +    parseJson(json, columnNameOfCorruptRecords).mapPartitions( iter => {
    +      iter.map { parsed =>
    +        schema.fields.zipWithIndex.foreach {
    --- End diff --
    
    BTW, I believe that `schema.fields.zipWithIndex` will create temporal objects, use for loop instead.
    ```
    for (i <- 0 until schema.fields.length) {
      val fname = schema.fields(i).name
      val ftype = schema.fields(i).dataType
    
      mutableRow(i) = ...
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74424393
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27524/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Reuse mutable row for each record at jso...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-73853540
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27284/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74424550
  
      [Test build #27522 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27522/consoleFull) for   PR 4527 at commit [`2d45c68`](https://github.com/apache/spark/commit/2d45c68f4c61408ec650fba3214a19446d96ce37).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-73864166
  
      [Test build #27288 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27288/consoleFull) for   PR 4527 at commit [`b0c2b14`](https://github.com/apache/spark/commit/b0c2b145950c18e30a8e88086e018cb66931fbec).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Reuse mutable row for each record at jso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-73849199
  
      [Test build #27284 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27284/consoleFull) for   PR 4527 at commit [`b0c2b14`](https://github.com/apache/spark/commit/b0c2b145950c18e30a8e88086e018cb66931fbec).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-73925301
  
    Also, can you add performance numbers?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-87971697
  
    Thanks - sorry for not having looked at this earlier. Do you see any performance gains with this change? My understanding is that JSON is already very slow, and thus the code path is hard to optimize.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Reuse mutable row for each record at jso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-73853535
  
      [Test build #27284 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27284/consoleFull) for   PR 4527 at commit [`b0c2b14`](https://github.com/apache/spark/commit/b0c2b145950c18e30a8e88086e018cb66931fbec).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by yanbohappy <gi...@git.apache.org>.
Github user yanbohappy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4527#discussion_r24574169
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala ---
    @@ -39,7 +39,19 @@ private[sql] object JsonRDD extends Logging {
           json: RDD[String],
           schema: StructType,
           columnNameOfCorruptRecords: String): RDD[Row] = {
    -    parseJson(json, columnNameOfCorruptRecords).map(parsed => asRow(parsed, schema))
    +    // Reuse the mutable row for each record, however we still need to 
    +    // create a new row for every nested struct type in each record
    +    val mutableRow = new SpecificMutableRow(schema.fields.map(_.dataType))
    --- End diff --
    
    You are right, it's not appropriate to use SpecificMutableRow here. I will change back to GenericMutableRow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74424392
  
      [Test build #27524 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27524/consoleFull) for   PR 4527 at commit [`2286ac5`](https://github.com/apache/spark/commit/2286ac550a16ac3c1c2c7dfaf9b8b20924bdf56a).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74410182
  
      [Test build #27514 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27514/consoleFull) for   PR 4527 at commit [`7039fa7`](https://github.com/apache/spark/commit/7039fa7914f36141ea0ba2d340d365d5254acbd3).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang closed the pull request at:

    https://github.com/apache/spark/pull/4527


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74066906
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27351/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74421514
  
      [Test build #27524 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27524/consoleFull) for   PR 4527 at commit [`2286ac5`](https://github.com/apache/spark/commit/2286ac550a16ac3c1c2c7dfaf9b8b20924bdf56a).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Reuse mutable row for each record at jso...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-73855676
  
      [Test build #27288 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27288/consoleFull) for   PR 4527 at commit [`b0c2b14`](https://github.com/apache/spark/commit/b0c2b145950c18e30a8e88086e018cb66931fbec).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74424555
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27522/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74410071
  
      [Test build #27513 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27513/consoleFull) for   PR 4527 at commit [`6cd26fe`](https://github.com/apache/spark/commit/6cd26fe6fbd5a2aa88a325933810d43c7dd39b57).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by chenghao-intel <gi...@git.apache.org>.
Github user chenghao-intel commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4527#discussion_r24503655
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala ---
    @@ -39,7 +39,19 @@ private[sql] object JsonRDD extends Logging {
           json: RDD[String],
           schema: StructType,
           columnNameOfCorruptRecords: String): RDD[Row] = {
    -    parseJson(json, columnNameOfCorruptRecords).map(parsed => asRow(parsed, schema))
    +    // Reuse the mutable row for each record, however we still need to 
    +    // create a new row for every nested struct type in each record
    +    val mutableRow = new SpecificMutableRow(schema.fields.map(_.dataType))
    +    parseJson(json, columnNameOfCorruptRecords).mapPartitions( iter => {
    +      iter.map { parsed =>
    +        schema.fields.zipWithIndex.foreach {
    --- End diff --
    
    This is duplicated with the function `asRow`, can we add additional parameter for `asRow`, says
    ```
    def asRow(json: Map[String,Any], schema: StructType, mutable: GenericMutableRow = null): Row = {
      row = if (mutable == null) {
         new GenericMutableRow(schema.fields.length)
      } else {
        mutable
      }
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74066899
  
      [Test build #27351 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27351/consoleFull) for   PR 4527 at commit [`c30a358`](https://github.com/apache/spark/commit/c30a358b927de171487b1ce5063714e4d6ef25bf).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by yanbohappy <gi...@git.apache.org>.
Github user yanbohappy commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74424908
  
    This improvement is very similar with #758, so I have run the similar performance test.
    The benchmark suggests this optimization made the optimized version about 1.5x to 2x faster when scanning JSON table, but it depends on the JSON schema especially for whether different record with different schema.
    For a JSON file with 188010 lines, the build scan consumed time is: 
    original: Takes 15598 ms
    optimized: Takes 10152 ms



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by yanbohappy <gi...@git.apache.org>.
Github user yanbohappy commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74388397
  
    @yhuai 
    This improvement is very similar with #758, so I have leverage the performance test there.
    The benchmark suggests this optimization made the optimized version 1.5x faster when scanning JSON table, but it's not very stable.
    For a json file with 188010 lines, the build scan consumed time is: 
    original: Takes 15598 ms
    optimized: Takes 10152 ms


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74410073
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27513/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Reuse mutable row for each record at jso...

Posted by yanbohappy <gi...@git.apache.org>.
Github user yanbohappy commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-73855155
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74410504
  
      [Test build #27514 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27514/consoleFull) for   PR 4527 at commit [`7039fa7`](https://github.com/apache/spark/commit/7039fa7914f36141ea0ba2d340d365d5254acbd3).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74410507
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27514/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-73864173
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27288/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SQL] Reuse mutable row for each record at jso...

Posted by yanbohappy <gi...@git.apache.org>.
Github user yanbohappy commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-73850498
  
    https://issues.apache.org/jira/browse/SPARK-5738


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5738] [SQL] Reuse mutable row for each ...

Posted by yanbohappy <gi...@git.apache.org>.
Github user yanbohappy commented on the pull request:

    https://github.com/apache/spark/pull/4527#issuecomment-74424959
  
    cc @liancheng @rxin @yhuai 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org