You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by NathanHowell <gi...@git.apache.org> on 2015/04/30 09:24:12 UTC

[GitHub] spark pull request: [SPARK-5938][SQL] Improve JsonRDD performance

GitHub user NathanHowell opened a pull request:

    https://github.com/apache/spark/pull/5801

    [SPARK-5938][SQL] Improve JsonRDD performance

    This patch comprises of a few related pieces of work:
    
    * Schema inference is performed directly on the JSON token stream
    * `String => Row` conversion populate Spark SQL structures without intermediate types
    * Projection pushdown is implemented via CatalystScan for DataFrame queries
    
    I've run some basic queries on a 300MB/100k row dataset with a flat schema and the results are promising:
    
    * Before: ```INFO DAGScheduler: Job 8 finished: count at <console>:20, took 2.916653 s```
    * After: ```INFO DAGScheduler: Job 8 finished: count at <console>:20, took 2.184896 s```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/NathanHowell/spark json-performance

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5801.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5801
    
----
commit 1e441e23a2cfd8712720a728056e363e41538d1f
Author: Nathan Howell <nh...@godaddy.com>
Date:   2015-04-29T05:44:19Z

    Eliminate arrow pattern, replace with pattern matches

commit 73a56927d09c670eb62317f611c47a90096fe693
Author: Nathan Howell <nh...@godaddy.com>
Date:   2015-04-27T22:38:28Z

    Improve JSON parsing and type inference performance

commit 1abf1d6010c71cd1cffa97d7564f8fb71eb19f10
Author: Nathan Howell <nh...@godaddy.com>
Date:   2015-04-30T02:16:33Z

    Add projection pushdown support to JsonRDD

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5938][SPARK-5443][SQL] Improve JsonRDD ...

Posted by NathanHowell <gi...@git.apache.org>.
Github user NathanHowell commented on the pull request:

    https://github.com/apache/spark/pull/5801#issuecomment-97959395
  
    Benchmarked a small-ish real dataset... Runs are with 5 executors (for 5 input splits) with data in HDFS:
    
    step | before | after 
    ------|----------|--------
    `val df = sqlContext.jsonRDD(...)` - schema inference | 37.14s | 18.16s | 
    `df.count()` | 125.8s | 25.7s
    `df.select("col1").count()` | 96.9s | 26.5s
    
    Not sure why but the new code seems a bit slower when using projection pushdowns. It may be schema dependent or overhead from evaluating the projection expression.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5938][SPARK-5443][SQL] Improve JsonRDD ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5801#issuecomment-97989149
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31449/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5938][SQL] Improve JsonRDD performance

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5801#issuecomment-97696753
  
      [Test build #31399 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31399/consoleFull) for   PR 5801 at commit [`1abf1d6`](https://github.com/apache/spark/commit/1abf1d6010c71cd1cffa97d7564f8fb71eb19f10).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5938][SPARK-5443][SQL] Improve JsonRDD ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5801#issuecomment-97989148
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5938][SQL] Improve JsonRDD performance

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5801#issuecomment-97696293
  
    Jenkins, ok to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5938][SQL] Improve JsonRDD performance

Posted by NathanHowell <gi...@git.apache.org>.
Github user NathanHowell commented on the pull request:

    https://github.com/apache/spark/pull/5801#issuecomment-97699564
  
    Looks like it may also resolve [SPARK-5443](https://issues.apache.org/jira/browse/SPARK-5443).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5938][SPARK-5443][SQL] Improve JsonRDD ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5801#issuecomment-97732136
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31399/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5938][SQL] Improve JsonRDD performance

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5801#issuecomment-97696611
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5938][SPARK-5443][SQL] Improve JsonRDD ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5801#issuecomment-97960946
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5938][SPARK-5443][SQL] Improve JsonRDD ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5801#issuecomment-97732135
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5938][SPARK-5443][SQL] Improve JsonRDD ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5801#issuecomment-97961787
  
      [Test build #31449 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31449/consoleFull) for   PR 5801 at commit [`55c2f39`](https://github.com/apache/spark/commit/55c2f391aa727db5ebd62716f3219bb13d236fb2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5938][SQL] Improve JsonRDD performance

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5801#issuecomment-97699857
  
    Can you put both JIRA tickets in the title? It will then automatically linked to both tickets.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5938][SPARK-5443][SQL] Improve JsonRDD ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5801#issuecomment-97732132
  
      [Test build #31399 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31399/consoleFull) for   PR 5801 at commit [`1abf1d6`](https://github.com/apache/spark/commit/1abf1d6010c71cd1cffa97d7564f8fb71eb19f10).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class KMeansModel (`
      * `trait PMMLExportable `
    
     * This patch **adds the following new dependencies:**
       * `jaxb-api-2.2.7.jar`
       * `jaxb-core-2.2.7.jar`
       * `jaxb-impl-2.2.7.jar`
       * `pmml-agent-1.1.15.jar`
       * `pmml-model-1.1.15.jar`
       * `pmml-schema-1.1.15.jar`
    
     * This patch **removes the following dependencies:**
       * `activation-1.1.jar`
       * `jaxb-api-2.2.2.jar`
       * `jaxb-impl-2.2.3-1.jar`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5938][SQL] Improve JsonRDD performance

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/5801#issuecomment-97696331
  
    I won't have time to look at this today, but this is pretty cool.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5938][SPARK-5443][SQL] Improve JsonRDD ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5801#issuecomment-97960884
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5938][SQL] Improve JsonRDD performance

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5801#issuecomment-97692744
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5938][SQL] Improve JsonRDD performance

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5801#issuecomment-97696599
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5938][SPARK-5443][SQL] Improve JsonRDD ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5801#issuecomment-97989141
  
      [Test build #31449 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31449/consoleFull) for   PR 5801 at commit [`55c2f39`](https://github.com/apache/spark/commit/55c2f391aa727db5ebd62716f3219bb13d236fb2).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org