You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by yhuai <gi...@git.apache.org> on 2014/07/21 03:14:30 UTC

[GitHub] spark pull request: [SPARK-2603][SQL] Remove unnecessary toMap and...

GitHub user yhuai opened a pull request:

    https://github.com/apache/spark/pull/1504

    [SPARK-2603][SQL] Remove unnecessary toMap and toList in converting Java collections to Scala collections JsonRDD.scala

    In JsonRDD.scalafy, we are using toMap/toList to convert a Java Map/List to a Scala one. These two operations are pretty expensive because they read elements from a Java Map/List and then load to a Scala Map/List. We can use Scala wrappers to wrap those Java collections instead of using toMap/toList.
    
    I did a quick test to see the performance. I had a 2.9GB cached RDD[String] storing one JSON object per record. I tried `sqlContext.jsonRDD(jsonData)` to see the performance. The job running for schema inference had one stage and there were 48 tasks. These tasks were executed sequentially.
    
    ```
    Original:
    Run 1: 1.5 min
    Run 2: 1.4 min
    Run 3: 1.5 min
    
    With this change:
    Run 1: 1.2 min
    Run 2: 1.2 min
    Run 3: 1.2 min
    ```
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-2603

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yhuai/spark removeToMapToList

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1504.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1504
    
----
commit d1abdb8dc8dc2c8d9abedd5e2f8eff5f1f754e2b
Author: Yin Huai <hu...@cse.ohio-state.edu>
Date:   2014-07-21T00:29:32Z

    Remove unnecessary toMap and toList.

commit 09b9bca2773c259a8d660a302252b994f1bf821e
Author: Yin Huai <hu...@cse.ohio-state.edu>
Date:   2014-07-21T01:01:54Z

    Merge remote-tracking branch 'upstream/master' into removeToMapToList

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2603][SQL] Remove unnecessary toMap and...

Posted by advancedxy <gi...@git.apache.org>.
Github user advancedxy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1504#discussion_r15208849
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala ---
    @@ -239,9 +239,9 @@ private[sql] object JsonRDD extends Logging {
           // .map(identity) is used as a workaround of non-serializable Map
           // generated by .mapValues.
           // This issue is documented at https://issues.scala-lang.org/browse/SI-7005
    -      map.toMap.mapValues(scalafy).map(identity)
    +      JMapWrapper(map).mapValues(scalafy).map(identity)
    --- End diff --
    
    Just a little notice: `.asScala` returns mutable collections. Maybe that's not what you want.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2603][SQL] Remove unnecessary toMap and...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/1504#issuecomment-49568566
  
    Looks like this failed a couple JsonSuite tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2603][SQL] Remove unnecessary toMap and...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1504#issuecomment-49568518
  
    QA results for PR 1504:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16888/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2603][SQL] Remove unnecessary toMap and...

Posted by yhuai <gi...@git.apache.org>.
Github user yhuai commented on the pull request:

    https://github.com/apache/spark/pull/1504#issuecomment-49569594
  
    yeah... I did not change the pattern matching logic in the type inference part accordingly. Also, with this fix, my test did not show significant performance improvement on the schema inference. But, seems the query performance was improved. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2603][SQL] Remove unnecessary toMap and...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1504


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2603][SQL] Remove unnecessary toMap and...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/1504#issuecomment-50057075
  
    Thanks!  Merged into Master and 1.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2603][SQL] Remove unnecessary toMap and...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1504#issuecomment-49569588
  
    QA tests have started for PR 1504. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16891/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2603][SQL] Remove unnecessary toMap and...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1504#issuecomment-49572345
  
    QA results for PR 1504:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16891/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2603][SQL] Remove unnecessary toMap and...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1504#issuecomment-49565895
  
    QA tests have started for PR 1504. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16888/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2603][SQL] Remove unnecessary toMap and...

Posted by marmbrus <gi...@git.apache.org>.
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1504#discussion_r15197742
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala ---
    @@ -239,9 +239,9 @@ private[sql] object JsonRDD extends Logging {
           // .map(identity) is used as a workaround of non-serializable Map
           // generated by .mapValues.
           // This issue is documented at https://issues.scala-lang.org/browse/SI-7005
    -      map.toMap.mapValues(scalafy).map(identity)
    +      JMapWrapper(map).mapValues(scalafy).map(identity)
    --- End diff --
    
    Should we be using the `.asScala` / `.asJava` methods in [`JavaConverters`](http://www.scala-lang.org/api/2.10.2/index.html#scala.collection.JavaConverters$) instead of creating these classes manually.  It seems like thats more robust to changes in the Scala library, and they handle cases like going back and forth efficiently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---