You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by pwendell <gi...@git.apache.org> on 2014/04/01 20:19:06 UTC

[GitHub] spark pull request: Merge Hadoop Into Spark

GitHub user pwendell opened a pull request:

    https://github.com/apache/spark/pull/286

    Merge Hadoop Into Spark

    This patch merges the Hadoop 0.20.2 source code into the Spark project. I've thought about this a bunch and this will provide us with several benefits:
    
    ### More source code
    Let's be honest, to be taken seriously as a project Spark needs to have _way more_ lines of code. Spark is currently 70,000 lines of Scala code - this patch adds 452,000 lines of XML alone (!) This will make our github stats look great!
    
    ### Seamless builds
    Sometimes users stumble trying to build Spark against Hadoop. Not anymore!! With Hadoop inside of Spark this won't a problem at all. I mean, there's basically only one Hadoop version, right? So this should work for pretty much everyone. Wait, hold on, is hadoop-0.20.2 the same as hadoop-2.2.0? I'm assuming it's the same because they both have the same number of "2"s.
    
    ### Your favorite old configs
    This patch will give users access to some of their favorite old configs from Hadoop. Did you just figure out what `dfs.namenode.path.based.cache.block.map.allocation.percent` was?! Now you can use it in Spark!! Pining for your old friend `mapreduce.map.skip.proc.count.autoincr`... fuggedabodit - we got ya!
    
    I plan to contribute tests and docs in a subsequent patch. Please merge this ASAP and include in Spark 0.9.1.
    
    **NB: This diff is too large for github to render. Users will have to download and play with this on their own.**

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/pwendell/spark 4-1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/286.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #286
    
----
commit 2ce4bb837bd551501542683f20cf15bb6f574efc
Author: Patrick Wendell <pw...@gmail.com>
Date:   2014-04-01T18:10:12Z

    Merge Hadoop Into Spark
    
    This patch merges the Hadoop 0.20.2 source code into the Spark project. I've thought about this a bunch and this will provide us with several benefits:
    
    Let's be honest, to be taken seriously as a project Spark needs to have _way more_ lines of code. Spark is currently 70,000 lines of Scala code - this patch adds 452,000 lines of XML alone (!) This will make our github stats look great!
    
    Sometimes users stumble trying to build Spark against Hadoop. Not anymore!! With Hadoop inside of Spark this won't a problem at all. I mean, there's basically only one Hadoop version, right? So this should work for pretty much everyone. Wait, hold on, is hadoop-0.20.2 the same as hadoop-2.2.0? I'm assuming it's the same because they both have the same number of "2"s.
    
    This patch will give users access to some of their favorite old configs from Hadoop. Did you just figure out what `dfs.namenode.path.based.cache.block.map.allocation.percent` was?! Now you can use it in Spark!! Pining for your old friend `mapreduce.map.skip.proc.count.autoincr`... fuggedabodit - we got ya!
    
    I plan to contribute tests and docs in a subsequent patch. Please merge this ASAP and include in Spark 0.9.1.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell closed the pull request at:

    https://github.com/apache/spark/pull/286


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by mattf <gi...@git.apache.org>.
Github user mattf commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39260062
  
    +1, lgtm, but only today!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39240275
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13636/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39240088
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by yinxusen <gi...@git.apache.org>.
Github user yinxusen commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39281046
  
    Yep, I find that each time I do `sbt clean gen-idea` or `sbt update` or even `sbt testOnly xxx`, I can do the cooking, take a shower, and have a rest.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39240317
  
    But will this work on YARN?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39239988
  
    +1 !!! I have been asking for this multiple times on the mailing list and finally see the light!!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by shivaram <gi...@git.apache.org>.
Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39243472
  
    I love hadoop-0.20.2 -- It is the best Hadoop I have ever used. Thanks @pwendell for pulling this in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39240734
  
    The best way to have more lines than some project is to merge it!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39240273
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39240076
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39254250
  
    Guys, I just realized that this includes a really cool submodule called "Map Reduce" which seems to be a generalized form of Spark! It also has around 3x the amount of code, and that's in Java (which is somewhat more concise than Scala), so I estimate it's roughly 5x as awesome as Spark. Are there plans for deprecating the Spark API in favor of "Map Reduce"?
    
    If we deprecate Spark in 0.9.1 and replace it with Map Reduce in 1.0.0, that should give users enough time to migrate to the new API (which is much simpler -- only two functions!).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39270210
  
    Wait, so did it pass or fail Jenkins tests? Jenkins isn't saying what happened, only that the build finished.
    
    Anyway I'd be okay with this as long as you make it meet our code style guidelines.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39251945
  
    LGTM, merged !
    
    
    On Wed, Apr 2, 2014 at 12:20 AM, Shivaram Venkataraman <
    notifications@github.com> wrote:
    
    > I love hadoop-0.20.2 -- It is the best Hadoop I have ever used. Thanks
    > @pwendell <https://github.com/pwendell> for pulling this in.
    >
    > --
    > Reply to this email directly or view it on GitHub<https://github.com/apache/spark/pull/286#issuecomment-39243472>
    > .
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by win2cs <gi...@git.apache.org>.
Github user win2cs commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39281738
  
    "But will this work on YARN?", I've the same question.
    And does it still support specifying other Hadoop distribution like before?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39280558
  
    This is a good start! Why not merge all transitive dependencies into Spark altogether? In this way we can save lots of efforts on SBT/Maven, and avoid time consuming Ivy dependency resolution.
    
    You know, ever since those .orbit things were added, `sbt update` costs nearly 40 minutes to resolve all the dependencies in mainland China... I have to turn it off by `skip in update := true` and `offline := true`.
    
    OK, the 2nd paragraph is not part of the joke :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by CrazyJvm <gi...@git.apache.org>.
Github user CrazyJvm commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39278059
  
    +1 amazing! I've been looking forward it for a long time! thx!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by CodingCat <gi...@git.apache.org>.
Github user CodingCat commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39253624
  
    Great job!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by pelick <gi...@git.apache.org>.
Github user pelick commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39281241
  
    Is it true about this => “ Wait, hold on, is hadoop-0.20.2 the same as hadoop-2.2.0? I'm assuming it's the same because they both have the same number of "2"s. ”


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by CrazyJvm <gi...@git.apache.org>.
Github user CrazyJvm commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39278167
  
    amazing! I've been looking forward it for a long time! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by sryza <gi...@git.apache.org>.
Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39253647
  
    I'm OK with holding off on this for a separate JIRA, but for completeness I'd like to propose merging in CDH code as well.  I believe that CDH4.2.2 would be the best choice of version, as it maintains 2-compatibility with 0.20.2 and 2.2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39284253
  
    > You know, ever since those .orbit things were added, sbt update costs nearly 40 minutes to resolve all the dependencies in mainland China... I have to turn it off by skip in update := true and offline := true.
    
    +1. Really terrible nightmare. If the version of some jar is 6.4, it usually can not be downloaded in mainland China without a proxy :(


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39253343
  
    LGTMT


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Merge Hadoop Into Spark

Posted by DavyLin <gi...@git.apache.org>.
Github user DavyLin commented on the pull request:

    https://github.com/apache/spark/pull/286#issuecomment-39280455
  
    I do not know, ah, but I'm looking forward to it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---