You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by nkronenfeld <gi...@git.apache.org> on 2015/04/18 00:14:23 UTC

[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

GitHub user nkronenfeld opened a pull request:

    https://github.com/apache/spark/pull/5565

    Common interfaces between RDD, DStream, and DataFrame

    This PR is the beginning of an attempt to put a common interface between the main distributed collection types in Spark.
    
    I've tried to make this first checkin towards such a common interface as simple as possible.  To this end, I've taken RDDApi from the sql project, pulled it up into core (as RDDLike), and changed it as necessary to allow all three main distributed collection classes to implement it.
    
    I've then done something similar for pair methods, between RDD and DStream (I don't think there is an equivalent for DataFrames)
    
    This involves a few small interface changes - things like reduceByKey having different method signatures in different classes - but they are, for the moment, minor.  That being said, they are still interface changes, and I don't expect this to get merged in without discussion.  So - suggestions and help are welcome, encouraged, etc.
    
    In the very near future, if this PR is accepted, I would like to expand on it in a few simple ways:
    
    * I want to try to pull more functions up into this interface
    * There are a lot of functions with 3 versions:
      * foo(...)
      * foo(..., numPartitions: Int)
      * foo(..., partitioner: Partitioner)
    
      These should all be replaceble by 
    
      * foo(..., partitioner: Partitioner = defaultPartitioner)
    
      with the implicit Int => Partitioner conversion I've put in here.  I did half of this reduction, in once case (reduceByKey) out of necessity, trying to get the implementation contained herein to compile, but extending it as far as possible would make a lot of things much cleaner.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nkronenfeld/spark-1 feature/common-interface2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5565.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5565
    
----
commit 9dbbd9ea0e69fd0d5fc5056aeabe4f7efc842cee
Author: Nathan Kronenfeld <nk...@oculusinfo.com>
Date:   2015-04-17T20:23:26Z

    Common interface between RDD, DStream, DataFrame - non-pair methods

commit fb920ffc6e30897e19626f6556af3f0ffc5248bb
Author: Nathan Kronenfeld <nk...@oculusinfo.com>
Date:   2015-04-17T22:02:20Z

    Common interface for PairRDD functions

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94410749
  
    What has to be written differently for an `RDD` from a `DStream` vs an `RDD` from a batch process? they're both `RDD`s. Maybe a very small example?
    
    I think methods like `DStream.map` are mostly redundant, yes. I seem to recall @tdas saying something to that effect. I imagine they might not have even been added if this were done over again.
    
    What's different about the API though? the `ClassTag`s? I only skimmed your change and that's all I saw.
    
    But that makes many of the `DStream` methods at worst superfluous and doesn't change your ability to access the data through the same `RDD` API. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94477339
  
      [Test build #30589 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30589/consoleFull) for   PR 5565 at commit [`07c2a13`](https://github.com/apache/spark/commit/07c2a133919fb986fc724af6d8026f8a72c0d198).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by nkronenfeld <gi...@git.apache.org>.
Github user nkronenfeld commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94332668
  
    Ok, I think there are only 3 real API changes, though one of them appears in multiple places.
    
    * `RDD`
      * `flatMap[U](f: T => TraversableOnce[U])` changed to `flatMap[U](f: T => Traversable[U])`.  My thinking at the time was, `DStream` used a function to `Traversable`, `RDD` to `TraversableOnce`, and that the `Traversable` part in `DStream` was necessary there.  Thinking about it again, I'm not completely sure of that - I think it might be just as valid to change the `DStream` version to `TraversableOnce`, which would probably be more general, and require fewer people to change code
    * `PairRDDFunctions`
      * Several of the functions now require a `ClassTag`, as the `DStream` version already did.  I tried to take the requirement out of the `DStream` versions instead, since that would be more inclusive, but ran into some compiler problems, and this was easier in the short run.  If people think it's doable to change in that direction, I can try.
        * `combineByKey`
        * `mapValues`
        * `flatMapValues`
      * `reduceByKey`
        * changed the (partitioner, function) version to (function, partitioner) (see below).  I wanted to leave the old version in, deprecated, but for the same reason as the next point, couldn't.
        * eliminated the (function, int) version.  Unfortunately, the compiler seemed to want to pick the type of function generification before it picked which version of the function to use, and because of this, couldn't handle the type inference.  I'm not sure someone with more scala foo couldn't solve this.  I'll copy the error in here when I can get hold of it again.
    
    However, regarding the last two, there is a general pattern in a lot of the spark functions, that they all have three versions:
    
    * `foo(....)` - an N-argument version, that uses the default partitioner
    * `foo(..., Int)` - an N+1-argument version, that uses a HashPartitioner, and
    * `foo(..., Partitioner)` - an N+1 argument version, that uses the passed-in partitioner.
    
    `PairRDDFunctions.reduceByKey` broke this pattern, but `PairDStreamFunctions.reduceByKey` did not - so, with differing method signatures, one had to change.  Since the RDD one was out of sync with pretty much every similar function in the class, it seemed natural to change it - and, I would argue, only makes sense that way.
    
    As my initial PR notes mention, I think the ideal thing would be to reduce all these function triplets to a single version, with auto-conversion from `Int` to `HashPartitioner`, which would leave all associated classes cleaner, smaller, and less confusing.
    
    Besides these three changes, I don't see any other API changes, but I'll go through it again later to be sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by nkronenfeld <gi...@git.apache.org>.
Github user nkronenfeld commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-95221601
  
    Could you leave it open a little bit longer? I want to see if I can shorten the list of differences, and having it active here makes it a little easier to see if I've done it correctly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-95221846
  
    OK, mark it `[WIP]` in the title, but does it help you much to have it as an open PR? you can always test things locally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94126381
  
    This is such a significant change that it certainly should have a JIRA, and should probably start with design discussion first. This can accompany it as a straw-man, sure. But before even that happens --
    
    My initial reaction is that this is introducing a lot of change to the APIs, which isn't backwards compatible. That's a non-starter for the short term of course, not necessarily in the longer-term, but to core methods, it would still have to pull its weight.
    
    Can these changes be made without changing the API though?
    
    Unifying RDD and DStream may not be important. They expose similar-ish APIs but are different things, and as I understand DStream's methods are partly redundant at this point anyway. I don't have as good a view into how interchangeable an RDD and DataFrame is intended to be.
    
    How much does this much change buy -- what's the upside?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94087406
  
      [Test build #30503 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30503/consoleFull) for   PR 5565 at commit [`fb920ff`](https://github.com/apache/spark/commit/fb920ffc6e30897e19626f6556af3f0ffc5248bb).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.
     * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by nkronenfeld <gi...@git.apache.org>.
Github user nkronenfeld commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94507802
  
    @pwendell I get all that... does that mean we are stuck with those deprecated versions forever then? Or is there some timeframe over which deprecated functions will eventually be removed?  Would it be possible to prepare for this, say, 4 releases in advance, so eventually it could be done?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94366735
  
    Hey @nkronenfeld - simply put, we cannot make binary-incompatible changes to API's in Spark due to our API guarantees, so this rules out many of your proposed changes. This has been true since Spark 1.0. Basically code compiled against older versions needs to work with newer ones, and that sometimes means swallowing minor API inconsistencies.
    
    If you'd like to add new versions of functions with more arguments, or add shared parent interfaces in a way that is binary compatible, that's not a violation of the policy. In those cases though, a JIRA might be helpful to sort of summarize your high level thoughts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by nkronenfeld <gi...@git.apache.org>.
Github user nkronenfeld commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94519103
  
    @srowen - If I understand your question, you're asking why we can't take our application, written for an RDD, and just put the whole thing inside a call to dstream.foreachRDD... I'm not ignoring it, I'm trying to figure out what would happen if we did that, and that might take a bit.
    
    Regarding your answer to @harishreedharan, I would say, of course we don't have to - but we really should have from the start.  To have two nearly, but not quite, identical APIs is confusing, and encourages inconsistency; having a common interface enforces that the methods in both mean the same thing now, change together in the future, and are consistent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94492320
  
    @nkronenfeld that's a good example. `reduceByKey` with a `Partitioner` has it as the first arg in `RDD` but second in `DStream`. I'm almost sure that was not intentional. That could be fixed by deprecating the existing method and introducing the corrected one. If it's purely for consistency, I suppose I'd shrug at it, whether it's useful.
    
    You're saying it's for a much more important purpose, which is being able to use the same code consistently between batch and streaming. However I am still missing why in your example you can't apply the same `PipelineData` to an `RDD` from `DStream.foreachRDD`, for example, as to any other `RDD`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94475597
  
      [Test build #30589 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30589/consoleFull) for   PR 5565 at commit [`07c2a13`](https://github.com/apache/spark/commit/07c2a133919fb986fc724af6d8026f8a72c0d198).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94515362
  
    @harishreedharan Yeah I'm saying the `DStream` methods are convenience methods. If that's right, you could argue they should be removed, but I'm not arguing that. The question is more like, do they have to be formally unified with the `RDD` API to enable operating across batch and streaming?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94087234
  
      [Test build #30503 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30503/consoleFull) for   PR 5565 at commit [`fb920ff`](https://github.com/apache/spark/commit/fb920ffc6e30897e19626f6556af3f0ffc5248bb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by nkronenfeld <gi...@git.apache.org>.
Github user nkronenfeld commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94554558
  
    @srowen I'm sure, for what it's worth, that part works out - it's whether the surrounding application structure can fit in that paradigm that is the issue.
    
    Just to be clear, I didn't really expect this to ever get merged in as is :-)  I am hoping for a few things from this PR:
    
    1. I wanted a list of ununified points the various APIs needed to be unified
    2. I wanted to know what sort of compatibility was needed (code-compatible, which with one exception, this is, vs. binary compatible, which it definitely isn't)
    3. I was hoping to foster in the community some sense that this was a goal we could work towards
      * So that when we got to the next point where we could make compatibility changes, we would be ready to do so
      * And to try and prevent further changes from moving the APIs even farther appart.
    4. And, lastly, I was hoping that someone whos scala was better than mine might have some ideas on how to do this without as many API changes (for instance, modifying this so that we could have all 4 versions of reduceByKey, with the bad one deprecated, and still have the compiler intuit types correctly)
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by nkronenfeld <gi...@git.apache.org>.
Github user nkronenfeld commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94351472
  
    No, we do that at the moment.
    
    But doing it that way results in a few rather ugly constructs in the application code that can be rather painful, as soon as one starts passing data constructs around.  As soon as one starts passing the collection structures between modules, say, for instance, between stages in a pipeline, one instantly needs to duplicate the entire pipeline for batch and streaming cases.
    
    It isn't just one place where one has to do this replacement - it's every little pipeline operation, for every algorithm, 90% of which are using just the most basic RDD and DStream functions should be easily consolidated.
    
    I'd also note that, where there is an interface change, it is there because the original methods in RDD and DStream were declared inconsistently.  Unless there is a good reason to keep them inconsistent (which so far I don't see in any of these three cases), I would suggest that isn't a good thing to begin with - just in terms of consistency and usability of the library, where they can be the same, they should be.  It reduces the learning curve, and removes some esoteric, hard-to-track-down gotchas that are bound occasionally to bite people newly switching from one case to the other.
    
    On a final note, if this is the intended use of dstream, why have the map, flatMap, reduceByKey, etc functions on it at all?  It seems clear it was intended to be used this way (Hm, that reminds me of a fourth small interface change I'll add above, but as you'll see, it's very, very minor), so why not make sure the use is the same?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by harishreedharan <gi...@git.apache.org>.
Github user harishreedharan commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94513583
  
    I think the DStream API does expose calls which basically call foreachRDD, like the map etc. But from an app developer point of view, it makes my application much cleaner if I do something like {{dstream.map(_.getValue)}} than having a foreachRDD call for every transformation. I don't even see a major maintenance issue as far as the codebase is concerned, since the DStream API mostly just use public API. Really, removing these methods will just make streaming apps more complex.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by nkronenfeld <gi...@git.apache.org>.
Github user nkronenfeld commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94330557
  
    Thanks for taking a look at it.
    
    I'll take a look and write up here exactly what API changes this makes - while I don't think there are many, there aren't none, and that's why I was looking for comments before going any further, once I figured out if it were possible.
    
    As far as I could tell, I could not get this to compile without making these changes, for reasons I'll comment on individually when I get together a list.
    
    As far as the importance of unifying RDD and DStream - I would put that far above unifying RDD and DataFrame.  I only included DataFrame because someone had already made it inherit from a class they called RDDApi, which looked to me like a little bit of prep work for doing what I've done.  RDD and DStream, while different things, they are used for the same purpose.  One of the chief benefits of Spark, touted from early on, was that one could use the same code in varying circumstances - batch jobs, live jobs, streaming jobs, etc.  Yet, from the beginning of the streaming side, despite this being touted as a benefit, it wasn't really true.  I think making it really true is a huge up-side.  I know, for my company, the ability to take our batch jobs and apply them to streaming data without changing our code would be huge, and I can't imagine this isn't true for many other people.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by harishreedharan <gi...@git.apache.org>.
Github user harishreedharan commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94529830
  
    I agree that they should have been unified from the beginning, but at this point merging the two APIs into one would have to be done when we can break version compat. I'd still like to have something like `DStream.map` (in fact, DStream could really have been an RDD implementation itself, so we didn't have multiple implementations).  
    
    At this point, as long as we can unify APIs without breaking them I am ok with it - and in a future 2.0 release, we can make `DStream` an `RDD` implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94509143
  
    Yes, you would deprecate rather than remove methods and then likely remove them in the next major release, like a 2.0.0 release. That's a straightforward process, but first the question is whether it is worth doing, and it's probably not purely for consistency. That's why the question is still whether there is something this is buying that you can't accomplish with the `RDD` API itself in both batch and streaming.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94522384
  
    Sure, I hope it's good news, that the theory and promise matches reality -- that you can do just about anything you can do in Streaming by manipulating an RDD and that manipulating an RDD is the same no matter whether you're in streaming or not.
    
    I think @tdas would also agree it should have been consistent -- seems like just a little oversight -- and maybe `DStream` didn't need this similar API at all. I agree with that. What can or should be done going forward is a different question. I tend to not use the `DStream` methods at all where I can manipulate an `RDD`, myself, so it doesn't create much of an issue in practice from my POV.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94350194
  
    @nkronenfeld the thrust of my comment comes from the fact that `DStream`s give you an `RDD`, which you can do whatever you want with, including make a `DStream` again with the result via `transform`. Is that the missing piece here? This change is certainly not a prerequisite for reusing code across batch and streaming.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by nkronenfeld <gi...@git.apache.org>.
Github user nkronenfeld commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94472545
  
    @srowen - the examples are pretty simple - it happens as soon as you are passing an RDD to something, rather than passing something to an RDD.  For instance:
    
        case class PipelineData (config: Properties, data: RDD);
    
    In each individual case, there are ways around it, of course, but in total they get to be rather ugly rather quickly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP] Common interfaces between RDD, DStream, ...

Posted by nkronenfeld <gi...@git.apache.org>.
Github user nkronenfeld closed the pull request at:

    https://github.com/apache/spark/pull/5565


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94503276
  
    @nkronenfeld we can do clean-ups in a way that's backwards compatible. But we cannot remove old method signatures for the purpose of clean-up. This is why we vet new public API's pretty closely in Spark post 1.0. However, there is some legacy stuff that, to your point, is sometimes messy. The reason is that otherwise you break old programs that depend on the existing API's.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94477349
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30589/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-95139395
  
    Yes, good to see a list of the differences. I suspect that many of these changes (removing `DStream` methods? at least making them consistent?) could easily happen in 2.x. You can file a Improvement JIRA targeted for 2+ as a TODO; that's better than a PR in the long term, since this should be resolved one way or the other in the short term.
    
    If it's not going to be committed do you mind closing it? it will still be here to inspect.
    
    Adding methods in a backwards-compatible way mostly amounts to keeping and deprecating existing methods. It's possible where needed. I think it's not worth bothering with that at this stage, given that the methods in question are arguably superfluous. That's not a claim that it couldn't be migrated, or that the existing API is ideal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94087409
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30503/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP] Common interfaces between RDD, DStream, ...

Posted by nkronenfeld <gi...@git.apache.org>.
Github user nkronenfeld commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-96656969
  
    I'll see, I guess I can always reopen.
    
    If I can get some work towards this done, some deprecation with new versions of the necessary functions, I'll resubmit.
    
    Any idea how to deal with the classtag differences?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by nkronenfeld <gi...@git.apache.org>.
Github user nkronenfeld commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94471584
  
    @pwendell - something like reduceByKey having the opposite argument order of everything else in the code seems like an obvious mistake in the API - is there no mechanism to allow for fixing such things?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by nkronenfeld <gi...@git.apache.org>.
Github user nkronenfeld commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94510832
  
    @srowen are you suggesting putting the entire application inside the foreachRDD call?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/5565#issuecomment-94513545
  
    @nkronenfeld I suppose this argument does depend on whether I'm right about this: you could express anything you can express with a bunch of `DStream` operations ending in a bunch of actions, using one big call to `foreachRDD`. Where does this most conflict with your experience with what you're trying to do? that might suss out the problem here one way or the other rapidly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org