You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Patrick Wendell <pw...@gmail.com> on 2014/05/13 10:36:33 UTC

[VOTE] Release Apache Spark 1.0.0 (rc5)

Please vote on releasing the following candidate as Apache Spark version 1.0.0!

The tag to be voted on is v1.0.0-rc5 (commit 18f0623):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=18f062303303824139998e8fc8f4158217b0dbc3

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc5/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1012/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/

Please vote on releasing this package as Apache Spark 1.0.0!

The vote is open until Friday, May 16, at 09:30 UTC and passes if a
majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.0.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== API Changes ==
We welcome users to compile Spark applications against 1.0. There are
a few API changes in this release. Here are links to the associated
upgrade guides - user facing changes have been kept as small as
possible.

changes to ML vector specification:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10

changes to the Java API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark

changes to the streaming API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x

changes to the GraphX API:
http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091

coGroup and related functions now return Iterable[T] instead of Seq[T]
==> Call toSeq on the result to restore the old behavior

SparkContext.jarOfClass returns Option[String] instead of Seq[String]
==> Call toSeq on the result to restore old behavior

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Matei Zaharia <ma...@gmail.com>.
Yup, this is a good point, the interface includes stuff like launch scripts and environment variables. However I do think that the current features of spark-submit can all be supported in future releases. We’ll definitely have a very strict standard for modifying these later on.

Matei

On May 17, 2014, at 2:05 PM, Mridul Muralidharan <mr...@gmail.com> wrote:

> I would make the case for interface stability not just api stability.
> Particularly given that we have significantly changed some of our
> interfaces, I want to ensure developers/users are not seeing red flags.
> 
> Bugs and code stability can be addressed in minor releases if found, but
> behavioral change and/or interface changes would be a much more invasive
> issue for our users.
> 
> Regards
> Mridul
> On 18-May-2014 2:19 am, "Matei Zaharia" <ma...@gmail.com> wrote:
> 
>> As others have said, the 1.0 milestone is about API stability, not about
>> saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the sooner
>> users can confidently build on Spark, knowing that the application they
>> build today will still run on Spark 1.9.9 three years from now. This is
>> something that I’ve seen done badly (and experienced the effects thereof)
>> in other big data projects, such as MapReduce and even YARN. The result is
>> that you annoy users, you end up with a fragmented userbase where everyone
>> is building against a different version, and you drastically slow down
>> development.
>> 
>> With a project as fast-growing as fast-growing as Spark in particular,
>> there will be new bugs discovered and reported continuously, especially in
>> the non-core components. Look at the graph of # of contributors in time to
>> Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph; “commits”
>> changed when we started merging each patch as a single commit). This is not
>> slowing down, and we need to have the culture now that we treat API
>> stability and release numbers at the level expected for a 1.0 project
>> instead of having people come in and randomly change the API.
>> 
>> I’ll also note that the issues marked “blocker” were marked so by their
>> reporters, since the reporter can set the priority. I don’t consider stuff
>> like parallelize() not partitioning ranges in the same way as other
>> collections a blocker — it’s a bug, it would be good to fix it, but it only
>> affects a small number of use cases. Of course if we find a real blocker
>> (in particular a regression from a previous version, or a feature that’s
>> just completely broken), we will delay the release for that, but at some
>> point you have to say “okay, this fix will go into the next maintenance
>> release”. Maybe we need to write a clear policy for what the issue
>> priorities mean.
>> 
>> Finally, I believe it’s much better to have a culture where you can make
>> releases on a regular schedule, and have the option to make a maintenance
>> release in 3-4 days if you find new bugs, than one where you pile up stuff
>> into each release. This is what much large project than us, like Linux, do,
>> and it’s the only way to avoid indefinite stalling with a large contributor
>> base. In the worst case, if you find a new bug that warrants immediate
>> release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1 in
>> three days with just your bug fix in it). And if you find an API that you’d
>> like to improve, just add a new one and maybe deprecate the old one — at
>> some point we have to respect our users and let them know that code they
>> write today will still run tomorrow.
>> 
>> Matei
>> 
>> On May 17, 2014, at 10:32 AM, Kan Zhang <kz...@apache.org> wrote:
>> 
>>> +1 on the running commentary here, non-binding of course :-)
>>> 
>>> 
>>> On Sat, May 17, 2014 at 8:44 AM, Andrew Ash <an...@andrewash.com>
>> wrote:
>>> 
>>>> +1 on the next release feeling more like a 0.10 than a 1.0
>>>> On May 17, 2014 4:38 AM, "Mridul Muralidharan" <mr...@gmail.com>
>> wrote:
>>>> 
>>>>> I had echoed similar sentiments a while back when there was a
>> discussion
>>>>> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
>>>>> changes, add missing functionality, go through a hardening release
>> before
>>>>> 1.0
>>>>> 
>>>>> But the community preferred a 1.0 :-)
>>>>> 
>>>>> Regards,
>>>>> Mridul
>>>>> 
>>>>> On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
>>>>>> 
>>>>>> On this note, non-binding commentary:
>>>>>> 
>>>>>> Releases happen in local minima of change, usually created by
>>>>>> internally enforced code freeze. Spark is incredibly busy now due to
>>>>>> external factors -- recently a TLP, recently discovered by a large new
>>>>>> audience, ease of contribution enabled by Github. It's getting like
>>>>>> the first year of mainstream battle-testing in a month. It's been very
>>>>>> hard to freeze anything! I see a number of non-trivial issues being
>>>>>> reported, and I don't think it has been possible to triage all of
>>>>>> them, even.
>>>>>> 
>>>>>> Given the high rate of change, my instinct would have been to release
>>>>>> 0.10.0 now. But won't it always be very busy? I do think the rate of
>>>>>> significant issues will slow down.
>>>>>> 
>>>>>> Version ain't nothing but a number, but if it has any meaning it's the
>>>>>> semantic versioning meaning. 1.0 imposes extra handicaps around
>>>>>> striving to maintain backwards-compatibility. That may end up being
>>>>>> bent to fit in important changes that are going to be required in this
>>>>>> continuing period of change. Hadoop does this all the time
>>>>>> unfortunately and gets away with it, I suppose -- minor version
>>>>>> releases are really major. (On the other extreme, HBase is at 0.98 and
>>>>>> quite production-ready.)
>>>>>> 
>>>>>> Just consider this a second vote for focus on fixes and 1.0.x rather
>>>>>> than new features and 1.x. I think there are a few steps that could
>>>>>> streamline triage of this flood of contributions, and make all of this
>>>>>> easier, but that's for another thread.
>>>>>> 
>>>>>> 
>>>>>> On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <
>> mark@clearstorydata.com
>>>>> 
>>>>> wrote:
>>>>>>> +1, but just barely.  We've got quite a number of outstanding bugs
>>>>>>> identified, and many of them have fixes in progress.  I'd hate to see
>>>>> those
>>>>>>> efforts get lost in a post-1.0.0 flood of new features targeted at
>>>>> 1.1.0 --
>>>>>>> in other words, I'd like to see 1.0.1 retain a high priority relative
>>>>> to
>>>>>>> 1.1.0.
>>>>>>> 
>>>>>>> Looking through the unresolved JIRAs, it doesn't look like any of the
>>>>>>> identified bugs are show-stoppers or strictly regressions (although I
>>>>> will
>>>>>>> note that one that I have in progress, SPARK-1749, is a bug that we
>>>>>>> introduced with recent work -- it's not strictly a regression because
>>>>> we
>>>>>>> had equally bad but different behavior when the DAGScheduler
>>>> exceptions
>>>>>>> weren't previously being handled at all vs. being slightly
>>>> mis-handled
>>>>>>> now), so I'm not currently seeing a reason not to release.
>>>>> 
>>>> 
>> 
>> 


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Marcelo Vanzin <va...@cloudera.com>.
Hi Patrick,

On Fri, May 30, 2014 at 2:11 PM, Patrick Wendell <pw...@gmail.com> wrote:
> 2. private[spark]
> 3. @Experimental or @DeveloperApi

I understand @Experimental, but when would you use @DeveloperApi
instead of private[spark]? Seems to me that, for the API user, they
both mean very similar, if not exactly the same, thing. And the second
is actually more user-friendly since the compiler will yell at you.

Who's the "Developer" that the annotation refers to?

-- 
Marcelo

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Colin McCabe <cm...@alumni.cmu.edu>.
On Sat, May 31, 2014 at 10:45 AM, Patrick Wendell <pw...@gmail.com>
wrote:

> One other consideration popped into my head:
>
> 5. Shading our dependencies could mess up our external API's if we
> ever return types that are outside of the spark package because we'd
> then be returned shaded types that users have to deal with. E.g. where
> before we returned an o.a.flume.AvroFlumeEvent we'd have to return a
> some.namespace.AvroFlumeEvent. Then users downstream would have to
> deal with converting our types into the correct namespace if they want
> to inter-operate with other libraries. We generally try to avoid ever
> returning types from other libraries, but it would be good to audit
> our API's and see if we ever do this.


That's a good point.  It seems to me that if Spark is returning a type in
the public API, that type is part of the public API (for better or worse).
 So this is a case where you wouldn't want to shade that type.  But it
would be nice to avoid doing this, for exactly the reasons you state...

On Fri, May 30, 2014 at 10:54 PM, Patrick Wendell <pw...@gmail.com>
> wrote:
> > Spark is a bit different than Hadoop MapReduce, so maybe that's a
> > source of some confusion. Spark is often used as a substrate for
> > building different types of analytics applications, so @DeveloperAPI
> > are internal API's that we'd like to expose to application writers,
> > but that might be more volatile. This is like the internal API's in
> > the linux kernel, they aren't stable, but of course we try to minimize
> > changes to them. If people want to write lower-level modules against
> > them, that's fine with us, but they know the interfaces might change.
>

MapReduce is used as a substrate in a lot of cases, too.  Hive has
traditionally created MR jobs to do what it needs to do.  Similarly, Oozie
can create MR jobs.  It seems that what @DeveloperAPI is pretty similar to
@LimitedPrivate in Hadoop.  If I understand correctly, your hope is that
frameworks will use @DeveloperAPI, but individual application developers
will steer clear.  That is a good plan, as long as you can ensure that the
framework developers are willing to lock their versions to a certain Spark
version.  Otherwise they will make the same arguments we've heard before,
that they don't want to transition off of a deprecated @DeveloperAPI
because they want to keep support for Spark 1.0.0 (or whatever).  We hear
these arguments in Hadoop all the time...  now that spark as a 1.0 release
they will carry more weight.  Remember, Hadoop APIs started nice and simple
too :)

>
> > This has worked pretty well over the years, even with many different
> > companies writing against those API's.
> >
> > @Experimental are user-facing features we are trying out. Hopefully
> > that one is more clear.
> >
> > In terms of making a big jar that shades all of our dependencies - I'm
> > curious how that would actually work in practice. It would be good to
> > explore. There are a few potential challenges I see:
> >
> > 1. If any of our dependencies encode class name information in IPC
> > messages, this would break. E.g. can you definitely shade the Hadoop
> > client, protobuf, hbase client, etc and have them send messages over
> > the wire? This could break things if class names are ever encoded in a
> > wire format.
>

Google protobuffers assume a fixed schema.  That is to say, they do not
include metadata identifying the types of what is placed in them.  The
types are determined by convention.  It is possible to change the java
package in which the protobuf classes reside with no harmful effects.  (See
HDFS-4909 for an example of this).  The RPC itself does include a java
class name for the interface we're talking to, though.  The code for
handling this is all under our control, though, so if we had to make any
minor modifications to make shading work, we could.

> 2. Many libraries like logging subsystems, configuration systems, etc
> > rely on static state and initialization. I'm not totally sure how e.g.
> > slf4j initializes itself if you have both a shaded and non-shaded copy
> > of slf4j present.
>

I guess the worst case scenario would be that the shaded version of slf4j
creates a log file, but then the app's unshaded version overwrites that log
file.  I don't see how the two versions could "cooperate" since they aren't
sharing static state.  The only solutions I can see are leaving slf4j
unshaded, or setting up separate log files for the spark-core versus the
application.  I haven't thought this through completely, but my gut feeling
is that if you're sharing a log file, you probably want to share the
logging code too.


> > 3. This would mean the spark-core jar would be really massive because
> > it would inline all of our deps. We've actually been thinking of
> > avoiding the current assembly jar approach because, due to scala
> > specialized classes, our assemblies now have more than 65,000 class
> > files in them leading to all kinds of bad issues. We'd have to stick
> > with a big uber assembly-like jar if we decide to shade stuff.
> > 4. I'm not totally sure how this would work when people want to e.g.
> > build Spark with different Hadoop versions. Would we publish different
> > shaded uber-jars for every Hadoop version? Would the Hadoop dep just
> > not be shaded... if so what about all it's dependencies.
>

I wonder if it would be possible to put Hadoop and its dependencies "in a
box," (as it were) by using a separate classloader for them.  That might
solve this without requiring an uber-jar.  It would be nice to not have to
transfer all that stuff each time you start a job... in a perfect world,
the stuff that had not changed would not need to be transferred (thinking
out loud here)

best,
Colin


>
> > Anyways just some things to consider... simplifying our classpath is
> > definitely an avenue worth exploring!
> >
> >
> >
> >
> > On Fri, May 30, 2014 at 2:56 PM, Colin McCabe <cm...@alumni.cmu.edu>
> wrote:
> >> On Fri, May 30, 2014 at 2:11 PM, Patrick Wendell <pw...@gmail.com>
> wrote:
> >>
> >>> Hey guys, thanks for the insights. Also, I realize Hadoop has gotten
> >>> way better about this with 2.2+ and I think it's great progress.
> >>>
> >>> We have well defined API levels in Spark and also automated checking
> >>> of API violations for new pull requests. When doing code reviews we
> >>> always enforce the narrowest possible visibility:
> >>>
> >>> 1. private
> >>> 2. private[spark]
> >>> 3. @Experimental or @DeveloperApi
> >>> 4. public
> >>>
> >>> Our automated checks exclude 1-3. Anything that breaks 4 will trigger
> >>> a build failure.
> >>>
> >>>
> >> That's really excellent.  Great job.
> >>
> >> I like the private[spark] visibility level-- sounds like this is another
> >> way Scala has greatly improved on Java.
> >>
> >> The Scala compiler prevents anyone external from using 1 or 2. We do
> >>> have "bytecode public but annotated" (3) API's that we might change.
> >>> We spent a lot of time looking into whether these can offer compiler
> >>> warnings, but we haven't found a way to do this and do not see a
> >>> better alternative at this point.
> >>>
> >>
> >> It would be nice if the production build could strip this stuff out.
> >>  Otherwise, it feels a lot like a @private, @unstable Hadoop API... and
> we
> >> know how those turned out.
> >>
> >>
> >>> Regarding Scala compatibility, Scala 2.11+ is "source code
> >>> compatible", meaning we'll be able to cross-compile Spark for
> >>> different versions of Scala. We've already been in touch with Typesafe
> >>> about this and they've offered to integrate Spark into their
> >>> compatibility test suite. They've also committed to patching 2.11 with
> >>> a minor release if bugs are found.
> >>>
> >>
> >> Thanks, I hadn't heard about this plan.  Hopefully we can get everyone
> on
> >> 2.11 ASAP.
> >>
> >>
> >>> Anyways, my point is we've actually thought a lot about this already.
> >>>
> >>> The CLASSPATH thing is different than API stability, but indeed also a
> >>> form of compatibility. This is something where I'd also like to see
> >>> Spark have better isolation of user classes from Spark's own
> >>> execution...
> >>>
> >>>
> >> I think the best thing to do is just "shade" all the dependencies.  Then
> >> they will be in a different namespace, and clients can have their own
> >> versions of whatever dependencies they like without conflicting.  As
> >> Marcelo mentioned, there might be a few edge cases where this breaks
> >> reflection, but I don't think that's an issue for most libraries.  So at
> >> worst case we could end up needing apps to follow us in lockstep for
> Kryo
> >> or maybe Akka, but not the whole kit and caboodle like with Hadoop.
> >>
> >> best,
> >> Colin
> >>
> >>
> >> - Patrick
> >>>
> >>>
> >>>
> >>> On Fri, May 30, 2014 at 12:30 PM, Marcelo Vanzin <va...@cloudera.com>
> >>> wrote:
> >>> > On Fri, May 30, 2014 at 12:05 PM, Colin McCabe <
> cmccabe@alumni.cmu.edu>
> >>> wrote:
> >>> >> I don't know if Scala provides any mechanisms to do this beyond what
> >>> Java provides.
> >>> >
> >>> > In fact it does. You can say something like "private[foo]" and the
> >>> > annotated element will be visible for all classes under "foo" (where
> >>> > "foo" is any package in the hierarchy leading up to the class).
> That's
> >>> > used a lot in Spark.
> >>> >
> >>> > I haven't fully looked at how the @DeveloperApi is used, but I agree
> >>> > with you - annotations are not a good way to do this. The Scala
> >>> > feature above would be much better, but it might still leak things at
> >>> > the Java bytecode level (don't know how Scala implements it under the
> >>> > cover, but I assume it's not by declaring the element as a Java
> >>> > "private").
> >>> >
> >>> > Another thing is that in Scala the default visibility is public,
> which
> >>> > makes it very easy to inadvertently add things to the API. I'd like
> to
> >>> > see more care in making things have the proper visibility - I
> >>> > generally declare things private first, and relax that as needed.
> >>> > Using @VisibleForTesting would be great too, when the Scala
> >>> > private[foo] approach doesn't work.
> >>> >
> >>> >> Does Spark also expose its CLASSPATH in
> >>> >> this way to executors?  I was under the impression that it did.
> >>> >
> >>> > If you're using the Spark assemblies, yes, there is a lot of things
> >>> > that your app gets exposed to. For example, you can see Guava and
> >>> > Jetty (and many other things) there. This is something that has
> always
> >>> > bugged me, but I don't really have a good suggestion of how to fix
> it;
> >>> > shading goes a certain way, but it also breaks codes that uses
> >>> > reflection (e.g. Class.forName()-style class loading).
> >>> >
> >>> > What is worse is that Spark doesn't even agree with the Hadoop code
> it
> >>> > depends on; e.g., Spark uses Guava 14.x while Hadoop is still in
> Guava
> >>> > 11.x. So when you run your Scala app, what gets loaded?
> >>> >
> >>> >> At some point we will also have to confront the Scala version issue.
> >>>  Will
> >>> >> there be flag days where Spark jobs need to be upgraded to a new,
> >>> >> incompatible version of Scala to run on the latest Spark?
> >>> >
> >>> > Yes, this could be an issue - I'm not sure Scala has a policy towards
> >>> > this, but updates (at least minor, e.g. 2.9 -> 2.10) tend to break
> >>> > binary compatibility.
> >>> >
> >>> > Scala also makes some API updates tricky - e.g., adding a new named
> >>> > argument to a Scala method is not a binary compatible change (while,
> >>> > e.g., adding a new keyword argument in a python method is just fine).
> >>> > The use of implicits and other Scala features make this even more
> >>> > opaque...
> >>> >
> >>> > Anyway, not really any solutions in this message, just a few comments
> >>> > I wanted to throw out there. :-)
> >>> >
> >>> > --
> >>> > Marcelo
> >>>
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Patrick Wendell <pw...@gmail.com>.
One other consideration popped into my head:

5. Shading our dependencies could mess up our external API's if we
ever return types that are outside of the spark package because we'd
then be returned shaded types that users have to deal with. E.g. where
before we returned an o.a.flume.AvroFlumeEvent we'd have to return a
some.namespace.AvroFlumeEvent. Then users downstream would have to
deal with converting our types into the correct namespace if they want
to inter-operate with other libraries. We generally try to avoid ever
returning types from other libraries, but it would be good to audit
our API's and see if we ever do this.

On Fri, May 30, 2014 at 10:54 PM, Patrick Wendell <pw...@gmail.com> wrote:
> Spark is a bit different than Hadoop MapReduce, so maybe that's a
> source of some confusion. Spark is often used as a substrate for
> building different types of analytics applications, so @DeveloperAPI
> are internal API's that we'd like to expose to application writers,
> but that might be more volatile. This is like the internal API's in
> the linux kernel, they aren't stable, but of course we try to minimize
> changes to them. If people want to write lower-level modules against
> them, that's fine with us, but they know the interfaces might change.
>
> This has worked pretty well over the years, even with many different
> companies writing against those API's.
>
> @Experimental are user-facing features we are trying out. Hopefully
> that one is more clear.
>
> In terms of making a big jar that shades all of our dependencies - I'm
> curious how that would actually work in practice. It would be good to
> explore. There are a few potential challenges I see:
>
> 1. If any of our dependencies encode class name information in IPC
> messages, this would break. E.g. can you definitely shade the Hadoop
> client, protobuf, hbase client, etc and have them send messages over
> the wire? This could break things if class names are ever encoded in a
> wire format.
> 2. Many libraries like logging subsystems, configuration systems, etc
> rely on static state and initialization. I'm not totally sure how e.g.
> slf4j initializes itself if you have both a shaded and non-shaded copy
> of slf4j present.
> 3. This would mean the spark-core jar would be really massive because
> it would inline all of our deps. We've actually been thinking of
> avoiding the current assembly jar approach because, due to scala
> specialized classes, our assemblies now have more than 65,000 class
> files in them leading to all kinds of bad issues. We'd have to stick
> with a big uber assembly-like jar if we decide to shade stuff.
> 4. I'm not totally sure how this would work when people want to e.g.
> build Spark with different Hadoop versions. Would we publish different
> shaded uber-jars for every Hadoop version? Would the Hadoop dep just
> not be shaded... if so what about all it's dependencies.
>
> Anyways just some things to consider... simplifying our classpath is
> definitely an avenue worth exploring!
>
>
>
>
> On Fri, May 30, 2014 at 2:56 PM, Colin McCabe <cm...@alumni.cmu.edu> wrote:
>> On Fri, May 30, 2014 at 2:11 PM, Patrick Wendell <pw...@gmail.com> wrote:
>>
>>> Hey guys, thanks for the insights. Also, I realize Hadoop has gotten
>>> way better about this with 2.2+ and I think it's great progress.
>>>
>>> We have well defined API levels in Spark and also automated checking
>>> of API violations for new pull requests. When doing code reviews we
>>> always enforce the narrowest possible visibility:
>>>
>>> 1. private
>>> 2. private[spark]
>>> 3. @Experimental or @DeveloperApi
>>> 4. public
>>>
>>> Our automated checks exclude 1-3. Anything that breaks 4 will trigger
>>> a build failure.
>>>
>>>
>> That's really excellent.  Great job.
>>
>> I like the private[spark] visibility level-- sounds like this is another
>> way Scala has greatly improved on Java.
>>
>> The Scala compiler prevents anyone external from using 1 or 2. We do
>>> have "bytecode public but annotated" (3) API's that we might change.
>>> We spent a lot of time looking into whether these can offer compiler
>>> warnings, but we haven't found a way to do this and do not see a
>>> better alternative at this point.
>>>
>>
>> It would be nice if the production build could strip this stuff out.
>>  Otherwise, it feels a lot like a @private, @unstable Hadoop API... and we
>> know how those turned out.
>>
>>
>>> Regarding Scala compatibility, Scala 2.11+ is "source code
>>> compatible", meaning we'll be able to cross-compile Spark for
>>> different versions of Scala. We've already been in touch with Typesafe
>>> about this and they've offered to integrate Spark into their
>>> compatibility test suite. They've also committed to patching 2.11 with
>>> a minor release if bugs are found.
>>>
>>
>> Thanks, I hadn't heard about this plan.  Hopefully we can get everyone on
>> 2.11 ASAP.
>>
>>
>>> Anyways, my point is we've actually thought a lot about this already.
>>>
>>> The CLASSPATH thing is different than API stability, but indeed also a
>>> form of compatibility. This is something where I'd also like to see
>>> Spark have better isolation of user classes from Spark's own
>>> execution...
>>>
>>>
>> I think the best thing to do is just "shade" all the dependencies.  Then
>> they will be in a different namespace, and clients can have their own
>> versions of whatever dependencies they like without conflicting.  As
>> Marcelo mentioned, there might be a few edge cases where this breaks
>> reflection, but I don't think that's an issue for most libraries.  So at
>> worst case we could end up needing apps to follow us in lockstep for Kryo
>> or maybe Akka, but not the whole kit and caboodle like with Hadoop.
>>
>> best,
>> Colin
>>
>>
>> - Patrick
>>>
>>>
>>>
>>> On Fri, May 30, 2014 at 12:30 PM, Marcelo Vanzin <va...@cloudera.com>
>>> wrote:
>>> > On Fri, May 30, 2014 at 12:05 PM, Colin McCabe <cm...@alumni.cmu.edu>
>>> wrote:
>>> >> I don't know if Scala provides any mechanisms to do this beyond what
>>> Java provides.
>>> >
>>> > In fact it does. You can say something like "private[foo]" and the
>>> > annotated element will be visible for all classes under "foo" (where
>>> > "foo" is any package in the hierarchy leading up to the class). That's
>>> > used a lot in Spark.
>>> >
>>> > I haven't fully looked at how the @DeveloperApi is used, but I agree
>>> > with you - annotations are not a good way to do this. The Scala
>>> > feature above would be much better, but it might still leak things at
>>> > the Java bytecode level (don't know how Scala implements it under the
>>> > cover, but I assume it's not by declaring the element as a Java
>>> > "private").
>>> >
>>> > Another thing is that in Scala the default visibility is public, which
>>> > makes it very easy to inadvertently add things to the API. I'd like to
>>> > see more care in making things have the proper visibility - I
>>> > generally declare things private first, and relax that as needed.
>>> > Using @VisibleForTesting would be great too, when the Scala
>>> > private[foo] approach doesn't work.
>>> >
>>> >> Does Spark also expose its CLASSPATH in
>>> >> this way to executors?  I was under the impression that it did.
>>> >
>>> > If you're using the Spark assemblies, yes, there is a lot of things
>>> > that your app gets exposed to. For example, you can see Guava and
>>> > Jetty (and many other things) there. This is something that has always
>>> > bugged me, but I don't really have a good suggestion of how to fix it;
>>> > shading goes a certain way, but it also breaks codes that uses
>>> > reflection (e.g. Class.forName()-style class loading).
>>> >
>>> > What is worse is that Spark doesn't even agree with the Hadoop code it
>>> > depends on; e.g., Spark uses Guava 14.x while Hadoop is still in Guava
>>> > 11.x. So when you run your Scala app, what gets loaded?
>>> >
>>> >> At some point we will also have to confront the Scala version issue.
>>>  Will
>>> >> there be flag days where Spark jobs need to be upgraded to a new,
>>> >> incompatible version of Scala to run on the latest Spark?
>>> >
>>> > Yes, this could be an issue - I'm not sure Scala has a policy towards
>>> > this, but updates (at least minor, e.g. 2.9 -> 2.10) tend to break
>>> > binary compatibility.
>>> >
>>> > Scala also makes some API updates tricky - e.g., adding a new named
>>> > argument to a Scala method is not a binary compatible change (while,
>>> > e.g., adding a new keyword argument in a python method is just fine).
>>> > The use of implicits and other Scala features make this even more
>>> > opaque...
>>> >
>>> > Anyway, not really any solutions in this message, just a few comments
>>> > I wanted to throw out there. :-)
>>> >
>>> > --
>>> > Marcelo
>>>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Sean Owen <so...@cloudera.com>.
On Mon, Jun 2, 2014 at 6:05 PM, Marcelo Vanzin <va...@cloudera.com> wrote:
> You mentioned something in your shading argument that kinda reminded
> me of something. Spark currently depends on slf4j implementations and
> log4j with "compile" scope. I'd argue that's the wrong approach if
> we're talking about Spark being used embedded inside applications;
> Spark should only depend on the slf4j API package, and let the
> application provide the underlying implementation.

Good idea in general; in practice, the drawback is that you can't do
things like set log levels if you only depend on the SLF4J API. There
are a few cases where that's nice to control, and that's only possible
if you bind to a particular logger as well.

You typically bundle a SLF4J binding anyway, to give a default, or
else the end-user has to know to also bind some SLF4J logger to get
output. Of course it does make for a bit more surgery if you want to
override the binding this way.

Shading can bring a whole new level of confusion; I myself would only
use it where essential as a workaround. Same with trying to make more
elaborate custom classloading schemes -- never in my darkest
nightmares have I imagine the failure modes that probably pop up when
that goes wrong. I think the library collisions will get better over
time as only later versions of Hadoop are in scope, for example,
and/or one build system is in play. I like tackling complexity along
those lines first.

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Marcelo Vanzin <va...@cloudera.com>.
Hi Patrick,

Thanks for all the explanations, that makes sense. @DeveloperApi
worries me a little bit especially because of the things Colin
mentions - it's sort of hard to make people move off of APIs, or
support different versions of the same API. But maybe if expectations
(or lack thereof) are set up front, there will be less issues.

You mentioned something in your shading argument that kinda reminded
me of something. Spark currently depends on slf4j implementations and
log4j with "compile" scope. I'd argue that's the wrong approach if
we're talking about Spark being used embedded inside applications;
Spark should only depend on the slf4j API package, and let the
application provide the underlying implementation.

The assembly jars could include an implementation (since I assume
those are currently targeted at cluster deployment and not embedding).

That way there is less sources of conflict at runtime (i.e. the
"multiple implementation jars" messages you can see when running some
Spark programs).

On Fri, May 30, 2014 at 10:54 PM, Patrick Wendell <pw...@gmail.com> wrote:
> 2. Many libraries like logging subsystems, configuration systems, etc
> rely on static state and initialization. I'm not totally sure how e.g.
> slf4j initializes itself if you have both a shaded and non-shaded copy
> of slf4j present.

-- 
Marcelo

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Patrick Wendell <pw...@gmail.com>.
Spark is a bit different than Hadoop MapReduce, so maybe that's a
source of some confusion. Spark is often used as a substrate for
building different types of analytics applications, so @DeveloperAPI
are internal API's that we'd like to expose to application writers,
but that might be more volatile. This is like the internal API's in
the linux kernel, they aren't stable, but of course we try to minimize
changes to them. If people want to write lower-level modules against
them, that's fine with us, but they know the interfaces might change.

This has worked pretty well over the years, even with many different
companies writing against those API's.

@Experimental are user-facing features we are trying out. Hopefully
that one is more clear.

In terms of making a big jar that shades all of our dependencies - I'm
curious how that would actually work in practice. It would be good to
explore. There are a few potential challenges I see:

1. If any of our dependencies encode class name information in IPC
messages, this would break. E.g. can you definitely shade the Hadoop
client, protobuf, hbase client, etc and have them send messages over
the wire? This could break things if class names are ever encoded in a
wire format.
2. Many libraries like logging subsystems, configuration systems, etc
rely on static state and initialization. I'm not totally sure how e.g.
slf4j initializes itself if you have both a shaded and non-shaded copy
of slf4j present.
3. This would mean the spark-core jar would be really massive because
it would inline all of our deps. We've actually been thinking of
avoiding the current assembly jar approach because, due to scala
specialized classes, our assemblies now have more than 65,000 class
files in them leading to all kinds of bad issues. We'd have to stick
with a big uber assembly-like jar if we decide to shade stuff.
4. I'm not totally sure how this would work when people want to e.g.
build Spark with different Hadoop versions. Would we publish different
shaded uber-jars for every Hadoop version? Would the Hadoop dep just
not be shaded... if so what about all it's dependencies.

Anyways just some things to consider... simplifying our classpath is
definitely an avenue worth exploring!




On Fri, May 30, 2014 at 2:56 PM, Colin McCabe <cm...@alumni.cmu.edu> wrote:
> On Fri, May 30, 2014 at 2:11 PM, Patrick Wendell <pw...@gmail.com> wrote:
>
>> Hey guys, thanks for the insights. Also, I realize Hadoop has gotten
>> way better about this with 2.2+ and I think it's great progress.
>>
>> We have well defined API levels in Spark and also automated checking
>> of API violations for new pull requests. When doing code reviews we
>> always enforce the narrowest possible visibility:
>>
>> 1. private
>> 2. private[spark]
>> 3. @Experimental or @DeveloperApi
>> 4. public
>>
>> Our automated checks exclude 1-3. Anything that breaks 4 will trigger
>> a build failure.
>>
>>
> That's really excellent.  Great job.
>
> I like the private[spark] visibility level-- sounds like this is another
> way Scala has greatly improved on Java.
>
> The Scala compiler prevents anyone external from using 1 or 2. We do
>> have "bytecode public but annotated" (3) API's that we might change.
>> We spent a lot of time looking into whether these can offer compiler
>> warnings, but we haven't found a way to do this and do not see a
>> better alternative at this point.
>>
>
> It would be nice if the production build could strip this stuff out.
>  Otherwise, it feels a lot like a @private, @unstable Hadoop API... and we
> know how those turned out.
>
>
>> Regarding Scala compatibility, Scala 2.11+ is "source code
>> compatible", meaning we'll be able to cross-compile Spark for
>> different versions of Scala. We've already been in touch with Typesafe
>> about this and they've offered to integrate Spark into their
>> compatibility test suite. They've also committed to patching 2.11 with
>> a minor release if bugs are found.
>>
>
> Thanks, I hadn't heard about this plan.  Hopefully we can get everyone on
> 2.11 ASAP.
>
>
>> Anyways, my point is we've actually thought a lot about this already.
>>
>> The CLASSPATH thing is different than API stability, but indeed also a
>> form of compatibility. This is something where I'd also like to see
>> Spark have better isolation of user classes from Spark's own
>> execution...
>>
>>
> I think the best thing to do is just "shade" all the dependencies.  Then
> they will be in a different namespace, and clients can have their own
> versions of whatever dependencies they like without conflicting.  As
> Marcelo mentioned, there might be a few edge cases where this breaks
> reflection, but I don't think that's an issue for most libraries.  So at
> worst case we could end up needing apps to follow us in lockstep for Kryo
> or maybe Akka, but not the whole kit and caboodle like with Hadoop.
>
> best,
> Colin
>
>
> - Patrick
>>
>>
>>
>> On Fri, May 30, 2014 at 12:30 PM, Marcelo Vanzin <va...@cloudera.com>
>> wrote:
>> > On Fri, May 30, 2014 at 12:05 PM, Colin McCabe <cm...@alumni.cmu.edu>
>> wrote:
>> >> I don't know if Scala provides any mechanisms to do this beyond what
>> Java provides.
>> >
>> > In fact it does. You can say something like "private[foo]" and the
>> > annotated element will be visible for all classes under "foo" (where
>> > "foo" is any package in the hierarchy leading up to the class). That's
>> > used a lot in Spark.
>> >
>> > I haven't fully looked at how the @DeveloperApi is used, but I agree
>> > with you - annotations are not a good way to do this. The Scala
>> > feature above would be much better, but it might still leak things at
>> > the Java bytecode level (don't know how Scala implements it under the
>> > cover, but I assume it's not by declaring the element as a Java
>> > "private").
>> >
>> > Another thing is that in Scala the default visibility is public, which
>> > makes it very easy to inadvertently add things to the API. I'd like to
>> > see more care in making things have the proper visibility - I
>> > generally declare things private first, and relax that as needed.
>> > Using @VisibleForTesting would be great too, when the Scala
>> > private[foo] approach doesn't work.
>> >
>> >> Does Spark also expose its CLASSPATH in
>> >> this way to executors?  I was under the impression that it did.
>> >
>> > If you're using the Spark assemblies, yes, there is a lot of things
>> > that your app gets exposed to. For example, you can see Guava and
>> > Jetty (and many other things) there. This is something that has always
>> > bugged me, but I don't really have a good suggestion of how to fix it;
>> > shading goes a certain way, but it also breaks codes that uses
>> > reflection (e.g. Class.forName()-style class loading).
>> >
>> > What is worse is that Spark doesn't even agree with the Hadoop code it
>> > depends on; e.g., Spark uses Guava 14.x while Hadoop is still in Guava
>> > 11.x. So when you run your Scala app, what gets loaded?
>> >
>> >> At some point we will also have to confront the Scala version issue.
>>  Will
>> >> there be flag days where Spark jobs need to be upgraded to a new,
>> >> incompatible version of Scala to run on the latest Spark?
>> >
>> > Yes, this could be an issue - I'm not sure Scala has a policy towards
>> > this, but updates (at least minor, e.g. 2.9 -> 2.10) tend to break
>> > binary compatibility.
>> >
>> > Scala also makes some API updates tricky - e.g., adding a new named
>> > argument to a Scala method is not a binary compatible change (while,
>> > e.g., adding a new keyword argument in a python method is just fine).
>> > The use of implicits and other Scala features make this even more
>> > opaque...
>> >
>> > Anyway, not really any solutions in this message, just a few comments
>> > I wanted to throw out there. :-)
>> >
>> > --
>> > Marcelo
>>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Colin McCabe <cm...@alumni.cmu.edu>.
On Fri, May 30, 2014 at 2:11 PM, Patrick Wendell <pw...@gmail.com> wrote:

> Hey guys, thanks for the insights. Also, I realize Hadoop has gotten
> way better about this with 2.2+ and I think it's great progress.
>
> We have well defined API levels in Spark and also automated checking
> of API violations for new pull requests. When doing code reviews we
> always enforce the narrowest possible visibility:
>
> 1. private
> 2. private[spark]
> 3. @Experimental or @DeveloperApi
> 4. public
>
> Our automated checks exclude 1-3. Anything that breaks 4 will trigger
> a build failure.
>
>
That's really excellent.  Great job.

I like the private[spark] visibility level-- sounds like this is another
way Scala has greatly improved on Java.

The Scala compiler prevents anyone external from using 1 or 2. We do
> have "bytecode public but annotated" (3) API's that we might change.
> We spent a lot of time looking into whether these can offer compiler
> warnings, but we haven't found a way to do this and do not see a
> better alternative at this point.
>

It would be nice if the production build could strip this stuff out.
 Otherwise, it feels a lot like a @private, @unstable Hadoop API... and we
know how those turned out.


> Regarding Scala compatibility, Scala 2.11+ is "source code
> compatible", meaning we'll be able to cross-compile Spark for
> different versions of Scala. We've already been in touch with Typesafe
> about this and they've offered to integrate Spark into their
> compatibility test suite. They've also committed to patching 2.11 with
> a minor release if bugs are found.
>

Thanks, I hadn't heard about this plan.  Hopefully we can get everyone on
2.11 ASAP.


> Anyways, my point is we've actually thought a lot about this already.
>
> The CLASSPATH thing is different than API stability, but indeed also a
> form of compatibility. This is something where I'd also like to see
> Spark have better isolation of user classes from Spark's own
> execution...
>
>
I think the best thing to do is just "shade" all the dependencies.  Then
they will be in a different namespace, and clients can have their own
versions of whatever dependencies they like without conflicting.  As
Marcelo mentioned, there might be a few edge cases where this breaks
reflection, but I don't think that's an issue for most libraries.  So at
worst case we could end up needing apps to follow us in lockstep for Kryo
or maybe Akka, but not the whole kit and caboodle like with Hadoop.

best,
Colin


- Patrick
>
>
>
> On Fri, May 30, 2014 at 12:30 PM, Marcelo Vanzin <va...@cloudera.com>
> wrote:
> > On Fri, May 30, 2014 at 12:05 PM, Colin McCabe <cm...@alumni.cmu.edu>
> wrote:
> >> I don't know if Scala provides any mechanisms to do this beyond what
> Java provides.
> >
> > In fact it does. You can say something like "private[foo]" and the
> > annotated element will be visible for all classes under "foo" (where
> > "foo" is any package in the hierarchy leading up to the class). That's
> > used a lot in Spark.
> >
> > I haven't fully looked at how the @DeveloperApi is used, but I agree
> > with you - annotations are not a good way to do this. The Scala
> > feature above would be much better, but it might still leak things at
> > the Java bytecode level (don't know how Scala implements it under the
> > cover, but I assume it's not by declaring the element as a Java
> > "private").
> >
> > Another thing is that in Scala the default visibility is public, which
> > makes it very easy to inadvertently add things to the API. I'd like to
> > see more care in making things have the proper visibility - I
> > generally declare things private first, and relax that as needed.
> > Using @VisibleForTesting would be great too, when the Scala
> > private[foo] approach doesn't work.
> >
> >> Does Spark also expose its CLASSPATH in
> >> this way to executors?  I was under the impression that it did.
> >
> > If you're using the Spark assemblies, yes, there is a lot of things
> > that your app gets exposed to. For example, you can see Guava and
> > Jetty (and many other things) there. This is something that has always
> > bugged me, but I don't really have a good suggestion of how to fix it;
> > shading goes a certain way, but it also breaks codes that uses
> > reflection (e.g. Class.forName()-style class loading).
> >
> > What is worse is that Spark doesn't even agree with the Hadoop code it
> > depends on; e.g., Spark uses Guava 14.x while Hadoop is still in Guava
> > 11.x. So when you run your Scala app, what gets loaded?
> >
> >> At some point we will also have to confront the Scala version issue.
>  Will
> >> there be flag days where Spark jobs need to be upgraded to a new,
> >> incompatible version of Scala to run on the latest Spark?
> >
> > Yes, this could be an issue - I'm not sure Scala has a policy towards
> > this, but updates (at least minor, e.g. 2.9 -> 2.10) tend to break
> > binary compatibility.
> >
> > Scala also makes some API updates tricky - e.g., adding a new named
> > argument to a Scala method is not a binary compatible change (while,
> > e.g., adding a new keyword argument in a python method is just fine).
> > The use of implicits and other Scala features make this even more
> > opaque...
> >
> > Anyway, not really any solutions in this message, just a few comments
> > I wanted to throw out there. :-)
> >
> > --
> > Marcelo
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Patrick Wendell <pw...@gmail.com>.
Hey guys, thanks for the insights. Also, I realize Hadoop has gotten
way better about this with 2.2+ and I think it's great progress.

We have well defined API levels in Spark and also automated checking
of API violations for new pull requests. When doing code reviews we
always enforce the narrowest possible visibility:

1. private
2. private[spark]
3. @Experimental or @DeveloperApi
4. public

Our automated checks exclude 1-3. Anything that breaks 4 will trigger
a build failure.

The Scala compiler prevents anyone external from using 1 or 2. We do
have "bytecode public but annotated" (3) API's that we might change.
We spent a lot of time looking into whether these can offer compiler
warnings, but we haven't found a way to do this and do not see a
better alternative at this point.

Regarding Scala compatibility, Scala 2.11+ is "source code
compatible", meaning we'll be able to cross-compile Spark for
different versions of Scala. We've already been in touch with Typesafe
about this and they've offered to integrate Spark into their
compatibility test suite. They've also committed to patching 2.11 with
a minor release if bugs are found.

Anyways, my point is we've actually thought a lot about this already.

The CLASSPATH thing is different than API stability, but indeed also a
form of compatibility. This is something where I'd also like to see
Spark have better isolation of user classes from Spark's own
execution...

- Patrick



On Fri, May 30, 2014 at 12:30 PM, Marcelo Vanzin <va...@cloudera.com> wrote:
> On Fri, May 30, 2014 at 12:05 PM, Colin McCabe <cm...@alumni.cmu.edu> wrote:
>> I don't know if Scala provides any mechanisms to do this beyond what Java provides.
>
> In fact it does. You can say something like "private[foo]" and the
> annotated element will be visible for all classes under "foo" (where
> "foo" is any package in the hierarchy leading up to the class). That's
> used a lot in Spark.
>
> I haven't fully looked at how the @DeveloperApi is used, but I agree
> with you - annotations are not a good way to do this. The Scala
> feature above would be much better, but it might still leak things at
> the Java bytecode level (don't know how Scala implements it under the
> cover, but I assume it's not by declaring the element as a Java
> "private").
>
> Another thing is that in Scala the default visibility is public, which
> makes it very easy to inadvertently add things to the API. I'd like to
> see more care in making things have the proper visibility - I
> generally declare things private first, and relax that as needed.
> Using @VisibleForTesting would be great too, when the Scala
> private[foo] approach doesn't work.
>
>> Does Spark also expose its CLASSPATH in
>> this way to executors?  I was under the impression that it did.
>
> If you're using the Spark assemblies, yes, there is a lot of things
> that your app gets exposed to. For example, you can see Guava and
> Jetty (and many other things) there. This is something that has always
> bugged me, but I don't really have a good suggestion of how to fix it;
> shading goes a certain way, but it also breaks codes that uses
> reflection (e.g. Class.forName()-style class loading).
>
> What is worse is that Spark doesn't even agree with the Hadoop code it
> depends on; e.g., Spark uses Guava 14.x while Hadoop is still in Guava
> 11.x. So when you run your Scala app, what gets loaded?
>
>> At some point we will also have to confront the Scala version issue.  Will
>> there be flag days where Spark jobs need to be upgraded to a new,
>> incompatible version of Scala to run on the latest Spark?
>
> Yes, this could be an issue - I'm not sure Scala has a policy towards
> this, but updates (at least minor, e.g. 2.9 -> 2.10) tend to break
> binary compatibility.
>
> Scala also makes some API updates tricky - e.g., adding a new named
> argument to a Scala method is not a binary compatible change (while,
> e.g., adding a new keyword argument in a python method is just fine).
> The use of implicits and other Scala features make this even more
> opaque...
>
> Anyway, not really any solutions in this message, just a few comments
> I wanted to throw out there. :-)
>
> --
> Marcelo

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Marcelo Vanzin <va...@cloudera.com>.
On Fri, May 30, 2014 at 12:05 PM, Colin McCabe <cm...@alumni.cmu.edu> wrote:
> I don't know if Scala provides any mechanisms to do this beyond what Java provides.

In fact it does. You can say something like "private[foo]" and the
annotated element will be visible for all classes under "foo" (where
"foo" is any package in the hierarchy leading up to the class). That's
used a lot in Spark.

I haven't fully looked at how the @DeveloperApi is used, but I agree
with you - annotations are not a good way to do this. The Scala
feature above would be much better, but it might still leak things at
the Java bytecode level (don't know how Scala implements it under the
cover, but I assume it's not by declaring the element as a Java
"private").

Another thing is that in Scala the default visibility is public, which
makes it very easy to inadvertently add things to the API. I'd like to
see more care in making things have the proper visibility - I
generally declare things private first, and relax that as needed.
Using @VisibleForTesting would be great too, when the Scala
private[foo] approach doesn't work.

> Does Spark also expose its CLASSPATH in
> this way to executors?  I was under the impression that it did.

If you're using the Spark assemblies, yes, there is a lot of things
that your app gets exposed to. For example, you can see Guava and
Jetty (and many other things) there. This is something that has always
bugged me, but I don't really have a good suggestion of how to fix it;
shading goes a certain way, but it also breaks codes that uses
reflection (e.g. Class.forName()-style class loading).

What is worse is that Spark doesn't even agree with the Hadoop code it
depends on; e.g., Spark uses Guava 14.x while Hadoop is still in Guava
11.x. So when you run your Scala app, what gets loaded?

> At some point we will also have to confront the Scala version issue.  Will
> there be flag days where Spark jobs need to be upgraded to a new,
> incompatible version of Scala to run on the latest Spark?

Yes, this could be an issue - I'm not sure Scala has a policy towards
this, but updates (at least minor, e.g. 2.9 -> 2.10) tend to break
binary compatibility.

Scala also makes some API updates tricky - e.g., adding a new named
argument to a Scala method is not a binary compatible change (while,
e.g., adding a new keyword argument in a python method is just fine).
The use of implicits and other Scala features make this even more
opaque...

Anyway, not really any solutions in this message, just a few comments
I wanted to throw out there. :-)

-- 
Marcelo

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Colin McCabe <cm...@alumni.cmu.edu>.
First of all, I think it's great that you're thinking about this.  API
stability is super important and it would be good to see Spark get on top
of this.

I want to clarify a bit about Hadoop.  The problem that Hadoop faces is
that the Java package system isn't very flexible.  If you have a method in,
say, the org.apache.hadoop.hdfs.shortcircuit package that should only ever
be used by the org.apache.hadoop.hdfs.client package, there is no way to
express that.  You have to make the method public.  You can hide things by
making them package-private, but that only works if your entire project is
a single giant package, and that is not the road Hadoop devs wanted to go
down.

So a lot of internal stuff ended up being public.  Once things are public,
of course, they can be called by anyone.  To get around this limitation,
Hadoop came up with a pretty rigorous compatibility policy, discussed here:
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/InterfaceClassification.html
The basic idea is that we'd put "interface annotations" on every public
class.  The "Private" annotation meant that it was only supposed to be used
in the project itself.  "Limited-Private" was kind of the project and maybe
one or two closely related projects.  And "Public" was supposed to be the
public API.  At a finer granularity, for specific public methods, you could
add the "VisibileForTesting" annotation to indicate that they were only
visible to make a unit test possible.

This sounds great in theory.  But in practice, users often ignore the
annotation and just do whatever they want.  This is not because they're
mustache-twirling villains, but because they have legitimate (to them)
reasons.  For example, HBase would often find that they could get better
performance by hooking into supposedly private HDFS APIs.  Of course, they
could always ask HDFS to add public versions of those APIs.  But that takes
time, and could be contentious.  In the best case, they'd have to wait for
another Hadoop release to happen before HBase could benefit.  From their
perspective, supporting the feature on more Hadoop releases was better than
supporting it on fewer, even if the latter was the "correct" way of doing
things.  Then of course there were the cases where there were simple
oversights... there either was no interface annotation or the user of the
downstream project forgot to check it.

Ideally, we'd later add a @stable API and transition everyone to it.  But
that's much easier said than done.  A lot of projects just don't want to
change, because it would mean giving up compatibility with older releases
without the "blessed" API.  Basically, it's a tragedy of the commons.  It
would be much better for everyone if we all used public stable APIs and
never used private or unstable ones.  But each individual project feels
that it can get advantages by cheating and using (or continuing to use) the
private / unstable APIs.  Candidly, Spark is one of those projects that
continues to use deprecated and private Hadoop APIs-- mostly for
compatibility reasons, as I understand.

I think that the lesson learned here is that the compiler needs to be in
charge of preventing people from using APIs, not an annotation.
 Public/private annotations "Just Don't Work."  I don't know if Scala
provides any mechanisms to do this beyond what Java provides.  Even if not,
there are probably classloader and CLASSPATH tricks that could be used to
hide internals.  I also think that it makes sense to put a lot of thought
into APIs up front, because changing them later can be very painful.  On a
related note, there were definitely cases where Hadoop changed an API, and
the pain outweighed the gain.

There are other dimensions to compatibility... for example, Hadoop
currently leaks its CLASSPATH, so that you can't easily write a MapReduce
job without using the same versions of Guava (just to pick one random
example) that it does.  In practice, this led to a pathological fear of
updating dependencies, since we didn't want to break users who needed
specific version of their deps.  Does Spark also expose its CLASSPATH in
this way to executors?  I was under the impression that it did.

At some point we will also have to confront the Scala version issue.  Will
there be flag days where Spark jobs need to be upgraded to a new,
incompatible version of Scala to run on the latest Spark?  There are pros
and cons, but I think users will mostly see the cons.

On Thu, May 29, 2014 at 1:23 PM, Patrick Wendell <pw...@gmail.com> wrote:

> 1. Hadoop projects don't do any rigorous checking that new patches
> don't break API's. Of course, the results in regular API breaks and a
> poor understanding of what is a public API.
>

I agree with this.  We should test these compatibility scenarios, and we
don't.  It would be awesome to do this in an automated way for Spark.


> 2. In several cases it's not possible to do basic things in Hadoop
> without using deprecated or private API's.
>

Disagree.  The problem is that we have stable APIs, but users don't want to
use them (they prefer the ancient API Doug Cutting wrote in 2008, because
it works on some old version of Hadoop).  It's hard to argue against this
kind of reasoning, since (to reiterate) it's rational from the point of
view of the individual.  This is the problem with deprecation in general--
once you've let an API out into the wild, it's very difficult to get it
back into its cage.

3. There is significant vendor fragmentation of API's.
>

The big difference in the last few years was that some people were creating
distributions based on Hadoop 1.x and others were creating distributions
based on 2.x.  But nobody added vendor specific APIs (or at least I haven't
heard of any).  (I can't speak for MapR... since they are proprietary, I
have not seen the code.)  Now that Hadoop 1.x is starting to die a natural
death, any differences between 2.x and 1.x are becoming less important.
 Sadly, Yahoo continues to use and develop 0.23, for now at least... But I
think their efforts are mostly directed at backporting.  They have not
added divergent APIs, to my knowledge.

best,
Colin


The main focus of the Hadoop vendors is making consistent cuts of the
> core projects work together (HDFS/Pig/Hive/etc) - so API breaks are
> sometimes considered "fixed" as long as the other projects work around
> them (see [1]). We also regularly need to do archaeology (see [2]) and
> directly interact with Hadoop committers to understand what API's are
> stable and in which versions.
>
> One goal of Spark is to deal with the pain of inter-operating with
> Hadoop so that application writers don't to. We'd like to retain the
> property that if you build an application against the (well defined,
> stable) Spark API's right now, you'll be able to run it across many
> Hadoop vendors and versions for the entire Spark 1.X release cycle.
>
> Writing apps against Hadoop can be very difficult... consider how much
> more engineering effort we spent maintaining YARN support than Mesos
> support. There are many factors, but one is that Mesos has a single,
> narrow, stable API. We've never had to make a change in Mesos due to
> an API change, for several years. YARN on the other hand, there are at
> least 3 YARN API's that currently exist, which are all binary
> incompatible. We'd like to offer apps the ability to build against
> Spark's API and just let us deal with it.
>
> As more vendors packaging Spark, I'd like to see us put tools in the
> upstream Spark repo that do validation for vendor packages of Spark,
> so that we don't end up with fragmentation. Of course, vendors can
> enhance the API and are encouraged to, but we need a kernel of API's
> that vendors must maintain (think POSIX) to be considered compliant
> with Apache Spark. I believe some other projects like OpenStack have
> done this to avoid fragmentation.
>
> - Patrick
>
> [1] https://issues.apache.org/jira/browse/MAPREDUCE-5830
> [2]
> http://2.bp.blogspot.com/-GO6HF0OAFHw/UOfNEH-4sEI/AAAAAAAAAD0/dEWFFYTRgYw/s1600/output-file.png
>
> On Sun, May 18, 2014 at 2:13 AM, Mridul Muralidharan <mr...@gmail.com>
> wrote:
> > So I think I need to clarify a few things here - particularly since
> > this mail went to the wrong mailing list and a much wider audience
> > than I intended it for :-)
> >
> >
> > Most of the issues I mentioned are internal implementation detail of
> > spark core : which means, we can enhance them in future without
> > disruption to our userbase (ability to support large number of
> > input/output partitions. Note: this is of order of 100k input and
> > output partitions with uniform spread of keys - very rarely seen
> > outside of some crazy jobs).
> >
> > Some of the issues I mentioned would reqiure DeveloperApi changes -
> > which are not user exposed : they would impact developer use of these
> > api's - which are mostly internally provided by spark. (Like fixing
> > blocks > 2G would require change to Serializer api)
> >
> > A smaller faction might require interface changes - note, I am
> > referring specifically to configuration changes (removing/deprecating
> > some) and possibly newer options to submit/env, etc - I dont envision
> > any programming api change itself.
> > The only api change we did was from Seq -> Iterable - which is
> > actually to address some of the issues I mentioned (join/cogroup).
> >
> > Remaining are bugs which need to be addressed or the feature
> > removed/enhanced like shuffle consolidation.
> >
> > There might be semantic extension of some things like OFF_HEAP storage
> > level to address other computation models - but that would not have an
> > impact on end user - since other options would be pluggable with
> > default set to Tachyon so that there is no user expectation change.
> >
> >
> > So will the interface possibly change ? Sure though we will try to
> > keep it backwardly compatible (as we did with 1.0).
> > Will the api change - other than backward compatible enhancements,
> probably not.
> >
> >
> > Regards,
> > Mridul
> >
> >
> > On Sun, May 18, 2014 at 12:11 PM, Mridul Muralidharan <mr...@gmail.com>
> wrote:
> >>
> >> On 18-May-2014 5:05 am, "Mark Hamstra" <ma...@clearstorydata.com> wrote:
> >>>
> >>> I don't understand.  We never said that interfaces wouldn't change from
> >>> 0.9
> >>
> >> Agreed.
> >>
> >>> to 1.0.  What we are committing to is stability going forward from the
> >>> 1.0.0 baseline.  Nobody is disputing that backward-incompatible
> behavior
> >>> or
> >>> interface changes would be an issue post-1.0.0.  The question is
> whether
> >>
> >> The point is, how confident are we that these are the right set of
> interface
> >> definitions.
> >> We think it is, but we could also have gone through a 0.10 to vet the
> >> proposed 1.0 changes to stabilize them.
> >>
> >> To give examples for which we don't have solutions currently (which we
> are
> >> facing internally here btw, so not academic exercise) :
> >>
> >> - Current spark shuffle model breaks very badly as number of partitions
> >> increases (input and output).
> >>
> >> - As number of nodes increase, the overhead per node keeps going up.
> Spark
> >> currently is more geared towards large memory machines; when the RAM per
> >> node is modest (8 to 16 gig) but large number of them are available, it
> does
> >> not do too well.
> >>
> >> - Current block abstraction breaks as data per block goes beyond 2 gig.
> >>
> >> - Cogroup/join when value per key or number of keys (or both) is high
> breaks
> >> currently.
> >>
> >> - Shuffle consolidation is so badly broken it is not funny.
> >>
> >> - Currently there is no way of effectively leveraging accelerator
> >> cards/coprocessors/gpus from spark - to do so, I suspect we will need to
> >> redefine OFF_HEAP.
> >>
> >> - Effectively leveraging ssd is still an open question IMO when you
> have mix
> >> of both available.
> >>
> >> We have resolved some of these and looking at the rest. These are not
> unique
> >> to our internal usage profile, I have seen most of these asked elsewhere
> >> too.
> >>
> >> Thankfully some of the 1.0 changes actually are geared towards helping
> to
> >> alleviate some of the above (Iterable change for ex), most of the rest
> are
> >> internal impl detail of spark core which helps a lot - but there are
> cases
> >> where this is not so.
> >>
> >> Unfortunately I don't know yet if the unresolved/uninvestigated issues
> will
> >> require more changes or not.
> >>
> >> Given this I am very skeptical of expecting current spark interfaces to
> be
> >> sufficient for next 1 year (forget 3)
> >>
> >> I understand this is an argument which can be made to never release 1.0
> :-)
> >> Which is why I was ok with a 1.0 instead of 0.10 release in spite of my
> >> preference.
> >>
> >> This is a good problem to have IMO ... People are using spark
> extensively
> >> and in circumstances that we did not envision : necessitating changes
> even
> >> to spark core.
> >>
> >> But the claim that 1.0 interfaces are stable is not something I buy -
> they
> >> are not, we will need to break them soon and cost of maintaining
> backward
> >> compatibility will be high.
> >>
> >> We just need to make an informed decision to live with that cost, not
> hand
> >> wave it away.
> >>
> >> Regards
> >> Mridul
> >>
> >>> there is anything apparent now that is expected to require such
> disruptive
> >>> changes if we were to commit to the current release candidate as our
> >>> guaranteed 1.0.0 baseline.
> >>>
> >>>
> >>> On Sat, May 17, 2014 at 2:05 PM, Mridul Muralidharan
> >>> <mr...@gmail.com>wrote:
> >>>
> >>> > I would make the case for interface stability not just api stability.
> >>> > Particularly given that we have significantly changed some of our
> >>> > interfaces, I want to ensure developers/users are not seeing red
> flags.
> >>> >
> >>> > Bugs and code stability can be addressed in minor releases if found,
> but
> >>> > behavioral change and/or interface changes would be a much more
> invasive
> >>> > issue for our users.
> >>> >
> >>> > Regards
> >>> > Mridul
> >>> > On 18-May-2014 2:19 am, "Matei Zaharia" <ma...@gmail.com>
> wrote:
> >>> >
> >>> > > As others have said, the 1.0 milestone is about API stability, not
> >>> > > about
> >>> > > saying "we've eliminated all bugs". The sooner you declare 1.0, the
> >>> > sooner
> >>> > > users can confidently build on Spark, knowing that the application
> >>> > > they
> >>> > > build today will still run on Spark 1.9.9 three years from now.
> This
> >>> > > is
> >>> > > something that I've seen done badly (and experienced the effects
> >>> > > thereof)
> >>> > > in other big data projects, such as MapReduce and even YARN. The
> >>> > > result
> >>> > is
> >>> > > that you annoy users, you end up with a fragmented userbase where
> >>> > everyone
> >>> > > is building against a different version, and you drastically slow
> down
> >>> > > development.
> >>> > >
> >>> > > With a project as fast-growing as fast-growing as Spark in
> particular,
> >>> > > there will be new bugs discovered and reported continuously,
> >>> > > especially
> >>> > in
> >>> > > the non-core components. Look at the graph of # of contributors in
> >>> > > time
> >>> > to
> >>> > > Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph;
> >>> > "commits"
> >>> > > changed when we started merging each patch as a single commit).
> This
> >>> > > is
> >>> > not
> >>> > > slowing down, and we need to have the culture now that we treat API
> >>> > > stability and release numbers at the level expected for a 1.0
> project
> >>> > > instead of having people come in and randomly change the API.
> >>> > >
> >>> > > I'll also note that the issues marked "blocker" were marked so by
> >>> > > their
> >>> > > reporters, since the reporter can set the priority. I don't
> consider
> >>> > stuff
> >>> > > like parallelize() not partitioning ranges in the same way as other
> >>> > > collections a blocker -- it's a bug, it would be good to fix it,
> but it
> >>> > only
> >>> > > affects a small number of use cases. Of course if we find a real
> >>> > > blocker
> >>> > > (in particular a regression from a previous version, or a feature
> >>> > > that's
> >>> > > just completely broken), we will delay the release for that, but at
> >>> > > some
> >>> > > point you have to say "okay, this fix will go into the next
> >>> > > maintenance
> >>> > > release". Maybe we need to write a clear policy for what the issue
> >>> > > priorities mean.
> >>> > >
> >>> > > Finally, I believe it's much better to have a culture where you can
> >>> > > make
> >>> > > releases on a regular schedule, and have the option to make a
> >>> > > maintenance
> >>> > > release in 3-4 days if you find new bugs, than one where you pile
> up
> >>> > stuff
> >>> > > into each release. This is what much large project than us, like
> >>> > > Linux,
> >>> > do,
> >>> > > and it's the only way to avoid indefinite stalling with a large
> >>> > contributor
> >>> > > base. In the worst case, if you find a new bug that warrants
> immediate
> >>> > > release, it goes into 1.0.1 a week after 1.0.0 (we can vote on
> 1.0.1
> >>> > > in
> >>> > > three days with just your bug fix in it). And if you find an API
> that
> >>> > you'd
> >>> > > like to improve, just add a new one and maybe deprecate the old
> one --
> >>> > > at
> >>> > > some point we have to respect our users and let them know that code
> >>> > > they
> >>> > > write today will still run tomorrow.
> >>> > >
> >>> > > Matei
> >>> > >
> >>> > > On May 17, 2014, at 10:32 AM, Kan Zhang <kz...@apache.org> wrote:
> >>> > >
> >>> > > > +1 on the running commentary here, non-binding of course :-)
> >>> > > >
> >>> > > >
> >>> > > > On Sat, May 17, 2014 at 8:44 AM, Andrew Ash <
> andrew@andrewash.com>
> >>> > > wrote:
> >>> > > >
> >>> > > >> +1 on the next release feeling more like a 0.10 than a 1.0
> >>> > > >> On May 17, 2014 4:38 AM, "Mridul Muralidharan" <
> mridul@gmail.com>
> >>> > > wrote:
> >>> > > >>
> >>> > > >>> I had echoed similar sentiments a while back when there was a
> >>> > > discussion
> >>> > > >>> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize
> >>> > > >>> the
> >>> > api
> >>> > > >>> changes, add missing functionality, go through a hardening
> release
> >>> > > before
> >>> > > >>> 1.0
> >>> > > >>>
> >>> > > >>> But the community preferred a 1.0 :-)
> >>> > > >>>
> >>> > > >>> Regards,
> >>> > > >>> Mridul
> >>> > > >>>
> >>> > > >>> On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com>
> wrote:
> >>> > > >>>>
> >>> > > >>>> On this note, non-binding commentary:
> >>> > > >>>>
> >>> > > >>>> Releases happen in local minima of change, usually created by
> >>> > > >>>> internally enforced code freeze. Spark is incredibly busy now
> due
> >>> > > >>>> to
> >>> > > >>>> external factors -- recently a TLP, recently discovered by a
> >>> > > >>>> large
> >>> > new
> >>> > > >>>> audience, ease of contribution enabled by Github. It's getting
> >>> > > >>>> like
> >>> > > >>>> the first year of mainstream battle-testing in a month. It's
> been
> >>> > very
> >>> > > >>>> hard to freeze anything! I see a number of non-trivial issues
> >>> > > >>>> being
> >>> > > >>>> reported, and I don't think it has been possible to triage
> all of
> >>> > > >>>> them, even.
> >>> > > >>>>
> >>> > > >>>> Given the high rate of change, my instinct would have been to
> >>> > release
> >>> > > >>>> 0.10.0 now. But won't it always be very busy? I do think the
> rate
> >>> > > >>>> of
> >>> > > >>>> significant issues will slow down.
> >>> > > >>>>
> >>> > > >>>> Version ain't nothing but a number, but if it has any meaning
> >>> > > >>>> it's
> >>> > the
> >>> > > >>>> semantic versioning meaning. 1.0 imposes extra handicaps
> around
> >>> > > >>>> striving to maintain backwards-compatibility. That may end up
> >>> > > >>>> being
> >>> > > >>>> bent to fit in important changes that are going to be
> required in
> >>> > this
> >>> > > >>>> continuing period of change. Hadoop does this all the time
> >>> > > >>>> unfortunately and gets away with it, I suppose -- minor
> version
> >>> > > >>>> releases are really major. (On the other extreme, HBase is at
> >>> > > >>>> 0.98
> >>> > and
> >>> > > >>>> quite production-ready.)
> >>> > > >>>>
> >>> > > >>>> Just consider this a second vote for focus on fixes and 1.0.x
> >>> > > >>>> rather
> >>> > > >>>> than new features and 1.x. I think there are a few steps that
> >>> > > >>>> could
> >>> > > >>>> streamline triage of this flood of contributions, and make
> all of
> >>> > this
> >>> > > >>>> easier, but that's for another thread.
> >>> > > >>>>
> >>> > > >>>>
> >>> > > >>>> On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <
> >>> > > mark@clearstorydata.com
> >>> > > >>>
> >>> > > >>> wrote:
> >>> > > >>>>> +1, but just barely.  We've got quite a number of outstanding
> >>> > > >>>>> bugs
> >>> > > >>>>> identified, and many of them have fixes in progress.  I'd
> hate
> >>> > > >>>>> to
> >>> > see
> >>> > > >>> those
> >>> > > >>>>> efforts get lost in a post-1.0.0 flood of new features
> targeted
> >>> > > >>>>> at
> >>> > > >>> 1.1.0 --
> >>> > > >>>>> in other words, I'd like to see 1.0.1 retain a high priority
> >>> > relative
> >>> > > >>> to
> >>> > > >>>>> 1.1.0.
> >>> > > >>>>>
> >>> > > >>>>> Looking through the unresolved JIRAs, it doesn't look like
> any
> >>> > > >>>>> of
> >>> > the
> >>> > > >>>>> identified bugs are show-stoppers or strictly regressions
> >>> > (although I
> >>> > > >>> will
> >>> > > >>>>> note that one that I have in progress, SPARK-1749, is a bug
> that
> >>> > > >>>>> we
> >>> > > >>>>> introduced with recent work -- it's not strictly a regression
> >>> > because
> >>> > > >>> we
> >>> > > >>>>> had equally bad but different behavior when the DAGScheduler
> >>> > > >> exceptions
> >>> > > >>>>> weren't previously being handled at all vs. being slightly
> >>> > > >> mis-handled
> >>> > > >>>>> now), so I'm not currently seeing a reason not to release.
> >>> > > >>>
> >>> > > >>
> >>> > >
> >>> > >
> >>> >
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Patrick Wendell <pw...@gmail.com>.
[tl;dr stable API's are important - sorry, this is slightly meandering]

Hey - just wanted to chime in on this as I was travelling. Sean, you
bring up great points here about the velocity and stability of Spark.
Many projects have fairly customized semantics around what versions
actually mean (HBase is a good, if somewhat hard-to-comprehend,
example).

What the 1.X label means to Spark is that we are willing to guarantee
stability for Spark's core API. This is something that actually, Spark
has been doing for a while already (we've made few or no breaking
changes to the Spark core API for several years) and we want to codify
this for application developers. In this regard Spark has made a bunch
of changes to enforce the integrity of our API's:

- We went through and clearly annotated internal, or experimental
API's. This was a huge project-wide effort and included Scaladoc and
several other components to make it clear to users.
- We implemented automated byte-code verification of all proposed pull
requests that they don't break public API's. Pull requests after 1.0
will fail if they break API's that are not explicitly declared private
or experimental.

I can't possibly emphasize enough the importance of API stability.
What we want to avoid is the Hadoop approach. Candidly, Hadoop does a
poor job on this. There really isn't a well defined stable API for any
of the Hadoop components, for a few reasons:

1. Hadoop projects don't do any rigorous checking that new patches
don't break API's. Of course, the results in regular API breaks and a
poor understanding of what is a public API.
2. In several cases it's not possible to do basic things in Hadoop
without using deprecated or private API's.
3. There is significant vendor fragmentation of API's.

The main focus of the Hadoop vendors is making consistent cuts of the
core projects work together (HDFS/Pig/Hive/etc) - so API breaks are
sometimes considered "fixed" as long as the other projects work around
them (see [1]). We also regularly need to do archaeology (see [2]) and
directly interact with Hadoop committers to understand what API's are
stable and in which versions.

One goal of Spark is to deal with the pain of inter-operating with
Hadoop so that application writers don't to. We'd like to retain the
property that if you build an application against the (well defined,
stable) Spark API's right now, you'll be able to run it across many
Hadoop vendors and versions for the entire Spark 1.X release cycle.

Writing apps against Hadoop can be very difficult... consider how much
more engineering effort we spent maintaining YARN support than Mesos
support. There are many factors, but one is that Mesos has a single,
narrow, stable API. We've never had to make a change in Mesos due to
an API change, for several years. YARN on the other hand, there are at
least 3 YARN API's that currently exist, which are all binary
incompatible. We'd like to offer apps the ability to build against
Spark's API and just let us deal with it.

As more vendors packaging Spark, I'd like to see us put tools in the
upstream Spark repo that do validation for vendor packages of Spark,
so that we don't end up with fragmentation. Of course, vendors can
enhance the API and are encouraged to, but we need a kernel of API's
that vendors must maintain (think POSIX) to be considered compliant
with Apache Spark. I believe some other projects like OpenStack have
done this to avoid fragmentation.

- Patrick

[1] https://issues.apache.org/jira/browse/MAPREDUCE-5830
[2] http://2.bp.blogspot.com/-GO6HF0OAFHw/UOfNEH-4sEI/AAAAAAAAAD0/dEWFFYTRgYw/s1600/output-file.png

On Sun, May 18, 2014 at 2:13 AM, Mridul Muralidharan <mr...@gmail.com> wrote:
> So I think I need to clarify a few things here - particularly since
> this mail went to the wrong mailing list and a much wider audience
> than I intended it for :-)
>
>
> Most of the issues I mentioned are internal implementation detail of
> spark core : which means, we can enhance them in future without
> disruption to our userbase (ability to support large number of
> input/output partitions. Note: this is of order of 100k input and
> output partitions with uniform spread of keys - very rarely seen
> outside of some crazy jobs).
>
> Some of the issues I mentioned would reqiure DeveloperApi changes -
> which are not user exposed : they would impact developer use of these
> api's - which are mostly internally provided by spark. (Like fixing
> blocks > 2G would require change to Serializer api)
>
> A smaller faction might require interface changes - note, I am
> referring specifically to configuration changes (removing/deprecating
> some) and possibly newer options to submit/env, etc - I dont envision
> any programming api change itself.
> The only api change we did was from Seq -> Iterable - which is
> actually to address some of the issues I mentioned (join/cogroup).
>
> Remaining are bugs which need to be addressed or the feature
> removed/enhanced like shuffle consolidation.
>
> There might be semantic extension of some things like OFF_HEAP storage
> level to address other computation models - but that would not have an
> impact on end user - since other options would be pluggable with
> default set to Tachyon so that there is no user expectation change.
>
>
> So will the interface possibly change ? Sure though we will try to
> keep it backwardly compatible (as we did with 1.0).
> Will the api change - other than backward compatible enhancements, probably not.
>
>
> Regards,
> Mridul
>
>
> On Sun, May 18, 2014 at 12:11 PM, Mridul Muralidharan <mr...@gmail.com> wrote:
>>
>> On 18-May-2014 5:05 am, "Mark Hamstra" <ma...@clearstorydata.com> wrote:
>>>
>>> I don't understand.  We never said that interfaces wouldn't change from
>>> 0.9
>>
>> Agreed.
>>
>>> to 1.0.  What we are committing to is stability going forward from the
>>> 1.0.0 baseline.  Nobody is disputing that backward-incompatible behavior
>>> or
>>> interface changes would be an issue post-1.0.0.  The question is whether
>>
>> The point is, how confident are we that these are the right set of interface
>> definitions.
>> We think it is, but we could also have gone through a 0.10 to vet the
>> proposed 1.0 changes to stabilize them.
>>
>> To give examples for which we don't have solutions currently (which we are
>> facing internally here btw, so not academic exercise) :
>>
>> - Current spark shuffle model breaks very badly as number of partitions
>> increases (input and output).
>>
>> - As number of nodes increase, the overhead per node keeps going up. Spark
>> currently is more geared towards large memory machines; when the RAM per
>> node is modest (8 to 16 gig) but large number of them are available, it does
>> not do too well.
>>
>> - Current block abstraction breaks as data per block goes beyond 2 gig.
>>
>> - Cogroup/join when value per key or number of keys (or both) is high breaks
>> currently.
>>
>> - Shuffle consolidation is so badly broken it is not funny.
>>
>> - Currently there is no way of effectively leveraging accelerator
>> cards/coprocessors/gpus from spark - to do so, I suspect we will need to
>> redefine OFF_HEAP.
>>
>> - Effectively leveraging ssd is still an open question IMO when you have mix
>> of both available.
>>
>> We have resolved some of these and looking at the rest. These are not unique
>> to our internal usage profile, I have seen most of these asked elsewhere
>> too.
>>
>> Thankfully some of the 1.0 changes actually are geared towards helping to
>> alleviate some of the above (Iterable change for ex), most of the rest are
>> internal impl detail of spark core which helps a lot - but there are cases
>> where this is not so.
>>
>> Unfortunately I don't know yet if the unresolved/uninvestigated issues will
>> require more changes or not.
>>
>> Given this I am very skeptical of expecting current spark interfaces to be
>> sufficient for next 1 year (forget 3)
>>
>> I understand this is an argument which can be made to never release 1.0 :-)
>> Which is why I was ok with a 1.0 instead of 0.10 release in spite of my
>> preference.
>>
>> This is a good problem to have IMO ... People are using spark extensively
>> and in circumstances that we did not envision : necessitating changes even
>> to spark core.
>>
>> But the claim that 1.0 interfaces are stable is not something I buy - they
>> are not, we will need to break them soon and cost of maintaining backward
>> compatibility will be high.
>>
>> We just need to make an informed decision to live with that cost, not hand
>> wave it away.
>>
>> Regards
>> Mridul
>>
>>> there is anything apparent now that is expected to require such disruptive
>>> changes if we were to commit to the current release candidate as our
>>> guaranteed 1.0.0 baseline.
>>>
>>>
>>> On Sat, May 17, 2014 at 2:05 PM, Mridul Muralidharan
>>> <mr...@gmail.com>wrote:
>>>
>>> > I would make the case for interface stability not just api stability.
>>> > Particularly given that we have significantly changed some of our
>>> > interfaces, I want to ensure developers/users are not seeing red flags.
>>> >
>>> > Bugs and code stability can be addressed in minor releases if found, but
>>> > behavioral change and/or interface changes would be a much more invasive
>>> > issue for our users.
>>> >
>>> > Regards
>>> > Mridul
>>> > On 18-May-2014 2:19 am, "Matei Zaharia" <ma...@gmail.com> wrote:
>>> >
>>> > > As others have said, the 1.0 milestone is about API stability, not
>>> > > about
>>> > > saying "we've eliminated all bugs". The sooner you declare 1.0, the
>>> > sooner
>>> > > users can confidently build on Spark, knowing that the application
>>> > > they
>>> > > build today will still run on Spark 1.9.9 three years from now. This
>>> > > is
>>> > > something that I've seen done badly (and experienced the effects
>>> > > thereof)
>>> > > in other big data projects, such as MapReduce and even YARN. The
>>> > > result
>>> > is
>>> > > that you annoy users, you end up with a fragmented userbase where
>>> > everyone
>>> > > is building against a different version, and you drastically slow down
>>> > > development.
>>> > >
>>> > > With a project as fast-growing as fast-growing as Spark in particular,
>>> > > there will be new bugs discovered and reported continuously,
>>> > > especially
>>> > in
>>> > > the non-core components. Look at the graph of # of contributors in
>>> > > time
>>> > to
>>> > > Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph;
>>> > "commits"
>>> > > changed when we started merging each patch as a single commit). This
>>> > > is
>>> > not
>>> > > slowing down, and we need to have the culture now that we treat API
>>> > > stability and release numbers at the level expected for a 1.0 project
>>> > > instead of having people come in and randomly change the API.
>>> > >
>>> > > I'll also note that the issues marked "blocker" were marked so by
>>> > > their
>>> > > reporters, since the reporter can set the priority. I don't consider
>>> > stuff
>>> > > like parallelize() not partitioning ranges in the same way as other
>>> > > collections a blocker -- it's a bug, it would be good to fix it, but it
>>> > only
>>> > > affects a small number of use cases. Of course if we find a real
>>> > > blocker
>>> > > (in particular a regression from a previous version, or a feature
>>> > > that's
>>> > > just completely broken), we will delay the release for that, but at
>>> > > some
>>> > > point you have to say "okay, this fix will go into the next
>>> > > maintenance
>>> > > release". Maybe we need to write a clear policy for what the issue
>>> > > priorities mean.
>>> > >
>>> > > Finally, I believe it's much better to have a culture where you can
>>> > > make
>>> > > releases on a regular schedule, and have the option to make a
>>> > > maintenance
>>> > > release in 3-4 days if you find new bugs, than one where you pile up
>>> > stuff
>>> > > into each release. This is what much large project than us, like
>>> > > Linux,
>>> > do,
>>> > > and it's the only way to avoid indefinite stalling with a large
>>> > contributor
>>> > > base. In the worst case, if you find a new bug that warrants immediate
>>> > > release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1
>>> > > in
>>> > > three days with just your bug fix in it). And if you find an API that
>>> > you'd
>>> > > like to improve, just add a new one and maybe deprecate the old one --
>>> > > at
>>> > > some point we have to respect our users and let them know that code
>>> > > they
>>> > > write today will still run tomorrow.
>>> > >
>>> > > Matei
>>> > >
>>> > > On May 17, 2014, at 10:32 AM, Kan Zhang <kz...@apache.org> wrote:
>>> > >
>>> > > > +1 on the running commentary here, non-binding of course :-)
>>> > > >
>>> > > >
>>> > > > On Sat, May 17, 2014 at 8:44 AM, Andrew Ash <an...@andrewash.com>
>>> > > wrote:
>>> > > >
>>> > > >> +1 on the next release feeling more like a 0.10 than a 1.0
>>> > > >> On May 17, 2014 4:38 AM, "Mridul Muralidharan" <mr...@gmail.com>
>>> > > wrote:
>>> > > >>
>>> > > >>> I had echoed similar sentiments a while back when there was a
>>> > > discussion
>>> > > >>> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize
>>> > > >>> the
>>> > api
>>> > > >>> changes, add missing functionality, go through a hardening release
>>> > > before
>>> > > >>> 1.0
>>> > > >>>
>>> > > >>> But the community preferred a 1.0 :-)
>>> > > >>>
>>> > > >>> Regards,
>>> > > >>> Mridul
>>> > > >>>
>>> > > >>> On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
>>> > > >>>>
>>> > > >>>> On this note, non-binding commentary:
>>> > > >>>>
>>> > > >>>> Releases happen in local minima of change, usually created by
>>> > > >>>> internally enforced code freeze. Spark is incredibly busy now due
>>> > > >>>> to
>>> > > >>>> external factors -- recently a TLP, recently discovered by a
>>> > > >>>> large
>>> > new
>>> > > >>>> audience, ease of contribution enabled by Github. It's getting
>>> > > >>>> like
>>> > > >>>> the first year of mainstream battle-testing in a month. It's been
>>> > very
>>> > > >>>> hard to freeze anything! I see a number of non-trivial issues
>>> > > >>>> being
>>> > > >>>> reported, and I don't think it has been possible to triage all of
>>> > > >>>> them, even.
>>> > > >>>>
>>> > > >>>> Given the high rate of change, my instinct would have been to
>>> > release
>>> > > >>>> 0.10.0 now. But won't it always be very busy? I do think the rate
>>> > > >>>> of
>>> > > >>>> significant issues will slow down.
>>> > > >>>>
>>> > > >>>> Version ain't nothing but a number, but if it has any meaning
>>> > > >>>> it's
>>> > the
>>> > > >>>> semantic versioning meaning. 1.0 imposes extra handicaps around
>>> > > >>>> striving to maintain backwards-compatibility. That may end up
>>> > > >>>> being
>>> > > >>>> bent to fit in important changes that are going to be required in
>>> > this
>>> > > >>>> continuing period of change. Hadoop does this all the time
>>> > > >>>> unfortunately and gets away with it, I suppose -- minor version
>>> > > >>>> releases are really major. (On the other extreme, HBase is at
>>> > > >>>> 0.98
>>> > and
>>> > > >>>> quite production-ready.)
>>> > > >>>>
>>> > > >>>> Just consider this a second vote for focus on fixes and 1.0.x
>>> > > >>>> rather
>>> > > >>>> than new features and 1.x. I think there are a few steps that
>>> > > >>>> could
>>> > > >>>> streamline triage of this flood of contributions, and make all of
>>> > this
>>> > > >>>> easier, but that's for another thread.
>>> > > >>>>
>>> > > >>>>
>>> > > >>>> On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <
>>> > > mark@clearstorydata.com
>>> > > >>>
>>> > > >>> wrote:
>>> > > >>>>> +1, but just barely.  We've got quite a number of outstanding
>>> > > >>>>> bugs
>>> > > >>>>> identified, and many of them have fixes in progress.  I'd hate
>>> > > >>>>> to
>>> > see
>>> > > >>> those
>>> > > >>>>> efforts get lost in a post-1.0.0 flood of new features targeted
>>> > > >>>>> at
>>> > > >>> 1.1.0 --
>>> > > >>>>> in other words, I'd like to see 1.0.1 retain a high priority
>>> > relative
>>> > > >>> to
>>> > > >>>>> 1.1.0.
>>> > > >>>>>
>>> > > >>>>> Looking through the unresolved JIRAs, it doesn't look like any
>>> > > >>>>> of
>>> > the
>>> > > >>>>> identified bugs are show-stoppers or strictly regressions
>>> > (although I
>>> > > >>> will
>>> > > >>>>> note that one that I have in progress, SPARK-1749, is a bug that
>>> > > >>>>> we
>>> > > >>>>> introduced with recent work -- it's not strictly a regression
>>> > because
>>> > > >>> we
>>> > > >>>>> had equally bad but different behavior when the DAGScheduler
>>> > > >> exceptions
>>> > > >>>>> weren't previously being handled at all vs. being slightly
>>> > > >> mis-handled
>>> > > >>>>> now), so I'm not currently seeing a reason not to release.
>>> > > >>>
>>> > > >>
>>> > >
>>> > >
>>> >

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Mridul Muralidharan <mr...@gmail.com>.
So I think I need to clarify a few things here - particularly since
this mail went to the wrong mailing list and a much wider audience
than I intended it for :-)


Most of the issues I mentioned are internal implementation detail of
spark core : which means, we can enhance them in future without
disruption to our userbase (ability to support large number of
input/output partitions. Note: this is of order of 100k input and
output partitions with uniform spread of keys - very rarely seen
outside of some crazy jobs).

Some of the issues I mentioned would reqiure DeveloperApi changes -
which are not user exposed : they would impact developer use of these
api's - which are mostly internally provided by spark. (Like fixing
blocks > 2G would require change to Serializer api)

A smaller faction might require interface changes - note, I am
referring specifically to configuration changes (removing/deprecating
some) and possibly newer options to submit/env, etc - I dont envision
any programming api change itself.
The only api change we did was from Seq -> Iterable - which is
actually to address some of the issues I mentioned (join/cogroup).

Remaining are bugs which need to be addressed or the feature
removed/enhanced like shuffle consolidation.

There might be semantic extension of some things like OFF_HEAP storage
level to address other computation models - but that would not have an
impact on end user - since other options would be pluggable with
default set to Tachyon so that there is no user expectation change.


So will the interface possibly change ? Sure though we will try to
keep it backwardly compatible (as we did with 1.0).
Will the api change - other than backward compatible enhancements, probably not.


Regards,
Mridul


On Sun, May 18, 2014 at 12:11 PM, Mridul Muralidharan <mr...@gmail.com> wrote:
>
> On 18-May-2014 5:05 am, "Mark Hamstra" <ma...@clearstorydata.com> wrote:
>>
>> I don't understand.  We never said that interfaces wouldn't change from
>> 0.9
>
> Agreed.
>
>> to 1.0.  What we are committing to is stability going forward from the
>> 1.0.0 baseline.  Nobody is disputing that backward-incompatible behavior
>> or
>> interface changes would be an issue post-1.0.0.  The question is whether
>
> The point is, how confident are we that these are the right set of interface
> definitions.
> We think it is, but we could also have gone through a 0.10 to vet the
> proposed 1.0 changes to stabilize them.
>
> To give examples for which we don't have solutions currently (which we are
> facing internally here btw, so not academic exercise) :
>
> - Current spark shuffle model breaks very badly as number of partitions
> increases (input and output).
>
> - As number of nodes increase, the overhead per node keeps going up. Spark
> currently is more geared towards large memory machines; when the RAM per
> node is modest (8 to 16 gig) but large number of them are available, it does
> not do too well.
>
> - Current block abstraction breaks as data per block goes beyond 2 gig.
>
> - Cogroup/join when value per key or number of keys (or both) is high breaks
> currently.
>
> - Shuffle consolidation is so badly broken it is not funny.
>
> - Currently there is no way of effectively leveraging accelerator
> cards/coprocessors/gpus from spark - to do so, I suspect we will need to
> redefine OFF_HEAP.
>
> - Effectively leveraging ssd is still an open question IMO when you have mix
> of both available.
>
> We have resolved some of these and looking at the rest. These are not unique
> to our internal usage profile, I have seen most of these asked elsewhere
> too.
>
> Thankfully some of the 1.0 changes actually are geared towards helping to
> alleviate some of the above (Iterable change for ex), most of the rest are
> internal impl detail of spark core which helps a lot - but there are cases
> where this is not so.
>
> Unfortunately I don't know yet if the unresolved/uninvestigated issues will
> require more changes or not.
>
> Given this I am very skeptical of expecting current spark interfaces to be
> sufficient for next 1 year (forget 3)
>
> I understand this is an argument which can be made to never release 1.0 :-)
> Which is why I was ok with a 1.0 instead of 0.10 release in spite of my
> preference.
>
> This is a good problem to have IMO ... People are using spark extensively
> and in circumstances that we did not envision : necessitating changes even
> to spark core.
>
> But the claim that 1.0 interfaces are stable is not something I buy - they
> are not, we will need to break them soon and cost of maintaining backward
> compatibility will be high.
>
> We just need to make an informed decision to live with that cost, not hand
> wave it away.
>
> Regards
> Mridul
>
>> there is anything apparent now that is expected to require such disruptive
>> changes if we were to commit to the current release candidate as our
>> guaranteed 1.0.0 baseline.
>>
>>
>> On Sat, May 17, 2014 at 2:05 PM, Mridul Muralidharan
>> <mr...@gmail.com>wrote:
>>
>> > I would make the case for interface stability not just api stability.
>> > Particularly given that we have significantly changed some of our
>> > interfaces, I want to ensure developers/users are not seeing red flags.
>> >
>> > Bugs and code stability can be addressed in minor releases if found, but
>> > behavioral change and/or interface changes would be a much more invasive
>> > issue for our users.
>> >
>> > Regards
>> > Mridul
>> > On 18-May-2014 2:19 am, "Matei Zaharia" <ma...@gmail.com> wrote:
>> >
>> > > As others have said, the 1.0 milestone is about API stability, not
>> > > about
>> > > saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the
>> > sooner
>> > > users can confidently build on Spark, knowing that the application
>> > > they
>> > > build today will still run on Spark 1.9.9 three years from now. This
>> > > is
>> > > something that I’ve seen done badly (and experienced the effects
>> > > thereof)
>> > > in other big data projects, such as MapReduce and even YARN. The
>> > > result
>> > is
>> > > that you annoy users, you end up with a fragmented userbase where
>> > everyone
>> > > is building against a different version, and you drastically slow down
>> > > development.
>> > >
>> > > With a project as fast-growing as fast-growing as Spark in particular,
>> > > there will be new bugs discovered and reported continuously,
>> > > especially
>> > in
>> > > the non-core components. Look at the graph of # of contributors in
>> > > time
>> > to
>> > > Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph;
>> > “commits”
>> > > changed when we started merging each patch as a single commit). This
>> > > is
>> > not
>> > > slowing down, and we need to have the culture now that we treat API
>> > > stability and release numbers at the level expected for a 1.0 project
>> > > instead of having people come in and randomly change the API.
>> > >
>> > > I’ll also note that the issues marked “blocker” were marked so by
>> > > their
>> > > reporters, since the reporter can set the priority. I don’t consider
>> > stuff
>> > > like parallelize() not partitioning ranges in the same way as other
>> > > collections a blocker — it’s a bug, it would be good to fix it, but it
>> > only
>> > > affects a small number of use cases. Of course if we find a real
>> > > blocker
>> > > (in particular a regression from a previous version, or a feature
>> > > that’s
>> > > just completely broken), we will delay the release for that, but at
>> > > some
>> > > point you have to say “okay, this fix will go into the next
>> > > maintenance
>> > > release”. Maybe we need to write a clear policy for what the issue
>> > > priorities mean.
>> > >
>> > > Finally, I believe it’s much better to have a culture where you can
>> > > make
>> > > releases on a regular schedule, and have the option to make a
>> > > maintenance
>> > > release in 3-4 days if you find new bugs, than one where you pile up
>> > stuff
>> > > into each release. This is what much large project than us, like
>> > > Linux,
>> > do,
>> > > and it’s the only way to avoid indefinite stalling with a large
>> > contributor
>> > > base. In the worst case, if you find a new bug that warrants immediate
>> > > release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1
>> > > in
>> > > three days with just your bug fix in it). And if you find an API that
>> > you’d
>> > > like to improve, just add a new one and maybe deprecate the old one —
>> > > at
>> > > some point we have to respect our users and let them know that code
>> > > they
>> > > write today will still run tomorrow.
>> > >
>> > > Matei
>> > >
>> > > On May 17, 2014, at 10:32 AM, Kan Zhang <kz...@apache.org> wrote:
>> > >
>> > > > +1 on the running commentary here, non-binding of course :-)
>> > > >
>> > > >
>> > > > On Sat, May 17, 2014 at 8:44 AM, Andrew Ash <an...@andrewash.com>
>> > > wrote:
>> > > >
>> > > >> +1 on the next release feeling more like a 0.10 than a 1.0
>> > > >> On May 17, 2014 4:38 AM, "Mridul Muralidharan" <mr...@gmail.com>
>> > > wrote:
>> > > >>
>> > > >>> I had echoed similar sentiments a while back when there was a
>> > > discussion
>> > > >>> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize
>> > > >>> the
>> > api
>> > > >>> changes, add missing functionality, go through a hardening release
>> > > before
>> > > >>> 1.0
>> > > >>>
>> > > >>> But the community preferred a 1.0 :-)
>> > > >>>
>> > > >>> Regards,
>> > > >>> Mridul
>> > > >>>
>> > > >>> On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
>> > > >>>>
>> > > >>>> On this note, non-binding commentary:
>> > > >>>>
>> > > >>>> Releases happen in local minima of change, usually created by
>> > > >>>> internally enforced code freeze. Spark is incredibly busy now due
>> > > >>>> to
>> > > >>>> external factors -- recently a TLP, recently discovered by a
>> > > >>>> large
>> > new
>> > > >>>> audience, ease of contribution enabled by Github. It's getting
>> > > >>>> like
>> > > >>>> the first year of mainstream battle-testing in a month. It's been
>> > very
>> > > >>>> hard to freeze anything! I see a number of non-trivial issues
>> > > >>>> being
>> > > >>>> reported, and I don't think it has been possible to triage all of
>> > > >>>> them, even.
>> > > >>>>
>> > > >>>> Given the high rate of change, my instinct would have been to
>> > release
>> > > >>>> 0.10.0 now. But won't it always be very busy? I do think the rate
>> > > >>>> of
>> > > >>>> significant issues will slow down.
>> > > >>>>
>> > > >>>> Version ain't nothing but a number, but if it has any meaning
>> > > >>>> it's
>> > the
>> > > >>>> semantic versioning meaning. 1.0 imposes extra handicaps around
>> > > >>>> striving to maintain backwards-compatibility. That may end up
>> > > >>>> being
>> > > >>>> bent to fit in important changes that are going to be required in
>> > this
>> > > >>>> continuing period of change. Hadoop does this all the time
>> > > >>>> unfortunately and gets away with it, I suppose -- minor version
>> > > >>>> releases are really major. (On the other extreme, HBase is at
>> > > >>>> 0.98
>> > and
>> > > >>>> quite production-ready.)
>> > > >>>>
>> > > >>>> Just consider this a second vote for focus on fixes and 1.0.x
>> > > >>>> rather
>> > > >>>> than new features and 1.x. I think there are a few steps that
>> > > >>>> could
>> > > >>>> streamline triage of this flood of contributions, and make all of
>> > this
>> > > >>>> easier, but that's for another thread.
>> > > >>>>
>> > > >>>>
>> > > >>>> On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <
>> > > mark@clearstorydata.com
>> > > >>>
>> > > >>> wrote:
>> > > >>>>> +1, but just barely.  We've got quite a number of outstanding
>> > > >>>>> bugs
>> > > >>>>> identified, and many of them have fixes in progress.  I'd hate
>> > > >>>>> to
>> > see
>> > > >>> those
>> > > >>>>> efforts get lost in a post-1.0.0 flood of new features targeted
>> > > >>>>> at
>> > > >>> 1.1.0 --
>> > > >>>>> in other words, I'd like to see 1.0.1 retain a high priority
>> > relative
>> > > >>> to
>> > > >>>>> 1.1.0.
>> > > >>>>>
>> > > >>>>> Looking through the unresolved JIRAs, it doesn't look like any
>> > > >>>>> of
>> > the
>> > > >>>>> identified bugs are show-stoppers or strictly regressions
>> > (although I
>> > > >>> will
>> > > >>>>> note that one that I have in progress, SPARK-1749, is a bug that
>> > > >>>>> we
>> > > >>>>> introduced with recent work -- it's not strictly a regression
>> > because
>> > > >>> we
>> > > >>>>> had equally bad but different behavior when the DAGScheduler
>> > > >> exceptions
>> > > >>>>> weren't previously being handled at all vs. being slightly
>> > > >> mis-handled
>> > > >>>>> now), so I'm not currently seeing a reason not to release.
>> > > >>>
>> > > >>
>> > >
>> > >
>> >

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Mridul Muralidharan <mr...@gmail.com>.
On 18-May-2014 5:05 am, "Mark Hamstra" <ma...@clearstorydata.com> wrote:
>
> I don't understand.  We never said that interfaces wouldn't change from
0.9

Agreed.

> to 1.0.  What we are committing to is stability going forward from the
> 1.0.0 baseline.  Nobody is disputing that backward-incompatible behavior
or
> interface changes would be an issue post-1.0.0.  The question is whether

The point is, how confident are we that these are the right set of
interface definitions.
We think it is, but we could also have gone through a 0.10 to vet the
proposed 1.0 changes to stabilize them.

To give examples for which we don't have solutions currently (which we are
facing internally here btw, so not academic exercise) :

- Current spark shuffle model breaks very badly as number of partitions
increases (input and output).

- As number of nodes increase, the overhead per node keeps going up. Spark
currently is more geared towards large memory machines; when the RAM per
node is modest (8 to 16 gig) but large number of them are available, it
does not do too well.

- Current block abstraction breaks as data per block goes beyond 2 gig.

- Cogroup/join when value per key or number of keys (or both) is high
breaks currently.

- Shuffle consolidation is so badly broken it is not funny.

- Currently there is no way of effectively leveraging accelerator
cards/coprocessors/gpus from spark - to do so, I suspect we will need to
redefine OFF_HEAP.

- Effectively leveraging ssd is still an open question IMO when you have
mix of both available.

We have resolved some of these and looking at the rest. These are not
unique to our internal usage profile, I have seen most of these asked
elsewhere too.

Thankfully some of the 1.0 changes actually are geared towards helping to
alleviate some of the above (Iterable change for ex), most of the rest are
internal impl detail of spark core which helps a lot - but there are cases
where this is not so.

Unfortunately I don't know yet if the unresolved/uninvestigated issues will
require more changes or not.

Given this I am very skeptical of expecting current spark interfaces to be
sufficient for next 1 year (forget 3)

I understand this is an argument which can be made to never release 1.0 :-)
Which is why I was ok with a 1.0 instead of 0.10 release in spite of my
preference.

This is a good problem to have IMO ... People are using spark extensively
and in circumstances that we did not envision : necessitating changes even
to spark core.

But the claim that 1.0 interfaces are stable is not something I buy - they
are not, we will need to break them soon and cost of maintaining backward
compatibility will be high.

We just need to make an informed decision to live with that cost, not hand
wave it away.

Regards
Mridul

> there is anything apparent now that is expected to require such disruptive
> changes if we were to commit to the current release candidate as our
> guaranteed 1.0.0 baseline.
>
>
> On Sat, May 17, 2014 at 2:05 PM, Mridul Muralidharan <mridul@gmail.com
>wrote:
>
> > I would make the case for interface stability not just api stability.
> > Particularly given that we have significantly changed some of our
> > interfaces, I want to ensure developers/users are not seeing red flags.
> >
> > Bugs and code stability can be addressed in minor releases if found, but
> > behavioral change and/or interface changes would be a much more invasive
> > issue for our users.
> >
> > Regards
> > Mridul
> > On 18-May-2014 2:19 am, "Matei Zaharia" <ma...@gmail.com> wrote:
> >
> > > As others have said, the 1.0 milestone is about API stability, not
about
> > > saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the
> > sooner
> > > users can confidently build on Spark, knowing that the application
they
> > > build today will still run on Spark 1.9.9 three years from now. This
is
> > > something that I’ve seen done badly (and experienced the effects
thereof)
> > > in other big data projects, such as MapReduce and even YARN. The
result
> > is
> > > that you annoy users, you end up with a fragmented userbase where
> > everyone
> > > is building against a different version, and you drastically slow down
> > > development.
> > >
> > > With a project as fast-growing as fast-growing as Spark in particular,
> > > there will be new bugs discovered and reported continuously,
especially
> > in
> > > the non-core components. Look at the graph of # of contributors in
time
> > to
> > > Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph;
> > “commits”
> > > changed when we started merging each patch as a single commit). This
is
> > not
> > > slowing down, and we need to have the culture now that we treat API
> > > stability and release numbers at the level expected for a 1.0 project
> > > instead of having people come in and randomly change the API.
> > >
> > > I’ll also note that the issues marked “blocker” were marked so by
their
> > > reporters, since the reporter can set the priority. I don’t consider
> > stuff
> > > like parallelize() not partitioning ranges in the same way as other
> > > collections a blocker — it’s a bug, it would be good to fix it, but it
> > only
> > > affects a small number of use cases. Of course if we find a real
blocker
> > > (in particular a regression from a previous version, or a feature
that’s
> > > just completely broken), we will delay the release for that, but at
some
> > > point you have to say “okay, this fix will go into the next
maintenance
> > > release”. Maybe we need to write a clear policy for what the issue
> > > priorities mean.
> > >
> > > Finally, I believe it’s much better to have a culture where you can
make
> > > releases on a regular schedule, and have the option to make a
maintenance
> > > release in 3-4 days if you find new bugs, than one where you pile up
> > stuff
> > > into each release. This is what much large project than us, like
Linux,
> > do,
> > > and it’s the only way to avoid indefinite stalling with a large
> > contributor
> > > base. In the worst case, if you find a new bug that warrants immediate
> > > release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1
in
> > > three days with just your bug fix in it). And if you find an API that
> > you’d
> > > like to improve, just add a new one and maybe deprecate the old one —
at
> > > some point we have to respect our users and let them know that code
they
> > > write today will still run tomorrow.
> > >
> > > Matei
> > >
> > > On May 17, 2014, at 10:32 AM, Kan Zhang <kz...@apache.org> wrote:
> > >
> > > > +1 on the running commentary here, non-binding of course :-)
> > > >
> > > >
> > > > On Sat, May 17, 2014 at 8:44 AM, Andrew Ash <an...@andrewash.com>
> > > wrote:
> > > >
> > > >> +1 on the next release feeling more like a 0.10 than a 1.0
> > > >> On May 17, 2014 4:38 AM, "Mridul Muralidharan" <mr...@gmail.com>
> > > wrote:
> > > >>
> > > >>> I had echoed similar sentiments a while back when there was a
> > > discussion
> > > >>> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize
the
> > api
> > > >>> changes, add missing functionality, go through a hardening release
> > > before
> > > >>> 1.0
> > > >>>
> > > >>> But the community preferred a 1.0 :-)
> > > >>>
> > > >>> Regards,
> > > >>> Mridul
> > > >>>
> > > >>> On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
> > > >>>>
> > > >>>> On this note, non-binding commentary:
> > > >>>>
> > > >>>> Releases happen in local minima of change, usually created by
> > > >>>> internally enforced code freeze. Spark is incredibly busy now
due to
> > > >>>> external factors -- recently a TLP, recently discovered by a
large
> > new
> > > >>>> audience, ease of contribution enabled by Github. It's getting
like
> > > >>>> the first year of mainstream battle-testing in a month. It's been
> > very
> > > >>>> hard to freeze anything! I see a number of non-trivial issues
being
> > > >>>> reported, and I don't think it has been possible to triage all of
> > > >>>> them, even.
> > > >>>>
> > > >>>> Given the high rate of change, my instinct would have been to
> > release
> > > >>>> 0.10.0 now. But won't it always be very busy? I do think the
rate of
> > > >>>> significant issues will slow down.
> > > >>>>
> > > >>>> Version ain't nothing but a number, but if it has any meaning
it's
> > the
> > > >>>> semantic versioning meaning. 1.0 imposes extra handicaps around
> > > >>>> striving to maintain backwards-compatibility. That may end up
being
> > > >>>> bent to fit in important changes that are going to be required in
> > this
> > > >>>> continuing period of change. Hadoop does this all the time
> > > >>>> unfortunately and gets away with it, I suppose -- minor version
> > > >>>> releases are really major. (On the other extreme, HBase is at
0.98
> > and
> > > >>>> quite production-ready.)
> > > >>>>
> > > >>>> Just consider this a second vote for focus on fixes and 1.0.x
rather
> > > >>>> than new features and 1.x. I think there are a few steps that
could
> > > >>>> streamline triage of this flood of contributions, and make all of
> > this
> > > >>>> easier, but that's for another thread.
> > > >>>>
> > > >>>>
> > > >>>> On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <
> > > mark@clearstorydata.com
> > > >>>
> > > >>> wrote:
> > > >>>>> +1, but just barely.  We've got quite a number of outstanding
bugs
> > > >>>>> identified, and many of them have fixes in progress.  I'd hate
to
> > see
> > > >>> those
> > > >>>>> efforts get lost in a post-1.0.0 flood of new features targeted
at
> > > >>> 1.1.0 --
> > > >>>>> in other words, I'd like to see 1.0.1 retain a high priority
> > relative
> > > >>> to
> > > >>>>> 1.1.0.
> > > >>>>>
> > > >>>>> Looking through the unresolved JIRAs, it doesn't look like any
of
> > the
> > > >>>>> identified bugs are show-stoppers or strictly regressions
> > (although I
> > > >>> will
> > > >>>>> note that one that I have in progress, SPARK-1749, is a bug
that we
> > > >>>>> introduced with recent work -- it's not strictly a regression
> > because
> > > >>> we
> > > >>>>> had equally bad but different behavior when the DAGScheduler
> > > >> exceptions
> > > >>>>> weren't previously being handled at all vs. being slightly
> > > >> mis-handled
> > > >>>>> now), so I'm not currently seeing a reason not to release.
> > > >>>
> > > >>
> > >
> > >
> >

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Mark Hamstra <ma...@clearstorydata.com>.
I don't understand.  We never said that interfaces wouldn't change from 0.9
to 1.0.  What we are committing to is stability going forward from the
1.0.0 baseline.  Nobody is disputing that backward-incompatible behavior or
interface changes would be an issue post-1.0.0.  The question is whether
there is anything apparent now that is expected to require such disruptive
changes if we were to commit to the current release candidate as our
guaranteed 1.0.0 baseline.


On Sat, May 17, 2014 at 2:05 PM, Mridul Muralidharan <mr...@gmail.com>wrote:

> I would make the case for interface stability not just api stability.
> Particularly given that we have significantly changed some of our
> interfaces, I want to ensure developers/users are not seeing red flags.
>
> Bugs and code stability can be addressed in minor releases if found, but
> behavioral change and/or interface changes would be a much more invasive
> issue for our users.
>
> Regards
> Mridul
> On 18-May-2014 2:19 am, "Matei Zaharia" <ma...@gmail.com> wrote:
>
> > As others have said, the 1.0 milestone is about API stability, not about
> > saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the
> sooner
> > users can confidently build on Spark, knowing that the application they
> > build today will still run on Spark 1.9.9 three years from now. This is
> > something that I’ve seen done badly (and experienced the effects thereof)
> > in other big data projects, such as MapReduce and even YARN. The result
> is
> > that you annoy users, you end up with a fragmented userbase where
> everyone
> > is building against a different version, and you drastically slow down
> > development.
> >
> > With a project as fast-growing as fast-growing as Spark in particular,
> > there will be new bugs discovered and reported continuously, especially
> in
> > the non-core components. Look at the graph of # of contributors in time
> to
> > Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph;
> “commits”
> > changed when we started merging each patch as a single commit). This is
> not
> > slowing down, and we need to have the culture now that we treat API
> > stability and release numbers at the level expected for a 1.0 project
> > instead of having people come in and randomly change the API.
> >
> > I’ll also note that the issues marked “blocker” were marked so by their
> > reporters, since the reporter can set the priority. I don’t consider
> stuff
> > like parallelize() not partitioning ranges in the same way as other
> > collections a blocker — it’s a bug, it would be good to fix it, but it
> only
> > affects a small number of use cases. Of course if we find a real blocker
> > (in particular a regression from a previous version, or a feature that’s
> > just completely broken), we will delay the release for that, but at some
> > point you have to say “okay, this fix will go into the next maintenance
> > release”. Maybe we need to write a clear policy for what the issue
> > priorities mean.
> >
> > Finally, I believe it’s much better to have a culture where you can make
> > releases on a regular schedule, and have the option to make a maintenance
> > release in 3-4 days if you find new bugs, than one where you pile up
> stuff
> > into each release. This is what much large project than us, like Linux,
> do,
> > and it’s the only way to avoid indefinite stalling with a large
> contributor
> > base. In the worst case, if you find a new bug that warrants immediate
> > release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1 in
> > three days with just your bug fix in it). And if you find an API that
> you’d
> > like to improve, just add a new one and maybe deprecate the old one — at
> > some point we have to respect our users and let them know that code they
> > write today will still run tomorrow.
> >
> > Matei
> >
> > On May 17, 2014, at 10:32 AM, Kan Zhang <kz...@apache.org> wrote:
> >
> > > +1 on the running commentary here, non-binding of course :-)
> > >
> > >
> > > On Sat, May 17, 2014 at 8:44 AM, Andrew Ash <an...@andrewash.com>
> > wrote:
> > >
> > >> +1 on the next release feeling more like a 0.10 than a 1.0
> > >> On May 17, 2014 4:38 AM, "Mridul Muralidharan" <mr...@gmail.com>
> > wrote:
> > >>
> > >>> I had echoed similar sentiments a while back when there was a
> > discussion
> > >>> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the
> api
> > >>> changes, add missing functionality, go through a hardening release
> > before
> > >>> 1.0
> > >>>
> > >>> But the community preferred a 1.0 :-)
> > >>>
> > >>> Regards,
> > >>> Mridul
> > >>>
> > >>> On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
> > >>>>
> > >>>> On this note, non-binding commentary:
> > >>>>
> > >>>> Releases happen in local minima of change, usually created by
> > >>>> internally enforced code freeze. Spark is incredibly busy now due to
> > >>>> external factors -- recently a TLP, recently discovered by a large
> new
> > >>>> audience, ease of contribution enabled by Github. It's getting like
> > >>>> the first year of mainstream battle-testing in a month. It's been
> very
> > >>>> hard to freeze anything! I see a number of non-trivial issues being
> > >>>> reported, and I don't think it has been possible to triage all of
> > >>>> them, even.
> > >>>>
> > >>>> Given the high rate of change, my instinct would have been to
> release
> > >>>> 0.10.0 now. But won't it always be very busy? I do think the rate of
> > >>>> significant issues will slow down.
> > >>>>
> > >>>> Version ain't nothing but a number, but if it has any meaning it's
> the
> > >>>> semantic versioning meaning. 1.0 imposes extra handicaps around
> > >>>> striving to maintain backwards-compatibility. That may end up being
> > >>>> bent to fit in important changes that are going to be required in
> this
> > >>>> continuing period of change. Hadoop does this all the time
> > >>>> unfortunately and gets away with it, I suppose -- minor version
> > >>>> releases are really major. (On the other extreme, HBase is at 0.98
> and
> > >>>> quite production-ready.)
> > >>>>
> > >>>> Just consider this a second vote for focus on fixes and 1.0.x rather
> > >>>> than new features and 1.x. I think there are a few steps that could
> > >>>> streamline triage of this flood of contributions, and make all of
> this
> > >>>> easier, but that's for another thread.
> > >>>>
> > >>>>
> > >>>> On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <
> > mark@clearstorydata.com
> > >>>
> > >>> wrote:
> > >>>>> +1, but just barely.  We've got quite a number of outstanding bugs
> > >>>>> identified, and many of them have fixes in progress.  I'd hate to
> see
> > >>> those
> > >>>>> efforts get lost in a post-1.0.0 flood of new features targeted at
> > >>> 1.1.0 --
> > >>>>> in other words, I'd like to see 1.0.1 retain a high priority
> relative
> > >>> to
> > >>>>> 1.1.0.
> > >>>>>
> > >>>>> Looking through the unresolved JIRAs, it doesn't look like any of
> the
> > >>>>> identified bugs are show-stoppers or strictly regressions
> (although I
> > >>> will
> > >>>>> note that one that I have in progress, SPARK-1749, is a bug that we
> > >>>>> introduced with recent work -- it's not strictly a regression
> because
> > >>> we
> > >>>>> had equally bad but different behavior when the DAGScheduler
> > >> exceptions
> > >>>>> weren't previously being handled at all vs. being slightly
> > >> mis-handled
> > >>>>> now), so I'm not currently seeing a reason not to release.
> > >>>
> > >>
> >
> >
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Mridul Muralidharan <mr...@gmail.com>.
I would make the case for interface stability not just api stability.
Particularly given that we have significantly changed some of our
interfaces, I want to ensure developers/users are not seeing red flags.

Bugs and code stability can be addressed in minor releases if found, but
behavioral change and/or interface changes would be a much more invasive
issue for our users.

Regards
Mridul
On 18-May-2014 2:19 am, "Matei Zaharia" <ma...@gmail.com> wrote:

> As others have said, the 1.0 milestone is about API stability, not about
> saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the sooner
> users can confidently build on Spark, knowing that the application they
> build today will still run on Spark 1.9.9 three years from now. This is
> something that I’ve seen done badly (and experienced the effects thereof)
> in other big data projects, such as MapReduce and even YARN. The result is
> that you annoy users, you end up with a fragmented userbase where everyone
> is building against a different version, and you drastically slow down
> development.
>
> With a project as fast-growing as fast-growing as Spark in particular,
> there will be new bugs discovered and reported continuously, especially in
> the non-core components. Look at the graph of # of contributors in time to
> Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph; “commits”
> changed when we started merging each patch as a single commit). This is not
> slowing down, and we need to have the culture now that we treat API
> stability and release numbers at the level expected for a 1.0 project
> instead of having people come in and randomly change the API.
>
> I’ll also note that the issues marked “blocker” were marked so by their
> reporters, since the reporter can set the priority. I don’t consider stuff
> like parallelize() not partitioning ranges in the same way as other
> collections a blocker — it’s a bug, it would be good to fix it, but it only
> affects a small number of use cases. Of course if we find a real blocker
> (in particular a regression from a previous version, or a feature that’s
> just completely broken), we will delay the release for that, but at some
> point you have to say “okay, this fix will go into the next maintenance
> release”. Maybe we need to write a clear policy for what the issue
> priorities mean.
>
> Finally, I believe it’s much better to have a culture where you can make
> releases on a regular schedule, and have the option to make a maintenance
> release in 3-4 days if you find new bugs, than one where you pile up stuff
> into each release. This is what much large project than us, like Linux, do,
> and it’s the only way to avoid indefinite stalling with a large contributor
> base. In the worst case, if you find a new bug that warrants immediate
> release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1 in
> three days with just your bug fix in it). And if you find an API that you’d
> like to improve, just add a new one and maybe deprecate the old one — at
> some point we have to respect our users and let them know that code they
> write today will still run tomorrow.
>
> Matei
>
> On May 17, 2014, at 10:32 AM, Kan Zhang <kz...@apache.org> wrote:
>
> > +1 on the running commentary here, non-binding of course :-)
> >
> >
> > On Sat, May 17, 2014 at 8:44 AM, Andrew Ash <an...@andrewash.com>
> wrote:
> >
> >> +1 on the next release feeling more like a 0.10 than a 1.0
> >> On May 17, 2014 4:38 AM, "Mridul Muralidharan" <mr...@gmail.com>
> wrote:
> >>
> >>> I had echoed similar sentiments a while back when there was a
> discussion
> >>> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
> >>> changes, add missing functionality, go through a hardening release
> before
> >>> 1.0
> >>>
> >>> But the community preferred a 1.0 :-)
> >>>
> >>> Regards,
> >>> Mridul
> >>>
> >>> On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
> >>>>
> >>>> On this note, non-binding commentary:
> >>>>
> >>>> Releases happen in local minima of change, usually created by
> >>>> internally enforced code freeze. Spark is incredibly busy now due to
> >>>> external factors -- recently a TLP, recently discovered by a large new
> >>>> audience, ease of contribution enabled by Github. It's getting like
> >>>> the first year of mainstream battle-testing in a month. It's been very
> >>>> hard to freeze anything! I see a number of non-trivial issues being
> >>>> reported, and I don't think it has been possible to triage all of
> >>>> them, even.
> >>>>
> >>>> Given the high rate of change, my instinct would have been to release
> >>>> 0.10.0 now. But won't it always be very busy? I do think the rate of
> >>>> significant issues will slow down.
> >>>>
> >>>> Version ain't nothing but a number, but if it has any meaning it's the
> >>>> semantic versioning meaning. 1.0 imposes extra handicaps around
> >>>> striving to maintain backwards-compatibility. That may end up being
> >>>> bent to fit in important changes that are going to be required in this
> >>>> continuing period of change. Hadoop does this all the time
> >>>> unfortunately and gets away with it, I suppose -- minor version
> >>>> releases are really major. (On the other extreme, HBase is at 0.98 and
> >>>> quite production-ready.)
> >>>>
> >>>> Just consider this a second vote for focus on fixes and 1.0.x rather
> >>>> than new features and 1.x. I think there are a few steps that could
> >>>> streamline triage of this flood of contributions, and make all of this
> >>>> easier, but that's for another thread.
> >>>>
> >>>>
> >>>> On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <
> mark@clearstorydata.com
> >>>
> >>> wrote:
> >>>>> +1, but just barely.  We've got quite a number of outstanding bugs
> >>>>> identified, and many of them have fixes in progress.  I'd hate to see
> >>> those
> >>>>> efforts get lost in a post-1.0.0 flood of new features targeted at
> >>> 1.1.0 --
> >>>>> in other words, I'd like to see 1.0.1 retain a high priority relative
> >>> to
> >>>>> 1.1.0.
> >>>>>
> >>>>> Looking through the unresolved JIRAs, it doesn't look like any of the
> >>>>> identified bugs are show-stoppers or strictly regressions (although I
> >>> will
> >>>>> note that one that I have in progress, SPARK-1749, is a bug that we
> >>>>> introduced with recent work -- it's not strictly a regression because
> >>> we
> >>>>> had equally bad but different behavior when the DAGScheduler
> >> exceptions
> >>>>> weren't previously being handled at all vs. being slightly
> >> mis-handled
> >>>>> now), so I'm not currently seeing a reason not to release.
> >>>
> >>
>
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Michael Malak <mi...@yahoo.com>.
While developers may appreciate "1.0 == API stability," I'm not sure that will be the understanding of the VP who gives the green light to a Spark-based development effort.

I fear a bug that silently produces erroneous results will be perceived like the FDIV bug, but in this case without the momentum of an existing large installed base and with a number of "competitors" (GridGain, H20, Stratosphere). Despite the stated intention of API stability, the perception (which becomes the reality) of "1.0" is that it's ready for production use -- not bullet-proof, but also not with known silent generation of erroneous results. Exceptions and crashes are much more tolerated than silent corruption of data. The result may be a reputation of the Spark team unconcerned about data integrity.

I ran into (and submitted) https://issues.apache.org/jira/browse/SPARK-1817 due to the lack of zipWithIndex(). zip() with a self-created partitioned range was the way I was trying to number with IDs a collection of nodes in preparation for the GraphX constructor. For the record, it was a frequent Spark committer who escalated it to "blocker"; I did not submit it as such. Partitioning a Scala range isn't just a toy example; it has a real-life use.

I also wonder about the REPL. Cloudera, for example, touts it as key to making Spark a "crossover tool" that Data Scientists can also use. The REPL can be considered an API of sorts -- not a traditional Scala or Java API, of course, but the "API" that a human data analyst would use. With the Scala REPL exhibiting some of the same bad behaviors as the Spark REPL, there is a question of whether the Spark REPL can even be fixed. If the Spark REPL has to be eliminated after 1.0 due to an inability to repair it, that would constitute API instability.


 
On Saturday, May 17, 2014 2:49 PM, Matei Zaharia <ma...@gmail.com> wrote:
 
As others have said, the 1.0 milestone is about API stability, not about saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the sooner users can confidently build on Spark, knowing that the application they build today will still run on Spark 1.9.9 three years from now. This is something that I’ve seen done badly (and experienced the effects thereof) in other big data projects, such as MapReduce and even YARN. The result is that you annoy users, you end up with a fragmented userbase where everyone is building against a different version, and you drastically slow down development.

With a project as fast-growing as fast-growing as Spark in particular, there will be new bugs discovered and reported continuously, especially in the non-core components. Look at the graph of # of contributors in time to Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph; “commits” changed when we started merging each patch as a single commit). This is not slowing down, and we need to have the culture now that we treat API stability and release numbers at the level expected for a 1.0 project instead of having people come in and randomly change the API.

I’ll also note that the issues marked “blocker” were marked so by their reporters, since the reporter can set the priority. I don’t consider stuff like parallelize() not partitioning ranges in the same way as other collections a blocker — it’s a bug, it would be good to fix it, but it only affects a small number of use cases. Of course if we find a real blocker (in particular a regression from a previous version, or a feature that’s just completely broken), we will delay the release for that, but at some point you have to say “okay, this fix will go into the next maintenance release”. Maybe we need to write a clear policy for what the issue priorities mean.

Finally, I believe it’s much better to have a culture where you can make releases on a regular schedule, and have the option to make a maintenance release in 3-4 days if you find new bugs, than one where you pile up stuff into each release. This is what much large project than us, like Linux, do, and it’s the only way to avoid indefinite stalling with a large contributor base. In the worst case, if you find a new bug that warrants immediate release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1 in three days with just your bug fix in it). And if you find an API that you’d like to improve, just add a new one and maybe deprecate the old one — at some point we have to respect our users and let them know that code they write today will still run tomorrow.

Matei


On May 17, 2014, at 10:32 AM, Kan Zhang <kz...@apache.org> wrote:

> +1 on the running commentary here, non-binding of course :-)
> 
> 
> On Sat, May 17, 2014 at 8:44 AM, Andrew Ash <an...@andrewash.com> wrote:
> 
>> +1 on the next release feeling more like a 0.10 than a 1.0
>> On May 17, 2014 4:38 AM, "Mridul Muralidharan" <mr...@gmail.com> wrote:
>> 
>>> I had echoed similar sentiments a while back when there was a discussion
>>> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
>>> changes, add missing functionality, go through a hardening release before
>>> 1.0
>>> 
>>> But the community preferred a 1.0 :-)
>>> 
>>> Regards,
>>> Mridul
>>> 
>>> On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
>>>> 
>>>> On this note, non-binding commentary:
>>>> 
>>>> Releases happen in local minima of change, usually created by
>>>> internally enforced code freeze. Spark is incredibly busy now due to
>>>> external factors -- recently a TLP, recently discovered by a large new
>>>> audience, ease of contribution enabled by Github. It's getting like
>>>> the first year of mainstream battle-testing in a month. It's been very
>>>> hard to freeze anything! I see a number of non-trivial issues being
>>>> reported, and I don't think it has been possible to triage all of
>>>> them, even.
>>>> 
>>>> Given the high rate of change, my instinct would have been to release
>>>> 0.10.0 now. But won't it always be very busy? I do think the rate of
>>>> significant issues will slow down.
>>>> 
>>>> Version ain't nothing but a number, but if it has any meaning it's the
>>>> semantic versioning meaning. 1.0 imposes extra handicaps around
>>>> striving to maintain backwards-compatibility. That may end up being
>>>> bent to fit in important changes that are going to be required in this
>>>> continuing period of change. Hadoop does this all the time
>>>> unfortunately and gets away with it, I suppose -- minor version
>>>> releases are really major. (On the other extreme, HBase is at 0.98 and
>>>> quite production-ready.)
>>>> 
>>>> Just consider this a second vote for focus on fixes and 1.0.x rather
>>>> than new features and 1.x. I think there are a few steps that could
>>>> streamline triage of this flood of contributions, and make all of this
>>>> easier, but that's for another thread.
>>>> 
>>>> 
>>>> On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <mark@clearstorydata.com
>>> 
>>> wrote:
>>>>> +1, but just barely.  We've got quite a number of outstanding bugs
>>>>> identified, and many of them have fixes in progress.  I'd hate to see
>>> those
>>>>> efforts get lost in a post-1.0.0 flood of new features targeted at
>>> 1.1.0 --
>>>>> in other words, I'd like to see 1.0.1 retain a high priority relative
>>> to
>>>>> 1.1.0.
>>>>> 
>>>>> Looking through the unresolved JIRAs, it doesn't look like any of the
>>>>> identified bugs are show-stoppers or strictly regressions (although I
>>> will
>>>>> note that one that I have in progress, SPARK-1749, is a bug that we
>>>>> introduced with recent work -- it's not strictly a regression because
>>> we
>>>>> had equally bad but different behavior when the DAGScheduler
>> exceptions
>>>>> weren't previously being handled at all vs. being slightly
>> mis-handled
>>>>> now), so I'm not currently seeing a reason not to release.
>>> 
>> 

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Matei Zaharia <ma...@gmail.com>.
As others have said, the 1.0 milestone is about API stability, not about saying “we’ve eliminated all bugs”. The sooner you declare 1.0, the sooner users can confidently build on Spark, knowing that the application they build today will still run on Spark 1.9.9 three years from now. This is something that I’ve seen done badly (and experienced the effects thereof) in other big data projects, such as MapReduce and even YARN. The result is that you annoy users, you end up with a fragmented userbase where everyone is building against a different version, and you drastically slow down development.

With a project as fast-growing as fast-growing as Spark in particular, there will be new bugs discovered and reported continuously, especially in the non-core components. Look at the graph of # of contributors in time to Spark: https://www.ohloh.net/p/apache-spark (bottom-most graph; “commits” changed when we started merging each patch as a single commit). This is not slowing down, and we need to have the culture now that we treat API stability and release numbers at the level expected for a 1.0 project instead of having people come in and randomly change the API.

I’ll also note that the issues marked “blocker” were marked so by their reporters, since the reporter can set the priority. I don’t consider stuff like parallelize() not partitioning ranges in the same way as other collections a blocker — it’s a bug, it would be good to fix it, but it only affects a small number of use cases. Of course if we find a real blocker (in particular a regression from a previous version, or a feature that’s just completely broken), we will delay the release for that, but at some point you have to say “okay, this fix will go into the next maintenance release”. Maybe we need to write a clear policy for what the issue priorities mean.

Finally, I believe it’s much better to have a culture where you can make releases on a regular schedule, and have the option to make a maintenance release in 3-4 days if you find new bugs, than one where you pile up stuff into each release. This is what much large project than us, like Linux, do, and it’s the only way to avoid indefinite stalling with a large contributor base. In the worst case, if you find a new bug that warrants immediate release, it goes into 1.0.1 a week after 1.0.0 (we can vote on 1.0.1 in three days with just your bug fix in it). And if you find an API that you’d like to improve, just add a new one and maybe deprecate the old one — at some point we have to respect our users and let them know that code they write today will still run tomorrow.

Matei

On May 17, 2014, at 10:32 AM, Kan Zhang <kz...@apache.org> wrote:

> +1 on the running commentary here, non-binding of course :-)
> 
> 
> On Sat, May 17, 2014 at 8:44 AM, Andrew Ash <an...@andrewash.com> wrote:
> 
>> +1 on the next release feeling more like a 0.10 than a 1.0
>> On May 17, 2014 4:38 AM, "Mridul Muralidharan" <mr...@gmail.com> wrote:
>> 
>>> I had echoed similar sentiments a while back when there was a discussion
>>> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
>>> changes, add missing functionality, go through a hardening release before
>>> 1.0
>>> 
>>> But the community preferred a 1.0 :-)
>>> 
>>> Regards,
>>> Mridul
>>> 
>>> On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
>>>> 
>>>> On this note, non-binding commentary:
>>>> 
>>>> Releases happen in local minima of change, usually created by
>>>> internally enforced code freeze. Spark is incredibly busy now due to
>>>> external factors -- recently a TLP, recently discovered by a large new
>>>> audience, ease of contribution enabled by Github. It's getting like
>>>> the first year of mainstream battle-testing in a month. It's been very
>>>> hard to freeze anything! I see a number of non-trivial issues being
>>>> reported, and I don't think it has been possible to triage all of
>>>> them, even.
>>>> 
>>>> Given the high rate of change, my instinct would have been to release
>>>> 0.10.0 now. But won't it always be very busy? I do think the rate of
>>>> significant issues will slow down.
>>>> 
>>>> Version ain't nothing but a number, but if it has any meaning it's the
>>>> semantic versioning meaning. 1.0 imposes extra handicaps around
>>>> striving to maintain backwards-compatibility. That may end up being
>>>> bent to fit in important changes that are going to be required in this
>>>> continuing period of change. Hadoop does this all the time
>>>> unfortunately and gets away with it, I suppose -- minor version
>>>> releases are really major. (On the other extreme, HBase is at 0.98 and
>>>> quite production-ready.)
>>>> 
>>>> Just consider this a second vote for focus on fixes and 1.0.x rather
>>>> than new features and 1.x. I think there are a few steps that could
>>>> streamline triage of this flood of contributions, and make all of this
>>>> easier, but that's for another thread.
>>>> 
>>>> 
>>>> On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <mark@clearstorydata.com
>>> 
>>> wrote:
>>>>> +1, but just barely.  We've got quite a number of outstanding bugs
>>>>> identified, and many of them have fixes in progress.  I'd hate to see
>>> those
>>>>> efforts get lost in a post-1.0.0 flood of new features targeted at
>>> 1.1.0 --
>>>>> in other words, I'd like to see 1.0.1 retain a high priority relative
>>> to
>>>>> 1.1.0.
>>>>> 
>>>>> Looking through the unresolved JIRAs, it doesn't look like any of the
>>>>> identified bugs are show-stoppers or strictly regressions (although I
>>> will
>>>>> note that one that I have in progress, SPARK-1749, is a bug that we
>>>>> introduced with recent work -- it's not strictly a regression because
>>> we
>>>>> had equally bad but different behavior when the DAGScheduler
>> exceptions
>>>>> weren't previously being handled at all vs. being slightly
>> mis-handled
>>>>> now), so I'm not currently seeing a reason not to release.
>>> 
>> 


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Kan Zhang <kz...@apache.org>.
+1 on the running commentary here, non-binding of course :-)


On Sat, May 17, 2014 at 8:44 AM, Andrew Ash <an...@andrewash.com> wrote:

> +1 on the next release feeling more like a 0.10 than a 1.0
> On May 17, 2014 4:38 AM, "Mridul Muralidharan" <mr...@gmail.com> wrote:
>
> > I had echoed similar sentiments a while back when there was a discussion
> > around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
> > changes, add missing functionality, go through a hardening release before
> > 1.0
> >
> > But the community preferred a 1.0 :-)
> >
> > Regards,
> > Mridul
> >
> > On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
> > >
> > > On this note, non-binding commentary:
> > >
> > > Releases happen in local minima of change, usually created by
> > > internally enforced code freeze. Spark is incredibly busy now due to
> > > external factors -- recently a TLP, recently discovered by a large new
> > > audience, ease of contribution enabled by Github. It's getting like
> > > the first year of mainstream battle-testing in a month. It's been very
> > > hard to freeze anything! I see a number of non-trivial issues being
> > > reported, and I don't think it has been possible to triage all of
> > > them, even.
> > >
> > > Given the high rate of change, my instinct would have been to release
> > > 0.10.0 now. But won't it always be very busy? I do think the rate of
> > > significant issues will slow down.
> > >
> > > Version ain't nothing but a number, but if it has any meaning it's the
> > > semantic versioning meaning. 1.0 imposes extra handicaps around
> > > striving to maintain backwards-compatibility. That may end up being
> > > bent to fit in important changes that are going to be required in this
> > > continuing period of change. Hadoop does this all the time
> > > unfortunately and gets away with it, I suppose -- minor version
> > > releases are really major. (On the other extreme, HBase is at 0.98 and
> > > quite production-ready.)
> > >
> > > Just consider this a second vote for focus on fixes and 1.0.x rather
> > > than new features and 1.x. I think there are a few steps that could
> > > streamline triage of this flood of contributions, and make all of this
> > > easier, but that's for another thread.
> > >
> > >
> > > On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <mark@clearstorydata.com
> >
> > wrote:
> > > > +1, but just barely.  We've got quite a number of outstanding bugs
> > > > identified, and many of them have fixes in progress.  I'd hate to see
> > those
> > > > efforts get lost in a post-1.0.0 flood of new features targeted at
> > 1.1.0 --
> > > > in other words, I'd like to see 1.0.1 retain a high priority relative
> > to
> > > > 1.1.0.
> > > >
> > > > Looking through the unresolved JIRAs, it doesn't look like any of the
> > > > identified bugs are show-stoppers or strictly regressions (although I
> > will
> > > > note that one that I have in progress, SPARK-1749, is a bug that we
> > > > introduced with recent work -- it's not strictly a regression because
> > we
> > > > had equally bad but different behavior when the DAGScheduler
> exceptions
> > > > weren't previously being handled at all vs. being slightly
> mis-handled
> > > > now), so I'm not currently seeing a reason not to release.
> >
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Andrew Ash <an...@andrewash.com>.
+1 on the next release feeling more like a 0.10 than a 1.0
On May 17, 2014 4:38 AM, "Mridul Muralidharan" <mr...@gmail.com> wrote:

> I had echoed similar sentiments a while back when there was a discussion
> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
> changes, add missing functionality, go through a hardening release before
> 1.0
>
> But the community preferred a 1.0 :-)
>
> Regards,
> Mridul
>
> On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
> >
> > On this note, non-binding commentary:
> >
> > Releases happen in local minima of change, usually created by
> > internally enforced code freeze. Spark is incredibly busy now due to
> > external factors -- recently a TLP, recently discovered by a large new
> > audience, ease of contribution enabled by Github. It's getting like
> > the first year of mainstream battle-testing in a month. It's been very
> > hard to freeze anything! I see a number of non-trivial issues being
> > reported, and I don't think it has been possible to triage all of
> > them, even.
> >
> > Given the high rate of change, my instinct would have been to release
> > 0.10.0 now. But won't it always be very busy? I do think the rate of
> > significant issues will slow down.
> >
> > Version ain't nothing but a number, but if it has any meaning it's the
> > semantic versioning meaning. 1.0 imposes extra handicaps around
> > striving to maintain backwards-compatibility. That may end up being
> > bent to fit in important changes that are going to be required in this
> > continuing period of change. Hadoop does this all the time
> > unfortunately and gets away with it, I suppose -- minor version
> > releases are really major. (On the other extreme, HBase is at 0.98 and
> > quite production-ready.)
> >
> > Just consider this a second vote for focus on fixes and 1.0.x rather
> > than new features and 1.x. I think there are a few steps that could
> > streamline triage of this flood of contributions, and make all of this
> > easier, but that's for another thread.
> >
> >
> > On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
> > > +1, but just barely.  We've got quite a number of outstanding bugs
> > > identified, and many of them have fixes in progress.  I'd hate to see
> those
> > > efforts get lost in a post-1.0.0 flood of new features targeted at
> 1.1.0 --
> > > in other words, I'd like to see 1.0.1 retain a high priority relative
> to
> > > 1.1.0.
> > >
> > > Looking through the unresolved JIRAs, it doesn't look like any of the
> > > identified bugs are show-stoppers or strictly regressions (although I
> will
> > > note that one that I have in progress, SPARK-1749, is a bug that we
> > > introduced with recent work -- it's not strictly a regression because
> we
> > > had equally bad but different behavior when the DAGScheduler exceptions
> > > weren't previously being handled at all vs. being slightly mis-handled
> > > now), so I'm not currently seeing a reason not to release.
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Sean Owen <so...@cloudera.com>.
On Sat, May 17, 2014 at 4:52 PM, Mark Hamstra <ma...@clearstorydata.com> wrote:
> Which of the unresolved bugs in spark-core do you think will require an
> API-breaking change to fix?  If there are none of those, then we are still
> essentially on track for a 1.0.0 release.

I don't have a particular one in mind, but look at
https://issues.apache.org/jira/browse/SPARK-1817?filter=12327229 for
example. There are 10 issues marked blocker or critical, that are
targeted at Core / 1.0.0 (or unset). Many are probably not critical,
not for 1.0, or wouldn't require a big change to fix. But has this
been reviewed then -- can you tell? I'd be happy for someone to tell
me to stop worrying, yeah, there's nothing too big here.


> The number of contributions and pace of change now is quite high, but I
> don't think that waiting for the pace to slow before releasing 1.0 is
> viable.  If Spark's short history is any guide to its near future, the pace
> will not slow by any significant amount for any noteworthy length of time,

I think we'd agree core is the most important part. I'd humbly suggest
fixes and improvements to core remain exceptionally important after
1.0 and there is a long line of proposed changes, most good. Would be
great to really burn that down. Maybe that is the kind of thing I
personally would have preferred to see before a 1.0, but it's not up
to me and there are other factors at work here. I don't object
strongly or anything.

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Mridul Muralidharan <mr...@gmail.com>.
On 18-May-2014 1:45 am, "Mark Hamstra" <ma...@clearstorydata.com> wrote:
>
> I'm not trying to muzzle the discussion.  All I am saying is that we don't
> need to have the same discussion about 0.10 vs. 1.0 that we already had.

Agreed, no point in repeating the same discussion ... I am also trying to
understand what the concerns are.

Specifically though, the scope of 1.0 (in terms of changes) went up quite a
bit - a lot of which are new changes and features; not just the initially
envisioned api changes and stability fixes.

If this is raising concerns, particularly since lot of users are depending
on stability of spark interfaces (api, env, scripts, behavior); I want to
understand better what they are - and if they are legitimately serious
enough, we will need to revisit decision to go to 1.0 instead of 0.10 ...
I hope we don't need to though given how late we are in dev cycle

Regards
Mridul

>  If you can tell me about specific changes in the current release
candidate
> that occasion new arguments for why a 1.0 release is an unacceptable idea,
> then I'm listening.
>
>
> On Sat, May 17, 2014 at 11:59 AM, Mridul Muralidharan <mridul@gmail.com
>wrote:
>
> > On 17-May-2014 11:40 pm, "Mark Hamstra" <ma...@clearstorydata.com> wrote:
> > >
> > > That is a past issue that we don't need to be re-opening now.  The
> > present
> >
> > Huh ? If we need to revisit based on changed circumstances, we must -
the
> > scope of changes introduced in this release was definitely not
anticipated
> > when 1.0 vs 0.10 discussion happened.
> >
> > If folks are worried about stability of core; it is a valid concern IMO.
> >
> > Having said that, I am still ok with going to 1.0; but if a conversation
> > starts about need for 1.0 vs going to 0.10 I want to hear more and
possibly
> > allay the concerns and not try to muzzle the discussion.
> >
> >
> > Regards
> > Mridul
> >
> > > issue, and what I am asking, is which pending bug fixes does anyone
> > > anticipate will require breaking the public API guaranteed in rc9
> > >
> > >
> > > On Sat, May 17, 2014 at 9:44 AM, Mridul Muralidharan <mridul@gmail.com
> > >wrote:
> > >
> > > > We made incompatible api changes whose impact we don't know yet
> > completely
> > > > : both from implementation and usage point of view.
> > > >
> > > > We had the option of getting real-world feedback from the user
> > community if
> > > > we had gone to 0.10 but the spark developers seemed to be in a
hurry to
> > get
> > > > to 1.0 - so I made my opinion known but left it to the wisdom of
larger
> > > > group of committers to decide ... I did not think it was critical
> > enough to
> > > > do a binding -1 on.
> > > >
> > > > Regards
> > > > Mridul
> > > > On 17-May-2014 9:43 pm, "Mark Hamstra" <ma...@clearstorydata.com>
> > wrote:
> > > >
> > > > > Which of the unresolved bugs in spark-core do you think will
require
> > an
> > > > > API-breaking change to fix?  If there are none of those, then we
are
> > > > still
> > > > > essentially on track for a 1.0.0 release.
> > > > >
> > > > > The number of contributions and pace of change now is quite high,
but
> > I
> > > > > don't think that waiting for the pace to slow before releasing
1.0 is
> > > > > viable.  If Spark's short history is any guide to its near future,
> > the
> > > > pace
> > > > > will not slow by any significant amount for any noteworthy length
of
> > > > time,
> > > > > but rather will continue to increase.  What we need to be aiming
for,
> > I
> > > > > think, is to have the great majority of those new contributions
being
> > > > made
> > > > > to MLLlib, GraphX, SparkSQL and other areas of the code that we
have
> > > > > clearly marked as not frozen in 1.x. I think we are already seeing
> > that,
> > > > > but if I am just not recognizing breakage of our semantic
versioning
> > > > > guarantee that will be forced on us by some pending changes, now
> > would
> > > > be a
> > > > > good time to set me straight.
> > > > >
> > > > >
> > > > > On Sat, May 17, 2014 at 4:26 AM, Mridul Muralidharan <
> > mridul@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > I had echoed similar sentiments a while back when there was a
> > > > discussion
> > > > > > around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize
the
> > api
> > > > > > changes, add missing functionality, go through a hardening
release
> > > > before
> > > > > > 1.0
> > > > > >
> > > > > > But the community preferred a 1.0 :-)
> > > > > >
> > > > > > Regards,
> > > > > > Mridul
> > > > > >
> > > > > > On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
> > > > > > >
> > > > > > > On this note, non-binding commentary:
> > > > > > >
> > > > > > > Releases happen in local minima of change, usually created by
> > > > > > > internally enforced code freeze. Spark is incredibly busy now
due
> > to
> > > > > > > external factors -- recently a TLP, recently discovered by a
> > large
> > > > new
> > > > > > > audience, ease of contribution enabled by Github. It's getting
> > like
> > > > > > > the first year of mainstream battle-testing in a month. It's
been
> > > > very
> > > > > > > hard to freeze anything! I see a number of non-trivial issues
> > being
> > > > > > > reported, and I don't think it has been possible to triage
all of
> > > > > > > them, even.
> > > > > > >
> > > > > > > Given the high rate of change, my instinct would have been to
> > release
> > > > > > > 0.10.0 now. But won't it always be very busy? I do think the
rate
> > of
> > > > > > > significant issues will slow down.
> > > > > > >
> > > > > > > Version ain't nothing but a number, but if it has any meaning
> > it's
> > > > the
> > > > > > > semantic versioning meaning. 1.0 imposes extra handicaps
around
> > > > > > > striving to maintain backwards-compatibility. That may end up
> > being
> > > > > > > bent to fit in important changes that are going to be
required in
> > > > this
> > > > > > > continuing period of change. Hadoop does this all the time
> > > > > > > unfortunately and gets away with it, I suppose -- minor
version
> > > > > > > releases are really major. (On the other extreme, HBase is at
> > 0.98
> > > > and
> > > > > > > quite production-ready.)
> > > > > > >
> > > > > > > Just consider this a second vote for focus on fixes and 1.0.x
> > rather
> > > > > > > than new features and 1.x. I think there are a few steps that
> > could
> > > > > > > streamline triage of this flood of contributions, and make
all of
> > > > this
> > > > > > > easier, but that's for another thread.
> > > > > > >
> > > > > > >
> > > > > > > On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <
> > > > mark@clearstorydata.com
> > > > > >
> > > > > > wrote:
> > > > > > > > +1, but just barely.  We've got quite a number of
outstanding
> > bugs
> > > > > > > > identified, and many of them have fixes in progress.  I'd
hate
> > to
> > > > see
> > > > > > those
> > > > > > > > efforts get lost in a post-1.0.0 flood of new features
targeted
> > at
> > > > > > 1.1.0 --
> > > > > > > > in other words, I'd like to see 1.0.1 retain a high priority
> > > > relative
> > > > > > to
> > > > > > > > 1.1.0.
> > > > > > > >
> > > > > > > > Looking through the unresolved JIRAs, it doesn't look like
any
> > of
> > > > the
> > > > > > > > identified bugs are show-stoppers or strictly regressions
> > > > (although I
> > > > > > will
> > > > > > > > note that one that I have in progress, SPARK-1749, is a bug
> > that we
> > > > > > > > introduced with recent work -- it's not strictly a
regression
> > > > because
> > > > > > we
> > > > > > > > had equally bad but different behavior when the DAGScheduler
> > > > > exceptions
> > > > > > > > weren't previously being handled at all vs. being slightly
> > > > > mis-handled
> > > > > > > > now), so I'm not currently seeing a reason not to release.
> > > > > >
> > > > >
> > > >
> >

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Mark Hamstra <ma...@clearstorydata.com>.
I'm not trying to muzzle the discussion.  All I am saying is that we don't
need to have the same discussion about 0.10 vs. 1.0 that we already had.
 If you can tell me about specific changes in the current release candidate
that occasion new arguments for why a 1.0 release is an unacceptable idea,
then I'm listening.


On Sat, May 17, 2014 at 11:59 AM, Mridul Muralidharan <mr...@gmail.com>wrote:

> On 17-May-2014 11:40 pm, "Mark Hamstra" <ma...@clearstorydata.com> wrote:
> >
> > That is a past issue that we don't need to be re-opening now.  The
> present
>
> Huh ? If we need to revisit based on changed circumstances, we must - the
> scope of changes introduced in this release was definitely not anticipated
> when 1.0 vs 0.10 discussion happened.
>
> If folks are worried about stability of core; it is a valid concern IMO.
>
> Having said that, I am still ok with going to 1.0; but if a conversation
> starts about need for 1.0 vs going to 0.10 I want to hear more and possibly
> allay the concerns and not try to muzzle the discussion.
>
>
> Regards
> Mridul
>
> > issue, and what I am asking, is which pending bug fixes does anyone
> > anticipate will require breaking the public API guaranteed in rc9
> >
> >
> > On Sat, May 17, 2014 at 9:44 AM, Mridul Muralidharan <mridul@gmail.com
> >wrote:
> >
> > > We made incompatible api changes whose impact we don't know yet
> completely
> > > : both from implementation and usage point of view.
> > >
> > > We had the option of getting real-world feedback from the user
> community if
> > > we had gone to 0.10 but the spark developers seemed to be in a hurry to
> get
> > > to 1.0 - so I made my opinion known but left it to the wisdom of larger
> > > group of committers to decide ... I did not think it was critical
> enough to
> > > do a binding -1 on.
> > >
> > > Regards
> > > Mridul
> > > On 17-May-2014 9:43 pm, "Mark Hamstra" <ma...@clearstorydata.com>
> wrote:
> > >
> > > > Which of the unresolved bugs in spark-core do you think will require
> an
> > > > API-breaking change to fix?  If there are none of those, then we are
> > > still
> > > > essentially on track for a 1.0.0 release.
> > > >
> > > > The number of contributions and pace of change now is quite high, but
> I
> > > > don't think that waiting for the pace to slow before releasing 1.0 is
> > > > viable.  If Spark's short history is any guide to its near future,
> the
> > > pace
> > > > will not slow by any significant amount for any noteworthy length of
> > > time,
> > > > but rather will continue to increase.  What we need to be aiming for,
> I
> > > > think, is to have the great majority of those new contributions being
> > > made
> > > > to MLLlib, GraphX, SparkSQL and other areas of the code that we have
> > > > clearly marked as not frozen in 1.x. I think we are already seeing
> that,
> > > > but if I am just not recognizing breakage of our semantic versioning
> > > > guarantee that will be forced on us by some pending changes, now
> would
> > > be a
> > > > good time to set me straight.
> > > >
> > > >
> > > > On Sat, May 17, 2014 at 4:26 AM, Mridul Muralidharan <
> mridul@gmail.com
> > > > >wrote:
> > > >
> > > > > I had echoed similar sentiments a while back when there was a
> > > discussion
> > > > > around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the
> api
> > > > > changes, add missing functionality, go through a hardening release
> > > before
> > > > > 1.0
> > > > >
> > > > > But the community preferred a 1.0 :-)
> > > > >
> > > > > Regards,
> > > > > Mridul
> > > > >
> > > > > On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
> > > > > >
> > > > > > On this note, non-binding commentary:
> > > > > >
> > > > > > Releases happen in local minima of change, usually created by
> > > > > > internally enforced code freeze. Spark is incredibly busy now due
> to
> > > > > > external factors -- recently a TLP, recently discovered by a
> large
> > > new
> > > > > > audience, ease of contribution enabled by Github. It's getting
> like
> > > > > > the first year of mainstream battle-testing in a month. It's been
> > > very
> > > > > > hard to freeze anything! I see a number of non-trivial issues
> being
> > > > > > reported, and I don't think it has been possible to triage all of
> > > > > > them, even.
> > > > > >
> > > > > > Given the high rate of change, my instinct would have been to
> release
> > > > > > 0.10.0 now. But won't it always be very busy? I do think the rate
> of
> > > > > > significant issues will slow down.
> > > > > >
> > > > > > Version ain't nothing but a number, but if it has any meaning
> it's
> > > the
> > > > > > semantic versioning meaning. 1.0 imposes extra handicaps around
> > > > > > striving to maintain backwards-compatibility. That may end up
> being
> > > > > > bent to fit in important changes that are going to be required in
> > > this
> > > > > > continuing period of change. Hadoop does this all the time
> > > > > > unfortunately and gets away with it, I suppose -- minor version
> > > > > > releases are really major. (On the other extreme, HBase is at
> 0.98
> > > and
> > > > > > quite production-ready.)
> > > > > >
> > > > > > Just consider this a second vote for focus on fixes and 1.0.x
> rather
> > > > > > than new features and 1.x. I think there are a few steps that
> could
> > > > > > streamline triage of this flood of contributions, and make all of
> > > this
> > > > > > easier, but that's for another thread.
> > > > > >
> > > > > >
> > > > > > On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <
> > > mark@clearstorydata.com
> > > > >
> > > > > wrote:
> > > > > > > +1, but just barely.  We've got quite a number of outstanding
> bugs
> > > > > > > identified, and many of them have fixes in progress.  I'd hate
> to
> > > see
> > > > > those
> > > > > > > efforts get lost in a post-1.0.0 flood of new features targeted
> at
> > > > > 1.1.0 --
> > > > > > > in other words, I'd like to see 1.0.1 retain a high priority
> > > relative
> > > > > to
> > > > > > > 1.1.0.
> > > > > > >
> > > > > > > Looking through the unresolved JIRAs, it doesn't look like any
> of
> > > the
> > > > > > > identified bugs are show-stoppers or strictly regressions
> > > (although I
> > > > > will
> > > > > > > note that one that I have in progress, SPARK-1749, is a bug
> that we
> > > > > > > introduced with recent work -- it's not strictly a regression
> > > because
> > > > > we
> > > > > > > had equally bad but different behavior when the DAGScheduler
> > > > exceptions
> > > > > > > weren't previously being handled at all vs. being slightly
> > > > mis-handled
> > > > > > > now), so I'm not currently seeing a reason not to release.
> > > > >
> > > >
> > >
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Mridul Muralidharan <mr...@gmail.com>.
On 17-May-2014 11:40 pm, "Mark Hamstra" <ma...@clearstorydata.com> wrote:
>
> That is a past issue that we don't need to be re-opening now.  The present

Huh ? If we need to revisit based on changed circumstances, we must - the
scope of changes introduced in this release was definitely not anticipated
when 1.0 vs 0.10 discussion happened.

If folks are worried about stability of core; it is a valid concern IMO.

Having said that, I am still ok with going to 1.0; but if a conversation
starts about need for 1.0 vs going to 0.10 I want to hear more and possibly
allay the concerns and not try to muzzle the discussion.


Regards
Mridul

> issue, and what I am asking, is which pending bug fixes does anyone
> anticipate will require breaking the public API guaranteed in rc9
>
>
> On Sat, May 17, 2014 at 9:44 AM, Mridul Muralidharan <mridul@gmail.com
>wrote:
>
> > We made incompatible api changes whose impact we don't know yet
completely
> > : both from implementation and usage point of view.
> >
> > We had the option of getting real-world feedback from the user
community if
> > we had gone to 0.10 but the spark developers seemed to be in a hurry to
get
> > to 1.0 - so I made my opinion known but left it to the wisdom of larger
> > group of committers to decide ... I did not think it was critical
enough to
> > do a binding -1 on.
> >
> > Regards
> > Mridul
> > On 17-May-2014 9:43 pm, "Mark Hamstra" <ma...@clearstorydata.com> wrote:
> >
> > > Which of the unresolved bugs in spark-core do you think will require
an
> > > API-breaking change to fix?  If there are none of those, then we are
> > still
> > > essentially on track for a 1.0.0 release.
> > >
> > > The number of contributions and pace of change now is quite high, but
I
> > > don't think that waiting for the pace to slow before releasing 1.0 is
> > > viable.  If Spark's short history is any guide to its near future, the
> > pace
> > > will not slow by any significant amount for any noteworthy length of
> > time,
> > > but rather will continue to increase.  What we need to be aiming for,
I
> > > think, is to have the great majority of those new contributions being
> > made
> > > to MLLlib, GraphX, SparkSQL and other areas of the code that we have
> > > clearly marked as not frozen in 1.x. I think we are already seeing
that,
> > > but if I am just not recognizing breakage of our semantic versioning
> > > guarantee that will be forced on us by some pending changes, now would
> > be a
> > > good time to set me straight.
> > >
> > >
> > > On Sat, May 17, 2014 at 4:26 AM, Mridul Muralidharan <mridul@gmail.com
> > > >wrote:
> > >
> > > > I had echoed similar sentiments a while back when there was a
> > discussion
> > > > around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the
api
> > > > changes, add missing functionality, go through a hardening release
> > before
> > > > 1.0
> > > >
> > > > But the community preferred a 1.0 :-)
> > > >
> > > > Regards,
> > > > Mridul
> > > >
> > > > On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
> > > > >
> > > > > On this note, non-binding commentary:
> > > > >
> > > > > Releases happen in local minima of change, usually created by
> > > > > internally enforced code freeze. Spark is incredibly busy now due
to
> > > > > external factors -- recently a TLP, recently discovered by a large
> > new
> > > > > audience, ease of contribution enabled by Github. It's getting
like
> > > > > the first year of mainstream battle-testing in a month. It's been
> > very
> > > > > hard to freeze anything! I see a number of non-trivial issues
being
> > > > > reported, and I don't think it has been possible to triage all of
> > > > > them, even.
> > > > >
> > > > > Given the high rate of change, my instinct would have been to
release
> > > > > 0.10.0 now. But won't it always be very busy? I do think the rate
of
> > > > > significant issues will slow down.
> > > > >
> > > > > Version ain't nothing but a number, but if it has any meaning it's
> > the
> > > > > semantic versioning meaning. 1.0 imposes extra handicaps around
> > > > > striving to maintain backwards-compatibility. That may end up
being
> > > > > bent to fit in important changes that are going to be required in
> > this
> > > > > continuing period of change. Hadoop does this all the time
> > > > > unfortunately and gets away with it, I suppose -- minor version
> > > > > releases are really major. (On the other extreme, HBase is at 0.98
> > and
> > > > > quite production-ready.)
> > > > >
> > > > > Just consider this a second vote for focus on fixes and 1.0.x
rather
> > > > > than new features and 1.x. I think there are a few steps that
could
> > > > > streamline triage of this flood of contributions, and make all of
> > this
> > > > > easier, but that's for another thread.
> > > > >
> > > > >
> > > > > On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <
> > mark@clearstorydata.com
> > > >
> > > > wrote:
> > > > > > +1, but just barely.  We've got quite a number of outstanding
bugs
> > > > > > identified, and many of them have fixes in progress.  I'd hate
to
> > see
> > > > those
> > > > > > efforts get lost in a post-1.0.0 flood of new features targeted
at
> > > > 1.1.0 --
> > > > > > in other words, I'd like to see 1.0.1 retain a high priority
> > relative
> > > > to
> > > > > > 1.1.0.
> > > > > >
> > > > > > Looking through the unresolved JIRAs, it doesn't look like any
of
> > the
> > > > > > identified bugs are show-stoppers or strictly regressions
> > (although I
> > > > will
> > > > > > note that one that I have in progress, SPARK-1749, is a bug
that we
> > > > > > introduced with recent work -- it's not strictly a regression
> > because
> > > > we
> > > > > > had equally bad but different behavior when the DAGScheduler
> > > exceptions
> > > > > > weren't previously being handled at all vs. being slightly
> > > mis-handled
> > > > > > now), so I'm not currently seeing a reason not to release.
> > > >
> > >
> >

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Mark Hamstra <ma...@clearstorydata.com>.
That is a past issue that we don't need to be re-opening now.  The present
issue, and what I am asking, is which pending bug fixes does anyone
anticipate will require breaking the public API guaranteed in rc9?


On Sat, May 17, 2014 at 9:44 AM, Mridul Muralidharan <mr...@gmail.com>wrote:

> We made incompatible api changes whose impact we don't know yet completely
> : both from implementation and usage point of view.
>
> We had the option of getting real-world feedback from the user community if
> we had gone to 0.10 but the spark developers seemed to be in a hurry to get
> to 1.0 - so I made my opinion known but left it to the wisdom of larger
> group of committers to decide ... I did not think it was critical enough to
> do a binding -1 on.
>
> Regards
> Mridul
> On 17-May-2014 9:43 pm, "Mark Hamstra" <ma...@clearstorydata.com> wrote:
>
> > Which of the unresolved bugs in spark-core do you think will require an
> > API-breaking change to fix?  If there are none of those, then we are
> still
> > essentially on track for a 1.0.0 release.
> >
> > The number of contributions and pace of change now is quite high, but I
> > don't think that waiting for the pace to slow before releasing 1.0 is
> > viable.  If Spark's short history is any guide to its near future, the
> pace
> > will not slow by any significant amount for any noteworthy length of
> time,
> > but rather will continue to increase.  What we need to be aiming for, I
> > think, is to have the great majority of those new contributions being
> made
> > to MLLlib, GraphX, SparkSQL and other areas of the code that we have
> > clearly marked as not frozen in 1.x. I think we are already seeing that,
> > but if I am just not recognizing breakage of our semantic versioning
> > guarantee that will be forced on us by some pending changes, now would
> be a
> > good time to set me straight.
> >
> >
> > On Sat, May 17, 2014 at 4:26 AM, Mridul Muralidharan <mridul@gmail.com
> > >wrote:
> >
> > > I had echoed similar sentiments a while back when there was a
> discussion
> > > around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
> > > changes, add missing functionality, go through a hardening release
> before
> > > 1.0
> > >
> > > But the community preferred a 1.0 :-)
> > >
> > > Regards,
> > > Mridul
> > >
> > > On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
> > > >
> > > > On this note, non-binding commentary:
> > > >
> > > > Releases happen in local minima of change, usually created by
> > > > internally enforced code freeze. Spark is incredibly busy now due to
> > > > external factors -- recently a TLP, recently discovered by a large
> new
> > > > audience, ease of contribution enabled by Github. It's getting like
> > > > the first year of mainstream battle-testing in a month. It's been
> very
> > > > hard to freeze anything! I see a number of non-trivial issues being
> > > > reported, and I don't think it has been possible to triage all of
> > > > them, even.
> > > >
> > > > Given the high rate of change, my instinct would have been to release
> > > > 0.10.0 now. But won't it always be very busy? I do think the rate of
> > > > significant issues will slow down.
> > > >
> > > > Version ain't nothing but a number, but if it has any meaning it's
> the
> > > > semantic versioning meaning. 1.0 imposes extra handicaps around
> > > > striving to maintain backwards-compatibility. That may end up being
> > > > bent to fit in important changes that are going to be required in
> this
> > > > continuing period of change. Hadoop does this all the time
> > > > unfortunately and gets away with it, I suppose -- minor version
> > > > releases are really major. (On the other extreme, HBase is at 0.98
> and
> > > > quite production-ready.)
> > > >
> > > > Just consider this a second vote for focus on fixes and 1.0.x rather
> > > > than new features and 1.x. I think there are a few steps that could
> > > > streamline triage of this flood of contributions, and make all of
> this
> > > > easier, but that's for another thread.
> > > >
> > > >
> > > > On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <
> mark@clearstorydata.com
> > >
> > > wrote:
> > > > > +1, but just barely.  We've got quite a number of outstanding bugs
> > > > > identified, and many of them have fixes in progress.  I'd hate to
> see
> > > those
> > > > > efforts get lost in a post-1.0.0 flood of new features targeted at
> > > 1.1.0 --
> > > > > in other words, I'd like to see 1.0.1 retain a high priority
> relative
> > > to
> > > > > 1.1.0.
> > > > >
> > > > > Looking through the unresolved JIRAs, it doesn't look like any of
> the
> > > > > identified bugs are show-stoppers or strictly regressions
> (although I
> > > will
> > > > > note that one that I have in progress, SPARK-1749, is a bug that we
> > > > > introduced with recent work -- it's not strictly a regression
> because
> > > we
> > > > > had equally bad but different behavior when the DAGScheduler
> > exceptions
> > > > > weren't previously being handled at all vs. being slightly
> > mis-handled
> > > > > now), so I'm not currently seeing a reason not to release.
> > >
> >
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Mridul Muralidharan <mr...@gmail.com>.
We made incompatible api changes whose impact we don't know yet completely
: both from implementation and usage point of view.

We had the option of getting real-world feedback from the user community if
we had gone to 0.10 but the spark developers seemed to be in a hurry to get
to 1.0 - so I made my opinion known but left it to the wisdom of larger
group of committers to decide ... I did not think it was critical enough to
do a binding -1 on.

Regards
Mridul
On 17-May-2014 9:43 pm, "Mark Hamstra" <ma...@clearstorydata.com> wrote:

> Which of the unresolved bugs in spark-core do you think will require an
> API-breaking change to fix?  If there are none of those, then we are still
> essentially on track for a 1.0.0 release.
>
> The number of contributions and pace of change now is quite high, but I
> don't think that waiting for the pace to slow before releasing 1.0 is
> viable.  If Spark's short history is any guide to its near future, the pace
> will not slow by any significant amount for any noteworthy length of time,
> but rather will continue to increase.  What we need to be aiming for, I
> think, is to have the great majority of those new contributions being made
> to MLLlib, GraphX, SparkSQL and other areas of the code that we have
> clearly marked as not frozen in 1.x. I think we are already seeing that,
> but if I am just not recognizing breakage of our semantic versioning
> guarantee that will be forced on us by some pending changes, now would be a
> good time to set me straight.
>
>
> On Sat, May 17, 2014 at 4:26 AM, Mridul Muralidharan <mridul@gmail.com
> >wrote:
>
> > I had echoed similar sentiments a while back when there was a discussion
> > around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
> > changes, add missing functionality, go through a hardening release before
> > 1.0
> >
> > But the community preferred a 1.0 :-)
> >
> > Regards,
> > Mridul
> >
> > On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
> > >
> > > On this note, non-binding commentary:
> > >
> > > Releases happen in local minima of change, usually created by
> > > internally enforced code freeze. Spark is incredibly busy now due to
> > > external factors -- recently a TLP, recently discovered by a large new
> > > audience, ease of contribution enabled by Github. It's getting like
> > > the first year of mainstream battle-testing in a month. It's been very
> > > hard to freeze anything! I see a number of non-trivial issues being
> > > reported, and I don't think it has been possible to triage all of
> > > them, even.
> > >
> > > Given the high rate of change, my instinct would have been to release
> > > 0.10.0 now. But won't it always be very busy? I do think the rate of
> > > significant issues will slow down.
> > >
> > > Version ain't nothing but a number, but if it has any meaning it's the
> > > semantic versioning meaning. 1.0 imposes extra handicaps around
> > > striving to maintain backwards-compatibility. That may end up being
> > > bent to fit in important changes that are going to be required in this
> > > continuing period of change. Hadoop does this all the time
> > > unfortunately and gets away with it, I suppose -- minor version
> > > releases are really major. (On the other extreme, HBase is at 0.98 and
> > > quite production-ready.)
> > >
> > > Just consider this a second vote for focus on fixes and 1.0.x rather
> > > than new features and 1.x. I think there are a few steps that could
> > > streamline triage of this flood of contributions, and make all of this
> > > easier, but that's for another thread.
> > >
> > >
> > > On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <mark@clearstorydata.com
> >
> > wrote:
> > > > +1, but just barely.  We've got quite a number of outstanding bugs
> > > > identified, and many of them have fixes in progress.  I'd hate to see
> > those
> > > > efforts get lost in a post-1.0.0 flood of new features targeted at
> > 1.1.0 --
> > > > in other words, I'd like to see 1.0.1 retain a high priority relative
> > to
> > > > 1.1.0.
> > > >
> > > > Looking through the unresolved JIRAs, it doesn't look like any of the
> > > > identified bugs are show-stoppers or strictly regressions (although I
> > will
> > > > note that one that I have in progress, SPARK-1749, is a bug that we
> > > > introduced with recent work -- it's not strictly a regression because
> > we
> > > > had equally bad but different behavior when the DAGScheduler
> exceptions
> > > > weren't previously being handled at all vs. being slightly
> mis-handled
> > > > now), so I'm not currently seeing a reason not to release.
> >
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Mark Hamstra <ma...@clearstorydata.com>.
Which of the unresolved bugs in spark-core do you think will require an
API-breaking change to fix?  If there are none of those, then we are still
essentially on track for a 1.0.0 release.

The number of contributions and pace of change now is quite high, but I
don't think that waiting for the pace to slow before releasing 1.0 is
viable.  If Spark's short history is any guide to its near future, the pace
will not slow by any significant amount for any noteworthy length of time,
but rather will continue to increase.  What we need to be aiming for, I
think, is to have the great majority of those new contributions being made
to MLLlib, GraphX, SparkSQL and other areas of the code that we have
clearly marked as not frozen in 1.x. I think we are already seeing that,
but if I am just not recognizing breakage of our semantic versioning
guarantee that will be forced on us by some pending changes, now would be a
good time to set me straight.


On Sat, May 17, 2014 at 4:26 AM, Mridul Muralidharan <mr...@gmail.com>wrote:

> I had echoed similar sentiments a while back when there was a discussion
> around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
> changes, add missing functionality, go through a hardening release before
> 1.0
>
> But the community preferred a 1.0 :-)
>
> Regards,
> Mridul
>
> On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
> >
> > On this note, non-binding commentary:
> >
> > Releases happen in local minima of change, usually created by
> > internally enforced code freeze. Spark is incredibly busy now due to
> > external factors -- recently a TLP, recently discovered by a large new
> > audience, ease of contribution enabled by Github. It's getting like
> > the first year of mainstream battle-testing in a month. It's been very
> > hard to freeze anything! I see a number of non-trivial issues being
> > reported, and I don't think it has been possible to triage all of
> > them, even.
> >
> > Given the high rate of change, my instinct would have been to release
> > 0.10.0 now. But won't it always be very busy? I do think the rate of
> > significant issues will slow down.
> >
> > Version ain't nothing but a number, but if it has any meaning it's the
> > semantic versioning meaning. 1.0 imposes extra handicaps around
> > striving to maintain backwards-compatibility. That may end up being
> > bent to fit in important changes that are going to be required in this
> > continuing period of change. Hadoop does this all the time
> > unfortunately and gets away with it, I suppose -- minor version
> > releases are really major. (On the other extreme, HBase is at 0.98 and
> > quite production-ready.)
> >
> > Just consider this a second vote for focus on fixes and 1.0.x rather
> > than new features and 1.x. I think there are a few steps that could
> > streamline triage of this flood of contributions, and make all of this
> > easier, but that's for another thread.
> >
> >
> > On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <ma...@clearstorydata.com>
> wrote:
> > > +1, but just barely.  We've got quite a number of outstanding bugs
> > > identified, and many of them have fixes in progress.  I'd hate to see
> those
> > > efforts get lost in a post-1.0.0 flood of new features targeted at
> 1.1.0 --
> > > in other words, I'd like to see 1.0.1 retain a high priority relative
> to
> > > 1.1.0.
> > >
> > > Looking through the unresolved JIRAs, it doesn't look like any of the
> > > identified bugs are show-stoppers or strictly regressions (although I
> will
> > > note that one that I have in progress, SPARK-1749, is a bug that we
> > > introduced with recent work -- it's not strictly a regression because
> we
> > > had equally bad but different behavior when the DAGScheduler exceptions
> > > weren't previously being handled at all vs. being slightly mis-handled
> > > now), so I'm not currently seeing a reason not to release.
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Mridul Muralidharan <mr...@gmail.com>.
I had echoed similar sentiments a while back when there was a discussion
around 0.10 vs 1.0 ... I would have preferred 0.10 to stabilize the api
changes, add missing functionality, go through a hardening release before
1.0

But the community preferred a 1.0 :-)

Regards,
Mridul

On 17-May-2014 3:19 pm, "Sean Owen" <so...@cloudera.com> wrote:
>
> On this note, non-binding commentary:
>
> Releases happen in local minima of change, usually created by
> internally enforced code freeze. Spark is incredibly busy now due to
> external factors -- recently a TLP, recently discovered by a large new
> audience, ease of contribution enabled by Github. It's getting like
> the first year of mainstream battle-testing in a month. It's been very
> hard to freeze anything! I see a number of non-trivial issues being
> reported, and I don't think it has been possible to triage all of
> them, even.
>
> Given the high rate of change, my instinct would have been to release
> 0.10.0 now. But won't it always be very busy? I do think the rate of
> significant issues will slow down.
>
> Version ain't nothing but a number, but if it has any meaning it's the
> semantic versioning meaning. 1.0 imposes extra handicaps around
> striving to maintain backwards-compatibility. That may end up being
> bent to fit in important changes that are going to be required in this
> continuing period of change. Hadoop does this all the time
> unfortunately and gets away with it, I suppose -- minor version
> releases are really major. (On the other extreme, HBase is at 0.98 and
> quite production-ready.)
>
> Just consider this a second vote for focus on fixes and 1.0.x rather
> than new features and 1.x. I think there are a few steps that could
> streamline triage of this flood of contributions, and make all of this
> easier, but that's for another thread.
>
>
> On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <ma...@clearstorydata.com>
wrote:
> > +1, but just barely.  We've got quite a number of outstanding bugs
> > identified, and many of them have fixes in progress.  I'd hate to see
those
> > efforts get lost in a post-1.0.0 flood of new features targeted at
1.1.0 --
> > in other words, I'd like to see 1.0.1 retain a high priority relative to
> > 1.1.0.
> >
> > Looking through the unresolved JIRAs, it doesn't look like any of the
> > identified bugs are show-stoppers or strictly regressions (although I
will
> > note that one that I have in progress, SPARK-1749, is a bug that we
> > introduced with recent work -- it's not strictly a regression because we
> > had equally bad but different behavior when the DAGScheduler exceptions
> > weren't previously being handled at all vs. being slightly mis-handled
> > now), so I'm not currently seeing a reason not to release.

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Sean Owen <so...@cloudera.com>.
On this note, non-binding commentary:

Releases happen in local minima of change, usually created by
internally enforced code freeze. Spark is incredibly busy now due to
external factors -- recently a TLP, recently discovered by a large new
audience, ease of contribution enabled by Github. It's getting like
the first year of mainstream battle-testing in a month. It's been very
hard to freeze anything! I see a number of non-trivial issues being
reported, and I don't think it has been possible to triage all of
them, even.

Given the high rate of change, my instinct would have been to release
0.10.0 now. But won't it always be very busy? I do think the rate of
significant issues will slow down.

Version ain't nothing but a number, but if it has any meaning it's the
semantic versioning meaning. 1.0 imposes extra handicaps around
striving to maintain backwards-compatibility. That may end up being
bent to fit in important changes that are going to be required in this
continuing period of change. Hadoop does this all the time
unfortunately and gets away with it, I suppose -- minor version
releases are really major. (On the other extreme, HBase is at 0.98 and
quite production-ready.)

Just consider this a second vote for focus on fixes and 1.0.x rather
than new features and 1.x. I think there are a few steps that could
streamline triage of this flood of contributions, and make all of this
easier, but that's for another thread.


On Fri, May 16, 2014 at 8:50 PM, Mark Hamstra <ma...@clearstorydata.com> wrote:
> +1, but just barely.  We've got quite a number of outstanding bugs
> identified, and many of them have fixes in progress.  I'd hate to see those
> efforts get lost in a post-1.0.0 flood of new features targeted at 1.1.0 --
> in other words, I'd like to see 1.0.1 retain a high priority relative to
> 1.1.0.
>
> Looking through the unresolved JIRAs, it doesn't look like any of the
> identified bugs are show-stoppers or strictly regressions (although I will
> note that one that I have in progress, SPARK-1749, is a bug that we
> introduced with recent work -- it's not strictly a regression because we
> had equally bad but different behavior when the DAGScheduler exceptions
> weren't previously being handled at all vs. being slightly mis-handled
> now), so I'm not currently seeing a reason not to release.

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Mark Hamstra <ma...@clearstorydata.com>.
+1, but just barely.  We've got quite a number of outstanding bugs
identified, and many of them have fixes in progress.  I'd hate to see those
efforts get lost in a post-1.0.0 flood of new features targeted at 1.1.0 --
in other words, I'd like to see 1.0.1 retain a high priority relative to
1.1.0.

Looking through the unresolved JIRAs, it doesn't look like any of the
identified bugs are show-stoppers or strictly regressions (although I will
note that one that I have in progress, SPARK-1749, is a bug that we
introduced with recent work -- it's not strictly a regression because we
had equally bad but different behavior when the DAGScheduler exceptions
weren't previously being handled at all vs. being slightly mis-handled
now), so I'm not currently seeing a reason not to release.


On Tue, May 13, 2014 at 1:36 AM, Patrick Wendell <pw...@gmail.com> wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.0.0!
>
> The tag to be voted on is v1.0.0-rc5 (commit 18f0623):
>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=18f062303303824139998e8fc8f4158217b0dbc3
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1012/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/
>
> Please vote on releasing this package as Apache Spark 1.0.0!
>
> The vote is open until Friday, May 16, at 09:30 UTC and passes if a
> majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.0.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == API Changes ==
> We welcome users to compile Spark applications against 1.0. There are
> a few API changes in this release. Here are links to the associated
> upgrade guides - user facing changes have been kept as small as
> possible.
>
> changes to ML vector specification:
>
> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/mllib-guide.html#from-09-to-10
>
> changes to the Java API:
>
> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark
>
> changes to the streaming API:
>
> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x
>
> changes to the GraphX API:
>
> http://people.apache.org/~pwendell/spark-1.0.0-rc5-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091
>
> coGroup and related functions now return Iterable[T] instead of Seq[T]
> ==> Call toSeq on the result to restore the old behavior
>
> SparkContext.jarOfClass returns Option[String] instead of Seq[String]
> ==> Call toSeq on the result to restore old behavior
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Andrew Or <an...@databricks.com>.
+1


2014-05-13 6:49 GMT-07:00 Sean Owen <so...@cloudera.com>:

> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell <pw...@gmail.com>
> wrote:
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>
> Good news is that the sigs, MD5 and SHA are all correct.
>
> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
> use SHA512, which took me a bit of head-scratching to figure out.
>
> If another RC comes out, I might suggest making it SHA1 everywhere?
> But there is nothing wrong with these signatures and checksums.
>
> Now to look at the contents...
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Patrick Wendell <pw...@gmail.com>.
I'm cancelling this vote in favor of rc6.

On Tue, May 13, 2014 at 8:01 AM, Sean Owen <so...@cloudera.com> wrote:
> On Tue, May 13, 2014 at 2:49 PM, Sean Owen <so...@cloudera.com> wrote:
>> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell <pw...@gmail.com> wrote:
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>>
>> Good news is that the sigs, MD5 and SHA are all correct.
>>
>> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
>> use SHA512, which took me a bit of head-scratching to figure out.
>>
>> If another RC comes out, I might suggest making it SHA1 everywhere?
>> But there is nothing wrong with these signatures and checksums.
>>
>> Now to look at the contents...
>
> This is a bit of drudgery that probably needs to be done too: a review
> of the LICENSE and NOTICE file. Having dumped the licenses of
> dependencies, I don't believe these reflect all of the software that's
> going to be distributed in 1.0.
>
> (Good news is there's no forbidden license stuff included AFAICT.)
>
> And good news is that NOTICE can be auto-generated, largely, with a
> Maven plugin. This can be done manually for now.
>
> And there is a license plugin that will list all known licenses of
> transitive dependencies so that LICENSE can be filled out fairly
> easily.
>
> What say? want a JIRA with details?

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Sean Owen <so...@cloudera.com>.
On Tue, May 13, 2014 at 2:49 PM, Sean Owen <so...@cloudera.com> wrote:
> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell <pw...@gmail.com> wrote:
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>
> Good news is that the sigs, MD5 and SHA are all correct.
>
> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
> use SHA512, which took me a bit of head-scratching to figure out.
>
> If another RC comes out, I might suggest making it SHA1 everywhere?
> But there is nothing wrong with these signatures and checksums.
>
> Now to look at the contents...

This is a bit of drudgery that probably needs to be done too: a review
of the LICENSE and NOTICE file. Having dumped the licenses of
dependencies, I don't believe these reflect all of the software that's
going to be distributed in 1.0.

(Good news is there's no forbidden license stuff included AFAICT.)

And good news is that NOTICE can be auto-generated, largely, with a
Maven plugin. This can be done manually for now.

And there is a license plugin that will list all known licenses of
transitive dependencies so that LICENSE can be filled out fairly
easily.

What say? want a JIRA with details?

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Patrick Wendell <pw...@gmail.com>.
Hey Everyone,

Just a heads up - I've sent other release candidates to the list, but
they appear to be getting swallowed (i.e. they are not on nabble). I
think there is an issue with Apache mail servers.

I'm going to keep trying... if you get duplicate e-mails I apologize in advance.

On Thu, May 15, 2014 at 10:23 AM, Patrick Wendell <pw...@gmail.com> wrote:
> Thanks for your feedback. Since it's not a regression, it won't block
> the release.
>
> On Wed, May 14, 2014 at 12:17 AM, witgo <wi...@qq.com> wrote:
>> SPARK-1817 will cause users to get incorrect results  and RDD.zip is common usage .
>> This should be the highest priority. I think we should fix the bug,and should also test the previous release
>> ------------------ Original ------------------
>> From:  "Patrick Wendell";<pw...@gmail.com>;
>> Date:  Wed, May 14, 2014 03:02 PM
>> To:  "dev@spark.apache.org"<de...@spark.apache.org>;
>>
>> Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
>>
>>
>>
>> Hey @witgo - those bugs are not severe enough to block the release,
>> but it would be nice to get them fixed.
>>
>> At this point we are focused on severe bugs with an immediate fix, or
>> regressions from previous versions of Spark. Anything that misses this
>> release will get merged into the branch-1.0 branch and make it into
>> the 1.0.1 release, so people will have access to it.
>>
>> On Tue, May 13, 2014 at 5:32 PM, witgo <wi...@qq.com> wrote:
>>> -1
>>> The following bug should be fixed:
>>> https://issues.apache.org/jira/browse/SPARK-1817
>>> https://issues.apache.org/jira/browse/SPARK-1712
>>>
>>>
>>> ------------------ Original ------------------
>>> From:  "Patrick Wendell";<pw...@gmail.com>;
>>> Date:  Wed, May 14, 2014 04:07 AM
>>> To:  "dev@spark.apache.org"<de...@spark.apache.org>;
>>>
>>> Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
>>>
>>>
>>>
>>> Hey all - there were some earlier RC's that were not presented to the
>>> dev list because issues were found with them. Also, there seems to be
>>> some issues with the reliability of the dev list e-mail. Just a heads
>>> up.
>>>
>>> I'll lead with a +1 for this.
>>>
>>> On Tue, May 13, 2014 at 8:07 AM, Nan Zhu <zh...@gmail.com> wrote:
>>>> just curious, where is rc4 VOTE?
>>>>
>>>> I searched my gmail but didn't find that?
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, May 13, 2014 at 9:49 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>>> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell <pw...@gmail.com>
>>>>> wrote:
>>>>> > The release files, including signatures, digests, etc. can be found at:
>>>>> > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>>>>>
>>>>> Good news is that the sigs, MD5 and SHA are all correct.
>>>>>
>>>>> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
>>>>> use SHA512, which took me a bit of head-scratching to figure out.
>>>>>
>>>>> If another RC comes out, I might suggest making it SHA1 everywhere?
>>>>> But there is nothing wrong with these signatures and checksums.
>>>>>
>>>>> Now to look at the contents...
>>>>>
>>> .
>> .

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Patrick Wendell <pw...@gmail.com>.
Thanks for your feedback. Since it's not a regression, it won't block
the release.

On Wed, May 14, 2014 at 12:17 AM, witgo <wi...@qq.com> wrote:
> SPARK-1817 will cause users to get incorrect results  and RDD.zip is common usage .
> This should be the highest priority. I think we should fix the bug,and should also test the previous release
> ------------------ Original ------------------
> From:  "Patrick Wendell";<pw...@gmail.com>;
> Date:  Wed, May 14, 2014 03:02 PM
> To:  "dev@spark.apache.org"<de...@spark.apache.org>;
>
> Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
>
>
>
> Hey @witgo - those bugs are not severe enough to block the release,
> but it would be nice to get them fixed.
>
> At this point we are focused on severe bugs with an immediate fix, or
> regressions from previous versions of Spark. Anything that misses this
> release will get merged into the branch-1.0 branch and make it into
> the 1.0.1 release, so people will have access to it.
>
> On Tue, May 13, 2014 at 5:32 PM, witgo <wi...@qq.com> wrote:
>> -1
>> The following bug should be fixed:
>> https://issues.apache.org/jira/browse/SPARK-1817
>> https://issues.apache.org/jira/browse/SPARK-1712
>>
>>
>> ------------------ Original ------------------
>> From:  "Patrick Wendell";<pw...@gmail.com>;
>> Date:  Wed, May 14, 2014 04:07 AM
>> To:  "dev@spark.apache.org"<de...@spark.apache.org>;
>>
>> Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
>>
>>
>>
>> Hey all - there were some earlier RC's that were not presented to the
>> dev list because issues were found with them. Also, there seems to be
>> some issues with the reliability of the dev list e-mail. Just a heads
>> up.
>>
>> I'll lead with a +1 for this.
>>
>> On Tue, May 13, 2014 at 8:07 AM, Nan Zhu <zh...@gmail.com> wrote:
>>> just curious, where is rc4 VOTE?
>>>
>>> I searched my gmail but didn't find that?
>>>
>>>
>>>
>>>
>>> On Tue, May 13, 2014 at 9:49 AM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell <pw...@gmail.com>
>>>> wrote:
>>>> > The release files, including signatures, digests, etc. can be found at:
>>>> > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>>>>
>>>> Good news is that the sigs, MD5 and SHA are all correct.
>>>>
>>>> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
>>>> use SHA512, which took me a bit of head-scratching to figure out.
>>>>
>>>> If another RC comes out, I might suggest making it SHA1 everywhere?
>>>> But there is nothing wrong with these signatures and checksums.
>>>>
>>>> Now to look at the contents...
>>>>
>> .
> .

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by witgo <wi...@qq.com>.
SPARK-1817 will cause users to get incorrect results  and RDD.zip is common usage .
This should be the highest priority. I think we should fix the bug,and should also test the previous release
------------------ Original ------------------
From:  "Patrick Wendell";<pw...@gmail.com>;
Date:  Wed, May 14, 2014 03:02 PM
To:  "dev@spark.apache.org"<de...@spark.apache.org>; 

Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)



Hey @witgo - those bugs are not severe enough to block the release,
but it would be nice to get them fixed.

At this point we are focused on severe bugs with an immediate fix, or
regressions from previous versions of Spark. Anything that misses this
release will get merged into the branch-1.0 branch and make it into
the 1.0.1 release, so people will have access to it.

On Tue, May 13, 2014 at 5:32 PM, witgo <wi...@qq.com> wrote:
> -1
> The following bug should be fixed:
> https://issues.apache.org/jira/browse/SPARK-1817
> https://issues.apache.org/jira/browse/SPARK-1712
>
>
> ------------------ Original ------------------
> From:  "Patrick Wendell";<pw...@gmail.com>;
> Date:  Wed, May 14, 2014 04:07 AM
> To:  "dev@spark.apache.org"<de...@spark.apache.org>;
>
> Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
>
>
>
> Hey all - there were some earlier RC's that were not presented to the
> dev list because issues were found with them. Also, there seems to be
> some issues with the reliability of the dev list e-mail. Just a heads
> up.
>
> I'll lead with a +1 for this.
>
> On Tue, May 13, 2014 at 8:07 AM, Nan Zhu <zh...@gmail.com> wrote:
>> just curious, where is rc4 VOTE?
>>
>> I searched my gmail but didn't find that?
>>
>>
>>
>>
>> On Tue, May 13, 2014 at 9:49 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell <pw...@gmail.com>
>>> wrote:
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>>>
>>> Good news is that the sigs, MD5 and SHA are all correct.
>>>
>>> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
>>> use SHA512, which took me a bit of head-scratching to figure out.
>>>
>>> If another RC comes out, I might suggest making it SHA1 everywhere?
>>> But there is nothing wrong with these signatures and checksums.
>>>
>>> Now to look at the contents...
>>>
> .
.

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Patrick Wendell <pw...@gmail.com>.
Hey @witgo - those bugs are not severe enough to block the release,
but it would be nice to get them fixed.

At this point we are focused on severe bugs with an immediate fix, or
regressions from previous versions of Spark. Anything that misses this
release will get merged into the branch-1.0 branch and make it into
the 1.0.1 release, so people will have access to it.

On Tue, May 13, 2014 at 5:32 PM, witgo <wi...@qq.com> wrote:
> -1
> The following bug should be fixed:
> https://issues.apache.org/jira/browse/SPARK-1817
> https://issues.apache.org/jira/browse/SPARK-1712
>
>
> ------------------ Original ------------------
> From:  "Patrick Wendell";<pw...@gmail.com>;
> Date:  Wed, May 14, 2014 04:07 AM
> To:  "dev@spark.apache.org"<de...@spark.apache.org>;
>
> Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
>
>
>
> Hey all - there were some earlier RC's that were not presented to the
> dev list because issues were found with them. Also, there seems to be
> some issues with the reliability of the dev list e-mail. Just a heads
> up.
>
> I'll lead with a +1 for this.
>
> On Tue, May 13, 2014 at 8:07 AM, Nan Zhu <zh...@gmail.com> wrote:
>> just curious, where is rc4 VOTE?
>>
>> I searched my gmail but didn't find that?
>>
>>
>>
>>
>> On Tue, May 13, 2014 at 9:49 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell <pw...@gmail.com>
>>> wrote:
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>>>
>>> Good news is that the sigs, MD5 and SHA are all correct.
>>>
>>> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
>>> use SHA512, which took me a bit of head-scratching to figure out.
>>>
>>> If another RC comes out, I might suggest making it SHA1 everywhere?
>>> But there is nothing wrong with these signatures and checksums.
>>>
>>> Now to look at the contents...
>>>
> .

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Henry Saputra <he...@gmail.com>.
HI Sandy,

Just curious if the Vote is for rc5 or rc6? Gmail shows me that you
replied to the rc5 thread.

Thanks,

- Henry

On Wed, May 14, 2014 at 1:28 PM, Sandy Ryza <sa...@cloudera.com> wrote:
> +1 (non-binding)
>
> * Built the release from source.
> * Compiled Java and Scala apps that interact with HDFS against it.
> * Ran them in local mode.
> * Ran them against a pseudo-distributed YARN cluster in both yarn-client
> mode and yarn-cluster mode.
>
>
> On Tue, May 13, 2014 at 9:09 PM, witgo <wi...@qq.com> wrote:
>
>> You need to set:
>> spark.akka.frameSize         5
>> spark.default.parallelism    1
>>
>>
>>
>>
>>
>> ------------------ Original ------------------
>> From:  "Madhu";<ma...@madhu.com>;
>> Date:  Wed, May 14, 2014 09:15 AM
>> To:  "dev"<de...@spark.incubator.apache.org>;
>>
>> Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
>>
>>
>>
>> I just built rc5 on Windows 7 and tried to reproduce the problem described
>> in
>>
>> https://issues.apache.org/jira/browse/SPARK-1712
>>
>> It works on my machine:
>>
>> 14/05/13 21:06:47 INFO DAGScheduler: Stage 1 (sum at <console>:17) finished
>> in 4.548 s
>> 14/05/13 21:06:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
>> have all completed, from pool
>> 14/05/13 21:06:47 INFO SparkContext: Job finished: sum at <console>:17,
>> took
>> 4.814991993 s
>> res1: Double = 5.000005E11
>>
>> I used all defaults, no config files were changed.
>> Not sure if that makes a difference...
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6560.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>> .

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Sandy Ryza <sa...@cloudera.com>.
+1 (non-binding)

* Built the release from source.
* Compiled Java and Scala apps that interact with HDFS against it.
* Ran them in local mode.
* Ran them against a pseudo-distributed YARN cluster in both yarn-client
mode and yarn-cluster mode.


On Tue, May 13, 2014 at 9:09 PM, witgo <wi...@qq.com> wrote:

> You need to set:
> spark.akka.frameSize         5
> spark.default.parallelism    1
>
>
>
>
>
> ------------------ Original ------------------
> From:  "Madhu";<ma...@madhu.com>;
> Date:  Wed, May 14, 2014 09:15 AM
> To:  "dev"<de...@spark.incubator.apache.org>;
>
> Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
>
>
>
> I just built rc5 on Windows 7 and tried to reproduce the problem described
> in
>
> https://issues.apache.org/jira/browse/SPARK-1712
>
> It works on my machine:
>
> 14/05/13 21:06:47 INFO DAGScheduler: Stage 1 (sum at <console>:17) finished
> in 4.548 s
> 14/05/13 21:06:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
> have all completed, from pool
> 14/05/13 21:06:47 INFO SparkContext: Job finished: sum at <console>:17,
> took
> 4.814991993 s
> res1: Double = 5.000005E11
>
> I used all defaults, no config files were changed.
> Not sure if that makes a difference...
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6560.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> .

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by witgo <wi...@qq.com>.
You need to set: 
spark.akka.frameSize         5
spark.default.parallelism    1





------------------ Original ------------------
From:  "Madhu";<ma...@madhu.com>;
Date:  Wed, May 14, 2014 09:15 AM
To:  "dev"<de...@spark.incubator.apache.org>; 

Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)



I just built rc5 on Windows 7 and tried to reproduce the problem described in

https://issues.apache.org/jira/browse/SPARK-1712

It works on my machine:

14/05/13 21:06:47 INFO DAGScheduler: Stage 1 (sum at <console>:17) finished
in 4.548 s
14/05/13 21:06:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
have all completed, from pool
14/05/13 21:06:47 INFO SparkContext: Job finished: sum at <console>:17, took
4.814991993 s
res1: Double = 5.000005E11

I used all defaults, no config files were changed.
Not sure if that makes a difference...



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6560.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
.

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Madhu <ma...@madhu.com>.
I just built rc5 on Windows 7 and tried to reproduce the problem described in

https://issues.apache.org/jira/browse/SPARK-1712

It works on my machine:

14/05/13 21:06:47 INFO DAGScheduler: Stage 1 (sum at <console>:17) finished
in 4.548 s
14/05/13 21:06:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks
have all completed, from pool
14/05/13 21:06:47 INFO SparkContext: Job finished: sum at <console>:17, took
4.814991993 s
res1: Double = 5.000005E11

I used all defaults, no config files were changed.
Not sure if that makes a difference...



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6560.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by witgo <wi...@qq.com>.
-1 
The following bug should be fixed: 
https://issues.apache.org/jira/browse/SPARK-1817
https://issues.apache.org/jira/browse/SPARK-1712


------------------ Original ------------------
From:  "Patrick Wendell";<pw...@gmail.com>;
Date:  Wed, May 14, 2014 04:07 AM
To:  "dev@spark.apache.org"<de...@spark.apache.org>; 

Subject:  Re: [VOTE] Release Apache Spark 1.0.0 (rc5)



Hey all - there were some earlier RC's that were not presented to the
dev list because issues were found with them. Also, there seems to be
some issues with the reliability of the dev list e-mail. Just a heads
up.

I'll lead with a +1 for this.

On Tue, May 13, 2014 at 8:07 AM, Nan Zhu <zh...@gmail.com> wrote:
> just curious, where is rc4 VOTE?
>
> I searched my gmail but didn't find that?
>
>
>
>
> On Tue, May 13, 2014 at 9:49 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell <pw...@gmail.com>
>> wrote:
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>>
>> Good news is that the sigs, MD5 and SHA are all correct.
>>
>> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
>> use SHA512, which took me a bit of head-scratching to figure out.
>>
>> If another RC comes out, I might suggest making it SHA1 everywhere?
>> But there is nothing wrong with these signatures and checksums.
>>
>> Now to look at the contents...
>>
.

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Patrick Wendell <pw...@gmail.com>.
Hey all - there were some earlier RC's that were not presented to the
dev list because issues were found with them. Also, there seems to be
some issues with the reliability of the dev list e-mail. Just a heads
up.

I'll lead with a +1 for this.

On Tue, May 13, 2014 at 8:07 AM, Nan Zhu <zh...@gmail.com> wrote:
> just curious, where is rc4 VOTE?
>
> I searched my gmail but didn't find that?
>
>
>
>
> On Tue, May 13, 2014 at 9:49 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell <pw...@gmail.com>
>> wrote:
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>>
>> Good news is that the sigs, MD5 and SHA are all correct.
>>
>> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
>> use SHA512, which took me a bit of head-scratching to figure out.
>>
>> If another RC comes out, I might suggest making it SHA1 everywhere?
>> But there is nothing wrong with these signatures and checksums.
>>
>> Now to look at the contents...
>>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Nan Zhu <zh...@gmail.com>.
Ah, I see, thanks 

-- 
Nan Zhu


On Tuesday, May 13, 2014 at 12:59 PM, Mark Hamstra wrote:

> There were a few early/test RCs this cycle that were never put to a vote.
> 
> 
> On Tue, May 13, 2014 at 8:07 AM, Nan Zhu <zhunanmcgill@gmail.com (mailto:zhunanmcgill@gmail.com)> wrote:
> 
> > just curious, where is rc4 VOTE?
> > 
> > I searched my gmail but didn't find that?
> > 
> > 
> > 
> > 
> > On Tue, May 13, 2014 at 9:49 AM, Sean Owen <sowen@cloudera.com (mailto:sowen@cloudera.com)> wrote:
> > 
> > > On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell <pwendell@gmail.com (mailto:pwendell@gmail.com)>
> > > wrote:
> > > > The release files, including signatures, digests, etc. can be found at:
> > > > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
> > > > 
> > > 
> > > 
> > > Good news is that the sigs, MD5 and SHA are all correct.
> > > 
> > > Tiny note: the Maven artifacts use SHA1, while the binary artifacts
> > > use SHA512, which took me a bit of head-scratching to figure out.
> > > 
> > > If another RC comes out, I might suggest making it SHA1 everywhere?
> > > But there is nothing wrong with these signatures and checksums.
> > > 
> > > Now to look at the contents... 


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Mark Hamstra <ma...@clearstorydata.com>.
There were a few early/test RCs this cycle that were never put to a vote.


On Tue, May 13, 2014 at 8:07 AM, Nan Zhu <zh...@gmail.com> wrote:

> just curious, where is rc4 VOTE?
>
> I searched my gmail but didn't find that?
>
>
>
>
> On Tue, May 13, 2014 at 9:49 AM, Sean Owen <so...@cloudera.com> wrote:
>
> > On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell <pw...@gmail.com>
> > wrote:
> > > The release files, including signatures, digests, etc. can be found at:
> > > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
> >
> > Good news is that the sigs, MD5 and SHA are all correct.
> >
> > Tiny note: the Maven artifacts use SHA1, while the binary artifacts
> > use SHA512, which took me a bit of head-scratching to figure out.
> >
> > If another RC comes out, I might suggest making it SHA1 everywhere?
> > But there is nothing wrong with these signatures and checksums.
> >
> > Now to look at the contents...
> >
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Nan Zhu <zh...@gmail.com>.
just curious, where is rc4 VOTE?

I searched my gmail but didn't find that?




On Tue, May 13, 2014 at 9:49 AM, Sean Owen <so...@cloudera.com> wrote:

> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell <pw...@gmail.com>
> wrote:
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.0.0-rc5/
>
> Good news is that the sigs, MD5 and SHA are all correct.
>
> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
> use SHA512, which took me a bit of head-scratching to figure out.
>
> If another RC comes out, I might suggest making it SHA1 everywhere?
> But there is nothing wrong with these signatures and checksums.
>
> Now to look at the contents...
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Matei Zaharia <ma...@gmail.com>.
SHA-1 is being end-of-lived so I’d actually say switch to 512 for all of them instead.

On May 13, 2014, at 6:49 AM, Sean Owen <so...@cloudera.com> wrote:

> On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell <pw...@gmail.com> wrote:
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-1.0.0-rc5/
> 
> Good news is that the sigs, MD5 and SHA are all correct.
> 
> Tiny note: the Maven artifacts use SHA1, while the binary artifacts
> use SHA512, which took me a bit of head-scratching to figure out.
> 
> If another RC comes out, I might suggest making it SHA1 everywhere?
> But there is nothing wrong with these signatures and checksums.
> 
> Now to look at the contents...


Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Sean Owen <so...@cloudera.com>.
On Tue, May 13, 2014 at 9:36 AM, Patrick Wendell <pw...@gmail.com> wrote:
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.0.0-rc5/

Good news is that the sigs, MD5 and SHA are all correct.

Tiny note: the Maven artifacts use SHA1, while the binary artifacts
use SHA512, which took me a bit of head-scratching to figure out.

If another RC comes out, I might suggest making it SHA1 everywhere?
But there is nothing wrong with these signatures and checksums.

Now to look at the contents...

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Nan Zhu <zh...@gmail.com>.
+1, replaced rc3 with rc5, all applications are working fine

Best, 

-- 
Nan Zhu


On Tuesday, May 13, 2014 at 8:03 PM, Madhu wrote:

> I built rc5 using sbt/sbt assembly on Linux without any problems.
> There used to be an sbt.cmd for Windows build, has that been deprecated?
> If so, I can document the Windows build steps that worked for me.
> 
> 
> 
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6558.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com (http://Nabble.com).
> 
> 



Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Posted by Madhu <ma...@madhu.com>.
I built rc5 using sbt/sbt assembly on Linux without any problems.
There used to be an sbt.cmd for Windows build, has that been deprecated?
If so, I can document the Windows build steps that worked for me.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-0-0-rc5-tp6542p6558.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.