You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2016/11/07 18:10:54 UTC

Re: Spark Improvement Proposals

Oops. Let me try figure that out.

On Monday, November 7, 2016, Cody Koeninger <co...@koeninger.org> wrote:

> Thanks for picking up on this.
>
> Maybe I fail at google docs, but I can't see any edits on the document
> you linked.
>
> Regarding lazy consensus, if the board in general has less of an issue
> with that, sure.  As long as it is clearly announced, lasts at least
> 72 hours, and has a clear outcome.
>
> The other points are hard to comment on without being able to see the
> text in question.
>
>
> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <rxin@databricks.com
> <javascript:;>> wrote:
> > I just looked through the entire thread again tonight - there are a lot
> of
> > great ideas being discussed. Thanks Cody for taking the first crack at
> the
> > proposal.
> >
> > I want to first comment on the context. Spark is one of the most
> innovative
> > and important projects in (big) data -- overall technical decisions made
> in
> > Apache Spark are sound. But of course, a project as large and active as
> > Spark always have room for improvement, and we as a community should
> strive
> > to take it to the next level.
> >
> > To that end, the two biggest areas for improvements in my opinion are:
> >
> > 1. Visibility: There are so much happening that it is difficult to know
> what
> > really is going on. For people that don't follow closely, it is
> difficult to
> > know what the important initiatives are. Even for people that do follow,
> it
> > is difficult to know what specific things require their attention, since
> the
> > number of pull requests and JIRA tickets are high and it's difficult to
> > extract signal from noise.
> >
> > 2. Solicit user (broadly defined, including developers themselves) input
> > more proactively: At the end of the day the project provides value
> because
> > users use it. Users can't tell us exactly what to build, but it is
> important
> > to get their inputs.
> >
> >
> > I've taken Cody's doc and edited it:
> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
> > (I've made all my modifications trackable)
> >
> > There are couple high level changes I made:
> >
> > 1. I've consulted a board member and he recommended lazy consensus as
> > opposed to voting. The reason being in voting there can easily be a
> "loser'
> > that gets outvoted.
> >
> > 2. I made it lighter weight, and renamed "strategy" to "optional design
> > sketch". Echoing one of the earlier email: "IMHO so far aside from
> tagging
> > things and linking them elsewhere simply having design docs and
> prototypes
> > implementations in PRs is not something that has not worked so far".
> >
> > 3. I made some the language tweaks to focus more on visibility. For
> example,
> > "The purpose of an SIP is to inform and involve", rather than just
> > "involve". SIPs should also have at least two emails that go to dev@.
> >
> >
> > While I was editing this, I thought we really needed a suggested template
> > for design doc too. I will get to that too ...
> >
> >
> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <rxin@databricks.com
> <javascript:;>> wrote:
> >>
> >> Most things looked OK to me too, although I do plan to take a closer
> look
> >> after Nov 1st when we cut the release branch for 2.1.
> >>
> >>
> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <vanzin@cloudera.com
> <javascript:;>>
> >> wrote:
> >>>
> >>> The proposal looks OK to me. I assume, even though it's not explicitly
> >>> called, that voting would happen by e-mail? A template for the
> >>> proposal document (instead of just a bullet nice) would also be nice,
> >>> but that can be done at any time.
> >>>
> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
> >>> for a SIP, given the scope of the work. The document attached even
> >>> somewhat matches the proposed format. So if anyone wants to try out
> >>> the process...
> >>>
> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <cody@koeninger.org
> <javascript:;>>
> >>> wrote:
> >>> > Now that spark summit europe is over, are any committers interested
> in
> >>> > moving forward with this?
> >>> >
> >>> >
> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >>> >
> >>> > Or are we going to let this discussion die on the vine?
> >>> >
> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
> >>> > <tomasz.gaweda@outlook.com <javascript:;>> wrote:
> >>> >> Maybe my mail was not clear enough.
> >>> >>
> >>> >>
> >>> >> I didn't want to write "lets focus on Flink" or any other framework.
> >>> >> The
> >>> >> idea with benchmarks was to show two things:
> >>> >>
> >>> >> - why some people are doing bad PR for Spark
> >>> >>
> >>> >> - how - in easy way - we can change it and show that Spark is still
> on
> >>> >> the
> >>> >> top
> >>> >>
> >>> >>
> >>> >> No more, no less. Benchmarks will be helpful, but I don't think
> >>> >> they're the
> >>> >> most important thing in Spark :) On the Spark main page there is
> still
> >>> >> chart
> >>> >> "Spark vs Hadoop". It is important to show that framework is not the
> >>> >> same
> >>> >> Spark with other API, but much faster and optimized, comparable or
> >>> >> even
> >>> >> faster than other frameworks.
> >>> >>
> >>> >>
> >>> >> About real-time streaming, I think it would be just good to see it
> in
> >>> >> Spark.
> >>> >> I very like current Spark model, but many voices that says "we need
> >>> >> more" -
> >>> >> community should listen also them and try to help them. With SIPs it
> >>> >> would
> >>> >> be easier, I've just posted this example as "thing that may be
> changed
> >>> >> with
> >>> >> SIP".
> >>> >>
> >>> >>
> >>> >> I very like unification via Datasets, but there is a lot of
> algorithms
> >>> >> inside - let's make easy API, but with strong background (articles,
> >>> >> benchmarks, descriptions, etc) that shows that Spark is still modern
> >>> >> framework.
> >>> >>
> >>> >>
> >>> >> Maybe now my intention will be clearer :) As I said organizational
> >>> >> ideas
> >>> >> were already mentioned and I agree with them, my mail was just to
> show
> >>> >> some
> >>> >> aspects from my side, so from theside of developer and person who is
> >>> >> trying
> >>> >> to help others with Spark (via StackOverflow or other ways)
> >>> >>
> >>> >>
> >>> >> Pozdrawiam / Best regards,
> >>> >>
> >>> >> Tomasz
> >>> >>
> >>> >>
> >>> >> ________________________________
> >>> >> Od: Cody Koeninger <cody@koeninger.org <javascript:;>>
> >>> >> Wysłane: 17 października 2016 16:46
> >>> >> Do: Debasish Das
> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org <javascript:;>
> >>> >> Temat: Re: Spark Improvement Proposals
> >>> >>
> >>> >> I think narrowly focusing on Flink or benchmarks is missing my
> point.
> >>> >>
> >>> >> My point is evolve or die.  Spark's governance and organization is
> >>> >> hampering its ability to evolve technologically, and it needs to
> >>> >> change.
> >>> >>
> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
> >>> >> <debasish.das83@gmail.com <javascript:;>>
> >>> >> wrote:
> >>> >>> Thanks Cody for bringing up a valid point...I picked up Spark in
> 2014
> >>> >>> as
> >>> >>> soon as I looked into it since compared to writing Java map-reduce
> >>> >>> and
> >>> >>> Cascading code, Spark made writing distributed code fun...But now
> as
> >>> >>> we
> >>> >>> went
> >>> >>> deeper with Spark and real-time streaming use-case gets more
> >>> >>> prominent, I
> >>> >>> think it is time to bring a messaging model in conjunction with the
> >>> >>> batch/micro-batch API that Spark is good at....akka-streams close
> >>> >>> integration with spark micro-batching APIs looks like a great
> >>> >>> direction to
> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated streaming
> >>> >>> with
> >>> >>> batch with the assumption is that micro-batching is sufficient to
> run
> >>> >>> SQL
> >>> >>> commands on stream but do we really have time to do SQL processing
> at
> >>> >>> streaming data within 1-2 seconds ?
> >>> >>>
> >>> >>> After reading the email chain, I started to look into Flink
> >>> >>> documentation
> >>> >>> and if you compare it with Spark documentation, I think we have
> major
> >>> >>> work
> >>> >>> to do detailing out Spark internals so that more people from
> >>> >>> community
> >>> >>> start
> >>> >>> to take active role in improving the issues so that Spark stays
> >>> >>> strong
> >>> >>> compared to Flink.
> >>> >>>
> >>> >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
> >>> >>>
> >>> >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
> >>> >>>
> >>> >>> Spark is no longer an engine that works for micro-batch and
> >>> >>> batch...We
> >>> >>> (and
> >>> >>> I am sure many others) are pushing spark as an engine for stream
> and
> >>> >>> query
> >>> >>> processing.....we need to make it a state-of-the-art engine for
> high
> >>> >>> speed
> >>> >>> streaming data and user queries as well !
> >>> >>>
> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
> >>> >>> <tomasz.gaweda@outlook.com <javascript:;>>
> >>> >>> wrote:
> >>> >>>>
> >>> >>>> Hi everyone,
> >>> >>>>
> >>> >>>> I'm quite late with my answer, but I think my suggestions may
> help a
> >>> >>>> little bit. :) Many technical and organizational topics were
> >>> >>>> mentioned,
> >>> >>>> but I want to focus on these negative posts about Spark and about
> >>> >>>> "haters"
> >>> >>>>
> >>> >>>> I really like Spark. Easy of use, speed, very good community -
> it's
> >>> >>>> everything here. But Every project has to "flight" on "framework
> >>> >>>> market"
> >>> >>>> to be still no 1. I'm following many Spark and Big Data
> communities,
> >>> >>>> maybe my mail will inspire someone :)
> >>> >>>>
> >>> >>>> You (every Spark developer; so far I didn't have enough time to
> join
> >>> >>>> contributing to Spark) has done excellent job. So why are some
> >>> >>>> people
> >>> >>>> saying that Flink (or other framework) is better, like it was
> posted
> >>> >>>> in
> >>> >>>> this mailing list? No, not because that framework is better in all
> >>> >>>> cases.. In my opinion, many of these discussions where started
> after
> >>> >>>> Flink marketing-like posts. Please look at StackOverflow "Flink vs
> >>> >>>> ...."
> >>> >>>> posts, almost every post in "winned" by Flink. Answers are
> sometimes
> >>> >>>> saying nothing about other frameworks, Flink's users (often PMC's)
> >>> >>>> are
> >>> >>>> just posting same information about real-time streaming, about
> delta
> >>> >>>> iterations, etc. It look smart and very often it is marked as an
> >>> >>>> aswer,
> >>> >>>> even if - in my opinion - there wasn't told all the truth.
> >>> >>>>
> >>> >>>>
> >>> >>>> My suggestion: I don't have enough money and knowledgle to perform
> >>> >>>> huge
> >>> >>>> performance test. Maybe some company, that supports Spark
> >>> >>>> (Databricks,
> >>> >>>> Cloudera? - just saying you're most visible in community :) )
> could
> >>> >>>> perform performance test of:
> >>> >>>>
> >>> >>>> - streaming engine - probably Spark will loose because of
> mini-batch
> >>> >>>> model, however currently the difference should be much lower that
> in
> >>> >>>> previous versions
> >>> >>>>
> >>> >>>> - Machine Learning models
> >>> >>>>
> >>> >>>> - batch jobs
> >>> >>>>
> >>> >>>> - Graph jobs
> >>> >>>>
> >>> >>>> - SQL queries
> >>> >>>>
> >>> >>>> People will see that Spark is envolving and is also a modern
> >>> >>>> framework,
> >>> >>>> because after reading posts mentioned above people may think "it
> is
> >>> >>>> outdated, future is in framework X".
> >>> >>>>
> >>> >>>> Matei Zaharia posted excellent blog post about how Spark
> Structured
> >>> >>>> Streaming beats every other framework in terms of easy-of-use and
> >>> >>>> reliability. Performance tests, done in various environments (in
> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
> >>> >>>> cluster), could be also very good marketing stuff to say "hey,
> >>> >>>> you're
> >>> >>>> telling that you're better, but Spark is still faster and is still
> >>> >>>> getting even more fast!". This would be based on facts (just
> >>> >>>> numbers),
> >>> >>>> not opinions. It would be good for companies, for marketing
> puproses
> >>> >>>> and
> >>> >>>> for every Spark developer
> >>> >>>>
> >>> >>>>
> >>> >>>> Second: real-time streaming. I've written some time ago about
> >>> >>>> real-time
> >>> >>>> streaming support in Spark Structured Streaming. Some work should
> be
> >>> >>>> done to make SSS more low-latency, but I think it's possible.
> Maybe
> >>> >>>> Spark may look at Gearpump, which is also built on top of Akka? I
> >>> >>>> don't
> >>> >>>> know yet, it is good topic for SIP. However I think that Spark
> >>> >>>> should
> >>> >>>> have real-time streaming support. Currently I see many
> >>> >>>> posts/comments
> >>> >>>> that "Spark has too big latency". Spark Streaming is doing very
> good
> >>> >>>> jobs with micro-batches, however I think it is possible to add
> also
> >>> >>>> more
> >>> >>>> real-time processing.
> >>> >>>>
> >>> >>>> Other people said much more and I agree with proposal of SIP. I'm
> >>> >>>> also
> >>> >>>> happy that PMC's are not saying that they will not listen to
> users,
> >>> >>>> but
> >>> >>>> they really want to make Spark better for every user.
> >>> >>>>
> >>> >>>>
> >>> >>>> What do you think about these two topics? Especially I'm looking
> at
> >>> >>>> Cody
> >>> >>>> (who has started this topic) and PMCs :)
> >>> >>>>
> >>> >>>> Pozdrawiam / Best regards,
> >>> >>>>
> >>> >>>> Tomasz
> >>> >>>>
> >>> >>>>
> >>>
> >>
> >
> >
>

Re: Spark Improvement Proposals

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

The current proposal seems process-heavy to me. That's not necessarily bad,
but there are a couple areas I haven't seen discussed.

Why is there a shepherd? If the person proposing a change has a good idea,
I don't see why one is either a good idea or necessary. The result of this
requirement is that each SPIP must attract the attention of a PMC member,
and that PMC member has then taken on extra responsibility. Why can't the
SPIP author simply call a vote when an idea has been sufficiently
discussed? I think *this* proposal would have moved faster if Cody had felt
empowered to bring it to a vote. More steps out of the author's control
will cause fewer ideas to move forward, regardless of quality, so we should
make sure this is balanced by a real benefit.

Why are only PMC members allowed a binding vote? I don't have a strong
inclination one way or another, but until recently this was an open
question. I'd like to hear the argument for restricting voting to PMC
members, or I think we should change it to allow all commiters. If this
decision is left to default, let's be more inclusive.

I would be fine with the proposal overall if there are good reasons behind
these choices.

rb

On Thu, Feb 16, 2017 at 8:22 AM, Reynold Xin <rx...@databricks.com> wrote:

> Updated. Any feedback from other community members?
>
>
> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <co...@koeninger.org>
> wrote:
>
>> Thanks for doing that.
>>
>> Given that there are at least 4 different Apache voting processes,
>> "typical Apache vote process" isn't meaningful to me.
>>
>> I think the intention is that in order to pass, it needs at least 3 +1
>> votes from PMC members *and no -1 votes from PMC members*.  But the
>> document doesn't explicitly say that second part.
>>
>> There's also no mention of the duration a vote should remain open.
>> There's a mention of a month for finding a shepherd, but that's different.
>>
>> Other than that, LGTM.
>>
>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Here's a new draft that incorporated most of the feedback:
>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9h
>>> TK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#
>>>
>>> I added a specific role for SPIP Author and another one for SPIP
>>> Shepherd.
>>>
>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <ga...@gmail.com> wrote:
>>>
>>>> During the summit, I also had a lot of discussions over similar topics
>>>> with multiple Committers and active users. I heard many fantastic ideas. I
>>>> believe Spark improvement proposals are good channels to collect the
>>>> requirements/designs.
>>>>
>>>>
>>>> IMO, we also need to consider the priority when working on these items.
>>>> Even if the proposal is accepted, it does not mean it will be implemented
>>>> and merged immediately. It is not a FIFO queue.
>>>>
>>>>
>>>> Even if some PRs are merged, sometimes, we still have to revert them
>>>> back, if the design and implementation are not reviewed carefully. We have
>>>> to ensure our quality. Spark is not an application software. It is an
>>>> infrastructure software that is being used by many many companies. We have
>>>> to be very careful in the design and implementation, especially
>>>> adding/changing the external APIs.
>>>>
>>>>
>>>> When I developed the Mainframe infrastructure/middleware software in
>>>> the past 6 years, I were involved in the discussions with external/internal
>>>> customers. The to-do feature list was always above 100. Sometimes, the
>>>> customers are feeling frustrated when we are unable to deliver them on time
>>>> due to the resource limits and others. Even if they paid us billions, we
>>>> still need to do it phase by phase or sometimes they have to accept the
>>>> workarounds. That is the reality everyone has to face, I think.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> Xiao Li
>>>>
>>>> 2017-02-11 7:57 GMT-08:00 Cody Koeninger <co...@koeninger.org>:
>>>>
>>>>> At the spark summit this week, everyone from PMC members to users I
>>>>> had never met before were asking me about the Spark improvement proposals
>>>>> idea.  It's clear that it's a real community need.
>>>>>
>>>>> But it's been almost half a year, and nothing visible has been done.
>>>>>
>>>>> Reynold, are you going to do this?
>>>>>
>>>>> If so, when?
>>>>>
>>>>> If not, why?
>>>>>
>>>>> You already did the right thing by including long-deserved
>>>>> committers.  Please keep doing the right thing for the community.
>>>>>
>>>>> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> +1 on all counts (consensus, time bound, define roles)
>>>>>>
>>>>>> I can update the doc in the next few days and share back. Then maybe
>>>>>> we can just officially vote on this. As Tim suggested, we might not get it
>>>>>> 100% right the first time and would need to re-iterate. But that's fine.
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <ti...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Cody,
>>>>>>> thank you for bringing up this topic, I agree it is very important
>>>>>>> to keep a cohesive community around some common, fluid goals. Here are a
>>>>>>> few comments about the current document:
>>>>>>>
>>>>>>> 1. name: it should not overlap with an existing one such as SIP. Can
>>>>>>> you imagine someone trying to discuss a scala spore proposal for spark?
>>>>>>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
>>>>>>> sounds great.
>>>>>>>
>>>>>>> 2. roles: at a high level, SPIPs are meant to reach consensus for
>>>>>>> technical decisions with a lasting impact. As such, the template should
>>>>>>> emphasize the role of the various parties during this process:
>>>>>>>
>>>>>>>  - the SPIP author is responsible for building consensus. She is the
>>>>>>> champion driving the process forward and is responsible for ensuring that
>>>>>>> the SPIP follows the general guidelines. The author should be identified in
>>>>>>> the SPIP. The authorship of a SPIP can be transferred if the current author
>>>>>>> is not interested and someone else wants to move the SPIP forward. There
>>>>>>> should probably be 2-3 authors at most for each SPIP.
>>>>>>>
>>>>>>>  - someone with voting power should probably shepherd the SPIP (and
>>>>>>> be recorded as such): ensuring that the final decision over the SPIP is
>>>>>>> recorded (rejected, accepted, etc.), and advising about the technical
>>>>>>> quality of the SPIP: this person need not be a champion for the SPIP or
>>>>>>> contribute to it, but rather makes sure it stands a chance of being
>>>>>>> approved when the vote happens. Also, if the author cannot find anyone who
>>>>>>> would want to take this role, this proposal is likely to be rejected anyway.
>>>>>>>
>>>>>>>  - users, committers, contributors have the roles already outlined
>>>>>>> in the document
>>>>>>>
>>>>>>> 3. timeline: ideally, once a SPIP has been offered for voting, it
>>>>>>> should move swiftly into either being accepted or rejected, so that we do
>>>>>>> not end up with a distracting long tail of half-hearted proposals.
>>>>>>>
>>>>>>> These rules are meant to be flexible, but the current document
>>>>>>> should be clear about who is in charge of a SPIP, and the state it is
>>>>>>> currently in.
>>>>>>>
>>>>>>> We have had long discussions over some very important questions such
>>>>>>> as approval. I do not have an opinion on these, but why not make a pick and
>>>>>>> reevaluate this decision later? This is not a binding process at this point.
>>>>>>>
>>>>>>> Tim
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <co...@koeninger.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I don't have a concern about voting vs consensus.
>>>>>>>>
>>>>>>>> I have a concern that whatever the decision making process is, it
>>>>>>>> is explicitly announced on the ticket for the given proposal, with an
>>>>>>>> explicit deadline, and an explicit outcome.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <ir...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I'm also in favor of this.  Thanks for your persistence Cody.
>>>>>>>>>
>>>>>>>>> My take on the specific issues Joseph mentioned:
>>>>>>>>>
>>>>>>>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue
>>>>>>>>> made earlier for consensus:
>>>>>>>>>
>>>>>>>>> > Majority vs consensus: My rationale is that I don't think we
>>>>>>>>> want to consider a proposal approved if it had objections serious enough
>>>>>>>>> that committers down-voted (or PMC depending on who gets a vote). If these
>>>>>>>>> proposals are like PEPs, then they represent a significant amount of
>>>>>>>>> community effort and I wouldn't want to move forward if up to half of the
>>>>>>>>> community thinks it's an untenable idea.
>>>>>>>>>
>>>>>>>>> 2) Design doc template -- agree this would be useful, but also
>>>>>>>>> seems totally orthogonal to moving forward on the SIP proposal.
>>>>>>>>>
>>>>>>>>> 3) agree w/ Joseph's proposal for updating the template.
>>>>>>>>>
>>>>>>>>> One small addition:
>>>>>>>>>
>>>>>>>>> 4) Deciding on a name -- minor, but I think its wroth
>>>>>>>>> disambiguating from Scala's SIPs, and the best proposal I've heard is
>>>>>>>>> "SPIP".   At least, no one has objected.  (don't care enough that I'd
>>>>>>>>> object to anything else, though.)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <
>>>>>>>>> joseph@databricks.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Cody,
>>>>>>>>>>
>>>>>>>>>> Thanks for being persistent about this.  I too would like to see
>>>>>>>>>> this happen.  Reviewing the thread, it sounds like the main things
>>>>>>>>>> remaining are:
>>>>>>>>>> * Decide about a few issues
>>>>>>>>>> * Finalize the doc(s)
>>>>>>>>>> * Vote on this proposal
>>>>>>>>>>
>>>>>>>>>> Issues & TODOs:
>>>>>>>>>>
>>>>>>>>>> (1) The main issue I see above is voting vs. consensus.  I have
>>>>>>>>>> little preference here.  It sounds like something which could be tailored
>>>>>>>>>> based on whether we see too many or too few SIPs being approved.
>>>>>>>>>>
>>>>>>>>>> (2) Design doc template  (This would be great to have for Spark
>>>>>>>>>> regardless of this SIP discussion.)
>>>>>>>>>> * Reynold, are you still putting this together?
>>>>>>>>>>
>>>>>>>>>> (3) Template cleanups.  Listing some items mentioned above + a
>>>>>>>>>> new one w.r.t. Reynold's draft
>>>>>>>>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>>>>>>>>>> :
>>>>>>>>>> * Reinstate the "Where" section with links to current and past
>>>>>>>>>> SIPs
>>>>>>>>>> * Add field for stating explicit deadlines for approval
>>>>>>>>>> * Add field for stating Author & Committer shepherd
>>>>>>>>>>
>>>>>>>>>> Thanks all!
>>>>>>>>>> Joseph
>>>>>>>>>>
>>>>>>>>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <
>>>>>>>>>> cody@koeninger.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> I'm bumping this one more time for the new year, and then I'm
>>>>>>>>>>> giving up.
>>>>>>>>>>>
>>>>>>>>>>> Please, fix your process, even if it isn't exactly the way I
>>>>>>>>>>> suggested.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> > On lazy consensus as opposed to voting:
>>>>>>>>>>> >
>>>>>>>>>>> > First, why lazy consensus? The proposal was for consensus,
>>>>>>>>>>> which is at least
>>>>>>>>>>> > three +1 votes and no vetos. Consensus has no losing side, it
>>>>>>>>>>> requires
>>>>>>>>>>> > getting to a point where there is agreement. Isn't that
>>>>>>>>>>> agreement what we
>>>>>>>>>>> > want to achieve with these proposals?
>>>>>>>>>>> >
>>>>>>>>>>> > Second, lazy consensus only removes the requirement for three
>>>>>>>>>>> +1 votes. Why
>>>>>>>>>>> > would we not want at least three committers to think something
>>>>>>>>>>> is a good
>>>>>>>>>>> > idea before adopting the proposal?
>>>>>>>>>>> >
>>>>>>>>>>> > rb
>>>>>>>>>>> >
>>>>>>>>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <
>>>>>>>>>>> cody@koeninger.org> wrote:
>>>>>>>>>>> >>
>>>>>>>>>>> >> So there are some minor things (the Where section heading
>>>>>>>>>>> appears to
>>>>>>>>>>> >> be dropped; wherever this document is posted it needs to
>>>>>>>>>>> actually link
>>>>>>>>>>> >> to a jira filter showing current / past SIPs) but it doesn't
>>>>>>>>>>> look like
>>>>>>>>>>> >> I can comment on the google doc.
>>>>>>>>>>> >>
>>>>>>>>>>> >> The major substantive issue that I have is that this version
>>>>>>>>>>> is
>>>>>>>>>>> >> significantly less clear as to the outcome of an SIP.
>>>>>>>>>>> >>
>>>>>>>>>>> >> The apache example of lazy consensus at
>>>>>>>>>>> >> http://apache.org/foundation/voting.html#LazyConsensus
>>>>>>>>>>> involves an
>>>>>>>>>>> >> explicit announcement of an explicit deadline, which I think
>>>>>>>>>>> are
>>>>>>>>>>> >> necessary for clarity.
>>>>>>>>>>> >>
>>>>>>>>>>> >>
>>>>>>>>>>> >>
>>>>>>>>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <
>>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>>> >> > It turned out suggested edits (trackable) don't show up for
>>>>>>>>>>> non-owners,
>>>>>>>>>>> >> > so
>>>>>>>>>>> >> > I've just merged all the edits in place. It should be
>>>>>>>>>>> visible now.
>>>>>>>>>>> >> >
>>>>>>>>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <
>>>>>>>>>>> rxin@databricks.com>
>>>>>>>>>>> >> > wrote:
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> Oops. Let me try figure that out.
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >>
>>>>>>>>>>> >> >> On Monday, November 7, 2016, Cody Koeninger <
>>>>>>>>>>> cody@koeninger.org> wrote:
>>>>>>>>>>> >> >>>
>>>>>>>>>>> >> >>> Thanks for picking up on this.
>>>>>>>>>>> >> >>>
>>>>>>>>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on
>>>>>>>>>>> the document
>>>>>>>>>>> >> >>> you linked.
>>>>>>>>>>> >> >>>
>>>>>>>>>>> >> >>> Regarding lazy consensus, if the board in general has
>>>>>>>>>>> less of an issue
>>>>>>>>>>> >> >>> with that, sure.  As long as it is clearly announced,
>>>>>>>>>>> lasts at least
>>>>>>>>>>> >> >>> 72 hours, and has a clear outcome.
>>>>>>>>>>> >> >>>
>>>>>>>>>>> >> >>> The other points are hard to comment on without being
>>>>>>>>>>> able to see the
>>>>>>>>>>> >> >>> text in question.
>>>>>>>>>>> >> >>>
>>>>>>>>>>> >> >>>
>>>>>>>>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <
>>>>>>>>>>> rxin@databricks.com>
>>>>>>>>>>> >> >>> wrote:
>>>>>>>>>>> >> >>> > I just looked through the entire thread again tonight -
>>>>>>>>>>> there are a
>>>>>>>>>>> >> >>> > lot
>>>>>>>>>>> >> >>> > of
>>>>>>>>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the
>>>>>>>>>>> first crack
>>>>>>>>>>> >> >>> > at
>>>>>>>>>>> >> >>> > the
>>>>>>>>>>> >> >>> > proposal.
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > I want to first comment on the context. Spark is one of
>>>>>>>>>>> the most
>>>>>>>>>>> >> >>> > innovative
>>>>>>>>>>> >> >>> > and important projects in (big) data -- overall
>>>>>>>>>>> technical decisions
>>>>>>>>>>> >> >>> > made in
>>>>>>>>>>> >> >>> > Apache Spark are sound. But of course, a project as
>>>>>>>>>>> large and active
>>>>>>>>>>> >> >>> > as
>>>>>>>>>>> >> >>> > Spark always have room for improvement, and we as a
>>>>>>>>>>> community should
>>>>>>>>>>> >> >>> > strive
>>>>>>>>>>> >> >>> > to take it to the next level.
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > To that end, the two biggest areas for improvements in
>>>>>>>>>>> my opinion
>>>>>>>>>>> >> >>> > are:
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > 1. Visibility: There are so much happening that it is
>>>>>>>>>>> difficult to
>>>>>>>>>>> >> >>> > know
>>>>>>>>>>> >> >>> > what
>>>>>>>>>>> >> >>> > really is going on. For people that don't follow
>>>>>>>>>>> closely, it is
>>>>>>>>>>> >> >>> > difficult to
>>>>>>>>>>> >> >>> > know what the important initiatives are. Even for
>>>>>>>>>>> people that do
>>>>>>>>>>> >> >>> > follow, it
>>>>>>>>>>> >> >>> > is difficult to know what specific things require their
>>>>>>>>>>> attention,
>>>>>>>>>>> >> >>> > since the
>>>>>>>>>>> >> >>> > number of pull requests and JIRA tickets are high and
>>>>>>>>>>> it's difficult
>>>>>>>>>>> >> >>> > to
>>>>>>>>>>> >> >>> > extract signal from noise.
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > 2. Solicit user (broadly defined, including developers
>>>>>>>>>>> themselves)
>>>>>>>>>>> >> >>> > input
>>>>>>>>>>> >> >>> > more proactively: At the end of the day the project
>>>>>>>>>>> provides value
>>>>>>>>>>> >> >>> > because
>>>>>>>>>>> >> >>> > users use it. Users can't tell us exactly what to
>>>>>>>>>>> build, but it is
>>>>>>>>>>> >> >>> > important
>>>>>>>>>>> >> >>> > to get their inputs.
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > I've taken Cody's doc and edited it:
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > https://docs.google.com/docume
>>>>>>>>>>> nt/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#headi
>>>>>>>>>>> ng=h.36ut37zh7w2b
>>>>>>>>>>> >> >>> > (I've made all my modifications trackable)
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > There are couple high level changes I made:
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > 1. I've consulted a board member and he recommended
>>>>>>>>>>> lazy consensus
>>>>>>>>>>> >> >>> > as
>>>>>>>>>>> >> >>> > opposed to voting. The reason being in voting there can
>>>>>>>>>>> easily be a
>>>>>>>>>>> >> >>> > "loser'
>>>>>>>>>>> >> >>> > that gets outvoted.
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to
>>>>>>>>>>> "optional
>>>>>>>>>>> >> >>> > design
>>>>>>>>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far
>>>>>>>>>>> aside from
>>>>>>>>>>> >> >>> > tagging
>>>>>>>>>>> >> >>> > things and linking them elsewhere simply having design
>>>>>>>>>>> docs and
>>>>>>>>>>> >> >>> > prototypes
>>>>>>>>>>> >> >>> > implementations in PRs is not something that has not
>>>>>>>>>>> worked so far".
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > 3. I made some the language tweaks to focus more on
>>>>>>>>>>> visibility. For
>>>>>>>>>>> >> >>> > example,
>>>>>>>>>>> >> >>> > "The purpose of an SIP is to inform and involve",
>>>>>>>>>>> rather than just
>>>>>>>>>>> >> >>> > "involve". SIPs should also have at least two emails
>>>>>>>>>>> that go to
>>>>>>>>>>> >> >>> > dev@.
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > While I was editing this, I thought we really needed a
>>>>>>>>>>> suggested
>>>>>>>>>>> >> >>> > template
>>>>>>>>>>> >> >>> > for design doc too. I will get to that too ...
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>>>>>>>>>> rxin@databricks.com>
>>>>>>>>>>> >> >>> > wrote:
>>>>>>>>>>> >> >>> >>
>>>>>>>>>>> >> >>> >> Most things looked OK to me too, although I do plan to
>>>>>>>>>>> take a
>>>>>>>>>>> >> >>> >> closer
>>>>>>>>>>> >> >>> >> look
>>>>>>>>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>>>>>>>>>>> >> >>> >>
>>>>>>>>>>> >> >>> >>
>>>>>>>>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>>>>>>>>>> >> >>> >> <va...@cloudera.com>
>>>>>>>>>>> >> >>> >> wrote:
>>>>>>>>>>> >> >>> >>>
>>>>>>>>>>> >> >>> >>> The proposal looks OK to me. I assume, even though
>>>>>>>>>>> it's not
>>>>>>>>>>> >> >>> >>> explicitly
>>>>>>>>>>> >> >>> >>> called, that voting would happen by e-mail? A
>>>>>>>>>>> template for the
>>>>>>>>>>> >> >>> >>> proposal document (instead of just a bullet nice)
>>>>>>>>>>> would also be
>>>>>>>>>>> >> >>> >>> nice,
>>>>>>>>>>> >> >>> >>> but that can be done at any time.
>>>>>>>>>>> >> >>> >>>
>>>>>>>>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I
>>>>>>>>>>> consider a
>>>>>>>>>>> >> >>> >>> candidate
>>>>>>>>>>> >> >>> >>> for a SIP, given the scope of the work. The document
>>>>>>>>>>> attached even
>>>>>>>>>>> >> >>> >>> somewhat matches the proposed format. So if anyone
>>>>>>>>>>> wants to try
>>>>>>>>>>> >> >>> >>> out
>>>>>>>>>>> >> >>> >>> the process...
>>>>>>>>>>> >> >>> >>>
>>>>>>>>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>>>>>>>>>>> >> >>> >>> <co...@koeninger.org>
>>>>>>>>>>> >> >>> >>> wrote:
>>>>>>>>>>> >> >>> >>> > Now that spark summit europe is over, are any
>>>>>>>>>>> committers
>>>>>>>>>>> >> >>> >>> > interested
>>>>>>>>>>> >> >>> >>> > in
>>>>>>>>>>> >> >>> >>> > moving forward with this?
>>>>>>>>>>> >> >>> >>> >
>>>>>>>>>>> >> >>> >>> >
>>>>>>>>>>> >> >>> >>> >
>>>>>>>>>>> >> >>> >>> >
>>>>>>>>>>> >> >>> >>> > https://github.com/koeninger/s
>>>>>>>>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>>>>>>>> >> >>> >>> >
>>>>>>>>>>> >> >>> >>> > Or are we going to let this discussion die on the
>>>>>>>>>>> vine?
>>>>>>>>>>> >> >>> >>> >
>>>>>>>>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>>>>>>>>>> >> >>> >>> > <to...@outlook.com> wrote:
>>>>>>>>>>> >> >>> >>> >> Maybe my mail was not clear enough.
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or
>>>>>>>>>>> any other
>>>>>>>>>>> >> >>> >>> >> framework.
>>>>>>>>>>> >> >>> >>> >> The
>>>>>>>>>>> >> >>> >>> >> idea with benchmarks was to show two things:
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> - why some people are doing bad PR for Spark
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> - how - in easy way - we can change it and show
>>>>>>>>>>> that Spark is
>>>>>>>>>>> >> >>> >>> >> still on
>>>>>>>>>>> >> >>> >>> >> the
>>>>>>>>>>> >> >>> >>> >> top
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but
>>>>>>>>>>> I don't think
>>>>>>>>>>> >> >>> >>> >> they're the
>>>>>>>>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main
>>>>>>>>>>> page there
>>>>>>>>>>> >> >>> >>> >> is
>>>>>>>>>>> >> >>> >>> >> still
>>>>>>>>>>> >> >>> >>> >> chart
>>>>>>>>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that
>>>>>>>>>>> framework is
>>>>>>>>>>> >> >>> >>> >> not
>>>>>>>>>>> >> >>> >>> >> the
>>>>>>>>>>> >> >>> >>> >> same
>>>>>>>>>>> >> >>> >>> >> Spark with other API, but much faster and
>>>>>>>>>>> optimized, comparable
>>>>>>>>>>> >> >>> >>> >> or
>>>>>>>>>>> >> >>> >>> >> even
>>>>>>>>>>> >> >>> >>> >> faster than other frameworks.
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> About real-time streaming, I think it would be
>>>>>>>>>>> just good to see
>>>>>>>>>>> >> >>> >>> >> it
>>>>>>>>>>> >> >>> >>> >> in
>>>>>>>>>>> >> >>> >>> >> Spark.
>>>>>>>>>>> >> >>> >>> >> I very like current Spark model, but many voices
>>>>>>>>>>> that says "we
>>>>>>>>>>> >> >>> >>> >> need
>>>>>>>>>>> >> >>> >>> >> more" -
>>>>>>>>>>> >> >>> >>> >> community should listen also them and try to help
>>>>>>>>>>> them. With
>>>>>>>>>>> >> >>> >>> >> SIPs
>>>>>>>>>>> >> >>> >>> >> it
>>>>>>>>>>> >> >>> >>> >> would
>>>>>>>>>>> >> >>> >>> >> be easier, I've just posted this example as "thing
>>>>>>>>>>> that may be
>>>>>>>>>>> >> >>> >>> >> changed
>>>>>>>>>>> >> >>> >>> >> with
>>>>>>>>>>> >> >>> >>> >> SIP".
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> I very like unification via Datasets, but there is
>>>>>>>>>>> a lot of
>>>>>>>>>>> >> >>> >>> >> algorithms
>>>>>>>>>>> >> >>> >>> >> inside - let's make easy API, but with strong
>>>>>>>>>>> background
>>>>>>>>>>> >> >>> >>> >> (articles,
>>>>>>>>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that
>>>>>>>>>>> Spark is still
>>>>>>>>>>> >> >>> >>> >> modern
>>>>>>>>>>> >> >>> >>> >> framework.
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said
>>>>>>>>>>> >> >>> >>> >> organizational
>>>>>>>>>>> >> >>> >>> >> ideas
>>>>>>>>>>> >> >>> >>> >> were already mentioned and I agree with them, my
>>>>>>>>>>> mail was just
>>>>>>>>>>> >> >>> >>> >> to
>>>>>>>>>>> >> >>> >>> >> show
>>>>>>>>>>> >> >>> >>> >> some
>>>>>>>>>>> >> >>> >>> >> aspects from my side, so from theside of developer
>>>>>>>>>>> and person
>>>>>>>>>>> >> >>> >>> >> who
>>>>>>>>>>> >> >>> >>> >> is
>>>>>>>>>>> >> >>> >>> >> trying
>>>>>>>>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or
>>>>>>>>>>> other ways)
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> Pozdrawiam / Best regards,
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> Tomasz
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> ________________________________
>>>>>>>>>>> >> >>> >>> >> Od: Cody Koeninger <co...@koeninger.org>
>>>>>>>>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46
>>>>>>>>>>> >> >>> >>> >> Do: Debasish Das
>>>>>>>>>>> >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>>>>>>>>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks
>>>>>>>>>>> is missing my
>>>>>>>>>>> >> >>> >>> >> point.
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> My point is evolve or die.  Spark's governance and
>>>>>>>>>>> organization
>>>>>>>>>>> >> >>> >>> >> is
>>>>>>>>>>> >> >>> >>> >> hampering its ability to evolve technologically,
>>>>>>>>>>> and it needs
>>>>>>>>>>> >> >>> >>> >> to
>>>>>>>>>>> >> >>> >>> >> change.
>>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>>>>>>>>>> >> >>> >>> >> <de...@gmail.com>
>>>>>>>>>>> >> >>> >>> >> wrote:
>>>>>>>>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I
>>>>>>>>>>> picked up Spark
>>>>>>>>>>> >> >>> >>> >>> in
>>>>>>>>>>> >> >>> >>> >>> 2014
>>>>>>>>>>> >> >>> >>> >>> as
>>>>>>>>>>> >> >>> >>> >>> soon as I looked into it since compared to
>>>>>>>>>>> writing Java
>>>>>>>>>>> >> >>> >>> >>> map-reduce
>>>>>>>>>>> >> >>> >>> >>> and
>>>>>>>>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed
>>>>>>>>>>> code fun...But
>>>>>>>>>>> >> >>> >>> >>> now
>>>>>>>>>>> >> >>> >>> >>> as
>>>>>>>>>>> >> >>> >>> >>> we
>>>>>>>>>>> >> >>> >>> >>> went
>>>>>>>>>>> >> >>> >>> >>> deeper with Spark and real-time streaming
>>>>>>>>>>> use-case gets more
>>>>>>>>>>> >> >>> >>> >>> prominent, I
>>>>>>>>>>> >> >>> >>> >>> think it is time to bring a messaging model in
>>>>>>>>>>> conjunction
>>>>>>>>>>> >> >>> >>> >>> with
>>>>>>>>>>> >> >>> >>> >>> the
>>>>>>>>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good
>>>>>>>>>>> at....akka-streams
>>>>>>>>>>> >> >>> >>> >>> close
>>>>>>>>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks
>>>>>>>>>>> like a great
>>>>>>>>>>> >> >>> >>> >>> direction to
>>>>>>>>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0
>>>>>>>>>>> integrated
>>>>>>>>>>> >> >>> >>> >>> streaming
>>>>>>>>>>> >> >>> >>> >>> with
>>>>>>>>>>> >> >>> >>> >>> batch with the assumption is that micro-batching
>>>>>>>>>>> is sufficient
>>>>>>>>>>> >> >>> >>> >>> to
>>>>>>>>>>> >> >>> >>> >>> run
>>>>>>>>>>> >> >>> >>> >>> SQL
>>>>>>>>>>> >> >>> >>> >>> commands on stream but do we really have time to
>>>>>>>>>>> do SQL
>>>>>>>>>>> >> >>> >>> >>> processing at
>>>>>>>>>>> >> >>> >>> >>> streaming data within 1-2 seconds ?
>>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>>> >> >>> >>> >>> After reading the email chain, I started to look
>>>>>>>>>>> into Flink
>>>>>>>>>>> >> >>> >>> >>> documentation
>>>>>>>>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I
>>>>>>>>>>> think we
>>>>>>>>>>> >> >>> >>> >>> have
>>>>>>>>>>> >> >>> >>> >>> major
>>>>>>>>>>> >> >>> >>> >>> work
>>>>>>>>>>> >> >>> >>> >>> to do detailing out Spark internals so that more
>>>>>>>>>>> people from
>>>>>>>>>>> >> >>> >>> >>> community
>>>>>>>>>>> >> >>> >>> >>> start
>>>>>>>>>>> >> >>> >>> >>> to take active role in improving the issues so
>>>>>>>>>>> that Spark
>>>>>>>>>>> >> >>> >>> >>> stays
>>>>>>>>>>> >> >>> >>> >>> strong
>>>>>>>>>>> >> >>> >>> >>> compared to Flink.
>>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>>>>>>> uence/display/SPARK/Spark+Internals
>>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>>>>>>> uence/display/FLINK/Flink+Internals
>>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>>> >> >>> >>> >>> Spark is no longer an engine that works for
>>>>>>>>>>> micro-batch and
>>>>>>>>>>> >> >>> >>> >>> batch...We
>>>>>>>>>>> >> >>> >>> >>> (and
>>>>>>>>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an
>>>>>>>>>>> engine for
>>>>>>>>>>> >> >>> >>> >>> stream
>>>>>>>>>>> >> >>> >>> >>> and
>>>>>>>>>>> >> >>> >>> >>> query
>>>>>>>>>>> >> >>> >>> >>> processing.....we need to make it a
>>>>>>>>>>> state-of-the-art engine
>>>>>>>>>>> >> >>> >>> >>> for
>>>>>>>>>>> >> >>> >>> >>> high
>>>>>>>>>>> >> >>> >>> >>> speed
>>>>>>>>>>> >> >>> >>> >>> streaming data and user queries as well !
>>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>>>>>>>>>>> >> >>> >>> >>> <to...@outlook.com>
>>>>>>>>>>> >> >>> >>> >>> wrote:
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> Hi everyone,
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my
>>>>>>>>>>> suggestions may
>>>>>>>>>>> >> >>> >>> >>>> help a
>>>>>>>>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational
>>>>>>>>>>> topics were
>>>>>>>>>>> >> >>> >>> >>>> mentioned,
>>>>>>>>>>> >> >>> >>> >>>> but I want to focus on these negative posts
>>>>>>>>>>> about Spark and
>>>>>>>>>>> >> >>> >>> >>>> about
>>>>>>>>>>> >> >>> >>> >>>> "haters"
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very
>>>>>>>>>>> good community
>>>>>>>>>>> >> >>> >>> >>>> -
>>>>>>>>>>> >> >>> >>> >>>> it's
>>>>>>>>>>> >> >>> >>> >>>> everything here. But Every project has to
>>>>>>>>>>> "flight" on
>>>>>>>>>>> >> >>> >>> >>>> "framework
>>>>>>>>>>> >> >>> >>> >>>> market"
>>>>>>>>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and
>>>>>>>>>>> Big Data
>>>>>>>>>>> >> >>> >>> >>>> communities,
>>>>>>>>>>> >> >>> >>> >>>> maybe my mail will inspire someone :)
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have
>>>>>>>>>>> enough time
>>>>>>>>>>> >> >>> >>> >>>> to
>>>>>>>>>>> >> >>> >>> >>>> join
>>>>>>>>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job.
>>>>>>>>>>> So why are
>>>>>>>>>>> >> >>> >>> >>>> some
>>>>>>>>>>> >> >>> >>> >>>> people
>>>>>>>>>>> >> >>> >>> >>>> saying that Flink (or other framework) is
>>>>>>>>>>> better, like it was
>>>>>>>>>>> >> >>> >>> >>>> posted
>>>>>>>>>>> >> >>> >>> >>>> in
>>>>>>>>>>> >> >>> >>> >>>> this mailing list? No, not because that
>>>>>>>>>>> framework is better
>>>>>>>>>>> >> >>> >>> >>>> in
>>>>>>>>>>> >> >>> >>> >>>> all
>>>>>>>>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions
>>>>>>>>>>> where
>>>>>>>>>>> >> >>> >>> >>>> started
>>>>>>>>>>> >> >>> >>> >>>> after
>>>>>>>>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at
>>>>>>>>>>> StackOverflow
>>>>>>>>>>> >> >>> >>> >>>> "Flink
>>>>>>>>>>> >> >>> >>> >>>> vs
>>>>>>>>>>> >> >>> >>> >>>> ...."
>>>>>>>>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink.
>>>>>>>>>>> Answers are
>>>>>>>>>>> >> >>> >>> >>>> sometimes
>>>>>>>>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's
>>>>>>>>>>> users (often
>>>>>>>>>>> >> >>> >>> >>>> PMC's)
>>>>>>>>>>> >> >>> >>> >>>> are
>>>>>>>>>>> >> >>> >>> >>>> just posting same information about real-time
>>>>>>>>>>> streaming,
>>>>>>>>>>> >> >>> >>> >>>> about
>>>>>>>>>>> >> >>> >>> >>>> delta
>>>>>>>>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it
>>>>>>>>>>> is marked as
>>>>>>>>>>> >> >>> >>> >>>> an
>>>>>>>>>>> >> >>> >>> >>>> aswer,
>>>>>>>>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all
>>>>>>>>>>> the truth.
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and
>>>>>>>>>>> knowledgle to
>>>>>>>>>>> >> >>> >>> >>>> perform
>>>>>>>>>>> >> >>> >>> >>>> huge
>>>>>>>>>>> >> >>> >>> >>>> performance test. Maybe some company, that
>>>>>>>>>>> supports Spark
>>>>>>>>>>> >> >>> >>> >>>> (Databricks,
>>>>>>>>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in
>>>>>>>>>>> community :) )
>>>>>>>>>>> >> >>> >>> >>>> could
>>>>>>>>>>> >> >>> >>> >>>> perform performance test of:
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose
>>>>>>>>>>> because of
>>>>>>>>>>> >> >>> >>> >>>> mini-batch
>>>>>>>>>>> >> >>> >>> >>>> model, however currently the difference should
>>>>>>>>>>> be much lower
>>>>>>>>>>> >> >>> >>> >>>> that in
>>>>>>>>>>> >> >>> >>> >>>> previous versions
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> - Machine Learning models
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> - batch jobs
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> - Graph jobs
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> - SQL queries
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> People will see that Spark is envolving and is
>>>>>>>>>>> also a modern
>>>>>>>>>>> >> >>> >>> >>>> framework,
>>>>>>>>>>> >> >>> >>> >>>> because after reading posts mentioned above
>>>>>>>>>>> people may think
>>>>>>>>>>> >> >>> >>> >>>> "it
>>>>>>>>>>> >> >>> >>> >>>> is
>>>>>>>>>>> >> >>> >>> >>>> outdated, future is in framework X".
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about
>>>>>>>>>>> how Spark
>>>>>>>>>>> >> >>> >>> >>>> Structured
>>>>>>>>>>> >> >>> >>> >>>> Streaming beats every other framework in terms
>>>>>>>>>>> of easy-of-use
>>>>>>>>>>> >> >>> >>> >>>> and
>>>>>>>>>>> >> >>> >>> >>>> reliability. Performance tests, done in various
>>>>>>>>>>> environments
>>>>>>>>>>> >> >>> >>> >>>> (in
>>>>>>>>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node
>>>>>>>>>>> cluster,
>>>>>>>>>>> >> >>> >>> >>>> 20-node
>>>>>>>>>>> >> >>> >>> >>>> cluster), could be also very good marketing
>>>>>>>>>>> stuff to say
>>>>>>>>>>> >> >>> >>> >>>> "hey,
>>>>>>>>>>> >> >>> >>> >>>> you're
>>>>>>>>>>> >> >>> >>> >>>> telling that you're better, but Spark is still
>>>>>>>>>>> faster and is
>>>>>>>>>>> >> >>> >>> >>>> still
>>>>>>>>>>> >> >>> >>> >>>> getting even more fast!". This would be based on
>>>>>>>>>>> facts (just
>>>>>>>>>>> >> >>> >>> >>>> numbers),
>>>>>>>>>>> >> >>> >>> >>>> not opinions. It would be good for companies,
>>>>>>>>>>> for marketing
>>>>>>>>>>> >> >>> >>> >>>> puproses
>>>>>>>>>>> >> >>> >>> >>>> and
>>>>>>>>>>> >> >>> >>> >>>> for every Spark developer
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some
>>>>>>>>>>> time ago about
>>>>>>>>>>> >> >>> >>> >>>> real-time
>>>>>>>>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming.
>>>>>>>>>>> Some work
>>>>>>>>>>> >> >>> >>> >>>> should be
>>>>>>>>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think
>>>>>>>>>>> it's possible.
>>>>>>>>>>> >> >>> >>> >>>> Maybe
>>>>>>>>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built
>>>>>>>>>>> on top of
>>>>>>>>>>> >> >>> >>> >>>> Akka?
>>>>>>>>>>> >> >>> >>> >>>> I
>>>>>>>>>>> >> >>> >>> >>>> don't
>>>>>>>>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I
>>>>>>>>>>> think that
>>>>>>>>>>> >> >>> >>> >>>> Spark
>>>>>>>>>>> >> >>> >>> >>>> should
>>>>>>>>>>> >> >>> >>> >>>> have real-time streaming support. Currently I
>>>>>>>>>>> see many
>>>>>>>>>>> >> >>> >>> >>>> posts/comments
>>>>>>>>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark
>>>>>>>>>>> Streaming is doing
>>>>>>>>>>> >> >>> >>> >>>> very
>>>>>>>>>>> >> >>> >>> >>>> good
>>>>>>>>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is
>>>>>>>>>>> possible to
>>>>>>>>>>> >> >>> >>> >>>> add
>>>>>>>>>>> >> >>> >>> >>>> also
>>>>>>>>>>> >> >>> >>> >>>> more
>>>>>>>>>>> >> >>> >>> >>>> real-time processing.
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> Other people said much more and I agree with
>>>>>>>>>>> proposal of SIP.
>>>>>>>>>>> >> >>> >>> >>>> I'm
>>>>>>>>>>> >> >>> >>> >>>> also
>>>>>>>>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will
>>>>>>>>>>> not listen to
>>>>>>>>>>> >> >>> >>> >>>> users,
>>>>>>>>>>> >> >>> >>> >>>> but
>>>>>>>>>>> >> >>> >>> >>>> they really want to make Spark better for every
>>>>>>>>>>> user.
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> What do you think about these two topics?
>>>>>>>>>>> Especially I'm
>>>>>>>>>>> >> >>> >>> >>>> looking
>>>>>>>>>>> >> >>> >>> >>>> at
>>>>>>>>>>> >> >>> >>> >>>> Cody
>>>>>>>>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :)
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> Pozdrawiam / Best regards,
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>> Tomasz
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>>> >> >>> >>>
>>>>>>>>>>> >> >>> >>
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >>> >
>>>>>>>>>>> >> >
>>>>>>>>>>> >> >
>>>>>>>>>>> >>
>>>>>>>>>>> >> ------------------------------------------------------------
>>>>>>>>>>> ---------
>>>>>>>>>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>> >>
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> >
>>>>>>>>>>> > --
>>>>>>>>>>> > Ryan Blue
>>>>>>>>>>> > Software Engineer
>>>>>>>>>>> > Netflix
>>>>>>>>>>>
>>>>>>>>>>> ------------------------------------------------------------
>>>>>>>>>>> ---------
>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Joseph Bradley
>>>>>>>>>>
>>>>>>>>>> Software Engineer - Machine Learning
>>>>>>>>>>
>>>>>>>>>> Databricks, Inc.
>>>>>>>>>>
>>>>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Spark Improvement Proposals

Posted by vaquar khan <va...@gmail.com>.

Many of us have issue with "shepherd role " , i think we should go with
vote.

Regards,
Vaquar khan

On Thu, Mar 9, 2017 at 11:00 AM, Reynold Xin <rx...@databricks.com> wrote:

> I'm fine without a vote. (are we voting on wether we need a vote?)
>
>
> On Thu, Mar 9, 2017 at 8:55 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
>> Nah, anyone can call a vote. This really isn't that formal. We just want to
>> declare and document consensus.
>>
>> I think SPIP is just a remix of existing process anyway, and don't think
>> it will actually do much anyway, which is why I am sanguine about the whole
>> thing.
>>
>> To bring this to a conclusion, I will just put the contents of the doc in
>> an email tomorrow for a VOTE. Raise any objections now.
>>
>> On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger <co...@koeninger.org> wrote:
>>
>>> I started this idea as a fork with a merge-able change to docs.
>>> Reynold moved it to his google doc, and has suggested during this
>>> email thread that a vote should occur.
>>> If a vote needs to occur, I can't see anything on
>>> http://apache.org/foundation/voting.html suggesting that I can call
>>> for a vote, which is why I'm asking PMC members to do it since they're
>>> the ones who would vote anyway.
>>> Now Sean is saying this is a code/doc change that can just be reviewed
>>> and merged as usual...which is what I tried to do to begin with.
>>>
>>> The fact that you haven't agreed on a process to agree on your process
>>> is, I think, an indication that the process really does need
>>> improvement ;)
>>>
>>>
>


-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago

Re: Spark Improvement Proposals

Posted by Koert Kuipers <ko...@tresata.com>.

gonna end up with a stackoverflow on recursive votes here

On Thu, Mar 9, 2017 at 1:17 PM, Mark Hamstra <ma...@clearstorydata.com>
wrote:

> -0 on voting on whether we need a vote.
>
> On Thu, Mar 9, 2017 at 9:00 AM, Reynold Xin <rx...@databricks.com> wrote:
>
>> I'm fine without a vote. (are we voting on wether we need a vote?)
>>
>>
>> On Thu, Mar 9, 2017 at 8:55 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
>>> Nah, anyone can call a vote. This really isn't that formal. We just want to
>>> declare and document consensus.
>>>
>>> I think SPIP is just a remix of existing process anyway, and don't think
>>> it will actually do much anyway, which is why I am sanguine about the whole
>>> thing.
>>>
>>> To bring this to a conclusion, I will just put the contents of the doc
>>> in an email tomorrow for a VOTE. Raise any objections now.
>>>
>>> On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger <co...@koeninger.org>
>>> wrote:
>>>
>>>> I started this idea as a fork with a merge-able change to docs.
>>>> Reynold moved it to his google doc, and has suggested during this
>>>> email thread that a vote should occur.
>>>> If a vote needs to occur, I can't see anything on
>>>> http://apache.org/foundation/voting.html suggesting that I can call
>>>> for a vote, which is why I'm asking PMC members to do it since they're
>>>> the ones who would vote anyway.
>>>> Now Sean is saying this is a code/doc change that can just be reviewed
>>>> and merged as usual...which is what I tried to do to begin with.
>>>>
>>>> The fact that you haven't agreed on a process to agree on your process
>>>> is, I think, an indication that the process really does need
>>>> improvement ;)
>>>>
>>>>
>>
>

Re: Spark Improvement Proposals

Posted by Mark Hamstra <ma...@clearstorydata.com>.

-0 on voting on whether we need a vote.

On Thu, Mar 9, 2017 at 9:00 AM, Reynold Xin <rx...@databricks.com> wrote:

> I'm fine without a vote. (are we voting on wether we need a vote?)
>
>
> On Thu, Mar 9, 2017 at 8:55 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
>> Nah, anyone can call a vote. This really isn't that formal. We just want to
>> declare and document consensus.
>>
>> I think SPIP is just a remix of existing process anyway, and don't think
>> it will actually do much anyway, which is why I am sanguine about the whole
>> thing.
>>
>> To bring this to a conclusion, I will just put the contents of the doc in
>> an email tomorrow for a VOTE. Raise any objections now.
>>
>> On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger <co...@koeninger.org> wrote:
>>
>>> I started this idea as a fork with a merge-able change to docs.
>>> Reynold moved it to his google doc, and has suggested during this
>>> email thread that a vote should occur.
>>> If a vote needs to occur, I can't see anything on
>>> http://apache.org/foundation/voting.html suggesting that I can call
>>> for a vote, which is why I'm asking PMC members to do it since they're
>>> the ones who would vote anyway.
>>> Now Sean is saying this is a code/doc change that can just be reviewed
>>> and merged as usual...which is what I tried to do to begin with.
>>>
>>> The fact that you haven't agreed on a process to agree on your process
>>> is, I think, an indication that the process really does need
>>> improvement ;)
>>>
>>>
>

Re: Spark Improvement Proposals

Posted by Reynold Xin <rx...@databricks.com>.

I'm fine without a vote. (are we voting on wether we need a vote?)


On Thu, Mar 9, 2017 at 8:55 AM, Sean Owen <so...@cloudera.com> wrote:

> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
> Nah, anyone can call a vote. This really isn't that formal. We just want to
> declare and document consensus.
>
> I think SPIP is just a remix of existing process anyway, and don't think
> it will actually do much anyway, which is why I am sanguine about the whole
> thing.
>
> To bring this to a conclusion, I will just put the contents of the doc in
> an email tomorrow for a VOTE. Raise any objections now.
>
> On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger <co...@koeninger.org> wrote:
>
>> I started this idea as a fork with a merge-able change to docs.
>> Reynold moved it to his google doc, and has suggested during this
>> email thread that a vote should occur.
>> If a vote needs to occur, I can't see anything on
>> http://apache.org/foundation/voting.html suggesting that I can call
>> for a vote, which is why I'm asking PMC members to do it since they're
>> the ones who would vote anyway.
>> Now Sean is saying this is a code/doc change that can just be reviewed
>> and merged as usual...which is what I tried to do to begin with.
>>
>> The fact that you haven't agreed on a process to agree on your process
>> is, I think, an indication that the process really does need
>> improvement ;)
>>
>>

Re: Spark Improvement Proposals

Posted by Sean Owen <so...@cloudera.com>.

Responding to your request for a vote, I meant that this isn't required per
se and the consensus here was not to vote on it. Hence the jokes about
meta-voting protocol. In that sense nothing new happened process-wise,
nothing against ASF norms, if that's your concern.

I think it's just an agreed convention now, that we will VOTE, as normal,
on particular types of changes that we call SPIPs. I mean it's no new
process in the ASF sense because VOTEs are an existing mechanic. I
personally view it as, simply, additional guidance about how to manage huge
JIRAs in a way that makes them stand a chance of moving forward. I suppose
we could VOTE about any JIRA if we wanted. They all proceed via lazy
consensus at the moment.

Practically -- I heard support for codifying this process and no objections
to the final form. This was bouncing around in process purgatory, when no
particular new process was called for.

It takes effect immediately, implicitly, like anything else I guess, like
amendments to code style guidelines. Please uses SPIPs to propose big
changes from here.

As to finding it hard to pick out of the noise, sure, I sympathize. Many
big things happen without a VOTE tag though. It does take a time investment
to triage these email lists. I don't know that this by itself means a VOTE
should have happened.

On Mon, Mar 13, 2017 at 6:15 PM Tom Graves <tg...@yahoo.com> wrote:

> Another thing I think you should send out is when exactly does this take
> affect.  Is it any major new feature without a pull request?   Is it
> anything major starting with the 2.3 release?
>
> Tom
>
>
> On Monday, March 13, 2017 1:08 PM, Tom Graves <tg...@yahoo.com.INVALID>
> wrote:
>
>
> I'm not sure how you can say its not a new process.  If that is the case
> why do we need a page documenting it?
> As a developer if I want to put up a major improvement I have to now
> follow the SPIP whereas before I didn't, that certain seems like a new
> process.  As a PMC member I now have the ability to vote on these SPIPs,
> that seems like something new again.
>
> There are  apache bylaws and then there are project specific bylaws.  As
> far as I know Spark doesn't document any of its project specific bylaws so
> I guess this isn't officially a change to them, but it was implicit before
> that you didn't need any review for major improvements before, now you need
> an explicit vote for them to be approved.  Certainly seems to fall under
> the "Procedural" section in the voting link you sent.
>
> I understand this was under discussion for a while and you have asked for
> peoples feedback multiple times.  But sometimes long threads are easy to
> ignore.  That is why personally I like to see things labelled [VOTE],
> [ANNOUNCE], [DISCUSS] when it gets close to finalizing on something like
> this.
>
> I don't really want to draw this out or argue anymore about it, if I
> really wanted a vote I guess I would -1 the change. I'm not going to do
> that.
> I would at least like to see an announcement go out about it.  The last
> thing I saw you say was you were going to call a vote.  A few people chimed
> in with their thoughts on that vote, but nothing was said after that.
>
> Tom
>
>
>
> On Monday, March 13, 2017 12:36 PM, Sean Owen <so...@cloudera.com> wrote:
>
>
> It's not a new process, in that it doesn't entail anything not already in
> http://apache.org/foundation/voting.html . We're just deciding to call a
> VOTE for this type of code modification.
>
> To your point -- yes, it's been around a long time with no further
> comment, and I called several times for more input. That's pretty strong
> lazy consensus of the form we use every day.
>
> On Mon, Mar 13, 2017 at 5:30 PM Tom Graves <tg...@yahoo.com> wrote:
>
> It seems like if you are adding responsibilities you should do a vote.
> SPIP'S require votes from PMC members so you are now putting more
> responsibility on them. It feels like we should have an official vote to
> make sure they (PMC members) agree with that and to make sure everyone pays
> attention to it.  That thread has been there for a while just as discussion
> and now all of a sudden its implemented without even an announcement being
> sent out about it.
>
> Tom
>
>
>
>
>
>

Re: Spark Improvement Proposals

Posted by Tom Graves <tg...@yahoo.com.INVALID>.

Another thing I think you should send out is when exactly does this take affect. Is it any major new feature without a pull request? Is it anything major starting with the 2.3 release?
Tom

On Monday, March 13, 2017 1:08 PM, Tom Graves <tg...@yahoo.com.INVALID> wrote:

I'm not sure how you can say its not a new process. If that is the case why do we need a page documenting it?
As a developer if I want to put up a major improvement I have to now follow the SPIP whereas before I didn't, that certain seems like a new process. As a PMC member I now have the ability to vote on these SPIPs, that seems like something new again.
There are apache bylaws and then there are project specific bylaws. As far as I know Spark doesn't document any of its project specific bylaws so I guess this isn't officially a change to them, but it was implicit before that you didn't need any review for major improvements before, now you need an explicit vote for them to be approved. Certainly seems to fall under the "Procedural" section in the voting link you sent.
I understand this was under discussion for a while and you have asked for peoples feedback multiple times. But sometimes long threads are easy to ignore. That is why personally I like to see things labelled [VOTE], [ANNOUNCE], [DISCUSS] when it gets close to finalizing on something like this.
I don't really want to draw this out or argue anymore about it, if I really wanted a vote I guess I would -1 the change. I'm not going to do that. I would at least like to see an announcement go out about it. The last thing I saw you say was you were going to call a vote. A few people chimed in with their thoughts on that vote, but nothing was said after that.
Tom

On Monday, March 13, 2017 12:36 PM, Sean Owen <so...@cloudera.com> wrote:

It's not a new process, in that it doesn't entail anything not already in http://apache.org/foundation/voting.html . We're just deciding to call a VOTE for this type of code modification.
To your point -- yes, it's been around a long time with no further comment, and I called several times for more input. That's pretty strong lazy consensus of the form we use every day.

On Mon, Mar 13, 2017 at 5:30 PM Tom Graves <tg...@yahoo.com> wrote:

It seems like if you are adding responsibilities you should do a vote. SPIP'S require votes from PMC members so you are now putting more responsibility on them. It feels like we should have an official vote to make sure they (PMC members) agree with that and to make sure everyone pays attention to it. That thread has been there for a while just as discussion and now all of a sudden its implemented without even an announcement being sent out about it.
Tom

Re: Spark Improvement Proposals

Posted by Tom Graves <tg...@yahoo.com.INVALID>.

On Monday, March 13, 2017 12:36 PM, Sean Owen <so...@cloudera.com> wrote:

On Mon, Mar 13, 2017 at 5:30 PM Tom Graves <tg...@yahoo.com> wrote:

Re: Spark Improvement Proposals

Posted by Sean Owen <so...@cloudera.com>.

It's not a new process, in that it doesn't entail anything not already in
http://apache.org/foundation/voting.html . We're just deciding to call a
VOTE for this type of code modification.

To your point -- yes, it's been around a long time with no further comment,
and I called several times for more input. That's pretty strong lazy
consensus of the form we use every day.

On Mon, Mar 13, 2017 at 5:30 PM Tom Graves <tg...@yahoo.com> wrote:

> It seems like if you are adding responsibilities you should do a vote.
> SPIP'S require votes from PMC members so you are now putting more
> responsibility on them. It feels like we should have an official vote to
> make sure they (PMC members) agree with that and to make sure everyone pays
> attention to it.  That thread has been there for a while just as discussion
> and now all of a sudden its implemented without even an announcement being
> sent out about it.
>
> Tom
>
>

Re: Spark Improvement Proposals

Posted by Tom Graves <tg...@yahoo.com.INVALID>.

On Monday, March 13, 2017 11:37 AM, Sean Owen <so...@cloudera.com> wrote:

This ended up proceeding as a normal doc change, instead of precipitating a meta-vote.However, the text that's on the web site now can certainly be further amended if anyone wants to propose a change from here.
On Mon, Mar 13, 2017 at 1:50 PM Tom Graves <tg...@yahoo.com> wrote:

I think a vote here would be good. I think most of the discussion was done by 4 or 5 people and its a long thread. If nothing else it summarizes everything and gets people attention to the change.
Tom

On Thursday, March 9, 2017 10:55 AM, Sean Owen <so...@cloudera.com> wrote:

I think a VOTE is over-thinking it, and is rarely used, but, can't hurt. Nah, anyone can call a vote. This really isn't that formal. We just want to declare and document consensus.
I think SPIP is just a remix of existing process anyway, and don't think it will actually do much anyway, which is why I am sanguine about the whole thing.
To bring this to a conclusion, I will just put the contents of the doc in an email tomorrow for a VOTE. Raise any objections now.
On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger <co...@koeninger.org> wrote:

I started this idea as a fork with a merge-able change to docs.
Reynold moved it to his google doc, and has suggested during this
email thread that a vote should occur.
If a vote needs to occur, I can't see anything on
http://apache.org/foundation/voting.html suggesting that I can call
for a vote, which is why I'm asking PMC members to do it since they're
the ones who would vote anyway.
Now Sean is saying this is a code/doc change that can just be reviewed
and merged as usual...which is what I tried to do to begin with.

The fact that you haven't agreed on a process to agree on your process
is, I think, an indication that the process really does need
improvement ;)

Re: Spark Improvement Proposals

Posted by Sean Owen <so...@cloudera.com>.

This ended up proceeding as a normal doc change, instead of precipitating a
meta-vote.
However, the text that's on the web site now can certainly be further
amended if anyone wants to propose a change from here.

On Mon, Mar 13, 2017 at 1:50 PM Tom Graves <tg...@yahoo.com> wrote:

> I think a vote here would be good. I think most of the discussion was done
> by 4 or 5 people and its a long thread.  If nothing else it summarizes
> everything and gets people attention to the change.
>
> Tom
>
>
> On Thursday, March 9, 2017 10:55 AM, Sean Owen <so...@cloudera.com> wrote:
>
>
> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
> Nah, anyone can call a vote. This really isn't that formal. We just want to
> declare and document consensus.
>
> I think SPIP is just a remix of existing process anyway, and don't think
> it will actually do much anyway, which is why I am sanguine about the whole
> thing.
>
> To bring this to a conclusion, I will just put the contents of the doc in
> an email tomorrow for a VOTE. Raise any objections now.
>
> On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger <co...@koeninger.org> wrote:
>
> I started this idea as a fork with a merge-able change to docs.
> Reynold moved it to his google doc, and has suggested during this
> email thread that a vote should occur.
> If a vote needs to occur, I can't see anything on
> http://apache.org/foundation/voting.html suggesting that I can call
> for a vote, which is why I'm asking PMC members to do it since they're
> the ones who would vote anyway.
> Now Sean is saying this is a code/doc change that can just be reviewed
> and merged as usual...which is what I tried to do to begin with.
>
> The fact that you haven't agreed on a process to agree on your process
> is, I think, an indication that the process really does need
> improvement ;)
>
>
>
>

Re: Spark Improvement Proposals

Posted by Tom Graves <tg...@yahoo.com.INVALID>.

On Thursday, March 9, 2017 10:55 AM, Sean Owen <so...@cloudera.com> wrote:

The fact that you haven't agreed on a process to agree on your process
is, I think, an indication that the process really does need
improvement ;)

Re: Spark Improvement Proposals

Posted by Cody Koeninger <co...@koeninger.org>.

Can someone with filter share permissions can make a filter for open
SPIP and one for closed SPIP and share it?

e.g.

project = SPARK AND status in (Open, Reopened, "In Progress") AND
labels=SPIP ORDER BY createdDate DESC

and another with the status closed equivalent

I just made an open ticket with the SPIP label show it should show up

On Fri, Mar 10, 2017 at 11:19 AM, Reynold Xin <rx...@databricks.com> wrote:
> We can just start using spip label and link to it.
>
>
>
> On Fri, Mar 10, 2017 at 9:18 AM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> So to be clear, if I translate that google doc to markup and submit a
>> PR, you will merge it?
>>
>> If we're just using "spip" label, that's probably fine, but we still
>> need shared filters for open and closed SPIPs so the page can link to
>> them.
>>
>> I do not believe I have jira permissions to share filters, I just
>> attempted to edit one of mine and do not see an add shares field.
>>
>> On Fri, Mar 10, 2017 at 10:54 AM, Sean Owen <so...@cloudera.com> wrote:
>> > Sure, that seems OK to me. I can merge anything like that.
>> > I think anyone can make a new label in JIRA; I don't know if even the
>> > admins
>> > can make a new issue type unfortunately. We may just have to mention a
>> > convention involving title and label or something.
>> >
>> > On Fri, Mar 10, 2017 at 4:52 PM Cody Koeninger <co...@koeninger.org>
>> > wrote:
>> >>
>> >> I think it ought to be its own page, linked from the more / community
>> >> menu dropdowns.
>> >>
>> >> We also need the jira tag, and for the page to clearly link to filters
>> >> that show proposed / completed SPIPs
>> >>
>> >> On Fri, Mar 10, 2017 at 3:39 AM, Sean Owen <so...@cloudera.com> wrote:
>> >> > Alrighty, if nobody is objecting, and nobody calls for a VOTE, then,
>> >> > let's
>> >> > say this document is the SPIP 1.0 process.
>> >> >
>> >> > I think the next step is just to translate the text to some suitable
>> >> > location. I suggest adding it to
>> >> > https://github.com/apache/spark-website/blob/asf-site/contributing.md
>> >> >
>> >> > On Thu, Mar 9, 2017 at 4:55 PM Sean Owen <so...@cloudera.com> wrote:
>> >> >>
>> >> >> I think a VOTE is over-thinking it, and is rarely used, but, can't
>> >> >> hurt.
>> >> >> Nah, anyone can call a vote. This really isn't that formal. We just
>> >> >> want to
>> >> >> declare and document consensus.
>> >> >>
>> >> >> I think SPIP is just a remix of existing process anyway, and don't
>> >> >> think
>> >> >> it will actually do much anyway, which is why I am sanguine about
>> >> >> the
>> >> >> whole
>> >> >> thing.
>> >> >>
>> >> >> To bring this to a conclusion, I will just put the contents of the
>> >> >> doc
>> >> >> in
>> >> >> an email tomorrow for a VOTE. Raise any objections now.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Spark Improvement Proposals

Posted by Reynold Xin <rx...@databricks.com>.

We can just start using spip label and link to it.



On Fri, Mar 10, 2017 at 9:18 AM, Cody Koeninger <co...@koeninger.org> wrote:

> So to be clear, if I translate that google doc to markup and submit a
> PR, you will merge it?
>
> If we're just using "spip" label, that's probably fine, but we still
> need shared filters for open and closed SPIPs so the page can link to
> them.
>
> I do not believe I have jira permissions to share filters, I just
> attempted to edit one of mine and do not see an add shares field.
>
> On Fri, Mar 10, 2017 at 10:54 AM, Sean Owen <so...@cloudera.com> wrote:
> > Sure, that seems OK to me. I can merge anything like that.
> > I think anyone can make a new label in JIRA; I don't know if even the
> admins
> > can make a new issue type unfortunately. We may just have to mention a
> > convention involving title and label or something.
> >
> > On Fri, Mar 10, 2017 at 4:52 PM Cody Koeninger <co...@koeninger.org>
> wrote:
> >>
> >> I think it ought to be its own page, linked from the more / community
> >> menu dropdowns.
> >>
> >> We also need the jira tag, and for the page to clearly link to filters
> >> that show proposed / completed SPIPs
> >>
> >> On Fri, Mar 10, 2017 at 3:39 AM, Sean Owen <so...@cloudera.com> wrote:
> >> > Alrighty, if nobody is objecting, and nobody calls for a VOTE, then,
> >> > let's
> >> > say this document is the SPIP 1.0 process.
> >> >
> >> > I think the next step is just to translate the text to some suitable
> >> > location. I suggest adding it to
> >> > https://github.com/apache/spark-website/blob/asf-site/contributing.md
> >> >
> >> > On Thu, Mar 9, 2017 at 4:55 PM Sean Owen <so...@cloudera.com> wrote:
> >> >>
> >> >> I think a VOTE is over-thinking it, and is rarely used, but, can't
> >> >> hurt.
> >> >> Nah, anyone can call a vote. This really isn't that formal. We just
> >> >> want to
> >> >> declare and document consensus.
> >> >>
> >> >> I think SPIP is just a remix of existing process anyway, and don't
> >> >> think
> >> >> it will actually do much anyway, which is why I am sanguine about the
> >> >> whole
> >> >> thing.
> >> >>
> >> >> To bring this to a conclusion, I will just put the contents of the
> doc
> >> >> in
> >> >> an email tomorrow for a VOTE. Raise any objections now.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Spark Improvement Proposals

Posted by Cody Koeninger <co...@koeninger.org>.

So to be clear, if I translate that google doc to markup and submit a
PR, you will merge it?

If we're just using "spip" label, that's probably fine, but we still
need shared filters for open and closed SPIPs so the page can link to
them.

I do not believe I have jira permissions to share filters, I just
attempted to edit one of mine and do not see an add shares field.

On Fri, Mar 10, 2017 at 10:54 AM, Sean Owen <so...@cloudera.com> wrote:
> Sure, that seems OK to me. I can merge anything like that.
> I think anyone can make a new label in JIRA; I don't know if even the admins
> can make a new issue type unfortunately. We may just have to mention a
> convention involving title and label or something.
>
> On Fri, Mar 10, 2017 at 4:52 PM Cody Koeninger <co...@koeninger.org> wrote:
>>
>> I think it ought to be its own page, linked from the more / community
>> menu dropdowns.
>>
>> We also need the jira tag, and for the page to clearly link to filters
>> that show proposed / completed SPIPs
>>
>> On Fri, Mar 10, 2017 at 3:39 AM, Sean Owen <so...@cloudera.com> wrote:
>> > Alrighty, if nobody is objecting, and nobody calls for a VOTE, then,
>> > let's
>> > say this document is the SPIP 1.0 process.
>> >
>> > I think the next step is just to translate the text to some suitable
>> > location. I suggest adding it to
>> > https://github.com/apache/spark-website/blob/asf-site/contributing.md
>> >
>> > On Thu, Mar 9, 2017 at 4:55 PM Sean Owen <so...@cloudera.com> wrote:
>> >>
>> >> I think a VOTE is over-thinking it, and is rarely used, but, can't
>> >> hurt.
>> >> Nah, anyone can call a vote. This really isn't that formal. We just
>> >> want to
>> >> declare and document consensus.
>> >>
>> >> I think SPIP is just a remix of existing process anyway, and don't
>> >> think
>> >> it will actually do much anyway, which is why I am sanguine about the
>> >> whole
>> >> thing.
>> >>
>> >> To bring this to a conclusion, I will just put the contents of the doc
>> >> in
>> >> an email tomorrow for a VOTE. Raise any objections now.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Spark Improvement Proposals

Posted by Cody Koeninger <co...@koeninger.org>.

I think it ought to be its own page, linked from the more / community
menu dropdowns.

We also need the jira tag, and for the page to clearly link to filters
that show proposed / completed SPIPs

On Fri, Mar 10, 2017 at 3:39 AM, Sean Owen <so...@cloudera.com> wrote:
> Alrighty, if nobody is objecting, and nobody calls for a VOTE, then, let's
> say this document is the SPIP 1.0 process.
>
> I think the next step is just to translate the text to some suitable
> location. I suggest adding it to
> https://github.com/apache/spark-website/blob/asf-site/contributing.md
>
> On Thu, Mar 9, 2017 at 4:55 PM Sean Owen <so...@cloudera.com> wrote:
>>
>> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
>> Nah, anyone can call a vote. This really isn't that formal. We just want to
>> declare and document consensus.
>>
>> I think SPIP is just a remix of existing process anyway, and don't think
>> it will actually do much anyway, which is why I am sanguine about the whole
>> thing.
>>
>> To bring this to a conclusion, I will just put the contents of the doc in
>> an email tomorrow for a VOTE. Raise any objections now.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Spark Improvement Proposals

Posted by Sean Owen <so...@cloudera.com>.

Alrighty, if nobody is objecting, and nobody calls for a VOTE, then, let's
say this document is the SPIP 1.0 process.

I think the next step is just to translate the text to some suitable
location. I suggest adding it to
https://github.com/apache/spark-website/blob/asf-site/contributing.md

On Thu, Mar 9, 2017 at 4:55 PM Sean Owen <so...@cloudera.com> wrote:

> I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
> Nah, anyone can call a vote. This really isn't that formal. We just want to
> declare and document consensus.
>
> I think SPIP is just a remix of existing process anyway, and don't think
> it will actually do much anyway, which is why I am sanguine about the whole
> thing.
>
> To bring this to a conclusion, I will just put the contents of the doc in
> an email tomorrow for a VOTE. Raise any objections now.
>

Re: Spark Improvement Proposals

Posted by Sean Owen <so...@cloudera.com>.

I think a VOTE is over-thinking it, and is rarely used, but, can't hurt.
Nah, anyone can call a vote. This really isn't that formal. We just want to
declare and document consensus.

I think SPIP is just a remix of existing process anyway, and don't think it
will actually do much anyway, which is why I am sanguine about the whole
thing.

To bring this to a conclusion, I will just put the contents of the doc in
an email tomorrow for a VOTE. Raise any objections now.

On Thu, Mar 9, 2017 at 3:39 PM Cody Koeninger <co...@koeninger.org> wrote:

> I started this idea as a fork with a merge-able change to docs.
> Reynold moved it to his google doc, and has suggested during this
> email thread that a vote should occur.
> If a vote needs to occur, I can't see anything on
> http://apache.org/foundation/voting.html suggesting that I can call
> for a vote, which is why I'm asking PMC members to do it since they're
> the ones who would vote anyway.
> Now Sean is saying this is a code/doc change that can just be reviewed
> and merged as usual...which is what I tried to do to begin with.
>
> The fact that you haven't agreed on a process to agree on your process
> is, I think, an indication that the process really does need
> improvement ;)
>
>

Re: Spark Improvement Proposals

Posted by Cody Koeninger <co...@koeninger.org>.

I started this idea as a fork with a merge-able change to docs.
Reynold moved it to his google doc, and has suggested during this
email thread that a vote should occur.
If a vote needs to occur, I can't see anything on
http://apache.org/foundation/voting.html suggesting that I can call
for a vote, which is why I'm asking PMC members to do it since they're
the ones who would vote anyway.
Now Sean is saying this is a code/doc change that can just be reviewed
and merged as usual...which is what I tried to do to begin with.

The fact that you haven't agreed on a process to agree on your process
is, I think, an indication that the process really does need
improvement ;)

On Tue, Mar 7, 2017 at 11:05 AM, Sean Owen <so...@cloudera.com> wrote:
> Do we need a VOTE? heck I think anyone can call one, anyway.
>
> Pre-flight vote check: anyone have objections to the text as-is?
> See
> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#
>
> If so let's hash out specific suggest changes.
>
> If not, then I think the next step is to probably update the
> github.com/apache/spark-website repo with the text here. That's a code/doc
> change we can just review and merge as usual.
>
> On Tue, Mar 7, 2017 at 3:15 PM Cody Koeninger <co...@koeninger.org> wrote:
>>
>> Another week, another ping.  Anyone on the PMC willing to call a vote on
>> this?

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Spark Improvement Proposals

Posted by Sean Owen <so...@cloudera.com>.

Do we need a VOTE? heck I think anyone can call one, anyway.

Pre-flight vote check: anyone have objections to the text as-is?
See
https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#

If so let's hash out specific suggest changes.

If not, then I think the next step is to probably update the
github.com/apache/spark-website repo with the text here. That's a code/doc
change we can just review and merge as usual.

On Tue, Mar 7, 2017 at 3:15 PM Cody Koeninger <co...@koeninger.org> wrote:

> Another week, another ping.  Anyone on the PMC willing to call a vote on
> this?
>

Re: Spark Improvement Proposals

Posted by Cody Koeninger <co...@koeninger.org>.

Another week, another ping.  Anyone on the PMC willing to call a vote on
this?

On Mon, Feb 27, 2017 at 3:08 PM, Ryan Blue <rb...@netflix.com> wrote:

> I'd like to see more discussion on the issues I raised. I don't think
> there was a response for why voting is limited to PMC members.
>
> Tim was kind enough to reply with his rationale for a shepherd, but I
> don't think that it justifies failing proposals. I think it boiled down to
> "shepherds can be helpful", which isn't a good reason to require them in my
> opinion. Sam also had some good comments on this and I think that there's
> more to talk about.
>
> That said, I'd rather not have this proposal fail because we're tired of
> talking about it. If most people are okay with it as it stands and want a
> vote, I'm fine testing this out and fixing it later.
>
> rb
>
> On Fri, Feb 24, 2017 at 8:28 PM, Joseph Bradley <jo...@databricks.com>
> wrote:
>
>> The current draft LGTM.  I agree some of the various concerns may need to
>> be addressed in the future, depending on how SPIPs progress in practice.
>> If others agree, let's put it to a vote and revisit the proposal in a few
>> months.
>> Joseph
>>
>> On Fri, Feb 24, 2017 at 5:35 AM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>>
>>> It's been a week since any further discussion.
>>>
>>> Do PMC members think the current draft is OK to vote on?
>>>
>>> On Fri, Feb 17, 2017 at 10:41 PM, vaquar khan <va...@gmail.com>
>>> wrote:
>>> > I like document and happy to see SPIP draft version however i feel
>>> shepherd
>>> > role is again hurdle in process improvement ,It's like everything
>>> depends
>>> > only on shepherd .
>>> >
>>> > Also want to add point that SPIP  should be time bound with define SLA
>>> else
>>> > will defeats purpose.
>>> >
>>> >
>>> > Regards,
>>> > Vaquar khan
>>> >
>>> > On Thu, Feb 16, 2017 at 3:26 PM, Ryan Blue <rb...@netflix.com.invalid>
>>> > wrote:
>>> >>
>>> >> > [The shepherd] can advise on technical and procedural
>>> considerations for
>>> >> > people outside the community
>>> >>
>>> >> The sentiment is good, but this doesn't justify requiring a shepherd
>>> for a
>>> >> proposal. There are plenty of people that wouldn't need this, would
>>> get
>>> >> feedback during discussion, or would ask a committer or PMC member if
>>> it
>>> >> weren't a formal requirement.
>>> >>
>>> >> > if no one is willing to be a shepherd, the proposed idea is
>>> probably not
>>> >> > going to receive much traction in the first place.
>>> >>
>>> >> This also doesn't sound like a reason for needing a shepherd. Saying
>>> that
>>> >> a shepherd probably won't hurt the process doesn't give me an idea of
>>> why a
>>> >> shepherd should be required in the first place.
>>> >>
>>> >> What was the motivation for adding a shepherd originally? It may not
>>> be
>>> >> bad and it could be helpful, but neither of those makes me think that
>>> they
>>> >> should be required or else the proposal fails.
>>> >>
>>> >> rb
>>> >>
>>> >> On Thu, Feb 16, 2017 at 12:23 PM, Tim Hunter <
>>> timhunter@databricks.com>
>>> >> wrote:
>>> >>>
>>> >>> The doc looks good to me.
>>> >>>
>>> >>> Ryan, the role of the shepherd is to make sure that someone
>>> >>> knowledgeable with Spark processes is involved: this person can
>>> advise
>>> >>> on technical and procedural considerations for people outside the
>>> >>> community. Also, if no one is willing to be a shepherd, the proposed
>>> >>> idea is probably not going to receive much traction in the first
>>> >>> place.
>>> >>>
>>> >>> Tim
>>> >>>
>>> >>> On Thu, Feb 16, 2017 at 9:17 AM, Cody Koeninger <co...@koeninger.org>
>>> >>> wrote:
>>> >>> > Reynold, thanks, LGTM.
>>> >>> >
>>> >>> > Sean, great concerns.  I agree that behavior is largely cultural
>>> and
>>> >>> > writing down a process won't necessarily solve any problems one
>>> way or
>>> >>> > the other.  But one outwardly visible change I'm hoping for out of
>>> >>> > this a way for people who have a stake in Spark, but can't follow
>>> >>> > jiras closely, to go to the Spark website, see the list of proposed
>>> >>> > major changes, contribute discussion on issues that are relevant to
>>> >>> > their needs, and see a clear direction once a vote has passed.  We
>>> >>> > don't have that now.
>>> >>> >
>>> >>> > Ryan, realistically speaking any PMC member can and will stop any
>>> >>> > changes they don't like anyway, so might as well be up front about
>>> the
>>> >>> > reality of the situation.
>>> >>> >
>>> >>> > On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen <so...@cloudera.com>
>>> wrote:
>>> >>> >> The text seems fine to me. Really, this is not describing a
>>> >>> >> fundamentally
>>> >>> >> new process, which is good. We've always had JIRAs, we've always
>>> been
>>> >>> >> able
>>> >>> >> to call a VOTE for a big question. This just writes down a
>>> sensible
>>> >>> >> set of
>>> >>> >> guidelines for putting those two together when a major change is
>>> >>> >> proposed. I
>>> >>> >> look forward to turning some big JIRAs into a request for a SPIP.
>>> >>> >>
>>> >>> >> My only hesitation is that this seems to be perceived by some as
>>> a new
>>> >>> >> or
>>> >>> >> different thing, that is supposed to solve some problems that
>>> aren't
>>> >>> >> otherwise solvable. I see mentioned problems like: clear process
>>> for
>>> >>> >> managing work, public communication, more committers, some sort of
>>> >>> >> binding
>>> >>> >> outcome and deadline.
>>> >>> >>
>>> >>> >> If SPIP is supposed to be a way to make people design in public
>>> and a
>>> >>> >> way to
>>> >>> >> force attention to a particular change, then, this doesn't do
>>> that by
>>> >>> >> itself. Therefore I don't want to let a detailed discussion of
>>> SPIP
>>> >>> >> detract
>>> >>> >> from the discussion about doing what SPIP implies. It's just a
>>> process
>>> >>> >> document.
>>> >>> >>
>>> >>> >> Still, a fine step IMHO.
>>> >>> >>
>>> >>> >> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <rx...@databricks.com>
>>> >>> >> wrote:
>>> >>> >>>
>>> >>> >>> Updated. Any feedback from other community members?
>>> >>> >>>
>>> >>> >>>
>>> >>> >>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <
>>> cody@koeninger.org>
>>> >>> >>> wrote:
>>> >>> >>>>
>>> >>> >>>> Thanks for doing that.
>>> >>> >>>>
>>> >>> >>>> Given that there are at least 4 different Apache voting
>>> processes,
>>> >>> >>>> "typical Apache vote process" isn't meaningful to me.
>>> >>> >>>>
>>> >>> >>>> I think the intention is that in order to pass, it needs at
>>> least 3
>>> >>> >>>> +1
>>> >>> >>>> votes from PMC members *and no -1 votes from PMC members*.  But
>>> the
>>> >>> >>>> document
>>> >>> >>>> doesn't explicitly say that second part.
>>> >>> >>>>
>>> >>> >>>> There's also no mention of the duration a vote should remain
>>> open.
>>> >>> >>>> There's a mention of a month for finding a shepherd, but that's
>>> >>> >>>> different.
>>> >>> >>>>
>>> >>> >>>> Other than that, LGTM.
>>> >>> >>>>
>>> >>> >>>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <
>>> rxin@databricks.com>
>>> >>> >>>> wrote:
>>> >>> >>>>>
>>> >>> >>>>> Here's a new draft that incorporated most of the feedback:
>>> >>> >>>>>
>>> >>> >>>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
>>> nRanvXmnZ7SUi4qMljg/edit#
>>> >>> >>>>>
>>> >>> >>>>> I added a specific role for SPIP Author and another one for
>>> SPIP
>>> >>> >>>>> Shepherd.
>>> >>> >>>>>
>>> >>> >>>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <gatorsmile@gmail.com
>>> >
>>> >>> >>>>> wrote:
>>> >>> >>>>>>
>>> >>> >>>>>> During the summit, I also had a lot of discussions over
>>> similar
>>> >>> >>>>>> topics
>>> >>> >>>>>> with multiple Committers and active users. I heard many
>>> fantastic
>>> >>> >>>>>> ideas. I
>>> >>> >>>>>> believe Spark improvement proposals are good channels to
>>> collect
>>> >>> >>>>>> the
>>> >>> >>>>>> requirements/designs.
>>> >>> >>>>>>
>>> >>> >>>>>>
>>> >>> >>>>>> IMO, we also need to consider the priority when working on
>>> these
>>> >>> >>>>>> items.
>>> >>> >>>>>> Even if the proposal is accepted, it does not mean it will be
>>> >>> >>>>>> implemented
>>> >>> >>>>>> and merged immediately. It is not a FIFO queue.
>>> >>> >>>>>>
>>> >>> >>>>>>
>>> >>> >>>>>> Even if some PRs are merged, sometimes, we still have to
>>> revert
>>> >>> >>>>>> them
>>> >>> >>>>>> back, if the design and implementation are not reviewed
>>> carefully.
>>> >>> >>>>>> We have
>>> >>> >>>>>> to ensure our quality. Spark is not an application software.
>>> It is
>>> >>> >>>>>> an
>>> >>> >>>>>> infrastructure software that is being used by many many
>>> companies.
>>> >>> >>>>>> We have
>>> >>> >>>>>> to be very careful in the design and implementation,
>>> especially
>>> >>> >>>>>> adding/changing the external APIs.
>>> >>> >>>>>>
>>> >>> >>>>>>
>>> >>> >>>>>> When I developed the Mainframe infrastructure/middleware
>>> software
>>> >>> >>>>>> in
>>> >>> >>>>>> the past 6 years, I were involved in the discussions with
>>> >>> >>>>>> external/internal
>>> >>> >>>>>> customers. The to-do feature list was always above 100.
>>> Sometimes,
>>> >>> >>>>>> the
>>> >>> >>>>>> customers are feeling frustrated when we are unable to deliver
>>> >>> >>>>>> them on time
>>> >>> >>>>>> due to the resource limits and others. Even if they paid us
>>> >>> >>>>>> billions, we
>>> >>> >>>>>> still need to do it phase by phase or sometimes they have to
>>> >>> >>>>>> accept the
>>> >>> >>>>>> workarounds. That is the reality everyone has to face, I
>>> think.
>>> >>> >>>>>>
>>> >>> >>>>>>
>>> >>> >>>>>> Thanks,
>>> >>> >>>>>>
>>> >>> >>>>>>
>>> >>> >>>>>> Xiao Li
>>> >>> >>>>>>>
>>> >>> >>>>>>>
>>> >>> >>
>>> >>> >
>>> >>> > ------------------------------------------------------------
>>> ---------
>>> >>> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>> >>> >
>>> >>>
>>> >>> ------------------------------------------------------------
>>> ---------
>>> >>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Ryan Blue
>>> >> Software Engineer
>>> >> Netflix
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Regards,
>>> > Vaquar Khan
>>> > +1 -224-436-0783
>>> >
>>> > IT Architect / Lead Consultant
>>> > Greater Chicago
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Spark Improvement Proposals

Posted by Sean Owen <so...@cloudera.com>.

To me, no new process is being invented here, on purpose, and so we should
just rely on whatever governs any large JIRA or vote, because SPIPs are
really just guidance for making a big JIRA.

http://apache.org/foundation/voting.html suggests that PMC members have the
binding votes in general, and for code-modification votes in particular,
which is what this is. Absent a strong reason to diverge from that, I'd go
with that.

(PS: On reading this, I didn't realize that the guidance was that releases
are blessed just by majority vote. Oh well, not that it has mattered.)

I also don't see a need to require a shepherd, because JIRAs don't have
such a process, though I also can't see a situation where nobody with a
vote cares to endorse the SPIP ever, but three people vote for it and
nobody objects?

Perhaps downgrade this to "strongly suggested, so that you don't waste your
time."

Or, implicitly, that proposing a SPIP calls a vote that lasts for, dunno, a
month. If fewer than 3 PMC vote for it, it doesn't pass anyway. If at least
1 does, OK, they're the shepherd(s). No new process.

On Mon, Feb 27, 2017 at 9:09 PM Ryan Blue <rb...@netflix.com> wrote:

> I'd like to see more discussion on the issues I raised. I don't think
> there was a response for why voting is limited to PMC members.
>
> Tim was kind enough to reply with his rationale for a shepherd, but I
> don't think that it justifies failing proposals. I think it boiled down to
> "shepherds can be helpful", which isn't a good reason to require them in my
> opinion. Sam also had some good comments on this and I think that there's
> more to talk about.
>
> That said, I'd rather not have this proposal fail because we're tired of
> talking about it. If most people are okay with it as it stands and want a
> vote, I'm fine testing this out and fixing it later.
>
> rb
>
>

Re: Spark Improvement Proposals

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I'd like to see more discussion on the issues I raised. I don't think there
was a response for why voting is limited to PMC members.

Tim was kind enough to reply with his rationale for a shepherd, but I don't
think that it justifies failing proposals. I think it boiled down to
"shepherds can be helpful", which isn't a good reason to require them in my
opinion. Sam also had some good comments on this and I think that there's
more to talk about.

That said, I'd rather not have this proposal fail because we're tired of
talking about it. If most people are okay with it as it stands and want a
vote, I'm fine testing this out and fixing it later.

rb

On Fri, Feb 24, 2017 at 8:28 PM, Joseph Bradley <jo...@databricks.com>
wrote:

> The current draft LGTM.  I agree some of the various concerns may need to
> be addressed in the future, depending on how SPIPs progress in practice.
> If others agree, let's put it to a vote and revisit the proposal in a few
> months.
> Joseph
>
> On Fri, Feb 24, 2017 at 5:35 AM, Cody Koeninger <co...@koeninger.org>
> wrote:
>
>> It's been a week since any further discussion.
>>
>> Do PMC members think the current draft is OK to vote on?
>>
>> On Fri, Feb 17, 2017 at 10:41 PM, vaquar khan <va...@gmail.com>
>> wrote:
>> > I like document and happy to see SPIP draft version however i feel
>> shepherd
>> > role is again hurdle in process improvement ,It's like everything
>> depends
>> > only on shepherd .
>> >
>> > Also want to add point that SPIP  should be time bound with define SLA
>> else
>> > will defeats purpose.
>> >
>> >
>> > Regards,
>> > Vaquar khan
>> >
>> > On Thu, Feb 16, 2017 at 3:26 PM, Ryan Blue <rb...@netflix.com.invalid>
>> > wrote:
>> >>
>> >> > [The shepherd] can advise on technical and procedural considerations
>> for
>> >> > people outside the community
>> >>
>> >> The sentiment is good, but this doesn't justify requiring a shepherd
>> for a
>> >> proposal. There are plenty of people that wouldn't need this, would get
>> >> feedback during discussion, or would ask a committer or PMC member if
>> it
>> >> weren't a formal requirement.
>> >>
>> >> > if no one is willing to be a shepherd, the proposed idea is probably
>> not
>> >> > going to receive much traction in the first place.
>> >>
>> >> This also doesn't sound like a reason for needing a shepherd. Saying
>> that
>> >> a shepherd probably won't hurt the process doesn't give me an idea of
>> why a
>> >> shepherd should be required in the first place.
>> >>
>> >> What was the motivation for adding a shepherd originally? It may not be
>> >> bad and it could be helpful, but neither of those makes me think that
>> they
>> >> should be required or else the proposal fails.
>> >>
>> >> rb
>> >>
>> >> On Thu, Feb 16, 2017 at 12:23 PM, Tim Hunter <timhunter@databricks.com
>> >
>> >> wrote:
>> >>>
>> >>> The doc looks good to me.
>> >>>
>> >>> Ryan, the role of the shepherd is to make sure that someone
>> >>> knowledgeable with Spark processes is involved: this person can advise
>> >>> on technical and procedural considerations for people outside the
>> >>> community. Also, if no one is willing to be a shepherd, the proposed
>> >>> idea is probably not going to receive much traction in the first
>> >>> place.
>> >>>
>> >>> Tim
>> >>>
>> >>> On Thu, Feb 16, 2017 at 9:17 AM, Cody Koeninger <co...@koeninger.org>
>> >>> wrote:
>> >>> > Reynold, thanks, LGTM.
>> >>> >
>> >>> > Sean, great concerns.  I agree that behavior is largely cultural and
>> >>> > writing down a process won't necessarily solve any problems one way
>> or
>> >>> > the other.  But one outwardly visible change I'm hoping for out of
>> >>> > this a way for people who have a stake in Spark, but can't follow
>> >>> > jiras closely, to go to the Spark website, see the list of proposed
>> >>> > major changes, contribute discussion on issues that are relevant to
>> >>> > their needs, and see a clear direction once a vote has passed.  We
>> >>> > don't have that now.
>> >>> >
>> >>> > Ryan, realistically speaking any PMC member can and will stop any
>> >>> > changes they don't like anyway, so might as well be up front about
>> the
>> >>> > reality of the situation.
>> >>> >
>> >>> > On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen <so...@cloudera.com>
>> wrote:
>> >>> >> The text seems fine to me. Really, this is not describing a
>> >>> >> fundamentally
>> >>> >> new process, which is good. We've always had JIRAs, we've always
>> been
>> >>> >> able
>> >>> >> to call a VOTE for a big question. This just writes down a sensible
>> >>> >> set of
>> >>> >> guidelines for putting those two together when a major change is
>> >>> >> proposed. I
>> >>> >> look forward to turning some big JIRAs into a request for a SPIP.
>> >>> >>
>> >>> >> My only hesitation is that this seems to be perceived by some as a
>> new
>> >>> >> or
>> >>> >> different thing, that is supposed to solve some problems that
>> aren't
>> >>> >> otherwise solvable. I see mentioned problems like: clear process
>> for
>> >>> >> managing work, public communication, more committers, some sort of
>> >>> >> binding
>> >>> >> outcome and deadline.
>> >>> >>
>> >>> >> If SPIP is supposed to be a way to make people design in public
>> and a
>> >>> >> way to
>> >>> >> force attention to a particular change, then, this doesn't do that
>> by
>> >>> >> itself. Therefore I don't want to let a detailed discussion of SPIP
>> >>> >> detract
>> >>> >> from the discussion about doing what SPIP implies. It's just a
>> process
>> >>> >> document.
>> >>> >>
>> >>> >> Still, a fine step IMHO.
>> >>> >>
>> >>> >> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <rx...@databricks.com>
>> >>> >> wrote:
>> >>> >>>
>> >>> >>> Updated. Any feedback from other community members?
>> >>> >>>
>> >>> >>>
>> >>> >>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <
>> cody@koeninger.org>
>> >>> >>> wrote:
>> >>> >>>>
>> >>> >>>> Thanks for doing that.
>> >>> >>>>
>> >>> >>>> Given that there are at least 4 different Apache voting
>> processes,
>> >>> >>>> "typical Apache vote process" isn't meaningful to me.
>> >>> >>>>
>> >>> >>>> I think the intention is that in order to pass, it needs at
>> least 3
>> >>> >>>> +1
>> >>> >>>> votes from PMC members *and no -1 votes from PMC members*.  But
>> the
>> >>> >>>> document
>> >>> >>>> doesn't explicitly say that second part.
>> >>> >>>>
>> >>> >>>> There's also no mention of the duration a vote should remain
>> open.
>> >>> >>>> There's a mention of a month for finding a shepherd, but that's
>> >>> >>>> different.
>> >>> >>>>
>> >>> >>>> Other than that, LGTM.
>> >>> >>>>
>> >>> >>>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <
>> rxin@databricks.com>
>> >>> >>>> wrote:
>> >>> >>>>>
>> >>> >>>>> Here's a new draft that incorporated most of the feedback:
>> >>> >>>>>
>> >>> >>>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
>> nRanvXmnZ7SUi4qMljg/edit#
>> >>> >>>>>
>> >>> >>>>> I added a specific role for SPIP Author and another one for SPIP
>> >>> >>>>> Shepherd.
>> >>> >>>>>
>> >>> >>>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <ga...@gmail.com>
>> >>> >>>>> wrote:
>> >>> >>>>>>
>> >>> >>>>>> During the summit, I also had a lot of discussions over similar
>> >>> >>>>>> topics
>> >>> >>>>>> with multiple Committers and active users. I heard many
>> fantastic
>> >>> >>>>>> ideas. I
>> >>> >>>>>> believe Spark improvement proposals are good channels to
>> collect
>> >>> >>>>>> the
>> >>> >>>>>> requirements/designs.
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>> IMO, we also need to consider the priority when working on
>> these
>> >>> >>>>>> items.
>> >>> >>>>>> Even if the proposal is accepted, it does not mean it will be
>> >>> >>>>>> implemented
>> >>> >>>>>> and merged immediately. It is not a FIFO queue.
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>> Even if some PRs are merged, sometimes, we still have to revert
>> >>> >>>>>> them
>> >>> >>>>>> back, if the design and implementation are not reviewed
>> carefully.
>> >>> >>>>>> We have
>> >>> >>>>>> to ensure our quality. Spark is not an application software.
>> It is
>> >>> >>>>>> an
>> >>> >>>>>> infrastructure software that is being used by many many
>> companies.
>> >>> >>>>>> We have
>> >>> >>>>>> to be very careful in the design and implementation, especially
>> >>> >>>>>> adding/changing the external APIs.
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>> When I developed the Mainframe infrastructure/middleware
>> software
>> >>> >>>>>> in
>> >>> >>>>>> the past 6 years, I were involved in the discussions with
>> >>> >>>>>> external/internal
>> >>> >>>>>> customers. The to-do feature list was always above 100.
>> Sometimes,
>> >>> >>>>>> the
>> >>> >>>>>> customers are feeling frustrated when we are unable to deliver
>> >>> >>>>>> them on time
>> >>> >>>>>> due to the resource limits and others. Even if they paid us
>> >>> >>>>>> billions, we
>> >>> >>>>>> still need to do it phase by phase or sometimes they have to
>> >>> >>>>>> accept the
>> >>> >>>>>> workarounds. That is the reality everyone has to face, I think.
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>> Thanks,
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>> Xiao Li
>> >>> >>>>>>>
>> >>> >>>>>>>
>> >>> >>
>> >>> >
>> >>> > ------------------------------------------------------------
>> ---------
>> >>> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> >>> >
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Ryan Blue
>> >> Software Engineer
>> >> Netflix
>> >
>> >
>> >
>> >
>> > --
>> > Regards,
>> > Vaquar Khan
>> > +1 -224-436-0783
>> >
>> > IT Architect / Lead Consultant
>> > Greater Chicago
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: Spark Improvement Proposals

Posted by Joseph Bradley <jo...@databricks.com>.

The current draft LGTM.  I agree some of the various concerns may need to
be addressed in the future, depending on how SPIPs progress in practice.
If others agree, let's put it to a vote and revisit the proposal in a few
months.
Joseph

On Fri, Feb 24, 2017 at 5:35 AM, Cody Koeninger <co...@koeninger.org> wrote:

> It's been a week since any further discussion.
>
> Do PMC members think the current draft is OK to vote on?
>
> On Fri, Feb 17, 2017 at 10:41 PM, vaquar khan <va...@gmail.com>
> wrote:
> > I like document and happy to see SPIP draft version however i feel
> shepherd
> > role is again hurdle in process improvement ,It's like everything depends
> > only on shepherd .
> >
> > Also want to add point that SPIP  should be time bound with define SLA
> else
> > will defeats purpose.
> >
> >
> > Regards,
> > Vaquar khan
> >
> > On Thu, Feb 16, 2017 at 3:26 PM, Ryan Blue <rb...@netflix.com.invalid>
> > wrote:
> >>
> >> > [The shepherd] can advise on technical and procedural considerations
> for
> >> > people outside the community
> >>
> >> The sentiment is good, but this doesn't justify requiring a shepherd
> for a
> >> proposal. There are plenty of people that wouldn't need this, would get
> >> feedback during discussion, or would ask a committer or PMC member if it
> >> weren't a formal requirement.
> >>
> >> > if no one is willing to be a shepherd, the proposed idea is probably
> not
> >> > going to receive much traction in the first place.
> >>
> >> This also doesn't sound like a reason for needing a shepherd. Saying
> that
> >> a shepherd probably won't hurt the process doesn't give me an idea of
> why a
> >> shepherd should be required in the first place.
> >>
> >> What was the motivation for adding a shepherd originally? It may not be
> >> bad and it could be helpful, but neither of those makes me think that
> they
> >> should be required or else the proposal fails.
> >>
> >> rb
> >>
> >> On Thu, Feb 16, 2017 at 12:23 PM, Tim Hunter <ti...@databricks.com>
> >> wrote:
> >>>
> >>> The doc looks good to me.
> >>>
> >>> Ryan, the role of the shepherd is to make sure that someone
> >>> knowledgeable with Spark processes is involved: this person can advise
> >>> on technical and procedural considerations for people outside the
> >>> community. Also, if no one is willing to be a shepherd, the proposed
> >>> idea is probably not going to receive much traction in the first
> >>> place.
> >>>
> >>> Tim
> >>>
> >>> On Thu, Feb 16, 2017 at 9:17 AM, Cody Koeninger <co...@koeninger.org>
> >>> wrote:
> >>> > Reynold, thanks, LGTM.
> >>> >
> >>> > Sean, great concerns.  I agree that behavior is largely cultural and
> >>> > writing down a process won't necessarily solve any problems one way
> or
> >>> > the other.  But one outwardly visible change I'm hoping for out of
> >>> > this a way for people who have a stake in Spark, but can't follow
> >>> > jiras closely, to go to the Spark website, see the list of proposed
> >>> > major changes, contribute discussion on issues that are relevant to
> >>> > their needs, and see a clear direction once a vote has passed.  We
> >>> > don't have that now.
> >>> >
> >>> > Ryan, realistically speaking any PMC member can and will stop any
> >>> > changes they don't like anyway, so might as well be up front about
> the
> >>> > reality of the situation.
> >>> >
> >>> > On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen <so...@cloudera.com>
> wrote:
> >>> >> The text seems fine to me. Really, this is not describing a
> >>> >> fundamentally
> >>> >> new process, which is good. We've always had JIRAs, we've always
> been
> >>> >> able
> >>> >> to call a VOTE for a big question. This just writes down a sensible
> >>> >> set of
> >>> >> guidelines for putting those two together when a major change is
> >>> >> proposed. I
> >>> >> look forward to turning some big JIRAs into a request for a SPIP.
> >>> >>
> >>> >> My only hesitation is that this seems to be perceived by some as a
> new
> >>> >> or
> >>> >> different thing, that is supposed to solve some problems that aren't
> >>> >> otherwise solvable. I see mentioned problems like: clear process for
> >>> >> managing work, public communication, more committers, some sort of
> >>> >> binding
> >>> >> outcome and deadline.
> >>> >>
> >>> >> If SPIP is supposed to be a way to make people design in public and
> a
> >>> >> way to
> >>> >> force attention to a particular change, then, this doesn't do that
> by
> >>> >> itself. Therefore I don't want to let a detailed discussion of SPIP
> >>> >> detract
> >>> >> from the discussion about doing what SPIP implies. It's just a
> process
> >>> >> document.
> >>> >>
> >>> >> Still, a fine step IMHO.
> >>> >>
> >>> >> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <rx...@databricks.com>
> >>> >> wrote:
> >>> >>>
> >>> >>> Updated. Any feedback from other community members?
> >>> >>>
> >>> >>>
> >>> >>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <
> cody@koeninger.org>
> >>> >>> wrote:
> >>> >>>>
> >>> >>>> Thanks for doing that.
> >>> >>>>
> >>> >>>> Given that there are at least 4 different Apache voting processes,
> >>> >>>> "typical Apache vote process" isn't meaningful to me.
> >>> >>>>
> >>> >>>> I think the intention is that in order to pass, it needs at least
> 3
> >>> >>>> +1
> >>> >>>> votes from PMC members *and no -1 votes from PMC members*.  But
> the
> >>> >>>> document
> >>> >>>> doesn't explicitly say that second part.
> >>> >>>>
> >>> >>>> There's also no mention of the duration a vote should remain open.
> >>> >>>> There's a mention of a month for finding a shepherd, but that's
> >>> >>>> different.
> >>> >>>>
> >>> >>>> Other than that, LGTM.
> >>> >>>>
> >>> >>>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <rxin@databricks.com
> >
> >>> >>>> wrote:
> >>> >>>>>
> >>> >>>>> Here's a new draft that incorporated most of the feedback:
> >>> >>>>>
> >>> >>>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
> nRanvXmnZ7SUi4qMljg/edit#
> >>> >>>>>
> >>> >>>>> I added a specific role for SPIP Author and another one for SPIP
> >>> >>>>> Shepherd.
> >>> >>>>>
> >>> >>>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <ga...@gmail.com>
> >>> >>>>> wrote:
> >>> >>>>>>
> >>> >>>>>> During the summit, I also had a lot of discussions over similar
> >>> >>>>>> topics
> >>> >>>>>> with multiple Committers and active users. I heard many
> fantastic
> >>> >>>>>> ideas. I
> >>> >>>>>> believe Spark improvement proposals are good channels to collect
> >>> >>>>>> the
> >>> >>>>>> requirements/designs.
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> IMO, we also need to consider the priority when working on these
> >>> >>>>>> items.
> >>> >>>>>> Even if the proposal is accepted, it does not mean it will be
> >>> >>>>>> implemented
> >>> >>>>>> and merged immediately. It is not a FIFO queue.
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> Even if some PRs are merged, sometimes, we still have to revert
> >>> >>>>>> them
> >>> >>>>>> back, if the design and implementation are not reviewed
> carefully.
> >>> >>>>>> We have
> >>> >>>>>> to ensure our quality. Spark is not an application software. It
> is
> >>> >>>>>> an
> >>> >>>>>> infrastructure software that is being used by many many
> companies.
> >>> >>>>>> We have
> >>> >>>>>> to be very careful in the design and implementation, especially
> >>> >>>>>> adding/changing the external APIs.
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> When I developed the Mainframe infrastructure/middleware
> software
> >>> >>>>>> in
> >>> >>>>>> the past 6 years, I were involved in the discussions with
> >>> >>>>>> external/internal
> >>> >>>>>> customers. The to-do feature list was always above 100.
> Sometimes,
> >>> >>>>>> the
> >>> >>>>>> customers are feeling frustrated when we are unable to deliver
> >>> >>>>>> them on time
> >>> >>>>>> due to the resource limits and others. Even if they paid us
> >>> >>>>>> billions, we
> >>> >>>>>> still need to do it phase by phase or sometimes they have to
> >>> >>>>>> accept the
> >>> >>>>>> workarounds. That is the reality everyone has to face, I think.
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> Thanks,
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> Xiao Li
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>
> >>> >
> >>> > ------------------------------------------------------------
> ---------
> >>> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >>> >
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >>>
> >>
> >>
> >>
> >> --
> >> Ryan Blue
> >> Software Engineer
> >> Netflix
> >
> >
> >
> >
> > --
> > Regards,
> > Vaquar Khan
> > +1 -224-436-0783
> >
> > IT Architect / Lead Consultant
> > Greater Chicago
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: Spark Improvement Proposals

Posted by Cody Koeninger <co...@koeninger.org>.

It's been a week since any further discussion.

Do PMC members think the current draft is OK to vote on?

On Fri, Feb 17, 2017 at 10:41 PM, vaquar khan <va...@gmail.com> wrote:
> I like document and happy to see SPIP draft version however i feel shepherd
> role is again hurdle in process improvement ,It's like everything depends
> only on shepherd .
>
> Also want to add point that SPIP  should be time bound with define SLA else
> will defeats purpose.
>
>
> Regards,
> Vaquar khan
>
> On Thu, Feb 16, 2017 at 3:26 PM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>>
>> > [The shepherd] can advise on technical and procedural considerations for
>> > people outside the community
>>
>> The sentiment is good, but this doesn't justify requiring a shepherd for a
>> proposal. There are plenty of people that wouldn't need this, would get
>> feedback during discussion, or would ask a committer or PMC member if it
>> weren't a formal requirement.
>>
>> > if no one is willing to be a shepherd, the proposed idea is probably not
>> > going to receive much traction in the first place.
>>
>> This also doesn't sound like a reason for needing a shepherd. Saying that
>> a shepherd probably won't hurt the process doesn't give me an idea of why a
>> shepherd should be required in the first place.
>>
>> What was the motivation for adding a shepherd originally? It may not be
>> bad and it could be helpful, but neither of those makes me think that they
>> should be required or else the proposal fails.
>>
>> rb
>>
>> On Thu, Feb 16, 2017 at 12:23 PM, Tim Hunter <ti...@databricks.com>
>> wrote:
>>>
>>> The doc looks good to me.
>>>
>>> Ryan, the role of the shepherd is to make sure that someone
>>> knowledgeable with Spark processes is involved: this person can advise
>>> on technical and procedural considerations for people outside the
>>> community. Also, if no one is willing to be a shepherd, the proposed
>>> idea is probably not going to receive much traction in the first
>>> place.
>>>
>>> Tim
>>>
>>> On Thu, Feb 16, 2017 at 9:17 AM, Cody Koeninger <co...@koeninger.org>
>>> wrote:
>>> > Reynold, thanks, LGTM.
>>> >
>>> > Sean, great concerns.  I agree that behavior is largely cultural and
>>> > writing down a process won't necessarily solve any problems one way or
>>> > the other.  But one outwardly visible change I'm hoping for out of
>>> > this a way for people who have a stake in Spark, but can't follow
>>> > jiras closely, to go to the Spark website, see the list of proposed
>>> > major changes, contribute discussion on issues that are relevant to
>>> > their needs, and see a clear direction once a vote has passed.  We
>>> > don't have that now.
>>> >
>>> > Ryan, realistically speaking any PMC member can and will stop any
>>> > changes they don't like anyway, so might as well be up front about the
>>> > reality of the situation.
>>> >
>>> > On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen <so...@cloudera.com> wrote:
>>> >> The text seems fine to me. Really, this is not describing a
>>> >> fundamentally
>>> >> new process, which is good. We've always had JIRAs, we've always been
>>> >> able
>>> >> to call a VOTE for a big question. This just writes down a sensible
>>> >> set of
>>> >> guidelines for putting those two together when a major change is
>>> >> proposed. I
>>> >> look forward to turning some big JIRAs into a request for a SPIP.
>>> >>
>>> >> My only hesitation is that this seems to be perceived by some as a new
>>> >> or
>>> >> different thing, that is supposed to solve some problems that aren't
>>> >> otherwise solvable. I see mentioned problems like: clear process for
>>> >> managing work, public communication, more committers, some sort of
>>> >> binding
>>> >> outcome and deadline.
>>> >>
>>> >> If SPIP is supposed to be a way to make people design in public and a
>>> >> way to
>>> >> force attention to a particular change, then, this doesn't do that by
>>> >> itself. Therefore I don't want to let a detailed discussion of SPIP
>>> >> detract
>>> >> from the discussion about doing what SPIP implies. It's just a process
>>> >> document.
>>> >>
>>> >> Still, a fine step IMHO.
>>> >>
>>> >> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <rx...@databricks.com>
>>> >> wrote:
>>> >>>
>>> >>> Updated. Any feedback from other community members?
>>> >>>
>>> >>>
>>> >>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <co...@koeninger.org>
>>> >>> wrote:
>>> >>>>
>>> >>>> Thanks for doing that.
>>> >>>>
>>> >>>> Given that there are at least 4 different Apache voting processes,
>>> >>>> "typical Apache vote process" isn't meaningful to me.
>>> >>>>
>>> >>>> I think the intention is that in order to pass, it needs at least 3
>>> >>>> +1
>>> >>>> votes from PMC members *and no -1 votes from PMC members*.  But the
>>> >>>> document
>>> >>>> doesn't explicitly say that second part.
>>> >>>>
>>> >>>> There's also no mention of the duration a vote should remain open.
>>> >>>> There's a mention of a month for finding a shepherd, but that's
>>> >>>> different.
>>> >>>>
>>> >>>> Other than that, LGTM.
>>> >>>>
>>> >>>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <rx...@databricks.com>
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> Here's a new draft that incorporated most of the feedback:
>>> >>>>>
>>> >>>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#
>>> >>>>>
>>> >>>>> I added a specific role for SPIP Author and another one for SPIP
>>> >>>>> Shepherd.
>>> >>>>>
>>> >>>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <ga...@gmail.com>
>>> >>>>> wrote:
>>> >>>>>>
>>> >>>>>> During the summit, I also had a lot of discussions over similar
>>> >>>>>> topics
>>> >>>>>> with multiple Committers and active users. I heard many fantastic
>>> >>>>>> ideas. I
>>> >>>>>> believe Spark improvement proposals are good channels to collect
>>> >>>>>> the
>>> >>>>>> requirements/designs.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> IMO, we also need to consider the priority when working on these
>>> >>>>>> items.
>>> >>>>>> Even if the proposal is accepted, it does not mean it will be
>>> >>>>>> implemented
>>> >>>>>> and merged immediately. It is not a FIFO queue.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Even if some PRs are merged, sometimes, we still have to revert
>>> >>>>>> them
>>> >>>>>> back, if the design and implementation are not reviewed carefully.
>>> >>>>>> We have
>>> >>>>>> to ensure our quality. Spark is not an application software. It is
>>> >>>>>> an
>>> >>>>>> infrastructure software that is being used by many many companies.
>>> >>>>>> We have
>>> >>>>>> to be very careful in the design and implementation, especially
>>> >>>>>> adding/changing the external APIs.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> When I developed the Mainframe infrastructure/middleware software
>>> >>>>>> in
>>> >>>>>> the past 6 years, I were involved in the discussions with
>>> >>>>>> external/internal
>>> >>>>>> customers. The to-do feature list was always above 100. Sometimes,
>>> >>>>>> the
>>> >>>>>> customers are feeling frustrated when we are unable to deliver
>>> >>>>>> them on time
>>> >>>>>> due to the resource limits and others. Even if they paid us
>>> >>>>>> billions, we
>>> >>>>>> still need to do it phase by phase or sometimes they have to
>>> >>>>>> accept the
>>> >>>>>> workarounds. That is the reality everyone has to face, I think.
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Thanks,
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> Xiao Li
>>> >>>>>>>
>>> >>>>>>>
>>> >>
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>
>
>
>
> --
> Regards,
> Vaquar Khan
> +1 -224-436-0783
>
> IT Architect / Lead Consultant
> Greater Chicago

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Spark Improvement Proposals

Posted by vaquar khan <va...@gmail.com>.

I like document and happy to see SPIP draft version however i feel shepherd
role is again hurdle in process improvement ,It's like everything depends
only on shepherd .

Also want to add point that SPIP  should be time bound with define SLA else
will defeats purpose.


Regards,
Vaquar khan

On Thu, Feb 16, 2017 at 3:26 PM, Ryan Blue <rb...@netflix.com.invalid>
wrote:

> > [The shepherd] can advise on technical and procedural considerations for
> people outside the community
>
> The sentiment is good, but this doesn't justify requiring a shepherd for a
> proposal. There are plenty of people that wouldn't need this, would get
> feedback during discussion, or would ask a committer or PMC member if it
> weren't a formal requirement.
>
> > if no one is willing to be a shepherd, the proposed idea is probably not
> going to receive much traction in the first place.
>
> This also doesn't sound like a reason for needing a shepherd. Saying that
> a shepherd probably won't hurt the process doesn't give me an idea of why a
> shepherd should be required in the first place.
>
> What was the motivation for adding a shepherd originally? It may not be
> bad and it could be helpful, but neither of those makes me think that they
> should be required or else the proposal fails.
>
> rb
>
> On Thu, Feb 16, 2017 at 12:23 PM, Tim Hunter <ti...@databricks.com>
> wrote:
>
>> The doc looks good to me.
>>
>> Ryan, the role of the shepherd is to make sure that someone
>> knowledgeable with Spark processes is involved: this person can advise
>> on technical and procedural considerations for people outside the
>> community. Also, if no one is willing to be a shepherd, the proposed
>> idea is probably not going to receive much traction in the first
>> place.
>>
>> Tim
>>
>> On Thu, Feb 16, 2017 at 9:17 AM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>> > Reynold, thanks, LGTM.
>> >
>> > Sean, great concerns.  I agree that behavior is largely cultural and
>> > writing down a process won't necessarily solve any problems one way or
>> > the other.  But one outwardly visible change I'm hoping for out of
>> > this a way for people who have a stake in Spark, but can't follow
>> > jiras closely, to go to the Spark website, see the list of proposed
>> > major changes, contribute discussion on issues that are relevant to
>> > their needs, and see a clear direction once a vote has passed.  We
>> > don't have that now.
>> >
>> > Ryan, realistically speaking any PMC member can and will stop any
>> > changes they don't like anyway, so might as well be up front about the
>> > reality of the situation.
>> >
>> > On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen <so...@cloudera.com> wrote:
>> >> The text seems fine to me. Really, this is not describing a
>> fundamentally
>> >> new process, which is good. We've always had JIRAs, we've always been
>> able
>> >> to call a VOTE for a big question. This just writes down a sensible
>> set of
>> >> guidelines for putting those two together when a major change is
>> proposed. I
>> >> look forward to turning some big JIRAs into a request for a SPIP.
>> >>
>> >> My only hesitation is that this seems to be perceived by some as a new
>> or
>> >> different thing, that is supposed to solve some problems that aren't
>> >> otherwise solvable. I see mentioned problems like: clear process for
>> >> managing work, public communication, more committers, some sort of
>> binding
>> >> outcome and deadline.
>> >>
>> >> If SPIP is supposed to be a way to make people design in public and a
>> way to
>> >> force attention to a particular change, then, this doesn't do that by
>> >> itself. Therefore I don't want to let a detailed discussion of SPIP
>> detract
>> >> from the discussion about doing what SPIP implies. It's just a process
>> >> document.
>> >>
>> >> Still, a fine step IMHO.
>> >>
>> >> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <rx...@databricks.com>
>> wrote:
>> >>>
>> >>> Updated. Any feedback from other community members?
>> >>>
>> >>>
>> >>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <co...@koeninger.org>
>> >>> wrote:
>> >>>>
>> >>>> Thanks for doing that.
>> >>>>
>> >>>> Given that there are at least 4 different Apache voting processes,
>> >>>> "typical Apache vote process" isn't meaningful to me.
>> >>>>
>> >>>> I think the intention is that in order to pass, it needs at least 3
>> +1
>> >>>> votes from PMC members *and no -1 votes from PMC members*.  But the
>> document
>> >>>> doesn't explicitly say that second part.
>> >>>>
>> >>>> There's also no mention of the duration a vote should remain open.
>> >>>> There's a mention of a month for finding a shepherd, but that's
>> different.
>> >>>>
>> >>>> Other than that, LGTM.
>> >>>>
>> >>>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <rx...@databricks.com>
>> wrote:
>> >>>>>
>> >>>>> Here's a new draft that incorporated most of the feedback:
>> >>>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
>> nRanvXmnZ7SUi4qMljg/edit#
>> >>>>>
>> >>>>> I added a specific role for SPIP Author and another one for SPIP
>> >>>>> Shepherd.
>> >>>>>
>> >>>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <ga...@gmail.com>
>> wrote:
>> >>>>>>
>> >>>>>> During the summit, I also had a lot of discussions over similar
>> topics
>> >>>>>> with multiple Committers and active users. I heard many fantastic
>> ideas. I
>> >>>>>> believe Spark improvement proposals are good channels to collect
>> the
>> >>>>>> requirements/designs.
>> >>>>>>
>> >>>>>>
>> >>>>>> IMO, we also need to consider the priority when working on these
>> items.
>> >>>>>> Even if the proposal is accepted, it does not mean it will be
>> implemented
>> >>>>>> and merged immediately. It is not a FIFO queue.
>> >>>>>>
>> >>>>>>
>> >>>>>> Even if some PRs are merged, sometimes, we still have to revert
>> them
>> >>>>>> back, if the design and implementation are not reviewed carefully.
>> We have
>> >>>>>> to ensure our quality. Spark is not an application software. It is
>> an
>> >>>>>> infrastructure software that is being used by many many companies.
>> We have
>> >>>>>> to be very careful in the design and implementation, especially
>> >>>>>> adding/changing the external APIs.
>> >>>>>>
>> >>>>>>
>> >>>>>> When I developed the Mainframe infrastructure/middleware software
>> in
>> >>>>>> the past 6 years, I were involved in the discussions with
>> external/internal
>> >>>>>> customers. The to-do feature list was always above 100. Sometimes,
>> the
>> >>>>>> customers are feeling frustrated when we are unable to deliver
>> them on time
>> >>>>>> due to the resource limits and others. Even if they paid us
>> billions, we
>> >>>>>> still need to do it phase by phase or sometimes they have to
>> accept the
>> >>>>>> workarounds. That is the reality everyone has to face, I think.
>> >>>>>>
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>>
>> >>>>>>
>> >>>>>> Xiao Li
>> >>>>>>>
>> >>>>>>>
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>



-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago

Re: Spark Improvement Proposals

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

> [The shepherd] can advise on technical and procedural considerations for
people outside the community

The sentiment is good, but this doesn't justify requiring a shepherd for a
proposal. There are plenty of people that wouldn't need this, would get
feedback during discussion, or would ask a committer or PMC member if it
weren't a formal requirement.

> if no one is willing to be a shepherd, the proposed idea is probably not
going to receive much traction in the first place.

This also doesn't sound like a reason for needing a shepherd. Saying that a
shepherd probably won't hurt the process doesn't give me an idea of why a
shepherd should be required in the first place.

What was the motivation for adding a shepherd originally? It may not be bad
and it could be helpful, but neither of those makes me think that they
should be required or else the proposal fails.

rb

On Thu, Feb 16, 2017 at 12:23 PM, Tim Hunter <ti...@databricks.com>
wrote:

> The doc looks good to me.
>
> Ryan, the role of the shepherd is to make sure that someone
> knowledgeable with Spark processes is involved: this person can advise
> on technical and procedural considerations for people outside the
> community. Also, if no one is willing to be a shepherd, the proposed
> idea is probably not going to receive much traction in the first
> place.
>
> Tim
>
> On Thu, Feb 16, 2017 at 9:17 AM, Cody Koeninger <co...@koeninger.org>
> wrote:
> > Reynold, thanks, LGTM.
> >
> > Sean, great concerns.  I agree that behavior is largely cultural and
> > writing down a process won't necessarily solve any problems one way or
> > the other.  But one outwardly visible change I'm hoping for out of
> > this a way for people who have a stake in Spark, but can't follow
> > jiras closely, to go to the Spark website, see the list of proposed
> > major changes, contribute discussion on issues that are relevant to
> > their needs, and see a clear direction once a vote has passed.  We
> > don't have that now.
> >
> > Ryan, realistically speaking any PMC member can and will stop any
> > changes they don't like anyway, so might as well be up front about the
> > reality of the situation.
> >
> > On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen <so...@cloudera.com> wrote:
> >> The text seems fine to me. Really, this is not describing a
> fundamentally
> >> new process, which is good. We've always had JIRAs, we've always been
> able
> >> to call a VOTE for a big question. This just writes down a sensible set
> of
> >> guidelines for putting those two together when a major change is
> proposed. I
> >> look forward to turning some big JIRAs into a request for a SPIP.
> >>
> >> My only hesitation is that this seems to be perceived by some as a new
> or
> >> different thing, that is supposed to solve some problems that aren't
> >> otherwise solvable. I see mentioned problems like: clear process for
> >> managing work, public communication, more committers, some sort of
> binding
> >> outcome and deadline.
> >>
> >> If SPIP is supposed to be a way to make people design in public and a
> way to
> >> force attention to a particular change, then, this doesn't do that by
> >> itself. Therefore I don't want to let a detailed discussion of SPIP
> detract
> >> from the discussion about doing what SPIP implies. It's just a process
> >> document.
> >>
> >> Still, a fine step IMHO.
> >>
> >> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <rx...@databricks.com>
> wrote:
> >>>
> >>> Updated. Any feedback from other community members?
> >>>
> >>>
> >>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <co...@koeninger.org>
> >>> wrote:
> >>>>
> >>>> Thanks for doing that.
> >>>>
> >>>> Given that there are at least 4 different Apache voting processes,
> >>>> "typical Apache vote process" isn't meaningful to me.
> >>>>
> >>>> I think the intention is that in order to pass, it needs at least 3 +1
> >>>> votes from PMC members *and no -1 votes from PMC members*.  But the
> document
> >>>> doesn't explicitly say that second part.
> >>>>
> >>>> There's also no mention of the duration a vote should remain open.
> >>>> There's a mention of a month for finding a shepherd, but that's
> different.
> >>>>
> >>>> Other than that, LGTM.
> >>>>
> >>>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <rx...@databricks.com>
> wrote:
> >>>>>
> >>>>> Here's a new draft that incorporated most of the feedback:
> >>>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
> nRanvXmnZ7SUi4qMljg/edit#
> >>>>>
> >>>>> I added a specific role for SPIP Author and another one for SPIP
> >>>>> Shepherd.
> >>>>>
> >>>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <ga...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> During the summit, I also had a lot of discussions over similar
> topics
> >>>>>> with multiple Committers and active users. I heard many fantastic
> ideas. I
> >>>>>> believe Spark improvement proposals are good channels to collect the
> >>>>>> requirements/designs.
> >>>>>>
> >>>>>>
> >>>>>> IMO, we also need to consider the priority when working on these
> items.
> >>>>>> Even if the proposal is accepted, it does not mean it will be
> implemented
> >>>>>> and merged immediately. It is not a FIFO queue.
> >>>>>>
> >>>>>>
> >>>>>> Even if some PRs are merged, sometimes, we still have to revert them
> >>>>>> back, if the design and implementation are not reviewed carefully.
> We have
> >>>>>> to ensure our quality. Spark is not an application software. It is
> an
> >>>>>> infrastructure software that is being used by many many companies.
> We have
> >>>>>> to be very careful in the design and implementation, especially
> >>>>>> adding/changing the external APIs.
> >>>>>>
> >>>>>>
> >>>>>> When I developed the Mainframe infrastructure/middleware software in
> >>>>>> the past 6 years, I were involved in the discussions with
> external/internal
> >>>>>> customers. The to-do feature list was always above 100. Sometimes,
> the
> >>>>>> customers are feeling frustrated when we are unable to deliver them
> on time
> >>>>>> due to the resource limits and others. Even if they paid us
> billions, we
> >>>>>> still need to do it phase by phase or sometimes they have to accept
> the
> >>>>>> workarounds. That is the reality everyone has to face, I think.
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>>
> >>>>>> Xiao Li
> >>>>>>>
> >>>>>>>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Spark Improvement Proposals

Posted by Sam Elamin <hu...@gmail.com>.

Hi Folks

I thought id chime in as someone new to the process so feel free to
disregard it if it doesn't make sense.

I definitely agree that we need a new forum to identify or discuss changes
as JIRA isnt exactly the best place to do that, its a Bug tracker first and
foremost.

For example I was cycling in through the JIRAs to see if theres anything I
can contribute to and there isnt one overall story or goal. There are bugs
or wish lists but I am talking overall high level goals or wishes.

If the point of the SPIP is to focus the discussions on the problems but
also encourage non PMC to be more active in the project the requirement
that every single idea must have a shepherd might do more harm than good.

As Cody mentioned, any PMC member can veto a change, thats fine for
stopping potential detrimental changes, but if this process is hoping to
open the flood gates to allow literally anyone to come up with an idea or
at least facilitate conversations on positive changes then I foresee that
the PMCs will be far too stretched to govern it, since the number of ideas
or discussions will be a number of magnitudes greater than the number of
PMCs who can manage it. We will end up needing to get "project managers"
and "scrum masters" and nobody wants that!

Hence it would deter anyone from raising ideas in the first place unless
they are willing to find a shepherd for it. It reminds me of the need to
raise a BOC (Business Operation Case) when I worked in big corporations.
Overall it reduces morale since by default this isn't necessarily the
average engineers strong point. There is also the issue that the PMCs might
just be too busy to shepherd all the ideas regardless of merit so potential
great additions might die off purely because there just arent enough PMCs
around to take time out of their already busy schedules and give all the
SPIPs the attention they deserve

My point is this, allow anyone to raise ideas, facilitate discussions
proposals in a safe environment

At the end of the day  PMC members will have the guide/veto anything since
they have the experience.

I hope I was able to articulate what I meant, I really am loving working on
Spark and I think the future looks very promising for it and very much look
forward to being involved in the evolution of it

Kind Regards
Sam




On Thu, Feb 16, 2017 at 8:23 PM, Tim Hunter <ti...@databricks.com>
wrote:

> The doc looks good to me.
>
> Ryan, the role of the shepherd is to make sure that someone
> knowledgeable with Spark processes is involved: this person can advise
> on technical and procedural considerations for people outside the
> community. Also, if no one is willing to be a shepherd, the proposed
> idea is probably not going to receive much traction in the first
> place.
>
> Tim
>
> On Thu, Feb 16, 2017 at 9:17 AM, Cody Koeninger <co...@koeninger.org>
> wrote:
> > Reynold, thanks, LGTM.
> >
> > Sean, great concerns.  I agree that behavior is largely cultural and
> > writing down a process won't necessarily solve any problems one way or
> > the other.  But one outwardly visible change I'm hoping for out of
> > this a way for people who have a stake in Spark, but can't follow
> > jiras closely, to go to the Spark website, see the list of proposed
> > major changes, contribute discussion on issues that are relevant to
> > their needs, and see a clear direction once a vote has passed.  We
> > don't have that now.
> >
> > Ryan, realistically speaking any PMC member can and will stop any
> > changes they don't like anyway, so might as well be up front about the
> > reality of the situation.
> >
> > On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen <so...@cloudera.com> wrote:
> >> The text seems fine to me. Really, this is not describing a
> fundamentally
> >> new process, which is good. We've always had JIRAs, we've always been
> able
> >> to call a VOTE for a big question. This just writes down a sensible set
> of
> >> guidelines for putting those two together when a major change is
> proposed. I
> >> look forward to turning some big JIRAs into a request for a SPIP.
> >>
> >> My only hesitation is that this seems to be perceived by some as a new
> or
> >> different thing, that is supposed to solve some problems that aren't
> >> otherwise solvable. I see mentioned problems like: clear process for
> >> managing work, public communication, more committers, some sort of
> binding
> >> outcome and deadline.
> >>
> >> If SPIP is supposed to be a way to make people design in public and a
> way to
> >> force attention to a particular change, then, this doesn't do that by
> >> itself. Therefore I don't want to let a detailed discussion of SPIP
> detract
> >> from the discussion about doing what SPIP implies. It's just a process
> >> document.
> >>
> >> Still, a fine step IMHO.
> >>
> >> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <rx...@databricks.com>
> wrote:
> >>>
> >>> Updated. Any feedback from other community members?
> >>>
> >>>
> >>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <co...@koeninger.org>
> >>> wrote:
> >>>>
> >>>> Thanks for doing that.
> >>>>
> >>>> Given that there are at least 4 different Apache voting processes,
> >>>> "typical Apache vote process" isn't meaningful to me.
> >>>>
> >>>> I think the intention is that in order to pass, it needs at least 3 +1
> >>>> votes from PMC members *and no -1 votes from PMC members*.  But the
> document
> >>>> doesn't explicitly say that second part.
> >>>>
> >>>> There's also no mention of the duration a vote should remain open.
> >>>> There's a mention of a month for finding a shepherd, but that's
> different.
> >>>>
> >>>> Other than that, LGTM.
> >>>>
> >>>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <rx...@databricks.com>
> wrote:
> >>>>>
> >>>>> Here's a new draft that incorporated most of the feedback:
> >>>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
> nRanvXmnZ7SUi4qMljg/edit#
> >>>>>
> >>>>> I added a specific role for SPIP Author and another one for SPIP
> >>>>> Shepherd.
> >>>>>
> >>>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <ga...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> During the summit, I also had a lot of discussions over similar
> topics
> >>>>>> with multiple Committers and active users. I heard many fantastic
> ideas. I
> >>>>>> believe Spark improvement proposals are good channels to collect the
> >>>>>> requirements/designs.
> >>>>>>
> >>>>>>
> >>>>>> IMO, we also need to consider the priority when working on these
> items.
> >>>>>> Even if the proposal is accepted, it does not mean it will be
> implemented
> >>>>>> and merged immediately. It is not a FIFO queue.
> >>>>>>
> >>>>>>
> >>>>>> Even if some PRs are merged, sometimes, we still have to revert them
> >>>>>> back, if the design and implementation are not reviewed carefully.
> We have
> >>>>>> to ensure our quality. Spark is not an application software. It is
> an
> >>>>>> infrastructure software that is being used by many many companies.
> We have
> >>>>>> to be very careful in the design and implementation, especially
> >>>>>> adding/changing the external APIs.
> >>>>>>
> >>>>>>
> >>>>>> When I developed the Mainframe infrastructure/middleware software in
> >>>>>> the past 6 years, I were involved in the discussions with
> external/internal
> >>>>>> customers. The to-do feature list was always above 100. Sometimes,
> the
> >>>>>> customers are feeling frustrated when we are unable to deliver them
> on time
> >>>>>> due to the resource limits and others. Even if they paid us
> billions, we
> >>>>>> still need to do it phase by phase or sometimes they have to accept
> the
> >>>>>> workarounds. That is the reality everyone has to face, I think.
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>>
> >>>>>> Xiao Li
> >>>>>>>
> >>>>>>>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Spark Improvement Proposals

Posted by Tim Hunter <ti...@databricks.com>.

The doc looks good to me.

Ryan, the role of the shepherd is to make sure that someone
knowledgeable with Spark processes is involved: this person can advise
on technical and procedural considerations for people outside the
community. Also, if no one is willing to be a shepherd, the proposed
idea is probably not going to receive much traction in the first
place.

Tim

On Thu, Feb 16, 2017 at 9:17 AM, Cody Koeninger <co...@koeninger.org> wrote:
> Reynold, thanks, LGTM.
>
> Sean, great concerns.  I agree that behavior is largely cultural and
> writing down a process won't necessarily solve any problems one way or
> the other.  But one outwardly visible change I'm hoping for out of
> this a way for people who have a stake in Spark, but can't follow
> jiras closely, to go to the Spark website, see the list of proposed
> major changes, contribute discussion on issues that are relevant to
> their needs, and see a clear direction once a vote has passed.  We
> don't have that now.
>
> Ryan, realistically speaking any PMC member can and will stop any
> changes they don't like anyway, so might as well be up front about the
> reality of the situation.
>
> On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen <so...@cloudera.com> wrote:
>> The text seems fine to me. Really, this is not describing a fundamentally
>> new process, which is good. We've always had JIRAs, we've always been able
>> to call a VOTE for a big question. This just writes down a sensible set of
>> guidelines for putting those two together when a major change is proposed. I
>> look forward to turning some big JIRAs into a request for a SPIP.
>>
>> My only hesitation is that this seems to be perceived by some as a new or
>> different thing, that is supposed to solve some problems that aren't
>> otherwise solvable. I see mentioned problems like: clear process for
>> managing work, public communication, more committers, some sort of binding
>> outcome and deadline.
>>
>> If SPIP is supposed to be a way to make people design in public and a way to
>> force attention to a particular change, then, this doesn't do that by
>> itself. Therefore I don't want to let a detailed discussion of SPIP detract
>> from the discussion about doing what SPIP implies. It's just a process
>> document.
>>
>> Still, a fine step IMHO.
>>
>> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <rx...@databricks.com> wrote:
>>>
>>> Updated. Any feedback from other community members?
>>>
>>>
>>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <co...@koeninger.org>
>>> wrote:
>>>>
>>>> Thanks for doing that.
>>>>
>>>> Given that there are at least 4 different Apache voting processes,
>>>> "typical Apache vote process" isn't meaningful to me.
>>>>
>>>> I think the intention is that in order to pass, it needs at least 3 +1
>>>> votes from PMC members *and no -1 votes from PMC members*.  But the document
>>>> doesn't explicitly say that second part.
>>>>
>>>> There's also no mention of the duration a vote should remain open.
>>>> There's a mention of a month for finding a shepherd, but that's different.
>>>>
>>>> Other than that, LGTM.
>>>>
>>>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <rx...@databricks.com> wrote:
>>>>>
>>>>> Here's a new draft that incorporated most of the feedback:
>>>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#
>>>>>
>>>>> I added a specific role for SPIP Author and another one for SPIP
>>>>> Shepherd.
>>>>>
>>>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <ga...@gmail.com> wrote:
>>>>>>
>>>>>> During the summit, I also had a lot of discussions over similar topics
>>>>>> with multiple Committers and active users. I heard many fantastic ideas. I
>>>>>> believe Spark improvement proposals are good channels to collect the
>>>>>> requirements/designs.
>>>>>>
>>>>>>
>>>>>> IMO, we also need to consider the priority when working on these items.
>>>>>> Even if the proposal is accepted, it does not mean it will be implemented
>>>>>> and merged immediately. It is not a FIFO queue.
>>>>>>
>>>>>>
>>>>>> Even if some PRs are merged, sometimes, we still have to revert them
>>>>>> back, if the design and implementation are not reviewed carefully. We have
>>>>>> to ensure our quality. Spark is not an application software. It is an
>>>>>> infrastructure software that is being used by many many companies. We have
>>>>>> to be very careful in the design and implementation, especially
>>>>>> adding/changing the external APIs.
>>>>>>
>>>>>>
>>>>>> When I developed the Mainframe infrastructure/middleware software in
>>>>>> the past 6 years, I were involved in the discussions with external/internal
>>>>>> customers. The to-do feature list was always above 100. Sometimes, the
>>>>>> customers are feeling frustrated when we are unable to deliver them on time
>>>>>> due to the resource limits and others. Even if they paid us billions, we
>>>>>> still need to do it phase by phase or sometimes they have to accept the
>>>>>> workarounds. That is the reality everyone has to face, I think.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>> Xiao Li
>>>>>>>
>>>>>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Spark Improvement Proposals

Posted by Cody Koeninger <co...@koeninger.org>.

Reynold, thanks, LGTM.

Sean, great concerns.  I agree that behavior is largely cultural and
writing down a process won't necessarily solve any problems one way or
the other.  But one outwardly visible change I'm hoping for out of
this a way for people who have a stake in Spark, but can't follow
jiras closely, to go to the Spark website, see the list of proposed
major changes, contribute discussion on issues that are relevant to
their needs, and see a clear direction once a vote has passed.  We
don't have that now.

Ryan, realistically speaking any PMC member can and will stop any
changes they don't like anyway, so might as well be up front about the
reality of the situation.

On Thu, Feb 16, 2017 at 10:43 AM, Sean Owen <so...@cloudera.com> wrote:
> The text seems fine to me. Really, this is not describing a fundamentally
> new process, which is good. We've always had JIRAs, we've always been able
> to call a VOTE for a big question. This just writes down a sensible set of
> guidelines for putting those two together when a major change is proposed. I
> look forward to turning some big JIRAs into a request for a SPIP.
>
> My only hesitation is that this seems to be perceived by some as a new or
> different thing, that is supposed to solve some problems that aren't
> otherwise solvable. I see mentioned problems like: clear process for
> managing work, public communication, more committers, some sort of binding
> outcome and deadline.
>
> If SPIP is supposed to be a way to make people design in public and a way to
> force attention to a particular change, then, this doesn't do that by
> itself. Therefore I don't want to let a detailed discussion of SPIP detract
> from the discussion about doing what SPIP implies. It's just a process
> document.
>
> Still, a fine step IMHO.
>
> On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <rx...@databricks.com> wrote:
>>
>> Updated. Any feedback from other community members?
>>
>>
>> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>>>
>>> Thanks for doing that.
>>>
>>> Given that there are at least 4 different Apache voting processes,
>>> "typical Apache vote process" isn't meaningful to me.
>>>
>>> I think the intention is that in order to pass, it needs at least 3 +1
>>> votes from PMC members *and no -1 votes from PMC members*.  But the document
>>> doesn't explicitly say that second part.
>>>
>>> There's also no mention of the duration a vote should remain open.
>>> There's a mention of a month for finding a shepherd, but that's different.
>>>
>>> Other than that, LGTM.
>>>
>>> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <rx...@databricks.com> wrote:
>>>>
>>>> Here's a new draft that incorporated most of the feedback:
>>>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#
>>>>
>>>> I added a specific role for SPIP Author and another one for SPIP
>>>> Shepherd.
>>>>
>>>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <ga...@gmail.com> wrote:
>>>>>
>>>>> During the summit, I also had a lot of discussions over similar topics
>>>>> with multiple Committers and active users. I heard many fantastic ideas. I
>>>>> believe Spark improvement proposals are good channels to collect the
>>>>> requirements/designs.
>>>>>
>>>>>
>>>>> IMO, we also need to consider the priority when working on these items.
>>>>> Even if the proposal is accepted, it does not mean it will be implemented
>>>>> and merged immediately. It is not a FIFO queue.
>>>>>
>>>>>
>>>>> Even if some PRs are merged, sometimes, we still have to revert them
>>>>> back, if the design and implementation are not reviewed carefully. We have
>>>>> to ensure our quality. Spark is not an application software. It is an
>>>>> infrastructure software that is being used by many many companies. We have
>>>>> to be very careful in the design and implementation, especially
>>>>> adding/changing the external APIs.
>>>>>
>>>>>
>>>>> When I developed the Mainframe infrastructure/middleware software in
>>>>> the past 6 years, I were involved in the discussions with external/internal
>>>>> customers. The to-do feature list was always above 100. Sometimes, the
>>>>> customers are feeling frustrated when we are unable to deliver them on time
>>>>> due to the resource limits and others. Even if they paid us billions, we
>>>>> still need to do it phase by phase or sometimes they have to accept the
>>>>> workarounds. That is the reality everyone has to face, I think.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> Xiao Li
>>>>>>
>>>>>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Spark Improvement Proposals

Posted by Sean Owen <so...@cloudera.com>.

The text seems fine to me. Really, this is not describing a fundamentally
new process, which is good. We've always had JIRAs, we've always been able
to call a VOTE for a big question. This just writes down a sensible set of
guidelines for putting those two together when a major change is proposed.
I look forward to turning some big JIRAs into a request for a SPIP.

My only hesitation is that this seems to be perceived by some as a new or
different thing, that is supposed to solve some problems that aren't
otherwise solvable. I see mentioned problems like: clear process for
managing work, public communication, more committers, some sort of binding
outcome and deadline.

If SPIP is supposed to be a way to make people design in public and a way
to force attention to a particular change, then, this doesn't do that by
itself. Therefore I don't want to let a detailed discussion of SPIP detract
from the discussion about doing what SPIP implies. It's just a process
document.

Still, a fine step IMHO.

On Thu, Feb 16, 2017 at 4:22 PM Reynold Xin <rx...@databricks.com> wrote:

> Updated. Any feedback from other community members?
>
>
> On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <co...@koeninger.org>
> wrote:
>
> Thanks for doing that.
>
> Given that there are at least 4 different Apache voting processes,
> "typical Apache vote process" isn't meaningful to me.
>
> I think the intention is that in order to pass, it needs at least 3 +1
> votes from PMC members *and no -1 votes from PMC members*.  But the
> document doesn't explicitly say that second part.
>
> There's also no mention of the duration a vote should remain open.
> There's a mention of a month for finding a shepherd, but that's different.
>
> Other than that, LGTM.
>
> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <rx...@databricks.com> wrote:
>
> Here's a new draft that incorporated most of the feedback:
> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#
>
> I added a specific role for SPIP Author and another one for SPIP Shepherd.
>
> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <ga...@gmail.com> wrote:
>
> During the summit, I also had a lot of discussions over similar topics
> with multiple Committers and active users. I heard many fantastic ideas. I
> believe Spark improvement proposals are good channels to collect the
> requirements/designs.
>
>
> IMO, we also need to consider the priority when working on these items.
> Even if the proposal is accepted, it does not mean it will be implemented
> and merged immediately. It is not a FIFO queue.
>
>
> Even if some PRs are merged, sometimes, we still have to revert them back,
> if the design and implementation are not reviewed carefully. We have to
> ensure our quality. Spark is not an application software. It is an
> infrastructure software that is being used by many many companies. We have
> to be very careful in the design and implementation, especially
> adding/changing the external APIs.
>
>
> When I developed the Mainframe infrastructure/middleware software in the
> past 6 years, I were involved in the discussions with external/internal
> customers. The to-do feature list was always above 100. Sometimes, the
> customers are feeling frustrated when we are unable to deliver them on time
> due to the resource limits and others. Even if they paid us billions, we
> still need to do it phase by phase or sometimes they have to accept the
> workarounds. That is the reality everyone has to face, I think.
>
>
> Thanks,
>
>
> Xiao Li
>
>
>

Re: Spark Improvement Proposals

Posted by Reynold Xin <rx...@databricks.com>.

Updated. Any feedback from other community members?


On Wed, Feb 15, 2017 at 2:53 AM, Cody Koeninger <co...@koeninger.org> wrote:

> Thanks for doing that.
>
> Given that there are at least 4 different Apache voting processes,
> "typical Apache vote process" isn't meaningful to me.
>
> I think the intention is that in order to pass, it needs at least 3 +1
> votes from PMC members *and no -1 votes from PMC members*.  But the
> document doesn't explicitly say that second part.
>
> There's also no mention of the duration a vote should remain open.
> There's a mention of a month for finding a shepherd, but that's different.
>
> Other than that, LGTM.
>
> On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Here's a new draft that incorporated most of the feedback:
>> https://docs.google.com/document/d/1-Zdi_W-wtuxS9h
>> TK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#
>>
>> I added a specific role for SPIP Author and another one for SPIP Shepherd.
>>
>> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <ga...@gmail.com> wrote:
>>
>>> During the summit, I also had a lot of discussions over similar topics
>>> with multiple Committers and active users. I heard many fantastic ideas. I
>>> believe Spark improvement proposals are good channels to collect the
>>> requirements/designs.
>>>
>>>
>>> IMO, we also need to consider the priority when working on these items.
>>> Even if the proposal is accepted, it does not mean it will be implemented
>>> and merged immediately. It is not a FIFO queue.
>>>
>>>
>>> Even if some PRs are merged, sometimes, we still have to revert them
>>> back, if the design and implementation are not reviewed carefully. We have
>>> to ensure our quality. Spark is not an application software. It is an
>>> infrastructure software that is being used by many many companies. We have
>>> to be very careful in the design and implementation, especially
>>> adding/changing the external APIs.
>>>
>>>
>>> When I developed the Mainframe infrastructure/middleware software in the
>>> past 6 years, I were involved in the discussions with external/internal
>>> customers. The to-do feature list was always above 100. Sometimes, the
>>> customers are feeling frustrated when we are unable to deliver them on time
>>> due to the resource limits and others. Even if they paid us billions, we
>>> still need to do it phase by phase or sometimes they have to accept the
>>> workarounds. That is the reality everyone has to face, I think.
>>>
>>>
>>> Thanks,
>>>
>>>
>>> Xiao Li
>>>
>>> 2017-02-11 7:57 GMT-08:00 Cody Koeninger <co...@koeninger.org>:
>>>
>>>> At the spark summit this week, everyone from PMC members to users I had
>>>> never met before were asking me about the Spark improvement proposals
>>>> idea.  It's clear that it's a real community need.
>>>>
>>>> But it's been almost half a year, and nothing visible has been done.
>>>>
>>>> Reynold, are you going to do this?
>>>>
>>>> If so, when?
>>>>
>>>> If not, why?
>>>>
>>>> You already did the right thing by including long-deserved committers.
>>>> Please keep doing the right thing for the community.
>>>>
>>>> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>>> +1 on all counts (consensus, time bound, define roles)
>>>>>
>>>>> I can update the doc in the next few days and share back. Then maybe
>>>>> we can just officially vote on this. As Tim suggested, we might not get it
>>>>> 100% right the first time and would need to re-iterate. But that's fine.
>>>>>
>>>>>
>>>>> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <ti...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Cody,
>>>>>> thank you for bringing up this topic, I agree it is very important to
>>>>>> keep a cohesive community around some common, fluid goals. Here are a few
>>>>>> comments about the current document:
>>>>>>
>>>>>> 1. name: it should not overlap with an existing one such as SIP. Can
>>>>>> you imagine someone trying to discuss a scala spore proposal for spark?
>>>>>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
>>>>>> sounds great.
>>>>>>
>>>>>> 2. roles: at a high level, SPIPs are meant to reach consensus for
>>>>>> technical decisions with a lasting impact. As such, the template should
>>>>>> emphasize the role of the various parties during this process:
>>>>>>
>>>>>>  - the SPIP author is responsible for building consensus. She is the
>>>>>> champion driving the process forward and is responsible for ensuring that
>>>>>> the SPIP follows the general guidelines. The author should be identified in
>>>>>> the SPIP. The authorship of a SPIP can be transferred if the current author
>>>>>> is not interested and someone else wants to move the SPIP forward. There
>>>>>> should probably be 2-3 authors at most for each SPIP.
>>>>>>
>>>>>>  - someone with voting power should probably shepherd the SPIP (and
>>>>>> be recorded as such): ensuring that the final decision over the SPIP is
>>>>>> recorded (rejected, accepted, etc.), and advising about the technical
>>>>>> quality of the SPIP: this person need not be a champion for the SPIP or
>>>>>> contribute to it, but rather makes sure it stands a chance of being
>>>>>> approved when the vote happens. Also, if the author cannot find anyone who
>>>>>> would want to take this role, this proposal is likely to be rejected anyway.
>>>>>>
>>>>>>  - users, committers, contributors have the roles already outlined in
>>>>>> the document
>>>>>>
>>>>>> 3. timeline: ideally, once a SPIP has been offered for voting, it
>>>>>> should move swiftly into either being accepted or rejected, so that we do
>>>>>> not end up with a distracting long tail of half-hearted proposals.
>>>>>>
>>>>>> These rules are meant to be flexible, but the current document should
>>>>>> be clear about who is in charge of a SPIP, and the state it is currently in.
>>>>>>
>>>>>> We have had long discussions over some very important questions such
>>>>>> as approval. I do not have an opinion on these, but why not make a pick and
>>>>>> reevaluate this decision later? This is not a binding process at this point.
>>>>>>
>>>>>> Tim
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <co...@koeninger.org>
>>>>>> wrote:
>>>>>>
>>>>>>> I don't have a concern about voting vs consensus.
>>>>>>>
>>>>>>> I have a concern that whatever the decision making process is, it is
>>>>>>> explicitly announced on the ticket for the given proposal, with an explicit
>>>>>>> deadline, and an explicit outcome.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <ir...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'm also in favor of this.  Thanks for your persistence Cody.
>>>>>>>>
>>>>>>>> My take on the specific issues Joseph mentioned:
>>>>>>>>
>>>>>>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made
>>>>>>>> earlier for consensus:
>>>>>>>>
>>>>>>>> > Majority vs consensus: My rationale is that I don't think we want
>>>>>>>> to consider a proposal approved if it had objections serious enough that
>>>>>>>> committers down-voted (or PMC depending on who gets a vote). If these
>>>>>>>> proposals are like PEPs, then they represent a significant amount of
>>>>>>>> community effort and I wouldn't want to move forward if up to half of the
>>>>>>>> community thinks it's an untenable idea.
>>>>>>>>
>>>>>>>> 2) Design doc template -- agree this would be useful, but also
>>>>>>>> seems totally orthogonal to moving forward on the SIP proposal.
>>>>>>>>
>>>>>>>> 3) agree w/ Joseph's proposal for updating the template.
>>>>>>>>
>>>>>>>> One small addition:
>>>>>>>>
>>>>>>>> 4) Deciding on a name -- minor, but I think its wroth
>>>>>>>> disambiguating from Scala's SIPs, and the best proposal I've heard is
>>>>>>>> "SPIP".   At least, no one has objected.  (don't care enough that I'd
>>>>>>>> object to anything else, though.)
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <
>>>>>>>> joseph@databricks.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Cody,
>>>>>>>>>
>>>>>>>>> Thanks for being persistent about this.  I too would like to see
>>>>>>>>> this happen.  Reviewing the thread, it sounds like the main things
>>>>>>>>> remaining are:
>>>>>>>>> * Decide about a few issues
>>>>>>>>> * Finalize the doc(s)
>>>>>>>>> * Vote on this proposal
>>>>>>>>>
>>>>>>>>> Issues & TODOs:
>>>>>>>>>
>>>>>>>>> (1) The main issue I see above is voting vs. consensus.  I have
>>>>>>>>> little preference here.  It sounds like something which could be tailored
>>>>>>>>> based on whether we see too many or too few SIPs being approved.
>>>>>>>>>
>>>>>>>>> (2) Design doc template  (This would be great to have for Spark
>>>>>>>>> regardless of this SIP discussion.)
>>>>>>>>> * Reynold, are you still putting this together?
>>>>>>>>>
>>>>>>>>> (3) Template cleanups.  Listing some items mentioned above + a new
>>>>>>>>> one w.r.t. Reynold's draft
>>>>>>>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>>>>>>>>> :
>>>>>>>>> * Reinstate the "Where" section with links to current and past SIPs
>>>>>>>>> * Add field for stating explicit deadlines for approval
>>>>>>>>> * Add field for stating Author & Committer shepherd
>>>>>>>>>
>>>>>>>>> Thanks all!
>>>>>>>>> Joseph
>>>>>>>>>
>>>>>>>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <cody@koeninger.org
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> I'm bumping this one more time for the new year, and then I'm
>>>>>>>>>> giving up.
>>>>>>>>>>
>>>>>>>>>> Please, fix your process, even if it isn't exactly the way I
>>>>>>>>>> suggested.
>>>>>>>>>>
>>>>>>>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com>
>>>>>>>>>> wrote:
>>>>>>>>>> > On lazy consensus as opposed to voting:
>>>>>>>>>> >
>>>>>>>>>> > First, why lazy consensus? The proposal was for consensus,
>>>>>>>>>> which is at least
>>>>>>>>>> > three +1 votes and no vetos. Consensus has no losing side, it
>>>>>>>>>> requires
>>>>>>>>>> > getting to a point where there is agreement. Isn't that
>>>>>>>>>> agreement what we
>>>>>>>>>> > want to achieve with these proposals?
>>>>>>>>>> >
>>>>>>>>>> > Second, lazy consensus only removes the requirement for three
>>>>>>>>>> +1 votes. Why
>>>>>>>>>> > would we not want at least three committers to think something
>>>>>>>>>> is a good
>>>>>>>>>> > idea before adopting the proposal?
>>>>>>>>>> >
>>>>>>>>>> > rb
>>>>>>>>>> >
>>>>>>>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <
>>>>>>>>>> cody@koeninger.org> wrote:
>>>>>>>>>> >>
>>>>>>>>>> >> So there are some minor things (the Where section heading
>>>>>>>>>> appears to
>>>>>>>>>> >> be dropped; wherever this document is posted it needs to
>>>>>>>>>> actually link
>>>>>>>>>> >> to a jira filter showing current / past SIPs) but it doesn't
>>>>>>>>>> look like
>>>>>>>>>> >> I can comment on the google doc.
>>>>>>>>>> >>
>>>>>>>>>> >> The major substantive issue that I have is that this version is
>>>>>>>>>> >> significantly less clear as to the outcome of an SIP.
>>>>>>>>>> >>
>>>>>>>>>> >> The apache example of lazy consensus at
>>>>>>>>>> >> http://apache.org/foundation/voting.html#LazyConsensus
>>>>>>>>>> involves an
>>>>>>>>>> >> explicit announcement of an explicit deadline, which I think
>>>>>>>>>> are
>>>>>>>>>> >> necessary for clarity.
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >>
>>>>>>>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <
>>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>>> >> > It turned out suggested edits (trackable) don't show up for
>>>>>>>>>> non-owners,
>>>>>>>>>> >> > so
>>>>>>>>>> >> > I've just merged all the edits in place. It should be
>>>>>>>>>> visible now.
>>>>>>>>>> >> >
>>>>>>>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <
>>>>>>>>>> rxin@databricks.com>
>>>>>>>>>> >> > wrote:
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Oops. Let me try figure that out.
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> On Monday, November 7, 2016, Cody Koeninger <
>>>>>>>>>> cody@koeninger.org> wrote:
>>>>>>>>>> >> >>>
>>>>>>>>>> >> >>> Thanks for picking up on this.
>>>>>>>>>> >> >>>
>>>>>>>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on
>>>>>>>>>> the document
>>>>>>>>>> >> >>> you linked.
>>>>>>>>>> >> >>>
>>>>>>>>>> >> >>> Regarding lazy consensus, if the board in general has less
>>>>>>>>>> of an issue
>>>>>>>>>> >> >>> with that, sure.  As long as it is clearly announced,
>>>>>>>>>> lasts at least
>>>>>>>>>> >> >>> 72 hours, and has a clear outcome.
>>>>>>>>>> >> >>>
>>>>>>>>>> >> >>> The other points are hard to comment on without being able
>>>>>>>>>> to see the
>>>>>>>>>> >> >>> text in question.
>>>>>>>>>> >> >>>
>>>>>>>>>> >> >>>
>>>>>>>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <
>>>>>>>>>> rxin@databricks.com>
>>>>>>>>>> >> >>> wrote:
>>>>>>>>>> >> >>> > I just looked through the entire thread again tonight -
>>>>>>>>>> there are a
>>>>>>>>>> >> >>> > lot
>>>>>>>>>> >> >>> > of
>>>>>>>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the
>>>>>>>>>> first crack
>>>>>>>>>> >> >>> > at
>>>>>>>>>> >> >>> > the
>>>>>>>>>> >> >>> > proposal.
>>>>>>>>>> >> >>> >
>>>>>>>>>> >> >>> > I want to first comment on the context. Spark is one of
>>>>>>>>>> the most
>>>>>>>>>> >> >>> > innovative
>>>>>>>>>> >> >>> > and important projects in (big) data -- overall
>>>>>>>>>> technical decisions
>>>>>>>>>> >> >>> > made in
>>>>>>>>>> >> >>> > Apache Spark are sound. But of course, a project as
>>>>>>>>>> large and active
>>>>>>>>>> >> >>> > as
>>>>>>>>>> >> >>> > Spark always have room for improvement, and we as a
>>>>>>>>>> community should
>>>>>>>>>> >> >>> > strive
>>>>>>>>>> >> >>> > to take it to the next level.
>>>>>>>>>> >> >>> >
>>>>>>>>>> >> >>> > To that end, the two biggest areas for improvements in
>>>>>>>>>> my opinion
>>>>>>>>>> >> >>> > are:
>>>>>>>>>> >> >>> >
>>>>>>>>>> >> >>> > 1. Visibility: There are so much happening that it is
>>>>>>>>>> difficult to
>>>>>>>>>> >> >>> > know
>>>>>>>>>> >> >>> > what
>>>>>>>>>> >> >>> > really is going on. For people that don't follow
>>>>>>>>>> closely, it is
>>>>>>>>>> >> >>> > difficult to
>>>>>>>>>> >> >>> > know what the important initiatives are. Even for people
>>>>>>>>>> that do
>>>>>>>>>> >> >>> > follow, it
>>>>>>>>>> >> >>> > is difficult to know what specific things require their
>>>>>>>>>> attention,
>>>>>>>>>> >> >>> > since the
>>>>>>>>>> >> >>> > number of pull requests and JIRA tickets are high and
>>>>>>>>>> it's difficult
>>>>>>>>>> >> >>> > to
>>>>>>>>>> >> >>> > extract signal from noise.
>>>>>>>>>> >> >>> >
>>>>>>>>>> >> >>> > 2. Solicit user (broadly defined, including developers
>>>>>>>>>> themselves)
>>>>>>>>>> >> >>> > input
>>>>>>>>>> >> >>> > more proactively: At the end of the day the project
>>>>>>>>>> provides value
>>>>>>>>>> >> >>> > because
>>>>>>>>>> >> >>> > users use it. Users can't tell us exactly what to build,
>>>>>>>>>> but it is
>>>>>>>>>> >> >>> > important
>>>>>>>>>> >> >>> > to get their inputs.
>>>>>>>>>> >> >>> >
>>>>>>>>>> >> >>> >
>>>>>>>>>> >> >>> > I've taken Cody's doc and edited it:
>>>>>>>>>> >> >>> >
>>>>>>>>>> >> >>> >
>>>>>>>>>> >> >>> > https://docs.google.com/docume
>>>>>>>>>> nt/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#headi
>>>>>>>>>> ng=h.36ut37zh7w2b
>>>>>>>>>> >> >>> > (I've made all my modifications trackable)
>>>>>>>>>> >> >>> >
>>>>>>>>>> >> >>> > There are couple high level changes I made:
>>>>>>>>>> >> >>> >
>>>>>>>>>> >> >>> > 1. I've consulted a board member and he recommended lazy
>>>>>>>>>> consensus
>>>>>>>>>> >> >>> > as
>>>>>>>>>> >> >>> > opposed to voting. The reason being in voting there can
>>>>>>>>>> easily be a
>>>>>>>>>> >> >>> > "loser'
>>>>>>>>>> >> >>> > that gets outvoted.
>>>>>>>>>> >> >>> >
>>>>>>>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to
>>>>>>>>>> "optional
>>>>>>>>>> >> >>> > design
>>>>>>>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far
>>>>>>>>>> aside from
>>>>>>>>>> >> >>> > tagging
>>>>>>>>>> >> >>> > things and linking them elsewhere simply having design
>>>>>>>>>> docs and
>>>>>>>>>> >> >>> > prototypes
>>>>>>>>>> >> >>> > implementations in PRs is not something that has not
>>>>>>>>>> worked so far".
>>>>>>>>>> >> >>> >
>>>>>>>>>> >> >>> > 3. I made some the language tweaks to focus more on
>>>>>>>>>> visibility. For
>>>>>>>>>> >> >>> > example,
>>>>>>>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather
>>>>>>>>>> than just
>>>>>>>>>> >> >>> > "involve". SIPs should also have at least two emails
>>>>>>>>>> that go to
>>>>>>>>>> >> >>> > dev@.
>>>>>>>>>> >> >>> >
>>>>>>>>>> >> >>> >
>>>>>>>>>> >> >>> > While I was editing this, I thought we really needed a
>>>>>>>>>> suggested
>>>>>>>>>> >> >>> > template
>>>>>>>>>> >> >>> > for design doc too. I will get to that too ...
>>>>>>>>>> >> >>> >
>>>>>>>>>> >> >>> >
>>>>>>>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>>>>>>>>> rxin@databricks.com>
>>>>>>>>>> >> >>> > wrote:
>>>>>>>>>> >> >>> >>
>>>>>>>>>> >> >>> >> Most things looked OK to me too, although I do plan to
>>>>>>>>>> take a
>>>>>>>>>> >> >>> >> closer
>>>>>>>>>> >> >>> >> look
>>>>>>>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>>>>>>>>>> >> >>> >>
>>>>>>>>>> >> >>> >>
>>>>>>>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>>>>>>>>> >> >>> >> <va...@cloudera.com>
>>>>>>>>>> >> >>> >> wrote:
>>>>>>>>>> >> >>> >>>
>>>>>>>>>> >> >>> >>> The proposal looks OK to me. I assume, even though
>>>>>>>>>> it's not
>>>>>>>>>> >> >>> >>> explicitly
>>>>>>>>>> >> >>> >>> called, that voting would happen by e-mail? A template
>>>>>>>>>> for the
>>>>>>>>>> >> >>> >>> proposal document (instead of just a bullet nice)
>>>>>>>>>> would also be
>>>>>>>>>> >> >>> >>> nice,
>>>>>>>>>> >> >>> >>> but that can be done at any time.
>>>>>>>>>> >> >>> >>>
>>>>>>>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I
>>>>>>>>>> consider a
>>>>>>>>>> >> >>> >>> candidate
>>>>>>>>>> >> >>> >>> for a SIP, given the scope of the work. The document
>>>>>>>>>> attached even
>>>>>>>>>> >> >>> >>> somewhat matches the proposed format. So if anyone
>>>>>>>>>> wants to try
>>>>>>>>>> >> >>> >>> out
>>>>>>>>>> >> >>> >>> the process...
>>>>>>>>>> >> >>> >>>
>>>>>>>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>>>>>>>>>> >> >>> >>> <co...@koeninger.org>
>>>>>>>>>> >> >>> >>> wrote:
>>>>>>>>>> >> >>> >>> > Now that spark summit europe is over, are any
>>>>>>>>>> committers
>>>>>>>>>> >> >>> >>> > interested
>>>>>>>>>> >> >>> >>> > in
>>>>>>>>>> >> >>> >>> > moving forward with this?
>>>>>>>>>> >> >>> >>> >
>>>>>>>>>> >> >>> >>> >
>>>>>>>>>> >> >>> >>> >
>>>>>>>>>> >> >>> >>> >
>>>>>>>>>> >> >>> >>> > https://github.com/koeninger/s
>>>>>>>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>>>>>>> >> >>> >>> >
>>>>>>>>>> >> >>> >>> > Or are we going to let this discussion die on the
>>>>>>>>>> vine?
>>>>>>>>>> >> >>> >>> >
>>>>>>>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>>>>>>>>> >> >>> >>> > <to...@outlook.com> wrote:
>>>>>>>>>> >> >>> >>> >> Maybe my mail was not clear enough.
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any
>>>>>>>>>> other
>>>>>>>>>> >> >>> >>> >> framework.
>>>>>>>>>> >> >>> >>> >> The
>>>>>>>>>> >> >>> >>> >> idea with benchmarks was to show two things:
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >> - why some people are doing bad PR for Spark
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >> - how - in easy way - we can change it and show
>>>>>>>>>> that Spark is
>>>>>>>>>> >> >>> >>> >> still on
>>>>>>>>>> >> >>> >>> >> the
>>>>>>>>>> >> >>> >>> >> top
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I
>>>>>>>>>> don't think
>>>>>>>>>> >> >>> >>> >> they're the
>>>>>>>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main
>>>>>>>>>> page there
>>>>>>>>>> >> >>> >>> >> is
>>>>>>>>>> >> >>> >>> >> still
>>>>>>>>>> >> >>> >>> >> chart
>>>>>>>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that
>>>>>>>>>> framework is
>>>>>>>>>> >> >>> >>> >> not
>>>>>>>>>> >> >>> >>> >> the
>>>>>>>>>> >> >>> >>> >> same
>>>>>>>>>> >> >>> >>> >> Spark with other API, but much faster and
>>>>>>>>>> optimized, comparable
>>>>>>>>>> >> >>> >>> >> or
>>>>>>>>>> >> >>> >>> >> even
>>>>>>>>>> >> >>> >>> >> faster than other frameworks.
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >> About real-time streaming, I think it would be just
>>>>>>>>>> good to see
>>>>>>>>>> >> >>> >>> >> it
>>>>>>>>>> >> >>> >>> >> in
>>>>>>>>>> >> >>> >>> >> Spark.
>>>>>>>>>> >> >>> >>> >> I very like current Spark model, but many voices
>>>>>>>>>> that says "we
>>>>>>>>>> >> >>> >>> >> need
>>>>>>>>>> >> >>> >>> >> more" -
>>>>>>>>>> >> >>> >>> >> community should listen also them and try to help
>>>>>>>>>> them. With
>>>>>>>>>> >> >>> >>> >> SIPs
>>>>>>>>>> >> >>> >>> >> it
>>>>>>>>>> >> >>> >>> >> would
>>>>>>>>>> >> >>> >>> >> be easier, I've just posted this example as "thing
>>>>>>>>>> that may be
>>>>>>>>>> >> >>> >>> >> changed
>>>>>>>>>> >> >>> >>> >> with
>>>>>>>>>> >> >>> >>> >> SIP".
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >> I very like unification via Datasets, but there is
>>>>>>>>>> a lot of
>>>>>>>>>> >> >>> >>> >> algorithms
>>>>>>>>>> >> >>> >>> >> inside - let's make easy API, but with strong
>>>>>>>>>> background
>>>>>>>>>> >> >>> >>> >> (articles,
>>>>>>>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that
>>>>>>>>>> Spark is still
>>>>>>>>>> >> >>> >>> >> modern
>>>>>>>>>> >> >>> >>> >> framework.
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said
>>>>>>>>>> >> >>> >>> >> organizational
>>>>>>>>>> >> >>> >>> >> ideas
>>>>>>>>>> >> >>> >>> >> were already mentioned and I agree with them, my
>>>>>>>>>> mail was just
>>>>>>>>>> >> >>> >>> >> to
>>>>>>>>>> >> >>> >>> >> show
>>>>>>>>>> >> >>> >>> >> some
>>>>>>>>>> >> >>> >>> >> aspects from my side, so from theside of developer
>>>>>>>>>> and person
>>>>>>>>>> >> >>> >>> >> who
>>>>>>>>>> >> >>> >>> >> is
>>>>>>>>>> >> >>> >>> >> trying
>>>>>>>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or
>>>>>>>>>> other ways)
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >> Pozdrawiam / Best regards,
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >> Tomasz
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >> ________________________________
>>>>>>>>>> >> >>> >>> >> Od: Cody Koeninger <co...@koeninger.org>
>>>>>>>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46
>>>>>>>>>> >> >>> >>> >> Do: Debasish Das
>>>>>>>>>> >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>>>>>>>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is
>>>>>>>>>> missing my
>>>>>>>>>> >> >>> >>> >> point.
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >> My point is evolve or die.  Spark's governance and
>>>>>>>>>> organization
>>>>>>>>>> >> >>> >>> >> is
>>>>>>>>>> >> >>> >>> >> hampering its ability to evolve technologically,
>>>>>>>>>> and it needs
>>>>>>>>>> >> >>> >>> >> to
>>>>>>>>>> >> >>> >>> >> change.
>>>>>>>>>> >> >>> >>> >>
>>>>>>>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>>>>>>>>> >> >>> >>> >> <de...@gmail.com>
>>>>>>>>>> >> >>> >>> >> wrote:
>>>>>>>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I
>>>>>>>>>> picked up Spark
>>>>>>>>>> >> >>> >>> >>> in
>>>>>>>>>> >> >>> >>> >>> 2014
>>>>>>>>>> >> >>> >>> >>> as
>>>>>>>>>> >> >>> >>> >>> soon as I looked into it since compared to writing
>>>>>>>>>> Java
>>>>>>>>>> >> >>> >>> >>> map-reduce
>>>>>>>>>> >> >>> >>> >>> and
>>>>>>>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed
>>>>>>>>>> code fun...But
>>>>>>>>>> >> >>> >>> >>> now
>>>>>>>>>> >> >>> >>> >>> as
>>>>>>>>>> >> >>> >>> >>> we
>>>>>>>>>> >> >>> >>> >>> went
>>>>>>>>>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case
>>>>>>>>>> gets more
>>>>>>>>>> >> >>> >>> >>> prominent, I
>>>>>>>>>> >> >>> >>> >>> think it is time to bring a messaging model in
>>>>>>>>>> conjunction
>>>>>>>>>> >> >>> >>> >>> with
>>>>>>>>>> >> >>> >>> >>> the
>>>>>>>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good
>>>>>>>>>> at....akka-streams
>>>>>>>>>> >> >>> >>> >>> close
>>>>>>>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks
>>>>>>>>>> like a great
>>>>>>>>>> >> >>> >>> >>> direction to
>>>>>>>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0
>>>>>>>>>> integrated
>>>>>>>>>> >> >>> >>> >>> streaming
>>>>>>>>>> >> >>> >>> >>> with
>>>>>>>>>> >> >>> >>> >>> batch with the assumption is that micro-batching
>>>>>>>>>> is sufficient
>>>>>>>>>> >> >>> >>> >>> to
>>>>>>>>>> >> >>> >>> >>> run
>>>>>>>>>> >> >>> >>> >>> SQL
>>>>>>>>>> >> >>> >>> >>> commands on stream but do we really have time to
>>>>>>>>>> do SQL
>>>>>>>>>> >> >>> >>> >>> processing at
>>>>>>>>>> >> >>> >>> >>> streaming data within 1-2 seconds ?
>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>> >> >>> >>> >>> After reading the email chain, I started to look
>>>>>>>>>> into Flink
>>>>>>>>>> >> >>> >>> >>> documentation
>>>>>>>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I
>>>>>>>>>> think we
>>>>>>>>>> >> >>> >>> >>> have
>>>>>>>>>> >> >>> >>> >>> major
>>>>>>>>>> >> >>> >>> >>> work
>>>>>>>>>> >> >>> >>> >>> to do detailing out Spark internals so that more
>>>>>>>>>> people from
>>>>>>>>>> >> >>> >>> >>> community
>>>>>>>>>> >> >>> >>> >>> start
>>>>>>>>>> >> >>> >>> >>> to take active role in improving the issues so
>>>>>>>>>> that Spark
>>>>>>>>>> >> >>> >>> >>> stays
>>>>>>>>>> >> >>> >>> >>> strong
>>>>>>>>>> >> >>> >>> >>> compared to Flink.
>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>>>>>> uence/display/SPARK/Spark+Internals
>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>>>>>> uence/display/FLINK/Flink+Internals
>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>> >> >>> >>> >>> Spark is no longer an engine that works for
>>>>>>>>>> micro-batch and
>>>>>>>>>> >> >>> >>> >>> batch...We
>>>>>>>>>> >> >>> >>> >>> (and
>>>>>>>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an
>>>>>>>>>> engine for
>>>>>>>>>> >> >>> >>> >>> stream
>>>>>>>>>> >> >>> >>> >>> and
>>>>>>>>>> >> >>> >>> >>> query
>>>>>>>>>> >> >>> >>> >>> processing.....we need to make it a
>>>>>>>>>> state-of-the-art engine
>>>>>>>>>> >> >>> >>> >>> for
>>>>>>>>>> >> >>> >>> >>> high
>>>>>>>>>> >> >>> >>> >>> speed
>>>>>>>>>> >> >>> >>> >>> streaming data and user queries as well !
>>>>>>>>>> >> >>> >>> >>>
>>>>>>>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>>>>>>>>>> >> >>> >>> >>> <to...@outlook.com>
>>>>>>>>>> >> >>> >>> >>> wrote:
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>> Hi everyone,
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my
>>>>>>>>>> suggestions may
>>>>>>>>>> >> >>> >>> >>>> help a
>>>>>>>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational
>>>>>>>>>> topics were
>>>>>>>>>> >> >>> >>> >>>> mentioned,
>>>>>>>>>> >> >>> >>> >>>> but I want to focus on these negative posts about
>>>>>>>>>> Spark and
>>>>>>>>>> >> >>> >>> >>>> about
>>>>>>>>>> >> >>> >>> >>>> "haters"
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very
>>>>>>>>>> good community
>>>>>>>>>> >> >>> >>> >>>> -
>>>>>>>>>> >> >>> >>> >>>> it's
>>>>>>>>>> >> >>> >>> >>>> everything here. But Every project has to
>>>>>>>>>> "flight" on
>>>>>>>>>> >> >>> >>> >>>> "framework
>>>>>>>>>> >> >>> >>> >>>> market"
>>>>>>>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and
>>>>>>>>>> Big Data
>>>>>>>>>> >> >>> >>> >>>> communities,
>>>>>>>>>> >> >>> >>> >>>> maybe my mail will inspire someone :)
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have
>>>>>>>>>> enough time
>>>>>>>>>> >> >>> >>> >>>> to
>>>>>>>>>> >> >>> >>> >>>> join
>>>>>>>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So
>>>>>>>>>> why are
>>>>>>>>>> >> >>> >>> >>>> some
>>>>>>>>>> >> >>> >>> >>>> people
>>>>>>>>>> >> >>> >>> >>>> saying that Flink (or other framework) is better,
>>>>>>>>>> like it was
>>>>>>>>>> >> >>> >>> >>>> posted
>>>>>>>>>> >> >>> >>> >>>> in
>>>>>>>>>> >> >>> >>> >>>> this mailing list? No, not because that framework
>>>>>>>>>> is better
>>>>>>>>>> >> >>> >>> >>>> in
>>>>>>>>>> >> >>> >>> >>>> all
>>>>>>>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions
>>>>>>>>>> where
>>>>>>>>>> >> >>> >>> >>>> started
>>>>>>>>>> >> >>> >>> >>>> after
>>>>>>>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at
>>>>>>>>>> StackOverflow
>>>>>>>>>> >> >>> >>> >>>> "Flink
>>>>>>>>>> >> >>> >>> >>>> vs
>>>>>>>>>> >> >>> >>> >>>> ...."
>>>>>>>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink.
>>>>>>>>>> Answers are
>>>>>>>>>> >> >>> >>> >>>> sometimes
>>>>>>>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's
>>>>>>>>>> users (often
>>>>>>>>>> >> >>> >>> >>>> PMC's)
>>>>>>>>>> >> >>> >>> >>>> are
>>>>>>>>>> >> >>> >>> >>>> just posting same information about real-time
>>>>>>>>>> streaming,
>>>>>>>>>> >> >>> >>> >>>> about
>>>>>>>>>> >> >>> >>> >>>> delta
>>>>>>>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it
>>>>>>>>>> is marked as
>>>>>>>>>> >> >>> >>> >>>> an
>>>>>>>>>> >> >>> >>> >>>> aswer,
>>>>>>>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all
>>>>>>>>>> the truth.
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and
>>>>>>>>>> knowledgle to
>>>>>>>>>> >> >>> >>> >>>> perform
>>>>>>>>>> >> >>> >>> >>>> huge
>>>>>>>>>> >> >>> >>> >>>> performance test. Maybe some company, that
>>>>>>>>>> supports Spark
>>>>>>>>>> >> >>> >>> >>>> (Databricks,
>>>>>>>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in
>>>>>>>>>> community :) )
>>>>>>>>>> >> >>> >>> >>>> could
>>>>>>>>>> >> >>> >>> >>>> perform performance test of:
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose
>>>>>>>>>> because of
>>>>>>>>>> >> >>> >>> >>>> mini-batch
>>>>>>>>>> >> >>> >>> >>>> model, however currently the difference should be
>>>>>>>>>> much lower
>>>>>>>>>> >> >>> >>> >>>> that in
>>>>>>>>>> >> >>> >>> >>>> previous versions
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>> - Machine Learning models
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>> - batch jobs
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>> - Graph jobs
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>> - SQL queries
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>> People will see that Spark is envolving and is
>>>>>>>>>> also a modern
>>>>>>>>>> >> >>> >>> >>>> framework,
>>>>>>>>>> >> >>> >>> >>>> because after reading posts mentioned above
>>>>>>>>>> people may think
>>>>>>>>>> >> >>> >>> >>>> "it
>>>>>>>>>> >> >>> >>> >>>> is
>>>>>>>>>> >> >>> >>> >>>> outdated, future is in framework X".
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about
>>>>>>>>>> how Spark
>>>>>>>>>> >> >>> >>> >>>> Structured
>>>>>>>>>> >> >>> >>> >>>> Streaming beats every other framework in terms of
>>>>>>>>>> easy-of-use
>>>>>>>>>> >> >>> >>> >>>> and
>>>>>>>>>> >> >>> >>> >>>> reliability. Performance tests, done in various
>>>>>>>>>> environments
>>>>>>>>>> >> >>> >>> >>>> (in
>>>>>>>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node
>>>>>>>>>> cluster,
>>>>>>>>>> >> >>> >>> >>>> 20-node
>>>>>>>>>> >> >>> >>> >>>> cluster), could be also very good marketing stuff
>>>>>>>>>> to say
>>>>>>>>>> >> >>> >>> >>>> "hey,
>>>>>>>>>> >> >>> >>> >>>> you're
>>>>>>>>>> >> >>> >>> >>>> telling that you're better, but Spark is still
>>>>>>>>>> faster and is
>>>>>>>>>> >> >>> >>> >>>> still
>>>>>>>>>> >> >>> >>> >>>> getting even more fast!". This would be based on
>>>>>>>>>> facts (just
>>>>>>>>>> >> >>> >>> >>>> numbers),
>>>>>>>>>> >> >>> >>> >>>> not opinions. It would be good for companies, for
>>>>>>>>>> marketing
>>>>>>>>>> >> >>> >>> >>>> puproses
>>>>>>>>>> >> >>> >>> >>>> and
>>>>>>>>>> >> >>> >>> >>>> for every Spark developer
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some
>>>>>>>>>> time ago about
>>>>>>>>>> >> >>> >>> >>>> real-time
>>>>>>>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming.
>>>>>>>>>> Some work
>>>>>>>>>> >> >>> >>> >>>> should be
>>>>>>>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think
>>>>>>>>>> it's possible.
>>>>>>>>>> >> >>> >>> >>>> Maybe
>>>>>>>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built
>>>>>>>>>> on top of
>>>>>>>>>> >> >>> >>> >>>> Akka?
>>>>>>>>>> >> >>> >>> >>>> I
>>>>>>>>>> >> >>> >>> >>>> don't
>>>>>>>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I
>>>>>>>>>> think that
>>>>>>>>>> >> >>> >>> >>>> Spark
>>>>>>>>>> >> >>> >>> >>>> should
>>>>>>>>>> >> >>> >>> >>>> have real-time streaming support. Currently I see
>>>>>>>>>> many
>>>>>>>>>> >> >>> >>> >>>> posts/comments
>>>>>>>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming
>>>>>>>>>> is doing
>>>>>>>>>> >> >>> >>> >>>> very
>>>>>>>>>> >> >>> >>> >>>> good
>>>>>>>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is
>>>>>>>>>> possible to
>>>>>>>>>> >> >>> >>> >>>> add
>>>>>>>>>> >> >>> >>> >>>> also
>>>>>>>>>> >> >>> >>> >>>> more
>>>>>>>>>> >> >>> >>> >>>> real-time processing.
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>> Other people said much more and I agree with
>>>>>>>>>> proposal of SIP.
>>>>>>>>>> >> >>> >>> >>>> I'm
>>>>>>>>>> >> >>> >>> >>>> also
>>>>>>>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will
>>>>>>>>>> not listen to
>>>>>>>>>> >> >>> >>> >>>> users,
>>>>>>>>>> >> >>> >>> >>>> but
>>>>>>>>>> >> >>> >>> >>>> they really want to make Spark better for every
>>>>>>>>>> user.
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>> What do you think about these two topics?
>>>>>>>>>> Especially I'm
>>>>>>>>>> >> >>> >>> >>>> looking
>>>>>>>>>> >> >>> >>> >>>> at
>>>>>>>>>> >> >>> >>> >>>> Cody
>>>>>>>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :)
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>> Pozdrawiam / Best regards,
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>> Tomasz
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>>> >> >>> >>>
>>>>>>>>>> >> >>> >>
>>>>>>>>>> >> >>> >
>>>>>>>>>> >> >>> >
>>>>>>>>>> >> >
>>>>>>>>>> >> >
>>>>>>>>>> >>
>>>>>>>>>> >> ------------------------------------------------------------
>>>>>>>>>> ---------
>>>>>>>>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>> >>
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> >
>>>>>>>>>> > --
>>>>>>>>>> > Ryan Blue
>>>>>>>>>> > Software Engineer
>>>>>>>>>> > Netflix
>>>>>>>>>>
>>>>>>>>>> ------------------------------------------------------------
>>>>>>>>>> ---------
>>>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Joseph Bradley
>>>>>>>>>
>>>>>>>>> Software Engineer - Machine Learning
>>>>>>>>>
>>>>>>>>> Databricks, Inc.
>>>>>>>>>
>>>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark Improvement Proposals

Posted by Cody Koeninger <co...@koeninger.org>.

Thanks for doing that.

Given that there are at least 4 different Apache voting processes, "typical
Apache vote process" isn't meaningful to me.

I think the intention is that in order to pass, it needs at least 3 +1
votes from PMC members *and no -1 votes from PMC members*.  But the
document doesn't explicitly say that second part.

There's also no mention of the duration a vote should remain open.  There's
a mention of a month for finding a shepherd, but that's different.

Other than that, LGTM.

On Mon, Feb 13, 2017 at 9:02 AM, Reynold Xin <rx...@databricks.com> wrote:

> Here's a new draft that incorporated most of the feedback:
> https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
> nRanvXmnZ7SUi4qMljg/edit#
>
> I added a specific role for SPIP Author and another one for SPIP Shepherd.
>
> On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <ga...@gmail.com> wrote:
>
>> During the summit, I also had a lot of discussions over similar topics
>> with multiple Committers and active users. I heard many fantastic ideas. I
>> believe Spark improvement proposals are good channels to collect the
>> requirements/designs.
>>
>>
>> IMO, we also need to consider the priority when working on these items.
>> Even if the proposal is accepted, it does not mean it will be implemented
>> and merged immediately. It is not a FIFO queue.
>>
>>
>> Even if some PRs are merged, sometimes, we still have to revert them
>> back, if the design and implementation are not reviewed carefully. We have
>> to ensure our quality. Spark is not an application software. It is an
>> infrastructure software that is being used by many many companies. We have
>> to be very careful in the design and implementation, especially
>> adding/changing the external APIs.
>>
>>
>> When I developed the Mainframe infrastructure/middleware software in the
>> past 6 years, I were involved in the discussions with external/internal
>> customers. The to-do feature list was always above 100. Sometimes, the
>> customers are feeling frustrated when we are unable to deliver them on time
>> due to the resource limits and others. Even if they paid us billions, we
>> still need to do it phase by phase or sometimes they have to accept the
>> workarounds. That is the reality everyone has to face, I think.
>>
>>
>> Thanks,
>>
>>
>> Xiao Li
>>
>> 2017-02-11 7:57 GMT-08:00 Cody Koeninger <co...@koeninger.org>:
>>
>>> At the spark summit this week, everyone from PMC members to users I had
>>> never met before were asking me about the Spark improvement proposals
>>> idea.  It's clear that it's a real community need.
>>>
>>> But it's been almost half a year, and nothing visible has been done.
>>>
>>> Reynold, are you going to do this?
>>>
>>> If so, when?
>>>
>>> If not, why?
>>>
>>> You already did the right thing by including long-deserved committers.
>>> Please keep doing the right thing for the community.
>>>
>>> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>>> +1 on all counts (consensus, time bound, define roles)
>>>>
>>>> I can update the doc in the next few days and share back. Then maybe we
>>>> can just officially vote on this. As Tim suggested, we might not get it
>>>> 100% right the first time and would need to re-iterate. But that's fine.
>>>>
>>>>
>>>> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <ti...@databricks.com>
>>>> wrote:
>>>>
>>>>> Hi Cody,
>>>>> thank you for bringing up this topic, I agree it is very important to
>>>>> keep a cohesive community around some common, fluid goals. Here are a few
>>>>> comments about the current document:
>>>>>
>>>>> 1. name: it should not overlap with an existing one such as SIP. Can
>>>>> you imagine someone trying to discuss a scala spore proposal for spark?
>>>>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
>>>>> sounds great.
>>>>>
>>>>> 2. roles: at a high level, SPIPs are meant to reach consensus for
>>>>> technical decisions with a lasting impact. As such, the template should
>>>>> emphasize the role of the various parties during this process:
>>>>>
>>>>>  - the SPIP author is responsible for building consensus. She is the
>>>>> champion driving the process forward and is responsible for ensuring that
>>>>> the SPIP follows the general guidelines. The author should be identified in
>>>>> the SPIP. The authorship of a SPIP can be transferred if the current author
>>>>> is not interested and someone else wants to move the SPIP forward. There
>>>>> should probably be 2-3 authors at most for each SPIP.
>>>>>
>>>>>  - someone with voting power should probably shepherd the SPIP (and be
>>>>> recorded as such): ensuring that the final decision over the SPIP is
>>>>> recorded (rejected, accepted, etc.), and advising about the technical
>>>>> quality of the SPIP: this person need not be a champion for the SPIP or
>>>>> contribute to it, but rather makes sure it stands a chance of being
>>>>> approved when the vote happens. Also, if the author cannot find anyone who
>>>>> would want to take this role, this proposal is likely to be rejected anyway.
>>>>>
>>>>>  - users, committers, contributors have the roles already outlined in
>>>>> the document
>>>>>
>>>>> 3. timeline: ideally, once a SPIP has been offered for voting, it
>>>>> should move swiftly into either being accepted or rejected, so that we do
>>>>> not end up with a distracting long tail of half-hearted proposals.
>>>>>
>>>>> These rules are meant to be flexible, but the current document should
>>>>> be clear about who is in charge of a SPIP, and the state it is currently in.
>>>>>
>>>>> We have had long discussions over some very important questions such
>>>>> as approval. I do not have an opinion on these, but why not make a pick and
>>>>> reevaluate this decision later? This is not a binding process at this point.
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>>>> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <co...@koeninger.org>
>>>>> wrote:
>>>>>
>>>>>> I don't have a concern about voting vs consensus.
>>>>>>
>>>>>> I have a concern that whatever the decision making process is, it is
>>>>>> explicitly announced on the ticket for the given proposal, with an explicit
>>>>>> deadline, and an explicit outcome.
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <ir...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I'm also in favor of this.  Thanks for your persistence Cody.
>>>>>>>
>>>>>>> My take on the specific issues Joseph mentioned:
>>>>>>>
>>>>>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made
>>>>>>> earlier for consensus:
>>>>>>>
>>>>>>> > Majority vs consensus: My rationale is that I don't think we want
>>>>>>> to consider a proposal approved if it had objections serious enough that
>>>>>>> committers down-voted (or PMC depending on who gets a vote). If these
>>>>>>> proposals are like PEPs, then they represent a significant amount of
>>>>>>> community effort and I wouldn't want to move forward if up to half of the
>>>>>>> community thinks it's an untenable idea.
>>>>>>>
>>>>>>> 2) Design doc template -- agree this would be useful, but also seems
>>>>>>> totally orthogonal to moving forward on the SIP proposal.
>>>>>>>
>>>>>>> 3) agree w/ Joseph's proposal for updating the template.
>>>>>>>
>>>>>>> One small addition:
>>>>>>>
>>>>>>> 4) Deciding on a name -- minor, but I think its wroth disambiguating
>>>>>>> from Scala's SIPs, and the best proposal I've heard is "SPIP".   At least,
>>>>>>> no one has objected.  (don't care enough that I'd object to anything else,
>>>>>>> though.)
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <
>>>>>>> joseph@databricks.com> wrote:
>>>>>>>
>>>>>>>> Hi Cody,
>>>>>>>>
>>>>>>>> Thanks for being persistent about this.  I too would like to see
>>>>>>>> this happen.  Reviewing the thread, it sounds like the main things
>>>>>>>> remaining are:
>>>>>>>> * Decide about a few issues
>>>>>>>> * Finalize the doc(s)
>>>>>>>> * Vote on this proposal
>>>>>>>>
>>>>>>>> Issues & TODOs:
>>>>>>>>
>>>>>>>> (1) The main issue I see above is voting vs. consensus.  I have
>>>>>>>> little preference here.  It sounds like something which could be tailored
>>>>>>>> based on whether we see too many or too few SIPs being approved.
>>>>>>>>
>>>>>>>> (2) Design doc template  (This would be great to have for Spark
>>>>>>>> regardless of this SIP discussion.)
>>>>>>>> * Reynold, are you still putting this together?
>>>>>>>>
>>>>>>>> (3) Template cleanups.  Listing some items mentioned above + a new
>>>>>>>> one w.r.t. Reynold's draft
>>>>>>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>>>>>>>> :
>>>>>>>> * Reinstate the "Where" section with links to current and past SIPs
>>>>>>>> * Add field for stating explicit deadlines for approval
>>>>>>>> * Add field for stating Author & Committer shepherd
>>>>>>>>
>>>>>>>> Thanks all!
>>>>>>>> Joseph
>>>>>>>>
>>>>>>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <co...@koeninger.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I'm bumping this one more time for the new year, and then I'm
>>>>>>>>> giving up.
>>>>>>>>>
>>>>>>>>> Please, fix your process, even if it isn't exactly the way I
>>>>>>>>> suggested.
>>>>>>>>>
>>>>>>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com>
>>>>>>>>> wrote:
>>>>>>>>> > On lazy consensus as opposed to voting:
>>>>>>>>> >
>>>>>>>>> > First, why lazy consensus? The proposal was for consensus, which
>>>>>>>>> is at least
>>>>>>>>> > three +1 votes and no vetos. Consensus has no losing side, it
>>>>>>>>> requires
>>>>>>>>> > getting to a point where there is agreement. Isn't that
>>>>>>>>> agreement what we
>>>>>>>>> > want to achieve with these proposals?
>>>>>>>>> >
>>>>>>>>> > Second, lazy consensus only removes the requirement for three +1
>>>>>>>>> votes. Why
>>>>>>>>> > would we not want at least three committers to think something
>>>>>>>>> is a good
>>>>>>>>> > idea before adopting the proposal?
>>>>>>>>> >
>>>>>>>>> > rb
>>>>>>>>> >
>>>>>>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <
>>>>>>>>> cody@koeninger.org> wrote:
>>>>>>>>> >>
>>>>>>>>> >> So there are some minor things (the Where section heading
>>>>>>>>> appears to
>>>>>>>>> >> be dropped; wherever this document is posted it needs to
>>>>>>>>> actually link
>>>>>>>>> >> to a jira filter showing current / past SIPs) but it doesn't
>>>>>>>>> look like
>>>>>>>>> >> I can comment on the google doc.
>>>>>>>>> >>
>>>>>>>>> >> The major substantive issue that I have is that this version is
>>>>>>>>> >> significantly less clear as to the outcome of an SIP.
>>>>>>>>> >>
>>>>>>>>> >> The apache example of lazy consensus at
>>>>>>>>> >> http://apache.org/foundation/voting.html#LazyConsensus
>>>>>>>>> involves an
>>>>>>>>> >> explicit announcement of an explicit deadline, which I think are
>>>>>>>>> >> necessary for clarity.
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >>
>>>>>>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <
>>>>>>>>> rxin@databricks.com> wrote:
>>>>>>>>> >> > It turned out suggested edits (trackable) don't show up for
>>>>>>>>> non-owners,
>>>>>>>>> >> > so
>>>>>>>>> >> > I've just merged all the edits in place. It should be visible
>>>>>>>>> now.
>>>>>>>>> >> >
>>>>>>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <
>>>>>>>>> rxin@databricks.com>
>>>>>>>>> >> > wrote:
>>>>>>>>> >> >>
>>>>>>>>> >> >> Oops. Let me try figure that out.
>>>>>>>>> >> >>
>>>>>>>>> >> >>
>>>>>>>>> >> >> On Monday, November 7, 2016, Cody Koeninger <
>>>>>>>>> cody@koeninger.org> wrote:
>>>>>>>>> >> >>>
>>>>>>>>> >> >>> Thanks for picking up on this.
>>>>>>>>> >> >>>
>>>>>>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on
>>>>>>>>> the document
>>>>>>>>> >> >>> you linked.
>>>>>>>>> >> >>>
>>>>>>>>> >> >>> Regarding lazy consensus, if the board in general has less
>>>>>>>>> of an issue
>>>>>>>>> >> >>> with that, sure.  As long as it is clearly announced, lasts
>>>>>>>>> at least
>>>>>>>>> >> >>> 72 hours, and has a clear outcome.
>>>>>>>>> >> >>>
>>>>>>>>> >> >>> The other points are hard to comment on without being able
>>>>>>>>> to see the
>>>>>>>>> >> >>> text in question.
>>>>>>>>> >> >>>
>>>>>>>>> >> >>>
>>>>>>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <
>>>>>>>>> rxin@databricks.com>
>>>>>>>>> >> >>> wrote:
>>>>>>>>> >> >>> > I just looked through the entire thread again tonight -
>>>>>>>>> there are a
>>>>>>>>> >> >>> > lot
>>>>>>>>> >> >>> > of
>>>>>>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the
>>>>>>>>> first crack
>>>>>>>>> >> >>> > at
>>>>>>>>> >> >>> > the
>>>>>>>>> >> >>> > proposal.
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > I want to first comment on the context. Spark is one of
>>>>>>>>> the most
>>>>>>>>> >> >>> > innovative
>>>>>>>>> >> >>> > and important projects in (big) data -- overall technical
>>>>>>>>> decisions
>>>>>>>>> >> >>> > made in
>>>>>>>>> >> >>> > Apache Spark are sound. But of course, a project as large
>>>>>>>>> and active
>>>>>>>>> >> >>> > as
>>>>>>>>> >> >>> > Spark always have room for improvement, and we as a
>>>>>>>>> community should
>>>>>>>>> >> >>> > strive
>>>>>>>>> >> >>> > to take it to the next level.
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > To that end, the two biggest areas for improvements in my
>>>>>>>>> opinion
>>>>>>>>> >> >>> > are:
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > 1. Visibility: There are so much happening that it is
>>>>>>>>> difficult to
>>>>>>>>> >> >>> > know
>>>>>>>>> >> >>> > what
>>>>>>>>> >> >>> > really is going on. For people that don't follow closely,
>>>>>>>>> it is
>>>>>>>>> >> >>> > difficult to
>>>>>>>>> >> >>> > know what the important initiatives are. Even for people
>>>>>>>>> that do
>>>>>>>>> >> >>> > follow, it
>>>>>>>>> >> >>> > is difficult to know what specific things require their
>>>>>>>>> attention,
>>>>>>>>> >> >>> > since the
>>>>>>>>> >> >>> > number of pull requests and JIRA tickets are high and
>>>>>>>>> it's difficult
>>>>>>>>> >> >>> > to
>>>>>>>>> >> >>> > extract signal from noise.
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > 2. Solicit user (broadly defined, including developers
>>>>>>>>> themselves)
>>>>>>>>> >> >>> > input
>>>>>>>>> >> >>> > more proactively: At the end of the day the project
>>>>>>>>> provides value
>>>>>>>>> >> >>> > because
>>>>>>>>> >> >>> > users use it. Users can't tell us exactly what to build,
>>>>>>>>> but it is
>>>>>>>>> >> >>> > important
>>>>>>>>> >> >>> > to get their inputs.
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > I've taken Cody's doc and edited it:
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > https://docs.google.com/docume
>>>>>>>>> nt/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#headi
>>>>>>>>> ng=h.36ut37zh7w2b
>>>>>>>>> >> >>> > (I've made all my modifications trackable)
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > There are couple high level changes I made:
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > 1. I've consulted a board member and he recommended lazy
>>>>>>>>> consensus
>>>>>>>>> >> >>> > as
>>>>>>>>> >> >>> > opposed to voting. The reason being in voting there can
>>>>>>>>> easily be a
>>>>>>>>> >> >>> > "loser'
>>>>>>>>> >> >>> > that gets outvoted.
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to
>>>>>>>>> "optional
>>>>>>>>> >> >>> > design
>>>>>>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far
>>>>>>>>> aside from
>>>>>>>>> >> >>> > tagging
>>>>>>>>> >> >>> > things and linking them elsewhere simply having design
>>>>>>>>> docs and
>>>>>>>>> >> >>> > prototypes
>>>>>>>>> >> >>> > implementations in PRs is not something that has not
>>>>>>>>> worked so far".
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > 3. I made some the language tweaks to focus more on
>>>>>>>>> visibility. For
>>>>>>>>> >> >>> > example,
>>>>>>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather
>>>>>>>>> than just
>>>>>>>>> >> >>> > "involve". SIPs should also have at least two emails that
>>>>>>>>> go to
>>>>>>>>> >> >>> > dev@.
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > While I was editing this, I thought we really needed a
>>>>>>>>> suggested
>>>>>>>>> >> >>> > template
>>>>>>>>> >> >>> > for design doc too. I will get to that too ...
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>>>>>>>> rxin@databricks.com>
>>>>>>>>> >> >>> > wrote:
>>>>>>>>> >> >>> >>
>>>>>>>>> >> >>> >> Most things looked OK to me too, although I do plan to
>>>>>>>>> take a
>>>>>>>>> >> >>> >> closer
>>>>>>>>> >> >>> >> look
>>>>>>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>>>>>>>>> >> >>> >>
>>>>>>>>> >> >>> >>
>>>>>>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>>>>>>>> >> >>> >> <va...@cloudera.com>
>>>>>>>>> >> >>> >> wrote:
>>>>>>>>> >> >>> >>>
>>>>>>>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's
>>>>>>>>> not
>>>>>>>>> >> >>> >>> explicitly
>>>>>>>>> >> >>> >>> called, that voting would happen by e-mail? A template
>>>>>>>>> for the
>>>>>>>>> >> >>> >>> proposal document (instead of just a bullet nice) would
>>>>>>>>> also be
>>>>>>>>> >> >>> >>> nice,
>>>>>>>>> >> >>> >>> but that can be done at any time.
>>>>>>>>> >> >>> >>>
>>>>>>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I
>>>>>>>>> consider a
>>>>>>>>> >> >>> >>> candidate
>>>>>>>>> >> >>> >>> for a SIP, given the scope of the work. The document
>>>>>>>>> attached even
>>>>>>>>> >> >>> >>> somewhat matches the proposed format. So if anyone
>>>>>>>>> wants to try
>>>>>>>>> >> >>> >>> out
>>>>>>>>> >> >>> >>> the process...
>>>>>>>>> >> >>> >>>
>>>>>>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>>>>>>>>> >> >>> >>> <co...@koeninger.org>
>>>>>>>>> >> >>> >>> wrote:
>>>>>>>>> >> >>> >>> > Now that spark summit europe is over, are any
>>>>>>>>> committers
>>>>>>>>> >> >>> >>> > interested
>>>>>>>>> >> >>> >>> > in
>>>>>>>>> >> >>> >>> > moving forward with this?
>>>>>>>>> >> >>> >>> >
>>>>>>>>> >> >>> >>> >
>>>>>>>>> >> >>> >>> >
>>>>>>>>> >> >>> >>> >
>>>>>>>>> >> >>> >>> > https://github.com/koeninger/s
>>>>>>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>>>>>> >> >>> >>> >
>>>>>>>>> >> >>> >>> > Or are we going to let this discussion die on the
>>>>>>>>> vine?
>>>>>>>>> >> >>> >>> >
>>>>>>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>>>>>>>> >> >>> >>> > <to...@outlook.com> wrote:
>>>>>>>>> >> >>> >>> >> Maybe my mail was not clear enough.
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any
>>>>>>>>> other
>>>>>>>>> >> >>> >>> >> framework.
>>>>>>>>> >> >>> >>> >> The
>>>>>>>>> >> >>> >>> >> idea with benchmarks was to show two things:
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> - why some people are doing bad PR for Spark
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> - how - in easy way - we can change it and show that
>>>>>>>>> Spark is
>>>>>>>>> >> >>> >>> >> still on
>>>>>>>>> >> >>> >>> >> the
>>>>>>>>> >> >>> >>> >> top
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I
>>>>>>>>> don't think
>>>>>>>>> >> >>> >>> >> they're the
>>>>>>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main
>>>>>>>>> page there
>>>>>>>>> >> >>> >>> >> is
>>>>>>>>> >> >>> >>> >> still
>>>>>>>>> >> >>> >>> >> chart
>>>>>>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that
>>>>>>>>> framework is
>>>>>>>>> >> >>> >>> >> not
>>>>>>>>> >> >>> >>> >> the
>>>>>>>>> >> >>> >>> >> same
>>>>>>>>> >> >>> >>> >> Spark with other API, but much faster and optimized,
>>>>>>>>> comparable
>>>>>>>>> >> >>> >>> >> or
>>>>>>>>> >> >>> >>> >> even
>>>>>>>>> >> >>> >>> >> faster than other frameworks.
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> About real-time streaming, I think it would be just
>>>>>>>>> good to see
>>>>>>>>> >> >>> >>> >> it
>>>>>>>>> >> >>> >>> >> in
>>>>>>>>> >> >>> >>> >> Spark.
>>>>>>>>> >> >>> >>> >> I very like current Spark model, but many voices
>>>>>>>>> that says "we
>>>>>>>>> >> >>> >>> >> need
>>>>>>>>> >> >>> >>> >> more" -
>>>>>>>>> >> >>> >>> >> community should listen also them and try to help
>>>>>>>>> them. With
>>>>>>>>> >> >>> >>> >> SIPs
>>>>>>>>> >> >>> >>> >> it
>>>>>>>>> >> >>> >>> >> would
>>>>>>>>> >> >>> >>> >> be easier, I've just posted this example as "thing
>>>>>>>>> that may be
>>>>>>>>> >> >>> >>> >> changed
>>>>>>>>> >> >>> >>> >> with
>>>>>>>>> >> >>> >>> >> SIP".
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> I very like unification via Datasets, but there is a
>>>>>>>>> lot of
>>>>>>>>> >> >>> >>> >> algorithms
>>>>>>>>> >> >>> >>> >> inside - let's make easy API, but with strong
>>>>>>>>> background
>>>>>>>>> >> >>> >>> >> (articles,
>>>>>>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark
>>>>>>>>> is still
>>>>>>>>> >> >>> >>> >> modern
>>>>>>>>> >> >>> >>> >> framework.
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said
>>>>>>>>> >> >>> >>> >> organizational
>>>>>>>>> >> >>> >>> >> ideas
>>>>>>>>> >> >>> >>> >> were already mentioned and I agree with them, my
>>>>>>>>> mail was just
>>>>>>>>> >> >>> >>> >> to
>>>>>>>>> >> >>> >>> >> show
>>>>>>>>> >> >>> >>> >> some
>>>>>>>>> >> >>> >>> >> aspects from my side, so from theside of developer
>>>>>>>>> and person
>>>>>>>>> >> >>> >>> >> who
>>>>>>>>> >> >>> >>> >> is
>>>>>>>>> >> >>> >>> >> trying
>>>>>>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or
>>>>>>>>> other ways)
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> Pozdrawiam / Best regards,
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> Tomasz
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> ________________________________
>>>>>>>>> >> >>> >>> >> Od: Cody Koeninger <co...@koeninger.org>
>>>>>>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46
>>>>>>>>> >> >>> >>> >> Do: Debasish Das
>>>>>>>>> >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>>>>>>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is
>>>>>>>>> missing my
>>>>>>>>> >> >>> >>> >> point.
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> My point is evolve or die.  Spark's governance and
>>>>>>>>> organization
>>>>>>>>> >> >>> >>> >> is
>>>>>>>>> >> >>> >>> >> hampering its ability to evolve technologically, and
>>>>>>>>> it needs
>>>>>>>>> >> >>> >>> >> to
>>>>>>>>> >> >>> >>> >> change.
>>>>>>>>> >> >>> >>> >>
>>>>>>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>>>>>>>> >> >>> >>> >> <de...@gmail.com>
>>>>>>>>> >> >>> >>> >> wrote:
>>>>>>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I
>>>>>>>>> picked up Spark
>>>>>>>>> >> >>> >>> >>> in
>>>>>>>>> >> >>> >>> >>> 2014
>>>>>>>>> >> >>> >>> >>> as
>>>>>>>>> >> >>> >>> >>> soon as I looked into it since compared to writing
>>>>>>>>> Java
>>>>>>>>> >> >>> >>> >>> map-reduce
>>>>>>>>> >> >>> >>> >>> and
>>>>>>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed code
>>>>>>>>> fun...But
>>>>>>>>> >> >>> >>> >>> now
>>>>>>>>> >> >>> >>> >>> as
>>>>>>>>> >> >>> >>> >>> we
>>>>>>>>> >> >>> >>> >>> went
>>>>>>>>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case
>>>>>>>>> gets more
>>>>>>>>> >> >>> >>> >>> prominent, I
>>>>>>>>> >> >>> >>> >>> think it is time to bring a messaging model in
>>>>>>>>> conjunction
>>>>>>>>> >> >>> >>> >>> with
>>>>>>>>> >> >>> >>> >>> the
>>>>>>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good
>>>>>>>>> at....akka-streams
>>>>>>>>> >> >>> >>> >>> close
>>>>>>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks
>>>>>>>>> like a great
>>>>>>>>> >> >>> >>> >>> direction to
>>>>>>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0
>>>>>>>>> integrated
>>>>>>>>> >> >>> >>> >>> streaming
>>>>>>>>> >> >>> >>> >>> with
>>>>>>>>> >> >>> >>> >>> batch with the assumption is that micro-batching is
>>>>>>>>> sufficient
>>>>>>>>> >> >>> >>> >>> to
>>>>>>>>> >> >>> >>> >>> run
>>>>>>>>> >> >>> >>> >>> SQL
>>>>>>>>> >> >>> >>> >>> commands on stream but do we really have time to do
>>>>>>>>> SQL
>>>>>>>>> >> >>> >>> >>> processing at
>>>>>>>>> >> >>> >>> >>> streaming data within 1-2 seconds ?
>>>>>>>>> >> >>> >>> >>>
>>>>>>>>> >> >>> >>> >>> After reading the email chain, I started to look
>>>>>>>>> into Flink
>>>>>>>>> >> >>> >>> >>> documentation
>>>>>>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I
>>>>>>>>> think we
>>>>>>>>> >> >>> >>> >>> have
>>>>>>>>> >> >>> >>> >>> major
>>>>>>>>> >> >>> >>> >>> work
>>>>>>>>> >> >>> >>> >>> to do detailing out Spark internals so that more
>>>>>>>>> people from
>>>>>>>>> >> >>> >>> >>> community
>>>>>>>>> >> >>> >>> >>> start
>>>>>>>>> >> >>> >>> >>> to take active role in improving the issues so that
>>>>>>>>> Spark
>>>>>>>>> >> >>> >>> >>> stays
>>>>>>>>> >> >>> >>> >>> strong
>>>>>>>>> >> >>> >>> >>> compared to Flink.
>>>>>>>>> >> >>> >>> >>>
>>>>>>>>> >> >>> >>> >>>
>>>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>>>>> uence/display/SPARK/Spark+Internals
>>>>>>>>> >> >>> >>> >>>
>>>>>>>>> >> >>> >>> >>>
>>>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>>>>> uence/display/FLINK/Flink+Internals
>>>>>>>>> >> >>> >>> >>>
>>>>>>>>> >> >>> >>> >>> Spark is no longer an engine that works for
>>>>>>>>> micro-batch and
>>>>>>>>> >> >>> >>> >>> batch...We
>>>>>>>>> >> >>> >>> >>> (and
>>>>>>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an
>>>>>>>>> engine for
>>>>>>>>> >> >>> >>> >>> stream
>>>>>>>>> >> >>> >>> >>> and
>>>>>>>>> >> >>> >>> >>> query
>>>>>>>>> >> >>> >>> >>> processing.....we need to make it a
>>>>>>>>> state-of-the-art engine
>>>>>>>>> >> >>> >>> >>> for
>>>>>>>>> >> >>> >>> >>> high
>>>>>>>>> >> >>> >>> >>> speed
>>>>>>>>> >> >>> >>> >>> streaming data and user queries as well !
>>>>>>>>> >> >>> >>> >>>
>>>>>>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>>>>>>>>> >> >>> >>> >>> <to...@outlook.com>
>>>>>>>>> >> >>> >>> >>> wrote:
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> Hi everyone,
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my
>>>>>>>>> suggestions may
>>>>>>>>> >> >>> >>> >>>> help a
>>>>>>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational
>>>>>>>>> topics were
>>>>>>>>> >> >>> >>> >>>> mentioned,
>>>>>>>>> >> >>> >>> >>>> but I want to focus on these negative posts about
>>>>>>>>> Spark and
>>>>>>>>> >> >>> >>> >>>> about
>>>>>>>>> >> >>> >>> >>>> "haters"
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good
>>>>>>>>> community
>>>>>>>>> >> >>> >>> >>>> -
>>>>>>>>> >> >>> >>> >>>> it's
>>>>>>>>> >> >>> >>> >>>> everything here. But Every project has to "flight"
>>>>>>>>> on
>>>>>>>>> >> >>> >>> >>>> "framework
>>>>>>>>> >> >>> >>> >>>> market"
>>>>>>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big
>>>>>>>>> Data
>>>>>>>>> >> >>> >>> >>>> communities,
>>>>>>>>> >> >>> >>> >>>> maybe my mail will inspire someone :)
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have
>>>>>>>>> enough time
>>>>>>>>> >> >>> >>> >>>> to
>>>>>>>>> >> >>> >>> >>>> join
>>>>>>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So
>>>>>>>>> why are
>>>>>>>>> >> >>> >>> >>>> some
>>>>>>>>> >> >>> >>> >>>> people
>>>>>>>>> >> >>> >>> >>>> saying that Flink (or other framework) is better,
>>>>>>>>> like it was
>>>>>>>>> >> >>> >>> >>>> posted
>>>>>>>>> >> >>> >>> >>>> in
>>>>>>>>> >> >>> >>> >>>> this mailing list? No, not because that framework
>>>>>>>>> is better
>>>>>>>>> >> >>> >>> >>>> in
>>>>>>>>> >> >>> >>> >>>> all
>>>>>>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions
>>>>>>>>> where
>>>>>>>>> >> >>> >>> >>>> started
>>>>>>>>> >> >>> >>> >>>> after
>>>>>>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at
>>>>>>>>> StackOverflow
>>>>>>>>> >> >>> >>> >>>> "Flink
>>>>>>>>> >> >>> >>> >>>> vs
>>>>>>>>> >> >>> >>> >>>> ...."
>>>>>>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink.
>>>>>>>>> Answers are
>>>>>>>>> >> >>> >>> >>>> sometimes
>>>>>>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's
>>>>>>>>> users (often
>>>>>>>>> >> >>> >>> >>>> PMC's)
>>>>>>>>> >> >>> >>> >>>> are
>>>>>>>>> >> >>> >>> >>>> just posting same information about real-time
>>>>>>>>> streaming,
>>>>>>>>> >> >>> >>> >>>> about
>>>>>>>>> >> >>> >>> >>>> delta
>>>>>>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it
>>>>>>>>> is marked as
>>>>>>>>> >> >>> >>> >>>> an
>>>>>>>>> >> >>> >>> >>>> aswer,
>>>>>>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all
>>>>>>>>> the truth.
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and
>>>>>>>>> knowledgle to
>>>>>>>>> >> >>> >>> >>>> perform
>>>>>>>>> >> >>> >>> >>>> huge
>>>>>>>>> >> >>> >>> >>>> performance test. Maybe some company, that
>>>>>>>>> supports Spark
>>>>>>>>> >> >>> >>> >>>> (Databricks,
>>>>>>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in
>>>>>>>>> community :) )
>>>>>>>>> >> >>> >>> >>>> could
>>>>>>>>> >> >>> >>> >>>> perform performance test of:
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose
>>>>>>>>> because of
>>>>>>>>> >> >>> >>> >>>> mini-batch
>>>>>>>>> >> >>> >>> >>>> model, however currently the difference should be
>>>>>>>>> much lower
>>>>>>>>> >> >>> >>> >>>> that in
>>>>>>>>> >> >>> >>> >>>> previous versions
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> - Machine Learning models
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> - batch jobs
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> - Graph jobs
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> - SQL queries
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> People will see that Spark is envolving and is
>>>>>>>>> also a modern
>>>>>>>>> >> >>> >>> >>>> framework,
>>>>>>>>> >> >>> >>> >>>> because after reading posts mentioned above people
>>>>>>>>> may think
>>>>>>>>> >> >>> >>> >>>> "it
>>>>>>>>> >> >>> >>> >>>> is
>>>>>>>>> >> >>> >>> >>>> outdated, future is in framework X".
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how
>>>>>>>>> Spark
>>>>>>>>> >> >>> >>> >>>> Structured
>>>>>>>>> >> >>> >>> >>>> Streaming beats every other framework in terms of
>>>>>>>>> easy-of-use
>>>>>>>>> >> >>> >>> >>>> and
>>>>>>>>> >> >>> >>> >>>> reliability. Performance tests, done in various
>>>>>>>>> environments
>>>>>>>>> >> >>> >>> >>>> (in
>>>>>>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node
>>>>>>>>> cluster,
>>>>>>>>> >> >>> >>> >>>> 20-node
>>>>>>>>> >> >>> >>> >>>> cluster), could be also very good marketing stuff
>>>>>>>>> to say
>>>>>>>>> >> >>> >>> >>>> "hey,
>>>>>>>>> >> >>> >>> >>>> you're
>>>>>>>>> >> >>> >>> >>>> telling that you're better, but Spark is still
>>>>>>>>> faster and is
>>>>>>>>> >> >>> >>> >>>> still
>>>>>>>>> >> >>> >>> >>>> getting even more fast!". This would be based on
>>>>>>>>> facts (just
>>>>>>>>> >> >>> >>> >>>> numbers),
>>>>>>>>> >> >>> >>> >>>> not opinions. It would be good for companies, for
>>>>>>>>> marketing
>>>>>>>>> >> >>> >>> >>>> puproses
>>>>>>>>> >> >>> >>> >>>> and
>>>>>>>>> >> >>> >>> >>>> for every Spark developer
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some
>>>>>>>>> time ago about
>>>>>>>>> >> >>> >>> >>>> real-time
>>>>>>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming.
>>>>>>>>> Some work
>>>>>>>>> >> >>> >>> >>>> should be
>>>>>>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think
>>>>>>>>> it's possible.
>>>>>>>>> >> >>> >>> >>>> Maybe
>>>>>>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on
>>>>>>>>> top of
>>>>>>>>> >> >>> >>> >>>> Akka?
>>>>>>>>> >> >>> >>> >>>> I
>>>>>>>>> >> >>> >>> >>>> don't
>>>>>>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I
>>>>>>>>> think that
>>>>>>>>> >> >>> >>> >>>> Spark
>>>>>>>>> >> >>> >>> >>>> should
>>>>>>>>> >> >>> >>> >>>> have real-time streaming support. Currently I see
>>>>>>>>> many
>>>>>>>>> >> >>> >>> >>>> posts/comments
>>>>>>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming
>>>>>>>>> is doing
>>>>>>>>> >> >>> >>> >>>> very
>>>>>>>>> >> >>> >>> >>>> good
>>>>>>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is
>>>>>>>>> possible to
>>>>>>>>> >> >>> >>> >>>> add
>>>>>>>>> >> >>> >>> >>>> also
>>>>>>>>> >> >>> >>> >>>> more
>>>>>>>>> >> >>> >>> >>>> real-time processing.
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> Other people said much more and I agree with
>>>>>>>>> proposal of SIP.
>>>>>>>>> >> >>> >>> >>>> I'm
>>>>>>>>> >> >>> >>> >>>> also
>>>>>>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will not
>>>>>>>>> listen to
>>>>>>>>> >> >>> >>> >>>> users,
>>>>>>>>> >> >>> >>> >>>> but
>>>>>>>>> >> >>> >>> >>>> they really want to make Spark better for every
>>>>>>>>> user.
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> What do you think about these two topics?
>>>>>>>>> Especially I'm
>>>>>>>>> >> >>> >>> >>>> looking
>>>>>>>>> >> >>> >>> >>>> at
>>>>>>>>> >> >>> >>> >>>> Cody
>>>>>>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :)
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> Pozdrawiam / Best regards,
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>> Tomasz
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>> >>>>
>>>>>>>>> >> >>> >>>
>>>>>>>>> >> >>> >>
>>>>>>>>> >> >>> >
>>>>>>>>> >> >>> >
>>>>>>>>> >> >
>>>>>>>>> >> >
>>>>>>>>> >>
>>>>>>>>> >> ------------------------------------------------------------
>>>>>>>>> ---------
>>>>>>>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>> >>
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > --
>>>>>>>>> > Ryan Blue
>>>>>>>>> > Software Engineer
>>>>>>>>> > Netflix
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------
>>>>>>>>> ---------
>>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Joseph Bradley
>>>>>>>>
>>>>>>>> Software Engineer - Machine Learning
>>>>>>>>
>>>>>>>> Databricks, Inc.
>>>>>>>>
>>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark Improvement Proposals

Posted by Reynold Xin <rx...@databricks.com>.

Here's a new draft that incorporated most of the feedback:
https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#

I added a specific role for SPIP Author and another one for SPIP Shepherd.

On Sat, Feb 11, 2017 at 6:13 PM, Xiao Li <ga...@gmail.com> wrote:

> During the summit, I also had a lot of discussions over similar topics
> with multiple Committers and active users. I heard many fantastic ideas. I
> believe Spark improvement proposals are good channels to collect the
> requirements/designs.
>
>
> IMO, we also need to consider the priority when working on these items.
> Even if the proposal is accepted, it does not mean it will be implemented
> and merged immediately. It is not a FIFO queue.
>
>
> Even if some PRs are merged, sometimes, we still have to revert them back,
> if the design and implementation are not reviewed carefully. We have to
> ensure our quality. Spark is not an application software. It is an
> infrastructure software that is being used by many many companies. We have
> to be very careful in the design and implementation, especially
> adding/changing the external APIs.
>
>
> When I developed the Mainframe infrastructure/middleware software in the
> past 6 years, I were involved in the discussions with external/internal
> customers. The to-do feature list was always above 100. Sometimes, the
> customers are feeling frustrated when we are unable to deliver them on time
> due to the resource limits and others. Even if they paid us billions, we
> still need to do it phase by phase or sometimes they have to accept the
> workarounds. That is the reality everyone has to face, I think.
>
>
> Thanks,
>
>
> Xiao Li
>
> 2017-02-11 7:57 GMT-08:00 Cody Koeninger <co...@koeninger.org>:
>
>> At the spark summit this week, everyone from PMC members to users I had
>> never met before were asking me about the Spark improvement proposals
>> idea.  It's clear that it's a real community need.
>>
>> But it's been almost half a year, and nothing visible has been done.
>>
>> Reynold, are you going to do this?
>>
>> If so, when?
>>
>> If not, why?
>>
>> You already did the right thing by including long-deserved committers.
>> Please keep doing the right thing for the community.
>>
>> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>> +1 on all counts (consensus, time bound, define roles)
>>>
>>> I can update the doc in the next few days and share back. Then maybe we
>>> can just officially vote on this. As Tim suggested, we might not get it
>>> 100% right the first time and would need to re-iterate. But that's fine.
>>>
>>>
>>> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <ti...@databricks.com>
>>> wrote:
>>>
>>>> Hi Cody,
>>>> thank you for bringing up this topic, I agree it is very important to
>>>> keep a cohesive community around some common, fluid goals. Here are a few
>>>> comments about the current document:
>>>>
>>>> 1. name: it should not overlap with an existing one such as SIP. Can
>>>> you imagine someone trying to discuss a scala spore proposal for spark?
>>>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
>>>> sounds great.
>>>>
>>>> 2. roles: at a high level, SPIPs are meant to reach consensus for
>>>> technical decisions with a lasting impact. As such, the template should
>>>> emphasize the role of the various parties during this process:
>>>>
>>>>  - the SPIP author is responsible for building consensus. She is the
>>>> champion driving the process forward and is responsible for ensuring that
>>>> the SPIP follows the general guidelines. The author should be identified in
>>>> the SPIP. The authorship of a SPIP can be transferred if the current author
>>>> is not interested and someone else wants to move the SPIP forward. There
>>>> should probably be 2-3 authors at most for each SPIP.
>>>>
>>>>  - someone with voting power should probably shepherd the SPIP (and be
>>>> recorded as such): ensuring that the final decision over the SPIP is
>>>> recorded (rejected, accepted, etc.), and advising about the technical
>>>> quality of the SPIP: this person need not be a champion for the SPIP or
>>>> contribute to it, but rather makes sure it stands a chance of being
>>>> approved when the vote happens. Also, if the author cannot find anyone who
>>>> would want to take this role, this proposal is likely to be rejected anyway.
>>>>
>>>>  - users, committers, contributors have the roles already outlined in
>>>> the document
>>>>
>>>> 3. timeline: ideally, once a SPIP has been offered for voting, it
>>>> should move swiftly into either being accepted or rejected, so that we do
>>>> not end up with a distracting long tail of half-hearted proposals.
>>>>
>>>> These rules are meant to be flexible, but the current document should
>>>> be clear about who is in charge of a SPIP, and the state it is currently in.
>>>>
>>>> We have had long discussions over some very important questions such as
>>>> approval. I do not have an opinion on these, but why not make a pick and
>>>> reevaluate this decision later? This is not a binding process at this point.
>>>>
>>>> Tim
>>>>
>>>>
>>>> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <co...@koeninger.org>
>>>> wrote:
>>>>
>>>>> I don't have a concern about voting vs consensus.
>>>>>
>>>>> I have a concern that whatever the decision making process is, it is
>>>>> explicitly announced on the ticket for the given proposal, with an explicit
>>>>> deadline, and an explicit outcome.
>>>>>
>>>>>
>>>>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <ir...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> I'm also in favor of this.  Thanks for your persistence Cody.
>>>>>>
>>>>>> My take on the specific issues Joseph mentioned:
>>>>>>
>>>>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made
>>>>>> earlier for consensus:
>>>>>>
>>>>>> > Majority vs consensus: My rationale is that I don't think we want
>>>>>> to consider a proposal approved if it had objections serious enough that
>>>>>> committers down-voted (or PMC depending on who gets a vote). If these
>>>>>> proposals are like PEPs, then they represent a significant amount of
>>>>>> community effort and I wouldn't want to move forward if up to half of the
>>>>>> community thinks it's an untenable idea.
>>>>>>
>>>>>> 2) Design doc template -- agree this would be useful, but also seems
>>>>>> totally orthogonal to moving forward on the SIP proposal.
>>>>>>
>>>>>> 3) agree w/ Joseph's proposal for updating the template.
>>>>>>
>>>>>> One small addition:
>>>>>>
>>>>>> 4) Deciding on a name -- minor, but I think its wroth disambiguating
>>>>>> from Scala's SIPs, and the best proposal I've heard is "SPIP".   At least,
>>>>>> no one has objected.  (don't care enough that I'd object to anything else,
>>>>>> though.)
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <joseph@databricks.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi Cody,
>>>>>>>
>>>>>>> Thanks for being persistent about this.  I too would like to see
>>>>>>> this happen.  Reviewing the thread, it sounds like the main things
>>>>>>> remaining are:
>>>>>>> * Decide about a few issues
>>>>>>> * Finalize the doc(s)
>>>>>>> * Vote on this proposal
>>>>>>>
>>>>>>> Issues & TODOs:
>>>>>>>
>>>>>>> (1) The main issue I see above is voting vs. consensus.  I have
>>>>>>> little preference here.  It sounds like something which could be tailored
>>>>>>> based on whether we see too many or too few SIPs being approved.
>>>>>>>
>>>>>>> (2) Design doc template  (This would be great to have for Spark
>>>>>>> regardless of this SIP discussion.)
>>>>>>> * Reynold, are you still putting this together?
>>>>>>>
>>>>>>> (3) Template cleanups.  Listing some items mentioned above + a new
>>>>>>> one w.r.t. Reynold's draft
>>>>>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>>>>>>> :
>>>>>>> * Reinstate the "Where" section with links to current and past SIPs
>>>>>>> * Add field for stating explicit deadlines for approval
>>>>>>> * Add field for stating Author & Committer shepherd
>>>>>>>
>>>>>>> Thanks all!
>>>>>>> Joseph
>>>>>>>
>>>>>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <co...@koeninger.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I'm bumping this one more time for the new year, and then I'm
>>>>>>>> giving up.
>>>>>>>>
>>>>>>>> Please, fix your process, even if it isn't exactly the way I
>>>>>>>> suggested.
>>>>>>>>
>>>>>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com>
>>>>>>>> wrote:
>>>>>>>> > On lazy consensus as opposed to voting:
>>>>>>>> >
>>>>>>>> > First, why lazy consensus? The proposal was for consensus, which
>>>>>>>> is at least
>>>>>>>> > three +1 votes and no vetos. Consensus has no losing side, it
>>>>>>>> requires
>>>>>>>> > getting to a point where there is agreement. Isn't that agreement
>>>>>>>> what we
>>>>>>>> > want to achieve with these proposals?
>>>>>>>> >
>>>>>>>> > Second, lazy consensus only removes the requirement for three +1
>>>>>>>> votes. Why
>>>>>>>> > would we not want at least three committers to think something is
>>>>>>>> a good
>>>>>>>> > idea before adopting the proposal?
>>>>>>>> >
>>>>>>>> > rb
>>>>>>>> >
>>>>>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <
>>>>>>>> cody@koeninger.org> wrote:
>>>>>>>> >>
>>>>>>>> >> So there are some minor things (the Where section heading
>>>>>>>> appears to
>>>>>>>> >> be dropped; wherever this document is posted it needs to
>>>>>>>> actually link
>>>>>>>> >> to a jira filter showing current / past SIPs) but it doesn't
>>>>>>>> look like
>>>>>>>> >> I can comment on the google doc.
>>>>>>>> >>
>>>>>>>> >> The major substantive issue that I have is that this version is
>>>>>>>> >> significantly less clear as to the outcome of an SIP.
>>>>>>>> >>
>>>>>>>> >> The apache example of lazy consensus at
>>>>>>>> >> http://apache.org/foundation/voting.html#LazyConsensus involves
>>>>>>>> an
>>>>>>>> >> explicit announcement of an explicit deadline, which I think are
>>>>>>>> >> necessary for clarity.
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <rx...@databricks.com>
>>>>>>>> wrote:
>>>>>>>> >> > It turned out suggested edits (trackable) don't show up for
>>>>>>>> non-owners,
>>>>>>>> >> > so
>>>>>>>> >> > I've just merged all the edits in place. It should be visible
>>>>>>>> now.
>>>>>>>> >> >
>>>>>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <
>>>>>>>> rxin@databricks.com>
>>>>>>>> >> > wrote:
>>>>>>>> >> >>
>>>>>>>> >> >> Oops. Let me try figure that out.
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> On Monday, November 7, 2016, Cody Koeninger <
>>>>>>>> cody@koeninger.org> wrote:
>>>>>>>> >> >>>
>>>>>>>> >> >>> Thanks for picking up on this.
>>>>>>>> >> >>>
>>>>>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on
>>>>>>>> the document
>>>>>>>> >> >>> you linked.
>>>>>>>> >> >>>
>>>>>>>> >> >>> Regarding lazy consensus, if the board in general has less
>>>>>>>> of an issue
>>>>>>>> >> >>> with that, sure.  As long as it is clearly announced, lasts
>>>>>>>> at least
>>>>>>>> >> >>> 72 hours, and has a clear outcome.
>>>>>>>> >> >>>
>>>>>>>> >> >>> The other points are hard to comment on without being able
>>>>>>>> to see the
>>>>>>>> >> >>> text in question.
>>>>>>>> >> >>>
>>>>>>>> >> >>>
>>>>>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <
>>>>>>>> rxin@databricks.com>
>>>>>>>> >> >>> wrote:
>>>>>>>> >> >>> > I just looked through the entire thread again tonight -
>>>>>>>> there are a
>>>>>>>> >> >>> > lot
>>>>>>>> >> >>> > of
>>>>>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the
>>>>>>>> first crack
>>>>>>>> >> >>> > at
>>>>>>>> >> >>> > the
>>>>>>>> >> >>> > proposal.
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > I want to first comment on the context. Spark is one of
>>>>>>>> the most
>>>>>>>> >> >>> > innovative
>>>>>>>> >> >>> > and important projects in (big) data -- overall technical
>>>>>>>> decisions
>>>>>>>> >> >>> > made in
>>>>>>>> >> >>> > Apache Spark are sound. But of course, a project as large
>>>>>>>> and active
>>>>>>>> >> >>> > as
>>>>>>>> >> >>> > Spark always have room for improvement, and we as a
>>>>>>>> community should
>>>>>>>> >> >>> > strive
>>>>>>>> >> >>> > to take it to the next level.
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > To that end, the two biggest areas for improvements in my
>>>>>>>> opinion
>>>>>>>> >> >>> > are:
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > 1. Visibility: There are so much happening that it is
>>>>>>>> difficult to
>>>>>>>> >> >>> > know
>>>>>>>> >> >>> > what
>>>>>>>> >> >>> > really is going on. For people that don't follow closely,
>>>>>>>> it is
>>>>>>>> >> >>> > difficult to
>>>>>>>> >> >>> > know what the important initiatives are. Even for people
>>>>>>>> that do
>>>>>>>> >> >>> > follow, it
>>>>>>>> >> >>> > is difficult to know what specific things require their
>>>>>>>> attention,
>>>>>>>> >> >>> > since the
>>>>>>>> >> >>> > number of pull requests and JIRA tickets are high and it's
>>>>>>>> difficult
>>>>>>>> >> >>> > to
>>>>>>>> >> >>> > extract signal from noise.
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > 2. Solicit user (broadly defined, including developers
>>>>>>>> themselves)
>>>>>>>> >> >>> > input
>>>>>>>> >> >>> > more proactively: At the end of the day the project
>>>>>>>> provides value
>>>>>>>> >> >>> > because
>>>>>>>> >> >>> > users use it. Users can't tell us exactly what to build,
>>>>>>>> but it is
>>>>>>>> >> >>> > important
>>>>>>>> >> >>> > to get their inputs.
>>>>>>>> >> >>> >
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > I've taken Cody's doc and edited it:
>>>>>>>> >> >>> >
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > https://docs.google.com/docume
>>>>>>>> nt/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#headi
>>>>>>>> ng=h.36ut37zh7w2b
>>>>>>>> >> >>> > (I've made all my modifications trackable)
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > There are couple high level changes I made:
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > 1. I've consulted a board member and he recommended lazy
>>>>>>>> consensus
>>>>>>>> >> >>> > as
>>>>>>>> >> >>> > opposed to voting. The reason being in voting there can
>>>>>>>> easily be a
>>>>>>>> >> >>> > "loser'
>>>>>>>> >> >>> > that gets outvoted.
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to
>>>>>>>> "optional
>>>>>>>> >> >>> > design
>>>>>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far
>>>>>>>> aside from
>>>>>>>> >> >>> > tagging
>>>>>>>> >> >>> > things and linking them elsewhere simply having design
>>>>>>>> docs and
>>>>>>>> >> >>> > prototypes
>>>>>>>> >> >>> > implementations in PRs is not something that has not
>>>>>>>> worked so far".
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > 3. I made some the language tweaks to focus more on
>>>>>>>> visibility. For
>>>>>>>> >> >>> > example,
>>>>>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather
>>>>>>>> than just
>>>>>>>> >> >>> > "involve". SIPs should also have at least two emails that
>>>>>>>> go to
>>>>>>>> >> >>> > dev@.
>>>>>>>> >> >>> >
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > While I was editing this, I thought we really needed a
>>>>>>>> suggested
>>>>>>>> >> >>> > template
>>>>>>>> >> >>> > for design doc too. I will get to that too ...
>>>>>>>> >> >>> >
>>>>>>>> >> >>> >
>>>>>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>>>>>>> rxin@databricks.com>
>>>>>>>> >> >>> > wrote:
>>>>>>>> >> >>> >>
>>>>>>>> >> >>> >> Most things looked OK to me too, although I do plan to
>>>>>>>> take a
>>>>>>>> >> >>> >> closer
>>>>>>>> >> >>> >> look
>>>>>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>>>>>>>> >> >>> >>
>>>>>>>> >> >>> >>
>>>>>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>>>>>>> >> >>> >> <va...@cloudera.com>
>>>>>>>> >> >>> >> wrote:
>>>>>>>> >> >>> >>>
>>>>>>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's
>>>>>>>> not
>>>>>>>> >> >>> >>> explicitly
>>>>>>>> >> >>> >>> called, that voting would happen by e-mail? A template
>>>>>>>> for the
>>>>>>>> >> >>> >>> proposal document (instead of just a bullet nice) would
>>>>>>>> also be
>>>>>>>> >> >>> >>> nice,
>>>>>>>> >> >>> >>> but that can be done at any time.
>>>>>>>> >> >>> >>>
>>>>>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I
>>>>>>>> consider a
>>>>>>>> >> >>> >>> candidate
>>>>>>>> >> >>> >>> for a SIP, given the scope of the work. The document
>>>>>>>> attached even
>>>>>>>> >> >>> >>> somewhat matches the proposed format. So if anyone wants
>>>>>>>> to try
>>>>>>>> >> >>> >>> out
>>>>>>>> >> >>> >>> the process...
>>>>>>>> >> >>> >>>
>>>>>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>>>>>>>> >> >>> >>> <co...@koeninger.org>
>>>>>>>> >> >>> >>> wrote:
>>>>>>>> >> >>> >>> > Now that spark summit europe is over, are any
>>>>>>>> committers
>>>>>>>> >> >>> >>> > interested
>>>>>>>> >> >>> >>> > in
>>>>>>>> >> >>> >>> > moving forward with this?
>>>>>>>> >> >>> >>> >
>>>>>>>> >> >>> >>> >
>>>>>>>> >> >>> >>> >
>>>>>>>> >> >>> >>> >
>>>>>>>> >> >>> >>> > https://github.com/koeninger/s
>>>>>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>>>>> >> >>> >>> >
>>>>>>>> >> >>> >>> > Or are we going to let this discussion die on the vine?
>>>>>>>> >> >>> >>> >
>>>>>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>>>>>>> >> >>> >>> > <to...@outlook.com> wrote:
>>>>>>>> >> >>> >>> >> Maybe my mail was not clear enough.
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any
>>>>>>>> other
>>>>>>>> >> >>> >>> >> framework.
>>>>>>>> >> >>> >>> >> The
>>>>>>>> >> >>> >>> >> idea with benchmarks was to show two things:
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> - why some people are doing bad PR for Spark
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> - how - in easy way - we can change it and show that
>>>>>>>> Spark is
>>>>>>>> >> >>> >>> >> still on
>>>>>>>> >> >>> >>> >> the
>>>>>>>> >> >>> >>> >> top
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I
>>>>>>>> don't think
>>>>>>>> >> >>> >>> >> they're the
>>>>>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main
>>>>>>>> page there
>>>>>>>> >> >>> >>> >> is
>>>>>>>> >> >>> >>> >> still
>>>>>>>> >> >>> >>> >> chart
>>>>>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that
>>>>>>>> framework is
>>>>>>>> >> >>> >>> >> not
>>>>>>>> >> >>> >>> >> the
>>>>>>>> >> >>> >>> >> same
>>>>>>>> >> >>> >>> >> Spark with other API, but much faster and optimized,
>>>>>>>> comparable
>>>>>>>> >> >>> >>> >> or
>>>>>>>> >> >>> >>> >> even
>>>>>>>> >> >>> >>> >> faster than other frameworks.
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> About real-time streaming, I think it would be just
>>>>>>>> good to see
>>>>>>>> >> >>> >>> >> it
>>>>>>>> >> >>> >>> >> in
>>>>>>>> >> >>> >>> >> Spark.
>>>>>>>> >> >>> >>> >> I very like current Spark model, but many voices that
>>>>>>>> says "we
>>>>>>>> >> >>> >>> >> need
>>>>>>>> >> >>> >>> >> more" -
>>>>>>>> >> >>> >>> >> community should listen also them and try to help
>>>>>>>> them. With
>>>>>>>> >> >>> >>> >> SIPs
>>>>>>>> >> >>> >>> >> it
>>>>>>>> >> >>> >>> >> would
>>>>>>>> >> >>> >>> >> be easier, I've just posted this example as "thing
>>>>>>>> that may be
>>>>>>>> >> >>> >>> >> changed
>>>>>>>> >> >>> >>> >> with
>>>>>>>> >> >>> >>> >> SIP".
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> I very like unification via Datasets, but there is a
>>>>>>>> lot of
>>>>>>>> >> >>> >>> >> algorithms
>>>>>>>> >> >>> >>> >> inside - let's make easy API, but with strong
>>>>>>>> background
>>>>>>>> >> >>> >>> >> (articles,
>>>>>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark
>>>>>>>> is still
>>>>>>>> >> >>> >>> >> modern
>>>>>>>> >> >>> >>> >> framework.
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said
>>>>>>>> >> >>> >>> >> organizational
>>>>>>>> >> >>> >>> >> ideas
>>>>>>>> >> >>> >>> >> were already mentioned and I agree with them, my mail
>>>>>>>> was just
>>>>>>>> >> >>> >>> >> to
>>>>>>>> >> >>> >>> >> show
>>>>>>>> >> >>> >>> >> some
>>>>>>>> >> >>> >>> >> aspects from my side, so from theside of developer
>>>>>>>> and person
>>>>>>>> >> >>> >>> >> who
>>>>>>>> >> >>> >>> >> is
>>>>>>>> >> >>> >>> >> trying
>>>>>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or other
>>>>>>>> ways)
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> Pozdrawiam / Best regards,
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> Tomasz
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> ________________________________
>>>>>>>> >> >>> >>> >> Od: Cody Koeninger <co...@koeninger.org>
>>>>>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46
>>>>>>>> >> >>> >>> >> Do: Debasish Das
>>>>>>>> >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>>>>>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is
>>>>>>>> missing my
>>>>>>>> >> >>> >>> >> point.
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> My point is evolve or die.  Spark's governance and
>>>>>>>> organization
>>>>>>>> >> >>> >>> >> is
>>>>>>>> >> >>> >>> >> hampering its ability to evolve technologically, and
>>>>>>>> it needs
>>>>>>>> >> >>> >>> >> to
>>>>>>>> >> >>> >>> >> change.
>>>>>>>> >> >>> >>> >>
>>>>>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>>>>>>> >> >>> >>> >> <de...@gmail.com>
>>>>>>>> >> >>> >>> >> wrote:
>>>>>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked
>>>>>>>> up Spark
>>>>>>>> >> >>> >>> >>> in
>>>>>>>> >> >>> >>> >>> 2014
>>>>>>>> >> >>> >>> >>> as
>>>>>>>> >> >>> >>> >>> soon as I looked into it since compared to writing
>>>>>>>> Java
>>>>>>>> >> >>> >>> >>> map-reduce
>>>>>>>> >> >>> >>> >>> and
>>>>>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed code
>>>>>>>> fun...But
>>>>>>>> >> >>> >>> >>> now
>>>>>>>> >> >>> >>> >>> as
>>>>>>>> >> >>> >>> >>> we
>>>>>>>> >> >>> >>> >>> went
>>>>>>>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case
>>>>>>>> gets more
>>>>>>>> >> >>> >>> >>> prominent, I
>>>>>>>> >> >>> >>> >>> think it is time to bring a messaging model in
>>>>>>>> conjunction
>>>>>>>> >> >>> >>> >>> with
>>>>>>>> >> >>> >>> >>> the
>>>>>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good
>>>>>>>> at....akka-streams
>>>>>>>> >> >>> >>> >>> close
>>>>>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks
>>>>>>>> like a great
>>>>>>>> >> >>> >>> >>> direction to
>>>>>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0
>>>>>>>> integrated
>>>>>>>> >> >>> >>> >>> streaming
>>>>>>>> >> >>> >>> >>> with
>>>>>>>> >> >>> >>> >>> batch with the assumption is that micro-batching is
>>>>>>>> sufficient
>>>>>>>> >> >>> >>> >>> to
>>>>>>>> >> >>> >>> >>> run
>>>>>>>> >> >>> >>> >>> SQL
>>>>>>>> >> >>> >>> >>> commands on stream but do we really have time to do
>>>>>>>> SQL
>>>>>>>> >> >>> >>> >>> processing at
>>>>>>>> >> >>> >>> >>> streaming data within 1-2 seconds ?
>>>>>>>> >> >>> >>> >>>
>>>>>>>> >> >>> >>> >>> After reading the email chain, I started to look
>>>>>>>> into Flink
>>>>>>>> >> >>> >>> >>> documentation
>>>>>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I
>>>>>>>> think we
>>>>>>>> >> >>> >>> >>> have
>>>>>>>> >> >>> >>> >>> major
>>>>>>>> >> >>> >>> >>> work
>>>>>>>> >> >>> >>> >>> to do detailing out Spark internals so that more
>>>>>>>> people from
>>>>>>>> >> >>> >>> >>> community
>>>>>>>> >> >>> >>> >>> start
>>>>>>>> >> >>> >>> >>> to take active role in improving the issues so that
>>>>>>>> Spark
>>>>>>>> >> >>> >>> >>> stays
>>>>>>>> >> >>> >>> >>> strong
>>>>>>>> >> >>> >>> >>> compared to Flink.
>>>>>>>> >> >>> >>> >>>
>>>>>>>> >> >>> >>> >>>
>>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>>>> uence/display/SPARK/Spark+Internals
>>>>>>>> >> >>> >>> >>>
>>>>>>>> >> >>> >>> >>>
>>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>>>> uence/display/FLINK/Flink+Internals
>>>>>>>> >> >>> >>> >>>
>>>>>>>> >> >>> >>> >>> Spark is no longer an engine that works for
>>>>>>>> micro-batch and
>>>>>>>> >> >>> >>> >>> batch...We
>>>>>>>> >> >>> >>> >>> (and
>>>>>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an
>>>>>>>> engine for
>>>>>>>> >> >>> >>> >>> stream
>>>>>>>> >> >>> >>> >>> and
>>>>>>>> >> >>> >>> >>> query
>>>>>>>> >> >>> >>> >>> processing.....we need to make it a state-of-the-art
>>>>>>>> engine
>>>>>>>> >> >>> >>> >>> for
>>>>>>>> >> >>> >>> >>> high
>>>>>>>> >> >>> >>> >>> speed
>>>>>>>> >> >>> >>> >>> streaming data and user queries as well !
>>>>>>>> >> >>> >>> >>>
>>>>>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>>>>>>>> >> >>> >>> >>> <to...@outlook.com>
>>>>>>>> >> >>> >>> >>> wrote:
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> Hi everyone,
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my
>>>>>>>> suggestions may
>>>>>>>> >> >>> >>> >>>> help a
>>>>>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational
>>>>>>>> topics were
>>>>>>>> >> >>> >>> >>>> mentioned,
>>>>>>>> >> >>> >>> >>>> but I want to focus on these negative posts about
>>>>>>>> Spark and
>>>>>>>> >> >>> >>> >>>> about
>>>>>>>> >> >>> >>> >>>> "haters"
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good
>>>>>>>> community
>>>>>>>> >> >>> >>> >>>> -
>>>>>>>> >> >>> >>> >>>> it's
>>>>>>>> >> >>> >>> >>>> everything here. But Every project has to "flight"
>>>>>>>> on
>>>>>>>> >> >>> >>> >>>> "framework
>>>>>>>> >> >>> >>> >>>> market"
>>>>>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big
>>>>>>>> Data
>>>>>>>> >> >>> >>> >>>> communities,
>>>>>>>> >> >>> >>> >>>> maybe my mail will inspire someone :)
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have
>>>>>>>> enough time
>>>>>>>> >> >>> >>> >>>> to
>>>>>>>> >> >>> >>> >>>> join
>>>>>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So
>>>>>>>> why are
>>>>>>>> >> >>> >>> >>>> some
>>>>>>>> >> >>> >>> >>>> people
>>>>>>>> >> >>> >>> >>>> saying that Flink (or other framework) is better,
>>>>>>>> like it was
>>>>>>>> >> >>> >>> >>>> posted
>>>>>>>> >> >>> >>> >>>> in
>>>>>>>> >> >>> >>> >>>> this mailing list? No, not because that framework
>>>>>>>> is better
>>>>>>>> >> >>> >>> >>>> in
>>>>>>>> >> >>> >>> >>>> all
>>>>>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions
>>>>>>>> where
>>>>>>>> >> >>> >>> >>>> started
>>>>>>>> >> >>> >>> >>>> after
>>>>>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at
>>>>>>>> StackOverflow
>>>>>>>> >> >>> >>> >>>> "Flink
>>>>>>>> >> >>> >>> >>>> vs
>>>>>>>> >> >>> >>> >>>> ...."
>>>>>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink.
>>>>>>>> Answers are
>>>>>>>> >> >>> >>> >>>> sometimes
>>>>>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's
>>>>>>>> users (often
>>>>>>>> >> >>> >>> >>>> PMC's)
>>>>>>>> >> >>> >>> >>>> are
>>>>>>>> >> >>> >>> >>>> just posting same information about real-time
>>>>>>>> streaming,
>>>>>>>> >> >>> >>> >>>> about
>>>>>>>> >> >>> >>> >>>> delta
>>>>>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it is
>>>>>>>> marked as
>>>>>>>> >> >>> >>> >>>> an
>>>>>>>> >> >>> >>> >>>> aswer,
>>>>>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all the
>>>>>>>> truth.
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and
>>>>>>>> knowledgle to
>>>>>>>> >> >>> >>> >>>> perform
>>>>>>>> >> >>> >>> >>>> huge
>>>>>>>> >> >>> >>> >>>> performance test. Maybe some company, that supports
>>>>>>>> Spark
>>>>>>>> >> >>> >>> >>>> (Databricks,
>>>>>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in
>>>>>>>> community :) )
>>>>>>>> >> >>> >>> >>>> could
>>>>>>>> >> >>> >>> >>>> perform performance test of:
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose
>>>>>>>> because of
>>>>>>>> >> >>> >>> >>>> mini-batch
>>>>>>>> >> >>> >>> >>>> model, however currently the difference should be
>>>>>>>> much lower
>>>>>>>> >> >>> >>> >>>> that in
>>>>>>>> >> >>> >>> >>>> previous versions
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> - Machine Learning models
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> - batch jobs
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> - Graph jobs
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> - SQL queries
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> People will see that Spark is envolving and is also
>>>>>>>> a modern
>>>>>>>> >> >>> >>> >>>> framework,
>>>>>>>> >> >>> >>> >>>> because after reading posts mentioned above people
>>>>>>>> may think
>>>>>>>> >> >>> >>> >>>> "it
>>>>>>>> >> >>> >>> >>>> is
>>>>>>>> >> >>> >>> >>>> outdated, future is in framework X".
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how
>>>>>>>> Spark
>>>>>>>> >> >>> >>> >>>> Structured
>>>>>>>> >> >>> >>> >>>> Streaming beats every other framework in terms of
>>>>>>>> easy-of-use
>>>>>>>> >> >>> >>> >>>> and
>>>>>>>> >> >>> >>> >>>> reliability. Performance tests, done in various
>>>>>>>> environments
>>>>>>>> >> >>> >>> >>>> (in
>>>>>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node
>>>>>>>> cluster,
>>>>>>>> >> >>> >>> >>>> 20-node
>>>>>>>> >> >>> >>> >>>> cluster), could be also very good marketing stuff
>>>>>>>> to say
>>>>>>>> >> >>> >>> >>>> "hey,
>>>>>>>> >> >>> >>> >>>> you're
>>>>>>>> >> >>> >>> >>>> telling that you're better, but Spark is still
>>>>>>>> faster and is
>>>>>>>> >> >>> >>> >>>> still
>>>>>>>> >> >>> >>> >>>> getting even more fast!". This would be based on
>>>>>>>> facts (just
>>>>>>>> >> >>> >>> >>>> numbers),
>>>>>>>> >> >>> >>> >>>> not opinions. It would be good for companies, for
>>>>>>>> marketing
>>>>>>>> >> >>> >>> >>>> puproses
>>>>>>>> >> >>> >>> >>>> and
>>>>>>>> >> >>> >>> >>>> for every Spark developer
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some time
>>>>>>>> ago about
>>>>>>>> >> >>> >>> >>>> real-time
>>>>>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming.
>>>>>>>> Some work
>>>>>>>> >> >>> >>> >>>> should be
>>>>>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think it's
>>>>>>>> possible.
>>>>>>>> >> >>> >>> >>>> Maybe
>>>>>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on
>>>>>>>> top of
>>>>>>>> >> >>> >>> >>>> Akka?
>>>>>>>> >> >>> >>> >>>> I
>>>>>>>> >> >>> >>> >>>> don't
>>>>>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I think
>>>>>>>> that
>>>>>>>> >> >>> >>> >>>> Spark
>>>>>>>> >> >>> >>> >>>> should
>>>>>>>> >> >>> >>> >>>> have real-time streaming support. Currently I see
>>>>>>>> many
>>>>>>>> >> >>> >>> >>>> posts/comments
>>>>>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming
>>>>>>>> is doing
>>>>>>>> >> >>> >>> >>>> very
>>>>>>>> >> >>> >>> >>>> good
>>>>>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is
>>>>>>>> possible to
>>>>>>>> >> >>> >>> >>>> add
>>>>>>>> >> >>> >>> >>>> also
>>>>>>>> >> >>> >>> >>>> more
>>>>>>>> >> >>> >>> >>>> real-time processing.
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> Other people said much more and I agree with
>>>>>>>> proposal of SIP.
>>>>>>>> >> >>> >>> >>>> I'm
>>>>>>>> >> >>> >>> >>>> also
>>>>>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will not
>>>>>>>> listen to
>>>>>>>> >> >>> >>> >>>> users,
>>>>>>>> >> >>> >>> >>>> but
>>>>>>>> >> >>> >>> >>>> they really want to make Spark better for every
>>>>>>>> user.
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> What do you think about these two topics?
>>>>>>>> Especially I'm
>>>>>>>> >> >>> >>> >>>> looking
>>>>>>>> >> >>> >>> >>>> at
>>>>>>>> >> >>> >>> >>>> Cody
>>>>>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :)
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> Pozdrawiam / Best regards,
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>> Tomasz
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>> >>>>
>>>>>>>> >> >>> >>>
>>>>>>>> >> >>> >>
>>>>>>>> >> >>> >
>>>>>>>> >> >>> >
>>>>>>>> >> >
>>>>>>>> >> >
>>>>>>>> >>
>>>>>>>> >> ------------------------------------------------------------
>>>>>>>> ---------
>>>>>>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>> >>
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > --
>>>>>>>> > Ryan Blue
>>>>>>>> > Software Engineer
>>>>>>>> > Netflix
>>>>>>>>
>>>>>>>> ------------------------------------------------------------
>>>>>>>> ---------
>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Joseph Bradley
>>>>>>>
>>>>>>> Software Engineer - Machine Learning
>>>>>>>
>>>>>>> Databricks, Inc.
>>>>>>>
>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark Improvement Proposals

Posted by Xiao Li <ga...@gmail.com>.

During the summit, I also had a lot of discussions over similar topics with
multiple Committers and active users. I heard many fantastic ideas. I
believe Spark improvement proposals are good channels to collect the
requirements/designs.


IMO, we also need to consider the priority when working on these items.
Even if the proposal is accepted, it does not mean it will be implemented
and merged immediately. It is not a FIFO queue.


Even if some PRs are merged, sometimes, we still have to revert them back,
if the design and implementation are not reviewed carefully. We have to
ensure our quality. Spark is not an application software. It is an
infrastructure software that is being used by many many companies. We have
to be very careful in the design and implementation, especially
adding/changing the external APIs.


When I developed the Mainframe infrastructure/middleware software in the
past 6 years, I were involved in the discussions with external/internal
customers. The to-do feature list was always above 100. Sometimes, the
customers are feeling frustrated when we are unable to deliver them on time
due to the resource limits and others. Even if they paid us billions, we
still need to do it phase by phase or sometimes they have to accept the
workarounds. That is the reality everyone has to face, I think.


Thanks,


Xiao Li

2017-02-11 7:57 GMT-08:00 Cody Koeninger <co...@koeninger.org>:

> At the spark summit this week, everyone from PMC members to users I had
> never met before were asking me about the Spark improvement proposals
> idea.  It's clear that it's a real community need.
>
> But it's been almost half a year, and nothing visible has been done.
>
> Reynold, are you going to do this?
>
> If so, when?
>
> If not, why?
>
> You already did the right thing by including long-deserved committers.
> Please keep doing the right thing for the community.
>
> On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <rx...@databricks.com> wrote:
>
>> +1 on all counts (consensus, time bound, define roles)
>>
>> I can update the doc in the next few days and share back. Then maybe we
>> can just officially vote on this. As Tim suggested, we might not get it
>> 100% right the first time and would need to re-iterate. But that's fine.
>>
>>
>> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <ti...@databricks.com>
>> wrote:
>>
>>> Hi Cody,
>>> thank you for bringing up this topic, I agree it is very important to
>>> keep a cohesive community around some common, fluid goals. Here are a few
>>> comments about the current document:
>>>
>>> 1. name: it should not overlap with an existing one such as SIP. Can you
>>> imagine someone trying to discuss a scala spore proposal for spark?
>>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
>>> sounds great.
>>>
>>> 2. roles: at a high level, SPIPs are meant to reach consensus for
>>> technical decisions with a lasting impact. As such, the template should
>>> emphasize the role of the various parties during this process:
>>>
>>>  - the SPIP author is responsible for building consensus. She is the
>>> champion driving the process forward and is responsible for ensuring that
>>> the SPIP follows the general guidelines. The author should be identified in
>>> the SPIP. The authorship of a SPIP can be transferred if the current author
>>> is not interested and someone else wants to move the SPIP forward. There
>>> should probably be 2-3 authors at most for each SPIP.
>>>
>>>  - someone with voting power should probably shepherd the SPIP (and be
>>> recorded as such): ensuring that the final decision over the SPIP is
>>> recorded (rejected, accepted, etc.), and advising about the technical
>>> quality of the SPIP: this person need not be a champion for the SPIP or
>>> contribute to it, but rather makes sure it stands a chance of being
>>> approved when the vote happens. Also, if the author cannot find anyone who
>>> would want to take this role, this proposal is likely to be rejected anyway.
>>>
>>>  - users, committers, contributors have the roles already outlined in
>>> the document
>>>
>>> 3. timeline: ideally, once a SPIP has been offered for voting, it should
>>> move swiftly into either being accepted or rejected, so that we do not end
>>> up with a distracting long tail of half-hearted proposals.
>>>
>>> These rules are meant to be flexible, but the current document should be
>>> clear about who is in charge of a SPIP, and the state it is currently in.
>>>
>>> We have had long discussions over some very important questions such as
>>> approval. I do not have an opinion on these, but why not make a pick and
>>> reevaluate this decision later? This is not a binding process at this point.
>>>
>>> Tim
>>>
>>>
>>> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <co...@koeninger.org>
>>> wrote:
>>>
>>>> I don't have a concern about voting vs consensus.
>>>>
>>>> I have a concern that whatever the decision making process is, it is
>>>> explicitly announced on the ticket for the given proposal, with an explicit
>>>> deadline, and an explicit outcome.
>>>>
>>>>
>>>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <ir...@cloudera.com>
>>>> wrote:
>>>>
>>>>> I'm also in favor of this.  Thanks for your persistence Cody.
>>>>>
>>>>> My take on the specific issues Joseph mentioned:
>>>>>
>>>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made
>>>>> earlier for consensus:
>>>>>
>>>>> > Majority vs consensus: My rationale is that I don't think we want to
>>>>> consider a proposal approved if it had objections serious enough that
>>>>> committers down-voted (or PMC depending on who gets a vote). If these
>>>>> proposals are like PEPs, then they represent a significant amount of
>>>>> community effort and I wouldn't want to move forward if up to half of the
>>>>> community thinks it's an untenable idea.
>>>>>
>>>>> 2) Design doc template -- agree this would be useful, but also seems
>>>>> totally orthogonal to moving forward on the SIP proposal.
>>>>>
>>>>> 3) agree w/ Joseph's proposal for updating the template.
>>>>>
>>>>> One small addition:
>>>>>
>>>>> 4) Deciding on a name -- minor, but I think its wroth disambiguating
>>>>> from Scala's SIPs, and the best proposal I've heard is "SPIP".   At least,
>>>>> no one has objected.  (don't care enough that I'd object to anything else,
>>>>> though.)
>>>>>
>>>>>
>>>>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <jo...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Cody,
>>>>>>
>>>>>> Thanks for being persistent about this.  I too would like to see this
>>>>>> happen.  Reviewing the thread, it sounds like the main things remaining are:
>>>>>> * Decide about a few issues
>>>>>> * Finalize the doc(s)
>>>>>> * Vote on this proposal
>>>>>>
>>>>>> Issues & TODOs:
>>>>>>
>>>>>> (1) The main issue I see above is voting vs. consensus.  I have
>>>>>> little preference here.  It sounds like something which could be tailored
>>>>>> based on whether we see too many or too few SIPs being approved.
>>>>>>
>>>>>> (2) Design doc template  (This would be great to have for Spark
>>>>>> regardless of this SIP discussion.)
>>>>>> * Reynold, are you still putting this together?
>>>>>>
>>>>>> (3) Template cleanups.  Listing some items mentioned above + a new
>>>>>> one w.r.t. Reynold's draft
>>>>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>>>>>> :
>>>>>> * Reinstate the "Where" section with links to current and past SIPs
>>>>>> * Add field for stating explicit deadlines for approval
>>>>>> * Add field for stating Author & Committer shepherd
>>>>>>
>>>>>> Thanks all!
>>>>>> Joseph
>>>>>>
>>>>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <co...@koeninger.org>
>>>>>> wrote:
>>>>>>
>>>>>>> I'm bumping this one more time for the new year, and then I'm giving
>>>>>>> up.
>>>>>>>
>>>>>>> Please, fix your process, even if it isn't exactly the way I
>>>>>>> suggested.
>>>>>>>
>>>>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com>
>>>>>>> wrote:
>>>>>>> > On lazy consensus as opposed to voting:
>>>>>>> >
>>>>>>> > First, why lazy consensus? The proposal was for consensus, which
>>>>>>> is at least
>>>>>>> > three +1 votes and no vetos. Consensus has no losing side, it
>>>>>>> requires
>>>>>>> > getting to a point where there is agreement. Isn't that agreement
>>>>>>> what we
>>>>>>> > want to achieve with these proposals?
>>>>>>> >
>>>>>>> > Second, lazy consensus only removes the requirement for three +1
>>>>>>> votes. Why
>>>>>>> > would we not want at least three committers to think something is
>>>>>>> a good
>>>>>>> > idea before adopting the proposal?
>>>>>>> >
>>>>>>> > rb
>>>>>>> >
>>>>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <co...@koeninger.org>
>>>>>>> wrote:
>>>>>>> >>
>>>>>>> >> So there are some minor things (the Where section heading appears
>>>>>>> to
>>>>>>> >> be dropped; wherever this document is posted it needs to actually
>>>>>>> link
>>>>>>> >> to a jira filter showing current / past SIPs) but it doesn't look
>>>>>>> like
>>>>>>> >> I can comment on the google doc.
>>>>>>> >>
>>>>>>> >> The major substantive issue that I have is that this version is
>>>>>>> >> significantly less clear as to the outcome of an SIP.
>>>>>>> >>
>>>>>>> >> The apache example of lazy consensus at
>>>>>>> >> http://apache.org/foundation/voting.html#LazyConsensus involves
>>>>>>> an
>>>>>>> >> explicit announcement of an explicit deadline, which I think are
>>>>>>> >> necessary for clarity.
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <rx...@databricks.com>
>>>>>>> wrote:
>>>>>>> >> > It turned out suggested edits (trackable) don't show up for
>>>>>>> non-owners,
>>>>>>> >> > so
>>>>>>> >> > I've just merged all the edits in place. It should be visible
>>>>>>> now.
>>>>>>> >> >
>>>>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <
>>>>>>> rxin@databricks.com>
>>>>>>> >> > wrote:
>>>>>>> >> >>
>>>>>>> >> >> Oops. Let me try figure that out.
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> On Monday, November 7, 2016, Cody Koeninger <
>>>>>>> cody@koeninger.org> wrote:
>>>>>>> >> >>>
>>>>>>> >> >>> Thanks for picking up on this.
>>>>>>> >> >>>
>>>>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on the
>>>>>>> document
>>>>>>> >> >>> you linked.
>>>>>>> >> >>>
>>>>>>> >> >>> Regarding lazy consensus, if the board in general has less of
>>>>>>> an issue
>>>>>>> >> >>> with that, sure.  As long as it is clearly announced, lasts
>>>>>>> at least
>>>>>>> >> >>> 72 hours, and has a clear outcome.
>>>>>>> >> >>>
>>>>>>> >> >>> The other points are hard to comment on without being able to
>>>>>>> see the
>>>>>>> >> >>> text in question.
>>>>>>> >> >>>
>>>>>>> >> >>>
>>>>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <
>>>>>>> rxin@databricks.com>
>>>>>>> >> >>> wrote:
>>>>>>> >> >>> > I just looked through the entire thread again tonight -
>>>>>>> there are a
>>>>>>> >> >>> > lot
>>>>>>> >> >>> > of
>>>>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the
>>>>>>> first crack
>>>>>>> >> >>> > at
>>>>>>> >> >>> > the
>>>>>>> >> >>> > proposal.
>>>>>>> >> >>> >
>>>>>>> >> >>> > I want to first comment on the context. Spark is one of the
>>>>>>> most
>>>>>>> >> >>> > innovative
>>>>>>> >> >>> > and important projects in (big) data -- overall technical
>>>>>>> decisions
>>>>>>> >> >>> > made in
>>>>>>> >> >>> > Apache Spark are sound. But of course, a project as large
>>>>>>> and active
>>>>>>> >> >>> > as
>>>>>>> >> >>> > Spark always have room for improvement, and we as a
>>>>>>> community should
>>>>>>> >> >>> > strive
>>>>>>> >> >>> > to take it to the next level.
>>>>>>> >> >>> >
>>>>>>> >> >>> > To that end, the two biggest areas for improvements in my
>>>>>>> opinion
>>>>>>> >> >>> > are:
>>>>>>> >> >>> >
>>>>>>> >> >>> > 1. Visibility: There are so much happening that it is
>>>>>>> difficult to
>>>>>>> >> >>> > know
>>>>>>> >> >>> > what
>>>>>>> >> >>> > really is going on. For people that don't follow closely,
>>>>>>> it is
>>>>>>> >> >>> > difficult to
>>>>>>> >> >>> > know what the important initiatives are. Even for people
>>>>>>> that do
>>>>>>> >> >>> > follow, it
>>>>>>> >> >>> > is difficult to know what specific things require their
>>>>>>> attention,
>>>>>>> >> >>> > since the
>>>>>>> >> >>> > number of pull requests and JIRA tickets are high and it's
>>>>>>> difficult
>>>>>>> >> >>> > to
>>>>>>> >> >>> > extract signal from noise.
>>>>>>> >> >>> >
>>>>>>> >> >>> > 2. Solicit user (broadly defined, including developers
>>>>>>> themselves)
>>>>>>> >> >>> > input
>>>>>>> >> >>> > more proactively: At the end of the day the project
>>>>>>> provides value
>>>>>>> >> >>> > because
>>>>>>> >> >>> > users use it. Users can't tell us exactly what to build,
>>>>>>> but it is
>>>>>>> >> >>> > important
>>>>>>> >> >>> > to get their inputs.
>>>>>>> >> >>> >
>>>>>>> >> >>> >
>>>>>>> >> >>> > I've taken Cody's doc and edited it:
>>>>>>> >> >>> >
>>>>>>> >> >>> >
>>>>>>> >> >>> > https://docs.google.com/docume
>>>>>>> nt/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#headi
>>>>>>> ng=h.36ut37zh7w2b
>>>>>>> >> >>> > (I've made all my modifications trackable)
>>>>>>> >> >>> >
>>>>>>> >> >>> > There are couple high level changes I made:
>>>>>>> >> >>> >
>>>>>>> >> >>> > 1. I've consulted a board member and he recommended lazy
>>>>>>> consensus
>>>>>>> >> >>> > as
>>>>>>> >> >>> > opposed to voting. The reason being in voting there can
>>>>>>> easily be a
>>>>>>> >> >>> > "loser'
>>>>>>> >> >>> > that gets outvoted.
>>>>>>> >> >>> >
>>>>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to
>>>>>>> "optional
>>>>>>> >> >>> > design
>>>>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far
>>>>>>> aside from
>>>>>>> >> >>> > tagging
>>>>>>> >> >>> > things and linking them elsewhere simply having design docs
>>>>>>> and
>>>>>>> >> >>> > prototypes
>>>>>>> >> >>> > implementations in PRs is not something that has not worked
>>>>>>> so far".
>>>>>>> >> >>> >
>>>>>>> >> >>> > 3. I made some the language tweaks to focus more on
>>>>>>> visibility. For
>>>>>>> >> >>> > example,
>>>>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather
>>>>>>> than just
>>>>>>> >> >>> > "involve". SIPs should also have at least two emails that
>>>>>>> go to
>>>>>>> >> >>> > dev@.
>>>>>>> >> >>> >
>>>>>>> >> >>> >
>>>>>>> >> >>> > While I was editing this, I thought we really needed a
>>>>>>> suggested
>>>>>>> >> >>> > template
>>>>>>> >> >>> > for design doc too. I will get to that too ...
>>>>>>> >> >>> >
>>>>>>> >> >>> >
>>>>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>>>>>> rxin@databricks.com>
>>>>>>> >> >>> > wrote:
>>>>>>> >> >>> >>
>>>>>>> >> >>> >> Most things looked OK to me too, although I do plan to
>>>>>>> take a
>>>>>>> >> >>> >> closer
>>>>>>> >> >>> >> look
>>>>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>>>>>>> >> >>> >>
>>>>>>> >> >>> >>
>>>>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>>>>>> >> >>> >> <va...@cloudera.com>
>>>>>>> >> >>> >> wrote:
>>>>>>> >> >>> >>>
>>>>>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's
>>>>>>> not
>>>>>>> >> >>> >>> explicitly
>>>>>>> >> >>> >>> called, that voting would happen by e-mail? A template
>>>>>>> for the
>>>>>>> >> >>> >>> proposal document (instead of just a bullet nice) would
>>>>>>> also be
>>>>>>> >> >>> >>> nice,
>>>>>>> >> >>> >>> but that can be done at any time.
>>>>>>> >> >>> >>>
>>>>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider
>>>>>>> a
>>>>>>> >> >>> >>> candidate
>>>>>>> >> >>> >>> for a SIP, given the scope of the work. The document
>>>>>>> attached even
>>>>>>> >> >>> >>> somewhat matches the proposed format. So if anyone wants
>>>>>>> to try
>>>>>>> >> >>> >>> out
>>>>>>> >> >>> >>> the process...
>>>>>>> >> >>> >>>
>>>>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>>>>>>> >> >>> >>> <co...@koeninger.org>
>>>>>>> >> >>> >>> wrote:
>>>>>>> >> >>> >>> > Now that spark summit europe is over, are any committers
>>>>>>> >> >>> >>> > interested
>>>>>>> >> >>> >>> > in
>>>>>>> >> >>> >>> > moving forward with this?
>>>>>>> >> >>> >>> >
>>>>>>> >> >>> >>> >
>>>>>>> >> >>> >>> >
>>>>>>> >> >>> >>> >
>>>>>>> >> >>> >>> > https://github.com/koeninger/s
>>>>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>>>> >> >>> >>> >
>>>>>>> >> >>> >>> > Or are we going to let this discussion die on the vine?
>>>>>>> >> >>> >>> >
>>>>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>>>>>> >> >>> >>> > <to...@outlook.com> wrote:
>>>>>>> >> >>> >>> >> Maybe my mail was not clear enough.
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any
>>>>>>> other
>>>>>>> >> >>> >>> >> framework.
>>>>>>> >> >>> >>> >> The
>>>>>>> >> >>> >>> >> idea with benchmarks was to show two things:
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> - why some people are doing bad PR for Spark
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> - how - in easy way - we can change it and show that
>>>>>>> Spark is
>>>>>>> >> >>> >>> >> still on
>>>>>>> >> >>> >>> >> the
>>>>>>> >> >>> >>> >> top
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I
>>>>>>> don't think
>>>>>>> >> >>> >>> >> they're the
>>>>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main
>>>>>>> page there
>>>>>>> >> >>> >>> >> is
>>>>>>> >> >>> >>> >> still
>>>>>>> >> >>> >>> >> chart
>>>>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that
>>>>>>> framework is
>>>>>>> >> >>> >>> >> not
>>>>>>> >> >>> >>> >> the
>>>>>>> >> >>> >>> >> same
>>>>>>> >> >>> >>> >> Spark with other API, but much faster and optimized,
>>>>>>> comparable
>>>>>>> >> >>> >>> >> or
>>>>>>> >> >>> >>> >> even
>>>>>>> >> >>> >>> >> faster than other frameworks.
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> About real-time streaming, I think it would be just
>>>>>>> good to see
>>>>>>> >> >>> >>> >> it
>>>>>>> >> >>> >>> >> in
>>>>>>> >> >>> >>> >> Spark.
>>>>>>> >> >>> >>> >> I very like current Spark model, but many voices that
>>>>>>> says "we
>>>>>>> >> >>> >>> >> need
>>>>>>> >> >>> >>> >> more" -
>>>>>>> >> >>> >>> >> community should listen also them and try to help
>>>>>>> them. With
>>>>>>> >> >>> >>> >> SIPs
>>>>>>> >> >>> >>> >> it
>>>>>>> >> >>> >>> >> would
>>>>>>> >> >>> >>> >> be easier, I've just posted this example as "thing
>>>>>>> that may be
>>>>>>> >> >>> >>> >> changed
>>>>>>> >> >>> >>> >> with
>>>>>>> >> >>> >>> >> SIP".
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> I very like unification via Datasets, but there is a
>>>>>>> lot of
>>>>>>> >> >>> >>> >> algorithms
>>>>>>> >> >>> >>> >> inside - let's make easy API, but with strong
>>>>>>> background
>>>>>>> >> >>> >>> >> (articles,
>>>>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark
>>>>>>> is still
>>>>>>> >> >>> >>> >> modern
>>>>>>> >> >>> >>> >> framework.
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said
>>>>>>> >> >>> >>> >> organizational
>>>>>>> >> >>> >>> >> ideas
>>>>>>> >> >>> >>> >> were already mentioned and I agree with them, my mail
>>>>>>> was just
>>>>>>> >> >>> >>> >> to
>>>>>>> >> >>> >>> >> show
>>>>>>> >> >>> >>> >> some
>>>>>>> >> >>> >>> >> aspects from my side, so from theside of developer and
>>>>>>> person
>>>>>>> >> >>> >>> >> who
>>>>>>> >> >>> >>> >> is
>>>>>>> >> >>> >>> >> trying
>>>>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or other
>>>>>>> ways)
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> Pozdrawiam / Best regards,
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> Tomasz
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> ________________________________
>>>>>>> >> >>> >>> >> Od: Cody Koeninger <co...@koeninger.org>
>>>>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46
>>>>>>> >> >>> >>> >> Do: Debasish Das
>>>>>>> >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>>>>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is
>>>>>>> missing my
>>>>>>> >> >>> >>> >> point.
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> My point is evolve or die.  Spark's governance and
>>>>>>> organization
>>>>>>> >> >>> >>> >> is
>>>>>>> >> >>> >>> >> hampering its ability to evolve technologically, and
>>>>>>> it needs
>>>>>>> >> >>> >>> >> to
>>>>>>> >> >>> >>> >> change.
>>>>>>> >> >>> >>> >>
>>>>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>>>>>> >> >>> >>> >> <de...@gmail.com>
>>>>>>> >> >>> >>> >> wrote:
>>>>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked
>>>>>>> up Spark
>>>>>>> >> >>> >>> >>> in
>>>>>>> >> >>> >>> >>> 2014
>>>>>>> >> >>> >>> >>> as
>>>>>>> >> >>> >>> >>> soon as I looked into it since compared to writing
>>>>>>> Java
>>>>>>> >> >>> >>> >>> map-reduce
>>>>>>> >> >>> >>> >>> and
>>>>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed code
>>>>>>> fun...But
>>>>>>> >> >>> >>> >>> now
>>>>>>> >> >>> >>> >>> as
>>>>>>> >> >>> >>> >>> we
>>>>>>> >> >>> >>> >>> went
>>>>>>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case
>>>>>>> gets more
>>>>>>> >> >>> >>> >>> prominent, I
>>>>>>> >> >>> >>> >>> think it is time to bring a messaging model in
>>>>>>> conjunction
>>>>>>> >> >>> >>> >>> with
>>>>>>> >> >>> >>> >>> the
>>>>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good
>>>>>>> at....akka-streams
>>>>>>> >> >>> >>> >>> close
>>>>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks like
>>>>>>> a great
>>>>>>> >> >>> >>> >>> direction to
>>>>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0
>>>>>>> integrated
>>>>>>> >> >>> >>> >>> streaming
>>>>>>> >> >>> >>> >>> with
>>>>>>> >> >>> >>> >>> batch with the assumption is that micro-batching is
>>>>>>> sufficient
>>>>>>> >> >>> >>> >>> to
>>>>>>> >> >>> >>> >>> run
>>>>>>> >> >>> >>> >>> SQL
>>>>>>> >> >>> >>> >>> commands on stream but do we really have time to do
>>>>>>> SQL
>>>>>>> >> >>> >>> >>> processing at
>>>>>>> >> >>> >>> >>> streaming data within 1-2 seconds ?
>>>>>>> >> >>> >>> >>>
>>>>>>> >> >>> >>> >>> After reading the email chain, I started to look into
>>>>>>> Flink
>>>>>>> >> >>> >>> >>> documentation
>>>>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I
>>>>>>> think we
>>>>>>> >> >>> >>> >>> have
>>>>>>> >> >>> >>> >>> major
>>>>>>> >> >>> >>> >>> work
>>>>>>> >> >>> >>> >>> to do detailing out Spark internals so that more
>>>>>>> people from
>>>>>>> >> >>> >>> >>> community
>>>>>>> >> >>> >>> >>> start
>>>>>>> >> >>> >>> >>> to take active role in improving the issues so that
>>>>>>> Spark
>>>>>>> >> >>> >>> >>> stays
>>>>>>> >> >>> >>> >>> strong
>>>>>>> >> >>> >>> >>> compared to Flink.
>>>>>>> >> >>> >>> >>>
>>>>>>> >> >>> >>> >>>
>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>>> uence/display/SPARK/Spark+Internals
>>>>>>> >> >>> >>> >>>
>>>>>>> >> >>> >>> >>>
>>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>>> uence/display/FLINK/Flink+Internals
>>>>>>> >> >>> >>> >>>
>>>>>>> >> >>> >>> >>> Spark is no longer an engine that works for
>>>>>>> micro-batch and
>>>>>>> >> >>> >>> >>> batch...We
>>>>>>> >> >>> >>> >>> (and
>>>>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an engine
>>>>>>> for
>>>>>>> >> >>> >>> >>> stream
>>>>>>> >> >>> >>> >>> and
>>>>>>> >> >>> >>> >>> query
>>>>>>> >> >>> >>> >>> processing.....we need to make it a state-of-the-art
>>>>>>> engine
>>>>>>> >> >>> >>> >>> for
>>>>>>> >> >>> >>> >>> high
>>>>>>> >> >>> >>> >>> speed
>>>>>>> >> >>> >>> >>> streaming data and user queries as well !
>>>>>>> >> >>> >>> >>>
>>>>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>>>>>>> >> >>> >>> >>> <to...@outlook.com>
>>>>>>> >> >>> >>> >>> wrote:
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> Hi everyone,
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my
>>>>>>> suggestions may
>>>>>>> >> >>> >>> >>>> help a
>>>>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational
>>>>>>> topics were
>>>>>>> >> >>> >>> >>>> mentioned,
>>>>>>> >> >>> >>> >>>> but I want to focus on these negative posts about
>>>>>>> Spark and
>>>>>>> >> >>> >>> >>>> about
>>>>>>> >> >>> >>> >>>> "haters"
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good
>>>>>>> community
>>>>>>> >> >>> >>> >>>> -
>>>>>>> >> >>> >>> >>>> it's
>>>>>>> >> >>> >>> >>>> everything here. But Every project has to "flight" on
>>>>>>> >> >>> >>> >>>> "framework
>>>>>>> >> >>> >>> >>>> market"
>>>>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big
>>>>>>> Data
>>>>>>> >> >>> >>> >>>> communities,
>>>>>>> >> >>> >>> >>>> maybe my mail will inspire someone :)
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have
>>>>>>> enough time
>>>>>>> >> >>> >>> >>>> to
>>>>>>> >> >>> >>> >>>> join
>>>>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So
>>>>>>> why are
>>>>>>> >> >>> >>> >>>> some
>>>>>>> >> >>> >>> >>>> people
>>>>>>> >> >>> >>> >>>> saying that Flink (or other framework) is better,
>>>>>>> like it was
>>>>>>> >> >>> >>> >>>> posted
>>>>>>> >> >>> >>> >>>> in
>>>>>>> >> >>> >>> >>>> this mailing list? No, not because that framework is
>>>>>>> better
>>>>>>> >> >>> >>> >>>> in
>>>>>>> >> >>> >>> >>>> all
>>>>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions
>>>>>>> where
>>>>>>> >> >>> >>> >>>> started
>>>>>>> >> >>> >>> >>>> after
>>>>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at
>>>>>>> StackOverflow
>>>>>>> >> >>> >>> >>>> "Flink
>>>>>>> >> >>> >>> >>>> vs
>>>>>>> >> >>> >>> >>>> ...."
>>>>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink.
>>>>>>> Answers are
>>>>>>> >> >>> >>> >>>> sometimes
>>>>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's users
>>>>>>> (often
>>>>>>> >> >>> >>> >>>> PMC's)
>>>>>>> >> >>> >>> >>>> are
>>>>>>> >> >>> >>> >>>> just posting same information about real-time
>>>>>>> streaming,
>>>>>>> >> >>> >>> >>>> about
>>>>>>> >> >>> >>> >>>> delta
>>>>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it is
>>>>>>> marked as
>>>>>>> >> >>> >>> >>>> an
>>>>>>> >> >>> >>> >>>> aswer,
>>>>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all the
>>>>>>> truth.
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and
>>>>>>> knowledgle to
>>>>>>> >> >>> >>> >>>> perform
>>>>>>> >> >>> >>> >>>> huge
>>>>>>> >> >>> >>> >>>> performance test. Maybe some company, that supports
>>>>>>> Spark
>>>>>>> >> >>> >>> >>>> (Databricks,
>>>>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in
>>>>>>> community :) )
>>>>>>> >> >>> >>> >>>> could
>>>>>>> >> >>> >>> >>>> perform performance test of:
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose
>>>>>>> because of
>>>>>>> >> >>> >>> >>>> mini-batch
>>>>>>> >> >>> >>> >>>> model, however currently the difference should be
>>>>>>> much lower
>>>>>>> >> >>> >>> >>>> that in
>>>>>>> >> >>> >>> >>>> previous versions
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> - Machine Learning models
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> - batch jobs
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> - Graph jobs
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> - SQL queries
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> People will see that Spark is envolving and is also
>>>>>>> a modern
>>>>>>> >> >>> >>> >>>> framework,
>>>>>>> >> >>> >>> >>>> because after reading posts mentioned above people
>>>>>>> may think
>>>>>>> >> >>> >>> >>>> "it
>>>>>>> >> >>> >>> >>>> is
>>>>>>> >> >>> >>> >>>> outdated, future is in framework X".
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how
>>>>>>> Spark
>>>>>>> >> >>> >>> >>>> Structured
>>>>>>> >> >>> >>> >>>> Streaming beats every other framework in terms of
>>>>>>> easy-of-use
>>>>>>> >> >>> >>> >>>> and
>>>>>>> >> >>> >>> >>>> reliability. Performance tests, done in various
>>>>>>> environments
>>>>>>> >> >>> >>> >>>> (in
>>>>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node
>>>>>>> cluster,
>>>>>>> >> >>> >>> >>>> 20-node
>>>>>>> >> >>> >>> >>>> cluster), could be also very good marketing stuff to
>>>>>>> say
>>>>>>> >> >>> >>> >>>> "hey,
>>>>>>> >> >>> >>> >>>> you're
>>>>>>> >> >>> >>> >>>> telling that you're better, but Spark is still
>>>>>>> faster and is
>>>>>>> >> >>> >>> >>>> still
>>>>>>> >> >>> >>> >>>> getting even more fast!". This would be based on
>>>>>>> facts (just
>>>>>>> >> >>> >>> >>>> numbers),
>>>>>>> >> >>> >>> >>>> not opinions. It would be good for companies, for
>>>>>>> marketing
>>>>>>> >> >>> >>> >>>> puproses
>>>>>>> >> >>> >>> >>>> and
>>>>>>> >> >>> >>> >>>> for every Spark developer
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some time
>>>>>>> ago about
>>>>>>> >> >>> >>> >>>> real-time
>>>>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming.
>>>>>>> Some work
>>>>>>> >> >>> >>> >>>> should be
>>>>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think it's
>>>>>>> possible.
>>>>>>> >> >>> >>> >>>> Maybe
>>>>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on
>>>>>>> top of
>>>>>>> >> >>> >>> >>>> Akka?
>>>>>>> >> >>> >>> >>>> I
>>>>>>> >> >>> >>> >>>> don't
>>>>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I think
>>>>>>> that
>>>>>>> >> >>> >>> >>>> Spark
>>>>>>> >> >>> >>> >>>> should
>>>>>>> >> >>> >>> >>>> have real-time streaming support. Currently I see
>>>>>>> many
>>>>>>> >> >>> >>> >>>> posts/comments
>>>>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is
>>>>>>> doing
>>>>>>> >> >>> >>> >>>> very
>>>>>>> >> >>> >>> >>>> good
>>>>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is
>>>>>>> possible to
>>>>>>> >> >>> >>> >>>> add
>>>>>>> >> >>> >>> >>>> also
>>>>>>> >> >>> >>> >>>> more
>>>>>>> >> >>> >>> >>>> real-time processing.
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> Other people said much more and I agree with
>>>>>>> proposal of SIP.
>>>>>>> >> >>> >>> >>>> I'm
>>>>>>> >> >>> >>> >>>> also
>>>>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will not
>>>>>>> listen to
>>>>>>> >> >>> >>> >>>> users,
>>>>>>> >> >>> >>> >>>> but
>>>>>>> >> >>> >>> >>>> they really want to make Spark better for every user.
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> What do you think about these two topics? Especially
>>>>>>> I'm
>>>>>>> >> >>> >>> >>>> looking
>>>>>>> >> >>> >>> >>>> at
>>>>>>> >> >>> >>> >>>> Cody
>>>>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :)
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> Pozdrawiam / Best regards,
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>> Tomasz
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>> >>>>
>>>>>>> >> >>> >>>
>>>>>>> >> >>> >>
>>>>>>> >> >>> >
>>>>>>> >> >>> >
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >>
>>>>>>> >> ------------------------------------------------------------
>>>>>>> ---------
>>>>>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>> >>
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > Ryan Blue
>>>>>>> > Software Engineer
>>>>>>> > Netflix
>>>>>>>
>>>>>>> ------------------------------------------------------------
>>>>>>> ---------
>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Joseph Bradley
>>>>>>
>>>>>> Software Engineer - Machine Learning
>>>>>>
>>>>>> Databricks, Inc.
>>>>>>
>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark Improvement Proposals

Posted by Cody Koeninger <co...@koeninger.org>.

At the spark summit this week, everyone from PMC members to users I had
never met before were asking me about the Spark improvement proposals
idea.  It's clear that it's a real community need.

But it's been almost half a year, and nothing visible has been done.

Reynold, are you going to do this?

If so, when?

If not, why?

You already did the right thing by including long-deserved committers.
Please keep doing the right thing for the community.

On Wed, Jan 11, 2017 at 4:13 AM, Reynold Xin <rx...@databricks.com> wrote:

> +1 on all counts (consensus, time bound, define roles)
>
> I can update the doc in the next few days and share back. Then maybe we
> can just officially vote on this. As Tim suggested, we might not get it
> 100% right the first time and would need to re-iterate. But that's fine.
>
>
> On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <ti...@databricks.com>
> wrote:
>
>> Hi Cody,
>> thank you for bringing up this topic, I agree it is very important to
>> keep a cohesive community around some common, fluid goals. Here are a few
>> comments about the current document:
>>
>> 1. name: it should not overlap with an existing one such as SIP. Can you
>> imagine someone trying to discuss a scala spore proposal for spark?
>> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
>> sounds great.
>>
>> 2. roles: at a high level, SPIPs are meant to reach consensus for
>> technical decisions with a lasting impact. As such, the template should
>> emphasize the role of the various parties during this process:
>>
>>  - the SPIP author is responsible for building consensus. She is the
>> champion driving the process forward and is responsible for ensuring that
>> the SPIP follows the general guidelines. The author should be identified in
>> the SPIP. The authorship of a SPIP can be transferred if the current author
>> is not interested and someone else wants to move the SPIP forward. There
>> should probably be 2-3 authors at most for each SPIP.
>>
>>  - someone with voting power should probably shepherd the SPIP (and be
>> recorded as such): ensuring that the final decision over the SPIP is
>> recorded (rejected, accepted, etc.), and advising about the technical
>> quality of the SPIP: this person need not be a champion for the SPIP or
>> contribute to it, but rather makes sure it stands a chance of being
>> approved when the vote happens. Also, if the author cannot find anyone who
>> would want to take this role, this proposal is likely to be rejected anyway.
>>
>>  - users, committers, contributors have the roles already outlined in the
>> document
>>
>> 3. timeline: ideally, once a SPIP has been offered for voting, it should
>> move swiftly into either being accepted or rejected, so that we do not end
>> up with a distracting long tail of half-hearted proposals.
>>
>> These rules are meant to be flexible, but the current document should be
>> clear about who is in charge of a SPIP, and the state it is currently in.
>>
>> We have had long discussions over some very important questions such as
>> approval. I do not have an opinion on these, but why not make a pick and
>> reevaluate this decision later? This is not a binding process at this point.
>>
>> Tim
>>
>>
>> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>>
>>> I don't have a concern about voting vs consensus.
>>>
>>> I have a concern that whatever the decision making process is, it is
>>> explicitly announced on the ticket for the given proposal, with an explicit
>>> deadline, and an explicit outcome.
>>>
>>>
>>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <ir...@cloudera.com>
>>> wrote:
>>>
>>>> I'm also in favor of this.  Thanks for your persistence Cody.
>>>>
>>>> My take on the specific issues Joseph mentioned:
>>>>
>>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made
>>>> earlier for consensus:
>>>>
>>>> > Majority vs consensus: My rationale is that I don't think we want to
>>>> consider a proposal approved if it had objections serious enough that
>>>> committers down-voted (or PMC depending on who gets a vote). If these
>>>> proposals are like PEPs, then they represent a significant amount of
>>>> community effort and I wouldn't want to move forward if up to half of the
>>>> community thinks it's an untenable idea.
>>>>
>>>> 2) Design doc template -- agree this would be useful, but also seems
>>>> totally orthogonal to moving forward on the SIP proposal.
>>>>
>>>> 3) agree w/ Joseph's proposal for updating the template.
>>>>
>>>> One small addition:
>>>>
>>>> 4) Deciding on a name -- minor, but I think its wroth disambiguating
>>>> from Scala's SIPs, and the best proposal I've heard is "SPIP".   At least,
>>>> no one has objected.  (don't care enough that I'd object to anything else,
>>>> though.)
>>>>
>>>>
>>>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <jo...@databricks.com>
>>>> wrote:
>>>>
>>>>> Hi Cody,
>>>>>
>>>>> Thanks for being persistent about this.  I too would like to see this
>>>>> happen.  Reviewing the thread, it sounds like the main things remaining are:
>>>>> * Decide about a few issues
>>>>> * Finalize the doc(s)
>>>>> * Vote on this proposal
>>>>>
>>>>> Issues & TODOs:
>>>>>
>>>>> (1) The main issue I see above is voting vs. consensus.  I have little
>>>>> preference here.  It sounds like something which could be tailored based on
>>>>> whether we see too many or too few SIPs being approved.
>>>>>
>>>>> (2) Design doc template  (This would be great to have for Spark
>>>>> regardless of this SIP discussion.)
>>>>> * Reynold, are you still putting this together?
>>>>>
>>>>> (3) Template cleanups.  Listing some items mentioned above + a new one
>>>>> w.r.t. Reynold's draft
>>>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>>>>> :
>>>>> * Reinstate the "Where" section with links to current and past SIPs
>>>>> * Add field for stating explicit deadlines for approval
>>>>> * Add field for stating Author & Committer shepherd
>>>>>
>>>>> Thanks all!
>>>>> Joseph
>>>>>
>>>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <co...@koeninger.org>
>>>>> wrote:
>>>>>
>>>>>> I'm bumping this one more time for the new year, and then I'm giving
>>>>>> up.
>>>>>>
>>>>>> Please, fix your process, even if it isn't exactly the way I
>>>>>> suggested.
>>>>>>
>>>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>>> > On lazy consensus as opposed to voting:
>>>>>> >
>>>>>> > First, why lazy consensus? The proposal was for consensus, which is
>>>>>> at least
>>>>>> > three +1 votes and no vetos. Consensus has no losing side, it
>>>>>> requires
>>>>>> > getting to a point where there is agreement. Isn't that agreement
>>>>>> what we
>>>>>> > want to achieve with these proposals?
>>>>>> >
>>>>>> > Second, lazy consensus only removes the requirement for three +1
>>>>>> votes. Why
>>>>>> > would we not want at least three committers to think something is a
>>>>>> good
>>>>>> > idea before adopting the proposal?
>>>>>> >
>>>>>> > rb
>>>>>> >
>>>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <co...@koeninger.org>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> So there are some minor things (the Where section heading appears
>>>>>> to
>>>>>> >> be dropped; wherever this document is posted it needs to actually
>>>>>> link
>>>>>> >> to a jira filter showing current / past SIPs) but it doesn't look
>>>>>> like
>>>>>> >> I can comment on the google doc.
>>>>>> >>
>>>>>> >> The major substantive issue that I have is that this version is
>>>>>> >> significantly less clear as to the outcome of an SIP.
>>>>>> >>
>>>>>> >> The apache example of lazy consensus at
>>>>>> >> http://apache.org/foundation/voting.html#LazyConsensus involves an
>>>>>> >> explicit announcement of an explicit deadline, which I think are
>>>>>> >> necessary for clarity.
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>> >> > It turned out suggested edits (trackable) don't show up for
>>>>>> non-owners,
>>>>>> >> > so
>>>>>> >> > I've just merged all the edits in place. It should be visible
>>>>>> now.
>>>>>> >> >
>>>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <
>>>>>> rxin@databricks.com>
>>>>>> >> > wrote:
>>>>>> >> >>
>>>>>> >> >> Oops. Let me try figure that out.
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> On Monday, November 7, 2016, Cody Koeninger <co...@koeninger.org>
>>>>>> wrote:
>>>>>> >> >>>
>>>>>> >> >>> Thanks for picking up on this.
>>>>>> >> >>>
>>>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on the
>>>>>> document
>>>>>> >> >>> you linked.
>>>>>> >> >>>
>>>>>> >> >>> Regarding lazy consensus, if the board in general has less of
>>>>>> an issue
>>>>>> >> >>> with that, sure.  As long as it is clearly announced, lasts at
>>>>>> least
>>>>>> >> >>> 72 hours, and has a clear outcome.
>>>>>> >> >>>
>>>>>> >> >>> The other points are hard to comment on without being able to
>>>>>> see the
>>>>>> >> >>> text in question.
>>>>>> >> >>>
>>>>>> >> >>>
>>>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <
>>>>>> rxin@databricks.com>
>>>>>> >> >>> wrote:
>>>>>> >> >>> > I just looked through the entire thread again tonight -
>>>>>> there are a
>>>>>> >> >>> > lot
>>>>>> >> >>> > of
>>>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the
>>>>>> first crack
>>>>>> >> >>> > at
>>>>>> >> >>> > the
>>>>>> >> >>> > proposal.
>>>>>> >> >>> >
>>>>>> >> >>> > I want to first comment on the context. Spark is one of the
>>>>>> most
>>>>>> >> >>> > innovative
>>>>>> >> >>> > and important projects in (big) data -- overall technical
>>>>>> decisions
>>>>>> >> >>> > made in
>>>>>> >> >>> > Apache Spark are sound. But of course, a project as large
>>>>>> and active
>>>>>> >> >>> > as
>>>>>> >> >>> > Spark always have room for improvement, and we as a
>>>>>> community should
>>>>>> >> >>> > strive
>>>>>> >> >>> > to take it to the next level.
>>>>>> >> >>> >
>>>>>> >> >>> > To that end, the two biggest areas for improvements in my
>>>>>> opinion
>>>>>> >> >>> > are:
>>>>>> >> >>> >
>>>>>> >> >>> > 1. Visibility: There are so much happening that it is
>>>>>> difficult to
>>>>>> >> >>> > know
>>>>>> >> >>> > what
>>>>>> >> >>> > really is going on. For people that don't follow closely, it
>>>>>> is
>>>>>> >> >>> > difficult to
>>>>>> >> >>> > know what the important initiatives are. Even for people
>>>>>> that do
>>>>>> >> >>> > follow, it
>>>>>> >> >>> > is difficult to know what specific things require their
>>>>>> attention,
>>>>>> >> >>> > since the
>>>>>> >> >>> > number of pull requests and JIRA tickets are high and it's
>>>>>> difficult
>>>>>> >> >>> > to
>>>>>> >> >>> > extract signal from noise.
>>>>>> >> >>> >
>>>>>> >> >>> > 2. Solicit user (broadly defined, including developers
>>>>>> themselves)
>>>>>> >> >>> > input
>>>>>> >> >>> > more proactively: At the end of the day the project provides
>>>>>> value
>>>>>> >> >>> > because
>>>>>> >> >>> > users use it. Users can't tell us exactly what to build, but
>>>>>> it is
>>>>>> >> >>> > important
>>>>>> >> >>> > to get their inputs.
>>>>>> >> >>> >
>>>>>> >> >>> >
>>>>>> >> >>> > I've taken Cody's doc and edited it:
>>>>>> >> >>> >
>>>>>> >> >>> >
>>>>>> >> >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
>>>>>> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>>>>>> >> >>> > (I've made all my modifications trackable)
>>>>>> >> >>> >
>>>>>> >> >>> > There are couple high level changes I made:
>>>>>> >> >>> >
>>>>>> >> >>> > 1. I've consulted a board member and he recommended lazy
>>>>>> consensus
>>>>>> >> >>> > as
>>>>>> >> >>> > opposed to voting. The reason being in voting there can
>>>>>> easily be a
>>>>>> >> >>> > "loser'
>>>>>> >> >>> > that gets outvoted.
>>>>>> >> >>> >
>>>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to
>>>>>> "optional
>>>>>> >> >>> > design
>>>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far
>>>>>> aside from
>>>>>> >> >>> > tagging
>>>>>> >> >>> > things and linking them elsewhere simply having design docs
>>>>>> and
>>>>>> >> >>> > prototypes
>>>>>> >> >>> > implementations in PRs is not something that has not worked
>>>>>> so far".
>>>>>> >> >>> >
>>>>>> >> >>> > 3. I made some the language tweaks to focus more on
>>>>>> visibility. For
>>>>>> >> >>> > example,
>>>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather
>>>>>> than just
>>>>>> >> >>> > "involve". SIPs should also have at least two emails that go
>>>>>> to
>>>>>> >> >>> > dev@.
>>>>>> >> >>> >
>>>>>> >> >>> >
>>>>>> >> >>> > While I was editing this, I thought we really needed a
>>>>>> suggested
>>>>>> >> >>> > template
>>>>>> >> >>> > for design doc too. I will get to that too ...
>>>>>> >> >>> >
>>>>>> >> >>> >
>>>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>>>>> rxin@databricks.com>
>>>>>> >> >>> > wrote:
>>>>>> >> >>> >>
>>>>>> >> >>> >> Most things looked OK to me too, although I do plan to take
>>>>>> a
>>>>>> >> >>> >> closer
>>>>>> >> >>> >> look
>>>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>>>>>> >> >>> >>
>>>>>> >> >>> >>
>>>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>>>>> >> >>> >> <va...@cloudera.com>
>>>>>> >> >>> >> wrote:
>>>>>> >> >>> >>>
>>>>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's not
>>>>>> >> >>> >>> explicitly
>>>>>> >> >>> >>> called, that voting would happen by e-mail? A template for
>>>>>> the
>>>>>> >> >>> >>> proposal document (instead of just a bullet nice) would
>>>>>> also be
>>>>>> >> >>> >>> nice,
>>>>>> >> >>> >>> but that can be done at any time.
>>>>>> >> >>> >>>
>>>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a
>>>>>> >> >>> >>> candidate
>>>>>> >> >>> >>> for a SIP, given the scope of the work. The document
>>>>>> attached even
>>>>>> >> >>> >>> somewhat matches the proposed format. So if anyone wants
>>>>>> to try
>>>>>> >> >>> >>> out
>>>>>> >> >>> >>> the process...
>>>>>> >> >>> >>>
>>>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>>>>>> >> >>> >>> <co...@koeninger.org>
>>>>>> >> >>> >>> wrote:
>>>>>> >> >>> >>> > Now that spark summit europe is over, are any committers
>>>>>> >> >>> >>> > interested
>>>>>> >> >>> >>> > in
>>>>>> >> >>> >>> > moving forward with this?
>>>>>> >> >>> >>> >
>>>>>> >> >>> >>> >
>>>>>> >> >>> >>> >
>>>>>> >> >>> >>> >
>>>>>> >> >>> >>> > https://github.com/koeninger/s
>>>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>>> >> >>> >>> >
>>>>>> >> >>> >>> > Or are we going to let this discussion die on the vine?
>>>>>> >> >>> >>> >
>>>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>>>>> >> >>> >>> > <to...@outlook.com> wrote:
>>>>>> >> >>> >>> >> Maybe my mail was not clear enough.
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any
>>>>>> other
>>>>>> >> >>> >>> >> framework.
>>>>>> >> >>> >>> >> The
>>>>>> >> >>> >>> >> idea with benchmarks was to show two things:
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> - why some people are doing bad PR for Spark
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> - how - in easy way - we can change it and show that
>>>>>> Spark is
>>>>>> >> >>> >>> >> still on
>>>>>> >> >>> >>> >> the
>>>>>> >> >>> >>> >> top
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I
>>>>>> don't think
>>>>>> >> >>> >>> >> they're the
>>>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main page
>>>>>> there
>>>>>> >> >>> >>> >> is
>>>>>> >> >>> >>> >> still
>>>>>> >> >>> >>> >> chart
>>>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that
>>>>>> framework is
>>>>>> >> >>> >>> >> not
>>>>>> >> >>> >>> >> the
>>>>>> >> >>> >>> >> same
>>>>>> >> >>> >>> >> Spark with other API, but much faster and optimized,
>>>>>> comparable
>>>>>> >> >>> >>> >> or
>>>>>> >> >>> >>> >> even
>>>>>> >> >>> >>> >> faster than other frameworks.
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> About real-time streaming, I think it would be just
>>>>>> good to see
>>>>>> >> >>> >>> >> it
>>>>>> >> >>> >>> >> in
>>>>>> >> >>> >>> >> Spark.
>>>>>> >> >>> >>> >> I very like current Spark model, but many voices that
>>>>>> says "we
>>>>>> >> >>> >>> >> need
>>>>>> >> >>> >>> >> more" -
>>>>>> >> >>> >>> >> community should listen also them and try to help them.
>>>>>> With
>>>>>> >> >>> >>> >> SIPs
>>>>>> >> >>> >>> >> it
>>>>>> >> >>> >>> >> would
>>>>>> >> >>> >>> >> be easier, I've just posted this example as "thing that
>>>>>> may be
>>>>>> >> >>> >>> >> changed
>>>>>> >> >>> >>> >> with
>>>>>> >> >>> >>> >> SIP".
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> I very like unification via Datasets, but there is a
>>>>>> lot of
>>>>>> >> >>> >>> >> algorithms
>>>>>> >> >>> >>> >> inside - let's make easy API, but with strong background
>>>>>> >> >>> >>> >> (articles,
>>>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark is
>>>>>> still
>>>>>> >> >>> >>> >> modern
>>>>>> >> >>> >>> >> framework.
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said
>>>>>> >> >>> >>> >> organizational
>>>>>> >> >>> >>> >> ideas
>>>>>> >> >>> >>> >> were already mentioned and I agree with them, my mail
>>>>>> was just
>>>>>> >> >>> >>> >> to
>>>>>> >> >>> >>> >> show
>>>>>> >> >>> >>> >> some
>>>>>> >> >>> >>> >> aspects from my side, so from theside of developer and
>>>>>> person
>>>>>> >> >>> >>> >> who
>>>>>> >> >>> >>> >> is
>>>>>> >> >>> >>> >> trying
>>>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or other
>>>>>> ways)
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> Pozdrawiam / Best regards,
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> Tomasz
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> ________________________________
>>>>>> >> >>> >>> >> Od: Cody Koeninger <co...@koeninger.org>
>>>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46
>>>>>> >> >>> >>> >> Do: Debasish Das
>>>>>> >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>>>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is
>>>>>> missing my
>>>>>> >> >>> >>> >> point.
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> My point is evolve or die.  Spark's governance and
>>>>>> organization
>>>>>> >> >>> >>> >> is
>>>>>> >> >>> >>> >> hampering its ability to evolve technologically, and it
>>>>>> needs
>>>>>> >> >>> >>> >> to
>>>>>> >> >>> >>> >> change.
>>>>>> >> >>> >>> >>
>>>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>>>>> >> >>> >>> >> <de...@gmail.com>
>>>>>> >> >>> >>> >> wrote:
>>>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked
>>>>>> up Spark
>>>>>> >> >>> >>> >>> in
>>>>>> >> >>> >>> >>> 2014
>>>>>> >> >>> >>> >>> as
>>>>>> >> >>> >>> >>> soon as I looked into it since compared to writing Java
>>>>>> >> >>> >>> >>> map-reduce
>>>>>> >> >>> >>> >>> and
>>>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed code
>>>>>> fun...But
>>>>>> >> >>> >>> >>> now
>>>>>> >> >>> >>> >>> as
>>>>>> >> >>> >>> >>> we
>>>>>> >> >>> >>> >>> went
>>>>>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case
>>>>>> gets more
>>>>>> >> >>> >>> >>> prominent, I
>>>>>> >> >>> >>> >>> think it is time to bring a messaging model in
>>>>>> conjunction
>>>>>> >> >>> >>> >>> with
>>>>>> >> >>> >>> >>> the
>>>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good
>>>>>> at....akka-streams
>>>>>> >> >>> >>> >>> close
>>>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks like
>>>>>> a great
>>>>>> >> >>> >>> >>> direction to
>>>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0
>>>>>> integrated
>>>>>> >> >>> >>> >>> streaming
>>>>>> >> >>> >>> >>> with
>>>>>> >> >>> >>> >>> batch with the assumption is that micro-batching is
>>>>>> sufficient
>>>>>> >> >>> >>> >>> to
>>>>>> >> >>> >>> >>> run
>>>>>> >> >>> >>> >>> SQL
>>>>>> >> >>> >>> >>> commands on stream but do we really have time to do SQL
>>>>>> >> >>> >>> >>> processing at
>>>>>> >> >>> >>> >>> streaming data within 1-2 seconds ?
>>>>>> >> >>> >>> >>>
>>>>>> >> >>> >>> >>> After reading the email chain, I started to look into
>>>>>> Flink
>>>>>> >> >>> >>> >>> documentation
>>>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I
>>>>>> think we
>>>>>> >> >>> >>> >>> have
>>>>>> >> >>> >>> >>> major
>>>>>> >> >>> >>> >>> work
>>>>>> >> >>> >>> >>> to do detailing out Spark internals so that more
>>>>>> people from
>>>>>> >> >>> >>> >>> community
>>>>>> >> >>> >>> >>> start
>>>>>> >> >>> >>> >>> to take active role in improving the issues so that
>>>>>> Spark
>>>>>> >> >>> >>> >>> stays
>>>>>> >> >>> >>> >>> strong
>>>>>> >> >>> >>> >>> compared to Flink.
>>>>>> >> >>> >>> >>>
>>>>>> >> >>> >>> >>>
>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>> uence/display/SPARK/Spark+Internals
>>>>>> >> >>> >>> >>>
>>>>>> >> >>> >>> >>>
>>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>>> uence/display/FLINK/Flink+Internals
>>>>>> >> >>> >>> >>>
>>>>>> >> >>> >>> >>> Spark is no longer an engine that works for
>>>>>> micro-batch and
>>>>>> >> >>> >>> >>> batch...We
>>>>>> >> >>> >>> >>> (and
>>>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an engine
>>>>>> for
>>>>>> >> >>> >>> >>> stream
>>>>>> >> >>> >>> >>> and
>>>>>> >> >>> >>> >>> query
>>>>>> >> >>> >>> >>> processing.....we need to make it a state-of-the-art
>>>>>> engine
>>>>>> >> >>> >>> >>> for
>>>>>> >> >>> >>> >>> high
>>>>>> >> >>> >>> >>> speed
>>>>>> >> >>> >>> >>> streaming data and user queries as well !
>>>>>> >> >>> >>> >>>
>>>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>>>>>> >> >>> >>> >>> <to...@outlook.com>
>>>>>> >> >>> >>> >>> wrote:
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> Hi everyone,
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my
>>>>>> suggestions may
>>>>>> >> >>> >>> >>>> help a
>>>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational
>>>>>> topics were
>>>>>> >> >>> >>> >>>> mentioned,
>>>>>> >> >>> >>> >>>> but I want to focus on these negative posts about
>>>>>> Spark and
>>>>>> >> >>> >>> >>>> about
>>>>>> >> >>> >>> >>>> "haters"
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good
>>>>>> community
>>>>>> >> >>> >>> >>>> -
>>>>>> >> >>> >>> >>>> it's
>>>>>> >> >>> >>> >>>> everything here. But Every project has to "flight" on
>>>>>> >> >>> >>> >>>> "framework
>>>>>> >> >>> >>> >>>> market"
>>>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big
>>>>>> Data
>>>>>> >> >>> >>> >>>> communities,
>>>>>> >> >>> >>> >>>> maybe my mail will inspire someone :)
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have
>>>>>> enough time
>>>>>> >> >>> >>> >>>> to
>>>>>> >> >>> >>> >>>> join
>>>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So why
>>>>>> are
>>>>>> >> >>> >>> >>>> some
>>>>>> >> >>> >>> >>>> people
>>>>>> >> >>> >>> >>>> saying that Flink (or other framework) is better,
>>>>>> like it was
>>>>>> >> >>> >>> >>>> posted
>>>>>> >> >>> >>> >>>> in
>>>>>> >> >>> >>> >>>> this mailing list? No, not because that framework is
>>>>>> better
>>>>>> >> >>> >>> >>>> in
>>>>>> >> >>> >>> >>>> all
>>>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions where
>>>>>> >> >>> >>> >>>> started
>>>>>> >> >>> >>> >>>> after
>>>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at
>>>>>> StackOverflow
>>>>>> >> >>> >>> >>>> "Flink
>>>>>> >> >>> >>> >>>> vs
>>>>>> >> >>> >>> >>>> ...."
>>>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink.
>>>>>> Answers are
>>>>>> >> >>> >>> >>>> sometimes
>>>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's users
>>>>>> (often
>>>>>> >> >>> >>> >>>> PMC's)
>>>>>> >> >>> >>> >>>> are
>>>>>> >> >>> >>> >>>> just posting same information about real-time
>>>>>> streaming,
>>>>>> >> >>> >>> >>>> about
>>>>>> >> >>> >>> >>>> delta
>>>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it is
>>>>>> marked as
>>>>>> >> >>> >>> >>>> an
>>>>>> >> >>> >>> >>>> aswer,
>>>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all the
>>>>>> truth.
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and
>>>>>> knowledgle to
>>>>>> >> >>> >>> >>>> perform
>>>>>> >> >>> >>> >>>> huge
>>>>>> >> >>> >>> >>>> performance test. Maybe some company, that supports
>>>>>> Spark
>>>>>> >> >>> >>> >>>> (Databricks,
>>>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in
>>>>>> community :) )
>>>>>> >> >>> >>> >>>> could
>>>>>> >> >>> >>> >>>> perform performance test of:
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose
>>>>>> because of
>>>>>> >> >>> >>> >>>> mini-batch
>>>>>> >> >>> >>> >>>> model, however currently the difference should be
>>>>>> much lower
>>>>>> >> >>> >>> >>>> that in
>>>>>> >> >>> >>> >>>> previous versions
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> - Machine Learning models
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> - batch jobs
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> - Graph jobs
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> - SQL queries
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> People will see that Spark is envolving and is also a
>>>>>> modern
>>>>>> >> >>> >>> >>>> framework,
>>>>>> >> >>> >>> >>>> because after reading posts mentioned above people
>>>>>> may think
>>>>>> >> >>> >>> >>>> "it
>>>>>> >> >>> >>> >>>> is
>>>>>> >> >>> >>> >>>> outdated, future is in framework X".
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how
>>>>>> Spark
>>>>>> >> >>> >>> >>>> Structured
>>>>>> >> >>> >>> >>>> Streaming beats every other framework in terms of
>>>>>> easy-of-use
>>>>>> >> >>> >>> >>>> and
>>>>>> >> >>> >>> >>>> reliability. Performance tests, done in various
>>>>>> environments
>>>>>> >> >>> >>> >>>> (in
>>>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node
>>>>>> cluster,
>>>>>> >> >>> >>> >>>> 20-node
>>>>>> >> >>> >>> >>>> cluster), could be also very good marketing stuff to
>>>>>> say
>>>>>> >> >>> >>> >>>> "hey,
>>>>>> >> >>> >>> >>>> you're
>>>>>> >> >>> >>> >>>> telling that you're better, but Spark is still faster
>>>>>> and is
>>>>>> >> >>> >>> >>>> still
>>>>>> >> >>> >>> >>>> getting even more fast!". This would be based on
>>>>>> facts (just
>>>>>> >> >>> >>> >>>> numbers),
>>>>>> >> >>> >>> >>>> not opinions. It would be good for companies, for
>>>>>> marketing
>>>>>> >> >>> >>> >>>> puproses
>>>>>> >> >>> >>> >>>> and
>>>>>> >> >>> >>> >>>> for every Spark developer
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some time
>>>>>> ago about
>>>>>> >> >>> >>> >>>> real-time
>>>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming. Some
>>>>>> work
>>>>>> >> >>> >>> >>>> should be
>>>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think it's
>>>>>> possible.
>>>>>> >> >>> >>> >>>> Maybe
>>>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on
>>>>>> top of
>>>>>> >> >>> >>> >>>> Akka?
>>>>>> >> >>> >>> >>>> I
>>>>>> >> >>> >>> >>>> don't
>>>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I think
>>>>>> that
>>>>>> >> >>> >>> >>>> Spark
>>>>>> >> >>> >>> >>>> should
>>>>>> >> >>> >>> >>>> have real-time streaming support. Currently I see many
>>>>>> >> >>> >>> >>>> posts/comments
>>>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is
>>>>>> doing
>>>>>> >> >>> >>> >>>> very
>>>>>> >> >>> >>> >>>> good
>>>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is
>>>>>> possible to
>>>>>> >> >>> >>> >>>> add
>>>>>> >> >>> >>> >>>> also
>>>>>> >> >>> >>> >>>> more
>>>>>> >> >>> >>> >>>> real-time processing.
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> Other people said much more and I agree with proposal
>>>>>> of SIP.
>>>>>> >> >>> >>> >>>> I'm
>>>>>> >> >>> >>> >>>> also
>>>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will not
>>>>>> listen to
>>>>>> >> >>> >>> >>>> users,
>>>>>> >> >>> >>> >>>> but
>>>>>> >> >>> >>> >>>> they really want to make Spark better for every user.
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> What do you think about these two topics? Especially
>>>>>> I'm
>>>>>> >> >>> >>> >>>> looking
>>>>>> >> >>> >>> >>>> at
>>>>>> >> >>> >>> >>>> Cody
>>>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :)
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> Pozdrawiam / Best regards,
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>> Tomasz
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>> >>>>
>>>>>> >> >>> >>>
>>>>>> >> >>> >>
>>>>>> >> >>> >
>>>>>> >> >>> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >>
>>>>>> >> ------------------------------------------------------------
>>>>>> ---------
>>>>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>> >>
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Ryan Blue
>>>>>> > Software Engineer
>>>>>> > Netflix
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Joseph Bradley
>>>>>
>>>>> Software Engineer - Machine Learning
>>>>>
>>>>> Databricks, Inc.
>>>>>
>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Spark Improvement Proposals

Posted by Reynold Xin <rx...@databricks.com>.

+1 on all counts (consensus, time bound, define roles)

I can update the doc in the next few days and share back. Then maybe we can
just officially vote on this. As Tim suggested, we might not get it 100%
right the first time and would need to re-iterate. But that's fine.


On Thu, Jan 5, 2017 at 3:29 PM, Tim Hunter <ti...@databricks.com> wrote:

> Hi Cody,
> thank you for bringing up this topic, I agree it is very important to keep
> a cohesive community around some common, fluid goals. Here are a few
> comments about the current document:
>
> 1. name: it should not overlap with an existing one such as SIP. Can you
> imagine someone trying to discuss a scala spore proposal for spark?
> "[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
> sounds great.
>
> 2. roles: at a high level, SPIPs are meant to reach consensus for
> technical decisions with a lasting impact. As such, the template should
> emphasize the role of the various parties during this process:
>
>  - the SPIP author is responsible for building consensus. She is the
> champion driving the process forward and is responsible for ensuring that
> the SPIP follows the general guidelines. The author should be identified in
> the SPIP. The authorship of a SPIP can be transferred if the current author
> is not interested and someone else wants to move the SPIP forward. There
> should probably be 2-3 authors at most for each SPIP.
>
>  - someone with voting power should probably shepherd the SPIP (and be
> recorded as such): ensuring that the final decision over the SPIP is
> recorded (rejected, accepted, etc.), and advising about the technical
> quality of the SPIP: this person need not be a champion for the SPIP or
> contribute to it, but rather makes sure it stands a chance of being
> approved when the vote happens. Also, if the author cannot find anyone who
> would want to take this role, this proposal is likely to be rejected anyway.
>
>  - users, committers, contributors have the roles already outlined in the
> document
>
> 3. timeline: ideally, once a SPIP has been offered for voting, it should
> move swiftly into either being accepted or rejected, so that we do not end
> up with a distracting long tail of half-hearted proposals.
>
> These rules are meant to be flexible, but the current document should be
> clear about who is in charge of a SPIP, and the state it is currently in.
>
> We have had long discussions over some very important questions such as
> approval. I do not have an opinion on these, but why not make a pick and
> reevaluate this decision later? This is not a binding process at this point.
>
> Tim
>
>
> On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <co...@koeninger.org> wrote:
>
>> I don't have a concern about voting vs consensus.
>>
>> I have a concern that whatever the decision making process is, it is
>> explicitly announced on the ticket for the given proposal, with an explicit
>> deadline, and an explicit outcome.
>>
>>
>> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <ir...@cloudera.com>
>> wrote:
>>
>>> I'm also in favor of this.  Thanks for your persistence Cody.
>>>
>>> My take on the specific issues Joseph mentioned:
>>>
>>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made
>>> earlier for consensus:
>>>
>>> > Majority vs consensus: My rationale is that I don't think we want to
>>> consider a proposal approved if it had objections serious enough that
>>> committers down-voted (or PMC depending on who gets a vote). If these
>>> proposals are like PEPs, then they represent a significant amount of
>>> community effort and I wouldn't want to move forward if up to half of the
>>> community thinks it's an untenable idea.
>>>
>>> 2) Design doc template -- agree this would be useful, but also seems
>>> totally orthogonal to moving forward on the SIP proposal.
>>>
>>> 3) agree w/ Joseph's proposal for updating the template.
>>>
>>> One small addition:
>>>
>>> 4) Deciding on a name -- minor, but I think its wroth disambiguating
>>> from Scala's SIPs, and the best proposal I've heard is "SPIP".   At least,
>>> no one has objected.  (don't care enough that I'd object to anything else,
>>> though.)
>>>
>>>
>>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <jo...@databricks.com>
>>> wrote:
>>>
>>>> Hi Cody,
>>>>
>>>> Thanks for being persistent about this.  I too would like to see this
>>>> happen.  Reviewing the thread, it sounds like the main things remaining are:
>>>> * Decide about a few issues
>>>> * Finalize the doc(s)
>>>> * Vote on this proposal
>>>>
>>>> Issues & TODOs:
>>>>
>>>> (1) The main issue I see above is voting vs. consensus.  I have little
>>>> preference here.  It sounds like something which could be tailored based on
>>>> whether we see too many or too few SIPs being approved.
>>>>
>>>> (2) Design doc template  (This would be great to have for Spark
>>>> regardless of this SIP discussion.)
>>>> * Reynold, are you still putting this together?
>>>>
>>>> (3) Template cleanups.  Listing some items mentioned above + a new one
>>>> w.r.t. Reynold's draft
>>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>>>> :
>>>> * Reinstate the "Where" section with links to current and past SIPs
>>>> * Add field for stating explicit deadlines for approval
>>>> * Add field for stating Author & Committer shepherd
>>>>
>>>> Thanks all!
>>>> Joseph
>>>>
>>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <co...@koeninger.org>
>>>> wrote:
>>>>
>>>>> I'm bumping this one more time for the new year, and then I'm giving
>>>>> up.
>>>>>
>>>>> Please, fix your process, even if it isn't exactly the way I suggested.
>>>>>
>>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>> > On lazy consensus as opposed to voting:
>>>>> >
>>>>> > First, why lazy consensus? The proposal was for consensus, which is
>>>>> at least
>>>>> > three +1 votes and no vetos. Consensus has no losing side, it
>>>>> requires
>>>>> > getting to a point where there is agreement. Isn't that agreement
>>>>> what we
>>>>> > want to achieve with these proposals?
>>>>> >
>>>>> > Second, lazy consensus only removes the requirement for three +1
>>>>> votes. Why
>>>>> > would we not want at least three committers to think something is a
>>>>> good
>>>>> > idea before adopting the proposal?
>>>>> >
>>>>> > rb
>>>>> >
>>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <co...@koeninger.org>
>>>>> wrote:
>>>>> >>
>>>>> >> So there are some minor things (the Where section heading appears to
>>>>> >> be dropped; wherever this document is posted it needs to actually
>>>>> link
>>>>> >> to a jira filter showing current / past SIPs) but it doesn't look
>>>>> like
>>>>> >> I can comment on the google doc.
>>>>> >>
>>>>> >> The major substantive issue that I have is that this version is
>>>>> >> significantly less clear as to the outcome of an SIP.
>>>>> >>
>>>>> >> The apache example of lazy consensus at
>>>>> >> http://apache.org/foundation/voting.html#LazyConsensus involves an
>>>>> >> explicit announcement of an explicit deadline, which I think are
>>>>> >> necessary for clarity.
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>> >> > It turned out suggested edits (trackable) don't show up for
>>>>> non-owners,
>>>>> >> > so
>>>>> >> > I've just merged all the edits in place. It should be visible now.
>>>>> >> >
>>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <rxin@databricks.com
>>>>> >
>>>>> >> > wrote:
>>>>> >> >>
>>>>> >> >> Oops. Let me try figure that out.
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> On Monday, November 7, 2016, Cody Koeninger <co...@koeninger.org>
>>>>> wrote:
>>>>> >> >>>
>>>>> >> >>> Thanks for picking up on this.
>>>>> >> >>>
>>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on the
>>>>> document
>>>>> >> >>> you linked.
>>>>> >> >>>
>>>>> >> >>> Regarding lazy consensus, if the board in general has less of
>>>>> an issue
>>>>> >> >>> with that, sure.  As long as it is clearly announced, lasts at
>>>>> least
>>>>> >> >>> 72 hours, and has a clear outcome.
>>>>> >> >>>
>>>>> >> >>> The other points are hard to comment on without being able to
>>>>> see the
>>>>> >> >>> text in question.
>>>>> >> >>>
>>>>> >> >>>
>>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <
>>>>> rxin@databricks.com>
>>>>> >> >>> wrote:
>>>>> >> >>> > I just looked through the entire thread again tonight - there
>>>>> are a
>>>>> >> >>> > lot
>>>>> >> >>> > of
>>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the first
>>>>> crack
>>>>> >> >>> > at
>>>>> >> >>> > the
>>>>> >> >>> > proposal.
>>>>> >> >>> >
>>>>> >> >>> > I want to first comment on the context. Spark is one of the
>>>>> most
>>>>> >> >>> > innovative
>>>>> >> >>> > and important projects in (big) data -- overall technical
>>>>> decisions
>>>>> >> >>> > made in
>>>>> >> >>> > Apache Spark are sound. But of course, a project as large and
>>>>> active
>>>>> >> >>> > as
>>>>> >> >>> > Spark always have room for improvement, and we as a community
>>>>> should
>>>>> >> >>> > strive
>>>>> >> >>> > to take it to the next level.
>>>>> >> >>> >
>>>>> >> >>> > To that end, the two biggest areas for improvements in my
>>>>> opinion
>>>>> >> >>> > are:
>>>>> >> >>> >
>>>>> >> >>> > 1. Visibility: There are so much happening that it is
>>>>> difficult to
>>>>> >> >>> > know
>>>>> >> >>> > what
>>>>> >> >>> > really is going on. For people that don't follow closely, it
>>>>> is
>>>>> >> >>> > difficult to
>>>>> >> >>> > know what the important initiatives are. Even for people that
>>>>> do
>>>>> >> >>> > follow, it
>>>>> >> >>> > is difficult to know what specific things require their
>>>>> attention,
>>>>> >> >>> > since the
>>>>> >> >>> > number of pull requests and JIRA tickets are high and it's
>>>>> difficult
>>>>> >> >>> > to
>>>>> >> >>> > extract signal from noise.
>>>>> >> >>> >
>>>>> >> >>> > 2. Solicit user (broadly defined, including developers
>>>>> themselves)
>>>>> >> >>> > input
>>>>> >> >>> > more proactively: At the end of the day the project provides
>>>>> value
>>>>> >> >>> > because
>>>>> >> >>> > users use it. Users can't tell us exactly what to build, but
>>>>> it is
>>>>> >> >>> > important
>>>>> >> >>> > to get their inputs.
>>>>> >> >>> >
>>>>> >> >>> >
>>>>> >> >>> > I've taken Cody's doc and edited it:
>>>>> >> >>> >
>>>>> >> >>> >
>>>>> >> >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
>>>>> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>>>>> >> >>> > (I've made all my modifications trackable)
>>>>> >> >>> >
>>>>> >> >>> > There are couple high level changes I made:
>>>>> >> >>> >
>>>>> >> >>> > 1. I've consulted a board member and he recommended lazy
>>>>> consensus
>>>>> >> >>> > as
>>>>> >> >>> > opposed to voting. The reason being in voting there can
>>>>> easily be a
>>>>> >> >>> > "loser'
>>>>> >> >>> > that gets outvoted.
>>>>> >> >>> >
>>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to
>>>>> "optional
>>>>> >> >>> > design
>>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far aside
>>>>> from
>>>>> >> >>> > tagging
>>>>> >> >>> > things and linking them elsewhere simply having design docs
>>>>> and
>>>>> >> >>> > prototypes
>>>>> >> >>> > implementations in PRs is not something that has not worked
>>>>> so far".
>>>>> >> >>> >
>>>>> >> >>> > 3. I made some the language tweaks to focus more on
>>>>> visibility. For
>>>>> >> >>> > example,
>>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather than
>>>>> just
>>>>> >> >>> > "involve". SIPs should also have at least two emails that go
>>>>> to
>>>>> >> >>> > dev@.
>>>>> >> >>> >
>>>>> >> >>> >
>>>>> >> >>> > While I was editing this, I thought we really needed a
>>>>> suggested
>>>>> >> >>> > template
>>>>> >> >>> > for design doc too. I will get to that too ...
>>>>> >> >>> >
>>>>> >> >>> >
>>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>>>> rxin@databricks.com>
>>>>> >> >>> > wrote:
>>>>> >> >>> >>
>>>>> >> >>> >> Most things looked OK to me too, although I do plan to take a
>>>>> >> >>> >> closer
>>>>> >> >>> >> look
>>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>>>>> >> >>> >>
>>>>> >> >>> >>
>>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>>>> >> >>> >> <va...@cloudera.com>
>>>>> >> >>> >> wrote:
>>>>> >> >>> >>>
>>>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's not
>>>>> >> >>> >>> explicitly
>>>>> >> >>> >>> called, that voting would happen by e-mail? A template for
>>>>> the
>>>>> >> >>> >>> proposal document (instead of just a bullet nice) would
>>>>> also be
>>>>> >> >>> >>> nice,
>>>>> >> >>> >>> but that can be done at any time.
>>>>> >> >>> >>>
>>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a
>>>>> >> >>> >>> candidate
>>>>> >> >>> >>> for a SIP, given the scope of the work. The document
>>>>> attached even
>>>>> >> >>> >>> somewhat matches the proposed format. So if anyone wants to
>>>>> try
>>>>> >> >>> >>> out
>>>>> >> >>> >>> the process...
>>>>> >> >>> >>>
>>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>>>>> >> >>> >>> <co...@koeninger.org>
>>>>> >> >>> >>> wrote:
>>>>> >> >>> >>> > Now that spark summit europe is over, are any committers
>>>>> >> >>> >>> > interested
>>>>> >> >>> >>> > in
>>>>> >> >>> >>> > moving forward with this?
>>>>> >> >>> >>> >
>>>>> >> >>> >>> >
>>>>> >> >>> >>> >
>>>>> >> >>> >>> >
>>>>> >> >>> >>> > https://github.com/koeninger/s
>>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>>> >> >>> >>> >
>>>>> >> >>> >>> > Or are we going to let this discussion die on the vine?
>>>>> >> >>> >>> >
>>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>>>> >> >>> >>> > <to...@outlook.com> wrote:
>>>>> >> >>> >>> >> Maybe my mail was not clear enough.
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any other
>>>>> >> >>> >>> >> framework.
>>>>> >> >>> >>> >> The
>>>>> >> >>> >>> >> idea with benchmarks was to show two things:
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> - why some people are doing bad PR for Spark
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> - how - in easy way - we can change it and show that
>>>>> Spark is
>>>>> >> >>> >>> >> still on
>>>>> >> >>> >>> >> the
>>>>> >> >>> >>> >> top
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I
>>>>> don't think
>>>>> >> >>> >>> >> they're the
>>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main page
>>>>> there
>>>>> >> >>> >>> >> is
>>>>> >> >>> >>> >> still
>>>>> >> >>> >>> >> chart
>>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that
>>>>> framework is
>>>>> >> >>> >>> >> not
>>>>> >> >>> >>> >> the
>>>>> >> >>> >>> >> same
>>>>> >> >>> >>> >> Spark with other API, but much faster and optimized,
>>>>> comparable
>>>>> >> >>> >>> >> or
>>>>> >> >>> >>> >> even
>>>>> >> >>> >>> >> faster than other frameworks.
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> About real-time streaming, I think it would be just good
>>>>> to see
>>>>> >> >>> >>> >> it
>>>>> >> >>> >>> >> in
>>>>> >> >>> >>> >> Spark.
>>>>> >> >>> >>> >> I very like current Spark model, but many voices that
>>>>> says "we
>>>>> >> >>> >>> >> need
>>>>> >> >>> >>> >> more" -
>>>>> >> >>> >>> >> community should listen also them and try to help them.
>>>>> With
>>>>> >> >>> >>> >> SIPs
>>>>> >> >>> >>> >> it
>>>>> >> >>> >>> >> would
>>>>> >> >>> >>> >> be easier, I've just posted this example as "thing that
>>>>> may be
>>>>> >> >>> >>> >> changed
>>>>> >> >>> >>> >> with
>>>>> >> >>> >>> >> SIP".
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> I very like unification via Datasets, but there is a lot
>>>>> of
>>>>> >> >>> >>> >> algorithms
>>>>> >> >>> >>> >> inside - let's make easy API, but with strong background
>>>>> >> >>> >>> >> (articles,
>>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark is
>>>>> still
>>>>> >> >>> >>> >> modern
>>>>> >> >>> >>> >> framework.
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said
>>>>> >> >>> >>> >> organizational
>>>>> >> >>> >>> >> ideas
>>>>> >> >>> >>> >> were already mentioned and I agree with them, my mail
>>>>> was just
>>>>> >> >>> >>> >> to
>>>>> >> >>> >>> >> show
>>>>> >> >>> >>> >> some
>>>>> >> >>> >>> >> aspects from my side, so from theside of developer and
>>>>> person
>>>>> >> >>> >>> >> who
>>>>> >> >>> >>> >> is
>>>>> >> >>> >>> >> trying
>>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or other
>>>>> ways)
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> Pozdrawiam / Best regards,
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> Tomasz
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> ________________________________
>>>>> >> >>> >>> >> Od: Cody Koeninger <co...@koeninger.org>
>>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46
>>>>> >> >>> >>> >> Do: Debasish Das
>>>>> >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is
>>>>> missing my
>>>>> >> >>> >>> >> point.
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> My point is evolve or die.  Spark's governance and
>>>>> organization
>>>>> >> >>> >>> >> is
>>>>> >> >>> >>> >> hampering its ability to evolve technologically, and it
>>>>> needs
>>>>> >> >>> >>> >> to
>>>>> >> >>> >>> >> change.
>>>>> >> >>> >>> >>
>>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>>>> >> >>> >>> >> <de...@gmail.com>
>>>>> >> >>> >>> >> wrote:
>>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked up
>>>>> Spark
>>>>> >> >>> >>> >>> in
>>>>> >> >>> >>> >>> 2014
>>>>> >> >>> >>> >>> as
>>>>> >> >>> >>> >>> soon as I looked into it since compared to writing Java
>>>>> >> >>> >>> >>> map-reduce
>>>>> >> >>> >>> >>> and
>>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed code
>>>>> fun...But
>>>>> >> >>> >>> >>> now
>>>>> >> >>> >>> >>> as
>>>>> >> >>> >>> >>> we
>>>>> >> >>> >>> >>> went
>>>>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case gets
>>>>> more
>>>>> >> >>> >>> >>> prominent, I
>>>>> >> >>> >>> >>> think it is time to bring a messaging model in
>>>>> conjunction
>>>>> >> >>> >>> >>> with
>>>>> >> >>> >>> >>> the
>>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good
>>>>> at....akka-streams
>>>>> >> >>> >>> >>> close
>>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks like a
>>>>> great
>>>>> >> >>> >>> >>> direction to
>>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0
>>>>> integrated
>>>>> >> >>> >>> >>> streaming
>>>>> >> >>> >>> >>> with
>>>>> >> >>> >>> >>> batch with the assumption is that micro-batching is
>>>>> sufficient
>>>>> >> >>> >>> >>> to
>>>>> >> >>> >>> >>> run
>>>>> >> >>> >>> >>> SQL
>>>>> >> >>> >>> >>> commands on stream but do we really have time to do SQL
>>>>> >> >>> >>> >>> processing at
>>>>> >> >>> >>> >>> streaming data within 1-2 seconds ?
>>>>> >> >>> >>> >>>
>>>>> >> >>> >>> >>> After reading the email chain, I started to look into
>>>>> Flink
>>>>> >> >>> >>> >>> documentation
>>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I think
>>>>> we
>>>>> >> >>> >>> >>> have
>>>>> >> >>> >>> >>> major
>>>>> >> >>> >>> >>> work
>>>>> >> >>> >>> >>> to do detailing out Spark internals so that more people
>>>>> from
>>>>> >> >>> >>> >>> community
>>>>> >> >>> >>> >>> start
>>>>> >> >>> >>> >>> to take active role in improving the issues so that
>>>>> Spark
>>>>> >> >>> >>> >>> stays
>>>>> >> >>> >>> >>> strong
>>>>> >> >>> >>> >>> compared to Flink.
>>>>> >> >>> >>> >>>
>>>>> >> >>> >>> >>>
>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>> uence/display/SPARK/Spark+Internals
>>>>> >> >>> >>> >>>
>>>>> >> >>> >>> >>>
>>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>>> uence/display/FLINK/Flink+Internals
>>>>> >> >>> >>> >>>
>>>>> >> >>> >>> >>> Spark is no longer an engine that works for micro-batch
>>>>> and
>>>>> >> >>> >>> >>> batch...We
>>>>> >> >>> >>> >>> (and
>>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an engine
>>>>> for
>>>>> >> >>> >>> >>> stream
>>>>> >> >>> >>> >>> and
>>>>> >> >>> >>> >>> query
>>>>> >> >>> >>> >>> processing.....we need to make it a state-of-the-art
>>>>> engine
>>>>> >> >>> >>> >>> for
>>>>> >> >>> >>> >>> high
>>>>> >> >>> >>> >>> speed
>>>>> >> >>> >>> >>> streaming data and user queries as well !
>>>>> >> >>> >>> >>>
>>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>>>>> >> >>> >>> >>> <to...@outlook.com>
>>>>> >> >>> >>> >>> wrote:
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> Hi everyone,
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my
>>>>> suggestions may
>>>>> >> >>> >>> >>>> help a
>>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational
>>>>> topics were
>>>>> >> >>> >>> >>>> mentioned,
>>>>> >> >>> >>> >>>> but I want to focus on these negative posts about
>>>>> Spark and
>>>>> >> >>> >>> >>>> about
>>>>> >> >>> >>> >>>> "haters"
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good
>>>>> community
>>>>> >> >>> >>> >>>> -
>>>>> >> >>> >>> >>>> it's
>>>>> >> >>> >>> >>>> everything here. But Every project has to "flight" on
>>>>> >> >>> >>> >>>> "framework
>>>>> >> >>> >>> >>>> market"
>>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data
>>>>> >> >>> >>> >>>> communities,
>>>>> >> >>> >>> >>>> maybe my mail will inspire someone :)
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have
>>>>> enough time
>>>>> >> >>> >>> >>>> to
>>>>> >> >>> >>> >>>> join
>>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So why
>>>>> are
>>>>> >> >>> >>> >>>> some
>>>>> >> >>> >>> >>>> people
>>>>> >> >>> >>> >>>> saying that Flink (or other framework) is better, like
>>>>> it was
>>>>> >> >>> >>> >>>> posted
>>>>> >> >>> >>> >>>> in
>>>>> >> >>> >>> >>>> this mailing list? No, not because that framework is
>>>>> better
>>>>> >> >>> >>> >>>> in
>>>>> >> >>> >>> >>>> all
>>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions where
>>>>> >> >>> >>> >>>> started
>>>>> >> >>> >>> >>>> after
>>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at
>>>>> StackOverflow
>>>>> >> >>> >>> >>>> "Flink
>>>>> >> >>> >>> >>>> vs
>>>>> >> >>> >>> >>>> ...."
>>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers
>>>>> are
>>>>> >> >>> >>> >>>> sometimes
>>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's users
>>>>> (often
>>>>> >> >>> >>> >>>> PMC's)
>>>>> >> >>> >>> >>>> are
>>>>> >> >>> >>> >>>> just posting same information about real-time
>>>>> streaming,
>>>>> >> >>> >>> >>>> about
>>>>> >> >>> >>> >>>> delta
>>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it is
>>>>> marked as
>>>>> >> >>> >>> >>>> an
>>>>> >> >>> >>> >>>> aswer,
>>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all the
>>>>> truth.
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and
>>>>> knowledgle to
>>>>> >> >>> >>> >>>> perform
>>>>> >> >>> >>> >>>> huge
>>>>> >> >>> >>> >>>> performance test. Maybe some company, that supports
>>>>> Spark
>>>>> >> >>> >>> >>>> (Databricks,
>>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in
>>>>> community :) )
>>>>> >> >>> >>> >>>> could
>>>>> >> >>> >>> >>>> perform performance test of:
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose because
>>>>> of
>>>>> >> >>> >>> >>>> mini-batch
>>>>> >> >>> >>> >>>> model, however currently the difference should be much
>>>>> lower
>>>>> >> >>> >>> >>>> that in
>>>>> >> >>> >>> >>>> previous versions
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> - Machine Learning models
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> - batch jobs
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> - Graph jobs
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> - SQL queries
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> People will see that Spark is envolving and is also a
>>>>> modern
>>>>> >> >>> >>> >>>> framework,
>>>>> >> >>> >>> >>>> because after reading posts mentioned above people may
>>>>> think
>>>>> >> >>> >>> >>>> "it
>>>>> >> >>> >>> >>>> is
>>>>> >> >>> >>> >>>> outdated, future is in framework X".
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how
>>>>> Spark
>>>>> >> >>> >>> >>>> Structured
>>>>> >> >>> >>> >>>> Streaming beats every other framework in terms of
>>>>> easy-of-use
>>>>> >> >>> >>> >>>> and
>>>>> >> >>> >>> >>>> reliability. Performance tests, done in various
>>>>> environments
>>>>> >> >>> >>> >>>> (in
>>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster,
>>>>> >> >>> >>> >>>> 20-node
>>>>> >> >>> >>> >>>> cluster), could be also very good marketing stuff to
>>>>> say
>>>>> >> >>> >>> >>>> "hey,
>>>>> >> >>> >>> >>>> you're
>>>>> >> >>> >>> >>>> telling that you're better, but Spark is still faster
>>>>> and is
>>>>> >> >>> >>> >>>> still
>>>>> >> >>> >>> >>>> getting even more fast!". This would be based on facts
>>>>> (just
>>>>> >> >>> >>> >>>> numbers),
>>>>> >> >>> >>> >>>> not opinions. It would be good for companies, for
>>>>> marketing
>>>>> >> >>> >>> >>>> puproses
>>>>> >> >>> >>> >>>> and
>>>>> >> >>> >>> >>>> for every Spark developer
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some time
>>>>> ago about
>>>>> >> >>> >>> >>>> real-time
>>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming. Some
>>>>> work
>>>>> >> >>> >>> >>>> should be
>>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think it's
>>>>> possible.
>>>>> >> >>> >>> >>>> Maybe
>>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on top
>>>>> of
>>>>> >> >>> >>> >>>> Akka?
>>>>> >> >>> >>> >>>> I
>>>>> >> >>> >>> >>>> don't
>>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I think
>>>>> that
>>>>> >> >>> >>> >>>> Spark
>>>>> >> >>> >>> >>>> should
>>>>> >> >>> >>> >>>> have real-time streaming support. Currently I see many
>>>>> >> >>> >>> >>>> posts/comments
>>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is
>>>>> doing
>>>>> >> >>> >>> >>>> very
>>>>> >> >>> >>> >>>> good
>>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is
>>>>> possible to
>>>>> >> >>> >>> >>>> add
>>>>> >> >>> >>> >>>> also
>>>>> >> >>> >>> >>>> more
>>>>> >> >>> >>> >>>> real-time processing.
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> Other people said much more and I agree with proposal
>>>>> of SIP.
>>>>> >> >>> >>> >>>> I'm
>>>>> >> >>> >>> >>>> also
>>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will not
>>>>> listen to
>>>>> >> >>> >>> >>>> users,
>>>>> >> >>> >>> >>>> but
>>>>> >> >>> >>> >>>> they really want to make Spark better for every user.
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> What do you think about these two topics? Especially
>>>>> I'm
>>>>> >> >>> >>> >>>> looking
>>>>> >> >>> >>> >>>> at
>>>>> >> >>> >>> >>>> Cody
>>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :)
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> Pozdrawiam / Best regards,
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>> Tomasz
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>> >>>>
>>>>> >> >>> >>>
>>>>> >> >>> >>
>>>>> >> >>> >
>>>>> >> >>> >
>>>>> >> >
>>>>> >> >
>>>>> >>
>>>>> >> ------------------------------------------------------------
>>>>> ---------
>>>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Ryan Blue
>>>>> > Software Engineer
>>>>> > Netflix
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Joseph Bradley
>>>>
>>>> Software Engineer - Machine Learning
>>>>
>>>> Databricks, Inc.
>>>>
>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>
>>>
>>>
>>
>

Re: Spark Improvement Proposals

Posted by Tim Hunter <ti...@databricks.com>.

Hi Cody,
thank you for bringing up this topic, I agree it is very important to keep
a cohesive community around some common, fluid goals. Here are a few
comments about the current document:

1. name: it should not overlap with an existing one such as SIP. Can you
imagine someone trying to discuss a scala spore proposal for spark?
"[Spark] SIP-3 is intended to evolve in tandem with [Scala] SIP-21". SPIP
sounds great.

2. roles: at a high level, SPIPs are meant to reach consensus for technical
decisions with a lasting impact. As such, the template should emphasize the
role of the various parties during this process:

 - the SPIP author is responsible for building consensus. She is the
champion driving the process forward and is responsible for ensuring that
the SPIP follows the general guidelines. The author should be identified in
the SPIP. The authorship of a SPIP can be transferred if the current author
is not interested and someone else wants to move the SPIP forward. There
should probably be 2-3 authors at most for each SPIP.

 - someone with voting power should probably shepherd the SPIP (and be
recorded as such): ensuring that the final decision over the SPIP is
recorded (rejected, accepted, etc.), and advising about the technical
quality of the SPIP: this person need not be a champion for the SPIP or
contribute to it, but rather makes sure it stands a chance of being
approved when the vote happens. Also, if the author cannot find anyone who
would want to take this role, this proposal is likely to be rejected anyway.

 - users, committers, contributors have the roles already outlined in the
document

3. timeline: ideally, once a SPIP has been offered for voting, it should
move swiftly into either being accepted or rejected, so that we do not end
up with a distracting long tail of half-hearted proposals.

These rules are meant to be flexible, but the current document should be
clear about who is in charge of a SPIP, and the state it is currently in.

We have had long discussions over some very important questions such as
approval. I do not have an opinion on these, but why not make a pick and
reevaluate this decision later? This is not a binding process at this point.

Tim


On Tue, Jan 3, 2017 at 3:16 PM, Cody Koeninger <co...@koeninger.org> wrote:

> I don't have a concern about voting vs consensus.
>
> I have a concern that whatever the decision making process is, it is
> explicitly announced on the ticket for the given proposal, with an explicit
> deadline, and an explicit outcome.
>
>
> On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <ir...@cloudera.com> wrote:
>
>> I'm also in favor of this.  Thanks for your persistence Cody.
>>
>> My take on the specific issues Joseph mentioned:
>>
>> 1) voting vs. consensus -- I agree with the argument Ryan Blue made
>> earlier for consensus:
>>
>> > Majority vs consensus: My rationale is that I don't think we want to
>> consider a proposal approved if it had objections serious enough that
>> committers down-voted (or PMC depending on who gets a vote). If these
>> proposals are like PEPs, then they represent a significant amount of
>> community effort and I wouldn't want to move forward if up to half of the
>> community thinks it's an untenable idea.
>>
>> 2) Design doc template -- agree this would be useful, but also seems
>> totally orthogonal to moving forward on the SIP proposal.
>>
>> 3) agree w/ Joseph's proposal for updating the template.
>>
>> One small addition:
>>
>> 4) Deciding on a name -- minor, but I think its wroth disambiguating from
>> Scala's SIPs, and the best proposal I've heard is "SPIP".   At least, no
>> one has objected.  (don't care enough that I'd object to anything else,
>> though.)
>>
>>
>> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <jo...@databricks.com>
>> wrote:
>>
>>> Hi Cody,
>>>
>>> Thanks for being persistent about this.  I too would like to see this
>>> happen.  Reviewing the thread, it sounds like the main things remaining are:
>>> * Decide about a few issues
>>> * Finalize the doc(s)
>>> * Vote on this proposal
>>>
>>> Issues & TODOs:
>>>
>>> (1) The main issue I see above is voting vs. consensus.  I have little
>>> preference here.  It sounds like something which could be tailored based on
>>> whether we see too many or too few SIPs being approved.
>>>
>>> (2) Design doc template  (This would be great to have for Spark
>>> regardless of this SIP discussion.)
>>> * Reynold, are you still putting this together?
>>>
>>> (3) Template cleanups.  Listing some items mentioned above + a new one
>>> w.r.t. Reynold's draft
>>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>>> :
>>> * Reinstate the "Where" section with links to current and past SIPs
>>> * Add field for stating explicit deadlines for approval
>>> * Add field for stating Author & Committer shepherd
>>>
>>> Thanks all!
>>> Joseph
>>>
>>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <co...@koeninger.org>
>>> wrote:
>>>
>>>> I'm bumping this one more time for the new year, and then I'm giving up.
>>>>
>>>> Please, fix your process, even if it isn't exactly the way I suggested.
>>>>
>>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>> > On lazy consensus as opposed to voting:
>>>> >
>>>> > First, why lazy consensus? The proposal was for consensus, which is
>>>> at least
>>>> > three +1 votes and no vetos. Consensus has no losing side, it requires
>>>> > getting to a point where there is agreement. Isn't that agreement
>>>> what we
>>>> > want to achieve with these proposals?
>>>> >
>>>> > Second, lazy consensus only removes the requirement for three +1
>>>> votes. Why
>>>> > would we not want at least three committers to think something is a
>>>> good
>>>> > idea before adopting the proposal?
>>>> >
>>>> > rb
>>>> >
>>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <co...@koeninger.org>
>>>> wrote:
>>>> >>
>>>> >> So there are some minor things (the Where section heading appears to
>>>> >> be dropped; wherever this document is posted it needs to actually
>>>> link
>>>> >> to a jira filter showing current / past SIPs) but it doesn't look
>>>> like
>>>> >> I can comment on the google doc.
>>>> >>
>>>> >> The major substantive issue that I have is that this version is
>>>> >> significantly less clear as to the outcome of an SIP.
>>>> >>
>>>> >> The apache example of lazy consensus at
>>>> >> http://apache.org/foundation/voting.html#LazyConsensus involves an
>>>> >> explicit announcement of an explicit deadline, which I think are
>>>> >> necessary for clarity.
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>> >> > It turned out suggested edits (trackable) don't show up for
>>>> non-owners,
>>>> >> > so
>>>> >> > I've just merged all the edits in place. It should be visible now.
>>>> >> >
>>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <rx...@databricks.com>
>>>> >> > wrote:
>>>> >> >>
>>>> >> >> Oops. Let me try figure that out.
>>>> >> >>
>>>> >> >>
>>>> >> >> On Monday, November 7, 2016, Cody Koeninger <co...@koeninger.org>
>>>> wrote:
>>>> >> >>>
>>>> >> >>> Thanks for picking up on this.
>>>> >> >>>
>>>> >> >>> Maybe I fail at google docs, but I can't see any edits on the
>>>> document
>>>> >> >>> you linked.
>>>> >> >>>
>>>> >> >>> Regarding lazy consensus, if the board in general has less of an
>>>> issue
>>>> >> >>> with that, sure.  As long as it is clearly announced, lasts at
>>>> least
>>>> >> >>> 72 hours, and has a clear outcome.
>>>> >> >>>
>>>> >> >>> The other points are hard to comment on without being able to
>>>> see the
>>>> >> >>> text in question.
>>>> >> >>>
>>>> >> >>>
>>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <rxin@databricks.com
>>>> >
>>>> >> >>> wrote:
>>>> >> >>> > I just looked through the entire thread again tonight - there
>>>> are a
>>>> >> >>> > lot
>>>> >> >>> > of
>>>> >> >>> > great ideas being discussed. Thanks Cody for taking the first
>>>> crack
>>>> >> >>> > at
>>>> >> >>> > the
>>>> >> >>> > proposal.
>>>> >> >>> >
>>>> >> >>> > I want to first comment on the context. Spark is one of the
>>>> most
>>>> >> >>> > innovative
>>>> >> >>> > and important projects in (big) data -- overall technical
>>>> decisions
>>>> >> >>> > made in
>>>> >> >>> > Apache Spark are sound. But of course, a project as large and
>>>> active
>>>> >> >>> > as
>>>> >> >>> > Spark always have room for improvement, and we as a community
>>>> should
>>>> >> >>> > strive
>>>> >> >>> > to take it to the next level.
>>>> >> >>> >
>>>> >> >>> > To that end, the two biggest areas for improvements in my
>>>> opinion
>>>> >> >>> > are:
>>>> >> >>> >
>>>> >> >>> > 1. Visibility: There are so much happening that it is
>>>> difficult to
>>>> >> >>> > know
>>>> >> >>> > what
>>>> >> >>> > really is going on. For people that don't follow closely, it is
>>>> >> >>> > difficult to
>>>> >> >>> > know what the important initiatives are. Even for people that
>>>> do
>>>> >> >>> > follow, it
>>>> >> >>> > is difficult to know what specific things require their
>>>> attention,
>>>> >> >>> > since the
>>>> >> >>> > number of pull requests and JIRA tickets are high and it's
>>>> difficult
>>>> >> >>> > to
>>>> >> >>> > extract signal from noise.
>>>> >> >>> >
>>>> >> >>> > 2. Solicit user (broadly defined, including developers
>>>> themselves)
>>>> >> >>> > input
>>>> >> >>> > more proactively: At the end of the day the project provides
>>>> value
>>>> >> >>> > because
>>>> >> >>> > users use it. Users can't tell us exactly what to build, but
>>>> it is
>>>> >> >>> > important
>>>> >> >>> > to get their inputs.
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> > I've taken Cody's doc and edited it:
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
>>>> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>>>> >> >>> > (I've made all my modifications trackable)
>>>> >> >>> >
>>>> >> >>> > There are couple high level changes I made:
>>>> >> >>> >
>>>> >> >>> > 1. I've consulted a board member and he recommended lazy
>>>> consensus
>>>> >> >>> > as
>>>> >> >>> > opposed to voting. The reason being in voting there can easily
>>>> be a
>>>> >> >>> > "loser'
>>>> >> >>> > that gets outvoted.
>>>> >> >>> >
>>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to
>>>> "optional
>>>> >> >>> > design
>>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far aside
>>>> from
>>>> >> >>> > tagging
>>>> >> >>> > things and linking them elsewhere simply having design docs and
>>>> >> >>> > prototypes
>>>> >> >>> > implementations in PRs is not something that has not worked so
>>>> far".
>>>> >> >>> >
>>>> >> >>> > 3. I made some the language tweaks to focus more on
>>>> visibility. For
>>>> >> >>> > example,
>>>> >> >>> > "The purpose of an SIP is to inform and involve", rather than
>>>> just
>>>> >> >>> > "involve". SIPs should also have at least two emails that go to
>>>> >> >>> > dev@.
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> > While I was editing this, I thought we really needed a
>>>> suggested
>>>> >> >>> > template
>>>> >> >>> > for design doc too. I will get to that too ...
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>>> rxin@databricks.com>
>>>> >> >>> > wrote:
>>>> >> >>> >>
>>>> >> >>> >> Most things looked OK to me too, although I do plan to take a
>>>> >> >>> >> closer
>>>> >> >>> >> look
>>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>>>> >> >>> >>
>>>> >> >>> >>
>>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>>> >> >>> >> <va...@cloudera.com>
>>>> >> >>> >> wrote:
>>>> >> >>> >>>
>>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's not
>>>> >> >>> >>> explicitly
>>>> >> >>> >>> called, that voting would happen by e-mail? A template for
>>>> the
>>>> >> >>> >>> proposal document (instead of just a bullet nice) would also
>>>> be
>>>> >> >>> >>> nice,
>>>> >> >>> >>> but that can be done at any time.
>>>> >> >>> >>>
>>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a
>>>> >> >>> >>> candidate
>>>> >> >>> >>> for a SIP, given the scope of the work. The document
>>>> attached even
>>>> >> >>> >>> somewhat matches the proposed format. So if anyone wants to
>>>> try
>>>> >> >>> >>> out
>>>> >> >>> >>> the process...
>>>> >> >>> >>>
>>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>>>> >> >>> >>> <co...@koeninger.org>
>>>> >> >>> >>> wrote:
>>>> >> >>> >>> > Now that spark summit europe is over, are any committers
>>>> >> >>> >>> > interested
>>>> >> >>> >>> > in
>>>> >> >>> >>> > moving forward with this?
>>>> >> >>> >>> >
>>>> >> >>> >>> >
>>>> >> >>> >>> >
>>>> >> >>> >>> >
>>>> >> >>> >>> > https://github.com/koeninger/s
>>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>>> >> >>> >>> >
>>>> >> >>> >>> > Or are we going to let this discussion die on the vine?
>>>> >> >>> >>> >
>>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>>> >> >>> >>> > <to...@outlook.com> wrote:
>>>> >> >>> >>> >> Maybe my mail was not clear enough.
>>>> >> >>> >>> >>
>>>> >> >>> >>> >>
>>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any other
>>>> >> >>> >>> >> framework.
>>>> >> >>> >>> >> The
>>>> >> >>> >>> >> idea with benchmarks was to show two things:
>>>> >> >>> >>> >>
>>>> >> >>> >>> >> - why some people are doing bad PR for Spark
>>>> >> >>> >>> >>
>>>> >> >>> >>> >> - how - in easy way - we can change it and show that
>>>> Spark is
>>>> >> >>> >>> >> still on
>>>> >> >>> >>> >> the
>>>> >> >>> >>> >> top
>>>> >> >>> >>> >>
>>>> >> >>> >>> >>
>>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I don't
>>>> think
>>>> >> >>> >>> >> they're the
>>>> >> >>> >>> >> most important thing in Spark :) On the Spark main page
>>>> there
>>>> >> >>> >>> >> is
>>>> >> >>> >>> >> still
>>>> >> >>> >>> >> chart
>>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that framework
>>>> is
>>>> >> >>> >>> >> not
>>>> >> >>> >>> >> the
>>>> >> >>> >>> >> same
>>>> >> >>> >>> >> Spark with other API, but much faster and optimized,
>>>> comparable
>>>> >> >>> >>> >> or
>>>> >> >>> >>> >> even
>>>> >> >>> >>> >> faster than other frameworks.
>>>> >> >>> >>> >>
>>>> >> >>> >>> >>
>>>> >> >>> >>> >> About real-time streaming, I think it would be just good
>>>> to see
>>>> >> >>> >>> >> it
>>>> >> >>> >>> >> in
>>>> >> >>> >>> >> Spark.
>>>> >> >>> >>> >> I very like current Spark model, but many voices that
>>>> says "we
>>>> >> >>> >>> >> need
>>>> >> >>> >>> >> more" -
>>>> >> >>> >>> >> community should listen also them and try to help them.
>>>> With
>>>> >> >>> >>> >> SIPs
>>>> >> >>> >>> >> it
>>>> >> >>> >>> >> would
>>>> >> >>> >>> >> be easier, I've just posted this example as "thing that
>>>> may be
>>>> >> >>> >>> >> changed
>>>> >> >>> >>> >> with
>>>> >> >>> >>> >> SIP".
>>>> >> >>> >>> >>
>>>> >> >>> >>> >>
>>>> >> >>> >>> >> I very like unification via Datasets, but there is a lot
>>>> of
>>>> >> >>> >>> >> algorithms
>>>> >> >>> >>> >> inside - let's make easy API, but with strong background
>>>> >> >>> >>> >> (articles,
>>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark is
>>>> still
>>>> >> >>> >>> >> modern
>>>> >> >>> >>> >> framework.
>>>> >> >>> >>> >>
>>>> >> >>> >>> >>
>>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said
>>>> >> >>> >>> >> organizational
>>>> >> >>> >>> >> ideas
>>>> >> >>> >>> >> were already mentioned and I agree with them, my mail was
>>>> just
>>>> >> >>> >>> >> to
>>>> >> >>> >>> >> show
>>>> >> >>> >>> >> some
>>>> >> >>> >>> >> aspects from my side, so from theside of developer and
>>>> person
>>>> >> >>> >>> >> who
>>>> >> >>> >>> >> is
>>>> >> >>> >>> >> trying
>>>> >> >>> >>> >> to help others with Spark (via StackOverflow or other
>>>> ways)
>>>> >> >>> >>> >>
>>>> >> >>> >>> >>
>>>> >> >>> >>> >> Pozdrawiam / Best regards,
>>>> >> >>> >>> >>
>>>> >> >>> >>> >> Tomasz
>>>> >> >>> >>> >>
>>>> >> >>> >>> >>
>>>> >> >>> >>> >> ________________________________
>>>> >> >>> >>> >> Od: Cody Koeninger <co...@koeninger.org>
>>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46
>>>> >> >>> >>> >> Do: Debasish Das
>>>> >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals
>>>> >> >>> >>> >>
>>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is
>>>> missing my
>>>> >> >>> >>> >> point.
>>>> >> >>> >>> >>
>>>> >> >>> >>> >> My point is evolve or die.  Spark's governance and
>>>> organization
>>>> >> >>> >>> >> is
>>>> >> >>> >>> >> hampering its ability to evolve technologically, and it
>>>> needs
>>>> >> >>> >>> >> to
>>>> >> >>> >>> >> change.
>>>> >> >>> >>> >>
>>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>>> >> >>> >>> >> <de...@gmail.com>
>>>> >> >>> >>> >> wrote:
>>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked up
>>>> Spark
>>>> >> >>> >>> >>> in
>>>> >> >>> >>> >>> 2014
>>>> >> >>> >>> >>> as
>>>> >> >>> >>> >>> soon as I looked into it since compared to writing Java
>>>> >> >>> >>> >>> map-reduce
>>>> >> >>> >>> >>> and
>>>> >> >>> >>> >>> Cascading code, Spark made writing distributed code
>>>> fun...But
>>>> >> >>> >>> >>> now
>>>> >> >>> >>> >>> as
>>>> >> >>> >>> >>> we
>>>> >> >>> >>> >>> went
>>>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case gets
>>>> more
>>>> >> >>> >>> >>> prominent, I
>>>> >> >>> >>> >>> think it is time to bring a messaging model in
>>>> conjunction
>>>> >> >>> >>> >>> with
>>>> >> >>> >>> >>> the
>>>> >> >>> >>> >>> batch/micro-batch API that Spark is good
>>>> at....akka-streams
>>>> >> >>> >>> >>> close
>>>> >> >>> >>> >>> integration with spark micro-batching APIs looks like a
>>>> great
>>>> >> >>> >>> >>> direction to
>>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated
>>>> >> >>> >>> >>> streaming
>>>> >> >>> >>> >>> with
>>>> >> >>> >>> >>> batch with the assumption is that micro-batching is
>>>> sufficient
>>>> >> >>> >>> >>> to
>>>> >> >>> >>> >>> run
>>>> >> >>> >>> >>> SQL
>>>> >> >>> >>> >>> commands on stream but do we really have time to do SQL
>>>> >> >>> >>> >>> processing at
>>>> >> >>> >>> >>> streaming data within 1-2 seconds ?
>>>> >> >>> >>> >>>
>>>> >> >>> >>> >>> After reading the email chain, I started to look into
>>>> Flink
>>>> >> >>> >>> >>> documentation
>>>> >> >>> >>> >>> and if you compare it with Spark documentation, I think
>>>> we
>>>> >> >>> >>> >>> have
>>>> >> >>> >>> >>> major
>>>> >> >>> >>> >>> work
>>>> >> >>> >>> >>> to do detailing out Spark internals so that more people
>>>> from
>>>> >> >>> >>> >>> community
>>>> >> >>> >>> >>> start
>>>> >> >>> >>> >>> to take active role in improving the issues so that Spark
>>>> >> >>> >>> >>> stays
>>>> >> >>> >>> >>> strong
>>>> >> >>> >>> >>> compared to Flink.
>>>> >> >>> >>> >>>
>>>> >> >>> >>> >>>
>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>> uence/display/SPARK/Spark+Internals
>>>> >> >>> >>> >>>
>>>> >> >>> >>> >>>
>>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>>> uence/display/FLINK/Flink+Internals
>>>> >> >>> >>> >>>
>>>> >> >>> >>> >>> Spark is no longer an engine that works for micro-batch
>>>> and
>>>> >> >>> >>> >>> batch...We
>>>> >> >>> >>> >>> (and
>>>> >> >>> >>> >>> I am sure many others) are pushing spark as an engine for
>>>> >> >>> >>> >>> stream
>>>> >> >>> >>> >>> and
>>>> >> >>> >>> >>> query
>>>> >> >>> >>> >>> processing.....we need to make it a state-of-the-art
>>>> engine
>>>> >> >>> >>> >>> for
>>>> >> >>> >>> >>> high
>>>> >> >>> >>> >>> speed
>>>> >> >>> >>> >>> streaming data and user queries as well !
>>>> >> >>> >>> >>>
>>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>>>> >> >>> >>> >>> <to...@outlook.com>
>>>> >> >>> >>> >>> wrote:
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>> Hi everyone,
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my
>>>> suggestions may
>>>> >> >>> >>> >>>> help a
>>>> >> >>> >>> >>>> little bit. :) Many technical and organizational topics
>>>> were
>>>> >> >>> >>> >>>> mentioned,
>>>> >> >>> >>> >>>> but I want to focus on these negative posts about Spark
>>>> and
>>>> >> >>> >>> >>>> about
>>>> >> >>> >>> >>>> "haters"
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good
>>>> community
>>>> >> >>> >>> >>>> -
>>>> >> >>> >>> >>>> it's
>>>> >> >>> >>> >>>> everything here. But Every project has to "flight" on
>>>> >> >>> >>> >>>> "framework
>>>> >> >>> >>> >>>> market"
>>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data
>>>> >> >>> >>> >>>> communities,
>>>> >> >>> >>> >>>> maybe my mail will inspire someone :)
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have enough
>>>> time
>>>> >> >>> >>> >>>> to
>>>> >> >>> >>> >>>> join
>>>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So why
>>>> are
>>>> >> >>> >>> >>>> some
>>>> >> >>> >>> >>>> people
>>>> >> >>> >>> >>>> saying that Flink (or other framework) is better, like
>>>> it was
>>>> >> >>> >>> >>>> posted
>>>> >> >>> >>> >>>> in
>>>> >> >>> >>> >>>> this mailing list? No, not because that framework is
>>>> better
>>>> >> >>> >>> >>>> in
>>>> >> >>> >>> >>>> all
>>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions where
>>>> >> >>> >>> >>>> started
>>>> >> >>> >>> >>>> after
>>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow
>>>> >> >>> >>> >>>> "Flink
>>>> >> >>> >>> >>>> vs
>>>> >> >>> >>> >>>> ...."
>>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers
>>>> are
>>>> >> >>> >>> >>>> sometimes
>>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's users
>>>> (often
>>>> >> >>> >>> >>>> PMC's)
>>>> >> >>> >>> >>>> are
>>>> >> >>> >>> >>>> just posting same information about real-time streaming,
>>>> >> >>> >>> >>>> about
>>>> >> >>> >>> >>>> delta
>>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it is
>>>> marked as
>>>> >> >>> >>> >>>> an
>>>> >> >>> >>> >>>> aswer,
>>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all the
>>>> truth.
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>> My suggestion: I don't have enough money and knowledgle
>>>> to
>>>> >> >>> >>> >>>> perform
>>>> >> >>> >>> >>>> huge
>>>> >> >>> >>> >>>> performance test. Maybe some company, that supports
>>>> Spark
>>>> >> >>> >>> >>>> (Databricks,
>>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in
>>>> community :) )
>>>> >> >>> >>> >>>> could
>>>> >> >>> >>> >>>> perform performance test of:
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose because
>>>> of
>>>> >> >>> >>> >>>> mini-batch
>>>> >> >>> >>> >>>> model, however currently the difference should be much
>>>> lower
>>>> >> >>> >>> >>>> that in
>>>> >> >>> >>> >>>> previous versions
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>> - Machine Learning models
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>> - batch jobs
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>> - Graph jobs
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>> - SQL queries
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>> People will see that Spark is envolving and is also a
>>>> modern
>>>> >> >>> >>> >>>> framework,
>>>> >> >>> >>> >>>> because after reading posts mentioned above people may
>>>> think
>>>> >> >>> >>> >>>> "it
>>>> >> >>> >>> >>>> is
>>>> >> >>> >>> >>>> outdated, future is in framework X".
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark
>>>> >> >>> >>> >>>> Structured
>>>> >> >>> >>> >>>> Streaming beats every other framework in terms of
>>>> easy-of-use
>>>> >> >>> >>> >>>> and
>>>> >> >>> >>> >>>> reliability. Performance tests, done in various
>>>> environments
>>>> >> >>> >>> >>>> (in
>>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster,
>>>> >> >>> >>> >>>> 20-node
>>>> >> >>> >>> >>>> cluster), could be also very good marketing stuff to say
>>>> >> >>> >>> >>>> "hey,
>>>> >> >>> >>> >>>> you're
>>>> >> >>> >>> >>>> telling that you're better, but Spark is still faster
>>>> and is
>>>> >> >>> >>> >>>> still
>>>> >> >>> >>> >>>> getting even more fast!". This would be based on facts
>>>> (just
>>>> >> >>> >>> >>>> numbers),
>>>> >> >>> >>> >>>> not opinions. It would be good for companies, for
>>>> marketing
>>>> >> >>> >>> >>>> puproses
>>>> >> >>> >>> >>>> and
>>>> >> >>> >>> >>>> for every Spark developer
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>> Second: real-time streaming. I've written some time ago
>>>> about
>>>> >> >>> >>> >>>> real-time
>>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming. Some
>>>> work
>>>> >> >>> >>> >>>> should be
>>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think it's
>>>> possible.
>>>> >> >>> >>> >>>> Maybe
>>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on top
>>>> of
>>>> >> >>> >>> >>>> Akka?
>>>> >> >>> >>> >>>> I
>>>> >> >>> >>> >>>> don't
>>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I think that
>>>> >> >>> >>> >>>> Spark
>>>> >> >>> >>> >>>> should
>>>> >> >>> >>> >>>> have real-time streaming support. Currently I see many
>>>> >> >>> >>> >>>> posts/comments
>>>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is
>>>> doing
>>>> >> >>> >>> >>>> very
>>>> >> >>> >>> >>>> good
>>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is possible
>>>> to
>>>> >> >>> >>> >>>> add
>>>> >> >>> >>> >>>> also
>>>> >> >>> >>> >>>> more
>>>> >> >>> >>> >>>> real-time processing.
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>> Other people said much more and I agree with proposal
>>>> of SIP.
>>>> >> >>> >>> >>>> I'm
>>>> >> >>> >>> >>>> also
>>>> >> >>> >>> >>>> happy that PMC's are not saying that they will not
>>>> listen to
>>>> >> >>> >>> >>>> users,
>>>> >> >>> >>> >>>> but
>>>> >> >>> >>> >>>> they really want to make Spark better for every user.
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>> What do you think about these two topics? Especially I'm
>>>> >> >>> >>> >>>> looking
>>>> >> >>> >>> >>>> at
>>>> >> >>> >>> >>>> Cody
>>>> >> >>> >>> >>>> (who has started this topic) and PMCs :)
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>> Pozdrawiam / Best regards,
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>> Tomasz
>>>> >> >>> >>> >>>>
>>>> >> >>> >>> >>>>
>>>> >> >>> >>>
>>>> >> >>> >>
>>>> >> >>> >
>>>> >> >>> >
>>>> >> >
>>>> >> >
>>>> >>
>>>> >> ------------------------------------------------------------
>>>> ---------
>>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Ryan Blue
>>>> > Software Engineer
>>>> > Netflix
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Joseph Bradley
>>>
>>> Software Engineer - Machine Learning
>>>
>>> Databricks, Inc.
>>>
>>> [image: http://databricks.com] <http://databricks.com/>
>>>
>>
>>
>

Re: Spark Improvement Proposals

Posted by Cody Koeninger <co...@koeninger.org>.

I don't have a concern about voting vs consensus.

I have a concern that whatever the decision making process is, it is
explicitly announced on the ticket for the given proposal, with an explicit
deadline, and an explicit outcome.


On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid <ir...@cloudera.com> wrote:

> I'm also in favor of this.  Thanks for your persistence Cody.
>
> My take on the specific issues Joseph mentioned:
>
> 1) voting vs. consensus -- I agree with the argument Ryan Blue made
> earlier for consensus:
>
> > Majority vs consensus: My rationale is that I don't think we want to
> consider a proposal approved if it had objections serious enough that
> committers down-voted (or PMC depending on who gets a vote). If these
> proposals are like PEPs, then they represent a significant amount of
> community effort and I wouldn't want to move forward if up to half of the
> community thinks it's an untenable idea.
>
> 2) Design doc template -- agree this would be useful, but also seems
> totally orthogonal to moving forward on the SIP proposal.
>
> 3) agree w/ Joseph's proposal for updating the template.
>
> One small addition:
>
> 4) Deciding on a name -- minor, but I think its wroth disambiguating from
> Scala's SIPs, and the best proposal I've heard is "SPIP".   At least, no
> one has objected.  (don't care enough that I'd object to anything else,
> though.)
>
>
> On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <jo...@databricks.com>
> wrote:
>
>> Hi Cody,
>>
>> Thanks for being persistent about this.  I too would like to see this
>> happen.  Reviewing the thread, it sounds like the main things remaining are:
>> * Decide about a few issues
>> * Finalize the doc(s)
>> * Vote on this proposal
>>
>> Issues & TODOs:
>>
>> (1) The main issue I see above is voting vs. consensus.  I have little
>> preference here.  It sounds like something which could be tailored based on
>> whether we see too many or too few SIPs being approved.
>>
>> (2) Design doc template  (This would be great to have for Spark
>> regardless of this SIP discussion.)
>> * Reynold, are you still putting this together?
>>
>> (3) Template cleanups.  Listing some items mentioned above + a new one
>> w.r.t. Reynold's draft
>> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
>> :
>> * Reinstate the "Where" section with links to current and past SIPs
>> * Add field for stating explicit deadlines for approval
>> * Add field for stating Author & Committer shepherd
>>
>> Thanks all!
>> Joseph
>>
>> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>>
>>> I'm bumping this one more time for the new year, and then I'm giving up.
>>>
>>> Please, fix your process, even if it isn't exactly the way I suggested.
>>>
>>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com> wrote:
>>> > On lazy consensus as opposed to voting:
>>> >
>>> > First, why lazy consensus? The proposal was for consensus, which is at
>>> least
>>> > three +1 votes and no vetos. Consensus has no losing side, it requires
>>> > getting to a point where there is agreement. Isn't that agreement what
>>> we
>>> > want to achieve with these proposals?
>>> >
>>> > Second, lazy consensus only removes the requirement for three +1
>>> votes. Why
>>> > would we not want at least three committers to think something is a
>>> good
>>> > idea before adopting the proposal?
>>> >
>>> > rb
>>> >
>>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <co...@koeninger.org>
>>> wrote:
>>> >>
>>> >> So there are some minor things (the Where section heading appears to
>>> >> be dropped; wherever this document is posted it needs to actually link
>>> >> to a jira filter showing current / past SIPs) but it doesn't look like
>>> >> I can comment on the google doc.
>>> >>
>>> >> The major substantive issue that I have is that this version is
>>> >> significantly less clear as to the outcome of an SIP.
>>> >>
>>> >> The apache example of lazy consensus at
>>> >> http://apache.org/foundation/voting.html#LazyConsensus involves an
>>> >> explicit announcement of an explicit deadline, which I think are
>>> >> necessary for clarity.
>>> >>
>>> >>
>>> >>
>>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>> >> > It turned out suggested edits (trackable) don't show up for
>>> non-owners,
>>> >> > so
>>> >> > I've just merged all the edits in place. It should be visible now.
>>> >> >
>>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <rx...@databricks.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Oops. Let me try figure that out.
>>> >> >>
>>> >> >>
>>> >> >> On Monday, November 7, 2016, Cody Koeninger <co...@koeninger.org>
>>> wrote:
>>> >> >>>
>>> >> >>> Thanks for picking up on this.
>>> >> >>>
>>> >> >>> Maybe I fail at google docs, but I can't see any edits on the
>>> document
>>> >> >>> you linked.
>>> >> >>>
>>> >> >>> Regarding lazy consensus, if the board in general has less of an
>>> issue
>>> >> >>> with that, sure.  As long as it is clearly announced, lasts at
>>> least
>>> >> >>> 72 hours, and has a clear outcome.
>>> >> >>>
>>> >> >>> The other points are hard to comment on without being able to see
>>> the
>>> >> >>> text in question.
>>> >> >>>
>>> >> >>>
>>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <rx...@databricks.com>
>>> >> >>> wrote:
>>> >> >>> > I just looked through the entire thread again tonight - there
>>> are a
>>> >> >>> > lot
>>> >> >>> > of
>>> >> >>> > great ideas being discussed. Thanks Cody for taking the first
>>> crack
>>> >> >>> > at
>>> >> >>> > the
>>> >> >>> > proposal.
>>> >> >>> >
>>> >> >>> > I want to first comment on the context. Spark is one of the most
>>> >> >>> > innovative
>>> >> >>> > and important projects in (big) data -- overall technical
>>> decisions
>>> >> >>> > made in
>>> >> >>> > Apache Spark are sound. But of course, a project as large and
>>> active
>>> >> >>> > as
>>> >> >>> > Spark always have room for improvement, and we as a community
>>> should
>>> >> >>> > strive
>>> >> >>> > to take it to the next level.
>>> >> >>> >
>>> >> >>> > To that end, the two biggest areas for improvements in my
>>> opinion
>>> >> >>> > are:
>>> >> >>> >
>>> >> >>> > 1. Visibility: There are so much happening that it is difficult
>>> to
>>> >> >>> > know
>>> >> >>> > what
>>> >> >>> > really is going on. For people that don't follow closely, it is
>>> >> >>> > difficult to
>>> >> >>> > know what the important initiatives are. Even for people that do
>>> >> >>> > follow, it
>>> >> >>> > is difficult to know what specific things require their
>>> attention,
>>> >> >>> > since the
>>> >> >>> > number of pull requests and JIRA tickets are high and it's
>>> difficult
>>> >> >>> > to
>>> >> >>> > extract signal from noise.
>>> >> >>> >
>>> >> >>> > 2. Solicit user (broadly defined, including developers
>>> themselves)
>>> >> >>> > input
>>> >> >>> > more proactively: At the end of the day the project provides
>>> value
>>> >> >>> > because
>>> >> >>> > users use it. Users can't tell us exactly what to build, but it
>>> is
>>> >> >>> > important
>>> >> >>> > to get their inputs.
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > I've taken Cody's doc and edited it:
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
>>> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>>> >> >>> > (I've made all my modifications trackable)
>>> >> >>> >
>>> >> >>> > There are couple high level changes I made:
>>> >> >>> >
>>> >> >>> > 1. I've consulted a board member and he recommended lazy
>>> consensus
>>> >> >>> > as
>>> >> >>> > opposed to voting. The reason being in voting there can easily
>>> be a
>>> >> >>> > "loser'
>>> >> >>> > that gets outvoted.
>>> >> >>> >
>>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to "optional
>>> >> >>> > design
>>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far aside
>>> from
>>> >> >>> > tagging
>>> >> >>> > things and linking them elsewhere simply having design docs and
>>> >> >>> > prototypes
>>> >> >>> > implementations in PRs is not something that has not worked so
>>> far".
>>> >> >>> >
>>> >> >>> > 3. I made some the language tweaks to focus more on visibility.
>>> For
>>> >> >>> > example,
>>> >> >>> > "The purpose of an SIP is to inform and involve", rather than
>>> just
>>> >> >>> > "involve". SIPs should also have at least two emails that go to
>>> >> >>> > dev@.
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > While I was editing this, I thought we really needed a suggested
>>> >> >>> > template
>>> >> >>> > for design doc too. I will get to that too ...
>>> >> >>> >
>>> >> >>> >
>>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>>> rxin@databricks.com>
>>> >> >>> > wrote:
>>> >> >>> >>
>>> >> >>> >> Most things looked OK to me too, although I do plan to take a
>>> >> >>> >> closer
>>> >> >>> >> look
>>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>>> >> >>> >> <va...@cloudera.com>
>>> >> >>> >> wrote:
>>> >> >>> >>>
>>> >> >>> >>> The proposal looks OK to me. I assume, even though it's not
>>> >> >>> >>> explicitly
>>> >> >>> >>> called, that voting would happen by e-mail? A template for the
>>> >> >>> >>> proposal document (instead of just a bullet nice) would also
>>> be
>>> >> >>> >>> nice,
>>> >> >>> >>> but that can be done at any time.
>>> >> >>> >>>
>>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a
>>> >> >>> >>> candidate
>>> >> >>> >>> for a SIP, given the scope of the work. The document attached
>>> even
>>> >> >>> >>> somewhat matches the proposed format. So if anyone wants to
>>> try
>>> >> >>> >>> out
>>> >> >>> >>> the process...
>>> >> >>> >>>
>>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>>> >> >>> >>> <co...@koeninger.org>
>>> >> >>> >>> wrote:
>>> >> >>> >>> > Now that spark summit europe is over, are any committers
>>> >> >>> >>> > interested
>>> >> >>> >>> > in
>>> >> >>> >>> > moving forward with this?
>>> >> >>> >>> >
>>> >> >>> >>> >
>>> >> >>> >>> >
>>> >> >>> >>> >
>>> >> >>> >>> > https://github.com/koeninger/s
>>> park-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>> >> >>> >>> >
>>> >> >>> >>> > Or are we going to let this discussion die on the vine?
>>> >> >>> >>> >
>>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>> >> >>> >>> > <to...@outlook.com> wrote:
>>> >> >>> >>> >> Maybe my mail was not clear enough.
>>> >> >>> >>> >>
>>> >> >>> >>> >>
>>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any other
>>> >> >>> >>> >> framework.
>>> >> >>> >>> >> The
>>> >> >>> >>> >> idea with benchmarks was to show two things:
>>> >> >>> >>> >>
>>> >> >>> >>> >> - why some people are doing bad PR for Spark
>>> >> >>> >>> >>
>>> >> >>> >>> >> - how - in easy way - we can change it and show that Spark
>>> is
>>> >> >>> >>> >> still on
>>> >> >>> >>> >> the
>>> >> >>> >>> >> top
>>> >> >>> >>> >>
>>> >> >>> >>> >>
>>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I don't
>>> think
>>> >> >>> >>> >> they're the
>>> >> >>> >>> >> most important thing in Spark :) On the Spark main page
>>> there
>>> >> >>> >>> >> is
>>> >> >>> >>> >> still
>>> >> >>> >>> >> chart
>>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that framework
>>> is
>>> >> >>> >>> >> not
>>> >> >>> >>> >> the
>>> >> >>> >>> >> same
>>> >> >>> >>> >> Spark with other API, but much faster and optimized,
>>> comparable
>>> >> >>> >>> >> or
>>> >> >>> >>> >> even
>>> >> >>> >>> >> faster than other frameworks.
>>> >> >>> >>> >>
>>> >> >>> >>> >>
>>> >> >>> >>> >> About real-time streaming, I think it would be just good
>>> to see
>>> >> >>> >>> >> it
>>> >> >>> >>> >> in
>>> >> >>> >>> >> Spark.
>>> >> >>> >>> >> I very like current Spark model, but many voices that says
>>> "we
>>> >> >>> >>> >> need
>>> >> >>> >>> >> more" -
>>> >> >>> >>> >> community should listen also them and try to help them.
>>> With
>>> >> >>> >>> >> SIPs
>>> >> >>> >>> >> it
>>> >> >>> >>> >> would
>>> >> >>> >>> >> be easier, I've just posted this example as "thing that
>>> may be
>>> >> >>> >>> >> changed
>>> >> >>> >>> >> with
>>> >> >>> >>> >> SIP".
>>> >> >>> >>> >>
>>> >> >>> >>> >>
>>> >> >>> >>> >> I very like unification via Datasets, but there is a lot of
>>> >> >>> >>> >> algorithms
>>> >> >>> >>> >> inside - let's make easy API, but with strong background
>>> >> >>> >>> >> (articles,
>>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark is
>>> still
>>> >> >>> >>> >> modern
>>> >> >>> >>> >> framework.
>>> >> >>> >>> >>
>>> >> >>> >>> >>
>>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said
>>> >> >>> >>> >> organizational
>>> >> >>> >>> >> ideas
>>> >> >>> >>> >> were already mentioned and I agree with them, my mail was
>>> just
>>> >> >>> >>> >> to
>>> >> >>> >>> >> show
>>> >> >>> >>> >> some
>>> >> >>> >>> >> aspects from my side, so from theside of developer and
>>> person
>>> >> >>> >>> >> who
>>> >> >>> >>> >> is
>>> >> >>> >>> >> trying
>>> >> >>> >>> >> to help others with Spark (via StackOverflow or other ways)
>>> >> >>> >>> >>
>>> >> >>> >>> >>
>>> >> >>> >>> >> Pozdrawiam / Best regards,
>>> >> >>> >>> >>
>>> >> >>> >>> >> Tomasz
>>> >> >>> >>> >>
>>> >> >>> >>> >>
>>> >> >>> >>> >> ________________________________
>>> >> >>> >>> >> Od: Cody Koeninger <co...@koeninger.org>
>>> >> >>> >>> >> Wysłane: 17 października 2016 16:46
>>> >> >>> >>> >> Do: Debasish Das
>>> >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals
>>> >> >>> >>> >>
>>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is
>>> missing my
>>> >> >>> >>> >> point.
>>> >> >>> >>> >>
>>> >> >>> >>> >> My point is evolve or die.  Spark's governance and
>>> organization
>>> >> >>> >>> >> is
>>> >> >>> >>> >> hampering its ability to evolve technologically, and it
>>> needs
>>> >> >>> >>> >> to
>>> >> >>> >>> >> change.
>>> >> >>> >>> >>
>>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>> >> >>> >>> >> <de...@gmail.com>
>>> >> >>> >>> >> wrote:
>>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked up
>>> Spark
>>> >> >>> >>> >>> in
>>> >> >>> >>> >>> 2014
>>> >> >>> >>> >>> as
>>> >> >>> >>> >>> soon as I looked into it since compared to writing Java
>>> >> >>> >>> >>> map-reduce
>>> >> >>> >>> >>> and
>>> >> >>> >>> >>> Cascading code, Spark made writing distributed code
>>> fun...But
>>> >> >>> >>> >>> now
>>> >> >>> >>> >>> as
>>> >> >>> >>> >>> we
>>> >> >>> >>> >>> went
>>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case gets
>>> more
>>> >> >>> >>> >>> prominent, I
>>> >> >>> >>> >>> think it is time to bring a messaging model in conjunction
>>> >> >>> >>> >>> with
>>> >> >>> >>> >>> the
>>> >> >>> >>> >>> batch/micro-batch API that Spark is good
>>> at....akka-streams
>>> >> >>> >>> >>> close
>>> >> >>> >>> >>> integration with spark micro-batching APIs looks like a
>>> great
>>> >> >>> >>> >>> direction to
>>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated
>>> >> >>> >>> >>> streaming
>>> >> >>> >>> >>> with
>>> >> >>> >>> >>> batch with the assumption is that micro-batching is
>>> sufficient
>>> >> >>> >>> >>> to
>>> >> >>> >>> >>> run
>>> >> >>> >>> >>> SQL
>>> >> >>> >>> >>> commands on stream but do we really have time to do SQL
>>> >> >>> >>> >>> processing at
>>> >> >>> >>> >>> streaming data within 1-2 seconds ?
>>> >> >>> >>> >>>
>>> >> >>> >>> >>> After reading the email chain, I started to look into
>>> Flink
>>> >> >>> >>> >>> documentation
>>> >> >>> >>> >>> and if you compare it with Spark documentation, I think we
>>> >> >>> >>> >>> have
>>> >> >>> >>> >>> major
>>> >> >>> >>> >>> work
>>> >> >>> >>> >>> to do detailing out Spark internals so that more people
>>> from
>>> >> >>> >>> >>> community
>>> >> >>> >>> >>> start
>>> >> >>> >>> >>> to take active role in improving the issues so that Spark
>>> >> >>> >>> >>> stays
>>> >> >>> >>> >>> strong
>>> >> >>> >>> >>> compared to Flink.
>>> >> >>> >>> >>>
>>> >> >>> >>> >>>
>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>> uence/display/SPARK/Spark+Internals
>>> >> >>> >>> >>>
>>> >> >>> >>> >>>
>>> >> >>> >>> >>> https://cwiki.apache.org/confl
>>> uence/display/FLINK/Flink+Internals
>>> >> >>> >>> >>>
>>> >> >>> >>> >>> Spark is no longer an engine that works for micro-batch
>>> and
>>> >> >>> >>> >>> batch...We
>>> >> >>> >>> >>> (and
>>> >> >>> >>> >>> I am sure many others) are pushing spark as an engine for
>>> >> >>> >>> >>> stream
>>> >> >>> >>> >>> and
>>> >> >>> >>> >>> query
>>> >> >>> >>> >>> processing.....we need to make it a state-of-the-art
>>> engine
>>> >> >>> >>> >>> for
>>> >> >>> >>> >>> high
>>> >> >>> >>> >>> speed
>>> >> >>> >>> >>> streaming data and user queries as well !
>>> >> >>> >>> >>>
>>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>>> >> >>> >>> >>> <to...@outlook.com>
>>> >> >>> >>> >>> wrote:
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> Hi everyone,
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> I'm quite late with my answer, but I think my
>>> suggestions may
>>> >> >>> >>> >>>> help a
>>> >> >>> >>> >>>> little bit. :) Many technical and organizational topics
>>> were
>>> >> >>> >>> >>>> mentioned,
>>> >> >>> >>> >>>> but I want to focus on these negative posts about Spark
>>> and
>>> >> >>> >>> >>>> about
>>> >> >>> >>> >>>> "haters"
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good
>>> community
>>> >> >>> >>> >>>> -
>>> >> >>> >>> >>>> it's
>>> >> >>> >>> >>>> everything here. But Every project has to "flight" on
>>> >> >>> >>> >>>> "framework
>>> >> >>> >>> >>>> market"
>>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data
>>> >> >>> >>> >>>> communities,
>>> >> >>> >>> >>>> maybe my mail will inspire someone :)
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have enough
>>> time
>>> >> >>> >>> >>>> to
>>> >> >>> >>> >>>> join
>>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So why are
>>> >> >>> >>> >>>> some
>>> >> >>> >>> >>>> people
>>> >> >>> >>> >>>> saying that Flink (or other framework) is better, like
>>> it was
>>> >> >>> >>> >>>> posted
>>> >> >>> >>> >>>> in
>>> >> >>> >>> >>>> this mailing list? No, not because that framework is
>>> better
>>> >> >>> >>> >>>> in
>>> >> >>> >>> >>>> all
>>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions where
>>> >> >>> >>> >>>> started
>>> >> >>> >>> >>>> after
>>> >> >>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow
>>> >> >>> >>> >>>> "Flink
>>> >> >>> >>> >>>> vs
>>> >> >>> >>> >>>> ...."
>>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers
>>> are
>>> >> >>> >>> >>>> sometimes
>>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's users
>>> (often
>>> >> >>> >>> >>>> PMC's)
>>> >> >>> >>> >>>> are
>>> >> >>> >>> >>>> just posting same information about real-time streaming,
>>> >> >>> >>> >>>> about
>>> >> >>> >>> >>>> delta
>>> >> >>> >>> >>>> iterations, etc. It look smart and very often it is
>>> marked as
>>> >> >>> >>> >>>> an
>>> >> >>> >>> >>>> aswer,
>>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all the
>>> truth.
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> My suggestion: I don't have enough money and knowledgle
>>> to
>>> >> >>> >>> >>>> perform
>>> >> >>> >>> >>>> huge
>>> >> >>> >>> >>>> performance test. Maybe some company, that supports Spark
>>> >> >>> >>> >>>> (Databricks,
>>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in community
>>> :) )
>>> >> >>> >>> >>>> could
>>> >> >>> >>> >>>> perform performance test of:
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> - streaming engine - probably Spark will loose because of
>>> >> >>> >>> >>>> mini-batch
>>> >> >>> >>> >>>> model, however currently the difference should be much
>>> lower
>>> >> >>> >>> >>>> that in
>>> >> >>> >>> >>>> previous versions
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> - Machine Learning models
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> - batch jobs
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> - Graph jobs
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> - SQL queries
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> People will see that Spark is envolving and is also a
>>> modern
>>> >> >>> >>> >>>> framework,
>>> >> >>> >>> >>>> because after reading posts mentioned above people may
>>> think
>>> >> >>> >>> >>>> "it
>>> >> >>> >>> >>>> is
>>> >> >>> >>> >>>> outdated, future is in framework X".
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark
>>> >> >>> >>> >>>> Structured
>>> >> >>> >>> >>>> Streaming beats every other framework in terms of
>>> easy-of-use
>>> >> >>> >>> >>>> and
>>> >> >>> >>> >>>> reliability. Performance tests, done in various
>>> environments
>>> >> >>> >>> >>>> (in
>>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster,
>>> >> >>> >>> >>>> 20-node
>>> >> >>> >>> >>>> cluster), could be also very good marketing stuff to say
>>> >> >>> >>> >>>> "hey,
>>> >> >>> >>> >>>> you're
>>> >> >>> >>> >>>> telling that you're better, but Spark is still faster
>>> and is
>>> >> >>> >>> >>>> still
>>> >> >>> >>> >>>> getting even more fast!". This would be based on facts
>>> (just
>>> >> >>> >>> >>>> numbers),
>>> >> >>> >>> >>>> not opinions. It would be good for companies, for
>>> marketing
>>> >> >>> >>> >>>> puproses
>>> >> >>> >>> >>>> and
>>> >> >>> >>> >>>> for every Spark developer
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> Second: real-time streaming. I've written some time ago
>>> about
>>> >> >>> >>> >>>> real-time
>>> >> >>> >>> >>>> streaming support in Spark Structured Streaming. Some
>>> work
>>> >> >>> >>> >>>> should be
>>> >> >>> >>> >>>> done to make SSS more low-latency, but I think it's
>>> possible.
>>> >> >>> >>> >>>> Maybe
>>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on top of
>>> >> >>> >>> >>>> Akka?
>>> >> >>> >>> >>>> I
>>> >> >>> >>> >>>> don't
>>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I think that
>>> >> >>> >>> >>>> Spark
>>> >> >>> >>> >>>> should
>>> >> >>> >>> >>>> have real-time streaming support. Currently I see many
>>> >> >>> >>> >>>> posts/comments
>>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is
>>> doing
>>> >> >>> >>> >>>> very
>>> >> >>> >>> >>>> good
>>> >> >>> >>> >>>> jobs with micro-batches, however I think it is possible
>>> to
>>> >> >>> >>> >>>> add
>>> >> >>> >>> >>>> also
>>> >> >>> >>> >>>> more
>>> >> >>> >>> >>>> real-time processing.
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> Other people said much more and I agree with proposal of
>>> SIP.
>>> >> >>> >>> >>>> I'm
>>> >> >>> >>> >>>> also
>>> >> >>> >>> >>>> happy that PMC's are not saying that they will not
>>> listen to
>>> >> >>> >>> >>>> users,
>>> >> >>> >>> >>>> but
>>> >> >>> >>> >>>> they really want to make Spark better for every user.
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> What do you think about these two topics? Especially I'm
>>> >> >>> >>> >>>> looking
>>> >> >>> >>> >>>> at
>>> >> >>> >>> >>>> Cody
>>> >> >>> >>> >>>> (who has started this topic) and PMCs :)
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> Pozdrawiam / Best regards,
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>> Tomasz
>>> >> >>> >>> >>>>
>>> >> >>> >>> >>>>
>>> >> >>> >>>
>>> >> >>> >>
>>> >> >>> >
>>> >> >>> >
>>> >> >
>>> >> >
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Ryan Blue
>>> > Software Engineer
>>> > Netflix
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
>
>

Re: Spark Improvement Proposals

Posted by Imran Rashid <ir...@cloudera.com>.

I'm also in favor of this.  Thanks for your persistence Cody.

My take on the specific issues Joseph mentioned:

1) voting vs. consensus -- I agree with the argument Ryan Blue made earlier
for consensus:

> Majority vs consensus: My rationale is that I don't think we want to
consider a proposal approved if it had objections serious enough that
committers down-voted (or PMC depending on who gets a vote). If these
proposals are like PEPs, then they represent a significant amount of
community effort and I wouldn't want to move forward if up to half of the
community thinks it's an untenable idea.

2) Design doc template -- agree this would be useful, but also seems
totally orthogonal to moving forward on the SIP proposal.

3) agree w/ Joseph's proposal for updating the template.

One small addition:

4) Deciding on a name -- minor, but I think its wroth disambiguating from
Scala's SIPs, and the best proposal I've heard is "SPIP".   At least, no
one has objected.  (don't care enough that I'd object to anything else,
though.)


On Tue, Jan 3, 2017 at 3:30 PM, Joseph Bradley <jo...@databricks.com>
wrote:

> Hi Cody,
>
> Thanks for being persistent about this.  I too would like to see this
> happen.  Reviewing the thread, it sounds like the main things remaining are:
> * Decide about a few issues
> * Finalize the doc(s)
> * Vote on this proposal
>
> Issues & TODOs:
>
> (1) The main issue I see above is voting vs. consensus.  I have little
> preference here.  It sounds like something which could be tailored based on
> whether we see too many or too few SIPs being approved.
>
> (2) Design doc template  (This would be great to have for Spark regardless
> of this SIP discussion.)
> * Reynold, are you still putting this together?
>
> (3) Template cleanups.  Listing some items mentioned above + a new one
> w.r.t. Reynold's draft
> <https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
> :
> * Reinstate the "Where" section with links to current and past SIPs
> * Add field for stating explicit deadlines for approval
> * Add field for stating Author & Committer shepherd
>
> Thanks all!
> Joseph
>
> On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <co...@koeninger.org> wrote:
>
>> I'm bumping this one more time for the new year, and then I'm giving up.
>>
>> Please, fix your process, even if it isn't exactly the way I suggested.
>>
>> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com> wrote:
>> > On lazy consensus as opposed to voting:
>> >
>> > First, why lazy consensus? The proposal was for consensus, which is at
>> least
>> > three +1 votes and no vetos. Consensus has no losing side, it requires
>> > getting to a point where there is agreement. Isn't that agreement what
>> we
>> > want to achieve with these proposals?
>> >
>> > Second, lazy consensus only removes the requirement for three +1 votes.
>> Why
>> > would we not want at least three committers to think something is a good
>> > idea before adopting the proposal?
>> >
>> > rb
>> >
>> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>> >>
>> >> So there are some minor things (the Where section heading appears to
>> >> be dropped; wherever this document is posted it needs to actually link
>> >> to a jira filter showing current / past SIPs) but it doesn't look like
>> >> I can comment on the google doc.
>> >>
>> >> The major substantive issue that I have is that this version is
>> >> significantly less clear as to the outcome of an SIP.
>> >>
>> >> The apache example of lazy consensus at
>> >> http://apache.org/foundation/voting.html#LazyConsensus involves an
>> >> explicit announcement of an explicit deadline, which I think are
>> >> necessary for clarity.
>> >>
>> >>
>> >>
>> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <rx...@databricks.com>
>> wrote:
>> >> > It turned out suggested edits (trackable) don't show up for
>> non-owners,
>> >> > so
>> >> > I've just merged all the edits in place. It should be visible now.
>> >> >
>> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <rx...@databricks.com>
>> >> > wrote:
>> >> >>
>> >> >> Oops. Let me try figure that out.
>> >> >>
>> >> >>
>> >> >> On Monday, November 7, 2016, Cody Koeninger <co...@koeninger.org>
>> wrote:
>> >> >>>
>> >> >>> Thanks for picking up on this.
>> >> >>>
>> >> >>> Maybe I fail at google docs, but I can't see any edits on the
>> document
>> >> >>> you linked.
>> >> >>>
>> >> >>> Regarding lazy consensus, if the board in general has less of an
>> issue
>> >> >>> with that, sure.  As long as it is clearly announced, lasts at
>> least
>> >> >>> 72 hours, and has a clear outcome.
>> >> >>>
>> >> >>> The other points are hard to comment on without being able to see
>> the
>> >> >>> text in question.
>> >> >>>
>> >> >>>
>> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <rx...@databricks.com>
>> >> >>> wrote:
>> >> >>> > I just looked through the entire thread again tonight - there
>> are a
>> >> >>> > lot
>> >> >>> > of
>> >> >>> > great ideas being discussed. Thanks Cody for taking the first
>> crack
>> >> >>> > at
>> >> >>> > the
>> >> >>> > proposal.
>> >> >>> >
>> >> >>> > I want to first comment on the context. Spark is one of the most
>> >> >>> > innovative
>> >> >>> > and important projects in (big) data -- overall technical
>> decisions
>> >> >>> > made in
>> >> >>> > Apache Spark are sound. But of course, a project as large and
>> active
>> >> >>> > as
>> >> >>> > Spark always have room for improvement, and we as a community
>> should
>> >> >>> > strive
>> >> >>> > to take it to the next level.
>> >> >>> >
>> >> >>> > To that end, the two biggest areas for improvements in my opinion
>> >> >>> > are:
>> >> >>> >
>> >> >>> > 1. Visibility: There are so much happening that it is difficult
>> to
>> >> >>> > know
>> >> >>> > what
>> >> >>> > really is going on. For people that don't follow closely, it is
>> >> >>> > difficult to
>> >> >>> > know what the important initiatives are. Even for people that do
>> >> >>> > follow, it
>> >> >>> > is difficult to know what specific things require their
>> attention,
>> >> >>> > since the
>> >> >>> > number of pull requests and JIRA tickets are high and it's
>> difficult
>> >> >>> > to
>> >> >>> > extract signal from noise.
>> >> >>> >
>> >> >>> > 2. Solicit user (broadly defined, including developers
>> themselves)
>> >> >>> > input
>> >> >>> > more proactively: At the end of the day the project provides
>> value
>> >> >>> > because
>> >> >>> > users use it. Users can't tell us exactly what to build, but it
>> is
>> >> >>> > important
>> >> >>> > to get their inputs.
>> >> >>> >
>> >> >>> >
>> >> >>> > I've taken Cody's doc and edited it:
>> >> >>> >
>> >> >>> >
>> >> >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
>> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>> >> >>> > (I've made all my modifications trackable)
>> >> >>> >
>> >> >>> > There are couple high level changes I made:
>> >> >>> >
>> >> >>> > 1. I've consulted a board member and he recommended lazy
>> consensus
>> >> >>> > as
>> >> >>> > opposed to voting. The reason being in voting there can easily
>> be a
>> >> >>> > "loser'
>> >> >>> > that gets outvoted.
>> >> >>> >
>> >> >>> > 2. I made it lighter weight, and renamed "strategy" to "optional
>> >> >>> > design
>> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far aside
>> from
>> >> >>> > tagging
>> >> >>> > things and linking them elsewhere simply having design docs and
>> >> >>> > prototypes
>> >> >>> > implementations in PRs is not something that has not worked so
>> far".
>> >> >>> >
>> >> >>> > 3. I made some the language tweaks to focus more on visibility.
>> For
>> >> >>> > example,
>> >> >>> > "The purpose of an SIP is to inform and involve", rather than
>> just
>> >> >>> > "involve". SIPs should also have at least two emails that go to
>> >> >>> > dev@.
>> >> >>> >
>> >> >>> >
>> >> >>> > While I was editing this, I thought we really needed a suggested
>> >> >>> > template
>> >> >>> > for design doc too. I will get to that too ...
>> >> >>> >
>> >> >>> >
>> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <
>> rxin@databricks.com>
>> >> >>> > wrote:
>> >> >>> >>
>> >> >>> >> Most things looked OK to me too, although I do plan to take a
>> >> >>> >> closer
>> >> >>> >> look
>> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
>> >> >>> >>
>> >> >>> >>
>> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>> >> >>> >> <va...@cloudera.com>
>> >> >>> >> wrote:
>> >> >>> >>>
>> >> >>> >>> The proposal looks OK to me. I assume, even though it's not
>> >> >>> >>> explicitly
>> >> >>> >>> called, that voting would happen by e-mail? A template for the
>> >> >>> >>> proposal document (instead of just a bullet nice) would also be
>> >> >>> >>> nice,
>> >> >>> >>> but that can be done at any time.
>> >> >>> >>>
>> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a
>> >> >>> >>> candidate
>> >> >>> >>> for a SIP, given the scope of the work. The document attached
>> even
>> >> >>> >>> somewhat matches the proposed format. So if anyone wants to try
>> >> >>> >>> out
>> >> >>> >>> the process...
>> >> >>> >>>
>> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>> >> >>> >>> <co...@koeninger.org>
>> >> >>> >>> wrote:
>> >> >>> >>> > Now that spark summit europe is over, are any committers
>> >> >>> >>> > interested
>> >> >>> >>> > in
>> >> >>> >>> > moving forward with this?
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-i
>> mprovement-proposals.md
>> >> >>> >>> >
>> >> >>> >>> > Or are we going to let this discussion die on the vine?
>> >> >>> >>> >
>> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>> >> >>> >>> > <to...@outlook.com> wrote:
>> >> >>> >>> >> Maybe my mail was not clear enough.
>> >> >>> >>> >>
>> >> >>> >>> >>
>> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any other
>> >> >>> >>> >> framework.
>> >> >>> >>> >> The
>> >> >>> >>> >> idea with benchmarks was to show two things:
>> >> >>> >>> >>
>> >> >>> >>> >> - why some people are doing bad PR for Spark
>> >> >>> >>> >>
>> >> >>> >>> >> - how - in easy way - we can change it and show that Spark
>> is
>> >> >>> >>> >> still on
>> >> >>> >>> >> the
>> >> >>> >>> >> top
>> >> >>> >>> >>
>> >> >>> >>> >>
>> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I don't
>> think
>> >> >>> >>> >> they're the
>> >> >>> >>> >> most important thing in Spark :) On the Spark main page
>> there
>> >> >>> >>> >> is
>> >> >>> >>> >> still
>> >> >>> >>> >> chart
>> >> >>> >>> >> "Spark vs Hadoop". It is important to show that framework is
>> >> >>> >>> >> not
>> >> >>> >>> >> the
>> >> >>> >>> >> same
>> >> >>> >>> >> Spark with other API, but much faster and optimized,
>> comparable
>> >> >>> >>> >> or
>> >> >>> >>> >> even
>> >> >>> >>> >> faster than other frameworks.
>> >> >>> >>> >>
>> >> >>> >>> >>
>> >> >>> >>> >> About real-time streaming, I think it would be just good to
>> see
>> >> >>> >>> >> it
>> >> >>> >>> >> in
>> >> >>> >>> >> Spark.
>> >> >>> >>> >> I very like current Spark model, but many voices that says
>> "we
>> >> >>> >>> >> need
>> >> >>> >>> >> more" -
>> >> >>> >>> >> community should listen also them and try to help them. With
>> >> >>> >>> >> SIPs
>> >> >>> >>> >> it
>> >> >>> >>> >> would
>> >> >>> >>> >> be easier, I've just posted this example as "thing that may
>> be
>> >> >>> >>> >> changed
>> >> >>> >>> >> with
>> >> >>> >>> >> SIP".
>> >> >>> >>> >>
>> >> >>> >>> >>
>> >> >>> >>> >> I very like unification via Datasets, but there is a lot of
>> >> >>> >>> >> algorithms
>> >> >>> >>> >> inside - let's make easy API, but with strong background
>> >> >>> >>> >> (articles,
>> >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark is
>> still
>> >> >>> >>> >> modern
>> >> >>> >>> >> framework.
>> >> >>> >>> >>
>> >> >>> >>> >>
>> >> >>> >>> >> Maybe now my intention will be clearer :) As I said
>> >> >>> >>> >> organizational
>> >> >>> >>> >> ideas
>> >> >>> >>> >> were already mentioned and I agree with them, my mail was
>> just
>> >> >>> >>> >> to
>> >> >>> >>> >> show
>> >> >>> >>> >> some
>> >> >>> >>> >> aspects from my side, so from theside of developer and
>> person
>> >> >>> >>> >> who
>> >> >>> >>> >> is
>> >> >>> >>> >> trying
>> >> >>> >>> >> to help others with Spark (via StackOverflow or other ways)
>> >> >>> >>> >>
>> >> >>> >>> >>
>> >> >>> >>> >> Pozdrawiam / Best regards,
>> >> >>> >>> >>
>> >> >>> >>> >> Tomasz
>> >> >>> >>> >>
>> >> >>> >>> >>
>> >> >>> >>> >> ________________________________
>> >> >>> >>> >> Od: Cody Koeninger <co...@koeninger.org>
>> >> >>> >>> >> Wysłane: 17 października 2016 16:46
>> >> >>> >>> >> Do: Debasish Das
>> >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>> >> >>> >>> >> Temat: Re: Spark Improvement Proposals
>> >> >>> >>> >>
>> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is missing
>> my
>> >> >>> >>> >> point.
>> >> >>> >>> >>
>> >> >>> >>> >> My point is evolve or die.  Spark's governance and
>> organization
>> >> >>> >>> >> is
>> >> >>> >>> >> hampering its ability to evolve technologically, and it
>> needs
>> >> >>> >>> >> to
>> >> >>> >>> >> change.
>> >> >>> >>> >>
>> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>> >> >>> >>> >> <de...@gmail.com>
>> >> >>> >>> >> wrote:
>> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked up
>> Spark
>> >> >>> >>> >>> in
>> >> >>> >>> >>> 2014
>> >> >>> >>> >>> as
>> >> >>> >>> >>> soon as I looked into it since compared to writing Java
>> >> >>> >>> >>> map-reduce
>> >> >>> >>> >>> and
>> >> >>> >>> >>> Cascading code, Spark made writing distributed code
>> fun...But
>> >> >>> >>> >>> now
>> >> >>> >>> >>> as
>> >> >>> >>> >>> we
>> >> >>> >>> >>> went
>> >> >>> >>> >>> deeper with Spark and real-time streaming use-case gets
>> more
>> >> >>> >>> >>> prominent, I
>> >> >>> >>> >>> think it is time to bring a messaging model in conjunction
>> >> >>> >>> >>> with
>> >> >>> >>> >>> the
>> >> >>> >>> >>> batch/micro-batch API that Spark is good at....akka-streams
>> >> >>> >>> >>> close
>> >> >>> >>> >>> integration with spark micro-batching APIs looks like a
>> great
>> >> >>> >>> >>> direction to
>> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated
>> >> >>> >>> >>> streaming
>> >> >>> >>> >>> with
>> >> >>> >>> >>> batch with the assumption is that micro-batching is
>> sufficient
>> >> >>> >>> >>> to
>> >> >>> >>> >>> run
>> >> >>> >>> >>> SQL
>> >> >>> >>> >>> commands on stream but do we really have time to do SQL
>> >> >>> >>> >>> processing at
>> >> >>> >>> >>> streaming data within 1-2 seconds ?
>> >> >>> >>> >>>
>> >> >>> >>> >>> After reading the email chain, I started to look into Flink
>> >> >>> >>> >>> documentation
>> >> >>> >>> >>> and if you compare it with Spark documentation, I think we
>> >> >>> >>> >>> have
>> >> >>> >>> >>> major
>> >> >>> >>> >>> work
>> >> >>> >>> >>> to do detailing out Spark internals so that more people
>> from
>> >> >>> >>> >>> community
>> >> >>> >>> >>> start
>> >> >>> >>> >>> to take active role in improving the issues so that Spark
>> >> >>> >>> >>> stays
>> >> >>> >>> >>> strong
>> >> >>> >>> >>> compared to Flink.
>> >> >>> >>> >>>
>> >> >>> >>> >>>
>> >> >>> >>> >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+
>> Internals
>> >> >>> >>> >>>
>> >> >>> >>> >>>
>> >> >>> >>> >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+
>> Internals
>> >> >>> >>> >>>
>> >> >>> >>> >>> Spark is no longer an engine that works for micro-batch and
>> >> >>> >>> >>> batch...We
>> >> >>> >>> >>> (and
>> >> >>> >>> >>> I am sure many others) are pushing spark as an engine for
>> >> >>> >>> >>> stream
>> >> >>> >>> >>> and
>> >> >>> >>> >>> query
>> >> >>> >>> >>> processing.....we need to make it a state-of-the-art engine
>> >> >>> >>> >>> for
>> >> >>> >>> >>> high
>> >> >>> >>> >>> speed
>> >> >>> >>> >>> streaming data and user queries as well !
>> >> >>> >>> >>>
>> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>> >> >>> >>> >>> <to...@outlook.com>
>> >> >>> >>> >>> wrote:
>> >> >>> >>> >>>>
>> >> >>> >>> >>>> Hi everyone,
>> >> >>> >>> >>>>
>> >> >>> >>> >>>> I'm quite late with my answer, but I think my suggestions
>> may
>> >> >>> >>> >>>> help a
>> >> >>> >>> >>>> little bit. :) Many technical and organizational topics
>> were
>> >> >>> >>> >>>> mentioned,
>> >> >>> >>> >>>> but I want to focus on these negative posts about Spark
>> and
>> >> >>> >>> >>>> about
>> >> >>> >>> >>>> "haters"
>> >> >>> >>> >>>>
>> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good
>> community
>> >> >>> >>> >>>> -
>> >> >>> >>> >>>> it's
>> >> >>> >>> >>>> everything here. But Every project has to "flight" on
>> >> >>> >>> >>>> "framework
>> >> >>> >>> >>>> market"
>> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data
>> >> >>> >>> >>>> communities,
>> >> >>> >>> >>>> maybe my mail will inspire someone :)
>> >> >>> >>> >>>>
>> >> >>> >>> >>>> You (every Spark developer; so far I didn't have enough
>> time
>> >> >>> >>> >>>> to
>> >> >>> >>> >>>> join
>> >> >>> >>> >>>> contributing to Spark) has done excellent job. So why are
>> >> >>> >>> >>>> some
>> >> >>> >>> >>>> people
>> >> >>> >>> >>>> saying that Flink (or other framework) is better, like it
>> was
>> >> >>> >>> >>>> posted
>> >> >>> >>> >>>> in
>> >> >>> >>> >>>> this mailing list? No, not because that framework is
>> better
>> >> >>> >>> >>>> in
>> >> >>> >>> >>>> all
>> >> >>> >>> >>>> cases.. In my opinion, many of these discussions where
>> >> >>> >>> >>>> started
>> >> >>> >>> >>>> after
>> >> >>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow
>> >> >>> >>> >>>> "Flink
>> >> >>> >>> >>>> vs
>> >> >>> >>> >>>> ...."
>> >> >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers are
>> >> >>> >>> >>>> sometimes
>> >> >>> >>> >>>> saying nothing about other frameworks, Flink's users
>> (often
>> >> >>> >>> >>>> PMC's)
>> >> >>> >>> >>>> are
>> >> >>> >>> >>>> just posting same information about real-time streaming,
>> >> >>> >>> >>>> about
>> >> >>> >>> >>>> delta
>> >> >>> >>> >>>> iterations, etc. It look smart and very often it is
>> marked as
>> >> >>> >>> >>>> an
>> >> >>> >>> >>>> aswer,
>> >> >>> >>> >>>> even if - in my opinion - there wasn't told all the truth.
>> >> >>> >>> >>>>
>> >> >>> >>> >>>>
>> >> >>> >>> >>>> My suggestion: I don't have enough money and knowledgle to
>> >> >>> >>> >>>> perform
>> >> >>> >>> >>>> huge
>> >> >>> >>> >>>> performance test. Maybe some company, that supports Spark
>> >> >>> >>> >>>> (Databricks,
>> >> >>> >>> >>>> Cloudera? - just saying you're most visible in community
>> :) )
>> >> >>> >>> >>>> could
>> >> >>> >>> >>>> perform performance test of:
>> >> >>> >>> >>>>
>> >> >>> >>> >>>> - streaming engine - probably Spark will loose because of
>> >> >>> >>> >>>> mini-batch
>> >> >>> >>> >>>> model, however currently the difference should be much
>> lower
>> >> >>> >>> >>>> that in
>> >> >>> >>> >>>> previous versions
>> >> >>> >>> >>>>
>> >> >>> >>> >>>> - Machine Learning models
>> >> >>> >>> >>>>
>> >> >>> >>> >>>> - batch jobs
>> >> >>> >>> >>>>
>> >> >>> >>> >>>> - Graph jobs
>> >> >>> >>> >>>>
>> >> >>> >>> >>>> - SQL queries
>> >> >>> >>> >>>>
>> >> >>> >>> >>>> People will see that Spark is envolving and is also a
>> modern
>> >> >>> >>> >>>> framework,
>> >> >>> >>> >>>> because after reading posts mentioned above people may
>> think
>> >> >>> >>> >>>> "it
>> >> >>> >>> >>>> is
>> >> >>> >>> >>>> outdated, future is in framework X".
>> >> >>> >>> >>>>
>> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark
>> >> >>> >>> >>>> Structured
>> >> >>> >>> >>>> Streaming beats every other framework in terms of
>> easy-of-use
>> >> >>> >>> >>>> and
>> >> >>> >>> >>>> reliability. Performance tests, done in various
>> environments
>> >> >>> >>> >>>> (in
>> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster,
>> >> >>> >>> >>>> 20-node
>> >> >>> >>> >>>> cluster), could be also very good marketing stuff to say
>> >> >>> >>> >>>> "hey,
>> >> >>> >>> >>>> you're
>> >> >>> >>> >>>> telling that you're better, but Spark is still faster and
>> is
>> >> >>> >>> >>>> still
>> >> >>> >>> >>>> getting even more fast!". This would be based on facts
>> (just
>> >> >>> >>> >>>> numbers),
>> >> >>> >>> >>>> not opinions. It would be good for companies, for
>> marketing
>> >> >>> >>> >>>> puproses
>> >> >>> >>> >>>> and
>> >> >>> >>> >>>> for every Spark developer
>> >> >>> >>> >>>>
>> >> >>> >>> >>>>
>> >> >>> >>> >>>> Second: real-time streaming. I've written some time ago
>> about
>> >> >>> >>> >>>> real-time
>> >> >>> >>> >>>> streaming support in Spark Structured Streaming. Some work
>> >> >>> >>> >>>> should be
>> >> >>> >>> >>>> done to make SSS more low-latency, but I think it's
>> possible.
>> >> >>> >>> >>>> Maybe
>> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on top of
>> >> >>> >>> >>>> Akka?
>> >> >>> >>> >>>> I
>> >> >>> >>> >>>> don't
>> >> >>> >>> >>>> know yet, it is good topic for SIP. However I think that
>> >> >>> >>> >>>> Spark
>> >> >>> >>> >>>> should
>> >> >>> >>> >>>> have real-time streaming support. Currently I see many
>> >> >>> >>> >>>> posts/comments
>> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is doing
>> >> >>> >>> >>>> very
>> >> >>> >>> >>>> good
>> >> >>> >>> >>>> jobs with micro-batches, however I think it is possible to
>> >> >>> >>> >>>> add
>> >> >>> >>> >>>> also
>> >> >>> >>> >>>> more
>> >> >>> >>> >>>> real-time processing.
>> >> >>> >>> >>>>
>> >> >>> >>> >>>> Other people said much more and I agree with proposal of
>> SIP.
>> >> >>> >>> >>>> I'm
>> >> >>> >>> >>>> also
>> >> >>> >>> >>>> happy that PMC's are not saying that they will not listen
>> to
>> >> >>> >>> >>>> users,
>> >> >>> >>> >>>> but
>> >> >>> >>> >>>> they really want to make Spark better for every user.
>> >> >>> >>> >>>>
>> >> >>> >>> >>>>
>> >> >>> >>> >>>> What do you think about these two topics? Especially I'm
>> >> >>> >>> >>>> looking
>> >> >>> >>> >>>> at
>> >> >>> >>> >>>> Cody
>> >> >>> >>> >>>> (who has started this topic) and PMCs :)
>> >> >>> >>> >>>>
>> >> >>> >>> >>>> Pozdrawiam / Best regards,
>> >> >>> >>> >>>>
>> >> >>> >>> >>>> Tomasz
>> >> >>> >>> >>>>
>> >> >>> >>> >>>>
>> >> >>> >>>
>> >> >>> >>
>> >> >>> >
>> >> >>> >
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>> >>
>> >
>> >
>> >
>> > --
>> > Ryan Blue
>> > Software Engineer
>> > Netflix
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>

Re: Spark Improvement Proposals

Posted by Joseph Bradley <jo...@databricks.com>.

Hi Cody,

Thanks for being persistent about this.  I too would like to see this
happen.  Reviewing the thread, it sounds like the main things remaining are:
* Decide about a few issues
* Finalize the doc(s)
* Vote on this proposal

Issues & TODOs:

(1) The main issue I see above is voting vs. consensus.  I have little
preference here.  It sounds like something which could be tailored based on
whether we see too many or too few SIPs being approved.

(2) Design doc template  (This would be great to have for Spark regardless
of this SIP discussion.)
* Reynold, are you still putting this together?

(3) Template cleanups.  Listing some items mentioned above + a new one
w.r.t. Reynold's draft
<https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#>
:
* Reinstate the "Where" section with links to current and past SIPs
* Add field for stating explicit deadlines for approval
* Add field for stating Author & Committer shepherd

Thanks all!
Joseph

On Mon, Jan 2, 2017 at 7:45 AM, Cody Koeninger <co...@koeninger.org> wrote:

> I'm bumping this one more time for the new year, and then I'm giving up.
>
> Please, fix your process, even if it isn't exactly the way I suggested.
>
> On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com> wrote:
> > On lazy consensus as opposed to voting:
> >
> > First, why lazy consensus? The proposal was for consensus, which is at
> least
> > three +1 votes and no vetos. Consensus has no losing side, it requires
> > getting to a point where there is agreement. Isn't that agreement what we
> > want to achieve with these proposals?
> >
> > Second, lazy consensus only removes the requirement for three +1 votes.
> Why
> > would we not want at least three committers to think something is a good
> > idea before adopting the proposal?
> >
> > rb
> >
> > On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <co...@koeninger.org>
> wrote:
> >>
> >> So there are some minor things (the Where section heading appears to
> >> be dropped; wherever this document is posted it needs to actually link
> >> to a jira filter showing current / past SIPs) but it doesn't look like
> >> I can comment on the google doc.
> >>
> >> The major substantive issue that I have is that this version is
> >> significantly less clear as to the outcome of an SIP.
> >>
> >> The apache example of lazy consensus at
> >> http://apache.org/foundation/voting.html#LazyConsensus involves an
> >> explicit announcement of an explicit deadline, which I think are
> >> necessary for clarity.
> >>
> >>
> >>
> >> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <rx...@databricks.com>
> wrote:
> >> > It turned out suggested edits (trackable) don't show up for
> non-owners,
> >> > so
> >> > I've just merged all the edits in place. It should be visible now.
> >> >
> >> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <rx...@databricks.com>
> >> > wrote:
> >> >>
> >> >> Oops. Let me try figure that out.
> >> >>
> >> >>
> >> >> On Monday, November 7, 2016, Cody Koeninger <co...@koeninger.org>
> wrote:
> >> >>>
> >> >>> Thanks for picking up on this.
> >> >>>
> >> >>> Maybe I fail at google docs, but I can't see any edits on the
> document
> >> >>> you linked.
> >> >>>
> >> >>> Regarding lazy consensus, if the board in general has less of an
> issue
> >> >>> with that, sure.  As long as it is clearly announced, lasts at least
> >> >>> 72 hours, and has a clear outcome.
> >> >>>
> >> >>> The other points are hard to comment on without being able to see
> the
> >> >>> text in question.
> >> >>>
> >> >>>
> >> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <rx...@databricks.com>
> >> >>> wrote:
> >> >>> > I just looked through the entire thread again tonight - there are
> a
> >> >>> > lot
> >> >>> > of
> >> >>> > great ideas being discussed. Thanks Cody for taking the first
> crack
> >> >>> > at
> >> >>> > the
> >> >>> > proposal.
> >> >>> >
> >> >>> > I want to first comment on the context. Spark is one of the most
> >> >>> > innovative
> >> >>> > and important projects in (big) data -- overall technical
> decisions
> >> >>> > made in
> >> >>> > Apache Spark are sound. But of course, a project as large and
> active
> >> >>> > as
> >> >>> > Spark always have room for improvement, and we as a community
> should
> >> >>> > strive
> >> >>> > to take it to the next level.
> >> >>> >
> >> >>> > To that end, the two biggest areas for improvements in my opinion
> >> >>> > are:
> >> >>> >
> >> >>> > 1. Visibility: There are so much happening that it is difficult to
> >> >>> > know
> >> >>> > what
> >> >>> > really is going on. For people that don't follow closely, it is
> >> >>> > difficult to
> >> >>> > know what the important initiatives are. Even for people that do
> >> >>> > follow, it
> >> >>> > is difficult to know what specific things require their attention,
> >> >>> > since the
> >> >>> > number of pull requests and JIRA tickets are high and it's
> difficult
> >> >>> > to
> >> >>> > extract signal from noise.
> >> >>> >
> >> >>> > 2. Solicit user (broadly defined, including developers themselves)
> >> >>> > input
> >> >>> > more proactively: At the end of the day the project provides value
> >> >>> > because
> >> >>> > users use it. Users can't tell us exactly what to build, but it is
> >> >>> > important
> >> >>> > to get their inputs.
> >> >>> >
> >> >>> >
> >> >>> > I've taken Cody's doc and edited it:
> >> >>> >
> >> >>> >
> >> >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
> >> >>> > (I've made all my modifications trackable)
> >> >>> >
> >> >>> > There are couple high level changes I made:
> >> >>> >
> >> >>> > 1. I've consulted a board member and he recommended lazy consensus
> >> >>> > as
> >> >>> > opposed to voting. The reason being in voting there can easily be
> a
> >> >>> > "loser'
> >> >>> > that gets outvoted.
> >> >>> >
> >> >>> > 2. I made it lighter weight, and renamed "strategy" to "optional
> >> >>> > design
> >> >>> > sketch". Echoing one of the earlier email: "IMHO so far aside from
> >> >>> > tagging
> >> >>> > things and linking them elsewhere simply having design docs and
> >> >>> > prototypes
> >> >>> > implementations in PRs is not something that has not worked so
> far".
> >> >>> >
> >> >>> > 3. I made some the language tweaks to focus more on visibility.
> For
> >> >>> > example,
> >> >>> > "The purpose of an SIP is to inform and involve", rather than just
> >> >>> > "involve". SIPs should also have at least two emails that go to
> >> >>> > dev@.
> >> >>> >
> >> >>> >
> >> >>> > While I was editing this, I thought we really needed a suggested
> >> >>> > template
> >> >>> > for design doc too. I will get to that too ...
> >> >>> >
> >> >>> >
> >> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <rxin@databricks.com
> >
> >> >>> > wrote:
> >> >>> >>
> >> >>> >> Most things looked OK to me too, although I do plan to take a
> >> >>> >> closer
> >> >>> >> look
> >> >>> >> after Nov 1st when we cut the release branch for 2.1.
> >> >>> >>
> >> >>> >>
> >> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
> >> >>> >> <va...@cloudera.com>
> >> >>> >> wrote:
> >> >>> >>>
> >> >>> >>> The proposal looks OK to me. I assume, even though it's not
> >> >>> >>> explicitly
> >> >>> >>> called, that voting would happen by e-mail? A template for the
> >> >>> >>> proposal document (instead of just a bullet nice) would also be
> >> >>> >>> nice,
> >> >>> >>> but that can be done at any time.
> >> >>> >>>
> >> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a
> >> >>> >>> candidate
> >> >>> >>> for a SIP, given the scope of the work. The document attached
> even
> >> >>> >>> somewhat matches the proposed format. So if anyone wants to try
> >> >>> >>> out
> >> >>> >>> the process...
> >> >>> >>>
> >> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
> >> >>> >>> <co...@koeninger.org>
> >> >>> >>> wrote:
> >> >>> >>> > Now that spark summit europe is over, are any committers
> >> >>> >>> > interested
> >> >>> >>> > in
> >> >>> >>> > moving forward with this?
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> >
> >> >>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >> >>> >>> >
> >> >>> >>> > Or are we going to let this discussion die on the vine?
> >> >>> >>> >
> >> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
> >> >>> >>> > <to...@outlook.com> wrote:
> >> >>> >>> >> Maybe my mail was not clear enough.
> >> >>> >>> >>
> >> >>> >>> >>
> >> >>> >>> >> I didn't want to write "lets focus on Flink" or any other
> >> >>> >>> >> framework.
> >> >>> >>> >> The
> >> >>> >>> >> idea with benchmarks was to show two things:
> >> >>> >>> >>
> >> >>> >>> >> - why some people are doing bad PR for Spark
> >> >>> >>> >>
> >> >>> >>> >> - how - in easy way - we can change it and show that Spark is
> >> >>> >>> >> still on
> >> >>> >>> >> the
> >> >>> >>> >> top
> >> >>> >>> >>
> >> >>> >>> >>
> >> >>> >>> >> No more, no less. Benchmarks will be helpful, but I don't
> think
> >> >>> >>> >> they're the
> >> >>> >>> >> most important thing in Spark :) On the Spark main page there
> >> >>> >>> >> is
> >> >>> >>> >> still
> >> >>> >>> >> chart
> >> >>> >>> >> "Spark vs Hadoop". It is important to show that framework is
> >> >>> >>> >> not
> >> >>> >>> >> the
> >> >>> >>> >> same
> >> >>> >>> >> Spark with other API, but much faster and optimized,
> comparable
> >> >>> >>> >> or
> >> >>> >>> >> even
> >> >>> >>> >> faster than other frameworks.
> >> >>> >>> >>
> >> >>> >>> >>
> >> >>> >>> >> About real-time streaming, I think it would be just good to
> see
> >> >>> >>> >> it
> >> >>> >>> >> in
> >> >>> >>> >> Spark.
> >> >>> >>> >> I very like current Spark model, but many voices that says
> "we
> >> >>> >>> >> need
> >> >>> >>> >> more" -
> >> >>> >>> >> community should listen also them and try to help them. With
> >> >>> >>> >> SIPs
> >> >>> >>> >> it
> >> >>> >>> >> would
> >> >>> >>> >> be easier, I've just posted this example as "thing that may
> be
> >> >>> >>> >> changed
> >> >>> >>> >> with
> >> >>> >>> >> SIP".
> >> >>> >>> >>
> >> >>> >>> >>
> >> >>> >>> >> I very like unification via Datasets, but there is a lot of
> >> >>> >>> >> algorithms
> >> >>> >>> >> inside - let's make easy API, but with strong background
> >> >>> >>> >> (articles,
> >> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark is still
> >> >>> >>> >> modern
> >> >>> >>> >> framework.
> >> >>> >>> >>
> >> >>> >>> >>
> >> >>> >>> >> Maybe now my intention will be clearer :) As I said
> >> >>> >>> >> organizational
> >> >>> >>> >> ideas
> >> >>> >>> >> were already mentioned and I agree with them, my mail was
> just
> >> >>> >>> >> to
> >> >>> >>> >> show
> >> >>> >>> >> some
> >> >>> >>> >> aspects from my side, so from theside of developer and person
> >> >>> >>> >> who
> >> >>> >>> >> is
> >> >>> >>> >> trying
> >> >>> >>> >> to help others with Spark (via StackOverflow or other ways)
> >> >>> >>> >>
> >> >>> >>> >>
> >> >>> >>> >> Pozdrawiam / Best regards,
> >> >>> >>> >>
> >> >>> >>> >> Tomasz
> >> >>> >>> >>
> >> >>> >>> >>
> >> >>> >>> >> ________________________________
> >> >>> >>> >> Od: Cody Koeninger <co...@koeninger.org>
> >> >>> >>> >> Wysłane: 17 października 2016 16:46
> >> >>> >>> >> Do: Debasish Das
> >> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
> >> >>> >>> >> Temat: Re: Spark Improvement Proposals
> >> >>> >>> >>
> >> >>> >>> >> I think narrowly focusing on Flink or benchmarks is missing
> my
> >> >>> >>> >> point.
> >> >>> >>> >>
> >> >>> >>> >> My point is evolve or die.  Spark's governance and
> organization
> >> >>> >>> >> is
> >> >>> >>> >> hampering its ability to evolve technologically, and it needs
> >> >>> >>> >> to
> >> >>> >>> >> change.
> >> >>> >>> >>
> >> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
> >> >>> >>> >> <de...@gmail.com>
> >> >>> >>> >> wrote:
> >> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked up
> Spark
> >> >>> >>> >>> in
> >> >>> >>> >>> 2014
> >> >>> >>> >>> as
> >> >>> >>> >>> soon as I looked into it since compared to writing Java
> >> >>> >>> >>> map-reduce
> >> >>> >>> >>> and
> >> >>> >>> >>> Cascading code, Spark made writing distributed code
> fun...But
> >> >>> >>> >>> now
> >> >>> >>> >>> as
> >> >>> >>> >>> we
> >> >>> >>> >>> went
> >> >>> >>> >>> deeper with Spark and real-time streaming use-case gets more
> >> >>> >>> >>> prominent, I
> >> >>> >>> >>> think it is time to bring a messaging model in conjunction
> >> >>> >>> >>> with
> >> >>> >>> >>> the
> >> >>> >>> >>> batch/micro-batch API that Spark is good at....akka-streams
> >> >>> >>> >>> close
> >> >>> >>> >>> integration with spark micro-batching APIs looks like a
> great
> >> >>> >>> >>> direction to
> >> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated
> >> >>> >>> >>> streaming
> >> >>> >>> >>> with
> >> >>> >>> >>> batch with the assumption is that micro-batching is
> sufficient
> >> >>> >>> >>> to
> >> >>> >>> >>> run
> >> >>> >>> >>> SQL
> >> >>> >>> >>> commands on stream but do we really have time to do SQL
> >> >>> >>> >>> processing at
> >> >>> >>> >>> streaming data within 1-2 seconds ?
> >> >>> >>> >>>
> >> >>> >>> >>> After reading the email chain, I started to look into Flink
> >> >>> >>> >>> documentation
> >> >>> >>> >>> and if you compare it with Spark documentation, I think we
> >> >>> >>> >>> have
> >> >>> >>> >>> major
> >> >>> >>> >>> work
> >> >>> >>> >>> to do detailing out Spark internals so that more people from
> >> >>> >>> >>> community
> >> >>> >>> >>> start
> >> >>> >>> >>> to take active role in improving the issues so that Spark
> >> >>> >>> >>> stays
> >> >>> >>> >>> strong
> >> >>> >>> >>> compared to Flink.
> >> >>> >>> >>>
> >> >>> >>> >>>
> >> >>> >>> >>> https://cwiki.apache.org/confluence/display/SPARK/
> Spark+Internals
> >> >>> >>> >>>
> >> >>> >>> >>>
> >> >>> >>> >>> https://cwiki.apache.org/confluence/display/FLINK/
> Flink+Internals
> >> >>> >>> >>>
> >> >>> >>> >>> Spark is no longer an engine that works for micro-batch and
> >> >>> >>> >>> batch...We
> >> >>> >>> >>> (and
> >> >>> >>> >>> I am sure many others) are pushing spark as an engine for
> >> >>> >>> >>> stream
> >> >>> >>> >>> and
> >> >>> >>> >>> query
> >> >>> >>> >>> processing.....we need to make it a state-of-the-art engine
> >> >>> >>> >>> for
> >> >>> >>> >>> high
> >> >>> >>> >>> speed
> >> >>> >>> >>> streaming data and user queries as well !
> >> >>> >>> >>>
> >> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
> >> >>> >>> >>> <to...@outlook.com>
> >> >>> >>> >>> wrote:
> >> >>> >>> >>>>
> >> >>> >>> >>>> Hi everyone,
> >> >>> >>> >>>>
> >> >>> >>> >>>> I'm quite late with my answer, but I think my suggestions
> may
> >> >>> >>> >>>> help a
> >> >>> >>> >>>> little bit. :) Many technical and organizational topics
> were
> >> >>> >>> >>>> mentioned,
> >> >>> >>> >>>> but I want to focus on these negative posts about Spark and
> >> >>> >>> >>>> about
> >> >>> >>> >>>> "haters"
> >> >>> >>> >>>>
> >> >>> >>> >>>> I really like Spark. Easy of use, speed, very good
> community
> >> >>> >>> >>>> -
> >> >>> >>> >>>> it's
> >> >>> >>> >>>> everything here. But Every project has to "flight" on
> >> >>> >>> >>>> "framework
> >> >>> >>> >>>> market"
> >> >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data
> >> >>> >>> >>>> communities,
> >> >>> >>> >>>> maybe my mail will inspire someone :)
> >> >>> >>> >>>>
> >> >>> >>> >>>> You (every Spark developer; so far I didn't have enough
> time
> >> >>> >>> >>>> to
> >> >>> >>> >>>> join
> >> >>> >>> >>>> contributing to Spark) has done excellent job. So why are
> >> >>> >>> >>>> some
> >> >>> >>> >>>> people
> >> >>> >>> >>>> saying that Flink (or other framework) is better, like it
> was
> >> >>> >>> >>>> posted
> >> >>> >>> >>>> in
> >> >>> >>> >>>> this mailing list? No, not because that framework is better
> >> >>> >>> >>>> in
> >> >>> >>> >>>> all
> >> >>> >>> >>>> cases.. In my opinion, many of these discussions where
> >> >>> >>> >>>> started
> >> >>> >>> >>>> after
> >> >>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow
> >> >>> >>> >>>> "Flink
> >> >>> >>> >>>> vs
> >> >>> >>> >>>> ...."
> >> >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers are
> >> >>> >>> >>>> sometimes
> >> >>> >>> >>>> saying nothing about other frameworks, Flink's users (often
> >> >>> >>> >>>> PMC's)
> >> >>> >>> >>>> are
> >> >>> >>> >>>> just posting same information about real-time streaming,
> >> >>> >>> >>>> about
> >> >>> >>> >>>> delta
> >> >>> >>> >>>> iterations, etc. It look smart and very often it is marked
> as
> >> >>> >>> >>>> an
> >> >>> >>> >>>> aswer,
> >> >>> >>> >>>> even if - in my opinion - there wasn't told all the truth.
> >> >>> >>> >>>>
> >> >>> >>> >>>>
> >> >>> >>> >>>> My suggestion: I don't have enough money and knowledgle to
> >> >>> >>> >>>> perform
> >> >>> >>> >>>> huge
> >> >>> >>> >>>> performance test. Maybe some company, that supports Spark
> >> >>> >>> >>>> (Databricks,
> >> >>> >>> >>>> Cloudera? - just saying you're most visible in community
> :) )
> >> >>> >>> >>>> could
> >> >>> >>> >>>> perform performance test of:
> >> >>> >>> >>>>
> >> >>> >>> >>>> - streaming engine - probably Spark will loose because of
> >> >>> >>> >>>> mini-batch
> >> >>> >>> >>>> model, however currently the difference should be much
> lower
> >> >>> >>> >>>> that in
> >> >>> >>> >>>> previous versions
> >> >>> >>> >>>>
> >> >>> >>> >>>> - Machine Learning models
> >> >>> >>> >>>>
> >> >>> >>> >>>> - batch jobs
> >> >>> >>> >>>>
> >> >>> >>> >>>> - Graph jobs
> >> >>> >>> >>>>
> >> >>> >>> >>>> - SQL queries
> >> >>> >>> >>>>
> >> >>> >>> >>>> People will see that Spark is envolving and is also a
> modern
> >> >>> >>> >>>> framework,
> >> >>> >>> >>>> because after reading posts mentioned above people may
> think
> >> >>> >>> >>>> "it
> >> >>> >>> >>>> is
> >> >>> >>> >>>> outdated, future is in framework X".
> >> >>> >>> >>>>
> >> >>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark
> >> >>> >>> >>>> Structured
> >> >>> >>> >>>> Streaming beats every other framework in terms of
> easy-of-use
> >> >>> >>> >>>> and
> >> >>> >>> >>>> reliability. Performance tests, done in various
> environments
> >> >>> >>> >>>> (in
> >> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster,
> >> >>> >>> >>>> 20-node
> >> >>> >>> >>>> cluster), could be also very good marketing stuff to say
> >> >>> >>> >>>> "hey,
> >> >>> >>> >>>> you're
> >> >>> >>> >>>> telling that you're better, but Spark is still faster and
> is
> >> >>> >>> >>>> still
> >> >>> >>> >>>> getting even more fast!". This would be based on facts
> (just
> >> >>> >>> >>>> numbers),
> >> >>> >>> >>>> not opinions. It would be good for companies, for marketing
> >> >>> >>> >>>> puproses
> >> >>> >>> >>>> and
> >> >>> >>> >>>> for every Spark developer
> >> >>> >>> >>>>
> >> >>> >>> >>>>
> >> >>> >>> >>>> Second: real-time streaming. I've written some time ago
> about
> >> >>> >>> >>>> real-time
> >> >>> >>> >>>> streaming support in Spark Structured Streaming. Some work
> >> >>> >>> >>>> should be
> >> >>> >>> >>>> done to make SSS more low-latency, but I think it's
> possible.
> >> >>> >>> >>>> Maybe
> >> >>> >>> >>>> Spark may look at Gearpump, which is also built on top of
> >> >>> >>> >>>> Akka?
> >> >>> >>> >>>> I
> >> >>> >>> >>>> don't
> >> >>> >>> >>>> know yet, it is good topic for SIP. However I think that
> >> >>> >>> >>>> Spark
> >> >>> >>> >>>> should
> >> >>> >>> >>>> have real-time streaming support. Currently I see many
> >> >>> >>> >>>> posts/comments
> >> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is doing
> >> >>> >>> >>>> very
> >> >>> >>> >>>> good
> >> >>> >>> >>>> jobs with micro-batches, however I think it is possible to
> >> >>> >>> >>>> add
> >> >>> >>> >>>> also
> >> >>> >>> >>>> more
> >> >>> >>> >>>> real-time processing.
> >> >>> >>> >>>>
> >> >>> >>> >>>> Other people said much more and I agree with proposal of
> SIP.
> >> >>> >>> >>>> I'm
> >> >>> >>> >>>> also
> >> >>> >>> >>>> happy that PMC's are not saying that they will not listen
> to
> >> >>> >>> >>>> users,
> >> >>> >>> >>>> but
> >> >>> >>> >>>> they really want to make Spark better for every user.
> >> >>> >>> >>>>
> >> >>> >>> >>>>
> >> >>> >>> >>>> What do you think about these two topics? Especially I'm
> >> >>> >>> >>>> looking
> >> >>> >>> >>>> at
> >> >>> >>> >>>> Cody
> >> >>> >>> >>>> (who has started this topic) and PMCs :)
> >> >>> >>> >>>>
> >> >>> >>> >>>> Pozdrawiam / Best regards,
> >> >>> >>> >>>>
> >> >>> >>> >>>> Tomasz
> >> >>> >>> >>>>
> >> >>> >>> >>>>
> >> >>> >>>
> >> >>> >>
> >> >>> >
> >> >>> >
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
> >>
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: Spark Improvement Proposals

Posted by Cody Koeninger <co...@koeninger.org>.

I'm bumping this one more time for the new year, and then I'm giving up.

Please, fix your process, even if it isn't exactly the way I suggested.

On Tue, Nov 8, 2016 at 11:14 AM, Ryan Blue <rb...@netflix.com> wrote:
> On lazy consensus as opposed to voting:
>
> First, why lazy consensus? The proposal was for consensus, which is at least
> three +1 votes and no vetos. Consensus has no losing side, it requires
> getting to a point where there is agreement. Isn't that agreement what we
> want to achieve with these proposals?
>
> Second, lazy consensus only removes the requirement for three +1 votes. Why
> would we not want at least three committers to think something is a good
> idea before adopting the proposal?
>
> rb
>
> On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <co...@koeninger.org> wrote:
>>
>> So there are some minor things (the Where section heading appears to
>> be dropped; wherever this document is posted it needs to actually link
>> to a jira filter showing current / past SIPs) but it doesn't look like
>> I can comment on the google doc.
>>
>> The major substantive issue that I have is that this version is
>> significantly less clear as to the outcome of an SIP.
>>
>> The apache example of lazy consensus at
>> http://apache.org/foundation/voting.html#LazyConsensus involves an
>> explicit announcement of an explicit deadline, which I think are
>> necessary for clarity.
>>
>>
>>
>> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <rx...@databricks.com> wrote:
>> > It turned out suggested edits (trackable) don't show up for non-owners,
>> > so
>> > I've just merged all the edits in place. It should be visible now.
>> >
>> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <rx...@databricks.com>
>> > wrote:
>> >>
>> >> Oops. Let me try figure that out.
>> >>
>> >>
>> >> On Monday, November 7, 2016, Cody Koeninger <co...@koeninger.org> wrote:
>> >>>
>> >>> Thanks for picking up on this.
>> >>>
>> >>> Maybe I fail at google docs, but I can't see any edits on the document
>> >>> you linked.
>> >>>
>> >>> Regarding lazy consensus, if the board in general has less of an issue
>> >>> with that, sure.  As long as it is clearly announced, lasts at least
>> >>> 72 hours, and has a clear outcome.
>> >>>
>> >>> The other points are hard to comment on without being able to see the
>> >>> text in question.
>> >>>
>> >>>
>> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <rx...@databricks.com>
>> >>> wrote:
>> >>> > I just looked through the entire thread again tonight - there are a
>> >>> > lot
>> >>> > of
>> >>> > great ideas being discussed. Thanks Cody for taking the first crack
>> >>> > at
>> >>> > the
>> >>> > proposal.
>> >>> >
>> >>> > I want to first comment on the context. Spark is one of the most
>> >>> > innovative
>> >>> > and important projects in (big) data -- overall technical decisions
>> >>> > made in
>> >>> > Apache Spark are sound. But of course, a project as large and active
>> >>> > as
>> >>> > Spark always have room for improvement, and we as a community should
>> >>> > strive
>> >>> > to take it to the next level.
>> >>> >
>> >>> > To that end, the two biggest areas for improvements in my opinion
>> >>> > are:
>> >>> >
>> >>> > 1. Visibility: There are so much happening that it is difficult to
>> >>> > know
>> >>> > what
>> >>> > really is going on. For people that don't follow closely, it is
>> >>> > difficult to
>> >>> > know what the important initiatives are. Even for people that do
>> >>> > follow, it
>> >>> > is difficult to know what specific things require their attention,
>> >>> > since the
>> >>> > number of pull requests and JIRA tickets are high and it's difficult
>> >>> > to
>> >>> > extract signal from noise.
>> >>> >
>> >>> > 2. Solicit user (broadly defined, including developers themselves)
>> >>> > input
>> >>> > more proactively: At the end of the day the project provides value
>> >>> > because
>> >>> > users use it. Users can't tell us exactly what to build, but it is
>> >>> > important
>> >>> > to get their inputs.
>> >>> >
>> >>> >
>> >>> > I've taken Cody's doc and edited it:
>> >>> >
>> >>> >
>> >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>> >>> > (I've made all my modifications trackable)
>> >>> >
>> >>> > There are couple high level changes I made:
>> >>> >
>> >>> > 1. I've consulted a board member and he recommended lazy consensus
>> >>> > as
>> >>> > opposed to voting. The reason being in voting there can easily be a
>> >>> > "loser'
>> >>> > that gets outvoted.
>> >>> >
>> >>> > 2. I made it lighter weight, and renamed "strategy" to "optional
>> >>> > design
>> >>> > sketch". Echoing one of the earlier email: "IMHO so far aside from
>> >>> > tagging
>> >>> > things and linking them elsewhere simply having design docs and
>> >>> > prototypes
>> >>> > implementations in PRs is not something that has not worked so far".
>> >>> >
>> >>> > 3. I made some the language tweaks to focus more on visibility. For
>> >>> > example,
>> >>> > "The purpose of an SIP is to inform and involve", rather than just
>> >>> > "involve". SIPs should also have at least two emails that go to
>> >>> > dev@.
>> >>> >
>> >>> >
>> >>> > While I was editing this, I thought we really needed a suggested
>> >>> > template
>> >>> > for design doc too. I will get to that too ...
>> >>> >
>> >>> >
>> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <rx...@databricks.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Most things looked OK to me too, although I do plan to take a
>> >>> >> closer
>> >>> >> look
>> >>> >> after Nov 1st when we cut the release branch for 2.1.
>> >>> >>
>> >>> >>
>> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin
>> >>> >> <va...@cloudera.com>
>> >>> >> wrote:
>> >>> >>>
>> >>> >>> The proposal looks OK to me. I assume, even though it's not
>> >>> >>> explicitly
>> >>> >>> called, that voting would happen by e-mail? A template for the
>> >>> >>> proposal document (instead of just a bullet nice) would also be
>> >>> >>> nice,
>> >>> >>> but that can be done at any time.
>> >>> >>>
>> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a
>> >>> >>> candidate
>> >>> >>> for a SIP, given the scope of the work. The document attached even
>> >>> >>> somewhat matches the proposed format. So if anyone wants to try
>> >>> >>> out
>> >>> >>> the process...
>> >>> >>>
>> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger
>> >>> >>> <co...@koeninger.org>
>> >>> >>> wrote:
>> >>> >>> > Now that spark summit europe is over, are any committers
>> >>> >>> > interested
>> >>> >>> > in
>> >>> >>> > moving forward with this?
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> >
>> >>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>> >>> >>> >
>> >>> >>> > Or are we going to let this discussion die on the vine?
>> >>> >>> >
>> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>> >>> >>> > <to...@outlook.com> wrote:
>> >>> >>> >> Maybe my mail was not clear enough.
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> I didn't want to write "lets focus on Flink" or any other
>> >>> >>> >> framework.
>> >>> >>> >> The
>> >>> >>> >> idea with benchmarks was to show two things:
>> >>> >>> >>
>> >>> >>> >> - why some people are doing bad PR for Spark
>> >>> >>> >>
>> >>> >>> >> - how - in easy way - we can change it and show that Spark is
>> >>> >>> >> still on
>> >>> >>> >> the
>> >>> >>> >> top
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> No more, no less. Benchmarks will be helpful, but I don't think
>> >>> >>> >> they're the
>> >>> >>> >> most important thing in Spark :) On the Spark main page there
>> >>> >>> >> is
>> >>> >>> >> still
>> >>> >>> >> chart
>> >>> >>> >> "Spark vs Hadoop". It is important to show that framework is
>> >>> >>> >> not
>> >>> >>> >> the
>> >>> >>> >> same
>> >>> >>> >> Spark with other API, but much faster and optimized, comparable
>> >>> >>> >> or
>> >>> >>> >> even
>> >>> >>> >> faster than other frameworks.
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> About real-time streaming, I think it would be just good to see
>> >>> >>> >> it
>> >>> >>> >> in
>> >>> >>> >> Spark.
>> >>> >>> >> I very like current Spark model, but many voices that says "we
>> >>> >>> >> need
>> >>> >>> >> more" -
>> >>> >>> >> community should listen also them and try to help them. With
>> >>> >>> >> SIPs
>> >>> >>> >> it
>> >>> >>> >> would
>> >>> >>> >> be easier, I've just posted this example as "thing that may be
>> >>> >>> >> changed
>> >>> >>> >> with
>> >>> >>> >> SIP".
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> I very like unification via Datasets, but there is a lot of
>> >>> >>> >> algorithms
>> >>> >>> >> inside - let's make easy API, but with strong background
>> >>> >>> >> (articles,
>> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark is still
>> >>> >>> >> modern
>> >>> >>> >> framework.
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> Maybe now my intention will be clearer :) As I said
>> >>> >>> >> organizational
>> >>> >>> >> ideas
>> >>> >>> >> were already mentioned and I agree with them, my mail was just
>> >>> >>> >> to
>> >>> >>> >> show
>> >>> >>> >> some
>> >>> >>> >> aspects from my side, so from theside of developer and person
>> >>> >>> >> who
>> >>> >>> >> is
>> >>> >>> >> trying
>> >>> >>> >> to help others with Spark (via StackOverflow or other ways)
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> Pozdrawiam / Best regards,
>> >>> >>> >>
>> >>> >>> >> Tomasz
>> >>> >>> >>
>> >>> >>> >>
>> >>> >>> >> ________________________________
>> >>> >>> >> Od: Cody Koeninger <co...@koeninger.org>
>> >>> >>> >> Wysłane: 17 października 2016 16:46
>> >>> >>> >> Do: Debasish Das
>> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>> >>> >>> >> Temat: Re: Spark Improvement Proposals
>> >>> >>> >>
>> >>> >>> >> I think narrowly focusing on Flink or benchmarks is missing my
>> >>> >>> >> point.
>> >>> >>> >>
>> >>> >>> >> My point is evolve or die.  Spark's governance and organization
>> >>> >>> >> is
>> >>> >>> >> hampering its ability to evolve technologically, and it needs
>> >>> >>> >> to
>> >>> >>> >> change.
>> >>> >>> >>
>> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>> >>> >>> >> <de...@gmail.com>
>> >>> >>> >> wrote:
>> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked up Spark
>> >>> >>> >>> in
>> >>> >>> >>> 2014
>> >>> >>> >>> as
>> >>> >>> >>> soon as I looked into it since compared to writing Java
>> >>> >>> >>> map-reduce
>> >>> >>> >>> and
>> >>> >>> >>> Cascading code, Spark made writing distributed code fun...But
>> >>> >>> >>> now
>> >>> >>> >>> as
>> >>> >>> >>> we
>> >>> >>> >>> went
>> >>> >>> >>> deeper with Spark and real-time streaming use-case gets more
>> >>> >>> >>> prominent, I
>> >>> >>> >>> think it is time to bring a messaging model in conjunction
>> >>> >>> >>> with
>> >>> >>> >>> the
>> >>> >>> >>> batch/micro-batch API that Spark is good at....akka-streams
>> >>> >>> >>> close
>> >>> >>> >>> integration with spark micro-batching APIs looks like a great
>> >>> >>> >>> direction to
>> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated
>> >>> >>> >>> streaming
>> >>> >>> >>> with
>> >>> >>> >>> batch with the assumption is that micro-batching is sufficient
>> >>> >>> >>> to
>> >>> >>> >>> run
>> >>> >>> >>> SQL
>> >>> >>> >>> commands on stream but do we really have time to do SQL
>> >>> >>> >>> processing at
>> >>> >>> >>> streaming data within 1-2 seconds ?
>> >>> >>> >>>
>> >>> >>> >>> After reading the email chain, I started to look into Flink
>> >>> >>> >>> documentation
>> >>> >>> >>> and if you compare it with Spark documentation, I think we
>> >>> >>> >>> have
>> >>> >>> >>> major
>> >>> >>> >>> work
>> >>> >>> >>> to do detailing out Spark internals so that more people from
>> >>> >>> >>> community
>> >>> >>> >>> start
>> >>> >>> >>> to take active role in improving the issues so that Spark
>> >>> >>> >>> stays
>> >>> >>> >>> strong
>> >>> >>> >>> compared to Flink.
>> >>> >>> >>>
>> >>> >>> >>>
>> >>> >>> >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>> >>> >>> >>>
>> >>> >>> >>>
>> >>> >>> >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>> >>> >>> >>>
>> >>> >>> >>> Spark is no longer an engine that works for micro-batch and
>> >>> >>> >>> batch...We
>> >>> >>> >>> (and
>> >>> >>> >>> I am sure many others) are pushing spark as an engine for
>> >>> >>> >>> stream
>> >>> >>> >>> and
>> >>> >>> >>> query
>> >>> >>> >>> processing.....we need to make it a state-of-the-art engine
>> >>> >>> >>> for
>> >>> >>> >>> high
>> >>> >>> >>> speed
>> >>> >>> >>> streaming data and user queries as well !
>> >>> >>> >>>
>> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>> >>> >>> >>> <to...@outlook.com>
>> >>> >>> >>> wrote:
>> >>> >>> >>>>
>> >>> >>> >>>> Hi everyone,
>> >>> >>> >>>>
>> >>> >>> >>>> I'm quite late with my answer, but I think my suggestions may
>> >>> >>> >>>> help a
>> >>> >>> >>>> little bit. :) Many technical and organizational topics were
>> >>> >>> >>>> mentioned,
>> >>> >>> >>>> but I want to focus on these negative posts about Spark and
>> >>> >>> >>>> about
>> >>> >>> >>>> "haters"
>> >>> >>> >>>>
>> >>> >>> >>>> I really like Spark. Easy of use, speed, very good community
>> >>> >>> >>>> -
>> >>> >>> >>>> it's
>> >>> >>> >>>> everything here. But Every project has to "flight" on
>> >>> >>> >>>> "framework
>> >>> >>> >>>> market"
>> >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data
>> >>> >>> >>>> communities,
>> >>> >>> >>>> maybe my mail will inspire someone :)
>> >>> >>> >>>>
>> >>> >>> >>>> You (every Spark developer; so far I didn't have enough time
>> >>> >>> >>>> to
>> >>> >>> >>>> join
>> >>> >>> >>>> contributing to Spark) has done excellent job. So why are
>> >>> >>> >>>> some
>> >>> >>> >>>> people
>> >>> >>> >>>> saying that Flink (or other framework) is better, like it was
>> >>> >>> >>>> posted
>> >>> >>> >>>> in
>> >>> >>> >>>> this mailing list? No, not because that framework is better
>> >>> >>> >>>> in
>> >>> >>> >>>> all
>> >>> >>> >>>> cases.. In my opinion, many of these discussions where
>> >>> >>> >>>> started
>> >>> >>> >>>> after
>> >>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow
>> >>> >>> >>>> "Flink
>> >>> >>> >>>> vs
>> >>> >>> >>>> ...."
>> >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers are
>> >>> >>> >>>> sometimes
>> >>> >>> >>>> saying nothing about other frameworks, Flink's users (often
>> >>> >>> >>>> PMC's)
>> >>> >>> >>>> are
>> >>> >>> >>>> just posting same information about real-time streaming,
>> >>> >>> >>>> about
>> >>> >>> >>>> delta
>> >>> >>> >>>> iterations, etc. It look smart and very often it is marked as
>> >>> >>> >>>> an
>> >>> >>> >>>> aswer,
>> >>> >>> >>>> even if - in my opinion - there wasn't told all the truth.
>> >>> >>> >>>>
>> >>> >>> >>>>
>> >>> >>> >>>> My suggestion: I don't have enough money and knowledgle to
>> >>> >>> >>>> perform
>> >>> >>> >>>> huge
>> >>> >>> >>>> performance test. Maybe some company, that supports Spark
>> >>> >>> >>>> (Databricks,
>> >>> >>> >>>> Cloudera? - just saying you're most visible in community :) )
>> >>> >>> >>>> could
>> >>> >>> >>>> perform performance test of:
>> >>> >>> >>>>
>> >>> >>> >>>> - streaming engine - probably Spark will loose because of
>> >>> >>> >>>> mini-batch
>> >>> >>> >>>> model, however currently the difference should be much lower
>> >>> >>> >>>> that in
>> >>> >>> >>>> previous versions
>> >>> >>> >>>>
>> >>> >>> >>>> - Machine Learning models
>> >>> >>> >>>>
>> >>> >>> >>>> - batch jobs
>> >>> >>> >>>>
>> >>> >>> >>>> - Graph jobs
>> >>> >>> >>>>
>> >>> >>> >>>> - SQL queries
>> >>> >>> >>>>
>> >>> >>> >>>> People will see that Spark is envolving and is also a modern
>> >>> >>> >>>> framework,
>> >>> >>> >>>> because after reading posts mentioned above people may think
>> >>> >>> >>>> "it
>> >>> >>> >>>> is
>> >>> >>> >>>> outdated, future is in framework X".
>> >>> >>> >>>>
>> >>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark
>> >>> >>> >>>> Structured
>> >>> >>> >>>> Streaming beats every other framework in terms of easy-of-use
>> >>> >>> >>>> and
>> >>> >>> >>>> reliability. Performance tests, done in various environments
>> >>> >>> >>>> (in
>> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster,
>> >>> >>> >>>> 20-node
>> >>> >>> >>>> cluster), could be also very good marketing stuff to say
>> >>> >>> >>>> "hey,
>> >>> >>> >>>> you're
>> >>> >>> >>>> telling that you're better, but Spark is still faster and is
>> >>> >>> >>>> still
>> >>> >>> >>>> getting even more fast!". This would be based on facts (just
>> >>> >>> >>>> numbers),
>> >>> >>> >>>> not opinions. It would be good for companies, for marketing
>> >>> >>> >>>> puproses
>> >>> >>> >>>> and
>> >>> >>> >>>> for every Spark developer
>> >>> >>> >>>>
>> >>> >>> >>>>
>> >>> >>> >>>> Second: real-time streaming. I've written some time ago about
>> >>> >>> >>>> real-time
>> >>> >>> >>>> streaming support in Spark Structured Streaming. Some work
>> >>> >>> >>>> should be
>> >>> >>> >>>> done to make SSS more low-latency, but I think it's possible.
>> >>> >>> >>>> Maybe
>> >>> >>> >>>> Spark may look at Gearpump, which is also built on top of
>> >>> >>> >>>> Akka?
>> >>> >>> >>>> I
>> >>> >>> >>>> don't
>> >>> >>> >>>> know yet, it is good topic for SIP. However I think that
>> >>> >>> >>>> Spark
>> >>> >>> >>>> should
>> >>> >>> >>>> have real-time streaming support. Currently I see many
>> >>> >>> >>>> posts/comments
>> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is doing
>> >>> >>> >>>> very
>> >>> >>> >>>> good
>> >>> >>> >>>> jobs with micro-batches, however I think it is possible to
>> >>> >>> >>>> add
>> >>> >>> >>>> also
>> >>> >>> >>>> more
>> >>> >>> >>>> real-time processing.
>> >>> >>> >>>>
>> >>> >>> >>>> Other people said much more and I agree with proposal of SIP.
>> >>> >>> >>>> I'm
>> >>> >>> >>>> also
>> >>> >>> >>>> happy that PMC's are not saying that they will not listen to
>> >>> >>> >>>> users,
>> >>> >>> >>>> but
>> >>> >>> >>>> they really want to make Spark better for every user.
>> >>> >>> >>>>
>> >>> >>> >>>>
>> >>> >>> >>>> What do you think about these two topics? Especially I'm
>> >>> >>> >>>> looking
>> >>> >>> >>>> at
>> >>> >>> >>>> Cody
>> >>> >>> >>>> (who has started this topic) and PMCs :)
>> >>> >>> >>>>
>> >>> >>> >>>> Pozdrawiam / Best regards,
>> >>> >>> >>>>
>> >>> >>> >>>> Tomasz
>> >>> >>> >>>>
>> >>> >>> >>>>
>> >>> >>>
>> >>> >>
>> >>> >
>> >>> >
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Spark Improvement Proposals

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

On lazy consensus as opposed to voting:

First, why lazy consensus? The proposal was for consensus, which is at
least three +1 votes and no vetos. Consensus has no losing side, it
requires getting to a point where there is agreement. Isn't that agreement
what we want to achieve with these proposals?

Second, lazy consensus only removes the requirement for three +1 votes. Why
would we not want at least three committers to think something is a good
idea before adopting the proposal?

rb

On Tue, Nov 8, 2016 at 8:13 AM, Cody Koeninger <co...@koeninger.org> wrote:

> So there are some minor things (the Where section heading appears to
> be dropped; wherever this document is posted it needs to actually link
> to a jira filter showing current / past SIPs) but it doesn't look like
> I can comment on the google doc.
>
> The major substantive issue that I have is that this version is
> significantly less clear as to the outcome of an SIP.
>
> The apache example of lazy consensus at
> http://apache.org/foundation/voting.html#LazyConsensus involves an
> explicit announcement of an explicit deadline, which I think are
> necessary for clarity.
>
>
>
> On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <rx...@databricks.com> wrote:
> > It turned out suggested edits (trackable) don't show up for non-owners,
> so
> > I've just merged all the edits in place. It should be visible now.
> >
> > On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <rx...@databricks.com>
> wrote:
> >>
> >> Oops. Let me try figure that out.
> >>
> >>
> >> On Monday, November 7, 2016, Cody Koeninger <co...@koeninger.org> wrote:
> >>>
> >>> Thanks for picking up on this.
> >>>
> >>> Maybe I fail at google docs, but I can't see any edits on the document
> >>> you linked.
> >>>
> >>> Regarding lazy consensus, if the board in general has less of an issue
> >>> with that, sure.  As long as it is clearly announced, lasts at least
> >>> 72 hours, and has a clear outcome.
> >>>
> >>> The other points are hard to comment on without being able to see the
> >>> text in question.
> >>>
> >>>
> >>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <rx...@databricks.com>
> wrote:
> >>> > I just looked through the entire thread again tonight - there are a
> lot
> >>> > of
> >>> > great ideas being discussed. Thanks Cody for taking the first crack
> at
> >>> > the
> >>> > proposal.
> >>> >
> >>> > I want to first comment on the context. Spark is one of the most
> >>> > innovative
> >>> > and important projects in (big) data -- overall technical decisions
> >>> > made in
> >>> > Apache Spark are sound. But of course, a project as large and active
> as
> >>> > Spark always have room for improvement, and we as a community should
> >>> > strive
> >>> > to take it to the next level.
> >>> >
> >>> > To that end, the two biggest areas for improvements in my opinion
> are:
> >>> >
> >>> > 1. Visibility: There are so much happening that it is difficult to
> know
> >>> > what
> >>> > really is going on. For people that don't follow closely, it is
> >>> > difficult to
> >>> > know what the important initiatives are. Even for people that do
> >>> > follow, it
> >>> > is difficult to know what specific things require their attention,
> >>> > since the
> >>> > number of pull requests and JIRA tickets are high and it's difficult
> to
> >>> > extract signal from noise.
> >>> >
> >>> > 2. Solicit user (broadly defined, including developers themselves)
> >>> > input
> >>> > more proactively: At the end of the day the project provides value
> >>> > because
> >>> > users use it. Users can't tell us exactly what to build, but it is
> >>> > important
> >>> > to get their inputs.
> >>> >
> >>> >
> >>> > I've taken Cody's doc and edited it:
> >>> >
> >>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
> >>> > (I've made all my modifications trackable)
> >>> >
> >>> > There are couple high level changes I made:
> >>> >
> >>> > 1. I've consulted a board member and he recommended lazy consensus as
> >>> > opposed to voting. The reason being in voting there can easily be a
> >>> > "loser'
> >>> > that gets outvoted.
> >>> >
> >>> > 2. I made it lighter weight, and renamed "strategy" to "optional
> design
> >>> > sketch". Echoing one of the earlier email: "IMHO so far aside from
> >>> > tagging
> >>> > things and linking them elsewhere simply having design docs and
> >>> > prototypes
> >>> > implementations in PRs is not something that has not worked so far".
> >>> >
> >>> > 3. I made some the language tweaks to focus more on visibility. For
> >>> > example,
> >>> > "The purpose of an SIP is to inform and involve", rather than just
> >>> > "involve". SIPs should also have at least two emails that go to dev@
> .
> >>> >
> >>> >
> >>> > While I was editing this, I thought we really needed a suggested
> >>> > template
> >>> > for design doc too. I will get to that too ...
> >>> >
> >>> >
> >>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <rx...@databricks.com>
> >>> > wrote:
> >>> >>
> >>> >> Most things looked OK to me too, although I do plan to take a closer
> >>> >> look
> >>> >> after Nov 1st when we cut the release branch for 2.1.
> >>> >>
> >>> >>
> >>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <
> vanzin@cloudera.com>
> >>> >> wrote:
> >>> >>>
> >>> >>> The proposal looks OK to me. I assume, even though it's not
> >>> >>> explicitly
> >>> >>> called, that voting would happen by e-mail? A template for the
> >>> >>> proposal document (instead of just a bullet nice) would also be
> nice,
> >>> >>> but that can be done at any time.
> >>> >>>
> >>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a
> candidate
> >>> >>> for a SIP, given the scope of the work. The document attached even
> >>> >>> somewhat matches the proposed format. So if anyone wants to try out
> >>> >>> the process...
> >>> >>>
> >>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <
> cody@koeninger.org>
> >>> >>> wrote:
> >>> >>> > Now that spark summit europe is over, are any committers
> interested
> >>> >>> > in
> >>> >>> > moving forward with this?
> >>> >>> >
> >>> >>> >
> >>> >>> >
> >>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >>> >>> >
> >>> >>> > Or are we going to let this discussion die on the vine?
> >>> >>> >
> >>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
> >>> >>> > <to...@outlook.com> wrote:
> >>> >>> >> Maybe my mail was not clear enough.
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> I didn't want to write "lets focus on Flink" or any other
> >>> >>> >> framework.
> >>> >>> >> The
> >>> >>> >> idea with benchmarks was to show two things:
> >>> >>> >>
> >>> >>> >> - why some people are doing bad PR for Spark
> >>> >>> >>
> >>> >>> >> - how - in easy way - we can change it and show that Spark is
> >>> >>> >> still on
> >>> >>> >> the
> >>> >>> >> top
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> No more, no less. Benchmarks will be helpful, but I don't think
> >>> >>> >> they're the
> >>> >>> >> most important thing in Spark :) On the Spark main page there is
> >>> >>> >> still
> >>> >>> >> chart
> >>> >>> >> "Spark vs Hadoop". It is important to show that framework is not
> >>> >>> >> the
> >>> >>> >> same
> >>> >>> >> Spark with other API, but much faster and optimized, comparable
> or
> >>> >>> >> even
> >>> >>> >> faster than other frameworks.
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> About real-time streaming, I think it would be just good to see
> it
> >>> >>> >> in
> >>> >>> >> Spark.
> >>> >>> >> I very like current Spark model, but many voices that says "we
> >>> >>> >> need
> >>> >>> >> more" -
> >>> >>> >> community should listen also them and try to help them. With
> SIPs
> >>> >>> >> it
> >>> >>> >> would
> >>> >>> >> be easier, I've just posted this example as "thing that may be
> >>> >>> >> changed
> >>> >>> >> with
> >>> >>> >> SIP".
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> I very like unification via Datasets, but there is a lot of
> >>> >>> >> algorithms
> >>> >>> >> inside - let's make easy API, but with strong background
> >>> >>> >> (articles,
> >>> >>> >> benchmarks, descriptions, etc) that shows that Spark is still
> >>> >>> >> modern
> >>> >>> >> framework.
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> Maybe now my intention will be clearer :) As I said
> organizational
> >>> >>> >> ideas
> >>> >>> >> were already mentioned and I agree with them, my mail was just
> to
> >>> >>> >> show
> >>> >>> >> some
> >>> >>> >> aspects from my side, so from theside of developer and person
> who
> >>> >>> >> is
> >>> >>> >> trying
> >>> >>> >> to help others with Spark (via StackOverflow or other ways)
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> Pozdrawiam / Best regards,
> >>> >>> >>
> >>> >>> >> Tomasz
> >>> >>> >>
> >>> >>> >>
> >>> >>> >> ________________________________
> >>> >>> >> Od: Cody Koeninger <co...@koeninger.org>
> >>> >>> >> Wysłane: 17 października 2016 16:46
> >>> >>> >> Do: Debasish Das
> >>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
> >>> >>> >> Temat: Re: Spark Improvement Proposals
> >>> >>> >>
> >>> >>> >> I think narrowly focusing on Flink or benchmarks is missing my
> >>> >>> >> point.
> >>> >>> >>
> >>> >>> >> My point is evolve or die.  Spark's governance and organization
> is
> >>> >>> >> hampering its ability to evolve technologically, and it needs to
> >>> >>> >> change.
> >>> >>> >>
> >>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
> >>> >>> >> <de...@gmail.com>
> >>> >>> >> wrote:
> >>> >>> >>> Thanks Cody for bringing up a valid point...I picked up Spark
> in
> >>> >>> >>> 2014
> >>> >>> >>> as
> >>> >>> >>> soon as I looked into it since compared to writing Java
> >>> >>> >>> map-reduce
> >>> >>> >>> and
> >>> >>> >>> Cascading code, Spark made writing distributed code fun...But
> now
> >>> >>> >>> as
> >>> >>> >>> we
> >>> >>> >>> went
> >>> >>> >>> deeper with Spark and real-time streaming use-case gets more
> >>> >>> >>> prominent, I
> >>> >>> >>> think it is time to bring a messaging model in conjunction with
> >>> >>> >>> the
> >>> >>> >>> batch/micro-batch API that Spark is good at....akka-streams
> close
> >>> >>> >>> integration with spark micro-batching APIs looks like a great
> >>> >>> >>> direction to
> >>> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated
> >>> >>> >>> streaming
> >>> >>> >>> with
> >>> >>> >>> batch with the assumption is that micro-batching is sufficient
> to
> >>> >>> >>> run
> >>> >>> >>> SQL
> >>> >>> >>> commands on stream but do we really have time to do SQL
> >>> >>> >>> processing at
> >>> >>> >>> streaming data within 1-2 seconds ?
> >>> >>> >>>
> >>> >>> >>> After reading the email chain, I started to look into Flink
> >>> >>> >>> documentation
> >>> >>> >>> and if you compare it with Spark documentation, I think we have
> >>> >>> >>> major
> >>> >>> >>> work
> >>> >>> >>> to do detailing out Spark internals so that more people from
> >>> >>> >>> community
> >>> >>> >>> start
> >>> >>> >>> to take active role in improving the issues so that Spark stays
> >>> >>> >>> strong
> >>> >>> >>> compared to Flink.
> >>> >>> >>>
> >>> >>> >>> https://cwiki.apache.org/confluence/display/SPARK/
> Spark+Internals
> >>> >>> >>>
> >>> >>> >>> https://cwiki.apache.org/confluence/display/FLINK/
> Flink+Internals
> >>> >>> >>>
> >>> >>> >>> Spark is no longer an engine that works for micro-batch and
> >>> >>> >>> batch...We
> >>> >>> >>> (and
> >>> >>> >>> I am sure many others) are pushing spark as an engine for
> stream
> >>> >>> >>> and
> >>> >>> >>> query
> >>> >>> >>> processing.....we need to make it a state-of-the-art engine for
> >>> >>> >>> high
> >>> >>> >>> speed
> >>> >>> >>> streaming data and user queries as well !
> >>> >>> >>>
> >>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
> >>> >>> >>> <to...@outlook.com>
> >>> >>> >>> wrote:
> >>> >>> >>>>
> >>> >>> >>>> Hi everyone,
> >>> >>> >>>>
> >>> >>> >>>> I'm quite late with my answer, but I think my suggestions may
> >>> >>> >>>> help a
> >>> >>> >>>> little bit. :) Many technical and organizational topics were
> >>> >>> >>>> mentioned,
> >>> >>> >>>> but I want to focus on these negative posts about Spark and
> >>> >>> >>>> about
> >>> >>> >>>> "haters"
> >>> >>> >>>>
> >>> >>> >>>> I really like Spark. Easy of use, speed, very good community -
> >>> >>> >>>> it's
> >>> >>> >>>> everything here. But Every project has to "flight" on
> "framework
> >>> >>> >>>> market"
> >>> >>> >>>> to be still no 1. I'm following many Spark and Big Data
> >>> >>> >>>> communities,
> >>> >>> >>>> maybe my mail will inspire someone :)
> >>> >>> >>>>
> >>> >>> >>>> You (every Spark developer; so far I didn't have enough time
> to
> >>> >>> >>>> join
> >>> >>> >>>> contributing to Spark) has done excellent job. So why are some
> >>> >>> >>>> people
> >>> >>> >>>> saying that Flink (or other framework) is better, like it was
> >>> >>> >>>> posted
> >>> >>> >>>> in
> >>> >>> >>>> this mailing list? No, not because that framework is better in
> >>> >>> >>>> all
> >>> >>> >>>> cases.. In my opinion, many of these discussions where started
> >>> >>> >>>> after
> >>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow
> "Flink
> >>> >>> >>>> vs
> >>> >>> >>>> ...."
> >>> >>> >>>> posts, almost every post in "winned" by Flink. Answers are
> >>> >>> >>>> sometimes
> >>> >>> >>>> saying nothing about other frameworks, Flink's users (often
> >>> >>> >>>> PMC's)
> >>> >>> >>>> are
> >>> >>> >>>> just posting same information about real-time streaming, about
> >>> >>> >>>> delta
> >>> >>> >>>> iterations, etc. It look smart and very often it is marked as
> an
> >>> >>> >>>> aswer,
> >>> >>> >>>> even if - in my opinion - there wasn't told all the truth.
> >>> >>> >>>>
> >>> >>> >>>>
> >>> >>> >>>> My suggestion: I don't have enough money and knowledgle to
> >>> >>> >>>> perform
> >>> >>> >>>> huge
> >>> >>> >>>> performance test. Maybe some company, that supports Spark
> >>> >>> >>>> (Databricks,
> >>> >>> >>>> Cloudera? - just saying you're most visible in community :) )
> >>> >>> >>>> could
> >>> >>> >>>> perform performance test of:
> >>> >>> >>>>
> >>> >>> >>>> - streaming engine - probably Spark will loose because of
> >>> >>> >>>> mini-batch
> >>> >>> >>>> model, however currently the difference should be much lower
> >>> >>> >>>> that in
> >>> >>> >>>> previous versions
> >>> >>> >>>>
> >>> >>> >>>> - Machine Learning models
> >>> >>> >>>>
> >>> >>> >>>> - batch jobs
> >>> >>> >>>>
> >>> >>> >>>> - Graph jobs
> >>> >>> >>>>
> >>> >>> >>>> - SQL queries
> >>> >>> >>>>
> >>> >>> >>>> People will see that Spark is envolving and is also a modern
> >>> >>> >>>> framework,
> >>> >>> >>>> because after reading posts mentioned above people may think
> "it
> >>> >>> >>>> is
> >>> >>> >>>> outdated, future is in framework X".
> >>> >>> >>>>
> >>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark
> >>> >>> >>>> Structured
> >>> >>> >>>> Streaming beats every other framework in terms of easy-of-use
> >>> >>> >>>> and
> >>> >>> >>>> reliability. Performance tests, done in various environments
> (in
> >>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster,
> 20-node
> >>> >>> >>>> cluster), could be also very good marketing stuff to say "hey,
> >>> >>> >>>> you're
> >>> >>> >>>> telling that you're better, but Spark is still faster and is
> >>> >>> >>>> still
> >>> >>> >>>> getting even more fast!". This would be based on facts (just
> >>> >>> >>>> numbers),
> >>> >>> >>>> not opinions. It would be good for companies, for marketing
> >>> >>> >>>> puproses
> >>> >>> >>>> and
> >>> >>> >>>> for every Spark developer
> >>> >>> >>>>
> >>> >>> >>>>
> >>> >>> >>>> Second: real-time streaming. I've written some time ago about
> >>> >>> >>>> real-time
> >>> >>> >>>> streaming support in Spark Structured Streaming. Some work
> >>> >>> >>>> should be
> >>> >>> >>>> done to make SSS more low-latency, but I think it's possible.
> >>> >>> >>>> Maybe
> >>> >>> >>>> Spark may look at Gearpump, which is also built on top of
> Akka?
> >>> >>> >>>> I
> >>> >>> >>>> don't
> >>> >>> >>>> know yet, it is good topic for SIP. However I think that Spark
> >>> >>> >>>> should
> >>> >>> >>>> have real-time streaming support. Currently I see many
> >>> >>> >>>> posts/comments
> >>> >>> >>>> that "Spark has too big latency". Spark Streaming is doing
> very
> >>> >>> >>>> good
> >>> >>> >>>> jobs with micro-batches, however I think it is possible to add
> >>> >>> >>>> also
> >>> >>> >>>> more
> >>> >>> >>>> real-time processing.
> >>> >>> >>>>
> >>> >>> >>>> Other people said much more and I agree with proposal of SIP.
> >>> >>> >>>> I'm
> >>> >>> >>>> also
> >>> >>> >>>> happy that PMC's are not saying that they will not listen to
> >>> >>> >>>> users,
> >>> >>> >>>> but
> >>> >>> >>>> they really want to make Spark better for every user.
> >>> >>> >>>>
> >>> >>> >>>>
> >>> >>> >>>> What do you think about these two topics? Especially I'm
> looking
> >>> >>> >>>> at
> >>> >>> >>>> Cody
> >>> >>> >>>> (who has started this topic) and PMCs :)
> >>> >>> >>>>
> >>> >>> >>>> Pozdrawiam / Best regards,
> >>> >>> >>>>
> >>> >>> >>>> Tomasz
> >>> >>> >>>>
> >>> >>> >>>>
> >>> >>>
> >>> >>
> >>> >
> >>> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Spark Improvement Proposals

Posted by Cody Koeninger <co...@koeninger.org>.

So there are some minor things (the Where section heading appears to
be dropped; wherever this document is posted it needs to actually link
to a jira filter showing current / past SIPs) but it doesn't look like
I can comment on the google doc.

The major substantive issue that I have is that this version is
significantly less clear as to the outcome of an SIP.

The apache example of lazy consensus at
http://apache.org/foundation/voting.html#LazyConsensus involves an
explicit announcement of an explicit deadline, which I think are
necessary for clarity.



On Mon, Nov 7, 2016 at 1:55 PM, Reynold Xin <rx...@databricks.com> wrote:
> It turned out suggested edits (trackable) don't show up for non-owners, so
> I've just merged all the edits in place. It should be visible now.
>
> On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <rx...@databricks.com> wrote:
>>
>> Oops. Let me try figure that out.
>>
>>
>> On Monday, November 7, 2016, Cody Koeninger <co...@koeninger.org> wrote:
>>>
>>> Thanks for picking up on this.
>>>
>>> Maybe I fail at google docs, but I can't see any edits on the document
>>> you linked.
>>>
>>> Regarding lazy consensus, if the board in general has less of an issue
>>> with that, sure.  As long as it is clearly announced, lasts at least
>>> 72 hours, and has a clear outcome.
>>>
>>> The other points are hard to comment on without being able to see the
>>> text in question.
>>>
>>>
>>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <rx...@databricks.com> wrote:
>>> > I just looked through the entire thread again tonight - there are a lot
>>> > of
>>> > great ideas being discussed. Thanks Cody for taking the first crack at
>>> > the
>>> > proposal.
>>> >
>>> > I want to first comment on the context. Spark is one of the most
>>> > innovative
>>> > and important projects in (big) data -- overall technical decisions
>>> > made in
>>> > Apache Spark are sound. But of course, a project as large and active as
>>> > Spark always have room for improvement, and we as a community should
>>> > strive
>>> > to take it to the next level.
>>> >
>>> > To that end, the two biggest areas for improvements in my opinion are:
>>> >
>>> > 1. Visibility: There are so much happening that it is difficult to know
>>> > what
>>> > really is going on. For people that don't follow closely, it is
>>> > difficult to
>>> > know what the important initiatives are. Even for people that do
>>> > follow, it
>>> > is difficult to know what specific things require their attention,
>>> > since the
>>> > number of pull requests and JIRA tickets are high and it's difficult to
>>> > extract signal from noise.
>>> >
>>> > 2. Solicit user (broadly defined, including developers themselves)
>>> > input
>>> > more proactively: At the end of the day the project provides value
>>> > because
>>> > users use it. Users can't tell us exactly what to build, but it is
>>> > important
>>> > to get their inputs.
>>> >
>>> >
>>> > I've taken Cody's doc and edited it:
>>> >
>>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>>> > (I've made all my modifications trackable)
>>> >
>>> > There are couple high level changes I made:
>>> >
>>> > 1. I've consulted a board member and he recommended lazy consensus as
>>> > opposed to voting. The reason being in voting there can easily be a
>>> > "loser'
>>> > that gets outvoted.
>>> >
>>> > 2. I made it lighter weight, and renamed "strategy" to "optional design
>>> > sketch". Echoing one of the earlier email: "IMHO so far aside from
>>> > tagging
>>> > things and linking them elsewhere simply having design docs and
>>> > prototypes
>>> > implementations in PRs is not something that has not worked so far".
>>> >
>>> > 3. I made some the language tweaks to focus more on visibility. For
>>> > example,
>>> > "The purpose of an SIP is to inform and involve", rather than just
>>> > "involve". SIPs should also have at least two emails that go to dev@.
>>> >
>>> >
>>> > While I was editing this, I thought we really needed a suggested
>>> > template
>>> > for design doc too. I will get to that too ...
>>> >
>>> >
>>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <rx...@databricks.com>
>>> > wrote:
>>> >>
>>> >> Most things looked OK to me too, although I do plan to take a closer
>>> >> look
>>> >> after Nov 1st when we cut the release branch for 2.1.
>>> >>
>>> >>
>>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <va...@cloudera.com>
>>> >> wrote:
>>> >>>
>>> >>> The proposal looks OK to me. I assume, even though it's not
>>> >>> explicitly
>>> >>> called, that voting would happen by e-mail? A template for the
>>> >>> proposal document (instead of just a bullet nice) would also be nice,
>>> >>> but that can be done at any time.
>>> >>>
>>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
>>> >>> for a SIP, given the scope of the work. The document attached even
>>> >>> somewhat matches the proposed format. So if anyone wants to try out
>>> >>> the process...
>>> >>>
>>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <co...@koeninger.org>
>>> >>> wrote:
>>> >>> > Now that spark summit europe is over, are any committers interested
>>> >>> > in
>>> >>> > moving forward with this?
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>>> >>> >
>>> >>> > Or are we going to let this discussion die on the vine?
>>> >>> >
>>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>>> >>> > <to...@outlook.com> wrote:
>>> >>> >> Maybe my mail was not clear enough.
>>> >>> >>
>>> >>> >>
>>> >>> >> I didn't want to write "lets focus on Flink" or any other
>>> >>> >> framework.
>>> >>> >> The
>>> >>> >> idea with benchmarks was to show two things:
>>> >>> >>
>>> >>> >> - why some people are doing bad PR for Spark
>>> >>> >>
>>> >>> >> - how - in easy way - we can change it and show that Spark is
>>> >>> >> still on
>>> >>> >> the
>>> >>> >> top
>>> >>> >>
>>> >>> >>
>>> >>> >> No more, no less. Benchmarks will be helpful, but I don't think
>>> >>> >> they're the
>>> >>> >> most important thing in Spark :) On the Spark main page there is
>>> >>> >> still
>>> >>> >> chart
>>> >>> >> "Spark vs Hadoop". It is important to show that framework is not
>>> >>> >> the
>>> >>> >> same
>>> >>> >> Spark with other API, but much faster and optimized, comparable or
>>> >>> >> even
>>> >>> >> faster than other frameworks.
>>> >>> >>
>>> >>> >>
>>> >>> >> About real-time streaming, I think it would be just good to see it
>>> >>> >> in
>>> >>> >> Spark.
>>> >>> >> I very like current Spark model, but many voices that says "we
>>> >>> >> need
>>> >>> >> more" -
>>> >>> >> community should listen also them and try to help them. With SIPs
>>> >>> >> it
>>> >>> >> would
>>> >>> >> be easier, I've just posted this example as "thing that may be
>>> >>> >> changed
>>> >>> >> with
>>> >>> >> SIP".
>>> >>> >>
>>> >>> >>
>>> >>> >> I very like unification via Datasets, but there is a lot of
>>> >>> >> algorithms
>>> >>> >> inside - let's make easy API, but with strong background
>>> >>> >> (articles,
>>> >>> >> benchmarks, descriptions, etc) that shows that Spark is still
>>> >>> >> modern
>>> >>> >> framework.
>>> >>> >>
>>> >>> >>
>>> >>> >> Maybe now my intention will be clearer :) As I said organizational
>>> >>> >> ideas
>>> >>> >> were already mentioned and I agree with them, my mail was just to
>>> >>> >> show
>>> >>> >> some
>>> >>> >> aspects from my side, so from theside of developer and person who
>>> >>> >> is
>>> >>> >> trying
>>> >>> >> to help others with Spark (via StackOverflow or other ways)
>>> >>> >>
>>> >>> >>
>>> >>> >> Pozdrawiam / Best regards,
>>> >>> >>
>>> >>> >> Tomasz
>>> >>> >>
>>> >>> >>
>>> >>> >> ________________________________
>>> >>> >> Od: Cody Koeninger <co...@koeninger.org>
>>> >>> >> Wysłane: 17 października 2016 16:46
>>> >>> >> Do: Debasish Das
>>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>>> >>> >> Temat: Re: Spark Improvement Proposals
>>> >>> >>
>>> >>> >> I think narrowly focusing on Flink or benchmarks is missing my
>>> >>> >> point.
>>> >>> >>
>>> >>> >> My point is evolve or die.  Spark's governance and organization is
>>> >>> >> hampering its ability to evolve technologically, and it needs to
>>> >>> >> change.
>>> >>> >>
>>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>>> >>> >> <de...@gmail.com>
>>> >>> >> wrote:
>>> >>> >>> Thanks Cody for bringing up a valid point...I picked up Spark in
>>> >>> >>> 2014
>>> >>> >>> as
>>> >>> >>> soon as I looked into it since compared to writing Java
>>> >>> >>> map-reduce
>>> >>> >>> and
>>> >>> >>> Cascading code, Spark made writing distributed code fun...But now
>>> >>> >>> as
>>> >>> >>> we
>>> >>> >>> went
>>> >>> >>> deeper with Spark and real-time streaming use-case gets more
>>> >>> >>> prominent, I
>>> >>> >>> think it is time to bring a messaging model in conjunction with
>>> >>> >>> the
>>> >>> >>> batch/micro-batch API that Spark is good at....akka-streams close
>>> >>> >>> integration with spark micro-batching APIs looks like a great
>>> >>> >>> direction to
>>> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated
>>> >>> >>> streaming
>>> >>> >>> with
>>> >>> >>> batch with the assumption is that micro-batching is sufficient to
>>> >>> >>> run
>>> >>> >>> SQL
>>> >>> >>> commands on stream but do we really have time to do SQL
>>> >>> >>> processing at
>>> >>> >>> streaming data within 1-2 seconds ?
>>> >>> >>>
>>> >>> >>> After reading the email chain, I started to look into Flink
>>> >>> >>> documentation
>>> >>> >>> and if you compare it with Spark documentation, I think we have
>>> >>> >>> major
>>> >>> >>> work
>>> >>> >>> to do detailing out Spark internals so that more people from
>>> >>> >>> community
>>> >>> >>> start
>>> >>> >>> to take active role in improving the issues so that Spark stays
>>> >>> >>> strong
>>> >>> >>> compared to Flink.
>>> >>> >>>
>>> >>> >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>>> >>> >>>
>>> >>> >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>>> >>> >>>
>>> >>> >>> Spark is no longer an engine that works for micro-batch and
>>> >>> >>> batch...We
>>> >>> >>> (and
>>> >>> >>> I am sure many others) are pushing spark as an engine for stream
>>> >>> >>> and
>>> >>> >>> query
>>> >>> >>> processing.....we need to make it a state-of-the-art engine for
>>> >>> >>> high
>>> >>> >>> speed
>>> >>> >>> streaming data and user queries as well !
>>> >>> >>>
>>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>>> >>> >>> <to...@outlook.com>
>>> >>> >>> wrote:
>>> >>> >>>>
>>> >>> >>>> Hi everyone,
>>> >>> >>>>
>>> >>> >>>> I'm quite late with my answer, but I think my suggestions may
>>> >>> >>>> help a
>>> >>> >>>> little bit. :) Many technical and organizational topics were
>>> >>> >>>> mentioned,
>>> >>> >>>> but I want to focus on these negative posts about Spark and
>>> >>> >>>> about
>>> >>> >>>> "haters"
>>> >>> >>>>
>>> >>> >>>> I really like Spark. Easy of use, speed, very good community -
>>> >>> >>>> it's
>>> >>> >>>> everything here. But Every project has to "flight" on "framework
>>> >>> >>>> market"
>>> >>> >>>> to be still no 1. I'm following many Spark and Big Data
>>> >>> >>>> communities,
>>> >>> >>>> maybe my mail will inspire someone :)
>>> >>> >>>>
>>> >>> >>>> You (every Spark developer; so far I didn't have enough time to
>>> >>> >>>> join
>>> >>> >>>> contributing to Spark) has done excellent job. So why are some
>>> >>> >>>> people
>>> >>> >>>> saying that Flink (or other framework) is better, like it was
>>> >>> >>>> posted
>>> >>> >>>> in
>>> >>> >>>> this mailing list? No, not because that framework is better in
>>> >>> >>>> all
>>> >>> >>>> cases.. In my opinion, many of these discussions where started
>>> >>> >>>> after
>>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow "Flink
>>> >>> >>>> vs
>>> >>> >>>> ...."
>>> >>> >>>> posts, almost every post in "winned" by Flink. Answers are
>>> >>> >>>> sometimes
>>> >>> >>>> saying nothing about other frameworks, Flink's users (often
>>> >>> >>>> PMC's)
>>> >>> >>>> are
>>> >>> >>>> just posting same information about real-time streaming, about
>>> >>> >>>> delta
>>> >>> >>>> iterations, etc. It look smart and very often it is marked as an
>>> >>> >>>> aswer,
>>> >>> >>>> even if - in my opinion - there wasn't told all the truth.
>>> >>> >>>>
>>> >>> >>>>
>>> >>> >>>> My suggestion: I don't have enough money and knowledgle to
>>> >>> >>>> perform
>>> >>> >>>> huge
>>> >>> >>>> performance test. Maybe some company, that supports Spark
>>> >>> >>>> (Databricks,
>>> >>> >>>> Cloudera? - just saying you're most visible in community :) )
>>> >>> >>>> could
>>> >>> >>>> perform performance test of:
>>> >>> >>>>
>>> >>> >>>> - streaming engine - probably Spark will loose because of
>>> >>> >>>> mini-batch
>>> >>> >>>> model, however currently the difference should be much lower
>>> >>> >>>> that in
>>> >>> >>>> previous versions
>>> >>> >>>>
>>> >>> >>>> - Machine Learning models
>>> >>> >>>>
>>> >>> >>>> - batch jobs
>>> >>> >>>>
>>> >>> >>>> - Graph jobs
>>> >>> >>>>
>>> >>> >>>> - SQL queries
>>> >>> >>>>
>>> >>> >>>> People will see that Spark is envolving and is also a modern
>>> >>> >>>> framework,
>>> >>> >>>> because after reading posts mentioned above people may think "it
>>> >>> >>>> is
>>> >>> >>>> outdated, future is in framework X".
>>> >>> >>>>
>>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark
>>> >>> >>>> Structured
>>> >>> >>>> Streaming beats every other framework in terms of easy-of-use
>>> >>> >>>> and
>>> >>> >>>> reliability. Performance tests, done in various environments (in
>>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
>>> >>> >>>> cluster), could be also very good marketing stuff to say "hey,
>>> >>> >>>> you're
>>> >>> >>>> telling that you're better, but Spark is still faster and is
>>> >>> >>>> still
>>> >>> >>>> getting even more fast!". This would be based on facts (just
>>> >>> >>>> numbers),
>>> >>> >>>> not opinions. It would be good for companies, for marketing
>>> >>> >>>> puproses
>>> >>> >>>> and
>>> >>> >>>> for every Spark developer
>>> >>> >>>>
>>> >>> >>>>
>>> >>> >>>> Second: real-time streaming. I've written some time ago about
>>> >>> >>>> real-time
>>> >>> >>>> streaming support in Spark Structured Streaming. Some work
>>> >>> >>>> should be
>>> >>> >>>> done to make SSS more low-latency, but I think it's possible.
>>> >>> >>>> Maybe
>>> >>> >>>> Spark may look at Gearpump, which is also built on top of Akka?
>>> >>> >>>> I
>>> >>> >>>> don't
>>> >>> >>>> know yet, it is good topic for SIP. However I think that Spark
>>> >>> >>>> should
>>> >>> >>>> have real-time streaming support. Currently I see many
>>> >>> >>>> posts/comments
>>> >>> >>>> that "Spark has too big latency". Spark Streaming is doing very
>>> >>> >>>> good
>>> >>> >>>> jobs with micro-batches, however I think it is possible to add
>>> >>> >>>> also
>>> >>> >>>> more
>>> >>> >>>> real-time processing.
>>> >>> >>>>
>>> >>> >>>> Other people said much more and I agree with proposal of SIP.
>>> >>> >>>> I'm
>>> >>> >>>> also
>>> >>> >>>> happy that PMC's are not saying that they will not listen to
>>> >>> >>>> users,
>>> >>> >>>> but
>>> >>> >>>> they really want to make Spark better for every user.
>>> >>> >>>>
>>> >>> >>>>
>>> >>> >>>> What do you think about these two topics? Especially I'm looking
>>> >>> >>>> at
>>> >>> >>>> Cody
>>> >>> >>>> (who has started this topic) and PMCs :)
>>> >>> >>>>
>>> >>> >>>> Pozdrawiam / Best regards,
>>> >>> >>>>
>>> >>> >>>> Tomasz
>>> >>> >>>>
>>> >>> >>>>
>>> >>>
>>> >>
>>> >
>>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Spark Improvement Proposals

Posted by Reynold Xin <rx...@databricks.com>.

It turned out suggested edits (trackable) don't show up for non-owners, so
I've just merged all the edits in place. It should be visible now.

On Mon, Nov 7, 2016 at 10:10 AM, Reynold Xin <rx...@databricks.com> wrote:

> Oops. Let me try figure that out.
>
>
> On Monday, November 7, 2016, Cody Koeninger <co...@koeninger.org> wrote:
>
>> Thanks for picking up on this.
>>
>> Maybe I fail at google docs, but I can't see any edits on the document
>> you linked.
>>
>> Regarding lazy consensus, if the board in general has less of an issue
>> with that, sure.  As long as it is clearly announced, lasts at least
>> 72 hours, and has a clear outcome.
>>
>> The other points are hard to comment on without being able to see the
>> text in question.
>>
>>
>> On Mon, Nov 7, 2016 at 3:11 AM, Reynold Xin <rx...@databricks.com> wrote:
>> > I just looked through the entire thread again tonight - there are a lot
>> of
>> > great ideas being discussed. Thanks Cody for taking the first crack at
>> the
>> > proposal.
>> >
>> > I want to first comment on the context. Spark is one of the most
>> innovative
>> > and important projects in (big) data -- overall technical decisions
>> made in
>> > Apache Spark are sound. But of course, a project as large and active as
>> > Spark always have room for improvement, and we as a community should
>> strive
>> > to take it to the next level.
>> >
>> > To that end, the two biggest areas for improvements in my opinion are:
>> >
>> > 1. Visibility: There are so much happening that it is difficult to know
>> what
>> > really is going on. For people that don't follow closely, it is
>> difficult to
>> > know what the important initiatives are. Even for people that do
>> follow, it
>> > is difficult to know what specific things require their attention,
>> since the
>> > number of pull requests and JIRA tickets are high and it's difficult to
>> > extract signal from noise.
>> >
>> > 2. Solicit user (broadly defined, including developers themselves) input
>> > more proactively: At the end of the day the project provides value
>> because
>> > users use it. Users can't tell us exactly what to build, but it is
>> important
>> > to get their inputs.
>> >
>> >
>> > I've taken Cody's doc and edited it:
>> > https://docs.google.com/document/d/1-Zdi_W-wtuxS9hTK0P9qb2x-
>> nRanvXmnZ7SUi4qMljg/edit#heading=h.36ut37zh7w2b
>> > (I've made all my modifications trackable)
>> >
>> > There are couple high level changes I made:
>> >
>> > 1. I've consulted a board member and he recommended lazy consensus as
>> > opposed to voting. The reason being in voting there can easily be a
>> "loser'
>> > that gets outvoted.
>> >
>> > 2. I made it lighter weight, and renamed "strategy" to "optional design
>> > sketch". Echoing one of the earlier email: "IMHO so far aside from
>> tagging
>> > things and linking them elsewhere simply having design docs and
>> prototypes
>> > implementations in PRs is not something that has not worked so far".
>> >
>> > 3. I made some the language tweaks to focus more on visibility. For
>> example,
>> > "The purpose of an SIP is to inform and involve", rather than just
>> > "involve". SIPs should also have at least two emails that go to dev@.
>> >
>> >
>> > While I was editing this, I thought we really needed a suggested
>> template
>> > for design doc too. I will get to that too ...
>> >
>> >
>> > On Tue, Nov 1, 2016 at 12:09 AM, Reynold Xin <rx...@databricks.com>
>> wrote:
>> >>
>> >> Most things looked OK to me too, although I do plan to take a closer
>> look
>> >> after Nov 1st when we cut the release branch for 2.1.
>> >>
>> >>
>> >> On Mon, Oct 31, 2016 at 3:12 PM, Marcelo Vanzin <va...@cloudera.com>
>> >> wrote:
>> >>>
>> >>> The proposal looks OK to me. I assume, even though it's not explicitly
>> >>> called, that voting would happen by e-mail? A template for the
>> >>> proposal document (instead of just a bullet nice) would also be nice,
>> >>> but that can be done at any time.
>> >>>
>> >>> BTW, shameless plug: I filed SPARK-18085 which I consider a candidate
>> >>> for a SIP, given the scope of the work. The document attached even
>> >>> somewhat matches the proposed format. So if anyone wants to try out
>> >>> the process...
>> >>>
>> >>> On Mon, Oct 31, 2016 at 10:34 AM, Cody Koeninger <co...@koeninger.org>
>> >>> wrote:
>> >>> > Now that spark summit europe is over, are any committers interested
>> in
>> >>> > moving forward with this?
>> >>> >
>> >>> >
>> >>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-i
>> mprovement-proposals.md
>> >>> >
>> >>> > Or are we going to let this discussion die on the vine?
>> >>> >
>> >>> > On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
>> >>> > <to...@outlook.com> wrote:
>> >>> >> Maybe my mail was not clear enough.
>> >>> >>
>> >>> >>
>> >>> >> I didn't want to write "lets focus on Flink" or any other
>> framework.
>> >>> >> The
>> >>> >> idea with benchmarks was to show two things:
>> >>> >>
>> >>> >> - why some people are doing bad PR for Spark
>> >>> >>
>> >>> >> - how - in easy way - we can change it and show that Spark is
>> still on
>> >>> >> the
>> >>> >> top
>> >>> >>
>> >>> >>
>> >>> >> No more, no less. Benchmarks will be helpful, but I don't think
>> >>> >> they're the
>> >>> >> most important thing in Spark :) On the Spark main page there is
>> still
>> >>> >> chart
>> >>> >> "Spark vs Hadoop". It is important to show that framework is not
>> the
>> >>> >> same
>> >>> >> Spark with other API, but much faster and optimized, comparable or
>> >>> >> even
>> >>> >> faster than other frameworks.
>> >>> >>
>> >>> >>
>> >>> >> About real-time streaming, I think it would be just good to see it
>> in
>> >>> >> Spark.
>> >>> >> I very like current Spark model, but many voices that says "we need
>> >>> >> more" -
>> >>> >> community should listen also them and try to help them. With SIPs
>> it
>> >>> >> would
>> >>> >> be easier, I've just posted this example as "thing that may be
>> changed
>> >>> >> with
>> >>> >> SIP".
>> >>> >>
>> >>> >>
>> >>> >> I very like unification via Datasets, but there is a lot of
>> algorithms
>> >>> >> inside - let's make easy API, but with strong background (articles,
>> >>> >> benchmarks, descriptions, etc) that shows that Spark is still
>> modern
>> >>> >> framework.
>> >>> >>
>> >>> >>
>> >>> >> Maybe now my intention will be clearer :) As I said organizational
>> >>> >> ideas
>> >>> >> were already mentioned and I agree with them, my mail was just to
>> show
>> >>> >> some
>> >>> >> aspects from my side, so from theside of developer and person who
>> is
>> >>> >> trying
>> >>> >> to help others with Spark (via StackOverflow or other ways)
>> >>> >>
>> >>> >>
>> >>> >> Pozdrawiam / Best regards,
>> >>> >>
>> >>> >> Tomasz
>> >>> >>
>> >>> >>
>> >>> >> ________________________________
>> >>> >> Od: Cody Koeninger <co...@koeninger.org>
>> >>> >> Wysłane: 17 października 2016 16:46
>> >>> >> Do: Debasish Das
>> >>> >> DW: Tomasz Gawęda; dev@spark.apache.org
>> >>> >> Temat: Re: Spark Improvement Proposals
>> >>> >>
>> >>> >> I think narrowly focusing on Flink or benchmarks is missing my
>> point.
>> >>> >>
>> >>> >> My point is evolve or die.  Spark's governance and organization is
>> >>> >> hampering its ability to evolve technologically, and it needs to
>> >>> >> change.
>> >>> >>
>> >>> >> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das
>> >>> >> <de...@gmail.com>
>> >>> >> wrote:
>> >>> >>> Thanks Cody for bringing up a valid point...I picked up Spark in
>> 2014
>> >>> >>> as
>> >>> >>> soon as I looked into it since compared to writing Java map-reduce
>> >>> >>> and
>> >>> >>> Cascading code, Spark made writing distributed code fun...But now
>> as
>> >>> >>> we
>> >>> >>> went
>> >>> >>> deeper with Spark and real-time streaming use-case gets more
>> >>> >>> prominent, I
>> >>> >>> think it is time to bring a messaging model in conjunction with
>> the
>> >>> >>> batch/micro-batch API that Spark is good at....akka-streams close
>> >>> >>> integration with spark micro-batching APIs looks like a great
>> >>> >>> direction to
>> >>> >>> stay in the game with Apache Flink...Spark 2.0 integrated
>> streaming
>> >>> >>> with
>> >>> >>> batch with the assumption is that micro-batching is sufficient to
>> run
>> >>> >>> SQL
>> >>> >>> commands on stream but do we really have time to do SQL
>> processing at
>> >>> >>> streaming data within 1-2 seconds ?
>> >>> >>>
>> >>> >>> After reading the email chain, I started to look into Flink
>> >>> >>> documentation
>> >>> >>> and if you compare it with Spark documentation, I think we have
>> major
>> >>> >>> work
>> >>> >>> to do detailing out Spark internals so that more people from
>> >>> >>> community
>> >>> >>> start
>> >>> >>> to take active role in improving the issues so that Spark stays
>> >>> >>> strong
>> >>> >>> compared to Flink.
>> >>> >>>
>> >>> >>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>> >>> >>>
>> >>> >>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>> >>> >>>
>> >>> >>> Spark is no longer an engine that works for micro-batch and
>> >>> >>> batch...We
>> >>> >>> (and
>> >>> >>> I am sure many others) are pushing spark as an engine for stream
>> and
>> >>> >>> query
>> >>> >>> processing.....we need to make it a state-of-the-art engine for
>> high
>> >>> >>> speed
>> >>> >>> streaming data and user queries as well !
>> >>> >>>
>> >>> >>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda
>> >>> >>> <to...@outlook.com>
>> >>> >>> wrote:
>> >>> >>>>
>> >>> >>>> Hi everyone,
>> >>> >>>>
>> >>> >>>> I'm quite late with my answer, but I think my suggestions may
>> help a
>> >>> >>>> little bit. :) Many technical and organizational topics were
>> >>> >>>> mentioned,
>> >>> >>>> but I want to focus on these negative posts about Spark and about
>> >>> >>>> "haters"
>> >>> >>>>
>> >>> >>>> I really like Spark. Easy of use, speed, very good community -
>> it's
>> >>> >>>> everything here. But Every project has to "flight" on "framework
>> >>> >>>> market"
>> >>> >>>> to be still no 1. I'm following many Spark and Big Data
>> communities,
>> >>> >>>> maybe my mail will inspire someone :)
>> >>> >>>>
>> >>> >>>> You (every Spark developer; so far I didn't have enough time to
>> join
>> >>> >>>> contributing to Spark) has done excellent job. So why are some
>> >>> >>>> people
>> >>> >>>> saying that Flink (or other framework) is better, like it was
>> posted
>> >>> >>>> in
>> >>> >>>> this mailing list? No, not because that framework is better in
>> all
>> >>> >>>> cases.. In my opinion, many of these discussions where started
>> after
>> >>> >>>> Flink marketing-like posts. Please look at StackOverflow "Flink
>> vs
>> >>> >>>> ...."
>> >>> >>>> posts, almost every post in "winned" by Flink. Answers are
>> sometimes
>> >>> >>>> saying nothing about other frameworks, Flink's users (often
>> PMC's)
>> >>> >>>> are
>> >>> >>>> just posting same information about real-time streaming, about
>> delta
>> >>> >>>> iterations, etc. It look smart and very often it is marked as an
>> >>> >>>> aswer,
>> >>> >>>> even if - in my opinion - there wasn't told all the truth.
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> My suggestion: I don't have enough money and knowledgle to
>> perform
>> >>> >>>> huge
>> >>> >>>> performance test. Maybe some company, that supports Spark
>> >>> >>>> (Databricks,
>> >>> >>>> Cloudera? - just saying you're most visible in community :) )
>> could
>> >>> >>>> perform performance test of:
>> >>> >>>>
>> >>> >>>> - streaming engine - probably Spark will loose because of
>> mini-batch
>> >>> >>>> model, however currently the difference should be much lower
>> that in
>> >>> >>>> previous versions
>> >>> >>>>
>> >>> >>>> - Machine Learning models
>> >>> >>>>
>> >>> >>>> - batch jobs
>> >>> >>>>
>> >>> >>>> - Graph jobs
>> >>> >>>>
>> >>> >>>> - SQL queries
>> >>> >>>>
>> >>> >>>> People will see that Spark is envolving and is also a modern
>> >>> >>>> framework,
>> >>> >>>> because after reading posts mentioned above people may think "it
>> is
>> >>> >>>> outdated, future is in framework X".
>> >>> >>>>
>> >>> >>>> Matei Zaharia posted excellent blog post about how Spark
>> Structured
>> >>> >>>> Streaming beats every other framework in terms of easy-of-use and
>> >>> >>>> reliability. Performance tests, done in various environments (in
>> >>> >>>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
>> >>> >>>> cluster), could be also very good marketing stuff to say "hey,
>> >>> >>>> you're
>> >>> >>>> telling that you're better, but Spark is still faster and is
>> still
>> >>> >>>> getting even more fast!". This would be based on facts (just
>> >>> >>>> numbers),
>> >>> >>>> not opinions. It would be good for companies, for marketing
>> puproses
>> >>> >>>> and
>> >>> >>>> for every Spark developer
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> Second: real-time streaming. I've written some time ago about
>> >>> >>>> real-time
>> >>> >>>> streaming support in Spark Structured Streaming. Some work
>> should be
>> >>> >>>> done to make SSS more low-latency, but I think it's possible.
>> Maybe
>> >>> >>>> Spark may look at Gearpump, which is also built on top of Akka? I
>> >>> >>>> don't
>> >>> >>>> know yet, it is good topic for SIP. However I think that Spark
>> >>> >>>> should
>> >>> >>>> have real-time streaming support. Currently I see many
>> >>> >>>> posts/comments
>> >>> >>>> that "Spark has too big latency". Spark Streaming is doing very
>> good
>> >>> >>>> jobs with micro-batches, however I think it is possible to add
>> also
>> >>> >>>> more
>> >>> >>>> real-time processing.
>> >>> >>>>
>> >>> >>>> Other people said much more and I agree with proposal of SIP. I'm
>> >>> >>>> also
>> >>> >>>> happy that PMC's are not saying that they will not listen to
>> users,
>> >>> >>>> but
>> >>> >>>> they really want to make Spark better for every user.
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> What do you think about these two topics? Especially I'm looking
>> at
>> >>> >>>> Cody
>> >>> >>>> (who has started this topic) and PMCs :)
>> >>> >>>>
>> >>> >>>> Pozdrawiam / Best regards,
>> >>> >>>>
>> >>> >>>> Tomasz
>> >>> >>>>
>> >>> >>>>
>> >>>
>> >>
>> >
>> >
>>
>