You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by "andymhuang (黄明)" <an...@tencent.com> on 2016/10/17 07:01:01 UTC

Re: Spark Improvement Proposals(Internet mail)

There’s no need to compare to Flink’s Streaming Model. Spark should focus more on how to go beyond itself.


From the beginning, Spark’s success comes from it’s unified model can satisfiy SQL,Streaming, Machine Learning Models and Graphs Jobs …… all in One.  But From 1.6 to 2.0, the abstraction from RDD to DataFrame make no contribution to these two important areas (ML & Graph) with any substantial progress. Most things is for SQL and Streaming, which make Spark have to face the competition with Flink. But guys, these is not surposed to be the battle what Spark should face.


SIP is a good start. Voice from technical communication should be heard and accepted, not buried in the PR bodies. Nowadays, Spark don’t lack of committers or contributors. The right direction and focus area, will decide where it goes, what competitor it encounter, and finally what it can be.

---------------
Sincerely
Andy

 原始邮件
发件人: Debasish Das<de...@gmail.com>
收件人: Tomasz Gawęda<to...@outlook.com>
抄送: dev@spark.apache.org<de...@spark.apache.org>; Cody Koeninger<co...@koeninger.org>
发送时间: 2016年10月17日(周一) 10:21
主题: Re: Spark Improvement Proposals(Internet mail)

Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as soon as I looked into it since compared to writing Java map-reduce and Cascading code, Spark made writing distributed code fun...But now as we went deeper with Spark and real-time streaming use-case gets more prominent, I think it is time to bring a messaging model in conjunction with the batch/micro-batch API that Spark is good at....akka-streams close integration with spark micro-batching APIs looks like a great direction to stay in the game with Apache Flink...Spark 2.0 integrated streaming with batch with the assumption is that micro-batching is sufficient to run SQL commands on stream but do we really have time to do SQL processing at streaming data within 1-2 seconds ?

After reading the email chain, I started to look into Flink documentation and if you compare it with Spark documentation, I think we have major work to do detailing out Spark internals so that more people from community start to take active role in improving the issues so that Spark stays strong compared to Flink.

https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals

https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals

Spark is no longer an engine that works for micro-batch and batch...We (and I am sure many others) are pushing spark as an engine for stream and query processing.....we need to make it a state-of-the-art engine for high speed streaming data and user queries as well !

On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <to...@outlook.com>> wrote:
Hi everyone,

I'm quite late with my answer, but I think my suggestions may help a
little bit. :) Many technical and organizational topics were mentioned,
but I want to focus on these negative posts about Spark and about "haters"

I really like Spark. Easy of use, speed, very good community - it's
everything here. But Every project has to "flight" on "framework market"
to be still no 1. I'm following many Spark and Big Data communities,
maybe my mail will inspire someone :)

You (every Spark developer; so far I didn't have enough time to join
contributing to Spark) has done excellent job. So why are some people
saying that Flink (or other framework) is better, like it was posted in
this mailing list? No, not because that framework is better in all
cases.. In my opinion, many of these discussions where started after
Flink marketing-like posts. Please look at StackOverflow "Flink vs ...."
posts, almost every post in "winned" by Flink. Answers are sometimes
saying nothing about other frameworks, Flink's users (often PMC's) are
just posting same information about real-time streaming, about delta
iterations, etc. It look smart and very often it is marked as an aswer,
even if - in my opinion - there wasn't told all the truth.


My suggestion: I don't have enough money and knowledgle to perform huge
performance test. Maybe some company, that supports Spark (Databricks,
Cloudera? - just saying you're most visible in community :) ) could
perform performance test of:

- streaming engine - probably Spark will loose because of mini-batch
model, however currently the difference should be much lower that in
previous versions

- Machine Learning models

- batch jobs

- Graph jobs

- SQL queries

People will see that Spark is envolving and is also a modern framework,
because after reading posts mentioned above people may think "it is
outdated, future is in framework X".

Matei Zaharia posted excellent blog post about how Spark Structured
Streaming beats every other framework in terms of easy-of-use and
reliability. Performance tests, done in various environments (in
example: laptop, small 2 node cluster, 10-node cluster, 20-node
cluster), could be also very good marketing stuff to say "hey, you're
telling that you're better, but Spark is still faster and is still
getting even more fast!". This would be based on facts (just numbers),
not opinions. It would be good for companies, for marketing puproses and
for every Spark developer


Second: real-time streaming. I've written some time ago about real-time
streaming support in Spark Structured Streaming. Some work should be
done to make SSS more low-latency, but I think it's possible. Maybe
Spark may look at Gearpump, which is also built on top of Akka? I don't
know yet, it is good topic for SIP. However I think that Spark should
have real-time streaming support. Currently I see many posts/comments
that "Spark has too big latency". Spark Streaming is doing very good
jobs with micro-batches, however I think it is possible to add also more
real-time processing.

Other people said much more and I agree with proposal of SIP. I'm also
happy that PMC's are not saying that they will not listen to users, but
they really want to make Spark better for every user.


What do you think about these two topics? Especially I'm looking at Cody
(who has started this topic) and PMCs :)

Pozdrawiam / Best regards,

Tomasz


W dniu 2016-10-07 o 04:51, Cody Koeninger pisze:
> I love Spark.  3 or 4 years ago it was the first distributed computing
> environment that felt usable, and the community was welcoming.
>
> But I just got back from the Reactive Summit, and this is what I observed:
>
> - Industry leaders on stage making fun of Spark's streaming model
> - Open source project leaders saying they looked at Spark's governance
> as a model to avoid
> - Users saying they chose Flink because it was technically superior
> and they couldn't get any answers on the Spark mailing lists
>
> Whether you agree with the substance of any of this, when this stuff
> gets repeated enough people will believe it.
>
> Right now Spark is suffering from its own success, and I think
> something needs to change.
>
> - We need a clear process for planning significant changes to the codebase.
> I'm not saying you need to adopt Kafka Improvement Proposals exactly,
> but you need a documented process with a clear outcome (e.g. a vote).
> Passing around google docs after an implementation has largely been
> decided on doesn't cut it.
>
> - All technical communication needs to be public.
> Things getting decided in private chat, or when 1/3 of the committers
> work for the same company and can just talk to each other...
> Yes, it's convenient, but it's ultimately detrimental to the health of
> the project.
> The way structured streaming has played out has shown that there are
> significant technical blind spots (myself included).
> One way to address that is to get the people who have domain knowledge
> involved, and listen to them.
>
> - We need more committers, and more committer diversity.
> Per committer there are, what, more than 20 contributors and 10 new
> jira tickets a month?  It's too much.
> There are people (I am _not_ referring to myself) who have been around
> for years, contributed thousands of lines of code, helped educate the
> public around Spark... and yet are never going to be voted in.
>
> - We need a clear process for managing volunteer work.
> Too many tickets sit around unowned, unclosed, uncertain.
> If someone proposed something and it isn't up to snuff, tell them and
> close it.  It may be blunt, but it's clearer than "silent no".
> If someone wants to work on something, let them own the ticket and set
> a deadline. If they don't meet it, close it or reassign it.
>
> This is not me putting on an Apache Bureaucracy hat.  This is me
> saying, as a fellow hacker and loyal dissenter, something is wrong
> with the culture and process.
>
> Please, let's change it.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
>