You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Marcelo Vanzin <va...@cloudera.com> on 2016/03/17 19:14:49 UTC

SPARK-13843 and future of streaming backends

Hello all,

Recently a lot of the streaming backends were moved to a separate
project on github and removed from the main Spark repo.

While I think the idea is great, I'm a little worried about the
execution. Some concerns were already raised on the bug mentioned
above, but I'd like to have a more explicit discussion about this so
things don't fall through the cracks.

Mainly I have three concerns.

i. Ownership

That code used to be run by the ASF, but now it's hosted in a github
repo owned not by the ASF. That sounds a little sub-optimal, if not
problematic.

ii. Governance

Similar to the above; who has commit access to the above repos? Will
all the Spark committers, present and future, have commit access to
all of those repos? Are they still going to be considered part of
Spark and have release management done through the Spark community?


For both of the questions above, why are they not turned into
sub-projects of Spark and hosted on the ASF repos? I believe there is
a mechanism to do that, without the need to keep the code in the main
Spark repo, right?

iii. Usability

This is another thing I don't see discussed. For Scala-based code
things don't change much, I guess, if the artifact names don't change
(another reason to keep things in the ASF?), but what about python?
How are pyspark users expected to get that code going forward, since
it's not in Spark's pyspark.zip anymore?


Is there an easy way of keeping these things within the ASF Spark
project? I think that would be better for everybody.

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Marcelo,

I quickly discussed with Reynold this morning about this.

I share your concerns.

I fully understand that it's painful for users to wait a Spark releases 
to include fix in streaming backends as it's not really related.
It makes sense to provide backends "outside" of ASF, especially for 
legal issues: it's what we do at Camel with Camel-Extra.

Don't you think it could be interesting to have another ASF git repo 
dedicated to streaming backends, each backend can managed its release 
cycle following the ASF "rules" (staging, vote, ...) ?

Regards
JB

On 03/17/2016 07:14 PM, Marcelo Vanzin wrote:
> Hello all,
>
> Recently a lot of the streaming backends were moved to a separate
> project on github and removed from the main Spark repo.
>
> While I think the idea is great, I'm a little worried about the
> execution. Some concerns were already raised on the bug mentioned
> above, but I'd like to have a more explicit discussion about this so
> things don't fall through the cracks.
>
> Mainly I have three concerns.
>
> i. Ownership
>
> That code used to be run by the ASF, but now it's hosted in a github
> repo owned not by the ASF. That sounds a little sub-optimal, if not
> problematic.
>
> ii. Governance
>
> Similar to the above; who has commit access to the above repos? Will
> all the Spark committers, present and future, have commit access to
> all of those repos? Are they still going to be considered part of
> Spark and have release management done through the Spark community?
>
>
> For both of the questions above, why are they not turned into
> sub-projects of Spark and hosted on the ASF repos? I believe there is
> a mechanism to do that, without the need to keep the code in the main
> Spark repo, right?
>
> iii. Usability
>
> This is another thing I don't see discussed. For Scala-based code
> things don't change much, I guess, if the artifact names don't change
> (another reason to keep things in the ASF?), but what about python?
> How are pyspark users expected to get that code going forward, since
> it's not in Spark's pyspark.zip anymore?
>
>
> Is there an easy way of keeping these things within the ASF Spark
> project? I think that would be better for everybody.
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Thu, Mar 17, 2016 at 12:01 PM, Cody Koeninger <co...@koeninger.org> wrote:
> i.  An ASF project can clearly decide that some of its code is no
> longer worth maintaining and delete it.  This isn't really any
> different. It's still apache licensed so ultimately whoever wants the
> code can get it.

Absolutely. But I don't remember this being discussed either way. Was
the intention, as you mention later, just to decouple the release of
those components from the main Spark release, or to completely disown
that code?

If the latter, is the ASF ok with it still retaining the current
package and artifact names? Changing those would break backwards
compatibility. Which is why I believe that keeping them as a
sub-project, even if their release cadence is much slower, would be a
better solution for both developers and users.

> ii.  I think part of the rationale is to not tie release management to
> Spark, so it can proceed on a schedule that makes sense.  I'm fine
> with helping out with release management for the Kafka subproject, for
> instance.  I agree that practical governance questions need to be
> worked out.
>
> iii.  How is this any different from how python users get access to
> any other third party Spark package?

True, but that requires the modules to be published somewhere, not
just to live as a bunch of .py files in a gitbub repo. Basically, I'm
worried that there's work to be done to keep those modules working in
this new environment - how to build, test, and publish things, remove
potential uses of internal Spark APIs, just to cite a couple of
things.

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Cody Koeninger <co...@koeninger.org>.

i.  An ASF project can clearly decide that some of its code is no
longer worth maintaining and delete it.  This isn't really any
different. It's still apache licensed so ultimately whoever wants the
code can get it.

ii.  I think part of the rationale is to not tie release management to
Spark, so it can proceed on a schedule that makes sense.  I'm fine
with helping out with release management for the Kafka subproject, for
instance.  I agree that practical governance questions need to be
worked out.

iii.  How is this any different from how python users get access to
any other third party Spark package?


On Thu, Mar 17, 2016 at 1:14 PM, Marcelo Vanzin <va...@cloudera.com> wrote:
> Hello all,
>
> Recently a lot of the streaming backends were moved to a separate
> project on github and removed from the main Spark repo.
>
> While I think the idea is great, I'm a little worried about the
> execution. Some concerns were already raised on the bug mentioned
> above, but I'd like to have a more explicit discussion about this so
> things don't fall through the cracks.
>
> Mainly I have three concerns.
>
> i. Ownership
>
> That code used to be run by the ASF, but now it's hosted in a github
> repo owned not by the ASF. That sounds a little sub-optimal, if not
> problematic.
>
> ii. Governance
>
> Similar to the above; who has commit access to the above repos? Will
> all the Spark committers, present and future, have commit access to
> all of those repos? Are they still going to be considered part of
> Spark and have release management done through the Spark community?
>
>
> For both of the questions above, why are they not turned into
> sub-projects of Spark and hosted on the ASF repos? I believe there is
> a mechanism to do that, without the need to keep the code in the main
> Spark repo, right?
>
> iii. Usability
>
> This is another thing I don't see discussed. For Scala-based code
> things don't change much, I guess, if the artifact names don't change
> (another reason to keep things in the ASF?), but what about python?
> How are pyspark users expected to get that code going forward, since
> it's not in Spark's pyspark.zip anymore?
>
>
> Is there an easy way of keeping these things within the ASF Spark
> project? I think that would be better for everybody.
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Sean Owen <so...@cloudera.com>.

Code can be removed from an ASF project.
That code can live on elsewhere (in accordance with the license)

It can't be presented as part of the official ASF project, like any
other 3rd party project
The package name certainly must change from org.apache.spark

I don't know of a protocol, but common sense dictates a good-faith
effort to offer equivalent access to the code (e.g. interested
committers should probably be repo owners too.)

This differs from "any other code deletion" in that there's an intent
to keep working on the code but outside the project.
More discussion -- like this one -- would have been useful beforehand
but nothing's undoable

Backwards-compatibility is not a good reason for things, because we're
talking about Spark 2.x, and we're already talking about distributing
the code differently.

Is the reason for this change decoupling releases? or changing governance?
Seems like the former, but we don't actually need the latter to achieve that.
There's an argument for a new repo, but this is not an argument for
moving X out of the project per se

I'm sure doing this in the ASF is more overhead, but if changing
governance is a non-goal, there's no choice.
Convenience can't trump that.

Kafka integration is clearly more important than the others.
It seems to need to stay within the project.
However this still leaves a packaging problem to solve, that might
need a new repo. This is orthgonal.

Here's what I think:

1. Leave the moved modules outside the project entirely
  (why not Kinesis though? that one was not made clear)
2. Change package names and make sure it's clearly presented as external
3. Add any committers that want to be repo owners as owners
4. Keep Kafka within the project
5. Add some subproject within the current project as needed to
accomplish distribution goals

On Thu, Mar 17, 2016 at 6:14 PM, Marcelo Vanzin <va...@cloudera.com> wrote:
> Hello all,
>
> Recently a lot of the streaming backends were moved to a separate
> project on github and removed from the main Spark repo.
>
> While I think the idea is great, I'm a little worried about the
> execution. Some concerns were already raised on the bug mentioned
> above, but I'd like to have a more explicit discussion about this so
> things don't fall through the cracks.
>
> Mainly I have three concerns.
>
> i. Ownership
>
> That code used to be run by the ASF, but now it's hosted in a github
> repo owned not by the ASF. That sounds a little sub-optimal, if not
> problematic.
>
> ii. Governance
>
> Similar to the above; who has commit access to the above repos? Will
> all the Spark committers, present and future, have commit access to
> all of those repos? Are they still going to be considered part of
> Spark and have release management done through the Spark community?
>
>
> For both of the questions above, why are they not turned into
> sub-projects of Spark and hosted on the ASF repos? I believe there is
> a mechanism to do that, without the need to keep the code in the main
> Spark repo, right?
>
> iii. Usability
>
> This is another thing I don't see discussed. For Scala-based code
> things don't change much, I guess, if the artifact names don't change
> (another reason to keep things in the ASF?), but what about python?
> How are pyspark users expected to get that code going forward, since
> it's not in Spark's pyspark.zip anymore?
>
>
> Is there an easy way of keeping these things within the ASF Spark
> project? I think that would be better for everybody.
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Imran Rashid <ir...@cloudera.com>.

On Thu, Mar 17, 2016 at 2:55 PM, Cody Koeninger <co...@koeninger.org> wrote:

> Why would a PMC vote be necessary on every code deletion?
>

Certainly PMC votes are not necessary on *every* code deletion.  I dont'
think there is a very clear rule on when such discussion is warranted, just
a soft expectation that committers understand which changes require more
discussion before getting merged.  I believe the only formal requirement
for a PMC vote is when there is a release.  But I think as a community we'd
much rather deal with these issues ahead of time, rather than having
contentious discussions around releases because some are strongly opposed
to changes that have already been merged.

I'm all for the idea of removing these modules in general (for all of the
reasons already mentioned), but it seems that there are important questions
about how the new packages get distributed and how they are managed that
merit further discussion.

I'm somewhat torn on the question of the sub-project vs independent, and
how its governed.  I think Steve has summarized the tradeoffs very well.  I
do want to emphasize, though, that if they are entirely external from the
ASF, the artifact ids and the package names must change at the very least.

Re: SPARK-13843 and future of streaming backends

Posted by Marcelo Vanzin <va...@cloudera.com>.

Note the non-kafka bug was filed right before the change was pushed.
So there really wasn't any discussion before the decision was made to
remove that code.

I'm just trying to merge both discussions here in the list where it's
a little bit more dynamic than bug updates that end up getting lost in
the noise.

On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger <co...@koeninger.org> wrote:
> Why would a PMC vote be necessary on every code deletion?
>
> There was a Jira and pull request discussion about the submodules that
> have been removed so far.
>
> https://issues.apache.org/jira/browse/SPARK-13843
>
> There's another ongoing one about Kafka specifically
>
> https://issues.apache.org/jira/browse/SPARK-13877
>
>
> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan <mr...@gmail.com> wrote:
>>
>> I was not aware of a discussion in Dev list about this - agree with most of
>> the observations.
>> In addition, I did not see PMC signoff on moving (sub-)modules out.
>>
>> Regards
>> Mridul
>>
>>
>>
>> On Thursday, March 17, 2016, Marcelo Vanzin <va...@cloudera.com> wrote:
>>>
>>> Hello all,
>>>
>>> Recently a lot of the streaming backends were moved to a separate
>>> project on github and removed from the main Spark repo.
>>>
>>> While I think the idea is great, I'm a little worried about the
>>> execution. Some concerns were already raised on the bug mentioned
>>> above, but I'd like to have a more explicit discussion about this so
>>> things don't fall through the cracks.
>>>
>>> Mainly I have three concerns.
>>>
>>> i. Ownership
>>>
>>> That code used to be run by the ASF, but now it's hosted in a github
>>> repo owned not by the ASF. That sounds a little sub-optimal, if not
>>> problematic.
>>>
>>> ii. Governance
>>>
>>> Similar to the above; who has commit access to the above repos? Will
>>> all the Spark committers, present and future, have commit access to
>>> all of those repos? Are they still going to be considered part of
>>> Spark and have release management done through the Spark community?
>>>
>>>
>>> For both of the questions above, why are they not turned into
>>> sub-projects of Spark and hosted on the ASF repos? I believe there is
>>> a mechanism to do that, without the need to keep the code in the main
>>> Spark repo, right?
>>>
>>> iii. Usability
>>>
>>> This is another thing I don't see discussed. For Scala-based code
>>> things don't change much, I guess, if the artifact names don't change
>>> (another reason to keep things in the ASF?), but what about python?
>>> How are pyspark users expected to get that code going forward, since
>>> it's not in Spark's pyspark.zip anymore?
>>>
>>>
>>> Is there an easy way of keeping these things within the ASF Spark
>>> project? I think that would be better for everybody.
>>>
>>> --
>>> Marcelo
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Marcelo Vanzin <va...@cloudera.com>.

Also, just wanted to point out something:

On Thu, Mar 17, 2016 at 2:18 PM, Reynold Xin <rx...@databricks.com> wrote:
> Thanks for initiating this discussion. I merged the pull request because it
> was unblocking another major piece of work for Spark 2.0: not requiring
> assembly jars

While I do agree that's more important, the streaming assemblies
weren't really blocking that work. The fact that there are still
streaming assemblies in the build kinda proves that point. :-)

I even filed a task to look at getting rid of the streaming assemblies
(SPARK-13575; just the assemblies though, not the code) but while
working on it found it would be more complicated than expected, and
decided against it given that it didn't really affect work on the
other assemblies.

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Luciano Resende <lu...@gmail.com>.

If the intention is to actually decouple and give a life of it's own to
these connectors, I would have expected that they would still be hosted as
different git repositories inside Apache even tough users will not really
see much difference as they would still be mirrored in GitHub. This makes
it much easier on the legal departments of the upstream consumers and
customers as well because the code still follow the so well received and
trusted Apache Governance and Apache Release Policies. As for
implementation details, we can have multiple repositories if we see a lot
of fragmented releases, or a single "connectors" repository which in our
side would make administration more easily.

On Thu, Mar 17, 2016 at 2:33 PM, Marcelo Vanzin <va...@cloudera.com> wrote:

> Hi Reynold, thanks for the info.
>
> On Thu, Mar 17, 2016 at 2:18 PM, Reynold Xin <rx...@databricks.com> wrote:
> > If one really feels strongly that we should go through all the overhead
> to
> > setup an ASF subproject for these modules that won't work with the new
> > structured streaming, and want to spearhead to setup separate repos
> > (preferably one subproject per connector), CI, separate JIRA, governance,
> > READMEs, voting, we can discuss that. Until then, I'd keep the github
> option
> > open because IMHO it is what works the best for end users (including
> > discoverability, issue tracking, release publishing, ...).
>

Agree that there might be a little overhead, but there are ways to minimize
this, and I am sure there are volunteers willing to help in favor of having
a more unifying project. Breaking things into multiple projects, and having
to manage the matrix of supported versions will be hell worst overhead.

>
> For those of us who are not exactly familiar with the inner workings
> of administrating ASF projects, would you mind explaining in more
> detail what this overhead is?
>
> From my naive point of view, when I say "sub project" I assume that
> it's a simple as having a separate git repo for it, tied to the same
> parent project. Everything else - JIRA, committers, bylaws, etc -
> remains the same. And since the project we're talking about are very
> small, CI should be very simple (Travis?) and, assuming sporadic
> releases, things overall should not be that expensive to maintain.
>
>
Subprojects or even if we send this back to incubator as "connectors
project" is better then public github per package in my opinion.

Now, if with this move is signalizing to customers that the Streaming API
as in 1.x is going away in favor the new structure streaming APIs , then I
guess this is a complete different discussion.

-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: SPARK-13843 and future of streaming backends

Posted by Luciano Resende <lu...@gmail.com>.

On Fri, Mar 18, 2016 at 10:07 AM, Marcelo Vanzin <va...@cloudera.com>
wrote:

> Hi Steve, thanks for the write up.
>
> On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran <st...@hortonworks.com>
> wrote:
> > If you want a separate project, eg. SPARK-EXTRAS, then it *generally*
> needs to go through incubation. While normally its the incubator PMC which
> sponsors/oversees the incubating project, it doesn't have to be the case:
> the spark project can do it.
> >
> > Also Apache Arrow managed to make it straight to toplevel without that
> process. Given that the spark extras are already ASF source files, you
> could try the same thing, add all the existing committers, then look for
> volunteers to keep things.
>
> Am I to understand from your reply that it's not possible for a single
> project to have multiple repos?
>
>
It can have multiple repos, but this still brings the overhead into the PMC
to maintain it which was brought on previously on this thread and it might
not be the direction the PMC want to take (but I might be mistaken).

Another approach is to make this extras, just a subproject, with it's own
set of committers etc.... which gives less burden on the Spark PMC.

Anyway, my main issue here is not who and how it's going to be managed, but
that it continues under Apache governance.

-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: SPARK-13843 and future of streaming backends

Posted by Adam Kocoloski <ko...@apache.org>.

> On Mar 19, 2016, at 8:32 AM, Steve Loughran <st...@hortonworks.com> wrote:
> 
> 
>> On 18 Mar 2016, at 17:07, Marcelo Vanzin <va...@cloudera.com> wrote:
>> 
>> Hi Steve, thanks for the write up.
>> 
>> On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran <st...@hortonworks.com> wrote:
>>> If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs to go through incubation. While normally its the incubator PMC which sponsors/oversees the incubating project, it doesn't have to be the case: the spark project can do it.
>>> 
>>> Also Apache Arrow managed to make it straight to toplevel without that process. Given that the spark extras are already ASF source files, you could try the same thing, add all the existing committers, then look for volunteers to keep things.
>> 
>> Am I to understand from your reply that it's not possible for a single
>> project to have multiple repos?
>> 
> 
> 
> I don't know. there's generally a 1 project -> 1x issue, 1x JIRA.
> 
> but: hadoop core has 3x JIRA, 1x repo, and one set of write permissions to that repo, with the special exception of branches (encryption, ipv6) that have their own committers.
> 
> oh, and I know that hadoop site is on SVN, as are other projects, just to integrate with asf site publishing, so you can certainly have 1x git + 1 x svn
> 
> ASF won't normally let you have 1 repo with different bits of the tree having different access rights, so you couldn't open up spark-extras to people with less permissions/rights than others.
> 
> A separate repo will, separate issue tracking helps you isolate stuff

Multiple repositories per project are certainly allowed without incurring the overhead of a subproject; Cordova and CouchDB are two projects that have taken this approach:

https://github.com/apache?utf8=✓&query=cordova-
https://github.com/apache?utf8=✓&query=couchdb-

I believe Cordova also generates independent release artifacts in different cycles (e.g. cordova-ios releases independently from cordova-android).

If the goal is to enable a divergent set of committers to spark-extras then an independent project makes sense. If you’re just looking to streamline the main repo and decouple some of these other streaming “backends” from the normal release cycle then there are low impact ways to accomplish this inside a single Apache Spark project. Cheers,

Adam


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Steve Loughran <st...@hortonworks.com>.

> On 18 Mar 2016, at 17:07, Marcelo Vanzin <va...@cloudera.com> wrote:
> 
> Hi Steve, thanks for the write up.
> 
> On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran <st...@hortonworks.com> wrote:
>> If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs to go through incubation. While normally its the incubator PMC which sponsors/oversees the incubating project, it doesn't have to be the case: the spark project can do it.
>> 
>> Also Apache Arrow managed to make it straight to toplevel without that process. Given that the spark extras are already ASF source files, you could try the same thing, add all the existing committers, then look for volunteers to keep things.
> 
> Am I to understand from your reply that it's not possible for a single
> project to have multiple repos?
> 


I don't know. there's generally a 1 project -> 1x issue, 1x JIRA.

but: hadoop core has 3x JIRA, 1x repo, and one set of write permissions to that repo, with the special exception of branches (encryption, ipv6) that have their own committers.

oh, and I know that hadoop site is on SVN, as are other projects, just to integrate with asf site publishing, so you can certainly have 1x git + 1 x svn

ASF won't normally let you have 1 repo with different bits of the tree having different access rights, so you couldn't open up spark-extras to people with less permissions/rights than others.

A separate repo will, separate issue tracking helps you isolate stuff

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Fri, Mar 18, 2016 at 10:09 AM, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
> a project can have multiple repos: it's what we have in ServiceMix, in
> Karaf.
> For the *-extra on github, if the code has been in the ASF, the PMC members
> have to vote to move the code on *-extra.

That's good to know. To me that sounds like the best solution.

I've heard that top-level projects have some requirements with regards
to have active development, and these components probably will not see
that much activity. And top-level does sound like too much bureaucracy
for this.

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Marcelo,

a project can have multiple repos: it's what we have in ServiceMix, in 
Karaf.

For the *-extra on github, if the code has been in the ASF, the PMC 
members have to vote to move the code on *-extra.

Regards
JB

On 03/18/2016 06:07 PM, Marcelo Vanzin wrote:
> Hi Steve, thanks for the write up.
>
> On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran <st...@hortonworks.com> wrote:
>> If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs to go through incubation. While normally its the incubator PMC which sponsors/oversees the incubating project, it doesn't have to be the case: the spark project can do it.
>>
>> Also Apache Arrow managed to make it straight to toplevel without that process. Given that the spark extras are already ASF source files, you could try the same thing, add all the existing committers, then look for volunteers to keep things.
>
> Am I to understand from your reply that it's not possible for a single
> project to have multiple repos?
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Marcelo Vanzin <va...@cloudera.com>.

Hi Steve, thanks for the write up.

On Fri, Mar 18, 2016 at 3:12 AM, Steve Loughran <st...@hortonworks.com> wrote:
> If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs to go through incubation. While normally its the incubator PMC which sponsors/oversees the incubating project, it doesn't have to be the case: the spark project can do it.
>
> Also Apache Arrow managed to make it straight to toplevel without that process. Given that the spark extras are already ASF source files, you could try the same thing, add all the existing committers, then look for volunteers to keep things.

Am I to understand from your reply that it's not possible for a single
project to have multiple repos?

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Steve Loughran <st...@hortonworks.com>.

> On 17 Mar 2016, at 21:33, Marcelo Vanzin <va...@cloudera.com> wrote:
> 
> Hi Reynold, thanks for the info.
> 
> On Thu, Mar 17, 2016 at 2:18 PM, Reynold Xin <rx...@databricks.com> wrote:
>> If one really feels strongly that we should go through all the overhead to
>> setup an ASF subproject for these modules that won't work with the new
>> structured streaming, and want to spearhead to setup separate repos
>> (preferably one subproject per connector), CI, separate JIRA, governance,
>> READMEs, voting, we can discuss that. Until then, I'd keep the github option
>> open because IMHO it is what works the best for end users (including
>> discoverability, issue tracking, release publishing, ...).
> 
> For those of us who are not exactly familiar with the inner workings
> of administrating ASF projects, would you mind explaining in more
> detail what this overhead is?
> 
> From my naive point of view, when I say "sub project" I assume that
> it's a simple as having a separate git repo for it, tied to the same
> parent project. Everything else - JIRA, committers, bylaws, etc -
> remains the same. And since the project we're talking about are very
> small, CI should be very simple (Travis?) and, assuming sporadic
> releases, things overall should not be that expensive to maintain.
> 


If you want a separate project, eg. SPARK-EXTRAS, then it *generally* needs to go through incubation. While normally its the incubator PMC which sponsors/oversees the incubating project, it doesn't have to be the case: the spark project can do it.


Also Apache Arrow managed to make it straight to toplevel without that process. Given that the spark extras are already ASF source files, you could try the same thing, add all the existing committers, then look for volunteers to keep things.


You'd get
 -a JIRA entry of your own, easy to reassign bugs from SPARK to SPARK-EXTRAS
 -a bit of git
 -ability to set up builds on ASF Jenkins. Regression testing against spark nightlies would be invaluable here.
 -the ability to stage and publish through ASF Nexus


-Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Marcelo Vanzin <va...@cloudera.com>.

Hi Reynold, thanks for the info.

On Thu, Mar 17, 2016 at 2:18 PM, Reynold Xin <rx...@databricks.com> wrote:
> If one really feels strongly that we should go through all the overhead to
> setup an ASF subproject for these modules that won't work with the new
> structured streaming, and want to spearhead to setup separate repos
> (preferably one subproject per connector), CI, separate JIRA, governance,
> READMEs, voting, we can discuss that. Until then, I'd keep the github option
> open because IMHO it is what works the best for end users (including
> discoverability, issue tracking, release publishing, ...).

For those of us who are not exactly familiar with the inner workings
of administrating ASF projects, would you mind explaining in more
detail what this overhead is?

>From my naive point of view, when I say "sub project" I assume that
it's a simple as having a separate git repo for it, tied to the same
parent project. Everything else - JIRA, committers, bylaws, etc -
remains the same. And since the project we're talking about are very
small, CI should be very simple (Travis?) and, assuming sporadic
releases, things overall should not be that expensive to maintain.

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Reynold Xin <rx...@databricks.com>.

Thanks for initiating this discussion. I merged the pull request because it
was unblocking another major piece of work for Spark 2.0: not requiring
assembly jars, which is arguably a lot more important than sources that are
less frequently used. I take full responsibility for that.

I think it's inaccurate to call them "backend" because it makes these
things sound a lot more serious, when in reality they are a bunch of
connectors to less frequently used streaming data sources (e.g. mqtt,
flume). But that's not that important here.

Another important factor is that over time, with the development of
structure streaming, we'd provide a new API for streaming sources that
unifies the way to connect arbitrary sources, and as a result all of these
sources need to be rewritten anyway. This is similar to the RDD ->
DataFrame transition for data sources, although it was initially painful,
but in the long run provides much better experience for end-users because
they only need to learn a single API for all sources, and it becomes
trivial to transition from one source to another, without actually
impacting business logic.

So the truth is that in the long run, the existing connectors will be
replaced by new ones, and they have been causing minor issues here and
there in the code base. Now issues like these are never black and white. By
moving them out, we'd require users to at least change the maven coordinate
in their build file (although things can still be made binary and source
compatible). So I made the call and asked the contributor to keep Kafka and
Kinesis in, because those are the most widely used (and could be more
contentious), and move everything else out.

I have personally done enough data sources or 3rd party packages for Spark
on github that I can setup a github repo with CI and maven publishing in
just under an hour. I do not expect a lot of changes to these packages
because the APIs have been fairly stable. So the thing I was optimizing for
was to minimize the time we need to spent on these packages given the
(expected) low activity and the shift to focus on structured streaming, and
also minimize the chance to break user apps to provide the best user
experience.

Github repo seems the simplest choice to me. I also made another decision
to provide separate repos (and thus issue trackers) on github for these
packages. The reason is that these connectors have very disjoint
communities. For example, the community that care about mqtt is likely very
different from the community that care about akka. It is much easier to
track all of these.

Logistics wise -- things are still in flux. I think it'd make a lot of
sense to give existing Spark committers (or at least the ones that have
contributed to streaming) write access to the github repos. IMHO, it is not
in any of the major Spark contributing organizations' strategic interest to
"own" these projects, especially considering most of the activities will
switch to structured streaming.

If one really feels strongly that we should go through all the overhead to
setup an ASF subproject for these modules that won't work with the new
structured streaming, and want to spearhead to setup separate repos
(preferably one subproject per connector), CI, separate JIRA, governance,
READMEs, voting, we can discuss that. Until then, I'd keep the github
option open because IMHO it is what works the best for end users (including
discoverability, issue tracking, release publishing, ...).

On Thu, Mar 17, 2016 at 1:50 PM, Cody Koeninger <co...@koeninger.org> wrote:

> Anyone can fork apache licensed code.  Committers can approve pull
> requests that delete code from asf repos.  Because those two things
> happen near each other in time, it's somehow a process violation?
>
> I think the discussion would be better served by concentrating on how
> we're going to solve the problem and move forward.
>
> On Thu, Mar 17, 2016 at 3:13 PM, Mridul Muralidharan <mr...@gmail.com>
> wrote:
> > I am not referring to code edits - but to migrating submodules and
> > code currently in Apache Spark to 'outside' of it.
> > If I understand correctly, assets from Apache Spark are being moved
> > out of it into thirdparty external repositories - not owned by Apache.
> >
> > At a minimum, dev@ discussion (like this one) should be initiated.
> > As PMC is responsible for the project assets (including code), signoff
> > is required for it IMO.
> >
> > More experienced Apache members might be opine better in case I got it
> wrong !
> >
> >
> > Regards,
> > Mridul
> >
> >
> > On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
> >> Why would a PMC vote be necessary on every code deletion?
> >>
> >> There was a Jira and pull request discussion about the submodules that
> >> have been removed so far.
> >>
> >> https://issues.apache.org/jira/browse/SPARK-13843
> >>
> >> There's another ongoing one about Kafka specifically
> >>
> >> https://issues.apache.org/jira/browse/SPARK-13877
> >>
> >>
> >> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan <mr...@gmail.com>
> wrote:
> >>>
> >>> I was not aware of a discussion in Dev list about this - agree with
> most of
> >>> the observations.
> >>> In addition, I did not see PMC signoff on moving (sub-)modules out.
> >>>
> >>> Regards
> >>> Mridul
> >>>
> >>>
> >>>
> >>> On Thursday, March 17, 2016, Marcelo Vanzin <va...@cloudera.com>
> wrote:
> >>>>
> >>>> Hello all,
> >>>>
> >>>> Recently a lot of the streaming backends were moved to a separate
> >>>> project on github and removed from the main Spark repo.
> >>>>
> >>>> While I think the idea is great, I'm a little worried about the
> >>>> execution. Some concerns were already raised on the bug mentioned
> >>>> above, but I'd like to have a more explicit discussion about this so
> >>>> things don't fall through the cracks.
> >>>>
> >>>> Mainly I have three concerns.
> >>>>
> >>>> i. Ownership
> >>>>
> >>>> That code used to be run by the ASF, but now it's hosted in a github
> >>>> repo owned not by the ASF. That sounds a little sub-optimal, if not
> >>>> problematic.
> >>>>
> >>>> ii. Governance
> >>>>
> >>>> Similar to the above; who has commit access to the above repos? Will
> >>>> all the Spark committers, present and future, have commit access to
> >>>> all of those repos? Are they still going to be considered part of
> >>>> Spark and have release management done through the Spark community?
> >>>>
> >>>>
> >>>> For both of the questions above, why are they not turned into
> >>>> sub-projects of Spark and hosted on the ASF repos? I believe there is
> >>>> a mechanism to do that, without the need to keep the code in the main
> >>>> Spark repo, right?
> >>>>
> >>>> iii. Usability
> >>>>
> >>>> This is another thing I don't see discussed. For Scala-based code
> >>>> things don't change much, I guess, if the artifact names don't change
> >>>> (another reason to keep things in the ASF?), but what about python?
> >>>> How are pyspark users expected to get that code going forward, since
> >>>> it's not in Spark's pyspark.zip anymore?
> >>>>
> >>>>
> >>>> Is there an easy way of keeping these things within the ASF Spark
> >>>> project? I think that would be better for everybody.
> >>>>
> >>>> --
> >>>> Marcelo
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>>
> >>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: SPARK-13843 and future of streaming backends

Posted by Cody Koeninger <co...@koeninger.org>.

Anyone can fork apache licensed code.  Committers can approve pull
requests that delete code from asf repos.  Because those two things
happen near each other in time, it's somehow a process violation?

I think the discussion would be better served by concentrating on how
we're going to solve the problem and move forward.

On Thu, Mar 17, 2016 at 3:13 PM, Mridul Muralidharan <mr...@gmail.com> wrote:
> I am not referring to code edits - but to migrating submodules and
> code currently in Apache Spark to 'outside' of it.
> If I understand correctly, assets from Apache Spark are being moved
> out of it into thirdparty external repositories - not owned by Apache.
>
> At a minimum, dev@ discussion (like this one) should be initiated.
> As PMC is responsible for the project assets (including code), signoff
> is required for it IMO.
>
> More experienced Apache members might be opine better in case I got it wrong !
>
>
> Regards,
> Mridul
>
>
> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger <co...@koeninger.org> wrote:
>> Why would a PMC vote be necessary on every code deletion?
>>
>> There was a Jira and pull request discussion about the submodules that
>> have been removed so far.
>>
>> https://issues.apache.org/jira/browse/SPARK-13843
>>
>> There's another ongoing one about Kafka specifically
>>
>> https://issues.apache.org/jira/browse/SPARK-13877
>>
>>
>> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan <mr...@gmail.com> wrote:
>>>
>>> I was not aware of a discussion in Dev list about this - agree with most of
>>> the observations.
>>> In addition, I did not see PMC signoff on moving (sub-)modules out.
>>>
>>> Regards
>>> Mridul
>>>
>>>
>>>
>>> On Thursday, March 17, 2016, Marcelo Vanzin <va...@cloudera.com> wrote:
>>>>
>>>> Hello all,
>>>>
>>>> Recently a lot of the streaming backends were moved to a separate
>>>> project on github and removed from the main Spark repo.
>>>>
>>>> While I think the idea is great, I'm a little worried about the
>>>> execution. Some concerns were already raised on the bug mentioned
>>>> above, but I'd like to have a more explicit discussion about this so
>>>> things don't fall through the cracks.
>>>>
>>>> Mainly I have three concerns.
>>>>
>>>> i. Ownership
>>>>
>>>> That code used to be run by the ASF, but now it's hosted in a github
>>>> repo owned not by the ASF. That sounds a little sub-optimal, if not
>>>> problematic.
>>>>
>>>> ii. Governance
>>>>
>>>> Similar to the above; who has commit access to the above repos? Will
>>>> all the Spark committers, present and future, have commit access to
>>>> all of those repos? Are they still going to be considered part of
>>>> Spark and have release management done through the Spark community?
>>>>
>>>>
>>>> For both of the questions above, why are they not turned into
>>>> sub-projects of Spark and hosted on the ASF repos? I believe there is
>>>> a mechanism to do that, without the need to keep the code in the main
>>>> Spark repo, right?
>>>>
>>>> iii. Usability
>>>>
>>>> This is another thing I don't see discussed. For Scala-based code
>>>> things don't change much, I guess, if the artifact names don't change
>>>> (another reason to keep things in the ASF?), but what about python?
>>>> How are pyspark users expected to get that code going forward, since
>>>> it's not in Spark's pyspark.zip anymore?
>>>>
>>>>
>>>> Is there an easy way of keeping these things within the ASF Spark
>>>> project? I think that would be better for everybody.
>>>>
>>>> --
>>>> Marcelo
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Mridul Muralidharan <mr...@gmail.com>.

On Saturday, March 26, 2016, Sean Owen <so...@cloudera.com> wrote:

> This has been resolved; see the JIRA and related PRs but also
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-13843-Next-steps-td16783.html
>
>
This change happened subsequent to current thread (thanks Marcelo) and
could as well have gone unnoticed until release vote.



> This is not a scenario where a [VOTE] needs to take place, and code
> changes don't proceed through PMC votes. From the project perspective,
> code was deleted/retired for lack of interest, and this is controlled
> by the normal lazy consensus protocol which wasn't vetoed.


I have not seen Apache owned artifacts moved out of it's governance without
discussion - this was not refactoring or cleanup (as was suggested
disingenuously) but migration of submodules/functionality (though from
Reynold's clarification, looks like for good enough reasons).

A vote might or might not have required but a discussion must have
happened - atleast going forward, it will help us not to miss things
(artifact and project namespace, license, ownership, release cycle, version
compatibility, etc of the sub project could be of interest to users and
developers).

Regards
Mridul


> The subsequent discussion was in part about whether other modules
> should go, or whether one should come back, which it did. The latter
> suggests that change could have been left open for some discussion
> longer. Ideally, you would have commented before the initial change
> happened, but it sounds like several people would have liked more
> time. I don't think I'd call that "improper conduct" though, no. It
> was reversed via the same normal code management process.
>
> The rest of the question concerned what becomes of the code that was
> removed. It was revived outside the project for anyone who cares to
> continue collaborating. There seemed to be no disagreement about that,
> mostly because the code in question was of minimal interest. PMC
> doesn't need to rule on anything. There may still be some loose ends
> there like namespace changes. I'll add to the other thread about this.
>
>
>
> On Sat, Mar 26, 2016 at 1:17 PM, Jacek Laskowski <jacek@japila.pl
> <javascript:;>> wrote:
> > Hi,
> >
> > Although I'm not that much experienced member of ASF, I share your
> > concerns. I haven't looked at the issue from this point of view, but
> > after having read the thread I think PMC should've signed off the
> > migration of ASF-owned code to a non-ASF repo. At least a vote is
> > required (and this discussion is a sign that the process has not been
> > conducted properly as people have concerns, me including).
> >
> > Thanks Mridul!
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > ----
> > https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
> >
> >
> > On Thu, Mar 17, 2016 at 9:13 PM, Mridul Muralidharan <mridul@gmail.com
> <javascript:;>> wrote:
> >> I am not referring to code edits - but to migrating submodules and
> >> code currently in Apache Spark to 'outside' of it.
> >> If I understand correctly, assets from Apache Spark are being moved
> >> out of it into thirdparty external repositories - not owned by Apache.
> >>
> >> At a minimum, dev@ discussion (like this one) should be initiated.
> >> As PMC is responsible for the project assets (including code), signoff
> >> is required for it IMO.
> >>
> >> More experienced Apache members might be opine better in case I got it
> wrong !
> >>
> >>
> >> Regards,
> >> Mridul
> >>
> >>
> >> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger <cody@koeninger.org
> <javascript:;>> wrote:
> >>> Why would a PMC vote be necessary on every code deletion?
> >>>
> >>> There was a Jira and pull request discussion about the submodules that
> >>> have been removed so far.
> >>>
> >>> https://issues.apache.org/jira/browse/SPARK-13843
> >>>
> >>> There's another ongoing one about Kafka specifically
> >>>
> >>> https://issues.apache.org/jira/browse/SPARK-13877
> >>>
> >>>
> >>> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan <mridul@gmail.com
> <javascript:;>> wrote:
> >>>>
> >>>> I was not aware of a discussion in Dev list about this - agree with
> most of
> >>>> the observations.
> >>>> In addition, I did not see PMC signoff on moving (sub-)modules out.
> >>>>
> >>>> Regards
> >>>> Mridul
> >>>>
> >>>>
> >>>>
> >>>> On Thursday, March 17, 2016, Marcelo Vanzin <vanzin@cloudera.com
> <javascript:;>> wrote:
> >>>>>
> >>>>> Hello all,
> >>>>>
> >>>>> Recently a lot of the streaming backends were moved to a separate
> >>>>> project on github and removed from the main Spark repo.
> >>>>>
> >>>>> While I think the idea is great, I'm a little worried about the
> >>>>> execution. Some concerns were already raised on the bug mentioned
> >>>>> above, but I'd like to have a more explicit discussion about this so
> >>>>> things don't fall through the cracks.
> >>>>>
> >>>>> Mainly I have three concerns.
> >>>>>
> >>>>> i. Ownership
> >>>>>
> >>>>> That code used to be run by the ASF, but now it's hosted in a github
> >>>>> repo owned not by the ASF. That sounds a little sub-optimal, if not
> >>>>> problematic.
> >>>>>
> >>>>> ii. Governance
> >>>>>
> >>>>> Similar to the above; who has commit access to the above repos? Will
> >>>>> all the Spark committers, present and future, have commit access to
> >>>>> all of those repos? Are they still going to be considered part of
> >>>>> Spark and have release management done through the Spark community?
> >>>>>
> >>>>>
> >>>>> For both of the questions above, why are they not turned into
> >>>>> sub-projects of Spark and hosted on the ASF repos? I believe there is
> >>>>> a mechanism to do that, without the need to keep the code in the main
> >>>>> Spark repo, right?
> >>>>>
> >>>>> iii. Usability
> >>>>>
> >>>>> This is another thing I don't see discussed. For Scala-based code
> >>>>> things don't change much, I guess, if the artifact names don't change
> >>>>> (another reason to keep things in the ASF?), but what about python?
> >>>>> How are pyspark users expected to get that code going forward, since
> >>>>> it's not in Spark's pyspark.zip anymore?
> >>>>>
> >>>>>
> >>>>> Is there an easy way of keeping these things within the ASF Spark
> >>>>> project? I think that would be better for everybody.
> >>>>>
> >>>>> --
> >>>>> Marcelo
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> <javascript:;>
> >>>>> For additional commands, e-mail: dev-help@spark.apache.org
> <javascript:;>
> >>>>>
> >>>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org <javascript:;>
> >> For additional commands, e-mail: dev-help@spark.apache.org
> <javascript:;>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org <javascript:;>
> > For additional commands, e-mail: dev-help@spark.apache.org
> <javascript:;>
> >
>

Re: SPARK-13843 and future of streaming backends

Posted by Sean Owen <so...@cloudera.com>.

This has been resolved; see the JIRA and related PRs but also
http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-13843-Next-steps-td16783.html

This is not a scenario where a [VOTE] needs to take place, and code
changes don't proceed through PMC votes. From the project perspective,
code was deleted/retired for lack of interest, and this is controlled
by the normal lazy consensus protocol which wasn't vetoed.

The subsequent discussion was in part about whether other modules
should go, or whether one should come back, which it did. The latter
suggests that change could have been left open for some discussion
longer. Ideally, you would have commented before the initial change
happened, but it sounds like several people would have liked more
time. I don't think I'd call that "improper conduct" though, no. It
was reversed via the same normal code management process.

The rest of the question concerned what becomes of the code that was
removed. It was revived outside the project for anyone who cares to
continue collaborating. There seemed to be no disagreement about that,
mostly because the code in question was of minimal interest. PMC
doesn't need to rule on anything. There may still be some loose ends
there like namespace changes. I'll add to the other thread about this.



On Sat, Mar 26, 2016 at 1:17 PM, Jacek Laskowski <ja...@japila.pl> wrote:
> Hi,
>
> Although I'm not that much experienced member of ASF, I share your
> concerns. I haven't looked at the issue from this point of view, but
> after having read the thread I think PMC should've signed off the
> migration of ASF-owned code to a non-ASF repo. At least a vote is
> required (and this discussion is a sign that the process has not been
> conducted properly as people have concerns, me including).
>
> Thanks Mridul!
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Mar 17, 2016 at 9:13 PM, Mridul Muralidharan <mr...@gmail.com> wrote:
>> I am not referring to code edits - but to migrating submodules and
>> code currently in Apache Spark to 'outside' of it.
>> If I understand correctly, assets from Apache Spark are being moved
>> out of it into thirdparty external repositories - not owned by Apache.
>>
>> At a minimum, dev@ discussion (like this one) should be initiated.
>> As PMC is responsible for the project assets (including code), signoff
>> is required for it IMO.
>>
>> More experienced Apache members might be opine better in case I got it wrong !
>>
>>
>> Regards,
>> Mridul
>>
>>
>> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger <co...@koeninger.org> wrote:
>>> Why would a PMC vote be necessary on every code deletion?
>>>
>>> There was a Jira and pull request discussion about the submodules that
>>> have been removed so far.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-13843
>>>
>>> There's another ongoing one about Kafka specifically
>>>
>>> https://issues.apache.org/jira/browse/SPARK-13877
>>>
>>>
>>> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan <mr...@gmail.com> wrote:
>>>>
>>>> I was not aware of a discussion in Dev list about this - agree with most of
>>>> the observations.
>>>> In addition, I did not see PMC signoff on moving (sub-)modules out.
>>>>
>>>> Regards
>>>> Mridul
>>>>
>>>>
>>>>
>>>> On Thursday, March 17, 2016, Marcelo Vanzin <va...@cloudera.com> wrote:
>>>>>
>>>>> Hello all,
>>>>>
>>>>> Recently a lot of the streaming backends were moved to a separate
>>>>> project on github and removed from the main Spark repo.
>>>>>
>>>>> While I think the idea is great, I'm a little worried about the
>>>>> execution. Some concerns were already raised on the bug mentioned
>>>>> above, but I'd like to have a more explicit discussion about this so
>>>>> things don't fall through the cracks.
>>>>>
>>>>> Mainly I have three concerns.
>>>>>
>>>>> i. Ownership
>>>>>
>>>>> That code used to be run by the ASF, but now it's hosted in a github
>>>>> repo owned not by the ASF. That sounds a little sub-optimal, if not
>>>>> problematic.
>>>>>
>>>>> ii. Governance
>>>>>
>>>>> Similar to the above; who has commit access to the above repos? Will
>>>>> all the Spark committers, present and future, have commit access to
>>>>> all of those repos? Are they still going to be considered part of
>>>>> Spark and have release management done through the Spark community?
>>>>>
>>>>>
>>>>> For both of the questions above, why are they not turned into
>>>>> sub-projects of Spark and hosted on the ASF repos? I believe there is
>>>>> a mechanism to do that, without the need to keep the code in the main
>>>>> Spark repo, right?
>>>>>
>>>>> iii. Usability
>>>>>
>>>>> This is another thing I don't see discussed. For Scala-based code
>>>>> things don't change much, I guess, if the artifact names don't change
>>>>> (another reason to keep things in the ASF?), but what about python?
>>>>> How are pyspark users expected to get that code going forward, since
>>>>> it's not in Spark's pyspark.zip anymore?
>>>>>
>>>>>
>>>>> Is there an easy way of keeping these things within the ASF Spark
>>>>> project? I think that would be better for everybody.
>>>>>
>>>>> --
>>>>> Marcelo
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Jacek Laskowski <ja...@japila.pl>.

Hi,

Although I'm not that much experienced member of ASF, I share your
concerns. I haven't looked at the issue from this point of view, but
after having read the thread I think PMC should've signed off the
migration of ASF-owned code to a non-ASF repo. At least a vote is
required (and this discussion is a sign that the process has not been
conducted properly as people have concerns, me including).

Thanks Mridul!

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Mar 17, 2016 at 9:13 PM, Mridul Muralidharan <mr...@gmail.com> wrote:
> I am not referring to code edits - but to migrating submodules and
> code currently in Apache Spark to 'outside' of it.
> If I understand correctly, assets from Apache Spark are being moved
> out of it into thirdparty external repositories - not owned by Apache.
>
> At a minimum, dev@ discussion (like this one) should be initiated.
> As PMC is responsible for the project assets (including code), signoff
> is required for it IMO.
>
> More experienced Apache members might be opine better in case I got it wrong !
>
>
> Regards,
> Mridul
>
>
> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger <co...@koeninger.org> wrote:
>> Why would a PMC vote be necessary on every code deletion?
>>
>> There was a Jira and pull request discussion about the submodules that
>> have been removed so far.
>>
>> https://issues.apache.org/jira/browse/SPARK-13843
>>
>> There's another ongoing one about Kafka specifically
>>
>> https://issues.apache.org/jira/browse/SPARK-13877
>>
>>
>> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan <mr...@gmail.com> wrote:
>>>
>>> I was not aware of a discussion in Dev list about this - agree with most of
>>> the observations.
>>> In addition, I did not see PMC signoff on moving (sub-)modules out.
>>>
>>> Regards
>>> Mridul
>>>
>>>
>>>
>>> On Thursday, March 17, 2016, Marcelo Vanzin <va...@cloudera.com> wrote:
>>>>
>>>> Hello all,
>>>>
>>>> Recently a lot of the streaming backends were moved to a separate
>>>> project on github and removed from the main Spark repo.
>>>>
>>>> While I think the idea is great, I'm a little worried about the
>>>> execution. Some concerns were already raised on the bug mentioned
>>>> above, but I'd like to have a more explicit discussion about this so
>>>> things don't fall through the cracks.
>>>>
>>>> Mainly I have three concerns.
>>>>
>>>> i. Ownership
>>>>
>>>> That code used to be run by the ASF, but now it's hosted in a github
>>>> repo owned not by the ASF. That sounds a little sub-optimal, if not
>>>> problematic.
>>>>
>>>> ii. Governance
>>>>
>>>> Similar to the above; who has commit access to the above repos? Will
>>>> all the Spark committers, present and future, have commit access to
>>>> all of those repos? Are they still going to be considered part of
>>>> Spark and have release management done through the Spark community?
>>>>
>>>>
>>>> For both of the questions above, why are they not turned into
>>>> sub-projects of Spark and hosted on the ASF repos? I believe there is
>>>> a mechanism to do that, without the need to keep the code in the main
>>>> Spark repo, right?
>>>>
>>>> iii. Usability
>>>>
>>>> This is another thing I don't see discussed. For Scala-based code
>>>> things don't change much, I guess, if the artifact names don't change
>>>> (another reason to keep things in the ASF?), but what about python?
>>>> How are pyspark users expected to get that code going forward, since
>>>> it's not in Spark's pyspark.zip anymore?
>>>>
>>>>
>>>> Is there an easy way of keeping these things within the ASF Spark
>>>> project? I think that would be better for everybody.
>>>>
>>>> --
>>>> Marcelo
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Hi Marcelo,

Thanks for your reply. As a committer on the project, you *can* VETO
code. For sure. Unfortunately you don’t have a binding vote on adding
new PMC members/committers, and/or on releasing the software, but do
have the ability to VETO.

That said, if that’s not your intent, sorry for misreading your intent.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Marcelo Vanzin <va...@cloudera.com>
Date: Friday, March 18, 2016 at 3:24 PM
To: jpluser <ma...@apache.org>
Cc: "dev@spark.apache.org" <de...@spark.apache.org>
Subject: Re: SPARK-13843 and future of streaming backends

>On Fri, Mar 18, 2016 at 2:12 PM, chrismattmann <ma...@apache.org>
>wrote:
>> So, my comment here is that any code *cannot* be removed from an Apache
>> project if there is a VETO issued which so far I haven't seen, though
>>maybe
>> Marcelo can clarify that.
>
>No, my intention was not to veto the change. I'm actually for the
>removal of components if the community thinks they don't add much to
>the project. (I'm also not sure I can even veto things, not being a
>PMC member.)
>
>I mainly wanted to know what was the path forward for those components
>because, with Cloudera's hat on, we care about one of them (streaming
>integration with flume), and we'd prefer if that code remained under
>the ASF umbrella in some way.
>
>-- 
>Marcelo

Re: SPARK-13843 and future of streaming backends

Posted by Steve Loughran <st...@hortonworks.com>.

> On 18 Mar 2016, at 22:24, Marcelo Vanzin <va...@cloudera.com> wrote:
> 
> On Fri, Mar 18, 2016 at 2:12 PM, chrismattmann <ma...@apache.org> wrote:
>> So, my comment here is that any code *cannot* be removed from an Apache
>> project if there is a VETO issued which so far I haven't seen, though maybe
>> Marcelo can clarify that.
> 
> No, my intention was not to veto the change. I'm actually for the
> removal of components if the community thinks they don't add much to
> the project. (I'm also not sure I can even veto things, not being a
> PMC member.)
> 
> I mainly wanted to know what was the path forward for those components
> because, with Cloudera's hat on, we care about one of them (streaming
> integration with flume), and we'd prefer if that code remained under
> the ASF umbrella in some way.
> 

I'd be supportive of a spark-extras project; it'd actually be  place to keep stuff I've worked on 
 -the yarn ATS 1/1.5 integration
 -that mutant hive JAR which has the consistent kryo dependency and different shadings

... etc

There's also the fact that the twitter streaming is a common example to play with, flume is popular in places too.

If you want to set up a new incubator with a goal of graduating fast, I'd help. As a key metric of getting out of incubator is active development, you just need to "recruit" contributors and keep them engaged.




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Fri, Mar 18, 2016 at 2:12 PM, chrismattmann <ma...@apache.org> wrote:
> So, my comment here is that any code *cannot* be removed from an Apache
> project if there is a VETO issued which so far I haven't seen, though maybe
> Marcelo can clarify that.

No, my intention was not to veto the change. I'm actually for the
removal of components if the community thinks they don't add much to
the project. (I'm also not sure I can even veto things, not being a
PMC member.)

I mainly wanted to know what was the path forward for those components
because, with Cloudera's hat on, we care about one of them (streaming
integration with flume), and we'd prefer if that code remained under
the ASF umbrella in some way.

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by chrismattmann <ma...@apache.org>.

So, my comment here is that any code *cannot* be removed from an Apache
project if there is a VETO issued which so far I haven't seen, though maybe
Marcelo can clarify that.

However if a VETO was issued, then the code cannot be removed and must be
put back. Anyone can fork anything our license allows that, but the
community itself must steward the code and part of that is hearing
everyone's voice within that community before acting.

Cheers,
Chris



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SPARK-13843-and-future-of-streaming-backends-tp16711p16749.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Luciano Resende <lu...@gmail.com>.

On Fri, Mar 18, 2016 at 7:58 AM, Cody Koeninger <co...@koeninger.org> wrote:

> >  Or, as Cody Koeniger suggests, having a spark-extras project in the ASF
> with a focus on extras with their own support channel.
>
> To be clear, I didn't suggest that and don't think that's the best
> solution.  I said to the people who want things done that way, which
> committer is going to step up and do that organizational work?
>

I am currently not a committer, but If we are willing to go into the
direction of having another project as spark-extras, I can help drive the
bureaucratic work to make this a reality.

-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

Re: SPARK-13843 and future of streaming backends

Posted by Cody Koeninger <co...@koeninger.org>.

>  Or, as Cody Koeniger suggests, having a spark-extras project in the ASF with a focus on extras with their own support channel.

To be clear, I didn't suggest that and don't think that's the best
solution.  I said to the people who want things done that way, which
committer is going to step up and do that organizational work?

I think there are advantages to moving everything currently in extras/
and external/ out of the spark project, but the current Kafka
packaging issue can be solved straightforwardly by just adding another
artifact and code tree under external/.

On Fri, Mar 18, 2016 at 5:04 AM, Steve Loughran <st...@hortonworks.com> wrote:
>
> Spark has hit one of the enternal problems of OSS projects, one hit by: ant,
> maven, hadoop, ... anything with a plugin model.
>
> Take in the plugin: you're in control, but also down for maintenance
>
> Leave out the plugin: other people can maintain it, be more agile, etc.
>
> But you've lost control, and you can't even manage the links. Here I think
> maven suffered the most by keeping stuff in codehaus; migrating off there is
> still hard —not only did they lose the links: they lost the JIRA.
>
> Maven's relationship with codehaus was very tightly coupled, lots of
> committers on both; I don't know how that relationship was handled at a
> higher level.
>
>
> On 17 Mar 2016, at 20:51, Hari Shreedharan <hs...@cloudera.com>
> wrote:
>
> I have worked with various ASF projects for 4+ years now. Sure, ASF projects
> can delete code as they feel fit. But this is the first time I have really
> seen code being "moved out" of a project without discussion. I am sure you
> can do this without violating ASF policy, but the explanation for that would
> be convoluted (someone decided to make a copy and then the ASF project
> deleted it?).
>
>
> +1 for discussion. Dev changes should -> dev list; PMC for process in
> general. Don't think the ASF will overlook stuff like that.
>
> Might want to raise this issue on the next broad report
>
>
> FWIW, it may be better to just see if you can have committers to work on
> these projects: recruit the people and say 'please, only work in this area
> —for now". That gets developers on your team, which is generally considered
> a metric of health in a project.
>
> Or, as Cody Koeniger suggests, having a spark-extras project in the ASF with
> a focus on extras with their own support channel.
>
>
> Also, moving the code out would break compatibility. AFAIK, there is no way
> to push org.apache.* artifacts directly to maven central. That happens via
> mirroring from the ASF maven repos. Even if it you could somehow directly
> push the artifacts to mvn, you really can push to org.apache.* groups only
> if you are part of the repo and acting as an agent of that project (which in
> this case would be Apache Spark). Once you move the code out, even a
> committer/PMC member would not be representing the ASF when pushing the
> code. I am not sure if there is a way to fix this issue.
>
>
>
>
> This topic has cropped up in the general context of third party repos
> publishing artifacts with org.apache names but vendor specfic suffixes (e.g
> org.apache.hadoop/hadoop-common.5.3-cdh.jar
>
> Some people were pretty unhappy about this, but the conclusion reached was
> "maven doesn't let you do anything else and still let downstream people use
> it". Futhermore, as all ASF releases are nominally the source releases *not
> the binaries*, you can look at the POMs and say "we've released source code
> designed to publish artifacts to repos —this is 'use as intended'.
>
> People are also free to cut their own full project distributions, etc, etc.
> For example, I stick up the binaries of Windows builds independent of the
> ASF releases; these were originally just those from HDP on windows installs,
> now I check out the commit of the specific ASF release on a windows 2012 VM,
> do the build, copy the binaries. Free for all to use. But I do suspect that
> the ASF legal protections get a bit blurred here. These aren't ASF binaries,
> but binaries built directly from unmodified ASF releases.
>
> In contrast to sticking stuff into a github repo, the moved artifacts cannot
> be published as org.apache artfacts on maven central. That's non-negotiable
> as far as the ASF are concerned. The process for releasing ASF artifacts
> there goes downstream of the ASF public release process: you stage the
> artifacts, they are part of the vote process, everything with org.apache
> goes through it.
>
> That said: there is nothing to stop a set of shell org.apache artifacts
> being written which do nothing but contain transitive dependencies on
> artifacts in different groups, such as org.spark-project. The shells would
> be released by the ASF; they pull in the new stuff. And, therefore, it'd be
> possible to build a spark-assembly with the files. (I'm ignoring a loop in
> the build DAG here, playing with git submodules would let someone eliminate
> this by adding the removed libraries under a modified project.
>
> I think there might some issues related to package names; you could make a
> case for having public APIs with the original names —they're the API, after
> all, and that's exactly what Apache Harmony did with the java.* packages.
>
>
> Thanks,
> Hari
>
> On Thu, Mar 17, 2016 at 1:13 PM, Mridul Muralidharan <mr...@gmail.com>
> wrote:
>>
>> I am not referring to code edits - but to migrating submodules and
>> code currently in Apache Spark to 'outside' of it.
>> If I understand correctly, assets from Apache Spark are being moved
>> out of it into thirdparty external repositories - not owned by Apache.
>>
>> At a minimum, dev@ discussion (like this one) should be initiated.
>> As PMC is responsible for the project assets (including code), signoff
>> is required for it IMO.
>>
>> More experienced Apache members might be opine better in case I got it
>> wrong !
>>
>>
>> Regards,
>> Mridul
>>
>>
>> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>> > Why would a PMC vote be necessary on every code deletion?
>> >
>> > There was a Jira and pull request discussion about the submodules that
>> > have been removed so far.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-13843
>> >
>> > There's another ongoing one about Kafka specifically
>> >
>> > https://issues.apache.org/jira/browse/SPARK-13877
>> >
>> >
>> > On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan <mr...@gmail.com>
>> > wrote:
>> >>
>> >> I was not aware of a discussion in Dev list about this - agree with
>> >> most of
>> >> the observations.
>> >> In addition, I did not see PMC signoff on moving (sub-)modules out.
>> >>
>> >> Regards
>> >> Mridul
>> >>
>> >>
>> >>
>> >> On Thursday, March 17, 2016, Marcelo Vanzin <va...@cloudera.com>
>> >> wrote:
>> >>>
>> >>> Hello all,
>> >>>
>> >>> Recently a lot of the streaming backends were moved to a separate
>> >>> project on github and removed from the main Spark repo.
>> >>>
>> >>> While I think the idea is great, I'm a little worried about the
>> >>> execution. Some concerns were already raised on the bug mentioned
>> >>> above, but I'd like to have a more explicit discussion about this so
>> >>> things don't fall through the cracks.
>> >>>
>> >>> Mainly I have three concerns.
>> >>>
>> >>> i. Ownership
>> >>>
>> >>> That code used to be run by the ASF, but now it's hosted in a github
>> >>> repo owned not by the ASF. That sounds a little sub-optimal, if not
>> >>> problematic.
>> >>>
>> >>> ii. Governance
>> >>>
>> >>> Similar to the above; who has commit access to the above repos? Will
>> >>> all the Spark committers, present and future, have commit access to
>> >>> all of those repos? Are they still going to be considered part of
>> >>> Spark and have release management done through the Spark community?
>> >>>
>> >>>
>> >>> For both of the questions above, why are they not turned into
>> >>> sub-projects of Spark and hosted on the ASF repos? I believe there is
>> >>> a mechanism to do that, without the need to keep the code in the main
>> >>> Spark repo, right?
>> >>>
>> >>> iii. Usability
>> >>>
>> >>> This is another thing I don't see discussed. For Scala-based code
>> >>> things don't change much, I guess, if the artifact names don't change
>> >>> (another reason to keep things in the ASF?), but what about python?
>> >>> How are pyspark users expected to get that code going forward, since
>> >>> it's not in Spark's pyspark.zip anymore?
>> >>>
>> >>>
>> >>> Is there an easy way of keeping these things within the ASF Spark
>> >>> project? I think that would be better for everybody.
>> >>>
>> >>> --
>> >>> Marcelo
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >>> For additional commands, e-mail: dev-help@spark.apache.org
>> >>>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Steve Loughran <st...@hortonworks.com>.

Spark has hit one of the enternal problems of OSS projects, one hit by: ant, maven, hadoop, ... anything with a plugin model.

Take in the plugin: you're in control, but also down for maintenance

Leave out the plugin: other people can maintain it, be more agile, etc.

But you've lost control, and you can't even manage the links. Here I think maven suffered the most by keeping stuff in codehaus; migrating off there is still hard —not only did they lose the links: they lost the JIRA.

Maven's relationship with codehaus was very tightly coupled, lots of committers on both; I don't know how that relationship was handled at a higher level.

On 17 Mar 2016, at 20:51, Hari Shreedharan <hs...@cloudera.com>> wrote:

I have worked with various ASF projects for 4+ years now. Sure, ASF projects can delete code as they feel fit. But this is the first time I have really seen code being "moved out" of a project without discussion. I am sure you can do this without violating ASF policy, but the explanation for that would be convoluted (someone decided to make a copy and then the ASF project deleted it?).

+1 for discussion. Dev changes should -> dev list; PMC for process in general. Don't think the ASF will overlook stuff like that.

Might want to raise this issue on the next broad report

FWIW, it may be better to just see if you can have committers to work on these projects: recruit the people and say 'please, only work in this area —for now". That gets developers on your team, which is generally considered a metric of health in a project.

Or, as Cody Koeniger suggests, having a spark-extras project in the ASF with a focus on extras with their own support channel.

Also, moving the code out would break compatibility. AFAIK, there is no way to push org.apache.* artifacts directly to maven central. That happens via mirroring from the ASF maven repos. Even if it you could somehow directly push the artifacts to mvn, you really can push to org.apache.* groups only if you are part of the repo and acting as an agent of that project (which in this case would be Apache Spark). Once you move the code out, even a committer/PMC member would not be representing the ASF when pushing the code. I am not sure if there is a way to fix this issue.

This topic has cropped up in the general context of third party repos publishing artifacts with org.apache names but vendor specfic suffixes (e.g org.apache.hadoop/hadoop-common.5.3-cdh.jar

Some people were pretty unhappy about this, but the conclusion reached was "maven doesn't let you do anything else and still let downstream people use it". Futhermore, as all ASF releases are nominally the source releases *not the binaries*, you can look at the POMs and say "we've released source code designed to publish artifacts to repos —this is 'use as intended'.

People are also free to cut their own full project distributions, etc, etc. For example, I stick up the binaries of Windows builds independent of the ASF releases; these were originally just those from HDP on windows installs, now I check out the commit of the specific ASF release on a windows 2012 VM, do the build, copy the binaries. Free for all to use. But I do suspect that the ASF legal protections get a bit blurred here. These aren't ASF binaries, but binaries built directly from unmodified ASF releases.

In contrast to sticking stuff into a github repo, the moved artifacts cannot be published as org.apache artfacts on maven central. That's non-negotiable as far as the ASF are concerned. The process for releasing ASF artifacts there goes downstream of the ASF public release process: you stage the artifacts, they are part of the vote process, everything with org.apache goes through it.

That said: there is nothing to stop a set of shell org.apache artifacts being written which do nothing but contain transitive dependencies on artifacts in different groups, such as org.spark-project. The shells would be released by the ASF; they pull in the new stuff. And, therefore, it'd be possible to build a spark-assembly with the files. (I'm ignoring a loop in the build DAG here, playing with git submodules would let someone eliminate this by adding the removed libraries under a modified project.

I think there might some issues related to package names; you could make a case for having public APIs with the original names —they're the API, after all, and that's exactly what Apache Harmony did with the java.* packages.

Thanks,
Hari

On Thu, Mar 17, 2016 at 1:13 PM, Mridul Muralidharan <mr...@gmail.com>> wrote:
I am not referring to code edits - but to migrating submodules and
code currently in Apache Spark to 'outside' of it.
If I understand correctly, assets from Apache Spark are being moved
out of it into thirdparty external repositories - not owned by Apache.

At a minimum, dev@ discussion (like this one) should be initiated.
As PMC is responsible for the project assets (including code), signoff
is required for it IMO.

More experienced Apache members might be opine better in case I got it wrong !

Regards,
Mridul

On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger <co...@koeninger.org>> wrote:
> Why would a PMC vote be necessary on every code deletion?
>
> There was a Jira and pull request discussion about the submodules that
> have been removed so far.
>
> https://issues.apache.org/jira/browse/SPARK-13843
>
> There's another ongoing one about Kafka specifically
>
> https://issues.apache.org/jira/browse/SPARK-13877
>
>
> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan <mr...@gmail.com>> wrote:
>>
>> I was not aware of a discussion in Dev list about this - agree with most of
>> the observations.
>> In addition, I did not see PMC signoff on moving (sub-)modules out.
>>
>> Regards
>> Mridul
>>
>>
>>
>> On Thursday, March 17, 2016, Marcelo Vanzin <va...@cloudera.com>> wrote:
>>>
>>> Hello all,
>>>
>>> Recently a lot of the streaming backends were moved to a separate
>>> project on github and removed from the main Spark repo.
>>>
>>> While I think the idea is great, I'm a little worried about the
>>> execution. Some concerns were already raised on the bug mentioned
>>> above, but I'd like to have a more explicit discussion about this so
>>> things don't fall through the cracks.
>>>
>>> Mainly I have three concerns.
>>>
>>> i. Ownership
>>>
>>> That code used to be run by the ASF, but now it's hosted in a github
>>> repo owned not by the ASF. That sounds a little sub-optimal, if not
>>> problematic.
>>>
>>> ii. Governance
>>>
>>> Similar to the above; who has commit access to the above repos? Will
>>> all the Spark committers, present and future, have commit access to
>>> all of those repos? Are they still going to be considered part of
>>> Spark and have release management done through the Spark community?
>>>
>>>
>>> For both of the questions above, why are they not turned into
>>> sub-projects of Spark and hosted on the ASF repos? I believe there is
>>> a mechanism to do that, without the need to keep the code in the main
>>> Spark repo, right?
>>>
>>> iii. Usability
>>>
>>> This is another thing I don't see discussed. For Scala-based code
>>> things don't change much, I guess, if the artifact names don't change
>>> (another reason to keep things in the ASF?), but what about python?
>>> How are pyspark users expected to get that code going forward, since
>>> it's not in Spark's pyspark.zip anymore?
>>>
>>>
>>> Is there an easy way of keeping these things within the ASF Spark
>>> project? I think that would be better for everybody.
>>>
>>> --
>>> Marcelo
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
>>> For additional commands, e-mail: dev-help@spark.apache.org<ma...@spark.apache.org>
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
For additional commands, e-mail: dev-help@spark.apache.org<ma...@spark.apache.org>

Re: SPARK-13843 and future of streaming backends

Posted by Cody Koeninger <co...@koeninger.org>.

Are you talking about group/identifier name, or contained classes?

Because there are plenty of org.apache.* classes distributed via maven
with non-apache group / identifiers.

On Fri, Mar 25, 2016 at 6:54 PM, David Nalley <ke...@apache.org> wrote:
>
>> As far as group / artifact name compatibility, at least in the case of
>> Kafka we need different artifact names anyway, and people are going to
>> have to make changes to their build files for spark 2.0 anyway.   As
>> far as keeping the actual classes in org.apache.spark to not break
>> code despite the group name being different, I don't know whether that
>> would be enforced by maven central, just looked at as poor taste, or
>> ASF suing for trademark violation :)
>
>
> Sonatype, has strict instructions to only permit org.apache.* to originate from repository.apache.org. Exceptions to that must be approved by VP, Infrastructure.
> ------
> Sent via Pony Mail for dev@spark.apache.org.
> View this email online at:
> https://pony-poc.apache.org/list.html?dev@spark.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by David Nalley <ke...@apache.org>.

> As far as group / artifact name compatibility, at least in the case of
> Kafka we need different artifact names anyway, and people are going to
> have to make changes to their build files for spark 2.0 anyway.   As
> far as keeping the actual classes in org.apache.spark to not break
> code despite the group name being different, I don't know whether that
> would be enforced by maven central, just looked at as poor taste, or
> ASF suing for trademark violation :)


Sonatype, has strict instructions to only permit org.apache.* to originate from repository.apache.org. Exceptions to that must be approved by VP, Infrastructure. 
------
Sent via Pony Mail for dev@spark.apache.org. 
View this email online at:
https://pony-poc.apache.org/list.html?dev@spark.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Cody Koeninger <co...@koeninger.org>.

There's a difference between "without discussion" and "without as much
discussion as I would have liked to have a chance to notice it".
There are plenty of PRs that got merged before I noticed them that I
would rather have not gotten merged.

As far as group / artifact name compatibility, at least in the case of
Kafka we need different artifact names anyway, and people are going to
have to make changes to their build files for spark 2.0 anyway.   As
far as keeping the actual classes in org.apache.spark to not break
code despite the group name being different, I don't know whether that
would be enforced by maven central, just looked at as poor taste, or
ASF suing for trademark violation :)

For people who would rather the problem be solved with official asf
subprojects, which committers are volunteering to help do that work?
Reynold already said he doesn't want to mess with that overhead.

I'm fine with continuing to help work on the Kafka integration
wherever it ends up, I'd just like the color of the bikeshed to get
decided so we can build a decent bike...


On Thu, Mar 17, 2016 at 3:51 PM, Hari Shreedharan
<hs...@cloudera.com> wrote:
> I have worked with various ASF projects for 4+ years now. Sure, ASF projects
> can delete code as they feel fit. But this is the first time I have really
> seen code being "moved out" of a project without discussion. I am sure you
> can do this without violating ASF policy, but the explanation for that would
> be convoluted (someone decided to make a copy and then the ASF project
> deleted it?).
>
> Also, moving the code out would break compatibility. AFAIK, there is no way
> to push org.apache.* artifacts directly to maven central. That happens via
> mirroring from the ASF maven repos. Even if it you could somehow directly
> push the artifacts to mvn, you really can push to org.apache.* groups only
> if you are part of the repo and acting as an agent of that project (which in
> this case would be Apache Spark). Once you move the code out, even a
> committer/PMC member would not be representing the ASF when pushing the
> code. I am not sure if there is a way to fix this issue.
>
>
> Thanks,
> Hari
>
> On Thu, Mar 17, 2016 at 1:13 PM, Mridul Muralidharan <mr...@gmail.com>
> wrote:
>>
>> I am not referring to code edits - but to migrating submodules and
>> code currently in Apache Spark to 'outside' of it.
>> If I understand correctly, assets from Apache Spark are being moved
>> out of it into thirdparty external repositories - not owned by Apache.
>>
>> At a minimum, dev@ discussion (like this one) should be initiated.
>> As PMC is responsible for the project assets (including code), signoff
>> is required for it IMO.
>>
>> More experienced Apache members might be opine better in case I got it
>> wrong !
>>
>>
>> Regards,
>> Mridul
>>
>>
>> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>> > Why would a PMC vote be necessary on every code deletion?
>> >
>> > There was a Jira and pull request discussion about the submodules that
>> > have been removed so far.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-13843
>> >
>> > There's another ongoing one about Kafka specifically
>> >
>> > https://issues.apache.org/jira/browse/SPARK-13877
>> >
>> >
>> > On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan <mr...@gmail.com>
>> > wrote:
>> >>
>> >> I was not aware of a discussion in Dev list about this - agree with
>> >> most of
>> >> the observations.
>> >> In addition, I did not see PMC signoff on moving (sub-)modules out.
>> >>
>> >> Regards
>> >> Mridul
>> >>
>> >>
>> >>
>> >> On Thursday, March 17, 2016, Marcelo Vanzin <va...@cloudera.com>
>> >> wrote:
>> >>>
>> >>> Hello all,
>> >>>
>> >>> Recently a lot of the streaming backends were moved to a separate
>> >>> project on github and removed from the main Spark repo.
>> >>>
>> >>> While I think the idea is great, I'm a little worried about the
>> >>> execution. Some concerns were already raised on the bug mentioned
>> >>> above, but I'd like to have a more explicit discussion about this so
>> >>> things don't fall through the cracks.
>> >>>
>> >>> Mainly I have three concerns.
>> >>>
>> >>> i. Ownership
>> >>>
>> >>> That code used to be run by the ASF, but now it's hosted in a github
>> >>> repo owned not by the ASF. That sounds a little sub-optimal, if not
>> >>> problematic.
>> >>>
>> >>> ii. Governance
>> >>>
>> >>> Similar to the above; who has commit access to the above repos? Will
>> >>> all the Spark committers, present and future, have commit access to
>> >>> all of those repos? Are they still going to be considered part of
>> >>> Spark and have release management done through the Spark community?
>> >>>
>> >>>
>> >>> For both of the questions above, why are they not turned into
>> >>> sub-projects of Spark and hosted on the ASF repos? I believe there is
>> >>> a mechanism to do that, without the need to keep the code in the main
>> >>> Spark repo, right?
>> >>>
>> >>> iii. Usability
>> >>>
>> >>> This is another thing I don't see discussed. For Scala-based code
>> >>> things don't change much, I guess, if the artifact names don't change
>> >>> (another reason to keep things in the ASF?), but what about python?
>> >>> How are pyspark users expected to get that code going forward, since
>> >>> it's not in Spark's pyspark.zip anymore?
>> >>>
>> >>>
>> >>> Is there an easy way of keeping these things within the ASF Spark
>> >>> project? I think that would be better for everybody.
>> >>>
>> >>> --
>> >>> Marcelo
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >>> For additional commands, e-mail: dev-help@spark.apache.org
>> >>>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Hari Shreedharan <hs...@cloudera.com>.

I have worked with various ASF projects for 4+ years now. Sure, ASF
projects can delete code as they feel fit. But this is the first time I
have really seen code being "moved out" of a project without discussion. I
am sure you can do this without violating ASF policy, but the explanation
for that would be convoluted (someone decided to make a copy and then the
ASF project deleted it?).

Also, moving the code out would break compatibility. AFAIK, there is no way
to push org.apache.* artifacts directly to maven central. That happens via
mirroring from the ASF maven repos. Even if it you could somehow directly
push the artifacts to mvn, you really can push to org.apache.* groups only
if you are part of the repo and acting as an agent of that project (which
in this case would be Apache Spark). Once you move the code out, even a
committer/PMC member would not be representing the ASF when pushing the
code. I am not sure if there is a way to fix this issue.


Thanks,
Hari

On Thu, Mar 17, 2016 at 1:13 PM, Mridul Muralidharan <mr...@gmail.com>
wrote:

> I am not referring to code edits - but to migrating submodules and
> code currently in Apache Spark to 'outside' of it.
> If I understand correctly, assets from Apache Spark are being moved
> out of it into thirdparty external repositories - not owned by Apache.
>
> At a minimum, dev@ discussion (like this one) should be initiated.
> As PMC is responsible for the project assets (including code), signoff
> is required for it IMO.
>
> More experienced Apache members might be opine better in case I got it
> wrong !
>
>
> Regards,
> Mridul
>
>
> On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
> > Why would a PMC vote be necessary on every code deletion?
> >
> > There was a Jira and pull request discussion about the submodules that
> > have been removed so far.
> >
> > https://issues.apache.org/jira/browse/SPARK-13843
> >
> > There's another ongoing one about Kafka specifically
> >
> > https://issues.apache.org/jira/browse/SPARK-13877
> >
> >
> > On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan <mr...@gmail.com>
> wrote:
> >>
> >> I was not aware of a discussion in Dev list about this - agree with
> most of
> >> the observations.
> >> In addition, I did not see PMC signoff on moving (sub-)modules out.
> >>
> >> Regards
> >> Mridul
> >>
> >>
> >>
> >> On Thursday, March 17, 2016, Marcelo Vanzin <va...@cloudera.com>
> wrote:
> >>>
> >>> Hello all,
> >>>
> >>> Recently a lot of the streaming backends were moved to a separate
> >>> project on github and removed from the main Spark repo.
> >>>
> >>> While I think the idea is great, I'm a little worried about the
> >>> execution. Some concerns were already raised on the bug mentioned
> >>> above, but I'd like to have a more explicit discussion about this so
> >>> things don't fall through the cracks.
> >>>
> >>> Mainly I have three concerns.
> >>>
> >>> i. Ownership
> >>>
> >>> That code used to be run by the ASF, but now it's hosted in a github
> >>> repo owned not by the ASF. That sounds a little sub-optimal, if not
> >>> problematic.
> >>>
> >>> ii. Governance
> >>>
> >>> Similar to the above; who has commit access to the above repos? Will
> >>> all the Spark committers, present and future, have commit access to
> >>> all of those repos? Are they still going to be considered part of
> >>> Spark and have release management done through the Spark community?
> >>>
> >>>
> >>> For both of the questions above, why are they not turned into
> >>> sub-projects of Spark and hosted on the ASF repos? I believe there is
> >>> a mechanism to do that, without the need to keep the code in the main
> >>> Spark repo, right?
> >>>
> >>> iii. Usability
> >>>
> >>> This is another thing I don't see discussed. For Scala-based code
> >>> things don't change much, I guess, if the artifact names don't change
> >>> (another reason to keep things in the ASF?), but what about python?
> >>> How are pyspark users expected to get that code going forward, since
> >>> it's not in Spark's pyspark.zip anymore?
> >>>
> >>>
> >>> Is there an easy way of keeping these things within the ASF Spark
> >>> project? I think that would be better for everybody.
> >>>
> >>> --
> >>> Marcelo
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: SPARK-13843 and future of streaming backends

Posted by Mridul Muralidharan <mr...@gmail.com>.

I am not referring to code edits - but to migrating submodules and
code currently in Apache Spark to 'outside' of it.
If I understand correctly, assets from Apache Spark are being moved
out of it into thirdparty external repositories - not owned by Apache.

At a minimum, dev@ discussion (like this one) should be initiated.
As PMC is responsible for the project assets (including code), signoff
is required for it IMO.

More experienced Apache members might be opine better in case I got it wrong !


Regards,
Mridul


On Thu, Mar 17, 2016 at 12:55 PM, Cody Koeninger <co...@koeninger.org> wrote:
> Why would a PMC vote be necessary on every code deletion?
>
> There was a Jira and pull request discussion about the submodules that
> have been removed so far.
>
> https://issues.apache.org/jira/browse/SPARK-13843
>
> There's another ongoing one about Kafka specifically
>
> https://issues.apache.org/jira/browse/SPARK-13877
>
>
> On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan <mr...@gmail.com> wrote:
>>
>> I was not aware of a discussion in Dev list about this - agree with most of
>> the observations.
>> In addition, I did not see PMC signoff on moving (sub-)modules out.
>>
>> Regards
>> Mridul
>>
>>
>>
>> On Thursday, March 17, 2016, Marcelo Vanzin <va...@cloudera.com> wrote:
>>>
>>> Hello all,
>>>
>>> Recently a lot of the streaming backends were moved to a separate
>>> project on github and removed from the main Spark repo.
>>>
>>> While I think the idea is great, I'm a little worried about the
>>> execution. Some concerns were already raised on the bug mentioned
>>> above, but I'd like to have a more explicit discussion about this so
>>> things don't fall through the cracks.
>>>
>>> Mainly I have three concerns.
>>>
>>> i. Ownership
>>>
>>> That code used to be run by the ASF, but now it's hosted in a github
>>> repo owned not by the ASF. That sounds a little sub-optimal, if not
>>> problematic.
>>>
>>> ii. Governance
>>>
>>> Similar to the above; who has commit access to the above repos? Will
>>> all the Spark committers, present and future, have commit access to
>>> all of those repos? Are they still going to be considered part of
>>> Spark and have release management done through the Spark community?
>>>
>>>
>>> For both of the questions above, why are they not turned into
>>> sub-projects of Spark and hosted on the ASF repos? I believe there is
>>> a mechanism to do that, without the need to keep the code in the main
>>> Spark repo, right?
>>>
>>> iii. Usability
>>>
>>> This is another thing I don't see discussed. For Scala-based code
>>> things don't change much, I guess, if the artifact names don't change
>>> (another reason to keep things in the ASF?), but what about python?
>>> How are pyspark users expected to get that code going forward, since
>>> it's not in Spark's pyspark.zip anymore?
>>>
>>>
>>> Is there an easy way of keeping these things within the ASF Spark
>>> project? I think that would be better for everybody.
>>>
>>> --
>>> Marcelo
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Cody Koeninger <co...@koeninger.org>.

Why would a PMC vote be necessary on every code deletion?

There was a Jira and pull request discussion about the submodules that
have been removed so far.

https://issues.apache.org/jira/browse/SPARK-13843

There's another ongoing one about Kafka specifically

https://issues.apache.org/jira/browse/SPARK-13877


On Thu, Mar 17, 2016 at 2:49 PM, Mridul Muralidharan <mr...@gmail.com> wrote:
>
> I was not aware of a discussion in Dev list about this - agree with most of
> the observations.
> In addition, I did not see PMC signoff on moving (sub-)modules out.
>
> Regards
> Mridul
>
>
>
> On Thursday, March 17, 2016, Marcelo Vanzin <va...@cloudera.com> wrote:
>>
>> Hello all,
>>
>> Recently a lot of the streaming backends were moved to a separate
>> project on github and removed from the main Spark repo.
>>
>> While I think the idea is great, I'm a little worried about the
>> execution. Some concerns were already raised on the bug mentioned
>> above, but I'd like to have a more explicit discussion about this so
>> things don't fall through the cracks.
>>
>> Mainly I have three concerns.
>>
>> i. Ownership
>>
>> That code used to be run by the ASF, but now it's hosted in a github
>> repo owned not by the ASF. That sounds a little sub-optimal, if not
>> problematic.
>>
>> ii. Governance
>>
>> Similar to the above; who has commit access to the above repos? Will
>> all the Spark committers, present and future, have commit access to
>> all of those repos? Are they still going to be considered part of
>> Spark and have release management done through the Spark community?
>>
>>
>> For both of the questions above, why are they not turned into
>> sub-projects of Spark and hosted on the ASF repos? I believe there is
>> a mechanism to do that, without the need to keep the code in the main
>> Spark repo, right?
>>
>> iii. Usability
>>
>> This is another thing I don't see discussed. For Scala-based code
>> things don't change much, I guess, if the artifact names don't change
>> (another reason to keep things in the ASF?), but what about python?
>> How are pyspark users expected to get that code going forward, since
>> it's not in Spark's pyspark.zip anymore?
>>
>>
>> Is there an easy way of keeping these things within the ASF Spark
>> project? I think that would be better for everybody.
>>
>> --
>> Marcelo
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: SPARK-13843 and future of streaming backends

Posted by Mridul Muralidharan <mr...@gmail.com>.

I was not aware of a discussion in Dev list about this - agree with most of
the observations.
In addition, I did not see PMC signoff on moving (sub-)modules out.

Regards
Mridul


On Thursday, March 17, 2016, Marcelo Vanzin <va...@cloudera.com> wrote:

> Hello all,
>
> Recently a lot of the streaming backends were moved to a separate
> project on github and removed from the main Spark repo.
>
> While I think the idea is great, I'm a little worried about the
> execution. Some concerns were already raised on the bug mentioned
> above, but I'd like to have a more explicit discussion about this so
> things don't fall through the cracks.
>
> Mainly I have three concerns.
>
> i. Ownership
>
> That code used to be run by the ASF, but now it's hosted in a github
> repo owned not by the ASF. That sounds a little sub-optimal, if not
> problematic.
>
> ii. Governance
>
> Similar to the above; who has commit access to the above repos? Will
> all the Spark committers, present and future, have commit access to
> all of those repos? Are they still going to be considered part of
> Spark and have release management done through the Spark community?
>
>
> For both of the questions above, why are they not turned into
> sub-projects of Spark and hosted on the ASF repos? I believe there is
> a mechanism to do that, without the need to keep the code in the main
> Spark repo, right?
>
> iii. Usability
>
> This is another thing I don't see discussed. For Scala-based code
> things don't change much, I guess, if the artifact names don't change
> (another reason to keep things in the ASF?), but what about python?
> How are pyspark users expected to get that code going forward, since
> it's not in Spark's pyspark.zip anymore?
>
>
> Is there an easy way of keeping these things within the ASF Spark
> project? I think that would be better for everybody.
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org <javascript:;>
> For additional commands, e-mail: dev-help@spark.apache.org <javascript:;>
>
>