You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Lars Francke <la...@gmail.com> on 2019/07/26 13:59:09 UTC

Apache Training contribution for Spark - Feedback welcome

Hi Spark community,

you may or may not have heard of a new-ish (February 2019) project at
Apache: Apache Training (incubating). We aim to develop training material
about various projects inside and outside the ASF: <
http://training.apache.org/>

One of our users wants to contribute material on Spark[1]

We've done something similar for ZooKeeper[1] in the past and the ZooKeeper
community provided excellent feedback which helped make the product much
better[3].

That's why I'd like to invite everyone here to provide any kind of feedback
on the content donation. It is currently in PowerPoint format which makes
it a bit harder to review so we're happy to accept feedback in any form.

The idea is to convert the material to AsciiDoc at some point.

Cheers,
Lars

(I didn't want to cross post to user@ as well but this is obviously not
limited to dev@ users)

[1] <https://issues.apache.org/jira/browse/TRAINING-1
<https://issues.apache.org/jira/browse/TRAINING-13>7>
[2] <https://issues.apache.org/jira/browse/TRAINING-1
<https://issues.apache.org/jira/browse/TRAINING-13>3>
[3] You can see the content here <
https://github.com/apache/incubator-training/blob/master/content/ZooKeeper/src/main/asciidoc/index_en.adoc
>

Re: Apache Training contribution for Spark - Feedback welcome

Posted by Lars Francke <la...@gmail.com>.
On Mon, Jul 29, 2019 at 2:46 PM Sean Owen <sr...@gmail.com> wrote:

> TL;DR is: take the below as feedback to consider, and proceed as you
> see fit. Nobody's suggesting you can't do this.
>
> On Mon, Jul 29, 2019 at 2:58 AM Lars Francke <la...@gmail.com>
> wrote:
> > The way I read your point is that anyone can publish material (which
> includes source code) under the ALv2 outside of the ASF so why should they
> donate anything to the ASF?
> > If that's what you meant why have Apache Spark or any other Apache
> project for that matter.
> >> I think your premise is that people will _collaborate_ on training
> >> materials if there's an ASF project around it. Maybe so but see below.
> > That's our hope, yes. Should we not do this because it _could_ fail?
>
> Yep this is the answer to your question. The ASF exists to facilitate
> collaboration, not just host. I think the dynamics around
> collaboration on open standard software vs training materials are
> materially different.
>

I don't see a big difference between the two things.
Content is already being collaborated on today (see documentation, websites
and the few instances of training that exist or Wikipedia for that matter).
I'm afraid we'll need to agree to disagree on this one.


> > We - as a company - have created material and sold it for years but
> every time I give a training I see something that I should have updated and
> it's become impossible to keep up. I see the same outdated material from
> other organizations, we've talked to half a dozen or so training companies
> and they all have the same problem. To create quality training material you
> really need someone with deep insider knowledge, and those people are hard
> to come by.
> > So we're trying to shift and collaborate on the material and then
> differentiate ourselves by the trainer itself.
>
> I think this hand-waves past a lot of the concern raised here, but OK
> it's an experiment.
> I don't think it's 'wrong' to try to get people to collaborate on
> slides, sure. It may work well. If it doesn't for reasons raised here,
> well, worse things have happened.
> Consider how you might mitigate possible problems:
> a) what happens when another company wants to donate its Spark content?
>

This has been decided at the ASF level already (allow competing projects,
e.g. Flink & Spark). At the Apache Training level we briefly talked about
that as well. I don't want to go into details of the process but the short
version is: We'd accept anything and would then try to incorporate it into
existing stuff.

b) can you enshrine some best practices like making sure the content
> disclaims official association with the ASF? e.g. a trainer delivering
> it has to note the source but make clear it's not Apache training,
>

Yes.


> etc.
>

Re: Apache Training contribution for Spark - Feedback welcome

Posted by Sean Owen <sr...@gmail.com>.
TL;DR is: take the below as feedback to consider, and proceed as you
see fit. Nobody's suggesting you can't do this.

On Mon, Jul 29, 2019 at 2:58 AM Lars Francke <la...@gmail.com> wrote:
> The way I read your point is that anyone can publish material (which includes source code) under the ALv2 outside of the ASF so why should they donate anything to the ASF?
> If that's what you meant why have Apache Spark or any other Apache project for that matter.
>> I think your premise is that people will _collaborate_ on training
>> materials if there's an ASF project around it. Maybe so but see below.
> That's our hope, yes. Should we not do this because it _could_ fail?

Yep this is the answer to your question. The ASF exists to facilitate
collaboration, not just host. I think the dynamics around
collaboration on open standard software vs training materials are
materially different.

> We - as a company - have created material and sold it for years but every time I give a training I see something that I should have updated and it's become impossible to keep up. I see the same outdated material from other organizations, we've talked to half a dozen or so training companies and they all have the same problem. To create quality training material you really need someone with deep insider knowledge, and those people are hard to come by.
> So we're trying to shift and collaborate on the material and then differentiate ourselves by the trainer itself.

I think this hand-waves past a lot of the concern raised here, but OK
it's an experiment.
I don't think it's 'wrong' to try to get people to collaborate on
slides, sure. It may work well. If it doesn't for reasons raised here,
well, worse things have happened.
Consider how you might mitigate possible problems:
a) what happens when another company wants to donate its Spark content?
b) can you enshrine some best practices like making sure the content
disclaims official association with the ASF? e.g. a trainer delivering
it has to note the source but make clear it's not Apache training,
etc.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: Apache Training contribution for Spark - Feedback welcome

Posted by Lars Francke <la...@gmail.com>.
Happy to discuss this here but you're also invited to bring those points up
at dev@training as other projects might have similar concerns.

The request for assistance still stands. If anyone here is interested in
helping out reviewing and improving the material please reach out.


On Sat, Jul 27, 2019 at 12:01 AM Sean Owen <sr...@gmail.com> wrote:

> On Fri, Jul 26, 2019 at 4:01 PM Lars Francke <la...@gmail.com>
> wrote:
> > I understand why it might be seen that way and we need to make sure to
> point out that we have no intention of becoming "The official Apache Spark
> training" because that's not our intention at all.
>
> Of course that's the intention; the problem is perception, and I think
> that's a real problem no matter the intention.
>

Agreed. But that won't stop us from accepting or publishing content. If
that were a dealbreaker then we could move the Training project to the
Attic now.
Along with Livy, Toree, Phoenix, Hivemall and probably dozens of other ASF
projects which provide things on top of other ASF projects.
Neither of those are endorsed as "The official X for Y".


> > In this case, however, a company decided to donate their internal
> material - they didn't create this from scratch for the Apache Training
> project.
> > We want to encourage contributions and just because someone else has
> already created material shouldn't stop us from accepting this.
>
> This much doesn't seem like a compelling motive. Anyone can already
> donate their materials to the public domain or publish under the ALv2.
> The existence of an Apache project around it doesn't do anything...
> except your point below maybe:
>
>
> > Every company creates its own material as an asset to sell. There's very
> little quality open-source material out there.
>
> (Except the example I already gave, among many others! There's a lot
> of free content)
>

The way I read your point is that anyone can publish material (which
includes source code) under the ALv2 outside of the ASF so why should they
donate anything to the ASF?
If that's what you meant why have Apache Spark or any other Apache project
for that matter.

But I don't think that's what you're trying to say.
Hence I believe I must misunderstand and would ask you to
rephrase/reiterate the point your point, please.


> > We did some research around training and especially open-source training
> before we started the initiative and there are some projects out there that
> do this but all we found were silos with a relatively narrow focus and no
> greater community.
>
> I think your premise is that people will _collaborate_ on training
> materials if there's an ASF project around it. Maybe so but see below.
>

That's our hope, yes. Should we not do this because it _could_ fail?


> > Regarding your "outlines" comment: No, this is the "final" material
> (pending review of course). With "Training" we mean training in the sense
> that Cloudera, Databricks et. al. sell as well where an instructor-led
> course is being given using slides. These slides can, but don't have to
> speak for themselves. We're fine with the requirement that an experienced
> instructor needs to give this training. But this is just this content.
> We're also happy to accept other forms of content that are meant for a
> different way of consumption (self-serve). We don't intend to write
> exhaustive or authoritative documentation for projects.
>
> Are we talking about the content attached at TRAINING-17? It doesn't
> look nearly complete or comprehensive enough to endorse as Spark
> training material, IMHO. Again compare to even Jacek's site and
> content for an example of what I think that would look like. It's
> orders of magnitude more complete. I speak for myself, but I would not
> want to endorse that as Spark training with my Apache hat.
>
> I know the premise is, I think, these are _slides_ that trainers can
> deliver, but by themselves there is not enough content for trainers to
> know what to train.
>

No one wants to endorse anything as "official" anything.
And yes: This material is not perfect but that's how open-source works,
doesn't it?
This is an initial patch which can be used to collaborate and improve upon.
This is how Spark also works otherwise it'd have been perfect from version
0.1.

Again: I agree Jacek's material is more complete and we could reach out to
him (assuming he reads this anyway) but the fact is that this company did
so first and I want to encourage contributions.

All we're asking for here is help from the Spark community in making our
content better hoping that someone is interested. If not we'll do the best
we can ourselves. But this is where the experts are.


> What is the need the solves -- is there really demand for 'open
> source' training materials? my experience is that training is by
> definition professional services, and has to be delivered by people as
> a for-pay business, and they need to differentiate on the quality they
> provide. It's just materially different from having open standard
> software.
>

Yes, there is a demand and I disagree that it's materially different from
having open standard software.
I have not compared Jacek's material to the one in TRAINING-17 or to my own
but I'm willing to bet that there are lots and lots of redundancies.
The same concepts explained over and over in similar terms.
What's the value in that?

We - as a company - have created material and sold it for years but every
time I give a training I see something that I should have updated and it's
become impossible to keep up. I see the same outdated material from other
organizations, we've talked to half a dozen or so training companies and
they all have the same problem. To create quality training material you
really need someone with deep insider knowledge, and those people are hard
to come by.
So we're trying to shift and collaborate on the material and then
differentiate ourselves by the trainer itself.
We'll see how that works out.

Cheers,
Lars

Re: Apache Training contribution for Spark - Feedback welcome

Posted by Sean Owen <sr...@gmail.com>.
On Fri, Jul 26, 2019 at 4:01 PM Lars Francke <la...@gmail.com> wrote:
> I understand why it might be seen that way and we need to make sure to point out that we have no intention of becoming "The official Apache Spark training" because that's not our intention at all.

Of course that's the intention; the problem is perception, and I think
that's a real problem no matter the intention.


> In this case, however, a company decided to donate their internal material - they didn't create this from scratch for the Apache Training project.
> We want to encourage contributions and just because someone else has already created material shouldn't stop us from accepting this.

This much doesn't seem like a compelling motive. Anyone can already
donate their materials to the public domain or publish under the ALv2.
The existence of an Apache project around it doesn't do anything...
except your point below maybe:


> Every company creates its own material as an asset to sell. There's very little quality open-source material out there.

(Except the example I already gave, among many others! There's a lot
of free content)


> We did some research around training and especially open-source training before we started the initiative and there are some projects out there that do this but all we found were silos with a relatively narrow focus and no greater community.

I think your premise is that people will _collaborate_ on training
materials if there's an ASF project around it. Maybe so but see below.


> Regarding your "outlines" comment: No, this is the "final" material (pending review of course). With "Training" we mean training in the sense that Cloudera, Databricks et. al. sell as well where an instructor-led course is being given using slides. These slides can, but don't have to speak for themselves. We're fine with the requirement that an experienced instructor needs to give this training. But this is just this content. We're also happy to accept other forms of content that are meant for a different way of consumption (self-serve). We don't intend to write exhaustive or authoritative documentation for projects.

Are we talking about the content attached at TRAINING-17? It doesn't
look nearly complete or comprehensive enough to endorse as Spark
training material, IMHO. Again compare to even Jacek's site and
content for an example of what I think that would look like. It's
orders of magnitude more complete. I speak for myself, but I would not
want to endorse that as Spark training with my Apache hat.

I know the premise is, I think, these are _slides_ that trainers can
deliver, but by themselves there is not enough content for trainers to
know what to train.

What is the need the solves -- is there really demand for 'open
source' training materials? my experience is that training is by
definition professional services, and has to be delivered by people as
a for-pay business, and they need to differentiate on the quality they
provide. It's just materially different from having open standard
software.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: Apache Training contribution for Spark - Feedback welcome

Posted by Lars Francke <la...@gmail.com>.
Sean,

thanks for taking the time to comment.

We've discussed those issues during the proposal stage for the Incubator as
others brought them up as well. I can't remember all the details but let me
go through your points inline.

My reservation here is that as an Apache project, it might appear to
> 'bless' one set of materials as authoritative over all the others out
> there.


I understand why it might be seen that way and we need to make sure to
point out that we have no intention of becoming "The official Apache Spark
training" because that's not our intention at all.


> And there are already lots of good ones. For example, Jacek has
> long maintained a very comprehensive set of free Spark training
> materials at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> In comparison the slides I see proposed so far only seem like
> outlines?
>

Jacek is indeed doing a fantastic job (and I'm sure others as well).

In this case, however, a company decided to donate their internal material
- they didn't create this from scratch for the Apache Training project.
We want to encourage contributions and just because someone else has
already created material shouldn't stop us from accepting this.

The opposite in fact: There's very little collaboration - in general -
around training material.
Every company creates its own material as an asset to sell. There's very
little quality open-source material out there.
I'm not sure how many companies have created Spark training courses. I
wouldn't be surprised if it goes into the hundreds. And everyone draws the
same or very similar slides (what's an RDD, what's a DataFrame etc.)
We hope to change that and this contribution can be a first start.

We did some research around training and especially open-source training
before we started the initiative and there are some projects out there that
do this but all we found were silos with a relatively narrow focus and no
greater community.

Regarding your "outlines" comment: No, this is the "final" material
(pending review of course). With "Training" we mean training in the sense
that Cloudera, Databricks et. al. sell as well where an instructor-led
course is being given using slides. These slides can, but don't have to
speak for themselves. We're fine with the requirement that an experienced
instructor needs to give this training. But this is just this content.
We're also happy to accept other forms of content that are meant for a
different way of consumption (self-serve). We don't intend to write
exhaustive or authoritative documentation for projects.

It just frees people from having to do the tedious work of creating (and
updating) hundreds of slides.

It's also a separate project from Spark. We might have trouble
> ensuring the info is maintained and up to date, and sometimes outdated
> or incorrect info is worse than none - especially if it appears quasi
> official. The Spark project already maintains and updates its docs
> (which can always be better), so already has its hands full there.
>

Definitely. Outdated information is always a danger and I have no guarantee
that this isn't going to happen here.
The fact that this is hosted and governed by the ASF makes it less likely
to be completely abandoned though as there are clear processes in place for
collaboration that don't depend on a single person (which might be the case
with some of the other things that already exist).
We also hope that communities - like Spark - are also interested in
collaborating and while patches are always welcome so is creating a Jira to
point out outdated information.


> Personally, no strong objection here, but, what's the upside to
> running this as an ASF project vs just letting people continue to
> publish quality tutorials online?
>

Some points come to mind, this list is neither exhaustive nor do all points
apply equally to all the material that others have published:

- Clear and easy guidelines for collaboration
- Not a "bus factor" of one
- Everything is open-source with a friendly license and customizable
- We're still just getting started but because we already have four or five
different contributions we can share one technology stack between all of
them making it easier to collaborate ("everything looks familiar") and
every piece of content benefits from improvements in the technical stack
- We hope to have non-tool focused sessions later as well (e.g. Ingesting
data from Kafka into Elasticsearch using Spark [okay, this would maybe be a
bit too specific for now but something along the lines of a "Data
Ingestion" training]) where we can mix and match from the content we have

I'd have to dig into the original discuss threads in the incubator to find
more but I hope this helps a bit?

Cheers,
Lars


>
>
> On Fri, Jul 26, 2019 at 9:00 AM Lars Francke <la...@gmail.com>
> wrote:
> >
> > Hi Spark community,
> >
> > you may or may not have heard of a new-ish (February 2019) project at
> Apache: Apache Training (incubating). We aim to develop training material
> about various projects inside and outside the ASF: <
> http://training.apache.org/>
> >
> > One of our users wants to contribute material on Spark[1]
> >
> > We've done something similar for ZooKeeper[1] in the past and the
> ZooKeeper community provided excellent feedback which helped make the
> product much better[3].
> >
> > That's why I'd like to invite everyone here to provide any kind of
> feedback on the content donation. It is currently in PowerPoint format
> which makes it a bit harder to review so we're happy to accept feedback in
> any form.
> >
> > The idea is to convert the material to AsciiDoc at some point.
> >
> > Cheers,
> > Lars
> >
> > (I didn't want to cross post to user@ as well but this is obviously not
> limited to dev@ users)
> >
> > [1] <https://issues.apache.org/jira/browse/TRAINING-17>
> > [2] <https://issues.apache.org/jira/browse/TRAINING-13>
> > [3] You can see the content here <
> https://github.com/apache/incubator-training/blob/master/content/ZooKeeper/src/main/asciidoc/index_en.adoc
> >
>

Re: Apache Training contribution for Spark - Feedback welcome

Posted by Sean Owen <sr...@gmail.com>.
Generally speaking, I think we want to encourage more training and
tutorial content out there, for sure, so, the more the merrier.

My reservation here is that as an Apache project, it might appear to
'bless' one set of materials as authoritative over all the others out
there. And there are already lots of good ones. For example, Jacek has
long maintained a very comprehensive set of free Spark training
materials at https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
In comparison the slides I see proposed so far only seem like
outlines?

It's also a separate project from Spark. We might have trouble
ensuring the info is maintained and up to date, and sometimes outdated
or incorrect info is worse than none - especially if it appears quasi
official. The Spark project already maintains and updates its docs
(which can always be better), so already has its hands full there.

Personally, no strong objection here, but, what's the upside to
running this as an ASF project vs just letting people continue to
publish quality tutorials online?



On Fri, Jul 26, 2019 at 9:00 AM Lars Francke <la...@gmail.com> wrote:
>
> Hi Spark community,
>
> you may or may not have heard of a new-ish (February 2019) project at Apache: Apache Training (incubating). We aim to develop training material about various projects inside and outside the ASF: <http://training.apache.org/>
>
> One of our users wants to contribute material on Spark[1]
>
> We've done something similar for ZooKeeper[1] in the past and the ZooKeeper community provided excellent feedback which helped make the product much better[3].
>
> That's why I'd like to invite everyone here to provide any kind of feedback on the content donation. It is currently in PowerPoint format which makes it a bit harder to review so we're happy to accept feedback in any form.
>
> The idea is to convert the material to AsciiDoc at some point.
>
> Cheers,
> Lars
>
> (I didn't want to cross post to user@ as well but this is obviously not limited to dev@ users)
>
> [1] <https://issues.apache.org/jira/browse/TRAINING-17>
> [2] <https://issues.apache.org/jira/browse/TRAINING-13>
> [3] You can see the content here <https://github.com/apache/incubator-training/blob/master/content/ZooKeeper/src/main/asciidoc/index_en.adoc>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org