You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Erik Erlandson <ee...@redhat.com> on 2018/02/15 20:25:25 UTC

Re: Publishing official docker images for KubernetesSchedulerBackend

I've been spending some time learning more about GPL, as it may apply to
container images. Briefly, I believe that certain limits to the default
"viral" nature of GPL apply in the case of a hypothetical Apache Spark
image. For one example, the "mere aggregation"
<https://www.gnu.org/licenses/gpl-faq.en.html#MereAggregation>clause
firewalls interactions via comand line calls; this should apply to Spark's
use of bash and invocations of the jvm. There is a "classpath exception"
<https://www.gnu.org/software/classpath/license.html> that also applies to
linking against java libs (iiuc this is a common exception but is not
guaranteed by default, and so might need to be verified on a per case
basis). There are similar exceptions for gcc
<https://www.gnu.org/licenses/gcc-exception-faq.html> and libgcc
<https://www.gnu.org/licenses/gpl-faq.en.html#LibGCCException>.

There is an interesting discussion of some of these issues in this blog
post:
https://opensource.com/article/18/1/containers-gpl-and-copyleft

It would be nice if these weren't bolted-on exceptions, but the general
spirit of them seems to suggest that the FSF does not want GPL to be
contagious in ways that discourage its use in other open source projects.
My "IANAL" take-away is that publishing images for Apache Spark would not
run afoul of GPL contamination.

There are also considerations of "making source available", however the GPL
is clear that this need not be "on the same site" as any runnable
distribution. Possibly this depends on the semantics of "make available"
but clearly any package source from GPL code such as bash, openjdk, etc IS
readily available, and it's hard to imagine a scenario where it wouldn't be.

Lastly, the ASF may have other motivations besides purely "GPL firewalling"
for disapproving of container images as a downstream release channel.  If
this is the case, then GPL firewalling exceptions don't necessarily change
the policy outcome; but I felt it was worth mentioning that the inclusion
of GPL licensed dependencies on a container image is not (by itself)
necessarily toxic to ASF licensing.


On Tue, Dec 19, 2017 at 3:06 PM, Sean Owen <so...@cloudera.com> wrote:

> I'd follow LEGAL-270, yes.  The best resource on licensing is
> https://www.apache.org/legal/resolved.html ; it doesn't all have to be
> AL2, but needs to be compatible (sometimes with additional conditions).
> Auditing is basically entrusted to the PMC when voting on releases. I'll
> look at it with you.
>
> Only bits that are redistributed officially matter. That is a Dockerfile
> itself has no licensing issues. Images with copies of software would be the
> issue. Distributing a whole JVM and Python distro is probably going to
> bring in far too much.
>
> On Tue, Dec 19, 2017 at 2:59 PM Erik Erlandson <ee...@redhat.com>
> wrote:
>
>>
>> Here are some specific questions I'd recommend for the Apache Spark PMC
>> to bring to ASF legal counsel:
>>
>> 1) Does the philosophy described on LEGAL-270 still represent a
>> sanctioned approach to publishing releases via container image?
>> 2) If the transitive closure of pulled-in licenses on each of these
>> images is limited to licenses that are defined as compatible with
>> Apache-2 <https://www.apache.org/legal/resolved.html>, does that satisfy
>> ASF licensing and legal guidelines?
>> 3) What form of documentation/auditing for (2) should be provided to meet
>> legal requirements?
>>
>> I would define the proposed action this way; to include, as part of the
>> Apache Spark official release process, publishing a "spark-base" image, to
>> be tagged with the specific release, that consists of a build of the spark
>> code for that release installed on a base-image (currently alpine, but
>> possibly some other alternative like centos), combined with the jvm and
>> python (and any of their transitive deps).  Additionally, some number of
>> images derived from "spark-base" would be built, which consist of
>> spark-base and a small layer of bash scripting for ENTRYPOINT and CMD, to
>> support the kubernets back-end.  Optionally, similar images targeted for
>> mesos or yarn might also be created.
>>
>>
>> On Tue, Dec 19, 2017 at 1:28 PM, Mark Hamstra <ma...@clearstorydata.com>
>> wrote:
>>
>>> Reasoning by analogy to other Apache projects is generally not
>>> sufficient when it come to securing legally permissible form or behavior --
>>> that another project is doing something is not a guarantee that they are
>>> doing it right. If we have issues or legal questions, we need to formulate
>>> them and our proposed actions as clearly and concretely as possible so that
>>> the PMC can take those issues, questions and proposed actions to Apache
>>> counsel for advice or guidance.
>>>
>>> On Tue, Dec 19, 2017 at 10:34 AM, Erik Erlandson <ee...@redhat.com>
>>> wrote:
>>>
>>>> I've been looking a bit more into ASF legal posture on licensing and
>>>> container images. What I have found indicates that ASF considers container
>>>> images to be just another variety of distribution channel.  As such, it is
>>>> acceptable to publish official releases; for example an image such as
>>>> spark:v2.3.0 built from the v2.3.0 source is fine.  It is not acceptable to
>>>> do something like regularly publish spark:latest built from the head of
>>>> master.
>>>>
>>>> More detail here:
>>>> https://issues.apache.org/jira/browse/LEGAL-270
>>>>
>>>> So as I understand it, making a release-tagged public image as part of
>>>> each official release does not pose any problems.
>>>>
>>>> With respect to considering the licenses of other ancillary
>>>> dependencies that are also installed on such container images, I noticed
>>>> this clause in the legal boilerplate for the Flink images
>>>> <https://hub.docker.com/r/library/flink/>:
>>>>
>>>> As with all Docker images, these likely also contain other software
>>>>> which may be under other licenses (such as Bash, etc from the base
>>>>> distribution, along with any direct or indirect dependencies of the primary
>>>>> software being contained).
>>>>>
>>>>
>>>> So it may be sufficient to resolve this via disclaimer.
>>>>
>>>> -Erik
>>>>
>>>> On Thu, Dec 14, 2017 at 7:55 PM, Erik Erlandson <ee...@redhat.com>
>>>> wrote:
>>>>
>>>>> Currently the containers are based off alpine, which pulls in BSD2 and
>>>>> MIT licensing:
>>>>> https://github.com/apache/spark/pull/19717#discussion_r154502824
>>>>>
>>>>> to the best of my understanding, neither of those poses a problem.  If
>>>>> we based the image off of centos I'd also expect the licensing of any image
>>>>> deps to be compatible.
>>>>>
>>>>> On Thu, Dec 14, 2017 at 7:19 PM, Mark Hamstra <mark@clearstorydata.com
>>>>> > wrote:
>>>>>
>>>>>> What licensing issues come into play?
>>>>>>
>>>>>> On Thu, Dec 14, 2017 at 4:00 PM, Erik Erlandson <ee...@redhat.com>
>>>>>> wrote:
>>>>>>
>>>>>>> We've been discussing the topic of container images a bit more.  The
>>>>>>> kubernetes back-end operates by executing some specific CMD and ENTRYPOINT
>>>>>>> logic, which is different than mesos, and which is probably not practical
>>>>>>> to unify at this level.
>>>>>>>
>>>>>>> However: These CMD and ENTRYPOINT configurations are essentially
>>>>>>> just a thin skin on top of an image which is just an install of a spark
>>>>>>> distro.  We feel that a single "spark-base" image should be publishable,
>>>>>>> that is consumable by kube-spark images, and mesos-spark images, and likely
>>>>>>> any other community image whose primary purpose is running spark
>>>>>>> components.  The kube-specific dockerfiles would be written "FROM
>>>>>>> spark-base" and just add the small command and entrypoint layers.
>>>>>>> Likewise, the mesos images could add any specialization layers that are
>>>>>>> necessary on top of the "spark-base" image.
>>>>>>>
>>>>>>> Does this factorization sound reasonable to others?
>>>>>>> Cheers,
>>>>>>> Erik
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Nov 29, 2017 at 10:04 AM, Mridul Muralidharan <
>>>>>>> mridul@gmail.com> wrote:
>>>>>>>
>>>>>>>> We do support running on Apache Mesos via docker images - so this
>>>>>>>> would not be restricted to k8s.
>>>>>>>> But unlike mesos support, which has other modes of running, I
>>>>>>>> believe
>>>>>>>> k8s support more heavily depends on availability of docker images.
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Mridul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Nov 29, 2017 at 8:56 AM, Sean Owen <so...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>> > Would it be logical to provide Docker-based distributions of
>>>>>>>> other pieces of
>>>>>>>> > Spark? or is this specific to K8S?
>>>>>>>> > The problem is we wouldn't generally also provide a distribution
>>>>>>>> of Spark
>>>>>>>> > for the reasons you give, because if that, then why not RPMs and
>>>>>>>> so on.
>>>>>>>> >
>>>>>>>> > On Wed, Nov 29, 2017 at 10:41 AM Anirudh Ramanathan <
>>>>>>>> ramanathana@google.com>
>>>>>>>> > wrote:
>>>>>>>> >>
>>>>>>>> >> In this context, I think the docker images are similar to the
>>>>>>>> binaries
>>>>>>>> >> rather than an extension.
>>>>>>>> >> It's packaging the compiled distribution to save people the
>>>>>>>> effort of
>>>>>>>> >> building one themselves, akin to binaries or the python package.
>>>>>>>> >>
>>>>>>>> >> For reference, this is the base dockerfile for the main image
>>>>>>>> that we
>>>>>>>> >> intend to publish. It's not particularly complicated.
>>>>>>>> >> The driver and executor images are based on said base image and
>>>>>>>> only
>>>>>>>> >> customize the CMD (any file/directory inclusions are extraneous
>>>>>>>> and will be
>>>>>>>> >> removed).
>>>>>>>> >>
>>>>>>>> >> Is there only one way to build it? That's a bit harder to reason
>>>>>>>> about.
>>>>>>>> >> The base image I'd argue is likely going to always be built that
>>>>>>>> way. The
>>>>>>>> >> driver and executor images, there may be cases where people want
>>>>>>>> to
>>>>>>>> >> customize it - (like putting all dependencies into it for
>>>>>>>> example).
>>>>>>>> >> In those cases, as long as our images are bare bones, they can
>>>>>>>> use the
>>>>>>>> >> spark-driver/spark-executor images we publish as the base, and
>>>>>>>> build their
>>>>>>>> >> customization as a layer on top of it.
>>>>>>>> >>
>>>>>>>> >> I think the composability of docker images, makes this a bit
>>>>>>>> different
>>>>>>>> >> from say - debian packages.
>>>>>>>> >> We can publish canonical images that serve as both - a complete
>>>>>>>> image for
>>>>>>>> >> most Spark applications, as well as a stable substrate to build
>>>>>>>> >> customization upon.
>>>>>>>> >>
>>>>>>>> >> On Wed, Nov 29, 2017 at 7:38 AM, Mark Hamstra <
>>>>>>>> mark@clearstorydata.com>
>>>>>>>> >> wrote:
>>>>>>>> >>>
>>>>>>>> >>> It's probably also worth considering whether there is only one,
>>>>>>>> >>> well-defined, correct way to create such an image or whether
>>>>>>>> this is a
>>>>>>>> >>> reasonable avenue for customization. Part of why we don't do
>>>>>>>> something like
>>>>>>>> >>> maintain and publish canonical Debian packages for Spark is
>>>>>>>> because
>>>>>>>> >>> different organizations doing packaging and distribution of
>>>>>>>> infrastructures
>>>>>>>> >>> or operating systems can reasonably want to do this in a custom
>>>>>>>> (or
>>>>>>>> >>> non-customary) way. If there is really only one reasonable way
>>>>>>>> to do a
>>>>>>>> >>> docker image, then my bias starts to tend more toward the Spark
>>>>>>>> PMC taking
>>>>>>>> >>> on the responsibility to maintain and publish that image. If
>>>>>>>> there is more
>>>>>>>> >>> than one way to do it and publishing a particular image is more
>>>>>>>> just a
>>>>>>>> >>> convenience, then my bias tends more away from maintaining and
>>>>>>>> publish it.
>>>>>>>> >>>
>>>>>>>> >>> On Wed, Nov 29, 2017 at 5:14 AM, Sean Owen <so...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>> >>>>
>>>>>>>> >>>> Source code is the primary release; compiled binary releases
>>>>>>>> are
>>>>>>>> >>>> conveniences that are also released. A docker image sounds
>>>>>>>> fairly different
>>>>>>>> >>>> though. To the extent it's the standard delivery mechanism for
>>>>>>>> some artifact
>>>>>>>> >>>> (think: pyspark on PyPI as well) that makes sense, but is that
>>>>>>>> the
>>>>>>>> >>>> situation? if it's more of an extension or alternate
>>>>>>>> presentation of Spark
>>>>>>>> >>>> components, that typically wouldn't be part of a Spark
>>>>>>>> release. The ones the
>>>>>>>> >>>> PMC takes responsibility for maintaining ought to be the core,
>>>>>>>> critical
>>>>>>>> >>>> means of distribution alone.
>>>>>>>> >>>>
>>>>>>>> >>>> On Wed, Nov 29, 2017 at 2:52 AM Anirudh Ramanathan
>>>>>>>> >>>> <ra...@google.com.invalid> wrote:
>>>>>>>> >>>>>
>>>>>>>> >>>>> Hi all,
>>>>>>>> >>>>>
>>>>>>>> >>>>> We're all working towards the Kubernetes scheduler backend
>>>>>>>> (full steam
>>>>>>>> >>>>> ahead!) that's targeted towards Spark 2.3. One of the
>>>>>>>> questions that comes
>>>>>>>> >>>>> up often is docker images.
>>>>>>>> >>>>>
>>>>>>>> >>>>> While we're making available dockerfiles to allow people to
>>>>>>>> create
>>>>>>>> >>>>> their own docker images from source, ideally, we'd want to
>>>>>>>> publish official
>>>>>>>> >>>>> docker images as part of the release process.
>>>>>>>> >>>>>
>>>>>>>> >>>>> I understand that the ASF has procedure around this, and we
>>>>>>>> would want
>>>>>>>> >>>>> to get that started to help us get these artifacts published
>>>>>>>> by 2.3. I'd
>>>>>>>> >>>>> love to get a discussion around this started, and the
>>>>>>>> thoughts of the
>>>>>>>> >>>>> community regarding this.
>>>>>>>> >>>>>
>>>>>>>> >>>>> --
>>>>>>>> >>>>> Thanks,
>>>>>>>> >>>>> Anirudh Ramanathan
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> --
>>>>>>>> >> Anirudh Ramanathan
>>>>>>>>
>>>>>>>> ------------------------------------------------------------
>>>>>>>> ---------
>>>>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>