You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Dongjoon Hyun <do...@gmail.com> on 2020/02/05 20:16:08 UTC

Apache Spark Docker image repository

Hi, All.

From 2020, shall we have an official Docker image repository as an
additional distribution channel?

I'm considering the following images.

    - Public binary release (no snapshot image)
    - Public non-Spark base image (OS + R + Python)
      (This can be used in GitHub Action Jobs and Jenkins K8s Integration
Tests to speed up jobs and to have more stabler environments)

Bests,
Dongjoon.

Re: Apache Spark Docker image repository

Posted by Maciej Szymkiewicz <ms...@gmail.com>.
On 2/6/20 2:53 AM, Jiaxin Shan wrote:
> I will vote for this. It's pretty helpful to have managed Spark
> images. Currently, user have to download Spark binaries and build
> their own. 
> With this supported, user journey will be simplified and we only need
> to build an application image on top of base image provided by community. 
>
> Do we have different OS or architecture support? If not, there will be
> Java, R, Python total 3 container images for every release.

Well, technically speaking there are 3 non-deprecated Python versions (4
if you count PyPy), 3 non-deprecated R versions, luckily only one
non-deprecated Scala version and possible variations of JDK. Latest and
greatest are not necessarily the most popular and useful.

That's on top of native dependencies like BLAS (possibly in different
flavors and accounting for netlib-java break in development), libparquet
and libarrow.

Not all of these must be generated, but complexity grows pretty fast,
especially when native dependencies are involved. It gets worse if you
actually want to support Spark builds and tests ‒ for example to build
and fully test SparkR builds you need half of the universe including
some awkward LaTex style patches and such
(https://github.com/zero323/sparkr-build-sandbox).

End even without that images tend to grow pretty large.

Few years back me and Elias <https://github.com/eliasah> experimented
with the idea of generating different sets of Dockerfiles ‒
https://github.com/spark-in-a-box/spark-in-a-box ‒ intended use cases
where rather different (mostly quick setup of testbeds) though. The
project has been inactive for a while, with some private patches to fit
this or that use case.

>
> On Wed, Feb 5, 2020 at 2:56 PM Sean Owen <srowen@gmail.com
> <ma...@gmail.com>> wrote:
>
>     What would the images have - just the image for a worker?
>     We wouldn't want to publish N permutations of Python, R, OS, Java,
>     etc.
>     But if we don't then we make one or a few choices of that combo, and
>     then I wonder how many people find the image useful.
>     If the goal is just to support Spark testing, that seems fine and
>     tractable, but does it need to be 'public' as in advertised as a
>     convenience binary? vs just some image that's hosted somewhere for the
>     benefit of project infra.
>
>     On Wed, Feb 5, 2020 at 12:16 PM Dongjoon Hyun
>     <dongjoon.hyun@gmail.com <ma...@gmail.com>> wrote:
>     >
>     > Hi, All.
>     >
>     > From 2020, shall we have an official Docker image repository as
>     an additional distribution channel?
>     >
>     > I'm considering the following images.
>     >
>     >     - Public binary release (no snapshot image)
>     >     - Public non-Spark base image (OS + R + Python)
>     >       (This can be used in GitHub Action Jobs and Jenkins K8s
>     Integration Tests to speed up jobs and to have more stabler
>     environments)
>     >
>     > Bests,
>     > Dongjoon.
>
>     ---------------------------------------------------------------------
>     To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>     <ma...@spark.apache.org>
>
>
>
> -- 
> Best Regards!
> Jiaxin Shan
> Tel:  412-230-7670
> Address: 470 2nd Ave S, Kirkland, WA
>
-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: C095AA7F33E6123A


Re: Apache Spark Docker image repository

Posted by Jiaxin Shan <se...@gmail.com>.
I will vote for this. It's pretty helpful to have managed Spark images.
Currently, user have to download Spark binaries and build their own.
With this supported, user journey will be simplified and we only need to
build an application image on top of base image provided by community.

Do we have different OS or architecture support? If not, there will be
Java, R, Python total 3 container images for every release.


On Wed, Feb 5, 2020 at 2:56 PM Sean Owen <sr...@gmail.com> wrote:

> What would the images have - just the image for a worker?
> We wouldn't want to publish N permutations of Python, R, OS, Java, etc.
> But if we don't then we make one or a few choices of that combo, and
> then I wonder how many people find the image useful.
> If the goal is just to support Spark testing, that seems fine and
> tractable, but does it need to be 'public' as in advertised as a
> convenience binary? vs just some image that's hosted somewhere for the
> benefit of project infra.
>
> On Wed, Feb 5, 2020 at 12:16 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
> >
> > Hi, All.
> >
> > From 2020, shall we have an official Docker image repository as an
> additional distribution channel?
> >
> > I'm considering the following images.
> >
> >     - Public binary release (no snapshot image)
> >     - Public non-Spark base image (OS + R + Python)
> >       (This can be used in GitHub Action Jobs and Jenkins K8s
> Integration Tests to speed up jobs and to have more stabler environments)
> >
> > Bests,
> > Dongjoon.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

-- 
Best Regards!
Jiaxin Shan
Tel:  412-230-7670
Address: 470 2nd Ave S, Kirkland, WA

Re: Apache Spark Docker image repository

Posted by Sean Owen <sr...@gmail.com>.
What would the images have - just the image for a worker?
We wouldn't want to publish N permutations of Python, R, OS, Java, etc.
But if we don't then we make one or a few choices of that combo, and
then I wonder how many people find the image useful.
If the goal is just to support Spark testing, that seems fine and
tractable, but does it need to be 'public' as in advertised as a
convenience binary? vs just some image that's hosted somewhere for the
benefit of project infra.

On Wed, Feb 5, 2020 at 12:16 PM Dongjoon Hyun <do...@gmail.com> wrote:
>
> Hi, All.
>
> From 2020, shall we have an official Docker image repository as an additional distribution channel?
>
> I'm considering the following images.
>
>     - Public binary release (no snapshot image)
>     - Public non-Spark base image (OS + R + Python)
>       (This can be used in GitHub Action Jobs and Jenkins K8s Integration Tests to speed up jobs and to have more stabler environments)
>
> Bests,
> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: Apache Spark Docker image repository

Posted by Ismaël Mejía <ie...@gmail.com>.
Since Spark 3.1.1 is out now I was wondering if it would make sense to
try to get some consensus about starting to release docker images as
part of Spark 3.2.
Having ready to use images would definitely benefit adoption in
particular now that we support containerized runs via k8s became GA.

WDYT? Are there still some issues/blockers or reasons to not move forward?

On Tue, Feb 18, 2020 at 2:29 PM Ismaël Mejía <ie...@gmail.com> wrote:
>
> +1 to have Spark docker images for Dongjoon's arguments, having a container
> based distribution is definitely something in the benefit of users and the
> project too. Having this in the Apache Spark repo matters because of multiple
> eyes to fix/ímprove the images for the benefit of everyone.
>
> What still needs to be tested is the best distribution approach. I have been
> involved in both Flink and Beam's docker images processes (and passed the whole
> 'docker official image' validation and some of the learnt lessons is that the
> less you put in an image the best it is for everyone. So I wonder if the whole
> include everything in the world (Python, R, etc) would scale or if those should
> be overlays on top of a more core minimal image,  but well those are details to
> fix once consensus on this is agreed.
>
> On the Apache INFRA side there is some stuff to deal with at the beginning, but
> things become smoother once they are in place.  In any case fantastic idea and
> if I can help around I would be glad to.
>
> Regards,
> Ismaël
>
> On Tue, Feb 11, 2020 at 10:56 PM Dongjoon Hyun <do...@gmail.com> wrote:
>>
>> Hi, Sean.
>>
>> Yes. We should keep this minimal.
>>
>> BTW, for the following questions,
>>
>>     > But how much value does that add?
>>
>> How much value do you think we have at our binary distribution in the following link?
>>
>>     - https://www.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
>>
>> Docker image can have a similar value with the above for the users who are using Dockerized environment.
>>
>> If you are assuming the users who build from the source code or lives on vendor distributions, both the above existing binary distribution link and Docker image have no value.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Tue, Feb 11, 2020 at 8:51 AM Sean Owen <sr...@gmail.com> wrote:
>>>
>>> To be clear this is a convenience 'binary' for end users, not just an
>>> internal packaging to aid the testing framework?
>>>
>>> There's nothing wrong with providing an additional official packaging
>>> if we vote on it and it follows all the rules. There is an open
>>> question about how much value it adds vs that maintenance. I see we do
>>> already have some Dockerfiles, sure. Is it possible to reuse or
>>> repurpose these so that we don't have more to maintain? or: what is
>>> different from the existing Dockerfiles here? (dumb question, never
>>> paid much attention to them)
>>>
>>> We definitely can't release GPL bits or anything, yes. Just releasing
>>> a Dockerfile referring to GPL bits is a gray area - no bits are being
>>> redistributed, but, does it constitute a derived work where the GPL
>>> stuff is a non-optional dependency? Would any publishing of these
>>> images cause us to put a copy of third party GPL code anywhere?
>>>
>>> At the least, we should keep this minimal. One image if possible, that
>>> you overlay on top of your preferred OS/Java/Python image. But how
>>> much value does that add? I have no info either way that people want
>>> or don't need such a thing.
>>>
>>> On Tue, Feb 11, 2020 at 10:13 AM Erik Erlandson <ee...@redhat.com> wrote:
>>> >
>>> > My takeaway from the last time we discussed this was:
>>> > 1) To be ASF compliant, we needed to only publish images at official releases
>>> > 2) There was some ambiguity about whether or not a container image that included GPL'ed packages (spark images do) might trip over the GPL "viral propagation" due to integrating ASL and GPL in a "binary release".  The "air gap" GPL provision may apply - the GPL software interacts only at command-line boundaries.
>>> >
>>> > On Wed, Feb 5, 2020 at 1:23 PM Dongjoon Hyun <do...@gmail.com> wrote:
>>> >>
>>> >> Hi, All.
>>> >>
>>> >> From 2020, shall we have an official Docker image repository as an additional distribution channel?
>>> >>
>>> >> I'm considering the following images.
>>> >>
>>> >>     - Public binary release (no snapshot image)
>>> >>     - Public non-Spark base image (OS + R + Python)
>>> >>       (This can be used in GitHub Action Jobs and Jenkins K8s Integration Tests to speed up jobs and to have more stabler environments)
>>> >>
>>> >> Bests,
>>> >> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: Apache Spark Docker image repository

Posted by Ismaël Mejía <ie...@gmail.com>.
+1 to have Spark docker images for Dongjoon's arguments, having a container
based distribution is definitely something in the benefit of users and the
project too. Having this in the Apache Spark repo matters because of
multiple
eyes to fix/ímprove the images for the benefit of everyone.

What still needs to be tested is the best distribution approach. I have been
involved in both Flink and Beam's docker images processes (and passed the
whole
'docker official image' validation and some of the learnt lessons is that
the
less you put in an image the best it is for everyone. So I wonder if the
whole
include everything in the world (Python, R, etc) would scale or if those
should
be overlays on top of a more core minimal image,  but well those are
details to
fix once consensus on this is agreed.

On the Apache INFRA side there is some stuff to deal with at the beginning,
but
things become smoother once they are in place.  In any case fantastic idea
and
if I can help around I would be glad to.

Regards,
Ismaël

On Tue, Feb 11, 2020 at 10:56 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, Sean.
>
> Yes. We should keep this minimal.
>
> BTW, for the following questions,
>
>     > But how much value does that add?
>
> How much value do you think we have at our binary distribution in the
> following link?
>
>     -
> https://www.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
>
> Docker image can have a similar value with the above for the users who are
> using Dockerized environment.
>
> If you are assuming the users who build from the source code or lives on
> vendor distributions, both the above existing binary distribution link
> and Docker image have no value.
>
> Bests,
> Dongjoon.
>
>
> On Tue, Feb 11, 2020 at 8:51 AM Sean Owen <sr...@gmail.com> wrote:
>
>> To be clear this is a convenience 'binary' for end users, not just an
>> internal packaging to aid the testing framework?
>>
>> There's nothing wrong with providing an additional official packaging
>> if we vote on it and it follows all the rules. There is an open
>> question about how much value it adds vs that maintenance. I see we do
>> already have some Dockerfiles, sure. Is it possible to reuse or
>> repurpose these so that we don't have more to maintain? or: what is
>> different from the existing Dockerfiles here? (dumb question, never
>> paid much attention to them)
>>
>> We definitely can't release GPL bits or anything, yes. Just releasing
>> a Dockerfile referring to GPL bits is a gray area - no bits are being
>> redistributed, but, does it constitute a derived work where the GPL
>> stuff is a non-optional dependency? Would any publishing of these
>> images cause us to put a copy of third party GPL code anywhere?
>>
>> At the least, we should keep this minimal. One image if possible, that
>> you overlay on top of your preferred OS/Java/Python image. But how
>> much value does that add? I have no info either way that people want
>> or don't need such a thing.
>>
>> On Tue, Feb 11, 2020 at 10:13 AM Erik Erlandson <ee...@redhat.com>
>> wrote:
>> >
>> > My takeaway from the last time we discussed this was:
>> > 1) To be ASF compliant, we needed to only publish images at official
>> releases
>> > 2) There was some ambiguity about whether or not a container image that
>> included GPL'ed packages (spark images do) might trip over the GPL "viral
>> propagation" due to integrating ASL and GPL in a "binary release".  The
>> "air gap" GPL provision may apply - the GPL software interacts only at
>> command-line boundaries.
>> >
>> > On Wed, Feb 5, 2020 at 1:23 PM Dongjoon Hyun <do...@gmail.com>
>> wrote:
>> >>
>> >> Hi, All.
>> >>
>> >> From 2020, shall we have an official Docker image repository as an
>> additional distribution channel?
>> >>
>> >> I'm considering the following images.
>> >>
>> >>     - Public binary release (no snapshot image)
>> >>     - Public non-Spark base image (OS + R + Python)
>> >>       (This can be used in GitHub Action Jobs and Jenkins K8s
>> Integration Tests to speed up jobs and to have more stabler environments)
>> >>
>> >> Bests,
>> >> Dongjoon.
>>
>

Re: Apache Spark Docker image repository

Posted by Dongjoon Hyun <do...@gmail.com>.
Hi, Sean.

Yes. We should keep this minimal.

BTW, for the following questions,

    > But how much value does that add?

How much value do you think we have at our binary distribution in the
following link?

    -
https://www.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

Docker image can have a similar value with the above for the users who are
using Dockerized environment.

If you are assuming the users who build from the source code or lives on
vendor distributions, both the above existing binary distribution link and
Docker image have no value.

Bests,
Dongjoon.


On Tue, Feb 11, 2020 at 8:51 AM Sean Owen <sr...@gmail.com> wrote:

> To be clear this is a convenience 'binary' for end users, not just an
> internal packaging to aid the testing framework?
>
> There's nothing wrong with providing an additional official packaging
> if we vote on it and it follows all the rules. There is an open
> question about how much value it adds vs that maintenance. I see we do
> already have some Dockerfiles, sure. Is it possible to reuse or
> repurpose these so that we don't have more to maintain? or: what is
> different from the existing Dockerfiles here? (dumb question, never
> paid much attention to them)
>
> We definitely can't release GPL bits or anything, yes. Just releasing
> a Dockerfile referring to GPL bits is a gray area - no bits are being
> redistributed, but, does it constitute a derived work where the GPL
> stuff is a non-optional dependency? Would any publishing of these
> images cause us to put a copy of third party GPL code anywhere?
>
> At the least, we should keep this minimal. One image if possible, that
> you overlay on top of your preferred OS/Java/Python image. But how
> much value does that add? I have no info either way that people want
> or don't need such a thing.
>
> On Tue, Feb 11, 2020 at 10:13 AM Erik Erlandson <ee...@redhat.com>
> wrote:
> >
> > My takeaway from the last time we discussed this was:
> > 1) To be ASF compliant, we needed to only publish images at official
> releases
> > 2) There was some ambiguity about whether or not a container image that
> included GPL'ed packages (spark images do) might trip over the GPL "viral
> propagation" due to integrating ASL and GPL in a "binary release".  The
> "air gap" GPL provision may apply - the GPL software interacts only at
> command-line boundaries.
> >
> > On Wed, Feb 5, 2020 at 1:23 PM Dongjoon Hyun <do...@gmail.com>
> wrote:
> >>
> >> Hi, All.
> >>
> >> From 2020, shall we have an official Docker image repository as an
> additional distribution channel?
> >>
> >> I'm considering the following images.
> >>
> >>     - Public binary release (no snapshot image)
> >>     - Public non-Spark base image (OS + R + Python)
> >>       (This can be used in GitHub Action Jobs and Jenkins K8s
> Integration Tests to speed up jobs and to have more stabler environments)
> >>
> >> Bests,
> >> Dongjoon.
>

Re: Apache Spark Docker image repository

Posted by Sean Owen <sr...@gmail.com>.
To be clear this is a convenience 'binary' for end users, not just an
internal packaging to aid the testing framework?

There's nothing wrong with providing an additional official packaging
if we vote on it and it follows all the rules. There is an open
question about how much value it adds vs that maintenance. I see we do
already have some Dockerfiles, sure. Is it possible to reuse or
repurpose these so that we don't have more to maintain? or: what is
different from the existing Dockerfiles here? (dumb question, never
paid much attention to them)

We definitely can't release GPL bits or anything, yes. Just releasing
a Dockerfile referring to GPL bits is a gray area - no bits are being
redistributed, but, does it constitute a derived work where the GPL
stuff is a non-optional dependency? Would any publishing of these
images cause us to put a copy of third party GPL code anywhere?

At the least, we should keep this minimal. One image if possible, that
you overlay on top of your preferred OS/Java/Python image. But how
much value does that add? I have no info either way that people want
or don't need such a thing.

On Tue, Feb 11, 2020 at 10:13 AM Erik Erlandson <ee...@redhat.com> wrote:
>
> My takeaway from the last time we discussed this was:
> 1) To be ASF compliant, we needed to only publish images at official releases
> 2) There was some ambiguity about whether or not a container image that included GPL'ed packages (spark images do) might trip over the GPL "viral propagation" due to integrating ASL and GPL in a "binary release".  The "air gap" GPL provision may apply - the GPL software interacts only at command-line boundaries.
>
> On Wed, Feb 5, 2020 at 1:23 PM Dongjoon Hyun <do...@gmail.com> wrote:
>>
>> Hi, All.
>>
>> From 2020, shall we have an official Docker image repository as an additional distribution channel?
>>
>> I'm considering the following images.
>>
>>     - Public binary release (no snapshot image)
>>     - Public non-Spark base image (OS + R + Python)
>>       (This can be used in GitHub Action Jobs and Jenkins K8s Integration Tests to speed up jobs and to have more stabler environments)
>>
>> Bests,
>> Dongjoon.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: Apache Spark Docker image repository

Posted by Erik Erlandson <ee...@redhat.com>.
My takeaway from the last time we discussed this was:
1) To be ASF compliant, we needed to only publish images at official
releases
2) There was some ambiguity about whether or not a container image that
included GPL'ed packages (spark images do) might trip over the GPL "viral
propagation" due to integrating ASL and GPL in a "binary release".  The
"air gap" GPL provision may apply - the GPL software interacts only at
command-line boundaries.

On Wed, Feb 5, 2020 at 1:23 PM Dongjoon Hyun <do...@gmail.com>
wrote:

> Hi, All.
>
> From 2020, shall we have an official Docker image repository as an
> additional distribution channel?
>
> I'm considering the following images.
>
>     - Public binary release (no snapshot image)
>     - Public non-Spark base image (OS + R + Python)
>       (This can be used in GitHub Action Jobs and Jenkins K8s Integration
> Tests to speed up jobs and to have more stabler environments)
>
> Bests,
> Dongjoon.
>

Re: Apache Spark Docker image repository

Posted by shane knapp ☠ <sk...@berkeley.edu>.
>
>         (This can be used in GitHub Action Jobs and Jenkins K8s
> Integration Tests to speed up jobs and to have more stabler environments)
>

yep!

not only that, if we ever get around (hopefully this year) to
containerizing (the majority) the master and branch builds, i think it'd be
nice to have those available as there as well.

ah, an atomic build environment...  one can dream.  :)

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

Re: Apache Spark Docker image repository

Posted by Dongjoon Hyun <do...@gmail.com>.
Thank you, Hyukjin.

The maintenance overhead only occurs when we add a new release.

And, we can prevent accidental upstream changes by avoiding 'latest' tags.

The overhead will be much smaller than our exisitng Dockerfile maintenance
(e.g. 'spark-rm')

Also, if we have a docker repository, we can publish 'spark-rm' image
together as a tool. This will save the time and efforts of release managers
a lot.

Bests,
Dongjoon

On Mon, Feb 10, 2020 at 00:25 Hyukjin Kwon <gu...@gmail.com> wrote:

> Quick question. Roughly how much overhead is it required to maintain
> minimal version?
> If that looks not too much, I think it's fine to give a shot.
>
>
> 2020년 2월 8일 (토) 오전 6:51, Dongjoon Hyun <do...@gmail.com>님이 작성:
>
>> Thank you, Sean, Jiaxin, Shane, and Tom, for feedbacks.
>>
>> 1. For legal questions, please see the following three Apache-approved
>> approaches. We can follow one of them.
>>
>>        1. https://hub.docker.com/u/apache (93 repositories,
>> Airflow/NiFi/Beam/Druid/Zeppelin/Hadoop/...)
>>        2. https://hub.docker.com/_/solr (This is also official. There
>> are more instances like this.)
>>        3. https://hub.docker.com/u/apachestreampipes (Some projects
>> tries this form.)
>>
>> 2. For non-Spark dev-environment images, definitely it will help both our
>> Jenkins and GitHub Action jobs. Apache Infra team also supports GitHub
>> Action secret like the following.
>>
>>        https://issues.apache.org/jira/browse/INFRA-19565 Create a Docker
>> Hub secret for Github Actions
>>
>> 3. For Spark image content questions, we should not do the following.
>> It's because not only for legal issues, but also we cannot contain or
>> maintain all popular libraries like Nvidia library/TensorFlow in our image.
>>
>>        https://issues.apache.org/jira/browse/SPARK-26398 Support
>> building GPU docker images
>>
>> 4. The way I see this is a minimal legal image containing only our
>> artifacts from the followings. We can check the other Apache repos's best
>> practice.
>>
>>        https://www.apache.org/dist/spark/
>>
>> 5. For OS/Java/Python/R runtimes and libraries, those (except OS) can
>> be overlayed as an additional layers by the users in general. I don't think
>> we need to provide every combination (Debian/Ubuntu/CentOS/Alpine) x
>> (JDK/JRE) x (Python2/Python3/PyPy) x (R 3.6/3.6) x (many libraries).
>> Specifically, I don't think we need to install all libraries like `arrow`.
>>
>> 6. For the target users, this is a general docker image. We don't need to
>> assume that this is for K8s-only environment. This can be used in any
>> Docker environment.
>>
>> 7. For the number of images, as suggested in this thread, we may want to
>> follow our existing K8s integration test suite way by splitting PySpark and
>> R images from Java. But, I don't have any requirement for this.
>>
>> What I want to propose in this thread is that we can start with a minimal
>> viable product and evolve them (if needed) as an open source community.
>>
>> Bests,
>> Dongjoon.
>>
>> PS. BTW, Apache Spark 2.4.5 artifacts are published into our doc website,
>> our distribution repo, Maven Central, PyPi, CRAN, Homebrew.
>>        I'm preparing website news and download page update.
>>
>>
>> On Thu, Feb 6, 2020 at 11:19 AM Tom Graves <tg...@yahoo.com> wrote:
>>
>>> When discussions of docker have occurred in the past - mostly related to
>>> k8s - there is a lot of discussion about what is the right image to
>>> publish, as well as making sure Apache is ok with it. Apache official
>>> release is the source code so we may need to make sure to have disclaimer
>>> and we need to make sure it doesn't contain anything licensed that it
>>> shouldn't.  What happens when one of the docker images we publish has
>>> security update. We would need to make sure all the legal bases are covered
>>> first.
>>>
>>> Then the discussion comes into what is in the docker images and how
>>> useful it is. People run different os's, different python versions, etc.
>>> And like Sean mentioned how useful really is it other then a few examples.
>>> Some discussions on https://issues.apache.org/jira/browse/SPARK-24655
>>>
>>> Tom
>>>
>>>
>>>
>>> On Wednesday, February 5, 2020, 02:16:37 PM CST, Dongjoon Hyun <
>>> dongjoon.hyun@gmail.com> wrote:
>>>
>>>
>>> Hi, All.
>>>
>>> From 2020, shall we have an official Docker image repository as an
>>> additional distribution channel?
>>>
>>> I'm considering the following images.
>>>
>>>     - Public binary release (no snapshot image)
>>>     - Public non-Spark base image (OS + R + Python)
>>>       (This can be used in GitHub Action Jobs and Jenkins K8s
>>> Integration Tests to speed up jobs and to have more stabler environments)
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>

Re: Apache Spark Docker image repository

Posted by Hyukjin Kwon <gu...@gmail.com>.
Quick question. Roughly how much overhead is it required to maintain
minimal version?
If that looks not too much, I think it's fine to give a shot.


2020년 2월 8일 (토) 오전 6:51, Dongjoon Hyun <do...@gmail.com>님이 작성:

> Thank you, Sean, Jiaxin, Shane, and Tom, for feedbacks.
>
> 1. For legal questions, please see the following three Apache-approved
> approaches. We can follow one of them.
>
>        1. https://hub.docker.com/u/apache (93 repositories,
> Airflow/NiFi/Beam/Druid/Zeppelin/Hadoop/...)
>        2. https://hub.docker.com/_/solr (This is also official. There are
> more instances like this.)
>        3. https://hub.docker.com/u/apachestreampipes (Some projects tries
> this form.)
>
> 2. For non-Spark dev-environment images, definitely it will help both our
> Jenkins and GitHub Action jobs. Apache Infra team also supports GitHub
> Action secret like the following.
>
>        https://issues.apache.org/jira/browse/INFRA-19565 Create a Docker
> Hub secret for Github Actions
>
> 3. For Spark image content questions, we should not do the following. It's
> because not only for legal issues, but also we cannot contain or maintain
> all popular libraries like Nvidia library/TensorFlow in our image.
>
>        https://issues.apache.org/jira/browse/SPARK-26398 Support building
> GPU docker images
>
> 4. The way I see this is a minimal legal image containing only our
> artifacts from the followings. We can check the other Apache repos's best
> practice.
>
>        https://www.apache.org/dist/spark/
>
> 5. For OS/Java/Python/R runtimes and libraries, those (except OS) can
> be overlayed as an additional layers by the users in general. I don't think
> we need to provide every combination (Debian/Ubuntu/CentOS/Alpine) x
> (JDK/JRE) x (Python2/Python3/PyPy) x (R 3.6/3.6) x (many libraries).
> Specifically, I don't think we need to install all libraries like `arrow`.
>
> 6. For the target users, this is a general docker image. We don't need to
> assume that this is for K8s-only environment. This can be used in any
> Docker environment.
>
> 7. For the number of images, as suggested in this thread, we may want to
> follow our existing K8s integration test suite way by splitting PySpark and
> R images from Java. But, I don't have any requirement for this.
>
> What I want to propose in this thread is that we can start with a minimal
> viable product and evolve them (if needed) as an open source community.
>
> Bests,
> Dongjoon.
>
> PS. BTW, Apache Spark 2.4.5 artifacts are published into our doc website,
> our distribution repo, Maven Central, PyPi, CRAN, Homebrew.
>        I'm preparing website news and download page update.
>
>
> On Thu, Feb 6, 2020 at 11:19 AM Tom Graves <tg...@yahoo.com> wrote:
>
>> When discussions of docker have occurred in the past - mostly related to
>> k8s - there is a lot of discussion about what is the right image to
>> publish, as well as making sure Apache is ok with it. Apache official
>> release is the source code so we may need to make sure to have disclaimer
>> and we need to make sure it doesn't contain anything licensed that it
>> shouldn't.  What happens when one of the docker images we publish has
>> security update. We would need to make sure all the legal bases are covered
>> first.
>>
>> Then the discussion comes into what is in the docker images and how
>> useful it is. People run different os's, different python versions, etc.
>> And like Sean mentioned how useful really is it other then a few examples.
>> Some discussions on https://issues.apache.org/jira/browse/SPARK-24655
>>
>> Tom
>>
>>
>>
>> On Wednesday, February 5, 2020, 02:16:37 PM CST, Dongjoon Hyun <
>> dongjoon.hyun@gmail.com> wrote:
>>
>>
>> Hi, All.
>>
>> From 2020, shall we have an official Docker image repository as an
>> additional distribution channel?
>>
>> I'm considering the following images.
>>
>>     - Public binary release (no snapshot image)
>>     - Public non-Spark base image (OS + R + Python)
>>       (This can be used in GitHub Action Jobs and Jenkins K8s Integration
>> Tests to speed up jobs and to have more stabler environments)
>>
>> Bests,
>> Dongjoon.
>>
>

Re: Apache Spark Docker image repository

Posted by Dongjoon Hyun <do...@gmail.com>.
Thank you, Sean, Jiaxin, Shane, and Tom, for feedbacks.

1. For legal questions, please see the following three Apache-approved
approaches. We can follow one of them.

       1. https://hub.docker.com/u/apache (93 repositories,
Airflow/NiFi/Beam/Druid/Zeppelin/Hadoop/...)
       2. https://hub.docker.com/_/solr (This is also official. There are
more instances like this.)
       3. https://hub.docker.com/u/apachestreampipes (Some projects tries
this form.)

2. For non-Spark dev-environment images, definitely it will help both our
Jenkins and GitHub Action jobs. Apache Infra team also supports GitHub
Action secret like the following.

       https://issues.apache.org/jira/browse/INFRA-19565 Create a Docker
Hub secret for Github Actions

3. For Spark image content questions, we should not do the following. It's
because not only for legal issues, but also we cannot contain or maintain
all popular libraries like Nvidia library/TensorFlow in our image.

       https://issues.apache.org/jira/browse/SPARK-26398 Support building
GPU docker images

4. The way I see this is a minimal legal image containing only our
artifacts from the followings. We can check the other Apache repos's best
practice.

       https://www.apache.org/dist/spark/

5. For OS/Java/Python/R runtimes and libraries, those (except OS) can
be overlayed as an additional layers by the users in general. I don't think
we need to provide every combination (Debian/Ubuntu/CentOS/Alpine) x
(JDK/JRE) x (Python2/Python3/PyPy) x (R 3.6/3.6) x (many libraries).
Specifically, I don't think we need to install all libraries like `arrow`.

6. For the target users, this is a general docker image. We don't need to
assume that this is for K8s-only environment. This can be used in any
Docker environment.

7. For the number of images, as suggested in this thread, we may want to
follow our existing K8s integration test suite way by splitting PySpark and
R images from Java. But, I don't have any requirement for this.

What I want to propose in this thread is that we can start with a minimal
viable product and evolve them (if needed) as an open source community.

Bests,
Dongjoon.

PS. BTW, Apache Spark 2.4.5 artifacts are published into our doc website,
our distribution repo, Maven Central, PyPi, CRAN, Homebrew.
       I'm preparing website news and download page update.


On Thu, Feb 6, 2020 at 11:19 AM Tom Graves <tg...@yahoo.com> wrote:

> When discussions of docker have occurred in the past - mostly related to
> k8s - there is a lot of discussion about what is the right image to
> publish, as well as making sure Apache is ok with it. Apache official
> release is the source code so we may need to make sure to have disclaimer
> and we need to make sure it doesn't contain anything licensed that it
> shouldn't.  What happens when one of the docker images we publish has
> security update. We would need to make sure all the legal bases are covered
> first.
>
> Then the discussion comes into what is in the docker images and how useful
> it is. People run different os's, different python versions, etc. And like
> Sean mentioned how useful really is it other then a few examples.  Some
> discussions on https://issues.apache.org/jira/browse/SPARK-24655
>
> Tom
>
>
>
> On Wednesday, February 5, 2020, 02:16:37 PM CST, Dongjoon Hyun <
> dongjoon.hyun@gmail.com> wrote:
>
>
> Hi, All.
>
> From 2020, shall we have an official Docker image repository as an
> additional distribution channel?
>
> I'm considering the following images.
>
>     - Public binary release (no snapshot image)
>     - Public non-Spark base image (OS + R + Python)
>       (This can be used in GitHub Action Jobs and Jenkins K8s Integration
> Tests to speed up jobs and to have more stabler environments)
>
> Bests,
> Dongjoon.
>

Re: Apache Spark Docker image repository

Posted by Tom Graves <tg...@yahoo.com.INVALID>.
 When discussions of docker have occurred in the past - mostly related to k8s - there is a lot of discussion about what is the right image to publish, as well as making sure Apache is ok with it. Apache official release is the source code so we may need to make sure to have disclaimer and we need to make sure it doesn't contain anything licensed that it shouldn't.  What happens when one of the docker images we publish has security update. We would need to make sure all the legal bases are covered first.  
Then the discussion comes into what is in the docker images and how useful it is. People run different os's, different python versions, etc. And like Sean mentioned how useful really is it other then a few examples.  Some discussions on https://issues.apache.org/jira/browse/SPARK-24655
Tom


    On Wednesday, February 5, 2020, 02:16:37 PM CST, Dongjoon Hyun <do...@gmail.com> wrote:  
 
 Hi, All.
From 2020, shall we have an official Docker image repository as an additional distribution channel?
I'm considering the following images.
    - Public binary release (no snapshot image)    - Public non-Spark base image (OS + R + Python)      (This can be used in GitHub Action Jobs and Jenkins K8s Integration Tests to speed up jobs and to have more stabler environments)
Bests,Dongjoon.