You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2022/02/21 22:08:29 UTC

Re: Time to start publishing Spark Docker Images?

forwarded

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Thu, 22 Jul 2021 at 04:13, Holden Karau <ho...@pigscanfly.ca> wrote:

> Hi Folks,
>
> Many other distributed computing (https://hub.docker.com/r/rayproject/ray
> https://hub.docker.com/u/daskdev) and ASF projects (
> https://hub.docker.com/u/apache) now publish their images to dockerhub.
>
> We've already got the docker image tooling in place, I think we'd need to
> ask the ASF to grant permissions to the PMC to publish containers and
> update the release steps but I think this could be useful for folks.
>
> Cheers,
>
> Holden
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Fwd: Time to start publishing Spark Docker Images?

Posted by Mich Talebzadeh <mi...@gmail.com>.

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Forwarded Conversation
Subject: Time to start publishing Spark Docker Images?
------------------------

From: Holden Karau <ho...@pigscanfly.ca>
Date: Thu, 22 Jul 2021 at 04:13
To: dev <de...@spark.apache.org>

Hi Folks,

Many other distributed computing (https://hub.docker.com/r/rayproject/ray
https://hub.docker.com/u/daskdev) and ASF projects (
https://hub.docker.com/u/apache) now publish their images to dockerhub.

We've already got the docker image tooling in place, I think we'd need to
ask the ASF to grant permissions to the PMC to publish containers and
update the release steps but I think this could be useful for folks.

Cheers,

Holden

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

----------
From: Kent Yao <ya...@gmail.com>
Date: Thu, 22 Jul 2021 at 04:22
To: Holden Karau <ho...@pigscanfly.ca>
Cc: dev <de...@spark.apache.org>

+1

Bests,

*Kent Yao *
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
*a spark enthusiast*
*kyuubi <https://github.com/yaooqinn/kyuubi>is a unified multi-tenant JDBC
interface for large-scale data processing and analytics, built on top
of Apache Spark <http://spark.apache.org/>.*
*spark-authorizer <https://github.com/yaooqinn/spark-authorizer>A Spark SQL
extension which provides SQL Standard Authorization for **Apache Spark
<http://spark.apache.org/>.*
*spark-postgres <https://github.com/yaooqinn/spark-postgres> A library for
reading data from and transferring data to Postgres / Greenplum with Spark
SQL and DataFrames, 10~100x faster.*
*itatchi <https://github.com/yaooqinn/spark-func-extras>A** library t**hat
brings useful functions from various modern database management
systems to **Apache
Spark <http://spark.apache.org/>.*

--------------------------------------------------------------------- To
unsubscribe e-mail: dev-unsubscribe@spark.apache.org

----------
From: Hyukjin Kwon <gu...@gmail.com>
Date: Fri, 13 Aug 2021 at 01:44
To: Kent Yao <ya...@gmail.com>, Dongjoon Hyun <do...@apache.org>
Cc: Holden Karau <ho...@pigscanfly.ca>, dev <de...@spark.apache.org>

+1, I think we generally agreed upon having it. Thanks Holden for headsup
and driving this.

+@Dongjoon Hyun <do...@apache.org> FYI

2021년 7월 22일 (목) 오후 12:22, Kent Yao <ya...@gmail.com>님이 작성:

----------
From: John Zhuge <jz...@apache.org>
Date: Fri, 13 Aug 2021 at 01:48
To: Hyukjin Kwon <gu...@gmail.com>
Cc: Dongjoon Hyun <do...@apache.org>, Holden Karau <ho...@pigscanfly.ca>,
Kent Yao <ya...@gmail.com>, dev <de...@spark.apache.org>

+1
-- 
John Zhuge

----------
From: Holden Karau <ho...@pigscanfly.ca>
Date: Fri, 13 Aug 2021 at 01:54
To: John Zhuge <jz...@apache.org>
Cc: Hyukjin Kwon <gu...@gmail.com>, Dongjoon Hyun <do...@apache.org>,
Kent Yao <ya...@gmail.com>, dev <de...@spark.apache.org>

Awesome, I've filed an INFRA ticket to get the ball rolling.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Fri, 13 Aug 2021 at 07:45
To:
Cc: dev <de...@spark.apache.org>

I concur this is a good idea and certainly worth exploring.

In practice, preparing docker images as deployable will throw some
challenges because creating docker for Spark  is not really a singular
modular unit, say  creating docker for Jenkins. It involves different
versions and different images for Spark and PySpark and most likely will
end up as part of Kubernetes deployment.

Individuals and organisations will deploy it as the first cut. Great but I
equally feel that good documentation on how to build a consumable
deployable image will be more valuable.  FRom my own experience the current
documentation should be enhanced, for example how to deploy working
directories, additional Python packages, build with different Java
versions  (version 8 or version 11) etc.

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

----------
From: Bode, Meikel, NMA-CFD <Me...@bertelsmann.de>
Date: Fri, 13 Aug 2021 at 08:13
To: dev <de...@spark.apache.org>

Hi all,

I am Meikel Bode and only an interested reader of dev and user list.
Anyway, I would appreciate to have official docker images available.

Maybe one could get inspiration from the Jupyter docker stacks and provide
an hierarchy of different images like this:

https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html#image-relationships

Having a core image only supporting Java, an extended supporting Python
and/or R etc.

Looking forward to the discussion.

Best,

Meikel

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Fri, 13 Aug 2021 at 08:51
To: Bode, Meikel, NMA-CFD <Me...@bertelsmann.de>
Cc: dev <de...@spark.apache.org>

Agreed.

I have already built a few latest for Spark and PYSpark on 3.1.1 with Java
8 as I found out Java 11 does not work with Google BigQuery data warehouse.
However, to hack the Dockerfile one finds out the hard way.

For example how to add additional Python libraries like tensorflow etc.
Loading these libraries through Kubernetes is not practical as unzipping
and installing it through --py-files etc will take considerable time so
they need to be added to the dockerfile at the built time in directory for
Python under Kubernetes

/opt/spark/kubernetes/dockerfiles/spark/bindings/python

RUN pip install pyyaml numpy cx_Oracle tensorflow ....

Also you will need curl to test the ports from inside the docker

RUN apt-get update && apt-get install -y curl
RUN ["apt-get","install","-y","vim"]

As I said I am happy to build these specific dockerfiles plus the complete
documentation for it. I have already built one for Google (GCP). The
difference between Spark and PySpark version is that in Spark/scala a fat
jar file will contain all needed. That is not the case with Python I am
afraid.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Fri, 13 Aug 2021 at 08:59
To: Bode, Meikel, NMA-CFD <Me...@bertelsmann.de>
Cc: dev <de...@spark.apache.org>

should read PySpark

----------
From: Holden Karau <ho...@pigscanfly.ca>
Date: Fri, 13 Aug 2021 at 17:26
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Bode, Meikel, NMA-CFD <Me...@bertelsmann.de>, dev <
dev@spark.apache.org>

So we actually do have a script that does the build already it's more a
matter of publishing the results for easier use. Currently the script
produces three images spark, spark-py, and spark-r. I can certainly see a
solid reason to publish like with a jdk11 & jdk8 suffix as well if there is
interest in the community. If we want to have a say spark-py-pandas for a
Spark container image with everything necessary for the Koalas stuff to
work then I think that could be a great PR from someone to add :)

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Fri, 13 Aug 2021 at 23:43
To: Holden Karau <ho...@pigscanfly.ca>
Cc: Bode, Meikel, NMA-CFD <Me...@bertelsmann.de>, dev <
dev@spark.apache.org>

Hi,

We can cater for multiple types (spark, spark-py and spark-r) and spark
versions (assuming they are downloaded and available).
The challenge is that these docker images built are snapshots. They cannot
be amended later and if you change anything by going inside docker, as soon
as you are logged out whatever you did is reversed.

For example, I want to add tensorflow to my docker image. These are my
images

REPOSITORY                                TAG           IMAGE ID
 CREATED         SIZE
eu.gcr.io/axial-glow-224522/spark-py      java8_3.1.1   cfbb0e69f204   5
days ago      2.37GB
eu.gcr.io/axial-glow-224522/spark         3.1.1         8d1bf8e7e47d   5
days ago      805MB

using image ID I try to log in as root to the image

*docker run -u0 -it cfbb0e69f204 bash*

root@b542b0f1483d:/opt/spark/work-dir# pip install keras
Collecting keras
  Downloading keras-2.6.0-py2.py3-none-any.whl (1.3 MB)
     |████████████████████████████████| 1.3 MB 1.1 MB/s
Installing collected packages: keras
Successfully installed keras-2.6.0
WARNING: Running pip as the 'root' user can result in broken permissions
and conflicting behaviour with the system package manager. It is
recommended to use a virtual environment instead:
https://pip.pypa.io/warnings/venv
root@b542b0f1483d:/opt/spark/work-dir# pip list
Package       Version
------------- -------
asn1crypto    0.24.0
cryptography  2.6.1
cx-Oracle     8.2.1
entrypoints   0.3
*keras         2.6.0      <--- it is here*
keyring       17.1.1
keyrings.alt  3.1.1
numpy         1.21.1
pip           21.2.3
py4j          0.10.9
pycrypto      2.6.1
PyGObject     3.30.4
pyspark       3.1.2
pyxdg         0.25
PyYAML        5.4.1
SecretStorage 2.3.1
setuptools    57.4.0
six           1.12.0
wheel         0.32.3
root@b542b0f1483d:/opt/spark/work-dir# exit

Now I exited from the image and try to log in again
(pyspark_venv) hduser@rhes76: /home/hduser/dba/bin/build> docker run -u0
-it cfbb0e69f204 bash

root@5231ee95aa83:/opt/spark/work-dir# pip list
Package       Version
------------- -------
asn1crypto    0.24.0
cryptography  2.6.1
cx-Oracle     8.2.1
entrypoints   0.3
keyring       17.1.1
keyrings.alt  3.1.1
numpy         1.21.1
pip           21.2.3
py4j          0.10.9
pycrypto      2.6.1
PyGObject     3.30.4
pyspark       3.1.2
pyxdg         0.25
PyYAML        5.4.1
SecretStorage 2.3.1
setuptools    57.4.0
six           1.12.0
wheel         0.32.3

*Hm that keras is not there*. The docker Image cannot be altered after
build! So once the docker image is created that is just a snapshot.
However, it will still have tons of useful stuff for most
users/organisations. My suggestions is to create for a given type (spark,
spark-py etc):

   1. One vanilla flavour for everyday use with few useful packages
   2. One for medium use with most common packages for ETL/ELT stuff
   3. One specialist for ML etc with keras, tensorflow and anything else
   needed

These images should be maintained as we currently maintain spark releases
with accompanying documentation. Any reason why we cannot maintain
ourselves?

----------
From: Maciej <ms...@gmail.com>
Date: Mon, 16 Aug 2021 at 18:46
To: <de...@spark.apache.org>

I have a few concerns regarding PySpark and SparkR images.

First of all, how do we plan to handle interpreter versions? Ideally, we
should provide images for all supported variants, but based on the
preceding discussion and the proposed naming convention, I assume it is not
going to happen. If that's the case, it would be great if we could fix
interpreter versions based on some support criteria (lowest supported,
lowest non-deprecated, highest supported at the time of release, etc.)

Currently, we use the following:

   - for R use buster-cran35 Debian repositories which install R 3.6
   (provided version already changed in the past and broke image build ‒
   SPARK-28606).
   - for Python we depend on the system provided python3 packages, which
   currently provides Python 3.7.

which don't guarantee stability over time and might be hard to synchronize
with our support matrix.

Secondly, omitting libraries which are required for the full functionality
and performance, specifically

   - Numpy, Pandas and Arrow for PySpark
   - Arrow for SparkR

is likely to severely limit usability of the images (out of these, Arrow is
probably the hardest to manage, especially when you already depend on
system packages to provide R or Python interpreter).

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC

----------
From: Holden Karau <ho...@pigscanfly.ca>
Date: Tue, 17 Aug 2021 at 03:05
To: Maciej <ms...@gmail.com>
Cc: <de...@spark.apache.org>

These are some really good points all around.

I think, in the interest of simplicity, well start with just the 3 current
Dockerfiles in the Spark repo but for the next release (3.3) we should
explore adding some more Dockerfiles/build options.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 07:27
To:
Cc: Spark dev list <de...@spark.apache.org>

Thanks for the notes all.

I think we ought to consider what docker general usage is.

Docker image by definition is a self contained general purpose entity
providing spark service at the common denominator. Some docker images like
the one for jenkins are far simpler to build as they have less moving parts
With spark you either deploy your own set-up, get a service provider like
Google to provide it as a service (and they provide not the most recent
version, for example Dataproc cluster runs on Spark 3.1.1) or use docker in
Kubernetes cluster GKE. They provide the cluster and cluster management but
you deploy your own docker image.

I don't know much about spark-R but I think if we take Spark latest and
spark-py built on 3.7.3 and Spark-py for data science with the most widely
used and current packages we should be OK.

This set-up will work as long as the documentation goes into details of
interpreter and package versions and provides a blueprint to build your own
custom version of docker with whatever version you prefer.

As I said if we look at flink, they provide flink with scala and java
version and of course latest

1.13.2-scala_2.12-java8, 1.13-scala_2.12-java8, scala_2.12-java8,
1.13.2-scala_2.12, 1.13-scala_2.12, scala_2.12, 1.13.2-java8, 1.13-java8,
java8, 1.13.2, 1.13, latest

I personally believe that providing the most popular ones serves the
purpose for the community and anything above and beyond has to be tailor
made.

----------
From: Holden Karau <ho...@pigscanfly.ca>
Date: Tue, 17 Aug 2021 at 07:36
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Spark dev list <de...@spark.apache.org>

I as well think the largest use case of docker images would be on
Kubernetes.

While I have nothing against us adding more varieties I think it’s
important for us to get this started with our current containers, so I’ll
do that but let’s certainly continue exploring improvements after that.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 07:40
To: Holden Karau <ho...@pigscanfly.ca>
Cc: Spark dev list <de...@spark.apache.org>

An interesting point. Do we have a repository for current containers. I am
not aware of it.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 09:05
To: Maciej <ms...@gmail.com>
Cc: Spark dev list <de...@spark.apache.org>

Of course with PySpark, there is the option of putting your packages in gz
format and send them at spark-submit time

--conf "spark.yarn.dist.archives"=pyspark_venv.tar.gz#environment \

However, in the Kubernetes cluster that file is going to be fairly massive
and will take time to unzip and share. The interpreter will be what it
comes with the docker image!

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Mon, 16 Aug 2021 at 18:46, Maciej <ms...@gmail.com> wrote:

----------
From: Maciej <ms...@gmail.com>
Date: Tue, 17 Aug 2021 at 09:42
To: <de...@spark.apache.org>

You're right, but with the native dependencies (this is the case for the
packages I've mentioned before) we have to bundle complete environments. It
is doable, but if you do that, you're actually better off with base image.
I don't insist it is something we have to address right now, just something
to keep in mind.

----------
From: Maciej <ms...@gmail.com>
Date: Tue, 17 Aug 2021 at 09:51
To: Holden Karau <ho...@pigscanfly.ca>
Cc: <de...@spark.apache.org>

On 8/17/21 4:04 AM, Holden Karau wrote:

These are some really good points all around.

I think, in the interest of simplicity, well start with just the 3 current
Dockerfiles in the Spark repo but for the next release (3.3) we should
explore adding some more Dockerfiles/build options.

Sounds good.

However, I'd consider adding guest lang version to the tag names, i.e.

3.1.2_sparkpy_3.7-scala_2.12-java11

3.1.2_sparkR_3.6-scala_2.12-java11

and some basics safeguards in the layers, to make sure that these are
really the versions we use.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 10:31
To: Maciej <ms...@gmail.com>
Cc: Holden Karau <ho...@pigscanfly.ca>, Spark dev list <
dev@spark.apache.org>

3.1.2_sparkpy_3.7-scala_2.12-java11

3.1.2_sparkR_3.6-scala_2.12-java11
Yes let us go with that and remember that we can change the tags anytime.
The accompanying release note should detail what is inside the image
downloaded.

+1 for me

-- 
>> Best regards,
>> Maciej Szymkiewicz
>>
>> Web: https://zero323.net
>> Keybase: https://keybase.io/zero323
>> Gigs: https://www.codementor.io/@zero323
>> PGP: A30CEF0C31A501EC
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
>

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 13:24
To:
Cc: Spark dev list <de...@spark.apache.org>

Examples:

*docker images*

REPOSITORY       TAG                                  IMAGE ID
 CREATED          SIZE

spark/spark-py   3.1.1_sparkpy_3.7-scala_2.12-java8   ba3c17bc9337   2
minutes ago    2.19GB

spark            3.1.1-scala_2.12-java11              4595c4e78879   18
minutes ago   635MB

----------
From: Maciej <ms...@gmail.com>
Date: Tue, 17 Aug 2021 at 16:17
To: <de...@spark.apache.org>

Quick question ‒ is this actual output? If so, do we know what accounts
1.5GB overhead for PySpark image. Even without --no-install-recommends this
seems like a lot (if I recall correctly it was around 400MB for existing
images).

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 16:24
To: Maciej <ms...@gmail.com>
Cc: Spark dev list <de...@spark.apache.org>

Yes, I will double check. it includes java 8 in addition to base java 11.

in addition it has these Python packages for now (added for my own needs
for now)

root@ce6773017a14:/opt/spark/work-dir# pip list
Package       Version
------------- -------
asn1crypto    0.24.0
cryptography  2.6.1
cx-Oracle     8.2.1
entrypoints   0.3
keyring       17.1.1
keyrings.alt  3.1.1
numpy         1.21.2
pip           21.2.4
py4j          0.10.9
pycrypto      2.6.1
PyGObject     3.30.4
pyspark       3.1.2
pyxdg         0.25
PyYAML        5.4.1
SecretStorage 2.3.1
setuptools    57.4.0
six           1.12.0
wheel         0.32.3

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 16:55
To: Maciej <ms...@gmail.com>
Cc: Spark dev list <de...@spark.apache.org>

With no additional python packages etc we get 1.4GB compared to 2.19GB
before

REPOSITORY       TAG                                      IMAGE ID
 CREATED                  SIZE
spark/spark-py   3.1.1_sparkpy_3.7-scala_2.12-java8only   faee4dbb95dd
 Less than a second ago   1.41GB
spark/spark-py   3.1.1_sparkpy_3.7-scala_2.12-java8       ba3c17bc9337   4
hours ago              2.19GB

root@233a81199b43:/opt/spark/work-dir# pip list
Package       Version
------------- -------
asn1crypto    0.24.0
cryptography  2.6.1
entrypoints   0.3
keyring       17.1.1
keyrings.alt  3.1.1
pip           21.2.4
pycrypto      2.6.1
PyGObject     3.30.4
pyxdg         0.25

----------
From: Andrew Melo <an...@gmail.com>
Date: Tue, 17 Aug 2021 at 16:57
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Maciej <ms...@gmail.com>, Spark dev list <de...@spark.apache.org>

Silly Q, did you blow away the pip cache before committing the layer? That
always trips me up.

Cheers
Andrew
-- 
It's dark in this basement.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 20:29
To: Andrew Melo <an...@gmail.com>
Cc: Maciej <ms...@gmail.com>, Spark dev list <de...@spark.apache.org>

Hi Andrew,

Can you please elaborate on blowing pip cache before committing the layer?

Thanks,

Much

----------
From: Andrew Melo <an...@gmail.com>
Date: Tue, 17 Aug 2021 at 20:44
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Maciej <ms...@gmail.com>, Spark dev list <de...@spark.apache.org>

Hi Mich,

By default, pip caches downloaded binaries to somewhere like
$HOME/.cache/pip. So after doing any "pip install", you'll want to either
delete that directory, or pass the "--no-cache-dir" option to pip to
prevent the download binaries from being added to the image.

HTH
Andrew

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 23:05
To: Andrew Melo <an...@gmail.com>
Cc: Maciej <ms...@gmail.com>, Spark dev list <de...@spark.apache.org>

Thanks Andrew, that was helpful.

Step 10/23 : RUN pip install pyyaml numpy cx_Oracle pyspark --no-cache-dir

And the reduction in size is considerable, 1.75GB vs 2.19GB . Note that the
original run has now been invalidated

REPOSITORY       TAG                                      IMAGE ID
 CREATED                  SIZE
spark/spark-py   3.1.1_sparkpy_3.7-scala_2.12-java8       ecef8bd15731
 Less than a second ago   1.75GB
<none>           <none>                                   ba3c17bc9337   10
hours ago             2.19GB

----------
From: Holden Karau <ho...@pigscanfly.ca>
Date: Tue, 17 Aug 2021 at 23:26
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Andrew Melo <an...@gmail.com>, Maciej <ms...@gmail.com>,
Spark dev list <de...@spark.apache.org>

pip installing pyspark like that probably isn't a great idea since there
isn't a version tagged to it. Probably better to install from the local
files copied in than potentially from pypi. Might be able to install in -e
mode where it'll do symlinks to save space I'm not sure.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 23:52
To: Holden Karau <ho...@pigscanfly.ca>
Cc: Andrew Melo <an...@gmail.com>, Maciej <ms...@gmail.com>,
Spark dev list <de...@spark.apache.org>

Well, we need to decide what packages need to be installed with spark-py.
PySpark is not one of them. true

The docker build itself takes care of PySpark by copying them from the
$SPARK_HOME directory

COPY python/pyspark ${SPARK_HOME}/python/pyspark
COPY python/lib ${SPARK_HOME}/python/lib

Please review the docker file for python in
$SPARK_HOME/kubernetes/dockerfiles/spark/bindings/python/Dockerfile
and make changes needed.

ARG base_img
FROM $base_img
WORKDIR /
# Reset to root to run installation tasks
USER 0
RUN mkdir ${SPARK_HOME}/python
RUN apt-get update && \
    apt install -y python3 python3-pip && \
    pip3 install --upgrade pip setuptools && \
    # Removed the .cache to save space
    rm -r /root/.cache && rm -rf /var/cache/apt/*

COPY python/pyspark ${SPARK_HOME}/python/pyspark
COPY python/lib ${SPARK_HOME}/python/lib

WORKDIR /opt/spark/work-dir
ENTRYPOINT [ "/opt/entrypoint.sh" ]

# Specify the User that the actual main process will run as
ARG spark_uid=185
USER ${spark_uid}

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Wed, 18 Aug 2021 at 10:10
To: Holden Karau <ho...@pigscanfly.ca>
Cc: Andrew Melo <an...@gmail.com>, Maciej <ms...@gmail.com>,
Spark dev list <de...@spark.apache.org>

A rather related point

The docker image comes with the following java

root@73a798cc3303:/opt/spark/work-dir# java -version
openjdk version "11.0.12" 2021-07-20
OpenJDK Runtime Environment 18.9 (build 11.0.12+7)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.12+7, mixed mode, sharing)

For Java 8 I believe debian buster does not support Java 8,. This will be
added to the docker image.

Any particular java 8 we should go for.

For now I am using jdk1.8.0_201 which is Oracle Java. Current debian
versions built in GCP use

openjdk version "1.8.0_292"

Shall we choose and adopt one java 8 version for docker images? This will
be in addition to java 11 already installed with base

----------
From: Holden Karau <ho...@pigscanfly.ca>
Date: Wed, 18 Aug 2021 at 21:27
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Andrew Melo <an...@gmail.com>, Maciej <ms...@gmail.com>,
Spark dev list <de...@spark.apache.org>

So the default image we use right now for the 3.2 line is 11-jre-slim, in
3.0 we used 8-jre-slim, I think these are ok bases for us to build from
unless someone has a good reason otherwise?

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Wed, 18 Aug 2021 at 23:42
To: Holden Karau <ho...@pigscanfly.ca>
Cc: Andrew Melo <an...@gmail.com>, Maciej <ms...@gmail.com>,
Spark dev list <de...@spark.apache.org>

We have both base images now

REPOSITORY       TAG                                      IMAGE ID
 CREATED              SIZE
openjdk          8-jre-slim                               0d0a85fdf642   40
hours ago         187MB
openjdk          11-jre-slim                              eb77da2ec13c   3
weeks ago          221MB

Only java version differences:

For 11-jre-slim we have:

ARG java_image_tag=*11-jre-slim*

FROM openjdk:${java_image_tag}

And for 8-jre-slim

ARG java_image_tag=*8-jre-slim*

FROM openjdk:${java_image_tag}

----------
From: Ankit Gupta <in...@gmail.com>
Date: Sat, 21 Aug 2021 at 15:50
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Holden Karau <ho...@pigscanfly.ca>, Andrew Melo <an...@gmail.com>,
Maciej <ms...@gmail.com>, Spark dev list <de...@spark.apache.org>

Hey All

Just a suggestion, or maybe a future enhancement, we should also try and
use different base OSs like buster, alpine, slim, stretch, etc. and add
that in the tag as well. This will help the users to choose the images
according to their requirements.

Thanks and Regards.

Ankit Prakash Gupta
info.ankitp@gmail.com
LinkedIn : https://www.linkedin.com/in/infoankitp/
Medium: https://medium.com/@info.ankitp

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Sat, 25 Dec 2021 at 16:24
To:
Cc: Spark dev list <de...@spark.apache.org>

Season's greetings to all.

A while back we discussed publishing docker images, mainly for Kubernetes.

Increasing number of people are using Spark on Kubernetes.

Following our previous discussions, what matters is the tag, which is the
detailed identifier of the image used. These images are normally loaded to
the container/artefact registories in Cloud

For example with SPARK_VERSION, SCALA_VERSION, DOCKERIMAGETAG, BASE_OS and
the used DOCKERFILE

export PROJECT_ID=$(gcloud info --format='value(config.project)')
export GCP_CR=eu.gcr.io/${PROJECT_ID} <http://eu.gcr.io/$%7BPROJECT_ID%7D>

BASE_OS="buster"
SPARK_VERSION="3.1.1"
SCALA_VERSION="scala_2.12"
DOCKERFILE="java8PlusPackages"
DOCKERIMAGETAG="8-jre-slim"
cd $SPARK_HOME

# Building Docker image from provided Dockerfile base 11
cd $SPARK_HOME
/opt/spark/bin/docker-image-tool.sh \
              -r $GCP_CR \
              -t
${SPARK_VERSION}-${SCALA_VERSION}-${DOCKERIMAGETAG}-${BASE_OS}-${DOCKERFILE}
\
              -b java_image_tag=${DOCKERIMAGETAG} \
              -p
./kubernetes/dockerfiles/spark/bindings/python/${DOCKERFILE} \
               build

This results in a docker image created with a tag

IMAGEDRIVER="eu.gcr.io/
<PROJECT_ID>/spark-py:3.1.1-scala_2.12-8-jre-slim-buster-java8PlusPackages"

and

--conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \
 --conf spark.kubernetes.executor.container.image=${IMAGEDRIVER} \

The question is do we need anything else in the tag itself or enough info
is provided?

Cheers

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Mon, 21 Feb 2022 at 22:08
To: Holden Karau <ho...@pigscanfly.ca>
Cc: dev <de...@spark.apache.org>

forwarded

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Mon, 8 May 2023 at 10:30
To: Spark dev list <de...@spark.apache.org>

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Forwarded Conversation
Subject: Time to start publishing Spark Docker Images?
------------------------

From: Holden Karau <ho...@pigscanfly.ca>
Date: Thu, 22 Jul 2021 at 04:13
To: dev <de...@spark.apache.org>

Hi Folks,

Many other distributed computing (https://hub.docker.com/r/rayproject/ray
https://hub.docker.com/u/daskdev) and ASF projects (
https://hub.docker.com/u/apache) now publish their images to dockerhub.

We've already got the docker image tooling in place, I think we'd need to
ask the ASF to grant permissions to the PMC to publish containers and
update the release steps but I think this could be useful for folks.

Cheers,

Holden

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

----------
From: Kent Yao <ya...@gmail.com>
Date: Thu, 22 Jul 2021 at 04:22
To: Holden Karau <ho...@pigscanfly.ca>
Cc: dev <de...@spark.apache.org>

+1

Bests,

*Kent Yao *
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
*a spark enthusiast*
*kyuubi <https://github.com/yaooqinn/kyuubi>is a unified multi-tenant JDBC
interface for large-scale data processing and analytics, built on top
of Apache Spark <http://spark.apache.org/>.*
*spark-authorizer <https://github.com/yaooqinn/spark-authorizer>A Spark SQL
extension which provides SQL Standard Authorization for **Apache Spark
<http://spark.apache.org/>.*
*spark-postgres <https://github.com/yaooqinn/spark-postgres> A library for
reading data from and transferring data to Postgres / Greenplum with Spark
SQL and DataFrames, 10~100x faster.*
*itatchi <https://github.com/yaooqinn/spark-func-extras>A** library t**hat
brings useful functions from various modern database management
systems to **Apache
Spark <http://spark.apache.org/>.*

--------------------------------------------------------------------- To
unsubscribe e-mail: dev-unsubscribe@spark.apache.org

----------
From: Hyukjin Kwon <gu...@gmail.com>
Date: Fri, 13 Aug 2021 at 01:44
To: Kent Yao <ya...@gmail.com>, Dongjoon Hyun <do...@apache.org>
Cc: Holden Karau <ho...@pigscanfly.ca>, dev <de...@spark.apache.org>

+1, I think we generally agreed upon having it. Thanks Holden for headsup
and driving this.

+@Dongjoon Hyun <do...@apache.org> FYI

2021년 7월 22일 (목) 오후 12:22, Kent Yao <ya...@gmail.com>님이 작성:

----------
From: John Zhuge <jz...@apache.org>
Date: Fri, 13 Aug 2021 at 01:48
To: Hyukjin Kwon <gu...@gmail.com>
Cc: Dongjoon Hyun <do...@apache.org>, Holden Karau <ho...@pigscanfly.ca>,
Kent Yao <ya...@gmail.com>, dev <de...@spark.apache.org>

+1
-- 
John Zhuge

----------
From: Holden Karau <ho...@pigscanfly.ca>
Date: Fri, 13 Aug 2021 at 01:54
To: John Zhuge <jz...@apache.org>
Cc: Hyukjin Kwon <gu...@gmail.com>, Dongjoon Hyun <do...@apache.org>,
Kent Yao <ya...@gmail.com>, dev <de...@spark.apache.org>

Awesome, I've filed an INFRA ticket to get the ball rolling.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Fri, 13 Aug 2021 at 07:45
To:
Cc: dev <de...@spark.apache.org>

I concur this is a good idea and certainly worth exploring.

In practice, preparing docker images as deployable will throw some
challenges because creating docker for Spark  is not really a singular
modular unit, say  creating docker for Jenkins. It involves different
versions and different images for Spark and PySpark and most likely will
end up as part of Kubernetes deployment.

Individuals and organisations will deploy it as the first cut. Great but I
equally feel that good documentation on how to build a consumable
deployable image will be more valuable.  FRom my own experience the current
documentation should be enhanced, for example how to deploy working
directories, additional Python packages, build with different Java
versions  (version 8 or version 11) etc.

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

----------
From: Bode, Meikel, NMA-CFD <Me...@bertelsmann.de>
Date: Fri, 13 Aug 2021 at 08:13
To: dev <de...@spark.apache.org>

Hi all,

I am Meikel Bode and only an interested reader of dev and user list.
Anyway, I would appreciate to have official docker images available.

Maybe one could get inspiration from the Jupyter docker stacks and provide
an hierarchy of different images like this:

https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html#image-relationships

Having a core image only supporting Java, an extended supporting Python
and/or R etc.

Looking forward to the discussion.

Best,

Meikel

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Fri, 13 Aug 2021 at 08:51
To: Bode, Meikel, NMA-CFD <Me...@bertelsmann.de>
Cc: dev <de...@spark.apache.org>

Agreed.

I have already built a few latest for Spark and PYSpark on 3.1.1 with Java
8 as I found out Java 11 does not work with Google BigQuery data warehouse.
However, to hack the Dockerfile one finds out the hard way.

For example how to add additional Python libraries like tensorflow etc.
Loading these libraries through Kubernetes is not practical as unzipping
and installing it through --py-files etc will take considerable time so
they need to be added to the dockerfile at the built time in directory for
Python under Kubernetes

/opt/spark/kubernetes/dockerfiles/spark/bindings/python

RUN pip install pyyaml numpy cx_Oracle tensorflow ....

Also you will need curl to test the ports from inside the docker

RUN apt-get update && apt-get install -y curl
RUN ["apt-get","install","-y","vim"]

As I said I am happy to build these specific dockerfiles plus the complete
documentation for it. I have already built one for Google (GCP). The
difference between Spark and PySpark version is that in Spark/scala a fat
jar file will contain all needed. That is not the case with Python I am
afraid.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Fri, 13 Aug 2021 at 08:59
To: Bode, Meikel, NMA-CFD <Me...@bertelsmann.de>
Cc: dev <de...@spark.apache.org>

should read PySpark

----------
From: Holden Karau <ho...@pigscanfly.ca>
Date: Fri, 13 Aug 2021 at 17:26
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Bode, Meikel, NMA-CFD <Me...@bertelsmann.de>, dev <
dev@spark.apache.org>

So we actually do have a script that does the build already it's more a
matter of publishing the results for easier use. Currently the script
produces three images spark, spark-py, and spark-r. I can certainly see a
solid reason to publish like with a jdk11 & jdk8 suffix as well if there is
interest in the community. If we want to have a say spark-py-pandas for a
Spark container image with everything necessary for the Koalas stuff to
work then I think that could be a great PR from someone to add :)

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Fri, 13 Aug 2021 at 23:43
To: Holden Karau <ho...@pigscanfly.ca>
Cc: Bode, Meikel, NMA-CFD <Me...@bertelsmann.de>, dev <
dev@spark.apache.org>
----------
From: Maciej <ms...@gmail.com>
Date: Mon, 16 Aug 2021 at 18:46
To: <de...@spark.apache.org>

I have a few concerns regarding PySpark and SparkR images.

First of all, how do we plan to handle interpreter versions? Ideally, we
should provide images for all supported variants, but based on the
preceding discussion and the proposed naming convention, I assume it is not
going to happen. If that's the case, it would be great if we could fix
interpreter versions based on some support criteria (lowest supported,
lowest non-deprecated, highest supported at the time of release, etc.)

Currently, we use the following:

   - for R use buster-cran35 Debian repositories which install R 3.6
   (provided version already changed in the past and broke image build ‒
   SPARK-28606).
   - for Python we depend on the system provided python3 packages, which
   currently provides Python 3.7.

which don't guarantee stability over time and might be hard to synchronize
with our support matrix.

Secondly, omitting libraries which are required for the full functionality
and performance, specifically

   - Numpy, Pandas and Arrow for PySpark
   - Arrow for SparkR

is likely to severely limit usability of the images (out of these, Arrow is
probably the hardest to manage, especially when you already depend on
system packages to provide R or Python interpreter).

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC

----------
From: Holden Karau <ho...@pigscanfly.ca>
Date: Tue, 17 Aug 2021 at 03:05
To: Maciej <ms...@gmail.com>
Cc: <de...@spark.apache.org>

These are some really good points all around.

I think, in the interest of simplicity, well start with just the 3 current
Dockerfiles in the Spark repo but for the next release (3.3) we should
explore adding some more Dockerfiles/build options.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 07:27
To:
Cc: Spark dev list <de...@spark.apache.org>

Thanks for the notes all.

I think we ought to consider what docker general usage is.

Docker image by definition is a self contained general purpose entity
providing spark service at the common denominator. Some docker images like
the one for jenkins are far simpler to build as they have less moving parts
With spark you either deploy your own set-up, get a service provider like
Google to provide it as a service (and they provide not the most recent
version, for example Dataproc cluster runs on Spark 3.1.1) or use docker in
Kubernetes cluster GKE. They provide the cluster and cluster management but
you deploy your own docker image.

I don't know much about spark-R but I think if we take Spark latest and
spark-py built on 3.7.3 and Spark-py for data science with the most widely
used and current packages we should be OK.

This set-up will work as long as the documentation goes into details of
interpreter and package versions and provides a blueprint to build your own
custom version of docker with whatever version you prefer.

As I said if we look at flink, they provide flink with scala and java
version and of course latest

1.13.2-scala_2.12-java8, 1.13-scala_2.12-java8, scala_2.12-java8,
1.13.2-scala_2.12, 1.13-scala_2.12, scala_2.12, 1.13.2-java8, 1.13-java8,
java8, 1.13.2, 1.13, latest

I personally believe that providing the most popular ones serves the
purpose for the community and anything above and beyond has to be tailor
made.

----------
From: Holden Karau <ho...@pigscanfly.ca>
Date: Tue, 17 Aug 2021 at 07:36
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Spark dev list <de...@spark.apache.org>

I as well think the largest use case of docker images would be on
Kubernetes.

While I have nothing against us adding more varieties I think it’s
important for us to get this started with our current containers, so I’ll
do that but let’s certainly continue exploring improvements after that.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 07:40
To: Holden Karau <ho...@pigscanfly.ca>
Cc: Spark dev list <de...@spark.apache.org>

An interesting point. Do we have a repository for current containers. I am
not aware of it.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 09:05
To: Maciej <ms...@gmail.com>
Cc: Spark dev list <de...@spark.apache.org>

Of course with PySpark, there is the option of putting your packages in gz
format and send them at spark-submit time

--conf "spark.yarn.dist.archives"=pyspark_venv.tar.gz#environment \

However, in the Kubernetes cluster that file is going to be fairly massive
and will take time to unzip and share. The interpreter will be what it
comes with the docker image!

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Mon, 16 Aug 2021 at 18:46, Maciej <ms...@gmail.com> wrote:

----------
From: Maciej <ms...@gmail.com>
Date: Tue, 17 Aug 2021 at 09:42
To: <de...@spark.apache.org>

You're right, but with the native dependencies (this is the case for the
packages I've mentioned before) we have to bundle complete environments. It
is doable, but if you do that, you're actually better off with base image.
I don't insist it is something we have to address right now, just something
to keep in mind.

----------
From: Maciej <ms...@gmail.com>
Date: Tue, 17 Aug 2021 at 09:51
To: Holden Karau <ho...@pigscanfly.ca>
Cc: <de...@spark.apache.org>

On 8/17/21 4:04 AM, Holden Karau wrote:

These are some really good points all around.

I think, in the interest of simplicity, well start with just the 3 current
Dockerfiles in the Spark repo but for the next release (3.3) we should
explore adding some more Dockerfiles/build options.

Sounds good.

However, I'd consider adding guest lang version to the tag names, i.e.

3.1.2_sparkpy_3.7-scala_2.12-java11

3.1.2_sparkR_3.6-scala_2.12-java11

and some basics safeguards in the layers, to make sure that these are
really the versions we use.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 10:31
To: Maciej <ms...@gmail.com>
Cc: Holden Karau <ho...@pigscanfly.ca>, Spark dev list <
dev@spark.apache.org>

3.1.2_sparkpy_3.7-scala_2.12-java11

3.1.2_sparkR_3.6-scala_2.12-java11
Yes let us go with that and remember that we can change the tags anytime.
The accompanying release note should detail what is inside the image
downloaded.

+1 for me

-- 
>> Best regards,
>> Maciej Szymkiewicz
>>
>> Web: https://zero323.net
>> Keybase: https://keybase.io/zero323
>> Gigs: https://www.codementor.io/@zero323
>> PGP: A30CEF0C31A501EC
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
>

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 13:24
To:
Cc: Spark dev list <de...@spark.apache.org>

Examples:

*docker images*

REPOSITORY       TAG                                  IMAGE ID
 CREATED          SIZE

spark/spark-py   3.1.1_sparkpy_3.7-scala_2.12-java8   ba3c17bc9337   2
minutes ago    2.19GB

spark            3.1.1-scala_2.12-java11              4595c4e78879   18
minutes ago   635MB

----------
From: Maciej <ms...@gmail.com>
Date: Tue, 17 Aug 2021 at 16:17
To: <de...@spark.apache.org>

Quick question ‒ is this actual output? If so, do we know what accounts
1.5GB overhead for PySpark image. Even without --no-install-recommends this
seems like a lot (if I recall correctly it was around 400MB for existing
images).

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 16:24
To: Maciej <ms...@gmail.com>
Cc: Spark dev list <de...@spark.apache.org>

Yes, I will double check. it includes java 8 in addition to base java 11.

in addition it has these Python packages for now (added for my own needs
for now)

root@ce6773017a14:/opt/spark/work-dir# pip list
Package       Version
------------- -------
asn1crypto    0.24.0
cryptography  2.6.1
cx-Oracle     8.2.1
entrypoints   0.3
keyring       17.1.1
keyrings.alt  3.1.1
numpy         1.21.2
pip           21.2.4
py4j          0.10.9
pycrypto      2.6.1
PyGObject     3.30.4
pyspark       3.1.2
pyxdg         0.25
PyYAML        5.4.1
SecretStorage 2.3.1
setuptools    57.4.0
six           1.12.0
wheel         0.32.3

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 16:55
To: Maciej <ms...@gmail.com>
Cc: Spark dev list <de...@spark.apache.org>

With no additional python packages etc we get 1.4GB compared to 2.19GB
before

REPOSITORY       TAG                                      IMAGE ID
 CREATED                  SIZE
spark/spark-py   3.1.1_sparkpy_3.7-scala_2.12-java8only   faee4dbb95dd
 Less than a second ago   1.41GB
spark/spark-py   3.1.1_sparkpy_3.7-scala_2.12-java8       ba3c17bc9337   4
hours ago              2.19GB

root@233a81199b43:/opt/spark/work-dir# pip list
Package       Version
------------- -------
asn1crypto    0.24.0
cryptography  2.6.1
entrypoints   0.3
keyring       17.1.1
keyrings.alt  3.1.1
pip           21.2.4
pycrypto      2.6.1
PyGObject     3.30.4
pyxdg         0.25

----------
From: Andrew Melo <an...@gmail.com>
Date: Tue, 17 Aug 2021 at 16:57
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Maciej <ms...@gmail.com>, Spark dev list <de...@spark.apache.org>

Silly Q, did you blow away the pip cache before committing the layer? That
always trips me up.

Cheers
Andrew
-- 
It's dark in this basement.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 20:29
To: Andrew Melo <an...@gmail.com>
Cc: Maciej <ms...@gmail.com>, Spark dev list <de...@spark.apache.org>

Hi Andrew,

Can you please elaborate on blowing pip cache before committing the layer?

Thanks,

Much

----------
From: Andrew Melo <an...@gmail.com>
Date: Tue, 17 Aug 2021 at 20:44
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Maciej <ms...@gmail.com>, Spark dev list <de...@spark.apache.org>

Hi Mich,

By default, pip caches downloaded binaries to somewhere like
$HOME/.cache/pip. So after doing any "pip install", you'll want to either
delete that directory, or pass the "--no-cache-dir" option to pip to
prevent the download binaries from being added to the image.

HTH
Andrew

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 23:05
To: Andrew Melo <an...@gmail.com>
Cc: Maciej <ms...@gmail.com>, Spark dev list <de...@spark.apache.org>

Thanks Andrew, that was helpful.

Step 10/23 : RUN pip install pyyaml numpy cx_Oracle pyspark --no-cache-dir

And the reduction in size is considerable, 1.75GB vs 2.19GB . Note that the
original run has now been invalidated

REPOSITORY       TAG                                      IMAGE ID
 CREATED                  SIZE
spark/spark-py   3.1.1_sparkpy_3.7-scala_2.12-java8       ecef8bd15731
 Less than a second ago   1.75GB
<none>           <none>                                   ba3c17bc9337   10
hours ago             2.19GB

----------
From: Holden Karau <ho...@pigscanfly.ca>
Date: Tue, 17 Aug 2021 at 23:26
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Andrew Melo <an...@gmail.com>, Maciej <ms...@gmail.com>,
Spark dev list <de...@spark.apache.org>

pip installing pyspark like that probably isn't a great idea since there
isn't a version tagged to it. Probably better to install from the local
files copied in than potentially from pypi. Might be able to install in -e
mode where it'll do symlinks to save space I'm not sure.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 23:52
To: Holden Karau <ho...@pigscanfly.ca>
Cc: Andrew Melo <an...@gmail.com>, Maciej <ms...@gmail.com>,
Spark dev list <de...@spark.apache.org>

Well, we need to decide what packages need to be installed with spark-py.
PySpark is not one of them. true

The docker build itself takes care of PySpark by copying them from the
$SPARK_HOME directory

COPY python/pyspark ${SPARK_HOME}/python/pyspark
COPY python/lib ${SPARK_HOME}/python/lib

Please review the docker file for python in
$SPARK_HOME/kubernetes/dockerfiles/spark/bindings/python/Dockerfile
and make changes needed.

ARG base_img
FROM $base_img
WORKDIR /
# Reset to root to run installation tasks
USER 0
RUN mkdir ${SPARK_HOME}/python
RUN apt-get update && \
    apt install -y python3 python3-pip && \
    pip3 install --upgrade pip setuptools && \
    # Removed the .cache to save space
    rm -r /root/.cache && rm -rf /var/cache/apt/*

COPY python/pyspark ${SPARK_HOME}/python/pyspark
COPY python/lib ${SPARK_HOME}/python/lib

WORKDIR /opt/spark/work-dir
ENTRYPOINT [ "/opt/entrypoint.sh" ]

# Specify the User that the actual main process will run as
ARG spark_uid=185
USER ${spark_uid}

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Wed, 18 Aug 2021 at 10:10
To: Holden Karau <ho...@pigscanfly.ca>
Cc: Andrew Melo <an...@gmail.com>, Maciej <ms...@gmail.com>,
Spark dev list <de...@spark.apache.org>

A rather related point

The docker image comes with the following java

root@73a798cc3303:/opt/spark/work-dir# java -version
openjdk version "11.0.12" 2021-07-20
OpenJDK Runtime Environment 18.9 (build 11.0.12+7)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.12+7, mixed mode, sharing)

For Java 8 I believe debian buster does not support Java 8,. This will be
added to the docker image.

Any particular java 8 we should go for.

For now I am using jdk1.8.0_201 which is Oracle Java. Current debian
versions built in GCP use

openjdk version "1.8.0_292"

Shall we choose and adopt one java 8 version for docker images? This will
be in addition to java 11 already installed with base

----------
From: Holden Karau <ho...@pigscanfly.ca>
Date: Wed, 18 Aug 2021 at 21:27
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Andrew Melo <an...@gmail.com>, Maciej <ms...@gmail.com>,
Spark dev list <de...@spark.apache.org>

So the default image we use right now for the 3.2 line is 11-jre-slim, in
3.0 we used 8-jre-slim, I think these are ok bases for us to build from
unless someone has a good reason otherwise?

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Wed, 18 Aug 2021 at 23:42
To: Holden Karau <ho...@pigscanfly.ca>
Cc: Andrew Melo <an...@gmail.com>, Maciej <ms...@gmail.com>,
Spark dev list <de...@spark.apache.org>

We have both base images now

REPOSITORY       TAG                                      IMAGE ID
 CREATED              SIZE
openjdk          8-jre-slim                               0d0a85fdf642   40
hours ago         187MB
openjdk          11-jre-slim                              eb77da2ec13c   3
weeks ago          221MB

Only java version differences:

For 11-jre-slim we have:

ARG java_image_tag=*11-jre-slim*

FROM openjdk:${java_image_tag}

And for 8-jre-slim

ARG java_image_tag=*8-jre-slim*

FROM openjdk:${java_image_tag}

----------
From: Ankit Gupta <in...@gmail.com>
Date: Sat, 21 Aug 2021 at 15:50
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Holden Karau <ho...@pigscanfly.ca>, Andrew Melo <an...@gmail.com>,
Maciej <ms...@gmail.com>, Spark dev list <de...@spark.apache.org>

Hey All

Just a suggestion, or maybe a future enhancement, we should also try and
use different base OSs like buster, alpine, slim, stretch, etc. and add
that in the tag as well. This will help the users to choose the images
according to their requirements.

Thanks and Regards.

Ankit Prakash Gupta
info.ankitp@gmail.com
LinkedIn : https://www.linkedin.com/in/infoankitp/
Medium: https://medium.com/@info.ankitp

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Sat, 25 Dec 2021 at 16:24
To:
Cc: Spark dev list <de...@spark.apache.org>
----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Mon, 21 Feb 2022 at 22:08
To: Holden Karau <ho...@pigscanfly.ca>
Cc: dev <de...@spark.apache.org>

Fwd: Time to start publishing Spark Docker Images?

Posted by Mich Talebzadeh <mi...@gmail.com>.

Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Forwarded Conversation
Subject: Time to start publishing Spark Docker Images?
------------------------

From: Holden Karau <ho...@pigscanfly.ca>
Date: Thu, 22 Jul 2021 at 04:13
To: dev <de...@spark.apache.org>

Hi Folks,

Many other distributed computing (https://hub.docker.com/r/rayproject/ray
https://hub.docker.com/u/daskdev) and ASF projects (
https://hub.docker.com/u/apache) now publish their images to dockerhub.

We've already got the docker image tooling in place, I think we'd need to
ask the ASF to grant permissions to the PMC to publish containers and
update the release steps but I think this could be useful for folks.

Cheers,

Holden

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

----------
From: Kent Yao <ya...@gmail.com>
Date: Thu, 22 Jul 2021 at 04:22
To: Holden Karau <ho...@pigscanfly.ca>
Cc: dev <de...@spark.apache.org>

+1

Bests,

*Kent Yao *
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
*a spark enthusiast*
*kyuubi <https://github.com/yaooqinn/kyuubi>is a unified multi-tenant JDBC
interface for large-scale data processing and analytics, built on top
of Apache Spark <http://spark.apache.org/>.*
*spark-authorizer <https://github.com/yaooqinn/spark-authorizer>A Spark SQL
extension which provides SQL Standard Authorization for **Apache Spark
<http://spark.apache.org/>.*
*spark-postgres <https://github.com/yaooqinn/spark-postgres> A library for
reading data from and transferring data to Postgres / Greenplum with Spark
SQL and DataFrames, 10~100x faster.*
*itatchi <https://github.com/yaooqinn/spark-func-extras>A** library t**hat
brings useful functions from various modern database management
systems to **Apache
Spark <http://spark.apache.org/>.*

--------------------------------------------------------------------- To
unsubscribe e-mail: dev-unsubscribe@spark.apache.org

----------
From: Hyukjin Kwon <gu...@gmail.com>
Date: Fri, 13 Aug 2021 at 01:44
To: Kent Yao <ya...@gmail.com>, Dongjoon Hyun <do...@apache.org>
Cc: Holden Karau <ho...@pigscanfly.ca>, dev <de...@spark.apache.org>

+1, I think we generally agreed upon having it. Thanks Holden for headsup
and driving this.

+@Dongjoon Hyun <do...@apache.org> FYI

2021년 7월 22일 (목) 오후 12:22, Kent Yao <ya...@gmail.com>님이 작성:

----------
From: John Zhuge <jz...@apache.org>
Date: Fri, 13 Aug 2021 at 01:48
To: Hyukjin Kwon <gu...@gmail.com>
Cc: Dongjoon Hyun <do...@apache.org>, Holden Karau <ho...@pigscanfly.ca>,
Kent Yao <ya...@gmail.com>, dev <de...@spark.apache.org>

+1
-- 
John Zhuge

----------
From: Holden Karau <ho...@pigscanfly.ca>
Date: Fri, 13 Aug 2021 at 01:54
To: John Zhuge <jz...@apache.org>
Cc: Hyukjin Kwon <gu...@gmail.com>, Dongjoon Hyun <do...@apache.org>,
Kent Yao <ya...@gmail.com>, dev <de...@spark.apache.org>

Awesome, I've filed an INFRA ticket to get the ball rolling.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Fri, 13 Aug 2021 at 07:45
To:
Cc: dev <de...@spark.apache.org>

I concur this is a good idea and certainly worth exploring.

In practice, preparing docker images as deployable will throw some
challenges because creating docker for Spark  is not really a singular
modular unit, say  creating docker for Jenkins. It involves different
versions and different images for Spark and PySpark and most likely will
end up as part of Kubernetes deployment.

Individuals and organisations will deploy it as the first cut. Great but I
equally feel that good documentation on how to build a consumable
deployable image will be more valuable.  FRom my own experience the current
documentation should be enhanced, for example how to deploy working
directories, additional Python packages, build with different Java
versions  (version 8 or version 11) etc.

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

----------
From: Bode, Meikel, NMA-CFD <Me...@bertelsmann.de>
Date: Fri, 13 Aug 2021 at 08:13
To: dev <de...@spark.apache.org>

Hi all,

I am Meikel Bode and only an interested reader of dev and user list.
Anyway, I would appreciate to have official docker images available.

Maybe one could get inspiration from the Jupyter docker stacks and provide
an hierarchy of different images like this:

https://jupyter-docker-stacks.readthedocs.io/en/latest/using/selecting.html#image-relationships

Having a core image only supporting Java, an extended supporting Python
and/or R etc.

Looking forward to the discussion.

Best,

Meikel

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Fri, 13 Aug 2021 at 08:51
To: Bode, Meikel, NMA-CFD <Me...@bertelsmann.de>
Cc: dev <de...@spark.apache.org>

Agreed.

I have already built a few latest for Spark and PYSpark on 3.1.1 with Java
8 as I found out Java 11 does not work with Google BigQuery data warehouse.
However, to hack the Dockerfile one finds out the hard way.

For example how to add additional Python libraries like tensorflow etc.
Loading these libraries through Kubernetes is not practical as unzipping
and installing it through --py-files etc will take considerable time so
they need to be added to the dockerfile at the built time in directory for
Python under Kubernetes

/opt/spark/kubernetes/dockerfiles/spark/bindings/python

RUN pip install pyyaml numpy cx_Oracle tensorflow ....

Also you will need curl to test the ports from inside the docker

RUN apt-get update && apt-get install -y curl
RUN ["apt-get","install","-y","vim"]

As I said I am happy to build these specific dockerfiles plus the complete
documentation for it. I have already built one for Google (GCP). The
difference between Spark and PySpark version is that in Spark/scala a fat
jar file will contain all needed. That is not the case with Python I am
afraid.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Fri, 13 Aug 2021 at 08:59
To: Bode, Meikel, NMA-CFD <Me...@bertelsmann.de>
Cc: dev <de...@spark.apache.org>

should read PySpark

----------
From: Holden Karau <ho...@pigscanfly.ca>
Date: Fri, 13 Aug 2021 at 17:26
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Bode, Meikel, NMA-CFD <Me...@bertelsmann.de>, dev <
dev@spark.apache.org>

So we actually do have a script that does the build already it's more a
matter of publishing the results for easier use. Currently the script
produces three images spark, spark-py, and spark-r. I can certainly see a
solid reason to publish like with a jdk11 & jdk8 suffix as well if there is
interest in the community. If we want to have a say spark-py-pandas for a
Spark container image with everything necessary for the Koalas stuff to
work then I think that could be a great PR from someone to add :)

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Fri, 13 Aug 2021 at 23:43
To: Holden Karau <ho...@pigscanfly.ca>
Cc: Bode, Meikel, NMA-CFD <Me...@bertelsmann.de>, dev <
dev@spark.apache.org>

Hi,

We can cater for multiple types (spark, spark-py and spark-r) and spark
versions (assuming they are downloaded and available).
The challenge is that these docker images built are snapshots. They cannot
be amended later and if you change anything by going inside docker, as soon
as you are logged out whatever you did is reversed.

For example, I want to add tensorflow to my docker image. These are my
images

REPOSITORY                                TAG           IMAGE ID
 CREATED         SIZE
eu.gcr.io/axial-glow-224522/spark-py      java8_3.1.1   cfbb0e69f204   5
days ago      2.37GB
eu.gcr.io/axial-glow-224522/spark         3.1.1         8d1bf8e7e47d   5
days ago      805MB

using image ID I try to log in as root to the image

*docker run -u0 -it cfbb0e69f204 bash*

root@b542b0f1483d:/opt/spark/work-dir# pip install keras
Collecting keras
  Downloading keras-2.6.0-py2.py3-none-any.whl (1.3 MB)
     |████████████████████████████████| 1.3 MB 1.1 MB/s
Installing collected packages: keras
Successfully installed keras-2.6.0
WARNING: Running pip as the 'root' user can result in broken permissions
and conflicting behaviour with the system package manager. It is
recommended to use a virtual environment instead:
https://pip.pypa.io/warnings/venv
root@b542b0f1483d:/opt/spark/work-dir# pip list
Package       Version
------------- -------
asn1crypto    0.24.0
cryptography  2.6.1
cx-Oracle     8.2.1
entrypoints   0.3
*keras         2.6.0      <--- it is here*
keyring       17.1.1
keyrings.alt  3.1.1
numpy         1.21.1
pip           21.2.3
py4j          0.10.9
pycrypto      2.6.1
PyGObject     3.30.4
pyspark       3.1.2
pyxdg         0.25
PyYAML        5.4.1
SecretStorage 2.3.1
setuptools    57.4.0
six           1.12.0
wheel         0.32.3
root@b542b0f1483d:/opt/spark/work-dir# exit

Now I exited from the image and try to log in again
(pyspark_venv) hduser@rhes76: /home/hduser/dba/bin/build> docker run -u0
-it cfbb0e69f204 bash

root@5231ee95aa83:/opt/spark/work-dir# pip list
Package       Version
------------- -------
asn1crypto    0.24.0
cryptography  2.6.1
cx-Oracle     8.2.1
entrypoints   0.3
keyring       17.1.1
keyrings.alt  3.1.1
numpy         1.21.1
pip           21.2.3
py4j          0.10.9
pycrypto      2.6.1
PyGObject     3.30.4
pyspark       3.1.2
pyxdg         0.25
PyYAML        5.4.1
SecretStorage 2.3.1
setuptools    57.4.0
six           1.12.0
wheel         0.32.3

*Hm that keras is not there*. The docker Image cannot be altered after
build! So once the docker image is created that is just a snapshot.
However, it will still have tons of useful stuff for most
users/organisations. My suggestions is to create for a given type (spark,
spark-py etc):

   1. One vanilla flavour for everyday use with few useful packages
   2. One for medium use with most common packages for ETL/ELT stuff
   3. One specialist for ML etc with keras, tensorflow and anything else
   needed

These images should be maintained as we currently maintain spark releases
with accompanying documentation. Any reason why we cannot maintain
ourselves?

----------
From: Maciej <ms...@gmail.com>
Date: Mon, 16 Aug 2021 at 18:46
To: <de...@spark.apache.org>

I have a few concerns regarding PySpark and SparkR images.

First of all, how do we plan to handle interpreter versions? Ideally, we
should provide images for all supported variants, but based on the
preceding discussion and the proposed naming convention, I assume it is not
going to happen. If that's the case, it would be great if we could fix
interpreter versions based on some support criteria (lowest supported,
lowest non-deprecated, highest supported at the time of release, etc.)

Currently, we use the following:

   - for R use buster-cran35 Debian repositories which install R 3.6
   (provided version already changed in the past and broke image build ‒
   SPARK-28606).
   - for Python we depend on the system provided python3 packages, which
   currently provides Python 3.7.

which don't guarantee stability over time and might be hard to synchronize
with our support matrix.

Secondly, omitting libraries which are required for the full functionality
and performance, specifically

   - Numpy, Pandas and Arrow for PySpark
   - Arrow for SparkR

is likely to severely limit usability of the images (out of these, Arrow is
probably the hardest to manage, especially when you already depend on
system packages to provide R or Python interpreter).

-- 
Best regards,
Maciej Szymkiewicz

Web: https://zero323.net
Keybase: https://keybase.io/zero323
Gigs: https://www.codementor.io/@zero323
PGP: A30CEF0C31A501EC

----------
From: Holden Karau <ho...@pigscanfly.ca>
Date: Tue, 17 Aug 2021 at 03:05
To: Maciej <ms...@gmail.com>
Cc: <de...@spark.apache.org>

These are some really good points all around.

I think, in the interest of simplicity, well start with just the 3 current
Dockerfiles in the Spark repo but for the next release (3.3) we should
explore adding some more Dockerfiles/build options.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 07:27
To:
Cc: Spark dev list <de...@spark.apache.org>

Thanks for the notes all.

I think we ought to consider what docker general usage is.

Docker image by definition is a self contained general purpose entity
providing spark service at the common denominator. Some docker images like
the one for jenkins are far simpler to build as they have less moving parts
With spark you either deploy your own set-up, get a service provider like
Google to provide it as a service (and they provide not the most recent
version, for example Dataproc cluster runs on Spark 3.1.1) or use docker in
Kubernetes cluster GKE. They provide the cluster and cluster management but
you deploy your own docker image.

I don't know much about spark-R but I think if we take Spark latest and
spark-py built on 3.7.3 and Spark-py for data science with the most widely
used and current packages we should be OK.

This set-up will work as long as the documentation goes into details of
interpreter and package versions and provides a blueprint to build your own
custom version of docker with whatever version you prefer.

As I said if we look at flink, they provide flink with scala and java
version and of course latest

1.13.2-scala_2.12-java8, 1.13-scala_2.12-java8, scala_2.12-java8,
1.13.2-scala_2.12, 1.13-scala_2.12, scala_2.12, 1.13.2-java8, 1.13-java8,
java8, 1.13.2, 1.13, latest

I personally believe that providing the most popular ones serves the
purpose for the community and anything above and beyond has to be tailor
made.

----------
From: Holden Karau <ho...@pigscanfly.ca>
Date: Tue, 17 Aug 2021 at 07:36
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Spark dev list <de...@spark.apache.org>

I as well think the largest use case of docker images would be on
Kubernetes.

While I have nothing against us adding more varieties I think it’s
important for us to get this started with our current containers, so I’ll
do that but let’s certainly continue exploring improvements after that.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 07:40
To: Holden Karau <ho...@pigscanfly.ca>
Cc: Spark dev list <de...@spark.apache.org>

An interesting point. Do we have a repository for current containers. I am
not aware of it.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 09:05
To: Maciej <ms...@gmail.com>
Cc: Spark dev list <de...@spark.apache.org>

Of course with PySpark, there is the option of putting your packages in gz
format and send them at spark-submit time

--conf "spark.yarn.dist.archives"=pyspark_venv.tar.gz#environment \

However, in the Kubernetes cluster that file is going to be fairly massive
and will take time to unzip and share. The interpreter will be what it
comes with the docker image!

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Mon, 16 Aug 2021 at 18:46, Maciej <ms...@gmail.com> wrote:

----------
From: Maciej <ms...@gmail.com>
Date: Tue, 17 Aug 2021 at 09:42
To: <de...@spark.apache.org>

You're right, but with the native dependencies (this is the case for the
packages I've mentioned before) we have to bundle complete environments. It
is doable, but if you do that, you're actually better off with base image.
I don't insist it is something we have to address right now, just something
to keep in mind.

----------
From: Maciej <ms...@gmail.com>
Date: Tue, 17 Aug 2021 at 09:51
To: Holden Karau <ho...@pigscanfly.ca>
Cc: <de...@spark.apache.org>

On 8/17/21 4:04 AM, Holden Karau wrote:

These are some really good points all around.

I think, in the interest of simplicity, well start with just the 3 current
Dockerfiles in the Spark repo but for the next release (3.3) we should
explore adding some more Dockerfiles/build options.

Sounds good.

However, I'd consider adding guest lang version to the tag names, i.e.

3.1.2_sparkpy_3.7-scala_2.12-java11

3.1.2_sparkR_3.6-scala_2.12-java11

and some basics safeguards in the layers, to make sure that these are
really the versions we use.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 10:31
To: Maciej <ms...@gmail.com>
Cc: Holden Karau <ho...@pigscanfly.ca>, Spark dev list <
dev@spark.apache.org>

3.1.2_sparkpy_3.7-scala_2.12-java11

3.1.2_sparkR_3.6-scala_2.12-java11
Yes let us go with that and remember that we can change the tags anytime.
The accompanying release note should detail what is inside the image
downloaded.

+1 for me

-- 
>> Best regards,
>> Maciej Szymkiewicz
>>
>> Web: https://zero323.net
>> Keybase: https://keybase.io/zero323
>> Gigs: https://www.codementor.io/@zero323
>> PGP: A30CEF0C31A501EC
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
>

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 13:24
To:
Cc: Spark dev list <de...@spark.apache.org>

Examples:

*docker images*

REPOSITORY       TAG                                  IMAGE ID
 CREATED          SIZE

spark/spark-py   3.1.1_sparkpy_3.7-scala_2.12-java8   ba3c17bc9337   2
minutes ago    2.19GB

spark            3.1.1-scala_2.12-java11              4595c4e78879   18
minutes ago   635MB

----------
From: Maciej <ms...@gmail.com>
Date: Tue, 17 Aug 2021 at 16:17
To: <de...@spark.apache.org>

Quick question ‒ is this actual output? If so, do we know what accounts
1.5GB overhead for PySpark image. Even without --no-install-recommends this
seems like a lot (if I recall correctly it was around 400MB for existing
images).

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 16:24
To: Maciej <ms...@gmail.com>
Cc: Spark dev list <de...@spark.apache.org>

Yes, I will double check. it includes java 8 in addition to base java 11.

in addition it has these Python packages for now (added for my own needs
for now)

root@ce6773017a14:/opt/spark/work-dir# pip list
Package       Version
------------- -------
asn1crypto    0.24.0
cryptography  2.6.1
cx-Oracle     8.2.1
entrypoints   0.3
keyring       17.1.1
keyrings.alt  3.1.1
numpy         1.21.2
pip           21.2.4
py4j          0.10.9
pycrypto      2.6.1
PyGObject     3.30.4
pyspark       3.1.2
pyxdg         0.25
PyYAML        5.4.1
SecretStorage 2.3.1
setuptools    57.4.0
six           1.12.0
wheel         0.32.3

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 16:55
To: Maciej <ms...@gmail.com>
Cc: Spark dev list <de...@spark.apache.org>

With no additional python packages etc we get 1.4GB compared to 2.19GB
before

REPOSITORY       TAG                                      IMAGE ID
 CREATED                  SIZE
spark/spark-py   3.1.1_sparkpy_3.7-scala_2.12-java8only   faee4dbb95dd
 Less than a second ago   1.41GB
spark/spark-py   3.1.1_sparkpy_3.7-scala_2.12-java8       ba3c17bc9337   4
hours ago              2.19GB

root@233a81199b43:/opt/spark/work-dir# pip list
Package       Version
------------- -------
asn1crypto    0.24.0
cryptography  2.6.1
entrypoints   0.3
keyring       17.1.1
keyrings.alt  3.1.1
pip           21.2.4
pycrypto      2.6.1
PyGObject     3.30.4
pyxdg         0.25

----------
From: Andrew Melo <an...@gmail.com>
Date: Tue, 17 Aug 2021 at 16:57
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Maciej <ms...@gmail.com>, Spark dev list <de...@spark.apache.org>

Silly Q, did you blow away the pip cache before committing the layer? That
always trips me up.

Cheers
Andrew
-- 
It's dark in this basement.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 20:29
To: Andrew Melo <an...@gmail.com>
Cc: Maciej <ms...@gmail.com>, Spark dev list <de...@spark.apache.org>

Hi Andrew,

Can you please elaborate on blowing pip cache before committing the layer?

Thanks,

Much

----------
From: Andrew Melo <an...@gmail.com>
Date: Tue, 17 Aug 2021 at 20:44
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Maciej <ms...@gmail.com>, Spark dev list <de...@spark.apache.org>

Hi Mich,

By default, pip caches downloaded binaries to somewhere like
$HOME/.cache/pip. So after doing any "pip install", you'll want to either
delete that directory, or pass the "--no-cache-dir" option to pip to
prevent the download binaries from being added to the image.

HTH
Andrew

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 23:05
To: Andrew Melo <an...@gmail.com>
Cc: Maciej <ms...@gmail.com>, Spark dev list <de...@spark.apache.org>

Thanks Andrew, that was helpful.

Step 10/23 : RUN pip install pyyaml numpy cx_Oracle pyspark --no-cache-dir

And the reduction in size is considerable, 1.75GB vs 2.19GB . Note that the
original run has now been invalidated

REPOSITORY       TAG                                      IMAGE ID
 CREATED                  SIZE
spark/spark-py   3.1.1_sparkpy_3.7-scala_2.12-java8       ecef8bd15731
 Less than a second ago   1.75GB
<none>           <none>                                   ba3c17bc9337   10
hours ago             2.19GB

----------
From: Holden Karau <ho...@pigscanfly.ca>
Date: Tue, 17 Aug 2021 at 23:26
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Andrew Melo <an...@gmail.com>, Maciej <ms...@gmail.com>,
Spark dev list <de...@spark.apache.org>

pip installing pyspark like that probably isn't a great idea since there
isn't a version tagged to it. Probably better to install from the local
files copied in than potentially from pypi. Might be able to install in -e
mode where it'll do symlinks to save space I'm not sure.

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Tue, 17 Aug 2021 at 23:52
To: Holden Karau <ho...@pigscanfly.ca>
Cc: Andrew Melo <an...@gmail.com>, Maciej <ms...@gmail.com>,
Spark dev list <de...@spark.apache.org>

Well, we need to decide what packages need to be installed with spark-py.
PySpark is not one of them. true

The docker build itself takes care of PySpark by copying them from the
$SPARK_HOME directory

COPY python/pyspark ${SPARK_HOME}/python/pyspark
COPY python/lib ${SPARK_HOME}/python/lib

Please review the docker file for python in
$SPARK_HOME/kubernetes/dockerfiles/spark/bindings/python/Dockerfile
and make changes needed.

ARG base_img
FROM $base_img
WORKDIR /
# Reset to root to run installation tasks
USER 0
RUN mkdir ${SPARK_HOME}/python
RUN apt-get update && \
    apt install -y python3 python3-pip && \
    pip3 install --upgrade pip setuptools && \
    # Removed the .cache to save space
    rm -r /root/.cache && rm -rf /var/cache/apt/*

COPY python/pyspark ${SPARK_HOME}/python/pyspark
COPY python/lib ${SPARK_HOME}/python/lib

WORKDIR /opt/spark/work-dir
ENTRYPOINT [ "/opt/entrypoint.sh" ]

# Specify the User that the actual main process will run as
ARG spark_uid=185
USER ${spark_uid}

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Wed, 18 Aug 2021 at 10:10
To: Holden Karau <ho...@pigscanfly.ca>
Cc: Andrew Melo <an...@gmail.com>, Maciej <ms...@gmail.com>,
Spark dev list <de...@spark.apache.org>

A rather related point

The docker image comes with the following java

root@73a798cc3303:/opt/spark/work-dir# java -version
openjdk version "11.0.12" 2021-07-20
OpenJDK Runtime Environment 18.9 (build 11.0.12+7)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.12+7, mixed mode, sharing)

For Java 8 I believe debian buster does not support Java 8,. This will be
added to the docker image.

Any particular java 8 we should go for.

For now I am using jdk1.8.0_201 which is Oracle Java. Current debian
versions built in GCP use

openjdk version "1.8.0_292"

Shall we choose and adopt one java 8 version for docker images? This will
be in addition to java 11 already installed with base

----------
From: Holden Karau <ho...@pigscanfly.ca>
Date: Wed, 18 Aug 2021 at 21:27
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Andrew Melo <an...@gmail.com>, Maciej <ms...@gmail.com>,
Spark dev list <de...@spark.apache.org>

So the default image we use right now for the 3.2 line is 11-jre-slim, in
3.0 we used 8-jre-slim, I think these are ok bases for us to build from
unless someone has a good reason otherwise?

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Wed, 18 Aug 2021 at 23:42
To: Holden Karau <ho...@pigscanfly.ca>
Cc: Andrew Melo <an...@gmail.com>, Maciej <ms...@gmail.com>,
Spark dev list <de...@spark.apache.org>

We have both base images now

REPOSITORY       TAG                                      IMAGE ID
 CREATED              SIZE
openjdk          8-jre-slim                               0d0a85fdf642   40
hours ago         187MB
openjdk          11-jre-slim                              eb77da2ec13c   3
weeks ago          221MB

Only java version differences:

For 11-jre-slim we have:

ARG java_image_tag=*11-jre-slim*

FROM openjdk:${java_image_tag}

And for 8-jre-slim

ARG java_image_tag=*8-jre-slim*

FROM openjdk:${java_image_tag}

----------
From: Ankit Gupta <in...@gmail.com>
Date: Sat, 21 Aug 2021 at 15:50
To: Mich Talebzadeh <mi...@gmail.com>
Cc: Holden Karau <ho...@pigscanfly.ca>, Andrew Melo <an...@gmail.com>,
Maciej <ms...@gmail.com>, Spark dev list <de...@spark.apache.org>

Hey All

Just a suggestion, or maybe a future enhancement, we should also try and
use different base OSs like buster, alpine, slim, stretch, etc. and add
that in the tag as well. This will help the users to choose the images
according to their requirements.

Thanks and Regards.

Ankit Prakash Gupta
info.ankitp@gmail.com
LinkedIn : https://www.linkedin.com/in/infoankitp/
Medium: https://medium.com/@info.ankitp

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Sat, 25 Dec 2021 at 16:24
To:
Cc: Spark dev list <de...@spark.apache.org>

Season's greetings to all.

A while back we discussed publishing docker images, mainly for Kubernetes.

Increasing number of people are using Spark on Kubernetes.

Following our previous discussions, what matters is the tag, which is the
detailed identifier of the image used. These images are normally loaded to
the container/artefact registories in Cloud

For example with SPARK_VERSION, SCALA_VERSION, DOCKERIMAGETAG, BASE_OS and
the used DOCKERFILE

export PROJECT_ID=$(gcloud info --format='value(config.project)')
export GCP_CR=eu.gcr.io/${PROJECT_ID} <http://eu.gcr.io/$%7BPROJECT_ID%7D>

BASE_OS="buster"
SPARK_VERSION="3.1.1"
SCALA_VERSION="scala_2.12"
DOCKERFILE="java8PlusPackages"
DOCKERIMAGETAG="8-jre-slim"
cd $SPARK_HOME

# Building Docker image from provided Dockerfile base 11
cd $SPARK_HOME
/opt/spark/bin/docker-image-tool.sh \
              -r $GCP_CR \
              -t
${SPARK_VERSION}-${SCALA_VERSION}-${DOCKERIMAGETAG}-${BASE_OS}-${DOCKERFILE}
\
              -b java_image_tag=${DOCKERIMAGETAG} \
              -p
./kubernetes/dockerfiles/spark/bindings/python/${DOCKERFILE} \
               build

This results in a docker image created with a tag

IMAGEDRIVER="eu.gcr.io/
<PROJECT_ID>/spark-py:3.1.1-scala_2.12-8-jre-slim-buster-java8PlusPackages"

and

--conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \
 --conf spark.kubernetes.executor.container.image=${IMAGEDRIVER} \

The question is do we need anything else in the tag itself or enough info
is provided?

Cheers

----------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Mon, 21 Feb 2022 at 22:08
To: Holden Karau <ho...@pigscanfly.ca>
Cc: dev <de...@spark.apache.org>

forwarded

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.