You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2022/02/23 11:16:21 UTC
Spark 3.1.3 docker pre-built with Python Data science packages
Some people asked me whether it was possible to create a docker file (spark
3.1.3) with Python packages geared towards DS etc., having the following
pre-built packages
pyyaml TensorFlow Theano Pandas Keras NumPy SciPy Scrapy SciKit-Learn
XGBoost Matplotlib Seaborn Bokeh Plotly pydot Statsmodels
Ok I built and pushed this to the docker repository. It is called
spark-py-pthonpackages-3.1.3-scala_2.12-11-jre-slim-buster
<https://hub.docker.com/layers/michtalebzadeh/spark_dockerfiles/spark-py-pthonpackages-3.1.3-scala_2.12-11-jre-slim-buster/images/sha256-1a76a9279e9dbaeb9c554fba601b85ecd76cbf3956c81b94eb4552c2d1435366?context=repo>
It is 1.3 GB compared to the normal spark-py package of 432.79 MB
and you can download it from
https://hub.docker.com/repository/docker/michtalebzadeh/spark_dockerfiles/tags?page=1&ordering=last_updated
These are the loaded packages from inside this docker
docker run -u 0 -it 7621929f9c97 bash
root@bb71cb7a89de:/opt/spark/work-dir# pip list
Package Version
---------------------------- -------------------
absl-py 1.0.0
astunparse 1.6.3
attrs 21.4.0
Automat 20.2.0
bokeh 2.4.2
cachetools 5.0.0
certifi 2021.10.8
cffi 1.15.0
charset-normalizer 2.0.12
constantly 15.1.0
cryptography 36.0.1
cssselect 1.1.0
cycler 0.11.0
flatbuffers 2.0
fonttools 4.29.1
gast 0.5.3
google-auth 2.6.0
google-auth-oauthlib 0.4.6
google-pasta 0.2.0
grpcio 1.44.0
h2 3.2.0
h5py 3.6.0
hpack 3.0.0
hyperframe 5.2.0
hyperlink 21.0.0
idna 3.3
importlib-metadata 4.11.1
incremental 21.3.0
itemadapter 0.4.0
itemloaders 1.0.4
Jinja2 3.0.3
jmespath 0.10.0
joblib 1.1.0
keras 2.8.0
Keras-Preprocessing 1.1.2
kiwisolver 1.3.2
libclang 13.0.0
lxml 4.8.0
Markdown 3.3.6
MarkupSafe 2.1.0
matplotlib 3.5.1
numpy 1.22.2
oauthlib 3.2.0
opt-einsum 3.3.0
packaging 21.3
pandas 1.4.1
parsel 1.6.0
patsy 0.5.2
Pillow 9.0.1
pip 22.0.3
plotly 5.6.0
priority 1.3.0
Protego 0.2.1
protobuf 3.19.4
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycparser 2.21
PyDispatcher 2.0.5
pydot 1.4.2
pyOpenSSL 22.0.0
pyparsing 3.0.7
python-dateutil 2.8.2
pytz 2021.3
PyYAML 6.0
queuelib 1.6.2
requests 2.27.1
requests-oauthlib 1.3.1
rsa 4.8
scikit-learn 1.0.2
scipy 1.8.0
Scrapy 2.5.1
seaborn 0.11.2
service-identity 21.1.0
setuptools 60.9.3
six 1.16.0
statsmodels 0.13.2
tenacity 8.0.1
tensorboard 2.8.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorflow 2.8.0
tensorflow-io-gcs-filesystem 0.24.0
termcolor 1.1.0
tf-estimator-nightly 2.8.0.dev2021122109
Theano 1.0.5
threadpoolctl 3.1.0
tornado 6.1
Twisted 22.1.0
typing_extensions 4.1.1
urllib3 1.26.8
w3lib 1.22.0
Werkzeug 2.0.3
wheel 0.34.2
wrapt 1.13.3
xgboost 1.5.2
zipp 3.7.0
zope.interface 5.4.0
Let me know how it works for you.
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
https://en.everybodywiki.com/Mich_Talebzadeh
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.