You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Holden Karau <ho...@pigscanfly.ca> on 2018/01/22 07:24:16 UTC

Work-in-progress PySpark dependency management package (external)

Hi Y'all,

Over the past few years a lot of people have talked to me about their
difficulty managing dependencies with PySpark. Some folks and I (not as
part of the official Spark project or anything official like that) put
together a quick proof-of-concept library called "coffee-boat" to make it a
bit easier with PySpark and we'd love your early feedback.

It should support both packaging all of the dependencies in advance for
efficiency, with support for adding that one last package you forgot in the
middle of your notebook (less efficiently).

This _should_ work regardless of the cluster manager (with the exception of
localmode because it has very different addFile behaviour) but I've only
tested* it on standalone & YARN. My limited testing shows it to be
resilient to worker resets/restarts but I'm sure the real world will come
up with more ways to make things fail. It is more than a little hacky (e.g.
it depends on sed).

The repo is at https://github.com/nteract/coffee_boat & we have some
starter issues if anyone is looking to contribute
https://github.com/nteract/coffee_boat/issues.

If this looks like an ok path forward and we work out some of the kinks
I'll send this over to user@ in awhile.


Cheers,

Holden

*For a very loose version of the word "tested"
-- 
Twitter: https://twitter.com/holdenkarau