You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@toree.apache.org by Mana M <ma...@gmail.com> on 2017/12/11 19:07:34 UTC

Installing Python Dependencies on Spark Cluster Hosts with Toree

Hello,

I am new to Spark + Jupyter and setting them up for our data analysis team.
I had one question for which I cannot really find answer anywhere - hope
someone can help here.

I have setup multi-host Spark cluster and also have successfully installed
Jupyter with Jupyter Hub. This setup will be shared among several data
analysis team.

The Spark cluster is setup with some common Python libraries. But each user
may require additional libraries for their experimentation time to time. Is
it possible for Jupyter user to install Python dependencies for her/his
notebook, so dependencies are available on all Spark cluster nodes before
user runs the notebook through Jupyter?

I read about line magics (addDeps) in Apache toree, but I did not find any
information on adding Python dependencies.

Thanks,
Mana

Re: Installing Python Dependencies on Spark Cluster Hosts with Toree

Posted by Luciano Resende <lu...@gmail.com>.
On Mon, Dec 11, 2017 at 11:07 AM, Mana M <ma...@gmail.com> wrote:

> Hello,
>
> I am new to Spark + Jupyter and setting them up for our data analysis team.
> I had one question for which I cannot really find answer anywhere - hope
> someone can help here.
>
> I have setup multi-host Spark cluster and also have successfully installed
> Jupyter with Jupyter Hub. This setup will be shared among several data
> analysis team.
>
> The Spark cluster is setup with some common Python libraries. But each user
> may require additional libraries for their experimentation time to time. Is
> it possible for Jupyter user to install Python dependencies for her/his
> notebook, so dependencies are available on all Spark cluster nodes before
> user runs the notebook through Jupyter?
>
> I read about line magics (addDeps) in Apache toree, but I did not find any
> information on adding Python dependencies.
>
> Thanks,
> Mana
>

Toree does not provide any capabilities to manage the required dependencies
on remote execution nodes. Some approaches used in the industry community
are:

Anaconda or Anaconda Enterprise that enables you to build env that
are available/replicated in all nodes
Mapped user folders that can be used to host the necessary packages

I have also seen some discussions on Spark to better handle that, but I
don't believe this has been completely solved.


-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/