You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Yanbo Liang <yb...@gmail.com> on 2016/09/07 06:36:48 UTC

Discuss SparkR executors/workers support virtualenv

Hi All,


Many users have requirements to use third party R packages in
executors/workers, but SparkR can not satisfy this requirements elegantly.
For example, you should to mess with the IT/administrators of the cluster
to deploy these R packages on each executors/workers node which is very
inflexible.

I think we should support third party R packages for SparkR users as what
we do for jar packages in the following two scenarios:
1, Users can install R packages from CRAN or custom CRAN-like repository
for each executors.
2, Users can load their local R packages and install them on each executors.

To achieve this goal, the first thing is to make SparkR executors support
virtualenv like Python conda. I have investigated and found packrat(
http://rstudio.github.io/packrat/) is one of the candidates to support
virtualenv for R. Packrat is a dependency management system for R and can
isolate the dependent R packages in its own private package space. Then
SparkR users can install third party packages in the application
scope(destroy after the application exit) and don’t need to bother
IT/administrators to install these packages manually.

I would like to know whether it make sense.


Thanks

Yanbo

Re: Discuss SparkR executors/workers support virtualenv

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.
I think this makes sense -- making it easier to use additional R
packages would be a good feature. I am not sure we need Packrat for
this use case though. Lets continue discussion on the JIRA at
https://issues.apache.org/jira/browse/SPARK-17428

Thanks
Shivaram

On Tue, Sep 6, 2016 at 11:36 PM, Yanbo Liang <yb...@gmail.com> wrote:
> Hi All,
>
>
> Many users have requirements to use third party R packages in
> executors/workers, but SparkR can not satisfy this requirements elegantly.
> For example, you should to mess with the IT/administrators of the cluster to
> deploy these R packages on each executors/workers node which is very
> inflexible.
>
> I think we should support third party R packages for SparkR users as what we
> do for jar packages in the following two scenarios:
> 1, Users can install R packages from CRAN or custom CRAN-like repository for
> each executors.
> 2, Users can load their local R packages and install them on each executors.
>
> To achieve this goal, the first thing is to make SparkR executors support
> virtualenv like Python conda. I have investigated and found
> packrat(http://rstudio.github.io/packrat/) is one of the candidates to
> support virtualenv for R. Packrat is a dependency management system for R
> and can isolate the dependent R packages in its own private package space.
> Then SparkR users can install third party packages in the application
> scope(destroy after the application exit) and don’t need to bother
> IT/administrators to install these packages manually.
>
> I would like to know whether it make sense.
>
>
> Thanks
>
> Yanbo

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org