You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Hossein <fa...@gmail.com> on 2015/09/22 03:18:50 UTC

SparkR package path

Hi dev list,

SparkR backend assumes SparkR source files are located under
"SPARK_HOME/R/lib/." This directory is created by running R/install-dev.sh.
This setting makes sense for Spark developers, but if an R user downloads
and installs SparkR source package, the source files are going to be in
placed different locations.

In the R runtime it is easy to find location of package files using
path.package("SparkR"). But we need to make some changes to R backend
and/or spark-submit so that, JVM process learns the location of worker.R
and daemon.R and shell.R from the R runtime.

Do you think this change is feasible?

Thanks,
--Hossein

Re: SparkR package path

Posted by Luciano Resende <lu...@gmail.com>.

For host information, are you looking for something like this (which is
available today in Spark 1.5 already) ?

# Spark related configuration
Sys.setenv("SPARK_MASTER_IP"="127.0.0.1")
Sys.setenv("SPARK_LOCAL_IP"="127.0.0.1")

#Load libraries
library("rJava")
library(SparkR, lib.loc="/...../spark-bin/R/lib")

#Initalize  spark context
sc <- sparkR.init(sparkHome = "/...../spark-bin",
sparkPackages="com.databricks:spark-csv_2.11:1.2.0")



On Thu, Sep 24, 2015 at 2:09 PM, Hossein <fa...@gmail.com> wrote:

> Right now in sparkR.R the backend hostname is hard coded to "localhost" (
> https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L156).
>
> If we make that address configurable / parameterized, then a user can
> connect a remote Spark cluster with no need to have spark jars on their
> local machine. I have got this request from some R users. Their company has
> a Spark cluster (usually managed by another team), and they want to connect
> to it from their workstation (e.g., from within RStudio, etc).
>
>
>
> --Hossein
>
> On Thu, Sep 24, 2015 at 12:25 PM, Shivaram Venkataraman <
> shivaram@eecs.berkeley.edu> wrote:
>
>> I don't think the crux of the problem is about users who download the
>> source -- Spark's source distribution is clearly marked as something
>> that needs to be built and they can run `mvn -DskipTests -Psparkr
>> package` based on instructions in the Spark docs.
>>
>> The crux of the problem is that with a source or binary R package, the
>> client side the SparkR code needs the Spark JARs to be available. So
>> we can't just connect to a remote Spark cluster using just the R
>> scripts as we need the Scala classes around to create a Spark context
>> etc.
>>
>> But this is a use case that I've heard from a lot of users -- my take
>> is that this should be a separate package / layer on top of SparkR.
>> Dan Putler (cc'd) had a proposal on a client package for this and
>> maybe able to add more.
>>
>> Thanks
>> Shivaram
>>
>> On Thu, Sep 24, 2015 at 11:36 AM, Hossein <fa...@gmail.com> wrote:
>> > Requiring users to download entire Spark distribution to connect to a
>> remote
>> > cluster (which is already running Spark) seems an over kill. Even for
>> most
>> > spark users who download Spark source, it is very unintuitive that they
>> need
>> > to run a script named "install-dev.sh" before they can run SparkR.
>> >
>> > --Hossein
>> >
>> > On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui <ru...@intel.com> wrote:
>> >>
>> >> SparkR package is not a standalone R package, as it is actually R API
>> of
>> >> Spark and needs to co-operate with a matching version of Spark, so
>> exposing
>> >> it in CRAN does not ease use of R users as they need to download
>> matching
>> >> Spark distribution, unless we expose a bundled SparkR package to CRAN
>> >> (packageing with Spark), is this desirable? Actually, for normal users
>> who
>> >> are not developers, they are not required to download Spark source,
>> build
>> >> and install SparkR package. They just need to download a Spark
>> distribution,
>> >> and then use SparkR.
>> >>
>> >>
>> >>
>> >> For using SparkR in Rstudio, there is a documentation at
>> >> https://github.com/apache/spark/tree/master/R
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> From: Hossein [mailto:falaki@gmail.com]
>> >> Sent: Thursday, September 24, 2015 1:42 AM
>> >> To: shivaram@eecs.berkeley.edu
>> >> Cc: Sun, Rui; dev@spark.apache.org
>> >> Subject: Re: SparkR package path
>> >>
>> >>
>> >>
>> >> Yes, I think exposing SparkR in CRAN can significantly expand the
>> reach of
>> >> both SparkR and Spark itself to a larger community of data scientists
>> (and
>> >> statisticians).
>> >>
>> >>
>> >>
>> >> I have been getting questions on how to use SparkR in RStudio. Most of
>> >> these folks have a Spark Cluster and wish to talk to it from RStudio.
>> While
>> >> that is a bigger task, for now, first step could be not requiring them
>> to
>> >> download Spark source and run a script that is named install-dev.sh. I
>> filed
>> >> SPARK-10776 to track this.
>> >>
>> >>
>> >>
>> >>
>> >> --Hossein
>> >>
>> >>
>> >>
>> >> On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman
>> >> <sh...@eecs.berkeley.edu> wrote:
>> >>
>> >> As Rui says it would be good to understand the use case we want to
>> >> support (supporting CRAN installs could be one for example). I don't
>> >> think it should be very hard to do as the RBackend itself doesn't use
>> >> the R source files. The RRDD does use it and the value comes from
>> >>
>> >>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
>> >> AFAIK -- So we could introduce a new config flag that can be used for
>> >> this new mode.
>> >>
>> >> Thanks
>> >> Shivaram
>> >>
>> >>
>> >> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <ru...@intel.com> wrote:
>> >> > Hossein,
>> >> >
>> >> >
>> >> >
>> >> > Any strong reason to download and install SparkR source package
>> >> > separately
>> >> > from the Spark distribution?
>> >> >
>> >> > An R user can simply download the spark distribution, which contains
>> >> > SparkR
>> >> > source and binary package, and directly use sparkR. No need to
>> install
>> >> > SparkR package at all.
>> >> >
>> >> >
>> >> >
>> >> > From: Hossein [mailto:falaki@gmail.com]
>> >> > Sent: Tuesday, September 22, 2015 9:19 AM
>> >> > To: dev@spark.apache.org
>> >> > Subject: SparkR package path
>> >> >
>> >> >
>> >> >
>> >> > Hi dev list,
>> >> >
>> >> >
>> >> >
>> >> > SparkR backend assumes SparkR source files are located under
>> >> > "SPARK_HOME/R/lib/." This directory is created by running
>> >> > R/install-dev.sh.
>> >> > This setting makes sense for Spark developers, but if an R user
>> >> > downloads
>> >> > and installs SparkR source package, the source files are going to be
>> in
>> >> > placed different locations.
>> >> >
>> >> >
>> >> >
>> >> > In the R runtime it is easy to find location of package files using
>> >> > path.package("SparkR"). But we need to make some changes to R backend
>> >> > and/or
>> >> > spark-submit so that, JVM process learns the location of worker.R and
>> >> > daemon.R and shell.R from the R runtime.
>> >> >
>> >> >
>> >> >
>> >> > Do you think this change is feasible?
>> >> >
>> >> >
>> >> >
>> >> > Thanks,
>> >> >
>> >> > --Hossein
>> >>
>> >>
>> >
>> >
>>
>
>


-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

RE: SparkR package path

Posted by "Sun, Rui" <ru...@intel.com>.

Yes, the current implementation requires the backend to be on the same host as SparkR package. But this does not prevent SparkR from connecting to a remote Spark Cluster specified by a Spark master URL. The only thing needed is that there need be to a Spark JAR co-located with SparkR package on the same client machine. This is similar to any Spark application, which also depends on Spark JAR.

Theoritically, as SparkR package communicates with the backend via socket, the backend could be running on a different host. But this will make the launching of SparkR more complex, requiring not small change to spark-submit. Also additional network traffic overhead would be incurred.  I can’t see any compelling demand for this.

From: Hossein [mailto:falaki@gmail.com]
Sent: Friday, September 25, 2015 5:09 AM
To: shivaram@eecs.berkeley.edu
Cc: Sun, Rui; dev@spark.apache.org; Dan Putler
Subject: Re: SparkR package path

Right now in sparkR.R the backend hostname is hard coded to "localhost" (https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L156).

If we make that address configurable / parameterized, then a user can connect a remote Spark cluster with no need to have spark jars on their local machine. I have got this request from some R users. Their company has a Spark cluster (usually managed by another team), and they want to connect to it from their workstation (e.g., from within RStudio, etc).

--Hossein

On Thu, Sep 24, 2015 at 12:25 PM, Shivaram Venkataraman <sh...@eecs.berkeley.edu>> wrote:
I don't think the crux of the problem is about users who download the
source -- Spark's source distribution is clearly marked as something
that needs to be built and they can run `mvn -DskipTests -Psparkr
package` based on instructions in the Spark docs.

The crux of the problem is that with a source or binary R package, the
client side the SparkR code needs the Spark JARs to be available. So
we can't just connect to a remote Spark cluster using just the R
scripts as we need the Scala classes around to create a Spark context
etc.

But this is a use case that I've heard from a lot of users -- my take
is that this should be a separate package / layer on top of SparkR.
Dan Putler (cc'd) had a proposal on a client package for this and
maybe able to add more.

Thanks
Shivaram

On Thu, Sep 24, 2015 at 11:36 AM, Hossein <fa...@gmail.com>> wrote:
> Requiring users to download entire Spark distribution to connect to a remote
> cluster (which is already running Spark) seems an over kill. Even for most
> spark users who download Spark source, it is very unintuitive that they need
> to run a script named "install-dev.sh" before they can run SparkR.
>
> --Hossein
>
> On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui <ru...@intel.com>> wrote:
>>
>> SparkR package is not a standalone R package, as it is actually R API of
>> Spark and needs to co-operate with a matching version of Spark, so exposing
>> it in CRAN does not ease use of R users as they need to download matching
>> Spark distribution, unless we expose a bundled SparkR package to CRAN
>> (packageing with Spark), is this desirable? Actually, for normal users who
>> are not developers, they are not required to download Spark source, build
>> and install SparkR package. They just need to download a Spark distribution,
>> and then use SparkR.
>>
>>
>>
>> For using SparkR in Rstudio, there is a documentation at
>> https://github.com/apache/spark/tree/master/R
>>
>>
>>
>>
>>
>>
>>
>> From: Hossein [mailto:falaki@gmail.com<ma...@gmail.com>]
>> Sent: Thursday, September 24, 2015 1:42 AM
>> To: shivaram@eecs.berkeley.edu<ma...@eecs.berkeley.edu>
>> Cc: Sun, Rui; dev@spark.apache.org<ma...@spark.apache.org>
>> Subject: Re: SparkR package path
>>
>>
>>
>> Yes, I think exposing SparkR in CRAN can significantly expand the reach of
>> both SparkR and Spark itself to a larger community of data scientists (and
>> statisticians).
>>
>>
>>
>> I have been getting questions on how to use SparkR in RStudio. Most of
>> these folks have a Spark Cluster and wish to talk to it from RStudio. While
>> that is a bigger task, for now, first step could be not requiring them to
>> download Spark source and run a script that is named install-dev.sh. I filed
>> SPARK-10776 to track this.
>>
>>
>>
>>
>> --Hossein
>>
>>
>>
>> On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman
>> <sh...@eecs.berkeley.edu>> wrote:
>>
>> As Rui says it would be good to understand the use case we want to
>> support (supporting CRAN installs could be one for example). I don't
>> think it should be very hard to do as the RBackend itself doesn't use
>> the R source files. The RRDD does use it and the value comes from
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
>> AFAIK -- So we could introduce a new config flag that can be used for
>> this new mode.
>>
>> Thanks
>> Shivaram
>>
>>
>> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <ru...@intel.com>> wrote:
>> > Hossein,
>> >
>> >
>> >
>> > Any strong reason to download and install SparkR source package
>> > separately
>> > from the Spark distribution?
>> >
>> > An R user can simply download the spark distribution, which contains
>> > SparkR
>> > source and binary package, and directly use sparkR. No need to install
>> > SparkR package at all.
>> >
>> >
>> >
>> > From: Hossein [mailto:falaki@gmail.com<ma...@gmail.com>]
>> > Sent: Tuesday, September 22, 2015 9:19 AM
>> > To: dev@spark.apache.org<ma...@spark.apache.org>
>> > Subject: SparkR package path
>> >
>> >
>> >
>> > Hi dev list,
>> >
>> >
>> >
>> > SparkR backend assumes SparkR source files are located under
>> > "SPARK_HOME/R/lib/." This directory is created by running
>> > R/install-dev.sh.
>> > This setting makes sense for Spark developers, but if an R user
>> > downloads
>> > and installs SparkR source package, the source files are going to be in
>> > placed different locations.
>> >
>> >
>> >
>> > In the R runtime it is easy to find location of package files using
>> > path.package("SparkR"). But we need to make some changes to R backend
>> > and/or
>> > spark-submit so that, JVM process learns the location of worker.R and
>> > daemon.R and shell.R from the R runtime.
>> >
>> >
>> >
>> > Do you think this change is feasible?
>> >
>> >
>> >
>> > Thanks,
>> >
>> > --Hossein
>>
>>
>
>

Re: SparkR package path

Posted by Hossein <fa...@gmail.com>.

Right now in sparkR.R the backend hostname is hard coded to "localhost" (
https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L156).

If we make that address configurable / parameterized, then a user can
connect a remote Spark cluster with no need to have spark jars on their
local machine. I have got this request from some R users. Their company has
a Spark cluster (usually managed by another team), and they want to connect
to it from their workstation (e.g., from within RStudio, etc).



--Hossein

On Thu, Sep 24, 2015 at 12:25 PM, Shivaram Venkataraman <
shivaram@eecs.berkeley.edu> wrote:

> I don't think the crux of the problem is about users who download the
> source -- Spark's source distribution is clearly marked as something
> that needs to be built and they can run `mvn -DskipTests -Psparkr
> package` based on instructions in the Spark docs.
>
> The crux of the problem is that with a source or binary R package, the
> client side the SparkR code needs the Spark JARs to be available. So
> we can't just connect to a remote Spark cluster using just the R
> scripts as we need the Scala classes around to create a Spark context
> etc.
>
> But this is a use case that I've heard from a lot of users -- my take
> is that this should be a separate package / layer on top of SparkR.
> Dan Putler (cc'd) had a proposal on a client package for this and
> maybe able to add more.
>
> Thanks
> Shivaram
>
> On Thu, Sep 24, 2015 at 11:36 AM, Hossein <fa...@gmail.com> wrote:
> > Requiring users to download entire Spark distribution to connect to a
> remote
> > cluster (which is already running Spark) seems an over kill. Even for
> most
> > spark users who download Spark source, it is very unintuitive that they
> need
> > to run a script named "install-dev.sh" before they can run SparkR.
> >
> > --Hossein
> >
> > On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui <ru...@intel.com> wrote:
> >>
> >> SparkR package is not a standalone R package, as it is actually R API of
> >> Spark and needs to co-operate with a matching version of Spark, so
> exposing
> >> it in CRAN does not ease use of R users as they need to download
> matching
> >> Spark distribution, unless we expose a bundled SparkR package to CRAN
> >> (packageing with Spark), is this desirable? Actually, for normal users
> who
> >> are not developers, they are not required to download Spark source,
> build
> >> and install SparkR package. They just need to download a Spark
> distribution,
> >> and then use SparkR.
> >>
> >>
> >>
> >> For using SparkR in Rstudio, there is a documentation at
> >> https://github.com/apache/spark/tree/master/R
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> From: Hossein [mailto:falaki@gmail.com]
> >> Sent: Thursday, September 24, 2015 1:42 AM
> >> To: shivaram@eecs.berkeley.edu
> >> Cc: Sun, Rui; dev@spark.apache.org
> >> Subject: Re: SparkR package path
> >>
> >>
> >>
> >> Yes, I think exposing SparkR in CRAN can significantly expand the reach
> of
> >> both SparkR and Spark itself to a larger community of data scientists
> (and
> >> statisticians).
> >>
> >>
> >>
> >> I have been getting questions on how to use SparkR in RStudio. Most of
> >> these folks have a Spark Cluster and wish to talk to it from RStudio.
> While
> >> that is a bigger task, for now, first step could be not requiring them
> to
> >> download Spark source and run a script that is named install-dev.sh. I
> filed
> >> SPARK-10776 to track this.
> >>
> >>
> >>
> >>
> >> --Hossein
> >>
> >>
> >>
> >> On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman
> >> <sh...@eecs.berkeley.edu> wrote:
> >>
> >> As Rui says it would be good to understand the use case we want to
> >> support (supporting CRAN installs could be one for example). I don't
> >> think it should be very hard to do as the RBackend itself doesn't use
> >> the R source files. The RRDD does use it and the value comes from
> >>
> >>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
> >> AFAIK -- So we could introduce a new config flag that can be used for
> >> this new mode.
> >>
> >> Thanks
> >> Shivaram
> >>
> >>
> >> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <ru...@intel.com> wrote:
> >> > Hossein,
> >> >
> >> >
> >> >
> >> > Any strong reason to download and install SparkR source package
> >> > separately
> >> > from the Spark distribution?
> >> >
> >> > An R user can simply download the spark distribution, which contains
> >> > SparkR
> >> > source and binary package, and directly use sparkR. No need to install
> >> > SparkR package at all.
> >> >
> >> >
> >> >
> >> > From: Hossein [mailto:falaki@gmail.com]
> >> > Sent: Tuesday, September 22, 2015 9:19 AM
> >> > To: dev@spark.apache.org
> >> > Subject: SparkR package path
> >> >
> >> >
> >> >
> >> > Hi dev list,
> >> >
> >> >
> >> >
> >> > SparkR backend assumes SparkR source files are located under
> >> > "SPARK_HOME/R/lib/." This directory is created by running
> >> > R/install-dev.sh.
> >> > This setting makes sense for Spark developers, but if an R user
> >> > downloads
> >> > and installs SparkR source package, the source files are going to be
> in
> >> > placed different locations.
> >> >
> >> >
> >> >
> >> > In the R runtime it is easy to find location of package files using
> >> > path.package("SparkR"). But we need to make some changes to R backend
> >> > and/or
> >> > spark-submit so that, JVM process learns the location of worker.R and
> >> > daemon.R and shell.R from the R runtime.
> >> >
> >> >
> >> >
> >> > Do you think this change is feasible?
> >> >
> >> >
> >> >
> >> > Thanks,
> >> >
> >> > --Hossein
> >>
> >>
> >
> >
>

Re: SparkR package path

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

I don't think the crux of the problem is about users who download the
source -- Spark's source distribution is clearly marked as something
that needs to be built and they can run `mvn -DskipTests -Psparkr
package` based on instructions in the Spark docs.

The crux of the problem is that with a source or binary R package, the
client side the SparkR code needs the Spark JARs to be available. So
we can't just connect to a remote Spark cluster using just the R
scripts as we need the Scala classes around to create a Spark context
etc.

But this is a use case that I've heard from a lot of users -- my take
is that this should be a separate package / layer on top of SparkR.
Dan Putler (cc'd) had a proposal on a client package for this and
maybe able to add more.

Thanks
Shivaram

On Thu, Sep 24, 2015 at 11:36 AM, Hossein <fa...@gmail.com> wrote:
> Requiring users to download entire Spark distribution to connect to a remote
> cluster (which is already running Spark) seems an over kill. Even for most
> spark users who download Spark source, it is very unintuitive that they need
> to run a script named "install-dev.sh" before they can run SparkR.
>
> --Hossein
>
> On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui <ru...@intel.com> wrote:
>>
>> SparkR package is not a standalone R package, as it is actually R API of
>> Spark and needs to co-operate with a matching version of Spark, so exposing
>> it in CRAN does not ease use of R users as they need to download matching
>> Spark distribution, unless we expose a bundled SparkR package to CRAN
>> (packageing with Spark), is this desirable? Actually, for normal users who
>> are not developers, they are not required to download Spark source, build
>> and install SparkR package. They just need to download a Spark distribution,
>> and then use SparkR.
>>
>>
>>
>> For using SparkR in Rstudio, there is a documentation at
>> https://github.com/apache/spark/tree/master/R
>>
>>
>>
>>
>>
>>
>>
>> From: Hossein [mailto:falaki@gmail.com]
>> Sent: Thursday, September 24, 2015 1:42 AM
>> To: shivaram@eecs.berkeley.edu
>> Cc: Sun, Rui; dev@spark.apache.org
>> Subject: Re: SparkR package path
>>
>>
>>
>> Yes, I think exposing SparkR in CRAN can significantly expand the reach of
>> both SparkR and Spark itself to a larger community of data scientists (and
>> statisticians).
>>
>>
>>
>> I have been getting questions on how to use SparkR in RStudio. Most of
>> these folks have a Spark Cluster and wish to talk to it from RStudio. While
>> that is a bigger task, for now, first step could be not requiring them to
>> download Spark source and run a script that is named install-dev.sh. I filed
>> SPARK-10776 to track this.
>>
>>
>>
>>
>> --Hossein
>>
>>
>>
>> On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman
>> <sh...@eecs.berkeley.edu> wrote:
>>
>> As Rui says it would be good to understand the use case we want to
>> support (supporting CRAN installs could be one for example). I don't
>> think it should be very hard to do as the RBackend itself doesn't use
>> the R source files. The RRDD does use it and the value comes from
>>
>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
>> AFAIK -- So we could introduce a new config flag that can be used for
>> this new mode.
>>
>> Thanks
>> Shivaram
>>
>>
>> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <ru...@intel.com> wrote:
>> > Hossein,
>> >
>> >
>> >
>> > Any strong reason to download and install SparkR source package
>> > separately
>> > from the Spark distribution?
>> >
>> > An R user can simply download the spark distribution, which contains
>> > SparkR
>> > source and binary package, and directly use sparkR. No need to install
>> > SparkR package at all.
>> >
>> >
>> >
>> > From: Hossein [mailto:falaki@gmail.com]
>> > Sent: Tuesday, September 22, 2015 9:19 AM
>> > To: dev@spark.apache.org
>> > Subject: SparkR package path
>> >
>> >
>> >
>> > Hi dev list,
>> >
>> >
>> >
>> > SparkR backend assumes SparkR source files are located under
>> > "SPARK_HOME/R/lib/." This directory is created by running
>> > R/install-dev.sh.
>> > This setting makes sense for Spark developers, but if an R user
>> > downloads
>> > and installs SparkR source package, the source files are going to be in
>> > placed different locations.
>> >
>> >
>> >
>> > In the R runtime it is easy to find location of package files using
>> > path.package("SparkR"). But we need to make some changes to R backend
>> > and/or
>> > spark-submit so that, JVM process learns the location of worker.R and
>> > daemon.R and shell.R from the R runtime.
>> >
>> >
>> >
>> > Do you think this change is feasible?
>> >
>> >
>> >
>> > Thanks,
>> >
>> > --Hossein
>>
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

RE: SparkR package path

Posted by "Sun, Rui" <ru...@intel.com>.

If  a user downloads Spark source, of course he needs to build it before running it. But a user can download pre-built Spark binary distributions, then he can directly use sparkR after deployment of the Spark cluster.

From: Hossein [mailto:falaki@gmail.com]
Sent: Friday, September 25, 2015 2:37 AM
To: Sun, Rui
Cc: shivaram@eecs.berkeley.edu; dev@spark.apache.org
Subject: Re: SparkR package path

Requiring users to download entire Spark distribution to connect to a remote cluster (which is already running Spark) seems an over kill. Even for most spark users who download Spark source, it is very unintuitive that they need to run a script named "install-dev.sh" before they can run SparkR.

--Hossein

On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui <ru...@intel.com>> wrote:
SparkR package is not a standalone R package, as it is actually R API of Spark and needs to co-operate with a matching version of Spark, so exposing it in CRAN does not ease use of R users as they need to download matching Spark distribution, unless we expose a bundled SparkR package to CRAN (packageing with Spark), is this desirable? Actually, for normal users who are not developers, they are not required to download Spark source, build and install SparkR package. They just need to download a Spark distribution, and then use SparkR.

For using SparkR in Rstudio, there is a documentation at https://github.com/apache/spark/tree/master/R

From: Hossein [mailto:falaki@gmail.com<ma...@gmail.com>]
Sent: Thursday, September 24, 2015 1:42 AM
To: shivaram@eecs.berkeley.edu<ma...@eecs.berkeley.edu>
Cc: Sun, Rui; dev@spark.apache.org<ma...@spark.apache.org>
Subject: Re: SparkR package path

Yes, I think exposing SparkR in CRAN can significantly expand the reach of both SparkR and Spark itself to a larger community of data scientists (and statisticians).

I have been getting questions on how to use SparkR in RStudio. Most of these folks have a Spark Cluster and wish to talk to it from RStudio. While that is a bigger task, for now, first step could be not requiring them to download Spark source and run a script that is named install-dev.sh. I filed SPARK-10776 to track this.

--Hossein

On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman <sh...@eecs.berkeley.edu>> wrote:
As Rui says it would be good to understand the use case we want to
support (supporting CRAN installs could be one for example). I don't
think it should be very hard to do as the RBackend itself doesn't use
the R source files. The RRDD does use it and the value comes from
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
AFAIK -- So we could introduce a new config flag that can be used for
this new mode.

Thanks
Shivaram

On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <ru...@intel.com>> wrote:
> Hossein,
>
>
>
> Any strong reason to download and install SparkR source package separately
> from the Spark distribution?
>
> An R user can simply download the spark distribution, which contains SparkR
> source and binary package, and directly use sparkR. No need to install
> SparkR package at all.
>
>
>
> From: Hossein [mailto:falaki@gmail.com<ma...@gmail.com>]
> Sent: Tuesday, September 22, 2015 9:19 AM
> To: dev@spark.apache.org<ma...@spark.apache.org>
> Subject: SparkR package path
>
>
>
> Hi dev list,
>
>
>
> SparkR backend assumes SparkR source files are located under
> "SPARK_HOME/R/lib/." This directory is created by running R/install-dev.sh.
> This setting makes sense for Spark developers, but if an R user downloads
> and installs SparkR source package, the source files are going to be in
> placed different locations.
>
>
>
> In the R runtime it is easy to find location of package files using
> path.package("SparkR"). But we need to make some changes to R backend and/or
> spark-submit so that, JVM process learns the location of worker.R and
> daemon.R and shell.R from the R runtime.
>
>
>
> Do you think this change is feasible?
>
>
>
> Thanks,
>
> --Hossein

Re: SparkR package path

Posted by Hossein <fa...@gmail.com>.

Requiring users to download entire Spark distribution to connect to a
remote cluster (which is already running Spark) seems an over kill. Even
for most spark users who download Spark source, it is very unintuitive that
they need to run a script named "install-dev.sh" before they can run SparkR.

--Hossein

On Wed, Sep 23, 2015 at 7:28 PM, Sun, Rui <ru...@intel.com> wrote:

> SparkR package is not a standalone R package, as it is actually R API of
> Spark and needs to co-operate with a matching version of Spark, so exposing
> it in CRAN does not ease use of R users as they need to download matching
> Spark distribution, unless we expose a bundled SparkR package to CRAN
> (packageing with Spark), is this desirable? Actually, for normal users who
> are not developers, they are not required to download Spark source, build
> and install SparkR package. They just need to download a Spark
> distribution, and then use SparkR.
>
>
>
> For using SparkR in Rstudio, there is a documentation at
> https://github.com/apache/spark/tree/master/R
>
>
>
>
>
>
>
> *From:* Hossein [mailto:falaki@gmail.com]
> *Sent:* Thursday, September 24, 2015 1:42 AM
> *To:* shivaram@eecs.berkeley.edu
> *Cc:* Sun, Rui; dev@spark.apache.org
> *Subject:* Re: SparkR package path
>
>
>
> Yes, I think exposing SparkR in CRAN can significantly expand the reach of
> both SparkR and Spark itself to a larger community of data scientists (and
> statisticians).
>
>
>
> I have been getting questions on how to use SparkR in RStudio. Most of
> these folks have a Spark Cluster and wish to talk to it from RStudio. While
> that is a bigger task, for now, first step could be not requiring them to
> download Spark source and run a script that is named install-dev.sh. I
> filed SPARK-10776 to track this.
>
>
>
>
> --Hossein
>
>
>
> On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman <
> shivaram@eecs.berkeley.edu> wrote:
>
> As Rui says it would be good to understand the use case we want to
> support (supporting CRAN installs could be one for example). I don't
> think it should be very hard to do as the RBackend itself doesn't use
> the R source files. The RRDD does use it and the value comes from
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
> AFAIK -- So we could introduce a new config flag that can be used for
> this new mode.
>
> Thanks
> Shivaram
>
>
> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <ru...@intel.com> wrote:
> > Hossein,
> >
> >
> >
> > Any strong reason to download and install SparkR source package
> separately
> > from the Spark distribution?
> >
> > An R user can simply download the spark distribution, which contains
> SparkR
> > source and binary package, and directly use sparkR. No need to install
> > SparkR package at all.
> >
> >
> >
> > From: Hossein [mailto:falaki@gmail.com]
> > Sent: Tuesday, September 22, 2015 9:19 AM
> > To: dev@spark.apache.org
> > Subject: SparkR package path
> >
> >
> >
> > Hi dev list,
> >
> >
> >
> > SparkR backend assumes SparkR source files are located under
> > "SPARK_HOME/R/lib/." This directory is created by running
> R/install-dev.sh.
> > This setting makes sense for Spark developers, but if an R user downloads
> > and installs SparkR source package, the source files are going to be in
> > placed different locations.
> >
> >
> >
> > In the R runtime it is easy to find location of package files using
> > path.package("SparkR"). But we need to make some changes to R backend
> and/or
> > spark-submit so that, JVM process learns the location of worker.R and
> > daemon.R and shell.R from the R runtime.
> >
> >
> >
> > Do you think this change is feasible?
> >
> >
> >
> > Thanks,
> >
> > --Hossein
>
>
>

RE: SparkR package path

Posted by "Sun, Rui" <ru...@intel.com>.

SparkR package is not a standalone R package, as it is actually R API of Spark and needs to co-operate with a matching version of Spark, so exposing it in CRAN does not ease use of R users as they need to download matching Spark distribution, unless we expose a bundled SparkR package to CRAN (packageing with Spark), is this desirable? Actually, for normal users who are not developers, they are not required to download Spark source, build and install SparkR package. They just need to download a Spark distribution, and then use SparkR.

For using SparkR in Rstudio, there is a documentation at https://github.com/apache/spark/tree/master/R

From: Hossein [mailto:falaki@gmail.com]
Sent: Thursday, September 24, 2015 1:42 AM
To: shivaram@eecs.berkeley.edu
Cc: Sun, Rui; dev@spark.apache.org
Subject: Re: SparkR package path

Yes, I think exposing SparkR in CRAN can significantly expand the reach of both SparkR and Spark itself to a larger community of data scientists (and statisticians).

I have been getting questions on how to use SparkR in RStudio. Most of these folks have a Spark Cluster and wish to talk to it from RStudio. While that is a bigger task, for now, first step could be not requiring them to download Spark source and run a script that is named install-dev.sh. I filed SPARK-10776 to track this.

--Hossein

On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman <sh...@eecs.berkeley.edu>> wrote:
As Rui says it would be good to understand the use case we want to
support (supporting CRAN installs could be one for example). I don't
think it should be very hard to do as the RBackend itself doesn't use
the R source files. The RRDD does use it and the value comes from
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
AFAIK -- So we could introduce a new config flag that can be used for
this new mode.

Thanks
Shivaram

On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <ru...@intel.com>> wrote:
> Hossein,
>
>
>
> Any strong reason to download and install SparkR source package separately
> from the Spark distribution?
>
> An R user can simply download the spark distribution, which contains SparkR
> source and binary package, and directly use sparkR. No need to install
> SparkR package at all.
>
>
>
> From: Hossein [mailto:falaki@gmail.com<ma...@gmail.com>]
> Sent: Tuesday, September 22, 2015 9:19 AM
> To: dev@spark.apache.org<ma...@spark.apache.org>
> Subject: SparkR package path
>
>
>
> Hi dev list,
>
>
>
> SparkR backend assumes SparkR source files are located under
> "SPARK_HOME/R/lib/." This directory is created by running R/install-dev.sh.
> This setting makes sense for Spark developers, but if an R user downloads
> and installs SparkR source package, the source files are going to be in
> placed different locations.
>
>
>
> In the R runtime it is easy to find location of package files using
> path.package("SparkR"). But we need to make some changes to R backend and/or
> spark-submit so that, JVM process learns the location of worker.R and
> daemon.R and shell.R from the R runtime.
>
>
>
> Do you think this change is feasible?
>
>
>
> Thanks,
>
> --Hossein

Re: SparkR package path

Posted by Hossein <fa...@gmail.com>.

Yes, I think exposing SparkR in CRAN can significantly expand the reach of
both SparkR and Spark itself to a larger community of data scientists (and
statisticians).

I have been getting questions on how to use SparkR in RStudio. Most of
these folks have a Spark Cluster and wish to talk to it from RStudio. While
that is a bigger task, for now, first step could be not requiring them to
download Spark source and run a script that is named install-dev.sh. I
filed SPARK-10776 to track this.


--Hossein

On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman <
shivaram@eecs.berkeley.edu> wrote:

> As Rui says it would be good to understand the use case we want to
> support (supporting CRAN installs could be one for example). I don't
> think it should be very hard to do as the RBackend itself doesn't use
> the R source files. The RRDD does use it and the value comes from
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
> AFAIK -- So we could introduce a new config flag that can be used for
> this new mode.
>
> Thanks
> Shivaram
>
> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <ru...@intel.com> wrote:
> > Hossein,
> >
> >
> >
> > Any strong reason to download and install SparkR source package
> separately
> > from the Spark distribution?
> >
> > An R user can simply download the spark distribution, which contains
> SparkR
> > source and binary package, and directly use sparkR. No need to install
> > SparkR package at all.
> >
> >
> >
> > From: Hossein [mailto:falaki@gmail.com]
> > Sent: Tuesday, September 22, 2015 9:19 AM
> > To: dev@spark.apache.org
> > Subject: SparkR package path
> >
> >
> >
> > Hi dev list,
> >
> >
> >
> > SparkR backend assumes SparkR source files are located under
> > "SPARK_HOME/R/lib/." This directory is created by running
> R/install-dev.sh.
> > This setting makes sense for Spark developers, but if an R user downloads
> > and installs SparkR source package, the source files are going to be in
> > placed different locations.
> >
> >
> >
> > In the R runtime it is easy to find location of package files using
> > path.package("SparkR"). But we need to make some changes to R backend
> and/or
> > spark-submit so that, JVM process learns the location of worker.R and
> > daemon.R and shell.R from the R runtime.
> >
> >
> >
> > Do you think this change is feasible?
> >
> >
> >
> > Thanks,
> >
> > --Hossein
>

Re: SparkR package path

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

As Rui says it would be good to understand the use case we want to
support (supporting CRAN installs could be one for example). I don't
think it should be very hard to do as the RBackend itself doesn't use
the R source files. The RRDD does use it and the value comes from
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
AFAIK -- So we could introduce a new config flag that can be used for
this new mode.

Thanks
Shivaram

On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui <ru...@intel.com> wrote:
> Hossein,
>
>
>
> Any strong reason to download and install SparkR source package separately
> from the Spark distribution?
>
> An R user can simply download the spark distribution, which contains SparkR
> source and binary package, and directly use sparkR. No need to install
> SparkR package at all.
>
>
>
> From: Hossein [mailto:falaki@gmail.com]
> Sent: Tuesday, September 22, 2015 9:19 AM
> To: dev@spark.apache.org
> Subject: SparkR package path
>
>
>
> Hi dev list,
>
>
>
> SparkR backend assumes SparkR source files are located under
> "SPARK_HOME/R/lib/." This directory is created by running R/install-dev.sh.
> This setting makes sense for Spark developers, but if an R user downloads
> and installs SparkR source package, the source files are going to be in
> placed different locations.
>
>
>
> In the R runtime it is easy to find location of package files using
> path.package("SparkR"). But we need to make some changes to R backend and/or
> spark-submit so that, JVM process learns the location of worker.R and
> daemon.R and shell.R from the R runtime.
>
>
>
> Do you think this change is feasible?
>
>
>
> Thanks,
>
> --Hossein

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

RE: SparkR package path

Posted by "Sun, Rui" <ru...@intel.com>.

Hossein,

Any strong reason to download and install SparkR source package separately from the Spark distribution?
An R user can simply download the spark distribution, which contains SparkR source and binary package, and directly use sparkR. No need to install SparkR package at all.

From: Hossein [mailto:falaki@gmail.com]
Sent: Tuesday, September 22, 2015 9:19 AM
To: dev@spark.apache.org
Subject: SparkR package path

Hi dev list,

SparkR backend assumes SparkR source files are located under "SPARK_HOME/R/lib/." This directory is created by running R/install-dev.sh. This setting makes sense for Spark developers, but if an R user downloads and installs SparkR source package, the source files are going to be in placed different locations.

In the R runtime it is easy to find location of package files using path.package("SparkR"). But we need to make some changes to R backend and/or spark-submit so that, JVM process learns the location of worker.R and daemon.R and shell.R from the R runtime.

Do you think this change is feasible?

Thanks,
--Hossein