You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zeppelin.apache.org by Jeff Steinmetz <je...@gmail.com> on 2016/02/23 18:03:25 UTC

R and SparkR Support

Hello zeppelin dev group,

Regarding the R Interpreter Pull requests 208 and 702.  I am trying to figure out if the functionality between these are overlapping, or one supports something different than the other.  Is 702 a super set of 208 (702 is a fork of 208)?

Can you pass the reference of a distributed (parallelized) dataframe built in %spark (scala) to the R interpreter?   Similar to z.put(“myDF", myDF)?

Similarly, since R doesn’t support serialization of functions (unless you use something from the SparkR library) is there an example of collecting the parallel DF to a local DF (which I realize it means the dataset needs to fit in local memory on the zeppelin server).

I can to dig into this a bit and help out where appropriate, however its unclear which PR to focus my efforts on.

Best,
Jeff Steinmetz
Principal Architect
Akili Interactive Labs

On 2/23/16, 8:01 AM, "elbamos" <gi...@git.apache.org> wrote:

>Github user elbamos commented on the pull request:
>
>    https://github.com/apache/incubator-zeppelin/pull/702#issuecomment-187764059
>  
>    @btiernay support for that has been in 208 all along... 
>    
>    > On Feb 23, 2016, at 9:27 AM, Bob Tiernay <no...@github.com> wrote:
>    > 
>    > @echarles This is great! Thanks for all your hard work. Very much appreciated!
>    > 
>    > â•‰
>    > Reply to this email directly or view it on GitHub.
>    > 
>
>
>
>---
>If your project is set up for it, you can reply to this email and have your
>reply appear on GitHub as well. If your project does not have this feature
>enabled and wishes so, or if the feature is enabled but not working, please
>contact infrastructure at infrastructure@apache.org or file a JIRA ticket
>with INFRA.
>---

Re: R and SparkR Support

Posted by Eric Charles <er...@apache.org>.


On 23/02/16 18:03, Jeff Steinmetz wrote:
> Hello zeppelin dev group,
>
> Regarding the R Interpreter Pull requests 208 and 702.  I am trying to figure out if the functionality between these are overlapping, or one supports something different than the other.  Is 702 a super set of 208 (702 is a fork of 208)?
>

702 is not a fork of 208, just something that was in a public repo since 
long and finally decided to make a PR to deal with the points expressed 
on [1]

[1] 
https://github.com/apache/incubator-zeppelin/pull/208#issuecomment-170337289

> Can you pass the reference of a distributed (parallelized) dataframe built in %spark (scala) to the R interpreter?   Similar to z.put(“myDF", myDF)?
>

I expect passing a dataframe via the Zeppelin context will fail, but as 
the interperters run on the same Spark REPL, the dataframe are 
accessible in both R and Scala (see links to screenshots)

https://raw.githubusercontent.com/datalayer/datalayer-zeppelin/rscala/_Rimg/r-scala-dataframe-binding.png

https://raw.githubusercontent.com/datalayer/datalayer-zeppelin/rscala/_Rimg/scala-r-dataframe-binding.png


> Similarly, since R doesn’t support serialization of functions (unless you use something from the SparkR library) is there an example of collecting the parallel DF to a local DF (which I realize it means the dataset needs to fit in local memory on the zeppelin server).


That's something I am thinking to since long, especially to allow 
visualizing small or subset datasets with R visualizations. I will try 
to show an example.

>
> I can to dig into this a bit and help out where appropriate, however its unclear which PR to focus my efforts on.
>

I guess you can checkout both and play with them to better feel what 
they offer.

> Best,
> Jeff Steinmetz
> Principal Architect
> Akili Interactive Labs
>
>
>
>
>
>
>
> On 2/23/16, 8:01 AM, "elbamos" <gi...@git.apache.org> wrote:
>
>> Github user elbamos commented on the pull request:
>>
>>     https://github.com/apache/incubator-zeppelin/pull/702#issuecomment-187764059
>>
>>     @btiernay support for that has been in 208 all along...
>>
>>     > On Feb 23, 2016, at 9:27 AM, Bob Tiernay <no...@github.com> wrote:
>>     >
>>     > @echarles This is great! Thanks for all your hard work. Very much appreciated!
>>     >
>>     > â•‰
>>     > Reply to this email directly or view it on GitHub.
>>     >
>>
>>
>>
>> ---
>> If your project is set up for it, you can reply to this email and have your
>> reply appear on GitHub as well. If your project does not have this feature
>> enabled and wishes so, or if the feature is enabled but not working, please
>> contact infrastructure at infrastructure@apache.org or file a JIRA ticket
>> with INFRA.
>> ---
>

Re: R and SparkR Support

Posted by "Amos B. Elberg" <am...@gmail.com>.

I continue to not see a point to engaging in this as a debate. 

The user acceptance speaks for itself. (As just one thing, the only person who hasn't gotten the display system working in 208, is Eric.) So does the rate of change - there have been a series of pushes to 702 in the past few month or two, either fixing problems related to (1), or adding functionality that Eric originally said wasn't required or was a bad idea, but put in after I pointed it out or users complained.  The reason that process slowed is that I've stopped highlighting the gaps. 

If anyone has a question about any of this, I'll address it.

> On Feb 23, 2016, at 2:06 PM, Eric Charles <er...@apache.org> wrote:
> 
> 
>> On 23/02/16 19:52, Amos B. Elberg wrote:
>> Eric, they're not equivalent. 208 continues to have functionality 702 doesn't, including the display system.
>> 
>> I'm not going to tell you what you're doing wrong in your implementation and "test" of 208, because the users don't seem to have the same confusion, and I've essentially been guiding your development process by pointing out the issues.
>> 
>> All three of the issues you raise were addressed already in other threads:
>> 
>> 1. The proposed approach to rscala actually introduces maintenance issues that have already broken 702. 702 was then revised to work around that, by distributing part of rscala in binary form. But the workaround doesn't deal with the issue of R users updating their own installations, and it eliminates the purported benefit of the approach.
> 
> Using binary form with a specific version at build time is the classical way to deploy on machines. Upgrading machines with a new rscala library implies rebuilding and redeploying.
> 
> This flexibility is only possible with binaries and not with forked fixed source code. With 702, you can choose to build with scala 2.xx and rscala 1.0.8 or the version you want to align with the library available on your machines.
> 
>> 2. This is purely cosmetic. 208 is outside the spark module because it made development, testing and merging cleaner.
> 
> Sure, this is cosmetic, but I have tried to stick to the existing pyspark implementation to avoid additional maven modules. Btw, having two magic keywords as 208 offers is also something I have avoided to align with current practices and make it simple for the end user.
> 
>> 
>> 3. 208 has supported the HTML, TABLE and IMG display system all along, in an R-consistent manner. 702 originally did not support any of it. After I pointed out the gap and users complained, 702 was revised to implement it partially. 702 still does not. That's why the user questions about this all get asked on 702 - the people using 208 don't need to ask about it, because it works as expected.
> 
> I quickly pulled and tested today your branch but running print("%html <h1>hello</h1>") didn't work. Will try again tomorrow.
> 
>>> On Feb 23, 2016, at 1:20 PM, Eric Charles <er...@apache.org> wrote:
>>> 
>>> It would make no sense merging both.
>>> 
>>> From an end-user perspective, I guess both are equivalent, although with the last commit I made, the Zeppelin Display system is supported in 702 (I had no luck when testing this functionality with 208). As I said, feel free to test both and send feature requests.
>>> 
>>> From a developer perspective, I will reiterate the points I sent on [1] which are addressed in 702 (these points make sense to me but didn't receive echo so far - would like to get feedback on these):
>>> 
>>> 1.- Use rscala jar instead of forking -> allows to support the platform version (scala version...) and benefit from the rscala project new versions with patches without having to maintain in the zeppelin source tree fork.
>>> 
>>> 2.- Just like Python, develop R in the Spark module
>>> 
>>> 3.- Support the same behavior asthe rest (no TABLE when output is a dataframe, support the HTML, TABLE and IMG display system, support the Dynamic Form system).
>>> 
>>> I still have the Dynamic Form system operational.
>>> 
>>> [1] http://mail-archives.apache.org/mod_mbox/incubator-zeppelin-dev/201512.mbox/%3C5683E471.9010001%40apache.org%3E
>>> 
>>>> On 23/02/16 19:09, Jeff Steinmetz wrote:
>>>> Thank you Amos Elberg & Eric Charles:
>>>> Is the goal of the community to merge both 208 and 702 at some point as two “different” R interpreters?
>>>> 
>>>> One that is
>>>>   %r
>>>> And another that is
>>>>   %spark.r
>>>> 
>>>> Still trying to wrap my head around the difference.
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 2/23/16, 9:34 AM, "Amos B. Elberg" <am...@gmail.com> wrote:
>>>>> 
>>>>> Jeff - 702 isn't a fork, it's an alternative based on 208 that has a subset of 208's features.  208 is the superset. 208 is also what the community is now attempting to integrate.
>>>>> 
>>>>> R does support serialization of functions.
>>>>> 
>>>>> 208 does support passing a spark table back and forth between R and scala. Passing a data.frame through the Zeppelin context will fail in spark up to 1.5. It may now be working for some data frames in 1.6.
>>>>> 
>>>>> There are examples that do all these things in the documentation for 208 on my repo at github.com/elbamos/Zeppelin-With-R
>>>>> 
>>>>>> On Feb 23, 2016, at 12:03 PM, Jeff Steinmetz <je...@gmail.com> wrote:
>>>>>> 
>>>>>> Hello zeppelin dev group,
>>>>>> 
>>>>>> Regarding the R Interpreter Pull requests 208 and 702.  I am trying to figure out if the functionality between these are overlapping, or one supports something different than the other.  Is 702 a super set of 208 (702 is a fork of 208)?
>>>>>> 
>>>>>> Can you pass the reference of a distributed (parallelized) dataframe built in %spark (scala) to the R interpreter?   Similar to z.put(“myDF", myDF)?
>>>>>> 
>>>>>> Similarly, since R doesn’t support serialization of functions (unless you use something from the SparkR library) is there an example of collecting the parallel DF to a local DF (which I realize it means the dataset needs to fit in local memory on the zeppelin server).
>>>>>> 
>>>>>> I can to dig into this a bit and help out where appropriate, however its unclear which PR to focus my efforts on.
>>>>>> 
>>>>>> Best,
>>>>>> Jeff Steinmetz
>>>>>> Principal Architect
>>>>>> Akili Interactive Labs
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 2/23/16, 8:01 AM, "elbamos" <gi...@git.apache.org> wrote:
>>>>>>> 
>>>>>>> Github user elbamos commented on the pull request:
>>>>>>> 
>>>>>>>   https://github.com/apache/incubator-zeppelin/pull/702#issuecomment-187764059
>>>>>>> 
>>>>>>>   @btiernay support for that has been in 208 all along...
>>>>>>> 
>>>>>>>> On Feb 23, 2016, at 9:27 AM, Bob Tiernay <no...@github.com> wrote:
>>>>>>>> 
>>>>>>>> @echarles This is great! Thanks for all your hard work. Very much appreciated!
>>>>>>>> 
>>>>>>>> â•‰
>>>>>>>> Reply to this email directly or view it on GitHub.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ---
>>>>>>> If your project is set up for it, you can reply to this email and have your
>>>>>>> reply appear on GitHub as well. If your project does not have this feature
>>>>>>> enabled and wishes so, or if the feature is enabled but not working, please
>>>>>>> contact infrastructure at infrastructure@apache.org or file a JIRA ticket
>>>>>>> with INFRA.
>>>>>>> ---
>>>>

Re: R and SparkR Support

Posted by Eric Charles <er...@apache.org>.

On 23/02/16 19:52, Amos B. Elberg wrote:
> Eric, they're not equivalent. 208 continues to have functionality 702 doesn't, including the display system.
>
> I'm not going to tell you what you're doing wrong in your implementation and "test" of 208, because the users don't seem to have the same confusion, and I've essentially been guiding your development process by pointing out the issues.
>
> All three of the issues you raise were addressed already in other threads:
>
> 1. The proposed approach to rscala actually introduces maintenance issues that have already broken 702. 702 was then revised to work around that, by distributing part of rscala in binary form. But the workaround doesn't deal with the issue of R users updating their own installations, and it eliminates the purported benefit of the approach.
>

Using binary form with a specific version at build time is the classical 
way to deploy on machines. Upgrading machines with a new rscala library 
implies rebuilding and redeploying.

This flexibility is only possible with binaries and not with forked 
fixed source code. With 702, you can choose to build with scala 2.xx and 
rscala 1.0.8 or the version you want to align with the library available 
on your machines.

> 2. This is purely cosmetic. 208 is outside the spark module because it made development, testing and merging cleaner.

Sure, this is cosmetic, but I have tried to stick to the existing 
pyspark implementation to avoid additional maven modules. Btw, having 
two magic keywords as 208 offers is also something I have avoided to 
align with current practices and make it simple for the end user.

>
> 3. 208 has supported the HTML, TABLE and IMG display system all along, in an R-consistent manner. 702 originally did not support any of it. After I pointed out the gap and users complained, 702 was revised to implement it partially. 702 still does not. That's why the user questions about this all get asked on 702 - the people using 208 don't need to ask about it, because it works as expected.
>

I quickly pulled and tested today your branch but running print("%html 
<h1>hello</h1>") didn't work. Will try again tomorrow.

>> On Feb 23, 2016, at 1:20 PM, Eric Charles <er...@apache.org> wrote:
>>
>> It would make no sense merging both.
>>
>>  From an end-user perspective, I guess both are equivalent, although with the last commit I made, the Zeppelin Display system is supported in 702 (I had no luck when testing this functionality with 208). As I said, feel free to test both and send feature requests.
>>
>>  From a developer perspective, I will reiterate the points I sent on [1] which are addressed in 702 (these points make sense to me but didn't receive echo so far - would like to get feedback on these):
>>
>> 1.- Use rscala jar instead of forking -> allows to support the platform version (scala version...) and benefit from the rscala project new versions with patches without having to maintain in the zeppelin source tree fork.
>>
>> 2.- Just like Python, develop R in the Spark module
>>
>> 3.- Support the same behavior asthe rest (no TABLE when output is a dataframe, support the HTML, TABLE and IMG display system, support the Dynamic Form system).
>>
>> I still have the Dynamic Form system operational.
>>
>> [1] http://mail-archives.apache.org/mod_mbox/incubator-zeppelin-dev/201512.mbox/%3C5683E471.9010001%40apache.org%3E
>>
>>> On 23/02/16 19:09, Jeff Steinmetz wrote:
>>> Thank you Amos Elberg & Eric Charles:
>>> Is the goal of the community to merge both 208 and 702 at some point as two “different” R interpreters?
>>>
>>> One that is
>>>    %r
>>> And another that is
>>>    %spark.r
>>>
>>> Still trying to wrap my head around the difference.
>>>
>>>
>>>
>>>
>>>> On 2/23/16, 9:34 AM, "Amos B. Elberg" <am...@gmail.com> wrote:
>>>>
>>>> Jeff - 702 isn't a fork, it's an alternative based on 208 that has a subset of 208's features.  208 is the superset. 208 is also what the community is now attempting to integrate.
>>>>
>>>> R does support serialization of functions.
>>>>
>>>> 208 does support passing a spark table back and forth between R and scala. Passing a data.frame through the Zeppelin context will fail in spark up to 1.5. It may now be working for some data frames in 1.6.
>>>>
>>>> There are examples that do all these things in the documentation for 208 on my repo at github.com/elbamos/Zeppelin-With-R
>>>>
>>>>> On Feb 23, 2016, at 12:03 PM, Jeff Steinmetz <je...@gmail.com> wrote:
>>>>>
>>>>> Hello zeppelin dev group,
>>>>>
>>>>> Regarding the R Interpreter Pull requests 208 and 702.  I am trying to figure out if the functionality between these are overlapping, or one supports something different than the other.  Is 702 a super set of 208 (702 is a fork of 208)?
>>>>>
>>>>> Can you pass the reference of a distributed (parallelized) dataframe built in %spark (scala) to the R interpreter?   Similar to z.put(“myDF", myDF)?
>>>>>
>>>>> Similarly, since R doesn’t support serialization of functions (unless you use something from the SparkR library) is there an example of collecting the parallel DF to a local DF (which I realize it means the dataset needs to fit in local memory on the zeppelin server).
>>>>>
>>>>> I can to dig into this a bit and help out where appropriate, however its unclear which PR to focus my efforts on.
>>>>>
>>>>> Best,
>>>>> Jeff Steinmetz
>>>>> Principal Architect
>>>>> Akili Interactive Labs
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On 2/23/16, 8:01 AM, "elbamos" <gi...@git.apache.org> wrote:
>>>>>>
>>>>>> Github user elbamos commented on the pull request:
>>>>>>
>>>>>>    https://github.com/apache/incubator-zeppelin/pull/702#issuecomment-187764059
>>>>>>
>>>>>>    @btiernay support for that has been in 208 all along...
>>>>>>
>>>>>>> On Feb 23, 2016, at 9:27 AM, Bob Tiernay <no...@github.com> wrote:
>>>>>>>
>>>>>>> @echarles This is great! Thanks for all your hard work. Very much appreciated!
>>>>>>>
>>>>>>> â•‰
>>>>>>> Reply to this email directly or view it on GitHub.
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---
>>>>>> If your project is set up for it, you can reply to this email and have your
>>>>>> reply appear on GitHub as well. If your project does not have this feature
>>>>>> enabled and wishes so, or if the feature is enabled but not working, please
>>>>>> contact infrastructure at infrastructure@apache.org or file a JIRA ticket
>>>>>> with INFRA.
>>>>>> ---
>>>

Re: R and SparkR Support

Posted by "Amos B. Elberg" <am...@gmail.com>.

Eric, they're not equivalent. 208 continues to have functionality 702 doesn't, including the display system.

I'm not going to tell you what you're doing wrong in your implementation and "test" of 208, because the users don't seem to have the same confusion, and I've essentially been guiding your development process by pointing out the issues. 

All three of the issues you raise were addressed already in other threads:

1. The proposed approach to rscala actually introduces maintenance issues that have already broken 702. 702 was then revised to work around that, by distributing part of rscala in binary form. But the workaround doesn't deal with the issue of R users updating their own installations, and it eliminates the purported benefit of the approach.

2. This is purely cosmetic. 208 is outside the spark module because it made development, testing and merging cleaner.

3. 208 has supported the HTML, TABLE and IMG display system all along, in an R-consistent manner. 702 originally did not support any of it. After I pointed out the gap and users complained, 702 was revised to implement it partially. 702 still does not. That's why the user questions about this all get asked on 702 - the people using 208 don't need to ask about it, because it works as expected.

> On Feb 23, 2016, at 1:20 PM, Eric Charles <er...@apache.org> wrote:
> 
> It would make no sense merging both.
> 
> From an end-user perspective, I guess both are equivalent, although with the last commit I made, the Zeppelin Display system is supported in 702 (I had no luck when testing this functionality with 208). As I said, feel free to test both and send feature requests.
> 
> From a developer perspective, I will reiterate the points I sent on [1] which are addressed in 702 (these points make sense to me but didn't receive echo so far - would like to get feedback on these):
> 
> 1.- Use rscala jar instead of forking -> allows to support the platform version (scala version...) and benefit from the rscala project new versions with patches without having to maintain in the zeppelin source tree fork.
> 
> 2.- Just like Python, develop R in the Spark module
> 
> 3.- Support the same behavior asthe rest (no TABLE when output is a dataframe, support the HTML, TABLE and IMG display system, support the Dynamic Form system).
> 
> I still have the Dynamic Form system operational.
> 
> [1] http://mail-archives.apache.org/mod_mbox/incubator-zeppelin-dev/201512.mbox/%3C5683E471.9010001%40apache.org%3E
> 
>> On 23/02/16 19:09, Jeff Steinmetz wrote:
>> Thank you Amos Elberg & Eric Charles:
>> Is the goal of the community to merge both 208 and 702 at some point as two “different” R interpreters?
>> 
>> One that is
>>   %r
>> And another that is
>>   %spark.r
>> 
>> Still trying to wrap my head around the difference.
>> 
>> 
>> 
>> 
>>> On 2/23/16, 9:34 AM, "Amos B. Elberg" <am...@gmail.com> wrote:
>>> 
>>> Jeff - 702 isn't a fork, it's an alternative based on 208 that has a subset of 208's features.  208 is the superset. 208 is also what the community is now attempting to integrate.
>>> 
>>> R does support serialization of functions.
>>> 
>>> 208 does support passing a spark table back and forth between R and scala. Passing a data.frame through the Zeppelin context will fail in spark up to 1.5. It may now be working for some data frames in 1.6.
>>> 
>>> There are examples that do all these things in the documentation for 208 on my repo at github.com/elbamos/Zeppelin-With-R
>>> 
>>>> On Feb 23, 2016, at 12:03 PM, Jeff Steinmetz <je...@gmail.com> wrote:
>>>> 
>>>> Hello zeppelin dev group,
>>>> 
>>>> Regarding the R Interpreter Pull requests 208 and 702.  I am trying to figure out if the functionality between these are overlapping, or one supports something different than the other.  Is 702 a super set of 208 (702 is a fork of 208)?
>>>> 
>>>> Can you pass the reference of a distributed (parallelized) dataframe built in %spark (scala) to the R interpreter?   Similar to z.put(“myDF", myDF)?
>>>> 
>>>> Similarly, since R doesn’t support serialization of functions (unless you use something from the SparkR library) is there an example of collecting the parallel DF to a local DF (which I realize it means the dataset needs to fit in local memory on the zeppelin server).
>>>> 
>>>> I can to dig into this a bit and help out where appropriate, however its unclear which PR to focus my efforts on.
>>>> 
>>>> Best,
>>>> Jeff Steinmetz
>>>> Principal Architect
>>>> Akili Interactive Labs
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 2/23/16, 8:01 AM, "elbamos" <gi...@git.apache.org> wrote:
>>>>> 
>>>>> Github user elbamos commented on the pull request:
>>>>> 
>>>>>   https://github.com/apache/incubator-zeppelin/pull/702#issuecomment-187764059
>>>>> 
>>>>>   @btiernay support for that has been in 208 all along...
>>>>> 
>>>>>> On Feb 23, 2016, at 9:27 AM, Bob Tiernay <no...@github.com> wrote:
>>>>>> 
>>>>>> @echarles This is great! Thanks for all your hard work. Very much appreciated!
>>>>>> 
>>>>>> â•‰
>>>>>> Reply to this email directly or view it on GitHub.
>>>>> 
>>>>> 
>>>>> 
>>>>> ---
>>>>> If your project is set up for it, you can reply to this email and have your
>>>>> reply appear on GitHub as well. If your project does not have this feature
>>>>> enabled and wishes so, or if the feature is enabled but not working, please
>>>>> contact infrastructure at infrastructure@apache.org or file a JIRA ticket
>>>>> with INFRA.
>>>>> ---
>>

Re: R and SparkR Support

Posted by Eric Charles <er...@apache.org>.

It would make no sense merging both.

 From an end-user perspective, I guess both are equivalent, although 
with the last commit I made, the Zeppelin Display system is supported in 
702 (I had no luck when testing this functionality with 208). As I said, 
feel free to test both and send feature requests.

 From a developer perspective, I will reiterate the points I sent on [1] 
which are addressed in 702 (these points make sense to me but didn't 
receive echo so far - would like to get feedback on these):

1.- Use rscala jar instead of forking -> allows to support the platform 
version (scala version...) and benefit from the rscala project new 
versions with patches without having to maintain in the zeppelin source 
tree fork.

2.- Just like Python, develop R in the Spark module

3.- Support the same behavior asthe rest (no TABLE when output is a 
dataframe, support the HTML, TABLE and IMG display system, support the 
Dynamic Form system).

I still have the Dynamic Form system operational.

[1] 
http://mail-archives.apache.org/mod_mbox/incubator-zeppelin-dev/201512.mbox/%3C5683E471.9010001%40apache.org%3E

On 23/02/16 19:09, Jeff Steinmetz wrote:
> Thank you Amos Elberg & Eric Charles:
> Is the goal of the community to merge both 208 and 702 at some point as two “different” R interpreters?
>
> One that is
>    %r
> And another that is
>    %spark.r
>
> Still trying to wrap my head around the difference.
>
>
>
>
> On 2/23/16, 9:34 AM, "Amos B. Elberg" <am...@gmail.com> wrote:
>
>> Jeff - 702 isn't a fork, it's an alternative based on 208 that has a subset of 208's features.  208 is the superset. 208 is also what the community is now attempting to integrate.
>>
>> R does support serialization of functions.
>>
>> 208 does support passing a spark table back and forth between R and scala. Passing a data.frame through the Zeppelin context will fail in spark up to 1.5. It may now be working for some data frames in 1.6.
>>
>> There are examples that do all these things in the documentation for 208 on my repo at github.com/elbamos/Zeppelin-With-R
>>
>>> On Feb 23, 2016, at 12:03 PM, Jeff Steinmetz <je...@gmail.com> wrote:
>>>
>>> Hello zeppelin dev group,
>>>
>>> Regarding the R Interpreter Pull requests 208 and 702.  I am trying to figure out if the functionality between these are overlapping, or one supports something different than the other.  Is 702 a super set of 208 (702 is a fork of 208)?
>>>
>>> Can you pass the reference of a distributed (parallelized) dataframe built in %spark (scala) to the R interpreter?   Similar to z.put(“myDF", myDF)?
>>>
>>> Similarly, since R doesn’t support serialization of functions (unless you use something from the SparkR library) is there an example of collecting the parallel DF to a local DF (which I realize it means the dataset needs to fit in local memory on the zeppelin server).
>>>
>>> I can to dig into this a bit and help out where appropriate, however its unclear which PR to focus my efforts on.
>>>
>>> Best,
>>> Jeff Steinmetz
>>> Principal Architect
>>> Akili Interactive Labs
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On 2/23/16, 8:01 AM, "elbamos" <gi...@git.apache.org> wrote:
>>>>
>>>> Github user elbamos commented on the pull request:
>>>>
>>>>    https://github.com/apache/incubator-zeppelin/pull/702#issuecomment-187764059
>>>>
>>>>    @btiernay support for that has been in 208 all along...
>>>>
>>>>> On Feb 23, 2016, at 9:27 AM, Bob Tiernay <no...@github.com> wrote:
>>>>>
>>>>> @echarles This is great! Thanks for all your hard work. Very much appreciated!
>>>>>
>>>>> â•‰
>>>>> Reply to this email directly or view it on GitHub.
>>>>
>>>>
>>>>
>>>> ---
>>>> If your project is set up for it, you can reply to this email and have your
>>>> reply appear on GitHub as well. If your project does not have this feature
>>>> enabled and wishes so, or if the feature is enabled but not working, please
>>>> contact infrastructure at infrastructure@apache.org or file a JIRA ticket
>>>> with INFRA.
>>>> ---
>>>
>

Re: R and SparkR Support

Posted by "Amos B. Elberg" <am...@gmail.com>.

The community is working toward merging 208. %spark.r and %r are the same thing - just two different ways Zeppelin let's you identify an interpreter. 

> On Feb 23, 2016, at 1:09 PM, Jeff Steinmetz <je...@gmail.com> wrote:
> 
> Thank you Amos Elberg & Eric Charles:
> Is the goal of the community to merge both 208 and 702 at some point as two “different” R interpreters?
> 
> One that is
>  %r
> And another that is
>  %spark.r
> 
> Still trying to wrap my head around the difference.
> 
> 
> 
> 
>> On 2/23/16, 9:34 AM, "Amos B. Elberg" <am...@gmail.com> wrote:
>> 
>> Jeff - 702 isn't a fork, it's an alternative based on 208 that has a subset of 208's features.  208 is the superset. 208 is also what the community is now attempting to integrate.
>> 
>> R does support serialization of functions. 
>> 
>> 208 does support passing a spark table back and forth between R and scala. Passing a data.frame through the Zeppelin context will fail in spark up to 1.5. It may now be working for some data frames in 1.6.
>> 
>> There are examples that do all these things in the documentation for 208 on my repo at github.com/elbamos/Zeppelin-With-R 
>> 
>>> On Feb 23, 2016, at 12:03 PM, Jeff Steinmetz <je...@gmail.com> wrote:
>>> 
>>> Hello zeppelin dev group,
>>> 
>>> Regarding the R Interpreter Pull requests 208 and 702.  I am trying to figure out if the functionality between these are overlapping, or one supports something different than the other.  Is 702 a super set of 208 (702 is a fork of 208)?
>>> 
>>> Can you pass the reference of a distributed (parallelized) dataframe built in %spark (scala) to the R interpreter?   Similar to z.put(“myDF", myDF)?
>>> 
>>> Similarly, since R doesn’t support serialization of functions (unless you use something from the SparkR library) is there an example of collecting the parallel DF to a local DF (which I realize it means the dataset needs to fit in local memory on the zeppelin server).
>>> 
>>> I can to dig into this a bit and help out where appropriate, however its unclear which PR to focus my efforts on.
>>> 
>>> Best,
>>> Jeff Steinmetz
>>> Principal Architect
>>> Akili Interactive Labs
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On 2/23/16, 8:01 AM, "elbamos" <gi...@git.apache.org> wrote:
>>>> 
>>>> Github user elbamos commented on the pull request:
>>>> 
>>>>  https://github.com/apache/incubator-zeppelin/pull/702#issuecomment-187764059
>>>> 
>>>>  @btiernay support for that has been in 208 all along... 
>>>> 
>>>>> On Feb 23, 2016, at 9:27 AM, Bob Tiernay <no...@github.com> wrote:
>>>>> 
>>>>> @echarles This is great! Thanks for all your hard work. Very much appreciated!
>>>>> 
>>>>> â•‰
>>>>> Reply to this email directly or view it on GitHub.
>>>> 
>>>> 
>>>> 
>>>> ---
>>>> If your project is set up for it, you can reply to this email and have your
>>>> reply appear on GitHub as well. If your project does not have this feature
>>>> enabled and wishes so, or if the feature is enabled but not working, please
>>>> contact infrastructure at infrastructure@apache.org or file a JIRA ticket
>>>> with INFRA.
>>>> ---
>

Re: R and SparkR Support

Posted by Jeff Steinmetz <je...@gmail.com>.

Thank you Amos Elberg & Eric Charles:
Is the goal of the community to merge both 208 and 702 at some point as two “different” R interpreters?

One that is
  %r
And another that is
  %spark.r

Still trying to wrap my head around the difference.




On 2/23/16, 9:34 AM, "Amos B. Elberg" <am...@gmail.com> wrote:

>Jeff - 702 isn't a fork, it's an alternative based on 208 that has a subset of 208's features.  208 is the superset. 208 is also what the community is now attempting to integrate.
>
>R does support serialization of functions. 
>
>208 does support passing a spark table back and forth between R and scala. Passing a data.frame through the Zeppelin context will fail in spark up to 1.5. It may now be working for some data frames in 1.6.
>
>There are examples that do all these things in the documentation for 208 on my repo at github.com/elbamos/Zeppelin-With-R 
>
>> On Feb 23, 2016, at 12:03 PM, Jeff Steinmetz <je...@gmail.com> wrote:
>> 
>> Hello zeppelin dev group,
>> 
>> Regarding the R Interpreter Pull requests 208 and 702.  I am trying to figure out if the functionality between these are overlapping, or one supports something different than the other.  Is 702 a super set of 208 (702 is a fork of 208)?
>> 
>> Can you pass the reference of a distributed (parallelized) dataframe built in %spark (scala) to the R interpreter?   Similar to z.put(“myDF", myDF)?
>> 
>> Similarly, since R doesn’t support serialization of functions (unless you use something from the SparkR library) is there an example of collecting the parallel DF to a local DF (which I realize it means the dataset needs to fit in local memory on the zeppelin server).
>> 
>> I can to dig into this a bit and help out where appropriate, however its unclear which PR to focus my efforts on.
>> 
>> Best,
>> Jeff Steinmetz
>> Principal Architect
>> Akili Interactive Labs
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On 2/23/16, 8:01 AM, "elbamos" <gi...@git.apache.org> wrote:
>>> 
>>> Github user elbamos commented on the pull request:
>>> 
>>>   https://github.com/apache/incubator-zeppelin/pull/702#issuecomment-187764059
>>> 
>>>   @btiernay support for that has been in 208 all along... 
>>> 
>>>> On Feb 23, 2016, at 9:27 AM, Bob Tiernay <no...@github.com> wrote:
>>>> 
>>>> @echarles This is great! Thanks for all your hard work. Very much appreciated!
>>>> 
>>>> â•‰
>>>> Reply to this email directly or view it on GitHub.
>>> 
>>> 
>>> 
>>> ---
>>> If your project is set up for it, you can reply to this email and have your
>>> reply appear on GitHub as well. If your project does not have this feature
>>> enabled and wishes so, or if the feature is enabled but not working, please
>>> contact infrastructure at infrastructure@apache.org or file a JIRA ticket
>>> with INFRA.
>>> ---
>>

Re: R and SparkR Support

Posted by "Amos B. Elberg" <am...@gmail.com>.

Jeff - 702 isn't a fork, it's an alternative based on 208 that has a subset of 208's features.  208 is the superset. 208 is also what the community is now attempting to integrate.

R does support serialization of functions. 

208 does support passing a spark table back and forth between R and scala. Passing a data.frame through the Zeppelin context will fail in spark up to 1.5. It may now be working for some data frames in 1.6.

There are examples that do all these things in the documentation for 208 on my repo at github.com/elbamos/Zeppelin-With-R 

> On Feb 23, 2016, at 12:03 PM, Jeff Steinmetz <je...@gmail.com> wrote:
> 
> Hello zeppelin dev group,
> 
> Regarding the R Interpreter Pull requests 208 and 702.  I am trying to figure out if the functionality between these are overlapping, or one supports something different than the other.  Is 702 a super set of 208 (702 is a fork of 208)?
> 
> Can you pass the reference of a distributed (parallelized) dataframe built in %spark (scala) to the R interpreter?   Similar to z.put(“myDF", myDF)?
> 
> Similarly, since R doesn’t support serialization of functions (unless you use something from the SparkR library) is there an example of collecting the parallel DF to a local DF (which I realize it means the dataset needs to fit in local memory on the zeppelin server).
> 
> I can to dig into this a bit and help out where appropriate, however its unclear which PR to focus my efforts on.
> 
> Best,
> Jeff Steinmetz
> Principal Architect
> Akili Interactive Labs
> 
> 
> 
> 
> 
> 
> 
>> On 2/23/16, 8:01 AM, "elbamos" <gi...@git.apache.org> wrote:
>> 
>> Github user elbamos commented on the pull request:
>> 
>>   https://github.com/apache/incubator-zeppelin/pull/702#issuecomment-187764059
>> 
>>   @btiernay support for that has been in 208 all along... 
>> 
>>> On Feb 23, 2016, at 9:27 AM, Bob Tiernay <no...@github.com> wrote:
>>> 
>>> @echarles This is great! Thanks for all your hard work. Very much appreciated!
>>> 
>>> â•‰
>>> Reply to this email directly or view it on GitHub.
>> 
>> 
>> 
>> ---
>> If your project is set up for it, you can reply to this email and have your
>> reply appear on GitHub as well. If your project does not have this feature
>> enabled and wishes so, or if the feature is enabled but not working, please
>> contact infrastructure at infrastructure@apache.org or file a JIRA ticket
>> with INFRA.
>> ---
>