You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zeppelin.apache.org by doanduyhai <gi...@git.apache.org> on 2015/05/23 22:36:06 UTC
[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...
GitHub user doanduyhai opened a pull request:
https://github.com/apache/incubator-zeppelin/pull/80
Add Scala utility functions for display
Until now, to display data as a table, there are 2 alternatives:
1. Either use **Spark DataFrame** and Zeppeline built-in support
2. Or generate manually a `println(%table ...)`. As an example of displaying an `RDD[(String,String,Int)]` representing a collection of users:
```scala
val data = new java.lang.StringBuilder("%table Login\tName\tAge\n")
rdd.foreach {
case (login,name,age) => data.append(s"$login\t$name\t$age\n")
}
println(data.toString())
```
My proposal is to add a new utility function to make creating tables easier that the code example above. Of course one can always use **Spark DataFrame** but I find it quite restrictive. People using Spark versions lesser than 1.3 cannot rely on DataFrame and sometimes one does not want to transform an RDD to DataFrame for display.
How are the utility functions implemented ?
1. I added a new module **spark-utils** which provide Scala code for display utility functions. This module will use the **maven-scala-plugin** to compile all the classes in package `org.apache.zeppelin.spark.utils`.
2. Right now the package `org.apache.zeppelin.spark.utils` only contains 1 object `DisplayUtils` which augments RDDs of Tuples or RDDs of Scala case classes (all of them sub-class of trait `Product`) with the new method `displayAsTable(columnLabels: String*)`.
3. The `DisplayUtils` object is imported automatically into the `SparkInterpreter` with `intp.interpret("import org.apache.zeppelin.spark.utils.DisplayUtils._");`
4. The Maven module **interpreter** will now have a **runtime** dependency on the module **spark-utils** so that the utility class will be loaded at runtime
5. Usage of the new display utility function is:
**Paragraph1**
```scala
case class Person(login: String, name: String, age: Int)
val rddTuples:RDD[(String,String,Int)] = sc.parallelize(List(("jdoe","John DOE",32),("hsue","Helen SUE",27))
val rddCaseClass:RDD[(String,String,Int)] = sc.parallelize(List(Person("jdoe","John DOE",32),Person("hsue","Helen SUE",27))
```
**Paragraph2**
```scala
rddTuples.displayAsTable("Login","Name","Age")
```
**Paragraph3**
```scala
rddCaseClass.displayAsTable("Login","Name","Age")
```
6. The `displayAsTable()` method is error-proof, meaning that if the user provides **more** columns label that the number of elements in the tuples/case class, the extra column labels will ignored. If the user provides **less** column labels than expected, the method will pad missing column headers with **Column2**, **Column3** etc ...
7. In addition to the `displayAsTable` methods, I added some other utility methods to make it easier to handle custom HTML and images:
a. calling `html()` will generate the string `"%html "`
b. calling `html("<p> This is a test</p>)` will generate the string `"%html <p> This is a test</p>"`
c. calling `img("http://www.google.com")` will generate the string `"<img src='http://www.google.com' />"`
d. calling `img64()` will generate the string `"%img "`
e. calling `img64("ABCDE123")` will generate the string `"%img ABCDE123"`
Of course the `DisplayUtils` object can be extended with new other functions to support future advanced displaying features
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/doanduyhai/incubator-zeppelin DisplayUtils
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-zeppelin/pull/80.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #80
----
commit 5edab7130e70cf9f765dc268648d8ef294251b37
Author: DuyHai DOAN <do...@gmail.com>
Date: 2015-05-23T19:49:33Z
Add new module spark-utils to expose utility functions for display
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...
Posted by Leemoonsoo <gi...@git.apache.org>.
Github user Leemoonsoo commented on the pull request:
https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-111664894
Great work!
Looks good to me. +1 for merge.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...
Posted by doanduyhai <gi...@git.apache.org>.
Github user doanduyhai commented on the pull request:
https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-108464392
@Leemoonsoo
Thank you for your interesting questions. Indeed I see 2 main points here:
1. How to guarantee that the end-user do not fetch a massive amount of data from RDD to the Driver programe, e.g. add some mechanism to limit the amount of fetched data for display
2. Normalize the `showRDD()` and `displayAsTable()` code to put them at the same place and not spread over
So, my answers:
1.Default limit for fetched data
For this we can take the same approach as the one taken by `showRdd()`:
```scala
val defaultMaxLimit = 10000
def displayAsTable(maxLimit: Long, columnsLabel: String*) {...}
def displayAsTable(columnsLabel: String*) {
displayAsTable(defaultMaxLimit, columnsLabel)
}
```
The above solution works, but I propose another approach. In my last commit, the `displayAsTable()` method also works for plain **Scala collection** so I suggest to remove the support from RDD. If the end-user want to use `displayAsTable()`, **they must convert their RDD first to a Scala collection**. In some way, we force the end-user to call either `collect()` or `take(n)` **explicitly**.
Example:
```scala
rdd
.map(...)
.filter(...)
.take(30) // User responsibility
.displayAsTable(...)
```
This way, we put the responsibility of collecting data explicitly on the end-user, no surprise, what do you think ?
2. Normalize the `showRDD()` and `displayAsTable()` code
I had a look into the **[ZeppelinContext]** code and basically it is dealing with Scala code in Java. What we can do is:
a. either put the `displayAsTable()` method inside this **`ZeppelinContext`** class and code everything in Java but then we loose the power of Scala implicit conversion. We'll need to do:
```
val collection = rdd.map(...).filter(...).take(10)
z.displayAsTable(collection,"Header1","Header2",...)
```
b. or we port the code of **`ZeppelinContext`** in Scala directly so that both `displayAsTable()` and `showRDD()` are written directly in **Scala**. It would make sense because lots of the code in this class is dealing already with Scala code and it would avoid us converting back and forth between Java and Scala:
I volunteer to port this class from Java to Scala (with unit tests of course). What do you think @Leemoonsoo ?
[ZeppelinContext]: https://github.com/apache/incubator-zeppelin/blob/master/spark/src/main/java/org/apache/zeppelin/spark/ZeppelinContext.java
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...
Posted by Leemoonsoo <gi...@git.apache.org>.
Github user Leemoonsoo commented on the pull request:
https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-108290927
@doanduyhai
Thanks for great contribution.
There was a simple display help function ZeppelinContext.show(Object o).
How do you want to deal with it? i mean utility function for the same purpose is placed in two different place.
Another thing is, it's very easy to user make mistake that use display function against very large RDD. it's hard to do .collect() large RDD and transfer data to front-end side. ZeppelinContext.show(Object o) does use .take(n) to survive from user mistake. Could you share your experience and opinion about it?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...
Posted by aseigneurin <gi...@git.apache.org>.
Github user aseigneurin commented on the pull request:
https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-105449050
I guess there is a typo:
val rddCaseClass:RDD[(String,String,Int)] = sc.parallelize(List(Person("jdoe","John DOE",32),Person("hsue","Helen SUE",27))
Should be:
val rddCaseClass:RDD[(Person)] = sc.parallelize(List(Person("jdoe","John DOE",32),Person("hsue","Helen SUE",27))
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/incubator-zeppelin/pull/80
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...
Posted by doanduyhai <gi...@git.apache.org>.
Github user doanduyhai commented on the pull request:
https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-105576107
Thanks aseigneurin for spotting the typo, I updated the PR comment accordingly
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...
Posted by doanduyhai <gi...@git.apache.org>.
Github user doanduyhai commented on the pull request:
https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-111465199
Hello @Leemoonsoo
So,
1) I've renamed `displayAsTable()` to `display()` as suggested
2) I've introduce a default limit to RDD ( `rdd.take()`) using the `zeppeline.spark.maxResult` property
3) User can override the default limit to pass in their own limit value in the `display()` method
4) There is no default limit for **Scala** collection
5) I've rebased from master branch
Here is a screen capture showing the new default limit:
![image](https://cloud.githubusercontent.com/assets/1532977/8129319/6fe1f298-1108-11e5-9708-f7891ea52fa1.png)
For code normalization, I would prefer we merge and close this pull request and open a new JIRA to prepare the port of ZeppelinContext class in **Scala** with appropriate unit tests.
Baby steps.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...
Posted by bzz <gi...@git.apache.org>.
Github user bzz commented on the pull request:
https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-105366771
@doanduyhai thanks for really great contribution with an awesome design docs!
Utility functions that you have implemented are very useful indeed, the only question is: what is the benefit of having a separate sub-module for that, comparing with just adding those to spark interpreter itself?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...
Posted by Leemoonsoo <gi...@git.apache.org>.
Github user Leemoonsoo commented on the pull request:
https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-110051119
@doanduyhai
Thanks for sharing the idea.
To me, it still make sense to display function support RDD.
That i think make user feeling the similar to DataFrame.show().
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...
Posted by Leemoonsoo <gi...@git.apache.org>.
Github user Leemoonsoo commented on the pull request:
https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-111033924
@doanduyhai
Also i'd like to know if you're planning to add more commit here for Java to Scala and some improvements. Or if you prefer to merge it first and make new pull request for further improvements.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...
Posted by doanduyhai <gi...@git.apache.org>.
Github user doanduyhai commented on the pull request:
https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-108090033
Ok, I have updated my commit with:
1. A rebase from master to be up-to-date
2. I also enhanced the code to accept not only RDDs but also **plain Scala collection** of Tuples/case class. So now you'll be able to do:
```scala
List(
Person("jdoe","John DOE",32),
Person("hsue","Helen SUE",27),
Person("rsmith","Richard SMITH",45)
).displayAsTable("Login","Name","Age")
```
It just works with plain **Scala collection**, no need to transform them to an RDD by calling `sc.parallelize(myScalaCollection)`
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...
Posted by swkimme <gi...@git.apache.org>.
Github user swkimme commented on the pull request:
https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-105708858
Thanks for the great contribution!
I'm +1 for adding new utility module, it seems like a really great way to add some utilities.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...
Posted by doanduyhai <gi...@git.apache.org>.
Github user doanduyhai commented on the pull request:
https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-107146055
Ok, I re-pushed the commit, by putting Scala code directly into the **spark** module. I also removed the **spark-utils** module and it works (tested locally)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---