You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zeppelin.apache.org by doanduyhai <gi...@git.apache.org> on 2015/05/23 22:36:06 UTC

[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...

GitHub user doanduyhai opened a pull request:

    https://github.com/apache/incubator-zeppelin/pull/80

    Add Scala utility functions for display

    Until now, to display data as a table, there are 2 alternatives:
    
    1. Either use **Spark DataFrame** and Zeppeline built-in support
    2. Or generate manually a `println(%table ...)`. As an example of displaying an `RDD[(String,String,Int)]` representing a collection of users:
    
    ```scala
    val data = new java.lang.StringBuilder("%table Login\tName\tAge\n")
    rdd.foreach {
       case (login,name,age) => data.append(s"$login\t$name\t$age\n")
    }
    
    println(data.toString())
    ```
    
    My proposal is to add a new utility function to make creating tables easier that the code example above. Of course one can always use **Spark DataFrame** but I find it quite restrictive. People using Spark versions lesser than 1.3 cannot rely on DataFrame and sometimes one does not want to transform an RDD to DataFrame for display.
    
    How are the utility functions implemented ?
    
    1. I added a new module **spark-utils** which provide Scala code for display utility functions. This module will use the **maven-scala-plugin** to compile all the classes in package `org.apache.zeppelin.spark.utils`. 
    
    2. Right now the package `org.apache.zeppelin.spark.utils` only contains 1 object `DisplayUtils` which augments RDDs of Tuples or RDDs of Scala case classes (all of them sub-class of trait `Product`) with the new method `displayAsTable(columnLabels: String*)`.
    
    3. The `DisplayUtils` object is imported automatically into the `SparkInterpreter` with `intp.interpret("import org.apache.zeppelin.spark.utils.DisplayUtils._");`
    
    4. The Maven module **interpreter** will now have a **runtime** dependency on the module **spark-utils** so that the utility class will be loaded at runtime
    
    5. Usage of the new display utility function is:
    
        **Paragraph1**
        ```scala
        case class Person(login: String, name: String, age: Int)
        val rddTuples:RDD[(String,String,Int)] = sc.parallelize(List(("jdoe","John DOE",32),("hsue","Helen     SUE",27))
        val rddCaseClass:RDD[(String,String,Int)] = sc.parallelize(List(Person("jdoe","John DOE",32),Person("hsue","Helen SUE",27))
        ```
        **Paragraph2**
        ```scala
        rddTuples.displayAsTable("Login","Name","Age")
        ```
        
        **Paragraph3**
        ```scala
        rddCaseClass.displayAsTable("Login","Name","Age")
        ```
    
    6. The `displayAsTable()` method is error-proof, meaning that if the user provides **more** columns label that the number of elements in the tuples/case class, the extra column labels will ignored. If the user provides **less** column labels than expected, the method will pad missing column headers with **Column2**, **Column3** etc ...
    
    7. In addition to the `displayAsTable` methods, I added some other utility methods to make it easier to handle custom HTML and images:
        a. calling `html()` will generate the string `"%html "`
        b. calling `html("<p> This is a test</p>)` will generate the string `"%html <p> This is a test</p>"`
        c. calling `img("http://www.google.com")` will generate the string `"<img src='http://www.google.com' />"`
       d. calling `img64()` will generate the string `"%img "`
       e. calling `img64("ABCDE123")` will generate the string `"%img ABCDE123"`
    
    Of course the `DisplayUtils` object can be extended with new other functions to support future advanced displaying features


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/doanduyhai/incubator-zeppelin DisplayUtils

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-zeppelin/pull/80.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #80
    
----
commit 5edab7130e70cf9f765dc268648d8ef294251b37
Author: DuyHai DOAN <do...@gmail.com>
Date:   2015-05-23T19:49:33Z

    Add new module spark-utils to expose utility functions for display

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...

Posted by Leemoonsoo <gi...@git.apache.org>.
Github user Leemoonsoo commented on the pull request:

    https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-111664894
  
    Great work!
    Looks good to me. +1 for merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...

Posted by doanduyhai <gi...@git.apache.org>.
Github user doanduyhai commented on the pull request:

    https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-108464392
  
    @Leemoonsoo 
    
    Thank you for  your interesting questions. Indeed I see 2 main points here:
    
    1. How to guarantee that the end-user do not fetch a massive amount of data from RDD to the Driver programe, e.g. add some mechanism to limit the amount of fetched data for display
    
    2. Normalize the `showRDD()` and `displayAsTable()` code to put them at the same place and not spread over
    
    
    So, my answers:
    
    1.Default limit for fetched data
    
     For this we can take the same approach as the one taken by `showRdd()`: 
     
    ```scala
      val defaultMaxLimit = 10000 
    
      def displayAsTable(maxLimit: Long, columnsLabel: String*) {...}
    
      def displayAsTable(columnsLabel: String*) {
          displayAsTable(defaultMaxLimit, columnsLabel)
      }
    ```
     The above solution works, but I propose another approach. In my last commit, the `displayAsTable()` method also works for plain **Scala collection** so I suggest to remove the support from RDD. If the end-user want to use `displayAsTable()`, **they must convert their RDD first to a Scala collection**. In some way, we force the end-user to call either `collect()` or `take(n)` **explicitly**.
    
    Example:
    
    ```scala
    rdd
    .map(...)
    .filter(...)
    .take(30) // User responsibility
    .displayAsTable(...)
    ```
    
     This way, we put the responsibility of collecting data explicitly on the end-user, no surprise, what do you think ?
     
    2. Normalize the `showRDD()` and `displayAsTable()` code 
    
    I had a look into the **[ZeppelinContext]** code and basically it is dealing with Scala code in Java. What we can do is:
    
     a. either put the `displayAsTable()` method inside this **`ZeppelinContext`** class and code everything in Java but then we loose the power of Scala implicit conversion. We'll need to do:
    
    ```
    val collection = rdd.map(...).filter(...).take(10)
    z.displayAsTable(collection,"Header1","Header2",...)
    ```
    
     b. or we port the code of **`ZeppelinContext`** in Scala directly so that both `displayAsTable()` and `showRDD()` are written directly in **Scala**. It would make sense because lots of the code in this class is dealing already with Scala code and it would avoid us converting back and forth between Java and Scala:
    
    I volunteer to port this class from Java to Scala (with unit tests of course). What do you think @Leemoonsoo  ?
    
    [ZeppelinContext]: https://github.com/apache/incubator-zeppelin/blob/master/spark/src/main/java/org/apache/zeppelin/spark/ZeppelinContext.java



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...

Posted by Leemoonsoo <gi...@git.apache.org>.
Github user Leemoonsoo commented on the pull request:

    https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-108290927
  
    @doanduyhai 
    Thanks for great contribution.
    
    There was a simple display help function ZeppelinContext.show(Object o).
    How do you want to deal with it? i mean utility function for the same purpose is placed in two different place. 
    
    Another thing is, it's very easy to user make mistake that use display function against very large RDD. it's hard to do .collect() large RDD and transfer data to front-end side. ZeppelinContext.show(Object o) does use .take(n) to survive from user mistake. Could you share your experience and opinion about it?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...

Posted by aseigneurin <gi...@git.apache.org>.
Github user aseigneurin commented on the pull request:

    https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-105449050
  
    I guess there is a typo:
    
        val rddCaseClass:RDD[(String,String,Int)] = sc.parallelize(List(Person("jdoe","John DOE",32),Person("hsue","Helen SUE",27))
    
    Should be:
    
        val rddCaseClass:RDD[(Person)] = sc.parallelize(List(Person("jdoe","John DOE",32),Person("hsue","Helen SUE",27))


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/incubator-zeppelin/pull/80


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...

Posted by doanduyhai <gi...@git.apache.org>.
Github user doanduyhai commented on the pull request:

    https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-105576107
  
    Thanks aseigneurin for spotting the typo, I updated the PR comment accordingly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...

Posted by doanduyhai <gi...@git.apache.org>.
Github user doanduyhai commented on the pull request:

    https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-111465199
  
    Hello @Leemoonsoo 
    
    So,
    
    1) I've renamed `displayAsTable()` to `display()` as suggested
    2) I've introduce a default limit to RDD ( `rdd.take()`) using the `zeppeline.spark.maxResult` property
    3) User can override the default limit to pass in their own limit value in the `display()` method
    4) There is no default limit for **Scala** collection
    5) I've rebased from master branch
    
    Here is a screen capture showing the new default limit: 
    ![image](https://cloud.githubusercontent.com/assets/1532977/8129319/6fe1f298-1108-11e5-9708-f7891ea52fa1.png)
    
    For code normalization, I would prefer we merge and close this pull request and open a new JIRA to prepare the port of ZeppelinContext class in **Scala** with appropriate unit tests.
    
    Baby steps.
    
    
     


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...

Posted by bzz <gi...@git.apache.org>.
Github user bzz commented on the pull request:

    https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-105366771
  
    @doanduyhai thanks for really great contribution with an awesome design docs!
    
    Utility functions that you have implemented are very useful indeed, the only question is: what is the benefit of having a separate sub-module for that, comparing with just adding those to spark interpreter itself?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...

Posted by Leemoonsoo <gi...@git.apache.org>.
Github user Leemoonsoo commented on the pull request:

    https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-110051119
  
    @doanduyhai 
    Thanks for sharing the idea.
    To me, it still make sense to display function support RDD.
    That i think make user feeling the similar to DataFrame.show().
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...

Posted by Leemoonsoo <gi...@git.apache.org>.
Github user Leemoonsoo commented on the pull request:

    https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-111033924
  
    @doanduyhai 
    Also i'd like to know if you're planning to add more commit here for Java to Scala and some improvements. Or if you prefer to merge it first and make new pull request for further improvements.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...

Posted by doanduyhai <gi...@git.apache.org>.
Github user doanduyhai commented on the pull request:

    https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-108090033
  
    Ok, I have updated my commit with:
    
    1. A rebase from master to be up-to-date
    2. I also enhanced the code to accept not only RDDs but also **plain Scala collection** of Tuples/case class. So now you'll be able to do:
    
    ```scala
    List(
        Person("jdoe","John DOE",32),
        Person("hsue","Helen SUE",27),
        Person("rsmith","Richard SMITH",45)
    ).displayAsTable("Login","Name","Age")
    ```
    
    It just works with plain **Scala collection**, no need to transform them to an RDD by calling `sc.parallelize(myScalaCollection)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...

Posted by swkimme <gi...@git.apache.org>.
Github user swkimme commented on the pull request:

    https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-105708858
  
    Thanks for the great contribution! 
    I'm +1 for adding new utility module, it seems like a really great way to add some utilities.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-zeppelin pull request: Add Scala utility functions for d...

Posted by doanduyhai <gi...@git.apache.org>.
Github user doanduyhai commented on the pull request:

    https://github.com/apache/incubator-zeppelin/pull/80#issuecomment-107146055
  
    Ok, I re-pushed the commit, by putting Scala code directly into the **spark** module. I also removed the **spark-utils** module and it works (tested locally)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---