You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zeppelin.apache.org by cloverhearts <gi...@git.apache.org> on 2016/12/23 07:37:17 UTC

[GitHub] zeppelin pull request #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

GitHub user cloverhearts opened a pull request:

    https://github.com/apache/zeppelin/pull/1799

    [ZEPPELIN-1165 : WIP] Code-based job workflow

    ### What is this PR for?
    
    Code based workflow (**work in progress**)
    
    Re-implementation on this pr
    https://github.com/apache/zeppelin/pull/1176
    
    
    
    Workflow process feature.
    (To ensure the success of each paragraph, it is possible to run consecutively.)
    #### Case 1
    
    Through a dynamic form, you can execute the order in paragraph.
    There is a difference with traditional methods.
    Please check the following flowchart.
    ![workflowdynamicformcontrol](https://cloud.githubusercontent.com/assets/10525473/16791726/a4b96ff0-48fc-11e6-8e23-9ec577066bb7.png)
    #### Case 2
    
    In general, when run a plurality of Paragraph, it performs Note entire run.
    This is a good way to run a lot of Paragraph contained in the Note.
    However, the problem occurs if the Interpreter of Paragraph different.
    ![notebook_example](https://cloud.githubusercontent.com/assets/10525473/16803203/175ad01a-4940-11e6-8949-72d0c49bdf9e.png)
    For Paragraph each using a different type of one of the Interpreter Note but run in sequence, the end is all different.
    
    ![normal notebook run](https://cloud.githubusercontent.com/assets/10525473/16803193/069c4d94-4940-11e6-9293-888b6c6288a0.png)
    For example, Markdown is a very fast Interpreter.
    The process is completed very quickly.
    This is a problem in the sequential execution Paragraph.
    
    ![worklfow run](https://cloud.githubusercontent.com/assets/10525473/16803192/06998122-4940-11e6-8f01-43cdf64f2eef.png)
    This feature ensures a certain execution order Notebook with each Interpreter.
    ##### Case 3
    
    For concurrent job in the workflow ...
    
    ![job_repl](https://cloud.githubusercontent.com/assets/10525473/16828860/96e601b8-49ce-11e6-87d0-6f7fc30ce751.png)
    
    If the current functional design is supposed to run at the same time, as follows 
    It is to share the results of the job.
    But if the situation need to run the job at the same time, subject to their execution flow.
    
    *\* The results will have to succeed, the following paragraph will be executed. **
    ### What type of PR is it?
    
    Improvement
    ### jira
    https://issues.apache.org/jira/browse/ZEPPELIN-1165
    
    ![cap 2016-07-14 15-11-07-036]
    ### Questions:
    - Does the licenses files need update? no
    - Is there breaking changes for older versions? no
    - Does this needs documentation? yes


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/cloverhearts/zeppelin ZEPPELIN-workflow

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zeppelin/pull/1799.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1799
    
----
commit 84daa37f132a968ca381c3d86b6f74e834256e26
Author: cloverhearts <cl...@gmail.com>
Date:   2016-12-23T07:31:10Z

    added remote work job status class and get method on interface

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    @cloverhearts What I mean is that the code like following would be called many times by users
    ```
    if (z.getZeppelinJobStatus("execute note id", "execute paragraph id").getJobStatus().isFinished() == true)
    { z.run("execute note id", "execute paragraph id") }
    ```
    It is just like some code templates, so what I suggest is that we can create a high level workflow framework which use these apis internally. And for users, they just need to specify the dependency between paragraphs using this framework, they don't need to check job status like the code above. 
     


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by cloverhearts <gi...@git.apache.org>.
Github user cloverhearts commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    
    Yes, apart from workflow, this feature is essential. (Get paragraph status)
    I want to separate getZeppelinJobStatus () into a separate PR, and I want to improve the workflow by gathering this from feedback here.
    And many Zeppelin users seem to want to work with a DAG type workflow outside of the interpreter.
    I will put your opinions on this together and present a new alternative to this PR.
    
    And we will separate the functions related to the workflow into other PRs.
    
    For example, getting paragraph status, deleting paragraph output.
    
    Thank you a lot for your opinion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by rasehorn <gi...@git.apache.org>.
Github user rasehorn commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    @cloverhearts 
    I think a picture and some pseudocode tells more than thousand words, so I created one.
    
    Also: I'm only talking about the use case to ensure a certain sequence of paragraph executions when runAll is called for the notebook. If you explicitely call z.run(paragraphId) within a certain notebook after runAll() was called, you propably execute those paragraphs twice.
    
    The easiest way to ensure a certain sequence of paragraph execution after runAll() was issued is to make the paragraphs wait for the one they depend on to finish. 
    
    Lets say we have three paragraphs. 
    The first one is necessary to prepare the data and define temporary tables. The second and third paragraphs depend on that data, so it does not make sense to execute them before paragraph 1 finished.
    Since the last two paragraphs are in status "running" and wait in parallel for the first paragraph to finish, they will be executed in parallel.
    
    Please see the picture 
    ![wait pseudocode](https://cloud.githubusercontent.com/assets/22585000/21642718/903ba9be-d284-11e6-8efb-958adca7861a.jpg)
    
    From my point of view this would be the easiest way for a ZeppelinUser to ensure a certain sequence of paragraph execution including control which paragraphs are executed in parallel. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by Leemoonsoo <gi...@git.apache.org>.
Github user Leemoonsoo commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    ```
    z.getZeppelinJobStatus("execute note id", "execute paragraph id").getJobStatus()
    ```
    
    How about not repeating `Job`, `Status` and omit `Zeppelin` (while `z.` represents zeppelin) in method name?
    i.e. something like
    
    ```
    z.getJob("note id", "paragraph id").getStatus()
    ```
    
    or just
    
    ```
    z.getJobStatus("note id", "paragraph id")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    Thanks @cloverhearts , after reading #1176. This PR is the first phase of this feature (implement low level api for workflow), is that correct ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by cloverhearts <gi...@git.apache.org>.
Github user cloverhearts commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    @zjffdu 
    Yes you are right.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by rasehorn <gi...@git.apache.org>.
Github user rasehorn commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    From my point of view this kind of functionality shall be provided by the core framework. 
    I do not have created many notebooks but what I've done always is: create one paragraph after the other to seperate data preparation from processing and visualization. So for the approach I apply it would be sufficient to execute the paragraphs in the sequence they are ordered in the notebook and this should be the default behaviour. 
    To support control over parallel execution of paragraphs it would be sufficient from my point of view to have a flag on each paragraph telling if this paragraph could be executed in parallel, so all subsequent paragraphs (their order within the notebook, not their ID) having this flag set could also be executed in parallel. 
    
    This is a kind defining the paragraph execution workflow implicitely without the need to program explicitely.
    But again: I'm not a power user. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin pull request #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/zeppelin/pull/1799


---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by cloverhearts <gi...@git.apache.org>.
Github user cloverhearts commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    create new issue on jira
    https://issues.apache.org/jira/browse/ZEPPELIN-1886
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by cloverhearts <gi...@git.apache.org>.
Github user cloverhearts commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    @Leemoonsoo 
    Yes it seems to be good, I will make a new change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by cloverhearts <gi...@git.apache.org>.
Github user cloverhearts commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    @rasehorn @zjffdu 
    Thank you very much!
    I understand the function wait.
    I will try to organize it again based on your opinion.
    Thank you for your kind comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by rasehorn <gi...@git.apache.org>.
Github user rasehorn commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    As far as I remember another discussion the paragraph IDs will change if you export/import or copy a notebook (not sure which one applies). If that is the case the workflow will be broken after import. If the user in front of the screen is not familiar with the code and logic of the notebook, it might be difficult to fix.  
    
    What about a simple "z.wait(ordernumber or paragraphId)" function which makes the paragraph wait for the paragraph referenced by the ordernumber or id to finish successfully or cancel the paragraph execution in case of an error? 
    
    This way all paragraphs without z.wait will be executed in parallel and those calling z.wait would be executed in sequence to the ones they depend on. And additionally this kind of functionality would not be mixed with the job handling on notebook level.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by xiufengliu <gi...@git.apache.org>.
Github user xiufengliu commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    @cloverhearts Is this feature available now? I am really looking forward to. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by cloverhearts <gi...@git.apache.org>.
Github user cloverhearts commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    @rasehorn 
    Thank you for your opinion.
    I have a question.
    In fact, there is a little confusion in the sense of `wait`.
    It is difficult to understand because it is a mixture of parallel and sequential.
    If you enter the code in one paragraph, the paragraph will be executed sequentially from the top.
    If so, you have to wait under the z.wait function, even if it is declared in parallel.
    Which of the following does it mean?
    
    Case 1
    `` `
    Z.run ("paragraph") // parallel run (currently support)
    `` `
    
    `` `
    Z.runForWait ("paragraphID-A") // waiting for job of paragraphID-A, finished or error
    `` `
    
    Or
    
    Case 2
    `` `
    Z.run ("paragraphID-A") // parallel run (currently support)
    Z.wait ("paragraphID-A") // waiting for job finished or error
    `` `
    
    And would you please more explain regarding `cancel`?
    "cancel" mean is "next job cancel"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by rasehorn <gi...@git.apache.org>.
Github user rasehorn commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    I'm also a little bit confused what this PR really is about - the pictures above point to paragraph execution order and control but the discussion also points to Notebook execution workflows. 
    From my point of view the control over paragraph execution within a notebook is something different than defining a workflow for notebook execution and mixing different features leads to poor design. 
    
    Often paragraphs within notebooks depend on others and therefore they need to be executed in a certain order. I feel like this kind of paragraph execution control shall be handeled by the core framework based on settings for each paragraph within the notebook.
    
    Additionally: In some places within the discussion the implementation of that feature on interpreter level was mentioned. It is not clear to me why the notebook workflow definition feature shall be reimplemented in different interpreters in different ways. Instead the internals of a notebook are of no interest when it is executed within a workflow - all that matters is success or failure and a definition at the workflow level what shall happen in case of a failure. So from my point of view the notebook workflow feature should also be implemented in the core code independently from the different interpreters available.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by cloverhearts <gi...@git.apache.org>.
Github user cloverhearts commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    @zjffdu 
    Actually, my english not good.
    If you do not mind, please give me your opinion at any time.
    Thank you :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by cloverhearts <gi...@git.apache.org>.
Github user cloverhearts commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    @zjffdu 
    Thank you for your best advice!! :)
    and Sorry, I missing the description.
    actually, i made that on codebase. (dynamic form is removed.)
    
    In fact, this feature has a dependency on Spark.
    However, it is designed to be easily re-implemented in other interpreters.
    There are also advantages.
    
    Since calls can be made at any time in the code, we can use them together during analysis or in combination with external libraries.
    (DAG is the same)
    
    By default, this is not a complete implementation of the workflow.
    However, I think this feature provides the basic user environment for the user to use freely.
    
    
    ```
    if (z.getZeppelinJobStatus("execute note id", "execute paragraph id").getJobStatus().isFinished() == true)
    { z.run("execute note id", "execute paragraph id") }
    ```
    or
    ```
    val result = z.runSync("execute note id", "execute paragraph id")
    if (result.isFinished == true)
    { println("job is done") }
    ```
    or
    ```
    z.run("execute note id", "execute paragraph id")
    while (z.getZeppelinJobStatus("execute note id", "execute paragraph id").getJobStatus().isRunning)
    { // loop }
    println("next job or done.");
    ```
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    BTW, in the first phase we can provide the high-level framework to allow user to call it programmatically, And in the second phase, it would be better to allow user to do it though drag & drop in UI. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    I agree with @rasehorn that workflow execution should be done in a high level framework. User just need to define the workflow (specify the dependencies between paragraphs).  I also paste one image to illustrate my current idea. In the following screenshot, we have 4 paragraphs, paragraph 1 needs to run first and paragraph 2,3,4 can be run concurrently after paragraph 1. So on each paragraph's top right area, we can allow user to specify this paragraph's dependencies. Here, paragraph_1 has no dependencies, and paragraph 2,3,4 depends on paragraph 1. After the workflow is defined (dependencies are specified), we can click the button on the top right of the note to run all the paragraphs on the note.  We could also provide rest api for run this whole note. 
    
    ![image](https://cloud.githubusercontent.com/assets/164491/21643626/594117e2-d2c4-11e6-8c8f-658a05adee3e.png)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by zjffdu <gi...@git.apache.org>.
Github user zjffdu commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    @cloverhearts This is very interesting. I have a few questions
    1. Does the dynamic forms here mean more control flow (like if condition and for loop)
    2. In case 2, If the markdown interpreter paragraph does not depends on the spark interpreter paragraph, we can execute them parallelly rather than sequentially. 
    3. I think the most important thing of workflow is to define the DAG (dependency between paragraphs). Your idea is to run the paragraphs programmatically. Would it be more intuitive to just define the DAG (Directed acyclic graph), and let the framework to run the dag automatically. 
    e.g.
    
    ```
    val flow = new JobFlow(noteId)
    val note = z.getNote(noteId)
    val p1 = z.getParagraph(pId1)
    val p2 = z.getParagraph(pid2)
    val p3 = z.getParagraph(pid3)
    p3.addDependency (p2)
    p2.addDependency(p1)
    flow.add(p1).add(p2).add(p3).run()
    ``` 
    4. Currently we use noteId and paragraphId, but I think these are not readable. We'd better use note name and paragraph name. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zeppelin issue #1799: [ZEPPELIN-1165 : WIP] Code-based job workflow

Posted by cloverhearts <gi...@git.apache.org>.
Github user cloverhearts commented on the issue:

    https://github.com/apache/zeppelin/pull/1799
  
    @zjffdu 
    I agree with you.
    But I am a bit cautious about this part.
    In fact, we've re-implemented this functionality in a variety of ways, and we've actually implemented it in the parent framework format. (Formerly PR)
    If, according to your opinion, I will re-implement it, it will be a form that combines my previous PR with the current PR.
    I need many people opinion.
    
    perhaps, Woluld you give me for many opinion this about?
    commiter and zeppelin users?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---