You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@zeppelin.apache.org by "Piyush Mukati (Data Platform)" <pi...@flipkart.com> on 2015/08/10 10:20:46 UTC

why zeppelin SparkInterpreter use FIFOScheduler

Hi Moon,
Any suggestion on it, have to wait lot when multiple people  working
with spark.
Can we create separate instance of   SparkILoop  SparkIMain and
printstrems  for each notebook while sharing the SparkContext
ZeppelinContext   SQLContext and DependencyResolver and then use
parallel scheduler ?
thanks

-piyush

Hi Moon,

How about tracking dedicated SparkContext for a notebook in Spark's
remote interpreter - this will allow multiple users to run their spark
paragraphs in parallel. Also, within a notebook only one paragraph is
executed at a time.

Regards,
-Pranav.

On 15/07/15 7:15 pm, moon soo Lee wrote:
> Hi,
>
> Thanks for asking question.
>
> The reason is simply because of it is running code statements. The
> statements can have order and dependency. Imagine i have two paragraphs
>
> %spark
> val a = 1
>
> %spark
> print(a)
>
> If they're not running one by one, that means they possibly runs in
> random order and the output will be always different. Either '1' or
> 'val a can not found'.
>
> This is the reason why. But if there are nice idea to handle this
> problem i agree using parallel scheduler would help a lot.
>
> Thanks,
> moon
> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
> <linxizeng0615@gmail.com <ma...@gmail.com>> wrote:
>
>     any one who have the same question with me? or this is not a question?
>
>     2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
>     <ma...@gmail.com>>:
>
>         hi, Moon:
>            I notice that the getScheduler function in the
>         SparkInterpreter.java return a FIFOScheduler which makes the
>         spark interpreter run spark job one by one. It's not a good
>         experience when couple of users do some work on zeppelin at
>         the same time, because they have to wait for each other.
>         And at the same time, SparkSqlInterpreter can chose what
>         scheduler to use by "zeppelin.spark.concurrentSQL".
>         My question is, what kind of consideration do you based on to
>         make such a decision?
>
>

-- 

------------------------------------------------------------------------------------------------------------------------------------------

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited. Although Flipkart has taken reasonable precautions to ensure no 
viruses are present in this email, the company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachments

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Pranav Kumar Agarwal <pr...@gmail.com>.

It had nothing to do the changes related to completion code. The issue 
was reproducible on master also.
Its due to the recent fix for ZEPPELIN-173

On one of our environment the hostname didn't returned the domain name 
after the hostname, however since the query coming from the browser 
included the hostname.domain.name. Basically the equality check in 
NotebookServer.java at checkOrigin method for 
currentHost.equals(sourceHost) was failing.

I think the code should fetch getCanonicalHostName also and try it as 
one of the combination before returning false. Since this was not much 
of a concern we just commented the newly added method checkOrigin in our 
local copy.

Regards,
-Pranav.

On 02/09/15 11:59 am, moon soo Lee wrote:
> Hi,
>
> I'm not sure what could be wrong.
> can you see any existing notebook?
>
> Best,
> moon
>
> On Mon, Aug 31, 2015 at 8:48 PM Piyush Mukati (Data Platform) 
> <piyush.mukati@flipkart.com <ma...@flipkart.com>> wrote:
>
>     Hi,
>     we have passed the InterpreterContext to  completion() , it is
>     working good on my local dev setup.
>     but after
>     mvn  clean package  -P build-distr  -Pspark-1.4
>     -Dhadoop.version=2.6.0 -Phadoop-2.6 -Pyarn
>     I copied zeppelin-0.6.0-incubating-SNAPSHOT.tar.gz to some other
>     machine,
>     while running from there it always shows disconnected and no
>     notebook are shown, even i am not able to create any notebook as
>     well.
>
>     Screenshot 2015-09-01 09.14.54.png
>     i am not seeing anything in logs. can anyone please suggest me how
>     can i further debug into it.
>     thanks.
>
>     On Wed, Aug 26, 2015 at 8:27 PM, moon soo Lee <moon@apache.org
>     <ma...@apache.org>> wrote:
>
>         Hi Pranav,
>
>         Thanks for sharing the plan.
>         I think passing InterpreterContext to completion()  make sense.
>         Although it changes interpreter api, changing now looks better
>         than later.
>
>         Thanks.
>         moon
>
>         On Tue, Aug 25, 2015 at 11:22 PM Pranav Kumar Agarwal
>         <praagarw@gmail.com <ma...@gmail.com>> wrote:
>
>             Hi Moon,
>
>             > I think releasing SparkIMain and related objects
>             By packaging I meant to ask what is the process to
>             "release SparkIMain
>             and related objects"? for Zeppelin's code uptake?
>
>             I have one more question:
>             Most the changes to allow SparkInterpreter support
>             ParallelScheduler are
>             implemented but I'm struggling with the completion
>             feature. Since I have
>             SparkIMain interpreter for each notebook, completion
>             functionality is
>             not working as expected cause the completion method
>             doesn't have
>             InterpreterContext. I need to be able to pull notebook
>             specific
>             SparkIMain interpreter to return correct completion
>             results, and for
>             that I need to know the notbook-id at the time of
>             completion call.
>
>             I'm planning to change the Interpreter.java abstract
>             method completion
>             to pass InterpreterContext along with buffer and cursor
>             location. This
>             will require refactoring all the Interpreter's. It's a
>             change in the
>             contract, so thought will run with you before embarking on
>             it...
>
>             Please let me know your thoughts.
>
>             Regards,
>             -Pranav.
>
>             On 18/08/15 8:04 am, moon soo Lee wrote:
>             > Could you explain little bit more about package changes
>             you mean?
>             >
>             > Thanks,
>             > moon
>             >
>             > On Mon, Aug 17, 2015 at 10:27 AM Pranav Agarwal
>             <praagarw@gmail.com <ma...@gmail.com>
>             > <mailto:praagarw@gmail.com <ma...@gmail.com>>>
>             wrote:
>             >
>             >     Any thoughts on how to package changes related to Spark?
>             >
>             >     On 17-Aug-2015 7:58 pm, "moon soo Lee"
>             <moon@apache.org <ma...@apache.org>
>             >     <mailto:moon@apache.org <ma...@apache.org>>>
>             wrote:
>             >
>             >         I think releasing SparkIMain and related objects
>             after
>             >         configurable inactivity would be good for now.
>             >
>             >         About scheduler, I can help implementing such
>             scheduler.
>             >
>             >         Thanks,
>             >         moon
>             >
>             >         On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar
>             Agarwal
>             >         <praagarw@gmail.com <ma...@gmail.com>
>             <mailto:praagarw@gmail.com <ma...@gmail.com>>>
>             wrote:
>             >
>             >             Hi Moon,
>             >
>             >             Yes, the notebookid comes from
>             InterpreterContext. At the
>             >             moment destroying SparkIMain on deletion of
>             notebook is
>             >             not handled. I think SparkIMain is a
>             lightweight object,
>             >             do you see a concern having these objects in
>             a map? One
>             >             possible option could be to destroy notebook
>             related
>             >             objects when the inactivity on a notebook is
>             greater than
>             >             say 8 hours.
>             >
>             >
>             >>             >> 4. Build a queue inside interpreter to
>             allow only one
>             >>             paragraph execution
>             >>             >> at a time per notebook.
>             >>
>             >>             One downside of this approach is, GUI will
>             display
>             >>             RUNNING instead of PENDING for jobs inside
>             of queue in
>             >>             interpreter.
>             >             Yes that's an good point. Having a scheduler
>             at Zeppelin
>             >             server to build a scheduler that is parallel
>             across
>             >             notebook's and FIFO across paragraph's will
>             be nice. Is
>             >             there any plan for having such a scheduler?
>             >
>             >             Regards,
>             >             -Pranav.
>             >
>             >
>             >             On 17/08/15 5:38 am, moon soo Lee wrote:
>             >>             Pranav, proposal looks awesome!
>             >>
>             >>             I have a question and feedback,
>             >>
>             >>             You said you tested 1,2 and 3. To create
>             SparkIMain per
>             >>             notebook, you need information of notebook
>             id. Did you
>             >>             get it from InterpreterContext?
>             >>             Then how did you handle destroying of
>             SparkIMain (when
>             >>             notebook is deleting)?
>             >>             As far as i know, interpreter not able to
>             get information
>             >>             of notebook deletion.
>             >>
>             >>             >> 4. Build a queue inside interpreter to
>             allow only one
>             >>             paragraph execution
>             >>             >> at a time per notebook.
>             >>
>             >>             One downside of this approach is, GUI will
>             display
>             >>             RUNNING instead of PENDING for jobs inside
>             of queue in
>             >>             interpreter.
>             >>
>             >>             Best,
>             >>             moon
>             >>
>             >>             On Sun, Aug 16, 2015 at 12:55 AM IT CTO
>             >>             <goi.cto@gmail.com
>             <ma...@gmail.com> <mailto:goi.cto@gmail.com
>             <ma...@gmail.com>>> wrote:
>             >>
>             >>                 +1 for "to re-factor the Zeppelin
>             architecture so
>             >>                 that it can handle multi-tenancy easily"
>             >>
>             >>                 On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan
>             >>                 <doanduyhai@gmail.com
>             <ma...@gmail.com> <mailto:doanduyhai@gmail.com
>             <ma...@gmail.com>>>
>             >>                 wrote:
>             >>
>             >>                     Agree with Joel, we may think to
>             re-factor the
>             >>                     Zeppelin architecture so that it
>             can handle
>             >>                     multi-tenancy easily. The technical
>             solution
>             >>                     proposed by Pranav is great but it
>             only applies
>             >>                     to Spark. Right now, each
>             interpreter has to
>             >>                     manage multi-tenancy its own way.
>             Ultimately
>             >>                     Zeppelin can propose a multi-tenancy
>             >>                     contract/info (like UserContext,
>             similar to
>             >>  InterpreterContext) so that each interpreter can
>             >>                     choose to use or not.
>             >>
>             >>
>             >>                     On Sun, Aug 16, 2015 at 3:09 AM,
>             Joel Zambrano
>             >>                     <djoelz@gmail.com
>             <ma...@gmail.com> <mailto:djoelz@gmail.com
>             <ma...@gmail.com>>> wrote:
>             >>
>             >>                         I think while the idea of
>             running multiple
>             >>                         notes simultaneously is great.
>             It is really
>             >>                         dancing around the lack of true
>             multi user
>             >>                         support in Zeppelin. While the
>             proposed
>             >>                         solution would work if the
>             applications
>             >>                         resources are those of the
>             whole cluster, if
>             >>                         the app is limited (say they
>             are 8 cores of
>             >>                         16, with some distribution in
>             memory) then
>             >>  potentially your note can hog all the
>             >>                         resources and the scheduler
>             will have to
>             >>                         throttle all other executions
>             leaving you
>             >>                         exactly where you are now.
>             >>                         While I think the solution is a
>             good one,
>             >>                         maybe this question makes us
>             think in adding
>             >>                         true multiuser support.
>             >>                         Where we isolate resources
>             (cluster and the
>             >>                         notebooks themselves), have
>             separate
>             >>  login/identity and (I don't know if it's
>             >>                         possible) share the same context.
>             >>
>             >>                         Thanks,
>             >>                         Joel
>             >>
>             >>                         > On Aug 15, 2015, at 1:58 PM,
>             Rohit Agarwal
>             >>                         <mindprince@gmail.com
>             <ma...@gmail.com>
>             >>  <mailto:mindprince@gmail.com
>             <ma...@gmail.com>>> wrote:
>             >>                         >
>             >>                         > If the problem is that
>             multiple users have
>             >>                         to wait for each other while
>             >>                         > using Zeppelin, the solution
>             already
>             >>                         exists: they can create a new
>             >>                         > interpreter by going to the
>             interpreter
>             >>                         page and attach it to their
>             >>                         > notebook - then they don't
>             have to wait for
>             >>                         others to submit their job.
>             >>                         >
>             >>                         > But I agree, having
>             paragraphs from one
>             >>                         note wait for paragraphs from other
>             >>                         > notes is a confusing default.
>             We can get
>             >>                         around that in two ways:
>             >>                         >
>             >>                         >  1. Create a new interpreter
>             for each note
>             >>                         and attach that interpreter to
>             >>                         >  that note. This approach
>             would require the least amount
>             >>                         of code changes but
>             >>                         >  is resource heavy and
>             doesn't let you
>             >>                         share Spark Context between
>             different
>             >>                         >  notes.
>             >>                         >  2. If we want to share the
>             Spark Context
>             >>                         between different notes, we can
>             >>                         >  submit jobs from different
>             notes into
>             >>  different fairscheduler pools (
>             >>                         >
>             >>
>             https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application).
>             >>                         >  This can be done by
>             submitting jobs from
>             >>  different notes in different
>             >>                         >  threads. This will make sure
>             that jobs
>             >>                         from one note are run sequentially
>             >>                         >  but jobs from different
>             notes will be
>             >>                         able to run in parallel.
>             >>                         >
>             >>                         > Neither of these options
>             require any change
>             >>                         in the Spark code.
>             >>                         >
>             >>                         > --
>             >>                         > Thanks & Regards
>             >>                         > Rohit Agarwal
>             >>                         >
>             https://www.linkedin.com/in/rohitagarwal003
>             >>                         >
>             >>                         > On Sat, Aug 15, 2015 at 12:01
>             PM, Pranav
>             >>                         Kumar Agarwal
>             <praagarw@gmail.com <ma...@gmail.com>
>             >>                         <mailto:praagarw@gmail.com
>             <ma...@gmail.com>>>
>
>             >>                         > wrote:
>             >>                         >
>             >>  >> If someone can share about the idea of
>             >>                         sharing single SparkContext through
>             >>  >>> multiple SparkILoop safely, it'll be
>             >>                         really helpful.
>             >>  >> Here is a proposal:
>             >>  >> 1. In Spark code, change SparkIMain.scala
>             >>                         to allow setting the virtual
>             >>  >> directory. While creating new instances of
>             >>  SparkIMain per notebook from
>             >>  >> zeppelin spark interpreter set all the
>             >>  instances of SparkIMain to the same
>             >>  >> virtual directory.
>             >>  >> 2. Start HTTP server on that virtual
>             >>  directory and set this HTTP server in
>             >>  >> Spark Context using classserverUri method
>             >>  >> 3. Scala generated code has a notion of
>             >>  packages. The default package name
>             >>  >> is "line$<linenumber>". Package name can
>             >>                         be controlled using System
>             >>  >> Property scala.repl.name.line. Setting
>             >>                         this property to "notebook id"
>             >>  >> ensures that code generated by individual
>             >>  instances of SparkIMain is
>             >>  >> isolated from other instances of SparkIMain
>             >>  >> 4. Build a queue inside interpreter to
>             >>                         allow only one paragraph execution
>             >>  >> at a time per notebook.
>             >>  >>
>             >>  >> I have tested 1, 2, and 3 and it seems to
>             >>                         provide isolation across
>             >>  >> classnames. I'll work towards submitting a
>             >>                         formal patch soon - Is there any
>             >>  >> Jira already for the same that I can
>             >>                         uptake? Also I need to understand:
>             >>  >> 1. How does Zeppelin uptake Spark fixes?
>             >>                         OR do I need to first work
>             >>  >> towards getting Spark changes merged in
>             >>                         Apache Spark github?
>             >>  >>
>             >>  >> Any suggestions on comments on the
>             >>  proposal are highly welcome.
>             >>  >>
>             >>  >> Regards,
>             >>  >> -Pranav.
>             >>  >>
>             >>  >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>             >>  >>>
>             >>  >>> Hi piyush,
>             >>  >>>
>             >>  >>> Separate instance of SparkILoop
>             >>  SparkIMain for each notebook while
>             >>  >>> sharing the SparkContext sounds great.
>             >>  >>>
>             >>  >>> Actually, i tried to do it, found problem
>             >>                         that multiple SparkILoop could
>             >>  >>> generates the same class name, and spark
>             >>  executor confuses classname since
>             >>  >>> they're reading classes from single
>             >>  SparkContext.
>             >>  >>>
>             >>  >>> If someone can share about the idea of
>             >>                         sharing single SparkContext
>             >>  >>> through multiple SparkILoop safely, it'll
>             >>                         be really helpful.
>             >>  >>>
>             >>  >>> Thanks,
>             >>  >>> moon
>             >>  >>>
>             >>  >>>
>             >>  >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush
>             >>                         Mukati (Data Platform) <
>             >>  >>> piyush.mukati@flipkart.com
>             <ma...@flipkart.com>
>             >>  <mailto:piyush.mukati@flipkart.com
>             <ma...@flipkart.com>>
>             >>                       
>              <mailto:piyush.mukati@flipkart.com
>             <ma...@flipkart.com>
>
>             >>  <mailto:piyush.mukati@flipkart.com
>             <ma...@flipkart.com>>>> wrote:
>             >>  >>>
>             >>  >>>    Hi Moon,
>             >>  >>>    Any suggestion on it, have to wait lot
>             >>                         when multiple people  working
>             >>  >>> with spark.
>             >>  >>>    Can we create separate instance of
>             >> SparkILoop SparkIMain and
>             >>  >>> printstrems  for each notebook while
>             >>                         sharing theSparkContext
>             >>  >>> ZeppelinContext SQLContext and
>             >>  DependencyResolver and then use parallel
>             >>  >>> scheduler ?
>             >>  >>> thanks
>             >>  >>>
>             >>  >>> -piyush
>             >>  >>>
>             >>  >>>    Hi Moon,
>             >>  >>>
>             >>  >>>    How about tracking dedicated
>             >>  SparkContext for a notebook in Spark's
>             >>  >>> remote interpreter - this will allow
>             >>  multiple users to run their spark
>             >>  >>> paragraphs in parallel. Also, within a
>             >>  notebook only one paragraph is
>             >>  >>> executed at a time.
>             >>  >>>
>             >>  >>> Regards,
>             >>  >>> -Pranav.
>             >>  >>>
>             >>  >>>
>             >>  >>>> On 15/07/15 7:15 pm, moon soo Lee wrote:
>             >>  >>>> Hi,
>             >>  >>>>
>             >>  >>>> Thanks for asking question.
>             >>  >>>>
>             >>  >>>> The reason is simply because of it is
>             >>                         running code statements. The
>             >>  >>>> statements can have order and
>             >>  dependency. Imagine i have two
>             >>  >>> paragraphs
>             >>  >>>>
>             >>  >>>> %spark
>             >>  >>>> val a = 1
>             >>  >>>>
>             >>  >>>> %spark
>             >>  >>>> print(a)
>             >>  >>>>
>             >>  >>>> If they're not running one by one, that
>             >>                         means they possibly runs in
>             >>  >>>> random order and the output will be
>             >>                         always different. Either '1' or
>             >>  >>>> 'val a can not found'.
>             >>  >>>>
>             >>  >>>> This is the reason why. But if there are
>             >>                         nice idea to handle this
>             >>  >>>> problem i agree using parallel scheduler
>             >>                         would help a lot.
>             >>  >>>>
>             >>  >>>> Thanks,
>             >>  >>>> moon
>             >>  >>>> On 2015년 7월 14일 (화) at 오후 7:59
>             >>                         linxi zeng
>             >>  >>>> <linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>>
>             >>  >>> <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>>>>
>             >>  >>> wrote:
>             >>  >>>>
>             >>  >>>> any one who have the same question with
>             >>                         me? or this is not a
>             >>  >>> question?
>             >>  >>>>
>             >>  >>>> 2015-07-14 11:47 GMT+08:00 linxi zeng
>             >>                         <linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>
>             >>  >>> <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>>
>             >>  >>>> <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>> <mailto:
>             >>  >>> linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>>>>:
>             >>  >>>>
>             >>  >>>>     hi, Moon:
>             >>  >>>>        I notice that the getScheduler
>             >>  function in the
>             >>  >>>> SparkInterpreter.java return a
>             >>  FIFOScheduler which makes the
>             >>  >>>>     spark interpreter run spark job one
>             >>                         by one. It's not a good
>             >>  >>>>     experience when couple of users do
>             >>                         some work on zeppelin at
>             >>  >>>>     the same time, because they have to
>             >>                         wait for each other.
>             >>  >>>>     And at the same time,
>             >>  SparkSqlInterpreter can chose what
>             >>  >>>>     scheduler to use by
>             >>  "zeppelin.spark.concurrentSQL".
>             >>  >>>>     My question is, what kind of
>             >>  consideration do you based on
>             >>  >>> to
>             >>  >>>>     make such a decision?
>             >>  >>>
>             >>  >>>
>             >>  >>>
>             >>  >>>
>             >>  >>>
>             >>
>              ------------------------------------------------------------------------------------------------------------------------------------------
>             >>  >>>
>             >>  >>>    This email and any files transmitted
>             >>                         with it are confidential and
>             >>  >>> intended solely for the use of the
>             >>  individual or entity to whom
>             >>  >>>    they are addressed. If you have
>             >>  received this email in error
>             >>  >>> please notify the system manager. This
>             >>                         message contains
>             >>  >>> confidential information and is intended
>             >>                         only for the individual
>             >>  >>> named. If you are not the named addressee
>             >>                         you should not
>             >>  >>> disseminate, distribute or copy this
>             >>                         e-mail. Please notify the
>             >>  >>> sender immediately by e-mail if you have
>             >>  received this e-mail by
>             >>  >>> mistake and delete this e-mail from your
>             >>                         system. If you are not
>             >>  >>>    the intended recipient you are
>             >>  notified that disclosing, copying,
>             >>  >>> distributing or taking any action in
>             >>  reliance on the contents of
>             >>  >>>    this information is strictly
>             >>  prohibited. Although Flipkart has
>             >>  >>> taken reasonable precautions to ensure no
>             >>                         viruses are present in
>             >>  >>>    this email, the company cannot accept
>             >>  responsibility for any loss
>             >>  >>>    or damage arising from the use of this
>             >>                         email or attachments
>             >>  >>
>             >>
>             >>
>             >
>
>
>
>     ------------------------------------------------------------------------------------------------------------------------------------------
>
>     This email and any files transmitted with it are confidential and
>     intended solely for the use of the individual or entity to whom
>     they are addressed. If you have received this email in error
>     please notify the system manager. This message contains
>     confidential information and is intended only for the individual
>     named. If you are not the named addressee you should not
>     disseminate, distribute or copy this e-mail. Please notify the
>     sender immediately by e-mail if you have received this e-mail by
>     mistake and delete this e-mail from your system. If you are not
>     the intended recipient you are notified that disclosing, copying,
>     distributing or taking any action in reliance on the contents of
>     this information is strictly prohibited. Although Flipkart has
>     taken reasonable precautions to ensure no viruses are present in
>     this email, the company cannot accept responsibility for any loss
>     or damage arising from the use of this email or attachments
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Pranav Kumar Agarwal <pr...@gmail.com>.

It had nothing to do the changes related to completion code. The issue 
was reproducible on master also.
Its due to the recent fix for ZEPPELIN-173

On one of our environment the hostname didn't returned the domain name 
after the hostname, however since the query coming from the browser 
included the hostname.domain.name. Basically the equality check in 
NotebookServer.java at checkOrigin method for 
currentHost.equals(sourceHost) was failing.

I think the code should fetch getCanonicalHostName also and try it as 
one of the combination before returning false. Since this was not much 
of a concern we just commented the newly added method checkOrigin in our 
local copy.

Regards,
-Pranav.

On 02/09/15 11:59 am, moon soo Lee wrote:
> Hi,
>
> I'm not sure what could be wrong.
> can you see any existing notebook?
>
> Best,
> moon
>
> On Mon, Aug 31, 2015 at 8:48 PM Piyush Mukati (Data Platform) 
> <piyush.mukati@flipkart.com <ma...@flipkart.com>> wrote:
>
>     Hi,
>     we have passed the InterpreterContext to  completion() , it is
>     working good on my local dev setup.
>     but after
>     mvn  clean package  -P build-distr  -Pspark-1.4
>     -Dhadoop.version=2.6.0 -Phadoop-2.6 -Pyarn
>     I copied zeppelin-0.6.0-incubating-SNAPSHOT.tar.gz to some other
>     machine,
>     while running from there it always shows disconnected and no
>     notebook are shown, even i am not able to create any notebook as
>     well.
>
>     Screenshot 2015-09-01 09.14.54.png
>     i am not seeing anything in logs. can anyone please suggest me how
>     can i further debug into it.
>     thanks.
>
>     On Wed, Aug 26, 2015 at 8:27 PM, moon soo Lee <moon@apache.org
>     <ma...@apache.org>> wrote:
>
>         Hi Pranav,
>
>         Thanks for sharing the plan.
>         I think passing InterpreterContext to completion()  make sense.
>         Although it changes interpreter api, changing now looks better
>         than later.
>
>         Thanks.
>         moon
>
>         On Tue, Aug 25, 2015 at 11:22 PM Pranav Kumar Agarwal
>         <praagarw@gmail.com <ma...@gmail.com>> wrote:
>
>             Hi Moon,
>
>             > I think releasing SparkIMain and related objects
>             By packaging I meant to ask what is the process to
>             "release SparkIMain
>             and related objects"? for Zeppelin's code uptake?
>
>             I have one more question:
>             Most the changes to allow SparkInterpreter support
>             ParallelScheduler are
>             implemented but I'm struggling with the completion
>             feature. Since I have
>             SparkIMain interpreter for each notebook, completion
>             functionality is
>             not working as expected cause the completion method
>             doesn't have
>             InterpreterContext. I need to be able to pull notebook
>             specific
>             SparkIMain interpreter to return correct completion
>             results, and for
>             that I need to know the notbook-id at the time of
>             completion call.
>
>             I'm planning to change the Interpreter.java abstract
>             method completion
>             to pass InterpreterContext along with buffer and cursor
>             location. This
>             will require refactoring all the Interpreter's. It's a
>             change in the
>             contract, so thought will run with you before embarking on
>             it...
>
>             Please let me know your thoughts.
>
>             Regards,
>             -Pranav.
>
>             On 18/08/15 8:04 am, moon soo Lee wrote:
>             > Could you explain little bit more about package changes
>             you mean?
>             >
>             > Thanks,
>             > moon
>             >
>             > On Mon, Aug 17, 2015 at 10:27 AM Pranav Agarwal
>             <praagarw@gmail.com <ma...@gmail.com>
>             > <mailto:praagarw@gmail.com <ma...@gmail.com>>>
>             wrote:
>             >
>             >     Any thoughts on how to package changes related to Spark?
>             >
>             >     On 17-Aug-2015 7:58 pm, "moon soo Lee"
>             <moon@apache.org <ma...@apache.org>
>             >     <mailto:moon@apache.org <ma...@apache.org>>>
>             wrote:
>             >
>             >         I think releasing SparkIMain and related objects
>             after
>             >         configurable inactivity would be good for now.
>             >
>             >         About scheduler, I can help implementing such
>             scheduler.
>             >
>             >         Thanks,
>             >         moon
>             >
>             >         On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar
>             Agarwal
>             >         <praagarw@gmail.com <ma...@gmail.com>
>             <mailto:praagarw@gmail.com <ma...@gmail.com>>>
>             wrote:
>             >
>             >             Hi Moon,
>             >
>             >             Yes, the notebookid comes from
>             InterpreterContext. At the
>             >             moment destroying SparkIMain on deletion of
>             notebook is
>             >             not handled. I think SparkIMain is a
>             lightweight object,
>             >             do you see a concern having these objects in
>             a map? One
>             >             possible option could be to destroy notebook
>             related
>             >             objects when the inactivity on a notebook is
>             greater than
>             >             say 8 hours.
>             >
>             >
>             >>             >> 4. Build a queue inside interpreter to
>             allow only one
>             >>             paragraph execution
>             >>             >> at a time per notebook.
>             >>
>             >>             One downside of this approach is, GUI will
>             display
>             >>             RUNNING instead of PENDING for jobs inside
>             of queue in
>             >>             interpreter.
>             >             Yes that's an good point. Having a scheduler
>             at Zeppelin
>             >             server to build a scheduler that is parallel
>             across
>             >             notebook's and FIFO across paragraph's will
>             be nice. Is
>             >             there any plan for having such a scheduler?
>             >
>             >             Regards,
>             >             -Pranav.
>             >
>             >
>             >             On 17/08/15 5:38 am, moon soo Lee wrote:
>             >>             Pranav, proposal looks awesome!
>             >>
>             >>             I have a question and feedback,
>             >>
>             >>             You said you tested 1,2 and 3. To create
>             SparkIMain per
>             >>             notebook, you need information of notebook
>             id. Did you
>             >>             get it from InterpreterContext?
>             >>             Then how did you handle destroying of
>             SparkIMain (when
>             >>             notebook is deleting)?
>             >>             As far as i know, interpreter not able to
>             get information
>             >>             of notebook deletion.
>             >>
>             >>             >> 4. Build a queue inside interpreter to
>             allow only one
>             >>             paragraph execution
>             >>             >> at a time per notebook.
>             >>
>             >>             One downside of this approach is, GUI will
>             display
>             >>             RUNNING instead of PENDING for jobs inside
>             of queue in
>             >>             interpreter.
>             >>
>             >>             Best,
>             >>             moon
>             >>
>             >>             On Sun, Aug 16, 2015 at 12:55 AM IT CTO
>             >>             <goi.cto@gmail.com
>             <ma...@gmail.com> <mailto:goi.cto@gmail.com
>             <ma...@gmail.com>>> wrote:
>             >>
>             >>                 +1 for "to re-factor the Zeppelin
>             architecture so
>             >>                 that it can handle multi-tenancy easily"
>             >>
>             >>                 On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan
>             >>                 <doanduyhai@gmail.com
>             <ma...@gmail.com> <mailto:doanduyhai@gmail.com
>             <ma...@gmail.com>>>
>             >>                 wrote:
>             >>
>             >>                     Agree with Joel, we may think to
>             re-factor the
>             >>                     Zeppelin architecture so that it
>             can handle
>             >>                     multi-tenancy easily. The technical
>             solution
>             >>                     proposed by Pranav is great but it
>             only applies
>             >>                     to Spark. Right now, each
>             interpreter has to
>             >>                     manage multi-tenancy its own way.
>             Ultimately
>             >>                     Zeppelin can propose a multi-tenancy
>             >>                     contract/info (like UserContext,
>             similar to
>             >>  InterpreterContext) so that each interpreter can
>             >>                     choose to use or not.
>             >>
>             >>
>             >>                     On Sun, Aug 16, 2015 at 3:09 AM,
>             Joel Zambrano
>             >>                     <djoelz@gmail.com
>             <ma...@gmail.com> <mailto:djoelz@gmail.com
>             <ma...@gmail.com>>> wrote:
>             >>
>             >>                         I think while the idea of
>             running multiple
>             >>                         notes simultaneously is great.
>             It is really
>             >>                         dancing around the lack of true
>             multi user
>             >>                         support in Zeppelin. While the
>             proposed
>             >>                         solution would work if the
>             applications
>             >>                         resources are those of the
>             whole cluster, if
>             >>                         the app is limited (say they
>             are 8 cores of
>             >>                         16, with some distribution in
>             memory) then
>             >>  potentially your note can hog all the
>             >>                         resources and the scheduler
>             will have to
>             >>                         throttle all other executions
>             leaving you
>             >>                         exactly where you are now.
>             >>                         While I think the solution is a
>             good one,
>             >>                         maybe this question makes us
>             think in adding
>             >>                         true multiuser support.
>             >>                         Where we isolate resources
>             (cluster and the
>             >>                         notebooks themselves), have
>             separate
>             >>  login/identity and (I don't know if it's
>             >>                         possible) share the same context.
>             >>
>             >>                         Thanks,
>             >>                         Joel
>             >>
>             >>                         > On Aug 15, 2015, at 1:58 PM,
>             Rohit Agarwal
>             >>                         <mindprince@gmail.com
>             <ma...@gmail.com>
>             >>  <mailto:mindprince@gmail.com
>             <ma...@gmail.com>>> wrote:
>             >>                         >
>             >>                         > If the problem is that
>             multiple users have
>             >>                         to wait for each other while
>             >>                         > using Zeppelin, the solution
>             already
>             >>                         exists: they can create a new
>             >>                         > interpreter by going to the
>             interpreter
>             >>                         page and attach it to their
>             >>                         > notebook - then they don't
>             have to wait for
>             >>                         others to submit their job.
>             >>                         >
>             >>                         > But I agree, having
>             paragraphs from one
>             >>                         note wait for paragraphs from other
>             >>                         > notes is a confusing default.
>             We can get
>             >>                         around that in two ways:
>             >>                         >
>             >>                         >  1. Create a new interpreter
>             for each note
>             >>                         and attach that interpreter to
>             >>                         >  that note. This approach
>             would require the least amount
>             >>                         of code changes but
>             >>                         >  is resource heavy and
>             doesn't let you
>             >>                         share Spark Context between
>             different
>             >>                         >  notes.
>             >>                         >  2. If we want to share the
>             Spark Context
>             >>                         between different notes, we can
>             >>                         >  submit jobs from different
>             notes into
>             >>  different fairscheduler pools (
>             >>                         >
>             >>
>             https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application).
>             >>                         >  This can be done by
>             submitting jobs from
>             >>  different notes in different
>             >>                         >  threads. This will make sure
>             that jobs
>             >>                         from one note are run sequentially
>             >>                         >  but jobs from different
>             notes will be
>             >>                         able to run in parallel.
>             >>                         >
>             >>                         > Neither of these options
>             require any change
>             >>                         in the Spark code.
>             >>                         >
>             >>                         > --
>             >>                         > Thanks & Regards
>             >>                         > Rohit Agarwal
>             >>                         >
>             https://www.linkedin.com/in/rohitagarwal003
>             >>                         >
>             >>                         > On Sat, Aug 15, 2015 at 12:01
>             PM, Pranav
>             >>                         Kumar Agarwal
>             <praagarw@gmail.com <ma...@gmail.com>
>             >>                         <mailto:praagarw@gmail.com
>             <ma...@gmail.com>>>
>
>             >>                         > wrote:
>             >>                         >
>             >>  >> If someone can share about the idea of
>             >>                         sharing single SparkContext through
>             >>  >>> multiple SparkILoop safely, it'll be
>             >>                         really helpful.
>             >>  >> Here is a proposal:
>             >>  >> 1. In Spark code, change SparkIMain.scala
>             >>                         to allow setting the virtual
>             >>  >> directory. While creating new instances of
>             >>  SparkIMain per notebook from
>             >>  >> zeppelin spark interpreter set all the
>             >>  instances of SparkIMain to the same
>             >>  >> virtual directory.
>             >>  >> 2. Start HTTP server on that virtual
>             >>  directory and set this HTTP server in
>             >>  >> Spark Context using classserverUri method
>             >>  >> 3. Scala generated code has a notion of
>             >>  packages. The default package name
>             >>  >> is "line$<linenumber>". Package name can
>             >>                         be controlled using System
>             >>  >> Property scala.repl.name.line. Setting
>             >>                         this property to "notebook id"
>             >>  >> ensures that code generated by individual
>             >>  instances of SparkIMain is
>             >>  >> isolated from other instances of SparkIMain
>             >>  >> 4. Build a queue inside interpreter to
>             >>                         allow only one paragraph execution
>             >>  >> at a time per notebook.
>             >>  >>
>             >>  >> I have tested 1, 2, and 3 and it seems to
>             >>                         provide isolation across
>             >>  >> classnames. I'll work towards submitting a
>             >>                         formal patch soon - Is there any
>             >>  >> Jira already for the same that I can
>             >>                         uptake? Also I need to understand:
>             >>  >> 1. How does Zeppelin uptake Spark fixes?
>             >>                         OR do I need to first work
>             >>  >> towards getting Spark changes merged in
>             >>                         Apache Spark github?
>             >>  >>
>             >>  >> Any suggestions on comments on the
>             >>  proposal are highly welcome.
>             >>  >>
>             >>  >> Regards,
>             >>  >> -Pranav.
>             >>  >>
>             >>  >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>             >>  >>>
>             >>  >>> Hi piyush,
>             >>  >>>
>             >>  >>> Separate instance of SparkILoop
>             >>  SparkIMain for each notebook while
>             >>  >>> sharing the SparkContext sounds great.
>             >>  >>>
>             >>  >>> Actually, i tried to do it, found problem
>             >>                         that multiple SparkILoop could
>             >>  >>> generates the same class name, and spark
>             >>  executor confuses classname since
>             >>  >>> they're reading classes from single
>             >>  SparkContext.
>             >>  >>>
>             >>  >>> If someone can share about the idea of
>             >>                         sharing single SparkContext
>             >>  >>> through multiple SparkILoop safely, it'll
>             >>                         be really helpful.
>             >>  >>>
>             >>  >>> Thanks,
>             >>  >>> moon
>             >>  >>>
>             >>  >>>
>             >>  >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush
>             >>                         Mukati (Data Platform) <
>             >>  >>> piyush.mukati@flipkart.com
>             <ma...@flipkart.com>
>             >>  <mailto:piyush.mukati@flipkart.com
>             <ma...@flipkart.com>>
>             >>                       
>              <mailto:piyush.mukati@flipkart.com
>             <ma...@flipkart.com>
>
>             >>  <mailto:piyush.mukati@flipkart.com
>             <ma...@flipkart.com>>>> wrote:
>             >>  >>>
>             >>  >>>    Hi Moon,
>             >>  >>>    Any suggestion on it, have to wait lot
>             >>                         when multiple people  working
>             >>  >>> with spark.
>             >>  >>>    Can we create separate instance of
>             >> SparkILoop SparkIMain and
>             >>  >>> printstrems  for each notebook while
>             >>                         sharing theSparkContext
>             >>  >>> ZeppelinContext SQLContext and
>             >>  DependencyResolver and then use parallel
>             >>  >>> scheduler ?
>             >>  >>> thanks
>             >>  >>>
>             >>  >>> -piyush
>             >>  >>>
>             >>  >>>    Hi Moon,
>             >>  >>>
>             >>  >>>    How about tracking dedicated
>             >>  SparkContext for a notebook in Spark's
>             >>  >>> remote interpreter - this will allow
>             >>  multiple users to run their spark
>             >>  >>> paragraphs in parallel. Also, within a
>             >>  notebook only one paragraph is
>             >>  >>> executed at a time.
>             >>  >>>
>             >>  >>> Regards,
>             >>  >>> -Pranav.
>             >>  >>>
>             >>  >>>
>             >>  >>>> On 15/07/15 7:15 pm, moon soo Lee wrote:
>             >>  >>>> Hi,
>             >>  >>>>
>             >>  >>>> Thanks for asking question.
>             >>  >>>>
>             >>  >>>> The reason is simply because of it is
>             >>                         running code statements. The
>             >>  >>>> statements can have order and
>             >>  dependency. Imagine i have two
>             >>  >>> paragraphs
>             >>  >>>>
>             >>  >>>> %spark
>             >>  >>>> val a = 1
>             >>  >>>>
>             >>  >>>> %spark
>             >>  >>>> print(a)
>             >>  >>>>
>             >>  >>>> If they're not running one by one, that
>             >>                         means they possibly runs in
>             >>  >>>> random order and the output will be
>             >>                         always different. Either '1' or
>             >>  >>>> 'val a can not found'.
>             >>  >>>>
>             >>  >>>> This is the reason why. But if there are
>             >>                         nice idea to handle this
>             >>  >>>> problem i agree using parallel scheduler
>             >>                         would help a lot.
>             >>  >>>>
>             >>  >>>> Thanks,
>             >>  >>>> moon
>             >>  >>>> On 2015년 7월 14일 (화) at 오후 7:59
>             >>                         linxi zeng
>             >>  >>>> <linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>>
>             >>  >>> <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>>>>
>             >>  >>> wrote:
>             >>  >>>>
>             >>  >>>> any one who have the same question with
>             >>                         me? or this is not a
>             >>  >>> question?
>             >>  >>>>
>             >>  >>>> 2015-07-14 11:47 GMT+08:00 linxi zeng
>             >>                         <linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>
>             >>  >>> <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>>
>             >>  >>>> <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>> <mailto:
>             >>  >>> linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             >>  <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>>>>:
>             >>  >>>>
>             >>  >>>>     hi, Moon:
>             >>  >>>>        I notice that the getScheduler
>             >>  function in the
>             >>  >>>> SparkInterpreter.java return a
>             >>  FIFOScheduler which makes the
>             >>  >>>>     spark interpreter run spark job one
>             >>                         by one. It's not a good
>             >>  >>>>     experience when couple of users do
>             >>                         some work on zeppelin at
>             >>  >>>>     the same time, because they have to
>             >>                         wait for each other.
>             >>  >>>>     And at the same time,
>             >>  SparkSqlInterpreter can chose what
>             >>  >>>>     scheduler to use by
>             >>  "zeppelin.spark.concurrentSQL".
>             >>  >>>>     My question is, what kind of
>             >>  consideration do you based on
>             >>  >>> to
>             >>  >>>>     make such a decision?
>             >>  >>>
>             >>  >>>
>             >>  >>>
>             >>  >>>
>             >>  >>>
>             >>
>              ------------------------------------------------------------------------------------------------------------------------------------------
>             >>  >>>
>             >>  >>>    This email and any files transmitted
>             >>                         with it are confidential and
>             >>  >>> intended solely for the use of the
>             >>  individual or entity to whom
>             >>  >>>    they are addressed. If you have
>             >>  received this email in error
>             >>  >>> please notify the system manager. This
>             >>                         message contains
>             >>  >>> confidential information and is intended
>             >>                         only for the individual
>             >>  >>> named. If you are not the named addressee
>             >>                         you should not
>             >>  >>> disseminate, distribute or copy this
>             >>                         e-mail. Please notify the
>             >>  >>> sender immediately by e-mail if you have
>             >>  received this e-mail by
>             >>  >>> mistake and delete this e-mail from your
>             >>                         system. If you are not
>             >>  >>>    the intended recipient you are
>             >>  notified that disclosing, copying,
>             >>  >>> distributing or taking any action in
>             >>  reliance on the contents of
>             >>  >>>    this information is strictly
>             >>  prohibited. Although Flipkart has
>             >>  >>> taken reasonable precautions to ensure no
>             >>                         viruses are present in
>             >>  >>>    this email, the company cannot accept
>             >>  responsibility for any loss
>             >>  >>>    or damage arising from the use of this
>             >>                         email or attachments
>             >>  >>
>             >>
>             >>
>             >
>
>
>
>     ------------------------------------------------------------------------------------------------------------------------------------------
>
>     This email and any files transmitted with it are confidential and
>     intended solely for the use of the individual or entity to whom
>     they are addressed. If you have received this email in error
>     please notify the system manager. This message contains
>     confidential information and is intended only for the individual
>     named. If you are not the named addressee you should not
>     disseminate, distribute or copy this e-mail. Please notify the
>     sender immediately by e-mail if you have received this e-mail by
>     mistake and delete this e-mail from your system. If you are not
>     the intended recipient you are notified that disclosing, copying,
>     distributing or taking any action in reliance on the contents of
>     this information is strictly prohibited. Although Flipkart has
>     taken reasonable precautions to ensure no viruses are present in
>     this email, the company cannot accept responsibility for any loss
>     or damage arising from the use of this email or attachments
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by moon soo Lee <mo...@apache.org>.

Hi,

I'm not sure what could be wrong.
can you see any existing notebook?

Best,
moon

On Mon, Aug 31, 2015 at 8:48 PM Piyush Mukati (Data Platform) <
piyush.mukati@flipkart.com> wrote:

> Hi,
> we have passed the InterpreterContext to  completion() , it is working
> good on my local dev setup.
> but after
> mvn  clean package  -P build-distr  -Pspark-1.4 -Dhadoop.version=2.6.0
> -Phadoop-2.6 -Pyarn
> I copied zeppelin-0.6.0-incubating-SNAPSHOT.tar.gz to some other machine,
> while running from there it always shows disconnected and no notebook are
> shown, even i am not able to create any notebook as well.
>
>  [image: Screenshot 2015-09-01 09.14.54.png]
> i am not seeing anything in logs. can anyone please suggest me how can i
> further debug into it.
> thanks.
>
> On Wed, Aug 26, 2015 at 8:27 PM, moon soo Lee <mo...@apache.org> wrote:
>
>> Hi Pranav,
>>
>> Thanks for sharing the plan.
>> I think passing InterpreterContext to completion()  make sense.
>> Although it changes interpreter api, changing now looks better than later.
>>
>> Thanks.
>> moon
>>
>> On Tue, Aug 25, 2015 at 11:22 PM Pranav Kumar Agarwal <pr...@gmail.com>
>> wrote:
>>
>>> Hi Moon,
>>>
>>> > I think releasing SparkIMain and related objects
>>> By packaging I meant to ask what is the process to "release SparkIMain
>>> and related objects"? for Zeppelin's code uptake?
>>>
>>> I have one more question:
>>> Most the changes to allow SparkInterpreter support ParallelScheduler are
>>> implemented but I'm struggling with the completion feature. Since I have
>>> SparkIMain interpreter for each notebook, completion functionality is
>>> not working as expected cause the completion method doesn't have
>>> InterpreterContext. I need to be able to pull notebook specific
>>> SparkIMain interpreter to return correct completion results, and for
>>> that I need to know the notbook-id at the time of completion call.
>>>
>>> I'm planning to change the Interpreter.java abstract method completion
>>> to pass InterpreterContext along with buffer and cursor location. This
>>> will require refactoring all the Interpreter's. It's a change in the
>>> contract, so thought will run with you before embarking on it...
>>>
>>> Please let me know your thoughts.
>>>
>>> Regards,
>>> -Pranav.
>>>
>>> On 18/08/15 8:04 am, moon soo Lee wrote:
>>> > Could you explain little bit more about package changes you mean?
>>> >
>>> > Thanks,
>>> > moon
>>> >
>>> > On Mon, Aug 17, 2015 at 10:27 AM Pranav Agarwal <praagarw@gmail.com
>>> > <ma...@gmail.com>> wrote:
>>> >
>>> >     Any thoughts on how to package changes related to Spark?
>>> >
>>> >     On 17-Aug-2015 7:58 pm, "moon soo Lee" <moon@apache.org
>>> >     <ma...@apache.org>> wrote:
>>> >
>>> >         I think releasing SparkIMain and related objects after
>>> >         configurable inactivity would be good for now.
>>> >
>>> >         About scheduler, I can help implementing such scheduler.
>>> >
>>> >         Thanks,
>>> >         moon
>>> >
>>> >         On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal
>>> >         <praagarw@gmail.com <ma...@gmail.com>> wrote:
>>> >
>>> >             Hi Moon,
>>> >
>>> >             Yes, the notebookid comes from InterpreterContext. At the
>>> >             moment destroying SparkIMain on deletion of notebook is
>>> >             not handled. I think SparkIMain is a lightweight object,
>>> >             do you see a concern having these objects in a map? One
>>> >             possible option could be to destroy notebook related
>>> >             objects when the inactivity on a notebook is greater than
>>> >             say 8 hours.
>>> >
>>> >
>>> >>             >> 4. Build a queue inside interpreter to allow only one
>>> >>             paragraph execution
>>> >>             >> at a time per notebook.
>>> >>
>>> >>             One downside of this approach is, GUI will display
>>> >>             RUNNING instead of PENDING for jobs inside of queue in
>>> >>             interpreter.
>>> >             Yes that's an good point. Having a scheduler at Zeppelin
>>> >             server to build a scheduler that is parallel across
>>> >             notebook's and FIFO across paragraph's will be nice. Is
>>> >             there any plan for having such a scheduler?
>>> >
>>> >             Regards,
>>> >             -Pranav.
>>> >
>>> >
>>> >             On 17/08/15 5:38 am, moon soo Lee wrote:
>>> >>             Pranav, proposal looks awesome!
>>> >>
>>> >>             I have a question and feedback,
>>> >>
>>> >>             You said you tested 1,2 and 3. To create SparkIMain per
>>> >>             notebook, you need information of notebook id. Did you
>>> >>             get it from InterpreterContext?
>>> >>             Then how did you handle destroying of SparkIMain (when
>>> >>             notebook is deleting)?
>>> >>             As far as i know, interpreter not able to get information
>>> >>             of notebook deletion.
>>> >>
>>> >>             >> 4. Build a queue inside interpreter to allow only one
>>> >>             paragraph execution
>>> >>             >> at a time per notebook.
>>> >>
>>> >>             One downside of this approach is, GUI will display
>>> >>             RUNNING instead of PENDING for jobs inside of queue in
>>> >>             interpreter.
>>> >>
>>> >>             Best,
>>> >>             moon
>>> >>
>>> >>             On Sun, Aug 16, 2015 at 12:55 AM IT CTO
>>> >>             <goi.cto@gmail.com <ma...@gmail.com>> wrote:
>>> >>
>>> >>                 +1 for "to re-factor the Zeppelin architecture so
>>> >>                 that it can handle multi-tenancy easily"
>>> >>
>>> >>                 On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan
>>> >>                 <doanduyhai@gmail.com <ma...@gmail.com>>
>>> >>                 wrote:
>>> >>
>>> >>                     Agree with Joel, we may think to re-factor the
>>> >>                     Zeppelin architecture so that it can handle
>>> >>                     multi-tenancy easily. The technical solution
>>> >>                     proposed by Pranav is great but it only applies
>>> >>                     to Spark. Right now, each interpreter has to
>>> >>                     manage multi-tenancy its own way. Ultimately
>>> >>                     Zeppelin can propose a multi-tenancy
>>> >>                     contract/info (like UserContext, similar to
>>> >>                     InterpreterContext) so that each interpreter can
>>> >>                     choose to use or not.
>>> >>
>>> >>
>>> >>                     On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano
>>> >>                     <djoelz@gmail.com <ma...@gmail.com>>
>>> wrote:
>>> >>
>>> >>                         I think while the idea of running multiple
>>> >>                         notes simultaneously is great. It is really
>>> >>                         dancing around the lack of true multi user
>>> >>                         support in Zeppelin. While the proposed
>>> >>                         solution would work if the applications
>>> >>                         resources are those of the whole cluster, if
>>> >>                         the app is limited (say they are 8 cores of
>>> >>                         16, with some distribution in memory) then
>>> >>                         potentially your note can hog all the
>>> >>                         resources and the scheduler will have to
>>> >>                         throttle all other executions leaving you
>>> >>                         exactly where you are now.
>>> >>                         While I think the solution is a good one,
>>> >>                         maybe this question makes us think in adding
>>> >>                         true multiuser support.
>>> >>                         Where we isolate resources (cluster and the
>>> >>                         notebooks themselves), have separate
>>> >>                         login/identity and (I don't know if it's
>>> >>                         possible) share the same context.
>>> >>
>>> >>                         Thanks,
>>> >>                         Joel
>>> >>
>>> >>                         > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal
>>> >>                         <mindprince@gmail.com
>>> >>                         <ma...@gmail.com>> wrote:
>>> >>                         >
>>> >>                         > If the problem is that multiple users have
>>> >>                         to wait for each other while
>>> >>                         > using Zeppelin, the solution already
>>> >>                         exists: they can create a new
>>> >>                         > interpreter by going to the interpreter
>>> >>                         page and attach it to their
>>> >>                         > notebook - then they don't have to wait for
>>> >>                         others to submit their job.
>>> >>                         >
>>> >>                         > But I agree, having paragraphs from one
>>> >>                         note wait for paragraphs from other
>>> >>                         > notes is a confusing default. We can get
>>> >>                         around that in two ways:
>>> >>                         >
>>> >>                         >   1. Create a new interpreter for each note
>>> >>                         and attach that interpreter to
>>> >>                         >   that note. This approach would require
>>> the least amount
>>> >>                         of code changes but
>>> >>                         >   is resource heavy and doesn't let you
>>> >>                         share Spark Context between different
>>> >>                         >   notes.
>>> >>                         >   2. If we want to share the Spark Context
>>> >>                         between different notes, we can
>>> >>                         >   submit jobs from different notes into
>>> >>                         different fairscheduler pools (
>>> >>                         >
>>> >>
>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>> ).
>>> >>                         >   This can be done by submitting jobs from
>>> >>                         different notes in different
>>> >>                         >   threads. This will make sure that jobs
>>> >>                         from one note are run sequentially
>>> >>                         >   but jobs from different notes will be
>>> >>                         able to run in parallel.
>>> >>                         >
>>> >>                         > Neither of these options require any change
>>> >>                         in the Spark code.
>>> >>                         >
>>> >>                         > --
>>> >>                         > Thanks & Regards
>>> >>                         > Rohit Agarwal
>>> >>                         > https://www.linkedin.com/in/rohitagarwal003
>>> >>                         >
>>> >>                         > On Sat, Aug 15, 2015 at 12:01 PM, Pranav
>>> >>                         Kumar Agarwal <praagarw@gmail.com
>>> >>                         <ma...@gmail.com>>
>>>
>>> >>                         > wrote:
>>> >>                         >
>>> >>                         >> If someone can share about the idea of
>>> >>                         sharing single SparkContext through
>>> >>                         >>> multiple SparkILoop safely, it'll be
>>> >>                         really helpful.
>>> >>                         >> Here is a proposal:
>>> >>                         >> 1. In Spark code, change SparkIMain.scala
>>> >>                         to allow setting the virtual
>>> >>                         >> directory. While creating new instances of
>>> >>                         SparkIMain per notebook from
>>> >>                         >> zeppelin spark interpreter set all the
>>> >>                         instances of SparkIMain to the same
>>> >>                         >> virtual directory.
>>> >>                         >> 2. Start HTTP server on that virtual
>>> >>                         directory and set this HTTP server in
>>> >>                         >> Spark Context using classserverUri method
>>> >>                         >> 3. Scala generated code has a notion of
>>> >>                         packages. The default package name
>>> >>                         >> is "line$<linenumber>". Package name can
>>> >>                         be controlled using System
>>> >>                         >> Property scala.repl.name.line. Setting
>>> >>                         this property to "notebook id"
>>> >>                         >> ensures that code generated by individual
>>> >>                         instances of SparkIMain is
>>> >>                         >> isolated from other instances of SparkIMain
>>> >>                         >> 4. Build a queue inside interpreter to
>>> >>                         allow only one paragraph execution
>>> >>                         >> at a time per notebook.
>>> >>                         >>
>>> >>                         >> I have tested 1, 2, and 3 and it seems to
>>> >>                         provide isolation across
>>> >>                         >> classnames. I'll work towards submitting a
>>> >>                         formal patch soon - Is there any
>>> >>                         >> Jira already for the same that I can
>>> >>                         uptake? Also I need to understand:
>>> >>                         >> 1. How does Zeppelin uptake Spark fixes?
>>> >>                         OR do I need to first work
>>> >>                         >> towards getting Spark changes merged in
>>> >>                         Apache Spark github?
>>> >>                         >>
>>> >>                         >> Any suggestions on comments on the
>>> >>                         proposal are highly welcome.
>>> >>                         >>
>>> >>                         >> Regards,
>>> >>                         >> -Pranav.
>>> >>                         >>
>>> >>                         >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>> >>                         >>>
>>> >>                         >>> Hi piyush,
>>> >>                         >>>
>>> >>                         >>> Separate instance of SparkILoop
>>> >>                         SparkIMain for each notebook while
>>> >>                         >>> sharing the SparkContext sounds great.
>>> >>                         >>>
>>> >>                         >>> Actually, i tried to do it, found problem
>>> >>                         that multiple SparkILoop could
>>> >>                         >>> generates the same class name, and spark
>>> >>                         executor confuses classname since
>>> >>                         >>> they're reading classes from single
>>> >>                         SparkContext.
>>> >>                         >>>
>>> >>                         >>> If someone can share about the idea of
>>> >>                         sharing single SparkContext
>>> >>                         >>> through multiple SparkILoop safely, it'll
>>> >>                         be really helpful.
>>> >>                         >>>
>>> >>                         >>> Thanks,
>>> >>                         >>> moon
>>> >>                         >>>
>>> >>                         >>>
>>> >>                         >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush
>>> >>                         Mukati (Data Platform) <
>>> >>                         >>> piyush.mukati@flipkart.com
>>> >>                         <ma...@flipkart.com>
>>> >>                         <mailto:piyush.mukati@flipkart.com
>>>
>>> >>                         <ma...@flipkart.com>>> wrote:
>>> >>                         >>>
>>> >>                         >>>    Hi Moon,
>>> >>                         >>>    Any suggestion on it, have to wait lot
>>> >>                         when multiple people  working
>>> >>                         >>> with spark.
>>> >>                         >>>    Can we create separate instance of
>>> >>                          SparkILoop SparkIMain and
>>> >>                         >>> printstrems  for each notebook while
>>> >>                         sharing theSparkContext
>>> >>                         >>> ZeppelinContext  SQLContext and
>>> >>                         DependencyResolver and then use parallel
>>> >>                         >>> scheduler ?
>>> >>                         >>> thanks
>>> >>                         >>>
>>> >>                         >>> -piyush
>>> >>                         >>>
>>> >>                         >>>    Hi Moon,
>>> >>                         >>>
>>> >>                         >>>    How about tracking dedicated
>>> >>                         SparkContext for a notebook in Spark's
>>> >>                         >>> remote interpreter - this will allow
>>> >>                         multiple users to run their spark
>>> >>                         >>> paragraphs in parallel. Also, within a
>>> >>                         notebook only one paragraph is
>>> >>                         >>> executed at a time.
>>> >>                         >>>
>>> >>                         >>> Regards,
>>> >>                         >>> -Pranav.
>>> >>                         >>>
>>> >>                         >>>
>>> >>                         >>>> On 15/07/15 7:15 pm, moon soo Lee wrote:
>>> >>                         >>>> Hi,
>>> >>                         >>>>
>>> >>                         >>>> Thanks for asking question.
>>> >>                         >>>>
>>> >>                         >>>> The reason is simply because of it is
>>> >>                         running code statements. The
>>> >>                         >>>> statements can have order and
>>> >>                         dependency. Imagine i have two
>>> >>                         >>> paragraphs
>>> >>                         >>>>
>>> >>                         >>>> %spark
>>> >>                         >>>> val a = 1
>>> >>                         >>>>
>>> >>                         >>>> %spark
>>> >>                         >>>> print(a)
>>> >>                         >>>>
>>> >>                         >>>> If they're not running one by one, that
>>> >>                         means they possibly runs in
>>> >>                         >>>> random order and the output will be
>>> >>                         always different. Either '1' or
>>> >>                         >>>> 'val a can not found'.
>>> >>                         >>>>
>>> >>                         >>>> This is the reason why. But if there are
>>> >>                         nice idea to handle this
>>> >>                         >>>> problem i agree using parallel scheduler
>>> >>                         would help a lot.
>>> >>                         >>>>
>>> >>                         >>>> Thanks,
>>> >>                         >>>> moon
>>> >>                         >>>> On 2015년 7월 14일 (화) at 오후 7:59
>>> >>                         linxi zeng
>>> >>                         >>>> <linxizeng0615@gmail.com
>>> >>                         <ma...@gmail.com>
>>> >>                         <mailto:linxizeng0615@gmail.com
>>> >>                         <ma...@gmail.com>>
>>> >>                         >>> <mailto:linxizeng0615@gmail.com
>>> >>                         <ma...@gmail.com>
>>> >>                         <mailto:linxizeng0615@gmail.com
>>> >>                         <ma...@gmail.com>>>>
>>> >>                         >>> wrote:
>>> >>                         >>>>
>>> >>                         >>>> any one who have the same question with
>>> >>                         me? or this is not a
>>> >>                         >>> question?
>>> >>                         >>>>
>>> >>                         >>>> 2015-07-14 11:47 GMT+08:00 linxi zeng
>>> >>                         <linxizeng0615@gmail.com
>>> >>                         <ma...@gmail.com>
>>> >>                         >>> <mailto:linxizeng0615@gmail.com
>>> >>                         <ma...@gmail.com>>
>>> >>                         >>>> <mailto:linxizeng0615@gmail.com
>>> >>                         <ma...@gmail.com> <mailto:
>>> >>                         >>> linxizeng0615@gmail.com
>>> >>                         <ma...@gmail.com>>>>:
>>> >>                         >>>>
>>> >>                         >>>>     hi, Moon:
>>> >>                         >>>>        I notice that the getScheduler
>>> >>                         function in the
>>> >>                         >>>> SparkInterpreter.java return a
>>> >>                         FIFOScheduler which makes the
>>> >>                         >>>>     spark interpreter run spark job one
>>> >>                         by one. It's not a good
>>> >>                         >>>>     experience when couple of users do
>>> >>                         some work on zeppelin at
>>> >>                         >>>>     the same time, because they have to
>>> >>                         wait for each other.
>>> >>                         >>>>     And at the same time,
>>> >>                         SparkSqlInterpreter can chose what
>>> >>                         >>>>     scheduler to use by
>>> >>                         "zeppelin.spark.concurrentSQL".
>>> >>                         >>>>     My question is, what kind of
>>> >>                         consideration do you based on
>>> >>                         >>> to
>>> >>                         >>>>     make such a decision?
>>> >>                         >>>
>>> >>                         >>>
>>> >>                         >>>
>>> >>                         >>>
>>> >>                         >>>
>>> >>
>>>  ------------------------------------------------------------------------------------------------------------------------------------------
>>> >>                         >>>
>>> >>                         >>>    This email and any files transmitted
>>> >>                         with it are confidential and
>>> >>                         >>> intended solely for the use of the
>>> >>                         individual or entity to whom
>>> >>                         >>>    they are addressed. If you have
>>> >>                         received this email in error
>>> >>                         >>> please notify the system manager. This
>>> >>                         message contains
>>> >>                         >>> confidential information and is intended
>>> >>                         only for the individual
>>> >>                         >>> named. If you are not the named addressee
>>> >>                         you should not
>>> >>                         >>> disseminate, distribute or copy this
>>> >>                         e-mail. Please notify the
>>> >>                         >>> sender immediately by e-mail if you have
>>> >>                         received this e-mail by
>>> >>                         >>> mistake and delete this e-mail from your
>>> >>                         system. If you are not
>>> >>                         >>>    the intended recipient you are
>>> >>                         notified that disclosing, copying,
>>> >>                         >>> distributing or taking any action in
>>> >>                         reliance on the contents of
>>> >>                         >>>    this information is strictly
>>> >>                         prohibited. Although Flipkart has
>>> >>                         >>> taken reasonable precautions to ensure no
>>> >>                         viruses are present in
>>> >>                         >>>    this email, the company cannot accept
>>> >>                         responsibility for any loss
>>> >>                         >>>    or damage arising from the use of this
>>> >>                         email or attachments
>>> >>                         >>
>>> >>
>>> >>
>>> >
>>>
>>>
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------
>
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you have received this email in error please notify the system manager.
> This message contains confidential information and is intended only for the
> individual named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately by e-mail if you have received this e-mail by mistake and
> delete this e-mail from your system. If you are not the intended recipient
> you are notified that disclosing, copying, distributing or taking any
> action in reliance on the contents of this information is strictly
> prohibited. Although Flipkart has taken reasonable precautions to ensure no
> viruses are present in this email, the company cannot accept responsibility
> for any loss or damage arising from the use of this email or attachments
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by moon soo Lee <mo...@apache.org>.

Hi,

I'm not sure what could be wrong.
can you see any existing notebook?

Best,
moon

On Mon, Aug 31, 2015 at 8:48 PM Piyush Mukati (Data Platform) <
piyush.mukati@flipkart.com> wrote:

> Hi,
> we have passed the InterpreterContext to  completion() , it is working
> good on my local dev setup.
> but after
> mvn  clean package  -P build-distr  -Pspark-1.4 -Dhadoop.version=2.6.0
> -Phadoop-2.6 -Pyarn
> I copied zeppelin-0.6.0-incubating-SNAPSHOT.tar.gz to some other machine,
> while running from there it always shows disconnected and no notebook are
> shown, even i am not able to create any notebook as well.
>
>  [image: Screenshot 2015-09-01 09.14.54.png]
> i am not seeing anything in logs. can anyone please suggest me how can i
> further debug into it.
> thanks.
>
> On Wed, Aug 26, 2015 at 8:27 PM, moon soo Lee <mo...@apache.org> wrote:
>
>> Hi Pranav,
>>
>> Thanks for sharing the plan.
>> I think passing InterpreterContext to completion()  make sense.
>> Although it changes interpreter api, changing now looks better than later.
>>
>> Thanks.
>> moon
>>
>> On Tue, Aug 25, 2015 at 11:22 PM Pranav Kumar Agarwal <pr...@gmail.com>
>> wrote:
>>
>>> Hi Moon,
>>>
>>> > I think releasing SparkIMain and related objects
>>> By packaging I meant to ask what is the process to "release SparkIMain
>>> and related objects"? for Zeppelin's code uptake?
>>>
>>> I have one more question:
>>> Most the changes to allow SparkInterpreter support ParallelScheduler are
>>> implemented but I'm struggling with the completion feature. Since I have
>>> SparkIMain interpreter for each notebook, completion functionality is
>>> not working as expected cause the completion method doesn't have
>>> InterpreterContext. I need to be able to pull notebook specific
>>> SparkIMain interpreter to return correct completion results, and for
>>> that I need to know the notbook-id at the time of completion call.
>>>
>>> I'm planning to change the Interpreter.java abstract method completion
>>> to pass InterpreterContext along with buffer and cursor location. This
>>> will require refactoring all the Interpreter's. It's a change in the
>>> contract, so thought will run with you before embarking on it...
>>>
>>> Please let me know your thoughts.
>>>
>>> Regards,
>>> -Pranav.
>>>
>>> On 18/08/15 8:04 am, moon soo Lee wrote:
>>> > Could you explain little bit more about package changes you mean?
>>> >
>>> > Thanks,
>>> > moon
>>> >
>>> > On Mon, Aug 17, 2015 at 10:27 AM Pranav Agarwal <praagarw@gmail.com
>>> > <ma...@gmail.com>> wrote:
>>> >
>>> >     Any thoughts on how to package changes related to Spark?
>>> >
>>> >     On 17-Aug-2015 7:58 pm, "moon soo Lee" <moon@apache.org
>>> >     <ma...@apache.org>> wrote:
>>> >
>>> >         I think releasing SparkIMain and related objects after
>>> >         configurable inactivity would be good for now.
>>> >
>>> >         About scheduler, I can help implementing such scheduler.
>>> >
>>> >         Thanks,
>>> >         moon
>>> >
>>> >         On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal
>>> >         <praagarw@gmail.com <ma...@gmail.com>> wrote:
>>> >
>>> >             Hi Moon,
>>> >
>>> >             Yes, the notebookid comes from InterpreterContext. At the
>>> >             moment destroying SparkIMain on deletion of notebook is
>>> >             not handled. I think SparkIMain is a lightweight object,
>>> >             do you see a concern having these objects in a map? One
>>> >             possible option could be to destroy notebook related
>>> >             objects when the inactivity on a notebook is greater than
>>> >             say 8 hours.
>>> >
>>> >
>>> >>             >> 4. Build a queue inside interpreter to allow only one
>>> >>             paragraph execution
>>> >>             >> at a time per notebook.
>>> >>
>>> >>             One downside of this approach is, GUI will display
>>> >>             RUNNING instead of PENDING for jobs inside of queue in
>>> >>             interpreter.
>>> >             Yes that's an good point. Having a scheduler at Zeppelin
>>> >             server to build a scheduler that is parallel across
>>> >             notebook's and FIFO across paragraph's will be nice. Is
>>> >             there any plan for having such a scheduler?
>>> >
>>> >             Regards,
>>> >             -Pranav.
>>> >
>>> >
>>> >             On 17/08/15 5:38 am, moon soo Lee wrote:
>>> >>             Pranav, proposal looks awesome!
>>> >>
>>> >>             I have a question and feedback,
>>> >>
>>> >>             You said you tested 1,2 and 3. To create SparkIMain per
>>> >>             notebook, you need information of notebook id. Did you
>>> >>             get it from InterpreterContext?
>>> >>             Then how did you handle destroying of SparkIMain (when
>>> >>             notebook is deleting)?
>>> >>             As far as i know, interpreter not able to get information
>>> >>             of notebook deletion.
>>> >>
>>> >>             >> 4. Build a queue inside interpreter to allow only one
>>> >>             paragraph execution
>>> >>             >> at a time per notebook.
>>> >>
>>> >>             One downside of this approach is, GUI will display
>>> >>             RUNNING instead of PENDING for jobs inside of queue in
>>> >>             interpreter.
>>> >>
>>> >>             Best,
>>> >>             moon
>>> >>
>>> >>             On Sun, Aug 16, 2015 at 12:55 AM IT CTO
>>> >>             <goi.cto@gmail.com <ma...@gmail.com>> wrote:
>>> >>
>>> >>                 +1 for "to re-factor the Zeppelin architecture so
>>> >>                 that it can handle multi-tenancy easily"
>>> >>
>>> >>                 On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan
>>> >>                 <doanduyhai@gmail.com <ma...@gmail.com>>
>>> >>                 wrote:
>>> >>
>>> >>                     Agree with Joel, we may think to re-factor the
>>> >>                     Zeppelin architecture so that it can handle
>>> >>                     multi-tenancy easily. The technical solution
>>> >>                     proposed by Pranav is great but it only applies
>>> >>                     to Spark. Right now, each interpreter has to
>>> >>                     manage multi-tenancy its own way. Ultimately
>>> >>                     Zeppelin can propose a multi-tenancy
>>> >>                     contract/info (like UserContext, similar to
>>> >>                     InterpreterContext) so that each interpreter can
>>> >>                     choose to use or not.
>>> >>
>>> >>
>>> >>                     On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano
>>> >>                     <djoelz@gmail.com <ma...@gmail.com>>
>>> wrote:
>>> >>
>>> >>                         I think while the idea of running multiple
>>> >>                         notes simultaneously is great. It is really
>>> >>                         dancing around the lack of true multi user
>>> >>                         support in Zeppelin. While the proposed
>>> >>                         solution would work if the applications
>>> >>                         resources are those of the whole cluster, if
>>> >>                         the app is limited (say they are 8 cores of
>>> >>                         16, with some distribution in memory) then
>>> >>                         potentially your note can hog all the
>>> >>                         resources and the scheduler will have to
>>> >>                         throttle all other executions leaving you
>>> >>                         exactly where you are now.
>>> >>                         While I think the solution is a good one,
>>> >>                         maybe this question makes us think in adding
>>> >>                         true multiuser support.
>>> >>                         Where we isolate resources (cluster and the
>>> >>                         notebooks themselves), have separate
>>> >>                         login/identity and (I don't know if it's
>>> >>                         possible) share the same context.
>>> >>
>>> >>                         Thanks,
>>> >>                         Joel
>>> >>
>>> >>                         > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal
>>> >>                         <mindprince@gmail.com
>>> >>                         <ma...@gmail.com>> wrote:
>>> >>                         >
>>> >>                         > If the problem is that multiple users have
>>> >>                         to wait for each other while
>>> >>                         > using Zeppelin, the solution already
>>> >>                         exists: they can create a new
>>> >>                         > interpreter by going to the interpreter
>>> >>                         page and attach it to their
>>> >>                         > notebook - then they don't have to wait for
>>> >>                         others to submit their job.
>>> >>                         >
>>> >>                         > But I agree, having paragraphs from one
>>> >>                         note wait for paragraphs from other
>>> >>                         > notes is a confusing default. We can get
>>> >>                         around that in two ways:
>>> >>                         >
>>> >>                         >   1. Create a new interpreter for each note
>>> >>                         and attach that interpreter to
>>> >>                         >   that note. This approach would require
>>> the least amount
>>> >>                         of code changes but
>>> >>                         >   is resource heavy and doesn't let you
>>> >>                         share Spark Context between different
>>> >>                         >   notes.
>>> >>                         >   2. If we want to share the Spark Context
>>> >>                         between different notes, we can
>>> >>                         >   submit jobs from different notes into
>>> >>                         different fairscheduler pools (
>>> >>                         >
>>> >>
>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>> ).
>>> >>                         >   This can be done by submitting jobs from
>>> >>                         different notes in different
>>> >>                         >   threads. This will make sure that jobs
>>> >>                         from one note are run sequentially
>>> >>                         >   but jobs from different notes will be
>>> >>                         able to run in parallel.
>>> >>                         >
>>> >>                         > Neither of these options require any change
>>> >>                         in the Spark code.
>>> >>                         >
>>> >>                         > --
>>> >>                         > Thanks & Regards
>>> >>                         > Rohit Agarwal
>>> >>                         > https://www.linkedin.com/in/rohitagarwal003
>>> >>                         >
>>> >>                         > On Sat, Aug 15, 2015 at 12:01 PM, Pranav
>>> >>                         Kumar Agarwal <praagarw@gmail.com
>>> >>                         <ma...@gmail.com>>
>>>
>>> >>                         > wrote:
>>> >>                         >
>>> >>                         >> If someone can share about the idea of
>>> >>                         sharing single SparkContext through
>>> >>                         >>> multiple SparkILoop safely, it'll be
>>> >>                         really helpful.
>>> >>                         >> Here is a proposal:
>>> >>                         >> 1. In Spark code, change SparkIMain.scala
>>> >>                         to allow setting the virtual
>>> >>                         >> directory. While creating new instances of
>>> >>                         SparkIMain per notebook from
>>> >>                         >> zeppelin spark interpreter set all the
>>> >>                         instances of SparkIMain to the same
>>> >>                         >> virtual directory.
>>> >>                         >> 2. Start HTTP server on that virtual
>>> >>                         directory and set this HTTP server in
>>> >>                         >> Spark Context using classserverUri method
>>> >>                         >> 3. Scala generated code has a notion of
>>> >>                         packages. The default package name
>>> >>                         >> is "line$<linenumber>". Package name can
>>> >>                         be controlled using System
>>> >>                         >> Property scala.repl.name.line. Setting
>>> >>                         this property to "notebook id"
>>> >>                         >> ensures that code generated by individual
>>> >>                         instances of SparkIMain is
>>> >>                         >> isolated from other instances of SparkIMain
>>> >>                         >> 4. Build a queue inside interpreter to
>>> >>                         allow only one paragraph execution
>>> >>                         >> at a time per notebook.
>>> >>                         >>
>>> >>                         >> I have tested 1, 2, and 3 and it seems to
>>> >>                         provide isolation across
>>> >>                         >> classnames. I'll work towards submitting a
>>> >>                         formal patch soon - Is there any
>>> >>                         >> Jira already for the same that I can
>>> >>                         uptake? Also I need to understand:
>>> >>                         >> 1. How does Zeppelin uptake Spark fixes?
>>> >>                         OR do I need to first work
>>> >>                         >> towards getting Spark changes merged in
>>> >>                         Apache Spark github?
>>> >>                         >>
>>> >>                         >> Any suggestions on comments on the
>>> >>                         proposal are highly welcome.
>>> >>                         >>
>>> >>                         >> Regards,
>>> >>                         >> -Pranav.
>>> >>                         >>
>>> >>                         >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>> >>                         >>>
>>> >>                         >>> Hi piyush,
>>> >>                         >>>
>>> >>                         >>> Separate instance of SparkILoop
>>> >>                         SparkIMain for each notebook while
>>> >>                         >>> sharing the SparkContext sounds great.
>>> >>                         >>>
>>> >>                         >>> Actually, i tried to do it, found problem
>>> >>                         that multiple SparkILoop could
>>> >>                         >>> generates the same class name, and spark
>>> >>                         executor confuses classname since
>>> >>                         >>> they're reading classes from single
>>> >>                         SparkContext.
>>> >>                         >>>
>>> >>                         >>> If someone can share about the idea of
>>> >>                         sharing single SparkContext
>>> >>                         >>> through multiple SparkILoop safely, it'll
>>> >>                         be really helpful.
>>> >>                         >>>
>>> >>                         >>> Thanks,
>>> >>                         >>> moon
>>> >>                         >>>
>>> >>                         >>>
>>> >>                         >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush
>>> >>                         Mukati (Data Platform) <
>>> >>                         >>> piyush.mukati@flipkart.com
>>> >>                         <ma...@flipkart.com>
>>> >>                         <mailto:piyush.mukati@flipkart.com
>>>
>>> >>                         <ma...@flipkart.com>>> wrote:
>>> >>                         >>>
>>> >>                         >>>    Hi Moon,
>>> >>                         >>>    Any suggestion on it, have to wait lot
>>> >>                         when multiple people  working
>>> >>                         >>> with spark.
>>> >>                         >>>    Can we create separate instance of
>>> >>                          SparkILoop SparkIMain and
>>> >>                         >>> printstrems  for each notebook while
>>> >>                         sharing theSparkContext
>>> >>                         >>> ZeppelinContext  SQLContext and
>>> >>                         DependencyResolver and then use parallel
>>> >>                         >>> scheduler ?
>>> >>                         >>> thanks
>>> >>                         >>>
>>> >>                         >>> -piyush
>>> >>                         >>>
>>> >>                         >>>    Hi Moon,
>>> >>                         >>>
>>> >>                         >>>    How about tracking dedicated
>>> >>                         SparkContext for a notebook in Spark's
>>> >>                         >>> remote interpreter - this will allow
>>> >>                         multiple users to run their spark
>>> >>                         >>> paragraphs in parallel. Also, within a
>>> >>                         notebook only one paragraph is
>>> >>                         >>> executed at a time.
>>> >>                         >>>
>>> >>                         >>> Regards,
>>> >>                         >>> -Pranav.
>>> >>                         >>>
>>> >>                         >>>
>>> >>                         >>>> On 15/07/15 7:15 pm, moon soo Lee wrote:
>>> >>                         >>>> Hi,
>>> >>                         >>>>
>>> >>                         >>>> Thanks for asking question.
>>> >>                         >>>>
>>> >>                         >>>> The reason is simply because of it is
>>> >>                         running code statements. The
>>> >>                         >>>> statements can have order and
>>> >>                         dependency. Imagine i have two
>>> >>                         >>> paragraphs
>>> >>                         >>>>
>>> >>                         >>>> %spark
>>> >>                         >>>> val a = 1
>>> >>                         >>>>
>>> >>                         >>>> %spark
>>> >>                         >>>> print(a)
>>> >>                         >>>>
>>> >>                         >>>> If they're not running one by one, that
>>> >>                         means they possibly runs in
>>> >>                         >>>> random order and the output will be
>>> >>                         always different. Either '1' or
>>> >>                         >>>> 'val a can not found'.
>>> >>                         >>>>
>>> >>                         >>>> This is the reason why. But if there are
>>> >>                         nice idea to handle this
>>> >>                         >>>> problem i agree using parallel scheduler
>>> >>                         would help a lot.
>>> >>                         >>>>
>>> >>                         >>>> Thanks,
>>> >>                         >>>> moon
>>> >>                         >>>> On 2015년 7월 14일 (화) at 오후 7:59
>>> >>                         linxi zeng
>>> >>                         >>>> <linxizeng0615@gmail.com
>>> >>                         <ma...@gmail.com>
>>> >>                         <mailto:linxizeng0615@gmail.com
>>> >>                         <ma...@gmail.com>>
>>> >>                         >>> <mailto:linxizeng0615@gmail.com
>>> >>                         <ma...@gmail.com>
>>> >>                         <mailto:linxizeng0615@gmail.com
>>> >>                         <ma...@gmail.com>>>>
>>> >>                         >>> wrote:
>>> >>                         >>>>
>>> >>                         >>>> any one who have the same question with
>>> >>                         me? or this is not a
>>> >>                         >>> question?
>>> >>                         >>>>
>>> >>                         >>>> 2015-07-14 11:47 GMT+08:00 linxi zeng
>>> >>                         <linxizeng0615@gmail.com
>>> >>                         <ma...@gmail.com>
>>> >>                         >>> <mailto:linxizeng0615@gmail.com
>>> >>                         <ma...@gmail.com>>
>>> >>                         >>>> <mailto:linxizeng0615@gmail.com
>>> >>                         <ma...@gmail.com> <mailto:
>>> >>                         >>> linxizeng0615@gmail.com
>>> >>                         <ma...@gmail.com>>>>:
>>> >>                         >>>>
>>> >>                         >>>>     hi, Moon:
>>> >>                         >>>>        I notice that the getScheduler
>>> >>                         function in the
>>> >>                         >>>> SparkInterpreter.java return a
>>> >>                         FIFOScheduler which makes the
>>> >>                         >>>>     spark interpreter run spark job one
>>> >>                         by one. It's not a good
>>> >>                         >>>>     experience when couple of users do
>>> >>                         some work on zeppelin at
>>> >>                         >>>>     the same time, because they have to
>>> >>                         wait for each other.
>>> >>                         >>>>     And at the same time,
>>> >>                         SparkSqlInterpreter can chose what
>>> >>                         >>>>     scheduler to use by
>>> >>                         "zeppelin.spark.concurrentSQL".
>>> >>                         >>>>     My question is, what kind of
>>> >>                         consideration do you based on
>>> >>                         >>> to
>>> >>                         >>>>     make such a decision?
>>> >>                         >>>
>>> >>                         >>>
>>> >>                         >>>
>>> >>                         >>>
>>> >>                         >>>
>>> >>
>>>  ------------------------------------------------------------------------------------------------------------------------------------------
>>> >>                         >>>
>>> >>                         >>>    This email and any files transmitted
>>> >>                         with it are confidential and
>>> >>                         >>> intended solely for the use of the
>>> >>                         individual or entity to whom
>>> >>                         >>>    they are addressed. If you have
>>> >>                         received this email in error
>>> >>                         >>> please notify the system manager. This
>>> >>                         message contains
>>> >>                         >>> confidential information and is intended
>>> >>                         only for the individual
>>> >>                         >>> named. If you are not the named addressee
>>> >>                         you should not
>>> >>                         >>> disseminate, distribute or copy this
>>> >>                         e-mail. Please notify the
>>> >>                         >>> sender immediately by e-mail if you have
>>> >>                         received this e-mail by
>>> >>                         >>> mistake and delete this e-mail from your
>>> >>                         system. If you are not
>>> >>                         >>>    the intended recipient you are
>>> >>                         notified that disclosing, copying,
>>> >>                         >>> distributing or taking any action in
>>> >>                         reliance on the contents of
>>> >>                         >>>    this information is strictly
>>> >>                         prohibited. Although Flipkart has
>>> >>                         >>> taken reasonable precautions to ensure no
>>> >>                         viruses are present in
>>> >>                         >>>    this email, the company cannot accept
>>> >>                         responsibility for any loss
>>> >>                         >>>    or damage arising from the use of this
>>> >>                         email or attachments
>>> >>                         >>
>>> >>
>>> >>
>>> >
>>>
>>>
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------
>
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you have received this email in error please notify the system manager.
> This message contains confidential information and is intended only for the
> individual named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately by e-mail if you have received this e-mail by mistake and
> delete this e-mail from your system. If you are not the intended recipient
> you are notified that disclosing, copying, distributing or taking any
> action in reliance on the contents of this information is strictly
> prohibited. Although Flipkart has taken reasonable precautions to ensure no
> viruses are present in this email, the company cannot accept responsibility
> for any loss or damage arising from the use of this email or attachments
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by "Piyush Mukati (Data Platform)" <pi...@flipkart.com>.

Hi,
we have passed the InterpreterContext to  completion() , it is working good
on my local dev setup.
but after
mvn  clean package  -P build-distr  -Pspark-1.4 -Dhadoop.version=2.6.0
-Phadoop-2.6 -Pyarn
I copied zeppelin-0.6.0-incubating-SNAPSHOT.tar.gz to some other machine,
while running from there it always shows disconnected and no notebook are
shown, even i am not able to create any notebook as well.


i am not seeing anything in logs. can anyone please suggest me how can i
further debug into it.
thanks.

On Wed, Aug 26, 2015 at 8:27 PM, moon soo Lee <mo...@apache.org> wrote:

> Hi Pranav,
>
> Thanks for sharing the plan.
> I think passing InterpreterContext to completion()  make sense.
> Although it changes interpreter api, changing now looks better than later.
>
> Thanks.
> moon
>
> On Tue, Aug 25, 2015 at 11:22 PM Pranav Kumar Agarwal <pr...@gmail.com>
> wrote:
>
>> Hi Moon,
>>
>> > I think releasing SparkIMain and related objects
>> By packaging I meant to ask what is the process to "release SparkIMain
>> and related objects"? for Zeppelin's code uptake?
>>
>> I have one more question:
>> Most the changes to allow SparkInterpreter support ParallelScheduler are
>> implemented but I'm struggling with the completion feature. Since I have
>> SparkIMain interpreter for each notebook, completion functionality is
>> not working as expected cause the completion method doesn't have
>> InterpreterContext. I need to be able to pull notebook specific
>> SparkIMain interpreter to return correct completion results, and for
>> that I need to know the notbook-id at the time of completion call.
>>
>> I'm planning to change the Interpreter.java abstract method completion
>> to pass InterpreterContext along with buffer and cursor location. This
>> will require refactoring all the Interpreter's. It's a change in the
>> contract, so thought will run with you before embarking on it...
>>
>> Please let me know your thoughts.
>>
>> Regards,
>> -Pranav.
>>
>> On 18/08/15 8:04 am, moon soo Lee wrote:
>> > Could you explain little bit more about package changes you mean?
>> >
>> > Thanks,
>> > moon
>> >
>> > On Mon, Aug 17, 2015 at 10:27 AM Pranav Agarwal <praagarw@gmail.com
>> > <ma...@gmail.com>> wrote:
>> >
>> >     Any thoughts on how to package changes related to Spark?
>> >
>> >     On 17-Aug-2015 7:58 pm, "moon soo Lee" <moon@apache.org
>> >     <ma...@apache.org>> wrote:
>> >
>> >         I think releasing SparkIMain and related objects after
>> >         configurable inactivity would be good for now.
>> >
>> >         About scheduler, I can help implementing such scheduler.
>> >
>> >         Thanks,
>> >         moon
>> >
>> >         On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal
>> >         <praagarw@gmail.com <ma...@gmail.com>> wrote:
>> >
>> >             Hi Moon,
>> >
>> >             Yes, the notebookid comes from InterpreterContext. At the
>> >             moment destroying SparkIMain on deletion of notebook is
>> >             not handled. I think SparkIMain is a lightweight object,
>> >             do you see a concern having these objects in a map? One
>> >             possible option could be to destroy notebook related
>> >             objects when the inactivity on a notebook is greater than
>> >             say 8 hours.
>> >
>> >
>> >>             >> 4. Build a queue inside interpreter to allow only one
>> >>             paragraph execution
>> >>             >> at a time per notebook.
>> >>
>> >>             One downside of this approach is, GUI will display
>> >>             RUNNING instead of PENDING for jobs inside of queue in
>> >>             interpreter.
>> >             Yes that's an good point. Having a scheduler at Zeppelin
>> >             server to build a scheduler that is parallel across
>> >             notebook's and FIFO across paragraph's will be nice. Is
>> >             there any plan for having such a scheduler?
>> >
>> >             Regards,
>> >             -Pranav.
>> >
>> >
>> >             On 17/08/15 5:38 am, moon soo Lee wrote:
>> >>             Pranav, proposal looks awesome!
>> >>
>> >>             I have a question and feedback,
>> >>
>> >>             You said you tested 1,2 and 3. To create SparkIMain per
>> >>             notebook, you need information of notebook id. Did you
>> >>             get it from InterpreterContext?
>> >>             Then how did you handle destroying of SparkIMain (when
>> >>             notebook is deleting)?
>> >>             As far as i know, interpreter not able to get information
>> >>             of notebook deletion.
>> >>
>> >>             >> 4. Build a queue inside interpreter to allow only one
>> >>             paragraph execution
>> >>             >> at a time per notebook.
>> >>
>> >>             One downside of this approach is, GUI will display
>> >>             RUNNING instead of PENDING for jobs inside of queue in
>> >>             interpreter.
>> >>
>> >>             Best,
>> >>             moon
>> >>
>> >>             On Sun, Aug 16, 2015 at 12:55 AM IT CTO
>> >>             <goi.cto@gmail.com <ma...@gmail.com>> wrote:
>> >>
>> >>                 +1 for "to re-factor the Zeppelin architecture so
>> >>                 that it can handle multi-tenancy easily"
>> >>
>> >>                 On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan
>> >>                 <doanduyhai@gmail.com <ma...@gmail.com>>
>> >>                 wrote:
>> >>
>> >>                     Agree with Joel, we may think to re-factor the
>> >>                     Zeppelin architecture so that it can handle
>> >>                     multi-tenancy easily. The technical solution
>> >>                     proposed by Pranav is great but it only applies
>> >>                     to Spark. Right now, each interpreter has to
>> >>                     manage multi-tenancy its own way. Ultimately
>> >>                     Zeppelin can propose a multi-tenancy
>> >>                     contract/info (like UserContext, similar to
>> >>                     InterpreterContext) so that each interpreter can
>> >>                     choose to use or not.
>> >>
>> >>
>> >>                     On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano
>> >>                     <djoelz@gmail.com <ma...@gmail.com>>
>> wrote:
>> >>
>> >>                         I think while the idea of running multiple
>> >>                         notes simultaneously is great. It is really
>> >>                         dancing around the lack of true multi user
>> >>                         support in Zeppelin. While the proposed
>> >>                         solution would work if the applications
>> >>                         resources are those of the whole cluster, if
>> >>                         the app is limited (say they are 8 cores of
>> >>                         16, with some distribution in memory) then
>> >>                         potentially your note can hog all the
>> >>                         resources and the scheduler will have to
>> >>                         throttle all other executions leaving you
>> >>                         exactly where you are now.
>> >>                         While I think the solution is a good one,
>> >>                         maybe this question makes us think in adding
>> >>                         true multiuser support.
>> >>                         Where we isolate resources (cluster and the
>> >>                         notebooks themselves), have separate
>> >>                         login/identity and (I don't know if it's
>> >>                         possible) share the same context.
>> >>
>> >>                         Thanks,
>> >>                         Joel
>> >>
>> >>                         > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal
>> >>                         <mindprince@gmail.com
>> >>                         <ma...@gmail.com>> wrote:
>> >>                         >
>> >>                         > If the problem is that multiple users have
>> >>                         to wait for each other while
>> >>                         > using Zeppelin, the solution already
>> >>                         exists: they can create a new
>> >>                         > interpreter by going to the interpreter
>> >>                         page and attach it to their
>> >>                         > notebook - then they don't have to wait for
>> >>                         others to submit their job.
>> >>                         >
>> >>                         > But I agree, having paragraphs from one
>> >>                         note wait for paragraphs from other
>> >>                         > notes is a confusing default. We can get
>> >>                         around that in two ways:
>> >>                         >
>> >>                         >   1. Create a new interpreter for each note
>> >>                         and attach that interpreter to
>> >>                         >   that note. This approach would require the
>> least amount
>> >>                         of code changes but
>> >>                         >   is resource heavy and doesn't let you
>> >>                         share Spark Context between different
>> >>                         >   notes.
>> >>                         >   2. If we want to share the Spark Context
>> >>                         between different notes, we can
>> >>                         >   submit jobs from different notes into
>> >>                         different fairscheduler pools (
>> >>                         >
>> >>
>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>> ).
>> >>                         >   This can be done by submitting jobs from
>> >>                         different notes in different
>> >>                         >   threads. This will make sure that jobs
>> >>                         from one note are run sequentially
>> >>                         >   but jobs from different notes will be
>> >>                         able to run in parallel.
>> >>                         >
>> >>                         > Neither of these options require any change
>> >>                         in the Spark code.
>> >>                         >
>> >>                         > --
>> >>                         > Thanks & Regards
>> >>                         > Rohit Agarwal
>> >>                         > https://www.linkedin.com/in/rohitagarwal003
>> >>                         >
>> >>                         > On Sat, Aug 15, 2015 at 12:01 PM, Pranav
>> >>                         Kumar Agarwal <praagarw@gmail.com
>> >>                         <ma...@gmail.com>>
>>
>> >>                         > wrote:
>> >>                         >
>> >>                         >> If someone can share about the idea of
>> >>                         sharing single SparkContext through
>> >>                         >>> multiple SparkILoop safely, it'll be
>> >>                         really helpful.
>> >>                         >> Here is a proposal:
>> >>                         >> 1. In Spark code, change SparkIMain.scala
>> >>                         to allow setting the virtual
>> >>                         >> directory. While creating new instances of
>> >>                         SparkIMain per notebook from
>> >>                         >> zeppelin spark interpreter set all the
>> >>                         instances of SparkIMain to the same
>> >>                         >> virtual directory.
>> >>                         >> 2. Start HTTP server on that virtual
>> >>                         directory and set this HTTP server in
>> >>                         >> Spark Context using classserverUri method
>> >>                         >> 3. Scala generated code has a notion of
>> >>                         packages. The default package name
>> >>                         >> is "line$<linenumber>". Package name can
>> >>                         be controlled using System
>> >>                         >> Property scala.repl.name.line. Setting
>> >>                         this property to "notebook id"
>> >>                         >> ensures that code generated by individual
>> >>                         instances of SparkIMain is
>> >>                         >> isolated from other instances of SparkIMain
>> >>                         >> 4. Build a queue inside interpreter to
>> >>                         allow only one paragraph execution
>> >>                         >> at a time per notebook.
>> >>                         >>
>> >>                         >> I have tested 1, 2, and 3 and it seems to
>> >>                         provide isolation across
>> >>                         >> classnames. I'll work towards submitting a
>> >>                         formal patch soon - Is there any
>> >>                         >> Jira already for the same that I can
>> >>                         uptake? Also I need to understand:
>> >>                         >> 1. How does Zeppelin uptake Spark fixes?
>> >>                         OR do I need to first work
>> >>                         >> towards getting Spark changes merged in
>> >>                         Apache Spark github?
>> >>                         >>
>> >>                         >> Any suggestions on comments on the
>> >>                         proposal are highly welcome.
>> >>                         >>
>> >>                         >> Regards,
>> >>                         >> -Pranav.
>> >>                         >>
>> >>                         >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>> >>                         >>>
>> >>                         >>> Hi piyush,
>> >>                         >>>
>> >>                         >>> Separate instance of SparkILoop
>> >>                         SparkIMain for each notebook while
>> >>                         >>> sharing the SparkContext sounds great.
>> >>                         >>>
>> >>                         >>> Actually, i tried to do it, found problem
>> >>                         that multiple SparkILoop could
>> >>                         >>> generates the same class name, and spark
>> >>                         executor confuses classname since
>> >>                         >>> they're reading classes from single
>> >>                         SparkContext.
>> >>                         >>>
>> >>                         >>> If someone can share about the idea of
>> >>                         sharing single SparkContext
>> >>                         >>> through multiple SparkILoop safely, it'll
>> >>                         be really helpful.
>> >>                         >>>
>> >>                         >>> Thanks,
>> >>                         >>> moon
>> >>                         >>>
>> >>                         >>>
>> >>                         >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush
>> >>                         Mukati (Data Platform) <
>> >>                         >>> piyush.mukati@flipkart.com
>> >>                         <ma...@flipkart.com>
>> >>                         <mailto:piyush.mukati@flipkart.com
>>
>> >>                         <ma...@flipkart.com>>> wrote:
>> >>                         >>>
>> >>                         >>>    Hi Moon,
>> >>                         >>>    Any suggestion on it, have to wait lot
>> >>                         when multiple people  working
>> >>                         >>> with spark.
>> >>                         >>>    Can we create separate instance of
>> >>                          SparkILoop SparkIMain and
>> >>                         >>> printstrems  for each notebook while
>> >>                         sharing theSparkContext
>> >>                         >>> ZeppelinContext  SQLContext and
>> >>                         DependencyResolver and then use parallel
>> >>                         >>> scheduler ?
>> >>                         >>> thanks
>> >>                         >>>
>> >>                         >>> -piyush
>> >>                         >>>
>> >>                         >>>    Hi Moon,
>> >>                         >>>
>> >>                         >>>    How about tracking dedicated
>> >>                         SparkContext for a notebook in Spark's
>> >>                         >>> remote interpreter - this will allow
>> >>                         multiple users to run their spark
>> >>                         >>> paragraphs in parallel. Also, within a
>> >>                         notebook only one paragraph is
>> >>                         >>> executed at a time.
>> >>                         >>>
>> >>                         >>> Regards,
>> >>                         >>> -Pranav.
>> >>                         >>>
>> >>                         >>>
>> >>                         >>>> On 15/07/15 7:15 pm, moon soo Lee wrote:
>> >>                         >>>> Hi,
>> >>                         >>>>
>> >>                         >>>> Thanks for asking question.
>> >>                         >>>>
>> >>                         >>>> The reason is simply because of it is
>> >>                         running code statements. The
>> >>                         >>>> statements can have order and
>> >>                         dependency. Imagine i have two
>> >>                         >>> paragraphs
>> >>                         >>>>
>> >>                         >>>> %spark
>> >>                         >>>> val a = 1
>> >>                         >>>>
>> >>                         >>>> %spark
>> >>                         >>>> print(a)
>> >>                         >>>>
>> >>                         >>>> If they're not running one by one, that
>> >>                         means they possibly runs in
>> >>                         >>>> random order and the output will be
>> >>                         always different. Either '1' or
>> >>                         >>>> 'val a can not found'.
>> >>                         >>>>
>> >>                         >>>> This is the reason why. But if there are
>> >>                         nice idea to handle this
>> >>                         >>>> problem i agree using parallel scheduler
>> >>                         would help a lot.
>> >>                         >>>>
>> >>                         >>>> Thanks,
>> >>                         >>>> moon
>> >>                         >>>> On 2015년 7월 14일 (화) at 오후 7:59
>> >>                         linxi zeng
>> >>                         >>>> <linxizeng0615@gmail.com
>> >>                         <ma...@gmail.com>
>> >>                         <mailto:linxizeng0615@gmail.com
>> >>                         <ma...@gmail.com>>
>> >>                         >>> <mailto:linxizeng0615@gmail.com
>> >>                         <ma...@gmail.com>
>> >>                         <mailto:linxizeng0615@gmail.com
>> >>                         <ma...@gmail.com>>>>
>> >>                         >>> wrote:
>> >>                         >>>>
>> >>                         >>>> any one who have the same question with
>> >>                         me? or this is not a
>> >>                         >>> question?
>> >>                         >>>>
>> >>                         >>>> 2015-07-14 11:47 GMT+08:00 linxi zeng
>> >>                         <linxizeng0615@gmail.com
>> >>                         <ma...@gmail.com>
>> >>                         >>> <mailto:linxizeng0615@gmail.com
>> >>                         <ma...@gmail.com>>
>> >>                         >>>> <mailto:linxizeng0615@gmail.com
>> >>                         <ma...@gmail.com> <mailto:
>> >>                         >>> linxizeng0615@gmail.com
>> >>                         <ma...@gmail.com>>>>:
>> >>                         >>>>
>> >>                         >>>>     hi, Moon:
>> >>                         >>>>        I notice that the getScheduler
>> >>                         function in the
>> >>                         >>>> SparkInterpreter.java return a
>> >>                         FIFOScheduler which makes the
>> >>                         >>>>     spark interpreter run spark job one
>> >>                         by one. It's not a good
>> >>                         >>>>     experience when couple of users do
>> >>                         some work on zeppelin at
>> >>                         >>>>     the same time, because they have to
>> >>                         wait for each other.
>> >>                         >>>>     And at the same time,
>> >>                         SparkSqlInterpreter can chose what
>> >>                         >>>>     scheduler to use by
>> >>                         "zeppelin.spark.concurrentSQL".
>> >>                         >>>>     My question is, what kind of
>> >>                         consideration do you based on
>> >>                         >>> to
>> >>                         >>>>     make such a decision?
>> >>                         >>>
>> >>                         >>>
>> >>                         >>>
>> >>                         >>>
>> >>                         >>>
>> >>
>>  ------------------------------------------------------------------------------------------------------------------------------------------
>> >>                         >>>
>> >>                         >>>    This email and any files transmitted
>> >>                         with it are confidential and
>> >>                         >>> intended solely for the use of the
>> >>                         individual or entity to whom
>> >>                         >>>    they are addressed. If you have
>> >>                         received this email in error
>> >>                         >>> please notify the system manager. This
>> >>                         message contains
>> >>                         >>> confidential information and is intended
>> >>                         only for the individual
>> >>                         >>> named. If you are not the named addressee
>> >>                         you should not
>> >>                         >>> disseminate, distribute or copy this
>> >>                         e-mail. Please notify the
>> >>                         >>> sender immediately by e-mail if you have
>> >>                         received this e-mail by
>> >>                         >>> mistake and delete this e-mail from your
>> >>                         system. If you are not
>> >>                         >>>    the intended recipient you are
>> >>                         notified that disclosing, copying,
>> >>                         >>> distributing or taking any action in
>> >>                         reliance on the contents of
>> >>                         >>>    this information is strictly
>> >>                         prohibited. Although Flipkart has
>> >>                         >>> taken reasonable precautions to ensure no
>> >>                         viruses are present in
>> >>                         >>>    this email, the company cannot accept
>> >>                         responsibility for any loss
>> >>                         >>>    or damage arising from the use of this
>> >>                         email or attachments
>> >>                         >>
>> >>
>> >>
>> >
>>
>>

-- 


------------------------------------------------------------------------------------------------------------------------------------------

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited. Although Flipkart has taken reasonable precautions to ensure no 
viruses are present in this email, the company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachments

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by "Piyush Mukati (Data Platform)" <pi...@flipkart.com>.

Hi,
we have passed the InterpreterContext to  completion() , it is working good
on my local dev setup.
but after
mvn  clean package  -P build-distr  -Pspark-1.4 -Dhadoop.version=2.6.0
-Phadoop-2.6 -Pyarn
I copied zeppelin-0.6.0-incubating-SNAPSHOT.tar.gz to some other machine,
while running from there it always shows disconnected and no notebook are
shown, even i am not able to create any notebook as well.


i am not seeing anything in logs. can anyone please suggest me how can i
further debug into it.
thanks.

On Wed, Aug 26, 2015 at 8:27 PM, moon soo Lee <mo...@apache.org> wrote:

> Hi Pranav,
>
> Thanks for sharing the plan.
> I think passing InterpreterContext to completion()  make sense.
> Although it changes interpreter api, changing now looks better than later.
>
> Thanks.
> moon
>
> On Tue, Aug 25, 2015 at 11:22 PM Pranav Kumar Agarwal <pr...@gmail.com>
> wrote:
>
>> Hi Moon,
>>
>> > I think releasing SparkIMain and related objects
>> By packaging I meant to ask what is the process to "release SparkIMain
>> and related objects"? for Zeppelin's code uptake?
>>
>> I have one more question:
>> Most the changes to allow SparkInterpreter support ParallelScheduler are
>> implemented but I'm struggling with the completion feature. Since I have
>> SparkIMain interpreter for each notebook, completion functionality is
>> not working as expected cause the completion method doesn't have
>> InterpreterContext. I need to be able to pull notebook specific
>> SparkIMain interpreter to return correct completion results, and for
>> that I need to know the notbook-id at the time of completion call.
>>
>> I'm planning to change the Interpreter.java abstract method completion
>> to pass InterpreterContext along with buffer and cursor location. This
>> will require refactoring all the Interpreter's. It's a change in the
>> contract, so thought will run with you before embarking on it...
>>
>> Please let me know your thoughts.
>>
>> Regards,
>> -Pranav.
>>
>> On 18/08/15 8:04 am, moon soo Lee wrote:
>> > Could you explain little bit more about package changes you mean?
>> >
>> > Thanks,
>> > moon
>> >
>> > On Mon, Aug 17, 2015 at 10:27 AM Pranav Agarwal <praagarw@gmail.com
>> > <ma...@gmail.com>> wrote:
>> >
>> >     Any thoughts on how to package changes related to Spark?
>> >
>> >     On 17-Aug-2015 7:58 pm, "moon soo Lee" <moon@apache.org
>> >     <ma...@apache.org>> wrote:
>> >
>> >         I think releasing SparkIMain and related objects after
>> >         configurable inactivity would be good for now.
>> >
>> >         About scheduler, I can help implementing such scheduler.
>> >
>> >         Thanks,
>> >         moon
>> >
>> >         On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal
>> >         <praagarw@gmail.com <ma...@gmail.com>> wrote:
>> >
>> >             Hi Moon,
>> >
>> >             Yes, the notebookid comes from InterpreterContext. At the
>> >             moment destroying SparkIMain on deletion of notebook is
>> >             not handled. I think SparkIMain is a lightweight object,
>> >             do you see a concern having these objects in a map? One
>> >             possible option could be to destroy notebook related
>> >             objects when the inactivity on a notebook is greater than
>> >             say 8 hours.
>> >
>> >
>> >>             >> 4. Build a queue inside interpreter to allow only one
>> >>             paragraph execution
>> >>             >> at a time per notebook.
>> >>
>> >>             One downside of this approach is, GUI will display
>> >>             RUNNING instead of PENDING for jobs inside of queue in
>> >>             interpreter.
>> >             Yes that's an good point. Having a scheduler at Zeppelin
>> >             server to build a scheduler that is parallel across
>> >             notebook's and FIFO across paragraph's will be nice. Is
>> >             there any plan for having such a scheduler?
>> >
>> >             Regards,
>> >             -Pranav.
>> >
>> >
>> >             On 17/08/15 5:38 am, moon soo Lee wrote:
>> >>             Pranav, proposal looks awesome!
>> >>
>> >>             I have a question and feedback,
>> >>
>> >>             You said you tested 1,2 and 3. To create SparkIMain per
>> >>             notebook, you need information of notebook id. Did you
>> >>             get it from InterpreterContext?
>> >>             Then how did you handle destroying of SparkIMain (when
>> >>             notebook is deleting)?
>> >>             As far as i know, interpreter not able to get information
>> >>             of notebook deletion.
>> >>
>> >>             >> 4. Build a queue inside interpreter to allow only one
>> >>             paragraph execution
>> >>             >> at a time per notebook.
>> >>
>> >>             One downside of this approach is, GUI will display
>> >>             RUNNING instead of PENDING for jobs inside of queue in
>> >>             interpreter.
>> >>
>> >>             Best,
>> >>             moon
>> >>
>> >>             On Sun, Aug 16, 2015 at 12:55 AM IT CTO
>> >>             <goi.cto@gmail.com <ma...@gmail.com>> wrote:
>> >>
>> >>                 +1 for "to re-factor the Zeppelin architecture so
>> >>                 that it can handle multi-tenancy easily"
>> >>
>> >>                 On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan
>> >>                 <doanduyhai@gmail.com <ma...@gmail.com>>
>> >>                 wrote:
>> >>
>> >>                     Agree with Joel, we may think to re-factor the
>> >>                     Zeppelin architecture so that it can handle
>> >>                     multi-tenancy easily. The technical solution
>> >>                     proposed by Pranav is great but it only applies
>> >>                     to Spark. Right now, each interpreter has to
>> >>                     manage multi-tenancy its own way. Ultimately
>> >>                     Zeppelin can propose a multi-tenancy
>> >>                     contract/info (like UserContext, similar to
>> >>                     InterpreterContext) so that each interpreter can
>> >>                     choose to use or not.
>> >>
>> >>
>> >>                     On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano
>> >>                     <djoelz@gmail.com <ma...@gmail.com>>
>> wrote:
>> >>
>> >>                         I think while the idea of running multiple
>> >>                         notes simultaneously is great. It is really
>> >>                         dancing around the lack of true multi user
>> >>                         support in Zeppelin. While the proposed
>> >>                         solution would work if the applications
>> >>                         resources are those of the whole cluster, if
>> >>                         the app is limited (say they are 8 cores of
>> >>                         16, with some distribution in memory) then
>> >>                         potentially your note can hog all the
>> >>                         resources and the scheduler will have to
>> >>                         throttle all other executions leaving you
>> >>                         exactly where you are now.
>> >>                         While I think the solution is a good one,
>> >>                         maybe this question makes us think in adding
>> >>                         true multiuser support.
>> >>                         Where we isolate resources (cluster and the
>> >>                         notebooks themselves), have separate
>> >>                         login/identity and (I don't know if it's
>> >>                         possible) share the same context.
>> >>
>> >>                         Thanks,
>> >>                         Joel
>> >>
>> >>                         > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal
>> >>                         <mindprince@gmail.com
>> >>                         <ma...@gmail.com>> wrote:
>> >>                         >
>> >>                         > If the problem is that multiple users have
>> >>                         to wait for each other while
>> >>                         > using Zeppelin, the solution already
>> >>                         exists: they can create a new
>> >>                         > interpreter by going to the interpreter
>> >>                         page and attach it to their
>> >>                         > notebook - then they don't have to wait for
>> >>                         others to submit their job.
>> >>                         >
>> >>                         > But I agree, having paragraphs from one
>> >>                         note wait for paragraphs from other
>> >>                         > notes is a confusing default. We can get
>> >>                         around that in two ways:
>> >>                         >
>> >>                         >   1. Create a new interpreter for each note
>> >>                         and attach that interpreter to
>> >>                         >   that note. This approach would require the
>> least amount
>> >>                         of code changes but
>> >>                         >   is resource heavy and doesn't let you
>> >>                         share Spark Context between different
>> >>                         >   notes.
>> >>                         >   2. If we want to share the Spark Context
>> >>                         between different notes, we can
>> >>                         >   submit jobs from different notes into
>> >>                         different fairscheduler pools (
>> >>                         >
>> >>
>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>> ).
>> >>                         >   This can be done by submitting jobs from
>> >>                         different notes in different
>> >>                         >   threads. This will make sure that jobs
>> >>                         from one note are run sequentially
>> >>                         >   but jobs from different notes will be
>> >>                         able to run in parallel.
>> >>                         >
>> >>                         > Neither of these options require any change
>> >>                         in the Spark code.
>> >>                         >
>> >>                         > --
>> >>                         > Thanks & Regards
>> >>                         > Rohit Agarwal
>> >>                         > https://www.linkedin.com/in/rohitagarwal003
>> >>                         >
>> >>                         > On Sat, Aug 15, 2015 at 12:01 PM, Pranav
>> >>                         Kumar Agarwal <praagarw@gmail.com
>> >>                         <ma...@gmail.com>>
>>
>> >>                         > wrote:
>> >>                         >
>> >>                         >> If someone can share about the idea of
>> >>                         sharing single SparkContext through
>> >>                         >>> multiple SparkILoop safely, it'll be
>> >>                         really helpful.
>> >>                         >> Here is a proposal:
>> >>                         >> 1. In Spark code, change SparkIMain.scala
>> >>                         to allow setting the virtual
>> >>                         >> directory. While creating new instances of
>> >>                         SparkIMain per notebook from
>> >>                         >> zeppelin spark interpreter set all the
>> >>                         instances of SparkIMain to the same
>> >>                         >> virtual directory.
>> >>                         >> 2. Start HTTP server on that virtual
>> >>                         directory and set this HTTP server in
>> >>                         >> Spark Context using classserverUri method
>> >>                         >> 3. Scala generated code has a notion of
>> >>                         packages. The default package name
>> >>                         >> is "line$<linenumber>". Package name can
>> >>                         be controlled using System
>> >>                         >> Property scala.repl.name.line. Setting
>> >>                         this property to "notebook id"
>> >>                         >> ensures that code generated by individual
>> >>                         instances of SparkIMain is
>> >>                         >> isolated from other instances of SparkIMain
>> >>                         >> 4. Build a queue inside interpreter to
>> >>                         allow only one paragraph execution
>> >>                         >> at a time per notebook.
>> >>                         >>
>> >>                         >> I have tested 1, 2, and 3 and it seems to
>> >>                         provide isolation across
>> >>                         >> classnames. I'll work towards submitting a
>> >>                         formal patch soon - Is there any
>> >>                         >> Jira already for the same that I can
>> >>                         uptake? Also I need to understand:
>> >>                         >> 1. How does Zeppelin uptake Spark fixes?
>> >>                         OR do I need to first work
>> >>                         >> towards getting Spark changes merged in
>> >>                         Apache Spark github?
>> >>                         >>
>> >>                         >> Any suggestions on comments on the
>> >>                         proposal are highly welcome.
>> >>                         >>
>> >>                         >> Regards,
>> >>                         >> -Pranav.
>> >>                         >>
>> >>                         >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>> >>                         >>>
>> >>                         >>> Hi piyush,
>> >>                         >>>
>> >>                         >>> Separate instance of SparkILoop
>> >>                         SparkIMain for each notebook while
>> >>                         >>> sharing the SparkContext sounds great.
>> >>                         >>>
>> >>                         >>> Actually, i tried to do it, found problem
>> >>                         that multiple SparkILoop could
>> >>                         >>> generates the same class name, and spark
>> >>                         executor confuses classname since
>> >>                         >>> they're reading classes from single
>> >>                         SparkContext.
>> >>                         >>>
>> >>                         >>> If someone can share about the idea of
>> >>                         sharing single SparkContext
>> >>                         >>> through multiple SparkILoop safely, it'll
>> >>                         be really helpful.
>> >>                         >>>
>> >>                         >>> Thanks,
>> >>                         >>> moon
>> >>                         >>>
>> >>                         >>>
>> >>                         >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush
>> >>                         Mukati (Data Platform) <
>> >>                         >>> piyush.mukati@flipkart.com
>> >>                         <ma...@flipkart.com>
>> >>                         <mailto:piyush.mukati@flipkart.com
>>
>> >>                         <ma...@flipkart.com>>> wrote:
>> >>                         >>>
>> >>                         >>>    Hi Moon,
>> >>                         >>>    Any suggestion on it, have to wait lot
>> >>                         when multiple people  working
>> >>                         >>> with spark.
>> >>                         >>>    Can we create separate instance of
>> >>                          SparkILoop SparkIMain and
>> >>                         >>> printstrems  for each notebook while
>> >>                         sharing theSparkContext
>> >>                         >>> ZeppelinContext  SQLContext and
>> >>                         DependencyResolver and then use parallel
>> >>                         >>> scheduler ?
>> >>                         >>> thanks
>> >>                         >>>
>> >>                         >>> -piyush
>> >>                         >>>
>> >>                         >>>    Hi Moon,
>> >>                         >>>
>> >>                         >>>    How about tracking dedicated
>> >>                         SparkContext for a notebook in Spark's
>> >>                         >>> remote interpreter - this will allow
>> >>                         multiple users to run their spark
>> >>                         >>> paragraphs in parallel. Also, within a
>> >>                         notebook only one paragraph is
>> >>                         >>> executed at a time.
>> >>                         >>>
>> >>                         >>> Regards,
>> >>                         >>> -Pranav.
>> >>                         >>>
>> >>                         >>>
>> >>                         >>>> On 15/07/15 7:15 pm, moon soo Lee wrote:
>> >>                         >>>> Hi,
>> >>                         >>>>
>> >>                         >>>> Thanks for asking question.
>> >>                         >>>>
>> >>                         >>>> The reason is simply because of it is
>> >>                         running code statements. The
>> >>                         >>>> statements can have order and
>> >>                         dependency. Imagine i have two
>> >>                         >>> paragraphs
>> >>                         >>>>
>> >>                         >>>> %spark
>> >>                         >>>> val a = 1
>> >>                         >>>>
>> >>                         >>>> %spark
>> >>                         >>>> print(a)
>> >>                         >>>>
>> >>                         >>>> If they're not running one by one, that
>> >>                         means they possibly runs in
>> >>                         >>>> random order and the output will be
>> >>                         always different. Either '1' or
>> >>                         >>>> 'val a can not found'.
>> >>                         >>>>
>> >>                         >>>> This is the reason why. But if there are
>> >>                         nice idea to handle this
>> >>                         >>>> problem i agree using parallel scheduler
>> >>                         would help a lot.
>> >>                         >>>>
>> >>                         >>>> Thanks,
>> >>                         >>>> moon
>> >>                         >>>> On 2015년 7월 14일 (화) at 오후 7:59
>> >>                         linxi zeng
>> >>                         >>>> <linxizeng0615@gmail.com
>> >>                         <ma...@gmail.com>
>> >>                         <mailto:linxizeng0615@gmail.com
>> >>                         <ma...@gmail.com>>
>> >>                         >>> <mailto:linxizeng0615@gmail.com
>> >>                         <ma...@gmail.com>
>> >>                         <mailto:linxizeng0615@gmail.com
>> >>                         <ma...@gmail.com>>>>
>> >>                         >>> wrote:
>> >>                         >>>>
>> >>                         >>>> any one who have the same question with
>> >>                         me? or this is not a
>> >>                         >>> question?
>> >>                         >>>>
>> >>                         >>>> 2015-07-14 11:47 GMT+08:00 linxi zeng
>> >>                         <linxizeng0615@gmail.com
>> >>                         <ma...@gmail.com>
>> >>                         >>> <mailto:linxizeng0615@gmail.com
>> >>                         <ma...@gmail.com>>
>> >>                         >>>> <mailto:linxizeng0615@gmail.com
>> >>                         <ma...@gmail.com> <mailto:
>> >>                         >>> linxizeng0615@gmail.com
>> >>                         <ma...@gmail.com>>>>:
>> >>                         >>>>
>> >>                         >>>>     hi, Moon:
>> >>                         >>>>        I notice that the getScheduler
>> >>                         function in the
>> >>                         >>>> SparkInterpreter.java return a
>> >>                         FIFOScheduler which makes the
>> >>                         >>>>     spark interpreter run spark job one
>> >>                         by one. It's not a good
>> >>                         >>>>     experience when couple of users do
>> >>                         some work on zeppelin at
>> >>                         >>>>     the same time, because they have to
>> >>                         wait for each other.
>> >>                         >>>>     And at the same time,
>> >>                         SparkSqlInterpreter can chose what
>> >>                         >>>>     scheduler to use by
>> >>                         "zeppelin.spark.concurrentSQL".
>> >>                         >>>>     My question is, what kind of
>> >>                         consideration do you based on
>> >>                         >>> to
>> >>                         >>>>     make such a decision?
>> >>                         >>>
>> >>                         >>>
>> >>                         >>>
>> >>                         >>>
>> >>                         >>>
>> >>
>>  ------------------------------------------------------------------------------------------------------------------------------------------
>> >>                         >>>
>> >>                         >>>    This email and any files transmitted
>> >>                         with it are confidential and
>> >>                         >>> intended solely for the use of the
>> >>                         individual or entity to whom
>> >>                         >>>    they are addressed. If you have
>> >>                         received this email in error
>> >>                         >>> please notify the system manager. This
>> >>                         message contains
>> >>                         >>> confidential information and is intended
>> >>                         only for the individual
>> >>                         >>> named. If you are not the named addressee
>> >>                         you should not
>> >>                         >>> disseminate, distribute or copy this
>> >>                         e-mail. Please notify the
>> >>                         >>> sender immediately by e-mail if you have
>> >>                         received this e-mail by
>> >>                         >>> mistake and delete this e-mail from your
>> >>                         system. If you are not
>> >>                         >>>    the intended recipient you are
>> >>                         notified that disclosing, copying,
>> >>                         >>> distributing or taking any action in
>> >>                         reliance on the contents of
>> >>                         >>>    this information is strictly
>> >>                         prohibited. Although Flipkart has
>> >>                         >>> taken reasonable precautions to ensure no
>> >>                         viruses are present in
>> >>                         >>>    this email, the company cannot accept
>> >>                         responsibility for any loss
>> >>                         >>>    or damage arising from the use of this
>> >>                         email or attachments
>> >>                         >>
>> >>
>> >>
>> >
>>
>>

-- 


------------------------------------------------------------------------------------------------------------------------------------------

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited. Although Flipkart has taken reasonable precautions to ensure no 
viruses are present in this email, the company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachments

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by moon soo Lee <mo...@apache.org>.

Hi Pranav,

Thanks for sharing the plan.
I think passing InterpreterContext to completion()  make sense.
Although it changes interpreter api, changing now looks better than later.

Thanks.
moon

On Tue, Aug 25, 2015 at 11:22 PM Pranav Kumar Agarwal <pr...@gmail.com>
wrote:

> Hi Moon,
>
> > I think releasing SparkIMain and related objects
> By packaging I meant to ask what is the process to "release SparkIMain
> and related objects"? for Zeppelin's code uptake?
>
> I have one more question:
> Most the changes to allow SparkInterpreter support ParallelScheduler are
> implemented but I'm struggling with the completion feature. Since I have
> SparkIMain interpreter for each notebook, completion functionality is
> not working as expected cause the completion method doesn't have
> InterpreterContext. I need to be able to pull notebook specific
> SparkIMain interpreter to return correct completion results, and for
> that I need to know the notbook-id at the time of completion call.
>
> I'm planning to change the Interpreter.java abstract method completion
> to pass InterpreterContext along with buffer and cursor location. This
> will require refactoring all the Interpreter's. It's a change in the
> contract, so thought will run with you before embarking on it...
>
> Please let me know your thoughts.
>
> Regards,
> -Pranav.
>
> On 18/08/15 8:04 am, moon soo Lee wrote:
> > Could you explain little bit more about package changes you mean?
> >
> > Thanks,
> > moon
> >
> > On Mon, Aug 17, 2015 at 10:27 AM Pranav Agarwal <praagarw@gmail.com
> > <ma...@gmail.com>> wrote:
> >
> >     Any thoughts on how to package changes related to Spark?
> >
> >     On 17-Aug-2015 7:58 pm, "moon soo Lee" <moon@apache.org
> >     <ma...@apache.org>> wrote:
> >
> >         I think releasing SparkIMain and related objects after
> >         configurable inactivity would be good for now.
> >
> >         About scheduler, I can help implementing such scheduler.
> >
> >         Thanks,
> >         moon
> >
> >         On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal
> >         <praagarw@gmail.com <ma...@gmail.com>> wrote:
> >
> >             Hi Moon,
> >
> >             Yes, the notebookid comes from InterpreterContext. At the
> >             moment destroying SparkIMain on deletion of notebook is
> >             not handled. I think SparkIMain is a lightweight object,
> >             do you see a concern having these objects in a map? One
> >             possible option could be to destroy notebook related
> >             objects when the inactivity on a notebook is greater than
> >             say 8 hours.
> >
> >
> >>             >> 4. Build a queue inside interpreter to allow only one
> >>             paragraph execution
> >>             >> at a time per notebook.
> >>
> >>             One downside of this approach is, GUI will display
> >>             RUNNING instead of PENDING for jobs inside of queue in
> >>             interpreter.
> >             Yes that's an good point. Having a scheduler at Zeppelin
> >             server to build a scheduler that is parallel across
> >             notebook's and FIFO across paragraph's will be nice. Is
> >             there any plan for having such a scheduler?
> >
> >             Regards,
> >             -Pranav.
> >
> >
> >             On 17/08/15 5:38 am, moon soo Lee wrote:
> >>             Pranav, proposal looks awesome!
> >>
> >>             I have a question and feedback,
> >>
> >>             You said you tested 1,2 and 3. To create SparkIMain per
> >>             notebook, you need information of notebook id. Did you
> >>             get it from InterpreterContext?
> >>             Then how did you handle destroying of SparkIMain (when
> >>             notebook is deleting)?
> >>             As far as i know, interpreter not able to get information
> >>             of notebook deletion.
> >>
> >>             >> 4. Build a queue inside interpreter to allow only one
> >>             paragraph execution
> >>             >> at a time per notebook.
> >>
> >>             One downside of this approach is, GUI will display
> >>             RUNNING instead of PENDING for jobs inside of queue in
> >>             interpreter.
> >>
> >>             Best,
> >>             moon
> >>
> >>             On Sun, Aug 16, 2015 at 12:55 AM IT CTO
> >>             <goi.cto@gmail.com <ma...@gmail.com>> wrote:
> >>
> >>                 +1 for "to re-factor the Zeppelin architecture so
> >>                 that it can handle multi-tenancy easily"
> >>
> >>                 On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan
> >>                 <doanduyhai@gmail.com <ma...@gmail.com>>
> >>                 wrote:
> >>
> >>                     Agree with Joel, we may think to re-factor the
> >>                     Zeppelin architecture so that it can handle
> >>                     multi-tenancy easily. The technical solution
> >>                     proposed by Pranav is great but it only applies
> >>                     to Spark. Right now, each interpreter has to
> >>                     manage multi-tenancy its own way. Ultimately
> >>                     Zeppelin can propose a multi-tenancy
> >>                     contract/info (like UserContext, similar to
> >>                     InterpreterContext) so that each interpreter can
> >>                     choose to use or not.
> >>
> >>
> >>                     On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano
> >>                     <djoelz@gmail.com <ma...@gmail.com>> wrote:
> >>
> >>                         I think while the idea of running multiple
> >>                         notes simultaneously is great. It is really
> >>                         dancing around the lack of true multi user
> >>                         support in Zeppelin. While the proposed
> >>                         solution would work if the applications
> >>                         resources are those of the whole cluster, if
> >>                         the app is limited (say they are 8 cores of
> >>                         16, with some distribution in memory) then
> >>                         potentially your note can hog all the
> >>                         resources and the scheduler will have to
> >>                         throttle all other executions leaving you
> >>                         exactly where you are now.
> >>                         While I think the solution is a good one,
> >>                         maybe this question makes us think in adding
> >>                         true multiuser support.
> >>                         Where we isolate resources (cluster and the
> >>                         notebooks themselves), have separate
> >>                         login/identity and (I don't know if it's
> >>                         possible) share the same context.
> >>
> >>                         Thanks,
> >>                         Joel
> >>
> >>                         > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal
> >>                         <mindprince@gmail.com
> >>                         <ma...@gmail.com>> wrote:
> >>                         >
> >>                         > If the problem is that multiple users have
> >>                         to wait for each other while
> >>                         > using Zeppelin, the solution already
> >>                         exists: they can create a new
> >>                         > interpreter by going to the interpreter
> >>                         page and attach it to their
> >>                         > notebook - then they don't have to wait for
> >>                         others to submit their job.
> >>                         >
> >>                         > But I agree, having paragraphs from one
> >>                         note wait for paragraphs from other
> >>                         > notes is a confusing default. We can get
> >>                         around that in two ways:
> >>                         >
> >>                         >   1. Create a new interpreter for each note
> >>                         and attach that interpreter to
> >>                         >   that note. This approach would require the
> least amount
> >>                         of code changes but
> >>                         >   is resource heavy and doesn't let you
> >>                         share Spark Context between different
> >>                         >   notes.
> >>                         >   2. If we want to share the Spark Context
> >>                         between different notes, we can
> >>                         >   submit jobs from different notes into
> >>                         different fairscheduler pools (
> >>                         >
> >>
> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
> ).
> >>                         >   This can be done by submitting jobs from
> >>                         different notes in different
> >>                         >   threads. This will make sure that jobs
> >>                         from one note are run sequentially
> >>                         >   but jobs from different notes will be
> >>                         able to run in parallel.
> >>                         >
> >>                         > Neither of these options require any change
> >>                         in the Spark code.
> >>                         >
> >>                         > --
> >>                         > Thanks & Regards
> >>                         > Rohit Agarwal
> >>                         > https://www.linkedin.com/in/rohitagarwal003
> >>                         >
> >>                         > On Sat, Aug 15, 2015 at 12:01 PM, Pranav
> >>                         Kumar Agarwal <praagarw@gmail.com
> >>                         <ma...@gmail.com>>
> >>                         > wrote:
> >>                         >
> >>                         >> If someone can share about the idea of
> >>                         sharing single SparkContext through
> >>                         >>> multiple SparkILoop safely, it'll be
> >>                         really helpful.
> >>                         >> Here is a proposal:
> >>                         >> 1. In Spark code, change SparkIMain.scala
> >>                         to allow setting the virtual
> >>                         >> directory. While creating new instances of
> >>                         SparkIMain per notebook from
> >>                         >> zeppelin spark interpreter set all the
> >>                         instances of SparkIMain to the same
> >>                         >> virtual directory.
> >>                         >> 2. Start HTTP server on that virtual
> >>                         directory and set this HTTP server in
> >>                         >> Spark Context using classserverUri method
> >>                         >> 3. Scala generated code has a notion of
> >>                         packages. The default package name
> >>                         >> is "line$<linenumber>". Package name can
> >>                         be controlled using System
> >>                         >> Property scala.repl.name.line. Setting
> >>                         this property to "notebook id"
> >>                         >> ensures that code generated by individual
> >>                         instances of SparkIMain is
> >>                         >> isolated from other instances of SparkIMain
> >>                         >> 4. Build a queue inside interpreter to
> >>                         allow only one paragraph execution
> >>                         >> at a time per notebook.
> >>                         >>
> >>                         >> I have tested 1, 2, and 3 and it seems to
> >>                         provide isolation across
> >>                         >> classnames. I'll work towards submitting a
> >>                         formal patch soon - Is there any
> >>                         >> Jira already for the same that I can
> >>                         uptake? Also I need to understand:
> >>                         >> 1. How does Zeppelin uptake Spark fixes?
> >>                         OR do I need to first work
> >>                         >> towards getting Spark changes merged in
> >>                         Apache Spark github?
> >>                         >>
> >>                         >> Any suggestions on comments on the
> >>                         proposal are highly welcome.
> >>                         >>
> >>                         >> Regards,
> >>                         >> -Pranav.
> >>                         >>
> >>                         >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
> >>                         >>>
> >>                         >>> Hi piyush,
> >>                         >>>
> >>                         >>> Separate instance of SparkILoop
> >>                         SparkIMain for each notebook while
> >>                         >>> sharing the SparkContext sounds great.
> >>                         >>>
> >>                         >>> Actually, i tried to do it, found problem
> >>                         that multiple SparkILoop could
> >>                         >>> generates the same class name, and spark
> >>                         executor confuses classname since
> >>                         >>> they're reading classes from single
> >>                         SparkContext.
> >>                         >>>
> >>                         >>> If someone can share about the idea of
> >>                         sharing single SparkContext
> >>                         >>> through multiple SparkILoop safely, it'll
> >>                         be really helpful.
> >>                         >>>
> >>                         >>> Thanks,
> >>                         >>> moon
> >>                         >>>
> >>                         >>>
> >>                         >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush
> >>                         Mukati (Data Platform) <
> >>                         >>> piyush.mukati@flipkart.com
> >>                         <ma...@flipkart.com>
> >>                         <mailto:piyush.mukati@flipkart.com
> >>                         <ma...@flipkart.com>>> wrote:
> >>                         >>>
> >>                         >>>    Hi Moon,
> >>                         >>>    Any suggestion on it, have to wait lot
> >>                         when multiple people  working
> >>                         >>> with spark.
> >>                         >>>    Can we create separate instance of
> >>                          SparkILoop SparkIMain and
> >>                         >>> printstrems  for each notebook while
> >>                         sharing theSparkContext
> >>                         >>> ZeppelinContext  SQLContext and
> >>                         DependencyResolver and then use parallel
> >>                         >>> scheduler ?
> >>                         >>> thanks
> >>                         >>>
> >>                         >>> -piyush
> >>                         >>>
> >>                         >>>    Hi Moon,
> >>                         >>>
> >>                         >>>    How about tracking dedicated
> >>                         SparkContext for a notebook in Spark's
> >>                         >>> remote interpreter - this will allow
> >>                         multiple users to run their spark
> >>                         >>> paragraphs in parallel. Also, within a
> >>                         notebook only one paragraph is
> >>                         >>> executed at a time.
> >>                         >>>
> >>                         >>> Regards,
> >>                         >>> -Pranav.
> >>                         >>>
> >>                         >>>
> >>                         >>>> On 15/07/15 7:15 pm, moon soo Lee wrote:
> >>                         >>>> Hi,
> >>                         >>>>
> >>                         >>>> Thanks for asking question.
> >>                         >>>>
> >>                         >>>> The reason is simply because of it is
> >>                         running code statements. The
> >>                         >>>> statements can have order and
> >>                         dependency. Imagine i have two
> >>                         >>> paragraphs
> >>                         >>>>
> >>                         >>>> %spark
> >>                         >>>> val a = 1
> >>                         >>>>
> >>                         >>>> %spark
> >>                         >>>> print(a)
> >>                         >>>>
> >>                         >>>> If they're not running one by one, that
> >>                         means they possibly runs in
> >>                         >>>> random order and the output will be
> >>                         always different. Either '1' or
> >>                         >>>> 'val a can not found'.
> >>                         >>>>
> >>                         >>>> This is the reason why. But if there are
> >>                         nice idea to handle this
> >>                         >>>> problem i agree using parallel scheduler
> >>                         would help a lot.
> >>                         >>>>
> >>                         >>>> Thanks,
> >>                         >>>> moon
> >>                         >>>> On 2015년 7월 14일 (화) at 오후 7:59
> >>                         linxi zeng
> >>                         >>>> <linxizeng0615@gmail.com
> >>                         <ma...@gmail.com>
> >>                         <mailto:linxizeng0615@gmail.com
> >>                         <ma...@gmail.com>>
> >>                         >>> <mailto:linxizeng0615@gmail.com
> >>                         <ma...@gmail.com>
> >>                         <mailto:linxizeng0615@gmail.com
> >>                         <ma...@gmail.com>>>>
> >>                         >>> wrote:
> >>                         >>>>
> >>                         >>>> any one who have the same question with
> >>                         me? or this is not a
> >>                         >>> question?
> >>                         >>>>
> >>                         >>>> 2015-07-14 11:47 GMT+08:00 linxi zeng
> >>                         <linxizeng0615@gmail.com
> >>                         <ma...@gmail.com>
> >>                         >>> <mailto:linxizeng0615@gmail.com
> >>                         <ma...@gmail.com>>
> >>                         >>>> <mailto:linxizeng0615@gmail.com
> >>                         <ma...@gmail.com> <mailto:
> >>                         >>> linxizeng0615@gmail.com
> >>                         <ma...@gmail.com>>>>:
> >>                         >>>>
> >>                         >>>>     hi, Moon:
> >>                         >>>>        I notice that the getScheduler
> >>                         function in the
> >>                         >>>> SparkInterpreter.java return a
> >>                         FIFOScheduler which makes the
> >>                         >>>>     spark interpreter run spark job one
> >>                         by one. It's not a good
> >>                         >>>>     experience when couple of users do
> >>                         some work on zeppelin at
> >>                         >>>>     the same time, because they have to
> >>                         wait for each other.
> >>                         >>>>     And at the same time,
> >>                         SparkSqlInterpreter can chose what
> >>                         >>>>     scheduler to use by
> >>                         "zeppelin.spark.concurrentSQL".
> >>                         >>>>     My question is, what kind of
> >>                         consideration do you based on
> >>                         >>> to
> >>                         >>>>     make such a decision?
> >>                         >>>
> >>                         >>>
> >>                         >>>
> >>                         >>>
> >>                         >>>
> >>
>  ------------------------------------------------------------------------------------------------------------------------------------------
> >>                         >>>
> >>                         >>>    This email and any files transmitted
> >>                         with it are confidential and
> >>                         >>> intended solely for the use of the
> >>                         individual or entity to whom
> >>                         >>>    they are addressed. If you have
> >>                         received this email in error
> >>                         >>> please notify the system manager. This
> >>                         message contains
> >>                         >>> confidential information and is intended
> >>                         only for the individual
> >>                         >>> named. If you are not the named addressee
> >>                         you should not
> >>                         >>> disseminate, distribute or copy this
> >>                         e-mail. Please notify the
> >>                         >>> sender immediately by e-mail if you have
> >>                         received this e-mail by
> >>                         >>> mistake and delete this e-mail from your
> >>                         system. If you are not
> >>                         >>>    the intended recipient you are
> >>                         notified that disclosing, copying,
> >>                         >>> distributing or taking any action in
> >>                         reliance on the contents of
> >>                         >>>    this information is strictly
> >>                         prohibited. Although Flipkart has
> >>                         >>> taken reasonable precautions to ensure no
> >>                         viruses are present in
> >>                         >>>    this email, the company cannot accept
> >>                         responsibility for any loss
> >>                         >>>    or damage arising from the use of this
> >>                         email or attachments
> >>                         >>
> >>
> >>
> >
>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by moon soo Lee <mo...@apache.org>.

Hi Pranav,

Thanks for sharing the plan.
I think passing InterpreterContext to completion()  make sense.
Although it changes interpreter api, changing now looks better than later.

Thanks.
moon

On Tue, Aug 25, 2015 at 11:22 PM Pranav Kumar Agarwal <pr...@gmail.com>
wrote:

> Hi Moon,
>
> > I think releasing SparkIMain and related objects
> By packaging I meant to ask what is the process to "release SparkIMain
> and related objects"? for Zeppelin's code uptake?
>
> I have one more question:
> Most the changes to allow SparkInterpreter support ParallelScheduler are
> implemented but I'm struggling with the completion feature. Since I have
> SparkIMain interpreter for each notebook, completion functionality is
> not working as expected cause the completion method doesn't have
> InterpreterContext. I need to be able to pull notebook specific
> SparkIMain interpreter to return correct completion results, and for
> that I need to know the notbook-id at the time of completion call.
>
> I'm planning to change the Interpreter.java abstract method completion
> to pass InterpreterContext along with buffer and cursor location. This
> will require refactoring all the Interpreter's. It's a change in the
> contract, so thought will run with you before embarking on it...
>
> Please let me know your thoughts.
>
> Regards,
> -Pranav.
>
> On 18/08/15 8:04 am, moon soo Lee wrote:
> > Could you explain little bit more about package changes you mean?
> >
> > Thanks,
> > moon
> >
> > On Mon, Aug 17, 2015 at 10:27 AM Pranav Agarwal <praagarw@gmail.com
> > <ma...@gmail.com>> wrote:
> >
> >     Any thoughts on how to package changes related to Spark?
> >
> >     On 17-Aug-2015 7:58 pm, "moon soo Lee" <moon@apache.org
> >     <ma...@apache.org>> wrote:
> >
> >         I think releasing SparkIMain and related objects after
> >         configurable inactivity would be good for now.
> >
> >         About scheduler, I can help implementing such scheduler.
> >
> >         Thanks,
> >         moon
> >
> >         On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal
> >         <praagarw@gmail.com <ma...@gmail.com>> wrote:
> >
> >             Hi Moon,
> >
> >             Yes, the notebookid comes from InterpreterContext. At the
> >             moment destroying SparkIMain on deletion of notebook is
> >             not handled. I think SparkIMain is a lightweight object,
> >             do you see a concern having these objects in a map? One
> >             possible option could be to destroy notebook related
> >             objects when the inactivity on a notebook is greater than
> >             say 8 hours.
> >
> >
> >>             >> 4. Build a queue inside interpreter to allow only one
> >>             paragraph execution
> >>             >> at a time per notebook.
> >>
> >>             One downside of this approach is, GUI will display
> >>             RUNNING instead of PENDING for jobs inside of queue in
> >>             interpreter.
> >             Yes that's an good point. Having a scheduler at Zeppelin
> >             server to build a scheduler that is parallel across
> >             notebook's and FIFO across paragraph's will be nice. Is
> >             there any plan for having such a scheduler?
> >
> >             Regards,
> >             -Pranav.
> >
> >
> >             On 17/08/15 5:38 am, moon soo Lee wrote:
> >>             Pranav, proposal looks awesome!
> >>
> >>             I have a question and feedback,
> >>
> >>             You said you tested 1,2 and 3. To create SparkIMain per
> >>             notebook, you need information of notebook id. Did you
> >>             get it from InterpreterContext?
> >>             Then how did you handle destroying of SparkIMain (when
> >>             notebook is deleting)?
> >>             As far as i know, interpreter not able to get information
> >>             of notebook deletion.
> >>
> >>             >> 4. Build a queue inside interpreter to allow only one
> >>             paragraph execution
> >>             >> at a time per notebook.
> >>
> >>             One downside of this approach is, GUI will display
> >>             RUNNING instead of PENDING for jobs inside of queue in
> >>             interpreter.
> >>
> >>             Best,
> >>             moon
> >>
> >>             On Sun, Aug 16, 2015 at 12:55 AM IT CTO
> >>             <goi.cto@gmail.com <ma...@gmail.com>> wrote:
> >>
> >>                 +1 for "to re-factor the Zeppelin architecture so
> >>                 that it can handle multi-tenancy easily"
> >>
> >>                 On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan
> >>                 <doanduyhai@gmail.com <ma...@gmail.com>>
> >>                 wrote:
> >>
> >>                     Agree with Joel, we may think to re-factor the
> >>                     Zeppelin architecture so that it can handle
> >>                     multi-tenancy easily. The technical solution
> >>                     proposed by Pranav is great but it only applies
> >>                     to Spark. Right now, each interpreter has to
> >>                     manage multi-tenancy its own way. Ultimately
> >>                     Zeppelin can propose a multi-tenancy
> >>                     contract/info (like UserContext, similar to
> >>                     InterpreterContext) so that each interpreter can
> >>                     choose to use or not.
> >>
> >>
> >>                     On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano
> >>                     <djoelz@gmail.com <ma...@gmail.com>> wrote:
> >>
> >>                         I think while the idea of running multiple
> >>                         notes simultaneously is great. It is really
> >>                         dancing around the lack of true multi user
> >>                         support in Zeppelin. While the proposed
> >>                         solution would work if the applications
> >>                         resources are those of the whole cluster, if
> >>                         the app is limited (say they are 8 cores of
> >>                         16, with some distribution in memory) then
> >>                         potentially your note can hog all the
> >>                         resources and the scheduler will have to
> >>                         throttle all other executions leaving you
> >>                         exactly where you are now.
> >>                         While I think the solution is a good one,
> >>                         maybe this question makes us think in adding
> >>                         true multiuser support.
> >>                         Where we isolate resources (cluster and the
> >>                         notebooks themselves), have separate
> >>                         login/identity and (I don't know if it's
> >>                         possible) share the same context.
> >>
> >>                         Thanks,
> >>                         Joel
> >>
> >>                         > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal
> >>                         <mindprince@gmail.com
> >>                         <ma...@gmail.com>> wrote:
> >>                         >
> >>                         > If the problem is that multiple users have
> >>                         to wait for each other while
> >>                         > using Zeppelin, the solution already
> >>                         exists: they can create a new
> >>                         > interpreter by going to the interpreter
> >>                         page and attach it to their
> >>                         > notebook - then they don't have to wait for
> >>                         others to submit their job.
> >>                         >
> >>                         > But I agree, having paragraphs from one
> >>                         note wait for paragraphs from other
> >>                         > notes is a confusing default. We can get
> >>                         around that in two ways:
> >>                         >
> >>                         >   1. Create a new interpreter for each note
> >>                         and attach that interpreter to
> >>                         >   that note. This approach would require the
> least amount
> >>                         of code changes but
> >>                         >   is resource heavy and doesn't let you
> >>                         share Spark Context between different
> >>                         >   notes.
> >>                         >   2. If we want to share the Spark Context
> >>                         between different notes, we can
> >>                         >   submit jobs from different notes into
> >>                         different fairscheduler pools (
> >>                         >
> >>
> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
> ).
> >>                         >   This can be done by submitting jobs from
> >>                         different notes in different
> >>                         >   threads. This will make sure that jobs
> >>                         from one note are run sequentially
> >>                         >   but jobs from different notes will be
> >>                         able to run in parallel.
> >>                         >
> >>                         > Neither of these options require any change
> >>                         in the Spark code.
> >>                         >
> >>                         > --
> >>                         > Thanks & Regards
> >>                         > Rohit Agarwal
> >>                         > https://www.linkedin.com/in/rohitagarwal003
> >>                         >
> >>                         > On Sat, Aug 15, 2015 at 12:01 PM, Pranav
> >>                         Kumar Agarwal <praagarw@gmail.com
> >>                         <ma...@gmail.com>>
> >>                         > wrote:
> >>                         >
> >>                         >> If someone can share about the idea of
> >>                         sharing single SparkContext through
> >>                         >>> multiple SparkILoop safely, it'll be
> >>                         really helpful.
> >>                         >> Here is a proposal:
> >>                         >> 1. In Spark code, change SparkIMain.scala
> >>                         to allow setting the virtual
> >>                         >> directory. While creating new instances of
> >>                         SparkIMain per notebook from
> >>                         >> zeppelin spark interpreter set all the
> >>                         instances of SparkIMain to the same
> >>                         >> virtual directory.
> >>                         >> 2. Start HTTP server on that virtual
> >>                         directory and set this HTTP server in
> >>                         >> Spark Context using classserverUri method
> >>                         >> 3. Scala generated code has a notion of
> >>                         packages. The default package name
> >>                         >> is "line$<linenumber>". Package name can
> >>                         be controlled using System
> >>                         >> Property scala.repl.name.line. Setting
> >>                         this property to "notebook id"
> >>                         >> ensures that code generated by individual
> >>                         instances of SparkIMain is
> >>                         >> isolated from other instances of SparkIMain
> >>                         >> 4. Build a queue inside interpreter to
> >>                         allow only one paragraph execution
> >>                         >> at a time per notebook.
> >>                         >>
> >>                         >> I have tested 1, 2, and 3 and it seems to
> >>                         provide isolation across
> >>                         >> classnames. I'll work towards submitting a
> >>                         formal patch soon - Is there any
> >>                         >> Jira already for the same that I can
> >>                         uptake? Also I need to understand:
> >>                         >> 1. How does Zeppelin uptake Spark fixes?
> >>                         OR do I need to first work
> >>                         >> towards getting Spark changes merged in
> >>                         Apache Spark github?
> >>                         >>
> >>                         >> Any suggestions on comments on the
> >>                         proposal are highly welcome.
> >>                         >>
> >>                         >> Regards,
> >>                         >> -Pranav.
> >>                         >>
> >>                         >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
> >>                         >>>
> >>                         >>> Hi piyush,
> >>                         >>>
> >>                         >>> Separate instance of SparkILoop
> >>                         SparkIMain for each notebook while
> >>                         >>> sharing the SparkContext sounds great.
> >>                         >>>
> >>                         >>> Actually, i tried to do it, found problem
> >>                         that multiple SparkILoop could
> >>                         >>> generates the same class name, and spark
> >>                         executor confuses classname since
> >>                         >>> they're reading classes from single
> >>                         SparkContext.
> >>                         >>>
> >>                         >>> If someone can share about the idea of
> >>                         sharing single SparkContext
> >>                         >>> through multiple SparkILoop safely, it'll
> >>                         be really helpful.
> >>                         >>>
> >>                         >>> Thanks,
> >>                         >>> moon
> >>                         >>>
> >>                         >>>
> >>                         >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush
> >>                         Mukati (Data Platform) <
> >>                         >>> piyush.mukati@flipkart.com
> >>                         <ma...@flipkart.com>
> >>                         <mailto:piyush.mukati@flipkart.com
> >>                         <ma...@flipkart.com>>> wrote:
> >>                         >>>
> >>                         >>>    Hi Moon,
> >>                         >>>    Any suggestion on it, have to wait lot
> >>                         when multiple people  working
> >>                         >>> with spark.
> >>                         >>>    Can we create separate instance of
> >>                          SparkILoop SparkIMain and
> >>                         >>> printstrems  for each notebook while
> >>                         sharing theSparkContext
> >>                         >>> ZeppelinContext  SQLContext and
> >>                         DependencyResolver and then use parallel
> >>                         >>> scheduler ?
> >>                         >>> thanks
> >>                         >>>
> >>                         >>> -piyush
> >>                         >>>
> >>                         >>>    Hi Moon,
> >>                         >>>
> >>                         >>>    How about tracking dedicated
> >>                         SparkContext for a notebook in Spark's
> >>                         >>> remote interpreter - this will allow
> >>                         multiple users to run their spark
> >>                         >>> paragraphs in parallel. Also, within a
> >>                         notebook only one paragraph is
> >>                         >>> executed at a time.
> >>                         >>>
> >>                         >>> Regards,
> >>                         >>> -Pranav.
> >>                         >>>
> >>                         >>>
> >>                         >>>> On 15/07/15 7:15 pm, moon soo Lee wrote:
> >>                         >>>> Hi,
> >>                         >>>>
> >>                         >>>> Thanks for asking question.
> >>                         >>>>
> >>                         >>>> The reason is simply because of it is
> >>                         running code statements. The
> >>                         >>>> statements can have order and
> >>                         dependency. Imagine i have two
> >>                         >>> paragraphs
> >>                         >>>>
> >>                         >>>> %spark
> >>                         >>>> val a = 1
> >>                         >>>>
> >>                         >>>> %spark
> >>                         >>>> print(a)
> >>                         >>>>
> >>                         >>>> If they're not running one by one, that
> >>                         means they possibly runs in
> >>                         >>>> random order and the output will be
> >>                         always different. Either '1' or
> >>                         >>>> 'val a can not found'.
> >>                         >>>>
> >>                         >>>> This is the reason why. But if there are
> >>                         nice idea to handle this
> >>                         >>>> problem i agree using parallel scheduler
> >>                         would help a lot.
> >>                         >>>>
> >>                         >>>> Thanks,
> >>                         >>>> moon
> >>                         >>>> On 2015년 7월 14일 (화) at 오후 7:59
> >>                         linxi zeng
> >>                         >>>> <linxizeng0615@gmail.com
> >>                         <ma...@gmail.com>
> >>                         <mailto:linxizeng0615@gmail.com
> >>                         <ma...@gmail.com>>
> >>                         >>> <mailto:linxizeng0615@gmail.com
> >>                         <ma...@gmail.com>
> >>                         <mailto:linxizeng0615@gmail.com
> >>                         <ma...@gmail.com>>>>
> >>                         >>> wrote:
> >>                         >>>>
> >>                         >>>> any one who have the same question with
> >>                         me? or this is not a
> >>                         >>> question?
> >>                         >>>>
> >>                         >>>> 2015-07-14 11:47 GMT+08:00 linxi zeng
> >>                         <linxizeng0615@gmail.com
> >>                         <ma...@gmail.com>
> >>                         >>> <mailto:linxizeng0615@gmail.com
> >>                         <ma...@gmail.com>>
> >>                         >>>> <mailto:linxizeng0615@gmail.com
> >>                         <ma...@gmail.com> <mailto:
> >>                         >>> linxizeng0615@gmail.com
> >>                         <ma...@gmail.com>>>>:
> >>                         >>>>
> >>                         >>>>     hi, Moon:
> >>                         >>>>        I notice that the getScheduler
> >>                         function in the
> >>                         >>>> SparkInterpreter.java return a
> >>                         FIFOScheduler which makes the
> >>                         >>>>     spark interpreter run spark job one
> >>                         by one. It's not a good
> >>                         >>>>     experience when couple of users do
> >>                         some work on zeppelin at
> >>                         >>>>     the same time, because they have to
> >>                         wait for each other.
> >>                         >>>>     And at the same time,
> >>                         SparkSqlInterpreter can chose what
> >>                         >>>>     scheduler to use by
> >>                         "zeppelin.spark.concurrentSQL".
> >>                         >>>>     My question is, what kind of
> >>                         consideration do you based on
> >>                         >>> to
> >>                         >>>>     make such a decision?
> >>                         >>>
> >>                         >>>
> >>                         >>>
> >>                         >>>
> >>                         >>>
> >>
>  ------------------------------------------------------------------------------------------------------------------------------------------
> >>                         >>>
> >>                         >>>    This email and any files transmitted
> >>                         with it are confidential and
> >>                         >>> intended solely for the use of the
> >>                         individual or entity to whom
> >>                         >>>    they are addressed. If you have
> >>                         received this email in error
> >>                         >>> please notify the system manager. This
> >>                         message contains
> >>                         >>> confidential information and is intended
> >>                         only for the individual
> >>                         >>> named. If you are not the named addressee
> >>                         you should not
> >>                         >>> disseminate, distribute or copy this
> >>                         e-mail. Please notify the
> >>                         >>> sender immediately by e-mail if you have
> >>                         received this e-mail by
> >>                         >>> mistake and delete this e-mail from your
> >>                         system. If you are not
> >>                         >>>    the intended recipient you are
> >>                         notified that disclosing, copying,
> >>                         >>> distributing or taking any action in
> >>                         reliance on the contents of
> >>                         >>>    this information is strictly
> >>                         prohibited. Although Flipkart has
> >>                         >>> taken reasonable precautions to ensure no
> >>                         viruses are present in
> >>                         >>>    this email, the company cannot accept
> >>                         responsibility for any loss
> >>                         >>>    or damage arising from the use of this
> >>                         email or attachments
> >>                         >>
> >>
> >>
> >
>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Pranav Kumar Agarwal <pr...@gmail.com>.

Hi Moon,

> I think releasing SparkIMain and related objects 
By packaging I meant to ask what is the process to "release SparkIMain 
and related objects"? for Zeppelin's code uptake?

I have one more question:
Most the changes to allow SparkInterpreter support ParallelScheduler are 
implemented but I'm struggling with the completion feature. Since I have 
SparkIMain interpreter for each notebook, completion functionality is 
not working as expected cause the completion method doesn't have 
InterpreterContext. I need to be able to pull notebook specific 
SparkIMain interpreter to return correct completion results, and for 
that I need to know the notbook-id at the time of completion call.

I'm planning to change the Interpreter.java abstract method completion 
to pass InterpreterContext along with buffer and cursor location. This 
will require refactoring all the Interpreter's. It's a change in the 
contract, so thought will run with you before embarking on it...

Please let me know your thoughts.

Regards,
-Pranav.

On 18/08/15 8:04 am, moon soo Lee wrote:
> Could you explain little bit more about package changes you mean?
>
> Thanks,
> moon
>
> On Mon, Aug 17, 2015 at 10:27 AM Pranav Agarwal <praagarw@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Any thoughts on how to package changes related to Spark?
>
>     On 17-Aug-2015 7:58 pm, "moon soo Lee" <moon@apache.org
>     <ma...@apache.org>> wrote:
>
>         I think releasing SparkIMain and related objects after
>         configurable inactivity would be good for now.
>
>         About scheduler, I can help implementing such scheduler.
>
>         Thanks,
>         moon
>
>         On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal
>         <praagarw@gmail.com <ma...@gmail.com>> wrote:
>
>             Hi Moon,
>
>             Yes, the notebookid comes from InterpreterContext. At the
>             moment destroying SparkIMain on deletion of notebook is
>             not handled. I think SparkIMain is a lightweight object,
>             do you see a concern having these objects in a map? One
>             possible option could be to destroy notebook related
>             objects when the inactivity on a notebook is greater than
>             say 8 hours.
>
>
>>             >> 4. Build a queue inside interpreter to allow only one
>>             paragraph execution
>>             >> at a time per notebook.
>>
>>             One downside of this approach is, GUI will display
>>             RUNNING instead of PENDING for jobs inside of queue in
>>             interpreter.
>             Yes that's an good point. Having a scheduler at Zeppelin
>             server to build a scheduler that is parallel across
>             notebook's and FIFO across paragraph's will be nice. Is
>             there any plan for having such a scheduler?
>
>             Regards,
>             -Pranav.
>
>
>             On 17/08/15 5:38 am, moon soo Lee wrote:
>>             Pranav, proposal looks awesome!
>>
>>             I have a question and feedback,
>>
>>             You said you tested 1,2 and 3. To create SparkIMain per
>>             notebook, you need information of notebook id. Did you
>>             get it from InterpreterContext?
>>             Then how did you handle destroying of SparkIMain (when
>>             notebook is deleting)?
>>             As far as i know, interpreter not able to get information
>>             of notebook deletion.
>>
>>             >> 4. Build a queue inside interpreter to allow only one
>>             paragraph execution
>>             >> at a time per notebook.
>>
>>             One downside of this approach is, GUI will display
>>             RUNNING instead of PENDING for jobs inside of queue in
>>             interpreter.
>>
>>             Best,
>>             moon
>>
>>             On Sun, Aug 16, 2015 at 12:55 AM IT CTO
>>             <goi.cto@gmail.com <ma...@gmail.com>> wrote:
>>
>>                 +1 for "to re-factor the Zeppelin architecture so
>>                 that it can handle multi-tenancy easily"
>>
>>                 On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan
>>                 <doanduyhai@gmail.com <ma...@gmail.com>>
>>                 wrote:
>>
>>                     Agree with Joel, we may think to re-factor the
>>                     Zeppelin architecture so that it can handle
>>                     multi-tenancy easily. The technical solution
>>                     proposed by Pranav is great but it only applies
>>                     to Spark. Right now, each interpreter has to
>>                     manage multi-tenancy its own way. Ultimately
>>                     Zeppelin can propose a multi-tenancy
>>                     contract/info (like UserContext, similar to
>>                     InterpreterContext) so that each interpreter can
>>                     choose to use or not.
>>
>>
>>                     On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano
>>                     <djoelz@gmail.com <ma...@gmail.com>> wrote:
>>
>>                         I think while the idea of running multiple
>>                         notes simultaneously is great. It is really
>>                         dancing around the lack of true multi user
>>                         support in Zeppelin. While the proposed
>>                         solution would work if the applications
>>                         resources are those of the whole cluster, if
>>                         the app is limited (say they are 8 cores of
>>                         16, with some distribution in memory) then
>>                         potentially your note can hog all the
>>                         resources and the scheduler will have to
>>                         throttle all other executions leaving you
>>                         exactly where you are now.
>>                         While I think the solution is a good one,
>>                         maybe this question makes us think in adding
>>                         true multiuser support.
>>                         Where we isolate resources (cluster and the
>>                         notebooks themselves), have separate
>>                         login/identity and (I don't know if it's
>>                         possible) share the same context.
>>
>>                         Thanks,
>>                         Joel
>>
>>                         > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal
>>                         <mindprince@gmail.com
>>                         <ma...@gmail.com>> wrote:
>>                         >
>>                         > If the problem is that multiple users have
>>                         to wait for each other while
>>                         > using Zeppelin, the solution already
>>                         exists: they can create a new
>>                         > interpreter by going to the interpreter
>>                         page and attach it to their
>>                         > notebook - then they don't have to wait for
>>                         others to submit their job.
>>                         >
>>                         > But I agree, having paragraphs from one
>>                         note wait for paragraphs from other
>>                         > notes is a confusing default. We can get
>>                         around that in two ways:
>>                         >
>>                         >   1. Create a new interpreter for each note
>>                         and attach that interpreter to
>>                         >   that note. This approach would require the least amount
>>                         of code changes but
>>                         >   is resource heavy and doesn't let you
>>                         share Spark Context between different
>>                         >   notes.
>>                         >   2. If we want to share the Spark Context
>>                         between different notes, we can
>>                         >   submit jobs from different notes into
>>                         different fairscheduler pools (
>>                         >
>>                         https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application).
>>                         >   This can be done by submitting jobs from
>>                         different notes in different
>>                         >   threads. This will make sure that jobs
>>                         from one note are run sequentially
>>                         >   but jobs from different notes will be
>>                         able to run in parallel.
>>                         >
>>                         > Neither of these options require any change
>>                         in the Spark code.
>>                         >
>>                         > --
>>                         > Thanks & Regards
>>                         > Rohit Agarwal
>>                         > https://www.linkedin.com/in/rohitagarwal003
>>                         >
>>                         > On Sat, Aug 15, 2015 at 12:01 PM, Pranav
>>                         Kumar Agarwal <praagarw@gmail.com
>>                         <ma...@gmail.com>>
>>                         > wrote:
>>                         >
>>                         >> If someone can share about the idea of
>>                         sharing single SparkContext through
>>                         >>> multiple SparkILoop safely, it'll be
>>                         really helpful.
>>                         >> Here is a proposal:
>>                         >> 1. In Spark code, change SparkIMain.scala
>>                         to allow setting the virtual
>>                         >> directory. While creating new instances of
>>                         SparkIMain per notebook from
>>                         >> zeppelin spark interpreter set all the
>>                         instances of SparkIMain to the same
>>                         >> virtual directory.
>>                         >> 2. Start HTTP server on that virtual
>>                         directory and set this HTTP server in
>>                         >> Spark Context using classserverUri method
>>                         >> 3. Scala generated code has a notion of
>>                         packages. The default package name
>>                         >> is "line$<linenumber>". Package name can
>>                         be controlled using System
>>                         >> Property scala.repl.name.line. Setting
>>                         this property to "notebook id"
>>                         >> ensures that code generated by individual
>>                         instances of SparkIMain is
>>                         >> isolated from other instances of SparkIMain
>>                         >> 4. Build a queue inside interpreter to
>>                         allow only one paragraph execution
>>                         >> at a time per notebook.
>>                         >>
>>                         >> I have tested 1, 2, and 3 and it seems to
>>                         provide isolation across
>>                         >> classnames. I'll work towards submitting a
>>                         formal patch soon - Is there any
>>                         >> Jira already for the same that I can
>>                         uptake? Also I need to understand:
>>                         >> 1. How does Zeppelin uptake Spark fixes?
>>                         OR do I need to first work
>>                         >> towards getting Spark changes merged in
>>                         Apache Spark github?
>>                         >>
>>                         >> Any suggestions on comments on the
>>                         proposal are highly welcome.
>>                         >>
>>                         >> Regards,
>>                         >> -Pranav.
>>                         >>
>>                         >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>                         >>>
>>                         >>> Hi piyush,
>>                         >>>
>>                         >>> Separate instance of SparkILoop
>>                         SparkIMain for each notebook while
>>                         >>> sharing the SparkContext sounds great.
>>                         >>>
>>                         >>> Actually, i tried to do it, found problem
>>                         that multiple SparkILoop could
>>                         >>> generates the same class name, and spark
>>                         executor confuses classname since
>>                         >>> they're reading classes from single
>>                         SparkContext.
>>                         >>>
>>                         >>> If someone can share about the idea of
>>                         sharing single SparkContext
>>                         >>> through multiple SparkILoop safely, it'll
>>                         be really helpful.
>>                         >>>
>>                         >>> Thanks,
>>                         >>> moon
>>                         >>>
>>                         >>>
>>                         >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush
>>                         Mukati (Data Platform) <
>>                         >>> piyush.mukati@flipkart.com
>>                         <ma...@flipkart.com>
>>                         <mailto:piyush.mukati@flipkart.com
>>                         <ma...@flipkart.com>>> wrote:
>>                         >>>
>>                         >>>    Hi Moon,
>>                         >>>    Any suggestion on it, have to wait lot
>>                         when multiple people  working
>>                         >>> with spark.
>>                         >>>    Can we create separate instance of
>>                          SparkILoop SparkIMain and
>>                         >>> printstrems  for each notebook while
>>                         sharing theSparkContext
>>                         >>> ZeppelinContext  SQLContext and
>>                         DependencyResolver and then use parallel
>>                         >>> scheduler ?
>>                         >>> thanks
>>                         >>>
>>                         >>> -piyush
>>                         >>>
>>                         >>>    Hi Moon,
>>                         >>>
>>                         >>>    How about tracking dedicated
>>                         SparkContext for a notebook in Spark's
>>                         >>> remote interpreter - this will allow
>>                         multiple users to run their spark
>>                         >>> paragraphs in parallel. Also, within a
>>                         notebook only one paragraph is
>>                         >>> executed at a time.
>>                         >>>
>>                         >>> Regards,
>>                         >>> -Pranav.
>>                         >>>
>>                         >>>
>>                         >>>> On 15/07/15 7:15 pm, moon soo Lee wrote:
>>                         >>>> Hi,
>>                         >>>>
>>                         >>>> Thanks for asking question.
>>                         >>>>
>>                         >>>> The reason is simply because of it is
>>                         running code statements. The
>>                         >>>> statements can have order and
>>                         dependency. Imagine i have two
>>                         >>> paragraphs
>>                         >>>>
>>                         >>>> %spark
>>                         >>>> val a = 1
>>                         >>>>
>>                         >>>> %spark
>>                         >>>> print(a)
>>                         >>>>
>>                         >>>> If they're not running one by one, that
>>                         means they possibly runs in
>>                         >>>> random order and the output will be
>>                         always different. Either '1' or
>>                         >>>> 'val a can not found'.
>>                         >>>>
>>                         >>>> This is the reason why. But if there are
>>                         nice idea to handle this
>>                         >>>> problem i agree using parallel scheduler
>>                         would help a lot.
>>                         >>>>
>>                         >>>> Thanks,
>>                         >>>> moon
>>                         >>>> On 2015년 7월 14일 (화) at 오후 7:59
>>                         linxi zeng
>>                         >>>> <linxizeng0615@gmail.com
>>                         <ma...@gmail.com>
>>                         <mailto:linxizeng0615@gmail.com
>>                         <ma...@gmail.com>>
>>                         >>> <mailto:linxizeng0615@gmail.com
>>                         <ma...@gmail.com>
>>                         <mailto:linxizeng0615@gmail.com
>>                         <ma...@gmail.com>>>>
>>                         >>> wrote:
>>                         >>>>
>>                         >>>> any one who have the same question with
>>                         me? or this is not a
>>                         >>> question?
>>                         >>>>
>>                         >>>> 2015-07-14 11:47 GMT+08:00 linxi zeng
>>                         <linxizeng0615@gmail.com
>>                         <ma...@gmail.com>
>>                         >>> <mailto:linxizeng0615@gmail.com
>>                         <ma...@gmail.com>>
>>                         >>>> <mailto:linxizeng0615@gmail.com
>>                         <ma...@gmail.com> <mailto:
>>                         >>> linxizeng0615@gmail.com
>>                         <ma...@gmail.com>>>>:
>>                         >>>>
>>                         >>>>     hi, Moon:
>>                         >>>>        I notice that the getScheduler
>>                         function in the
>>                         >>>> SparkInterpreter.java return a
>>                         FIFOScheduler which makes the
>>                         >>>>     spark interpreter run spark job one
>>                         by one. It's not a good
>>                         >>>>     experience when couple of users do
>>                         some work on zeppelin at
>>                         >>>>     the same time, because they have to
>>                         wait for each other.
>>                         >>>>     And at the same time,
>>                         SparkSqlInterpreter can chose what
>>                         >>>>     scheduler to use by
>>                         "zeppelin.spark.concurrentSQL".
>>                         >>>>     My question is, what kind of
>>                         consideration do you based on
>>                         >>> to
>>                         >>>>     make such a decision?
>>                         >>>
>>                         >>>
>>                         >>>
>>                         >>>
>>                         >>>
>>                         ------------------------------------------------------------------------------------------------------------------------------------------
>>                         >>>
>>                         >>>    This email and any files transmitted
>>                         with it are confidential and
>>                         >>> intended solely for the use of the
>>                         individual or entity to whom
>>                         >>>    they are addressed. If you have
>>                         received this email in error
>>                         >>> please notify the system manager. This
>>                         message contains
>>                         >>> confidential information and is intended
>>                         only for the individual
>>                         >>> named. If you are not the named addressee
>>                         you should not
>>                         >>> disseminate, distribute or copy this
>>                         e-mail. Please notify the
>>                         >>> sender immediately by e-mail if you have
>>                         received this e-mail by
>>                         >>> mistake and delete this e-mail from your
>>                         system. If you are not
>>                         >>>    the intended recipient you are
>>                         notified that disclosing, copying,
>>                         >>> distributing or taking any action in
>>                         reliance on the contents of
>>                         >>>    this information is strictly
>>                         prohibited. Although Flipkart has
>>                         >>> taken reasonable precautions to ensure no
>>                         viruses are present in
>>                         >>>    this email, the company cannot accept
>>                         responsibility for any loss
>>                         >>>    or damage arising from the use of this
>>                         email or attachments
>>                         >>
>>
>>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Pranav Kumar Agarwal <pr...@gmail.com>.

Hi Moon,

> I think releasing SparkIMain and related objects 
By packaging I meant to ask what is the process to "release SparkIMain 
and related objects"? for Zeppelin's code uptake?

I have one more question:
Most the changes to allow SparkInterpreter support ParallelScheduler are 
implemented but I'm struggling with the completion feature. Since I have 
SparkIMain interpreter for each notebook, completion functionality is 
not working as expected cause the completion method doesn't have 
InterpreterContext. I need to be able to pull notebook specific 
SparkIMain interpreter to return correct completion results, and for 
that I need to know the notbook-id at the time of completion call.

I'm planning to change the Interpreter.java abstract method completion 
to pass InterpreterContext along with buffer and cursor location. This 
will require refactoring all the Interpreter's. It's a change in the 
contract, so thought will run with you before embarking on it...

Please let me know your thoughts.

Regards,
-Pranav.

On 18/08/15 8:04 am, moon soo Lee wrote:
> Could you explain little bit more about package changes you mean?
>
> Thanks,
> moon
>
> On Mon, Aug 17, 2015 at 10:27 AM Pranav Agarwal <praagarw@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Any thoughts on how to package changes related to Spark?
>
>     On 17-Aug-2015 7:58 pm, "moon soo Lee" <moon@apache.org
>     <ma...@apache.org>> wrote:
>
>         I think releasing SparkIMain and related objects after
>         configurable inactivity would be good for now.
>
>         About scheduler, I can help implementing such scheduler.
>
>         Thanks,
>         moon
>
>         On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal
>         <praagarw@gmail.com <ma...@gmail.com>> wrote:
>
>             Hi Moon,
>
>             Yes, the notebookid comes from InterpreterContext. At the
>             moment destroying SparkIMain on deletion of notebook is
>             not handled. I think SparkIMain is a lightweight object,
>             do you see a concern having these objects in a map? One
>             possible option could be to destroy notebook related
>             objects when the inactivity on a notebook is greater than
>             say 8 hours.
>
>
>>             >> 4. Build a queue inside interpreter to allow only one
>>             paragraph execution
>>             >> at a time per notebook.
>>
>>             One downside of this approach is, GUI will display
>>             RUNNING instead of PENDING for jobs inside of queue in
>>             interpreter.
>             Yes that's an good point. Having a scheduler at Zeppelin
>             server to build a scheduler that is parallel across
>             notebook's and FIFO across paragraph's will be nice. Is
>             there any plan for having such a scheduler?
>
>             Regards,
>             -Pranav.
>
>
>             On 17/08/15 5:38 am, moon soo Lee wrote:
>>             Pranav, proposal looks awesome!
>>
>>             I have a question and feedback,
>>
>>             You said you tested 1,2 and 3. To create SparkIMain per
>>             notebook, you need information of notebook id. Did you
>>             get it from InterpreterContext?
>>             Then how did you handle destroying of SparkIMain (when
>>             notebook is deleting)?
>>             As far as i know, interpreter not able to get information
>>             of notebook deletion.
>>
>>             >> 4. Build a queue inside interpreter to allow only one
>>             paragraph execution
>>             >> at a time per notebook.
>>
>>             One downside of this approach is, GUI will display
>>             RUNNING instead of PENDING for jobs inside of queue in
>>             interpreter.
>>
>>             Best,
>>             moon
>>
>>             On Sun, Aug 16, 2015 at 12:55 AM IT CTO
>>             <goi.cto@gmail.com <ma...@gmail.com>> wrote:
>>
>>                 +1 for "to re-factor the Zeppelin architecture so
>>                 that it can handle multi-tenancy easily"
>>
>>                 On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan
>>                 <doanduyhai@gmail.com <ma...@gmail.com>>
>>                 wrote:
>>
>>                     Agree with Joel, we may think to re-factor the
>>                     Zeppelin architecture so that it can handle
>>                     multi-tenancy easily. The technical solution
>>                     proposed by Pranav is great but it only applies
>>                     to Spark. Right now, each interpreter has to
>>                     manage multi-tenancy its own way. Ultimately
>>                     Zeppelin can propose a multi-tenancy
>>                     contract/info (like UserContext, similar to
>>                     InterpreterContext) so that each interpreter can
>>                     choose to use or not.
>>
>>
>>                     On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano
>>                     <djoelz@gmail.com <ma...@gmail.com>> wrote:
>>
>>                         I think while the idea of running multiple
>>                         notes simultaneously is great. It is really
>>                         dancing around the lack of true multi user
>>                         support in Zeppelin. While the proposed
>>                         solution would work if the applications
>>                         resources are those of the whole cluster, if
>>                         the app is limited (say they are 8 cores of
>>                         16, with some distribution in memory) then
>>                         potentially your note can hog all the
>>                         resources and the scheduler will have to
>>                         throttle all other executions leaving you
>>                         exactly where you are now.
>>                         While I think the solution is a good one,
>>                         maybe this question makes us think in adding
>>                         true multiuser support.
>>                         Where we isolate resources (cluster and the
>>                         notebooks themselves), have separate
>>                         login/identity and (I don't know if it's
>>                         possible) share the same context.
>>
>>                         Thanks,
>>                         Joel
>>
>>                         > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal
>>                         <mindprince@gmail.com
>>                         <ma...@gmail.com>> wrote:
>>                         >
>>                         > If the problem is that multiple users have
>>                         to wait for each other while
>>                         > using Zeppelin, the solution already
>>                         exists: they can create a new
>>                         > interpreter by going to the interpreter
>>                         page and attach it to their
>>                         > notebook - then they don't have to wait for
>>                         others to submit their job.
>>                         >
>>                         > But I agree, having paragraphs from one
>>                         note wait for paragraphs from other
>>                         > notes is a confusing default. We can get
>>                         around that in two ways:
>>                         >
>>                         >   1. Create a new interpreter for each note
>>                         and attach that interpreter to
>>                         >   that note. This approach would require the least amount
>>                         of code changes but
>>                         >   is resource heavy and doesn't let you
>>                         share Spark Context between different
>>                         >   notes.
>>                         >   2. If we want to share the Spark Context
>>                         between different notes, we can
>>                         >   submit jobs from different notes into
>>                         different fairscheduler pools (
>>                         >
>>                         https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application).
>>                         >   This can be done by submitting jobs from
>>                         different notes in different
>>                         >   threads. This will make sure that jobs
>>                         from one note are run sequentially
>>                         >   but jobs from different notes will be
>>                         able to run in parallel.
>>                         >
>>                         > Neither of these options require any change
>>                         in the Spark code.
>>                         >
>>                         > --
>>                         > Thanks & Regards
>>                         > Rohit Agarwal
>>                         > https://www.linkedin.com/in/rohitagarwal003
>>                         >
>>                         > On Sat, Aug 15, 2015 at 12:01 PM, Pranav
>>                         Kumar Agarwal <praagarw@gmail.com
>>                         <ma...@gmail.com>>
>>                         > wrote:
>>                         >
>>                         >> If someone can share about the idea of
>>                         sharing single SparkContext through
>>                         >>> multiple SparkILoop safely, it'll be
>>                         really helpful.
>>                         >> Here is a proposal:
>>                         >> 1. In Spark code, change SparkIMain.scala
>>                         to allow setting the virtual
>>                         >> directory. While creating new instances of
>>                         SparkIMain per notebook from
>>                         >> zeppelin spark interpreter set all the
>>                         instances of SparkIMain to the same
>>                         >> virtual directory.
>>                         >> 2. Start HTTP server on that virtual
>>                         directory and set this HTTP server in
>>                         >> Spark Context using classserverUri method
>>                         >> 3. Scala generated code has a notion of
>>                         packages. The default package name
>>                         >> is "line$<linenumber>". Package name can
>>                         be controlled using System
>>                         >> Property scala.repl.name.line. Setting
>>                         this property to "notebook id"
>>                         >> ensures that code generated by individual
>>                         instances of SparkIMain is
>>                         >> isolated from other instances of SparkIMain
>>                         >> 4. Build a queue inside interpreter to
>>                         allow only one paragraph execution
>>                         >> at a time per notebook.
>>                         >>
>>                         >> I have tested 1, 2, and 3 and it seems to
>>                         provide isolation across
>>                         >> classnames. I'll work towards submitting a
>>                         formal patch soon - Is there any
>>                         >> Jira already for the same that I can
>>                         uptake? Also I need to understand:
>>                         >> 1. How does Zeppelin uptake Spark fixes?
>>                         OR do I need to first work
>>                         >> towards getting Spark changes merged in
>>                         Apache Spark github?
>>                         >>
>>                         >> Any suggestions on comments on the
>>                         proposal are highly welcome.
>>                         >>
>>                         >> Regards,
>>                         >> -Pranav.
>>                         >>
>>                         >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>                         >>>
>>                         >>> Hi piyush,
>>                         >>>
>>                         >>> Separate instance of SparkILoop
>>                         SparkIMain for each notebook while
>>                         >>> sharing the SparkContext sounds great.
>>                         >>>
>>                         >>> Actually, i tried to do it, found problem
>>                         that multiple SparkILoop could
>>                         >>> generates the same class name, and spark
>>                         executor confuses classname since
>>                         >>> they're reading classes from single
>>                         SparkContext.
>>                         >>>
>>                         >>> If someone can share about the idea of
>>                         sharing single SparkContext
>>                         >>> through multiple SparkILoop safely, it'll
>>                         be really helpful.
>>                         >>>
>>                         >>> Thanks,
>>                         >>> moon
>>                         >>>
>>                         >>>
>>                         >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush
>>                         Mukati (Data Platform) <
>>                         >>> piyush.mukati@flipkart.com
>>                         <ma...@flipkart.com>
>>                         <mailto:piyush.mukati@flipkart.com
>>                         <ma...@flipkart.com>>> wrote:
>>                         >>>
>>                         >>>    Hi Moon,
>>                         >>>    Any suggestion on it, have to wait lot
>>                         when multiple people  working
>>                         >>> with spark.
>>                         >>>    Can we create separate instance of
>>                          SparkILoop SparkIMain and
>>                         >>> printstrems  for each notebook while
>>                         sharing theSparkContext
>>                         >>> ZeppelinContext  SQLContext and
>>                         DependencyResolver and then use parallel
>>                         >>> scheduler ?
>>                         >>> thanks
>>                         >>>
>>                         >>> -piyush
>>                         >>>
>>                         >>>    Hi Moon,
>>                         >>>
>>                         >>>    How about tracking dedicated
>>                         SparkContext for a notebook in Spark's
>>                         >>> remote interpreter - this will allow
>>                         multiple users to run their spark
>>                         >>> paragraphs in parallel. Also, within a
>>                         notebook only one paragraph is
>>                         >>> executed at a time.
>>                         >>>
>>                         >>> Regards,
>>                         >>> -Pranav.
>>                         >>>
>>                         >>>
>>                         >>>> On 15/07/15 7:15 pm, moon soo Lee wrote:
>>                         >>>> Hi,
>>                         >>>>
>>                         >>>> Thanks for asking question.
>>                         >>>>
>>                         >>>> The reason is simply because of it is
>>                         running code statements. The
>>                         >>>> statements can have order and
>>                         dependency. Imagine i have two
>>                         >>> paragraphs
>>                         >>>>
>>                         >>>> %spark
>>                         >>>> val a = 1
>>                         >>>>
>>                         >>>> %spark
>>                         >>>> print(a)
>>                         >>>>
>>                         >>>> If they're not running one by one, that
>>                         means they possibly runs in
>>                         >>>> random order and the output will be
>>                         always different. Either '1' or
>>                         >>>> 'val a can not found'.
>>                         >>>>
>>                         >>>> This is the reason why. But if there are
>>                         nice idea to handle this
>>                         >>>> problem i agree using parallel scheduler
>>                         would help a lot.
>>                         >>>>
>>                         >>>> Thanks,
>>                         >>>> moon
>>                         >>>> On 2015년 7월 14일 (화) at 오후 7:59
>>                         linxi zeng
>>                         >>>> <linxizeng0615@gmail.com
>>                         <ma...@gmail.com>
>>                         <mailto:linxizeng0615@gmail.com
>>                         <ma...@gmail.com>>
>>                         >>> <mailto:linxizeng0615@gmail.com
>>                         <ma...@gmail.com>
>>                         <mailto:linxizeng0615@gmail.com
>>                         <ma...@gmail.com>>>>
>>                         >>> wrote:
>>                         >>>>
>>                         >>>> any one who have the same question with
>>                         me? or this is not a
>>                         >>> question?
>>                         >>>>
>>                         >>>> 2015-07-14 11:47 GMT+08:00 linxi zeng
>>                         <linxizeng0615@gmail.com
>>                         <ma...@gmail.com>
>>                         >>> <mailto:linxizeng0615@gmail.com
>>                         <ma...@gmail.com>>
>>                         >>>> <mailto:linxizeng0615@gmail.com
>>                         <ma...@gmail.com> <mailto:
>>                         >>> linxizeng0615@gmail.com
>>                         <ma...@gmail.com>>>>:
>>                         >>>>
>>                         >>>>     hi, Moon:
>>                         >>>>        I notice that the getScheduler
>>                         function in the
>>                         >>>> SparkInterpreter.java return a
>>                         FIFOScheduler which makes the
>>                         >>>>     spark interpreter run spark job one
>>                         by one. It's not a good
>>                         >>>>     experience when couple of users do
>>                         some work on zeppelin at
>>                         >>>>     the same time, because they have to
>>                         wait for each other.
>>                         >>>>     And at the same time,
>>                         SparkSqlInterpreter can chose what
>>                         >>>>     scheduler to use by
>>                         "zeppelin.spark.concurrentSQL".
>>                         >>>>     My question is, what kind of
>>                         consideration do you based on
>>                         >>> to
>>                         >>>>     make such a decision?
>>                         >>>
>>                         >>>
>>                         >>>
>>                         >>>
>>                         >>>
>>                         ------------------------------------------------------------------------------------------------------------------------------------------
>>                         >>>
>>                         >>>    This email and any files transmitted
>>                         with it are confidential and
>>                         >>> intended solely for the use of the
>>                         individual or entity to whom
>>                         >>>    they are addressed. If you have
>>                         received this email in error
>>                         >>> please notify the system manager. This
>>                         message contains
>>                         >>> confidential information and is intended
>>                         only for the individual
>>                         >>> named. If you are not the named addressee
>>                         you should not
>>                         >>> disseminate, distribute or copy this
>>                         e-mail. Please notify the
>>                         >>> sender immediately by e-mail if you have
>>                         received this e-mail by
>>                         >>> mistake and delete this e-mail from your
>>                         system. If you are not
>>                         >>>    the intended recipient you are
>>                         notified that disclosing, copying,
>>                         >>> distributing or taking any action in
>>                         reliance on the contents of
>>                         >>>    this information is strictly
>>                         prohibited. Although Flipkart has
>>                         >>> taken reasonable precautions to ensure no
>>                         viruses are present in
>>                         >>>    this email, the company cannot accept
>>                         responsibility for any loss
>>                         >>>    or damage arising from the use of this
>>                         email or attachments
>>                         >>
>>
>>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by moon soo Lee <mo...@apache.org>.

Could you explain little bit more about package changes you mean?

Thanks,
moon

On Mon, Aug 17, 2015 at 10:27 AM Pranav Agarwal <pr...@gmail.com> wrote:

> Any thoughts on how to package changes related to Spark?
> On 17-Aug-2015 7:58 pm, "moon soo Lee" <mo...@apache.org> wrote:
>
>> I think releasing SparkIMain and related objects after configurable
>> inactivity would be good for now.
>>
>> About scheduler, I can help implementing such scheduler.
>>
>> Thanks,
>> moon
>>
>> On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal <pr...@gmail.com>
>> wrote:
>>
>>> Hi Moon,
>>>
>>> Yes, the notebookid comes from InterpreterContext. At the moment
>>> destroying SparkIMain on deletion of notebook is not handled. I think
>>> SparkIMain is a lightweight object, do you see a concern having these
>>> objects in a map? One possible option could be to destroy notebook related
>>> objects when the inactivity on a notebook is greater than say 8 hours.
>>>
>>>
>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>> execution
>>> >> at a time per notebook.
>>>
>>> One downside of this approach is, GUI will display RUNNING instead of
>>> PENDING for jobs inside of queue in interpreter.
>>>
>>> Yes that's an good point. Having a scheduler at Zeppelin server to build
>>> a scheduler that is parallel across notebook's and FIFO across paragraph's
>>> will be nice. Is there any plan for having such a scheduler?
>>>
>>> Regards,
>>> -Pranav.
>>>
>>>
>>> On 17/08/15 5:38 am, moon soo Lee wrote:
>>>
>>> Pranav, proposal looks awesome!
>>>
>>> I have a question and feedback,
>>>
>>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
>>> need information of notebook id. Did you get it from InterpreterContext?
>>> Then how did you handle destroying of SparkIMain (when notebook is
>>> deleting)?
>>> As far as i know, interpreter not able to get information of notebook
>>> deletion.
>>>
>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>> execution
>>> >> at a time per notebook.
>>>
>>> One downside of this approach is, GUI will display RUNNING instead of
>>> PENDING for jobs inside of queue in interpreter.
>>>
>>> Best,
>>> moon
>>>
>>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:
>>>
>>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>>>> multi-tenancy easily"
>>>>
>>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture
>>>>> so that it can handle multi-tenancy easily. The technical solution proposed
>>>>> by Pranav is great but it only applies to Spark. Right now, each
>>>>> interpreter has to manage multi-tenancy its own way. Ultimately Zeppelin
>>>>> can propose a multi-tenancy contract/info (like UserContext, similar to
>>>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>>>
>>>>>
>>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think while the idea of running multiple notes simultaneously is
>>>>>> great. It is really dancing around the lack of true multi user support in
>>>>>> Zeppelin. While the proposed solution would work if the applications
>>>>>> resources are those of the whole cluster, if the app is limited (say they
>>>>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>>>>> note can hog all the resources and the scheduler will have to throttle all
>>>>>> other executions leaving you exactly where you are now.
>>>>>> While I think the solution is a good one, maybe this question makes
>>>>>> us think in adding true multiuser support.
>>>>>> Where we isolate resources (cluster and the notebooks themselves),
>>>>>> have separate login/identity and (I don't know if it's possible) share the
>>>>>> same context.
>>>>>>
>>>>>> Thanks,
>>>>>> Joel
>>>>>>
>>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > If the problem is that multiple users have to wait for each other
>>>>>> while
>>>>>> > using Zeppelin, the solution already exists: they can create a new
>>>>>> > interpreter by going to the interpreter page and attach it to their
>>>>>> > notebook - then they don't have to wait for others to submit their
>>>>>> job.
>>>>>> >
>>>>>> > But I agree, having paragraphs from one note wait for paragraphs
>>>>>> from other
>>>>>> > notes is a confusing default. We can get around that in two ways:
>>>>>> >
>>>>>> >   1. Create a new interpreter for each note and attach that
>>>>>> interpreter to
>>>>>> >   that note. This approach would require the least amount of code
>>>>>> changes but
>>>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>>>> different
>>>>>> >   notes.
>>>>>> >   2. If we want to share the Spark Context between different notes,
>>>>>> we can
>>>>>> >   submit jobs from different notes into different fairscheduler
>>>>>> pools (
>>>>>> >
>>>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>>>> ).
>>>>>> >   This can be done by submitting jobs from different notes in
>>>>>> different
>>>>>> >   threads. This will make sure that jobs from one note are run
>>>>>> sequentially
>>>>>> >   but jobs from different notes will be able to run in parallel.
>>>>>> >
>>>>>> > Neither of these options require any change in the Spark code.
>>>>>> >
>>>>>> > --
>>>>>> > Thanks & Regards
>>>>>> > Rohit Agarwal
>>>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>>>> >
>>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>>>> praagarw@gmail.com>
>>>>>> > wrote:
>>>>>> >
>>>>>> >> If someone can share about the idea of sharing single SparkContext
>>>>>> through
>>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>>>> >> Here is a proposal:
>>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>>>> virtual
>>>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>>>> from
>>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>>>> the same
>>>>>> >> virtual directory.
>>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>>>> server in
>>>>>> >> Spark Context using classserverUri method
>>>>>> >> 3. Scala generated code has a notion of packages. The default
>>>>>> package name
>>>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
>>>>>> id"
>>>>>> >> ensures that code generated by individual instances of SparkIMain
>>>>>> is
>>>>>> >> isolated from other instances of SparkIMain
>>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>>>> execution
>>>>>> >> at a time per notebook.
>>>>>> >>
>>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>>>> there any
>>>>>> >> Jira already for the same that I can uptake? Also I need to
>>>>>> understand:
>>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>>>> >>
>>>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>>>> >>
>>>>>> >> Regards,
>>>>>> >> -Pranav.
>>>>>> >>
>>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>>>> >>>
>>>>>> >>> Hi piyush,
>>>>>> >>>
>>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>>>> >>> sharing the SparkContext sounds great.
>>>>>> >>>
>>>>>> >>> Actually, i tried to do it, found problem that multiple
>>>>>> SparkILoop could
>>>>>> >>> generates the same class name, and spark executor confuses
>>>>>> classname since
>>>>>> >>> they're reading classes from single SparkContext.
>>>>>> >>>
>>>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>>>> >>>
>>>>>> >>> Thanks,
>>>>>> >>> moon
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>>    Hi Moon,
>>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>>>> working
>>>>>> >>> with spark.
>>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>>>> parallel
>>>>>> >>> scheduler ?
>>>>>> >>>    thanks
>>>>>> >>>
>>>>>> >>>    -piyush
>>>>>> >>>
>>>>>> >>>    Hi Moon,
>>>>>> >>>
>>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>>>> Spark's
>>>>>> >>>    remote interpreter - this will allow multiple users to run
>>>>>> their spark
>>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>>>> paragraph is
>>>>>> >>>    executed at a time.
>>>>>> >>>
>>>>>> >>>    Regards,
>>>>>> >>>    -Pranav.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>>>> >>>> Hi,
>>>>>> >>>>
>>>>>> >>>> Thanks for asking question.
>>>>>> >>>>
>>>>>> >>>> The reason is simply because of it is running code statements.
>>>>>> The
>>>>>> >>>> statements can have order and dependency. Imagine i have two
>>>>>> >>> paragraphs
>>>>>> >>>>
>>>>>> >>>> %spark
>>>>>> >>>> val a = 1
>>>>>> >>>>
>>>>>> >>>> %spark
>>>>>> >>>> print(a)
>>>>>> >>>>
>>>>>> >>>> If they're not running one by one, that means they possibly runs
>>>>>> in
>>>>>> >>>> random order and the output will be always different. Either '1'
>>>>>> or
>>>>>> >>>> 'val a can not found'.
>>>>>> >>>>
>>>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>>>> >>>>
>>>>>> >>>> Thanks,
>>>>>> >>>> moon
>>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
>>>>>> >>> <mailto:linxizeng0615@gmail.com  <mailto:linxizeng0615@gmail.com
>>>>>> >>>
>>>>>> >>> wrote:
>>>>>> >>>>
>>>>>> >>>>    any one who have the same question with me? or this is not a
>>>>>> >>> question?
>>>>>> >>>>
>>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <
>>>>>> linxizeng0615@gmail.com
>>>>>> >>> <ma...@gmail.com>
>>>>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
>>>>>> >>> linxizeng0615@gmail.com>>>:
>>>>>> >>>>
>>>>>> >>>>        hi, Moon:
>>>>>> >>>>           I notice that the getScheduler function in the
>>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
>>>>>> the
>>>>>> >>>>        spark interpreter run spark job one by one. It's not a
>>>>>> good
>>>>>> >>>>        experience when couple of users do some work on zeppelin
>>>>>> at
>>>>>> >>>>        the same time, because they have to wait for each other.
>>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>>>> >>>>        My question is, what kind of consideration do you based on
>>>>>> >>> to
>>>>>> >>>>        make such a decision?
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>>>> >>>
>>>>>> >>>    This email and any files transmitted with it are confidential
>>>>>> and
>>>>>> >>>    intended solely for the use of the individual or entity to whom
>>>>>> >>>    they are addressed. If you have received this email in error
>>>>>> >>>    please notify the system manager. This message contains
>>>>>> >>>    confidential information and is intended only for the
>>>>>> individual
>>>>>> >>>    named. If you are not the named addressee you should not
>>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>>>> >>>    sender immediately by e-mail if you have received this e-mail
>>>>>> by
>>>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>>>> >>>    the intended recipient you are notified that disclosing,
>>>>>> copying,
>>>>>> >>>    distributing or taking any action in reliance on the contents
>>>>>> of
>>>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>>>> >>>    taken reasonable precautions to ensure no viruses are present
>>>>>> in
>>>>>> >>>    this email, the company cannot accept responsibility for any
>>>>>> loss
>>>>>> >>>    or damage arising from the use of this email or attachments
>>>>>> >>
>>>>>>
>>>>>
>>>>>
>>>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by moon soo Lee <mo...@apache.org>.

Could you explain little bit more about package changes you mean?

Thanks,
moon

On Mon, Aug 17, 2015 at 10:27 AM Pranav Agarwal <pr...@gmail.com> wrote:

> Any thoughts on how to package changes related to Spark?
> On 17-Aug-2015 7:58 pm, "moon soo Lee" <mo...@apache.org> wrote:
>
>> I think releasing SparkIMain and related objects after configurable
>> inactivity would be good for now.
>>
>> About scheduler, I can help implementing such scheduler.
>>
>> Thanks,
>> moon
>>
>> On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal <pr...@gmail.com>
>> wrote:
>>
>>> Hi Moon,
>>>
>>> Yes, the notebookid comes from InterpreterContext. At the moment
>>> destroying SparkIMain on deletion of notebook is not handled. I think
>>> SparkIMain is a lightweight object, do you see a concern having these
>>> objects in a map? One possible option could be to destroy notebook related
>>> objects when the inactivity on a notebook is greater than say 8 hours.
>>>
>>>
>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>> execution
>>> >> at a time per notebook.
>>>
>>> One downside of this approach is, GUI will display RUNNING instead of
>>> PENDING for jobs inside of queue in interpreter.
>>>
>>> Yes that's an good point. Having a scheduler at Zeppelin server to build
>>> a scheduler that is parallel across notebook's and FIFO across paragraph's
>>> will be nice. Is there any plan for having such a scheduler?
>>>
>>> Regards,
>>> -Pranav.
>>>
>>>
>>> On 17/08/15 5:38 am, moon soo Lee wrote:
>>>
>>> Pranav, proposal looks awesome!
>>>
>>> I have a question and feedback,
>>>
>>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
>>> need information of notebook id. Did you get it from InterpreterContext?
>>> Then how did you handle destroying of SparkIMain (when notebook is
>>> deleting)?
>>> As far as i know, interpreter not able to get information of notebook
>>> deletion.
>>>
>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>> execution
>>> >> at a time per notebook.
>>>
>>> One downside of this approach is, GUI will display RUNNING instead of
>>> PENDING for jobs inside of queue in interpreter.
>>>
>>> Best,
>>> moon
>>>
>>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:
>>>
>>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>>>> multi-tenancy easily"
>>>>
>>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture
>>>>> so that it can handle multi-tenancy easily. The technical solution proposed
>>>>> by Pranav is great but it only applies to Spark. Right now, each
>>>>> interpreter has to manage multi-tenancy its own way. Ultimately Zeppelin
>>>>> can propose a multi-tenancy contract/info (like UserContext, similar to
>>>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>>>
>>>>>
>>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think while the idea of running multiple notes simultaneously is
>>>>>> great. It is really dancing around the lack of true multi user support in
>>>>>> Zeppelin. While the proposed solution would work if the applications
>>>>>> resources are those of the whole cluster, if the app is limited (say they
>>>>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>>>>> note can hog all the resources and the scheduler will have to throttle all
>>>>>> other executions leaving you exactly where you are now.
>>>>>> While I think the solution is a good one, maybe this question makes
>>>>>> us think in adding true multiuser support.
>>>>>> Where we isolate resources (cluster and the notebooks themselves),
>>>>>> have separate login/identity and (I don't know if it's possible) share the
>>>>>> same context.
>>>>>>
>>>>>> Thanks,
>>>>>> Joel
>>>>>>
>>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > If the problem is that multiple users have to wait for each other
>>>>>> while
>>>>>> > using Zeppelin, the solution already exists: they can create a new
>>>>>> > interpreter by going to the interpreter page and attach it to their
>>>>>> > notebook - then they don't have to wait for others to submit their
>>>>>> job.
>>>>>> >
>>>>>> > But I agree, having paragraphs from one note wait for paragraphs
>>>>>> from other
>>>>>> > notes is a confusing default. We can get around that in two ways:
>>>>>> >
>>>>>> >   1. Create a new interpreter for each note and attach that
>>>>>> interpreter to
>>>>>> >   that note. This approach would require the least amount of code
>>>>>> changes but
>>>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>>>> different
>>>>>> >   notes.
>>>>>> >   2. If we want to share the Spark Context between different notes,
>>>>>> we can
>>>>>> >   submit jobs from different notes into different fairscheduler
>>>>>> pools (
>>>>>> >
>>>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>>>> ).
>>>>>> >   This can be done by submitting jobs from different notes in
>>>>>> different
>>>>>> >   threads. This will make sure that jobs from one note are run
>>>>>> sequentially
>>>>>> >   but jobs from different notes will be able to run in parallel.
>>>>>> >
>>>>>> > Neither of these options require any change in the Spark code.
>>>>>> >
>>>>>> > --
>>>>>> > Thanks & Regards
>>>>>> > Rohit Agarwal
>>>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>>>> >
>>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>>>> praagarw@gmail.com>
>>>>>> > wrote:
>>>>>> >
>>>>>> >> If someone can share about the idea of sharing single SparkContext
>>>>>> through
>>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>>>> >> Here is a proposal:
>>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>>>> virtual
>>>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>>>> from
>>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>>>> the same
>>>>>> >> virtual directory.
>>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>>>> server in
>>>>>> >> Spark Context using classserverUri method
>>>>>> >> 3. Scala generated code has a notion of packages. The default
>>>>>> package name
>>>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
>>>>>> id"
>>>>>> >> ensures that code generated by individual instances of SparkIMain
>>>>>> is
>>>>>> >> isolated from other instances of SparkIMain
>>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>>>> execution
>>>>>> >> at a time per notebook.
>>>>>> >>
>>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>>>> there any
>>>>>> >> Jira already for the same that I can uptake? Also I need to
>>>>>> understand:
>>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>>>> >>
>>>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>>>> >>
>>>>>> >> Regards,
>>>>>> >> -Pranav.
>>>>>> >>
>>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>>>> >>>
>>>>>> >>> Hi piyush,
>>>>>> >>>
>>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>>>> >>> sharing the SparkContext sounds great.
>>>>>> >>>
>>>>>> >>> Actually, i tried to do it, found problem that multiple
>>>>>> SparkILoop could
>>>>>> >>> generates the same class name, and spark executor confuses
>>>>>> classname since
>>>>>> >>> they're reading classes from single SparkContext.
>>>>>> >>>
>>>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>>>> >>>
>>>>>> >>> Thanks,
>>>>>> >>> moon
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>>    Hi Moon,
>>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>>>> working
>>>>>> >>> with spark.
>>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>>>> parallel
>>>>>> >>> scheduler ?
>>>>>> >>>    thanks
>>>>>> >>>
>>>>>> >>>    -piyush
>>>>>> >>>
>>>>>> >>>    Hi Moon,
>>>>>> >>>
>>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>>>> Spark's
>>>>>> >>>    remote interpreter - this will allow multiple users to run
>>>>>> their spark
>>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>>>> paragraph is
>>>>>> >>>    executed at a time.
>>>>>> >>>
>>>>>> >>>    Regards,
>>>>>> >>>    -Pranav.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>>>> >>>> Hi,
>>>>>> >>>>
>>>>>> >>>> Thanks for asking question.
>>>>>> >>>>
>>>>>> >>>> The reason is simply because of it is running code statements.
>>>>>> The
>>>>>> >>>> statements can have order and dependency. Imagine i have two
>>>>>> >>> paragraphs
>>>>>> >>>>
>>>>>> >>>> %spark
>>>>>> >>>> val a = 1
>>>>>> >>>>
>>>>>> >>>> %spark
>>>>>> >>>> print(a)
>>>>>> >>>>
>>>>>> >>>> If they're not running one by one, that means they possibly runs
>>>>>> in
>>>>>> >>>> random order and the output will be always different. Either '1'
>>>>>> or
>>>>>> >>>> 'val a can not found'.
>>>>>> >>>>
>>>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>>>> >>>>
>>>>>> >>>> Thanks,
>>>>>> >>>> moon
>>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
>>>>>> >>> <mailto:linxizeng0615@gmail.com  <mailto:linxizeng0615@gmail.com
>>>>>> >>>
>>>>>> >>> wrote:
>>>>>> >>>>
>>>>>> >>>>    any one who have the same question with me? or this is not a
>>>>>> >>> question?
>>>>>> >>>>
>>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <
>>>>>> linxizeng0615@gmail.com
>>>>>> >>> <ma...@gmail.com>
>>>>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
>>>>>> >>> linxizeng0615@gmail.com>>>:
>>>>>> >>>>
>>>>>> >>>>        hi, Moon:
>>>>>> >>>>           I notice that the getScheduler function in the
>>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
>>>>>> the
>>>>>> >>>>        spark interpreter run spark job one by one. It's not a
>>>>>> good
>>>>>> >>>>        experience when couple of users do some work on zeppelin
>>>>>> at
>>>>>> >>>>        the same time, because they have to wait for each other.
>>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>>>> >>>>        My question is, what kind of consideration do you based on
>>>>>> >>> to
>>>>>> >>>>        make such a decision?
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>>>> >>>
>>>>>> >>>    This email and any files transmitted with it are confidential
>>>>>> and
>>>>>> >>>    intended solely for the use of the individual or entity to whom
>>>>>> >>>    they are addressed. If you have received this email in error
>>>>>> >>>    please notify the system manager. This message contains
>>>>>> >>>    confidential information and is intended only for the
>>>>>> individual
>>>>>> >>>    named. If you are not the named addressee you should not
>>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>>>> >>>    sender immediately by e-mail if you have received this e-mail
>>>>>> by
>>>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>>>> >>>    the intended recipient you are notified that disclosing,
>>>>>> copying,
>>>>>> >>>    distributing or taking any action in reliance on the contents
>>>>>> of
>>>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>>>> >>>    taken reasonable precautions to ensure no viruses are present
>>>>>> in
>>>>>> >>>    this email, the company cannot accept responsibility for any
>>>>>> loss
>>>>>> >>>    or damage arising from the use of this email or attachments
>>>>>> >>
>>>>>>
>>>>>
>>>>>
>>>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Pranav Agarwal <pr...@gmail.com>.

Any thoughts on how to package changes related to Spark?
On 17-Aug-2015 7:58 pm, "moon soo Lee" <mo...@apache.org> wrote:

> I think releasing SparkIMain and related objects after configurable
> inactivity would be good for now.
>
> About scheduler, I can help implementing such scheduler.
>
> Thanks,
> moon
>
> On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal <pr...@gmail.com>
> wrote:
>
>> Hi Moon,
>>
>> Yes, the notebookid comes from InterpreterContext. At the moment
>> destroying SparkIMain on deletion of notebook is not handled. I think
>> SparkIMain is a lightweight object, do you see a concern having these
>> objects in a map? One possible option could be to destroy notebook related
>> objects when the inactivity on a notebook is greater than say 8 hours.
>>
>>
>> >> 4. Build a queue inside interpreter to allow only one paragraph
>> execution
>> >> at a time per notebook.
>>
>> One downside of this approach is, GUI will display RUNNING instead of
>> PENDING for jobs inside of queue in interpreter.
>>
>> Yes that's an good point. Having a scheduler at Zeppelin server to build
>> a scheduler that is parallel across notebook's and FIFO across paragraph's
>> will be nice. Is there any plan for having such a scheduler?
>>
>> Regards,
>> -Pranav.
>>
>>
>> On 17/08/15 5:38 am, moon soo Lee wrote:
>>
>> Pranav, proposal looks awesome!
>>
>> I have a question and feedback,
>>
>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
>> need information of notebook id. Did you get it from InterpreterContext?
>> Then how did you handle destroying of SparkIMain (when notebook is
>> deleting)?
>> As far as i know, interpreter not able to get information of notebook
>> deletion.
>>
>> >> 4. Build a queue inside interpreter to allow only one paragraph
>> execution
>> >> at a time per notebook.
>>
>> One downside of this approach is, GUI will display RUNNING instead of
>> PENDING for jobs inside of queue in interpreter.
>>
>> Best,
>> moon
>>
>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:
>>
>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>>> multi-tenancy easily"
>>>
>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com>
>>> wrote:
>>>
>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture so
>>>> that it can handle multi-tenancy easily. The technical solution proposed by Pranav
>>>> is great but it only applies to Spark. Right now, each interpreter has to
>>>> manage multi-tenancy its own way. Ultimately Zeppelin can propose a
>>>> multi-tenancy contract/info (like UserContext, similar to
>>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>>
>>>>
>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think while the idea of running multiple notes simultaneously is
>>>>> great. It is really dancing around the lack of true multi user support in
>>>>> Zeppelin. While the proposed solution would work if the applications
>>>>> resources are those of the whole cluster, if the app is limited (say they
>>>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>>>> note can hog all the resources and the scheduler will have to throttle all
>>>>> other executions leaving you exactly where you are now.
>>>>> While I think the solution is a good one, maybe this question makes us
>>>>> think in adding true multiuser support.
>>>>> Where we isolate resources (cluster and the notebooks themselves),
>>>>> have separate login/identity and (I don't know if it's possible) share the
>>>>> same context.
>>>>>
>>>>> Thanks,
>>>>> Joel
>>>>>
>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > If the problem is that multiple users have to wait for each other
>>>>> while
>>>>> > using Zeppelin, the solution already exists: they can create a new
>>>>> > interpreter by going to the interpreter page and attach it to their
>>>>> > notebook - then they don't have to wait for others to submit their
>>>>> job.
>>>>> >
>>>>> > But I agree, having paragraphs from one note wait for paragraphs
>>>>> from other
>>>>> > notes is a confusing default. We can get around that in two ways:
>>>>> >
>>>>> >   1. Create a new interpreter for each note and attach that
>>>>> interpreter to
>>>>> >   that note. This approach would require the least amount of code
>>>>> changes but
>>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>>> different
>>>>> >   notes.
>>>>> >   2. If we want to share the Spark Context between different notes,
>>>>> we can
>>>>> >   submit jobs from different notes into different fairscheduler
>>>>> pools (
>>>>> >
>>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>>> ).
>>>>> >   This can be done by submitting jobs from different notes in
>>>>> different
>>>>> >   threads. This will make sure that jobs from one note are run
>>>>> sequentially
>>>>> >   but jobs from different notes will be able to run in parallel.
>>>>> >
>>>>> > Neither of these options require any change in the Spark code.
>>>>> >
>>>>> > --
>>>>> > Thanks & Regards
>>>>> > Rohit Agarwal
>>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>>> >
>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>>> praagarw@gmail.com>
>>>>> > wrote:
>>>>> >
>>>>> >> If someone can share about the idea of sharing single SparkContext
>>>>> through
>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>>> >> Here is a proposal:
>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>>> virtual
>>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>>> from
>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>>> the same
>>>>> >> virtual directory.
>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>>> server in
>>>>> >> Spark Context using classserverUri method
>>>>> >> 3. Scala generated code has a notion of packages. The default
>>>>> package name
>>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
>>>>> id"
>>>>> >> ensures that code generated by individual instances of SparkIMain is
>>>>> >> isolated from other instances of SparkIMain
>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>>> execution
>>>>> >> at a time per notebook.
>>>>> >>
>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>>> there any
>>>>> >> Jira already for the same that I can uptake? Also I need to
>>>>> understand:
>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>>> >>
>>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>>> >>
>>>>> >> Regards,
>>>>> >> -Pranav.
>>>>> >>
>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>>> >>>
>>>>> >>> Hi piyush,
>>>>> >>>
>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>>> >>> sharing the SparkContext sounds great.
>>>>> >>>
>>>>> >>> Actually, i tried to do it, found problem that multiple SparkILoop
>>>>> could
>>>>> >>> generates the same class name, and spark executor confuses
>>>>> classname since
>>>>> >>> they're reading classes from single SparkContext.
>>>>> >>>
>>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>>> >>>
>>>>> >>> Thanks,
>>>>> >>> moon
>>>>> >>>
>>>>> >>>
>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
>>>>> wrote:
>>>>> >>>
>>>>> >>>    Hi Moon,
>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>>> working
>>>>> >>> with spark.
>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>>> parallel
>>>>> >>> scheduler ?
>>>>> >>>    thanks
>>>>> >>>
>>>>> >>>    -piyush
>>>>> >>>
>>>>> >>>    Hi Moon,
>>>>> >>>
>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>>> Spark's
>>>>> >>>    remote interpreter - this will allow multiple users to run
>>>>> their spark
>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>>> paragraph is
>>>>> >>>    executed at a time.
>>>>> >>>
>>>>> >>>    Regards,
>>>>> >>>    -Pranav.
>>>>> >>>
>>>>> >>>
>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>>> >>>> Hi,
>>>>> >>>>
>>>>> >>>> Thanks for asking question.
>>>>> >>>>
>>>>> >>>> The reason is simply because of it is running code statements. The
>>>>> >>>> statements can have order and dependency. Imagine i have two
>>>>> >>> paragraphs
>>>>> >>>>
>>>>> >>>> %spark
>>>>> >>>> val a = 1
>>>>> >>>>
>>>>> >>>> %spark
>>>>> >>>> print(a)
>>>>> >>>>
>>>>> >>>> If they're not running one by one, that means they possibly runs
>>>>> in
>>>>> >>>> random order and the output will be always different. Either '1'
>>>>> or
>>>>> >>>> 'val a can not found'.
>>>>> >>>>
>>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>>> >>>>
>>>>> >>>> Thanks,
>>>>> >>>> moon
>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
>>>>> >>> <mailto:linxizeng0615@gmail.com  <mailto:linxizeng0615@gmail.com
>>>>> >>>
>>>>> >>> wrote:
>>>>> >>>>
>>>>> >>>>    any one who have the same question with me? or this is not a
>>>>> >>> question?
>>>>> >>>>
>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
>>>>> >>> <ma...@gmail.com>
>>>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
>>>>> >>> linxizeng0615@gmail.com>>>:
>>>>> >>>>
>>>>> >>>>        hi, Moon:
>>>>> >>>>           I notice that the getScheduler function in the
>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
>>>>> the
>>>>> >>>>        spark interpreter run spark job one by one. It's not a good
>>>>> >>>>        experience when couple of users do some work on zeppelin at
>>>>> >>>>        the same time, because they have to wait for each other.
>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>>> >>>>        My question is, what kind of consideration do you based on
>>>>> >>> to
>>>>> >>>>        make such a decision?
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>>> >>>
>>>>> >>>    This email and any files transmitted with it are confidential
>>>>> and
>>>>> >>>    intended solely for the use of the individual or entity to whom
>>>>> >>>    they are addressed. If you have received this email in error
>>>>> >>>    please notify the system manager. This message contains
>>>>> >>>    confidential information and is intended only for the individual
>>>>> >>>    named. If you are not the named addressee you should not
>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>>> >>>    sender immediately by e-mail if you have received this e-mail by
>>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>>> >>>    the intended recipient you are notified that disclosing,
>>>>> copying,
>>>>> >>>    distributing or taking any action in reliance on the contents of
>>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>>> >>>    taken reasonable precautions to ensure no viruses are present in
>>>>> >>>    this email, the company cannot accept responsibility for any
>>>>> loss
>>>>> >>>    or damage arising from the use of this email or attachments
>>>>> >>
>>>>>
>>>>
>>>>
>>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Pranav Agarwal <pr...@gmail.com>.

Any thoughts on how to package changes related to Spark?
On 17-Aug-2015 7:58 pm, "moon soo Lee" <mo...@apache.org> wrote:

> I think releasing SparkIMain and related objects after configurable
> inactivity would be good for now.
>
> About scheduler, I can help implementing such scheduler.
>
> Thanks,
> moon
>
> On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal <pr...@gmail.com>
> wrote:
>
>> Hi Moon,
>>
>> Yes, the notebookid comes from InterpreterContext. At the moment
>> destroying SparkIMain on deletion of notebook is not handled. I think
>> SparkIMain is a lightweight object, do you see a concern having these
>> objects in a map? One possible option could be to destroy notebook related
>> objects when the inactivity on a notebook is greater than say 8 hours.
>>
>>
>> >> 4. Build a queue inside interpreter to allow only one paragraph
>> execution
>> >> at a time per notebook.
>>
>> One downside of this approach is, GUI will display RUNNING instead of
>> PENDING for jobs inside of queue in interpreter.
>>
>> Yes that's an good point. Having a scheduler at Zeppelin server to build
>> a scheduler that is parallel across notebook's and FIFO across paragraph's
>> will be nice. Is there any plan for having such a scheduler?
>>
>> Regards,
>> -Pranav.
>>
>>
>> On 17/08/15 5:38 am, moon soo Lee wrote:
>>
>> Pranav, proposal looks awesome!
>>
>> I have a question and feedback,
>>
>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
>> need information of notebook id. Did you get it from InterpreterContext?
>> Then how did you handle destroying of SparkIMain (when notebook is
>> deleting)?
>> As far as i know, interpreter not able to get information of notebook
>> deletion.
>>
>> >> 4. Build a queue inside interpreter to allow only one paragraph
>> execution
>> >> at a time per notebook.
>>
>> One downside of this approach is, GUI will display RUNNING instead of
>> PENDING for jobs inside of queue in interpreter.
>>
>> Best,
>> moon
>>
>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:
>>
>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>>> multi-tenancy easily"
>>>
>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com>
>>> wrote:
>>>
>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture so
>>>> that it can handle multi-tenancy easily. The technical solution proposed by Pranav
>>>> is great but it only applies to Spark. Right now, each interpreter has to
>>>> manage multi-tenancy its own way. Ultimately Zeppelin can propose a
>>>> multi-tenancy contract/info (like UserContext, similar to
>>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>>
>>>>
>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think while the idea of running multiple notes simultaneously is
>>>>> great. It is really dancing around the lack of true multi user support in
>>>>> Zeppelin. While the proposed solution would work if the applications
>>>>> resources are those of the whole cluster, if the app is limited (say they
>>>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>>>> note can hog all the resources and the scheduler will have to throttle all
>>>>> other executions leaving you exactly where you are now.
>>>>> While I think the solution is a good one, maybe this question makes us
>>>>> think in adding true multiuser support.
>>>>> Where we isolate resources (cluster and the notebooks themselves),
>>>>> have separate login/identity and (I don't know if it's possible) share the
>>>>> same context.
>>>>>
>>>>> Thanks,
>>>>> Joel
>>>>>
>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > If the problem is that multiple users have to wait for each other
>>>>> while
>>>>> > using Zeppelin, the solution already exists: they can create a new
>>>>> > interpreter by going to the interpreter page and attach it to their
>>>>> > notebook - then they don't have to wait for others to submit their
>>>>> job.
>>>>> >
>>>>> > But I agree, having paragraphs from one note wait for paragraphs
>>>>> from other
>>>>> > notes is a confusing default. We can get around that in two ways:
>>>>> >
>>>>> >   1. Create a new interpreter for each note and attach that
>>>>> interpreter to
>>>>> >   that note. This approach would require the least amount of code
>>>>> changes but
>>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>>> different
>>>>> >   notes.
>>>>> >   2. If we want to share the Spark Context between different notes,
>>>>> we can
>>>>> >   submit jobs from different notes into different fairscheduler
>>>>> pools (
>>>>> >
>>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>>> ).
>>>>> >   This can be done by submitting jobs from different notes in
>>>>> different
>>>>> >   threads. This will make sure that jobs from one note are run
>>>>> sequentially
>>>>> >   but jobs from different notes will be able to run in parallel.
>>>>> >
>>>>> > Neither of these options require any change in the Spark code.
>>>>> >
>>>>> > --
>>>>> > Thanks & Regards
>>>>> > Rohit Agarwal
>>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>>> >
>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>>> praagarw@gmail.com>
>>>>> > wrote:
>>>>> >
>>>>> >> If someone can share about the idea of sharing single SparkContext
>>>>> through
>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>>> >> Here is a proposal:
>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>>> virtual
>>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>>> from
>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>>> the same
>>>>> >> virtual directory.
>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>>> server in
>>>>> >> Spark Context using classserverUri method
>>>>> >> 3. Scala generated code has a notion of packages. The default
>>>>> package name
>>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
>>>>> id"
>>>>> >> ensures that code generated by individual instances of SparkIMain is
>>>>> >> isolated from other instances of SparkIMain
>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>>> execution
>>>>> >> at a time per notebook.
>>>>> >>
>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>>> there any
>>>>> >> Jira already for the same that I can uptake? Also I need to
>>>>> understand:
>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>>> >>
>>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>>> >>
>>>>> >> Regards,
>>>>> >> -Pranav.
>>>>> >>
>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>>> >>>
>>>>> >>> Hi piyush,
>>>>> >>>
>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>>> >>> sharing the SparkContext sounds great.
>>>>> >>>
>>>>> >>> Actually, i tried to do it, found problem that multiple SparkILoop
>>>>> could
>>>>> >>> generates the same class name, and spark executor confuses
>>>>> classname since
>>>>> >>> they're reading classes from single SparkContext.
>>>>> >>>
>>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>>> >>>
>>>>> >>> Thanks,
>>>>> >>> moon
>>>>> >>>
>>>>> >>>
>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
>>>>> wrote:
>>>>> >>>
>>>>> >>>    Hi Moon,
>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>>> working
>>>>> >>> with spark.
>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>>> parallel
>>>>> >>> scheduler ?
>>>>> >>>    thanks
>>>>> >>>
>>>>> >>>    -piyush
>>>>> >>>
>>>>> >>>    Hi Moon,
>>>>> >>>
>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>>> Spark's
>>>>> >>>    remote interpreter - this will allow multiple users to run
>>>>> their spark
>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>>> paragraph is
>>>>> >>>    executed at a time.
>>>>> >>>
>>>>> >>>    Regards,
>>>>> >>>    -Pranav.
>>>>> >>>
>>>>> >>>
>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>>> >>>> Hi,
>>>>> >>>>
>>>>> >>>> Thanks for asking question.
>>>>> >>>>
>>>>> >>>> The reason is simply because of it is running code statements. The
>>>>> >>>> statements can have order and dependency. Imagine i have two
>>>>> >>> paragraphs
>>>>> >>>>
>>>>> >>>> %spark
>>>>> >>>> val a = 1
>>>>> >>>>
>>>>> >>>> %spark
>>>>> >>>> print(a)
>>>>> >>>>
>>>>> >>>> If they're not running one by one, that means they possibly runs
>>>>> in
>>>>> >>>> random order and the output will be always different. Either '1'
>>>>> or
>>>>> >>>> 'val a can not found'.
>>>>> >>>>
>>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>>> >>>>
>>>>> >>>> Thanks,
>>>>> >>>> moon
>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
>>>>> >>> <mailto:linxizeng0615@gmail.com  <mailto:linxizeng0615@gmail.com
>>>>> >>>
>>>>> >>> wrote:
>>>>> >>>>
>>>>> >>>>    any one who have the same question with me? or this is not a
>>>>> >>> question?
>>>>> >>>>
>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
>>>>> >>> <ma...@gmail.com>
>>>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
>>>>> >>> linxizeng0615@gmail.com>>>:
>>>>> >>>>
>>>>> >>>>        hi, Moon:
>>>>> >>>>           I notice that the getScheduler function in the
>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
>>>>> the
>>>>> >>>>        spark interpreter run spark job one by one. It's not a good
>>>>> >>>>        experience when couple of users do some work on zeppelin at
>>>>> >>>>        the same time, because they have to wait for each other.
>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>>> >>>>        My question is, what kind of consideration do you based on
>>>>> >>> to
>>>>> >>>>        make such a decision?
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>>> >>>
>>>>> >>>    This email and any files transmitted with it are confidential
>>>>> and
>>>>> >>>    intended solely for the use of the individual or entity to whom
>>>>> >>>    they are addressed. If you have received this email in error
>>>>> >>>    please notify the system manager. This message contains
>>>>> >>>    confidential information and is intended only for the individual
>>>>> >>>    named. If you are not the named addressee you should not
>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>>> >>>    sender immediately by e-mail if you have received this e-mail by
>>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>>> >>>    the intended recipient you are notified that disclosing,
>>>>> copying,
>>>>> >>>    distributing or taking any action in reliance on the contents of
>>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>>> >>>    taken reasonable precautions to ensure no viruses are present in
>>>>> >>>    this email, the company cannot accept responsibility for any
>>>>> loss
>>>>> >>>    or damage arising from the use of this email or attachments
>>>>> >>
>>>>>
>>>>
>>>>
>>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by moon soo Lee <mo...@apache.org>.

I think releasing SparkIMain and related objects after configurable
inactivity would be good for now.

About scheduler, I can help implementing such scheduler.

Thanks,
moon

On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal <pr...@gmail.com>
wrote:

> Hi Moon,
>
> Yes, the notebookid comes from InterpreterContext. At the moment
> destroying SparkIMain on deletion of notebook is not handled. I think
> SparkIMain is a lightweight object, do you see a concern having these
> objects in a map? One possible option could be to destroy notebook related
> objects when the inactivity on a notebook is greater than say 8 hours.
>
>
> >> 4. Build a queue inside interpreter to allow only one paragraph
> execution
> >> at a time per notebook.
>
> One downside of this approach is, GUI will display RUNNING instead of
> PENDING for jobs inside of queue in interpreter.
>
> Yes that's an good point. Having a scheduler at Zeppelin server to build a
> scheduler that is parallel across notebook's and FIFO across paragraph's
> will be nice. Is there any plan for having such a scheduler?
>
> Regards,
> -Pranav.
>
>
> On 17/08/15 5:38 am, moon soo Lee wrote:
>
> Pranav, proposal looks awesome!
>
> I have a question and feedback,
>
> You said you tested 1,2 and 3. To create SparkIMain per notebook, you need
> information of notebook id. Did you get it from InterpreterContext?
> Then how did you handle destroying of SparkIMain (when notebook is
> deleting)?
> As far as i know, interpreter not able to get information of notebook
> deletion.
>
> >> 4. Build a queue inside interpreter to allow only one paragraph
> execution
> >> at a time per notebook.
>
> One downside of this approach is, GUI will display RUNNING instead of
> PENDING for jobs inside of queue in interpreter.
>
> Best,
> moon
>
> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:
>
>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>> multi-tenancy easily"
>>
>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com> wrote:
>>
>>> Agree with Joel, we may think to re-factor the Zeppelin architecture so
>>> that it can handle multi-tenancy easily. The technical solution proposed by Pranav
>>> is great but it only applies to Spark. Right now, each interpreter has to
>>> manage multi-tenancy its own way. Ultimately Zeppelin can propose a
>>> multi-tenancy contract/info (like UserContext, similar to
>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>
>>>
>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com> wrote:
>>>
>>>> I think while the idea of running multiple notes simultaneously is
>>>> great. It is really dancing around the lack of true multi user support in
>>>> Zeppelin. While the proposed solution would work if the applications
>>>> resources are those of the whole cluster, if the app is limited (say they
>>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>>> note can hog all the resources and the scheduler will have to throttle all
>>>> other executions leaving you exactly where you are now.
>>>> While I think the solution is a good one, maybe this question makes us
>>>> think in adding true multiuser support.
>>>> Where we isolate resources (cluster and the notebooks themselves), have
>>>> separate login/identity and (I don't know if it's possible) share the same
>>>> context.
>>>>
>>>> Thanks,
>>>> Joel
>>>>
>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
>>>> wrote:
>>>> >
>>>> > If the problem is that multiple users have to wait for each other
>>>> while
>>>> > using Zeppelin, the solution already exists: they can create a new
>>>> > interpreter by going to the interpreter page and attach it to their
>>>> > notebook - then they don't have to wait for others to submit their
>>>> job.
>>>> >
>>>> > But I agree, having paragraphs from one note wait for paragraphs from
>>>> other
>>>> > notes is a confusing default. We can get around that in two ways:
>>>> >
>>>> >   1. Create a new interpreter for each note and attach that
>>>> interpreter to
>>>> >   that note. This approach would require the least amount of code
>>>> changes but
>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>> different
>>>> >   notes.
>>>> >   2. If we want to share the Spark Context between different notes,
>>>> we can
>>>> >   submit jobs from different notes into different fairscheduler pools
>>>> (
>>>> >
>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>> ).
>>>> >   This can be done by submitting jobs from different notes in
>>>> different
>>>> >   threads. This will make sure that jobs from one note are run
>>>> sequentially
>>>> >   but jobs from different notes will be able to run in parallel.
>>>> >
>>>> > Neither of these options require any change in the Spark code.
>>>> >
>>>> > --
>>>> > Thanks & Regards
>>>> > Rohit Agarwal
>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>> >
>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>> praagarw@gmail.com>
>>>> > wrote:
>>>> >
>>>> >> If someone can share about the idea of sharing single SparkContext
>>>> through
>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>> >> Here is a proposal:
>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>> virtual
>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>> from
>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>> the same
>>>> >> virtual directory.
>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>> server in
>>>> >> Spark Context using classserverUri method
>>>> >> 3. Scala generated code has a notion of packages. The default
>>>> package name
>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>> >> Property scala.repl.name.line. Setting this property to "notebook id"
>>>> >> ensures that code generated by individual instances of SparkIMain is
>>>> >> isolated from other instances of SparkIMain
>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>> execution
>>>> >> at a time per notebook.
>>>> >>
>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>> there any
>>>> >> Jira already for the same that I can uptake? Also I need to
>>>> understand:
>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>> >>
>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>> >>
>>>> >> Regards,
>>>> >> -Pranav.
>>>> >>
>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>> >>>
>>>> >>> Hi piyush,
>>>> >>>
>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>> >>> sharing the SparkContext sounds great.
>>>> >>>
>>>> >>> Actually, i tried to do it, found problem that multiple SparkILoop
>>>> could
>>>> >>> generates the same class name, and spark executor confuses
>>>> classname since
>>>> >>> they're reading classes from single SparkContext.
>>>> >>>
>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>> >>>
>>>> >>> Thanks,
>>>> >>> moon
>>>> >>>
>>>> >>>
>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
>>>> wrote:
>>>> >>>
>>>> >>>    Hi Moon,
>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>> working
>>>> >>> with spark.
>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>> parallel
>>>> >>> scheduler ?
>>>> >>>    thanks
>>>> >>>
>>>> >>>    -piyush
>>>> >>>
>>>> >>>    Hi Moon,
>>>> >>>
>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>> Spark's
>>>> >>>    remote interpreter - this will allow multiple users to run their
>>>> spark
>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>> paragraph is
>>>> >>>    executed at a time.
>>>> >>>
>>>> >>>    Regards,
>>>> >>>    -Pranav.
>>>> >>>
>>>> >>>
>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>> >>>> Hi,
>>>> >>>>
>>>> >>>> Thanks for asking question.
>>>> >>>>
>>>> >>>> The reason is simply because of it is running code statements. The
>>>> >>>> statements can have order and dependency. Imagine i have two
>>>> >>> paragraphs
>>>> >>>>
>>>> >>>> %spark
>>>> >>>> val a = 1
>>>> >>>>
>>>> >>>> %spark
>>>> >>>> print(a)
>>>> >>>>
>>>> >>>> If they're not running one by one, that means they possibly runs in
>>>> >>>> random order and the output will be always different. Either '1' or
>>>> >>>> 'val a can not found'.
>>>> >>>>
>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>> >>>>
>>>> >>>> Thanks,
>>>> >>>> moon
>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
>>>> >>> <mailto:linxizeng0615@gmail.com  <ma...@gmail.com>>>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>>    any one who have the same question with me? or this is not a
>>>> >>> question?
>>>> >>>>
>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
>>>> >>> <ma...@gmail.com>
>>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
>>>> >>> linxizeng0615@gmail.com>>>:
>>>> >>>>
>>>> >>>>        hi, Moon:
>>>> >>>>           I notice that the getScheduler function in the
>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes the
>>>> >>>>        spark interpreter run spark job one by one. It's not a good
>>>> >>>>        experience when couple of users do some work on zeppelin at
>>>> >>>>        the same time, because they have to wait for each other.
>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>> >>>>        My question is, what kind of consideration do you based on
>>>> >>> to
>>>> >>>>        make such a decision?
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>> >>>
>>>> >>>    This email and any files transmitted with it are confidential and
>>>> >>>    intended solely for the use of the individual or entity to whom
>>>> >>>    they are addressed. If you have received this email in error
>>>> >>>    please notify the system manager. This message contains
>>>> >>>    confidential information and is intended only for the individual
>>>> >>>    named. If you are not the named addressee you should not
>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>> >>>    sender immediately by e-mail if you have received this e-mail by
>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>> >>>    the intended recipient you are notified that disclosing, copying,
>>>> >>>    distributing or taking any action in reliance on the contents of
>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>> >>>    taken reasonable precautions to ensure no viruses are present in
>>>> >>>    this email, the company cannot accept responsibility for any loss
>>>> >>>    or damage arising from the use of this email or attachments
>>>> >>
>>>>
>>>
>>>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by moon soo Lee <mo...@apache.org>.

I think releasing SparkIMain and related objects after configurable
inactivity would be good for now.

About scheduler, I can help implementing such scheduler.

Thanks,
moon

On Sun, Aug 16, 2015 at 11:54 PM Pranav Kumar Agarwal <pr...@gmail.com>
wrote:

> Hi Moon,
>
> Yes, the notebookid comes from InterpreterContext. At the moment
> destroying SparkIMain on deletion of notebook is not handled. I think
> SparkIMain is a lightweight object, do you see a concern having these
> objects in a map? One possible option could be to destroy notebook related
> objects when the inactivity on a notebook is greater than say 8 hours.
>
>
> >> 4. Build a queue inside interpreter to allow only one paragraph
> execution
> >> at a time per notebook.
>
> One downside of this approach is, GUI will display RUNNING instead of
> PENDING for jobs inside of queue in interpreter.
>
> Yes that's an good point. Having a scheduler at Zeppelin server to build a
> scheduler that is parallel across notebook's and FIFO across paragraph's
> will be nice. Is there any plan for having such a scheduler?
>
> Regards,
> -Pranav.
>
>
> On 17/08/15 5:38 am, moon soo Lee wrote:
>
> Pranav, proposal looks awesome!
>
> I have a question and feedback,
>
> You said you tested 1,2 and 3. To create SparkIMain per notebook, you need
> information of notebook id. Did you get it from InterpreterContext?
> Then how did you handle destroying of SparkIMain (when notebook is
> deleting)?
> As far as i know, interpreter not able to get information of notebook
> deletion.
>
> >> 4. Build a queue inside interpreter to allow only one paragraph
> execution
> >> at a time per notebook.
>
> One downside of this approach is, GUI will display RUNNING instead of
> PENDING for jobs inside of queue in interpreter.
>
> Best,
> moon
>
> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:
>
>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>> multi-tenancy easily"
>>
>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com> wrote:
>>
>>> Agree with Joel, we may think to re-factor the Zeppelin architecture so
>>> that it can handle multi-tenancy easily. The technical solution proposed by Pranav
>>> is great but it only applies to Spark. Right now, each interpreter has to
>>> manage multi-tenancy its own way. Ultimately Zeppelin can propose a
>>> multi-tenancy contract/info (like UserContext, similar to
>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>
>>>
>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com> wrote:
>>>
>>>> I think while the idea of running multiple notes simultaneously is
>>>> great. It is really dancing around the lack of true multi user support in
>>>> Zeppelin. While the proposed solution would work if the applications
>>>> resources are those of the whole cluster, if the app is limited (say they
>>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>>> note can hog all the resources and the scheduler will have to throttle all
>>>> other executions leaving you exactly where you are now.
>>>> While I think the solution is a good one, maybe this question makes us
>>>> think in adding true multiuser support.
>>>> Where we isolate resources (cluster and the notebooks themselves), have
>>>> separate login/identity and (I don't know if it's possible) share the same
>>>> context.
>>>>
>>>> Thanks,
>>>> Joel
>>>>
>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
>>>> wrote:
>>>> >
>>>> > If the problem is that multiple users have to wait for each other
>>>> while
>>>> > using Zeppelin, the solution already exists: they can create a new
>>>> > interpreter by going to the interpreter page and attach it to their
>>>> > notebook - then they don't have to wait for others to submit their
>>>> job.
>>>> >
>>>> > But I agree, having paragraphs from one note wait for paragraphs from
>>>> other
>>>> > notes is a confusing default. We can get around that in two ways:
>>>> >
>>>> >   1. Create a new interpreter for each note and attach that
>>>> interpreter to
>>>> >   that note. This approach would require the least amount of code
>>>> changes but
>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>> different
>>>> >   notes.
>>>> >   2. If we want to share the Spark Context between different notes,
>>>> we can
>>>> >   submit jobs from different notes into different fairscheduler pools
>>>> (
>>>> >
>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>> ).
>>>> >   This can be done by submitting jobs from different notes in
>>>> different
>>>> >   threads. This will make sure that jobs from one note are run
>>>> sequentially
>>>> >   but jobs from different notes will be able to run in parallel.
>>>> >
>>>> > Neither of these options require any change in the Spark code.
>>>> >
>>>> > --
>>>> > Thanks & Regards
>>>> > Rohit Agarwal
>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>> >
>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>> praagarw@gmail.com>
>>>> > wrote:
>>>> >
>>>> >> If someone can share about the idea of sharing single SparkContext
>>>> through
>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>> >> Here is a proposal:
>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>> virtual
>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>> from
>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>> the same
>>>> >> virtual directory.
>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>> server in
>>>> >> Spark Context using classserverUri method
>>>> >> 3. Scala generated code has a notion of packages. The default
>>>> package name
>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>> >> Property scala.repl.name.line. Setting this property to "notebook id"
>>>> >> ensures that code generated by individual instances of SparkIMain is
>>>> >> isolated from other instances of SparkIMain
>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>> execution
>>>> >> at a time per notebook.
>>>> >>
>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>> there any
>>>> >> Jira already for the same that I can uptake? Also I need to
>>>> understand:
>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>> >>
>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>> >>
>>>> >> Regards,
>>>> >> -Pranav.
>>>> >>
>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>> >>>
>>>> >>> Hi piyush,
>>>> >>>
>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>> >>> sharing the SparkContext sounds great.
>>>> >>>
>>>> >>> Actually, i tried to do it, found problem that multiple SparkILoop
>>>> could
>>>> >>> generates the same class name, and spark executor confuses
>>>> classname since
>>>> >>> they're reading classes from single SparkContext.
>>>> >>>
>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>> >>>
>>>> >>> Thanks,
>>>> >>> moon
>>>> >>>
>>>> >>>
>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
>>>> wrote:
>>>> >>>
>>>> >>>    Hi Moon,
>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>> working
>>>> >>> with spark.
>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>> parallel
>>>> >>> scheduler ?
>>>> >>>    thanks
>>>> >>>
>>>> >>>    -piyush
>>>> >>>
>>>> >>>    Hi Moon,
>>>> >>>
>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>> Spark's
>>>> >>>    remote interpreter - this will allow multiple users to run their
>>>> spark
>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>> paragraph is
>>>> >>>    executed at a time.
>>>> >>>
>>>> >>>    Regards,
>>>> >>>    -Pranav.
>>>> >>>
>>>> >>>
>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>> >>>> Hi,
>>>> >>>>
>>>> >>>> Thanks for asking question.
>>>> >>>>
>>>> >>>> The reason is simply because of it is running code statements. The
>>>> >>>> statements can have order and dependency. Imagine i have two
>>>> >>> paragraphs
>>>> >>>>
>>>> >>>> %spark
>>>> >>>> val a = 1
>>>> >>>>
>>>> >>>> %spark
>>>> >>>> print(a)
>>>> >>>>
>>>> >>>> If they're not running one by one, that means they possibly runs in
>>>> >>>> random order and the output will be always different. Either '1' or
>>>> >>>> 'val a can not found'.
>>>> >>>>
>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>> >>>>
>>>> >>>> Thanks,
>>>> >>>> moon
>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
>>>> >>> <mailto:linxizeng0615@gmail.com  <ma...@gmail.com>>>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>>    any one who have the same question with me? or this is not a
>>>> >>> question?
>>>> >>>>
>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
>>>> >>> <ma...@gmail.com>
>>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
>>>> >>> linxizeng0615@gmail.com>>>:
>>>> >>>>
>>>> >>>>        hi, Moon:
>>>> >>>>           I notice that the getScheduler function in the
>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes the
>>>> >>>>        spark interpreter run spark job one by one. It's not a good
>>>> >>>>        experience when couple of users do some work on zeppelin at
>>>> >>>>        the same time, because they have to wait for each other.
>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>> >>>>        My question is, what kind of consideration do you based on
>>>> >>> to
>>>> >>>>        make such a decision?
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>> >>>
>>>> >>>    This email and any files transmitted with it are confidential and
>>>> >>>    intended solely for the use of the individual or entity to whom
>>>> >>>    they are addressed. If you have received this email in error
>>>> >>>    please notify the system manager. This message contains
>>>> >>>    confidential information and is intended only for the individual
>>>> >>>    named. If you are not the named addressee you should not
>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>> >>>    sender immediately by e-mail if you have received this e-mail by
>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>> >>>    the intended recipient you are notified that disclosing, copying,
>>>> >>>    distributing or taking any action in reliance on the contents of
>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>> >>>    taken reasonable precautions to ensure no viruses are present in
>>>> >>>    this email, the company cannot accept responsibility for any loss
>>>> >>>    or damage arising from the use of this email or attachments
>>>> >>
>>>>
>>>
>>>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Pranav Kumar Agarwal <pr...@gmail.com>.

Hi Moon,

Yes, the notebookid comes from InterpreterContext. At the moment 
destroying SparkIMain on deletion of notebook is not handled. I think 
SparkIMain is a lightweight object, do you see a concern having these 
objects in a map? One possible option could be to destroy notebook 
related objects when the inactivity on a notebook is greater than say 8 
hours.

> >> 4. Build a queue inside interpreter to allow only one paragraph 
> execution
> >> at a time per notebook.
>
> One downside of this approach is, GUI will display RUNNING instead of 
> PENDING for jobs inside of queue in interpreter.
Yes that's an good point. Having a scheduler at Zeppelin server to build 
a scheduler that is parallel across notebook's and FIFO across 
paragraph's will be nice. Is there any plan for having such a scheduler?

Regards,
-Pranav.

On 17/08/15 5:38 am, moon soo Lee wrote:
> Pranav, proposal looks awesome!
>
> I have a question and feedback,
>
> You said you tested 1,2 and 3. To create SparkIMain per notebook, you 
> need information of notebook id. Did you get it from InterpreterContext?
> Then how did you handle destroying of SparkIMain (when notebook is 
> deleting)?
> As far as i know, interpreter not able to get information of notebook 
> deletion.
>
> >> 4. Build a queue inside interpreter to allow only one paragraph 
> execution
> >> at a time per notebook.
>
> One downside of this approach is, GUI will display RUNNING instead of 
> PENDING for jobs inside of queue in interpreter.
>
> Best,
> moon
>
> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <goi.cto@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     +1 for "to re-factor the Zeppelin architecture so that it can
>     handle multi-tenancy easily"
>
>     On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <doanduyhai@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         Agree with Joel, we may think to re-factor the Zeppelin
>         architecture so that it can handle multi-tenancy easily. The
>         technical solution proposed by Pranav is great but it only
>         applies to Spark. Right now, each interpreter has to manage
>         multi-tenancy its own way. Ultimately Zeppelin can propose a
>         multi-tenancy contract/info (like UserContext, similar to
>         InterpreterContext) so that each interpreter can choose to use
>         or not.
>
>
>         On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano
>         <djoelz@gmail.com <ma...@gmail.com>> wrote:
>
>             I think while the idea of running multiple notes
>             simultaneously is great. It is really dancing around the
>             lack of true multi user support in Zeppelin. While the
>             proposed solution would work if the applications resources
>             are those of the whole cluster, if the app is limited (say
>             they are 8 cores of 16, with some distribution in memory)
>             then potentially your note can hog all the resources and
>             the scheduler will have to throttle all other executions
>             leaving you exactly where you are now.
>             While I think the solution is a good one, maybe this
>             question makes us think in adding true multiuser support.
>             Where we isolate resources (cluster and the notebooks
>             themselves), have separate login/identity and (I don't
>             know if it's possible) share the same context.
>
>             Thanks,
>             Joel
>
>             > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal
>             <mindprince@gmail.com <ma...@gmail.com>> wrote:
>             >
>             > If the problem is that multiple users have to wait for
>             each other while
>             > using Zeppelin, the solution already exists: they can
>             create a new
>             > interpreter by going to the interpreter page and attach
>             it to their
>             > notebook - then they don't have to wait for others to
>             submit their job.
>             >
>             > But I agree, having paragraphs from one note wait for
>             paragraphs from other
>             > notes is a confusing default. We can get around that in
>             two ways:
>             >
>             >   1. Create a new interpreter for each note and attach
>             that interpreter to
>             >   that note. This approach would require the least amount of code changes but
>             >   is resource heavy and doesn't let you share Spark
>             Context between different
>             >   notes.
>             >   2. If we want to share the Spark Context between
>             different notes, we can
>             >   submit jobs from different notes into different
>             fairscheduler pools (
>             >
>             https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application).
>             >   This can be done by submitting jobs from different
>             notes in different
>             >   threads. This will make sure that jobs from one note
>             are run sequentially
>             >   but jobs from different notes will be able to run in
>             parallel.
>             >
>             > Neither of these options require any change in the Spark
>             code.
>             >
>             > --
>             > Thanks & Regards
>             > Rohit Agarwal
>             > https://www.linkedin.com/in/rohitagarwal003
>             >
>             > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal
>             <praagarw@gmail.com <ma...@gmail.com>>
>             > wrote:
>             >
>             >> If someone can share about the idea of sharing single
>             SparkContext through
>             >>> multiple SparkILoop safely, it'll be really helpful.
>             >> Here is a proposal:
>             >> 1. In Spark code, change SparkIMain.scala to allow
>             setting the virtual
>             >> directory. While creating new instances of SparkIMain
>             per notebook from
>             >> zeppelin spark interpreter set all the instances of
>             SparkIMain to the same
>             >> virtual directory.
>             >> 2. Start HTTP server on that virtual directory and set
>             this HTTP server in
>             >> Spark Context using classserverUri method
>             >> 3. Scala generated code has a notion of packages. The
>             default package name
>             >> is "line$<linenumber>". Package name can be controlled
>             using System
>             >> Property scala.repl.name.line. Setting this property to
>             "notebook id"
>             >> ensures that code generated by individual instances of
>             SparkIMain is
>             >> isolated from other instances of SparkIMain
>             >> 4. Build a queue inside interpreter to allow only one
>             paragraph execution
>             >> at a time per notebook.
>             >>
>             >> I have tested 1, 2, and 3 and it seems to provide
>             isolation across
>             >> classnames. I'll work towards submitting a formal patch
>             soon - Is there any
>             >> Jira already for the same that I can uptake? Also I
>             need to understand:
>             >> 1. How does Zeppelin uptake Spark fixes? OR do I need
>             to first work
>             >> towards getting Spark changes merged in Apache Spark
>             github?
>             >>
>             >> Any suggestions on comments on the proposal are highly
>             welcome.
>             >>
>             >> Regards,
>             >> -Pranav.
>             >>
>             >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>             >>>
>             >>> Hi piyush,
>             >>>
>             >>> Separate instance of SparkILoop SparkIMain for each
>             notebook while
>             >>> sharing the SparkContext sounds great.
>             >>>
>             >>> Actually, i tried to do it, found problem that
>             multiple SparkILoop could
>             >>> generates the same class name, and spark executor
>             confuses classname since
>             >>> they're reading classes from single SparkContext.
>             >>>
>             >>> If someone can share about the idea of sharing single
>             SparkContext
>             >>> through multiple SparkILoop safely, it'll be really
>             helpful.
>             >>>
>             >>> Thanks,
>             >>> moon
>             >>>
>             >>>
>             >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data
>             Platform) <
>             >>> piyush.mukati@flipkart.com
>             <ma...@flipkart.com>
>             <mailto:piyush.mukati@flipkart.com
>             <ma...@flipkart.com>>> wrote:
>             >>>
>             >>>    Hi Moon,
>             >>>    Any suggestion on it, have to wait lot when
>             multiple people  working
>             >>> with spark.
>             >>>    Can we create separate instance of   SparkILoop 
>             SparkIMain and
>             >>> printstrems  for each notebook while sharing
>             theSparkContext
>             >>> ZeppelinContext   SQLContext and DependencyResolver
>             and then use parallel
>             >>> scheduler ?
>             >>>    thanks
>             >>>
>             >>>    -piyush
>             >>>
>             >>>    Hi Moon,
>             >>>
>             >>>    How about tracking dedicated SparkContext for a
>             notebook in Spark's
>             >>>    remote interpreter - this will allow multiple users
>             to run their spark
>             >>>    paragraphs in parallel. Also, within a notebook
>             only one paragraph is
>             >>>    executed at a time.
>             >>>
>             >>>    Regards,
>             >>>    -Pranav.
>             >>>
>             >>>
>             >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>             >>>> Hi,
>             >>>>
>             >>>> Thanks for asking question.
>             >>>>
>             >>>> The reason is simply because of it is running code
>             statements. The
>             >>>> statements can have order and dependency. Imagine i
>             have two
>             >>> paragraphs
>             >>>>
>             >>>> %spark
>             >>>> val a = 1
>             >>>>
>             >>>> %spark
>             >>>> print(a)
>             >>>>
>             >>>> If they're not running one by one, that means they
>             possibly runs in
>             >>>> random order and the output will be always different.
>             Either '1' or
>             >>>> 'val a can not found'.
>             >>>>
>             >>>> This is the reason why. But if there are nice idea to
>             handle this
>             >>>> problem i agree using parallel scheduler would help a
>             lot.
>             >>>>
>             >>>> Thanks,
>             >>>> moon
>             >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>             >>>> <linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>
>             >>> <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>>>
>             >>> wrote:
>             >>>>
>             >>>>    any one who have the same question with me? or
>             this is not a
>             >>> question?
>             >>>>
>             >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng
>             <linxizeng0615@gmail.com <ma...@gmail.com>
>             >>> <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>
>             >>>>    <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com> <mailto:
>             >>> linxizeng0615@gmail.com
>             <ma...@gmail.com>>>>:
>             >>>>
>             >>>>        hi, Moon:
>             >>>>           I notice that the getScheduler function in the
>             >>>> SparkInterpreter.java return a FIFOScheduler which
>             makes the
>             >>>>        spark interpreter run spark job one by one.
>             It's not a good
>             >>>>        experience when couple of users do some work
>             on zeppelin at
>             >>>>        the same time, because they have to wait for
>             each other.
>             >>>>        And at the same time, SparkSqlInterpreter can
>             chose what
>             >>>>        scheduler to use by
>             "zeppelin.spark.concurrentSQL".
>             >>>>        My question is, what kind of consideration do
>             you based on
>             >>> to
>             >>>>        make such a decision?
>             >>>
>             >>>
>             >>>
>             >>>
>             >>>
>             ------------------------------------------------------------------------------------------------------------------------------------------
>             >>>
>             >>>    This email and any files transmitted with it are
>             confidential and
>             >>>    intended solely for the use of the individual or
>             entity to whom
>             >>>    they are addressed. If you have received this email
>             in error
>             >>>    please notify the system manager. This message contains
>             >>>    confidential information and is intended only for
>             the individual
>             >>>    named. If you are not the named addressee you
>             should not
>             >>>    disseminate, distribute or copy this e-mail. Please
>             notify the
>             >>>    sender immediately by e-mail if you have received
>             this e-mail by
>             >>>    mistake and delete this e-mail from your system. If
>             you are not
>             >>>    the intended recipient you are notified that
>             disclosing, copying,
>             >>>    distributing or taking any action in reliance on
>             the contents of
>             >>>    this information is strictly prohibited. Although
>             Flipkart has
>             >>>    taken reasonable precautions to ensure no viruses
>             are present in
>             >>>    this email, the company cannot accept
>             responsibility for any loss
>             >>>    or damage arising from the use of this email or
>             attachments
>             >>
>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Dimp Bhat <di...@gmail.com>.

Thanks Piyush. Do we have any ETA for this to be sent for review?

Dimple

On Wed, Jan 13, 2016 at 6:23 PM, Piyush Mukati (Data Platform) <
piyush.mukati@flipkart.com> wrote:

> Hi,
>  The code is available here
>
> https://github.com/piyush-mukati/incubator-zeppelin/tree/parallel_scheduler_support_spark
>
>
> some testing part is left.
>
> On Wed, Jan 13, 2016 at 11:47 PM, Dimp Bhat <di...@gmail.com> wrote:
>
> > Hi Pranav,
> > When do you plan to send out the code for running notebooks in parallel ?
> >
> > Dimple
> >
> > On Tue, Nov 17, 2015 at 3:27 AM, Pranav Kumar Agarwal <
> praagarw@gmail.com>
> > wrote:
> >
> >> Hi Rohit,
> >>
> >> We implemented the proposal and are able to run Zeppelin as a hosted
> >> service inside my organization. Our internal forked version has
> pluggable
> >> authentication and type ahead.
> >>
> >> I need to get the work ported to the latest and chop out the auth
> changes
> >> portion. We'll be submitting it soon.
> >>
> >> We'll target to get this out for review by 11/26.
> >>
> >> Regards,
> >> -Pranav.
> >>
> >>
> >>
> >> On 17/11/15 4:34 am, Rohit Agarwal wrote:
> >>
> >> Hey Pranav,
> >>
> >> Did you make any progress on this?
> >>
> >> --
> >> Rohit
> >>
> >> On Sunday, August 16, 2015, moon soo Lee <mo...@apache.org> wrote:
> >>
> >>> Pranav, proposal looks awesome!
> >>>
> >>> I have a question and feedback,
> >>>
> >>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
> >>> need information of notebook id. Did you get it from
> InterpreterContext?
> >>> Then how did you handle destroying of SparkIMain (when notebook is
> >>> deleting)?
> >>> As far as i know, interpreter not able to get information of notebook
> >>> deletion.
> >>>
> >>> >> 4. Build a queue inside interpreter to allow only one paragraph
> >>> execution
> >>> >> at a time per notebook.
> >>>
> >>> One downside of this approach is, GUI will display RUNNING instead of
> >>> PENDING for jobs inside of queue in interpreter.
> >>>
> >>> Best,
> >>> moon
> >>>
> >>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:
> >>>
> >>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
> >>>> multi-tenancy easily"
> >>>>
> >>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture
> >>>>> so that it can handle multi-tenancy easily. The technical solution
> proposed
> >>>>> by Pranav is great but it only applies to Spark. Right now, each
> >>>>> interpreter has to manage multi-tenancy its own way. Ultimately
> Zeppelin
> >>>>> can propose a multi-tenancy contract/info (like UserContext, similar
> to
> >>>>> InterpreterContext) so that each interpreter can choose to use or
> not.
> >>>>>
> >>>>>
> >>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I think while the idea of running multiple notes simultaneously is
> >>>>>> great. It is really dancing around the lack of true multi user
> support in
> >>>>>> Zeppelin. While the proposed solution would work if the applications
> >>>>>> resources are those of the whole cluster, if the app is limited
> (say they
> >>>>>> are 8 cores of 16, with some distribution in memory) then
> potentially your
> >>>>>> note can hog all the resources and the scheduler will have to
> throttle all
> >>>>>> other executions leaving you exactly where you are now.
> >>>>>> While I think the solution is a good one, maybe this question makes
> >>>>>> us think in adding true multiuser support.
> >>>>>> Where we isolate resources (cluster and the notebooks themselves),
> >>>>>> have separate login/identity and (I don't know if it's possible)
> share the
> >>>>>> same context.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Joel
> >>>>>>
> >>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
> >>>>>> wrote:
> >>>>>> >
> >>>>>> > If the problem is that multiple users have to wait for each other
> >>>>>> while
> >>>>>> > using Zeppelin, the solution already exists: they can create a new
> >>>>>> > interpreter by going to the interpreter page and attach it to
> their
> >>>>>> > notebook - then they don't have to wait for others to submit their
> >>>>>> job.
> >>>>>> >
> >>>>>> > But I agree, having paragraphs from one note wait for paragraphs
> >>>>>> from other
> >>>>>> > notes is a confusing default. We can get around that in two ways:
> >>>>>> >
> >>>>>> >   1. Create a new interpreter for each note and attach that
> >>>>>> interpreter to
> >>>>>> >   that note. This approach would require the least amount of code
> >>>>>> changes but
> >>>>>> >   is resource heavy and doesn't let you share Spark Context
> between
> >>>>>> different
> >>>>>> >   notes.
> >>>>>> >   2. If we want to share the Spark Context between different
> notes,
> >>>>>> we can
> >>>>>> >   submit jobs from different notes into different fairscheduler
> >>>>>> pools (
> >>>>>> >
> >>>>>>
> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
> >>>>>> ).
> >>>>>> >   This can be done by submitting jobs from different notes in
> >>>>>> different
> >>>>>> >   threads. This will make sure that jobs from one note are run
> >>>>>> sequentially
> >>>>>> >   but jobs from different notes will be able to run in parallel.
> >>>>>> >
> >>>>>> > Neither of these options require any change in the Spark code.
> >>>>>> >
> >>>>>> > --
> >>>>>> > Thanks & Regards
> >>>>>> > Rohit Agarwal
> >>>>>> > https://www.linkedin.com/in/rohitagarwal003
> >>>>>> >
> >>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
> >>>>>> praagarw@gmail.com>
> >>>>>> > wrote:
> >>>>>> >
> >>>>>> >> If someone can share about the idea of sharing single
> SparkContext
> >>>>>> through
> >>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
> >>>>>> >> Here is a proposal:
> >>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
> >>>>>> virtual
> >>>>>> >> directory. While creating new instances of SparkIMain per
> notebook
> >>>>>> from
> >>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
> >>>>>> the same
> >>>>>> >> virtual directory.
> >>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
> >>>>>> server in
> >>>>>> >> Spark Context using classserverUri method
> >>>>>> >> 3. Scala generated code has a notion of packages. The default
> >>>>>> package name
> >>>>>> >> is "line$<linenumber>". Package name can be controlled using
> System
> >>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
> >>>>>> id"
> >>>>>> >> ensures that code generated by individual instances of SparkIMain
> >>>>>> is
> >>>>>> >> isolated from other instances of SparkIMain
> >>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
> >>>>>> execution
> >>>>>> >> at a time per notebook.
> >>>>>> >>
> >>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation
> across
> >>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
> >>>>>> there any
> >>>>>> >> Jira already for the same that I can uptake? Also I need to
> >>>>>> understand:
> >>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first
> work
> >>>>>> >> towards getting Spark changes merged in Apache Spark github?
> >>>>>> >>
> >>>>>> >> Any suggestions on comments on the proposal are highly welcome.
> >>>>>> >>
> >>>>>> >> Regards,
> >>>>>> >> -Pranav.
> >>>>>> >>
> >>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
> >>>>>> >>>
> >>>>>> >>> Hi piyush,
> >>>>>> >>>
> >>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook
> while
> >>>>>> >>> sharing the SparkContext sounds great.
> >>>>>> >>>
> >>>>>> >>> Actually, i tried to do it, found problem that multiple
> >>>>>> SparkILoop could
> >>>>>> >>> generates the same class name, and spark executor confuses
> >>>>>> classname since
> >>>>>> >>> they're reading classes from single SparkContext.
> >>>>>> >>>
> >>>>>> >>> If someone can share about the idea of sharing single
> SparkContext
> >>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
> >>>>>> >>>
> >>>>>> >>> Thanks,
> >>>>>> >>> moon
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
> >>>>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
> >>>>>> wrote:
> >>>>>> >>>
> >>>>>> >>>    Hi Moon,
> >>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
> >>>>>> working
> >>>>>> >>> with spark.
> >>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain
> and
> >>>>>> >>> printstrems  for each notebook while sharing theSparkContext
> >>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
> >>>>>> parallel
> >>>>>> >>> scheduler ?
> >>>>>> >>>    thanks
> >>>>>> >>>
> >>>>>> >>>    -piyush
> >>>>>> >>>
> >>>>>> >>>    Hi Moon,
> >>>>>> >>>
> >>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
> >>>>>> Spark's
> >>>>>> >>>    remote interpreter - this will allow multiple users to run
> >>>>>> their spark
> >>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
> >>>>>> paragraph is
> >>>>>> >>>    executed at a time.
> >>>>>> >>>
> >>>>>> >>>    Regards,
> >>>>>> >>>    -Pranav.
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
> >>>>>> >>>> Hi,
> >>>>>> >>>>
> >>>>>> >>>> Thanks for asking question.
> >>>>>> >>>>
> >>>>>> >>>> The reason is simply because of it is running code statements.
> >>>>>> The
> >>>>>> >>>> statements can have order and dependency. Imagine i have two
> >>>>>> >>> paragraphs
> >>>>>> >>>>
> >>>>>> >>>> %spark
> >>>>>> >>>> val a = 1
> >>>>>> >>>>
> >>>>>> >>>> %spark
> >>>>>> >>>> print(a)
> >>>>>> >>>>
> >>>>>> >>>> If they're not running one by one, that means they possibly
> runs
> >>>>>> in
> >>>>>> >>>> random order and the output will be always different. Either
> '1'
> >>>>>> or
> >>>>>> >>>> 'val a can not found'.
> >>>>>> >>>>
> >>>>>> >>>> This is the reason why. But if there are nice idea to handle
> this
> >>>>>> >>>> problem i agree using parallel scheduler would help a lot.
> >>>>>> >>>>
> >>>>>> >>>> Thanks,
> >>>>>> >>>> moon
> >>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
> >>>>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
> >>>>>> >>> <mailto:linxizeng0615@gmail.com  <mailto:
> linxizeng0615@gmail.com
> >>>>>> >>>
> >>>>>> >>> wrote:
> >>>>>> >>>>
> >>>>>> >>>>    any one who have the same question with me? or this is not a
> >>>>>> >>> question?
> >>>>>> >>>>
> >>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <
> >>>>>> linxizeng0615@gmail.com
> >>>>>> >>> <ma...@gmail.com>
> >>>>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
> >>>>>> >>> linxizeng0615@gmail.com>>>:
> >>>>>> >>>>
> >>>>>> >>>>        hi, Moon:
> >>>>>> >>>>           I notice that the getScheduler function in the
> >>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
> >>>>>> the
> >>>>>> >>>>        spark interpreter run spark job one by one. It's not a
> >>>>>> good
> >>>>>> >>>>        experience when couple of users do some work on zeppelin
> >>>>>> at
> >>>>>> >>>>        the same time, because they have to wait for each other.
> >>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
> >>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
> >>>>>> >>>>        My question is, what kind of consideration do you based
> on
> >>>>>> >>> to
> >>>>>> >>>>        make such a decision?
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>>
> ------------------------------------------------------------------------------------------------------------------------------------------
> >>>>>> >>>
> >>>>>> >>>    This email and any files transmitted with it are confidential
> >>>>>> and
> >>>>>> >>>    intended solely for the use of the individual or entity to
> whom
> >>>>>> >>>    they are addressed. If you have received this email in error
> >>>>>> >>>    please notify the system manager. This message contains
> >>>>>> >>>    confidential information and is intended only for the
> >>>>>> individual
> >>>>>> >>>    named. If you are not the named addressee you should not
> >>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify
> the
> >>>>>> >>>    sender immediately by e-mail if you have received this e-mail
> >>>>>> by
> >>>>>> >>>    mistake and delete this e-mail from your system. If you are
> not
> >>>>>> >>>    the intended recipient you are notified that disclosing,
> >>>>>> copying,
> >>>>>> >>>    distributing or taking any action in reliance on the contents
> >>>>>> of
> >>>>>> >>>    this information is strictly prohibited. Although Flipkart
> has
> >>>>>> >>>    taken reasonable precautions to ensure no viruses are present
> >>>>>> in
> >>>>>> >>>    this email, the company cannot accept responsibility for any
> >>>>>> loss
> >>>>>> >>>    or damage arising from the use of this email or attachments
> >>>>>> >>
> >>>>>>
> >>>>>
> >>>>>
> >>
> >> --
> >> Sent from a mobile device. Excuse my thumbs.
> >>
> >>
> >>
> >
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Dimp Bhat <di...@gmail.com>.

Thanks Piyush. Do we have any ETA for this to be sent for review?

Dimple

On Wed, Jan 13, 2016 at 6:23 PM, Piyush Mukati (Data Platform) <
piyush.mukati@flipkart.com> wrote:

> Hi,
>  The code is available here
>
> https://github.com/piyush-mukati/incubator-zeppelin/tree/parallel_scheduler_support_spark
>
>
> some testing part is left.
>
> On Wed, Jan 13, 2016 at 11:47 PM, Dimp Bhat <di...@gmail.com> wrote:
>
> > Hi Pranav,
> > When do you plan to send out the code for running notebooks in parallel ?
> >
> > Dimple
> >
> > On Tue, Nov 17, 2015 at 3:27 AM, Pranav Kumar Agarwal <
> praagarw@gmail.com>
> > wrote:
> >
> >> Hi Rohit,
> >>
> >> We implemented the proposal and are able to run Zeppelin as a hosted
> >> service inside my organization. Our internal forked version has
> pluggable
> >> authentication and type ahead.
> >>
> >> I need to get the work ported to the latest and chop out the auth
> changes
> >> portion. We'll be submitting it soon.
> >>
> >> We'll target to get this out for review by 11/26.
> >>
> >> Regards,
> >> -Pranav.
> >>
> >>
> >>
> >> On 17/11/15 4:34 am, Rohit Agarwal wrote:
> >>
> >> Hey Pranav,
> >>
> >> Did you make any progress on this?
> >>
> >> --
> >> Rohit
> >>
> >> On Sunday, August 16, 2015, moon soo Lee <mo...@apache.org> wrote:
> >>
> >>> Pranav, proposal looks awesome!
> >>>
> >>> I have a question and feedback,
> >>>
> >>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
> >>> need information of notebook id. Did you get it from
> InterpreterContext?
> >>> Then how did you handle destroying of SparkIMain (when notebook is
> >>> deleting)?
> >>> As far as i know, interpreter not able to get information of notebook
> >>> deletion.
> >>>
> >>> >> 4. Build a queue inside interpreter to allow only one paragraph
> >>> execution
> >>> >> at a time per notebook.
> >>>
> >>> One downside of this approach is, GUI will display RUNNING instead of
> >>> PENDING for jobs inside of queue in interpreter.
> >>>
> >>> Best,
> >>> moon
> >>>
> >>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:
> >>>
> >>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
> >>>> multi-tenancy easily"
> >>>>
> >>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture
> >>>>> so that it can handle multi-tenancy easily. The technical solution
> proposed
> >>>>> by Pranav is great but it only applies to Spark. Right now, each
> >>>>> interpreter has to manage multi-tenancy its own way. Ultimately
> Zeppelin
> >>>>> can propose a multi-tenancy contract/info (like UserContext, similar
> to
> >>>>> InterpreterContext) so that each interpreter can choose to use or
> not.
> >>>>>
> >>>>>
> >>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I think while the idea of running multiple notes simultaneously is
> >>>>>> great. It is really dancing around the lack of true multi user
> support in
> >>>>>> Zeppelin. While the proposed solution would work if the applications
> >>>>>> resources are those of the whole cluster, if the app is limited
> (say they
> >>>>>> are 8 cores of 16, with some distribution in memory) then
> potentially your
> >>>>>> note can hog all the resources and the scheduler will have to
> throttle all
> >>>>>> other executions leaving you exactly where you are now.
> >>>>>> While I think the solution is a good one, maybe this question makes
> >>>>>> us think in adding true multiuser support.
> >>>>>> Where we isolate resources (cluster and the notebooks themselves),
> >>>>>> have separate login/identity and (I don't know if it's possible)
> share the
> >>>>>> same context.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Joel
> >>>>>>
> >>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
> >>>>>> wrote:
> >>>>>> >
> >>>>>> > If the problem is that multiple users have to wait for each other
> >>>>>> while
> >>>>>> > using Zeppelin, the solution already exists: they can create a new
> >>>>>> > interpreter by going to the interpreter page and attach it to
> their
> >>>>>> > notebook - then they don't have to wait for others to submit their
> >>>>>> job.
> >>>>>> >
> >>>>>> > But I agree, having paragraphs from one note wait for paragraphs
> >>>>>> from other
> >>>>>> > notes is a confusing default. We can get around that in two ways:
> >>>>>> >
> >>>>>> >   1. Create a new interpreter for each note and attach that
> >>>>>> interpreter to
> >>>>>> >   that note. This approach would require the least amount of code
> >>>>>> changes but
> >>>>>> >   is resource heavy and doesn't let you share Spark Context
> between
> >>>>>> different
> >>>>>> >   notes.
> >>>>>> >   2. If we want to share the Spark Context between different
> notes,
> >>>>>> we can
> >>>>>> >   submit jobs from different notes into different fairscheduler
> >>>>>> pools (
> >>>>>> >
> >>>>>>
> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
> >>>>>> ).
> >>>>>> >   This can be done by submitting jobs from different notes in
> >>>>>> different
> >>>>>> >   threads. This will make sure that jobs from one note are run
> >>>>>> sequentially
> >>>>>> >   but jobs from different notes will be able to run in parallel.
> >>>>>> >
> >>>>>> > Neither of these options require any change in the Spark code.
> >>>>>> >
> >>>>>> > --
> >>>>>> > Thanks & Regards
> >>>>>> > Rohit Agarwal
> >>>>>> > https://www.linkedin.com/in/rohitagarwal003
> >>>>>> >
> >>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
> >>>>>> praagarw@gmail.com>
> >>>>>> > wrote:
> >>>>>> >
> >>>>>> >> If someone can share about the idea of sharing single
> SparkContext
> >>>>>> through
> >>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
> >>>>>> >> Here is a proposal:
> >>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
> >>>>>> virtual
> >>>>>> >> directory. While creating new instances of SparkIMain per
> notebook
> >>>>>> from
> >>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
> >>>>>> the same
> >>>>>> >> virtual directory.
> >>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
> >>>>>> server in
> >>>>>> >> Spark Context using classserverUri method
> >>>>>> >> 3. Scala generated code has a notion of packages. The default
> >>>>>> package name
> >>>>>> >> is "line$<linenumber>". Package name can be controlled using
> System
> >>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
> >>>>>> id"
> >>>>>> >> ensures that code generated by individual instances of SparkIMain
> >>>>>> is
> >>>>>> >> isolated from other instances of SparkIMain
> >>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
> >>>>>> execution
> >>>>>> >> at a time per notebook.
> >>>>>> >>
> >>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation
> across
> >>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
> >>>>>> there any
> >>>>>> >> Jira already for the same that I can uptake? Also I need to
> >>>>>> understand:
> >>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first
> work
> >>>>>> >> towards getting Spark changes merged in Apache Spark github?
> >>>>>> >>
> >>>>>> >> Any suggestions on comments on the proposal are highly welcome.
> >>>>>> >>
> >>>>>> >> Regards,
> >>>>>> >> -Pranav.
> >>>>>> >>
> >>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
> >>>>>> >>>
> >>>>>> >>> Hi piyush,
> >>>>>> >>>
> >>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook
> while
> >>>>>> >>> sharing the SparkContext sounds great.
> >>>>>> >>>
> >>>>>> >>> Actually, i tried to do it, found problem that multiple
> >>>>>> SparkILoop could
> >>>>>> >>> generates the same class name, and spark executor confuses
> >>>>>> classname since
> >>>>>> >>> they're reading classes from single SparkContext.
> >>>>>> >>>
> >>>>>> >>> If someone can share about the idea of sharing single
> SparkContext
> >>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
> >>>>>> >>>
> >>>>>> >>> Thanks,
> >>>>>> >>> moon
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
> >>>>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
> >>>>>> wrote:
> >>>>>> >>>
> >>>>>> >>>    Hi Moon,
> >>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
> >>>>>> working
> >>>>>> >>> with spark.
> >>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain
> and
> >>>>>> >>> printstrems  for each notebook while sharing theSparkContext
> >>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
> >>>>>> parallel
> >>>>>> >>> scheduler ?
> >>>>>> >>>    thanks
> >>>>>> >>>
> >>>>>> >>>    -piyush
> >>>>>> >>>
> >>>>>> >>>    Hi Moon,
> >>>>>> >>>
> >>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
> >>>>>> Spark's
> >>>>>> >>>    remote interpreter - this will allow multiple users to run
> >>>>>> their spark
> >>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
> >>>>>> paragraph is
> >>>>>> >>>    executed at a time.
> >>>>>> >>>
> >>>>>> >>>    Regards,
> >>>>>> >>>    -Pranav.
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
> >>>>>> >>>> Hi,
> >>>>>> >>>>
> >>>>>> >>>> Thanks for asking question.
> >>>>>> >>>>
> >>>>>> >>>> The reason is simply because of it is running code statements.
> >>>>>> The
> >>>>>> >>>> statements can have order and dependency. Imagine i have two
> >>>>>> >>> paragraphs
> >>>>>> >>>>
> >>>>>> >>>> %spark
> >>>>>> >>>> val a = 1
> >>>>>> >>>>
> >>>>>> >>>> %spark
> >>>>>> >>>> print(a)
> >>>>>> >>>>
> >>>>>> >>>> If they're not running one by one, that means they possibly
> runs
> >>>>>> in
> >>>>>> >>>> random order and the output will be always different. Either
> '1'
> >>>>>> or
> >>>>>> >>>> 'val a can not found'.
> >>>>>> >>>>
> >>>>>> >>>> This is the reason why. But if there are nice idea to handle
> this
> >>>>>> >>>> problem i agree using parallel scheduler would help a lot.
> >>>>>> >>>>
> >>>>>> >>>> Thanks,
> >>>>>> >>>> moon
> >>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
> >>>>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
> >>>>>> >>> <mailto:linxizeng0615@gmail.com  <mailto:
> linxizeng0615@gmail.com
> >>>>>> >>>
> >>>>>> >>> wrote:
> >>>>>> >>>>
> >>>>>> >>>>    any one who have the same question with me? or this is not a
> >>>>>> >>> question?
> >>>>>> >>>>
> >>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <
> >>>>>> linxizeng0615@gmail.com
> >>>>>> >>> <ma...@gmail.com>
> >>>>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
> >>>>>> >>> linxizeng0615@gmail.com>>>:
> >>>>>> >>>>
> >>>>>> >>>>        hi, Moon:
> >>>>>> >>>>           I notice that the getScheduler function in the
> >>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
> >>>>>> the
> >>>>>> >>>>        spark interpreter run spark job one by one. It's not a
> >>>>>> good
> >>>>>> >>>>        experience when couple of users do some work on zeppelin
> >>>>>> at
> >>>>>> >>>>        the same time, because they have to wait for each other.
> >>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
> >>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
> >>>>>> >>>>        My question is, what kind of consideration do you based
> on
> >>>>>> >>> to
> >>>>>> >>>>        make such a decision?
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>>
> ------------------------------------------------------------------------------------------------------------------------------------------
> >>>>>> >>>
> >>>>>> >>>    This email and any files transmitted with it are confidential
> >>>>>> and
> >>>>>> >>>    intended solely for the use of the individual or entity to
> whom
> >>>>>> >>>    they are addressed. If you have received this email in error
> >>>>>> >>>    please notify the system manager. This message contains
> >>>>>> >>>    confidential information and is intended only for the
> >>>>>> individual
> >>>>>> >>>    named. If you are not the named addressee you should not
> >>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify
> the
> >>>>>> >>>    sender immediately by e-mail if you have received this e-mail
> >>>>>> by
> >>>>>> >>>    mistake and delete this e-mail from your system. If you are
> not
> >>>>>> >>>    the intended recipient you are notified that disclosing,
> >>>>>> copying,
> >>>>>> >>>    distributing or taking any action in reliance on the contents
> >>>>>> of
> >>>>>> >>>    this information is strictly prohibited. Although Flipkart
> has
> >>>>>> >>>    taken reasonable precautions to ensure no viruses are present
> >>>>>> in
> >>>>>> >>>    this email, the company cannot accept responsibility for any
> >>>>>> loss
> >>>>>> >>>    or damage arising from the use of this email or attachments
> >>>>>> >>
> >>>>>>
> >>>>>
> >>>>>
> >>
> >> --
> >> Sent from a mobile device. Excuse my thumbs.
> >>
> >>
> >>
> >
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by "Piyush Mukati (Data Platform)" <pi...@flipkart.com>.

Hi,
 The code is available here
https://github.com/piyush-mukati/incubator-zeppelin/tree/parallel_scheduler_support_spark


some testing part is left.

On Wed, Jan 13, 2016 at 11:47 PM, Dimp Bhat <di...@gmail.com> wrote:

> Hi Pranav,
> When do you plan to send out the code for running notebooks in parallel ?
>
> Dimple
>
> On Tue, Nov 17, 2015 at 3:27 AM, Pranav Kumar Agarwal <pr...@gmail.com>
> wrote:
>
>> Hi Rohit,
>>
>> We implemented the proposal and are able to run Zeppelin as a hosted
>> service inside my organization. Our internal forked version has pluggable
>> authentication and type ahead.
>>
>> I need to get the work ported to the latest and chop out the auth changes
>> portion. We'll be submitting it soon.
>>
>> We'll target to get this out for review by 11/26.
>>
>> Regards,
>> -Pranav.
>>
>>
>>
>> On 17/11/15 4:34 am, Rohit Agarwal wrote:
>>
>> Hey Pranav,
>>
>> Did you make any progress on this?
>>
>> --
>> Rohit
>>
>> On Sunday, August 16, 2015, moon soo Lee <mo...@apache.org> wrote:
>>
>>> Pranav, proposal looks awesome!
>>>
>>> I have a question and feedback,
>>>
>>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
>>> need information of notebook id. Did you get it from InterpreterContext?
>>> Then how did you handle destroying of SparkIMain (when notebook is
>>> deleting)?
>>> As far as i know, interpreter not able to get information of notebook
>>> deletion.
>>>
>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>> execution
>>> >> at a time per notebook.
>>>
>>> One downside of this approach is, GUI will display RUNNING instead of
>>> PENDING for jobs inside of queue in interpreter.
>>>
>>> Best,
>>> moon
>>>
>>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:
>>>
>>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>>>> multi-tenancy easily"
>>>>
>>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture
>>>>> so that it can handle multi-tenancy easily. The technical solution proposed
>>>>> by Pranav is great but it only applies to Spark. Right now, each
>>>>> interpreter has to manage multi-tenancy its own way. Ultimately Zeppelin
>>>>> can propose a multi-tenancy contract/info (like UserContext, similar to
>>>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>>>
>>>>>
>>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think while the idea of running multiple notes simultaneously is
>>>>>> great. It is really dancing around the lack of true multi user support in
>>>>>> Zeppelin. While the proposed solution would work if the applications
>>>>>> resources are those of the whole cluster, if the app is limited (say they
>>>>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>>>>> note can hog all the resources and the scheduler will have to throttle all
>>>>>> other executions leaving you exactly where you are now.
>>>>>> While I think the solution is a good one, maybe this question makes
>>>>>> us think in adding true multiuser support.
>>>>>> Where we isolate resources (cluster and the notebooks themselves),
>>>>>> have separate login/identity and (I don't know if it's possible) share the
>>>>>> same context.
>>>>>>
>>>>>> Thanks,
>>>>>> Joel
>>>>>>
>>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > If the problem is that multiple users have to wait for each other
>>>>>> while
>>>>>> > using Zeppelin, the solution already exists: they can create a new
>>>>>> > interpreter by going to the interpreter page and attach it to their
>>>>>> > notebook - then they don't have to wait for others to submit their
>>>>>> job.
>>>>>> >
>>>>>> > But I agree, having paragraphs from one note wait for paragraphs
>>>>>> from other
>>>>>> > notes is a confusing default. We can get around that in two ways:
>>>>>> >
>>>>>> >   1. Create a new interpreter for each note and attach that
>>>>>> interpreter to
>>>>>> >   that note. This approach would require the least amount of code
>>>>>> changes but
>>>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>>>> different
>>>>>> >   notes.
>>>>>> >   2. If we want to share the Spark Context between different notes,
>>>>>> we can
>>>>>> >   submit jobs from different notes into different fairscheduler
>>>>>> pools (
>>>>>> >
>>>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>>>> ).
>>>>>> >   This can be done by submitting jobs from different notes in
>>>>>> different
>>>>>> >   threads. This will make sure that jobs from one note are run
>>>>>> sequentially
>>>>>> >   but jobs from different notes will be able to run in parallel.
>>>>>> >
>>>>>> > Neither of these options require any change in the Spark code.
>>>>>> >
>>>>>> > --
>>>>>> > Thanks & Regards
>>>>>> > Rohit Agarwal
>>>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>>>> >
>>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>>>> praagarw@gmail.com>
>>>>>> > wrote:
>>>>>> >
>>>>>> >> If someone can share about the idea of sharing single SparkContext
>>>>>> through
>>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>>>> >> Here is a proposal:
>>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>>>> virtual
>>>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>>>> from
>>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>>>> the same
>>>>>> >> virtual directory.
>>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>>>> server in
>>>>>> >> Spark Context using classserverUri method
>>>>>> >> 3. Scala generated code has a notion of packages. The default
>>>>>> package name
>>>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
>>>>>> id"
>>>>>> >> ensures that code generated by individual instances of SparkIMain
>>>>>> is
>>>>>> >> isolated from other instances of SparkIMain
>>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>>>> execution
>>>>>> >> at a time per notebook.
>>>>>> >>
>>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>>>> there any
>>>>>> >> Jira already for the same that I can uptake? Also I need to
>>>>>> understand:
>>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>>>> >>
>>>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>>>> >>
>>>>>> >> Regards,
>>>>>> >> -Pranav.
>>>>>> >>
>>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>>>> >>>
>>>>>> >>> Hi piyush,
>>>>>> >>>
>>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>>>> >>> sharing the SparkContext sounds great.
>>>>>> >>>
>>>>>> >>> Actually, i tried to do it, found problem that multiple
>>>>>> SparkILoop could
>>>>>> >>> generates the same class name, and spark executor confuses
>>>>>> classname since
>>>>>> >>> they're reading classes from single SparkContext.
>>>>>> >>>
>>>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>>>> >>>
>>>>>> >>> Thanks,
>>>>>> >>> moon
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>>    Hi Moon,
>>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>>>> working
>>>>>> >>> with spark.
>>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>>>> parallel
>>>>>> >>> scheduler ?
>>>>>> >>>    thanks
>>>>>> >>>
>>>>>> >>>    -piyush
>>>>>> >>>
>>>>>> >>>    Hi Moon,
>>>>>> >>>
>>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>>>> Spark's
>>>>>> >>>    remote interpreter - this will allow multiple users to run
>>>>>> their spark
>>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>>>> paragraph is
>>>>>> >>>    executed at a time.
>>>>>> >>>
>>>>>> >>>    Regards,
>>>>>> >>>    -Pranav.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>>>> >>>> Hi,
>>>>>> >>>>
>>>>>> >>>> Thanks for asking question.
>>>>>> >>>>
>>>>>> >>>> The reason is simply because of it is running code statements.
>>>>>> The
>>>>>> >>>> statements can have order and dependency. Imagine i have two
>>>>>> >>> paragraphs
>>>>>> >>>>
>>>>>> >>>> %spark
>>>>>> >>>> val a = 1
>>>>>> >>>>
>>>>>> >>>> %spark
>>>>>> >>>> print(a)
>>>>>> >>>>
>>>>>> >>>> If they're not running one by one, that means they possibly runs
>>>>>> in
>>>>>> >>>> random order and the output will be always different. Either '1'
>>>>>> or
>>>>>> >>>> 'val a can not found'.
>>>>>> >>>>
>>>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>>>> >>>>
>>>>>> >>>> Thanks,
>>>>>> >>>> moon
>>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
>>>>>> >>> <mailto:linxizeng0615@gmail.com  <mailto:linxizeng0615@gmail.com
>>>>>> >>>
>>>>>> >>> wrote:
>>>>>> >>>>
>>>>>> >>>>    any one who have the same question with me? or this is not a
>>>>>> >>> question?
>>>>>> >>>>
>>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <
>>>>>> linxizeng0615@gmail.com
>>>>>> >>> <ma...@gmail.com>
>>>>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
>>>>>> >>> linxizeng0615@gmail.com>>>:
>>>>>> >>>>
>>>>>> >>>>        hi, Moon:
>>>>>> >>>>           I notice that the getScheduler function in the
>>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
>>>>>> the
>>>>>> >>>>        spark interpreter run spark job one by one. It's not a
>>>>>> good
>>>>>> >>>>        experience when couple of users do some work on zeppelin
>>>>>> at
>>>>>> >>>>        the same time, because they have to wait for each other.
>>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>>>> >>>>        My question is, what kind of consideration do you based on
>>>>>> >>> to
>>>>>> >>>>        make such a decision?
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>>>> >>>
>>>>>> >>>    This email and any files transmitted with it are confidential
>>>>>> and
>>>>>> >>>    intended solely for the use of the individual or entity to whom
>>>>>> >>>    they are addressed. If you have received this email in error
>>>>>> >>>    please notify the system manager. This message contains
>>>>>> >>>    confidential information and is intended only for the
>>>>>> individual
>>>>>> >>>    named. If you are not the named addressee you should not
>>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>>>> >>>    sender immediately by e-mail if you have received this e-mail
>>>>>> by
>>>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>>>> >>>    the intended recipient you are notified that disclosing,
>>>>>> copying,
>>>>>> >>>    distributing or taking any action in reliance on the contents
>>>>>> of
>>>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>>>> >>>    taken reasonable precautions to ensure no viruses are present
>>>>>> in
>>>>>> >>>    this email, the company cannot accept responsibility for any
>>>>>> loss
>>>>>> >>>    or damage arising from the use of this email or attachments
>>>>>> >>
>>>>>>
>>>>>
>>>>>
>>
>> --
>> Sent from a mobile device. Excuse my thumbs.
>>
>>
>>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by "Piyush Mukati (Data Platform)" <pi...@flipkart.com>.

Hi,
 The code is available here
https://github.com/piyush-mukati/incubator-zeppelin/tree/parallel_scheduler_support_spark


some testing part is left.

On Wed, Jan 13, 2016 at 11:47 PM, Dimp Bhat <di...@gmail.com> wrote:

> Hi Pranav,
> When do you plan to send out the code for running notebooks in parallel ?
>
> Dimple
>
> On Tue, Nov 17, 2015 at 3:27 AM, Pranav Kumar Agarwal <pr...@gmail.com>
> wrote:
>
>> Hi Rohit,
>>
>> We implemented the proposal and are able to run Zeppelin as a hosted
>> service inside my organization. Our internal forked version has pluggable
>> authentication and type ahead.
>>
>> I need to get the work ported to the latest and chop out the auth changes
>> portion. We'll be submitting it soon.
>>
>> We'll target to get this out for review by 11/26.
>>
>> Regards,
>> -Pranav.
>>
>>
>>
>> On 17/11/15 4:34 am, Rohit Agarwal wrote:
>>
>> Hey Pranav,
>>
>> Did you make any progress on this?
>>
>> --
>> Rohit
>>
>> On Sunday, August 16, 2015, moon soo Lee <mo...@apache.org> wrote:
>>
>>> Pranav, proposal looks awesome!
>>>
>>> I have a question and feedback,
>>>
>>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
>>> need information of notebook id. Did you get it from InterpreterContext?
>>> Then how did you handle destroying of SparkIMain (when notebook is
>>> deleting)?
>>> As far as i know, interpreter not able to get information of notebook
>>> deletion.
>>>
>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>> execution
>>> >> at a time per notebook.
>>>
>>> One downside of this approach is, GUI will display RUNNING instead of
>>> PENDING for jobs inside of queue in interpreter.
>>>
>>> Best,
>>> moon
>>>
>>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:
>>>
>>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>>>> multi-tenancy easily"
>>>>
>>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture
>>>>> so that it can handle multi-tenancy easily. The technical solution proposed
>>>>> by Pranav is great but it only applies to Spark. Right now, each
>>>>> interpreter has to manage multi-tenancy its own way. Ultimately Zeppelin
>>>>> can propose a multi-tenancy contract/info (like UserContext, similar to
>>>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>>>
>>>>>
>>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think while the idea of running multiple notes simultaneously is
>>>>>> great. It is really dancing around the lack of true multi user support in
>>>>>> Zeppelin. While the proposed solution would work if the applications
>>>>>> resources are those of the whole cluster, if the app is limited (say they
>>>>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>>>>> note can hog all the resources and the scheduler will have to throttle all
>>>>>> other executions leaving you exactly where you are now.
>>>>>> While I think the solution is a good one, maybe this question makes
>>>>>> us think in adding true multiuser support.
>>>>>> Where we isolate resources (cluster and the notebooks themselves),
>>>>>> have separate login/identity and (I don't know if it's possible) share the
>>>>>> same context.
>>>>>>
>>>>>> Thanks,
>>>>>> Joel
>>>>>>
>>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > If the problem is that multiple users have to wait for each other
>>>>>> while
>>>>>> > using Zeppelin, the solution already exists: they can create a new
>>>>>> > interpreter by going to the interpreter page and attach it to their
>>>>>> > notebook - then they don't have to wait for others to submit their
>>>>>> job.
>>>>>> >
>>>>>> > But I agree, having paragraphs from one note wait for paragraphs
>>>>>> from other
>>>>>> > notes is a confusing default. We can get around that in two ways:
>>>>>> >
>>>>>> >   1. Create a new interpreter for each note and attach that
>>>>>> interpreter to
>>>>>> >   that note. This approach would require the least amount of code
>>>>>> changes but
>>>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>>>> different
>>>>>> >   notes.
>>>>>> >   2. If we want to share the Spark Context between different notes,
>>>>>> we can
>>>>>> >   submit jobs from different notes into different fairscheduler
>>>>>> pools (
>>>>>> >
>>>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>>>> ).
>>>>>> >   This can be done by submitting jobs from different notes in
>>>>>> different
>>>>>> >   threads. This will make sure that jobs from one note are run
>>>>>> sequentially
>>>>>> >   but jobs from different notes will be able to run in parallel.
>>>>>> >
>>>>>> > Neither of these options require any change in the Spark code.
>>>>>> >
>>>>>> > --
>>>>>> > Thanks & Regards
>>>>>> > Rohit Agarwal
>>>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>>>> >
>>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>>>> praagarw@gmail.com>
>>>>>> > wrote:
>>>>>> >
>>>>>> >> If someone can share about the idea of sharing single SparkContext
>>>>>> through
>>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>>>> >> Here is a proposal:
>>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>>>> virtual
>>>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>>>> from
>>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>>>> the same
>>>>>> >> virtual directory.
>>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>>>> server in
>>>>>> >> Spark Context using classserverUri method
>>>>>> >> 3. Scala generated code has a notion of packages. The default
>>>>>> package name
>>>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
>>>>>> id"
>>>>>> >> ensures that code generated by individual instances of SparkIMain
>>>>>> is
>>>>>> >> isolated from other instances of SparkIMain
>>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>>>> execution
>>>>>> >> at a time per notebook.
>>>>>> >>
>>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>>>> there any
>>>>>> >> Jira already for the same that I can uptake? Also I need to
>>>>>> understand:
>>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>>>> >>
>>>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>>>> >>
>>>>>> >> Regards,
>>>>>> >> -Pranav.
>>>>>> >>
>>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>>>> >>>
>>>>>> >>> Hi piyush,
>>>>>> >>>
>>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>>>> >>> sharing the SparkContext sounds great.
>>>>>> >>>
>>>>>> >>> Actually, i tried to do it, found problem that multiple
>>>>>> SparkILoop could
>>>>>> >>> generates the same class name, and spark executor confuses
>>>>>> classname since
>>>>>> >>> they're reading classes from single SparkContext.
>>>>>> >>>
>>>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>>>> >>>
>>>>>> >>> Thanks,
>>>>>> >>> moon
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>>    Hi Moon,
>>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>>>> working
>>>>>> >>> with spark.
>>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>>>> parallel
>>>>>> >>> scheduler ?
>>>>>> >>>    thanks
>>>>>> >>>
>>>>>> >>>    -piyush
>>>>>> >>>
>>>>>> >>>    Hi Moon,
>>>>>> >>>
>>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>>>> Spark's
>>>>>> >>>    remote interpreter - this will allow multiple users to run
>>>>>> their spark
>>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>>>> paragraph is
>>>>>> >>>    executed at a time.
>>>>>> >>>
>>>>>> >>>    Regards,
>>>>>> >>>    -Pranav.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>>>> >>>> Hi,
>>>>>> >>>>
>>>>>> >>>> Thanks for asking question.
>>>>>> >>>>
>>>>>> >>>> The reason is simply because of it is running code statements.
>>>>>> The
>>>>>> >>>> statements can have order and dependency. Imagine i have two
>>>>>> >>> paragraphs
>>>>>> >>>>
>>>>>> >>>> %spark
>>>>>> >>>> val a = 1
>>>>>> >>>>
>>>>>> >>>> %spark
>>>>>> >>>> print(a)
>>>>>> >>>>
>>>>>> >>>> If they're not running one by one, that means they possibly runs
>>>>>> in
>>>>>> >>>> random order and the output will be always different. Either '1'
>>>>>> or
>>>>>> >>>> 'val a can not found'.
>>>>>> >>>>
>>>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>>>> >>>>
>>>>>> >>>> Thanks,
>>>>>> >>>> moon
>>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
>>>>>> >>> <mailto:linxizeng0615@gmail.com  <mailto:linxizeng0615@gmail.com
>>>>>> >>>
>>>>>> >>> wrote:
>>>>>> >>>>
>>>>>> >>>>    any one who have the same question with me? or this is not a
>>>>>> >>> question?
>>>>>> >>>>
>>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <
>>>>>> linxizeng0615@gmail.com
>>>>>> >>> <ma...@gmail.com>
>>>>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
>>>>>> >>> linxizeng0615@gmail.com>>>:
>>>>>> >>>>
>>>>>> >>>>        hi, Moon:
>>>>>> >>>>           I notice that the getScheduler function in the
>>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
>>>>>> the
>>>>>> >>>>        spark interpreter run spark job one by one. It's not a
>>>>>> good
>>>>>> >>>>        experience when couple of users do some work on zeppelin
>>>>>> at
>>>>>> >>>>        the same time, because they have to wait for each other.
>>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>>>> >>>>        My question is, what kind of consideration do you based on
>>>>>> >>> to
>>>>>> >>>>        make such a decision?
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>>>> >>>
>>>>>> >>>    This email and any files transmitted with it are confidential
>>>>>> and
>>>>>> >>>    intended solely for the use of the individual or entity to whom
>>>>>> >>>    they are addressed. If you have received this email in error
>>>>>> >>>    please notify the system manager. This message contains
>>>>>> >>>    confidential information and is intended only for the
>>>>>> individual
>>>>>> >>>    named. If you are not the named addressee you should not
>>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>>>> >>>    sender immediately by e-mail if you have received this e-mail
>>>>>> by
>>>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>>>> >>>    the intended recipient you are notified that disclosing,
>>>>>> copying,
>>>>>> >>>    distributing or taking any action in reliance on the contents
>>>>>> of
>>>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>>>> >>>    taken reasonable precautions to ensure no viruses are present
>>>>>> in
>>>>>> >>>    this email, the company cannot accept responsibility for any
>>>>>> loss
>>>>>> >>>    or damage arising from the use of this email or attachments
>>>>>> >>
>>>>>>
>>>>>
>>>>>
>>
>> --
>> Sent from a mobile device. Excuse my thumbs.
>>
>>
>>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Dimp Bhat <di...@gmail.com>.

Hi Pranav,
When do you plan to send out the code for running notebooks in parallel ?

Dimple

On Tue, Nov 17, 2015 at 3:27 AM, Pranav Kumar Agarwal <pr...@gmail.com>
wrote:

> Hi Rohit,
>
> We implemented the proposal and are able to run Zeppelin as a hosted
> service inside my organization. Our internal forked version has pluggable
> authentication and type ahead.
>
> I need to get the work ported to the latest and chop out the auth changes
> portion. We'll be submitting it soon.
>
> We'll target to get this out for review by 11/26.
>
> Regards,
> -Pranav.
>
>
>
> On 17/11/15 4:34 am, Rohit Agarwal wrote:
>
> Hey Pranav,
>
> Did you make any progress on this?
>
> --
> Rohit
>
> On Sunday, August 16, 2015, moon soo Lee <mo...@apache.org> wrote:
>
>> Pranav, proposal looks awesome!
>>
>> I have a question and feedback,
>>
>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
>> need information of notebook id. Did you get it from InterpreterContext?
>> Then how did you handle destroying of SparkIMain (when notebook is
>> deleting)?
>> As far as i know, interpreter not able to get information of notebook
>> deletion.
>>
>> >> 4. Build a queue inside interpreter to allow only one paragraph
>> execution
>> >> at a time per notebook.
>>
>> One downside of this approach is, GUI will display RUNNING instead of
>> PENDING for jobs inside of queue in interpreter.
>>
>> Best,
>> moon
>>
>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:
>>
>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>>> multi-tenancy easily"
>>>
>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com>
>>> wrote:
>>>
>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture so
>>>> that it can handle multi-tenancy easily. The technical solution proposed by Pranav
>>>> is great but it only applies to Spark. Right now, each interpreter has to
>>>> manage multi-tenancy its own way. Ultimately Zeppelin can propose a
>>>> multi-tenancy contract/info (like UserContext, similar to
>>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>>
>>>>
>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think while the idea of running multiple notes simultaneously is
>>>>> great. It is really dancing around the lack of true multi user support in
>>>>> Zeppelin. While the proposed solution would work if the applications
>>>>> resources are those of the whole cluster, if the app is limited (say they
>>>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>>>> note can hog all the resources and the scheduler will have to throttle all
>>>>> other executions leaving you exactly where you are now.
>>>>> While I think the solution is a good one, maybe this question makes us
>>>>> think in adding true multiuser support.
>>>>> Where we isolate resources (cluster and the notebooks themselves),
>>>>> have separate login/identity and (I don't know if it's possible) share the
>>>>> same context.
>>>>>
>>>>> Thanks,
>>>>> Joel
>>>>>
>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > If the problem is that multiple users have to wait for each other
>>>>> while
>>>>> > using Zeppelin, the solution already exists: they can create a new
>>>>> > interpreter by going to the interpreter page and attach it to their
>>>>> > notebook - then they don't have to wait for others to submit their
>>>>> job.
>>>>> >
>>>>> > But I agree, having paragraphs from one note wait for paragraphs
>>>>> from other
>>>>> > notes is a confusing default. We can get around that in two ways:
>>>>> >
>>>>> >   1. Create a new interpreter for each note and attach that
>>>>> interpreter to
>>>>> >   that note. This approach would require the least amount of code
>>>>> changes but
>>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>>> different
>>>>> >   notes.
>>>>> >   2. If we want to share the Spark Context between different notes,
>>>>> we can
>>>>> >   submit jobs from different notes into different fairscheduler
>>>>> pools (
>>>>> >
>>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>>> ).
>>>>> >   This can be done by submitting jobs from different notes in
>>>>> different
>>>>> >   threads. This will make sure that jobs from one note are run
>>>>> sequentially
>>>>> >   but jobs from different notes will be able to run in parallel.
>>>>> >
>>>>> > Neither of these options require any change in the Spark code.
>>>>> >
>>>>> > --
>>>>> > Thanks & Regards
>>>>> > Rohit Agarwal
>>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>>> >
>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>>> praagarw@gmail.com>
>>>>> > wrote:
>>>>> >
>>>>> >> If someone can share about the idea of sharing single SparkContext
>>>>> through
>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>>> >> Here is a proposal:
>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>>> virtual
>>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>>> from
>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>>> the same
>>>>> >> virtual directory.
>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>>> server in
>>>>> >> Spark Context using classserverUri method
>>>>> >> 3. Scala generated code has a notion of packages. The default
>>>>> package name
>>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
>>>>> id"
>>>>> >> ensures that code generated by individual instances of SparkIMain is
>>>>> >> isolated from other instances of SparkIMain
>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>>> execution
>>>>> >> at a time per notebook.
>>>>> >>
>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>>> there any
>>>>> >> Jira already for the same that I can uptake? Also I need to
>>>>> understand:
>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>>> >>
>>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>>> >>
>>>>> >> Regards,
>>>>> >> -Pranav.
>>>>> >>
>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>>> >>>
>>>>> >>> Hi piyush,
>>>>> >>>
>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>>> >>> sharing the SparkContext sounds great.
>>>>> >>>
>>>>> >>> Actually, i tried to do it, found problem that multiple SparkILoop
>>>>> could
>>>>> >>> generates the same class name, and spark executor confuses
>>>>> classname since
>>>>> >>> they're reading classes from single SparkContext.
>>>>> >>>
>>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>>> >>>
>>>>> >>> Thanks,
>>>>> >>> moon
>>>>> >>>
>>>>> >>>
>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
>>>>> wrote:
>>>>> >>>
>>>>> >>>    Hi Moon,
>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>>> working
>>>>> >>> with spark.
>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>>> parallel
>>>>> >>> scheduler ?
>>>>> >>>    thanks
>>>>> >>>
>>>>> >>>    -piyush
>>>>> >>>
>>>>> >>>    Hi Moon,
>>>>> >>>
>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>>> Spark's
>>>>> >>>    remote interpreter - this will allow multiple users to run
>>>>> their spark
>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>>> paragraph is
>>>>> >>>    executed at a time.
>>>>> >>>
>>>>> >>>    Regards,
>>>>> >>>    -Pranav.
>>>>> >>>
>>>>> >>>
>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>>> >>>> Hi,
>>>>> >>>>
>>>>> >>>> Thanks for asking question.
>>>>> >>>>
>>>>> >>>> The reason is simply because of it is running code statements. The
>>>>> >>>> statements can have order and dependency. Imagine i have two
>>>>> >>> paragraphs
>>>>> >>>>
>>>>> >>>> %spark
>>>>> >>>> val a = 1
>>>>> >>>>
>>>>> >>>> %spark
>>>>> >>>> print(a)
>>>>> >>>>
>>>>> >>>> If they're not running one by one, that means they possibly runs
>>>>> in
>>>>> >>>> random order and the output will be always different. Either '1'
>>>>> or
>>>>> >>>> 'val a can not found'.
>>>>> >>>>
>>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>>> >>>>
>>>>> >>>> Thanks,
>>>>> >>>> moon
>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
>>>>> >>> <mailto:linxizeng0615@gmail.com  <mailto:linxizeng0615@gmail.com
>>>>> >>>
>>>>> >>> wrote:
>>>>> >>>>
>>>>> >>>>    any one who have the same question with me? or this is not a
>>>>> >>> question?
>>>>> >>>>
>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
>>>>> >>> <ma...@gmail.com>
>>>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
>>>>> >>> linxizeng0615@gmail.com>>>:
>>>>> >>>>
>>>>> >>>>        hi, Moon:
>>>>> >>>>           I notice that the getScheduler function in the
>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
>>>>> the
>>>>> >>>>        spark interpreter run spark job one by one. It's not a good
>>>>> >>>>        experience when couple of users do some work on zeppelin at
>>>>> >>>>        the same time, because they have to wait for each other.
>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>>> >>>>        My question is, what kind of consideration do you based on
>>>>> >>> to
>>>>> >>>>        make such a decision?
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>>> >>>
>>>>> >>>    This email and any files transmitted with it are confidential
>>>>> and
>>>>> >>>    intended solely for the use of the individual or entity to whom
>>>>> >>>    they are addressed. If you have received this email in error
>>>>> >>>    please notify the system manager. This message contains
>>>>> >>>    confidential information and is intended only for the individual
>>>>> >>>    named. If you are not the named addressee you should not
>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>>> >>>    sender immediately by e-mail if you have received this e-mail by
>>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>>> >>>    the intended recipient you are notified that disclosing,
>>>>> copying,
>>>>> >>>    distributing or taking any action in reliance on the contents of
>>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>>> >>>    taken reasonable precautions to ensure no viruses are present in
>>>>> >>>    this email, the company cannot accept responsibility for any
>>>>> loss
>>>>> >>>    or damage arising from the use of this email or attachments
>>>>> >>
>>>>>
>>>>
>>>>
>
> --
> Sent from a mobile device. Excuse my thumbs.
>
>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Dimp Bhat <di...@gmail.com>.

Hi Pranav,
When do you plan to send out the code for running notebooks in parallel ?

Dimple

On Tue, Nov 17, 2015 at 3:27 AM, Pranav Kumar Agarwal <pr...@gmail.com>
wrote:

> Hi Rohit,
>
> We implemented the proposal and are able to run Zeppelin as a hosted
> service inside my organization. Our internal forked version has pluggable
> authentication and type ahead.
>
> I need to get the work ported to the latest and chop out the auth changes
> portion. We'll be submitting it soon.
>
> We'll target to get this out for review by 11/26.
>
> Regards,
> -Pranav.
>
>
>
> On 17/11/15 4:34 am, Rohit Agarwal wrote:
>
> Hey Pranav,
>
> Did you make any progress on this?
>
> --
> Rohit
>
> On Sunday, August 16, 2015, moon soo Lee <mo...@apache.org> wrote:
>
>> Pranav, proposal looks awesome!
>>
>> I have a question and feedback,
>>
>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
>> need information of notebook id. Did you get it from InterpreterContext?
>> Then how did you handle destroying of SparkIMain (when notebook is
>> deleting)?
>> As far as i know, interpreter not able to get information of notebook
>> deletion.
>>
>> >> 4. Build a queue inside interpreter to allow only one paragraph
>> execution
>> >> at a time per notebook.
>>
>> One downside of this approach is, GUI will display RUNNING instead of
>> PENDING for jobs inside of queue in interpreter.
>>
>> Best,
>> moon
>>
>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:
>>
>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>>> multi-tenancy easily"
>>>
>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com>
>>> wrote:
>>>
>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture so
>>>> that it can handle multi-tenancy easily. The technical solution proposed by Pranav
>>>> is great but it only applies to Spark. Right now, each interpreter has to
>>>> manage multi-tenancy its own way. Ultimately Zeppelin can propose a
>>>> multi-tenancy contract/info (like UserContext, similar to
>>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>>
>>>>
>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think while the idea of running multiple notes simultaneously is
>>>>> great. It is really dancing around the lack of true multi user support in
>>>>> Zeppelin. While the proposed solution would work if the applications
>>>>> resources are those of the whole cluster, if the app is limited (say they
>>>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>>>> note can hog all the resources and the scheduler will have to throttle all
>>>>> other executions leaving you exactly where you are now.
>>>>> While I think the solution is a good one, maybe this question makes us
>>>>> think in adding true multiuser support.
>>>>> Where we isolate resources (cluster and the notebooks themselves),
>>>>> have separate login/identity and (I don't know if it's possible) share the
>>>>> same context.
>>>>>
>>>>> Thanks,
>>>>> Joel
>>>>>
>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > If the problem is that multiple users have to wait for each other
>>>>> while
>>>>> > using Zeppelin, the solution already exists: they can create a new
>>>>> > interpreter by going to the interpreter page and attach it to their
>>>>> > notebook - then they don't have to wait for others to submit their
>>>>> job.
>>>>> >
>>>>> > But I agree, having paragraphs from one note wait for paragraphs
>>>>> from other
>>>>> > notes is a confusing default. We can get around that in two ways:
>>>>> >
>>>>> >   1. Create a new interpreter for each note and attach that
>>>>> interpreter to
>>>>> >   that note. This approach would require the least amount of code
>>>>> changes but
>>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>>> different
>>>>> >   notes.
>>>>> >   2. If we want to share the Spark Context between different notes,
>>>>> we can
>>>>> >   submit jobs from different notes into different fairscheduler
>>>>> pools (
>>>>> >
>>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>>> ).
>>>>> >   This can be done by submitting jobs from different notes in
>>>>> different
>>>>> >   threads. This will make sure that jobs from one note are run
>>>>> sequentially
>>>>> >   but jobs from different notes will be able to run in parallel.
>>>>> >
>>>>> > Neither of these options require any change in the Spark code.
>>>>> >
>>>>> > --
>>>>> > Thanks & Regards
>>>>> > Rohit Agarwal
>>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>>> >
>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>>> praagarw@gmail.com>
>>>>> > wrote:
>>>>> >
>>>>> >> If someone can share about the idea of sharing single SparkContext
>>>>> through
>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>>> >> Here is a proposal:
>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>>> virtual
>>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>>> from
>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>>> the same
>>>>> >> virtual directory.
>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>>> server in
>>>>> >> Spark Context using classserverUri method
>>>>> >> 3. Scala generated code has a notion of packages. The default
>>>>> package name
>>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
>>>>> id"
>>>>> >> ensures that code generated by individual instances of SparkIMain is
>>>>> >> isolated from other instances of SparkIMain
>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>>> execution
>>>>> >> at a time per notebook.
>>>>> >>
>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>>> there any
>>>>> >> Jira already for the same that I can uptake? Also I need to
>>>>> understand:
>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>>> >>
>>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>>> >>
>>>>> >> Regards,
>>>>> >> -Pranav.
>>>>> >>
>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>>> >>>
>>>>> >>> Hi piyush,
>>>>> >>>
>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>>> >>> sharing the SparkContext sounds great.
>>>>> >>>
>>>>> >>> Actually, i tried to do it, found problem that multiple SparkILoop
>>>>> could
>>>>> >>> generates the same class name, and spark executor confuses
>>>>> classname since
>>>>> >>> they're reading classes from single SparkContext.
>>>>> >>>
>>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>>> >>>
>>>>> >>> Thanks,
>>>>> >>> moon
>>>>> >>>
>>>>> >>>
>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
>>>>> wrote:
>>>>> >>>
>>>>> >>>    Hi Moon,
>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>>> working
>>>>> >>> with spark.
>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>>> parallel
>>>>> >>> scheduler ?
>>>>> >>>    thanks
>>>>> >>>
>>>>> >>>    -piyush
>>>>> >>>
>>>>> >>>    Hi Moon,
>>>>> >>>
>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>>> Spark's
>>>>> >>>    remote interpreter - this will allow multiple users to run
>>>>> their spark
>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>>> paragraph is
>>>>> >>>    executed at a time.
>>>>> >>>
>>>>> >>>    Regards,
>>>>> >>>    -Pranav.
>>>>> >>>
>>>>> >>>
>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>>> >>>> Hi,
>>>>> >>>>
>>>>> >>>> Thanks for asking question.
>>>>> >>>>
>>>>> >>>> The reason is simply because of it is running code statements. The
>>>>> >>>> statements can have order and dependency. Imagine i have two
>>>>> >>> paragraphs
>>>>> >>>>
>>>>> >>>> %spark
>>>>> >>>> val a = 1
>>>>> >>>>
>>>>> >>>> %spark
>>>>> >>>> print(a)
>>>>> >>>>
>>>>> >>>> If they're not running one by one, that means they possibly runs
>>>>> in
>>>>> >>>> random order and the output will be always different. Either '1'
>>>>> or
>>>>> >>>> 'val a can not found'.
>>>>> >>>>
>>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>>> >>>>
>>>>> >>>> Thanks,
>>>>> >>>> moon
>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
>>>>> >>> <mailto:linxizeng0615@gmail.com  <mailto:linxizeng0615@gmail.com
>>>>> >>>
>>>>> >>> wrote:
>>>>> >>>>
>>>>> >>>>    any one who have the same question with me? or this is not a
>>>>> >>> question?
>>>>> >>>>
>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
>>>>> >>> <ma...@gmail.com>
>>>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
>>>>> >>> linxizeng0615@gmail.com>>>:
>>>>> >>>>
>>>>> >>>>        hi, Moon:
>>>>> >>>>           I notice that the getScheduler function in the
>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
>>>>> the
>>>>> >>>>        spark interpreter run spark job one by one. It's not a good
>>>>> >>>>        experience when couple of users do some work on zeppelin at
>>>>> >>>>        the same time, because they have to wait for each other.
>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>>> >>>>        My question is, what kind of consideration do you based on
>>>>> >>> to
>>>>> >>>>        make such a decision?
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>>> >>>
>>>>> >>>    This email and any files transmitted with it are confidential
>>>>> and
>>>>> >>>    intended solely for the use of the individual or entity to whom
>>>>> >>>    they are addressed. If you have received this email in error
>>>>> >>>    please notify the system manager. This message contains
>>>>> >>>    confidential information and is intended only for the individual
>>>>> >>>    named. If you are not the named addressee you should not
>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>>> >>>    sender immediately by e-mail if you have received this e-mail by
>>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>>> >>>    the intended recipient you are notified that disclosing,
>>>>> copying,
>>>>> >>>    distributing or taking any action in reliance on the contents of
>>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>>> >>>    taken reasonable precautions to ensure no viruses are present in
>>>>> >>>    this email, the company cannot accept responsibility for any
>>>>> loss
>>>>> >>>    or damage arising from the use of this email or attachments
>>>>> >>
>>>>>
>>>>
>>>>
>
> --
> Sent from a mobile device. Excuse my thumbs.
>
>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Pranav Kumar Agarwal <pr...@gmail.com>.

Hi Rohit,

We implemented the proposal and are able to run Zeppelin as a hosted 
service inside my organization. Our internal forked version has 
pluggable authentication and type ahead.

I need to get the work ported to the latest and chop out the auth 
changes portion. We'll be submitting it soon.

We'll target to get this out for review by 11/26.

Regards,
-Pranav.


On 17/11/15 4:34 am, Rohit Agarwal wrote:
> Hey Pranav,
>
> Did you make any progress on this?
>
> --
> Rohit
>
> On Sunday, August 16, 2015, moon soo Lee <moon@apache.org 
> <ma...@apache.org>> wrote:
>
>     Pranav, proposal looks awesome!
>
>     I have a question and feedback,
>
>     You said you tested 1,2 and 3. To create SparkIMain per notebook,
>     you need information of notebook id. Did you get it from
>     InterpreterContext?
>     Then how did you handle destroying of SparkIMain (when notebook is
>     deleting)?
>     As far as i know, interpreter not able to get information of
>     notebook deletion.
>
>     >> 4. Build a queue inside interpreter to allow only one paragraph
>     execution
>     >> at a time per notebook.
>
>     One downside of this approach is, GUI will display RUNNING instead
>     of PENDING for jobs inside of queue in interpreter.
>
>     Best,
>     moon
>
>     On Sun, Aug 16, 2015 at 12:55 AM IT CTO <goi.cto@gmail.com
>     <javascript:_e(%7B%7D,'cvml','goi.cto@gmail.com');>> wrote:
>
>         +1 for "to re-factor the Zeppelin architecture so that it can
>         handle multi-tenancy easily"
>
>         On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan
>         <doanduyhai@gmail.com
>         <javascript:_e(%7B%7D,'cvml','doanduyhai@gmail.com');>> wrote:
>
>             Agree with Joel, we may think to re-factor the Zeppelin
>             architecture so that it can handle multi-tenancy easily.
>             The technical solution proposed by Pranav is great but it
>             only applies to Spark. Right now, each interpreter has to
>             manage multi-tenancy its own way. Ultimately Zeppelin can
>             propose a multi-tenancy contract/info (like UserContext,
>             similar to InterpreterContext) so that each interpreter
>             can choose to use or not.
>
>
>             On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano
>             <djoelz@gmail.com
>             <javascript:_e(%7B%7D,'cvml','djoelz@gmail.com');>> wrote:
>
>                 I think while the idea of running multiple notes
>                 simultaneously is great. It is really dancing around
>                 the lack of true multi user support in Zeppelin. While
>                 the proposed solution would work if the applications
>                 resources are those of the whole cluster, if the app
>                 is limited (say they are 8 cores of 16, with some
>                 distribution in memory) then potentially your note can
>                 hog all the resources and the scheduler will have to
>                 throttle all other executions leaving you exactly
>                 where you are now.
>                 While I think the solution is a good one, maybe this
>                 question makes us think in adding true multiuser support.
>                 Where we isolate resources (cluster and the notebooks
>                 themselves), have separate login/identity and (I don't
>                 know if it's possible) share the same context.
>
>                 Thanks,
>                 Joel
>
>                 > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal
>                 <mindprince@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','mindprince@gmail.com');>> wrote:
>                 >
>                 > If the problem is that multiple users have to wait
>                 for each other while
>                 > using Zeppelin, the solution already exists: they
>                 can create a new
>                 > interpreter by going to the interpreter page and
>                 attach it to their
>                 > notebook - then they don't have to wait for others
>                 to submit their job.
>                 >
>                 > But I agree, having paragraphs from one note wait
>                 for paragraphs from other
>                 > notes is a confusing default. We can get around that
>                 in two ways:
>                 >
>                 >   1. Create a new interpreter for each note and
>                 attach that interpreter to
>                 >   that note. This approach would require the least amount of code changes but
>                 >   is resource heavy and doesn't let you share Spark
>                 Context between different
>                 >   notes.
>                 >   2. If we want to share the Spark Context between
>                 different notes, we can
>                 >   submit jobs from different notes into different
>                 fairscheduler pools (
>                 >
>                 https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application).
>                 >   This can be done by submitting jobs from different
>                 notes in different
>                 >   threads. This will make sure that jobs from one
>                 note are run sequentially
>                 >   but jobs from different notes will be able to run
>                 in parallel.
>                 >
>                 > Neither of these options require any change in the
>                 Spark code.
>                 >
>                 > --
>                 > Thanks & Regards
>                 > Rohit Agarwal
>                 > https://www.linkedin.com/in/rohitagarwal003
>                 >
>                 > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar
>                 Agarwal <praagarw@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','praagarw@gmail.com');>>
>                 > wrote:
>                 >
>                 >> If someone can share about the idea of sharing
>                 single SparkContext through
>                 >>> multiple SparkILoop safely, it'll be really helpful.
>                 >> Here is a proposal:
>                 >> 1. In Spark code, change SparkIMain.scala to allow
>                 setting the virtual
>                 >> directory. While creating new instances of
>                 SparkIMain per notebook from
>                 >> zeppelin spark interpreter set all the instances of
>                 SparkIMain to the same
>                 >> virtual directory.
>                 >> 2. Start HTTP server on that virtual directory and
>                 set this HTTP server in
>                 >> Spark Context using classserverUri method
>                 >> 3. Scala generated code has a notion of packages.
>                 The default package name
>                 >> is "line$<linenumber>". Package name can be
>                 controlled using System
>                 >> Property scala.repl.name.line. Setting this
>                 property to "notebook id"
>                 >> ensures that code generated by individual instances
>                 of SparkIMain is
>                 >> isolated from other instances of SparkIMain
>                 >> 4. Build a queue inside interpreter to allow only
>                 one paragraph execution
>                 >> at a time per notebook.
>                 >>
>                 >> I have tested 1, 2, and 3 and it seems to provide
>                 isolation across
>                 >> classnames. I'll work towards submitting a formal
>                 patch soon - Is there any
>                 >> Jira already for the same that I can uptake? Also I
>                 need to understand:
>                 >> 1. How does Zeppelin uptake Spark fixes? OR do I
>                 need to first work
>                 >> towards getting Spark changes merged in Apache
>                 Spark github?
>                 >>
>                 >> Any suggestions on comments on the proposal are
>                 highly welcome.
>                 >>
>                 >> Regards,
>                 >> -Pranav.
>                 >>
>                 >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>                 >>>
>                 >>> Hi piyush,
>                 >>>
>                 >>> Separate instance of SparkILoop SparkIMain for
>                 each notebook while
>                 >>> sharing the SparkContext sounds great.
>                 >>>
>                 >>> Actually, i tried to do it, found problem that
>                 multiple SparkILoop could
>                 >>> generates the same class name, and spark executor
>                 confuses classname since
>                 >>> they're reading classes from single SparkContext.
>                 >>>
>                 >>> If someone can share about the idea of sharing
>                 single SparkContext
>                 >>> through multiple SparkILoop safely, it'll be
>                 really helpful.
>                 >>>
>                 >>> Thanks,
>                 >>> moon
>                 >>>
>                 >>>
>                 >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati
>                 (Data Platform) <
>                 >>> piyush.mukati@flipkart.com
>                 <javascript:_e(%7B%7D,'cvml','piyush.mukati@flipkart.com');>
>                 <mailto:piyush.mukati@flipkart.com
>                 <javascript:_e(%7B%7D,'cvml','piyush.mukati@flipkart.com');>>>
>                 wrote:
>                 >>>
>                 >>>    Hi Moon,
>                 >>>    Any suggestion on it, have to wait lot when
>                 multiple people working
>                 >>> with spark.
>                 >>>    Can we create separate instance of 
>                  SparkILoop  SparkIMain and
>                 >>> printstrems  for each notebook while sharing
>                 theSparkContext
>                 >>> ZeppelinContext  SQLContext and DependencyResolver
>                 and then use parallel
>                 >>> scheduler ?
>                 >>>    thanks
>                 >>>
>                 >>>    -piyush
>                 >>>
>                 >>>    Hi Moon,
>                 >>>
>                 >>>    How about tracking dedicated SparkContext for a
>                 notebook in Spark's
>                 >>>    remote interpreter - this will allow multiple
>                 users to run their spark
>                 >>>    paragraphs in parallel. Also, within a notebook
>                 only one paragraph is
>                 >>>    executed at a time.
>                 >>>
>                 >>>    Regards,
>                 >>>    -Pranav.
>                 >>>
>                 >>>
>                 >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>                 >>>> Hi,
>                 >>>>
>                 >>>> Thanks for asking question.
>                 >>>>
>                 >>>> The reason is simply because of it is running
>                 code statements. The
>                 >>>> statements can have order and dependency. Imagine
>                 i have two
>                 >>> paragraphs
>                 >>>>
>                 >>>> %spark
>                 >>>> val a = 1
>                 >>>>
>                 >>>> %spark
>                 >>>> print(a)
>                 >>>>
>                 >>>> If they're not running one by one, that means
>                 they possibly runs in
>                 >>>> random order and the output will be always
>                 different. Either '1' or
>                 >>>> 'val a can not found'.
>                 >>>>
>                 >>>> This is the reason why. But if there are nice
>                 idea to handle this
>                 >>>> problem i agree using parallel scheduler would
>                 help a lot.
>                 >>>>
>                 >>>> Thanks,
>                 >>>> moon
>                 >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>                 >>>> <linxizeng0615@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>
>                 <mailto:linxizeng0615@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>>
>                 >>> <mailto:linxizeng0615@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>
>                 <mailto:linxizeng0615@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>>>>
>                 >>> wrote:
>                 >>>>
>                 >>>>    any one who have the same question with me? or
>                 this is not a
>                 >>> question?
>                 >>>>
>                 >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng
>                 <linxizeng0615@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>
>                 >>> <mailto:linxizeng0615@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>>
>                 >>>>    <mailto:linxizeng0615@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>
>                 <mailto:
>                 >>> linxizeng0615@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>>>>:
>                 >>>>
>                 >>>>        hi, Moon:
>                 >>>>           I notice that the getScheduler function
>                 in the
>                 >>>> SparkInterpreter.java return a FIFOScheduler
>                 which makes the
>                 >>>>        spark interpreter run spark job one by
>                 one. It's not a good
>                 >>>>        experience when couple of users do some
>                 work on zeppelin at
>                 >>>>        the same time, because they have to wait
>                 for each other.
>                 >>>>        And at the same time, SparkSqlInterpreter
>                 can chose what
>                 >>>>        scheduler to use by
>                 "zeppelin.spark.concurrentSQL".
>                 >>>>        My question is, what kind of consideration
>                 do you based on
>                 >>> to
>                 >>>>        make such a decision?
>                 >>>
>                 >>>
>                 >>>
>                 >>>
>                 >>>
>                 ------------------------------------------------------------------------------------------------------------------------------------------
>                 >>>
>                 >>>    This email and any files transmitted with it
>                 are confidential and
>                 >>>    intended solely for the use of the individual
>                 or entity to whom
>                 >>>    they are addressed. If you have received this
>                 email in error
>                 >>>    please notify the system manager. This message
>                 contains
>                 >>>    confidential information and is intended only
>                 for the individual
>                 >>>    named. If you are not the named addressee you
>                 should not
>                 >>>    disseminate, distribute or copy this e-mail.
>                 Please notify the
>                 >>>    sender immediately by e-mail if you have
>                 received this e-mail by
>                 >>>    mistake and delete this e-mail from your
>                 system. If you are not
>                 >>>    the intended recipient you are notified that
>                 disclosing, copying,
>                 >>>    distributing or taking any action in reliance
>                 on the contents of
>                 >>>    this information is strictly prohibited.
>                 Although Flipkart has
>                 >>>    taken reasonable precautions to ensure no
>                 viruses are present in
>                 >>>    this email, the company cannot accept
>                 responsibility for any loss
>                 >>>    or damage arising from the use of this email or
>                 attachments
>                 >>
>
>
>
>
> -- 
> Sent from a mobile device. Excuse my thumbs.

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Pranav Kumar Agarwal <pr...@gmail.com>.

Hi Rohit,

We implemented the proposal and are able to run Zeppelin as a hosted 
service inside my organization. Our internal forked version has 
pluggable authentication and type ahead.

I need to get the work ported to the latest and chop out the auth 
changes portion. We'll be submitting it soon.

We'll target to get this out for review by 11/26.

Regards,
-Pranav.


On 17/11/15 4:34 am, Rohit Agarwal wrote:
> Hey Pranav,
>
> Did you make any progress on this?
>
> --
> Rohit
>
> On Sunday, August 16, 2015, moon soo Lee <moon@apache.org 
> <ma...@apache.org>> wrote:
>
>     Pranav, proposal looks awesome!
>
>     I have a question and feedback,
>
>     You said you tested 1,2 and 3. To create SparkIMain per notebook,
>     you need information of notebook id. Did you get it from
>     InterpreterContext?
>     Then how did you handle destroying of SparkIMain (when notebook is
>     deleting)?
>     As far as i know, interpreter not able to get information of
>     notebook deletion.
>
>     >> 4. Build a queue inside interpreter to allow only one paragraph
>     execution
>     >> at a time per notebook.
>
>     One downside of this approach is, GUI will display RUNNING instead
>     of PENDING for jobs inside of queue in interpreter.
>
>     Best,
>     moon
>
>     On Sun, Aug 16, 2015 at 12:55 AM IT CTO <goi.cto@gmail.com
>     <javascript:_e(%7B%7D,'cvml','goi.cto@gmail.com');>> wrote:
>
>         +1 for "to re-factor the Zeppelin architecture so that it can
>         handle multi-tenancy easily"
>
>         On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan
>         <doanduyhai@gmail.com
>         <javascript:_e(%7B%7D,'cvml','doanduyhai@gmail.com');>> wrote:
>
>             Agree with Joel, we may think to re-factor the Zeppelin
>             architecture so that it can handle multi-tenancy easily.
>             The technical solution proposed by Pranav is great but it
>             only applies to Spark. Right now, each interpreter has to
>             manage multi-tenancy its own way. Ultimately Zeppelin can
>             propose a multi-tenancy contract/info (like UserContext,
>             similar to InterpreterContext) so that each interpreter
>             can choose to use or not.
>
>
>             On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano
>             <djoelz@gmail.com
>             <javascript:_e(%7B%7D,'cvml','djoelz@gmail.com');>> wrote:
>
>                 I think while the idea of running multiple notes
>                 simultaneously is great. It is really dancing around
>                 the lack of true multi user support in Zeppelin. While
>                 the proposed solution would work if the applications
>                 resources are those of the whole cluster, if the app
>                 is limited (say they are 8 cores of 16, with some
>                 distribution in memory) then potentially your note can
>                 hog all the resources and the scheduler will have to
>                 throttle all other executions leaving you exactly
>                 where you are now.
>                 While I think the solution is a good one, maybe this
>                 question makes us think in adding true multiuser support.
>                 Where we isolate resources (cluster and the notebooks
>                 themselves), have separate login/identity and (I don't
>                 know if it's possible) share the same context.
>
>                 Thanks,
>                 Joel
>
>                 > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal
>                 <mindprince@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','mindprince@gmail.com');>> wrote:
>                 >
>                 > If the problem is that multiple users have to wait
>                 for each other while
>                 > using Zeppelin, the solution already exists: they
>                 can create a new
>                 > interpreter by going to the interpreter page and
>                 attach it to their
>                 > notebook - then they don't have to wait for others
>                 to submit their job.
>                 >
>                 > But I agree, having paragraphs from one note wait
>                 for paragraphs from other
>                 > notes is a confusing default. We can get around that
>                 in two ways:
>                 >
>                 >   1. Create a new interpreter for each note and
>                 attach that interpreter to
>                 >   that note. This approach would require the least amount of code changes but
>                 >   is resource heavy and doesn't let you share Spark
>                 Context between different
>                 >   notes.
>                 >   2. If we want to share the Spark Context between
>                 different notes, we can
>                 >   submit jobs from different notes into different
>                 fairscheduler pools (
>                 >
>                 https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application).
>                 >   This can be done by submitting jobs from different
>                 notes in different
>                 >   threads. This will make sure that jobs from one
>                 note are run sequentially
>                 >   but jobs from different notes will be able to run
>                 in parallel.
>                 >
>                 > Neither of these options require any change in the
>                 Spark code.
>                 >
>                 > --
>                 > Thanks & Regards
>                 > Rohit Agarwal
>                 > https://www.linkedin.com/in/rohitagarwal003
>                 >
>                 > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar
>                 Agarwal <praagarw@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','praagarw@gmail.com');>>
>                 > wrote:
>                 >
>                 >> If someone can share about the idea of sharing
>                 single SparkContext through
>                 >>> multiple SparkILoop safely, it'll be really helpful.
>                 >> Here is a proposal:
>                 >> 1. In Spark code, change SparkIMain.scala to allow
>                 setting the virtual
>                 >> directory. While creating new instances of
>                 SparkIMain per notebook from
>                 >> zeppelin spark interpreter set all the instances of
>                 SparkIMain to the same
>                 >> virtual directory.
>                 >> 2. Start HTTP server on that virtual directory and
>                 set this HTTP server in
>                 >> Spark Context using classserverUri method
>                 >> 3. Scala generated code has a notion of packages.
>                 The default package name
>                 >> is "line$<linenumber>". Package name can be
>                 controlled using System
>                 >> Property scala.repl.name.line. Setting this
>                 property to "notebook id"
>                 >> ensures that code generated by individual instances
>                 of SparkIMain is
>                 >> isolated from other instances of SparkIMain
>                 >> 4. Build a queue inside interpreter to allow only
>                 one paragraph execution
>                 >> at a time per notebook.
>                 >>
>                 >> I have tested 1, 2, and 3 and it seems to provide
>                 isolation across
>                 >> classnames. I'll work towards submitting a formal
>                 patch soon - Is there any
>                 >> Jira already for the same that I can uptake? Also I
>                 need to understand:
>                 >> 1. How does Zeppelin uptake Spark fixes? OR do I
>                 need to first work
>                 >> towards getting Spark changes merged in Apache
>                 Spark github?
>                 >>
>                 >> Any suggestions on comments on the proposal are
>                 highly welcome.
>                 >>
>                 >> Regards,
>                 >> -Pranav.
>                 >>
>                 >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>                 >>>
>                 >>> Hi piyush,
>                 >>>
>                 >>> Separate instance of SparkILoop SparkIMain for
>                 each notebook while
>                 >>> sharing the SparkContext sounds great.
>                 >>>
>                 >>> Actually, i tried to do it, found problem that
>                 multiple SparkILoop could
>                 >>> generates the same class name, and spark executor
>                 confuses classname since
>                 >>> they're reading classes from single SparkContext.
>                 >>>
>                 >>> If someone can share about the idea of sharing
>                 single SparkContext
>                 >>> through multiple SparkILoop safely, it'll be
>                 really helpful.
>                 >>>
>                 >>> Thanks,
>                 >>> moon
>                 >>>
>                 >>>
>                 >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati
>                 (Data Platform) <
>                 >>> piyush.mukati@flipkart.com
>                 <javascript:_e(%7B%7D,'cvml','piyush.mukati@flipkart.com');>
>                 <mailto:piyush.mukati@flipkart.com
>                 <javascript:_e(%7B%7D,'cvml','piyush.mukati@flipkart.com');>>>
>                 wrote:
>                 >>>
>                 >>>    Hi Moon,
>                 >>>    Any suggestion on it, have to wait lot when
>                 multiple people working
>                 >>> with spark.
>                 >>>    Can we create separate instance of 
>                  SparkILoop  SparkIMain and
>                 >>> printstrems  for each notebook while sharing
>                 theSparkContext
>                 >>> ZeppelinContext  SQLContext and DependencyResolver
>                 and then use parallel
>                 >>> scheduler ?
>                 >>>    thanks
>                 >>>
>                 >>>    -piyush
>                 >>>
>                 >>>    Hi Moon,
>                 >>>
>                 >>>    How about tracking dedicated SparkContext for a
>                 notebook in Spark's
>                 >>>    remote interpreter - this will allow multiple
>                 users to run their spark
>                 >>>    paragraphs in parallel. Also, within a notebook
>                 only one paragraph is
>                 >>>    executed at a time.
>                 >>>
>                 >>>    Regards,
>                 >>>    -Pranav.
>                 >>>
>                 >>>
>                 >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>                 >>>> Hi,
>                 >>>>
>                 >>>> Thanks for asking question.
>                 >>>>
>                 >>>> The reason is simply because of it is running
>                 code statements. The
>                 >>>> statements can have order and dependency. Imagine
>                 i have two
>                 >>> paragraphs
>                 >>>>
>                 >>>> %spark
>                 >>>> val a = 1
>                 >>>>
>                 >>>> %spark
>                 >>>> print(a)
>                 >>>>
>                 >>>> If they're not running one by one, that means
>                 they possibly runs in
>                 >>>> random order and the output will be always
>                 different. Either '1' or
>                 >>>> 'val a can not found'.
>                 >>>>
>                 >>>> This is the reason why. But if there are nice
>                 idea to handle this
>                 >>>> problem i agree using parallel scheduler would
>                 help a lot.
>                 >>>>
>                 >>>> Thanks,
>                 >>>> moon
>                 >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>                 >>>> <linxizeng0615@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>
>                 <mailto:linxizeng0615@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>>
>                 >>> <mailto:linxizeng0615@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>
>                 <mailto:linxizeng0615@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>>>>
>                 >>> wrote:
>                 >>>>
>                 >>>>    any one who have the same question with me? or
>                 this is not a
>                 >>> question?
>                 >>>>
>                 >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng
>                 <linxizeng0615@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>
>                 >>> <mailto:linxizeng0615@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>>
>                 >>>>    <mailto:linxizeng0615@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>
>                 <mailto:
>                 >>> linxizeng0615@gmail.com
>                 <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>>>>:
>                 >>>>
>                 >>>>        hi, Moon:
>                 >>>>           I notice that the getScheduler function
>                 in the
>                 >>>> SparkInterpreter.java return a FIFOScheduler
>                 which makes the
>                 >>>>        spark interpreter run spark job one by
>                 one. It's not a good
>                 >>>>        experience when couple of users do some
>                 work on zeppelin at
>                 >>>>        the same time, because they have to wait
>                 for each other.
>                 >>>>        And at the same time, SparkSqlInterpreter
>                 can chose what
>                 >>>>        scheduler to use by
>                 "zeppelin.spark.concurrentSQL".
>                 >>>>        My question is, what kind of consideration
>                 do you based on
>                 >>> to
>                 >>>>        make such a decision?
>                 >>>
>                 >>>
>                 >>>
>                 >>>
>                 >>>
>                 ------------------------------------------------------------------------------------------------------------------------------------------
>                 >>>
>                 >>>    This email and any files transmitted with it
>                 are confidential and
>                 >>>    intended solely for the use of the individual
>                 or entity to whom
>                 >>>    they are addressed. If you have received this
>                 email in error
>                 >>>    please notify the system manager. This message
>                 contains
>                 >>>    confidential information and is intended only
>                 for the individual
>                 >>>    named. If you are not the named addressee you
>                 should not
>                 >>>    disseminate, distribute or copy this e-mail.
>                 Please notify the
>                 >>>    sender immediately by e-mail if you have
>                 received this e-mail by
>                 >>>    mistake and delete this e-mail from your
>                 system. If you are not
>                 >>>    the intended recipient you are notified that
>                 disclosing, copying,
>                 >>>    distributing or taking any action in reliance
>                 on the contents of
>                 >>>    this information is strictly prohibited.
>                 Although Flipkart has
>                 >>>    taken reasonable precautions to ensure no
>                 viruses are present in
>                 >>>    this email, the company cannot accept
>                 responsibility for any loss
>                 >>>    or damage arising from the use of this email or
>                 attachments
>                 >>
>
>
>
>
> -- 
> Sent from a mobile device. Excuse my thumbs.

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Rohit Agarwal <mi...@gmail.com>.

Hey Pranav,

Did you make any progress on this?

--
Rohit

On Sunday, August 16, 2015, moon soo Lee <mo...@apache.org> wrote:

> Pranav, proposal looks awesome!
>
> I have a question and feedback,
>
> You said you tested 1,2 and 3. To create SparkIMain per notebook, you need
> information of notebook id. Did you get it from InterpreterContext?
> Then how did you handle destroying of SparkIMain (when notebook is
> deleting)?
> As far as i know, interpreter not able to get information of notebook
> deletion.
>
> >> 4. Build a queue inside interpreter to allow only one paragraph
> execution
> >> at a time per notebook.
>
> One downside of this approach is, GUI will display RUNNING instead of
> PENDING for jobs inside of queue in interpreter.
>
> Best,
> moon
>
> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <goi.cto@gmail.com
> <javascript:_e(%7B%7D,'cvml','goi.cto@gmail.com');>> wrote:
>
>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>> multi-tenancy easily"
>>
>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <doanduyhai@gmail.com
>> <javascript:_e(%7B%7D,'cvml','doanduyhai@gmail.com');>> wrote:
>>
>>> Agree with Joel, we may think to re-factor the Zeppelin architecture so
>>> that it can handle multi-tenancy easily. The technical solution proposed by Pranav
>>> is great but it only applies to Spark. Right now, each interpreter has to
>>> manage multi-tenancy its own way. Ultimately Zeppelin can propose a
>>> multi-tenancy contract/info (like UserContext, similar to
>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>
>>>
>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <djoelz@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','djoelz@gmail.com');>> wrote:
>>>
>>>> I think while the idea of running multiple notes simultaneously is
>>>> great. It is really dancing around the lack of true multi user support in
>>>> Zeppelin. While the proposed solution would work if the applications
>>>> resources are those of the whole cluster, if the app is limited (say they
>>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>>> note can hog all the resources and the scheduler will have to throttle all
>>>> other executions leaving you exactly where you are now.
>>>> While I think the solution is a good one, maybe this question makes us
>>>> think in adding true multiuser support.
>>>> Where we isolate resources (cluster and the notebooks themselves), have
>>>> separate login/identity and (I don't know if it's possible) share the same
>>>> context.
>>>>
>>>> Thanks,
>>>> Joel
>>>>
>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mindprince@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','mindprince@gmail.com');>> wrote:
>>>> >
>>>> > If the problem is that multiple users have to wait for each other
>>>> while
>>>> > using Zeppelin, the solution already exists: they can create a new
>>>> > interpreter by going to the interpreter page and attach it to their
>>>> > notebook - then they don't have to wait for others to submit their
>>>> job.
>>>> >
>>>> > But I agree, having paragraphs from one note wait for paragraphs from
>>>> other
>>>> > notes is a confusing default. We can get around that in two ways:
>>>> >
>>>> >   1. Create a new interpreter for each note and attach that
>>>> interpreter to
>>>> >   that note. This approach would require the least amount of code
>>>> changes but
>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>> different
>>>> >   notes.
>>>> >   2. If we want to share the Spark Context between different notes,
>>>> we can
>>>> >   submit jobs from different notes into different fairscheduler pools
>>>> (
>>>> >
>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>> ).
>>>> >   This can be done by submitting jobs from different notes in
>>>> different
>>>> >   threads. This will make sure that jobs from one note are run
>>>> sequentially
>>>> >   but jobs from different notes will be able to run in parallel.
>>>> >
>>>> > Neither of these options require any change in the Spark code.
>>>> >
>>>> > --
>>>> > Thanks & Regards
>>>> > Rohit Agarwal
>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>> >
>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>> praagarw@gmail.com <javascript:_e(%7B%7D,'cvml','praagarw@gmail.com');>
>>>> >
>>>> > wrote:
>>>> >
>>>> >> If someone can share about the idea of sharing single SparkContext
>>>> through
>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>> >> Here is a proposal:
>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>> virtual
>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>> from
>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>> the same
>>>> >> virtual directory.
>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>> server in
>>>> >> Spark Context using classserverUri method
>>>> >> 3. Scala generated code has a notion of packages. The default
>>>> package name
>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>> >> Property scala.repl.name.line. Setting this property to "notebook id"
>>>> >> ensures that code generated by individual instances of SparkIMain is
>>>> >> isolated from other instances of SparkIMain
>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>> execution
>>>> >> at a time per notebook.
>>>> >>
>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>> there any
>>>> >> Jira already for the same that I can uptake? Also I need to
>>>> understand:
>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>> >>
>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>> >>
>>>> >> Regards,
>>>> >> -Pranav.
>>>> >>
>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>> >>>
>>>> >>> Hi piyush,
>>>> >>>
>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>> >>> sharing the SparkContext sounds great.
>>>> >>>
>>>> >>> Actually, i tried to do it, found problem that multiple SparkILoop
>>>> could
>>>> >>> generates the same class name, and spark executor confuses
>>>> classname since
>>>> >>> they're reading classes from single SparkContext.
>>>> >>>
>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>> >>>
>>>> >>> Thanks,
>>>> >>> moon
>>>> >>>
>>>> >>>
>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>> >>> piyush.mukati@flipkart.com
>>>> <javascript:_e(%7B%7D,'cvml','piyush.mukati@flipkart.com');> <mailto:
>>>> piyush.mukati@flipkart.com
>>>> <javascript:_e(%7B%7D,'cvml','piyush.mukati@flipkart.com');>>> wrote:
>>>> >>>
>>>> >>>    Hi Moon,
>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>> working
>>>> >>> with spark.
>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>> parallel
>>>> >>> scheduler ?
>>>> >>>    thanks
>>>> >>>
>>>> >>>    -piyush
>>>> >>>
>>>> >>>    Hi Moon,
>>>> >>>
>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>> Spark's
>>>> >>>    remote interpreter - this will allow multiple users to run their
>>>> spark
>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>> paragraph is
>>>> >>>    executed at a time.
>>>> >>>
>>>> >>>    Regards,
>>>> >>>    -Pranav.
>>>> >>>
>>>> >>>
>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>> >>>> Hi,
>>>> >>>>
>>>> >>>> Thanks for asking question.
>>>> >>>>
>>>> >>>> The reason is simply because of it is running code statements. The
>>>> >>>> statements can have order and dependency. Imagine i have two
>>>> >>> paragraphs
>>>> >>>>
>>>> >>>> %spark
>>>> >>>> val a = 1
>>>> >>>>
>>>> >>>> %spark
>>>> >>>> print(a)
>>>> >>>>
>>>> >>>> If they're not running one by one, that means they possibly runs in
>>>> >>>> random order and the output will be always different. Either '1' or
>>>> >>>> 'val a can not found'.
>>>> >>>>
>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>> >>>>
>>>> >>>> Thanks,
>>>> >>>> moon
>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>> >>>> <linxizeng0615@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>  <mailto:
>>>> linxizeng0615@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>>
>>>> >>> <mailto:linxizeng0615@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>  <mailto:
>>>> linxizeng0615@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>>>>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>>    any one who have the same question with me? or this is not a
>>>> >>> question?
>>>> >>>>
>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>
>>>> >>> <mailto:linxizeng0615@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>>
>>>> >>>>    <mailto:linxizeng0615@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>  <mailto:
>>>> >>> linxizeng0615@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>>>>:
>>>> >>>>
>>>> >>>>        hi, Moon:
>>>> >>>>           I notice that the getScheduler function in the
>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes the
>>>> >>>>        spark interpreter run spark job one by one. It's not a good
>>>> >>>>        experience when couple of users do some work on zeppelin at
>>>> >>>>        the same time, because they have to wait for each other.
>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>> >>>>        My question is, what kind of consideration do you based on
>>>> >>> to
>>>> >>>>        make such a decision?
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>> >>>
>>>> >>>    This email and any files transmitted with it are confidential and
>>>> >>>    intended solely for the use of the individual or entity to whom
>>>> >>>    they are addressed. If you have received this email in error
>>>> >>>    please notify the system manager. This message contains
>>>> >>>    confidential information and is intended only for the individual
>>>> >>>    named. If you are not the named addressee you should not
>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>> >>>    sender immediately by e-mail if you have received this e-mail by
>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>> >>>    the intended recipient you are notified that disclosing, copying,
>>>> >>>    distributing or taking any action in reliance on the contents of
>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>> >>>    taken reasonable precautions to ensure no viruses are present in
>>>> >>>    this email, the company cannot accept responsibility for any loss
>>>> >>>    or damage arising from the use of this email or attachments
>>>> >>
>>>>
>>>
>>>

-- 
Sent from a mobile device. Excuse my thumbs.

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Rohit Agarwal <mi...@gmail.com>.

Hey Pranav,

Did you make any progress on this?

--
Rohit

On Sunday, August 16, 2015, moon soo Lee <mo...@apache.org> wrote:

> Pranav, proposal looks awesome!
>
> I have a question and feedback,
>
> You said you tested 1,2 and 3. To create SparkIMain per notebook, you need
> information of notebook id. Did you get it from InterpreterContext?
> Then how did you handle destroying of SparkIMain (when notebook is
> deleting)?
> As far as i know, interpreter not able to get information of notebook
> deletion.
>
> >> 4. Build a queue inside interpreter to allow only one paragraph
> execution
> >> at a time per notebook.
>
> One downside of this approach is, GUI will display RUNNING instead of
> PENDING for jobs inside of queue in interpreter.
>
> Best,
> moon
>
> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <goi.cto@gmail.com
> <javascript:_e(%7B%7D,'cvml','goi.cto@gmail.com');>> wrote:
>
>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>> multi-tenancy easily"
>>
>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <doanduyhai@gmail.com
>> <javascript:_e(%7B%7D,'cvml','doanduyhai@gmail.com');>> wrote:
>>
>>> Agree with Joel, we may think to re-factor the Zeppelin architecture so
>>> that it can handle multi-tenancy easily. The technical solution proposed by Pranav
>>> is great but it only applies to Spark. Right now, each interpreter has to
>>> manage multi-tenancy its own way. Ultimately Zeppelin can propose a
>>> multi-tenancy contract/info (like UserContext, similar to
>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>
>>>
>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <djoelz@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','djoelz@gmail.com');>> wrote:
>>>
>>>> I think while the idea of running multiple notes simultaneously is
>>>> great. It is really dancing around the lack of true multi user support in
>>>> Zeppelin. While the proposed solution would work if the applications
>>>> resources are those of the whole cluster, if the app is limited (say they
>>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>>> note can hog all the resources and the scheduler will have to throttle all
>>>> other executions leaving you exactly where you are now.
>>>> While I think the solution is a good one, maybe this question makes us
>>>> think in adding true multiuser support.
>>>> Where we isolate resources (cluster and the notebooks themselves), have
>>>> separate login/identity and (I don't know if it's possible) share the same
>>>> context.
>>>>
>>>> Thanks,
>>>> Joel
>>>>
>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mindprince@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','mindprince@gmail.com');>> wrote:
>>>> >
>>>> > If the problem is that multiple users have to wait for each other
>>>> while
>>>> > using Zeppelin, the solution already exists: they can create a new
>>>> > interpreter by going to the interpreter page and attach it to their
>>>> > notebook - then they don't have to wait for others to submit their
>>>> job.
>>>> >
>>>> > But I agree, having paragraphs from one note wait for paragraphs from
>>>> other
>>>> > notes is a confusing default. We can get around that in two ways:
>>>> >
>>>> >   1. Create a new interpreter for each note and attach that
>>>> interpreter to
>>>> >   that note. This approach would require the least amount of code
>>>> changes but
>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>> different
>>>> >   notes.
>>>> >   2. If we want to share the Spark Context between different notes,
>>>> we can
>>>> >   submit jobs from different notes into different fairscheduler pools
>>>> (
>>>> >
>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>> ).
>>>> >   This can be done by submitting jobs from different notes in
>>>> different
>>>> >   threads. This will make sure that jobs from one note are run
>>>> sequentially
>>>> >   but jobs from different notes will be able to run in parallel.
>>>> >
>>>> > Neither of these options require any change in the Spark code.
>>>> >
>>>> > --
>>>> > Thanks & Regards
>>>> > Rohit Agarwal
>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>> >
>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>> praagarw@gmail.com <javascript:_e(%7B%7D,'cvml','praagarw@gmail.com');>
>>>> >
>>>> > wrote:
>>>> >
>>>> >> If someone can share about the idea of sharing single SparkContext
>>>> through
>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>> >> Here is a proposal:
>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>> virtual
>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>> from
>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>> the same
>>>> >> virtual directory.
>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>> server in
>>>> >> Spark Context using classserverUri method
>>>> >> 3. Scala generated code has a notion of packages. The default
>>>> package name
>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>> >> Property scala.repl.name.line. Setting this property to "notebook id"
>>>> >> ensures that code generated by individual instances of SparkIMain is
>>>> >> isolated from other instances of SparkIMain
>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>> execution
>>>> >> at a time per notebook.
>>>> >>
>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>> there any
>>>> >> Jira already for the same that I can uptake? Also I need to
>>>> understand:
>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>> >>
>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>> >>
>>>> >> Regards,
>>>> >> -Pranav.
>>>> >>
>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>> >>>
>>>> >>> Hi piyush,
>>>> >>>
>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>> >>> sharing the SparkContext sounds great.
>>>> >>>
>>>> >>> Actually, i tried to do it, found problem that multiple SparkILoop
>>>> could
>>>> >>> generates the same class name, and spark executor confuses
>>>> classname since
>>>> >>> they're reading classes from single SparkContext.
>>>> >>>
>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>> >>>
>>>> >>> Thanks,
>>>> >>> moon
>>>> >>>
>>>> >>>
>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>> >>> piyush.mukati@flipkart.com
>>>> <javascript:_e(%7B%7D,'cvml','piyush.mukati@flipkart.com');> <mailto:
>>>> piyush.mukati@flipkart.com
>>>> <javascript:_e(%7B%7D,'cvml','piyush.mukati@flipkart.com');>>> wrote:
>>>> >>>
>>>> >>>    Hi Moon,
>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>> working
>>>> >>> with spark.
>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>> parallel
>>>> >>> scheduler ?
>>>> >>>    thanks
>>>> >>>
>>>> >>>    -piyush
>>>> >>>
>>>> >>>    Hi Moon,
>>>> >>>
>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>> Spark's
>>>> >>>    remote interpreter - this will allow multiple users to run their
>>>> spark
>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>> paragraph is
>>>> >>>    executed at a time.
>>>> >>>
>>>> >>>    Regards,
>>>> >>>    -Pranav.
>>>> >>>
>>>> >>>
>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>> >>>> Hi,
>>>> >>>>
>>>> >>>> Thanks for asking question.
>>>> >>>>
>>>> >>>> The reason is simply because of it is running code statements. The
>>>> >>>> statements can have order and dependency. Imagine i have two
>>>> >>> paragraphs
>>>> >>>>
>>>> >>>> %spark
>>>> >>>> val a = 1
>>>> >>>>
>>>> >>>> %spark
>>>> >>>> print(a)
>>>> >>>>
>>>> >>>> If they're not running one by one, that means they possibly runs in
>>>> >>>> random order and the output will be always different. Either '1' or
>>>> >>>> 'val a can not found'.
>>>> >>>>
>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>> >>>>
>>>> >>>> Thanks,
>>>> >>>> moon
>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>> >>>> <linxizeng0615@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>  <mailto:
>>>> linxizeng0615@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>>
>>>> >>> <mailto:linxizeng0615@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>  <mailto:
>>>> linxizeng0615@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>>>>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>>    any one who have the same question with me? or this is not a
>>>> >>> question?
>>>> >>>>
>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>
>>>> >>> <mailto:linxizeng0615@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>>
>>>> >>>>    <mailto:linxizeng0615@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>  <mailto:
>>>> >>> linxizeng0615@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','linxizeng0615@gmail.com');>>>>:
>>>> >>>>
>>>> >>>>        hi, Moon:
>>>> >>>>           I notice that the getScheduler function in the
>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes the
>>>> >>>>        spark interpreter run spark job one by one. It's not a good
>>>> >>>>        experience when couple of users do some work on zeppelin at
>>>> >>>>        the same time, because they have to wait for each other.
>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>> >>>>        My question is, what kind of consideration do you based on
>>>> >>> to
>>>> >>>>        make such a decision?
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>> >>>
>>>> >>>    This email and any files transmitted with it are confidential and
>>>> >>>    intended solely for the use of the individual or entity to whom
>>>> >>>    they are addressed. If you have received this email in error
>>>> >>>    please notify the system manager. This message contains
>>>> >>>    confidential information and is intended only for the individual
>>>> >>>    named. If you are not the named addressee you should not
>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>> >>>    sender immediately by e-mail if you have received this e-mail by
>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>> >>>    the intended recipient you are notified that disclosing, copying,
>>>> >>>    distributing or taking any action in reliance on the contents of
>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>> >>>    taken reasonable precautions to ensure no viruses are present in
>>>> >>>    this email, the company cannot accept responsibility for any loss
>>>> >>>    or damage arising from the use of this email or attachments
>>>> >>
>>>>
>>>
>>>

-- 
Sent from a mobile device. Excuse my thumbs.

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Pranav Kumar Agarwal <pr...@gmail.com>.

Hi Moon,

Yes, the notebookid comes from InterpreterContext. At the moment 
destroying SparkIMain on deletion of notebook is not handled. I think 
SparkIMain is a lightweight object, do you see a concern having these 
objects in a map? One possible option could be to destroy notebook 
related objects when the inactivity on a notebook is greater than say 8 
hours.

> >> 4. Build a queue inside interpreter to allow only one paragraph 
> execution
> >> at a time per notebook.
>
> One downside of this approach is, GUI will display RUNNING instead of 
> PENDING for jobs inside of queue in interpreter.
Yes that's an good point. Having a scheduler at Zeppelin server to build 
a scheduler that is parallel across notebook's and FIFO across 
paragraph's will be nice. Is there any plan for having such a scheduler?

Regards,
-Pranav.

On 17/08/15 5:38 am, moon soo Lee wrote:
> Pranav, proposal looks awesome!
>
> I have a question and feedback,
>
> You said you tested 1,2 and 3. To create SparkIMain per notebook, you 
> need information of notebook id. Did you get it from InterpreterContext?
> Then how did you handle destroying of SparkIMain (when notebook is 
> deleting)?
> As far as i know, interpreter not able to get information of notebook 
> deletion.
>
> >> 4. Build a queue inside interpreter to allow only one paragraph 
> execution
> >> at a time per notebook.
>
> One downside of this approach is, GUI will display RUNNING instead of 
> PENDING for jobs inside of queue in interpreter.
>
> Best,
> moon
>
> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <goi.cto@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     +1 for "to re-factor the Zeppelin architecture so that it can
>     handle multi-tenancy easily"
>
>     On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <doanduyhai@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         Agree with Joel, we may think to re-factor the Zeppelin
>         architecture so that it can handle multi-tenancy easily. The
>         technical solution proposed by Pranav is great but it only
>         applies to Spark. Right now, each interpreter has to manage
>         multi-tenancy its own way. Ultimately Zeppelin can propose a
>         multi-tenancy contract/info (like UserContext, similar to
>         InterpreterContext) so that each interpreter can choose to use
>         or not.
>
>
>         On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano
>         <djoelz@gmail.com <ma...@gmail.com>> wrote:
>
>             I think while the idea of running multiple notes
>             simultaneously is great. It is really dancing around the
>             lack of true multi user support in Zeppelin. While the
>             proposed solution would work if the applications resources
>             are those of the whole cluster, if the app is limited (say
>             they are 8 cores of 16, with some distribution in memory)
>             then potentially your note can hog all the resources and
>             the scheduler will have to throttle all other executions
>             leaving you exactly where you are now.
>             While I think the solution is a good one, maybe this
>             question makes us think in adding true multiuser support.
>             Where we isolate resources (cluster and the notebooks
>             themselves), have separate login/identity and (I don't
>             know if it's possible) share the same context.
>
>             Thanks,
>             Joel
>
>             > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal
>             <mindprince@gmail.com <ma...@gmail.com>> wrote:
>             >
>             > If the problem is that multiple users have to wait for
>             each other while
>             > using Zeppelin, the solution already exists: they can
>             create a new
>             > interpreter by going to the interpreter page and attach
>             it to their
>             > notebook - then they don't have to wait for others to
>             submit their job.
>             >
>             > But I agree, having paragraphs from one note wait for
>             paragraphs from other
>             > notes is a confusing default. We can get around that in
>             two ways:
>             >
>             >   1. Create a new interpreter for each note and attach
>             that interpreter to
>             >   that note. This approach would require the least amount of code changes but
>             >   is resource heavy and doesn't let you share Spark
>             Context between different
>             >   notes.
>             >   2. If we want to share the Spark Context between
>             different notes, we can
>             >   submit jobs from different notes into different
>             fairscheduler pools (
>             >
>             https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application).
>             >   This can be done by submitting jobs from different
>             notes in different
>             >   threads. This will make sure that jobs from one note
>             are run sequentially
>             >   but jobs from different notes will be able to run in
>             parallel.
>             >
>             > Neither of these options require any change in the Spark
>             code.
>             >
>             > --
>             > Thanks & Regards
>             > Rohit Agarwal
>             > https://www.linkedin.com/in/rohitagarwal003
>             >
>             > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal
>             <praagarw@gmail.com <ma...@gmail.com>>
>             > wrote:
>             >
>             >> If someone can share about the idea of sharing single
>             SparkContext through
>             >>> multiple SparkILoop safely, it'll be really helpful.
>             >> Here is a proposal:
>             >> 1. In Spark code, change SparkIMain.scala to allow
>             setting the virtual
>             >> directory. While creating new instances of SparkIMain
>             per notebook from
>             >> zeppelin spark interpreter set all the instances of
>             SparkIMain to the same
>             >> virtual directory.
>             >> 2. Start HTTP server on that virtual directory and set
>             this HTTP server in
>             >> Spark Context using classserverUri method
>             >> 3. Scala generated code has a notion of packages. The
>             default package name
>             >> is "line$<linenumber>". Package name can be controlled
>             using System
>             >> Property scala.repl.name.line. Setting this property to
>             "notebook id"
>             >> ensures that code generated by individual instances of
>             SparkIMain is
>             >> isolated from other instances of SparkIMain
>             >> 4. Build a queue inside interpreter to allow only one
>             paragraph execution
>             >> at a time per notebook.
>             >>
>             >> I have tested 1, 2, and 3 and it seems to provide
>             isolation across
>             >> classnames. I'll work towards submitting a formal patch
>             soon - Is there any
>             >> Jira already for the same that I can uptake? Also I
>             need to understand:
>             >> 1. How does Zeppelin uptake Spark fixes? OR do I need
>             to first work
>             >> towards getting Spark changes merged in Apache Spark
>             github?
>             >>
>             >> Any suggestions on comments on the proposal are highly
>             welcome.
>             >>
>             >> Regards,
>             >> -Pranav.
>             >>
>             >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>             >>>
>             >>> Hi piyush,
>             >>>
>             >>> Separate instance of SparkILoop SparkIMain for each
>             notebook while
>             >>> sharing the SparkContext sounds great.
>             >>>
>             >>> Actually, i tried to do it, found problem that
>             multiple SparkILoop could
>             >>> generates the same class name, and spark executor
>             confuses classname since
>             >>> they're reading classes from single SparkContext.
>             >>>
>             >>> If someone can share about the idea of sharing single
>             SparkContext
>             >>> through multiple SparkILoop safely, it'll be really
>             helpful.
>             >>>
>             >>> Thanks,
>             >>> moon
>             >>>
>             >>>
>             >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data
>             Platform) <
>             >>> piyush.mukati@flipkart.com
>             <ma...@flipkart.com>
>             <mailto:piyush.mukati@flipkart.com
>             <ma...@flipkart.com>>> wrote:
>             >>>
>             >>>    Hi Moon,
>             >>>    Any suggestion on it, have to wait lot when
>             multiple people  working
>             >>> with spark.
>             >>>    Can we create separate instance of   SparkILoop 
>             SparkIMain and
>             >>> printstrems  for each notebook while sharing
>             theSparkContext
>             >>> ZeppelinContext   SQLContext and DependencyResolver
>             and then use parallel
>             >>> scheduler ?
>             >>>    thanks
>             >>>
>             >>>    -piyush
>             >>>
>             >>>    Hi Moon,
>             >>>
>             >>>    How about tracking dedicated SparkContext for a
>             notebook in Spark's
>             >>>    remote interpreter - this will allow multiple users
>             to run their spark
>             >>>    paragraphs in parallel. Also, within a notebook
>             only one paragraph is
>             >>>    executed at a time.
>             >>>
>             >>>    Regards,
>             >>>    -Pranav.
>             >>>
>             >>>
>             >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>             >>>> Hi,
>             >>>>
>             >>>> Thanks for asking question.
>             >>>>
>             >>>> The reason is simply because of it is running code
>             statements. The
>             >>>> statements can have order and dependency. Imagine i
>             have two
>             >>> paragraphs
>             >>>>
>             >>>> %spark
>             >>>> val a = 1
>             >>>>
>             >>>> %spark
>             >>>> print(a)
>             >>>>
>             >>>> If they're not running one by one, that means they
>             possibly runs in
>             >>>> random order and the output will be always different.
>             Either '1' or
>             >>>> 'val a can not found'.
>             >>>>
>             >>>> This is the reason why. But if there are nice idea to
>             handle this
>             >>>> problem i agree using parallel scheduler would help a
>             lot.
>             >>>>
>             >>>> Thanks,
>             >>>> moon
>             >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>             >>>> <linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>
>             >>> <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>
>             <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>>>
>             >>> wrote:
>             >>>>
>             >>>>    any one who have the same question with me? or
>             this is not a
>             >>> question?
>             >>>>
>             >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng
>             <linxizeng0615@gmail.com <ma...@gmail.com>
>             >>> <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com>>
>             >>>>    <mailto:linxizeng0615@gmail.com
>             <ma...@gmail.com> <mailto:
>             >>> linxizeng0615@gmail.com
>             <ma...@gmail.com>>>>:
>             >>>>
>             >>>>        hi, Moon:
>             >>>>           I notice that the getScheduler function in the
>             >>>> SparkInterpreter.java return a FIFOScheduler which
>             makes the
>             >>>>        spark interpreter run spark job one by one.
>             It's not a good
>             >>>>        experience when couple of users do some work
>             on zeppelin at
>             >>>>        the same time, because they have to wait for
>             each other.
>             >>>>        And at the same time, SparkSqlInterpreter can
>             chose what
>             >>>>        scheduler to use by
>             "zeppelin.spark.concurrentSQL".
>             >>>>        My question is, what kind of consideration do
>             you based on
>             >>> to
>             >>>>        make such a decision?
>             >>>
>             >>>
>             >>>
>             >>>
>             >>>
>             ------------------------------------------------------------------------------------------------------------------------------------------
>             >>>
>             >>>    This email and any files transmitted with it are
>             confidential and
>             >>>    intended solely for the use of the individual or
>             entity to whom
>             >>>    they are addressed. If you have received this email
>             in error
>             >>>    please notify the system manager. This message contains
>             >>>    confidential information and is intended only for
>             the individual
>             >>>    named. If you are not the named addressee you
>             should not
>             >>>    disseminate, distribute or copy this e-mail. Please
>             notify the
>             >>>    sender immediately by e-mail if you have received
>             this e-mail by
>             >>>    mistake and delete this e-mail from your system. If
>             you are not
>             >>>    the intended recipient you are notified that
>             disclosing, copying,
>             >>>    distributing or taking any action in reliance on
>             the contents of
>             >>>    this information is strictly prohibited. Although
>             Flipkart has
>             >>>    taken reasonable precautions to ensure no viruses
>             are present in
>             >>>    this email, the company cannot accept
>             responsibility for any loss
>             >>>    or damage arising from the use of this email or
>             attachments
>             >>
>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by moon soo Lee <mo...@apache.org>.

Pranav, proposal looks awesome!

I have a question and feedback,

You said you tested 1,2 and 3. To create SparkIMain per notebook, you need
information of notebook id. Did you get it from InterpreterContext?
Then how did you handle destroying of SparkIMain (when notebook is
deleting)?
As far as i know, interpreter not able to get information of notebook
deletion.

>> 4. Build a queue inside interpreter to allow only one paragraph execution
>> at a time per notebook.

One downside of this approach is, GUI will display RUNNING instead of
PENDING for jobs inside of queue in interpreter.

Best,
moon

On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:

> +1 for "to re-factor the Zeppelin architecture so that it can handle
> multi-tenancy easily"
>
> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com> wrote:
>
>> Agree with Joel, we may think to re-factor the Zeppelin architecture so
>> that it can handle multi-tenancy easily. The technical solution proposed by Pranav
>> is great but it only applies to Spark. Right now, each interpreter has to
>> manage multi-tenancy its own way. Ultimately Zeppelin can propose a
>> multi-tenancy contract/info (like UserContext, similar to
>> InterpreterContext) so that each interpreter can choose to use or not.
>>
>>
>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com> wrote:
>>
>>> I think while the idea of running multiple notes simultaneously is
>>> great. It is really dancing around the lack of true multi user support in
>>> Zeppelin. While the proposed solution would work if the applications
>>> resources are those of the whole cluster, if the app is limited (say they
>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>> note can hog all the resources and the scheduler will have to throttle all
>>> other executions leaving you exactly where you are now.
>>> While I think the solution is a good one, maybe this question makes us
>>> think in adding true multiuser support.
>>> Where we isolate resources (cluster and the notebooks themselves), have
>>> separate login/identity and (I don't know if it's possible) share the same
>>> context.
>>>
>>> Thanks,
>>> Joel
>>>
>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
>>> wrote:
>>> >
>>> > If the problem is that multiple users have to wait for each other while
>>> > using Zeppelin, the solution already exists: they can create a new
>>> > interpreter by going to the interpreter page and attach it to their
>>> > notebook - then they don't have to wait for others to submit their job.
>>> >
>>> > But I agree, having paragraphs from one note wait for paragraphs from
>>> other
>>> > notes is a confusing default. We can get around that in two ways:
>>> >
>>> >   1. Create a new interpreter for each note and attach that
>>> interpreter to
>>> >   that note. This approach would require the least amount of code
>>> changes but
>>> >   is resource heavy and doesn't let you share Spark Context between
>>> different
>>> >   notes.
>>> >   2. If we want to share the Spark Context between different notes, we
>>> can
>>> >   submit jobs from different notes into different fairscheduler pools (
>>> >
>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>> ).
>>> >   This can be done by submitting jobs from different notes in different
>>> >   threads. This will make sure that jobs from one note are run
>>> sequentially
>>> >   but jobs from different notes will be able to run in parallel.
>>> >
>>> > Neither of these options require any change in the Spark code.
>>> >
>>> > --
>>> > Thanks & Regards
>>> > Rohit Agarwal
>>> > https://www.linkedin.com/in/rohitagarwal003
>>> >
>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>> praagarw@gmail.com>
>>> > wrote:
>>> >
>>> >> If someone can share about the idea of sharing single SparkContext
>>> through
>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>> >> Here is a proposal:
>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the virtual
>>> >> directory. While creating new instances of SparkIMain per notebook
>>> from
>>> >> zeppelin spark interpreter set all the instances of SparkIMain to the
>>> same
>>> >> virtual directory.
>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>> server in
>>> >> Spark Context using classserverUri method
>>> >> 3. Scala generated code has a notion of packages. The default package
>>> name
>>> >> is "line$<linenumber>". Package name can be controlled using System
>>> >> Property scala.repl.name.line. Setting this property to "notebook id"
>>> >> ensures that code generated by individual instances of SparkIMain is
>>> >> isolated from other instances of SparkIMain
>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>> execution
>>> >> at a time per notebook.
>>> >>
>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>> there any
>>> >> Jira already for the same that I can uptake? Also I need to
>>> understand:
>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>> >> towards getting Spark changes merged in Apache Spark github?
>>> >>
>>> >> Any suggestions on comments on the proposal are highly welcome.
>>> >>
>>> >> Regards,
>>> >> -Pranav.
>>> >>
>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>> >>>
>>> >>> Hi piyush,
>>> >>>
>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>> >>> sharing the SparkContext sounds great.
>>> >>>
>>> >>> Actually, i tried to do it, found problem that multiple SparkILoop
>>> could
>>> >>> generates the same class name, and spark executor confuses classname
>>> since
>>> >>> they're reading classes from single SparkContext.
>>> >>>
>>> >>> If someone can share about the idea of sharing single SparkContext
>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>> >>>
>>> >>> Thanks,
>>> >>> moon
>>> >>>
>>> >>>
>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
>>> wrote:
>>> >>>
>>> >>>    Hi Moon,
>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>> working
>>> >>> with spark.
>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>> >>> printstrems  for each notebook while sharing theSparkContext
>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>> parallel
>>> >>> scheduler ?
>>> >>>    thanks
>>> >>>
>>> >>>    -piyush
>>> >>>
>>> >>>    Hi Moon,
>>> >>>
>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>> Spark's
>>> >>>    remote interpreter - this will allow multiple users to run their
>>> spark
>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>> paragraph is
>>> >>>    executed at a time.
>>> >>>
>>> >>>    Regards,
>>> >>>    -Pranav.
>>> >>>
>>> >>>
>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>> >>>> Hi,
>>> >>>>
>>> >>>> Thanks for asking question.
>>> >>>>
>>> >>>> The reason is simply because of it is running code statements. The
>>> >>>> statements can have order and dependency. Imagine i have two
>>> >>> paragraphs
>>> >>>>
>>> >>>> %spark
>>> >>>> val a = 1
>>> >>>>
>>> >>>> %spark
>>> >>>> print(a)
>>> >>>>
>>> >>>> If they're not running one by one, that means they possibly runs in
>>> >>>> random order and the output will be always different. Either '1' or
>>> >>>> 'val a can not found'.
>>> >>>>
>>> >>>> This is the reason why. But if there are nice idea to handle this
>>> >>>> problem i agree using parallel scheduler would help a lot.
>>> >>>>
>>> >>>> Thanks,
>>> >>>> moon
>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
>>> >>> <mailto:linxizeng0615@gmail.com  <ma...@gmail.com>>>
>>> >>> wrote:
>>> >>>>
>>> >>>>    any one who have the same question with me? or this is not a
>>> >>> question?
>>> >>>>
>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
>>> >>> <ma...@gmail.com>
>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
>>> >>> linxizeng0615@gmail.com>>>:
>>> >>>>
>>> >>>>        hi, Moon:
>>> >>>>           I notice that the getScheduler function in the
>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes the
>>> >>>>        spark interpreter run spark job one by one. It's not a good
>>> >>>>        experience when couple of users do some work on zeppelin at
>>> >>>>        the same time, because they have to wait for each other.
>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>> >>>>        My question is, what kind of consideration do you based on
>>> >>> to
>>> >>>>        make such a decision?
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>> >>>
>>> >>>    This email and any files transmitted with it are confidential and
>>> >>>    intended solely for the use of the individual or entity to whom
>>> >>>    they are addressed. If you have received this email in error
>>> >>>    please notify the system manager. This message contains
>>> >>>    confidential information and is intended only for the individual
>>> >>>    named. If you are not the named addressee you should not
>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>> >>>    sender immediately by e-mail if you have received this e-mail by
>>> >>>    mistake and delete this e-mail from your system. If you are not
>>> >>>    the intended recipient you are notified that disclosing, copying,
>>> >>>    distributing or taking any action in reliance on the contents of
>>> >>>    this information is strictly prohibited. Although Flipkart has
>>> >>>    taken reasonable precautions to ensure no viruses are present in
>>> >>>    this email, the company cannot accept responsibility for any loss
>>> >>>    or damage arising from the use of this email or attachments
>>> >>
>>>
>>
>>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by moon soo Lee <mo...@apache.org>.

Pranav, proposal looks awesome!

I have a question and feedback,

You said you tested 1,2 and 3. To create SparkIMain per notebook, you need
information of notebook id. Did you get it from InterpreterContext?
Then how did you handle destroying of SparkIMain (when notebook is
deleting)?
As far as i know, interpreter not able to get information of notebook
deletion.

>> 4. Build a queue inside interpreter to allow only one paragraph execution
>> at a time per notebook.

One downside of this approach is, GUI will display RUNNING instead of
PENDING for jobs inside of queue in interpreter.

Best,
moon

On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:

> +1 for "to re-factor the Zeppelin architecture so that it can handle
> multi-tenancy easily"
>
> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com> wrote:
>
>> Agree with Joel, we may think to re-factor the Zeppelin architecture so
>> that it can handle multi-tenancy easily. The technical solution proposed by Pranav
>> is great but it only applies to Spark. Right now, each interpreter has to
>> manage multi-tenancy its own way. Ultimately Zeppelin can propose a
>> multi-tenancy contract/info (like UserContext, similar to
>> InterpreterContext) so that each interpreter can choose to use or not.
>>
>>
>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com> wrote:
>>
>>> I think while the idea of running multiple notes simultaneously is
>>> great. It is really dancing around the lack of true multi user support in
>>> Zeppelin. While the proposed solution would work if the applications
>>> resources are those of the whole cluster, if the app is limited (say they
>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>> note can hog all the resources and the scheduler will have to throttle all
>>> other executions leaving you exactly where you are now.
>>> While I think the solution is a good one, maybe this question makes us
>>> think in adding true multiuser support.
>>> Where we isolate resources (cluster and the notebooks themselves), have
>>> separate login/identity and (I don't know if it's possible) share the same
>>> context.
>>>
>>> Thanks,
>>> Joel
>>>
>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
>>> wrote:
>>> >
>>> > If the problem is that multiple users have to wait for each other while
>>> > using Zeppelin, the solution already exists: they can create a new
>>> > interpreter by going to the interpreter page and attach it to their
>>> > notebook - then they don't have to wait for others to submit their job.
>>> >
>>> > But I agree, having paragraphs from one note wait for paragraphs from
>>> other
>>> > notes is a confusing default. We can get around that in two ways:
>>> >
>>> >   1. Create a new interpreter for each note and attach that
>>> interpreter to
>>> >   that note. This approach would require the least amount of code
>>> changes but
>>> >   is resource heavy and doesn't let you share Spark Context between
>>> different
>>> >   notes.
>>> >   2. If we want to share the Spark Context between different notes, we
>>> can
>>> >   submit jobs from different notes into different fairscheduler pools (
>>> >
>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>> ).
>>> >   This can be done by submitting jobs from different notes in different
>>> >   threads. This will make sure that jobs from one note are run
>>> sequentially
>>> >   but jobs from different notes will be able to run in parallel.
>>> >
>>> > Neither of these options require any change in the Spark code.
>>> >
>>> > --
>>> > Thanks & Regards
>>> > Rohit Agarwal
>>> > https://www.linkedin.com/in/rohitagarwal003
>>> >
>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>> praagarw@gmail.com>
>>> > wrote:
>>> >
>>> >> If someone can share about the idea of sharing single SparkContext
>>> through
>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>> >> Here is a proposal:
>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the virtual
>>> >> directory. While creating new instances of SparkIMain per notebook
>>> from
>>> >> zeppelin spark interpreter set all the instances of SparkIMain to the
>>> same
>>> >> virtual directory.
>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>> server in
>>> >> Spark Context using classserverUri method
>>> >> 3. Scala generated code has a notion of packages. The default package
>>> name
>>> >> is "line$<linenumber>". Package name can be controlled using System
>>> >> Property scala.repl.name.line. Setting this property to "notebook id"
>>> >> ensures that code generated by individual instances of SparkIMain is
>>> >> isolated from other instances of SparkIMain
>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>> execution
>>> >> at a time per notebook.
>>> >>
>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>> there any
>>> >> Jira already for the same that I can uptake? Also I need to
>>> understand:
>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>> >> towards getting Spark changes merged in Apache Spark github?
>>> >>
>>> >> Any suggestions on comments on the proposal are highly welcome.
>>> >>
>>> >> Regards,
>>> >> -Pranav.
>>> >>
>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>> >>>
>>> >>> Hi piyush,
>>> >>>
>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>> >>> sharing the SparkContext sounds great.
>>> >>>
>>> >>> Actually, i tried to do it, found problem that multiple SparkILoop
>>> could
>>> >>> generates the same class name, and spark executor confuses classname
>>> since
>>> >>> they're reading classes from single SparkContext.
>>> >>>
>>> >>> If someone can share about the idea of sharing single SparkContext
>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>> >>>
>>> >>> Thanks,
>>> >>> moon
>>> >>>
>>> >>>
>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
>>> wrote:
>>> >>>
>>> >>>    Hi Moon,
>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>> working
>>> >>> with spark.
>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>> >>> printstrems  for each notebook while sharing theSparkContext
>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>> parallel
>>> >>> scheduler ?
>>> >>>    thanks
>>> >>>
>>> >>>    -piyush
>>> >>>
>>> >>>    Hi Moon,
>>> >>>
>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>> Spark's
>>> >>>    remote interpreter - this will allow multiple users to run their
>>> spark
>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>> paragraph is
>>> >>>    executed at a time.
>>> >>>
>>> >>>    Regards,
>>> >>>    -Pranav.
>>> >>>
>>> >>>
>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>> >>>> Hi,
>>> >>>>
>>> >>>> Thanks for asking question.
>>> >>>>
>>> >>>> The reason is simply because of it is running code statements. The
>>> >>>> statements can have order and dependency. Imagine i have two
>>> >>> paragraphs
>>> >>>>
>>> >>>> %spark
>>> >>>> val a = 1
>>> >>>>
>>> >>>> %spark
>>> >>>> print(a)
>>> >>>>
>>> >>>> If they're not running one by one, that means they possibly runs in
>>> >>>> random order and the output will be always different. Either '1' or
>>> >>>> 'val a can not found'.
>>> >>>>
>>> >>>> This is the reason why. But if there are nice idea to handle this
>>> >>>> problem i agree using parallel scheduler would help a lot.
>>> >>>>
>>> >>>> Thanks,
>>> >>>> moon
>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
>>> >>> <mailto:linxizeng0615@gmail.com  <ma...@gmail.com>>>
>>> >>> wrote:
>>> >>>>
>>> >>>>    any one who have the same question with me? or this is not a
>>> >>> question?
>>> >>>>
>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
>>> >>> <ma...@gmail.com>
>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
>>> >>> linxizeng0615@gmail.com>>>:
>>> >>>>
>>> >>>>        hi, Moon:
>>> >>>>           I notice that the getScheduler function in the
>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes the
>>> >>>>        spark interpreter run spark job one by one. It's not a good
>>> >>>>        experience when couple of users do some work on zeppelin at
>>> >>>>        the same time, because they have to wait for each other.
>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>> >>>>        My question is, what kind of consideration do you based on
>>> >>> to
>>> >>>>        make such a decision?
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>> >>>
>>> >>>    This email and any files transmitted with it are confidential and
>>> >>>    intended solely for the use of the individual or entity to whom
>>> >>>    they are addressed. If you have received this email in error
>>> >>>    please notify the system manager. This message contains
>>> >>>    confidential information and is intended only for the individual
>>> >>>    named. If you are not the named addressee you should not
>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>> >>>    sender immediately by e-mail if you have received this e-mail by
>>> >>>    mistake and delete this e-mail from your system. If you are not
>>> >>>    the intended recipient you are notified that disclosing, copying,
>>> >>>    distributing or taking any action in reliance on the contents of
>>> >>>    this information is strictly prohibited. Although Flipkart has
>>> >>>    taken reasonable precautions to ensure no viruses are present in
>>> >>>    this email, the company cannot accept responsibility for any loss
>>> >>>    or damage arising from the use of this email or attachments
>>> >>
>>>
>>
>>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by IT CTO <go...@gmail.com>.

+1 for "to re-factor the Zeppelin architecture so that it can handle
multi-tenancy easily"

On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com> wrote:

> Agree with Joel, we may think to re-factor the Zeppelin architecture so
> that it can handle multi-tenancy easily. The technical solution proposed by Pranav
> is great but it only applies to Spark. Right now, each interpreter has to
> manage multi-tenancy its own way. Ultimately Zeppelin can propose a
> multi-tenancy contract/info (like UserContext, similar to
> InterpreterContext) so that each interpreter can choose to use or not.
>
>
> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com> wrote:
>
>> I think while the idea of running multiple notes simultaneously is great.
>> It is really dancing around the lack of true multi user support in
>> Zeppelin. While the proposed solution would work if the applications
>> resources are those of the whole cluster, if the app is limited (say they
>> are 8 cores of 16, with some distribution in memory) then potentially your
>> note can hog all the resources and the scheduler will have to throttle all
>> other executions leaving you exactly where you are now.
>> While I think the solution is a good one, maybe this question makes us
>> think in adding true multiuser support.
>> Where we isolate resources (cluster and the notebooks themselves), have
>> separate login/identity and (I don't know if it's possible) share the same
>> context.
>>
>> Thanks,
>> Joel
>>
>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
>> wrote:
>> >
>> > If the problem is that multiple users have to wait for each other while
>> > using Zeppelin, the solution already exists: they can create a new
>> > interpreter by going to the interpreter page and attach it to their
>> > notebook - then they don't have to wait for others to submit their job.
>> >
>> > But I agree, having paragraphs from one note wait for paragraphs from
>> other
>> > notes is a confusing default. We can get around that in two ways:
>> >
>> >   1. Create a new interpreter for each note and attach that interpreter
>> to
>> >   that note. This approach would require the least amount of code
>> changes but
>> >   is resource heavy and doesn't let you share Spark Context between
>> different
>> >   notes.
>> >   2. If we want to share the Spark Context between different notes, we
>> can
>> >   submit jobs from different notes into different fairscheduler pools (
>> >
>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>> ).
>> >   This can be done by submitting jobs from different notes in different
>> >   threads. This will make sure that jobs from one note are run
>> sequentially
>> >   but jobs from different notes will be able to run in parallel.
>> >
>> > Neither of these options require any change in the Spark code.
>> >
>> > --
>> > Thanks & Regards
>> > Rohit Agarwal
>> > https://www.linkedin.com/in/rohitagarwal003
>> >
>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>> praagarw@gmail.com>
>> > wrote:
>> >
>> >> If someone can share about the idea of sharing single SparkContext
>> through
>> >>> multiple SparkILoop safely, it'll be really helpful.
>> >> Here is a proposal:
>> >> 1. In Spark code, change SparkIMain.scala to allow setting the virtual
>> >> directory. While creating new instances of SparkIMain per notebook from
>> >> zeppelin spark interpreter set all the instances of SparkIMain to the
>> same
>> >> virtual directory.
>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>> server in
>> >> Spark Context using classserverUri method
>> >> 3. Scala generated code has a notion of packages. The default package
>> name
>> >> is "line$<linenumber>". Package name can be controlled using System
>> >> Property scala.repl.name.line. Setting this property to "notebook id"
>> >> ensures that code generated by individual instances of SparkIMain is
>> >> isolated from other instances of SparkIMain
>> >> 4. Build a queue inside interpreter to allow only one paragraph
>> execution
>> >> at a time per notebook.
>> >>
>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>> >> classnames. I'll work towards submitting a formal patch soon - Is
>> there any
>> >> Jira already for the same that I can uptake? Also I need to understand:
>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>> >> towards getting Spark changes merged in Apache Spark github?
>> >>
>> >> Any suggestions on comments on the proposal are highly welcome.
>> >>
>> >> Regards,
>> >> -Pranav.
>> >>
>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>> >>>
>> >>> Hi piyush,
>> >>>
>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>> >>> sharing the SparkContext sounds great.
>> >>>
>> >>> Actually, i tried to do it, found problem that multiple SparkILoop
>> could
>> >>> generates the same class name, and spark executor confuses classname
>> since
>> >>> they're reading classes from single SparkContext.
>> >>>
>> >>> If someone can share about the idea of sharing single SparkContext
>> >>> through multiple SparkILoop safely, it'll be really helpful.
>> >>>
>> >>> Thanks,
>> >>> moon
>> >>>
>> >>>
>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
>> wrote:
>> >>>
>> >>>    Hi Moon,
>> >>>    Any suggestion on it, have to wait lot when multiple people
>> working
>> >>> with spark.
>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>> >>> printstrems  for each notebook while sharing theSparkContext
>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>> parallel
>> >>> scheduler ?
>> >>>    thanks
>> >>>
>> >>>    -piyush
>> >>>
>> >>>    Hi Moon,
>> >>>
>> >>>    How about tracking dedicated SparkContext for a notebook in Spark's
>> >>>    remote interpreter - this will allow multiple users to run their
>> spark
>> >>>    paragraphs in parallel. Also, within a notebook only one paragraph
>> is
>> >>>    executed at a time.
>> >>>
>> >>>    Regards,
>> >>>    -Pranav.
>> >>>
>> >>>
>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>> >>>> Hi,
>> >>>>
>> >>>> Thanks for asking question.
>> >>>>
>> >>>> The reason is simply because of it is running code statements. The
>> >>>> statements can have order and dependency. Imagine i have two
>> >>> paragraphs
>> >>>>
>> >>>> %spark
>> >>>> val a = 1
>> >>>>
>> >>>> %spark
>> >>>> print(a)
>> >>>>
>> >>>> If they're not running one by one, that means they possibly runs in
>> >>>> random order and the output will be always different. Either '1' or
>> >>>> 'val a can not found'.
>> >>>>
>> >>>> This is the reason why. But if there are nice idea to handle this
>> >>>> problem i agree using parallel scheduler would help a lot.
>> >>>>
>> >>>> Thanks,
>> >>>> moon
>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
>> >>> <mailto:linxizeng0615@gmail.com  <ma...@gmail.com>>>
>> >>> wrote:
>> >>>>
>> >>>>    any one who have the same question with me? or this is not a
>> >>> question?
>> >>>>
>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
>> >>> <ma...@gmail.com>
>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
>> >>> linxizeng0615@gmail.com>>>:
>> >>>>
>> >>>>        hi, Moon:
>> >>>>           I notice that the getScheduler function in the
>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes the
>> >>>>        spark interpreter run spark job one by one. It's not a good
>> >>>>        experience when couple of users do some work on zeppelin at
>> >>>>        the same time, because they have to wait for each other.
>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>> >>>>        My question is, what kind of consideration do you based on
>> >>> to
>> >>>>        make such a decision?
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> ------------------------------------------------------------------------------------------------------------------------------------------
>> >>>
>> >>>    This email and any files transmitted with it are confidential and
>> >>>    intended solely for the use of the individual or entity to whom
>> >>>    they are addressed. If you have received this email in error
>> >>>    please notify the system manager. This message contains
>> >>>    confidential information and is intended only for the individual
>> >>>    named. If you are not the named addressee you should not
>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>> >>>    sender immediately by e-mail if you have received this e-mail by
>> >>>    mistake and delete this e-mail from your system. If you are not
>> >>>    the intended recipient you are notified that disclosing, copying,
>> >>>    distributing or taking any action in reliance on the contents of
>> >>>    this information is strictly prohibited. Although Flipkart has
>> >>>    taken reasonable precautions to ensure no viruses are present in
>> >>>    this email, the company cannot accept responsibility for any loss
>> >>>    or damage arising from the use of this email or attachments
>> >>
>>
>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by IT CTO <go...@gmail.com>.

+1 for "to re-factor the Zeppelin architecture so that it can handle
multi-tenancy easily"

On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com> wrote:

> Agree with Joel, we may think to re-factor the Zeppelin architecture so
> that it can handle multi-tenancy easily. The technical solution proposed by Pranav
> is great but it only applies to Spark. Right now, each interpreter has to
> manage multi-tenancy its own way. Ultimately Zeppelin can propose a
> multi-tenancy contract/info (like UserContext, similar to
> InterpreterContext) so that each interpreter can choose to use or not.
>
>
> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com> wrote:
>
>> I think while the idea of running multiple notes simultaneously is great.
>> It is really dancing around the lack of true multi user support in
>> Zeppelin. While the proposed solution would work if the applications
>> resources are those of the whole cluster, if the app is limited (say they
>> are 8 cores of 16, with some distribution in memory) then potentially your
>> note can hog all the resources and the scheduler will have to throttle all
>> other executions leaving you exactly where you are now.
>> While I think the solution is a good one, maybe this question makes us
>> think in adding true multiuser support.
>> Where we isolate resources (cluster and the notebooks themselves), have
>> separate login/identity and (I don't know if it's possible) share the same
>> context.
>>
>> Thanks,
>> Joel
>>
>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
>> wrote:
>> >
>> > If the problem is that multiple users have to wait for each other while
>> > using Zeppelin, the solution already exists: they can create a new
>> > interpreter by going to the interpreter page and attach it to their
>> > notebook - then they don't have to wait for others to submit their job.
>> >
>> > But I agree, having paragraphs from one note wait for paragraphs from
>> other
>> > notes is a confusing default. We can get around that in two ways:
>> >
>> >   1. Create a new interpreter for each note and attach that interpreter
>> to
>> >   that note. This approach would require the least amount of code
>> changes but
>> >   is resource heavy and doesn't let you share Spark Context between
>> different
>> >   notes.
>> >   2. If we want to share the Spark Context between different notes, we
>> can
>> >   submit jobs from different notes into different fairscheduler pools (
>> >
>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>> ).
>> >   This can be done by submitting jobs from different notes in different
>> >   threads. This will make sure that jobs from one note are run
>> sequentially
>> >   but jobs from different notes will be able to run in parallel.
>> >
>> > Neither of these options require any change in the Spark code.
>> >
>> > --
>> > Thanks & Regards
>> > Rohit Agarwal
>> > https://www.linkedin.com/in/rohitagarwal003
>> >
>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>> praagarw@gmail.com>
>> > wrote:
>> >
>> >> If someone can share about the idea of sharing single SparkContext
>> through
>> >>> multiple SparkILoop safely, it'll be really helpful.
>> >> Here is a proposal:
>> >> 1. In Spark code, change SparkIMain.scala to allow setting the virtual
>> >> directory. While creating new instances of SparkIMain per notebook from
>> >> zeppelin spark interpreter set all the instances of SparkIMain to the
>> same
>> >> virtual directory.
>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>> server in
>> >> Spark Context using classserverUri method
>> >> 3. Scala generated code has a notion of packages. The default package
>> name
>> >> is "line$<linenumber>". Package name can be controlled using System
>> >> Property scala.repl.name.line. Setting this property to "notebook id"
>> >> ensures that code generated by individual instances of SparkIMain is
>> >> isolated from other instances of SparkIMain
>> >> 4. Build a queue inside interpreter to allow only one paragraph
>> execution
>> >> at a time per notebook.
>> >>
>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>> >> classnames. I'll work towards submitting a formal patch soon - Is
>> there any
>> >> Jira already for the same that I can uptake? Also I need to understand:
>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>> >> towards getting Spark changes merged in Apache Spark github?
>> >>
>> >> Any suggestions on comments on the proposal are highly welcome.
>> >>
>> >> Regards,
>> >> -Pranav.
>> >>
>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>> >>>
>> >>> Hi piyush,
>> >>>
>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>> >>> sharing the SparkContext sounds great.
>> >>>
>> >>> Actually, i tried to do it, found problem that multiple SparkILoop
>> could
>> >>> generates the same class name, and spark executor confuses classname
>> since
>> >>> they're reading classes from single SparkContext.
>> >>>
>> >>> If someone can share about the idea of sharing single SparkContext
>> >>> through multiple SparkILoop safely, it'll be really helpful.
>> >>>
>> >>> Thanks,
>> >>> moon
>> >>>
>> >>>
>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
>> wrote:
>> >>>
>> >>>    Hi Moon,
>> >>>    Any suggestion on it, have to wait lot when multiple people
>> working
>> >>> with spark.
>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>> >>> printstrems  for each notebook while sharing theSparkContext
>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>> parallel
>> >>> scheduler ?
>> >>>    thanks
>> >>>
>> >>>    -piyush
>> >>>
>> >>>    Hi Moon,
>> >>>
>> >>>    How about tracking dedicated SparkContext for a notebook in Spark's
>> >>>    remote interpreter - this will allow multiple users to run their
>> spark
>> >>>    paragraphs in parallel. Also, within a notebook only one paragraph
>> is
>> >>>    executed at a time.
>> >>>
>> >>>    Regards,
>> >>>    -Pranav.
>> >>>
>> >>>
>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>> >>>> Hi,
>> >>>>
>> >>>> Thanks for asking question.
>> >>>>
>> >>>> The reason is simply because of it is running code statements. The
>> >>>> statements can have order and dependency. Imagine i have two
>> >>> paragraphs
>> >>>>
>> >>>> %spark
>> >>>> val a = 1
>> >>>>
>> >>>> %spark
>> >>>> print(a)
>> >>>>
>> >>>> If they're not running one by one, that means they possibly runs in
>> >>>> random order and the output will be always different. Either '1' or
>> >>>> 'val a can not found'.
>> >>>>
>> >>>> This is the reason why. But if there are nice idea to handle this
>> >>>> problem i agree using parallel scheduler would help a lot.
>> >>>>
>> >>>> Thanks,
>> >>>> moon
>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
>> >>> <mailto:linxizeng0615@gmail.com  <ma...@gmail.com>>>
>> >>> wrote:
>> >>>>
>> >>>>    any one who have the same question with me? or this is not a
>> >>> question?
>> >>>>
>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
>> >>> <ma...@gmail.com>
>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
>> >>> linxizeng0615@gmail.com>>>:
>> >>>>
>> >>>>        hi, Moon:
>> >>>>           I notice that the getScheduler function in the
>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes the
>> >>>>        spark interpreter run spark job one by one. It's not a good
>> >>>>        experience when couple of users do some work on zeppelin at
>> >>>>        the same time, because they have to wait for each other.
>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>> >>>>        My question is, what kind of consideration do you based on
>> >>> to
>> >>>>        make such a decision?
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> ------------------------------------------------------------------------------------------------------------------------------------------
>> >>>
>> >>>    This email and any files transmitted with it are confidential and
>> >>>    intended solely for the use of the individual or entity to whom
>> >>>    they are addressed. If you have received this email in error
>> >>>    please notify the system manager. This message contains
>> >>>    confidential information and is intended only for the individual
>> >>>    named. If you are not the named addressee you should not
>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>> >>>    sender immediately by e-mail if you have received this e-mail by
>> >>>    mistake and delete this e-mail from your system. If you are not
>> >>>    the intended recipient you are notified that disclosing, copying,
>> >>>    distributing or taking any action in reliance on the contents of
>> >>>    this information is strictly prohibited. Although Flipkart has
>> >>>    taken reasonable precautions to ensure no viruses are present in
>> >>>    this email, the company cannot accept responsibility for any loss
>> >>>    or damage arising from the use of this email or attachments
>> >>
>>
>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by DuyHai Doan <do...@gmail.com>.

Agree with Joel, we may think to re-factor the Zeppelin architecture so
that it can handle multi-tenancy easily. The technical solution
proposed by Pranav
is great but it only applies to Spark. Right now, each interpreter has to
manage multi-tenancy its own way. Ultimately Zeppelin can propose a
multi-tenancy contract/info (like UserContext, similar to
InterpreterContext) so that each interpreter can choose to use or not.


On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com> wrote:

> I think while the idea of running multiple notes simultaneously is great.
> It is really dancing around the lack of true multi user support in
> Zeppelin. While the proposed solution would work if the applications
> resources are those of the whole cluster, if the app is limited (say they
> are 8 cores of 16, with some distribution in memory) then potentially your
> note can hog all the resources and the scheduler will have to throttle all
> other executions leaving you exactly where you are now.
> While I think the solution is a good one, maybe this question makes us
> think in adding true multiuser support.
> Where we isolate resources (cluster and the notebooks themselves), have
> separate login/identity and (I don't know if it's possible) share the same
> context.
>
> Thanks,
> Joel
>
> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com> wrote:
> >
> > If the problem is that multiple users have to wait for each other while
> > using Zeppelin, the solution already exists: they can create a new
> > interpreter by going to the interpreter page and attach it to their
> > notebook - then they don't have to wait for others to submit their job.
> >
> > But I agree, having paragraphs from one note wait for paragraphs from
> other
> > notes is a confusing default. We can get around that in two ways:
> >
> >   1. Create a new interpreter for each note and attach that interpreter
> to
> >   that note. This approach would require the least amount of code
> changes but
> >   is resource heavy and doesn't let you share Spark Context between
> different
> >   notes.
> >   2. If we want to share the Spark Context between different notes, we
> can
> >   submit jobs from different notes into different fairscheduler pools (
> >
> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
> ).
> >   This can be done by submitting jobs from different notes in different
> >   threads. This will make sure that jobs from one note are run
> sequentially
> >   but jobs from different notes will be able to run in parallel.
> >
> > Neither of these options require any change in the Spark code.
> >
> > --
> > Thanks & Regards
> > Rohit Agarwal
> > https://www.linkedin.com/in/rohitagarwal003
> >
> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
> praagarw@gmail.com>
> > wrote:
> >
> >> If someone can share about the idea of sharing single SparkContext
> through
> >>> multiple SparkILoop safely, it'll be really helpful.
> >> Here is a proposal:
> >> 1. In Spark code, change SparkIMain.scala to allow setting the virtual
> >> directory. While creating new instances of SparkIMain per notebook from
> >> zeppelin spark interpreter set all the instances of SparkIMain to the
> same
> >> virtual directory.
> >> 2. Start HTTP server on that virtual directory and set this HTTP server
> in
> >> Spark Context using classserverUri method
> >> 3. Scala generated code has a notion of packages. The default package
> name
> >> is "line$<linenumber>". Package name can be controlled using System
> >> Property scala.repl.name.line. Setting this property to "notebook id"
> >> ensures that code generated by individual instances of SparkIMain is
> >> isolated from other instances of SparkIMain
> >> 4. Build a queue inside interpreter to allow only one paragraph
> execution
> >> at a time per notebook.
> >>
> >> I have tested 1, 2, and 3 and it seems to provide isolation across
> >> classnames. I'll work towards submitting a formal patch soon - Is there
> any
> >> Jira already for the same that I can uptake? Also I need to understand:
> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
> >> towards getting Spark changes merged in Apache Spark github?
> >>
> >> Any suggestions on comments on the proposal are highly welcome.
> >>
> >> Regards,
> >> -Pranav.
> >>
> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
> >>>
> >>> Hi piyush,
> >>>
> >>> Separate instance of SparkILoop SparkIMain for each notebook while
> >>> sharing the SparkContext sounds great.
> >>>
> >>> Actually, i tried to do it, found problem that multiple SparkILoop
> could
> >>> generates the same class name, and spark executor confuses classname
> since
> >>> they're reading classes from single SparkContext.
> >>>
> >>> If someone can share about the idea of sharing single SparkContext
> >>> through multiple SparkILoop safely, it'll be really helpful.
> >>>
> >>> Thanks,
> >>> moon
> >>>
> >>>
> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>> wrote:
> >>>
> >>>    Hi Moon,
> >>>    Any suggestion on it, have to wait lot when multiple people  working
> >>> with spark.
> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
> >>> printstrems  for each notebook while sharing theSparkContext
> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
> parallel
> >>> scheduler ?
> >>>    thanks
> >>>
> >>>    -piyush
> >>>
> >>>    Hi Moon,
> >>>
> >>>    How about tracking dedicated SparkContext for a notebook in Spark's
> >>>    remote interpreter - this will allow multiple users to run their
> spark
> >>>    paragraphs in parallel. Also, within a notebook only one paragraph
> is
> >>>    executed at a time.
> >>>
> >>>    Regards,
> >>>    -Pranav.
> >>>
> >>>
> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
> >>>> Hi,
> >>>>
> >>>> Thanks for asking question.
> >>>>
> >>>> The reason is simply because of it is running code statements. The
> >>>> statements can have order and dependency. Imagine i have two
> >>> paragraphs
> >>>>
> >>>> %spark
> >>>> val a = 1
> >>>>
> >>>> %spark
> >>>> print(a)
> >>>>
> >>>> If they're not running one by one, that means they possibly runs in
> >>>> random order and the output will be always different. Either '1' or
> >>>> 'val a can not found'.
> >>>>
> >>>> This is the reason why. But if there are nice idea to handle this
> >>>> problem i agree using parallel scheduler would help a lot.
> >>>>
> >>>> Thanks,
> >>>> moon
> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
> >>> <mailto:linxizeng0615@gmail.com  <ma...@gmail.com>>>
> >>> wrote:
> >>>>
> >>>>    any one who have the same question with me? or this is not a
> >>> question?
> >>>>
> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
> >>> <ma...@gmail.com>
> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
> >>> linxizeng0615@gmail.com>>>:
> >>>>
> >>>>        hi, Moon:
> >>>>           I notice that the getScheduler function in the
> >>>>        SparkInterpreter.java return a FIFOScheduler which makes the
> >>>>        spark interpreter run spark job one by one. It's not a good
> >>>>        experience when couple of users do some work on zeppelin at
> >>>>        the same time, because they have to wait for each other.
> >>>>        And at the same time, SparkSqlInterpreter can chose what
> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
> >>>>        My question is, what kind of consideration do you based on
> >>> to
> >>>>        make such a decision?
> >>>
> >>>
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------------------------------------------------------------------------
> >>>
> >>>    This email and any files transmitted with it are confidential and
> >>>    intended solely for the use of the individual or entity to whom
> >>>    they are addressed. If you have received this email in error
> >>>    please notify the system manager. This message contains
> >>>    confidential information and is intended only for the individual
> >>>    named. If you are not the named addressee you should not
> >>>    disseminate, distribute or copy this e-mail. Please notify the
> >>>    sender immediately by e-mail if you have received this e-mail by
> >>>    mistake and delete this e-mail from your system. If you are not
> >>>    the intended recipient you are notified that disclosing, copying,
> >>>    distributing or taking any action in reliance on the contents of
> >>>    this information is strictly prohibited. Although Flipkart has
> >>>    taken reasonable precautions to ensure no viruses are present in
> >>>    this email, the company cannot accept responsibility for any loss
> >>>    or damage arising from the use of this email or attachments
> >>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by DuyHai Doan <do...@gmail.com>.

Agree with Joel, we may think to re-factor the Zeppelin architecture so
that it can handle multi-tenancy easily. The technical solution
proposed by Pranav
is great but it only applies to Spark. Right now, each interpreter has to
manage multi-tenancy its own way. Ultimately Zeppelin can propose a
multi-tenancy contract/info (like UserContext, similar to
InterpreterContext) so that each interpreter can choose to use or not.


On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com> wrote:

> I think while the idea of running multiple notes simultaneously is great.
> It is really dancing around the lack of true multi user support in
> Zeppelin. While the proposed solution would work if the applications
> resources are those of the whole cluster, if the app is limited (say they
> are 8 cores of 16, with some distribution in memory) then potentially your
> note can hog all the resources and the scheduler will have to throttle all
> other executions leaving you exactly where you are now.
> While I think the solution is a good one, maybe this question makes us
> think in adding true multiuser support.
> Where we isolate resources (cluster and the notebooks themselves), have
> separate login/identity and (I don't know if it's possible) share the same
> context.
>
> Thanks,
> Joel
>
> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com> wrote:
> >
> > If the problem is that multiple users have to wait for each other while
> > using Zeppelin, the solution already exists: they can create a new
> > interpreter by going to the interpreter page and attach it to their
> > notebook - then they don't have to wait for others to submit their job.
> >
> > But I agree, having paragraphs from one note wait for paragraphs from
> other
> > notes is a confusing default. We can get around that in two ways:
> >
> >   1. Create a new interpreter for each note and attach that interpreter
> to
> >   that note. This approach would require the least amount of code
> changes but
> >   is resource heavy and doesn't let you share Spark Context between
> different
> >   notes.
> >   2. If we want to share the Spark Context between different notes, we
> can
> >   submit jobs from different notes into different fairscheduler pools (
> >
> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
> ).
> >   This can be done by submitting jobs from different notes in different
> >   threads. This will make sure that jobs from one note are run
> sequentially
> >   but jobs from different notes will be able to run in parallel.
> >
> > Neither of these options require any change in the Spark code.
> >
> > --
> > Thanks & Regards
> > Rohit Agarwal
> > https://www.linkedin.com/in/rohitagarwal003
> >
> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
> praagarw@gmail.com>
> > wrote:
> >
> >> If someone can share about the idea of sharing single SparkContext
> through
> >>> multiple SparkILoop safely, it'll be really helpful.
> >> Here is a proposal:
> >> 1. In Spark code, change SparkIMain.scala to allow setting the virtual
> >> directory. While creating new instances of SparkIMain per notebook from
> >> zeppelin spark interpreter set all the instances of SparkIMain to the
> same
> >> virtual directory.
> >> 2. Start HTTP server on that virtual directory and set this HTTP server
> in
> >> Spark Context using classserverUri method
> >> 3. Scala generated code has a notion of packages. The default package
> name
> >> is "line$<linenumber>". Package name can be controlled using System
> >> Property scala.repl.name.line. Setting this property to "notebook id"
> >> ensures that code generated by individual instances of SparkIMain is
> >> isolated from other instances of SparkIMain
> >> 4. Build a queue inside interpreter to allow only one paragraph
> execution
> >> at a time per notebook.
> >>
> >> I have tested 1, 2, and 3 and it seems to provide isolation across
> >> classnames. I'll work towards submitting a formal patch soon - Is there
> any
> >> Jira already for the same that I can uptake? Also I need to understand:
> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
> >> towards getting Spark changes merged in Apache Spark github?
> >>
> >> Any suggestions on comments on the proposal are highly welcome.
> >>
> >> Regards,
> >> -Pranav.
> >>
> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
> >>>
> >>> Hi piyush,
> >>>
> >>> Separate instance of SparkILoop SparkIMain for each notebook while
> >>> sharing the SparkContext sounds great.
> >>>
> >>> Actually, i tried to do it, found problem that multiple SparkILoop
> could
> >>> generates the same class name, and spark executor confuses classname
> since
> >>> they're reading classes from single SparkContext.
> >>>
> >>> If someone can share about the idea of sharing single SparkContext
> >>> through multiple SparkILoop safely, it'll be really helpful.
> >>>
> >>> Thanks,
> >>> moon
> >>>
> >>>
> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>> wrote:
> >>>
> >>>    Hi Moon,
> >>>    Any suggestion on it, have to wait lot when multiple people  working
> >>> with spark.
> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
> >>> printstrems  for each notebook while sharing theSparkContext
> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
> parallel
> >>> scheduler ?
> >>>    thanks
> >>>
> >>>    -piyush
> >>>
> >>>    Hi Moon,
> >>>
> >>>    How about tracking dedicated SparkContext for a notebook in Spark's
> >>>    remote interpreter - this will allow multiple users to run their
> spark
> >>>    paragraphs in parallel. Also, within a notebook only one paragraph
> is
> >>>    executed at a time.
> >>>
> >>>    Regards,
> >>>    -Pranav.
> >>>
> >>>
> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
> >>>> Hi,
> >>>>
> >>>> Thanks for asking question.
> >>>>
> >>>> The reason is simply because of it is running code statements. The
> >>>> statements can have order and dependency. Imagine i have two
> >>> paragraphs
> >>>>
> >>>> %spark
> >>>> val a = 1
> >>>>
> >>>> %spark
> >>>> print(a)
> >>>>
> >>>> If they're not running one by one, that means they possibly runs in
> >>>> random order and the output will be always different. Either '1' or
> >>>> 'val a can not found'.
> >>>>
> >>>> This is the reason why. But if there are nice idea to handle this
> >>>> problem i agree using parallel scheduler would help a lot.
> >>>>
> >>>> Thanks,
> >>>> moon
> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
> >>> <mailto:linxizeng0615@gmail.com  <ma...@gmail.com>>>
> >>> wrote:
> >>>>
> >>>>    any one who have the same question with me? or this is not a
> >>> question?
> >>>>
> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
> >>> <ma...@gmail.com>
> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
> >>> linxizeng0615@gmail.com>>>:
> >>>>
> >>>>        hi, Moon:
> >>>>           I notice that the getScheduler function in the
> >>>>        SparkInterpreter.java return a FIFOScheduler which makes the
> >>>>        spark interpreter run spark job one by one. It's not a good
> >>>>        experience when couple of users do some work on zeppelin at
> >>>>        the same time, because they have to wait for each other.
> >>>>        And at the same time, SparkSqlInterpreter can chose what
> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
> >>>>        My question is, what kind of consideration do you based on
> >>> to
> >>>>        make such a decision?
> >>>
> >>>
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------------------------------------------------------------------------
> >>>
> >>>    This email and any files transmitted with it are confidential and
> >>>    intended solely for the use of the individual or entity to whom
> >>>    they are addressed. If you have received this email in error
> >>>    please notify the system manager. This message contains
> >>>    confidential information and is intended only for the individual
> >>>    named. If you are not the named addressee you should not
> >>>    disseminate, distribute or copy this e-mail. Please notify the
> >>>    sender immediately by e-mail if you have received this e-mail by
> >>>    mistake and delete this e-mail from your system. If you are not
> >>>    the intended recipient you are notified that disclosing, copying,
> >>>    distributing or taking any action in reliance on the contents of
> >>>    this information is strictly prohibited. Although Flipkart has
> >>>    taken reasonable precautions to ensure no viruses are present in
> >>>    this email, the company cannot accept responsibility for any loss
> >>>    or damage arising from the use of this email or attachments
> >>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Joel Zambrano <dj...@gmail.com>.

I think while the idea of running multiple notes simultaneously is great. It is really dancing around the lack of true multi user support in Zeppelin. While the proposed solution would work if the applications resources are those of the whole cluster, if the app is limited (say they are 8 cores of 16, with some distribution in memory) then potentially your note can hog all the resources and the scheduler will have to throttle all other executions leaving you exactly where you are now. 
While I think the solution is a good one, maybe this question makes us think in adding true multiuser support.
Where we isolate resources (cluster and the notebooks themselves), have separate login/identity and (I don't know if it's possible) share the same context.

Thanks,
Joel

> On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com> wrote:
> 
> If the problem is that multiple users have to wait for each other while
> using Zeppelin, the solution already exists: they can create a new
> interpreter by going to the interpreter page and attach it to their
> notebook - then they don't have to wait for others to submit their job.
> 
> But I agree, having paragraphs from one note wait for paragraphs from other
> notes is a confusing default. We can get around that in two ways:
> 
>   1. Create a new interpreter for each note and attach that interpreter to
>   that note. This approach would require the least amount of code changes but
>   is resource heavy and doesn't let you share Spark Context between different
>   notes.
>   2. If we want to share the Spark Context between different notes, we can
>   submit jobs from different notes into different fairscheduler pools (
>   https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application).
>   This can be done by submitting jobs from different notes in different
>   threads. This will make sure that jobs from one note are run sequentially
>   but jobs from different notes will be able to run in parallel.
> 
> Neither of these options require any change in the Spark code.
> 
> --
> Thanks & Regards
> Rohit Agarwal
> https://www.linkedin.com/in/rohitagarwal003
> 
> On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <pr...@gmail.com>
> wrote:
> 
>> If someone can share about the idea of sharing single SparkContext through
>>> multiple SparkILoop safely, it'll be really helpful.
>> Here is a proposal:
>> 1. In Spark code, change SparkIMain.scala to allow setting the virtual
>> directory. While creating new instances of SparkIMain per notebook from
>> zeppelin spark interpreter set all the instances of SparkIMain to the same
>> virtual directory.
>> 2. Start HTTP server on that virtual directory and set this HTTP server in
>> Spark Context using classserverUri method
>> 3. Scala generated code has a notion of packages. The default package name
>> is "line$<linenumber>". Package name can be controlled using System
>> Property scala.repl.name.line. Setting this property to "notebook id"
>> ensures that code generated by individual instances of SparkIMain is
>> isolated from other instances of SparkIMain
>> 4. Build a queue inside interpreter to allow only one paragraph execution
>> at a time per notebook.
>> 
>> I have tested 1, 2, and 3 and it seems to provide isolation across
>> classnames. I'll work towards submitting a formal patch soon - Is there any
>> Jira already for the same that I can uptake? Also I need to understand:
>> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>> towards getting Spark changes merged in Apache Spark github?
>> 
>> Any suggestions on comments on the proposal are highly welcome.
>> 
>> Regards,
>> -Pranav.
>> 
>>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>> 
>>> Hi piyush,
>>> 
>>> Separate instance of SparkILoop SparkIMain for each notebook while
>>> sharing the SparkContext sounds great.
>>> 
>>> Actually, i tried to do it, found problem that multiple SparkILoop could
>>> generates the same class name, and spark executor confuses classname since
>>> they're reading classes from single SparkContext.
>>> 
>>> If someone can share about the idea of sharing single SparkContext
>>> through multiple SparkILoop safely, it'll be really helpful.
>>> 
>>> Thanks,
>>> moon
>>> 
>>> 
>>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>> piyush.mukati@flipkart.com <ma...@flipkart.com>> wrote:
>>> 
>>>    Hi Moon,
>>>    Any suggestion on it, have to wait lot when multiple people  working
>>> with spark.
>>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>> printstrems  for each notebook while sharing theSparkContext
>>> ZeppelinContext   SQLContext and DependencyResolver and then use parallel
>>> scheduler ?
>>>    thanks
>>> 
>>>    -piyush
>>> 
>>>    Hi Moon,
>>> 
>>>    How about tracking dedicated SparkContext for a notebook in Spark's
>>>    remote interpreter - this will allow multiple users to run their spark
>>>    paragraphs in parallel. Also, within a notebook only one paragraph is
>>>    executed at a time.
>>> 
>>>    Regards,
>>>    -Pranav.
>>> 
>>> 
>>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>> Hi,
>>>> 
>>>> Thanks for asking question.
>>>> 
>>>> The reason is simply because of it is running code statements. The
>>>> statements can have order and dependency. Imagine i have two
>>> paragraphs
>>>> 
>>>> %spark
>>>> val a = 1
>>>> 
>>>> %spark
>>>> print(a)
>>>> 
>>>> If they're not running one by one, that means they possibly runs in
>>>> random order and the output will be always different. Either '1' or
>>>> 'val a can not found'.
>>>> 
>>>> This is the reason why. But if there are nice idea to handle this
>>>> problem i agree using parallel scheduler would help a lot.
>>>> 
>>>> Thanks,
>>>> moon
>>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
>>> <mailto:linxizeng0615@gmail.com  <ma...@gmail.com>>>
>>> wrote:
>>>> 
>>>>    any one who have the same question with me? or this is not a
>>> question?
>>>> 
>>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
>>> <ma...@gmail.com>
>>>>    <mailto:linxizeng0615@gmail.com  <mailto:
>>> linxizeng0615@gmail.com>>>:
>>>> 
>>>>        hi, Moon:
>>>>           I notice that the getScheduler function in the
>>>>        SparkInterpreter.java return a FIFOScheduler which makes the
>>>>        spark interpreter run spark job one by one. It's not a good
>>>>        experience when couple of users do some work on zeppelin at
>>>>        the same time, because they have to wait for each other.
>>>>        And at the same time, SparkSqlInterpreter can chose what
>>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>>        My question is, what kind of consideration do you based on
>>> to
>>>>        make such a decision?
>>> 
>>> 
>>> 
>>> 
>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>> 
>>>    This email and any files transmitted with it are confidential and
>>>    intended solely for the use of the individual or entity to whom
>>>    they are addressed. If you have received this email in error
>>>    please notify the system manager. This message contains
>>>    confidential information and is intended only for the individual
>>>    named. If you are not the named addressee you should not
>>>    disseminate, distribute or copy this e-mail. Please notify the
>>>    sender immediately by e-mail if you have received this e-mail by
>>>    mistake and delete this e-mail from your system. If you are not
>>>    the intended recipient you are notified that disclosing, copying,
>>>    distributing or taking any action in reliance on the contents of
>>>    this information is strictly prohibited. Although Flipkart has
>>>    taken reasonable precautions to ensure no viruses are present in
>>>    this email, the company cannot accept responsibility for any loss
>>>    or damage arising from the use of this email or attachments
>>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Rohit Agarwal <mi...@gmail.com>.

If the problem is that multiple users have to wait for each other while
using Zeppelin, the solution already exists: they can create a new
interpreter by going to the interpreter page and attach it to their
notebook - then they don't have to wait for others to submit their job.

But I agree, having paragraphs from one note wait for paragraphs from other
notes is a confusing default. We can get around that in two ways:

   1. Create a new interpreter for each note and attach that interpreter to
   that note. This approach would require the least amount of code changes but
   is resource heavy and doesn't let you share Spark Context between different
   notes.
   2. If we want to share the Spark Context between different notes, we can
   submit jobs from different notes into different fairscheduler pools (
   https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application).
   This can be done by submitting jobs from different notes in different
   threads. This will make sure that jobs from one note are run sequentially
   but jobs from different notes will be able to run in parallel.

Neither of these options require any change in the Spark code.

--
Thanks & Regards
Rohit Agarwal
https://www.linkedin.com/in/rohitagarwal003

On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <pr...@gmail.com>
wrote:

> If someone can share about the idea of sharing single SparkContext through
>> multiple SparkILoop safely, it'll be really helpful.
>>
> Here is a proposal:
> 1. In Spark code, change SparkIMain.scala to allow setting the virtual
> directory. While creating new instances of SparkIMain per notebook from
> zeppelin spark interpreter set all the instances of SparkIMain to the same
> virtual directory.
> 2. Start HTTP server on that virtual directory and set this HTTP server in
> Spark Context using classserverUri method
> 3. Scala generated code has a notion of packages. The default package name
> is "line$<linenumber>". Package name can be controlled using System
> Property scala.repl.name.line. Setting this property to "notebook id"
> ensures that code generated by individual instances of SparkIMain is
> isolated from other instances of SparkIMain
> 4. Build a queue inside interpreter to allow only one paragraph execution
> at a time per notebook.
>
> I have tested 1, 2, and 3 and it seems to provide isolation across
> classnames. I'll work towards submitting a formal patch soon - Is there any
> Jira already for the same that I can uptake? Also I need to understand:
> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
> towards getting Spark changes merged in Apache Spark github?
>
> Any suggestions on comments on the proposal are highly welcome.
>
> Regards,
> -Pranav.
>
> On 10/08/15 11:36 pm, moon soo Lee wrote:
>
>> Hi piyush,
>>
>> Separate instance of SparkILoop SparkIMain for each notebook while
>> sharing the SparkContext sounds great.
>>
>> Actually, i tried to do it, found problem that multiple SparkILoop could
>> generates the same class name, and spark executor confuses classname since
>> they're reading classes from single SparkContext.
>>
>> If someone can share about the idea of sharing single SparkContext
>> through multiple SparkILoop safely, it'll be really helpful.
>>
>> Thanks,
>> moon
>>
>>
>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>> piyush.mukati@flipkart.com <ma...@flipkart.com>> wrote:
>>
>>     Hi Moon,
>>     Any suggestion on it, have to wait lot when multiple people  working
>> with spark.
>>     Can we create separate instance of   SparkILoop  SparkIMain and
>> printstrems  for each notebook while sharing theSparkContext
>> ZeppelinContext   SQLContext and DependencyResolver and then use parallel
>> scheduler ?
>>     thanks
>>
>>     -piyush
>>
>>     Hi Moon,
>>
>>     How about tracking dedicated SparkContext for a notebook in Spark's
>>     remote interpreter - this will allow multiple users to run their spark
>>     paragraphs in parallel. Also, within a notebook only one paragraph is
>>     executed at a time.
>>
>>     Regards,
>>     -Pranav.
>>
>>
>>     On 15/07/15 7:15 pm, moon soo Lee wrote:
>>     > Hi,
>>     >
>>     > Thanks for asking question.
>>     >
>>     > The reason is simply because of it is running code statements. The
>>     > statements can have order and dependency. Imagine i have two
>> paragraphs
>>     >
>>     > %spark
>>     > val a = 1
>>     >
>>     > %spark
>>     > print(a)
>>     >
>>     > If they're not running one by one, that means they possibly runs in
>>     > random order and the output will be always different. Either '1' or
>>     > 'val a can not found'.
>>     >
>>     > This is the reason why. But if there are nice idea to handle this
>>     > problem i agree using parallel scheduler would help a lot.
>>     >
>>     > Thanks,
>>     > moon
>>     > On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>     > <linxizeng0615@gmail.com  <ma...@gmail.com>
>> <mailto:linxizeng0615@gmail.com  <ma...@gmail.com>>>
>> wrote:
>>     >
>>     >     any one who have the same question with me? or this is not a
>> question?
>>     >
>>     >     2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
>> <ma...@gmail.com>
>>     >     <mailto:linxizeng0615@gmail.com  <mailto:
>> linxizeng0615@gmail.com>>>:
>>     >
>>     >         hi, Moon:
>>     >            I notice that the getScheduler function in the
>>     >         SparkInterpreter.java return a FIFOScheduler which makes the
>>     >         spark interpreter run spark job one by one. It's not a good
>>     >         experience when couple of users do some work on zeppelin at
>>     >         the same time, because they have to wait for each other.
>>     >         And at the same time, SparkSqlInterpreter can chose what
>>     >         scheduler to use by "zeppelin.spark.concurrentSQL".
>>     >         My question is, what kind of consideration do you based on
>> to
>>     >         make such a decision?
>>     >
>>     >
>>
>>
>>
>>
>> ------------------------------------------------------------------------------------------------------------------------------------------
>>
>>     This email and any files transmitted with it are confidential and
>>     intended solely for the use of the individual or entity to whom
>>     they are addressed. If you have received this email in error
>>     please notify the system manager. This message contains
>>     confidential information and is intended only for the individual
>>     named. If you are not the named addressee you should not
>>     disseminate, distribute or copy this e-mail. Please notify the
>>     sender immediately by e-mail if you have received this e-mail by
>>     mistake and delete this e-mail from your system. If you are not
>>     the intended recipient you are notified that disclosing, copying,
>>     distributing or taking any action in reliance on the contents of
>>     this information is strictly prohibited. Although Flipkart has
>>     taken reasonable precautions to ensure no viruses are present in
>>     this email, the company cannot accept responsibility for any loss
>>     or damage arising from the use of this email or attachments
>>
>>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Rohit Agarwal <mi...@gmail.com>.

If the problem is that multiple users have to wait for each other while
using Zeppelin, the solution already exists: they can create a new
interpreter by going to the interpreter page and attach it to their
notebook - then they don't have to wait for others to submit their job.

But I agree, having paragraphs from one note wait for paragraphs from other
notes is a confusing default. We can get around that in two ways:

   1. Create a new interpreter for each note and attach that interpreter to
   that note. This approach would require the least amount of code changes but
   is resource heavy and doesn't let you share Spark Context between different
   notes.
   2. If we want to share the Spark Context between different notes, we can
   submit jobs from different notes into different fairscheduler pools (
   https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application).
   This can be done by submitting jobs from different notes in different
   threads. This will make sure that jobs from one note are run sequentially
   but jobs from different notes will be able to run in parallel.

Neither of these options require any change in the Spark code.

--
Thanks & Regards
Rohit Agarwal
https://www.linkedin.com/in/rohitagarwal003

On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <pr...@gmail.com>
wrote:

> If someone can share about the idea of sharing single SparkContext through
>> multiple SparkILoop safely, it'll be really helpful.
>>
> Here is a proposal:
> 1. In Spark code, change SparkIMain.scala to allow setting the virtual
> directory. While creating new instances of SparkIMain per notebook from
> zeppelin spark interpreter set all the instances of SparkIMain to the same
> virtual directory.
> 2. Start HTTP server on that virtual directory and set this HTTP server in
> Spark Context using classserverUri method
> 3. Scala generated code has a notion of packages. The default package name
> is "line$<linenumber>". Package name can be controlled using System
> Property scala.repl.name.line. Setting this property to "notebook id"
> ensures that code generated by individual instances of SparkIMain is
> isolated from other instances of SparkIMain
> 4. Build a queue inside interpreter to allow only one paragraph execution
> at a time per notebook.
>
> I have tested 1, 2, and 3 and it seems to provide isolation across
> classnames. I'll work towards submitting a formal patch soon - Is there any
> Jira already for the same that I can uptake? Also I need to understand:
> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
> towards getting Spark changes merged in Apache Spark github?
>
> Any suggestions on comments on the proposal are highly welcome.
>
> Regards,
> -Pranav.
>
> On 10/08/15 11:36 pm, moon soo Lee wrote:
>
>> Hi piyush,
>>
>> Separate instance of SparkILoop SparkIMain for each notebook while
>> sharing the SparkContext sounds great.
>>
>> Actually, i tried to do it, found problem that multiple SparkILoop could
>> generates the same class name, and spark executor confuses classname since
>> they're reading classes from single SparkContext.
>>
>> If someone can share about the idea of sharing single SparkContext
>> through multiple SparkILoop safely, it'll be really helpful.
>>
>> Thanks,
>> moon
>>
>>
>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>> piyush.mukati@flipkart.com <ma...@flipkart.com>> wrote:
>>
>>     Hi Moon,
>>     Any suggestion on it, have to wait lot when multiple people  working
>> with spark.
>>     Can we create separate instance of   SparkILoop  SparkIMain and
>> printstrems  for each notebook while sharing theSparkContext
>> ZeppelinContext   SQLContext and DependencyResolver and then use parallel
>> scheduler ?
>>     thanks
>>
>>     -piyush
>>
>>     Hi Moon,
>>
>>     How about tracking dedicated SparkContext for a notebook in Spark's
>>     remote interpreter - this will allow multiple users to run their spark
>>     paragraphs in parallel. Also, within a notebook only one paragraph is
>>     executed at a time.
>>
>>     Regards,
>>     -Pranav.
>>
>>
>>     On 15/07/15 7:15 pm, moon soo Lee wrote:
>>     > Hi,
>>     >
>>     > Thanks for asking question.
>>     >
>>     > The reason is simply because of it is running code statements. The
>>     > statements can have order and dependency. Imagine i have two
>> paragraphs
>>     >
>>     > %spark
>>     > val a = 1
>>     >
>>     > %spark
>>     > print(a)
>>     >
>>     > If they're not running one by one, that means they possibly runs in
>>     > random order and the output will be always different. Either '1' or
>>     > 'val a can not found'.
>>     >
>>     > This is the reason why. But if there are nice idea to handle this
>>     > problem i agree using parallel scheduler would help a lot.
>>     >
>>     > Thanks,
>>     > moon
>>     > On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>     > <linxizeng0615@gmail.com  <ma...@gmail.com>
>> <mailto:linxizeng0615@gmail.com  <ma...@gmail.com>>>
>> wrote:
>>     >
>>     >     any one who have the same question with me? or this is not a
>> question?
>>     >
>>     >     2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
>> <ma...@gmail.com>
>>     >     <mailto:linxizeng0615@gmail.com  <mailto:
>> linxizeng0615@gmail.com>>>:
>>     >
>>     >         hi, Moon:
>>     >            I notice that the getScheduler function in the
>>     >         SparkInterpreter.java return a FIFOScheduler which makes the
>>     >         spark interpreter run spark job one by one. It's not a good
>>     >         experience when couple of users do some work on zeppelin at
>>     >         the same time, because they have to wait for each other.
>>     >         And at the same time, SparkSqlInterpreter can chose what
>>     >         scheduler to use by "zeppelin.spark.concurrentSQL".
>>     >         My question is, what kind of consideration do you based on
>> to
>>     >         make such a decision?
>>     >
>>     >
>>
>>
>>
>>
>> ------------------------------------------------------------------------------------------------------------------------------------------
>>
>>     This email and any files transmitted with it are confidential and
>>     intended solely for the use of the individual or entity to whom
>>     they are addressed. If you have received this email in error
>>     please notify the system manager. This message contains
>>     confidential information and is intended only for the individual
>>     named. If you are not the named addressee you should not
>>     disseminate, distribute or copy this e-mail. Please notify the
>>     sender immediately by e-mail if you have received this e-mail by
>>     mistake and delete this e-mail from your system. If you are not
>>     the intended recipient you are notified that disclosing, copying,
>>     distributing or taking any action in reliance on the contents of
>>     this information is strictly prohibited. Although Flipkart has
>>     taken reasonable precautions to ensure no viruses are present in
>>     this email, the company cannot accept responsibility for any loss
>>     or damage arising from the use of this email or attachments
>>
>>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Rohit Agarwal <mi...@gmail.com>.

If the problem is that multiple users have to wait for each other while
using Zeppelin, the solution already exists: they can create a new
interpreter by going to the interpreter page and attach it to their
notebook - then they don't have to wait for others to submit their job.

But I agree, having paragraphs from one note wait for paragraphs from other
notes is a confusing default. We can get around that in two ways:

   1. Create a new interpreter for each note and attach that interpreter to
   that note. This approach would require the least amount of code changes but
   is resource heavy and doesn't let you share Spark Context between different
   notes.
   2. If we want to share the Spark Context between different notes, we can
   submit jobs from different notes into different fairscheduler pools (
   https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application).
   This can be done by submitting jobs from different notes in different
   threads. This will make sure that jobs from one note are run sequentially
   but jobs from different notes will be able to run in parallel.

Neither of these options require any change in the Spark code.

--
Thanks & Regards
Rohit Agarwal
https://www.linkedin.com/in/rohitagarwal003

On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <pr...@gmail.com>
wrote:

> If someone can share about the idea of sharing single SparkContext through
>> multiple SparkILoop safely, it'll be really helpful.
>>
> Here is a proposal:
> 1. In Spark code, change SparkIMain.scala to allow setting the virtual
> directory. While creating new instances of SparkIMain per notebook from
> zeppelin spark interpreter set all the instances of SparkIMain to the same
> virtual directory.
> 2. Start HTTP server on that virtual directory and set this HTTP server in
> Spark Context using classserverUri method
> 3. Scala generated code has a notion of packages. The default package name
> is "line$<linenumber>". Package name can be controlled using System
> Property scala.repl.name.line. Setting this property to "notebook id"
> ensures that code generated by individual instances of SparkIMain is
> isolated from other instances of SparkIMain
> 4. Build a queue inside interpreter to allow only one paragraph execution
> at a time per notebook.
>
> I have tested 1, 2, and 3 and it seems to provide isolation across
> classnames. I'll work towards submitting a formal patch soon - Is there any
> Jira already for the same that I can uptake? Also I need to understand:
> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
> towards getting Spark changes merged in Apache Spark github?
>
> Any suggestions on comments on the proposal are highly welcome.
>
> Regards,
> -Pranav.
>
> On 10/08/15 11:36 pm, moon soo Lee wrote:
>
>> Hi piyush,
>>
>> Separate instance of SparkILoop SparkIMain for each notebook while
>> sharing the SparkContext sounds great.
>>
>> Actually, i tried to do it, found problem that multiple SparkILoop could
>> generates the same class name, and spark executor confuses classname since
>> they're reading classes from single SparkContext.
>>
>> If someone can share about the idea of sharing single SparkContext
>> through multiple SparkILoop safely, it'll be really helpful.
>>
>> Thanks,
>> moon
>>
>>
>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>> piyush.mukati@flipkart.com <ma...@flipkart.com>> wrote:
>>
>>     Hi Moon,
>>     Any suggestion on it, have to wait lot when multiple people  working
>> with spark.
>>     Can we create separate instance of   SparkILoop  SparkIMain and
>> printstrems  for each notebook while sharing theSparkContext
>> ZeppelinContext   SQLContext and DependencyResolver and then use parallel
>> scheduler ?
>>     thanks
>>
>>     -piyush
>>
>>     Hi Moon,
>>
>>     How about tracking dedicated SparkContext for a notebook in Spark's
>>     remote interpreter - this will allow multiple users to run their spark
>>     paragraphs in parallel. Also, within a notebook only one paragraph is
>>     executed at a time.
>>
>>     Regards,
>>     -Pranav.
>>
>>
>>     On 15/07/15 7:15 pm, moon soo Lee wrote:
>>     > Hi,
>>     >
>>     > Thanks for asking question.
>>     >
>>     > The reason is simply because of it is running code statements. The
>>     > statements can have order and dependency. Imagine i have two
>> paragraphs
>>     >
>>     > %spark
>>     > val a = 1
>>     >
>>     > %spark
>>     > print(a)
>>     >
>>     > If they're not running one by one, that means they possibly runs in
>>     > random order and the output will be always different. Either '1' or
>>     > 'val a can not found'.
>>     >
>>     > This is the reason why. But if there are nice idea to handle this
>>     > problem i agree using parallel scheduler would help a lot.
>>     >
>>     > Thanks,
>>     > moon
>>     > On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>     > <linxizeng0615@gmail.com  <ma...@gmail.com>
>> <mailto:linxizeng0615@gmail.com  <ma...@gmail.com>>>
>> wrote:
>>     >
>>     >     any one who have the same question with me? or this is not a
>> question?
>>     >
>>     >     2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
>> <ma...@gmail.com>
>>     >     <mailto:linxizeng0615@gmail.com  <mailto:
>> linxizeng0615@gmail.com>>>:
>>     >
>>     >         hi, Moon:
>>     >            I notice that the getScheduler function in the
>>     >         SparkInterpreter.java return a FIFOScheduler which makes the
>>     >         spark interpreter run spark job one by one. It's not a good
>>     >         experience when couple of users do some work on zeppelin at
>>     >         the same time, because they have to wait for each other.
>>     >         And at the same time, SparkSqlInterpreter can chose what
>>     >         scheduler to use by "zeppelin.spark.concurrentSQL".
>>     >         My question is, what kind of consideration do you based on
>> to
>>     >         make such a decision?
>>     >
>>     >
>>
>>
>>
>>
>> ------------------------------------------------------------------------------------------------------------------------------------------
>>
>>     This email and any files transmitted with it are confidential and
>>     intended solely for the use of the individual or entity to whom
>>     they are addressed. If you have received this email in error
>>     please notify the system manager. This message contains
>>     confidential information and is intended only for the individual
>>     named. If you are not the named addressee you should not
>>     disseminate, distribute or copy this e-mail. Please notify the
>>     sender immediately by e-mail if you have received this e-mail by
>>     mistake and delete this e-mail from your system. If you are not
>>     the intended recipient you are notified that disclosing, copying,
>>     distributing or taking any action in reliance on the contents of
>>     this information is strictly prohibited. Although Flipkart has
>>     taken reasonable precautions to ensure no viruses are present in
>>     this email, the company cannot accept responsibility for any loss
>>     or damage arising from the use of this email or attachments
>>
>>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Pranav Kumar Agarwal <pr...@gmail.com>.

> If someone can share about the idea of sharing single SparkContext 
> through multiple SparkILoop safely, it'll be really helpful.
Here is a proposal:
1. In Spark code, change SparkIMain.scala to allow setting the virtual 
directory. While creating new instances of SparkIMain per notebook from 
zeppelin spark interpreter set all the instances of SparkIMain to the 
same virtual directory.
2. Start HTTP server on that virtual directory and set this HTTP server 
in Spark Context using classserverUri method
3. Scala generated code has a notion of packages. The default package 
name is "line$<linenumber>". Package name can be controlled using System 
Property scala.repl.name.line. Setting this property to "notebook id" 
ensures that code generated by individual instances of SparkIMain is 
isolated from other instances of SparkIMain
4. Build a queue inside interpreter to allow only one paragraph 
execution at a time per notebook.

I have tested 1, 2, and 3 and it seems to provide isolation across 
classnames. I'll work towards submitting a formal patch soon - Is there 
any Jira already for the same that I can uptake? Also I need to understand:
1. How does Zeppelin uptake Spark fixes? OR do I need to first work 
towards getting Spark changes merged in Apache Spark github?

Any suggestions on comments on the proposal are highly welcome.

Regards,
-Pranav.

On 10/08/15 11:36 pm, moon soo Lee wrote:
> Hi piyush,
>
> Separate instance of SparkILoop SparkIMain for each notebook while 
> sharing the SparkContext sounds great.
>
> Actually, i tried to do it, found problem that multiple SparkILoop 
> could generates the same class name, and spark executor confuses 
> classname since they're reading classes from single SparkContext.
>
> If someone can share about the idea of sharing single SparkContext 
> through multiple SparkILoop safely, it'll be really helpful.
>
> Thanks,
> moon
>
>
> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) 
> <piyush.mukati@flipkart.com <ma...@flipkart.com>> wrote:
>
>     Hi Moon,
>     Any suggestion on it, have to wait lot when multiple people  working with spark.
>     Can we create separate instance of   SparkILoop  SparkIMain and printstrems  for each notebook while sharing theSparkContext  ZeppelinContext   SQLContext and DependencyResolver and then use parallel scheduler ?
>     thanks
>
>     -piyush
>
>     Hi Moon,
>
>     How about tracking dedicated SparkContext for a notebook in Spark's
>     remote interpreter - this will allow multiple users to run their spark
>     paragraphs in parallel. Also, within a notebook only one paragraph is
>     executed at a time.
>
>     Regards,
>     -Pranav.
>
>
>     On 15/07/15 7:15 pm, moon soo Lee wrote:
>     > Hi,
>     >
>     > Thanks for asking question.
>     >
>     > The reason is simply because of it is running code statements. The
>     > statements can have order and dependency. Imagine i have two paragraphs
>     >
>     > %spark
>     > val a = 1
>     >
>     > %spark
>     > print(a)
>     >
>     > If they're not running one by one, that means they possibly runs in
>     > random order and the output will be always different. Either '1' or
>     > 'val a can not found'.
>     >
>     > This is the reason why. But if there are nice idea to handle this
>     > problem i agree using parallel scheduler would help a lot.
>     >
>     > Thanks,
>     > moon
>     > On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>     > <linxizeng0615@gmail.com  <ma...@gmail.com>  <mailto:linxizeng0615@gmail.com  <ma...@gmail.com>>> wrote:
>     >
>     >     any one who have the same question with me? or this is not a question?
>     >
>     >     2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com  <ma...@gmail.com>
>     >     <mailto:linxizeng0615@gmail.com  <ma...@gmail.com>>>:
>     >
>     >         hi, Moon:
>     >            I notice that the getScheduler function in the
>     >         SparkInterpreter.java return a FIFOScheduler which makes the
>     >         spark interpreter run spark job one by one. It's not a good
>     >         experience when couple of users do some work on zeppelin at
>     >         the same time, because they have to wait for each other.
>     >         And at the same time, SparkSqlInterpreter can chose what
>     >         scheduler to use by "zeppelin.spark.concurrentSQL".
>     >         My question is, what kind of consideration do you based on to
>     >         make such a decision?
>     >
>     >
>
>
>
>     ------------------------------------------------------------------------------------------------------------------------------------------
>
>     This email and any files transmitted with it are confidential and
>     intended solely for the use of the individual or entity to whom
>     they are addressed. If you have received this email in error
>     please notify the system manager. This message contains
>     confidential information and is intended only for the individual
>     named. If you are not the named addressee you should not
>     disseminate, distribute or copy this e-mail. Please notify the
>     sender immediately by e-mail if you have received this e-mail by
>     mistake and delete this e-mail from your system. If you are not
>     the intended recipient you are notified that disclosing, copying,
>     distributing or taking any action in reliance on the contents of
>     this information is strictly prohibited. Although Flipkart has
>     taken reasonable precautions to ensure no viruses are present in
>     this email, the company cannot accept responsibility for any loss
>     or damage arising from the use of this email or attachments
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Pranav Kumar Agarwal <pr...@gmail.com>.

> If someone can share about the idea of sharing single SparkContext 
> through multiple SparkILoop safely, it'll be really helpful.
Here is a proposal:
1. In Spark code, change SparkIMain.scala to allow setting the virtual 
directory. While creating new instances of SparkIMain per notebook from 
zeppelin spark interpreter set all the instances of SparkIMain to the 
same virtual directory.
2. Start HTTP server on that virtual directory and set this HTTP server 
in Spark Context using classserverUri method
3. Scala generated code has a notion of packages. The default package 
name is "line$<linenumber>". Package name can be controlled using System 
Property scala.repl.name.line. Setting this property to "notebook id" 
ensures that code generated by individual instances of SparkIMain is 
isolated from other instances of SparkIMain
4. Build a queue inside interpreter to allow only one paragraph 
execution at a time per notebook.

I have tested 1, 2, and 3 and it seems to provide isolation across 
classnames. I'll work towards submitting a formal patch soon - Is there 
any Jira already for the same that I can uptake? Also I need to understand:
1. How does Zeppelin uptake Spark fixes? OR do I need to first work 
towards getting Spark changes merged in Apache Spark github?

Any suggestions on comments on the proposal are highly welcome.

Regards,
-Pranav.

On 10/08/15 11:36 pm, moon soo Lee wrote:
> Hi piyush,
>
> Separate instance of SparkILoop SparkIMain for each notebook while 
> sharing the SparkContext sounds great.
>
> Actually, i tried to do it, found problem that multiple SparkILoop 
> could generates the same class name, and spark executor confuses 
> classname since they're reading classes from single SparkContext.
>
> If someone can share about the idea of sharing single SparkContext 
> through multiple SparkILoop safely, it'll be really helpful.
>
> Thanks,
> moon
>
>
> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) 
> <piyush.mukati@flipkart.com <ma...@flipkart.com>> wrote:
>
>     Hi Moon,
>     Any suggestion on it, have to wait lot when multiple people  working with spark.
>     Can we create separate instance of   SparkILoop  SparkIMain and printstrems  for each notebook while sharing theSparkContext  ZeppelinContext   SQLContext and DependencyResolver and then use parallel scheduler ?
>     thanks
>
>     -piyush
>
>     Hi Moon,
>
>     How about tracking dedicated SparkContext for a notebook in Spark's
>     remote interpreter - this will allow multiple users to run their spark
>     paragraphs in parallel. Also, within a notebook only one paragraph is
>     executed at a time.
>
>     Regards,
>     -Pranav.
>
>
>     On 15/07/15 7:15 pm, moon soo Lee wrote:
>     > Hi,
>     >
>     > Thanks for asking question.
>     >
>     > The reason is simply because of it is running code statements. The
>     > statements can have order and dependency. Imagine i have two paragraphs
>     >
>     > %spark
>     > val a = 1
>     >
>     > %spark
>     > print(a)
>     >
>     > If they're not running one by one, that means they possibly runs in
>     > random order and the output will be always different. Either '1' or
>     > 'val a can not found'.
>     >
>     > This is the reason why. But if there are nice idea to handle this
>     > problem i agree using parallel scheduler would help a lot.
>     >
>     > Thanks,
>     > moon
>     > On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>     > <linxizeng0615@gmail.com  <ma...@gmail.com>  <mailto:linxizeng0615@gmail.com  <ma...@gmail.com>>> wrote:
>     >
>     >     any one who have the same question with me? or this is not a question?
>     >
>     >     2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com  <ma...@gmail.com>
>     >     <mailto:linxizeng0615@gmail.com  <ma...@gmail.com>>>:
>     >
>     >         hi, Moon:
>     >            I notice that the getScheduler function in the
>     >         SparkInterpreter.java return a FIFOScheduler which makes the
>     >         spark interpreter run spark job one by one. It's not a good
>     >         experience when couple of users do some work on zeppelin at
>     >         the same time, because they have to wait for each other.
>     >         And at the same time, SparkSqlInterpreter can chose what
>     >         scheduler to use by "zeppelin.spark.concurrentSQL".
>     >         My question is, what kind of consideration do you based on to
>     >         make such a decision?
>     >
>     >
>
>
>
>     ------------------------------------------------------------------------------------------------------------------------------------------
>
>     This email and any files transmitted with it are confidential and
>     intended solely for the use of the individual or entity to whom
>     they are addressed. If you have received this email in error
>     please notify the system manager. This message contains
>     confidential information and is intended only for the individual
>     named. If you are not the named addressee you should not
>     disseminate, distribute or copy this e-mail. Please notify the
>     sender immediately by e-mail if you have received this e-mail by
>     mistake and delete this e-mail from your system. If you are not
>     the intended recipient you are notified that disclosing, copying,
>     distributing or taking any action in reliance on the contents of
>     this information is strictly prohibited. Although Flipkart has
>     taken reasonable precautions to ensure no viruses are present in
>     this email, the company cannot accept responsibility for any loss
>     or damage arising from the use of this email or attachments
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by moon soo Lee <mo...@apache.org>.

Hi piyush,

Separate instance of SparkILoop SparkIMain for each notebook while sharing
the SparkContext sounds great.

Actually, i tried to do it, found problem that multiple SparkILoop could
generates the same class name, and spark executor confuses classname since
they're reading classes from single SparkContext.

If someone can share about the idea of sharing single SparkContext through
multiple SparkILoop safely, it'll be really helpful.

Thanks,
moon


On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
piyush.mukati@flipkart.com> wrote:

> Hi Moon,
> Any suggestion on it, have to wait lot when multiple people  working with spark.
> Can we create separate instance of   SparkILoop  SparkIMain and printstrems  for each notebook while sharing the SparkContext  ZeppelinContext   SQLContext and DependencyResolver and then use parallel scheduler ?
> thanks
>
> -piyush
>
>
> Hi Moon,
>
> How about tracking dedicated SparkContext for a notebook in Spark's
> remote interpreter - this will allow multiple users to run their spark
> paragraphs in parallel. Also, within a notebook only one paragraph is
> executed at a time.
>
> Regards,
> -Pranav.
>
>
> On 15/07/15 7:15 pm, moon soo Lee wrote:
> > Hi,
> >
> > Thanks for asking question.
> >
> > The reason is simply because of it is running code statements. The
> > statements can have order and dependency. Imagine i have two paragraphs
> >
> > %spark
> > val a = 1
> >
> > %spark
> > print(a)
> >
> > If they're not running one by one, that means they possibly runs in
> > random order and the output will be always different. Either '1' or
> > 'val a can not found'.
> >
> > This is the reason why. But if there are nice idea to handle this
> > problem i agree using parallel scheduler would help a lot.
> >
> > Thanks,
> > moon
> > On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
> > <linxizeng0615@gmail.com <ma...@gmail.com>> wrote:
> >
> >     any one who have the same question with me? or this is not a question?
> >
> >     2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
> >     <ma...@gmail.com>>:
> >
> >         hi, Moon:
> >            I notice that the getScheduler function in the
> >         SparkInterpreter.java return a FIFOScheduler which makes the
> >         spark interpreter run spark job one by one. It's not a good
> >         experience when couple of users do some work on zeppelin at
> >         the same time, because they have to wait for each other.
> >         And at the same time, SparkSqlInterpreter can chose what
> >         scheduler to use by "zeppelin.spark.concurrentSQL".
> >         My question is, what kind of consideration do you based on to
> >         make such a decision?
> >
> >
>
>
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------
>
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you have received this email in error please notify the system manager.
> This message contains confidential information and is intended only for the
> individual named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately by e-mail if you have received this e-mail by mistake and
> delete this e-mail from your system. If you are not the intended recipient
> you are notified that disclosing, copying, distributing or taking any
> action in reliance on the contents of this information is strictly
> prohibited. Although Flipkart has taken reasonable precautions to ensure no
> viruses are present in this email, the company cannot accept responsibility
> for any loss or damage arising from the use of this email or attachments
>