You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@zeppelin.apache.org by Dimp Bhat <di...@gmail.com> on 2016/01/13 19:17:24 UTC

Re: why zeppelin SparkInterpreter use FIFOScheduler

Hi Pranav,
When do you plan to send out the code for running notebooks in parallel ?

Dimple

On Tue, Nov 17, 2015 at 3:27 AM, Pranav Kumar Agarwal <pr...@gmail.com>
wrote:

> Hi Rohit,
>
> We implemented the proposal and are able to run Zeppelin as a hosted
> service inside my organization. Our internal forked version has pluggable
> authentication and type ahead.
>
> I need to get the work ported to the latest and chop out the auth changes
> portion. We'll be submitting it soon.
>
> We'll target to get this out for review by 11/26.
>
> Regards,
> -Pranav.
>
>
>
> On 17/11/15 4:34 am, Rohit Agarwal wrote:
>
> Hey Pranav,
>
> Did you make any progress on this?
>
> --
> Rohit
>
> On Sunday, August 16, 2015, moon soo Lee <mo...@apache.org> wrote:
>
>> Pranav, proposal looks awesome!
>>
>> I have a question and feedback,
>>
>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
>> need information of notebook id. Did you get it from InterpreterContext?
>> Then how did you handle destroying of SparkIMain (when notebook is
>> deleting)?
>> As far as i know, interpreter not able to get information of notebook
>> deletion.
>>
>> >> 4. Build a queue inside interpreter to allow only one paragraph
>> execution
>> >> at a time per notebook.
>>
>> One downside of this approach is, GUI will display RUNNING instead of
>> PENDING for jobs inside of queue in interpreter.
>>
>> Best,
>> moon
>>
>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:
>>
>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>>> multi-tenancy easily"
>>>
>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com>
>>> wrote:
>>>
>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture so
>>>> that it can handle multi-tenancy easily. The technical solution proposed by Pranav
>>>> is great but it only applies to Spark. Right now, each interpreter has to
>>>> manage multi-tenancy its own way. Ultimately Zeppelin can propose a
>>>> multi-tenancy contract/info (like UserContext, similar to
>>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>>
>>>>
>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com>
>>>> wrote:
>>>>
>>>>> I think while the idea of running multiple notes simultaneously is
>>>>> great. It is really dancing around the lack of true multi user support in
>>>>> Zeppelin. While the proposed solution would work if the applications
>>>>> resources are those of the whole cluster, if the app is limited (say they
>>>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>>>> note can hog all the resources and the scheduler will have to throttle all
>>>>> other executions leaving you exactly where you are now.
>>>>> While I think the solution is a good one, maybe this question makes us
>>>>> think in adding true multiuser support.
>>>>> Where we isolate resources (cluster and the notebooks themselves),
>>>>> have separate login/identity and (I don't know if it's possible) share the
>>>>> same context.
>>>>>
>>>>> Thanks,
>>>>> Joel
>>>>>
>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > If the problem is that multiple users have to wait for each other
>>>>> while
>>>>> > using Zeppelin, the solution already exists: they can create a new
>>>>> > interpreter by going to the interpreter page and attach it to their
>>>>> > notebook - then they don't have to wait for others to submit their
>>>>> job.
>>>>> >
>>>>> > But I agree, having paragraphs from one note wait for paragraphs
>>>>> from other
>>>>> > notes is a confusing default. We can get around that in two ways:
>>>>> >
>>>>> >   1. Create a new interpreter for each note and attach that
>>>>> interpreter to
>>>>> >   that note. This approach would require the least amount of code
>>>>> changes but
>>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>>> different
>>>>> >   notes.
>>>>> >   2. If we want to share the Spark Context between different notes,
>>>>> we can
>>>>> >   submit jobs from different notes into different fairscheduler
>>>>> pools (
>>>>> >
>>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>>> ).
>>>>> >   This can be done by submitting jobs from different notes in
>>>>> different
>>>>> >   threads. This will make sure that jobs from one note are run
>>>>> sequentially
>>>>> >   but jobs from different notes will be able to run in parallel.
>>>>> >
>>>>> > Neither of these options require any change in the Spark code.
>>>>> >
>>>>> > --
>>>>> > Thanks & Regards
>>>>> > Rohit Agarwal
>>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>>> >
>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>>> praagarw@gmail.com>
>>>>> > wrote:
>>>>> >
>>>>> >> If someone can share about the idea of sharing single SparkContext
>>>>> through
>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>>> >> Here is a proposal:
>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>>> virtual
>>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>>> from
>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>>> the same
>>>>> >> virtual directory.
>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>>> server in
>>>>> >> Spark Context using classserverUri method
>>>>> >> 3. Scala generated code has a notion of packages. The default
>>>>> package name
>>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
>>>>> id"
>>>>> >> ensures that code generated by individual instances of SparkIMain is
>>>>> >> isolated from other instances of SparkIMain
>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>>> execution
>>>>> >> at a time per notebook.
>>>>> >>
>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>>> there any
>>>>> >> Jira already for the same that I can uptake? Also I need to
>>>>> understand:
>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>>> >>
>>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>>> >>
>>>>> >> Regards,
>>>>> >> -Pranav.
>>>>> >>
>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>>> >>>
>>>>> >>> Hi piyush,
>>>>> >>>
>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>>> >>> sharing the SparkContext sounds great.
>>>>> >>>
>>>>> >>> Actually, i tried to do it, found problem that multiple SparkILoop
>>>>> could
>>>>> >>> generates the same class name, and spark executor confuses
>>>>> classname since
>>>>> >>> they're reading classes from single SparkContext.
>>>>> >>>
>>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>>> >>>
>>>>> >>> Thanks,
>>>>> >>> moon
>>>>> >>>
>>>>> >>>
>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
>>>>> wrote:
>>>>> >>>
>>>>> >>>    Hi Moon,
>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>>> working
>>>>> >>> with spark.
>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>>> parallel
>>>>> >>> scheduler ?
>>>>> >>>    thanks
>>>>> >>>
>>>>> >>>    -piyush
>>>>> >>>
>>>>> >>>    Hi Moon,
>>>>> >>>
>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>>> Spark's
>>>>> >>>    remote interpreter - this will allow multiple users to run
>>>>> their spark
>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>>> paragraph is
>>>>> >>>    executed at a time.
>>>>> >>>
>>>>> >>>    Regards,
>>>>> >>>    -Pranav.
>>>>> >>>
>>>>> >>>
>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>>> >>>> Hi,
>>>>> >>>>
>>>>> >>>> Thanks for asking question.
>>>>> >>>>
>>>>> >>>> The reason is simply because of it is running code statements. The
>>>>> >>>> statements can have order and dependency. Imagine i have two
>>>>> >>> paragraphs
>>>>> >>>>
>>>>> >>>> %spark
>>>>> >>>> val a = 1
>>>>> >>>>
>>>>> >>>> %spark
>>>>> >>>> print(a)
>>>>> >>>>
>>>>> >>>> If they're not running one by one, that means they possibly runs
>>>>> in
>>>>> >>>> random order and the output will be always different. Either '1'
>>>>> or
>>>>> >>>> 'val a can not found'.
>>>>> >>>>
>>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>>> >>>>
>>>>> >>>> Thanks,
>>>>> >>>> moon
>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
>>>>> >>> <mailto:linxizeng0615@gmail.com  <mailto:linxizeng0615@gmail.com
>>>>> >>>
>>>>> >>> wrote:
>>>>> >>>>
>>>>> >>>>    any one who have the same question with me? or this is not a
>>>>> >>> question?
>>>>> >>>>
>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <linxizeng0615@gmail.com
>>>>> >>> <ma...@gmail.com>
>>>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
>>>>> >>> linxizeng0615@gmail.com>>>:
>>>>> >>>>
>>>>> >>>>        hi, Moon:
>>>>> >>>>           I notice that the getScheduler function in the
>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
>>>>> the
>>>>> >>>>        spark interpreter run spark job one by one. It's not a good
>>>>> >>>>        experience when couple of users do some work on zeppelin at
>>>>> >>>>        the same time, because they have to wait for each other.
>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>>> >>>>        My question is, what kind of consideration do you based on
>>>>> >>> to
>>>>> >>>>        make such a decision?
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>>> >>>
>>>>> >>>    This email and any files transmitted with it are confidential
>>>>> and
>>>>> >>>    intended solely for the use of the individual or entity to whom
>>>>> >>>    they are addressed. If you have received this email in error
>>>>> >>>    please notify the system manager. This message contains
>>>>> >>>    confidential information and is intended only for the individual
>>>>> >>>    named. If you are not the named addressee you should not
>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>>> >>>    sender immediately by e-mail if you have received this e-mail by
>>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>>> >>>    the intended recipient you are notified that disclosing,
>>>>> copying,
>>>>> >>>    distributing or taking any action in reliance on the contents of
>>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>>> >>>    taken reasonable precautions to ensure no viruses are present in
>>>>> >>>    this email, the company cannot accept responsibility for any
>>>>> loss
>>>>> >>>    or damage arising from the use of this email or attachments
>>>>> >>
>>>>>
>>>>
>>>>
>
> --
> Sent from a mobile device. Excuse my thumbs.
>
>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Dimp Bhat <di...@gmail.com>.

Thanks Piyush. Do we have any ETA for this to be sent for review?

Dimple

On Wed, Jan 13, 2016 at 6:23 PM, Piyush Mukati (Data Platform) <
piyush.mukati@flipkart.com> wrote:

> Hi,
>  The code is available here
>
> https://github.com/piyush-mukati/incubator-zeppelin/tree/parallel_scheduler_support_spark
>
>
> some testing part is left.
>
> On Wed, Jan 13, 2016 at 11:47 PM, Dimp Bhat <di...@gmail.com> wrote:
>
> > Hi Pranav,
> > When do you plan to send out the code for running notebooks in parallel ?
> >
> > Dimple
> >
> > On Tue, Nov 17, 2015 at 3:27 AM, Pranav Kumar Agarwal <
> praagarw@gmail.com>
> > wrote:
> >
> >> Hi Rohit,
> >>
> >> We implemented the proposal and are able to run Zeppelin as a hosted
> >> service inside my organization. Our internal forked version has
> pluggable
> >> authentication and type ahead.
> >>
> >> I need to get the work ported to the latest and chop out the auth
> changes
> >> portion. We'll be submitting it soon.
> >>
> >> We'll target to get this out for review by 11/26.
> >>
> >> Regards,
> >> -Pranav.
> >>
> >>
> >>
> >> On 17/11/15 4:34 am, Rohit Agarwal wrote:
> >>
> >> Hey Pranav,
> >>
> >> Did you make any progress on this?
> >>
> >> --
> >> Rohit
> >>
> >> On Sunday, August 16, 2015, moon soo Lee <mo...@apache.org> wrote:
> >>
> >>> Pranav, proposal looks awesome!
> >>>
> >>> I have a question and feedback,
> >>>
> >>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
> >>> need information of notebook id. Did you get it from
> InterpreterContext?
> >>> Then how did you handle destroying of SparkIMain (when notebook is
> >>> deleting)?
> >>> As far as i know, interpreter not able to get information of notebook
> >>> deletion.
> >>>
> >>> >> 4. Build a queue inside interpreter to allow only one paragraph
> >>> execution
> >>> >> at a time per notebook.
> >>>
> >>> One downside of this approach is, GUI will display RUNNING instead of
> >>> PENDING for jobs inside of queue in interpreter.
> >>>
> >>> Best,
> >>> moon
> >>>
> >>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:
> >>>
> >>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
> >>>> multi-tenancy easily"
> >>>>
> >>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture
> >>>>> so that it can handle multi-tenancy easily. The technical solution
> proposed
> >>>>> by Pranav is great but it only applies to Spark. Right now, each
> >>>>> interpreter has to manage multi-tenancy its own way. Ultimately
> Zeppelin
> >>>>> can propose a multi-tenancy contract/info (like UserContext, similar
> to
> >>>>> InterpreterContext) so that each interpreter can choose to use or
> not.
> >>>>>
> >>>>>
> >>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I think while the idea of running multiple notes simultaneously is
> >>>>>> great. It is really dancing around the lack of true multi user
> support in
> >>>>>> Zeppelin. While the proposed solution would work if the applications
> >>>>>> resources are those of the whole cluster, if the app is limited
> (say they
> >>>>>> are 8 cores of 16, with some distribution in memory) then
> potentially your
> >>>>>> note can hog all the resources and the scheduler will have to
> throttle all
> >>>>>> other executions leaving you exactly where you are now.
> >>>>>> While I think the solution is a good one, maybe this question makes
> >>>>>> us think in adding true multiuser support.
> >>>>>> Where we isolate resources (cluster and the notebooks themselves),
> >>>>>> have separate login/identity and (I don't know if it's possible)
> share the
> >>>>>> same context.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Joel
> >>>>>>
> >>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
> >>>>>> wrote:
> >>>>>> >
> >>>>>> > If the problem is that multiple users have to wait for each other
> >>>>>> while
> >>>>>> > using Zeppelin, the solution already exists: they can create a new
> >>>>>> > interpreter by going to the interpreter page and attach it to
> their
> >>>>>> > notebook - then they don't have to wait for others to submit their
> >>>>>> job.
> >>>>>> >
> >>>>>> > But I agree, having paragraphs from one note wait for paragraphs
> >>>>>> from other
> >>>>>> > notes is a confusing default. We can get around that in two ways:
> >>>>>> >
> >>>>>> >   1. Create a new interpreter for each note and attach that
> >>>>>> interpreter to
> >>>>>> >   that note. This approach would require the least amount of code
> >>>>>> changes but
> >>>>>> >   is resource heavy and doesn't let you share Spark Context
> between
> >>>>>> different
> >>>>>> >   notes.
> >>>>>> >   2. If we want to share the Spark Context between different
> notes,
> >>>>>> we can
> >>>>>> >   submit jobs from different notes into different fairscheduler
> >>>>>> pools (
> >>>>>> >
> >>>>>>
> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
> >>>>>> ).
> >>>>>> >   This can be done by submitting jobs from different notes in
> >>>>>> different
> >>>>>> >   threads. This will make sure that jobs from one note are run
> >>>>>> sequentially
> >>>>>> >   but jobs from different notes will be able to run in parallel.
> >>>>>> >
> >>>>>> > Neither of these options require any change in the Spark code.
> >>>>>> >
> >>>>>> > --
> >>>>>> > Thanks & Regards
> >>>>>> > Rohit Agarwal
> >>>>>> > https://www.linkedin.com/in/rohitagarwal003
> >>>>>> >
> >>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
> >>>>>> praagarw@gmail.com>
> >>>>>> > wrote:
> >>>>>> >
> >>>>>> >> If someone can share about the idea of sharing single
> SparkContext
> >>>>>> through
> >>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
> >>>>>> >> Here is a proposal:
> >>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
> >>>>>> virtual
> >>>>>> >> directory. While creating new instances of SparkIMain per
> notebook
> >>>>>> from
> >>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
> >>>>>> the same
> >>>>>> >> virtual directory.
> >>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
> >>>>>> server in
> >>>>>> >> Spark Context using classserverUri method
> >>>>>> >> 3. Scala generated code has a notion of packages. The default
> >>>>>> package name
> >>>>>> >> is "line$<linenumber>". Package name can be controlled using
> System
> >>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
> >>>>>> id"
> >>>>>> >> ensures that code generated by individual instances of SparkIMain
> >>>>>> is
> >>>>>> >> isolated from other instances of SparkIMain
> >>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
> >>>>>> execution
> >>>>>> >> at a time per notebook.
> >>>>>> >>
> >>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation
> across
> >>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
> >>>>>> there any
> >>>>>> >> Jira already for the same that I can uptake? Also I need to
> >>>>>> understand:
> >>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first
> work
> >>>>>> >> towards getting Spark changes merged in Apache Spark github?
> >>>>>> >>
> >>>>>> >> Any suggestions on comments on the proposal are highly welcome.
> >>>>>> >>
> >>>>>> >> Regards,
> >>>>>> >> -Pranav.
> >>>>>> >>
> >>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
> >>>>>> >>>
> >>>>>> >>> Hi piyush,
> >>>>>> >>>
> >>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook
> while
> >>>>>> >>> sharing the SparkContext sounds great.
> >>>>>> >>>
> >>>>>> >>> Actually, i tried to do it, found problem that multiple
> >>>>>> SparkILoop could
> >>>>>> >>> generates the same class name, and spark executor confuses
> >>>>>> classname since
> >>>>>> >>> they're reading classes from single SparkContext.
> >>>>>> >>>
> >>>>>> >>> If someone can share about the idea of sharing single
> SparkContext
> >>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
> >>>>>> >>>
> >>>>>> >>> Thanks,
> >>>>>> >>> moon
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
> >>>>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
> >>>>>> wrote:
> >>>>>> >>>
> >>>>>> >>>    Hi Moon,
> >>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
> >>>>>> working
> >>>>>> >>> with spark.
> >>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain
> and
> >>>>>> >>> printstrems  for each notebook while sharing theSparkContext
> >>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
> >>>>>> parallel
> >>>>>> >>> scheduler ?
> >>>>>> >>>    thanks
> >>>>>> >>>
> >>>>>> >>>    -piyush
> >>>>>> >>>
> >>>>>> >>>    Hi Moon,
> >>>>>> >>>
> >>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
> >>>>>> Spark's
> >>>>>> >>>    remote interpreter - this will allow multiple users to run
> >>>>>> their spark
> >>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
> >>>>>> paragraph is
> >>>>>> >>>    executed at a time.
> >>>>>> >>>
> >>>>>> >>>    Regards,
> >>>>>> >>>    -Pranav.
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
> >>>>>> >>>> Hi,
> >>>>>> >>>>
> >>>>>> >>>> Thanks for asking question.
> >>>>>> >>>>
> >>>>>> >>>> The reason is simply because of it is running code statements.
> >>>>>> The
> >>>>>> >>>> statements can have order and dependency. Imagine i have two
> >>>>>> >>> paragraphs
> >>>>>> >>>>
> >>>>>> >>>> %spark
> >>>>>> >>>> val a = 1
> >>>>>> >>>>
> >>>>>> >>>> %spark
> >>>>>> >>>> print(a)
> >>>>>> >>>>
> >>>>>> >>>> If they're not running one by one, that means they possibly
> runs
> >>>>>> in
> >>>>>> >>>> random order and the output will be always different. Either
> '1'
> >>>>>> or
> >>>>>> >>>> 'val a can not found'.
> >>>>>> >>>>
> >>>>>> >>>> This is the reason why. But if there are nice idea to handle
> this
> >>>>>> >>>> problem i agree using parallel scheduler would help a lot.
> >>>>>> >>>>
> >>>>>> >>>> Thanks,
> >>>>>> >>>> moon
> >>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
> >>>>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
> >>>>>> >>> <mailto:linxizeng0615@gmail.com  <mailto:
> linxizeng0615@gmail.com
> >>>>>> >>>
> >>>>>> >>> wrote:
> >>>>>> >>>>
> >>>>>> >>>>    any one who have the same question with me? or this is not a
> >>>>>> >>> question?
> >>>>>> >>>>
> >>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <
> >>>>>> linxizeng0615@gmail.com
> >>>>>> >>> <ma...@gmail.com>
> >>>>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
> >>>>>> >>> linxizeng0615@gmail.com>>>:
> >>>>>> >>>>
> >>>>>> >>>>        hi, Moon:
> >>>>>> >>>>           I notice that the getScheduler function in the
> >>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
> >>>>>> the
> >>>>>> >>>>        spark interpreter run spark job one by one. It's not a
> >>>>>> good
> >>>>>> >>>>        experience when couple of users do some work on zeppelin
> >>>>>> at
> >>>>>> >>>>        the same time, because they have to wait for each other.
> >>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
> >>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
> >>>>>> >>>>        My question is, what kind of consideration do you based
> on
> >>>>>> >>> to
> >>>>>> >>>>        make such a decision?
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>>
> ------------------------------------------------------------------------------------------------------------------------------------------
> >>>>>> >>>
> >>>>>> >>>    This email and any files transmitted with it are confidential
> >>>>>> and
> >>>>>> >>>    intended solely for the use of the individual or entity to
> whom
> >>>>>> >>>    they are addressed. If you have received this email in error
> >>>>>> >>>    please notify the system manager. This message contains
> >>>>>> >>>    confidential information and is intended only for the
> >>>>>> individual
> >>>>>> >>>    named. If you are not the named addressee you should not
> >>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify
> the
> >>>>>> >>>    sender immediately by e-mail if you have received this e-mail
> >>>>>> by
> >>>>>> >>>    mistake and delete this e-mail from your system. If you are
> not
> >>>>>> >>>    the intended recipient you are notified that disclosing,
> >>>>>> copying,
> >>>>>> >>>    distributing or taking any action in reliance on the contents
> >>>>>> of
> >>>>>> >>>    this information is strictly prohibited. Although Flipkart
> has
> >>>>>> >>>    taken reasonable precautions to ensure no viruses are present
> >>>>>> in
> >>>>>> >>>    this email, the company cannot accept responsibility for any
> >>>>>> loss
> >>>>>> >>>    or damage arising from the use of this email or attachments
> >>>>>> >>
> >>>>>>
> >>>>>
> >>>>>
> >>
> >> --
> >> Sent from a mobile device. Excuse my thumbs.
> >>
> >>
> >>
> >
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by Dimp Bhat <di...@gmail.com>.

Thanks Piyush. Do we have any ETA for this to be sent for review?

Dimple

On Wed, Jan 13, 2016 at 6:23 PM, Piyush Mukati (Data Platform) <
piyush.mukati@flipkart.com> wrote:

> Hi,
>  The code is available here
>
> https://github.com/piyush-mukati/incubator-zeppelin/tree/parallel_scheduler_support_spark
>
>
> some testing part is left.
>
> On Wed, Jan 13, 2016 at 11:47 PM, Dimp Bhat <di...@gmail.com> wrote:
>
> > Hi Pranav,
> > When do you plan to send out the code for running notebooks in parallel ?
> >
> > Dimple
> >
> > On Tue, Nov 17, 2015 at 3:27 AM, Pranav Kumar Agarwal <
> praagarw@gmail.com>
> > wrote:
> >
> >> Hi Rohit,
> >>
> >> We implemented the proposal and are able to run Zeppelin as a hosted
> >> service inside my organization. Our internal forked version has
> pluggable
> >> authentication and type ahead.
> >>
> >> I need to get the work ported to the latest and chop out the auth
> changes
> >> portion. We'll be submitting it soon.
> >>
> >> We'll target to get this out for review by 11/26.
> >>
> >> Regards,
> >> -Pranav.
> >>
> >>
> >>
> >> On 17/11/15 4:34 am, Rohit Agarwal wrote:
> >>
> >> Hey Pranav,
> >>
> >> Did you make any progress on this?
> >>
> >> --
> >> Rohit
> >>
> >> On Sunday, August 16, 2015, moon soo Lee <mo...@apache.org> wrote:
> >>
> >>> Pranav, proposal looks awesome!
> >>>
> >>> I have a question and feedback,
> >>>
> >>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
> >>> need information of notebook id. Did you get it from
> InterpreterContext?
> >>> Then how did you handle destroying of SparkIMain (when notebook is
> >>> deleting)?
> >>> As far as i know, interpreter not able to get information of notebook
> >>> deletion.
> >>>
> >>> >> 4. Build a queue inside interpreter to allow only one paragraph
> >>> execution
> >>> >> at a time per notebook.
> >>>
> >>> One downside of this approach is, GUI will display RUNNING instead of
> >>> PENDING for jobs inside of queue in interpreter.
> >>>
> >>> Best,
> >>> moon
> >>>
> >>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:
> >>>
> >>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
> >>>> multi-tenancy easily"
> >>>>
> >>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture
> >>>>> so that it can handle multi-tenancy easily. The technical solution
> proposed
> >>>>> by Pranav is great but it only applies to Spark. Right now, each
> >>>>> interpreter has to manage multi-tenancy its own way. Ultimately
> Zeppelin
> >>>>> can propose a multi-tenancy contract/info (like UserContext, similar
> to
> >>>>> InterpreterContext) so that each interpreter can choose to use or
> not.
> >>>>>
> >>>>>
> >>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> I think while the idea of running multiple notes simultaneously is
> >>>>>> great. It is really dancing around the lack of true multi user
> support in
> >>>>>> Zeppelin. While the proposed solution would work if the applications
> >>>>>> resources are those of the whole cluster, if the app is limited
> (say they
> >>>>>> are 8 cores of 16, with some distribution in memory) then
> potentially your
> >>>>>> note can hog all the resources and the scheduler will have to
> throttle all
> >>>>>> other executions leaving you exactly where you are now.
> >>>>>> While I think the solution is a good one, maybe this question makes
> >>>>>> us think in adding true multiuser support.
> >>>>>> Where we isolate resources (cluster and the notebooks themselves),
> >>>>>> have separate login/identity and (I don't know if it's possible)
> share the
> >>>>>> same context.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Joel
> >>>>>>
> >>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
> >>>>>> wrote:
> >>>>>> >
> >>>>>> > If the problem is that multiple users have to wait for each other
> >>>>>> while
> >>>>>> > using Zeppelin, the solution already exists: they can create a new
> >>>>>> > interpreter by going to the interpreter page and attach it to
> their
> >>>>>> > notebook - then they don't have to wait for others to submit their
> >>>>>> job.
> >>>>>> >
> >>>>>> > But I agree, having paragraphs from one note wait for paragraphs
> >>>>>> from other
> >>>>>> > notes is a confusing default. We can get around that in two ways:
> >>>>>> >
> >>>>>> >   1. Create a new interpreter for each note and attach that
> >>>>>> interpreter to
> >>>>>> >   that note. This approach would require the least amount of code
> >>>>>> changes but
> >>>>>> >   is resource heavy and doesn't let you share Spark Context
> between
> >>>>>> different
> >>>>>> >   notes.
> >>>>>> >   2. If we want to share the Spark Context between different
> notes,
> >>>>>> we can
> >>>>>> >   submit jobs from different notes into different fairscheduler
> >>>>>> pools (
> >>>>>> >
> >>>>>>
> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
> >>>>>> ).
> >>>>>> >   This can be done by submitting jobs from different notes in
> >>>>>> different
> >>>>>> >   threads. This will make sure that jobs from one note are run
> >>>>>> sequentially
> >>>>>> >   but jobs from different notes will be able to run in parallel.
> >>>>>> >
> >>>>>> > Neither of these options require any change in the Spark code.
> >>>>>> >
> >>>>>> > --
> >>>>>> > Thanks & Regards
> >>>>>> > Rohit Agarwal
> >>>>>> > https://www.linkedin.com/in/rohitagarwal003
> >>>>>> >
> >>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
> >>>>>> praagarw@gmail.com>
> >>>>>> > wrote:
> >>>>>> >
> >>>>>> >> If someone can share about the idea of sharing single
> SparkContext
> >>>>>> through
> >>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
> >>>>>> >> Here is a proposal:
> >>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
> >>>>>> virtual
> >>>>>> >> directory. While creating new instances of SparkIMain per
> notebook
> >>>>>> from
> >>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
> >>>>>> the same
> >>>>>> >> virtual directory.
> >>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
> >>>>>> server in
> >>>>>> >> Spark Context using classserverUri method
> >>>>>> >> 3. Scala generated code has a notion of packages. The default
> >>>>>> package name
> >>>>>> >> is "line$<linenumber>". Package name can be controlled using
> System
> >>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
> >>>>>> id"
> >>>>>> >> ensures that code generated by individual instances of SparkIMain
> >>>>>> is
> >>>>>> >> isolated from other instances of SparkIMain
> >>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
> >>>>>> execution
> >>>>>> >> at a time per notebook.
> >>>>>> >>
> >>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation
> across
> >>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
> >>>>>> there any
> >>>>>> >> Jira already for the same that I can uptake? Also I need to
> >>>>>> understand:
> >>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first
> work
> >>>>>> >> towards getting Spark changes merged in Apache Spark github?
> >>>>>> >>
> >>>>>> >> Any suggestions on comments on the proposal are highly welcome.
> >>>>>> >>
> >>>>>> >> Regards,
> >>>>>> >> -Pranav.
> >>>>>> >>
> >>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
> >>>>>> >>>
> >>>>>> >>> Hi piyush,
> >>>>>> >>>
> >>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook
> while
> >>>>>> >>> sharing the SparkContext sounds great.
> >>>>>> >>>
> >>>>>> >>> Actually, i tried to do it, found problem that multiple
> >>>>>> SparkILoop could
> >>>>>> >>> generates the same class name, and spark executor confuses
> >>>>>> classname since
> >>>>>> >>> they're reading classes from single SparkContext.
> >>>>>> >>>
> >>>>>> >>> If someone can share about the idea of sharing single
> SparkContext
> >>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
> >>>>>> >>>
> >>>>>> >>> Thanks,
> >>>>>> >>> moon
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
> >>>>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
> >>>>>> wrote:
> >>>>>> >>>
> >>>>>> >>>    Hi Moon,
> >>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
> >>>>>> working
> >>>>>> >>> with spark.
> >>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain
> and
> >>>>>> >>> printstrems  for each notebook while sharing theSparkContext
> >>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
> >>>>>> parallel
> >>>>>> >>> scheduler ?
> >>>>>> >>>    thanks
> >>>>>> >>>
> >>>>>> >>>    -piyush
> >>>>>> >>>
> >>>>>> >>>    Hi Moon,
> >>>>>> >>>
> >>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
> >>>>>> Spark's
> >>>>>> >>>    remote interpreter - this will allow multiple users to run
> >>>>>> their spark
> >>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
> >>>>>> paragraph is
> >>>>>> >>>    executed at a time.
> >>>>>> >>>
> >>>>>> >>>    Regards,
> >>>>>> >>>    -Pranav.
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
> >>>>>> >>>> Hi,
> >>>>>> >>>>
> >>>>>> >>>> Thanks for asking question.
> >>>>>> >>>>
> >>>>>> >>>> The reason is simply because of it is running code statements.
> >>>>>> The
> >>>>>> >>>> statements can have order and dependency. Imagine i have two
> >>>>>> >>> paragraphs
> >>>>>> >>>>
> >>>>>> >>>> %spark
> >>>>>> >>>> val a = 1
> >>>>>> >>>>
> >>>>>> >>>> %spark
> >>>>>> >>>> print(a)
> >>>>>> >>>>
> >>>>>> >>>> If they're not running one by one, that means they possibly
> runs
> >>>>>> in
> >>>>>> >>>> random order and the output will be always different. Either
> '1'
> >>>>>> or
> >>>>>> >>>> 'val a can not found'.
> >>>>>> >>>>
> >>>>>> >>>> This is the reason why. But if there are nice idea to handle
> this
> >>>>>> >>>> problem i agree using parallel scheduler would help a lot.
> >>>>>> >>>>
> >>>>>> >>>> Thanks,
> >>>>>> >>>> moon
> >>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
> >>>>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
> >>>>>> >>> <mailto:linxizeng0615@gmail.com  <mailto:
> linxizeng0615@gmail.com
> >>>>>> >>>
> >>>>>> >>> wrote:
> >>>>>> >>>>
> >>>>>> >>>>    any one who have the same question with me? or this is not a
> >>>>>> >>> question?
> >>>>>> >>>>
> >>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <
> >>>>>> linxizeng0615@gmail.com
> >>>>>> >>> <ma...@gmail.com>
> >>>>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
> >>>>>> >>> linxizeng0615@gmail.com>>>:
> >>>>>> >>>>
> >>>>>> >>>>        hi, Moon:
> >>>>>> >>>>           I notice that the getScheduler function in the
> >>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
> >>>>>> the
> >>>>>> >>>>        spark interpreter run spark job one by one. It's not a
> >>>>>> good
> >>>>>> >>>>        experience when couple of users do some work on zeppelin
> >>>>>> at
> >>>>>> >>>>        the same time, because they have to wait for each other.
> >>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
> >>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
> >>>>>> >>>>        My question is, what kind of consideration do you based
> on
> >>>>>> >>> to
> >>>>>> >>>>        make such a decision?
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>>
> ------------------------------------------------------------------------------------------------------------------------------------------
> >>>>>> >>>
> >>>>>> >>>    This email and any files transmitted with it are confidential
> >>>>>> and
> >>>>>> >>>    intended solely for the use of the individual or entity to
> whom
> >>>>>> >>>    they are addressed. If you have received this email in error
> >>>>>> >>>    please notify the system manager. This message contains
> >>>>>> >>>    confidential information and is intended only for the
> >>>>>> individual
> >>>>>> >>>    named. If you are not the named addressee you should not
> >>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify
> the
> >>>>>> >>>    sender immediately by e-mail if you have received this e-mail
> >>>>>> by
> >>>>>> >>>    mistake and delete this e-mail from your system. If you are
> not
> >>>>>> >>>    the intended recipient you are notified that disclosing,
> >>>>>> copying,
> >>>>>> >>>    distributing or taking any action in reliance on the contents
> >>>>>> of
> >>>>>> >>>    this information is strictly prohibited. Although Flipkart
> has
> >>>>>> >>>    taken reasonable precautions to ensure no viruses are present
> >>>>>> in
> >>>>>> >>>    this email, the company cannot accept responsibility for any
> >>>>>> loss
> >>>>>> >>>    or damage arising from the use of this email or attachments
> >>>>>> >>
> >>>>>>
> >>>>>
> >>>>>
> >>
> >> --
> >> Sent from a mobile device. Excuse my thumbs.
> >>
> >>
> >>
> >
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by "Piyush Mukati (Data Platform)" <pi...@flipkart.com>.

Hi,
 The code is available here
https://github.com/piyush-mukati/incubator-zeppelin/tree/parallel_scheduler_support_spark


some testing part is left.

On Wed, Jan 13, 2016 at 11:47 PM, Dimp Bhat <di...@gmail.com> wrote:

> Hi Pranav,
> When do you plan to send out the code for running notebooks in parallel ?
>
> Dimple
>
> On Tue, Nov 17, 2015 at 3:27 AM, Pranav Kumar Agarwal <pr...@gmail.com>
> wrote:
>
>> Hi Rohit,
>>
>> We implemented the proposal and are able to run Zeppelin as a hosted
>> service inside my organization. Our internal forked version has pluggable
>> authentication and type ahead.
>>
>> I need to get the work ported to the latest and chop out the auth changes
>> portion. We'll be submitting it soon.
>>
>> We'll target to get this out for review by 11/26.
>>
>> Regards,
>> -Pranav.
>>
>>
>>
>> On 17/11/15 4:34 am, Rohit Agarwal wrote:
>>
>> Hey Pranav,
>>
>> Did you make any progress on this?
>>
>> --
>> Rohit
>>
>> On Sunday, August 16, 2015, moon soo Lee <mo...@apache.org> wrote:
>>
>>> Pranav, proposal looks awesome!
>>>
>>> I have a question and feedback,
>>>
>>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
>>> need information of notebook id. Did you get it from InterpreterContext?
>>> Then how did you handle destroying of SparkIMain (when notebook is
>>> deleting)?
>>> As far as i know, interpreter not able to get information of notebook
>>> deletion.
>>>
>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>> execution
>>> >> at a time per notebook.
>>>
>>> One downside of this approach is, GUI will display RUNNING instead of
>>> PENDING for jobs inside of queue in interpreter.
>>>
>>> Best,
>>> moon
>>>
>>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:
>>>
>>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>>>> multi-tenancy easily"
>>>>
>>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture
>>>>> so that it can handle multi-tenancy easily. The technical solution proposed
>>>>> by Pranav is great but it only applies to Spark. Right now, each
>>>>> interpreter has to manage multi-tenancy its own way. Ultimately Zeppelin
>>>>> can propose a multi-tenancy contract/info (like UserContext, similar to
>>>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>>>
>>>>>
>>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think while the idea of running multiple notes simultaneously is
>>>>>> great. It is really dancing around the lack of true multi user support in
>>>>>> Zeppelin. While the proposed solution would work if the applications
>>>>>> resources are those of the whole cluster, if the app is limited (say they
>>>>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>>>>> note can hog all the resources and the scheduler will have to throttle all
>>>>>> other executions leaving you exactly where you are now.
>>>>>> While I think the solution is a good one, maybe this question makes
>>>>>> us think in adding true multiuser support.
>>>>>> Where we isolate resources (cluster and the notebooks themselves),
>>>>>> have separate login/identity and (I don't know if it's possible) share the
>>>>>> same context.
>>>>>>
>>>>>> Thanks,
>>>>>> Joel
>>>>>>
>>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > If the problem is that multiple users have to wait for each other
>>>>>> while
>>>>>> > using Zeppelin, the solution already exists: they can create a new
>>>>>> > interpreter by going to the interpreter page and attach it to their
>>>>>> > notebook - then they don't have to wait for others to submit their
>>>>>> job.
>>>>>> >
>>>>>> > But I agree, having paragraphs from one note wait for paragraphs
>>>>>> from other
>>>>>> > notes is a confusing default. We can get around that in two ways:
>>>>>> >
>>>>>> >   1. Create a new interpreter for each note and attach that
>>>>>> interpreter to
>>>>>> >   that note. This approach would require the least amount of code
>>>>>> changes but
>>>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>>>> different
>>>>>> >   notes.
>>>>>> >   2. If we want to share the Spark Context between different notes,
>>>>>> we can
>>>>>> >   submit jobs from different notes into different fairscheduler
>>>>>> pools (
>>>>>> >
>>>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>>>> ).
>>>>>> >   This can be done by submitting jobs from different notes in
>>>>>> different
>>>>>> >   threads. This will make sure that jobs from one note are run
>>>>>> sequentially
>>>>>> >   but jobs from different notes will be able to run in parallel.
>>>>>> >
>>>>>> > Neither of these options require any change in the Spark code.
>>>>>> >
>>>>>> > --
>>>>>> > Thanks & Regards
>>>>>> > Rohit Agarwal
>>>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>>>> >
>>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>>>> praagarw@gmail.com>
>>>>>> > wrote:
>>>>>> >
>>>>>> >> If someone can share about the idea of sharing single SparkContext
>>>>>> through
>>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>>>> >> Here is a proposal:
>>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>>>> virtual
>>>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>>>> from
>>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>>>> the same
>>>>>> >> virtual directory.
>>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>>>> server in
>>>>>> >> Spark Context using classserverUri method
>>>>>> >> 3. Scala generated code has a notion of packages. The default
>>>>>> package name
>>>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
>>>>>> id"
>>>>>> >> ensures that code generated by individual instances of SparkIMain
>>>>>> is
>>>>>> >> isolated from other instances of SparkIMain
>>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>>>> execution
>>>>>> >> at a time per notebook.
>>>>>> >>
>>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>>>> there any
>>>>>> >> Jira already for the same that I can uptake? Also I need to
>>>>>> understand:
>>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>>>> >>
>>>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>>>> >>
>>>>>> >> Regards,
>>>>>> >> -Pranav.
>>>>>> >>
>>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>>>> >>>
>>>>>> >>> Hi piyush,
>>>>>> >>>
>>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>>>> >>> sharing the SparkContext sounds great.
>>>>>> >>>
>>>>>> >>> Actually, i tried to do it, found problem that multiple
>>>>>> SparkILoop could
>>>>>> >>> generates the same class name, and spark executor confuses
>>>>>> classname since
>>>>>> >>> they're reading classes from single SparkContext.
>>>>>> >>>
>>>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>>>> >>>
>>>>>> >>> Thanks,
>>>>>> >>> moon
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>>    Hi Moon,
>>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>>>> working
>>>>>> >>> with spark.
>>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>>>> parallel
>>>>>> >>> scheduler ?
>>>>>> >>>    thanks
>>>>>> >>>
>>>>>> >>>    -piyush
>>>>>> >>>
>>>>>> >>>    Hi Moon,
>>>>>> >>>
>>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>>>> Spark's
>>>>>> >>>    remote interpreter - this will allow multiple users to run
>>>>>> their spark
>>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>>>> paragraph is
>>>>>> >>>    executed at a time.
>>>>>> >>>
>>>>>> >>>    Regards,
>>>>>> >>>    -Pranav.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>>>> >>>> Hi,
>>>>>> >>>>
>>>>>> >>>> Thanks for asking question.
>>>>>> >>>>
>>>>>> >>>> The reason is simply because of it is running code statements.
>>>>>> The
>>>>>> >>>> statements can have order and dependency. Imagine i have two
>>>>>> >>> paragraphs
>>>>>> >>>>
>>>>>> >>>> %spark
>>>>>> >>>> val a = 1
>>>>>> >>>>
>>>>>> >>>> %spark
>>>>>> >>>> print(a)
>>>>>> >>>>
>>>>>> >>>> If they're not running one by one, that means they possibly runs
>>>>>> in
>>>>>> >>>> random order and the output will be always different. Either '1'
>>>>>> or
>>>>>> >>>> 'val a can not found'.
>>>>>> >>>>
>>>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>>>> >>>>
>>>>>> >>>> Thanks,
>>>>>> >>>> moon
>>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
>>>>>> >>> <mailto:linxizeng0615@gmail.com  <mailto:linxizeng0615@gmail.com
>>>>>> >>>
>>>>>> >>> wrote:
>>>>>> >>>>
>>>>>> >>>>    any one who have the same question with me? or this is not a
>>>>>> >>> question?
>>>>>> >>>>
>>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <
>>>>>> linxizeng0615@gmail.com
>>>>>> >>> <ma...@gmail.com>
>>>>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
>>>>>> >>> linxizeng0615@gmail.com>>>:
>>>>>> >>>>
>>>>>> >>>>        hi, Moon:
>>>>>> >>>>           I notice that the getScheduler function in the
>>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
>>>>>> the
>>>>>> >>>>        spark interpreter run spark job one by one. It's not a
>>>>>> good
>>>>>> >>>>        experience when couple of users do some work on zeppelin
>>>>>> at
>>>>>> >>>>        the same time, because they have to wait for each other.
>>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>>>> >>>>        My question is, what kind of consideration do you based on
>>>>>> >>> to
>>>>>> >>>>        make such a decision?
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>>>> >>>
>>>>>> >>>    This email and any files transmitted with it are confidential
>>>>>> and
>>>>>> >>>    intended solely for the use of the individual or entity to whom
>>>>>> >>>    they are addressed. If you have received this email in error
>>>>>> >>>    please notify the system manager. This message contains
>>>>>> >>>    confidential information and is intended only for the
>>>>>> individual
>>>>>> >>>    named. If you are not the named addressee you should not
>>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>>>> >>>    sender immediately by e-mail if you have received this e-mail
>>>>>> by
>>>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>>>> >>>    the intended recipient you are notified that disclosing,
>>>>>> copying,
>>>>>> >>>    distributing or taking any action in reliance on the contents
>>>>>> of
>>>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>>>> >>>    taken reasonable precautions to ensure no viruses are present
>>>>>> in
>>>>>> >>>    this email, the company cannot accept responsibility for any
>>>>>> loss
>>>>>> >>>    or damage arising from the use of this email or attachments
>>>>>> >>
>>>>>>
>>>>>
>>>>>
>>
>> --
>> Sent from a mobile device. Excuse my thumbs.
>>
>>
>>
>

Re: why zeppelin SparkInterpreter use FIFOScheduler

Posted by "Piyush Mukati (Data Platform)" <pi...@flipkart.com>.

Hi,
 The code is available here
https://github.com/piyush-mukati/incubator-zeppelin/tree/parallel_scheduler_support_spark


some testing part is left.

On Wed, Jan 13, 2016 at 11:47 PM, Dimp Bhat <di...@gmail.com> wrote:

> Hi Pranav,
> When do you plan to send out the code for running notebooks in parallel ?
>
> Dimple
>
> On Tue, Nov 17, 2015 at 3:27 AM, Pranav Kumar Agarwal <pr...@gmail.com>
> wrote:
>
>> Hi Rohit,
>>
>> We implemented the proposal and are able to run Zeppelin as a hosted
>> service inside my organization. Our internal forked version has pluggable
>> authentication and type ahead.
>>
>> I need to get the work ported to the latest and chop out the auth changes
>> portion. We'll be submitting it soon.
>>
>> We'll target to get this out for review by 11/26.
>>
>> Regards,
>> -Pranav.
>>
>>
>>
>> On 17/11/15 4:34 am, Rohit Agarwal wrote:
>>
>> Hey Pranav,
>>
>> Did you make any progress on this?
>>
>> --
>> Rohit
>>
>> On Sunday, August 16, 2015, moon soo Lee <mo...@apache.org> wrote:
>>
>>> Pranav, proposal looks awesome!
>>>
>>> I have a question and feedback,
>>>
>>> You said you tested 1,2 and 3. To create SparkIMain per notebook, you
>>> need information of notebook id. Did you get it from InterpreterContext?
>>> Then how did you handle destroying of SparkIMain (when notebook is
>>> deleting)?
>>> As far as i know, interpreter not able to get information of notebook
>>> deletion.
>>>
>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>> execution
>>> >> at a time per notebook.
>>>
>>> One downside of this approach is, GUI will display RUNNING instead of
>>> PENDING for jobs inside of queue in interpreter.
>>>
>>> Best,
>>> moon
>>>
>>> On Sun, Aug 16, 2015 at 12:55 AM IT CTO <go...@gmail.com> wrote:
>>>
>>>> +1 for "to re-factor the Zeppelin architecture so that it can handle
>>>> multi-tenancy easily"
>>>>
>>>> On Sun, Aug 16, 2015 at 9:47 AM DuyHai Doan <do...@gmail.com>
>>>> wrote:
>>>>
>>>>> Agree with Joel, we may think to re-factor the Zeppelin architecture
>>>>> so that it can handle multi-tenancy easily. The technical solution proposed
>>>>> by Pranav is great but it only applies to Spark. Right now, each
>>>>> interpreter has to manage multi-tenancy its own way. Ultimately Zeppelin
>>>>> can propose a multi-tenancy contract/info (like UserContext, similar to
>>>>> InterpreterContext) so that each interpreter can choose to use or not.
>>>>>
>>>>>
>>>>> On Sun, Aug 16, 2015 at 3:09 AM, Joel Zambrano <dj...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think while the idea of running multiple notes simultaneously is
>>>>>> great. It is really dancing around the lack of true multi user support in
>>>>>> Zeppelin. While the proposed solution would work if the applications
>>>>>> resources are those of the whole cluster, if the app is limited (say they
>>>>>> are 8 cores of 16, with some distribution in memory) then potentially your
>>>>>> note can hog all the resources and the scheduler will have to throttle all
>>>>>> other executions leaving you exactly where you are now.
>>>>>> While I think the solution is a good one, maybe this question makes
>>>>>> us think in adding true multiuser support.
>>>>>> Where we isolate resources (cluster and the notebooks themselves),
>>>>>> have separate login/identity and (I don't know if it's possible) share the
>>>>>> same context.
>>>>>>
>>>>>> Thanks,
>>>>>> Joel
>>>>>>
>>>>>> > On Aug 15, 2015, at 1:58 PM, Rohit Agarwal <mi...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > If the problem is that multiple users have to wait for each other
>>>>>> while
>>>>>> > using Zeppelin, the solution already exists: they can create a new
>>>>>> > interpreter by going to the interpreter page and attach it to their
>>>>>> > notebook - then they don't have to wait for others to submit their
>>>>>> job.
>>>>>> >
>>>>>> > But I agree, having paragraphs from one note wait for paragraphs
>>>>>> from other
>>>>>> > notes is a confusing default. We can get around that in two ways:
>>>>>> >
>>>>>> >   1. Create a new interpreter for each note and attach that
>>>>>> interpreter to
>>>>>> >   that note. This approach would require the least amount of code
>>>>>> changes but
>>>>>> >   is resource heavy and doesn't let you share Spark Context between
>>>>>> different
>>>>>> >   notes.
>>>>>> >   2. If we want to share the Spark Context between different notes,
>>>>>> we can
>>>>>> >   submit jobs from different notes into different fairscheduler
>>>>>> pools (
>>>>>> >
>>>>>> https://spark.apache.org/docs/1.4.0/job-scheduling.html#scheduling-within-an-application
>>>>>> ).
>>>>>> >   This can be done by submitting jobs from different notes in
>>>>>> different
>>>>>> >   threads. This will make sure that jobs from one note are run
>>>>>> sequentially
>>>>>> >   but jobs from different notes will be able to run in parallel.
>>>>>> >
>>>>>> > Neither of these options require any change in the Spark code.
>>>>>> >
>>>>>> > --
>>>>>> > Thanks & Regards
>>>>>> > Rohit Agarwal
>>>>>> > https://www.linkedin.com/in/rohitagarwal003
>>>>>> >
>>>>>> > On Sat, Aug 15, 2015 at 12:01 PM, Pranav Kumar Agarwal <
>>>>>> praagarw@gmail.com>
>>>>>> > wrote:
>>>>>> >
>>>>>> >> If someone can share about the idea of sharing single SparkContext
>>>>>> through
>>>>>> >>> multiple SparkILoop safely, it'll be really helpful.
>>>>>> >> Here is a proposal:
>>>>>> >> 1. In Spark code, change SparkIMain.scala to allow setting the
>>>>>> virtual
>>>>>> >> directory. While creating new instances of SparkIMain per notebook
>>>>>> from
>>>>>> >> zeppelin spark interpreter set all the instances of SparkIMain to
>>>>>> the same
>>>>>> >> virtual directory.
>>>>>> >> 2. Start HTTP server on that virtual directory and set this HTTP
>>>>>> server in
>>>>>> >> Spark Context using classserverUri method
>>>>>> >> 3. Scala generated code has a notion of packages. The default
>>>>>> package name
>>>>>> >> is "line$<linenumber>". Package name can be controlled using System
>>>>>> >> Property scala.repl.name.line. Setting this property to "notebook
>>>>>> id"
>>>>>> >> ensures that code generated by individual instances of SparkIMain
>>>>>> is
>>>>>> >> isolated from other instances of SparkIMain
>>>>>> >> 4. Build a queue inside interpreter to allow only one paragraph
>>>>>> execution
>>>>>> >> at a time per notebook.
>>>>>> >>
>>>>>> >> I have tested 1, 2, and 3 and it seems to provide isolation across
>>>>>> >> classnames. I'll work towards submitting a formal patch soon - Is
>>>>>> there any
>>>>>> >> Jira already for the same that I can uptake? Also I need to
>>>>>> understand:
>>>>>> >> 1. How does Zeppelin uptake Spark fixes? OR do I need to first work
>>>>>> >> towards getting Spark changes merged in Apache Spark github?
>>>>>> >>
>>>>>> >> Any suggestions on comments on the proposal are highly welcome.
>>>>>> >>
>>>>>> >> Regards,
>>>>>> >> -Pranav.
>>>>>> >>
>>>>>> >>> On 10/08/15 11:36 pm, moon soo Lee wrote:
>>>>>> >>>
>>>>>> >>> Hi piyush,
>>>>>> >>>
>>>>>> >>> Separate instance of SparkILoop SparkIMain for each notebook while
>>>>>> >>> sharing the SparkContext sounds great.
>>>>>> >>>
>>>>>> >>> Actually, i tried to do it, found problem that multiple
>>>>>> SparkILoop could
>>>>>> >>> generates the same class name, and spark executor confuses
>>>>>> classname since
>>>>>> >>> they're reading classes from single SparkContext.
>>>>>> >>>
>>>>>> >>> If someone can share about the idea of sharing single SparkContext
>>>>>> >>> through multiple SparkILoop safely, it'll be really helpful.
>>>>>> >>>
>>>>>> >>> Thanks,
>>>>>> >>> moon
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Mon, Aug 10, 2015 at 1:21 AM Piyush Mukati (Data Platform) <
>>>>>> >>> piyush.mukati@flipkart.com <ma...@flipkart.com>>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>>    Hi Moon,
>>>>>> >>>    Any suggestion on it, have to wait lot when multiple people
>>>>>> working
>>>>>> >>> with spark.
>>>>>> >>>    Can we create separate instance of   SparkILoop  SparkIMain and
>>>>>> >>> printstrems  for each notebook while sharing theSparkContext
>>>>>> >>> ZeppelinContext   SQLContext and DependencyResolver and then use
>>>>>> parallel
>>>>>> >>> scheduler ?
>>>>>> >>>    thanks
>>>>>> >>>
>>>>>> >>>    -piyush
>>>>>> >>>
>>>>>> >>>    Hi Moon,
>>>>>> >>>
>>>>>> >>>    How about tracking dedicated SparkContext for a notebook in
>>>>>> Spark's
>>>>>> >>>    remote interpreter - this will allow multiple users to run
>>>>>> their spark
>>>>>> >>>    paragraphs in parallel. Also, within a notebook only one
>>>>>> paragraph is
>>>>>> >>>    executed at a time.
>>>>>> >>>
>>>>>> >>>    Regards,
>>>>>> >>>    -Pranav.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>>    On 15/07/15 7:15 pm, moon soo Lee wrote:
>>>>>> >>>> Hi,
>>>>>> >>>>
>>>>>> >>>> Thanks for asking question.
>>>>>> >>>>
>>>>>> >>>> The reason is simply because of it is running code statements.
>>>>>> The
>>>>>> >>>> statements can have order and dependency. Imagine i have two
>>>>>> >>> paragraphs
>>>>>> >>>>
>>>>>> >>>> %spark
>>>>>> >>>> val a = 1
>>>>>> >>>>
>>>>>> >>>> %spark
>>>>>> >>>> print(a)
>>>>>> >>>>
>>>>>> >>>> If they're not running one by one, that means they possibly runs
>>>>>> in
>>>>>> >>>> random order and the output will be always different. Either '1'
>>>>>> or
>>>>>> >>>> 'val a can not found'.
>>>>>> >>>>
>>>>>> >>>> This is the reason why. But if there are nice idea to handle this
>>>>>> >>>> problem i agree using parallel scheduler would help a lot.
>>>>>> >>>>
>>>>>> >>>> Thanks,
>>>>>> >>>> moon
>>>>>> >>>> On 2015년 7월 14일 (화) at 오후 7:59 linxi zeng
>>>>>> >>>> <linxizeng0615@gmail.com  <ma...@gmail.com>
>>>>>> >>> <mailto:linxizeng0615@gmail.com  <mailto:linxizeng0615@gmail.com
>>>>>> >>>
>>>>>> >>> wrote:
>>>>>> >>>>
>>>>>> >>>>    any one who have the same question with me? or this is not a
>>>>>> >>> question?
>>>>>> >>>>
>>>>>> >>>>    2015-07-14 11:47 GMT+08:00 linxi zeng <
>>>>>> linxizeng0615@gmail.com
>>>>>> >>> <ma...@gmail.com>
>>>>>> >>>>    <mailto:linxizeng0615@gmail.com  <mailto:
>>>>>> >>> linxizeng0615@gmail.com>>>:
>>>>>> >>>>
>>>>>> >>>>        hi, Moon:
>>>>>> >>>>           I notice that the getScheduler function in the
>>>>>> >>>>        SparkInterpreter.java return a FIFOScheduler which makes
>>>>>> the
>>>>>> >>>>        spark interpreter run spark job one by one. It's not a
>>>>>> good
>>>>>> >>>>        experience when couple of users do some work on zeppelin
>>>>>> at
>>>>>> >>>>        the same time, because they have to wait for each other.
>>>>>> >>>>        And at the same time, SparkSqlInterpreter can chose what
>>>>>> >>>>        scheduler to use by "zeppelin.spark.concurrentSQL".
>>>>>> >>>>        My question is, what kind of consideration do you based on
>>>>>> >>> to
>>>>>> >>>>        make such a decision?
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> ------------------------------------------------------------------------------------------------------------------------------------------
>>>>>> >>>
>>>>>> >>>    This email and any files transmitted with it are confidential
>>>>>> and
>>>>>> >>>    intended solely for the use of the individual or entity to whom
>>>>>> >>>    they are addressed. If you have received this email in error
>>>>>> >>>    please notify the system manager. This message contains
>>>>>> >>>    confidential information and is intended only for the
>>>>>> individual
>>>>>> >>>    named. If you are not the named addressee you should not
>>>>>> >>>    disseminate, distribute or copy this e-mail. Please notify the
>>>>>> >>>    sender immediately by e-mail if you have received this e-mail
>>>>>> by
>>>>>> >>>    mistake and delete this e-mail from your system. If you are not
>>>>>> >>>    the intended recipient you are notified that disclosing,
>>>>>> copying,
>>>>>> >>>    distributing or taking any action in reliance on the contents
>>>>>> of
>>>>>> >>>    this information is strictly prohibited. Although Flipkart has
>>>>>> >>>    taken reasonable precautions to ensure no viruses are present
>>>>>> in
>>>>>> >>>    this email, the company cannot accept responsibility for any
>>>>>> loss
>>>>>> >>>    or damage arising from the use of this email or attachments
>>>>>> >>
>>>>>>
>>>>>
>>>>>
>>
>> --
>> Sent from a mobile device. Excuse my thumbs.
>>
>>
>>
>