You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zeppelin.apache.org by Ankit Jain <an...@gmail.com> on 2018/07/24 23:12:24 UTC

Parallel Execution of Spark Jobs

Hi,
I am playing around with execution policy of Spark jobs(and all Zeppelin
paragraphs actually).

Looks like there are couple of control points-
1) Spark scheduling - FIFO vs Fair as documented in
https://spark.apache.org/docs/2.1.1/job-scheduling.html#fair-scheduler-pools
.

Since we are still on .7 version and don't have
https://issues.apache.org/jira/browse/ZEPPELIN-3563, I am forcefully doing
sc.setLocalProperty("spark.scheduler.pool", "fair");
in both SparkInterpreter.java and SparkSqlInterpreter.java.

Also because we are exposing Zeppelin to multiple users we may not actually
want users to hog the cluster and always use FAIR.

This may complicate our merge to .8 though.

2. On top of Spark scheduling, each Zeppelin Interpreter itself seems to
have a scheduler queue. Each task is submitted to a FIFOScheduler except
SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag
is turned on.

I am changing SparkInterpreter.java to use ParallelScheduler too and that
seems to do the trick.

Now multiple notebooks are able to run in parallel.

My question is if other people have tested SparkInterpreter with
ParallelScheduler?
Also ideally this should be configurable. User should be specify fifo or
parallel.

Executing all paragraphs does add more complication and maybe

https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep the
execution order sane.


Thoughts?

-- 
Thanks & Regards,
Ankit.

Re: Parallel Execution of Spark Jobs

Posted by Ankit Jain <an...@gmail.com>.

Thanks for further clarification Jeff.

> On Jul 26, 2018, at 8:11 PM, Jeff Zhang <zj...@gmail.com> wrote:
> 
> Let me rephrase it.  In scoped mode, there's multiple Interpreter Group (Personally I prefer to call it multiple sessions) in ones JVM (For spark interpreter, there's multiple SparkInterpreter instances). 
> And there's one SparkContext in this JVM which is shared by all the SparkInterpreter instances. Regarding Scheduler, there's multiple Scheduler in scoped mode in this JVM, each SparkInterpreter instance own its own scheduler. Let me know if you have any other question.
> 
> 
> 
> Ankit Jain <an...@gmail.com>于2018年7月25日周三 下午10:27写道：
>> Jeff, what you said seems to be in conflict with what is detailed here - https://medium.com/@leemoonsoo/apache-zeppelin-interpreter-mode-explained-bae0525d0555
>> 
>> "In Scoped mode, Zeppelin still runs single interpreter JVM process but multiple Interpreter Group serve each Note."
>> 
>> In practice as well we see one Interpreter process for scoped mode.
>> 
>> Can you please clarify?
>> 
>> Adding Moon too.
>> 
>> Thanks
>> Ankit
>> 
>>> On Tue, Jul 24, 2018 at 11:09 PM, Ankit Jain <an...@gmail.com> wrote:
>>> Aah that makes sense - so only all jobs from one user will block in FIFOScheduler.
>>> 
>>> By moving to ParallelScheduler, only gain achieved is jobs from same user can also be run in parallel but may have dependency resolution issues.
>>> 
>>> Just to confirm I have it right - If "Run all" notebook is not a requirement and users run one paragraph at a time from different notebooks, ParallelScheduler should be ok?
>>> 
>>> Thanks
>>> Ankit
>>> 
>>>> On Tue, Jul 24, 2018 at 10:38 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>>> 
>>>> 1. Zeppelin-3563 force FAIR scheduling and just allow to specify the pool
>>>> 2. scheduler can not to figure out the dependencies between paragraphs. That's why SparkInterpreter use FIFOScheduler. 
>>>> If you use per user scoped mode. SparkContext is shared between users but SparkInterpreter is not shared. That means there's multiple SparkInterpreter instances that share the same SparkContext but they doesn't share the same FIFOScheduler, each SparkInterpreter use its own FIFOScheduler. 
>>>> 
>>>> Ankit Jain <an...@gmail.com>于2018年7月25日周三 下午12:58写道：
>>>>> Thanks for the quick feedback Jeff.
>>>>> 
>>>>> Re:1 - I did see Zeppelin-3563 but we are not on .8 yet and also we may want to force FAIR execution instead of letting user control it.
>>>>> 
>>>>> Re:2 - Is there an architecture issue here or we just need better thread safety? Ideally scheduler should be able to figure out the dependencies and run whatever can be parallel.
>>>>> 
>>>>> Re:Interpreter mode, I may not have been clear but we are running per user scoped mode - so Spark context is shared among all users. 
>>>>> 
>>>>> Doesn't that mean all jobs from different users go to one FIFOScheduler forcing all small jobs to block on a big one? That is specifically we are trying to avoid.
>>>>> 
>>>>> Thanks
>>>>> Ankit
>>>>> 
>>>>>> On Tue, Jul 24, 2018 at 5:40 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>>>>> Regarding 1.  ZEPPELIN-3563 should be helpful. See https://github.com/apache/zeppelin/blob/master/docs/interpreter/spark.md#running-spark-sql-concurrently
>>>>>> for more details. 
>>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563
>>>>>> 
>>>>>> Regarding 2. If you use ParallelScheduler for SparkInterpreter, you may hit weird issues if your paragraph has dependency between each other. e.g. paragraph 1 will use variable v1 which is defined in paragraph p2. Then the order of paragraph execution matters here, and ParallelScheduler can not guarantee the order of execution.
>>>>>> That's why we use FIFOScheduler for SparkInterpreter. 
>>>>>> 
>>>>>> In your scenario where multiple users share the same sparkcontext, I would suggest you to use scoped per user mode. Then each user will share the same sparkcontext which means you can save resources, and also they are in each FIFOScheduler which is isolated from each other. 
>>>>>> 
>>>>>> Ankit Jain <an...@gmail.com>于2018年7月25日周三 上午8:14写道：
>>>>>>> Forgot to mention this is for shared scoped mode, so same Spark application and context for all users on a single Zeppelin instance.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Ankit
>>>>>>> 
>>>>>>>> On Jul 24, 2018, at 4:12 PM, Ankit Jain <an...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> I am playing around with execution policy of Spark jobs(and all Zeppelin paragraphs actually).
>>>>>>>> 
>>>>>>>> Looks like there are couple of control points-
>>>>>>>> 1) Spark scheduling - FIFO vs Fair as documented in https://spark.apache.org/docs/2.1.1/job-scheduling.html#fair-scheduler-pools.
>>>>>>>> 
>>>>>>>> Since we are still on .7 version and don't have https://issues.apache.org/jira/browse/ZEPPELIN-3563, I am forcefully doing sc.setLocalProperty("spark.scheduler.pool", "fair");
>>>>>>>> in both SparkInterpreter.java and SparkSqlInterpreter.java.
>>>>>>>> 
>>>>>>>> Also because we are exposing Zeppelin to multiple users we may not actually want users to hog the cluster and always use FAIR.
>>>>>>>> 
>>>>>>>> This may complicate our merge to .8 though.
>>>>>>>> 
>>>>>>>> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems to have a scheduler queue. Each task is submitted to a FIFOScheduler except SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag is turned on.
>>>>>>>> 
>>>>>>>> I am changing SparkInterpreter.java to use ParallelScheduler too and that seems to do the trick.
>>>>>>>> 
>>>>>>>> Now multiple notebooks are able to run in parallel.
>>>>>>>> 
>>>>>>>> My question is if other people have tested SparkInterpreter with ParallelScheduler? Also ideally this should be configurable. User should be specify fifo or parallel.
>>>>>>>> 
>>>>>>>> Executing all paragraphs does add more complication and maybe
>>>>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep the execution order sane.
>>>>>>>> 
>>>>>>>> Thoughts?
>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> Thanks & Regards,
>>>>>>>> Ankit.
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Thanks & Regards,
>>>>> Ankit.
>>> 
>>> 
>>> 
>>> -- 
>>> Thanks & Regards,
>>> Ankit.
>> 
>> 
>> 
>> -- 
>> Thanks & Regards,
>> Ankit.

Re: Parallel Execution of Spark Jobs

Posted by Ankit Jain <an...@gmail.com>.

Thanks for further clarification Jeff.

> On Jul 26, 2018, at 8:11 PM, Jeff Zhang <zj...@gmail.com> wrote:
> 
> Let me rephrase it.  In scoped mode, there's multiple Interpreter Group (Personally I prefer to call it multiple sessions) in ones JVM (For spark interpreter, there's multiple SparkInterpreter instances). 
> And there's one SparkContext in this JVM which is shared by all the SparkInterpreter instances. Regarding Scheduler, there's multiple Scheduler in scoped mode in this JVM, each SparkInterpreter instance own its own scheduler. Let me know if you have any other question.
> 
> 
> 
> Ankit Jain <an...@gmail.com>于2018年7月25日周三 下午10:27写道：
>> Jeff, what you said seems to be in conflict with what is detailed here - https://medium.com/@leemoonsoo/apache-zeppelin-interpreter-mode-explained-bae0525d0555
>> 
>> "In Scoped mode, Zeppelin still runs single interpreter JVM process but multiple Interpreter Group serve each Note."
>> 
>> In practice as well we see one Interpreter process for scoped mode.
>> 
>> Can you please clarify?
>> 
>> Adding Moon too.
>> 
>> Thanks
>> Ankit
>> 
>>> On Tue, Jul 24, 2018 at 11:09 PM, Ankit Jain <an...@gmail.com> wrote:
>>> Aah that makes sense - so only all jobs from one user will block in FIFOScheduler.
>>> 
>>> By moving to ParallelScheduler, only gain achieved is jobs from same user can also be run in parallel but may have dependency resolution issues.
>>> 
>>> Just to confirm I have it right - If "Run all" notebook is not a requirement and users run one paragraph at a time from different notebooks, ParallelScheduler should be ok?
>>> 
>>> Thanks
>>> Ankit
>>> 
>>>> On Tue, Jul 24, 2018 at 10:38 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>>> 
>>>> 1. Zeppelin-3563 force FAIR scheduling and just allow to specify the pool
>>>> 2. scheduler can not to figure out the dependencies between paragraphs. That's why SparkInterpreter use FIFOScheduler. 
>>>> If you use per user scoped mode. SparkContext is shared between users but SparkInterpreter is not shared. That means there's multiple SparkInterpreter instances that share the same SparkContext but they doesn't share the same FIFOScheduler, each SparkInterpreter use its own FIFOScheduler. 
>>>> 
>>>> Ankit Jain <an...@gmail.com>于2018年7月25日周三 下午12:58写道：
>>>>> Thanks for the quick feedback Jeff.
>>>>> 
>>>>> Re:1 - I did see Zeppelin-3563 but we are not on .8 yet and also we may want to force FAIR execution instead of letting user control it.
>>>>> 
>>>>> Re:2 - Is there an architecture issue here or we just need better thread safety? Ideally scheduler should be able to figure out the dependencies and run whatever can be parallel.
>>>>> 
>>>>> Re:Interpreter mode, I may not have been clear but we are running per user scoped mode - so Spark context is shared among all users. 
>>>>> 
>>>>> Doesn't that mean all jobs from different users go to one FIFOScheduler forcing all small jobs to block on a big one? That is specifically we are trying to avoid.
>>>>> 
>>>>> Thanks
>>>>> Ankit
>>>>> 
>>>>>> On Tue, Jul 24, 2018 at 5:40 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>>>>> Regarding 1.  ZEPPELIN-3563 should be helpful. See https://github.com/apache/zeppelin/blob/master/docs/interpreter/spark.md#running-spark-sql-concurrently
>>>>>> for more details. 
>>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563
>>>>>> 
>>>>>> Regarding 2. If you use ParallelScheduler for SparkInterpreter, you may hit weird issues if your paragraph has dependency between each other. e.g. paragraph 1 will use variable v1 which is defined in paragraph p2. Then the order of paragraph execution matters here, and ParallelScheduler can not guarantee the order of execution.
>>>>>> That's why we use FIFOScheduler for SparkInterpreter. 
>>>>>> 
>>>>>> In your scenario where multiple users share the same sparkcontext, I would suggest you to use scoped per user mode. Then each user will share the same sparkcontext which means you can save resources, and also they are in each FIFOScheduler which is isolated from each other. 
>>>>>> 
>>>>>> Ankit Jain <an...@gmail.com>于2018年7月25日周三 上午8:14写道：
>>>>>>> Forgot to mention this is for shared scoped mode, so same Spark application and context for all users on a single Zeppelin instance.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Ankit
>>>>>>> 
>>>>>>>> On Jul 24, 2018, at 4:12 PM, Ankit Jain <an...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> I am playing around with execution policy of Spark jobs(and all Zeppelin paragraphs actually).
>>>>>>>> 
>>>>>>>> Looks like there are couple of control points-
>>>>>>>> 1) Spark scheduling - FIFO vs Fair as documented in https://spark.apache.org/docs/2.1.1/job-scheduling.html#fair-scheduler-pools.
>>>>>>>> 
>>>>>>>> Since we are still on .7 version and don't have https://issues.apache.org/jira/browse/ZEPPELIN-3563, I am forcefully doing sc.setLocalProperty("spark.scheduler.pool", "fair");
>>>>>>>> in both SparkInterpreter.java and SparkSqlInterpreter.java.
>>>>>>>> 
>>>>>>>> Also because we are exposing Zeppelin to multiple users we may not actually want users to hog the cluster and always use FAIR.
>>>>>>>> 
>>>>>>>> This may complicate our merge to .8 though.
>>>>>>>> 
>>>>>>>> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems to have a scheduler queue. Each task is submitted to a FIFOScheduler except SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag is turned on.
>>>>>>>> 
>>>>>>>> I am changing SparkInterpreter.java to use ParallelScheduler too and that seems to do the trick.
>>>>>>>> 
>>>>>>>> Now multiple notebooks are able to run in parallel.
>>>>>>>> 
>>>>>>>> My question is if other people have tested SparkInterpreter with ParallelScheduler? Also ideally this should be configurable. User should be specify fifo or parallel.
>>>>>>>> 
>>>>>>>> Executing all paragraphs does add more complication and maybe
>>>>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep the execution order sane.
>>>>>>>> 
>>>>>>>> Thoughts?
>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> Thanks & Regards,
>>>>>>>> Ankit.
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Thanks & Regards,
>>>>> Ankit.
>>> 
>>> 
>>> 
>>> -- 
>>> Thanks & Regards,
>>> Ankit.
>> 
>> 
>> 
>> -- 
>> Thanks & Regards,
>> Ankit.

Re: Parallel Execution of Spark Jobs

Posted by Jeff Zhang <zj...@gmail.com>.

Let me rephrase it.  In scoped mode, there's multiple Interpreter Group
(Personally I prefer to call it multiple sessions) in ones JVM (For spark
interpreter, there's multiple SparkInterpreter instances).
And there's one SparkContext in this JVM which is shared by all the
SparkInterpreter instances. Regarding Scheduler, there's multiple Scheduler
in scoped mode in this JVM, each SparkInterpreter instance own its own
scheduler. Let me know if you have any other question.



Ankit Jain <an...@gmail.com>于2018年7月25日周三 下午10:27写道：

> Jeff, what you said seems to be in conflict with what is detailed here -
> https://medium.com/@leemoonsoo/apache-zeppelin-interpreter-mode-explained-bae0525d0555
>
> "In *Scoped* mode, Zeppelin still runs single interpreter JVM process but
> multiple *Interpreter Group* serve each Note."
>
> In practice as well we see one Interpreter process for scoped mode.
>
> Can you please clarify?
>
> Adding Moon too.
>
> Thanks
> Ankit
>
> On Tue, Jul 24, 2018 at 11:09 PM, Ankit Jain <an...@gmail.com>
> wrote:
>
>> Aah that makes sense - so only all jobs from one user will block in
>> FIFOScheduler.
>>
>> By moving to ParallelScheduler, only gain achieved is jobs from same user
>> can also be run in parallel but may have dependency resolution issues.
>>
>> Just to confirm I have it right - If "Run all" notebook is not a
>> requirement and users run one paragraph at a time from different notebooks, ParallelScheduler
>> should be ok?
>>
>> Thanks
>> Ankit
>>
>> On Tue, Jul 24, 2018 at 10:38 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>
>>>
>>> 1. Zeppelin-3563 force FAIR scheduling and just allow to specify the pool
>>> 2. scheduler can not to figure out the dependencies between paragraphs.
>>> That's why SparkInterpreter use FIFOScheduler.
>>> If you use per user scoped mode. SparkContext is shared between users
>>> but SparkInterpreter is not shared. That means there's multiple
>>> SparkInterpreter instances that share the same SparkContext but they
>>> doesn't share the same FIFOScheduler, each SparkInterpreter use its own
>>> FIFOScheduler.
>>>
>>> Ankit Jain <an...@gmail.com>于2018年7月25日周三 下午12:58写道：
>>>
>>>> Thanks for the quick feedback Jeff.
>>>>
>>>> Re:1 - I did see Zeppelin-3563 but we are not on .8 yet and also we may
>>>> want to force FAIR execution instead of letting user control it.
>>>>
>>>> Re:2 - Is there an architecture issue here or we just need better
>>>> thread safety? Ideally scheduler should be able to figure out the
>>>> dependencies and run whatever can be parallel.
>>>>
>>>> Re:Interpreter mode, I may not have been clear but we are running per
>>>> user scoped mode - so Spark context is shared among all users.
>>>>
>>>> Doesn't that mean all jobs from different users go to one FIFOScheduler
>>>> forcing all small jobs to block on a big one? That is specifically we are
>>>> trying to avoid.
>>>>
>>>> Thanks
>>>> Ankit
>>>>
>>>> On Tue, Jul 24, 2018 at 5:40 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>>>
>>>>> Regarding 1.  ZEPPELIN-3563 should be helpful. See
>>>>> https://github.com/apache/zeppelin/blob/master/docs/interpreter/spark.md#running-spark-sql-concurrently
>>>>> for more details.
>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563
>>>>>
>>>>> Regarding 2. If you use ParallelScheduler for SparkInterpreter, you
>>>>> may hit weird issues if your paragraph has dependency between each other.
>>>>> e.g. paragraph 1 will use variable v1 which is defined in paragraph p2.
>>>>> Then the order of paragraph execution matters here, and ParallelScheduler
>>>>> can not guarantee the order of execution.
>>>>> That's why we use FIFOScheduler for SparkInterpreter.
>>>>>
>>>>> In your scenario where multiple users share the same sparkcontext, I
>>>>> would suggest you to use scoped per user mode. Then each user will share
>>>>> the same sparkcontext which means you can save resources, and also they are
>>>>> in each FIFOScheduler which is isolated from each other.
>>>>>
>>>>> Ankit Jain <an...@gmail.com>于2018年7月25日周三 上午8:14写道：
>>>>>
>>>>>> Forgot to mention this is for shared scoped mode, so same Spark
>>>>>> application and context for all users on a single Zeppelin instance.
>>>>>>
>>>>>> Thanks
>>>>>> Ankit
>>>>>>
>>>>>> On Jul 24, 2018, at 4:12 PM, Ankit Jain <an...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>> I am playing around with execution policy of Spark jobs(and all
>>>>>> Zeppelin paragraphs actually).
>>>>>>
>>>>>> Looks like there are couple of control points-
>>>>>> 1) Spark scheduling - FIFO vs Fair as documented in
>>>>>> https://spark.apache.org/docs/2.1.1/job-scheduling.html#fair-scheduler-pools
>>>>>> .
>>>>>>
>>>>>> Since we are still on .7 version and don't have
>>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563, I am forcefully
>>>>>> doing sc.setLocalProperty("spark.scheduler.pool", "fair");
>>>>>> in both SparkInterpreter.java and SparkSqlInterpreter.java.
>>>>>>
>>>>>> Also because we are exposing Zeppelin to multiple users we may not
>>>>>> actually want users to hog the cluster and always use FAIR.
>>>>>>
>>>>>> This may complicate our merge to .8 though.
>>>>>>
>>>>>> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems
>>>>>> to have a scheduler queue. Each task is submitted to a FIFOScheduler except
>>>>>> SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag
>>>>>> is turned on.
>>>>>>
>>>>>> I am changing SparkInterpreter.java to use ParallelScheduler too and
>>>>>> that seems to do the trick.
>>>>>>
>>>>>> Now multiple notebooks are able to run in parallel.
>>>>>>
>>>>>> My question is if other people have tested SparkInterpreter with ParallelScheduler?
>>>>>> Also ideally this should be configurable. User should be specify fifo or
>>>>>> parallel.
>>>>>>
>>>>>> Executing all paragraphs does add more complication and maybe
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us
>>>>>> keep the execution order sane.
>>>>>>
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Ankit.
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Ankit.
>>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Ankit.
>>
>
>
>
> --
> Thanks & Regards,
> Ankit.
>

Re: Parallel Execution of Spark Jobs

Posted by Jeff Zhang <zj...@gmail.com>.

Let me rephrase it.  In scoped mode, there's multiple Interpreter Group
(Personally I prefer to call it multiple sessions) in ones JVM (For spark
interpreter, there's multiple SparkInterpreter instances).
And there's one SparkContext in this JVM which is shared by all the
SparkInterpreter instances. Regarding Scheduler, there's multiple Scheduler
in scoped mode in this JVM, each SparkInterpreter instance own its own
scheduler. Let me know if you have any other question.



Ankit Jain <an...@gmail.com>于2018年7月25日周三 下午10:27写道：

> Jeff, what you said seems to be in conflict with what is detailed here -
> https://medium.com/@leemoonsoo/apache-zeppelin-interpreter-mode-explained-bae0525d0555
>
> "In *Scoped* mode, Zeppelin still runs single interpreter JVM process but
> multiple *Interpreter Group* serve each Note."
>
> In practice as well we see one Interpreter process for scoped mode.
>
> Can you please clarify?
>
> Adding Moon too.
>
> Thanks
> Ankit
>
> On Tue, Jul 24, 2018 at 11:09 PM, Ankit Jain <an...@gmail.com>
> wrote:
>
>> Aah that makes sense - so only all jobs from one user will block in
>> FIFOScheduler.
>>
>> By moving to ParallelScheduler, only gain achieved is jobs from same user
>> can also be run in parallel but may have dependency resolution issues.
>>
>> Just to confirm I have it right - If "Run all" notebook is not a
>> requirement and users run one paragraph at a time from different notebooks, ParallelScheduler
>> should be ok?
>>
>> Thanks
>> Ankit
>>
>> On Tue, Jul 24, 2018 at 10:38 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>
>>>
>>> 1. Zeppelin-3563 force FAIR scheduling and just allow to specify the pool
>>> 2. scheduler can not to figure out the dependencies between paragraphs.
>>> That's why SparkInterpreter use FIFOScheduler.
>>> If you use per user scoped mode. SparkContext is shared between users
>>> but SparkInterpreter is not shared. That means there's multiple
>>> SparkInterpreter instances that share the same SparkContext but they
>>> doesn't share the same FIFOScheduler, each SparkInterpreter use its own
>>> FIFOScheduler.
>>>
>>> Ankit Jain <an...@gmail.com>于2018年7月25日周三 下午12:58写道：
>>>
>>>> Thanks for the quick feedback Jeff.
>>>>
>>>> Re:1 - I did see Zeppelin-3563 but we are not on .8 yet and also we may
>>>> want to force FAIR execution instead of letting user control it.
>>>>
>>>> Re:2 - Is there an architecture issue here or we just need better
>>>> thread safety? Ideally scheduler should be able to figure out the
>>>> dependencies and run whatever can be parallel.
>>>>
>>>> Re:Interpreter mode, I may not have been clear but we are running per
>>>> user scoped mode - so Spark context is shared among all users.
>>>>
>>>> Doesn't that mean all jobs from different users go to one FIFOScheduler
>>>> forcing all small jobs to block on a big one? That is specifically we are
>>>> trying to avoid.
>>>>
>>>> Thanks
>>>> Ankit
>>>>
>>>> On Tue, Jul 24, 2018 at 5:40 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>>>
>>>>> Regarding 1.  ZEPPELIN-3563 should be helpful. See
>>>>> https://github.com/apache/zeppelin/blob/master/docs/interpreter/spark.md#running-spark-sql-concurrently
>>>>> for more details.
>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563
>>>>>
>>>>> Regarding 2. If you use ParallelScheduler for SparkInterpreter, you
>>>>> may hit weird issues if your paragraph has dependency between each other.
>>>>> e.g. paragraph 1 will use variable v1 which is defined in paragraph p2.
>>>>> Then the order of paragraph execution matters here, and ParallelScheduler
>>>>> can not guarantee the order of execution.
>>>>> That's why we use FIFOScheduler for SparkInterpreter.
>>>>>
>>>>> In your scenario where multiple users share the same sparkcontext, I
>>>>> would suggest you to use scoped per user mode. Then each user will share
>>>>> the same sparkcontext which means you can save resources, and also they are
>>>>> in each FIFOScheduler which is isolated from each other.
>>>>>
>>>>> Ankit Jain <an...@gmail.com>于2018年7月25日周三 上午8:14写道：
>>>>>
>>>>>> Forgot to mention this is for shared scoped mode, so same Spark
>>>>>> application and context for all users on a single Zeppelin instance.
>>>>>>
>>>>>> Thanks
>>>>>> Ankit
>>>>>>
>>>>>> On Jul 24, 2018, at 4:12 PM, Ankit Jain <an...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>> I am playing around with execution policy of Spark jobs(and all
>>>>>> Zeppelin paragraphs actually).
>>>>>>
>>>>>> Looks like there are couple of control points-
>>>>>> 1) Spark scheduling - FIFO vs Fair as documented in
>>>>>> https://spark.apache.org/docs/2.1.1/job-scheduling.html#fair-scheduler-pools
>>>>>> .
>>>>>>
>>>>>> Since we are still on .7 version and don't have
>>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563, I am forcefully
>>>>>> doing sc.setLocalProperty("spark.scheduler.pool", "fair");
>>>>>> in both SparkInterpreter.java and SparkSqlInterpreter.java.
>>>>>>
>>>>>> Also because we are exposing Zeppelin to multiple users we may not
>>>>>> actually want users to hog the cluster and always use FAIR.
>>>>>>
>>>>>> This may complicate our merge to .8 though.
>>>>>>
>>>>>> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems
>>>>>> to have a scheduler queue. Each task is submitted to a FIFOScheduler except
>>>>>> SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag
>>>>>> is turned on.
>>>>>>
>>>>>> I am changing SparkInterpreter.java to use ParallelScheduler too and
>>>>>> that seems to do the trick.
>>>>>>
>>>>>> Now multiple notebooks are able to run in parallel.
>>>>>>
>>>>>> My question is if other people have tested SparkInterpreter with ParallelScheduler?
>>>>>> Also ideally this should be configurable. User should be specify fifo or
>>>>>> parallel.
>>>>>>
>>>>>> Executing all paragraphs does add more complication and maybe
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us
>>>>>> keep the execution order sane.
>>>>>>
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Ankit.
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Ankit.
>>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Ankit.
>>
>
>
>
> --
> Thanks & Regards,
> Ankit.
>

Re: Parallel Execution of Spark Jobs

Posted by Ankit Jain <an...@gmail.com>.

Jeff, what you said seems to be in conflict with what is detailed here -
https://medium.com/@leemoonsoo/apache-zeppelin-interpreter-mode-explained-
bae0525d0555

"In *Scoped* mode, Zeppelin still runs single interpreter JVM process but
multiple *Interpreter Group* serve each Note."

In practice as well we see one Interpreter process for scoped mode.

Can you please clarify?

Adding Moon too.

Thanks
Ankit

On Tue, Jul 24, 2018 at 11:09 PM, Ankit Jain <an...@gmail.com>
wrote:

> Aah that makes sense - so only all jobs from one user will block in
> FIFOScheduler.
>
> By moving to ParallelScheduler, only gain achieved is jobs from same user
> can also be run in parallel but may have dependency resolution issues.
>
> Just to confirm I have it right - If "Run all" notebook is not a
> requirement and users run one paragraph at a time from different notebooks, ParallelScheduler
> should be ok?
>
> Thanks
> Ankit
>
> On Tue, Jul 24, 2018 at 10:38 PM, Jeff Zhang <zj...@gmail.com> wrote:
>
>>
>> 1. Zeppelin-3563 force FAIR scheduling and just allow to specify the pool
>> 2. scheduler can not to figure out the dependencies between paragraphs.
>> That's why SparkInterpreter use FIFOScheduler.
>> If you use per user scoped mode. SparkContext is shared between users but
>> SparkInterpreter is not shared. That means there's multiple
>> SparkInterpreter instances that share the same SparkContext but they
>> doesn't share the same FIFOScheduler, each SparkInterpreter use its own
>> FIFOScheduler.
>>
>> Ankit Jain <an...@gmail.com>于2018年7月25日周三 下午12:58写道：
>>
>>> Thanks for the quick feedback Jeff.
>>>
>>> Re:1 - I did see Zeppelin-3563 but we are not on .8 yet and also we may
>>> want to force FAIR execution instead of letting user control it.
>>>
>>> Re:2 - Is there an architecture issue here or we just need better thread
>>> safety? Ideally scheduler should be able to figure out the dependencies and
>>> run whatever can be parallel.
>>>
>>> Re:Interpreter mode, I may not have been clear but we are running per
>>> user scoped mode - so Spark context is shared among all users.
>>>
>>> Doesn't that mean all jobs from different users go to one FIFOScheduler
>>> forcing all small jobs to block on a big one? That is specifically we are
>>> trying to avoid.
>>>
>>> Thanks
>>> Ankit
>>>
>>> On Tue, Jul 24, 2018 at 5:40 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>>
>>>> Regarding 1.  ZEPPELIN-3563 should be helpful. See
>>>> https://github.com/apache/zeppelin/blob/master/docs/inte
>>>> rpreter/spark.md#running-spark-sql-concurrently
>>>> for more details.
>>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563
>>>>
>>>> Regarding 2. If you use ParallelScheduler for SparkInterpreter, you may
>>>> hit weird issues if your paragraph has dependency between each other. e.g.
>>>> paragraph 1 will use variable v1 which is defined in paragraph p2. Then the
>>>> order of paragraph execution matters here, and ParallelScheduler can
>>>> not guarantee the order of execution.
>>>> That's why we use FIFOScheduler for SparkInterpreter.
>>>>
>>>> In your scenario where multiple users share the same sparkcontext, I
>>>> would suggest you to use scoped per user mode. Then each user will share
>>>> the same sparkcontext which means you can save resources, and also they are
>>>> in each FIFOScheduler which is isolated from each other.
>>>>
>>>> Ankit Jain <an...@gmail.com>于2018年7月25日周三 上午8:14写道：
>>>>
>>>>> Forgot to mention this is for shared scoped mode, so same Spark
>>>>> application and context for all users on a single Zeppelin instance.
>>>>>
>>>>> Thanks
>>>>> Ankit
>>>>>
>>>>> On Jul 24, 2018, at 4:12 PM, Ankit Jain <an...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>> I am playing around with execution policy of Spark jobs(and all
>>>>> Zeppelin paragraphs actually).
>>>>>
>>>>> Looks like there are couple of control points-
>>>>> 1) Spark scheduling - FIFO vs Fair as documented in
>>>>> https://spark.apache.org/docs/2.1.1/job-scheduling.html#
>>>>> fair-scheduler-pools.
>>>>>
>>>>> Since we are still on .7 version and don't have
>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563, I am forcefully
>>>>> doing sc.setLocalProperty("spark.scheduler.pool", "fair");
>>>>> in both SparkInterpreter.java and SparkSqlInterpreter.java.
>>>>>
>>>>> Also because we are exposing Zeppelin to multiple users we may not
>>>>> actually want users to hog the cluster and always use FAIR.
>>>>>
>>>>> This may complicate our merge to .8 though.
>>>>>
>>>>> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems
>>>>> to have a scheduler queue. Each task is submitted to a FIFOScheduler except
>>>>> SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag
>>>>> is turned on.
>>>>>
>>>>> I am changing SparkInterpreter.java to use ParallelScheduler too and
>>>>> that seems to do the trick.
>>>>>
>>>>> Now multiple notebooks are able to run in parallel.
>>>>>
>>>>> My question is if other people have tested SparkInterpreter with ParallelScheduler?
>>>>> Also ideally this should be configurable. User should be specify fifo or
>>>>> parallel.
>>>>>
>>>>> Executing all paragraphs does add more complication and maybe
>>>>>
>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep
>>>>> the execution order sane.
>>>>>
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Ankit.
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Ankit.
>>>
>>
>
>
> --
> Thanks & Regards,
> Ankit.
>



-- 
Thanks & Regards,
Ankit.

Re: Parallel Execution of Spark Jobs

Posted by Ankit Jain <an...@gmail.com>.

Jeff, what you said seems to be in conflict with what is detailed here -
https://medium.com/@leemoonsoo/apache-zeppelin-interpreter-mode-explained-
bae0525d0555

"In *Scoped* mode, Zeppelin still runs single interpreter JVM process but
multiple *Interpreter Group* serve each Note."

In practice as well we see one Interpreter process for scoped mode.

Can you please clarify?

Adding Moon too.

Thanks
Ankit

On Tue, Jul 24, 2018 at 11:09 PM, Ankit Jain <an...@gmail.com>
wrote:

> Aah that makes sense - so only all jobs from one user will block in
> FIFOScheduler.
>
> By moving to ParallelScheduler, only gain achieved is jobs from same user
> can also be run in parallel but may have dependency resolution issues.
>
> Just to confirm I have it right - If "Run all" notebook is not a
> requirement and users run one paragraph at a time from different notebooks, ParallelScheduler
> should be ok?
>
> Thanks
> Ankit
>
> On Tue, Jul 24, 2018 at 10:38 PM, Jeff Zhang <zj...@gmail.com> wrote:
>
>>
>> 1. Zeppelin-3563 force FAIR scheduling and just allow to specify the pool
>> 2. scheduler can not to figure out the dependencies between paragraphs.
>> That's why SparkInterpreter use FIFOScheduler.
>> If you use per user scoped mode. SparkContext is shared between users but
>> SparkInterpreter is not shared. That means there's multiple
>> SparkInterpreter instances that share the same SparkContext but they
>> doesn't share the same FIFOScheduler, each SparkInterpreter use its own
>> FIFOScheduler.
>>
>> Ankit Jain <an...@gmail.com>于2018年7月25日周三 下午12:58写道：
>>
>>> Thanks for the quick feedback Jeff.
>>>
>>> Re:1 - I did see Zeppelin-3563 but we are not on .8 yet and also we may
>>> want to force FAIR execution instead of letting user control it.
>>>
>>> Re:2 - Is there an architecture issue here or we just need better thread
>>> safety? Ideally scheduler should be able to figure out the dependencies and
>>> run whatever can be parallel.
>>>
>>> Re:Interpreter mode, I may not have been clear but we are running per
>>> user scoped mode - so Spark context is shared among all users.
>>>
>>> Doesn't that mean all jobs from different users go to one FIFOScheduler
>>> forcing all small jobs to block on a big one? That is specifically we are
>>> trying to avoid.
>>>
>>> Thanks
>>> Ankit
>>>
>>> On Tue, Jul 24, 2018 at 5:40 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>>
>>>> Regarding 1.  ZEPPELIN-3563 should be helpful. See
>>>> https://github.com/apache/zeppelin/blob/master/docs/inte
>>>> rpreter/spark.md#running-spark-sql-concurrently
>>>> for more details.
>>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563
>>>>
>>>> Regarding 2. If you use ParallelScheduler for SparkInterpreter, you may
>>>> hit weird issues if your paragraph has dependency between each other. e.g.
>>>> paragraph 1 will use variable v1 which is defined in paragraph p2. Then the
>>>> order of paragraph execution matters here, and ParallelScheduler can
>>>> not guarantee the order of execution.
>>>> That's why we use FIFOScheduler for SparkInterpreter.
>>>>
>>>> In your scenario where multiple users share the same sparkcontext, I
>>>> would suggest you to use scoped per user mode. Then each user will share
>>>> the same sparkcontext which means you can save resources, and also they are
>>>> in each FIFOScheduler which is isolated from each other.
>>>>
>>>> Ankit Jain <an...@gmail.com>于2018年7月25日周三 上午8:14写道：
>>>>
>>>>> Forgot to mention this is for shared scoped mode, so same Spark
>>>>> application and context for all users on a single Zeppelin instance.
>>>>>
>>>>> Thanks
>>>>> Ankit
>>>>>
>>>>> On Jul 24, 2018, at 4:12 PM, Ankit Jain <an...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi,
>>>>> I am playing around with execution policy of Spark jobs(and all
>>>>> Zeppelin paragraphs actually).
>>>>>
>>>>> Looks like there are couple of control points-
>>>>> 1) Spark scheduling - FIFO vs Fair as documented in
>>>>> https://spark.apache.org/docs/2.1.1/job-scheduling.html#
>>>>> fair-scheduler-pools.
>>>>>
>>>>> Since we are still on .7 version and don't have
>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563, I am forcefully
>>>>> doing sc.setLocalProperty("spark.scheduler.pool", "fair");
>>>>> in both SparkInterpreter.java and SparkSqlInterpreter.java.
>>>>>
>>>>> Also because we are exposing Zeppelin to multiple users we may not
>>>>> actually want users to hog the cluster and always use FAIR.
>>>>>
>>>>> This may complicate our merge to .8 though.
>>>>>
>>>>> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems
>>>>> to have a scheduler queue. Each task is submitted to a FIFOScheduler except
>>>>> SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag
>>>>> is turned on.
>>>>>
>>>>> I am changing SparkInterpreter.java to use ParallelScheduler too and
>>>>> that seems to do the trick.
>>>>>
>>>>> Now multiple notebooks are able to run in parallel.
>>>>>
>>>>> My question is if other people have tested SparkInterpreter with ParallelScheduler?
>>>>> Also ideally this should be configurable. User should be specify fifo or
>>>>> parallel.
>>>>>
>>>>> Executing all paragraphs does add more complication and maybe
>>>>>
>>>>> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep
>>>>> the execution order sane.
>>>>>
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Ankit.
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Ankit.
>>>
>>
>
>
> --
> Thanks & Regards,
> Ankit.
>



-- 
Thanks & Regards,
Ankit.

Re: Parallel Execution of Spark Jobs

Posted by Ankit Jain <an...@gmail.com>.

Aah that makes sense - so only all jobs from one user will block in
FIFOScheduler.

By moving to ParallelScheduler, only gain achieved is jobs from same user
can also be run in parallel but may have dependency resolution issues.

Just to confirm I have it right - If "Run all" notebook is not a
requirement and users run one paragraph at a time from different
notebooks, ParallelScheduler
should be ok?

Thanks
Ankit

On Tue, Jul 24, 2018 at 10:38 PM, Jeff Zhang <zj...@gmail.com> wrote:

>
> 1. Zeppelin-3563 force FAIR scheduling and just allow to specify the pool
> 2. scheduler can not to figure out the dependencies between paragraphs.
> That's why SparkInterpreter use FIFOScheduler.
> If you use per user scoped mode. SparkContext is shared between users but
> SparkInterpreter is not shared. That means there's multiple
> SparkInterpreter instances that share the same SparkContext but they
> doesn't share the same FIFOScheduler, each SparkInterpreter use its own
> FIFOScheduler.
>
> Ankit Jain <an...@gmail.com>于2018年7月25日周三 下午12:58写道：
>
>> Thanks for the quick feedback Jeff.
>>
>> Re:1 - I did see Zeppelin-3563 but we are not on .8 yet and also we may
>> want to force FAIR execution instead of letting user control it.
>>
>> Re:2 - Is there an architecture issue here or we just need better thread
>> safety? Ideally scheduler should be able to figure out the dependencies and
>> run whatever can be parallel.
>>
>> Re:Interpreter mode, I may not have been clear but we are running per
>> user scoped mode - so Spark context is shared among all users.
>>
>> Doesn't that mean all jobs from different users go to one FIFOScheduler
>> forcing all small jobs to block on a big one? That is specifically we are
>> trying to avoid.
>>
>> Thanks
>> Ankit
>>
>> On Tue, Jul 24, 2018 at 5:40 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>
>>> Regarding 1.  ZEPPELIN-3563 should be helpful. See
>>> https://github.com/apache/zeppelin/blob/master/docs/
>>> interpreter/spark.md#running-spark-sql-concurrently
>>> for more details.
>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563
>>>
>>> Regarding 2. If you use ParallelScheduler for SparkInterpreter, you may
>>> hit weird issues if your paragraph has dependency between each other. e.g.
>>> paragraph 1 will use variable v1 which is defined in paragraph p2. Then the
>>> order of paragraph execution matters here, and ParallelScheduler can
>>> not guarantee the order of execution.
>>> That's why we use FIFOScheduler for SparkInterpreter.
>>>
>>> In your scenario where multiple users share the same sparkcontext, I
>>> would suggest you to use scoped per user mode. Then each user will share
>>> the same sparkcontext which means you can save resources, and also they are
>>> in each FIFOScheduler which is isolated from each other.
>>>
>>> Ankit Jain <an...@gmail.com>于2018年7月25日周三 上午8:14写道：
>>>
>>>> Forgot to mention this is for shared scoped mode, so same Spark
>>>> application and context for all users on a single Zeppelin instance.
>>>>
>>>> Thanks
>>>> Ankit
>>>>
>>>> On Jul 24, 2018, at 4:12 PM, Ankit Jain <an...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>> I am playing around with execution policy of Spark jobs(and all
>>>> Zeppelin paragraphs actually).
>>>>
>>>> Looks like there are couple of control points-
>>>> 1) Spark scheduling - FIFO vs Fair as documented in
>>>> https://spark.apache.org/docs/2.1.1/job-scheduling.
>>>> html#fair-scheduler-pools.
>>>>
>>>> Since we are still on .7 version and don't have https://issues.apache.
>>>> org/jira/browse/ZEPPELIN-3563, I am forcefully doing sc
>>>> .setLocalProperty("spark.scheduler.pool", "fair");
>>>> in both SparkInterpreter.java and SparkSqlInterpreter.java.
>>>>
>>>> Also because we are exposing Zeppelin to multiple users we may not
>>>> actually want users to hog the cluster and always use FAIR.
>>>>
>>>> This may complicate our merge to .8 though.
>>>>
>>>> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems
>>>> to have a scheduler queue. Each task is submitted to a FIFOScheduler except
>>>> SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag
>>>> is turned on.
>>>>
>>>> I am changing SparkInterpreter.java to use ParallelScheduler too and
>>>> that seems to do the trick.
>>>>
>>>> Now multiple notebooks are able to run in parallel.
>>>>
>>>> My question is if other people have tested SparkInterpreter with ParallelScheduler?
>>>> Also ideally this should be configurable. User should be specify fifo or
>>>> parallel.
>>>>
>>>> Executing all paragraphs does add more complication and maybe
>>>>
>>>> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep
>>>> the execution order sane.
>>>>
>>>>
>>>> Thoughts?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Ankit.
>>>>
>>>>
>>
>>
>> --
>> Thanks & Regards,
>> Ankit.
>>
>


-- 
Thanks & Regards,
Ankit.

Re: Parallel Execution of Spark Jobs

Posted by Ankit Jain <an...@gmail.com>.

Aah that makes sense - so only all jobs from one user will block in
FIFOScheduler.

By moving to ParallelScheduler, only gain achieved is jobs from same user
can also be run in parallel but may have dependency resolution issues.

Just to confirm I have it right - If "Run all" notebook is not a
requirement and users run one paragraph at a time from different
notebooks, ParallelScheduler
should be ok?

Thanks
Ankit

On Tue, Jul 24, 2018 at 10:38 PM, Jeff Zhang <zj...@gmail.com> wrote:

>
> 1. Zeppelin-3563 force FAIR scheduling and just allow to specify the pool
> 2. scheduler can not to figure out the dependencies between paragraphs.
> That's why SparkInterpreter use FIFOScheduler.
> If you use per user scoped mode. SparkContext is shared between users but
> SparkInterpreter is not shared. That means there's multiple
> SparkInterpreter instances that share the same SparkContext but they
> doesn't share the same FIFOScheduler, each SparkInterpreter use its own
> FIFOScheduler.
>
> Ankit Jain <an...@gmail.com>于2018年7月25日周三 下午12:58写道：
>
>> Thanks for the quick feedback Jeff.
>>
>> Re:1 - I did see Zeppelin-3563 but we are not on .8 yet and also we may
>> want to force FAIR execution instead of letting user control it.
>>
>> Re:2 - Is there an architecture issue here or we just need better thread
>> safety? Ideally scheduler should be able to figure out the dependencies and
>> run whatever can be parallel.
>>
>> Re:Interpreter mode, I may not have been clear but we are running per
>> user scoped mode - so Spark context is shared among all users.
>>
>> Doesn't that mean all jobs from different users go to one FIFOScheduler
>> forcing all small jobs to block on a big one? That is specifically we are
>> trying to avoid.
>>
>> Thanks
>> Ankit
>>
>> On Tue, Jul 24, 2018 at 5:40 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>
>>> Regarding 1.  ZEPPELIN-3563 should be helpful. See
>>> https://github.com/apache/zeppelin/blob/master/docs/
>>> interpreter/spark.md#running-spark-sql-concurrently
>>> for more details.
>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563
>>>
>>> Regarding 2. If you use ParallelScheduler for SparkInterpreter, you may
>>> hit weird issues if your paragraph has dependency between each other. e.g.
>>> paragraph 1 will use variable v1 which is defined in paragraph p2. Then the
>>> order of paragraph execution matters here, and ParallelScheduler can
>>> not guarantee the order of execution.
>>> That's why we use FIFOScheduler for SparkInterpreter.
>>>
>>> In your scenario where multiple users share the same sparkcontext, I
>>> would suggest you to use scoped per user mode. Then each user will share
>>> the same sparkcontext which means you can save resources, and also they are
>>> in each FIFOScheduler which is isolated from each other.
>>>
>>> Ankit Jain <an...@gmail.com>于2018年7月25日周三 上午8:14写道：
>>>
>>>> Forgot to mention this is for shared scoped mode, so same Spark
>>>> application and context for all users on a single Zeppelin instance.
>>>>
>>>> Thanks
>>>> Ankit
>>>>
>>>> On Jul 24, 2018, at 4:12 PM, Ankit Jain <an...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi,
>>>> I am playing around with execution policy of Spark jobs(and all
>>>> Zeppelin paragraphs actually).
>>>>
>>>> Looks like there are couple of control points-
>>>> 1) Spark scheduling - FIFO vs Fair as documented in
>>>> https://spark.apache.org/docs/2.1.1/job-scheduling.
>>>> html#fair-scheduler-pools.
>>>>
>>>> Since we are still on .7 version and don't have https://issues.apache.
>>>> org/jira/browse/ZEPPELIN-3563, I am forcefully doing sc
>>>> .setLocalProperty("spark.scheduler.pool", "fair");
>>>> in both SparkInterpreter.java and SparkSqlInterpreter.java.
>>>>
>>>> Also because we are exposing Zeppelin to multiple users we may not
>>>> actually want users to hog the cluster and always use FAIR.
>>>>
>>>> This may complicate our merge to .8 though.
>>>>
>>>> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems
>>>> to have a scheduler queue. Each task is submitted to a FIFOScheduler except
>>>> SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag
>>>> is turned on.
>>>>
>>>> I am changing SparkInterpreter.java to use ParallelScheduler too and
>>>> that seems to do the trick.
>>>>
>>>> Now multiple notebooks are able to run in parallel.
>>>>
>>>> My question is if other people have tested SparkInterpreter with ParallelScheduler?
>>>> Also ideally this should be configurable. User should be specify fifo or
>>>> parallel.
>>>>
>>>> Executing all paragraphs does add more complication and maybe
>>>>
>>>> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep
>>>> the execution order sane.
>>>>
>>>>
>>>> Thoughts?
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Ankit.
>>>>
>>>>
>>
>>
>> --
>> Thanks & Regards,
>> Ankit.
>>
>


-- 
Thanks & Regards,
Ankit.

Re: Parallel Execution of Spark Jobs

Posted by Jeff Zhang <zj...@gmail.com>.

1. Zeppelin-3563 force FAIR scheduling and just allow to specify the pool
2. scheduler can not to figure out the dependencies between paragraphs.
That's why SparkInterpreter use FIFOScheduler.
If you use per user scoped mode. SparkContext is shared between users but
SparkInterpreter is not shared. That means there's multiple
SparkInterpreter instances that share the same SparkContext but they
doesn't share the same FIFOScheduler, each SparkInterpreter use its own
FIFOScheduler.

Ankit Jain <an...@gmail.com>于2018年7月25日周三 下午12:58写道：

> Thanks for the quick feedback Jeff.
>
> Re:1 - I did see Zeppelin-3563 but we are not on .8 yet and also we may
> want to force FAIR execution instead of letting user control it.
>
> Re:2 - Is there an architecture issue here or we just need better thread
> safety? Ideally scheduler should be able to figure out the dependencies and
> run whatever can be parallel.
>
> Re:Interpreter mode, I may not have been clear but we are running per user
> scoped mode - so Spark context is shared among all users.
>
> Doesn't that mean all jobs from different users go to one FIFOScheduler
> forcing all small jobs to block on a big one? That is specifically we are
> trying to avoid.
>
> Thanks
> Ankit
>
> On Tue, Jul 24, 2018 at 5:40 PM, Jeff Zhang <zj...@gmail.com> wrote:
>
>> Regarding 1.  ZEPPELIN-3563 should be helpful. See
>> https://github.com/apache/zeppelin/blob/master/docs/interpreter/spark.md#running-spark-sql-concurrently
>> for more details.
>> https://issues.apache.org/jira/browse/ZEPPELIN-3563
>>
>> Regarding 2. If you use ParallelScheduler for SparkInterpreter, you may
>> hit weird issues if your paragraph has dependency between each other. e.g.
>> paragraph 1 will use variable v1 which is defined in paragraph p2. Then the
>> order of paragraph execution matters here, and ParallelScheduler can
>> not guarantee the order of execution.
>> That's why we use FIFOScheduler for SparkInterpreter.
>>
>> In your scenario where multiple users share the same sparkcontext, I
>> would suggest you to use scoped per user mode. Then each user will share
>> the same sparkcontext which means you can save resources, and also they are
>> in each FIFOScheduler which is isolated from each other.
>>
>> Ankit Jain <an...@gmail.com>于2018年7月25日周三 上午8:14写道：
>>
>>> Forgot to mention this is for shared scoped mode, so same Spark
>>> application and context for all users on a single Zeppelin instance.
>>>
>>> Thanks
>>> Ankit
>>>
>>> On Jul 24, 2018, at 4:12 PM, Ankit Jain <an...@gmail.com> wrote:
>>>
>>> Hi,
>>> I am playing around with execution policy of Spark jobs(and all Zeppelin
>>> paragraphs actually).
>>>
>>> Looks like there are couple of control points-
>>> 1) Spark scheduling - FIFO vs Fair as documented in
>>> https://spark.apache.org/docs/2.1.1/job-scheduling.html#fair-scheduler-pools
>>> .
>>>
>>> Since we are still on .7 version and don't have
>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563, I am forcefully
>>> doing sc.setLocalProperty("spark.scheduler.pool", "fair");
>>> in both SparkInterpreter.java and SparkSqlInterpreter.java.
>>>
>>> Also because we are exposing Zeppelin to multiple users we may not
>>> actually want users to hog the cluster and always use FAIR.
>>>
>>> This may complicate our merge to .8 though.
>>>
>>> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems to
>>> have a scheduler queue. Each task is submitted to a FIFOScheduler except
>>> SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag
>>> is turned on.
>>>
>>> I am changing SparkInterpreter.java to use ParallelScheduler too and
>>> that seems to do the trick.
>>>
>>> Now multiple notebooks are able to run in parallel.
>>>
>>> My question is if other people have tested SparkInterpreter with ParallelScheduler?
>>> Also ideally this should be configurable. User should be specify fifo or
>>> parallel.
>>>
>>> Executing all paragraphs does add more complication and maybe
>>>
>>> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep
>>> the execution order sane.
>>>
>>>
>>> Thoughts?
>>>
>>> --
>>> Thanks & Regards,
>>> Ankit.
>>>
>>>
>
>
> --
> Thanks & Regards,
> Ankit.
>

Re: Parallel Execution of Spark Jobs

Posted by Jeff Zhang <zj...@gmail.com>.

1. Zeppelin-3563 force FAIR scheduling and just allow to specify the pool
2. scheduler can not to figure out the dependencies between paragraphs.
That's why SparkInterpreter use FIFOScheduler.
If you use per user scoped mode. SparkContext is shared between users but
SparkInterpreter is not shared. That means there's multiple
SparkInterpreter instances that share the same SparkContext but they
doesn't share the same FIFOScheduler, each SparkInterpreter use its own
FIFOScheduler.

Ankit Jain <an...@gmail.com>于2018年7月25日周三 下午12:58写道：

> Thanks for the quick feedback Jeff.
>
> Re:1 - I did see Zeppelin-3563 but we are not on .8 yet and also we may
> want to force FAIR execution instead of letting user control it.
>
> Re:2 - Is there an architecture issue here or we just need better thread
> safety? Ideally scheduler should be able to figure out the dependencies and
> run whatever can be parallel.
>
> Re:Interpreter mode, I may not have been clear but we are running per user
> scoped mode - so Spark context is shared among all users.
>
> Doesn't that mean all jobs from different users go to one FIFOScheduler
> forcing all small jobs to block on a big one? That is specifically we are
> trying to avoid.
>
> Thanks
> Ankit
>
> On Tue, Jul 24, 2018 at 5:40 PM, Jeff Zhang <zj...@gmail.com> wrote:
>
>> Regarding 1.  ZEPPELIN-3563 should be helpful. See
>> https://github.com/apache/zeppelin/blob/master/docs/interpreter/spark.md#running-spark-sql-concurrently
>> for more details.
>> https://issues.apache.org/jira/browse/ZEPPELIN-3563
>>
>> Regarding 2. If you use ParallelScheduler for SparkInterpreter, you may
>> hit weird issues if your paragraph has dependency between each other. e.g.
>> paragraph 1 will use variable v1 which is defined in paragraph p2. Then the
>> order of paragraph execution matters here, and ParallelScheduler can
>> not guarantee the order of execution.
>> That's why we use FIFOScheduler for SparkInterpreter.
>>
>> In your scenario where multiple users share the same sparkcontext, I
>> would suggest you to use scoped per user mode. Then each user will share
>> the same sparkcontext which means you can save resources, and also they are
>> in each FIFOScheduler which is isolated from each other.
>>
>> Ankit Jain <an...@gmail.com>于2018年7月25日周三 上午8:14写道：
>>
>>> Forgot to mention this is for shared scoped mode, so same Spark
>>> application and context for all users on a single Zeppelin instance.
>>>
>>> Thanks
>>> Ankit
>>>
>>> On Jul 24, 2018, at 4:12 PM, Ankit Jain <an...@gmail.com> wrote:
>>>
>>> Hi,
>>> I am playing around with execution policy of Spark jobs(and all Zeppelin
>>> paragraphs actually).
>>>
>>> Looks like there are couple of control points-
>>> 1) Spark scheduling - FIFO vs Fair as documented in
>>> https://spark.apache.org/docs/2.1.1/job-scheduling.html#fair-scheduler-pools
>>> .
>>>
>>> Since we are still on .7 version and don't have
>>> https://issues.apache.org/jira/browse/ZEPPELIN-3563, I am forcefully
>>> doing sc.setLocalProperty("spark.scheduler.pool", "fair");
>>> in both SparkInterpreter.java and SparkSqlInterpreter.java.
>>>
>>> Also because we are exposing Zeppelin to multiple users we may not
>>> actually want users to hog the cluster and always use FAIR.
>>>
>>> This may complicate our merge to .8 though.
>>>
>>> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems to
>>> have a scheduler queue. Each task is submitted to a FIFOScheduler except
>>> SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag
>>> is turned on.
>>>
>>> I am changing SparkInterpreter.java to use ParallelScheduler too and
>>> that seems to do the trick.
>>>
>>> Now multiple notebooks are able to run in parallel.
>>>
>>> My question is if other people have tested SparkInterpreter with ParallelScheduler?
>>> Also ideally this should be configurable. User should be specify fifo or
>>> parallel.
>>>
>>> Executing all paragraphs does add more complication and maybe
>>>
>>> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep
>>> the execution order sane.
>>>
>>>
>>> Thoughts?
>>>
>>> --
>>> Thanks & Regards,
>>> Ankit.
>>>
>>>
>
>
> --
> Thanks & Regards,
> Ankit.
>

Re: Parallel Execution of Spark Jobs

Posted by Ankit Jain <an...@gmail.com>.

Thanks for the quick feedback Jeff.

Re:1 - I did see Zeppelin-3563 but we are not on .8 yet and also we may
want to force FAIR execution instead of letting user control it.

Re:2 - Is there an architecture issue here or we just need better thread
safety? Ideally scheduler should be able to figure out the dependencies and
run whatever can be parallel.

Re:Interpreter mode, I may not have been clear but we are running per user
scoped mode - so Spark context is shared among all users.

Doesn't that mean all jobs from different users go to one FIFOScheduler
forcing all small jobs to block on a big one? That is specifically we are
trying to avoid.

Thanks
Ankit

On Tue, Jul 24, 2018 at 5:40 PM, Jeff Zhang <zj...@gmail.com> wrote:

> Regarding 1.  ZEPPELIN-3563 should be helpful. See
> https://github.com/apache/zeppelin/blob/master/docs/
> interpreter/spark.md#running-spark-sql-concurrently
> for more details.
> https://issues.apache.org/jira/browse/ZEPPELIN-3563
>
> Regarding 2. If you use ParallelScheduler for SparkInterpreter, you may
> hit weird issues if your paragraph has dependency between each other. e.g.
> paragraph 1 will use variable v1 which is defined in paragraph p2. Then the
> order of paragraph execution matters here, and ParallelScheduler can
> not guarantee the order of execution.
> That's why we use FIFOScheduler for SparkInterpreter.
>
> In your scenario where multiple users share the same sparkcontext, I would
> suggest you to use scoped per user mode. Then each user will share the same
> sparkcontext which means you can save resources, and also they are in each
> FIFOScheduler which is isolated from each other.
>
> Ankit Jain <an...@gmail.com>于2018年7月25日周三 上午8:14写道：
>
>> Forgot to mention this is for shared scoped mode, so same Spark
>> application and context for all users on a single Zeppelin instance.
>>
>> Thanks
>> Ankit
>>
>> On Jul 24, 2018, at 4:12 PM, Ankit Jain <an...@gmail.com> wrote:
>>
>> Hi,
>> I am playing around with execution policy of Spark jobs(and all Zeppelin
>> paragraphs actually).
>>
>> Looks like there are couple of control points-
>> 1) Spark scheduling - FIFO vs Fair as documented in
>> https://spark.apache.org/docs/2.1.1/job-scheduling.
>> html#fair-scheduler-pools.
>>
>> Since we are still on .7 version and don't have https://issues.apache.
>> org/jira/browse/ZEPPELIN-3563, I am forcefully doing sc.setLocalProperty(
>> "spark.scheduler.pool", "fair");
>> in both SparkInterpreter.java and SparkSqlInterpreter.java.
>>
>> Also because we are exposing Zeppelin to multiple users we may not
>> actually want users to hog the cluster and always use FAIR.
>>
>> This may complicate our merge to .8 though.
>>
>> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems to
>> have a scheduler queue. Each task is submitted to a FIFOScheduler except
>> SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag
>> is turned on.
>>
>> I am changing SparkInterpreter.java to use ParallelScheduler too and
>> that seems to do the trick.
>>
>> Now multiple notebooks are able to run in parallel.
>>
>> My question is if other people have tested SparkInterpreter with ParallelScheduler?
>> Also ideally this should be configurable. User should be specify fifo or
>> parallel.
>>
>> Executing all paragraphs does add more complication and maybe
>>
>> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep
>> the execution order sane.
>>
>>
>> Thoughts?
>>
>> --
>> Thanks & Regards,
>> Ankit.
>>
>>


-- 
Thanks & Regards,
Ankit.

Re: Parallel Execution of Spark Jobs

Posted by Ankit Jain <an...@gmail.com>.

Thanks for the quick feedback Jeff.

Re:1 - I did see Zeppelin-3563 but we are not on .8 yet and also we may
want to force FAIR execution instead of letting user control it.

Re:2 - Is there an architecture issue here or we just need better thread
safety? Ideally scheduler should be able to figure out the dependencies and
run whatever can be parallel.

Re:Interpreter mode, I may not have been clear but we are running per user
scoped mode - so Spark context is shared among all users.

Doesn't that mean all jobs from different users go to one FIFOScheduler
forcing all small jobs to block on a big one? That is specifically we are
trying to avoid.

Thanks
Ankit

On Tue, Jul 24, 2018 at 5:40 PM, Jeff Zhang <zj...@gmail.com> wrote:

> Regarding 1.  ZEPPELIN-3563 should be helpful. See
> https://github.com/apache/zeppelin/blob/master/docs/
> interpreter/spark.md#running-spark-sql-concurrently
> for more details.
> https://issues.apache.org/jira/browse/ZEPPELIN-3563
>
> Regarding 2. If you use ParallelScheduler for SparkInterpreter, you may
> hit weird issues if your paragraph has dependency between each other. e.g.
> paragraph 1 will use variable v1 which is defined in paragraph p2. Then the
> order of paragraph execution matters here, and ParallelScheduler can
> not guarantee the order of execution.
> That's why we use FIFOScheduler for SparkInterpreter.
>
> In your scenario where multiple users share the same sparkcontext, I would
> suggest you to use scoped per user mode. Then each user will share the same
> sparkcontext which means you can save resources, and also they are in each
> FIFOScheduler which is isolated from each other.
>
> Ankit Jain <an...@gmail.com>于2018年7月25日周三 上午8:14写道：
>
>> Forgot to mention this is for shared scoped mode, so same Spark
>> application and context for all users on a single Zeppelin instance.
>>
>> Thanks
>> Ankit
>>
>> On Jul 24, 2018, at 4:12 PM, Ankit Jain <an...@gmail.com> wrote:
>>
>> Hi,
>> I am playing around with execution policy of Spark jobs(and all Zeppelin
>> paragraphs actually).
>>
>> Looks like there are couple of control points-
>> 1) Spark scheduling - FIFO vs Fair as documented in
>> https://spark.apache.org/docs/2.1.1/job-scheduling.
>> html#fair-scheduler-pools.
>>
>> Since we are still on .7 version and don't have https://issues.apache.
>> org/jira/browse/ZEPPELIN-3563, I am forcefully doing sc.setLocalProperty(
>> "spark.scheduler.pool", "fair");
>> in both SparkInterpreter.java and SparkSqlInterpreter.java.
>>
>> Also because we are exposing Zeppelin to multiple users we may not
>> actually want users to hog the cluster and always use FAIR.
>>
>> This may complicate our merge to .8 though.
>>
>> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems to
>> have a scheduler queue. Each task is submitted to a FIFOScheduler except
>> SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag
>> is turned on.
>>
>> I am changing SparkInterpreter.java to use ParallelScheduler too and
>> that seems to do the trick.
>>
>> Now multiple notebooks are able to run in parallel.
>>
>> My question is if other people have tested SparkInterpreter with ParallelScheduler?
>> Also ideally this should be configurable. User should be specify fifo or
>> parallel.
>>
>> Executing all paragraphs does add more complication and maybe
>>
>> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep
>> the execution order sane.
>>
>>
>> Thoughts?
>>
>> --
>> Thanks & Regards,
>> Ankit.
>>
>>


-- 
Thanks & Regards,
Ankit.

Re: Parallel Execution of Spark Jobs

Posted by Jeff Zhang <zj...@gmail.com>.

Regarding 1.  ZEPPELIN-3563 should be helpful. See
https://github.com/apache/zeppelin/blob/master/docs/interpreter/spark.md#running-spark-sql-concurrently
for more details.
https://issues.apache.org/jira/browse/ZEPPELIN-3563

Regarding 2. If you use ParallelScheduler for SparkInterpreter, you may hit
weird issues if your paragraph has dependency between each other. e.g.
paragraph 1 will use variable v1 which is defined in paragraph p2. Then the
order of paragraph execution matters here, and ParallelScheduler can
not guarantee the order of execution.
That's why we use FIFOScheduler for SparkInterpreter.

In your scenario where multiple users share the same sparkcontext, I would
suggest you to use scoped per user mode. Then each user will share the same
sparkcontext which means you can save resources, and also they are in each
FIFOScheduler which is isolated from each other.

Ankit Jain <an...@gmail.com>于2018年7月25日周三 上午8:14写道：

> Forgot to mention this is for shared scoped mode, so same Spark
> application and context for all users on a single Zeppelin instance.
>
> Thanks
> Ankit
>
> On Jul 24, 2018, at 4:12 PM, Ankit Jain <an...@gmail.com> wrote:
>
> Hi,
> I am playing around with execution policy of Spark jobs(and all Zeppelin
> paragraphs actually).
>
> Looks like there are couple of control points-
> 1) Spark scheduling - FIFO vs Fair as documented in
> https://spark.apache.org/docs/2.1.1/job-scheduling.html#fair-scheduler-pools
> .
>
> Since we are still on .7 version and don't have
> https://issues.apache.org/jira/browse/ZEPPELIN-3563, I am forcefully
> doing sc.setLocalProperty("spark.scheduler.pool", "fair");
> in both SparkInterpreter.java and SparkSqlInterpreter.java.
>
> Also because we are exposing Zeppelin to multiple users we may not
> actually want users to hog the cluster and always use FAIR.
>
> This may complicate our merge to .8 though.
>
> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems to
> have a scheduler queue. Each task is submitted to a FIFOScheduler except
> SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag
> is turned on.
>
> I am changing SparkInterpreter.java to use ParallelScheduler too and that
> seems to do the trick.
>
> Now multiple notebooks are able to run in parallel.
>
> My question is if other people have tested SparkInterpreter with ParallelScheduler?
> Also ideally this should be configurable. User should be specify fifo or
> parallel.
>
> Executing all paragraphs does add more complication and maybe
>
> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep the
> execution order sane.
>
>
> Thoughts?
>
> --
> Thanks & Regards,
> Ankit.
>
>

Re: Parallel Execution of Spark Jobs

Posted by Jeff Zhang <zj...@gmail.com>.

Regarding 1.  ZEPPELIN-3563 should be helpful. See
https://github.com/apache/zeppelin/blob/master/docs/interpreter/spark.md#running-spark-sql-concurrently
for more details.
https://issues.apache.org/jira/browse/ZEPPELIN-3563

Regarding 2. If you use ParallelScheduler for SparkInterpreter, you may hit
weird issues if your paragraph has dependency between each other. e.g.
paragraph 1 will use variable v1 which is defined in paragraph p2. Then the
order of paragraph execution matters here, and ParallelScheduler can
not guarantee the order of execution.
That's why we use FIFOScheduler for SparkInterpreter.

In your scenario where multiple users share the same sparkcontext, I would
suggest you to use scoped per user mode. Then each user will share the same
sparkcontext which means you can save resources, and also they are in each
FIFOScheduler which is isolated from each other.

Ankit Jain <an...@gmail.com>于2018年7月25日周三 上午8:14写道：

> Forgot to mention this is for shared scoped mode, so same Spark
> application and context for all users on a single Zeppelin instance.
>
> Thanks
> Ankit
>
> On Jul 24, 2018, at 4:12 PM, Ankit Jain <an...@gmail.com> wrote:
>
> Hi,
> I am playing around with execution policy of Spark jobs(and all Zeppelin
> paragraphs actually).
>
> Looks like there are couple of control points-
> 1) Spark scheduling - FIFO vs Fair as documented in
> https://spark.apache.org/docs/2.1.1/job-scheduling.html#fair-scheduler-pools
> .
>
> Since we are still on .7 version and don't have
> https://issues.apache.org/jira/browse/ZEPPELIN-3563, I am forcefully
> doing sc.setLocalProperty("spark.scheduler.pool", "fair");
> in both SparkInterpreter.java and SparkSqlInterpreter.java.
>
> Also because we are exposing Zeppelin to multiple users we may not
> actually want users to hog the cluster and always use FAIR.
>
> This may complicate our merge to .8 though.
>
> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems to
> have a scheduler queue. Each task is submitted to a FIFOScheduler except
> SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag
> is turned on.
>
> I am changing SparkInterpreter.java to use ParallelScheduler too and that
> seems to do the trick.
>
> Now multiple notebooks are able to run in parallel.
>
> My question is if other people have tested SparkInterpreter with ParallelScheduler?
> Also ideally this should be configurable. User should be specify fifo or
> parallel.
>
> Executing all paragraphs does add more complication and maybe
>
> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep the
> execution order sane.
>
>
> Thoughts?
>
> --
> Thanks & Regards,
> Ankit.
>
>

Re: Parallel Execution of Spark Jobs

Posted by Ankit Jain <an...@gmail.com>.

Forgot to mention this is for shared scoped mode, so same Spark application and context for all users on a single Zeppelin instance.

Thanks
Ankit

> On Jul 24, 2018, at 4:12 PM, Ankit Jain <an...@gmail.com> wrote:
> 
> Hi,
> I am playing around with execution policy of Spark jobs(and all Zeppelin paragraphs actually).
> 
> Looks like there are couple of control points-
> 1) Spark scheduling - FIFO vs Fair as documented in https://spark.apache.org/docs/2.1.1/job-scheduling.html#fair-scheduler-pools.
> 
> Since we are still on .7 version and don't have https://issues.apache.org/jira/browse/ZEPPELIN-3563, I am forcefully doing sc.setLocalProperty("spark.scheduler.pool", "fair");
> in both SparkInterpreter.java and SparkSqlInterpreter.java.
> 
> Also because we are exposing Zeppelin to multiple users we may not actually want users to hog the cluster and always use FAIR.
> 
> This may complicate our merge to .8 though.
> 
> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems to have a scheduler queue. Each task is submitted to a FIFOScheduler except SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag is turned on.
> 
> I am changing SparkInterpreter.java to use ParallelScheduler too and that seems to do the trick.
> 
> Now multiple notebooks are able to run in parallel.
> 
> My question is if other people have tested SparkInterpreter with ParallelScheduler? Also ideally this should be configurable. User should be specify fifo or parallel.
> 
> Executing all paragraphs does add more complication and maybe
> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep the execution order sane.
> 
> Thoughts?
> 
> -- 
> Thanks & Regards,
> Ankit.

Re: Parallel Execution of Spark Jobs

Posted by Ankit Jain <an...@gmail.com>.

Forgot to mention this is for shared scoped mode, so same Spark application and context for all users on a single Zeppelin instance.

Thanks
Ankit

> On Jul 24, 2018, at 4:12 PM, Ankit Jain <an...@gmail.com> wrote:
> 
> Hi,
> I am playing around with execution policy of Spark jobs(and all Zeppelin paragraphs actually).
> 
> Looks like there are couple of control points-
> 1) Spark scheduling - FIFO vs Fair as documented in https://spark.apache.org/docs/2.1.1/job-scheduling.html#fair-scheduler-pools.
> 
> Since we are still on .7 version and don't have https://issues.apache.org/jira/browse/ZEPPELIN-3563, I am forcefully doing sc.setLocalProperty("spark.scheduler.pool", "fair");
> in both SparkInterpreter.java and SparkSqlInterpreter.java.
> 
> Also because we are exposing Zeppelin to multiple users we may not actually want users to hog the cluster and always use FAIR.
> 
> This may complicate our merge to .8 though.
> 
> 2. On top of Spark scheduling, each Zeppelin Interpreter itself seems to have a scheduler queue. Each task is submitted to a FIFOScheduler except SparkSqlInterpreter which creates a ParallelScheduler ig concurrentsql flag is turned on.
> 
> I am changing SparkInterpreter.java to use ParallelScheduler too and that seems to do the trick.
> 
> Now multiple notebooks are able to run in parallel.
> 
> My question is if other people have tested SparkInterpreter with ParallelScheduler? Also ideally this should be configurable. User should be specify fifo or parallel.
> 
> Executing all paragraphs does add more complication and maybe
> https://issues.apache.org/jira/browse/ZEPPELIN-2368 will help us keep the execution order sane.
> 
> Thoughts?
> 
> -- 
> Thanks & Regards,
> Ankit.