You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zeppelin.apache.org by moon soo Lee <mo...@apache.org> on 2015/01/13 12:26:01 UTC

Run interpreter in separate process.

Hi guys,

I'm bringing an issue https://github.com/NFLabs/zeppelin/issues/278 to this
mailing list for discussion.

Zeppelin creates interpreter instance with each separate classloader to
avoid interfere(dependency conflictions, singletons, static members) with
other interpreter instance. It was working well until now but i can see
some limitations.

a) When multiple interpreter instances are running concurrently, they can
not avoid interfere of their stdin/stdout/stderr.
b) When interpreter's one dependency is designed(== hardcoded) to use
Application classloader, it won't work within Zeppelin because Zeppelin
loads interpreter's dependency jars in it's threadcontext classloader, not
Application classloader.

Run interpreter in separate process is the solution i can think.
In detail, because of interpreter is abstracted by it's public methods,
everything will be simply done if we can call those method remotely by some
sort of RPC mechanism.

Therefore

a) Main entry point and run script to run interpreter in separate process
b) RPC mechanism between Zeppelin and separate interpreter process
c) Option to enabling/disabling this capability.

are major tasks i'm thinking.

What do you guys think? Please share if there're some idea.

Best,
moon

Re: Run interpreter in separate process.

Posted by Roman Shaposhnik <ro...@shaposhnik.org>.
On Wed, Jan 21, 2015 at 5:29 PM, Alex B. <bz...@apache.org> wrote:
> Thanks for advice and can not agree more with you guys - as soon as I start
> developing a feature of the Zeppelin, simple branching model as you
> describe would be my first preference!
>
> The code that I refer below though does not depend on or contain any part
> of Zeppelin itself.
> Basically it is a side-project of throw-away prototype for the
> IPC strategy, it never meant to be neither directly used nor merged to
> Zeppelin itself.
>
> Given that, although the benefit of visibility is huge, are you still
> advise project repo as the best place for it?

Then it becomes your call ;-)

Thanks,
Roman.

Re: Run interpreter in separate process.

Posted by "Alex B." <bz...@apache.org>.
Thanks for advice and can not agree more with you guys - as soon as I start
developing a feature of the Zeppelin, simple branching model as you
describe would be my first preference!

The code that I refer below though does not depend on or contain any part
of Zeppelin itself.
Basically it is a side-project of throw-away prototype for the
IPC strategy, it never meant to be neither directly used nor merged to
Zeppelin itself.

Given that, although the benefit of visibility is huge, are you still
advise project repo as the best place for it?

--
Kind regards,
Alexander

On Thu, Jan 22, 2015 at 8:05 AM, Konstantin Boudnik <co...@apache.org> wrote:

> +1 [a big one!]
>
> On Wed, Jan 21, 2015 at 02:50PM, Roman Shaposhnik wrote:
> > On Wed, Jan 21, 2015 at 12:28 AM, Alex B. <bz...@apache.org> wrote:
> > > Quick update: I'v started to hack around a PoC for this feature here
> > > https://github.com/bzz/zeppelin-multiprocess-interpreter-poc , it's
> very
> > > early stage but please feel free to check it out and provide any
> feedback.
> > >
> > > On enabling\disabling this feature: if it will be easy to provide such
> > > choice (which we are not sure yet) then there is no reason not to
> implement
> > > 4.
> >
> > You may want to consider developing features like this one directly in
> > ASF repo on a feature branch. This is the model used by some of the
> > most well established projects like Apache Hadoop, etc. It helps keep
> > the work exposed to as many potential contributors as possible.
> >
> > Thanks,
> > Roman.
>

Re: Run interpreter in separate process.

Posted by Konstantin Boudnik <co...@apache.org>.
+1 [a big one!]

On Wed, Jan 21, 2015 at 02:50PM, Roman Shaposhnik wrote:
> On Wed, Jan 21, 2015 at 12:28 AM, Alex B. <bz...@apache.org> wrote:
> > Quick update: I'v started to hack around a PoC for this feature here
> > https://github.com/bzz/zeppelin-multiprocess-interpreter-poc , it's very
> > early stage but please feel free to check it out and provide any feedback.
> >
> > On enabling\disabling this feature: if it will be easy to provide such
> > choice (which we are not sure yet) then there is no reason not to implement
> > 4.
> 
> You may want to consider developing features like this one directly in
> ASF repo on a feature branch. This is the model used by some of the
> most well established projects like Apache Hadoop, etc. It helps keep
> the work exposed to as many potential contributors as possible.
> 
> Thanks,
> Roman.

Re: Run interpreter in separate process.

Posted by Roman Shaposhnik <ro...@shaposhnik.org>.
On Wed, Jan 21, 2015 at 12:28 AM, Alex B. <bz...@apache.org> wrote:
> Quick update: I'v started to hack around a PoC for this feature here
> https://github.com/bzz/zeppelin-multiprocess-interpreter-poc , it's very
> early stage but please feel free to check it out and provide any feedback.
>
> On enabling\disabling this feature: if it will be easy to provide such
> choice (which we are not sure yet) then there is no reason not to implement
> 4.

You may want to consider developing features like this one directly in
ASF repo on a feature branch. This is the model used by some of the
most well established projects like Apache Hadoop, etc. It helps keep
the work exposed to as many potential contributors as possible.

Thanks,
Roman.

Re: Run interpreter in separate process.

Posted by "Alex B." <bz...@apache.org>.
Quick update: I'v started to hack around a PoC for this feature here
https://github.com/bzz/zeppelin-multiprocess-interpreter-poc , it's very
early stage but please feel free to check it out and provide any feedback.

On enabling\disabling this feature: if it will be easy to provide such
choice (which we are not sure yet) then there is no reason not to implement
4.


On Sun, Jan 18, 2015 at 9:46 AM, moon soo Lee <le...@gmail.com> wrote:

> Alexander, thanks for interest to implementing this feature.
>
> I think there're some alternatives to enabling/disabling this feature
>
> 1) Run all interpreter in separate process
> 2) Let user select which interpreter will be run in separate process
> 3) Let interpreter choose, it is going to run in separate process or not.
> 4) Let user select, but interpreter provide default selection.
>
> What do you guys think? To me, 4) which gives user flexibility as well as
> simplicity.
>
>
> And i can easily think some possible improvements after the first step,
>
> a) Run interpreter process not only in local machine but on remote machine
> (will it be helpful for anything work with Yarn?)
> b) Option to keep separate process running when zeppelin terminates, so
> zeppelin can reconnect when it restarted.
> c) Implement remote interpreter in different language. (eg, pyspark)
>
> So, I think if IPC implementation can have a possibility to RPC and various
> language support, then it'll be better for future.
>
>
> Best,
> moon
>
>
>
> On Thu, Jan 15, 2015 at 8:58 PM, Alex B <ab...@nflabs.com> wrote:
>
> > I think I'd like to volunteer to implement this feature.
> >
> > My the perspective is: we solve 2 immediate problems and at the end
> > have a maturing enough interpreter API to be able so add pyspark
> > support.
> >
> > Immediate problem we solve are:
> >  - A multiple interpreters running right now mix stdout/err
> >  - in case of JVMs there also is a  Classloader collision problem,
> > which does not allow SparkSQL to work with spark 1.2
> >
> > Suggested solution:
> > To separate each interpreter to a it's own process.
> >
> > This means bringing to the codebase things like:
> >  - API for managing the runtime state of that process
> >  - then IPC implementation itself (thrift?)
> >  - basic ClassLoading for JVM based interpreters
> >
> > Please, let me know if there is something I have missed here!
> >
> > --
> > Kind regards,
> > Alexander
> >
> > > On 13 Jan 2015, at 20:26, moon soo Lee <mo...@apache.org> wrote:
> > >
> > > Hi guys,
> > >
> > > I'm bringing an issue https://github.com/NFLabs/zeppelin/issues/278 to
> > this
> > > mailing list for discussion.
> > >
> > > Zeppelin creates interpreter instance with each separate classloader to
> > > avoid interfere(dependency conflictions, singletons, static members)
> with
> > > other interpreter instance. It was working well until now but i can see
> > > some limitations.
> > >
> > > a) When multiple interpreter instances are running concurrently, they
> can
> > > not avoid interfere of their stdin/stdout/stderr.
> > > b) When interpreter's one dependency is designed(== hardcoded) to use
> > > Application classloader, it won't work within Zeppelin because Zeppelin
> > > loads interpreter's dependency jars in it's threadcontext classloader,
> > not
> > > Application classloader.
> > >
> > > Run interpreter in separate process is the solution i can think.
> > > In detail, because of interpreter is abstracted by it's public methods,
> > > everything will be simply done if we can call those method remotely by
> > some
> > > sort of RPC mechanism.
> > >
> > > Therefore
> > >
> > > a) Main entry point and run script to run interpreter in separate
> process
> > > b) RPC mechanism between Zeppelin and separate interpreter process
> > > c) Option to enabling/disabling this capability.
> > >
> > > are major tasks i'm thinking.
> > >
> > > What do you guys think? Please share if there're some idea.
> > >
> > > Best,
> > > moon
> >
>

Re: Run interpreter in separate process.

Posted by moon soo Lee <le...@gmail.com>.
Alexander, thanks for interest to implementing this feature.

I think there're some alternatives to enabling/disabling this feature

1) Run all interpreter in separate process
2) Let user select which interpreter will be run in separate process
3) Let interpreter choose, it is going to run in separate process or not.
4) Let user select, but interpreter provide default selection.

What do you guys think? To me, 4) which gives user flexibility as well as
simplicity.


And i can easily think some possible improvements after the first step,

a) Run interpreter process not only in local machine but on remote machine
(will it be helpful for anything work with Yarn?)
b) Option to keep separate process running when zeppelin terminates, so
zeppelin can reconnect when it restarted.
c) Implement remote interpreter in different language. (eg, pyspark)

So, I think if IPC implementation can have a possibility to RPC and various
language support, then it'll be better for future.


Best,
moon



On Thu, Jan 15, 2015 at 8:58 PM, Alex B <ab...@nflabs.com> wrote:

> I think I'd like to volunteer to implement this feature.
>
> My the perspective is: we solve 2 immediate problems and at the end
> have a maturing enough interpreter API to be able so add pyspark
> support.
>
> Immediate problem we solve are:
>  - A multiple interpreters running right now mix stdout/err
>  - in case of JVMs there also is a  Classloader collision problem,
> which does not allow SparkSQL to work with spark 1.2
>
> Suggested solution:
> To separate each interpreter to a it's own process.
>
> This means bringing to the codebase things like:
>  - API for managing the runtime state of that process
>  - then IPC implementation itself (thrift?)
>  - basic ClassLoading for JVM based interpreters
>
> Please, let me know if there is something I have missed here!
>
> --
> Kind regards,
> Alexander
>
> > On 13 Jan 2015, at 20:26, moon soo Lee <mo...@apache.org> wrote:
> >
> > Hi guys,
> >
> > I'm bringing an issue https://github.com/NFLabs/zeppelin/issues/278 to
> this
> > mailing list for discussion.
> >
> > Zeppelin creates interpreter instance with each separate classloader to
> > avoid interfere(dependency conflictions, singletons, static members) with
> > other interpreter instance. It was working well until now but i can see
> > some limitations.
> >
> > a) When multiple interpreter instances are running concurrently, they can
> > not avoid interfere of their stdin/stdout/stderr.
> > b) When interpreter's one dependency is designed(== hardcoded) to use
> > Application classloader, it won't work within Zeppelin because Zeppelin
> > loads interpreter's dependency jars in it's threadcontext classloader,
> not
> > Application classloader.
> >
> > Run interpreter in separate process is the solution i can think.
> > In detail, because of interpreter is abstracted by it's public methods,
> > everything will be simply done if we can call those method remotely by
> some
> > sort of RPC mechanism.
> >
> > Therefore
> >
> > a) Main entry point and run script to run interpreter in separate process
> > b) RPC mechanism between Zeppelin and separate interpreter process
> > c) Option to enabling/disabling this capability.
> >
> > are major tasks i'm thinking.
> >
> > What do you guys think? Please share if there're some idea.
> >
> > Best,
> > moon
>

Re: Run interpreter in separate process.

Posted by Alex B <ab...@nflabs.com>.
I think I'd like to volunteer to implement this feature.

My the perspective is: we solve 2 immediate problems and at the end
have a maturing enough interpreter API to be able so add pyspark
support.

Immediate problem we solve are:
 - A multiple interpreters running right now mix stdout/err
 - in case of JVMs there also is a  Classloader collision problem,
which does not allow SparkSQL to work with spark 1.2

Suggested solution:
To separate each interpreter to a it's own process.

This means bringing to the codebase things like:
 - API for managing the runtime state of that process
 - then IPC implementation itself (thrift?)
 - basic ClassLoading for JVM based interpreters

Please, let me know if there is something I have missed here!

--
Kind regards,
Alexander

> On 13 Jan 2015, at 20:26, moon soo Lee <mo...@apache.org> wrote:
>
> Hi guys,
>
> I'm bringing an issue https://github.com/NFLabs/zeppelin/issues/278 to this
> mailing list for discussion.
>
> Zeppelin creates interpreter instance with each separate classloader to
> avoid interfere(dependency conflictions, singletons, static members) with
> other interpreter instance. It was working well until now but i can see
> some limitations.
>
> a) When multiple interpreter instances are running concurrently, they can
> not avoid interfere of their stdin/stdout/stderr.
> b) When interpreter's one dependency is designed(== hardcoded) to use
> Application classloader, it won't work within Zeppelin because Zeppelin
> loads interpreter's dependency jars in it's threadcontext classloader, not
> Application classloader.
>
> Run interpreter in separate process is the solution i can think.
> In detail, because of interpreter is abstracted by it's public methods,
> everything will be simply done if we can call those method remotely by some
> sort of RPC mechanism.
>
> Therefore
>
> a) Main entry point and run script to run interpreter in separate process
> b) RPC mechanism between Zeppelin and separate interpreter process
> c) Option to enabling/disabling this capability.
>
> are major tasks i'm thinking.
>
> What do you guys think? Please share if there're some idea.
>
> Best,
> moon