You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by Chao Shi <st...@live.com> on 2013/03/17 07:57:38 UTC

Re: About status web page

My previous post seems not to be delivered successfully. Try to gzip the
patch. The patch is large since it contains jquery and viz.js.

On Fri, Mar 15, 2013 at 11:02 PM, Chao Shi <st...@live.com> wrote:

> Hey guys,
>
> I have a very simple prototype for this. It uses DotfileWriter to generate
> the dot file and renders it with viz.js.
>
> There are lots things that could be improved:
> - show completed/running jobs in different colors, perhaps as well as job
> progress in percentage
> - interactive things on UI, e.g. click on a job will navigate to JT page,
> auto refresh
> - configurable port
> - .. and more
>
> I'd like to hear what do you think of the prototype before continue. A
> quick way to demo it is to patch it and run some integration tests. During
> the integration tests, you can navigate to http://localhost:10080.
>
> On Wed, Feb 27, 2013 at 3:30 PM, Matthias Friedrich <ma...@mafr.de> wrote:
>
>> On Wednesday, 2013-02-27, Chao Shi wrote:
>> > I'm developing a complex pipeline (30+ MRs plus lots of joins). I have a
>> > hard time to understand which part of the pipeline spends most running
>> time
>> > and how much intermediate output does it produce. Crunch's optimization
>> > work is great, but it makes the execution plan difficult to be
>> understood.
>> > Each time I modified the pipeline, I have to dump the dot file and run
>> > graphviz to generate a new picture and examine if there's anything
>> wrong.
>> >
>> > About security, I'm not familiar with how Hadoop does it. I will try to
>> > reuse hadoop's HttpServer (does it have something to do with security?).
>> > The bottom line is to make this feature disabled by default, and let
>> users
>> > enable it at their own risk.
>>
>> OK, sounds good.
>>
>> > If this feature is enabled, the user can choose to use unused port or
>> > specified port. I haven't got an idea that how the user know the
>> randomly
>> > picked port (via log?) . I will be working on a prototype version first,
>> > and see if the status page is generally useful.
>>
>> Yeah, logging the URL would probably be the only thing that works. Not
>> counting fancy stuff like MDNS ;-)
>>
>> In my opinion, we should try to get this done with the dependencies that
>> we already get through Hadoop. Each additional library we add to Crunch
>> will cause interoperability problems for someone.
>>
>> Regards,
>>   Matthias
>>
>>
>

Re: About status web page

Posted by Chao Shi <st...@live.com>.
Thanks Matthias for the advice.

I agree with you that it is not useful for small jobs, and I'm OK to keep
it out of crunch core.

The challenge there is that there may be many communication between
StatusServer and crunch core: StatusServer needs to know the DAG and track
running status of each job. The ideal way is to make PipelineExecution some
APIs for this. In this approach, a user construct StatusServer and specify
whatever port he likes.

The POST approach has similar difficulties: we need to define the protocol.
The good thing is one can build a central status server for all crunch
jobs, whose life time is also longer. BTW, besides debugging, monitor is
another goal (e.g. see the pipeline stucks at which stage, where is the
critical path).

The license of viz.js is
here<https://github.com/mdaines/viz.js/blob/master/COPYING>.
I'm not sure if we could use. If viz.js is not possible, the worst case is
to call "graphviz" at server-side via pipe.

On Sun, Mar 17, 2013 at 4:03 PM, Matthias Friedrich <ma...@mafr.de> wrote:

> Hi,
>
> I'm still not convinced that running a web service from a batch job is
> a good technical fit because it is transient in nature. For small jobs
> you only have a second or two to hit reload in your browser.
>
> How about leaving the server out of crunch core and just add
> functionality for a Pipeline to post its Configuration to an external
> web service? In debug mode, the Pipeline could do a HTTP PUT to a
> well-known address (http://localhost:10080/jobs/, but that could be
> configurable). When debugging, users would start the web service
> separately if they need it.
>
> The advantage is that crunch core stays clean and the web service
> sees more than just one Pipeline, so it can display a history of
> executed Pipelines.
>
> BTW, what's the license of vis.js?
>
> Regards,
>   Matthias
>
> On Sunday, 2013-03-17, Chao Shi wrote:
> > My previous post seems not to be delivered successfully. Try to gzip the
> > patch. The patch is large since it contains jquery and viz.js.
> >
> > On Fri, Mar 15, 2013 at 11:02 PM, Chao Shi <st...@live.com> wrote:
> >
> > > Hey guys,
> > >
> > > I have a very simple prototype for this. It uses DotfileWriter to
> generate
> > > the dot file and renders it with viz.js.
> > >
> > > There are lots things that could be improved:
> > > - show completed/running jobs in different colors, perhaps as well as
> job
> > > progress in percentage
> > > - interactive things on UI, e.g. click on a job will navigate to JT
> page,
> > > auto refresh
> > > - configurable port
> > > - .. and more
> > >
> > > I'd like to hear what do you think of the prototype before continue. A
> > > quick way to demo it is to patch it and run some integration tests.
> During
> > > the integration tests, you can navigate to http://localhost:10080.
> > >
> > > On Wed, Feb 27, 2013 at 3:30 PM, Matthias Friedrich <ma...@mafr.de>
> wrote:
> > >
> > >> On Wednesday, 2013-02-27, Chao Shi wrote:
> > >> > I'm developing a complex pipeline (30+ MRs plus lots of joins). I
> have a
> > >> > hard time to understand which part of the pipeline spends most
> running
> > >> time
> > >> > and how much intermediate output does it produce. Crunch's
> optimization
> > >> > work is great, but it makes the execution plan difficult to be
> > >> understood.
> > >> > Each time I modified the pipeline, I have to dump the dot file and
> run
> > >> > graphviz to generate a new picture and examine if there's anything
> > >> wrong.
> > >> >
> > >> > About security, I'm not familiar with how Hadoop does it. I will
> try to
> > >> > reuse hadoop's HttpServer (does it have something to do with
> security?).
> > >> > The bottom line is to make this feature disabled by default, and let
> > >> users
> > >> > enable it at their own risk.
> > >>
> > >> OK, sounds good.
> > >>
> > >> > If this feature is enabled, the user can choose to use unused port
> or
> > >> > specified port. I haven't got an idea that how the user know the
> > >> randomly
> > >> > picked port (via log?) . I will be working on a prototype version
> first,
> > >> > and see if the status page is generally useful.
> > >>
> > >> Yeah, logging the URL would probably be the only thing that works. Not
> > >> counting fancy stuff like MDNS ;-)
> > >>
> > >> In my opinion, we should try to get this done with the dependencies
> that
> > >> we already get through Hadoop. Each additional library we add to
> Crunch
> > >> will cause interoperability problems for someone.
> > >>
> > >> Regards,
> > >>   Matthias
> > >>
> > >>
> > >
>
>
>
>

Re: About status web page

Posted by Matthias Friedrich <ma...@mafr.de>.
Hi,

I'm still not convinced that running a web service from a batch job is
a good technical fit because it is transient in nature. For small jobs
you only have a second or two to hit reload in your browser.

How about leaving the server out of crunch core and just add
functionality for a Pipeline to post its Configuration to an external
web service? In debug mode, the Pipeline could do a HTTP PUT to a
well-known address (http://localhost:10080/jobs/, but that could be
configurable). When debugging, users would start the web service
separately if they need it.

The advantage is that crunch core stays clean and the web service
sees more than just one Pipeline, so it can display a history of
executed Pipelines.

BTW, what's the license of vis.js?

Regards,
  Matthias

On Sunday, 2013-03-17, Chao Shi wrote:
> My previous post seems not to be delivered successfully. Try to gzip the
> patch. The patch is large since it contains jquery and viz.js.
> 
> On Fri, Mar 15, 2013 at 11:02 PM, Chao Shi <st...@live.com> wrote:
> 
> > Hey guys,
> >
> > I have a very simple prototype for this. It uses DotfileWriter to generate
> > the dot file and renders it with viz.js.
> >
> > There are lots things that could be improved:
> > - show completed/running jobs in different colors, perhaps as well as job
> > progress in percentage
> > - interactive things on UI, e.g. click on a job will navigate to JT page,
> > auto refresh
> > - configurable port
> > - .. and more
> >
> > I'd like to hear what do you think of the prototype before continue. A
> > quick way to demo it is to patch it and run some integration tests. During
> > the integration tests, you can navigate to http://localhost:10080.
> >
> > On Wed, Feb 27, 2013 at 3:30 PM, Matthias Friedrich <ma...@mafr.de> wrote:
> >
> >> On Wednesday, 2013-02-27, Chao Shi wrote:
> >> > I'm developing a complex pipeline (30+ MRs plus lots of joins). I have a
> >> > hard time to understand which part of the pipeline spends most running
> >> time
> >> > and how much intermediate output does it produce. Crunch's optimization
> >> > work is great, but it makes the execution plan difficult to be
> >> understood.
> >> > Each time I modified the pipeline, I have to dump the dot file and run
> >> > graphviz to generate a new picture and examine if there's anything
> >> wrong.
> >> >
> >> > About security, I'm not familiar with how Hadoop does it. I will try to
> >> > reuse hadoop's HttpServer (does it have something to do with security?).
> >> > The bottom line is to make this feature disabled by default, and let
> >> users
> >> > enable it at their own risk.
> >>
> >> OK, sounds good.
> >>
> >> > If this feature is enabled, the user can choose to use unused port or
> >> > specified port. I haven't got an idea that how the user know the
> >> randomly
> >> > picked port (via log?) . I will be working on a prototype version first,
> >> > and see if the status page is generally useful.
> >>
> >> Yeah, logging the URL would probably be the only thing that works. Not
> >> counting fancy stuff like MDNS ;-)
> >>
> >> In my opinion, we should try to get this done with the dependencies that
> >> we already get through Hadoop. Each additional library we add to Crunch
> >> will cause interoperability problems for someone.
> >>
> >> Regards,
> >>   Matthias
> >>
> >>
> >