You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "soren@dopeness.org" <so...@dopeness.org> on 2011/01/11 01:06:47 UTC

calling pig from a web app

I'd be interested to hear people's experience / best practices for running
pig scripts on demand from a web app. What do you use as the calling
mechanism? how to you handle priority / scheduling for ad-hoc or user
generated tasks?

Best,
Soren

-- 
http://about.me/soren/bio

Re: calling pig from a web app

Posted by Alejandro Abdelnur <tu...@cloudera.com>.
Soren,

Adding to the 'oozie' alternative ...

With Oozie you can can do something like:

$ oozie pig -file SCRIPT

The command line options are aligned with Pig ones (you can do a direct
passthrough of options). You'll get a JOB ID (like it would be a PIG server)
and later you can monitor the progress of the job via commanline, API or
webconsole.

And you don't need to write Oozie workflow.xml

And with Oozie 2.3, about to be released, it becomes even simpler as you
don't have to worry about the PIG JARs (Oozie now supports a sharelib).

Hope this helps.

Thanks.

Alejandro

On Wed, Jan 12, 2011 at 6:51 AM, soren@dopeness.org <so...@dopeness.org>wrote:

> Thanks Dmitriy, exactly the information I was looking for.
>
> On Tue, Jan 11, 2011 at 1:40 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
>
> > Soren,
> > The "real" answer is probably to use Oozie under the covers in order to
> > handle all kinds of edge conditions w.r.t cluster availability, job
> > configuration, etc.
> >
> > If you don't want to deal with Oozie or Azkaban, you could do something
> > like
> > the following:
> >
> > 1) web app that works with a simple "pig job" model. The pig job model
> > specifies the script, parameters, status (submitted / running / done /
> > killed / died), and a few timestamps as needed.
> >
> > 2) a daemon process that monitors the table for new jobs and starts them
> on
> > the cluster, updating the table appropriately. You can add all the
> resource
> > constraints, access restrictions, etc here.
> >
> > The more you develop the daemon and the web app (how about monitoring the
> > Pig job through the new PigStats?...), the more you will realize you are
> > rebuilding Oozie and start thinking about how to integrate it. But if you
> > need something to work by the end of the week, a quickly rolled daemon +
> > rails app is probably faster to set up in the short term.
> >
> > D
> >
> > On Tue, Jan 11, 2011 at 1:34 PM, soren@dopeness.org <soren@dopeness.org
> > >wrote:
> >
> > > Thanks Jeff. I am aware of the Java API, I was hoping to hear from
> people
> > > who might already be doing this and learn from their own experiences
> > before
> > > I go down any one particular path.
> > >
> > > On Mon, Jan 10, 2011 at 8:39 PM, Jeff Zhang <zj...@gmail.com> wrote:
> > >
> > > > You can use Java API of Pig. Regarding the priority, you can let user
> > > > choose
> > > > the priority on web page. And you can use other scheduler rather the
> > > > default
> > > > FIFO of hadoop
> > > >
> > > > On Tue, Jan 11, 2011 at 8:37 AM, Charles Gonçalves <
> > charles.fg@gmail.com
> > > > >wrote:
> > > >
> > > > > I reinforce the interest in this topic.
> > > > > I'll soon need to create a web interface for my marketers
> colleagues
> > > ...
> > > > >
> > > > > On Mon, Jan 10, 2011 at 10:06 PM, soren@dopeness.org <
> > > soren@dopeness.org
> > > > > >wrote:
> > > > >
> > > > > > I'd be interested to hear people's experience / best practices
> for
> > > > > running
> > > > > > pig scripts on demand from a web app. What do you use as the
> > calling
> > > > > > mechanism? how to you handle priority / scheduling for ad-hoc or
> > user
> > > > > > generated tasks?
> > > > > >
> > > > > > Best,
> > > > > > Soren
> > > > > >
> > > > > > --
> > > > > > http://about.me/soren/bio
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > *Charles Ferreira Gonçalves *
> > > > > http://homepages.dcc.ufmg.br/~charles/
> > > > > UFMG - ICEx - Dcc
> > > > > Cel.: 55 31 87741485
> > > > > Tel.:  55 31 34741485
> > > > > Lab.: 55 31 34095840
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards
> > > >
> > > > Jeff Zhang
> > > >
> > >
> > >
> > >
> > > --
> > > http://about.me/soren/bio
> > >
> >
>
>
>
> --
> http://about.me/soren/bio
>

Re: calling pig from a web app

Posted by "soren@dopeness.org" <so...@dopeness.org>.
Thanks Dmitriy, exactly the information I was looking for.

On Tue, Jan 11, 2011 at 1:40 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Soren,
> The "real" answer is probably to use Oozie under the covers in order to
> handle all kinds of edge conditions w.r.t cluster availability, job
> configuration, etc.
>
> If you don't want to deal with Oozie or Azkaban, you could do something
> like
> the following:
>
> 1) web app that works with a simple "pig job" model. The pig job model
> specifies the script, parameters, status (submitted / running / done /
> killed / died), and a few timestamps as needed.
>
> 2) a daemon process that monitors the table for new jobs and starts them on
> the cluster, updating the table appropriately. You can add all the resource
> constraints, access restrictions, etc here.
>
> The more you develop the daemon and the web app (how about monitoring the
> Pig job through the new PigStats?...), the more you will realize you are
> rebuilding Oozie and start thinking about how to integrate it. But if you
> need something to work by the end of the week, a quickly rolled daemon +
> rails app is probably faster to set up in the short term.
>
> D
>
> On Tue, Jan 11, 2011 at 1:34 PM, soren@dopeness.org <soren@dopeness.org
> >wrote:
>
> > Thanks Jeff. I am aware of the Java API, I was hoping to hear from people
> > who might already be doing this and learn from their own experiences
> before
> > I go down any one particular path.
> >
> > On Mon, Jan 10, 2011 at 8:39 PM, Jeff Zhang <zj...@gmail.com> wrote:
> >
> > > You can use Java API of Pig. Regarding the priority, you can let user
> > > choose
> > > the priority on web page. And you can use other scheduler rather the
> > > default
> > > FIFO of hadoop
> > >
> > > On Tue, Jan 11, 2011 at 8:37 AM, Charles Gonçalves <
> charles.fg@gmail.com
> > > >wrote:
> > >
> > > > I reinforce the interest in this topic.
> > > > I'll soon need to create a web interface for my marketers colleagues
> > ...
> > > >
> > > > On Mon, Jan 10, 2011 at 10:06 PM, soren@dopeness.org <
> > soren@dopeness.org
> > > > >wrote:
> > > >
> > > > > I'd be interested to hear people's experience / best practices for
> > > > running
> > > > > pig scripts on demand from a web app. What do you use as the
> calling
> > > > > mechanism? how to you handle priority / scheduling for ad-hoc or
> user
> > > > > generated tasks?
> > > > >
> > > > > Best,
> > > > > Soren
> > > > >
> > > > > --
> > > > > http://about.me/soren/bio
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *Charles Ferreira Gonçalves *
> > > > http://homepages.dcc.ufmg.br/~charles/
> > > > UFMG - ICEx - Dcc
> > > > Cel.: 55 31 87741485
> > > > Tel.:  55 31 34741485
> > > > Lab.: 55 31 34095840
> > > >
> > >
> > >
> > >
> > > --
> > > Best Regards
> > >
> > > Jeff Zhang
> > >
> >
> >
> >
> > --
> > http://about.me/soren/bio
> >
>



-- 
http://about.me/soren/bio

Re: calling pig from a web app

Posted by Julien Le Dem <le...@yahoo-inc.com>.
Also Pig is not thread safe so far, so you can not have multiple threads firing different Pig "queries" in parallel.
Oozie works around this by running pig from a Map task on a slave. That way each Pig script runs in its own process.
Julien

On 1/11/11 1:40 PM, "Dmitriy Ryaboy" <dv...@gmail.com> wrote:

Soren,
The "real" answer is probably to use Oozie under the covers in order to
handle all kinds of edge conditions w.r.t cluster availability, job
configuration, etc.

If you don't want to deal with Oozie or Azkaban, you could do something like
the following:

1) web app that works with a simple "pig job" model. The pig job model
specifies the script, parameters, status (submitted / running / done /
killed / died), and a few timestamps as needed.

2) a daemon process that monitors the table for new jobs and starts them on
the cluster, updating the table appropriately. You can add all the resource
constraints, access restrictions, etc here.

The more you develop the daemon and the web app (how about monitoring the
Pig job through the new PigStats?...), the more you will realize you are
rebuilding Oozie and start thinking about how to integrate it. But if you
need something to work by the end of the week, a quickly rolled daemon +
rails app is probably faster to set up in the short term.

D

On Tue, Jan 11, 2011 at 1:34 PM, soren@dopeness.org <so...@dopeness.org>wrote:

> Thanks Jeff. I am aware of the Java API, I was hoping to hear from people
> who might already be doing this and learn from their own experiences before
> I go down any one particular path.
>
> On Mon, Jan 10, 2011 at 8:39 PM, Jeff Zhang <zj...@gmail.com> wrote:
>
> > You can use Java API of Pig. Regarding the priority, you can let user
> > choose
> > the priority on web page. And you can use other scheduler rather the
> > default
> > FIFO of hadoop
> >
> > On Tue, Jan 11, 2011 at 8:37 AM, Charles Gonçalves <charles.fg@gmail.com
> > >wrote:
> >
> > > I reinforce the interest in this topic.
> > > I'll soon need to create a web interface for my marketers colleagues
> ...
> > >
> > > On Mon, Jan 10, 2011 at 10:06 PM, soren@dopeness.org <
> soren@dopeness.org
> > > >wrote:
> > >
> > > > I'd be interested to hear people's experience / best practices for
> > > running
> > > > pig scripts on demand from a web app. What do you use as the calling
> > > > mechanism? how to you handle priority / scheduling for ad-hoc or user
> > > > generated tasks?
> > > >
> > > > Best,
> > > > Soren
> > > >
> > > > --
> > > > http://about.me/soren/bio
> > > >
> > >
> > >
> > >
> > > --
> > > *Charles Ferreira Gonçalves *
> > > http://homepages.dcc.ufmg.br/~charles/
> > > UFMG - ICEx - Dcc
> > > Cel.: 55 31 87741485
> > > Tel.:  55 31 34741485
> > > Lab.: 55 31 34095840
> > >
> >
> >
> >
> > --
> > Best Regards
> >
> > Jeff Zhang
> >
>
>
>
> --
> http://about.me/soren/bio
>


Re: calling pig from a web app

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Soren,
The "real" answer is probably to use Oozie under the covers in order to
handle all kinds of edge conditions w.r.t cluster availability, job
configuration, etc.

If you don't want to deal with Oozie or Azkaban, you could do something like
the following:

1) web app that works with a simple "pig job" model. The pig job model
specifies the script, parameters, status (submitted / running / done /
killed / died), and a few timestamps as needed.

2) a daemon process that monitors the table for new jobs and starts them on
the cluster, updating the table appropriately. You can add all the resource
constraints, access restrictions, etc here.

The more you develop the daemon and the web app (how about monitoring the
Pig job through the new PigStats?...), the more you will realize you are
rebuilding Oozie and start thinking about how to integrate it. But if you
need something to work by the end of the week, a quickly rolled daemon +
rails app is probably faster to set up in the short term.

D

On Tue, Jan 11, 2011 at 1:34 PM, soren@dopeness.org <so...@dopeness.org>wrote:

> Thanks Jeff. I am aware of the Java API, I was hoping to hear from people
> who might already be doing this and learn from their own experiences before
> I go down any one particular path.
>
> On Mon, Jan 10, 2011 at 8:39 PM, Jeff Zhang <zj...@gmail.com> wrote:
>
> > You can use Java API of Pig. Regarding the priority, you can let user
> > choose
> > the priority on web page. And you can use other scheduler rather the
> > default
> > FIFO of hadoop
> >
> > On Tue, Jan 11, 2011 at 8:37 AM, Charles Gonçalves <charles.fg@gmail.com
> > >wrote:
> >
> > > I reinforce the interest in this topic.
> > > I'll soon need to create a web interface for my marketers colleagues
> ...
> > >
> > > On Mon, Jan 10, 2011 at 10:06 PM, soren@dopeness.org <
> soren@dopeness.org
> > > >wrote:
> > >
> > > > I'd be interested to hear people's experience / best practices for
> > > running
> > > > pig scripts on demand from a web app. What do you use as the calling
> > > > mechanism? how to you handle priority / scheduling for ad-hoc or user
> > > > generated tasks?
> > > >
> > > > Best,
> > > > Soren
> > > >
> > > > --
> > > > http://about.me/soren/bio
> > > >
> > >
> > >
> > >
> > > --
> > > *Charles Ferreira Gonçalves *
> > > http://homepages.dcc.ufmg.br/~charles/
> > > UFMG - ICEx - Dcc
> > > Cel.: 55 31 87741485
> > > Tel.:  55 31 34741485
> > > Lab.: 55 31 34095840
> > >
> >
> >
> >
> > --
> > Best Regards
> >
> > Jeff Zhang
> >
>
>
>
> --
> http://about.me/soren/bio
>

Re: calling pig from a web app

Posted by "soren@dopeness.org" <so...@dopeness.org>.
Thanks Jeff. I am aware of the Java API, I was hoping to hear from people
who might already be doing this and learn from their own experiences before
I go down any one particular path.

On Mon, Jan 10, 2011 at 8:39 PM, Jeff Zhang <zj...@gmail.com> wrote:

> You can use Java API of Pig. Regarding the priority, you can let user
> choose
> the priority on web page. And you can use other scheduler rather the
> default
> FIFO of hadoop
>
> On Tue, Jan 11, 2011 at 8:37 AM, Charles Gonçalves <charles.fg@gmail.com
> >wrote:
>
> > I reinforce the interest in this topic.
> > I'll soon need to create a web interface for my marketers colleagues ...
> >
> > On Mon, Jan 10, 2011 at 10:06 PM, soren@dopeness.org <soren@dopeness.org
> > >wrote:
> >
> > > I'd be interested to hear people's experience / best practices for
> > running
> > > pig scripts on demand from a web app. What do you use as the calling
> > > mechanism? how to you handle priority / scheduling for ad-hoc or user
> > > generated tasks?
> > >
> > > Best,
> > > Soren
> > >
> > > --
> > > http://about.me/soren/bio
> > >
> >
> >
> >
> > --
> > *Charles Ferreira Gonçalves *
> > http://homepages.dcc.ufmg.br/~charles/
> > UFMG - ICEx - Dcc
> > Cel.: 55 31 87741485
> > Tel.:  55 31 34741485
> > Lab.: 55 31 34095840
> >
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
http://about.me/soren/bio

Re: calling pig from a web app

Posted by Jeff Zhang <zj...@gmail.com>.
You can use Java API of Pig. Regarding the priority, you can let user choose
the priority on web page. And you can use other scheduler rather the default
FIFO of hadoop

On Tue, Jan 11, 2011 at 8:37 AM, Charles Gonçalves <ch...@gmail.com>wrote:

> I reinforce the interest in this topic.
> I'll soon need to create a web interface for my marketers colleagues ...
>
> On Mon, Jan 10, 2011 at 10:06 PM, soren@dopeness.org <soren@dopeness.org
> >wrote:
>
> > I'd be interested to hear people's experience / best practices for
> running
> > pig scripts on demand from a web app. What do you use as the calling
> > mechanism? how to you handle priority / scheduling for ad-hoc or user
> > generated tasks?
> >
> > Best,
> > Soren
> >
> > --
> > http://about.me/soren/bio
> >
>
>
>
> --
> *Charles Ferreira Gonçalves *
> http://homepages.dcc.ufmg.br/~charles/
> UFMG - ICEx - Dcc
> Cel.: 55 31 87741485
> Tel.:  55 31 34741485
> Lab.: 55 31 34095840
>



-- 
Best Regards

Jeff Zhang

Re: calling pig from a web app

Posted by Charles Gonçalves <ch...@gmail.com>.
I reinforce the interest in this topic.
I'll soon need to create a web interface for my marketers colleagues ...

On Mon, Jan 10, 2011 at 10:06 PM, soren@dopeness.org <so...@dopeness.org>wrote:

> I'd be interested to hear people's experience / best practices for running
> pig scripts on demand from a web app. What do you use as the calling
> mechanism? how to you handle priority / scheduling for ad-hoc or user
> generated tasks?
>
> Best,
> Soren
>
> --
> http://about.me/soren/bio
>



-- 
*Charles Ferreira Gonçalves *
http://homepages.dcc.ufmg.br/~charles/
UFMG - ICEx - Dcc
Cel.: 55 31 87741485
Tel.:  55 31 34741485
Lab.: 55 31 34095840