You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@oozie.apache.org by Micah Whitacre <mk...@gmail.com> on 2014/05/27 20:53:29 UTC

Retrieving Oozie Child Job Ids

So a bit ago I logged OOZIE-1767[1] to help track child jobs that would be
launched from running Crunch or Cascading code inside of an Oozie Java
action.  Oozie currently only tracks the job ids of the launching job and
not the ids of the jobs that might get spawned.  So while the suggestion on
the issue is not necessarily the right solution it got me thinking about
whether or not tracking the child jobs would even be able to solve what I
was looking for.

I'm curious where are those child job ids stored?  Are they retrievable or
are they only usable as a parameter/property inside of the workflow
instance?  (e.g. inside of the workflow spec).  Or is that information
stored in a way that I could retrieve it later from Oozie using something
like the REST API[2]?

[1] - https://issues.apache.org/jira/browse/OOZIE-1767
[2] - http://oozie.apache.org/docs/3.3.2/WebServicesAPI.html#Job_Information

Re: Retrieving Oozie Child Job Ids

Posted by Mona Chitnis <ch...@yahoo-inc.com>.

The action data comprising child job ids, stats (pig/m-r), error info is
uploaded to HDFS as a sequence file (.seq) in the location of the action
dir
E.g. /user/micah/oozie-mica/<wf-id>-<action-name>

Direct links to all child jobs launched by pig/hive are also shown under
tab ŒChild Jobs¹ in Oozie web-console info for your oozie job.

Doing the following command can parse it
$ hdfs dfs -text <path-to-sequence-file>


On 5/28/14, 10:36 AM, "Micah Whitacre" <mk...@gmail.com> wrote:

>>>  In other words, the Oozie server doesn't find out about launched MR
>jobs until after they're done.
>
>Where is that information searchable or retrievable inside of Oozie after
>the MR jobs are done?  Through the REST endpoint?  Instead of the MR
>example Hive might be more analogous.  If a Hive query kicked off 1:M jobs
>where in Oozie can I retrieve those after they have completed?  Can I only
>do it inside of the workflow or is the data stored somewhere more
>persistent.
>
>>> You/We should make Crunch and Cascading actions instead of using the
>Java action.  Is there some reason this would be a bad idea?
>
>I'm not opposed to the idea and while working on CRUNCH-272[1] and digging
>through Oozie code I think that is needed based on how the data is
>inserted
>into the WorkflowActionBean.  For disclosure I'm not actually looking for
>Cascading support but figured if this was solved generically then both
>projects would get the benefit.
>
>[1] - https://issues.apache.org/jira/browse/CRUNCH-272
>
>
>On Wed, May 28, 2014 at 12:16 PM, Robert Kanter
><rk...@cloudera.com>wrote:
>
>> Oozie doesn't actually track the child IDs (except for the MR action,
>>which
>> has slightly different behavior in that the launcher job exits
>> immediately); it only reports them once the launcher has finished, which
>> happens after the actions have actually finished.  In other words, the
>> Oozie server doesn't find out about launched MR jobs until after they're
>> done.
>>
>> If you're using Hadoop 2.4.0 or later (or a Hadoop with YARN-1461 and
>> MAPREDUCE-5699), you should take a look at OOZIE-1722 where Oozie
>>utilizing
>> YARN tags to search for jobs that may have already been launched.  Would
>> the tags be helpful?
>>
>> That said, I think my original comment on OOZIE-1767 makes sense:
>>
>> > Why not just make Crunch and Cascading actions? We can then also give
>> them
>> > their own sharelibs, handle any other custom logic, and give them
>>easier
>> > schemas. I think this would make it easier for other users too.
>>
>> You/We should make Crunch and Cascading actions instead of using the
>>Java
>> action.  Is there some reason this would be a bad idea?
>>
>>
>> On Tue, May 27, 2014 at 11:53 AM, Micah Whitacre <mk...@gmail.com>
>>wrote:
>>
>> > So a bit ago I logged OOZIE-1767[1] to help track child jobs that
>>would
>> be
>> > launched from running Crunch or Cascading code inside of an Oozie Java
>> > action.  Oozie currently only tracks the job ids of the launching job
>>and
>> > not the ids of the jobs that might get spawned.  So while the
>>suggestion
>> on
>> > the issue is not necessarily the right solution it got me thinking
>>about
>> > whether or not tracking the child jobs would even be able to solve
>>what I
>> > was looking for.
>> >
>> > I'm curious where are those child job ids stored?  Are they
>>retrievable
>> or
>> > are they only usable as a parameter/property inside of the workflow
>> > instance?  (e.g. inside of the workflow spec).  Or is that information
>> > stored in a way that I could retrieve it later from Oozie using
>>something
>> > like the REST API[2]?
>> >
>> > [1] - https://issues.apache.org/jira/browse/OOZIE-1767
>> > [2] -
>> > http://oozie.apache.org/docs/3.3.2/WebServicesAPI.html#Job_Information
>> >
>>

Re: Retrieving Oozie Child Job Ids

Posted by Micah Whitacre <mk...@gmail.com>.

>>  In other words, the Oozie server doesn't find out about launched MR
jobs until after they're done.

Where is that information searchable or retrievable inside of Oozie after
the MR jobs are done?  Through the REST endpoint?  Instead of the MR
example Hive might be more analogous.  If a Hive query kicked off 1:M jobs
where in Oozie can I retrieve those after they have completed?  Can I only
do it inside of the workflow or is the data stored somewhere more
persistent.

>> You/We should make Crunch and Cascading actions instead of using the
Java action.  Is there some reason this would be a bad idea?

I'm not opposed to the idea and while working on CRUNCH-272[1] and digging
through Oozie code I think that is needed based on how the data is inserted
into the WorkflowActionBean.  For disclosure I'm not actually looking for
Cascading support but figured if this was solved generically then both
projects would get the benefit.

[1] - https://issues.apache.org/jira/browse/CRUNCH-272


On Wed, May 28, 2014 at 12:16 PM, Robert Kanter <rk...@cloudera.com>wrote:

> Oozie doesn't actually track the child IDs (except for the MR action, which
> has slightly different behavior in that the launcher job exits
> immediately); it only reports them once the launcher has finished, which
> happens after the actions have actually finished.  In other words, the
> Oozie server doesn't find out about launched MR jobs until after they're
> done.
>
> If you're using Hadoop 2.4.0 or later (or a Hadoop with YARN-1461 and
> MAPREDUCE-5699), you should take a look at OOZIE-1722 where Oozie utilizing
> YARN tags to search for jobs that may have already been launched.  Would
> the tags be helpful?
>
> That said, I think my original comment on OOZIE-1767 makes sense:
>
> > Why not just make Crunch and Cascading actions? We can then also give
> them
> > their own sharelibs, handle any other custom logic, and give them easier
> > schemas. I think this would make it easier for other users too.
>
> You/We should make Crunch and Cascading actions instead of using the Java
> action.  Is there some reason this would be a bad idea?
>
>
> On Tue, May 27, 2014 at 11:53 AM, Micah Whitacre <mk...@gmail.com> wrote:
>
> > So a bit ago I logged OOZIE-1767[1] to help track child jobs that would
> be
> > launched from running Crunch or Cascading code inside of an Oozie Java
> > action.  Oozie currently only tracks the job ids of the launching job and
> > not the ids of the jobs that might get spawned.  So while the suggestion
> on
> > the issue is not necessarily the right solution it got me thinking about
> > whether or not tracking the child jobs would even be able to solve what I
> > was looking for.
> >
> > I'm curious where are those child job ids stored?  Are they retrievable
> or
> > are they only usable as a parameter/property inside of the workflow
> > instance?  (e.g. inside of the workflow spec).  Or is that information
> > stored in a way that I could retrieve it later from Oozie using something
> > like the REST API[2]?
> >
> > [1] - https://issues.apache.org/jira/browse/OOZIE-1767
> > [2] -
> > http://oozie.apache.org/docs/3.3.2/WebServicesAPI.html#Job_Information
> >
>

Re: Retrieving Oozie Child Job Ids

Posted by Robert Kanter <rk...@cloudera.com>.

Oozie doesn't actually track the child IDs (except for the MR action, which
has slightly different behavior in that the launcher job exits
immediately); it only reports them once the launcher has finished, which
happens after the actions have actually finished.  In other words, the
Oozie server doesn't find out about launched MR jobs until after they're
done.

If you're using Hadoop 2.4.0 or later (or a Hadoop with YARN-1461 and
MAPREDUCE-5699), you should take a look at OOZIE-1722 where Oozie utilizing
YARN tags to search for jobs that may have already been launched.  Would
the tags be helpful?

That said, I think my original comment on OOZIE-1767 makes sense:

> Why not just make Crunch and Cascading actions? We can then also give them
> their own sharelibs, handle any other custom logic, and give them easier
> schemas. I think this would make it easier for other users too.

You/We should make Crunch and Cascading actions instead of using the Java
action.  Is there some reason this would be a bad idea?

On Tue, May 27, 2014 at 11:53 AM, Micah Whitacre <mk...@gmail.com> wrote:

> So a bit ago I logged OOZIE-1767[1] to help track child jobs that would be
> launched from running Crunch or Cascading code inside of an Oozie Java
> action.  Oozie currently only tracks the job ids of the launching job and
> not the ids of the jobs that might get spawned.  So while the suggestion on
> the issue is not necessarily the right solution it got me thinking about
> whether or not tracking the child jobs would even be able to solve what I
> was looking for.
>
> I'm curious where are those child job ids stored?  Are they retrievable or
> are they only usable as a parameter/property inside of the workflow
> instance?  (e.g. inside of the workflow spec).  Or is that information
> stored in a way that I could retrieve it later from Oozie using something
> like the REST API[2]?
>
> [1] - https://issues.apache.org/jira/browse/OOZIE-1767
> [2] -
> http://oozie.apache.org/docs/3.3.2/WebServicesAPI.html#Job_Information
>