You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Micah Whitacre (JIRA)" <ji...@apache.org> on 2014/04/14 01:30:15 UTC

[jira] [Updated] (CRUNCH-272) Unable to correlate crunch jobs within Oozie

     [ https://issues.apache.org/jira/browse/CRUNCH-272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Micah Whitacre updated CRUNCH-272:
----------------------------------

    Attachment: CRUNCH-272_prototype.patch

So in thinking about how to solve this I see about 2-3 solutions:

1. We create a custom Oozie action that does a similar action to what the HiveAction does[1][2] and simply search the log files for strings that match a regex.  This is was called suboptimal by Oozie community as I suggested they make a small tweak to JavaAction to do this and they weren't enthused by it.  Additionally we'd have to pic a regex that might fit all of our pipeline type jobs (MR + Spark).
2. We create a launcher framework and most likely a Launcher API that all Crunch consumers wanting to use Oozie would have to implement so that they could report back to us their PipelineResults and we pull the job ids off of that.  This would then be coupled with a custom Oozie Action.
3. We continue to let consumers utilize the standard Oozie Java action and instead provide facilities/helpers to report the child job ids in a consistent manner with how Oozie expects the job ids to be reported.  This is exactly what is contained inside this patch.  The downside with this approach is that if jobs that were invoked due to a materialize/iterator call are not tracked in the PipelineResult stages we would be missing some jobs.  This would be a problem with approach #2 as well.

I still need to test out may prototype code on a cluster with Oozie but I'm reasonably sure it will work.  Thoughts?

[1] - https://github.com/apache/oozie/blob/master/sharelib/hive/src/main/java/org/apache/oozie/action/hadoop/HiveMain.java#L298
[2] - https://github.com/apache/oozie/blob/master/sharelib/oozie/src/main/java/org/apache/oozie/action/hadoop/LauncherMain.java#L42

> Unable to correlate crunch jobs within Oozie
> --------------------------------------------
>
>                 Key: CRUNCH-272
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-272
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Mike Zimmerman
>            Assignee: Micah Whitacre
>         Attachments: CRUNCH-272_prototype.patch
>
>
> I'm not really sure if this should be logged to Oozie or to Crunch, so please feel free to move as needed.
> I would like to request a way to decorate map/reduce jobs that are spawned by a Crunch pipeline so that I can programmatically determine their origin.  The primary use case for this is integration with Oozie.  Oozie launches a single map job to run a java action (in our case this java action runs a crunch job).  Traceability from this original "launcher" job to the jobs created by the crunch job is impossible without trolling logs.  This leaves a big black hole for the system operator to assess the performance/impact of these jobs.  My initial thought was to provide a simple way to indicate a correlationId or similar on a map/reduce job and then make it accessible within Oozie to query for.  Obviously, that request would have to come after the correlation feature was available within map/reduce.



--
This message was sent by Atlassian JIRA
(v6.2#6252)