You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Hadoop QA (JIRA)" <ji...@apache.org> on 2015/04/30 00:55:07 UTC

[jira] [Commented] (OOZIE-2223) Improve documentation with regard to Java action retries

    [ https://issues.apache.org/jira/browse/OOZIE-2223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14520460#comment-14520460 ] 

Hadoop QA commented on OOZIE-2223:
----------------------------------

Testing JIRA OOZIE-2223

Cleaning local git workspace

----------------------------

{color:green}+1 PATCH_APPLIES{color}
{color:green}+1 CLEAN{color}
{color:red}-1 RAW_PATCH_ANALYSIS{color}
.    {color:green}+1{color} the patch does not introduce any @author tags
.    {color:green}+1{color} the patch does not introduce any tabs
.    {color:green}+1{color} the patch does not introduce any trailing spaces
.    {color:red}-1{color} the patch contains 1 line(s) longer than 132 characters
.    {color:red}-1{color} the patch does not add/modify any testcase
{color:green}+1 RAT{color}
.    {color:green}+1{color} the patch does not seem to introduce new RAT warnings
{color:green}+1 JAVADOC{color}
.    {color:green}+1{color} the patch does not seem to introduce new Javadoc warnings
{color:green}+1 COMPILE{color}
.    {color:green}+1{color} HEAD compiles
.    {color:green}+1{color} patch compiles
.    {color:green}+1{color} the patch does not seem to introduce new javac warnings
{color:green}+1 BACKWARDS_COMPATIBILITY{color}
.    {color:green}+1{color} the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
.    {color:green}+1{color} the patch does not modify JPA files
{color:green}+1 TESTS{color}
.    Tests run: 1651
{color:green}+1 DISTRO{color}
.    {color:green}+1{color} distro tarball builds with the patch 

----------------------------
{color:red}*-1 Overall result, please check the reported -1(s)*{color}


The full output of the test-patch run is available at

.   https://builds.apache.org/job/oozie-trunk-precommit-build/2339/

> Improve documentation with regard to Java action retries
> --------------------------------------------------------
>
>                 Key: OOZIE-2223
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2223
>             Project: Oozie
>          Issue Type: Improvement
>          Components: docs
>    Affects Versions: 3.3.2, 4.1.0, 4.0.1
>            Reporter: Ben Roling
>         Attachments: OOZIE-2223-1.patch
>
>
> My organization has been bitten by a mistake in the way we have written Java action applications.  I would like to introduce a documentation change that might reduce the likelihood that others new to Oozie make the same mistake.
> The mistake is not accounting for the possibility that launcher tasks will fail due to reasons such as cluster maintenance.  We have a number of jobs that take input and output paths as arguments.  Our code had been specifically written such that if the output path already exists the job fails to avoid inadvertently deleting an output that may have been consumed by a downstream job.
> This has bitten us during cluster maintenance that requires TaskTracker restarts.  During such an event any launcher running on a TaskTracker at the time of the TaskTracker restart fails and is retried on another TaskTracker.  The new attempt of the launcher task fails due to the output directory already existing.  This in turn fails the whole workflow.  Maintenance that requires restarting all TaskTrackers can end up causing a lot of workflow failures.
> The current documentation does hint at such issues via mention of the “prepare” block, but I don’t think the explanation of this block is clear enough for newcomers to understand its use.  Furthermore, I’m not sure the prepare block is the best answer for how to handle the specific types of issues I am referring to.  A “delete” action in a prepare block will delete content regardless of state, which provides the possibility that a previously completed good output could be deleted.  This can lead to issues such as corrupted traceability when there is a need to trace an output back to the inputs that produced it.
> I believe a more appropriate implementation to address the possibility of launcher task failure is to write the action such that it uses a previous complete output without deleting or reprocessing.  Only if it detects an incomplete output does it delete the output and re-run the processing to produce the output.  This protects from the possibility of accidental output destruction.
> Furthermore, some types of actions spawn activity that runs asynchronously outside the context of the launcher task itself.  In such cases the action author must take care to clean up any stray activity spawned prior to the failure of the initial launcher task to ensure it does not collide with activity produced by the new attempt of the launcher.  In the case of my organization, such activity includes child M/R jobs spawned from the Apache Crunch pipelines we invoke from our Java actions.  Depending on the design of the action, it can be required to find and kill such child jobs before invoking the new pipeline and spawning new child jobs.
> I will attach a patch that demonstrates one possible documentation improvement to shed light on these issues but I appreciate feedback and any other ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)