You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Alejandro Abdelnur (JIRA)" <ji...@apache.org> on 2007/07/02 15:05:04 UTC

[jira] Created: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
------------------------------------------------------------------------------------------------------

                 Key: HADOOP-1558
                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
             Project: Hadoop
          Issue Type: Improvement
          Components: mapred
         Environment: all
            Reporter: Alejandro Abdelnur
             Fix For: 0.14.0


Add  OutputFormat methods like:

/** Called to initialize output for this job. */
void initialize(JobConf job) throws IOException;

/** Called to finalize output for this job. */
void commit(JobConf job) throws IOException;

In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 

The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by Alejandro Abdelnur <tu...@gmail.com>.

Suggestions make sense.

I was looking at the Task class and it seems too Map/Reduce Task
specific so I'll need some help here.

It is you intention to run the initialize/commit Tasks in the JT box
or it they should run in the slaves?

Thxs.

A

On 7/11/07, Doug Cutting (JIRA) <ji...@apache.org> wrote:
>
>      [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Doug Cutting updated HADOOP-1558:
> ---------------------------------
>
>     Fix Version/s:     (was: 0.14.0)
>            Status: Open  (was: Patch Available)
>
> This is a good feature, but it's going to be more complicated to implement.  We only instantiate user classes in task and client jvms, never in jobtracker or tasktracker jvms.  So initialize() and commit() need to be run as tasks: InitializeTask and CommitTask.  Adding new task classes should be easy in principle, but it might not be in practice.  Also, getUncommittedOutputDirectory() is specific to file-based output formats and so does not belong in the OutputFormat interface, but rather on a base class for file-based outputs.  We should probably rename OutputFormatBase to be FileOutputFormat, just as we renamed InputFormatBase to be FileInputFormat.
>
> > changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> > ------------------------------------------------------------------------------------------------------
> >
> >                 Key: HADOOP-1558
> >                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
> >             Project: Hadoop
> >          Issue Type: Improvement
> >          Components: mapred
> >         Environment: all
> >            Reporter: Alejandro Abdelnur
> >         Attachments: hadoop-1558-JUN1007-1934.txt
> >
> >
> > Add  OutputFormat methods like:
> > /** Called to initialize output for this job. */
> > void initialize(JobConf job) throws IOException;
> > /** Called to finalize output for this job. */
> > void commit(JobConf job) throws IOException;
> > In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name.
> > The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

[jira] Commented: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518653 ] 

Alejandro Abdelnur commented on HADOOP-1558:
--------------------------------------------

A comment related 1416.

1558 issue/patch is to address the handling of the output directory from working to final location.

1416 issue is to address the handling of the output of a part from working to final location with speculative execution considerations.

The OutputHandler (of 1558) provides to the task the base path where to create the part file. 

To solve 1416 a similar pattern (of 1558) could be used.

A way of doing it could be by adding to the OutputHandler interface the following methods:

  Path getUncommittedPartFile(JobConf conf, taskId);
  void initializePartFile(JobConf conf, taskId);
  void commitPartFile(JobConf conf, taskId);

The speculative execution logic would invoke init/commit methods and this methods would resolve any cleanup/discard to be done.

As this is handled in the OutputHandler implementation it will work for non-file base scenarios as well and transparently from the TT.


> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.15.0
>
>         Attachments: hadoop-1558-JUL2607-1600.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Attachment: hadoop-1558-JUN0907-1550.txt

Added 3 new methods to the OutputFormat interface:

  getUncommittedOutputDirectory(): returns the temporary output directory for the job.
  initialize(): initializes/cleans-up temporary output directory for the job.
  commit(): moves data from temporary output directory to the job output directory.

The getUncommittedOutputDirectory() method must be used by getRecordWriter implementations to create the output file.

The initialize() method is called by JobInProgress constructor.

The commit() method is called by the JobTracker at job finalization time if the job has been completed successfully.

Including testcase.

All Hadoop implementations of OutputFormat have been retrofitted to the new signature behavior.

The LocalJobRunner has been also retrofitted to invoke initialize() and commit() on the OutputFormat.


> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN0907-1550.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Fix Version/s: 0.15.0
           Status: Patch Available  (was: Open)

synched up patch with trunk

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.15.0
>
>         Attachments: hadoop-1558-JUL2607-1600.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur resolved HADOOP-1558.
----------------------------------------

    Resolution: Won't Fix

This issue was open to address Hadoop-1121. Hadoop-1121 has been resolved as wont-fix, this issue is not applicable anymore in the context of Hadoop-1121.

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>         Attachments: hadoop-1558-JUL2607-1600.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521163 ] 

Doug Cutting commented on HADOOP-1558:
--------------------------------------

> this solution means the JT has to (potentially) execute client code

No, the simplest implementation might do that, but that wouldn't be acceptable.  We can probably promote tasks in the task process.  Tasks might check with the jobtracker if they were the winning invocation of the task, and promote themselves if they are.  We might even be able to promote the job this way: when the last task is promoted, the job could be promoted in that same jvm.  Abandoning failed tasks is trickier, since the task process may no longer exist.  Job abandonment is similarly tricky.  In these cases I can see now way to avoid running a special task.  Perhaps we can run a single cleanup task to abandon all failed tasks and the job?

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.15.0
>
>         Attachments: hadoop-1558-JUL2607-1600.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Attachment:     (was: hadoop-1558-JUN1007-1934.txt)

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>         Attachments: hadoop-1558-JUL2607-1600.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520632 ] 

Owen O'Malley commented on HADOOP-1558:
---------------------------------------

One last point is that TaskContext would need to have at least:

{code}
interface TaskContext {
  JobConf getJobConf();
  String getTaskId();
  String getJobId();
}
{code}

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.15.0
>
>         Attachments: hadoop-1558-JUL2607-1600.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Attachment: hadoop-1558-JUN0907-1620.txt

new testcase was missing in previous patch.

the initialize() and commit() methods now check if the output dir is present, if missing (could be the case when MR job by implementation is designed to produce not ouput) it does a no-operation in both methods.


> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN0907-1550.txt, hadoop-1558-JUN0907-1620.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Status: Open  (was: Patch Available)

a contrib testcase is failing with a NP

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN0907-1550.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Status: Open  (was: Patch Available)

to resubmit with diff ignoring whitespaces changes (to blame: my IDE settings for it)

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN0907-1620.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511085 ] 

Hadoop QA commented on HADOOP-1558:
-----------------------------------

-1, build or testing failed

2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12361406/hadoop-1558-JUN0907-1550.txt against trunk revision r554144.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/375/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/375/console

Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN0907-1620.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Attachment:     (was: hadoop-1558-JUN0907-1620.txt)

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN0907-1721.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Attachment: hadoop-1558-JUN0907-1721.txt

ignoring whitespaces now

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN0907-1721.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541227 ] 

Alejandro Abdelnur commented on HADOOP-1558:
--------------------------------------------

A couple follow up comments on the solution proposed by this patch:

* The OutputHandler interface is not meant to be a public interface, it is not expected M/R developers write alternate implementations of it. Just allows Hadoop to easily plugin an alternate implementation for other type of storage if required.

* The logic done in the OutpuHandler interface is not much different for what today is being done in the JobTracker, it is just a refactoring/encapsulation of the FS operations when handling the output dir/files

While I understand the motivations of Doug and Owen on doing some mayor changes (either via a new RPC call or a new task), the proposed changes are minor hooks in the current implementation and it is mostly moving FS operations from one place to another. 

Until a mayor refactoring is done these changes would allow implementing functionality to restart jobs automatically after a JT failure.



> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>         Attachments: hadoop-1558-JUL2607-1600.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512206 ] 

Arun C Murthy commented on HADOOP-1558:
---------------------------------------

I firmly am on Doug's side of the fence about the need keep the kernel free of user-code, however w.r.t to this issue I'd like to bring some complications to everyone's attention:

Eventually we need to move {{Task.saveOutput}} and {{Task.discardOutput}} to the OutputFormats; however this means that we *have* to call these methods from the {{JobTracker}} (only there do we have a global picture of the tasks and the job), there-by ruling out these being done at the child-jvms since I believe doing this isn't feasible performance-wise (I'd love to hear thoughts/arguments/ideas); hence I'd agree with Alejandro's take on static output-file handlers which cannot be user-supplied or user-overridden for now.

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN1007-1934.txt, hadoop-1558-JUN1107-1533.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Status: Patch Available  (was: Open)

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN0907-1620.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518505 ] 

Owen O'Malley commented on HADOOP-1558:
---------------------------------------

-1

I'm sorry, I still think this is the wrong direction. I've been putting it off trying to think up a reasonable alternative, but this is just too limited and confusing to the user to be useful. In particular, this change can't fix HADOOP-1416, which is very strongly related it it.

I'll try and work out a proposal for what we could move OutputFormat too.

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.15.0
>
>         Attachments: hadoop-1558-JUL2607-1600.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511745 ] 

Hadoop QA commented on HADOOP-1558:
-----------------------------------

-1, build or testing failed

2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12361574/hadoop-1558-JUN1107-1533.txt against trunk revision r555114.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/397/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/397/console

Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN1007-1934.txt, hadoop-1558-JUN1107-1533.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Status: Patch Available  (was: Open)

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN0907-1721.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Fix Version/s: 0.14.0

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN1007-1934.txt, hadoop-1558-JUN1107-1533.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520629 ] 

owen.omalley edited comment on HADOOP-1558 at 8/17/07 11:48 AM:
-----------------------------------------------------------------

My point was that I think that OutputFormat should look like:

{code}
interface OutputFormat {
  void checkOutputSpecs(JobConf) throws IOException;
  RecordWriter getRecordWriter(TaskContext) throws IOException;
  // handle promotion or abandonment of Tasks for completion or failure
  void promoteTask(TaskContext) throws IOException;
  void abandonTask(TaskContext) throws IOException;
  // handle promotion or abandoment of the entire Job
  void promoteJob(TaskContext) throws IOException;
  void abandonJob(TaskContext) throws IOException;
}
{code}

the task handling is for HADOOP-1416 and the job handling is for this problem. It makes more sense to make these cases symmetric rather than completely different.

      was (Author: owen.omalley):
    My point was that I think that OutputFormat should look like:

{{code}}
interface OutputFormat {
  void checkOutputSpecs(JobConf) throws IOException;
  RecordWriter getRecordWriter(TaskContext) throws IOException;
  // handle promotion or abandonment of Tasks for completion or failure
  void promoteTask(TaskContext) throws IOException;
  void abandonTask(TaskContext) throws IOException;
  // handle promotion or abandoment of the entire Job
  void promoteJob(TaskContext) throws IOException;
  void abandonJob(TaskContext) throws IOException;
}
{{code}}

the task handling is for HADOOP-1416 and the job handling is for this problem. It makes more sense to make these cases symmetric rather than completely different.
  
> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.15.0
>
>         Attachments: hadoop-1558-JUL2607-1600.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1558:
---------------------------------

    Status: Open  (was: Patch Available)

This does not seem like the best solution.  We need inputs and outputs to be user-extensible, including things like the output 'commit' hook added here.  So jobs must be able to provide custom implementations of these methods.  But, for reliability, we've worked hard to remove all user code from the jobtracker.

So these must be run either in a separate jvm on the jobtracker or as new task subclasses run on a tasktracker.  I think the latter is preferable, since it would avoid a lot of code duplication (keeping track of child processes) but would require chasing down all of the places in the code where we assume that tasks are either map or reduce, which may not be easy.

A third option might be to run them in JobClient.  The initialize() method can certainly be run there, no?  The commit() method is tricker, since we want it to be run even if the JobClient process dies.  Perhaps we could have the jobtracker advance jobs to a state where they're complete but not committed, and then, when a JobClient polls for completion and finds it in this state, it runs the commit method for the job, regardless of whether it was the originally submitting jvm.  Could something like that work?  Probably not, but it's worth consideration...

Finally, your interface still contains file-specifics.  We must not assume that inputs or outputs are files.  We want to permit input and output from, e.g., HBase.  So a top-level output interface must not use Path.

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN1007-1934.txt, hadoop-1558-JUN1107-1533.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Attachment: hadoop-1558-JUN1107-1533.txt

New patch using the approach described in the last comment.

All testcases pass except the TestSymLink from streaming contrib (due to test bug 1587).


> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>         Attachments: hadoop-1558-JUN1007-1934.txt, hadoop-1558-JUN1107-1533.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511679 ] 

Alejandro Abdelnur commented on HADOOP-1558:
--------------------------------------------

Adding answer by email:
-----
Suggestions make sense. I was looking at the Task class and it seems too Map/Reduce Task specific so I'll need some help here. It is your intention to run the initialize/commit Tasks in the JT box or it they should run in the slaves?
---

Had just another idea that would not require a separate task for initialize/commit, nor running custom code in the JT.

A new interface:

public interface OutputHandler {
  public void initialize(JobConf conf) throws IOException;
  public void commit(JobConf conf) throws IOException;
  public Path getOutputDirPath(JobConf conf) throws IOException;
}

Provide 2 implementations of if:

1. FileOutputHandler that does the handling implemented by the patch.
2. NOPOutputhandler that does a no operation.

Add to the OutputFormat interface a method:

  public Class getOutputHandlerClass();

This method must returns the OutputHandler implementation the OutputFormat requires. It must be a Hadoop provided implementation (for now one of the 2 above).

The JobConf, upon setting the OutpuFormat class will set an internal property with the declared OutputHandler class.

The JobTracker and JobInProgress will use this property to instantiate and run the OutputHandler initialize/commit methods.

Thus no custom code in the JT and no need for new Task classes.




> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>         Attachments: hadoop-1558-JUN1007-1934.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Status: Patch Available  (was: Open)

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>         Attachments: hadoop-1558-JUN1007-1934.txt, hadoop-1558-JUN1107-1533.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511171 ] 

Hadoop QA commented on HADOOP-1558:
-----------------------------------

-1, build or testing failed

2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12361413/hadoop-1558-JUN0907-1721.txt against trunk revision r554144.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/376/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/376/console

Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN0907-1721.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Attachment: hadoop-1558-JUN1007-1934.txt

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN0907-1721.txt, hadoop-1558-JUN1007-1934.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1558:
---------------------------------

    Status: Open  (was: Patch Available)

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.15.0
>
>         Attachments: hadoop-1558-JUL2607-1600.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Attachment:     (was: hadoop-1558-JUN0907-1550.txt)

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN0907-1620.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1558:
---------------------------------

    Fix Version/s:     (was: 0.14.0)
           Status: Open  (was: Patch Available)

This is a good feature, but it's going to be more complicated to implement.  We only instantiate user classes in task and client jvms, never in jobtracker or tasktracker jvms.  So initialize() and commit() need to be run as tasks: InitializeTask and CommitTask.  Adding new task classes should be easy in principle, but it might not be in practice.  Also, getUncommittedOutputDirectory() is specific to file-based output formats and so does not belong in the OutputFormat interface, but rather on a base class for file-based outputs.  We should probably rename OutputFormatBase to be FileOutputFormat, just as we renamed InputFormatBase to be FileInputFormat.

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>         Attachments: hadoop-1558-JUN1007-1934.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Status: Open  (was: Patch Available)

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN0907-1721.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520629 ] 

Owen O'Malley commented on HADOOP-1558:
---------------------------------------

My point was that I think that OutputFormat should look like:

{{code}}
interface OutputFormat {
  void checkOutputSpecs(JobConf) throws IOException;
  RecordWriter getRecordWriter(TaskContext) throws IOException;
  // handle promotion or abandonment of Tasks for completion or failure
  void promoteTask(TaskContext) throws IOException;
  void abandonTask(TaskContext) throws IOException;
  // handle promotion or abandoment of the entire Job
  void promoteJob(TaskContext) throws IOException;
  void abandonJob(TaskContext) throws IOException;
}
{{code}}

the task handling is for HADOOP-1416 and the job handling is for this problem. It makes more sense to make these cases symmetric rather than completely different.

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.15.0
>
>         Attachments: hadoop-1558-JUL2607-1600.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Attachment:     (was: hadoop-1558-JUN0907-1721.txt)

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN1007-1934.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Status: Patch Available  (was: Open)

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN0907-1550.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Attachment: hadoop-1558-JUL2607-1600.txt

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>         Attachments: hadoop-1558-JUL2607-1600.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511478 ] 

Hadoop QA commented on HADOOP-1558:
-----------------------------------

-1, build or testing failed

2 attempts failed to build and test the latest attachment http://issues.apache.org/jira/secure/attachment/12361499/hadoop-1558-JUN1007-1934.txt against trunk revision r554811.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/388/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/388/console

Please note that this message is automatically generated and may represent a problem with the automation system and not the patch.

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN1007-1934.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Status: Patch Available  (was: Open)

Resubmitting patch, added debug logging to OutputFormatBase temporary output directory handling.

The testcase TestSymLink from contrib/streaming is failing. 

It is failing because a build/env issue of the contrib testcases:

The contrib testcases use src/contrib/test/hadoop-site.xml. 
The property 'mapred.system.dir' is this file is defined as with a variable '${contrib.name}'.
The src/build/build-contrib.xml ant file sets the sysproperty 'contrib.name' to the name of the contrib component for the JVM running the testcase.

The problem is that when a testcase uses MiniMRCluster the TaskRunner forks a JVM for the task and in this JVM (which uses the above hadoop-site.xml) the variable 'contrib.name' is undefined.

If I hardcode 'streaming' in the hadoop-site.xml for the TestSymLink the testcase works fine.




> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN1007-1934.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur updated HADOOP-1558:
---------------------------------------

    Attachment:     (was: hadoop-1558-JUN1107-1533.txt)

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>         Attachments: hadoop-1558-JUL2607-1600.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12520932 ] 

Alejandro Abdelnur commented on HADOOP-1558:
--------------------------------------------

I'm fine with it, but this solution means the JT has to (potentially) execute client code and I thought that that is a no no.

This would be the case when the OutputFormat implementation is custom.

IMO using a custom OutputFormat is a much more likely scenario than using a different data store (that is what the OutputHandler does).


> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.15.0
>
>         Attachments: hadoop-1558-JUL2607-1600.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511920 ] 

Alejandro Abdelnur commented on HADOOP-1558:
--------------------------------------------

Doug, I understand the angle you are coming from. 

I've spent some time looking at making this a Task but decided against it. Mostly because refactoring Task to other things than running Map/Reduce -and to allow them to run in the JT box- brings much higher risk into the code.

Because of that I've taken a compromise path implemented in the patch.

The decoupling the initialize/commit from the OutputFormat into the OutputHandler approach relies on a couple of assumptions:

* It is far more common than jobs will use custom OutputFormats than custom persistent stores. In other words, as a MapReduce developer I may come up with custom OutputFormats on job basis but hardly introduce a new persistent store (DFS, HBase, S3) on job basis.

* Leaving to the MapReduce developer implementing OutputFormat the initialize/commit logic has a high risk in shared cluster environments as the decision on where temporary output directories are created could clash with out OutputFormat implementations from other jobs. IMO it seems a good thing for Hadoop code to keep control on this.

Regarding extensibility:

* The OutputHandler is an interface and custom implementations can be added to the Hadoop cluster classpath to be available for use by MapReduce jobs. Even for existing OutputFormats as the default OutputHandler can be overridden in the JobConf. As I think this a much less frequent situation I see this approach acceptable.

Regarding stores that are not file based and the 'Path getUncommitedPath(Job)' method. I see 2 options:

* This method could be ignored by non-file-based OutputHandlers, they would just care about the initialize and commit methods.

* Change this method to 'String getUncommittedName(Job)'. In the case of of file-based OutputHandler this would be interpreted as the Path to use by the OutputFormats. In the case of non-file-based this would be interpreted according to the store implementation, for example in the case of HBase it could be the value for a 'uncommitted' column, thus records of a non-completed jobs could be easily tracked and cleaned up, the initialize() would remove all records with this name (from a failed prior run), the commit() would set this column to null for all records of the job.

Thoughts?


> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.14.0
>
>         Attachments: hadoop-1558-JUN1007-1934.txt, hadoop-1558-JUN1107-1533.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1558) changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515776 ] 

Hadoop QA commented on HADOOP-1558:
-----------------------------------

+1

http://issues.apache.org/jira/secure/attachment/12362627/hadoop-1558-JUL2607-1600.txt applied and successfully tested against trunk revision r559819.

Test results:   http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/471/testReport/
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/471/console

> changes to OutputFormat to work on temporary directory to enable re-running crashed jobs (Issue: 1121)
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1558
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1558
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.15.0
>
>         Attachments: hadoop-1558-JUL2607-1600.txt
>
>
> Add  OutputFormat methods like:
> /** Called to initialize output for this job. */
> void initialize(JobConf job) throws IOException;
> /** Called to finalize output for this job. */
> void commit(JobConf job) throws IOException;
> In the base implemenation for FileSystem output, initialize() might then create a temporary directory for the job, removing any that already exists, and commit could rename the temporary output directory to the final name. 
> The existing checkOutputSpecs() would continue to throw an exception if the final output already exists.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.