You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "stack@archive.org (JIRA)" <ji...@apache.org> on 2007/04/06 20:05:32 UTC

[jira] Updated: (HADOOP-1199) want InputFormat for task logs

     [ https://issues.apache.org/jira/browse/HADOOP-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack@archive.org updated HADOOP-1199:
--------------------------------------

    Attachment: hadoop1199.patch

Here's a first cut at a TaskLogInputFormat.

+ The TaskLogInputFormat class when it runs will read all local userlogs associated with the configured jobid.  The number of splits is equal to the number of configured maptasks (Do folks have better ideas regards how to do the split? See the next item for problems with this scheme).
+ If there are no associated logs on the local host for the configured jobid, the task fails and will be scheduled, usually, elsewhere.  The 'elsewhere' may have already had a log analysis task run against it so logs can be double-counted.  This makes is so TaskLogInputFormat, as currently written, is inappropriate for precision reporting based off log output.
+ As currently written, it will read the content of the tasks' stdout, stderr or syslog subdirectory.  In other words, you must run a job per subdirectory.
+ It gives JobTracker#idFormat package rather than private access to avoid duplicating number formating.
+ It includes the patch attached to hadoop-1181 (I've resolved this issue as "won't fix"). This patch opens access to TaskLog so it can be used outside of the mapred package.  It also makes it so TaskLog$Reader can take URLs to userlog subdirectories.  Folks need to be able to get to their logs.

> want InputFormat for task logs
> ------------------------------
>
>                 Key: HADOOP-1199
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1199
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Doug Cutting
>         Attachments: hadoop1199.patch
>
>
> We should provide an InputFormat implementation that includes all the task logs from a job. Folks should be able to do something like:
> job = new JobConf();
> job.setInputFormatClass(TaskLogInputFormat.class);
> TaskLogInputFormat.setJobId(jobId);
> ...
> Tasks should ideally be localized to the node that each log is on.
> Examining logs should be as lightweight as possible, to facilitate debugging. It should not require a copy to HDFS. A faster debug loop is like a faster search engine: it makes people more productive. The sooner one can find that, e.g., most tasks failed with a NullPointerException on line 723, the better. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.