You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-dev@hadoop.apache.org by "Sudharsan Sampath (JIRA)" <ji...@apache.org> on 2011/07/01 08:10:28 UTC

[jira] [Created] (MAPREDUCE-2635) Jobs hang indefinitely on failure.

Jobs hang indefinitely on failure.
----------------------------------

                 Key: MAPREDUCE-2635
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2635
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: jobtracker, task-controller, tasktracker
    Affects Versions: 0.20.2, 0.20.1
         Environment: Suse Linux cluster with 2 nodes. One running a jobtracker, namenode, datanode, tasktracker. Other running tasktracker, datanode.
            Reporter: Sudharsan Sampath
            Priority: Blocker


Running the following example hangs the child job indefinitely.

public class HaltCluster
{

  public static void main(String[] args) throws IOException
  {
    JobConf jobConf = new JobConf();
    prepareConf(jobConf);
    if (args != null && args.length > 0)
    {
      jobConf.set("callonceagain", args[0]);
      jobConf.setMaxMapAttempts(1);
      jobConf.setJobName("ParentJob");

    }
    JobClient.runJob(jobConf);

  }

  public static void prepareConf(JobConf jobConf)
  {
    jobConf.setJarByClass(HaltCluster.class);
    jobConf.set("mapred.job.tracker", "<<jobtracker>>");
    jobConf.set("fs.default.name", "<<hdfs>>");
    MultipleInputs.addInputPath(jobConf, new Path("/ignore" + System.currentTimeMillis()), MyInputFormat.class);
    jobConf.setJobName("ChildJob");
    jobConf.setMapperClass(MyMapper.class);
    jobConf.setOutputFormat(NullOutputFormat.class);
    jobConf.setNumReduceTasks(0);
  }

}

public class MyMapper implements Mapper<IntWritable, Text, NullWritable, NullWritable>
{
  JobConf myConf = null;

  @Override
  public void map(IntWritable arg0, Text arg1, OutputCollector<NullWritable, NullWritable> arg2, Reporter arg3) throws IOException
  {
    if (myConf != null && "true".equals(myConf.get("callonceagain")))
    {
      startBackGroundReporting(arg3);
      HaltCluster.main(new String[] {});
    }

    throw new RuntimeException("Throwing exception");
  }

  private void startBackGroundReporting(final Reporter arg3)
  {
    Thread t = new Thread()
    {
      @Override
      public void run()
      {
        while (true)
        {
          arg3.setStatus("Reporting to be alive at " + System.currentTimeMillis());
        }
      }
    };
    t.setDaemon(true);
    t.start();
  }

  @Override
  public void configure(JobConf arg0)
  {
    myConf = arg0;

  }

  @Override
  public void close() throws IOException
  {
    // TODO Auto-generated method stub

  }

}

run using the following command

java -cp <<classpath>> HaltCluster true

But if only one job is triggered as java -cp <<classpath>> HaltCluster
it fails to max number of attempts and quits as expected.


Also, when the jobs hang, running the child job once again, makes it come out of deadlock and completes the three jobs.



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAPREDUCE-2635) Jobs hang indefinitely on failure.

Posted by "Harsh J (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAPREDUCE-2635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Harsh J resolved MAPREDUCE-2635.
--------------------------------

    Resolution: Cannot Reproduce

I'm still unable to reproduce this. And Oozie does similar things, but never have we run into such a situation. At best, this was probably a local issue. If we run into this again someday and have logs, we can reopen it.

Thanks all!
                
> Jobs hang indefinitely on failure.
> ----------------------------------
>
>                 Key: MAPREDUCE-2635
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2635
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker, task-controller, tasktracker
>    Affects Versions: 0.20.1, 0.20.2
>         Environment: Suse Linux cluster with 2 nodes. One running a jobtracker, namenode, datanode, tasktracker. Other running tasktracker, datanode.
>            Reporter: Sudharsan Sampath
>            Priority: Blocker
>
> Running the following example hangs the child job indefinitely.
> public class HaltCluster
> {
>   public static void main(String[] args) throws IOException
>   {
>     JobConf jobConf = new JobConf();
>     prepareConf(jobConf);
>     if (args != null && args.length > 0)
>     {
>       jobConf.set("callonceagain", args[0]);
>       jobConf.setMaxMapAttempts(1);
>       jobConf.setJobName("ParentJob");
>     }
>     JobClient.runJob(jobConf);
>   }
>   public static void prepareConf(JobConf jobConf)
>   {
>     jobConf.setJarByClass(HaltCluster.class);
>     jobConf.set("mapred.job.tracker", "<<jobtracker>>");
>     jobConf.set("fs.default.name", "<<hdfs>>");
>     MultipleInputs.addInputPath(jobConf, new Path("/ignore" + System.currentTimeMillis()), MyInputFormat.class);
>     jobConf.setJobName("ChildJob");
>     jobConf.setMapperClass(MyMapper.class);
>     jobConf.setOutputFormat(NullOutputFormat.class);
>     jobConf.setNumReduceTasks(0);
>   }
> }
> public class MyMapper implements Mapper<IntWritable, Text, NullWritable, NullWritable>
> {
>   JobConf myConf = null;
>   @Override
>   public void map(IntWritable arg0, Text arg1, OutputCollector<NullWritable, NullWritable> arg2, Reporter arg3) throws IOException
>   {
>     if (myConf != null && "true".equals(myConf.get("callonceagain")))
>     {
>       startBackGroundReporting(arg3);
>       HaltCluster.main(new String[] {});
>     }
>     throw new RuntimeException("Throwing exception");
>   }
>   private void startBackGroundReporting(final Reporter arg3)
>   {
>     Thread t = new Thread()
>     {
>       @Override
>       public void run()
>       {
>         while (true)
>         {
>           arg3.setStatus("Reporting to be alive at " + System.currentTimeMillis());
>         }
>       }
>     };
>     t.setDaemon(true);
>     t.start();
>   }
>   @Override
>   public void configure(JobConf arg0)
>   {
>     myConf = arg0;
>   }
>   @Override
>   public void close() throws IOException
>   {
>     // TODO Auto-generated method stub
>   }
> }
> run using the following command
> java -cp <<classpath>> HaltCluster true
> But if only one job is triggered as java -cp <<classpath>> HaltCluster
> it fails to max number of attempts and quits as expected.
> Also, when the jobs hang, running the child job once again, makes it come out of deadlock and completes the three jobs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira