You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Andrew McNabb (JIRA)" <ji...@apache.org> on 2007/01/29 22:10:49 UTC

[jira] Created: (HADOOP-950) Job completes but command doesn't return

Job completes but command doesn't return
----------------------------------------

                 Key: HADOOP-950
                 URL: https://issues.apache.org/jira/browse/HADOOP-950
             Project: Hadoop
          Issue Type: Bug
          Components: contrib/streaming
    Affects Versions: 0.10.1
         Environment: Linux, Java 1.6.0-b105
            Reporter: Andrew McNabb


I've had a job submission command hang on many different occasions.  I can't tell exactly what makes it complete some times and hang others.  Here's some information about one time when it hanged.

I started a job at 12:40.  Here is the info from 'ps aux' including the full command line:

/ibrix/home/awm27/jdk1.6.0/bin/java -Xmx1000m -Dhadoop.log.dir=/ibrix/home/awm27/hadoop/logs/log.f17880 -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/ibrix/home/awm27/hadoop/bin/.. -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/ibrix/home/awm27/hadoop/bin/../lib/native/Linux-amd64-64 -classpath /ibrix/home/awm27/hadoop/dynamic_conf/tmp.FkrUc17883:/ibrix/home/awm27/jdk1.6.0/lib/tools.jar:/ibrix/home/awm27/hadoop/bin/../build/classes:/ibrix/home/awm27/hadoop/bin/../build:/ibrix/home/awm27/hadoop/bin/../build/test/classes:/ibrix/home/awm27/hadoop/bin/..:/ibrix/home/awm27/hadoop/bin/../hadoop-0.10.1-core.jar:/ibrix/home/awm27/hadoop/bin/../lib/commons-cli-2.0-SNAPSHOT.jar:/ibrix/home/awm27/hadoop/bin/../lib/commons-codec-1.3.jar:/ibrix/home/awm27/hadoop/bin/../lib/commons-httpclient-3.0.1.jar:/ibrix/home/awm27/hadoop/bin/../lib/commons-logging-1.0.4.jar:/ibrix/home/awm27/hadoop/bin/../lib/commons-logging-api-1.0.4.jar:/ibrix/home/awm27/hadoop/bin/../lib/jets3t.jar:/ibrix/home/awm27/hadoop/bin/../lib/jetty-5.1.4.jar:/ibrix/home/awm27/hadoop/bin/../lib/junit-3.8.1.jar:/ibrix/home/awm27/hadoop/bin/../lib/log4j-1.2.13.jar:/ibrix/home/awm27/hadoop/bin/../lib/lucene-core-1.9.1.jar:/ibrix/home/awm27/hadoop/bin/../lib/servlet-api.jar:/ibrix/home/awm27/hadoop/bin/../lib/jetty-ext/ant.jar:/ibrix/home/awm27/hadoop/bin/../lib/jetty-ext/commons-el.jar:/ibrix/home/awm27/hadoop/bin/../lib/jetty-ext/jasper-compiler.jar:/ibrix/home/awm27/hadoop/bin/../lib/jetty-ext/jasper-runtime.jar:/ibrix/home/awm27/hadoop/bin/../lib/jetty-ext/jsp-api.jar org.apache.hadoop.util.RunJar /ibrix/home/awm27/hadoop/build/hadoop-streaming.jar -jobconf mapred.job.name="MRPSO_RBF_2" -input MRPSO_RBF/MRPSO_RBF_1 -output MRPSO_RBF/MRPSO_RBF_2 -mapper /ibrix/home/awm27/svn/mrpso/mapper.py -combiner /ibrix/home/awm27/svn/mrpso/reducer.py -reducer /ibrix/home/awm27/svn/mrpso/reducer.py -cmdenv PYTHONPATH=/ibrix/home/awm27/svn/python -cmdenv MRPSO_FUNCTION=RBF -cmdenv MRPSO_DATAFILES=/tmp/tmpT9M4cq


At the time of submission, the jobtracker reported receiving the job and began processing it.  The first line in this part of the logs is:

2007-01-29 12:40:44,072 INFO org.apache.hadoop.mapred.JobInProgress: Choosing cached task tip_0002_m_000002


At 13:16, the job completed, with the following normal log messages:

2007-01-29 13:16:36,115 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_0002_r_000001_0' has completed tip_0002_r_000001 successfully.
2007-01-29 13:16:36,117 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_0002_r_000001_0' has completed.
2007-01-29 13:16:36,566 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_0002_r_000002_0' has completed tip_0002_r_000002 successfully.
2007-01-29 13:16:36,566 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_0002_r_000002_0' has completed.
2007-01-29 13:16:36,879 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_0002_r_000003_0' has completed tip_0002_r_000003 successfully.
2007-01-29 13:16:36,879 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_0002_r_000003_0' has completed.
2007-01-29 13:16:41,808 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_0002_r_000000_0' has completed tip_0002_r_000000 successfully.
2007-01-29 13:16:41,930 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_0002_r_000000_0' has completed.
2007-01-29 13:16:41,940 INFO org.apache.hadoop.mapred.JobInProgress: Job job_0002 has completed successfully.
2007-01-29 13:16:41,942 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_0002_m_000002_0' from 'tracker_m4b-3-2.local:50050'
...
2007-01-29 13:16:50,660 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_0002_m_000016_0' from 'tracker_m4b-3-8.local:50050'
2007-01-29 13:16:50,661 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_0002_m_000044_0' from 'tracker_m4b-3-8.local:50050'
2007-01-29 13:16:50,662 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_0002_m_000046_0' from 'tracker_m4b-3-8.local:50050'


At 14:03, the hadoop-streaming command (process 21261) that started at 12:40 was still running and using 0.0% of the CPU.  The output of the command ended with:

07/01/29 13:16:41 INFO streaming.StreamJob:  map 100%  reduce 93%
07/01/29 13:16:42 INFO streaming.StreamJob:  map 100%  reduce 100%
07/01/29 13:16:42 INFO streaming.StreamJob: Job complete: job_0002

It seems odd that hadoop-streaming would hang instead of returning.

I hope this information is helpful.  Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-950) Job completes but command doesn't return

Posted by "Andrew McNabb (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12468446 ] 

Andrew McNabb commented on HADOOP-950:
--------------------------------------

The hanging seems to have been caused by blocked writes to a pipe.  I have a workaround, but it would be nice if hadoop-streaming could be configured to quiet mode.  Thanks.

> Job completes but command doesn't return
> ----------------------------------------
>
>                 Key: HADOOP-950
>                 URL: https://issues.apache.org/jira/browse/HADOOP-950
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.10.1
>         Environment: Linux, Java 1.6.0-b105
>            Reporter: Andrew McNabb
>
> I've had a job submission command hang on many different occasions.  I can't tell exactly what makes it complete some times and hang others.  Here's some information about one time when it hanged.
> I started a job at 12:40.  Here is the info from 'ps aux' including the full command line:
> /ibrix/home/awm27/jdk1.6.0/bin/java -Xmx1000m -Dhadoop.log.dir=/ibrix/home/awm27/hadoop/logs/log.f17880 -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/ibrix/home/awm27/hadoop/bin/.. -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/ibrix/home/awm27/hadoop/bin/../lib/native/Linux-amd64-64 -classpath /ibrix/home/awm27/hadoop/dynamic_conf/tmp.FkrUc17883:/ibrix/home/awm27/jdk1.6.0/lib/tools.jar:/ibrix/home/awm27/hadoop/bin/../build/classes:/ibrix/home/awm27/hadoop/bin/../build:/ibrix/home/awm27/hadoop/bin/../build/test/classes:/ibrix/home/awm27/hadoop/bin/..:/ibrix/home/awm27/hadoop/bin/../hadoop-0.10.1-core.jar:/ibrix/home/awm27/hadoop/bin/../lib/commons-cli-2.0-SNAPSHOT.jar:/ibrix/home/awm27/hadoop/bin/../lib/commons-codec-1.3.jar:/ibrix/home/awm27/hadoop/bin/../lib/commons-httpclient-3.0.1.jar:/ibrix/home/awm27/hadoop/bin/../lib/commons-logging-1.0.4.jar:/ibrix/home/awm27/hadoop/bin/../lib/commons-logging-api-1.0.4.jar:/ibrix/home/awm27/hadoop/bin/../lib/jets3t.jar:/ibrix/home/awm27/hadoop/bin/../lib/jetty-5.1.4.jar:/ibrix/home/awm27/hadoop/bin/../lib/junit-3.8.1.jar:/ibrix/home/awm27/hadoop/bin/../lib/log4j-1.2.13.jar:/ibrix/home/awm27/hadoop/bin/../lib/lucene-core-1.9.1.jar:/ibrix/home/awm27/hadoop/bin/../lib/servlet-api.jar:/ibrix/home/awm27/hadoop/bin/../lib/jetty-ext/ant.jar:/ibrix/home/awm27/hadoop/bin/../lib/jetty-ext/commons-el.jar:/ibrix/home/awm27/hadoop/bin/../lib/jetty-ext/jasper-compiler.jar:/ibrix/home/awm27/hadoop/bin/../lib/jetty-ext/jasper-runtime.jar:/ibrix/home/awm27/hadoop/bin/../lib/jetty-ext/jsp-api.jar org.apache.hadoop.util.RunJar /ibrix/home/awm27/hadoop/build/hadoop-streaming.jar -jobconf mapred.job.name="MRPSO_RBF_2" -input MRPSO_RBF/MRPSO_RBF_1 -output MRPSO_RBF/MRPSO_RBF_2 -mapper /ibrix/home/awm27/svn/mrpso/mapper.py -combiner /ibrix/home/awm27/svn/mrpso/reducer.py -reducer /ibrix/home/awm27/svn/mrpso/reducer.py -cmdenv PYTHONPATH=/ibrix/home/awm27/svn/python -cmdenv MRPSO_FUNCTION=RBF -cmdenv MRPSO_DATAFILES=/tmp/tmpT9M4cq
> At the time of submission, the jobtracker reported receiving the job and began processing it.  The first line in this part of the logs is:
> 2007-01-29 12:40:44,072 INFO org.apache.hadoop.mapred.JobInProgress: Choosing cached task tip_0002_m_000002
> At 13:16, the job completed, with the following normal log messages:
> 2007-01-29 13:16:36,115 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_0002_r_000001_0' has completed tip_0002_r_000001 successfully.
> 2007-01-29 13:16:36,117 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_0002_r_000001_0' has completed.
> 2007-01-29 13:16:36,566 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_0002_r_000002_0' has completed tip_0002_r_000002 successfully.
> 2007-01-29 13:16:36,566 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_0002_r_000002_0' has completed.
> 2007-01-29 13:16:36,879 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_0002_r_000003_0' has completed tip_0002_r_000003 successfully.
> 2007-01-29 13:16:36,879 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_0002_r_000003_0' has completed.
> 2007-01-29 13:16:41,808 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_0002_r_000000_0' has completed tip_0002_r_000000 successfully.
> 2007-01-29 13:16:41,930 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_0002_r_000000_0' has completed.
> 2007-01-29 13:16:41,940 INFO org.apache.hadoop.mapred.JobInProgress: Job job_0002 has completed successfully.
> 2007-01-29 13:16:41,942 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_0002_m_000002_0' from 'tracker_m4b-3-2.local:50050'
> ...
> 2007-01-29 13:16:50,660 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_0002_m_000016_0' from 'tracker_m4b-3-8.local:50050'
> 2007-01-29 13:16:50,661 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_0002_m_000044_0' from 'tracker_m4b-3-8.local:50050'
> 2007-01-29 13:16:50,662 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_0002_m_000046_0' from 'tracker_m4b-3-8.local:50050'
> At 14:03, the hadoop-streaming command (process 21261) that started at 12:40 was still running and using 0.0% of the CPU.  The output of the command ended with:
> 07/01/29 13:16:41 INFO streaming.StreamJob:  map 100%  reduce 93%
> 07/01/29 13:16:42 INFO streaming.StreamJob:  map 100%  reduce 100%
> 07/01/29 13:16:42 INFO streaming.StreamJob: Job complete: job_0002
> It seems odd that hadoop-streaming would hang instead of returning.
> I hope this information is helpful.  Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-950) Job completes but command doesn't return

Posted by "Andrew McNabb (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12468419 ] 

Andrew McNabb commented on HADOOP-950:
--------------------------------------

I'm doing a test right now.  The output of hadoop-streaming is being sent to a pipe, so one theory for what's happening is that hadoop-streaming blocks when it writes because the output buffer is full.

If this is the cause, then there's no bug in hadoop-streaming per se, but the problem would go away if there were an option for quiet output (don't write out the percent complete, etc.).

I'll add another comment to this when I find out for sure.  Thanks.

> Job completes but command doesn't return
> ----------------------------------------
>
>                 Key: HADOOP-950
>                 URL: https://issues.apache.org/jira/browse/HADOOP-950
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.10.1
>         Environment: Linux, Java 1.6.0-b105
>            Reporter: Andrew McNabb
>
> I've had a job submission command hang on many different occasions.  I can't tell exactly what makes it complete some times and hang others.  Here's some information about one time when it hanged.
> I started a job at 12:40.  Here is the info from 'ps aux' including the full command line:
> /ibrix/home/awm27/jdk1.6.0/bin/java -Xmx1000m -Dhadoop.log.dir=/ibrix/home/awm27/hadoop/logs/log.f17880 -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/ibrix/home/awm27/hadoop/bin/.. -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/ibrix/home/awm27/hadoop/bin/../lib/native/Linux-amd64-64 -classpath /ibrix/home/awm27/hadoop/dynamic_conf/tmp.FkrUc17883:/ibrix/home/awm27/jdk1.6.0/lib/tools.jar:/ibrix/home/awm27/hadoop/bin/../build/classes:/ibrix/home/awm27/hadoop/bin/../build:/ibrix/home/awm27/hadoop/bin/../build/test/classes:/ibrix/home/awm27/hadoop/bin/..:/ibrix/home/awm27/hadoop/bin/../hadoop-0.10.1-core.jar:/ibrix/home/awm27/hadoop/bin/../lib/commons-cli-2.0-SNAPSHOT.jar:/ibrix/home/awm27/hadoop/bin/../lib/commons-codec-1.3.jar:/ibrix/home/awm27/hadoop/bin/../lib/commons-httpclient-3.0.1.jar:/ibrix/home/awm27/hadoop/bin/../lib/commons-logging-1.0.4.jar:/ibrix/home/awm27/hadoop/bin/../lib/commons-logging-api-1.0.4.jar:/ibrix/home/awm27/hadoop/bin/../lib/jets3t.jar:/ibrix/home/awm27/hadoop/bin/../lib/jetty-5.1.4.jar:/ibrix/home/awm27/hadoop/bin/../lib/junit-3.8.1.jar:/ibrix/home/awm27/hadoop/bin/../lib/log4j-1.2.13.jar:/ibrix/home/awm27/hadoop/bin/../lib/lucene-core-1.9.1.jar:/ibrix/home/awm27/hadoop/bin/../lib/servlet-api.jar:/ibrix/home/awm27/hadoop/bin/../lib/jetty-ext/ant.jar:/ibrix/home/awm27/hadoop/bin/../lib/jetty-ext/commons-el.jar:/ibrix/home/awm27/hadoop/bin/../lib/jetty-ext/jasper-compiler.jar:/ibrix/home/awm27/hadoop/bin/../lib/jetty-ext/jasper-runtime.jar:/ibrix/home/awm27/hadoop/bin/../lib/jetty-ext/jsp-api.jar org.apache.hadoop.util.RunJar /ibrix/home/awm27/hadoop/build/hadoop-streaming.jar -jobconf mapred.job.name="MRPSO_RBF_2" -input MRPSO_RBF/MRPSO_RBF_1 -output MRPSO_RBF/MRPSO_RBF_2 -mapper /ibrix/home/awm27/svn/mrpso/mapper.py -combiner /ibrix/home/awm27/svn/mrpso/reducer.py -reducer /ibrix/home/awm27/svn/mrpso/reducer.py -cmdenv PYTHONPATH=/ibrix/home/awm27/svn/python -cmdenv MRPSO_FUNCTION=RBF -cmdenv MRPSO_DATAFILES=/tmp/tmpT9M4cq
> At the time of submission, the jobtracker reported receiving the job and began processing it.  The first line in this part of the logs is:
> 2007-01-29 12:40:44,072 INFO org.apache.hadoop.mapred.JobInProgress: Choosing cached task tip_0002_m_000002
> At 13:16, the job completed, with the following normal log messages:
> 2007-01-29 13:16:36,115 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_0002_r_000001_0' has completed tip_0002_r_000001 successfully.
> 2007-01-29 13:16:36,117 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_0002_r_000001_0' has completed.
> 2007-01-29 13:16:36,566 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_0002_r_000002_0' has completed tip_0002_r_000002 successfully.
> 2007-01-29 13:16:36,566 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_0002_r_000002_0' has completed.
> 2007-01-29 13:16:36,879 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_0002_r_000003_0' has completed tip_0002_r_000003 successfully.
> 2007-01-29 13:16:36,879 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_0002_r_000003_0' has completed.
> 2007-01-29 13:16:41,808 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_0002_r_000000_0' has completed tip_0002_r_000000 successfully.
> 2007-01-29 13:16:41,930 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_0002_r_000000_0' has completed.
> 2007-01-29 13:16:41,940 INFO org.apache.hadoop.mapred.JobInProgress: Job job_0002 has completed successfully.
> 2007-01-29 13:16:41,942 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_0002_m_000002_0' from 'tracker_m4b-3-2.local:50050'
> ...
> 2007-01-29 13:16:50,660 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_0002_m_000016_0' from 'tracker_m4b-3-8.local:50050'
> 2007-01-29 13:16:50,661 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_0002_m_000044_0' from 'tracker_m4b-3-8.local:50050'
> 2007-01-29 13:16:50,662 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_0002_m_000046_0' from 'tracker_m4b-3-8.local:50050'
> At 14:03, the hadoop-streaming command (process 21261) that started at 12:40 was still running and using 0.0% of the CPU.  The output of the command ended with:
> 07/01/29 13:16:41 INFO streaming.StreamJob:  map 100%  reduce 93%
> 07/01/29 13:16:42 INFO streaming.StreamJob:  map 100%  reduce 100%
> 07/01/29 13:16:42 INFO streaming.StreamJob: Job complete: job_0002
> It seems odd that hadoop-streaming would hang instead of returning.
> I hope this information is helpful.  Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.