You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by lei wang <nu...@gmail.com> on 2009/07/10 03:56:49 UTC

Arc to segements failed for " Task attempt_200907091108_0001_m_000520_0 failed to report status for 602 seconds. Killing!"

hi, I try to convert arc file to segments these days ,  nutch goes well for
convert 2millions pages,but for it failed for " Task
attempt_200907091108_0001_m_000520_0 failed to report status for 602
seconds. Killing!" when i increase the page counter to 7 millions, I have 10
nodes. for  the hadoop-site.xml config as below:
Any hlep would appreciate..
==============================================================
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>fs.default.name</name>
  <value>hdfs://distributed1:9000/</value>
  <description>The name of the default file system. Either the literal
string "local" or a host:port for DFS.</description>
</property>
<property>
  <name>mapred.job.tracker</name>
  <value>distributed1:9001</value>
  <description>The host and port that the MapReduce job tracker runs at. If
"local", then jobs are run in-process as a single map and reduce
task.</description>
</property>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/home/had/nutch-1.0/tmp</value>
  <description>A base for other temporary directories.</description>
</property>
<property>
  <name>dfs.name.dir</name>
  <value>/home/had/nutch-1.0/filesystem/name</value>
  <description>Determines where on the local filesystem the DFS name node
should store the name table. If this is a comma-delimited list of
directories then the name table is replicated in all of the directories, for
redundancy. </description>
</property>
<property>
  <name>dfs.data.dir</name>
  <value>/home/had/nutch-1.0/filesystem/data</value>
  <description>Determines where on the local filesystem an DFS data node
should store its blocks. If this is a comma-delimited list of directories,
then data will be stored in all named directories, typically on different
devices. Directories that do not exist are ignored.</description>
</property>
<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication. The actual number of replications
can be specified when the file is created. The default is used if
replication is not specified in create time.</description>
</property>
<property>
  <name>mapred.tasktracker.tasks.maximum</name>
  <value>4</value>
  <description>
    The maximum number of tasks that will be run simultaneously by
    a task tracker. This should be adjusted according to the heap size
    per task, the amount of RAM available, and CPU consumption of each task.
  </description>
</property>
<property>
  <name>mapred.map.tasks</name>
  <value>997</value>
  <description>The default number of map tasks per job.  Typically set
to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".
  </description>
</property>
<property>
  <name>mapred.reduce.tasks</name>
  <value>79</value>
  <description>The default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  </description>
</property>
<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx2000m</value>
  <description>
    You can specify other Java options for each map or reduce task here,
    but most likely you will want to adjust the heap size.
  </description>
</property>
<property>
  <name>mapred.system.dir</name>
  <value>/home/had/nutch-1.0/filesystem/mapreduce/system</value>
</property>

<property>
  <name>mapred.local.dir</name>
  <value>/home/had/nutch-1.0/filesystem/mapreduce/local</value>
</property>

</configuration>

Re: Arc to segements failed for " Task attempt_200907091108_0001_m_000520_0 failed to report status for 602 seconds. Killing!"

Posted by Ken Krugler <kk...@transpac.com>.

>hi, I try to convert arc file to segments these days ,  nutch goes well for
>convert 2millions pages,but for it failed for " Task
>attempt_200907091108_0001_m_000520_0 failed to report status for 602
>seconds. Killing!" when i increase the page counter to 7 millions, I have 10
>nodes. for  the hadoop-site.xml config as below:
>Any hlep would appreciate..

Sounds like ArcSegmentCreator isn't calling Hadoop's reporter often 
enough - though a quick check of that code didn't expose an obvious 
bug. If every record was failing, this could happen (reporter would 
never get called).

You're sure that you get valid results when the record count is 2M?

You could increase the Hadoop timeout by adding an entry like this to 
your hadoop-site.xml file:

<property>
   <name>mapred.task.timeout</name>
   <value>600000</value>
   <description>The number of milliseconds before a task will be
   terminated if it neither reads an input, writes an output, nor
   updates its status string.
   </description>
</property>

Default is 600 seconds, so bump up as necessary.

-- Ken

>==============================================================
><?xml version="1.0"?>
><?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
><!-- Put site-specific property overrides in this file. -->
>
><configuration>
><property>
>   <name>fs.default.name</name>
>   <value>hdfs://distributed1:9000/</value>
>   <description>The name of the default file system. Either the literal
>string "local" or a host:port for DFS.</description>
></property>
><property>
>   <name>mapred.job.tracker</name>
>   <value>distributed1:9001</value>
>   <description>The host and port that the MapReduce job tracker runs at. If
>"local", then jobs are run in-process as a single map and reduce
>task.</description>
></property>
><property>
>   <name>hadoop.tmp.dir</name>
>   <value>/home/had/nutch-1.0/tmp</value>
>   <description>A base for other temporary directories.</description>
></property>
><property>
>   <name>dfs.name.dir</name>
>   <value>/home/had/nutch-1.0/filesystem/name</value>
>   <description>Determines where on the local filesystem the DFS name node
>should store the name table. If this is a comma-delimited list of
>directories then the name table is replicated in all of the directories, for
>redundancy. </description>
></property>
><property>
>   <name>dfs.data.dir</name>
>   <value>/home/had/nutch-1.0/filesystem/data</value>
>   <description>Determines where on the local filesystem an DFS data node
>should store its blocks. If this is a comma-delimited list of directories,
>then data will be stored in all named directories, typically on different
>devices. Directories that do not exist are ignored.</description>
></property>
><property>
>   <name>dfs.replication</name>
>   <value>1</value>
>   <description>Default block replication. The actual number of replications
>can be specified when the file is created. The default is used if
>replication is not specified in create time.</description>
></property>
><property>
>   <name>mapred.tasktracker.tasks.maximum</name>
>   <value>4</value>
>   <description>
>     The maximum number of tasks that will be run simultaneously by
>     a task tracker. This should be adjusted according to the heap size
>     per task, the amount of RAM available, and CPU consumption of each task.
>   </description>
></property>
><property>
>   <name>mapred.map.tasks</name>
>   <value>997</value>
>   <description>The default number of map tasks per job.  Typically set
>to a prime several times greater than number of available hosts.
>   Ignored when mapred.job.tracker is "local".
>   </description>
></property>
><property>
>   <name>mapred.reduce.tasks</name>
>   <value>79</value>
>   <description>The default number of reduce tasks per job.  Typically set
>   to a prime close to the number of available hosts.  Ignored when
>   mapred.job.tracker is "local".
>   </description>
></property>
><property>
>   <name>mapred.child.java.opts</name>
>   <value>-Xmx2000m</value>
>   <description>
>     You can specify other Java options for each map or reduce task here,
>     but most likely you will want to adjust the heap size.
>   </description>
></property>
><property>
>   <name>mapred.system.dir</name>
>   <value>/home/had/nutch-1.0/filesystem/mapreduce/system</value>
></property>
>
><property>
>   <name>mapred.local.dir</name>
>   <value>/home/had/nutch-1.0/filesystem/mapreduce/local</value>
></property>
>
></configuration>


-- 
Ken Krugler
+1 530-210-6378