You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Rod Taylor <rb...@sitesell.com> on 2005/10/14 16:53:01 UTC

All trackers exited on all nodes

Over night the trackers appeared to have difficulties finding a block
from the datanode and eventually exited. The datanode reports serving
the block mentioned below successfully on several earlier occasions.

Instead of the tracker exiting I would have expected the current task to
be aborted and for it to continue on taking others. Below are the logs
from the problem with the block through to where the tracker was no
longer running.


051014 072850 No node available for block blk_-6130354727587666838
051014 072850 Could not obtain block from any node:
java.io.IOException: No live nodes contain current block
051014 072900 No node available for block blk_-6130354727587666838
051014 072900 Could not obtain block from any node:
java.io.IOException: No live nodes contain current block
051014 072910 parsing file:/opt/nutch-0.8_2/conf/nutch-default.xml
051014 072910 parsing file:/opt/nutch-0.8_2/conf/mapred-default.xml
051014 072910 parsing /home/sitesell/local/tracker/task_m_xlwltz/job.xml
051014 072910 parsing file:/opt/nutch-0.8_2/conf/nutch-site.xml
051014 072910 Task task_m_xlwltz timed out.  Killing.
051014 072910 parsing file:/opt/nutch-0.8_2/conf/nutch-default.xml
051014 072910 parsing file:/opt/nutch-0.8_2/conf/mapred-default.xml
051014 072910 parsing /home/sitesell/local/tracker/task_m_xlwltz/job.xml
051014 072910 parsing file:/opt/nutch-0.8_2/conf/nutch-site.xml
051014 072910 task_m_1mewgo done; removing files.
Exception in thread "main" java.util.ConcurrentModificationException
        at java.util.TreeMap$EntryIterator.nextEntry(TreeMap.java:1048)
        at java.util.TreeMap$ValueIterator.next(TreeMap.java:1079)
        at
org.apache.nutch.mapred.TaskTracker.close(TaskTracker.java:130)
        at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:281)
        at
org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:625)
051014 072911 task_m_xlwltz  parsing
file:/opt/nutch-0.8_2/conf/nutch-default.xml
051014 072911 task_m_xlwltz  parsing
file:/opt/nutch-0.8_2/conf/nutch-site.xml
051014 072911 task_m_xlwltz  Child starting
051014 072911 Server connection on port 64071 from 192.168.100.13:
starting
051014 072911 task_m_xlwltz  Client connection to 0.0.0.0:64071:
starting
051014 072911 task_m_xlwltz  parsing
file:/opt/nutch-0.8_2/conf/nutch-default.xml
051014 072911 task_m_xlwltz  parsing
file:/opt/nutch-0.8_2/conf/mapred-default.xml
051014 072911 task_m_xlwltz
parsing /home/sitesell/local/tracker/task_m_xlwltz/job.xml
051014 072911 task_m_xlwltz  parsing
file:/opt/nutch-0.8_2/conf/nutch-site.xml
051014 072911 task_m_xlwltz  Client connection to 192.168.100.14:5466:
starting
051014 072912 task_m_xlwltz
0.07144528% /user/sitesell/generate-temp-1200428493/part-00000:3924416
+230848
051014 072912 task_m_xlwltz
1.0% /user/sitesell/generate-temp-1200428493/part-00000:3924416+230848
051014 072912 Task task_m_xlwltz is done.
051014 072912 Task task_m_xlwltz is done.
051014 072912 Server connection on port 64071 from 192.168.100.13:
exiting
051014 072912 task_m_xlwltz done; removing files.

-- 
Rod Taylor <rb...@sitesell.com>

Re: All trackers exited on all nodes

Posted by Rod Taylor <rb...@sitesell.com>.

On Mon, 2005-10-17 at 11:46 -0700, Doug Cutting wrote:
> Rod Taylor wrote:
> > The machine the namenode is running on does have very high load at
> > times. Do you recommend a separate box for the namenode and jobtracker
> > which runs strictly those items?
> 
> That would be optimal, but it shouldn't be required.  If a tasktracker 
> or datanode is sluggish then its impact is small, but if the jobtracker 
> or namenode become sluggish the impact is systemic.  That said, so long 

Machine ordered. Thanks.

> >>What's in the jobtracker logs around this time?  Did it report this 
> >>tasktracker as lost?
> > 
> > The jobtracker did not indicate such a thing (via an exception anyway).
> > Tasktracker connections seem to be established and disconnected from
> > fairly frequently. Perhaps this is what you mean?
> 
> No, there's a "lost tracker" message when the jobtracker times out a 
> tasktracker.  These are bad, since the jobtracker then assumes that all 
> of the temporary map data at that tasktracker is gone, and re-schedules 
> those map tasks.

Ahh. I recall seeing those in the past when tasktrackers actually were
dying (you fixed that bug). No, this type of action was not taking
place.

-- 
Rod Taylor <rb...@sitesell.com>

Re: All trackers exited on all nodes

Posted by Doug Cutting <cu...@nutch.org>.

Rod Taylor wrote:
> The machine the namenode is running on does have very high load at
> times. Do you recommend a separate box for the namenode and jobtracker
> which runs strictly those items?

That would be optimal, but it shouldn't be required.  If a tasktracker 
or datanode is sluggish then its impact is small, but if the jobtracker 
or namenode become sluggish the impact is systemic.  That said, so long 
as these don't crash, things should work.  The problem is that the code 
paths for recovery when namenodes and jobtrackers are sluggish have not 
been tested as much.

>>What's in the jobtracker logs around this time?  Did it report this 
>>tasktracker as lost?
> 
> The jobtracker did not indicate such a thing (via an exception anyway).
> Tasktracker connections seem to be established and disconnected from
> fairly frequently. Perhaps this is what you mean?

No, there's a "lost tracker" message when the jobtracker times out a 
tasktracker.  These are bad, since the jobtracker then assumes that all 
of the temporary map data at that tasktracker is gone, and re-schedules 
those map tasks.

Doug

Re: All trackers exited on all nodes

Posted by Rod Taylor <rb...@sitesell.com>.

On Mon, 2005-10-17 at 11:06 -0700, Doug Cutting wrote:
> Rod Taylor wrote:
> > Over night the trackers appeared to have difficulties finding a block
> > from the datanode and eventually exited. The datanode reports serving
> > the block mentioned below successfully on several earlier occasions.
> 
> Were all of the datanodes still alive?  If so, then the problem is 
> probably that heartbeats from the datanodes to the namenode were 
> delayed, so that the namenode assumed the datanodes were dead.  This can 

Yes. All of the datanodes, namenode and jobtracker were alive and I
don't recall them having any indication of errors in the logs but the
logs have been rotated since and I don't have them anymore.

The machine the namenode is running on does have very high load at
times. Do you recommend a separate box for the namenode and jobtracker
which runs strictly those items?

> sometimes happen when the namenode becomes severely loaded.  What is in 
> the namenode's logs around this time?  Does it report some datanodes as 
> lost?

Sorry, I don't have the logs anymore. I will be sure to check for it
next time the problem occurs.

> > Instead of the tracker exiting I would have expected the current task to
> > be aborted and for it to continue on taking others. Below are the logs
> > from the problem with the block through to where the tracker was no
> > longer running.
> [ ... ]
> > Exception in thread "main" java.util.ConcurrentModificationException
> >         at java.util.TreeMap$EntryIterator.nextEntry(TreeMap.java:1048)
> >         at java.util.TreeMap$ValueIterator.next(TreeMap.java:1079)
> >         at
> > org.apache.nutch.mapred.TaskTracker.close(TaskTracker.java:130)
> >         at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:281)
> >         at
> > org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:625)
> 
> I have seen this before.  It is a bug in the TaskTracker that I have not 
> yet had time to fix.  It is triggered by an unresponsive jobtracker, 
> which times out the tasktracker, assuming it is dead.  Then, when the 
> tasktracker's heartbeat arrives, the jobtracker does not recognize it, 
> which causes the tasktracker to close and restart.  The bug is in the 
> tasktracker's close method, which uses an iterator in an unsafe way.
> 
> It should instead use something like:
> 
>    while (tasks.size() != 0) {
>      TaskInProgress tip = (TaskInProgress)tasks.first();
>      tip.jobHasFinished();
>    }
> 
> What's in the jobtracker logs around this time?  Did it report this 
> tasktracker as lost?

The jobtracker did not indicate such a thing (via an exception anyway).
Tasktracker connections seem to be established and disconnected from
fairly frequently. Perhaps this is what you mean?

> Was this all by chance at the start of an Indexer job?  That's when I've 
> seen this sort of thing.  This job has more input files than any other, 

Middle of a fetch actually.

-- 
Rod Taylor <rb...@sitesell.com>

Re: All trackers exited on all nodes

Posted by Doug Cutting <cu...@nutch.org>.

Rod Taylor wrote:
> Over night the trackers appeared to have difficulties finding a block
> from the datanode and eventually exited. The datanode reports serving
> the block mentioned below successfully on several earlier occasions.

Were all of the datanodes still alive?  If so, then the problem is 
probably that heartbeats from the datanodes to the namenode were 
delayed, so that the namenode assumed the datanodes were dead.  This can 
sometimes happen when the namenode becomes severely loaded.  What is in 
the namenode's logs around this time?  Does it report some datanodes as 
lost?

> Instead of the tracker exiting I would have expected the current task to
> be aborted and for it to continue on taking others. Below are the logs
> from the problem with the block through to where the tracker was no
> longer running.
[ ... ]
> Exception in thread "main" java.util.ConcurrentModificationException
>         at java.util.TreeMap$EntryIterator.nextEntry(TreeMap.java:1048)
>         at java.util.TreeMap$ValueIterator.next(TreeMap.java:1079)
>         at
> org.apache.nutch.mapred.TaskTracker.close(TaskTracker.java:130)
>         at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:281)
>         at
> org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:625)

I have seen this before.  It is a bug in the TaskTracker that I have not 
yet had time to fix.  It is triggered by an unresponsive jobtracker, 
which times out the tasktracker, assuming it is dead.  Then, when the 
tasktracker's heartbeat arrives, the jobtracker does not recognize it, 
which causes the tasktracker to close and restart.  The bug is in the 
tasktracker's close method, which uses an iterator in an unsafe way.

It should instead use something like:

   while (tasks.size() != 0) {
     TaskInProgress tip = (TaskInProgress)tasks.first();
     tip.jobHasFinished();
   }

What's in the jobtracker logs around this time?  Did it report this 
tasktracker as lost?

Was this all by chance at the start of an Indexer job?  That's when I've 
seen this sort of thing.  This job has more input files than any other, 
and my guess is that constructing the splits (which ties up the 
jobtracker) can for some reason take longer than the timeout, although 
it really shouldn't...

Doug