You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Nathan Fiedler <na...@gmail.com> on 2008/04/01 00:55:13 UTC

Re: reduce task hanging or just slow?

I don't have any particular experience with this, but perhaps X-Trace
[1] can help. The presentation given at the Hadoop Summit was very
impressive, looks like a great debugging tool. There are hooks already
in Hadoop, so I think it's just a matter of enabling them, collecting
the data, and generating the pretty graphs, at which point hopefully
the cause becomes clear.

n

[1] http://www.x-trace.net/

On Mon, Mar 31, 2008 at 12:07 PM, Colin Freas <co...@gmail.com> wrote:
> I've set up a job to run on my small 4 (sometimes 5) node cluster on dual
>  processor server boxes with 2-8GB of memory.
>
>  My job processes 24 100-300MB files that are a days worth of logs, total
>  data is about 6GB.
>
>  I've modified the word count example to do what I need, and it works fine on
>  small test files.
>
>  I've set the number of map tasks at 200, the number of reduce tasks to 14.
>  Things seem to go along fine, the map % climbs nicely, along with the
>  reduce.  Once the map hits 100% though, the reduce % stops increasing.
>  Right now it's stuck around 58%.  I was hoping changing the number of reduce
>  tasks would help, but I'm not really sure it did.  I had tried this once
>  before with the default number of deduce jobs, and I got to 100% (Map) and
>  14% (Reduce) before I saw this hanging behavior.
>
>  I'm just trying to understand what's happening here, and if there's
>  something I can do to increase the performance, short of adding nodes.  Is
>  it likely I've set something up incorrectly somewhere?
>
>  Any help appreciated.
>
>  Thanks!
>
>  -Colin
>

Re: reduce task hanging or just slow?

Posted by Colin Freas <co...@gmail.com>.

I believe that this is exactly what happened.

I'm not sure exactly what happened, but the networking stack on the master
node was all screwed up somehow.  All the machines serve double duty as
development boxes, and they're on two different networks.  The master node
could contact the cluster network but not the open net.  Once we got that
working, things seemed alright, even though before that all the cluster
machines could contact the master node on private gig-e network.

So, this is a pain in the ass.  Is there a way to get it to bind hostnames
to the ips in my slaves file?  Or just use the ips in slaves outright?  And
is there some way to know for sure this is what the problem is?  Is this
related to HADOOP-1374?  Could that bug be this hostname thing?

-Colin

On Mon, Mar 31, 2008 at 8:58 PM, Mafish Liu <ma...@gmail.com> wrote:

> Hi:
>    I have met the similar problem with you.  Finally, I found that this
> problem was caused by the hostname resolution because hadoop use hostname
> to
> access other nodes.
>    To fix this, try open your jobtracker log file( It often resides in
> $HADOOP_HOME/logs/hadoop-xxxx-jobtracker-xxxx.log ) to see if there is a
> error:
> "FATAL org.apache.hadoop.mapred.JobTracker: java.net.UnknownHostException:
> Invalid hostname for server: local"
>    If, it is, adding ip-hostname pairs to /etc/hosts files on all of you
> nodes may fix this problem.
>
> Good luck and best regards.
>
> Mafish
>
> --
> Mafish@gmail.com
> Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
>

Re: reduce task hanging or just slow?

Posted by Mafish Liu <ma...@gmail.com>.

Hi:
    I have met the similar problem with you.  Finally, I found that this
problem was caused by the hostname resolution because hadoop use hostname to
access other nodes.
    To fix this, try open your jobtracker log file( It often resides in
$HADOOP_HOME/logs/hadoop-xxxx-jobtracker-xxxx.log ) to see if there is a
error:
"FATAL org.apache.hadoop.mapred.JobTracker: java.net.UnknownHostException:
Invalid hostname for server: local"
    If, it is, adding ip-hostname pairs to /etc/hosts files on all of you
nodes may fix this problem.

Good luck and best regards.

Mafish

-- 
Mafish@gmail.com
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.