You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2012/05/03 19:09:27 UTC

Problem with cluster

I'm trying to use a small cluster to make sure I understand the setup 
and have my code running before going to a big cluster. I have two 
machines. I've followed the tutorial here: 
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ 
I have been using 0.20.203 -- is this the most stable version of pre-1.0 
code?

The cluster seemed fine for some time except for the occasional HDFS 
corruption, a know issue. I have run  mostly mahout code unaltered with 
success.

However I am now getting some consistent errors with mahout and bixo 
(only recently started using this). When I start a job from the master, 
say a command line mahout job, the slave dies pretty quickly. It looks 
like spawned threads never complete and kill the slave. Hadoop may 
recover or it may not depending on what it is doing.

In any case when I go to the slave and do ps -e I get a huge list of

    "fuser <defunct>" with a long list of pids.


The datanode logs on the slave have this warning:

    pat@occam:~$ tail -f
    hadoop-0.20.203.0/logs/hadoop-pat-datanode-occam.log
    2012-05-03 08:39:39,035 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
    threadgroup to exit, active threads is 1
    2012-05-03 08:39:40,035 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
    threadgroup to exit, active threads is 1
    2012-05-03 08:39:41,035 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
    threadgroup to exit, active threads is 1
    2012-05-03 08:39:42,036 INFO
    org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
    threadgroup to exit, active threads is 1
    etc....

So far I have removed the slave from the master's config and set 
replication to 1 and all works, just slower.

Any ideas? and should I upgrade to a newer version?

Re: Problem with cluster

Posted by Ravi Prakash <ra...@gmail.com>.

Hi Pat,

20.205 is the stable version before 1.0. 1.0 is not substantially different
than 0.20. Any reasons you don't wanna use it?

I don't think "occasional HDFS corruption" is a known issue. That would be,
umm... lets just say pretty severe. Are you sure you've configured it
properly?

Your task is killing the Hadoop daemons? :-o You might wanna check with the
developers of Mahout / bixo if that is a known issue. Obviously it should
not happen. Hadoop daemons are known to be quite long lasting (many months
atleast), and there are ways you can setup security to prevent tasks from
doing that (but guessing you have 2 nodes, maybe you don't want to invest
in that)

The message is displayed when the DN is trying to shut down but cannot
because it is waiting on some (apparently 1) thread.

HTH
Ravi

On Thu, May 3, 2012 at 12:09 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> I'm trying to use a small cluster to make sure I understand the setup and
> have my code running before going to a big cluster. I have two machines.
> I've followed the tutorial here: http://www.michael-noll.com/**
> tutorials/running-hadoop-on-**ubuntu-linux-multi-node-**cluster/<http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/>I have been using 0.20.203 -- is this the most stable version of pre-1.0
> code?
>
> The cluster seemed fine for some time except for the occasional HDFS
> corruption, a know issue. I have run  mostly mahout code unaltered with
> success.
>
> However I am now getting some consistent errors with mahout and bixo (only
> recently started using this). When I start a job from the master, say a
> command line mahout job, the slave dies pretty quickly. It looks like
> spawned threads never complete and kill the slave. Hadoop may recover or it
> may not depending on what it is doing.
>
> In any case when I go to the slave and do ps -e I get a huge list of
>
>   "fuser <defunct>" with a long list of pids.
>
>
> The datanode logs on the slave have this warning:
>
>   pat@occam:~$ tail -f
>   hadoop-0.20.203.0/logs/hadoop-**pat-datanode-occam.log
>   2012-05-03 08:39:39,035 INFO
>   org.apache.hadoop.hdfs.server.**datanode.DataNode: Waiting for
>   threadgroup to exit, active threads is 1
>   2012-05-03 08:39:40,035 INFO
>   org.apache.hadoop.hdfs.server.**datanode.DataNode: Waiting for
>   threadgroup to exit, active threads is 1
>   2012-05-03 08:39:41,035 INFO
>   org.apache.hadoop.hdfs.server.**datanode.DataNode: Waiting for
>   threadgroup to exit, active threads is 1
>   2012-05-03 08:39:42,036 INFO
>   org.apache.hadoop.hdfs.server.**datanode.DataNode: Waiting for
>   threadgroup to exit, active threads is 1
>   etc....
>
> So far I have removed the slave from the master's config and set
> replication to 1 and all works, just slower.
>
> Any ideas? and should I upgrade to a newer version?
>
>
>
>