You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by George Kousiouris <gk...@mail.ntua.gr> on 2011/09/21 15:58:29 UTC

Problem with MR job

Hi all,

We are trying to run a mahout job in a hadoop cluster, but we keep 
getting the same status. The job passes the initial mahout stages and 
when it comes to be executed as a MR job, it seems to be stuck at 0% 
progress. Through the UI we see that it is submitted but not running. 
After a while it gets killed. In the logs the error shown is this one:

2011-09-21 07:47:50,507 INFO org.apache.hadoop.mapred.JobTracker: 
problem cleaning system directory: 
hdfs://master/var/lib/hadoop-0.20/cache/hdfs/mapred/system
org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create 
directory /var/lib/hadoop-0.20/cache/hdfs/mapred/system. Name nod$
The reported blocks 0 needs additional 12 blocks to reach the threshold 
0.9990 of total blocks 13. Safe mode will be turned off automatically.
         at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:1966)
         at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:1940)
         at 
org.apache.hadoop.hdfs.server.namenode.NameNode.mkdirs(NameNode.java:770)
         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
         at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
         at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
         at java.lang.reflect.Method.invoke(Method.java:597)


Some staging files seem to have been created however.

I was thinking of sending this to the mahout mailing list but it seems a 
more core hadoop issue.

We are using the following command to launch the mahout example:
./mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job 
--input hdfs://master/user/hdfs/testdata/synthetic_control.data --output 
hdfs://master/user/hdfs/testdata/output --t1 0.5 --t2 1 --maxIter 50

Any clues?
George

-- 

---------------------------

George Kousiouris
Electrical and Computer Engineer
Division of Communications,
Electronics and Information Engineering
School of Electrical and Computer Engineering
Tel: +30 210 772 2546
Mobile: +30 6939354121
Fax: +30 210 772 2569
Email: gkousiou@mail.ntua.gr
Site: http://users.ntua.gr/gkousiou/

National Technical University of Athens
9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece

Re: Problem with MR job

Posted by Harsh J <ha...@cloudera.com>.

Hello George,

Have you looked at your DFS health page (http://NN:50070/)? I believe
you have missing or fallen DataNode instances.

I'd start them back up, after checking their (DataNode's) logs to
figure out why they died.

On Wed, Sep 21, 2011 at 7:28 PM, George Kousiouris
<gk...@mail.ntua.gr> wrote:
>
> Hi all,
>
> We are trying to run a mahout job in a hadoop cluster, but we keep getting
> the same status. The job passes the initial mahout stages and when it comes
> to be executed as a MR job, it seems to be stuck at 0% progress. Through the
> UI we see that it is submitted but not running. After a while it gets
> killed. In the logs the error shown is this one:
>
> 2011-09-21 07:47:50,507 INFO org.apache.hadoop.mapred.JobTracker: problem
> cleaning system directory:
> hdfs://master/var/lib/hadoop-0.20/cache/hdfs/mapred/system
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create
> directory /var/lib/hadoop-0.20/cache/hdfs/mapred/system. Name nod$
> The reported blocks 0 needs additional 12 blocks to reach the threshold
> 0.9990 of total blocks 13. Safe mode will be turned off automatically.
>        at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:1966)
>        at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:1940)
>        at
> org.apache.hadoop.hdfs.server.namenode.NameNode.mkdirs(NameNode.java:770)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>
>
> Some staging files seem to have been created however.
>
> I was thinking of sending this to the mahout mailing list but it seems a
> more core hadoop issue.
>
> We are using the following command to launch the mahout example:
> ./mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job --input
> hdfs://master/user/hdfs/testdata/synthetic_control.data --output
> hdfs://master/user/hdfs/testdata/output --t1 0.5 --t2 1 --maxIter 50
>
> Any clues?
> George
>
> --
>
> ---------------------------
>
> George Kousiouris
> Electrical and Computer Engineer
> Division of Communications,
> Electronics and Information Engineering
> School of Electrical and Computer Engineering
> Tel: +30 210 772 2546
> Mobile: +30 6939354121
> Fax: +30 210 772 2569
> Email: gkousiou@mail.ntua.gr
> Site: http://users.ntua.gr/gkousiou/
>
> National Technical University of Athens
> 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
>
>



-- 
Harsh J

Re: Problem with MR job

Posted by Uma Maheswara Rao G 72686 <ma...@huawei.com>.

Can you check your DN data directories once, whether the blocks present or not?

Can you give the DN and NN logs. Please put them in some site and share the link here.

Regards,
Uma
----- Original Message -----
From: George Kousiouris <gk...@mail.ntua.gr>
Date: Wednesday, September 21, 2011 8:06 pm
Subject: Re: Problem with MR job
To: common-user@hadoop.apache.org
Cc: Uma Maheswara Rao G 72686 <ma...@huawei.com>

> 
> Hi,
> 
> Some more logs, specifically from the JobTracker:
> 
> 2011-09-21 10:22:43,482 INFO 
> org.apache.hadoop.mapred.JobInProgress: 
> Initializing job_201109211018_0001
> 2011-09-21 10:22:43,538 ERROR org.apache.hadoop.mapred.JobHistory: 
> Failed creating job history log file for job job_201109211018_0001
> java.io.FileNotFoundException: 
> /usr/lib/hadoop-
> 0.20/logs/history/master_1316614721548_job_201109211018_0001_hdfs_Input+Driver+running+over+input%3A+hdfs%3A%2F%2Fmaster%2Fuse 
> (P$
>         at java.io.FileOutputStream.open(Native Method)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:189)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:185)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:243)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:336)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:369)
>         at 
> org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:1223)
>         at 
> org.apache.hadoop.mapred.JobInProgress$3.run(JobInProgress.java:681)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>         at 
> org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:678)
>         at 
> org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4013)
>         at 
> org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-09-21 10:22:43,666 ERROR org.apache.hadoop.mapred.JobHistory: 
> Failed to store job conf in the log dir
> java.io.FileNotFoundException: 
> /usr/lib/hadoop-
> 0.20/logs/history/master_1316614721548_job_201109211018_0001_conf.xml 
> (Permission denied)
>         at java.io.FileOutputStream.open(Native Method)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:189)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:185)
>         at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:243)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:336)
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:369)
> 
> 
> On 9/21/2011 5:15 PM, George Kousiouris wrote:
> >
> > Hi,
> >
> > The status seems healthy and the datanodes live:
> > Status: HEALTHY
> >  Total size:    118805326 B
> >  Total dirs:    31
> >  Total files:    38
> >  Total blocks (validated):    38 (avg. block size 3126455 B)
> >  Minimally replicated blocks:    38 (100.0 %)
> >  Over-replicated blocks:    0 (0.0 %)
> >  Under-replicated blocks:    9 (23.68421 %)
> >  Mis-replicated blocks:        0 (0.0 %)
> >  Default replication factor:    1
> >  Average block replication:    1.2368422
> >  Corrupt blocks:        0
> >  Missing replicas:        72 (153.19148 %)
> >  Number of data-nodes:        2
> >  Number of racks:        1
> > FSCK ended at Wed Sep 21 10:06:17 EDT 2011 in 9 milliseconds
> >
> >
> > The filesystem under path '/' is HEALTHY
> >
> > The jps command has the following output:
> > hdfs@master:~$ jps
> > 24292 SecondaryNameNode
> > 30010 Jps
> > 24109 DataNode
> > 23962 NameNode
> >
> > Shouldn't this have two datanode listings? In our system, one of 
> the 
> > datanodes and the namenode is the same machine, but i seem to 
> remember 
> > that in the past even with this setup two datanode listings 
> appeared 
> > in the jps output.
> >
> > Thanks,
> > George
> >
> >
> >
> >
> > On 9/21/2011 5:08 PM, Uma Maheswara Rao G 72686 wrote:
> >> Hi,
> >>
> >>   Any cluster restart happend? ..is your NameNode detecting 
> DataNodes 
> >> as live?
> >>   Looks DNs did not report anyblocks to NN yet. You have 13 
> blocks 
> >> persisted in NameNode namespace. At least 12 blocks should be 
> >> reported from your DNs. Other wise automatically it will not 
> come out 
> >> of safemode.
> >>
> >> Regards,
> >> Uma
> >> ----- Original Message -----
> >> From: George Kousiouris<gk...@mail.ntua.gr>
> >> Date: Wednesday, September 21, 2011 7:29 pm
> >> Subject: Problem with MR job
> >> To: "common-user@hadoop.apache.org"<co...@hadoop.apache.org>
> >>
> >>> Hi all,
> >>>
> >>> We are trying to run a mahout job in a hadoop cluster, but we keep
> >>> getting the same status. The job passes the initial mahout stages
> >>> and
> >>> when it comes to be executed as a MR job, it seems to be stuck at
> >>> 0%
> >>> progress. Through the UI we see that it is submitted but not
> >>> running.
> >>> After a while it gets killed. In the logs the error shown is 
> this one:
> >>>
> >>> 2011-09-21 07:47:50,507 INFO org.apache.hadoop.mapred.JobTracker:
> >>> problem cleaning system directory:
> >>> hdfs://master/var/lib/hadoop-0.20/cache/hdfs/mapred/system
> >>> org.apache.hadoop.ipc.RemoteException:
> >>> org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot
> >>> create
> >>> directory /var/lib/hadoop-0.20/cache/hdfs/mapred/system. Name nod$
> >>> The reported blocks 0 needs additional 12 blocks to reach the
> >>> threshold
> >>> 0.9990 of total blocks 13. Safe mode will be turned off 
> automatically.>>>          at
> >>> 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:1966) 
> >>>
> >>>          at
> >>> 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:1940) 
> >>>
> >>>          at
> >>> 
> org.apache.hadoop.hdfs.server.namenode.NameNode.mkdirs(NameNode.java:770) 
> >>>
> >>>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> >>> Method)         at
> >>> 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
> >>>
> >>>          at
> >>> 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
> >>>
> >>>          at java.lang.reflect.Method.invoke(Method.java:597)
> >>>
> >>>
> >>> Some staging files seem to have been created however.
> >>>
> >>> I was thinking of sending this to the mahout mailing list but it
> >>> seems a
> >>> more core hadoop issue.
> >>>
> >>> We are using the following command to launch the mahout example:
> >>> ./mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
> >>> --input hdfs://master/user/hdfs/testdata/synthetic_control.data 
> --
> >>> output
> >>> hdfs://master/user/hdfs/testdata/output --t1 0.5 --t2 1 --
> maxIter 50
> >>>
> >>> Any clues?
> >>> George
> >>>
> >>> -- 
> >>>
> >>> ---------------------------
> >>>
> >>> George Kousiouris
> >>> Electrical and Computer Engineer
> >>> Division of Communications,
> >>> Electronics and Information Engineering
> >>> School of Electrical and Computer Engineering
> >>> Tel: +30 210 772 2546
> >>> Mobile: +30 6939354121
> >>> Fax: +30 210 772 2569
> >>> Email: gkousiou@mail.ntua.gr
> >>> Site: http://users.ntua.gr/gkousiou/
> >>>
> >>> National Technical University of Athens
> >>> 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
> >>>
> >>>
> >>
> >
> >
> 
> 
> -- 
> 
> ---------------------------
> 
> George Kousiouris
> Electrical and Computer Engineer
> Division of Communications,
> Electronics and Information Engineering
> School of Electrical and Computer Engineering
> Tel: +30 210 772 2546
> Mobile: +30 6939354121
> Fax: +30 210 772 2569
> Email: gkousiou@mail.ntua.gr
> Site: http://users.ntua.gr/gkousiou/
> 
> National Technical University of Athens
> 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
> 
>

Re: Problem with MR job

Posted by George Kousiouris <gk...@mail.ntua.gr>.

Hi,

Some more logs, specifically from the JobTracker:

2011-09-21 10:22:43,482 INFO org.apache.hadoop.mapred.JobInProgress: 
Initializing job_201109211018_0001
2011-09-21 10:22:43,538 ERROR org.apache.hadoop.mapred.JobHistory: 
Failed creating job history log file for job job_201109211018_0001
java.io.FileNotFoundException: 
/usr/lib/hadoop-0.20/logs/history/master_1316614721548_job_201109211018_0001_hdfs_Input+Driver+running+over+input%3A+hdfs%3A%2F%2Fmaster%2Fuse 
(P$
         at java.io.FileOutputStream.open(Native Method)
         at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
         at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:189)
         at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:185)
         at 
org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:243)
         at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:336)
         at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:369)
         at 
org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:1223)
         at 
org.apache.hadoop.mapred.JobInProgress$3.run(JobInProgress.java:681)
         at java.security.AccessController.doPrivileged(Native Method)
         at javax.security.auth.Subject.doAs(Subject.java:396)
         at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
         at 
org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:678)
         at 
org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4013)
         at 
org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
         at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
         at java.lang.Thread.run(Thread.java:662)
2011-09-21 10:22:43,666 ERROR org.apache.hadoop.mapred.JobHistory: 
Failed to store job conf in the log dir
java.io.FileNotFoundException: 
/usr/lib/hadoop-0.20/logs/history/master_1316614721548_job_201109211018_0001_conf.xml 
(Permission denied)
         at java.io.FileOutputStream.open(Native Method)
         at java.io.FileOutputStream.<init>(FileOutputStream.java:179)
         at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:189)
         at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:185)
         at 
org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:243)
         at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.<init>(ChecksumFileSystem.java:336)
         at 
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:369)


On 9/21/2011 5:15 PM, George Kousiouris wrote:
>
> Hi,
>
> The status seems healthy and the datanodes live:
> Status: HEALTHY
>  Total size:    118805326 B
>  Total dirs:    31
>  Total files:    38
>  Total blocks (validated):    38 (avg. block size 3126455 B)
>  Minimally replicated blocks:    38 (100.0 %)
>  Over-replicated blocks:    0 (0.0 %)
>  Under-replicated blocks:    9 (23.68421 %)
>  Mis-replicated blocks:        0 (0.0 %)
>  Default replication factor:    1
>  Average block replication:    1.2368422
>  Corrupt blocks:        0
>  Missing replicas:        72 (153.19148 %)
>  Number of data-nodes:        2
>  Number of racks:        1
> FSCK ended at Wed Sep 21 10:06:17 EDT 2011 in 9 milliseconds
>
>
> The filesystem under path '/' is HEALTHY
>
> The jps command has the following output:
> hdfs@master:~$ jps
> 24292 SecondaryNameNode
> 30010 Jps
> 24109 DataNode
> 23962 NameNode
>
> Shouldn't this have two datanode listings? In our system, one of the 
> datanodes and the namenode is the same machine, but i seem to remember 
> that in the past even with this setup two datanode listings appeared 
> in the jps output.
>
> Thanks,
> George
>
>
>
>
> On 9/21/2011 5:08 PM, Uma Maheswara Rao G 72686 wrote:
>> Hi,
>>
>>   Any cluster restart happend? ..is your NameNode detecting DataNodes 
>> as live?
>>   Looks DNs did not report anyblocks to NN yet. You have 13 blocks 
>> persisted in NameNode namespace. At least 12 blocks should be 
>> reported from your DNs. Other wise automatically it will not come out 
>> of safemode.
>>
>> Regards,
>> Uma
>> ----- Original Message -----
>> From: George Kousiouris<gk...@mail.ntua.gr>
>> Date: Wednesday, September 21, 2011 7:29 pm
>> Subject: Problem with MR job
>> To: "common-user@hadoop.apache.org"<co...@hadoop.apache.org>
>>
>>> Hi all,
>>>
>>> We are trying to run a mahout job in a hadoop cluster, but we keep
>>> getting the same status. The job passes the initial mahout stages
>>> and
>>> when it comes to be executed as a MR job, it seems to be stuck at
>>> 0%
>>> progress. Through the UI we see that it is submitted but not
>>> running.
>>> After a while it gets killed. In the logs the error shown is this one:
>>>
>>> 2011-09-21 07:47:50,507 INFO org.apache.hadoop.mapred.JobTracker:
>>> problem cleaning system directory:
>>> hdfs://master/var/lib/hadoop-0.20/cache/hdfs/mapred/system
>>> org.apache.hadoop.ipc.RemoteException:
>>> org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot
>>> create
>>> directory /var/lib/hadoop-0.20/cache/hdfs/mapred/system. Name nod$
>>> The reported blocks 0 needs additional 12 blocks to reach the
>>> threshold
>>> 0.9990 of total blocks 13. Safe mode will be turned off automatically.
>>>          at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:1966) 
>>>
>>>          at
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:1940) 
>>>
>>>          at
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.mkdirs(NameNode.java:770) 
>>>
>>>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>> Method)         at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
>>>
>>>          at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
>>>
>>>          at java.lang.reflect.Method.invoke(Method.java:597)
>>>
>>>
>>> Some staging files seem to have been created however.
>>>
>>> I was thinking of sending this to the mahout mailing list but it
>>> seems a
>>> more core hadoop issue.
>>>
>>> We are using the following command to launch the mahout example:
>>> ./mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
>>> --input hdfs://master/user/hdfs/testdata/synthetic_control.data --
>>> output
>>> hdfs://master/user/hdfs/testdata/output --t1 0.5 --t2 1 --maxIter 50
>>>
>>> Any clues?
>>> George
>>>
>>> -- 
>>>
>>> ---------------------------
>>>
>>> George Kousiouris
>>> Electrical and Computer Engineer
>>> Division of Communications,
>>> Electronics and Information Engineering
>>> School of Electrical and Computer Engineering
>>> Tel: +30 210 772 2546
>>> Mobile: +30 6939354121
>>> Fax: +30 210 772 2569
>>> Email: gkousiou@mail.ntua.gr
>>> Site: http://users.ntua.gr/gkousiou/
>>>
>>> National Technical University of Athens
>>> 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
>>>
>>>
>>
>
>


-- 

---------------------------

George Kousiouris
Electrical and Computer Engineer
Division of Communications,
Electronics and Information Engineering
School of Electrical and Computer Engineering
Tel: +30 210 772 2546
Mobile: +30 6939354121
Fax: +30 210 772 2569
Email: gkousiou@mail.ntua.gr
Site: http://users.ntua.gr/gkousiou/

National Technical University of Athens
9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece

Re: Problem with MR job

Posted by George Kousiouris <gk...@mail.ntua.gr>.

Hi,

The status seems healthy and the datanodes live:
Status: HEALTHY
  Total size:    118805326 B
  Total dirs:    31
  Total files:    38
  Total blocks (validated):    38 (avg. block size 3126455 B)
  Minimally replicated blocks:    38 (100.0 %)
  Over-replicated blocks:    0 (0.0 %)
  Under-replicated blocks:    9 (23.68421 %)
  Mis-replicated blocks:        0 (0.0 %)
  Default replication factor:    1
  Average block replication:    1.2368422
  Corrupt blocks:        0
  Missing replicas:        72 (153.19148 %)
  Number of data-nodes:        2
  Number of racks:        1
FSCK ended at Wed Sep 21 10:06:17 EDT 2011 in 9 milliseconds


The filesystem under path '/' is HEALTHY

The jps command has the following output:
hdfs@master:~$ jps
24292 SecondaryNameNode
30010 Jps
24109 DataNode
23962 NameNode

Shouldn't this have two datanode listings? In our system, one of the 
datanodes and the namenode is the same machine, but i seem to remember 
that in the past even with this setup two datanode listings appeared in 
the jps output.

Thanks,
George




On 9/21/2011 5:08 PM, Uma Maheswara Rao G 72686 wrote:
> Hi,
>
>   Any cluster restart happend? ..is your NameNode detecting DataNodes as live?
>   Looks DNs did not report anyblocks to NN yet. You have 13 blocks persisted in NameNode namespace. At least 12 blocks should be reported from your DNs. Other wise automatically it will not come out of safemode.
>
> Regards,
> Uma
> ----- Original Message -----
> From: George Kousiouris<gk...@mail.ntua.gr>
> Date: Wednesday, September 21, 2011 7:29 pm
> Subject: Problem with MR job
> To: "common-user@hadoop.apache.org"<co...@hadoop.apache.org>
>
>> Hi all,
>>
>> We are trying to run a mahout job in a hadoop cluster, but we keep
>> getting the same status. The job passes the initial mahout stages
>> and
>> when it comes to be executed as a MR job, it seems to be stuck at
>> 0%
>> progress. Through the UI we see that it is submitted but not
>> running.
>> After a while it gets killed. In the logs the error shown is this one:
>>
>> 2011-09-21 07:47:50,507 INFO org.apache.hadoop.mapred.JobTracker:
>> problem cleaning system directory:
>> hdfs://master/var/lib/hadoop-0.20/cache/hdfs/mapred/system
>> org.apache.hadoop.ipc.RemoteException:
>> org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot
>> create
>> directory /var/lib/hadoop-0.20/cache/hdfs/mapred/system. Name nod$
>> The reported blocks 0 needs additional 12 blocks to reach the
>> threshold
>> 0.9990 of total blocks 13. Safe mode will be turned off automatically.
>>          at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:1966)
>>          at
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:1940)
>>          at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.mkdirs(NameNode.java:770)
>>          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>> Method)         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>          at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>          at java.lang.reflect.Method.invoke(Method.java:597)
>>
>>
>> Some staging files seem to have been created however.
>>
>> I was thinking of sending this to the mahout mailing list but it
>> seems a
>> more core hadoop issue.
>>
>> We are using the following command to launch the mahout example:
>> ./mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
>> --input hdfs://master/user/hdfs/testdata/synthetic_control.data --
>> output
>> hdfs://master/user/hdfs/testdata/output --t1 0.5 --t2 1 --maxIter 50
>>
>> Any clues?
>> George
>>
>> -- 
>>
>> ---------------------------
>>
>> George Kousiouris
>> Electrical and Computer Engineer
>> Division of Communications,
>> Electronics and Information Engineering
>> School of Electrical and Computer Engineering
>> Tel: +30 210 772 2546
>> Mobile: +30 6939354121
>> Fax: +30 210 772 2569
>> Email: gkousiou@mail.ntua.gr
>> Site: http://users.ntua.gr/gkousiou/
>>
>> National Technical University of Athens
>> 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
>>
>>
>


-- 

---------------------------

George Kousiouris
Electrical and Computer Engineer
Division of Communications,
Electronics and Information Engineering
School of Electrical and Computer Engineering
Tel: +30 210 772 2546
Mobile: +30 6939354121
Fax: +30 210 772 2569
Email: gkousiou@mail.ntua.gr
Site: http://users.ntua.gr/gkousiou/

National Technical University of Athens
9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece

Re: Problem with MR job

Posted by Uma Maheswara Rao G 72686 <ma...@huawei.com>.

Hi,

 Any cluster restart happend? ..is your NameNode detecting DataNodes as live?
 Looks DNs did not report anyblocks to NN yet. You have 13 blocks persisted in NameNode namespace. At least 12 blocks should be reported from your DNs. Other wise automatically it will not come out of safemode.

Regards,
Uma
----- Original Message -----
From: George Kousiouris <gk...@mail.ntua.gr>
Date: Wednesday, September 21, 2011 7:29 pm
Subject: Problem with MR job
To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>

> 
> Hi all,
> 
> We are trying to run a mahout job in a hadoop cluster, but we keep 
> getting the same status. The job passes the initial mahout stages 
> and 
> when it comes to be executed as a MR job, it seems to be stuck at 
> 0% 
> progress. Through the UI we see that it is submitted but not 
> running. 
> After a while it gets killed. In the logs the error shown is this one:
> 
> 2011-09-21 07:47:50,507 INFO org.apache.hadoop.mapred.JobTracker: 
> problem cleaning system directory: 
> hdfs://master/var/lib/hadoop-0.20/cache/hdfs/mapred/system
> org.apache.hadoop.ipc.RemoteException: 
> org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot 
> create 
> directory /var/lib/hadoop-0.20/cache/hdfs/mapred/system. Name nod$
> The reported blocks 0 needs additional 12 blocks to reach the 
> threshold 
> 0.9990 of total blocks 13. Safe mode will be turned off automatically.
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInternal(FSNamesystem.java:1966)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:1940)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.mkdirs(NameNode.java:770)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
> 
> 
> Some staging files seem to have been created however.
> 
> I was thinking of sending this to the mahout mailing list but it 
> seems a 
> more core hadoop issue.
> 
> We are using the following command to launch the mahout example:
> ./mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job 
> --input hdfs://master/user/hdfs/testdata/synthetic_control.data --
> output 
> hdfs://master/user/hdfs/testdata/output --t1 0.5 --t2 1 --maxIter 50
> 
> Any clues?
> George
> 
> -- 
> 
> ---------------------------
> 
> George Kousiouris
> Electrical and Computer Engineer
> Division of Communications,
> Electronics and Information Engineering
> School of Electrical and Computer Engineering
> Tel: +30 210 772 2546
> Mobile: +30 6939354121
> Fax: +30 210 772 2569
> Email: gkousiou@mail.ntua.gr
> Site: http://users.ntua.gr/gkousiou/
> 
> National Technical University of Athens
> 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
> 
>