You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by C G <pa...@yahoo.com> on 2008/07/06 17:32:15 UTC

Hadoop 0.17.0 - lots of I/O problems and can't run small datasets?

Hi All:
 
I've got 0.17.0 set up on a 7 node grid (6 slaves w/datanodes, 1 master running namenode).  I'm trying to process a small (180G) dataset.  I've done this succesfully and painlessly running 0.15.0.  When I run 0.17.0 with the same data and same code (w/API changes for 0.17.0 and recompiled, of course), I get a ton of failures.  I've increased the number of namenode threads trying to resolve this, but that doesn't seem to help.  The errors are of the following flavor:
 
java.io.IOException: Could not get block locations. Aborting...
java.io.IOException: All datanodes 10.2.11.2:50010 are bad. Aborting...
Exception in thread "Thread-2" java.util.ConcurrentModificationException
Exception closing file /blah/_temporary/_task_200807052311_0001_r_0000
04_0/baz/part-xxxxx
 
As things stand right now, I can't deploy to 0.17.0 (or 0.16.4 or 0.17.1).  I am wondering if anybody can shed some light on this, or if others are having similar problems.  
 
Any thoughts, insights, etc. would be greatly appreciated.
 
Thanks,
C G
 
Here's an ugly trace:
08/07/06 01:43:29 INFO mapred.JobClient:  map 100% reduce 93%
08/07/06 01:43:29 INFO mapred.JobClient: Task Id : task_200807052311_0001_r_000003_0, Status : FAILED
java.io.IOException: Could not get block locations. Aborting...
        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2080)
        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818)
task_200807052311_0001_r_000003_0: Exception closing file /output/_temporary/_task_200807052311_0001_r_0000
03_0/a/b/part-00003
task_200807052311_0001_r_000003_0: java.io.IOException: All datanodes 10.2.11.2:50010 are bad. Aborting...
task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.ja
va:2095)
task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1
818)
task_200807052311_0001_r_000003_0: Exception in thread "Thread-2" java.util..ConcurrentModificationException
task_200807052311_0001_r_000003_0:      at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100)
task_200807052311_0001_r_000003_0:      at java.util.TreeMap$KeyIterator.next(TreeMap.java:1154)
task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DFSClient.close(DFSClient.java:217)
task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DistributedFileSystem.close(DistributedFileSystem.java:214)
task_200807052311_0001_r_000003_0:      at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1324)
task_200807052311_0001_r_000003_0:      at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:224)
task_200807052311_0001_r_000003_0:      at org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:209)
08/07/06 01:44:32 INFO mapred.JobClient:  map 100% reduce 74%
08/07/06 01:44:32 INFO mapred.JobClient: Task Id : task_200807052311_0001_r_000001_0, Status : FAILED
java.io.IOException: Could not get block locations. Aborting...
        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2080)
        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818)
task_200807052311_0001_r_000001_0: Exception in thread "Thread-2" java.util..ConcurrentModificationException
task_200807052311_0001_r_000001_0:      at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100)
task_200807052311_0001_r_000001_0:      at java.util.TreeMap$KeyIterator.next(TreeMap.java:1154)
task_200807052311_0001_r_000001_0:      at org.apache.hadoop.dfs.DFSClient.close(DFSClient.java:217)
task_200807052311_0001_r_000001_0:      at org.apache.hadoop.dfs.DistributedFileSystem.close(DistributedFileSystem.java:214)
task_200807052311_0001_r_000001_0:      at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1324)
task_200807052311_0001_r_000001_0:      at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:224)
task_200807052311_0001_r_000001_0:      at org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:209)
08/07/06 01:44:45 INFO mapred.JobClient:  map 100% reduce 54%

Re: Re: Hadoop 0.17.0 - lots of I/O problems and can't run small datasets?

Posted by heyongqiang <he...@software.ict.ac.cn>.

ConcurrentModificationException is a java bug or?




Best regards,
 
Yongqiang He
2008-07-08

Email: heyongqiang@software.ict.ac.cn
Tel:   86-10-62600966(O)
 
Research Center for Grid and Service Computing,
Institute of Computing Technology, 
Chinese Academy of Sciences
P.O.Box 2704, 100080, Beijing, China 



发件人： Raghu Angadi
发送时间： 2008-07-08 01:45:19
收件人： core-user@hadoop.apache.org
抄送： 
主题： Re: Hadoop 0.17.0 - lots of I/O problems and can't run small datasets?

ConcurrentModificationException looks like a bug we should file a jira.

Regd why the writes are failing, we need to look at more logs.. Could 
you attach complete log from one of the failed tasks. Also try to see if 
there is anything in NameNode log around that time.

Raghu.

C G wrote:
> Hi All:
>  
> I've got 0.17.0 set up on a 7 node grid (6 slaves w/datanodes, 1 master running namenode).  I'm trying to process a small (180G) dataset.  I've done this succesfully and painlessly running 0.15.0.  When I run 0.17.0 with the same data and same code (w/API changes for 0.17.0 and recompiled, of course), I get a ton of failures.  I've increased the number of namenode threads trying to resolve this, but that doesn't seem to help.  The errors are of the following flavor:
>  
> java.io.IOException: Could not get block locations. Aborting...
> java.io.IOException: All datanodes 10.2.11.2:50010 are bad. Aborting...
> Exception in thread "Thread-2" java.util.ConcurrentModificationException
> Exception closing file /blah/_temporary/_task_200807052311_0001_r_0000
> 04_0/baz/part-xxxxx
>  
> As things stand right now, I can't deploy to 0.17.0 (or 0.16.4 or 0.17.1).  I am wondering if anybody can shed some light on this, or if others are having similar problems.  
>  
> Any thoughts, insights, etc. would be greatly appreciated.
>  
> Thanks,
> C G
>  
> Here's an ugly trace:
> 08/07/06 01:43:29 INFO mapred.JobClient:  map 100% reduce 93%
> 08/07/06 01:43:29 INFO mapred.JobClient: Task Id : task_200807052311_0001_r_000003_0, Status : FAILED
> java.io.IOException: Could not get block locations. Aborting...
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2080)
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818)
> task_200807052311_0001_r_000003_0: Exception closing file /output/_temporary/_task_200807052311_0001_r_0000
> 03_0/a/b/part-00003
> task_200807052311_0001_r_000003_0: java.io.IOException: All datanodes 10.2.11.2:50010 are bad. Aborting...
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.ja
> va:2095)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1
> 818)
> task_200807052311_0001_r_000003_0: Exception in thread "Thread-2" java.util..ConcurrentModificationException
> task_200807052311_0001_r_000003_0:      at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100)
> task_200807052311_0001_r_000003_0:      at java.util.TreeMap$KeyIterator.next(TreeMap.java:1154)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DFSClient.close(DFSClient.java:217)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DistributedFileSystem.close(DistributedFileSystem.java:214)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1324)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:224)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:209)
> 08/07/06 01:44:32 INFO mapred.JobClient:  map 100% reduce 74%
> 08/07/06 01:44:32 INFO mapred.JobClient: Task Id : task_200807052311_0001_r_000001_0, Status : FAILED
> java.io.IOException: Could not get block locations. Aborting...
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2080)
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818)
> task_200807052311_0001_r_000001_0: Exception in thread "Thread-2" java.util..ConcurrentModificationException
> task_200807052311_0001_r_000001_0:      at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100)
> task_200807052311_0001_r_000001_0:      at java.util.TreeMap$KeyIterator.next(TreeMap.java:1154)
> task_200807052311_0001_r_000001_0:      at org.apache.hadoop.dfs.DFSClient.close(DFSClient.java:217)
> task_200807052311_0001_r_000001_0:      at org.apache.hadoop.dfs.DistributedFileSystem.close(DistributedFileSystem.java:214)
> task_200807052311_0001_r_000001_0:      at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1324)
> task_200807052311_0001_r_000001_0:      at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:224)
> task_200807052311_0001_r_000001_0:      at org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:209)
> 08/07/06 01:44:45 INFO mapred.JobClient:  map 100% reduce 54%
> 
> 
> 
>

Re: Re: Hadoop 0.17.0 - lots of I/O problems and can't run small datasets?

Posted by C G <pa...@yahoo.com>.

Yongqiang:
 
Thanks for this information.  I'll try your changes and see if the experiment runs better.
 
Thanks,
C G

--- On Mon, 7/7/08, heyongqiang <he...@software.ict.ac.cn> wrote:

From: heyongqiang <he...@software.ict.ac.cn>
Subject: Re: Re: Hadoop 0.17.0 - lots of I/O problems and can't run small datasets?
To: "core-user@hadoop.apache.org" <co...@hadoop.apache.org>
Date: Monday, July 7, 2008, 9:03 PM

i doubt this error was because one datanode quit during the client write,and
that datanode was chosen by namenode for the client to contact to write(this
was what DFSClient.DFSOutputStream.nextBlockOutputStream did).
Default,client side retry 3 times and sleep total 3*xxx seconds,but NameNode
need more time to find the deadnode.So every time when client wake up, there is
a chance the dead node was chosen again.
maybe u should chang the NameNode's interval finding the deadnode and chang
the Client's sleep more long?
I have changed the DFSClient.DFSOutputStream.nextBlockOutputStream's sleep
code like below:
    if (!success) {
     LOG.info("Abandoning block " + block + " and
retry...");
     namenode.abandonBlock(block, src, clientName);

     // Connection failed. Let's wait a little bit and retry
     retry = true;
     try {
      if (System.currentTimeMillis() - startTime > 5000) {
       LOG.info("Waiting to find target node: "
         + nodes[0].getName());
      }
      long time=heartbeatRecheckInterval;
      Thread.sleep(time);
     } catch (InterruptedException iex) {
     }
    }

heartbeatRecheckInterval is exactly the interval of the NameNode's deadnode
monitor's recheck interval.    And I also changed the NameNode's
deadnode recheck interval to be double of heartbeat interval.




Best regards,
 
Yongqiang He
2008-07-08

Email: heyongqiang@software.ict.ac.cn
Tel:   86-10-62600966(O)
 
Research Center for Grid and Service Computing,
Institute of Computing Technology, 
Chinese Academy of Sciences
P.O.Box 2704, 100080, Beijing, China 



发件人： Raghu Angadi
发送时间： 2008-07-08 01:45:19
收件人： core-user@hadoop.apache.org
抄送： 
主题： Re: Hadoop 0.17.0 - lots of I/O problems and can't run small
datasets?

ConcurrentModificationException looks like a bug we should file a jira.

Regd why the writes are failing, we need to look at more logs.. Could 
you attach complete log from one of the failed tasks. Also try to see if 
there is anything in NameNode log around that time.

Raghu.

C G wrote:
> Hi All:
>  
> I've got 0.17.0 set up on a 7 node grid (6 slaves w/datanodes, 1
master running namenode).  I'm trying to process a small (180G) dataset. 
I've done this succesfully and painlessly running 0.15.0.  When I run
0.17.0 with the same data and same code (w/API changes for 0.17.0 and
recompiled, of course), I get a ton of failures.  I've increased the number
of namenode threads trying to resolve this, but that doesn't seem to help. 
The errors are of the following flavor:
>  
> java.io.IOException: Could not get block locations. Aborting...
> java.io.IOException: All datanodes 10.2.11.2:50010 are bad. Aborting...
> Exception in thread "Thread-2"
java.util.ConcurrentModificationException
> Exception closing file /blah/_temporary/_task_200807052311_0001_r_0000
> 04_0/baz/part-xxxxx
>  
> As things stand right now, I can't deploy to 0.17.0 (or 0.16.4 or
0.17.1).  I am wondering if anybody can shed some light on this, or if others
are having similar problems.  
>  
> Any thoughts, insights, etc. would be greatly appreciated.
>  
> Thanks,
> C G
>  
> Here's an ugly trace:
> 08/07/06 01:43:29 INFO mapred.JobClient:  map 100% reduce 93%
> 08/07/06 01:43:29 INFO mapred.JobClient: Task Id :
task_200807052311_0001_r_000003_0, Status : FAILED
> java.io.IOException: Could not get block locations. Aborting...
>         at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2080)
>         at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
>         at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818)
> task_200807052311_0001_r_000003_0: Exception closing file
/output/_temporary/_task_200807052311_0001_r_0000
> 03_0/a/b/part-00003
> task_200807052311_0001_r_000003_0: java.io.IOException: All datanodes
10.2.11.2:50010 are bad. Aborting...
> task_200807052311_0001_r_000003_0:      at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.ja
> va:2095)
> task_200807052311_0001_r_000003_0:      at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
> task_200807052311_0001_r_000003_0:      at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1
> 818)
> task_200807052311_0001_r_000003_0: Exception in thread
"Thread-2" java.util..ConcurrentModificationException
> task_200807052311_0001_r_000003_0:      at
java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100)
> task_200807052311_0001_r_000003_0:      at
java.util.TreeMap$KeyIterator.next(TreeMap.java:1154)
> task_200807052311_0001_r_000003_0:      at
org.apache.hadoop.dfs.DFSClient.close(DFSClient.java:217)
> task_200807052311_0001_r_000003_0:      at
org.apache.hadoop.dfs.DistributedFileSystem.close(DistributedFileSystem.java:214)
> task_200807052311_0001_r_000003_0:      at
org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1324)
> task_200807052311_0001_r_000003_0:      at
org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:224)
> task_200807052311_0001_r_000003_0:      at
org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:209)
> 08/07/06 01:44:32 INFO mapred.JobClient:  map 100% reduce 74%
> 08/07/06 01:44:32 INFO mapred.JobClient: Task Id :
task_200807052311_0001_r_000001_0, Status : FAILED
> java.io.IOException: Could not get block locations. Aborting...
>         at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2080)
>         at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
>         at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818)
> task_200807052311_0001_r_000001_0: Exception in thread
"Thread-2" java.util..ConcurrentModificationException
> task_200807052311_0001_r_000001_0:      at
java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100)
> task_200807052311_0001_r_000001_0:      at
java.util.TreeMap$KeyIterator.next(TreeMap.java:1154)
> task_200807052311_0001_r_000001_0:      at
org.apache.hadoop.dfs.DFSClient.close(DFSClient.java:217)
> task_200807052311_0001_r_000001_0:      at
org.apache.hadoop.dfs.DistributedFileSystem.close(DistributedFileSystem.java:214)
> task_200807052311_0001_r_000001_0:      at
org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1324)
> task_200807052311_0001_r_000001_0:      at
org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:224)
> task_200807052311_0001_r_000001_0:      at
org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:209)
> 08/07/06 01:44:45 INFO mapred.JobClient:  map 100% reduce 54%
> 
> 
> 
>

Re: Re: Hadoop 0.17.0 - lots of I/O problems and can't run small datasets?

Posted by heyongqiang <he...@software.ict.ac.cn>.

i doubt this error was because one datanode quit during the client write,and that datanode was chosen by namenode for the client to contact to write(this was what DFSClient.DFSOutputStream.nextBlockOutputStream did).
Default,client side retry 3 times and sleep total 3*xxx seconds,but NameNode need more time to find the deadnode.So every time when client wake up, there is a chance the dead node was chosen again.
maybe u should chang the NameNode's interval finding the deadnode and chang the Client's sleep more long?
I have changed the DFSClient.DFSOutputStream.nextBlockOutputStream's sleep code like below:
    if (!success) {
     LOG.info("Abandoning block " + block + " and retry...");
     namenode.abandonBlock(block, src, clientName);

     // Connection failed. Let's wait a little bit and retry
     retry = true;
     try {
      if (System.currentTimeMillis() - startTime > 5000) {
       LOG.info("Waiting to find target node: "
         + nodes[0].getName());
      }
      long time=heartbeatRecheckInterval;
      Thread.sleep(time);
     } catch (InterruptedException iex) {
     }
    }

heartbeatRecheckInterval is exactly the interval of the NameNode's deadnode monitor's recheck interval.    And I also changed the NameNode's deadnode recheck interval to be double of heartbeat interval.




Best regards,
 
Yongqiang He
2008-07-08

Email: heyongqiang@software.ict.ac.cn
Tel:   86-10-62600966(O)
 
Research Center for Grid and Service Computing,
Institute of Computing Technology, 
Chinese Academy of Sciences
P.O.Box 2704, 100080, Beijing, China 



发件人： Raghu Angadi
发送时间： 2008-07-08 01:45:19
收件人： core-user@hadoop.apache.org
抄送： 
主题： Re: Hadoop 0.17.0 - lots of I/O problems and can't run small datasets?

ConcurrentModificationException looks like a bug we should file a jira.

Regd why the writes are failing, we need to look at more logs.. Could 
you attach complete log from one of the failed tasks. Also try to see if 
there is anything in NameNode log around that time.

Raghu.

C G wrote:
> Hi All:
>  
> I've got 0.17.0 set up on a 7 node grid (6 slaves w/datanodes, 1 master running namenode).  I'm trying to process a small (180G) dataset.  I've done this succesfully and painlessly running 0.15.0.  When I run 0.17.0 with the same data and same code (w/API changes for 0.17.0 and recompiled, of course), I get a ton of failures.  I've increased the number of namenode threads trying to resolve this, but that doesn't seem to help.  The errors are of the following flavor:
>  
> java.io.IOException: Could not get block locations. Aborting...
> java.io.IOException: All datanodes 10.2.11.2:50010 are bad. Aborting...
> Exception in thread "Thread-2" java.util.ConcurrentModificationException
> Exception closing file /blah/_temporary/_task_200807052311_0001_r_0000
> 04_0/baz/part-xxxxx
>  
> As things stand right now, I can't deploy to 0.17.0 (or 0.16.4 or 0.17.1).  I am wondering if anybody can shed some light on this, or if others are having similar problems.  
>  
> Any thoughts, insights, etc. would be greatly appreciated.
>  
> Thanks,
> C G
>  
> Here's an ugly trace:
> 08/07/06 01:43:29 INFO mapred.JobClient:  map 100% reduce 93%
> 08/07/06 01:43:29 INFO mapred.JobClient: Task Id : task_200807052311_0001_r_000003_0, Status : FAILED
> java.io.IOException: Could not get block locations. Aborting...
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2080)
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818)
> task_200807052311_0001_r_000003_0: Exception closing file /output/_temporary/_task_200807052311_0001_r_0000
> 03_0/a/b/part-00003
> task_200807052311_0001_r_000003_0: java.io.IOException: All datanodes 10.2.11.2:50010 are bad. Aborting...
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.ja
> va:2095)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1
> 818)
> task_200807052311_0001_r_000003_0: Exception in thread "Thread-2" java.util..ConcurrentModificationException
> task_200807052311_0001_r_000003_0:      at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100)
> task_200807052311_0001_r_000003_0:      at java.util.TreeMap$KeyIterator.next(TreeMap.java:1154)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DFSClient.close(DFSClient.java:217)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DistributedFileSystem.close(DistributedFileSystem.java:214)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1324)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:224)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:209)
> 08/07/06 01:44:32 INFO mapred.JobClient:  map 100% reduce 74%
> 08/07/06 01:44:32 INFO mapred.JobClient: Task Id : task_200807052311_0001_r_000001_0, Status : FAILED
> java.io.IOException: Could not get block locations. Aborting...
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2080)
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818)
> task_200807052311_0001_r_000001_0: Exception in thread "Thread-2" java.util..ConcurrentModificationException
> task_200807052311_0001_r_000001_0:      at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100)
> task_200807052311_0001_r_000001_0:      at java.util.TreeMap$KeyIterator.next(TreeMap.java:1154)
> task_200807052311_0001_r_000001_0:      at org.apache.hadoop.dfs.DFSClient.close(DFSClient.java:217)
> task_200807052311_0001_r_000001_0:      at org.apache.hadoop.dfs.DistributedFileSystem.close(DistributedFileSystem.java:214)
> task_200807052311_0001_r_000001_0:      at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1324)
> task_200807052311_0001_r_000001_0:      at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:224)
> task_200807052311_0001_r_000001_0:      at org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:209)
> 08/07/06 01:44:45 INFO mapred.JobClient:  map 100% reduce 54%
> 
> 
> 
>

Re: Hadoop 0.17.0 - lots of I/O problems and can't run small datasets?

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

ConcurrentModificationException looks like a bug we should file a jira.

Regd why the writes are failing, we need to look at more logs.. Could 
you attach complete log from one of the failed tasks. Also try to see if 
there is anything in NameNode log around that time.

Raghu.

C G wrote:
> Hi All:
>  
> I've got 0.17.0 set up on a 7 node grid (6 slaves w/datanodes, 1 master running namenode).  I'm trying to process a small (180G) dataset.  I've done this succesfully and painlessly running 0.15.0.  When I run 0.17.0 with the same data and same code (w/API changes for 0.17.0 and recompiled, of course), I get a ton of failures.  I've increased the number of namenode threads trying to resolve this, but that doesn't seem to help.  The errors are of the following flavor:
>  
> java.io.IOException: Could not get block locations. Aborting...
> java.io.IOException: All datanodes 10.2.11.2:50010 are bad. Aborting...
> Exception in thread "Thread-2" java.util.ConcurrentModificationException
> Exception closing file /blah/_temporary/_task_200807052311_0001_r_0000
> 04_0/baz/part-xxxxx
>  
> As things stand right now, I can't deploy to 0.17.0 (or 0.16.4 or 0.17.1).  I am wondering if anybody can shed some light on this, or if others are having similar problems.  
>  
> Any thoughts, insights, etc. would be greatly appreciated.
>  
> Thanks,
> C G
>  
> Here's an ugly trace:
> 08/07/06 01:43:29 INFO mapred.JobClient:  map 100% reduce 93%
> 08/07/06 01:43:29 INFO mapred.JobClient: Task Id : task_200807052311_0001_r_000003_0, Status : FAILED
> java.io.IOException: Could not get block locations. Aborting...
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2080)
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818)
> task_200807052311_0001_r_000003_0: Exception closing file /output/_temporary/_task_200807052311_0001_r_0000
> 03_0/a/b/part-00003
> task_200807052311_0001_r_000003_0: java.io.IOException: All datanodes 10.2.11.2:50010 are bad. Aborting...
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.ja
> va:2095)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1
> 818)
> task_200807052311_0001_r_000003_0: Exception in thread "Thread-2" java.util..ConcurrentModificationException
> task_200807052311_0001_r_000003_0:      at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100)
> task_200807052311_0001_r_000003_0:      at java.util.TreeMap$KeyIterator.next(TreeMap.java:1154)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DFSClient.close(DFSClient.java:217)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.dfs.DistributedFileSystem.close(DistributedFileSystem.java:214)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1324)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:224)
> task_200807052311_0001_r_000003_0:      at org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:209)
> 08/07/06 01:44:32 INFO mapred.JobClient:  map 100% reduce 74%
> 08/07/06 01:44:32 INFO mapred.JobClient: Task Id : task_200807052311_0001_r_000001_0, Status : FAILED
> java.io.IOException: Could not get block locations. Aborting...
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2080)
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.java:1702)
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1818)
> task_200807052311_0001_r_000001_0: Exception in thread "Thread-2" java.util..ConcurrentModificationException
> task_200807052311_0001_r_000001_0:      at java.util.TreeMap$PrivateEntryIterator.nextEntry(TreeMap.java:1100)
> task_200807052311_0001_r_000001_0:      at java.util.TreeMap$KeyIterator.next(TreeMap.java:1154)
> task_200807052311_0001_r_000001_0:      at org.apache.hadoop.dfs.DFSClient.close(DFSClient.java:217)
> task_200807052311_0001_r_000001_0:      at org.apache.hadoop.dfs.DistributedFileSystem.close(DistributedFileSystem.java:214)
> task_200807052311_0001_r_000001_0:      at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1324)
> task_200807052311_0001_r_000001_0:      at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:224)
> task_200807052311_0001_r_000001_0:      at org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:209)
> 08/07/06 01:44:45 INFO mapred.JobClient:  map 100% reduce 54%
> 
> 
> 
>