You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jesse Hires <jh...@gmail.com> on 2009/10/21 01:22:05 UTC

datanode.BlockAlreadyExistsException

I tried asking this over at the nutch-user alias, but I am seeing very
little traction, so I thought I'd ask the developers. I realize this is most
likely a configuration problem on my end, but I am very new to using nutch,
so I am having a difficult time understanding where I need to look.

Does anyone have any insight into the following error I am seeing in the
hadoop logs? Is this something I should be concerned with, or is it expected
that this shows up in the logs from time to time? If it is not expected,
where can I look for more information on what is going on?

2009-10-16 17:02:43,061 ERROR datanode.DataNode -
DatanodeRegistration(192.168.1.7:50010,
storageID=DS-1226842861-192.168.1.7-50010-1254609174303,
infoPort=50075, ipcPort=50020):DataXceiver

org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException:
Block blk_909837363833332565_3277 is valid, and cannot be written to.
	at org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:975)

	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:97)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:259)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)

	at java.lang.Thread.run(Thread.java:636)



I am able to produce this just injecting the urls (2 of them), but it shows
up on both datanodes, and happens whenever I run an operation that uses dfs.

I am running the latest sources from the trunk.
I've verified that only one instance of the following on the datanodes:
org.apache.hadoop.hdfs.server.datanode.DataNode
org.apache.hadoop.mapred.TaskTracker

I've also verified that only one instance of the following are running on
the name node:
org.apache.hadoop.hdfs.server.namenode.NameNode
org.apache.hadoop.mapred.JobTracker


The hardware is as follows:
Two data nodes, both configured identical. Atom 330 proc, 2gigs ram, 320g
SATA 3.0 hard drive, Fedora Core 10.
One name node, running some amd x86 proc, 2 gigs memory, 750g SATA, Fedora
Core 10. (pieced together from spare parts)
All across a 100mb network.
Admittedly this is low end hardware, but I am doing this specifically as an
exercise in using low power (as in electricity)  hardware.

I can also provide config files if needed.

Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com

Re: datanode.BlockAlreadyExistsException

Posted by Jesse Hires <jh...@gmail.com>.
I am still getting the same errors.

I ran fsck (rebooted with a forced fsck at startup) No issues.
I increased the ulimit to 8192

I was using /etc/hosts for all name lookups (common across all machines and
copied from same location). I have since modified hadoop-site.xml and the
slaves file to use IP address only.

Using ifconfig, ping, and looking at /etc/sysconfig/network I've determined
that all the machines are who they think they are.

Of note may be that I also get the following WARN in the logs after the
BlockAlreadyExistsException. I am seeing the same on both datanodes (just
swap the IP addresses)

2009-10-21 21:13:03,415 WARN  datanode.DataNode -
DatanodeRegistration(192.168.1.7:50010,
storageID=DS-1226842861-192.168.1.7-50010-1254609174303,
infoPort=50075, ipcPort=50020):Failed to transfer
blk_-2053461958845826983_3919 to 192.168.1.6:50010 got
java.net.SocketException: Original Exception : java.io.IOException:
Connection reset by peer
	at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
	at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:456)
	at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:557)
	at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:199)
	at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)
	at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400)
	at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:1108)
	at java.lang.Thread.run(Thread.java:636)
Caused by: java.io.IOException: Connection reset by peer


I am able to generate/fetch/updatedb/etc....  As near as I can tell, things
seem to be working, but I really wouldn't know if I am missing anything
anyway. No errors are being displayed on the command line. Every iteration
seems to be growing the index, segments, linkdb accordingly.




Here is the hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

  <property>
    <name>fs.default.name</name>
    <value>hdfs://192.168.1.3:9000</value>
  </property>

  <property>
    <name>mapred.job.tracker</name>
    <value>hdfs://192.168.1.3:9001</value>
  </property>

  <property>
    <name>mapred.tasktracker.tasks.maximum</name>
    <value>1</value>
  </property>

  <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx512m</value>
  </property>

  <property>
    <name>dfs.name.dir</name>
    <value>/home/nutch/crawl/filesystem/name</value>
  </property>

  <property>
    <name>dfs.data.dir</name>
    <value>/home/nutch/crawl/filesystem/data</value>
  </property>

  <property>
    <name>mapred.system.dir</name>
    <value>/home/nutch/crawl/filesystem/mapreduce/system</value>
  </property>

  <property>
    <name>mapred.local.dir</name>
    <value>/home/nutch/crawl/filesystem/mapreduce/local</value>
  </property>

  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>

</configuration>








Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Wed, Oct 21, 2009 at 6:01 PM, Jesse Hires <jh...@gmail.com> wrote:

> Thanks for the pointers!
> As soon as I have some results, I'll post them back and let you know if the
> problem is solved.
>
> Jesse
>
> int GetRandomNumber()
> {
>    return 4; // Chosen by fair roll of dice
>                 // Guaranteed to be random
> } // xkcd.com
>
>
>
> On Wed, Oct 21, 2009 at 4:46 AM, Andrzej Bialecki <ab...@getopt.org> wrote:
>
>> Jesse Hires wrote:
>>
>>> I tried asking this over at the nutch-user alias, but I am seeing very
>>> little traction, so I thought I'd ask the developers. I realize this is most
>>> likely a configuration problem on my end, but I am very new to using nutch,
>>> so I am having a difficult time understanding where I need to look.
>>>
>>> Does anyone have any insight into the following error I am seeing in the
>>> hadoop logs? Is this something I should be concerned with, or is it expected
>>> that this shows up in the logs from time to time? If it is not expected,
>>> where can I look for more information on what is going on?
>>>
>>
>> It's not expected at all - this usually indicates some config error, or FS
>> corruption, or it may be also caused by conflicting DNS (e.g. the same name
>> resolving to different addresses on different nodes), or a problem with
>> permissions (e.g. daemon started remotely uses uid/permissions/env that
>> doesn't allow it to create/delete files in data dir). This may be also some
>> weird corner case when processes run out of file descriptors - you should
>> check ulimit -n and set it to a value higher than 4096.
>>
>> Please also run fsck / and see what it says.
>>
>>  I can also provide config files if needed.
>>>
>>
>> We need just the modifications in hadoop-site.xml, that's where the
>> problem may be located.
>>
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>

Re: datanode.BlockAlreadyExistsException

Posted by Jesse Hires <jh...@gmail.com>.
Thanks for the pointers!
As soon as I have some results, I'll post them back and let you know if the
problem is solved.

Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Wed, Oct 21, 2009 at 4:46 AM, Andrzej Bialecki <ab...@getopt.org> wrote:

> Jesse Hires wrote:
>
>> I tried asking this over at the nutch-user alias, but I am seeing very
>> little traction, so I thought I'd ask the developers. I realize this is most
>> likely a configuration problem on my end, but I am very new to using nutch,
>> so I am having a difficult time understanding where I need to look.
>>
>> Does anyone have any insight into the following error I am seeing in the
>> hadoop logs? Is this something I should be concerned with, or is it expected
>> that this shows up in the logs from time to time? If it is not expected,
>> where can I look for more information on what is going on?
>>
>
> It's not expected at all - this usually indicates some config error, or FS
> corruption, or it may be also caused by conflicting DNS (e.g. the same name
> resolving to different addresses on different nodes), or a problem with
> permissions (e.g. daemon started remotely uses uid/permissions/env that
> doesn't allow it to create/delete files in data dir). This may be also some
> weird corner case when processes run out of file descriptors - you should
> check ulimit -n and set it to a value higher than 4096.
>
> Please also run fsck / and see what it says.
>
>  I can also provide config files if needed.
>>
>
> We need just the modifications in hadoop-site.xml, that's where the problem
> may be located.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: datanode.BlockAlreadyExistsException

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jesse Hires wrote:
> I tried asking this over at the nutch-user alias, but I am seeing very 
> little traction, so I thought I'd ask the developers. I realize this is 
> most likely a configuration problem on my end, but I am very new to 
> using nutch, so I am having a difficult time understanding where I need 
> to look.
> 
> Does anyone have any insight into the following error I am seeing in the 
> hadoop logs? Is this something I should be concerned with, or is it 
> expected that this shows up in the logs from time to time? If it is not 
> expected, where can I look for more information on what is going on?

It's not expected at all - this usually indicates some config error, or 
FS corruption, or it may be also caused by conflicting DNS (e.g. the 
same name resolving to different addresses on different nodes), or a 
problem with permissions (e.g. daemon started remotely uses 
uid/permissions/env that doesn't allow it to create/delete files in data 
dir). This may be also some weird corner case when processes run out of 
file descriptors - you should check ulimit -n and set it to a value 
higher than 4096.

Please also run fsck / and see what it says.

> I can also provide config files if needed.

We need just the modifications in hadoop-site.xml, that's where the 
problem may be located.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com