You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Juho Mäkinen <ju...@gmail.com> on 2008/12/22 09:59:17 UTC

Hadoop corrupting files if file block size is 4GB and file size is 2GB

I have been storing log data into hdfs cluster (just one datanode at
this moment) with 4GB as block size. It worked fine at the beginning
but now my individual file sizes have grown over 2GB and I cannot
access the files from HDFS cluster anymore. This seems to be happening
if the file size is over 2GB. All files which are under 2GB work fine.
There has been always enough disk space and time doesn't seem to be a
factor (for example 2008-11-24 doesn't work, but 2008-12-05 works)

"hadoop dfs -lsr /events/eventlog"
-rw-r--r--   1 garo supergroup 2177143062 2008-11-25 04:04
/events/eventlog/eventlog-2008-11-24 (doesn't work)
-rw-r--r--   1 garo supergroup 2121109956 2008-12-06 04:04
/events/eventlog/eventlog-2008-12-05 (works)

Note that 2008-12-05 filesize is less than 2^31 but 2008-11-24 is
larger than 2^31 (2 GB)


Example:
[garo@postmetal tmp]$ hadoop dfs -get /events/eventlog/eventlog-2008-11-24 .
get: null

Error log:
==> hadoop-garo-datanode-postmetal.pri.log <==
2008-12-22 10:52:12,325 ERROR org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(127.0.0.1:50010,
storageID=DS-1049869337-10.157.67.82-50010-1221647796455,
infoPort=50075, ipcPort=50020):DataXceiver:
java.lang.IndexOutOfBoundsException
        at java.io.DataInputStream.readFully(DataInputStream.java:175)
        at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1821)
        at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
        at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1109)
        at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
        at java.lang.Thread.run(Thread.java:619)

Datanode web interface for url:
http://postmetal.pri:50075/browseBlock.jsp?blockId=-7907060692488773710&blockSize=2177143062&genstamp=6286&filename=/events/eventlog/eventlog-2008-11-24&datanodePort=50010&namenodeInfoPort=50070
displays this:

Total number of blocks: 1
-7907060692488773710:	 	127.0.0.1:50010

Is this a known problem? Has hadoop ever been tested with block sizes
over 2GB? Are my files corrupted (I do have working backups in
non-hadoop system). If this is the case and hadoop doesn't support
such big block sizes then there should be a clear error message when
trying to add files with big block sizes. Or is the problem not in
block size but in some other place?

 - Juho Mäkinen

Re: Hadoop corrupting files if file block size is 4GB and file size is 2GB

Posted by Doug Cutting <cu...@apache.org>.

Why are you using such a big block size?  I suspect this problem will go 
away if you decrease your blocksize to less than 2GB.

This sounds like a bug, probably related to integer overflow: some part 
of Hadoop is using an 'int' where it should be using a 'long'.  Please 
file an issue in Jira, ideally with a test case.

A short-term fix might be to simply prohibit block sizes greater than 
2GB: that's what most folks are using, and that's what's tested, so 
that's effectively all that's supported.  If we incorporate tests for 
larger block sizes and fix this bug, then we might remove such a 
restriction.

Doug

Juho Mäkinen wrote:
> I have been storing log data into hdfs cluster (just one datanode at
> this moment) with 4GB as block size. It worked fine at the beginning
> but now my individual file sizes have grown over 2GB and I cannot
> access the files from HDFS cluster anymore. This seems to be happening
> if the file size is over 2GB. All files which are under 2GB work fine.
> There has been always enough disk space and time doesn't seem to be a
> factor (for example 2008-11-24 doesn't work, but 2008-12-05 works)
> 
> "hadoop dfs -lsr /events/eventlog"
> -rw-r--r--   1 garo supergroup 2177143062 2008-11-25 04:04
> /events/eventlog/eventlog-2008-11-24 (doesn't work)
> -rw-r--r--   1 garo supergroup 2121109956 2008-12-06 04:04
> /events/eventlog/eventlog-2008-12-05 (works)
> 
> Note that 2008-12-05 filesize is less than 2^31 but 2008-11-24 is
> larger than 2^31 (2 GB)
> 
> 
> Example:
> [garo@postmetal tmp]$ hadoop dfs -get /events/eventlog/eventlog-2008-11-24 .
> get: null
> 
> Error log:
> ==> hadoop-garo-datanode-postmetal.pri.log <==
> 2008-12-22 10:52:12,325 ERROR org.apache.hadoop.dfs.DataNode:
> DatanodeRegistration(127.0.0.1:50010,
> storageID=DS-1049869337-10.157.67.82-50010-1221647796455,
> infoPort=50075, ipcPort=50020):DataXceiver:
> java.lang.IndexOutOfBoundsException
>         at java.io.DataInputStream.readFully(DataInputStream.java:175)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1821)
>         at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.readBlock(DataNode.java:1109)
>         at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:1037)
>         at java.lang.Thread.run(Thread.java:619)
> 
> Datanode web interface for url:
> http://postmetal.pri:50075/browseBlock.jsp?blockId=-7907060692488773710&blockSize=2177143062&genstamp=6286&filename=/events/eventlog/eventlog-2008-11-24&datanodePort=50010&namenodeInfoPort=50070
> displays this:
> 
> Total number of blocks: 1
> -7907060692488773710:	 	127.0.0.1:50010
> 
> Is this a known problem? Has hadoop ever been tested with block sizes
> over 2GB? Are my files corrupted (I do have working backups in
> non-hadoop system). If this is the case and hadoop doesn't support
> such big block sizes then there should be a clear error message when
> trying to add files with big block sizes. Or is the problem not in
> block size but in some other place?
> 
>  - Juho Mäkinen