You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Max Hansmire <ha...@gmail.com> on 2012/09/04 18:30:28 UTC

Data loss on EMR cluster running Hadoop and Hive

I ran into an issue yesterday where one of the blocks on HDFS seems to
have gone away. I would appreciate any help that you can provide.

I am running Hadoop on Amazon's Elastic Map Reduce (EMR). I am running
hadoop version 0.20.205 and hive version 0.8.1.

I have a hive table that is written out in the reduce step of a map
reduce job created by hive. This step completed with no errors, but
the next map-reduce job that tries to read it failed with the
following error message.

"Caused by: java.io.IOException: No live nodes contain current block"

I ran hadoop fs -cat on the same file and got the same error.

Looking more closely at the data and name node logs, I see this error
for the same problem block. It is in the name node when trying to read
the data.

2012-09-03 11:56:05,054 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode
(org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
DatanodeRegistration(10.193.39.159:9200,
storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
infoPort=9102, ipcPort=9201):sendBlock() :  Offset 134217727 and
length 1 don't match block blk_-7100869813617535842_5426 ( blockLen
120152064 )
2012-09-03 11:56:05,054 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode
(org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
DatanodeRegistration(10.193.39.159:9200,
storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
infoPort=9102, ipcPort=9201):Got exception while serving
blk_-7100869813617535842_5426 to /10.96.57.112:
java.io.IOException:  Offset 134217727 and length 1 don't match block
blk_-7100869813617535842_5426 ( blockLen 120152064 )
	at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
	at java.lang.Thread.run(Thread.java:662)

2012-09-03 11:56:05,054 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode
(org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
DatanodeRegistration(10.193.39.159:9200,
storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
infoPort=9102, ipcPort=9201):DataXceiver
java.io.IOException:  Offset 134217727 and length 1 don't match block
blk_-7100869813617535842_5426 ( blockLen 120152064 )
	at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
	at java.lang.Thread.run(Thread.java:662)

Unfortunately the EMR cluster that had the data on it has since been
terminated. I have access to the logs, but I can't run an fsck. I can
provide more detailed stack traces etc. if you think it would be
helpful. Rerunning my process by re-generating the corrupted block
resolved the issue.

Would really appreciate if anyone has a reasonable explanation of what
happened and how to avoid in the future.

Max

Re: Data loss on EMR cluster running Hadoop and Hive

Posted by Michael Segel <mi...@hotmail.com>.

Max, 
Yes, you will get better performance if your data is on HDFS (local/ephemeral) versus S3. 

I'm not sure why you couldn't see the bad block. 
Next time this happens, try running an hadoop fsck from the name node. 

The reason why I was suggesting that you run against S3 is that while slower, its still faster than trying to copy the data to the local disk, run the job and then push the results to S3. 

Again, I would suggest that you try and contact support from AWS.

HTH

-Mike

On Sep 4, 2012, at 12:08 PM, Max Hansmire <ha...@gmail.com> wrote:

> Especially where I am reading from from the file using a Map-Reduce
> job in the next step I am not sure that it makes sense in terms of
> performance to put the file on S3. I have not tested, but my suspicion
> is that the local disk reads on HDFS would outperform reading and
> writing the file to S3.
> 
> This is a bad block on HDFS and not the underlying filesystem. I
> thought that HDFS was supposed to be tolerant of native file system
> failures.
> 
> Max
> 
> On Tue, Sep 4, 2012 at 12:43 PM, Michael Segel
> <mi...@hotmail.com> wrote:
>> Next time, try reading and writing to S3 directly from your hive job.
>> 
>> Not sure why the block was bad... What did the AWS folks have to say?
>> 
>> -Mike
>> 
>> On Sep 4, 2012, at 11:30 AM, Max Hansmire <ha...@gmail.com> wrote:
>> 
>>> I ran into an issue yesterday where one of the blocks on HDFS seems to
>>> have gone away. I would appreciate any help that you can provide.
>>> 
>>> I am running Hadoop on Amazon's Elastic Map Reduce (EMR). I am running
>>> hadoop version 0.20.205 and hive version 0.8.1.
>>> 
>>> I have a hive table that is written out in the reduce step of a map
>>> reduce job created by hive. This step completed with no errors, but
>>> the next map-reduce job that tries to read it failed with the
>>> following error message.
>>> 
>>> "Caused by: java.io.IOException: No live nodes contain current block"
>>> 
>>> I ran hadoop fs -cat on the same file and got the same error.
>>> 
>>> Looking more closely at the data and name node logs, I see this error
>>> for the same problem block. It is in the name node when trying to read
>>> the data.
>>> 
>>> 2012-09-03 11:56:05,054 WARN
>>> org.apache.hadoop.hdfs.server.datanode.DataNode
>>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>>> DatanodeRegistration(10.193.39.159:9200,
>>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>>> infoPort=9102, ipcPort=9201):sendBlock() :  Offset 134217727 and
>>> length 1 don't match block blk_-7100869813617535842_5426 ( blockLen
>>> 120152064 )
>>> 2012-09-03 11:56:05,054 WARN
>>> org.apache.hadoop.hdfs.server.datanode.DataNode
>>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>>> DatanodeRegistration(10.193.39.159:9200,
>>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>>> infoPort=9102, ipcPort=9201):Got exception while serving
>>> blk_-7100869813617535842_5426 to /10.96.57.112:
>>> java.io.IOException:  Offset 134217727 and length 1 don't match block
>>> blk_-7100869813617535842_5426 ( blockLen 120152064 )
>>>      at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
>>>      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
>>>      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>>      at java.lang.Thread.run(Thread.java:662)
>>> 
>>> 2012-09-03 11:56:05,054 ERROR
>>> org.apache.hadoop.hdfs.server.datanode.DataNode
>>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>>> DatanodeRegistration(10.193.39.159:9200,
>>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>>> infoPort=9102, ipcPort=9201):DataXceiver
>>> java.io.IOException:  Offset 134217727 and length 1 don't match block
>>> blk_-7100869813617535842_5426 ( blockLen 120152064 )
>>>      at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
>>>      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
>>>      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>>      at java.lang.Thread.run(Thread.java:662)
>>> 
>>> Unfortunately the EMR cluster that had the data on it has since been
>>> terminated. I have access to the logs, but I can't run an fsck. I can
>>> provide more detailed stack traces etc. if you think it would be
>>> helpful. Rerunning my process by re-generating the corrupted block
>>> resolved the issue.
>>> 
>>> Would really appreciate if anyone has a reasonable explanation of what
>>> happened and how to avoid in the future.
>>> 
>>> Max
>>> 
>> 
>

Re: Data loss on EMR cluster running Hadoop and Hive

Posted by Michael Segel <mi...@hotmail.com>.

Max, 
Yes, you will get better performance if your data is on HDFS (local/ephemeral) versus S3. 

I'm not sure why you couldn't see the bad block. 
Next time this happens, try running an hadoop fsck from the name node. 

The reason why I was suggesting that you run against S3 is that while slower, its still faster than trying to copy the data to the local disk, run the job and then push the results to S3. 

Again, I would suggest that you try and contact support from AWS.

HTH

-Mike

On Sep 4, 2012, at 12:08 PM, Max Hansmire <ha...@gmail.com> wrote:

> Especially where I am reading from from the file using a Map-Reduce
> job in the next step I am not sure that it makes sense in terms of
> performance to put the file on S3. I have not tested, but my suspicion
> is that the local disk reads on HDFS would outperform reading and
> writing the file to S3.
> 
> This is a bad block on HDFS and not the underlying filesystem. I
> thought that HDFS was supposed to be tolerant of native file system
> failures.
> 
> Max
> 
> On Tue, Sep 4, 2012 at 12:43 PM, Michael Segel
> <mi...@hotmail.com> wrote:
>> Next time, try reading and writing to S3 directly from your hive job.
>> 
>> Not sure why the block was bad... What did the AWS folks have to say?
>> 
>> -Mike
>> 
>> On Sep 4, 2012, at 11:30 AM, Max Hansmire <ha...@gmail.com> wrote:
>> 
>>> I ran into an issue yesterday where one of the blocks on HDFS seems to
>>> have gone away. I would appreciate any help that you can provide.
>>> 
>>> I am running Hadoop on Amazon's Elastic Map Reduce (EMR). I am running
>>> hadoop version 0.20.205 and hive version 0.8.1.
>>> 
>>> I have a hive table that is written out in the reduce step of a map
>>> reduce job created by hive. This step completed with no errors, but
>>> the next map-reduce job that tries to read it failed with the
>>> following error message.
>>> 
>>> "Caused by: java.io.IOException: No live nodes contain current block"
>>> 
>>> I ran hadoop fs -cat on the same file and got the same error.
>>> 
>>> Looking more closely at the data and name node logs, I see this error
>>> for the same problem block. It is in the name node when trying to read
>>> the data.
>>> 
>>> 2012-09-03 11:56:05,054 WARN
>>> org.apache.hadoop.hdfs.server.datanode.DataNode
>>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>>> DatanodeRegistration(10.193.39.159:9200,
>>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>>> infoPort=9102, ipcPort=9201):sendBlock() :  Offset 134217727 and
>>> length 1 don't match block blk_-7100869813617535842_5426 ( blockLen
>>> 120152064 )
>>> 2012-09-03 11:56:05,054 WARN
>>> org.apache.hadoop.hdfs.server.datanode.DataNode
>>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>>> DatanodeRegistration(10.193.39.159:9200,
>>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>>> infoPort=9102, ipcPort=9201):Got exception while serving
>>> blk_-7100869813617535842_5426 to /10.96.57.112:
>>> java.io.IOException:  Offset 134217727 and length 1 don't match block
>>> blk_-7100869813617535842_5426 ( blockLen 120152064 )
>>>      at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
>>>      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
>>>      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>>      at java.lang.Thread.run(Thread.java:662)
>>> 
>>> 2012-09-03 11:56:05,054 ERROR
>>> org.apache.hadoop.hdfs.server.datanode.DataNode
>>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>>> DatanodeRegistration(10.193.39.159:9200,
>>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>>> infoPort=9102, ipcPort=9201):DataXceiver
>>> java.io.IOException:  Offset 134217727 and length 1 don't match block
>>> blk_-7100869813617535842_5426 ( blockLen 120152064 )
>>>      at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
>>>      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
>>>      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>>      at java.lang.Thread.run(Thread.java:662)
>>> 
>>> Unfortunately the EMR cluster that had the data on it has since been
>>> terminated. I have access to the logs, but I can't run an fsck. I can
>>> provide more detailed stack traces etc. if you think it would be
>>> helpful. Rerunning my process by re-generating the corrupted block
>>> resolved the issue.
>>> 
>>> Would really appreciate if anyone has a reasonable explanation of what
>>> happened and how to avoid in the future.
>>> 
>>> Max
>>> 
>> 
>

Re: Data loss on EMR cluster running Hadoop and Hive

Posted by Michael Segel <mi...@hotmail.com>.

Max, 
Yes, you will get better performance if your data is on HDFS (local/ephemeral) versus S3. 

I'm not sure why you couldn't see the bad block. 
Next time this happens, try running an hadoop fsck from the name node. 

The reason why I was suggesting that you run against S3 is that while slower, its still faster than trying to copy the data to the local disk, run the job and then push the results to S3. 

Again, I would suggest that you try and contact support from AWS.

HTH

-Mike

On Sep 4, 2012, at 12:08 PM, Max Hansmire <ha...@gmail.com> wrote:

> Especially where I am reading from from the file using a Map-Reduce
> job in the next step I am not sure that it makes sense in terms of
> performance to put the file on S3. I have not tested, but my suspicion
> is that the local disk reads on HDFS would outperform reading and
> writing the file to S3.
> 
> This is a bad block on HDFS and not the underlying filesystem. I
> thought that HDFS was supposed to be tolerant of native file system
> failures.
> 
> Max
> 
> On Tue, Sep 4, 2012 at 12:43 PM, Michael Segel
> <mi...@hotmail.com> wrote:
>> Next time, try reading and writing to S3 directly from your hive job.
>> 
>> Not sure why the block was bad... What did the AWS folks have to say?
>> 
>> -Mike
>> 
>> On Sep 4, 2012, at 11:30 AM, Max Hansmire <ha...@gmail.com> wrote:
>> 
>>> I ran into an issue yesterday where one of the blocks on HDFS seems to
>>> have gone away. I would appreciate any help that you can provide.
>>> 
>>> I am running Hadoop on Amazon's Elastic Map Reduce (EMR). I am running
>>> hadoop version 0.20.205 and hive version 0.8.1.
>>> 
>>> I have a hive table that is written out in the reduce step of a map
>>> reduce job created by hive. This step completed with no errors, but
>>> the next map-reduce job that tries to read it failed with the
>>> following error message.
>>> 
>>> "Caused by: java.io.IOException: No live nodes contain current block"
>>> 
>>> I ran hadoop fs -cat on the same file and got the same error.
>>> 
>>> Looking more closely at the data and name node logs, I see this error
>>> for the same problem block. It is in the name node when trying to read
>>> the data.
>>> 
>>> 2012-09-03 11:56:05,054 WARN
>>> org.apache.hadoop.hdfs.server.datanode.DataNode
>>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>>> DatanodeRegistration(10.193.39.159:9200,
>>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>>> infoPort=9102, ipcPort=9201):sendBlock() :  Offset 134217727 and
>>> length 1 don't match block blk_-7100869813617535842_5426 ( blockLen
>>> 120152064 )
>>> 2012-09-03 11:56:05,054 WARN
>>> org.apache.hadoop.hdfs.server.datanode.DataNode
>>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>>> DatanodeRegistration(10.193.39.159:9200,
>>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>>> infoPort=9102, ipcPort=9201):Got exception while serving
>>> blk_-7100869813617535842_5426 to /10.96.57.112:
>>> java.io.IOException:  Offset 134217727 and length 1 don't match block
>>> blk_-7100869813617535842_5426 ( blockLen 120152064 )
>>>      at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
>>>      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
>>>      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>>      at java.lang.Thread.run(Thread.java:662)
>>> 
>>> 2012-09-03 11:56:05,054 ERROR
>>> org.apache.hadoop.hdfs.server.datanode.DataNode
>>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>>> DatanodeRegistration(10.193.39.159:9200,
>>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>>> infoPort=9102, ipcPort=9201):DataXceiver
>>> java.io.IOException:  Offset 134217727 and length 1 don't match block
>>> blk_-7100869813617535842_5426 ( blockLen 120152064 )
>>>      at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
>>>      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
>>>      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>>      at java.lang.Thread.run(Thread.java:662)
>>> 
>>> Unfortunately the EMR cluster that had the data on it has since been
>>> terminated. I have access to the logs, but I can't run an fsck. I can
>>> provide more detailed stack traces etc. if you think it would be
>>> helpful. Rerunning my process by re-generating the corrupted block
>>> resolved the issue.
>>> 
>>> Would really appreciate if anyone has a reasonable explanation of what
>>> happened and how to avoid in the future.
>>> 
>>> Max
>>> 
>> 
>

Re: Data loss on EMR cluster running Hadoop and Hive

Posted by Michael Segel <mi...@hotmail.com>.

Max, 
Yes, you will get better performance if your data is on HDFS (local/ephemeral) versus S3. 

I'm not sure why you couldn't see the bad block. 
Next time this happens, try running an hadoop fsck from the name node. 

The reason why I was suggesting that you run against S3 is that while slower, its still faster than trying to copy the data to the local disk, run the job and then push the results to S3. 

Again, I would suggest that you try and contact support from AWS.

HTH

-Mike

On Sep 4, 2012, at 12:08 PM, Max Hansmire <ha...@gmail.com> wrote:

> Especially where I am reading from from the file using a Map-Reduce
> job in the next step I am not sure that it makes sense in terms of
> performance to put the file on S3. I have not tested, but my suspicion
> is that the local disk reads on HDFS would outperform reading and
> writing the file to S3.
> 
> This is a bad block on HDFS and not the underlying filesystem. I
> thought that HDFS was supposed to be tolerant of native file system
> failures.
> 
> Max
> 
> On Tue, Sep 4, 2012 at 12:43 PM, Michael Segel
> <mi...@hotmail.com> wrote:
>> Next time, try reading and writing to S3 directly from your hive job.
>> 
>> Not sure why the block was bad... What did the AWS folks have to say?
>> 
>> -Mike
>> 
>> On Sep 4, 2012, at 11:30 AM, Max Hansmire <ha...@gmail.com> wrote:
>> 
>>> I ran into an issue yesterday where one of the blocks on HDFS seems to
>>> have gone away. I would appreciate any help that you can provide.
>>> 
>>> I am running Hadoop on Amazon's Elastic Map Reduce (EMR). I am running
>>> hadoop version 0.20.205 and hive version 0.8.1.
>>> 
>>> I have a hive table that is written out in the reduce step of a map
>>> reduce job created by hive. This step completed with no errors, but
>>> the next map-reduce job that tries to read it failed with the
>>> following error message.
>>> 
>>> "Caused by: java.io.IOException: No live nodes contain current block"
>>> 
>>> I ran hadoop fs -cat on the same file and got the same error.
>>> 
>>> Looking more closely at the data and name node logs, I see this error
>>> for the same problem block. It is in the name node when trying to read
>>> the data.
>>> 
>>> 2012-09-03 11:56:05,054 WARN
>>> org.apache.hadoop.hdfs.server.datanode.DataNode
>>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>>> DatanodeRegistration(10.193.39.159:9200,
>>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>>> infoPort=9102, ipcPort=9201):sendBlock() :  Offset 134217727 and
>>> length 1 don't match block blk_-7100869813617535842_5426 ( blockLen
>>> 120152064 )
>>> 2012-09-03 11:56:05,054 WARN
>>> org.apache.hadoop.hdfs.server.datanode.DataNode
>>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>>> DatanodeRegistration(10.193.39.159:9200,
>>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>>> infoPort=9102, ipcPort=9201):Got exception while serving
>>> blk_-7100869813617535842_5426 to /10.96.57.112:
>>> java.io.IOException:  Offset 134217727 and length 1 don't match block
>>> blk_-7100869813617535842_5426 ( blockLen 120152064 )
>>>      at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
>>>      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
>>>      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>>      at java.lang.Thread.run(Thread.java:662)
>>> 
>>> 2012-09-03 11:56:05,054 ERROR
>>> org.apache.hadoop.hdfs.server.datanode.DataNode
>>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>>> DatanodeRegistration(10.193.39.159:9200,
>>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>>> infoPort=9102, ipcPort=9201):DataXceiver
>>> java.io.IOException:  Offset 134217727 and length 1 don't match block
>>> blk_-7100869813617535842_5426 ( blockLen 120152064 )
>>>      at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
>>>      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
>>>      at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>>      at java.lang.Thread.run(Thread.java:662)
>>> 
>>> Unfortunately the EMR cluster that had the data on it has since been
>>> terminated. I have access to the logs, but I can't run an fsck. I can
>>> provide more detailed stack traces etc. if you think it would be
>>> helpful. Rerunning my process by re-generating the corrupted block
>>> resolved the issue.
>>> 
>>> Would really appreciate if anyone has a reasonable explanation of what
>>> happened and how to avoid in the future.
>>> 
>>> Max
>>> 
>> 
>

Re: Data loss on EMR cluster running Hadoop and Hive

Posted by Max Hansmire <ha...@gmail.com>.

Especially where I am reading from from the file using a Map-Reduce
job in the next step I am not sure that it makes sense in terms of
performance to put the file on S3. I have not tested, but my suspicion
is that the local disk reads on HDFS would outperform reading and
writing the file to S3.

This is a bad block on HDFS and not the underlying filesystem. I
thought that HDFS was supposed to be tolerant of native file system
failures.

Max

On Tue, Sep 4, 2012 at 12:43 PM, Michael Segel
<mi...@hotmail.com> wrote:
> Next time, try reading and writing to S3 directly from your hive job.
>
> Not sure why the block was bad... What did the AWS folks have to say?
>
> -Mike
>
> On Sep 4, 2012, at 11:30 AM, Max Hansmire <ha...@gmail.com> wrote:
>
>> I ran into an issue yesterday where one of the blocks on HDFS seems to
>> have gone away. I would appreciate any help that you can provide.
>>
>> I am running Hadoop on Amazon's Elastic Map Reduce (EMR). I am running
>> hadoop version 0.20.205 and hive version 0.8.1.
>>
>> I have a hive table that is written out in the reduce step of a map
>> reduce job created by hive. This step completed with no errors, but
>> the next map-reduce job that tries to read it failed with the
>> following error message.
>>
>> "Caused by: java.io.IOException: No live nodes contain current block"
>>
>> I ran hadoop fs -cat on the same file and got the same error.
>>
>> Looking more closely at the data and name node logs, I see this error
>> for the same problem block. It is in the name node when trying to read
>> the data.
>>
>> 2012-09-03 11:56:05,054 WARN
>> org.apache.hadoop.hdfs.server.datanode.DataNode
>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>> DatanodeRegistration(10.193.39.159:9200,
>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>> infoPort=9102, ipcPort=9201):sendBlock() :  Offset 134217727 and
>> length 1 don't match block blk_-7100869813617535842_5426 ( blockLen
>> 120152064 )
>> 2012-09-03 11:56:05,054 WARN
>> org.apache.hadoop.hdfs.server.datanode.DataNode
>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>> DatanodeRegistration(10.193.39.159:9200,
>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>> infoPort=9102, ipcPort=9201):Got exception while serving
>> blk_-7100869813617535842_5426 to /10.96.57.112:
>> java.io.IOException:  Offset 134217727 and length 1 don't match block
>> blk_-7100869813617535842_5426 ( blockLen 120152064 )
>>       at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
>>       at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
>>       at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>       at java.lang.Thread.run(Thread.java:662)
>>
>> 2012-09-03 11:56:05,054 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode
>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>> DatanodeRegistration(10.193.39.159:9200,
>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>> infoPort=9102, ipcPort=9201):DataXceiver
>> java.io.IOException:  Offset 134217727 and length 1 don't match block
>> blk_-7100869813617535842_5426 ( blockLen 120152064 )
>>       at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
>>       at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
>>       at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>       at java.lang.Thread.run(Thread.java:662)
>>
>> Unfortunately the EMR cluster that had the data on it has since been
>> terminated. I have access to the logs, but I can't run an fsck. I can
>> provide more detailed stack traces etc. if you think it would be
>> helpful. Rerunning my process by re-generating the corrupted block
>> resolved the issue.
>>
>> Would really appreciate if anyone has a reasonable explanation of what
>> happened and how to avoid in the future.
>>
>> Max
>>
>

Re: Data loss on EMR cluster running Hadoop and Hive

Posted by Max Hansmire <ha...@gmail.com>.

Especially where I am reading from from the file using a Map-Reduce
job in the next step I am not sure that it makes sense in terms of
performance to put the file on S3. I have not tested, but my suspicion
is that the local disk reads on HDFS would outperform reading and
writing the file to S3.

This is a bad block on HDFS and not the underlying filesystem. I
thought that HDFS was supposed to be tolerant of native file system
failures.

Max

On Tue, Sep 4, 2012 at 12:43 PM, Michael Segel
<mi...@hotmail.com> wrote:
> Next time, try reading and writing to S3 directly from your hive job.
>
> Not sure why the block was bad... What did the AWS folks have to say?
>
> -Mike
>
> On Sep 4, 2012, at 11:30 AM, Max Hansmire <ha...@gmail.com> wrote:
>
>> I ran into an issue yesterday where one of the blocks on HDFS seems to
>> have gone away. I would appreciate any help that you can provide.
>>
>> I am running Hadoop on Amazon's Elastic Map Reduce (EMR). I am running
>> hadoop version 0.20.205 and hive version 0.8.1.
>>
>> I have a hive table that is written out in the reduce step of a map
>> reduce job created by hive. This step completed with no errors, but
>> the next map-reduce job that tries to read it failed with the
>> following error message.
>>
>> "Caused by: java.io.IOException: No live nodes contain current block"
>>
>> I ran hadoop fs -cat on the same file and got the same error.
>>
>> Looking more closely at the data and name node logs, I see this error
>> for the same problem block. It is in the name node when trying to read
>> the data.
>>
>> 2012-09-03 11:56:05,054 WARN
>> org.apache.hadoop.hdfs.server.datanode.DataNode
>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>> DatanodeRegistration(10.193.39.159:9200,
>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>> infoPort=9102, ipcPort=9201):sendBlock() :  Offset 134217727 and
>> length 1 don't match block blk_-7100869813617535842_5426 ( blockLen
>> 120152064 )
>> 2012-09-03 11:56:05,054 WARN
>> org.apache.hadoop.hdfs.server.datanode.DataNode
>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>> DatanodeRegistration(10.193.39.159:9200,
>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>> infoPort=9102, ipcPort=9201):Got exception while serving
>> blk_-7100869813617535842_5426 to /10.96.57.112:
>> java.io.IOException:  Offset 134217727 and length 1 don't match block
>> blk_-7100869813617535842_5426 ( blockLen 120152064 )
>>       at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
>>       at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
>>       at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>       at java.lang.Thread.run(Thread.java:662)
>>
>> 2012-09-03 11:56:05,054 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode
>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>> DatanodeRegistration(10.193.39.159:9200,
>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>> infoPort=9102, ipcPort=9201):DataXceiver
>> java.io.IOException:  Offset 134217727 and length 1 don't match block
>> blk_-7100869813617535842_5426 ( blockLen 120152064 )
>>       at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
>>       at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
>>       at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>       at java.lang.Thread.run(Thread.java:662)
>>
>> Unfortunately the EMR cluster that had the data on it has since been
>> terminated. I have access to the logs, but I can't run an fsck. I can
>> provide more detailed stack traces etc. if you think it would be
>> helpful. Rerunning my process by re-generating the corrupted block
>> resolved the issue.
>>
>> Would really appreciate if anyone has a reasonable explanation of what
>> happened and how to avoid in the future.
>>
>> Max
>>
>

Re: Data loss on EMR cluster running Hadoop and Hive

Posted by Max Hansmire <ha...@gmail.com>.

Especially where I am reading from from the file using a Map-Reduce
job in the next step I am not sure that it makes sense in terms of
performance to put the file on S3. I have not tested, but my suspicion
is that the local disk reads on HDFS would outperform reading and
writing the file to S3.

This is a bad block on HDFS and not the underlying filesystem. I
thought that HDFS was supposed to be tolerant of native file system
failures.

Max

On Tue, Sep 4, 2012 at 12:43 PM, Michael Segel
<mi...@hotmail.com> wrote:
> Next time, try reading and writing to S3 directly from your hive job.
>
> Not sure why the block was bad... What did the AWS folks have to say?
>
> -Mike
>
> On Sep 4, 2012, at 11:30 AM, Max Hansmire <ha...@gmail.com> wrote:
>
>> I ran into an issue yesterday where one of the blocks on HDFS seems to
>> have gone away. I would appreciate any help that you can provide.
>>
>> I am running Hadoop on Amazon's Elastic Map Reduce (EMR). I am running
>> hadoop version 0.20.205 and hive version 0.8.1.
>>
>> I have a hive table that is written out in the reduce step of a map
>> reduce job created by hive. This step completed with no errors, but
>> the next map-reduce job that tries to read it failed with the
>> following error message.
>>
>> "Caused by: java.io.IOException: No live nodes contain current block"
>>
>> I ran hadoop fs -cat on the same file and got the same error.
>>
>> Looking more closely at the data and name node logs, I see this error
>> for the same problem block. It is in the name node when trying to read
>> the data.
>>
>> 2012-09-03 11:56:05,054 WARN
>> org.apache.hadoop.hdfs.server.datanode.DataNode
>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>> DatanodeRegistration(10.193.39.159:9200,
>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>> infoPort=9102, ipcPort=9201):sendBlock() :  Offset 134217727 and
>> length 1 don't match block blk_-7100869813617535842_5426 ( blockLen
>> 120152064 )
>> 2012-09-03 11:56:05,054 WARN
>> org.apache.hadoop.hdfs.server.datanode.DataNode
>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>> DatanodeRegistration(10.193.39.159:9200,
>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>> infoPort=9102, ipcPort=9201):Got exception while serving
>> blk_-7100869813617535842_5426 to /10.96.57.112:
>> java.io.IOException:  Offset 134217727 and length 1 don't match block
>> blk_-7100869813617535842_5426 ( blockLen 120152064 )
>>       at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
>>       at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
>>       at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>       at java.lang.Thread.run(Thread.java:662)
>>
>> 2012-09-03 11:56:05,054 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode
>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>> DatanodeRegistration(10.193.39.159:9200,
>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>> infoPort=9102, ipcPort=9201):DataXceiver
>> java.io.IOException:  Offset 134217727 and length 1 don't match block
>> blk_-7100869813617535842_5426 ( blockLen 120152064 )
>>       at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
>>       at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
>>       at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>       at java.lang.Thread.run(Thread.java:662)
>>
>> Unfortunately the EMR cluster that had the data on it has since been
>> terminated. I have access to the logs, but I can't run an fsck. I can
>> provide more detailed stack traces etc. if you think it would be
>> helpful. Rerunning my process by re-generating the corrupted block
>> resolved the issue.
>>
>> Would really appreciate if anyone has a reasonable explanation of what
>> happened and how to avoid in the future.
>>
>> Max
>>
>

Re: Data loss on EMR cluster running Hadoop and Hive

Posted by Max Hansmire <ha...@gmail.com>.

Especially where I am reading from from the file using a Map-Reduce
job in the next step I am not sure that it makes sense in terms of
performance to put the file on S3. I have not tested, but my suspicion
is that the local disk reads on HDFS would outperform reading and
writing the file to S3.

This is a bad block on HDFS and not the underlying filesystem. I
thought that HDFS was supposed to be tolerant of native file system
failures.

Max

On Tue, Sep 4, 2012 at 12:43 PM, Michael Segel
<mi...@hotmail.com> wrote:
> Next time, try reading and writing to S3 directly from your hive job.
>
> Not sure why the block was bad... What did the AWS folks have to say?
>
> -Mike
>
> On Sep 4, 2012, at 11:30 AM, Max Hansmire <ha...@gmail.com> wrote:
>
>> I ran into an issue yesterday where one of the blocks on HDFS seems to
>> have gone away. I would appreciate any help that you can provide.
>>
>> I am running Hadoop on Amazon's Elastic Map Reduce (EMR). I am running
>> hadoop version 0.20.205 and hive version 0.8.1.
>>
>> I have a hive table that is written out in the reduce step of a map
>> reduce job created by hive. This step completed with no errors, but
>> the next map-reduce job that tries to read it failed with the
>> following error message.
>>
>> "Caused by: java.io.IOException: No live nodes contain current block"
>>
>> I ran hadoop fs -cat on the same file and got the same error.
>>
>> Looking more closely at the data and name node logs, I see this error
>> for the same problem block. It is in the name node when trying to read
>> the data.
>>
>> 2012-09-03 11:56:05,054 WARN
>> org.apache.hadoop.hdfs.server.datanode.DataNode
>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>> DatanodeRegistration(10.193.39.159:9200,
>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>> infoPort=9102, ipcPort=9201):sendBlock() :  Offset 134217727 and
>> length 1 don't match block blk_-7100869813617535842_5426 ( blockLen
>> 120152064 )
>> 2012-09-03 11:56:05,054 WARN
>> org.apache.hadoop.hdfs.server.datanode.DataNode
>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>> DatanodeRegistration(10.193.39.159:9200,
>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>> infoPort=9102, ipcPort=9201):Got exception while serving
>> blk_-7100869813617535842_5426 to /10.96.57.112:
>> java.io.IOException:  Offset 134217727 and length 1 don't match block
>> blk_-7100869813617535842_5426 ( blockLen 120152064 )
>>       at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
>>       at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
>>       at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>       at java.lang.Thread.run(Thread.java:662)
>>
>> 2012-09-03 11:56:05,054 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode
>> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
>> DatanodeRegistration(10.193.39.159:9200,
>> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
>> infoPort=9102, ipcPort=9201):DataXceiver
>> java.io.IOException:  Offset 134217727 and length 1 don't match block
>> blk_-7100869813617535842_5426 ( blockLen 120152064 )
>>       at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
>>       at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
>>       at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
>>       at java.lang.Thread.run(Thread.java:662)
>>
>> Unfortunately the EMR cluster that had the data on it has since been
>> terminated. I have access to the logs, but I can't run an fsck. I can
>> provide more detailed stack traces etc. if you think it would be
>> helpful. Rerunning my process by re-generating the corrupted block
>> resolved the issue.
>>
>> Would really appreciate if anyone has a reasonable explanation of what
>> happened and how to avoid in the future.
>>
>> Max
>>
>

Re: Data loss on EMR cluster running Hadoop and Hive

Posted by Michael Segel <mi...@hotmail.com>.

Next time, try reading and writing to S3 directly from your hive job.

Not sure why the block was bad... What did the AWS folks have to say? 

-Mike

On Sep 4, 2012, at 11:30 AM, Max Hansmire <ha...@gmail.com> wrote:

> I ran into an issue yesterday where one of the blocks on HDFS seems to
> have gone away. I would appreciate any help that you can provide.
> 
> I am running Hadoop on Amazon's Elastic Map Reduce (EMR). I am running
> hadoop version 0.20.205 and hive version 0.8.1.
> 
> I have a hive table that is written out in the reduce step of a map
> reduce job created by hive. This step completed with no errors, but
> the next map-reduce job that tries to read it failed with the
> following error message.
> 
> "Caused by: java.io.IOException: No live nodes contain current block"
> 
> I ran hadoop fs -cat on the same file and got the same error.
> 
> Looking more closely at the data and name node logs, I see this error
> for the same problem block. It is in the name node when trying to read
> the data.
> 
> 2012-09-03 11:56:05,054 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
> DatanodeRegistration(10.193.39.159:9200,
> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
> infoPort=9102, ipcPort=9201):sendBlock() :  Offset 134217727 and
> length 1 don't match block blk_-7100869813617535842_5426 ( blockLen
> 120152064 )
> 2012-09-03 11:56:05,054 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
> DatanodeRegistration(10.193.39.159:9200,
> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
> infoPort=9102, ipcPort=9201):Got exception while serving
> blk_-7100869813617535842_5426 to /10.96.57.112:
> java.io.IOException:  Offset 134217727 and length 1 don't match block
> blk_-7100869813617535842_5426 ( blockLen 120152064 )
> 	at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
> 	at java.lang.Thread.run(Thread.java:662)
> 
> 2012-09-03 11:56:05,054 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
> DatanodeRegistration(10.193.39.159:9200,
> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
> infoPort=9102, ipcPort=9201):DataXceiver
> java.io.IOException:  Offset 134217727 and length 1 don't match block
> blk_-7100869813617535842_5426 ( blockLen 120152064 )
> 	at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
> 	at java.lang.Thread.run(Thread.java:662)
> 
> Unfortunately the EMR cluster that had the data on it has since been
> terminated. I have access to the logs, but I can't run an fsck. I can
> provide more detailed stack traces etc. if you think it would be
> helpful. Rerunning my process by re-generating the corrupted block
> resolved the issue.
> 
> Would really appreciate if anyone has a reasonable explanation of what
> happened and how to avoid in the future.
> 
> Max
>

Re: Data loss on EMR cluster running Hadoop and Hive

Posted by Michael Segel <mi...@hotmail.com>.

Next time, try reading and writing to S3 directly from your hive job.

Not sure why the block was bad... What did the AWS folks have to say? 

-Mike

On Sep 4, 2012, at 11:30 AM, Max Hansmire <ha...@gmail.com> wrote:

> I ran into an issue yesterday where one of the blocks on HDFS seems to
> have gone away. I would appreciate any help that you can provide.
> 
> I am running Hadoop on Amazon's Elastic Map Reduce (EMR). I am running
> hadoop version 0.20.205 and hive version 0.8.1.
> 
> I have a hive table that is written out in the reduce step of a map
> reduce job created by hive. This step completed with no errors, but
> the next map-reduce job that tries to read it failed with the
> following error message.
> 
> "Caused by: java.io.IOException: No live nodes contain current block"
> 
> I ran hadoop fs -cat on the same file and got the same error.
> 
> Looking more closely at the data and name node logs, I see this error
> for the same problem block. It is in the name node when trying to read
> the data.
> 
> 2012-09-03 11:56:05,054 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
> DatanodeRegistration(10.193.39.159:9200,
> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
> infoPort=9102, ipcPort=9201):sendBlock() :  Offset 134217727 and
> length 1 don't match block blk_-7100869813617535842_5426 ( blockLen
> 120152064 )
> 2012-09-03 11:56:05,054 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
> DatanodeRegistration(10.193.39.159:9200,
> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
> infoPort=9102, ipcPort=9201):Got exception while serving
> blk_-7100869813617535842_5426 to /10.96.57.112:
> java.io.IOException:  Offset 134217727 and length 1 don't match block
> blk_-7100869813617535842_5426 ( blockLen 120152064 )
> 	at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
> 	at java.lang.Thread.run(Thread.java:662)
> 
> 2012-09-03 11:56:05,054 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
> DatanodeRegistration(10.193.39.159:9200,
> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
> infoPort=9102, ipcPort=9201):DataXceiver
> java.io.IOException:  Offset 134217727 and length 1 don't match block
> blk_-7100869813617535842_5426 ( blockLen 120152064 )
> 	at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
> 	at java.lang.Thread.run(Thread.java:662)
> 
> Unfortunately the EMR cluster that had the data on it has since been
> terminated. I have access to the logs, but I can't run an fsck. I can
> provide more detailed stack traces etc. if you think it would be
> helpful. Rerunning my process by re-generating the corrupted block
> resolved the issue.
> 
> Would really appreciate if anyone has a reasonable explanation of what
> happened and how to avoid in the future.
> 
> Max
>

Re: Data loss on EMR cluster running Hadoop and Hive

Posted by Michael Segel <mi...@hotmail.com>.

Next time, try reading and writing to S3 directly from your hive job.

Not sure why the block was bad... What did the AWS folks have to say? 

-Mike

On Sep 4, 2012, at 11:30 AM, Max Hansmire <ha...@gmail.com> wrote:

> I ran into an issue yesterday where one of the blocks on HDFS seems to
> have gone away. I would appreciate any help that you can provide.
> 
> I am running Hadoop on Amazon's Elastic Map Reduce (EMR). I am running
> hadoop version 0.20.205 and hive version 0.8.1.
> 
> I have a hive table that is written out in the reduce step of a map
> reduce job created by hive. This step completed with no errors, but
> the next map-reduce job that tries to read it failed with the
> following error message.
> 
> "Caused by: java.io.IOException: No live nodes contain current block"
> 
> I ran hadoop fs -cat on the same file and got the same error.
> 
> Looking more closely at the data and name node logs, I see this error
> for the same problem block. It is in the name node when trying to read
> the data.
> 
> 2012-09-03 11:56:05,054 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
> DatanodeRegistration(10.193.39.159:9200,
> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
> infoPort=9102, ipcPort=9201):sendBlock() :  Offset 134217727 and
> length 1 don't match block blk_-7100869813617535842_5426 ( blockLen
> 120152064 )
> 2012-09-03 11:56:05,054 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
> DatanodeRegistration(10.193.39.159:9200,
> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
> infoPort=9102, ipcPort=9201):Got exception while serving
> blk_-7100869813617535842_5426 to /10.96.57.112:
> java.io.IOException:  Offset 134217727 and length 1 don't match block
> blk_-7100869813617535842_5426 ( blockLen 120152064 )
> 	at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
> 	at java.lang.Thread.run(Thread.java:662)
> 
> 2012-09-03 11:56:05,054 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
> DatanodeRegistration(10.193.39.159:9200,
> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
> infoPort=9102, ipcPort=9201):DataXceiver
> java.io.IOException:  Offset 134217727 and length 1 don't match block
> blk_-7100869813617535842_5426 ( blockLen 120152064 )
> 	at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
> 	at java.lang.Thread.run(Thread.java:662)
> 
> Unfortunately the EMR cluster that had the data on it has since been
> terminated. I have access to the logs, but I can't run an fsck. I can
> provide more detailed stack traces etc. if you think it would be
> helpful. Rerunning my process by re-generating the corrupted block
> resolved the issue.
> 
> Would really appreciate if anyone has a reasonable explanation of what
> happened and how to avoid in the future.
> 
> Max
>

Re: Data loss on EMR cluster running Hadoop and Hive

Posted by Michael Segel <mi...@hotmail.com>.

Next time, try reading and writing to S3 directly from your hive job.

Not sure why the block was bad... What did the AWS folks have to say? 

-Mike

On Sep 4, 2012, at 11:30 AM, Max Hansmire <ha...@gmail.com> wrote:

> I ran into an issue yesterday where one of the blocks on HDFS seems to
> have gone away. I would appreciate any help that you can provide.
> 
> I am running Hadoop on Amazon's Elastic Map Reduce (EMR). I am running
> hadoop version 0.20.205 and hive version 0.8.1.
> 
> I have a hive table that is written out in the reduce step of a map
> reduce job created by hive. This step completed with no errors, but
> the next map-reduce job that tries to read it failed with the
> following error message.
> 
> "Caused by: java.io.IOException: No live nodes contain current block"
> 
> I ran hadoop fs -cat on the same file and got the same error.
> 
> Looking more closely at the data and name node logs, I see this error
> for the same problem block. It is in the name node when trying to read
> the data.
> 
> 2012-09-03 11:56:05,054 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
> DatanodeRegistration(10.193.39.159:9200,
> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
> infoPort=9102, ipcPort=9201):sendBlock() :  Offset 134217727 and
> length 1 don't match block blk_-7100869813617535842_5426 ( blockLen
> 120152064 )
> 2012-09-03 11:56:05,054 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
> DatanodeRegistration(10.193.39.159:9200,
> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
> infoPort=9102, ipcPort=9201):Got exception while serving
> blk_-7100869813617535842_5426 to /10.96.57.112:
> java.io.IOException:  Offset 134217727 and length 1 don't match block
> blk_-7100869813617535842_5426 ( blockLen 120152064 )
> 	at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
> 	at java.lang.Thread.run(Thread.java:662)
> 
> 2012-09-03 11:56:05,054 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode
> (org.apache.hadoop.hdfs.server.datanode.DataXceiver@4a7cdff0):
> DatanodeRegistration(10.193.39.159:9200,
> storageID=DS-2147477684-10.193.39.159-9200-1346659207926,
> infoPort=9102, ipcPort=9201):DataXceiver
> java.io.IOException:  Offset 134217727 and length 1 don't match block
> blk_-7100869813617535842_5426 ( blockLen 120152064 )
> 	at org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:141)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:189)
> 	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)
> 	at java.lang.Thread.run(Thread.java:662)
> 
> Unfortunately the EMR cluster that had the data on it has since been
> terminated. I have access to the logs, but I can't run an fsck. I can
> provide more detailed stack traces etc. if you think it would be
> helpful. Rerunning my process by re-generating the corrupted block
> resolved the issue.
> 
> Would really appreciate if anyone has a reasonable explanation of what
> happened and how to avoid in the future.
> 
> Max
>