You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by John Lilley <jo...@redpoint.net> on 2014/01/27 16:06:37 UTC

BlockMissingException reading HDFS file, but the block exists and fsck shows OK

I am getting this perplexing error.  Our YARN application launches tasks that attempt to simultaneously open a large number of files for merge.  There seems to be a load threshold in terms of number of simultaneous tasks attempting to open a set of HDFS files on a four-node cluster.  The threshold is hit at 32 tasks, each opening 450 files.  The threshold is not hit at 16 tasks, each opening 250 files.

The files are stored in HDFS with replication=1.  I know that low replication leaves me open to node-failure issues, but bear with me, nothing is actually failing.

I get this exception when attempting to open a file:
org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
Could not obtain block: BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
    org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
    org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
    org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
    org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)

However, the block is definitely *not* missing.  I can be running the following command continuously while all of this is going on:
hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
Well before the tasks start it is showing good files all around, including:
/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614 bytes, 2 block(s):  OK
0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385 len=134217728 repl=1 [192.168.57.110:50010]
1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411 len=10620886 repl=1 [192.168.57.110:50010]

My application logs also show that *some* tasks are able to open the files for which a missing block is reported.
In case you suspect, the files are not being deleted.  The fsck continues to show good status for these files well after the error report.
I've also checked to ensure that the files are not being held open by the creators of the files.

This leads me to believe that I've hit a an HDFS open-file limit of some kind.  We can compensate pretty easily, by doing a two-phase merge that opens far fewer files simultaneously, keeping a limited pool of open files, etc.  However, I would still like to know what limit is being hit, and how to best predict that limit on various cluster configurations.

Thanks,
john

Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Posted by Peyman Mohajerian <mo...@gmail.com>.
maybe its inode exhaustion:
'df -i' command can tell you more.


On Mon, Jan 27, 2014 at 12:00 PM, John Lilley <jo...@redpoint.net>wrote:

>  I've found that the error occurs right around a threshold where 20 tasks
> attempt to open 220 files each.  This is ... slightly over 4k total files
> open.
>
> But that's the total number of open files across the 4-node cluster, and
> since the blocks are evenly distributed, that amounts to 1k connections per
> node, which should not be a problem.
>
> I've run tests wherein a single process on a single node can open over 8k
> files without issue.
>
> I think that there is some other factor at work, perhaps one of:
>
> 1)      Timing (because the files were just written),
>
> 2)      Multi-node, multi-process access to the same set of files.
>
> 3)      Replication=1 having an influence.
>
>
>
> Any ideas?  I am not seeing any errors in the datanode logs.
>
>
>
> I will run some other tests with replication=3 to see what happens.
>
>
>
> John
>
>
>
>
>
> *From:* John Lilley [mailto:john.lilley@redpoint.net]
> *Sent:* Monday, January 27, 2014 8:41 AM
> *To:* user@hadoop.apache.org
> *Subject:* RE: BlockMissingException reading HDFS file, but the block
> exists and fsck shows OK
>
>
>
> None of the datanode logs have error messages.
>
>
>
> *From:* Harsh J [mailto:harsh@cloudera.com <ha...@cloudera.com>]
> *Sent:* Monday, January 27, 2014 8:15 AM
> *To:* <us...@hadoop.apache.org>
> *Subject:* Re: BlockMissingException reading HDFS file, but the block
> exists and fsck shows OK
>
>
>
> Can you check the log of the DN that is holding the specific block for any
> errors?
>
> On Jan 27, 2014 8:37 PM, "John Lilley" <jo...@redpoint.net> wrote:
>
> I am getting this perplexing error.  Our YARN application launches tasks
> that attempt to simultaneously open a large number of files for merge.
> There seems to be a load threshold in terms of number of simultaneous tasks
> attempting to open a set of HDFS files on a four-node cluster.  The
> threshold is hit at 32 tasks, each opening 450 files.  The threshold is not
> hit at 16 tasks, each opening 250 files.
>
>
>
> The files are stored in HDFS with replication=1.  I know that low
> replication leaves me open to node-failure issues, but bear with me,
> nothing is actually failing.
>
>
>
> I get this exception when attempting to open a file:
>
> org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
>
>
> Could not obtain block:
> BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
>
> file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
>
>
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
>
>
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
>
>     org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
>
>     org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)
>
>
>
> However, the block is definitely **not** missing.  I can be running the
> following command continuously while all of this is going on:
>
> hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
>
> Well before the tasks start it is showing good files all around, including:
>
> /rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614
> bytes, 2 block(s):  OK
>
> 0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385
> len=134217728 repl=1 [192.168.57.110:50010]
>
> 1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
> len=10620886 repl=1 [192.168.57.110:50010]
>
>
>
> My application logs also show that **some** tasks are able to open the
> files for which a missing block is reported.
>
> In case you suspect, the files are not being deleted.  The fsck continues
> to show good status for these files well after the error report.
>
> I've also checked to ensure that the files are not being held open by the
> creators of the files.
>
>
>
> This leads me to believe that I've hit a an HDFS open-file limit of some
> kind.  We can compensate pretty easily, by doing a two-phase merge that
> opens far fewer files simultaneously, keeping a limited pool of open files,
> etc.  However, I would still like to know what limit is being hit, and how
> to best predict that limit on various cluster configurations.
>
>
>
> Thanks,
>
> john
>

Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Posted by Peyman Mohajerian <mo...@gmail.com>.
maybe its inode exhaustion:
'df -i' command can tell you more.


On Mon, Jan 27, 2014 at 12:00 PM, John Lilley <jo...@redpoint.net>wrote:

>  I've found that the error occurs right around a threshold where 20 tasks
> attempt to open 220 files each.  This is ... slightly over 4k total files
> open.
>
> But that's the total number of open files across the 4-node cluster, and
> since the blocks are evenly distributed, that amounts to 1k connections per
> node, which should not be a problem.
>
> I've run tests wherein a single process on a single node can open over 8k
> files without issue.
>
> I think that there is some other factor at work, perhaps one of:
>
> 1)      Timing (because the files were just written),
>
> 2)      Multi-node, multi-process access to the same set of files.
>
> 3)      Replication=1 having an influence.
>
>
>
> Any ideas?  I am not seeing any errors in the datanode logs.
>
>
>
> I will run some other tests with replication=3 to see what happens.
>
>
>
> John
>
>
>
>
>
> *From:* John Lilley [mailto:john.lilley@redpoint.net]
> *Sent:* Monday, January 27, 2014 8:41 AM
> *To:* user@hadoop.apache.org
> *Subject:* RE: BlockMissingException reading HDFS file, but the block
> exists and fsck shows OK
>
>
>
> None of the datanode logs have error messages.
>
>
>
> *From:* Harsh J [mailto:harsh@cloudera.com <ha...@cloudera.com>]
> *Sent:* Monday, January 27, 2014 8:15 AM
> *To:* <us...@hadoop.apache.org>
> *Subject:* Re: BlockMissingException reading HDFS file, but the block
> exists and fsck shows OK
>
>
>
> Can you check the log of the DN that is holding the specific block for any
> errors?
>
> On Jan 27, 2014 8:37 PM, "John Lilley" <jo...@redpoint.net> wrote:
>
> I am getting this perplexing error.  Our YARN application launches tasks
> that attempt to simultaneously open a large number of files for merge.
> There seems to be a load threshold in terms of number of simultaneous tasks
> attempting to open a set of HDFS files on a four-node cluster.  The
> threshold is hit at 32 tasks, each opening 450 files.  The threshold is not
> hit at 16 tasks, each opening 250 files.
>
>
>
> The files are stored in HDFS with replication=1.  I know that low
> replication leaves me open to node-failure issues, but bear with me,
> nothing is actually failing.
>
>
>
> I get this exception when attempting to open a file:
>
> org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
>
>
> Could not obtain block:
> BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
>
> file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
>
>
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
>
>
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
>
>     org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
>
>     org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)
>
>
>
> However, the block is definitely **not** missing.  I can be running the
> following command continuously while all of this is going on:
>
> hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
>
> Well before the tasks start it is showing good files all around, including:
>
> /rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614
> bytes, 2 block(s):  OK
>
> 0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385
> len=134217728 repl=1 [192.168.57.110:50010]
>
> 1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
> len=10620886 repl=1 [192.168.57.110:50010]
>
>
>
> My application logs also show that **some** tasks are able to open the
> files for which a missing block is reported.
>
> In case you suspect, the files are not being deleted.  The fsck continues
> to show good status for these files well after the error report.
>
> I've also checked to ensure that the files are not being held open by the
> creators of the files.
>
>
>
> This leads me to believe that I've hit a an HDFS open-file limit of some
> kind.  We can compensate pretty easily, by doing a two-phase merge that
> opens far fewer files simultaneously, keeping a limited pool of open files,
> etc.  However, I would still like to know what limit is being hit, and how
> to best predict that limit on various cluster configurations.
>
>
>
> Thanks,
>
> john
>

Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Posted by Peyman Mohajerian <mo...@gmail.com>.
maybe its inode exhaustion:
'df -i' command can tell you more.


On Mon, Jan 27, 2014 at 12:00 PM, John Lilley <jo...@redpoint.net>wrote:

>  I've found that the error occurs right around a threshold where 20 tasks
> attempt to open 220 files each.  This is ... slightly over 4k total files
> open.
>
> But that's the total number of open files across the 4-node cluster, and
> since the blocks are evenly distributed, that amounts to 1k connections per
> node, which should not be a problem.
>
> I've run tests wherein a single process on a single node can open over 8k
> files without issue.
>
> I think that there is some other factor at work, perhaps one of:
>
> 1)      Timing (because the files were just written),
>
> 2)      Multi-node, multi-process access to the same set of files.
>
> 3)      Replication=1 having an influence.
>
>
>
> Any ideas?  I am not seeing any errors in the datanode logs.
>
>
>
> I will run some other tests with replication=3 to see what happens.
>
>
>
> John
>
>
>
>
>
> *From:* John Lilley [mailto:john.lilley@redpoint.net]
> *Sent:* Monday, January 27, 2014 8:41 AM
> *To:* user@hadoop.apache.org
> *Subject:* RE: BlockMissingException reading HDFS file, but the block
> exists and fsck shows OK
>
>
>
> None of the datanode logs have error messages.
>
>
>
> *From:* Harsh J [mailto:harsh@cloudera.com <ha...@cloudera.com>]
> *Sent:* Monday, January 27, 2014 8:15 AM
> *To:* <us...@hadoop.apache.org>
> *Subject:* Re: BlockMissingException reading HDFS file, but the block
> exists and fsck shows OK
>
>
>
> Can you check the log of the DN that is holding the specific block for any
> errors?
>
> On Jan 27, 2014 8:37 PM, "John Lilley" <jo...@redpoint.net> wrote:
>
> I am getting this perplexing error.  Our YARN application launches tasks
> that attempt to simultaneously open a large number of files for merge.
> There seems to be a load threshold in terms of number of simultaneous tasks
> attempting to open a set of HDFS files on a four-node cluster.  The
> threshold is hit at 32 tasks, each opening 450 files.  The threshold is not
> hit at 16 tasks, each opening 250 files.
>
>
>
> The files are stored in HDFS with replication=1.  I know that low
> replication leaves me open to node-failure issues, but bear with me,
> nothing is actually failing.
>
>
>
> I get this exception when attempting to open a file:
>
> org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
>
>
> Could not obtain block:
> BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
>
> file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
>
>
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
>
>
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
>
>     org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
>
>     org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)
>
>
>
> However, the block is definitely **not** missing.  I can be running the
> following command continuously while all of this is going on:
>
> hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
>
> Well before the tasks start it is showing good files all around, including:
>
> /rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614
> bytes, 2 block(s):  OK
>
> 0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385
> len=134217728 repl=1 [192.168.57.110:50010]
>
> 1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
> len=10620886 repl=1 [192.168.57.110:50010]
>
>
>
> My application logs also show that **some** tasks are able to open the
> files for which a missing block is reported.
>
> In case you suspect, the files are not being deleted.  The fsck continues
> to show good status for these files well after the error report.
>
> I've also checked to ensure that the files are not being held open by the
> creators of the files.
>
>
>
> This leads me to believe that I've hit a an HDFS open-file limit of some
> kind.  We can compensate pretty easily, by doing a two-phase merge that
> opens far fewer files simultaneously, keeping a limited pool of open files,
> etc.  However, I would still like to know what limit is being hit, and how
> to best predict that limit on various cluster configurations.
>
>
>
> Thanks,
>
> john
>

Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Posted by Peyman Mohajerian <mo...@gmail.com>.
maybe its inode exhaustion:
'df -i' command can tell you more.


On Mon, Jan 27, 2014 at 12:00 PM, John Lilley <jo...@redpoint.net>wrote:

>  I've found that the error occurs right around a threshold where 20 tasks
> attempt to open 220 files each.  This is ... slightly over 4k total files
> open.
>
> But that's the total number of open files across the 4-node cluster, and
> since the blocks are evenly distributed, that amounts to 1k connections per
> node, which should not be a problem.
>
> I've run tests wherein a single process on a single node can open over 8k
> files without issue.
>
> I think that there is some other factor at work, perhaps one of:
>
> 1)      Timing (because the files were just written),
>
> 2)      Multi-node, multi-process access to the same set of files.
>
> 3)      Replication=1 having an influence.
>
>
>
> Any ideas?  I am not seeing any errors in the datanode logs.
>
>
>
> I will run some other tests with replication=3 to see what happens.
>
>
>
> John
>
>
>
>
>
> *From:* John Lilley [mailto:john.lilley@redpoint.net]
> *Sent:* Monday, January 27, 2014 8:41 AM
> *To:* user@hadoop.apache.org
> *Subject:* RE: BlockMissingException reading HDFS file, but the block
> exists and fsck shows OK
>
>
>
> None of the datanode logs have error messages.
>
>
>
> *From:* Harsh J [mailto:harsh@cloudera.com <ha...@cloudera.com>]
> *Sent:* Monday, January 27, 2014 8:15 AM
> *To:* <us...@hadoop.apache.org>
> *Subject:* Re: BlockMissingException reading HDFS file, but the block
> exists and fsck shows OK
>
>
>
> Can you check the log of the DN that is holding the specific block for any
> errors?
>
> On Jan 27, 2014 8:37 PM, "John Lilley" <jo...@redpoint.net> wrote:
>
> I am getting this perplexing error.  Our YARN application launches tasks
> that attempt to simultaneously open a large number of files for merge.
> There seems to be a load threshold in terms of number of simultaneous tasks
> attempting to open a set of HDFS files on a four-node cluster.  The
> threshold is hit at 32 tasks, each opening 450 files.  The threshold is not
> hit at 16 tasks, each opening 250 files.
>
>
>
> The files are stored in HDFS with replication=1.  I know that low
> replication leaves me open to node-failure issues, but bear with me,
> nothing is actually failing.
>
>
>
> I get this exception when attempting to open a file:
>
> org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
>
>
> Could not obtain block:
> BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
>
> file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
>
>
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
>
>
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
>
>     org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
>
>     org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)
>
>
>
> However, the block is definitely **not** missing.  I can be running the
> following command continuously while all of this is going on:
>
> hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
>
> Well before the tasks start it is showing good files all around, including:
>
> /rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614
> bytes, 2 block(s):  OK
>
> 0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385
> len=134217728 repl=1 [192.168.57.110:50010]
>
> 1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
> len=10620886 repl=1 [192.168.57.110:50010]
>
>
>
> My application logs also show that **some** tasks are able to open the
> files for which a missing block is reported.
>
> In case you suspect, the files are not being deleted.  The fsck continues
> to show good status for these files well after the error report.
>
> I've also checked to ensure that the files are not being held open by the
> creators of the files.
>
>
>
> This leads me to believe that I've hit a an HDFS open-file limit of some
> kind.  We can compensate pretty easily, by doing a two-phase merge that
> opens far fewer files simultaneously, keeping a limited pool of open files,
> etc.  However, I would still like to know what limit is being hit, and how
> to best predict that limit on various cluster configurations.
>
>
>
> Thanks,
>
> john
>

RE: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Posted by John Lilley <jo...@redpoint.net>.
I've found that the error occurs right around a threshold where 20 tasks attempt to open 220 files each.  This is ... slightly over 4k total files open.
But that's the total number of open files across the 4-node cluster, and since the blocks are evenly distributed, that amounts to 1k connections per node, which should not be a problem.
I've run tests wherein a single process on a single node can open over 8k files without issue.
I think that there is some other factor at work, perhaps one of:

1)      Timing (because the files were just written),

2)      Multi-node, multi-process access to the same set of files.

3)      Replication=1 having an influence.

Any ideas?  I am not seeing any errors in the datanode logs.

I will run some other tests with replication=3 to see what happens.

John


From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: Monday, January 27, 2014 8:41 AM
To: user@hadoop.apache.org
Subject: RE: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

None of the datanode logs have error messages.

From: Harsh J [mailto:harsh@cloudera.com]
Sent: Monday, January 27, 2014 8:15 AM
To: <us...@hadoop.apache.org>>
Subject: Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK


Can you check the log of the DN that is holding the specific block for any errors?
On Jan 27, 2014 8:37 PM, "John Lilley" <jo...@redpoint.net>> wrote:
I am getting this perplexing error.  Our YARN application launches tasks that attempt to simultaneously open a large number of files for merge.  There seems to be a load threshold in terms of number of simultaneous tasks attempting to open a set of HDFS files on a four-node cluster.  The threshold is hit at 32 tasks, each opening 450 files.  The threshold is not hit at 16 tasks, each opening 250 files.

The files are stored in HDFS with replication=1.  I know that low replication leaves me open to node-failure issues, but bear with me, nothing is actually failing.

I get this exception when attempting to open a file:
org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
Could not obtain block: BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
    org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
    org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
    org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
    org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)

However, the block is definitely *not* missing.  I can be running the following command continuously while all of this is going on:
hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
Well before the tasks start it is showing good files all around, including:
/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614 bytes, 2 block(s):  OK
0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385 len=134217728 repl=1 [192.168.57.110:50010<http://192.168.57.110:50010>]
1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411 len=10620886 repl=1 [192.168.57.110:50010<http://192.168.57.110:50010>]

My application logs also show that *some* tasks are able to open the files for which a missing block is reported.
In case you suspect, the files are not being deleted.  The fsck continues to show good status for these files well after the error report.
I've also checked to ensure that the files are not being held open by the creators of the files.

This leads me to believe that I've hit a an HDFS open-file limit of some kind.  We can compensate pretty easily, by doing a two-phase merge that opens far fewer files simultaneously, keeping a limited pool of open files, etc.  However, I would still like to know what limit is being hit, and how to best predict that limit on various cluster configurations.

Thanks,
john

RE: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Posted by John Lilley <jo...@redpoint.net>.
I've found that the error occurs right around a threshold where 20 tasks attempt to open 220 files each.  This is ... slightly over 4k total files open.
But that's the total number of open files across the 4-node cluster, and since the blocks are evenly distributed, that amounts to 1k connections per node, which should not be a problem.
I've run tests wherein a single process on a single node can open over 8k files without issue.
I think that there is some other factor at work, perhaps one of:

1)      Timing (because the files were just written),

2)      Multi-node, multi-process access to the same set of files.

3)      Replication=1 having an influence.

Any ideas?  I am not seeing any errors in the datanode logs.

I will run some other tests with replication=3 to see what happens.

John


From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: Monday, January 27, 2014 8:41 AM
To: user@hadoop.apache.org
Subject: RE: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

None of the datanode logs have error messages.

From: Harsh J [mailto:harsh@cloudera.com]
Sent: Monday, January 27, 2014 8:15 AM
To: <us...@hadoop.apache.org>>
Subject: Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK


Can you check the log of the DN that is holding the specific block for any errors?
On Jan 27, 2014 8:37 PM, "John Lilley" <jo...@redpoint.net>> wrote:
I am getting this perplexing error.  Our YARN application launches tasks that attempt to simultaneously open a large number of files for merge.  There seems to be a load threshold in terms of number of simultaneous tasks attempting to open a set of HDFS files on a four-node cluster.  The threshold is hit at 32 tasks, each opening 450 files.  The threshold is not hit at 16 tasks, each opening 250 files.

The files are stored in HDFS with replication=1.  I know that low replication leaves me open to node-failure issues, but bear with me, nothing is actually failing.

I get this exception when attempting to open a file:
org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
Could not obtain block: BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
    org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
    org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
    org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
    org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)

However, the block is definitely *not* missing.  I can be running the following command continuously while all of this is going on:
hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
Well before the tasks start it is showing good files all around, including:
/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614 bytes, 2 block(s):  OK
0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385 len=134217728 repl=1 [192.168.57.110:50010<http://192.168.57.110:50010>]
1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411 len=10620886 repl=1 [192.168.57.110:50010<http://192.168.57.110:50010>]

My application logs also show that *some* tasks are able to open the files for which a missing block is reported.
In case you suspect, the files are not being deleted.  The fsck continues to show good status for these files well after the error report.
I've also checked to ensure that the files are not being held open by the creators of the files.

This leads me to believe that I've hit a an HDFS open-file limit of some kind.  We can compensate pretty easily, by doing a two-phase merge that opens far fewer files simultaneously, keeping a limited pool of open files, etc.  However, I would still like to know what limit is being hit, and how to best predict that limit on various cluster configurations.

Thanks,
john

RE: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Posted by John Lilley <jo...@redpoint.net>.
I've found that the error occurs right around a threshold where 20 tasks attempt to open 220 files each.  This is ... slightly over 4k total files open.
But that's the total number of open files across the 4-node cluster, and since the blocks are evenly distributed, that amounts to 1k connections per node, which should not be a problem.
I've run tests wherein a single process on a single node can open over 8k files without issue.
I think that there is some other factor at work, perhaps one of:

1)      Timing (because the files were just written),

2)      Multi-node, multi-process access to the same set of files.

3)      Replication=1 having an influence.

Any ideas?  I am not seeing any errors in the datanode logs.

I will run some other tests with replication=3 to see what happens.

John


From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: Monday, January 27, 2014 8:41 AM
To: user@hadoop.apache.org
Subject: RE: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

None of the datanode logs have error messages.

From: Harsh J [mailto:harsh@cloudera.com]
Sent: Monday, January 27, 2014 8:15 AM
To: <us...@hadoop.apache.org>>
Subject: Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK


Can you check the log of the DN that is holding the specific block for any errors?
On Jan 27, 2014 8:37 PM, "John Lilley" <jo...@redpoint.net>> wrote:
I am getting this perplexing error.  Our YARN application launches tasks that attempt to simultaneously open a large number of files for merge.  There seems to be a load threshold in terms of number of simultaneous tasks attempting to open a set of HDFS files on a four-node cluster.  The threshold is hit at 32 tasks, each opening 450 files.  The threshold is not hit at 16 tasks, each opening 250 files.

The files are stored in HDFS with replication=1.  I know that low replication leaves me open to node-failure issues, but bear with me, nothing is actually failing.

I get this exception when attempting to open a file:
org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
Could not obtain block: BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
    org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
    org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
    org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
    org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)

However, the block is definitely *not* missing.  I can be running the following command continuously while all of this is going on:
hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
Well before the tasks start it is showing good files all around, including:
/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614 bytes, 2 block(s):  OK
0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385 len=134217728 repl=1 [192.168.57.110:50010<http://192.168.57.110:50010>]
1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411 len=10620886 repl=1 [192.168.57.110:50010<http://192.168.57.110:50010>]

My application logs also show that *some* tasks are able to open the files for which a missing block is reported.
In case you suspect, the files are not being deleted.  The fsck continues to show good status for these files well after the error report.
I've also checked to ensure that the files are not being held open by the creators of the files.

This leads me to believe that I've hit a an HDFS open-file limit of some kind.  We can compensate pretty easily, by doing a two-phase merge that opens far fewer files simultaneously, keeping a limited pool of open files, etc.  However, I would still like to know what limit is being hit, and how to best predict that limit on various cluster configurations.

Thanks,
john

RE: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Posted by John Lilley <jo...@redpoint.net>.
I've found that the error occurs right around a threshold where 20 tasks attempt to open 220 files each.  This is ... slightly over 4k total files open.
But that's the total number of open files across the 4-node cluster, and since the blocks are evenly distributed, that amounts to 1k connections per node, which should not be a problem.
I've run tests wherein a single process on a single node can open over 8k files without issue.
I think that there is some other factor at work, perhaps one of:

1)      Timing (because the files were just written),

2)      Multi-node, multi-process access to the same set of files.

3)      Replication=1 having an influence.

Any ideas?  I am not seeing any errors in the datanode logs.

I will run some other tests with replication=3 to see what happens.

John


From: John Lilley [mailto:john.lilley@redpoint.net]
Sent: Monday, January 27, 2014 8:41 AM
To: user@hadoop.apache.org
Subject: RE: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

None of the datanode logs have error messages.

From: Harsh J [mailto:harsh@cloudera.com]
Sent: Monday, January 27, 2014 8:15 AM
To: <us...@hadoop.apache.org>>
Subject: Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK


Can you check the log of the DN that is holding the specific block for any errors?
On Jan 27, 2014 8:37 PM, "John Lilley" <jo...@redpoint.net>> wrote:
I am getting this perplexing error.  Our YARN application launches tasks that attempt to simultaneously open a large number of files for merge.  There seems to be a load threshold in terms of number of simultaneous tasks attempting to open a set of HDFS files on a four-node cluster.  The threshold is hit at 32 tasks, each opening 450 files.  The threshold is not hit at 16 tasks, each opening 250 files.

The files are stored in HDFS with replication=1.  I know that low replication leaves me open to node-failure issues, but bear with me, nothing is actually failing.

I get this exception when attempting to open a file:
org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
Could not obtain block: BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
    org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
    org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
    org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
    org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)

However, the block is definitely *not* missing.  I can be running the following command continuously while all of this is going on:
hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
Well before the tasks start it is showing good files all around, including:
/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614 bytes, 2 block(s):  OK
0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385 len=134217728 repl=1 [192.168.57.110:50010<http://192.168.57.110:50010>]
1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411 len=10620886 repl=1 [192.168.57.110:50010<http://192.168.57.110:50010>]

My application logs also show that *some* tasks are able to open the files for which a missing block is reported.
In case you suspect, the files are not being deleted.  The fsck continues to show good status for these files well after the error report.
I've also checked to ensure that the files are not being held open by the creators of the files.

This leads me to believe that I've hit a an HDFS open-file limit of some kind.  We can compensate pretty easily, by doing a two-phase merge that opens far fewer files simultaneously, keeping a limited pool of open files, etc.  However, I would still like to know what limit is being hit, and how to best predict that limit on various cluster configurations.

Thanks,
john

RE: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Posted by John Lilley <jo...@redpoint.net>.
None of the datanode logs have error messages.

From: Harsh J [mailto:harsh@cloudera.com]
Sent: Monday, January 27, 2014 8:15 AM
To: <us...@hadoop.apache.org>
Subject: Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK


Can you check the log of the DN that is holding the specific block for any errors?
On Jan 27, 2014 8:37 PM, "John Lilley" <jo...@redpoint.net>> wrote:
I am getting this perplexing error.  Our YARN application launches tasks that attempt to simultaneously open a large number of files for merge.  There seems to be a load threshold in terms of number of simultaneous tasks attempting to open a set of HDFS files on a four-node cluster.  The threshold is hit at 32 tasks, each opening 450 files.  The threshold is not hit at 16 tasks, each opening 250 files.

The files are stored in HDFS with replication=1.  I know that low replication leaves me open to node-failure issues, but bear with me, nothing is actually failing.

I get this exception when attempting to open a file:
org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
Could not obtain block: BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
    org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
    org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
    org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
    org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)

However, the block is definitely *not* missing.  I can be running the following command continuously while all of this is going on:
hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
Well before the tasks start it is showing good files all around, including:
/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614 bytes, 2 block(s):  OK
0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385 len=134217728 repl=1 [192.168.57.110:50010<http://192.168.57.110:50010>]
1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411 len=10620886 repl=1 [192.168.57.110:50010<http://192.168.57.110:50010>]

My application logs also show that *some* tasks are able to open the files for which a missing block is reported.
In case you suspect, the files are not being deleted.  The fsck continues to show good status for these files well after the error report.
I've also checked to ensure that the files are not being held open by the creators of the files.

This leads me to believe that I've hit a an HDFS open-file limit of some kind.  We can compensate pretty easily, by doing a two-phase merge that opens far fewer files simultaneously, keeping a limited pool of open files, etc.  However, I would still like to know what limit is being hit, and how to best predict that limit on various cluster configurations.

Thanks,
john

RE: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Posted by John Lilley <jo...@redpoint.net>.
None of the datanode logs have error messages.

From: Harsh J [mailto:harsh@cloudera.com]
Sent: Monday, January 27, 2014 8:15 AM
To: <us...@hadoop.apache.org>
Subject: Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK


Can you check the log of the DN that is holding the specific block for any errors?
On Jan 27, 2014 8:37 PM, "John Lilley" <jo...@redpoint.net>> wrote:
I am getting this perplexing error.  Our YARN application launches tasks that attempt to simultaneously open a large number of files for merge.  There seems to be a load threshold in terms of number of simultaneous tasks attempting to open a set of HDFS files on a four-node cluster.  The threshold is hit at 32 tasks, each opening 450 files.  The threshold is not hit at 16 tasks, each opening 250 files.

The files are stored in HDFS with replication=1.  I know that low replication leaves me open to node-failure issues, but bear with me, nothing is actually failing.

I get this exception when attempting to open a file:
org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
Could not obtain block: BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
    org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
    org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
    org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
    org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)

However, the block is definitely *not* missing.  I can be running the following command continuously while all of this is going on:
hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
Well before the tasks start it is showing good files all around, including:
/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614 bytes, 2 block(s):  OK
0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385 len=134217728 repl=1 [192.168.57.110:50010<http://192.168.57.110:50010>]
1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411 len=10620886 repl=1 [192.168.57.110:50010<http://192.168.57.110:50010>]

My application logs also show that *some* tasks are able to open the files for which a missing block is reported.
In case you suspect, the files are not being deleted.  The fsck continues to show good status for these files well after the error report.
I've also checked to ensure that the files are not being held open by the creators of the files.

This leads me to believe that I've hit a an HDFS open-file limit of some kind.  We can compensate pretty easily, by doing a two-phase merge that opens far fewer files simultaneously, keeping a limited pool of open files, etc.  However, I would still like to know what limit is being hit, and how to best predict that limit on various cluster configurations.

Thanks,
john

RE: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Posted by John Lilley <jo...@redpoint.net>.
None of the datanode logs have error messages.

From: Harsh J [mailto:harsh@cloudera.com]
Sent: Monday, January 27, 2014 8:15 AM
To: <us...@hadoop.apache.org>
Subject: Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK


Can you check the log of the DN that is holding the specific block for any errors?
On Jan 27, 2014 8:37 PM, "John Lilley" <jo...@redpoint.net>> wrote:
I am getting this perplexing error.  Our YARN application launches tasks that attempt to simultaneously open a large number of files for merge.  There seems to be a load threshold in terms of number of simultaneous tasks attempting to open a set of HDFS files on a four-node cluster.  The threshold is hit at 32 tasks, each opening 450 files.  The threshold is not hit at 16 tasks, each opening 250 files.

The files are stored in HDFS with replication=1.  I know that low replication leaves me open to node-failure issues, but bear with me, nothing is actually failing.

I get this exception when attempting to open a file:
org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
Could not obtain block: BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
    org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
    org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
    org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
    org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)

However, the block is definitely *not* missing.  I can be running the following command continuously while all of this is going on:
hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
Well before the tasks start it is showing good files all around, including:
/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614 bytes, 2 block(s):  OK
0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385 len=134217728 repl=1 [192.168.57.110:50010<http://192.168.57.110:50010>]
1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411 len=10620886 repl=1 [192.168.57.110:50010<http://192.168.57.110:50010>]

My application logs also show that *some* tasks are able to open the files for which a missing block is reported.
In case you suspect, the files are not being deleted.  The fsck continues to show good status for these files well after the error report.
I've also checked to ensure that the files are not being held open by the creators of the files.

This leads me to believe that I've hit a an HDFS open-file limit of some kind.  We can compensate pretty easily, by doing a two-phase merge that opens far fewer files simultaneously, keeping a limited pool of open files, etc.  However, I would still like to know what limit is being hit, and how to best predict that limit on various cluster configurations.

Thanks,
john

RE: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Posted by John Lilley <jo...@redpoint.net>.
None of the datanode logs have error messages.

From: Harsh J [mailto:harsh@cloudera.com]
Sent: Monday, January 27, 2014 8:15 AM
To: <us...@hadoop.apache.org>
Subject: Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK


Can you check the log of the DN that is holding the specific block for any errors?
On Jan 27, 2014 8:37 PM, "John Lilley" <jo...@redpoint.net>> wrote:
I am getting this perplexing error.  Our YARN application launches tasks that attempt to simultaneously open a large number of files for merge.  There seems to be a load threshold in terms of number of simultaneous tasks attempting to open a set of HDFS files on a four-node cluster.  The threshold is hit at 32 tasks, each opening 450 files.  The threshold is not hit at 16 tasks, each opening 250 files.

The files are stored in HDFS with replication=1.  I know that low replication leaves me open to node-failure issues, but bear with me, nothing is actually failing.

I get this exception when attempting to open a file:
org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
Could not obtain block: BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
    org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
    org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
    org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
    org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)

However, the block is definitely *not* missing.  I can be running the following command continuously while all of this is going on:
hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
Well before the tasks start it is showing good files all around, including:
/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614 bytes, 2 block(s):  OK
0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385 len=134217728 repl=1 [192.168.57.110:50010<http://192.168.57.110:50010>]
1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411 len=10620886 repl=1 [192.168.57.110:50010<http://192.168.57.110:50010>]

My application logs also show that *some* tasks are able to open the files for which a missing block is reported.
In case you suspect, the files are not being deleted.  The fsck continues to show good status for these files well after the error report.
I've also checked to ensure that the files are not being held open by the creators of the files.

This leads me to believe that I've hit a an HDFS open-file limit of some kind.  We can compensate pretty easily, by doing a two-phase merge that opens far fewer files simultaneously, keeping a limited pool of open files, etc.  However, I would still like to know what limit is being hit, and how to best predict that limit on various cluster configurations.

Thanks,
john

Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Posted by Harsh J <ha...@cloudera.com>.
Can you check the log of the DN that is holding the specific block for any
errors?
On Jan 27, 2014 8:37 PM, "John Lilley" <jo...@redpoint.net> wrote:

>  I am getting this perplexing error.  Our YARN application launches tasks
> that attempt to simultaneously open a large number of files for merge.
> There seems to be a load threshold in terms of number of simultaneous tasks
> attempting to open a set of HDFS files on a four-node cluster.  The
> threshold is hit at 32 tasks, each opening 450 files.  The threshold is not
> hit at 16 tasks, each opening 250 files.
>
>
>
> The files are stored in HDFS with replication=1.  I know that low
> replication leaves me open to node-failure issues, but bear with me,
> nothing is actually failing.
>
>
>
> I get this exception when attempting to open a file:
>
> org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
>
>
> Could not obtain block:
> BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
>
> file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
>
>
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
>
>
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
>
>     org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
>
>     org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)
>
>
>
> However, the block is definitely **not** missing.  I can be running the
> following command continuously while all of this is going on:
>
> hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
>
> Well before the tasks start it is showing good files all around, including:
>
> /rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614
> bytes, 2 block(s):  OK
>
> 0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385
> len=134217728 repl=1 [192.168.57.110:50010]
>
> 1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
> len=10620886 repl=1 [192.168.57.110:50010]
>
>
>
> My application logs also show that **some** tasks are able to open the
> files for which a missing block is reported.
>
> In case you suspect, the files are not being deleted.  The fsck continues
> to show good status for these files well after the error report.
>
> I've also checked to ensure that the files are not being held open by the
> creators of the files.
>
>
>
> This leads me to believe that I've hit a an HDFS open-file limit of some
> kind.  We can compensate pretty easily, by doing a two-phase merge that
> opens far fewer files simultaneously, keeping a limited pool of open files,
> etc.  However, I would still like to know what limit is being hit, and how
> to best predict that limit on various cluster configurations.
>
>
>
> Thanks,
>
> john
>

Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Posted by Harsh J <ha...@cloudera.com>.
Can you check the log of the DN that is holding the specific block for any
errors?
On Jan 27, 2014 8:37 PM, "John Lilley" <jo...@redpoint.net> wrote:

>  I am getting this perplexing error.  Our YARN application launches tasks
> that attempt to simultaneously open a large number of files for merge.
> There seems to be a load threshold in terms of number of simultaneous tasks
> attempting to open a set of HDFS files on a four-node cluster.  The
> threshold is hit at 32 tasks, each opening 450 files.  The threshold is not
> hit at 16 tasks, each opening 250 files.
>
>
>
> The files are stored in HDFS with replication=1.  I know that low
> replication leaves me open to node-failure issues, but bear with me,
> nothing is actually failing.
>
>
>
> I get this exception when attempting to open a file:
>
> org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
>
>
> Could not obtain block:
> BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
>
> file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
>
>
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
>
>
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
>
>     org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
>
>     org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)
>
>
>
> However, the block is definitely **not** missing.  I can be running the
> following command continuously while all of this is going on:
>
> hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
>
> Well before the tasks start it is showing good files all around, including:
>
> /rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614
> bytes, 2 block(s):  OK
>
> 0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385
> len=134217728 repl=1 [192.168.57.110:50010]
>
> 1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
> len=10620886 repl=1 [192.168.57.110:50010]
>
>
>
> My application logs also show that **some** tasks are able to open the
> files for which a missing block is reported.
>
> In case you suspect, the files are not being deleted.  The fsck continues
> to show good status for these files well after the error report.
>
> I've also checked to ensure that the files are not being held open by the
> creators of the files.
>
>
>
> This leads me to believe that I've hit a an HDFS open-file limit of some
> kind.  We can compensate pretty easily, by doing a two-phase merge that
> opens far fewer files simultaneously, keeping a limited pool of open files,
> etc.  However, I would still like to know what limit is being hit, and how
> to best predict that limit on various cluster configurations.
>
>
>
> Thanks,
>
> john
>

Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Posted by Harsh J <ha...@cloudera.com>.
Can you check the log of the DN that is holding the specific block for any
errors?
On Jan 27, 2014 8:37 PM, "John Lilley" <jo...@redpoint.net> wrote:

>  I am getting this perplexing error.  Our YARN application launches tasks
> that attempt to simultaneously open a large number of files for merge.
> There seems to be a load threshold in terms of number of simultaneous tasks
> attempting to open a set of HDFS files on a four-node cluster.  The
> threshold is hit at 32 tasks, each opening 450 files.  The threshold is not
> hit at 16 tasks, each opening 250 files.
>
>
>
> The files are stored in HDFS with replication=1.  I know that low
> replication leaves me open to node-failure issues, but bear with me,
> nothing is actually failing.
>
>
>
> I get this exception when attempting to open a file:
>
> org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
>
>
> Could not obtain block:
> BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
>
> file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
>
>
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
>
>
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
>
>     org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
>
>     org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)
>
>
>
> However, the block is definitely **not** missing.  I can be running the
> following command continuously while all of this is going on:
>
> hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
>
> Well before the tasks start it is showing good files all around, including:
>
> /rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614
> bytes, 2 block(s):  OK
>
> 0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385
> len=134217728 repl=1 [192.168.57.110:50010]
>
> 1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
> len=10620886 repl=1 [192.168.57.110:50010]
>
>
>
> My application logs also show that **some** tasks are able to open the
> files for which a missing block is reported.
>
> In case you suspect, the files are not being deleted.  The fsck continues
> to show good status for these files well after the error report.
>
> I've also checked to ensure that the files are not being held open by the
> creators of the files.
>
>
>
> This leads me to believe that I've hit a an HDFS open-file limit of some
> kind.  We can compensate pretty easily, by doing a two-phase merge that
> opens far fewer files simultaneously, keeping a limited pool of open files,
> etc.  However, I would still like to know what limit is being hit, and how
> to best predict that limit on various cluster configurations.
>
>
>
> Thanks,
>
> john
>

Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Posted by Harsh J <ha...@cloudera.com>.
Can you check the log of the DN that is holding the specific block for any
errors?
On Jan 27, 2014 8:37 PM, "John Lilley" <jo...@redpoint.net> wrote:

>  I am getting this perplexing error.  Our YARN application launches tasks
> that attempt to simultaneously open a large number of files for merge.
> There seems to be a load threshold in terms of number of simultaneous tasks
> attempting to open a set of HDFS files on a four-node cluster.  The
> threshold is hit at 32 tasks, each opening 450 files.  The threshold is not
> hit at 16 tasks, each opening 250 files.
>
>
>
> The files are stored in HDFS with replication=1.  I know that low
> replication leaves me open to node-failure issues, but bear with me,
> nothing is actually failing.
>
>
>
> I get this exception when attempting to open a file:
>
> org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
>
>
> Could not obtain block:
> BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
>
> file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
>
>
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
>
>
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
>
>     org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
>
>     org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)
>
>
>
> However, the block is definitely **not** missing.  I can be running the
> following command continuously while all of this is going on:
>
> hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
>
> Well before the tasks start it is showing good files all around, including:
>
> /rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614
> bytes, 2 block(s):  OK
>
> 0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385
> len=134217728 repl=1 [192.168.57.110:50010]
>
> 1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
> len=10620886 repl=1 [192.168.57.110:50010]
>
>
>
> My application logs also show that **some** tasks are able to open the
> files for which a missing block is reported.
>
> In case you suspect, the files are not being deleted.  The fsck continues
> to show good status for these files well after the error report.
>
> I've also checked to ensure that the files are not being held open by the
> creators of the files.
>
>
>
> This leads me to believe that I've hit a an HDFS open-file limit of some
> kind.  We can compensate pretty easily, by doing a two-phase merge that
> opens far fewer files simultaneously, keeping a limited pool of open files,
> etc.  However, I would still like to know what limit is being hit, and how
> to best predict that limit on various cluster configurations.
>
>
>
> Thanks,
>
> john
>