You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by blah blah <tm...@gmail.com> on 2013/06/22 12:51:12 UTC

How can a YarnTask read/write local-host HDFS blocks?

Hi all

*Disclaimer*
I am creating a prototype Application Master. I am using old Yarn
development version. Revision 1437315, from 2013-01-23 (SNAPSHOT 3.0.0). I
can not update to current trunk version, as prototype deadline is soon, and
I don't have time to include Yarn API changes.

My cluster setup is as follows:
- each computational node acts as NodeManager and as a DataNode
- dedicated single node for the ResourceManager and NameNode

I have scheduled Containers/Tasks to the hosts which hold input data HDFS
blocks to achieve data locality (new
AMRMClient.ContainerRequest(capability, *blocksHosts*, racks, pri,
numContainers)). I know that the Task schedule is not guaranteed (but lets
assume Tasks were scheduled directly to hosts with input HDFS blocks). I
have 3 questions regarding reading/writing data from HDFS.

1. How can a Container/Task read local HDFS block?
Since Container/Task was scheduled on the same computational node as its
input HDFS block, how can I read the local block? Should I use
LocalFileSystem, since HDFS block is stored locally? Any code snippet or
source code reference will be greatly appreciated.

2. Multiple Containers on same Host, how to differ which local block should
be read by which Container/Task?
In case there are multiple Containers/Tasks scheduled to the same Host, and
also different input HDFS blocks are stored on the same Host. How can I
ensure that Container/Task will read "its" HDFS local block. For example
INPUT consists of 10 blocks, Job uses 5 nodes, and for each node 2
containers were scheduled, also each node holds 2 distinct HDFS blocks. How
can I read Block_A in Container_2_Host_A and Block_B in Container_3_Host_A.
Again any code snippet or source code reference will be greatly appreciated.

3. Write HDFS block to local node (not local file system).
How can I write read-processed HDFS blocks back to HDFS, but store it on
the same local host. As far as I know (if I am wrong please correct me),
whenever Task writes some data to HDFS, HDFS tries to store it on the same
host, then rack, then as close as possible (assuming replication factor 3).
Is this process automated, and simple hdfs.write() will do the trick? You
know that any code snippet or source code reference will be greatly
appreciated.

Thank you for your help in advance.

regards
tmp

RE: How can a YarnTask read/write local-host HDFS blocks?

Posted by John Lilley <jo...@redpoint.net>.

Blah blah,
One point you might have missed: multiple tasks cannot all write the same HDFS file at the same time.  So you can't just split an output file into sections and say "task1 write block1, etc".  Typically each task outputs a separate file and these file-parts are read or merged later.
john

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Saturday, June 22, 2013 5:33 AM
To: <us...@hadoop.apache.org>
Subject: Re: How can a YarnTask read/write local-host HDFS blocks?

Hi,

On Sat, Jun 22, 2013 at 4:21 PM, blah blah <tm...@gmail.com> wrote:
> Hi all
>
> Disclaimer
> I am creating a prototype Application Master. I am using old Yarn 
> development version. Revision 1437315, from 2013-01-23 (SNAPSHOT 
> 3.0.0). I can not update to current trunk version, as prototype 
> deadline is soon, and I don't have time to include Yarn API changes.
>
> My cluster setup is as follows:
> - each computational node acts as NodeManager and as a DataNode
> - dedicated single node for the ResourceManager and NameNode
>
> I have scheduled Containers/Tasks to the hosts which hold input data 
> HDFS blocks to achieve data locality (new 
> AMRMClient.ContainerRequest(capability,
> blocksHosts, racks, pri, numContainers)). I know that the Task 
> schedule is not guaranteed (but lets assume Tasks were scheduled 
> directly to hosts with input HDFS blocks). I have 3 questions 
> regarding reading/writing data from HDFS.
>
> 1. How can a Container/Task read local HDFS block?
> Since Container/Task was scheduled on the same computational node as 
> its input HDFS block, how can I read the local block? Should I use 
> LocalFileSystem, since HDFS block is stored locally? Any code snippet 
> or source code reference will be greatly appreciated.

The HDFS client does local reads automatically if there is a local DN where they are running (and it has the block they request). A developer needn't concern themselves with explicitly trying to read local data - it is done automatically by the framework.


> 2. Multiple Containers on same Host, how to differ which local block 
> should be read by which Container/Task?
> In case there are multiple Containers/Tasks scheduled to the same 
> Host, and also different input HDFS blocks are stored on the same 
> Host. How can I ensure that Container/Task will read "its" HDFS local 
> block. For example INPUT consists of 10 blocks, Job uses 5 nodes, and 
> for each node 2 containers were scheduled, also each node holds 2 
> distinct HDFS blocks. How can I read Block_A in Container_2_Host_A and Block_B in Container_3_Host_A.
> Again any code snippet or source code reference will be greatly appreciated.

You have to basically assign a file (offset + len) to each container ID you launch. They then have to read this assigned file alone. You can pass this read info to them via CLI options, some serialized file, etc..

> 3. Write HDFS block to local node (not local file system).
> How can I write read-processed HDFS blocks back to HDFS, but store it 
> on the same local host. As far as I know (if I am wrong please correct 
> me), whenever Task writes some data to HDFS, HDFS tries to store it on 
> the same host, then rack, then as close as possible (assuming replication factor 3).
> Is this process automated, and simple hdfs.write() will do the trick? 
> You know that any code snippet or source code reference will be 
> greatly appreciated.

This process is automatic in the same way a local ready is automatic.
You needn't write special code for this.

> Thank you for your help in advance.
>
> regards
> tmp



--
Harsh J

RE: How can a YarnTask read/write local-host HDFS blocks?

Posted by John Lilley <jo...@redpoint.net>.

Blah blah,
One point you might have missed: multiple tasks cannot all write the same HDFS file at the same time.  So you can't just split an output file into sections and say "task1 write block1, etc".  Typically each task outputs a separate file and these file-parts are read or merged later.
john

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Saturday, June 22, 2013 5:33 AM
To: <us...@hadoop.apache.org>
Subject: Re: How can a YarnTask read/write local-host HDFS blocks?

Hi,

On Sat, Jun 22, 2013 at 4:21 PM, blah blah <tm...@gmail.com> wrote:
> Hi all
>
> Disclaimer
> I am creating a prototype Application Master. I am using old Yarn 
> development version. Revision 1437315, from 2013-01-23 (SNAPSHOT 
> 3.0.0). I can not update to current trunk version, as prototype 
> deadline is soon, and I don't have time to include Yarn API changes.
>
> My cluster setup is as follows:
> - each computational node acts as NodeManager and as a DataNode
> - dedicated single node for the ResourceManager and NameNode
>
> I have scheduled Containers/Tasks to the hosts which hold input data 
> HDFS blocks to achieve data locality (new 
> AMRMClient.ContainerRequest(capability,
> blocksHosts, racks, pri, numContainers)). I know that the Task 
> schedule is not guaranteed (but lets assume Tasks were scheduled 
> directly to hosts with input HDFS blocks). I have 3 questions 
> regarding reading/writing data from HDFS.
>
> 1. How can a Container/Task read local HDFS block?
> Since Container/Task was scheduled on the same computational node as 
> its input HDFS block, how can I read the local block? Should I use 
> LocalFileSystem, since HDFS block is stored locally? Any code snippet 
> or source code reference will be greatly appreciated.

The HDFS client does local reads automatically if there is a local DN where they are running (and it has the block they request). A developer needn't concern themselves with explicitly trying to read local data - it is done automatically by the framework.


> 2. Multiple Containers on same Host, how to differ which local block 
> should be read by which Container/Task?
> In case there are multiple Containers/Tasks scheduled to the same 
> Host, and also different input HDFS blocks are stored on the same 
> Host. How can I ensure that Container/Task will read "its" HDFS local 
> block. For example INPUT consists of 10 blocks, Job uses 5 nodes, and 
> for each node 2 containers were scheduled, also each node holds 2 
> distinct HDFS blocks. How can I read Block_A in Container_2_Host_A and Block_B in Container_3_Host_A.
> Again any code snippet or source code reference will be greatly appreciated.

You have to basically assign a file (offset + len) to each container ID you launch. They then have to read this assigned file alone. You can pass this read info to them via CLI options, some serialized file, etc..

> 3. Write HDFS block to local node (not local file system).
> How can I write read-processed HDFS blocks back to HDFS, but store it 
> on the same local host. As far as I know (if I am wrong please correct 
> me), whenever Task writes some data to HDFS, HDFS tries to store it on 
> the same host, then rack, then as close as possible (assuming replication factor 3).
> Is this process automated, and simple hdfs.write() will do the trick? 
> You know that any code snippet or source code reference will be 
> greatly appreciated.

This process is automatic in the same way a local ready is automatic.
You needn't write special code for this.

> Thank you for your help in advance.
>
> regards
> tmp



--
Harsh J

RE: How can a YarnTask read/write local-host HDFS blocks?

Posted by John Lilley <jo...@redpoint.net>.

Blah blah,
One point you might have missed: multiple tasks cannot all write the same HDFS file at the same time.  So you can't just split an output file into sections and say "task1 write block1, etc".  Typically each task outputs a separate file and these file-parts are read or merged later.
john

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Saturday, June 22, 2013 5:33 AM
To: <us...@hadoop.apache.org>
Subject: Re: How can a YarnTask read/write local-host HDFS blocks?

Hi,

On Sat, Jun 22, 2013 at 4:21 PM, blah blah <tm...@gmail.com> wrote:
> Hi all
>
> Disclaimer
> I am creating a prototype Application Master. I am using old Yarn 
> development version. Revision 1437315, from 2013-01-23 (SNAPSHOT 
> 3.0.0). I can not update to current trunk version, as prototype 
> deadline is soon, and I don't have time to include Yarn API changes.
>
> My cluster setup is as follows:
> - each computational node acts as NodeManager and as a DataNode
> - dedicated single node for the ResourceManager and NameNode
>
> I have scheduled Containers/Tasks to the hosts which hold input data 
> HDFS blocks to achieve data locality (new 
> AMRMClient.ContainerRequest(capability,
> blocksHosts, racks, pri, numContainers)). I know that the Task 
> schedule is not guaranteed (but lets assume Tasks were scheduled 
> directly to hosts with input HDFS blocks). I have 3 questions 
> regarding reading/writing data from HDFS.
>
> 1. How can a Container/Task read local HDFS block?
> Since Container/Task was scheduled on the same computational node as 
> its input HDFS block, how can I read the local block? Should I use 
> LocalFileSystem, since HDFS block is stored locally? Any code snippet 
> or source code reference will be greatly appreciated.

The HDFS client does local reads automatically if there is a local DN where they are running (and it has the block they request). A developer needn't concern themselves with explicitly trying to read local data - it is done automatically by the framework.


> 2. Multiple Containers on same Host, how to differ which local block 
> should be read by which Container/Task?
> In case there are multiple Containers/Tasks scheduled to the same 
> Host, and also different input HDFS blocks are stored on the same 
> Host. How can I ensure that Container/Task will read "its" HDFS local 
> block. For example INPUT consists of 10 blocks, Job uses 5 nodes, and 
> for each node 2 containers were scheduled, also each node holds 2 
> distinct HDFS blocks. How can I read Block_A in Container_2_Host_A and Block_B in Container_3_Host_A.
> Again any code snippet or source code reference will be greatly appreciated.

You have to basically assign a file (offset + len) to each container ID you launch. They then have to read this assigned file alone. You can pass this read info to them via CLI options, some serialized file, etc..

> 3. Write HDFS block to local node (not local file system).
> How can I write read-processed HDFS blocks back to HDFS, but store it 
> on the same local host. As far as I know (if I am wrong please correct 
> me), whenever Task writes some data to HDFS, HDFS tries to store it on 
> the same host, then rack, then as close as possible (assuming replication factor 3).
> Is this process automated, and simple hdfs.write() will do the trick? 
> You know that any code snippet or source code reference will be 
> greatly appreciated.

This process is automatic in the same way a local ready is automatic.
You needn't write special code for this.

> Thank you for your help in advance.
>
> regards
> tmp



--
Harsh J

RE: How can a YarnTask read/write local-host HDFS blocks?

Posted by John Lilley <jo...@redpoint.net>.

Blah blah,
One point you might have missed: multiple tasks cannot all write the same HDFS file at the same time.  So you can't just split an output file into sections and say "task1 write block1, etc".  Typically each task outputs a separate file and these file-parts are read or merged later.
john

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Saturday, June 22, 2013 5:33 AM
To: <us...@hadoop.apache.org>
Subject: Re: How can a YarnTask read/write local-host HDFS blocks?

Hi,

On Sat, Jun 22, 2013 at 4:21 PM, blah blah <tm...@gmail.com> wrote:
> Hi all
>
> Disclaimer
> I am creating a prototype Application Master. I am using old Yarn 
> development version. Revision 1437315, from 2013-01-23 (SNAPSHOT 
> 3.0.0). I can not update to current trunk version, as prototype 
> deadline is soon, and I don't have time to include Yarn API changes.
>
> My cluster setup is as follows:
> - each computational node acts as NodeManager and as a DataNode
> - dedicated single node for the ResourceManager and NameNode
>
> I have scheduled Containers/Tasks to the hosts which hold input data 
> HDFS blocks to achieve data locality (new 
> AMRMClient.ContainerRequest(capability,
> blocksHosts, racks, pri, numContainers)). I know that the Task 
> schedule is not guaranteed (but lets assume Tasks were scheduled 
> directly to hosts with input HDFS blocks). I have 3 questions 
> regarding reading/writing data from HDFS.
>
> 1. How can a Container/Task read local HDFS block?
> Since Container/Task was scheduled on the same computational node as 
> its input HDFS block, how can I read the local block? Should I use 
> LocalFileSystem, since HDFS block is stored locally? Any code snippet 
> or source code reference will be greatly appreciated.

The HDFS client does local reads automatically if there is a local DN where they are running (and it has the block they request). A developer needn't concern themselves with explicitly trying to read local data - it is done automatically by the framework.


> 2. Multiple Containers on same Host, how to differ which local block 
> should be read by which Container/Task?
> In case there are multiple Containers/Tasks scheduled to the same 
> Host, and also different input HDFS blocks are stored on the same 
> Host. How can I ensure that Container/Task will read "its" HDFS local 
> block. For example INPUT consists of 10 blocks, Job uses 5 nodes, and 
> for each node 2 containers were scheduled, also each node holds 2 
> distinct HDFS blocks. How can I read Block_A in Container_2_Host_A and Block_B in Container_3_Host_A.
> Again any code snippet or source code reference will be greatly appreciated.

You have to basically assign a file (offset + len) to each container ID you launch. They then have to read this assigned file alone. You can pass this read info to them via CLI options, some serialized file, etc..

> 3. Write HDFS block to local node (not local file system).
> How can I write read-processed HDFS blocks back to HDFS, but store it 
> on the same local host. As far as I know (if I am wrong please correct 
> me), whenever Task writes some data to HDFS, HDFS tries to store it on 
> the same host, then rack, then as close as possible (assuming replication factor 3).
> Is this process automated, and simple hdfs.write() will do the trick? 
> You know that any code snippet or source code reference will be 
> greatly appreciated.

This process is automatic in the same way a local ready is automatic.
You needn't write special code for this.

> Thank you for your help in advance.
>
> regards
> tmp



--
Harsh J

Re: How can a YarnTask read/write local-host HDFS blocks?

Posted by Harsh J <ha...@cloudera.com>.

Hi,

On Sat, Jun 22, 2013 at 4:21 PM, blah blah <tm...@gmail.com> wrote:
> Hi all
>
> Disclaimer
> I am creating a prototype Application Master. I am using old Yarn
> development version. Revision 1437315, from 2013-01-23 (SNAPSHOT 3.0.0). I
> can not update to current trunk version, as prototype deadline is soon, and
> I don't have time to include Yarn API changes.
>
> My cluster setup is as follows:
> - each computational node acts as NodeManager and as a DataNode
> - dedicated single node for the ResourceManager and NameNode
>
> I have scheduled Containers/Tasks to the hosts which hold input data HDFS
> blocks to achieve data locality (new AMRMClient.ContainerRequest(capability,
> blocksHosts, racks, pri, numContainers)). I know that the Task schedule is
> not guaranteed (but lets assume Tasks were scheduled directly to hosts with
> input HDFS blocks). I have 3 questions regarding reading/writing data from
> HDFS.
>
> 1. How can a Container/Task read local HDFS block?
> Since Container/Task was scheduled on the same computational node as its
> input HDFS block, how can I read the local block? Should I use
> LocalFileSystem, since HDFS block is stored locally? Any code snippet or
> source code reference will be greatly appreciated.

The HDFS client does local reads automatically if there is a local DN
where they are running (and it has the block they request). A
developer needn't concern themselves with explicitly trying to read
local data - it is done automatically by the framework.

> 2. Multiple Containers on same Host, how to differ which local block should
> be read by which Container/Task?
> In case there are multiple Containers/Tasks scheduled to the same Host, and
> also different input HDFS blocks are stored on the same Host. How can I
> ensure that Container/Task will read "its" HDFS local block. For example
> INPUT consists of 10 blocks, Job uses 5 nodes, and for each node 2
> containers were scheduled, also each node holds 2 distinct HDFS blocks. How
> can I read Block_A in Container_2_Host_A and Block_B in Container_3_Host_A.
> Again any code snippet or source code reference will be greatly appreciated.

You have to basically assign a file (offset + len) to each container
ID you launch. They then have to read this assigned file alone. You
can pass this read info to them via CLI options, some serialized file,
etc..

> 3. Write HDFS block to local node (not local file system).
> How can I write read-processed HDFS blocks back to HDFS, but store it on the
> same local host. As far as I know (if I am wrong please correct me),
> whenever Task writes some data to HDFS, HDFS tries to store it on the same
> host, then rack, then as close as possible (assuming replication factor 3).
> Is this process automated, and simple hdfs.write() will do the trick? You
> know that any code snippet or source code reference will be greatly
> appreciated.

This process is automatic in the same way a local ready is automatic.
You needn't write special code for this.

> Thank you for your help in advance.
>
> regards
> tmp



--
Harsh J

Re: How can a YarnTask read/write local-host HDFS blocks?

Posted by Harsh J <ha...@cloudera.com>.

Hi,

On Sat, Jun 22, 2013 at 4:21 PM, blah blah <tm...@gmail.com> wrote:
> Hi all
>
> Disclaimer
> I am creating a prototype Application Master. I am using old Yarn
> development version. Revision 1437315, from 2013-01-23 (SNAPSHOT 3.0.0). I
> can not update to current trunk version, as prototype deadline is soon, and
> I don't have time to include Yarn API changes.
>
> My cluster setup is as follows:
> - each computational node acts as NodeManager and as a DataNode
> - dedicated single node for the ResourceManager and NameNode
>
> I have scheduled Containers/Tasks to the hosts which hold input data HDFS
> blocks to achieve data locality (new AMRMClient.ContainerRequest(capability,
> blocksHosts, racks, pri, numContainers)). I know that the Task schedule is
> not guaranteed (but lets assume Tasks were scheduled directly to hosts with
> input HDFS blocks). I have 3 questions regarding reading/writing data from
> HDFS.
>
> 1. How can a Container/Task read local HDFS block?
> Since Container/Task was scheduled on the same computational node as its
> input HDFS block, how can I read the local block? Should I use
> LocalFileSystem, since HDFS block is stored locally? Any code snippet or
> source code reference will be greatly appreciated.

The HDFS client does local reads automatically if there is a local DN
where they are running (and it has the block they request). A
developer needn't concern themselves with explicitly trying to read
local data - it is done automatically by the framework.

> 2. Multiple Containers on same Host, how to differ which local block should
> be read by which Container/Task?
> In case there are multiple Containers/Tasks scheduled to the same Host, and
> also different input HDFS blocks are stored on the same Host. How can I
> ensure that Container/Task will read "its" HDFS local block. For example
> INPUT consists of 10 blocks, Job uses 5 nodes, and for each node 2
> containers were scheduled, also each node holds 2 distinct HDFS blocks. How
> can I read Block_A in Container_2_Host_A and Block_B in Container_3_Host_A.
> Again any code snippet or source code reference will be greatly appreciated.

You have to basically assign a file (offset + len) to each container
ID you launch. They then have to read this assigned file alone. You
can pass this read info to them via CLI options, some serialized file,
etc..

> 3. Write HDFS block to local node (not local file system).
> How can I write read-processed HDFS blocks back to HDFS, but store it on the
> same local host. As far as I know (if I am wrong please correct me),
> whenever Task writes some data to HDFS, HDFS tries to store it on the same
> host, then rack, then as close as possible (assuming replication factor 3).
> Is this process automated, and simple hdfs.write() will do the trick? You
> know that any code snippet or source code reference will be greatly
> appreciated.

This process is automatic in the same way a local ready is automatic.
You needn't write special code for this.

> Thank you for your help in advance.
>
> regards
> tmp



--
Harsh J

Re: How can a YarnTask read/write local-host HDFS blocks?

Posted by Harsh J <ha...@cloudera.com>.

Hi,

On Sat, Jun 22, 2013 at 4:21 PM, blah blah <tm...@gmail.com> wrote:
> Hi all
>
> Disclaimer
> I am creating a prototype Application Master. I am using old Yarn
> development version. Revision 1437315, from 2013-01-23 (SNAPSHOT 3.0.0). I
> can not update to current trunk version, as prototype deadline is soon, and
> I don't have time to include Yarn API changes.
>
> My cluster setup is as follows:
> - each computational node acts as NodeManager and as a DataNode
> - dedicated single node for the ResourceManager and NameNode
>
> I have scheduled Containers/Tasks to the hosts which hold input data HDFS
> blocks to achieve data locality (new AMRMClient.ContainerRequest(capability,
> blocksHosts, racks, pri, numContainers)). I know that the Task schedule is
> not guaranteed (but lets assume Tasks were scheduled directly to hosts with
> input HDFS blocks). I have 3 questions regarding reading/writing data from
> HDFS.
>
> 1. How can a Container/Task read local HDFS block?
> Since Container/Task was scheduled on the same computational node as its
> input HDFS block, how can I read the local block? Should I use
> LocalFileSystem, since HDFS block is stored locally? Any code snippet or
> source code reference will be greatly appreciated.

The HDFS client does local reads automatically if there is a local DN
where they are running (and it has the block they request). A
developer needn't concern themselves with explicitly trying to read
local data - it is done automatically by the framework.

> 2. Multiple Containers on same Host, how to differ which local block should
> be read by which Container/Task?
> In case there are multiple Containers/Tasks scheduled to the same Host, and
> also different input HDFS blocks are stored on the same Host. How can I
> ensure that Container/Task will read "its" HDFS local block. For example
> INPUT consists of 10 blocks, Job uses 5 nodes, and for each node 2
> containers were scheduled, also each node holds 2 distinct HDFS blocks. How
> can I read Block_A in Container_2_Host_A and Block_B in Container_3_Host_A.
> Again any code snippet or source code reference will be greatly appreciated.

You have to basically assign a file (offset + len) to each container
ID you launch. They then have to read this assigned file alone. You
can pass this read info to them via CLI options, some serialized file,
etc..

> 3. Write HDFS block to local node (not local file system).
> How can I write read-processed HDFS blocks back to HDFS, but store it on the
> same local host. As far as I know (if I am wrong please correct me),
> whenever Task writes some data to HDFS, HDFS tries to store it on the same
> host, then rack, then as close as possible (assuming replication factor 3).
> Is this process automated, and simple hdfs.write() will do the trick? You
> know that any code snippet or source code reference will be greatly
> appreciated.

This process is automatic in the same way a local ready is automatic.
You needn't write special code for this.

> Thank you for your help in advance.
>
> regards
> tmp



--
Harsh J

Re: How can a YarnTask read/write local-host HDFS blocks?

Posted by Harsh J <ha...@cloudera.com>.

Hi,

On Sat, Jun 22, 2013 at 4:21 PM, blah blah <tm...@gmail.com> wrote:
> Hi all
>
> Disclaimer
> I am creating a prototype Application Master. I am using old Yarn
> development version. Revision 1437315, from 2013-01-23 (SNAPSHOT 3.0.0). I
> can not update to current trunk version, as prototype deadline is soon, and
> I don't have time to include Yarn API changes.
>
> My cluster setup is as follows:
> - each computational node acts as NodeManager and as a DataNode
> - dedicated single node for the ResourceManager and NameNode
>
> I have scheduled Containers/Tasks to the hosts which hold input data HDFS
> blocks to achieve data locality (new AMRMClient.ContainerRequest(capability,
> blocksHosts, racks, pri, numContainers)). I know that the Task schedule is
> not guaranteed (but lets assume Tasks were scheduled directly to hosts with
> input HDFS blocks). I have 3 questions regarding reading/writing data from
> HDFS.
>
> 1. How can a Container/Task read local HDFS block?
> Since Container/Task was scheduled on the same computational node as its
> input HDFS block, how can I read the local block? Should I use
> LocalFileSystem, since HDFS block is stored locally? Any code snippet or
> source code reference will be greatly appreciated.

The HDFS client does local reads automatically if there is a local DN
where they are running (and it has the block they request). A
developer needn't concern themselves with explicitly trying to read
local data - it is done automatically by the framework.

> 2. Multiple Containers on same Host, how to differ which local block should
> be read by which Container/Task?
> In case there are multiple Containers/Tasks scheduled to the same Host, and
> also different input HDFS blocks are stored on the same Host. How can I
> ensure that Container/Task will read "its" HDFS local block. For example
> INPUT consists of 10 blocks, Job uses 5 nodes, and for each node 2
> containers were scheduled, also each node holds 2 distinct HDFS blocks. How
> can I read Block_A in Container_2_Host_A and Block_B in Container_3_Host_A.
> Again any code snippet or source code reference will be greatly appreciated.

You have to basically assign a file (offset + len) to each container
ID you launch. They then have to read this assigned file alone. You
can pass this read info to them via CLI options, some serialized file,
etc..

> 3. Write HDFS block to local node (not local file system).
> How can I write read-processed HDFS blocks back to HDFS, but store it on the
> same local host. As far as I know (if I am wrong please correct me),
> whenever Task writes some data to HDFS, HDFS tries to store it on the same
> host, then rack, then as close as possible (assuming replication factor 3).
> Is this process automated, and simple hdfs.write() will do the trick? You
> know that any code snippet or source code reference will be greatly
> appreciated.

This process is automatic in the same way a local ready is automatic.
You needn't write special code for this.

> Thank you for your help in advance.
>
> regards
> tmp



--
Harsh J