You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Jiaqi Tan <ji...@gmail.com> on 2008/03/23 11:25:35 UTC

HoD and locality of TaskTrackers to data (on DataNodes)

Hi,

I have a question about using HoD and the locality of the assigned
TaskTrackers to the data.

Suppose I have a long-running HDFS installation with
TaskTrackers/JobTracker nodes dynamically allocated by HoD, and I
uploaded my data to HDFS prior to running my job/allocating nodes
using "dfs -put". Then, I allocate some nodes and run my job on that
data using HoD. Would the nodes allocated by HoD take into account the
HDFS nodes on which my data resides (e.g. by looking at which
DataNodes hold blocks that belong to the current user)? If the nodes
are just arbitrarily allocated, doesn't that break Hadoop's design
principle of having processing take place near the data?

And if HoD doesn't currently take block location into account when
allocating nodes, are there future plans for that to be incorporated?

Thanks,
Jiaqi Tan

Re: HoD and locality of TaskTrackers to data (on DataNodes)

Posted by Jiaqi Tan <ji...@gmail.com>.

Hi Hemanth,

More design questions I'm wondering about:

So what determines the spread/location of data blocks that are
uploaded/added to HDFS outside of the Map/Reduce framework? For
instance, if I use a dfs -put to upload files to the HDFS, does the
dfs system try to spread the blocks out across machines as far as
possible? Or is the priority to balance disk usage so that the disks
that have the most free capacity get the blocks?

In this case, it would seem that the overall priority for the system
would be to first spread the blocks out to equalize disk usage,
following which HoD might come in and suggest nodes to use for
computation based on where the blocks are; but then it would seem HoD
users would be constrained (either from below or above) on the number
of nodes to use, e.g. if HDFS went and spread the blocks out on 20
DataNodes, then optimal locality would be achieved having 20
TaskTraskers? This seems to restrict the number of TaskTrackers a HoD
user would want to allocate (not too few, and not too many).

Thanks,
Jiaqi

On Sun, Mar 23, 2008 at 11:56 PM, Hemanth Yamijala
<yh...@yahoo-inc.com> wrote:
> Jiaqi,
>
>
> > Hi,
>  >
>  > I have a question about using HoD and the locality of the assigned
>  > TaskTrackers to the data.
>  >
>  > Suppose I have a long-running HDFS installation with
>  > TaskTrackers/JobTracker nodes dynamically allocated by HoD, and I
>  > uploaded my data to HDFS prior to running my job/allocating nodes
>  > using "dfs -put". Then, I allocate some nodes and run my job on that
>  > data using HoD. Would the nodes allocated by HoD take into account the
>  > HDFS nodes on which my data resides (e.g. by looking at which
>  > DataNodes hold blocks that belong to the current user)? If the nodes
>  > are just arbitrarily allocated, doesn't that break Hadoop's design
>  > principle of having processing take place near the data?
>  >
>  > And if HoD doesn't currently take block location into account when
>  > allocating nodes, are there future plans for that to be incorporated?
>  >
>  >
>  Excellent point ! HOD does not currently take this into account.  We are
>  working on ways in which we can accomplish this using configuration
>  outside HOD (i.e. in Torque / some Hadoop features in 0.17 like
>  HADOOP-1985). I will update this list (and possibly also documentation)
>  on how this can be setup, after we have some more concrete results.
>
>  Thanks
>  Hemanth
>

Re: HoD and locality of TaskTrackers to data (on DataNodes)

Posted by Jiaqi Tan <ji...@gmail.com>.

Hi Hemanth,

More design questions I'm wondering about:

So what determines the spread/location of data blocks that are
uploaded/added to HDFS outside of the Map/Reduce framework? For
instance, if I use a dfs -put to upload files to the HDFS, does the
dfs system try to spread the blocks out across machines as far as
possible? Or is the priority to balance disk usage so that the disks
that have the most free capacity get the blocks?

In this case, it would seem that the overall priority for the system
would be to first spread the blocks out to equalize disk usage,
following which HoD might come in and suggest nodes to use for
computation based on where the blocks are; but then it would seem HoD
users would be constrained (either from below or above) on the number
of nodes to use, e.g. if HDFS went and spread the blocks out on 20
DataNodes, then optimal locality would be achieved having 20
TaskTraskers? This seems to restrict the number of TaskTrackers a HoD
user would want to allocate (not too few, and not too many).

Thanks,
Jiaqi

On Sun, Mar 23, 2008 at 11:56 PM, Hemanth Yamijala
<yh...@yahoo-inc.com> wrote:
> Jiaqi,
>
>
> > Hi,
>  >
>  > I have a question about using HoD and the locality of the assigned
>  > TaskTrackers to the data.
>  >
>  > Suppose I have a long-running HDFS installation with
>  > TaskTrackers/JobTracker nodes dynamically allocated by HoD, and I
>  > uploaded my data to HDFS prior to running my job/allocating nodes
>  > using "dfs -put". Then, I allocate some nodes and run my job on that
>  > data using HoD. Would the nodes allocated by HoD take into account the
>  > HDFS nodes on which my data resides (e.g. by looking at which
>  > DataNodes hold blocks that belong to the current user)? If the nodes
>  > are just arbitrarily allocated, doesn't that break Hadoop's design
>  > principle of having processing take place near the data?
>  >
>  > And if HoD doesn't currently take block location into account when
>  > allocating nodes, are there future plans for that to be incorporated?
>  >
>  >
>  Excellent point ! HOD does not currently take this into account.  We are
>  working on ways in which we can accomplish this using configuration
>  outside HOD (i.e. in Torque / some Hadoop features in 0.17 like
>  HADOOP-1985). I will update this list (and possibly also documentation)
>  on how this can be setup, after we have some more concrete results.
>
>  Thanks
>  Hemanth
>

Re: HoD and locality of TaskTrackers to data (on DataNodes)

Posted by Hemanth Yamijala <yh...@yahoo-inc.com>.

Jiaqi,
> Hi,
>
> I have a question about using HoD and the locality of the assigned
> TaskTrackers to the data.
>
> Suppose I have a long-running HDFS installation with
> TaskTrackers/JobTracker nodes dynamically allocated by HoD, and I
> uploaded my data to HDFS prior to running my job/allocating nodes
> using "dfs -put". Then, I allocate some nodes and run my job on that
> data using HoD. Would the nodes allocated by HoD take into account the
> HDFS nodes on which my data resides (e.g. by looking at which
> DataNodes hold blocks that belong to the current user)? If the nodes
> are just arbitrarily allocated, doesn't that break Hadoop's design
> principle of having processing take place near the data?
>
> And if HoD doesn't currently take block location into account when
> allocating nodes, are there future plans for that to be incorporated?
>
>   
Excellent point ! HOD does not currently take this into account.  We are 
working on ways in which we can accomplish this using configuration 
outside HOD (i.e. in Torque / some Hadoop features in 0.17 like 
HADOOP-1985). I will update this list (and possibly also documentation) 
on how this can be setup, after we have some more concrete results.

Thanks
Hemanth

Re: HoD and locality of TaskTrackers to data (on DataNodes)

Posted by Hemanth Yamijala <yh...@yahoo-inc.com>.

Jiaqi,
> Hi,
>
> I have a question about using HoD and the locality of the assigned
> TaskTrackers to the data.
>
> Suppose I have a long-running HDFS installation with
> TaskTrackers/JobTracker nodes dynamically allocated by HoD, and I
> uploaded my data to HDFS prior to running my job/allocating nodes
> using "dfs -put". Then, I allocate some nodes and run my job on that
> data using HoD. Would the nodes allocated by HoD take into account the
> HDFS nodes on which my data resides (e.g. by looking at which
> DataNodes hold blocks that belong to the current user)? If the nodes
> are just arbitrarily allocated, doesn't that break Hadoop's design
> principle of having processing take place near the data?
>
> And if HoD doesn't currently take block location into account when
> allocating nodes, are there future plans for that to be incorporated?
>
>   
Excellent point ! HOD does not currently take this into account.  We are 
working on ways in which we can accomplish this using configuration 
outside HOD (i.e. in Torque / some Hadoop features in 0.17 like 
HADOOP-1985). I will update this list (and possibly also documentation) 
on how this can be setup, after we have some more concrete results.

Thanks
Hemanth