You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by James Cipar <jc...@andrew.cmu.edu> on 2008/12/02 00:54:07 UTC

Which replica?

Is there any way to determine which replica of each chunk is read by a  
map-reduce program?  I've been looking through the hadoop code, and it  
seems like it tries to hide those kinds of details from the higher  
level API.  Ideally, I'd like the host the task was running on, the  
file name and chunk number, and the host the chunk was read from.

Re: Which replica?

Posted by Jim Cipar <jc...@cmu.edu>.

I'm looking at alternative policies for task and data placement.  As a 
first step, I'd like to be able to observe what Hadoop is doing without 
modifying our cluster's software.  We saw that the datanodes log every 
block that is read from them, but we didn't see any way to map from 
those block names to a (filename, chunk) pair.



Doug Cutting wrote:
> A task may read from more than one block.  For example, in 
> line-oriented input, lines frequently cross block boundaries.  And a 
> block may be read from more than one host.  For example, if a datanode 
> dies midway through providing a block, the client will switch to using 
> a different datanode.  So the mapping is not simple.  This information 
> is also not, as you inferred, available to applications.  Why do you 
> need this?  Do you have a compelling reason?
>
> Doug
>
> James Cipar wrote:
>> Is there any way to determine which replica of each chunk is read by 
>> a map-reduce program?  I've been looking through the hadoop code, and 
>> it seems like it tries to hide those kinds of details from the higher 
>> level API.  Ideally, I'd like the host the task was running on, the 
>> file name and chunk number, and the host the chunk was read from.
>

Re: Which replica?

Posted by Doug Cutting <cu...@apache.org>.

A task may read from more than one block.  For example, in line-oriented 
input, lines frequently cross block boundaries.  And a block may be read 
from more than one host.  For example, if a datanode dies midway through 
providing a block, the client will switch to using a different datanode. 
  So the mapping is not simple.  This information is also not, as you 
inferred, available to applications.  Why do you need this?  Do you have 
a compelling reason?

Doug

James Cipar wrote:
> Is there any way to determine which replica of each chunk is read by a 
> map-reduce program?  I've been looking through the hadoop code, and it 
> seems like it tries to hide those kinds of details from the higher level 
> API.  Ideally, I'd like the host the task was running on, the file name 
> and chunk number, and the host the chunk was read from.