You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Johannes Zillmann <jz...@googlemail.com> on 2011/04/20 12:07:58 UTC

use of inputSplit#getLocations()

Dear Folks,

i'm having a custom implementation of InputSplit which contains a combination of multiple blocks (similar to CombineFileInputFormat).
Each splits can have a different "data-locality category ":
a) host-local: there is one host which contains one replica of each block
b) rack-local: there is one rack which contains one replica of each block
c) mixed: blocks coming from different rack

I'm wondering what InputSplit#getLocation() should return in all those cases so hadoop can make optimal use of data-locality...

Should it contain all hosts which contains a replica of any of the blocks, sorted in a way the the hosts which contributes the most data come first ?
Or should it contains only those host which were determined as most optimal regarding the data-locality during the splitting-process. 

F.e. in case (a). Should the location array only contain this one host, or should it contain all hosts but the one host with all the blocks should simply be on the first position ?

best regards
Johannes


Re: use of inputSplit#getLocations()

Posted by Johannes Zillmann <jz...@googlemail.com>.
Hey Harsh,

thanks for you answer - thats what i wanted to know!
I wasn't sure after reading discussion in https://issues.apache.org/jira/browse/HADOOP-3293 if it makes sense to provide 
a) all locations sorted:
	hostA - 100% datalocality
	hostB - 80% datalocality
	hostC - 30% datalocality
b) the only the top locations sorted and cut the very poor choices:
	hostA - 100% datalocality
	hostB - 80% datalocality
c) or only the number one:
	hostA - 100% datalocality

Based on your input and internal discussion i would tend to (b) or maybe even (c).
best regards
Johannes


On Apr 21, 2011, at 3:58 PM, Harsh J wrote:

> Hello again,
> 
> The scheduler does take care of locality with a good algorithm.
> However, the String[] of locations is for one given logical split. It
> does not matter to the scheduler what your split really contains
> (single/multiple block or offsets), it only looks at 'where' to run
> the particular split favorably according to the location hostnames
> supplied. For every host supplied, it adds an entry to a cache set of
> <Node, Set<Local_TIPs>>.
> 
> For a split that's got 2 file offsets of 100% + 60%; giving the 100%
> local place alone, would be the best idea.
> 
> Have I understood your question right here?
> 
> On Thu, Apr 21, 2011 at 6:54 PM, Johannes Zillmann
> <jz...@googlemail.com> wrote:
>> So how does inputSplit#getLocations() influence the task distribution ? I assume a split is assigned in favor to a tasktracker which matches one of its location !?
>> Assuming that we have a combined split and there is one location with 100% data-locality and one location with 60% data-locality.
>> If the task-sceheduler doesn't care about the order of specified locations wouldn't it be better to specify as split-location only the 100% location so that its more likely that the split is assigned to that host ?
>> 
>> 
>> Johannes
>> 
>> On Apr 21, 2011, at 9:37 AM, Harsh J wrote:
>> 
>>> Hey Johannes,
>>> 
>>> On Wed, Apr 20, 2011 at 3:37 PM, Johannes Zillmann
>>> <jz...@googlemail.com> wrote:
>>>> Should it contain all hosts which contains a replica of any of the blocks, sorted in a way the the hosts which contributes the most data come first ?
>>>> Or should it contains only those host which were determined as most optimal regarding the data-locality during the splitting-process.
>>>> 
>>>> F.e. in case (a). Should the location array only contain this one host, or should it contain all hosts but the one host with all the blocks should simply be on the first position ?
>>> 
>>> Its better to send all locations for maximal locality, but the order
>>> is not considered AFAIK. Its the order of TT heartbeats at the JT that
>>> matters, instead.
>>> 
>>> --
>>> Harsh J
>> 
>> 
> 
> 
> 
> -- 
> Harsh J


Re: use of inputSplit#getLocations()

Posted by Harsh J <ha...@cloudera.com>.
Hello again,

The scheduler does take care of locality with a good algorithm.
However, the String[] of locations is for one given logical split. It
does not matter to the scheduler what your split really contains
(single/multiple block or offsets), it only looks at 'where' to run
the particular split favorably according to the location hostnames
supplied. For every host supplied, it adds an entry to a cache set of
<Node, Set<Local_TIPs>>.

For a split that's got 2 file offsets of 100% + 60%; giving the 100%
local place alone, would be the best idea.

Have I understood your question right here?

On Thu, Apr 21, 2011 at 6:54 PM, Johannes Zillmann
<jz...@googlemail.com> wrote:
> So how does inputSplit#getLocations() influence the task distribution ? I assume a split is assigned in favor to a tasktracker which matches one of its location !?
> Assuming that we have a combined split and there is one location with 100% data-locality and one location with 60% data-locality.
> If the task-sceheduler doesn't care about the order of specified locations wouldn't it be better to specify as split-location only the 100% location so that its more likely that the split is assigned to that host ?
>
>
> Johannes
>
> On Apr 21, 2011, at 9:37 AM, Harsh J wrote:
>
>> Hey Johannes,
>>
>> On Wed, Apr 20, 2011 at 3:37 PM, Johannes Zillmann
>> <jz...@googlemail.com> wrote:
>>> Should it contain all hosts which contains a replica of any of the blocks, sorted in a way the the hosts which contributes the most data come first ?
>>> Or should it contains only those host which were determined as most optimal regarding the data-locality during the splitting-process.
>>>
>>> F.e. in case (a). Should the location array only contain this one host, or should it contain all hosts but the one host with all the blocks should simply be on the first position ?
>>
>> Its better to send all locations for maximal locality, but the order
>> is not considered AFAIK. Its the order of TT heartbeats at the JT that
>> matters, instead.
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: use of inputSplit#getLocations()

Posted by Johannes Zillmann <jz...@googlemail.com>.
So how does inputSplit#getLocations() influence the task distribution ? I assume a split is assigned in favor to a tasktracker which matches one of its location !? 
Assuming that we have a combined split and there is one location with 100% data-locality and one location with 60% data-locality.
If the task-sceheduler doesn't care about the order of specified locations wouldn't it be better to specify as split-location only the 100% location so that its more likely that the split is assigned to that host ?


Johannes

On Apr 21, 2011, at 9:37 AM, Harsh J wrote:

> Hey Johannes,
> 
> On Wed, Apr 20, 2011 at 3:37 PM, Johannes Zillmann
> <jz...@googlemail.com> wrote:
>> Should it contain all hosts which contains a replica of any of the blocks, sorted in a way the the hosts which contributes the most data come first ?
>> Or should it contains only those host which were determined as most optimal regarding the data-locality during the splitting-process.
>> 
>> F.e. in case (a). Should the location array only contain this one host, or should it contain all hosts but the one host with all the blocks should simply be on the first position ?
> 
> Its better to send all locations for maximal locality, but the order
> is not considered AFAIK. Its the order of TT heartbeats at the JT that
> matters, instead.
> 
> -- 
> Harsh J


Re: use of inputSplit#getLocations()

Posted by Harsh J <ha...@cloudera.com>.
Hey Johannes,

On Wed, Apr 20, 2011 at 3:37 PM, Johannes Zillmann
<jz...@googlemail.com> wrote:
> Should it contain all hosts which contains a replica of any of the blocks, sorted in a way the the hosts which contributes the most data come first ?
> Or should it contains only those host which were determined as most optimal regarding the data-locality during the splitting-process.
>
> F.e. in case (a). Should the location array only contain this one host, or should it contain all hosts but the one host with all the blocks should simply be on the first position ?

Its better to send all locations for maximal locality, but the order
is not considered AFAIK. Its the order of TT heartbeats at the JT that
matters, instead.

-- 
Harsh J