You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by "Agarwal, Nikhil" <Ni...@netapp.com> on 2013/05/14 12:55:26 UTC

Map Tasks do not obey data locality principle........

Hi,

I  have a 3-node cluster, with JobTracker running on one machine and TaskTrackers on other two (say, slave1 and slave2). Instead of using HDFS, I have written my own FileSystem implementation. Since, unlike HDFS I am unable to provide a shared filesystem view to JobTrackers and TaskTracker thus, I mounted the root container of slave2 on a directory in slave1 (nfs mount). By doing this I am able to submit MR job to JobTracker, with input path as my_scheme://slave1_IP:Port/dir1, etc.  MR runs successfully but what happens is that data locality is not ensured i.e. if files A,B,C are kept on slave1 and D,E,F on slave2 then according to data locality, map tasks should be submitted such that map task of A,B,C are submitted to TaskTracker running on slave1 and D,E,F on slave2. Instead of this, it randomly schedules the map task to any of the tasktrackers. If map task of file A is submitted to TaskTracker running on slave2 then it implies that file A is being fetched over the network by slave2.

How do I avoid this from happening?

Thanks,
Nikhil

Re: Map Tasks do not obey data locality principle........

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Nikhil,

Which scheduler are you using?

-Sandy


On Tue, May 14, 2013 at 3:55 AM, Agarwal, Nikhil
<Ni...@netapp.com>wrote:

>  Hi,****
>
> ** **
>
> I  have a 3-node cluster, with JobTracker running on one machine and
> TaskTrackers on other two (say, slave1 and slave2). Instead of using HDFS,
> I have written my own FileSystem implementation. Since, unlike HDFS I am
> unable to provide a shared filesystem view to JobTrackers and TaskTracker
> thus, I mounted the root container of slave2 on a directory in slave1 (nfs
> mount). By doing this I am able to submit MR job to JobTracker, with input
> path as my_scheme://slave1_IP:Port/dir1, etc.  MR runs successfully but
> what happens is that data locality is not ensured i.e. if files A,B,C are
> kept on slave1 and D,E,F on slave2 then according to data locality, map
> tasks should be submitted such that map task of A,B,C are submitted to
> TaskTracker running on slave1 and D,E,F on slave2. Instead of this, it
> randomly schedules the map task to any of the tasktrackers. If map task of
> file A is submitted to TaskTracker running on slave2 then it implies that
> file A is being fetched over the network by slave2.****
>
> ** **
>
> How do I avoid this from happening?****
>
> ** **
>
> Thanks,****
>
> Nikhil****
>
> ** **
>
> ** **
>

Re: Map Tasks do not obey data locality principle........

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Nikhil,

Which scheduler are you using?

-Sandy


On Tue, May 14, 2013 at 3:55 AM, Agarwal, Nikhil
<Ni...@netapp.com>wrote:

>  Hi,****
>
> ** **
>
> I  have a 3-node cluster, with JobTracker running on one machine and
> TaskTrackers on other two (say, slave1 and slave2). Instead of using HDFS,
> I have written my own FileSystem implementation. Since, unlike HDFS I am
> unable to provide a shared filesystem view to JobTrackers and TaskTracker
> thus, I mounted the root container of slave2 on a directory in slave1 (nfs
> mount). By doing this I am able to submit MR job to JobTracker, with input
> path as my_scheme://slave1_IP:Port/dir1, etc.  MR runs successfully but
> what happens is that data locality is not ensured i.e. if files A,B,C are
> kept on slave1 and D,E,F on slave2 then according to data locality, map
> tasks should be submitted such that map task of A,B,C are submitted to
> TaskTracker running on slave1 and D,E,F on slave2. Instead of this, it
> randomly schedules the map task to any of the tasktrackers. If map task of
> file A is submitted to TaskTracker running on slave2 then it implies that
> file A is being fetched over the network by slave2.****
>
> ** **
>
> How do I avoid this from happening?****
>
> ** **
>
> Thanks,****
>
> Nikhil****
>
> ** **
>
> ** **
>

Re: Map Tasks do not obey data locality principle........

Posted by Harsh J <ha...@cloudera.com>.

Hi Nikhil,

For (1) - Its hard to specifically tell what you may be doing wrong or
differently than expected cause I don't have the source to look it up,
but do you at least see the JT log a line saying task X has a split on
node Y? Is that line accurate to your inputsplit vs. data location?

For (2) - I think the answer is kinda obvious so perhaps the question
isn't clear/specific? Map pulls data to operate on it, and if data is
"remote" (i.e. not on its local filesystem in direct or indirect form)
then the bytes will be pulled over some form of a network.

On Thu, May 16, 2013 at 11:51 AM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> Agreed. Thanks for replying. As hints what I have given is the ip address of the node where the file is residing but still it does not follow data locality.
>
> One clarification -  If map task for file A is being submitted to a TaskTracker running on different node then does it necessarily mean that entire file A was transferred to the other node?
>
> Regards,
> Nikhil
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Thursday, May 16, 2013 11:47 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Map Tasks do not obey data locality principle........
>
> The scheduling is done based on block locations filled in by the input splits. If there's no hints being provided by your FS, then the result you're seeing is correct.
>
> Note that if you don't use a block concept, you ought to consider a whole file as one block and return a location based on that.
>
> Essentially, your
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,%20long,%20long)
> form of API calls has to return valid values for scheduling to work.
>
> On Thu, May 16, 2013 at 11:38 AM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> No, it does not.  I have kept the granularity at file level rather than a block. I do not think that should affect the mapping of tasks ?
>>
>> Regards,
>> Nikhil
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: Thursday, May 16, 2013 2:31 AM
>> To: <us...@hadoop.apache.org>
>> Subject: Re: Map Tasks do not obey data locality principle........
>>
>> Also, does your custom FS report block locations in the exact same format as how HDFS does?
>>
>> On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>>> Hi,
>>>
>>>
>>>
>>> I  have a 3-node cluster, with JobTracker running on one machine and
>>> TaskTrackers on other two (say, slave1 and slave2). Instead of using
>>> HDFS, I have written my own FileSystem implementation. Since, unlike
>>> HDFS I am unable to provide a shared filesystem view to JobTrackers
>>> and TaskTracker thus, I mounted the root container of slave2 on a
>>> directory in slave1 (nfs mount). By doing this I am able to submit MR
>>> job to JobTracker, with input path as
>>> my_scheme://slave1_IP:Port/dir1, etc.  MR runs successfully but what
>>> happens is that data locality is not ensured i.e. if files A,B,C are
>>> kept on
>>> slave1 and D,E,F on slave2 then according to data locality, map tasks
>>> should be submitted such that map task of A,B,C are submitted to
>>> TaskTracker running on slave1 and D,E,F on slave2. Instead of this,
>>> it randomly schedules the map task to any of the tasktrackers. If map
>>> task of file A is submitted to TaskTracker running on slave2 then it
>>> implies that file A is being fetched over the network by slave2.
>>>
>>>
>>>
>>> How do I avoid this from happening?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Nikhil
>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: Map Tasks do not obey data locality principle........

Posted by Harsh J <ha...@cloudera.com>.

Hi Nikhil,

For (1) - Its hard to specifically tell what you may be doing wrong or
differently than expected cause I don't have the source to look it up,
but do you at least see the JT log a line saying task X has a split on
node Y? Is that line accurate to your inputsplit vs. data location?

For (2) - I think the answer is kinda obvious so perhaps the question
isn't clear/specific? Map pulls data to operate on it, and if data is
"remote" (i.e. not on its local filesystem in direct or indirect form)
then the bytes will be pulled over some form of a network.

On Thu, May 16, 2013 at 11:51 AM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> Agreed. Thanks for replying. As hints what I have given is the ip address of the node where the file is residing but still it does not follow data locality.
>
> One clarification -  If map task for file A is being submitted to a TaskTracker running on different node then does it necessarily mean that entire file A was transferred to the other node?
>
> Regards,
> Nikhil
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Thursday, May 16, 2013 11:47 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Map Tasks do not obey data locality principle........
>
> The scheduling is done based on block locations filled in by the input splits. If there's no hints being provided by your FS, then the result you're seeing is correct.
>
> Note that if you don't use a block concept, you ought to consider a whole file as one block and return a location based on that.
>
> Essentially, your
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,%20long,%20long)
> form of API calls has to return valid values for scheduling to work.
>
> On Thu, May 16, 2013 at 11:38 AM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> No, it does not.  I have kept the granularity at file level rather than a block. I do not think that should affect the mapping of tasks ?
>>
>> Regards,
>> Nikhil
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: Thursday, May 16, 2013 2:31 AM
>> To: <us...@hadoop.apache.org>
>> Subject: Re: Map Tasks do not obey data locality principle........
>>
>> Also, does your custom FS report block locations in the exact same format as how HDFS does?
>>
>> On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>>> Hi,
>>>
>>>
>>>
>>> I  have a 3-node cluster, with JobTracker running on one machine and
>>> TaskTrackers on other two (say, slave1 and slave2). Instead of using
>>> HDFS, I have written my own FileSystem implementation. Since, unlike
>>> HDFS I am unable to provide a shared filesystem view to JobTrackers
>>> and TaskTracker thus, I mounted the root container of slave2 on a
>>> directory in slave1 (nfs mount). By doing this I am able to submit MR
>>> job to JobTracker, with input path as
>>> my_scheme://slave1_IP:Port/dir1, etc.  MR runs successfully but what
>>> happens is that data locality is not ensured i.e. if files A,B,C are
>>> kept on
>>> slave1 and D,E,F on slave2 then according to data locality, map tasks
>>> should be submitted such that map task of A,B,C are submitted to
>>> TaskTracker running on slave1 and D,E,F on slave2. Instead of this,
>>> it randomly schedules the map task to any of the tasktrackers. If map
>>> task of file A is submitted to TaskTracker running on slave2 then it
>>> implies that file A is being fetched over the network by slave2.
>>>
>>>
>>>
>>> How do I avoid this from happening?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Nikhil
>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: Map Tasks do not obey data locality principle........

Posted by Harsh J <ha...@cloudera.com>.

Hi Nikhil,

For (1) - Its hard to specifically tell what you may be doing wrong or
differently than expected cause I don't have the source to look it up,
but do you at least see the JT log a line saying task X has a split on
node Y? Is that line accurate to your inputsplit vs. data location?

For (2) - I think the answer is kinda obvious so perhaps the question
isn't clear/specific? Map pulls data to operate on it, and if data is
"remote" (i.e. not on its local filesystem in direct or indirect form)
then the bytes will be pulled over some form of a network.

On Thu, May 16, 2013 at 11:51 AM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> Agreed. Thanks for replying. As hints what I have given is the ip address of the node where the file is residing but still it does not follow data locality.
>
> One clarification -  If map task for file A is being submitted to a TaskTracker running on different node then does it necessarily mean that entire file A was transferred to the other node?
>
> Regards,
> Nikhil
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Thursday, May 16, 2013 11:47 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Map Tasks do not obey data locality principle........
>
> The scheduling is done based on block locations filled in by the input splits. If there's no hints being provided by your FS, then the result you're seeing is correct.
>
> Note that if you don't use a block concept, you ought to consider a whole file as one block and return a location based on that.
>
> Essentially, your
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,%20long,%20long)
> form of API calls has to return valid values for scheduling to work.
>
> On Thu, May 16, 2013 at 11:38 AM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> No, it does not.  I have kept the granularity at file level rather than a block. I do not think that should affect the mapping of tasks ?
>>
>> Regards,
>> Nikhil
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: Thursday, May 16, 2013 2:31 AM
>> To: <us...@hadoop.apache.org>
>> Subject: Re: Map Tasks do not obey data locality principle........
>>
>> Also, does your custom FS report block locations in the exact same format as how HDFS does?
>>
>> On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>>> Hi,
>>>
>>>
>>>
>>> I  have a 3-node cluster, with JobTracker running on one machine and
>>> TaskTrackers on other two (say, slave1 and slave2). Instead of using
>>> HDFS, I have written my own FileSystem implementation. Since, unlike
>>> HDFS I am unable to provide a shared filesystem view to JobTrackers
>>> and TaskTracker thus, I mounted the root container of slave2 on a
>>> directory in slave1 (nfs mount). By doing this I am able to submit MR
>>> job to JobTracker, with input path as
>>> my_scheme://slave1_IP:Port/dir1, etc.  MR runs successfully but what
>>> happens is that data locality is not ensured i.e. if files A,B,C are
>>> kept on
>>> slave1 and D,E,F on slave2 then according to data locality, map tasks
>>> should be submitted such that map task of A,B,C are submitted to
>>> TaskTracker running on slave1 and D,E,F on slave2. Instead of this,
>>> it randomly schedules the map task to any of the tasktrackers. If map
>>> task of file A is submitted to TaskTracker running on slave2 then it
>>> implies that file A is being fetched over the network by slave2.
>>>
>>>
>>>
>>> How do I avoid this from happening?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Nikhil
>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: Map Tasks do not obey data locality principle........

Posted by Harsh J <ha...@cloudera.com>.

Hi Nikhil,

For (1) - Its hard to specifically tell what you may be doing wrong or
differently than expected cause I don't have the source to look it up,
but do you at least see the JT log a line saying task X has a split on
node Y? Is that line accurate to your inputsplit vs. data location?

For (2) - I think the answer is kinda obvious so perhaps the question
isn't clear/specific? Map pulls data to operate on it, and if data is
"remote" (i.e. not on its local filesystem in direct or indirect form)
then the bytes will be pulled over some form of a network.

On Thu, May 16, 2013 at 11:51 AM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> Agreed. Thanks for replying. As hints what I have given is the ip address of the node where the file is residing but still it does not follow data locality.
>
> One clarification -  If map task for file A is being submitted to a TaskTracker running on different node then does it necessarily mean that entire file A was transferred to the other node?
>
> Regards,
> Nikhil
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Thursday, May 16, 2013 11:47 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Map Tasks do not obey data locality principle........
>
> The scheduling is done based on block locations filled in by the input splits. If there's no hints being provided by your FS, then the result you're seeing is correct.
>
> Note that if you don't use a block concept, you ought to consider a whole file as one block and return a location based on that.
>
> Essentially, your
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,%20long,%20long)
> form of API calls has to return valid values for scheduling to work.
>
> On Thu, May 16, 2013 at 11:38 AM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> No, it does not.  I have kept the granularity at file level rather than a block. I do not think that should affect the mapping of tasks ?
>>
>> Regards,
>> Nikhil
>>
>> -----Original Message-----
>> From: Harsh J [mailto:harsh@cloudera.com]
>> Sent: Thursday, May 16, 2013 2:31 AM
>> To: <us...@hadoop.apache.org>
>> Subject: Re: Map Tasks do not obey data locality principle........
>>
>> Also, does your custom FS report block locations in the exact same format as how HDFS does?
>>
>> On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>>> Hi,
>>>
>>>
>>>
>>> I  have a 3-node cluster, with JobTracker running on one machine and
>>> TaskTrackers on other two (say, slave1 and slave2). Instead of using
>>> HDFS, I have written my own FileSystem implementation. Since, unlike
>>> HDFS I am unable to provide a shared filesystem view to JobTrackers
>>> and TaskTracker thus, I mounted the root container of slave2 on a
>>> directory in slave1 (nfs mount). By doing this I am able to submit MR
>>> job to JobTracker, with input path as
>>> my_scheme://slave1_IP:Port/dir1, etc.  MR runs successfully but what
>>> happens is that data locality is not ensured i.e. if files A,B,C are
>>> kept on
>>> slave1 and D,E,F on slave2 then according to data locality, map tasks
>>> should be submitted such that map task of A,B,C are submitted to
>>> TaskTracker running on slave1 and D,E,F on slave2. Instead of this,
>>> it randomly schedules the map task to any of the tasktrackers. If map
>>> task of file A is submitted to TaskTracker running on slave2 then it
>>> implies that file A is being fetched over the network by slave2.
>>>
>>>
>>>
>>> How do I avoid this from happening?
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Nikhil
>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> Harsh J
>
>
>
> --
> Harsh J



-- 
Harsh J

RE: Map Tasks do not obey data locality principle........

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

Agreed. Thanks for replying. As hints what I have given is the ip address of the node where the file is residing but still it does not follow data locality.

One clarification -  If map task for file A is being submitted to a TaskTracker running on different node then does it necessarily mean that entire file A was transferred to the other node?

Regards,
Nikhil

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Thursday, May 16, 2013 11:47 AM
To: <us...@hadoop.apache.org>
Subject: Re: Map Tasks do not obey data locality principle........

The scheduling is done based on block locations filled in by the input splits. If there's no hints being provided by your FS, then the result you're seeing is correct.

Note that if you don't use a block concept, you ought to consider a whole file as one block and return a location based on that.

Essentially, your
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,%20long,%20long)
form of API calls has to return valid values for scheduling to work.

On Thu, May 16, 2013 at 11:38 AM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> No, it does not.  I have kept the granularity at file level rather than a block. I do not think that should affect the mapping of tasks ?
>
> Regards,
> Nikhil
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Thursday, May 16, 2013 2:31 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Map Tasks do not obey data locality principle........
>
> Also, does your custom FS report block locations in the exact same format as how HDFS does?
>
> On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and 
>> TaskTrackers on other two (say, slave1 and slave2). Instead of using 
>> HDFS, I have written my own FileSystem implementation. Since, unlike 
>> HDFS I am unable to provide a shared filesystem view to JobTrackers 
>> and TaskTracker thus, I mounted the root container of slave2 on a 
>> directory in slave1 (nfs mount). By doing this I am able to submit MR 
>> job to JobTracker, with input path as 
>> my_scheme://slave1_IP:Port/dir1, etc.  MR runs successfully but what 
>> happens is that data locality is not ensured i.e. if files A,B,C are 
>> kept on
>> slave1 and D,E,F on slave2 then according to data locality, map tasks 
>> should be submitted such that map task of A,B,C are submitted to 
>> TaskTracker running on slave1 and D,E,F on slave2. Instead of this, 
>> it randomly schedules the map task to any of the tasktrackers. If map 
>> task of file A is submitted to TaskTracker running on slave2 then it 
>> implies that file A is being fetched over the network by slave2.
>>
>>
>>
>> How do I avoid this from happening?
>>
>>
>>
>> Thanks,
>>
>> Nikhil
>>
>>
>>
>>
>
>
>
> --
> Harsh J



--
Harsh J

RE: Map Tasks do not obey data locality principle........

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

Agreed. Thanks for replying. As hints what I have given is the ip address of the node where the file is residing but still it does not follow data locality.

One clarification -  If map task for file A is being submitted to a TaskTracker running on different node then does it necessarily mean that entire file A was transferred to the other node?

Regards,
Nikhil

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Thursday, May 16, 2013 11:47 AM
To: <us...@hadoop.apache.org>
Subject: Re: Map Tasks do not obey data locality principle........

The scheduling is done based on block locations filled in by the input splits. If there's no hints being provided by your FS, then the result you're seeing is correct.

Note that if you don't use a block concept, you ought to consider a whole file as one block and return a location based on that.

Essentially, your
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,%20long,%20long)
form of API calls has to return valid values for scheduling to work.

On Thu, May 16, 2013 at 11:38 AM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> No, it does not.  I have kept the granularity at file level rather than a block. I do not think that should affect the mapping of tasks ?
>
> Regards,
> Nikhil
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Thursday, May 16, 2013 2:31 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Map Tasks do not obey data locality principle........
>
> Also, does your custom FS report block locations in the exact same format as how HDFS does?
>
> On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and 
>> TaskTrackers on other two (say, slave1 and slave2). Instead of using 
>> HDFS, I have written my own FileSystem implementation. Since, unlike 
>> HDFS I am unable to provide a shared filesystem view to JobTrackers 
>> and TaskTracker thus, I mounted the root container of slave2 on a 
>> directory in slave1 (nfs mount). By doing this I am able to submit MR 
>> job to JobTracker, with input path as 
>> my_scheme://slave1_IP:Port/dir1, etc.  MR runs successfully but what 
>> happens is that data locality is not ensured i.e. if files A,B,C are 
>> kept on
>> slave1 and D,E,F on slave2 then according to data locality, map tasks 
>> should be submitted such that map task of A,B,C are submitted to 
>> TaskTracker running on slave1 and D,E,F on slave2. Instead of this, 
>> it randomly schedules the map task to any of the tasktrackers. If map 
>> task of file A is submitted to TaskTracker running on slave2 then it 
>> implies that file A is being fetched over the network by slave2.
>>
>>
>>
>> How do I avoid this from happening?
>>
>>
>>
>> Thanks,
>>
>> Nikhil
>>
>>
>>
>>
>
>
>
> --
> Harsh J



--
Harsh J

RE: Map Tasks do not obey data locality principle........

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

Agreed. Thanks for replying. As hints what I have given is the ip address of the node where the file is residing but still it does not follow data locality.

One clarification -  If map task for file A is being submitted to a TaskTracker running on different node then does it necessarily mean that entire file A was transferred to the other node?

Regards,
Nikhil

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Thursday, May 16, 2013 11:47 AM
To: <us...@hadoop.apache.org>
Subject: Re: Map Tasks do not obey data locality principle........

The scheduling is done based on block locations filled in by the input splits. If there's no hints being provided by your FS, then the result you're seeing is correct.

Note that if you don't use a block concept, you ought to consider a whole file as one block and return a location based on that.

Essentially, your
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,%20long,%20long)
form of API calls has to return valid values for scheduling to work.

On Thu, May 16, 2013 at 11:38 AM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> No, it does not.  I have kept the granularity at file level rather than a block. I do not think that should affect the mapping of tasks ?
>
> Regards,
> Nikhil
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Thursday, May 16, 2013 2:31 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Map Tasks do not obey data locality principle........
>
> Also, does your custom FS report block locations in the exact same format as how HDFS does?
>
> On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and 
>> TaskTrackers on other two (say, slave1 and slave2). Instead of using 
>> HDFS, I have written my own FileSystem implementation. Since, unlike 
>> HDFS I am unable to provide a shared filesystem view to JobTrackers 
>> and TaskTracker thus, I mounted the root container of slave2 on a 
>> directory in slave1 (nfs mount). By doing this I am able to submit MR 
>> job to JobTracker, with input path as 
>> my_scheme://slave1_IP:Port/dir1, etc.  MR runs successfully but what 
>> happens is that data locality is not ensured i.e. if files A,B,C are 
>> kept on
>> slave1 and D,E,F on slave2 then according to data locality, map tasks 
>> should be submitted such that map task of A,B,C are submitted to 
>> TaskTracker running on slave1 and D,E,F on slave2. Instead of this, 
>> it randomly schedules the map task to any of the tasktrackers. If map 
>> task of file A is submitted to TaskTracker running on slave2 then it 
>> implies that file A is being fetched over the network by slave2.
>>
>>
>>
>> How do I avoid this from happening?
>>
>>
>>
>> Thanks,
>>
>> Nikhil
>>
>>
>>
>>
>
>
>
> --
> Harsh J



--
Harsh J

RE: Map Tasks do not obey data locality principle........

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

Agreed. Thanks for replying. As hints what I have given is the ip address of the node where the file is residing but still it does not follow data locality.

One clarification -  If map task for file A is being submitted to a TaskTracker running on different node then does it necessarily mean that entire file A was transferred to the other node?

Regards,
Nikhil

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Thursday, May 16, 2013 11:47 AM
To: <us...@hadoop.apache.org>
Subject: Re: Map Tasks do not obey data locality principle........

The scheduling is done based on block locations filled in by the input splits. If there's no hints being provided by your FS, then the result you're seeing is correct.

Note that if you don't use a block concept, you ought to consider a whole file as one block and return a location based on that.

Essentially, your
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,%20long,%20long)
form of API calls has to return valid values for scheduling to work.

On Thu, May 16, 2013 at 11:38 AM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> No, it does not.  I have kept the granularity at file level rather than a block. I do not think that should affect the mapping of tasks ?
>
> Regards,
> Nikhil
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Thursday, May 16, 2013 2:31 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Map Tasks do not obey data locality principle........
>
> Also, does your custom FS report block locations in the exact same format as how HDFS does?
>
> On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and 
>> TaskTrackers on other two (say, slave1 and slave2). Instead of using 
>> HDFS, I have written my own FileSystem implementation. Since, unlike 
>> HDFS I am unable to provide a shared filesystem view to JobTrackers 
>> and TaskTracker thus, I mounted the root container of slave2 on a 
>> directory in slave1 (nfs mount). By doing this I am able to submit MR 
>> job to JobTracker, with input path as 
>> my_scheme://slave1_IP:Port/dir1, etc.  MR runs successfully but what 
>> happens is that data locality is not ensured i.e. if files A,B,C are 
>> kept on
>> slave1 and D,E,F on slave2 then according to data locality, map tasks 
>> should be submitted such that map task of A,B,C are submitted to 
>> TaskTracker running on slave1 and D,E,F on slave2. Instead of this, 
>> it randomly schedules the map task to any of the tasktrackers. If map 
>> task of file A is submitted to TaskTracker running on slave2 then it 
>> implies that file A is being fetched over the network by slave2.
>>
>>
>>
>> How do I avoid this from happening?
>>
>>
>>
>> Thanks,
>>
>> Nikhil
>>
>>
>>
>>
>
>
>
> --
> Harsh J



--
Harsh J

Re: Map Tasks do not obey data locality principle........

Posted by Harsh J <ha...@cloudera.com>.

The scheduling is done based on block locations filled in by the input
splits. If there's no hints being provided by your FS, then the result
you're seeing is correct.

Note that if you don't use a block concept, you ought to consider a
whole file as one block and return a location based on that.

Essentially, your
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,%20long,%20long)
form of API calls has to return valid values for scheduling to work.

On Thu, May 16, 2013 at 11:38 AM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> No, it does not.  I have kept the granularity at file level rather than a block. I do not think that should affect the mapping of tasks ?
>
> Regards,
> Nikhil
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Thursday, May 16, 2013 2:31 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Map Tasks do not obey data locality principle........
>
> Also, does your custom FS report block locations in the exact same format as how HDFS does?
>
> On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and
>> TaskTrackers on other two (say, slave1 and slave2). Instead of using
>> HDFS, I have written my own FileSystem implementation. Since, unlike
>> HDFS I am unable to provide a shared filesystem view to JobTrackers
>> and TaskTracker thus, I mounted the root container of slave2 on a
>> directory in slave1 (nfs mount). By doing this I am able to submit MR
>> job to JobTracker, with input path as my_scheme://slave1_IP:Port/dir1,
>> etc.  MR runs successfully but what happens is that data locality is
>> not ensured i.e. if files A,B,C are kept on
>> slave1 and D,E,F on slave2 then according to data locality, map tasks
>> should be submitted such that map task of A,B,C are submitted to
>> TaskTracker running on slave1 and D,E,F on slave2. Instead of this, it
>> randomly schedules the map task to any of the tasktrackers. If map
>> task of file A is submitted to TaskTracker running on slave2 then it
>> implies that file A is being fetched over the network by slave2.
>>
>>
>>
>> How do I avoid this from happening?
>>
>>
>>
>> Thanks,
>>
>> Nikhil
>>
>>
>>
>>
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: Map Tasks do not obey data locality principle........

Posted by Harsh J <ha...@cloudera.com>.

The scheduling is done based on block locations filled in by the input
splits. If there's no hints being provided by your FS, then the result
you're seeing is correct.

Note that if you don't use a block concept, you ought to consider a
whole file as one block and return a location based on that.

Essentially, your
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,%20long,%20long)
form of API calls has to return valid values for scheduling to work.

On Thu, May 16, 2013 at 11:38 AM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> No, it does not.  I have kept the granularity at file level rather than a block. I do not think that should affect the mapping of tasks ?
>
> Regards,
> Nikhil
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Thursday, May 16, 2013 2:31 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Map Tasks do not obey data locality principle........
>
> Also, does your custom FS report block locations in the exact same format as how HDFS does?
>
> On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and
>> TaskTrackers on other two (say, slave1 and slave2). Instead of using
>> HDFS, I have written my own FileSystem implementation. Since, unlike
>> HDFS I am unable to provide a shared filesystem view to JobTrackers
>> and TaskTracker thus, I mounted the root container of slave2 on a
>> directory in slave1 (nfs mount). By doing this I am able to submit MR
>> job to JobTracker, with input path as my_scheme://slave1_IP:Port/dir1,
>> etc.  MR runs successfully but what happens is that data locality is
>> not ensured i.e. if files A,B,C are kept on
>> slave1 and D,E,F on slave2 then according to data locality, map tasks
>> should be submitted such that map task of A,B,C are submitted to
>> TaskTracker running on slave1 and D,E,F on slave2. Instead of this, it
>> randomly schedules the map task to any of the tasktrackers. If map
>> task of file A is submitted to TaskTracker running on slave2 then it
>> implies that file A is being fetched over the network by slave2.
>>
>>
>>
>> How do I avoid this from happening?
>>
>>
>>
>> Thanks,
>>
>> Nikhil
>>
>>
>>
>>
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: Map Tasks do not obey data locality principle........

Posted by Harsh J <ha...@cloudera.com>.

The scheduling is done based on block locations filled in by the input
splits. If there's no hints being provided by your FS, then the result
you're seeing is correct.

Note that if you don't use a block concept, you ought to consider a
whole file as one block and return a location based on that.

Essentially, your
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,%20long,%20long)
form of API calls has to return valid values for scheduling to work.

On Thu, May 16, 2013 at 11:38 AM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> No, it does not.  I have kept the granularity at file level rather than a block. I do not think that should affect the mapping of tasks ?
>
> Regards,
> Nikhil
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Thursday, May 16, 2013 2:31 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Map Tasks do not obey data locality principle........
>
> Also, does your custom FS report block locations in the exact same format as how HDFS does?
>
> On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and
>> TaskTrackers on other two (say, slave1 and slave2). Instead of using
>> HDFS, I have written my own FileSystem implementation. Since, unlike
>> HDFS I am unable to provide a shared filesystem view to JobTrackers
>> and TaskTracker thus, I mounted the root container of slave2 on a
>> directory in slave1 (nfs mount). By doing this I am able to submit MR
>> job to JobTracker, with input path as my_scheme://slave1_IP:Port/dir1,
>> etc.  MR runs successfully but what happens is that data locality is
>> not ensured i.e. if files A,B,C are kept on
>> slave1 and D,E,F on slave2 then according to data locality, map tasks
>> should be submitted such that map task of A,B,C are submitted to
>> TaskTracker running on slave1 and D,E,F on slave2. Instead of this, it
>> randomly schedules the map task to any of the tasktrackers. If map
>> task of file A is submitted to TaskTracker running on slave2 then it
>> implies that file A is being fetched over the network by slave2.
>>
>>
>>
>> How do I avoid this from happening?
>>
>>
>>
>> Thanks,
>>
>> Nikhil
>>
>>
>>
>>
>
>
>
> --
> Harsh J



-- 
Harsh J

Re: Map Tasks do not obey data locality principle........

Posted by Harsh J <ha...@cloudera.com>.

The scheduling is done based on block locations filled in by the input
splits. If there's no hints being provided by your FS, then the result
you're seeing is correct.

Note that if you don't use a block concept, you ought to consider a
whole file as one block and return a location based on that.

Essentially, your
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,%20long,%20long)
form of API calls has to return valid values for scheduling to work.

On Thu, May 16, 2013 at 11:38 AM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> No, it does not.  I have kept the granularity at file level rather than a block. I do not think that should affect the mapping of tasks ?
>
> Regards,
> Nikhil
>
> -----Original Message-----
> From: Harsh J [mailto:harsh@cloudera.com]
> Sent: Thursday, May 16, 2013 2:31 AM
> To: <us...@hadoop.apache.org>
> Subject: Re: Map Tasks do not obey data locality principle........
>
> Also, does your custom FS report block locations in the exact same format as how HDFS does?
>
> On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and
>> TaskTrackers on other two (say, slave1 and slave2). Instead of using
>> HDFS, I have written my own FileSystem implementation. Since, unlike
>> HDFS I am unable to provide a shared filesystem view to JobTrackers
>> and TaskTracker thus, I mounted the root container of slave2 on a
>> directory in slave1 (nfs mount). By doing this I am able to submit MR
>> job to JobTracker, with input path as my_scheme://slave1_IP:Port/dir1,
>> etc.  MR runs successfully but what happens is that data locality is
>> not ensured i.e. if files A,B,C are kept on
>> slave1 and D,E,F on slave2 then according to data locality, map tasks
>> should be submitted such that map task of A,B,C are submitted to
>> TaskTracker running on slave1 and D,E,F on slave2. Instead of this, it
>> randomly schedules the map task to any of the tasktrackers. If map
>> task of file A is submitted to TaskTracker running on slave2 then it
>> implies that file A is being fetched over the network by slave2.
>>
>>
>>
>> How do I avoid this from happening?
>>
>>
>>
>> Thanks,
>>
>> Nikhil
>>
>>
>>
>>
>
>
>
> --
> Harsh J



-- 
Harsh J

RE: Map Tasks do not obey data locality principle........

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

No, it does not.  I have kept the granularity at file level rather than a block. I do not think that should affect the mapping of tasks ?

Regards,
Nikhil 

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Thursday, May 16, 2013 2:31 AM
To: <us...@hadoop.apache.org>
Subject: Re: Map Tasks do not obey data locality principle........

Also, does your custom FS report block locations in the exact same format as how HDFS does?

On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> Hi,
>
>
>
> I  have a 3-node cluster, with JobTracker running on one machine and 
> TaskTrackers on other two (say, slave1 and slave2). Instead of using 
> HDFS, I have written my own FileSystem implementation. Since, unlike 
> HDFS I am unable to provide a shared filesystem view to JobTrackers 
> and TaskTracker thus, I mounted the root container of slave2 on a 
> directory in slave1 (nfs mount). By doing this I am able to submit MR 
> job to JobTracker, with input path as my_scheme://slave1_IP:Port/dir1, 
> etc.  MR runs successfully but what happens is that data locality is 
> not ensured i.e. if files A,B,C are kept on
> slave1 and D,E,F on slave2 then according to data locality, map tasks 
> should be submitted such that map task of A,B,C are submitted to 
> TaskTracker running on slave1 and D,E,F on slave2. Instead of this, it 
> randomly schedules the map task to any of the tasktrackers. If map 
> task of file A is submitted to TaskTracker running on slave2 then it 
> implies that file A is being fetched over the network by slave2.
>
>
>
> How do I avoid this from happening?
>
>
>
> Thanks,
>
> Nikhil
>
>
>
>



--
Harsh J

RE: Map Tasks do not obey data locality principle........

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

No, it does not.  I have kept the granularity at file level rather than a block. I do not think that should affect the mapping of tasks ?

Regards,
Nikhil 

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Thursday, May 16, 2013 2:31 AM
To: <us...@hadoop.apache.org>
Subject: Re: Map Tasks do not obey data locality principle........

Also, does your custom FS report block locations in the exact same format as how HDFS does?

On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> Hi,
>
>
>
> I  have a 3-node cluster, with JobTracker running on one machine and 
> TaskTrackers on other two (say, slave1 and slave2). Instead of using 
> HDFS, I have written my own FileSystem implementation. Since, unlike 
> HDFS I am unable to provide a shared filesystem view to JobTrackers 
> and TaskTracker thus, I mounted the root container of slave2 on a 
> directory in slave1 (nfs mount). By doing this I am able to submit MR 
> job to JobTracker, with input path as my_scheme://slave1_IP:Port/dir1, 
> etc.  MR runs successfully but what happens is that data locality is 
> not ensured i.e. if files A,B,C are kept on
> slave1 and D,E,F on slave2 then according to data locality, map tasks 
> should be submitted such that map task of A,B,C are submitted to 
> TaskTracker running on slave1 and D,E,F on slave2. Instead of this, it 
> randomly schedules the map task to any of the tasktrackers. If map 
> task of file A is submitted to TaskTracker running on slave2 then it 
> implies that file A is being fetched over the network by slave2.
>
>
>
> How do I avoid this from happening?
>
>
>
> Thanks,
>
> Nikhil
>
>
>
>



--
Harsh J

RE: Map Tasks do not obey data locality principle........

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

No, it does not.  I have kept the granularity at file level rather than a block. I do not think that should affect the mapping of tasks ?

Regards,
Nikhil 

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Thursday, May 16, 2013 2:31 AM
To: <us...@hadoop.apache.org>
Subject: Re: Map Tasks do not obey data locality principle........

Also, does your custom FS report block locations in the exact same format as how HDFS does?

On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> Hi,
>
>
>
> I  have a 3-node cluster, with JobTracker running on one machine and 
> TaskTrackers on other two (say, slave1 and slave2). Instead of using 
> HDFS, I have written my own FileSystem implementation. Since, unlike 
> HDFS I am unable to provide a shared filesystem view to JobTrackers 
> and TaskTracker thus, I mounted the root container of slave2 on a 
> directory in slave1 (nfs mount). By doing this I am able to submit MR 
> job to JobTracker, with input path as my_scheme://slave1_IP:Port/dir1, 
> etc.  MR runs successfully but what happens is that data locality is 
> not ensured i.e. if files A,B,C are kept on
> slave1 and D,E,F on slave2 then according to data locality, map tasks 
> should be submitted such that map task of A,B,C are submitted to 
> TaskTracker running on slave1 and D,E,F on slave2. Instead of this, it 
> randomly schedules the map task to any of the tasktrackers. If map 
> task of file A is submitted to TaskTracker running on slave2 then it 
> implies that file A is being fetched over the network by slave2.
>
>
>
> How do I avoid this from happening?
>
>
>
> Thanks,
>
> Nikhil
>
>
>
>



--
Harsh J

RE: Map Tasks do not obey data locality principle........

Posted by "Agarwal, Nikhil" <Ni...@netapp.com>.

No, it does not.  I have kept the granularity at file level rather than a block. I do not think that should affect the mapping of tasks ?

Regards,
Nikhil 

-----Original Message-----
From: Harsh J [mailto:harsh@cloudera.com] 
Sent: Thursday, May 16, 2013 2:31 AM
To: <us...@hadoop.apache.org>
Subject: Re: Map Tasks do not obey data locality principle........

Also, does your custom FS report block locations in the exact same format as how HDFS does?

On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil <Ni...@netapp.com> wrote:
> Hi,
>
>
>
> I  have a 3-node cluster, with JobTracker running on one machine and 
> TaskTrackers on other two (say, slave1 and slave2). Instead of using 
> HDFS, I have written my own FileSystem implementation. Since, unlike 
> HDFS I am unable to provide a shared filesystem view to JobTrackers 
> and TaskTracker thus, I mounted the root container of slave2 on a 
> directory in slave1 (nfs mount). By doing this I am able to submit MR 
> job to JobTracker, with input path as my_scheme://slave1_IP:Port/dir1, 
> etc.  MR runs successfully but what happens is that data locality is 
> not ensured i.e. if files A,B,C are kept on
> slave1 and D,E,F on slave2 then according to data locality, map tasks 
> should be submitted such that map task of A,B,C are submitted to 
> TaskTracker running on slave1 and D,E,F on slave2. Instead of this, it 
> randomly schedules the map task to any of the tasktrackers. If map 
> task of file A is submitted to TaskTracker running on slave2 then it 
> implies that file A is being fetched over the network by slave2.
>
>
>
> How do I avoid this from happening?
>
>
>
> Thanks,
>
> Nikhil
>
>
>
>



--
Harsh J

Re: Map Tasks do not obey data locality principle........

Posted by Harsh J <ha...@cloudera.com>.

Also, does your custom FS report block locations in the exact same
format as how HDFS does?

On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> Hi,
>
>
>
> I  have a 3-node cluster, with JobTracker running on one machine and
> TaskTrackers on other two (say, slave1 and slave2). Instead of using HDFS, I
> have written my own FileSystem implementation. Since, unlike HDFS I am
> unable to provide a shared filesystem view to JobTrackers and TaskTracker
> thus, I mounted the root container of slave2 on a directory in slave1 (nfs
> mount). By doing this I am able to submit MR job to JobTracker, with input
> path as my_scheme://slave1_IP:Port/dir1, etc.  MR runs successfully but what
> happens is that data locality is not ensured i.e. if files A,B,C are kept on
> slave1 and D,E,F on slave2 then according to data locality, map tasks should
> be submitted such that map task of A,B,C are submitted to TaskTracker
> running on slave1 and D,E,F on slave2. Instead of this, it randomly
> schedules the map task to any of the tasktrackers. If map task of file A is
> submitted to TaskTracker running on slave2 then it implies that file A is
> being fetched over the network by slave2.
>
>
>
> How do I avoid this from happening?
>
>
>
> Thanks,
>
> Nikhil
>
>
>
>



-- 
Harsh J

Re: Map Tasks do not obey data locality principle........

Posted by Harsh J <ha...@cloudera.com>.

Also, does your custom FS report block locations in the exact same
format as how HDFS does?

On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> Hi,
>
>
>
> I  have a 3-node cluster, with JobTracker running on one machine and
> TaskTrackers on other two (say, slave1 and slave2). Instead of using HDFS, I
> have written my own FileSystem implementation. Since, unlike HDFS I am
> unable to provide a shared filesystem view to JobTrackers and TaskTracker
> thus, I mounted the root container of slave2 on a directory in slave1 (nfs
> mount). By doing this I am able to submit MR job to JobTracker, with input
> path as my_scheme://slave1_IP:Port/dir1, etc.  MR runs successfully but what
> happens is that data locality is not ensured i.e. if files A,B,C are kept on
> slave1 and D,E,F on slave2 then according to data locality, map tasks should
> be submitted such that map task of A,B,C are submitted to TaskTracker
> running on slave1 and D,E,F on slave2. Instead of this, it randomly
> schedules the map task to any of the tasktrackers. If map task of file A is
> submitted to TaskTracker running on slave2 then it implies that file A is
> being fetched over the network by slave2.
>
>
>
> How do I avoid this from happening?
>
>
>
> Thanks,
>
> Nikhil
>
>
>
>



-- 
Harsh J

Re: Map Tasks do not obey data locality principle........

Posted by Harsh J <ha...@cloudera.com>.

Also, does your custom FS report block locations in the exact same
format as how HDFS does?

On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> Hi,
>
>
>
> I  have a 3-node cluster, with JobTracker running on one machine and
> TaskTrackers on other two (say, slave1 and slave2). Instead of using HDFS, I
> have written my own FileSystem implementation. Since, unlike HDFS I am
> unable to provide a shared filesystem view to JobTrackers and TaskTracker
> thus, I mounted the root container of slave2 on a directory in slave1 (nfs
> mount). By doing this I am able to submit MR job to JobTracker, with input
> path as my_scheme://slave1_IP:Port/dir1, etc.  MR runs successfully but what
> happens is that data locality is not ensured i.e. if files A,B,C are kept on
> slave1 and D,E,F on slave2 then according to data locality, map tasks should
> be submitted such that map task of A,B,C are submitted to TaskTracker
> running on slave1 and D,E,F on slave2. Instead of this, it randomly
> schedules the map task to any of the tasktrackers. If map task of file A is
> submitted to TaskTracker running on slave2 then it implies that file A is
> being fetched over the network by slave2.
>
>
>
> How do I avoid this from happening?
>
>
>
> Thanks,
>
> Nikhil
>
>
>
>



-- 
Harsh J

Re: Map Tasks do not obey data locality principle........

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Nikhil,

Which scheduler are you using?

-Sandy


On Tue, May 14, 2013 at 3:55 AM, Agarwal, Nikhil
<Ni...@netapp.com>wrote:

>  Hi,****
>
> ** **
>
> I  have a 3-node cluster, with JobTracker running on one machine and
> TaskTrackers on other two (say, slave1 and slave2). Instead of using HDFS,
> I have written my own FileSystem implementation. Since, unlike HDFS I am
> unable to provide a shared filesystem view to JobTrackers and TaskTracker
> thus, I mounted the root container of slave2 on a directory in slave1 (nfs
> mount). By doing this I am able to submit MR job to JobTracker, with input
> path as my_scheme://slave1_IP:Port/dir1, etc.  MR runs successfully but
> what happens is that data locality is not ensured i.e. if files A,B,C are
> kept on slave1 and D,E,F on slave2 then according to data locality, map
> tasks should be submitted such that map task of A,B,C are submitted to
> TaskTracker running on slave1 and D,E,F on slave2. Instead of this, it
> randomly schedules the map task to any of the tasktrackers. If map task of
> file A is submitted to TaskTracker running on slave2 then it implies that
> file A is being fetched over the network by slave2.****
>
> ** **
>
> How do I avoid this from happening?****
>
> ** **
>
> Thanks,****
>
> Nikhil****
>
> ** **
>
> ** **
>

Re: Map Tasks do not obey data locality principle........

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Nikhil,

Which scheduler are you using?

-Sandy


On Tue, May 14, 2013 at 3:55 AM, Agarwal, Nikhil
<Ni...@netapp.com>wrote:

>  Hi,****
>
> ** **
>
> I  have a 3-node cluster, with JobTracker running on one machine and
> TaskTrackers on other two (say, slave1 and slave2). Instead of using HDFS,
> I have written my own FileSystem implementation. Since, unlike HDFS I am
> unable to provide a shared filesystem view to JobTrackers and TaskTracker
> thus, I mounted the root container of slave2 on a directory in slave1 (nfs
> mount). By doing this I am able to submit MR job to JobTracker, with input
> path as my_scheme://slave1_IP:Port/dir1, etc.  MR runs successfully but
> what happens is that data locality is not ensured i.e. if files A,B,C are
> kept on slave1 and D,E,F on slave2 then according to data locality, map
> tasks should be submitted such that map task of A,B,C are submitted to
> TaskTracker running on slave1 and D,E,F on slave2. Instead of this, it
> randomly schedules the map task to any of the tasktrackers. If map task of
> file A is submitted to TaskTracker running on slave2 then it implies that
> file A is being fetched over the network by slave2.****
>
> ** **
>
> How do I avoid this from happening?****
>
> ** **
>
> Thanks,****
>
> Nikhil****
>
> ** **
>
> ** **
>

Re: Map Tasks do not obey data locality principle........

Posted by Harsh J <ha...@cloudera.com>.

Also, does your custom FS report block locations in the exact same
format as how HDFS does?

On Tue, May 14, 2013 at 4:25 PM, Agarwal, Nikhil
<Ni...@netapp.com> wrote:
> Hi,
>
>
>
> I  have a 3-node cluster, with JobTracker running on one machine and
> TaskTrackers on other two (say, slave1 and slave2). Instead of using HDFS, I
> have written my own FileSystem implementation. Since, unlike HDFS I am
> unable to provide a shared filesystem view to JobTrackers and TaskTracker
> thus, I mounted the root container of slave2 on a directory in slave1 (nfs
> mount). By doing this I am able to submit MR job to JobTracker, with input
> path as my_scheme://slave1_IP:Port/dir1, etc.  MR runs successfully but what
> happens is that data locality is not ensured i.e. if files A,B,C are kept on
> slave1 and D,E,F on slave2 then according to data locality, map tasks should
> be submitted such that map task of A,B,C are submitted to TaskTracker
> running on slave1 and D,E,F on slave2. Instead of this, it randomly
> schedules the map task to any of the tasktrackers. If map task of file A is
> submitted to TaskTracker running on slave2 then it implies that file A is
> being fetched over the network by slave2.
>
>
>
> How do I avoid this from happening?
>
>
>
> Thanks,
>
> Nikhil
>
>
>
>



-- 
Harsh J