You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by ricky lee <ri...@gmail.com> on 2013/10/29 02:56:37 UTC

question about preserving data locality in MapReduce with Yarn

Hi,

I have a question about maintaining data locality in a MapReduce job
launched through Yarn. Based on the Yarn tutorial, it seems like an
application master can specify resource name, memory, and cpu when
requesting containers. By carefully choosing resource names, I think the
data locality can be achieved. I am curious how the current MapReduce
application master is doing this. Does it check all needed blocks for a job
and choose subset of nodes with the most needed blocks? If someone can
point me source code snippets that make this decision, it would be very
much appreciated. thx.

-r

Re: question about preserving data locality in MapReduce with Yarn

Posted by Arun C Murthy <ac...@hortonworks.com>.

The code is slightly hard to follow since it's split between the client and the ApplicationMaster.

The client invokes InputFormat.getSplits to compute locations and writes it to a file in HDFS.
The ApplicationMaster then reads the file and creates resource-requests based on the locations for each input file (3-replicas). See TaskAttemptImpl.dataLocalHosts and TaskAttemptImpl.dataLocalRacks - follow those variables around in the code-base.

hth,
Arun

On Oct 28, 2013, at 11:10 PM, ricky l <ri...@gmail.com> wrote:

> Hi Sandy, thank you very much for the information. It is good to know that MapReduce AM considers the block location information. BTW, I am not very familiar with the concept of splits. Is it specific to MR jobs? If possible, code location would be very helpful for reference as I am trying to implement an application master that needs to consider HDFS data-locality. thx.
> 
> r.
> 
> 
> On Mon, Oct 28, 2013 at 10:21 PM, Sandy Ryza <sa...@cloudera.com> wrote:
> Hi Ricky,
> 
> The input splits contain the locations of the blocks they cover.  The AM gets the information from the input splits and submits requests for those location.  Each container request spans all the replicas that the block is located on.  Are you interested in something more specific?
> 
> -Sandy
> 
> 
> On Mon, Oct 28, 2013 at 7:09 PM, ricky lee <ri...@gmail.com> wrote:
> Well, I thought an application master can somewhat ask where the data exist to a namenode.... isn't it true? If it does not know where the data reside, does a MapReduce application master specify the resource name as "*" which means data locality might not be preserved at all? thx,
> 
> r
> 
> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: question about preserving data locality in MapReduce with Yarn

Posted by Sandy Ryza <sa...@cloudera.com>.

Splits are a MapReduce concept . Check out FileInputFormat for how an
example of how to get block locations.  You can then pass these locations
into an AMRMClient.ContainerRequest.

-Sandy


On Mon, Oct 28, 2013 at 8:10 PM, ricky l <ri...@gmail.com> wrote:

> Hi Sandy, thank you very much for the information. It is good to know that
> MapReduce AM considers the block location information. BTW, I am not very
> familiar with the concept of splits. Is it specific to MR jobs? If
> possible, code location would be very helpful for reference as I am trying
> to implement an application master that needs to consider HDFS
> data-locality. thx.
>
> r.
>
>
> On Mon, Oct 28, 2013 at 10:21 PM, Sandy Ryza <sa...@cloudera.com>wrote:
>
>> Hi Ricky,
>>
>> The input splits contain the locations of the blocks they cover.  The AM
>> gets the information from the input splits and submits requests for those
>> location.  Each container request spans all the replicas that the block is
>> located on.  Are you interested in something more specific?
>>
>> -Sandy
>>
>>
>> On Mon, Oct 28, 2013 at 7:09 PM, ricky lee <ri...@gmail.com>wrote:
>>
>>> Well, I thought an application master can somewhat ask where the data
>>> exist to a namenode.... isn't it true? If it does not know where the data
>>> reside, does a MapReduce application master specify the resource name as
>>> "*" which means data locality might not be preserved at all? thx,
>>>
>>> r
>>>
>>
>>
>

Re: question about preserving data locality in MapReduce with Yarn

Posted by Sandy Ryza <sa...@cloudera.com>.

Splits are a MapReduce concept . Check out FileInputFormat for how an
example of how to get block locations.  You can then pass these locations
into an AMRMClient.ContainerRequest.

-Sandy


On Mon, Oct 28, 2013 at 8:10 PM, ricky l <ri...@gmail.com> wrote:

> Hi Sandy, thank you very much for the information. It is good to know that
> MapReduce AM considers the block location information. BTW, I am not very
> familiar with the concept of splits. Is it specific to MR jobs? If
> possible, code location would be very helpful for reference as I am trying
> to implement an application master that needs to consider HDFS
> data-locality. thx.
>
> r.
>
>
> On Mon, Oct 28, 2013 at 10:21 PM, Sandy Ryza <sa...@cloudera.com>wrote:
>
>> Hi Ricky,
>>
>> The input splits contain the locations of the blocks they cover.  The AM
>> gets the information from the input splits and submits requests for those
>> location.  Each container request spans all the replicas that the block is
>> located on.  Are you interested in something more specific?
>>
>> -Sandy
>>
>>
>> On Mon, Oct 28, 2013 at 7:09 PM, ricky lee <ri...@gmail.com>wrote:
>>
>>> Well, I thought an application master can somewhat ask where the data
>>> exist to a namenode.... isn't it true? If it does not know where the data
>>> reside, does a MapReduce application master specify the resource name as
>>> "*" which means data locality might not be preserved at all? thx,
>>>
>>> r
>>>
>>
>>
>

Re: question about preserving data locality in MapReduce with Yarn

Posted by Arun C Murthy <ac...@hortonworks.com>.

The code is slightly hard to follow since it's split between the client and the ApplicationMaster.

The client invokes InputFormat.getSplits to compute locations and writes it to a file in HDFS.
The ApplicationMaster then reads the file and creates resource-requests based on the locations for each input file (3-replicas). See TaskAttemptImpl.dataLocalHosts and TaskAttemptImpl.dataLocalRacks - follow those variables around in the code-base.

hth,
Arun

On Oct 28, 2013, at 11:10 PM, ricky l <ri...@gmail.com> wrote:

> Hi Sandy, thank you very much for the information. It is good to know that MapReduce AM considers the block location information. BTW, I am not very familiar with the concept of splits. Is it specific to MR jobs? If possible, code location would be very helpful for reference as I am trying to implement an application master that needs to consider HDFS data-locality. thx.
> 
> r.
> 
> 
> On Mon, Oct 28, 2013 at 10:21 PM, Sandy Ryza <sa...@cloudera.com> wrote:
> Hi Ricky,
> 
> The input splits contain the locations of the blocks they cover.  The AM gets the information from the input splits and submits requests for those location.  Each container request spans all the replicas that the block is located on.  Are you interested in something more specific?
> 
> -Sandy
> 
> 
> On Mon, Oct 28, 2013 at 7:09 PM, ricky lee <ri...@gmail.com> wrote:
> Well, I thought an application master can somewhat ask where the data exist to a namenode.... isn't it true? If it does not know where the data reside, does a MapReduce application master specify the resource name as "*" which means data locality might not be preserved at all? thx,
> 
> r
> 
> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: question about preserving data locality in MapReduce with Yarn

Posted by Arun C Murthy <ac...@hortonworks.com>.

The code is slightly hard to follow since it's split between the client and the ApplicationMaster.

The client invokes InputFormat.getSplits to compute locations and writes it to a file in HDFS.
The ApplicationMaster then reads the file and creates resource-requests based on the locations for each input file (3-replicas). See TaskAttemptImpl.dataLocalHosts and TaskAttemptImpl.dataLocalRacks - follow those variables around in the code-base.

hth,
Arun

On Oct 28, 2013, at 11:10 PM, ricky l <ri...@gmail.com> wrote:

> Hi Sandy, thank you very much for the information. It is good to know that MapReduce AM considers the block location information. BTW, I am not very familiar with the concept of splits. Is it specific to MR jobs? If possible, code location would be very helpful for reference as I am trying to implement an application master that needs to consider HDFS data-locality. thx.
> 
> r.
> 
> 
> On Mon, Oct 28, 2013 at 10:21 PM, Sandy Ryza <sa...@cloudera.com> wrote:
> Hi Ricky,
> 
> The input splits contain the locations of the blocks they cover.  The AM gets the information from the input splits and submits requests for those location.  Each container request spans all the replicas that the block is located on.  Are you interested in something more specific?
> 
> -Sandy
> 
> 
> On Mon, Oct 28, 2013 at 7:09 PM, ricky lee <ri...@gmail.com> wrote:
> Well, I thought an application master can somewhat ask where the data exist to a namenode.... isn't it true? If it does not know where the data reside, does a MapReduce application master specify the resource name as "*" which means data locality might not be preserved at all? thx,
> 
> r
> 
> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: question about preserving data locality in MapReduce with Yarn

Posted by Arun C Murthy <ac...@hortonworks.com>.

The code is slightly hard to follow since it's split between the client and the ApplicationMaster.

The client invokes InputFormat.getSplits to compute locations and writes it to a file in HDFS.
The ApplicationMaster then reads the file and creates resource-requests based on the locations for each input file (3-replicas). See TaskAttemptImpl.dataLocalHosts and TaskAttemptImpl.dataLocalRacks - follow those variables around in the code-base.

hth,
Arun

On Oct 28, 2013, at 11:10 PM, ricky l <ri...@gmail.com> wrote:

> Hi Sandy, thank you very much for the information. It is good to know that MapReduce AM considers the block location information. BTW, I am not very familiar with the concept of splits. Is it specific to MR jobs? If possible, code location would be very helpful for reference as I am trying to implement an application master that needs to consider HDFS data-locality. thx.
> 
> r.
> 
> 
> On Mon, Oct 28, 2013 at 10:21 PM, Sandy Ryza <sa...@cloudera.com> wrote:
> Hi Ricky,
> 
> The input splits contain the locations of the blocks they cover.  The AM gets the information from the input splits and submits requests for those location.  Each container request spans all the replicas that the block is located on.  Are you interested in something more specific?
> 
> -Sandy
> 
> 
> On Mon, Oct 28, 2013 at 7:09 PM, ricky lee <ri...@gmail.com> wrote:
> Well, I thought an application master can somewhat ask where the data exist to a namenode.... isn't it true? If it does not know where the data reside, does a MapReduce application master specify the resource name as "*" which means data locality might not be preserved at all? thx,
> 
> r
> 
> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/



-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: question about preserving data locality in MapReduce with Yarn

Posted by Sandy Ryza <sa...@cloudera.com>.

Splits are a MapReduce concept . Check out FileInputFormat for how an
example of how to get block locations.  You can then pass these locations
into an AMRMClient.ContainerRequest.

-Sandy


On Mon, Oct 28, 2013 at 8:10 PM, ricky l <ri...@gmail.com> wrote:

> Hi Sandy, thank you very much for the information. It is good to know that
> MapReduce AM considers the block location information. BTW, I am not very
> familiar with the concept of splits. Is it specific to MR jobs? If
> possible, code location would be very helpful for reference as I am trying
> to implement an application master that needs to consider HDFS
> data-locality. thx.
>
> r.
>
>
> On Mon, Oct 28, 2013 at 10:21 PM, Sandy Ryza <sa...@cloudera.com>wrote:
>
>> Hi Ricky,
>>
>> The input splits contain the locations of the blocks they cover.  The AM
>> gets the information from the input splits and submits requests for those
>> location.  Each container request spans all the replicas that the block is
>> located on.  Are you interested in something more specific?
>>
>> -Sandy
>>
>>
>> On Mon, Oct 28, 2013 at 7:09 PM, ricky lee <ri...@gmail.com>wrote:
>>
>>> Well, I thought an application master can somewhat ask where the data
>>> exist to a namenode.... isn't it true? If it does not know where the data
>>> reside, does a MapReduce application master specify the resource name as
>>> "*" which means data locality might not be preserved at all? thx,
>>>
>>> r
>>>
>>
>>
>

Re: question about preserving data locality in MapReduce with Yarn

Posted by Sandy Ryza <sa...@cloudera.com>.

Splits are a MapReduce concept . Check out FileInputFormat for how an
example of how to get block locations.  You can then pass these locations
into an AMRMClient.ContainerRequest.

-Sandy


On Mon, Oct 28, 2013 at 8:10 PM, ricky l <ri...@gmail.com> wrote:

> Hi Sandy, thank you very much for the information. It is good to know that
> MapReduce AM considers the block location information. BTW, I am not very
> familiar with the concept of splits. Is it specific to MR jobs? If
> possible, code location would be very helpful for reference as I am trying
> to implement an application master that needs to consider HDFS
> data-locality. thx.
>
> r.
>
>
> On Mon, Oct 28, 2013 at 10:21 PM, Sandy Ryza <sa...@cloudera.com>wrote:
>
>> Hi Ricky,
>>
>> The input splits contain the locations of the blocks they cover.  The AM
>> gets the information from the input splits and submits requests for those
>> location.  Each container request spans all the replicas that the block is
>> located on.  Are you interested in something more specific?
>>
>> -Sandy
>>
>>
>> On Mon, Oct 28, 2013 at 7:09 PM, ricky lee <ri...@gmail.com>wrote:
>>
>>> Well, I thought an application master can somewhat ask where the data
>>> exist to a namenode.... isn't it true? If it does not know where the data
>>> reside, does a MapReduce application master specify the resource name as
>>> "*" which means data locality might not be preserved at all? thx,
>>>
>>> r
>>>
>>
>>
>

Re: question about preserving data locality in MapReduce with Yarn

Posted by ricky l <ri...@gmail.com>.

Hi Sandy, thank you very much for the information. It is good to know that
MapReduce AM considers the block location information. BTW, I am not very
familiar with the concept of splits. Is it specific to MR jobs? If
possible, code location would be very helpful for reference as I am trying
to implement an application master that needs to consider HDFS
data-locality. thx.

r.

On Mon, Oct 28, 2013 at 10:21 PM, Sandy Ryza <sa...@cloudera.com>wrote:

> Hi Ricky,
>
> The input splits contain the locations of the blocks they cover.  The AM
> gets the information from the input splits and submits requests for those
> location.  Each container request spans all the replicas that the block is
> located on.  Are you interested in something more specific?
>
> -Sandy
>
>
> On Mon, Oct 28, 2013 at 7:09 PM, ricky lee <ri...@gmail.com> wrote:
>
>> Well, I thought an application master can somewhat ask where the data
>> exist to a namenode.... isn't it true? If it does not know where the data
>> reside, does a MapReduce application master specify the resource name as
>> "*" which means data locality might not be preserved at all? thx,
>>
>> r
>>
>
>

Re: question about preserving data locality in MapReduce with Yarn

Posted by ricky l <ri...@gmail.com>.

Hi Sandy, thank you very much for the information. It is good to know that
MapReduce AM considers the block location information. BTW, I am not very
familiar with the concept of splits. Is it specific to MR jobs? If
possible, code location would be very helpful for reference as I am trying
to implement an application master that needs to consider HDFS
data-locality. thx.

r.

On Mon, Oct 28, 2013 at 10:21 PM, Sandy Ryza <sa...@cloudera.com>wrote:

> Hi Ricky,
>
> The input splits contain the locations of the blocks they cover.  The AM
> gets the information from the input splits and submits requests for those
> location.  Each container request spans all the replicas that the block is
> located on.  Are you interested in something more specific?
>
> -Sandy
>
>
> On Mon, Oct 28, 2013 at 7:09 PM, ricky lee <ri...@gmail.com> wrote:
>
>> Well, I thought an application master can somewhat ask where the data
>> exist to a namenode.... isn't it true? If it does not know where the data
>> reside, does a MapReduce application master specify the resource name as
>> "*" which means data locality might not be preserved at all? thx,
>>
>> r
>>
>
>

Re: question about preserving data locality in MapReduce with Yarn

Posted by ricky l <ri...@gmail.com>.

Hi Sandy, thank you very much for the information. It is good to know that
MapReduce AM considers the block location information. BTW, I am not very
familiar with the concept of splits. Is it specific to MR jobs? If
possible, code location would be very helpful for reference as I am trying
to implement an application master that needs to consider HDFS
data-locality. thx.

r.

On Mon, Oct 28, 2013 at 10:21 PM, Sandy Ryza <sa...@cloudera.com>wrote:

> Hi Ricky,
>
> The input splits contain the locations of the blocks they cover.  The AM
> gets the information from the input splits and submits requests for those
> location.  Each container request spans all the replicas that the block is
> located on.  Are you interested in something more specific?
>
> -Sandy
>
>
> On Mon, Oct 28, 2013 at 7:09 PM, ricky lee <ri...@gmail.com> wrote:
>
>> Well, I thought an application master can somewhat ask where the data
>> exist to a namenode.... isn't it true? If it does not know where the data
>> reside, does a MapReduce application master specify the resource name as
>> "*" which means data locality might not be preserved at all? thx,
>>
>> r
>>
>
>

Re: question about preserving data locality in MapReduce with Yarn

Posted by ricky l <ri...@gmail.com>.

Hi Sandy, thank you very much for the information. It is good to know that
MapReduce AM considers the block location information. BTW, I am not very
familiar with the concept of splits. Is it specific to MR jobs? If
possible, code location would be very helpful for reference as I am trying
to implement an application master that needs to consider HDFS
data-locality. thx.

r.

On Mon, Oct 28, 2013 at 10:21 PM, Sandy Ryza <sa...@cloudera.com>wrote:

> Hi Ricky,
>
> The input splits contain the locations of the blocks they cover.  The AM
> gets the information from the input splits and submits requests for those
> location.  Each container request spans all the replicas that the block is
> located on.  Are you interested in something more specific?
>
> -Sandy
>
>
> On Mon, Oct 28, 2013 at 7:09 PM, ricky lee <ri...@gmail.com> wrote:
>
>> Well, I thought an application master can somewhat ask where the data
>> exist to a namenode.... isn't it true? If it does not know where the data
>> reside, does a MapReduce application master specify the resource name as
>> "*" which means data locality might not be preserved at all? thx,
>>
>> r
>>
>
>

Re: question about preserving data locality in MapReduce with Yarn

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Ricky,

The input splits contain the locations of the blocks they cover.  The AM
gets the information from the input splits and submits requests for those
location.  Each container request spans all the replicas that the block is
located on.  Are you interested in something more specific?

-Sandy

On Mon, Oct 28, 2013 at 7:09 PM, ricky lee <ri...@gmail.com> wrote:

> Well, I thought an application master can somewhat ask where the data
> exist to a namenode.... isn't it true? If it does not know where the data
> reside, does a MapReduce application master specify the resource name as
> "*" which means data locality might not be preserved at all? thx,
>
> r
>

Re: question about preserving data locality in MapReduce with Yarn

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Ricky,

The input splits contain the locations of the blocks they cover.  The AM
gets the information from the input splits and submits requests for those
location.  Each container request spans all the replicas that the block is
located on.  Are you interested in something more specific?

-Sandy

On Mon, Oct 28, 2013 at 7:09 PM, ricky lee <ri...@gmail.com> wrote:

> Well, I thought an application master can somewhat ask where the data
> exist to a namenode.... isn't it true? If it does not know where the data
> reside, does a MapReduce application master specify the resource name as
> "*" which means data locality might not be preserved at all? thx,
>
> r
>

Re: question about preserving data locality in MapReduce with Yarn

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Ricky,

The input splits contain the locations of the blocks they cover.  The AM
gets the information from the input splits and submits requests for those
location.  Each container request spans all the replicas that the block is
located on.  Are you interested in something more specific?

-Sandy

On Mon, Oct 28, 2013 at 7:09 PM, ricky lee <ri...@gmail.com> wrote:

> Well, I thought an application master can somewhat ask where the data
> exist to a namenode.... isn't it true? If it does not know where the data
> reside, does a MapReduce application master specify the resource name as
> "*" which means data locality might not be preserved at all? thx,
>
> r
>

Re: question about preserving data locality in MapReduce with Yarn

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Ricky,

The input splits contain the locations of the blocks they cover.  The AM
gets the information from the input splits and submits requests for those
location.  Each container request spans all the replicas that the block is
located on.  Are you interested in something more specific?

-Sandy

On Mon, Oct 28, 2013 at 7:09 PM, ricky lee <ri...@gmail.com> wrote:

> Well, I thought an application master can somewhat ask where the data
> exist to a namenode.... isn't it true? If it does not know where the data
> reside, does a MapReduce application master specify the resource name as
> "*" which means data locality might not be preserved at all? thx,
>
> r
>

Re: question about preserving data locality in MapReduce with Yarn

Posted by ricky lee <ri...@gmail.com>.

Well, I thought an application master can somewhat ask where the data exist
to a namenode.... isn't it true? If it does not know where the data reside,
does a MapReduce application master specify the resource name as "*" which
means data locality might not be preserved at all? thx,

r

Re: question about preserving data locality in MapReduce with Yarn

Posted by ricky lee <ri...@gmail.com>.

Well, I thought an application master can somewhat ask where the data exist
to a namenode.... isn't it true? If it does not know where the data reside,
does a MapReduce application master specify the resource name as "*" which
means data locality might not be preserved at all? thx,

r

Re: question about preserving data locality in MapReduce with Yarn

Posted by ricky lee <ri...@gmail.com>.

Well, I thought an application master can somewhat ask where the data exist
to a namenode.... isn't it true? If it does not know where the data reside,
does a MapReduce application master specify the resource name as "*" which
means data locality might not be preserved at all? thx,

r

Re: question about preserving data locality in MapReduce with Yarn

Posted by ricky lee <ri...@gmail.com>.

Well, I thought an application master can somewhat ask where the data exist
to a namenode.... isn't it true? If it does not know where the data reside,
does a MapReduce application master specify the resource name as "*" which
means data locality might not be preserved at all? thx,

r

Re: question about preserving data locality in MapReduce with Yarn

Posted by Michael Segel <ms...@hotmail.com>.

How do you know where the data exists when you begin?

Sent from a remote device. Please excuse any typos...

Mike Segel

> On Oct 28, 2013, at 8:57 PM, "ricky lee" <ri...@gmail.com> wrote:
> 
> Hi,
> 
> I have a question about maintaining data locality in a MapReduce job launched through Yarn. Based on the Yarn tutorial, it seems like an application master can specify resource name, memory, and cpu when requesting containers. By carefully choosing resource names, I think the data locality can be achieved. I am curious how the current MapReduce application master is doing this. Does it check all needed blocks for a job and choose subset of nodes with the most needed blocks? If someone can point me source code snippets that make this decision, it would be very much appreciated. thx.
> 
> -r

Re: question about preserving data locality in MapReduce with Yarn

Posted by Michael Segel <ms...@hotmail.com>.

How do you know where the data exists when you begin?

Sent from a remote device. Please excuse any typos...

Mike Segel

> On Oct 28, 2013, at 8:57 PM, "ricky lee" <ri...@gmail.com> wrote:
> 
> Hi,
> 
> I have a question about maintaining data locality in a MapReduce job launched through Yarn. Based on the Yarn tutorial, it seems like an application master can specify resource name, memory, and cpu when requesting containers. By carefully choosing resource names, I think the data locality can be achieved. I am curious how the current MapReduce application master is doing this. Does it check all needed blocks for a job and choose subset of nodes with the most needed blocks? If someone can point me source code snippets that make this decision, it would be very much appreciated. thx.
> 
> -r

Re: question about preserving data locality in MapReduce with Yarn

Posted by Michael Segel <ms...@hotmail.com>.

How do you know where the data exists when you begin?

Sent from a remote device. Please excuse any typos...

Mike Segel

> On Oct 28, 2013, at 8:57 PM, "ricky lee" <ri...@gmail.com> wrote:
> 
> Hi,
> 
> I have a question about maintaining data locality in a MapReduce job launched through Yarn. Based on the Yarn tutorial, it seems like an application master can specify resource name, memory, and cpu when requesting containers. By carefully choosing resource names, I think the data locality can be achieved. I am curious how the current MapReduce application master is doing this. Does it check all needed blocks for a job and choose subset of nodes with the most needed blocks? If someone can point me source code snippets that make this decision, it would be very much appreciated. thx.
> 
> -r

Re: question about preserving data locality in MapReduce with Yarn

Posted by Michael Segel <ms...@hotmail.com>.

How do you know where the data exists when you begin?

Sent from a remote device. Please excuse any typos...

Mike Segel

> On Oct 28, 2013, at 8:57 PM, "ricky lee" <ri...@gmail.com> wrote:
> 
> Hi,
> 
> I have a question about maintaining data locality in a MapReduce job launched through Yarn. Based on the Yarn tutorial, it seems like an application master can specify resource name, memory, and cpu when requesting containers. By carefully choosing resource names, I think the data locality can be achieved. I am curious how the current MapReduce application master is doing this. Does it check all needed blocks for a job and choose subset of nodes with the most needed blocks? If someone can point me source code snippets that make this decision, it would be very much appreciated. thx.
> 
> -r