You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Sai Sai <sa...@yahoo.in> on 2013/09/27 07:25:55 UTC

Re: Retrieve and compute input splits

Hi
I have attached the anatomy of MR from definitive guide.

In step 6 it says JT/Scheduler  retrieve  input splits computed by the client from hdfs.

In the above line it refers to as the client computes input splits.


1. Why does the JT/Scheduler retrieve the input splits and what does it do.
If it is retrieving the input split does this mean it goes to the block and reads each record 
and gets the record back to JT. If so this is a lot of data movement for large files.
which is not data locality. so i m getting confused.

2. How does the client know how to calculate the input splits.

Any help please.
Thanks
Sai

Re: Retrieve and compute input splits

Posted by Sai Sai <sa...@yahoo.in>.

Thanks for your suggestions and replies.
I am still confused about this:

To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the JobClient from the shared filesystem (step 6).

My question:

Does the input split in the above statement refer to the physical block or the logical input split.
I undersstand that the client will split the file and save the blocks at the time of writing the file to the cluster and the meta data
about the blocks is in Namenode. 
The only place where the meta data about the blocks is in NN so can v assume in step 6 is the scheduler goes to 
NN for retrieving this meta data from NN and thats what is indicated in the diagram as Shared File System HDFS.
And if this is right the input split is the physical blocks info and not the logical input split info which could be just a single line
if v r using TextInuptFormat  the default one.
Any suggestions.
Thanks
Sai

________________________________
 From: Jay Vyas <ja...@gmail.com>
To: "common-user@hadoop.apache.org" <us...@hadoop.apache.org> 
Cc: Sai Sai <sa...@yahoo.in> 
Sent: Saturday, 28 September 2013 5:35 AM
Subject: Re: Retrieve and compute input splits

Technically, the block locations are provided by the InputSplit which in the FileInputFormat case, is provided by the FileSystem Interface.

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputSplit.html

The thing to realize here is that the FileSystem implementation is provided at runtime - so the InputSplit class is responsible to create a FileSystem implementation using reflection, and then call the getBlockLocations of on a given file or set of files which the input split is corresponding to.

I think your confusion here lies in the fact that the input splits create a filesystem, however, they dont know what the filesystem implementation actually is - they only rely on the abstract contract, which provides a set of block locations.  

See the FileSystem abstract class for details on that.

On Fri, Sep 27, 2013 at 7:02 PM, Peyman Mohajerian <mo...@gmail.com> wrote:

For the JobClient to compute the input splits doesn't it need to contact Name Node. Only Name Node knows where the splits are, how can it compute it without that additional call?
>
>
>
>
>On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal <so...@gmail.com> wrote:
>
>The input splits are not copied, only the information on the location of the splits is copied to the jobtracker so that it can assign tasktrackers which are local to the split.
>>
>>
>>Check the Job Initialization section at 
>>http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/
>>
>>
>>
>>To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the JobClient from the shared filesystem (step 6). It then creates one map task for each split. The number of reduce tasks to create is determined by the mapred.reduce.tasks property in the JobConf, which is set by the setNumReduceTasks() method, and the scheduler simply creates this number of reduce tasks to be run. Tasks are given IDs at this point.
>>
>>
>>
>>Best Regards,
>>Sonal
>>Nube Technologies 
>>
>>
>>
>>
>>
>>
>>
>>
>>On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai <sa...@yahoo.in> wrote:
>>
>>Hi
>>>I have attached the anatomy of MR from definitive guide.
>>>
>>>
>>>In step 6 it says JT/Scheduler  retrieve  input splits computed by the client from hdfs.
>>>
>>>
>>>In the above line it refers to as the client computes input splits.
>>>
>>>
>>>
>>>1. Why does the JT/Scheduler retrieve the input splits and what does it do.
>>>If it is retrieving the input split does this mean it goes to the block and reads each record 
>>>and gets the record back to JT. If so this is a lot of data movement for large files.
>>>which is not data locality. so i m getting confused.
>>>
>>>
>>>2. How does the client know how to calculate the input splits.
>>>
>>>
>>>Any help please.
>>>ThanksSai
>>
>

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Retrieve and compute input splits

Posted by Sai Sai <sa...@yahoo.in>.

Thanks for your suggestions and replies.
I am still confused about this:

To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the JobClient from the shared filesystem (step 6).

My question:

Does the input split in the above statement refer to the physical block or the logical input split.
I undersstand that the client will split the file and save the blocks at the time of writing the file to the cluster and the meta data
about the blocks is in Namenode. 
The only place where the meta data about the blocks is in NN so can v assume in step 6 is the scheduler goes to 
NN for retrieving this meta data from NN and thats what is indicated in the diagram as Shared File System HDFS.
And if this is right the input split is the physical blocks info and not the logical input split info which could be just a single line
if v r using TextInuptFormat  the default one.
Any suggestions.
Thanks
Sai

________________________________
 From: Jay Vyas <ja...@gmail.com>
To: "common-user@hadoop.apache.org" <us...@hadoop.apache.org> 
Cc: Sai Sai <sa...@yahoo.in> 
Sent: Saturday, 28 September 2013 5:35 AM
Subject: Re: Retrieve and compute input splits

Technically, the block locations are provided by the InputSplit which in the FileInputFormat case, is provided by the FileSystem Interface.

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputSplit.html

The thing to realize here is that the FileSystem implementation is provided at runtime - so the InputSplit class is responsible to create a FileSystem implementation using reflection, and then call the getBlockLocations of on a given file or set of files which the input split is corresponding to.

I think your confusion here lies in the fact that the input splits create a filesystem, however, they dont know what the filesystem implementation actually is - they only rely on the abstract contract, which provides a set of block locations.  

See the FileSystem abstract class for details on that.

On Fri, Sep 27, 2013 at 7:02 PM, Peyman Mohajerian <mo...@gmail.com> wrote:

For the JobClient to compute the input splits doesn't it need to contact Name Node. Only Name Node knows where the splits are, how can it compute it without that additional call?
>
>
>
>
>On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal <so...@gmail.com> wrote:
>
>The input splits are not copied, only the information on the location of the splits is copied to the jobtracker so that it can assign tasktrackers which are local to the split.
>>
>>
>>Check the Job Initialization section at 
>>http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/
>>
>>
>>
>>To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the JobClient from the shared filesystem (step 6). It then creates one map task for each split. The number of reduce tasks to create is determined by the mapred.reduce.tasks property in the JobConf, which is set by the setNumReduceTasks() method, and the scheduler simply creates this number of reduce tasks to be run. Tasks are given IDs at this point.
>>
>>
>>
>>Best Regards,
>>Sonal
>>Nube Technologies 
>>
>>
>>
>>
>>
>>
>>
>>
>>On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai <sa...@yahoo.in> wrote:
>>
>>Hi
>>>I have attached the anatomy of MR from definitive guide.
>>>
>>>
>>>In step 6 it says JT/Scheduler  retrieve  input splits computed by the client from hdfs.
>>>
>>>
>>>In the above line it refers to as the client computes input splits.
>>>
>>>
>>>
>>>1. Why does the JT/Scheduler retrieve the input splits and what does it do.
>>>If it is retrieving the input split does this mean it goes to the block and reads each record 
>>>and gets the record back to JT. If so this is a lot of data movement for large files.
>>>which is not data locality. so i m getting confused.
>>>
>>>
>>>2. How does the client know how to calculate the input splits.
>>>
>>>
>>>Any help please.
>>>ThanksSai
>>
>

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Retrieve and compute input splits

Posted by Sai Sai <sa...@yahoo.in>.

Thanks for your suggestions and replies.
I am still confused about this:

To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the JobClient from the shared filesystem (step 6).

My question:

Does the input split in the above statement refer to the physical block or the logical input split.
I undersstand that the client will split the file and save the blocks at the time of writing the file to the cluster and the meta data
about the blocks is in Namenode. 
The only place where the meta data about the blocks is in NN so can v assume in step 6 is the scheduler goes to 
NN for retrieving this meta data from NN and thats what is indicated in the diagram as Shared File System HDFS.
And if this is right the input split is the physical blocks info and not the logical input split info which could be just a single line
if v r using TextInuptFormat  the default one.
Any suggestions.
Thanks
Sai

________________________________
 From: Jay Vyas <ja...@gmail.com>
To: "common-user@hadoop.apache.org" <us...@hadoop.apache.org> 
Cc: Sai Sai <sa...@yahoo.in> 
Sent: Saturday, 28 September 2013 5:35 AM
Subject: Re: Retrieve and compute input splits

Technically, the block locations are provided by the InputSplit which in the FileInputFormat case, is provided by the FileSystem Interface.

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputSplit.html

The thing to realize here is that the FileSystem implementation is provided at runtime - so the InputSplit class is responsible to create a FileSystem implementation using reflection, and then call the getBlockLocations of on a given file or set of files which the input split is corresponding to.

I think your confusion here lies in the fact that the input splits create a filesystem, however, they dont know what the filesystem implementation actually is - they only rely on the abstract contract, which provides a set of block locations.  

See the FileSystem abstract class for details on that.

On Fri, Sep 27, 2013 at 7:02 PM, Peyman Mohajerian <mo...@gmail.com> wrote:

For the JobClient to compute the input splits doesn't it need to contact Name Node. Only Name Node knows where the splits are, how can it compute it without that additional call?
>
>
>
>
>On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal <so...@gmail.com> wrote:
>
>The input splits are not copied, only the information on the location of the splits is copied to the jobtracker so that it can assign tasktrackers which are local to the split.
>>
>>
>>Check the Job Initialization section at 
>>http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/
>>
>>
>>
>>To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the JobClient from the shared filesystem (step 6). It then creates one map task for each split. The number of reduce tasks to create is determined by the mapred.reduce.tasks property in the JobConf, which is set by the setNumReduceTasks() method, and the scheduler simply creates this number of reduce tasks to be run. Tasks are given IDs at this point.
>>
>>
>>
>>Best Regards,
>>Sonal
>>Nube Technologies 
>>
>>
>>
>>
>>
>>
>>
>>
>>On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai <sa...@yahoo.in> wrote:
>>
>>Hi
>>>I have attached the anatomy of MR from definitive guide.
>>>
>>>
>>>In step 6 it says JT/Scheduler  retrieve  input splits computed by the client from hdfs.
>>>
>>>
>>>In the above line it refers to as the client computes input splits.
>>>
>>>
>>>
>>>1. Why does the JT/Scheduler retrieve the input splits and what does it do.
>>>If it is retrieving the input split does this mean it goes to the block and reads each record 
>>>and gets the record back to JT. If so this is a lot of data movement for large files.
>>>which is not data locality. so i m getting confused.
>>>
>>>
>>>2. How does the client know how to calculate the input splits.
>>>
>>>
>>>Any help please.
>>>ThanksSai
>>
>

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Retrieve and compute input splits

Posted by Sai Sai <sa...@yahoo.in>.

Thanks for your suggestions and replies.
I am still confused about this:

To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the JobClient from the shared filesystem (step 6).

My question:

Does the input split in the above statement refer to the physical block or the logical input split.
I undersstand that the client will split the file and save the blocks at the time of writing the file to the cluster and the meta data
about the blocks is in Namenode. 
The only place where the meta data about the blocks is in NN so can v assume in step 6 is the scheduler goes to 
NN for retrieving this meta data from NN and thats what is indicated in the diagram as Shared File System HDFS.
And if this is right the input split is the physical blocks info and not the logical input split info which could be just a single line
if v r using TextInuptFormat  the default one.
Any suggestions.
Thanks
Sai

________________________________
 From: Jay Vyas <ja...@gmail.com>
To: "common-user@hadoop.apache.org" <us...@hadoop.apache.org> 
Cc: Sai Sai <sa...@yahoo.in> 
Sent: Saturday, 28 September 2013 5:35 AM
Subject: Re: Retrieve and compute input splits

Technically, the block locations are provided by the InputSplit which in the FileInputFormat case, is provided by the FileSystem Interface.

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputSplit.html

The thing to realize here is that the FileSystem implementation is provided at runtime - so the InputSplit class is responsible to create a FileSystem implementation using reflection, and then call the getBlockLocations of on a given file or set of files which the input split is corresponding to.

I think your confusion here lies in the fact that the input splits create a filesystem, however, they dont know what the filesystem implementation actually is - they only rely on the abstract contract, which provides a set of block locations.  

See the FileSystem abstract class for details on that.

On Fri, Sep 27, 2013 at 7:02 PM, Peyman Mohajerian <mo...@gmail.com> wrote:

For the JobClient to compute the input splits doesn't it need to contact Name Node. Only Name Node knows where the splits are, how can it compute it without that additional call?
>
>
>
>
>On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal <so...@gmail.com> wrote:
>
>The input splits are not copied, only the information on the location of the splits is copied to the jobtracker so that it can assign tasktrackers which are local to the split.
>>
>>
>>Check the Job Initialization section at 
>>http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/
>>
>>
>>
>>To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the JobClient from the shared filesystem (step 6). It then creates one map task for each split. The number of reduce tasks to create is determined by the mapred.reduce.tasks property in the JobConf, which is set by the setNumReduceTasks() method, and the scheduler simply creates this number of reduce tasks to be run. Tasks are given IDs at this point.
>>
>>
>>
>>Best Regards,
>>Sonal
>>Nube Technologies 
>>
>>
>>
>>
>>
>>
>>
>>
>>On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai <sa...@yahoo.in> wrote:
>>
>>Hi
>>>I have attached the anatomy of MR from definitive guide.
>>>
>>>
>>>In step 6 it says JT/Scheduler  retrieve  input splits computed by the client from hdfs.
>>>
>>>
>>>In the above line it refers to as the client computes input splits.
>>>
>>>
>>>
>>>1. Why does the JT/Scheduler retrieve the input splits and what does it do.
>>>If it is retrieving the input split does this mean it goes to the block and reads each record 
>>>and gets the record back to JT. If so this is a lot of data movement for large files.
>>>which is not data locality. so i m getting confused.
>>>
>>>
>>>2. How does the client know how to calculate the input splits.
>>>
>>>
>>>Any help please.
>>>ThanksSai
>>
>

-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Retrieve and compute input splits

Posted by Jay Vyas <ja...@gmail.com>.

Technically, the block locations are provided by the InputSplit which in
the FileInputFormat case, is provided by the FileSystem Interface.

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputSplit.html

The thing to realize here is that the FileSystem implementation is provided
at runtime - so the InputSplit class is responsible to create a FileSystem
implementation using reflection, and then call the getBlockLocations of on
a given file or set of files which the input split is corresponding to.

I think your confusion here lies in the fact that the input splits create a
filesystem, however, they dont know what the filesystem implementation
actually is - they only rely on the abstract contract, which provides a set
of block locations.

See the FileSystem abstract class for details on that.


On Fri, Sep 27, 2013 at 7:02 PM, Peyman Mohajerian <mo...@gmail.com>wrote:

> For the JobClient to compute the input splits doesn't it need to contact
> Name Node. Only Name Node knows where the splits are, how can it compute it
> without that additional call?
>
>
> On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal <so...@gmail.com>wrote:
>
>> The input splits are not copied, only the information on the location of
>> the splits is copied to the jobtracker so that it can assign tasktrackers
>> which are local to the split.
>>
>> Check the Job Initialization section at
>>
>> http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/
>>
>> To create the list of tasks to run, the job scheduler first retrieves
>> the input splits computed by the JobClient from the shared filesystem
>> (step 6). It then creates one map task for each split. The number of reduce
>> tasks to create is determined by the mapred.reduce.tasks property in the
>> JobConf, which is set by the setNumReduceTasks() method, and the
>> scheduler simply creates this number of reduce tasks to be run. Tasks are
>> given IDs at this point.
>>
>> Best Regards,
>> Sonal
>> Nube Technologies <http://www.nubetech.co>
>>
>>  <http://in.linkedin.com/in/sonalgoyal>
>>
>>
>>
>>
>> On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai <sa...@yahoo.in> wrote:
>>
>>> Hi
>>> I have attached the anatomy of MR from definitive guide.
>>>
>>> In step 6 it says JT/Scheduler  retrieve  input splits computed by the
>>> client from hdfs.
>>>
>>> In the above line it refers to as the client computes input splits.
>>>
>>> 1. Why does the JT/Scheduler retrieve the input splits and what does it
>>> do.
>>> If it is retrieving the input split does this mean it goes to the block
>>> and reads each record
>>> and gets the record back to JT. If so this is a lot of data movement for
>>> large files.
>>> which is not data locality. so i m getting confused.
>>>
>>> 2. How does the client know how to calculate the input splits.
>>>
>>> Any help please.
>>> Thanks
>>> Sai
>>>
>>
>>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Retrieve and compute input splits

Posted by Jay Vyas <ja...@gmail.com>.

Technically, the block locations are provided by the InputSplit which in
the FileInputFormat case, is provided by the FileSystem Interface.

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputSplit.html

The thing to realize here is that the FileSystem implementation is provided
at runtime - so the InputSplit class is responsible to create a FileSystem
implementation using reflection, and then call the getBlockLocations of on
a given file or set of files which the input split is corresponding to.

I think your confusion here lies in the fact that the input splits create a
filesystem, however, they dont know what the filesystem implementation
actually is - they only rely on the abstract contract, which provides a set
of block locations.

See the FileSystem abstract class for details on that.


On Fri, Sep 27, 2013 at 7:02 PM, Peyman Mohajerian <mo...@gmail.com>wrote:

> For the JobClient to compute the input splits doesn't it need to contact
> Name Node. Only Name Node knows where the splits are, how can it compute it
> without that additional call?
>
>
> On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal <so...@gmail.com>wrote:
>
>> The input splits are not copied, only the information on the location of
>> the splits is copied to the jobtracker so that it can assign tasktrackers
>> which are local to the split.
>>
>> Check the Job Initialization section at
>>
>> http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/
>>
>> To create the list of tasks to run, the job scheduler first retrieves
>> the input splits computed by the JobClient from the shared filesystem
>> (step 6). It then creates one map task for each split. The number of reduce
>> tasks to create is determined by the mapred.reduce.tasks property in the
>> JobConf, which is set by the setNumReduceTasks() method, and the
>> scheduler simply creates this number of reduce tasks to be run. Tasks are
>> given IDs at this point.
>>
>> Best Regards,
>> Sonal
>> Nube Technologies <http://www.nubetech.co>
>>
>>  <http://in.linkedin.com/in/sonalgoyal>
>>
>>
>>
>>
>> On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai <sa...@yahoo.in> wrote:
>>
>>> Hi
>>> I have attached the anatomy of MR from definitive guide.
>>>
>>> In step 6 it says JT/Scheduler  retrieve  input splits computed by the
>>> client from hdfs.
>>>
>>> In the above line it refers to as the client computes input splits.
>>>
>>> 1. Why does the JT/Scheduler retrieve the input splits and what does it
>>> do.
>>> If it is retrieving the input split does this mean it goes to the block
>>> and reads each record
>>> and gets the record back to JT. If so this is a lot of data movement for
>>> large files.
>>> which is not data locality. so i m getting confused.
>>>
>>> 2. How does the client know how to calculate the input splits.
>>>
>>> Any help please.
>>> Thanks
>>> Sai
>>>
>>
>>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Retrieve and compute input splits

Posted by Jay Vyas <ja...@gmail.com>.

Technically, the block locations are provided by the InputSplit which in
the FileInputFormat case, is provided by the FileSystem Interface.

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputSplit.html

The thing to realize here is that the FileSystem implementation is provided
at runtime - so the InputSplit class is responsible to create a FileSystem
implementation using reflection, and then call the getBlockLocations of on
a given file or set of files which the input split is corresponding to.

I think your confusion here lies in the fact that the input splits create a
filesystem, however, they dont know what the filesystem implementation
actually is - they only rely on the abstract contract, which provides a set
of block locations.

See the FileSystem abstract class for details on that.


On Fri, Sep 27, 2013 at 7:02 PM, Peyman Mohajerian <mo...@gmail.com>wrote:

> For the JobClient to compute the input splits doesn't it need to contact
> Name Node. Only Name Node knows where the splits are, how can it compute it
> without that additional call?
>
>
> On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal <so...@gmail.com>wrote:
>
>> The input splits are not copied, only the information on the location of
>> the splits is copied to the jobtracker so that it can assign tasktrackers
>> which are local to the split.
>>
>> Check the Job Initialization section at
>>
>> http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/
>>
>> To create the list of tasks to run, the job scheduler first retrieves
>> the input splits computed by the JobClient from the shared filesystem
>> (step 6). It then creates one map task for each split. The number of reduce
>> tasks to create is determined by the mapred.reduce.tasks property in the
>> JobConf, which is set by the setNumReduceTasks() method, and the
>> scheduler simply creates this number of reduce tasks to be run. Tasks are
>> given IDs at this point.
>>
>> Best Regards,
>> Sonal
>> Nube Technologies <http://www.nubetech.co>
>>
>>  <http://in.linkedin.com/in/sonalgoyal>
>>
>>
>>
>>
>> On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai <sa...@yahoo.in> wrote:
>>
>>> Hi
>>> I have attached the anatomy of MR from definitive guide.
>>>
>>> In step 6 it says JT/Scheduler  retrieve  input splits computed by the
>>> client from hdfs.
>>>
>>> In the above line it refers to as the client computes input splits.
>>>
>>> 1. Why does the JT/Scheduler retrieve the input splits and what does it
>>> do.
>>> If it is retrieving the input split does this mean it goes to the block
>>> and reads each record
>>> and gets the record back to JT. If so this is a lot of data movement for
>>> large files.
>>> which is not data locality. so i m getting confused.
>>>
>>> 2. How does the client know how to calculate the input splits.
>>>
>>> Any help please.
>>> Thanks
>>> Sai
>>>
>>
>>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Retrieve and compute input splits

Posted by Jay Vyas <ja...@gmail.com>.

Technically, the block locations are provided by the InputSplit which in
the FileInputFormat case, is provided by the FileSystem Interface.

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputSplit.html

The thing to realize here is that the FileSystem implementation is provided
at runtime - so the InputSplit class is responsible to create a FileSystem
implementation using reflection, and then call the getBlockLocations of on
a given file or set of files which the input split is corresponding to.

I think your confusion here lies in the fact that the input splits create a
filesystem, however, they dont know what the filesystem implementation
actually is - they only rely on the abstract contract, which provides a set
of block locations.

See the FileSystem abstract class for details on that.


On Fri, Sep 27, 2013 at 7:02 PM, Peyman Mohajerian <mo...@gmail.com>wrote:

> For the JobClient to compute the input splits doesn't it need to contact
> Name Node. Only Name Node knows where the splits are, how can it compute it
> without that additional call?
>
>
> On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal <so...@gmail.com>wrote:
>
>> The input splits are not copied, only the information on the location of
>> the splits is copied to the jobtracker so that it can assign tasktrackers
>> which are local to the split.
>>
>> Check the Job Initialization section at
>>
>> http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/
>>
>> To create the list of tasks to run, the job scheduler first retrieves
>> the input splits computed by the JobClient from the shared filesystem
>> (step 6). It then creates one map task for each split. The number of reduce
>> tasks to create is determined by the mapred.reduce.tasks property in the
>> JobConf, which is set by the setNumReduceTasks() method, and the
>> scheduler simply creates this number of reduce tasks to be run. Tasks are
>> given IDs at this point.
>>
>> Best Regards,
>> Sonal
>> Nube Technologies <http://www.nubetech.co>
>>
>>  <http://in.linkedin.com/in/sonalgoyal>
>>
>>
>>
>>
>> On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai <sa...@yahoo.in> wrote:
>>
>>> Hi
>>> I have attached the anatomy of MR from definitive guide.
>>>
>>> In step 6 it says JT/Scheduler  retrieve  input splits computed by the
>>> client from hdfs.
>>>
>>> In the above line it refers to as the client computes input splits.
>>>
>>> 1. Why does the JT/Scheduler retrieve the input splits and what does it
>>> do.
>>> If it is retrieving the input split does this mean it goes to the block
>>> and reads each record
>>> and gets the record back to JT. If so this is a lot of data movement for
>>> large files.
>>> which is not data locality. so i m getting confused.
>>>
>>> 2. How does the client know how to calculate the input splits.
>>>
>>> Any help please.
>>> Thanks
>>> Sai
>>>
>>
>>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Re: Retrieve and compute input splits

Posted by Peyman Mohajerian <mo...@gmail.com>.

For the JobClient to compute the input splits doesn't it need to contact
Name Node. Only Name Node knows where the splits are, how can it compute it
without that additional call?


On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal <so...@gmail.com> wrote:

> The input splits are not copied, only the information on the location of
> the splits is copied to the jobtracker so that it can assign tasktrackers
> which are local to the split.
>
> Check the Job Initialization section at
>
> http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/
>
> To create the list of tasks to run, the job scheduler first retrieves the
> input splits computed by the JobClient from the shared filesystem (step
> 6). It then creates one map task for each split. The number of reduce tasks
> to create is determined by the mapred.reduce.tasks property in the JobConf,
> which is set by the setNumReduceTasks() method, and the scheduler simply
> creates this number of reduce tasks to be run. Tasks are given IDs at this
> point.
>
> Best Regards,
> Sonal
> Nube Technologies <http://www.nubetech.co>
>
>  <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>
> On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai <sa...@yahoo.in> wrote:
>
>> Hi
>> I have attached the anatomy of MR from definitive guide.
>>
>> In step 6 it says JT/Scheduler  retrieve  input splits computed by the
>> client from hdfs.
>>
>> In the above line it refers to as the client computes input splits.
>>
>> 1. Why does the JT/Scheduler retrieve the input splits and what does it
>> do.
>> If it is retrieving the input split does this mean it goes to the block
>> and reads each record
>> and gets the record back to JT. If so this is a lot of data movement for
>> large files.
>> which is not data locality. so i m getting confused.
>>
>> 2. How does the client know how to calculate the input splits.
>>
>> Any help please.
>> Thanks
>> Sai
>>
>
>

Re: Retrieve and compute input splits

Posted by Peyman Mohajerian <mo...@gmail.com>.

For the JobClient to compute the input splits doesn't it need to contact
Name Node. Only Name Node knows where the splits are, how can it compute it
without that additional call?


On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal <so...@gmail.com> wrote:

> The input splits are not copied, only the information on the location of
> the splits is copied to the jobtracker so that it can assign tasktrackers
> which are local to the split.
>
> Check the Job Initialization section at
>
> http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/
>
> To create the list of tasks to run, the job scheduler first retrieves the
> input splits computed by the JobClient from the shared filesystem (step
> 6). It then creates one map task for each split. The number of reduce tasks
> to create is determined by the mapred.reduce.tasks property in the JobConf,
> which is set by the setNumReduceTasks() method, and the scheduler simply
> creates this number of reduce tasks to be run. Tasks are given IDs at this
> point.
>
> Best Regards,
> Sonal
> Nube Technologies <http://www.nubetech.co>
>
>  <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>
> On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai <sa...@yahoo.in> wrote:
>
>> Hi
>> I have attached the anatomy of MR from definitive guide.
>>
>> In step 6 it says JT/Scheduler  retrieve  input splits computed by the
>> client from hdfs.
>>
>> In the above line it refers to as the client computes input splits.
>>
>> 1. Why does the JT/Scheduler retrieve the input splits and what does it
>> do.
>> If it is retrieving the input split does this mean it goes to the block
>> and reads each record
>> and gets the record back to JT. If so this is a lot of data movement for
>> large files.
>> which is not data locality. so i m getting confused.
>>
>> 2. How does the client know how to calculate the input splits.
>>
>> Any help please.
>> Thanks
>> Sai
>>
>
>

Re: Retrieve and compute input splits

Posted by Peyman Mohajerian <mo...@gmail.com>.

For the JobClient to compute the input splits doesn't it need to contact
Name Node. Only Name Node knows where the splits are, how can it compute it
without that additional call?


On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal <so...@gmail.com> wrote:

> The input splits are not copied, only the information on the location of
> the splits is copied to the jobtracker so that it can assign tasktrackers
> which are local to the split.
>
> Check the Job Initialization section at
>
> http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/
>
> To create the list of tasks to run, the job scheduler first retrieves the
> input splits computed by the JobClient from the shared filesystem (step
> 6). It then creates one map task for each split. The number of reduce tasks
> to create is determined by the mapred.reduce.tasks property in the JobConf,
> which is set by the setNumReduceTasks() method, and the scheduler simply
> creates this number of reduce tasks to be run. Tasks are given IDs at this
> point.
>
> Best Regards,
> Sonal
> Nube Technologies <http://www.nubetech.co>
>
>  <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>
> On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai <sa...@yahoo.in> wrote:
>
>> Hi
>> I have attached the anatomy of MR from definitive guide.
>>
>> In step 6 it says JT/Scheduler  retrieve  input splits computed by the
>> client from hdfs.
>>
>> In the above line it refers to as the client computes input splits.
>>
>> 1. Why does the JT/Scheduler retrieve the input splits and what does it
>> do.
>> If it is retrieving the input split does this mean it goes to the block
>> and reads each record
>> and gets the record back to JT. If so this is a lot of data movement for
>> large files.
>> which is not data locality. so i m getting confused.
>>
>> 2. How does the client know how to calculate the input splits.
>>
>> Any help please.
>> Thanks
>> Sai
>>
>
>

Re: Retrieve and compute input splits

Posted by Peyman Mohajerian <mo...@gmail.com>.

For the JobClient to compute the input splits doesn't it need to contact
Name Node. Only Name Node knows where the splits are, how can it compute it
without that additional call?


On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal <so...@gmail.com> wrote:

> The input splits are not copied, only the information on the location of
> the splits is copied to the jobtracker so that it can assign tasktrackers
> which are local to the split.
>
> Check the Job Initialization section at
>
> http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/
>
> To create the list of tasks to run, the job scheduler first retrieves the
> input splits computed by the JobClient from the shared filesystem (step
> 6). It then creates one map task for each split. The number of reduce tasks
> to create is determined by the mapred.reduce.tasks property in the JobConf,
> which is set by the setNumReduceTasks() method, and the scheduler simply
> creates this number of reduce tasks to be run. Tasks are given IDs at this
> point.
>
> Best Regards,
> Sonal
> Nube Technologies <http://www.nubetech.co>
>
>  <http://in.linkedin.com/in/sonalgoyal>
>
>
>
>
> On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai <sa...@yahoo.in> wrote:
>
>> Hi
>> I have attached the anatomy of MR from definitive guide.
>>
>> In step 6 it says JT/Scheduler  retrieve  input splits computed by the
>> client from hdfs.
>>
>> In the above line it refers to as the client computes input splits.
>>
>> 1. Why does the JT/Scheduler retrieve the input splits and what does it
>> do.
>> If it is retrieving the input split does this mean it goes to the block
>> and reads each record
>> and gets the record back to JT. If so this is a lot of data movement for
>> large files.
>> which is not data locality. so i m getting confused.
>>
>> 2. How does the client know how to calculate the input splits.
>>
>> Any help please.
>> Thanks
>> Sai
>>
>
>

Re: Retrieve and compute input splits

Posted by Sonal Goyal <so...@gmail.com>.

The input splits are not copied, only the information on the location of
the splits is copied to the jobtracker so that it can assign tasktrackers
which are local to the split.

Check the Job Initialization section at
http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/

To create the list of tasks to run, the job scheduler first retrieves the
input splits computed by the JobClient from the shared filesystem (step 6).
It then creates one map task for each split. The number of reduce tasks to
create is determined by the mapred.reduce.tasks property in the JobConf,
which is set by the setNumReduceTasks() method, and the scheduler simply
creates this number of reduce tasks to be run. Tasks are given IDs at this
point.

Best Regards,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>

On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai <sa...@yahoo.in> wrote:

> Hi
> I have attached the anatomy of MR from definitive guide.
>
> In step 6 it says JT/Scheduler  retrieve  input splits computed by the
> client from hdfs.
>
> In the above line it refers to as the client computes input splits.
>
> 1. Why does the JT/Scheduler retrieve the input splits and what does it do.
> If it is retrieving the input split does this mean it goes to the block
> and reads each record
> and gets the record back to JT. If so this is a lot of data movement for
> large files.
> which is not data locality. so i m getting confused.
>
> 2. How does the client know how to calculate the input splits.
>
> Any help please.
> Thanks
> Sai
>

Re: Retrieve and compute input splits

Posted by Sonal Goyal <so...@gmail.com>.

The input splits are not copied, only the information on the location of
the splits is copied to the jobtracker so that it can assign tasktrackers
which are local to the split.

Check the Job Initialization section at
http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/

To create the list of tasks to run, the job scheduler first retrieves the
input splits computed by the JobClient from the shared filesystem (step 6).
It then creates one map task for each split. The number of reduce tasks to
create is determined by the mapred.reduce.tasks property in the JobConf,
which is set by the setNumReduceTasks() method, and the scheduler simply
creates this number of reduce tasks to be run. Tasks are given IDs at this
point.

Best Regards,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>

On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai <sa...@yahoo.in> wrote:

> Hi
> I have attached the anatomy of MR from definitive guide.
>
> In step 6 it says JT/Scheduler  retrieve  input splits computed by the
> client from hdfs.
>
> In the above line it refers to as the client computes input splits.
>
> 1. Why does the JT/Scheduler retrieve the input splits and what does it do.
> If it is retrieving the input split does this mean it goes to the block
> and reads each record
> and gets the record back to JT. If so this is a lot of data movement for
> large files.
> which is not data locality. so i m getting confused.
>
> 2. How does the client know how to calculate the input splits.
>
> Any help please.
> Thanks
> Sai
>

Re: Retrieve and compute input splits

Posted by Sonal Goyal <so...@gmail.com>.

The input splits are not copied, only the information on the location of
the splits is copied to the jobtracker so that it can assign tasktrackers
which are local to the split.

Check the Job Initialization section at
http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/

To create the list of tasks to run, the job scheduler first retrieves the
input splits computed by the JobClient from the shared filesystem (step 6).
It then creates one map task for each split. The number of reduce tasks to
create is determined by the mapred.reduce.tasks property in the JobConf,
which is set by the setNumReduceTasks() method, and the scheduler simply
creates this number of reduce tasks to be run. Tasks are given IDs at this
point.

Best Regards,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>

On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai <sa...@yahoo.in> wrote:

> Hi
> I have attached the anatomy of MR from definitive guide.
>
> In step 6 it says JT/Scheduler  retrieve  input splits computed by the
> client from hdfs.
>
> In the above line it refers to as the client computes input splits.
>
> 1. Why does the JT/Scheduler retrieve the input splits and what does it do.
> If it is retrieving the input split does this mean it goes to the block
> and reads each record
> and gets the record back to JT. If so this is a lot of data movement for
> large files.
> which is not data locality. so i m getting confused.
>
> 2. How does the client know how to calculate the input splits.
>
> Any help please.
> Thanks
> Sai
>

Re: Retrieve and compute input splits

Posted by Sonal Goyal <so...@gmail.com>.

The input splits are not copied, only the information on the location of
the splits is copied to the jobtracker so that it can assign tasktrackers
which are local to the split.

Check the Job Initialization section at
http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/

To create the list of tasks to run, the job scheduler first retrieves the
input splits computed by the JobClient from the shared filesystem (step 6).
It then creates one map task for each split. The number of reduce tasks to
create is determined by the mapred.reduce.tasks property in the JobConf,
which is set by the setNumReduceTasks() method, and the scheduler simply
creates this number of reduce tasks to be run. Tasks are given IDs at this
point.

Best Regards,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>

On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai <sa...@yahoo.in> wrote:

> Hi
> I have attached the anatomy of MR from definitive guide.
>
> In step 6 it says JT/Scheduler  retrieve  input splits computed by the
> client from hdfs.
>
> In the above line it refers to as the client computes input splits.
>
> 1. Why does the JT/Scheduler retrieve the input splits and what does it do.
> If it is retrieving the input split does this mean it goes to the block
> and reads each record
> and gets the record back to JT. If so this is a lot of data movement for
> large files.
> which is not data locality. so i m getting confused.
>
> 2. How does the client know how to calculate the input splits.
>
> Any help please.
> Thanks
> Sai
>