You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Todd <bi...@163.com> on 2014/12/17 15:16:58 UTC

How many blocks does one input split have?

Hi Hadoopers,

I got a question about how many blocks does one input split have? It is random or the number can be configured or fixed(can't be changed)?
Thanks!

Re: Re: How many blocks does one input split have?

Posted by Dieter De Witte <dr...@gmail.com>.

1 maptask = 1 input split, but a Mapperclass can handle multiple tasks
albeit one at a time..

2014-12-18 4:54 GMT+01:00 bit1129@163.com <bi...@163.com>:
>
> Sure, thanks Mark. That mean, the completed mapper task is not reused to
> work on the pending input splits.
>
> ------------------------------
> bit1129@163.com
>
>
> *From:* daemeon reiydelle <da...@gmail.com>
> *Date:* 2014-12-18 11:11
> *To:* user <us...@hadoop.apache.org>
> *CC:* mark charts <mc...@yahoo.com>
> *Subject:* Re: Re: How many blocks does one input split have?
> There would be thousands of tasks, but not all fired off at the same time.
> The number of parallel tasks is configurable but typically 1 per data node
> core.
>
>
> *.......*
>
> On Wed, Dec 17, 2014 at 6:31 PM, bit1129@163.com <bi...@163.com> wrote:
>>
>> Thanks Mark and Dieter for the reply.
>>
>> Actually, I got another question in mind. What's the relationship between
>> input split and mapper task?Is it one one relation or a mapper task can
>> handle more than one input splits?
>>
>> If mapper task can only handle one input split, then if there are many
>> input splits(say, the the original file is 1TB or larger,then there may be
>> thousands of input splits), thousands of mapper tasks would be created.
>>
>> ------------------------------
>> bit1129@163.com
>>
>>
>> *From:* mark charts <mc...@yahoo.com>
>> *Date:* 2014-12-18 00:15
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: How many blocks does one input split have?
>> Hello.
>>
>>
>> FYI.
>>
>> "The way HDFS has been set up, it breaks down very large files into large
>> blocks
>> (for example, measuring 128MB), and stores three copies of these blocks on
>> different nodes in the cluster. HDFS has no awareness of the content of
>> these
>> files.
>>
>> In YARN, when a MapReduce job is started, the Resource Manager (the
>> cluster resource management and job scheduling facility) creates an
>> Application Master daemon to look after the lifecycle of the job. (In
>> Hadoop 1,
>> the JobTracker monitored individual jobs as well as handling job
>> -scheduling
>> and cluster resource management. One of the first things the Application
>> Master
>> does is determine which file blocks are needed for processing. The
>> Application
>> Master requests details from the NameNode on where the replicas of the
>> needed data blocks are stored. Using the location data for the file blocks,
>> the Application
>> Master makes requests to the Resource Manager to have map tasks process
>> specific
>> blocks on the slave nodes where they’re stored.
>> The key to efficient MapReduce processing is that, wherever possible,
>> data is
>> processed locally — on the slave node where it’s stored.
>> Before looking at how the data blocks are processed, you need to look more
>> closely at how Hadoop stores data. In Hadoop, files are composed of
>> individual
>> records, which are ultimately processed one-by-one by mapper tasks. For
>> example, the sample data set we use in this book contains information
>> about
>> completed flights within the United States between 1987 and 2008. We have
>> one
>> large file for each year, and within every file, each individual line
>> represents a
>> single flight. In other words, one line represents one record. Now,
>> remember
>> that the block size for the Hadoop cluster is 64MB, which means that the
>> light
>> data files are broken into chunks of exactly 64MB.
>>
>> Do you see the problem? If each map task processes all records in a
>> specific
>> data block, what happens to those records that span block boundaries?
>> File blocks are exactly 64MB (or whatever you set the block size to be),
>> and
>> because HDFS has no conception of what’s inside the file blocks, it can’t
>> gauge
>> when a record might spill over into another block. To solve this problem,
>> Hadoop uses a logical representation of the data stored in file blocks,
>> known as
>> input splits. When a MapReduce job client calculates the input splits, it
>> figures
>> out where the first whole record in a block begins and where the last
>> record
>> in the block ends. In cases where the last record in a block is
>> incomplete, the
>> input split includes location information for the next block and the byte
>> offset
>> of the data needed to complete the record.
>> You can configure the Application Master daemon (or JobTracker, if you’re
>> in
>> Hadoop 1) to calculate the input splits instead of the job client, which
>> would
>> be faster for jobs processing a large number of data blocks.
>> MapReduce data processing is driven by this concept of input splits. The
>> number of input splits that are calculated for a specific application
>> determines
>> the number of mapper tasks. Each of these mapper tasks is assigned, where
>> possible, to a slave node where the input split is stored. The Resource
>> Manager
>> (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input
>> splits
>> are processed locally."                                          *sic*
>>
>> Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
>> Rafael Coss, and Roman B. Melnyk
>>
>>
>>
>> Mark Charts
>>
>>
>>
>>
>>   On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <
>> drdwitte@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> Check this post:
>> http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size
>>
>> Regards, D
>>
>>
>> 2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
>>
>> Hi Hadoopers,
>>
>> I got a question about how many blocks does one input split have? It is
>> random or the number can be configured or fixed(can't be changed)?
>> Thanks!
>>
>>
>>
>>

Re: Re: How many blocks does one input split have?

Posted by Dieter De Witte <dr...@gmail.com>.

1 maptask = 1 input split, but a Mapperclass can handle multiple tasks
albeit one at a time..

2014-12-18 4:54 GMT+01:00 bit1129@163.com <bi...@163.com>:
>
> Sure, thanks Mark. That mean, the completed mapper task is not reused to
> work on the pending input splits.
>
> ------------------------------
> bit1129@163.com
>
>
> *From:* daemeon reiydelle <da...@gmail.com>
> *Date:* 2014-12-18 11:11
> *To:* user <us...@hadoop.apache.org>
> *CC:* mark charts <mc...@yahoo.com>
> *Subject:* Re: Re: How many blocks does one input split have?
> There would be thousands of tasks, but not all fired off at the same time.
> The number of parallel tasks is configurable but typically 1 per data node
> core.
>
>
> *.......*
>
> On Wed, Dec 17, 2014 at 6:31 PM, bit1129@163.com <bi...@163.com> wrote:
>>
>> Thanks Mark and Dieter for the reply.
>>
>> Actually, I got another question in mind. What's the relationship between
>> input split and mapper task?Is it one one relation or a mapper task can
>> handle more than one input splits?
>>
>> If mapper task can only handle one input split, then if there are many
>> input splits(say, the the original file is 1TB or larger,then there may be
>> thousands of input splits), thousands of mapper tasks would be created.
>>
>> ------------------------------
>> bit1129@163.com
>>
>>
>> *From:* mark charts <mc...@yahoo.com>
>> *Date:* 2014-12-18 00:15
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: How many blocks does one input split have?
>> Hello.
>>
>>
>> FYI.
>>
>> "The way HDFS has been set up, it breaks down very large files into large
>> blocks
>> (for example, measuring 128MB), and stores three copies of these blocks on
>> different nodes in the cluster. HDFS has no awareness of the content of
>> these
>> files.
>>
>> In YARN, when a MapReduce job is started, the Resource Manager (the
>> cluster resource management and job scheduling facility) creates an
>> Application Master daemon to look after the lifecycle of the job. (In
>> Hadoop 1,
>> the JobTracker monitored individual jobs as well as handling job
>> -scheduling
>> and cluster resource management. One of the first things the Application
>> Master
>> does is determine which file blocks are needed for processing. The
>> Application
>> Master requests details from the NameNode on where the replicas of the
>> needed data blocks are stored. Using the location data for the file blocks,
>> the Application
>> Master makes requests to the Resource Manager to have map tasks process
>> specific
>> blocks on the slave nodes where they’re stored.
>> The key to efficient MapReduce processing is that, wherever possible,
>> data is
>> processed locally — on the slave node where it’s stored.
>> Before looking at how the data blocks are processed, you need to look more
>> closely at how Hadoop stores data. In Hadoop, files are composed of
>> individual
>> records, which are ultimately processed one-by-one by mapper tasks. For
>> example, the sample data set we use in this book contains information
>> about
>> completed flights within the United States between 1987 and 2008. We have
>> one
>> large file for each year, and within every file, each individual line
>> represents a
>> single flight. In other words, one line represents one record. Now,
>> remember
>> that the block size for the Hadoop cluster is 64MB, which means that the
>> light
>> data files are broken into chunks of exactly 64MB.
>>
>> Do you see the problem? If each map task processes all records in a
>> specific
>> data block, what happens to those records that span block boundaries?
>> File blocks are exactly 64MB (or whatever you set the block size to be),
>> and
>> because HDFS has no conception of what’s inside the file blocks, it can’t
>> gauge
>> when a record might spill over into another block. To solve this problem,
>> Hadoop uses a logical representation of the data stored in file blocks,
>> known as
>> input splits. When a MapReduce job client calculates the input splits, it
>> figures
>> out where the first whole record in a block begins and where the last
>> record
>> in the block ends. In cases where the last record in a block is
>> incomplete, the
>> input split includes location information for the next block and the byte
>> offset
>> of the data needed to complete the record.
>> You can configure the Application Master daemon (or JobTracker, if you’re
>> in
>> Hadoop 1) to calculate the input splits instead of the job client, which
>> would
>> be faster for jobs processing a large number of data blocks.
>> MapReduce data processing is driven by this concept of input splits. The
>> number of input splits that are calculated for a specific application
>> determines
>> the number of mapper tasks. Each of these mapper tasks is assigned, where
>> possible, to a slave node where the input split is stored. The Resource
>> Manager
>> (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input
>> splits
>> are processed locally."                                          *sic*
>>
>> Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
>> Rafael Coss, and Roman B. Melnyk
>>
>>
>>
>> Mark Charts
>>
>>
>>
>>
>>   On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <
>> drdwitte@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> Check this post:
>> http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size
>>
>> Regards, D
>>
>>
>> 2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
>>
>> Hi Hadoopers,
>>
>> I got a question about how many blocks does one input split have? It is
>> random or the number can be configured or fixed(can't be changed)?
>> Thanks!
>>
>>
>>
>>

Re: Re: How many blocks does one input split have?

Posted by Dieter De Witte <dr...@gmail.com>.

1 maptask = 1 input split, but a Mapperclass can handle multiple tasks
albeit one at a time..

2014-12-18 4:54 GMT+01:00 bit1129@163.com <bi...@163.com>:
>
> Sure, thanks Mark. That mean, the completed mapper task is not reused to
> work on the pending input splits.
>
> ------------------------------
> bit1129@163.com
>
>
> *From:* daemeon reiydelle <da...@gmail.com>
> *Date:* 2014-12-18 11:11
> *To:* user <us...@hadoop.apache.org>
> *CC:* mark charts <mc...@yahoo.com>
> *Subject:* Re: Re: How many blocks does one input split have?
> There would be thousands of tasks, but not all fired off at the same time.
> The number of parallel tasks is configurable but typically 1 per data node
> core.
>
>
> *.......*
>
> On Wed, Dec 17, 2014 at 6:31 PM, bit1129@163.com <bi...@163.com> wrote:
>>
>> Thanks Mark and Dieter for the reply.
>>
>> Actually, I got another question in mind. What's the relationship between
>> input split and mapper task?Is it one one relation or a mapper task can
>> handle more than one input splits?
>>
>> If mapper task can only handle one input split, then if there are many
>> input splits(say, the the original file is 1TB or larger,then there may be
>> thousands of input splits), thousands of mapper tasks would be created.
>>
>> ------------------------------
>> bit1129@163.com
>>
>>
>> *From:* mark charts <mc...@yahoo.com>
>> *Date:* 2014-12-18 00:15
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: How many blocks does one input split have?
>> Hello.
>>
>>
>> FYI.
>>
>> "The way HDFS has been set up, it breaks down very large files into large
>> blocks
>> (for example, measuring 128MB), and stores three copies of these blocks on
>> different nodes in the cluster. HDFS has no awareness of the content of
>> these
>> files.
>>
>> In YARN, when a MapReduce job is started, the Resource Manager (the
>> cluster resource management and job scheduling facility) creates an
>> Application Master daemon to look after the lifecycle of the job. (In
>> Hadoop 1,
>> the JobTracker monitored individual jobs as well as handling job
>> -scheduling
>> and cluster resource management. One of the first things the Application
>> Master
>> does is determine which file blocks are needed for processing. The
>> Application
>> Master requests details from the NameNode on where the replicas of the
>> needed data blocks are stored. Using the location data for the file blocks,
>> the Application
>> Master makes requests to the Resource Manager to have map tasks process
>> specific
>> blocks on the slave nodes where they’re stored.
>> The key to efficient MapReduce processing is that, wherever possible,
>> data is
>> processed locally — on the slave node where it’s stored.
>> Before looking at how the data blocks are processed, you need to look more
>> closely at how Hadoop stores data. In Hadoop, files are composed of
>> individual
>> records, which are ultimately processed one-by-one by mapper tasks. For
>> example, the sample data set we use in this book contains information
>> about
>> completed flights within the United States between 1987 and 2008. We have
>> one
>> large file for each year, and within every file, each individual line
>> represents a
>> single flight. In other words, one line represents one record. Now,
>> remember
>> that the block size for the Hadoop cluster is 64MB, which means that the
>> light
>> data files are broken into chunks of exactly 64MB.
>>
>> Do you see the problem? If each map task processes all records in a
>> specific
>> data block, what happens to those records that span block boundaries?
>> File blocks are exactly 64MB (or whatever you set the block size to be),
>> and
>> because HDFS has no conception of what’s inside the file blocks, it can’t
>> gauge
>> when a record might spill over into another block. To solve this problem,
>> Hadoop uses a logical representation of the data stored in file blocks,
>> known as
>> input splits. When a MapReduce job client calculates the input splits, it
>> figures
>> out where the first whole record in a block begins and where the last
>> record
>> in the block ends. In cases where the last record in a block is
>> incomplete, the
>> input split includes location information for the next block and the byte
>> offset
>> of the data needed to complete the record.
>> You can configure the Application Master daemon (or JobTracker, if you’re
>> in
>> Hadoop 1) to calculate the input splits instead of the job client, which
>> would
>> be faster for jobs processing a large number of data blocks.
>> MapReduce data processing is driven by this concept of input splits. The
>> number of input splits that are calculated for a specific application
>> determines
>> the number of mapper tasks. Each of these mapper tasks is assigned, where
>> possible, to a slave node where the input split is stored. The Resource
>> Manager
>> (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input
>> splits
>> are processed locally."                                          *sic*
>>
>> Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
>> Rafael Coss, and Roman B. Melnyk
>>
>>
>>
>> Mark Charts
>>
>>
>>
>>
>>   On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <
>> drdwitte@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> Check this post:
>> http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size
>>
>> Regards, D
>>
>>
>> 2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
>>
>> Hi Hadoopers,
>>
>> I got a question about how many blocks does one input split have? It is
>> random or the number can be configured or fixed(can't be changed)?
>> Thanks!
>>
>>
>>
>>

Re: Re: How many blocks does one input split have?

Posted by Dieter De Witte <dr...@gmail.com>.

1 maptask = 1 input split, but a Mapperclass can handle multiple tasks
albeit one at a time..

2014-12-18 4:54 GMT+01:00 bit1129@163.com <bi...@163.com>:
>
> Sure, thanks Mark. That mean, the completed mapper task is not reused to
> work on the pending input splits.
>
> ------------------------------
> bit1129@163.com
>
>
> *From:* daemeon reiydelle <da...@gmail.com>
> *Date:* 2014-12-18 11:11
> *To:* user <us...@hadoop.apache.org>
> *CC:* mark charts <mc...@yahoo.com>
> *Subject:* Re: Re: How many blocks does one input split have?
> There would be thousands of tasks, but not all fired off at the same time.
> The number of parallel tasks is configurable but typically 1 per data node
> core.
>
>
> *.......*
>
> On Wed, Dec 17, 2014 at 6:31 PM, bit1129@163.com <bi...@163.com> wrote:
>>
>> Thanks Mark and Dieter for the reply.
>>
>> Actually, I got another question in mind. What's the relationship between
>> input split and mapper task?Is it one one relation or a mapper task can
>> handle more than one input splits?
>>
>> If mapper task can only handle one input split, then if there are many
>> input splits(say, the the original file is 1TB or larger,then there may be
>> thousands of input splits), thousands of mapper tasks would be created.
>>
>> ------------------------------
>> bit1129@163.com
>>
>>
>> *From:* mark charts <mc...@yahoo.com>
>> *Date:* 2014-12-18 00:15
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: How many blocks does one input split have?
>> Hello.
>>
>>
>> FYI.
>>
>> "The way HDFS has been set up, it breaks down very large files into large
>> blocks
>> (for example, measuring 128MB), and stores three copies of these blocks on
>> different nodes in the cluster. HDFS has no awareness of the content of
>> these
>> files.
>>
>> In YARN, when a MapReduce job is started, the Resource Manager (the
>> cluster resource management and job scheduling facility) creates an
>> Application Master daemon to look after the lifecycle of the job. (In
>> Hadoop 1,
>> the JobTracker monitored individual jobs as well as handling job
>> -scheduling
>> and cluster resource management. One of the first things the Application
>> Master
>> does is determine which file blocks are needed for processing. The
>> Application
>> Master requests details from the NameNode on where the replicas of the
>> needed data blocks are stored. Using the location data for the file blocks,
>> the Application
>> Master makes requests to the Resource Manager to have map tasks process
>> specific
>> blocks on the slave nodes where they’re stored.
>> The key to efficient MapReduce processing is that, wherever possible,
>> data is
>> processed locally — on the slave node where it’s stored.
>> Before looking at how the data blocks are processed, you need to look more
>> closely at how Hadoop stores data. In Hadoop, files are composed of
>> individual
>> records, which are ultimately processed one-by-one by mapper tasks. For
>> example, the sample data set we use in this book contains information
>> about
>> completed flights within the United States between 1987 and 2008. We have
>> one
>> large file for each year, and within every file, each individual line
>> represents a
>> single flight. In other words, one line represents one record. Now,
>> remember
>> that the block size for the Hadoop cluster is 64MB, which means that the
>> light
>> data files are broken into chunks of exactly 64MB.
>>
>> Do you see the problem? If each map task processes all records in a
>> specific
>> data block, what happens to those records that span block boundaries?
>> File blocks are exactly 64MB (or whatever you set the block size to be),
>> and
>> because HDFS has no conception of what’s inside the file blocks, it can’t
>> gauge
>> when a record might spill over into another block. To solve this problem,
>> Hadoop uses a logical representation of the data stored in file blocks,
>> known as
>> input splits. When a MapReduce job client calculates the input splits, it
>> figures
>> out where the first whole record in a block begins and where the last
>> record
>> in the block ends. In cases where the last record in a block is
>> incomplete, the
>> input split includes location information for the next block and the byte
>> offset
>> of the data needed to complete the record.
>> You can configure the Application Master daemon (or JobTracker, if you’re
>> in
>> Hadoop 1) to calculate the input splits instead of the job client, which
>> would
>> be faster for jobs processing a large number of data blocks.
>> MapReduce data processing is driven by this concept of input splits. The
>> number of input splits that are calculated for a specific application
>> determines
>> the number of mapper tasks. Each of these mapper tasks is assigned, where
>> possible, to a slave node where the input split is stored. The Resource
>> Manager
>> (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input
>> splits
>> are processed locally."                                          *sic*
>>
>> Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
>> Rafael Coss, and Roman B. Melnyk
>>
>>
>>
>> Mark Charts
>>
>>
>>
>>
>>   On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <
>> drdwitte@gmail.com> wrote:
>>
>>
>> Hi,
>>
>> Check this post:
>> http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size
>>
>> Regards, D
>>
>>
>> 2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
>>
>> Hi Hadoopers,
>>
>> I got a question about how many blocks does one input split have? It is
>> random or the number can be configured or fixed(can't be changed)?
>> Thanks!
>>
>>
>>
>>

Re: Re: How many blocks does one input split have?

Posted by "bit1129@163.com" <bi...@163.com>.

Sure, thanks Mark. That mean, the completed mapper task is not reused to work on the pending input splits.

bit1129@163.com

From: daemeon reiydelle
Date: 2014-12-18 11:11
To: user
CC: mark charts
Subject: Re: Re: How many blocks does one input split have?
There would be thousands of tasks, but not all fired off at the same time. The number of parallel tasks is configurable but typically 1 per data node core.

.......

On Wed, Dec 17, 2014 at 6:31 PM, bit1129@163.com <bi...@163.com> wrote:
Thanks Mark and Dieter for the reply.

Actually, I got another question in mind. What's the relationship between input split and mapper task?Is it one one relation or a mapper task can handle more than one input splits?

If mapper task can only handle one input split, then if there are many input splits(say, the the original file is 1TB or larger,then there may be thousands of input splits), thousands of mapper tasks would be created.

bit1129@163.com

From: mark charts
Date: 2014-12-18 00:15
To: user@hadoop.apache.org
Subject: Re: How many blocks does one input split have?
Hello.

FYI.

"The way HDFS has been set up, it breaks down very large files into large blocks
(for example, measuring 128MB), and stores three copies of these blocks on
different nodes in the cluster. HDFS has no awareness of the content of these
files.

In YARN, when a MapReduce job is started, the Resource Manager (the
cluster resource management and job scheduling facility) creates an
Application Master daemon to look after the lifecycle of the job. (In Hadoop 1,
the JobTracker monitored individual jobs as well as handling job -scheduling
and cluster resource management. One of the first things the Application Master
does is determine which file blocks are needed for processing. The Application 
Master requests details from the NameNode on where the replicas of the needed data blocks are stored. Using the location data for the file blocks, the Application 
Master makes requests to the Resource Manager to have map tasks process specific 
blocks on the slave nodes where they’re stored.
The key to efficient MapReduce processing is that, wherever possible, data is
processed locally ― on the slave node where it’s stored.
Before looking at how the data blocks are processed, you need to look more
closely at how Hadoop stores data. In Hadoop, files are composed of individual
records, which are ultimately processed one-by-one by mapper tasks. For
example, the sample data set we use in this book contains information about
completed flights within the United States between 1987 and 2008. We have one
large file for each year, and within every file, each individual line represents a
single flight. In other words, one line represents one record. Now, remember
that the block size for the Hadoop cluster is 64MB, which means that the light
data files are broken into chunks of exactly 64MB.

Do you see the problem? If each map task processes all records in a specific
data block, what happens to those records that span block boundaries?
File blocks are exactly 64MB (or whatever you set the block size to be), and
because HDFS has no conception of what’s inside the file blocks, it can’t gauge
when a record might spill over into another block. To solve this problem,
Hadoop uses a logical representation of the data stored in file blocks, known as
input splits. When a MapReduce job client calculates the input splits, it figures
out where the first whole record in a block begins and where the last record
in the block ends. In cases where the last record in a block is incomplete, the
input split includes location information for the next block and the byte offset
of the data needed to complete the record. 
You can configure the Application Master daemon (or JobTracker, if you’re in
Hadoop 1) to calculate the input splits instead of the job client, which would
be faster for jobs processing a large number of data blocks.
MapReduce data processing is driven by this concept of input splits. The
number of input splits that are calculated for a specific application determines
the number of mapper tasks. Each of these mapper tasks is assigned, where
possible, to a slave node where the input split is stored. The Resource Manager
(or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splits
are processed locally."                                          sic

Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
Rafael Coss, and Roman B. Melnyk

Mark Charts

On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <dr...@gmail.com> wrote:

Hi,

Check this post: http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D

2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
Hi Hadoopers,

I got a question about how many blocks does one input split have? It is random or the number can be configured or fixed(can't be changed)?
Thanks!

Re: Re: How many blocks does one input split have?

Posted by "bit1129@163.com" <bi...@163.com>.

Sure, thanks Mark. That mean, the completed mapper task is not reused to work on the pending input splits.

bit1129@163.com

From: daemeon reiydelle
Date: 2014-12-18 11:11
To: user
CC: mark charts
Subject: Re: Re: How many blocks does one input split have?
There would be thousands of tasks, but not all fired off at the same time. The number of parallel tasks is configurable but typically 1 per data node core.

.......

On Wed, Dec 17, 2014 at 6:31 PM, bit1129@163.com <bi...@163.com> wrote:
Thanks Mark and Dieter for the reply.

Actually, I got another question in mind. What's the relationship between input split and mapper task?Is it one one relation or a mapper task can handle more than one input splits?

If mapper task can only handle one input split, then if there are many input splits(say, the the original file is 1TB or larger,then there may be thousands of input splits), thousands of mapper tasks would be created.

bit1129@163.com

From: mark charts
Date: 2014-12-18 00:15
To: user@hadoop.apache.org
Subject: Re: How many blocks does one input split have?
Hello.

FYI.

"The way HDFS has been set up, it breaks down very large files into large blocks
(for example, measuring 128MB), and stores three copies of these blocks on
different nodes in the cluster. HDFS has no awareness of the content of these
files.

In YARN, when a MapReduce job is started, the Resource Manager (the
cluster resource management and job scheduling facility) creates an
Application Master daemon to look after the lifecycle of the job. (In Hadoop 1,
the JobTracker monitored individual jobs as well as handling job -scheduling
and cluster resource management. One of the first things the Application Master
does is determine which file blocks are needed for processing. The Application 
Master requests details from the NameNode on where the replicas of the needed data blocks are stored. Using the location data for the file blocks, the Application 
Master makes requests to the Resource Manager to have map tasks process specific 
blocks on the slave nodes where they’re stored.
The key to efficient MapReduce processing is that, wherever possible, data is
processed locally ― on the slave node where it’s stored.
Before looking at how the data blocks are processed, you need to look more
closely at how Hadoop stores data. In Hadoop, files are composed of individual
records, which are ultimately processed one-by-one by mapper tasks. For
example, the sample data set we use in this book contains information about
completed flights within the United States between 1987 and 2008. We have one
large file for each year, and within every file, each individual line represents a
single flight. In other words, one line represents one record. Now, remember
that the block size for the Hadoop cluster is 64MB, which means that the light
data files are broken into chunks of exactly 64MB.

Do you see the problem? If each map task processes all records in a specific
data block, what happens to those records that span block boundaries?
File blocks are exactly 64MB (or whatever you set the block size to be), and
because HDFS has no conception of what’s inside the file blocks, it can’t gauge
when a record might spill over into another block. To solve this problem,
Hadoop uses a logical representation of the data stored in file blocks, known as
input splits. When a MapReduce job client calculates the input splits, it figures
out where the first whole record in a block begins and where the last record
in the block ends. In cases where the last record in a block is incomplete, the
input split includes location information for the next block and the byte offset
of the data needed to complete the record. 
You can configure the Application Master daemon (or JobTracker, if you’re in
Hadoop 1) to calculate the input splits instead of the job client, which would
be faster for jobs processing a large number of data blocks.
MapReduce data processing is driven by this concept of input splits. The
number of input splits that are calculated for a specific application determines
the number of mapper tasks. Each of these mapper tasks is assigned, where
possible, to a slave node where the input split is stored. The Resource Manager
(or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splits
are processed locally."                                          sic

Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
Rafael Coss, and Roman B. Melnyk

Mark Charts

On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <dr...@gmail.com> wrote:

Hi,

Check this post: http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D

2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
Hi Hadoopers,

I got a question about how many blocks does one input split have? It is random or the number can be configured or fixed(can't be changed)?
Thanks!

Re: Re: How many blocks does one input split have?

Posted by "bit1129@163.com" <bi...@163.com>.

Sure, thanks Mark. That mean, the completed mapper task is not reused to work on the pending input splits.

bit1129@163.com

From: daemeon reiydelle
Date: 2014-12-18 11:11
To: user
CC: mark charts
Subject: Re: Re: How many blocks does one input split have?
There would be thousands of tasks, but not all fired off at the same time. The number of parallel tasks is configurable but typically 1 per data node core.

.......

On Wed, Dec 17, 2014 at 6:31 PM, bit1129@163.com <bi...@163.com> wrote:
Thanks Mark and Dieter for the reply.

Actually, I got another question in mind. What's the relationship between input split and mapper task?Is it one one relation or a mapper task can handle more than one input splits?

If mapper task can only handle one input split, then if there are many input splits(say, the the original file is 1TB or larger,then there may be thousands of input splits), thousands of mapper tasks would be created.

bit1129@163.com

From: mark charts
Date: 2014-12-18 00:15
To: user@hadoop.apache.org
Subject: Re: How many blocks does one input split have?
Hello.

FYI.

"The way HDFS has been set up, it breaks down very large files into large blocks
(for example, measuring 128MB), and stores three copies of these blocks on
different nodes in the cluster. HDFS has no awareness of the content of these
files.

In YARN, when a MapReduce job is started, the Resource Manager (the
cluster resource management and job scheduling facility) creates an
Application Master daemon to look after the lifecycle of the job. (In Hadoop 1,
the JobTracker monitored individual jobs as well as handling job -scheduling
and cluster resource management. One of the first things the Application Master
does is determine which file blocks are needed for processing. The Application 
Master requests details from the NameNode on where the replicas of the needed data blocks are stored. Using the location data for the file blocks, the Application 
Master makes requests to the Resource Manager to have map tasks process specific 
blocks on the slave nodes where they’re stored.
The key to efficient MapReduce processing is that, wherever possible, data is
processed locally ― on the slave node where it’s stored.
Before looking at how the data blocks are processed, you need to look more
closely at how Hadoop stores data. In Hadoop, files are composed of individual
records, which are ultimately processed one-by-one by mapper tasks. For
example, the sample data set we use in this book contains information about
completed flights within the United States between 1987 and 2008. We have one
large file for each year, and within every file, each individual line represents a
single flight. In other words, one line represents one record. Now, remember
that the block size for the Hadoop cluster is 64MB, which means that the light
data files are broken into chunks of exactly 64MB.

Do you see the problem? If each map task processes all records in a specific
data block, what happens to those records that span block boundaries?
File blocks are exactly 64MB (or whatever you set the block size to be), and
because HDFS has no conception of what’s inside the file blocks, it can’t gauge
when a record might spill over into another block. To solve this problem,
Hadoop uses a logical representation of the data stored in file blocks, known as
input splits. When a MapReduce job client calculates the input splits, it figures
out where the first whole record in a block begins and where the last record
in the block ends. In cases where the last record in a block is incomplete, the
input split includes location information for the next block and the byte offset
of the data needed to complete the record. 
You can configure the Application Master daemon (or JobTracker, if you’re in
Hadoop 1) to calculate the input splits instead of the job client, which would
be faster for jobs processing a large number of data blocks.
MapReduce data processing is driven by this concept of input splits. The
number of input splits that are calculated for a specific application determines
the number of mapper tasks. Each of these mapper tasks is assigned, where
possible, to a slave node where the input split is stored. The Resource Manager
(or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splits
are processed locally."                                          sic

Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
Rafael Coss, and Roman B. Melnyk

Mark Charts

On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <dr...@gmail.com> wrote:

Hi,

Check this post: http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D

2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
Hi Hadoopers,

I got a question about how many blocks does one input split have? It is random or the number can be configured or fixed(can't be changed)?
Thanks!

Re: Re: How many blocks does one input split have?

Posted by "bit1129@163.com" <bi...@163.com>.

Sure, thanks Mark. That mean, the completed mapper task is not reused to work on the pending input splits.

bit1129@163.com

From: daemeon reiydelle
Date: 2014-12-18 11:11
To: user
CC: mark charts
Subject: Re: Re: How many blocks does one input split have?
There would be thousands of tasks, but not all fired off at the same time. The number of parallel tasks is configurable but typically 1 per data node core.

.......

On Wed, Dec 17, 2014 at 6:31 PM, bit1129@163.com <bi...@163.com> wrote:
Thanks Mark and Dieter for the reply.

Actually, I got another question in mind. What's the relationship between input split and mapper task?Is it one one relation or a mapper task can handle more than one input splits?

If mapper task can only handle one input split, then if there are many input splits(say, the the original file is 1TB or larger,then there may be thousands of input splits), thousands of mapper tasks would be created.

bit1129@163.com

From: mark charts
Date: 2014-12-18 00:15
To: user@hadoop.apache.org
Subject: Re: How many blocks does one input split have?
Hello.

FYI.

"The way HDFS has been set up, it breaks down very large files into large blocks
(for example, measuring 128MB), and stores three copies of these blocks on
different nodes in the cluster. HDFS has no awareness of the content of these
files.

In YARN, when a MapReduce job is started, the Resource Manager (the
cluster resource management and job scheduling facility) creates an
Application Master daemon to look after the lifecycle of the job. (In Hadoop 1,
the JobTracker monitored individual jobs as well as handling job -scheduling
and cluster resource management. One of the first things the Application Master
does is determine which file blocks are needed for processing. The Application 
Master requests details from the NameNode on where the replicas of the needed data blocks are stored. Using the location data for the file blocks, the Application 
Master makes requests to the Resource Manager to have map tasks process specific 
blocks on the slave nodes where they’re stored.
The key to efficient MapReduce processing is that, wherever possible, data is
processed locally ― on the slave node where it’s stored.
Before looking at how the data blocks are processed, you need to look more
closely at how Hadoop stores data. In Hadoop, files are composed of individual
records, which are ultimately processed one-by-one by mapper tasks. For
example, the sample data set we use in this book contains information about
completed flights within the United States between 1987 and 2008. We have one
large file for each year, and within every file, each individual line represents a
single flight. In other words, one line represents one record. Now, remember
that the block size for the Hadoop cluster is 64MB, which means that the light
data files are broken into chunks of exactly 64MB.

Do you see the problem? If each map task processes all records in a specific
data block, what happens to those records that span block boundaries?
File blocks are exactly 64MB (or whatever you set the block size to be), and
because HDFS has no conception of what’s inside the file blocks, it can’t gauge
when a record might spill over into another block. To solve this problem,
Hadoop uses a logical representation of the data stored in file blocks, known as
input splits. When a MapReduce job client calculates the input splits, it figures
out where the first whole record in a block begins and where the last record
in the block ends. In cases where the last record in a block is incomplete, the
input split includes location information for the next block and the byte offset
of the data needed to complete the record. 
You can configure the Application Master daemon (or JobTracker, if you’re in
Hadoop 1) to calculate the input splits instead of the job client, which would
be faster for jobs processing a large number of data blocks.
MapReduce data processing is driven by this concept of input splits. The
number of input splits that are calculated for a specific application determines
the number of mapper tasks. Each of these mapper tasks is assigned, where
possible, to a slave node where the input split is stored. The Resource Manager
(or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splits
are processed locally."                                          sic

Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
Rafael Coss, and Roman B. Melnyk

Mark Charts

On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <dr...@gmail.com> wrote:

Hi,

Check this post: http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D

2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
Hi Hadoopers,

I got a question about how many blocks does one input split have? It is random or the number can be configured or fixed(can't be changed)?
Thanks!

Re: Re: How many blocks does one input split have?

Posted by daemeon reiydelle <da...@gmail.com>.

There would be thousands of tasks, but not all fired off at the same time.
The number of parallel tasks is configurable but typically 1 per data node
core.


*.......*

On Wed, Dec 17, 2014 at 6:31 PM, bit1129@163.com <bi...@163.com> wrote:
>
> Thanks Mark and Dieter for the reply.
>
> Actually, I got another question in mind. What's the relationship between
> input split and mapper task?Is it one one relation or a mapper task can
> handle more than one input splits?
>
> If mapper task can only handle one input split, then if there are many
> input splits(say, the the original file is 1TB or larger,then there may be
> thousands of input splits), thousands of mapper tasks would be created.
>
> ------------------------------
> bit1129@163.com
>
>
> *From:* mark charts <mc...@yahoo.com>
> *Date:* 2014-12-18 00:15
> *To:* user@hadoop.apache.org
> *Subject:* Re: How many blocks does one input split have?
> Hello.
>
>
> FYI.
>
> "The way HDFS has been set up, it breaks down very large files into large
> blocks
> (for example, measuring 128MB), and stores three copies of these blocks on
> different nodes in the cluster. HDFS has no awareness of the content of
> these
> files.
>
> In YARN, when a MapReduce job is started, the Resource Manager (the
> cluster resource management and job scheduling facility) creates an
> Application Master daemon to look after the lifecycle of the job. (In
> Hadoop 1,
> the JobTracker monitored individual jobs as well as handling job
> scheduling
> and cluster resource management. One of the first things the Application
> Master
> does is determine which file blocks are needed for processing. The
> Application
> Master requests details from the NameNode on where the replicas of the
> needed data blocks are stored. Using the location data for the file blocks,
> the Application
> Master makes requests to the Resource Manager to have map tasks process
> specific
> blocks on the slave nodes where they’re stored.
> The key to efficient MapReduce processing is that, wherever possible, data
> is
> processed locally — on the slave node where it’s stored.
> Before looking at how the data blocks are processed, you need to look more
> closely at how Hadoop stores data. In Hadoop, files are composed of
> individual
> records, which are ultimately processed one-by-one by mapper tasks. For
> example, the sample data set we use in this book contains information about
> completed flights within the United States between 1987 and 2008. We have
> one
> large file for each year, and within every file, each individual line
> represents a
> single flight. In other words, one line represents one record. Now,
> remember
> that the block size for the Hadoop cluster is 64MB, which means that the
> light
> data files are broken into chunks of exactly 64MB.
>
> Do you see the problem? If each map task processes all records in a
> specific
> data block, what happens to those records that span block boundaries?
> File blocks are exactly 64MB (or whatever you set the block size to be),
> and
> because HDFS has no conception of what’s inside the file blocks, it can’t
> gauge
> when a record might spill over into another block. To solve this problem,
> Hadoop uses a logical representation of the data stored in file blocks,
> known as
> input splits. When a MapReduce job client calculates the input splits, it
> figures
> out where the first whole record in a block begins and where the last
> record
> in the block ends. In cases where the last record in a block is
> incomplete, the
> input split includes location information for the next block and the byte
> offset
> of the data needed to complete the record.
> You can configure the Application Master daemon (or JobTracker, if you’re
> in
> Hadoop 1) to calculate the input splits instead of the job client, which
> would
> be faster for jobs processing a large number of data blocks.
> MapReduce data processing is driven by this concept of input splits. The
> number of input splits that are calculated for a specific application
> determines
> the number of mapper tasks. Each of these mapper tasks is assigned, where
> possible, to a slave node where the input split is stored. The Resource
> Manager
> (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input
> splits
> are processed locally."                                          *sic*
>
> Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
> Rafael Coss, and Roman B. Melnyk
>
>
>
> Mark Charts
>
>
>
>
>   On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <
> drdwitte@gmail.com> wrote:
>
>
> Hi,
>
> Check this post:
> http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size
>
> Regards, D
>
>
> 2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
>
> Hi Hadoopers,
>
> I got a question about how many blocks does one input split have? It is
> random or the number can be configured or fixed(can't be changed)?
> Thanks!
>
>
>
>

Re: Re: How many blocks does one input split have?

Posted by daemeon reiydelle <da...@gmail.com>.

There would be thousands of tasks, but not all fired off at the same time.
The number of parallel tasks is configurable but typically 1 per data node
core.


*.......*

On Wed, Dec 17, 2014 at 6:31 PM, bit1129@163.com <bi...@163.com> wrote:
>
> Thanks Mark and Dieter for the reply.
>
> Actually, I got another question in mind. What's the relationship between
> input split and mapper task?Is it one one relation or a mapper task can
> handle more than one input splits?
>
> If mapper task can only handle one input split, then if there are many
> input splits(say, the the original file is 1TB or larger,then there may be
> thousands of input splits), thousands of mapper tasks would be created.
>
> ------------------------------
> bit1129@163.com
>
>
> *From:* mark charts <mc...@yahoo.com>
> *Date:* 2014-12-18 00:15
> *To:* user@hadoop.apache.org
> *Subject:* Re: How many blocks does one input split have?
> Hello.
>
>
> FYI.
>
> "The way HDFS has been set up, it breaks down very large files into large
> blocks
> (for example, measuring 128MB), and stores three copies of these blocks on
> different nodes in the cluster. HDFS has no awareness of the content of
> these
> files.
>
> In YARN, when a MapReduce job is started, the Resource Manager (the
> cluster resource management and job scheduling facility) creates an
> Application Master daemon to look after the lifecycle of the job. (In
> Hadoop 1,
> the JobTracker monitored individual jobs as well as handling job
> scheduling
> and cluster resource management. One of the first things the Application
> Master
> does is determine which file blocks are needed for processing. The
> Application
> Master requests details from the NameNode on where the replicas of the
> needed data blocks are stored. Using the location data for the file blocks,
> the Application
> Master makes requests to the Resource Manager to have map tasks process
> specific
> blocks on the slave nodes where they’re stored.
> The key to efficient MapReduce processing is that, wherever possible, data
> is
> processed locally — on the slave node where it’s stored.
> Before looking at how the data blocks are processed, you need to look more
> closely at how Hadoop stores data. In Hadoop, files are composed of
> individual
> records, which are ultimately processed one-by-one by mapper tasks. For
> example, the sample data set we use in this book contains information about
> completed flights within the United States between 1987 and 2008. We have
> one
> large file for each year, and within every file, each individual line
> represents a
> single flight. In other words, one line represents one record. Now,
> remember
> that the block size for the Hadoop cluster is 64MB, which means that the
> light
> data files are broken into chunks of exactly 64MB.
>
> Do you see the problem? If each map task processes all records in a
> specific
> data block, what happens to those records that span block boundaries?
> File blocks are exactly 64MB (or whatever you set the block size to be),
> and
> because HDFS has no conception of what’s inside the file blocks, it can’t
> gauge
> when a record might spill over into another block. To solve this problem,
> Hadoop uses a logical representation of the data stored in file blocks,
> known as
> input splits. When a MapReduce job client calculates the input splits, it
> figures
> out where the first whole record in a block begins and where the last
> record
> in the block ends. In cases where the last record in a block is
> incomplete, the
> input split includes location information for the next block and the byte
> offset
> of the data needed to complete the record.
> You can configure the Application Master daemon (or JobTracker, if you’re
> in
> Hadoop 1) to calculate the input splits instead of the job client, which
> would
> be faster for jobs processing a large number of data blocks.
> MapReduce data processing is driven by this concept of input splits. The
> number of input splits that are calculated for a specific application
> determines
> the number of mapper tasks. Each of these mapper tasks is assigned, where
> possible, to a slave node where the input split is stored. The Resource
> Manager
> (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input
> splits
> are processed locally."                                          *sic*
>
> Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
> Rafael Coss, and Roman B. Melnyk
>
>
>
> Mark Charts
>
>
>
>
>   On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <
> drdwitte@gmail.com> wrote:
>
>
> Hi,
>
> Check this post:
> http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size
>
> Regards, D
>
>
> 2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
>
> Hi Hadoopers,
>
> I got a question about how many blocks does one input split have? It is
> random or the number can be configured or fixed(can't be changed)?
> Thanks!
>
>
>
>

Re: Re: How many blocks does one input split have?

Posted by daemeon reiydelle <da...@gmail.com>.

There would be thousands of tasks, but not all fired off at the same time.
The number of parallel tasks is configurable but typically 1 per data node
core.


*.......*

On Wed, Dec 17, 2014 at 6:31 PM, bit1129@163.com <bi...@163.com> wrote:
>
> Thanks Mark and Dieter for the reply.
>
> Actually, I got another question in mind. What's the relationship between
> input split and mapper task?Is it one one relation or a mapper task can
> handle more than one input splits?
>
> If mapper task can only handle one input split, then if there are many
> input splits(say, the the original file is 1TB or larger,then there may be
> thousands of input splits), thousands of mapper tasks would be created.
>
> ------------------------------
> bit1129@163.com
>
>
> *From:* mark charts <mc...@yahoo.com>
> *Date:* 2014-12-18 00:15
> *To:* user@hadoop.apache.org
> *Subject:* Re: How many blocks does one input split have?
> Hello.
>
>
> FYI.
>
> "The way HDFS has been set up, it breaks down very large files into large
> blocks
> (for example, measuring 128MB), and stores three copies of these blocks on
> different nodes in the cluster. HDFS has no awareness of the content of
> these
> files.
>
> In YARN, when a MapReduce job is started, the Resource Manager (the
> cluster resource management and job scheduling facility) creates an
> Application Master daemon to look after the lifecycle of the job. (In
> Hadoop 1,
> the JobTracker monitored individual jobs as well as handling job
> scheduling
> and cluster resource management. One of the first things the Application
> Master
> does is determine which file blocks are needed for processing. The
> Application
> Master requests details from the NameNode on where the replicas of the
> needed data blocks are stored. Using the location data for the file blocks,
> the Application
> Master makes requests to the Resource Manager to have map tasks process
> specific
> blocks on the slave nodes where they’re stored.
> The key to efficient MapReduce processing is that, wherever possible, data
> is
> processed locally — on the slave node where it’s stored.
> Before looking at how the data blocks are processed, you need to look more
> closely at how Hadoop stores data. In Hadoop, files are composed of
> individual
> records, which are ultimately processed one-by-one by mapper tasks. For
> example, the sample data set we use in this book contains information about
> completed flights within the United States between 1987 and 2008. We have
> one
> large file for each year, and within every file, each individual line
> represents a
> single flight. In other words, one line represents one record. Now,
> remember
> that the block size for the Hadoop cluster is 64MB, which means that the
> light
> data files are broken into chunks of exactly 64MB.
>
> Do you see the problem? If each map task processes all records in a
> specific
> data block, what happens to those records that span block boundaries?
> File blocks are exactly 64MB (or whatever you set the block size to be),
> and
> because HDFS has no conception of what’s inside the file blocks, it can’t
> gauge
> when a record might spill over into another block. To solve this problem,
> Hadoop uses a logical representation of the data stored in file blocks,
> known as
> input splits. When a MapReduce job client calculates the input splits, it
> figures
> out where the first whole record in a block begins and where the last
> record
> in the block ends. In cases where the last record in a block is
> incomplete, the
> input split includes location information for the next block and the byte
> offset
> of the data needed to complete the record.
> You can configure the Application Master daemon (or JobTracker, if you’re
> in
> Hadoop 1) to calculate the input splits instead of the job client, which
> would
> be faster for jobs processing a large number of data blocks.
> MapReduce data processing is driven by this concept of input splits. The
> number of input splits that are calculated for a specific application
> determines
> the number of mapper tasks. Each of these mapper tasks is assigned, where
> possible, to a slave node where the input split is stored. The Resource
> Manager
> (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input
> splits
> are processed locally."                                          *sic*
>
> Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
> Rafael Coss, and Roman B. Melnyk
>
>
>
> Mark Charts
>
>
>
>
>   On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <
> drdwitte@gmail.com> wrote:
>
>
> Hi,
>
> Check this post:
> http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size
>
> Regards, D
>
>
> 2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
>
> Hi Hadoopers,
>
> I got a question about how many blocks does one input split have? It is
> random or the number can be configured or fixed(can't be changed)?
> Thanks!
>
>
>
>

Re: Re: How many blocks does one input split have?

Posted by daemeon reiydelle <da...@gmail.com>.

There would be thousands of tasks, but not all fired off at the same time.
The number of parallel tasks is configurable but typically 1 per data node
core.


*.......*

On Wed, Dec 17, 2014 at 6:31 PM, bit1129@163.com <bi...@163.com> wrote:
>
> Thanks Mark and Dieter for the reply.
>
> Actually, I got another question in mind. What's the relationship between
> input split and mapper task?Is it one one relation or a mapper task can
> handle more than one input splits?
>
> If mapper task can only handle one input split, then if there are many
> input splits(say, the the original file is 1TB or larger,then there may be
> thousands of input splits), thousands of mapper tasks would be created.
>
> ------------------------------
> bit1129@163.com
>
>
> *From:* mark charts <mc...@yahoo.com>
> *Date:* 2014-12-18 00:15
> *To:* user@hadoop.apache.org
> *Subject:* Re: How many blocks does one input split have?
> Hello.
>
>
> FYI.
>
> "The way HDFS has been set up, it breaks down very large files into large
> blocks
> (for example, measuring 128MB), and stores three copies of these blocks on
> different nodes in the cluster. HDFS has no awareness of the content of
> these
> files.
>
> In YARN, when a MapReduce job is started, the Resource Manager (the
> cluster resource management and job scheduling facility) creates an
> Application Master daemon to look after the lifecycle of the job. (In
> Hadoop 1,
> the JobTracker monitored individual jobs as well as handling job
> scheduling
> and cluster resource management. One of the first things the Application
> Master
> does is determine which file blocks are needed for processing. The
> Application
> Master requests details from the NameNode on where the replicas of the
> needed data blocks are stored. Using the location data for the file blocks,
> the Application
> Master makes requests to the Resource Manager to have map tasks process
> specific
> blocks on the slave nodes where they’re stored.
> The key to efficient MapReduce processing is that, wherever possible, data
> is
> processed locally — on the slave node where it’s stored.
> Before looking at how the data blocks are processed, you need to look more
> closely at how Hadoop stores data. In Hadoop, files are composed of
> individual
> records, which are ultimately processed one-by-one by mapper tasks. For
> example, the sample data set we use in this book contains information about
> completed flights within the United States between 1987 and 2008. We have
> one
> large file for each year, and within every file, each individual line
> represents a
> single flight. In other words, one line represents one record. Now,
> remember
> that the block size for the Hadoop cluster is 64MB, which means that the
> light
> data files are broken into chunks of exactly 64MB.
>
> Do you see the problem? If each map task processes all records in a
> specific
> data block, what happens to those records that span block boundaries?
> File blocks are exactly 64MB (or whatever you set the block size to be),
> and
> because HDFS has no conception of what’s inside the file blocks, it can’t
> gauge
> when a record might spill over into another block. To solve this problem,
> Hadoop uses a logical representation of the data stored in file blocks,
> known as
> input splits. When a MapReduce job client calculates the input splits, it
> figures
> out where the first whole record in a block begins and where the last
> record
> in the block ends. In cases where the last record in a block is
> incomplete, the
> input split includes location information for the next block and the byte
> offset
> of the data needed to complete the record.
> You can configure the Application Master daemon (or JobTracker, if you’re
> in
> Hadoop 1) to calculate the input splits instead of the job client, which
> would
> be faster for jobs processing a large number of data blocks.
> MapReduce data processing is driven by this concept of input splits. The
> number of input splits that are calculated for a specific application
> determines
> the number of mapper tasks. Each of these mapper tasks is assigned, where
> possible, to a slave node where the input split is stored. The Resource
> Manager
> (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input
> splits
> are processed locally."                                          *sic*
>
> Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
> Rafael Coss, and Roman B. Melnyk
>
>
>
> Mark Charts
>
>
>
>
>   On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <
> drdwitte@gmail.com> wrote:
>
>
> Hi,
>
> Check this post:
> http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size
>
> Regards, D
>
>
> 2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
>
> Hi Hadoopers,
>
> I got a question about how many blocks does one input split have? It is
> random or the number can be configured or fixed(can't be changed)?
> Thanks!
>
>
>
>

Re: Re: How many blocks does one input split have?

Posted by "bit1129@163.com" <bi...@163.com>.

Thanks Mark and Dieter for the reply.

Actually, I got another question in mind. What's the relationship between input split and mapper task?Is it one one relation or a mapper task can handle more than one input splits?

If mapper task can only handle one input split, then if there are many input splits(say, the the original file is 1TB or larger,then there may be thousands of input splits), thousands of mapper tasks would be created.



bit1129@163.com
 
From: mark charts
Date: 2014-12-18 00:15
To: user@hadoop.apache.org
Subject: Re: How many blocks does one input split have?
Hello.


FYI.

"The way HDFS has been set up, it breaks down very large files into large blocks
(for example, measuring 128MB), and stores three copies of these blocks on
different nodes in the cluster. HDFS has no awareness of the content of these
files.
 
In YARN, when a MapReduce job is started, the Resource Manager (the
cluster resource management and job scheduling facility) creates an
Application Master daemon to look after the lifecycle of the job. (In Hadoop 1,
the JobTracker monitored individual jobs as well as handling job scheduling
and cluster resource management. One of the first things the Application Master
does is determine which file blocks are needed for processing. The Application 
Master requests details from the NameNode on where the replicas of the needed data blocks are stored. Using the location data for the file blocks, the Application 
Master makes requests to the Resource Manager to have map tasks process specific 
blocks on the slave nodes where they’re stored.
The key to efficient MapReduce processing is that, wherever possible, data is
processed locally — on the slave node where it’s stored.
Before looking at how the data blocks are processed, you need to look more
closely at how Hadoop stores data. In Hadoop, files are composed of individual
records, which are ultimately processed one-by-one by mapper tasks. For
example, the sample data set we use in this book contains information about
completed flights within the United States between 1987 and 2008. We have one
large file for each year, and within every file, each individual line represents a
single flight. In other words, one line represents one record. Now, remember
that the block size for the Hadoop cluster is 64MB, which means that the light
data files are broken into chunks of exactly 64MB.

Do you see the problem? If each map task processes all records in a specific
data block, what happens to those records that span block boundaries?
File blocks are exactly 64MB (or whatever you set the block size to be), and
because HDFS has no conception of what’s inside the file blocks, it can’t gauge
when a record might spill over into another block. To solve this problem,
Hadoop uses a logical representation of the data stored in file blocks, known as
input splits. When a MapReduce job client calculates the input splits, it figures
out where the first whole record in a block begins and where the last record
in the block ends. In cases where the last record in a block is incomplete, the
input split includes location information for the next block and the byte offset
of the data needed to complete the record. 
You can configure the Application Master daemon (or JobTracker, if you’re in
Hadoop 1) to calculate the input splits instead of the job client, which would
be faster for jobs processing a large number of data blocks.
MapReduce data processing is driven by this concept of input splits. The
number of input splits that are calculated for a specific application determines
the number of mapper tasks. Each of these mapper tasks is assigned, where
possible, to a slave node where the input split is stored. The Resource Manager
(or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splits
are processed locally."                                          sic

Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
Rafael Coss, and Roman B. Melnyk



Mark Charts




On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <dr...@gmail.com> wrote:


Hi,

Check this post: http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D


2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
Hi Hadoopers,

I got a question about how many blocks does one input split have? It is random or the number can be configured or fixed(can't be changed)?
Thanks!

Re: How many blocks does one input split have?

Posted by Dieter De Witte <dr...@gmail.com>.

Well formulated answer, thanks for sharing!

2014-12-17 17:15 GMT+01:00 mark charts <mc...@yahoo.com>:
>
> Hello.
>
>
> FYI.
>
> "The way HDFS has been set up, it breaks down very large files into large
> blocks
> (for example, measuring 128MB), and stores three copies of these blocks on
> different nodes in the cluster. HDFS has no awareness of the content of
> these
> files.
>
> In YARN, when a MapReduce job is started, the Resource Manager (the
> cluster resource management and job scheduling facility) creates an
> Application Master daemon to look after the lifecycle of the job. (In
> Hadoop 1,
> the JobTracker monitored individual jobs as well as handling job
> scheduling
> and cluster resource management. One of the first things the Application
> Master
> does is determine which file blocks are needed for processing. The
> Application
> Master requests details from the NameNode on where the replicas of the
> needed data blocks are stored. Using the location data for the file blocks,
> the Application
> Master makes requests to the Resource Manager to have map tasks process
> specific
> blocks on the slave nodes where they’re stored.
> The key to efficient MapReduce processing is that, wherever possible, data
> is
> processed locally — on the slave node where it’s stored.
> Before looking at how the data blocks are processed, you need to look more
> closely at how Hadoop stores data. In Hadoop, files are composed of
> individual
> records, which are ultimately processed one-by-one by mapper tasks. For
> example, the sample data set we use in this book contains information about
> completed flights within the United States between 1987 and 2008. We have
> one
> large file for each year, and within every file, each individual line
> represents a
> single flight. In other words, one line represents one record. Now,
> remember
> that the block size for the Hadoop cluster is 64MB, which means that the
> light
> data files are broken into chunks of exactly 64MB.
>
> Do you see the problem? If each map task processes all records in a
> specific
> data block, what happens to those records that span block boundaries?
> File blocks are exactly 64MB (or whatever you set the block size to be),
> and
> because HDFS has no conception of what’s inside the file blocks, it can’t
> gauge
> when a record might spill over into another block. To solve this problem,
> Hadoop uses a logical representation of the data stored in file blocks,
> known as
> input splits. When a MapReduce job client calculates the input splits, it
> figures
> out where the first whole record in a block begins and where the last
> record
> in the block ends. In cases where the last record in a block is
> incomplete, the
> input split includes location information for the next block and the byte
> offset
> of the data needed to complete the record.
> You can configure the Application Master daemon (or JobTracker, if you’re
> in
> Hadoop 1) to calculate the input splits instead of the job client, which
> would
> be faster for jobs processing a large number of data blocks.
> MapReduce data processing is driven by this concept of input splits. The
> number of input splits that are calculated for a specific application
> determines
> the number of mapper tasks. Each of these mapper tasks is assigned, where
> possible, to a slave node where the input split is stored. The Resource
> Manager
> (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input
> splits
> are processed locally."                                          *sic*
>
> Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
> Rafael Coss, and Roman B. Melnyk
>
>
>
> Mark Charts
>
>
>
>
>   On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <
> drdwitte@gmail.com> wrote:
>
>
> Hi,
>
> Check this post:
> http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size
>
> Regards, D
>
>
> 2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
>
> Hi Hadoopers,
>
> I got a question about how many blocks does one input split have? It is
> random or the number can be configured or fixed(can't be changed)?
> Thanks!
>
>
>
>

Re: How many blocks does one input split have?

Posted by Dieter De Witte <dr...@gmail.com>.

Well formulated answer, thanks for sharing!

2014-12-17 17:15 GMT+01:00 mark charts <mc...@yahoo.com>:
>
> Hello.
>
>
> FYI.
>
> "The way HDFS has been set up, it breaks down very large files into large
> blocks
> (for example, measuring 128MB), and stores three copies of these blocks on
> different nodes in the cluster. HDFS has no awareness of the content of
> these
> files.
>
> In YARN, when a MapReduce job is started, the Resource Manager (the
> cluster resource management and job scheduling facility) creates an
> Application Master daemon to look after the lifecycle of the job. (In
> Hadoop 1,
> the JobTracker monitored individual jobs as well as handling job
> scheduling
> and cluster resource management. One of the first things the Application
> Master
> does is determine which file blocks are needed for processing. The
> Application
> Master requests details from the NameNode on where the replicas of the
> needed data blocks are stored. Using the location data for the file blocks,
> the Application
> Master makes requests to the Resource Manager to have map tasks process
> specific
> blocks on the slave nodes where they’re stored.
> The key to efficient MapReduce processing is that, wherever possible, data
> is
> processed locally — on the slave node where it’s stored.
> Before looking at how the data blocks are processed, you need to look more
> closely at how Hadoop stores data. In Hadoop, files are composed of
> individual
> records, which are ultimately processed one-by-one by mapper tasks. For
> example, the sample data set we use in this book contains information about
> completed flights within the United States between 1987 and 2008. We have
> one
> large file for each year, and within every file, each individual line
> represents a
> single flight. In other words, one line represents one record. Now,
> remember
> that the block size for the Hadoop cluster is 64MB, which means that the
> light
> data files are broken into chunks of exactly 64MB.
>
> Do you see the problem? If each map task processes all records in a
> specific
> data block, what happens to those records that span block boundaries?
> File blocks are exactly 64MB (or whatever you set the block size to be),
> and
> because HDFS has no conception of what’s inside the file blocks, it can’t
> gauge
> when a record might spill over into another block. To solve this problem,
> Hadoop uses a logical representation of the data stored in file blocks,
> known as
> input splits. When a MapReduce job client calculates the input splits, it
> figures
> out where the first whole record in a block begins and where the last
> record
> in the block ends. In cases where the last record in a block is
> incomplete, the
> input split includes location information for the next block and the byte
> offset
> of the data needed to complete the record.
> You can configure the Application Master daemon (or JobTracker, if you’re
> in
> Hadoop 1) to calculate the input splits instead of the job client, which
> would
> be faster for jobs processing a large number of data blocks.
> MapReduce data processing is driven by this concept of input splits. The
> number of input splits that are calculated for a specific application
> determines
> the number of mapper tasks. Each of these mapper tasks is assigned, where
> possible, to a slave node where the input split is stored. The Resource
> Manager
> (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input
> splits
> are processed locally."                                          *sic*
>
> Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
> Rafael Coss, and Roman B. Melnyk
>
>
>
> Mark Charts
>
>
>
>
>   On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <
> drdwitte@gmail.com> wrote:
>
>
> Hi,
>
> Check this post:
> http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size
>
> Regards, D
>
>
> 2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
>
> Hi Hadoopers,
>
> I got a question about how many blocks does one input split have? It is
> random or the number can be configured or fixed(can't be changed)?
> Thanks!
>
>
>
>

Re: How many blocks does one input split have?

Posted by Dieter De Witte <dr...@gmail.com>.

Well formulated answer, thanks for sharing!

2014-12-17 17:15 GMT+01:00 mark charts <mc...@yahoo.com>:
>
> Hello.
>
>
> FYI.
>
> "The way HDFS has been set up, it breaks down very large files into large
> blocks
> (for example, measuring 128MB), and stores three copies of these blocks on
> different nodes in the cluster. HDFS has no awareness of the content of
> these
> files.
>
> In YARN, when a MapReduce job is started, the Resource Manager (the
> cluster resource management and job scheduling facility) creates an
> Application Master daemon to look after the lifecycle of the job. (In
> Hadoop 1,
> the JobTracker monitored individual jobs as well as handling job
> scheduling
> and cluster resource management. One of the first things the Application
> Master
> does is determine which file blocks are needed for processing. The
> Application
> Master requests details from the NameNode on where the replicas of the
> needed data blocks are stored. Using the location data for the file blocks,
> the Application
> Master makes requests to the Resource Manager to have map tasks process
> specific
> blocks on the slave nodes where they’re stored.
> The key to efficient MapReduce processing is that, wherever possible, data
> is
> processed locally — on the slave node where it’s stored.
> Before looking at how the data blocks are processed, you need to look more
> closely at how Hadoop stores data. In Hadoop, files are composed of
> individual
> records, which are ultimately processed one-by-one by mapper tasks. For
> example, the sample data set we use in this book contains information about
> completed flights within the United States between 1987 and 2008. We have
> one
> large file for each year, and within every file, each individual line
> represents a
> single flight. In other words, one line represents one record. Now,
> remember
> that the block size for the Hadoop cluster is 64MB, which means that the
> light
> data files are broken into chunks of exactly 64MB.
>
> Do you see the problem? If each map task processes all records in a
> specific
> data block, what happens to those records that span block boundaries?
> File blocks are exactly 64MB (or whatever you set the block size to be),
> and
> because HDFS has no conception of what’s inside the file blocks, it can’t
> gauge
> when a record might spill over into another block. To solve this problem,
> Hadoop uses a logical representation of the data stored in file blocks,
> known as
> input splits. When a MapReduce job client calculates the input splits, it
> figures
> out where the first whole record in a block begins and where the last
> record
> in the block ends. In cases where the last record in a block is
> incomplete, the
> input split includes location information for the next block and the byte
> offset
> of the data needed to complete the record.
> You can configure the Application Master daemon (or JobTracker, if you’re
> in
> Hadoop 1) to calculate the input splits instead of the job client, which
> would
> be faster for jobs processing a large number of data blocks.
> MapReduce data processing is driven by this concept of input splits. The
> number of input splits that are calculated for a specific application
> determines
> the number of mapper tasks. Each of these mapper tasks is assigned, where
> possible, to a slave node where the input split is stored. The Resource
> Manager
> (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input
> splits
> are processed locally."                                          *sic*
>
> Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
> Rafael Coss, and Roman B. Melnyk
>
>
>
> Mark Charts
>
>
>
>
>   On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <
> drdwitte@gmail.com> wrote:
>
>
> Hi,
>
> Check this post:
> http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size
>
> Regards, D
>
>
> 2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
>
> Hi Hadoopers,
>
> I got a question about how many blocks does one input split have? It is
> random or the number can be configured or fixed(can't be changed)?
> Thanks!
>
>
>
>

Re: Re: How many blocks does one input split have?

Posted by "bit1129@163.com" <bi...@163.com>.

Thanks Mark and Dieter for the reply.

Actually, I got another question in mind. What's the relationship between input split and mapper task?Is it one one relation or a mapper task can handle more than one input splits?

If mapper task can only handle one input split, then if there are many input splits(say, the the original file is 1TB or larger,then there may be thousands of input splits), thousands of mapper tasks would be created.



bit1129@163.com
 
From: mark charts
Date: 2014-12-18 00:15
To: user@hadoop.apache.org
Subject: Re: How many blocks does one input split have?
Hello.


FYI.

"The way HDFS has been set up, it breaks down very large files into large blocks
(for example, measuring 128MB), and stores three copies of these blocks on
different nodes in the cluster. HDFS has no awareness of the content of these
files.
 
In YARN, when a MapReduce job is started, the Resource Manager (the
cluster resource management and job scheduling facility) creates an
Application Master daemon to look after the lifecycle of the job. (In Hadoop 1,
the JobTracker monitored individual jobs as well as handling job scheduling
and cluster resource management. One of the first things the Application Master
does is determine which file blocks are needed for processing. The Application 
Master requests details from the NameNode on where the replicas of the needed data blocks are stored. Using the location data for the file blocks, the Application 
Master makes requests to the Resource Manager to have map tasks process specific 
blocks on the slave nodes where they’re stored.
The key to efficient MapReduce processing is that, wherever possible, data is
processed locally — on the slave node where it’s stored.
Before looking at how the data blocks are processed, you need to look more
closely at how Hadoop stores data. In Hadoop, files are composed of individual
records, which are ultimately processed one-by-one by mapper tasks. For
example, the sample data set we use in this book contains information about
completed flights within the United States between 1987 and 2008. We have one
large file for each year, and within every file, each individual line represents a
single flight. In other words, one line represents one record. Now, remember
that the block size for the Hadoop cluster is 64MB, which means that the light
data files are broken into chunks of exactly 64MB.

Do you see the problem? If each map task processes all records in a specific
data block, what happens to those records that span block boundaries?
File blocks are exactly 64MB (or whatever you set the block size to be), and
because HDFS has no conception of what’s inside the file blocks, it can’t gauge
when a record might spill over into another block. To solve this problem,
Hadoop uses a logical representation of the data stored in file blocks, known as
input splits. When a MapReduce job client calculates the input splits, it figures
out where the first whole record in a block begins and where the last record
in the block ends. In cases where the last record in a block is incomplete, the
input split includes location information for the next block and the byte offset
of the data needed to complete the record. 
You can configure the Application Master daemon (or JobTracker, if you’re in
Hadoop 1) to calculate the input splits instead of the job client, which would
be faster for jobs processing a large number of data blocks.
MapReduce data processing is driven by this concept of input splits. The
number of input splits that are calculated for a specific application determines
the number of mapper tasks. Each of these mapper tasks is assigned, where
possible, to a slave node where the input split is stored. The Resource Manager
(or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splits
are processed locally."                                          sic

Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
Rafael Coss, and Roman B. Melnyk



Mark Charts




On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <dr...@gmail.com> wrote:


Hi,

Check this post: http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D


2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
Hi Hadoopers,

I got a question about how many blocks does one input split have? It is random or the number can be configured or fixed(can't be changed)?
Thanks!

Re: Re: How many blocks does one input split have?

Posted by "bit1129@163.com" <bi...@163.com>.

Thanks Mark and Dieter for the reply.

Actually, I got another question in mind. What's the relationship between input split and mapper task?Is it one one relation or a mapper task can handle more than one input splits?

If mapper task can only handle one input split, then if there are many input splits(say, the the original file is 1TB or larger,then there may be thousands of input splits), thousands of mapper tasks would be created.



bit1129@163.com
 
From: mark charts
Date: 2014-12-18 00:15
To: user@hadoop.apache.org
Subject: Re: How many blocks does one input split have?
Hello.


FYI.

"The way HDFS has been set up, it breaks down very large files into large blocks
(for example, measuring 128MB), and stores three copies of these blocks on
different nodes in the cluster. HDFS has no awareness of the content of these
files.
 
In YARN, when a MapReduce job is started, the Resource Manager (the
cluster resource management and job scheduling facility) creates an
Application Master daemon to look after the lifecycle of the job. (In Hadoop 1,
the JobTracker monitored individual jobs as well as handling job scheduling
and cluster resource management. One of the first things the Application Master
does is determine which file blocks are needed for processing. The Application 
Master requests details from the NameNode on where the replicas of the needed data blocks are stored. Using the location data for the file blocks, the Application 
Master makes requests to the Resource Manager to have map tasks process specific 
blocks on the slave nodes where they’re stored.
The key to efficient MapReduce processing is that, wherever possible, data is
processed locally — on the slave node where it’s stored.
Before looking at how the data blocks are processed, you need to look more
closely at how Hadoop stores data. In Hadoop, files are composed of individual
records, which are ultimately processed one-by-one by mapper tasks. For
example, the sample data set we use in this book contains information about
completed flights within the United States between 1987 and 2008. We have one
large file for each year, and within every file, each individual line represents a
single flight. In other words, one line represents one record. Now, remember
that the block size for the Hadoop cluster is 64MB, which means that the light
data files are broken into chunks of exactly 64MB.

Do you see the problem? If each map task processes all records in a specific
data block, what happens to those records that span block boundaries?
File blocks are exactly 64MB (or whatever you set the block size to be), and
because HDFS has no conception of what’s inside the file blocks, it can’t gauge
when a record might spill over into another block. To solve this problem,
Hadoop uses a logical representation of the data stored in file blocks, known as
input splits. When a MapReduce job client calculates the input splits, it figures
out where the first whole record in a block begins and where the last record
in the block ends. In cases where the last record in a block is incomplete, the
input split includes location information for the next block and the byte offset
of the data needed to complete the record. 
You can configure the Application Master daemon (or JobTracker, if you’re in
Hadoop 1) to calculate the input splits instead of the job client, which would
be faster for jobs processing a large number of data blocks.
MapReduce data processing is driven by this concept of input splits. The
number of input splits that are calculated for a specific application determines
the number of mapper tasks. Each of these mapper tasks is assigned, where
possible, to a slave node where the input split is stored. The Resource Manager
(or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splits
are processed locally."                                          sic

Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
Rafael Coss, and Roman B. Melnyk



Mark Charts




On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <dr...@gmail.com> wrote:


Hi,

Check this post: http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D


2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
Hi Hadoopers,

I got a question about how many blocks does one input split have? It is random or the number can be configured or fixed(can't be changed)?
Thanks!

Re: How many blocks does one input split have?

Posted by Dieter De Witte <dr...@gmail.com>.

Well formulated answer, thanks for sharing!

2014-12-17 17:15 GMT+01:00 mark charts <mc...@yahoo.com>:
>
> Hello.
>
>
> FYI.
>
> "The way HDFS has been set up, it breaks down very large files into large
> blocks
> (for example, measuring 128MB), and stores three copies of these blocks on
> different nodes in the cluster. HDFS has no awareness of the content of
> these
> files.
>
> In YARN, when a MapReduce job is started, the Resource Manager (the
> cluster resource management and job scheduling facility) creates an
> Application Master daemon to look after the lifecycle of the job. (In
> Hadoop 1,
> the JobTracker monitored individual jobs as well as handling job
> scheduling
> and cluster resource management. One of the first things the Application
> Master
> does is determine which file blocks are needed for processing. The
> Application
> Master requests details from the NameNode on where the replicas of the
> needed data blocks are stored. Using the location data for the file blocks,
> the Application
> Master makes requests to the Resource Manager to have map tasks process
> specific
> blocks on the slave nodes where they’re stored.
> The key to efficient MapReduce processing is that, wherever possible, data
> is
> processed locally — on the slave node where it’s stored.
> Before looking at how the data blocks are processed, you need to look more
> closely at how Hadoop stores data. In Hadoop, files are composed of
> individual
> records, which are ultimately processed one-by-one by mapper tasks. For
> example, the sample data set we use in this book contains information about
> completed flights within the United States between 1987 and 2008. We have
> one
> large file for each year, and within every file, each individual line
> represents a
> single flight. In other words, one line represents one record. Now,
> remember
> that the block size for the Hadoop cluster is 64MB, which means that the
> light
> data files are broken into chunks of exactly 64MB.
>
> Do you see the problem? If each map task processes all records in a
> specific
> data block, what happens to those records that span block boundaries?
> File blocks are exactly 64MB (or whatever you set the block size to be),
> and
> because HDFS has no conception of what’s inside the file blocks, it can’t
> gauge
> when a record might spill over into another block. To solve this problem,
> Hadoop uses a logical representation of the data stored in file blocks,
> known as
> input splits. When a MapReduce job client calculates the input splits, it
> figures
> out where the first whole record in a block begins and where the last
> record
> in the block ends. In cases where the last record in a block is
> incomplete, the
> input split includes location information for the next block and the byte
> offset
> of the data needed to complete the record.
> You can configure the Application Master daemon (or JobTracker, if you’re
> in
> Hadoop 1) to calculate the input splits instead of the job client, which
> would
> be faster for jobs processing a large number of data blocks.
> MapReduce data processing is driven by this concept of input splits. The
> number of input splits that are calculated for a specific application
> determines
> the number of mapper tasks. Each of these mapper tasks is assigned, where
> possible, to a slave node where the input split is stored. The Resource
> Manager
> (or JobTracker, if you’re in Hadoop 1) does its best to ensure that input
> splits
> are processed locally."                                          *sic*
>
> Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
> Rafael Coss, and Roman B. Melnyk
>
>
>
> Mark Charts
>
>
>
>
>   On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <
> drdwitte@gmail.com> wrote:
>
>
> Hi,
>
> Check this post:
> http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size
>
> Regards, D
>
>
> 2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
>
> Hi Hadoopers,
>
> I got a question about how many blocks does one input split have? It is
> random or the number can be configured or fixed(can't be changed)?
> Thanks!
>
>
>
>

Re: Re: How many blocks does one input split have?

Posted by "bit1129@163.com" <bi...@163.com>.

Thanks Mark and Dieter for the reply.

Actually, I got another question in mind. What's the relationship between input split and mapper task?Is it one one relation or a mapper task can handle more than one input splits?

If mapper task can only handle one input split, then if there are many input splits(say, the the original file is 1TB or larger,then there may be thousands of input splits), thousands of mapper tasks would be created.



bit1129@163.com
 
From: mark charts
Date: 2014-12-18 00:15
To: user@hadoop.apache.org
Subject: Re: How many blocks does one input split have?
Hello.


FYI.

"The way HDFS has been set up, it breaks down very large files into large blocks
(for example, measuring 128MB), and stores three copies of these blocks on
different nodes in the cluster. HDFS has no awareness of the content of these
files.
 
In YARN, when a MapReduce job is started, the Resource Manager (the
cluster resource management and job scheduling facility) creates an
Application Master daemon to look after the lifecycle of the job. (In Hadoop 1,
the JobTracker monitored individual jobs as well as handling job scheduling
and cluster resource management. One of the first things the Application Master
does is determine which file blocks are needed for processing. The Application 
Master requests details from the NameNode on where the replicas of the needed data blocks are stored. Using the location data for the file blocks, the Application 
Master makes requests to the Resource Manager to have map tasks process specific 
blocks on the slave nodes where they’re stored.
The key to efficient MapReduce processing is that, wherever possible, data is
processed locally — on the slave node where it’s stored.
Before looking at how the data blocks are processed, you need to look more
closely at how Hadoop stores data. In Hadoop, files are composed of individual
records, which are ultimately processed one-by-one by mapper tasks. For
example, the sample data set we use in this book contains information about
completed flights within the United States between 1987 and 2008. We have one
large file for each year, and within every file, each individual line represents a
single flight. In other words, one line represents one record. Now, remember
that the block size for the Hadoop cluster is 64MB, which means that the light
data files are broken into chunks of exactly 64MB.

Do you see the problem? If each map task processes all records in a specific
data block, what happens to those records that span block boundaries?
File blocks are exactly 64MB (or whatever you set the block size to be), and
because HDFS has no conception of what’s inside the file blocks, it can’t gauge
when a record might spill over into another block. To solve this problem,
Hadoop uses a logical representation of the data stored in file blocks, known as
input splits. When a MapReduce job client calculates the input splits, it figures
out where the first whole record in a block begins and where the last record
in the block ends. In cases where the last record in a block is incomplete, the
input split includes location information for the next block and the byte offset
of the data needed to complete the record. 
You can configure the Application Master daemon (or JobTracker, if you’re in
Hadoop 1) to calculate the input splits instead of the job client, which would
be faster for jobs processing a large number of data blocks.
MapReduce data processing is driven by this concept of input splits. The
number of input splits that are calculated for a specific application determines
the number of mapper tasks. Each of these mapper tasks is assigned, where
possible, to a slave node where the input split is stored. The Resource Manager
(or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splits
are processed locally."                                          sic

Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,
Rafael Coss, and Roman B. Melnyk



Mark Charts




On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <dr...@gmail.com> wrote:


Hi,

Check this post: http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D


2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
Hi Hadoopers,

I got a question about how many blocks does one input split have? It is random or the number can be configured or fixed(can't be changed)?
Thanks!

Re: How many blocks does one input split have?

Posted by mark charts <mc...@yahoo.com>.

Hello.

FYI.
"The way HDFS has been set up, it breaks down very large files into large blocks(for example, measuring 128MB), and stores three copies of these blocks ondifferent nodes in the cluster. HDFS has no awareness of the content of thesefiles. In YARN, when a MapReduce job is started, the Resource Manager (thecluster resource management and job scheduling facility) creates anApplication Master daemon to look after the lifecycle of the job. (In Hadoop 1,the JobTracker monitored individual jobs as well as handling job schedulingand cluster resource management. One of the first things the Application Masterdoes is determine which file blocks are needed for processing. The Application Master requests details from the NameNode on where the replicas of the needed data blocks are stored. Using the location data for the file blocks, the Application Master makes requests to the Resource Manager to have map tasks process specific blocks on the slave nodes where they’re stored. The key to efficient MapReduce processing is that, wherever possible, data isprocessed locally — on the slave node where it’s stored.Before looking at how the data blocks are processed, you need to look moreclosely at how Hadoop stores data. In Hadoop, files are composed of individualrecords, which are ultimately processed one-by-one by mapper tasks. Forexample, the sample data set we use in this book contains information aboutcompleted flights within the United States between 1987 and 2008. We have onelarge file for each year, and within every file, each individual line represents asingle flight. In other words, one line represents one record. Now, rememberthat the block size for the Hadoop cluster is 64MB, which means that the lightdata files are broken into chunks of exactly 64MB.
Do you see the problem? If each map task processes all records in a specificdata block, what happens to those records that span block boundaries?File blocks are exactly 64MB (or whatever you set the block size to be), andbecause HDFS has no conception of what’s inside the file blocks, it can’t gaugewhen a record might spill over into another block. To solve this problem,Hadoop uses a logical representation of the data stored in file blocks, known asinput splits. When a MapReduce job client calculates the input splits, it figuresout where the first whole record in a block begins and where the last recordin the block ends. In cases where the last record in a block is incomplete, theinput split includes location information for the next block and the byte offsetof the data needed to complete the record. You can configure the Application Master daemon (or JobTracker, if you’re inHadoop 1) to calculate the input splits instead of the job client, which wouldbe faster for jobs processing a large number of data blocks.MapReduce data processing is driven by this concept of input splits. Thenumber of input splits that are calculated for a specific application determinesthe number of mapper tasks. Each of these mapper tasks is assigned, wherepossible, to a slave node where the input split is stored. The Resource Manager(or JobTracker, if you’re in Hadoop 1) does its best to ensure that input splitsare processed locally." sic
Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown,Rafael Coss, and Roman B. Melnyk

Mark Charts

On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <dr...@gmail.com> wrote:

Hi,

Check this post: http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D

2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
Hi Hadoopers,

I got a question about how many blocks does one input split have? It is random or the number can be configured or fixed(can't be changed)?
Thanks!

Re: How many blocks does one input split have?

Posted by mark charts <mc...@yahoo.com>.

Hello.

Mark Charts

On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <dr...@gmail.com> wrote:

Hi,

Check this post: http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D

2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
Hi Hadoopers,

I got a question about how many blocks does one input split have? It is random or the number can be configured or fixed(can't be changed)?
Thanks!

Re: How many blocks does one input split have?

Posted by mark charts <mc...@yahoo.com>.

Hello.

Mark Charts

On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <dr...@gmail.com> wrote:

Hi,

Check this post: http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D

2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
Hi Hadoopers,

I got a question about how many blocks does one input split have? It is random or the number can be configured or fixed(can't be changed)?
Thanks!

Re: How many blocks does one input split have?

Posted by mark charts <mc...@yahoo.com>.

Hello.

Mark Charts

On Wednesday, December 17, 2014 10:32 AM, Dieter De Witte <dr...@gmail.com> wrote:

Hi,

Check this post: http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D

2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
Hi Hadoopers,

I got a question about how many blocks does one input split have? It is random or the number can be configured or fixed(can't be changed)?
Thanks!

Re: How many blocks does one input split have?

Posted by Dieter De Witte <dr...@gmail.com>.

Hi,

Check this post:
http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D


2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
>
> Hi Hadoopers,
>
> I got a question about how many blocks does one input split have? It is
> random or the number can be configured or fixed(can't be changed)?
> Thanks!
>

Re: How many blocks does one input split have?

Posted by Dieter De Witte <dr...@gmail.com>.

Hi,

Check this post:
http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D


2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
>
> Hi Hadoopers,
>
> I got a question about how many blocks does one input split have? It is
> random or the number can be configured or fixed(can't be changed)?
> Thanks!
>

Re: How many blocks does one input split have?

Posted by Dieter De Witte <dr...@gmail.com>.

Hi,

Check this post:
http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D


2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
>
> Hi Hadoopers,
>
> I got a question about how many blocks does one input split have? It is
> random or the number can be configured or fixed(can't be changed)?
> Thanks!
>

Re: How many blocks does one input split have?

Posted by Dieter De Witte <dr...@gmail.com>.

Hi,

Check this post:
http://stackoverflow.com/questions/17727468/hadoop-input-split-size-vs-block-size

Regards, D


2014-12-17 15:16 GMT+01:00 Todd <bi...@163.com>:
>
> Hi Hadoopers,
>
> I got a question about how many blocks does one input split have? It is
> random or the number can be configured or fixed(can't be changed)?
> Thanks!
>