You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Keith Wiley <kw...@keithwiley.com> on 2014/01/21 20:28:32 UTC

Is perfect control over mapper num AND split distribution possible?

I am running a job that takes no input from the mapper-input key/value interface.  Each job reads the same small file from the distributed cache and processes it independently (to generate Monte Carlo sampling of the problem space).  I am using MR purely to parallelize the otherwise redundant and separated sampling process.  To maximize parallelism, I want to set the number of mappers explicitly, such that 10 samples run in exact 1X time by perfectly distributing over 10 mappers.  I am accomplishing this by generating a dummy MR input file of nonvalue data.  Each row is identical so I know the exact row length of all rows.  I then simply set the split size to the row length with the intention that Hadoop perfectly assign the intended number of mappers.  This approach mostly works.  However, I get a few extraneous empty mappers.  Since they get no input, they do no work and exit almost immediately, so they aren't a serious drain on cluster resources, but I'm confused why I get extra mappers in the first place.

My working theory was that the end-lines of the input file must be accounted for when calculating split sizes (so my splits were too small and I got a few extra splits hanging off the end of the input file).  I attempted to fix this by adding one to the calculated split size (one greater than the actual row length now).  This works perfectly, generating exactly the intended number of mappers, exactly the same number as there are rows in the input file.  However, the labor distribution is not perfect.  Almost every single run produces one mapper which receives no input (and ends immediately) and another mapper which receives two inputs, thus triggering two "processing sessions" on that particular mapper such that it takes twice as long to complete as the other mappers.  Obviously, this wrecks the potential parallelism by literally doubling the overall job time.

Which split size is correct: row length without end-line or row length with end-line?  The former yields extra empty mappers while the latter yields exactly the right number.  However, if the latter is correct, why is the task distribution uneven (albeit NEARLY even) and what (if anything) can be done about it?

Thanks.

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"The easy confidence with which I know another man's religion is folly teaches
me to suspect that my own is also."
                                           --  Mark Twain
________________________________________________________________________________

Re: Is perfect control over mapper num AND split distribution possible?

Posted by Keith Wiley <kw...@keithwiley.com>.

Seems to work well.  Thank you very much!

On Jan 21, 2014, at 12:42 , Keith Wiley wrote:

> I'll look it up.  Thanks.
> 
> On Jan 21, 2014, at 11:43 , java8964 wrote:
> 
>> You cannot use hadoop "NLineInputFormat"?
>> 
>> If you generate 100 lines of text file, by default, one line will trigger one mapper task.
>> 
>> As long as you have 100 task slot available, you will get 100 mapper running concurrently.
>> 
>> You want perfect control over mapper num? NLineInputFormat is designed for your purpose.
>> 
>> Yong


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
                                           --  Keith Wiley
________________________________________________________________________________

Re: Is perfect control over mapper num AND split distribution possible?

Posted by Keith Wiley <kw...@keithwiley.com>.

Seems to work well.  Thank you very much!

On Jan 21, 2014, at 12:42 , Keith Wiley wrote:

> I'll look it up.  Thanks.
> 
> On Jan 21, 2014, at 11:43 , java8964 wrote:
> 
>> You cannot use hadoop "NLineInputFormat"?
>> 
>> If you generate 100 lines of text file, by default, one line will trigger one mapper task.
>> 
>> As long as you have 100 task slot available, you will get 100 mapper running concurrently.
>> 
>> You want perfect control over mapper num? NLineInputFormat is designed for your purpose.
>> 
>> Yong


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
                                           --  Keith Wiley
________________________________________________________________________________

Re: Is perfect control over mapper num AND split distribution possible?

Posted by Keith Wiley <kw...@keithwiley.com>.

Seems to work well.  Thank you very much!

On Jan 21, 2014, at 12:42 , Keith Wiley wrote:

> I'll look it up.  Thanks.
> 
> On Jan 21, 2014, at 11:43 , java8964 wrote:
> 
>> You cannot use hadoop "NLineInputFormat"?
>> 
>> If you generate 100 lines of text file, by default, one line will trigger one mapper task.
>> 
>> As long as you have 100 task slot available, you will get 100 mapper running concurrently.
>> 
>> You want perfect control over mapper num? NLineInputFormat is designed for your purpose.
>> 
>> Yong


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
                                           --  Keith Wiley
________________________________________________________________________________

Re: Is perfect control over mapper num AND split distribution possible?

Posted by Keith Wiley <kw...@keithwiley.com>.

Seems to work well.  Thank you very much!

On Jan 21, 2014, at 12:42 , Keith Wiley wrote:

> I'll look it up.  Thanks.
> 
> On Jan 21, 2014, at 11:43 , java8964 wrote:
> 
>> You cannot use hadoop "NLineInputFormat"?
>> 
>> If you generate 100 lines of text file, by default, one line will trigger one mapper task.
>> 
>> As long as you have 100 task slot available, you will get 100 mapper running concurrently.
>> 
>> You want perfect control over mapper num? NLineInputFormat is designed for your purpose.
>> 
>> Yong


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
                                           --  Keith Wiley
________________________________________________________________________________

Re: Is perfect control over mapper num AND split distribution possible?

Posted by Keith Wiley <kw...@keithwiley.com>.

I'll look it up.  Thanks.

On Jan 21, 2014, at 11:43 , java8964 wrote:

> You cannot use hadoop "NLineInputFormat"?
> 
> If you generate 100 lines of text file, by default, one line will trigger one mapper task.
> 
> As long as you have 100 task slot available, you will get 100 mapper running concurrently.
> 
> You want perfect control over mapper num? NLineInputFormat is designed for your purpose.
> 
> Yong

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered."
                                           --  Keith Wiley
________________________________________________________________________________

Re: Is perfect control over mapper num AND split distribution possible?

Posted by Keith Wiley <kw...@keithwiley.com>.

I'll look it up.  Thanks.

On Jan 21, 2014, at 11:43 , java8964 wrote:

> You cannot use hadoop "NLineInputFormat"?
> 
> If you generate 100 lines of text file, by default, one line will trigger one mapper task.
> 
> As long as you have 100 task slot available, you will get 100 mapper running concurrently.
> 
> You want perfect control over mapper num? NLineInputFormat is designed for your purpose.
> 
> Yong

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered."
                                           --  Keith Wiley
________________________________________________________________________________

Re: Is perfect control over mapper num AND split distribution possible?

Posted by Keith Wiley <kw...@keithwiley.com>.

I'll look it up.  Thanks.

On Jan 21, 2014, at 11:43 , java8964 wrote:

> You cannot use hadoop "NLineInputFormat"?
> 
> If you generate 100 lines of text file, by default, one line will trigger one mapper task.
> 
> As long as you have 100 task slot available, you will get 100 mapper running concurrently.
> 
> You want perfect control over mapper num? NLineInputFormat is designed for your purpose.
> 
> Yong

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered."
                                           --  Keith Wiley
________________________________________________________________________________

Re: Is perfect control over mapper num AND split distribution possible?

Posted by Keith Wiley <kw...@keithwiley.com>.

I'll look it up.  Thanks.

On Jan 21, 2014, at 11:43 , java8964 wrote:

> You cannot use hadoop "NLineInputFormat"?
> 
> If you generate 100 lines of text file, by default, one line will trigger one mapper task.
> 
> As long as you have 100 task slot available, you will get 100 mapper running concurrently.
> 
> You want perfect control over mapper num? NLineInputFormat is designed for your purpose.
> 
> Yong

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge ratio than
when I entered."
                                           --  Keith Wiley
________________________________________________________________________________

RE: Is perfect control over mapper num AND split distribution possible?

Posted by java8964 <ja...@hotmail.com>.

You cannot use hadoop "NLineInputFormat"?
If you generate 100 lines of text file, by default, one line will trigger one mapper task.
As long as you have 100 task slot available, you will get 100 mapper running concurrently.
You want perfect control over mapper num? NLineInputFormat is designed for your purpose.
Yong

> From: kwiley@keithwiley.com
> Subject: Is perfect control over mapper num AND split distribution possible?
> Date: Tue, 21 Jan 2014 11:28:32 -0800
> To: user@hadoop.apache.org
> 
> I am running a job that takes no input from the mapper-input key/value interface.  Each job reads the same small file from the distributed cache and processes it independently (to generate Monte Carlo sampling of the problem space).  I am using MR purely to parallelize the otherwise redundant and separated sampling process.  To maximize parallelism, I want to set the number of mappers explicitly, such that 10 samples run in exact 1X time by perfectly distributing over 10 mappers.  I am accomplishing this by generating a dummy MR input file of nonvalue data.  Each row is identical so I know the exact row length of all rows.  I then simply set the split size to the row length with the intention that Hadoop perfectly assign the intended number of mappers.  This approach mostly works.  However, I get a few extraneous empty mappers.  Since they get no input, they do no work and exit almost immediately, so they aren't a serious drain on cluster resources, but I'm confused why I get extra mappers in the first place.
> 
> My working theory was that the end-lines of the input file must be accounted for when calculating split sizes (so my splits were too small and I got a few extra splits hanging off the end of the input file).  I attempted to fix this by adding one to the calculated split size (one greater than the actual row length now).  This works perfectly, generating exactly the intended number of mappers, exactly the same number as there are rows in the input file.  However, the labor distribution is not perfect.  Almost every single run produces one mapper which receives no input (and ends immediately) and another mapper which receives two inputs, thus triggering two "processing sessions" on that particular mapper such that it takes twice as long to complete as the other mappers.  Obviously, this wrecks the potential parallelism by literally doubling the overall job time.
> 
> Which split size is correct: row length without end-line or row length with end-line?  The former yields extra empty mappers while the latter yields exactly the right number.  However, if the latter is correct, why is the task distribution uneven (albeit NEARLY even) and what (if anything) can be done about it?
> 
> Thanks.
> 
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
> 
> "The easy confidence with which I know another man's religion is folly teaches
> me to suspect that my own is also."
>                                            --  Mark Twain
> ________________________________________________________________________________
>

RE: Is perfect control over mapper num AND split distribution possible?

Posted by java8964 <ja...@hotmail.com>.

You cannot use hadoop "NLineInputFormat"?
If you generate 100 lines of text file, by default, one line will trigger one mapper task.
As long as you have 100 task slot available, you will get 100 mapper running concurrently.
You want perfect control over mapper num? NLineInputFormat is designed for your purpose.
Yong

> From: kwiley@keithwiley.com
> Subject: Is perfect control over mapper num AND split distribution possible?
> Date: Tue, 21 Jan 2014 11:28:32 -0800
> To: user@hadoop.apache.org
> 
> I am running a job that takes no input from the mapper-input key/value interface.  Each job reads the same small file from the distributed cache and processes it independently (to generate Monte Carlo sampling of the problem space).  I am using MR purely to parallelize the otherwise redundant and separated sampling process.  To maximize parallelism, I want to set the number of mappers explicitly, such that 10 samples run in exact 1X time by perfectly distributing over 10 mappers.  I am accomplishing this by generating a dummy MR input file of nonvalue data.  Each row is identical so I know the exact row length of all rows.  I then simply set the split size to the row length with the intention that Hadoop perfectly assign the intended number of mappers.  This approach mostly works.  However, I get a few extraneous empty mappers.  Since they get no input, they do no work and exit almost immediately, so they aren't a serious drain on cluster resources, but I'm confused why I get extra mappers in the first place.
> 
> My working theory was that the end-lines of the input file must be accounted for when calculating split sizes (so my splits were too small and I got a few extra splits hanging off the end of the input file).  I attempted to fix this by adding one to the calculated split size (one greater than the actual row length now).  This works perfectly, generating exactly the intended number of mappers, exactly the same number as there are rows in the input file.  However, the labor distribution is not perfect.  Almost every single run produces one mapper which receives no input (and ends immediately) and another mapper which receives two inputs, thus triggering two "processing sessions" on that particular mapper such that it takes twice as long to complete as the other mappers.  Obviously, this wrecks the potential parallelism by literally doubling the overall job time.
> 
> Which split size is correct: row length without end-line or row length with end-line?  The former yields extra empty mappers while the latter yields exactly the right number.  However, if the latter is correct, why is the task distribution uneven (albeit NEARLY even) and what (if anything) can be done about it?
> 
> Thanks.
> 
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
> 
> "The easy confidence with which I know another man's religion is folly teaches
> me to suspect that my own is also."
>                                            --  Mark Twain
> ________________________________________________________________________________
>

RE: Is perfect control over mapper num AND split distribution possible?

Posted by java8964 <ja...@hotmail.com>.

You cannot use hadoop "NLineInputFormat"?
If you generate 100 lines of text file, by default, one line will trigger one mapper task.
As long as you have 100 task slot available, you will get 100 mapper running concurrently.
You want perfect control over mapper num? NLineInputFormat is designed for your purpose.
Yong

> From: kwiley@keithwiley.com
> Subject: Is perfect control over mapper num AND split distribution possible?
> Date: Tue, 21 Jan 2014 11:28:32 -0800
> To: user@hadoop.apache.org
> 
> I am running a job that takes no input from the mapper-input key/value interface.  Each job reads the same small file from the distributed cache and processes it independently (to generate Monte Carlo sampling of the problem space).  I am using MR purely to parallelize the otherwise redundant and separated sampling process.  To maximize parallelism, I want to set the number of mappers explicitly, such that 10 samples run in exact 1X time by perfectly distributing over 10 mappers.  I am accomplishing this by generating a dummy MR input file of nonvalue data.  Each row is identical so I know the exact row length of all rows.  I then simply set the split size to the row length with the intention that Hadoop perfectly assign the intended number of mappers.  This approach mostly works.  However, I get a few extraneous empty mappers.  Since they get no input, they do no work and exit almost immediately, so they aren't a serious drain on cluster resources, but I'm confused why I get extra mappers in the first place.
> 
> My working theory was that the end-lines of the input file must be accounted for when calculating split sizes (so my splits were too small and I got a few extra splits hanging off the end of the input file).  I attempted to fix this by adding one to the calculated split size (one greater than the actual row length now).  This works perfectly, generating exactly the intended number of mappers, exactly the same number as there are rows in the input file.  However, the labor distribution is not perfect.  Almost every single run produces one mapper which receives no input (and ends immediately) and another mapper which receives two inputs, thus triggering two "processing sessions" on that particular mapper such that it takes twice as long to complete as the other mappers.  Obviously, this wrecks the potential parallelism by literally doubling the overall job time.
> 
> Which split size is correct: row length without end-line or row length with end-line?  The former yields extra empty mappers while the latter yields exactly the right number.  However, if the latter is correct, why is the task distribution uneven (albeit NEARLY even) and what (if anything) can be done about it?
> 
> Thanks.
> 
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
> 
> "The easy confidence with which I know another man's religion is folly teaches
> me to suspect that my own is also."
>                                            --  Mark Twain
> ________________________________________________________________________________
>

RE: Is perfect control over mapper num AND split distribution possible?

Posted by java8964 <ja...@hotmail.com>.

You cannot use hadoop "NLineInputFormat"?
If you generate 100 lines of text file, by default, one line will trigger one mapper task.
As long as you have 100 task slot available, you will get 100 mapper running concurrently.
You want perfect control over mapper num? NLineInputFormat is designed for your purpose.
Yong

> From: kwiley@keithwiley.com
> Subject: Is perfect control over mapper num AND split distribution possible?
> Date: Tue, 21 Jan 2014 11:28:32 -0800
> To: user@hadoop.apache.org
> 
> I am running a job that takes no input from the mapper-input key/value interface.  Each job reads the same small file from the distributed cache and processes it independently (to generate Monte Carlo sampling of the problem space).  I am using MR purely to parallelize the otherwise redundant and separated sampling process.  To maximize parallelism, I want to set the number of mappers explicitly, such that 10 samples run in exact 1X time by perfectly distributing over 10 mappers.  I am accomplishing this by generating a dummy MR input file of nonvalue data.  Each row is identical so I know the exact row length of all rows.  I then simply set the split size to the row length with the intention that Hadoop perfectly assign the intended number of mappers.  This approach mostly works.  However, I get a few extraneous empty mappers.  Since they get no input, they do no work and exit almost immediately, so they aren't a serious drain on cluster resources, but I'm confused why I get extra mappers in the first place.
> 
> My working theory was that the end-lines of the input file must be accounted for when calculating split sizes (so my splits were too small and I got a few extra splits hanging off the end of the input file).  I attempted to fix this by adding one to the calculated split size (one greater than the actual row length now).  This works perfectly, generating exactly the intended number of mappers, exactly the same number as there are rows in the input file.  However, the labor distribution is not perfect.  Almost every single run produces one mapper which receives no input (and ends immediately) and another mapper which receives two inputs, thus triggering two "processing sessions" on that particular mapper such that it takes twice as long to complete as the other mappers.  Obviously, this wrecks the potential parallelism by literally doubling the overall job time.
> 
> Which split size is correct: row length without end-line or row length with end-line?  The former yields extra empty mappers while the latter yields exactly the right number.  However, if the latter is correct, why is the task distribution uneven (albeit NEARLY even) and what (if anything) can be done about it?
> 
> Thanks.
> 
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
> 
> "The easy confidence with which I know another man's religion is folly teaches
> me to suspect that my own is also."
>                                            --  Mark Twain
> ________________________________________________________________________________
>