You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Vincent Xue <xu...@gmail.com> on 2011/05/23 11:09:12 UTC

Mapping one key per Map Task

Hello Hadoop Users,

I would like to know if anyone has ever tried splitting an input
sequence file by key instead of by size. I know that this is unusual
for the map reduce paradigm but I am in a situation where I need to
perform some large tasks on each key pair in a load balancing like
fashion.

To describe in more detail:
I have one sequence file of 2000 key value pairs. I want to distribute
each key value pair to a map task where it will perform a series of
Map/Reduce tasks. This means that the Map task is calling a series of
Jobs. Once the Jobs in each task is complete, I want to reduce all of
the output into one sequence file.

I am stuck in which I am limited by the number of splits a sequence
file is handled in. Hadoop only splits my sequence file into 80 map
tasks when I can perform around 250 map tasks on my cluster. This
means that I am not fully utilizing my cluster, and my Job will not
scale.

Can anyone shed some light on this problem. I have tried looking at
the InputFormats but I am not sure if this is where I should continue
looking.

Best Regards
Vincent

Re: Mapping one key per Map Task

Posted by Jason <ur...@gmail.com>.

Look at NLineInputFormat

Sent from my iPhone

On May 23, 2011, at 2:09 AM, Vincent Xue <xu...@gmail.com> wrote:

> Hello Hadoop Users,
> 
> I would like to know if anyone has ever tried splitting an input
> sequence file by key instead of by size. I know that this is unusual
> for the map reduce paradigm but I am in a situation where I need to
> perform some large tasks on each key pair in a load balancing like
> fashion.
> 
> To describe in more detail:
> I have one sequence file of 2000 key value pairs. I want to distribute
> each key value pair to a map task where it will perform a series of
> Map/Reduce tasks. This means that the Map task is calling a series of
> Jobs. Once the Jobs in each task is complete, I want to reduce all of
> the output into one sequence file.
> 
> I am stuck in which I am limited by the number of splits a sequence
> file is handled in. Hadoop only splits my sequence file into 80 map
> tasks when I can perform around 250 map tasks on my cluster. This
> means that I am not fully utilizing my cluster, and my Job will not
> scale.
> 
> Can anyone shed some light on this problem. I have tried looking at
> the InputFormats but I am not sure if this is where I should continue
> looking.
> 
> Best Regards
> Vincent

Re: Mapping one key per Map Task

Posted by Joey Echeverria <jo...@cloudera.com>.

Look at getInputSplits() of SequenceFileInputFormat.

-Joey
On May 23, 2011 5:09 AM, "Vincent Xue" <xu...@gmail.com> wrote:
> Hello Hadoop Users,
>
> I would like to know if anyone has ever tried splitting an input
> sequence file by key instead of by size. I know that this is unusual
> for the map reduce paradigm but I am in a situation where I need to
> perform some large tasks on each key pair in a load balancing like
> fashion.
>
> To describe in more detail:
> I have one sequence file of 2000 key value pairs. I want to distribute
> each key value pair to a map task where it will perform a series of
> Map/Reduce tasks. This means that the Map task is calling a series of
> Jobs. Once the Jobs in each task is complete, I want to reduce all of
> the output into one sequence file.
>
> I am stuck in which I am limited by the number of splits a sequence
> file is handled in. Hadoop only splits my sequence file into 80 map
> tasks when I can perform around 250 map tasks on my cluster. This
> means that I am not fully utilizing my cluster, and my Job will not
> scale.
>
> Can anyone shed some light on this problem. I have tried looking at
> the InputFormats but I am not sure if this is where I should continue
> looking.
>
> Best Regards
> Vincent

Re: Mapping one key per Map Task

Posted by Moustafa Gaber <mo...@gmail.com>.

I think you don't need to split your input file so that each map is assigned
one key. Your goal is to make load balance. For each map task of yours, it
will initiate a new MR sub-job. This sub-job will be assigned a new
master/workers, which means the map task of the sub-job may be scheduled to
work on other machines rather than the ones your parent map tasks are
working on. Therefore, you can still achieve load balance without splitting
your input file per one key.

On Mon, May 23, 2011 at 1:02 PM, Vincent Xue <xu...@gmail.com> wrote:

> Thanks for the suggestions!
>
> On Mon, May 23, 2011 at 5:50 PM, Harsh J <ha...@cloudera.com> wrote:
> > Vincent,
> >
> > You _might_ lose locality by splitting beyond the block splits, and
> > the tasks although better 'parallelized', may only end up performing
> > worse. A good way to instead increase task #s is to go the block size
> > way (lower block size, getting more splits at the cost of little extra
> > NN space). After all, block sizes are per-file properties.
> >
> > But if you'd really want to go the per-record way, an NLine-like
> > implementation for SequenceFiles using what Joey and Jason have
> > pointed out - would be the best way. (NLineInputFormat doesn't cover
> > the use of SequenceFiles directly - Its implemented with a LineReader)
> >
> > On Mon, May 23, 2011 at 2:39 PM, Vincent Xue <xu...@gmail.com> wrote:
> >> Hello Hadoop Users,
> >>
> >> I would like to know if anyone has ever tried splitting an input
> >> sequence file by key instead of by size. I know that this is unusual
> >> for the map reduce paradigm but I am in a situation where I need to
> >> perform some large tasks on each key pair in a load balancing like
> >> fashion.
> >>
> >> To describe in more detail:
> >> I have one sequence file of 2000 key value pairs. I want to distribute
> >> each key value pair to a map task where it will perform a series of
> >> Map/Reduce tasks. This means that the Map task is calling a series of
> >> Jobs. Once the Jobs in each task is complete, I want to reduce all of
> >> the output into one sequence file.
> >>
> >> I am stuck in which I am limited by the number of splits a sequence
> >> file is handled in. Hadoop only splits my sequence file into 80 map
> >> tasks when I can perform around 250 map tasks on my cluster. This
> >> means that I am not fully utilizing my cluster, and my Job will not
> >> scale.
> >>
> >> Can anyone shed some light on this problem. I have tried looking at
> >> the InputFormats but I am not sure if this is where I should continue
> >> looking.
> >>
> >> Best Regards
> >> Vincent
> >>
> >
> >
> >
> > --
> > Harsh J
> >
>



-- 
Best Regards,
Mostafa Ead

Re: Mapping one key per Map Task

Posted by Vincent Xue <xu...@gmail.com>.

Thanks for the suggestions!

On Mon, May 23, 2011 at 5:50 PM, Harsh J <ha...@cloudera.com> wrote:
> Vincent,
>
> You _might_ lose locality by splitting beyond the block splits, and
> the tasks although better 'parallelized', may only end up performing
> worse. A good way to instead increase task #s is to go the block size
> way (lower block size, getting more splits at the cost of little extra
> NN space). After all, block sizes are per-file properties.
>
> But if you'd really want to go the per-record way, an NLine-like
> implementation for SequenceFiles using what Joey and Jason have
> pointed out - would be the best way. (NLineInputFormat doesn't cover
> the use of SequenceFiles directly - Its implemented with a LineReader)
>
> On Mon, May 23, 2011 at 2:39 PM, Vincent Xue <xu...@gmail.com> wrote:
>> Hello Hadoop Users,
>>
>> I would like to know if anyone has ever tried splitting an input
>> sequence file by key instead of by size. I know that this is unusual
>> for the map reduce paradigm but I am in a situation where I need to
>> perform some large tasks on each key pair in a load balancing like
>> fashion.
>>
>> To describe in more detail:
>> I have one sequence file of 2000 key value pairs. I want to distribute
>> each key value pair to a map task where it will perform a series of
>> Map/Reduce tasks. This means that the Map task is calling a series of
>> Jobs. Once the Jobs in each task is complete, I want to reduce all of
>> the output into one sequence file.
>>
>> I am stuck in which I am limited by the number of splits a sequence
>> file is handled in. Hadoop only splits my sequence file into 80 map
>> tasks when I can perform around 250 map tasks on my cluster. This
>> means that I am not fully utilizing my cluster, and my Job will not
>> scale.
>>
>> Can anyone shed some light on this problem. I have tried looking at
>> the InputFormats but I am not sure if this is where I should continue
>> looking.
>>
>> Best Regards
>> Vincent
>>
>
>
>
> --
> Harsh J
>

Re: Mapping one key per Map Task

Posted by Harsh J <ha...@cloudera.com>.

Vincent,

You _might_ lose locality by splitting beyond the block splits, and
the tasks although better 'parallelized', may only end up performing
worse. A good way to instead increase task #s is to go the block size
way (lower block size, getting more splits at the cost of little extra
NN space). After all, block sizes are per-file properties.

But if you'd really want to go the per-record way, an NLine-like
implementation for SequenceFiles using what Joey and Jason have
pointed out - would be the best way. (NLineInputFormat doesn't cover
the use of SequenceFiles directly - Its implemented with a LineReader)

On Mon, May 23, 2011 at 2:39 PM, Vincent Xue <xu...@gmail.com> wrote:
> Hello Hadoop Users,
>
> I would like to know if anyone has ever tried splitting an input
> sequence file by key instead of by size. I know that this is unusual
> for the map reduce paradigm but I am in a situation where I need to
> perform some large tasks on each key pair in a load balancing like
> fashion.
>
> To describe in more detail:
> I have one sequence file of 2000 key value pairs. I want to distribute
> each key value pair to a map task where it will perform a series of
> Map/Reduce tasks. This means that the Map task is calling a series of
> Jobs. Once the Jobs in each task is complete, I want to reduce all of
> the output into one sequence file.
>
> I am stuck in which I am limited by the number of splits a sequence
> file is handled in. Hadoop only splits my sequence file into 80 map
> tasks when I can perform around 250 map tasks on my cluster. This
> means that I am not fully utilizing my cluster, and my Job will not
> scale.
>
> Can anyone shed some light on this problem. I have tried looking at
> the InputFormats but I am not sure if this is where I should continue
> looking.
>
> Best Regards
> Vincent
>

-- 
Harsh J