You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Jim the Standing Bear <st...@gmail.com> on 2008/01/16 16:59:39 UTC

a question on number of parallel tasks

Hi,

How do I make hadoop split its output?  The program I am writing
crawls a catalog tree from a single url, so initially the input
contains only one entry.  after a few iterations, it will have tens of
thousands of urls.  But what I noticed is that the file is always in
one block (part-00000).   What I would like to have is once the number
of entries increases, it can parallelize the job.  Currently it
doesn't seem to be case.

-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

Re: a question on number of parallel tasks

Posted by Jim the Standing Bear <st...@gmail.com>.

Thanks, Miles.

On Jan 16, 2008 11:51 AM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:
> The number of reduces should be a function of the amount of data needing
> reducing, not the number of mappers.
>
> For example,  your mappers might delete  90% of the input data, in which
> case you should only need 1/10 of the number of reducers as mappers.
>
> Miles
>
>
> On 16/01/2008, Jim the Standing Bear <st...@gmail.com> wrote:
> >
> > hmm.. interesting... these are supposed to be the output from mappers
> > (and default reducers since I didn't specify any for those jobs)...
> > but shouldn't the number of reducers match the number of mappers?  If
> > there was only one reducer, it would mean I only had one mapper task
> > running??  That is why I asked my question in the first place, because
> > I suspect my jobs were not being running in parallel.
> >
> > On Jan 16, 2008 11:42 AM, Ted Dunning <td...@veoh.com> wrote:
> > >
> > > The part nomenclature does not refer to splits.  It refers to how many
> > > reduce processes were involved in actually writing the output
> > file.  Files
> > > are split at read-time as necessary.
> > >
> > > You will get more of them if you have more reducers.
> > >
> > >
> > >
> > > On 1/16/08 8:25 AM, "Jim the Standing Bear" <st...@gmail.com>
> > wrote:
> > >
> > > > Thanks Ted.  I just didn't ask it right.  Here is a stupid 101
> > > > question, which I am sure the answer lies in the documentation
> > > > somewhere, just that I was having some difficulties in finding it...
> > > >
> > > > when I do an "ls" on the dfs,  I would see this:
> > > > /user/bear/output/part-00000 <r 4>
> > > >
> > > > I probably got confused on what the part-##### means... I thought
> > > > part-##### tells how many splits a file has... so far, I have only
> > > > seen part-00000.  When will it have part-00001, 00002, etc?
> > > >
> > > >
> > > >
> > > > On Jan 16, 2008 11:04 AM, Ted Dunning <td...@veoh.com> wrote:
> > > >>
> > > >>
> > > >> Parallelizing the processing of data occurs at two steps.  The first
> > is
> > > >> during the map phase where the input data file is (hopefully) split
> > across
> > > >> multiple tasks.  This should happen transparently most of the time
> > unless
> > > >> you have a perverse data format or use unsplittable compression on
> > your
> > > >> file.
> > > >>
> > > >> This parallelism can occur whether you have one input file or many.
> > > >>
> > > >> The second level of parallelism is at reduce phase.  You set this by
> > setting
> > > >> the number of reducers.  This will also determine the number of
> > output files
> > > >> that you get.
> > > >>
> > > >> Depending on your algorithm, it may help or hurt to have one or many
> > > >> reducers.  The recent example of a program to find the 10 largest
> > elements
> > > >> is an example that pretty much requires a single reducer.  Other
> > programs
> > > >> where the mapper produces huge amounts of output would be better
> > served by
> > > >> having many reducers.
> > > >>
> > > >> This is a general answer since the question is kind of non-specific.
> > > >>
> > > >>
> > > >>
> > > >> On 1/16/08 7:59 AM, "Jim the Standing Bear" <st...@gmail.com>
> > wrote:
> > > >>
> > > >>> Hi,
> > > >>>
> > > >>> How do I make hadoop split its output?  The program I am writing
> > > >>> crawls a catalog tree from a single url, so initially the input
> > > >>> contains only one entry.  after a few iterations, it will have tens
> > of
> > > >>> thousands of urls.  But what I noticed is that the file is always in
> > > >>> one block (part-00000).   What I would like to have is once the
> > number
> > > >>> of entries increases, it can parallelize the job.  Currently it
> > > >>> doesn't seem to be case.
> > > >>
> > > >>
> > > >
> > > >
> > >
> > >
> >
> >
> >
> > --
> > --------------------------------------
> > Standing Bear Has Spoken
> > --------------------------------------
> >
>



-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

Re: a question on number of parallel tasks

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.

The number of reduces should be a function of the amount of data needing
reducing, not the number of mappers.

For example,  your mappers might delete  90% of the input data, in which
case you should only need 1/10 of the number of reducers as mappers.

Miles

On 16/01/2008, Jim the Standing Bear <st...@gmail.com> wrote:
>
> hmm.. interesting... these are supposed to be the output from mappers
> (and default reducers since I didn't specify any for those jobs)...
> but shouldn't the number of reducers match the number of mappers?  If
> there was only one reducer, it would mean I only had one mapper task
> running??  That is why I asked my question in the first place, because
> I suspect my jobs were not being running in parallel.
>
> On Jan 16, 2008 11:42 AM, Ted Dunning <td...@veoh.com> wrote:
> >
> > The part nomenclature does not refer to splits.  It refers to how many
> > reduce processes were involved in actually writing the output
> file.  Files
> > are split at read-time as necessary.
> >
> > You will get more of them if you have more reducers.
> >
> >
> >
> > On 1/16/08 8:25 AM, "Jim the Standing Bear" <st...@gmail.com>
> wrote:
> >
> > > Thanks Ted.  I just didn't ask it right.  Here is a stupid 101
> > > question, which I am sure the answer lies in the documentation
> > > somewhere, just that I was having some difficulties in finding it...
> > >
> > > when I do an "ls" on the dfs,  I would see this:
> > > /user/bear/output/part-00000 <r 4>
> > >
> > > I probably got confused on what the part-##### means... I thought
> > > part-##### tells how many splits a file has... so far, I have only
> > > seen part-00000.  When will it have part-00001, 00002, etc?
> > >
> > >
> > >
> > > On Jan 16, 2008 11:04 AM, Ted Dunning <td...@veoh.com> wrote:
> > >>
> > >>
> > >> Parallelizing the processing of data occurs at two steps.  The first
> is
> > >> during the map phase where the input data file is (hopefully) split
> across
> > >> multiple tasks.  This should happen transparently most of the time
> unless
> > >> you have a perverse data format or use unsplittable compression on
> your
> > >> file.
> > >>
> > >> This parallelism can occur whether you have one input file or many.
> > >>
> > >> The second level of parallelism is at reduce phase.  You set this by
> setting
> > >> the number of reducers.  This will also determine the number of
> output files
> > >> that you get.
> > >>
> > >> Depending on your algorithm, it may help or hurt to have one or many
> > >> reducers.  The recent example of a program to find the 10 largest
> elements
> > >> is an example that pretty much requires a single reducer.  Other
> programs
> > >> where the mapper produces huge amounts of output would be better
> served by
> > >> having many reducers.
> > >>
> > >> This is a general answer since the question is kind of non-specific.
> > >>
> > >>
> > >>
> > >> On 1/16/08 7:59 AM, "Jim the Standing Bear" <st...@gmail.com>
> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> How do I make hadoop split its output?  The program I am writing
> > >>> crawls a catalog tree from a single url, so initially the input
> > >>> contains only one entry.  after a few iterations, it will have tens
> of
> > >>> thousands of urls.  But what I noticed is that the file is always in
> > >>> one block (part-00000).   What I would like to have is once the
> number
> > >>> of entries increases, it can parallelize the job.  Currently it
> > >>> doesn't seem to be case.
> > >>
> > >>
> > >
> > >
> >
> >
>
>
>
> --
> --------------------------------------
> Standing Bear Has Spoken
> --------------------------------------
>

Re: a question on number of parallel tasks

Posted by Jim the Standing Bear <st...@gmail.com>.

hmm.. interesting... these are supposed to be the output from mappers
(and default reducers since I didn't specify any for those jobs)...
but shouldn't the number of reducers match the number of mappers?  If
there was only one reducer, it would mean I only had one mapper task
running??  That is why I asked my question in the first place, because
I suspect my jobs were not being running in parallel.

On Jan 16, 2008 11:42 AM, Ted Dunning <td...@veoh.com> wrote:
>
> The part nomenclature does not refer to splits.  It refers to how many
> reduce processes were involved in actually writing the output file.  Files
> are split at read-time as necessary.
>
> You will get more of them if you have more reducers.
>
>
>
> On 1/16/08 8:25 AM, "Jim the Standing Bear" <st...@gmail.com> wrote:
>
> > Thanks Ted.  I just didn't ask it right.  Here is a stupid 101
> > question, which I am sure the answer lies in the documentation
> > somewhere, just that I was having some difficulties in finding it...
> >
> > when I do an "ls" on the dfs,  I would see this:
> > /user/bear/output/part-00000 <r 4>
> >
> > I probably got confused on what the part-##### means... I thought
> > part-##### tells how many splits a file has... so far, I have only
> > seen part-00000.  When will it have part-00001, 00002, etc?
> >
> >
> >
> > On Jan 16, 2008 11:04 AM, Ted Dunning <td...@veoh.com> wrote:
> >>
> >>
> >> Parallelizing the processing of data occurs at two steps.  The first is
> >> during the map phase where the input data file is (hopefully) split across
> >> multiple tasks.  This should happen transparently most of the time unless
> >> you have a perverse data format or use unsplittable compression on your
> >> file.
> >>
> >> This parallelism can occur whether you have one input file or many.
> >>
> >> The second level of parallelism is at reduce phase.  You set this by setting
> >> the number of reducers.  This will also determine the number of output files
> >> that you get.
> >>
> >> Depending on your algorithm, it may help or hurt to have one or many
> >> reducers.  The recent example of a program to find the 10 largest elements
> >> is an example that pretty much requires a single reducer.  Other programs
> >> where the mapper produces huge amounts of output would be better served by
> >> having many reducers.
> >>
> >> This is a general answer since the question is kind of non-specific.
> >>
> >>
> >>
> >> On 1/16/08 7:59 AM, "Jim the Standing Bear" <st...@gmail.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> How do I make hadoop split its output?  The program I am writing
> >>> crawls a catalog tree from a single url, so initially the input
> >>> contains only one entry.  after a few iterations, it will have tens of
> >>> thousands of urls.  But what I noticed is that the file is always in
> >>> one block (part-00000).   What I would like to have is once the number
> >>> of entries increases, it can parallelize the job.  Currently it
> >>> doesn't seem to be case.
> >>
> >>
> >
> >
>
>



-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

Re: a question on number of parallel tasks

Posted by Ted Dunning <td...@veoh.com>.

The part nomenclature does not refer to splits.  It refers to how many
reduce processes were involved in actually writing the output file.  Files
are split at read-time as necessary.

You will get more of them if you have more reducers.


On 1/16/08 8:25 AM, "Jim the Standing Bear" <st...@gmail.com> wrote:

> Thanks Ted.  I just didn't ask it right.  Here is a stupid 101
> question, which I am sure the answer lies in the documentation
> somewhere, just that I was having some difficulties in finding it...
> 
> when I do an "ls" on the dfs,  I would see this:
> /user/bear/output/part-00000 <r 4>
> 
> I probably got confused on what the part-##### means... I thought
> part-##### tells how many splits a file has... so far, I have only
> seen part-00000.  When will it have part-00001, 00002, etc?
> 
> 
> 
> On Jan 16, 2008 11:04 AM, Ted Dunning <td...@veoh.com> wrote:
>> 
>> 
>> Parallelizing the processing of data occurs at two steps.  The first is
>> during the map phase where the input data file is (hopefully) split across
>> multiple tasks.  This should happen transparently most of the time unless
>> you have a perverse data format or use unsplittable compression on your
>> file.
>> 
>> This parallelism can occur whether you have one input file or many.
>> 
>> The second level of parallelism is at reduce phase.  You set this by setting
>> the number of reducers.  This will also determine the number of output files
>> that you get.
>> 
>> Depending on your algorithm, it may help or hurt to have one or many
>> reducers.  The recent example of a program to find the 10 largest elements
>> is an example that pretty much requires a single reducer.  Other programs
>> where the mapper produces huge amounts of output would be better served by
>> having many reducers.
>> 
>> This is a general answer since the question is kind of non-specific.
>> 
>> 
>> 
>> On 1/16/08 7:59 AM, "Jim the Standing Bear" <st...@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> How do I make hadoop split its output?  The program I am writing
>>> crawls a catalog tree from a single url, so initially the input
>>> contains only one entry.  after a few iterations, it will have tens of
>>> thousands of urls.  But what I noticed is that the file is always in
>>> one block (part-00000).   What I would like to have is once the number
>>> of entries increases, it can parallelize the job.  Currently it
>>> doesn't seem to be case.
>> 
>> 
> 
>

Re: a question on number of parallel tasks

Posted by Jim the Standing Bear <st...@gmail.com>.

Thanks Ted.  I just didn't ask it right.  Here is a stupid 101
question, which I am sure the answer lies in the documentation
somewhere, just that I was having some difficulties in finding it...

when I do an "ls" on the dfs,  I would see this:
/user/bear/output/part-00000 <r 4>

I probably got confused on what the part-##### means... I thought
part-##### tells how many splits a file has... so far, I have only
seen part-00000.  When will it have part-00001, 00002, etc?



On Jan 16, 2008 11:04 AM, Ted Dunning <td...@veoh.com> wrote:
>
>
> Parallelizing the processing of data occurs at two steps.  The first is
> during the map phase where the input data file is (hopefully) split across
> multiple tasks.  This should happen transparently most of the time unless
> you have a perverse data format or use unsplittable compression on your
> file.
>
> This parallelism can occur whether you have one input file or many.
>
> The second level of parallelism is at reduce phase.  You set this by setting
> the number of reducers.  This will also determine the number of output files
> that you get.
>
> Depending on your algorithm, it may help or hurt to have one or many
> reducers.  The recent example of a program to find the 10 largest elements
> is an example that pretty much requires a single reducer.  Other programs
> where the mapper produces huge amounts of output would be better served by
> having many reducers.
>
> This is a general answer since the question is kind of non-specific.
>
>
>
> On 1/16/08 7:59 AM, "Jim the Standing Bear" <st...@gmail.com> wrote:
>
> > Hi,
> >
> > How do I make hadoop split its output?  The program I am writing
> > crawls a catalog tree from a single url, so initially the input
> > contains only one entry.  after a few iterations, it will have tens of
> > thousands of urls.  But what I noticed is that the file is always in
> > one block (part-00000).   What I would like to have is once the number
> > of entries increases, it can parallelize the job.  Currently it
> > doesn't seem to be case.
>
>



-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

Re: a question on number of parallel tasks

Posted by Ted Dunning <td...@veoh.com>.

Parallelizing the processing of data occurs at two steps.  The first is
during the map phase where the input data file is (hopefully) split across
multiple tasks.  This should happen transparently most of the time unless
you have a perverse data format or use unsplittable compression on your
file.

This parallelism can occur whether you have one input file or many.

The second level of parallelism is at reduce phase.  You set this by setting
the number of reducers.  This will also determine the number of output files
that you get.

Depending on your algorithm, it may help or hurt to have one or many
reducers.  The recent example of a program to find the 10 largest elements
is an example that pretty much requires a single reducer.  Other programs
where the mapper produces huge amounts of output would be better served by
having many reducers.

This is a general answer since the question is kind of non-specific.

On 1/16/08 7:59 AM, "Jim the Standing Bear" <st...@gmail.com> wrote:

> Hi,
> 
> How do I make hadoop split its output?  The program I am writing
> crawls a catalog tree from a single url, so initially the input
> contains only one entry.  after a few iterations, it will have tens of
> thousands of urls.  But what I noticed is that the file is always in
> one block (part-00000).   What I would like to have is once the number
> of entries increases, it can parallelize the job.  Currently it
> doesn't seem to be case.