You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Jon Allen <ja...@gmail.com> on 2012/11/23 13:02:25 UTC

Hadoop 1.0.4 Performance Problem

Hi,

We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have hit performance problems.  Before the upgrade a 15TB terasort took about 45 minutes, afterwards it takes just over an hour.  Looking in more detail it appears the shuffle phase has increased from 20 minutes to 40 minutes.  Does anyone have any thoughts about what's changed between these releases that may have caused this?

The only change to the system has been to Hadoop.  We moved from a tarball install of 0.20.203 with all processes running as hadoop to an RPM deployment of 1.0.4 with processes running as hdfs and mapred.  Nothing else has changed.

As a related question, we're still running with a configuration that was tuned for version 0.20.1. Are there any recommendations for tuning properties that have been introduced in recent versions that are worth investigating?

Thanks,
Jon

Re: Hadoop 1.0.4 Performance Problem

Posted by Jon Allen <ja...@gmail.com>.

Amit,

As Harsh says, I was referring to the
mapred.fairscheduler.assignmultiple config
property.  This only shows as an issue if you're running significantly
fewer reduce tasks than available slots.  We ran the terasort as a simple
smoke test of the upgrade rather than as a useful performance measure, it
just worried me that there was such a noticeable degradation.  We're now
moving onto some more realistic tests based on our operational jobs.

Thanks,
Jon


On Tue, Nov 27, 2012 at 10:00 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi Amit,
>
> He means the mapred.fairscheduler.assignmultiple FairScheduler
> property. It is true by default, which works well for most workloads
> if not benchmark style workloads. I would not usually trust that as a
> base perf. measure of everything that comes out of an upgrade.
>
> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>
> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
> > Hi Jon,
> >
> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop 1.0.4
> > and I haven't noticed any performance issues. By  "multiple assignment
> > feature" do you mean speculative execution
> > (mapred.map.tasks.speculative.execution and
> > mapred.reduce.tasks.speculative.execution) ?
> >
> >
> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:
> >>
> >> Problem solved, but worth warning others about.
> >>
> >> Before the upgrade the reducers for the terasort process had been evenly
> >> distributed around the cluster - one per task tracker in turn, looping
> >> around the cluster until all tasks were allocated.  After the upgrade
> all
> >> reduce task had been submitted to small number of task trackers - submit
> >> tasks until the task tracker slots were full and then move onto the next
> >> task tracker.  Skewing the reducers like this quite clearly hit the
> >> benchmark performance.
> >>
> >> The reason for this turns out to be the fair scheduler rewrite
> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of
> the
> >> assign multiple property. Previously this property caused a single map
> and a
> >> single reduce task to be allocated in a task tracker heartbeat (rather
> than
> >> the default of a map or a reduce).  After the upgrade it allocates as
> many
> >> tasks as there are available task slots.  Turning off the multiple
> >> assignment feature returned the terasort to its pre-upgrade performance.
> >>
> >> I can see potential benefits to this change and need to think through
> the
> >> consequences to real world applications (though in practice we're
> likely to
> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
> this
> >> has been a pain so to warn other user is there anywhere central that
> can be
> >> used to record upgrade gotchas like this?
> >>
> >>
> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
> wrote:
> >>>
> >>> Hi,
> >>>
> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
> >>> hit performance problems.  Before the upgrade a 15TB terasort took
> about 45
> >>> minutes, afterwards it takes just over an hour.  Looking in more
> detail it
> >>> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>  Does
> >>> anyone have any thoughts about what's changed between these releases
> that
> >>> may have caused this?
> >>>
> >>> The only change to the system has been to Hadoop.  We moved from a
> >>> tarball install of 0.20.203 with all processes running as hadoop to an
> RPM
> >>> deployment of 1.0.4 with processes running as hdfs and mapred.
>  Nothing else
> >>> has changed.
> >>>
> >>> As a related question, we're still running with a configuration that
> was
> >>> tuned for version 0.20.1. Are there any recommendations for tuning
> >>> properties that have been introduced in recent versions that are worth
> >>> investigating?
> >>>
> >>> Thanks,
> >>> Jon
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jon Allen <ja...@gmail.com>.

Yes, it's a fair scheduler thing - I don't think it's right to call it a
problem, it's just a change in behaviour that came as a surprise.

My reason for using the fair scheduler is because we have multiple users
running simultaneous jobs and the default scheduler doesn't allow this.  My
choice of fair scheduler rather capacity scheduler was made a couple of
years ago and was entirely arbitrary.

On Tue, Nov 27, 2012 at 10:21 AM, Amit Sela <am...@infolinks.com> wrote:

> So this is a FairScheduler problem ?
> We are using the default Hadoop scheduler. Is there a reason to use the
> Fair Scheduler if most of the time we don't have more than 4 jobs
> running simultaneously ?
>
>
> On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hi Amit,
>>
>> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> property. It is true by default, which works well for most workloads
>> if not benchmark style workloads. I would not usually trust that as a
>> base perf. measure of everything that comes out of an upgrade.
>>
>> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>>
>> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
>> > Hi Jon,
>> >
>> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>> 1.0.4
>> > and I haven't noticed any performance issues. By  "multiple assignment
>> > feature" do you mean speculative execution
>> > (mapred.map.tasks.speculative.execution and
>> > mapred.reduce.tasks.speculative.execution) ?
>> >
>> >
>> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>> wrote:
>> >>
>> >> Problem solved, but worth warning others about.
>> >>
>> >> Before the upgrade the reducers for the terasort process had been
>> evenly
>> >> distributed around the cluster - one per task tracker in turn, looping
>> >> around the cluster until all tasks were allocated.  After the upgrade
>> all
>> >> reduce task had been submitted to small number of task trackers -
>> submit
>> >> tasks until the task tracker slots were full and then move onto the
>> next
>> >> task tracker.  Skewing the reducers like this quite clearly hit the
>> >> benchmark performance.
>> >>
>> >> The reason for this turns out to be the fair scheduler rewrite
>> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of
>> the
>> >> assign multiple property. Previously this property caused a single map
>> and a
>> >> single reduce task to be allocated in a task tracker heartbeat (rather
>> than
>> >> the default of a map or a reduce).  After the upgrade it allocates as
>> many
>> >> tasks as there are available task slots.  Turning off the multiple
>> >> assignment feature returned the terasort to its pre-upgrade
>> performance.
>> >>
>> >> I can see potential benefits to this change and need to think through
>> the
>> >> consequences to real world applications (though in practice we're
>> likely to
>> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
>> this
>> >> has been a pain so to warn other user is there anywhere central that
>> can be
>> >> used to record upgrade gotchas like this?
>> >>
>> >>
>> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
>> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>> about 45
>> >>> minutes, afterwards it takes just over an hour.  Looking in more
>> detail it
>> >>> appears the shuffle phase has increased from 20 minutes to 40
>> minutes.  Does
>> >>> anyone have any thoughts about what's changed between these releases
>> that
>> >>> may have caused this?
>> >>>
>> >>> The only change to the system has been to Hadoop.  We moved from a
>> >>> tarball install of 0.20.203 with all processes running as hadoop to
>> an RPM
>> >>> deployment of 1.0.4 with processes running as hdfs and mapred.
>>  Nothing else
>> >>> has changed.
>> >>>
>> >>> As a related question, we're still running with a configuration that
>> was
>> >>> tuned for version 0.20.1. Are there any recommendations for tuning
>> >>> properties that have been introduced in recent versions that are worth
>> >>> investigating?
>> >>>
>> >>> Thanks,
>> >>> Jon
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jon Allen <ja...@gmail.com>.

Yes, it's a fair scheduler thing - I don't think it's right to call it a
problem, it's just a change in behaviour that came as a surprise.

My reason for using the fair scheduler is because we have multiple users
running simultaneous jobs and the default scheduler doesn't allow this.  My
choice of fair scheduler rather capacity scheduler was made a couple of
years ago and was entirely arbitrary.

On Tue, Nov 27, 2012 at 10:21 AM, Amit Sela <am...@infolinks.com> wrote:

> So this is a FairScheduler problem ?
> We are using the default Hadoop scheduler. Is there a reason to use the
> Fair Scheduler if most of the time we don't have more than 4 jobs
> running simultaneously ?
>
>
> On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hi Amit,
>>
>> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> property. It is true by default, which works well for most workloads
>> if not benchmark style workloads. I would not usually trust that as a
>> base perf. measure of everything that comes out of an upgrade.
>>
>> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>>
>> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
>> > Hi Jon,
>> >
>> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>> 1.0.4
>> > and I haven't noticed any performance issues. By  "multiple assignment
>> > feature" do you mean speculative execution
>> > (mapred.map.tasks.speculative.execution and
>> > mapred.reduce.tasks.speculative.execution) ?
>> >
>> >
>> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>> wrote:
>> >>
>> >> Problem solved, but worth warning others about.
>> >>
>> >> Before the upgrade the reducers for the terasort process had been
>> evenly
>> >> distributed around the cluster - one per task tracker in turn, looping
>> >> around the cluster until all tasks were allocated.  After the upgrade
>> all
>> >> reduce task had been submitted to small number of task trackers -
>> submit
>> >> tasks until the task tracker slots were full and then move onto the
>> next
>> >> task tracker.  Skewing the reducers like this quite clearly hit the
>> >> benchmark performance.
>> >>
>> >> The reason for this turns out to be the fair scheduler rewrite
>> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of
>> the
>> >> assign multiple property. Previously this property caused a single map
>> and a
>> >> single reduce task to be allocated in a task tracker heartbeat (rather
>> than
>> >> the default of a map or a reduce).  After the upgrade it allocates as
>> many
>> >> tasks as there are available task slots.  Turning off the multiple
>> >> assignment feature returned the terasort to its pre-upgrade
>> performance.
>> >>
>> >> I can see potential benefits to this change and need to think through
>> the
>> >> consequences to real world applications (though in practice we're
>> likely to
>> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
>> this
>> >> has been a pain so to warn other user is there anywhere central that
>> can be
>> >> used to record upgrade gotchas like this?
>> >>
>> >>
>> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
>> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>> about 45
>> >>> minutes, afterwards it takes just over an hour.  Looking in more
>> detail it
>> >>> appears the shuffle phase has increased from 20 minutes to 40
>> minutes.  Does
>> >>> anyone have any thoughts about what's changed between these releases
>> that
>> >>> may have caused this?
>> >>>
>> >>> The only change to the system has been to Hadoop.  We moved from a
>> >>> tarball install of 0.20.203 with all processes running as hadoop to
>> an RPM
>> >>> deployment of 1.0.4 with processes running as hdfs and mapred.
>>  Nothing else
>> >>> has changed.
>> >>>
>> >>> As a related question, we're still running with a configuration that
>> was
>> >>> tuned for version 0.20.1. Are there any recommendations for tuning
>> >>> properties that have been introduced in recent versions that are worth
>> >>> investigating?
>> >>>
>> >>> Thanks,
>> >>> Jon
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by anand sharma <an...@gmail.com>.

+1 the way jon elaborated it.


On Fri, Dec 21, 2012 at 6:36 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Jon,
>
> FYI, this issue in the fair scheduler was fixed by
> https://issues.apache.org/jira/browse/MAPREDUCE-2905 for 1.1.0.
> Though it is present again in MR2:
> https://issues.apache.org/jira/browse/MAPREDUCE-3268
>
> -Todd
>
> On Wed, Nov 28, 2012 at 2:32 PM, Jon Allen <ja...@gmail.com> wrote:
> > Jie,
> >
> > Simple answer - I got lucky (though obviously there are thing you need to
> > have in place to allow you to be lucky).
> >
> > Before running the upgrade I ran a set of tests to baseline the cluster
> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
> by
> > itself isn't very realistic as a cluster test but it's nice and simple to
> > run and is good for regression testing things after a change.
> >
> > After the upgrade the intention was to run the same tests and show that
> the
> > performance hadn't degraded (improved would have been nice but not worse
> was
> > the minimum).  When we ran the terasort we found that performance was
> about
> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
> I've
> > said, terasort doesn't provide a realistic view of operational
> performance
> > but this showed that something major had changed and we needed to
> understand
> > it before going further.  So how to go about diagnosing this ...
> >
> > First rule - understand what you're trying to achieve.  It's very easy to
> > say performance isn't good enough but performance can always be better so
> > you need to know what's realistic and at what point you're going to stop
> > tuning things.  I had a previous baseline that I was trying to match so I
> > knew what I was trying to achieve.
> >
> > Next thing to do is profile your job and identify where the problem is.
>  We
> > had the full job history from the before and after jobs and comparing
> these
> > we saw that map performance was fairly consistent as were the reduce sort
> > and reduce phases.  The problem was with the shuffle, which had gone
> from 20
> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
> is
> > to make sure you've got as much information as possible.  If we'd just
> kept
> > the overall job time then there would have been a lot more areas to look
> at
> > but knowing the problem was with shuffle allowed me to focus effort in
> this
> > area.
> >
> > So what had changed in the shuffle that may have slowed things down.  The
> > first thing we thought of was that we'd moved from a tarball deployment
> to
> > using the RPM so what effect might this have had on things.  Our
> operational
> > configuration compresses the map output and in the past we've had
> problems
> > with Java compression libraries being used rather than native ones and
> this
> > has affected performance.  We knew the RPM deployment had moved the
> native
> > library so spent some time confirming to ourselves that these were being
> > used correctly (but this turned out to not be the problem).  We then
> spent
> > time doing some process and server profiling - using dstat to look at the
> > server bottlenecks and jstack/jmap to check what the task tracker and
> reduce
> > processes were doing.  Although not directly relevant to this particular
> > problem doing this was useful just to get my head around what Hadoop is
> > doing at various points of the process.
> >
> > The next bit was one place where I got lucky - I happened to be logged
> onto
> > one of the worker nodes when a test job was running and I noticed that
> there
> > weren't any reduce tasks running on the server.  This was odd as we'd
> > submitted more reducers than we have servers so I'd expected at least one
> > task to be running on each server.  Checking the job tracker log file it
> > turned out that since the upgrade the job tracker had been submitting
> reduce
> > tasks to only 10% of the available nodes.  A different 10% each time the
> job
> > was run so clearly the individual task trackers were working OK but there
> > was something odd going on with the task allocation.  Checking the job
> > tracker log file showed that before the upgrade tasks had been fairly
> evenly
> > distributed so something had changed.  After that it was a case of
> digging
> > around the source code to find out which classes were available for task
> > allocation and what inside them had changed.  This can be quite daunting
> but
> > if you're comfortable with Java then it's just a case of following the
> calls
> > through the code.  Once I found the cause it was just a case of working
> out
> > what my options were for working around it (in this case turning off the
> > multiple assignment option - I can work out whether I want to turn it
> back
> > on in slower time).
> >
> > Where I think we got very lucky is that we hit this problem.  The
> > configuration we use for the terasort has just over 1 reducer per worker
> > node rather than maxing out the available reducer slots.  This decision
> was
> > made several years and I can't remember the reasons for it.  If we'd been
> > using a larger number of reducers then the number of worker nodes in use
> > would have been similar regardless of the allocation algorithm and so the
> > performance would have looked similar before and after the upgrade.  We
> > would have hit this problem eventually but probably not until we started
> > running user jobs and by then it would be too late to do the intrusive
> > investigations that were possible now.
> >
> > Hope this has been useful.
> >
> > Regards,
> > Jon
> >
> >
> >
> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
> >>
> >> Jon:
> >>
> >> This is interesting and helpful! How did you figure out the cause? And
> how
> >> much time did you spend? Could you share some experience of performance
> >> diagnosis?
> >>
> >> Jie
> >>
> >> On Tuesday, November 27, 2012, Harsh J wrote:
> >>>
> >>> Hi Amit,
> >>>
> >>> The default scheduler is FIFO, and may not work for all forms of
> >>> workloads. Read the multiple schedulers available to see if they have
> >>> features that may benefit your workload:
> >>>
> >>> Capacity Scheduler:
> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
> >>> FairScheduler:
> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
> >>>
> >>> While there's a good overlap of features between them, there are a few
> >>> differences that set them apart and make them each useful for
> >>> different use-cases. If I had to summarize on some such differences,
> >>> FairScheduler is better suited to SLA form of job execution situations
> >>> due to its preemptive features (which make it useful in user and
> >>> service mix scenarios), while CapacityScheduler provides
> >>> manual-resource-request oriented scheduling for odd jobs with high
> >>> memory workloads, etc. (which make it useful for running certain
> >>> specific kind of jobs along side the regular ones).
> >>>
> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
> wrote:
> >>> > So this is a FairScheduler problem ?
> >>> > We are using the default Hadoop scheduler. Is there a reason to use
> the
> >>> > Fair
> >>> > Scheduler if most of the time we don't have more than 4 jobs running
> >>> > simultaneously ?
> >>> >
> >>> >
> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
> wrote:
> >>> >>
> >>> >> Hi Amit,
> >>> >>
> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
> >>> >> property. It is true by default, which works well for most workloads
> >>> >> if not benchmark style workloads. I would not usually trust that as
> a
> >>> >> base perf. measure of everything that comes out of an upgrade.
> >>> >>
> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
> >>> >>
> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
> >>> >> wrote:
> >>> >> > Hi Jon,
> >>> >> >
> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
> Hadoop
> >>> >> > 1.0.4
> >>> >> > and I haven't noticed any performance issues. By  "multiple
> >>> >> > assignment
> >>> >> > feature" do you mean speculative execution
> >>> >> > (mapred.map.tasks.speculative.execution and
> >>> >> > mapred.reduce.tasks.speculative.execution) ?
> >>> >> >
> >>> >> >
> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Problem solved, but worth warning others about.
> >>> >> >>
> >>> >> >> Before the upgrade the reducers for the terasort process had been
> >>> >> >> evenly
> >>> >> >> distributed around the cluster - one per task tracker in turn,
> >>> >> >> looping
> >>> >> >> around the cluster until all tasks were allocated.  After the
> >>> >> >> upgrade
> >>> >> >> all
> >>> >> >> reduce task had been submitted to small number of task trackers -
> >>> >> >> submit
> >>> >> >> tasks until the task tracker slots were full and then move onto
> the
> >>> >> >> next
> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
> the
> >>> >> >> benchmark performance.
> >>> >> >>
> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
> behaviour
> >>> >> >> of
> >>> >> >> the
> >>> >> >> assign multiple property. Previously this property caused a
> single
> >>> >> >> map
> >>> >> >> and a
> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
> >>> >> >> (rather
> >>> >> >> than
> >>> >> >> the default of a map or a reduce).  After the upgrade it
> allocates
> >>> >> >> as
> >>> >> >> many
> >>> >> >> tasks as there are available task slots.  Turning off the
> multiple
> >>> >> >> assignment feature returned the terasort to its pre-upgrade
> >>> >> >> performance.
> >>> >> >>
> >>> >> >> I can see potential benefits to this change and need to think
> >>> >> >> through
> >>> >> >> the
> >>> >> >> consequences to real world applications (though in practice we're
> >>> >> >> likely to
> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
> >>> >> >> Investigating
> >>> >> >> this
> >>> >> >> has been a pain so to warn other user is there anywhere central
> >>> >> >> that
> >>> >> >> can be
> >>> >> >> used to record upgrade gotchas like this?
> >>> >> >>
> >>> >> >>
> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <jayayedev@gmail.com
> >
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> Hi,
> >>> >> >>>
> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
> and
> >>> >> >>> have
> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
> took
> >>> >> >>> about 45
> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
> >>> >> >>> detail it
> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
> >>> >> >>> minutes.
> >>> >> >>> Does
> >>> >> >>> anyone have any thoughts about what'--
> >>> Harsh J
> >>>
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Hadoop 1.0.4 Performance Problem

Posted by anand sharma <an...@gmail.com>.

+1 the way jon elaborated it.


On Fri, Dec 21, 2012 at 6:36 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Jon,
>
> FYI, this issue in the fair scheduler was fixed by
> https://issues.apache.org/jira/browse/MAPREDUCE-2905 for 1.1.0.
> Though it is present again in MR2:
> https://issues.apache.org/jira/browse/MAPREDUCE-3268
>
> -Todd
>
> On Wed, Nov 28, 2012 at 2:32 PM, Jon Allen <ja...@gmail.com> wrote:
> > Jie,
> >
> > Simple answer - I got lucky (though obviously there are thing you need to
> > have in place to allow you to be lucky).
> >
> > Before running the upgrade I ran a set of tests to baseline the cluster
> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
> by
> > itself isn't very realistic as a cluster test but it's nice and simple to
> > run and is good for regression testing things after a change.
> >
> > After the upgrade the intention was to run the same tests and show that
> the
> > performance hadn't degraded (improved would have been nice but not worse
> was
> > the minimum).  When we ran the terasort we found that performance was
> about
> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
> I've
> > said, terasort doesn't provide a realistic view of operational
> performance
> > but this showed that something major had changed and we needed to
> understand
> > it before going further.  So how to go about diagnosing this ...
> >
> > First rule - understand what you're trying to achieve.  It's very easy to
> > say performance isn't good enough but performance can always be better so
> > you need to know what's realistic and at what point you're going to stop
> > tuning things.  I had a previous baseline that I was trying to match so I
> > knew what I was trying to achieve.
> >
> > Next thing to do is profile your job and identify where the problem is.
>  We
> > had the full job history from the before and after jobs and comparing
> these
> > we saw that map performance was fairly consistent as were the reduce sort
> > and reduce phases.  The problem was with the shuffle, which had gone
> from 20
> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
> is
> > to make sure you've got as much information as possible.  If we'd just
> kept
> > the overall job time then there would have been a lot more areas to look
> at
> > but knowing the problem was with shuffle allowed me to focus effort in
> this
> > area.
> >
> > So what had changed in the shuffle that may have slowed things down.  The
> > first thing we thought of was that we'd moved from a tarball deployment
> to
> > using the RPM so what effect might this have had on things.  Our
> operational
> > configuration compresses the map output and in the past we've had
> problems
> > with Java compression libraries being used rather than native ones and
> this
> > has affected performance.  We knew the RPM deployment had moved the
> native
> > library so spent some time confirming to ourselves that these were being
> > used correctly (but this turned out to not be the problem).  We then
> spent
> > time doing some process and server profiling - using dstat to look at the
> > server bottlenecks and jstack/jmap to check what the task tracker and
> reduce
> > processes were doing.  Although not directly relevant to this particular
> > problem doing this was useful just to get my head around what Hadoop is
> > doing at various points of the process.
> >
> > The next bit was one place where I got lucky - I happened to be logged
> onto
> > one of the worker nodes when a test job was running and I noticed that
> there
> > weren't any reduce tasks running on the server.  This was odd as we'd
> > submitted more reducers than we have servers so I'd expected at least one
> > task to be running on each server.  Checking the job tracker log file it
> > turned out that since the upgrade the job tracker had been submitting
> reduce
> > tasks to only 10% of the available nodes.  A different 10% each time the
> job
> > was run so clearly the individual task trackers were working OK but there
> > was something odd going on with the task allocation.  Checking the job
> > tracker log file showed that before the upgrade tasks had been fairly
> evenly
> > distributed so something had changed.  After that it was a case of
> digging
> > around the source code to find out which classes were available for task
> > allocation and what inside them had changed.  This can be quite daunting
> but
> > if you're comfortable with Java then it's just a case of following the
> calls
> > through the code.  Once I found the cause it was just a case of working
> out
> > what my options were for working around it (in this case turning off the
> > multiple assignment option - I can work out whether I want to turn it
> back
> > on in slower time).
> >
> > Where I think we got very lucky is that we hit this problem.  The
> > configuration we use for the terasort has just over 1 reducer per worker
> > node rather than maxing out the available reducer slots.  This decision
> was
> > made several years and I can't remember the reasons for it.  If we'd been
> > using a larger number of reducers then the number of worker nodes in use
> > would have been similar regardless of the allocation algorithm and so the
> > performance would have looked similar before and after the upgrade.  We
> > would have hit this problem eventually but probably not until we started
> > running user jobs and by then it would be too late to do the intrusive
> > investigations that were possible now.
> >
> > Hope this has been useful.
> >
> > Regards,
> > Jon
> >
> >
> >
> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
> >>
> >> Jon:
> >>
> >> This is interesting and helpful! How did you figure out the cause? And
> how
> >> much time did you spend? Could you share some experience of performance
> >> diagnosis?
> >>
> >> Jie
> >>
> >> On Tuesday, November 27, 2012, Harsh J wrote:
> >>>
> >>> Hi Amit,
> >>>
> >>> The default scheduler is FIFO, and may not work for all forms of
> >>> workloads. Read the multiple schedulers available to see if they have
> >>> features that may benefit your workload:
> >>>
> >>> Capacity Scheduler:
> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
> >>> FairScheduler:
> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
> >>>
> >>> While there's a good overlap of features between them, there are a few
> >>> differences that set them apart and make them each useful for
> >>> different use-cases. If I had to summarize on some such differences,
> >>> FairScheduler is better suited to SLA form of job execution situations
> >>> due to its preemptive features (which make it useful in user and
> >>> service mix scenarios), while CapacityScheduler provides
> >>> manual-resource-request oriented scheduling for odd jobs with high
> >>> memory workloads, etc. (which make it useful for running certain
> >>> specific kind of jobs along side the regular ones).
> >>>
> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
> wrote:
> >>> > So this is a FairScheduler problem ?
> >>> > We are using the default Hadoop scheduler. Is there a reason to use
> the
> >>> > Fair
> >>> > Scheduler if most of the time we don't have more than 4 jobs running
> >>> > simultaneously ?
> >>> >
> >>> >
> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
> wrote:
> >>> >>
> >>> >> Hi Amit,
> >>> >>
> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
> >>> >> property. It is true by default, which works well for most workloads
> >>> >> if not benchmark style workloads. I would not usually trust that as
> a
> >>> >> base perf. measure of everything that comes out of an upgrade.
> >>> >>
> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
> >>> >>
> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
> >>> >> wrote:
> >>> >> > Hi Jon,
> >>> >> >
> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
> Hadoop
> >>> >> > 1.0.4
> >>> >> > and I haven't noticed any performance issues. By  "multiple
> >>> >> > assignment
> >>> >> > feature" do you mean speculative execution
> >>> >> > (mapred.map.tasks.speculative.execution and
> >>> >> > mapred.reduce.tasks.speculative.execution) ?
> >>> >> >
> >>> >> >
> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Problem solved, but worth warning others about.
> >>> >> >>
> >>> >> >> Before the upgrade the reducers for the terasort process had been
> >>> >> >> evenly
> >>> >> >> distributed around the cluster - one per task tracker in turn,
> >>> >> >> looping
> >>> >> >> around the cluster until all tasks were allocated.  After the
> >>> >> >> upgrade
> >>> >> >> all
> >>> >> >> reduce task had been submitted to small number of task trackers -
> >>> >> >> submit
> >>> >> >> tasks until the task tracker slots were full and then move onto
> the
> >>> >> >> next
> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
> the
> >>> >> >> benchmark performance.
> >>> >> >>
> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
> behaviour
> >>> >> >> of
> >>> >> >> the
> >>> >> >> assign multiple property. Previously this property caused a
> single
> >>> >> >> map
> >>> >> >> and a
> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
> >>> >> >> (rather
> >>> >> >> than
> >>> >> >> the default of a map or a reduce).  After the upgrade it
> allocates
> >>> >> >> as
> >>> >> >> many
> >>> >> >> tasks as there are available task slots.  Turning off the
> multiple
> >>> >> >> assignment feature returned the terasort to its pre-upgrade
> >>> >> >> performance.
> >>> >> >>
> >>> >> >> I can see potential benefits to this change and need to think
> >>> >> >> through
> >>> >> >> the
> >>> >> >> consequences to real world applications (though in practice we're
> >>> >> >> likely to
> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
> >>> >> >> Investigating
> >>> >> >> this
> >>> >> >> has been a pain so to warn other user is there anywhere central
> >>> >> >> that
> >>> >> >> can be
> >>> >> >> used to record upgrade gotchas like this?
> >>> >> >>
> >>> >> >>
> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <jayayedev@gmail.com
> >
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> Hi,
> >>> >> >>>
> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
> and
> >>> >> >>> have
> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
> took
> >>> >> >>> about 45
> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
> >>> >> >>> detail it
> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
> >>> >> >>> minutes.
> >>> >> >>> Does
> >>> >> >>> anyone have any thoughts about what'--
> >>> Harsh J
> >>>
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Hadoop 1.0.4 Performance Problem

Posted by anand sharma <an...@gmail.com>.

+1 the way jon elaborated it.


On Fri, Dec 21, 2012 at 6:36 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Jon,
>
> FYI, this issue in the fair scheduler was fixed by
> https://issues.apache.org/jira/browse/MAPREDUCE-2905 for 1.1.0.
> Though it is present again in MR2:
> https://issues.apache.org/jira/browse/MAPREDUCE-3268
>
> -Todd
>
> On Wed, Nov 28, 2012 at 2:32 PM, Jon Allen <ja...@gmail.com> wrote:
> > Jie,
> >
> > Simple answer - I got lucky (though obviously there are thing you need to
> > have in place to allow you to be lucky).
> >
> > Before running the upgrade I ran a set of tests to baseline the cluster
> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
> by
> > itself isn't very realistic as a cluster test but it's nice and simple to
> > run and is good for regression testing things after a change.
> >
> > After the upgrade the intention was to run the same tests and show that
> the
> > performance hadn't degraded (improved would have been nice but not worse
> was
> > the minimum).  When we ran the terasort we found that performance was
> about
> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
> I've
> > said, terasort doesn't provide a realistic view of operational
> performance
> > but this showed that something major had changed and we needed to
> understand
> > it before going further.  So how to go about diagnosing this ...
> >
> > First rule - understand what you're trying to achieve.  It's very easy to
> > say performance isn't good enough but performance can always be better so
> > you need to know what's realistic and at what point you're going to stop
> > tuning things.  I had a previous baseline that I was trying to match so I
> > knew what I was trying to achieve.
> >
> > Next thing to do is profile your job and identify where the problem is.
>  We
> > had the full job history from the before and after jobs and comparing
> these
> > we saw that map performance was fairly consistent as were the reduce sort
> > and reduce phases.  The problem was with the shuffle, which had gone
> from 20
> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
> is
> > to make sure you've got as much information as possible.  If we'd just
> kept
> > the overall job time then there would have been a lot more areas to look
> at
> > but knowing the problem was with shuffle allowed me to focus effort in
> this
> > area.
> >
> > So what had changed in the shuffle that may have slowed things down.  The
> > first thing we thought of was that we'd moved from a tarball deployment
> to
> > using the RPM so what effect might this have had on things.  Our
> operational
> > configuration compresses the map output and in the past we've had
> problems
> > with Java compression libraries being used rather than native ones and
> this
> > has affected performance.  We knew the RPM deployment had moved the
> native
> > library so spent some time confirming to ourselves that these were being
> > used correctly (but this turned out to not be the problem).  We then
> spent
> > time doing some process and server profiling - using dstat to look at the
> > server bottlenecks and jstack/jmap to check what the task tracker and
> reduce
> > processes were doing.  Although not directly relevant to this particular
> > problem doing this was useful just to get my head around what Hadoop is
> > doing at various points of the process.
> >
> > The next bit was one place where I got lucky - I happened to be logged
> onto
> > one of the worker nodes when a test job was running and I noticed that
> there
> > weren't any reduce tasks running on the server.  This was odd as we'd
> > submitted more reducers than we have servers so I'd expected at least one
> > task to be running on each server.  Checking the job tracker log file it
> > turned out that since the upgrade the job tracker had been submitting
> reduce
> > tasks to only 10% of the available nodes.  A different 10% each time the
> job
> > was run so clearly the individual task trackers were working OK but there
> > was something odd going on with the task allocation.  Checking the job
> > tracker log file showed that before the upgrade tasks had been fairly
> evenly
> > distributed so something had changed.  After that it was a case of
> digging
> > around the source code to find out which classes were available for task
> > allocation and what inside them had changed.  This can be quite daunting
> but
> > if you're comfortable with Java then it's just a case of following the
> calls
> > through the code.  Once I found the cause it was just a case of working
> out
> > what my options were for working around it (in this case turning off the
> > multiple assignment option - I can work out whether I want to turn it
> back
> > on in slower time).
> >
> > Where I think we got very lucky is that we hit this problem.  The
> > configuration we use for the terasort has just over 1 reducer per worker
> > node rather than maxing out the available reducer slots.  This decision
> was
> > made several years and I can't remember the reasons for it.  If we'd been
> > using a larger number of reducers then the number of worker nodes in use
> > would have been similar regardless of the allocation algorithm and so the
> > performance would have looked similar before and after the upgrade.  We
> > would have hit this problem eventually but probably not until we started
> > running user jobs and by then it would be too late to do the intrusive
> > investigations that were possible now.
> >
> > Hope this has been useful.
> >
> > Regards,
> > Jon
> >
> >
> >
> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
> >>
> >> Jon:
> >>
> >> This is interesting and helpful! How did you figure out the cause? And
> how
> >> much time did you spend? Could you share some experience of performance
> >> diagnosis?
> >>
> >> Jie
> >>
> >> On Tuesday, November 27, 2012, Harsh J wrote:
> >>>
> >>> Hi Amit,
> >>>
> >>> The default scheduler is FIFO, and may not work for all forms of
> >>> workloads. Read the multiple schedulers available to see if they have
> >>> features that may benefit your workload:
> >>>
> >>> Capacity Scheduler:
> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
> >>> FairScheduler:
> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
> >>>
> >>> While there's a good overlap of features between them, there are a few
> >>> differences that set them apart and make them each useful for
> >>> different use-cases. If I had to summarize on some such differences,
> >>> FairScheduler is better suited to SLA form of job execution situations
> >>> due to its preemptive features (which make it useful in user and
> >>> service mix scenarios), while CapacityScheduler provides
> >>> manual-resource-request oriented scheduling for odd jobs with high
> >>> memory workloads, etc. (which make it useful for running certain
> >>> specific kind of jobs along side the regular ones).
> >>>
> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
> wrote:
> >>> > So this is a FairScheduler problem ?
> >>> > We are using the default Hadoop scheduler. Is there a reason to use
> the
> >>> > Fair
> >>> > Scheduler if most of the time we don't have more than 4 jobs running
> >>> > simultaneously ?
> >>> >
> >>> >
> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
> wrote:
> >>> >>
> >>> >> Hi Amit,
> >>> >>
> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
> >>> >> property. It is true by default, which works well for most workloads
> >>> >> if not benchmark style workloads. I would not usually trust that as
> a
> >>> >> base perf. measure of everything that comes out of an upgrade.
> >>> >>
> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
> >>> >>
> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
> >>> >> wrote:
> >>> >> > Hi Jon,
> >>> >> >
> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
> Hadoop
> >>> >> > 1.0.4
> >>> >> > and I haven't noticed any performance issues. By  "multiple
> >>> >> > assignment
> >>> >> > feature" do you mean speculative execution
> >>> >> > (mapred.map.tasks.speculative.execution and
> >>> >> > mapred.reduce.tasks.speculative.execution) ?
> >>> >> >
> >>> >> >
> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Problem solved, but worth warning others about.
> >>> >> >>
> >>> >> >> Before the upgrade the reducers for the terasort process had been
> >>> >> >> evenly
> >>> >> >> distributed around the cluster - one per task tracker in turn,
> >>> >> >> looping
> >>> >> >> around the cluster until all tasks were allocated.  After the
> >>> >> >> upgrade
> >>> >> >> all
> >>> >> >> reduce task had been submitted to small number of task trackers -
> >>> >> >> submit
> >>> >> >> tasks until the task tracker slots were full and then move onto
> the
> >>> >> >> next
> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
> the
> >>> >> >> benchmark performance.
> >>> >> >>
> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
> behaviour
> >>> >> >> of
> >>> >> >> the
> >>> >> >> assign multiple property. Previously this property caused a
> single
> >>> >> >> map
> >>> >> >> and a
> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
> >>> >> >> (rather
> >>> >> >> than
> >>> >> >> the default of a map or a reduce).  After the upgrade it
> allocates
> >>> >> >> as
> >>> >> >> many
> >>> >> >> tasks as there are available task slots.  Turning off the
> multiple
> >>> >> >> assignment feature returned the terasort to its pre-upgrade
> >>> >> >> performance.
> >>> >> >>
> >>> >> >> I can see potential benefits to this change and need to think
> >>> >> >> through
> >>> >> >> the
> >>> >> >> consequences to real world applications (though in practice we're
> >>> >> >> likely to
> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
> >>> >> >> Investigating
> >>> >> >> this
> >>> >> >> has been a pain so to warn other user is there anywhere central
> >>> >> >> that
> >>> >> >> can be
> >>> >> >> used to record upgrade gotchas like this?
> >>> >> >>
> >>> >> >>
> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <jayayedev@gmail.com
> >
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> Hi,
> >>> >> >>>
> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
> and
> >>> >> >>> have
> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
> took
> >>> >> >>> about 45
> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
> >>> >> >>> detail it
> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
> >>> >> >>> minutes.
> >>> >> >>> Does
> >>> >> >>> anyone have any thoughts about what'--
> >>> Harsh J
> >>>
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Hadoop 1.0.4 Performance Problem

Posted by anand sharma <an...@gmail.com>.

+1 the way jon elaborated it.


On Fri, Dec 21, 2012 at 6:36 AM, Todd Lipcon <to...@cloudera.com> wrote:

> Hi Jon,
>
> FYI, this issue in the fair scheduler was fixed by
> https://issues.apache.org/jira/browse/MAPREDUCE-2905 for 1.1.0.
> Though it is present again in MR2:
> https://issues.apache.org/jira/browse/MAPREDUCE-3268
>
> -Todd
>
> On Wed, Nov 28, 2012 at 2:32 PM, Jon Allen <ja...@gmail.com> wrote:
> > Jie,
> >
> > Simple answer - I got lucky (though obviously there are thing you need to
> > have in place to allow you to be lucky).
> >
> > Before running the upgrade I ran a set of tests to baseline the cluster
> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
> by
> > itself isn't very realistic as a cluster test but it's nice and simple to
> > run and is good for regression testing things after a change.
> >
> > After the upgrade the intention was to run the same tests and show that
> the
> > performance hadn't degraded (improved would have been nice but not worse
> was
> > the minimum).  When we ran the terasort we found that performance was
> about
> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
> I've
> > said, terasort doesn't provide a realistic view of operational
> performance
> > but this showed that something major had changed and we needed to
> understand
> > it before going further.  So how to go about diagnosing this ...
> >
> > First rule - understand what you're trying to achieve.  It's very easy to
> > say performance isn't good enough but performance can always be better so
> > you need to know what's realistic and at what point you're going to stop
> > tuning things.  I had a previous baseline that I was trying to match so I
> > knew what I was trying to achieve.
> >
> > Next thing to do is profile your job and identify where the problem is.
>  We
> > had the full job history from the before and after jobs and comparing
> these
> > we saw that map performance was fairly consistent as were the reduce sort
> > and reduce phases.  The problem was with the shuffle, which had gone
> from 20
> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
> is
> > to make sure you've got as much information as possible.  If we'd just
> kept
> > the overall job time then there would have been a lot more areas to look
> at
> > but knowing the problem was with shuffle allowed me to focus effort in
> this
> > area.
> >
> > So what had changed in the shuffle that may have slowed things down.  The
> > first thing we thought of was that we'd moved from a tarball deployment
> to
> > using the RPM so what effect might this have had on things.  Our
> operational
> > configuration compresses the map output and in the past we've had
> problems
> > with Java compression libraries being used rather than native ones and
> this
> > has affected performance.  We knew the RPM deployment had moved the
> native
> > library so spent some time confirming to ourselves that these were being
> > used correctly (but this turned out to not be the problem).  We then
> spent
> > time doing some process and server profiling - using dstat to look at the
> > server bottlenecks and jstack/jmap to check what the task tracker and
> reduce
> > processes were doing.  Although not directly relevant to this particular
> > problem doing this was useful just to get my head around what Hadoop is
> > doing at various points of the process.
> >
> > The next bit was one place where I got lucky - I happened to be logged
> onto
> > one of the worker nodes when a test job was running and I noticed that
> there
> > weren't any reduce tasks running on the server.  This was odd as we'd
> > submitted more reducers than we have servers so I'd expected at least one
> > task to be running on each server.  Checking the job tracker log file it
> > turned out that since the upgrade the job tracker had been submitting
> reduce
> > tasks to only 10% of the available nodes.  A different 10% each time the
> job
> > was run so clearly the individual task trackers were working OK but there
> > was something odd going on with the task allocation.  Checking the job
> > tracker log file showed that before the upgrade tasks had been fairly
> evenly
> > distributed so something had changed.  After that it was a case of
> digging
> > around the source code to find out which classes were available for task
> > allocation and what inside them had changed.  This can be quite daunting
> but
> > if you're comfortable with Java then it's just a case of following the
> calls
> > through the code.  Once I found the cause it was just a case of working
> out
> > what my options were for working around it (in this case turning off the
> > multiple assignment option - I can work out whether I want to turn it
> back
> > on in slower time).
> >
> > Where I think we got very lucky is that we hit this problem.  The
> > configuration we use for the terasort has just over 1 reducer per worker
> > node rather than maxing out the available reducer slots.  This decision
> was
> > made several years and I can't remember the reasons for it.  If we'd been
> > using a larger number of reducers then the number of worker nodes in use
> > would have been similar regardless of the allocation algorithm and so the
> > performance would have looked similar before and after the upgrade.  We
> > would have hit this problem eventually but probably not until we started
> > running user jobs and by then it would be too late to do the intrusive
> > investigations that were possible now.
> >
> > Hope this has been useful.
> >
> > Regards,
> > Jon
> >
> >
> >
> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
> >>
> >> Jon:
> >>
> >> This is interesting and helpful! How did you figure out the cause? And
> how
> >> much time did you spend? Could you share some experience of performance
> >> diagnosis?
> >>
> >> Jie
> >>
> >> On Tuesday, November 27, 2012, Harsh J wrote:
> >>>
> >>> Hi Amit,
> >>>
> >>> The default scheduler is FIFO, and may not work for all forms of
> >>> workloads. Read the multiple schedulers available to see if they have
> >>> features that may benefit your workload:
> >>>
> >>> Capacity Scheduler:
> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
> >>> FairScheduler:
> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
> >>>
> >>> While there's a good overlap of features between them, there are a few
> >>> differences that set them apart and make them each useful for
> >>> different use-cases. If I had to summarize on some such differences,
> >>> FairScheduler is better suited to SLA form of job execution situations
> >>> due to its preemptive features (which make it useful in user and
> >>> service mix scenarios), while CapacityScheduler provides
> >>> manual-resource-request oriented scheduling for odd jobs with high
> >>> memory workloads, etc. (which make it useful for running certain
> >>> specific kind of jobs along side the regular ones).
> >>>
> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
> wrote:
> >>> > So this is a FairScheduler problem ?
> >>> > We are using the default Hadoop scheduler. Is there a reason to use
> the
> >>> > Fair
> >>> > Scheduler if most of the time we don't have more than 4 jobs running
> >>> > simultaneously ?
> >>> >
> >>> >
> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
> wrote:
> >>> >>
> >>> >> Hi Amit,
> >>> >>
> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
> >>> >> property. It is true by default, which works well for most workloads
> >>> >> if not benchmark style workloads. I would not usually trust that as
> a
> >>> >> base perf. measure of everything that comes out of an upgrade.
> >>> >>
> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
> >>> >>
> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
> >>> >> wrote:
> >>> >> > Hi Jon,
> >>> >> >
> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
> Hadoop
> >>> >> > 1.0.4
> >>> >> > and I haven't noticed any performance issues. By  "multiple
> >>> >> > assignment
> >>> >> > feature" do you mean speculative execution
> >>> >> > (mapred.map.tasks.speculative.execution and
> >>> >> > mapred.reduce.tasks.speculative.execution) ?
> >>> >> >
> >>> >> >
> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Problem solved, but worth warning others about.
> >>> >> >>
> >>> >> >> Before the upgrade the reducers for the terasort process had been
> >>> >> >> evenly
> >>> >> >> distributed around the cluster - one per task tracker in turn,
> >>> >> >> looping
> >>> >> >> around the cluster until all tasks were allocated.  After the
> >>> >> >> upgrade
> >>> >> >> all
> >>> >> >> reduce task had been submitted to small number of task trackers -
> >>> >> >> submit
> >>> >> >> tasks until the task tracker slots were full and then move onto
> the
> >>> >> >> next
> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
> the
> >>> >> >> benchmark performance.
> >>> >> >>
> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
> behaviour
> >>> >> >> of
> >>> >> >> the
> >>> >> >> assign multiple property. Previously this property caused a
> single
> >>> >> >> map
> >>> >> >> and a
> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
> >>> >> >> (rather
> >>> >> >> than
> >>> >> >> the default of a map or a reduce).  After the upgrade it
> allocates
> >>> >> >> as
> >>> >> >> many
> >>> >> >> tasks as there are available task slots.  Turning off the
> multiple
> >>> >> >> assignment feature returned the terasort to its pre-upgrade
> >>> >> >> performance.
> >>> >> >>
> >>> >> >> I can see potential benefits to this change and need to think
> >>> >> >> through
> >>> >> >> the
> >>> >> >> consequences to real world applications (though in practice we're
> >>> >> >> likely to
> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
> >>> >> >> Investigating
> >>> >> >> this
> >>> >> >> has been a pain so to warn other user is there anywhere central
> >>> >> >> that
> >>> >> >> can be
> >>> >> >> used to record upgrade gotchas like this?
> >>> >> >>
> >>> >> >>
> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <jayayedev@gmail.com
> >
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> Hi,
> >>> >> >>>
> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
> and
> >>> >> >>> have
> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
> took
> >>> >> >>> about 45
> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
> >>> >> >>> detail it
> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
> >>> >> >>> minutes.
> >>> >> >>> Does
> >>> >> >>> anyone have any thoughts about what'--
> >>> Harsh J
> >>>
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Jon,

FYI, this issue in the fair scheduler was fixed by
https://issues.apache.org/jira/browse/MAPREDUCE-2905 for 1.1.0.
Though it is present again in MR2:
https://issues.apache.org/jira/browse/MAPREDUCE-3268

-Todd

On Wed, Nov 28, 2012 at 2:32 PM, Jon Allen <ja...@gmail.com> wrote:
> Jie,
>
> Simple answer - I got lucky (though obviously there are thing you need to
> have in place to allow you to be lucky).
>
> Before running the upgrade I ran a set of tests to baseline the cluster
> performance, e.g. terasort, gridmix and some operational jobs.  Terasort by
> itself isn't very realistic as a cluster test but it's nice and simple to
> run and is good for regression testing things after a change.
>
> After the upgrade the intention was to run the same tests and show that the
> performance hadn't degraded (improved would have been nice but not worse was
> the minimum).  When we ran the terasort we found that performance was about
> 50% worse - execution time had gone from 40 minutes to 60 minutes.  As I've
> said, terasort doesn't provide a realistic view of operational performance
> but this showed that something major had changed and we needed to understand
> it before going further.  So how to go about diagnosing this ...
>
> First rule - understand what you're trying to achieve.  It's very easy to
> say performance isn't good enough but performance can always be better so
> you need to know what's realistic and at what point you're going to stop
> tuning things.  I had a previous baseline that I was trying to match so I
> knew what I was trying to achieve.
>
> Next thing to do is profile your job and identify where the problem is.  We
> had the full job history from the before and after jobs and comparing these
> we saw that map performance was fairly consistent as were the reduce sort
> and reduce phases.  The problem was with the shuffle, which had gone from 20
> minutes pre-upgrade to 40 minutes afterwards.  The important thing here is
> to make sure you've got as much information as possible.  If we'd just kept
> the overall job time then there would have been a lot more areas to look at
> but knowing the problem was with shuffle allowed me to focus effort in this
> area.
>
> So what had changed in the shuffle that may have slowed things down.  The
> first thing we thought of was that we'd moved from a tarball deployment to
> using the RPM so what effect might this have had on things.  Our operational
> configuration compresses the map output and in the past we've had problems
> with Java compression libraries being used rather than native ones and this
> has affected performance.  We knew the RPM deployment had moved the native
> library so spent some time confirming to ourselves that these were being
> used correctly (but this turned out to not be the problem).  We then spent
> time doing some process and server profiling - using dstat to look at the
> server bottlenecks and jstack/jmap to check what the task tracker and reduce
> processes were doing.  Although not directly relevant to this particular
> problem doing this was useful just to get my head around what Hadoop is
> doing at various points of the process.
>
> The next bit was one place where I got lucky - I happened to be logged onto
> one of the worker nodes when a test job was running and I noticed that there
> weren't any reduce tasks running on the server.  This was odd as we'd
> submitted more reducers than we have servers so I'd expected at least one
> task to be running on each server.  Checking the job tracker log file it
> turned out that since the upgrade the job tracker had been submitting reduce
> tasks to only 10% of the available nodes.  A different 10% each time the job
> was run so clearly the individual task trackers were working OK but there
> was something odd going on with the task allocation.  Checking the job
> tracker log file showed that before the upgrade tasks had been fairly evenly
> distributed so something had changed.  After that it was a case of digging
> around the source code to find out which classes were available for task
> allocation and what inside them had changed.  This can be quite daunting but
> if you're comfortable with Java then it's just a case of following the calls
> through the code.  Once I found the cause it was just a case of working out
> what my options were for working around it (in this case turning off the
> multiple assignment option - I can work out whether I want to turn it back
> on in slower time).
>
> Where I think we got very lucky is that we hit this problem.  The
> configuration we use for the terasort has just over 1 reducer per worker
> node rather than maxing out the available reducer slots.  This decision was
> made several years and I can't remember the reasons for it.  If we'd been
> using a larger number of reducers then the number of worker nodes in use
> would have been similar regardless of the allocation algorithm and so the
> performance would have looked similar before and after the upgrade.  We
> would have hit this problem eventually but probably not until we started
> running user jobs and by then it would be too late to do the intrusive
> investigations that were possible now.
>
> Hope this has been useful.
>
> Regards,
> Jon
>
>
>
> On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
>>
>> Jon:
>>
>> This is interesting and helpful! How did you figure out the cause? And how
>> much time did you spend? Could you share some experience of performance
>> diagnosis?
>>
>> Jie
>>
>> On Tuesday, November 27, 2012, Harsh J wrote:
>>>
>>> Hi Amit,
>>>
>>> The default scheduler is FIFO, and may not work for all forms of
>>> workloads. Read the multiple schedulers available to see if they have
>>> features that may benefit your workload:
>>>
>>> Capacity Scheduler:
>>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>>> FairScheduler:
>>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>>>
>>> While there's a good overlap of features between them, there are a few
>>> differences that set them apart and make them each useful for
>>> different use-cases. If I had to summarize on some such differences,
>>> FairScheduler is better suited to SLA form of job execution situations
>>> due to its preemptive features (which make it useful in user and
>>> service mix scenarios), while CapacityScheduler provides
>>> manual-resource-request oriented scheduling for odd jobs with high
>>> memory workloads, etc. (which make it useful for running certain
>>> specific kind of jobs along side the regular ones).
>>>
>>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
>>> > So this is a FairScheduler problem ?
>>> > We are using the default Hadoop scheduler. Is there a reason to use the
>>> > Fair
>>> > Scheduler if most of the time we don't have more than 4 jobs running
>>> > simultaneously ?
>>> >
>>> >
>>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>>> >>
>>> >> Hi Amit,
>>> >>
>>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>>> >> property. It is true by default, which works well for most workloads
>>> >> if not benchmark style workloads. I would not usually trust that as a
>>> >> base perf. measure of everything that comes out of an upgrade.
>>> >>
>>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>>> >>
>>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>>> >> wrote:
>>> >> > Hi Jon,
>>> >> >
>>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>>> >> > 1.0.4
>>> >> > and I haven't noticed any performance issues. By  "multiple
>>> >> > assignment
>>> >> > feature" do you mean speculative execution
>>> >> > (mapred.map.tasks.speculative.execution and
>>> >> > mapred.reduce.tasks.speculative.execution) ?
>>> >> >
>>> >> >
>>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Problem solved, but worth warning others about.
>>> >> >>
>>> >> >> Before the upgrade the reducers for the terasort process had been
>>> >> >> evenly
>>> >> >> distributed around the cluster - one per task tracker in turn,
>>> >> >> looping
>>> >> >> around the cluster until all tasks were allocated.  After the
>>> >> >> upgrade
>>> >> >> all
>>> >> >> reduce task had been submitted to small number of task trackers -
>>> >> >> submit
>>> >> >> tasks until the task tracker slots were full and then move onto the
>>> >> >> next
>>> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
>>> >> >> benchmark performance.
>>> >> >>
>>> >> >> The reason for this turns out to be the fair scheduler rewrite
>>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
>>> >> >> of
>>> >> >> the
>>> >> >> assign multiple property. Previously this property caused a single
>>> >> >> map
>>> >> >> and a
>>> >> >> single reduce task to be allocated in a task tracker heartbeat
>>> >> >> (rather
>>> >> >> than
>>> >> >> the default of a map or a reduce).  After the upgrade it allocates
>>> >> >> as
>>> >> >> many
>>> >> >> tasks as there are available task slots.  Turning off the multiple
>>> >> >> assignment feature returned the terasort to its pre-upgrade
>>> >> >> performance.
>>> >> >>
>>> >> >> I can see potential benefits to this change and need to think
>>> >> >> through
>>> >> >> the
>>> >> >> consequences to real world applications (though in practice we're
>>> >> >> likely to
>>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>>> >> >> Investigating
>>> >> >> this
>>> >> >> has been a pain so to warn other user is there anywhere central
>>> >> >> that
>>> >> >> can be
>>> >> >> used to record upgrade gotchas like this?
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
>>> >> >>> have
>>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>>> >> >>> about 45
>>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
>>> >> >>> detail it
>>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>>> >> >>> minutes.
>>> >> >>> Does
>>> >> >>> anyone have any thoughts about what'--
>>> Harsh J
>>>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Hadoop 1.0.4 Performance Problem

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Jon,

FYI, this issue in the fair scheduler was fixed by
https://issues.apache.org/jira/browse/MAPREDUCE-2905 for 1.1.0.
Though it is present again in MR2:
https://issues.apache.org/jira/browse/MAPREDUCE-3268

-Todd

On Wed, Nov 28, 2012 at 2:32 PM, Jon Allen <ja...@gmail.com> wrote:
> Jie,
>
> Simple answer - I got lucky (though obviously there are thing you need to
> have in place to allow you to be lucky).
>
> Before running the upgrade I ran a set of tests to baseline the cluster
> performance, e.g. terasort, gridmix and some operational jobs.  Terasort by
> itself isn't very realistic as a cluster test but it's nice and simple to
> run and is good for regression testing things after a change.
>
> After the upgrade the intention was to run the same tests and show that the
> performance hadn't degraded (improved would have been nice but not worse was
> the minimum).  When we ran the terasort we found that performance was about
> 50% worse - execution time had gone from 40 minutes to 60 minutes.  As I've
> said, terasort doesn't provide a realistic view of operational performance
> but this showed that something major had changed and we needed to understand
> it before going further.  So how to go about diagnosing this ...
>
> First rule - understand what you're trying to achieve.  It's very easy to
> say performance isn't good enough but performance can always be better so
> you need to know what's realistic and at what point you're going to stop
> tuning things.  I had a previous baseline that I was trying to match so I
> knew what I was trying to achieve.
>
> Next thing to do is profile your job and identify where the problem is.  We
> had the full job history from the before and after jobs and comparing these
> we saw that map performance was fairly consistent as were the reduce sort
> and reduce phases.  The problem was with the shuffle, which had gone from 20
> minutes pre-upgrade to 40 minutes afterwards.  The important thing here is
> to make sure you've got as much information as possible.  If we'd just kept
> the overall job time then there would have been a lot more areas to look at
> but knowing the problem was with shuffle allowed me to focus effort in this
> area.
>
> So what had changed in the shuffle that may have slowed things down.  The
> first thing we thought of was that we'd moved from a tarball deployment to
> using the RPM so what effect might this have had on things.  Our operational
> configuration compresses the map output and in the past we've had problems
> with Java compression libraries being used rather than native ones and this
> has affected performance.  We knew the RPM deployment had moved the native
> library so spent some time confirming to ourselves that these were being
> used correctly (but this turned out to not be the problem).  We then spent
> time doing some process and server profiling - using dstat to look at the
> server bottlenecks and jstack/jmap to check what the task tracker and reduce
> processes were doing.  Although not directly relevant to this particular
> problem doing this was useful just to get my head around what Hadoop is
> doing at various points of the process.
>
> The next bit was one place where I got lucky - I happened to be logged onto
> one of the worker nodes when a test job was running and I noticed that there
> weren't any reduce tasks running on the server.  This was odd as we'd
> submitted more reducers than we have servers so I'd expected at least one
> task to be running on each server.  Checking the job tracker log file it
> turned out that since the upgrade the job tracker had been submitting reduce
> tasks to only 10% of the available nodes.  A different 10% each time the job
> was run so clearly the individual task trackers were working OK but there
> was something odd going on with the task allocation.  Checking the job
> tracker log file showed that before the upgrade tasks had been fairly evenly
> distributed so something had changed.  After that it was a case of digging
> around the source code to find out which classes were available for task
> allocation and what inside them had changed.  This can be quite daunting but
> if you're comfortable with Java then it's just a case of following the calls
> through the code.  Once I found the cause it was just a case of working out
> what my options were for working around it (in this case turning off the
> multiple assignment option - I can work out whether I want to turn it back
> on in slower time).
>
> Where I think we got very lucky is that we hit this problem.  The
> configuration we use for the terasort has just over 1 reducer per worker
> node rather than maxing out the available reducer slots.  This decision was
> made several years and I can't remember the reasons for it.  If we'd been
> using a larger number of reducers then the number of worker nodes in use
> would have been similar regardless of the allocation algorithm and so the
> performance would have looked similar before and after the upgrade.  We
> would have hit this problem eventually but probably not until we started
> running user jobs and by then it would be too late to do the intrusive
> investigations that were possible now.
>
> Hope this has been useful.
>
> Regards,
> Jon
>
>
>
> On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
>>
>> Jon:
>>
>> This is interesting and helpful! How did you figure out the cause? And how
>> much time did you spend? Could you share some experience of performance
>> diagnosis?
>>
>> Jie
>>
>> On Tuesday, November 27, 2012, Harsh J wrote:
>>>
>>> Hi Amit,
>>>
>>> The default scheduler is FIFO, and may not work for all forms of
>>> workloads. Read the multiple schedulers available to see if they have
>>> features that may benefit your workload:
>>>
>>> Capacity Scheduler:
>>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>>> FairScheduler:
>>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>>>
>>> While there's a good overlap of features between them, there are a few
>>> differences that set them apart and make them each useful for
>>> different use-cases. If I had to summarize on some such differences,
>>> FairScheduler is better suited to SLA form of job execution situations
>>> due to its preemptive features (which make it useful in user and
>>> service mix scenarios), while CapacityScheduler provides
>>> manual-resource-request oriented scheduling for odd jobs with high
>>> memory workloads, etc. (which make it useful for running certain
>>> specific kind of jobs along side the regular ones).
>>>
>>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
>>> > So this is a FairScheduler problem ?
>>> > We are using the default Hadoop scheduler. Is there a reason to use the
>>> > Fair
>>> > Scheduler if most of the time we don't have more than 4 jobs running
>>> > simultaneously ?
>>> >
>>> >
>>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>>> >>
>>> >> Hi Amit,
>>> >>
>>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>>> >> property. It is true by default, which works well for most workloads
>>> >> if not benchmark style workloads. I would not usually trust that as a
>>> >> base perf. measure of everything that comes out of an upgrade.
>>> >>
>>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>>> >>
>>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>>> >> wrote:
>>> >> > Hi Jon,
>>> >> >
>>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>>> >> > 1.0.4
>>> >> > and I haven't noticed any performance issues. By  "multiple
>>> >> > assignment
>>> >> > feature" do you mean speculative execution
>>> >> > (mapred.map.tasks.speculative.execution and
>>> >> > mapred.reduce.tasks.speculative.execution) ?
>>> >> >
>>> >> >
>>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Problem solved, but worth warning others about.
>>> >> >>
>>> >> >> Before the upgrade the reducers for the terasort process had been
>>> >> >> evenly
>>> >> >> distributed around the cluster - one per task tracker in turn,
>>> >> >> looping
>>> >> >> around the cluster until all tasks were allocated.  After the
>>> >> >> upgrade
>>> >> >> all
>>> >> >> reduce task had been submitted to small number of task trackers -
>>> >> >> submit
>>> >> >> tasks until the task tracker slots were full and then move onto the
>>> >> >> next
>>> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
>>> >> >> benchmark performance.
>>> >> >>
>>> >> >> The reason for this turns out to be the fair scheduler rewrite
>>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
>>> >> >> of
>>> >> >> the
>>> >> >> assign multiple property. Previously this property caused a single
>>> >> >> map
>>> >> >> and a
>>> >> >> single reduce task to be allocated in a task tracker heartbeat
>>> >> >> (rather
>>> >> >> than
>>> >> >> the default of a map or a reduce).  After the upgrade it allocates
>>> >> >> as
>>> >> >> many
>>> >> >> tasks as there are available task slots.  Turning off the multiple
>>> >> >> assignment feature returned the terasort to its pre-upgrade
>>> >> >> performance.
>>> >> >>
>>> >> >> I can see potential benefits to this change and need to think
>>> >> >> through
>>> >> >> the
>>> >> >> consequences to real world applications (though in practice we're
>>> >> >> likely to
>>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>>> >> >> Investigating
>>> >> >> this
>>> >> >> has been a pain so to warn other user is there anywhere central
>>> >> >> that
>>> >> >> can be
>>> >> >> used to record upgrade gotchas like this?
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
>>> >> >>> have
>>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>>> >> >>> about 45
>>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
>>> >> >>> detail it
>>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>>> >> >>> minutes.
>>> >> >>> Does
>>> >> >>> anyone have any thoughts about what'--
>>> Harsh J
>>>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Hadoop 1.0.4 Performance Problem

Posted by Jie Li <ji...@cs.duke.edu>.

Hi Chris,

The standalone log analyzer was released in December and designed to
be easier to use.

Regarding the license, I think it's ok to use it in the commercial
environment for the evaluation purpose, and your feedback would help
us to improve it.

Jie

On Tue, Dec 18, 2012 at 1:02 AM, Chris Smith <cs...@gmail.com> wrote:
> Jie,
>
> Recent was over 11 months ago.  :-)
>
> Unfortunately the software licence requires that most of us 'negotiate' a
> commerical use license before we trial the software in a commercial
> environment:
> http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
> clarified here:  http://www.cs.duke.edu/starfish/previous.html
>
> Under that last URL was a note that you were soon to distribute the source
> code under the Apache Software License.  Last time I asked the reply was
> that this would not happen.  Perhaps it is time to update your web pages or
> your license arrangements.  :-)
>
> I like what I saw on my home 'cluster' but have not the time to sort out
> licensing to trial this in a commercial environment.
>
> Chris
>
>
>
>
>
> On 14 December 2012 01:46, Jie Li <ji...@cs.duke.edu> wrote:
>>
>> Hi Jon,
>>
>> Thanks for sharing these insights! Can't agree with you more!
>>
>> Recently we released a tool called Starfish Hadoop Log Analyzer for
>> analyzing the job histories. I believe it can quickly point out this
>> reduce problem you met!
>>
>> http://www.cs.duke.edu/starfish/release.html
>>
>> Jie
>>
>> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
>> > Jie,
>> >
>> > Simple answer - I got lucky (though obviously there are thing you need
>> > to
>> > have in place to allow you to be lucky).
>> >
>> > Before running the upgrade I ran a set of tests to baseline the cluster
>> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
>> > by
>> > itself isn't very realistic as a cluster test but it's nice and simple
>> > to
>> > run and is good for regression testing things after a change.
>> >
>> > After the upgrade the intention was to run the same tests and show that
>> > the
>> > performance hadn't degraded (improved would have been nice but not worse
>> > was
>> > the minimum).  When we ran the terasort we found that performance was
>> > about
>> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
>> > I've
>> > said, terasort doesn't provide a realistic view of operational
>> > performance
>> > but this showed that something major had changed and we needed to
>> > understand
>> > it before going further.  So how to go about diagnosing this ...
>> >
>> > First rule - understand what you're trying to achieve.  It's very easy
>> > to
>> > say performance isn't good enough but performance can always be better
>> > so
>> > you need to know what's realistic and at what point you're going to stop
>> > tuning things.  I had a previous baseline that I was trying to match so
>> > I
>> > knew what I was trying to achieve.
>> >
>> > Next thing to do is profile your job and identify where the problem is.
>> > We
>> > had the full job history from the before and after jobs and comparing
>> > these
>> > we saw that map performance was fairly consistent as were the reduce
>> > sort
>> > and reduce phases.  The problem was with the shuffle, which had gone
>> > from 20
>> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
>> > is
>> > to make sure you've got as much information as possible.  If we'd just
>> > kept
>> > the overall job time then there would have been a lot more areas to look
>> > at
>> > but knowing the problem was with shuffle allowed me to focus effort in
>> > this
>> > area.
>> >
>> > So what had changed in the shuffle that may have slowed things down.
>> > The
>> > first thing we thought of was that we'd moved from a tarball deployment
>> > to
>> > using the RPM so what effect might this have had on things.  Our
>> > operational
>> > configuration compresses the map output and in the past we've had
>> > problems
>> > with Java compression libraries being used rather than native ones and
>> > this
>> > has affected performance.  We knew the RPM deployment had moved the
>> > native
>> > library so spent some time confirming to ourselves that these were being
>> > used correctly (but this turned out to not be the problem).  We then
>> > spent
>> > time doing some process and server profiling - using dstat to look at
>> > the
>> > server bottlenecks and jstack/jmap to check what the task tracker and
>> > reduce
>> > processes were doing.  Although not directly relevant to this particular
>> > problem doing this was useful just to get my head around what Hadoop is
>> > doing at various points of the process.
>> >
>> > The next bit was one place where I got lucky - I happened to be logged
>> > onto
>> > one of the worker nodes when a test job was running and I noticed that
>> > there
>> > weren't any reduce tasks running on the server.  This was odd as we'd
>> > submitted more reducers than we have servers so I'd expected at least
>> > one
>> > task to be running on each server.  Checking the job tracker log file it
>> > turned out that since the upgrade the job tracker had been submitting
>> > reduce
>> > tasks to only 10% of the available nodes.  A different 10% each time the
>> > job
>> > was run so clearly the individual task trackers were working OK but
>> > there
>> > was something odd going on with the task allocation.  Checking the job
>> > tracker log file showed that before the upgrade tasks had been fairly
>> > evenly
>> > distributed so something had changed.  After that it was a case of
>> > digging
>> > around the source code to find out which classes were available for task
>> > allocation and what inside them had changed.  This can be quite daunting
>> > but
>> > if you're comfortable with Java then it's just a case of following the
>> > calls
>> > through the code.  Once I found the cause it was just a case of working
>> > out
>> > what my options were for working around it (in this case turning off the
>> > multiple assignment option - I can work out whether I want to turn it
>> > back
>> > on in slower time).
>> >
>> > Where I think we got very lucky is that we hit this problem.  The
>> > configuration we use for the terasort has just over 1 reducer per worker
>> > node rather than maxing out the available reducer slots.  This decision
>> > was
>> > made several years and I can't remember the reasons for it.  If we'd
>> > been
>> > using a larger number of reducers then the number of worker nodes in use
>> > would have been similar regardless of the allocation algorithm and so
>> > the
>> > performance would have looked similar before and after the upgrade.  We
>> > would have hit this problem eventually but probably not until we started
>> > running user jobs and by then it would be too late to do the intrusive
>> > investigations that were possible now.
>> >
>> > Hope this has been useful.
>> >
>> > Regards,
>> > Jon
>> >
>> >
>> >
>> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
>> >>
>> >> Jon:
>> >>
>> >> This is interesting and helpful! How did you figure out the cause? And
>> >> how
>> >> much time did you spend? Could you share some experience of performance
>> >> diagnosis?
>> >>
>> >> Jie
>> >>
>> >> On Tuesday, November 27, 2012, Harsh J wrote:
>> >>>
>> >>> Hi Amit,
>> >>>
>> >>> The default scheduler is FIFO, and may not work for all forms of
>> >>> workloads. Read the multiple schedulers available to see if they have
>> >>> features that may benefit your workload:
>> >>>
>> >>> Capacity Scheduler:
>> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>> >>> FairScheduler:
>> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>> >>>
>> >>> While there's a good overlap of features between them, there are a few
>> >>> differences that set them apart and make them each useful for
>> >>> different use-cases. If I had to summarize on some such differences,
>> >>> FairScheduler is better suited to SLA form of job execution situations
>> >>> due to its preemptive features (which make it useful in user and
>> >>> service mix scenarios), while CapacityScheduler provides
>> >>> manual-resource-request oriented scheduling for odd jobs with high
>> >>> memory workloads, etc. (which make it useful for running certain
>> >>> specific kind of jobs along side the regular ones).
>> >>>
>> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
>> >>> wrote:
>> >>> > So this is a FairScheduler problem ?
>> >>> > We are using the default Hadoop scheduler. Is there a reason to use
>> >>> > the
>> >>> > Fair
>> >>> > Scheduler if most of the time we don't have more than 4 jobs running
>> >>> > simultaneously ?
>> >>> >
>> >>> >
>> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Hi Amit,
>> >>> >>
>> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> >>> >> property. It is true by default, which works well for most
>> >>> >> workloads
>> >>> >> if not benchmark style workloads. I would not usually trust that as
>> >>> >> a
>> >>> >> base perf. measure of everything that comes out of an upgrade.
>> >>> >>
>> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>> >>> >>
>> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>> >>> >> wrote:
>> >>> >> > Hi Jon,
>> >>> >> >
>> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
>> >>> >> > Hadoop
>> >>> >> > 1.0.4
>> >>> >> > and I haven't noticed any performance issues. By  "multiple
>> >>> >> > assignment
>> >>> >> > feature" do you mean speculative execution
>> >>> >> > (mapred.map.tasks.speculative.execution and
>> >>> >> > mapred.reduce.tasks.speculative.execution) ?
>> >>> >> >
>> >>> >> >
>> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>> >>> >> > wrote:
>> >>> >> >>
>> >>> >> >> Problem solved, but worth warning others about.
>> >>> >> >>
>> >>> >> >> Before the upgrade the reducers for the terasort process had
>> >>> >> >> been
>> >>> >> >> evenly
>> >>> >> >> distributed around the cluster - one per task tracker in turn,
>> >>> >> >> looping
>> >>> >> >> around the cluster until all tasks were allocated.  After the
>> >>> >> >> upgrade
>> >>> >> >> all
>> >>> >> >> reduce task had been submitted to small number of task trackers
>> >>> >> >> -
>> >>> >> >> submit
>> >>> >> >> tasks until the task tracker slots were full and then move onto
>> >>> >> >> the
>> >>> >> >> next
>> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
>> >>> >> >> the
>> >>> >> >> benchmark performance.
>> >>> >> >>
>> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
>> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
>> >>> >> >> behaviour
>> >>> >> >> of
>> >>> >> >> the
>> >>> >> >> assign multiple property. Previously this property caused a
>> >>> >> >> single
>> >>> >> >> map
>> >>> >> >> and a
>> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
>> >>> >> >> (rather
>> >>> >> >> than
>> >>> >> >> the default of a map or a reduce).  After the upgrade it
>> >>> >> >> allocates
>> >>> >> >> as
>> >>> >> >> many
>> >>> >> >> tasks as there are available task slots.  Turning off the
>> >>> >> >> multiple
>> >>> >> >> assignment feature returned the terasort to its pre-upgrade
>> >>> >> >> performance.
>> >>> >> >>
>> >>> >> >> I can see potential benefits to this change and need to think
>> >>> >> >> through
>> >>> >> >> the
>> >>> >> >> consequences to real world applications (though in practice
>> >>> >> >> we're
>> >>> >> >> likely to
>> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>> >>> >> >> Investigating
>> >>> >> >> this
>> >>> >> >> has been a pain so to warn other user is there anywhere central
>> >>> >> >> that
>> >>> >> >> can be
>> >>> >> >> used to record upgrade gotchas like this?
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen
>> >>> >> >> <ja...@gmail.com>
>> >>> >> >> wrote:
>> >>> >> >>>
>> >>> >> >>> Hi,
>> >>> >> >>>
>> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
>> >>> >> >>> and
>> >>> >> >>> have
>> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
>> >>> >> >>> took
>> >>> >> >>> about 45
>> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in
>> >>> >> >>> more
>> >>> >> >>> detail it
>> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>> >>> >> >>> minutes.
>> >>> >> >>> Does
>> >>> >> >>> anyone have any thoughts about what'--
>> >>> Harsh J
>> >>>
>> >
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jie Li <ji...@cs.duke.edu>.

Hi Chris,

The standalone log analyzer was released in December and designed to
be easier to use.

Regarding the license, I think it's ok to use it in the commercial
environment for the evaluation purpose, and your feedback would help
us to improve it.

Jie

On Tue, Dec 18, 2012 at 1:02 AM, Chris Smith <cs...@gmail.com> wrote:
> Jie,
>
> Recent was over 11 months ago.  :-)
>
> Unfortunately the software licence requires that most of us 'negotiate' a
> commerical use license before we trial the software in a commercial
> environment:
> http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
> clarified here:  http://www.cs.duke.edu/starfish/previous.html
>
> Under that last URL was a note that you were soon to distribute the source
> code under the Apache Software License.  Last time I asked the reply was
> that this would not happen.  Perhaps it is time to update your web pages or
> your license arrangements.  :-)
>
> I like what I saw on my home 'cluster' but have not the time to sort out
> licensing to trial this in a commercial environment.
>
> Chris
>
>
>
>
>
> On 14 December 2012 01:46, Jie Li <ji...@cs.duke.edu> wrote:
>>
>> Hi Jon,
>>
>> Thanks for sharing these insights! Can't agree with you more!
>>
>> Recently we released a tool called Starfish Hadoop Log Analyzer for
>> analyzing the job histories. I believe it can quickly point out this
>> reduce problem you met!
>>
>> http://www.cs.duke.edu/starfish/release.html
>>
>> Jie
>>
>> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
>> > Jie,
>> >
>> > Simple answer - I got lucky (though obviously there are thing you need
>> > to
>> > have in place to allow you to be lucky).
>> >
>> > Before running the upgrade I ran a set of tests to baseline the cluster
>> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
>> > by
>> > itself isn't very realistic as a cluster test but it's nice and simple
>> > to
>> > run and is good for regression testing things after a change.
>> >
>> > After the upgrade the intention was to run the same tests and show that
>> > the
>> > performance hadn't degraded (improved would have been nice but not worse
>> > was
>> > the minimum).  When we ran the terasort we found that performance was
>> > about
>> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
>> > I've
>> > said, terasort doesn't provide a realistic view of operational
>> > performance
>> > but this showed that something major had changed and we needed to
>> > understand
>> > it before going further.  So how to go about diagnosing this ...
>> >
>> > First rule - understand what you're trying to achieve.  It's very easy
>> > to
>> > say performance isn't good enough but performance can always be better
>> > so
>> > you need to know what's realistic and at what point you're going to stop
>> > tuning things.  I had a previous baseline that I was trying to match so
>> > I
>> > knew what I was trying to achieve.
>> >
>> > Next thing to do is profile your job and identify where the problem is.
>> > We
>> > had the full job history from the before and after jobs and comparing
>> > these
>> > we saw that map performance was fairly consistent as were the reduce
>> > sort
>> > and reduce phases.  The problem was with the shuffle, which had gone
>> > from 20
>> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
>> > is
>> > to make sure you've got as much information as possible.  If we'd just
>> > kept
>> > the overall job time then there would have been a lot more areas to look
>> > at
>> > but knowing the problem was with shuffle allowed me to focus effort in
>> > this
>> > area.
>> >
>> > So what had changed in the shuffle that may have slowed things down.
>> > The
>> > first thing we thought of was that we'd moved from a tarball deployment
>> > to
>> > using the RPM so what effect might this have had on things.  Our
>> > operational
>> > configuration compresses the map output and in the past we've had
>> > problems
>> > with Java compression libraries being used rather than native ones and
>> > this
>> > has affected performance.  We knew the RPM deployment had moved the
>> > native
>> > library so spent some time confirming to ourselves that these were being
>> > used correctly (but this turned out to not be the problem).  We then
>> > spent
>> > time doing some process and server profiling - using dstat to look at
>> > the
>> > server bottlenecks and jstack/jmap to check what the task tracker and
>> > reduce
>> > processes were doing.  Although not directly relevant to this particular
>> > problem doing this was useful just to get my head around what Hadoop is
>> > doing at various points of the process.
>> >
>> > The next bit was one place where I got lucky - I happened to be logged
>> > onto
>> > one of the worker nodes when a test job was running and I noticed that
>> > there
>> > weren't any reduce tasks running on the server.  This was odd as we'd
>> > submitted more reducers than we have servers so I'd expected at least
>> > one
>> > task to be running on each server.  Checking the job tracker log file it
>> > turned out that since the upgrade the job tracker had been submitting
>> > reduce
>> > tasks to only 10% of the available nodes.  A different 10% each time the
>> > job
>> > was run so clearly the individual task trackers were working OK but
>> > there
>> > was something odd going on with the task allocation.  Checking the job
>> > tracker log file showed that before the upgrade tasks had been fairly
>> > evenly
>> > distributed so something had changed.  After that it was a case of
>> > digging
>> > around the source code to find out which classes were available for task
>> > allocation and what inside them had changed.  This can be quite daunting
>> > but
>> > if you're comfortable with Java then it's just a case of following the
>> > calls
>> > through the code.  Once I found the cause it was just a case of working
>> > out
>> > what my options were for working around it (in this case turning off the
>> > multiple assignment option - I can work out whether I want to turn it
>> > back
>> > on in slower time).
>> >
>> > Where I think we got very lucky is that we hit this problem.  The
>> > configuration we use for the terasort has just over 1 reducer per worker
>> > node rather than maxing out the available reducer slots.  This decision
>> > was
>> > made several years and I can't remember the reasons for it.  If we'd
>> > been
>> > using a larger number of reducers then the number of worker nodes in use
>> > would have been similar regardless of the allocation algorithm and so
>> > the
>> > performance would have looked similar before and after the upgrade.  We
>> > would have hit this problem eventually but probably not until we started
>> > running user jobs and by then it would be too late to do the intrusive
>> > investigations that were possible now.
>> >
>> > Hope this has been useful.
>> >
>> > Regards,
>> > Jon
>> >
>> >
>> >
>> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
>> >>
>> >> Jon:
>> >>
>> >> This is interesting and helpful! How did you figure out the cause? And
>> >> how
>> >> much time did you spend? Could you share some experience of performance
>> >> diagnosis?
>> >>
>> >> Jie
>> >>
>> >> On Tuesday, November 27, 2012, Harsh J wrote:
>> >>>
>> >>> Hi Amit,
>> >>>
>> >>> The default scheduler is FIFO, and may not work for all forms of
>> >>> workloads. Read the multiple schedulers available to see if they have
>> >>> features that may benefit your workload:
>> >>>
>> >>> Capacity Scheduler:
>> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>> >>> FairScheduler:
>> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>> >>>
>> >>> While there's a good overlap of features between them, there are a few
>> >>> differences that set them apart and make them each useful for
>> >>> different use-cases. If I had to summarize on some such differences,
>> >>> FairScheduler is better suited to SLA form of job execution situations
>> >>> due to its preemptive features (which make it useful in user and
>> >>> service mix scenarios), while CapacityScheduler provides
>> >>> manual-resource-request oriented scheduling for odd jobs with high
>> >>> memory workloads, etc. (which make it useful for running certain
>> >>> specific kind of jobs along side the regular ones).
>> >>>
>> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
>> >>> wrote:
>> >>> > So this is a FairScheduler problem ?
>> >>> > We are using the default Hadoop scheduler. Is there a reason to use
>> >>> > the
>> >>> > Fair
>> >>> > Scheduler if most of the time we don't have more than 4 jobs running
>> >>> > simultaneously ?
>> >>> >
>> >>> >
>> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Hi Amit,
>> >>> >>
>> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> >>> >> property. It is true by default, which works well for most
>> >>> >> workloads
>> >>> >> if not benchmark style workloads. I would not usually trust that as
>> >>> >> a
>> >>> >> base perf. measure of everything that comes out of an upgrade.
>> >>> >>
>> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>> >>> >>
>> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>> >>> >> wrote:
>> >>> >> > Hi Jon,
>> >>> >> >
>> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
>> >>> >> > Hadoop
>> >>> >> > 1.0.4
>> >>> >> > and I haven't noticed any performance issues. By  "multiple
>> >>> >> > assignment
>> >>> >> > feature" do you mean speculative execution
>> >>> >> > (mapred.map.tasks.speculative.execution and
>> >>> >> > mapred.reduce.tasks.speculative.execution) ?
>> >>> >> >
>> >>> >> >
>> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>> >>> >> > wrote:
>> >>> >> >>
>> >>> >> >> Problem solved, but worth warning others about.
>> >>> >> >>
>> >>> >> >> Before the upgrade the reducers for the terasort process had
>> >>> >> >> been
>> >>> >> >> evenly
>> >>> >> >> distributed around the cluster - one per task tracker in turn,
>> >>> >> >> looping
>> >>> >> >> around the cluster until all tasks were allocated.  After the
>> >>> >> >> upgrade
>> >>> >> >> all
>> >>> >> >> reduce task had been submitted to small number of task trackers
>> >>> >> >> -
>> >>> >> >> submit
>> >>> >> >> tasks until the task tracker slots were full and then move onto
>> >>> >> >> the
>> >>> >> >> next
>> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
>> >>> >> >> the
>> >>> >> >> benchmark performance.
>> >>> >> >>
>> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
>> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
>> >>> >> >> behaviour
>> >>> >> >> of
>> >>> >> >> the
>> >>> >> >> assign multiple property. Previously this property caused a
>> >>> >> >> single
>> >>> >> >> map
>> >>> >> >> and a
>> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
>> >>> >> >> (rather
>> >>> >> >> than
>> >>> >> >> the default of a map or a reduce).  After the upgrade it
>> >>> >> >> allocates
>> >>> >> >> as
>> >>> >> >> many
>> >>> >> >> tasks as there are available task slots.  Turning off the
>> >>> >> >> multiple
>> >>> >> >> assignment feature returned the terasort to its pre-upgrade
>> >>> >> >> performance.
>> >>> >> >>
>> >>> >> >> I can see potential benefits to this change and need to think
>> >>> >> >> through
>> >>> >> >> the
>> >>> >> >> consequences to real world applications (though in practice
>> >>> >> >> we're
>> >>> >> >> likely to
>> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>> >>> >> >> Investigating
>> >>> >> >> this
>> >>> >> >> has been a pain so to warn other user is there anywhere central
>> >>> >> >> that
>> >>> >> >> can be
>> >>> >> >> used to record upgrade gotchas like this?
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen
>> >>> >> >> <ja...@gmail.com>
>> >>> >> >> wrote:
>> >>> >> >>>
>> >>> >> >>> Hi,
>> >>> >> >>>
>> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
>> >>> >> >>> and
>> >>> >> >>> have
>> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
>> >>> >> >>> took
>> >>> >> >>> about 45
>> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in
>> >>> >> >>> more
>> >>> >> >>> detail it
>> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>> >>> >> >>> minutes.
>> >>> >> >>> Does
>> >>> >> >>> anyone have any thoughts about what'--
>> >>> Harsh J
>> >>>
>> >
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jie Li <ji...@cs.duke.edu>.

Hi Chris,

The standalone log analyzer was released in December and designed to
be easier to use.

Regarding the license, I think it's ok to use it in the commercial
environment for the evaluation purpose, and your feedback would help
us to improve it.

Jie

On Tue, Dec 18, 2012 at 1:02 AM, Chris Smith <cs...@gmail.com> wrote:
> Jie,
>
> Recent was over 11 months ago.  :-)
>
> Unfortunately the software licence requires that most of us 'negotiate' a
> commerical use license before we trial the software in a commercial
> environment:
> http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
> clarified here:  http://www.cs.duke.edu/starfish/previous.html
>
> Under that last URL was a note that you were soon to distribute the source
> code under the Apache Software License.  Last time I asked the reply was
> that this would not happen.  Perhaps it is time to update your web pages or
> your license arrangements.  :-)
>
> I like what I saw on my home 'cluster' but have not the time to sort out
> licensing to trial this in a commercial environment.
>
> Chris
>
>
>
>
>
> On 14 December 2012 01:46, Jie Li <ji...@cs.duke.edu> wrote:
>>
>> Hi Jon,
>>
>> Thanks for sharing these insights! Can't agree with you more!
>>
>> Recently we released a tool called Starfish Hadoop Log Analyzer for
>> analyzing the job histories. I believe it can quickly point out this
>> reduce problem you met!
>>
>> http://www.cs.duke.edu/starfish/release.html
>>
>> Jie
>>
>> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
>> > Jie,
>> >
>> > Simple answer - I got lucky (though obviously there are thing you need
>> > to
>> > have in place to allow you to be lucky).
>> >
>> > Before running the upgrade I ran a set of tests to baseline the cluster
>> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
>> > by
>> > itself isn't very realistic as a cluster test but it's nice and simple
>> > to
>> > run and is good for regression testing things after a change.
>> >
>> > After the upgrade the intention was to run the same tests and show that
>> > the
>> > performance hadn't degraded (improved would have been nice but not worse
>> > was
>> > the minimum).  When we ran the terasort we found that performance was
>> > about
>> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
>> > I've
>> > said, terasort doesn't provide a realistic view of operational
>> > performance
>> > but this showed that something major had changed and we needed to
>> > understand
>> > it before going further.  So how to go about diagnosing this ...
>> >
>> > First rule - understand what you're trying to achieve.  It's very easy
>> > to
>> > say performance isn't good enough but performance can always be better
>> > so
>> > you need to know what's realistic and at what point you're going to stop
>> > tuning things.  I had a previous baseline that I was trying to match so
>> > I
>> > knew what I was trying to achieve.
>> >
>> > Next thing to do is profile your job and identify where the problem is.
>> > We
>> > had the full job history from the before and after jobs and comparing
>> > these
>> > we saw that map performance was fairly consistent as were the reduce
>> > sort
>> > and reduce phases.  The problem was with the shuffle, which had gone
>> > from 20
>> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
>> > is
>> > to make sure you've got as much information as possible.  If we'd just
>> > kept
>> > the overall job time then there would have been a lot more areas to look
>> > at
>> > but knowing the problem was with shuffle allowed me to focus effort in
>> > this
>> > area.
>> >
>> > So what had changed in the shuffle that may have slowed things down.
>> > The
>> > first thing we thought of was that we'd moved from a tarball deployment
>> > to
>> > using the RPM so what effect might this have had on things.  Our
>> > operational
>> > configuration compresses the map output and in the past we've had
>> > problems
>> > with Java compression libraries being used rather than native ones and
>> > this
>> > has affected performance.  We knew the RPM deployment had moved the
>> > native
>> > library so spent some time confirming to ourselves that these were being
>> > used correctly (but this turned out to not be the problem).  We then
>> > spent
>> > time doing some process and server profiling - using dstat to look at
>> > the
>> > server bottlenecks and jstack/jmap to check what the task tracker and
>> > reduce
>> > processes were doing.  Although not directly relevant to this particular
>> > problem doing this was useful just to get my head around what Hadoop is
>> > doing at various points of the process.
>> >
>> > The next bit was one place where I got lucky - I happened to be logged
>> > onto
>> > one of the worker nodes when a test job was running and I noticed that
>> > there
>> > weren't any reduce tasks running on the server.  This was odd as we'd
>> > submitted more reducers than we have servers so I'd expected at least
>> > one
>> > task to be running on each server.  Checking the job tracker log file it
>> > turned out that since the upgrade the job tracker had been submitting
>> > reduce
>> > tasks to only 10% of the available nodes.  A different 10% each time the
>> > job
>> > was run so clearly the individual task trackers were working OK but
>> > there
>> > was something odd going on with the task allocation.  Checking the job
>> > tracker log file showed that before the upgrade tasks had been fairly
>> > evenly
>> > distributed so something had changed.  After that it was a case of
>> > digging
>> > around the source code to find out which classes were available for task
>> > allocation and what inside them had changed.  This can be quite daunting
>> > but
>> > if you're comfortable with Java then it's just a case of following the
>> > calls
>> > through the code.  Once I found the cause it was just a case of working
>> > out
>> > what my options were for working around it (in this case turning off the
>> > multiple assignment option - I can work out whether I want to turn it
>> > back
>> > on in slower time).
>> >
>> > Where I think we got very lucky is that we hit this problem.  The
>> > configuration we use for the terasort has just over 1 reducer per worker
>> > node rather than maxing out the available reducer slots.  This decision
>> > was
>> > made several years and I can't remember the reasons for it.  If we'd
>> > been
>> > using a larger number of reducers then the number of worker nodes in use
>> > would have been similar regardless of the allocation algorithm and so
>> > the
>> > performance would have looked similar before and after the upgrade.  We
>> > would have hit this problem eventually but probably not until we started
>> > running user jobs and by then it would be too late to do the intrusive
>> > investigations that were possible now.
>> >
>> > Hope this has been useful.
>> >
>> > Regards,
>> > Jon
>> >
>> >
>> >
>> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
>> >>
>> >> Jon:
>> >>
>> >> This is interesting and helpful! How did you figure out the cause? And
>> >> how
>> >> much time did you spend? Could you share some experience of performance
>> >> diagnosis?
>> >>
>> >> Jie
>> >>
>> >> On Tuesday, November 27, 2012, Harsh J wrote:
>> >>>
>> >>> Hi Amit,
>> >>>
>> >>> The default scheduler is FIFO, and may not work for all forms of
>> >>> workloads. Read the multiple schedulers available to see if they have
>> >>> features that may benefit your workload:
>> >>>
>> >>> Capacity Scheduler:
>> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>> >>> FairScheduler:
>> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>> >>>
>> >>> While there's a good overlap of features between them, there are a few
>> >>> differences that set them apart and make them each useful for
>> >>> different use-cases. If I had to summarize on some such differences,
>> >>> FairScheduler is better suited to SLA form of job execution situations
>> >>> due to its preemptive features (which make it useful in user and
>> >>> service mix scenarios), while CapacityScheduler provides
>> >>> manual-resource-request oriented scheduling for odd jobs with high
>> >>> memory workloads, etc. (which make it useful for running certain
>> >>> specific kind of jobs along side the regular ones).
>> >>>
>> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
>> >>> wrote:
>> >>> > So this is a FairScheduler problem ?
>> >>> > We are using the default Hadoop scheduler. Is there a reason to use
>> >>> > the
>> >>> > Fair
>> >>> > Scheduler if most of the time we don't have more than 4 jobs running
>> >>> > simultaneously ?
>> >>> >
>> >>> >
>> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Hi Amit,
>> >>> >>
>> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> >>> >> property. It is true by default, which works well for most
>> >>> >> workloads
>> >>> >> if not benchmark style workloads. I would not usually trust that as
>> >>> >> a
>> >>> >> base perf. measure of everything that comes out of an upgrade.
>> >>> >>
>> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>> >>> >>
>> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>> >>> >> wrote:
>> >>> >> > Hi Jon,
>> >>> >> >
>> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
>> >>> >> > Hadoop
>> >>> >> > 1.0.4
>> >>> >> > and I haven't noticed any performance issues. By  "multiple
>> >>> >> > assignment
>> >>> >> > feature" do you mean speculative execution
>> >>> >> > (mapred.map.tasks.speculative.execution and
>> >>> >> > mapred.reduce.tasks.speculative.execution) ?
>> >>> >> >
>> >>> >> >
>> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>> >>> >> > wrote:
>> >>> >> >>
>> >>> >> >> Problem solved, but worth warning others about.
>> >>> >> >>
>> >>> >> >> Before the upgrade the reducers for the terasort process had
>> >>> >> >> been
>> >>> >> >> evenly
>> >>> >> >> distributed around the cluster - one per task tracker in turn,
>> >>> >> >> looping
>> >>> >> >> around the cluster until all tasks were allocated.  After the
>> >>> >> >> upgrade
>> >>> >> >> all
>> >>> >> >> reduce task had been submitted to small number of task trackers
>> >>> >> >> -
>> >>> >> >> submit
>> >>> >> >> tasks until the task tracker slots were full and then move onto
>> >>> >> >> the
>> >>> >> >> next
>> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
>> >>> >> >> the
>> >>> >> >> benchmark performance.
>> >>> >> >>
>> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
>> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
>> >>> >> >> behaviour
>> >>> >> >> of
>> >>> >> >> the
>> >>> >> >> assign multiple property. Previously this property caused a
>> >>> >> >> single
>> >>> >> >> map
>> >>> >> >> and a
>> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
>> >>> >> >> (rather
>> >>> >> >> than
>> >>> >> >> the default of a map or a reduce).  After the upgrade it
>> >>> >> >> allocates
>> >>> >> >> as
>> >>> >> >> many
>> >>> >> >> tasks as there are available task slots.  Turning off the
>> >>> >> >> multiple
>> >>> >> >> assignment feature returned the terasort to its pre-upgrade
>> >>> >> >> performance.
>> >>> >> >>
>> >>> >> >> I can see potential benefits to this change and need to think
>> >>> >> >> through
>> >>> >> >> the
>> >>> >> >> consequences to real world applications (though in practice
>> >>> >> >> we're
>> >>> >> >> likely to
>> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>> >>> >> >> Investigating
>> >>> >> >> this
>> >>> >> >> has been a pain so to warn other user is there anywhere central
>> >>> >> >> that
>> >>> >> >> can be
>> >>> >> >> used to record upgrade gotchas like this?
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen
>> >>> >> >> <ja...@gmail.com>
>> >>> >> >> wrote:
>> >>> >> >>>
>> >>> >> >>> Hi,
>> >>> >> >>>
>> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
>> >>> >> >>> and
>> >>> >> >>> have
>> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
>> >>> >> >>> took
>> >>> >> >>> about 45
>> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in
>> >>> >> >>> more
>> >>> >> >>> detail it
>> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>> >>> >> >>> minutes.
>> >>> >> >>> Does
>> >>> >> >>> anyone have any thoughts about what'--
>> >>> Harsh J
>> >>>
>> >
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jie Li <ji...@cs.duke.edu>.

Hi Chris,

The standalone log analyzer was released in December and designed to
be easier to use.

Regarding the license, I think it's ok to use it in the commercial
environment for the evaluation purpose, and your feedback would help
us to improve it.

Jie

On Tue, Dec 18, 2012 at 1:02 AM, Chris Smith <cs...@gmail.com> wrote:
> Jie,
>
> Recent was over 11 months ago.  :-)
>
> Unfortunately the software licence requires that most of us 'negotiate' a
> commerical use license before we trial the software in a commercial
> environment:
> http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
> clarified here:  http://www.cs.duke.edu/starfish/previous.html
>
> Under that last URL was a note that you were soon to distribute the source
> code under the Apache Software License.  Last time I asked the reply was
> that this would not happen.  Perhaps it is time to update your web pages or
> your license arrangements.  :-)
>
> I like what I saw on my home 'cluster' but have not the time to sort out
> licensing to trial this in a commercial environment.
>
> Chris
>
>
>
>
>
> On 14 December 2012 01:46, Jie Li <ji...@cs.duke.edu> wrote:
>>
>> Hi Jon,
>>
>> Thanks for sharing these insights! Can't agree with you more!
>>
>> Recently we released a tool called Starfish Hadoop Log Analyzer for
>> analyzing the job histories. I believe it can quickly point out this
>> reduce problem you met!
>>
>> http://www.cs.duke.edu/starfish/release.html
>>
>> Jie
>>
>> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
>> > Jie,
>> >
>> > Simple answer - I got lucky (though obviously there are thing you need
>> > to
>> > have in place to allow you to be lucky).
>> >
>> > Before running the upgrade I ran a set of tests to baseline the cluster
>> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
>> > by
>> > itself isn't very realistic as a cluster test but it's nice and simple
>> > to
>> > run and is good for regression testing things after a change.
>> >
>> > After the upgrade the intention was to run the same tests and show that
>> > the
>> > performance hadn't degraded (improved would have been nice but not worse
>> > was
>> > the minimum).  When we ran the terasort we found that performance was
>> > about
>> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
>> > I've
>> > said, terasort doesn't provide a realistic view of operational
>> > performance
>> > but this showed that something major had changed and we needed to
>> > understand
>> > it before going further.  So how to go about diagnosing this ...
>> >
>> > First rule - understand what you're trying to achieve.  It's very easy
>> > to
>> > say performance isn't good enough but performance can always be better
>> > so
>> > you need to know what's realistic and at what point you're going to stop
>> > tuning things.  I had a previous baseline that I was trying to match so
>> > I
>> > knew what I was trying to achieve.
>> >
>> > Next thing to do is profile your job and identify where the problem is.
>> > We
>> > had the full job history from the before and after jobs and comparing
>> > these
>> > we saw that map performance was fairly consistent as were the reduce
>> > sort
>> > and reduce phases.  The problem was with the shuffle, which had gone
>> > from 20
>> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
>> > is
>> > to make sure you've got as much information as possible.  If we'd just
>> > kept
>> > the overall job time then there would have been a lot more areas to look
>> > at
>> > but knowing the problem was with shuffle allowed me to focus effort in
>> > this
>> > area.
>> >
>> > So what had changed in the shuffle that may have slowed things down.
>> > The
>> > first thing we thought of was that we'd moved from a tarball deployment
>> > to
>> > using the RPM so what effect might this have had on things.  Our
>> > operational
>> > configuration compresses the map output and in the past we've had
>> > problems
>> > with Java compression libraries being used rather than native ones and
>> > this
>> > has affected performance.  We knew the RPM deployment had moved the
>> > native
>> > library so spent some time confirming to ourselves that these were being
>> > used correctly (but this turned out to not be the problem).  We then
>> > spent
>> > time doing some process and server profiling - using dstat to look at
>> > the
>> > server bottlenecks and jstack/jmap to check what the task tracker and
>> > reduce
>> > processes were doing.  Although not directly relevant to this particular
>> > problem doing this was useful just to get my head around what Hadoop is
>> > doing at various points of the process.
>> >
>> > The next bit was one place where I got lucky - I happened to be logged
>> > onto
>> > one of the worker nodes when a test job was running and I noticed that
>> > there
>> > weren't any reduce tasks running on the server.  This was odd as we'd
>> > submitted more reducers than we have servers so I'd expected at least
>> > one
>> > task to be running on each server.  Checking the job tracker log file it
>> > turned out that since the upgrade the job tracker had been submitting
>> > reduce
>> > tasks to only 10% of the available nodes.  A different 10% each time the
>> > job
>> > was run so clearly the individual task trackers were working OK but
>> > there
>> > was something odd going on with the task allocation.  Checking the job
>> > tracker log file showed that before the upgrade tasks had been fairly
>> > evenly
>> > distributed so something had changed.  After that it was a case of
>> > digging
>> > around the source code to find out which classes were available for task
>> > allocation and what inside them had changed.  This can be quite daunting
>> > but
>> > if you're comfortable with Java then it's just a case of following the
>> > calls
>> > through the code.  Once I found the cause it was just a case of working
>> > out
>> > what my options were for working around it (in this case turning off the
>> > multiple assignment option - I can work out whether I want to turn it
>> > back
>> > on in slower time).
>> >
>> > Where I think we got very lucky is that we hit this problem.  The
>> > configuration we use for the terasort has just over 1 reducer per worker
>> > node rather than maxing out the available reducer slots.  This decision
>> > was
>> > made several years and I can't remember the reasons for it.  If we'd
>> > been
>> > using a larger number of reducers then the number of worker nodes in use
>> > would have been similar regardless of the allocation algorithm and so
>> > the
>> > performance would have looked similar before and after the upgrade.  We
>> > would have hit this problem eventually but probably not until we started
>> > running user jobs and by then it would be too late to do the intrusive
>> > investigations that were possible now.
>> >
>> > Hope this has been useful.
>> >
>> > Regards,
>> > Jon
>> >
>> >
>> >
>> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
>> >>
>> >> Jon:
>> >>
>> >> This is interesting and helpful! How did you figure out the cause? And
>> >> how
>> >> much time did you spend? Could you share some experience of performance
>> >> diagnosis?
>> >>
>> >> Jie
>> >>
>> >> On Tuesday, November 27, 2012, Harsh J wrote:
>> >>>
>> >>> Hi Amit,
>> >>>
>> >>> The default scheduler is FIFO, and may not work for all forms of
>> >>> workloads. Read the multiple schedulers available to see if they have
>> >>> features that may benefit your workload:
>> >>>
>> >>> Capacity Scheduler:
>> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>> >>> FairScheduler:
>> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>> >>>
>> >>> While there's a good overlap of features between them, there are a few
>> >>> differences that set them apart and make them each useful for
>> >>> different use-cases. If I had to summarize on some such differences,
>> >>> FairScheduler is better suited to SLA form of job execution situations
>> >>> due to its preemptive features (which make it useful in user and
>> >>> service mix scenarios), while CapacityScheduler provides
>> >>> manual-resource-request oriented scheduling for odd jobs with high
>> >>> memory workloads, etc. (which make it useful for running certain
>> >>> specific kind of jobs along side the regular ones).
>> >>>
>> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
>> >>> wrote:
>> >>> > So this is a FairScheduler problem ?
>> >>> > We are using the default Hadoop scheduler. Is there a reason to use
>> >>> > the
>> >>> > Fair
>> >>> > Scheduler if most of the time we don't have more than 4 jobs running
>> >>> > simultaneously ?
>> >>> >
>> >>> >
>> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Hi Amit,
>> >>> >>
>> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> >>> >> property. It is true by default, which works well for most
>> >>> >> workloads
>> >>> >> if not benchmark style workloads. I would not usually trust that as
>> >>> >> a
>> >>> >> base perf. measure of everything that comes out of an upgrade.
>> >>> >>
>> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>> >>> >>
>> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>> >>> >> wrote:
>> >>> >> > Hi Jon,
>> >>> >> >
>> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
>> >>> >> > Hadoop
>> >>> >> > 1.0.4
>> >>> >> > and I haven't noticed any performance issues. By  "multiple
>> >>> >> > assignment
>> >>> >> > feature" do you mean speculative execution
>> >>> >> > (mapred.map.tasks.speculative.execution and
>> >>> >> > mapred.reduce.tasks.speculative.execution) ?
>> >>> >> >
>> >>> >> >
>> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>> >>> >> > wrote:
>> >>> >> >>
>> >>> >> >> Problem solved, but worth warning others about.
>> >>> >> >>
>> >>> >> >> Before the upgrade the reducers for the terasort process had
>> >>> >> >> been
>> >>> >> >> evenly
>> >>> >> >> distributed around the cluster - one per task tracker in turn,
>> >>> >> >> looping
>> >>> >> >> around the cluster until all tasks were allocated.  After the
>> >>> >> >> upgrade
>> >>> >> >> all
>> >>> >> >> reduce task had been submitted to small number of task trackers
>> >>> >> >> -
>> >>> >> >> submit
>> >>> >> >> tasks until the task tracker slots were full and then move onto
>> >>> >> >> the
>> >>> >> >> next
>> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
>> >>> >> >> the
>> >>> >> >> benchmark performance.
>> >>> >> >>
>> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
>> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
>> >>> >> >> behaviour
>> >>> >> >> of
>> >>> >> >> the
>> >>> >> >> assign multiple property. Previously this property caused a
>> >>> >> >> single
>> >>> >> >> map
>> >>> >> >> and a
>> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
>> >>> >> >> (rather
>> >>> >> >> than
>> >>> >> >> the default of a map or a reduce).  After the upgrade it
>> >>> >> >> allocates
>> >>> >> >> as
>> >>> >> >> many
>> >>> >> >> tasks as there are available task slots.  Turning off the
>> >>> >> >> multiple
>> >>> >> >> assignment feature returned the terasort to its pre-upgrade
>> >>> >> >> performance.
>> >>> >> >>
>> >>> >> >> I can see potential benefits to this change and need to think
>> >>> >> >> through
>> >>> >> >> the
>> >>> >> >> consequences to real world applications (though in practice
>> >>> >> >> we're
>> >>> >> >> likely to
>> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>> >>> >> >> Investigating
>> >>> >> >> this
>> >>> >> >> has been a pain so to warn other user is there anywhere central
>> >>> >> >> that
>> >>> >> >> can be
>> >>> >> >> used to record upgrade gotchas like this?
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen
>> >>> >> >> <ja...@gmail.com>
>> >>> >> >> wrote:
>> >>> >> >>>
>> >>> >> >>> Hi,
>> >>> >> >>>
>> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
>> >>> >> >>> and
>> >>> >> >>> have
>> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
>> >>> >> >>> took
>> >>> >> >>> about 45
>> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in
>> >>> >> >>> more
>> >>> >> >>> detail it
>> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>> >>> >> >>> minutes.
>> >>> >> >>> Does
>> >>> >> >>> anyone have any thoughts about what'--
>> >>> Harsh J
>> >>>
>> >
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Chris Smith <cs...@gmail.com>.

Jie,

Recent was over 11 months ago.  :-)

Unfortunately the software licence requires that most of us 'negotiate' a
commerical use license before we trial the software in a commercial
environment:
http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
clarified here:  http://www.cs.duke.edu/starfish/previous.html

Under that last URL was a note that you were soon to distribute the source
code under the Apache Software License.  Last time I asked the reply was
that this would not happen.  Perhaps it is time to update your web pages or
your license arrangements.  :-)

I like what I saw on my home 'cluster' but have not the time to sort out
licensing to trial this in a commercial environment.

Chris




On 14 December 2012 01:46, Jie Li <ji...@cs.duke.edu> wrote:

> Hi Jon,
>
> Thanks for sharing these insights! Can't agree with you more!
>
> Recently we released a tool called Starfish Hadoop Log Analyzer for
> analyzing the job histories. I believe it can quickly point out this
> reduce problem you met!
>
> http://www.cs.duke.edu/starfish/release.html
>
> Jie
>
> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
> > Jie,
> >
> > Simple answer - I got lucky (though obviously there are thing you need to
> > have in place to allow you to be lucky).
> >
> > Before running the upgrade I ran a set of tests to baseline the cluster
> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
> by
> > itself isn't very realistic as a cluster test but it's nice and simple to
> > run and is good for regression testing things after a change.
> >
> > After the upgrade the intention was to run the same tests and show that
> the
> > performance hadn't degraded (improved would have been nice but not worse
> was
> > the minimum).  When we ran the terasort we found that performance was
> about
> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
> I've
> > said, terasort doesn't provide a realistic view of operational
> performance
> > but this showed that something major had changed and we needed to
> understand
> > it before going further.  So how to go about diagnosing this ...
> >
> > First rule - understand what you're trying to achieve.  It's very easy to
> > say performance isn't good enough but performance can always be better so
> > you need to know what's realistic and at what point you're going to stop
> > tuning things.  I had a previous baseline that I was trying to match so I
> > knew what I was trying to achieve.
> >
> > Next thing to do is profile your job and identify where the problem is.
>  We
> > had the full job history from the before and after jobs and comparing
> these
> > we saw that map performance was fairly consistent as were the reduce sort
> > and reduce phases.  The problem was with the shuffle, which had gone
> from 20
> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
> is
> > to make sure you've got as much information as possible.  If we'd just
> kept
> > the overall job time then there would have been a lot more areas to look
> at
> > but knowing the problem was with shuffle allowed me to focus effort in
> this
> > area.
> >
> > So what had changed in the shuffle that may have slowed things down.  The
> > first thing we thought of was that we'd moved from a tarball deployment
> to
> > using the RPM so what effect might this have had on things.  Our
> operational
> > configuration compresses the map output and in the past we've had
> problems
> > with Java compression libraries being used rather than native ones and
> this
> > has affected performance.  We knew the RPM deployment had moved the
> native
> > library so spent some time confirming to ourselves that these were being
> > used correctly (but this turned out to not be the problem).  We then
> spent
> > time doing some process and server profiling - using dstat to look at the
> > server bottlenecks and jstack/jmap to check what the task tracker and
> reduce
> > processes were doing.  Although not directly relevant to this particular
> > problem doing this was useful just to get my head around what Hadoop is
> > doing at various points of the process.
> >
> > The next bit was one place where I got lucky - I happened to be logged
> onto
> > one of the worker nodes when a test job was running and I noticed that
> there
> > weren't any reduce tasks running on the server.  This was odd as we'd
> > submitted more reducers than we have servers so I'd expected at least one
> > task to be running on each server.  Checking the job tracker log file it
> > turned out that since the upgrade the job tracker had been submitting
> reduce
> > tasks to only 10% of the available nodes.  A different 10% each time the
> job
> > was run so clearly the individual task trackers were working OK but there
> > was something odd going on with the task allocation.  Checking the job
> > tracker log file showed that before the upgrade tasks had been fairly
> evenly
> > distributed so something had changed.  After that it was a case of
> digging
> > around the source code to find out which classes were available for task
> > allocation and what inside them had changed.  This can be quite daunting
> but
> > if you're comfortable with Java then it's just a case of following the
> calls
> > through the code.  Once I found the cause it was just a case of working
> out
> > what my options were for working around it (in this case turning off the
> > multiple assignment option - I can work out whether I want to turn it
> back
> > on in slower time).
> >
> > Where I think we got very lucky is that we hit this problem.  The
> > configuration we use for the terasort has just over 1 reducer per worker
> > node rather than maxing out the available reducer slots.  This decision
> was
> > made several years and I can't remember the reasons for it.  If we'd been
> > using a larger number of reducers then the number of worker nodes in use
> > would have been similar regardless of the allocation algorithm and so the
> > performance would have looked similar before and after the upgrade.  We
> > would have hit this problem eventually but probably not until we started
> > running user jobs and by then it would be too late to do the intrusive
> > investigations that were possible now.
> >
> > Hope this has been useful.
> >
> > Regards,
> > Jon
> >
> >
> >
> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
> >>
> >> Jon:
> >>
> >> This is interesting and helpful! How did you figure out the cause? And
> how
> >> much time did you spend? Could you share some experience of performance
> >> diagnosis?
> >>
> >> Jie
> >>
> >> On Tuesday, November 27, 2012, Harsh J wrote:
> >>>
> >>> Hi Amit,
> >>>
> >>> The default scheduler is FIFO, and may not work for all forms of
> >>> workloads. Read the multiple schedulers available to see if they have
> >>> features that may benefit your workload:
> >>>
> >>> Capacity Scheduler:
> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
> >>> FairScheduler:
> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
> >>>
> >>> While there's a good overlap of features between them, there are a few
> >>> differences that set them apart and make them each useful for
> >>> different use-cases. If I had to summarize on some such differences,
> >>> FairScheduler is better suited to SLA form of job execution situations
> >>> due to its preemptive features (which make it useful in user and
> >>> service mix scenarios), while CapacityScheduler provides
> >>> manual-resource-request oriented scheduling for odd jobs with high
> >>> memory workloads, etc. (which make it useful for running certain
> >>> specific kind of jobs along side the regular ones).
> >>>
> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
> wrote:
> >>> > So this is a FairScheduler problem ?
> >>> > We are using the default Hadoop scheduler. Is there a reason to use
> the
> >>> > Fair
> >>> > Scheduler if most of the time we don't have more than 4 jobs running
> >>> > simultaneously ?
> >>> >
> >>> >
> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
> wrote:
> >>> >>
> >>> >> Hi Amit,
> >>> >>
> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
> >>> >> property. It is true by default, which works well for most workloads
> >>> >> if not benchmark style workloads. I would not usually trust that as
> a
> >>> >> base perf. measure of everything that comes out of an upgrade.
> >>> >>
> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
> >>> >>
> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
> >>> >> wrote:
> >>> >> > Hi Jon,
> >>> >> >
> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
> Hadoop
> >>> >> > 1.0.4
> >>> >> > and I haven't noticed any performance issues. By  "multiple
> >>> >> > assignment
> >>> >> > feature" do you mean speculative execution
> >>> >> > (mapred.map.tasks.speculative.execution and
> >>> >> > mapred.reduce.tasks.speculative.execution) ?
> >>> >> >
> >>> >> >
> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Problem solved, but worth warning others about.
> >>> >> >>
> >>> >> >> Before the upgrade the reducers for the terasort process had been
> >>> >> >> evenly
> >>> >> >> distributed around the cluster - one per task tracker in turn,
> >>> >> >> looping
> >>> >> >> around the cluster until all tasks were allocated.  After the
> >>> >> >> upgrade
> >>> >> >> all
> >>> >> >> reduce task had been submitted to small number of task trackers -
> >>> >> >> submit
> >>> >> >> tasks until the task tracker slots were full and then move onto
> the
> >>> >> >> next
> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
> the
> >>> >> >> benchmark performance.
> >>> >> >>
> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
> behaviour
> >>> >> >> of
> >>> >> >> the
> >>> >> >> assign multiple property. Previously this property caused a
> single
> >>> >> >> map
> >>> >> >> and a
> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
> >>> >> >> (rather
> >>> >> >> than
> >>> >> >> the default of a map or a reduce).  After the upgrade it
> allocates
> >>> >> >> as
> >>> >> >> many
> >>> >> >> tasks as there are available task slots.  Turning off the
> multiple
> >>> >> >> assignment feature returned the terasort to its pre-upgrade
> >>> >> >> performance.
> >>> >> >>
> >>> >> >> I can see potential benefits to this change and need to think
> >>> >> >> through
> >>> >> >> the
> >>> >> >> consequences to real world applications (though in practice we're
> >>> >> >> likely to
> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
> >>> >> >> Investigating
> >>> >> >> this
> >>> >> >> has been a pain so to warn other user is there anywhere central
> >>> >> >> that
> >>> >> >> can be
> >>> >> >> used to record upgrade gotchas like this?
> >>> >> >>
> >>> >> >>
> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <jayayedev@gmail.com
> >
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> Hi,
> >>> >> >>>
> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
> and
> >>> >> >>> have
> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
> took
> >>> >> >>> about 45
> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
> >>> >> >>> detail it
> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
> >>> >> >>> minutes.
> >>> >> >>> Does
> >>> >> >>> anyone have any thoughts about what'--
> >>> Harsh J
> >>>
> >
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Chris Smith <cs...@gmail.com>.

Jie,

Recent was over 11 months ago.  :-)

Unfortunately the software licence requires that most of us 'negotiate' a
commerical use license before we trial the software in a commercial
environment:
http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
clarified here:  http://www.cs.duke.edu/starfish/previous.html

Under that last URL was a note that you were soon to distribute the source
code under the Apache Software License.  Last time I asked the reply was
that this would not happen.  Perhaps it is time to update your web pages or
your license arrangements.  :-)

I like what I saw on my home 'cluster' but have not the time to sort out
licensing to trial this in a commercial environment.

Chris




On 14 December 2012 01:46, Jie Li <ji...@cs.duke.edu> wrote:

> Hi Jon,
>
> Thanks for sharing these insights! Can't agree with you more!
>
> Recently we released a tool called Starfish Hadoop Log Analyzer for
> analyzing the job histories. I believe it can quickly point out this
> reduce problem you met!
>
> http://www.cs.duke.edu/starfish/release.html
>
> Jie
>
> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
> > Jie,
> >
> > Simple answer - I got lucky (though obviously there are thing you need to
> > have in place to allow you to be lucky).
> >
> > Before running the upgrade I ran a set of tests to baseline the cluster
> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
> by
> > itself isn't very realistic as a cluster test but it's nice and simple to
> > run and is good for regression testing things after a change.
> >
> > After the upgrade the intention was to run the same tests and show that
> the
> > performance hadn't degraded (improved would have been nice but not worse
> was
> > the minimum).  When we ran the terasort we found that performance was
> about
> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
> I've
> > said, terasort doesn't provide a realistic view of operational
> performance
> > but this showed that something major had changed and we needed to
> understand
> > it before going further.  So how to go about diagnosing this ...
> >
> > First rule - understand what you're trying to achieve.  It's very easy to
> > say performance isn't good enough but performance can always be better so
> > you need to know what's realistic and at what point you're going to stop
> > tuning things.  I had a previous baseline that I was trying to match so I
> > knew what I was trying to achieve.
> >
> > Next thing to do is profile your job and identify where the problem is.
>  We
> > had the full job history from the before and after jobs and comparing
> these
> > we saw that map performance was fairly consistent as were the reduce sort
> > and reduce phases.  The problem was with the shuffle, which had gone
> from 20
> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
> is
> > to make sure you've got as much information as possible.  If we'd just
> kept
> > the overall job time then there would have been a lot more areas to look
> at
> > but knowing the problem was with shuffle allowed me to focus effort in
> this
> > area.
> >
> > So what had changed in the shuffle that may have slowed things down.  The
> > first thing we thought of was that we'd moved from a tarball deployment
> to
> > using the RPM so what effect might this have had on things.  Our
> operational
> > configuration compresses the map output and in the past we've had
> problems
> > with Java compression libraries being used rather than native ones and
> this
> > has affected performance.  We knew the RPM deployment had moved the
> native
> > library so spent some time confirming to ourselves that these were being
> > used correctly (but this turned out to not be the problem).  We then
> spent
> > time doing some process and server profiling - using dstat to look at the
> > server bottlenecks and jstack/jmap to check what the task tracker and
> reduce
> > processes were doing.  Although not directly relevant to this particular
> > problem doing this was useful just to get my head around what Hadoop is
> > doing at various points of the process.
> >
> > The next bit was one place where I got lucky - I happened to be logged
> onto
> > one of the worker nodes when a test job was running and I noticed that
> there
> > weren't any reduce tasks running on the server.  This was odd as we'd
> > submitted more reducers than we have servers so I'd expected at least one
> > task to be running on each server.  Checking the job tracker log file it
> > turned out that since the upgrade the job tracker had been submitting
> reduce
> > tasks to only 10% of the available nodes.  A different 10% each time the
> job
> > was run so clearly the individual task trackers were working OK but there
> > was something odd going on with the task allocation.  Checking the job
> > tracker log file showed that before the upgrade tasks had been fairly
> evenly
> > distributed so something had changed.  After that it was a case of
> digging
> > around the source code to find out which classes were available for task
> > allocation and what inside them had changed.  This can be quite daunting
> but
> > if you're comfortable with Java then it's just a case of following the
> calls
> > through the code.  Once I found the cause it was just a case of working
> out
> > what my options were for working around it (in this case turning off the
> > multiple assignment option - I can work out whether I want to turn it
> back
> > on in slower time).
> >
> > Where I think we got very lucky is that we hit this problem.  The
> > configuration we use for the terasort has just over 1 reducer per worker
> > node rather than maxing out the available reducer slots.  This decision
> was
> > made several years and I can't remember the reasons for it.  If we'd been
> > using a larger number of reducers then the number of worker nodes in use
> > would have been similar regardless of the allocation algorithm and so the
> > performance would have looked similar before and after the upgrade.  We
> > would have hit this problem eventually but probably not until we started
> > running user jobs and by then it would be too late to do the intrusive
> > investigations that were possible now.
> >
> > Hope this has been useful.
> >
> > Regards,
> > Jon
> >
> >
> >
> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
> >>
> >> Jon:
> >>
> >> This is interesting and helpful! How did you figure out the cause? And
> how
> >> much time did you spend? Could you share some experience of performance
> >> diagnosis?
> >>
> >> Jie
> >>
> >> On Tuesday, November 27, 2012, Harsh J wrote:
> >>>
> >>> Hi Amit,
> >>>
> >>> The default scheduler is FIFO, and may not work for all forms of
> >>> workloads. Read the multiple schedulers available to see if they have
> >>> features that may benefit your workload:
> >>>
> >>> Capacity Scheduler:
> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
> >>> FairScheduler:
> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
> >>>
> >>> While there's a good overlap of features between them, there are a few
> >>> differences that set them apart and make them each useful for
> >>> different use-cases. If I had to summarize on some such differences,
> >>> FairScheduler is better suited to SLA form of job execution situations
> >>> due to its preemptive features (which make it useful in user and
> >>> service mix scenarios), while CapacityScheduler provides
> >>> manual-resource-request oriented scheduling for odd jobs with high
> >>> memory workloads, etc. (which make it useful for running certain
> >>> specific kind of jobs along side the regular ones).
> >>>
> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
> wrote:
> >>> > So this is a FairScheduler problem ?
> >>> > We are using the default Hadoop scheduler. Is there a reason to use
> the
> >>> > Fair
> >>> > Scheduler if most of the time we don't have more than 4 jobs running
> >>> > simultaneously ?
> >>> >
> >>> >
> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
> wrote:
> >>> >>
> >>> >> Hi Amit,
> >>> >>
> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
> >>> >> property. It is true by default, which works well for most workloads
> >>> >> if not benchmark style workloads. I would not usually trust that as
> a
> >>> >> base perf. measure of everything that comes out of an upgrade.
> >>> >>
> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
> >>> >>
> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
> >>> >> wrote:
> >>> >> > Hi Jon,
> >>> >> >
> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
> Hadoop
> >>> >> > 1.0.4
> >>> >> > and I haven't noticed any performance issues. By  "multiple
> >>> >> > assignment
> >>> >> > feature" do you mean speculative execution
> >>> >> > (mapred.map.tasks.speculative.execution and
> >>> >> > mapred.reduce.tasks.speculative.execution) ?
> >>> >> >
> >>> >> >
> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Problem solved, but worth warning others about.
> >>> >> >>
> >>> >> >> Before the upgrade the reducers for the terasort process had been
> >>> >> >> evenly
> >>> >> >> distributed around the cluster - one per task tracker in turn,
> >>> >> >> looping
> >>> >> >> around the cluster until all tasks were allocated.  After the
> >>> >> >> upgrade
> >>> >> >> all
> >>> >> >> reduce task had been submitted to small number of task trackers -
> >>> >> >> submit
> >>> >> >> tasks until the task tracker slots were full and then move onto
> the
> >>> >> >> next
> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
> the
> >>> >> >> benchmark performance.
> >>> >> >>
> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
> behaviour
> >>> >> >> of
> >>> >> >> the
> >>> >> >> assign multiple property. Previously this property caused a
> single
> >>> >> >> map
> >>> >> >> and a
> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
> >>> >> >> (rather
> >>> >> >> than
> >>> >> >> the default of a map or a reduce).  After the upgrade it
> allocates
> >>> >> >> as
> >>> >> >> many
> >>> >> >> tasks as there are available task slots.  Turning off the
> multiple
> >>> >> >> assignment feature returned the terasort to its pre-upgrade
> >>> >> >> performance.
> >>> >> >>
> >>> >> >> I can see potential benefits to this change and need to think
> >>> >> >> through
> >>> >> >> the
> >>> >> >> consequences to real world applications (though in practice we're
> >>> >> >> likely to
> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
> >>> >> >> Investigating
> >>> >> >> this
> >>> >> >> has been a pain so to warn other user is there anywhere central
> >>> >> >> that
> >>> >> >> can be
> >>> >> >> used to record upgrade gotchas like this?
> >>> >> >>
> >>> >> >>
> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <jayayedev@gmail.com
> >
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> Hi,
> >>> >> >>>
> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
> and
> >>> >> >>> have
> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
> took
> >>> >> >>> about 45
> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
> >>> >> >>> detail it
> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
> >>> >> >>> minutes.
> >>> >> >>> Does
> >>> >> >>> anyone have any thoughts about what'--
> >>> Harsh J
> >>>
> >
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Chris Smith <cs...@gmail.com>.

Jie,

Recent was over 11 months ago.  :-)

Unfortunately the software licence requires that most of us 'negotiate' a
commerical use license before we trial the software in a commercial
environment:
http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
clarified here:  http://www.cs.duke.edu/starfish/previous.html

Under that last URL was a note that you were soon to distribute the source
code under the Apache Software License.  Last time I asked the reply was
that this would not happen.  Perhaps it is time to update your web pages or
your license arrangements.  :-)

I like what I saw on my home 'cluster' but have not the time to sort out
licensing to trial this in a commercial environment.

Chris




On 14 December 2012 01:46, Jie Li <ji...@cs.duke.edu> wrote:

> Hi Jon,
>
> Thanks for sharing these insights! Can't agree with you more!
>
> Recently we released a tool called Starfish Hadoop Log Analyzer for
> analyzing the job histories. I believe it can quickly point out this
> reduce problem you met!
>
> http://www.cs.duke.edu/starfish/release.html
>
> Jie
>
> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
> > Jie,
> >
> > Simple answer - I got lucky (though obviously there are thing you need to
> > have in place to allow you to be lucky).
> >
> > Before running the upgrade I ran a set of tests to baseline the cluster
> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
> by
> > itself isn't very realistic as a cluster test but it's nice and simple to
> > run and is good for regression testing things after a change.
> >
> > After the upgrade the intention was to run the same tests and show that
> the
> > performance hadn't degraded (improved would have been nice but not worse
> was
> > the minimum).  When we ran the terasort we found that performance was
> about
> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
> I've
> > said, terasort doesn't provide a realistic view of operational
> performance
> > but this showed that something major had changed and we needed to
> understand
> > it before going further.  So how to go about diagnosing this ...
> >
> > First rule - understand what you're trying to achieve.  It's very easy to
> > say performance isn't good enough but performance can always be better so
> > you need to know what's realistic and at what point you're going to stop
> > tuning things.  I had a previous baseline that I was trying to match so I
> > knew what I was trying to achieve.
> >
> > Next thing to do is profile your job and identify where the problem is.
>  We
> > had the full job history from the before and after jobs and comparing
> these
> > we saw that map performance was fairly consistent as were the reduce sort
> > and reduce phases.  The problem was with the shuffle, which had gone
> from 20
> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
> is
> > to make sure you've got as much information as possible.  If we'd just
> kept
> > the overall job time then there would have been a lot more areas to look
> at
> > but knowing the problem was with shuffle allowed me to focus effort in
> this
> > area.
> >
> > So what had changed in the shuffle that may have slowed things down.  The
> > first thing we thought of was that we'd moved from a tarball deployment
> to
> > using the RPM so what effect might this have had on things.  Our
> operational
> > configuration compresses the map output and in the past we've had
> problems
> > with Java compression libraries being used rather than native ones and
> this
> > has affected performance.  We knew the RPM deployment had moved the
> native
> > library so spent some time confirming to ourselves that these were being
> > used correctly (but this turned out to not be the problem).  We then
> spent
> > time doing some process and server profiling - using dstat to look at the
> > server bottlenecks and jstack/jmap to check what the task tracker and
> reduce
> > processes were doing.  Although not directly relevant to this particular
> > problem doing this was useful just to get my head around what Hadoop is
> > doing at various points of the process.
> >
> > The next bit was one place where I got lucky - I happened to be logged
> onto
> > one of the worker nodes when a test job was running and I noticed that
> there
> > weren't any reduce tasks running on the server.  This was odd as we'd
> > submitted more reducers than we have servers so I'd expected at least one
> > task to be running on each server.  Checking the job tracker log file it
> > turned out that since the upgrade the job tracker had been submitting
> reduce
> > tasks to only 10% of the available nodes.  A different 10% each time the
> job
> > was run so clearly the individual task trackers were working OK but there
> > was something odd going on with the task allocation.  Checking the job
> > tracker log file showed that before the upgrade tasks had been fairly
> evenly
> > distributed so something had changed.  After that it was a case of
> digging
> > around the source code to find out which classes were available for task
> > allocation and what inside them had changed.  This can be quite daunting
> but
> > if you're comfortable with Java then it's just a case of following the
> calls
> > through the code.  Once I found the cause it was just a case of working
> out
> > what my options were for working around it (in this case turning off the
> > multiple assignment option - I can work out whether I want to turn it
> back
> > on in slower time).
> >
> > Where I think we got very lucky is that we hit this problem.  The
> > configuration we use for the terasort has just over 1 reducer per worker
> > node rather than maxing out the available reducer slots.  This decision
> was
> > made several years and I can't remember the reasons for it.  If we'd been
> > using a larger number of reducers then the number of worker nodes in use
> > would have been similar regardless of the allocation algorithm and so the
> > performance would have looked similar before and after the upgrade.  We
> > would have hit this problem eventually but probably not until we started
> > running user jobs and by then it would be too late to do the intrusive
> > investigations that were possible now.
> >
> > Hope this has been useful.
> >
> > Regards,
> > Jon
> >
> >
> >
> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
> >>
> >> Jon:
> >>
> >> This is interesting and helpful! How did you figure out the cause? And
> how
> >> much time did you spend? Could you share some experience of performance
> >> diagnosis?
> >>
> >> Jie
> >>
> >> On Tuesday, November 27, 2012, Harsh J wrote:
> >>>
> >>> Hi Amit,
> >>>
> >>> The default scheduler is FIFO, and may not work for all forms of
> >>> workloads. Read the multiple schedulers available to see if they have
> >>> features that may benefit your workload:
> >>>
> >>> Capacity Scheduler:
> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
> >>> FairScheduler:
> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
> >>>
> >>> While there's a good overlap of features between them, there are a few
> >>> differences that set them apart and make them each useful for
> >>> different use-cases. If I had to summarize on some such differences,
> >>> FairScheduler is better suited to SLA form of job execution situations
> >>> due to its preemptive features (which make it useful in user and
> >>> service mix scenarios), while CapacityScheduler provides
> >>> manual-resource-request oriented scheduling for odd jobs with high
> >>> memory workloads, etc. (which make it useful for running certain
> >>> specific kind of jobs along side the regular ones).
> >>>
> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
> wrote:
> >>> > So this is a FairScheduler problem ?
> >>> > We are using the default Hadoop scheduler. Is there a reason to use
> the
> >>> > Fair
> >>> > Scheduler if most of the time we don't have more than 4 jobs running
> >>> > simultaneously ?
> >>> >
> >>> >
> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
> wrote:
> >>> >>
> >>> >> Hi Amit,
> >>> >>
> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
> >>> >> property. It is true by default, which works well for most workloads
> >>> >> if not benchmark style workloads. I would not usually trust that as
> a
> >>> >> base perf. measure of everything that comes out of an upgrade.
> >>> >>
> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
> >>> >>
> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
> >>> >> wrote:
> >>> >> > Hi Jon,
> >>> >> >
> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
> Hadoop
> >>> >> > 1.0.4
> >>> >> > and I haven't noticed any performance issues. By  "multiple
> >>> >> > assignment
> >>> >> > feature" do you mean speculative execution
> >>> >> > (mapred.map.tasks.speculative.execution and
> >>> >> > mapred.reduce.tasks.speculative.execution) ?
> >>> >> >
> >>> >> >
> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Problem solved, but worth warning others about.
> >>> >> >>
> >>> >> >> Before the upgrade the reducers for the terasort process had been
> >>> >> >> evenly
> >>> >> >> distributed around the cluster - one per task tracker in turn,
> >>> >> >> looping
> >>> >> >> around the cluster until all tasks were allocated.  After the
> >>> >> >> upgrade
> >>> >> >> all
> >>> >> >> reduce task had been submitted to small number of task trackers -
> >>> >> >> submit
> >>> >> >> tasks until the task tracker slots were full and then move onto
> the
> >>> >> >> next
> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
> the
> >>> >> >> benchmark performance.
> >>> >> >>
> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
> behaviour
> >>> >> >> of
> >>> >> >> the
> >>> >> >> assign multiple property. Previously this property caused a
> single
> >>> >> >> map
> >>> >> >> and a
> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
> >>> >> >> (rather
> >>> >> >> than
> >>> >> >> the default of a map or a reduce).  After the upgrade it
> allocates
> >>> >> >> as
> >>> >> >> many
> >>> >> >> tasks as there are available task slots.  Turning off the
> multiple
> >>> >> >> assignment feature returned the terasort to its pre-upgrade
> >>> >> >> performance.
> >>> >> >>
> >>> >> >> I can see potential benefits to this change and need to think
> >>> >> >> through
> >>> >> >> the
> >>> >> >> consequences to real world applications (though in practice we're
> >>> >> >> likely to
> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
> >>> >> >> Investigating
> >>> >> >> this
> >>> >> >> has been a pain so to warn other user is there anywhere central
> >>> >> >> that
> >>> >> >> can be
> >>> >> >> used to record upgrade gotchas like this?
> >>> >> >>
> >>> >> >>
> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <jayayedev@gmail.com
> >
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> Hi,
> >>> >> >>>
> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
> and
> >>> >> >>> have
> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
> took
> >>> >> >>> about 45
> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
> >>> >> >>> detail it
> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
> >>> >> >>> minutes.
> >>> >> >>> Does
> >>> >> >>> anyone have any thoughts about what'--
> >>> Harsh J
> >>>
> >
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Chris Smith <cs...@gmail.com>.

Jie,

Recent was over 11 months ago.  :-)

Unfortunately the software licence requires that most of us 'negotiate' a
commerical use license before we trial the software in a commercial
environment:
http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
clarified here:  http://www.cs.duke.edu/starfish/previous.html

Under that last URL was a note that you were soon to distribute the source
code under the Apache Software License.  Last time I asked the reply was
that this would not happen.  Perhaps it is time to update your web pages or
your license arrangements.  :-)

I like what I saw on my home 'cluster' but have not the time to sort out
licensing to trial this in a commercial environment.

Chris




On 14 December 2012 01:46, Jie Li <ji...@cs.duke.edu> wrote:

> Hi Jon,
>
> Thanks for sharing these insights! Can't agree with you more!
>
> Recently we released a tool called Starfish Hadoop Log Analyzer for
> analyzing the job histories. I believe it can quickly point out this
> reduce problem you met!
>
> http://www.cs.duke.edu/starfish/release.html
>
> Jie
>
> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
> > Jie,
> >
> > Simple answer - I got lucky (though obviously there are thing you need to
> > have in place to allow you to be lucky).
> >
> > Before running the upgrade I ran a set of tests to baseline the cluster
> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
> by
> > itself isn't very realistic as a cluster test but it's nice and simple to
> > run and is good for regression testing things after a change.
> >
> > After the upgrade the intention was to run the same tests and show that
> the
> > performance hadn't degraded (improved would have been nice but not worse
> was
> > the minimum).  When we ran the terasort we found that performance was
> about
> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
> I've
> > said, terasort doesn't provide a realistic view of operational
> performance
> > but this showed that something major had changed and we needed to
> understand
> > it before going further.  So how to go about diagnosing this ...
> >
> > First rule - understand what you're trying to achieve.  It's very easy to
> > say performance isn't good enough but performance can always be better so
> > you need to know what's realistic and at what point you're going to stop
> > tuning things.  I had a previous baseline that I was trying to match so I
> > knew what I was trying to achieve.
> >
> > Next thing to do is profile your job and identify where the problem is.
>  We
> > had the full job history from the before and after jobs and comparing
> these
> > we saw that map performance was fairly consistent as were the reduce sort
> > and reduce phases.  The problem was with the shuffle, which had gone
> from 20
> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
> is
> > to make sure you've got as much information as possible.  If we'd just
> kept
> > the overall job time then there would have been a lot more areas to look
> at
> > but knowing the problem was with shuffle allowed me to focus effort in
> this
> > area.
> >
> > So what had changed in the shuffle that may have slowed things down.  The
> > first thing we thought of was that we'd moved from a tarball deployment
> to
> > using the RPM so what effect might this have had on things.  Our
> operational
> > configuration compresses the map output and in the past we've had
> problems
> > with Java compression libraries being used rather than native ones and
> this
> > has affected performance.  We knew the RPM deployment had moved the
> native
> > library so spent some time confirming to ourselves that these were being
> > used correctly (but this turned out to not be the problem).  We then
> spent
> > time doing some process and server profiling - using dstat to look at the
> > server bottlenecks and jstack/jmap to check what the task tracker and
> reduce
> > processes were doing.  Although not directly relevant to this particular
> > problem doing this was useful just to get my head around what Hadoop is
> > doing at various points of the process.
> >
> > The next bit was one place where I got lucky - I happened to be logged
> onto
> > one of the worker nodes when a test job was running and I noticed that
> there
> > weren't any reduce tasks running on the server.  This was odd as we'd
> > submitted more reducers than we have servers so I'd expected at least one
> > task to be running on each server.  Checking the job tracker log file it
> > turned out that since the upgrade the job tracker had been submitting
> reduce
> > tasks to only 10% of the available nodes.  A different 10% each time the
> job
> > was run so clearly the individual task trackers were working OK but there
> > was something odd going on with the task allocation.  Checking the job
> > tracker log file showed that before the upgrade tasks had been fairly
> evenly
> > distributed so something had changed.  After that it was a case of
> digging
> > around the source code to find out which classes were available for task
> > allocation and what inside them had changed.  This can be quite daunting
> but
> > if you're comfortable with Java then it's just a case of following the
> calls
> > through the code.  Once I found the cause it was just a case of working
> out
> > what my options were for working around it (in this case turning off the
> > multiple assignment option - I can work out whether I want to turn it
> back
> > on in slower time).
> >
> > Where I think we got very lucky is that we hit this problem.  The
> > configuration we use for the terasort has just over 1 reducer per worker
> > node rather than maxing out the available reducer slots.  This decision
> was
> > made several years and I can't remember the reasons for it.  If we'd been
> > using a larger number of reducers then the number of worker nodes in use
> > would have been similar regardless of the allocation algorithm and so the
> > performance would have looked similar before and after the upgrade.  We
> > would have hit this problem eventually but probably not until we started
> > running user jobs and by then it would be too late to do the intrusive
> > investigations that were possible now.
> >
> > Hope this has been useful.
> >
> > Regards,
> > Jon
> >
> >
> >
> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
> >>
> >> Jon:
> >>
> >> This is interesting and helpful! How did you figure out the cause? And
> how
> >> much time did you spend? Could you share some experience of performance
> >> diagnosis?
> >>
> >> Jie
> >>
> >> On Tuesday, November 27, 2012, Harsh J wrote:
> >>>
> >>> Hi Amit,
> >>>
> >>> The default scheduler is FIFO, and may not work for all forms of
> >>> workloads. Read the multiple schedulers available to see if they have
> >>> features that may benefit your workload:
> >>>
> >>> Capacity Scheduler:
> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
> >>> FairScheduler:
> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
> >>>
> >>> While there's a good overlap of features between them, there are a few
> >>> differences that set them apart and make them each useful for
> >>> different use-cases. If I had to summarize on some such differences,
> >>> FairScheduler is better suited to SLA form of job execution situations
> >>> due to its preemptive features (which make it useful in user and
> >>> service mix scenarios), while CapacityScheduler provides
> >>> manual-resource-request oriented scheduling for odd jobs with high
> >>> memory workloads, etc. (which make it useful for running certain
> >>> specific kind of jobs along side the regular ones).
> >>>
> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
> wrote:
> >>> > So this is a FairScheduler problem ?
> >>> > We are using the default Hadoop scheduler. Is there a reason to use
> the
> >>> > Fair
> >>> > Scheduler if most of the time we don't have more than 4 jobs running
> >>> > simultaneously ?
> >>> >
> >>> >
> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
> wrote:
> >>> >>
> >>> >> Hi Amit,
> >>> >>
> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
> >>> >> property. It is true by default, which works well for most workloads
> >>> >> if not benchmark style workloads. I would not usually trust that as
> a
> >>> >> base perf. measure of everything that comes out of an upgrade.
> >>> >>
> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
> >>> >>
> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
> >>> >> wrote:
> >>> >> > Hi Jon,
> >>> >> >
> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
> Hadoop
> >>> >> > 1.0.4
> >>> >> > and I haven't noticed any performance issues. By  "multiple
> >>> >> > assignment
> >>> >> > feature" do you mean speculative execution
> >>> >> > (mapred.map.tasks.speculative.execution and
> >>> >> > mapred.reduce.tasks.speculative.execution) ?
> >>> >> >
> >>> >> >
> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Problem solved, but worth warning others about.
> >>> >> >>
> >>> >> >> Before the upgrade the reducers for the terasort process had been
> >>> >> >> evenly
> >>> >> >> distributed around the cluster - one per task tracker in turn,
> >>> >> >> looping
> >>> >> >> around the cluster until all tasks were allocated.  After the
> >>> >> >> upgrade
> >>> >> >> all
> >>> >> >> reduce task had been submitted to small number of task trackers -
> >>> >> >> submit
> >>> >> >> tasks until the task tracker slots were full and then move onto
> the
> >>> >> >> next
> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
> the
> >>> >> >> benchmark performance.
> >>> >> >>
> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
> behaviour
> >>> >> >> of
> >>> >> >> the
> >>> >> >> assign multiple property. Previously this property caused a
> single
> >>> >> >> map
> >>> >> >> and a
> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
> >>> >> >> (rather
> >>> >> >> than
> >>> >> >> the default of a map or a reduce).  After the upgrade it
> allocates
> >>> >> >> as
> >>> >> >> many
> >>> >> >> tasks as there are available task slots.  Turning off the
> multiple
> >>> >> >> assignment feature returned the terasort to its pre-upgrade
> >>> >> >> performance.
> >>> >> >>
> >>> >> >> I can see potential benefits to this change and need to think
> >>> >> >> through
> >>> >> >> the
> >>> >> >> consequences to real world applications (though in practice we're
> >>> >> >> likely to
> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
> >>> >> >> Investigating
> >>> >> >> this
> >>> >> >> has been a pain so to warn other user is there anywhere central
> >>> >> >> that
> >>> >> >> can be
> >>> >> >> used to record upgrade gotchas like this?
> >>> >> >>
> >>> >> >>
> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <jayayedev@gmail.com
> >
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> Hi,
> >>> >> >>>
> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
> and
> >>> >> >>> have
> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
> took
> >>> >> >>> about 45
> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
> >>> >> >>> detail it
> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
> >>> >> >>> minutes.
> >>> >> >>> Does
> >>> >> >>> anyone have any thoughts about what'--
> >>> Harsh J
> >>>
> >
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jie Li <ji...@cs.duke.edu>.

Hi Jon,

Thanks for sharing these insights! Can't agree with you more!

Recently we released a tool called Starfish Hadoop Log Analyzer for
analyzing the job histories. I believe it can quickly point out this
reduce problem you met!

http://www.cs.duke.edu/starfish/release.html

Jie

On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
> Jie,
>
> Simple answer - I got lucky (though obviously there are thing you need to
> have in place to allow you to be lucky).
>
> Before running the upgrade I ran a set of tests to baseline the cluster
> performance, e.g. terasort, gridmix and some operational jobs.  Terasort by
> itself isn't very realistic as a cluster test but it's nice and simple to
> run and is good for regression testing things after a change.
>
> After the upgrade the intention was to run the same tests and show that the
> performance hadn't degraded (improved would have been nice but not worse was
> the minimum).  When we ran the terasort we found that performance was about
> 50% worse - execution time had gone from 40 minutes to 60 minutes.  As I've
> said, terasort doesn't provide a realistic view of operational performance
> but this showed that something major had changed and we needed to understand
> it before going further.  So how to go about diagnosing this ...
>
> First rule - understand what you're trying to achieve.  It's very easy to
> say performance isn't good enough but performance can always be better so
> you need to know what's realistic and at what point you're going to stop
> tuning things.  I had a previous baseline that I was trying to match so I
> knew what I was trying to achieve.
>
> Next thing to do is profile your job and identify where the problem is.  We
> had the full job history from the before and after jobs and comparing these
> we saw that map performance was fairly consistent as were the reduce sort
> and reduce phases.  The problem was with the shuffle, which had gone from 20
> minutes pre-upgrade to 40 minutes afterwards.  The important thing here is
> to make sure you've got as much information as possible.  If we'd just kept
> the overall job time then there would have been a lot more areas to look at
> but knowing the problem was with shuffle allowed me to focus effort in this
> area.
>
> So what had changed in the shuffle that may have slowed things down.  The
> first thing we thought of was that we'd moved from a tarball deployment to
> using the RPM so what effect might this have had on things.  Our operational
> configuration compresses the map output and in the past we've had problems
> with Java compression libraries being used rather than native ones and this
> has affected performance.  We knew the RPM deployment had moved the native
> library so spent some time confirming to ourselves that these were being
> used correctly (but this turned out to not be the problem).  We then spent
> time doing some process and server profiling - using dstat to look at the
> server bottlenecks and jstack/jmap to check what the task tracker and reduce
> processes were doing.  Although not directly relevant to this particular
> problem doing this was useful just to get my head around what Hadoop is
> doing at various points of the process.
>
> The next bit was one place where I got lucky - I happened to be logged onto
> one of the worker nodes when a test job was running and I noticed that there
> weren't any reduce tasks running on the server.  This was odd as we'd
> submitted more reducers than we have servers so I'd expected at least one
> task to be running on each server.  Checking the job tracker log file it
> turned out that since the upgrade the job tracker had been submitting reduce
> tasks to only 10% of the available nodes.  A different 10% each time the job
> was run so clearly the individual task trackers were working OK but there
> was something odd going on with the task allocation.  Checking the job
> tracker log file showed that before the upgrade tasks had been fairly evenly
> distributed so something had changed.  After that it was a case of digging
> around the source code to find out which classes were available for task
> allocation and what inside them had changed.  This can be quite daunting but
> if you're comfortable with Java then it's just a case of following the calls
> through the code.  Once I found the cause it was just a case of working out
> what my options were for working around it (in this case turning off the
> multiple assignment option - I can work out whether I want to turn it back
> on in slower time).
>
> Where I think we got very lucky is that we hit this problem.  The
> configuration we use for the terasort has just over 1 reducer per worker
> node rather than maxing out the available reducer slots.  This decision was
> made several years and I can't remember the reasons for it.  If we'd been
> using a larger number of reducers then the number of worker nodes in use
> would have been similar regardless of the allocation algorithm and so the
> performance would have looked similar before and after the upgrade.  We
> would have hit this problem eventually but probably not until we started
> running user jobs and by then it would be too late to do the intrusive
> investigations that were possible now.
>
> Hope this has been useful.
>
> Regards,
> Jon
>
>
>
> On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
>>
>> Jon:
>>
>> This is interesting and helpful! How did you figure out the cause? And how
>> much time did you spend? Could you share some experience of performance
>> diagnosis?
>>
>> Jie
>>
>> On Tuesday, November 27, 2012, Harsh J wrote:
>>>
>>> Hi Amit,
>>>
>>> The default scheduler is FIFO, and may not work for all forms of
>>> workloads. Read the multiple schedulers available to see if they have
>>> features that may benefit your workload:
>>>
>>> Capacity Scheduler:
>>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>>> FairScheduler:
>>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>>>
>>> While there's a good overlap of features between them, there are a few
>>> differences that set them apart and make them each useful for
>>> different use-cases. If I had to summarize on some such differences,
>>> FairScheduler is better suited to SLA form of job execution situations
>>> due to its preemptive features (which make it useful in user and
>>> service mix scenarios), while CapacityScheduler provides
>>> manual-resource-request oriented scheduling for odd jobs with high
>>> memory workloads, etc. (which make it useful for running certain
>>> specific kind of jobs along side the regular ones).
>>>
>>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
>>> > So this is a FairScheduler problem ?
>>> > We are using the default Hadoop scheduler. Is there a reason to use the
>>> > Fair
>>> > Scheduler if most of the time we don't have more than 4 jobs running
>>> > simultaneously ?
>>> >
>>> >
>>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>>> >>
>>> >> Hi Amit,
>>> >>
>>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>>> >> property. It is true by default, which works well for most workloads
>>> >> if not benchmark style workloads. I would not usually trust that as a
>>> >> base perf. measure of everything that comes out of an upgrade.
>>> >>
>>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>>> >>
>>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>>> >> wrote:
>>> >> > Hi Jon,
>>> >> >
>>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>>> >> > 1.0.4
>>> >> > and I haven't noticed any performance issues. By  "multiple
>>> >> > assignment
>>> >> > feature" do you mean speculative execution
>>> >> > (mapred.map.tasks.speculative.execution and
>>> >> > mapred.reduce.tasks.speculative.execution) ?
>>> >> >
>>> >> >
>>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Problem solved, but worth warning others about.
>>> >> >>
>>> >> >> Before the upgrade the reducers for the terasort process had been
>>> >> >> evenly
>>> >> >> distributed around the cluster - one per task tracker in turn,
>>> >> >> looping
>>> >> >> around the cluster until all tasks were allocated.  After the
>>> >> >> upgrade
>>> >> >> all
>>> >> >> reduce task had been submitted to small number of task trackers -
>>> >> >> submit
>>> >> >> tasks until the task tracker slots were full and then move onto the
>>> >> >> next
>>> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
>>> >> >> benchmark performance.
>>> >> >>
>>> >> >> The reason for this turns out to be the fair scheduler rewrite
>>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
>>> >> >> of
>>> >> >> the
>>> >> >> assign multiple property. Previously this property caused a single
>>> >> >> map
>>> >> >> and a
>>> >> >> single reduce task to be allocated in a task tracker heartbeat
>>> >> >> (rather
>>> >> >> than
>>> >> >> the default of a map or a reduce).  After the upgrade it allocates
>>> >> >> as
>>> >> >> many
>>> >> >> tasks as there are available task slots.  Turning off the multiple
>>> >> >> assignment feature returned the terasort to its pre-upgrade
>>> >> >> performance.
>>> >> >>
>>> >> >> I can see potential benefits to this change and need to think
>>> >> >> through
>>> >> >> the
>>> >> >> consequences to real world applications (though in practice we're
>>> >> >> likely to
>>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>>> >> >> Investigating
>>> >> >> this
>>> >> >> has been a pain so to warn other user is there anywhere central
>>> >> >> that
>>> >> >> can be
>>> >> >> used to record upgrade gotchas like this?
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
>>> >> >>> have
>>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>>> >> >>> about 45
>>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
>>> >> >>> detail it
>>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>>> >> >>> minutes.
>>> >> >>> Does
>>> >> >>> anyone have any thoughts about what'--
>>> Harsh J
>>>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jie Li <ji...@cs.duke.edu>.

Hi Jon,

Thanks for sharing these insights! Can't agree with you more!

Recently we released a tool called Starfish Hadoop Log Analyzer for
analyzing the job histories. I believe it can quickly point out this
reduce problem you met!

http://www.cs.duke.edu/starfish/release.html

Jie

On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
> Jie,
>
> Simple answer - I got lucky (though obviously there are thing you need to
> have in place to allow you to be lucky).
>
> Before running the upgrade I ran a set of tests to baseline the cluster
> performance, e.g. terasort, gridmix and some operational jobs.  Terasort by
> itself isn't very realistic as a cluster test but it's nice and simple to
> run and is good for regression testing things after a change.
>
> After the upgrade the intention was to run the same tests and show that the
> performance hadn't degraded (improved would have been nice but not worse was
> the minimum).  When we ran the terasort we found that performance was about
> 50% worse - execution time had gone from 40 minutes to 60 minutes.  As I've
> said, terasort doesn't provide a realistic view of operational performance
> but this showed that something major had changed and we needed to understand
> it before going further.  So how to go about diagnosing this ...
>
> First rule - understand what you're trying to achieve.  It's very easy to
> say performance isn't good enough but performance can always be better so
> you need to know what's realistic and at what point you're going to stop
> tuning things.  I had a previous baseline that I was trying to match so I
> knew what I was trying to achieve.
>
> Next thing to do is profile your job and identify where the problem is.  We
> had the full job history from the before and after jobs and comparing these
> we saw that map performance was fairly consistent as were the reduce sort
> and reduce phases.  The problem was with the shuffle, which had gone from 20
> minutes pre-upgrade to 40 minutes afterwards.  The important thing here is
> to make sure you've got as much information as possible.  If we'd just kept
> the overall job time then there would have been a lot more areas to look at
> but knowing the problem was with shuffle allowed me to focus effort in this
> area.
>
> So what had changed in the shuffle that may have slowed things down.  The
> first thing we thought of was that we'd moved from a tarball deployment to
> using the RPM so what effect might this have had on things.  Our operational
> configuration compresses the map output and in the past we've had problems
> with Java compression libraries being used rather than native ones and this
> has affected performance.  We knew the RPM deployment had moved the native
> library so spent some time confirming to ourselves that these were being
> used correctly (but this turned out to not be the problem).  We then spent
> time doing some process and server profiling - using dstat to look at the
> server bottlenecks and jstack/jmap to check what the task tracker and reduce
> processes were doing.  Although not directly relevant to this particular
> problem doing this was useful just to get my head around what Hadoop is
> doing at various points of the process.
>
> The next bit was one place where I got lucky - I happened to be logged onto
> one of the worker nodes when a test job was running and I noticed that there
> weren't any reduce tasks running on the server.  This was odd as we'd
> submitted more reducers than we have servers so I'd expected at least one
> task to be running on each server.  Checking the job tracker log file it
> turned out that since the upgrade the job tracker had been submitting reduce
> tasks to only 10% of the available nodes.  A different 10% each time the job
> was run so clearly the individual task trackers were working OK but there
> was something odd going on with the task allocation.  Checking the job
> tracker log file showed that before the upgrade tasks had been fairly evenly
> distributed so something had changed.  After that it was a case of digging
> around the source code to find out which classes were available for task
> allocation and what inside them had changed.  This can be quite daunting but
> if you're comfortable with Java then it's just a case of following the calls
> through the code.  Once I found the cause it was just a case of working out
> what my options were for working around it (in this case turning off the
> multiple assignment option - I can work out whether I want to turn it back
> on in slower time).
>
> Where I think we got very lucky is that we hit this problem.  The
> configuration we use for the terasort has just over 1 reducer per worker
> node rather than maxing out the available reducer slots.  This decision was
> made several years and I can't remember the reasons for it.  If we'd been
> using a larger number of reducers then the number of worker nodes in use
> would have been similar regardless of the allocation algorithm and so the
> performance would have looked similar before and after the upgrade.  We
> would have hit this problem eventually but probably not until we started
> running user jobs and by then it would be too late to do the intrusive
> investigations that were possible now.
>
> Hope this has been useful.
>
> Regards,
> Jon
>
>
>
> On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
>>
>> Jon:
>>
>> This is interesting and helpful! How did you figure out the cause? And how
>> much time did you spend? Could you share some experience of performance
>> diagnosis?
>>
>> Jie
>>
>> On Tuesday, November 27, 2012, Harsh J wrote:
>>>
>>> Hi Amit,
>>>
>>> The default scheduler is FIFO, and may not work for all forms of
>>> workloads. Read the multiple schedulers available to see if they have
>>> features that may benefit your workload:
>>>
>>> Capacity Scheduler:
>>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>>> FairScheduler:
>>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>>>
>>> While there's a good overlap of features between them, there are a few
>>> differences that set them apart and make them each useful for
>>> different use-cases. If I had to summarize on some such differences,
>>> FairScheduler is better suited to SLA form of job execution situations
>>> due to its preemptive features (which make it useful in user and
>>> service mix scenarios), while CapacityScheduler provides
>>> manual-resource-request oriented scheduling for odd jobs with high
>>> memory workloads, etc. (which make it useful for running certain
>>> specific kind of jobs along side the regular ones).
>>>
>>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
>>> > So this is a FairScheduler problem ?
>>> > We are using the default Hadoop scheduler. Is there a reason to use the
>>> > Fair
>>> > Scheduler if most of the time we don't have more than 4 jobs running
>>> > simultaneously ?
>>> >
>>> >
>>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>>> >>
>>> >> Hi Amit,
>>> >>
>>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>>> >> property. It is true by default, which works well for most workloads
>>> >> if not benchmark style workloads. I would not usually trust that as a
>>> >> base perf. measure of everything that comes out of an upgrade.
>>> >>
>>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>>> >>
>>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>>> >> wrote:
>>> >> > Hi Jon,
>>> >> >
>>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>>> >> > 1.0.4
>>> >> > and I haven't noticed any performance issues. By  "multiple
>>> >> > assignment
>>> >> > feature" do you mean speculative execution
>>> >> > (mapred.map.tasks.speculative.execution and
>>> >> > mapred.reduce.tasks.speculative.execution) ?
>>> >> >
>>> >> >
>>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Problem solved, but worth warning others about.
>>> >> >>
>>> >> >> Before the upgrade the reducers for the terasort process had been
>>> >> >> evenly
>>> >> >> distributed around the cluster - one per task tracker in turn,
>>> >> >> looping
>>> >> >> around the cluster until all tasks were allocated.  After the
>>> >> >> upgrade
>>> >> >> all
>>> >> >> reduce task had been submitted to small number of task trackers -
>>> >> >> submit
>>> >> >> tasks until the task tracker slots were full and then move onto the
>>> >> >> next
>>> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
>>> >> >> benchmark performance.
>>> >> >>
>>> >> >> The reason for this turns out to be the fair scheduler rewrite
>>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
>>> >> >> of
>>> >> >> the
>>> >> >> assign multiple property. Previously this property caused a single
>>> >> >> map
>>> >> >> and a
>>> >> >> single reduce task to be allocated in a task tracker heartbeat
>>> >> >> (rather
>>> >> >> than
>>> >> >> the default of a map or a reduce).  After the upgrade it allocates
>>> >> >> as
>>> >> >> many
>>> >> >> tasks as there are available task slots.  Turning off the multiple
>>> >> >> assignment feature returned the terasort to its pre-upgrade
>>> >> >> performance.
>>> >> >>
>>> >> >> I can see potential benefits to this change and need to think
>>> >> >> through
>>> >> >> the
>>> >> >> consequences to real world applications (though in practice we're
>>> >> >> likely to
>>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>>> >> >> Investigating
>>> >> >> this
>>> >> >> has been a pain so to warn other user is there anywhere central
>>> >> >> that
>>> >> >> can be
>>> >> >> used to record upgrade gotchas like this?
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
>>> >> >>> have
>>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>>> >> >>> about 45
>>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
>>> >> >>> detail it
>>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>>> >> >>> minutes.
>>> >> >>> Does
>>> >> >>> anyone have any thoughts about what'--
>>> Harsh J
>>>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jie Li <ji...@cs.duke.edu>.

Hi Jon,

Thanks for sharing these insights! Can't agree with you more!

Recently we released a tool called Starfish Hadoop Log Analyzer for
analyzing the job histories. I believe it can quickly point out this
reduce problem you met!

http://www.cs.duke.edu/starfish/release.html

Jie

On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
> Jie,
>
> Simple answer - I got lucky (though obviously there are thing you need to
> have in place to allow you to be lucky).
>
> Before running the upgrade I ran a set of tests to baseline the cluster
> performance, e.g. terasort, gridmix and some operational jobs.  Terasort by
> itself isn't very realistic as a cluster test but it's nice and simple to
> run and is good for regression testing things after a change.
>
> After the upgrade the intention was to run the same tests and show that the
> performance hadn't degraded (improved would have been nice but not worse was
> the minimum).  When we ran the terasort we found that performance was about
> 50% worse - execution time had gone from 40 minutes to 60 minutes.  As I've
> said, terasort doesn't provide a realistic view of operational performance
> but this showed that something major had changed and we needed to understand
> it before going further.  So how to go about diagnosing this ...
>
> First rule - understand what you're trying to achieve.  It's very easy to
> say performance isn't good enough but performance can always be better so
> you need to know what's realistic and at what point you're going to stop
> tuning things.  I had a previous baseline that I was trying to match so I
> knew what I was trying to achieve.
>
> Next thing to do is profile your job and identify where the problem is.  We
> had the full job history from the before and after jobs and comparing these
> we saw that map performance was fairly consistent as were the reduce sort
> and reduce phases.  The problem was with the shuffle, which had gone from 20
> minutes pre-upgrade to 40 minutes afterwards.  The important thing here is
> to make sure you've got as much information as possible.  If we'd just kept
> the overall job time then there would have been a lot more areas to look at
> but knowing the problem was with shuffle allowed me to focus effort in this
> area.
>
> So what had changed in the shuffle that may have slowed things down.  The
> first thing we thought of was that we'd moved from a tarball deployment to
> using the RPM so what effect might this have had on things.  Our operational
> configuration compresses the map output and in the past we've had problems
> with Java compression libraries being used rather than native ones and this
> has affected performance.  We knew the RPM deployment had moved the native
> library so spent some time confirming to ourselves that these were being
> used correctly (but this turned out to not be the problem).  We then spent
> time doing some process and server profiling - using dstat to look at the
> server bottlenecks and jstack/jmap to check what the task tracker and reduce
> processes were doing.  Although not directly relevant to this particular
> problem doing this was useful just to get my head around what Hadoop is
> doing at various points of the process.
>
> The next bit was one place where I got lucky - I happened to be logged onto
> one of the worker nodes when a test job was running and I noticed that there
> weren't any reduce tasks running on the server.  This was odd as we'd
> submitted more reducers than we have servers so I'd expected at least one
> task to be running on each server.  Checking the job tracker log file it
> turned out that since the upgrade the job tracker had been submitting reduce
> tasks to only 10% of the available nodes.  A different 10% each time the job
> was run so clearly the individual task trackers were working OK but there
> was something odd going on with the task allocation.  Checking the job
> tracker log file showed that before the upgrade tasks had been fairly evenly
> distributed so something had changed.  After that it was a case of digging
> around the source code to find out which classes were available for task
> allocation and what inside them had changed.  This can be quite daunting but
> if you're comfortable with Java then it's just a case of following the calls
> through the code.  Once I found the cause it was just a case of working out
> what my options were for working around it (in this case turning off the
> multiple assignment option - I can work out whether I want to turn it back
> on in slower time).
>
> Where I think we got very lucky is that we hit this problem.  The
> configuration we use for the terasort has just over 1 reducer per worker
> node rather than maxing out the available reducer slots.  This decision was
> made several years and I can't remember the reasons for it.  If we'd been
> using a larger number of reducers then the number of worker nodes in use
> would have been similar regardless of the allocation algorithm and so the
> performance would have looked similar before and after the upgrade.  We
> would have hit this problem eventually but probably not until we started
> running user jobs and by then it would be too late to do the intrusive
> investigations that were possible now.
>
> Hope this has been useful.
>
> Regards,
> Jon
>
>
>
> On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
>>
>> Jon:
>>
>> This is interesting and helpful! How did you figure out the cause? And how
>> much time did you spend? Could you share some experience of performance
>> diagnosis?
>>
>> Jie
>>
>> On Tuesday, November 27, 2012, Harsh J wrote:
>>>
>>> Hi Amit,
>>>
>>> The default scheduler is FIFO, and may not work for all forms of
>>> workloads. Read the multiple schedulers available to see if they have
>>> features that may benefit your workload:
>>>
>>> Capacity Scheduler:
>>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>>> FairScheduler:
>>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>>>
>>> While there's a good overlap of features between them, there are a few
>>> differences that set them apart and make them each useful for
>>> different use-cases. If I had to summarize on some such differences,
>>> FairScheduler is better suited to SLA form of job execution situations
>>> due to its preemptive features (which make it useful in user and
>>> service mix scenarios), while CapacityScheduler provides
>>> manual-resource-request oriented scheduling for odd jobs with high
>>> memory workloads, etc. (which make it useful for running certain
>>> specific kind of jobs along side the regular ones).
>>>
>>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
>>> > So this is a FairScheduler problem ?
>>> > We are using the default Hadoop scheduler. Is there a reason to use the
>>> > Fair
>>> > Scheduler if most of the time we don't have more than 4 jobs running
>>> > simultaneously ?
>>> >
>>> >
>>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>>> >>
>>> >> Hi Amit,
>>> >>
>>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>>> >> property. It is true by default, which works well for most workloads
>>> >> if not benchmark style workloads. I would not usually trust that as a
>>> >> base perf. measure of everything that comes out of an upgrade.
>>> >>
>>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>>> >>
>>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>>> >> wrote:
>>> >> > Hi Jon,
>>> >> >
>>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>>> >> > 1.0.4
>>> >> > and I haven't noticed any performance issues. By  "multiple
>>> >> > assignment
>>> >> > feature" do you mean speculative execution
>>> >> > (mapred.map.tasks.speculative.execution and
>>> >> > mapred.reduce.tasks.speculative.execution) ?
>>> >> >
>>> >> >
>>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Problem solved, but worth warning others about.
>>> >> >>
>>> >> >> Before the upgrade the reducers for the terasort process had been
>>> >> >> evenly
>>> >> >> distributed around the cluster - one per task tracker in turn,
>>> >> >> looping
>>> >> >> around the cluster until all tasks were allocated.  After the
>>> >> >> upgrade
>>> >> >> all
>>> >> >> reduce task had been submitted to small number of task trackers -
>>> >> >> submit
>>> >> >> tasks until the task tracker slots were full and then move onto the
>>> >> >> next
>>> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
>>> >> >> benchmark performance.
>>> >> >>
>>> >> >> The reason for this turns out to be the fair scheduler rewrite
>>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
>>> >> >> of
>>> >> >> the
>>> >> >> assign multiple property. Previously this property caused a single
>>> >> >> map
>>> >> >> and a
>>> >> >> single reduce task to be allocated in a task tracker heartbeat
>>> >> >> (rather
>>> >> >> than
>>> >> >> the default of a map or a reduce).  After the upgrade it allocates
>>> >> >> as
>>> >> >> many
>>> >> >> tasks as there are available task slots.  Turning off the multiple
>>> >> >> assignment feature returned the terasort to its pre-upgrade
>>> >> >> performance.
>>> >> >>
>>> >> >> I can see potential benefits to this change and need to think
>>> >> >> through
>>> >> >> the
>>> >> >> consequences to real world applications (though in practice we're
>>> >> >> likely to
>>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>>> >> >> Investigating
>>> >> >> this
>>> >> >> has been a pain so to warn other user is there anywhere central
>>> >> >> that
>>> >> >> can be
>>> >> >> used to record upgrade gotchas like this?
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
>>> >> >>> have
>>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>>> >> >>> about 45
>>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
>>> >> >>> detail it
>>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>>> >> >>> minutes.
>>> >> >>> Does
>>> >> >>> anyone have any thoughts about what'--
>>> Harsh J
>>>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Jon,

FYI, this issue in the fair scheduler was fixed by
https://issues.apache.org/jira/browse/MAPREDUCE-2905 for 1.1.0.
Though it is present again in MR2:
https://issues.apache.org/jira/browse/MAPREDUCE-3268

-Todd

On Wed, Nov 28, 2012 at 2:32 PM, Jon Allen <ja...@gmail.com> wrote:
> Jie,
>
> Simple answer - I got lucky (though obviously there are thing you need to
> have in place to allow you to be lucky).
>
> Before running the upgrade I ran a set of tests to baseline the cluster
> performance, e.g. terasort, gridmix and some operational jobs.  Terasort by
> itself isn't very realistic as a cluster test but it's nice and simple to
> run and is good for regression testing things after a change.
>
> After the upgrade the intention was to run the same tests and show that the
> performance hadn't degraded (improved would have been nice but not worse was
> the minimum).  When we ran the terasort we found that performance was about
> 50% worse - execution time had gone from 40 minutes to 60 minutes.  As I've
> said, terasort doesn't provide a realistic view of operational performance
> but this showed that something major had changed and we needed to understand
> it before going further.  So how to go about diagnosing this ...
>
> First rule - understand what you're trying to achieve.  It's very easy to
> say performance isn't good enough but performance can always be better so
> you need to know what's realistic and at what point you're going to stop
> tuning things.  I had a previous baseline that I was trying to match so I
> knew what I was trying to achieve.
>
> Next thing to do is profile your job and identify where the problem is.  We
> had the full job history from the before and after jobs and comparing these
> we saw that map performance was fairly consistent as were the reduce sort
> and reduce phases.  The problem was with the shuffle, which had gone from 20
> minutes pre-upgrade to 40 minutes afterwards.  The important thing here is
> to make sure you've got as much information as possible.  If we'd just kept
> the overall job time then there would have been a lot more areas to look at
> but knowing the problem was with shuffle allowed me to focus effort in this
> area.
>
> So what had changed in the shuffle that may have slowed things down.  The
> first thing we thought of was that we'd moved from a tarball deployment to
> using the RPM so what effect might this have had on things.  Our operational
> configuration compresses the map output and in the past we've had problems
> with Java compression libraries being used rather than native ones and this
> has affected performance.  We knew the RPM deployment had moved the native
> library so spent some time confirming to ourselves that these were being
> used correctly (but this turned out to not be the problem).  We then spent
> time doing some process and server profiling - using dstat to look at the
> server bottlenecks and jstack/jmap to check what the task tracker and reduce
> processes were doing.  Although not directly relevant to this particular
> problem doing this was useful just to get my head around what Hadoop is
> doing at various points of the process.
>
> The next bit was one place where I got lucky - I happened to be logged onto
> one of the worker nodes when a test job was running and I noticed that there
> weren't any reduce tasks running on the server.  This was odd as we'd
> submitted more reducers than we have servers so I'd expected at least one
> task to be running on each server.  Checking the job tracker log file it
> turned out that since the upgrade the job tracker had been submitting reduce
> tasks to only 10% of the available nodes.  A different 10% each time the job
> was run so clearly the individual task trackers were working OK but there
> was something odd going on with the task allocation.  Checking the job
> tracker log file showed that before the upgrade tasks had been fairly evenly
> distributed so something had changed.  After that it was a case of digging
> around the source code to find out which classes were available for task
> allocation and what inside them had changed.  This can be quite daunting but
> if you're comfortable with Java then it's just a case of following the calls
> through the code.  Once I found the cause it was just a case of working out
> what my options were for working around it (in this case turning off the
> multiple assignment option - I can work out whether I want to turn it back
> on in slower time).
>
> Where I think we got very lucky is that we hit this problem.  The
> configuration we use for the terasort has just over 1 reducer per worker
> node rather than maxing out the available reducer slots.  This decision was
> made several years and I can't remember the reasons for it.  If we'd been
> using a larger number of reducers then the number of worker nodes in use
> would have been similar regardless of the allocation algorithm and so the
> performance would have looked similar before and after the upgrade.  We
> would have hit this problem eventually but probably not until we started
> running user jobs and by then it would be too late to do the intrusive
> investigations that were possible now.
>
> Hope this has been useful.
>
> Regards,
> Jon
>
>
>
> On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
>>
>> Jon:
>>
>> This is interesting and helpful! How did you figure out the cause? And how
>> much time did you spend? Could you share some experience of performance
>> diagnosis?
>>
>> Jie
>>
>> On Tuesday, November 27, 2012, Harsh J wrote:
>>>
>>> Hi Amit,
>>>
>>> The default scheduler is FIFO, and may not work for all forms of
>>> workloads. Read the multiple schedulers available to see if they have
>>> features that may benefit your workload:
>>>
>>> Capacity Scheduler:
>>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>>> FairScheduler:
>>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>>>
>>> While there's a good overlap of features between them, there are a few
>>> differences that set them apart and make them each useful for
>>> different use-cases. If I had to summarize on some such differences,
>>> FairScheduler is better suited to SLA form of job execution situations
>>> due to its preemptive features (which make it useful in user and
>>> service mix scenarios), while CapacityScheduler provides
>>> manual-resource-request oriented scheduling for odd jobs with high
>>> memory workloads, etc. (which make it useful for running certain
>>> specific kind of jobs along side the regular ones).
>>>
>>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
>>> > So this is a FairScheduler problem ?
>>> > We are using the default Hadoop scheduler. Is there a reason to use the
>>> > Fair
>>> > Scheduler if most of the time we don't have more than 4 jobs running
>>> > simultaneously ?
>>> >
>>> >
>>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>>> >>
>>> >> Hi Amit,
>>> >>
>>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>>> >> property. It is true by default, which works well for most workloads
>>> >> if not benchmark style workloads. I would not usually trust that as a
>>> >> base perf. measure of everything that comes out of an upgrade.
>>> >>
>>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>>> >>
>>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>>> >> wrote:
>>> >> > Hi Jon,
>>> >> >
>>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>>> >> > 1.0.4
>>> >> > and I haven't noticed any performance issues. By  "multiple
>>> >> > assignment
>>> >> > feature" do you mean speculative execution
>>> >> > (mapred.map.tasks.speculative.execution and
>>> >> > mapred.reduce.tasks.speculative.execution) ?
>>> >> >
>>> >> >
>>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Problem solved, but worth warning others about.
>>> >> >>
>>> >> >> Before the upgrade the reducers for the terasort process had been
>>> >> >> evenly
>>> >> >> distributed around the cluster - one per task tracker in turn,
>>> >> >> looping
>>> >> >> around the cluster until all tasks were allocated.  After the
>>> >> >> upgrade
>>> >> >> all
>>> >> >> reduce task had been submitted to small number of task trackers -
>>> >> >> submit
>>> >> >> tasks until the task tracker slots were full and then move onto the
>>> >> >> next
>>> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
>>> >> >> benchmark performance.
>>> >> >>
>>> >> >> The reason for this turns out to be the fair scheduler rewrite
>>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
>>> >> >> of
>>> >> >> the
>>> >> >> assign multiple property. Previously this property caused a single
>>> >> >> map
>>> >> >> and a
>>> >> >> single reduce task to be allocated in a task tracker heartbeat
>>> >> >> (rather
>>> >> >> than
>>> >> >> the default of a map or a reduce).  After the upgrade it allocates
>>> >> >> as
>>> >> >> many
>>> >> >> tasks as there are available task slots.  Turning off the multiple
>>> >> >> assignment feature returned the terasort to its pre-upgrade
>>> >> >> performance.
>>> >> >>
>>> >> >> I can see potential benefits to this change and need to think
>>> >> >> through
>>> >> >> the
>>> >> >> consequences to real world applications (though in practice we're
>>> >> >> likely to
>>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>>> >> >> Investigating
>>> >> >> this
>>> >> >> has been a pain so to warn other user is there anywhere central
>>> >> >> that
>>> >> >> can be
>>> >> >> used to record upgrade gotchas like this?
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
>>> >> >>> have
>>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>>> >> >>> about 45
>>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
>>> >> >>> detail it
>>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>>> >> >>> minutes.
>>> >> >>> Does
>>> >> >>> anyone have any thoughts about what'--
>>> Harsh J
>>>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Hadoop 1.0.4 Performance Problem

Posted by Jie Li <ji...@cs.duke.edu>.

Hi Jon,

Thanks for sharing these insights! Can't agree with you more!

Recently we released a tool called Starfish Hadoop Log Analyzer for
analyzing the job histories. I believe it can quickly point out this
reduce problem you met!

http://www.cs.duke.edu/starfish/release.html

Jie

On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
> Jie,
>
> Simple answer - I got lucky (though obviously there are thing you need to
> have in place to allow you to be lucky).
>
> Before running the upgrade I ran a set of tests to baseline the cluster
> performance, e.g. terasort, gridmix and some operational jobs.  Terasort by
> itself isn't very realistic as a cluster test but it's nice and simple to
> run and is good for regression testing things after a change.
>
> After the upgrade the intention was to run the same tests and show that the
> performance hadn't degraded (improved would have been nice but not worse was
> the minimum).  When we ran the terasort we found that performance was about
> 50% worse - execution time had gone from 40 minutes to 60 minutes.  As I've
> said, terasort doesn't provide a realistic view of operational performance
> but this showed that something major had changed and we needed to understand
> it before going further.  So how to go about diagnosing this ...
>
> First rule - understand what you're trying to achieve.  It's very easy to
> say performance isn't good enough but performance can always be better so
> you need to know what's realistic and at what point you're going to stop
> tuning things.  I had a previous baseline that I was trying to match so I
> knew what I was trying to achieve.
>
> Next thing to do is profile your job and identify where the problem is.  We
> had the full job history from the before and after jobs and comparing these
> we saw that map performance was fairly consistent as were the reduce sort
> and reduce phases.  The problem was with the shuffle, which had gone from 20
> minutes pre-upgrade to 40 minutes afterwards.  The important thing here is
> to make sure you've got as much information as possible.  If we'd just kept
> the overall job time then there would have been a lot more areas to look at
> but knowing the problem was with shuffle allowed me to focus effort in this
> area.
>
> So what had changed in the shuffle that may have slowed things down.  The
> first thing we thought of was that we'd moved from a tarball deployment to
> using the RPM so what effect might this have had on things.  Our operational
> configuration compresses the map output and in the past we've had problems
> with Java compression libraries being used rather than native ones and this
> has affected performance.  We knew the RPM deployment had moved the native
> library so spent some time confirming to ourselves that these were being
> used correctly (but this turned out to not be the problem).  We then spent
> time doing some process and server profiling - using dstat to look at the
> server bottlenecks and jstack/jmap to check what the task tracker and reduce
> processes were doing.  Although not directly relevant to this particular
> problem doing this was useful just to get my head around what Hadoop is
> doing at various points of the process.
>
> The next bit was one place where I got lucky - I happened to be logged onto
> one of the worker nodes when a test job was running and I noticed that there
> weren't any reduce tasks running on the server.  This was odd as we'd
> submitted more reducers than we have servers so I'd expected at least one
> task to be running on each server.  Checking the job tracker log file it
> turned out that since the upgrade the job tracker had been submitting reduce
> tasks to only 10% of the available nodes.  A different 10% each time the job
> was run so clearly the individual task trackers were working OK but there
> was something odd going on with the task allocation.  Checking the job
> tracker log file showed that before the upgrade tasks had been fairly evenly
> distributed so something had changed.  After that it was a case of digging
> around the source code to find out which classes were available for task
> allocation and what inside them had changed.  This can be quite daunting but
> if you're comfortable with Java then it's just a case of following the calls
> through the code.  Once I found the cause it was just a case of working out
> what my options were for working around it (in this case turning off the
> multiple assignment option - I can work out whether I want to turn it back
> on in slower time).
>
> Where I think we got very lucky is that we hit this problem.  The
> configuration we use for the terasort has just over 1 reducer per worker
> node rather than maxing out the available reducer slots.  This decision was
> made several years and I can't remember the reasons for it.  If we'd been
> using a larger number of reducers then the number of worker nodes in use
> would have been similar regardless of the allocation algorithm and so the
> performance would have looked similar before and after the upgrade.  We
> would have hit this problem eventually but probably not until we started
> running user jobs and by then it would be too late to do the intrusive
> investigations that were possible now.
>
> Hope this has been useful.
>
> Regards,
> Jon
>
>
>
> On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
>>
>> Jon:
>>
>> This is interesting and helpful! How did you figure out the cause? And how
>> much time did you spend? Could you share some experience of performance
>> diagnosis?
>>
>> Jie
>>
>> On Tuesday, November 27, 2012, Harsh J wrote:
>>>
>>> Hi Amit,
>>>
>>> The default scheduler is FIFO, and may not work for all forms of
>>> workloads. Read the multiple schedulers available to see if they have
>>> features that may benefit your workload:
>>>
>>> Capacity Scheduler:
>>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>>> FairScheduler:
>>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>>>
>>> While there's a good overlap of features between them, there are a few
>>> differences that set them apart and make them each useful for
>>> different use-cases. If I had to summarize on some such differences,
>>> FairScheduler is better suited to SLA form of job execution situations
>>> due to its preemptive features (which make it useful in user and
>>> service mix scenarios), while CapacityScheduler provides
>>> manual-resource-request oriented scheduling for odd jobs with high
>>> memory workloads, etc. (which make it useful for running certain
>>> specific kind of jobs along side the regular ones).
>>>
>>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
>>> > So this is a FairScheduler problem ?
>>> > We are using the default Hadoop scheduler. Is there a reason to use the
>>> > Fair
>>> > Scheduler if most of the time we don't have more than 4 jobs running
>>> > simultaneously ?
>>> >
>>> >
>>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>>> >>
>>> >> Hi Amit,
>>> >>
>>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>>> >> property. It is true by default, which works well for most workloads
>>> >> if not benchmark style workloads. I would not usually trust that as a
>>> >> base perf. measure of everything that comes out of an upgrade.
>>> >>
>>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>>> >>
>>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>>> >> wrote:
>>> >> > Hi Jon,
>>> >> >
>>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>>> >> > 1.0.4
>>> >> > and I haven't noticed any performance issues. By  "multiple
>>> >> > assignment
>>> >> > feature" do you mean speculative execution
>>> >> > (mapred.map.tasks.speculative.execution and
>>> >> > mapred.reduce.tasks.speculative.execution) ?
>>> >> >
>>> >> >
>>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Problem solved, but worth warning others about.
>>> >> >>
>>> >> >> Before the upgrade the reducers for the terasort process had been
>>> >> >> evenly
>>> >> >> distributed around the cluster - one per task tracker in turn,
>>> >> >> looping
>>> >> >> around the cluster until all tasks were allocated.  After the
>>> >> >> upgrade
>>> >> >> all
>>> >> >> reduce task had been submitted to small number of task trackers -
>>> >> >> submit
>>> >> >> tasks until the task tracker slots were full and then move onto the
>>> >> >> next
>>> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
>>> >> >> benchmark performance.
>>> >> >>
>>> >> >> The reason for this turns out to be the fair scheduler rewrite
>>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
>>> >> >> of
>>> >> >> the
>>> >> >> assign multiple property. Previously this property caused a single
>>> >> >> map
>>> >> >> and a
>>> >> >> single reduce task to be allocated in a task tracker heartbeat
>>> >> >> (rather
>>> >> >> than
>>> >> >> the default of a map or a reduce).  After the upgrade it allocates
>>> >> >> as
>>> >> >> many
>>> >> >> tasks as there are available task slots.  Turning off the multiple
>>> >> >> assignment feature returned the terasort to its pre-upgrade
>>> >> >> performance.
>>> >> >>
>>> >> >> I can see potential benefits to this change and need to think
>>> >> >> through
>>> >> >> the
>>> >> >> consequences to real world applications (though in practice we're
>>> >> >> likely to
>>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>>> >> >> Investigating
>>> >> >> this
>>> >> >> has been a pain so to warn other user is there anywhere central
>>> >> >> that
>>> >> >> can be
>>> >> >> used to record upgrade gotchas like this?
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
>>> >> >>> have
>>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>>> >> >>> about 45
>>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
>>> >> >>> detail it
>>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>>> >> >>> minutes.
>>> >> >>> Does
>>> >> >>> anyone have any thoughts about what'--
>>> Harsh J
>>>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Jon,

FYI, this issue in the fair scheduler was fixed by
https://issues.apache.org/jira/browse/MAPREDUCE-2905 for 1.1.0.
Though it is present again in MR2:
https://issues.apache.org/jira/browse/MAPREDUCE-3268

-Todd

On Wed, Nov 28, 2012 at 2:32 PM, Jon Allen <ja...@gmail.com> wrote:
> Jie,
>
> Simple answer - I got lucky (though obviously there are thing you need to
> have in place to allow you to be lucky).
>
> Before running the upgrade I ran a set of tests to baseline the cluster
> performance, e.g. terasort, gridmix and some operational jobs.  Terasort by
> itself isn't very realistic as a cluster test but it's nice and simple to
> run and is good for regression testing things after a change.
>
> After the upgrade the intention was to run the same tests and show that the
> performance hadn't degraded (improved would have been nice but not worse was
> the minimum).  When we ran the terasort we found that performance was about
> 50% worse - execution time had gone from 40 minutes to 60 minutes.  As I've
> said, terasort doesn't provide a realistic view of operational performance
> but this showed that something major had changed and we needed to understand
> it before going further.  So how to go about diagnosing this ...
>
> First rule - understand what you're trying to achieve.  It's very easy to
> say performance isn't good enough but performance can always be better so
> you need to know what's realistic and at what point you're going to stop
> tuning things.  I had a previous baseline that I was trying to match so I
> knew what I was trying to achieve.
>
> Next thing to do is profile your job and identify where the problem is.  We
> had the full job history from the before and after jobs and comparing these
> we saw that map performance was fairly consistent as were the reduce sort
> and reduce phases.  The problem was with the shuffle, which had gone from 20
> minutes pre-upgrade to 40 minutes afterwards.  The important thing here is
> to make sure you've got as much information as possible.  If we'd just kept
> the overall job time then there would have been a lot more areas to look at
> but knowing the problem was with shuffle allowed me to focus effort in this
> area.
>
> So what had changed in the shuffle that may have slowed things down.  The
> first thing we thought of was that we'd moved from a tarball deployment to
> using the RPM so what effect might this have had on things.  Our operational
> configuration compresses the map output and in the past we've had problems
> with Java compression libraries being used rather than native ones and this
> has affected performance.  We knew the RPM deployment had moved the native
> library so spent some time confirming to ourselves that these were being
> used correctly (but this turned out to not be the problem).  We then spent
> time doing some process and server profiling - using dstat to look at the
> server bottlenecks and jstack/jmap to check what the task tracker and reduce
> processes were doing.  Although not directly relevant to this particular
> problem doing this was useful just to get my head around what Hadoop is
> doing at various points of the process.
>
> The next bit was one place where I got lucky - I happened to be logged onto
> one of the worker nodes when a test job was running and I noticed that there
> weren't any reduce tasks running on the server.  This was odd as we'd
> submitted more reducers than we have servers so I'd expected at least one
> task to be running on each server.  Checking the job tracker log file it
> turned out that since the upgrade the job tracker had been submitting reduce
> tasks to only 10% of the available nodes.  A different 10% each time the job
> was run so clearly the individual task trackers were working OK but there
> was something odd going on with the task allocation.  Checking the job
> tracker log file showed that before the upgrade tasks had been fairly evenly
> distributed so something had changed.  After that it was a case of digging
> around the source code to find out which classes were available for task
> allocation and what inside them had changed.  This can be quite daunting but
> if you're comfortable with Java then it's just a case of following the calls
> through the code.  Once I found the cause it was just a case of working out
> what my options were for working around it (in this case turning off the
> multiple assignment option - I can work out whether I want to turn it back
> on in slower time).
>
> Where I think we got very lucky is that we hit this problem.  The
> configuration we use for the terasort has just over 1 reducer per worker
> node rather than maxing out the available reducer slots.  This decision was
> made several years and I can't remember the reasons for it.  If we'd been
> using a larger number of reducers then the number of worker nodes in use
> would have been similar regardless of the allocation algorithm and so the
> performance would have looked similar before and after the upgrade.  We
> would have hit this problem eventually but probably not until we started
> running user jobs and by then it would be too late to do the intrusive
> investigations that were possible now.
>
> Hope this has been useful.
>
> Regards,
> Jon
>
>
>
> On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
>>
>> Jon:
>>
>> This is interesting and helpful! How did you figure out the cause? And how
>> much time did you spend? Could you share some experience of performance
>> diagnosis?
>>
>> Jie
>>
>> On Tuesday, November 27, 2012, Harsh J wrote:
>>>
>>> Hi Amit,
>>>
>>> The default scheduler is FIFO, and may not work for all forms of
>>> workloads. Read the multiple schedulers available to see if they have
>>> features that may benefit your workload:
>>>
>>> Capacity Scheduler:
>>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>>> FairScheduler:
>>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>>>
>>> While there's a good overlap of features between them, there are a few
>>> differences that set them apart and make them each useful for
>>> different use-cases. If I had to summarize on some such differences,
>>> FairScheduler is better suited to SLA form of job execution situations
>>> due to its preemptive features (which make it useful in user and
>>> service mix scenarios), while CapacityScheduler provides
>>> manual-resource-request oriented scheduling for odd jobs with high
>>> memory workloads, etc. (which make it useful for running certain
>>> specific kind of jobs along side the regular ones).
>>>
>>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
>>> > So this is a FairScheduler problem ?
>>> > We are using the default Hadoop scheduler. Is there a reason to use the
>>> > Fair
>>> > Scheduler if most of the time we don't have more than 4 jobs running
>>> > simultaneously ?
>>> >
>>> >
>>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>>> >>
>>> >> Hi Amit,
>>> >>
>>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>>> >> property. It is true by default, which works well for most workloads
>>> >> if not benchmark style workloads. I would not usually trust that as a
>>> >> base perf. measure of everything that comes out of an upgrade.
>>> >>
>>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>>> >>
>>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>>> >> wrote:
>>> >> > Hi Jon,
>>> >> >
>>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>>> >> > 1.0.4
>>> >> > and I haven't noticed any performance issues. By  "multiple
>>> >> > assignment
>>> >> > feature" do you mean speculative execution
>>> >> > (mapred.map.tasks.speculative.execution and
>>> >> > mapred.reduce.tasks.speculative.execution) ?
>>> >> >
>>> >> >
>>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Problem solved, but worth warning others about.
>>> >> >>
>>> >> >> Before the upgrade the reducers for the terasort process had been
>>> >> >> evenly
>>> >> >> distributed around the cluster - one per task tracker in turn,
>>> >> >> looping
>>> >> >> around the cluster until all tasks were allocated.  After the
>>> >> >> upgrade
>>> >> >> all
>>> >> >> reduce task had been submitted to small number of task trackers -
>>> >> >> submit
>>> >> >> tasks until the task tracker slots were full and then move onto the
>>> >> >> next
>>> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
>>> >> >> benchmark performance.
>>> >> >>
>>> >> >> The reason for this turns out to be the fair scheduler rewrite
>>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
>>> >> >> of
>>> >> >> the
>>> >> >> assign multiple property. Previously this property caused a single
>>> >> >> map
>>> >> >> and a
>>> >> >> single reduce task to be allocated in a task tracker heartbeat
>>> >> >> (rather
>>> >> >> than
>>> >> >> the default of a map or a reduce).  After the upgrade it allocates
>>> >> >> as
>>> >> >> many
>>> >> >> tasks as there are available task slots.  Turning off the multiple
>>> >> >> assignment feature returned the terasort to its pre-upgrade
>>> >> >> performance.
>>> >> >>
>>> >> >> I can see potential benefits to this change and need to think
>>> >> >> through
>>> >> >> the
>>> >> >> consequences to real world applications (though in practice we're
>>> >> >> likely to
>>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>>> >> >> Investigating
>>> >> >> this
>>> >> >> has been a pain so to warn other user is there anywhere central
>>> >> >> that
>>> >> >> can be
>>> >> >> used to record upgrade gotchas like this?
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
>>> >> >>> have
>>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>>> >> >>> about 45
>>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
>>> >> >>> detail it
>>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>>> >> >>> minutes.
>>> >> >>> Does
>>> >> >>> anyone have any thoughts about what'--
>>> Harsh J
>>>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Hadoop 1.0.4 Performance Problem

Posted by Jon Allen <ja...@gmail.com>.

Jie,

Simple answer - I got lucky (though obviously there are thing you need to
have in place to allow you to be lucky).

Before running the upgrade I ran a set of tests to baseline the cluster
performance, e.g. terasort, gridmix and some operational jobs.  Terasort by
itself isn't very realistic as a cluster test but it's nice and simple to
run and is good for regression testing things after a change.

After the upgrade the intention was to run the same tests and show that the
performance hadn't degraded (improved would have been nice but not worse
was the minimum).  When we ran the terasort we found that performance was
about 50% worse - execution time had gone from 40 minutes to 60 minutes.
 As I've said, terasort doesn't provide a realistic view of operational
performance but this showed that something major had changed and we needed
to understand it before going further.  So how to go about diagnosing this
...

First rule - understand what you're trying to achieve.  It's very easy to
say performance isn't good enough but performance can always be better so
you need to know what's realistic and at what point you're going to stop
tuning things.  I had a previous baseline that I was trying to match so I
knew what I was trying to achieve.

Next thing to do is profile your job and identify where the problem is.  We
had the full job history from the before and after jobs and comparing these
we saw that map performance was fairly consistent as were the reduce sort
and reduce phases.  The problem was with the shuffle, which had gone from
20 minutes pre-upgrade to 40 minutes afterwards.  The important thing here
is to make sure you've got as much information as possible.  If we'd just
kept the overall job time then there would have been a lot more areas to
look at but knowing the problem was with shuffle allowed me to focus effort
in this area.

So what had changed in the shuffle that may have slowed things down.  The
first thing we thought of was that we'd moved from a tarball deployment to
using the RPM so what effect might this have had on things.  Our
operational configuration compresses the map output and in the past we've
had problems with Java compression libraries being used rather than native
ones and this has affected performance.  We knew the RPM deployment had
moved the native library so spent some time confirming to ourselves that
these were being used correctly (but this turned out to not be the
problem).  We then spent time doing some process and server profiling -
using dstat to look at the server bottlenecks and jstack/jmap to check what
the task tracker and reduce processes were doing.  Although not directly
relevant to this particular problem doing this was useful just to get my
head around what Hadoop is doing at various points of the process.

The next bit was one place where I got lucky - I happened to be logged onto
one of the worker nodes when a test job was running and I noticed that
there weren't any reduce tasks running on the server.  This was odd as we'd
submitted more reducers than we have servers so I'd expected at least one
task to be running on each server.  Checking the job tracker log file it
turned out that since the upgrade the job tracker had been submitting
reduce tasks to only 10% of the available nodes.  A different 10% each time
the job was run so clearly the individual task trackers were working OK but
there was something odd going on with the task allocation.  Checking the
job tracker log file showed that before the upgrade tasks had been fairly
evenly distributed so something had changed.  After that it was a case of
digging around the source code to find out which classes were available for
task allocation and what inside them had changed.  This can be quite
daunting but if you're comfortable with Java then it's just a case of
following the calls through the code.  Once I found the cause it was just a
case of working out what my options were for working around it (in this
case turning off the multiple assignment option - I can work out whether I
want to turn it back on in slower time).

Where I think we got very lucky is that we hit this problem.  The
configuration we use for the terasort has just over 1 reducer per worker
node rather than maxing out the available reducer slots.  This decision was
made several years and I can't remember the reasons for it.  If we'd been
using a larger number of reducers then the number of worker nodes in use
would have been similar regardless of the allocation algorithm and so the
performance would have looked similar before and after the upgrade.  We
would have hit this problem eventually but probably not until we started
running user jobs and by then it would be too late to do the intrusive
investigations that were possible now.

Hope this has been useful.

Regards,
Jon

On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:

> Jon:
>
> This is interesting and helpful! How did you figure out the cause? And how
> much time did you spend? Could you share some experience of performance
> diagnosis?
>
> Jie
>
> On Tuesday, November 27, 2012, Harsh J wrote:
>
>> Hi Amit,
>>
>> The default scheduler is FIFO, and may not work for all forms of
>> workloads. Read the multiple schedulers available to see if they have
>> features that may benefit your workload:
>>
>> Capacity Scheduler:
>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>> FairScheduler:
>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>>
>> While there's a good overlap of features between them, there are a few
>> differences that set them apart and make them each useful for
>> different use-cases. If I had to summarize on some such differences,
>> FairScheduler is better suited to SLA form of job execution situations
>> due to its preemptive features (which make it useful in user and
>> service mix scenarios), while CapacityScheduler provides
>> manual-resource-request oriented scheduling for odd jobs with high
>> memory workloads, etc. (which make it useful for running certain
>> specific kind of jobs along side the regular ones).
>>
>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
>> > So this is a FairScheduler problem ?
>> > We are using the default Hadoop scheduler. Is there a reason to use the
>> Fair
>> > Scheduler if most of the time we don't have more than 4 jobs running
>> > simultaneously ?
>> >
>> >
>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> Hi Amit,
>> >>
>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> >> property. It is true by default, which works well for most workloads
>> >> if not benchmark style workloads. I would not usually trust that as a
>> >> base perf. measure of everything that comes out of an upgrade.
>> >>
>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>> >>
>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>> wrote:
>> >> > Hi Jon,
>> >> >
>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>> >> > 1.0.4
>> >> > and I haven't noticed any performance issues. By  "multiple
>> assignment
>> >> > feature" do you mean speculative execution
>> >> > (mapred.map.tasks.speculative.execution and
>> >> > mapred.reduce.tasks.speculative.execution) ?
>> >> >
>> >> >
>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>> wrote:
>> >> >>
>> >> >> Problem solved, but worth warning others about.
>> >> >>
>> >> >> Before the upgrade the reducers for the terasort process had been
>> >> >> evenly
>> >> >> distributed around the cluster - one per task tracker in turn,
>> looping
>> >> >> around the cluster until all tasks were allocated.  After the
>> upgrade
>> >> >> all
>> >> >> reduce task had been submitted to small number of task trackers -
>> >> >> submit
>> >> >> tasks until the task tracker slots were full and then move onto the
>> >> >> next
>> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
>> >> >> benchmark performance.
>> >> >>
>> >> >> The reason for this turns out to be the fair scheduler rewrite
>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
>> of
>> >> >> the
>> >> >> assign multiple property. Previously this property caused a single
>> map
>> >> >> and a
>> >> >> single reduce task to be allocated in a task tracker heartbeat
>> (rather
>> >> >> than
>> >> >> the default of a map or a reduce).  After the upgrade it allocates
>> as
>> >> >> many
>> >> >> tasks as there are available task slots.  Turning off the multiple
>> >> >> assignment feature returned the terasort to its pre-upgrade
>> >> >> performance.
>> >> >>
>> >> >> I can see potential benefits to this change and need to think
>> through
>> >> >> the
>> >> >> consequences to real world applications (though in practice we're
>> >> >> likely to
>> >> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
>> >> >> this
>> >> >> has been a pain so to warn other user is there anywhere central that
>> >> >> can be
>> >> >> used to record upgrade gotchas like this?
>> >> >>
>> >> >>
>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> Hi,
>> >> >>>
>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
>> have
>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>> >> >>> about 45
>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
>> >> >>> detail it
>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>> minutes.
>> >> >>> Does
>> >> >>> anyone have any thoughts about what'--
>> Harsh J
>>
>>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jon Allen <ja...@gmail.com>.

Jie,

Simple answer - I got lucky (though obviously there are thing you need to
have in place to allow you to be lucky).

Before running the upgrade I ran a set of tests to baseline the cluster
performance, e.g. terasort, gridmix and some operational jobs.  Terasort by
itself isn't very realistic as a cluster test but it's nice and simple to
run and is good for regression testing things after a change.

After the upgrade the intention was to run the same tests and show that the
performance hadn't degraded (improved would have been nice but not worse
was the minimum).  When we ran the terasort we found that performance was
about 50% worse - execution time had gone from 40 minutes to 60 minutes.
 As I've said, terasort doesn't provide a realistic view of operational
performance but this showed that something major had changed and we needed
to understand it before going further.  So how to go about diagnosing this
...

First rule - understand what you're trying to achieve.  It's very easy to
say performance isn't good enough but performance can always be better so
you need to know what's realistic and at what point you're going to stop
tuning things.  I had a previous baseline that I was trying to match so I
knew what I was trying to achieve.

Next thing to do is profile your job and identify where the problem is.  We
had the full job history from the before and after jobs and comparing these
we saw that map performance was fairly consistent as were the reduce sort
and reduce phases.  The problem was with the shuffle, which had gone from
20 minutes pre-upgrade to 40 minutes afterwards.  The important thing here
is to make sure you've got as much information as possible.  If we'd just
kept the overall job time then there would have been a lot more areas to
look at but knowing the problem was with shuffle allowed me to focus effort
in this area.

So what had changed in the shuffle that may have slowed things down.  The
first thing we thought of was that we'd moved from a tarball deployment to
using the RPM so what effect might this have had on things.  Our
operational configuration compresses the map output and in the past we've
had problems with Java compression libraries being used rather than native
ones and this has affected performance.  We knew the RPM deployment had
moved the native library so spent some time confirming to ourselves that
these were being used correctly (but this turned out to not be the
problem).  We then spent time doing some process and server profiling -
using dstat to look at the server bottlenecks and jstack/jmap to check what
the task tracker and reduce processes were doing.  Although not directly
relevant to this particular problem doing this was useful just to get my
head around what Hadoop is doing at various points of the process.

The next bit was one place where I got lucky - I happened to be logged onto
one of the worker nodes when a test job was running and I noticed that
there weren't any reduce tasks running on the server.  This was odd as we'd
submitted more reducers than we have servers so I'd expected at least one
task to be running on each server.  Checking the job tracker log file it
turned out that since the upgrade the job tracker had been submitting
reduce tasks to only 10% of the available nodes.  A different 10% each time
the job was run so clearly the individual task trackers were working OK but
there was something odd going on with the task allocation.  Checking the
job tracker log file showed that before the upgrade tasks had been fairly
evenly distributed so something had changed.  After that it was a case of
digging around the source code to find out which classes were available for
task allocation and what inside them had changed.  This can be quite
daunting but if you're comfortable with Java then it's just a case of
following the calls through the code.  Once I found the cause it was just a
case of working out what my options were for working around it (in this
case turning off the multiple assignment option - I can work out whether I
want to turn it back on in slower time).

Where I think we got very lucky is that we hit this problem.  The
configuration we use for the terasort has just over 1 reducer per worker
node rather than maxing out the available reducer slots.  This decision was
made several years and I can't remember the reasons for it.  If we'd been
using a larger number of reducers then the number of worker nodes in use
would have been similar regardless of the allocation algorithm and so the
performance would have looked similar before and after the upgrade.  We
would have hit this problem eventually but probably not until we started
running user jobs and by then it would be too late to do the intrusive
investigations that were possible now.

Hope this has been useful.

Regards,
Jon

On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:

> Jon:
>
> This is interesting and helpful! How did you figure out the cause? And how
> much time did you spend? Could you share some experience of performance
> diagnosis?
>
> Jie
>
> On Tuesday, November 27, 2012, Harsh J wrote:
>
>> Hi Amit,
>>
>> The default scheduler is FIFO, and may not work for all forms of
>> workloads. Read the multiple schedulers available to see if they have
>> features that may benefit your workload:
>>
>> Capacity Scheduler:
>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>> FairScheduler:
>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>>
>> While there's a good overlap of features between them, there are a few
>> differences that set them apart and make them each useful for
>> different use-cases. If I had to summarize on some such differences,
>> FairScheduler is better suited to SLA form of job execution situations
>> due to its preemptive features (which make it useful in user and
>> service mix scenarios), while CapacityScheduler provides
>> manual-resource-request oriented scheduling for odd jobs with high
>> memory workloads, etc. (which make it useful for running certain
>> specific kind of jobs along side the regular ones).
>>
>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
>> > So this is a FairScheduler problem ?
>> > We are using the default Hadoop scheduler. Is there a reason to use the
>> Fair
>> > Scheduler if most of the time we don't have more than 4 jobs running
>> > simultaneously ?
>> >
>> >
>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> Hi Amit,
>> >>
>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> >> property. It is true by default, which works well for most workloads
>> >> if not benchmark style workloads. I would not usually trust that as a
>> >> base perf. measure of everything that comes out of an upgrade.
>> >>
>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>> >>
>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>> wrote:
>> >> > Hi Jon,
>> >> >
>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>> >> > 1.0.4
>> >> > and I haven't noticed any performance issues. By  "multiple
>> assignment
>> >> > feature" do you mean speculative execution
>> >> > (mapred.map.tasks.speculative.execution and
>> >> > mapred.reduce.tasks.speculative.execution) ?
>> >> >
>> >> >
>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>> wrote:
>> >> >>
>> >> >> Problem solved, but worth warning others about.
>> >> >>
>> >> >> Before the upgrade the reducers for the terasort process had been
>> >> >> evenly
>> >> >> distributed around the cluster - one per task tracker in turn,
>> looping
>> >> >> around the cluster until all tasks were allocated.  After the
>> upgrade
>> >> >> all
>> >> >> reduce task had been submitted to small number of task trackers -
>> >> >> submit
>> >> >> tasks until the task tracker slots were full and then move onto the
>> >> >> next
>> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
>> >> >> benchmark performance.
>> >> >>
>> >> >> The reason for this turns out to be the fair scheduler rewrite
>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
>> of
>> >> >> the
>> >> >> assign multiple property. Previously this property caused a single
>> map
>> >> >> and a
>> >> >> single reduce task to be allocated in a task tracker heartbeat
>> (rather
>> >> >> than
>> >> >> the default of a map or a reduce).  After the upgrade it allocates
>> as
>> >> >> many
>> >> >> tasks as there are available task slots.  Turning off the multiple
>> >> >> assignment feature returned the terasort to its pre-upgrade
>> >> >> performance.
>> >> >>
>> >> >> I can see potential benefits to this change and need to think
>> through
>> >> >> the
>> >> >> consequences to real world applications (though in practice we're
>> >> >> likely to
>> >> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
>> >> >> this
>> >> >> has been a pain so to warn other user is there anywhere central that
>> >> >> can be
>> >> >> used to record upgrade gotchas like this?
>> >> >>
>> >> >>
>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> Hi,
>> >> >>>
>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
>> have
>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>> >> >>> about 45
>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
>> >> >>> detail it
>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>> minutes.
>> >> >>> Does
>> >> >>> anyone have any thoughts about what'--
>> Harsh J
>>
>>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jon Allen <ja...@gmail.com>.

Jie,

Simple answer - I got lucky (though obviously there are thing you need to
have in place to allow you to be lucky).

Before running the upgrade I ran a set of tests to baseline the cluster
performance, e.g. terasort, gridmix and some operational jobs.  Terasort by
itself isn't very realistic as a cluster test but it's nice and simple to
run and is good for regression testing things after a change.

After the upgrade the intention was to run the same tests and show that the
performance hadn't degraded (improved would have been nice but not worse
was the minimum).  When we ran the terasort we found that performance was
about 50% worse - execution time had gone from 40 minutes to 60 minutes.
 As I've said, terasort doesn't provide a realistic view of operational
performance but this showed that something major had changed and we needed
to understand it before going further.  So how to go about diagnosing this
...

First rule - understand what you're trying to achieve.  It's very easy to
say performance isn't good enough but performance can always be better so
you need to know what's realistic and at what point you're going to stop
tuning things.  I had a previous baseline that I was trying to match so I
knew what I was trying to achieve.

Next thing to do is profile your job and identify where the problem is.  We
had the full job history from the before and after jobs and comparing these
we saw that map performance was fairly consistent as were the reduce sort
and reduce phases.  The problem was with the shuffle, which had gone from
20 minutes pre-upgrade to 40 minutes afterwards.  The important thing here
is to make sure you've got as much information as possible.  If we'd just
kept the overall job time then there would have been a lot more areas to
look at but knowing the problem was with shuffle allowed me to focus effort
in this area.

So what had changed in the shuffle that may have slowed things down.  The
first thing we thought of was that we'd moved from a tarball deployment to
using the RPM so what effect might this have had on things.  Our
operational configuration compresses the map output and in the past we've
had problems with Java compression libraries being used rather than native
ones and this has affected performance.  We knew the RPM deployment had
moved the native library so spent some time confirming to ourselves that
these were being used correctly (but this turned out to not be the
problem).  We then spent time doing some process and server profiling -
using dstat to look at the server bottlenecks and jstack/jmap to check what
the task tracker and reduce processes were doing.  Although not directly
relevant to this particular problem doing this was useful just to get my
head around what Hadoop is doing at various points of the process.

The next bit was one place where I got lucky - I happened to be logged onto
one of the worker nodes when a test job was running and I noticed that
there weren't any reduce tasks running on the server.  This was odd as we'd
submitted more reducers than we have servers so I'd expected at least one
task to be running on each server.  Checking the job tracker log file it
turned out that since the upgrade the job tracker had been submitting
reduce tasks to only 10% of the available nodes.  A different 10% each time
the job was run so clearly the individual task trackers were working OK but
there was something odd going on with the task allocation.  Checking the
job tracker log file showed that before the upgrade tasks had been fairly
evenly distributed so something had changed.  After that it was a case of
digging around the source code to find out which classes were available for
task allocation and what inside them had changed.  This can be quite
daunting but if you're comfortable with Java then it's just a case of
following the calls through the code.  Once I found the cause it was just a
case of working out what my options were for working around it (in this
case turning off the multiple assignment option - I can work out whether I
want to turn it back on in slower time).

Where I think we got very lucky is that we hit this problem.  The
configuration we use for the terasort has just over 1 reducer per worker
node rather than maxing out the available reducer slots.  This decision was
made several years and I can't remember the reasons for it.  If we'd been
using a larger number of reducers then the number of worker nodes in use
would have been similar regardless of the allocation algorithm and so the
performance would have looked similar before and after the upgrade.  We
would have hit this problem eventually but probably not until we started
running user jobs and by then it would be too late to do the intrusive
investigations that were possible now.

Hope this has been useful.

Regards,
Jon

On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:

> Jon:
>
> This is interesting and helpful! How did you figure out the cause? And how
> much time did you spend? Could you share some experience of performance
> diagnosis?
>
> Jie
>
> On Tuesday, November 27, 2012, Harsh J wrote:
>
>> Hi Amit,
>>
>> The default scheduler is FIFO, and may not work for all forms of
>> workloads. Read the multiple schedulers available to see if they have
>> features that may benefit your workload:
>>
>> Capacity Scheduler:
>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>> FairScheduler:
>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>>
>> While there's a good overlap of features between them, there are a few
>> differences that set them apart and make them each useful for
>> different use-cases. If I had to summarize on some such differences,
>> FairScheduler is better suited to SLA form of job execution situations
>> due to its preemptive features (which make it useful in user and
>> service mix scenarios), while CapacityScheduler provides
>> manual-resource-request oriented scheduling for odd jobs with high
>> memory workloads, etc. (which make it useful for running certain
>> specific kind of jobs along side the regular ones).
>>
>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
>> > So this is a FairScheduler problem ?
>> > We are using the default Hadoop scheduler. Is there a reason to use the
>> Fair
>> > Scheduler if most of the time we don't have more than 4 jobs running
>> > simultaneously ?
>> >
>> >
>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> Hi Amit,
>> >>
>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> >> property. It is true by default, which works well for most workloads
>> >> if not benchmark style workloads. I would not usually trust that as a
>> >> base perf. measure of everything that comes out of an upgrade.
>> >>
>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>> >>
>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>> wrote:
>> >> > Hi Jon,
>> >> >
>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>> >> > 1.0.4
>> >> > and I haven't noticed any performance issues. By  "multiple
>> assignment
>> >> > feature" do you mean speculative execution
>> >> > (mapred.map.tasks.speculative.execution and
>> >> > mapred.reduce.tasks.speculative.execution) ?
>> >> >
>> >> >
>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>> wrote:
>> >> >>
>> >> >> Problem solved, but worth warning others about.
>> >> >>
>> >> >> Before the upgrade the reducers for the terasort process had been
>> >> >> evenly
>> >> >> distributed around the cluster - one per task tracker in turn,
>> looping
>> >> >> around the cluster until all tasks were allocated.  After the
>> upgrade
>> >> >> all
>> >> >> reduce task had been submitted to small number of task trackers -
>> >> >> submit
>> >> >> tasks until the task tracker slots were full and then move onto the
>> >> >> next
>> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
>> >> >> benchmark performance.
>> >> >>
>> >> >> The reason for this turns out to be the fair scheduler rewrite
>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
>> of
>> >> >> the
>> >> >> assign multiple property. Previously this property caused a single
>> map
>> >> >> and a
>> >> >> single reduce task to be allocated in a task tracker heartbeat
>> (rather
>> >> >> than
>> >> >> the default of a map or a reduce).  After the upgrade it allocates
>> as
>> >> >> many
>> >> >> tasks as there are available task slots.  Turning off the multiple
>> >> >> assignment feature returned the terasort to its pre-upgrade
>> >> >> performance.
>> >> >>
>> >> >> I can see potential benefits to this change and need to think
>> through
>> >> >> the
>> >> >> consequences to real world applications (though in practice we're
>> >> >> likely to
>> >> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
>> >> >> this
>> >> >> has been a pain so to warn other user is there anywhere central that
>> >> >> can be
>> >> >> used to record upgrade gotchas like this?
>> >> >>
>> >> >>
>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> Hi,
>> >> >>>
>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
>> have
>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>> >> >>> about 45
>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
>> >> >>> detail it
>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>> minutes.
>> >> >>> Does
>> >> >>> anyone have any thoughts about what'--
>> Harsh J
>>
>>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jon Allen <ja...@gmail.com>.

Jie,

Simple answer - I got lucky (though obviously there are thing you need to
have in place to allow you to be lucky).

Before running the upgrade I ran a set of tests to baseline the cluster
performance, e.g. terasort, gridmix and some operational jobs.  Terasort by
itself isn't very realistic as a cluster test but it's nice and simple to
run and is good for regression testing things after a change.

After the upgrade the intention was to run the same tests and show that the
performance hadn't degraded (improved would have been nice but not worse
was the minimum).  When we ran the terasort we found that performance was
about 50% worse - execution time had gone from 40 minutes to 60 minutes.
 As I've said, terasort doesn't provide a realistic view of operational
performance but this showed that something major had changed and we needed
to understand it before going further.  So how to go about diagnosing this
...

First rule - understand what you're trying to achieve.  It's very easy to
say performance isn't good enough but performance can always be better so
you need to know what's realistic and at what point you're going to stop
tuning things.  I had a previous baseline that I was trying to match so I
knew what I was trying to achieve.

Next thing to do is profile your job and identify where the problem is.  We
had the full job history from the before and after jobs and comparing these
we saw that map performance was fairly consistent as were the reduce sort
and reduce phases.  The problem was with the shuffle, which had gone from
20 minutes pre-upgrade to 40 minutes afterwards.  The important thing here
is to make sure you've got as much information as possible.  If we'd just
kept the overall job time then there would have been a lot more areas to
look at but knowing the problem was with shuffle allowed me to focus effort
in this area.

So what had changed in the shuffle that may have slowed things down.  The
first thing we thought of was that we'd moved from a tarball deployment to
using the RPM so what effect might this have had on things.  Our
operational configuration compresses the map output and in the past we've
had problems with Java compression libraries being used rather than native
ones and this has affected performance.  We knew the RPM deployment had
moved the native library so spent some time confirming to ourselves that
these were being used correctly (but this turned out to not be the
problem).  We then spent time doing some process and server profiling -
using dstat to look at the server bottlenecks and jstack/jmap to check what
the task tracker and reduce processes were doing.  Although not directly
relevant to this particular problem doing this was useful just to get my
head around what Hadoop is doing at various points of the process.

The next bit was one place where I got lucky - I happened to be logged onto
one of the worker nodes when a test job was running and I noticed that
there weren't any reduce tasks running on the server.  This was odd as we'd
submitted more reducers than we have servers so I'd expected at least one
task to be running on each server.  Checking the job tracker log file it
turned out that since the upgrade the job tracker had been submitting
reduce tasks to only 10% of the available nodes.  A different 10% each time
the job was run so clearly the individual task trackers were working OK but
there was something odd going on with the task allocation.  Checking the
job tracker log file showed that before the upgrade tasks had been fairly
evenly distributed so something had changed.  After that it was a case of
digging around the source code to find out which classes were available for
task allocation and what inside them had changed.  This can be quite
daunting but if you're comfortable with Java then it's just a case of
following the calls through the code.  Once I found the cause it was just a
case of working out what my options were for working around it (in this
case turning off the multiple assignment option - I can work out whether I
want to turn it back on in slower time).

Where I think we got very lucky is that we hit this problem.  The
configuration we use for the terasort has just over 1 reducer per worker
node rather than maxing out the available reducer slots.  This decision was
made several years and I can't remember the reasons for it.  If we'd been
using a larger number of reducers then the number of worker nodes in use
would have been similar regardless of the allocation algorithm and so the
performance would have looked similar before and after the upgrade.  We
would have hit this problem eventually but probably not until we started
running user jobs and by then it would be too late to do the intrusive
investigations that were possible now.

Hope this has been useful.

Regards,
Jon

On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:

> Jon:
>
> This is interesting and helpful! How did you figure out the cause? And how
> much time did you spend? Could you share some experience of performance
> diagnosis?
>
> Jie
>
> On Tuesday, November 27, 2012, Harsh J wrote:
>
>> Hi Amit,
>>
>> The default scheduler is FIFO, and may not work for all forms of
>> workloads. Read the multiple schedulers available to see if they have
>> features that may benefit your workload:
>>
>> Capacity Scheduler:
>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>> FairScheduler:
>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>>
>> While there's a good overlap of features between them, there are a few
>> differences that set them apart and make them each useful for
>> different use-cases. If I had to summarize on some such differences,
>> FairScheduler is better suited to SLA form of job execution situations
>> due to its preemptive features (which make it useful in user and
>> service mix scenarios), while CapacityScheduler provides
>> manual-resource-request oriented scheduling for odd jobs with high
>> memory workloads, etc. (which make it useful for running certain
>> specific kind of jobs along side the regular ones).
>>
>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
>> > So this is a FairScheduler problem ?
>> > We are using the default Hadoop scheduler. Is there a reason to use the
>> Fair
>> > Scheduler if most of the time we don't have more than 4 jobs running
>> > simultaneously ?
>> >
>> >
>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>> >>
>> >> Hi Amit,
>> >>
>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> >> property. It is true by default, which works well for most workloads
>> >> if not benchmark style workloads. I would not usually trust that as a
>> >> base perf. measure of everything that comes out of an upgrade.
>> >>
>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>> >>
>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>> wrote:
>> >> > Hi Jon,
>> >> >
>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>> >> > 1.0.4
>> >> > and I haven't noticed any performance issues. By  "multiple
>> assignment
>> >> > feature" do you mean speculative execution
>> >> > (mapred.map.tasks.speculative.execution and
>> >> > mapred.reduce.tasks.speculative.execution) ?
>> >> >
>> >> >
>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>> wrote:
>> >> >>
>> >> >> Problem solved, but worth warning others about.
>> >> >>
>> >> >> Before the upgrade the reducers for the terasort process had been
>> >> >> evenly
>> >> >> distributed around the cluster - one per task tracker in turn,
>> looping
>> >> >> around the cluster until all tasks were allocated.  After the
>> upgrade
>> >> >> all
>> >> >> reduce task had been submitted to small number of task trackers -
>> >> >> submit
>> >> >> tasks until the task tracker slots were full and then move onto the
>> >> >> next
>> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
>> >> >> benchmark performance.
>> >> >>
>> >> >> The reason for this turns out to be the fair scheduler rewrite
>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
>> of
>> >> >> the
>> >> >> assign multiple property. Previously this property caused a single
>> map
>> >> >> and a
>> >> >> single reduce task to be allocated in a task tracker heartbeat
>> (rather
>> >> >> than
>> >> >> the default of a map or a reduce).  After the upgrade it allocates
>> as
>> >> >> many
>> >> >> tasks as there are available task slots.  Turning off the multiple
>> >> >> assignment feature returned the terasort to its pre-upgrade
>> >> >> performance.
>> >> >>
>> >> >> I can see potential benefits to this change and need to think
>> through
>> >> >> the
>> >> >> consequences to real world applications (though in practice we're
>> >> >> likely to
>> >> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
>> >> >> this
>> >> >> has been a pain so to warn other user is there anywhere central that
>> >> >> can be
>> >> >> used to record upgrade gotchas like this?
>> >> >>
>> >> >>
>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> Hi,
>> >> >>>
>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
>> have
>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>> >> >>> about 45
>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
>> >> >>> detail it
>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>> minutes.
>> >> >>> Does
>> >> >>> anyone have any thoughts about what'--
>> Harsh J
>>
>>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jie Li <ji...@cs.duke.edu>.

Jon:

This is interesting and helpful! How did you figure out the cause? And how
much time did you spend? Could you share some experience of performance
diagnosis?

Jie

On Tuesday, November 27, 2012, Harsh J wrote:

> Hi Amit,
>
> The default scheduler is FIFO, and may not work for all forms of
> workloads. Read the multiple schedulers available to see if they have
> features that may benefit your workload:
>
> Capacity Scheduler:
> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
> FairScheduler:
> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>
> While there's a good overlap of features between them, there are a few
> differences that set them apart and make them each useful for
> different use-cases. If I had to summarize on some such differences,
> FairScheduler is better suited to SLA form of job execution situations
> due to its preemptive features (which make it useful in user and
> service mix scenarios), while CapacityScheduler provides
> manual-resource-request oriented scheduling for odd jobs with high
> memory workloads, etc. (which make it useful for running certain
> specific kind of jobs along side the regular ones).
>
> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
> > So this is a FairScheduler problem ?
> > We are using the default Hadoop scheduler. Is there a reason to use the
> Fair
> > Scheduler if most of the time we don't have more than 4 jobs running
> > simultaneously ?
> >
> >
> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hi Amit,
> >>
> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
> >> property. It is true by default, which works well for most workloads
> >> if not benchmark style workloads. I would not usually trust that as a
> >> base perf. measure of everything that comes out of an upgrade.
> >>
> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
> >>
> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
> >> > Hi Jon,
> >> >
> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
> >> > 1.0.4
> >> > and I haven't noticed any performance issues. By  "multiple assignment
> >> > feature" do you mean speculative execution
> >> > (mapred.map.tasks.speculative.execution and
> >> > mapred.reduce.tasks.speculative.execution) ?
> >> >
> >> >
> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
> wrote:
> >> >>
> >> >> Problem solved, but worth warning others about.
> >> >>
> >> >> Before the upgrade the reducers for the terasort process had been
> >> >> evenly
> >> >> distributed around the cluster - one per task tracker in turn,
> looping
> >> >> around the cluster until all tasks were allocated.  After the upgrade
> >> >> all
> >> >> reduce task had been submitted to small number of task trackers -
> >> >> submit
> >> >> tasks until the task tracker slots were full and then move onto the
> >> >> next
> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
> >> >> benchmark performance.
> >> >>
> >> >> The reason for this turns out to be the fair scheduler rewrite
> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
> of
> >> >> the
> >> >> assign multiple property. Previously this property caused a single
> map
> >> >> and a
> >> >> single reduce task to be allocated in a task tracker heartbeat
> (rather
> >> >> than
> >> >> the default of a map or a reduce).  After the upgrade it allocates as
> >> >> many
> >> >> tasks as there are available task slots.  Turning off the multiple
> >> >> assignment feature returned the terasort to its pre-upgrade
> >> >> performance.
> >> >>
> >> >> I can see potential benefits to this change and need to think through
> >> >> the
> >> >> consequences to real world applications (though in practice we're
> >> >> likely to
> >> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
> >> >> this
> >> >> has been a pain so to warn other user is there anywhere central that
> >> >> can be
> >> >> used to record upgrade gotchas like this?
> >> >>
> >> >>
> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
> have
> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
> >> >>> about 45
> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
> >> >>> detail it
> >> >>> appears the shuffle phase has increased from 20 minutes to 40
> minutes.
> >> >>> Does
> >> >>> anyone have any thoughts about what'--
> Harsh J
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jie Li <ji...@cs.duke.edu>.

Jon:

This is interesting and helpful! How did you figure out the cause? And how
much time did you spend? Could you share some experience of performance
diagnosis?

Jie

On Tuesday, November 27, 2012, Harsh J wrote:

> Hi Amit,
>
> The default scheduler is FIFO, and may not work for all forms of
> workloads. Read the multiple schedulers available to see if they have
> features that may benefit your workload:
>
> Capacity Scheduler:
> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
> FairScheduler:
> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>
> While there's a good overlap of features between them, there are a few
> differences that set them apart and make them each useful for
> different use-cases. If I had to summarize on some such differences,
> FairScheduler is better suited to SLA form of job execution situations
> due to its preemptive features (which make it useful in user and
> service mix scenarios), while CapacityScheduler provides
> manual-resource-request oriented scheduling for odd jobs with high
> memory workloads, etc. (which make it useful for running certain
> specific kind of jobs along side the regular ones).
>
> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
> > So this is a FairScheduler problem ?
> > We are using the default Hadoop scheduler. Is there a reason to use the
> Fair
> > Scheduler if most of the time we don't have more than 4 jobs running
> > simultaneously ?
> >
> >
> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hi Amit,
> >>
> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
> >> property. It is true by default, which works well for most workloads
> >> if not benchmark style workloads. I would not usually trust that as a
> >> base perf. measure of everything that comes out of an upgrade.
> >>
> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
> >>
> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
> >> > Hi Jon,
> >> >
> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
> >> > 1.0.4
> >> > and I haven't noticed any performance issues. By  "multiple assignment
> >> > feature" do you mean speculative execution
> >> > (mapred.map.tasks.speculative.execution and
> >> > mapred.reduce.tasks.speculative.execution) ?
> >> >
> >> >
> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
> wrote:
> >> >>
> >> >> Problem solved, but worth warning others about.
> >> >>
> >> >> Before the upgrade the reducers for the terasort process had been
> >> >> evenly
> >> >> distributed around the cluster - one per task tracker in turn,
> looping
> >> >> around the cluster until all tasks were allocated.  After the upgrade
> >> >> all
> >> >> reduce task had been submitted to small number of task trackers -
> >> >> submit
> >> >> tasks until the task tracker slots were full and then move onto the
> >> >> next
> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
> >> >> benchmark performance.
> >> >>
> >> >> The reason for this turns out to be the fair scheduler rewrite
> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
> of
> >> >> the
> >> >> assign multiple property. Previously this property caused a single
> map
> >> >> and a
> >> >> single reduce task to be allocated in a task tracker heartbeat
> (rather
> >> >> than
> >> >> the default of a map or a reduce).  After the upgrade it allocates as
> >> >> many
> >> >> tasks as there are available task slots.  Turning off the multiple
> >> >> assignment feature returned the terasort to its pre-upgrade
> >> >> performance.
> >> >>
> >> >> I can see potential benefits to this change and need to think through
> >> >> the
> >> >> consequences to real world applications (though in practice we're
> >> >> likely to
> >> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
> >> >> this
> >> >> has been a pain so to warn other user is there anywhere central that
> >> >> can be
> >> >> used to record upgrade gotchas like this?
> >> >>
> >> >>
> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
> have
> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
> >> >>> about 45
> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
> >> >>> detail it
> >> >>> appears the shuffle phase has increased from 20 minutes to 40
> minutes.
> >> >>> Does
> >> >>> anyone have any thoughts about what'--
> Harsh J
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jie Li <ji...@cs.duke.edu>.

Jon:

This is interesting and helpful! How did you figure out the cause? And how
much time did you spend? Could you share some experience of performance
diagnosis?

Jie

On Tuesday, November 27, 2012, Harsh J wrote:

> Hi Amit,
>
> The default scheduler is FIFO, and may not work for all forms of
> workloads. Read the multiple schedulers available to see if they have
> features that may benefit your workload:
>
> Capacity Scheduler:
> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
> FairScheduler:
> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>
> While there's a good overlap of features between them, there are a few
> differences that set them apart and make them each useful for
> different use-cases. If I had to summarize on some such differences,
> FairScheduler is better suited to SLA form of job execution situations
> due to its preemptive features (which make it useful in user and
> service mix scenarios), while CapacityScheduler provides
> manual-resource-request oriented scheduling for odd jobs with high
> memory workloads, etc. (which make it useful for running certain
> specific kind of jobs along side the regular ones).
>
> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
> > So this is a FairScheduler problem ?
> > We are using the default Hadoop scheduler. Is there a reason to use the
> Fair
> > Scheduler if most of the time we don't have more than 4 jobs running
> > simultaneously ?
> >
> >
> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hi Amit,
> >>
> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
> >> property. It is true by default, which works well for most workloads
> >> if not benchmark style workloads. I would not usually trust that as a
> >> base perf. measure of everything that comes out of an upgrade.
> >>
> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
> >>
> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
> >> > Hi Jon,
> >> >
> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
> >> > 1.0.4
> >> > and I haven't noticed any performance issues. By  "multiple assignment
> >> > feature" do you mean speculative execution
> >> > (mapred.map.tasks.speculative.execution and
> >> > mapred.reduce.tasks.speculative.execution) ?
> >> >
> >> >
> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
> wrote:
> >> >>
> >> >> Problem solved, but worth warning others about.
> >> >>
> >> >> Before the upgrade the reducers for the terasort process had been
> >> >> evenly
> >> >> distributed around the cluster - one per task tracker in turn,
> looping
> >> >> around the cluster until all tasks were allocated.  After the upgrade
> >> >> all
> >> >> reduce task had been submitted to small number of task trackers -
> >> >> submit
> >> >> tasks until the task tracker slots were full and then move onto the
> >> >> next
> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
> >> >> benchmark performance.
> >> >>
> >> >> The reason for this turns out to be the fair scheduler rewrite
> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
> of
> >> >> the
> >> >> assign multiple property. Previously this property caused a single
> map
> >> >> and a
> >> >> single reduce task to be allocated in a task tracker heartbeat
> (rather
> >> >> than
> >> >> the default of a map or a reduce).  After the upgrade it allocates as
> >> >> many
> >> >> tasks as there are available task slots.  Turning off the multiple
> >> >> assignment feature returned the terasort to its pre-upgrade
> >> >> performance.
> >> >>
> >> >> I can see potential benefits to this change and need to think through
> >> >> the
> >> >> consequences to real world applications (though in practice we're
> >> >> likely to
> >> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
> >> >> this
> >> >> has been a pain so to warn other user is there anywhere central that
> >> >> can be
> >> >> used to record upgrade gotchas like this?
> >> >>
> >> >>
> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
> have
> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
> >> >>> about 45
> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
> >> >>> detail it
> >> >>> appears the shuffle phase has increased from 20 minutes to 40
> minutes.
> >> >>> Does
> >> >>> anyone have any thoughts about what'--
> Harsh J
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jie Li <ji...@cs.duke.edu>.

Jon:

This is interesting and helpful! How did you figure out the cause? And how
much time did you spend? Could you share some experience of performance
diagnosis?

Jie

On Tuesday, November 27, 2012, Harsh J wrote:

> Hi Amit,
>
> The default scheduler is FIFO, and may not work for all forms of
> workloads. Read the multiple schedulers available to see if they have
> features that may benefit your workload:
>
> Capacity Scheduler:
> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
> FairScheduler:
> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>
> While there's a good overlap of features between them, there are a few
> differences that set them apart and make them each useful for
> different use-cases. If I had to summarize on some such differences,
> FairScheduler is better suited to SLA form of job execution situations
> due to its preemptive features (which make it useful in user and
> service mix scenarios), while CapacityScheduler provides
> manual-resource-request oriented scheduling for odd jobs with high
> memory workloads, etc. (which make it useful for running certain
> specific kind of jobs along side the regular ones).
>
> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
> > So this is a FairScheduler problem ?
> > We are using the default Hadoop scheduler. Is there a reason to use the
> Fair
> > Scheduler if most of the time we don't have more than 4 jobs running
> > simultaneously ?
> >
> >
> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hi Amit,
> >>
> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
> >> property. It is true by default, which works well for most workloads
> >> if not benchmark style workloads. I would not usually trust that as a
> >> base perf. measure of everything that comes out of an upgrade.
> >>
> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
> >>
> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
> >> > Hi Jon,
> >> >
> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
> >> > 1.0.4
> >> > and I haven't noticed any performance issues. By  "multiple assignment
> >> > feature" do you mean speculative execution
> >> > (mapred.map.tasks.speculative.execution and
> >> > mapred.reduce.tasks.speculative.execution) ?
> >> >
> >> >
> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
> wrote:
> >> >>
> >> >> Problem solved, but worth warning others about.
> >> >>
> >> >> Before the upgrade the reducers for the terasort process had been
> >> >> evenly
> >> >> distributed around the cluster - one per task tracker in turn,
> looping
> >> >> around the cluster until all tasks were allocated.  After the upgrade
> >> >> all
> >> >> reduce task had been submitted to small number of task trackers -
> >> >> submit
> >> >> tasks until the task tracker slots were full and then move onto the
> >> >> next
> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
> >> >> benchmark performance.
> >> >>
> >> >> The reason for this turns out to be the fair scheduler rewrite
> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
> of
> >> >> the
> >> >> assign multiple property. Previously this property caused a single
> map
> >> >> and a
> >> >> single reduce task to be allocated in a task tracker heartbeat
> (rather
> >> >> than
> >> >> the default of a map or a reduce).  After the upgrade it allocates as
> >> >> many
> >> >> tasks as there are available task slots.  Turning off the multiple
> >> >> assignment feature returned the terasort to its pre-upgrade
> >> >> performance.
> >> >>
> >> >> I can see potential benefits to this change and need to think through
> >> >> the
> >> >> consequences to real world applications (though in practice we're
> >> >> likely to
> >> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
> >> >> this
> >> >> has been a pain so to warn other user is there anywhere central that
> >> >> can be
> >> >> used to record upgrade gotchas like this?
> >> >>
> >> >>
> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
> have
> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
> >> >>> about 45
> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
> >> >>> detail it
> >> >>> appears the shuffle phase has increased from 20 minutes to 40
> minutes.
> >> >>> Does
> >> >>> anyone have any thoughts about what'--
> Harsh J
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Harsh J <ha...@cloudera.com>.

Hi Amit,

The default scheduler is FIFO, and may not work for all forms of
workloads. Read the multiple schedulers available to see if they have
features that may benefit your workload:

Capacity Scheduler: http://hadoop.apache.org/docs/stable/capacity_scheduler.html
FairScheduler:         http://hadoop.apache.org/docs/stable/fair_scheduler.html

While there's a good overlap of features between them, there are a few
differences that set them apart and make them each useful for
different use-cases. If I had to summarize on some such differences,
FairScheduler is better suited to SLA form of job execution situations
due to its preemptive features (which make it useful in user and
service mix scenarios), while CapacityScheduler provides
manual-resource-request oriented scheduling for odd jobs with high
memory workloads, etc. (which make it useful for running certain
specific kind of jobs along side the regular ones).

On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
> So this is a FairScheduler problem ?
> We are using the default Hadoop scheduler. Is there a reason to use the Fair
> Scheduler if most of the time we don't have more than 4 jobs running
> simultaneously ?
>
>
> On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Amit,
>>
>> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> property. It is true by default, which works well for most workloads
>> if not benchmark style workloads. I would not usually trust that as a
>> base perf. measure of everything that comes out of an upgrade.
>>
>> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>>
>> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
>> > Hi Jon,
>> >
>> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>> > 1.0.4
>> > and I haven't noticed any performance issues. By  "multiple assignment
>> > feature" do you mean speculative execution
>> > (mapred.map.tasks.speculative.execution and
>> > mapred.reduce.tasks.speculative.execution) ?
>> >
>> >
>> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:
>> >>
>> >> Problem solved, but worth warning others about.
>> >>
>> >> Before the upgrade the reducers for the terasort process had been
>> >> evenly
>> >> distributed around the cluster - one per task tracker in turn, looping
>> >> around the cluster until all tasks were allocated.  After the upgrade
>> >> all
>> >> reduce task had been submitted to small number of task trackers -
>> >> submit
>> >> tasks until the task tracker slots were full and then move onto the
>> >> next
>> >> task tracker.  Skewing the reducers like this quite clearly hit the
>> >> benchmark performance.
>> >>
>> >> The reason for this turns out to be the fair scheduler rewrite
>> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of
>> >> the
>> >> assign multiple property. Previously this property caused a single map
>> >> and a
>> >> single reduce task to be allocated in a task tracker heartbeat (rather
>> >> than
>> >> the default of a map or a reduce).  After the upgrade it allocates as
>> >> many
>> >> tasks as there are available task slots.  Turning off the multiple
>> >> assignment feature returned the terasort to its pre-upgrade
>> >> performance.
>> >>
>> >> I can see potential benefits to this change and need to think through
>> >> the
>> >> consequences to real world applications (though in practice we're
>> >> likely to
>> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
>> >> this
>> >> has been a pain so to warn other user is there anywhere central that
>> >> can be
>> >> used to record upgrade gotchas like this?
>> >>
>> >>
>> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
>> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>> >>> about 45
>> >>> minutes, afterwards it takes just over an hour.  Looking in more
>> >>> detail it
>> >>> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>> >>> Does
>> >>> anyone have any thoughts about what's changed between these releases
>> >>> that
>> >>> may have caused this?
>> >>>
>> >>> The only change to the system has been to Hadoop.  We moved from a
>> >>> tarball install of 0.20.203 with all processes running as hadoop to an
>> >>> RPM
>> >>> deployment of 1.0.4 with processes running as hdfs and mapred.
>> >>> Nothing else
>> >>> has changed.
>> >>>
>> >>> As a related question, we're still running with a configuration that
>> >>> was
>> >>> tuned for version 0.20.1. Are there any recommendations for tuning
>> >>> properties that have been introduced in recent versions that are worth
>> >>> investigating?
>> >>>
>> >>> Thanks,
>> >>> Jon
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Hadoop 1.0.4 Performance Problem

Posted by Jon Allen <ja...@gmail.com>.

Yes, it's a fair scheduler thing - I don't think it's right to call it a
problem, it's just a change in behaviour that came as a surprise.

My reason for using the fair scheduler is because we have multiple users
running simultaneous jobs and the default scheduler doesn't allow this.  My
choice of fair scheduler rather capacity scheduler was made a couple of
years ago and was entirely arbitrary.

On Tue, Nov 27, 2012 at 10:21 AM, Amit Sela <am...@infolinks.com> wrote:

> So this is a FairScheduler problem ?
> We are using the default Hadoop scheduler. Is there a reason to use the
> Fair Scheduler if most of the time we don't have more than 4 jobs
> running simultaneously ?
>
>
> On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hi Amit,
>>
>> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> property. It is true by default, which works well for most workloads
>> if not benchmark style workloads. I would not usually trust that as a
>> base perf. measure of everything that comes out of an upgrade.
>>
>> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>>
>> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
>> > Hi Jon,
>> >
>> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>> 1.0.4
>> > and I haven't noticed any performance issues. By  "multiple assignment
>> > feature" do you mean speculative execution
>> > (mapred.map.tasks.speculative.execution and
>> > mapred.reduce.tasks.speculative.execution) ?
>> >
>> >
>> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>> wrote:
>> >>
>> >> Problem solved, but worth warning others about.
>> >>
>> >> Before the upgrade the reducers for the terasort process had been
>> evenly
>> >> distributed around the cluster - one per task tracker in turn, looping
>> >> around the cluster until all tasks were allocated.  After the upgrade
>> all
>> >> reduce task had been submitted to small number of task trackers -
>> submit
>> >> tasks until the task tracker slots were full and then move onto the
>> next
>> >> task tracker.  Skewing the reducers like this quite clearly hit the
>> >> benchmark performance.
>> >>
>> >> The reason for this turns out to be the fair scheduler rewrite
>> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of
>> the
>> >> assign multiple property. Previously this property caused a single map
>> and a
>> >> single reduce task to be allocated in a task tracker heartbeat (rather
>> than
>> >> the default of a map or a reduce).  After the upgrade it allocates as
>> many
>> >> tasks as there are available task slots.  Turning off the multiple
>> >> assignment feature returned the terasort to its pre-upgrade
>> performance.
>> >>
>> >> I can see potential benefits to this change and need to think through
>> the
>> >> consequences to real world applications (though in practice we're
>> likely to
>> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
>> this
>> >> has been a pain so to warn other user is there anywhere central that
>> can be
>> >> used to record upgrade gotchas like this?
>> >>
>> >>
>> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
>> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>> about 45
>> >>> minutes, afterwards it takes just over an hour.  Looking in more
>> detail it
>> >>> appears the shuffle phase has increased from 20 minutes to 40
>> minutes.  Does
>> >>> anyone have any thoughts about what's changed between these releases
>> that
>> >>> may have caused this?
>> >>>
>> >>> The only change to the system has been to Hadoop.  We moved from a
>> >>> tarball install of 0.20.203 with all processes running as hadoop to
>> an RPM
>> >>> deployment of 1.0.4 with processes running as hdfs and mapred.
>>  Nothing else
>> >>> has changed.
>> >>>
>> >>> As a related question, we're still running with a configuration that
>> was
>> >>> tuned for version 0.20.1. Are there any recommendations for tuning
>> >>> properties that have been introduced in recent versions that are worth
>> >>> investigating?
>> >>>
>> >>> Thanks,
>> >>> Jon
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Harsh J <ha...@cloudera.com>.

Hi Amit,

The default scheduler is FIFO, and may not work for all forms of
workloads. Read the multiple schedulers available to see if they have
features that may benefit your workload:

Capacity Scheduler: http://hadoop.apache.org/docs/stable/capacity_scheduler.html
FairScheduler:         http://hadoop.apache.org/docs/stable/fair_scheduler.html

While there's a good overlap of features between them, there are a few
differences that set them apart and make them each useful for
different use-cases. If I had to summarize on some such differences,
FairScheduler is better suited to SLA form of job execution situations
due to its preemptive features (which make it useful in user and
service mix scenarios), while CapacityScheduler provides
manual-resource-request oriented scheduling for odd jobs with high
memory workloads, etc. (which make it useful for running certain
specific kind of jobs along side the regular ones).

On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
> So this is a FairScheduler problem ?
> We are using the default Hadoop scheduler. Is there a reason to use the Fair
> Scheduler if most of the time we don't have more than 4 jobs running
> simultaneously ?
>
>
> On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Amit,
>>
>> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> property. It is true by default, which works well for most workloads
>> if not benchmark style workloads. I would not usually trust that as a
>> base perf. measure of everything that comes out of an upgrade.
>>
>> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>>
>> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
>> > Hi Jon,
>> >
>> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>> > 1.0.4
>> > and I haven't noticed any performance issues. By  "multiple assignment
>> > feature" do you mean speculative execution
>> > (mapred.map.tasks.speculative.execution and
>> > mapred.reduce.tasks.speculative.execution) ?
>> >
>> >
>> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:
>> >>
>> >> Problem solved, but worth warning others about.
>> >>
>> >> Before the upgrade the reducers for the terasort process had been
>> >> evenly
>> >> distributed around the cluster - one per task tracker in turn, looping
>> >> around the cluster until all tasks were allocated.  After the upgrade
>> >> all
>> >> reduce task had been submitted to small number of task trackers -
>> >> submit
>> >> tasks until the task tracker slots were full and then move onto the
>> >> next
>> >> task tracker.  Skewing the reducers like this quite clearly hit the
>> >> benchmark performance.
>> >>
>> >> The reason for this turns out to be the fair scheduler rewrite
>> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of
>> >> the
>> >> assign multiple property. Previously this property caused a single map
>> >> and a
>> >> single reduce task to be allocated in a task tracker heartbeat (rather
>> >> than
>> >> the default of a map or a reduce).  After the upgrade it allocates as
>> >> many
>> >> tasks as there are available task slots.  Turning off the multiple
>> >> assignment feature returned the terasort to its pre-upgrade
>> >> performance.
>> >>
>> >> I can see potential benefits to this change and need to think through
>> >> the
>> >> consequences to real world applications (though in practice we're
>> >> likely to
>> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
>> >> this
>> >> has been a pain so to warn other user is there anywhere central that
>> >> can be
>> >> used to record upgrade gotchas like this?
>> >>
>> >>
>> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
>> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>> >>> about 45
>> >>> minutes, afterwards it takes just over an hour.  Looking in more
>> >>> detail it
>> >>> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>> >>> Does
>> >>> anyone have any thoughts about what's changed between these releases
>> >>> that
>> >>> may have caused this?
>> >>>
>> >>> The only change to the system has been to Hadoop.  We moved from a
>> >>> tarball install of 0.20.203 with all processes running as hadoop to an
>> >>> RPM
>> >>> deployment of 1.0.4 with processes running as hdfs and mapred.
>> >>> Nothing else
>> >>> has changed.
>> >>>
>> >>> As a related question, we're still running with a configuration that
>> >>> was
>> >>> tuned for version 0.20.1. Are there any recommendations for tuning
>> >>> properties that have been introduced in recent versions that are worth
>> >>> investigating?
>> >>>
>> >>> Thanks,
>> >>> Jon
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Hadoop 1.0.4 Performance Problem

Posted by Harsh J <ha...@cloudera.com>.

Hi Amit,

The default scheduler is FIFO, and may not work for all forms of
workloads. Read the multiple schedulers available to see if they have
features that may benefit your workload:

Capacity Scheduler: http://hadoop.apache.org/docs/stable/capacity_scheduler.html
FairScheduler:         http://hadoop.apache.org/docs/stable/fair_scheduler.html

While there's a good overlap of features between them, there are a few
differences that set them apart and make them each useful for
different use-cases. If I had to summarize on some such differences,
FairScheduler is better suited to SLA form of job execution situations
due to its preemptive features (which make it useful in user and
service mix scenarios), while CapacityScheduler provides
manual-resource-request oriented scheduling for odd jobs with high
memory workloads, etc. (which make it useful for running certain
specific kind of jobs along side the regular ones).

On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
> So this is a FairScheduler problem ?
> We are using the default Hadoop scheduler. Is there a reason to use the Fair
> Scheduler if most of the time we don't have more than 4 jobs running
> simultaneously ?
>
>
> On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Amit,
>>
>> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> property. It is true by default, which works well for most workloads
>> if not benchmark style workloads. I would not usually trust that as a
>> base perf. measure of everything that comes out of an upgrade.
>>
>> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>>
>> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
>> > Hi Jon,
>> >
>> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>> > 1.0.4
>> > and I haven't noticed any performance issues. By  "multiple assignment
>> > feature" do you mean speculative execution
>> > (mapred.map.tasks.speculative.execution and
>> > mapred.reduce.tasks.speculative.execution) ?
>> >
>> >
>> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:
>> >>
>> >> Problem solved, but worth warning others about.
>> >>
>> >> Before the upgrade the reducers for the terasort process had been
>> >> evenly
>> >> distributed around the cluster - one per task tracker in turn, looping
>> >> around the cluster until all tasks were allocated.  After the upgrade
>> >> all
>> >> reduce task had been submitted to small number of task trackers -
>> >> submit
>> >> tasks until the task tracker slots were full and then move onto the
>> >> next
>> >> task tracker.  Skewing the reducers like this quite clearly hit the
>> >> benchmark performance.
>> >>
>> >> The reason for this turns out to be the fair scheduler rewrite
>> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of
>> >> the
>> >> assign multiple property. Previously this property caused a single map
>> >> and a
>> >> single reduce task to be allocated in a task tracker heartbeat (rather
>> >> than
>> >> the default of a map or a reduce).  After the upgrade it allocates as
>> >> many
>> >> tasks as there are available task slots.  Turning off the multiple
>> >> assignment feature returned the terasort to its pre-upgrade
>> >> performance.
>> >>
>> >> I can see potential benefits to this change and need to think through
>> >> the
>> >> consequences to real world applications (though in practice we're
>> >> likely to
>> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
>> >> this
>> >> has been a pain so to warn other user is there anywhere central that
>> >> can be
>> >> used to record upgrade gotchas like this?
>> >>
>> >>
>> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
>> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>> >>> about 45
>> >>> minutes, afterwards it takes just over an hour.  Looking in more
>> >>> detail it
>> >>> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>> >>> Does
>> >>> anyone have any thoughts about what's changed between these releases
>> >>> that
>> >>> may have caused this?
>> >>>
>> >>> The only change to the system has been to Hadoop.  We moved from a
>> >>> tarball install of 0.20.203 with all processes running as hadoop to an
>> >>> RPM
>> >>> deployment of 1.0.4 with processes running as hdfs and mapred.
>> >>> Nothing else
>> >>> has changed.
>> >>>
>> >>> As a related question, we're still running with a configuration that
>> >>> was
>> >>> tuned for version 0.20.1. Are there any recommendations for tuning
>> >>> properties that have been introduced in recent versions that are worth
>> >>> investigating?
>> >>>
>> >>> Thanks,
>> >>> Jon
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Hadoop 1.0.4 Performance Problem

Posted by Harsh J <ha...@cloudera.com>.

Hi Amit,

The default scheduler is FIFO, and may not work for all forms of
workloads. Read the multiple schedulers available to see if they have
features that may benefit your workload:

Capacity Scheduler: http://hadoop.apache.org/docs/stable/capacity_scheduler.html
FairScheduler:         http://hadoop.apache.org/docs/stable/fair_scheduler.html

While there's a good overlap of features between them, there are a few
differences that set them apart and make them each useful for
different use-cases. If I had to summarize on some such differences,
FairScheduler is better suited to SLA form of job execution situations
due to its preemptive features (which make it useful in user and
service mix scenarios), while CapacityScheduler provides
manual-resource-request oriented scheduling for odd jobs with high
memory workloads, etc. (which make it useful for running certain
specific kind of jobs along side the regular ones).

On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
> So this is a FairScheduler problem ?
> We are using the default Hadoop scheduler. Is there a reason to use the Fair
> Scheduler if most of the time we don't have more than 4 jobs running
> simultaneously ?
>
>
> On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Amit,
>>
>> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> property. It is true by default, which works well for most workloads
>> if not benchmark style workloads. I would not usually trust that as a
>> base perf. measure of everything that comes out of an upgrade.
>>
>> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>>
>> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
>> > Hi Jon,
>> >
>> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>> > 1.0.4
>> > and I haven't noticed any performance issues. By  "multiple assignment
>> > feature" do you mean speculative execution
>> > (mapred.map.tasks.speculative.execution and
>> > mapred.reduce.tasks.speculative.execution) ?
>> >
>> >
>> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:
>> >>
>> >> Problem solved, but worth warning others about.
>> >>
>> >> Before the upgrade the reducers for the terasort process had been
>> >> evenly
>> >> distributed around the cluster - one per task tracker in turn, looping
>> >> around the cluster until all tasks were allocated.  After the upgrade
>> >> all
>> >> reduce task had been submitted to small number of task trackers -
>> >> submit
>> >> tasks until the task tracker slots were full and then move onto the
>> >> next
>> >> task tracker.  Skewing the reducers like this quite clearly hit the
>> >> benchmark performance.
>> >>
>> >> The reason for this turns out to be the fair scheduler rewrite
>> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of
>> >> the
>> >> assign multiple property. Previously this property caused a single map
>> >> and a
>> >> single reduce task to be allocated in a task tracker heartbeat (rather
>> >> than
>> >> the default of a map or a reduce).  After the upgrade it allocates as
>> >> many
>> >> tasks as there are available task slots.  Turning off the multiple
>> >> assignment feature returned the terasort to its pre-upgrade
>> >> performance.
>> >>
>> >> I can see potential benefits to this change and need to think through
>> >> the
>> >> consequences to real world applications (though in practice we're
>> >> likely to
>> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
>> >> this
>> >> has been a pain so to warn other user is there anywhere central that
>> >> can be
>> >> used to record upgrade gotchas like this?
>> >>
>> >>
>> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
>> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>> >>> about 45
>> >>> minutes, afterwards it takes just over an hour.  Looking in more
>> >>> detail it
>> >>> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>> >>> Does
>> >>> anyone have any thoughts about what's changed between these releases
>> >>> that
>> >>> may have caused this?
>> >>>
>> >>> The only change to the system has been to Hadoop.  We moved from a
>> >>> tarball install of 0.20.203 with all processes running as hadoop to an
>> >>> RPM
>> >>> deployment of 1.0.4 with processes running as hdfs and mapred.
>> >>> Nothing else
>> >>> has changed.
>> >>>
>> >>> As a related question, we're still running with a configuration that
>> >>> was
>> >>> tuned for version 0.20.1. Are there any recommendations for tuning
>> >>> properties that have been introduced in recent versions that are worth
>> >>> investigating?
>> >>>
>> >>> Thanks,
>> >>> Jon
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Hadoop 1.0.4 Performance Problem

Posted by Jon Allen <ja...@gmail.com>.

Yes, it's a fair scheduler thing - I don't think it's right to call it a
problem, it's just a change in behaviour that came as a surprise.

My reason for using the fair scheduler is because we have multiple users
running simultaneous jobs and the default scheduler doesn't allow this.  My
choice of fair scheduler rather capacity scheduler was made a couple of
years ago and was entirely arbitrary.

On Tue, Nov 27, 2012 at 10:21 AM, Amit Sela <am...@infolinks.com> wrote:

> So this is a FairScheduler problem ?
> We are using the default Hadoop scheduler. Is there a reason to use the
> Fair Scheduler if most of the time we don't have more than 4 jobs
> running simultaneously ?
>
>
> On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hi Amit,
>>
>> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> property. It is true by default, which works well for most workloads
>> if not benchmark style workloads. I would not usually trust that as a
>> base perf. measure of everything that comes out of an upgrade.
>>
>> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>>
>> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
>> > Hi Jon,
>> >
>> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>> 1.0.4
>> > and I haven't noticed any performance issues. By  "multiple assignment
>> > feature" do you mean speculative execution
>> > (mapred.map.tasks.speculative.execution and
>> > mapred.reduce.tasks.speculative.execution) ?
>> >
>> >
>> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>> wrote:
>> >>
>> >> Problem solved, but worth warning others about.
>> >>
>> >> Before the upgrade the reducers for the terasort process had been
>> evenly
>> >> distributed around the cluster - one per task tracker in turn, looping
>> >> around the cluster until all tasks were allocated.  After the upgrade
>> all
>> >> reduce task had been submitted to small number of task trackers -
>> submit
>> >> tasks until the task tracker slots were full and then move onto the
>> next
>> >> task tracker.  Skewing the reducers like this quite clearly hit the
>> >> benchmark performance.
>> >>
>> >> The reason for this turns out to be the fair scheduler rewrite
>> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of
>> the
>> >> assign multiple property. Previously this property caused a single map
>> and a
>> >> single reduce task to be allocated in a task tracker heartbeat (rather
>> than
>> >> the default of a map or a reduce).  After the upgrade it allocates as
>> many
>> >> tasks as there are available task slots.  Turning off the multiple
>> >> assignment feature returned the terasort to its pre-upgrade
>> performance.
>> >>
>> >> I can see potential benefits to this change and need to think through
>> the
>> >> consequences to real world applications (though in practice we're
>> likely to
>> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
>> this
>> >> has been a pain so to warn other user is there anywhere central that
>> can be
>> >> used to record upgrade gotchas like this?
>> >>
>> >>
>> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
>> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>> about 45
>> >>> minutes, afterwards it takes just over an hour.  Looking in more
>> detail it
>> >>> appears the shuffle phase has increased from 20 minutes to 40
>> minutes.  Does
>> >>> anyone have any thoughts about what's changed between these releases
>> that
>> >>> may have caused this?
>> >>>
>> >>> The only change to the system has been to Hadoop.  We moved from a
>> >>> tarball install of 0.20.203 with all processes running as hadoop to
>> an RPM
>> >>> deployment of 1.0.4 with processes running as hdfs and mapred.
>>  Nothing else
>> >>> has changed.
>> >>>
>> >>> As a related question, we're still running with a configuration that
>> was
>> >>> tuned for version 0.20.1. Are there any recommendations for tuning
>> >>> properties that have been introduced in recent versions that are worth
>> >>> investigating?
>> >>>
>> >>> Thanks,
>> >>> Jon
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Amit Sela <am...@infolinks.com>.

So this is a FairScheduler problem ?
We are using the default Hadoop scheduler. Is there a reason to use the
Fair Scheduler if most of the time we don't have more than 4 jobs
running simultaneously ?

On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Amit,
>
> He means the mapred.fairscheduler.assignmultiple FairScheduler
> property. It is true by default, which works well for most workloads
> if not benchmark style workloads. I would not usually trust that as a
> base perf. measure of everything that comes out of an upgrade.
>
> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>
> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
> > Hi Jon,
> >
> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop 1.0.4
> > and I haven't noticed any performance issues. By  "multiple assignment
> > feature" do you mean speculative execution
> > (mapred.map.tasks.speculative.execution and
> > mapred.reduce.tasks.speculative.execution) ?
> >
> >
> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:
> >>
> >> Problem solved, but worth warning others about.
> >>
> >> Before the upgrade the reducers for the terasort process had been evenly
> >> distributed around the cluster - one per task tracker in turn, looping
> >> around the cluster until all tasks were allocated.  After the upgrade
> all
> >> reduce task had been submitted to small number of task trackers - submit
> >> tasks until the task tracker slots were full and then move onto the next
> >> task tracker.  Skewing the reducers like this quite clearly hit the
> >> benchmark performance.
> >>
> >> The reason for this turns out to be the fair scheduler rewrite
> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of
> the
> >> assign multiple property. Previously this property caused a single map
> and a
> >> single reduce task to be allocated in a task tracker heartbeat (rather
> than
> >> the default of a map or a reduce).  After the upgrade it allocates as
> many
> >> tasks as there are available task slots.  Turning off the multiple
> >> assignment feature returned the terasort to its pre-upgrade performance.
> >>
> >> I can see potential benefits to this change and need to think through
> the
> >> consequences to real world applications (though in practice we're
> likely to
> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
> this
> >> has been a pain so to warn other user is there anywhere central that
> can be
> >> used to record upgrade gotchas like this?
> >>
> >>
> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
> wrote:
> >>>
> >>> Hi,
> >>>
> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
> >>> hit performance problems.  Before the upgrade a 15TB terasort took
> about 45
> >>> minutes, afterwards it takes just over an hour.  Looking in more
> detail it
> >>> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>  Does
> >>> anyone have any thoughts about what's changed between these releases
> that
> >>> may have caused this?
> >>>
> >>> The only change to the system has been to Hadoop.  We moved from a
> >>> tarball install of 0.20.203 with all processes running as hadoop to an
> RPM
> >>> deployment of 1.0.4 with processes running as hdfs and mapred.
>  Nothing else
> >>> has changed.
> >>>
> >>> As a related question, we're still running with a configuration that
> was
> >>> tuned for version 0.20.1. Are there any recommendations for tuning
> >>> properties that have been introduced in recent versions that are worth
> >>> investigating?
> >>>
> >>> Thanks,
> >>> Jon
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jon Allen <ja...@gmail.com>.

Amit,

As Harsh says, I was referring to the
mapred.fairscheduler.assignmultiple config
property.  This only shows as an issue if you're running significantly
fewer reduce tasks than available slots.  We ran the terasort as a simple
smoke test of the upgrade rather than as a useful performance measure, it
just worried me that there was such a noticeable degradation.  We're now
moving onto some more realistic tests based on our operational jobs.

Thanks,
Jon


On Tue, Nov 27, 2012 at 10:00 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi Amit,
>
> He means the mapred.fairscheduler.assignmultiple FairScheduler
> property. It is true by default, which works well for most workloads
> if not benchmark style workloads. I would not usually trust that as a
> base perf. measure of everything that comes out of an upgrade.
>
> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>
> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
> > Hi Jon,
> >
> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop 1.0.4
> > and I haven't noticed any performance issues. By  "multiple assignment
> > feature" do you mean speculative execution
> > (mapred.map.tasks.speculative.execution and
> > mapred.reduce.tasks.speculative.execution) ?
> >
> >
> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:
> >>
> >> Problem solved, but worth warning others about.
> >>
> >> Before the upgrade the reducers for the terasort process had been evenly
> >> distributed around the cluster - one per task tracker in turn, looping
> >> around the cluster until all tasks were allocated.  After the upgrade
> all
> >> reduce task had been submitted to small number of task trackers - submit
> >> tasks until the task tracker slots were full and then move onto the next
> >> task tracker.  Skewing the reducers like this quite clearly hit the
> >> benchmark performance.
> >>
> >> The reason for this turns out to be the fair scheduler rewrite
> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of
> the
> >> assign multiple property. Previously this property caused a single map
> and a
> >> single reduce task to be allocated in a task tracker heartbeat (rather
> than
> >> the default of a map or a reduce).  After the upgrade it allocates as
> many
> >> tasks as there are available task slots.  Turning off the multiple
> >> assignment feature returned the terasort to its pre-upgrade performance.
> >>
> >> I can see potential benefits to this change and need to think through
> the
> >> consequences to real world applications (though in practice we're
> likely to
> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
> this
> >> has been a pain so to warn other user is there anywhere central that
> can be
> >> used to record upgrade gotchas like this?
> >>
> >>
> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
> wrote:
> >>>
> >>> Hi,
> >>>
> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
> >>> hit performance problems.  Before the upgrade a 15TB terasort took
> about 45
> >>> minutes, afterwards it takes just over an hour.  Looking in more
> detail it
> >>> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>  Does
> >>> anyone have any thoughts about what's changed between these releases
> that
> >>> may have caused this?
> >>>
> >>> The only change to the system has been to Hadoop.  We moved from a
> >>> tarball install of 0.20.203 with all processes running as hadoop to an
> RPM
> >>> deployment of 1.0.4 with processes running as hdfs and mapred.
>  Nothing else
> >>> has changed.
> >>>
> >>> As a related question, we're still running with a configuration that
> was
> >>> tuned for version 0.20.1. Are there any recommendations for tuning
> >>> properties that have been introduced in recent versions that are worth
> >>> investigating?
> >>>
> >>> Thanks,
> >>> Jon
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Amit Sela <am...@infolinks.com>.

So this is a FairScheduler problem ?
We are using the default Hadoop scheduler. Is there a reason to use the
Fair Scheduler if most of the time we don't have more than 4 jobs
running simultaneously ?

On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Amit,
>
> He means the mapred.fairscheduler.assignmultiple FairScheduler
> property. It is true by default, which works well for most workloads
> if not benchmark style workloads. I would not usually trust that as a
> base perf. measure of everything that comes out of an upgrade.
>
> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>
> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
> > Hi Jon,
> >
> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop 1.0.4
> > and I haven't noticed any performance issues. By  "multiple assignment
> > feature" do you mean speculative execution
> > (mapred.map.tasks.speculative.execution and
> > mapred.reduce.tasks.speculative.execution) ?
> >
> >
> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:
> >>
> >> Problem solved, but worth warning others about.
> >>
> >> Before the upgrade the reducers for the terasort process had been evenly
> >> distributed around the cluster - one per task tracker in turn, looping
> >> around the cluster until all tasks were allocated.  After the upgrade
> all
> >> reduce task had been submitted to small number of task trackers - submit
> >> tasks until the task tracker slots were full and then move onto the next
> >> task tracker.  Skewing the reducers like this quite clearly hit the
> >> benchmark performance.
> >>
> >> The reason for this turns out to be the fair scheduler rewrite
> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of
> the
> >> assign multiple property. Previously this property caused a single map
> and a
> >> single reduce task to be allocated in a task tracker heartbeat (rather
> than
> >> the default of a map or a reduce).  After the upgrade it allocates as
> many
> >> tasks as there are available task slots.  Turning off the multiple
> >> assignment feature returned the terasort to its pre-upgrade performance.
> >>
> >> I can see potential benefits to this change and need to think through
> the
> >> consequences to real world applications (though in practice we're
> likely to
> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
> this
> >> has been a pain so to warn other user is there anywhere central that
> can be
> >> used to record upgrade gotchas like this?
> >>
> >>
> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
> wrote:
> >>>
> >>> Hi,
> >>>
> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
> >>> hit performance problems.  Before the upgrade a 15TB terasort took
> about 45
> >>> minutes, afterwards it takes just over an hour.  Looking in more
> detail it
> >>> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>  Does
> >>> anyone have any thoughts about what's changed between these releases
> that
> >>> may have caused this?
> >>>
> >>> The only change to the system has been to Hadoop.  We moved from a
> >>> tarball install of 0.20.203 with all processes running as hadoop to an
> RPM
> >>> deployment of 1.0.4 with processes running as hdfs and mapred.
>  Nothing else
> >>> has changed.
> >>>
> >>> As a related question, we're still running with a configuration that
> was
> >>> tuned for version 0.20.1. Are there any recommendations for tuning
> >>> properties that have been introduced in recent versions that are worth
> >>> investigating?
> >>>
> >>> Thanks,
> >>> Jon
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jon Allen <ja...@gmail.com>.

Amit,

As Harsh says, I was referring to the
mapred.fairscheduler.assignmultiple config
property.  This only shows as an issue if you're running significantly
fewer reduce tasks than available slots.  We ran the terasort as a simple
smoke test of the upgrade rather than as a useful performance measure, it
just worried me that there was such a noticeable degradation.  We're now
moving onto some more realistic tests based on our operational jobs.

Thanks,
Jon


On Tue, Nov 27, 2012 at 10:00 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi Amit,
>
> He means the mapred.fairscheduler.assignmultiple FairScheduler
> property. It is true by default, which works well for most workloads
> if not benchmark style workloads. I would not usually trust that as a
> base perf. measure of everything that comes out of an upgrade.
>
> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>
> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
> > Hi Jon,
> >
> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop 1.0.4
> > and I haven't noticed any performance issues. By  "multiple assignment
> > feature" do you mean speculative execution
> > (mapred.map.tasks.speculative.execution and
> > mapred.reduce.tasks.speculative.execution) ?
> >
> >
> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:
> >>
> >> Problem solved, but worth warning others about.
> >>
> >> Before the upgrade the reducers for the terasort process had been evenly
> >> distributed around the cluster - one per task tracker in turn, looping
> >> around the cluster until all tasks were allocated.  After the upgrade
> all
> >> reduce task had been submitted to small number of task trackers - submit
> >> tasks until the task tracker slots were full and then move onto the next
> >> task tracker.  Skewing the reducers like this quite clearly hit the
> >> benchmark performance.
> >>
> >> The reason for this turns out to be the fair scheduler rewrite
> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of
> the
> >> assign multiple property. Previously this property caused a single map
> and a
> >> single reduce task to be allocated in a task tracker heartbeat (rather
> than
> >> the default of a map or a reduce).  After the upgrade it allocates as
> many
> >> tasks as there are available task slots.  Turning off the multiple
> >> assignment feature returned the terasort to its pre-upgrade performance.
> >>
> >> I can see potential benefits to this change and need to think through
> the
> >> consequences to real world applications (though in practice we're
> likely to
> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
> this
> >> has been a pain so to warn other user is there anywhere central that
> can be
> >> used to record upgrade gotchas like this?
> >>
> >>
> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
> wrote:
> >>>
> >>> Hi,
> >>>
> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
> >>> hit performance problems.  Before the upgrade a 15TB terasort took
> about 45
> >>> minutes, afterwards it takes just over an hour.  Looking in more
> detail it
> >>> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>  Does
> >>> anyone have any thoughts about what's changed between these releases
> that
> >>> may have caused this?
> >>>
> >>> The only change to the system has been to Hadoop.  We moved from a
> >>> tarball install of 0.20.203 with all processes running as hadoop to an
> RPM
> >>> deployment of 1.0.4 with processes running as hdfs and mapred.
>  Nothing else
> >>> has changed.
> >>>
> >>> As a related question, we're still running with a configuration that
> was
> >>> tuned for version 0.20.1. Are there any recommendations for tuning
> >>> properties that have been introduced in recent versions that are worth
> >>> investigating?
> >>>
> >>> Thanks,
> >>> Jon
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Amit Sela <am...@infolinks.com>.

So this is a FairScheduler problem ?
We are using the default Hadoop scheduler. Is there a reason to use the
Fair Scheduler if most of the time we don't have more than 4 jobs
running simultaneously ?

On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Amit,
>
> He means the mapred.fairscheduler.assignmultiple FairScheduler
> property. It is true by default, which works well for most workloads
> if not benchmark style workloads. I would not usually trust that as a
> base perf. measure of everything that comes out of an upgrade.
>
> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>
> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
> > Hi Jon,
> >
> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop 1.0.4
> > and I haven't noticed any performance issues. By  "multiple assignment
> > feature" do you mean speculative execution
> > (mapred.map.tasks.speculative.execution and
> > mapred.reduce.tasks.speculative.execution) ?
> >
> >
> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:
> >>
> >> Problem solved, but worth warning others about.
> >>
> >> Before the upgrade the reducers for the terasort process had been evenly
> >> distributed around the cluster - one per task tracker in turn, looping
> >> around the cluster until all tasks were allocated.  After the upgrade
> all
> >> reduce task had been submitted to small number of task trackers - submit
> >> tasks until the task tracker slots were full and then move onto the next
> >> task tracker.  Skewing the reducers like this quite clearly hit the
> >> benchmark performance.
> >>
> >> The reason for this turns out to be the fair scheduler rewrite
> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of
> the
> >> assign multiple property. Previously this property caused a single map
> and a
> >> single reduce task to be allocated in a task tracker heartbeat (rather
> than
> >> the default of a map or a reduce).  After the upgrade it allocates as
> many
> >> tasks as there are available task slots.  Turning off the multiple
> >> assignment feature returned the terasort to its pre-upgrade performance.
> >>
> >> I can see potential benefits to this change and need to think through
> the
> >> consequences to real world applications (though in practice we're
> likely to
> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
> this
> >> has been a pain so to warn other user is there anywhere central that
> can be
> >> used to record upgrade gotchas like this?
> >>
> >>
> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
> wrote:
> >>>
> >>> Hi,
> >>>
> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
> >>> hit performance problems.  Before the upgrade a 15TB terasort took
> about 45
> >>> minutes, afterwards it takes just over an hour.  Looking in more
> detail it
> >>> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>  Does
> >>> anyone have any thoughts about what's changed between these releases
> that
> >>> may have caused this?
> >>>
> >>> The only change to the system has been to Hadoop.  We moved from a
> >>> tarball install of 0.20.203 with all processes running as hadoop to an
> RPM
> >>> deployment of 1.0.4 with processes running as hdfs and mapred.
>  Nothing else
> >>> has changed.
> >>>
> >>> As a related question, we're still running with a configuration that
> was
> >>> tuned for version 0.20.1. Are there any recommendations for tuning
> >>> properties that have been introduced in recent versions that are worth
> >>> investigating?
> >>>
> >>> Thanks,
> >>> Jon
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Amit Sela <am...@infolinks.com>.

So this is a FairScheduler problem ?
We are using the default Hadoop scheduler. Is there a reason to use the
Fair Scheduler if most of the time we don't have more than 4 jobs
running simultaneously ?

On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:

> Hi Amit,
>
> He means the mapred.fairscheduler.assignmultiple FairScheduler
> property. It is true by default, which works well for most workloads
> if not benchmark style workloads. I would not usually trust that as a
> base perf. measure of everything that comes out of an upgrade.
>
> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>
> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
> > Hi Jon,
> >
> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop 1.0.4
> > and I haven't noticed any performance issues. By  "multiple assignment
> > feature" do you mean speculative execution
> > (mapred.map.tasks.speculative.execution and
> > mapred.reduce.tasks.speculative.execution) ?
> >
> >
> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:
> >>
> >> Problem solved, but worth warning others about.
> >>
> >> Before the upgrade the reducers for the terasort process had been evenly
> >> distributed around the cluster - one per task tracker in turn, looping
> >> around the cluster until all tasks were allocated.  After the upgrade
> all
> >> reduce task had been submitted to small number of task trackers - submit
> >> tasks until the task tracker slots were full and then move onto the next
> >> task tracker.  Skewing the reducers like this quite clearly hit the
> >> benchmark performance.
> >>
> >> The reason for this turns out to be the fair scheduler rewrite
> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of
> the
> >> assign multiple property. Previously this property caused a single map
> and a
> >> single reduce task to be allocated in a task tracker heartbeat (rather
> than
> >> the default of a map or a reduce).  After the upgrade it allocates as
> many
> >> tasks as there are available task slots.  Turning off the multiple
> >> assignment feature returned the terasort to its pre-upgrade performance.
> >>
> >> I can see potential benefits to this change and need to think through
> the
> >> consequences to real world applications (though in practice we're
> likely to
> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
> this
> >> has been a pain so to warn other user is there anywhere central that
> can be
> >> used to record upgrade gotchas like this?
> >>
> >>
> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
> wrote:
> >>>
> >>> Hi,
> >>>
> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
> >>> hit performance problems.  Before the upgrade a 15TB terasort took
> about 45
> >>> minutes, afterwards it takes just over an hour.  Looking in more
> detail it
> >>> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>  Does
> >>> anyone have any thoughts about what's changed between these releases
> that
> >>> may have caused this?
> >>>
> >>> The only change to the system has been to Hadoop.  We moved from a
> >>> tarball install of 0.20.203 with all processes running as hadoop to an
> RPM
> >>> deployment of 1.0.4 with processes running as hdfs and mapred.
>  Nothing else
> >>> has changed.
> >>>
> >>> As a related question, we're still running with a configuration that
> was
> >>> tuned for version 0.20.1. Are there any recommendations for tuning
> >>> properties that have been introduced in recent versions that are worth
> >>> investigating?
> >>>
> >>> Thanks,
> >>> Jon
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jon Allen <ja...@gmail.com>.

Amit,

As Harsh says, I was referring to the
mapred.fairscheduler.assignmultiple config
property.  This only shows as an issue if you're running significantly
fewer reduce tasks than available slots.  We ran the terasort as a simple
smoke test of the upgrade rather than as a useful performance measure, it
just worried me that there was such a noticeable degradation.  We're now
moving onto some more realistic tests based on our operational jobs.

Thanks,
Jon


On Tue, Nov 27, 2012 at 10:00 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi Amit,
>
> He means the mapred.fairscheduler.assignmultiple FairScheduler
> property. It is true by default, which works well for most workloads
> if not benchmark style workloads. I would not usually trust that as a
> base perf. measure of everything that comes out of an upgrade.
>
> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>
> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
> > Hi Jon,
> >
> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop 1.0.4
> > and I haven't noticed any performance issues. By  "multiple assignment
> > feature" do you mean speculative execution
> > (mapred.map.tasks.speculative.execution and
> > mapred.reduce.tasks.speculative.execution) ?
> >
> >
> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:
> >>
> >> Problem solved, but worth warning others about.
> >>
> >> Before the upgrade the reducers for the terasort process had been evenly
> >> distributed around the cluster - one per task tracker in turn, looping
> >> around the cluster until all tasks were allocated.  After the upgrade
> all
> >> reduce task had been submitted to small number of task trackers - submit
> >> tasks until the task tracker slots were full and then move onto the next
> >> task tracker.  Skewing the reducers like this quite clearly hit the
> >> benchmark performance.
> >>
> >> The reason for this turns out to be the fair scheduler rewrite
> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of
> the
> >> assign multiple property. Previously this property caused a single map
> and a
> >> single reduce task to be allocated in a task tracker heartbeat (rather
> than
> >> the default of a map or a reduce).  After the upgrade it allocates as
> many
> >> tasks as there are available task slots.  Turning off the multiple
> >> assignment feature returned the terasort to its pre-upgrade performance.
> >>
> >> I can see potential benefits to this change and need to think through
> the
> >> consequences to real world applications (though in practice we're
> likely to
> >> move away from fair scheduler due to MAPREDUCE-4451).  Investigating
> this
> >> has been a pain so to warn other user is there anywhere central that
> can be
> >> used to record upgrade gotchas like this?
> >>
> >>
> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
> wrote:
> >>>
> >>> Hi,
> >>>
> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
> >>> hit performance problems.  Before the upgrade a 15TB terasort took
> about 45
> >>> minutes, afterwards it takes just over an hour.  Looking in more
> detail it
> >>> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>  Does
> >>> anyone have any thoughts about what's changed between these releases
> that
> >>> may have caused this?
> >>>
> >>> The only change to the system has been to Hadoop.  We moved from a
> >>> tarball install of 0.20.203 with all processes running as hadoop to an
> RPM
> >>> deployment of 1.0.4 with processes running as hdfs and mapred.
>  Nothing else
> >>> has changed.
> >>>
> >>> As a related question, we're still running with a configuration that
> was
> >>> tuned for version 0.20.1. Are there any recommendations for tuning
> >>> properties that have been introduced in recent versions that are worth
> >>> investigating?
> >>>
> >>> Thanks,
> >>> Jon
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Harsh J <ha...@cloudera.com>.

Hi Amit,

He means the mapred.fairscheduler.assignmultiple FairScheduler
property. It is true by default, which works well for most workloads
if not benchmark style workloads. I would not usually trust that as a
base perf. measure of everything that comes out of an upgrade.

The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.

On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
> Hi Jon,
>
> I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop 1.0.4
> and I haven't noticed any performance issues. By  "multiple assignment
> feature" do you mean speculative execution
> (mapred.map.tasks.speculative.execution and
> mapred.reduce.tasks.speculative.execution) ?
>
>
> On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:
>>
>> Problem solved, but worth warning others about.
>>
>> Before the upgrade the reducers for the terasort process had been evenly
>> distributed around the cluster - one per task tracker in turn, looping
>> around the cluster until all tasks were allocated.  After the upgrade all
>> reduce task had been submitted to small number of task trackers - submit
>> tasks until the task tracker slots were full and then move onto the next
>> task tracker.  Skewing the reducers like this quite clearly hit the
>> benchmark performance.
>>
>> The reason for this turns out to be the fair scheduler rewrite
>> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of the
>> assign multiple property. Previously this property caused a single map and a
>> single reduce task to be allocated in a task tracker heartbeat (rather than
>> the default of a map or a reduce).  After the upgrade it allocates as many
>> tasks as there are available task slots.  Turning off the multiple
>> assignment feature returned the terasort to its pre-upgrade performance.
>>
>> I can see potential benefits to this change and need to think through the
>> consequences to real world applications (though in practice we're likely to
>> move away from fair scheduler due to MAPREDUCE-4451).  Investigating this
>> has been a pain so to warn other user is there anywhere central that can be
>> used to record upgrade gotchas like this?
>>
>>
>> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
>>> hit performance problems.  Before the upgrade a 15TB terasort took about 45
>>> minutes, afterwards it takes just over an hour.  Looking in more detail it
>>> appears the shuffle phase has increased from 20 minutes to 40 minutes.  Does
>>> anyone have any thoughts about what's changed between these releases that
>>> may have caused this?
>>>
>>> The only change to the system has been to Hadoop.  We moved from a
>>> tarball install of 0.20.203 with all processes running as hadoop to an RPM
>>> deployment of 1.0.4 with processes running as hdfs and mapred.  Nothing else
>>> has changed.
>>>
>>> As a related question, we're still running with a configuration that was
>>> tuned for version 0.20.1. Are there any recommendations for tuning
>>> properties that have been introduced in recent versions that are worth
>>> investigating?
>>>
>>> Thanks,
>>> Jon
>>
>>
>



-- 
Harsh J

Re: Hadoop 1.0.4 Performance Problem

Posted by Harsh J <ha...@cloudera.com>.

Hi Amit,

He means the mapred.fairscheduler.assignmultiple FairScheduler
property. It is true by default, which works well for most workloads
if not benchmark style workloads. I would not usually trust that as a
base perf. measure of everything that comes out of an upgrade.

The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.

On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
> Hi Jon,
>
> I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop 1.0.4
> and I haven't noticed any performance issues. By  "multiple assignment
> feature" do you mean speculative execution
> (mapred.map.tasks.speculative.execution and
> mapred.reduce.tasks.speculative.execution) ?
>
>
> On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:
>>
>> Problem solved, but worth warning others about.
>>
>> Before the upgrade the reducers for the terasort process had been evenly
>> distributed around the cluster - one per task tracker in turn, looping
>> around the cluster until all tasks were allocated.  After the upgrade all
>> reduce task had been submitted to small number of task trackers - submit
>> tasks until the task tracker slots were full and then move onto the next
>> task tracker.  Skewing the reducers like this quite clearly hit the
>> benchmark performance.
>>
>> The reason for this turns out to be the fair scheduler rewrite
>> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of the
>> assign multiple property. Previously this property caused a single map and a
>> single reduce task to be allocated in a task tracker heartbeat (rather than
>> the default of a map or a reduce).  After the upgrade it allocates as many
>> tasks as there are available task slots.  Turning off the multiple
>> assignment feature returned the terasort to its pre-upgrade performance.
>>
>> I can see potential benefits to this change and need to think through the
>> consequences to real world applications (though in practice we're likely to
>> move away from fair scheduler due to MAPREDUCE-4451).  Investigating this
>> has been a pain so to warn other user is there anywhere central that can be
>> used to record upgrade gotchas like this?
>>
>>
>> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
>>> hit performance problems.  Before the upgrade a 15TB terasort took about 45
>>> minutes, afterwards it takes just over an hour.  Looking in more detail it
>>> appears the shuffle phase has increased from 20 minutes to 40 minutes.  Does
>>> anyone have any thoughts about what's changed between these releases that
>>> may have caused this?
>>>
>>> The only change to the system has been to Hadoop.  We moved from a
>>> tarball install of 0.20.203 with all processes running as hadoop to an RPM
>>> deployment of 1.0.4 with processes running as hdfs and mapred.  Nothing else
>>> has changed.
>>>
>>> As a related question, we're still running with a configuration that was
>>> tuned for version 0.20.1. Are there any recommendations for tuning
>>> properties that have been introduced in recent versions that are worth
>>> investigating?
>>>
>>> Thanks,
>>> Jon
>>
>>
>



-- 
Harsh J

Re: Hadoop 1.0.4 Performance Problem

Posted by Harsh J <ha...@cloudera.com>.

Hi Amit,

He means the mapred.fairscheduler.assignmultiple FairScheduler
property. It is true by default, which works well for most workloads
if not benchmark style workloads. I would not usually trust that as a
base perf. measure of everything that comes out of an upgrade.

The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.

On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
> Hi Jon,
>
> I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop 1.0.4
> and I haven't noticed any performance issues. By  "multiple assignment
> feature" do you mean speculative execution
> (mapred.map.tasks.speculative.execution and
> mapred.reduce.tasks.speculative.execution) ?
>
>
> On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:
>>
>> Problem solved, but worth warning others about.
>>
>> Before the upgrade the reducers for the terasort process had been evenly
>> distributed around the cluster - one per task tracker in turn, looping
>> around the cluster until all tasks were allocated.  After the upgrade all
>> reduce task had been submitted to small number of task trackers - submit
>> tasks until the task tracker slots were full and then move onto the next
>> task tracker.  Skewing the reducers like this quite clearly hit the
>> benchmark performance.
>>
>> The reason for this turns out to be the fair scheduler rewrite
>> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of the
>> assign multiple property. Previously this property caused a single map and a
>> single reduce task to be allocated in a task tracker heartbeat (rather than
>> the default of a map or a reduce).  After the upgrade it allocates as many
>> tasks as there are available task slots.  Turning off the multiple
>> assignment feature returned the terasort to its pre-upgrade performance.
>>
>> I can see potential benefits to this change and need to think through the
>> consequences to real world applications (though in practice we're likely to
>> move away from fair scheduler due to MAPREDUCE-4451).  Investigating this
>> has been a pain so to warn other user is there anywhere central that can be
>> used to record upgrade gotchas like this?
>>
>>
>> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
>>> hit performance problems.  Before the upgrade a 15TB terasort took about 45
>>> minutes, afterwards it takes just over an hour.  Looking in more detail it
>>> appears the shuffle phase has increased from 20 minutes to 40 minutes.  Does
>>> anyone have any thoughts about what's changed between these releases that
>>> may have caused this?
>>>
>>> The only change to the system has been to Hadoop.  We moved from a
>>> tarball install of 0.20.203 with all processes running as hadoop to an RPM
>>> deployment of 1.0.4 with processes running as hdfs and mapred.  Nothing else
>>> has changed.
>>>
>>> As a related question, we're still running with a configuration that was
>>> tuned for version 0.20.1. Are there any recommendations for tuning
>>> properties that have been introduced in recent versions that are worth
>>> investigating?
>>>
>>> Thanks,
>>> Jon
>>
>>
>



-- 
Harsh J

Re: Hadoop 1.0.4 Performance Problem

Posted by Harsh J <ha...@cloudera.com>.

Hi Amit,

He means the mapred.fairscheduler.assignmultiple FairScheduler
property. It is true by default, which works well for most workloads
if not benchmark style workloads. I would not usually trust that as a
base perf. measure of everything that comes out of an upgrade.

The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.

On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com> wrote:
> Hi Jon,
>
> I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop 1.0.4
> and I haven't noticed any performance issues. By  "multiple assignment
> feature" do you mean speculative execution
> (mapred.map.tasks.speculative.execution and
> mapred.reduce.tasks.speculative.execution) ?
>
>
> On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:
>>
>> Problem solved, but worth warning others about.
>>
>> Before the upgrade the reducers for the terasort process had been evenly
>> distributed around the cluster - one per task tracker in turn, looping
>> around the cluster until all tasks were allocated.  After the upgrade all
>> reduce task had been submitted to small number of task trackers - submit
>> tasks until the task tracker slots were full and then move onto the next
>> task tracker.  Skewing the reducers like this quite clearly hit the
>> benchmark performance.
>>
>> The reason for this turns out to be the fair scheduler rewrite
>> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of the
>> assign multiple property. Previously this property caused a single map and a
>> single reduce task to be allocated in a task tracker heartbeat (rather than
>> the default of a map or a reduce).  After the upgrade it allocates as many
>> tasks as there are available task slots.  Turning off the multiple
>> assignment feature returned the terasort to its pre-upgrade performance.
>>
>> I can see potential benefits to this change and need to think through the
>> consequences to real world applications (though in practice we're likely to
>> move away from fair scheduler due to MAPREDUCE-4451).  Investigating this
>> has been a pain so to warn other user is there anywhere central that can be
>> used to record upgrade gotchas like this?
>>
>>
>> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
>>> hit performance problems.  Before the upgrade a 15TB terasort took about 45
>>> minutes, afterwards it takes just over an hour.  Looking in more detail it
>>> appears the shuffle phase has increased from 20 minutes to 40 minutes.  Does
>>> anyone have any thoughts about what's changed between these releases that
>>> may have caused this?
>>>
>>> The only change to the system has been to Hadoop.  We moved from a
>>> tarball install of 0.20.203 with all processes running as hadoop to an RPM
>>> deployment of 1.0.4 with processes running as hdfs and mapred.  Nothing else
>>> has changed.
>>>
>>> As a related question, we're still running with a configuration that was
>>> tuned for version 0.20.1. Are there any recommendations for tuning
>>> properties that have been introduced in recent versions that are worth
>>> investigating?
>>>
>>> Thanks,
>>> Jon
>>
>>
>



-- 
Harsh J

Re: Hadoop 1.0.4 Performance Problem

Posted by Amit Sela <am...@infolinks.com>.

Hi Jon,

I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop 1.0.4
and I haven't noticed any performance issues. By  "multiple assignment
feature" do you mean speculative execution
(mapred.map.tasks.speculative.execution
and mapred.reduce.tasks.speculative.execution) ?


On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:

> Problem solved, but worth warning others about.
>
> Before the upgrade the reducers for the terasort process had been evenly
> distributed around the cluster - one per task tracker in turn, looping
> around the cluster until all tasks were allocated.  After the upgrade all
> reduce task had been submitted to small number of task trackers - submit
> tasks until the task tracker slots were full and then move onto the next
> task tracker.  Skewing the reducers like this quite clearly hit the
> benchmark performance.
>
> The reason for this turns out to be the fair scheduler rewrite
> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of the
> assign multiple property. Previously this property caused a single map and
> a single reduce task to be allocated in a task tracker heartbeat (rather
> than the default of a map or a reduce).  After the upgrade it allocates as
> many tasks as there are available task slots.  Turning off the multiple
> assignment feature returned the terasort to its pre-upgrade performance.
>
> I can see potential benefits to this change and need to think through the
> consequences to real world applications (though in practice we're likely to
> move away from fair scheduler due to MAPREDUCE-4451).  Investigating this
> has been a pain so to warn other user is there anywhere central that can be
> used to record upgrade gotchas like this?
>
>
> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com> wrote:
>
>> Hi,
>>
>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
>> hit performance problems.  Before the upgrade a 15TB terasort took about 45
>> minutes, afterwards it takes just over an hour.  Looking in more detail it
>> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>>  Does anyone have any thoughts about what's changed between these releases
>> that may have caused this?
>>
>> The only change to the system has been to Hadoop.  We moved from a
>> tarball install of 0.20.203 with all processes running as hadoop to an RPM
>> deployment of 1.0.4 with processes running as hdfs and mapred.  Nothing
>> else has changed.
>>
>> As a related question, we're still running with a configuration that was
>> tuned for version 0.20.1. Are there any recommendations for tuning
>> properties that have been introduced in recent versions that are worth
>> investigating?
>>
>> Thanks,
>> Jon
>
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Amit Sela <am...@infolinks.com>.

Hi Jon,

I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop 1.0.4
and I haven't noticed any performance issues. By  "multiple assignment
feature" do you mean speculative execution
(mapred.map.tasks.speculative.execution
and mapred.reduce.tasks.speculative.execution) ?


On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:

> Problem solved, but worth warning others about.
>
> Before the upgrade the reducers for the terasort process had been evenly
> distributed around the cluster - one per task tracker in turn, looping
> around the cluster until all tasks were allocated.  After the upgrade all
> reduce task had been submitted to small number of task trackers - submit
> tasks until the task tracker slots were full and then move onto the next
> task tracker.  Skewing the reducers like this quite clearly hit the
> benchmark performance.
>
> The reason for this turns out to be the fair scheduler rewrite
> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of the
> assign multiple property. Previously this property caused a single map and
> a single reduce task to be allocated in a task tracker heartbeat (rather
> than the default of a map or a reduce).  After the upgrade it allocates as
> many tasks as there are available task slots.  Turning off the multiple
> assignment feature returned the terasort to its pre-upgrade performance.
>
> I can see potential benefits to this change and need to think through the
> consequences to real world applications (though in practice we're likely to
> move away from fair scheduler due to MAPREDUCE-4451).  Investigating this
> has been a pain so to warn other user is there anywhere central that can be
> used to record upgrade gotchas like this?
>
>
> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com> wrote:
>
>> Hi,
>>
>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
>> hit performance problems.  Before the upgrade a 15TB terasort took about 45
>> minutes, afterwards it takes just over an hour.  Looking in more detail it
>> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>>  Does anyone have any thoughts about what's changed between these releases
>> that may have caused this?
>>
>> The only change to the system has been to Hadoop.  We moved from a
>> tarball install of 0.20.203 with all processes running as hadoop to an RPM
>> deployment of 1.0.4 with processes running as hdfs and mapred.  Nothing
>> else has changed.
>>
>> As a related question, we're still running with a configuration that was
>> tuned for version 0.20.1. Are there any recommendations for tuning
>> properties that have been introduced in recent versions that are worth
>> investigating?
>>
>> Thanks,
>> Jon
>
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Amit Sela <am...@infolinks.com>.

Hi Jon,

I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop 1.0.4
and I haven't noticed any performance issues. By  "multiple assignment
feature" do you mean speculative execution
(mapred.map.tasks.speculative.execution
and mapred.reduce.tasks.speculative.execution) ?


On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:

> Problem solved, but worth warning others about.
>
> Before the upgrade the reducers for the terasort process had been evenly
> distributed around the cluster - one per task tracker in turn, looping
> around the cluster until all tasks were allocated.  After the upgrade all
> reduce task had been submitted to small number of task trackers - submit
> tasks until the task tracker slots were full and then move onto the next
> task tracker.  Skewing the reducers like this quite clearly hit the
> benchmark performance.
>
> The reason for this turns out to be the fair scheduler rewrite
> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of the
> assign multiple property. Previously this property caused a single map and
> a single reduce task to be allocated in a task tracker heartbeat (rather
> than the default of a map or a reduce).  After the upgrade it allocates as
> many tasks as there are available task slots.  Turning off the multiple
> assignment feature returned the terasort to its pre-upgrade performance.
>
> I can see potential benefits to this change and need to think through the
> consequences to real world applications (though in practice we're likely to
> move away from fair scheduler due to MAPREDUCE-4451).  Investigating this
> has been a pain so to warn other user is there anywhere central that can be
> used to record upgrade gotchas like this?
>
>
> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com> wrote:
>
>> Hi,
>>
>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
>> hit performance problems.  Before the upgrade a 15TB terasort took about 45
>> minutes, afterwards it takes just over an hour.  Looking in more detail it
>> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>>  Does anyone have any thoughts about what's changed between these releases
>> that may have caused this?
>>
>> The only change to the system has been to Hadoop.  We moved from a
>> tarball install of 0.20.203 with all processes running as hadoop to an RPM
>> deployment of 1.0.4 with processes running as hdfs and mapred.  Nothing
>> else has changed.
>>
>> As a related question, we're still running with a configuration that was
>> tuned for version 0.20.1. Are there any recommendations for tuning
>> properties that have been introduced in recent versions that are worth
>> investigating?
>>
>> Thanks,
>> Jon
>
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Amit Sela <am...@infolinks.com>.

Hi Jon,

I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop 1.0.4
and I haven't noticed any performance issues. By  "multiple assignment
feature" do you mean speculative execution
(mapred.map.tasks.speculative.execution
and mapred.reduce.tasks.speculative.execution) ?


On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com> wrote:

> Problem solved, but worth warning others about.
>
> Before the upgrade the reducers for the terasort process had been evenly
> distributed around the cluster - one per task tracker in turn, looping
> around the cluster until all tasks were allocated.  After the upgrade all
> reduce task had been submitted to small number of task trackers - submit
> tasks until the task tracker slots were full and then move onto the next
> task tracker.  Skewing the reducers like this quite clearly hit the
> benchmark performance.
>
> The reason for this turns out to be the fair scheduler rewrite
> (MAPREDUCE-2981) that appears to have subtly modified the behaviour of the
> assign multiple property. Previously this property caused a single map and
> a single reduce task to be allocated in a task tracker heartbeat (rather
> than the default of a map or a reduce).  After the upgrade it allocates as
> many tasks as there are available task slots.  Turning off the multiple
> assignment feature returned the terasort to its pre-upgrade performance.
>
> I can see potential benefits to this change and need to think through the
> consequences to real world applications (though in practice we're likely to
> move away from fair scheduler due to MAPREDUCE-4451).  Investigating this
> has been a pain so to warn other user is there anywhere central that can be
> used to record upgrade gotchas like this?
>
>
> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com> wrote:
>
>> Hi,
>>
>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have
>> hit performance problems.  Before the upgrade a 15TB terasort took about 45
>> minutes, afterwards it takes just over an hour.  Looking in more detail it
>> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>>  Does anyone have any thoughts about what's changed between these releases
>> that may have caused this?
>>
>> The only change to the system has been to Hadoop.  We moved from a
>> tarball install of 0.20.203 with all processes running as hadoop to an RPM
>> deployment of 1.0.4 with processes running as hdfs and mapred.  Nothing
>> else has changed.
>>
>> As a related question, we're still running with a configuration that was
>> tuned for version 0.20.1. Are there any recommendations for tuning
>> properties that have been introduced in recent versions that are worth
>> investigating?
>>
>> Thanks,
>> Jon
>
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jon Allen <ja...@gmail.com>.

Problem solved, but worth warning others about.

Before the upgrade the reducers for the terasort process had been evenly
distributed around the cluster - one per task tracker in turn, looping
around the cluster until all tasks were allocated.  After the upgrade all
reduce task had been submitted to small number of task trackers - submit
tasks until the task tracker slots were full and then move onto the next
task tracker.  Skewing the reducers like this quite clearly hit the
benchmark performance.

The reason for this turns out to be the fair scheduler rewrite
(MAPREDUCE-2981) that appears to have subtly modified the behaviour of the
assign multiple property. Previously this property caused a single map and
a single reduce task to be allocated in a task tracker heartbeat (rather
than the default of a map or a reduce).  After the upgrade it allocates as
many tasks as there are available task slots.  Turning off the multiple
assignment feature returned the terasort to its pre-upgrade performance.

I can see potential benefits to this change and need to think through the
consequences to real world applications (though in practice we're likely to
move away from fair scheduler due to MAPREDUCE-4451).  Investigating this
has been a pain so to warn other user is there anywhere central that can be
used to record upgrade gotchas like this?

On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com> wrote:

> Hi,
>
> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have hit
> performance problems.  Before the upgrade a 15TB terasort took about 45
> minutes, afterwards it takes just over an hour.  Looking in more detail it
> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>  Does anyone have any thoughts about what's changed between these releases
> that may have caused this?
>
> The only change to the system has been to Hadoop.  We moved from a tarball
> install of 0.20.203 with all processes running as hadoop to an RPM
> deployment of 1.0.4 with processes running as hdfs and mapred.  Nothing
> else has changed.
>
> As a related question, we're still running with a configuration that was
> tuned for version 0.20.1. Are there any recommendations for tuning
> properties that have been introduced in recent versions that are worth
> investigating?
>
> Thanks,
> Jon

Re: Hadoop 1.0.4 Performance Problem

Posted by Jon Allen <ja...@gmail.com>.

Problem solved, but worth warning others about.

Before the upgrade the reducers for the terasort process had been evenly
distributed around the cluster - one per task tracker in turn, looping
around the cluster until all tasks were allocated.  After the upgrade all
reduce task had been submitted to small number of task trackers - submit
tasks until the task tracker slots were full and then move onto the next
task tracker.  Skewing the reducers like this quite clearly hit the
benchmark performance.

The reason for this turns out to be the fair scheduler rewrite
(MAPREDUCE-2981) that appears to have subtly modified the behaviour of the
assign multiple property. Previously this property caused a single map and
a single reduce task to be allocated in a task tracker heartbeat (rather
than the default of a map or a reduce).  After the upgrade it allocates as
many tasks as there are available task slots.  Turning off the multiple
assignment feature returned the terasort to its pre-upgrade performance.

I can see potential benefits to this change and need to think through the
consequences to real world applications (though in practice we're likely to
move away from fair scheduler due to MAPREDUCE-4451).  Investigating this
has been a pain so to warn other user is there anywhere central that can be
used to record upgrade gotchas like this?

On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com> wrote:

> Hi,
>
> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have hit
> performance problems.  Before the upgrade a 15TB terasort took about 45
> minutes, afterwards it takes just over an hour.  Looking in more detail it
> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>  Does anyone have any thoughts about what's changed between these releases
> that may have caused this?
>
> The only change to the system has been to Hadoop.  We moved from a tarball
> install of 0.20.203 with all processes running as hadoop to an RPM
> deployment of 1.0.4 with processes running as hdfs and mapred.  Nothing
> else has changed.
>
> As a related question, we're still running with a configuration that was
> tuned for version 0.20.1. Are there any recommendations for tuning
> properties that have been introduced in recent versions that are worth
> investigating?
>
> Thanks,
> Jon

Re: Hadoop 1.0.4 Performance Problem

Posted by Jon Allen <ja...@gmail.com>.

Problem solved, but worth warning others about.

Before the upgrade the reducers for the terasort process had been evenly
distributed around the cluster - one per task tracker in turn, looping
around the cluster until all tasks were allocated.  After the upgrade all
reduce task had been submitted to small number of task trackers - submit
tasks until the task tracker slots were full and then move onto the next
task tracker.  Skewing the reducers like this quite clearly hit the
benchmark performance.

The reason for this turns out to be the fair scheduler rewrite
(MAPREDUCE-2981) that appears to have subtly modified the behaviour of the
assign multiple property. Previously this property caused a single map and
a single reduce task to be allocated in a task tracker heartbeat (rather
than the default of a map or a reduce).  After the upgrade it allocates as
many tasks as there are available task slots.  Turning off the multiple
assignment feature returned the terasort to its pre-upgrade performance.

I can see potential benefits to this change and need to think through the
consequences to real world applications (though in practice we're likely to
move away from fair scheduler due to MAPREDUCE-4451).  Investigating this
has been a pain so to warn other user is there anywhere central that can be
used to record upgrade gotchas like this?

On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com> wrote:

> Hi,
>
> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have hit
> performance problems.  Before the upgrade a 15TB terasort took about 45
> minutes, afterwards it takes just over an hour.  Looking in more detail it
> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>  Does anyone have any thoughts about what's changed between these releases
> that may have caused this?
>
> The only change to the system has been to Hadoop.  We moved from a tarball
> install of 0.20.203 with all processes running as hadoop to an RPM
> deployment of 1.0.4 with processes running as hdfs and mapred.  Nothing
> else has changed.
>
> As a related question, we're still running with a configuration that was
> tuned for version 0.20.1. Are there any recommendations for tuning
> properties that have been introduced in recent versions that are worth
> investigating?
>
> Thanks,
> Jon

Re: Hadoop 1.0.4 Performance Problem

Posted by Jon Allen <ja...@gmail.com>.

Problem solved, but worth warning others about.

Before the upgrade the reducers for the terasort process had been evenly
distributed around the cluster - one per task tracker in turn, looping
around the cluster until all tasks were allocated.  After the upgrade all
reduce task had been submitted to small number of task trackers - submit
tasks until the task tracker slots were full and then move onto the next
task tracker.  Skewing the reducers like this quite clearly hit the
benchmark performance.

The reason for this turns out to be the fair scheduler rewrite
(MAPREDUCE-2981) that appears to have subtly modified the behaviour of the
assign multiple property. Previously this property caused a single map and
a single reduce task to be allocated in a task tracker heartbeat (rather
than the default of a map or a reduce).  After the upgrade it allocates as
many tasks as there are available task slots.  Turning off the multiple
assignment feature returned the terasort to its pre-upgrade performance.

I can see potential benefits to this change and need to think through the
consequences to real world applications (though in practice we're likely to
move away from fair scheduler due to MAPREDUCE-4451).  Investigating this
has been a pain so to warn other user is there anywhere central that can be
used to record upgrade gotchas like this?

On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com> wrote:

> Hi,
>
> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and have hit
> performance problems.  Before the upgrade a 15TB terasort took about 45
> minutes, afterwards it takes just over an hour.  Looking in more detail it
> appears the shuffle phase has increased from 20 minutes to 40 minutes.
>  Does anyone have any thoughts about what's changed between these releases
> that may have caused this?
>
> The only change to the system has been to Hadoop.  We moved from a tarball
> install of 0.20.203 with all processes running as hadoop to an RPM
> deployment of 1.0.4 with processes running as hdfs and mapred.  Nothing
> else has changed.
>
> As a related question, we're still running with a configuration that was
> tuned for version 0.20.1. Are there any recommendations for tuning
> properties that have been introduced in recent versions that are worth
> investigating?
>
> Thanks,
> Jon