You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Jie Li <ji...@cs.duke.edu> on 2012/12/14 02:46:25 UTC

Re: Hadoop 1.0.4 Performance Problem

Hi Jon,

Thanks for sharing these insights! Can't agree with you more!

Recently we released a tool called Starfish Hadoop Log Analyzer for
analyzing the job histories. I believe it can quickly point out this
reduce problem you met!

http://www.cs.duke.edu/starfish/release.html

Jie

On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
> Jie,
>
> Simple answer - I got lucky (though obviously there are thing you need to
> have in place to allow you to be lucky).
>
> Before running the upgrade I ran a set of tests to baseline the cluster
> performance, e.g. terasort, gridmix and some operational jobs.  Terasort by
> itself isn't very realistic as a cluster test but it's nice and simple to
> run and is good for regression testing things after a change.
>
> After the upgrade the intention was to run the same tests and show that the
> performance hadn't degraded (improved would have been nice but not worse was
> the minimum).  When we ran the terasort we found that performance was about
> 50% worse - execution time had gone from 40 minutes to 60 minutes.  As I've
> said, terasort doesn't provide a realistic view of operational performance
> but this showed that something major had changed and we needed to understand
> it before going further.  So how to go about diagnosing this ...
>
> First rule - understand what you're trying to achieve.  It's very easy to
> say performance isn't good enough but performance can always be better so
> you need to know what's realistic and at what point you're going to stop
> tuning things.  I had a previous baseline that I was trying to match so I
> knew what I was trying to achieve.
>
> Next thing to do is profile your job and identify where the problem is.  We
> had the full job history from the before and after jobs and comparing these
> we saw that map performance was fairly consistent as were the reduce sort
> and reduce phases.  The problem was with the shuffle, which had gone from 20
> minutes pre-upgrade to 40 minutes afterwards.  The important thing here is
> to make sure you've got as much information as possible.  If we'd just kept
> the overall job time then there would have been a lot more areas to look at
> but knowing the problem was with shuffle allowed me to focus effort in this
> area.
>
> So what had changed in the shuffle that may have slowed things down.  The
> first thing we thought of was that we'd moved from a tarball deployment to
> using the RPM so what effect might this have had on things.  Our operational
> configuration compresses the map output and in the past we've had problems
> with Java compression libraries being used rather than native ones and this
> has affected performance.  We knew the RPM deployment had moved the native
> library so spent some time confirming to ourselves that these were being
> used correctly (but this turned out to not be the problem).  We then spent
> time doing some process and server profiling - using dstat to look at the
> server bottlenecks and jstack/jmap to check what the task tracker and reduce
> processes were doing.  Although not directly relevant to this particular
> problem doing this was useful just to get my head around what Hadoop is
> doing at various points of the process.
>
> The next bit was one place where I got lucky - I happened to be logged onto
> one of the worker nodes when a test job was running and I noticed that there
> weren't any reduce tasks running on the server.  This was odd as we'd
> submitted more reducers than we have servers so I'd expected at least one
> task to be running on each server.  Checking the job tracker log file it
> turned out that since the upgrade the job tracker had been submitting reduce
> tasks to only 10% of the available nodes.  A different 10% each time the job
> was run so clearly the individual task trackers were working OK but there
> was something odd going on with the task allocation.  Checking the job
> tracker log file showed that before the upgrade tasks had been fairly evenly
> distributed so something had changed.  After that it was a case of digging
> around the source code to find out which classes were available for task
> allocation and what inside them had changed.  This can be quite daunting but
> if you're comfortable with Java then it's just a case of following the calls
> through the code.  Once I found the cause it was just a case of working out
> what my options were for working around it (in this case turning off the
> multiple assignment option - I can work out whether I want to turn it back
> on in slower time).
>
> Where I think we got very lucky is that we hit this problem.  The
> configuration we use for the terasort has just over 1 reducer per worker
> node rather than maxing out the available reducer slots.  This decision was
> made several years and I can't remember the reasons for it.  If we'd been
> using a larger number of reducers then the number of worker nodes in use
> would have been similar regardless of the allocation algorithm and so the
> performance would have looked similar before and after the upgrade.  We
> would have hit this problem eventually but probably not until we started
> running user jobs and by then it would be too late to do the intrusive
> investigations that were possible now.
>
> Hope this has been useful.
>
> Regards,
> Jon
>
>
>
> On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
>>
>> Jon:
>>
>> This is interesting and helpful! How did you figure out the cause? And how
>> much time did you spend? Could you share some experience of performance
>> diagnosis?
>>
>> Jie
>>
>> On Tuesday, November 27, 2012, Harsh J wrote:
>>>
>>> Hi Amit,
>>>
>>> The default scheduler is FIFO, and may not work for all forms of
>>> workloads. Read the multiple schedulers available to see if they have
>>> features that may benefit your workload:
>>>
>>> Capacity Scheduler:
>>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>>> FairScheduler:
>>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>>>
>>> While there's a good overlap of features between them, there are a few
>>> differences that set them apart and make them each useful for
>>> different use-cases. If I had to summarize on some such differences,
>>> FairScheduler is better suited to SLA form of job execution situations
>>> due to its preemptive features (which make it useful in user and
>>> service mix scenarios), while CapacityScheduler provides
>>> manual-resource-request oriented scheduling for odd jobs with high
>>> memory workloads, etc. (which make it useful for running certain
>>> specific kind of jobs along side the regular ones).
>>>
>>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com> wrote:
>>> > So this is a FairScheduler problem ?
>>> > We are using the default Hadoop scheduler. Is there a reason to use the
>>> > Fair
>>> > Scheduler if most of the time we don't have more than 4 jobs running
>>> > simultaneously ?
>>> >
>>> >
>>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com> wrote:
>>> >>
>>> >> Hi Amit,
>>> >>
>>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>>> >> property. It is true by default, which works well for most workloads
>>> >> if not benchmark style workloads. I would not usually trust that as a
>>> >> base perf. measure of everything that comes out of an upgrade.
>>> >>
>>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>>> >>
>>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>>> >> wrote:
>>> >> > Hi Jon,
>>> >> >
>>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to Hadoop
>>> >> > 1.0.4
>>> >> > and I haven't noticed any performance issues. By  "multiple
>>> >> > assignment
>>> >> > feature" do you mean speculative execution
>>> >> > (mapred.map.tasks.speculative.execution and
>>> >> > mapred.reduce.tasks.speculative.execution) ?
>>> >> >
>>> >> >
>>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Problem solved, but worth warning others about.
>>> >> >>
>>> >> >> Before the upgrade the reducers for the terasort process had been
>>> >> >> evenly
>>> >> >> distributed around the cluster - one per task tracker in turn,
>>> >> >> looping
>>> >> >> around the cluster until all tasks were allocated.  After the
>>> >> >> upgrade
>>> >> >> all
>>> >> >> reduce task had been submitted to small number of task trackers -
>>> >> >> submit
>>> >> >> tasks until the task tracker slots were full and then move onto the
>>> >> >> next
>>> >> >> task tracker.  Skewing the reducers like this quite clearly hit the
>>> >> >> benchmark performance.
>>> >> >>
>>> >> >> The reason for this turns out to be the fair scheduler rewrite
>>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the behaviour
>>> >> >> of
>>> >> >> the
>>> >> >> assign multiple property. Previously this property caused a single
>>> >> >> map
>>> >> >> and a
>>> >> >> single reduce task to be allocated in a task tracker heartbeat
>>> >> >> (rather
>>> >> >> than
>>> >> >> the default of a map or a reduce).  After the upgrade it allocates
>>> >> >> as
>>> >> >> many
>>> >> >> tasks as there are available task slots.  Turning off the multiple
>>> >> >> assignment feature returned the terasort to its pre-upgrade
>>> >> >> performance.
>>> >> >>
>>> >> >> I can see potential benefits to this change and need to think
>>> >> >> through
>>> >> >> the
>>> >> >> consequences to real world applications (though in practice we're
>>> >> >> likely to
>>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>>> >> >> Investigating
>>> >> >> this
>>> >> >> has been a pain so to warn other user is there anywhere central
>>> >> >> that
>>> >> >> can be
>>> >> >> used to record upgrade gotchas like this?
>>> >> >>
>>> >> >>
>>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <ja...@gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4 and
>>> >> >>> have
>>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort took
>>> >> >>> about 45
>>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
>>> >> >>> detail it
>>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>>> >> >>> minutes.
>>> >> >>> Does
>>> >> >>> anyone have any thoughts about what'--
>>> Harsh J
>>>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jie Li <ji...@cs.duke.edu>.

Hi Chris,

The standalone log analyzer was released in December and designed to
be easier to use.

Regarding the license, I think it's ok to use it in the commercial
environment for the evaluation purpose, and your feedback would help
us to improve it.

Jie

On Tue, Dec 18, 2012 at 1:02 AM, Chris Smith <cs...@gmail.com> wrote:
> Jie,
>
> Recent was over 11 months ago.  :-)
>
> Unfortunately the software licence requires that most of us 'negotiate' a
> commerical use license before we trial the software in a commercial
> environment:
> http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
> clarified here:  http://www.cs.duke.edu/starfish/previous.html
>
> Under that last URL was a note that you were soon to distribute the source
> code under the Apache Software License.  Last time I asked the reply was
> that this would not happen.  Perhaps it is time to update your web pages or
> your license arrangements.  :-)
>
> I like what I saw on my home 'cluster' but have not the time to sort out
> licensing to trial this in a commercial environment.
>
> Chris
>
>
>
>
>
> On 14 December 2012 01:46, Jie Li <ji...@cs.duke.edu> wrote:
>>
>> Hi Jon,
>>
>> Thanks for sharing these insights! Can't agree with you more!
>>
>> Recently we released a tool called Starfish Hadoop Log Analyzer for
>> analyzing the job histories. I believe it can quickly point out this
>> reduce problem you met!
>>
>> http://www.cs.duke.edu/starfish/release.html
>>
>> Jie
>>
>> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
>> > Jie,
>> >
>> > Simple answer - I got lucky (though obviously there are thing you need
>> > to
>> > have in place to allow you to be lucky).
>> >
>> > Before running the upgrade I ran a set of tests to baseline the cluster
>> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
>> > by
>> > itself isn't very realistic as a cluster test but it's nice and simple
>> > to
>> > run and is good for regression testing things after a change.
>> >
>> > After the upgrade the intention was to run the same tests and show that
>> > the
>> > performance hadn't degraded (improved would have been nice but not worse
>> > was
>> > the minimum).  When we ran the terasort we found that performance was
>> > about
>> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
>> > I've
>> > said, terasort doesn't provide a realistic view of operational
>> > performance
>> > but this showed that something major had changed and we needed to
>> > understand
>> > it before going further.  So how to go about diagnosing this ...
>> >
>> > First rule - understand what you're trying to achieve.  It's very easy
>> > to
>> > say performance isn't good enough but performance can always be better
>> > so
>> > you need to know what's realistic and at what point you're going to stop
>> > tuning things.  I had a previous baseline that I was trying to match so
>> > I
>> > knew what I was trying to achieve.
>> >
>> > Next thing to do is profile your job and identify where the problem is.
>> > We
>> > had the full job history from the before and after jobs and comparing
>> > these
>> > we saw that map performance was fairly consistent as were the reduce
>> > sort
>> > and reduce phases.  The problem was with the shuffle, which had gone
>> > from 20
>> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
>> > is
>> > to make sure you've got as much information as possible.  If we'd just
>> > kept
>> > the overall job time then there would have been a lot more areas to look
>> > at
>> > but knowing the problem was with shuffle allowed me to focus effort in
>> > this
>> > area.
>> >
>> > So what had changed in the shuffle that may have slowed things down.
>> > The
>> > first thing we thought of was that we'd moved from a tarball deployment
>> > to
>> > using the RPM so what effect might this have had on things.  Our
>> > operational
>> > configuration compresses the map output and in the past we've had
>> > problems
>> > with Java compression libraries being used rather than native ones and
>> > this
>> > has affected performance.  We knew the RPM deployment had moved the
>> > native
>> > library so spent some time confirming to ourselves that these were being
>> > used correctly (but this turned out to not be the problem).  We then
>> > spent
>> > time doing some process and server profiling - using dstat to look at
>> > the
>> > server bottlenecks and jstack/jmap to check what the task tracker and
>> > reduce
>> > processes were doing.  Although not directly relevant to this particular
>> > problem doing this was useful just to get my head around what Hadoop is
>> > doing at various points of the process.
>> >
>> > The next bit was one place where I got lucky - I happened to be logged
>> > onto
>> > one of the worker nodes when a test job was running and I noticed that
>> > there
>> > weren't any reduce tasks running on the server.  This was odd as we'd
>> > submitted more reducers than we have servers so I'd expected at least
>> > one
>> > task to be running on each server.  Checking the job tracker log file it
>> > turned out that since the upgrade the job tracker had been submitting
>> > reduce
>> > tasks to only 10% of the available nodes.  A different 10% each time the
>> > job
>> > was run so clearly the individual task trackers were working OK but
>> > there
>> > was something odd going on with the task allocation.  Checking the job
>> > tracker log file showed that before the upgrade tasks had been fairly
>> > evenly
>> > distributed so something had changed.  After that it was a case of
>> > digging
>> > around the source code to find out which classes were available for task
>> > allocation and what inside them had changed.  This can be quite daunting
>> > but
>> > if you're comfortable with Java then it's just a case of following the
>> > calls
>> > through the code.  Once I found the cause it was just a case of working
>> > out
>> > what my options were for working around it (in this case turning off the
>> > multiple assignment option - I can work out whether I want to turn it
>> > back
>> > on in slower time).
>> >
>> > Where I think we got very lucky is that we hit this problem.  The
>> > configuration we use for the terasort has just over 1 reducer per worker
>> > node rather than maxing out the available reducer slots.  This decision
>> > was
>> > made several years and I can't remember the reasons for it.  If we'd
>> > been
>> > using a larger number of reducers then the number of worker nodes in use
>> > would have been similar regardless of the allocation algorithm and so
>> > the
>> > performance would have looked similar before and after the upgrade.  We
>> > would have hit this problem eventually but probably not until we started
>> > running user jobs and by then it would be too late to do the intrusive
>> > investigations that were possible now.
>> >
>> > Hope this has been useful.
>> >
>> > Regards,
>> > Jon
>> >
>> >
>> >
>> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
>> >>
>> >> Jon:
>> >>
>> >> This is interesting and helpful! How did you figure out the cause? And
>> >> how
>> >> much time did you spend? Could you share some experience of performance
>> >> diagnosis?
>> >>
>> >> Jie
>> >>
>> >> On Tuesday, November 27, 2012, Harsh J wrote:
>> >>>
>> >>> Hi Amit,
>> >>>
>> >>> The default scheduler is FIFO, and may not work for all forms of
>> >>> workloads. Read the multiple schedulers available to see if they have
>> >>> features that may benefit your workload:
>> >>>
>> >>> Capacity Scheduler:
>> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>> >>> FairScheduler:
>> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>> >>>
>> >>> While there's a good overlap of features between them, there are a few
>> >>> differences that set them apart and make them each useful for
>> >>> different use-cases. If I had to summarize on some such differences,
>> >>> FairScheduler is better suited to SLA form of job execution situations
>> >>> due to its preemptive features (which make it useful in user and
>> >>> service mix scenarios), while CapacityScheduler provides
>> >>> manual-resource-request oriented scheduling for odd jobs with high
>> >>> memory workloads, etc. (which make it useful for running certain
>> >>> specific kind of jobs along side the regular ones).
>> >>>
>> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
>> >>> wrote:
>> >>> > So this is a FairScheduler problem ?
>> >>> > We are using the default Hadoop scheduler. Is there a reason to use
>> >>> > the
>> >>> > Fair
>> >>> > Scheduler if most of the time we don't have more than 4 jobs running
>> >>> > simultaneously ?
>> >>> >
>> >>> >
>> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Hi Amit,
>> >>> >>
>> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> >>> >> property. It is true by default, which works well for most
>> >>> >> workloads
>> >>> >> if not benchmark style workloads. I would not usually trust that as
>> >>> >> a
>> >>> >> base perf. measure of everything that comes out of an upgrade.
>> >>> >>
>> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>> >>> >>
>> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>> >>> >> wrote:
>> >>> >> > Hi Jon,
>> >>> >> >
>> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
>> >>> >> > Hadoop
>> >>> >> > 1.0.4
>> >>> >> > and I haven't noticed any performance issues. By  "multiple
>> >>> >> > assignment
>> >>> >> > feature" do you mean speculative execution
>> >>> >> > (mapred.map.tasks.speculative.execution and
>> >>> >> > mapred.reduce.tasks.speculative.execution) ?
>> >>> >> >
>> >>> >> >
>> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>> >>> >> > wrote:
>> >>> >> >>
>> >>> >> >> Problem solved, but worth warning others about.
>> >>> >> >>
>> >>> >> >> Before the upgrade the reducers for the terasort process had
>> >>> >> >> been
>> >>> >> >> evenly
>> >>> >> >> distributed around the cluster - one per task tracker in turn,
>> >>> >> >> looping
>> >>> >> >> around the cluster until all tasks were allocated.  After the
>> >>> >> >> upgrade
>> >>> >> >> all
>> >>> >> >> reduce task had been submitted to small number of task trackers
>> >>> >> >> -
>> >>> >> >> submit
>> >>> >> >> tasks until the task tracker slots were full and then move onto
>> >>> >> >> the
>> >>> >> >> next
>> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
>> >>> >> >> the
>> >>> >> >> benchmark performance.
>> >>> >> >>
>> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
>> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
>> >>> >> >> behaviour
>> >>> >> >> of
>> >>> >> >> the
>> >>> >> >> assign multiple property. Previously this property caused a
>> >>> >> >> single
>> >>> >> >> map
>> >>> >> >> and a
>> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
>> >>> >> >> (rather
>> >>> >> >> than
>> >>> >> >> the default of a map or a reduce).  After the upgrade it
>> >>> >> >> allocates
>> >>> >> >> as
>> >>> >> >> many
>> >>> >> >> tasks as there are available task slots.  Turning off the
>> >>> >> >> multiple
>> >>> >> >> assignment feature returned the terasort to its pre-upgrade
>> >>> >> >> performance.
>> >>> >> >>
>> >>> >> >> I can see potential benefits to this change and need to think
>> >>> >> >> through
>> >>> >> >> the
>> >>> >> >> consequences to real world applications (though in practice
>> >>> >> >> we're
>> >>> >> >> likely to
>> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>> >>> >> >> Investigating
>> >>> >> >> this
>> >>> >> >> has been a pain so to warn other user is there anywhere central
>> >>> >> >> that
>> >>> >> >> can be
>> >>> >> >> used to record upgrade gotchas like this?
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen
>> >>> >> >> <ja...@gmail.com>
>> >>> >> >> wrote:
>> >>> >> >>>
>> >>> >> >>> Hi,
>> >>> >> >>>
>> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
>> >>> >> >>> and
>> >>> >> >>> have
>> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
>> >>> >> >>> took
>> >>> >> >>> about 45
>> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in
>> >>> >> >>> more
>> >>> >> >>> detail it
>> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>> >>> >> >>> minutes.
>> >>> >> >>> Does
>> >>> >> >>> anyone have any thoughts about what'--
>> >>> Harsh J
>> >>>
>> >
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jie Li <ji...@cs.duke.edu>.

Hi Chris,

The standalone log analyzer was released in December and designed to
be easier to use.

Regarding the license, I think it's ok to use it in the commercial
environment for the evaluation purpose, and your feedback would help
us to improve it.

Jie

On Tue, Dec 18, 2012 at 1:02 AM, Chris Smith <cs...@gmail.com> wrote:
> Jie,
>
> Recent was over 11 months ago.  :-)
>
> Unfortunately the software licence requires that most of us 'negotiate' a
> commerical use license before we trial the software in a commercial
> environment:
> http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
> clarified here:  http://www.cs.duke.edu/starfish/previous.html
>
> Under that last URL was a note that you were soon to distribute the source
> code under the Apache Software License.  Last time I asked the reply was
> that this would not happen.  Perhaps it is time to update your web pages or
> your license arrangements.  :-)
>
> I like what I saw on my home 'cluster' but have not the time to sort out
> licensing to trial this in a commercial environment.
>
> Chris
>
>
>
>
>
> On 14 December 2012 01:46, Jie Li <ji...@cs.duke.edu> wrote:
>>
>> Hi Jon,
>>
>> Thanks for sharing these insights! Can't agree with you more!
>>
>> Recently we released a tool called Starfish Hadoop Log Analyzer for
>> analyzing the job histories. I believe it can quickly point out this
>> reduce problem you met!
>>
>> http://www.cs.duke.edu/starfish/release.html
>>
>> Jie
>>
>> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
>> > Jie,
>> >
>> > Simple answer - I got lucky (though obviously there are thing you need
>> > to
>> > have in place to allow you to be lucky).
>> >
>> > Before running the upgrade I ran a set of tests to baseline the cluster
>> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
>> > by
>> > itself isn't very realistic as a cluster test but it's nice and simple
>> > to
>> > run and is good for regression testing things after a change.
>> >
>> > After the upgrade the intention was to run the same tests and show that
>> > the
>> > performance hadn't degraded (improved would have been nice but not worse
>> > was
>> > the minimum).  When we ran the terasort we found that performance was
>> > about
>> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
>> > I've
>> > said, terasort doesn't provide a realistic view of operational
>> > performance
>> > but this showed that something major had changed and we needed to
>> > understand
>> > it before going further.  So how to go about diagnosing this ...
>> >
>> > First rule - understand what you're trying to achieve.  It's very easy
>> > to
>> > say performance isn't good enough but performance can always be better
>> > so
>> > you need to know what's realistic and at what point you're going to stop
>> > tuning things.  I had a previous baseline that I was trying to match so
>> > I
>> > knew what I was trying to achieve.
>> >
>> > Next thing to do is profile your job and identify where the problem is.
>> > We
>> > had the full job history from the before and after jobs and comparing
>> > these
>> > we saw that map performance was fairly consistent as were the reduce
>> > sort
>> > and reduce phases.  The problem was with the shuffle, which had gone
>> > from 20
>> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
>> > is
>> > to make sure you've got as much information as possible.  If we'd just
>> > kept
>> > the overall job time then there would have been a lot more areas to look
>> > at
>> > but knowing the problem was with shuffle allowed me to focus effort in
>> > this
>> > area.
>> >
>> > So what had changed in the shuffle that may have slowed things down.
>> > The
>> > first thing we thought of was that we'd moved from a tarball deployment
>> > to
>> > using the RPM so what effect might this have had on things.  Our
>> > operational
>> > configuration compresses the map output and in the past we've had
>> > problems
>> > with Java compression libraries being used rather than native ones and
>> > this
>> > has affected performance.  We knew the RPM deployment had moved the
>> > native
>> > library so spent some time confirming to ourselves that these were being
>> > used correctly (but this turned out to not be the problem).  We then
>> > spent
>> > time doing some process and server profiling - using dstat to look at
>> > the
>> > server bottlenecks and jstack/jmap to check what the task tracker and
>> > reduce
>> > processes were doing.  Although not directly relevant to this particular
>> > problem doing this was useful just to get my head around what Hadoop is
>> > doing at various points of the process.
>> >
>> > The next bit was one place where I got lucky - I happened to be logged
>> > onto
>> > one of the worker nodes when a test job was running and I noticed that
>> > there
>> > weren't any reduce tasks running on the server.  This was odd as we'd
>> > submitted more reducers than we have servers so I'd expected at least
>> > one
>> > task to be running on each server.  Checking the job tracker log file it
>> > turned out that since the upgrade the job tracker had been submitting
>> > reduce
>> > tasks to only 10% of the available nodes.  A different 10% each time the
>> > job
>> > was run so clearly the individual task trackers were working OK but
>> > there
>> > was something odd going on with the task allocation.  Checking the job
>> > tracker log file showed that before the upgrade tasks had been fairly
>> > evenly
>> > distributed so something had changed.  After that it was a case of
>> > digging
>> > around the source code to find out which classes were available for task
>> > allocation and what inside them had changed.  This can be quite daunting
>> > but
>> > if you're comfortable with Java then it's just a case of following the
>> > calls
>> > through the code.  Once I found the cause it was just a case of working
>> > out
>> > what my options were for working around it (in this case turning off the
>> > multiple assignment option - I can work out whether I want to turn it
>> > back
>> > on in slower time).
>> >
>> > Where I think we got very lucky is that we hit this problem.  The
>> > configuration we use for the terasort has just over 1 reducer per worker
>> > node rather than maxing out the available reducer slots.  This decision
>> > was
>> > made several years and I can't remember the reasons for it.  If we'd
>> > been
>> > using a larger number of reducers then the number of worker nodes in use
>> > would have been similar regardless of the allocation algorithm and so
>> > the
>> > performance would have looked similar before and after the upgrade.  We
>> > would have hit this problem eventually but probably not until we started
>> > running user jobs and by then it would be too late to do the intrusive
>> > investigations that were possible now.
>> >
>> > Hope this has been useful.
>> >
>> > Regards,
>> > Jon
>> >
>> >
>> >
>> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
>> >>
>> >> Jon:
>> >>
>> >> This is interesting and helpful! How did you figure out the cause? And
>> >> how
>> >> much time did you spend? Could you share some experience of performance
>> >> diagnosis?
>> >>
>> >> Jie
>> >>
>> >> On Tuesday, November 27, 2012, Harsh J wrote:
>> >>>
>> >>> Hi Amit,
>> >>>
>> >>> The default scheduler is FIFO, and may not work for all forms of
>> >>> workloads. Read the multiple schedulers available to see if they have
>> >>> features that may benefit your workload:
>> >>>
>> >>> Capacity Scheduler:
>> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>> >>> FairScheduler:
>> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>> >>>
>> >>> While there's a good overlap of features between them, there are a few
>> >>> differences that set them apart and make them each useful for
>> >>> different use-cases. If I had to summarize on some such differences,
>> >>> FairScheduler is better suited to SLA form of job execution situations
>> >>> due to its preemptive features (which make it useful in user and
>> >>> service mix scenarios), while CapacityScheduler provides
>> >>> manual-resource-request oriented scheduling for odd jobs with high
>> >>> memory workloads, etc. (which make it useful for running certain
>> >>> specific kind of jobs along side the regular ones).
>> >>>
>> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
>> >>> wrote:
>> >>> > So this is a FairScheduler problem ?
>> >>> > We are using the default Hadoop scheduler. Is there a reason to use
>> >>> > the
>> >>> > Fair
>> >>> > Scheduler if most of the time we don't have more than 4 jobs running
>> >>> > simultaneously ?
>> >>> >
>> >>> >
>> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Hi Amit,
>> >>> >>
>> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> >>> >> property. It is true by default, which works well for most
>> >>> >> workloads
>> >>> >> if not benchmark style workloads. I would not usually trust that as
>> >>> >> a
>> >>> >> base perf. measure of everything that comes out of an upgrade.
>> >>> >>
>> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>> >>> >>
>> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>> >>> >> wrote:
>> >>> >> > Hi Jon,
>> >>> >> >
>> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
>> >>> >> > Hadoop
>> >>> >> > 1.0.4
>> >>> >> > and I haven't noticed any performance issues. By  "multiple
>> >>> >> > assignment
>> >>> >> > feature" do you mean speculative execution
>> >>> >> > (mapred.map.tasks.speculative.execution and
>> >>> >> > mapred.reduce.tasks.speculative.execution) ?
>> >>> >> >
>> >>> >> >
>> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>> >>> >> > wrote:
>> >>> >> >>
>> >>> >> >> Problem solved, but worth warning others about.
>> >>> >> >>
>> >>> >> >> Before the upgrade the reducers for the terasort process had
>> >>> >> >> been
>> >>> >> >> evenly
>> >>> >> >> distributed around the cluster - one per task tracker in turn,
>> >>> >> >> looping
>> >>> >> >> around the cluster until all tasks were allocated.  After the
>> >>> >> >> upgrade
>> >>> >> >> all
>> >>> >> >> reduce task had been submitted to small number of task trackers
>> >>> >> >> -
>> >>> >> >> submit
>> >>> >> >> tasks until the task tracker slots were full and then move onto
>> >>> >> >> the
>> >>> >> >> next
>> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
>> >>> >> >> the
>> >>> >> >> benchmark performance.
>> >>> >> >>
>> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
>> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
>> >>> >> >> behaviour
>> >>> >> >> of
>> >>> >> >> the
>> >>> >> >> assign multiple property. Previously this property caused a
>> >>> >> >> single
>> >>> >> >> map
>> >>> >> >> and a
>> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
>> >>> >> >> (rather
>> >>> >> >> than
>> >>> >> >> the default of a map or a reduce).  After the upgrade it
>> >>> >> >> allocates
>> >>> >> >> as
>> >>> >> >> many
>> >>> >> >> tasks as there are available task slots.  Turning off the
>> >>> >> >> multiple
>> >>> >> >> assignment feature returned the terasort to its pre-upgrade
>> >>> >> >> performance.
>> >>> >> >>
>> >>> >> >> I can see potential benefits to this change and need to think
>> >>> >> >> through
>> >>> >> >> the
>> >>> >> >> consequences to real world applications (though in practice
>> >>> >> >> we're
>> >>> >> >> likely to
>> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>> >>> >> >> Investigating
>> >>> >> >> this
>> >>> >> >> has been a pain so to warn other user is there anywhere central
>> >>> >> >> that
>> >>> >> >> can be
>> >>> >> >> used to record upgrade gotchas like this?
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen
>> >>> >> >> <ja...@gmail.com>
>> >>> >> >> wrote:
>> >>> >> >>>
>> >>> >> >>> Hi,
>> >>> >> >>>
>> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
>> >>> >> >>> and
>> >>> >> >>> have
>> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
>> >>> >> >>> took
>> >>> >> >>> about 45
>> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in
>> >>> >> >>> more
>> >>> >> >>> detail it
>> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>> >>> >> >>> minutes.
>> >>> >> >>> Does
>> >>> >> >>> anyone have any thoughts about what'--
>> >>> Harsh J
>> >>>
>> >
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jie Li <ji...@cs.duke.edu>.

Hi Chris,

The standalone log analyzer was released in December and designed to
be easier to use.

Regarding the license, I think it's ok to use it in the commercial
environment for the evaluation purpose, and your feedback would help
us to improve it.

Jie

On Tue, Dec 18, 2012 at 1:02 AM, Chris Smith <cs...@gmail.com> wrote:
> Jie,
>
> Recent was over 11 months ago.  :-)
>
> Unfortunately the software licence requires that most of us 'negotiate' a
> commerical use license before we trial the software in a commercial
> environment:
> http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
> clarified here:  http://www.cs.duke.edu/starfish/previous.html
>
> Under that last URL was a note that you were soon to distribute the source
> code under the Apache Software License.  Last time I asked the reply was
> that this would not happen.  Perhaps it is time to update your web pages or
> your license arrangements.  :-)
>
> I like what I saw on my home 'cluster' but have not the time to sort out
> licensing to trial this in a commercial environment.
>
> Chris
>
>
>
>
>
> On 14 December 2012 01:46, Jie Li <ji...@cs.duke.edu> wrote:
>>
>> Hi Jon,
>>
>> Thanks for sharing these insights! Can't agree with you more!
>>
>> Recently we released a tool called Starfish Hadoop Log Analyzer for
>> analyzing the job histories. I believe it can quickly point out this
>> reduce problem you met!
>>
>> http://www.cs.duke.edu/starfish/release.html
>>
>> Jie
>>
>> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
>> > Jie,
>> >
>> > Simple answer - I got lucky (though obviously there are thing you need
>> > to
>> > have in place to allow you to be lucky).
>> >
>> > Before running the upgrade I ran a set of tests to baseline the cluster
>> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
>> > by
>> > itself isn't very realistic as a cluster test but it's nice and simple
>> > to
>> > run and is good for regression testing things after a change.
>> >
>> > After the upgrade the intention was to run the same tests and show that
>> > the
>> > performance hadn't degraded (improved would have been nice but not worse
>> > was
>> > the minimum).  When we ran the terasort we found that performance was
>> > about
>> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
>> > I've
>> > said, terasort doesn't provide a realistic view of operational
>> > performance
>> > but this showed that something major had changed and we needed to
>> > understand
>> > it before going further.  So how to go about diagnosing this ...
>> >
>> > First rule - understand what you're trying to achieve.  It's very easy
>> > to
>> > say performance isn't good enough but performance can always be better
>> > so
>> > you need to know what's realistic and at what point you're going to stop
>> > tuning things.  I had a previous baseline that I was trying to match so
>> > I
>> > knew what I was trying to achieve.
>> >
>> > Next thing to do is profile your job and identify where the problem is.
>> > We
>> > had the full job history from the before and after jobs and comparing
>> > these
>> > we saw that map performance was fairly consistent as were the reduce
>> > sort
>> > and reduce phases.  The problem was with the shuffle, which had gone
>> > from 20
>> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
>> > is
>> > to make sure you've got as much information as possible.  If we'd just
>> > kept
>> > the overall job time then there would have been a lot more areas to look
>> > at
>> > but knowing the problem was with shuffle allowed me to focus effort in
>> > this
>> > area.
>> >
>> > So what had changed in the shuffle that may have slowed things down.
>> > The
>> > first thing we thought of was that we'd moved from a tarball deployment
>> > to
>> > using the RPM so what effect might this have had on things.  Our
>> > operational
>> > configuration compresses the map output and in the past we've had
>> > problems
>> > with Java compression libraries being used rather than native ones and
>> > this
>> > has affected performance.  We knew the RPM deployment had moved the
>> > native
>> > library so spent some time confirming to ourselves that these were being
>> > used correctly (but this turned out to not be the problem).  We then
>> > spent
>> > time doing some process and server profiling - using dstat to look at
>> > the
>> > server bottlenecks and jstack/jmap to check what the task tracker and
>> > reduce
>> > processes were doing.  Although not directly relevant to this particular
>> > problem doing this was useful just to get my head around what Hadoop is
>> > doing at various points of the process.
>> >
>> > The next bit was one place where I got lucky - I happened to be logged
>> > onto
>> > one of the worker nodes when a test job was running and I noticed that
>> > there
>> > weren't any reduce tasks running on the server.  This was odd as we'd
>> > submitted more reducers than we have servers so I'd expected at least
>> > one
>> > task to be running on each server.  Checking the job tracker log file it
>> > turned out that since the upgrade the job tracker had been submitting
>> > reduce
>> > tasks to only 10% of the available nodes.  A different 10% each time the
>> > job
>> > was run so clearly the individual task trackers were working OK but
>> > there
>> > was something odd going on with the task allocation.  Checking the job
>> > tracker log file showed that before the upgrade tasks had been fairly
>> > evenly
>> > distributed so something had changed.  After that it was a case of
>> > digging
>> > around the source code to find out which classes were available for task
>> > allocation and what inside them had changed.  This can be quite daunting
>> > but
>> > if you're comfortable with Java then it's just a case of following the
>> > calls
>> > through the code.  Once I found the cause it was just a case of working
>> > out
>> > what my options were for working around it (in this case turning off the
>> > multiple assignment option - I can work out whether I want to turn it
>> > back
>> > on in slower time).
>> >
>> > Where I think we got very lucky is that we hit this problem.  The
>> > configuration we use for the terasort has just over 1 reducer per worker
>> > node rather than maxing out the available reducer slots.  This decision
>> > was
>> > made several years and I can't remember the reasons for it.  If we'd
>> > been
>> > using a larger number of reducers then the number of worker nodes in use
>> > would have been similar regardless of the allocation algorithm and so
>> > the
>> > performance would have looked similar before and after the upgrade.  We
>> > would have hit this problem eventually but probably not until we started
>> > running user jobs and by then it would be too late to do the intrusive
>> > investigations that were possible now.
>> >
>> > Hope this has been useful.
>> >
>> > Regards,
>> > Jon
>> >
>> >
>> >
>> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
>> >>
>> >> Jon:
>> >>
>> >> This is interesting and helpful! How did you figure out the cause? And
>> >> how
>> >> much time did you spend? Could you share some experience of performance
>> >> diagnosis?
>> >>
>> >> Jie
>> >>
>> >> On Tuesday, November 27, 2012, Harsh J wrote:
>> >>>
>> >>> Hi Amit,
>> >>>
>> >>> The default scheduler is FIFO, and may not work for all forms of
>> >>> workloads. Read the multiple schedulers available to see if they have
>> >>> features that may benefit your workload:
>> >>>
>> >>> Capacity Scheduler:
>> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>> >>> FairScheduler:
>> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>> >>>
>> >>> While there's a good overlap of features between them, there are a few
>> >>> differences that set them apart and make them each useful for
>> >>> different use-cases. If I had to summarize on some such differences,
>> >>> FairScheduler is better suited to SLA form of job execution situations
>> >>> due to its preemptive features (which make it useful in user and
>> >>> service mix scenarios), while CapacityScheduler provides
>> >>> manual-resource-request oriented scheduling for odd jobs with high
>> >>> memory workloads, etc. (which make it useful for running certain
>> >>> specific kind of jobs along side the regular ones).
>> >>>
>> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
>> >>> wrote:
>> >>> > So this is a FairScheduler problem ?
>> >>> > We are using the default Hadoop scheduler. Is there a reason to use
>> >>> > the
>> >>> > Fair
>> >>> > Scheduler if most of the time we don't have more than 4 jobs running
>> >>> > simultaneously ?
>> >>> >
>> >>> >
>> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Hi Amit,
>> >>> >>
>> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> >>> >> property. It is true by default, which works well for most
>> >>> >> workloads
>> >>> >> if not benchmark style workloads. I would not usually trust that as
>> >>> >> a
>> >>> >> base perf. measure of everything that comes out of an upgrade.
>> >>> >>
>> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>> >>> >>
>> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>> >>> >> wrote:
>> >>> >> > Hi Jon,
>> >>> >> >
>> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
>> >>> >> > Hadoop
>> >>> >> > 1.0.4
>> >>> >> > and I haven't noticed any performance issues. By  "multiple
>> >>> >> > assignment
>> >>> >> > feature" do you mean speculative execution
>> >>> >> > (mapred.map.tasks.speculative.execution and
>> >>> >> > mapred.reduce.tasks.speculative.execution) ?
>> >>> >> >
>> >>> >> >
>> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>> >>> >> > wrote:
>> >>> >> >>
>> >>> >> >> Problem solved, but worth warning others about.
>> >>> >> >>
>> >>> >> >> Before the upgrade the reducers for the terasort process had
>> >>> >> >> been
>> >>> >> >> evenly
>> >>> >> >> distributed around the cluster - one per task tracker in turn,
>> >>> >> >> looping
>> >>> >> >> around the cluster until all tasks were allocated.  After the
>> >>> >> >> upgrade
>> >>> >> >> all
>> >>> >> >> reduce task had been submitted to small number of task trackers
>> >>> >> >> -
>> >>> >> >> submit
>> >>> >> >> tasks until the task tracker slots were full and then move onto
>> >>> >> >> the
>> >>> >> >> next
>> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
>> >>> >> >> the
>> >>> >> >> benchmark performance.
>> >>> >> >>
>> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
>> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
>> >>> >> >> behaviour
>> >>> >> >> of
>> >>> >> >> the
>> >>> >> >> assign multiple property. Previously this property caused a
>> >>> >> >> single
>> >>> >> >> map
>> >>> >> >> and a
>> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
>> >>> >> >> (rather
>> >>> >> >> than
>> >>> >> >> the default of a map or a reduce).  After the upgrade it
>> >>> >> >> allocates
>> >>> >> >> as
>> >>> >> >> many
>> >>> >> >> tasks as there are available task slots.  Turning off the
>> >>> >> >> multiple
>> >>> >> >> assignment feature returned the terasort to its pre-upgrade
>> >>> >> >> performance.
>> >>> >> >>
>> >>> >> >> I can see potential benefits to this change and need to think
>> >>> >> >> through
>> >>> >> >> the
>> >>> >> >> consequences to real world applications (though in practice
>> >>> >> >> we're
>> >>> >> >> likely to
>> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>> >>> >> >> Investigating
>> >>> >> >> this
>> >>> >> >> has been a pain so to warn other user is there anywhere central
>> >>> >> >> that
>> >>> >> >> can be
>> >>> >> >> used to record upgrade gotchas like this?
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen
>> >>> >> >> <ja...@gmail.com>
>> >>> >> >> wrote:
>> >>> >> >>>
>> >>> >> >>> Hi,
>> >>> >> >>>
>> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
>> >>> >> >>> and
>> >>> >> >>> have
>> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
>> >>> >> >>> took
>> >>> >> >>> about 45
>> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in
>> >>> >> >>> more
>> >>> >> >>> detail it
>> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>> >>> >> >>> minutes.
>> >>> >> >>> Does
>> >>> >> >>> anyone have any thoughts about what'--
>> >>> Harsh J
>> >>>
>> >
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Jie Li <ji...@cs.duke.edu>.

Hi Chris,

The standalone log analyzer was released in December and designed to
be easier to use.

Regarding the license, I think it's ok to use it in the commercial
environment for the evaluation purpose, and your feedback would help
us to improve it.

Jie

On Tue, Dec 18, 2012 at 1:02 AM, Chris Smith <cs...@gmail.com> wrote:
> Jie,
>
> Recent was over 11 months ago.  :-)
>
> Unfortunately the software licence requires that most of us 'negotiate' a
> commerical use license before we trial the software in a commercial
> environment:
> http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
> clarified here:  http://www.cs.duke.edu/starfish/previous.html
>
> Under that last URL was a note that you were soon to distribute the source
> code under the Apache Software License.  Last time I asked the reply was
> that this would not happen.  Perhaps it is time to update your web pages or
> your license arrangements.  :-)
>
> I like what I saw on my home 'cluster' but have not the time to sort out
> licensing to trial this in a commercial environment.
>
> Chris
>
>
>
>
>
> On 14 December 2012 01:46, Jie Li <ji...@cs.duke.edu> wrote:
>>
>> Hi Jon,
>>
>> Thanks for sharing these insights! Can't agree with you more!
>>
>> Recently we released a tool called Starfish Hadoop Log Analyzer for
>> analyzing the job histories. I believe it can quickly point out this
>> reduce problem you met!
>>
>> http://www.cs.duke.edu/starfish/release.html
>>
>> Jie
>>
>> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
>> > Jie,
>> >
>> > Simple answer - I got lucky (though obviously there are thing you need
>> > to
>> > have in place to allow you to be lucky).
>> >
>> > Before running the upgrade I ran a set of tests to baseline the cluster
>> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
>> > by
>> > itself isn't very realistic as a cluster test but it's nice and simple
>> > to
>> > run and is good for regression testing things after a change.
>> >
>> > After the upgrade the intention was to run the same tests and show that
>> > the
>> > performance hadn't degraded (improved would have been nice but not worse
>> > was
>> > the minimum).  When we ran the terasort we found that performance was
>> > about
>> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
>> > I've
>> > said, terasort doesn't provide a realistic view of operational
>> > performance
>> > but this showed that something major had changed and we needed to
>> > understand
>> > it before going further.  So how to go about diagnosing this ...
>> >
>> > First rule - understand what you're trying to achieve.  It's very easy
>> > to
>> > say performance isn't good enough but performance can always be better
>> > so
>> > you need to know what's realistic and at what point you're going to stop
>> > tuning things.  I had a previous baseline that I was trying to match so
>> > I
>> > knew what I was trying to achieve.
>> >
>> > Next thing to do is profile your job and identify where the problem is.
>> > We
>> > had the full job history from the before and after jobs and comparing
>> > these
>> > we saw that map performance was fairly consistent as were the reduce
>> > sort
>> > and reduce phases.  The problem was with the shuffle, which had gone
>> > from 20
>> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
>> > is
>> > to make sure you've got as much information as possible.  If we'd just
>> > kept
>> > the overall job time then there would have been a lot more areas to look
>> > at
>> > but knowing the problem was with shuffle allowed me to focus effort in
>> > this
>> > area.
>> >
>> > So what had changed in the shuffle that may have slowed things down.
>> > The
>> > first thing we thought of was that we'd moved from a tarball deployment
>> > to
>> > using the RPM so what effect might this have had on things.  Our
>> > operational
>> > configuration compresses the map output and in the past we've had
>> > problems
>> > with Java compression libraries being used rather than native ones and
>> > this
>> > has affected performance.  We knew the RPM deployment had moved the
>> > native
>> > library so spent some time confirming to ourselves that these were being
>> > used correctly (but this turned out to not be the problem).  We then
>> > spent
>> > time doing some process and server profiling - using dstat to look at
>> > the
>> > server bottlenecks and jstack/jmap to check what the task tracker and
>> > reduce
>> > processes were doing.  Although not directly relevant to this particular
>> > problem doing this was useful just to get my head around what Hadoop is
>> > doing at various points of the process.
>> >
>> > The next bit was one place where I got lucky - I happened to be logged
>> > onto
>> > one of the worker nodes when a test job was running and I noticed that
>> > there
>> > weren't any reduce tasks running on the server.  This was odd as we'd
>> > submitted more reducers than we have servers so I'd expected at least
>> > one
>> > task to be running on each server.  Checking the job tracker log file it
>> > turned out that since the upgrade the job tracker had been submitting
>> > reduce
>> > tasks to only 10% of the available nodes.  A different 10% each time the
>> > job
>> > was run so clearly the individual task trackers were working OK but
>> > there
>> > was something odd going on with the task allocation.  Checking the job
>> > tracker log file showed that before the upgrade tasks had been fairly
>> > evenly
>> > distributed so something had changed.  After that it was a case of
>> > digging
>> > around the source code to find out which classes were available for task
>> > allocation and what inside them had changed.  This can be quite daunting
>> > but
>> > if you're comfortable with Java then it's just a case of following the
>> > calls
>> > through the code.  Once I found the cause it was just a case of working
>> > out
>> > what my options were for working around it (in this case turning off the
>> > multiple assignment option - I can work out whether I want to turn it
>> > back
>> > on in slower time).
>> >
>> > Where I think we got very lucky is that we hit this problem.  The
>> > configuration we use for the terasort has just over 1 reducer per worker
>> > node rather than maxing out the available reducer slots.  This decision
>> > was
>> > made several years and I can't remember the reasons for it.  If we'd
>> > been
>> > using a larger number of reducers then the number of worker nodes in use
>> > would have been similar regardless of the allocation algorithm and so
>> > the
>> > performance would have looked similar before and after the upgrade.  We
>> > would have hit this problem eventually but probably not until we started
>> > running user jobs and by then it would be too late to do the intrusive
>> > investigations that were possible now.
>> >
>> > Hope this has been useful.
>> >
>> > Regards,
>> > Jon
>> >
>> >
>> >
>> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
>> >>
>> >> Jon:
>> >>
>> >> This is interesting and helpful! How did you figure out the cause? And
>> >> how
>> >> much time did you spend? Could you share some experience of performance
>> >> diagnosis?
>> >>
>> >> Jie
>> >>
>> >> On Tuesday, November 27, 2012, Harsh J wrote:
>> >>>
>> >>> Hi Amit,
>> >>>
>> >>> The default scheduler is FIFO, and may not work for all forms of
>> >>> workloads. Read the multiple schedulers available to see if they have
>> >>> features that may benefit your workload:
>> >>>
>> >>> Capacity Scheduler:
>> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
>> >>> FairScheduler:
>> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
>> >>>
>> >>> While there's a good overlap of features between them, there are a few
>> >>> differences that set them apart and make them each useful for
>> >>> different use-cases. If I had to summarize on some such differences,
>> >>> FairScheduler is better suited to SLA form of job execution situations
>> >>> due to its preemptive features (which make it useful in user and
>> >>> service mix scenarios), while CapacityScheduler provides
>> >>> manual-resource-request oriented scheduling for odd jobs with high
>> >>> memory workloads, etc. (which make it useful for running certain
>> >>> specific kind of jobs along side the regular ones).
>> >>>
>> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
>> >>> wrote:
>> >>> > So this is a FairScheduler problem ?
>> >>> > We are using the default Hadoop scheduler. Is there a reason to use
>> >>> > the
>> >>> > Fair
>> >>> > Scheduler if most of the time we don't have more than 4 jobs running
>> >>> > simultaneously ?
>> >>> >
>> >>> >
>> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Hi Amit,
>> >>> >>
>> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
>> >>> >> property. It is true by default, which works well for most
>> >>> >> workloads
>> >>> >> if not benchmark style workloads. I would not usually trust that as
>> >>> >> a
>> >>> >> base perf. measure of everything that comes out of an upgrade.
>> >>> >>
>> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
>> >>> >>
>> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
>> >>> >> wrote:
>> >>> >> > Hi Jon,
>> >>> >> >
>> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
>> >>> >> > Hadoop
>> >>> >> > 1.0.4
>> >>> >> > and I haven't noticed any performance issues. By  "multiple
>> >>> >> > assignment
>> >>> >> > feature" do you mean speculative execution
>> >>> >> > (mapred.map.tasks.speculative.execution and
>> >>> >> > mapred.reduce.tasks.speculative.execution) ?
>> >>> >> >
>> >>> >> >
>> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
>> >>> >> > wrote:
>> >>> >> >>
>> >>> >> >> Problem solved, but worth warning others about.
>> >>> >> >>
>> >>> >> >> Before the upgrade the reducers for the terasort process had
>> >>> >> >> been
>> >>> >> >> evenly
>> >>> >> >> distributed around the cluster - one per task tracker in turn,
>> >>> >> >> looping
>> >>> >> >> around the cluster until all tasks were allocated.  After the
>> >>> >> >> upgrade
>> >>> >> >> all
>> >>> >> >> reduce task had been submitted to small number of task trackers
>> >>> >> >> -
>> >>> >> >> submit
>> >>> >> >> tasks until the task tracker slots were full and then move onto
>> >>> >> >> the
>> >>> >> >> next
>> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
>> >>> >> >> the
>> >>> >> >> benchmark performance.
>> >>> >> >>
>> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
>> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
>> >>> >> >> behaviour
>> >>> >> >> of
>> >>> >> >> the
>> >>> >> >> assign multiple property. Previously this property caused a
>> >>> >> >> single
>> >>> >> >> map
>> >>> >> >> and a
>> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
>> >>> >> >> (rather
>> >>> >> >> than
>> >>> >> >> the default of a map or a reduce).  After the upgrade it
>> >>> >> >> allocates
>> >>> >> >> as
>> >>> >> >> many
>> >>> >> >> tasks as there are available task slots.  Turning off the
>> >>> >> >> multiple
>> >>> >> >> assignment feature returned the terasort to its pre-upgrade
>> >>> >> >> performance.
>> >>> >> >>
>> >>> >> >> I can see potential benefits to this change and need to think
>> >>> >> >> through
>> >>> >> >> the
>> >>> >> >> consequences to real world applications (though in practice
>> >>> >> >> we're
>> >>> >> >> likely to
>> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
>> >>> >> >> Investigating
>> >>> >> >> this
>> >>> >> >> has been a pain so to warn other user is there anywhere central
>> >>> >> >> that
>> >>> >> >> can be
>> >>> >> >> used to record upgrade gotchas like this?
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen
>> >>> >> >> <ja...@gmail.com>
>> >>> >> >> wrote:
>> >>> >> >>>
>> >>> >> >>> Hi,
>> >>> >> >>>
>> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
>> >>> >> >>> and
>> >>> >> >>> have
>> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
>> >>> >> >>> took
>> >>> >> >>> about 45
>> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in
>> >>> >> >>> more
>> >>> >> >>> detail it
>> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
>> >>> >> >>> minutes.
>> >>> >> >>> Does
>> >>> >> >>> anyone have any thoughts about what'--
>> >>> Harsh J
>> >>>
>> >
>
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Chris Smith <cs...@gmail.com>.

Jie,

Recent was over 11 months ago.  :-)

Unfortunately the software licence requires that most of us 'negotiate' a
commerical use license before we trial the software in a commercial
environment:
http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
clarified here:  http://www.cs.duke.edu/starfish/previous.html

Under that last URL was a note that you were soon to distribute the source
code under the Apache Software License.  Last time I asked the reply was
that this would not happen.  Perhaps it is time to update your web pages or
your license arrangements.  :-)

I like what I saw on my home 'cluster' but have not the time to sort out
licensing to trial this in a commercial environment.

Chris




On 14 December 2012 01:46, Jie Li <ji...@cs.duke.edu> wrote:

> Hi Jon,
>
> Thanks for sharing these insights! Can't agree with you more!
>
> Recently we released a tool called Starfish Hadoop Log Analyzer for
> analyzing the job histories. I believe it can quickly point out this
> reduce problem you met!
>
> http://www.cs.duke.edu/starfish/release.html
>
> Jie
>
> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
> > Jie,
> >
> > Simple answer - I got lucky (though obviously there are thing you need to
> > have in place to allow you to be lucky).
> >
> > Before running the upgrade I ran a set of tests to baseline the cluster
> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
> by
> > itself isn't very realistic as a cluster test but it's nice and simple to
> > run and is good for regression testing things after a change.
> >
> > After the upgrade the intention was to run the same tests and show that
> the
> > performance hadn't degraded (improved would have been nice but not worse
> was
> > the minimum).  When we ran the terasort we found that performance was
> about
> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
> I've
> > said, terasort doesn't provide a realistic view of operational
> performance
> > but this showed that something major had changed and we needed to
> understand
> > it before going further.  So how to go about diagnosing this ...
> >
> > First rule - understand what you're trying to achieve.  It's very easy to
> > say performance isn't good enough but performance can always be better so
> > you need to know what's realistic and at what point you're going to stop
> > tuning things.  I had a previous baseline that I was trying to match so I
> > knew what I was trying to achieve.
> >
> > Next thing to do is profile your job and identify where the problem is.
>  We
> > had the full job history from the before and after jobs and comparing
> these
> > we saw that map performance was fairly consistent as were the reduce sort
> > and reduce phases.  The problem was with the shuffle, which had gone
> from 20
> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
> is
> > to make sure you've got as much information as possible.  If we'd just
> kept
> > the overall job time then there would have been a lot more areas to look
> at
> > but knowing the problem was with shuffle allowed me to focus effort in
> this
> > area.
> >
> > So what had changed in the shuffle that may have slowed things down.  The
> > first thing we thought of was that we'd moved from a tarball deployment
> to
> > using the RPM so what effect might this have had on things.  Our
> operational
> > configuration compresses the map output and in the past we've had
> problems
> > with Java compression libraries being used rather than native ones and
> this
> > has affected performance.  We knew the RPM deployment had moved the
> native
> > library so spent some time confirming to ourselves that these were being
> > used correctly (but this turned out to not be the problem).  We then
> spent
> > time doing some process and server profiling - using dstat to look at the
> > server bottlenecks and jstack/jmap to check what the task tracker and
> reduce
> > processes were doing.  Although not directly relevant to this particular
> > problem doing this was useful just to get my head around what Hadoop is
> > doing at various points of the process.
> >
> > The next bit was one place where I got lucky - I happened to be logged
> onto
> > one of the worker nodes when a test job was running and I noticed that
> there
> > weren't any reduce tasks running on the server.  This was odd as we'd
> > submitted more reducers than we have servers so I'd expected at least one
> > task to be running on each server.  Checking the job tracker log file it
> > turned out that since the upgrade the job tracker had been submitting
> reduce
> > tasks to only 10% of the available nodes.  A different 10% each time the
> job
> > was run so clearly the individual task trackers were working OK but there
> > was something odd going on with the task allocation.  Checking the job
> > tracker log file showed that before the upgrade tasks had been fairly
> evenly
> > distributed so something had changed.  After that it was a case of
> digging
> > around the source code to find out which classes were available for task
> > allocation and what inside them had changed.  This can be quite daunting
> but
> > if you're comfortable with Java then it's just a case of following the
> calls
> > through the code.  Once I found the cause it was just a case of working
> out
> > what my options were for working around it (in this case turning off the
> > multiple assignment option - I can work out whether I want to turn it
> back
> > on in slower time).
> >
> > Where I think we got very lucky is that we hit this problem.  The
> > configuration we use for the terasort has just over 1 reducer per worker
> > node rather than maxing out the available reducer slots.  This decision
> was
> > made several years and I can't remember the reasons for it.  If we'd been
> > using a larger number of reducers then the number of worker nodes in use
> > would have been similar regardless of the allocation algorithm and so the
> > performance would have looked similar before and after the upgrade.  We
> > would have hit this problem eventually but probably not until we started
> > running user jobs and by then it would be too late to do the intrusive
> > investigations that were possible now.
> >
> > Hope this has been useful.
> >
> > Regards,
> > Jon
> >
> >
> >
> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
> >>
> >> Jon:
> >>
> >> This is interesting and helpful! How did you figure out the cause? And
> how
> >> much time did you spend? Could you share some experience of performance
> >> diagnosis?
> >>
> >> Jie
> >>
> >> On Tuesday, November 27, 2012, Harsh J wrote:
> >>>
> >>> Hi Amit,
> >>>
> >>> The default scheduler is FIFO, and may not work for all forms of
> >>> workloads. Read the multiple schedulers available to see if they have
> >>> features that may benefit your workload:
> >>>
> >>> Capacity Scheduler:
> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
> >>> FairScheduler:
> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
> >>>
> >>> While there's a good overlap of features between them, there are a few
> >>> differences that set them apart and make them each useful for
> >>> different use-cases. If I had to summarize on some such differences,
> >>> FairScheduler is better suited to SLA form of job execution situations
> >>> due to its preemptive features (which make it useful in user and
> >>> service mix scenarios), while CapacityScheduler provides
> >>> manual-resource-request oriented scheduling for odd jobs with high
> >>> memory workloads, etc. (which make it useful for running certain
> >>> specific kind of jobs along side the regular ones).
> >>>
> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
> wrote:
> >>> > So this is a FairScheduler problem ?
> >>> > We are using the default Hadoop scheduler. Is there a reason to use
> the
> >>> > Fair
> >>> > Scheduler if most of the time we don't have more than 4 jobs running
> >>> > simultaneously ?
> >>> >
> >>> >
> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
> wrote:
> >>> >>
> >>> >> Hi Amit,
> >>> >>
> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
> >>> >> property. It is true by default, which works well for most workloads
> >>> >> if not benchmark style workloads. I would not usually trust that as
> a
> >>> >> base perf. measure of everything that comes out of an upgrade.
> >>> >>
> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
> >>> >>
> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
> >>> >> wrote:
> >>> >> > Hi Jon,
> >>> >> >
> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
> Hadoop
> >>> >> > 1.0.4
> >>> >> > and I haven't noticed any performance issues. By  "multiple
> >>> >> > assignment
> >>> >> > feature" do you mean speculative execution
> >>> >> > (mapred.map.tasks.speculative.execution and
> >>> >> > mapred.reduce.tasks.speculative.execution) ?
> >>> >> >
> >>> >> >
> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Problem solved, but worth warning others about.
> >>> >> >>
> >>> >> >> Before the upgrade the reducers for the terasort process had been
> >>> >> >> evenly
> >>> >> >> distributed around the cluster - one per task tracker in turn,
> >>> >> >> looping
> >>> >> >> around the cluster until all tasks were allocated.  After the
> >>> >> >> upgrade
> >>> >> >> all
> >>> >> >> reduce task had been submitted to small number of task trackers -
> >>> >> >> submit
> >>> >> >> tasks until the task tracker slots were full and then move onto
> the
> >>> >> >> next
> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
> the
> >>> >> >> benchmark performance.
> >>> >> >>
> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
> behaviour
> >>> >> >> of
> >>> >> >> the
> >>> >> >> assign multiple property. Previously this property caused a
> single
> >>> >> >> map
> >>> >> >> and a
> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
> >>> >> >> (rather
> >>> >> >> than
> >>> >> >> the default of a map or a reduce).  After the upgrade it
> allocates
> >>> >> >> as
> >>> >> >> many
> >>> >> >> tasks as there are available task slots.  Turning off the
> multiple
> >>> >> >> assignment feature returned the terasort to its pre-upgrade
> >>> >> >> performance.
> >>> >> >>
> >>> >> >> I can see potential benefits to this change and need to think
> >>> >> >> through
> >>> >> >> the
> >>> >> >> consequences to real world applications (though in practice we're
> >>> >> >> likely to
> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
> >>> >> >> Investigating
> >>> >> >> this
> >>> >> >> has been a pain so to warn other user is there anywhere central
> >>> >> >> that
> >>> >> >> can be
> >>> >> >> used to record upgrade gotchas like this?
> >>> >> >>
> >>> >> >>
> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <jayayedev@gmail.com
> >
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> Hi,
> >>> >> >>>
> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
> and
> >>> >> >>> have
> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
> took
> >>> >> >>> about 45
> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
> >>> >> >>> detail it
> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
> >>> >> >>> minutes.
> >>> >> >>> Does
> >>> >> >>> anyone have any thoughts about what'--
> >>> Harsh J
> >>>
> >
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Chris Smith <cs...@gmail.com>.

Jie,

Recent was over 11 months ago.  :-)

Unfortunately the software licence requires that most of us 'negotiate' a
commerical use license before we trial the software in a commercial
environment:
http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
clarified here:  http://www.cs.duke.edu/starfish/previous.html

Under that last URL was a note that you were soon to distribute the source
code under the Apache Software License.  Last time I asked the reply was
that this would not happen.  Perhaps it is time to update your web pages or
your license arrangements.  :-)

I like what I saw on my home 'cluster' but have not the time to sort out
licensing to trial this in a commercial environment.

Chris




On 14 December 2012 01:46, Jie Li <ji...@cs.duke.edu> wrote:

> Hi Jon,
>
> Thanks for sharing these insights! Can't agree with you more!
>
> Recently we released a tool called Starfish Hadoop Log Analyzer for
> analyzing the job histories. I believe it can quickly point out this
> reduce problem you met!
>
> http://www.cs.duke.edu/starfish/release.html
>
> Jie
>
> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
> > Jie,
> >
> > Simple answer - I got lucky (though obviously there are thing you need to
> > have in place to allow you to be lucky).
> >
> > Before running the upgrade I ran a set of tests to baseline the cluster
> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
> by
> > itself isn't very realistic as a cluster test but it's nice and simple to
> > run and is good for regression testing things after a change.
> >
> > After the upgrade the intention was to run the same tests and show that
> the
> > performance hadn't degraded (improved would have been nice but not worse
> was
> > the minimum).  When we ran the terasort we found that performance was
> about
> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
> I've
> > said, terasort doesn't provide a realistic view of operational
> performance
> > but this showed that something major had changed and we needed to
> understand
> > it before going further.  So how to go about diagnosing this ...
> >
> > First rule - understand what you're trying to achieve.  It's very easy to
> > say performance isn't good enough but performance can always be better so
> > you need to know what's realistic and at what point you're going to stop
> > tuning things.  I had a previous baseline that I was trying to match so I
> > knew what I was trying to achieve.
> >
> > Next thing to do is profile your job and identify where the problem is.
>  We
> > had the full job history from the before and after jobs and comparing
> these
> > we saw that map performance was fairly consistent as were the reduce sort
> > and reduce phases.  The problem was with the shuffle, which had gone
> from 20
> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
> is
> > to make sure you've got as much information as possible.  If we'd just
> kept
> > the overall job time then there would have been a lot more areas to look
> at
> > but knowing the problem was with shuffle allowed me to focus effort in
> this
> > area.
> >
> > So what had changed in the shuffle that may have slowed things down.  The
> > first thing we thought of was that we'd moved from a tarball deployment
> to
> > using the RPM so what effect might this have had on things.  Our
> operational
> > configuration compresses the map output and in the past we've had
> problems
> > with Java compression libraries being used rather than native ones and
> this
> > has affected performance.  We knew the RPM deployment had moved the
> native
> > library so spent some time confirming to ourselves that these were being
> > used correctly (but this turned out to not be the problem).  We then
> spent
> > time doing some process and server profiling - using dstat to look at the
> > server bottlenecks and jstack/jmap to check what the task tracker and
> reduce
> > processes were doing.  Although not directly relevant to this particular
> > problem doing this was useful just to get my head around what Hadoop is
> > doing at various points of the process.
> >
> > The next bit was one place where I got lucky - I happened to be logged
> onto
> > one of the worker nodes when a test job was running and I noticed that
> there
> > weren't any reduce tasks running on the server.  This was odd as we'd
> > submitted more reducers than we have servers so I'd expected at least one
> > task to be running on each server.  Checking the job tracker log file it
> > turned out that since the upgrade the job tracker had been submitting
> reduce
> > tasks to only 10% of the available nodes.  A different 10% each time the
> job
> > was run so clearly the individual task trackers were working OK but there
> > was something odd going on with the task allocation.  Checking the job
> > tracker log file showed that before the upgrade tasks had been fairly
> evenly
> > distributed so something had changed.  After that it was a case of
> digging
> > around the source code to find out which classes were available for task
> > allocation and what inside them had changed.  This can be quite daunting
> but
> > if you're comfortable with Java then it's just a case of following the
> calls
> > through the code.  Once I found the cause it was just a case of working
> out
> > what my options were for working around it (in this case turning off the
> > multiple assignment option - I can work out whether I want to turn it
> back
> > on in slower time).
> >
> > Where I think we got very lucky is that we hit this problem.  The
> > configuration we use for the terasort has just over 1 reducer per worker
> > node rather than maxing out the available reducer slots.  This decision
> was
> > made several years and I can't remember the reasons for it.  If we'd been
> > using a larger number of reducers then the number of worker nodes in use
> > would have been similar regardless of the allocation algorithm and so the
> > performance would have looked similar before and after the upgrade.  We
> > would have hit this problem eventually but probably not until we started
> > running user jobs and by then it would be too late to do the intrusive
> > investigations that were possible now.
> >
> > Hope this has been useful.
> >
> > Regards,
> > Jon
> >
> >
> >
> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
> >>
> >> Jon:
> >>
> >> This is interesting and helpful! How did you figure out the cause? And
> how
> >> much time did you spend? Could you share some experience of performance
> >> diagnosis?
> >>
> >> Jie
> >>
> >> On Tuesday, November 27, 2012, Harsh J wrote:
> >>>
> >>> Hi Amit,
> >>>
> >>> The default scheduler is FIFO, and may not work for all forms of
> >>> workloads. Read the multiple schedulers available to see if they have
> >>> features that may benefit your workload:
> >>>
> >>> Capacity Scheduler:
> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
> >>> FairScheduler:
> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
> >>>
> >>> While there's a good overlap of features between them, there are a few
> >>> differences that set them apart and make them each useful for
> >>> different use-cases. If I had to summarize on some such differences,
> >>> FairScheduler is better suited to SLA form of job execution situations
> >>> due to its preemptive features (which make it useful in user and
> >>> service mix scenarios), while CapacityScheduler provides
> >>> manual-resource-request oriented scheduling for odd jobs with high
> >>> memory workloads, etc. (which make it useful for running certain
> >>> specific kind of jobs along side the regular ones).
> >>>
> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
> wrote:
> >>> > So this is a FairScheduler problem ?
> >>> > We are using the default Hadoop scheduler. Is there a reason to use
> the
> >>> > Fair
> >>> > Scheduler if most of the time we don't have more than 4 jobs running
> >>> > simultaneously ?
> >>> >
> >>> >
> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
> wrote:
> >>> >>
> >>> >> Hi Amit,
> >>> >>
> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
> >>> >> property. It is true by default, which works well for most workloads
> >>> >> if not benchmark style workloads. I would not usually trust that as
> a
> >>> >> base perf. measure of everything that comes out of an upgrade.
> >>> >>
> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
> >>> >>
> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
> >>> >> wrote:
> >>> >> > Hi Jon,
> >>> >> >
> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
> Hadoop
> >>> >> > 1.0.4
> >>> >> > and I haven't noticed any performance issues. By  "multiple
> >>> >> > assignment
> >>> >> > feature" do you mean speculative execution
> >>> >> > (mapred.map.tasks.speculative.execution and
> >>> >> > mapred.reduce.tasks.speculative.execution) ?
> >>> >> >
> >>> >> >
> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Problem solved, but worth warning others about.
> >>> >> >>
> >>> >> >> Before the upgrade the reducers for the terasort process had been
> >>> >> >> evenly
> >>> >> >> distributed around the cluster - one per task tracker in turn,
> >>> >> >> looping
> >>> >> >> around the cluster until all tasks were allocated.  After the
> >>> >> >> upgrade
> >>> >> >> all
> >>> >> >> reduce task had been submitted to small number of task trackers -
> >>> >> >> submit
> >>> >> >> tasks until the task tracker slots were full and then move onto
> the
> >>> >> >> next
> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
> the
> >>> >> >> benchmark performance.
> >>> >> >>
> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
> behaviour
> >>> >> >> of
> >>> >> >> the
> >>> >> >> assign multiple property. Previously this property caused a
> single
> >>> >> >> map
> >>> >> >> and a
> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
> >>> >> >> (rather
> >>> >> >> than
> >>> >> >> the default of a map or a reduce).  After the upgrade it
> allocates
> >>> >> >> as
> >>> >> >> many
> >>> >> >> tasks as there are available task slots.  Turning off the
> multiple
> >>> >> >> assignment feature returned the terasort to its pre-upgrade
> >>> >> >> performance.
> >>> >> >>
> >>> >> >> I can see potential benefits to this change and need to think
> >>> >> >> through
> >>> >> >> the
> >>> >> >> consequences to real world applications (though in practice we're
> >>> >> >> likely to
> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
> >>> >> >> Investigating
> >>> >> >> this
> >>> >> >> has been a pain so to warn other user is there anywhere central
> >>> >> >> that
> >>> >> >> can be
> >>> >> >> used to record upgrade gotchas like this?
> >>> >> >>
> >>> >> >>
> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <jayayedev@gmail.com
> >
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> Hi,
> >>> >> >>>
> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
> and
> >>> >> >>> have
> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
> took
> >>> >> >>> about 45
> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
> >>> >> >>> detail it
> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
> >>> >> >>> minutes.
> >>> >> >>> Does
> >>> >> >>> anyone have any thoughts about what'--
> >>> Harsh J
> >>>
> >
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Chris Smith <cs...@gmail.com>.

Jie,

Recent was over 11 months ago.  :-)

Unfortunately the software licence requires that most of us 'negotiate' a
commerical use license before we trial the software in a commercial
environment:
http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
clarified here:  http://www.cs.duke.edu/starfish/previous.html

Under that last URL was a note that you were soon to distribute the source
code under the Apache Software License.  Last time I asked the reply was
that this would not happen.  Perhaps it is time to update your web pages or
your license arrangements.  :-)

I like what I saw on my home 'cluster' but have not the time to sort out
licensing to trial this in a commercial environment.

Chris




On 14 December 2012 01:46, Jie Li <ji...@cs.duke.edu> wrote:

> Hi Jon,
>
> Thanks for sharing these insights! Can't agree with you more!
>
> Recently we released a tool called Starfish Hadoop Log Analyzer for
> analyzing the job histories. I believe it can quickly point out this
> reduce problem you met!
>
> http://www.cs.duke.edu/starfish/release.html
>
> Jie
>
> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
> > Jie,
> >
> > Simple answer - I got lucky (though obviously there are thing you need to
> > have in place to allow you to be lucky).
> >
> > Before running the upgrade I ran a set of tests to baseline the cluster
> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
> by
> > itself isn't very realistic as a cluster test but it's nice and simple to
> > run and is good for regression testing things after a change.
> >
> > After the upgrade the intention was to run the same tests and show that
> the
> > performance hadn't degraded (improved would have been nice but not worse
> was
> > the minimum).  When we ran the terasort we found that performance was
> about
> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
> I've
> > said, terasort doesn't provide a realistic view of operational
> performance
> > but this showed that something major had changed and we needed to
> understand
> > it before going further.  So how to go about diagnosing this ...
> >
> > First rule - understand what you're trying to achieve.  It's very easy to
> > say performance isn't good enough but performance can always be better so
> > you need to know what's realistic and at what point you're going to stop
> > tuning things.  I had a previous baseline that I was trying to match so I
> > knew what I was trying to achieve.
> >
> > Next thing to do is profile your job and identify where the problem is.
>  We
> > had the full job history from the before and after jobs and comparing
> these
> > we saw that map performance was fairly consistent as were the reduce sort
> > and reduce phases.  The problem was with the shuffle, which had gone
> from 20
> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
> is
> > to make sure you've got as much information as possible.  If we'd just
> kept
> > the overall job time then there would have been a lot more areas to look
> at
> > but knowing the problem was with shuffle allowed me to focus effort in
> this
> > area.
> >
> > So what had changed in the shuffle that may have slowed things down.  The
> > first thing we thought of was that we'd moved from a tarball deployment
> to
> > using the RPM so what effect might this have had on things.  Our
> operational
> > configuration compresses the map output and in the past we've had
> problems
> > with Java compression libraries being used rather than native ones and
> this
> > has affected performance.  We knew the RPM deployment had moved the
> native
> > library so spent some time confirming to ourselves that these were being
> > used correctly (but this turned out to not be the problem).  We then
> spent
> > time doing some process and server profiling - using dstat to look at the
> > server bottlenecks and jstack/jmap to check what the task tracker and
> reduce
> > processes were doing.  Although not directly relevant to this particular
> > problem doing this was useful just to get my head around what Hadoop is
> > doing at various points of the process.
> >
> > The next bit was one place where I got lucky - I happened to be logged
> onto
> > one of the worker nodes when a test job was running and I noticed that
> there
> > weren't any reduce tasks running on the server.  This was odd as we'd
> > submitted more reducers than we have servers so I'd expected at least one
> > task to be running on each server.  Checking the job tracker log file it
> > turned out that since the upgrade the job tracker had been submitting
> reduce
> > tasks to only 10% of the available nodes.  A different 10% each time the
> job
> > was run so clearly the individual task trackers were working OK but there
> > was something odd going on with the task allocation.  Checking the job
> > tracker log file showed that before the upgrade tasks had been fairly
> evenly
> > distributed so something had changed.  After that it was a case of
> digging
> > around the source code to find out which classes were available for task
> > allocation and what inside them had changed.  This can be quite daunting
> but
> > if you're comfortable with Java then it's just a case of following the
> calls
> > through the code.  Once I found the cause it was just a case of working
> out
> > what my options were for working around it (in this case turning off the
> > multiple assignment option - I can work out whether I want to turn it
> back
> > on in slower time).
> >
> > Where I think we got very lucky is that we hit this problem.  The
> > configuration we use for the terasort has just over 1 reducer per worker
> > node rather than maxing out the available reducer slots.  This decision
> was
> > made several years and I can't remember the reasons for it.  If we'd been
> > using a larger number of reducers then the number of worker nodes in use
> > would have been similar regardless of the allocation algorithm and so the
> > performance would have looked similar before and after the upgrade.  We
> > would have hit this problem eventually but probably not until we started
> > running user jobs and by then it would be too late to do the intrusive
> > investigations that were possible now.
> >
> > Hope this has been useful.
> >
> > Regards,
> > Jon
> >
> >
> >
> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
> >>
> >> Jon:
> >>
> >> This is interesting and helpful! How did you figure out the cause? And
> how
> >> much time did you spend? Could you share some experience of performance
> >> diagnosis?
> >>
> >> Jie
> >>
> >> On Tuesday, November 27, 2012, Harsh J wrote:
> >>>
> >>> Hi Amit,
> >>>
> >>> The default scheduler is FIFO, and may not work for all forms of
> >>> workloads. Read the multiple schedulers available to see if they have
> >>> features that may benefit your workload:
> >>>
> >>> Capacity Scheduler:
> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
> >>> FairScheduler:
> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
> >>>
> >>> While there's a good overlap of features between them, there are a few
> >>> differences that set them apart and make them each useful for
> >>> different use-cases. If I had to summarize on some such differences,
> >>> FairScheduler is better suited to SLA form of job execution situations
> >>> due to its preemptive features (which make it useful in user and
> >>> service mix scenarios), while CapacityScheduler provides
> >>> manual-resource-request oriented scheduling for odd jobs with high
> >>> memory workloads, etc. (which make it useful for running certain
> >>> specific kind of jobs along side the regular ones).
> >>>
> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
> wrote:
> >>> > So this is a FairScheduler problem ?
> >>> > We are using the default Hadoop scheduler. Is there a reason to use
> the
> >>> > Fair
> >>> > Scheduler if most of the time we don't have more than 4 jobs running
> >>> > simultaneously ?
> >>> >
> >>> >
> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
> wrote:
> >>> >>
> >>> >> Hi Amit,
> >>> >>
> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
> >>> >> property. It is true by default, which works well for most workloads
> >>> >> if not benchmark style workloads. I would not usually trust that as
> a
> >>> >> base perf. measure of everything that comes out of an upgrade.
> >>> >>
> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
> >>> >>
> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
> >>> >> wrote:
> >>> >> > Hi Jon,
> >>> >> >
> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
> Hadoop
> >>> >> > 1.0.4
> >>> >> > and I haven't noticed any performance issues. By  "multiple
> >>> >> > assignment
> >>> >> > feature" do you mean speculative execution
> >>> >> > (mapred.map.tasks.speculative.execution and
> >>> >> > mapred.reduce.tasks.speculative.execution) ?
> >>> >> >
> >>> >> >
> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Problem solved, but worth warning others about.
> >>> >> >>
> >>> >> >> Before the upgrade the reducers for the terasort process had been
> >>> >> >> evenly
> >>> >> >> distributed around the cluster - one per task tracker in turn,
> >>> >> >> looping
> >>> >> >> around the cluster until all tasks were allocated.  After the
> >>> >> >> upgrade
> >>> >> >> all
> >>> >> >> reduce task had been submitted to small number of task trackers -
> >>> >> >> submit
> >>> >> >> tasks until the task tracker slots were full and then move onto
> the
> >>> >> >> next
> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
> the
> >>> >> >> benchmark performance.
> >>> >> >>
> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
> behaviour
> >>> >> >> of
> >>> >> >> the
> >>> >> >> assign multiple property. Previously this property caused a
> single
> >>> >> >> map
> >>> >> >> and a
> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
> >>> >> >> (rather
> >>> >> >> than
> >>> >> >> the default of a map or a reduce).  After the upgrade it
> allocates
> >>> >> >> as
> >>> >> >> many
> >>> >> >> tasks as there are available task slots.  Turning off the
> multiple
> >>> >> >> assignment feature returned the terasort to its pre-upgrade
> >>> >> >> performance.
> >>> >> >>
> >>> >> >> I can see potential benefits to this change and need to think
> >>> >> >> through
> >>> >> >> the
> >>> >> >> consequences to real world applications (though in practice we're
> >>> >> >> likely to
> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
> >>> >> >> Investigating
> >>> >> >> this
> >>> >> >> has been a pain so to warn other user is there anywhere central
> >>> >> >> that
> >>> >> >> can be
> >>> >> >> used to record upgrade gotchas like this?
> >>> >> >>
> >>> >> >>
> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <jayayedev@gmail.com
> >
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> Hi,
> >>> >> >>>
> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
> and
> >>> >> >>> have
> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
> took
> >>> >> >>> about 45
> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
> >>> >> >>> detail it
> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
> >>> >> >>> minutes.
> >>> >> >>> Does
> >>> >> >>> anyone have any thoughts about what'--
> >>> Harsh J
> >>>
> >
>

Re: Hadoop 1.0.4 Performance Problem

Posted by Chris Smith <cs...@gmail.com>.

Jie,

Recent was over 11 months ago.  :-)

Unfortunately the software licence requires that most of us 'negotiate' a
commerical use license before we trial the software in a commercial
environment:
http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
clarified here:  http://www.cs.duke.edu/starfish/previous.html

Under that last URL was a note that you were soon to distribute the source
code under the Apache Software License.  Last time I asked the reply was
that this would not happen.  Perhaps it is time to update your web pages or
your license arrangements.  :-)

I like what I saw on my home 'cluster' but have not the time to sort out
licensing to trial this in a commercial environment.

Chris




On 14 December 2012 01:46, Jie Li <ji...@cs.duke.edu> wrote:

> Hi Jon,
>
> Thanks for sharing these insights! Can't agree with you more!
>
> Recently we released a tool called Starfish Hadoop Log Analyzer for
> analyzing the job histories. I believe it can quickly point out this
> reduce problem you met!
>
> http://www.cs.duke.edu/starfish/release.html
>
> Jie
>
> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <ja...@gmail.com> wrote:
> > Jie,
> >
> > Simple answer - I got lucky (though obviously there are thing you need to
> > have in place to allow you to be lucky).
> >
> > Before running the upgrade I ran a set of tests to baseline the cluster
> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
> by
> > itself isn't very realistic as a cluster test but it's nice and simple to
> > run and is good for regression testing things after a change.
> >
> > After the upgrade the intention was to run the same tests and show that
> the
> > performance hadn't degraded (improved would have been nice but not worse
> was
> > the minimum).  When we ran the terasort we found that performance was
> about
> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
> I've
> > said, terasort doesn't provide a realistic view of operational
> performance
> > but this showed that something major had changed and we needed to
> understand
> > it before going further.  So how to go about diagnosing this ...
> >
> > First rule - understand what you're trying to achieve.  It's very easy to
> > say performance isn't good enough but performance can always be better so
> > you need to know what's realistic and at what point you're going to stop
> > tuning things.  I had a previous baseline that I was trying to match so I
> > knew what I was trying to achieve.
> >
> > Next thing to do is profile your job and identify where the problem is.
>  We
> > had the full job history from the before and after jobs and comparing
> these
> > we saw that map performance was fairly consistent as were the reduce sort
> > and reduce phases.  The problem was with the shuffle, which had gone
> from 20
> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
> is
> > to make sure you've got as much information as possible.  If we'd just
> kept
> > the overall job time then there would have been a lot more areas to look
> at
> > but knowing the problem was with shuffle allowed me to focus effort in
> this
> > area.
> >
> > So what had changed in the shuffle that may have slowed things down.  The
> > first thing we thought of was that we'd moved from a tarball deployment
> to
> > using the RPM so what effect might this have had on things.  Our
> operational
> > configuration compresses the map output and in the past we've had
> problems
> > with Java compression libraries being used rather than native ones and
> this
> > has affected performance.  We knew the RPM deployment had moved the
> native
> > library so spent some time confirming to ourselves that these were being
> > used correctly (but this turned out to not be the problem).  We then
> spent
> > time doing some process and server profiling - using dstat to look at the
> > server bottlenecks and jstack/jmap to check what the task tracker and
> reduce
> > processes were doing.  Although not directly relevant to this particular
> > problem doing this was useful just to get my head around what Hadoop is
> > doing at various points of the process.
> >
> > The next bit was one place where I got lucky - I happened to be logged
> onto
> > one of the worker nodes when a test job was running and I noticed that
> there
> > weren't any reduce tasks running on the server.  This was odd as we'd
> > submitted more reducers than we have servers so I'd expected at least one
> > task to be running on each server.  Checking the job tracker log file it
> > turned out that since the upgrade the job tracker had been submitting
> reduce
> > tasks to only 10% of the available nodes.  A different 10% each time the
> job
> > was run so clearly the individual task trackers were working OK but there
> > was something odd going on with the task allocation.  Checking the job
> > tracker log file showed that before the upgrade tasks had been fairly
> evenly
> > distributed so something had changed.  After that it was a case of
> digging
> > around the source code to find out which classes were available for task
> > allocation and what inside them had changed.  This can be quite daunting
> but
> > if you're comfortable with Java then it's just a case of following the
> calls
> > through the code.  Once I found the cause it was just a case of working
> out
> > what my options were for working around it (in this case turning off the
> > multiple assignment option - I can work out whether I want to turn it
> back
> > on in slower time).
> >
> > Where I think we got very lucky is that we hit this problem.  The
> > configuration we use for the terasort has just over 1 reducer per worker
> > node rather than maxing out the available reducer slots.  This decision
> was
> > made several years and I can't remember the reasons for it.  If we'd been
> > using a larger number of reducers then the number of worker nodes in use
> > would have been similar regardless of the allocation algorithm and so the
> > performance would have looked similar before and after the upgrade.  We
> > would have hit this problem eventually but probably not until we started
> > running user jobs and by then it would be too late to do the intrusive
> > investigations that were possible now.
> >
> > Hope this has been useful.
> >
> > Regards,
> > Jon
> >
> >
> >
> > On Tue, Nov 27, 2012 at 3:08 PM, Jie Li <ji...@cs.duke.edu> wrote:
> >>
> >> Jon:
> >>
> >> This is interesting and helpful! How did you figure out the cause? And
> how
> >> much time did you spend? Could you share some experience of performance
> >> diagnosis?
> >>
> >> Jie
> >>
> >> On Tuesday, November 27, 2012, Harsh J wrote:
> >>>
> >>> Hi Amit,
> >>>
> >>> The default scheduler is FIFO, and may not work for all forms of
> >>> workloads. Read the multiple schedulers available to see if they have
> >>> features that may benefit your workload:
> >>>
> >>> Capacity Scheduler:
> >>> http://hadoop.apache.org/docs/stable/capacity_scheduler.html
> >>> FairScheduler:
> >>> http://hadoop.apache.org/docs/stable/fair_scheduler.html
> >>>
> >>> While there's a good overlap of features between them, there are a few
> >>> differences that set them apart and make them each useful for
> >>> different use-cases. If I had to summarize on some such differences,
> >>> FairScheduler is better suited to SLA form of job execution situations
> >>> due to its preemptive features (which make it useful in user and
> >>> service mix scenarios), while CapacityScheduler provides
> >>> manual-resource-request oriented scheduling for odd jobs with high
> >>> memory workloads, etc. (which make it useful for running certain
> >>> specific kind of jobs along side the regular ones).
> >>>
> >>> On Tue, Nov 27, 2012 at 3:51 PM, Amit Sela <am...@infolinks.com>
> wrote:
> >>> > So this is a FairScheduler problem ?
> >>> > We are using the default Hadoop scheduler. Is there a reason to use
> the
> >>> > Fair
> >>> > Scheduler if most of the time we don't have more than 4 jobs running
> >>> > simultaneously ?
> >>> >
> >>> >
> >>> > On Tue, Nov 27, 2012 at 12:00 PM, Harsh J <ha...@cloudera.com>
> wrote:
> >>> >>
> >>> >> Hi Amit,
> >>> >>
> >>> >> He means the mapred.fairscheduler.assignmultiple FairScheduler
> >>> >> property. It is true by default, which works well for most workloads
> >>> >> if not benchmark style workloads. I would not usually trust that as
> a
> >>> >> base perf. measure of everything that comes out of an upgrade.
> >>> >>
> >>> >> The other JIRA, MAPREDUCE-4451, has been resolved for 1.2.0.
> >>> >>
> >>> >> On Tue, Nov 27, 2012 at 3:20 PM, Amit Sela <am...@infolinks.com>
> >>> >> wrote:
> >>> >> > Hi Jon,
> >>> >> >
> >>> >> > I recently upgraded our cluster from Hadoop 0.20.3-append to
> Hadoop
> >>> >> > 1.0.4
> >>> >> > and I haven't noticed any performance issues. By  "multiple
> >>> >> > assignment
> >>> >> > feature" do you mean speculative execution
> >>> >> > (mapred.map.tasks.speculative.execution and
> >>> >> > mapred.reduce.tasks.speculative.execution) ?
> >>> >> >
> >>> >> >
> >>> >> > On Mon, Nov 26, 2012 at 11:49 PM, Jon Allen <ja...@gmail.com>
> >>> >> > wrote:
> >>> >> >>
> >>> >> >> Problem solved, but worth warning others about.
> >>> >> >>
> >>> >> >> Before the upgrade the reducers for the terasort process had been
> >>> >> >> evenly
> >>> >> >> distributed around the cluster - one per task tracker in turn,
> >>> >> >> looping
> >>> >> >> around the cluster until all tasks were allocated.  After the
> >>> >> >> upgrade
> >>> >> >> all
> >>> >> >> reduce task had been submitted to small number of task trackers -
> >>> >> >> submit
> >>> >> >> tasks until the task tracker slots were full and then move onto
> the
> >>> >> >> next
> >>> >> >> task tracker.  Skewing the reducers like this quite clearly hit
> the
> >>> >> >> benchmark performance.
> >>> >> >>
> >>> >> >> The reason for this turns out to be the fair scheduler rewrite
> >>> >> >> (MAPREDUCE-2981) that appears to have subtly modified the
> behaviour
> >>> >> >> of
> >>> >> >> the
> >>> >> >> assign multiple property. Previously this property caused a
> single
> >>> >> >> map
> >>> >> >> and a
> >>> >> >> single reduce task to be allocated in a task tracker heartbeat
> >>> >> >> (rather
> >>> >> >> than
> >>> >> >> the default of a map or a reduce).  After the upgrade it
> allocates
> >>> >> >> as
> >>> >> >> many
> >>> >> >> tasks as there are available task slots.  Turning off the
> multiple
> >>> >> >> assignment feature returned the terasort to its pre-upgrade
> >>> >> >> performance.
> >>> >> >>
> >>> >> >> I can see potential benefits to this change and need to think
> >>> >> >> through
> >>> >> >> the
> >>> >> >> consequences to real world applications (though in practice we're
> >>> >> >> likely to
> >>> >> >> move away from fair scheduler due to MAPREDUCE-4451).
> >>> >> >> Investigating
> >>> >> >> this
> >>> >> >> has been a pain so to warn other user is there anywhere central
> >>> >> >> that
> >>> >> >> can be
> >>> >> >> used to record upgrade gotchas like this?
> >>> >> >>
> >>> >> >>
> >>> >> >> On Fri, Nov 23, 2012 at 12:02 PM, Jon Allen <jayayedev@gmail.com
> >
> >>> >> >> wrote:
> >>> >> >>>
> >>> >> >>> Hi,
> >>> >> >>>
> >>> >> >>> We've just upgraded our cluster from Hadoop 0.20.203 to 1.0.4
> and
> >>> >> >>> have
> >>> >> >>> hit performance problems.  Before the upgrade a 15TB terasort
> took
> >>> >> >>> about 45
> >>> >> >>> minutes, afterwards it takes just over an hour.  Looking in more
> >>> >> >>> detail it
> >>> >> >>> appears the shuffle phase has increased from 20 minutes to 40
> >>> >> >>> minutes.
> >>> >> >>> Does
> >>> >> >>> anyone have any thoughts about what'--
> >>> Harsh J
> >>>
> >
>