You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2009/08/12 17:49:47 UTC

[Hadoop Wiki] Update of "BristolHadoopWorkshop" by SteveLoughran

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/BristolHadoopWorkshop

The comment on the change is:
creating a page containing writeup of the event, adding entries as time permits

New page:
= Bristol Hadoop Workshop =

This was a little local workshop put together by Simon Metson of Bristol University, and Steve Loughran of HP, to get some of the local Hadoop users in a room and talk about our ongoing work.

These presentations were intended to start discussion and thought
 
  * [http://www.slideshare.net/steve_l/hadoop-futures Hadoop Futures] (Tom White, Cloudera)
  * [http://www.slideshare.net/steve_l/hadoop-hep Hadoop and High-Energy Physics] (Simon Metson, Bristol University)
  * [http://www.slideshare.net/steve_l/hdfs HDFS] (Johan Oskarsson, Last.fm)
  * [http://www.slideshare.net/steve_l/graphs-1848617 Graphs] Paolo Castagna, HP
  * [http://www.slideshare.net/steve_l/long-haul-hadoop Long Haul Hadoop] (Steve Loughran, HP)
  * [http://www.slideshare.net/steve_l/benchmarking-1840029 Benchmarking Hadoop] (Steve Loughran & Julio Guijarro, HP)

== Benchmarking ==

[:Terasort: Terasort], while a good way of regression testing performance across Hadoop versions, isn't ideal for assessing which hardware is best for other algorithms than sort, because things that are more iterative and CPU/memory hungry may not behave as expected on a cluster which has good IO, but not enough RAM for their algorithm.

In the discussion, though, it became clear that a common need people have that isn't that well address right now -and for which terasort is the best that people have to date- is QA-ing a new cluster.

Here you have new hardware -any of which failing is an immediate replacement call to the vendor- on a new network -which may not be configured right- and with a new set of configuration parameters -all of which may be wrong or at least suboptimal. You need something to run on the cluster which tests every node, makes sure it can see every other node's services, and report problems in meaningful summaries. The work should test CPU, FPU and RAM too, just to make sure they are all valid, and at the end of the run, generate some test numbers that can be compared to a spreadsheet-calculated estimate of performance and throughput.

When you bring up a cluster, even if every service has been asked to see if it is healthy, they still have the problem of talking to everything. The best check: push work through the system. Wait for things to fail, try and guess the problem. Having work to push through that is designed to stress the system's interconnected -and whose failure can be diagnosed with ease- would be nice.

That is, for all those people asking for a !HappyHadoop JPS page, it isn't enough. A cluster may cope with some of the workers going down, but it is not actually functional unless every node that is up can talk to every other node that is up, that nothing is coming up listening on IPv6, that the TaskTracker hasn't decided to only run on localhost, etc. etc.