You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Matei Zaharia <ma...@eecs.berkeley.edu> on 2007/12/01 06:48:05 UTC

A tracing framework for Hadoop

Hi,

We're grad students at UC Berkeley working on a project to instrument  
Hadoop using an open-source path-based tracing framework called X- 
Trace (www.x-trace.net/wiki). X-Trace captures causal dependencies  
between events in addition to timings, letting developers analyze not  
just performance but also context and dependencies for various events.  
We have created a web-based trace analysis UI that shows performance  
of different IPC calls, DFS operations, and phases of a MapReduce job.  
The goal is to let users easily spot the origin of unusual behavior in  
a running system at a centralized location. We believe that this kind  
of tracing can be used for performance tuning and debugging in both  
development and production environments.

We'd like to get feedback on our work and suggestions on what trace  
analyses would be useful to Hadoop developers and users. Some of the  
reports we currently generate include machine utilization over time,  
relative performance of different tasks, and performance of DFS  
operations. You can see an example set of reports at http://www.cs.berkeley.edu/~matei/xtrace_sample_task.html 
  (this is a trace of a Nutch indexing job). You can also read our  
project journal at http://radlab.cs.berkeley.edu/wiki/Projects/Monitoring_Hadoop_through_Tracing 
. We've already spotted some interesting issues, like map tasks and  
DFS reads/writes that are an order of magnitude slower than the  
average, and we are investigating possible causes for them. Most  
importantly, the UI lets a user easily see where the system is  
spending time and reason about how to tune it, and provides much more  
information than the progress data in the JobTracker UI. As a Hadoop  
developer, what kinds of questions do you want answered about running  
jobs that are hard to obtain just from process logs?

Once we've had a discussion on features for a trace analysis UI, we  
would like to contribute our work into the Hadoop codebase. We will  
create a JIRA issue and patch adding this functionality. We're also  
interested in seeing if we can integrate X-Trace logging more tightly  
with the current Apache logging in Hadoop.

Finally, we are currently experimenting on relatively small (<50  
nodes) clusters here at Berkeley, but we would really like to try  
tracing some large (>1000 node) clusters. If there is someone  
interested in evaluating performance on such a cluster, we would be  
very happy to talk about how to set up X-Trace and provide you with a  
patch.

Thanks,

Andy Konwinski and Matei Zaharia