You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Paul Smith <ps...@aconex.com> on 2009/09/25 09:04:38 UTC

3D Cluster Performance Visualization

Hi,

I'm still relatively new to Hadoop here, so bear with me.  We have a  
few ex-SGI staff with us, and one of the tools we now use at Aconex is  
Performance Co-Pilot (PCP), which is an open-source Performance  
Monitoring suite out of Silicon Graphics (see [1]).  SGI are a bit  
fond of large-scale problems and this toolset was built to support  
their own monster computers (see [2] for one of their clients, yep,  
that's one large single computer), and PCP was used to monitor and  
tune that, so I'm pretty confident it has the credentials to help with  
Hadoop.

Aconex has built a Java bridge to PCP and has open-sourced that as  
Parfait (see [3]).  We rely on this for real-time and post-problem  
retrospective analysis.  We would be dead in the water without it.  By  
being able to combine hardware and software metrics across multiple  
machines into a single warehouse of data we can correlate many  
interesting things and solve problems very quickly.

Now I want to unleash this on Hadoop.  I have written a MetricContext  
extension that uses the bridge, and I can export counters and values  
to PCP for the namenode, datanode, jobtracker and tasktracker.  We are  
building some small tool extensions to allow 3D visualization.  First  
fledgling view of what it looks like is here:

http://people.apache.org/~psmith/clustervis.png

Yes, a pretty trivial cluster at the moment, but the toolset allows  
pretty simple configurations to create the cluster by passing it the  
masters/slaves file.  Once PCP tools connects to each node through my  
implementation of PCP Metric Context it can find out whether it's a  
namenode, or a jobtracker etc and display it differently.  We hope to  
improve on the tools to utilise the DNSToSwitchMapping style to then  
visualize all the nodes within the cluster as they would appear in the  
rack.  PCP already has support for Cisco switches so we can also  
integrate those into the picture and display inter-rack networking  
volumes.  The real payoff here is the retrospective analysis, all this  
PCP data is collected into Archives so this view can be replayed at  
any time, and at any pace you want.  Very interesting problems are  
found when you have that sort of tool.

I guess my question is whether anyone else thinks this is going to be  
of value to the wider Hadoop community?  Obviously we do, but we're  
not exactly stretching Hadoop just yet, nor do we fully understand  
some of the tricky performance problems large Hadoop cluster admins  
face.  I think we'd love to think we could add this to the hadoop- 
contrib though hoping others might find it useful.

So if anyone is interested in asking questions or suggesting crucial  
feature sets we'd appreciate it.

cheers (and thanks for getting this far in the email.. :) )

Paul Smith
psmith at aconex.com
psmith at apache.org

[1] Performance Co-Pilot (PCP)
http://oss.sgi.com/projects/pcp/index.html

[2] NASAs 'Columbia' computer
http://www.nas.nasa.gov/News/Images/images.html

[3] Parfait
http://code.google.com/p/parfait/

Re: 3D Cluster Performance Visualization

Posted by Edward Capriolo <ed...@gmail.com>.
On Fri, Sep 25, 2009 at 10:06 AM, Brian Bockelman <bb...@cse.unl.edu> wrote:
> ;) Unfortunately, I'm going to go out on a limb and guess that we don't want
> to add OpenGL to the dependency list for the namenode...  The viz
> application actually doesn't depend on the namenode, it uses the datanodes.
>
> Here's the source:
> svn://t2.unl.edu/brian/HadoopViz/trunk
>
> The server portion is a bit hardcoded to our site (simply a python server);
> the client application is pretty cross-platform.  I actually compile and
> display the application on my Mac.
>
> Here's how it works:
>
> 1) Client issues read() request
> 2) Datanode services it.  Logs it with log4j
> 3) One of the log4j appenders is syslog pointing at a separate server
> 4) Separate log server recieves UDP packets; one packet per read()
> 5) Log server parses packets and decides whether they are within the cluster
> or going to the internet
>  - Currently a Pentium 4 throw-away machine; handles up to 4-5k packets per
> second before it starts dropping
> 6) Each client opens a TCP stream to the server and receives the transfer
> type, source, and dest, then renders appropriately
>
> It's pretty danged close to real-time; the time the client issues the read()
> request to seeing something plotted is on the order of 1 second.
>
> I'd really like to see this on a big (Yahoo, Facebook, any takers?) cluster.
>
> Brian
>
> On Sep 25, 2009, at 8:54 AM, Edward Capriolo wrote:
>
>> On Fri, Sep 25, 2009 at 9:25 AM, Brian Bockelman <bb...@cse.unl.edu>
>> wrote:
>>>
>>> Hey Paul,
>>>
>>> Here's another visualization one can do with HDFS:
>>>
>>> http://www.youtube.com/watch?v=qoBoEzOkeDQ
>>>
>>> Each time data is moved from one host to another, it is plotted as a drop
>>> of
>>> water from one square representing the host to one square representing
>>> the
>>> destination.  The color of the node's square depends on the number of
>>> transfers per second.  Data transferred out of the cluster is represented
>>> by
>>> drops going in/out of the ceiling.
>>>
>>> Hard to describe, easy to understand when you see it.  Absolutely
>>> mesmerizing for tour groups when you put it on a big-screen.
>>>
>>> Brian
>>>
>>> On Sep 25, 2009, at 2:04 AM, Paul Smith wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm still relatively new to Hadoop here, so bear with me.  We have a few
>>>> ex-SGI staff with us, and one of the tools we now use at Aconex is
>>>> Performance Co-Pilot (PCP), which is an open-source Performance
>>>> Monitoring
>>>> suite out of Silicon Graphics (see [1]).  SGI are a bit fond of
>>>> large-scale
>>>> problems and this toolset was built to support their own monster
>>>> computers
>>>> (see [2] for one of their clients, yep, that's one large single
>>>> computer),
>>>> and PCP was used to monitor and tune that, so I'm pretty confident it
>>>> has
>>>> the credentials to help with Hadoop.
>>>>
>>>> Aconex has built a Java bridge to PCP and has open-sourced that as
>>>> Parfait
>>>> (see [3]).  We rely on this for real-time and post-problem retrospective
>>>> analysis.  We would be dead in the water without it.  By being able to
>>>> combine hardware and software metrics across multiple machines into a
>>>> single
>>>> warehouse of data we can correlate many interesting things and solve
>>>> problems very quickly.
>>>>
>>>> Now I want to unleash this on Hadoop.  I have written a MetricContext
>>>> extension that uses the bridge, and I can export counters and values to
>>>> PCP
>>>> for the namenode, datanode, jobtracker and tasktracker.  We are building
>>>> some small tool extensions to allow 3D visualization.  First fledgling
>>>> view
>>>> of what it looks like is here:
>>>>
>>>> http://people.apache.org/~psmith/clustervis.png
>>>>
>>>> Yes, a pretty trivial cluster at the moment, but the toolset allows
>>>> pretty
>>>> simple configurations to create the cluster by passing it the
>>>> masters/slaves
>>>> file.  Once PCP tools connects to each node through my implementation of
>>>> PCP
>>>> Metric Context it can find out whether it's a namenode, or a jobtracker
>>>> etc
>>>> and display it differently.  We hope to improve on the tools to utilise
>>>> the
>>>> DNSToSwitchMapping style to then visualize all the nodes within the
>>>> cluster
>>>> as they would appear in the rack.  PCP already has support for Cisco
>>>> switches so we can also integrate those into the picture and display
>>>> inter-rack networking volumes.  The real payoff here is the
>>>> retrospective
>>>> analysis, all this PCP data is collected into Archives so this view can
>>>> be
>>>> replayed at any time, and at any pace you want.  Very interesting
>>>> problems
>>>> are found when you have that sort of tool.
>>>>
>>>> I guess my question is whether anyone else thinks this is going to be of
>>>> value to the wider Hadoop community?  Obviously we do, but we're not
>>>> exactly
>>>> stretching Hadoop just yet, nor do we fully understand some of the
>>>> tricky
>>>> performance problems large Hadoop cluster admins face.  I think we'd
>>>> love to
>>>> think we could add this to the hadoop-contrib though hoping others might
>>>> find it useful.
>>>>
>>>> So if anyone is interested in asking questions or suggesting crucial
>>>> feature sets we'd appreciate it.
>>>>
>>>> cheers (and thanks for getting this far in the email.. :) )
>>>>
>>>> Paul Smith
>>>> psmith at aconex.com
>>>> psmith at apache.org
>>>>
>>>> [1] Performance Co-Pilot (PCP)
>>>> http://oss.sgi.com/projects/pcp/index.html
>>>>
>>>> [2] NASAs 'Columbia' computer
>>>> http://www.nas.nasa.gov/News/Images/images.html
>>>>
>>>> [3] Parfait
>>>> http://code.google.com/p/parfait/
>>>
>>>
>>
>> Open up a Jira. Lets get hadoop viz on the name node web interface for
>> real time :)
>
>

Brian,

I was half kidding but if you can do it with open GL you can probably
do it with an applet of course as you mentioned it would take access
to the logging source. However maybe run it on each DataNode web
interface. Also a while back people were talking about those map
reduce job status graphs that who the map/reduce over the course of a
job. That is something I think we could do right from the job tracker
interface. There is a lot of info there we should be able to jazz it
up a bit :)

Ed

Re: 3D Cluster Performance Visualization

Posted by Steve Loughran <st...@apache.org>.
Brian Bockelman wrote:
> ;) Unfortunately, I'm going to go out on a limb and guess that we don't 
> want to add OpenGL to the dependency list for the namenode...  The viz 
> application actually doesn't depend on the namenode, it uses the datanodes.
> 
> Here's the source:
> svn://t2.unl.edu/brian/HadoopViz/trunk
> 
> The server portion is a bit hardcoded to our site (simply a python 
> server); the client application is pretty cross-platform.  I actually 
> compile and display the application on my Mac.
> 
> Here's how it works:
> 
> 1) Client issues read() request
> 2) Datanode services it.  Logs it with log4j
> 3) One of the log4j appenders is syslog pointing at a separate server
> 4) Separate log server recieves UDP packets; one packet per read()
> 5) Log server parses packets and decides whether they are within the 
> cluster or going to the internet
>   - Currently a Pentium 4 throw-away machine; handles up to 4-5k packets 
> per second before it starts dropping
> 6) Each client opens a TCP stream to the server and receives the 
> transfer type, source, and dest, then renders appropriately
> 
> It's pretty danged close to real-time; the time the client issues the 
> read() request to seeing something plotted is on the order of 1 second.
> 
> I'd really like to see this on a big (Yahoo, Facebook, any takers?) 
> cluster.
> 
> Brian
> 


Ok, so this is really an example of a datacentre back-end for Log4J, 
pushing out UDP packets to something else in the datacentre.  A nice 
side-line to the classic hadoop management displays. Add something about 
jobs executing and you are laughing. Do it all in Java3D and you even 
have cross platformness

Re: 3D Cluster Performance Visualization

Posted by Brian Bockelman <bb...@cse.unl.edu>.
;) Unfortunately, I'm going to go out on a limb and guess that we  
don't want to add OpenGL to the dependency list for the namenode...   
The viz application actually doesn't depend on the namenode, it uses  
the datanodes.

Here's the source:
svn://t2.unl.edu/brian/HadoopViz/trunk

The server portion is a bit hardcoded to our site (simply a python  
server); the client application is pretty cross-platform.  I actually  
compile and display the application on my Mac.

Here's how it works:

1) Client issues read() request
2) Datanode services it.  Logs it with log4j
3) One of the log4j appenders is syslog pointing at a separate server
4) Separate log server recieves UDP packets; one packet per read()
5) Log server parses packets and decides whether they are within the  
cluster or going to the internet
   - Currently a Pentium 4 throw-away machine; handles up to 4-5k  
packets per second before it starts dropping
6) Each client opens a TCP stream to the server and receives the  
transfer type, source, and dest, then renders appropriately

It's pretty danged close to real-time; the time the client issues the  
read() request to seeing something plotted is on the order of 1 second.

I'd really like to see this on a big (Yahoo, Facebook, any takers?)  
cluster.

Brian

On Sep 25, 2009, at 8:54 AM, Edward Capriolo wrote:

> On Fri, Sep 25, 2009 at 9:25 AM, Brian Bockelman  
> <bb...@cse.unl.edu> wrote:
>> Hey Paul,
>>
>> Here's another visualization one can do with HDFS:
>>
>> http://www.youtube.com/watch?v=qoBoEzOkeDQ
>>
>> Each time data is moved from one host to another, it is plotted as  
>> a drop of
>> water from one square representing the host to one square  
>> representing the
>> destination.  The color of the node's square depends on the number of
>> transfers per second.  Data transferred out of the cluster is  
>> represented by
>> drops going in/out of the ceiling.
>>
>> Hard to describe, easy to understand when you see it.  Absolutely
>> mesmerizing for tour groups when you put it on a big-screen.
>>
>> Brian
>>
>> On Sep 25, 2009, at 2:04 AM, Paul Smith wrote:
>>
>>> Hi,
>>>
>>> I'm still relatively new to Hadoop here, so bear with me.  We have  
>>> a few
>>> ex-SGI staff with us, and one of the tools we now use at Aconex is
>>> Performance Co-Pilot (PCP), which is an open-source Performance  
>>> Monitoring
>>> suite out of Silicon Graphics (see [1]).  SGI are a bit fond of  
>>> large-scale
>>> problems and this toolset was built to support their own monster  
>>> computers
>>> (see [2] for one of their clients, yep, that's one large single  
>>> computer),
>>> and PCP was used to monitor and tune that, so I'm pretty confident  
>>> it has
>>> the credentials to help with Hadoop.
>>>
>>> Aconex has built a Java bridge to PCP and has open-sourced that as  
>>> Parfait
>>> (see [3]).  We rely on this for real-time and post-problem  
>>> retrospective
>>> analysis.  We would be dead in the water without it.  By being  
>>> able to
>>> combine hardware and software metrics across multiple machines  
>>> into a single
>>> warehouse of data we can correlate many interesting things and solve
>>> problems very quickly.
>>>
>>> Now I want to unleash this on Hadoop.  I have written a  
>>> MetricContext
>>> extension that uses the bridge, and I can export counters and  
>>> values to PCP
>>> for the namenode, datanode, jobtracker and tasktracker.  We are  
>>> building
>>> some small tool extensions to allow 3D visualization.  First  
>>> fledgling view
>>> of what it looks like is here:
>>>
>>> http://people.apache.org/~psmith/clustervis.png
>>>
>>> Yes, a pretty trivial cluster at the moment, but the toolset  
>>> allows pretty
>>> simple configurations to create the cluster by passing it the  
>>> masters/slaves
>>> file.  Once PCP tools connects to each node through my  
>>> implementation of PCP
>>> Metric Context it can find out whether it's a namenode, or a  
>>> jobtracker etc
>>> and display it differently.  We hope to improve on the tools to  
>>> utilise the
>>> DNSToSwitchMapping style to then visualize all the nodes within  
>>> the cluster
>>> as they would appear in the rack.  PCP already has support for Cisco
>>> switches so we can also integrate those into the picture and display
>>> inter-rack networking volumes.  The real payoff here is the  
>>> retrospective
>>> analysis, all this PCP data is collected into Archives so this  
>>> view can be
>>> replayed at any time, and at any pace you want.  Very interesting  
>>> problems
>>> are found when you have that sort of tool.
>>>
>>> I guess my question is whether anyone else thinks this is going to  
>>> be of
>>> value to the wider Hadoop community?  Obviously we do, but we're  
>>> not exactly
>>> stretching Hadoop just yet, nor do we fully understand some of the  
>>> tricky
>>> performance problems large Hadoop cluster admins face.  I think  
>>> we'd love to
>>> think we could add this to the hadoop-contrib though hoping others  
>>> might
>>> find it useful.
>>>
>>> So if anyone is interested in asking questions or suggesting crucial
>>> feature sets we'd appreciate it.
>>>
>>> cheers (and thanks for getting this far in the email.. :) )
>>>
>>> Paul Smith
>>> psmith at aconex.com
>>> psmith at apache.org
>>>
>>> [1] Performance Co-Pilot (PCP)
>>> http://oss.sgi.com/projects/pcp/index.html
>>>
>>> [2] NASAs 'Columbia' computer
>>> http://www.nas.nasa.gov/News/Images/images.html
>>>
>>> [3] Parfait
>>> http://code.google.com/p/parfait/
>>
>>
>
> Open up a Jira. Lets get hadoop viz on the name node web interface for
> real time :)


Re: 3D Cluster Performance Visualization

Posted by Edward Capriolo <ed...@gmail.com>.
On Fri, Sep 25, 2009 at 9:25 AM, Brian Bockelman <bb...@cse.unl.edu> wrote:
> Hey Paul,
>
> Here's another visualization one can do with HDFS:
>
> http://www.youtube.com/watch?v=qoBoEzOkeDQ
>
> Each time data is moved from one host to another, it is plotted as a drop of
> water from one square representing the host to one square representing the
> destination.  The color of the node's square depends on the number of
> transfers per second.  Data transferred out of the cluster is represented by
> drops going in/out of the ceiling.
>
> Hard to describe, easy to understand when you see it.  Absolutely
> mesmerizing for tour groups when you put it on a big-screen.
>
> Brian
>
> On Sep 25, 2009, at 2:04 AM, Paul Smith wrote:
>
>> Hi,
>>
>> I'm still relatively new to Hadoop here, so bear with me.  We have a few
>> ex-SGI staff with us, and one of the tools we now use at Aconex is
>> Performance Co-Pilot (PCP), which is an open-source Performance Monitoring
>> suite out of Silicon Graphics (see [1]).  SGI are a bit fond of large-scale
>> problems and this toolset was built to support their own monster computers
>> (see [2] for one of their clients, yep, that's one large single computer),
>> and PCP was used to monitor and tune that, so I'm pretty confident it has
>> the credentials to help with Hadoop.
>>
>> Aconex has built a Java bridge to PCP and has open-sourced that as Parfait
>> (see [3]).  We rely on this for real-time and post-problem retrospective
>> analysis.  We would be dead in the water without it.  By being able to
>> combine hardware and software metrics across multiple machines into a single
>> warehouse of data we can correlate many interesting things and solve
>> problems very quickly.
>>
>> Now I want to unleash this on Hadoop.  I have written a MetricContext
>> extension that uses the bridge, and I can export counters and values to PCP
>> for the namenode, datanode, jobtracker and tasktracker.  We are building
>> some small tool extensions to allow 3D visualization.  First fledgling view
>> of what it looks like is here:
>>
>> http://people.apache.org/~psmith/clustervis.png
>>
>> Yes, a pretty trivial cluster at the moment, but the toolset allows pretty
>> simple configurations to create the cluster by passing it the masters/slaves
>> file.  Once PCP tools connects to each node through my implementation of PCP
>> Metric Context it can find out whether it's a namenode, or a jobtracker etc
>> and display it differently.  We hope to improve on the tools to utilise the
>> DNSToSwitchMapping style to then visualize all the nodes within the cluster
>> as they would appear in the rack.  PCP already has support for Cisco
>> switches so we can also integrate those into the picture and display
>> inter-rack networking volumes.  The real payoff here is the retrospective
>> analysis, all this PCP data is collected into Archives so this view can be
>> replayed at any time, and at any pace you want.  Very interesting problems
>> are found when you have that sort of tool.
>>
>> I guess my question is whether anyone else thinks this is going to be of
>> value to the wider Hadoop community?  Obviously we do, but we're not exactly
>> stretching Hadoop just yet, nor do we fully understand some of the tricky
>> performance problems large Hadoop cluster admins face.  I think we'd love to
>> think we could add this to the hadoop-contrib though hoping others might
>> find it useful.
>>
>> So if anyone is interested in asking questions or suggesting crucial
>> feature sets we'd appreciate it.
>>
>> cheers (and thanks for getting this far in the email.. :) )
>>
>> Paul Smith
>> psmith at aconex.com
>> psmith at apache.org
>>
>> [1] Performance Co-Pilot (PCP)
>> http://oss.sgi.com/projects/pcp/index.html
>>
>> [2] NASAs 'Columbia' computer
>> http://www.nas.nasa.gov/News/Images/images.html
>>
>> [3] Parfait
>> http://code.google.com/p/parfait/
>
>

Open up a Jira. Lets get hadoop viz on the name node web interface for
real time :)

Re: 3D Cluster Performance Visualization

Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hey Paul,

Here's another visualization one can do with HDFS:

http://www.youtube.com/watch?v=qoBoEzOkeDQ

Each time data is moved from one host to another, it is plotted as a  
drop of water from one square representing the host to one square  
representing the destination.  The color of the node's square depends  
on the number of transfers per second.  Data transferred out of the  
cluster is represented by drops going in/out of the ceiling.

Hard to describe, easy to understand when you see it.  Absolutely  
mesmerizing for tour groups when you put it on a big-screen.

Brian

On Sep 25, 2009, at 2:04 AM, Paul Smith wrote:

> Hi,
>
> I'm still relatively new to Hadoop here, so bear with me.  We have a  
> few ex-SGI staff with us, and one of the tools we now use at Aconex  
> is Performance Co-Pilot (PCP), which is an open-source Performance  
> Monitoring suite out of Silicon Graphics (see [1]).  SGI are a bit  
> fond of large-scale problems and this toolset was built to support  
> their own monster computers (see [2] for one of their clients, yep,  
> that's one large single computer), and PCP was used to monitor and  
> tune that, so I'm pretty confident it has the credentials to help  
> with Hadoop.
>
> Aconex has built a Java bridge to PCP and has open-sourced that as  
> Parfait (see [3]).  We rely on this for real-time and post-problem  
> retrospective analysis.  We would be dead in the water without it.   
> By being able to combine hardware and software metrics across  
> multiple machines into a single warehouse of data we can correlate  
> many interesting things and solve problems very quickly.
>
> Now I want to unleash this on Hadoop.  I have written a  
> MetricContext extension that uses the bridge, and I can export  
> counters and values to PCP for the namenode, datanode, jobtracker  
> and tasktracker.  We are building some small tool extensions to  
> allow 3D visualization.  First fledgling view of what it looks like  
> is here:
>
> http://people.apache.org/~psmith/clustervis.png
>
> Yes, a pretty trivial cluster at the moment, but the toolset allows  
> pretty simple configurations to create the cluster by passing it the  
> masters/slaves file.  Once PCP tools connects to each node through  
> my implementation of PCP Metric Context it can find out whether it's  
> a namenode, or a jobtracker etc and display it differently.  We hope  
> to improve on the tools to utilise the DNSToSwitchMapping style to  
> then visualize all the nodes within the cluster as they would appear  
> in the rack.  PCP already has support for Cisco switches so we can  
> also integrate those into the picture and display inter-rack  
> networking volumes.  The real payoff here is the retrospective  
> analysis, all this PCP data is collected into Archives so this view  
> can be replayed at any time, and at any pace you want.  Very  
> interesting problems are found when you have that sort of tool.
>
> I guess my question is whether anyone else thinks this is going to  
> be of value to the wider Hadoop community?  Obviously we do, but  
> we're not exactly stretching Hadoop just yet, nor do we fully  
> understand some of the tricky performance problems large Hadoop  
> cluster admins face.  I think we'd love to think we could add this  
> to the hadoop-contrib though hoping others might find it useful.
>
> So if anyone is interested in asking questions or suggesting crucial  
> feature sets we'd appreciate it.
>
> cheers (and thanks for getting this far in the email.. :) )
>
> Paul Smith
> psmith at aconex.com
> psmith at apache.org
>
> [1] Performance Co-Pilot (PCP)
> http://oss.sgi.com/projects/pcp/index.html
>
> [2] NASAs 'Columbia' computer
> http://www.nas.nasa.gov/News/Images/images.html
>
> [3] Parfait
> http://code.google.com/p/parfait/