You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Paul Smith <ps...@aconex.com> on 2009/09/25 09:04:38 UTC
3D Cluster Performance Visualization
Hi,
I'm still relatively new to Hadoop here, so bear with me. We have a
few ex-SGI staff with us, and one of the tools we now use at Aconex is
Performance Co-Pilot (PCP), which is an open-source Performance
Monitoring suite out of Silicon Graphics (see [1]). SGI are a bit
fond of large-scale problems and this toolset was built to support
their own monster computers (see [2] for one of their clients, yep,
that's one large single computer), and PCP was used to monitor and
tune that, so I'm pretty confident it has the credentials to help with
Hadoop.
Aconex has built a Java bridge to PCP and has open-sourced that as
Parfait (see [3]). We rely on this for real-time and post-problem
retrospective analysis. We would be dead in the water without it. By
being able to combine hardware and software metrics across multiple
machines into a single warehouse of data we can correlate many
interesting things and solve problems very quickly.
Now I want to unleash this on Hadoop. I have written a MetricContext
extension that uses the bridge, and I can export counters and values
to PCP for the namenode, datanode, jobtracker and tasktracker. We are
building some small tool extensions to allow 3D visualization. First
fledgling view of what it looks like is here:
http://people.apache.org/~psmith/clustervis.png
Yes, a pretty trivial cluster at the moment, but the toolset allows
pretty simple configurations to create the cluster by passing it the
masters/slaves file. Once PCP tools connects to each node through my
implementation of PCP Metric Context it can find out whether it's a
namenode, or a jobtracker etc and display it differently. We hope to
improve on the tools to utilise the DNSToSwitchMapping style to then
visualize all the nodes within the cluster as they would appear in the
rack. PCP already has support for Cisco switches so we can also
integrate those into the picture and display inter-rack networking
volumes. The real payoff here is the retrospective analysis, all this
PCP data is collected into Archives so this view can be replayed at
any time, and at any pace you want. Very interesting problems are
found when you have that sort of tool.
I guess my question is whether anyone else thinks this is going to be
of value to the wider Hadoop community? Obviously we do, but we're
not exactly stretching Hadoop just yet, nor do we fully understand
some of the tricky performance problems large Hadoop cluster admins
face. I think we'd love to think we could add this to the hadoop-
contrib though hoping others might find it useful.
So if anyone is interested in asking questions or suggesting crucial
feature sets we'd appreciate it.
cheers (and thanks for getting this far in the email.. :) )
Paul Smith
psmith at aconex.com
psmith at apache.org
[1] Performance Co-Pilot (PCP)
http://oss.sgi.com/projects/pcp/index.html
[2] NASAs 'Columbia' computer
http://www.nas.nasa.gov/News/Images/images.html
[3] Parfait
http://code.google.com/p/parfait/
Re: 3D Cluster Performance Visualization
Posted by Edward Capriolo <ed...@gmail.com>.
On Fri, Sep 25, 2009 at 10:06 AM, Brian Bockelman <bb...@cse.unl.edu> wrote:
> ;) Unfortunately, I'm going to go out on a limb and guess that we don't want
> to add OpenGL to the dependency list for the namenode... The viz
> application actually doesn't depend on the namenode, it uses the datanodes.
>
> Here's the source:
> svn://t2.unl.edu/brian/HadoopViz/trunk
>
> The server portion is a bit hardcoded to our site (simply a python server);
> the client application is pretty cross-platform. I actually compile and
> display the application on my Mac.
>
> Here's how it works:
>
> 1) Client issues read() request
> 2) Datanode services it. Logs it with log4j
> 3) One of the log4j appenders is syslog pointing at a separate server
> 4) Separate log server recieves UDP packets; one packet per read()
> 5) Log server parses packets and decides whether they are within the cluster
> or going to the internet
> - Currently a Pentium 4 throw-away machine; handles up to 4-5k packets per
> second before it starts dropping
> 6) Each client opens a TCP stream to the server and receives the transfer
> type, source, and dest, then renders appropriately
>
> It's pretty danged close to real-time; the time the client issues the read()
> request to seeing something plotted is on the order of 1 second.
>
> I'd really like to see this on a big (Yahoo, Facebook, any takers?) cluster.
>
> Brian
>
> On Sep 25, 2009, at 8:54 AM, Edward Capriolo wrote:
>
>> On Fri, Sep 25, 2009 at 9:25 AM, Brian Bockelman <bb...@cse.unl.edu>
>> wrote:
>>>
>>> Hey Paul,
>>>
>>> Here's another visualization one can do with HDFS:
>>>
>>> http://www.youtube.com/watch?v=qoBoEzOkeDQ
>>>
>>> Each time data is moved from one host to another, it is plotted as a drop
>>> of
>>> water from one square representing the host to one square representing
>>> the
>>> destination. The color of the node's square depends on the number of
>>> transfers per second. Data transferred out of the cluster is represented
>>> by
>>> drops going in/out of the ceiling.
>>>
>>> Hard to describe, easy to understand when you see it. Absolutely
>>> mesmerizing for tour groups when you put it on a big-screen.
>>>
>>> Brian
>>>
>>> On Sep 25, 2009, at 2:04 AM, Paul Smith wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm still relatively new to Hadoop here, so bear with me. We have a few
>>>> ex-SGI staff with us, and one of the tools we now use at Aconex is
>>>> Performance Co-Pilot (PCP), which is an open-source Performance
>>>> Monitoring
>>>> suite out of Silicon Graphics (see [1]). SGI are a bit fond of
>>>> large-scale
>>>> problems and this toolset was built to support their own monster
>>>> computers
>>>> (see [2] for one of their clients, yep, that's one large single
>>>> computer),
>>>> and PCP was used to monitor and tune that, so I'm pretty confident it
>>>> has
>>>> the credentials to help with Hadoop.
>>>>
>>>> Aconex has built a Java bridge to PCP and has open-sourced that as
>>>> Parfait
>>>> (see [3]). We rely on this for real-time and post-problem retrospective
>>>> analysis. We would be dead in the water without it. By being able to
>>>> combine hardware and software metrics across multiple machines into a
>>>> single
>>>> warehouse of data we can correlate many interesting things and solve
>>>> problems very quickly.
>>>>
>>>> Now I want to unleash this on Hadoop. I have written a MetricContext
>>>> extension that uses the bridge, and I can export counters and values to
>>>> PCP
>>>> for the namenode, datanode, jobtracker and tasktracker. We are building
>>>> some small tool extensions to allow 3D visualization. First fledgling
>>>> view
>>>> of what it looks like is here:
>>>>
>>>> http://people.apache.org/~psmith/clustervis.png
>>>>
>>>> Yes, a pretty trivial cluster at the moment, but the toolset allows
>>>> pretty
>>>> simple configurations to create the cluster by passing it the
>>>> masters/slaves
>>>> file. Once PCP tools connects to each node through my implementation of
>>>> PCP
>>>> Metric Context it can find out whether it's a namenode, or a jobtracker
>>>> etc
>>>> and display it differently. We hope to improve on the tools to utilise
>>>> the
>>>> DNSToSwitchMapping style to then visualize all the nodes within the
>>>> cluster
>>>> as they would appear in the rack. PCP already has support for Cisco
>>>> switches so we can also integrate those into the picture and display
>>>> inter-rack networking volumes. The real payoff here is the
>>>> retrospective
>>>> analysis, all this PCP data is collected into Archives so this view can
>>>> be
>>>> replayed at any time, and at any pace you want. Very interesting
>>>> problems
>>>> are found when you have that sort of tool.
>>>>
>>>> I guess my question is whether anyone else thinks this is going to be of
>>>> value to the wider Hadoop community? Obviously we do, but we're not
>>>> exactly
>>>> stretching Hadoop just yet, nor do we fully understand some of the
>>>> tricky
>>>> performance problems large Hadoop cluster admins face. I think we'd
>>>> love to
>>>> think we could add this to the hadoop-contrib though hoping others might
>>>> find it useful.
>>>>
>>>> So if anyone is interested in asking questions or suggesting crucial
>>>> feature sets we'd appreciate it.
>>>>
>>>> cheers (and thanks for getting this far in the email.. :) )
>>>>
>>>> Paul Smith
>>>> psmith at aconex.com
>>>> psmith at apache.org
>>>>
>>>> [1] Performance Co-Pilot (PCP)
>>>> http://oss.sgi.com/projects/pcp/index.html
>>>>
>>>> [2] NASAs 'Columbia' computer
>>>> http://www.nas.nasa.gov/News/Images/images.html
>>>>
>>>> [3] Parfait
>>>> http://code.google.com/p/parfait/
>>>
>>>
>>
>> Open up a Jira. Lets get hadoop viz on the name node web interface for
>> real time :)
>
>
Brian,
I was half kidding but if you can do it with open GL you can probably
do it with an applet of course as you mentioned it would take access
to the logging source. However maybe run it on each DataNode web
interface. Also a while back people were talking about those map
reduce job status graphs that who the map/reduce over the course of a
job. That is something I think we could do right from the job tracker
interface. There is a lot of info there we should be able to jazz it
up a bit :)
Ed
Re: 3D Cluster Performance Visualization
Posted by Steve Loughran <st...@apache.org>.
Brian Bockelman wrote:
> ;) Unfortunately, I'm going to go out on a limb and guess that we don't
> want to add OpenGL to the dependency list for the namenode... The viz
> application actually doesn't depend on the namenode, it uses the datanodes.
>
> Here's the source:
> svn://t2.unl.edu/brian/HadoopViz/trunk
>
> The server portion is a bit hardcoded to our site (simply a python
> server); the client application is pretty cross-platform. I actually
> compile and display the application on my Mac.
>
> Here's how it works:
>
> 1) Client issues read() request
> 2) Datanode services it. Logs it with log4j
> 3) One of the log4j appenders is syslog pointing at a separate server
> 4) Separate log server recieves UDP packets; one packet per read()
> 5) Log server parses packets and decides whether they are within the
> cluster or going to the internet
> - Currently a Pentium 4 throw-away machine; handles up to 4-5k packets
> per second before it starts dropping
> 6) Each client opens a TCP stream to the server and receives the
> transfer type, source, and dest, then renders appropriately
>
> It's pretty danged close to real-time; the time the client issues the
> read() request to seeing something plotted is on the order of 1 second.
>
> I'd really like to see this on a big (Yahoo, Facebook, any takers?)
> cluster.
>
> Brian
>
Ok, so this is really an example of a datacentre back-end for Log4J,
pushing out UDP packets to something else in the datacentre. A nice
side-line to the classic hadoop management displays. Add something about
jobs executing and you are laughing. Do it all in Java3D and you even
have cross platformness
Re: 3D Cluster Performance Visualization
Posted by Brian Bockelman <bb...@cse.unl.edu>.
;) Unfortunately, I'm going to go out on a limb and guess that we
don't want to add OpenGL to the dependency list for the namenode...
The viz application actually doesn't depend on the namenode, it uses
the datanodes.
Here's the source:
svn://t2.unl.edu/brian/HadoopViz/trunk
The server portion is a bit hardcoded to our site (simply a python
server); the client application is pretty cross-platform. I actually
compile and display the application on my Mac.
Here's how it works:
1) Client issues read() request
2) Datanode services it. Logs it with log4j
3) One of the log4j appenders is syslog pointing at a separate server
4) Separate log server recieves UDP packets; one packet per read()
5) Log server parses packets and decides whether they are within the
cluster or going to the internet
- Currently a Pentium 4 throw-away machine; handles up to 4-5k
packets per second before it starts dropping
6) Each client opens a TCP stream to the server and receives the
transfer type, source, and dest, then renders appropriately
It's pretty danged close to real-time; the time the client issues the
read() request to seeing something plotted is on the order of 1 second.
I'd really like to see this on a big (Yahoo, Facebook, any takers?)
cluster.
Brian
On Sep 25, 2009, at 8:54 AM, Edward Capriolo wrote:
> On Fri, Sep 25, 2009 at 9:25 AM, Brian Bockelman
> <bb...@cse.unl.edu> wrote:
>> Hey Paul,
>>
>> Here's another visualization one can do with HDFS:
>>
>> http://www.youtube.com/watch?v=qoBoEzOkeDQ
>>
>> Each time data is moved from one host to another, it is plotted as
>> a drop of
>> water from one square representing the host to one square
>> representing the
>> destination. The color of the node's square depends on the number of
>> transfers per second. Data transferred out of the cluster is
>> represented by
>> drops going in/out of the ceiling.
>>
>> Hard to describe, easy to understand when you see it. Absolutely
>> mesmerizing for tour groups when you put it on a big-screen.
>>
>> Brian
>>
>> On Sep 25, 2009, at 2:04 AM, Paul Smith wrote:
>>
>>> Hi,
>>>
>>> I'm still relatively new to Hadoop here, so bear with me. We have
>>> a few
>>> ex-SGI staff with us, and one of the tools we now use at Aconex is
>>> Performance Co-Pilot (PCP), which is an open-source Performance
>>> Monitoring
>>> suite out of Silicon Graphics (see [1]). SGI are a bit fond of
>>> large-scale
>>> problems and this toolset was built to support their own monster
>>> computers
>>> (see [2] for one of their clients, yep, that's one large single
>>> computer),
>>> and PCP was used to monitor and tune that, so I'm pretty confident
>>> it has
>>> the credentials to help with Hadoop.
>>>
>>> Aconex has built a Java bridge to PCP and has open-sourced that as
>>> Parfait
>>> (see [3]). We rely on this for real-time and post-problem
>>> retrospective
>>> analysis. We would be dead in the water without it. By being
>>> able to
>>> combine hardware and software metrics across multiple machines
>>> into a single
>>> warehouse of data we can correlate many interesting things and solve
>>> problems very quickly.
>>>
>>> Now I want to unleash this on Hadoop. I have written a
>>> MetricContext
>>> extension that uses the bridge, and I can export counters and
>>> values to PCP
>>> for the namenode, datanode, jobtracker and tasktracker. We are
>>> building
>>> some small tool extensions to allow 3D visualization. First
>>> fledgling view
>>> of what it looks like is here:
>>>
>>> http://people.apache.org/~psmith/clustervis.png
>>>
>>> Yes, a pretty trivial cluster at the moment, but the toolset
>>> allows pretty
>>> simple configurations to create the cluster by passing it the
>>> masters/slaves
>>> file. Once PCP tools connects to each node through my
>>> implementation of PCP
>>> Metric Context it can find out whether it's a namenode, or a
>>> jobtracker etc
>>> and display it differently. We hope to improve on the tools to
>>> utilise the
>>> DNSToSwitchMapping style to then visualize all the nodes within
>>> the cluster
>>> as they would appear in the rack. PCP already has support for Cisco
>>> switches so we can also integrate those into the picture and display
>>> inter-rack networking volumes. The real payoff here is the
>>> retrospective
>>> analysis, all this PCP data is collected into Archives so this
>>> view can be
>>> replayed at any time, and at any pace you want. Very interesting
>>> problems
>>> are found when you have that sort of tool.
>>>
>>> I guess my question is whether anyone else thinks this is going to
>>> be of
>>> value to the wider Hadoop community? Obviously we do, but we're
>>> not exactly
>>> stretching Hadoop just yet, nor do we fully understand some of the
>>> tricky
>>> performance problems large Hadoop cluster admins face. I think
>>> we'd love to
>>> think we could add this to the hadoop-contrib though hoping others
>>> might
>>> find it useful.
>>>
>>> So if anyone is interested in asking questions or suggesting crucial
>>> feature sets we'd appreciate it.
>>>
>>> cheers (and thanks for getting this far in the email.. :) )
>>>
>>> Paul Smith
>>> psmith at aconex.com
>>> psmith at apache.org
>>>
>>> [1] Performance Co-Pilot (PCP)
>>> http://oss.sgi.com/projects/pcp/index.html
>>>
>>> [2] NASAs 'Columbia' computer
>>> http://www.nas.nasa.gov/News/Images/images.html
>>>
>>> [3] Parfait
>>> http://code.google.com/p/parfait/
>>
>>
>
> Open up a Jira. Lets get hadoop viz on the name node web interface for
> real time :)
Re: 3D Cluster Performance Visualization
Posted by Edward Capriolo <ed...@gmail.com>.
On Fri, Sep 25, 2009 at 9:25 AM, Brian Bockelman <bb...@cse.unl.edu> wrote:
> Hey Paul,
>
> Here's another visualization one can do with HDFS:
>
> http://www.youtube.com/watch?v=qoBoEzOkeDQ
>
> Each time data is moved from one host to another, it is plotted as a drop of
> water from one square representing the host to one square representing the
> destination. The color of the node's square depends on the number of
> transfers per second. Data transferred out of the cluster is represented by
> drops going in/out of the ceiling.
>
> Hard to describe, easy to understand when you see it. Absolutely
> mesmerizing for tour groups when you put it on a big-screen.
>
> Brian
>
> On Sep 25, 2009, at 2:04 AM, Paul Smith wrote:
>
>> Hi,
>>
>> I'm still relatively new to Hadoop here, so bear with me. We have a few
>> ex-SGI staff with us, and one of the tools we now use at Aconex is
>> Performance Co-Pilot (PCP), which is an open-source Performance Monitoring
>> suite out of Silicon Graphics (see [1]). SGI are a bit fond of large-scale
>> problems and this toolset was built to support their own monster computers
>> (see [2] for one of their clients, yep, that's one large single computer),
>> and PCP was used to monitor and tune that, so I'm pretty confident it has
>> the credentials to help with Hadoop.
>>
>> Aconex has built a Java bridge to PCP and has open-sourced that as Parfait
>> (see [3]). We rely on this for real-time and post-problem retrospective
>> analysis. We would be dead in the water without it. By being able to
>> combine hardware and software metrics across multiple machines into a single
>> warehouse of data we can correlate many interesting things and solve
>> problems very quickly.
>>
>> Now I want to unleash this on Hadoop. I have written a MetricContext
>> extension that uses the bridge, and I can export counters and values to PCP
>> for the namenode, datanode, jobtracker and tasktracker. We are building
>> some small tool extensions to allow 3D visualization. First fledgling view
>> of what it looks like is here:
>>
>> http://people.apache.org/~psmith/clustervis.png
>>
>> Yes, a pretty trivial cluster at the moment, but the toolset allows pretty
>> simple configurations to create the cluster by passing it the masters/slaves
>> file. Once PCP tools connects to each node through my implementation of PCP
>> Metric Context it can find out whether it's a namenode, or a jobtracker etc
>> and display it differently. We hope to improve on the tools to utilise the
>> DNSToSwitchMapping style to then visualize all the nodes within the cluster
>> as they would appear in the rack. PCP already has support for Cisco
>> switches so we can also integrate those into the picture and display
>> inter-rack networking volumes. The real payoff here is the retrospective
>> analysis, all this PCP data is collected into Archives so this view can be
>> replayed at any time, and at any pace you want. Very interesting problems
>> are found when you have that sort of tool.
>>
>> I guess my question is whether anyone else thinks this is going to be of
>> value to the wider Hadoop community? Obviously we do, but we're not exactly
>> stretching Hadoop just yet, nor do we fully understand some of the tricky
>> performance problems large Hadoop cluster admins face. I think we'd love to
>> think we could add this to the hadoop-contrib though hoping others might
>> find it useful.
>>
>> So if anyone is interested in asking questions or suggesting crucial
>> feature sets we'd appreciate it.
>>
>> cheers (and thanks for getting this far in the email.. :) )
>>
>> Paul Smith
>> psmith at aconex.com
>> psmith at apache.org
>>
>> [1] Performance Co-Pilot (PCP)
>> http://oss.sgi.com/projects/pcp/index.html
>>
>> [2] NASAs 'Columbia' computer
>> http://www.nas.nasa.gov/News/Images/images.html
>>
>> [3] Parfait
>> http://code.google.com/p/parfait/
>
>
Open up a Jira. Lets get hadoop viz on the name node web interface for
real time :)
Re: 3D Cluster Performance Visualization
Posted by Brian Bockelman <bb...@cse.unl.edu>.
Hey Paul,
Here's another visualization one can do with HDFS:
http://www.youtube.com/watch?v=qoBoEzOkeDQ
Each time data is moved from one host to another, it is plotted as a
drop of water from one square representing the host to one square
representing the destination. The color of the node's square depends
on the number of transfers per second. Data transferred out of the
cluster is represented by drops going in/out of the ceiling.
Hard to describe, easy to understand when you see it. Absolutely
mesmerizing for tour groups when you put it on a big-screen.
Brian
On Sep 25, 2009, at 2:04 AM, Paul Smith wrote:
> Hi,
>
> I'm still relatively new to Hadoop here, so bear with me. We have a
> few ex-SGI staff with us, and one of the tools we now use at Aconex
> is Performance Co-Pilot (PCP), which is an open-source Performance
> Monitoring suite out of Silicon Graphics (see [1]). SGI are a bit
> fond of large-scale problems and this toolset was built to support
> their own monster computers (see [2] for one of their clients, yep,
> that's one large single computer), and PCP was used to monitor and
> tune that, so I'm pretty confident it has the credentials to help
> with Hadoop.
>
> Aconex has built a Java bridge to PCP and has open-sourced that as
> Parfait (see [3]). We rely on this for real-time and post-problem
> retrospective analysis. We would be dead in the water without it.
> By being able to combine hardware and software metrics across
> multiple machines into a single warehouse of data we can correlate
> many interesting things and solve problems very quickly.
>
> Now I want to unleash this on Hadoop. I have written a
> MetricContext extension that uses the bridge, and I can export
> counters and values to PCP for the namenode, datanode, jobtracker
> and tasktracker. We are building some small tool extensions to
> allow 3D visualization. First fledgling view of what it looks like
> is here:
>
> http://people.apache.org/~psmith/clustervis.png
>
> Yes, a pretty trivial cluster at the moment, but the toolset allows
> pretty simple configurations to create the cluster by passing it the
> masters/slaves file. Once PCP tools connects to each node through
> my implementation of PCP Metric Context it can find out whether it's
> a namenode, or a jobtracker etc and display it differently. We hope
> to improve on the tools to utilise the DNSToSwitchMapping style to
> then visualize all the nodes within the cluster as they would appear
> in the rack. PCP already has support for Cisco switches so we can
> also integrate those into the picture and display inter-rack
> networking volumes. The real payoff here is the retrospective
> analysis, all this PCP data is collected into Archives so this view
> can be replayed at any time, and at any pace you want. Very
> interesting problems are found when you have that sort of tool.
>
> I guess my question is whether anyone else thinks this is going to
> be of value to the wider Hadoop community? Obviously we do, but
> we're not exactly stretching Hadoop just yet, nor do we fully
> understand some of the tricky performance problems large Hadoop
> cluster admins face. I think we'd love to think we could add this
> to the hadoop-contrib though hoping others might find it useful.
>
> So if anyone is interested in asking questions or suggesting crucial
> feature sets we'd appreciate it.
>
> cheers (and thanks for getting this far in the email.. :) )
>
> Paul Smith
> psmith at aconex.com
> psmith at apache.org
>
> [1] Performance Co-Pilot (PCP)
> http://oss.sgi.com/projects/pcp/index.html
>
> [2] NASAs 'Columbia' computer
> http://www.nas.nasa.gov/News/Images/images.html
>
> [3] Parfait
> http://code.google.com/p/parfait/