You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by "Jones, Robert" <Ro...@asd-inc.com> on 2012/09/10 22:16:29 UTC

Which metrics to track?

Hello all, I am a sysadmin and do not know that much about Hadoop. I run a stats/metrics tracking system that logs stats over time so you can look at historical and current data and perform some trend analysis. I know I can access several hadoop metrics via jmx by going to http://localhost:50070/jmx?qry=hadoop:* and I've got a script that parses all that data so that I can stuff any of it that I want into our stats system.

So, with all that preamble, I have one question. Which metrics are worth tracking? There are a *lot* of metrics returned via the jmx query but I doubt all of them are critically important. Which metrics are important to track if we want to watch for any trends or spikes in our hadoop cluster?

Please provide some education to this noob. Thanks!

--
Bob Jones
Linux Systems/Network Engineer
ME Cloud Computing
Advanced Systems Development, Inc.
1 (434) 964-3156

****************************************************************************** This message and any attachments are solely for the intended recipient and may contain confidential or privileged information. If you are not the intended recipient, any disclosure, copying, use, or distribution of the information included in this message and any attachments is prohibited. If you have received this communication in error, please notify us by reply e-mail and immediately and permanently delete this message and any attachments. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of ASD. Employees of ASD are expressly required not to make defamatory statements and not to infringe or authorize any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect. ******************************************************************************

Re: Which metrics to track?

Posted by Gulfie <gu...@haruko.grotto-group.com>.


On Mon, Sep 10, 2012 at 08:16:29PM +0000, Jones, Robert wrote:
> Hello all, I am a sysadmin and do not know that much about Hadoop.  I run a stats/metrics tracking system that logs stats over time so you can look at historical and current data and perform some trend analysis.  I know I can access several hadoop metrics via jmx by going to http://localhost:50070/jmx?qry=hadoop:* and I've got a script that parses all that data so that I can stuff any of it that I want into our stats system.
> 
> So, with all that preamble, I have one question.  Which metrics are worth tracking?  There are a *lot* of metrics returned via the jmx query but I doubt all of them are critically important.  Which metrics are important to track if we want to watch for any trends or spikes in our hadoop cluster?

	All of them?  Often the funny ignored metrics are the ones that you'll want to look at only after something else triggers and you start looking into the causes.  At least keep the gziped raw files.  If need be you can feed the files into a hadoop job for later analysis. 

-gulfie



 
> Please provide some education to this noob.  Thanks!
> 
> --
> Bob Jones
> Linux Systems/Network Engineer
> ME Cloud Computing
> Advanced Systems Development, Inc.
> 1 (434) 964-3156
> 
> 
> 
> ****************************************************************************** This message and any attachments are solely for the intended recipient and may contain confidential or privileged information. If you are not the intended recipient, any disclosure, copying, use, or distribution of the information included in this message and any attachments is prohibited. If you have received this communication in error, please notify us by reply e-mail and immediately and permanently delete this message and any attachments. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of ASD. Employees of ASD are expressly required not to make defamatory statements and not to infringe or authorize any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect. ******************************************************************************

Re: Which metrics to track?

Posted by Gulfie <gu...@haruko.grotto-group.com>.


On Mon, Sep 10, 2012 at 08:16:29PM +0000, Jones, Robert wrote:
> Hello all, I am a sysadmin and do not know that much about Hadoop.  I run a stats/metrics tracking system that logs stats over time so you can look at historical and current data and perform some trend analysis.  I know I can access several hadoop metrics via jmx by going to http://localhost:50070/jmx?qry=hadoop:* and I've got a script that parses all that data so that I can stuff any of it that I want into our stats system.
> 
> So, with all that preamble, I have one question.  Which metrics are worth tracking?  There are a *lot* of metrics returned via the jmx query but I doubt all of them are critically important.  Which metrics are important to track if we want to watch for any trends or spikes in our hadoop cluster?

	All of them?  Often the funny ignored metrics are the ones that you'll want to look at only after something else triggers and you start looking into the causes.  At least keep the gziped raw files.  If need be you can feed the files into a hadoop job for later analysis. 

-gulfie



 
> Please provide some education to this noob.  Thanks!
> 
> --
> Bob Jones
> Linux Systems/Network Engineer
> ME Cloud Computing
> Advanced Systems Development, Inc.
> 1 (434) 964-3156
> 
> 
> 
> ****************************************************************************** This message and any attachments are solely for the intended recipient and may contain confidential or privileged information. If you are not the intended recipient, any disclosure, copying, use, or distribution of the information included in this message and any attachments is prohibited. If you have received this communication in error, please notify us by reply e-mail and immediately and permanently delete this message and any attachments. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of ASD. Employees of ASD are expressly required not to make defamatory statements and not to infringe or authorize any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect. ******************************************************************************

Re: Which metrics to track?

Posted by Adam Faris <af...@linkedin.com>.

>From an operations perspective, hadoop metrics are a bit different then watching hosts behind a load balancer as one needs to start thinking in terms of distributed systems and not individual hosts.  The reason being that the hadoop platform is fairly resilient against multiple node failures, if we loose an entire rack of data nodes due to a switch failure the platform might be degraded but isn't down.  It's still advisable to collect cpu, network, ram and disk usage on individual hosts using something like collectd or ganglia, but focus on the  display aspect.  The ability to aggregate individual metrics into a global graph and understand what's going on with the platform while particular jobs are running is very helpful.  Regarding JMX, you may have to look at hadoop source for explanations but there's not a lot of cruft in JMX output.  

Hadoop is good about detecting error conditions on data nodes and should be leveraged in monitoring and metric solutions.   Like most monitoring & metric systems, start slow and ramp up as you find new conditions.  Here's something to get you started with a 1.x grid.  

Namenode:
'PercentUsed' or 'PercentRemaining'. You should poll either value and start to worry about block corruption when HDFS usage hits 80% used. :)  
'CorruptBlocks', 'UnderReplicatedBlocks', 'MissingBlocks', & 'FSState' will alert you to HDFS issues.
'LiveNodes' or 'DeadNodes': let hadoop monitor for the datanode process as it's a built in freebie.
'FilesTotal': using the rule of thumb of 1GB ram for every million files on HDFS, allows you to track and tune your heap size.

Jobtracker:
'BlacklistedNodesInfoJson' 'GraylistedNodesInfoJson':   Let hadoop monitor for broken task trackers.
"JVM heap counters": Useful for heap tuning.
"Queue counts": When are jobs running? How many mappers/reducers are used at a particular time?  You'll find the number of mappers/reducers active at one time more helpful then how many jobs are running.  

Datanode:
"heartBeats_avg_time": if data node has high heartbeat, it could be network congestion or high load on local box.
"VolumeInfo": shows local filesystem sizes. (You could also use ganglia/collectd for this).  Note that unless you define separate partitions for spill space and HDFS blocks, when the task spills to disk you could fill your local datanode filesystem.

Tasktracker:
The tasktracker has stats that could be worth looking at, like the shuffle counters to know when a job spills to disk.  Currently we aren't using any of these values so I don't have recommendations. 

-- Adam

On Sep 10, 2012, at 1:16 PM, Jones, Robert wrote:

> Hello all, I am a sysadmin and do not know that much about Hadoop.  I run a stats/metrics tracking system that logs stats over time so you can look at historical and current data and perform some trend analysis.  I know I can access several hadoop metrics via jmx by going to http://localhost:50070/jmx?qry=hadoop:* and I've got a script that parses all that data so that I can stuff any of it that I want into our stats system.
> 
> So, with all that preamble, I have one question.  Which metrics are worth tracking?  There are a *lot* of metrics returned via the jmx query but I doubt all of them are critically important.  Which metrics are important to track if we want to watch for any trends or spikes in our hadoop cluster?
> 
> Please provide some education to this noob.  Thanks!
> 
> --
> Bob Jones
> Linux Systems/Network Engineer
> ME Cloud Computing
> Advanced Systems Development, Inc.
> 1 (434) 964-3156
> 
> 
> 
> ****************************************************************************** This message and any attachments are solely for the intended recipient and may contain confidential or privileged information. If you are not the intended recipient, any disclosure, copying, use, or distribution of the information included in this message and any attachments is prohibited. If you have received this communication in error, please notify us by reply e-mail and immediately and permanently delete this message and any attachments. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of ASD. Employees of ASD are expressly required not to make defamatory statements and not to infringe or authorize any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect. ******************************************************************************

Re: Which metrics to track?

Posted by Gulfie <gu...@haruko.grotto-group.com>.


On Mon, Sep 10, 2012 at 08:16:29PM +0000, Jones, Robert wrote:
> Hello all, I am a sysadmin and do not know that much about Hadoop.  I run a stats/metrics tracking system that logs stats over time so you can look at historical and current data and perform some trend analysis.  I know I can access several hadoop metrics via jmx by going to http://localhost:50070/jmx?qry=hadoop:* and I've got a script that parses all that data so that I can stuff any of it that I want into our stats system.
> 
> So, with all that preamble, I have one question.  Which metrics are worth tracking?  There are a *lot* of metrics returned via the jmx query but I doubt all of them are critically important.  Which metrics are important to track if we want to watch for any trends or spikes in our hadoop cluster?

	All of them?  Often the funny ignored metrics are the ones that you'll want to look at only after something else triggers and you start looking into the causes.  At least keep the gziped raw files.  If need be you can feed the files into a hadoop job for later analysis. 

-gulfie



 
> Please provide some education to this noob.  Thanks!
> 
> --
> Bob Jones
> Linux Systems/Network Engineer
> ME Cloud Computing
> Advanced Systems Development, Inc.
> 1 (434) 964-3156
> 
> 
> 
> ****************************************************************************** This message and any attachments are solely for the intended recipient and may contain confidential or privileged information. If you are not the intended recipient, any disclosure, copying, use, or distribution of the information included in this message and any attachments is prohibited. If you have received this communication in error, please notify us by reply e-mail and immediately and permanently delete this message and any attachments. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of ASD. Employees of ASD are expressly required not to make defamatory statements and not to infringe or authorize any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect. ******************************************************************************

Re: Which metrics to track?

Posted by Gulfie <gu...@haruko.grotto-group.com>.


On Mon, Sep 10, 2012 at 08:16:29PM +0000, Jones, Robert wrote:
> Hello all, I am a sysadmin and do not know that much about Hadoop.  I run a stats/metrics tracking system that logs stats over time so you can look at historical and current data and perform some trend analysis.  I know I can access several hadoop metrics via jmx by going to http://localhost:50070/jmx?qry=hadoop:* and I've got a script that parses all that data so that I can stuff any of it that I want into our stats system.
> 
> So, with all that preamble, I have one question.  Which metrics are worth tracking?  There are a *lot* of metrics returned via the jmx query but I doubt all of them are critically important.  Which metrics are important to track if we want to watch for any trends or spikes in our hadoop cluster?

	All of them?  Often the funny ignored metrics are the ones that you'll want to look at only after something else triggers and you start looking into the causes.  At least keep the gziped raw files.  If need be you can feed the files into a hadoop job for later analysis. 

-gulfie



 
> Please provide some education to this noob.  Thanks!
> 
> --
> Bob Jones
> Linux Systems/Network Engineer
> ME Cloud Computing
> Advanced Systems Development, Inc.
> 1 (434) 964-3156
> 
> 
> 
> ****************************************************************************** This message and any attachments are solely for the intended recipient and may contain confidential or privileged information. If you are not the intended recipient, any disclosure, copying, use, or distribution of the information included in this message and any attachments is prohibited. If you have received this communication in error, please notify us by reply e-mail and immediately and permanently delete this message and any attachments. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of ASD. Employees of ASD are expressly required not to make defamatory statements and not to infringe or authorize any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect. ******************************************************************************

Re: Which metrics to track?

Posted by Adam Faris <af...@linkedin.com>.

>From an operations perspective, hadoop metrics are a bit different then watching hosts behind a load balancer as one needs to start thinking in terms of distributed systems and not individual hosts.  The reason being that the hadoop platform is fairly resilient against multiple node failures, if we loose an entire rack of data nodes due to a switch failure the platform might be degraded but isn't down.  It's still advisable to collect cpu, network, ram and disk usage on individual hosts using something like collectd or ganglia, but focus on the  display aspect.  The ability to aggregate individual metrics into a global graph and understand what's going on with the platform while particular jobs are running is very helpful.  Regarding JMX, you may have to look at hadoop source for explanations but there's not a lot of cruft in JMX output.  

Hadoop is good about detecting error conditions on data nodes and should be leveraged in monitoring and metric solutions.   Like most monitoring & metric systems, start slow and ramp up as you find new conditions.  Here's something to get you started with a 1.x grid.  

Namenode:
'PercentUsed' or 'PercentRemaining'. You should poll either value and start to worry about block corruption when HDFS usage hits 80% used. :)  
'CorruptBlocks', 'UnderReplicatedBlocks', 'MissingBlocks', & 'FSState' will alert you to HDFS issues.
'LiveNodes' or 'DeadNodes': let hadoop monitor for the datanode process as it's a built in freebie.
'FilesTotal': using the rule of thumb of 1GB ram for every million files on HDFS, allows you to track and tune your heap size.

Jobtracker:
'BlacklistedNodesInfoJson' 'GraylistedNodesInfoJson':   Let hadoop monitor for broken task trackers.
"JVM heap counters": Useful for heap tuning.
"Queue counts": When are jobs running? How many mappers/reducers are used at a particular time?  You'll find the number of mappers/reducers active at one time more helpful then how many jobs are running.  

Datanode:
"heartBeats_avg_time": if data node has high heartbeat, it could be network congestion or high load on local box.
"VolumeInfo": shows local filesystem sizes. (You could also use ganglia/collectd for this).  Note that unless you define separate partitions for spill space and HDFS blocks, when the task spills to disk you could fill your local datanode filesystem.

Tasktracker:
The tasktracker has stats that could be worth looking at, like the shuffle counters to know when a job spills to disk.  Currently we aren't using any of these values so I don't have recommendations. 

-- Adam

On Sep 10, 2012, at 1:16 PM, Jones, Robert wrote:

> Hello all, I am a sysadmin and do not know that much about Hadoop.  I run a stats/metrics tracking system that logs stats over time so you can look at historical and current data and perform some trend analysis.  I know I can access several hadoop metrics via jmx by going to http://localhost:50070/jmx?qry=hadoop:* and I've got a script that parses all that data so that I can stuff any of it that I want into our stats system.
> 
> So, with all that preamble, I have one question.  Which metrics are worth tracking?  There are a *lot* of metrics returned via the jmx query but I doubt all of them are critically important.  Which metrics are important to track if we want to watch for any trends or spikes in our hadoop cluster?
> 
> Please provide some education to this noob.  Thanks!
> 
> --
> Bob Jones
> Linux Systems/Network Engineer
> ME Cloud Computing
> Advanced Systems Development, Inc.
> 1 (434) 964-3156
> 
> 
> 
> ****************************************************************************** This message and any attachments are solely for the intended recipient and may contain confidential or privileged information. If you are not the intended recipient, any disclosure, copying, use, or distribution of the information included in this message and any attachments is prohibited. If you have received this communication in error, please notify us by reply e-mail and immediately and permanently delete this message and any attachments. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of ASD. Employees of ASD are expressly required not to make defamatory statements and not to infringe or authorize any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect. ******************************************************************************

Re: Which metrics to track?

Posted by Adam Faris <af...@linkedin.com>.

>From an operations perspective, hadoop metrics are a bit different then watching hosts behind a load balancer as one needs to start thinking in terms of distributed systems and not individual hosts.  The reason being that the hadoop platform is fairly resilient against multiple node failures, if we loose an entire rack of data nodes due to a switch failure the platform might be degraded but isn't down.  It's still advisable to collect cpu, network, ram and disk usage on individual hosts using something like collectd or ganglia, but focus on the  display aspect.  The ability to aggregate individual metrics into a global graph and understand what's going on with the platform while particular jobs are running is very helpful.  Regarding JMX, you may have to look at hadoop source for explanations but there's not a lot of cruft in JMX output.  

Hadoop is good about detecting error conditions on data nodes and should be leveraged in monitoring and metric solutions.   Like most monitoring & metric systems, start slow and ramp up as you find new conditions.  Here's something to get you started with a 1.x grid.  

Namenode:
'PercentUsed' or 'PercentRemaining'. You should poll either value and start to worry about block corruption when HDFS usage hits 80% used. :)  
'CorruptBlocks', 'UnderReplicatedBlocks', 'MissingBlocks', & 'FSState' will alert you to HDFS issues.
'LiveNodes' or 'DeadNodes': let hadoop monitor for the datanode process as it's a built in freebie.
'FilesTotal': using the rule of thumb of 1GB ram for every million files on HDFS, allows you to track and tune your heap size.

Jobtracker:
'BlacklistedNodesInfoJson' 'GraylistedNodesInfoJson':   Let hadoop monitor for broken task trackers.
"JVM heap counters": Useful for heap tuning.
"Queue counts": When are jobs running? How many mappers/reducers are used at a particular time?  You'll find the number of mappers/reducers active at one time more helpful then how many jobs are running.  

Datanode:
"heartBeats_avg_time": if data node has high heartbeat, it could be network congestion or high load on local box.
"VolumeInfo": shows local filesystem sizes. (You could also use ganglia/collectd for this).  Note that unless you define separate partitions for spill space and HDFS blocks, when the task spills to disk you could fill your local datanode filesystem.

Tasktracker:
The tasktracker has stats that could be worth looking at, like the shuffle counters to know when a job spills to disk.  Currently we aren't using any of these values so I don't have recommendations. 

-- Adam

On Sep 10, 2012, at 1:16 PM, Jones, Robert wrote:

> Hello all, I am a sysadmin and do not know that much about Hadoop.  I run a stats/metrics tracking system that logs stats over time so you can look at historical and current data and perform some trend analysis.  I know I can access several hadoop metrics via jmx by going to http://localhost:50070/jmx?qry=hadoop:* and I've got a script that parses all that data so that I can stuff any of it that I want into our stats system.
> 
> So, with all that preamble, I have one question.  Which metrics are worth tracking?  There are a *lot* of metrics returned via the jmx query but I doubt all of them are critically important.  Which metrics are important to track if we want to watch for any trends or spikes in our hadoop cluster?
> 
> Please provide some education to this noob.  Thanks!
> 
> --
> Bob Jones
> Linux Systems/Network Engineer
> ME Cloud Computing
> Advanced Systems Development, Inc.
> 1 (434) 964-3156
> 
> 
> 
> ****************************************************************************** This message and any attachments are solely for the intended recipient and may contain confidential or privileged information. If you are not the intended recipient, any disclosure, copying, use, or distribution of the information included in this message and any attachments is prohibited. If you have received this communication in error, please notify us by reply e-mail and immediately and permanently delete this message and any attachments. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of ASD. Employees of ASD are expressly required not to make defamatory statements and not to infringe or authorize any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect. ******************************************************************************

Re: Which metrics to track?

Posted by Adam Faris <af...@linkedin.com>.

>From an operations perspective, hadoop metrics are a bit different then watching hosts behind a load balancer as one needs to start thinking in terms of distributed systems and not individual hosts.  The reason being that the hadoop platform is fairly resilient against multiple node failures, if we loose an entire rack of data nodes due to a switch failure the platform might be degraded but isn't down.  It's still advisable to collect cpu, network, ram and disk usage on individual hosts using something like collectd or ganglia, but focus on the  display aspect.  The ability to aggregate individual metrics into a global graph and understand what's going on with the platform while particular jobs are running is very helpful.  Regarding JMX, you may have to look at hadoop source for explanations but there's not a lot of cruft in JMX output.  

Hadoop is good about detecting error conditions on data nodes and should be leveraged in monitoring and metric solutions.   Like most monitoring & metric systems, start slow and ramp up as you find new conditions.  Here's something to get you started with a 1.x grid.  

Namenode:
'PercentUsed' or 'PercentRemaining'. You should poll either value and start to worry about block corruption when HDFS usage hits 80% used. :)  
'CorruptBlocks', 'UnderReplicatedBlocks', 'MissingBlocks', & 'FSState' will alert you to HDFS issues.
'LiveNodes' or 'DeadNodes': let hadoop monitor for the datanode process as it's a built in freebie.
'FilesTotal': using the rule of thumb of 1GB ram for every million files on HDFS, allows you to track and tune your heap size.

Jobtracker:
'BlacklistedNodesInfoJson' 'GraylistedNodesInfoJson':   Let hadoop monitor for broken task trackers.
"JVM heap counters": Useful for heap tuning.
"Queue counts": When are jobs running? How many mappers/reducers are used at a particular time?  You'll find the number of mappers/reducers active at one time more helpful then how many jobs are running.  

Datanode:
"heartBeats_avg_time": if data node has high heartbeat, it could be network congestion or high load on local box.
"VolumeInfo": shows local filesystem sizes. (You could also use ganglia/collectd for this).  Note that unless you define separate partitions for spill space and HDFS blocks, when the task spills to disk you could fill your local datanode filesystem.

Tasktracker:
The tasktracker has stats that could be worth looking at, like the shuffle counters to know when a job spills to disk.  Currently we aren't using any of these values so I don't have recommendations. 

-- Adam

On Sep 10, 2012, at 1:16 PM, Jones, Robert wrote:

> Hello all, I am a sysadmin and do not know that much about Hadoop.  I run a stats/metrics tracking system that logs stats over time so you can look at historical and current data and perform some trend analysis.  I know I can access several hadoop metrics via jmx by going to http://localhost:50070/jmx?qry=hadoop:* and I've got a script that parses all that data so that I can stuff any of it that I want into our stats system.
> 
> So, with all that preamble, I have one question.  Which metrics are worth tracking?  There are a *lot* of metrics returned via the jmx query but I doubt all of them are critically important.  Which metrics are important to track if we want to watch for any trends or spikes in our hadoop cluster?
> 
> Please provide some education to this noob.  Thanks!
> 
> --
> Bob Jones
> Linux Systems/Network Engineer
> ME Cloud Computing
> Advanced Systems Development, Inc.
> 1 (434) 964-3156
> 
> 
> 
> ****************************************************************************** This message and any attachments are solely for the intended recipient and may contain confidential or privileged information. If you are not the intended recipient, any disclosure, copying, use, or distribution of the information included in this message and any attachments is prohibited. If you have received this communication in error, please notify us by reply e-mail and immediately and permanently delete this message and any attachments. Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of ASD. Employees of ASD are expressly required not to make defamatory statements and not to infringe or authorize any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect. ******************************************************************************