You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Ivan Tretyakov <it...@griddynamics.com> on 2013/01/17 16:17:26 UTC

Problem sending metrics to multiple targets

Hi!

We have following problem.

There are three target hosts to send metrics: 192.168.1.111:8649,
192.168.1.113:8649,192.168.1.115:8649 (node01, node03, node05).
But for example datanode (using
org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31) sends one metrics to
first target host and the another to the second and third.
So some metrics missed on second and third node. When gmetad collects
metrics from one of these we could not see certain metrics in ganglia.

E.g. on node07 running only one process which sends metrics to ganglia -
datanode process and we could see following using tcpdump.

Dumping traffic for about three minutes:
$ sudo -i tcpdump dst port 8649 and src host node07 | tee tcpdump.out
...
$ head -n1 tcpdump.out
12:18:05.559719 IP node07.dom.local.43350 > node01.dom.local.8649: UDP,
length 180
$ tail -n1 tcpdump.out
12:20:59.575144 IP node

Then count packets and bytes sent to each target:
$ grep node01 tcpdump.out | wc -l
5972
$ grep node03 tcpdump.out | wc -l
3812
$ grep node05 tcpdump.out | wc -l
3811
$ grep node01 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}'
1048272
$ grep node03 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}'
731604
$ grep node05 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}'
731532

Also we could request gmond daemons which metrics do they have:

$ nc node01 8649 | grep ProcessName_DataNode | head -n1
<METRIC NAME="jvm.JvmMetrics.ProcessName_DataNode.LogFatal" VAL="0"
TYPE="float" UNITS="" TN="0" TMAX="60" DMAX="0" SLOPE="positive">
$ nc node03 8649 | grep ProcessName_DataNode | head -n1
$ nc node05 8649 | grep ProcessName_DataNode | head -n1
$ nc node01 8649 | grep ProcessName_DataNode | wc -l
100
$ nc node03 8649 | grep ProcessName_DataNode | wc -l
0
$ nc node05 8649 | grep ProcessName_DataNode | wc -l
0

We could see that only first collector node from the list has certain
metrics.

Hadoop version we use:
- MapReduce 2.0.0-mr1-cdh4.1.1
- HDFS 2.0.0-cdh4.1.1

hadoop-metrics2.properties content:

datanode.period=20
datanode.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
datanode.sink.ganglia.servers=192.168.1.111:8649,192.168.1.113:8649,
192.168.1.115:8649
datanode.sink.ganglia.tagsForPrefix.jvm=*
datanode.sink.ganglia.tagsForPrefix.dfs=*
datanode.sink.ganglia.tagsForPrefix.rpc=*
datanode.sink.ganglia.tagsForPrefix.rpcdetailed=*
datanode.sink.ganglia.tagsForPrefix.metricssystem=*

-- 
Best Regards
Ivan Tretyakov

Re: Problem sending metrics to multiple targets

Posted by Ivan Tretyakov <it...@griddynamics.com>.

We investigated the problem and found root cause. Metrics2 framework uses
different from first version config parser (Metrics2 uses apache-commons,
Metrics uses
hadoop's).  org.apache.hadoop.metrics2.sink.ganglia.AbstractGangliaSink uses
commas as separators by default. So when we provide list of servers it
returns everything until first separator - it is only first server from the
list.
But we were able to find workaround. Class parsing servers
list (org.apache.hadoop.metrics2.util.Servers) handles only commas and
spaces. It means if we will provide space separated list of servers instead
of comma separated then new parser will be able to read whole servers list.
After that all servers will be registered as metrics receivers and metrics
will be sent to all of them.


On Thu, Jan 17, 2013 at 7:17 PM, Ivan Tretyakov <itretyakov@griddynamics.com
> wrote:

> Hi!
>
> We have following problem.
>
> There are three target hosts to send metrics: 192.168.1.111:8649,
> 192.168.1.113:8649,192.168.1.115:8649 (node01, node03, node05).
> But for example datanode (using
> org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31) sends one metrics to
> first target host and the another to the second and third.
> So some metrics missed on second and third node. When gmetad collects
> metrics from one of these we could not see certain metrics in ganglia.
>
> E.g. on node07 running only one process which sends metrics to ganglia -
> datanode process and we could see following using tcpdump.
>
> Dumping traffic for about three minutes:
> $ sudo -i tcpdump dst port 8649 and src host node07 | tee tcpdump.out
> ...
> $ head -n1 tcpdump.out
> 12:18:05.559719 IP node07.dom.local.43350 > node01.dom.local.8649: UDP,
> length 180
> $ tail -n1 tcpdump.out
> 12:20:59.575144 IP node
>
> Then count packets and bytes sent to each target:
> $ grep node01 tcpdump.out | wc -l
> 5972
> $ grep node03 tcpdump.out | wc -l
> 3812
> $ grep node05 tcpdump.out | wc -l
> 3811
> $ grep node01 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}'
> 1048272
> $ grep node03 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}'
> 731604
> $ grep node05 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}'
> 731532
>
> Also we could request gmond daemons which metrics do they have:
>
> $ nc node01 8649 | grep ProcessName_DataNode | head -n1
> <METRIC NAME="jvm.JvmMetrics.ProcessName_DataNode.LogFatal" VAL="0"
> TYPE="float" UNITS="" TN="0" TMAX="60" DMAX="0" SLOPE="positive">
> $ nc node03 8649 | grep ProcessName_DataNode | head -n1
> $ nc node05 8649 | grep ProcessName_DataNode | head -n1
> $ nc node01 8649 | grep ProcessName_DataNode | wc -l
> 100
> $ nc node03 8649 | grep ProcessName_DataNode | wc -l
> 0
> $ nc node05 8649 | grep ProcessName_DataNode | wc -l
> 0
>
> We could see that only first collector node from the list has certain
> metrics.
>
> Hadoop version we use:
> - MapReduce 2.0.0-mr1-cdh4.1.1
> - HDFS 2.0.0-cdh4.1.1
>
> hadoop-metrics2.properties content:
>
> datanode.period=20
>
> datanode.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
> datanode.sink.ganglia.servers=192.168.1.111:8649,192.168.1.113:8649,
> 192.168.1.115:8649
>  datanode.sink.ganglia.tagsForPrefix.jvm=*
> datanode.sink.ganglia.tagsForPrefix.dfs=*
> datanode.sink.ganglia.tagsForPrefix.rpc=*
> datanode.sink.ganglia.tagsForPrefix.rpcdetailed=*
> datanode.sink.ganglia.tagsForPrefix.metricssystem=*
>
> --
> Best Regards
> Ivan Tretyakov
>



-- 
Best Regards
Ivan Tretyakov

Re: Problem sending metrics to multiple targets

Posted by Ivan Tretyakov <it...@griddynamics.com>.

We investigated the problem and found root cause. Metrics2 framework uses
different from first version config parser (Metrics2 uses apache-commons,
Metrics uses
hadoop's).  org.apache.hadoop.metrics2.sink.ganglia.AbstractGangliaSink uses
commas as separators by default. So when we provide list of servers it
returns everything until first separator - it is only first server from the
list.
But we were able to find workaround. Class parsing servers
list (org.apache.hadoop.metrics2.util.Servers) handles only commas and
spaces. It means if we will provide space separated list of servers instead
of comma separated then new parser will be able to read whole servers list.
After that all servers will be registered as metrics receivers and metrics
will be sent to all of them.


On Thu, Jan 17, 2013 at 7:17 PM, Ivan Tretyakov <itretyakov@griddynamics.com
> wrote:

> Hi!
>
> We have following problem.
>
> There are three target hosts to send metrics: 192.168.1.111:8649,
> 192.168.1.113:8649,192.168.1.115:8649 (node01, node03, node05).
> But for example datanode (using
> org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31) sends one metrics to
> first target host and the another to the second and third.
> So some metrics missed on second and third node. When gmetad collects
> metrics from one of these we could not see certain metrics in ganglia.
>
> E.g. on node07 running only one process which sends metrics to ganglia -
> datanode process and we could see following using tcpdump.
>
> Dumping traffic for about three minutes:
> $ sudo -i tcpdump dst port 8649 and src host node07 | tee tcpdump.out
> ...
> $ head -n1 tcpdump.out
> 12:18:05.559719 IP node07.dom.local.43350 > node01.dom.local.8649: UDP,
> length 180
> $ tail -n1 tcpdump.out
> 12:20:59.575144 IP node
>
> Then count packets and bytes sent to each target:
> $ grep node01 tcpdump.out | wc -l
> 5972
> $ grep node03 tcpdump.out | wc -l
> 3812
> $ grep node05 tcpdump.out | wc -l
> 3811
> $ grep node01 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}'
> 1048272
> $ grep node03 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}'
> 731604
> $ grep node05 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}'
> 731532
>
> Also we could request gmond daemons which metrics do they have:
>
> $ nc node01 8649 | grep ProcessName_DataNode | head -n1
> <METRIC NAME="jvm.JvmMetrics.ProcessName_DataNode.LogFatal" VAL="0"
> TYPE="float" UNITS="" TN="0" TMAX="60" DMAX="0" SLOPE="positive">
> $ nc node03 8649 | grep ProcessName_DataNode | head -n1
> $ nc node05 8649 | grep ProcessName_DataNode | head -n1
> $ nc node01 8649 | grep ProcessName_DataNode | wc -l
> 100
> $ nc node03 8649 | grep ProcessName_DataNode | wc -l
> 0
> $ nc node05 8649 | grep ProcessName_DataNode | wc -l
> 0
>
> We could see that only first collector node from the list has certain
> metrics.
>
> Hadoop version we use:
> - MapReduce 2.0.0-mr1-cdh4.1.1
> - HDFS 2.0.0-cdh4.1.1
>
> hadoop-metrics2.properties content:
>
> datanode.period=20
>
> datanode.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
> datanode.sink.ganglia.servers=192.168.1.111:8649,192.168.1.113:8649,
> 192.168.1.115:8649
>  datanode.sink.ganglia.tagsForPrefix.jvm=*
> datanode.sink.ganglia.tagsForPrefix.dfs=*
> datanode.sink.ganglia.tagsForPrefix.rpc=*
> datanode.sink.ganglia.tagsForPrefix.rpcdetailed=*
> datanode.sink.ganglia.tagsForPrefix.metricssystem=*
>
> --
> Best Regards
> Ivan Tretyakov
>



-- 
Best Regards
Ivan Tretyakov

Re: Problem sending metrics to multiple targets

Posted by Ivan Tretyakov <it...@griddynamics.com>.

We investigated the problem and found root cause. Metrics2 framework uses
different from first version config parser (Metrics2 uses apache-commons,
Metrics uses
hadoop's).  org.apache.hadoop.metrics2.sink.ganglia.AbstractGangliaSink uses
commas as separators by default. So when we provide list of servers it
returns everything until first separator - it is only first server from the
list.
But we were able to find workaround. Class parsing servers
list (org.apache.hadoop.metrics2.util.Servers) handles only commas and
spaces. It means if we will provide space separated list of servers instead
of comma separated then new parser will be able to read whole servers list.
After that all servers will be registered as metrics receivers and metrics
will be sent to all of them.


On Thu, Jan 17, 2013 at 7:17 PM, Ivan Tretyakov <itretyakov@griddynamics.com
> wrote:

> Hi!
>
> We have following problem.
>
> There are three target hosts to send metrics: 192.168.1.111:8649,
> 192.168.1.113:8649,192.168.1.115:8649 (node01, node03, node05).
> But for example datanode (using
> org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31) sends one metrics to
> first target host and the another to the second and third.
> So some metrics missed on second and third node. When gmetad collects
> metrics from one of these we could not see certain metrics in ganglia.
>
> E.g. on node07 running only one process which sends metrics to ganglia -
> datanode process and we could see following using tcpdump.
>
> Dumping traffic for about three minutes:
> $ sudo -i tcpdump dst port 8649 and src host node07 | tee tcpdump.out
> ...
> $ head -n1 tcpdump.out
> 12:18:05.559719 IP node07.dom.local.43350 > node01.dom.local.8649: UDP,
> length 180
> $ tail -n1 tcpdump.out
> 12:20:59.575144 IP node
>
> Then count packets and bytes sent to each target:
> $ grep node01 tcpdump.out | wc -l
> 5972
> $ grep node03 tcpdump.out | wc -l
> 3812
> $ grep node05 tcpdump.out | wc -l
> 3811
> $ grep node01 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}'
> 1048272
> $ grep node03 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}'
> 731604
> $ grep node05 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}'
> 731532
>
> Also we could request gmond daemons which metrics do they have:
>
> $ nc node01 8649 | grep ProcessName_DataNode | head -n1
> <METRIC NAME="jvm.JvmMetrics.ProcessName_DataNode.LogFatal" VAL="0"
> TYPE="float" UNITS="" TN="0" TMAX="60" DMAX="0" SLOPE="positive">
> $ nc node03 8649 | grep ProcessName_DataNode | head -n1
> $ nc node05 8649 | grep ProcessName_DataNode | head -n1
> $ nc node01 8649 | grep ProcessName_DataNode | wc -l
> 100
> $ nc node03 8649 | grep ProcessName_DataNode | wc -l
> 0
> $ nc node05 8649 | grep ProcessName_DataNode | wc -l
> 0
>
> We could see that only first collector node from the list has certain
> metrics.
>
> Hadoop version we use:
> - MapReduce 2.0.0-mr1-cdh4.1.1
> - HDFS 2.0.0-cdh4.1.1
>
> hadoop-metrics2.properties content:
>
> datanode.period=20
>
> datanode.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
> datanode.sink.ganglia.servers=192.168.1.111:8649,192.168.1.113:8649,
> 192.168.1.115:8649
>  datanode.sink.ganglia.tagsForPrefix.jvm=*
> datanode.sink.ganglia.tagsForPrefix.dfs=*
> datanode.sink.ganglia.tagsForPrefix.rpc=*
> datanode.sink.ganglia.tagsForPrefix.rpcdetailed=*
> datanode.sink.ganglia.tagsForPrefix.metricssystem=*
>
> --
> Best Regards
> Ivan Tretyakov
>



-- 
Best Regards
Ivan Tretyakov

Re: Problem sending metrics to multiple targets

Posted by Ivan Tretyakov <it...@griddynamics.com>.

We investigated the problem and found root cause. Metrics2 framework uses
different from first version config parser (Metrics2 uses apache-commons,
Metrics uses
hadoop's).  org.apache.hadoop.metrics2.sink.ganglia.AbstractGangliaSink uses
commas as separators by default. So when we provide list of servers it
returns everything until first separator - it is only first server from the
list.
But we were able to find workaround. Class parsing servers
list (org.apache.hadoop.metrics2.util.Servers) handles only commas and
spaces. It means if we will provide space separated list of servers instead
of comma separated then new parser will be able to read whole servers list.
After that all servers will be registered as metrics receivers and metrics
will be sent to all of them.


On Thu, Jan 17, 2013 at 7:17 PM, Ivan Tretyakov <itretyakov@griddynamics.com
> wrote:

> Hi!
>
> We have following problem.
>
> There are three target hosts to send metrics: 192.168.1.111:8649,
> 192.168.1.113:8649,192.168.1.115:8649 (node01, node03, node05).
> But for example datanode (using
> org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31) sends one metrics to
> first target host and the another to the second and third.
> So some metrics missed on second and third node. When gmetad collects
> metrics from one of these we could not see certain metrics in ganglia.
>
> E.g. on node07 running only one process which sends metrics to ganglia -
> datanode process and we could see following using tcpdump.
>
> Dumping traffic for about three minutes:
> $ sudo -i tcpdump dst port 8649 and src host node07 | tee tcpdump.out
> ...
> $ head -n1 tcpdump.out
> 12:18:05.559719 IP node07.dom.local.43350 > node01.dom.local.8649: UDP,
> length 180
> $ tail -n1 tcpdump.out
> 12:20:59.575144 IP node
>
> Then count packets and bytes sent to each target:
> $ grep node01 tcpdump.out | wc -l
> 5972
> $ grep node03 tcpdump.out | wc -l
> 3812
> $ grep node05 tcpdump.out | wc -l
> 3811
> $ grep node01 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}'
> 1048272
> $ grep node03 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}'
> 731604
> $ grep node05 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}'
> 731532
>
> Also we could request gmond daemons which metrics do they have:
>
> $ nc node01 8649 | grep ProcessName_DataNode | head -n1
> <METRIC NAME="jvm.JvmMetrics.ProcessName_DataNode.LogFatal" VAL="0"
> TYPE="float" UNITS="" TN="0" TMAX="60" DMAX="0" SLOPE="positive">
> $ nc node03 8649 | grep ProcessName_DataNode | head -n1
> $ nc node05 8649 | grep ProcessName_DataNode | head -n1
> $ nc node01 8649 | grep ProcessName_DataNode | wc -l
> 100
> $ nc node03 8649 | grep ProcessName_DataNode | wc -l
> 0
> $ nc node05 8649 | grep ProcessName_DataNode | wc -l
> 0
>
> We could see that only first collector node from the list has certain
> metrics.
>
> Hadoop version we use:
> - MapReduce 2.0.0-mr1-cdh4.1.1
> - HDFS 2.0.0-cdh4.1.1
>
> hadoop-metrics2.properties content:
>
> datanode.period=20
>
> datanode.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
> datanode.sink.ganglia.servers=192.168.1.111:8649,192.168.1.113:8649,
> 192.168.1.115:8649
>  datanode.sink.ganglia.tagsForPrefix.jvm=*
> datanode.sink.ganglia.tagsForPrefix.dfs=*
> datanode.sink.ganglia.tagsForPrefix.rpc=*
> datanode.sink.ganglia.tagsForPrefix.rpcdetailed=*
> datanode.sink.ganglia.tagsForPrefix.metricssystem=*
>
> --
> Best Regards
> Ivan Tretyakov
>



-- 
Best Regards
Ivan Tretyakov